fast_eda.fast_eda

Functions

describe_function(df)

Generate summary statistics for numeric columns in the DataFrame.

distribution_plots(df, c, r[, figsize, col_ovr])

Plots distributions of columns from a DataFrame using Matplotlib subplots and Seaborn plots.

count_nulls(df)

Count missing values in each column of the DataFrame.

correlation_matrix_viz(df)

Generate a correlation matrix visualization for numeric columns in a DataFrame.

Module Contents

fast_eda.fast_eda.describe_function(df)[source]

Generate summary statistics for numeric columns in the DataFrame.

This function computes basic statistics such as mean, median, standard deviation, minimum, and maximum for each numeric column in the DataFrame, providing an overview of the central tendency and spread of the data.

Parameters:

df (pandas.DataFrame) – The input DataFrame containing numeric columns.

Returns:

A DataFrame containing the calculated summary statistics for each numeric column.

Return type:

pandas.DataFrame

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]})
>>> describe_function(df)
fast_eda.fast_eda.distribution_plots(df, c, r, figsize=(10, 6), col_ovr=None)[source]

Plots distributions of columns from a DataFrame using Matplotlib subplots and Seaborn plots.

This function creates a grid of subplots to visualize the distributions of specified columns from a DataFrame. Numeric columns are plotted using histograms, while string columns are plotted using bar plots.

Parameters:
  • df (pandas.DataFrame) – The DataFrame containing the data to be plotted. Must not be empty.

  • c (int) – The number of columns in the subplot grid.

  • r (int) – The number of rows in the subplot grid.

  • figsize (tuple of int, optional, default=(10, 6)) – The size of the figure in inches (width, height).

  • col_ovr (list of str, optional) – A list of column names to plot. If None, all columns in the DataFrame are used. Must be a subset of the DataFrame’s columns.

Returns:

  • fig (matplotlib.figure.Figure) – The Matplotlib figure object containing the subplots.

  • axes (numpy.ndarray of matplotlib.axes._subplots.AxesSubplot) – An array of Axes objects corresponding to the subplots.

Raises:

AssertionError – If input validation fails for any of the parameters.

Notes

  • The function handles both numeric and string columns differently:
    • Numeric columns: Plotted using Seaborn’s histplot.

    • String columns: Plotted using Seaborn’s barplot without error bars.

  • Any unused subplot axes are hidden to prevent empty plots.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({
...     'numeric_col': [1, 2, 3, 4, 5],
...     'string_col': ['a', 'b', 'a', 'c', 'b']
... })
>>> fig, axes = distribution_plotting_function(df, c=2, r=1)
fast_eda.fast_eda.count_nulls(df)[source]

Count missing values in each column of the DataFrame.

This function calculates the number of missing (NaN) values in each column of the DataFrame, assisting in identifying columns that need cleaning or imputation.

Parameters:

df (pandas.DataFrame) – The input DataFrame to be analyzed.

Returns:

A Series with column names as the index and the count of missing values in each column as the values.

Return type:

pandas.Series

Raises:

ValueError – If the input is not a pandas DataFrame.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({'A': [1, None, 3], 'B': [None, 2, 3]})
>>> count_nulls(df)
fast_eda.fast_eda.correlation_matrix_viz(df)[source]

Generate a correlation matrix visualization for numeric columns in a DataFrame.

This function computes the Spearman correlation coefficients between all numeric columns in the provided DataFrame. The resulting correlation matrix is transformed into a long-form DataFrame suitable for visualization, and an interactive Altair scatter plot is created to display the correlations.

The visualization includes: - X-axis and Y-axis: The pair of features being compared. - Circle size: The magnitude of the absolute correlation value, indicating the strength of the relationship. - Color: The direction and strength of the correlation (positive or negative), represented using a diverging color scale.

Parameters:

df (pandas.DataFrame) – The input DataFrame containing numeric columns for correlation analysis.

Returns:

An interactive Altair chart visualizing the correlation matrix.

Return type:

alt.Chart

Notes

  • Self-correlations (diagonal values) are set to 0 to avoid cluttering the plot.

  • Non-numeric columns are ignored in the computation.

Examples

>>> import pandas as pd
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [2, 4, 6], 'C': [5, 3, 1]})
>>> correlation_matrix_viz(df)