eflow.data_analysis.null_analysis

Functions

display(*objs[, include, exclude, metadata, …])

Display a Python object in all frontends.

generate_meta_data(df, output_folder_path, …)

Creates files representing the shape and feature types of the dataframe.

missing_values_table(df)

Creates a pandas dataframe based on the missing data inside the

Classes

DataAnalysis(dataset_name[, overwrite_full_path])

All objects in data_analysis folder of eflow are related to this object.

DataFrameSnapshot([compare_shape, …])

Attempts to get a “snapshot” of a dataframe by extracting varying data of the pandas dataframe object; then generates a file to later compare in a set directory.

FeatureAnalysis(df_features[, …])

Analyzes the feature data of a pandas Dataframe object.

GRAPH_DEFAULTS

alias of eflow._hidden.constants.Enum

NullAnalysis(df_features[, dataset_sub_dir, …])

Analyzes a pandas dataframe’s object for null data; creates visuals like graphs and tables.

Exceptions

SnapshotMismatchError([error_message])

class NullAnalysis(df_features, dataset_sub_dir='', dataset_name='Default Dataset Name', overwrite_full_path=None, notebook_mode=False)[source]

Analyzes a pandas dataframe’s object for null data; creates visuals like graphs and tables.

feature_analysis_of_null_data(df, dataset_name, target_features=None, display_visuals=True, display_print=True, save_file=True, suppress_runtime_errors=True, aggregate_target_feature=True, selected_features=None, extra_tables=True, statistical_analysis_on_aggregates=True, nan_features=[])[source]

Performs all public methods that generate visualizations/insights that feature analysis uses on an aggregation of null data in a feature.

Note:

Pretty much my personal lazy button for running the entire object without specifying any method in particular.

Args:
df: pd.Dataframe

Pandas dataframe object

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

target_features: collection of string or None

A feature name that both exists in the init df_features and the passed dataframe.

Note

If init to ‘None’ then df_features will try to extract out the target feature.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

save_file: bool

Boolean value to whether or not to save the file.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

extra_tables: bool
When handling two types of features if set to true this will

generate any extra tables that might be helpful. Note -

These graphics may create duplicates if you already applied an aggregation in ‘perform_analysis’

statistical_analysis_on_aggregates: bool

If set to true then the function ‘statistical_analysis_on_aggregates’ will run; which aggregates the data of the target feature either by discrete values or by binning/labeling continuous data.

aggregate_target_feature: bool

Aggregate the data of the target feature if the data is non-continuous data.

Note

In the future I will have this also working with continuous data.

selected_features: collection object of features

Will only focus on these selected feature’s and will ignore the other given features.

nan_features: collection of strings

Features names that must contain nan data to aggregate on.

Raises:

If an empty dataframe is passed to this function or if the same dataframe is passed to it raise error.

missing_values_table(df, dataset_name, display_visuals=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, display_print=True)[source]

Creates/Saves a Pandas DataFrame object giving the percentage of the null data for the original DataFrame columns.

Args:
df: pd.Dataframe

Pandas DataFrame object

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

filename: string

If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.

save_file: bool

Boolean value to whether or not to save the file.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

perform_analysis(df, dataset_name, display_visuals=True, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, display_print=True, null_features_only=False)[source]

Perform all public methods of the NullAnalysis object. Except for feature_analysis_of_null_data.

Args:
df: pd.Dataframe

Pandas Dataframe object.

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

save_file: bool

Boolean value to whether or not to save the file.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

null_features_only: bool

Dataframe will pass on null features for the visualizations

plot_null_bar_graph(df, dataset_name, display_visuals=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, display_print=True, null_features_only=False, figsize=(24, 10), fontsize=16, labels=None, log=False, color='#072F5F', inline=False, filter=False, n=0, p=0, sort=None)[source]
Desc (Taken from missingno):

A bar graph visualization of the nullity of the given DataFrame then pushes the image to output folder.

Args:
df: pd.Dataframe

Pandas dataframe object

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

filename: string

If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.

save_file: bool

Boolean value to whether or not to save the file.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

null_features_only: bool

Dataframe will pass on null features for the visualizations

Please read the offical documentation for more about the parameters: Link - https://github.com/ResidentMario/missingno

Note -

Changed the default color of the bar graph because I thought it was ugly.

plot_null_dendrogram_graph(df, dataset_name, display_visuals=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, display_print=True, null_features_only=False, method='average', filter=None, n=0, p=0, orientation=None, figsize=(24, 10), fontsize=16, inline=False)[source]
Desc (Taken from missingno):

Fits a scipy hierarchical clustering algorithm to the given DataFrame’s variables and visualizes the results as a scipy dendrogram.

Args:
df:

Pandas dataframe object

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

filename: string

If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.

save_file: bool

Boolean value to whether or not to save the file.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

null_features_only: bool

Dataframe will pass on only null features for the visualizations

Please read the offical documentation for more about the parameters: Link: https://github.com/ResidentMario/missingno

plot_null_heatmap_graph(df, dataset_name, display_visuals=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, display_print=True, inline=False, filter=None, n=0, p=0, sort=None, figsize=(24, 10), fontsize=16, labels=True, cmap='RdBu', vmin=-1, vmax=1, cbar=True)[source]
Desc (Taken from missingno):

Presents a seaborn heatmap visualization of nullity correlation in the given DataFrame.

Args:
df: pd.Dataframe

Pandas dataframe object

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

filename: string

If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.

save_file: bool

Boolean value to whether or not to save the file.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

Please read the offical documentation for more about the parameters: Link: https://github.com/ResidentMario/missingno

Note:

Changed the default color of the bar graph because I thought it was ugly.

plot_null_matrix_graph(df, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, null_features_only=False, filter=None, n=0, p=0, sort=None, figsize=(24, 10), width_ratios=(15, 1), color=(0.027, 0.184, 0.373), fontsize=16, labels=None, sparkline=True, inline=False, freq=None)[source]
Desc (Taken from missingno):

A matrix visualization of the nullity of the given DataFrame then pushes the image to output folder.

Args:
df: pd.Dataframe

Pandas dataframe object

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

save_file: bool

Boolean value to whether or not to save the file.

filename: string

If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.

sub_dir: string

Specify the sub directory to append to the pre-defined folder path.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

null_features_only: bool

Dataframe will pass on null features for the visualizations

Please read the offical documentation at for more about the parameters: Link: https://github.com/ResidentMario/missingno

Note:

Changed the default color of the bar graph because I thought it was ugly.