eflow.data_analysis.null_analysis¶

Functions

`display`(*objs[, include, exclude, metadata, …])	Display a Python object in all frontends.
`generate_meta_data`(df, output_folder_path, …)	Creates files representing the shape and feature types of the dataframe.
`missing_values_table`(df)	Creates a pandas dataframe based on the missing data inside the

Classes

`DataAnalysis`(dataset_name[, overwrite_full_path])	All objects in data_analysis folder of eflow are related to this object.
`DataFrameSnapshot`([compare_shape, …])	Attempts to get a “snapshot” of a dataframe by extracting varying data of the pandas dataframe object; then generates a file to later compare in a set directory.
`FeatureAnalysis`(df_features[, …])	Analyzes the feature data of a pandas Dataframe object.
`GRAPH_DEFAULTS`	alias of `eflow._hidden.constants.Enum`
`NullAnalysis`(df_features[, dataset_sub_dir, …])	Analyzes a pandas dataframe’s object for null data; creates visuals like graphs and tables.

Exceptions

SnapshotMismatchError([error_message])

class NullAnalysis(df_features, dataset_sub_dir='', dataset_name='Default Dataset Name', overwrite_full_path=None, notebook_mode=False)[source]¶

Analyzes a pandas dataframe’s object for null data; creates visuals like graphs and tables.

feature_analysis_of_null_data(df, dataset_name, target_features=None, display_visuals=True, display_print=True, save_file=True, suppress_runtime_errors=True, aggregate_target_feature=True, selected_features=None, extra_tables=True, statistical_analysis_on_aggregates=True, nan_features=[])[source]¶

Performs all public methods that generate visualizations/insights that feature analysis uses on an aggregation of null data in a feature.

Note:

Pretty much my personal lazy button for running the entire object without specifying any method in particular.

Args:

df: pd.Dataframe

Pandas dataframe object

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

target_features: collection of string or None

A feature name that both exists in the init df_features and the passed dataframe.

Note: If init to ‘None’ then df_features will try to extract out the target feature.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

save_file: bool

Boolean value to whether or not to save the file.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

extra_tables: bool

When handling two types of features if set to true this will: generate any extra tables that might be helpful. Note -

These graphics may create duplicates if you already applied an aggregation in ‘perform_analysis’

statistical_analysis_on_aggregates: bool

If set to true then the function ‘statistical_analysis_on_aggregates’ will run; which aggregates the data of the target feature either by discrete values or by binning/labeling continuous data.

aggregate_target_feature: bool

Aggregate the data of the target feature if the data is non-continuous data.

Note: In the future I will have this also working with continuous data.

selected_features: collection object of features

Will only focus on these selected feature’s and will ignore the other given features.

nan_features: collection of strings

Features names that must contain nan data to aggregate on.

Raises:

If an empty dataframe is passed to this function or if the same dataframe is passed to it raise error.

missing_values_table(df, dataset_name, display_visuals=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, display_print=True)[source]¶

Creates/Saves a Pandas DataFrame object giving the percentage of the null data for the original DataFrame columns.

Args:

df: pd.Dataframe: Pandas DataFrame object
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
display_visuals: bool: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
filename: string: If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
save_file: bool: Boolean value to whether or not to save the file.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

perform_analysis(df, dataset_name, display_visuals=True, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, display_print=True, null_features_only=False)[source]¶

Perform all public methods of the NullAnalysis object. Except for feature_analysis_of_null_data.

Args:

df: pd.Dataframe: Pandas Dataframe object.
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
display_visuals: bool: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
save_file: bool: Boolean value to whether or not to save the file.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
null_features_only: bool: Dataframe will pass on null features for the visualizations

plot_null_bar_graph(df, dataset_name, display_visuals=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, display_print=True, null_features_only=False, figsize=(24, 10), fontsize=16, labels=None, log=False, color='#072F5F', inline=False, filter=False, n=0, p=0, sort=None)[source]¶

Desc (Taken from missingno):

A bar graph visualization of the nullity of the given DataFrame then pushes the image to output folder.

Args:

df: pd.Dataframe: Pandas dataframe object
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
display_visuals: bool: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
filename: string: If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
save_file: bool: Boolean value to whether or not to save the file.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
null_features_only: bool: Dataframe will pass on null features for the visualizations

Please read the offical documentation for more about the parameters: Link - https://github.com/ResidentMario/missingno

Note -: Changed the default color of the bar graph because I thought it was ugly.

plot_null_dendrogram_graph(df, dataset_name, display_visuals=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, display_print=True, null_features_only=False, method='average', filter=None, n=0, p=0, orientation=None, figsize=(24, 10), fontsize=16, inline=False)[source]¶

Desc (Taken from missingno):

Fits a scipy hierarchical clustering algorithm to the given DataFrame’s variables and visualizes the results as a scipy dendrogram.

Args:

df:: Pandas dataframe object
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
display_visuals: bool: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
filename: string: If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
save_file: bool: Boolean value to whether or not to save the file.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
null_features_only: bool: Dataframe will pass on only null features for the visualizations

Please read the offical documentation for more about the parameters: Link: https://github.com/ResidentMario/missingno

plot_null_heatmap_graph(df, dataset_name, display_visuals=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, display_print=True, inline=False, filter=None, n=0, p=0, sort=None, figsize=(24, 10), fontsize=16, labels=True, cmap='RdBu', vmin=-1, vmax=1, cbar=True)[source]¶

Desc (Taken from missingno):

Presents a seaborn heatmap visualization of nullity correlation in the given DataFrame.

Args:

df: pd.Dataframe: Pandas dataframe object
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
display_visuals: bool: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
filename: string: If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
save_file: bool: Boolean value to whether or not to save the file.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

Please read the offical documentation for more about the parameters: Link: https://github.com/ResidentMario/missingno

Note:: Changed the default color of the bar graph because I thought it was ugly.

plot_null_matrix_graph(df, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, null_features_only=False, filter=None, n=0, p=0, sort=None, figsize=(24, 10), width_ratios=(15, 1), color=(0.027, 0.184, 0.373), fontsize=16, labels=None, sparkline=True, inline=False, freq=None)[source]¶

Desc (Taken from missingno):

A matrix visualization of the nullity of the given DataFrame then pushes the image to output folder.

Args:

df: pd.Dataframe: Pandas dataframe object
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
display_visuals: bool: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
save_file: bool: Boolean value to whether or not to save the file.
filename: string: If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
sub_dir: string: Specify the sub directory to append to the pre-defined folder path.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
null_features_only: bool: Dataframe will pass on null features for the visualizations

Please read the offical documentation at for more about the parameters: Link: https://github.com/ResidentMario/missingno

Note:: Changed the default color of the bar graph because I thought it was ugly.