eflow.data_analysis.null_analysis¶
Functions
|
Display a Python object in all frontends. |
|
Creates files representing the shape and feature types of the dataframe. |
|
Creates a pandas dataframe based on the missing data inside the |
Classes
|
All objects in data_analysis folder of eflow are related to this object. |
|
Attempts to get a “snapshot” of a dataframe by extracting varying data of the pandas dataframe object; then generates a file to later compare in a set directory. |
|
Analyzes the feature data of a pandas Dataframe object. |
|
alias of |
|
Analyzes a pandas dataframe’s object for null data; creates visuals like graphs and tables. |
Exceptions
|
-
class
NullAnalysis
(df_features, dataset_sub_dir='', dataset_name='Default Dataset Name', overwrite_full_path=None, notebook_mode=False)[source]¶ Analyzes a pandas dataframe’s object for null data; creates visuals like graphs and tables.
-
feature_analysis_of_null_data
(df, dataset_name, target_features=None, display_visuals=True, display_print=True, save_file=True, suppress_runtime_errors=True, aggregate_target_feature=True, selected_features=None, extra_tables=True, statistical_analysis_on_aggregates=True, nan_features=[])[source]¶ Performs all public methods that generate visualizations/insights that feature analysis uses on an aggregation of null data in a feature.
- Note:
Pretty much my personal lazy button for running the entire object without specifying any method in particular.
- Args:
- df: pd.Dataframe
Pandas dataframe object
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- target_features: collection of string or None
A feature name that both exists in the init df_features and the passed dataframe.
- Note
If init to ‘None’ then df_features will try to extract out the target feature.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- save_file: bool
Boolean value to whether or not to save the file.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- extra_tables: bool
- When handling two types of features if set to true this will
generate any extra tables that might be helpful. Note -
These graphics may create duplicates if you already applied an aggregation in ‘perform_analysis’
- statistical_analysis_on_aggregates: bool
If set to true then the function ‘statistical_analysis_on_aggregates’ will run; which aggregates the data of the target feature either by discrete values or by binning/labeling continuous data.
- aggregate_target_feature: bool
Aggregate the data of the target feature if the data is non-continuous data.
- Note
In the future I will have this also working with continuous data.
- selected_features: collection object of features
Will only focus on these selected feature’s and will ignore the other given features.
- nan_features: collection of strings
Features names that must contain nan data to aggregate on.
- Raises:
If an empty dataframe is passed to this function or if the same dataframe is passed to it raise error.
-
missing_values_table
(df, dataset_name, display_visuals=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, display_print=True)[source]¶ Creates/Saves a Pandas DataFrame object giving the percentage of the null data for the original DataFrame columns.
- Args:
- df: pd.Dataframe
Pandas DataFrame object
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- filename: string
If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
- save_file: bool
Boolean value to whether or not to save the file.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
-
perform_analysis
(df, dataset_name, display_visuals=True, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, display_print=True, null_features_only=False)[source]¶ Perform all public methods of the NullAnalysis object. Except for feature_analysis_of_null_data.
- Args:
- df: pd.Dataframe
Pandas Dataframe object.
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- save_file: bool
Boolean value to whether or not to save the file.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- null_features_only: bool
Dataframe will pass on null features for the visualizations
-
plot_null_bar_graph
(df, dataset_name, display_visuals=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, display_print=True, null_features_only=False, figsize=(24, 10), fontsize=16, labels=None, log=False, color='#072F5F', inline=False, filter=False, n=0, p=0, sort=None)[source]¶ - Desc (Taken from missingno):
A bar graph visualization of the nullity of the given DataFrame then pushes the image to output folder.
- Args:
- df: pd.Dataframe
Pandas dataframe object
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- filename: string
If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
- save_file: bool
Boolean value to whether or not to save the file.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- null_features_only: bool
Dataframe will pass on null features for the visualizations
Please read the offical documentation for more about the parameters: Link - https://github.com/ResidentMario/missingno
- Note -
Changed the default color of the bar graph because I thought it was ugly.
-
plot_null_dendrogram_graph
(df, dataset_name, display_visuals=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, display_print=True, null_features_only=False, method='average', filter=None, n=0, p=0, orientation=None, figsize=(24, 10), fontsize=16, inline=False)[source]¶ - Desc (Taken from missingno):
Fits a scipy hierarchical clustering algorithm to the given DataFrame’s variables and visualizes the results as a scipy dendrogram.
- Args:
- df:
Pandas dataframe object
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- filename: string
If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
- save_file: bool
Boolean value to whether or not to save the file.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- null_features_only: bool
Dataframe will pass on only null features for the visualizations
Please read the offical documentation for more about the parameters: Link: https://github.com/ResidentMario/missingno
-
plot_null_heatmap_graph
(df, dataset_name, display_visuals=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, display_print=True, inline=False, filter=None, n=0, p=0, sort=None, figsize=(24, 10), fontsize=16, labels=True, cmap='RdBu', vmin=-1, vmax=1, cbar=True)[source]¶ - Desc (Taken from missingno):
Presents a seaborn heatmap visualization of nullity correlation in the given DataFrame.
- Args:
- df: pd.Dataframe
Pandas dataframe object
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- filename: string
If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
- save_file: bool
Boolean value to whether or not to save the file.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
Please read the offical documentation for more about the parameters: Link: https://github.com/ResidentMario/missingno
- Note:
Changed the default color of the bar graph because I thought it was ugly.
-
plot_null_matrix_graph
(df, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, null_features_only=False, filter=None, n=0, p=0, sort=None, figsize=(24, 10), width_ratios=(15, 1), color=(0.027, 0.184, 0.373), fontsize=16, labels=None, sparkline=True, inline=False, freq=None)[source]¶ - Desc (Taken from missingno):
A matrix visualization of the nullity of the given DataFrame then pushes the image to output folder.
- Args:
- df: pd.Dataframe
Pandas dataframe object
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- save_file: bool
Boolean value to whether or not to save the file.
- filename: string
If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
- sub_dir: string
Specify the sub directory to append to the pre-defined folder path.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- null_features_only: bool
Dataframe will pass on null features for the visualizations
Please read the offical documentation at for more about the parameters: Link: https://github.com/ResidentMario/missingno
- Note:
Changed the default color of the bar graph because I thought it was ugly.
-