FeatureAnalysis¶

from eflow.data_analysis.feature_analysis import FeatureAnalysis

class FeatureAnalysis(df_features, dataset_sub_dir='', dataset_name='', overwrite_full_path=None, notebook_mode=False)[source]¶

Analyzes the feature data of a pandas Dataframe object. (Ignores null data for displaying data and creates 2d graphics with 2 features. In the future I might add 3d graphics with 3 features.)

analyze_feature(df, feature_name, dataset_name, target_feature=None, display_visuals=True, display_print=True, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), extra_tables=True)[source]¶

Generate’s all graphic’s for that given feature and the relationship to the target feature.

Args:

df: pd.Dataframe: Pandas DataFrame object
feature_name: string: Specified feature column name.
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
target_feature: string: Will create graphics involving this feature with the main feature ‘feature_name’.
display_visuals: string: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
sub_dir: string: Specify the sub directory to append to the pre-defined folder path.
save_file: bool: Saves file if set to True; doesn’t if set to False.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
extra_tables: bool: When handling two types of features if set to true this will generate any extra tables that might be helpful. Note -

These graphics may create duplicates if you already applied an aggregation in ‘perform_analysis’

Raises:

Raises error if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

descr_table(df, feature_name, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True)[source]¶

Creates/Saves a pandas dataframe of a feature’s numerical data. Standard deviation, mean, Q1-Q5, median, variance, etc.

Note
Creates a png of the table.

Args:

df: pd.Dataframe: Pandas DataFrame object
feature_name: string: Specified feature column name.
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
display_visuals: bool: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
filename: string: If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
sub_dir: string: Specify the sub directory to append to the pre-defined folder path.
save_file: bool: Saves file if set to True; doesn’t if set to False.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

group_by_feature_value_count_table(df, feature_name, dataset_name, other_feature_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True)[source]¶

Creates/Saves a pandas dataframe of features and their found types in the dataframe.

Note
Creates a png of the table.

Args:

df: pd.Dataframe: Pandas DataFrame object
feature_name: string: Specified feature column name.
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
other_feature_name: string: Feature to compare to.
display_visuals: bool: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
filename: string: If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
sub_dir: string: Specify the sub directory to append to the pre-defined folder path.
save_file: bool: Saves file if set to True; doesn’t if set to False.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

perform_analysis(df, dataset_name, target_features=None, display_visuals=True, display_print=True, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), aggregate_target_feature=True, selected_features=None, extra_tables=True, statistical_analysis_on_aggregates=True)[source]¶

Performs all public methods that generate visualizations/insights about the data.

Note:

Pretty much my personal lazy button for running the entire object without specifying any method in particular.

Args:

df: pd.Dataframe

Pandas dataframe object

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

target_features: collection of strings or None

A feature name that both exists in the init df_features and the passed dataframe.

Note: If init to ‘None’ then df_features will try to extract out the target feature.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

save_file: bool

Boolean value to whether or not to save the file.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

extra_tables: bool

When handling two types of features if set to true this will: generate any extra tables that might be helpful. Note -

These graphics may create duplicates if you already applied an aggregation in ‘perform_analysis’

aggregate_target_feature: bool

Aggregate the data of the target feature if the data is non-continuous data.

Note: In the future I will have this also working with continuous data.

selected_features: collection object of features

Will only focus on these selected feature’s and will ignore the other given features.

statistical_analysis_on_aggregates: bool

If set to true then the function ‘statistical_analysis_on_aggregates’ will run; which aggregates the data of the target feature either by discrete values or by binning/labeling continuous data.

Raises:

If an empty dataframe is passed to this function or if the same dataframe is passed to it raise error.

plot_count_graph(df, feature_name, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), flip_axis=False, palette='PuBu')[source]¶

Display a barplot with color ranking from a feature’s value counts from the seaborn libary and save the graph in the correct directory structure.

Args:

df: pd.Dataframe: Pandas dataframe object.
feature_name: string: Specified feature column name.
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
display_visuals: bool: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
filename: string: Name to give the file.
sub_dir: string: Specify the sub directory to append to the pre-defined folder path.
save_file: bool: Boolean value to whether or not to save the file.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
display_print: bool: Determines whether or not to print function’s embedded print statements.
figsize: tuple: Size for the given plot.
flip_axis: bool: Flip the axis the ploting axis from x to y if set to ‘True’.
palette: dict or string: String representation of color pallete for ranking from seaborn’s pallete.

Credit to seaborn’s author: Michael Waskom Git username: mwaskom Link: http://tinyurl.com/y4pzrgcf

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

plot_distance_graph(df, feature_name, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), bins=None, norm_hist=True, hist=True, kde=True, colors=None, fit=None, fit_kws=None)[source]¶

Display a distance plot and save the graph in the correct directory.

Args:

df: pd.Dataframe: Pandas dataframe object
feature_name: string: Specified feature column name.
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
display_visuals: bool: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
filename: string: Name to give the file.
sub_dir: string: Specify the sub directory to append to the pre-defined folder path.
save_file: bool: Boolean value to whether or not to save the file.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
figsize: tuple: The given size of the plot.
bins: int: Specification of hist bins, or None to use Freedman-Diaconis rule.
norm_hist: bool: If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
hist: bool: Whether to plot a (normed) histogram.
kde: bool: Whether to plot a gaussian kernel density estimate.
colorsmatplotlib color: Color to plot everything but the fitted curve in.
fit: functional method: An object with fit method, returning a tuple that can be passed to a pdf method a positional arguments following an grid of values to evaluate the pdf on.
fit_kwsdictionaries, optional: Keyword arguments for underlying plotting functions.

Credit to seaborn’s author: Michael Waskom Git username: mwaskom Doc Link: http://tinyurl.com/ycco2hok

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

plot_jointplot_graph(df, feature_name, dataset_name, other_feature_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), color=None, kind='scatter and kde', ratio=5)[source]¶

Display a ridge plot and save the graph in the correct directory.

Args:

df: pd.Dataframe: Pandas DataFrame object.
feature_name: string: Specified feature column name.
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
other_feature_name: string: Feature to compare to.
display_visuals: bool: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
filename: string: If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
sub_dir: string: Specify the sub directory to append to the pre-defined folder path.
save_file: bool: Boolean value to whether or not to save the file.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
display_print: bool: Determines whether or not to print function’s embedded print statements.
figsize: tuple: Tuple object to represent the plot/image’s size. Because joinplot only accepts a single value for the figure; we just pull the greatest of the two values.
color: string: Seaborn/maplotlib color/hex color for representing the graph
kind: string (scatter,reg,resid,kde,hex,scatter and kde): Kind of plot to draw.
ratio:: Ratio of joint axes height to marginal axes height. (Determines distplot like plots dimensions.)

Credit to seaborn’s author: Michael Waskom Git username: mwaskom Link: http://tinyurl.com/v9pxsoy

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

plot_multi_bar_graph(df, feature_name, dataset_name, other_feature_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), colors=None, stacked=False)[source]¶

Display a pie graph and save the graph in the correct directory.

Args:

df:: Pandas DataFrame object.
feature_name:: Specified feature column name.
dataset_name:: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
display_visuals:: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
filename:: If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
sub_dir:: Specify the sub directory to append to the pre-defined folder path.
save_file:: Boolean value to whether or not to save the file.
dataframe_snapshot:: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
figsize: tuple: Size of the plot.
colors: dict or string: Dictionary of all feature values to hex color values.
stacked: bool: Determines if the multi bar graph should be stacked or not.

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

plot_pie_graph(df, feature_name, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), pallete=None)[source]¶

Display a pie graph and save the graph in the correct directory.

Args:

df:: Pandas DataFrame object.
feature_name:: Specified feature column name.
dataset_name:: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
display_visuals:: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
filename:: If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
sub_dir:: Specify the sub directory to append to the pre-defined folder path.
save_file:: Boolean value to whether or not to save the file.
dataframe_snapshot:: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
figsize: tuple: Size of the plot.
pallete: dict or string: Dictionary of all feature values to hex color values.

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

plot_ridge_graph(df, feature_name, dataset_name, other_feature_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), palette=None)[source]¶

Display a ridge plot and save the graph in the correct directory.

Args:

df: pd.Dataframe: Pandas DataFrame object.
feature_name: string: Specified feature column name.
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
other_feature_name: string: Feature to compare to.
display_visuals: bool: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
filename: string: If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
sub_dir: string: Specify the sub directory to append to the pre-defined folder path.
save_file: bool: Boolean value to whether or not to save the file.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
display_print: bool: Determines whether or not to print function’s embedded print statements.
figsize: tuple: Tuple object to represent the plot/image’s size.
palette: dict or string: Dictionary of all feature values to hex color values.

Note -

A large part of this was taken from: http://tinyurl.com/tuou2cn

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

plot_violin_graph(df, feature_name, dataset_name, other_feature_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), order=None, cut=2, scale='area', gridsize=100, width=0.8, palette=None, saturation=0.75)[source]¶

Display a violin plot and save the graph in the correct directory.

Args:

df: pd.Dataframe: Pandas dataframe object
feature_name: string: Specified feature column name to compare to y.
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
other_feature_name: string: Specified feature column name to compare to x.
display_visuals: bool: Boolean value to whether or not to display visualizations.
filename: string: Name to give the file.
sub_dir: string: Specify the sub directory to append to the pre-defined folder path.
save_file: bool: Boolean value to whether or not to save the file.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
display_print: bool: Determines whether or not to print function’s embedded print statements.
figsize: tuple: Size of the given plot.
order: lists of strings: Order to plot the categorical levels in, otherwise the levels are inferred from the data objects.
cut: float: Distance, in units of bandwidth size, to extend the density past the extreme datapoints. Set to 0 to limit the violin range within the range of the observed data. (i.e., to have the same effect as trim=True in ggplot.)
scale: string: {area, count, width} The method used to scale the width of each violin. If area, each violin will have the same area. If count, the width of the violins will be scaled by the number of observations in that bin. If width, each violin will have the same width.
gridsize: int: Number of points in the discrete grid used to compute the kernel density estimate.
width: float: Width of a full element when not using hue nesting, or width of all the elements for one level of the major grouping variable.
palette: dict or string: Colors to use for the different levels of the hue variable. Should be something that can be interpreted by color_palette(), or a dictionary mapping hue levels to matplotlib colors.
saturation: float: Proportion of the original saturation to draw colors at. Large patches often look better with slightly desaturated colors, but set this to 1 if you want the plot colors to perfectly match the input color spec.

Credit to seaborn’s author: Michael Waskom Git username: mwaskom Doc link: http://tinyurl.com/y3hxxzgv

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

statistical_analysis_on_aggregates(df, target_features, dataset_name, dataframe_snapshot=True)[source]¶

Aggregates the data of the target feature either by discrete values or by binning/labeling continuous data.

Args:

df: pd.Dataframe: Pandas DataFrame object.
target_features: list of string: Specified target features.
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

Note:

This function has a lot going on and it’s infancy so I am going to purposely not give it suppress_runtime_errors so people will find problems with it and report it to me.

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

value_counts_table(df, feature_name, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True)[source]¶

Creates a value counts table of the features given data.

Note
Creates a png of the table.

Args:

df: pd.Dataframe: Pandas DataFrame object
feature_name: string: Specified feature column name.
dataset_name: string: The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
display_visuals: bool: Boolean value to whether or not to display visualizations.
display_print: bool: Determines whether or not to print function’s embedded print statements.
filename: string: If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
sub_dir: string: Specify the sub directory to append to the pre-defined folder path.
save_file: bool: Saves file if set to True; doesn’t if set to False.
dataframe_snapshot: bool: Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
suppress_runtime_errors: bool: If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

Creates/Saves a pandas dataframe of value counts of a dataframe.

Note -: Creates a png of the table.

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.