FeatureAnalysis

from eflow.data_analysis.feature_analysis import FeatureAnalysis

class FeatureAnalysis(df_features, dataset_sub_dir='', dataset_name='', overwrite_full_path=None, notebook_mode=False)[source]

Analyzes the feature data of a pandas Dataframe object. (Ignores null data for displaying data and creates 2d graphics with 2 features. In the future I might add 3d graphics with 3 features.)

analyze_feature(df, feature_name, dataset_name, target_feature=None, display_visuals=True, display_print=True, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), extra_tables=True)[source]

Generate’s all graphic’s for that given feature and the relationship to the target feature.

Args:
df: pd.Dataframe

Pandas DataFrame object

feature_name: string

Specified feature column name.

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

target_feature: string

Will create graphics involving this feature with the main feature ‘feature_name’.

display_visuals: string

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

sub_dir: string

Specify the sub directory to append to the pre-defined folder path.

save_file: bool

Saves file if set to True; doesn’t if set to False.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

extra_tables: bool

When handling two types of features if set to true this will generate any extra tables that might be helpful. Note -

These graphics may create duplicates if you already applied an aggregation in ‘perform_analysis’

Raises:

Raises error if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

descr_table(df, feature_name, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True)[source]

Creates/Saves a pandas dataframe of a feature’s numerical data. Standard deviation, mean, Q1-Q5, median, variance, etc.

Note

Creates a png of the table.

Args:
df: pd.Dataframe

Pandas DataFrame object

feature_name: string

Specified feature column name.

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

filename: string

If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.

sub_dir: string

Specify the sub directory to append to the pre-defined folder path.

save_file: bool

Saves file if set to True; doesn’t if set to False.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

group_by_feature_value_count_table(df, feature_name, dataset_name, other_feature_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True)[source]

Creates/Saves a pandas dataframe of features and their found types in the dataframe.

Note

Creates a png of the table.

Args:
df: pd.Dataframe

Pandas DataFrame object

feature_name: string

Specified feature column name.

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

other_feature_name: string

Feature to compare to.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

filename: string

If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.

sub_dir: string

Specify the sub directory to append to the pre-defined folder path.

save_file: bool

Saves file if set to True; doesn’t if set to False.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

perform_analysis(df, dataset_name, target_features=None, display_visuals=True, display_print=True, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), aggregate_target_feature=True, selected_features=None, extra_tables=True, statistical_analysis_on_aggregates=True)[source]

Performs all public methods that generate visualizations/insights about the data.

Note:

Pretty much my personal lazy button for running the entire object without specifying any method in particular.

Args:
df: pd.Dataframe

Pandas dataframe object

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

target_features: collection of strings or None

A feature name that both exists in the init df_features and the passed dataframe.

Note

If init to ‘None’ then df_features will try to extract out the target feature.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

save_file: bool

Boolean value to whether or not to save the file.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

extra_tables: bool
When handling two types of features if set to true this will

generate any extra tables that might be helpful. Note -

These graphics may create duplicates if you already applied an aggregation in ‘perform_analysis’

aggregate_target_feature: bool

Aggregate the data of the target feature if the data is non-continuous data.

Note

In the future I will have this also working with continuous data.

selected_features: collection object of features

Will only focus on these selected feature’s and will ignore the other given features.

statistical_analysis_on_aggregates: bool

If set to true then the function ‘statistical_analysis_on_aggregates’ will run; which aggregates the data of the target feature either by discrete values or by binning/labeling continuous data.

Raises:

If an empty dataframe is passed to this function or if the same dataframe is passed to it raise error.

plot_count_graph(df, feature_name, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), flip_axis=False, palette='PuBu')[source]

Display a barplot with color ranking from a feature’s value counts from the seaborn libary and save the graph in the correct directory structure.

Args:
df: pd.Dataframe

Pandas dataframe object.

feature_name: string

Specified feature column name.

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

filename: string

Name to give the file.

sub_dir: string

Specify the sub directory to append to the pre-defined folder path.

save_file: bool

Boolean value to whether or not to save the file.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

display_print: bool

Determines whether or not to print function’s embedded print statements.

figsize: tuple

Size for the given plot.

flip_axis: bool

Flip the axis the ploting axis from x to y if set to ‘True’.

palette: dict or string

String representation of color pallete for ranking from seaborn’s pallete.

Credit to seaborn’s author: Michael Waskom Git username: mwaskom Link: http://tinyurl.com/y4pzrgcf

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

plot_distance_graph(df, feature_name, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), bins=None, norm_hist=True, hist=True, kde=True, colors=None, fit=None, fit_kws=None)[source]

Display a distance plot and save the graph in the correct directory.

Args:
df: pd.Dataframe

Pandas dataframe object

feature_name: string

Specified feature column name.

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

filename: string

Name to give the file.

sub_dir: string

Specify the sub directory to append to the pre-defined folder path.

save_file: bool

Boolean value to whether or not to save the file.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

figsize: tuple

The given size of the plot.

bins: int

Specification of hist bins, or None to use Freedman-Diaconis rule.

norm_hist: bool

If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.

hist: bool

Whether to plot a (normed) histogram.

kde: bool

Whether to plot a gaussian kernel density estimate.

colorsmatplotlib color

Color to plot everything but the fitted curve in.

fit: functional method

An object with fit method, returning a tuple that can be passed to a pdf method a positional arguments following an grid of values to evaluate the pdf on.

fit_kwsdictionaries, optional

Keyword arguments for underlying plotting functions.

Credit to seaborn’s author: Michael Waskom Git username: mwaskom Doc Link: http://tinyurl.com/ycco2hok

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

plot_jointplot_graph(df, feature_name, dataset_name, other_feature_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), color=None, kind='scatter and kde', ratio=5)[source]

Display a ridge plot and save the graph in the correct directory.

Args:
df: pd.Dataframe

Pandas DataFrame object.

feature_name: string

Specified feature column name.

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

other_feature_name: string

Feature to compare to.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

filename: string

If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.

sub_dir: string

Specify the sub directory to append to the pre-defined folder path.

save_file: bool

Boolean value to whether or not to save the file.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

display_print: bool

Determines whether or not to print function’s embedded print statements.

figsize: tuple

Tuple object to represent the plot/image’s size. Because joinplot only accepts a single value for the figure; we just pull the greatest of the two values.

color: string

Seaborn/maplotlib color/hex color for representing the graph

kind: string (scatter,reg,resid,kde,hex,scatter and kde)

Kind of plot to draw.

ratio:

Ratio of joint axes height to marginal axes height. (Determines distplot like plots dimensions.)

Credit to seaborn’s author: Michael Waskom Git username: mwaskom Link: http://tinyurl.com/v9pxsoy

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

plot_multi_bar_graph(df, feature_name, dataset_name, other_feature_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), colors=None, stacked=False)[source]

Display a pie graph and save the graph in the correct directory.

Args:
df:

Pandas DataFrame object.

feature_name:

Specified feature column name.

dataset_name:

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

display_visuals:

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

filename:

If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.

sub_dir:

Specify the sub directory to append to the pre-defined folder path.

save_file:

Boolean value to whether or not to save the file.

dataframe_snapshot:

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

figsize: tuple

Size of the plot.

colors: dict or string

Dictionary of all feature values to hex color values.

stacked: bool

Determines if the multi bar graph should be stacked or not.

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

plot_pie_graph(df, feature_name, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), pallete=None)[source]

Display a pie graph and save the graph in the correct directory.

Args:
df:

Pandas DataFrame object.

feature_name:

Specified feature column name.

dataset_name:

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

display_visuals:

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

filename:

If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.

sub_dir:

Specify the sub directory to append to the pre-defined folder path.

save_file:

Boolean value to whether or not to save the file.

dataframe_snapshot:

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

figsize: tuple

Size of the plot.

pallete: dict or string

Dictionary of all feature values to hex color values.

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

plot_ridge_graph(df, feature_name, dataset_name, other_feature_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), palette=None)[source]

Display a ridge plot and save the graph in the correct directory.

Args:
df: pd.Dataframe

Pandas DataFrame object.

feature_name: string

Specified feature column name.

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

other_feature_name: string

Feature to compare to.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

filename: string

If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.

sub_dir: string

Specify the sub directory to append to the pre-defined folder path.

save_file: bool

Boolean value to whether or not to save the file.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

display_print: bool

Determines whether or not to print function’s embedded print statements.

figsize: tuple

Tuple object to represent the plot/image’s size.

palette: dict or string

Dictionary of all feature values to hex color values.

Note -

A large part of this was taken from: http://tinyurl.com/tuou2cn

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

plot_violin_graph(df, feature_name, dataset_name, other_feature_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), order=None, cut=2, scale='area', gridsize=100, width=0.8, palette=None, saturation=0.75)[source]

Display a violin plot and save the graph in the correct directory.

Args:
df: pd.Dataframe

Pandas dataframe object

feature_name: string

Specified feature column name to compare to y.

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

other_feature_name: string

Specified feature column name to compare to x.

display_visuals: bool

Boolean value to whether or not to display visualizations.

filename: string

Name to give the file.

sub_dir: string

Specify the sub directory to append to the pre-defined folder path.

save_file: bool

Boolean value to whether or not to save the file.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

display_print: bool

Determines whether or not to print function’s embedded print statements.

figsize: tuple

Size of the given plot.

order: lists of strings

Order to plot the categorical levels in, otherwise the levels are inferred from the data objects.

cut: float

Distance, in units of bandwidth size, to extend the density past the extreme datapoints. Set to 0 to limit the violin range within the range of the observed data. (i.e., to have the same effect as trim=True in ggplot.)

scale: string

{area, count, width} The method used to scale the width of each violin. If area, each violin will have the same area. If count, the width of the violins will be scaled by the number of observations in that bin. If width, each violin will have the same width.

gridsize: int

Number of points in the discrete grid used to compute the kernel density estimate.

width: float

Width of a full element when not using hue nesting, or width of all the elements for one level of the major grouping variable.

palette: dict or string

Colors to use for the different levels of the hue variable. Should be something that can be interpreted by color_palette(), or a dictionary mapping hue levels to matplotlib colors.

saturation: float

Proportion of the original saturation to draw colors at. Large patches often look better with slightly desaturated colors, but set this to 1 if you want the plot colors to perfectly match the input color spec.

Credit to seaborn’s author: Michael Waskom Git username: mwaskom Doc link: http://tinyurl.com/y3hxxzgv

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

statistical_analysis_on_aggregates(df, target_features, dataset_name, dataframe_snapshot=True)[source]

Aggregates the data of the target feature either by discrete values or by binning/labeling continuous data.

Args:
df: pd.Dataframe

Pandas DataFrame object.

target_features: list of string

Specified target features.

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

Note:

This function has a lot going on and it’s infancy so I am going to purposely not give it suppress_runtime_errors so people will find problems with it and report it to me.

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.

value_counts_table(df, feature_name, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True)[source]

Creates a value counts table of the features given data.

Note

Creates a png of the table.

Args:
df: pd.Dataframe

Pandas DataFrame object

feature_name: string

Specified feature column name.

dataset_name: string

The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.

display_visuals: bool

Boolean value to whether or not to display visualizations.

display_print: bool

Determines whether or not to print function’s embedded print statements.

filename: string

If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.

sub_dir: string

Specify the sub directory to append to the pre-defined folder path.

save_file: bool

Saves file if set to True; doesn’t if set to False.

dataframe_snapshot: bool

Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.

suppress_runtime_errors: bool

If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.

Creates/Saves a pandas dataframe of value counts of a dataframe.

Note -

Creates a png of the table.

Raises:

Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.