FeatureAnalysis¶
from eflow.data_analysis.feature_analysis import FeatureAnalysis
-
class
FeatureAnalysis
(df_features, dataset_sub_dir='', dataset_name='', overwrite_full_path=None, notebook_mode=False)[source]¶ Analyzes the feature data of a pandas Dataframe object. (Ignores null data for displaying data and creates 2d graphics with 2 features. In the future I might add 3d graphics with 3 features.)
-
analyze_feature
(df, feature_name, dataset_name, target_feature=None, display_visuals=True, display_print=True, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), extra_tables=True)[source]¶ Generate’s all graphic’s for that given feature and the relationship to the target feature.
- Args:
- df: pd.Dataframe
Pandas DataFrame object
- feature_name: string
Specified feature column name.
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- target_feature: string
Will create graphics involving this feature with the main feature ‘feature_name’.
- display_visuals: string
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- sub_dir: string
Specify the sub directory to append to the pre-defined folder path.
- save_file: bool
Saves file if set to True; doesn’t if set to False.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- extra_tables: bool
When handling two types of features if set to true this will generate any extra tables that might be helpful. Note -
These graphics may create duplicates if you already applied an aggregation in ‘perform_analysis’
- Raises:
Raises error if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.
-
descr_table
(df, feature_name, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True)[source]¶ Creates/Saves a pandas dataframe of a feature’s numerical data. Standard deviation, mean, Q1-Q5, median, variance, etc.
- Note
Creates a png of the table.
- Args:
- df: pd.Dataframe
Pandas DataFrame object
- feature_name: string
Specified feature column name.
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- filename: string
If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
- sub_dir: string
Specify the sub directory to append to the pre-defined folder path.
- save_file: bool
Saves file if set to True; doesn’t if set to False.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- Raises:
Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.
-
group_by_feature_value_count_table
(df, feature_name, dataset_name, other_feature_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True)[source]¶ Creates/Saves a pandas dataframe of features and their found types in the dataframe.
- Note
Creates a png of the table.
- Args:
- df: pd.Dataframe
Pandas DataFrame object
- feature_name: string
Specified feature column name.
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- other_feature_name: string
Feature to compare to.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- filename: string
If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
- sub_dir: string
Specify the sub directory to append to the pre-defined folder path.
- save_file: bool
Saves file if set to True; doesn’t if set to False.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- Raises:
Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.
-
perform_analysis
(df, dataset_name, target_features=None, display_visuals=True, display_print=True, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), aggregate_target_feature=True, selected_features=None, extra_tables=True, statistical_analysis_on_aggregates=True)[source]¶ Performs all public methods that generate visualizations/insights about the data.
- Note:
Pretty much my personal lazy button for running the entire object without specifying any method in particular.
- Args:
- df: pd.Dataframe
Pandas dataframe object
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- target_features: collection of strings or None
A feature name that both exists in the init df_features and the passed dataframe.
- Note
If init to ‘None’ then df_features will try to extract out the target feature.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- save_file: bool
Boolean value to whether or not to save the file.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- extra_tables: bool
- When handling two types of features if set to true this will
generate any extra tables that might be helpful. Note -
These graphics may create duplicates if you already applied an aggregation in ‘perform_analysis’
- aggregate_target_feature: bool
Aggregate the data of the target feature if the data is non-continuous data.
- Note
In the future I will have this also working with continuous data.
- selected_features: collection object of features
Will only focus on these selected feature’s and will ignore the other given features.
- statistical_analysis_on_aggregates: bool
If set to true then the function ‘statistical_analysis_on_aggregates’ will run; which aggregates the data of the target feature either by discrete values or by binning/labeling continuous data.
- Raises:
If an empty dataframe is passed to this function or if the same dataframe is passed to it raise error.
-
plot_count_graph
(df, feature_name, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), flip_axis=False, palette='PuBu')[source]¶ Display a barplot with color ranking from a feature’s value counts from the seaborn libary and save the graph in the correct directory structure.
- Args:
- df: pd.Dataframe
Pandas dataframe object.
- feature_name: string
Specified feature column name.
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- filename: string
Name to give the file.
- sub_dir: string
Specify the sub directory to append to the pre-defined folder path.
- save_file: bool
Boolean value to whether or not to save the file.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- figsize: tuple
Size for the given plot.
- flip_axis: bool
Flip the axis the ploting axis from x to y if set to ‘True’.
- palette: dict or string
String representation of color pallete for ranking from seaborn’s pallete.
Credit to seaborn’s author: Michael Waskom Git username: mwaskom Link: http://tinyurl.com/y4pzrgcf
- Raises:
Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.
-
plot_distance_graph
(df, feature_name, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), bins=None, norm_hist=True, hist=True, kde=True, colors=None, fit=None, fit_kws=None)[source]¶ Display a distance plot and save the graph in the correct directory.
- Args:
- df: pd.Dataframe
Pandas dataframe object
- feature_name: string
Specified feature column name.
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- filename: string
Name to give the file.
- sub_dir: string
Specify the sub directory to append to the pre-defined folder path.
- save_file: bool
Boolean value to whether or not to save the file.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- figsize: tuple
The given size of the plot.
- bins: int
Specification of hist bins, or None to use Freedman-Diaconis rule.
- norm_hist: bool
If True, the histogram height shows a density rather than a count. This is implied if a KDE or fitted density is plotted.
- hist: bool
Whether to plot a (normed) histogram.
- kde: bool
Whether to plot a gaussian kernel density estimate.
- colorsmatplotlib color
Color to plot everything but the fitted curve in.
- fit: functional method
An object with fit method, returning a tuple that can be passed to a pdf method a positional arguments following an grid of values to evaluate the pdf on.
- fit_kwsdictionaries, optional
Keyword arguments for underlying plotting functions.
Credit to seaborn’s author: Michael Waskom Git username: mwaskom Doc Link: http://tinyurl.com/ycco2hok
- Raises:
Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.
-
plot_jointplot_graph
(df, feature_name, dataset_name, other_feature_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), color=None, kind='scatter and kde', ratio=5)[source]¶ Display a ridge plot and save the graph in the correct directory.
- Args:
- df: pd.Dataframe
Pandas DataFrame object.
- feature_name: string
Specified feature column name.
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- other_feature_name: string
Feature to compare to.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- filename: string
If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
- sub_dir: string
Specify the sub directory to append to the pre-defined folder path.
- save_file: bool
Boolean value to whether or not to save the file.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- figsize: tuple
Tuple object to represent the plot/image’s size. Because joinplot only accepts a single value for the figure; we just pull the greatest of the two values.
- color: string
Seaborn/maplotlib color/hex color for representing the graph
- kind: string (scatter,reg,resid,kde,hex,scatter and kde)
Kind of plot to draw.
- ratio:
Ratio of joint axes height to marginal axes height. (Determines distplot like plots dimensions.)
Credit to seaborn’s author: Michael Waskom Git username: mwaskom Link: http://tinyurl.com/v9pxsoy
- Raises:
Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.
-
plot_multi_bar_graph
(df, feature_name, dataset_name, other_feature_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), colors=None, stacked=False)[source]¶ Display a pie graph and save the graph in the correct directory.
- Args:
- df:
Pandas DataFrame object.
- feature_name:
Specified feature column name.
- dataset_name:
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- display_visuals:
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- filename:
If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
- sub_dir:
Specify the sub directory to append to the pre-defined folder path.
- save_file:
Boolean value to whether or not to save the file.
- dataframe_snapshot:
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- figsize: tuple
Size of the plot.
- colors: dict or string
Dictionary of all feature values to hex color values.
- stacked: bool
Determines if the multi bar graph should be stacked or not.
- Raises:
Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.
-
plot_pie_graph
(df, feature_name, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), pallete=None)[source]¶ Display a pie graph and save the graph in the correct directory.
- Args:
- df:
Pandas DataFrame object.
- feature_name:
Specified feature column name.
- dataset_name:
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- display_visuals:
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- filename:
If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
- sub_dir:
Specify the sub directory to append to the pre-defined folder path.
- save_file:
Boolean value to whether or not to save the file.
- dataframe_snapshot:
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- figsize: tuple
Size of the plot.
- pallete: dict or string
Dictionary of all feature values to hex color values.
- Raises:
Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.
-
plot_ridge_graph
(df, feature_name, dataset_name, other_feature_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), palette=None)[source]¶ Display a ridge plot and save the graph in the correct directory.
- Args:
- df: pd.Dataframe
Pandas DataFrame object.
- feature_name: string
Specified feature column name.
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- other_feature_name: string
Feature to compare to.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- filename: string
If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
- sub_dir: string
Specify the sub directory to append to the pre-defined folder path.
- save_file: bool
Boolean value to whether or not to save the file.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- figsize: tuple
Tuple object to represent the plot/image’s size.
- palette: dict or string
Dictionary of all feature values to hex color values.
- Note -
A large part of this was taken from: http://tinyurl.com/tuou2cn
- Raises:
Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.
-
plot_violin_graph
(df, feature_name, dataset_name, other_feature_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True, figsize=(13, 10), order=None, cut=2, scale='area', gridsize=100, width=0.8, palette=None, saturation=0.75)[source]¶ Display a violin plot and save the graph in the correct directory.
- Args:
- df: pd.Dataframe
Pandas dataframe object
- feature_name: string
Specified feature column name to compare to y.
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- other_feature_name: string
Specified feature column name to compare to x.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- filename: string
Name to give the file.
- sub_dir: string
Specify the sub directory to append to the pre-defined folder path.
- save_file: bool
Boolean value to whether or not to save the file.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- figsize: tuple
Size of the given plot.
- order: lists of strings
Order to plot the categorical levels in, otherwise the levels are inferred from the data objects.
- cut: float
Distance, in units of bandwidth size, to extend the density past the extreme datapoints. Set to 0 to limit the violin range within the range of the observed data. (i.e., to have the same effect as trim=True in ggplot.)
- scale: string
{area, count, width} The method used to scale the width of each violin. If area, each violin will have the same area. If count, the width of the violins will be scaled by the number of observations in that bin. If width, each violin will have the same width.
- gridsize: int
Number of points in the discrete grid used to compute the kernel density estimate.
- width: float
Width of a full element when not using hue nesting, or width of all the elements for one level of the major grouping variable.
- palette: dict or string
Colors to use for the different levels of the hue variable. Should be something that can be interpreted by color_palette(), or a dictionary mapping hue levels to matplotlib colors.
- saturation: float
Proportion of the original saturation to draw colors at. Large patches often look better with slightly desaturated colors, but set this to 1 if you want the plot colors to perfectly match the input color spec.
Credit to seaborn’s author: Michael Waskom Git username: mwaskom Doc link: http://tinyurl.com/y3hxxzgv
- Raises:
Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.
-
statistical_analysis_on_aggregates
(df, target_features, dataset_name, dataframe_snapshot=True)[source]¶ Aggregates the data of the target feature either by discrete values or by binning/labeling continuous data.
- Args:
- df: pd.Dataframe
Pandas DataFrame object.
- target_features: list of string
Specified target features.
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- Note:
This function has a lot going on and it’s infancy so I am going to purposely not give it suppress_runtime_errors so people will find problems with it and report it to me.
- Raises:
Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.
-
value_counts_table
(df, feature_name, dataset_name, display_visuals=True, display_print=True, filename=None, sub_dir=None, save_file=True, dataframe_snapshot=True, suppress_runtime_errors=True)[source]¶ Creates a value counts table of the features given data.
- Note
Creates a png of the table.
- Args:
- df: pd.Dataframe
Pandas DataFrame object
- feature_name: string
Specified feature column name.
- dataset_name: string
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- filename: string
If set to ‘None’ will default to a pre-defined string; unless it is set to an actual filename.
- sub_dir: string
Specify the sub directory to append to the pre-defined folder path.
- save_file: bool
Saves file if set to True; doesn’t if set to False.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
Creates/Saves a pandas dataframe of value counts of a dataframe.
- Note -
Creates a png of the table.
- Raises:
Raises error if the feature data is filled with only nulls or if the json file’s snapshot of the given dataframe doesn’t match the given dataframe.
-