ClassificationAnalysis¶
from eflow.model_analysis.classification_analysis import ClassificationAnalysis
-
class
ClassificationAnalysis
(dataset_name, model, model_name, feature_order, target_feature, pred_funcs_dict, df_features, sample_data, project_sub_dir='Classification Analysis', overwrite_full_path=None, target_classes=None, save_model=True, notebook_mode=False)[source]¶ Analyzes a classification model’s result’s based on the prediction function(s) passed to it. Creates graphs and tables to be saved in directory structure.
-
classification_correct_analysis
(X, y, pred_name, dataset_name, thresholds=None, display_visuals=True, save_file=True, aggerate_target=False, display_print=True, suppress_runtime_errors=True, aggregate_target_feature=True, selected_features=None, extra_tables=True, statistical_analysis_on_aggregates=True)[source]¶ Compares the actual target value to the predicted value and performs analysis of all the data.
- Args:
- X: np.matrix or lists of lists
Feature matrix.
- y: list or np.array
Target data vector.
- pred_name: str
The name of the prediction function in questioned stored in ‘self.__pred_funcs_dict’
- dataset_name: str
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- feature_order: collection object
Features names in proper order to re-create the pandas dataframe.
- thresholds:
If the model outputs a probability list/numpy array then we apply thresholds to the ouput of the model. For classification only; will not affect the direct output of the probabilities.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- save_file: bool
Boolean value to whether or not to save the file.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- extra_tables: bool
- When handling two types of features if set to true this will
generate any extra tables that might be helpful. Note -
These graphics may create duplicates if you already applied an aggregation in ‘perform_analysis’
- aggregate_target_feature: bool
Aggregate the data of the target feature if the data is non-continuous data.
- Note
In the future I will have this also working with continuous data.
- selected_features: collection object of features
Will only focus on these selected feature’s and will ignore the other given features.
- statistical_analysis_on_aggregates: bool
If set to true then the function ‘statistical_analysis_on_aggregates’ will run; which aggregates the data of the target feature either by discrete values or by binning/labeling continuous data.
-
classification_error_analysis
(X, y, pred_name, dataset_name, thresholds=None, display_visuals=True, save_file=True, aggerate_target=False, display_print=True, suppress_runtime_errors=True, aggregate_target_feature=True, selected_features=None, extra_tables=True, statistical_analysis_on_aggregates=True)[source]¶ Compares the actual target value to the predicted value and performs analysis of all the data.
- Args:
- X: np.matrix or lists of lists
Feature matrix.
- y: list or np.array
Target data vector.
- pred_name: str
The name of the prediction function in questioned stored in ‘self.__pred_funcs_dict’
- dataset_name: str
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- feature_order: collection object
Features names in proper order to re-create the pandas dataframe.
- thresholds:
If the model outputs a probability list/numpy array then we apply thresholds to the ouput of the model. For classification only; will not affect the direct output of the probabilities.
- display_visuals: bool
Boolean value to whether or not to display visualizations.
- display_print: bool
Determines whether or not to print function’s embedded print statements.
- save_file: bool
Boolean value to whether or not to save the file.
- dataframe_snapshot: bool
Boolean value to determine whether or not generate and compare a snapshot of the dataframe in the dataset’s directory structure. Helps ensure that data generated in that directory is correctly associated to a dataframe.
- suppress_runtime_errors: bool
If set to true; when generating any graphs will suppress any runtime errors so the program can keep running.
- extra_tables: bool
- When handling two types of features if set to true this will
generate any extra tables that might be helpful. Note -
These graphics may create duplicates if you already applied an aggregation in ‘perform_analysis’
- aggregate_target_feature: bool
Aggregate the data of the target feature if the data is non-continuous data.
- Note
In the future I will have this also working with continuous data.
- selected_features: collection object of features
Will only focus on these selected feature’s and will ignore the other given features.
- statistical_analysis_on_aggregates: bool
If set to true then the function ‘statistical_analysis_on_aggregates’ will run; which aggregates the data of the target feature either by discrete values or by binning/labeling continuous data.
-
classification_metrics
(X, y, pred_name, dataset_name, thresholds=None, display_visuals=True, save_file=True, title='', custom_metrics_dict={}, ignore_metrics=[], average_scoring=['micro', 'macro', 'weighted'])[source]¶ Creates a dataframe based on the prediction metrics of the feature matrix and target vector.
- Args:
- X:
Feature matrix.
- y: list or np.array
Target data vector.
- pred_name:
The name of the prediction function in questioned stored in ‘self.__pred_funcs_dict’
- dataset_name:
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- thresholds:
If the model outputs a probability list/numpy array then we apply thresholds to the ouput of the model. For classification only; will not affect the direct output of the probabilities.
- display_visuals:
Display tables.
- save_file:
Determines whether or not to save the generated document.
- title:
Adds to the column ‘Metric Score’.
- custom_metrics_dict:
Pass the name of metric(s) and the function definition(s) in a dictionary.
- ignore_metrics:
Specify the default metrics to not apply to the classification data_analysis.
Precision
MCC
Recall
F1-Score
Accuracy
- average_scoring:
- Determines the type of averaging performed on the data.
micro
macro
weighted
- Returns:
Return a dataframe object of the metrics value.
-
classification_report
(X, y, pred_name, dataset_name, thresholds=None, display_visuals=True, save_file=True)[source]¶ Creates a report of all target’s metric evaluations based on the model’s prediction output from the classification report from the sklearn.
- Args:
- X:
Feature matrix.
- y:
Target data vector.
- pred_name:
The name of the prediction function in questioned stored in ‘self.__pred_funcs_dict’
- dataset_name:
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- thresholds:
If the model outputs a probability list/numpy array then we apply thresholds to the ouput of the model. For classification only; will not affect the direct output of the probabilities.
- display_visuals:
Visualize graph if needed.
- save_file:
Boolean value to whether or not to save the file.
-
graph_model_importances
(feature_order, feature_importances, display_visuals=True)[source]¶ Graph given models’s feature importances
- Args:
- feature_order: list
Features names in proper order to re-create the pandas dataframe.
- feature_importances: list
List of floats that represent each corresponding features importance
- display_visuals: bool
Visualize graph if needed.
-
perform_analysis
(X, y, dataset_name, thresholds_matrix=None, classification_error_analysis=False, classification_correct_analysis=False, ignore_metrics=[], custom_metrics_dict={}, average_scoring=['micro', 'macro', 'weighted'], display_visuals=True)[source]¶ Runs all available analysis functions on the models predicted data.
- Args:
- X:
Feature matrix.
- y:
Target data vector.
- dataset_name:
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- thresholds_matrix:
List of list/matrix of thresholds.
If the model outputs a probability list/numpy array then we apply thresholds to the ouput of the model. For classification only; will not affect the direct output of the probabilities.
- classification_error_analysis: bool
Perform feature analysis on data that was incorrectly predicted.
- classification_correct_analysis: bool
Perform feature analysis on data that was correctly predicted.
- figsize:
All plot’s dimension’s.
- ignore_metrics:
Specify the default metrics to not apply to the classification data_analysis.
Precision
MCC
Recall
F1-Score
Accuracy
- custom_metrics_dict:
Pass the name of metric(s) with the function definition(s) in a dictionary.
- average_scoring:
Determines the type of averaging performed on the data.
- display_visuals:
Controls visual display of error error data_analysis if it is able to run.
- Returns:
Performs all classification functionality with the provided feature data and target data.
plot_precision_recall_curve
classification_evaluation
plot_confusion_matrix
-
plot_confusion_matrix
(X, y, pred_name, dataset_name, thresholds=None, display_visuals=True, save_file=True, title=None, normalize=False, hide_zeros=False, hide_counts=False, x_tick_rotation=0, ax=None, figsize=(13, 10), cmap='Blues', title_fontsize='large', text_fontsize='medium')[source]¶ From scikit-plot documentation Link: http://tinyurl.com/y3ym5pyc Creates a confusion matrix plot based on the models predictions.
- Args:
- X:
Feature matrix.
- y:
Target data vector.
- pred_name:
The name of the prediction function in questioned stored in ‘self.__pred_funcs_dict’
- dataset_name:
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- thresholds:
If the model outputs a probability list/numpy array then we apply thresholds to the ouput of the model. For classification only; will not affect the direct output of the probabilities.
- display_visuals:
Visualize graph if needed.
- save_file:
Boolean value to whether or not to save the file.
-
plot_cumulative_gain
(X, y, pred_name, dataset_name, thresholds=None, display_visuals=True, save_file=True, title=None, ax=None, figsize=(13, 10), title_fontsize='large', text_fontsize='medium')[source]¶ From scikit-plot documentation Link: http://tinyurl.com/y3ym5pyc Plots calibration curves for a set of classifier probability estimates.
- Args:
- X:
Feature matrix.
- y:
Target data vector.
- pred_name:
The name of the prediction function in questioned stored in ‘self.__pred_funcs_dict’
- dataset_name:
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- thresholds:
If the model outputs a probability list/numpy array then we apply thresholds to the ouput of the model. For classification only; will not affect the direct output of the probabilities.
- display_visuals:
Visualize graph if needed.
- save_file:
Boolean value to whether or not to save the file.
-
plot_ks_statistic
(X, y, pred_name, dataset_name, thresholds=None, display_visuals=True, save_file=True, title=None, ax=None, figsize=(13, 10), title_fontsize='large', text_fontsize='medium')[source]¶ From scikit-plot documentation Link: http://tinyurl.com/y3ym5pyc Generates the KS Statistic plot from labels and scores/probabilities.
- Args:
- X:
Feature matrix.
- y:
Target data vector.
- pred_name:
The name of the prediction function in questioned stored in ‘self.__pred_funcs_dict’
- dataset_name:
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- thresholds:
If the model outputs a probability list/numpy array then we apply thresholds to the ouput of the model. For classification only; will not affect the direct output of the probabilities.
- display_visuals:
Visualize graph if needed.
- save_file:
Boolean value to whether or not to save the file.
-
plot_lift_curve
(X, y, pred_name, dataset_name, thresholds=None, display_visuals=True, save_file=True, title=None, ax=None, figsize=(13, 10), title_fontsize='large', text_fontsize='medium')[source]¶ From scikit-plot documentation Link: http://tinyurl.com/y3ym5pyc The lift curve is used to determine the effectiveness of a binary classifier. A detailed explanation can be found at http://tinyurl.com/csegj9. The implementation here works only for binary classification.
- Args:
- X:
Feature matrix.
- y:
Target data vector.
- pred_name:
The name of the prediction function in questioned stored in ‘self.__pred_funcs_dict’
- dataset_name:
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- thresholds:
If the model outputs a probability list/numpy array then we apply thresholds to the ouput of the model. For classification only; will not affect the direct output of the probabilities.
- display_visuals:
Visualize graph if needed.
- save_file:
Boolean value to whether or not to save the file.
-
plot_precision_recall_curve
(X, y, pred_name, dataset_name, thresholds=None, display_visuals=True, save_file=True, title=None, plot_micro=True, classes_to_plot=None, ax=None, figsize=(13, 10), cmap='nipy_spectral', title_fontsize='large', text_fontsize='medium')[source]¶ From scikit-plot documentation Link: http://tinyurl.com/y3ym5pyc Plots precision recall curve plot based on the models predictions.
- Args:
- X:
Feature matrix.
- y:
Target data vector.
- pred_name:
The name of the prediction function in questioned stored in ‘self.__pred_funcs_dict’
- dataset_name:
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- thresholds:
If the model outputs a probability list/numpy array then we apply thresholds to the ouput of the model. For classification only; will not affect the direct output of the probabilities.
- display_visuals:
Visualize graph if needed.
- save_file:
Boolean value to whether or not to save the file.
-
plot_roc_curve
(X, y, pred_name, dataset_name, thresholds=None, display_visuals=True, save_file=True, title=None, ax=None, figsize=(13, 10), title_fontsize='large', text_fontsize='medium')[source]¶ From scikit-plot documentation Link: http://tinyurl.com/y3ym5pyc Creates ROC curves from labels and predicted probabilities.
- Args:
- X:
Feature matrix.
- y:
Target data vector.
- pred_name:
The name of the prediction function in questioned stored in ‘self.__pred_funcs_dict’
- dataset_name:
The dataset’s name; this will create a sub-directory in which your generated graph will be inner-nested in.
- thresholds:
If the model outputs a probability list/numpy array then we apply thresholds to the ouput of the model. For classification only; will not affect the direct output of the probabilities.
- display_visuals:
Visualize graph if needed.
- save_file:
Boolean value to whether or not to save the file.
-