- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to evaluate a suite of probabilistic models and improve their performance with appropriate data preparation.
- How to fit a final model and use it to predict probabilities for specific cases.
- Haberman Breast Cancer Survival Dataset
- Explore the Dataset
- Model Test and Baseline Result
- Evaluate Probabilistic Models
- Probabilistic Algorithm Evaluation
- Model Evaluation With Scaled Inputs
- Model Evaluation With Power Transforms

- Make Prediction on New Data
- The age of the patient at the time of the operation.
- The two-digit year of the operation.
- The number of “positive axillary nodes” detected, a measure of a cancer has spread.
- Haberman’s Survival Data (haberman.names)
- Download Haberman’s Survival Data (haberman.csv)
- BrierSkillScore = 1.0 – (ModelBrierScore / ReferenceBrierScore)
- Logistic Regression (LogisticRegression)
- Linear Discriminant Analysis (LinearDiscriminantAnalysis)
- Quadratic Discriminant Analysis (QuadraticDiscriminantAnalysis)
- Gaussian Naive Bayes (GaussianNB)
- Multinomial Naive Bayes (MultinomialNB)
- Gaussian Process Classifier (GaussianProcessClassifier)
- pandas.read_csv API
- pandas.DataFrame.describe API
- pandas.DataFrame.hist API
- sklearn.model_selection.RepeatedStratifiedKFold API.
- sklearn.preprocessing.LabelEncoder API.
- sklearn.preprocessing.PowerTransformer API
- Haberman’s Survival Data Set, UCI Machine Learning Repository.
- Haberman’s Survival Data Set CSV File
- Haberman’s Survival Data Set Names File
- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to evaluate a suite of probabilistic models and improve their performance with appropriate data preparation.
- How to fit a final model and use it to predict probabilities for specific cases.
- Imbalanced classification is specifically hard because of the severely skewed class distribution and the unequal misclassification costs.
- The difficulty of imbalanced classification is compounded by properties such as dataset size, label noise, and data distribution.
- How to develop an intuition for the compounding effects on modeling difficulty posed by different dataset properties.
- Why Imbalanced Classification Is Hard
- Compounding Effect of Dataset Size
- Compounding Effect of Label Noise
- Compounding Effect of Data Distribution
- Skewed Class Distribution
- Unequal Cost of Misclassification Errors
- Dataset Size.
- Label Noise.
- Data Distribution.
- Concept Learning And The Problem Of Small Disjuncts, 1989.
- Learning from Imbalanced Data Sets, 2018.
- Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.
- sklearn.datasets.make_classification API.
- Imbalanced classification is specifically hard because of the severely skewed class distribution and the unequal misclassification costs.
- The difficulty of imbalanced classification is compounded by properties such as dataset size, label noise, and data distribution.
- How to develop an intuition for the compounding effects on modeling difficulty posed by different dataset properties.
- One-class classification is a field of machine learning that provides techniques for outlier and anomaly detection.
- How to adapt one-class classification algorithms for imbalanced classification with a severely skewed class distribution.
- How to fit and evaluate one-class classification algorithms such as SVM, isolation forest, elliptic envelope, and local outlier factor.
- One-Class Classification for Imbalanced Data
- One-Class Support Vector Machines
- Isolation Forest
- Minimum Covariance Determinant
- Local Outlier Factor
**Negative Case**: Normal or inlier.**Positive Case**: Anomaly or outlier.**Inlier Prediction**: +1**Outlier Prediction**: -1- Estimating the Support of a High-Dimensional Distribution, 2001.
- Isolation Forest, 2008.
- Isolation-Based Anomaly Detection, 2012.
- A Fast Algorithm for the Minimum Covariance Determinant Estimator, 2012.
- Minimum Covariance Determinant and Extensions, 2017.
- LOF: Identifying Density-based Local Outliers, 2000.
- Learning from Imbalanced Data Sets, 2018.
- Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.
- Novelty and Outlier Detection, scikit-learn API.
- sklearn.svm.OneClassSVM API.
- sklearn.ensemble.IsolationForest API.
- sklearn.covariance.EllipticEnvelope API.
- sklearn.neighbors.LocalOutlierFactor API.
- Outlier, Wikipedia.
- Anomaly detection, Wikipedia.
- One-class classification, Wikipedia.
- One-class classification is a field of machine learning that provides techniques for outlier and anomaly detection.
- How to adapt one-class classification algorithms for imbalanced classification with a severely skewed class distribution.
- How to fit and evaluate one-class classification algorithms such as SVM, isolation forest, elliptic envelope and local outlier factor.
- The default threshold for interpreting probabilities to class labels is 0.5, and tuning this hyperparameter is called threshold moving.
- How to calculate the optimal threshold for the ROC Curve and Precision-Recall Curve directly.
- How to manually search threshold values for a chosen model and model evaluation metric.
- Converting Probabilities to Class Labels
- Threshold-Moving for Imbalanced Classification
- Optimal Threshold for ROC Curve
- Optimal Threshold for Precision-Recall Curve
- Optimal Threshold Tuning
- Prediction < 0.5 = Class 0
- Prediction >= 0.5 = Class 1
- The predicted probabilities are not calibrated, e.g. those predicted by an SVM or decision tree.
- The metric used to train the model is different from the metric used to evaluate a final model.
- The class distribution is severely skewed.
- The cost of one type of misclassification is more important than another type of misclassification.
- 1. Fit Model on the Training Dataset.
- 2. Predict Probabilities on the Test Dataset.
- 3. For each threshold in Thresholds:
- 3a. Convert probabilities to Class Labels using the threshold.
- 3b. Evaluate Class Labels.
- 3c. If Score is Better than Best Score.
- 3ci. Adopt Threshold.

- 4. Use Adopted Threshold When Making Class Predictions on New Data.

- Sensitivity = TruePositive / (TruePositive + FalseNegative)
- Specificity = FalseNegative / (FalsePositive + TrueNegative)
- Sensitivity = True Positive Rate
- Specificity = 1 – False Positive Rate
- G-Mean = sqrt(Sensitivity * Specificity)
- J = Sensitivity + Specificity – 1
- J = Sensitivity + (1 – FalsePositiveRate) – 1
- J = TruePositiveRate – FalsePositiveRate
- F-Measure = (2 * Precision * Recall) / (Precision + Recall)
- Machine Learning from Imbalanced Data Sets 101, 2000.
- Training Cost-sensitive Neural Networks With Methods Addressing The Class Imbalance Problem, 2005.
- Learning from Imbalanced Data Sets, 2018.
- Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.
- sklearn.metrics.roc_curve API.
- imblearn.metrics.geometric_mean_score API.
- sklearn.metrics.precision_recall_curve API.
- Discrimination Threshold, Yellowbrick.
- Youden’s J statistic, Wikipedia.
- Receiver operating characteristic, Wikipedia.
- The default threshold for interpreting probabilities to class labels is 0.5, and tuning this hyperparameter is called threshold moving.
- How to calculate the optimal threshold for the ROC Curve and Precision-Recall Curve directly.
- How to manually search threshold values for a chosen model and model evaluation metric.
- How gradient boosting works from a high level and how to develop an XGBoost model for classification.
- How the XGBoost training algorithm can be modified to weight error gradients proportional to positive class importance during training.
- How to configure the positive class weight for the XGBoost training algorithm and how to grid search different configurations.
- Imbalanced Classification Dataset
- XGBoost Model for Classification
- Weighted XGBoost for Class Imbalance
- Tune the Class Weighting Hyperparameter
**Small Gradient**: Small error or correction to the model.**Large Gradient**: Large error or correction to the model.- scale_pos_weight = total_negative_examples / total_positive_examples
- 1 (default)
- 10
- 25
- 50
- 75
- 99 (recommended)
- 100
- 1000
- XGBoost: A Scalable Tree Boosting System, 2016.
- Learning from Imbalanced Data Sets, 2018.
- Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.
- sklearn.datasets.make_classification API.
- xgboost.XGBClassifier API.
- XGBoost Parameters, API Documentation.
- Notes on Parameter Tuning, API Documentation.
- How gradient boosting works from a high level and how to develop an XGBoost model for classification.
- How the XGBoost training algorithm can be modified to weight error gradients proportional to positive class importance during training.
- How to configure the positive class weight for the XGBoost training algorithm and how to grid search different configurations.

## How to Develop a Probabilistic Model of Breast Cancer Patient Survival

Developing a probabilistic model is challenging in general, although it is made more so when there is skew in the distribution of cases, referred to as an imbalanced dataset.

The **Haberman Dataset** describes the five year or greater survival of breast cancer patient patients in the 1950s and 1960s and mostly contains patients that survive. This standard machine learning dataset can be used as the basis of developing a probabilistic model that predicts the probability of survival of a patient given a few details of their case.

Given the skewed distribution in cases in the dataset, careful attention must be paid to both the choice of predictive models to ensure that calibrated probabilities are predicted, and to the choice of model evaluation to ensure that the models are selected based on the skill of their predicted probabilities rather than crisp survival vs. non-survival class labels.

In this tutorial, you will discover how to develop a model to predict the probability of patient survival on an imbalanced dataset.

After completing this tutorial, you will know:

Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

## Tutorial Overview

This tutorial is divided into five parts; they are:

## Haberman Breast Cancer Survival Dataset

In this project, we will use a small breast cancer survival dataset, referred to generally as the “Haberman Dataset.”

The dataset describes breast cancer patient data and the outcome is patient survival. Specifically whether the patient survived for five years or longer, or whether the patient did not survive.

This is a standard dataset used in the study of imbalanced classification. According to the dataset description, the operations were conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital.

There are 306 examples in the dataset, and there are 3 input variables; they are:

As such, we have no control over the selection of cases that make up the dataset or features to use in those cases, other than what is available in the dataset.

Although the dataset describes breast cancer patient survival, given the small dataset size and the fact the data is based on breast cancer diagnosis and operations many decades ago, any models built on this dataset are not expected to generalize.

**To be crystal clear**, we are not “*solving breast cancer*.” We are exploring a standard imbalanced classification dataset.

You can learn more about the dataset here:

We will choose to frame this dataset as the prediction of a probability of patient survival.

That is:

Given patient breast cancer surgery details, what is the probability of survival of the patient to five years or more?

This will provide the basis for exploring probabilistic algorithms that can predict a probability instead of a class label and metrics for evaluating models that predict probabilities instead of class labels.

Next, let’s take a closer look at the data.

### Want to Get Started With Imbalance Classification?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Coursehttps://machinelearningmastery.lpages.co/leadbox-1576257931.js

## Explore the Dataset

First, download the dataset and save it in your current working directory with the name “*haberman.csv*“.

Review the contents of the file.

The first few lines of the file should look as follows:

30,64,1,1 30,62,3,1 30,65,0,1 31,59,2,1 31,65,4,1 33,58,10,1 33,60,0,1 34,59,0,2 34,66,9,2 34,58,30,1 ...

We can see that the patients have an age like 30 or 31 (column 1), that operations occurred in years like 64 and 62 for 1964 and 1962 respectively (column 2), and that “*axillary nodes*” has values like 1 and 0.

All values are numeric; specifically, they are integer. There are no missing values marked with a “*?*” character.

We can also see that the class label (column 3) has a value of either 1 for patient survival and 2 for patient non-survival.

Firstly, we can load the CSV dataset and summarize each column using a five-number summary. The dataset can be loaded as a *DataFrame* using the read_csv() Pandas function, specifying the location and the names of the columns as there is no header line.

... # define the dataset location filename = 'haberman.csv' # define the dataset column names columns = ['age', 'year', 'nodes', 'class'] # load the csv file as a data frame dataframe = read_csv(filename, header=None, names=columns)

We can then call the describe() function to create a report of the five-number summary of each column and print the contents of the report.

A five-number summary for a column includes useful details like the min and max values, the mean and standard deviation of which are useful if the variable has a Gaussian distribution, and the 25th, 50th, and 75th quartiles, which are useful if the variable does not have a Gaussian distribution.

... # summarize each column report = dataframe.describe() print(report)

Tying this together, the complete example of loading and summarizing the dataset columns is listed below.

# load and summarize the dataset from pandas import read_csv # define the dataset location filename = 'haberman.csv' # define the dataset column names columns = ['age', 'year', 'nodes', 'class'] # load the csv file as a data frame dataframe = read_csv(filename, header=None, names=columns) # summarize each column report = dataframe.describe() print(report)

Running the example loads the dataset and reports a five-number summary for each of the three input variables and the output variable.

Looking at the age, we can see that the youngest patient was 30 and the oldest was 83; that is quite a spread. The mean patient age was about 52 years. If the occurrence of cancer is somewhat random, we might expect this distribution to be Gaussian.

We can see that all operations were performed between 1958 and 1969. If the number of breast cancer patients is somewhat fixed over time, we might expect this variable to have a uniform distribution.

We can see nodes have values between 0 and 52. This might be a cancer diagnostic related to lymphatic nodes.

age year nodes class count 306.000000 306.000000 306.000000 306.000000 mean 52.457516 62.852941 4.026144 1.264706 std 10.803452 3.249405 7.189654 0.441899 min 30.000000 58.000000 0.000000 1.000000 25% 44.000000 60.000000 0.000000 1.000000 50% 52.000000 63.000000 1.000000 1.000000 75% 60.750000 65.750000 4.000000 2.000000 max 83.000000 69.000000 52.000000 2.000000

All variables are integers. Therefore, it might be helpful to look at each variable as a histogram to get an idea of the variable distribution.

This might be helpful in case we choose models later that are sensitive to the data distribution or scale of the data, in which case, we might need to transform or rescale the data.

We can create a histogram of each variable in the DataFrame by calling the hist() function.

The complete example is listed below.

# create histograms of each variable from pandas import read_csv from matplotlib import pyplot # define the dataset location filename = 'haberman.csv' # define the dataset column names columns = ['age', 'year', 'nodes', 'class'] # load the csv file as a data frame dataframe = read_csv(filename, header=None, names=columns) # create a histogram plot of each variable dataframe.hist() pyplot.show()

Running the example creates a histogram for each variable.

We can see that age appears to have a Gaussian distribution, as we might have expected. We can also see that year has a uniform distribution, mostly, with an outlier in the first year showing nearly double the number of operations.

We can see nodes has an exponential type distribution with perhaps most examples showing 0 nodes, with a long tail of values after that. A transform to un-bunch this distribution might help some models later on.

Finally, we can see the two-class values with an unequal class distribution, showing perhaps 2- or 3-times more survival than non-survival cases.

It may be helpful to know how imbalanced the dataset actually is.

We can use the Counter object to count the number of examples in each class, then use those counts to summarize the distribution.

The complete example is listed below.

# summarize the class ratio from pandas import read_csv from collections import Counter # define the dataset location filename = 'haberman.csv' # define the dataset column names columns = ['age', 'year', 'nodes', 'class'] # load the csv file as a data frame dataframe = read_csv(filename, header=None, names=columns) # summarize the class distribution target = dataframe['class'].values counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))

Running the example summarizes the class distribution for the dataset.

We can see that class 1 for survival has the most examples at 225, or about 74 percent of the dataset. We can see class 2 for non-survival has fewer examples at 81, or about 26 percent of the dataset.

The class distribution is skewed, but it is not severely imbalanced.

Class=1, Count=225, Percentage=73.529% Class=2, Count=81, Percentage=26.471%

Now that we have reviewed the dataset, let’s look at developing a test harness for evaluating candidate models.

## Model Test and Baseline Result

We will evaluate candidate models using repeated stratified k-fold cross-validation.

The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=10, meaning each fold will contain 306/10 or about 30 examples.

Stratified means that each fold will contain the same mixture of examples by class, that is about 74 percent to 26 percent survival and non-survival.

Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.

This means a single model will be fit and evaluated 10 * 3, or 30, times and the mean and standard deviation of these runs will be reported.

This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

Given that we are interested in predicting a probability of survival, we need a performance metric that evaluates the skill of a model based on the predicted probabilities. In this case, we will the Brier score that calculates the mean squared error between the predicted probabilities and the expected probabilities.

This can be calculated using the brier_score_loss() scikit-learn function. This score is minimized, with a perfect score of 0.0. We can invert the score to be maximizing by comparing a predicted score to a reference score, showing how much better the model is compared to the reference between 0.0 for the same, to 1.0 with perfect skill. Any models that achieves a score less than 0.0 represents less skill than the reference model. This is called the Brier Skill Score, or BSS for short.

It is customary for an imbalanced dataset to model the minority class as a positive class. In this dataset, the positive class represents non-survival. This means, that we will be predicting the probability of non-survival and will need to calculate the complement of the predicted probability in order to get the probability of survival.

As such, we can map the 1 class values (survival) to the negative case with a 0 class label, and the 2 class values (non-survival) to the positive case with a class label of 1. This can be achieved using the LabelEncoder class.

For example, the *load_dataset()* function below will load the dataset, split the variable columns into input and outputs, and then encode the target variable to 0 and 1 values.

# load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y

Next, we can calculate the Brier skill score for a model.

First, we need a Brier score for a reference prediction. A reference prediction for a problem in which we are predicting probabilities is the probability of the positive class label in the dataset.

In this case, the positive class label represents non-survival and occurs about 26% in the dataset. Therefore, predicting about 0.26471 represents the worst-case or baseline performance for a predictive model on this dataset. Any model that has a Brier score better than this has some skill, where as any model that as a Brier score lower than this has no skill. The Brier Skill Score captures this important relationship. We can calculate the Brier score for this default prediction strategy automatically for each training set in the k-fold cross-validation process, then use it as a point of comparison for a given model.

... # calculate reference brier score ref_probs = [0.26471 for _ in range(len(y_true))] bs_ref = brier_score_loss(y_true, ref_probs)

The Brier score can then be calculated for the predictions from a model and used in the calculation of the Brier Skill Score.

The *brier_skill_score()* function below implements this and calculates the Brier Skill Score for a given set of true labels and predictions on the same test set. Any model that achieves a BSS above 0.0 means it shows skill on this dataset.

# calculate brier skill score (BSS) def brier_skill_score(y_true, y_prob): # calculate reference brier score pos_prob = count_nonzero(y_true) / len(y_true) ref_probs = [pos_prob for _ in range(len(y_true))] bs_ref = brier_score_loss(y_true, ref_probs) # calculate model brier score bs_model = brier_score_loss(y_true, y_prob) # calculate skill score return 1.0 - (bs_model / bs_ref)

Next, we can make use of the *brier_skill_score()* function to evaluate a model using repeated stratified k-fold cross-validation.

To use our custom performance metric, we can use the make_scorer() scikit-learn function that takes the name of our custom function and creates a metric that we can use to evaluate models with the scikit-learn API. We will set the *needs_proba* argument to True to ensure that models that are evaluated make predictions using the *predict_proba()* function to ensure they give probabilities instead of class labels.

... # define the model evaluation the metric metric = make_scorer(brier_skill_score, needs_proba=True)

The *evaluate_model()* function below defines the evaluation procedure with our custom evaluation metric, taking the entire training dataset and model as input, then returns the sample of scores across each fold and each repeat.

# evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation the metric metric = make_scorer(brier_skill_score, needs_proba=True) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores

Finally, we can use our three functions and evaluate a model.

First, we can load the dataset and summarize the input and output arrays to confirm they were loaded correctly.

... # define the location of the dataset full_path = 'haberman.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y))

In this case, we will evaluate the baseline strategy of predicting the distribution of positive examples in the training set as the probability of each case in the test set.

This can be implemented automatically using the DummyClassifier class and setting the “*strategy*” to “*prior*” that will predict the prior probability of each class in the training dataset, which for the positive class we know is about 0.26471.

... # define the reference model model = DummyClassifier(strategy='prior')

We can then evaluate the model by calling our *evaluate_model()* function and report the mean and standard deviation of the results.

... # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean BSS: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this all together, the complete example of evaluating the baseline model on the Haberman breast cancer survival dataset using the Brier Skill Score is listed below.

We would expect the baseline model to achieve a BSS of 0.0, e.g. the same as the reference model because it is the reference model.

# baseline model and test harness for the haberman dataset from collections import Counter from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.metrics import brier_score_loss from sklearn.metrics import make_scorer from sklearn.dummy import DummyClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # calculate brier skill score (BSS) def brier_skill_score(y_true, y_prob): # calculate reference brier score ref_probs = [0.26471 for _ in range(len(y_true))] bs_ref = brier_score_loss(y_true, ref_probs) # calculate model brier score bs_model = brier_score_loss(y_true, y_prob) # calculate skill score return 1.0 - (bs_model / bs_ref) # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation the metric metric = make_scorer(brier_skill_score, needs_proba=True) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'haberman.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y)) # define the reference model model = DummyClassifier(strategy='prior') # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean BSS: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example first loads the dataset and reports the number of cases correctly as 306 and the distribution of class labels for the negative and positive cases as we expect.

The *DummyClassifier* with our default strategy is then evaluated using repeated stratified k-fold cross-validation and the mean and standard deviation of the Brier Skill Score is reported as 0.0. This is as we expected, as we are using the test harness to evaluate the reference strategy.

(306, 3) (306,) Counter({0: 225, 1: 81}) Mean BSS: -0.000 (0.000)

Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.

## Evaluate Probabilistic Models

In this section, we will use the test harness developed in the previous section to evaluate a suite of algorithms and then improvements to those algorithms, such as data preparation schemes.

### Probabilistic Algorithm Evaluation

We will evaluate a suite of models that are known to be effective at predicting probabilities.

Specifically, these are models that are fit under a probabilistic framework and explicitly predict a calibrated probability for each example. A such, this makes them well-suited to this dataset, even with the class imbalance.

We will evaluate the following six probabilistic models implemented with the scikit-learn library:

We are interested in directly comparing the results from each of these algorithms. We will compare each algorithm based on the mean score, as well as based on their distribution of scores.

We can define a list of the models that we want to evaluate, each with their default configuration or configured as to not produce a warning.

... # define models models = [LogisticRegression(solver='lbfgs'), LinearDiscriminantAnalysis(), QuadraticDiscriminantAnalysis(), GaussianNB(), MultinomialNB(), GaussianProcessClassifier()]

We can then enumerate each model, record a unique name for the model, evaluate it, and report the mean BSS and store the results for the end of the run.

... names, values = list(), list() # evaluate each model for model in models: # get a name for the model name = type(model).__name__[:7] # evaluate the model and store results scores = evaluate_model(X, y, model) # summarize and store print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) names.append(name) values.append(scores)

At the end of the run, we can then create a box and whisker plot that shows the distribution of results from each algorithm, where the box shows the 25th, 50th, and 75th percentiles of the scores and the triangle shows the mean result. The whiskers of each plot give an idea of the extremes of each distribution.

... # plot the results pyplot.boxplot(values, labels=names, showmeans=True) pyplot.show()

Tying this together, the complete example is listed below.

# compare probabilistic model on the haberman dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.metrics import brier_score_loss from sklearn.metrics import make_scorer from sklearn.linear_model import LogisticRegression from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.naive_bayes import MultinomialNB from sklearn.gaussian_process import GaussianProcessClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # calculate brier skill score (BSS) def brier_skill_score(y_true, y_prob): # calculate reference brier score ref_probs = [0.26471 for _ in range(len(y_true))] bs_ref = brier_score_loss(y_true, ref_probs) # calculate model brier score bs_model = brier_score_loss(y_true, y_prob) # calculate skill score return 1.0 - (bs_model / bs_ref) # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation the metric metric = make_scorer(brier_skill_score, needs_proba=True) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'haberman.csv' # load the dataset X, y = load_dataset(full_path) # define models models = [LogisticRegression(solver='lbfgs'), LinearDiscriminantAnalysis(), QuadraticDiscriminantAnalysis(), GaussianNB(), MultinomialNB(), GaussianProcessClassifier()] names, values = list(), list() # evaluate each model for model in models: # get a name for the model name = type(model).__name__[:7] # evaluate the model and store results scores = evaluate_model(X, y, model) # summarize and store print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) names.append(name) values.append(scores) # plot the results pyplot.boxplot(values, labels=names, showmeans=True) pyplot.show()

Running the example first summarizes the mean and standard deviation of the BSS for each algorithm (larger scores is better).

Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.

In this case, the results suggest that only two of the algorithms are not skillful, showing negative scores, and that perhaps the LogisticRegression (LR) and LinearDiscriminantAnalysis (LDA) algorithms are the best performing.

>Logisti 0.064 (0.123) >LinearD 0.067 (0.136) >Quadrat 0.027 (0.212) >Gaussia 0.011 (0.209) >Multino -0.213 (0.380) >Gaussia -0.141 (0.047)

A box and whisker plot is created summarizing the distribution of results.

Interestingly, most if not all algorithms show a spread indicating that they may be unskilful on some of the runs. The distribution between the two topperforming models appears roughly equivalent, so choosing a model based on mean performance might be a good start.

This is a good start; let’s see if we can improve the results with basic data preparation.

### Model Evaluation With Scaled Inputs

It can be a good practice to scale data for some algorithms if the variables have different units of measure, as they do in this case.

Algorithms like the LR and LDA are sensitive to the of the data and assume a Gaussian distribution for the input variables, which we don’t have in all cases.

Nevertheless, we can test the algorithms with standardization, where each variable is shifted to a zero mean and unit standard deviation. We will drop the *MultinomialNB* algorithm as it does not support negative input values.

We can achieve this by wrapping each model in a Pipeline where the first step is a StandardScaler, which will correctly be fit on the training dataset and applied to the test dataset within each k-fold cross-validation evaluation, preventing any data leakage.

... # create a pipeline pip = Pipeline(steps=[('t', StandardScaler()),('m',model)]) # evaluate the model and store results scores = evaluate_model(X, y, pip)

The complete example of evaluating the five remaining algorithms with standardized input data is listed below.

# compare probabilistic models with standardized input on the haberman dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.metrics import brier_score_loss from sklearn.metrics import make_scorer from sklearn.linear_model import LogisticRegression from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.gaussian_process import GaussianProcessClassifier from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # calculate brier skill score (BSS) def brier_skill_score(y_true, y_prob): # calculate reference brier score ref_probs = [0.26471 for _ in range(len(y_true))] bs_ref = brier_score_loss(y_true, ref_probs) # calculate model brier score bs_model = brier_score_loss(y_true, y_prob) # calculate skill score return 1.0 - (bs_model / bs_ref) # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation the metric metric = make_scorer(brier_skill_score, needs_proba=True) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'haberman.csv' # load the dataset X, y = load_dataset(full_path) # define models models = [LogisticRegression(solver='lbfgs'), LinearDiscriminantAnalysis(), QuadraticDiscriminantAnalysis(), GaussianNB(), GaussianProcessClassifier()] names, values = list(), list() # evaluate each model for model in models: # get a name for the model name = type(model).__name__[:7] # create a pipeline pip = Pipeline(steps=[('t', StandardScaler()),('m',model)]) # evaluate the model and store results scores = evaluate_model(X, y, pip) # summarize and store print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) names.append(name) values.append(scores) # plot the results pyplot.boxplot(values, labels=names, showmeans=True) pyplot.show()

Running the example again summarizes the mean and standard deviation of the BSS for each algorithm.

Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.

In this case, we can see that the standardization has not had much of an impact on the algorithms, except the Gaussian Process Classifier (GPC). The performance of the GPC with standardization has shot up and is now the best-performing technique. This highlights the importance of preparing data to meet the expectations of each model.

>Logisti 0.065 (0.121) >LinearD 0.067 (0.136) >Quadrat 0.027 (0.212) >Gaussia 0.011 (0.209) >Gaussia 0.092 (0.106)

Box and whisker plots for each algorithm’s results are created, showing the difference in mean performance (green triangles) and the similar spread in scores between the three top-performing methods.

This suggests all three probabilistic methods are discovering the same general mapping of inputs to probabilities in the dataset.

There is further data preparation to make the input variables more Gaussian, such as power transforms.

### Model Evaluation With Power Transform

Power transforms, such as the Box-Cox and Yeo-Johnson transforms, are designed to change the distribution to be more Gaussian.

This will help with the “*age*” input variable in our dataset and may help with the “*nodes*” variable and un-bunch the distribution slightly.

We can use the PowerTransformer scikit-learn class to perform the Yeo-Johnson and automatically determine the best parameters to apply based on the dataset, e.g. how to best make each variable more Gaussian. Importantly, this transformer will also standardize the dataset as part of the transform, ensuring we keep the gains seen in the previous section.

The power transform may make use of a *log()* function, which does not work on zero values. We have zero values in our dataset, therefore we will scale the dataset prior to the power transform using a MinMaxScaler.

Again, we can use this transform in a Pipeline to ensure it is fit on the training dataset and applied to the train and test datasets correctly, without data leakage.

... # create a pipeline pip = Pipeline(steps=[('t1', MinMaxScaler()), ('t2', PowerTransformer()),('m',model)]) # evaluate the model and store results scores = evaluate_model(X, y, pip)

We will focus on the three top-performing methods, in this case, LR, LDA, and GPC.

The complete example is listed below.

# compare probabilistic models with power transforms on the haberman dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.metrics import brier_score_loss from sklearn.metrics import make_scorer from sklearn.linear_model import LogisticRegression from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.gaussian_process import GaussianProcessClassifier from sklearn.pipeline import Pipeline from sklearn.preprocessing import PowerTransformer from sklearn.preprocessing import MinMaxScaler # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # calculate brier skill score (BSS) def brier_skill_score(y_true, y_prob): # calculate reference brier score ref_probs = [0.26471 for _ in range(len(y_true))] bs_ref = brier_score_loss(y_true, ref_probs) # calculate model brier score bs_model = brier_score_loss(y_true, y_prob) # calculate skill score return 1.0 - (bs_model / bs_ref) # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation the metric metric = make_scorer(brier_skill_score, needs_proba=True) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'haberman.csv' # load the dataset X, y = load_dataset(full_path) # define models models = [LogisticRegression(solver='lbfgs'), LinearDiscriminantAnalysis(), GaussianProcessClassifier()] names, values = list(), list() # evaluate each model for model in models: # get a name for the model name = type(model).__name__[:7] # create a pipeline pip = Pipeline(steps=[('t1', MinMaxScaler()), ('t2', PowerTransformer()),('m',model)]) # evaluate the model and store results scores = evaluate_model(X, y, pip) # summarize and store print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) names.append(name) values.append(scores) # plot the results pyplot.boxplot(values, labels=names, showmeans=True) pyplot.show()

Running the example again summarizes the mean and standard deviation of the BSS for each algorithm.

Your specific results will vary given the stochastic nature of the learning algorithms; consider running the example a few times.

In this case, we can see a further lift in model skill for the three models that were evaluated. We can see that the LR appears to have out-performed the other two methods.

>Logisti 0.111 (0.123) >LinearD 0.106 (0.147) >Gaussia 0.103 (0.096)

Box and whisker plots are created for the results from each algorithm, suggesting perhaps a smaller and more focused spread for LR compared to the LDA, which was the second-best performing method.

All methods still show skill on average, however the distribution of scores show runs that drop below 0.0 (no skill) in some cases.

## Make Prediction on New Data

We will select the Logistic Regression model with a power transform on the input data as our final model.

We can define and fit this model on the entire training dataset.

... # fit the model model = Pipeline(steps=[('t1', MinMaxScaler()), ('t2', PowerTransformer()),('m',LogisticRegression(solver='lbfgs'))]) model.fit(X, y)

Once fit, we can use it to make predictions for new data by calling the *predict_proba()* function. This will return two probabilities for each prediction, the first for survival and the second for non-survival, e.g. its complement.

For example:

... row = [31,59,2] yhat = model.predict_proba([row]) # get percentage of survival p_survive = yhat[0, 0] * 100

To demonstrate this, we can use the fit model to make some predictions of probability for a few cases where we know there is survival and a few where we know there is not.

The complete example is listed below.

# fit a model and make predictions for the on the haberman dataset from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.preprocessing import PowerTransformer from sklearn.preprocessing import MinMaxScaler # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # define the location of the dataset full_path = 'haberman.csv' # load the dataset X, y = load_dataset(full_path) # fit the model model = Pipeline(steps=[('t1', MinMaxScaler()), ('t2', PowerTransformer()),('m',LogisticRegression(solver='lbfgs'))]) model.fit(X, y) # some survival cases print('Survival Cases:') data = [[31,59,2], [31,65,4], [34,60,1]] for row in data: # make prediction yhat = model.predict_proba([row]) # get percentage of survival p_survive = yhat[0, 0] * 100 # summarize print('>data=%s, Survival=%.3f%%' % (row, p_survive)) # some non-survival cases print('Non-Survival Cases:') data = [[44,64,6], [34,66,9], [38,69,21]] for row in data: # make prediction yhat = model.predict_proba([row]) # get percentage of survival p_survive = yhat[0, 0] * 100 # summarize print('data=%s, Survival=%.3f%%' % (row, p_survive))

Running the example first fits the model on the entire training dataset.

Then the fit model is used to predict the probability of survival for cases where we know the patient survived, chosen from the dataset file. We can see that for the chosen survival cases, the probability of survival was high, between 77 percent and 86 percent.

Then some cases of non-survival are used as input to the model and the probability of survival is predicted. As we might have hoped, the probability of survival is modest, hovering around 53 percent to 63 percent.

Survival Cases: >data=[31, 59, 2], Survival=83.597% >data=[31, 65, 4], Survival=77.264% >data=[34, 60, 1], Survival=86.776% Non-Survival Cases: data=[44, 64, 6], Survival=63.092% data=[34, 66, 9], Survival=63.452% data=[38, 69, 21], Survival=53.389%

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### APIs

### Dataset

## Summary

In this tutorial, you discovered how to develop a model to predict the probability of patient survival on an imbalanced dataset.

Specifically, you learned:

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop a Probabilistic Model of Breast Cancer Patient Survival appeared first on Machine Learning Mastery.

Click here to read more## Why Is Imbalanced Classification Difficult?

Imbalanced classification is primarily challenging as a predictive modeling task because of the severely skewed class distribution.

This is the cause for poor performance with traditional machine learning models and evaluation metrics that assume a balanced class distribution.

Nevertheless, there are additional properties of a classification dataset that are not only challenging for predictive modeling but also increase or compound the difficulty when modeling imbalanced datasets.

In this tutorial, you will discover data characteristics that compound the challenge of imbalanced classification.

After completing this tutorial, you will know:

Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

## Tutorial Overview

This tutorial is divided into four parts; they are:

## Why Imbalanced Classification Is Hard

Imbalanced classification is defined by a dataset with a skewed class distribution.

This is often exemplified by a binary (two-class) classification task where most of the examples belong to class 0 with only a few examples in class 1. The distribution may range in severity from 1:2, 1:10, 1:100, or even 1:1000.

Because the class distribution is not balanced, most machine learning algorithms will perform poorly and require modification to avoid simply predicting the majority class in all cases. Additionally, metrics like classification lose their meaning and alternate methods for evaluating predictions on imbalanced examples are required, like ROC area under curve.

This is the foundational challenge of imbalanced classification.

An additional level of complexity comes from the problem domain from which the examples were drawn.

It is common for the majority class to represent a normal case in the domain, whereas the minority class represents an abnormal case, such as a fault, fraud, outlier, anomaly, disease state, and so on. As such, the interpretation of misclassification errors may differ across the classes.

For example, misclassifying an example from the majority class as an example from the minority class called a false-positive is often not desired, but less critical than classifying an example from the minority class as belonging to the majority class, a so-called false negative.

This is referred to as cost sensitivity of misclassification errors and is a second foundational challenge of imbalanced classification.

These two aspects, the skewed class distribution and cost sensitivity, are typically referenced when describing the difficulty of imbalanced classification.

Nevertheless, there are other characteristics of the classification problem that, when combined with these properties, compound their effect. These are general characteristics of classification predictive modeling that magnify the difficulty of the imbalanced classification task.

Class imbalance was widely acknowledged as a complicating factor for classification. However, some studies also argue that the imbalance ratio is not the only cause of performance degradation in learning from imbalanced data.

— Page 253, Learning from Imbalanced Data Sets, 2018.

There are many such characteristics, but perhaps three of the most common include:

It is important to not only acknowledge these properties but to also specifically develop an intuition for their impact. This will allow you to select and develop techniques to address them in your own predictive modeling projects.

Understanding these data intrinsic characteristics, as well as their relationship with class imbalance, is crucial for applying existing and developing new techniques to deal with imbalance data.

— Pages 253-254, Learning from Imbalanced Data Sets, 2018.

In the following sections, we will take a closer look at each of these properties and their impact on imbalanced classification.

### Want to Get Started With Imbalance Classification?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Coursehttps://machinelearningmastery.lpages.co/leadbox-1576257931.js

## Compounding Effect of Dataset Size

Dataset size simply refers to the number of examples collected from the domain to fit and evaluate a predictive model.

Typically, more data is better as it provides more coverage of the domain, perhaps to a point of diminishing returns.

Specifically, more data provides better representation of combinations and variance of features in the feature space and their mapping to class labels. From this, a model can better learn and generalize a class boundary to discriminate new examples in the future.

If the ratio of examples in the majority class to the minority class is somewhat fixed, then we would expect that we would have more examples in the minority class as the size of the dataset is scaled up.

This is good if we can collect more examples.

It is a problem typically because data is hard or expensive to collect and we often collect and work with a lot less data than we might prefer. As such, this can dramatically impact our ability to gain a large enough or representative sample of examples from the minority class.

A problem that often arises in classification is the small number of training instances. This issue, often reported as data rarity or lack of data, is related to the “lack of density” or “insufficiency of information”.

— Page 261, Learning from Imbalanced Data Sets, 2018.

For example, for a modest classification task with a balanced class distribution, we might be satisfied with thousands or tens of thousands of examples in order to develop, evaluate, and select a model.

A balanced binary classification with 10,000 examples would have 5,000 examples of each class. An imbalanced dataset with a 1:100 distribution with the same number of examples would only have 100 examples of the minority class.

As such, the size of the dataset dramatically impacts the imbalanced classification task, and datasets that are thought large in general are, in fact, probably not large enough when working with an imbalanced classification problem.

Without a sufficient large training set, a classifier may not generalize characteristics of the data. Furthermore, the classifier could also overfit the training data, with a poor performance in out-of-sample tests instances.

— Page 261, Learning from Imbalanced Data Sets, 2018.

To help, let’s make this concrete with a worked example.

We can use the make_classification() scikit-learn function to create a dataset of a given size with a ratio of about 1:100 examples (1 percent to 99 percent) in the minority class to the majority class.

... # create the dataset X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)

We can then create a scatter plot of the dataset and color the points for each class with a septate color to get an idea of the spatial relationship for the examples.

... # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend()

This process can then be repeated with different datasets sizes to show how the class imbalance is impacted visually. We will compare datasets with 100, 1,000, 10,000, and 100,000 examples.

The complete example is listed below.

# vary the dataset size for a 1:100 imbalanced dataset from collections import Counter from sklearn.datasets import make_classification from matplotlib import pyplot from numpy import where # dataset sizes sizes = [100, 1000, 10000, 100000] # create and plot a dataset with each size for i in range(len(sizes)): # determine the dataset size n = sizes[i] # create the dataset X, y = make_classification(n_samples=n, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print('Size=%d, Ratio=%s' % (n, counter)) # define subplot pyplot.subplot(2, 2, 1+i) pyplot.title('n=%d' % n) pyplot.xticks([]) pyplot.yticks([]) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() # show the figure pyplot.show()

Running the example creates and plots the same dataset with a 1:100 class distribution using four different sizes.

First, the class distribution is displayed for each dataset size. We can see that with a small dataset of 100 examples, we only get one example in the minority class as we might expect. Even with 100,000 examples in the dataset, we only get 1,000 examples in the minority class.

Size=100, Ratio=Counter({0: 99, 1: 1}) Size=1000, Ratio=Counter({0: 990, 1: 10}) Size=10000, Ratio=Counter({0: 9900, 1: 100}) Size=100000, Ratio=Counter({0: 99000, 1: 1000})

Scatter plots are created for each differently sized dataset.

We can see that it is not until very large sample sizes that the underlying structure of the class distributions becomes obvious.

These plots highlight the critical role that dataset size plays in imbalanced classification. It is hard to see how a model given 990 examples of the majority class and 10 of the minority class could hope to do well on the same problem depicted after 100,000 examples are drawn.

## Compounding Effect of Label Noise

Label noise refers to examples that belong to one class that are assigned to another class.

This can make determining the class boundary in feature space problematic for most machine learning algorithms, and this difficulty typically increases in proportion to the percentage of noise in the labels.

Two types of noise are distinguished in the literature: feature (or attribute) and class noise. Class noise is generally assumed to be more harmful than attribute noise in ML […] class noise somehow affects the observed class values (e.g., by somehow flipping the label of a minority class instance to the majority class label).

— Page 264, Learning from Imbalanced Data Sets, 2018.

The cause is often inherent in the problem domain, such as ambiguous observations on the class boundary or even errors in the data collection that could impact observations anywhere in the feature space.

For imbalanced classification, noisy labels have an even more dramatic effect.

Given that examples in the positive class are so few, losing some to noise reduces the amount of information available about the minorty class.

Additionally, having examples from the majority class incorrectly marked as belonging to the minority class can cause a disjoint or fragmentation of the minority class that is already sparse because of the lack of observations.

We can imagine that if there are examples along the class boundary that are ambiguous, we could identify and remove or correct them. Examples marked for the minority class that are in areas of the feature space that are high density for the majority class are also likely easy to identify and remove or correct.

It is the case where observations for both classes are sparse in the feature space where this problem becomes particularly difficult in general, and especially for imbalanced classification. It is these situations where unmodified machine learning algorithms will define the class boundary in favor of the majority class at the expense of the minority class.

Mislabeled minority class instances will contribute to increase the perceived imbalance ratio, as well as introduce mislabeled noisy instances inside the class region of the minority class. On the other hand, mislabeled majority class instances may lead the learning algorithm, or imbalanced treatment methods, focus on wrong areas of input space.

— Page 264, Learning from Imbalanced Data Sets, 2018.

We can develop an example to give a flavor of this challenge.

We can hold the dataset size constant as well as the 1:100 class ratio and vary the amount of label noise. This can be achieved by setting the “*flip_y*” argument to the *make_classification()* function which is a percentage of the number of examples in each class to change or flip the label.

We will explore varying this from 0 percent, 1 percent, 5 percent, and 7 percent.

The complete example is listed below.

# vary the label noise for a 1:100 imbalanced dataset from collections import Counter from sklearn.datasets import make_classification from matplotlib import pyplot from numpy import where # label noise ratios noise = [0, 0.01, 0.05, 0.07] # create and plot a dataset with different label noise for i in range(len(noise)): # determine the label noise n = noise[i] # create the dataset X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=n, random_state=1) # summarize class distribution counter = Counter(y) print('Noise=%d%%, Ratio=%s' % (int(n*100), counter)) # define subplot pyplot.subplot(2, 2, 1+i) pyplot.title('noise=%d%%' % int(n*100)) pyplot.xticks([]) pyplot.yticks([]) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() # show the figure pyplot.show()

Running the example creates and plots the same dataset with a 1:100 class distribution using four different amounts of label noise.

First, the class distribution is printed for each dataset with differing amounts of label noise. We can see that, as we might expect, as the noise is increased, the number of examples in the minority class is increased, most of which are incorrectly labeled.

We might expect these additional 30 examples in the minority class with 7 percent label noise to be quite damaging to a model trying to define a crisp class boundary in the feature space.

Noise=0%, Ratio=Counter({0: 990, 1: 10}) Noise=1%, Ratio=Counter({0: 983, 1: 17}) Noise=5%, Ratio=Counter({0: 963, 1: 37}) Noise=7%, Ratio=Counter({0: 959, 1: 41})

Scatter plots are created for each dataset with the differing label noise.

In this specific case, we don’t see many examples of confusion on the class boundary. Instead, we can see that as the label noise is increased, the number of examples in the mass of the minority class (orange points in the blue area) increases, representing false positives that really should be identified and removed from the dataset prior to modeling.

## Compounding Effect of Data Distribution

Another important consideration is the distribution of examples in feature space.

If we think about feature space spatially, we might like all examples in one class to be located on one part of the space, and those from the other class to appear in another part of the space.

If this is the case, we have good class separability and machine learning models can draw crisp class boundaries and achieve good classification performance. This holds on datasets with a balanced or imbalanced class distribution.

This is rarely the case, and it is more likely that each class has multiple “*concepts*” resulting in multiple different groups or clusters of examples in feature space.

… it is common that the “concept” beneath a class is split into several sub-concepts, spread over the input space.

— Page 255, Learning from Imbalanced Data Sets, 2018.

These groups are formally referred to as “*disjuncts*,” coming from a definition in the of rule-based systems for a rule that covers a group of cases comprised of sub-concepts. A small disjunct is one that relates or “*covers*” few examples in the training dataset.

Systems that learn from examples do not usually succeed in creating a purely conjunctive definition for each concept. Instead, they create a definition that consists of several disjuncts, where each disjunct is a conjunctive definition of a subconcept of the original concept.

— Concept Learning And The Problem Of Small Disjuncts, 1989.

This grouping makes class separability hard, requiring each group or cluster to be identified and included in the definition of the class boundary, implicitly or explicitly.

In the case of imbalanced datasets, this is a particular problem if the minority class has multiple concepts or clusters in the feature space. This is because the density of examples in this class is already sparse and it is difficult to discern separate groupings with so few examples. It may look like one large sparse grouping.

This lack of homogeneity is particularly problematic in algorithms based on the strategy of dividing-and-conquering […] where the sub-concepts lead to the creation of small disjuncts.

— Page 255, Learning from Imbalanced Data Sets, 2018.

For example, we might consider data that describes whether a patient is healthy (majority class) or sick (minority class). The data may capture many different types of illnesses, and there may be groups of similar illnesses, but if there are so few cases, then any grouping or concepts within the class may not be apparent and may look like a diffuse set mixed in with healthy cases.

To make this concrete, we can look at an example.

We can use the number of clusters in the dataset as a proxy for “*concepts*” and compare a dataset with one cluster of examples per class to a second dataset with two clusters per class.

This can be achieved by varying the “*n_clusters_per_class*” argument for the *make_classification()* function used to create the dataset.

We would expect that in an imbalanced dataset, such as a 1:100 class distribution, that the increase in the number of clusters is obvious for the majority class, but not so for the minority class.

The complete example is listed below.

# vary the number of clusters for a 1:100 imbalanced dataset from collections import Counter from sklearn.datasets import make_classification from matplotlib import pyplot from numpy import where # number of clusters clusters = [1, 2] # create and plot a dataset with different numbers of clusters for i in range(len(clusters)): c = clusters[i] # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=c, weights=[0.99], flip_y=0, random_state=1) counter = Counter(y) # define subplot pyplot.subplot(1, 2, 1+i) pyplot.title('Clusters=%d' % c) pyplot.xticks([]) pyplot.yticks([]) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() # show the figure pyplot.show()

Running the example creates and plots the same dataset with a 1:100 class distribution using two different numbers of clusters.

In the first scatter plot (left), we can see one cluster per class. The majority class (blue) quite clearly has one cluster, whereas the structure of the minority class (orange) is less obvious. In the second plot (right), we can again clearly see that the majority class has two clusters, and again the structure of the minority class (orange) is diffuse and it is not apparent that samples were drawn from two clusters.

This highlights the relationship between the size of the dataset and its ability to expose the underlying density or distribution of examples in the minority class. With so few examples, generalization by machine learning models is challenging, if not very problematic.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Papers

### Books

### APIs

## Summary

In this tutorial, you discovered data characteristics that compound the challenge of imbalanced classification.

Specifically, you learned:

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Why Is Imbalanced Classification Difficult? appeared first on Machine Learning Mastery.

Click here to read more## One-Class Classification Algorithms for Imbalanced Datasets

Outliers or anomalies are rare examples that do not fit in with the rest of the data.

Identifying outliers in data is referred to as outlier or anomaly detection and a subfield of machine learning focused on this problem is referred to as one-class classification. These are unsupervised learning algorithms that attempt to model “*normal*” examples in order to classify new examples as either normal or abnormal (e.g. outliers).

One-class classification algorithms can be used for binary classification tasks with a severely skewed class distribution. These techniques can be fit on the input examples from the majority class in the training dataset, then evaluated on a holdout test dataset.

Although not designed for these types of problems, one-class classification algorithms can be effective for imbalanced classification datasets where there are none or very few examples of the minority class, or datasets where there is no coherent structure to separate the classes that could be learned by a supervised algorithm.

In this tutorial, you will discover how to use one-class classification algorithms for datasets with severely skewed class distributions.

After completing this tutorial, you will know:

Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

## Tutorial Overview

This tutorial is divided into five parts; they are:

## One-Class Classification for Imbalanced Data

Outliers are both rare and unusual.

Rarity suggests that they have a low frequency relative to non-outlier data (so-called inliers). Unusual suggests that they do not fit neatly into the data distribution.

The presence of outliers can cause problems. For example, a single variable may have an outlier far from the mass of examples, which can skew summary statistics such as the mean and variance.

Fitting a machine learning model may require the identification and removal of outliers as a data preparation technique.

The process of identifying outliers in a dataset is generally referred to as anomaly detection, where the outliers are “*anomalies*,” and the rest of the data is “*normal*.” Outlier detection or anomaly detection is a challenging problem and is comprised of a range of techniques.

In machine learning, one approach to tackling the problem of anomaly detection is one-class classification.

One-Class Classification, or OCC for short, involves fitting a model on the “*normal*” data and predicting whether new data is normal or an outlier/anomaly.

A one-class classifier aims at capturing characteristics of training instances, in order to be able to distinguish between them and potential outliers to appear.

— Page 139, Learning from Imbalanced Data Sets, 2018.

A one-class classifier is fit on a training dataset that only has examples from the normal class. Once prepared, the model is used to classify new examples as either normal or not-normal, i.e. outliers or anomalies.

One-class classification techniques can be used for binary (two-class) imbalanced classification problems where the negative case (class 0) is taken as “*normal*” and the positive case (class 1) is taken as an outlier or anomaly.

Given the nature of the approach, one-class classifications are most suited for those tasks where the positive cases don’t have a consistent pattern or structure in the feature space, making it hard for other classification algorithms to learn a class boundary. Instead, treating the positive cases as outliers, it allows one-class classifiers to ignore the task of discrimination and instead focus on deviations from normal or what is expected.

This solution has proven to be especially useful when the minority class lack any structure, being predominantly composed of small disjuncts or noisy instances.

— Page 139, Learning from Imbalanced Data Sets, 2018.

It may also be appropriate where the number of positive cases in the training set is so few that they are not worth including in the model, such as a few tens of examples or fewer. Or for problems where no examples of positive cases can be collected prior to training a model.

To be clear, this adaptation of one-class classification algorithms for imbalanced classification is unusual but can be effective on some problems. The downside of this approach is that any examples of outliers (positive cases) we have during training are not used by the one-class classifier and are discarded. This suggests that perhaps an inverse modeling of the problem (e.g. model the positive case as normal) could be tried in parallel. It also suggests that the one-class classifier could provide an input to an ensemble of algorithms, each of which uses the training dataset in different ways.

One must remember that the advantages of one-class classifiers come at a price of discarding all of available information about the majority class. Therefore, this solution should be used carefully and may not fit some specific applications.

— Page 140, Learning from Imbalanced Data Sets, 2018.

The scikit-learn library provides a handful of common one-class classification algorithms intended for use in outlier or anomaly detection and change detection, such as One-Class SVM, Isolation Forest, Elliptic Envelope, and Local Outlier Factor.

In the following sections, we will take a look at each in turn.

Before we do, we will devise a binary classification dataset to demonstrate the algorithms. We will use the make_classification() scikit-learn function to create 10,000 examples with 10 examples in the minority class and 9,990 in the majority class, or a 0.1 percent vs. 99.9 percent, or about 1:1000 class distribution.

The example below creates and summarizes this dataset.

# Generate and plot a synthetic imbalanced classification dataset from collections import Counter from sklearn.datasets import make_classification from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4) # summarize class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first summarizes the class distribution, confirming the imbalance was created as expected.

Counter({0: 9990, 1: 10})

Next, a scatter plot is created and examples are plotted as points colored by their class label, showing a large mass for the majority class (blue) and a few dots for the minority class (orange).

This severe class imbalance with so few examples in the positive class and the unstructured nature of the few examples in the positive class might make a good basis for using one-class classification methods.

### Want to Get Started With Imbalance Classification?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Coursehttps://machinelearningmastery.lpages.co/leadbox-1576257931.js

## One-Class Support Vector Machines

The support vector machine, or SVM, algorithm developed initially for binary classification can be used for one-class classification.

If used for imbalanced classification, it is a good idea to evaluate the standard SVM and weighted SVM on your dataset before testing the one-class version.

When modeling one class, the algorithm captures the density of the majority class and classifies examples on the extremes of the density function as outliers. This modification of SVM is referred to as One-Class SVM.

… an algorithm that computes a binary function that is supposed to capture regions in input space where the probability density lives (its support), that is, a function such that most of the data will live in the region where the function is nonzero.

— Estimating the Support of a High-Dimensional Distribution, 2001.

The scikit-learn library provides an implementation of one-class SVM in the OneClassSVM class.

The main difference from a standard SVM is that it is fit in an unsupervised manner and does not provide the normal hyperparameters for tuning the margin like *C*. Instead, it provides a hyperparameter “*nu*” that controls the sensitivity of the support vectors and should be tuned to the approximate ratio of outliers in the data, e.g. 0.01%.

... # define outlier detection model model = OneClassSVM(gamma='scale', nu=0.01)

The model can be fit on all examples in the training dataset or just those examples in the majority class. Perhaps try both on your problem.

In this case, we will try fitting on just those examples in the training set that belong to the majority class.

# fit on majority class trainX = trainX[trainy==0] model.fit(trainX)

Once fit, the model can be used to identify outliers in new data.

When calling the *predict()* function on the model, it will output a +1 for normal examples, so-called inliers, and a -1 for outliers.

... # detect outliers in the test set yhat = model.predict(testX)

If we want to evaluate the performance of the model as a binary classifier, we must change the labels in the test dataset from 0 and 1 for the majority and minority classes respectively, to +1 and -1.

... # mark inliers 1, outliers -1 testy[testy == 1] = -1 testy[testy == 0] = 1

We can then compare the predictions from the model to the expected target values and calculate a score. Given that we have crisp class labels, we might use a score like precision, recall, or a combination of both, such as the F-measure (F1-score).

In this case, we will use F-measure score, which is the harmonic mean of precision and recall. We can calculate the F-measure using the *f1_score()* function and specify the label of the minority class as -1 via the “*pos_label*” argument.

... # calculate score score = f1_score(testy, yhat, pos_label=-1) print('F1 Score: %.3f' % score)

Tying this together, we can evaluate the one-class SVM algorithm on our synthetic dataset. We will split the dataset in two and use half to train the model in an unsupervised manner and the other half to evaluate it.

The complete example is listed below.

# one-class svm for imbalanced binary classification from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score from sklearn.svm import OneClassSVM # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # define outlier detection model model = OneClassSVM(gamma='scale', nu=0.01) # fit on majority class trainX = trainX[trainy==0] model.fit(trainX) # detect outliers in the test set yhat = model.predict(testX) # mark inliers 1, outliers -1 testy[testy == 1] = -1 testy[testy == 0] = 1 # calculate score score = f1_score(testy, yhat, pos_label=-1) print('F1 Score: %.3f' % score)

Running the example fits the model on the input examples from the majority class in the training set. The model is then used to classify examples in the test set as inliers and outliers.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a number of times.

In this case, an F1 score of 0.123 is achieved.

F1 Score: 0.123

## Isolation Forest

Isolation Forest, or iForest for short, is a tree-based anomaly detection algorithm.

… Isolation Forest (iForest) which detects anomalies purely based on the concept of isolation without employing any distance or density measure

— Isolation-Based Anomaly Detection, 2012.

It is based on modeling the normal data in such a way to isolate anomalies that are both few in number and different in the feature space.

… our proposed method takes advantage of two anomalies’ quantitative properties: i) they are the minority consisting of fewer instances and ii) they have attribute-values that are very different from those of normal instances.

— Isolation Forest, 2008.

Tree structures are created to isolate anomalies. The result is that isolated examples have a relatively short depth in the trees, whereas normal data is less isolated and has a greater depth in the trees.

… a tree structure can be constructed effectively to isolate every single instance. Because of their susceptibility to isolation, anomalies are isolated closer to the root of the tree; whereas normal points are isolated at the deeper end of the tree.

— Isolation Forest, 2008.

The scikit-learn library provides an implementation of Isolation Forest in the IsolationForest class.

Perhaps the most important hyperparameters of the model are the “*n_estimators*” argument that sets the number of trees to create and the “*contamination*” argument, which is used to help define the number of outliers in the dataset.

We know the contamination is about 0.01 percent positive cases to negative cases, so we can set the “*contamination*” argument to be 0.01.

... # define outlier detection model model = IsolationForest(contamination=0.01, behaviour='new')

The model is probably best trained on examples that exclude outliers. In this case, we fit the model on the input features for examples from the majority class only.

... # fit on majority class trainX = trainX[trainy==0] model.fit(trainX)

Like one-class SVM, the model will predict an inlier with a label of +1 and an outlier with a label of -1, therefore, the labels of the test set must be changed before evaluating the predictions.

Tying this together, the complete example is listed below.

# isolation forest for imbalanced classification from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score from sklearn.ensemble import IsolationForest # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # define outlier detection model model = IsolationForest(contamination=0.01, behaviour='new') # fit on majority class trainX = trainX[trainy==0] model.fit(trainX) # detect outliers in the test set yhat = model.predict(testX) # mark inliers 1, outliers -1 testy[testy == 1] = -1 testy[testy == 0] = 1 # calculate score score = f1_score(testy, yhat, pos_label=-1) print('F1 Score: %.3f' % score)

Running the example fits the isolation forest model on the training dataset in an unsupervised manner, then classifies examples in the test set as inliers and outliers and scores the result.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a number of times.

In this case, an F1 score of 0.154 is achieved.

F1 Score: 0.154

**Note**: the contamination is quite low and may result in many runs with an F1 Score of 0.0.

To improve the stability of the method on this dataset, try increasing the contamination to 0.05 or even 0.1 and re-run the example.

## Minimum Covariance Determinant

If the input variables have a Gaussian distribution, then simple statistical methods can be used to detect outliers.

For example, if the dataset has two input variables and both are Gaussian, then the feature space forms a multi-dimensional Gaussian and knowledge of this distribution can be used to identify values far from the distribution.

This approach can be generalized by defining a hypersphere (ellipsoid) that covers the normal data, and data that falls outside this shape is considered an outlier. An efficient implementation of this technique for multivariate data is known as the Minimum Covariance Determinant, or MCD for short.

It is unusual to have such well-behaved data, but if this is the case for your dataset, or you can use power transforms to make the variables Gaussian, then this approach might be appropriate.

The Minimum Covariance Determinant (MCD) method is a highly robust estimator of multivariate location and scatter, for which a fast algorithm is available. […] It also serves as a convenient and efficient tool for outlier detection.

— Minimum Covariance Determinant and Extensions, 2017.

The scikit-learn library provides access to this method via the EllipticEnvelope class.

It provides the “*contamination*” argument that defines the expected ratio of outliers to be observed in practice. We know that this is 0.01 percent in our synthetic dataset, so we can set it accordingly.

... # define outlier detection model model = EllipticEnvelope(contamination=0.01)

The model can be fit on the input data from the majority class only in order to estimate the distribution of “*normal*” data in an unsupervised manner.

... # fit on majority class trainX = trainX[trainy==0] model.fit(trainX)

The model will then be used to classify new examples as either normal (+1) or outliers (-1).

... # detect outliers in the test set yhat = model.predict(testX)

Tying this together, the complete example of using the elliptic envelope outlier detection model for imbalanced classification on our synthetic binary classification dataset is listed below.

# elliptic envelope for imbalanced classification from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score from sklearn.covariance import EllipticEnvelope # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # define outlier detection model model = EllipticEnvelope(contamination=0.01) # fit on majority class trainX = trainX[trainy==0] model.fit(trainX) # detect outliers in the test set yhat = model.predict(testX) # mark inliers 1, outliers -1 testy[testy == 1] = -1 testy[testy == 0] = 1 # calculate score score = f1_score(testy, yhat, pos_label=-1) print('F1 Score: %.3f' % score)

Running the example fits the elliptic envelope model on the training dataset in an unsupervised manner, then classifies examples in the test set as inliers and outliers and scores the result.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a number of times.

In this case, an F1 score of 0.157 is achieved.

F1 Score: 0.157

## Local Outlier Factor

A simple approach to identifying outliers is to locate those examples that are far from the other examples in the feature space.

This can work well for feature spaces with low dimensionality (few features), although it can become less reliable as the number of features is increased, referred to as the curse of dimensionality.

The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers.

We introduce a local outlier (LOF) for each object in the dataset, indicating its degree of outlier-ness.

— LOF: Identifying Density-based Local Outliers, 2000.

The scikit-learn library provides an implementation of this approach in the LocalOutlierFactor class.

The model can be defined and requires that the expected percentage of outliers in the dataset be indicated, such as 0.01 percent in the case of our synthetic dataset.

... # define outlier detection model model = LocalOutlierFactor(contamination=0.01)

The model is not fit. Instead, a “*normal*” dataset is used as the basis for identifying outliers in new data via a call to *fit_predict()*.

To use this model to identify outliers in our test dataset, we must first prepare the training dataset to only have input examples from the majority class.

... # get examples for just the majority class trainX = trainX[trainy==0]

Next, we can concatenate these examples with the input examples from the test dataset.

... # create one large dataset composite = vstack((trainX, testX))

We can then make a prediction by calling *fit_predict()* and retrieve only those labels for the examples in the test set.

... # make prediction on composite dataset yhat = model.fit_predict(composite) # get just the predictions on the test set yhat yhat[len(trainX):]

To make things easier, we can wrap this up into a new function with the name *lof_predict()* listed below.

# make a prediction with a lof model def lof_predict(model, trainX, testX): # create one large dataset composite = vstack((trainX, testX)) # make prediction on composite dataset yhat = model.fit_predict(composite) # return just the predictions on the test set return yhat[len(trainX):]

The predicted labels will be +1 for normal and -1 for outliers, like the other outlier detection algorithms in scikit-learn.

Tying this together, the complete example of using the LOF outlier detection algorithm for classification with a skewed class distribution is listed below.

# local outlier factor for imbalanced classification from numpy import vstack from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score from sklearn.neighbors import LocalOutlierFactor # make a prediction with a lof model def lof_predict(model, trainX, testX): # create one large dataset composite = vstack((trainX, testX)) # make prediction on composite dataset yhat = model.fit_predict(composite) # return just the predictions on the test set return yhat[len(trainX):] # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.999], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # define outlier detection model model = LocalOutlierFactor(contamination=0.01) # get examples for just the majority class trainX = trainX[trainy==0] # detect outliers in the test set yhat = lof_predict(model, trainX, testX) # mark inliers 1, outliers -1 testy[testy == 1] = -1 testy[testy == 0] = 1 # calculate score score = f1_score(testy, yhat, pos_label=-1) print('F1 Score: %.3f' % score)

Running the example uses the local outlier factor model with the training dataset in an unsupervised manner to classify examples in the test set as inliers and outliers, then scores the result.

In this case, an F1 score of 0.138 is achieved.

F1 Score: 0.138

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Papers

### Books

### APIs

### Articles

## Summary

In this tutorial, you discovered how to use one-class classification algorithms for datasets with severely skewed class distributions.

Specifically, you learned:

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post One-Class Classification Algorithms for Imbalanced Datasets appeared first on Machine Learning Mastery.

Click here to read more## A Gentle Introduction to Threshold-Moving for Imbalanced Classification

Classification predictive modeling typically involves predicting a class label.

Nevertheless, many machine learning algorithms are capable of predicting a probability or scoring of class membership, and this must be interpreted before it can be mapped to a crisp class label. This is achieved by using a threshold, such as 0.5, where all values equal or greater than the threshold are mapped to one class and all other values are mapped to another class.

For those classification problems that have a severe class imbalance, the default threshold can result in poor performance. As such, a simple and straightforward approach to improving the performance of a classifier that predicts probabilities on an imbalanced classification problem is to tune the threshold used to map probabilities to class labels.

In some cases, such as when using ROC Curves and Precision-Recall Curves, the best or optimal threshold for the classifier can be calculated directly. In other cases, it is possible to use a grid search to tune the threshold and locate the optimal value.

In this tutorial, you will discover how to tune the optimal threshold when converting probabilities to crisp class labels for imbalanced classification.

After completing this tutorial, you will know:

Let’s get started.

## Tutorial Overview

This tutorial is divided into five parts; they are:

## Converting Probabilities to Class Labels

Many machine learning algorithms are capable of predicting a probability or a scoring of class membership.

This is useful generally as it provides a measure of the certainty or uncertainty of a prediction. It also provides additional granularity over just predicting the class label that can be interpreted.

Some classification tasks require a crisp class label prediction. This means that even though a probability or scoring of class membership is predicted, it must be converted into a crisp class label.

The decision for converting a predicted probability or scoring into a class label is governed by a parameter referred to as the “*decision threshold*,” “*discrimination threshold*,” or simply the “*threshold*.” The default value for the threshold is 0.5 for normalized predicted probabilities or scores in the range between 0 or 1.

For example, on a binary classification problem with class labels 0 and 1, normalized predicted probabilities and a threshold of 0.5, then values less than the threshold of 0.5 are assigned to class 0 and values greater than or equal to 0.5 are assigned to class 1.

The problem is that the default threshold may not represent an optimal interpretation of the predicted probabilities.

This might be the case for a number of reasons, such as:

Worse still, some or all of these reasons may occur at the same time, such as the use of a neural network model with uncalibrated predicted probabilities on an imbalanced classification problem.

As such, there is often the need to change the default decision threshold when interpreting the predictions of a model.

… almost all classifiers generate positive or negative predictions by applying a threshold to a score. The choice of this threshold will have an impact in the trade-offs of positive and negative errors.

— Page 53, Learning from Imbalanced Data Sets, 2018.

### Want to Get Started With Imbalance Classification?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Coursehttps://machinelearningmastery.lpages.co/leadbox-1576257931.js

## Threshold-Moving for Imbalanced Classification

There are many techniques that may be used to address an imbalanced classification problem, such as resampling the training dataset and developing customized version of machine learning algorithms.

Nevertheless, perhaps the simplest approach to handle a severe class imbalance is to change the decision threshold. Although simple and very effective, this technique is often overlooked by practitioners and research academics alike as was noted by Foster Provost in his 2000 article titled “Machine Learning from Imbalanced Data Sets.”

The bottom line is that when studying problems with imbalanced data, using the classifiers produced by standard machine learning algorithms without adjusting the output threshold may well be a critical mistake.

— Machine Learning from Imbalanced Data Sets 101, 2000.

There are many reasons to choose an alternative to the default decision threshold.

For example, you may use ROC curves to analyze the predicted probabilities of a model and ROC AUC scores to compare and select a model, although you require crisp class labels from your model. How do you choose the threshold on the ROC Curve that results in the best balance between the true positive rate and the false positive rate?

Alternately, you may use precision-recall curves to analyze the predicted probabilities of a model, precision-recall AUC to compare and select models, and require crisp class labels as predictions. How do you choose the threshold on the Precision-Recall Curve that results in the best balance between precision and recall?

You may use a probability-based metric to train, evaluate, and compare models like log loss (cross-entropy) but require crisp class labels to be predicted. How do you choose the optimal threshold from predicted probabilities more generally?

Finally, you may have different costs associated with false positive and false negative misclassification, a so-called cost matrix, but wish to use and evaluate cost-insensitive models and later evaluate their predictions use a cost-sensitive measure. How do you choose a threshold that finds the best trade-off for predictions using the cost matrix?

Popular way of training a cost-sensitive classifier without a known cost matrix is to put emphasis on modifying the classification outputs when predictions are being made on new data. This is usually done by setting a threshold on the positive class, below which the negative one is being predicted. The value of this threshold is optimized using a validation set and thus the cost matrix can be learned from training data.

— Page 67, Learning from Imbalanced Data Sets, 2018.

The answer to these questions is to search a range of threshold values in order to find the best threshold. In some cases, the optimal threshold can be calculated directly.

Tuning or shifting the decision threshold in order to accommodate the broader requirements of the classification problem is generally referred to as “*threshold-moving*,” “*threshold-tuning*,” or simply “*thresholding*.”

It has been stated that trying other methods, such as sampling, without trying by simply setting the threshold may be misleading. The threshold-moving method uses the original training set to train [a model] and then moves the decision threshold such that the minority class examples are easier to be predicted correctly.

— Pages 72, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

The process involves first fitting the model on a training dataset and making predictions on a test dataset. The predictions are in the form of normalized probabilities or scores that are transformed into normalized probabilities. Different threshold values are then tried and the resulting crisp labels are evaluated using a chosen evaluation metric. The threshold that achieves the best evaluation metric is then adopted for the model when making predictions on new data in the future.

We can summarize this procedure below.

Although simple, there are a few different approaches to implementing threshold-moving depending on your circumstance. We will take a look at some of the most common examples in the following sections.

## Optimal Threshold for ROC Curve

A ROC curve is a diagnostic plot that evaluates a set of probability predictions made by a model on a test dataset.

A set of different thresholds are used to interpret the true positive rate and the false positive rate of the predictions on the positive (minority) class, and the scores are plotted in a line of increasing thresholds to create a curve.

The false-positive rate is plotted on the x-axis and the true positive rate is plotted on the y-axis and the plot is referred to as the Receiver Operating Characteristic curve, or ROC curve. A diagonal line on the plot from the bottom-left to top-right indicates the “*curve*” for a no-skill classifier (predicts the majority class in all cases), and a point in the top left of the plot indicates a model with perfect skill.

The curve is useful to understand the trade-off in the true-positive rate and false-positive rate for different thresholds. The area under the ROC Curve, so-called ROC AUC, provides a single number to summarize the performance of a model in terms of its ROC Curve with a value between 0.5 (no-skill) and 1.0 (perfect skill).

The ROC Curve is a useful diagnostic tool for understanding the trade-off for different thresholds and the ROC AUC provides a useful number for comparing models based on their general capabilities.

If crisp class labels are required from a model under such an analysis, then an optimal threshold is required. This would be a threshold on the curve that is closest to the top-left of the plot.

Thankfully, there are principled ways of locating this point.

First, let’s fit a model and calculate a ROC Curve.

We can use the make_classification() function to create a synthetic binary classification problem with 10,000 examples (rows), 99 percent of which belong to the majority class and 1 percent belong to the minority class.

... # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)

We can then split the dataset using the train_test_split() function and use half for the training set and half for the test set.

... # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)

We can then fit a LogisticRegression model and use it to make probability predictions on the test set and keep only the probability predictions for the minority class.

... # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # predict probabilities lr_probs = model.predict_proba(testX) # keep probabilities for the positive outcome only lr_probs = lr_probs[:, 1]

We can then use the roc_auc_score() function to calculate the true-positive rate and false-positive rate for the predictions using a set of thresholds that can then be used to create a ROC Curve plot.

... # calculate scores lr_auc = roc_auc_score(testy, lr_probs)

We can tie this all together, defining the dataset, fitting the model, and creating the ROC Curve plot. The complete example is listed below.

# roc curve for logistic regression model from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve from matplotlib import pyplot # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # predict probabilities yhat = model.predict_proba(testX) # keep probabilities for the positive outcome only yhat = yhat[:, 1] # calculate roc curves fpr, tpr, thresholds = roc_curve(testy, yhat) # plot the roc curve for the model pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill') pyplot.plot(fpr, tpr, marker='.', label='Logistic') # axis labels pyplot.xlabel('False Positive Rate') pyplot.ylabel('True Positive Rate') pyplot.legend() # show the plot pyplot.show()

Running the example fits a logistic regression model on the training dataset then evaluates it using a range of thresholds on the test set, creating the ROC Curve

We can see that there are a number of points or thresholds close to the top-left of the plot.

Which is the threshold that is optimal?

There are many ways we could locate the threshold with the optimal balance between false positive and true positive rates.

Firstly, the true positive rate is called the Sensitivity. The inverse of the false-positive rate is called the Specificity.

Where:

The Geometric Mean or G-Mean is a metric for imbalanced classification that, if optimized, will seek a balance between the sensitivity and the specificity.

One approach would be to test the model with each threshold returned from the call roc_auc_score() and select the threshold with the largest G-Mean value.

Given that we have already calculated the Sensitivity (TPR) and the complement to the Specificity when we calculated the ROC Curve, we can calculate the G-Mean for each threshold directly.

... # calculate the g-mean for each threshold gmeans = sqrt(tpr * (1-fpr))

Once calculated, we can locate the index for the largest G-mean score and use that index to determine which threshold value to use.

... # locate the index of the largest g-mean ix = argmax(gmeans) print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix]))

We can also re-draw the ROC Curve and highlight this point.

The complete example is listed below.

# roc curve for logistic regression model with optimal threshold from numpy import sqrt from numpy import argmax from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve from matplotlib import pyplot # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # predict probabilities yhat = model.predict_proba(testX) # keep probabilities for the positive outcome only yhat = yhat[:, 1] # calculate roc curves fpr, tpr, thresholds = roc_curve(testy, yhat) # calculate the g-mean for each threshold gmeans = sqrt(tpr * (1-fpr)) # locate the index of the largest g-mean ix = argmax(gmeans) print('Best Threshold=%f, G-Mean=%.3f' % (thresholds[ix], gmeans[ix])) # plot the roc curve for the model pyplot.plot([0,1], [0,1], linestyle='--', label='No Skill') pyplot.plot(fpr, tpr, marker='.', label='Logistic') pyplot.scatter(fpr[ix], tpr[ix], marker='o', color='black', label='Best') # axis labels pyplot.xlabel('False Positive Rate') pyplot.ylabel('True Positive Rate') pyplot.legend() # show the plot pyplot.show()

Running the example first locates the optimal threshold and reports this threshold and the G-Mean score.

In this case, we can see that the optimal threshold is about 0.016153.

Best Threshold=0.016153, G-Mean=0.933

The threshold is then used to locate the true and false positive rates, then this point is drawn on the ROC Curve.

We can see that the point for the optimal threshold is a large black dot and it appears to be closest to the top-left of the plot.

It turns out there is a much faster way to get the same result, called the Youden’s J statistic.

The statistic is calculated as:

Given that we have Sensitivity (TPR) and the complement of the specificity (FPR), we can calculate it as:

Which we can restate as:

We can then choose the threshold with the largest J statistic value. For example:

... # calculate roc curves fpr, tpr, thresholds = roc_curve(testy, yhat) # get the best threshold J = tpr - fpr ix = argmax(J) best_thresh = thresholds[ix] print('Best Threshold=%f' % (best_thresh))

Plugging this in, the complete example is listed below.

# roc curve for logistic regression model with optimal threshold from numpy import argmax from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # predict probabilities yhat = model.predict_proba(testX) # keep probabilities for the positive outcome only yhat = yhat[:, 1] # calculate roc curves fpr, tpr, thresholds = roc_curve(testy, yhat) # get the best threshold J = tpr - fpr ix = argmax(J) best_thresh = thresholds[ix] print('Best Threshold=%f' % (best_thresh))

We can see that this simpler approach calculates the optimal statistic directly.

Best Threshold=0.016153

## Optimal Threshold for Precision-Recall Curve

Unlike the ROC Curve, a precision-recall curve focuses on the performance of a classifier on the positive (minority class) only.

Precision is the ratio of the number of true positives divided by the sum of the true positives and false positives. It describes how good a model is at predicting the positive class. Recall is calculated as the ratio of the number of true positives divided by the sum of the true positives and the false negatives. Recall is the same as sensitivity.

A precision-recall curve is calculated by creating crisp class labels for probability predictions across a set of thresholds and calculating the precision and recall for each threshold. A line plot is created for the thresholds in ascending order with recall on the x-axis and precision on the y-axis.

A no-skill model is represented by a horizontal line with a precision that is the ratio of positive examples in the dataset (e.g. TP / (TP + TN)), or 0.01 on our synthetic dataset. perfect skill classifier has full precision and recall with a dot in the top-right corner.

We can use the same model and dataset from the previous section and evaluate the probability predictions for a logistic regression model using a precision-recall curve. The precision_recall_curve() function can be used to calculate the curve, returning the precision and recall scores for each threshold as well as the thresholds used.

... # calculate pr-curve precision, recall, thresholds = precision_recall_curve(testy, yhat)

Tying this together, the complete example of calculating a precision-recall curve for a logistic regression on an imbalanced classification problem is listed below.

# pr curve for logistic regression model from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import precision_recall_curve from matplotlib import pyplot # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # predict probabilities yhat = model.predict_proba(testX) # keep probabilities for the positive outcome only yhat = yhat[:, 1] # calculate pr-curve precision, recall, thresholds = precision_recall_curve(testy, yhat) # plot the roc curve for the model no_skill = len(testy[testy==1]) / len(testy) pyplot.plot([0,1], [no_skill,no_skill], linestyle='--', label='No Skill') pyplot.plot(recall, precision, marker='.', label='Logistic') # axis labels pyplot.xlabel('Recall') pyplot.ylabel('Precision') pyplot.legend() # show the plot pyplot.show()

Running the example calculates the precision and recall for each threshold and creates a precision-recall plot showing that the model has some skill across a range of thresholds on this dataset.

If we required crisp class labels from this model, which threshold would achieve the best result?

If we are interested in a threshold that results in the best balance of precision and recall, then this is the same as optimizing the F-measure that summarizes the harmonic mean of both measures.

As in the previous section, the naive approach to finding the optimal threshold would be to calculate the F-measure for each threshold. We can achieve the same effect by converting the precision and recall measures to F-measure directly; for example:

... # convert to f score fscore = (2 * precision * recall) / (precision + recall) # locate the index of the largest f score ix = argmax(fscore) print('Best Threshold=%f, F-Score=%.3f' % (thresholds[ix], fscore[ix]))

We can then plot the point on the precision-recall curve.

The complete example is listed below.

# optimal threshold for precision-recall curve with logistic regression model from numpy import argmax from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import precision_recall_curve from matplotlib import pyplot # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # predict probabilities yhat = model.predict_proba(testX) # keep probabilities for the positive outcome only yhat = yhat[:, 1] # calculate roc curves precision, recall, thresholds = precision_recall_curve(testy, yhat) # convert to f score fscore = (2 * precision * recall) / (precision + recall) # locate the index of the largest f score ix = argmax(fscore) print('Best Threshold=%f, F-Score=%.3f' % (thresholds[ix], fscore[ix])) # plot the roc curve for the model no_skill = len(testy[testy==1]) / len(testy) pyplot.plot([0,1], [no_skill,no_skill], linestyle='--', label='No Skill') pyplot.plot(recall, precision, marker='.', label='Logistic') pyplot.scatter(recall[ix], precision[ix], marker='o', color='black', label='Best') # axis labels pyplot.xlabel('Recall') pyplot.ylabel('Precision') pyplot.legend() # show the plot pyplot.show()

Running the example first calculates the F-measure for each threshold, then locates the score and threshold with the largest value.

In this case, we can see that the best F-measure was 0.756 achieved with a threshold of about 0.25.

Best Threshold=0.256036, F-Score=0.756

The precision-recall curve is plotted, and this time the threshold with the optimal F-measure is plotted with a larger black dot.

This threshold could then be used when making probability predictions in the future that must be converted from probabilities to crisp class labels.

## Optimal Threshold Tuning

Sometimes, we simply have a model and we wish to know the best threshold directly.

In this case, we can define a set of thresholds and then evaluate predicted probabilities under each in order to find and select the optimal threshold.

We can demonstrate this with a worked example.

First, we can fit a logistic regression model on our synthetic classification problem, then predict class labels and evaluate them using the F-Measure, which is the harmonic mean of precision and recall.

This will use the default threshold of 0.5 when interpreting the probabilities predicted by the logistic regression model.

The complete example is listed below.

# logistic regression for imbalanced classification from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # predict labels yhat = model.predict(testX) # evaluate the model score = f1_score(testy, yhat) print('F-Score: %.5f' % score)

Running the example, we can see that the model achieved an F-Measure of about 0.70 on the test dataset.

F-Score: 0.70130

Now we can use the same model on the same dataset and instead of predicting class labels directly, we can predict probabilities.

... # predict probabilities yhat = model.predict_proba(testX)

We only require the probabilities for the positive class.

... # keep probabilities for the positive outcome only probs = yhat[:, 1]

Next, we can then define a set of thresholds to evaluate the probabilities. In this case, we will test all thresholds between 0.0 and 1.0 with a step size of 0.001, that is, we will test 0.0, 0.001, 0.002, 0.003, and so on to 0.999.

... # define thresholds thresholds = arange(0, 1, 0.001)

Next, we need a way of using a single threshold to interpret the predicted probabilities.

This can be achieved by mapping all values equal to or greater than the threshold to 1 and all values less than the threshold to 0. We will define a *to_labels()* function to do this that will take the probabilities and threshold as an argument and return an array of integers in {0, 1}.

# apply threshold to positive probabilities to create labels def to_labels(pos_probs, threshold): return (pos_probs >= threshold).astype('int')

We can then call this function for each threshold and evaluate the resulting labels using the *f1_score()*.

We can do this in a single line, as follows:

... # evaluate each threshold scores = [f1_score(testy, to_labels(probs, t)) for t in thresholds]

We now have an array of scores that evaluate each threshold in our array of thresholds.

All we need to do now is locate the array index that has the largest score (best F-Measure) and we will have the optimal threshold and its evaluation.

... # get best threshold ix = argmax(scores) print('Threshold=%.3f, F-Score=%.5f' % (thresholds[ix], scores[ix]))

Tying this all together, the complete example of tuning the threshold for the logistic regression model on the synthetic imbalanced classification dataset is listed below.

# search thresholds for imbalanced classification from numpy import arange from numpy import argmax from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score # apply threshold to positive probabilities to create labels def to_labels(pos_probs, threshold): return (pos_probs >= threshold).astype('int') # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # predict probabilities yhat = model.predict_proba(testX) # keep probabilities for the positive outcome only probs = yhat[:, 1] # define thresholds thresholds = arange(0, 1, 0.001) # evaluate each threshold scores = [f1_score(testy, to_labels(probs, t)) for t in thresholds] # get best threshold ix = argmax(scores) print('Threshold=%.3f, F-Score=%.5f' % (thresholds[ix], scores[ix]))

Running the example reports the optimal threshold as 0.251 (compared to the default of 0.5) that achieves an F-Measure of about 0.75 (compared to 0.70).

You can use this example as a template when tuning the threshold on your own problem, allowing you to substitute your own model, metric, and even resolution of thresholds that you want to evaluate.

Threshold=0.251, F-Score=0.75556

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Papers

### Books

### APIs

### Articles

## Summary

In this tutorial, you discovered how to tune the optimal threshold when converting probabilities to crisp class labels for imbalanced classification.

Specifically, you learned:

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Threshold-Moving for Imbalanced Classification appeared first on Machine Learning Mastery.

Click here to read more## [D] Rescaling a dataset with consistent feature with DTW(?)

Hi all, Currently, my dataset t1 consists of 85 values. I want to stretch it to a length of 120. By doing this, I would like to keep the shape of original dataset. As you can see at the plot above, the rescaled dataset with zoom destroys the original shape. However, as you can see below, the rescaled dataset looks completely different than the original. https://preview.redd.it/oh239ugydsf41.png?width=1200&format=png&auto=webp&s=ffce1053f6810011dfd45d0b2f60efb84258e822
submitted by /u/xman236 |

## [D] Propose a Math function for this scaling please

I have the following Linear Scaling. Basically a range from 1 to 60m, with a 128 amount subdivision https://preview.redd.it/vh1kpm3xccf41.png?width=1029&format=png&auto=webp&s=7f35493893cfbf4e58113131fd5856e9c37c76eb I need the objects that are closer towards towards my camera to have a higher "density", and as I move towards 60, have a lower density but still maintain the 128 subdivision. Not sure what is the right math function I should use. submitted by /u/soulslicer0 |

## How to Configure XGBoost for Imbalanced Classification

The XGBoost algorithm is effective for a wide range of regression and classification predictive modeling problems.

It is an efficient implementation of the stochastic gradient boosting algorithm and offers a range of hyperparameters that give fine-grained control over the model training procedure. Although the algorithm performs well in general, even on imbalanced classification datasets, it offers a way to tune the training algorithm to pay more attention to misclassification of the minority class for datasets with a skewed class distribution.

This modified version of XGBoost is referred to as Class Weighted XGBoost or Cost-Sensitive XGBoost and can offer better performance on binary classification problems with a severe class imbalance.

In this tutorial, you will discover weighted XGBoost for imbalanced classification.

After completing this tutorial, you will know:

Let’s get started.

## Tutorial Overview

This tutorial is divided into four parts; they are:

## Imbalanced Classification Dataset

Before we dive into XGBoost for imbalanced classification, let’s first define an imbalanced classification dataset.

We can use the make_classification() scikit-learn function to define a synthetic imbalanced two-class classification dataset. We will generate 10,000 examples with an approximate 1:100 minority to majority class ratio.

... # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=2, weights=[0.99], flip_y=0, random_state=7)

Once generated, we can summarize the class distribution to confirm that the dataset was created as we expected.

... # summarize class distribution counter = Counter(y) print(counter)

Finally, we can create a scatter plot of the examples and color them by class label to help understand the challenge of classifying examples from this dataset.

... # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Tying this together, the complete example of generating the synthetic dataset and plotting the examples is listed below.

# Generate and plot a synthetic imbalanced classification dataset from collections import Counter from sklearn.datasets import make_classification from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=2, weights=[0.99], flip_y=0, random_state=7) # summarize class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first creates the dataset and summarizes the class distribution.

We can see that the dataset has an approximate 1:100 class distribution with a little less than 10,000 examples in the majority class and 100 in the minority class.

Counter({0: 9900, 1: 100})

Next, a scatter plot of the dataset is created showing the large mass of examples for the majority class (blue) and a small number of examples for the minority class (orange), with some modest class overlap.

### Want to Get Started With Imbalance Classification?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Coursehttps://machinelearningmastery.lpages.co/leadbox-1576257931.js

## XGBoost Model for Classification

XGBoost is short for **Extreme Gradient Boosting** and is an efficient implementation of the stochastic gradient boosting machine learning algorithm.

The stochastic gradient boosting algorithm, also called gradient boosting machines or tree boosting, is a powerful machine learning technique that performs well or even best on a wide range of challenging machine learning problems.

Tree boosting has been shown to give state-of-the-art results on many standard classification benchmarks.

— XGBoost: A Scalable Tree Boosting System, 2016.

It is an ensemble of decision trees algorithm where new trees fix errors of those trees that are already part of the model. Trees are added until no further improvements can be made to the model.

XGBoost provides a highly efficient implementation of the stochastic gradient boosting algorithm and access to a suite of model hyperparameters designed to provide control over the model training process.

The most important factor behind the success of XGBoost is its scalability in all scenarios. The system runs more than ten times faster than existing popular solutions on a single machine and scales to billions of examples in distributed or memory-limited settings.

— XGBoost: A Scalable Tree Boosting System, 2016.

XGBoost is an effective machine learning model, even on datasets where the class distribution is skewed.

Before any modification or tuning is made to the XGBoost algorithm for imbalanced classification, it is important to test the default XGBoost model and establish a baseline in performance.

Although the XGBoost library has its own Python API, we can use XGBoost models with the scikit-learn API via the XGBClassifier wrapper class. An instance of the model can be instantiated and used just like any other scikit-learn class for model evaluation. For example:

... # define model model = XGBClassifier()

We will use repeated cross-validation to evaluate the model, with three repeats of 10-fold cross-validation.

The model performance will be reported using the mean ROC area under curve (ROC AUC) averaged over repeats and all folds.

... # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) # summarize performance print('Mean ROC AUC: %.5f' % mean(scores))

Tying this together, the complete example of defining and evaluating a default XGBoost model on the imbalanced classification problem is listed below.

# fit xgboost on an imbalanced classification dataset from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from xgboost import XGBClassifier # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=2, weights=[0.99], flip_y=0, random_state=7) # define model model = XGBClassifier() # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) # summarize performance print('Mean ROC AUC: %.5f' % mean(scores))

Running the example evaluates the default XGBoost model on the imbalanced dataset and reports the mean ROC AUC.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

We can see that the model has skill, achieving a ROC AUC above 0.5, in this case achieving a mean score of 0.95724.

Mean ROC AUC: 0.95724

This provides a baseline for comparison for any hyperparameter tuning performed for the default XGBoost algorithm.

## Weighted XGBoost for Class Imbalance

Although the XGBoost algorithm performs well for a wide range of challenging problems, it offers a large number of hyperparameters, many of which require tuning in order to get the most out of the algorithm on a given dataset.

The implementation provides a hyperparameter designed to tune the behavior of the algorithm for imbalanced classification problems; this is the **scale_pos_weight** hyperparameter.

By default, the *scale_pos_weight* hyperparameter is set to the value of 1.0 and has the effect of weighing the balance of positive examples, relative to negative examples when boosting decision trees. For an imbalanced binary classification dataset, the negative class refers to the majority class (class 0) and the positive class refers to the minority class (class 1).

XGBoost is trained to minimize a loss function and the “*gradient*” in gradient boosting refers to the steepness of this loss function, e.g. the amount of error. A small gradient means a small error and, in turn, a small change to the model to correct the error. A large error gradient during training in turn results in a large correction.

Gradients are used as the basis for fitting subsequent trees added to boost or correct errors made by the existing state of the ensemble of decision trees.

The *scale_pos_weight* value is used to scale the gradient for the positive class.

This has the effect of scaling errors made by the model during training on the positive class and encourages the model to over-correct them. In turn, this can help the model achieve better performance when making predictions on the positive class. Pushed too far, it may result in the model overfitting the positive class at the cost of worse performance on the negative class or both classes.

As such, the *scale_pos_weight* can be used to train a class-weighted or cost-sensitive version of XGBoost for imbalanced classification.

A sensible default value to set for the *scale_pos_weight* hyperparameter is the inverse of the class distribution. For example, for a dataset with a 1 to 100 ratio for examples in the minority to majority classes, the *scale_pos_weight* can be set to 100. This will give classification errors made by the model on the minority class (positive class) 100 times more impact, and in turn, 100 times more correction than errors made on the majority class.

For example:

... # define model model = XGBClassifier(scale_pos_weight=100)

The XGBoost documentation suggests a fast way to estimate this value using the training dataset as the total number of examples in the majority class divided by the total number of examples in the minority class.

For example, we can calculate this value for our synthetic classification dataset. We would expect this to be about 100, or more precisely, 99 given the weighting we used to define the dataset.

... # count examples in each class counter = Counter(y) # estimate scale_pos_weight value estimate = counter[0] / counter[1] print('Estimate: %.3f' % estimate)

The complete example of estimating the value for the *scale_pos_weight* XGBoost hyperparameter is listed below.

# estimate a value for the scale_pos_weight xgboost hyperparameter from sklearn.datasets import make_classification from collections import Counter # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=2, weights=[0.99], flip_y=0, random_state=7) # count examples in each class counter = Counter(y) # estimate scale_pos_weight value estimate = counter[0] / counter[1] print('Estimate: %.3f' % estimate)

Running the example creates the dataset and estimates the values of the *scale_pos_weight* hyperparameter as 99, as we expected.

Estimate: 99.000

We will use this value directly in the configuration of the XGBoost model and evaluate its performance on the dataset using repeated k-fold cross-validation.

We would expect some improvement in ROC AUC, although this is not guaranteed depending on the difficulty of the dataset and the chosen configuration of the XGBoost model.

The complete example is listed below.

# fit balanced xgboost on an imbalanced classification dataset from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from xgboost import XGBClassifier # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=2, weights=[0.99], flip_y=0, random_state=7) # define model model = XGBClassifier(scale_pos_weight=99) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) # summarize performance print('Mean ROC AUC: %.5f' % mean(scores))

Running the example prepares the synthetic imbalanced classification dataset, then evaluates the class-weighted version of the XGBoost training algorithm using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see a modest lift in performance from a ROC AUC of about 0.95724 with *scale_pos_weight=1* in the previous section to a value of 0.95990 with *scale_pos_weight=99*.

Mean ROC AUC: 0.95990

## Tune the Class Weighting Hyperparameter

The heuristic for setting the *scale_pos_weight* is effective for many situations.

Nevertheless, it is possible that better performance can be achieved with a different class weighting, and this too will depend on the choice of performance metric used to evaluate the model.

In this section, we will grid search a range of different class weightings for class-weighted XGBoost and discover which results in the best ROC AUC score.

We will try the following weightings for the positive class:

These can be defined as grid search parameters for the GridSearchCV class as follows:

... # define grid weights = [1, 10, 25, 50, 75, 99, 100, 1000] param_grid = dict(scale_pos_weight=weights)

We can perform the grid search on these parameters using repeated cross-validation and estimate model performance using ROC AUC:

... # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid search grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=cv, scoring='roc_auc')

Once executed, we can summarize the best configuration as well as all of the results as follows:

... # report the best configuration print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) # report all configurations means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

Tying this together, the example below grid searches eight different positive class weights for the XGBoost algorithm on the imbalanced dataset.

We might expect that the heuristic class weighing is the best performing configuration.

# grid search positive class weights with xgboost for imbalance classification from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedStratifiedKFold from xgboost import XGBClassifier # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=2, weights=[0.99], flip_y=0, random_state=7) # define model model = XGBClassifier() # define grid weights = [1, 10, 25, 50, 75, 99, 100, 1000] param_grid = dict(scale_pos_weight=weights) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid search grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1, cv=cv, scoring='roc_auc') # execute the grid search grid_result = grid.fit(X, y) # report the best configuration print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) # report all configurations means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

Running the example evaluates each positive class weighting using repeated k-fold cross-validation and reports the best configuration and the associated mean ROC AUC score.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the *scale_pos_weight=99* positive class weighting achieved the best mean ROC score. This matches the configuration for the general heuristic.

It’s interesting to note that almost all values larger than the default value of 1 have a better mean ROC AUC, even the aggressive value of 1,000. It’s also interesting to note that a value of 99 performed better from the value of 100, which I may have used if I did not calculate the heuristic as suggested in the XGBoost documentation.

Best: 0.959901 using {'scale_pos_weight': 99} 0.957239 (0.031619) with: {'scale_pos_weight': 1} 0.958219 (0.027315) with: {'scale_pos_weight': 10} 0.958278 (0.027438) with: {'scale_pos_weight': 25} 0.959199 (0.026171) with: {'scale_pos_weight': 50} 0.959204 (0.025842) with: {'scale_pos_weight': 75} 0.959901 (0.025499) with: {'scale_pos_weight': 99} 0.959141 (0.025409) with: {'scale_pos_weight': 100} 0.958761 (0.024757) with: {'scale_pos_weight': 1000}

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Papers

### Books

### APIs

## Summary

In this tutorial, you discovered weighted XGBoost for imbalanced classification.

Specifically, you learned:

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Configure XGBoost for Imbalanced Classification appeared first on Machine Learning Mastery.

Click here to read more## [P] Web-based Interactive Tool for Visualizing Adversarial Attacks on Imagenet SOTA models in Pytorch

Hello! I've made an educational interactive flask app to generate Adversarial Examples using Some common Attack Strategies(more to be added). https://github.com/dsgiitr/adversarial_lab It is based on another Adversarial DNN Playground but that was implemented on MNIST handwriting Dataset only. FGSM on resnet You can upload Images and get Perturbed Image and Perturbation along with the top 5 Predicted labels for both the Images. The GAE library contains all the attacks. It is a simple and easy-to-understand implementation of Popular Strategies. Feedbacks Appretiated! Enjoy! submitted by /u/KONOHA_ |

## [discussion] AI most promising open source projects for 2020

Want to be up to date with the latest open source projects in AI? Drop a comment or submit a pull-request, if you believe that a relevant project was left behind. https://preview.redd.it/i5bv2l3bs4e41.png?width=1500&format=png&auto=webp&s=fe428c37f955519a1bc62cdc48418e535b7d4962 submitted by /u/haggais |