Grumpy, euphoric, and smart classifiers (interactive)
There are several reasons why you might be interested in evaluating the performance of a classification algorithm. Maybe you are a machine learning researcher and you want to show superiority of your new method, maybe you are a tech-savvy physician and you put together a neural net to detect a rare disease in patients, or maybe you are an investor and you need to evaluate the soundness of a proposed venture. If you fit into category $\textrm{I}$, this blog post is probably no news. However, if you see yourself as a machine learning practitioner, or you need to evaluate someone else’s work, it is helpful to be able to interpret performance metrics such as accuracy or AUPRC with respect to trivial classifiers. This blog post aims to help you with exactly that.
Evaluation Metrics
We set our scene in binary classification land, and we can find a plethora of evaluation metrics in this land. Although they appear in the table below, I don’t see the necessity to list all their definitions. That being said, I will briefly discuss the confusion matrix, the underlying building-block for all those metrics:
Prediction | |||
---|---|---|---|
Positive Class | Negative Class | ||
Truth | Positive Class | True Positive (TP) | False Negative (FN) |
Negative Class | False Positive (FP) | True Negative (TN) |
Confusion matrices appear in a variety of scientific fields, and sometimes it is confusing (hence the name) how their entries are defined (e.g. is a healthy patient a positive, since she/he does not have the disease, or a negative, since she/he was tested negatively?). In machine learning, it is fairly easy to remember the definitions: TP: A sample that was Truly predicted to belong to the Positive class, FN: A sample that was Falsely predicted to belong to the Negative class, etc. . A commonly used evaluation metric we can build from these values is accuracy, defined as $\text{acc}=\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}$.
The big caveat with some of these metrics is that they can be instable under class imbalance and a high performance value might be misleading. This is why it is important to understand how a trivial classifier would perform on the imbalanced data set. Furthermore, you should always consider the problem you want to solve and what the most relevant metric(s) are. For example, if you want to find a predictor that reliably rules out that a patient has a certain disease, you should focus on the Negative Predictive Value.
Trivial Classifiers
I introduce four trivial classifiers whose performance metrics you should know and take into consideration when you judge your model. If your algorithm reaches the same high accuracy as the grumpy classifier (thanks Michael Moor for this nice neologism) for example, it better outperforms it in other metrics by far:
- The random classifier: For each sample, the positive/negative class is predicted with probability 0.5.
- The grumpy classifier: For each sample, the negative class is predicted with probability 1.0.
- The euphoric classifier: For each sample, the positive class is predicted with probability 1.0.
- The “smart” classifier: For each sample, the positive class is predicted with probability equal to the prevalence of positives in the data set.
Interactive Metrics Calculator
Below, you find a table that contains a variety of metrics for each of the four classifiers. You can reduce the prevalence of the positive class step-by-step and make the data set less balanced. A “good” metric stays similar under prevalence shift. Notice, for example, how the grumpy classifier’s accuracy increases with growing class imbalance.
(Notice that the performance value of each “random” classifier can vary. I introduce random noise to the predictions each time you change the slider to have a more visually pleasing ROC and PR-Curve.)
Metric | Random | Grumpy | Euphoric | "Smart" |
---|---|---|---|---|
Accuracy | - | - | - | - |
Balanced Accuracy | - | - | - | - |
Precision | - | - | - | - |
Negative Predictive Value | - | - | - | - |
Sensitivity | - | - | - | - |
Specificity | - | - | - | - |
AUROC | - | - | - | - |
AUPRC | - | - | - | - |
F1 | - | - | - | - |
MCC | - | - | - | - |
Curves
Typically, we summarize the performance of a classifier with the Receiver Operating Characteristic (ROC) and/or the Precision-Recall Curve (PRC) with their respective areas under the curve (AUC): AUROC & AUPRC. Recall and True Positive Rate are synonyms and hence the difference between the curves lie in the second dimension, i.e., FPR and Precision. Taking a look at their definitions $\text{FPR}=\frac{\text{FP}}{\text{FP}+\text{TN}}$ and $\text{Precision}=\frac{\text{TP}}{\text{TP}+\text{FP}}$, we find that precision gives us an idea about how many of the samples that we predicted to be positive were actual positives. This is especially helpful to identify a “euphoric” classifier that has perfect sensitivity but low precision in an imbalanced setting.
An important property of the PR-Curve is that its baseline (dashed black line) shifts with the prevalence of the positive class (try it out!). This means that if we want to evaluate a PR-Curve, we need to look at the gain over the prevalence (the average trivial classifier has AUPRC=prevalence). This also means, that we should take care that the minority class is always the positive class. Otherwise, a trivial classifier might lead to an AUPRC of 0.9 if 90% of our data set is of the positive class. With the ROC, on the other hand, we always compare the AUC to 0.5 which is the mean performance of a trivial classifier.