The Case Against Precision as a Model Selection Criterion

Recently, I have introduced sensitivity and specificity as performance measures for model selection. Besides these measures, there is also the notion of recall and precision. Precision and recall originate from information retrieval but are also used in machine learning settings. However, the use of precision and recall can be problematic in some situations. In this post, I discuss the shortcomings of recall and precision and show why sensitivity and specificity are generally more useful.

Definitions

For a binary classification problems with classes 0 and 1, the resulting confusion matrix has the following structure:

Prediction/Reference 1 0
1 TP FP
0 FN TN

where TP indicates the number of true positives (model correctly predicts positive class), FP indicates the number of false positives (model incorrectly predicts positive class), FN indicates the number of false negatives (model incorrectly predicts negative class), and TN indicates the number of true negatives (model predicts negative class correctly). The definitions of sensitivity (recall), precision (positive predictive value, PPV), and specificity (true negative rate, TNV) are as follows:

sensitivity=recall=TPR=TPTP+FNprecision=PPV=TPTP+FPspecificity=TNR=1FPR=1FPFP+TN

Sensitivity and precision are related in that they are both using TP in the enumerator. While sensitivity identifies the rate at which observations from the positive class are correctly predicted, precision indicates the rate at which positive predictions are correct. Specificity, on the other hand, is based on the number of false positives and indicates the rate at which observations from the negative class are correctly predicted.

The advantage of sensitivity and specificity

Evaluating a model based on both, sensitivity and specificity, is appropriate for most data sets because these measures consider all entries in the confusion matrix. While sensitivity deals with true positives and false negatives, specificity deals with false positives and true negatives. This means that the combination of sensitivity and specificity is a holistic measure when both true positives and true negatives should be considered.

Sensitivity and specificity can be summarized by a single quantity, the balanced accuracy, which is defined as the mean of both measures:

balanced accuracy=sensitivity + specificity2

The balanced accuracy is in the range [0,1] where a values of 0 and 1 indicate whe worst-possible and the best-possible classifier, respectively.

The disadvantage of recall and precision

Evaluating a model using recall and precision does not use all cells of the confusion matrix. Recall deals with true positives and false negatives and precision deals with true positives and false positives. Thus, using this pair of performance measures, true negatives are never taken into account. Thus, precision and recall should only be used in situations, where the correct identification of the negative class does not play a role. This is why these measures originate from information retrieval where precision can be defined as

precision=|{relevant documents}{retrieved documents}||{retrieved documents}|.

Here, it does not matter at which rate irrelevant documents are correctly discarded (true negative rate) because it is of no consequence.

Precision and recall are often summarized as a single quantity, the F1-score, which is the harmonic mean of both measures:

F1=2recallprecisionrecall+precision

F1 is in the range [0,1] and will be 1 for a classifier maximizing precision and recall. Since it is based on the harmonic mean, the F1-score is very sensitive towards disparate values for precision and recall. Assume a classifier has a sensitivity of 90% and a precision of 30%. Then the conventional mean would be 0.9+0.32=0.6 but the harmonic mean (F1 score) would be 20.90.30.9+0.3=0.45.

Examples

Here, I provide two examples. The first examples investigates what can go wrong when precision is used as a performance metric. The second example shows a setting in which the use of precision is adequate.

What can go wrong when using precision?

Precision is a particularly bad measure when there are few observations that belong to the positive class. Let us assume a clinical data set in which 90% of persons are diseased (positive class) and only 10% are healthy (negative class). Let us assume we have developed two tests for classifying whether a patient is diseased or healthy. Both tests have an accuracy of 80% but make different types of errors.

# to use waffle, you need 
#   o FontAwesome
#   o register the fonts using extrafont::font_import()
library(waffle)
ref.colors <- c("#c14141", "#1853b2")
false.colors <- c("#9b3636", "#0e3168")
true.colors <- c("#f75959", "#2474f2")
iron(
    waffle(c("Diseased" = 90, "Healthy" = 10), rows = 5, use_glyph = "child", 
        glyph_size = 5, title = "Reference", colors = ref.colors),
    waffle(c("Diseased (TP)" = 80, "Healthy (FN)" = 10, "Diseased (FP)" = 10), 
        rows = 5, use_glyph = "child", 
        glyph_size = 5, title = "Clinical Test 1", colors = c(true.colors[1], false.colors[2], false.colors[1])),
    waffle(c("Diseased (TP)" = 70, "Healthy (FN)" = 20, "Healthy (TN)" = 10), 
        rows = 5, use_glyph = "child", 
        glyph_size = 5, title = "Clinical Test 2", colors = c(true.colors[1], false.colors[2], true.colors[2]))
)

Confusion matrix for the first test

Prediction/Reference Diseased Healthy
Diseased TP = 80 FP = 10
Healthy FN = 10 TN = 0

Confusion matrix for the second test

Prediction/Reference Diseased Healthy
Diseased TP = 70 FP = 0
Healthy FN = 20 TN = 10

Comparison of the two tests

Let us compare the performance of the two tests:

Measure Test 1 Test 2
Sensitivity (Recall) 88.9% 77.7%
Specificity 0% 100%
Precision 88.9% 100%

Considering sensitivity and specificity, we would not select the first test because its balanced accuracy is merely 0+0.8892=44.5%, while that of the second test is 0.777+12=88.85%.

Using precision and recall, however, the first test would have an F1-score of 20.8890.8890.889+0.889=0.889, while the second test has a lower score of 20.77710.777+10.87. Thus, we would find the first test to be superior over the second test although its specificity is a 0%. Thus, when using this test, all of the healthy patients would be classified as diseased. This would be a big problem because all of these patients would undergo severe psychological stress and expensive treatment due to the misdiagnosis. If we had used specificity instead, we would have selected the second model, which does not produce any false postives at a competitive sensitivity.

Use of precision when true negatives do not matter

Let us consider an example from information retrieval to illustrate when precision is a useful criterion. Assume that we want to compare two algorithms for document retrieval that both have have an accuracy of 80%.

library(waffle)
colors <- c("#c14141", "#1853b2")
iron(
    waffle(c("Relevant" = 30, "Irrelevant" = 70), rows = 5, use_glyph = "file", 
        glyph_size = 5, title = "Reference", colors = ref.colors),
    waffle(c("Relevant (TP)" = 25, "Irrelevant (FN)" = 5, "Relevant (FP)" = 15, "Irrelevant (TN)" = 55), 
        rows = 5, use_glyph = "file", 
        glyph_size = 5, title = "Retrieval Algorithm 1", colors = c(true.colors[1], false.colors[2], false.colors[1], true.colors[2])),
    waffle(c("Relevant (TP)" = 20, "Irrelevant (FN)" = 10, "Relevant (FP)" = 10, "Irrelevant (TN)" = 60), 
        rows = 5, use_glyph = "file", 
        glyph_size = 5, title = "Retrieval Algorithm 2", colors = c(true.colors[1], false.colors[2], false.colors[1], true.colors[2]))
)

Confusion matrix for the first algorithm

Prediction/Reference Relevant Irrelevant
Relevant TP = 25 FP = 15
Irrelevant FN = 5 TN = 55

Confusion matrix for the second algorithm

Prediction/Reference Relevant Irrelevant
Relevant TP = 20 FP = 10
Irrelevant FN = 10 TN = 60

Comparison of the two algorithms

Let us calculate the performance of the two algorithms from the confusion matrix:

Measure Algorithm 1 Algorithm 2
Sensitivity (Recall) 83.3% 66.7%
Specificity 78.6% 85.7%
Precision 62.5% 66.7%
Balanced accuracy 80.95% 76.2%
F1-score 71.4% 66.7%

In this example, both balanced accuracy and the F1-score would lead to prefering the first over the second algorithm. Note that the reported balanced accuracy is decidedly larger than the F1-score. This is because specificity is high for both algorithms due to the large number of discarded observations from the negative class. Since the F1-score does not consider the rate of true negatives, precision and recall are more appropriate than sensitivity and specificity for this task.

Summary

In this post, we have seen that performance measures should be carefully selected. While sensitivity and specificity generally perform well, precision and recall should only be used in circumstances where the true negative rate does not play a role.