# Lecture Notes in Pattern Recognition: Evaluation Measures and ROC Analysis

These are the lecture notes for FAU’s YouTube Lecture “Pattern Recognition“. This is a full transcript of the lecture video & matching slides. We hope, you enjoy this as much as the videos. Of course, this transcript was created with deep learning techniques largely automatically and only minor manual modifications were performed. Try it yourself! If you spot mistakes, please let us know!

Welcome everybody to a new round of pattern recognition questions and answers. Today we want to have a short chat about measures of evaluation classification systems and how a classification system can be altered in order to change the evaluation outcome. So I think you will find this interesting and I also brought a very interesting example to discuss the ideas of classification systems.

Today we want to discuss another round of questions and answers and in particular, we want to talk about the questions that you had sent them by email or posted in the forum. The key question that came up lately was can you discuss how do I evaluate a classification system?

So we talked about this in the lecture. But I found that we need some additional explanation to actually understand what the classification systems do and how we can evaluate them. So, of course, we will discuss this in an example. Everybody is talking about diseases nowadays and it seems everybody is becoming an expert on disease classification. So I also thought it would be interesting to talk about a typical disease and I took the zombie disease.

Okay, let’s get back to evaluating our classifier. So the classification system can be evaluated with many many different measures and what you can see here are different means of how to evaluate this classifier. There are of course many different ways how you can combine the above values in order to compute different rates. So a very typical one is known as the true positive rate or also hit rate recall or, sensitivity which is the number of true positives divided by the number of observed positives that is given as the true positives plus the false negatives. Then there is also the false positive rate which is essentially the false alarm rate and this is given as the number of false positives divided by the number of true positives and false negatives. Also highly relevant are measures like the positive predictive value or also precision and the position is given as the number of true positives over the number of true positives plus false positives. Then there is also the negative predictive value which is the number of true negatives divided by the number of two negatives plus false negatives. Last but not least the true negative rate or, specificity and this is calculated as one minus the false positive rate. So this all seems a little bit complex and I’m trying to break these things down a little. So remember one does not simply understand sensitivity and specificity. But I will try to make this easier by going into some examples and the example that I brought to you is the following one. Let’s assume we have a total number of 100,000 people and we will now populate the entire table of examples of these 100,000. We know that 10,000 actually are infected with the zombie disease. So this means the number goes here and another 90,000 have not been infected. So they’re just regular people. So this number would go here in our plot. Now we want to see how good our test is and let’s assume that we know the sensitivity. If we know the sensitivity, we can now actually compute the number of true positives which is the number of positives in our set. Then we multiply with the sensitivity and this yields 9970 true positives and this number then goes here. We can use this now also to compute the number of false negatives which you see here and of course I can also put it in our table. So the other value that is very relevant to determine the actual values in our table is the specificity. The specificity now allows us to compute the number of true negatives which is simply the number of negatives in the reference. So the actual number of true humans and this number is multiplied with the specificity. This then gives us 98,280 and we can put that in our table here. This allows us to compute now the number of false positives which is 720 in this plot. So this value goes here.

So let’s think about whether we can do something in order to decrease the number of false positives such that we are not harming many hangover people. So if you just have a bad hangover maybe that’s not so great that your fellow neighbors would then decide to use countermeasures against zombies and that could actually be not so great for you. So let’s think a bit about our test we don’t know who the real zombies are we just get the result from the test. But of course, this tells us at least that in the cases where we are positive we may just want to test them again just to be really sure. So what would happen if we do this re-test we just run the same test again, in this case, this is our test result so we have now observed positives and observed negatives and they are only determined by the previous test results. So now we look into the case where we want to retest only the positive cases. So let’s retest those observed positives and in the observed positives are of course true positives and the true positives now we have to multiply with the sensitivity and the sensitivity will then again give us whether it’s a positive or negative result. This means that we retest and a large number of true positives will have another 30 that are assigned to the number of false negatives. Now let’s consider the case of the false positive. So we had 720 we retest them and now we have to consider the specificity. We see now that of the 720 we can identify 714 as false positive. So they will be assigned to negative and this means that we only have six false positives left. So we get a new confusion matrix here. So these are the new results where we already considered the retesting procedure. So you see the number of false negatives has increased to 60 while the number of false positives is reduced to only six. So this means we have a new test in this new testing procedure where we test only the positives twice and it has a reduced true positive rate but it comes at a much better true negative rate. So the false positives are greatly reduced and you see that this number is now much much higher but it comes at the cost of reduced sensitivity. Of course, we could also go in the other direction.

So we retest only the observed negatives. Now in this case we want to make sure that we don’t miss one of the zombies so we re-test everything that was observed as negative. Now again we have to check our sensitivity and now we see that amongst the observed negatives there were only 30 cases that were false negatives and they multiplied with the sensitivity will now give actually 30 correct classifications. So our false negative number goes to zero. Now let’s consider the other error and we do that with the true negatives that are inside of our set and here we have to use the specificity. We see that we multiply this number and we yield another 714 false positives. So now we can update our confusion matrix which then looks like this one. So we introduced this retesting procedure which gave us a sensitivity of 100 percent. So we didn’t miss a single zombie but it comes at the cost that we now have 1434 false positives. So the number of false positives is almost doubled by introducing this procedure. But we don’t have any false negatives anymore. So remember we only had like ten thousand positives and now we have a thousand four hundred false positives. So this means then that out of the observed eleven thousand four hundred positives there is more than ten percent that weren’t actually zombies. So you’re getting rid of more than a thousand four hundred regular people at the expense of not missing anyone. So you’re taking countermeasures against a thousand four hundred people who have actually not been infected by the disease. So this might be a pretty harsh thing to do and maybe not every government will actually be able to pull this off. Well, we now developed three different testing procedures and now you may wonder are they the same, which one is better, and which one is worse.

So let’s have a look at them and compare them and the typical tool to do so is the so-called receiver-operated characteristic curve the ROC curve. I’m showing the space where this is evaluating here on the left-hand side you can see in this plot we are plotting the sensitivity over one minus the specificity. You can see a couple of important points on this plot on the top left indicated with the one you see the perfect classifier so this one has a 100 percent specificity and 100 percent sensitivity. So this is always doing the right decision. Another thing you could do is you could simply always decide for positive then you would be at the top right point of the curve and the very opposite is that you always decide for negative then you’re on the bottom left corner of the plot. There is a line connecting those two points and this is essentially a random decision but with varying thresholds. So if you would just roll dice you would be somewhere on the line connecting those two points and this means you do your decision completely uninformed of any observation and this would yield the following line. If you are in the top left triangle you have built a classification system that actually works. So I would expect any working classification system on the top left triangle. So above the diagonal and there’s also the space below the diagonal here indicated by four and if your classifier is located somewhere there it’s more wrong than right. So this means in a two-class system if you will just do the opposite of whatever your classification system is proposing you would actually end up in the top left triangle. So this is the space where we now have to locate our classifiers and we see that our vanilla zombie classifier wasn’t actually doing that bad. It is very close to the top left so it’s very close to the one because it has both very high sensitivity and very high specificity. Now we changed our classification system and re-run all the positives so we changed sensitivity and specificity to yield this point here. So we are slightly moving away from the original classification system and we could improve on the one rate but sacrifice on the other rate. Then we did the exact opposite and we rerun all of the negative cases and then we get this observation here. So we increase on the one rate but decreased on the other rate. So you see that by adjusting the classification system and re-running the decisions I can alter the outcome. Actually, there is a whole continuum of different solutions for sensitivity and specificity and this is shown here by the curve in green and we can essentially move along this green curve with a given classification system. Here you can see this classification system is very very good because it is almost covering an area of close to one. So if I had the perfect classifier then the area under this curve under the green curve would be exactly one and we are very very close to one here. So we have a very good classification system and because it’s the same system that we’re using it needs to be detected by this curve. Now you may wonder how do I get this green curve.

Well actually our classification system stems from two distributions the distributions of the positives and the negatives and determining the actual classification result is simply by picking a threshold on the test statistic. So whatever test do you do? So here in the case of zombies, I hear the eye pressure is a very good measure and while we actually determine whether somebody is a zombie or not you just vary the threshold. So this guy here is important and I can sample the entire green curve simply by varying the threshold. So I wouldn’t have to go through the retesting procedure and running two tests if it’s the same test it won’t change anything because you would simply be adjusting your classifier just according to a different threshold. So that wouldn’t help a lot well what can we do about this.

So let’s think about a couple of ideas and one very clear idea is that you pick the threshold according to the scenario. If you want to pick the threshold you have to determine the cost of the decision. So every decision comes with a cost and depending on for example the prevalence of the disease the cost may be very different. So if you are in the early phase you want to stop the zombies at all costs and it would be very expensive to miss one. Because the zombie would just keep on spreading the disease. In later phases of the disease, this may be very different. So probably you want to adjust your threshold according to the current testing scenario and you can do that if you know the cost for the decision because that allows you to weigh what is more costly, the false positive or the false negative. Another idea is to involve another test and you just test with two different tests and if they’re statistically independent you can combine the two in order to get a better decision. This then leads to concepts like boosting bagging and ensemble classifiers that we also will talk about in this class. Now one thing that I want to give you in order to understand these ensembling properties is what I want to show you a specific example.

So what you typically do is you then run a second test statistic and with the second test statistic, you can then create plots like this one and then find a decision boundary that is orthogonal to one of the two tests. This means that you can harness the power of the two tests. Whenever we do boosting and ensembling we typically assume that the tests are statistically independent. So this is a key ingredient and let’s look at an example where this is the case.

So I have test statistic 1 and our class 1 and class 2 or positive and negative are distributed in the following way. So there’s a significant overlap. So they both are somehow not separating the classes very well and I have another test where of course the same two classes are appearing again. Now I want to combine the two and now let’s pick a configuration of observations and these are now indicated as the true class and the true other class, so the positives and negatives. Here you can see that I essentially have to find points that follow the projection on the x and y-axis onto the original distribution. I did that here in this case and you can see in this particular case the distribution of points follows the respective one-dimensional projection. So if I just move them to the one side or to the other side and by the combination of the two tests I can get this very clear separation here. So this is very useful and in this case, we really have the independence of the two tests and because they are independent there is potentially a very high yield of combining them.

Now let’s look at an example where the classes are much better separated and here I have one class the other class now the overlap is very small. I do the same on the other axis and the overlap is again very small. This then means that the blue dots have to be distributed somewhere where they get projected onto the blue curve and the orange dots have to be located somewhere that they get projected onto the orange curve. This is a bit problematic because you see how we are very close to the diagonal and actually we have a correlation of test one and test two. So the two results are not completely independent and whenever I have this kind of very good separation already and the assumption that these are Gaussian distribution points on the test statistic, then I run into this problem. The problem is now that because the two tests are correlated there is very little yield by combining the two. So you see here a decision boundary. It’s a bit better but it doesn’t solve the problem as neatly as we’ve seen in the previous case. So if the two are correlated it’s much harder to piece the two tests out of another and to benefit essentially from the other test statistic because they are almost measuring the same thing.

Now of course this doesn’t mean that it’s impossible. So what is relevant is this area here and if I’m lucky with the tests then I can find a configuration that solves the problem much better. But of course, only the small area where the two classes are actually connected is the area where I can get yield. This is very hard to describe with the statistics only on one of the two axes so you really have to know the complete distribution of the classes and their joint probability distribution otherwise you won’t be able to determine this so easily. Well, I think these were quite interesting observations and gave you a little bit of insight into evaluating classifier’s test statistics and also what the sensitivity and specificity of a classification system also have in terms of implications.