So you’ve finally finished training the last epoch of your AI. It’s always a great feeling after hours of cleaning data
and waiting for those epochs to finish up.
Now comes one of the most important parts of finishing up a new model … evaluating its performance.
Most tutorials online use % accuracy to measure a model’s performance. Yet despite its popularity, that’s not always the
best option.
The Downfalls of Accuracy
To show accuracy can be a misleading metric, I’m going to use a model that detects COVID-19 as an example.
Let’s take the world’s COVID data as an example
The current population of the Earth is 7.8 billion people.
Now imagine a model that diagnosed every single person as COVID free. It’s pretty obvious that this is the worst
possible model to deploy … right?
19,700,000 / 7,800,000,000 = 0.253% error
This model would have a 0.252% error. That means a 99.747% accuracy.
Take that in … a model can achieve 99.747% accuracy by saying nobody in the world has coronavirus!
Trusting models like this to diagnose diseases would mean the end of the world.
So it’s obvious that some models need a different way of measuring their performance.
The Confusion Matrix
One popular way to visualize the performance of a model is the confusion matrix.
These matrices show that not all model errors are the same. There’s actually two types of errors: False Positives (Type
1) and False Negatives (Type 2).
Depending on what your model is predicting, one can be more dangerous than the other.
Precision and Recall
These two metrics focus on tracking one type of error (Type 1 or 2).
Precision is used to avoid false positives. Spam detectors would optimize for high precision. Any false positives would
cause it to delete important emails. Precision models focus on reducing false positives, even if it increases false
negatives.
Recall is used when false negatives are dangerous. Medical diagnosis models should optimize for high recall. Otherwise,
they’d let sick patients go home. Recall models focus on reducing false negatives, even if it increases false
positives.
ROC Curves
An ROC curve measures how good a model is at differentiating between two classes.
They’re often better than only measuring how good the model is at finding True Positives. (this is what accuracy tracks)
When a model makes a prediction, it gives certainty between 0–1. The model then classifies it as a positive only if that
confidence is above a certain threshold.
An ROC curve is a graph of the True Positive VS False Positive rates of every possible threshold.
If you take the areas underneath an ROC curve, you number between 0–1. This value represents how well your model
performs.
Next time you’re evaluating your model, choose a metric based on what’s important for your model to achieve.