Text Classification Basics Part 3: Confusion Matrix
When testing a classification problem, there are two key categories to consider: actual and predicted.
- Actual refers to the true label of a test sample, matching real-world conditions (e.g., a text message is SPAM).
- Predicted refers to the output generated by the machine learning model (e.g., the model predicts SPAM).
After testing, the results will fall into one of four classes:
1. Correctly classified to class 1: True HAM
2. Correctly classified to class 2: True SPAM
3. Incorrectly classified to class 1: False HAM
4. Incorrectly classified to class 2: False SPAM
Terminology:
- True Positive (TP): When the actual HAM is correctly predicted as HAM by the model.
- False Negative (FN): When the actual HAM is incorrectly predicted as SPAM.
- False Positive (FP) (Type 1 Error): When the actual SPAM is incorrectly predicted as HAM.
- True Negative (TN) (Type 2 Error): When the actual SPAM is correctly predicted as SPAM.
Example: Confusion Matrix
Using the confusion matrix above, let’s calculate the accuracy rate and error rate.
- Accuracy Rate: This measures how often the model makes correct predictions. It is calculated as:
- Error Rate: This measures the frequency of incorrect predictions. It is calculated as:
In summary, the accuracy rate tells us how well the model performs overall, while the error rate highlights how often the model makes incorrect predictions.