Model evaluation

In Edge AI, where models are deployed on resource-constrained devices like microcontrollers, evaluation metrics are critical. They ensure that your model performs well in terms of accuracy and runs efficiently on the target hardware. By understanding these metrics, you can fine-tune your models to achieve the best balance between performance and resource usage. These metrics serve several important purposes:

Model Comparison: Metrics allow you to compare different models and see which one performs better.
Model Tuning: They help you adjust and improve your model by showing where it might be going wrong.
Model Validation: Metrics ensure that your model generalizes well to new data, rather than just memorizing the training data (a problem known as overfitting).

When to Use Different Metrics

Choosing the right metric depends on your specific task and the application’s requirements:

Precision: Needed when avoiding false positives, such as in medical diagnosis. (Read on Scikit-learn Precision | Read on TensorFlow Precision)
Recall: Vital when missing detections is costly, like in security applications. (Read on Scikit-learn Recall | Read on TensorFlow Recall)
Lower IoU Thresholds: Suitable for tasks where rough localization suffices.
Higher IoU Thresholds: Necessary for tasks requiring precise localization.

Understanding these metrics in context ensures that your models are not only accurate but also suitable for their intended applications.

Types of Evaluation Metrics

Classification Metrics
Regression Metrics
Object Detection Metrics

Used for problems where the output is a category, such as detecting whether a sound is a cough or not:

Accuracy: Measures the percentage of correct predictions out of all predictions. For instance, in a model that classifies sounds on a wearable device, accuracy tells you how often the model gets it right. (Read on Scikit-learn Accuracy | Read on TensorFlow Accuracy) $\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
- ( TP ): True Positives
- ( TN ): True Negatives
- ( FP ): False Positives
- ( FN ): False Negatives
Precision: The percentage of true positive predictions out of all positive predictions made by the model. This is crucial in cases where false positives can have significant consequences, such as in health monitoring devices. (Read on Scikit-learn Precision | Read on TensorFlow Precision) $\text{Precision} = \frac{TP}{TP + FP}$
Recall: The percentage of actual positive instances that the model correctly identified. For example, in a fall detection system, recall is vital because missing a fall could lead to serious consequences. (Read on Scikit-learn Recall | Read on TensorFlow Recall) $\text{Recall} = \frac{TP}{TP + FN}$
F1 Score: The harmonic mean of precision and recall, useful when you need to balance the trade-offs between false positives and false negatives. (Read on Scikit-learn F1 Score | Read on TensorFlow F1 Score) $\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$
Confusion Matrix: A table that shows the number of correct and incorrect predictions made by the model. It helps visualize the model’s performance across different classes. (Read on Scikit-learn Confusion Matrix | Read on TensorFlow Confusion Matrix)

This confusion matrix helps evaluate the performance of the model by showing where it is performing well (high values along the diagonal) and where it is making mistakes (off-diagonal values).Here’s how to interpret it:

Labels: The “True label” on the Y-axis represents the actual class labels of the activities. The “Predicted label” on the X-axis represents the class labels predicted by the model.
Classes: The dataset seems to have three classes, represented as 0, 1, and 2. These likely correspond to different human activities.
Matrix Cells: The cells in the matrix contain the number of samples classified in each combination of actual versus predicted class.
- For instance: The top-left cell (44) indicates that the model correctly predicted class 0 for 44 instances where the true label was also 0.
- The off-diagonal cells represent misclassifications. For example, the cell at row 0, column 1 (29) shows that 29 samples were true class 0 but were incorrectly predicted as class 1.
Color Scale: The color scale on the right represents the intensity of the values in the cells, with lighter colors indicating higher values and darker colors indicating lower values.
ROC-AUC: The area under the receiver operating characteristic curve, showing the trade-off between true positive rate and false positive rate. (Read on Scikit-learn ROC-AUC | Read on TensorFlow AUC)
- The ROC curve plots True Positive Rate (Recall) against False Positive Rate (FPR), where: $\text{FPR} = \frac{FP}{FP + TN}$

The ROC (Receiver Operating Characteristic) curve is a commonly used tool for evaluating the performance of binary classification models. The ROC curve plots the trade-off between the true positive rate (TPR or Recall) and the false positive rate (FPR) for different threshold values.

True Positive Rate (Y-axis): This is the proportion of actual positives (walking instances) that the model correctly identifies (recall).
False Positive Rate (X-axis): This is the proportion of actual negatives (rest instances) that the model incorrectly identifies as positives (false positives).
Precision-Recall Curve: Useful in evaluating binary classification models, especially when dealing with imbalanced datasets, like in the context of walking vs resting activities. The Precision-Recall curve shows the trade-off between precision and recall for various threshold settings of the classifier.
- Precision (Y-axis): Precision measures the proportion of true positive predictions among all positive predictions made by the model. High precision means that when the model predicts “Walking,” it is correct most of the time.
- Recall (X-axis): Recall (or True Positive Rate) measures the proportion of actual positives (walking instances) that the model correctly identifies. High recall indicates that the model successfully identifies most instances of walking.
Log Loss: The negative log-likelihood of the true labels given the model predictions. (Read on Scikit-learn Log Loss | Read on TensorFlow Log Loss) $\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]$
- ( y_i ): Actual label
- ( p_i ): Predicted probability
- ( N ): Number of samples

Used for problems where the output is a continuous value, like predicting the temperature from sensor data:

Mean Squared Error (MSE): The average of the squared differences between the predicted values and the actual values. In an edge device that predicts temperature, MSE penalizes larger errors more heavily, making it crucial for ensuring accurate predictions. (Read on Scikit-learn MSE | Read on TensorFlow MSE) $\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$
- ( y_i ): Actual value
- ( \hat_i ): Predicted value
- ( N ): Number of samples
Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values, providing a straightforward measure of prediction accuracy. This is useful in energy monitoring systems where predictions need to be as close as possible to the actual values. (Read on Scikit-learn MAE | Read on TensorFlow MAE) $\text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i|$
R-Squared (R2): Measures how well your model explains the variability in the data. A higher R2 indicates a better model fit, which is useful when predicting variables like energy consumption in smart homes. (Read on Scikit-learn R2 Score | Read on TensorFlow R2 Score (Custom Implementation)) $R^2 = 1 - \frac{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{N} (y_i - \bar{y})^2}$
- ( \bar ): Mean of the actual values

Regression Accuracy

Regression accuracy is the percentage of windowed samples whose absolute error is within the configured threshold.

Compute window error: For each windowed sample, compute the absolute error between prediction and ground truth (for multi-output regression, a window is only correct if all outputs meet the threshold).
Apply the threshold: Mark the window correct if its absolute error is ≤ the configured threshold.
Accuracy: The percentage of windowed samples marked correct. $\text{Regression Accuracy} = \frac{\text{Number of correct windowed samples}}{\text{Total number of windowed samples}} \times 100\%$

Used for problems where the goal is to identify and locate objects in an image, such as detecting pedestrians in a self-driving car system.Focusing on the COCO mAP Score:

The COCO mAP (Mean Average Precision) score is a key metric used to evaluate the performance of an object detection model. It measures the model’s ability to correctly identify and locate objects within images.This result shows a mAP of 0.3, which may seem low, but it accurately reflects the model’s performance. The mAP is averaged over Intersection over Union (IoU) thresholds from 0.5 to 0.95, capturing the model’s ability to localize objects with varying degrees of precision.

How It Works

Detection and Localization: The model attempts to detect objects in an image and draws a bounding box around each one.
Intersection over Union (IoU): IoU calculates the overlap between the predicted bounding box and the actual (true) bounding box. An IoU of 1 indicates perfect overlap, while 0 means no overlap.
Precision Across Different IoU Thresholds: The mAP score averages the precision (the proportion of correctly detected objects) across different IoU thresholds (e.g., 0.5, 0.75). This demonstrates the model’s performance under both lenient (low IoU) and strict (high IoU) conditions.
Final Score: The final mAP score is the average of these precision values. A higher mAP score indicates that the model is better at correctly detecting and accurately placing bounding boxes around objects in various scenarios.

IoU Thresholds

mAP@IoU=0.5 (AP50): A less strict metric, useful for broader applications where rough localization is acceptable.
mAP@IoU=0.75 (AP75): A stricter metric requiring higher overlap between predicted and true bounding boxes, ideal for tasks needing precise localization.
mAP@[IoU=0.5:0.95]: The average of AP values computed at IoU thresholds ranging from 0.5 to 0.95. This primary COCO challenge metric provides a balanced view of the model’s performance.

Area-Based Evaluation

mAP can also be broken down by object size—small, medium, and large—to assess performance across different object scales:

Small Objects: Typically smaller than 32x32 pixels.
Medium Objects: Between 32x32 and 96x96 pixels.
Large Objects: Larger than 96x96 pixels.

Models generally perform better on larger objects, but understanding performance across all sizes is crucial for applications like aerial imaging or medical diagnostics.

Recall Metrics

Recall in object detection measures the ability of a model to find all relevant objects in an image:

Recall@[max_detections=1, 10, 100]: These metrics measure recall when considering only the top 1, 10, or 100 detections per image, providing insight into the model’s performance under different detection strictness levels.
Recall by Area: Similar to mAP, recall can also be evaluated based on object size, helping to understand how well the model recalls objects of different scales.

Importance of Evaluation Metrics

Evaluation metrics serve multiple purposes in the impulse lifecycle:

Model Selection: They enable you to compare different models and choose the one that best suits your needs.
Model Tuning: Metrics guide you in fine-tuning models by providing feedback on their performance.
Model Interpretation: Metrics help understand how well a model performs and where it might need improvement.
Model Deployment: Before deploying a model in real-world applications, metrics are used to ensure it meets the required standards.
Model Monitoring: After deployment, metrics continue to monitor the model’s performance over time.

How to Choose the Right Metric

Choosing the right metric depends on the specific task and application requirements:

For classification: In an Edge AI application like sound detection on a wearable device, precision might be more important if you want to avoid false alarms, while recall might be critical in safety applications where missing a critical event could be dangerous.
For regression: If you’re predicting energy usage in a smart home, MSE might be preferred because it penalizes large errors more, ensuring your model’s predictions are as accurate as possible.
For object detection: If you’re working on an edge-based animal detection camera, mAP with a higher IoU threshold might be crucial for ensuring the camera accurately identifies and locates potential animals.

Conclusion

Evaluation metrics like mAP and recall provide useful insights into the performance of machine learning models, particularly in object detection tasks. By understanding and appropriately focusing on the correct metrics, you can ensure that your models are robust, accurate, and effective for real-world deployment.

INTRODUCTION

OVERVIEW

GUIDES

CONCEPTS

METRICS

COURSES

When to Use Different Metrics

Types of Evaluation Metrics

Regression Accuracy

How It Works

IoU Thresholds

Area-Based Evaluation

Recall Metrics

Importance of Evaluation Metrics

How to Choose the Right Metric

Conclusion

INTRODUCTION

OVERVIEW

GUIDES

CONCEPTS

METRICS

COURSES

​When to Use Different Metrics

​Types of Evaluation Metrics

​Regression Accuracy

​How It Works

​IoU Thresholds

​Area-Based Evaluation

​Recall Metrics

​Importance of Evaluation Metrics

​How to Choose the Right Metric

​Conclusion

When to Use Different Metrics

Types of Evaluation Metrics

Regression Accuracy

How It Works

IoU Thresholds

Area-Based Evaluation

Recall Metrics

Importance of Evaluation Metrics

How to Choose the Right Metric

Conclusion