1 of 10

Machine learning

Data augmentation

What is data augmentation?

Data augmentation is a method that can help improve the accuracy of machine learning models. A data augmentation system makes small, random changes to your training data during the training process.

Being exposed to these variations during training can help prevent your model from taking shortcuts by "memorizing" superficial clues in your training data, meaning it may better reflect the deep underlying patterns in your dataset.

Data augmentation will not work with every dataset

As with most things in machine learning, data augmentation is effective for some datasets and models but not for others. While experimenting with data augmentation, bear in mind that it is not guaranteed to provide results.

Data augmentation is likely to make the biggest difference when used with small datasets. Large datasets may already contain enough variation that the model is able to identify the true underlying patterns and avoid overfitting to the training data.

Data augmentation techniques

The types of data augmentation that you apply will depend on your data type and use case.

For images, you might apply geometric transformations (rotations, scaling, flipping, cropping), adjust color aspects (brightness, contrast, hue, saturation), inject noise, or apply more advanced augmentations such as mixing images with strategies like CutMix or mixup.

For audio, you might apply transformations directly to the raw audio that include mixing in background noise, altering the pitch, perturbing the speed or volume, or randomly cropping and splitting your samples. Rather than altering the raw audio, you might instead apply transformations to audio features, for example spectrograms generated by MFCC or MFE processing, with techniques like SpecAugment.

Data augmentation and model deployment

Data augmentation occurs only during training. It will have no impact on the memory usage or latency of your model once it has been deployed.

Example workflow for using data augmentation

Here is a step-by-step guide to getting the most out of data augmentation.

1. Train a model without data augmentation

There is no guarantee that data augmentation will improve the performance of your model. Before you start experimenting, it's important to train a model without data augmentation and attempt to get the best possible performance. You can use this model as a baseline to understand whether data augmentation improves the performance of your model or not.

2. Train a second model with data augmentation

It's helpful to be able to compare model performance side by side. To allow this, create a second model that has the same settings as the first, with the exception of enabling data augmentation. If there are parameter options for the augmentation, leave the defaults in place.

3. Increase the number of training epochs

Often, the beneficial effects of data augmentation are only seen after training a network for longer. Increase the number of training epochs for your second model. A good rule of thumb might be to double the number of training epochs compared to your baseline model. You can look at the training output after your first run to determine if the model still seems to be improving and can be trained longer.

4. Compare and iterate

Now that you've trained a model with data augmentation, compare it to your baseline model by checking performance metrics. If the second model is more accurate or has a lower loss value, augmentation was successful.

Whether it was successful or not, you may be able to find settings that work better. If available, you can try other combinations of data augmentation parameter options. You can also try adjusting the architecture of your model. Since data augmentation can help prevent overfitting, you may be able to improve accuracy by increasing the size of your model while applying augmentation.

5. Check model performance using your test dataset

Once you have a several model variants, you can run model testing for each. You might find that a model trained with data augmentation performs better on your test dataset even if its accuracy during training is similar to your baseline model, so it's always worth checking your models against test data.

It's also worth comparing the confusion matrices for each model. Data augmentation may affect the performance of your model on different labels in different ways. For example, precision may improve for one pair of classes but be reduced for another.

Data augmentation in Edge Impulse

With Edge Impulse you can easily augment your dataset. Depending on your data type and learning block selection, Data augmentation settings are available directly in Studio while configuring your Learning block.

If you are an advanced user that is more familiar with Python and Keras, data augmentation techniques can be applied programmatically in Studio through the use of Expert mode. Alternatively, you can also leverage the Python SDK for full flexibility of your data augmentation and training pipeline.

Evaluation metrics

In Edge AI, where models are deployed on resource-constrained devices like microcontrollers, evaluation metrics are critical. They ensure that your model performs well in terms of accuracy and runs efficiently on the target hardware. By understanding these metrics, you can fine-tune your models to achieve the best balance between performance and resource usage.

These metrics serve several important purposes:

Model Comparison: Metrics allow you to compare different models and see which one performs better.
Model Tuning: They help you adjust and improve your model by showing where it might be going wrong.
Model Validation: Metrics ensure that your model generalizes well to new data, rather than just memorizing the training data (a problem known as overfitting).

When to Use Different Metrics

Choosing the right metric depends on your specific task and the application's requirements:

Precision: Needed when avoiding false positives, such as in medical diagnosis. (Read on Scikit-learn Precision | Read on TensorFlow Precision)
Recall: Vital when missing detections is costly, like in security applications. (Read on Scikit-learn Recall | Read on TensorFlow Recall)
Lower IoU Thresholds: Suitable for tasks where rough localization suffices.
Higher IoU Thresholds: Necessary for tasks requiring precise localization.

Understanding these metrics in context ensures that your models are not only accurate but also suitable for their intended applications.

Types of Evaluation Metrics

Used for problems where the output is a category, such as detecting whether a sound is a cough or not:

Accuracy: Measures the percentage of correct predictions out of all predictions. For instance, in a model that classifies sounds on a wearable device, accuracy tells you how often the model gets it right. (Read on Scikit-learn Accuracy | Read on TensorFlow Accuracy)
$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
- ( TP ): True Positives
- ( TN ): True Negatives
- ( FP ): False Positives
- ( FN ): False Negatives
Precision: The percentage of true positive predictions out of all positive predictions made by the model. This is crucial in cases where false positives can have significant consequences, such as in health monitoring devices. (Read on Scikit-learn Precision | Read on TensorFlow Precision)
$\text{Precision} = \frac{TP}{TP + FP}$
Recall: The percentage of actual positive instances that the model correctly identified. For example, in a fall detection system, recall is vital because missing a fall could lead to serious consequences. (Read on Scikit-learn Recall | Read on TensorFlow Recall)
$\text{Recall} = \frac{TP}{TP + FN}$
F1 Score: The harmonic mean of precision and recall, useful when you need to balance the trade-offs between false positives and false negatives. (Read on Scikit-learn F1 Score | Read on TensorFlow F1 Score)
$\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$
Confusion Matrix: A table that shows the number of correct and incorrect predictions made by the model. It helps visualize the model's performance across different classes. (Read on Scikit-learn Confusion Matrix | Read on TensorFlow Confusion Matrix)

This confusion matrix helps evaluate the performance of the model by showing where it is performing well (high values along the diagonal) and where it is making mistakes (off-diagonal values).

Here's how to interpret it:

Labels: The "True label" on the Y-axis represents the actual class labels of the activities. The "Predicted label" on the X-axis represents the class labels predicted by the model.
Classes: The dataset seems to have three classes, represented as 0, 1, and 2. These likely correspond to different human activities.
Matrix Cells: The cells in the matrix contain the number of samples classified in each combination of actual versus predicted class.
- For instance: The top-left cell (44) indicates that the model correctly predicted class 0 for 44 instances where the true label was also 0.
- The off-diagonal cells represent misclassifications. For example, the cell at row 0, column 1 (29) shows that 29 samples were true class 0 but were incorrectly predicted as class 1.
Color Scale: The color scale on the right represents the intensity of the values in the cells, with lighter colors indicating higher values and darker colors indicating lower values.
ROC-AUC: The area under the receiver operating characteristic curve, showing the trade-off between true positive rate and false positive rate. (Read on Scikit-learn ROC-AUC | Read on TensorFlow AUC)
- The ROC curve plots True Positive Rate (Recall) against False Positive Rate (FPR), where:

The ROC (Receiver Operating Characteristic) curve is a commonly used tool for evaluating the performance of binary classification models. The ROC curve plots the trade-off between the true positive rate (TPR or Recall) and the false positive rate (FPR) for different threshold values.

True Positive Rate (Y-axis): This is the proportion of actual positives (walking instances) that the model correctly identifies (recall).
False Positive Rate (X-axis): This is the proportion of actual negatives (rest instances) that the model incorrectly identifies as positives (false positives).
Precision-Recall Curve: Useful in evaluating binary classification models, especially when dealing with imbalanced datasets, like in the context of walking vs resting activities. The Precision-Recall curve shows the trade-off between precision and recall for various threshold settings of the classifier.
- Precision (Y-axis): Precision measures the proportion of true positive predictions among all positive predictions made by the model. High precision means that when the model predicts "Walking," it is correct most of the time.
- Recall (X-axis): Recall (or True Positive Rate) measures the proportion of actual positives (walking instances) that the model correctly identifies. High recall indicates that the model successfully identifies most instances of walking.
Log Loss: The negative log-likelihood of the true labels given the model predictions. (Read on Scikit-learn Log Loss | Read on TensorFlow Log Loss)
- ( y_i ): Actual label
- ( p_i ): Predicted probability
- ( N ): Number of samples

Used for problems where the output is a continuous value, like predicting the temperature from sensor data:

Mean Squared Error (MSE): The average of the squared differences between the predicted values and the actual values. In an edge device that predicts temperature, MSE penalizes larger errors more heavily, making it crucial for ensuring accurate predictions. (Read on Scikit-learn MSE | Read on TensorFlow MSE)
- ( y_i ): Actual value
- ( \hat{y}_i ): Predicted value
- ( N ): Number of samples
Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values, providing a straightforward measure of prediction accuracy. This is useful in energy monitoring systems where predictions need to be as close as possible to the actual values. (Read on Scikit-learn MAE | Read on TensorFlow MAE)
R-Squared (R2): Measures how well your model explains the variability in the data. A higher R2 indicates a better model fit, which is useful when predicting variables like energy consumption in smart homes. (Read on Scikit-learn R2 Score | Read on TensorFlow R2 Score (Custom Implementation))
- ( \bar{y} ): Mean of the actual values

Used for problems where the goal is to identify and locate objects in an image, such as detecting pedestrians in a self-driving car system.

Focusing on the COCO mAP Score:

The COCO mAP (Mean Average Precision) score is a key metric used to evaluate the performance of an object detection model. It measures the model's ability to correctly identify and locate objects within images.

This result shows a mAP of 0.3, which may seem low, but it accurately reflects the model's performance. The mAP is averaged over Intersection over Union (IoU) thresholds from 0.5 to 0.95, capturing the model's ability to localize objects with varying degrees of precision.

How It Works

Detection and Localization: The model attempts to detect objects in an image and draws a bounding box around each one.
Intersection over Union (IoU): IoU calculates the overlap between the predicted bounding box and the actual (true) bounding box. An IoU of 1 indicates perfect overlap, while 0 means no overlap.
Precision Across Different IoU Thresholds: The mAP score averages the precision (the proportion of correctly detected objects) across different IoU thresholds (e.g., 0.5, 0.75). This demonstrates the model's performance under both lenient (low IoU) and strict (high IoU) conditions.
Final Score: The final mAP score is the average of these precision values. A higher mAP score indicates that the model is better at correctly detecting and accurately placing bounding boxes around objects in various scenarios.

IoU Thresholds

mAP@IoU=0.5 (AP50): A less strict metric, useful for broader applications where rough localization is acceptable.
mAP@IoU=0.75 (AP75): A stricter metric requiring higher overlap between predicted and true bounding boxes, ideal for tasks needing precise localization.
mAP@[IoU=0.5:0.95]: The average of AP values computed at IoU thresholds ranging from 0.5 to 0.95. This primary COCO challenge metric provides a balanced view of the model's performance.

Area-Based Evaluation

mAP can also be broken down by object size—small, medium, and large—to assess performance across different object scales:

Small Objects: Typically smaller than 32x32 pixels.
Medium Objects: Between 32x32 and 96x96 pixels.
Large Objects: Larger than 96x96 pixels.

Models generally perform better on larger objects, but understanding performance across all sizes is crucial for applications like aerial imaging or medical diagnostics.

Recall Metrics

Recall in object detection measures the ability of a model to find all relevant objects in an image:

Recall@[max_detections=1, 10, 100]: These metrics measure recall when considering only the top 1, 10, or 100 detections per image, providing insight into the model's performance under different detection strictness levels.
Recall by Area: Similar to mAP, recall can also be evaluated based on object size, helping to understand how well the model recalls objects of different scales.

Importance of Evaluation Metrics

Evaluation metrics serve multiple purposes in the impulse lifecycle:

Model Selection: They enable you to compare different models and choose the one that best suits your needs.
Model Tuning: Metrics guide you in fine-tuning models by providing feedback on their performance.
Model Interpretation: Metrics help understand how well a model performs and where it might need improvement.
Model Deployment: Before deploying a model in real-world applications, metrics are used to ensure it meets the required standards.
Model Monitoring: After deployment, metrics continue to monitor the model's performance over time.

How to Choose the Right Metric

Choosing the right metric depends on the specific task and application requirements:

For classification: In an Edge AI application like sound detection on a wearable device, precision might be more important if you want to avoid false alarms, while recall might be critical in safety applications where missing a critical event could be dangerous.
For regression: If you're predicting energy usage in a smart home, MSE might be preferred because it penalizes large errors more, ensuring your model's predictions are as accurate as possible.
For object detection: If you're working on an edge-based animal detection camera, mAP with a higher IoU threshold might be crucial for ensuring the camera accurately identifies and locates potential animals.

Conclusion

Evaluation metrics like mAP and recall provide useful insights into the performance of machine learning models, particularly in object detection tasks. By understanding and appropriately focusing on the correct metrics, you can ensure that your models are robust, accurate, and effective for real-world deployment.

Neural networks

Neural networks are a set of algorithms, modeled loosely after the human brain, designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling, or clustering of raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text, or time series, must be translated.

What about anomaly detection with K-Means or GMM?

Please note that K-Means and Gaussian Mixture Models (GMM) are not neural networks. They are algorithms used in unsupervised machine learning, specifically for clustering tasks.

In Edge Impulse, Neural Networks can be used for supervised learning tasks such as Image or Audio Classification, Regression, Object Detection either using Transfer Learning, using pre-set neural network architectures or by designing your own.

How do they work?

Neural networks consist of layers of interconnected nodes, also known as neurons.

Neurons (or nodes)

Each node receives input from its predecessors, processes it, and passes its output to succeeding nodes. The processing involves weighted inputs, a bias (threshold), and an activation function that determines whether and to what extent the signal should progress further through the network.

Layers

Neurons are organized into layers: input, hidden, and output layers. The complexity of the network depends on the number and size of these layers.

Input Layer: Receives raw input data.
Hidden Layers: Perform computations using weighted inputs.
Output Layer: Produces the final output.

Neural Network Architectures

Neural networks can vary widely in architecture, adapting to different types of problems and data.

Learning process

The power of neural networks lies in their ability to learn. Learning occurs through a process called training, where the network adjusts its weights based on the difference between its output and the desired output. This process is facilitated by an optimizer, which guides the network in adjusting its weights to minimize error (the loss).

Training: Neural networks learn by adjusting weights based on the error in predictions. This process is repeated over many training cycles, or epochs, using training data.
Backpropagation: A key mechanism where the network adjusts its weights starting from the output layer and moving backward through the hidden layers, minimizing error with each pass.

Neural Networks in Edge AI

In Edge AI, neural networks operate under constraints of lower computational power and energy efficiency. They need to be optimized for speed and size without compromising too much on accuracy. This often involves techniques like feature extraction, neural network architectures, transfer learning, quantization, and model pruning.

Feature Extraction: Extracting meaningful features from the raw data that can be effectively processed by the neural network on resource-constrained devices.
Neural Networks Architectures: Selecting a model architecture that is designed to run efficiently on the type of processor you are targeting, and fit within memory constraints.
Transfer Learning: Using a pre-trained model and retraining it with a specific smaller dataset relevant to the edge application.
Quantization: Reducing the precision of the numbers used in the model to decrease the computational and storage burden.
Model Pruning: Reducing the size of the model by eliminating unnecessary nodes and layers.

Neural networks, in the context of Edge AI, must be designed and optimized to function efficiently in resource-constrained environments, balancing the trade-off between accuracy and performance.

To learn more about Neural Networks, see the “Introduction to Neural Networks” video in our “Introduction to Embedded Machine Learning” course:

Layers

Neural network architectures can be composed of multiple layers, each with specific roles and functions. These layers act as the building blocks of the network. The configuration and interaction of these layers define the capabilities of different neural network architectures, allowing them to learn from data and perform a wide array of tasks. From the initial data reception in the input layer through various transformation stages in hidden layers, and finally to the output layer where results are produced, each layer contributes to the network's overall intelligence and performance.

How can I make sure these layers will work on edge device?

In Edge AI applications, these layers need to be optimized not just for accuracy, but also for computational and memory efficiency to perform well within the constraints of edge devices. Some architectures may not be suitable for constrained devices because of the computational complexity, resource availability or unsupported operators. See the supported operators for more details.

If you don't know where to start, try out the EON Tuner, our device-aware Auto ML tool.

Also, feel free to profile your models for edge deployments using our BYOM feature or using our Python SDK.

With this page, we want to provide an overview of various neural network layers commonly used in edge machine learning.

Input Layer

The Input Layer serves as the initial phase of the neural network. It is responsible for receiving all the input data for the model. This layer does not perform any computation or transformation. It simply passes the features to the subsequent layers. The dimensionality of the Input Layer must match the shape of the data you're working with. For instance, in image processing tasks, the input layer's shape would correspond to the dimensions of the image, including the width, height, and color channels.

Dense Layer (or Fully Connected Layer)

A Dense layer, often referred to as a fully connected layer, is the most basic form of a layer in neural networks. Each neuron in a dense layer receives input from all the neurons of the previous layer, hence the term "fully connected". It's a common layer that can be used to process data that has been flattened or transformed from a higher to a lower dimension.

Reshape Layer

The Reshape layer is used to change the shape of the input data without altering its contents. It's particularly useful when you need to prepare the dataset for certain types of layers that require the input data to be in a particular shape.

Flatten Layer

Flatten layers are used to convert multi-dimensional data into a one-dimensional array. This is typically done before feeding the data into a Dense layer.

Dropout Layer

The Dropout layer is a regularization technique that reduces the risk of overfitting in neural networks. It does so by randomly setting a fraction of the input units to zero during each update of the training phase, which helps to make the network more robust and less sensitive to the specific weights of neurons.

1D Convolution Layer

The 1D Convolution layer is specifically designed for analyzing sequential data, such as audio signals or time-series data. This type of layer applies a series of filters to the input data to extract features. These filters slide over the data to produce a feature map, capturing patterns like trends or cycles that span over a sequence of data points.

1D Pooling Layer

Complementing the 1D Convolution layer, the 1D Pooling layer aims to reduce the spatial size of the feature maps, thus reducing the number of parameters and computation in the network. It works by aggregating the information within a certain window, usually by taking the maximum (Max Pooling) or the average (Average Pooling) of the values. This operation also helps to make the detection of features more invariant to scale and orientation changes in the input data.

2D Convolution Layer

The 2D Convolution layer is used primarily for image data and other two-dimensional input (like spectrograms). This layer operates with filters that move across the input image's height and width to detect patterns like edges, corners, or textures. Each filter produces a 2D activation map that represents the locations and strength of detected features in the input.

2D Pooling Layer

The 2D Pooling layer serves a similar purpose as its 1D counterpart but in two dimensions. After the convolution layer has extracted features from the input, the pooling layer reduces the spatial dimensions of these feature maps. It summarizes the presence of features in patches of the feature map and reduces sensitivity to the exact location of features. Max Pooling and Average Pooling are common types of pooling operations used in 2D Pooling layers.

Output layer

The Output Layer is the final layer in a neural network architecture, responsible for producing the results based on the learned features and representations from the previous layers. Its design is closely aligned with the specific objective of the neural network, such as classification, regression, or even more complex tasks like image segmentation or language translation.

Customizing Layers in Edge Impulse

There are two options to modify the layers with Edge Impulse Studio. Either directly from the Neural Network Architecture panel where you can choose from a wide range of predefined layers, or using the expert mode to access the TensorFlow/Keras APIs. See below to understand how to build a model with multiple layers in Expert Mode.

If you are an experienced ML practitioner, you can also bring your own model or bring your own architecture.

Building a Model with Multiple Layers in Expert Mode

Import the necessary libraries

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, MaxPooling2D, Dropout, Flatten

Define your neural network architecture

model = Sequential()
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(MaxPooling2D((2, 2)))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(10, activation='softmax'))

Compile and train your model

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=32, epochs=10)

Activation functions

An activation function is a mathematical equation that determines the output of a neural network node, or "neuron." It adds non-linearity to the network, allowing it to learn complex patterns in the data. Without activation functions, a neural network would simply be a linear regression model, incapable of handling complex tasks like image recognition or language processing.

Types of Activation Functions in Neural Networks

Several activation functions are used in neural networks, each with its characteristics and typical use cases. Some of the most common include:

ReLU (Rectified Linear Unit): It allows only positive values to pass through, introducing non-linearity. ReLU is efficient and widely used in deep learning. It is used by default in Edge Impulse for hidden layers.
Sigmoid: This function maps values into a range between 0 and 1, making it ideal for binary classification problems.
Tanh (Hyperbolic Tangent): Similar to the sigmoid but maps values between -1 and 1. It is useful in hidden layers of a neural network.
Softmax: Often used in the output layer of a neural network for multi-class classification; it turns logits into probabilities that sum to one.
Leaky ReLU: A variation of ReLU, it allows a small, non-zero gradient when the unit is not active.

When to Use Different Activation Functions

The choice of activation function depends on the specific task and the characteristics of the input and output data. For instance:

ReLU and its variants are generally preferred in hidden layers due to their computational efficiency.

Sigmoid or Softmax functions are often used in the output layer for binary and multi-class classification tasks, respectively.

Note that for regression tasks, the last layer is connected to the target variable y_pred. Thus, there is no need for an activation function in the output layer (like sigmoid or softmax).

Please note that the default activation functions in Edge Impulse have been selected to work well for your project tasks. We would advise you to primarily focus on your dataset quality and neural network architecture to improve your model performances.

Implementing Activation Functions in Expert Mode

In Edge Impulse, the Expert Mode allows for advanced customization, including the use of custom activation functions. Here is how you can do it:

Import the necessary libraries

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation  # Import activation functions

Define your neural network architecture

When adding layers to your model, specify the activation function you want to use:

model = Sequential()
model.add(Dense(units=64, activation='relu'))  # Using ReLU activation
model.add(Dense(units=10, activation='softmax'))  # Using Softmax for output layer

Compile and train your model:

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=32, epochs=10)

Loss functions

A loss function, also known as a cost function, is a method to measure the performance of a machine learning model. Essentially, it calculates the difference between the model's predictions and the actual target values. The goal of training a neural network is to minimize this difference, thereby improving the model's accuracy.

The loss function quantifies how well the model is performing. A higher loss indicates greater deviation from the actual values, while a lower loss signifies that the model's predictions are closer to the target values.

What's the difference between the loss function and the optimizer?

Loss Function: The loss function is a mathematical expression that measures the difference or 'error' between the actual output (prediction) of a model and the desired output (label). It helps us evaluate how well our model is performing. In other words, it quantifies the cost of misclassification.

Optimizer: An optimizer is an algorithmic entity designed to minimize the loss function. Its goal is to adjust the parameters (weights and biases) of a neural network in such a way that the loss is minimized. This is typically done through iterative processes like gradient descent or its variations. The optimizer calculates the partial derivative of the loss with respect to each parameter, which indicates the direction and magnitude of changes needed to reduce the loss.

So, while the loss function quantifies how 'wrong' our model is, the optimizer tries to minimize this error by changing the parameters of the model.

Types of Loss Functions

Each type of neural network task generally has a different loss function that is most suitable for it. Here are some of the common loss functions used:

Mean Squared Error (MSE): Used primarily for regression problems. It calculates the square of the difference between the predicted values and the actual values. It can be used for both single-step prediction tasks and time series forecasting problems. The goal is to minimize this average error, resulting in more accurate predictions. It is used by default in Edge Impulse regression learning blocks.
Mean Absolute Error (MAE): The MAE is another regression loss function that measures the average absolute difference between the predicted and actual target values. Unlike MSE, which considers squared errors, MAE uses the direct absolute value of the error, making it more sensitive to outliers but less affected by them. This makes it a good choice for problems with skewed or imbalanced data.
Binary Cross-Entropy Loss: Ideal for binary classification problems. It measures the difference between the predicted probabilities and the actual labels by minimizing the sum of the losses for each sample. Note that this loss function is commonly used in conjunction with the sigmoid activation function.
Categorical Cross-Entropy: Similar to the Binary Cross-Entropy, the Categorical Cross-Entropy is mostly used for multi-class classification. It measures the difference between the predicted probabilities and the actual labels for each class in a sample. The sum of these losses across all samples is then minimized. It is used by default in Edge Impulse classification learning blocks. Note that this loss function is commonly used in conjunction with the softmax activation function (also used by default in Edge Impulse for classification problems).
Huber Loss: A combination of MSE and MAE (Mean Absolute Error). It is less sensitive to outliers than MSE. It starts as the square of the difference between the predicted and actual values for small errors, similar to MSE. However, once the error exceeds a certain threshold, it switches to a linear relationship like MAE. This makes Huber loss more robust against outliers compared to MSE, while still maintaining its smoothness.
Log Loss: Similar to cross-entropy loss, it measures the performance of a classification model where the output is a probability value between 0 and 1.

When to change the loss function?

Choosing the right loss function is an integral part of model design. The choice depends on the type of problem (regression, classification, etc.) and the specific requirements of your application (like sensitivity to outliers).

Just as with optimizers, once you have settled on your overall model structure and chosen an appropriate loss function, you may want to fine-tune the settings further to achieve even better performance. This can involve testing different loss functions or adjusting their parameters to see what works best for your specific task.

In Edge Impulse, by default, we use:

Mean Squared Error (MSE) for regression tasks.
Categorical Cross-Entropy for classification tasks.

You can change them in the Expert Mode (see below). Please note that the default loss functions in Edge Impulse have been selected to work well with most tasks. We would advise you to primarily focus on your dataset quality and neural network architecture to improve your model performances.

Customizing Loss Function in Expert Mode

In Edge Impulse, the Expert Mode allows for advanced customization, including the use of custom loss functions. Here is how you can do it:

Import the necessary libraries

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.losses import MeanSquaredError, BinaryCrossentropy  # Import loss functions

Define your neural network architecture

model = Sequential()
# Add model layers

Select a loss function

Choose the loss function that suits your problem.

For instance, for a regression problem, you might choose Mean Squared Error:

loss_function = MeanSquaredError()

For a binary classification problem, Binary Cross-Entropy might be more appropriate:

loss_function = BinaryCrossentropy()

Compile and train your model with your chosen loss function

model.compile(optimizer='adam', loss=loss_function, metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=32, epochs=10)

Optimizers

If you are not familiar with optimizers, here is a brief overview of what an optimizer is and its role in machine learning, particularly in neural networks.

An optimizer is an algorithmic entity designed to minimize a specific function called the loss function. The loss function quantitatively expresses the difference between the predicted output of the neural network and the actual target values. Simply put, an optimizer's role is to change the attributes of the neural network, such as weights and learning rate, to reduce this loss., thereby enhancing the network's accuracy.

Optimizers work through an iterative process. They start by calculating the gradient, which is a partial derivative of the loss function. This gradient indicates how much the weights need to be adjusted to minimize the loss. The optimizer then updates the weights in the opposite direction of the gradient. This process is repeated over multiple iterations or epochs until the loss is minimized, and the model's predictions become as accurate as possible.

Each optimizer has its unique way of navigating the path to minimize loss. Here are a few:

Adam: Known for its adaptability, it's especially good for large datasets. It is used by default in Edge Impulse.
VeLO: VeLO represents a novel approach where the optimizer is itself a neural network that is trained on prior training jobs. See Learned Optimizer (VeLO) dedicated page.
Gradient Descent: Works by iteratively adjusting the values of parameters in the function until the minimum value is reached. In other words, it involves moving downhill along the steepest slope of the function towards its lowest point, hence its name "descent."
Stochastic Gradient Descent (SGD): A more dynamic cousin of Gradient Descent, updating weights more frequently for quicker learning.
RMSprop and Adagrad: These optimizers bring their own tweaks to the learning rate, making the journey smoother in specific scenarios.

When to change the optimizer & parameters?

Not sure which optimizer to use? Have a look at the Learned Optimizer (VeLO)!

Once you have settled on the overall model structure but want to achieve an even better model it can be appropriate to test another optimizer. This is classic hyperparameter fine-tuning where you try and see what works best. Any of these optimizers may achieve superior results, though getting there can sometimes require a lot of tuning. Note that each optimizer has its own parameters that you can customize.

In Edge Impulse, you can change the learning rate settings directly in the Neural Network settings section. To change the optimizer, you can do this using the expert mode (see the section below).

Changing the Optimizer in Expert Mode

When using the Expert Mode in Edge Impulse, you can access the full Keras API:

Import the necessary libraries:

First, make sure to import the necessary modules from Keras. You'll need the model you're working with (like Sequential) and the optimizer you want to use.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam, SGD, RMSprop  # Import optimizers

Define your neural network architecture:

Define your model architecture as you normally would. For example, using Sequential:

model = Sequential()
# Add model layers like Dense, Conv2D, etc.

Select an optimizer

Choose the optimizer you wish to use. You can use one of the built-in optimizers in Keras, and even customize its parameters. For example, to use the Adam optimizer with a custom learning rate:

optimizer = Adam(learning_rate=0.001)

Alternatively, you can use other optimizers like SGD or RMSprop in a similar way:

optimizer = SGD(learning_rate=0.01, momentum=0.9)
optimizer = RMSprop(learning_rate=0.001, rho=0.9)

Compile and train your model with your optimizer

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=32, epochs=10)

Learned optimizer (VeLO)

Machine learning model development involves several critical choices, such as the type of problem (classification or regression), the model architecture (e.g., dense layers, convolutions), and the available data. However, one often overlooked choice is the optimizer. This component is essential in the training loop, which typically involves:

Starting with a model with randomly initialized weights.
Passing labeled data through the model and comparing the output with the correct output using a "loss function".
Using an optimizer to make adjustments to the model weights based on the loss function results.
Repeating the process until the model's performance ceases to improve.

While there are various optimizers available, Adam [1] has become a default choice for many projects due to its general effectiveness. Unlike these traditional optimizers which are described by human-designed function, VeLO represents a novel approach where the optimizer is itself a neural network that is trained on prior training jobs.

What is an optimizer?

If you are not familiar with optimizers, see this page: Optimizers

VeLO: A Learned Optimizer

VeLO (Versatile Learned Optimizers) is an innovative concept where the optimizer is trained using a large number of training jobs, as detailed in the paper "VeLO: Training Versatile Learned Optimizers by Scaling Up" [2]. This approach contrasts with traditional optimizers, like Adam, which are handcrafted functions.

When to use the Learned Optimizer?

The learned optimizer can help you get some extra performance for certain models. For optimal results with VeLO, it is recommended to use as large a batch size as possible, potentially equal to the dataset's size. This approach, however, may lead to out-of-memory issues for some projects. Here are some pros and cons of using the learned optimizer:

Pros

VeLO generally requires less tuning compared to Adam.
The learned optimizer works well across various scenarios without specific adjustments.

Cons

VeLO comprises a large LSTM model, often larger than the models it trains. This requires more computational resources, particularly for GPU-intensive models like vision models.

Studio Integration

The Learned Optimizer can be enabled in Edge Impulse as an option on the training page.

Using VeLO in Expert Mode

The simplest way to use VeLO in expert mode is to enable the flag for a project and then switch to expert mode. This will pre-fill the needed lines of code in the expert mode.

To use VeLO in expert mode for an existing project:

Remove any existing optimizer creation, model.compile, or model.fit calls.
Replace with the train_keras_model_with_velo method.

from ei_tensorflow.velo import train_keras_model_with_velo
history = train_keras_model_with_velo(
 keras_model=model,
 training_data=train_dataset,
 validation_data=validation_dataset,
 loss_fn=tf.keras.metrics.categorical_crossentropy, # depending on your model
 num_epochs=num_epochs,
 callbacks=callbacks
)
print("history", history)

How does VeLO compare to Adam?

Consider the following graph which shows several runs of Adam vs VeLO:

The most influential hyperparameter of Adam is the learning rate.

If the learning rate is too low ( e.g. the red graph "adam_0.0001" ) then the model takes too long to make progress. If the learning rate is too high (e.g. the blue graph "adam_0.05") then the optimization becomes unstable.

One of the benefits of VeLO is that it doesn't require a learning rate to do well. In this example we see VeLO (purple graph "velo") doing as well as the best Adam learning rate variant.

As a side note, VeLO was designed for training large models. In this example to get the best result, the batch size was equal to the dataset size.

Examples

The following projects contain both a learning block with and without the learned optimizer so you can easily see the differences:

Image classification using transfer learning: Microscope - VeLO
Vibration analysis: Coffee Machine Stages - Multi-label data - VeLO

Resources

Epochs

An epoch (also known as training cycle) in machine learning is a term used to describe one complete pass through the entire training dataset by the learning algorithm. During an epoch, the machine learning model is exposed to every example in the dataset once, allowing it to learn from the data and adjust its parameters (weights) accordingly. The number of epochs is a hyperparameter that determines the number of times the learning algorithm will work through the entire training dataset.

Importance of epochs

The number of epochs is an important hyperparameter for the training process of a machine learning model. Too few epochs can result in an underfitted model, where the model has not learned enough from the training data to make accurate predictions. On the other hand, too many epochs can lead to overfitting, where the model has learned too well from the training data, including the noise, making it perform poorly on new, unseen data.

When to change the number of epochs (training cycles)

Selecting the appropriate number of epochs is a balance between underfitting and overfitting.

Underfitting: One of the most straightforward indicators of underfitting is if the model performs poorly on the training data. This can be observed in Edge Impulse Studio through metrics such as accuracy, or loss, depending on the type of problem (classification or regression). If these metrics indicate poor performance, it suggests that the model has not learned the patterns of the data well. In that case, increasing the number of epochs can improve your model performance. Please note that other solutions exist such as increasing your neural network architecture complexity, changing the preprocessing technique or reducing regularization.

Overfitting: Detecting overfitting involves recognizing when the model has learned too much from the training data, including its noise and outliers, to the detriment of its performance on new, unseen data. Overfitting is characterized by the model performing exceptionally well on the training data but poorly on the validation or test data. Evaluating overfitting can be achieved by comparing the performance of the model between the training set and the validation set during training. When the performance on the validation set starts to degrade, it might indicate that the model is beginning to overfit the training data. In that case, decreasing the number of epochs can improve your model performance. As with underfitting, other solutions exist to reduce overfitting such as increasing the number of training data, adding regularization techniques to add penalties on large weights, adding dropout layers, simplifying the model architecture and even using early stopping.

How epochs work

During each epoch, the dataset is typically divided into smaller batches. This approach, known as batch training, allows for more efficient and faster processing, especially with large datasets. The learning algorithm iterates through these batches, making predictions, calculating errors, and updating model parameters using an optimizer. An epoch consists of the following steps:

Initialization: Before training begins, the model's internal parameters (weights) are typically initialized randomly or according to a specific strategy.
Forward pass: For each example in the training dataset, the model makes a prediction (forward pass). This involves calculating the output of the model given its current weights and the input data.
Loss calculation: After making a prediction, the model calculates the loss (or error) by comparing its prediction to the actual target value using a loss function. The loss function quantifies how far the model's prediction is from the target.
Backward pass (backpropagation): The model updates its weights to reduce the loss. This is done through a process called backpropagation, where the gradient of the loss function of each weight is computed. The gradients indicate how the weights should be adjusted to minimize the loss.
Weight update: Using an optimization algorithm (such as Gradient Descent, Adam, etc.), the model adjusts its weights based on the gradients calculated during backpropagation. The goal is to reduce the loss by making the model's predictions more accurate.
Iteration over batches: An epoch consists of iterating over all batches in the dataset, performing the forward pass, loss calculation, backpropagation, and weight update for each batch.
Completion of an epoch: Once the model has processed all batches in the dataset, one epoch is complete. The model has now seen each example in the dataset exactly once.

What's the difference between an epoch and a batch size?

When training neural networks, both epochs and batch sizes are fundamental concepts, yet they serve distinct roles in the training process. An epoch represents one complete pass through the entire training dataset, where the model has the opportunity to learn from every example within the dataset once. This means that if you set the training to run for, say, 10 epochs, the entire dataset will be passed through the neural network 10 times, allowing the model to refine its weights and biases to improve its accuracy with each pass.

On the other hand, the batch size refers to the number of training examples utilized in one iteration of the training process. Instead of passing the entire dataset through the network at once (which can be computationally expensive and memory-intensive), the dataset is divided into smaller batches. For example, if you have a dataset of 2000 examples and choose a batch size of 100, it would take 20 iterations (batches) to complete one epoch. The batch size affects the updating of model parameters; with smaller batch sizes leading to more frequent updates, potentially increasing the granularity of the learning process but also introducing more variance in the updates. Conversely, larger batch sizes provide a more stable gradient estimate, but with less frequent updates, it could lead to slower convergence.

Change the number of epochs in Edge Impulse

In Edge Impulse, you can specify the number of training cycles in the training settings for your neural network-based models. Adjusting this parameter allows you to fine-tune the training process, aiming for the best possible model performance on your specific dataset. It's important to monitor both training and validation loss to determine the optimal number of epochs for your model.

Changing the epochs in Expert Mode

When using the Expert Mode in Edge Impulse, you can access the full Keras API:

You can modify the following line in the expert mode to change the number of training cycles:

EPOCHS = args.epochs or 100

When compiling and training your model, specify the number of epochs in the model.fit() function as follows:

model.fit(train_dataset, epochs=EPOCHS, validation_data=validation_dataset, verbose=2, callbacks=callbacks)

Apply Early Stopping in Expert Mode

The following approach allows your model to stop training as soon as it starts overfitting, or if further training doesn't lead to better performance, making your training process more efficient and potentially leading to better model performance.

Import EarlyStopping from tensorflow.keras.callbacks.

from tensorflow.keras.callbacks import EarlyStopping

Instantiate an EarlyStopping callback, specifying the metric to monitor (e.g., val_loss or val_accuracy), the minimum change (min_delta) that qualifies as an improvement, the number of epochs with no improvement after which training will be stopped (patience), and whether training should stop immediately after improvement (restore_best_weights).

# apply early stopping
callbacks.append(EarlyStopping(
    monitor='val_accuracy',    # Monitor validation accuracy
    min_delta=0.005,           # Minimum change to qualify as an improvement
    patience=15,               # Stop after 15 epochs without improvement
    verbose=1,                 # Print messages
    restore_best_weights=True  # Restore model weights from the epoch with the best value of the monitored quantity.
))

Find the full early stopping documentation on Keras documentation or have a look at this Edge Impulse public project as an example.

Evaluation metrics

These metrics serve several important purposes:

Model Comparison: Metrics allow you to compare different models and see which one performs better.
Model Tuning: They help you adjust and improve your model by showing where it might be going wrong.
Model Validation: Metrics ensure that your model generalizes well to new data, rather than just memorizing the training data (a problem known as overfitting).

When to Use Different Metrics

Choosing the right metric depends on your specific task and the application's requirements:

Precision: Needed when avoiding false positives, such as in medical diagnosis. (Read on Scikit-learn Precision | Read on TensorFlow Precision)
Recall: Vital when missing detections is costly, like in security applications. (Read on Scikit-learn Recall | Read on TensorFlow Recall)
Lower IoU Thresholds: Suitable for tasks where rough localization suffices.
Higher IoU Thresholds: Necessary for tasks requiring precise localization.

Understanding these metrics in context ensures that your models are not only accurate but also suitable for their intended applications.

Types of Evaluation Metrics

Used for problems where the output is a category, such as detecting whether a sound is a cough or not:

Accuracy: Measures the percentage of correct predictions out of all predictions. For instance, in a model that classifies sounds on a wearable device, accuracy tells you how often the model gets it right. (Read on Scikit-learn Accuracy | Read on TensorFlow Accuracy)
$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$
- ( TP ): True Positives
- ( TN ): True Negatives
- ( FP ): False Positives
- ( FN ): False Negatives
Precision: The percentage of true positive predictions out of all positive predictions made by the model. This is crucial in cases where false positives can have significant consequences, such as in health monitoring devices. (Read on Scikit-learn Precision | Read on TensorFlow Precision)
$\text{Precision} = \frac{TP}{TP + FP}$
Recall: The percentage of actual positive instances that the model correctly identified. For example, in a fall detection system, recall is vital because missing a fall could lead to serious consequences. (Read on Scikit-learn Recall | Read on TensorFlow Recall)
$\text{Recall} = \frac{TP}{TP + FN}$
F1 Score: The harmonic mean of precision and recall, useful when you need to balance the trade-offs between false positives and false negatives. (Read on Scikit-learn F1 Score | Read on TensorFlow F1 Score)
$\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$
Confusion Matrix: A table that shows the number of correct and incorrect predictions made by the model. It helps visualize the model's performance across different classes. (Read on Scikit-learn Confusion Matrix | Read on TensorFlow Confusion Matrix)

This confusion matrix helps evaluate the performance of the model by showing where it is performing well (high values along the diagonal) and where it is making mistakes (off-diagonal values).

Here's how to interpret it:

Labels: The "True label" on the Y-axis represents the actual class labels of the activities. The "Predicted label" on the X-axis represents the class labels predicted by the model.
Classes: The dataset seems to have three classes, represented as 0, 1, and 2. These likely correspond to different human activities.
Matrix Cells: The cells in the matrix contain the number of samples classified in each combination of actual versus predicted class.
- For instance: The top-left cell (44) indicates that the model correctly predicted class 0 for 44 instances where the true label was also 0.
- The off-diagonal cells represent misclassifications. For example, the cell at row 0, column 1 (29) shows that 29 samples were true class 0 but were incorrectly predicted as class 1.
Color Scale: The color scale on the right represents the intensity of the values in the cells, with lighter colors indicating higher values and darker colors indicating lower values.
ROC-AUC: The area under the receiver operating characteristic curve, showing the trade-off between true positive rate and false positive rate. (Read on Scikit-learn ROC-AUC | Read on TensorFlow AUC)
- The ROC curve plots True Positive Rate (Recall) against False Positive Rate (FPR), where:
  $\text{FPR} = \frac{FP}{FP + TN}$

True Positive Rate (Y-axis): This is the proportion of actual positives (walking instances) that the model correctly identifies (recall).
False Positive Rate (X-axis): This is the proportion of actual negatives (rest instances) that the model incorrectly identifies as positives (false positives).
Precision-Recall Curve: Useful in evaluating binary classification models, especially when dealing with imbalanced datasets, like in the context of walking vs resting activities. The Precision-Recall curve shows the trade-off between precision and recall for various threshold settings of the classifier.
- Precision (Y-axis): Precision measures the proportion of true positive predictions among all positive predictions made by the model. High precision means that when the model predicts "Walking," it is correct most of the time.
- Recall (X-axis): Recall (or True Positive Rate) measures the proportion of actual positives (walking instances) that the model correctly identifies. High recall indicates that the model successfully identifies most instances of walking.
Log Loss: The negative log-likelihood of the true labels given the model predictions. (Read on Scikit-learn Log Loss | Read on TensorFlow Log Loss)
$\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]$
- ( y_i ): Actual label
- ( p_i ): Predicted probability
- ( N ): Number of samples

Used for problems where the output is a continuous value, like predicting the temperature from sensor data:

Mean Squared Error (MSE): The average of the squared differences between the predicted values and the actual values. In an edge device that predicts temperature, MSE penalizes larger errors more heavily, making it crucial for ensuring accurate predictions. (Read on Scikit-learn MSE | Read on TensorFlow MSE)
$\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$
- ( y_i ): Actual value
- ( \hat{y}_i ): Predicted value
- ( N ): Number of samples
Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values, providing a straightforward measure of prediction accuracy. This is useful in energy monitoring systems where predictions need to be as close as possible to the actual values. (Read on Scikit-learn MAE | Read on TensorFlow MAE)
$\text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i|$
R-Squared (R2): Measures how well your model explains the variability in the data. A higher R2 indicates a better model fit, which is useful when predicting variables like energy consumption in smart homes. (Read on Scikit-learn R2 Score | Read on TensorFlow R2 Score (Custom Implementation))
$R^2 = 1 - \frac{\sum_{i=1}^{N} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{N} (y_i - \bar{y})^2}$
- ( \bar{y} ): Mean of the actual values

Used for problems where the goal is to identify and locate objects in an image, such as detecting pedestrians in a self-driving car system.

Focusing on the COCO mAP Score:

How It Works

Detection and Localization: The model attempts to detect objects in an image and draws a bounding box around each one.
Intersection over Union (IoU): IoU calculates the overlap between the predicted bounding box and the actual (true) bounding box. An IoU of 1 indicates perfect overlap, while 0 means no overlap.
Precision Across Different IoU Thresholds: The mAP score averages the precision (the proportion of correctly detected objects) across different IoU thresholds (e.g., 0.5, 0.75). This demonstrates the model's performance under both lenient (low IoU) and strict (high IoU) conditions.
Final Score: The final mAP score is the average of these precision values. A higher mAP score indicates that the model is better at correctly detecting and accurately placing bounding boxes around objects in various scenarios.

IoU Thresholds

mAP@IoU=0.5 (AP50): A less strict metric, useful for broader applications where rough localization is acceptable.
mAP@IoU=0.75 (AP75): A stricter metric requiring higher overlap between predicted and true bounding boxes, ideal for tasks needing precise localization.
mAP@[IoU=0.5:0.95]: The average of AP values computed at IoU thresholds ranging from 0.5 to 0.95. This primary COCO challenge metric provides a balanced view of the model's performance.

Area-Based Evaluation

mAP can also be broken down by object size—small, medium, and large—to assess performance across different object scales:

Small Objects: Typically smaller than 32x32 pixels.
Medium Objects: Between 32x32 and 96x96 pixels.
Large Objects: Larger than 96x96 pixels.

Models generally perform better on larger objects, but understanding performance across all sizes is crucial for applications like aerial imaging or medical diagnostics.

Recall Metrics

Recall in object detection measures the ability of a model to find all relevant objects in an image:

Recall@[max_detections=1, 10, 100]: These metrics measure recall when considering only the top 1, 10, or 100 detections per image, providing insight into the model's performance under different detection strictness levels.
Recall by Area: Similar to mAP, recall can also be evaluated based on object size, helping to understand how well the model recalls objects of different scales.

Importance of Evaluation Metrics

Evaluation metrics serve multiple purposes in the impulse lifecycle:

Model Selection: They enable you to compare different models and choose the one that best suits your needs.
Model Tuning: Metrics guide you in fine-tuning models by providing feedback on their performance.
Model Interpretation: Metrics help understand how well a model performs and where it might need improvement.
Model Deployment: Before deploying a model in real-world applications, metrics are used to ensure it meets the required standards.
Model Monitoring: After deployment, metrics continue to monitor the model's performance over time.

How to Choose the Right Metric

Choosing the right metric depends on the specific task and application requirements:

For classification: In an Edge AI application like sound detection on a wearable device, precision might be more important if you want to avoid false alarms, while recall might be critical in safety applications where missing a critical event could be dangerous.
For regression: If you're predicting energy usage in a smart home, MSE might be preferred because it penalizes large errors more, ensuring your model's predictions are as accurate as possible.
For object detection: If you're working on an edge-based animal detection camera, mAP with a higher IoU threshold might be crucial for ensuring the camera accurately identifies and locates potential animals.