Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Machine learning model development involves several critical choices, such as the type of problem (classification or regression), the model architecture (e.g., dense layers, convolutions), and the available data. However, one often overlooked choice is the optimizer. This component is essential in the training loop, which typically involves:
Starting with a model with randomly initialized weights.
Passing labeled data through the model and comparing the output with the correct output using a "loss function".
Using an optimizer to make adjustments to the model weights based on the loss function results.
Repeating the process until the model's performance ceases to improve.
While there are various optimizers available, Adam [1] has become a default choice for many projects due to its general effectiveness. Unlike these traditional optimizers which are described by human-designed function, VeLO represents a novel approach where the optimizer is itself a neural network that is trained on prior training jobs.
If you are not familiar with optimizers, see this page: Optimizers
VeLO (Versatile Learned Optimizers) is an innovative concept where the optimizer is trained using a large number of training jobs, as detailed in the paper "VeLO: Training Versatile Learned Optimizers by Scaling Up" [2]. This approach contrasts with traditional optimizers, like Adam, which are handcrafted functions.
The learned optimizer can help you get some extra performance for certain models. For optimal results with VeLO, it is recommended to use as large a batch size as possible, potentially equal to the dataset's size. This approach, however, may lead to out-of-memory issues for some projects. Here are some pros and cons of using the learned optimizer:
Pros
VeLO generally requires less tuning compared to Adam.
The learned optimizer works well across various scenarios without specific adjustments.
Cons
VeLO comprises a large LSTM model, often larger than the models it trains. This requires more computational resources, particularly for GPU-intensive models like vision models.
The Learned Optimizer can be enabled in Edge Impulse as an option on the training page.
The simplest way to use VeLO in expert mode is to enable the flag for a project and then switch to expert mode. This will pre-fill the needed lines of code in the expert mode.
To use VeLO in expert mode for an existing project:
Remove any existing optimizer creation, model.compile
, or model.fit
calls.
Replace with the train_keras_model_with_velo
method.
Consider the following graph which shows several runs of Adam vs VeLO:
The most influential hyperparameter of Adam is the learning rate.
If the learning rate is too low ( e.g. the red graph "adam_0.0001" ) then the model takes too long to make progress. If the learning rate is too high (e.g. the blue graph "adam_0.05") then the optimization becomes unstable.
One of the benefits of VeLO is that it doesn't require a learning rate to do well. In this example we see VeLO (purple graph "velo") doing as well as the best Adam learning rate variant.
As a side note, VeLO was designed for training large models. In this example to get the best result, the batch size was equal to the dataset size.
The following projects contain both a learning block with and without the learned optimizer so you can easily see the differences:
Image classification using transfer learning: Microscope - VeLO
Vibration analysis: Coffee Machine Stages - Multi-label data - VeLO
An activation function is a mathematical equation that determines the output of a neural network node, or "neuron." It adds non-linearity to the network, allowing it to learn complex patterns in the data. Without activation functions, a neural network would simply be a linear regression model, incapable of handling complex tasks like image recognition or language processing.
Several activation functions are used in neural networks, each with its characteristics and typical use cases. Some of the most common include:
ReLU (Rectified Linear Unit): It allows only positive values to pass through, introducing non-linearity. ReLU is efficient and widely used in deep learning. It is used by default in Edge Impulse for hidden layers.
Sigmoid: This function maps values into a range between 0 and 1, making it ideal for binary classification problems.
Tanh (Hyperbolic Tangent): Similar to the sigmoid but maps values between -1 and 1. It is useful in hidden layers of a neural network.
Softmax: Often used in the output layer of a neural network for multi-class classification; it turns logits into probabilities that sum to one.
Leaky ReLU: A variation of ReLU, it allows a small, non-zero gradient when the unit is not active.
The choice of activation function depends on the specific task and the characteristics of the input and output data. For instance:
ReLU and its variants are generally preferred in hidden layers due to their computational efficiency.
Sigmoid or Softmax functions are often used in the output layer for binary and multi-class classification tasks, respectively.
Note that for regression tasks, the last layer is connected to the target variable y_pred
. Thus, there is no need for an activation function in the output layer (like sigmoid or softmax).
Please note that the default activation functions in Edge Impulse have been selected to work well for your project tasks. We would advise you to primarily focus on your dataset quality and neural network architecture to improve your model performances.
Import the necessary libraries
Define your neural network architecture
When adding layers to your model, specify the activation function you want to use:
Compile and train your model:
A loss function, also known as a cost function, is a method to measure the performance of a machine learning model. Essentially, it calculates the difference between the model's predictions and the actual target values. The goal of training a neural network is to minimize this difference, thereby improving the model's accuracy.
The loss function quantifies how well the model is performing. A higher loss indicates greater deviation from the actual values, while a lower loss signifies that the model's predictions are closer to the target values.
Loss Function: The loss function is a mathematical expression that measures the difference or 'error' between the actual output (prediction) of a model and the desired output (label). It helps us evaluate how well our model is performing. In other words, it quantifies the cost of misclassification.
Optimizer: An optimizer is an algorithmic entity designed to minimize the loss function. Its goal is to adjust the parameters (weights and biases) of a neural network in such a way that the loss is minimized. This is typically done through iterative processes like gradient descent or its variations. The optimizer calculates the partial derivative of the loss with respect to each parameter, which indicates the direction and magnitude of changes needed to reduce the loss.
So, while the loss function quantifies how 'wrong' our model is, the optimizer tries to minimize this error by changing the parameters of the model.
Each type of neural network task generally has a different loss function that is most suitable for it. Here are some of the common loss functions used:
Mean Squared Error (MSE): Used primarily for regression problems. It calculates the square of the difference between the predicted values and the actual values. It can be used for both single-step prediction tasks and time series forecasting problems. The goal is to minimize this average error, resulting in more accurate predictions. It is used by default in Edge Impulse .
Mean Absolute Error (MAE): The MAE is another regression loss function that measures the average absolute difference between the predicted and actual target values. Unlike MSE, which considers squared errors, MAE uses the direct absolute value of the error, making it more sensitive to outliers but less affected by them. This makes it a good choice for problems with skewed or imbalanced data.
Binary Cross-Entropy Loss: Ideal for binary classification problems. It measures the difference between the predicted probabilities and the actual labels by minimizing the sum of the losses for each sample. Note that this loss function is commonly used in conjunction with the sigmoid activation function.
Categorical Cross-Entropy: Similar to the Binary Cross-Entropy, the Categorical Cross-Entropy is mostly used for multi-class classification. It measures the difference between the predicted probabilities and the actual labels for each class in a sample. The sum of these losses across all samples is then minimized. It is used by default in Edge Impulse . Note that this loss function is commonly used in conjunction with the softmax activation function (also used by default in Edge Impulse for classification problems).
Huber Loss: A combination of MSE and MAE (Mean Absolute Error). It is less sensitive to outliers than MSE. It starts as the square of the difference between the predicted and actual values for small errors, similar to MSE. However, once the error exceeds a certain threshold, it switches to a linear relationship like MAE. This makes Huber loss more robust against outliers compared to MSE, while still maintaining its smoothness.
Log Loss: Similar to cross-entropy loss, it measures the performance of a classification model where the output is a probability value between 0 and 1.
Choosing the right loss function is an integral part of model design. The choice depends on the type of problem (regression, classification, etc.) and the specific requirements of your application (like sensitivity to outliers).
Just as with , once you have settled on your overall model structure and chosen an appropriate loss function, you may want to fine-tune the settings further to achieve even better performance. This can involve testing different loss functions or adjusting their parameters to see what works best for your specific task.
In Edge Impulse, by default, we use:
Mean Squared Error (MSE) for regression tasks.
Categorical Cross-Entropy for classification tasks.
You can change them in the Expert Mode (see below). Please note that the default loss functions in Edge Impulse have been selected to work well with most tasks. We would advise you to primarily focus on your dataset quality and neural network architecture to improve your model performances.
Import the necessary libraries
Define your neural network architecture
Select a loss function
Choose the loss function that suits your problem.
For instance, for a regression problem, you might choose Mean Squared Error:
For a binary classification problem, Binary Cross-Entropy might be more appropriate:
Compile and train your model with your chosen loss function
An epoch (also known as training cycle) in machine learning is a term used to describe one complete pass through the entire training dataset by the learning algorithm. During an epoch, the machine learning model is exposed to every example in the dataset once, allowing it to learn from the data and adjust its parameters (weights) accordingly. The number of epochs is a hyperparameter that determines the number of times the learning algorithm will work through the entire training dataset.
The number of epochs is an important hyperparameter for the training process of a machine learning model. Too few epochs can result in an underfitted model, where the model has not learned enough from the training data to make accurate predictions. On the other hand, too many epochs can lead to overfitting, where the model has learned too well from the training data, including the noise, making it perform poorly on new, unseen data.
Selecting the appropriate number of epochs is a balance between underfitting and overfitting.
Underfitting: One of the most straightforward indicators of underfitting is if the model performs poorly on the training data. This can be observed in Edge Impulse Studio through metrics such as accuracy, or loss, depending on the type of problem (classification or regression). If these metrics indicate poor performance, it suggests that the model has not learned the patterns of the data well. In that case, increasing the number of epochs can improve your model performance. Please note that other solutions exist such as increasing your neural network architecture complexity, changing the preprocessing technique or reducing regularization.
Overfitting: Detecting overfitting involves recognizing when the model has learned too much from the training data, including its noise and outliers, to the detriment of its performance on new, unseen data. Overfitting is characterized by the model performing exceptionally well on the training data but poorly on the validation or test data. Evaluating overfitting can be achieved by comparing the performance of the model between the training set and the validation set during training. When the performance on the validation set starts to degrade, it might indicate that the model is beginning to overfit the training data. In that case, decreasing the number of epochs can improve your model performance. As with underfitting, other solutions exist to reduce overfitting such as increasing the number of training data, adding regularization techniques to add penalties on large weights, adding dropout layers, simplifying the model architecture and even using .
During each epoch, the dataset is typically divided into smaller batches. This approach, known as batch training, allows for more efficient and faster processing, especially with large datasets. The learning algorithm iterates through these batches, making predictions, calculating errors, and updating model parameters using an optimizer. An epoch consists of the following steps:
Initialization: Before training begins, the model's internal parameters (weights) are typically initialized randomly or according to a specific strategy.
Forward pass: For each example in the training dataset, the model makes a prediction (forward pass). This involves calculating the output of the model given its current weights and the input data.
Backward pass (backpropagation): The model updates its weights to reduce the loss. This is done through a process called backpropagation, where the gradient of the loss function of each weight is computed. The gradients indicate how the weights should be adjusted to minimize the loss.
Iteration over batches: An epoch consists of iterating over all batches in the dataset, performing the forward pass, loss calculation, backpropagation, and weight update for each batch.
Completion of an epoch: Once the model has processed all batches in the dataset, one epoch is complete. The model has now seen each example in the dataset exactly once.
When training neural networks, both epochs and batch sizes are fundamental concepts, yet they serve distinct roles in the training process. An epoch represents one complete pass through the entire training dataset, where the model has the opportunity to learn from every example within the dataset once. This means that if you set the training to run for, say, 10 epochs, the entire dataset will be passed through the neural network 10 times, allowing the model to refine its weights and biases to improve its accuracy with each pass.
On the other hand, the batch size refers to the number of training examples utilized in one iteration of the training process. Instead of passing the entire dataset through the network at once (which can be computationally expensive and memory-intensive), the dataset is divided into smaller batches. For example, if you have a dataset of 2000 examples and choose a batch size of 100, it would take 20 iterations (batches) to complete one epoch. The batch size affects the updating of model parameters; with smaller batch sizes leading to more frequent updates, potentially increasing the granularity of the learning process but also introducing more variance in the updates. Conversely, larger batch sizes provide a more stable gradient estimate, but with less frequent updates, it could lead to slower convergence.
You can modify the following line in the expert mode to change the number of training cycles:
When compiling and training your model, specify the number of epochs in the model.fit()
function as follows:
The following approach allows your model to stop training as soon as it starts overfitting, or if further training doesn't lead to better performance, making your training process more efficient and potentially leading to better model performance.
Import EarlyStopping
from tensorflow.keras.callbacks
.
Instantiate an EarlyStopping
callback, specifying the metric to monitor (e.g., val_loss
or val_accuracy
), the minimum change (min_delta
) that qualifies as an improvement, the number of epochs with no improvement after which training will be stopped (patience
), and whether training should stop immediately after improvement (restore_best_weights
).
Neural network architectures can be composed of multiple layers, each with specific roles and functions. These layers act as the building blocks of the network. The configuration and interaction of these layers define the capabilities of different neural network architectures, allowing them to learn from data and perform a wide array of tasks. From the initial data reception in the input layer through various transformation stages in hidden layers, and finally to the output layer where results are produced, each layer contributes to the network's overall intelligence and performance.
In Edge AI applications, these layers need to be optimized not just for accuracy, but also for computational and memory efficiency to perform well within the constraints of edge devices. Some architectures may not be suitable for constrained devices because of the computational complexity, resource availability or unsupported operators. See the for more details.
If you don't know where to start, try out the , our device-aware Auto ML tool.
Also, feel free to profile your models for edge deployments using our feature or using our .
With this page, we want to provide an overview of various neural network layers commonly used in edge machine learning.
The Input Layer serves as the initial phase of the neural network. It is responsible for receiving all the input data for the model. This layer does not perform any computation or transformation. It simply passes the features to the subsequent layers. The dimensionality of the Input Layer must match the shape of the data you're working with. For instance, in image processing tasks, the input layer's shape would correspond to the dimensions of the image, including the width, height, and color channels.
A Dense layer, often referred to as a fully connected layer, is the most basic form of a layer in neural networks. Each neuron in a dense layer receives input from all the neurons of the previous layer, hence the term "fully connected". It's a common layer that can be used to process data that has been flattened or transformed from a higher to a lower dimension.
The Reshape layer is used to change the shape of the input data without altering its contents. It's particularly useful when you need to prepare the dataset for certain types of layers that require the input data to be in a particular shape.
Flatten layers are used to convert multi-dimensional data into a one-dimensional array. This is typically done before feeding the data into a Dense layer.
The Dropout layer is a regularization technique that reduces the risk of overfitting in neural networks. It does so by randomly setting a fraction of the input units to zero during each update of the training phase, which helps to make the network more robust and less sensitive to the specific weights of neurons.
The 1D Convolution layer is specifically designed for analyzing sequential data, such as audio signals or time-series data. This type of layer applies a series of filters to the input data to extract features. These filters slide over the data to produce a feature map, capturing patterns like trends or cycles that span over a sequence of data points.
Complementing the 1D Convolution layer, the 1D Pooling layer aims to reduce the spatial size of the feature maps, thus reducing the number of parameters and computation in the network. It works by aggregating the information within a certain window, usually by taking the maximum (Max Pooling) or the average (Average Pooling) of the values. This operation also helps to make the detection of features more invariant to scale and orientation changes in the input data.
The 2D Convolution layer is used primarily for image data and other two-dimensional input (like spectrograms). This layer operates with filters that move across the input image's height and width to detect patterns like edges, corners, or textures. Each filter produces a 2D activation map that represents the locations and strength of detected features in the input.
The 2D Pooling layer serves a similar purpose as its 1D counterpart but in two dimensions. After the convolution layer has extracted features from the input, the pooling layer reduces the spatial dimensions of these feature maps. It summarizes the presence of features in patches of the feature map and reduces sensitivity to the exact location of features. Max Pooling and Average Pooling are common types of pooling operations used in 2D Pooling layers.
The Output Layer is the final layer in a neural network architecture, responsible for producing the results based on the learned features and representations from the previous layers. Its design is closely aligned with the specific objective of the neural network, such as classification, regression, or even more complex tasks like image segmentation or language translation.
Import the necessary libraries
Define your neural network architecture
Compile and train your model
Neural networks are a set of algorithms, modeled loosely after the human brain, designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling, or clustering of raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text, or time series, must be translated.
Please note that and are not neural networks. They are algorithms used in unsupervised machine learning, specifically for clustering tasks.
In Edge Impulse, Neural Networks can be used for supervised learning tasks such as , , either using Transfer Learning, using pre-set neural network architectures or by designing your own.
Neural networks consist of layers of interconnected nodes, also known as neurons.
Each node receives input from its predecessors, processes it, and passes its output to succeeding nodes. The processing involves weighted inputs, a bias (threshold), and an that determines whether and to what extent the signal should progress further through the network.
Input Layer: Receives raw input data.
Hidden Layers: Perform computations using weighted inputs.
Output Layer: Produces the final output.
Neural networks can vary widely in architecture, adapting to different types of problems and data.
Backpropagation: A key mechanism where the network adjusts its weights starting from the output layer and moving backward through the hidden layers, minimizing error with each pass.
In Edge AI, neural networks operate under constraints of lower computational power and energy efficiency. They need to be optimized for speed and size without compromising too much on accuracy. This often involves techniques like feature extraction, neural network architectures, transfer learning, quantization, and model pruning.
Feature Extraction: Extracting meaningful features from the raw data that can be effectively processed by the neural network on resource-constrained devices.
Neural Networks Architectures: Selecting a model architecture that is designed to run efficiently on the type of processor you are targeting, and fit within memory constraints.
Transfer Learning: Using a pre-trained model and retraining it with a specific smaller dataset relevant to the edge application.
Quantization: Reducing the precision of the numbers used in the model to decrease the computational and storage burden.
Model Pruning: Reducing the size of the model by eliminating unnecessary nodes and layers.
Neural networks, in the context of Edge AI, must be designed and optimized to function efficiently in resource-constrained environments, balancing the trade-off between accuracy and performance.
To learn more about Neural Networks, see the “Introduction to Neural Networks” video in our “Introduction to Embedded Machine Learning” course:
If you are not familiar with optimizers, here is a brief overview of what an optimizer is and its role in machine learning, particularly in neural networks.
An optimizer is an algorithmic entity designed to minimize a specific function called the loss function. The quantitatively expresses the difference between the predicted output of the neural network and the actual target values. Simply put, an optimizer's role is to change the attributes of the neural network, such as weights and learning rate, to reduce this loss., thereby enhancing the network's accuracy.
Optimizers work through an iterative process. They start by calculating the gradient, which is a partial derivative of the loss function. This gradient indicates how much the weights need to be adjusted to minimize the loss. The optimizer then updates the weights in the opposite direction of the gradient. This process is repeated over multiple iterations or epochs until the loss is minimized, and the model's predictions become as accurate as possible.
Each optimizer has its unique way of navigating the path to minimize loss. Here are a few:
Adam: Known for its adaptability, it's especially good for large datasets. It is used by default in Edge Impulse.
VeLO: VeLO represents a novel approach where the optimizer is itself a neural network that is trained on prior training jobs. See dedicated page.
Gradient Descent: Works by iteratively adjusting the values of parameters in the function until the minimum value is reached. In other words, it involves moving downhill along the steepest slope of the function towards its lowest point, hence its name "descent."
Stochastic Gradient Descent (SGD): A more dynamic cousin of Gradient Descent, updating weights more frequently for quicker learning.
RMSprop and Adagrad: These optimizers bring their own tweaks to the learning rate, making the journey smoother in specific scenarios.
Not sure which optimizer to use? Have a look at the !
Once you have settled on the overall model structure but want to achieve an even better model it can be appropriate to test another optimizer. This is classic hyperparameter fine-tuning where you try and see what works best. Any of these optimizers may achieve superior results, though getting there can sometimes require a lot of tuning. Note that each optimizer has its own parameters that you can customize.
In Edge Impulse, you can change the learning rate settings directly in the section. To change the optimizer, you can do this using the expert mode (see the section below).
When using the in Edge Impulse, you can access the full Keras API:
Import the necessary libraries:
First, make sure to import the necessary modules from Keras. You'll need the model you're working with (like Sequential) and the optimizer you want to use.
Define your neural network architecture:
Define your model architecture as you normally would. For example, using Sequential:
Select an optimizer
Choose the optimizer you wish to use. You can use one of the built-in optimizers in Keras, and even customize its parameters. For example, to use the Adam optimizer with a custom learning rate:
Alternatively, you can use other optimizers like SGD or RMSprop in a similar way:
Compile and train your model with your optimizer
In Edge Impulse, the allows for advanced customization, including the use of custom activation functions. Here is how you can do it:
In Edge Impulse, the allows for advanced customization, including the use of custom loss functions. Here is how you can do it:
Loss calculation: After making a prediction, the model calculates the (or error) by comparing its prediction to the actual target value using a loss function. The loss function quantifies how far the model's prediction is from the target.
Weight update: Using an algorithm (such as Gradient Descent, Adam, etc.), the model adjusts its weights based on the gradients calculated during backpropagation. The goal is to reduce the loss by making the model's predictions more accurate.
In Edge Impulse, you can specify the number of training cycles in the for your neural network-based models. Adjusting this parameter allows you to fine-tune the training process, aiming for the best possible model performance on your specific dataset. It's important to monitor both training and validation loss to determine the optimal number of epochs for your model.
When using the in Edge Impulse, you can access the full Keras API:
Find the full early stopping documentation on or have a look at as an example.
There are two options to modify the layers with Edge Impulse Studio. Either directly from the panel where you can choose from a wide range of predefined layers, or using the to access the TensorFlow/Keras APIs. See below to understand how to .
If you are an experienced ML practitioner, you can also or .
Neurons are organized into : input, hidden, and output layers. The complexity of the network depends on the number and size of these layers.
The power of neural networks lies in their ability to learn. Learning occurs through a process called training, where the network adjusts its weights based on the difference between its output and the desired output. This process is facilitated by an , which guides the network in adjusting its weights to minimize error (the ).
Training: Neural networks learn by adjusting weights based on the error in predictions. This process is repeated over many , using training data.