If you are not familiar with optimizers, here is a brief overview of what an optimizer is and its role in machine learning, particularly in neural networks.
An optimizer is an algorithmic entity designed to minimize a specific function called the loss function. The loss function quantitatively expresses the difference between the predicted output of the neural network and the actual target values. Simply put, an optimizer's role is to change the attributes of the neural network, such as weights and learning rate, to reduce this loss., thereby enhancing the network's accuracy.
Optimizers work through an iterative process. They start by calculating the gradient, which is a partial derivative of the loss function. This gradient indicates how much the weights need to be adjusted to minimize the loss. The optimizer then updates the weights in the opposite direction of the gradient. This process is repeated over multiple iterations or epochs until the loss is minimized, and the model's predictions become as accurate as possible.
Each optimizer has its unique way of navigating the path to minimize loss. Here are a few:
Adam: Known for its adaptability, it's especially good for large datasets. It is used by default in Edge Impulse.
VeLO: VeLO represents a novel approach where the optimizer is itself a neural network that is trained on prior training jobs. See Learned Optimizer (VeLO) dedicated page.
Gradient Descent: Works by iteratively adjusting the values of parameters in the function until the minimum value is reached. In other words, it involves moving downhill along the steepest slope of the function towards its lowest point, hence its name "descent."
Stochastic Gradient Descent (SGD): A more dynamic cousin of Gradient Descent, updating weights more frequently for quicker learning.
RMSprop and Adagrad: These optimizers bring their own tweaks to the learning rate, making the journey smoother in specific scenarios.
Not sure which optimizer to use? Have a look at the Learned Optimizer (VeLO)!
Once you have settled on the overall model structure but want to achieve an even better model it can be appropriate to test another optimizer. This is classic hyperparameter fine-tuning where you try and see what works best. Any of these optimizers may achieve superior results, though getting there can sometimes require a lot of tuning. Note that each optimizer has its own parameters that you can customize.
In Edge Impulse, you can change the learning rate settings directly in the Neural Network settings section. To change the optimizer, you can do this using the expert mode (see the section below).
When using the Expert Mode in Edge Impulse, you can access the full Keras API:
Import the necessary libraries:
First, make sure to import the necessary modules from Keras. You'll need the model you're working with (like Sequential) and the optimizer you want to use.
Define your neural network architecture:
Define your model architecture as you normally would. For example, using Sequential:
Select an optimizer
Choose the optimizer you wish to use. You can use one of the built-in optimizers in Keras, and even customize its parameters. For example, to use the Adam optimizer with a custom learning rate:
Alternatively, you can use other optimizers like SGD or RMSprop in a similar way:
Compile and train your model with your optimizer
Machine learning model development involves several critical choices, such as the type of problem (classification or regression), the model architecture (e.g., dense layers, convolutions), and the available data. However, one often overlooked choice is the optimizer. This component is essential in the training loop, which typically involves:
Starting with a model with randomly initialized weights.
Passing labeled data through the model and comparing the output with the correct output using a "loss function".
Using an optimizer to make adjustments to the model weights based on the loss function results.
Repeating the process until the model's performance ceases to improve.
While there are various optimizers available, Adam [1] has become a default choice for many projects due to its general effectiveness. Unlike these traditional optimizers which are described by human-designed function, VeLO represents a novel approach where the optimizer is itself a neural network that is trained on prior training jobs.
If you are not familiar with optimizers, see this page: Optimizers
VeLO (Versatile Learned Optimizers) is an innovative concept where the optimizer is trained using a large number of training jobs, as detailed in the paper "VeLO: Training Versatile Learned Optimizers by Scaling Up" [2]. This approach contrasts with traditional optimizers, like Adam, which are handcrafted functions.
The learned optimizer can help you get some extra performance for certain models. For optimal results with VeLO, it is recommended to use as large a batch size as possible, potentially equal to the dataset's size. This approach, however, may lead to out-of-memory issues for some projects. Here are some pros and cons of using the learned optimizer:
Pros
VeLO generally requires less tuning compared to Adam.
The learned optimizer works well across various scenarios without specific adjustments.
Cons
VeLO comprises a large LSTM model, often larger than the models it trains. This requires more computational resources, particularly for GPU-intensive models like vision models.
The Learned Optimizer can be enabled in Edge Impulse as an option on the training page.
The simplest way to use VeLO in expert mode is to enable the flag for a project and then switch to expert mode. This will pre-fill the needed lines of code in the expert mode.
To use VeLO in expert mode for an existing project:
Remove any existing optimizer creation, model.compile
, or model.fit
calls.
Replace with the train_keras_model_with_velo
method.
Consider the following graph which shows several runs of Adam vs VeLO:
The most influential hyperparameter of Adam is the learning rate.
If the learning rate is too low ( e.g. the red graph "adam_0.0001" ) then the model takes too long to make progress. If the learning rate is too high (e.g. the blue graph "adam_0.05") then the optimization becomes unstable.
One of the benefits of VeLO is that it doesn't require a learning rate to do well. In this example we see VeLO (purple graph "velo") doing as well as the best Adam learning rate variant.
As a side note, VeLO was designed for training large models. In this example to get the best result, the batch size was equal to the dataset size.
The following projects contain both a learning block with and without the learned optimizer so you can easily see the differences:
Image classification using transfer learning: Microscope - VeLO
Vibration analysis: Coffee Machine Stages - Multi-label data - VeLO