1 of 2

Optimizers

If you are not familiar with optimizers, here is a brief overview of what an optimizer is and its role in machine learning, particularly in neural networks.

An optimizer is an algorithmic entity designed to minimize a specific function called the loss function. The loss function quantitatively expresses the difference between the predicted output of the neural network and the actual target values. Simply put, an optimizer's role is to change the attributes of the neural network, such as weights and learning rate, to reduce this loss., thereby enhancing the network's accuracy.

Optimizers work through an iterative process. They start by calculating the gradient, which is a partial derivative of the loss function. This gradient indicates how much the weights need to be adjusted to minimize the loss. The optimizer then updates the weights in the opposite direction of the gradient. This process is repeated over multiple iterations or epochs until the loss is minimized, and the model's predictions become as accurate as possible.

Each optimizer has its unique way of navigating the path to minimize loss. Here are a few:

Adam: Known for its adaptability, it's especially good for large datasets. It is used by default in Edge Impulse.
VeLO: VeLO represents a novel approach where the optimizer is itself a neural network that is trained on prior training jobs. See Learned Optimizer (VeLO) dedicated page.
Gradient Descent: Works by iteratively adjusting the values of parameters in the function until the minimum value is reached. In other words, it involves moving downhill along the steepest slope of the function towards its lowest point, hence its name "descent."
Stochastic Gradient Descent (SGD): A more dynamic cousin of Gradient Descent, updating weights more frequently for quicker learning.
RMSprop and Adagrad: These optimizers bring their own tweaks to the learning rate, making the journey smoother in specific scenarios.

When to change the optimizer & parameters?

Not sure which optimizer to use? Have a look at the Learned Optimizer (VeLO)!

Once you have settled on the overall model structure but want to achieve an even better model it can be appropriate to test another optimizer. This is classic hyperparameter fine-tuning where you try and see what works best. Any of these optimizers may achieve superior results, though getting there can sometimes require a lot of tuning. Note that each optimizer has its own parameters that you can customize.

In Edge Impulse, you can change the learning rate settings directly in the Neural Network settings section. To change the optimizer, you can do this using the expert mode (see the section below).

Changing the Optimizer in Expert Mode

When using the Expert Mode in Edge Impulse, you can access the full Keras API:

Import the necessary libraries:

First, make sure to import the necessary modules from Keras. You'll need the model you're working with (like Sequential) and the optimizer you want to use.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam, SGD, RMSprop  # Import optimizers

Define your neural network architecture:

Define your model architecture as you normally would. For example, using Sequential:

model = Sequential()
# Add model layers like Dense, Conv2D, etc.

Select an optimizer

Choose the optimizer you wish to use. You can use one of the built-in optimizers in Keras, and even customize its parameters. For example, to use the Adam optimizer with a custom learning rate:

optimizer = Adam(learning_rate=0.001)

Alternatively, you can use other optimizers like SGD or RMSprop in a similar way:

optimizer = SGD(learning_rate=0.01, momentum=0.9)
optimizer = RMSprop(learning_rate=0.001, rho=0.9)

Compile and train your model with your optimizer

model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, batch_size=32, epochs=10)

Learned Optimizer (VeLO)

Machine learning model development involves several critical choices, such as the type of problem (classification or regression), the model architecture (e.g., dense layers, convolutions), and the available data. However, one often overlooked choice is the optimizer. This component is essential in the training loop, which typically involves:

Starting with a model with randomly initialized weights.
Passing labeled data through the model and comparing the output with the correct output using a "loss function".
Using an optimizer to make adjustments to the model weights based on the loss function results.
Repeating the process until the model's performance ceases to improve.

While there are various optimizers available, Adam [1] has become a default choice for many projects due to its general effectiveness. Unlike these traditional optimizers which are described by human-designed function, VeLO represents a novel approach where the optimizer is itself a neural network that is trained on prior training jobs.

What is an optimizer?

If you are not familiar with optimizers, see this page: Optimizers

VeLO: A Learned Optimizer

VeLO (Versatile Learned Optimizers) is an innovative concept where the optimizer is trained using a large number of training jobs, as detailed in the paper "VeLO: Training Versatile Learned Optimizers by Scaling Up" [2]. This approach contrasts with traditional optimizers, like Adam, which are handcrafted functions.

When to use the Learned Optimizer?

The learned optimizer can help you get some extra performance for certain models. For optimal results with VeLO, it is recommended to use as large a batch size as possible, potentially equal to the dataset's size. This approach, however, may lead to out-of-memory issues for some projects. Here are some pros and cons of using the learned optimizer:

Pros

VeLO generally requires less tuning compared to Adam.
The learned optimizer works well across various scenarios without specific adjustments.

Cons

VeLO comprises a large LSTM model, often larger than the models it trains. This requires more computational resources, particularly for GPU-intensive models like vision models.

Studio Integration

The Learned Optimizer can be enabled in Edge Impulse as an option on the training page.

Using VeLO in Expert Mode

The simplest way to use VeLO in expert mode is to enable the flag for a project and then switch to expert mode. This will pre-fill the needed lines of code in the expert mode.

To use VeLO in expert mode for an existing project:

Remove any existing optimizer creation, model.compile, or model.fit calls.
Replace with the train_keras_model_with_velo method.

from ei_tensorflow.velo import train_keras_model_with_velo
history = train_keras_model_with_velo(
 keras_model=model,
 training_data=train_dataset,
 validation_data=validation_dataset,
 loss_fn=tf.keras.metrics.categorical_crossentropy, # depending on your model
 num_epochs=num_epochs,
 callbacks=callbacks
)
print("history", history)

How does VeLO compare to Adam?

Consider the following graph which shows several runs of Adam vs VeLO:

The most influential hyperparameter of Adam is the learning rate.

If the learning rate is too low ( e.g. the red graph "adam_0.0001" ) then the model takes too long to make progress. If the learning rate is too high (e.g. the blue graph "adam_0.05") then the optimization becomes unstable.

One of the benefits of VeLO is that it doesn't require a learning rate to do well. In this example we see VeLO (purple graph "velo") doing as well as the best Adam learning rate variant.

As a side note, VeLO was designed for training large models. In this example to get the best result, the batch size was equal to the dataset size.

Examples

The following projects contain both a learning block with and without the learned optimizer so you can easily see the differences:

Image classification using transfer learning: Microscope - VeLO
Vibration analysis: Coffee Machine Stages - Multi-label data - VeLO

Resources