1 of 9

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

An activation function is a mathematical equation that determines the output of a neural network node, or "neuron." It adds non-linearity to the network, allowing it to learn complex patterns in the data. Without activation functions, a neural network would simply be a linear regression model, incapable of handling complex tasks like image recognition or language processing.

Types of Activation Functions in Neural Networks

Several activation functions are used in neural networks, each with its characteristics and typical use cases. Some of the most common include:

**ReLU (Rectified Linear Unit)**: It allows only positive values to pass through, introducing non-linearity. ReLU is efficient and widely used in deep learning. It is used by default in Edge Impulse for hidden layers.**Sigmoid**: This function maps values into a range between 0 and 1,**making it ideal for binary classification problems.****Tanh (Hyperbolic Tangent)**: Similar to the sigmoid but maps values between -1 and 1. It is useful in hidden layers of a neural network.**Softmax**: Often used in the output layer of a neural network**for multi-class classification**; it turns logits into probabilities that sum to one.**Leaky ReLU**: A variation of ReLU, it allows a small, non-zero gradient when the unit is not active.

When to Use Different Activation Functions

The choice of activation function depends on the specific task and the characteristics of the input and output data. For instance:

**ReLU** and its variants are generally preferred **in hidden layers** due to their computational efficiency.

**Sigmoid** or **Softmax** functions are often used in the **output layer** for binary and multi-class classification tasks, respectively.

Note that for regression tasks, the last layer is connected to the target variable `y_pred`

. Thus, there is no need for an activation function in the output layer (like sigmoid or softmax).

Please note that the default activation functions in Edge Impulse have been selected to work well for your project tasks. We would advise you to primarily focus on your dataset quality and neural network architecture to improve your model performances.

Implementing Activation Functions in Expert Mode

In Edge Impulse, the Expert Mode allows for advanced customization, including the use of custom activation functions. Here is how you can do it:

Import the necessary libraries

Define your neural network architecture

When adding layers to your model, specify the activation function you want to use:

Compile and train your model:

If you are not familiar with optimizers, here is a brief overview of what an optimizer is and its role in machine learning, particularly in neural networks.

An optimizer is an algorithmic entity designed to minimize a specific function called the loss function. The loss function quantitatively expresses the difference between the predicted output of the neural network and the actual target values. Simply put, an optimizer's role is to change the attributes of the neural network, such as weights and learning rate, to reduce this loss., thereby enhancing the network's accuracy.

Optimizers work through an iterative process. They start by calculating the gradient, which is a partial derivative of the loss function. This gradient indicates how much the weights need to be adjusted to minimize the loss. The optimizer then updates the weights in the opposite direction of the gradient. This process is repeated over multiple iterations or epochs until the loss is minimized, and the model's predictions become as accurate as possible.

Each optimizer has its unique way of navigating the path to minimize loss. Here are a few:

**Adam:**Known for its adaptability, it's especially good for large datasets.**It is used by default in Edge Impulse**.**VeLO:**VeLO represents a novel approach where the optimizer is itself a neural network that is trained on prior training jobs. See Learned Optimizer (VeLO) dedicated page.**Gradient Descent:**Works by iteratively adjusting the values of parameters in the function until the minimum value is reached. In other words, it involves moving downhill along the steepest slope of the function towards its lowest point, hence its name "descent."**Stochastic Gradient Descent (SGD):**A more dynamic cousin of Gradient Descent, updating weights more frequently for quicker learning.**RMSprop and Adagrad:**These optimizers bring their own tweaks to the learning rate, making the journey smoother in specific scenarios.

When to change the optimizer & parameters?

Not sure which optimizer to use? Have a look at the Learned Optimizer (VeLO)!

Once you have settled on the overall model structure but want to achieve an even better model it can be appropriate to test another optimizer. This is classic hyperparameter fine-tuning where you try and see what works best. Any of these optimizers may achieve superior results, though getting there can sometimes require a lot of tuning. Note that each optimizer has its own parameters that you can customize.

In Edge Impulse, you can change the **learning rate settings** directly in the Neural Network settings section. To change the optimizer, you can do this using the expert mode (see the section below).

Changing the Optimizer in Expert Mode

When using the Expert Mode in Edge Impulse, you can access the full Keras API:

Import the necessary libraries:

First, make sure to import the necessary modules from Keras. You'll need the model you're working with (like Sequential) and the optimizer you want to use.

Define your neural network architecture:

Define your model architecture as you normally would. For example, using Sequential:

Select an optimizer

Choose the optimizer you wish to use. You can use one of the built-in optimizers in Keras, and even customize its parameters. For example, to use the Adam optimizer with a custom learning rate:

Alternatively, you can use other optimizers like SGD or RMSprop in a similar way:

Compile and train your model with your optimizer

Neural network architectures can be composed of multiple layers, each with specific roles and functions. These layers act as the building blocks of the network. The configuration and interaction of these layers define the capabilities of different neural network architectures, allowing them to learn from data and perform a wide array of tasks. From the initial data reception in the input layer through various transformation stages in hidden layers, and finally to the output layer where results are produced, each layer contributes to the network's overall intelligence and performance.

How can I make sure these layers will work on edge device?

In Edge AI applications, these layers need to be optimized not just for accuracy, but also for computational and memory efficiency to perform well within the constraints of edge devices. Some architectures may not be suitable for constrained devices because of the computational complexity, resource availability or unsupported operators. See the for more details.

If you don't know where to start, try out the , our device-aware Auto ML tool.

Also, feel free to profile your models for edge deployments using our feature or using our .

With this page, we want to provide an overview of various neural network layers commonly used in edge machine learning.

Input Layer

The Input Layer serves as the initial phase of the neural network. It is responsible for receiving all the input data for the model. This layer does not perform any computation or transformation. It simply passes the features to the subsequent layers. The dimensionality of the Input Layer must match the shape of the data you're working with. For instance, in image processing tasks, the input layer's shape would correspond to the dimensions of the image, including the width, height, and color channels.

Dense Layer (or Fully Connected Layer)

A Dense layer, often referred to as a fully connected layer, is the most basic form of a layer in neural networks. Each neuron in a dense layer receives input from all the neurons of the previous layer, hence the term "fully connected". It's a common layer that can be used to process data that has been flattened or transformed from a higher to a lower dimension.

Reshape Layer

The Reshape layer is used to change the shape of the input data without altering its contents. It's particularly useful when you need to prepare the dataset for certain types of layers that require the input data to be in a particular shape.

Flatten Layer

Flatten layers are used to convert multi-dimensional data into a one-dimensional array. This is typically done before feeding the data into a Dense layer.

Dropout Layer

The Dropout layer is a regularization technique that reduces the risk of overfitting in neural networks. It does so by randomly setting a fraction of the input units to zero during each update of the training phase, which helps to make the network more robust and less sensitive to the specific weights of neurons.

1D Convolution Layer

The 1D Convolution layer is specifically designed for analyzing sequential data, such as audio signals or time-series data. This type of layer applies a series of filters to the input data to extract features. These filters slide over the data to produce a feature map, capturing patterns like trends or cycles that span over a sequence of data points.

1D Pooling Layer

Complementing the 1D Convolution layer, the 1D Pooling layer aims to reduce the spatial size of the feature maps, thus reducing the number of parameters and computation in the network. It works by aggregating the information within a certain window, usually by taking the maximum (Max Pooling) or the average (Average Pooling) of the values. This operation also helps to make the detection of features more invariant to scale and orientation changes in the input data.

2D Convolution Layer

The 2D Convolution layer is used primarily for image data and other two-dimensional input (like spectrograms). This layer operates with filters that move across the input image's height and width to detect patterns like edges, corners, or textures. Each filter produces a 2D activation map that represents the locations and strength of detected features in the input.

2D Pooling Layer

The 2D Pooling layer serves a similar purpose as its 1D counterpart but in two dimensions. After the convolution layer has extracted features from the input, the pooling layer reduces the spatial dimensions of these feature maps. It summarizes the presence of features in patches of the feature map and reduces sensitivity to the exact location of features. Max Pooling and Average Pooling are common types of pooling operations used in 2D Pooling layers.

Output layer

The Output Layer is the final layer in a neural network architecture, responsible for producing the results based on the learned features and representations from the previous layers. Its design is closely aligned with the specific objective of the neural network, such as classification, regression, or even more complex tasks like image segmentation or language translation.

Building a Model with Multiple Layers in Expert Mode

Import the necessary libraries

Define your neural network architecture

Compile and train your model

Machine learning model development involves several critical choices, such as the type of problem (classification or regression), the model architecture (e.g., dense layers, convolutions), and the available data. However, one often overlooked choice is the optimizer. This component is essential in the training loop, which typically involves:

Starting with a model with randomly initialized weights.

Passing labeled data through the model and comparing the output with the correct output using a "loss function".

Using an optimizer to make adjustments to the model weights based on the loss function results.

Repeating the process until the model's performance ceases to improve.

While there are various optimizers available, Adam [] has become a default choice for many projects due to its general effectiveness. Unlike these traditional optimizers which are described by human-designed function, VeLO represents a novel approach where the optimizer is itself a neural network that is trained on prior training jobs.

VeLO: A Learned Optimizer

VeLO (Versatile Learned Optimizers) is an innovative concept where the optimizer is trained using a large number of training jobs, as detailed in the paper "VeLO: Training Versatile Learned Optimizers by Scaling Up" []. This approach contrasts with traditional optimizers, like Adam, which are handcrafted functions.

When to use the Learned Optimizer?

The learned optimizer can help you get some extra performance for certain models. For optimal results with VeLO, **it is recommended to use as large a batch size as possible, potentially equal to the dataset's size**. This approach, however, may lead to out-of-memory issues for some projects. Here are some pros and cons of using the learned optimizer:

**Pros**

VeLO generally requires less tuning compared to Adam.

The learned optimizer works well across various scenarios without specific adjustments.

**Cons**

VeLO comprises a large LSTM model, often larger than the models it trains. This requires more computational resources, particularly for GPU-intensive models like vision models.

Studio Integration

The Learned Optimizer can be enabled in Edge Impulse as an option on the training page.

Using VeLO in Expert Mode

The simplest way to use VeLO in expert mode is to enable the flag for a project and then switch to expert mode. This will pre-fill the needed lines of code in the expert mode.

To use VeLO in expert mode for an existing project:

Remove any existing optimizer creation,

`model.compile`

, or`model.fit`

calls.Replace with the

`train_keras_model_with_velo`

method.

How does VeLO compare to Adam?

Consider the following graph which shows several runs of Adam vs VeLO:

The most influential hyperparameter of Adam is the learning rate.

If the learning rate is too low ( e.g. the red graph "adam_0.0001" ) then the model takes too long to make progress. If the learning rate is too high (e.g. the blue graph "adam_0.05") then the optimization becomes unstable.

One of the benefits of VeLO is that it doesn't require a learning rate to do well. In this example we see VeLO (purple graph "velo") doing as well as the best Adam learning rate variant.

As a side note, VeLO was designed for training large models. In this example to get the best result, the batch size was equal to the dataset size.

Examples

The following projects contain both a learning block with and without the learned optimizer so you can easily see the differences:

Resources

A loss function, also known as a cost function, is a method to measure the performance of a machine learning model. Essentially, it calculates the difference between the model's predictions and the actual target values. The goal of training a neural network is to minimize this difference, thereby improving the model's accuracy.

The loss function quantifies how well the model is performing. A higher loss indicates greater deviation from the actual values, while a lower loss signifies that the model's predictions are closer to the target values.

What's the difference between the loss function and the optimizer?

**Loss Function:** The loss function is a mathematical expression that measures the difference or 'error' between the actual output (prediction) of a model and the desired output (label). It helps us evaluate how well our model is performing. In other words, it quantifies the cost of misclassification.

**Optimizer:** An optimizer is an algorithmic entity designed to minimize the loss function. Its goal is to adjust the parameters (weights and biases) of a neural network in such a way that the loss is minimized. This is typically done through iterative processes like gradient descent or its variations. The optimizer calculates the partial derivative of the loss with respect to each parameter, which indicates the direction and magnitude of changes needed to reduce the loss.

So, while the loss function quantifies how 'wrong' our model is, the optimizer tries to minimize this error by changing the parameters of the model.

Types of Loss Functions

Each type of neural network task generally has a different loss function that is most suitable for it. Here are some of the common loss functions used:

**Mean Squared Error (MSE):**Used primarily for regression problems. It calculates the square of the difference between the predicted values and the actual values. It can be used for both single-step prediction tasks and time series forecasting problems. The goal is to minimize this average error, resulting in more accurate predictions. It is**used by default in Edge Impulse**.**Mean Absolute Error (MAE)**: The MAE is another regression loss function that measures the average absolute difference between the predicted and actual target values. Unlike MSE, which considers squared errors, MAE uses the direct absolute value of the error, making it more sensitive to outliers but less affected by them. This makes it a good choice for problems with skewed or imbalanced data.**Binary Cross-Entropy Loss**: Ideal for binary classification problems. It measures the difference between the predicted probabilities and the actual labels by minimizing the sum of the losses for each sample. Note that this loss function is commonly used in conjunction with the sigmoid activation function.**Categorical Cross-Entropy**: Similar to the Binary Cross-Entropy, the Categorical Cross-Entropy is mostly used for multi-class classification. It measures the difference between the predicted probabilities and the actual labels for each class in a sample. The sum of these losses across all samples is then minimized. It is**used by default in Edge Impulse**. Note that this loss function is commonly used in conjunction with the softmax activation function (also used by default in Edge Impulse for classification problems).**Huber Loss**: A combination of MSE and MAE (Mean Absolute Error). It is less sensitive to outliers than MSE. It starts as the square of the difference between the predicted and actual values for small errors, similar to MSE. However, once the error exceeds a certain threshold, it switches to a linear relationship like MAE. This makes Huber loss more robust against outliers compared to MSE, while still maintaining its smoothness.**Log Loss**: Similar to cross-entropy loss, it measures the performance of a classification model where the output is a probability value between 0 and 1.

When to change the loss function?

Choosing the right loss function is an integral part of model design. The choice depends on the type of problem (regression, classification, etc.) and the specific requirements of your application (like sensitivity to outliers).

Just as with , once you have settled on your overall model structure and chosen an appropriate loss function, you may want to fine-tune the settings further to achieve even better performance. This can involve testing different loss functions or adjusting their parameters to see what works best for your specific task.

In Edge Impulse, by default, we use:

Mean Squared Error (MSE) for regression tasks.

Categorical Cross-Entropy for classification tasks.

You can change them in the **Expert Mode** (see below). Please note that the default loss functions in Edge Impulse have been selected to work well with most tasks. We would advise you to primarily focus on your dataset quality and neural network architecture to improve your model performances.

Customizing Loss Function in Expert Mode

Import the necessary libraries