Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
In this tutorial, you'll use machine learning to build a system that can recognize objects in your house through your Sony Spresense - a task known as image classification - connected to your device. Adding sight to your embedded devices can make them see the difference between poachers and elephants, do quality control on factory lines, or let your RC cars drive themselves. In this tutorial you'll learn how to collect images for a well-balanced dataset, how to apply transfer learning to train a neural network, and deploy the system to your Spresense.
At the end of this tutorial, you'll have a firm understanding of how to classify images using Edge Impulse.
You can view the finished project, including all data, signal processing and machine learning blocks here: Tutorial: adding sight to your sensors.
For this tutorial you'll need a Sony's Spresense with camera add-on.
If you don't have a device yet, you can also upload an existing dataset through the Uploader. After this tutorial you can then deploy your trained machine learning model as a C++ library and run it on your device by following the Running your impulse locally tutorial.
You also may want to create a new Edge Impulse project for this tutorial, or use an existing one. You can create a new project on your project dashboard
In this tutorial we'll build a model that can distinguish between two objects in your house - we've used a plant and a lamp, but feel free to pick two other objects. To make your machine learning model see it's important that you capture a lot of example images of these objects. When training the model these example images are used to let the model distinguish between them. Because there are (hopefully) a lot more objects in your house than just lamps or plants, you also need to capture images that are neither a lamp or a plant to make the model work well.
Capture the following amount of data - make sure you capture a wide variety of angles and zoom levels:
50 images of a lamp.
50 images of a plant.
50 images of neither a plant nor a lamp - make sure to capture a wide variation of random objects in the same room as your lamp or plant.
You can collect data from your Spresense using the Edge Impulse CLI. Make sure you followed the Getting Started guide for the Spresense, then run the edge impulse daemon.
Once connected follow the guide on Collecting Image Data from Studio to build your dataset. Alternatively, you can capture your images using another camera, and then upload them by going to Data acquisition and clicking the 'Upload' icon.
Afterwards you should have a well-balanced dataset listed under Data acquisition in your Edge Impulse project. You can switch between your training and testing data with the two buttons above the 'Data collected' widget.
With the training set in place you can design an impulse. An impulse takes the raw data, adjusts the image size, uses a preprocessing block to manipulate the image, and then uses a learning block to classify new data. Preprocessing blocks always return the same values for the same input (e.g. convert a color image into a grayscale one), while learning blocks learn from past experiences.
For this tutorial we'll use the 'Images' preprocessing block. This block takes in the color image, optionally makes the image grayscale, and then turns the data into a features array. If you want to do more interesting preprocessing steps - like finding faces in a photo before feeding the image into the network -, see the Building custom processing blocks tutorial. Then we'll use a 'Transfer Learning' learning block, which takes all the images in and learns to distinguish between the three ('plant', 'lamp', 'unknown') classes.
In the studio go to Create impulse, set the image width and image height to 96
, and add the 'Images' and 'Transfer Learning (Images)' blocks. Then click Save impulse.
To configure your processing block, click Images in the menu on the left. This will show you the raw data on top of the screen (you can select other files via the drop down menu), and the results of the processing step on the right. You can use the options to switch between 'RGB' and 'Grayscale' mode, but for now leave the color depth on 'RGB' and click Save parameters.
This will send you to the 'Feature generation' screen. In here you'll:
Resize all the data.
Apply the processing block on all this data.
Create a 3D visualization of your complete dataset.
Click Generate features to start the process.
Afterwards the 'Feature explorer' will load. This is a plot of all the data in your dataset. Because images have a lot of dimensions (here: 96x96x3=27,648 features) we run a process called 'dimensionality reduction' on the dataset before visualizing this. Here the 27,648 features are compressed down to just 3, and then clustered based on similarity. Even though we have little data you can already see some clusters forming (lamp images are all on the right), and can click on the dots to see which image belongs to which dot.
With all data processed it's time to start training a neural network. Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. The network that we're training here will take the image data as an input, and try to map this to one of the three classes.
It's very hard to build a good working computer vision model from scratch, as you need a wide variety of input data to make the model generalize well, and training such models can take days on a GPU. To make this easier and faster we are using transfer learning. This lets you piggyback on a well-trained model, only retraining the upper layers of a neural network, leading to much more reliable models that train in a fraction of the time and work with substantially smaller datasets.
To configure the transfer learning model, click Transfer learning in the menu on the left. Here you can select the base model (the one selected by default will work, but you can change this based on your size requirements), optionally enable data augmentation (images are randomly manipulated to make the model perform better in the real world), and the rate at which the network learns.
Set:
Transfer learning model to MobileNetV1 96x96 0.25
Number of training cycles to 20
.
Learning rate to 0.0005
.
Data augmentation: enabled.
Minimum confidence rating: 0.7.
And click Start training. After the model is done you'll see accuracy numbers, a confusion matrix and some predicted on-device performance on the bottom. You have now trained your model!
Important: The full example firmware for the Sony's Spresense supports a variety of sensor interfaces, data ingestion, and other capabilities. As a result - this firmware supports image transfer learning networks up to the MobileNetV1 96x96 0.25. Custom firmware may allow larger models to run on the device. To learn more, see Running your impulse locally to deploy trained Edge Impulse models with your firmware, and check out the EON Tuner documentation to learn how to optimize your model's performance and memory usage
With the model trained let's try it out on some test data. When collecting the data we split the data up between a training and a testing dataset. The model was trained only on the training data, and thus we can use the data in the testing dataset to validate how well the model will work in the real world. This will help us ensure the model has not learned to overfit the training data, which is a common occurrence.
To validate your model, go to Model testing, select the checkbox next to 'Sample name' and click Classify selected. Here we hit 89% accuracy, which is great for a model with so little data.
To see a classification in detail, click the three dots next to an item, and select Show classification. This brings you to the Live classification screen with much more details on the file (if you collected data with your mobile phone you can also capture new testing data directly from here). This screen can help you determine why items were misclassified.
With the impulse designed, trained and verified you can deploy this model back to your device. This makes the model run without an internet connection, minimizes latency, and runs with minimum power consumption. Edge Impulse can package up the complete impulse - including the preprocessing steps, neural network weights, and classification code - in a single C++ library that you can include in your embedded software.
To run your impulse click on Deployment in the menu. Then under 'Build firmware' select the Spresense development board, and click Build. This will export the impulse, and build a binary that will run on your development board in a single step. After building is completed you'll get prompted to download a binary. Save this on your computer.
When you click the Build button, you'll see a pop-up with text and video instructions on how to deploy the binary to your particular device. Follow these instructions. Once you are done, we are ready to test your impulse out.
If you run into issues flashing your device, follow the steps and troubleshooting information in the Getting Started guide.
We can connect to the board's newly flashed firmware over serial. Open a terminal and run:
Congratulations! You've added sight to your sensors. Now that you've trained your model you can integrate your impulse in the firmware of your own embedded device, see Running your impulse locally. There are examples for Mbed OS, Arduino, STM32CubeIDE, Zephyr, and any other target that supports a C++ compiler.
Or if you're interested in more, see our tutorials on Continuous motion recognition or Adding sight to your sensors. If you have a great idea for a different project, that's fine too. Edge Impulse lets you capture data from any sensor, build custom processing blocks to extract features, and you have full flexibility in your Machine Learning pipeline with the learning blocks.
We can't wait to see what you'll build! 🚀
Responding to your voice
This tutorial is for the Syntiant hardware only. For other development boards, you can follow the standard Continuous Motion Recognition tutorial
In this tutorial, you'll use machine learning to build a gesture recognition system that runs on the Syntiant TinyML board. This is a hard task to solve using rule-based programming, as people don't perform gestures in the exact same way every time. But machine learning can handle these variations with ease. You'll learn how to collect high-frequency data from an IMU, build a neural network classifier, and how to deploy your model back to a device. At the end of this tutorial, you'll have a firm understanding of applying machine learning on Syntiant TinyML board using Edge Impulse.
Before starting the tutorial
After signing up for a free Edge Impulse account, clone the finished project, including all training data, signal processing and machine learning blocks here: Syntiant - Circular Motion. At the end of the process you will have the full project that comes pre-loaded with training and test datasets.
For this tutorial you'll need the:
An SD Card to perform IMU data acquisition
Follow the steps to connect your development board to Edge Impulse.
If your device is connected under Devices in the studio you can proceed:
Device compatibility
Edge Impulse can ingest data from any device - including embedded devices that you already have in production. See the documentation for the Ingestion service for more information.
With your device connected, we can collect some data. In the studio go to the Data acquisition tab. This is the place where all your raw data is stored, and - if your device is connected to the remote management API - where you can start sampling new data.
Under Record new data, select your Syntiant device, set the label to circular
, the sample length to 2000
, the sensor to Inertial
and the frequency to 100 Hz
. This indicates that you want to record data for 2 seconds, and label the recorded data as circular
. You can later edit these labels if needed.
After you click Start sampling move your device in a circular motion. In about twelve seconds the device should complete sampling and upload the file back to Edge Impulse. You see a new line appear under 'Collected data' in the studio. When you click it you now see the raw data graphed out. As the accelerometer on the development board has three axes you'll notice three different lines, one for each axis.
Continuous movement
It's important to do continuous movements as we'll later slice up the data in smaller windows.
Machine learning works best with lots of data, so a single sample won't cut it. Now is the time to start building your own dataset. For example, use the following two classes, and record around 3 minutes of data per class:
Circular - circular movements
Z_Openset - random movements that are not circular
Variations
Make sure to perform variations on the motions. E.g. do both slow and fast movements and vary the orientation of the board. You'll never know how your user will use the device.
With the training set in place you can design an impulse. An impulse takes the raw data, slices it up in smaller windows, uses signal processing blocks to extract features, and then uses a learning block to classify new data. Signal processing blocks always return the same values for the same input and are used to make raw data easier to process, while learning blocks learn from past experiences.
For this tutorial we'll use the 'IMU Syntiant' signal processing block. This block rescales raw data to 8 bits values to match the NDP chip input requirements. Then we'll use a 'Neural Network' learning block, that takes these generated features and learns to distinguish between our different classes (circular or not).
In the studio go to Create impulse, set the window size to 1800
(you can click on the 1800 ms.
text to enter an exact value), the window increase to 80
, and add the 'IMU Syntiant' and 'Classification (Keras)' blocks. Then click Save impulse.
Window size
The Syntiant NDP101 chip requires the number of generated features to be divisible by 4. In our example, we have 6 axis sampled at 100 Hz with a window of 1800ms, leading to 1080 (180x6) features which is divisible by 4.
To configure your signal processing block, click Syntiant IMU in the menu on the left. This will show you the raw data on top of the screen (you can select other files via the drop down menu), and the processed features on the right.
The Scale 16 bits to 8 bits
converts your raw data to 8 bits and normalize it to the range [-1, 1]. The circular motion public project's dataset is already rescaled so you need to disable the option in this case.
Click Save parameters. This will send you to the 'Feature generation' screen.
Click Generate features to start the process.
Afterwards the 'Feature explorer' will load. This is a plot of all the extracted features against all the generated windows. You can use this graph to compare your complete data set. A good rule of thumb is that if you can visually separate the data on a number of axes, then the machine learning model will be able to do so as well.
With all data processed it's time to start training a neural network. Neural networks are algorithms, modeled loosely after the human brain, that can learn to recognize patterns that appear in their training data. The network that we're training here will take the processing block features as an input, and try to map this to one of the two classes — 'circular' or 'z_openset'.
Click on NN Classifier in the left hand menu. You'll see the following page:
With everything in place, click Start training. You'll see a lot of text flying past in the Training output panel, which you can ignore for now. Training will take a few minutes. When it's complete, you'll see the Last training performance panel appear at the bottom of the page:
Congratulations, you've trained a neural network with Edge Impulse and ready to deploy on the Syntiant TinyML Board! But what do all these numbers mean?
At the start of training, 20% of the training data is set aside for validation. This means that instead of being used to train the model, it is used to evaluate how the model is performing. The Last training performance panel displays the results of this validation, providing some vital information about your model and how well it is working. Bear in mind that your exact numbers may differ from the ones in this tutorial.
On the left hand side of the panel, Accuracy refers to the percentage of windows of audio that were correctly classified. The higher number the better, although an accuracy approaching 100% is unlikely, and is often a sign that your model has overfit the training data. You will find out whether this is true in the next stage, during model testing. For many applications, an accuracy above 85% can be considered very good.
The Confusion matrix is a table showing the balance of correctly versus incorrectly classified windows. To understand it, compare the values in each row. For example, in the above screenshot, 100% of the circular motion samples were classified correctly, and 99.6% for the openset samples.
From the statistics in the previous step we know that the model works against our training data, but how well would the network perform on new data? Click on Live classification in the menu to find out. Your device should (just like in step 2) show as online under 'Classify new data'. Set the 'Sample length' to 2000
(5 seconds), click Start sampling and start doing movements. Afterward, you'll get a full report on what the network thought that you did.
If the network performed great, fantastic! But what if it performed poorly? There could be a variety of reasons, but the most common ones are:
There is not enough data. Neural networks need to learn patterns in data sets, and the more data the better.
The data does not look like other data the network has seen before. This is common when someone uses the device in a way that you didn't add to the test set. You can add the current file to the test set by clicking ⋮
, then selecting Move to training set. Make sure to update the label under 'Data acquisition' before training.
The model has not been trained enough. Up the number of epochs to 50
and see if performance increases (the classified file is stored, and you can load it through 'Classify existing validation sample').
The model is overfitting and thus performs poorly on new data. Try reducing the learning rate or add more data.
The neural network architecture is not a great fit for your data. Play with the number of layers and neurons and see if performance improves.
As you see there is still a lot of trial and error when building neural networks, but we hope the visualizations help a lot. You can also run the network against the complete validation set through 'Model validation'. Think of the model validation page as a set of unit tests for your model!
With a working model in place, we can look at places where our current impulse performs poorly.
With the impulse designed, trained and verified you can deploy this model back to your device. This makes the model run without an internet connection, minimizes latency, and runs with minimum power consumption.
To export your model, click on Deployment in the menu. Then under 'Build firmware' select the Syntiant development board,
The final step before building the firmware is to configure the posterior handler parameters of the Syntiant chip.
Pre-configured posterior parameters
For the Syntiant Circular Motion project, we've already pre-configured the posterior parameters so you can just go to the 'Build' output step.
Those parameters are used to tune the precision and recall of the neural network activations, to minimize False Rejection Rate and False Activation Rate. You can manually edit those parameters in JSON format or use the Find posterior parameters to search for the best values:
Select the classes you want to detect (the z_openset class should be omitted except for testing purpose)
Select a calibration method: either no calibration (fastest), or FAR optimized (FAR is optimized for an FRR target < 0.2).
Once optimized parameters have been found, you can click Build. This will build a Syntiant package that will run on your development board. After building is completed you'll get prompted to download a zipfile. Save this on your computer. A pop-up video will show how the download process works.
After unzipping the downloaded file, run the appropriate flashing script for your platform (Linux, Mac, or Win 10) to flash the Syntiant TinyML Board with the Syntiant Circular Motion model and associated firmware. You might see a Microsoft Defender screen pop up when the script is run on Windows 10. It's safe to proceed so select 'More info' and continue.
We can connect to the board's newly flashed firmware over serial. Open a terminal and run:
Serial daemon
If the device is connected via the Edge Impulse serial daemon, you'll need to stop the daemon first. Only one application can connect to the development board at a time.
This will sample data from the sensor, run the signal processing code, and then classify the data:
Victory! You've now built your first on-device machine learning model.
Congratulations! Now that you've trained your model you can integrate your impulse in the firmware of your TinyML board, see Running your impulse locally on Syntiant TinyML Board.
Or if you're interested in more, see our tutorial on Responding to your voice - Syntiant - RC Commands.
We can't wait to see what you'll build! 🚀
In this tutorial, you'll use machine learning to build a system that can recognize audible events, particularly your voice through audio classification. The system you create will work similarly to "Hey Siri" or "OK, Google" and is able to recognize keywords or other audible events, even in the presence of other background noise or background chatter.
You'll learn how to collect audio data from microphones, use signal processing to extract the most important information, and train a deep neural network that can tell you whether your keyword was heard in a given clip of audio. Finally, you'll deploy the system to an embedded device and evaluate how well it works.
At the end of this tutorial, you'll have a firm understanding of how to classify audio using Edge Impulse.
You can view the finished project, including all data, signal processing and machine learning blocks here: .
Detect non-voice audio?
We have a tutorial for that too! See .
For this tutorial you'll need a supported device. Follow the steps to connect your development board to Edge Impulse.
- the easiest option if you don't have one of the above.
If your device is connected under Devices in the studio you can proceed:
Device compatibility
In this tutorial we want to build a system that recognizes keywords, so your first job is to think of a great one. It can be your name, an action, or even a growl - it's your party. Do keep in mind that some keywords are harder to distinguish from others, and especially keywords with only one syllable (like 'One') might lead to false-positives (e.g. when you say 'Gone'). This is the reason that Apple, Google and Amazon all use at least three-syllable keywords ('Hey Siri', 'OK, Google', 'Alexa'). A good one would be "Hello world".
To collect your first data, go to Data acquisition, set your keyword as the label, set your sample length to 10s., your sensor to 'microphone' and your frequency to 16KHz. Then click Start sampling and start saying your keyword over and over again (with some pause in between).
Afterwards you have a file like this, clearly showing your keywords, separated by some noise.
![10 seconds of "Hello world" data](/.gitbook/assets/68a7c9e-screenshot_2020-11-19_at_221857.png", "Screenshot 2020-11-19 at 22.18.57.png)
This data is not suitable for Machine Learning yet though. You will need to cut out the parts where you say your keyword. This is important because you only want the actual keyword to be labeled as such, and not accidentally label noise, or incomplete sentences (e.g. only "Hello"). Fortunately the Edge Impulse Studio can do this for you. Click ⋮
next to your sample, and select Split sample.
If you have a short keyword, enable Shift samples to randomly shift the sample around in the window, and then click Split. You now have individual 1s. long samples in your dataset. Perfect!
Now that you know how to collect data we can consider other data we need to collect. In addition to your keyword we'll also need audio that is not your keyword. Like background noise, the TV playing ('noise' class), and humans saying other words ('unknown' class). This is required because a machine learning model has no idea about right and wrong (unless those are your keywords), but only learns from the data you feed into it. The more varied your data is, the better your model will work.
For each of these three classes ('your keyword', 'noise', and 'unknown') you want to capture an even amount of data (balanced datasets work better) - and for a decent keyword spotting model you'll want at least 10 minutes in each class (but, the more the better).
Thus, collect 10 minutes of samples for your keyword - do this in the same manner as above. The fastest way is probably through your mobile phone, collecting 1 minute clips, then automatically splitting this data. Make sure to capture wide variations of the keyword: leverage your family and your colleagues to help you collect the data, make sure you cover high and low pitches, and slow and fast speakers.
To import this data, go to Data acquisition, click the Upload icon, and select a number of 'noise' or 'unknown' samples (there's 25 minutes of each class, but you can select less files if you want), and clicking Begin upload. The data is automatically labeled and added to your project.
If you've collected all your training data through the 'Record new data' widget you'll have all your keywords in the 'Training' dataset. This is not great, because you want to keep 20% of your data separate to validate the machine learning model. To mitigate this you can go to Dashboard and select Perform train/test split. This will automatically split your data between a training class (80%) and a testing class (20%). Afterwards you should see something like this:
With the data set in place you can design an impulse. An impulse takes the raw data, slices it up in smaller windows, uses signal processing blocks to extract features, and then uses a learning block to classify new data. Signal processing blocks always return the same values for the same input and are used to make raw data easier to process, while learning blocks learn from past experiences.
For this tutorial we'll use the "MFCC" signal processing block. MFCC stands for Mel Frequency Cepstral Coefficients. This sounds scary, but it's basically just a way of turning raw audio—which contains a large amount of redundant information—into simplified form. Edge Impulse has many other processing blocks for audio, including "MFE" and the "Spectrogram" blocks for non-voice audio, but the "MFCC" block is great for dealing with human speech.
We'll then pass this simplified audio data into a Neural Network block, which will learn to distinguish between the three classes of audio.
In the Studio, go to the Create impulse tab, add a Time series data, an Audio (MFCC) and a Classification (Keras) block. Leave the window size to 1 second (as that's the length of our audio samples in the dataset) and click Save Impulse.
Now that we've assembled the building blocks of our Impulse, we can configure each individual part. Click on the MFCC tab in the left hand navigation menu. You'll see a page that looks like this:
In the spectrogram the vertical axis represents the frequencies (the number of frequency bands is controlled by 'Number of coefficients' parameter, try it out!), and the horizontal axis represents time (controlled by 'frame stride' and 'frame length'). The patterns visible in a spectrogram contain information about what type of sound it represents. For example, the spectrogram in this image shows "Hello world":
And the spectrogram in this image shows "On":
These differences are not necessarily easy for a person to describe, but fortunately they are enough for a neural network to learn to identify.
It's interesting to explore your data and look at the types of spectrograms it results in. You can use the dropdown box near the top right of the page to choose between different audio samples to visualize, or play with the parameters to see how the spectrogram changes.
In addition, you can see the performance of the MFCC block on your microcontroller below the spectrogram. This is the complete time that it takes on a low-power microcontroller (Cortex-M4F @ 80MHz) to analyze 1 second of data.
The spectrograms generated by the MFCC block will be passed into a neural network architecture that is particularly good at learning to recognize patterns in this type of tabular data. Before training our neural network, we'll need to generate MFCC blocks for all of our windows of audio. To do this, click the Generate features button at the top of the page, then click the green Generate features button. This will take a minute or so to complete.
Afterwards you're presented with one of the most useful features in Edge Impulse: the feature explorer. This is a 3D representation showing your complete dataset, with each data-item color-coded to its respective label. You can zoom in to every item, find anomalies (an item that's in a wrong cluster), and click on items to listen to the sample. This is a great way to check whether your dataset contains wrong items, and to validate whether your dataset is suitable for ML (it should separate nicely).
With all data processed it's time to start training a neural network. Neural networks are algorithms, modeled loosely after the human brain, that can learn to recognize patterns that appear in their training data. The network that we're training here will take the MFCC as an input, and try to map this to one of three classes—your keyword, noise or unknown.
Click on NN Classifier in the left hand menu. You'll see the following page:
A neural network is composed of layers of virtual "neurons", which you can see represented on the left hand side of the NN Classifier page. An input—in our case, an MFCC spectrogram—is fed into the first layer of neurons, which filters and transforms it based on each neuron's unique internal state. The first layer's output is then fed into the second layer, and so on, gradually transforming the original input into something radically different. In this case, the spectrogram input is transformed over four intermediate layers into just two numbers: the probability that the input represents your keyword, and the probability that the input represents 'noise' or 'unknown'.
During training, the internal state of the neurons is gradually tweaked and refined so that the network transforms its input in just the right ways to produce the correct output. This is done by feeding in a sample of training data, checking how far the network's output is from the correct answer, and adjusting the neurons' internal state to make it more likely that a correct answer is produced next time. When done thousands of times, this results in a trained network.
A particular arrangement of layers is referred to as an architecture, and different architectures are useful for different tasks. The default neural network architecture provided by Edge Impulse will work well for our current project, but you can also define your own architectures. You can even import custom neural network code from tools used by data scientists, such as TensorFlow and Keras (click the three dots at the top of the page).
Before you begin training, you should change some values in the configuration. Change the Minimum confidence rating to 0.6. This means that when the neural network makes a prediction (for example, that there is 0.8 probability that some audio contains "hello world") Edge Impulse will disregard it unless it is above the threshold of 0.6.
Next, enable 'Data augmentation'. When enabled your data is randomly mutated during training. For example, by adding noise, masking time or frequency bands, or warping your time axis. This is a very quick way to make your dataset work better in real life (with unpredictable sounds coming in), and prevents your neural network from overfitting (as the data samples are changed every training cycle).
With everything in place, click Start training. You'll see a lot of text flying past in the Training output panel, which you can ignore for now. Training will take a few minutes. When it's complete, you'll see the Last training performance panel appear at the bottom of the page:
Congratulations, you've trained a neural network with Edge Impulse! But what do all these numbers mean?
At the start of training, 20% of the training data is set aside for validation. This means that instead of being used to train the model, it is used to evaluate how the model is performing. The Last training performance panel displays the results of this validation, providing some vital information about your model and how well it is working. Bear in mind that your exact numbers may differ from the ones in this tutorial.
On the left hand side of the panel, Accuracy refers to the percentage of windows of audio that were correctly classified. The higher number the better, although an accuracy approaching 100% is unlikely, and is often a sign that your model has overfit the training data. You will find out whether this is true in the next stage, during model testing. For many applications, an accuracy above 85% can be considered very good.
The Confusion matrix is a table showing the balance of correctly versus incorrectly classified windows. To understand it, compare the values in each row. For example, in the above screenshot, 96 of the helloworld audio windows were classified as helloworld, while 10 of them were incorrectly classified as unknown or noise. This appears to be a great result.
The On-device performance region shows statistics about how the model is likely to run on-device. Inferencing time is an estimate of how long the model will take to analyze one second of data on a typical microcontroller (an Arm Cortex-M4F running at 80MHz). Peak RAM usage gives an idea of how much RAM will be required to run the model on-device.
The performance numbers in the previous step show that our model is working well on its training data, but it's extremely important that we test the model on new, unseen data before deploying it in the real world. This will help us ensure the model has not learned to overfit the training data, which is a common occurrence.
Fortunately we've put aside 20% of our data already in the 'Test set' (see Data acquisition). This is data that the model has never seen before, and we can use this to validate whether our model actually works on unseen data. To run your model against the test set, head to Model testing, select all items and click Classify selected.
To drill down into a misclassified sample, click the three dots (⋮
) next to a sample and select Show classification. You're then transported to the classification view, which lets you inspect the sample, and compare the sample to your training data. This way you can inspect whether this was actually a classification failure, or whether your data was incorrectly labeled. From here you can either update the label (when the label was wrong), or move the item to the training set to refine your model.
Misclassifications and uncertain results
It's inevitable that even a well-trained machine learning model will sometimes misclassify its inputs. When you integrate a model into your application, you should take into account that it will not always give you the correct answer.
For example, if you are classifying audio, you might want to classify several windows of data and average the results. This will give you better overall accuracy than assuming that every individual result is correct.
With the impulse designed, trained and verified you can deploy this model back to your device. This makes the model run without an internet connection, minimizes latency, and runs with minimum power consumption. Edge Impulse can package up the complete impulse - including the MFCC algorithm, neural network weights, and classification code - in a single C++ library that you can include in your embedded software.
Mobile phone
To export your model, click on Deployment in the menu. Then under 'Build firmware' select your development board, and click Build. This will export the impulse, and build a binary that will run on your development board in a single step. After building is completed you'll get prompted to download a binary. Save this on your computer.
When you click the Build button, you'll see a pop-up with text and video instructions on how to deploy the binary to your particular device. Follow these instructions. Once you are done, we are ready to test your impulse out.
We can connect to the board's newly flashed firmware over serial. Open a terminal and run:
Serial daemon
If the device is not connected over WiFi, but instead connected via the Edge Impulse serial daemon, you'll need stop the daemon. Only one application can connect to the development board at a time.
This will capture audio from the microphone, run the MFCC code, and then classify the spectrogram:
Great work! You've captured data, trained a model, and deployed it to an embedded device. You can now control LEDs, activate actuators, or send a message to the cloud whenever you say a keyword!
Is your model working properly in the Studio, but does not recognize your keyword when running in continuous mode on your device? Then this is probably due to dataset imbalance (a lot more unknown / noise data compared to your keyword) in combination with our moving average code to reduce false positives.
When running in continuous mode we run a moving average over the predictions to prevent false positives. E.g. if we do 3 classifications per second you’ll see your keyword potentially classified three times (once at the start of the audio file, once in the middle, once at the end). However, if your dataset is unbalanced (there’s a lot more noise / unknown than in your dataset) the ML model typically manages to only find your keyword in the 'center' window, and thus we filter it out as a false positive.
You can fix this by either:
Adding more data :-)
Or, by disabling the moving average filter by going into ei_run_classifier.h (in the edge-impulse-sdk directory) and removing:
Note that this might increase the number of false positives the model detects.
We can't wait to see what you'll build! 🚀
Some of the officially supported Edge Impulse boards require specific tutorial steps and thus separate documentation pages for their hardware; these tutorials are listed below:
Responding to your voice
This tutorial is for the only. For other development boards, you can follow the standard tutorial
In this tutorial, you'll use machine learning to build a system that can recognize audible events, particularly your voice through audio classification. The system you create will resemble giving audio commands to control a remote control (RC) car such as 'go' and 'stop', even in the presence of other background noise or background chatter. For the best user experience please use the Chrome browser.
You'll learn how to collect audio data from the microphone, use signal processing to extract the most important information, and train a deep neural network that can tell you whether your keyword was heard in a given clip of audio. Finally, you'll deploy the system to your Syntiant TinyML Board and evaluate how well it works.
At the end of this tutorial, you'll have a firm understanding of how to classify audio using Edge Impulse for Syntiant hardware.
Before starting the tutorial
After signing up for a free Edge Impulse account, clone the finished project, including all training data, signal processing and machine learning blocks here: . At the end of the process you will have the full project that comes pre-loaded with approx. 2.5 hours of training data.
Production ready systems
While the example Syntiant-RC tutorial project provides good performance, it is not intended to be production ready and is designed to maximize the out-of-box user experience. It is in essence a training vehicle for users of both Edge Impulse and Syntiant to understand the entire workflow. Production ready systems need to have a much more robust training data set to further reduce the likelihood of false positive and negatives.
For this tutorial you'll need the:
The Syntiant TinyML Board shows up as USB microphone once plugged in, and Edge Impulse can use this interface to record audio directly.
Device compatibility
Edge Impulse can ingest data from any device - including embedded devices that you already have in production. See the documentation for the for more information.
In this tutorial we want to build a system that recognizes keywords that resemble giving commands to a remote control car such as 'go' and 'stop'. Although the aforementioned public project comes pre-loaded with approx. 2.5 hours of training data, in order to add additional audio samples including your own, we'll show you how you can record audio samples directly from the Syntiant TinyML Board's audio interface.
To collect your own voice samples, ensure you have selected your system's microphone interface as the "Arduino MKRZero". Then, go to Devices -> Connect a new device, and choose the option of "Use Your Computer" and allow access to your microphone.
Set your keyword as the label and set your sample length to 10s. Then click Start Recording and start saying your keyword over and over again (with a slight pause in between each utterance).
Afterwards you have a file like this, clearly showing your keywords, separated by some noise. The new data sample will show up in the appropriate Training or Test data bucket.
This data is not suitable for Machine Learning yet though. You will need to cut out the parts where you say your keyword. This is important because you only want the actual keyword to be labeled as such, and not accidentally labeled as noise, or incomplete parts of the utterance. Fortunately the Edge Impulse Studio can do this for you. Click ⋮
next to your sample, and select Split sample.
If you have a short keyword, enable Shift samples to randomly shift the sample around in the window, and then click Split. You now have individual 1s. long samples in your dataset. Perfect!
About this section
This section goes through general guidance for collecting your audio data from scratch and not all steps are required for the out-of-box experience workflow, other than adding your own voice samples. Read on to understand more detail about how the dataset was constructed.
Now that you know how to collect keyword data let's consider other data we need to collect. In addition to your keyword we'll also need audio that is not your keyword. Like background noise, the TV playing, and humans saying other words, all of which go into the openset class. This class is labeled as and will be referred to as 'z_openset' from here on out. This is required because a machine learning model has no idea about right and wrong (unless those are your keywords), but only learns from the data you feed into it. The more varied your data is, the better your model will work.
For each of your classes (in this case 'go', 'stop', and 'z_openset') you want to capture an even amount of data (balanced datasets work better) - and for a decent keyword spotting model you'll want at the VERY minimum 10 minutes in each class if you're building your dataset from scratch. For the Syntiant-RC project we've used a subset of the 'Google 30' speech command set for 'go' and 'stop'.
To make this model more responsive to your own voice, do this in the same manner as above. One way of doing this would be to collect 10 seconds clips, then automatically split this data. Make sure to capture wide variations of the keyword and cover high and low pitches.
This is entirely optional, but to import this data, go to Data acquisition, click the Upload icon, and select a number of 'noise' or 'unknown' samples (there's 25 minutes of each class, but you can select less files if you want), and clicking Begin upload. The data is automatically labeled and added to your project. Make sure you label them 'z_openset' before uploading so that they go into the pre-existing 'z_openset' category.
If you've collected all your training data through the 'Record new data' widget you'll have all your keywords in the 'Training' dataset. This is not great, because you want to keep 20% of your data separate to validate the machine learning model. To mitigate this you can go to Dashboard and select Perform train/test split. This will automatically split your data between a training class (80%) and a testing class (20%). Afterwards you should see something like this:
Next steps
In the next steps we walk you through in detail how the Syntiant signal processing and neural network blocks were configured. If you imported the project, these are already pre-configured for you so you can just read on to understand more details.
With the data set in place you can design an impulse. An impulse takes the raw data, slices it up in smaller windows, uses signal processing blocks to extract features, and then uses a learning block to classify new data. Signal processing blocks always return the same values for the same input and are used to make raw data easier to process, while learning blocks learn from past experiences.
We'll then pass this simplified audio data into a Neural Network block, which will learn to distinguish between the three classes of audio.
In the Studio, go to the Create impulse tab, add a Time series data, an Audio (Syntiant) and a Neural Network (Keras) block. Set the window size to 968 ms and the window increase to 484 ms (we'll explain why later), verify that sampling frequency is set to 16000 Hz and click Save Impulse.
Now that we've assembled the building blocks of our Impulse, we can configure each individual part. Click on the Syntiant tab in the left hand navigation menu. You'll see a page that looks like this:
Syntiant block parameters
The number of generated features has to be 1,600, which corresponds to the Syntiant Neural Network input layer. To generate 1,600 features, you have to verify the following equation: window size = 1000 x (39 x frame stride + frame length). For our example: window size = 968 ms = 1000 x (39 x 0.024 + 0.032).
In the spectrogram the vertical axis represents the frequencies (the number of frequency bands is controlled by 'Number of coefficients' parameter), and the horizontal axis represents time (controlled by 'frame stride' and 'frame length'). The patterns visible in a spectrogram contain information about what type of sound it represents. For example, the spectrogram in this image shows "Go":
These differences are not necessarily easy for a person to describe, but fortunately they are enough for a neural network to learn to identify.
It's interesting to explore your data and look at the types of spectrograms it results in. You can use the dropdown box near the top right of the page to choose between different audio samples to visualize, or play with the parameters to see how the spectrogram changes.
The spectrograms generated by the Syntiant block will be passed into a neural network architecture that is particularly good at learning to recognize patterns in this type of tabular data. Before training our neural network, we'll need to generate Syntiant blocks for all of our windows of audio. To do this, click the Generate features button at the top of the page, then click the green Generate features button. This will take a minute or so to complete.
Afterwards you're presented with one of the most useful features in Edge Impulse: the feature explorer. This is a 3D representation showing your complete dataset, with each data-item color-coded to its respective label. You can zoom in to every item, find anomalies (an item that's in a wrong cluster), and click on items to listen to the sample. This is a great way to check whether your dataset contains wrong items, and to validate whether your dataset is suitable for ML (it should separate nicely).
With all data processed it's time to start training a neural network. Neural networks are algorithms, modeled loosely after the human brain, that can learn to recognize patterns that appear in their training data. The network that we're training here will take the processing block features as an input, and try to map this to one of the three classes — 'go', 'stop', or 'z_openset'.
Click on NN Classifier in the left hand menu. You'll see the following page:
The neural network architecture is automatically filled with the Syntiant configuration: 3 Dense Layers of 256 neurons and 3 Dropout layers.
With everything in place, click Start training. You'll see a lot of text flying past in the Training output panel, which you can ignore for now. Training will take a few minutes. When it's complete, you'll see the Last training performance panel appear at the bottom of the page:
Congratulations, you've trained a neural network with Edge Impulse and ready to deploy on the Syntiant TinyML Board! But what do all these numbers mean?
At the start of training, 20% of the training data is set aside for validation. This means that instead of being used to train the model, it is used to evaluate how the model is performing. The Last training performance panel displays the results of this validation, providing some vital information about your model and how well it is working. Bear in mind that your exact numbers may differ from the ones in this tutorial.
On the left hand side of the panel, Accuracy refers to the percentage of windows of audio that were correctly classified. The higher number the better, although an accuracy approaching 100% is unlikely, and is often a sign that your model has overfit the training data. You will find out whether this is true in the next stage, during model testing. For many applications, an accuracy above 85% can be considered very good.
The Confusion matrix is a table showing the balance of correctly versus incorrectly classified windows. To understand it, compare the values in each row. For example, in the above screenshot, 98.8% of the 'go' audio windows were classified as 'go', while 1.8% of them were classified to be in the 'stop'. This appears to be a great result.
The performance numbers in the previous step show that our model is working well on its training data, but it's extremely important that we test the model on new, unseen data before deploying it in the real world. This will help us ensure the model has not learned to overfit the training data, which is a common occurrence.
Fortunately we've put aside 20% of our data already in the 'Test set' (see Data acquisition). This is data that the model has never seen before, and we can use this to validate whether our model actually works on unseen data. To run your model against the test set, head to Model testing, and click Classify all.
To drill down into a misclassified sample, click the three dots (⋮
) next to a sample and select Show classification. You're then transported to the classification view, which lets you inspect the sample, and compare the sample to your training data. This way you can inspect whether this was actually a classification failure, or whether your data was incorrectly labeled. From here you can either update the label (when the label was wrong), or move the item to the training set to refine your model.
Misclassifications and uncertain results
It's inevitable that even a well-trained machine learning model will sometimes misclassify its inputs. When you integrate a model into your application, you should take into account that it will not always give you the correct answer.
For example, if you are classifying audio, you might want to classify several windows of data and average the results. This will give you better overall accuracy than assuming that every individual result is correct.
With the impulse designed, trained and verified you can deploy this model back to your device. This makes the model run without an internet connection, minimizes latency, and runs with minimum power consumption.
To export your model, click on Deployment in the menu. Then under 'Build firmware' select the Syntiant development board,
The final step before building the firmware is to configure the posterior handler parameters of the Syntiant chip.
Pre-configured posterior parameters
For the Syntiant-RC project, we've already pre-configured the posterior parameters so you can just go to the 'Build' output step. We recommend skipping to Step 9, but read on for more details about the process of posterior search.
Those parameters are used to tune the precision and recall of the neural network activations, to minimize False Rejection Rate and False Activation Rate. You can manually edit those parameters in JSON format or use the Find posterior parameters to search for the best values:
Select the classes you want to detect (the z_openset class should be omitted except for testing purpose)
Select a calibration dataset: either no calibration (fastest and recommended), the reference dataset with common english words from a radio program, or your own calibration dataset.
This calibration dataset is made of 2 files: an audio wav file and a csv file which contains the audio transcript of this program. This audio transcript should ideally have the classes you want to detect. For example you can imagine the following csv file: go,stop,this,is,an,example,transcript,for,optimizing,the,posterior,parameters,,,it,will,optimize,activations,for,the,go,stop,keywords,
You can also simplify the csv file and include only the keywords/classes you are interested in optimizing. For instance, if your audio wav files contains only 2 occurrences of 'go' and 'stop': go,stop,go,stop,
Tuning the audio gain
After generating the posterior parameters, you can change the audio gain by editing the "audio_pdm_gain" value at the end of the JSON file. Default value is set to 12 dB.
Once optimized parameters have been found, you can click Build. This will build a Syntiant package that will run on your development board. After building is completed you'll get prompted to download a zipfile. Save this on your computer. A pop-up video will show how the download process works.
After unzipping the downloaded file, run the appropriate flashing script for your platform (Linux, Mac, or Win 10) to flash the Syntiant TinyML Board with the Syntiant-RC model and associated firmware. You might see a Microsoft Defender screen pop up when the script is run on Windows 10. It's safe to proceed so select 'More info' and continue.
We can connect to the board's newly flashed firmware over serial. Open a terminal, connect to the Syntiant TinyML board using the appropriate COM port with 115200 8-N-1 settings. Utter the words 'go', 'stop' and you should see something like the following:
You can also print out predictions by running the edge-impulse-run-impulse
command in your terminal.
We can't wait to see what you'll build! 🚀
Edge Impulse can ingest data from any device - including embedded devices that you already have in production. See the documentation for the for more information.
Note: Data collection from a development board might be slow, you can use your as a sensor to make this much faster.
For the noise and unknown datasets you can either collect this yourself, or make your life a bit easier by using dataset of both 'noise' (all kinds of background noise) and 'unknown' (random words) data that we built for you here: .
This page allows you to configure the MFCC block, and lets you preview how the data will be transformed. The right of the page shows a visualization of the MFCC's output for a piece of audio, which is known as a . An MFCC spectrogram is a specially tuned spectrogram which highlights frequencies which are common in human speech (Edge Impulse also has normal spectrograms if that's more your thing).
You might think based on this number that we can only classify 2 or 3 windows per second, but we continuously build up the spectrogram (as it has a time component), which takes less time, and we can thus continuously listen for events 5-6x a second, even on an 40MHz processor. This is already implemented on all , and on your own device.
Your mobile phone can build and download the compiled impulse directly from the mobile client. See 'Deploying back to device' on the page.
Congratulations! you've used Edge Impulse to train a neural network model capable of recognizing audible events. There are endless applications for this type of model, from monitoring industrial machinery to recognizing voice commands. Now that you've trained your model you can integrate your impulse in the firmware of your own embedded device, see . There are examples for Mbed OS, Arduino, STM32CubeIDE, Zephyr, and any other target that supports a C++ compiler.
Or if you're interested in more, see our tutorials on or . If you have a great idea for a different project, that's fine too. Edge Impulse lets you capture data from any sensor, build to extract features, and you have full flexibility in your Machine Learning pipeline with the learning blocks.
Note: You can use your as a sensor as well.
To add additional data to the 'z_openset' dataset label you can either collect this yourself, or make your life a bit easier by using dataset of both 'noise' (all kinds of background noise) and 'unknown' (random words) data that we built for you here: . In the Syntiant-RC project we used short 1 second clips of NPR radio recordings. We consider the 'z_openset' a 'negative' class since it does not contain any of the keywords that are of interest such as 'go' and 'stop' (which are considered 'positive' classes).
For this tutorial we'll use the "Syntiant" signal processing block. The Syntiant processing block is similar to the block, but using a log-MEL scale plus other transformations specific to the Syntiant audio front-end.
This page allows you to configure some parameters of the Syntiant block, and lets you preview how the data will be transformed. The right of the page shows a visualization of the Syntiant's output for a piece of audio. The Syntiant processing block extracts speech features using filterbanks. To get a better understanding of parameters, have a look at the documentation.
Congratulations! Now that you've trained your model you can integrate your impulse in the firmware of your TinyML board, see .
Or if you're interested in more, see our tutorial on .
This tutorial will guide you through a people counting reference design built using the Silabs xG24 dev kit and the Arducam Mini 2MP Plus. The design showcases:
The Silabs xG24 Dev Kit featuring the EFR32 chipset with AI/ML accelerator providing:
Up to 3x speed increases in image-based ML processing (when compared to running a non-accelerated model),
An extremely low AI/ML and BT stack footprint allowing for concurrent inference and BT communication, and
Edge Impulse’s own FOMO algorithm providing:
image-based object detection at the lower end (ARM Cortex®-M33, 256kB RAM) of TinyML compute,
the ability to train object detection models using only ~100, instead of thousands of images,
the ability to detect objects at extremely low resolutions (64x64 pixels)
The diagram below depicts the ML lifecycle architecture defined for our people counting reference design. We used a single xG24 Dev Kit to implement either a collection or an inference flow, recursively as required, during the development process.
This guide assumes that you have already completed the getting started guide for the Silabs xG24 Dev Kit and have trained a model.
In order to replicate this reference design, you will also need:
An xG24 Dev Kit from Silabs
An Arducam Mini 2MP Plus
An Edge Impulse Studio account with a clone of the people counting project.
A development computer with the Edge Impulse CLI installed
For this project, we attached an Arducam mini 2MP plus to the xG24 Dev Kit in order to capture low-res images of people flow from a real environment. This can be achieved by connecting the two devices as specified in the table below:
Head over to your cloned Edge Impulse project, and go to Deployment. From here you can create the full firmware package built with all required libraries and dependencies. This includes the Silabs' Bluetooth stack which can broadcast inference results to nearby devices. Select Silabs xG24 Dev Kit and click Build to build the firmware. Then download and extract the .zip
file.
You can use your cloned project and xG24 Dev Kit camera assembly as a starting point to develop your own object detection project by following our FOMO guide.
xG24 Dev Kit pin | Arducam Mini 2MP Plus |
---|---|
1
GND
2
3
SDA
4
MOSI
5
6
MISO
7
8
SCK
9
10
11
SCL
12
13
CS
14
15
16
17
18
VSS
19
20
In this tutorial, you'll use machine learning to build a system that can recognize when a particular sound is happening—a task known as audio classification. The system you create will be able to recognize the sound of water running from a faucet, even in the presence of other background noise.
You'll learn how to collect audio data from microphones, use signal processing to extract the most important information, and train a deep neural network that can tell you whether the sound of running water can be heard in a given clip of audio. Finally, you'll deploy the system to an embedded device and evaluate how well it works.
At the end of this tutorial, you'll have a firm understanding of how to classify audio using Edge Impulse.
You can view the finished project, including all data, signal processing and machine learning blocks here: Tutorial: recognize sounds from audio.
For this tutorial you'll need a supported device. Follow the steps to connect your development board to Edge Impulse.
Any mobile phone - if you don't have a development kit
If your device is connected under Devices in the studio you can proceed:
Device compatibility
Edge Impulse can ingest data from any device - including embedded devices that you already have in production. See the documentation for the Ingestion service for more information.
To build this project, you'll need to collect some audio data that will be used to train the machine learning model. Since the goal is to detect the sound of a running faucet, you'll need to collect some examples of that. You'll also need some examples of typical background noise that doesn't contain the sound of a faucet, so the model can learn to discriminate between the two. These two types of examples represent the two classes we'll be training our model to detect: background noise, or running faucet.
You can use your device to collect some data. In the studio, go to the Data acquisition tab. This is the place where all your raw data is stored, and - if your device is connected to the remote management API - where you can start sampling new data.
Let's start by recording an example of background noise that doesn't contain the sound of a running faucet. Under Record new data, select your device, set the label to noise
, the sample length to 1000
, and the sensor to Built-in microphone
. This indicates that you want to record 1 second of audio, and label the recorded data as noise
. You can later edit these labels if needed.
After you click Start sampling, the device will capture a second of audio and transmit it to Edge Impulse. The LED will light while recording is in progress, then light again during transmission.
When the data has been uploaded, you will see a new line appear under 'Collected data'. You will also see the waveform of the audio in the 'RAW DATA' box. You can use the controls underneath to listen to the audio that was captured.
Since you now know how to capture audio with Edge Impulse, it's time to start building a dataset. For a simple audio classification model like this one, we should aim to capture around 10 minutes of data. We have two classes, and it's ideal if our data is balanced equally between each of them. This means we should aim to capture the following data:
5 minutes of background noise, with the label "noise"
5 minutes of running faucet noise, with the label "faucet"
In the real world, there are usually additional sounds present alongside the sounds we care about. For example, a running faucet is often accompanied by the sound of dishes being washed, teeth being brushed, or a conversation in the kitchen. Background noise might also include the sounds of television, kids playing, or cars driving past outside.
It's important that your training data contains these types of real world sounds. If your model is not exposed to them during training, it will not learn to take them into account, and it will not perform well during real-world usage.
For this tutorial, you should try to capture the following:
Background noise
2 minutes of background noise without much additional activity
1 minute of background noise with a TV or music playing
1 minute of background noise featuring occasional talking or conversation
1 minutes of background noise with the sounds of housework
Running faucet noise
1 minute of a faucet running
1 minute of a different faucet running
1 minute of a faucet running with a TV or music playing
1 minute of a faucet running with occasional talking or conversation
1 minute of a faucet running with the sounds of housework
It's okay if you can't get all of these, as long as you still obtain 5 minutes of data for each class. However, your model will perform better in the real world if it was trained on a representative dataset.
Dataset diversity
There's no guarantee your model will perform well in the presence of sounds that were not included in its training set, so it's important to make your dataset as diverse and representative of real-world conditions as possible.
The amount of audio that can be captured in one go varies depending on a device's memory. The ST B-L475E-IOT01A developer board has enough memory to capture 60 seconds of audio at a time, and the Arduino Nano 33 BLE Sense has enough memory for 16 seconds. To capture 60 seconds of audio, set the sample length to 60000
. Because the board transmits data quite slowly, it will take around 7 minutes before a 60 second sample appears in Edge Impulse.
Once you've captured around 10 minutes of data, it's time to start designing an Impulse.
Alternatively, you can load an example test set that has about ten minutes of data in these classes (but how much fun is that?). See the Running faucet dataset for more information.
With the training set in place you can design an impulse. An impulse takes the raw data, slices it up in smaller windows, uses signal processing blocks to extract features, and then uses a learning block to classify new data. Signal processing blocks always return the same values for the same input and are used to make raw data easier to process, while learning blocks learn from past experiences.
For this tutorial we'll use the "MFE" signal processing block. MFE stands for Mel Frequency Energy. This sounds scary, but it's basically just a way of turning raw audio—which contains a large amount of redundant information—into simplified form.
Spectrogram block
Edge Impulse supports three different blocks for audio classification: MFCC, MFE and spectrogram blocks. If your accuracy is not great using the MFE block you can switch to the spectrogram block, which is not tuned to frequencies for the human ear.
We'll then pass this simplified audio data into a Neural Network block, which will learn to distinguish between the two classes of audio (faucet and noise).
In the studio, go to the Create impulse tab. You'll see a Time Series Data block, like this one.
As mentioned above, Edge Impulse slices up the raw samples into windows that are fed into the machine learning model during training. The Window size field controls how long, in milliseconds, each window of data should be. A half second audio sample will be enough to determine whether a faucet is running or not, so you should make sure Window size is set to 500 ms. You can either drag the slider or type a new value directly.
Each raw sample is sliced into multiple windows, and the Window increase field controls the offset of each subsequent window from the first. For example, a Window increase value of 500 ms would result in each window starting 1 second after the start of the previous one.
By setting a Window increase that is smaller than the Window size, we can create windows that overlap. This is actually a great idea. Although they may contain similar data, each overlapping window is still a unique example of audio that represents the sample's label. By using overlapping windows, we can make the most of our training data. For example, with a Window size of 500 ms and a Window increase of 100 ms, we can extract 20 unique windows from only 2 seconds of data.
Make sure the Window increase field is set to 300 ms. The Time series data block should match the screenshot above.
Next, click Add a processing block and choose the 'MFE' block. Once you're done with that, click Add a learning block and select 'Classification (Keras)'. Finally, click Save impulse. Your impulse should now look like this:
Now that we've assembled the building blocks of our Impulse, we can configure each individual part. Click on the MFE tab in the left hand navigation menu. You'll see a page that looks like this:
This page allows you to configure the MFE block, and lets you preview how the data will be transformed. The right of the page shows a visualization of the MFE's output for a piece of audio, which is known as a spectrogram.
The MFE block transforms a window of audio into a table of data where each row represents a range of frequencies and each column represents a span of time. The value contained within each cell reflects the amplitude of its associated range of frequencies during that span of time. The spectrogram shows each cell as a colored block, the intensity which varies depends on the amplitude.
The patterns visible in a spectrogram contain information about what type of sound it represents. For example, the spectrogram in this image shows a pattern typical of background noise:
You can tell that it is slightly different from the following spectrogram, which shows a pattern typical of a running faucet:
These differences are not necessarily easy for a person to describe, but fortunately they are enough for a neural network to learn to identify.
It's interesting to explore your data and look at the types of spectrograms it results in. You can use the dropdown box near the top right of the page to choose between different audio samples to visualize, and drag the white window on the audio waveform to select different windows of data:
There are a lot of different ways to configure the MFE block, as shown in the Parameters box. Here, we are going to lower the Filter number to 32
in order to lower the overhead of running the MFE block. You can play around with other parameters such as the noise floor to see how they impact the spectrogram.
For the first run through this tutorial, set your MFE parameters to match the image below:
After testing out training with the parameters above, check out the EON Tuner tutorial to learn how to use Edge Impulse to automatically choose the best DSP parameters for your dataset.
The spectrograms generated by the MFE block will be passed into a neural network architecture that is particularly good at learning to recognize patterns in this type of tabular data. Before training our neural network, we'll need to generate MFE blocks for all of our windows of audio. To do this, click the Generate features button at the top of the page, then click the green Generate features button. If you have a full 10 minutes of data, the process will take a while to complete:
Once this process is complete the feature explorer shows a visualization of your dataset. Here dimensionality reduction is used to map your features onto a 3D space, and you can use the feature explorer to see if the different classes separate well, or find mislabeled data (if it shows in a different cluster). You can find more information in visualizing complex datasets.
Next, we'll configure the neural network and begin training.
With all data processed it's time to start training a neural network. Neural networks are algorithms, modeled loosely after the human brain, that can learn to recognize patterns that appear in their training data. The network that we're training here will take the MFE as an input, and try to map this to one of two classes—noise, or faucet.
Click on NN Classifier in the left hand menu. You'll see the following page:
A neural network is composed of layers of virtual "neurons", which you can see represented on the left hand side of the NN Classifier page. An input—in our case, an MFE spectrogram—is fed into the first layer of neurons, which filters and transforms it based on each neuron's unique internal state. The first layer's output is then fed into the second layer, and so on, gradually transforming the original input into something radically different. In this case, the spectrogram input is transformed over four intermediate layers into just two numbers: the probability that the input represents noise, and the probability that the input represents a running faucet.
During training, the internal state of the neurons is gradually tweaked and refined so that the network transforms its input in just the right ways to produce the correct output. This is done by feeding in a sample of training data, checking how far the network's output is from the correct answer, and adjusting the neurons' internal state to make it more likely that a correct answer is produced next time. When done thousands of times, this results in a trained network.
A particular arrangement of layers is referred to as an architecture, and different architectures are useful for different tasks. The default neural network architecture provided by Edge Impulse will work well for our current project, but you can also define your own architectures. You can even import custom neural network code from tools used by data scientists, such as TensorFlow and Keras.
The default settings should work, and to begin training, click Start training. You'll see a lot of text flying past in the Training output panel, which you can ignore for now. Training will take a few minutes. When it's complete, you'll see the Model panel appear at the right side of the page:
Congratulations, you've trained a neural network with Edge Impulse! But what do all these numbers mean?
At the start of training, 20% of the training data is set aside for validation. This means that instead of being used to train the model, it is used to evaluate how the model is performing. The Last training performance panel displays the results of this validation, providing some vital information about your model and how well it is working. Bear in mind that your exact numbers may differ from the ones in this tutorial.
On the left hand side of the panel, Accuracy refers to the percentage of windows of audio that were correctly classified. The higher number the better, although an accuracy approaching 100% is unlikely, and is often a sign that your model has overfit the training data. You will find out whether this is true in the next stage, during model testing. For many applications, an accuracy above 80% can be considered very good.
The Confusion matrix is a table showing the balance of correctly versus incorrectly classified windows. To understand it, compare the values in each row. For example, in the above screenshot, all of the faucet audio windows were classified as faucet, but a few noise windows were misclassified. This appears to be a great result though.
The On-device performance region shows statistics about how the model is likely to run on-device. Inferencing time is an estimate of how long the model will take to analyze one second of data on a typical microcontroller (here: an Arm Cortex-M4F running at 80MHz). Peak memory usage gives an idea of how much RAM will be required to run the model on-device.
The performance numbers in the previous step show that our model is working well on its training data, but it's extremely important that we test the model on new, unseen data before deploying it in the real world. This will help us ensure the model has not learned to overfit the training data, which is a common occurrence.
Edge Impulse provides some helpful tools for testing our model, including a way to capture live data from your device and immediately attempt to classify it. To try it out, click on Live classification in the left hand menu. Your device should show up in the 'Classify new data' panel. Capture 5 seconds of background noise by clicking Start sampling:
The sample will be captured, uploaded, and classified. Once this has happened, you'll see a breakdown of the results:
Once the sample is uploaded, it is split into windows–in this case, a total of 41. These windows are then classified. As you can see, our model classified all 41 windows of the captured audio as noise. This is a great result! Our model has correctly identified that the audio was background noise, even though this is new data that was not part of its training set.
Of course, it's possible some of the windows may be classified incorrectly. Since our model was 99% accurate based on its validation data, you can expect that at least 1% of windows will be classified wrongly—and likely much more than this, since our validation data doesn't represent every possible type of background or faucet noise. If your model didn't perform perfectly, don't worry. We'll get to troubleshooting later.
Misclassifications and uncertain results
It's inevitable that even a well-trained machine learning model will sometimes misclassify its inputs. When you integrate a model into your application, you should take into account that it will not always give you the correct answer.
For example, if you are classifying audio, you might want to classify several windows of data and average the results. This will give you better overall accuracy than assuming that every individual result is correct.
Using the Live classification tab, you can easily try out your model and get an idea of how it performs. But to be really sure that it is working well, we need to do some more rigorous testing. That's where the Model testing tab comes in. If you open it up, you'll see the sample we just captured listed in the Test data panel:
In addition to its training data, every Edge Impulse project also has a test dataset. Samples captured in Live classification are automatically saved to the test dataset, and the Model testing tab lists all of the test data.
To use the sample we've just captured for testing, we should correctly set its expected outcome. Click the ⋮
icon and select Edit expected outcome, then enter noise
. Now, select the sample using the checkbox to the left of the table and click Classify selected:
You'll see that the model's accuracy has been rated based on the test data. Right now, this doesn't give us much more information that just classifying the same sample in the Live classification tab. But if you build up a big, comprehensive set of test samples, you can use the Model testing tab to measure how your model is performing on real data.
Ideally, you'll want to collect a test set that contains a minimum of 25% the amount of data of your training set. So, if you've collected 10 minutes of training data, you should collect at least 2.5 minutes of test data. You should make sure this test data represents a wide range of possible conditions, so that it evaluates how the model performs with many different types of inputs. For example, collecting test audio for several different faucets is a good idea.
You can use the Data acquisition tab to manage your test data. Open the tab, and then click Test data at the top. Then, use the Record new data panel to capture a few minutes of test data, including audio for both background noise and faucet. Make sure the samples are labelled correctly. Once you're done, head back to the Model testing tab, select all the samples, and click Classify selected:
The screenshot shows classification results from a large number of test samples (there are more on the page than would fit in the screenshot). The panel shows that our model is performing at 85% accuracy, which is 5% less than how it performed on validation data. It's normal for a model to perform less well on entirely fresh data, so this is a successful result. Our model is working well!
For each test sample, the panel shows a breakdown of its individual performance. For example, one of the samples was classified with only 62% accuracy. Samples that contain a lot of misclassifications are valuable, since they have examples of types of audio that our model does not currently fit. It's often worth adding these to your training data, which you can do by clicking the ⋮
icon and selecting Move to training set. If you do this, you should add some new test data to make up for the loss!
Testing your model helps confirm that it works in real life, and it's something you should do after every change. However, if you often make tweaks to your model to try to improve its performance on the test dataset, your model may gradually start to overfit to the test dataset, and it will lose its value as a metric. To avoid this, continually add fresh data to your test dataset.
Data hygiene
It's extremely important that data is never duplicated between your training and test datasets. Your model will naturally perform well on the data that it was trained on, so if there are duplicate samples then your test results will indicate better performance than your model will achieve in the real world.
If the network performed great, fantastic! But what if it performed poorly? There could be a variety of reasons, but the most common ones are:
The data does not look like other data the network has seen before. This is common when someone uses the device in a way that you didn't add to the test set. You can add the current file to the test set by adding the correct label in the 'Expected outcome' field, clicking ⋮
, then selecting Move to training set.
The model has not been trained enough. Increase number of epochs to 200
and see if performance increases (the classified file is stored, and you can load it through 'Classify existing validation sample').
The model is overfitting and thus performs poorly on new data. Try reducing the number of epochs, reducing the learning rate, or adding more data.
The neural network architecture is not a great fit for your data. Play with the number of layers and neurons and see if performance improves.
As you see, there is still a lot of trial and error when building neural networks. One place to start improving the performance of your model is the EON Tuner tutorial. The tuner will automatically test different DSP and NN parameters to improve performance with your dataset.
With the impulse designed, trained and verified you can deploy this model back to your device. This makes the model run without an internet connection, minimizes latency, and runs with minimum power consumption. Edge Impulse can package up the complete impulse - including the MFE algorithm, neural network weights, and classification code - in a single C++ library that you can include in your embedded software.
Mobile phone
Your mobile phone can build and download the compiled impulse directly from the mobile client. See 'Deploying back to device' on the Using your mobile phone page.
To export your model, click on Deployment in the menu. Then under 'Build firmware' select your development board, and click Build. This will export the impulse, and build a binary that will run on your development board in a single step. After building is completed you'll get prompted to download a binary. Save this on your computer.
When you click the Build button, you'll see a pop-up with text and video instructions on how to deploy the binary to your particular device. Follow these instructions. Once you are done, we are ready to test your impulse out.
We can connect to the board's newly flashed firmware over serial. Open a terminal and run:
Serial daemon
If the device is not connected over WiFi, but instead connected via the Edge Impulse serial daemon, you'll need stop the daemon. Only one application can connect to the development board at a time.
This will capture audio from the microphone, run the MFE code, and then classify the spectrogram:
Great work! You've captured data, trained a model, and deployed it to an embedded device. It's time to celebrate—by pouring yourself a nice glass of water, and checking whether the sound is correctly classified by you model.
Congratulations! You've used Edge Impulse to train a neural network model capable of recognizing a particular sound. There are endless applications for this type of model, from monitoring industrial machinery to recognizing voice commands. Now that you've trained your model you can integrate your impulse in the firmware of your own embedded device, see Running your impulse locally. There are examples for Mbed OS, Arduino, STM32CubeIDE, Zephyr, and any other target that supports a C++ compiler.
Or if you're interested in more, see our tutorials on Continuous motion recognition or Adding sight to your sensors. If you have a great idea for a different project, that's fine too. Edge Impulse lets you capture data from any sensor, build custom processing blocks to extract features, and you have full flexibility in your Machine Learning pipeline with the learning blocks.
We can't wait to see what you'll build! 🚀