Responding to your voice
This tutorial is for the Syntiant hardware only. For other development boards, you can follow the standard Responding to your voice tutorial
In this tutorial, you'll use machine learning to build a system that can recognize audible events, particularly your voice through audio classification. The system you create will resemble giving audio commands to control a remote control (RC) car such as 'go' and 'stop', even in the presence of other background noise or background chatter. For the best user experience please use the Chrome browser.
You'll learn how to collect audio data from the microphone, use signal processing to extract the most important information, and train a deep neural network that can tell you whether your keyword was heard in a given clip of audio. Finally, you'll deploy the system to your Syntiant TinyML Board and evaluate how well it works.
At the end of this tutorial, you'll have a firm understanding of how to classify audio using Edge Impulse for Syntiant hardware.
Before starting the tutorial
After signing up for a free Edge Impulse account, clone the finished project, including all training data, signal processing and machine learning blocks here: Tutorial: Syntiant-RC. At the end of the process you will have the full project that comes pre-loaded with approx. 2.5 hours of training data.
Production ready systems
While the example Syntiant-RC tutorial project provides good performance, it is not intended to be production ready and is designed to maximize the out-of-box user experience. It is in essence a training vehicle for users of both Edge Impulse and Syntiant to understand the entire workflow. Production ready systems need to have a much more robust training data set to further reduce the likelihood of false positive and negatives.
For this tutorial you'll need the:
The Syntiant TinyML Board shows up as USB microphone once plugged in, and Edge Impulse can use this interface to record audio directly.
Device compatibility
Edge Impulse can ingest data from any device - including embedded devices that you already have in production. See the documentation for the Ingestion service for more information.
In this tutorial we want to build a system that recognizes keywords that resemble giving commands to a remote control car such as 'go' and 'stop'. Although the aforementioned public project comes pre-loaded with approx. 2.5 hours of training data, in order to add additional audio samples including your own, we'll show you how you can record audio samples directly from the Syntiant TinyML Board's audio interface.
To collect your own voice samples, ensure you have selected your system's microphone interface as the "Arduino MKRZero". Then, go to Devices -> Connect a new device, and choose the option of "Use Your Computer" and allow access to your microphone.
Set your keyword as the label and set your sample length to 10s. Then click Start Recording and start saying your keyword over and over again (with a slight pause in between each utterance).
Note: You can use your Mobile phone as a sensor as well.
Afterwards you have a file like this, clearly showing your keywords, separated by some noise. The new data sample will show up in the appropriate Training or Test data bucket.
This data is not suitable for Machine Learning yet though. You will need to cut out the parts where you say your keyword. This is important because you only want the actual keyword to be labeled as such, and not accidentally labeled as noise, or incomplete parts of the utterance. Fortunately the Edge Impulse Studio can do this for you. Click ⋮
next to your sample, and select Split sample.
If you have a short keyword, enable Shift samples to randomly shift the sample around in the window, and then click Split. You now have individual 1s. long samples in your dataset. Perfect!
About this section
This section goes through general guidance for collecting your audio data from scratch and not all steps are required for the out-of-box experience workflow, other than adding your own voice samples. Read on to understand more detail about how the dataset was constructed.
Now that you know how to collect keyword data let's consider other data we need to collect. In addition to your keyword we'll also need audio that is not your keyword. Like background noise, the TV playing, and humans saying other words, all of which go into the openset class. This class is labeled as and will be referred to as 'z_openset' from here on out. This is required because a machine learning model has no idea about right and wrong (unless those are your keywords), but only learns from the data you feed into it. The more varied your data is, the better your model will work.
For each of your classes (in this case 'go', 'stop', and 'z_openset') you want to capture an even amount of data (balanced datasets work better) - and for a decent keyword spotting model you'll want at the VERY minimum 10 minutes in each class if you're building your dataset from scratch. For the Syntiant-RC project we've used a subset of the 'Google 30' speech command set for 'go' and 'stop'.
To make this model more responsive to your own voice, do this in the same manner as above. One way of doing this would be to collect 10 seconds clips, then automatically split this data. Make sure to capture wide variations of the keyword and cover high and low pitches.
To add additional data to the 'z_openset' dataset label you can either collect this yourself, or make your life a bit easier by using dataset of both 'noise' (all kinds of background noise) and 'unknown' (random words) data that we built for you here: Pre-built datasets > Keyword spotting. In the Syntiant-RC project we used short 1 second clips of NPR radio recordings. We consider the 'z_openset' a 'negative' class since it does not contain any of the keywords that are of interest such as 'go' and 'stop' (which are considered 'positive' classes).
This is entirely optional, but to import this data, go to Data acquisition, click the Upload icon, and select a number of 'noise' or 'unknown' samples (there's 25 minutes of each class, but you can select less files if you want), and clicking Begin upload. The data is automatically labeled and added to your project. Make sure you label them 'z_openset' before uploading so that they go into the pre-existing 'z_openset' category.
If you've collected all your training data through the 'Record new data' widget you'll have all your keywords in the 'Training' dataset. This is not great, because you want to keep 20% of your data separate to validate the machine learning model. To mitigate this you can go to Dashboard and select Perform train/test split. This will automatically split your data between a training class (80%) and a testing class (20%). Afterwards you should see something like this:
Next steps
In the next steps we walk you through in detail how the Syntiant signal processing and neural network blocks were configured. If you imported the project, these are already pre-configured for you so you can just read on to understand more details.
With the data set in place you can design an impulse. An impulse takes the raw data, slices it up in smaller windows, uses signal processing blocks to extract features, and then uses a learning block to classify new data. Signal processing blocks always return the same values for the same input and are used to make raw data easier to process, while learning blocks learn from past experiences.
For this tutorial we'll use the "Syntiant" signal processing block. The Syntiant processing block is similar to the Audio MFE block, but using a log-MEL scale plus other transformations specific to the Syntiant audio front-end.
We'll then pass this simplified audio data into a Neural Network block, which will learn to distinguish between the three classes of audio.
In the Studio, go to the Create impulse tab, add a Time series data, an Audio (Syntiant) and a Neural Network (Keras) block. Set the window size to 968 ms and the window increase to 484 ms (we'll explain why later), verify that sampling frequency is set to 16000 Hz and click Save Impulse.
Now that we've assembled the building blocks of our Impulse, we can configure each individual part. Click on the Syntiant tab in the left hand navigation menu. You'll see a page that looks like this:
This page allows you to configure some parameters of the Syntiant block, and lets you preview how the data will be transformed. The right of the page shows a visualization of the Syntiant's output for a piece of audio. The Syntiant processing block extracts speech features using filterbanks. To get a better understanding of parameters, have a look at the Audio MFE documentation.
Syntiant block parameters
The number of generated features has to be 1,600, which corresponds to the Syntiant Neural Network input layer. To generate 1,600 features, you have to verify the following equation: window size = 1000 x (39 x frame stride + frame length). For our example: window size = 968 ms = 1000 x (39 x 0.024 + 0.032).
In the spectrogram the vertical axis represents the frequencies (the number of frequency bands is controlled by 'Number of coefficients' parameter), and the horizontal axis represents time (controlled by 'frame stride' and 'frame length'). The patterns visible in a spectrogram contain information about what type of sound it represents. For example, the spectrogram in this image shows "Go":
These differences are not necessarily easy for a person to describe, but fortunately they are enough for a neural network to learn to identify.
It's interesting to explore your data and look at the types of spectrograms it results in. You can use the dropdown box near the top right of the page to choose between different audio samples to visualize, or play with the parameters to see how the spectrogram changes.
The spectrograms generated by the Syntiant block will be passed into a neural network architecture that is particularly good at learning to recognize patterns in this type of tabular data. Before training our neural network, we'll need to generate Syntiant blocks for all of our windows of audio. To do this, click the Generate features button at the top of the page, then click the green Generate features button. This will take a minute or so to complete.
Afterwards you're presented with one of the most useful features in Edge Impulse: the feature explorer. This is a 3D representation showing your complete dataset, with each data-item color-coded to its respective label. You can zoom in to every item, find anomalies (an item that's in a wrong cluster), and click on items to listen to the sample. This is a great way to check whether your dataset contains wrong items, and to validate whether your dataset is suitable for ML (it should separate nicely).
With all data processed it's time to start training a neural network. Neural networks are algorithms, modeled loosely after the human brain, that can learn to recognize patterns that appear in their training data. The network that we're training here will take the processing block features as an input, and try to map this to one of the three classes — 'go', 'stop', or 'z_openset'.
Click on NN Classifier in the left hand menu. You'll see the following page:
The neural network architecture is automatically filled with the Syntiant configuration: 3 Dense Layers of 256 neurons and 3 Dropout layers.
With everything in place, click Start training. You'll see a lot of text flying past in the Training output panel, which you can ignore for now. Training will take a few minutes. When it's complete, you'll see the Last training performance panel appear at the bottom of the page:
Congratulations, you've trained a neural network with Edge Impulse and ready to deploy on the Syntiant TinyML Board! But what do all these numbers mean?
At the start of training, 20% of the training data is set aside for validation. This means that instead of being used to train the model, it is used to evaluate how the model is performing. The Last training performance panel displays the results of this validation, providing some vital information about your model and how well it is working. Bear in mind that your exact numbers may differ from the ones in this tutorial.
On the left hand side of the panel, Accuracy refers to the percentage of windows of audio that were correctly classified. The higher number the better, although an accuracy approaching 100% is unlikely, and is often a sign that your model has overfit the training data. You will find out whether this is true in the next stage, during model testing. For many applications, an accuracy above 85% can be considered very good.
The Confusion matrix is a table showing the balance of correctly versus incorrectly classified windows. To understand it, compare the values in each row. For example, in the above screenshot, 98.8% of the 'go' audio windows were classified as 'go', while 1.8% of them were classified to be in the 'stop'. This appears to be a great result.
The performance numbers in the previous step show that our model is working well on its training data, but it's extremely important that we test the model on new, unseen data before deploying it in the real world. This will help us ensure the model has not learned to overfit the training data, which is a common occurrence.
Fortunately we've put aside 20% of our data already in the 'Test set' (see Data acquisition). This is data that the model has never seen before, and we can use this to validate whether our model actually works on unseen data. To run your model against the test set, head to Model testing, and click Classify all.
To drill down into a misclassified sample, click the three dots (⋮
) next to a sample and select Show classification. You're then transported to the classification view, which lets you inspect the sample, and compare the sample to your training data. This way you can inspect whether this was actually a classification failure, or whether your data was incorrectly labeled. From here you can either update the label (when the label was wrong), or move the item to the training set to refine your model.
Misclassifications and uncertain results
It's inevitable that even a well-trained machine learning model will sometimes misclassify its inputs. When you integrate a model into your application, you should take into account that it will not always give you the correct answer.
For example, if you are classifying audio, you might want to classify several windows of data and average the results. This will give you better overall accuracy than assuming that every individual result is correct.
With the impulse designed, trained and verified you can deploy this model back to your device. This makes the model run without an internet connection, minimizes latency, and runs with minimum power consumption.
To export your model, click on Deployment in the menu. Then under 'Build firmware' select the Syntiant development board,
The final step before building the firmware is to configure the posterior handler parameters of the Syntiant chip.
Pre-configured posterior parameters
For the Syntiant-RC project, we've already pre-configured the posterior parameters so you can just go to the 'Build' output step. We recommend skipping to Step 9, but read on for more details about the process of posterior search.
Those parameters are used to tune the precision and recall of the neural network activations, to minimize False Rejection Rate and False Activation Rate. You can manually edit those parameters in JSON format or use the Find posterior parameters to search for the best values:
Select the classes you want to detect (the z_openset class should be omitted except for testing purpose)
Select a calibration dataset: either no calibration (fastest and recommended), the reference dataset with common english words from a radio program, or your own calibration dataset.
This calibration dataset is made of 2 files: an audio wav file and a csv file which contains the audio transcript of this program. This audio transcript should ideally have the classes you want to detect. For example you can imagine the following csv file: go,stop,this,is,an,example,transcript,for,optimizing,the,posterior,parameters,,,it,will,optimize,activations,for,the,go,stop,keywords,
You can also simplify the csv file and include only the keywords/classes you are interested in optimizing. For instance, if your audio wav files contains only 2 occurrences of 'go' and 'stop': go,stop,go,stop,
Tuning the audio gain
After generating the posterior parameters, you can change the audio gain by editing the "audio_pdm_gain" value at the end of the JSON file. Default value is set to 12 dB.
Once optimized parameters have been found, you can click Build. This will build a Syntiant package that will run on your development board. After building is completed you'll get prompted to download a zipfile. Save this on your computer. A pop-up video will show how the download process works.
After unzipping the downloaded file, run the appropriate flashing script for your platform (Linux, Mac, or Win 10) to flash the Syntiant TinyML Board with the Syntiant-RC model and associated firmware. You might see a Microsoft Defender screen pop up when the script is run on Windows 10. It's safe to proceed so select 'More info' and continue.
We can connect to the board's newly flashed firmware over serial. Open a terminal, connect to the Syntiant TinyML board using the appropriate COM port with 115200 8-N-1 settings. Utter the words 'go', 'stop' and you should see something like the following:
You can also print out predictions by running the edge-impulse-run-impulse
command in your terminal.
Congratulations! Now that you've trained your model you can integrate your impulse in the firmware of your TinyML board, see Running your impulse locally on Syntiant TinyML Board.
Or if you're interested in more, see our tutorial on Motion Recognition - Syntiant.
We can't wait to see what you'll build! 🚀