AI generated image
Wearable
Small device, huge potential
EI data acquisition
ElevenLabs API Key
dog_howling
, dog_barking
and environment
respectively. For each prompt, I used a prompt influence of 0.6 (this generated the best sounds), “Number of samples” as 6, “Minimum length (seconds)” as 1, “Frequency (Hz)” as 16000 and “Upload to category” as training. With these configurations, when we click the “Generate data” button on Edge Impulse Studio, this will generate 6 audio samples each of 1 second for one class. To generate sound for another class, we can simply put the prompt for it and leave the other fields unchanged. I used this configuration to generate around 39 minutes of audio files consisting of dogs barking, dogs howling and environment (e.g., city streets, construction sites, people talking) sounds.
However, later on after experimenting with various models, I noticed significant bias in the dog barking class, leading the models to classify any unheard sounds as dog barks (in other words, the models were overpredicting the dog bark class). In this case, I created another class, noise
, consisting of 10 minute recordings from quiet environments with conversations, silence, and low machine sounds like a refrigerator and a fan. I uploaded the recordings to the Edge Impulse project and used the split data tool to extract 1 second audio samples from the recording. After several experiments, I observed that the model actually performed best when I only had 3 classes: dog barking, dog howling and noise sounds. Therefore, I disabled the environment
class audio files in the dataset and this class was ignored in the pre-processing, model training and deployment.
Wrong predictions from the model, bias
noise
sound recording, I had around 36 minutes of sound data for both training and testing. In AI, the more data, the better the model will perform. For this demonstration project, I found the dataset size to be adequate.
Dataset
EI project dataset
Edge Impulse Experiments
First Impulse design
Autotune parameters
First Impulse features
First Impulse model training
First Impulse model testing
Create new impulse
Second Impulse design
Second Impulse features
Third Impulse design
Third Impulse features
Third Impulse model training
Fourth Impulse design
Fourth Impulse features
Fourth Impulse model training
Impulse 1 deployment
esp32_microphone
example code. The deployment steps are also documented on the XIAO ESP32S3 deployment tutorial. Once we open the esp32_microphone
sketch, we need to change the I2S library, update the microphone functions, and enable the ESP NN accelerator as described by MJRoBot (Marcelo Rovai) in Step 6. You can also obtain the complete updated code in this GitHub repository. Before uploading the code, we can follow the Seeed Studio documentation to download the ESP32 board on the Arduino IDE and then select the XIAO ESP32S3 board for uploading. With the XIAO ESP32S3 board still connected to the computer, we can open the Serial Monitor and see the inference results. We can see that the Digital Signal Processing (DSP) takes around 475ms (milliseconds) and the model takes around 90ms to classify the sound - which is very impressive. However, when I played YouTube videos of dog sound in front of the XIAO ESP32S3, like this one, the model did not correctly classify dog barks and we can see most of the confidence was on noise. Although this appears to be an issue, it may actually stem from the difference in sound quality between training and inference - the test using synthetic data performed well but deployment performance was not the same. In this case, the sounds captured during inference have noise, the volume of dog sounds is different, and overall the recordings are not clear as compared to the dataset samples.
Impulse 1 inference results
model-parameters
and tflite-model
folders to the first Impulse’s Arduino library folder, overwriting the existing files with the updated model parameters. Unfortunately, the model is not able to run on the ESP32S3 board and we get an error failed to allocate tensor arena
. This error means that we have run out of RAM on the ESP32S3.
Lastly, I experimented with deploying the MFCC Impulse. This algorithm works best for speech recognition but the model training and testing show that it performs well for detecting dog sounds. Following similar steps, I deployed the fourth Impulse using the EON Compiler and Quantized (int8) model optimizations. Surprisingly, this Impulse (using the MFCC processing algorithm) delivers the best performance even compared to the MFE pre-processing block. The Digital Signal Processing (DSP) takes approximately 285ms, with classification taking about 15ms. For detecting dog sounds, this Impulse accurately identifies with great confidence, demonstrating the positive impact of a DSP block on model performance!
Impulse 4 inference results
Wearable parts
Attaching straps using super glue
Wearable on stand
Soldering battery wires
Connection OLED to XIAO ESP32S3
Using wearable case
Assembled electronics
Wearable classifying safe environment
Wearable classifying dog barking sound
Wearable classifying dog howling sound
Wearable
EON Tuner