Similarly to the Spectrogram block, the Audio MFE processing block extracts time and frequency features from a signal. However it uses a non-linear scale in the frequency domain, called Mel-scale. It performs well on audio data, mostly for non-voice recognition use cases when sounds to be classified can be distinguished by human ear.
MFE spectrogram of an alarm (1-sec window)
Mel-filterbank energy features
- Frame length: The length of each frame in seconds
- Frame stride: The step between successive frame in seconds
- Filter number: The number of triangular filters applied to the spectrogram
- FFT length: The FFT size
- Low frequency: Lowest band edge of Mel-scale filterbanks
- High frequency: Highest band edge of Mel-scale filterbanks
- Noise floor (dB): signal lower than this level will be dropped
The features' extractions is similar to the Spectrogram (Frame length, Frame stride, and FFT length parameters are the same) but it adds 2 extra steps.
After computing the spectrogram, triangular filters are applied on a Mel-scale to extract frequency bands. They are configured with parameters Filter number, Low frequency and High frequency to select the frequency band and the number of frequency features to be extracted. The Mel-scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The idea is to extract more features (more filter banks) in the lower frequencies, and less in the high frequencies, thus it performs well on sounds that can be distinguished by human ear.
The last step is to perform a local mean normalization of the signal, applying the Noise floor value to the power spectrum.