Similarly to the Spectrogram block, the Audio MFE processing block extracts time and frequency features from a signal. However it uses a non-linear scale in the frequency domain, called Mel-scale. It performs well on audio data, mostly for non-voice recognition use cases when sounds to be classified can be distinguished by human ear.
Mel-filterbank energy features
- Frame length: The length of each frame in seconds
- Frame stride: The step between successive frame in seconds
- Filter number: The number of triangular filters applied to the spectrogram
- FFT length: The FFT size
- Low frequency: Lowest band edge of Mel-scale filterbanks
- High frequency: Highest band edge of Mel-scale filterbanks
- Window size: The size of sliding window for local cepstral mean normalization. Windows size must be odd
The features' extractions is similar to the Spectrogram (Frame length, Frame stride, and FFT length parameters are the same) but it adds 2 extra steps.
After computing the spectrogram, triangular filters are applied on a Mel-scale to extract frequency bands. They are configured with parameters Filter number, Low frequency and High frequency to select the frequency band and the number of frequency features to be extracted. The Mel-scale is a perceptual scale of pitches judged by listeners to be equal in distance from one another. The idea is to extract more features (more filter banks) in the lower frequencies, and less in the high frequencies, thus it performs well on sounds that can be distinguished by human ear.
The last step is to perform a local mean normalization of the signal.
Note: The Window size parameter corresponds to a number of samples. For example, for a 16 kHz signal and a window size of 301, the local normalization is performed on a 18.8ms window (1/16k * 301). This window can be smaller or bigger than the frame length.
Updated 9 months ago