Comment on page
Anomaly detection (K-means)
Neural networks are great, but they have one big flaw. They're terrible at dealing with data they have never seen before (like a new gesture). Neural networks cannot judge this, as they are only aware of the training data. If you give it something unlike anything it has seen before it'll still classify as one of the four classes.
This method looks at the data points in a dataset and groups those that are similar into a predefined number K of clusters. A threshold value can be added to detect anomalies: if the distance between a data point and its nearest centroid is greater than the threshold value, then it is an anomaly.
The main difficulty resides in choosing K, since data in a time series is always changing and different values of K might be ideal at different times. Besides, in more complex scenarios where there are both local and global outliers, many outliers might pass under the radar and be assigned to a cluster.
K-Means has some overlap with GMM. However, GMMs work using the assumption that the samples within a dataset can be modeled using different Gaussian distributions. If this is not the case for your dataset, K-Means will likely be a better option for you.
In most of our DSP blocks, you have the option to calculate the feature importance. Edge Impulse Studio will then output a Feature Importance list that will help you determine which axes generated from your DSP block are most significant to analyze when you want to do anomaly detection.
The K-Means anomaly detection learning block has two adjustable parameters: the Cluster count and The axes
- Cluster count: the
- Axes: The different axes correspond to the generated features from the pre-processing block. The chosen axes will use the features as the input data for the training.
Click on Start training to trigger the learning process. Once trained you will obtain a view that looks like the following:
Anomaly detection view
Note: By definition, there should not be any anomalies in the training dataset, and thus accuracy is not calculated during training. Run Model testing to learn more about the model performance. Additionally, you can also select a test data sample in the Anomaly Explorer directly on this page.
In the above picture, known clusters are in blue, new classified data are in orange. It's clearly outside of any known clusters and can thus be tagged as an anomaly.
Here is the process in the background:
- Create X number of clusters and group all the data.
- For each of these clusters, we store the center and the size of the cluster.
- During the inference, we calculate the closest cluster for a new data point and show the distance from the edge of the cluster. If it’s within a cluster (no anomaly) you thus get a value below 0.