Data augmentation

What is data augmentation?

Data augmentation is a method that can help improve the accuracy of machine learning models. A data augmentation system makes small, random changes to your training data during the training process.

Being exposed to these variations during training can help prevent your model from taking shortcuts by "memorizing" superficial clues in your training data, meaning it may better reflect the deep underlying patterns in your dataset.

Data augmentation will not work with every dataset

As with most things in machine learning, data augmentation is effective for some datasets and models but not for others. While experimenting with data augmentation, bear in mind that it is not guaranteed to provide results.

Data augmentation is likely to make the biggest difference when used with small datasets. Large datasets may already contain enough variation that the model is able to identify the true underlying patterns and avoid overfitting to the training data.

Data augmentation techniques

The types of data augmentation that you apply will depend on your data type and use case.

For images, you might apply geometric transformations (rotations, scaling, flipping, cropping), adjust color aspects (brightness, contrast, hue, saturation), inject noise, or apply more advanced augmentations such as mixing images with strategies like CutMix or mixup.

For audio, you might apply transformations directly to the raw audio that include mixing in background noise, altering the pitch, perturbing the speed or volume, or randomly cropping and splitting your samples. Rather than altering the raw audio, you might instead apply transformations to audio features, for example spectrograms generated by MFCC or MFE processing, with techniques like SpecAugment.

Data augmentation and model deployment

Data augmentation occurs only during training. It will have no impact on the memory usage or latency of your model once it has been deployed.

Example workflow for using data augmentation

Here is a step-by-step guide to getting the most out of data augmentation.

1. Train a model without data augmentation

There is no guarantee that data augmentation will improve the performance of your model. Before you start experimenting, it's important to train a model without data augmentation and attempt to get the best possible performance. You can use this model as a baseline to understand whether data augmentation improves the performance of your model or not.

2. Train a second model with data augmentation

It's helpful to be able to compare model performance side by side. To allow this, create a second model that has the same settings as the first, with the exception of enabling data augmentation. If there are parameter options for the augmentation, leave the defaults in place.

3. Increase the number of training epochs

Often, the beneficial effects of data augmentation are only seen after training a network for longer. Increase the number of training epochs for your second model. A good rule of thumb might be to double the number of training epochs compared to your baseline model. You can look at the training output after your first run to determine if the model still seems to be improving and can be trained longer.

4. Compare and iterate

Now that you've trained a model with data augmentation, compare it to your baseline model by checking performance metrics. If the second model is more accurate or has a lower loss value, augmentation was successful.

Whether it was successful or not, you may be able to find settings that work better. If available, you can try other combinations of data augmentation parameter options. You can also try adjusting the architecture of your model. Since data augmentation can help prevent overfitting, you may be able to improve accuracy by increasing the size of your model while applying augmentation.

5. Check model performance using your test dataset

Once you have a several model variants, you can run model testing for each. You might find that a model trained with data augmentation performs better on your test dataset even if its accuracy during training is similar to your baseline model, so it's always worth checking your models against test data.

It's also worth comparing the confusion matrices for each model. Data augmentation may affect the performance of your model on different labels in different ways. For example, precision may improve for one pair of classes but be reduced for another.

Data augmentation in Edge Impulse

With Edge Impulse you can easily augment your dataset. Depending on your data type and learning block selection, Data augmentation settings are available directly in Studio while configuring your Learning block.

If you are an advanced user that is more familiar with Python and Keras, data augmentation techniques can be applied programmatically in Studio through the use of Expert mode. Alternatively, you can also leverage the Python SDK for full flexibility of your data augmentation and training pipeline.

Last updated