Big problem, not enough data?

TL;DR Recommending a paper on image data augmentation… and asking about others’ experience with data augmentation generally.

In the days of Big Data it might seem odd, but often the data we have is not enough to train a good model. Datasets can be unexpectedly biased and our models can then fail to generalise well.

For example, most photos are taken under a limited set of lighting conditions and classification or segmentation tasks might fail if, for example, the image to be analysed after training had been taken under monochromatic light (the old bright yellow British streetlights used sodium vapour with a single dominant wavelength), which did not feature in the training data set. In this case, the complete absence of information in some colour channels might cause the model to fail to recognise anything.

If we can’t just get more data, one way to address potentially biased datasets is to generate more data from the data we have - provided we do it in controlled ways that preserve the features of interest and vary things that are not as important (or should be ignored entirely), such as the orientation of an object in a classification model.

We may already have random rotation in PerceptiLabs & others are possible (@birdstream has made a Feature Request for random shear) but there are other things that could be done.

Some might be very specialised and best managed outside PerceptiLabs or as custom code within it; some might be good candidates for other feature request. If you have any ideas for data augmentation of any kind you could add your own feature request or just raise it for discussion here…

That said, I just came across an interesting Survey article - A survey on Image Data Augmentation for Deep Learning (Connor Shorten* and Taghi M. Khoshgoftaar)

It’s quite long but not too technical - and it has plenty of references if you want to dive deeper into this. I quite like the PCA colour augmentation (see also here).

Anyone have experience of any of these techniques - or others for non-image data?

1 Like