Improve generalization by recording more data

Your first model will probably perform much better against the training data than the test data. It has not yet been taught to ignore features that should be ignored, such as clothing or daylight. Generalization is improved by adding more varied data.

The simplest way to approach this is to keep on adding more data until doing so produces no further improvement. 

Instead of spending time trying to improve the test performance directly, you can just record another set of data in different conditions. Here, this second data set is called day2. To train a model on the combined data sets from day 1 and day 2, simply merge their directory structures. The helper script does this automatically:

python day1+2 day1 day2

The above creates a new directory called day1+2 and copies all the images in day1/*/*.png and day2/*/*.png to it, maintaining the classification into subdirectories. Now it is straightforward to train a new model on the combined data set using the script and compare its performance on all three data sets (day1, day2 and test1) with the previous model using the script. For example:

simple bar chart comparing accuracy of day 1, day 2, and test

Adding a second day of data with different lighting and clothing not only improved the performance on the unseen test set, it also improved the performance of the day1 data. Notice that although the model trained on day1 performed reasonably on the test data, it got only 36% of its predictions correct on the day2 data.

Day1 and day2 both use the randomized white balance, making classification harder, so this is always going to be worse than the test set, but this very low score suggests training only on day1 was not enough to generalize to new situations.

Is 95% accuracy good enough? For some applications, it might be, but mispredicting 1 frame in 20 is once every two seconds at 10 FPS. Depending on when these occur, this could be very problematic. It is easy to keep adding more data to help improve the accuracy. Applying the same process again on the third day looks like this:

python day3
python day3
python day1+2+3 day1 day2 day3
python day1+2+3
python day1+2+3 val_day1+2+3
python day1+2+3 day1
python day1+2+3 day2
python day1+2+3 day3
python day1+2+3 test1
  • Advanced information

    To prepare the demo system shown in the video at the start of this guide, 3-4 minutes of video was recorded every day for three days. Half of this was filmed in dim lighting and half filmed in bright lighting. Each one of the five actions was repeated with several variations and the results were classified and saved.

    After training a ConvNet from scratch on the day1 dataset, it achieved an accuracy of 98% on its training data and 98% on the validation set. The high validation accuracy shows that the model has not just memorized the input data, but as the validation data was drawn from the same day it does not tell us whether the model will generalize to different clothes, ambient daylight and so on.

    The test set performance is good but not great - making a mistake 9% of the time leads to many misinterpreted gestures:

    simple bar chart showing accuracy of day 1 and test 1

    As more data is added from subsequent days, the performance improves across on every dataset:

    simple bar chart comparing accuracy of all three days against the test data

    Here we can see that although neither day1/model.h5 nor day1+2/model.h5 were trained on the day3 data, having seen two days worth of data meant that day1+2/model.h5 handled it significantly better. Adding the 3rd day of data almost halved the error on the test set and examining the remaining errors shows many pictures that were arguably mislabelled by hand in the classification process.

    For our demo video, this was sufficient for reliable predictions but each use case will vary. To get the best performance, keep on adding extra datasets until there is no extra benefit from training on them.

    This process ends in a single model file, such as day1+2+3/model.h5, which produces a good performance on all the data seen so far as well as the test set. All that remains now is to deploy it.

Previous Next