Train a neural network from scratch

Training a convolutional network is very compute-intensive and will take a long time on a Raspberry Pi 3. It will be quicker to copy the files to a laptop or desktop and run the script there. To do this you will need to install TensorFlow on your laptop or desktop by following this guide.

To train a neural network from scratch with the LeNet-like model using your training data and validation data, use this command: 

python day1 val_day1

This shows regular progress updates while training the model, as shown here:

Terminal output once you've trained the neural network from scratch

Although the output suggests that the network will train for 100 epochs, this is the maximum value, and in practice it finishes earlier than this. The day1/model.h5 file is updated whenever a better result is achieved, so it is possible to leave training running, then copy the best model.h5 file that gets produced to the Raspberry Pi and try it out to decide whether it is already good enough for real-world use.

Note: A GPU will speed this up but is not necessary. With 2500 images, the models train in under an hour on a 2017 MacBook Pro.

What are we doing here?

Let's take a look at what's going on in

The simple LeNet architecture features blocks of convolution, activation, and max pooling followed by one or more dense layers. This architecture works well for a wide range of applications and is small enough to run at around 10 FPS on a Raspberry Pi 3.

The code in sets up a simple convolutional network following this pattern:

a section of code from the file

You can increase or decrease the capability of the network by increasing and decreasing the channels (the first argument to each Conv2D call) and the size of the dense layer. In the code shown, these are set to 32, 32, 64, 64 respectively. To detect just one gesture (such as pointing at a light to turn it on or off) a network using 16, 16, 32, 16 trains and runs twice as fast with no loss in accuracy, so feel free to experiment with these values.

Once a good model has been found it can be instructive to come back and explore the effect of different activation functions such as relu and selu, the amount of dropout used, adding batch normalization or additional layers and so on. In general, this is not necessary as the above defaults already work well across multiple tasks.

The script used in the previous tutorial loaded all the data into memory before training, which limited the number of images that could be used. This version uses Keras' ImageDataGenerator to stream the images from disk without loading them all at once:

Using the ImageDataGenerator from Keras in

This code also uses ImageDataGenerator's data augmentation to randomly shear and zoom each image in the training set by up to 20% each time it is seen by the neural network. This helps to make sure the neural network does not overfit to specific locations or sizes without having to move the camera between each recording.

When training a convolutional neural network from scratch instead of just fitting a classifier to features as in the previous tutorial, it helps to use a few extra tricks:

callbacks in

The three callbacks shown here each help improve training and generalization performance:

  • ModelCheckpoint: this ensures that the final model.h5 file saved is the one with the best score on the validation dataset, even if the model overfit the training data in subsequent epochs.
  • EarlyStopping: this stops training after validation performance does not improve for more than 10 epochs to help prevent overfitting the validation dataset itself.
  • ReduceLROnPlateau: this decreases the learning rate when training performance levels off and improves the overall performance of the model without having to fine-tune the learning-rate parameter.

You can explore the source code of to see how these pieces fit together.

  • Advanced information

    In Episode 1, we use a pre-trained MobileNet as a feature extractor, but one of the limitations of this approach is that most real-world gesture recognition situations do not look much like ImageNet, shown here:

    collage of ImageNet images

    (Image © Alec Radford, 2015. Used under an MIT licence )

    This is a well-known problem when using models trained with ImageNet for transfer learning. The ImageNet data mostly consists of by well-lit, centered photographs with no noise or camera distortion. Because of this, networks trained on this data do not perform very well on images that do not share those characteristics.

    A typical image used for gesture recognition in a dimly-lit office from a Raspberry Pi looks more like this:

    blurred and dark image of a raised hand

    From this image, can you tell that the subject has their left hand raised? Here, the hand and arm involve only 2% of the total pixels and have very low contrast compared to background parts of the image. This dissimilarity to ImageNet images suggests that training a neural network from scratch will be a better approach than trying to transfer ImageNet learning.

    Training a convolutional neural network from scratch can be easy. There are many clever architectures that can be used to get the best possible performance on complex datasets, but to recognize a few gestures in a single location a simple architecture works fine.

Previous Next