In this section of the guide, we explain the reasoning behind our model choice.
In this guide, we are working in the embedded space. Therefore, it is important to be aware of the limitations our target device may have, for example, the amount of available flash memory or RAM. These limitations should be considered when deciding what neural network architecture to use. For example, a model with a large memory footprint may be too big to fit in the available memory on your device.
In this guide, we use the MobilenetV2 model architecture from Google. This family of models provides accuracy, a small memory footprint and is provides efficient execution on performance-constrained devices. The MobilenetV2 model architecture is a family of models, which means that it is possible to select a model that fits the restrictions a device may have. This makes it an ideal model choice for an embedded platform.
The trade-off between accuracy, speed, and size requirements is something that must always be considered when choosing what model to use. To give an example of this, our requirements are:
- Inference time under 500ms
- A model size of under 1MB
- To be as accurate as possible
To meet those requirements, we first focused on the model size. The model size is determined by the number of parameters in the model. As mentioned in What happens in an MCU face recognition model? we remove the final classification Conv2D layer of the model after training. This is because we only need the tensors that go into this layer as the fingerprint for our face recognition.
The following image shows the structure of the last few layers of Mobilenet v2.
This last Conv2D layer accounts for (1001x1x1x1280) + 1001 = 1.28M parameters. Therefore, the memory size can be calculated by subtracting 1.28 from the parameters. In our quantized model, each parameter uses 1 byte of memory and our size requirements are for the model to use under 1MB of memory. Therefore, by removing 1.28 from the parameters, we can see which model fits our size requirements. In the memory size column in the table below, this shows that all the models under float_v2_0.75_96 fit our size requirements.
We then must decide on an input resolution. Working with higher resolutions can give us better accuracy but comes at the expense of slower inference times. Inference times for these different input resolutions were measured on an Arm microcontroller-based device like a Cortex-M4 or a Cortex-M7. The highest resolution that satisfied our inference time requirement was selected, which was128 input size.
Note: Both model size and input resolution affect inference time, so it can be worth checking different combinations to find the sweet spot. A smaller model with higher resolution might give us faster inference times with similar accuracy.
The following table shows the results that were obtained when we altered the model version and image resolution:
|Different model version and image resolution||Quantized||MACs(M)||Parameters (M)||Memory size||Top 1 accuracy||Top 5 accuracy|