This section of the guide describes how we use quantization in our model. Quantization is the process of approximating a neural network that uses floating-point numbers to one of fixed-point integers. Quantization is applied to our model as it dramatically reduces both the memory requirement and computational cost of running the network.
Deep neural network consists of many parameters which are known as weights, for example, the famous VGG network has over 100 million parameters. In most cases, the bottleneck of running deep neural network is in transferring the weights and data between main memory and compute cores. With quantization, rather than using 32 bits for each weight value, we use just 8 bits. Therefore, the model becomes a quarter of its original size, and we instantly speed up the memory transfer by four times. This also bring other benefits including faster inference time.
We investigated two methods of quantization offered by TensorFlow: post-training quantization and Quantization Aware Training (QAT). Both methods will produce a fully quantized model where weights and calculations are in fixed-point.