With the launch of TensorFlow Lite, TensorFlow has been updated with quantization techniques and tools that you can use to improve the performance of your network.

This guide shows you how to quantize a network so that it uses 8-bit data types during training, using features that are available from TensorFlow 1.9 or later.

Devices can execute 8-bit integer models faster than 32-bit floating-point models because there is less data to move and simpler integer arithmetic operations can be used for multiplication and accumulation.

If you are deploying TensorFlow models using CoreML, Arm recommend that you convert the 32-bit unquantized model to CoreML. To convert the model to CoreML, use https://github.com/tf-coreml/tf-coreml and then use the CoreML quantization tools to optimize the model for deployment. Check the Apple Developer for more updates on this.

Note that it is not currently possible to deploy 8-bit quantized TensorFlow models via CoreML on iOS. However, you can use the same technique to reduce the compressed model size for distribution using the round_weights transform described in the TensorFlow GitHub, or to deploy 8-bit models using the TensorFlow C++ interface.