Overview
With the launch of TensorFlow Lite, TensorFlow has been updated with quantization techniques and tools that you can use to improve the performance of your network.
This guide shows you how to quantize a network so that it uses 8bit data types during training, using features that are available from TensorFlow 1.9 or later.
Devices can execute 8bit integer models faster than 32bit floatingpoint models because there is less data to move and simpler integer arithmetic operations can be used for multiplication and accumulation.
If you are deploying TensorFlow models using CoreML, Arm recommend that you convert the 32bit unquantized model to CoreML. To convert the model to CoreML, use https://github.com/tfcoreml/tfcoreml and then use the CoreML quantization tools to optimize the model for deployment. Check the Apple Developer for more updates on this.
Note that it is not currently possible to deploy 8bit quantized TensorFlow models via CoreML on iOS. However, you can use the same technique to reduce the compressed model size for distribution using the round_weights
transform described in the TensorFlow GitHub, or to deploy 8bit models using the TensorFlow C++ interface.
Before you begin
Before you can use the TensorFlow Lite quantization tools, you must:
 Install TensorFlow 1.9 or later. Arm tested TensorFlow version 1.15.

To follow the CifarNet examples in this article, clone the tensorflow/models repository from GitHub using the command:
git clone https://github.com/tensorflow/models.git
Use the master branch. Arm tested the checkout, d4e1f97fd8b929deab5b65f8fd2d0523f89d5b44.
 Prepare your network for quantization:

Remove unsupported operations that the TensorFlow quantization toolchain doesn’t support yet. Note that this support will change over time. See the TensorFlow documentation for details.
To remove unsupported operations from CifarNet, lines 68 and 71 must be removed from
models/research/slim/nets/cifarnet.py
. 
Add fake quantization layers to the model graph, before you initialize the optimizer. Call this function on the finished graph to add these layers:
tf.contrib.quantize.create_training_graph()
For a CifarNet example, you modify
models/research/slim/train_image_classifier.py
and addtf.contrib.quantize.create_training_graph(quant_delay=90000)
before the procedure to configure the optimization procedure on line 514. Thequant_delay
parameter specifies how many steps the operation allows the network to train in 32bit floatingpoint (FP32), before it introduces quantization effects. The value 90000 indicates that the model will be trained for 90000 steps with floatingpoint weights and activations, before the quantization process begins. You can also load the weights of an existing trained model and finetune it for quantization. In this case, add the code that loads the weights and set thequant_delay
value to 0 so that quantization begins immediately.

Train or Finetune Your Model
Train your model using your training data and compare the accuracy with the original 32bit network. The training process varies by model.
To finetune an existing model using quantization, load the weights from your trained model into a graph that you use the create_training_graph()
function to prepare. The create_training_graph()
function is described in the Prerequisites section. When you have loaded the weights, allow the model to continue training for a smaller number of steps.
When the training is complete, compare the accuracy with the original 32bit network.
For CifarNet, you can use the following commands to download the training data to /tmp/cifar10
and train the network for 100000 steps:
cd models/research/slim/ bash scripts/train_cifarnet_on_cifar10.sh
The fake quantization layers that tf.contrib.quantize.create_training_graph()
adds become active after 90000 steps and ensure that the final network is finetuned to work with quantized weights. The training process creates a /tmp/cifarnetmodel
directory that contains the graph and checkpoint weights.
To view the training progress:

Run the following command into TensorBoard:
tensorboard logdir=/tmp/cifarnetmodel/
 Open port 6006 on the training server in a browser. If you’re training on your laptop or desktop, this is http://localhost:6006/.
Training takes approximately 40 minutes on a p3.2xlarge EC2 instance.
The final accuracy is approximately 85% for CifarNet.
Prepare the Graph for Inference
To prepare the graph for inference with TensorFlow Lite or Arm NN, optimize the graph for inference, and freeze it:

Add fake quantization layers to the graph. This modifies the way the inference graph is exported, to make sure that it is exported with the quantization information in the right format. To add the fake quantization layers, call
tf.contrib.quantize.create_eval_graph()
on the inferenceready graph before saving it.For CifarNet, you can do this by modifying the file
models/research/slim/export_inference_graph.py
and addingtf.contrib.quantize.create_eval_graph()
beforegraph_def = graph.as_graph_def()
on line 118. 
Export the inference graph to a protobuf file. For CifarNet this is done using:
python export_inference_graph.py model_name=cifarnet dataset_name=cifar10 output_file=/tmp/cifarnet_inf_graph.pb
At this point, the graph does not contain your trained weights.

Freeze the graph using the
freeze_graph
tool. You can specify any checkpoint during training for this. The command to freeze the graph is:python m tensorflow.python.tools.freeze_graph \ input_graph=<your_graph_location> \ input_checkpoint=<your_chosen_checkpoint> \ input_binary=true \ output_graph=<output_directory> \ ouput_node_names=<output_nodes>
For CifarNet, using the last checkpoint, you can do this with the commands:
export LAST_CHECKPOINT=`head n1 /tmp/cifarnetmodel/checkpoint  cut d'"' f2` python m tensorflow.python.tools.freeze_graph \ input_graph=/tmp/cifarnet_inf_graph.pb \ input_checkpoint=${LAST_CHECKPOINT} \ input_binary=true \ output_graph=/tmp/frozen_cifarnet.pb \ output_node_names=CifarNet/Predictions/Softmax
Quantize the Graph
Use the TensorFlow Lite Converter tflite_convert
to optimize the TensorFlow graphs and convert them to the TensorFlow Lite format for 8bit inference. This tool is installed as standard in your path with TensorFlow 1.9 or later.
To use the TensorFlow Lite Converter:

Use the
tflite_convert
commandline program using the command:tflite_convert graph_def_file=<your_frozen_graph> \ output_file=<your_chosen_output_location> \ input_format=TENSORFLOW_GRAPHDEF \ output_format=TFLITE \ inference_type=QUANTIZED_UINT8 \ output_arrays=<your_output_arrays> \ input_arrays=<your_input_arrays> \ mean_values=<mean of input training data> \ std_dev_values=<standard deviation of input training data>
For CifarNet, this command is:
tflite_convert graph_def_file=/tmp/frozen_cifarnet.pb \ output_file=/tmp/quantized_cifarnet.tflite \ input_format=TENSORFLOW_GRAPHDEF \ output_format=TFLITE \ inference_type=QUANTIZED_UINT8 \ output_arrays=CifarNet/Predictions/Softmax \ input_arrays=input \ mean_values 121 \ std_dev_values 64
This command creates an output file that is one quarter of the size of the 32bit frozen input file.
For more information on using the TensorFlow Lite Converter, see the TensorFlow GitHub.

Check the accuracy of your result to ensure that it is the same as the original 32bit graph.