Generate an optimized 8-bit model

Most trained models use 32-bit floating-point numbers to represent their weights. Research has shown that for many networks you can reduce the precision of these numbers and store them in 8-bit integers, reducing the model size by a factor of 4. This has benefits when distributing and loading models on mobile devices in particular.

In theory, 8-bit integer models can also execute faster than 32-bit floating-point models because there is less data to move and simpler integer arithmetic operations can be used for multiplication and accumulation. However, not all commonly-used layers have 8-bit implementations in TensorFlow 1.4, which means that a quantized model may spend more time converting data between 8-bit and 32-bit formats for different layers than it saves through faster execution.

The optimal balance between speed, size and accuracy will vary by model, application and hardware platform, so it's best to quantize all deployed models and compare them to the unquantized version on the deployment hardware itself. You can quantize a neural network using TensorFlow's graph_transforms tool with the following command:

bazel-bin/tensorflow/tools/graph_transforms/transform_graph \
--in_graph=resnetv1_50.pb \
--out_graph=optimized_resnetv1_50_int8.pb \
--inputs='Placeholder' \
--outputs='resnet_v1_50/predictions/Reshape_1' \
--transforms='  
 add_default_attributes  
 strip_unused_nodes(type=float, shape="1,224,224,3")  
 remove_nodes(op=Identity, op=CheckNumerics)  
 fold_constants(ignore_errors=true)  
 fold_batch_norms  
 fold_old_batch_norms  
 quantize_weights  
 quantize_nodes  
 strip_unused_nodes 
 sort_by_execution_order'

The quantized network should be significantly smaller than the trained model. In this case, optimized_resnetv1_50_fp32.pb is 97MB whereas optimized_resnetv1_50_int8.pb is 25MB. This can also be important if the model will be distributed as part of a mobile application, quite apart from any inference speed improvements. If size is the primary concern, read the TensorFlow documentation  and try the round_weights transform that reduces the size of the compressed model for deployment without improving speed or affecting accuracy.

Note that it is not currently possible to deploy 8-bit quantized TensorFlow models via CoreML on iOS. It is, however, possible to use the same technique to reduce the compressed model size for distribution using the round_weights transform described in the above link, or to deploy 8-bit models using the TensorFlow C++ interface.

There are several variants of this step, which can include performing extra fine-tuning passes on the model or instrumented runs to determine quantization ranges. To learn more about this and the corresponding deployment workflow with TensorFlow Lite see the TensorFlow Quantization Guide.

32-bit model Benchmark models