Generate an optimized 8-bit model

Most trained models use 32-bit floating-point numbers to represent their weights. Research has shown that for many networks you can reduce the precision of these numbers and store them in 8-bit integers, reducing the model size by a factor of 4. This has benefits when distributing and loading models on mobile devices.

In theory, 8-bit integer models can also execute faster than 32-bit floating-point models because there is less data to move and simpler integer arithmetic operations can be used for multiplication and accumulation. However, not all commonly used layers have 8-bit implementations in TensorFlow 1.4. This means that a quantized model may spend more time converting data between 8-bit and 32-bit formats for different layers than it saves through faster execution.

The optimal balance between speed, size, and accuracy vary by model, application and hardware platform. So, it is best to quantize all deployed models and compare them to the unquantized version on the deployment hardware itself. You can quantize a neural network using TensorFlow's graph_transforms tool with the following command:

bazel-bin/tensorflow/tools/graph_transforms/transform_graph \
--in_graph=resnet_v2_50.pb \
--out_graph=optimized_resnet_v2_50_int8.pb \
--inputs='Placeholder' \
--outputs='resnet_v2_50/predictions/Reshape_1' \
 strip_unused_nodes(type=float, shape="1,224,224,3")  
 remove_nodes(op=Identity, op=CheckNumerics)  

The quantized network should be significantly smaller than the trained model. In this case, optimized_resnet_v2_50_fp32.pb is 102MB whereas optimized_resnet_v2_50_int8.pb is 27MB. This can also be important if the model is distributed as part of a mobile application, quite apart from any inference speed improvements. If size is the primary concern, read the TensorFlow documentation and try the round_weights transform that reduces the size of the compressed model for deployment without improving speed or affecting accuracy.

There are several variants of this step, which can include performing extra fine-tuning passes on the model or instrumented runs to determine quantization ranges. To learn more about this and the corresponding deployment workflow with TensorFlow Lite see the TensorFlow Quantization Guide.

Previous Next