Benchmark the optimized models
It is important to benchmark these models on real hardware. TensorFlow contains optimized 8-bit routines for Arm CPUs but not for x86, so 8-bit models will perform much slower on an x86-based laptop than a mobile Arm device. Benchmarking varies by platform; on Android you can build the TensorFlow benchmark application with this command:
bazel build -c opt --cxxopt=--std=c++11 --crosstool_top=//external:android/crosstool --cpu=armeabi-v7a --host_crosstool_top=@bazel_tools//tools/cpp:toolchain tensorflow/tools/benchmark:benchmark_model
With the Android deployment device connected (this example uses a HiKey 960), run:
adb shell "mkdir -p /data/local/tmp"
adb push bazel-bin/tensorflow/tools/benchmark/benchmark_model /data/local/tmp
adb push optimized_resnetv1_50_fp32.pb /data/local/tmp
adb push optimized_resnetv1_50_int8.pb /data/local/tmp
The benchmarks are run on a single core (num_threads=1) or four cores (num_threads=4) with these commands:
adb shell '/data/local/tmp/benchmark_model \ --num_threads=1 \ --graph=/data/local/tmp/optimized_resnetv1_50_fp32.pb \ --input_layer="Placeholder" \ --input_layer_shape="1,224,224,3" \ --input_layer_type="float" \ --output_layer="resnet_v1_50/predictions/Reshape_1"'
adb shell '/data/local/tmp/benchmark_model \ --num_threads=1 \ --graph=/data/local/tmp/optimized_resnetv1_50_int8.pb \ --input_layer="Placeholder" \ --input_layer_shape="1,224,224,3" \ --input_layer_type="float" \ --output_layer="resnet_v1_50/predictions/Reshape_1"'
Alternatively, deploy the models directly in your application, on iOS, Linux or Android, and use real-world performance measurements to compare the models.
Accuracy should always be evaluated using your own data, as the impact of quantization on accuracy can vary. In terms of compute performance on our HiKey 960 development platform, we see the following:
|Model type||Processor||Model size||1-batch inference|
|32-bit floating point||Arm Cortex A73||97 MB|| 1794ms
|8-bit integer||Arm Cortex A73||25 MB||935ms|
|32-bit floating point
||4x Arm Cortex A73||97 MB||567ms|
||4x Arm Cortex A73||25 MB||522ms|
Improving model performance
ResNet-50 on a 224x224x3 image uses around 7 billion operations per inference. It is worth considering whether your application requires a high resolution for fine details in the input, as running ResNet-50 on a 160x160 image would almost halve the number of operations and double the speed. For even faster inference of image processing workloads, investigate performance-optimized models such as the MobileNet family of networks. The MobileNet family allows you to scale computation down by a factor of a thousand, enabling you to scale the model to meet a wide range of FPS targets on existing hardware for a modest accuracy penalty.
|8-bit model||Deploy models|