Benchmark the optimized models

It is important to benchmark these models on real hardware. TensorFlow contains optimized 8-bit routines for Arm CPUs but not for x86, so 8-bit models will perform much slower on an x86-based laptop than a mobile Arm device. Benchmarking varies by platform; on Android you can build the TensorFlow benchmark application with this command:

bazel build -c opt --cxxopt=--std=c++11 --crosstool_top=//external:android/crosstool --cpu=armeabi-v7a --host_crosstool_top=@bazel_tools//tools/cpp:toolchain tensorflow/tools/benchmark:benchmark_model

With the Android deployment device connected (this example uses a HiKey 960), run:

adb shell "mkdir -p /data/local/tmp"
adb push bazel-bin/tensorflow/tools/benchmark/benchmark_model /data/local/tmp
adb push optimized_resnetv1_50_fp32.pb /data/local/tmp
adb push optimized_resnetv1_50_int8.pb /data/local/tmp

The benchmarks are run on a single core (num_threads=1) or four cores (num_threads=4) with these commands:

adb shell '/data/local/tmp/benchmark_model \
 --num_threads=1 \ 
 --graph=/data/local/tmp/optimized_resnetv1_50_fp32.pb \ 
 --input_layer="Placeholder" \
 --input_layer_shape="1,224,224,3" \
 --input_layer_type="float" \
 --output_layer="resnet_v1_50/predictions/Reshape_1"'
adb shell '/data/local/tmp/benchmark_model \
 --num_threads=1 \
 --graph=/data/local/tmp/optimized_resnetv1_50_int8.pb \
 --input_layer="Placeholder" \
 --input_layer_shape="1,224,224,3" \
 --input_layer_type="float" \
 --output_layer="resnet_v1_50/predictions/Reshape_1"'

Alternatively, deploy the models directly in your application, on iOS, Linux or Android, and use real-world performance measurements to compare the models.

Accuracy should always be evaluated using your own data, as the impact of quantization on accuracy can vary. In terms of compute performance on our HiKey 960 development platform, we see the following:

Model type Processor  Model size 1-batch inference 
32-bit floating point Arm Cortex A73  97 MB  1794ms
8-bit integer Arm Cortex A73  25 MB  935ms
32-bit floating point
4x Arm Cortex A73  97 MB  567ms
8-bit integer
4x Arm Cortex A73  25 MB  522ms

Improving model performance

ResNet-50 on a 224x224x3 image uses around 7 billion operations per inference. It is worth considering whether your application requires a high resolution for fine details in the input, as running ResNet-50 on a 160x160 image would almost halve the number of operations and double the speed. For even faster inference of image processing workloads, investigate performance-optimized models such as the MobileNet family of networks. The MobileNet family allows you to scale computation down by a factor of a thousand, enabling you to scale the model to meet a wide range of FPS targets on existing hardware for a modest accuracy penalty.

8-bit model Deploy models