There are many ways to deploy a trained neural network model to a mobile or embedded device. Different frameworks support Arm, including TensorFlow, PyTorch, Caffe2, MxNet, and CNTK on a various platforms, such as Android, iOS, and Linux. The deployment process for each is similar but every framework and operating system may use different tools. This walkthrough looks specifically at preparing TensorFlow models for deployment on Android, Linux, and iOS.
Optimization of a trained neural network model with TensorFlow follows these steps:
- Determine the names of the input and output nodes in the graph and the dimensions of the input data.
- Generate an optimized 32-bit model using TensorFlow's transform_graph tool.
- Generate an optimized 8-bit model that is more efficient but less accurate using TensorFlow's transform_graph tool.
- Benchmark the optimized models on-device and select the one that best meets your deployment needs.
This tutorial goes through each step in turn, using a pretrained ResNet-50 model (resnetv1_50.pb). The process is the same for other models, although input and output node names differ.
At the end of this tutorial, you will be ready to deploy your model on your chosen platform.
Before you begin
This tutorial assumes you already have a TensorFlow .pb model file using 32-bit floating point weights. If your model is in a different format
If you want to deploy it using TensorFlow then you will need to use a tool to convert it to the TensorFlow format first.
There are various projects and resources building up around converting model formats. Both, MMdnn and Deep learning model converter are useful resources, and the ONNX format has potential to vastly simplify this in the future.
The most important preparation that you can do is to ensure that the size and complexity of your trained model is suitable for the device that you intend to run it on. To ensure this:
- If you are using a pre-trained model for feature extraction or transfer learning, then you should consider using mobile-optimized versions such as MobileNet, TinyYolo, and so on.
- If you designed the architecture yourself, then you should consider adapting the architecture for faster execution and smaller size. An example of this is using depth-separable convolutions where possible, as in MobileNet. This provides better performance and accuracy improvements than by post-processing the model file after training.
Please follow TensorFlow documentation to install Bazel and other dependencies, and download the TensorFlow source code. At the root of your TensorFlow source tree, run ./configure to configure the system build.
This tutorial uses TensorFlow's
graph_transforms tool, which is built from the TensorFlow source with this command:
bazel build tensorflow/tools/graph_transforms:transform_graph
For more details on how to install and build TensorFlow, see the TensorFlow documentation.
Determine the names of input and output nodes
Skip this step if you can determine the names of the input and output nodes from the provider of your model or the training code. However, this step also demonstrates how to visualize the computational graph in a neural network model. This will help you to understand what will be executed at runtime and how the various transform_graph operations affect the structure of the model in practice.
The simplest way to visualize graphs is to use TensorBoard. To install TensorBoard, enter the following on the command line:
pip3 install tensorboard
Download resnet_v2_50.pb from here. Use the script provided in the TensorFlow source distribution to import model (.pb) files to TensorBoard by entering the following on the command line:
python3 tensorflow_core/python/tools/import_pb_to_tensorboard.py --model_dir resnet_v2_50.pb --log_dir /tmp/tensorboard
Do not let the name of the argument model_dir confuse you. A .pb file is an acceptable target.
Important: If you repeat this command for importing multiple models, empty the /tmp/tensorboard directory after each import to prevent confusion.
In a browser, navigate to http://localhost:6006/ and select the GRAPHS tab to see the model's graph. If it is collapsed under a single node, as shown in this image, use the expand control until you get to the actual operations.
In this network, we can see that Placeholder is the only input node and Reshape_1 is the output node. To get their full names, select them and look at the details box on the right, as this image shows.
Here, the name of the input node is import/input/Placeholder and the output node is import/resnet_v2_50/predictions/Reshape_1.
The details box also shows that this model has been trained with the standard 224x224x3 input size which is typical for a ResNet architecture. The transform_graph tool assumes the import namespace, so these are simplified to:
- Input name: Placeholder.
- Input dimensions: 1x224x224x3.
- Output name: resnet_v2_50/predictions/Reshape_1.
We can also use the summarize_graph tool to inspect the model and provide guesses about likely input and output nodes, as well as other information that is useful for debugging. Here is an example of how to use it on the resnet_v2_50 graph:
$ bazel build tensorflow/tools/graph_transforms:summarize_graph
$ bazel-bin/tensorflow/tools/graph_transforms/summarize_graph --in_graph=resnet_v2_50.pb
Found 1 possible inputs: (name=input, type=float(1), shape=[1,224,224,3])
No variables spotted.
Found 1 possible outputs: (name=resnet_v2_50/predictions/Reshape_1, op=Reshape)
Found 25615936 (25.61M) const parameters, 0 (0) variable parameters, and 0 control_edges
Op types used: 328 Const, 272 Identity, 147 Mul, 114 Add, 54 Conv2D, 49 Relu, 49 Rsqrt, 49 Sub, 22 BiasAdd, 4 MaxPool, 4 Pad, 2 Reshape, 1 Mean, 1 Placeholder, 1 Softmax, 1 Squeeze
To use with
try these arguments:
bazel run tensorflow/tools/benchmark:benchmark_model -- --graph=resnet_v2_50.pb --show_flops --input_layer=input --input_layer_type=float --input_layer_shape=1,224,224,3 --output_layer=resnet_v2_50/predictions/Reshape_1
Generate an optimized 32-bit model
This step primarily removes unnecessary nodes and ensures that the operations that are used are available in the TensorFlow distributions on mobile devices. One way it does this is by removing training-specific operations in the model's computational graph.
To generate your model, enter the following on the command line:
bazel-bin/tensorflow/tools/graph_transforms/transform_graph \ --in_graph=resnet_v2_50.pb \ --out_graph=optimized_resnet_v2_50_fp32.pb \ --inputs='Placeholder' \ --outputs='resnet_v2_50/predictions/Reshape_1' \ --transforms='strip_unused_nodes(type=float, shape="1,224,224,3") fold_constants(ignore_errors=true) fold_batch_norms fold_old_batch_norms'
The input and output names depend on your model. The shape is also included here as part of the strip_unused_nodes command.
If you encounter any problems, the TensorFlow documentation covers this step in more detail.
The transformed model does not differ greatly in size or speed from the training model. The difference is that it is capable of being loaded by the TensorFlow distribution for mobile devices, which may not implement all training operators.
It is a good idea to verify that this inference-ready model runs with the same accuracy as your trained one. How you do this depends on your training/test workflow and datasets.
You can now distribute this model and deploy it with TensorFlow on both mobile and embedded or low-power Linux devices.
Generate an optimized 8-bit model
Most trained models use 32-bit floating-point numbers to represent their weights. Research has shown that for many networks you can reduce the precision of these numbers and store them in 8-bit integers, reducing the model size by a factor of 4. This has benefits when distributing and loading models on mobile devices.
In theory, 8-bit integer models can also execute faster than 32-bit floating-point models because there is less data to move and simpler integer arithmetic operations can be used for multiplication and accumulation. However, not all commonly used layers have 8-bit implementations in TensorFlow 1.4. This means that a quantized model may spend more time converting data between 8-bit and 32-bit formats for different layers than it saves through faster execution.
The optimal balance between speed, size, and accuracy vary by model, application and hardware platform. So, it is best to quantize all deployed models and compare them to the unquantized version on the deployment hardware itself. You can quantize a neural network using TensorFlow's graph_transforms tool with the following command:
bazel-bin/tensorflow/tools/graph_transforms/transform_graph \ --in_graph=resnet_v2_50.pb \ --out_graph=optimized_resnet_v2_50_int8.pb \ --inputs='Placeholder' \ --outputs='resnet_v2_50/predictions/Reshape_1' \ --transforms=' add_default_attributes strip_unused_nodes(type=float, shape="1,224,224,3") remove_nodes(op=Identity, op=CheckNumerics) fold_constants(ignore_errors=true) fold_batch_norms fold_old_batch_norms quantize_weights quantize_nodes strip_unused_nodes sort_by_execution_order'
The quantized network should be significantly smaller than the trained model. In this case, optimized_resnet_v2_50_fp32.pb is 102MB whereas optimized_resnet_v2_50_int8.pb is 27MB. This can also be important if the model is distributed as part of a mobile application, quite apart from any inference speed improvements. If size is the primary concern, read the TensorFlow documentation and try the round_weights transform that reduces the size of the compressed model for deployment without improving speed or affecting accuracy.
There are several variants of this step, which can include performing extra fine-tuning passes on the model or instrumented runs to determine quantization ranges. To learn more about this and the corresponding deployment workflow with TensorFlow Lite see the TensorFlow Quantization Guide.
Benchmark the optimized models
It is important to benchmark these models on real hardware. TensorFlow contains optimized 8-bit routines for Arm CPUs but not for x86, so 8-bit models perform much slower on an x86-based laptop than a mobile Arm device. Benchmarking varies by platform; on Android you can build the TensorFlow benchmark application with this command:
bazel build -c opt --cxxopt=`--std=c++11` --config=android_arm tensorflow/tools/benchmark:benchmark_model
With the Android deployment device connected (this example uses a HiKey 960), run:
adb shell "mkdir -p /data/local/tmp"
adb push bazel-bin/tensorflow/tools/benchmark/benchmark_model /data/local/tmp
adb push optimized_resnet_v2_50_fp32.pb /data/local/tmp
adb push optimized_resnet_v2_50_int8.pb /data/local/tmp
The benchmarks are run on a single core (num_threads=1) or four cores (num_threads=4) with these commands:
adb shell '/data/local/tmp/benchmark_model \ --num_threads=1 \ --graph=/data/local/tmp/optimized_resnet_v2_50_fp32.pb \ --input_layer="Placeholder" \ --input_layer_shape="1,224,224,3" \ --input_layer_type="float" \ --output_layer="resnet_v2_50/predictions/Reshape_1"''
adb shell '/data/local/tmp/benchmark_model \ --num_threads=1 \ --graph=/data/local/tmp/optimized_resnet_v2_50_int8.pb \ --input_layer="Placeholder" \ --input_layer_shape="1,224,224,3" \ --input_layer_type="float" \ --output_layer="resnet_v2_50/predictions/Reshape_1"'
Alternatively, deploy the models directly in your application, on iOS, Linux or Android, and use real-world performance measurements to compare the models.
Accuracy should always be evaluated using your own data, as the impact of quantization on accuracy can vary. In terms of compute performance on our HiKey 960 development platform, we see the following:
|Model type||Processor||Model size||1-batch inference|
|32-bit floating point||Arm Cortex-A73||97MB|| 1794ms
|8-bit integer||Arm Cortex-A73||25MB||935ms|
|32-bit floating point
||4x Arm Cortex-A73||102MB||583ms|
||4x Arm Cortex-A73||27MB||524ms|
Improving model performance
ResNet-50 on a 224x224x3 image uses around 7 billion operations per inference. It is worth considering whether your application requires a high resolution for fine details in the input, as running ResNet-50 on a 160x160 image would almost halve the number of operations and double the speed. For even faster inference of image-processing workloads, investigate performance-optimized models such as the MobileNet family of networks. The MobileNet family allows you to scale computation down by a factor of a thousand, enabling you to scale the model to meet a wide range of FPS targets on existing hardware for a modest accuracy penalty.
Deploy the optimized models
The exact deployment method depends on your platform. Use the links in the table to access resources for deploying on your platform:
|Android||TensorFlow Java or C++ interface
|iOS||CoreML converter (no 8-bit)|
|iOS||TensorFlow C++ interface
|Linux||TensorFlow C++ interface|
Here are some resources related to material in this guide:
This guide demonstrates how to use TensorFlow Graph Transform Tool to optimize a frozen TF graph before deploying it in production. There are different types of optimizations. One is to make the model smaller and faster in size without accuracy loss. And the other is to change the weights from higher precision to lower precisions, usually from FP32 to FP16 or INT8. For deploying your model to a phone or embedded device, you can optimize away batch normalization or other training-only features. The TensorFlow Graph Transform framework offers a suite of tools for modifying computational graphs, and a framework to make it easy to write your own modifications.