PyArmNN object recognition

This section of the guide describes how to use PyArmNN to perform object recognition, using the PyArmNN Object Detection Sample Application as an example.

The sample application takes a model and video file or camera feed as input. The application then runs inference on each frame. Finally, the application draws bounding boxes around detected objects, with the corresponding labels and confidence scores overlaid.

The following image shows an example of one frame from the resulting video file with bounding boxes and confidence scores:

The PyArmNN Object Detection Sample Application performs the following steps:

  1. Initialization

    1. Read from the video source
    2. Prepare labels and model-specific functions
  2. Create a network

    1. Create the parser and import a graph
    2. Optimize the graph for the compute device
    3. Create input and output binding information
  3. Object detection pipeline

    1. Preprocess the captured frame
    2. Make the input and output tensors
    3. Execute inference
  4. Postprocessing

    1. Decode and process the inference output
    2. Draw the bounding boxes
    3. Get the example code
    4. Run the application

The following subsections describe these steps.

Read from the video source

The application parses the supplied user arguments and loads the specified video file or stream into an OpenCV cv2.VideoCapture object. The application then uses this object to capture frames from the source with the read() function.

The VideoCapture object also provides information about the source, like the frame rate and resolution of the input video. The application uses this information to create a cv2.VideoWriter object. This object is used at the end of every loop to write the processed frame to an output video file.

Prepare labels and model-specific functions

To interpret the inference result on the loaded network, an application must load the labels that are associated with the model. In the sample application, the dict_labels() function creates a dictionary that is keyed on the classification index at the output node of the model. The values in the dictionary map each label to a randomly generated RGB color. This mapping means that each class has a unique color, which is useful when plotting the bounding boxes of detected objects in a frame.

The user-specified model accesses and returns functions to decode and process the inference output, along with a resize factor. This resize factor is used when plotting bounding boxes, to ensure that they are scaled to their correct position in the original frame.

Create the parser and import a graph

A PyArmNN application must import a graph from file using an appropriate parser. Arm NN provides parsers for various model file types, including TFLite, TF, and ONNX. These parsers are libraries for loading neural networks of various formats into the Arm NN runtime. 

Because both the Yolo v3 and SSD models are in the TFLite format, the sample application uses the TFLite parser armnnTfLitePaser to process the models.

The CreateNetworkFromBinaryFile() function creates the parser and loads the network file. The parser then constructs the underlying Arm NN graph from the network file.

Optimize the graph for the compute device

Arm NN supports optimized execution on multiple CPU, GPU, and Ethos-N NPU devices. Before executing a graph, the application must select the appropriate device context by using IRuntime() to create a runtime context with default options.

We can optimize the imported graph by specifying a list of backends in order of preference and implementing backend-specific optimizations. A unique string identifies each one of these backends. For example:

  • CpuAcc represents the CPU backend.
  • GpuAcc represents the GPU backend.
  • CpuRef represents the CPU reference kernels.

Arm NN splits the entire graph into subgraphs based on these backends. Each subgraph is then optimized, and the corresponding subgraph in the original graph is substituted with its optimized version.

The Optimize() function optimizes the graph for inference, then LoadNetwork() loads the optimized network onto the compute device. The LoadNetwork() function also creates the backend-specific workloads for the layers and a backend-specific workload factory.

Create input and output binding information

Parsers extract the input information for the network. The GetSubgraphInputTensorNames() function extracts all the input names and the GetNetworkInputBindingInfo() function obtains the input binding information of the graph.

The input binding information contains all the essential information about the input. This information is a tuple consisting of:

  • Integer identifiers for bindable layers
  • Tensor information including:

    • Data type
    • Quantization information
    • Number of dimensions
    • Total number of elements

Similarly, we can get the output binding information for an output layer by using the parser to retrieve output tensor names and calling the GetNetworkOutputBindingInfo() function.

Preprocess the captured frame

Each frame that is captured from the video source is read as a ndarray in BGR format. Each frame must then be preprocessed before being passed into the network.

This preprocessing step consists of the following:

  1. Swap channels. In this example, swap BGR to RGB.
  2. Resize the frame to the required resolution
  3. Expand the dimensions of the array and perform data type conversion to match the model input layer.

You can read input_binding_info to obtain information about the shape and the data type of the input tensor. For example, SSD MobileNet V1 takes an input tensor with shape [1, 300, 300, 3] and data type uint8.

Make the input and output tensors

The make_input_tensors() function produces the input workload tensors.

The make_output_tensors() function produces the output workload tensors.

Execute inference

After creating the workload tensors, the compute device performs inference for the loaded network using the EnqueueWorkload() function of the runtime context. Calling the workload_tensors_to_ndarray() function obtains the inference results as a list of ndarrays.

Decode and process the inference output

The output from inference must be decoded to obtain information about detected objects in the frame.

The examples includes implementations of two networks, but you can implement your own network decoding solution. For more information, see Implementing Your Own Network.

For SSD MobileNet V1 models, the application decodes the results to obtain the bounding box positions, classification index, confidence, and number of detections in the input frame.

For YOLO v3 tiny models, the application decodes the output and performs non-maximum suppression. This suppression filters out any weak detections below a confidence threshold and any redundant bounding boxes above an intersection-over-union (IoU) threshold.

Experiment with different threshold values for confidence and IoU to achieve the best visual results.

Detection results are returned as a list with the following form:

[class index, [box positions], confidence score]

Where [box positions] contains bounding box coordinates in the following form:

[x_min, y_min, x_max, y_max]

Draw the bounding boxes

The draw_bounding_boxes() function takes the inference results and draws bounding boxes around detected objects. This function also adds the associated label and confidence score. The labels dictionary that we created in Preparing labels and model-specific functions  uses the class index of the detected object as a key to return the associated label and color for that class. The resize factor that we defined in Preparing labels and model-specific functions scales the bounding box coordinates to their correct positions in the original frame.

The processed frames are then written to file or displayed in a separate window.

Get the example code

You can find code for our example application, and more instructions, on our GitHub repository.

To use the example:

  1. Install git:
    $ sudo apt install git
  2. Clone the repo:
    git clone
  3. Move to the object detection example:
    cd armnn/python/pyarmnn/examples/object_detection/
  4. Install the following dependencies on your system
    $ sudo apt update
    $ sudo apt install libopencv-dev python3-opencv python3-venv
  5. Create a virtual environment:
    $ python3.8 -m venv devenv --system-site-packages
    $ source devenv/bin/activate
  6. Install the following dependencies on the virtual environment:
    $ pip install -r requirements.txt
  7. Download the object detection model from our GitHub repository.
  8. Download a video of your choice as an MP4.

Run the application

Run the model on the video you downloaded:

$ python --video_file_path _video_file_path_ 
    --model_file_path _model_file_path_ 
$ python --video_file_path _video_file_path_ 
    --model_file_path ssd_mobilenet_v1.tflite --model_name ssd_mobilenet_v1 
    --label_path labels.txt
$ python --video_file_path _video_file_path_ 
    --model_file_path yolo_v3_tiny_darknet_fp32.tflite --model_name yolo_v3_tiny 
    --label_path labels.txt
Previous Next