Arm NN C++ API object recognition

This section of the guide describes how to use the Arm NN public C++ API to perform object recognition. We use the Arm NN Object Detection Example to illustrate the process.

The sample application takes a model and video file or camera feed as input. The application then runs inference on each frame. Finally, the application draws bounding boxes around detected objects, with the corresponding labels and confidence scores overlaid.

The Arm NN Object Detection Example performs the following steps:

  1. Initialization

    1. Read from the video source
    2. Prepare labels and model-specific functions
  2. Create a network

    1. Create the parser and import a graph
    2. Optimize the graph for the compute device
    3. Create input and output binding information
  3. Object detection pipeline

    1. Preprocess the captured frame
    2. Make the input and output tensors
    3. Execute inference
  4. Postprocessing

    1. Decode and process the inference output
    2. Draw the bounding boxes
    3. Run the application

The following subsections describe these steps.

Read from the video source

After parsing user arguments, the application loads the chosen video file or stream into an OpenCV cv::VideoCapture object. The main function uses the IFrameReader interface and the OpenCV-specific implementation CvVideoFrameReader to capture frames from the source using the ReadFrame() function.

The CvVideoFrameReader object also provides information about the input video. Based on this information and the application arguments, the application creates one of the implementations of the IFrameOutput interface: CvVideoFileWriter or CvWindowOutput. The created object is used at the end of every loop to do one of the following:

  • CvVideoFileWriter uses cv::VideoWriter with an ffmpeg backend to write the processed frame to an output video file.
  • CvWindowOutput uses the cv::imshow() function to write the processed frame to a GUI window. See the GetFrameSourceAndSink function in Main.cpp for more details.

Prepare labels and model-specific functions

To interpret the result of running inference on the loaded network, the application must load the labels associated with the model. In the provided example code, the AssignColourToLabel function creates a vector of [label, color] pairs. The vector is ordered according to the object class index at the output node of the model. Labels are assigned with a randomly generated RGB color. This ensures that each class has a unique color which is helpful when plotting the bounding boxes of various detected objects in a frame.

Depending on the model that is being used, the CreatePipeline function returns a specific implementation of the object detection pipeline.

Create a network

All operations with Arm NN and networks are encapsulated in ArmnnNetworkExecutor class.

Create the parser and import a graph

Arm NN SDK imports the graph from a file using the appropriate parser.

Arm NN SDK provides parsers for reading graphs from various model formats. The example application focuses on the .tflite, .pb, and .onnx models.

Based on the extension of the provided model file, the corresponding parser is created, and the network file loaded with the CreateNetworkFromBinaryFile() method. The parser creates the underlying Arm NN graph.

The example application accepts .tflite format model files, using ITfLiteParser:

#include "armnnTfLiteParser/ITfLiteParser.hpp"
armnnTfLiteParser::ITfLiteParserPtr parser = 
armnn::INetworkPtr network = parser->CreateNetworkFromBinaryFile(modelPath.c_str());

Optimize the graph for the compute device

Arm NN supports optimized execution on multiple CPU and GPU devices. Before executing a graph, the application must select the appropriate device context. The example application creates a runtime context with default options with IRuntime(), as shown in the following code:

#include "armnn/ArmNN.hpp"
auto runtime = armnn::IRuntime::Create(armnn::IRuntime::CreationOptions());

The application optimizes the imported graph by specifying a list of backends in order of preference and implementing backend-specific optimizations. A unique string identifies each of the backends, for example CpuAcc, GpuAcc, CpuRef.

For example, the example application specifies backend optimizations, as shown in the following code:

std::vector<armnn::BackendId> backends{"CpuAcc", "GpuAcc", "CpuRef"};

Internally and transparently, Arm NN splits the graph into subgraphs based on the specified backends. Arm NN optimizes each of the subgraphs and, if possible, substitutes the corresponding subgraph in the original graph with its optimized version.

The application uses the Optimize() function to optimize the graph for inference, then loads the optimized network onto the compute device with LoadNetwork(). The LoadNetwork() function creates:

  • The backend-specific workloads for the layers
  • A backend-specific workload factory which creates the workloads.

The example application contains the following code:

armnn::IOptimizedNetworkPtr optNet = Optimize(*network,
std::string errorMessage;
runtime->LoadNetwork(0, std::move(optNet), errorMessage));
std::cerr << errorMessage << std::endl;

Create input and output binding information

Parsers can also extract input information for the network. The application calls GetSubgraphInputTensorNames to extract all the input names, then GetNetworkInputBindingInfo binds the input points of the graph. The example application contains the following code:

std::vector<std::string> inputNames = parser->GetSubgraphInputTensorNames(0);
auto inputBindingInfo = parser->GetNetworkInputBindingInfo(0, inputNames[0]);

The input binding information contains all the essential information about the input.

This information is a tuple consisting of:

  • Integer identifiers for bindable layers
  • Tensor information including:

    • Data type
    • Quantization information
    • Number of dimensions
    • Total number of elements

Similarly, the application gets the output binding information for an output layer by using the parser to retrieve output tensor names and calling GetNetworkOutputBindingInfo().

Object detection pipeline

The generic object detection pipeline contains the following three steps:

  1. Perform data pre-processing.
  2. Run inference.
  3. Decode inference results in the post-processing step.

See ObjDetectionPipeline and the implementations for MobileNetSSDv1 and YoloV3Tiny for more details.

Preprocess the captured frame

The application reads each frame captured from source as a cv::Mat in BGR format. The channels are swapped to RGB in frame reader code, as follows:

cv::Mat processed;
objectDetectionPipeline->PreProcessing(frame, processed);

The preprocessing step consists of resizing the frame to the required resolution, padding, and converting the data types to match the model input layer. For example, the example application uses SSD MobileNet V1 which takes an input tensor with shape [1, 300, 300, 3] and data type uint8.

The preprocessing step returns a cv::Mat object containing data ready for inference.

Execute inference

The following code shows how the application executes inference:

od::InferenceResults results;
objectDetectionPipeline->Inference(processed, results);

The inference step calls the ArmnnNetworkExecutor::Run method that prepares input tensors and executes inference. A compute device performs inference for the loaded network using the EnqueueWorkload() function of the runtime context. For example:

//const void* inputData = ...;
//outputTensors were pre-allocated before

armnn::InputTensors inputTensors = {{ 
     inputBindingInfo.first,armnn::ConstTensor(inputBindingInfo.second, inputData)}};
runtime->EnqueueWorkload(0, inputTensors, outputTensors);

The application allocates memory for output data once and maps it to output tensor objects. After successful inference, the application reads data from the pre-allocated output data buffer. See ArmnnNetworkExecutor::ArmnnNetworkExecutor and ArmnnNetworkExecutor::Run for more information.

Decode and process the inference output

The application must decode the output from inference to obtain information about the detected objects in the frame. The example application contains implementations for two networks, or you can implement your own network decoding solution.

For SSD MobileNet V1 models, the application decodes the results to obtain the bounding box positions, classification index, confidence, and number of detections in the input frame. See SSDResultDecoder for more details.

For YOLO V3 Tiny models, the application decodes the output and performs non-maximum suppression. This suppression filters out weak detections below a confidence threshold and any redundant bounding boxes above an intersection-over-union (IoU) threshold. See YoloResultDecoder for more details.

Experiment with different threshold values for confidence and IoU to achieve the best visual results.

The detection results are always returned as a vector of DetectedObject, with the box positions list containing bounding box coordinates in the following form:

[x_min, y_min, x_max, y_max]

Draw the bounding boxes

The post-processing step accepts a callback function which is invoked when decoding finishes. The application uses this callback function to draw detections on the initial frame. The example application uses the output detections and the AddInferenceOutputToFrame function to draw bounding boxes around detected objects and add the associated label and confidence score. The following code shows the post-processing step in detail:

   [&frame, &labels](od::DetectedObjects detects) -> void {
        AddInferenceOutputToFrame(detects, *frame, labels);

The processed frames are written to a file or displayed in a separate window.

Run the application

After building the application executable, you can run it with the following command-line options:

Specifies the path to the video file. This option is required.
Specifies the path to the object detection model. This option is required.
Specifies the path to the label set for the model file. This option is required.
Specifies the name of the model used for object detection. Valid values are SSD_MOBILE and YOLO_V3_TINY. This option is required.
Specifies the path to the output video file. This is optional. The default is /tmp/output.avi.
Specifies the backends in preference order, separated by a comma. Valid values include CpuAcc, CpuRef, and GpuAcc. This is optional. The default is CpuRef, the reference kernel on CPU.
Displays all the available command-line options.

To run object detection on a video file and output the result to another video file, use the following commands:

./object_detection_example --label-path /path/to/labels/file 
       --video-file-path /path/to/video/file --model-file-path /path/to/model/file
       --model-name [YOLO_V3_TINY | SSD_MOBILE] 
       --output-video-file-path /path/to/output/file

To run object detection on a video file and output the result to a window GUI, use the following commands:

./object_detection_example --label-path /path/to/labels/file
       --video-file-path /path/to/video/file --model-file-path /path/to/model/file
       --model-name [YOLO_V3_TINY | SSD_MOBILE]
Previous Next