PyArmNN object recognition
This section of the guide describes how to use PyArmNN to perform object recognition, using the PyArmNN Object Detection Sample Application as an example.
The sample application takes a model and video file or camera feed as input. The application then runs inference on each frame. Finally, the application draws bounding boxes around detected objects, with the corresponding labels and confidence scores overlaid.
The following image shows an example of one frame from the resulting video file with bounding boxes and confidence scores:
The PyArmNN Object Detection Sample Application performs the following steps:
- Read from the video source
- Prepare labels and model-specific functions
Create a network
- Create the parser and import a graph
- Optimize the graph for the compute device
- Create input and output binding information
Object detection pipeline
- Preprocess the captured frame
- Make the input and output tensors
- Execute inference
- Decode and process the inference output
- Draw the bounding boxes
- Run the application
The following subsections describe these steps.
Read from the video source
The application parses the supplied user
arguments and loads the specified video file or stream into an OpenCV
cv2.VideoCapture object. The application then uses this object to capture frames
from the source with the
VideoCapture object also provides
information about the source, like the frame rate and resolution of the input
video. The application uses this information to create a
cv2.VideoWriter object. This object is used at the end of every loop to write the
processed frame to an output video file.
Prepare labels and model-specific functions
To interpret the inference result on the
loaded network, an application must load the labels that are associated with
the model. In the sample application, the
dict_labels() function creates a
dictionary that is keyed on the classification index at the output node of the
model. The values in the dictionary map each label to a randomly generated RGB
color. This mapping means that each class has a unique color, which is useful when
plotting the bounding boxes of detected objects in a frame.
The user-specified model accesses and returns functions to decode and process the inference output, along with a resize factor. This resize factor is used when plotting bounding boxes, to ensure that they are scaled to their correct position in the original frame.
Create the parser and import a graph
A PyArmNN application must import a graph from file using an appropriate parser. Arm NN provides parsers for various model file types, including TFLite, TF, and ONNX. These parsers are libraries for loading neural networks of various formats into the Arm NN runtime.
Because both the Yolo v3 and SSD models are
in the TFLite format, the sample application uses the TFLite parser
armnnTfLitePaser to process the models.
function creates the parser and loads the network file. The parser then
constructs the underlying Arm NN graph from the network file.
Optimize the graph for the compute device
Arm NN supports optimized execution on
multiple CPU, GPU, and Ethos-N NPU devices. Before executing a graph, the
application must select the appropriate device context by using
to create a runtime context with default options.
We can optimize the imported graph by specifying a list of backends in order of preference and implementing backend-specific optimizations. A unique string identifies each one of these backends. For example:
CpuAccrepresents the CPU backend.
GpuAccrepresents the GPU backend.
CpuRefrepresents the CPU reference kernels.
Arm NN splits the entire graph into subgraphs based on these backends. Each subgraph is then optimized, and the corresponding subgraph in the original graph is substituted with its optimized version.
Optimize() function optimizes the
graph for inference, then
LoadNetwork() loads the optimized network onto the compute device. The
LoadNetwork() function also creates the backend-specific workloads for the layers and a backend-specific workload factory.
Create input and output binding information
Parsers extract the input information for the
GetSubgraphInputTensorNames() function extracts all the input names and the
GetNetworkInputBindingInfo() function obtains the input binding information of the graph.
The input binding information contains all the essential information about the input. This information is a tuple consisting of:
- Integer identifiers for bindable layers
Tensor information including:
- Data type
- Quantization information
- Number of dimensions
- Total number of elements
Similarly, we can get the output binding
information for an output layer by using the parser to retrieve output tensor
names and calling the
Preprocess the captured frame
Each frame that is captured from the video source
is read as a
ndarray in BGR format. Each frame must then be preprocessed before being
passed into the network.
This preprocessing step consists of the following:
- Swap channels. In this example, swap BGR to RGB.
- Resize the frame to the required resolution
- Expand the dimensions of the array and perform data type conversion to match the model input layer.
input_binding_info to obtain information about the shape and the data type of the
input tensor. For example, SSD MobileNet V1 takes an input tensor with shape
[1, 300, 300, 3] and data type
Make the input and output tensors
make_input_tensors() function produces
the input workload tensors.
make_output_tensors() function produces
the output workload tensors.
After creating the workload tensors, the
compute device performs inference for the loaded network using the
EnqueueWorkload() function of the runtime context. Calling the
workload_tensors_to_ndarray() function obtains the inference results as a list of
Decode and process the inference output
The output from inference must be decoded to obtain information about detected objects in the frame.
The examples includes implementations of two networks, but you can implement your own network decoding solution. For more information, see Implementing Your Own Network.
For SSD MobileNet V1 models, the application decodes the results to obtain the bounding box positions, classification index, confidence, and number of detections in the input frame.
For YOLO v3 tiny models, the application decodes the output and performs non-maximum suppression. This suppression filters out any weak detections below a confidence threshold and any redundant bounding boxes above an intersection-over-union (IoU) threshold.
Experiment with different threshold values for confidence and IoU to achieve the best visual results.
Detection results are returned as a list with the following form:
[class index, [box positions], confidence score]
[box positions] contains bounding box
coordinates in the following form:
[x_min, y_min, x_max, y_max]
Draw the bounding boxes
draw_bounding_boxes() function takes
the inference results and draws bounding boxes around detected objects. This
function also adds the associated label and confidence score. The labels
dictionary that we created in Preparing
labels and model-specific functions uses the class index of the detected
object as a key to return the associated label and color for that class. The
resize factor that we defined in Preparing labels and model-specific
functions scales the bounding box
coordinates to their correct positions in the original frame.
The processed frames are then written to file or displayed in a separate window.
Run the application
To run the video file with the Yolo v3 model with PyArmNN, use the following command:
python3 run_video_file.py –video_file_path <your_video> --model_file_path yolo_v3_tiny_darknet_fp32.tflite --model_name yolo_v3_tiny
To run the SSD model with PyArmNN, use the following command:
python3 run_video_file.py --video_file_path <your_video> --model_file_path ssd_mobilenet_v1.tflite --model_name ssd