Object detection model structure

Most deep learning-based object detection models have two parts:

  • An encoder, which takes an image as input and runs it through a series of blocks and layers that extract features. The encoder then uses these features to locate and label objects.
  • A decoder. Outputs from the encoder are then passed to a decoder. The decoder predicts bounding boxes and labels for each object.

The simplest decoder is a pure regressor. The regressor connects to the output of the encoder and directly predicts the location and size of each bounding box. The output of the model is the X, Y coordinate pair for the object and its extent. The disadvantage of using a pure regressor is that you must define the number of predicted objects ahead of time.

An extension of the regressor approach is a region proposal network. In this type of decoder, the model proposes regions of an image where it believes an object might reside. The pixels in these regions are fed into a classification network to determine a matching label. The region proposal network is a more accurate and flexible model that can process an arbitrary number of regions.

Single Shot Detectors (SSDs) seek to provide a middle ground between pure regressors and region proposal networks. Rather than using a subnetwork to propose regions, SSDs rely on a set of predetermined regions. A grid of anchor points is laid over the input image. At each anchor point, boxes of multiple shapes and sizes serve as regions.

For each box at each anchor point, the SSD model outputs:

  • A prediction of whether an object exists within the region
  • Modifications to the location and size of the box, to make the box fit the object more closely.

Because there are multiple boxes at each anchor point and anchor points might be close together, SSDs produce many potential detections that overlap. Post-processing must be applied to SSD outputs, to prune away most of these predictions, and pick the best one.

Object detectors output the location and label for each object. To benchmark the model performance, the most commonly-used metric is intersection-over-union (IOU). Given two bounding boxes, you compute the area of the intersection and divide by the area of the union. Metric values range from 0 (no interaction) to 1 (perfectly overlapping). For labels, you can use a simple percentage correct.

YOLO and MobileNet-SSD

Several models belong to the SSD family. The main differences between these variants are their encoders and the specific configuration of predetermined anchors.

YOLO v3 is a fast real-time object detection system. YOLO stands for You Only Look Once.  Techniques like multi-scale predictions and improved backbone classifiers enable this fast performance. YOLO trains a single neural network model that directly predicts bounding boxes and class labels for each bounding box. You can find more details in the paper YOLOv3: An Incremental Improvement.

MobileNet-SSD models feature a MobileNet-based encoder. SSDs are a good choice for models that are destined for mobile or embedded devices. For more information, see MobileNetV2 + SSDLite with Core ML.

Region-Based Convolutional Neural Network

The Region-Based Convolutional Neural Network (R-CNN) family of methods do the following:

  1. Generate candidate bounding boxes
  2. Extract features from each candidate region using a deep convolutional neural network
  3. Classify the features as one of the known classes

R-CNN is a relatively simple and straightforward approach. Several popular object detection models belong to the R-CNN family, including Fast R-CNN and Mask R-CNN.

Previous Next