Analysis of Deep Learning-based Object Detection

Monica Dommaraju
9 min readMay 16, 2020

--

Introduction

Oftentimes, I don’t recall things such as keys, mobile phones where I have last kept them and spend a good amount of time searching for them. With the recent object detection ML models and algorithms becoming more popular and its use in various applications of the real world problems such as security, autonomous driving, face detection, etc., I believe that Machine Learning can help solve these issues. To develop and to run these, Deep Convolutional Neural networks and the GPU’s computing power play a major role. In this article, let’s look into various algorithms which are used for object detection.

Considerably great progress has been made in the field of general object detection with the development of deep learning model’s and the improvement of GPUs computing power. Lets introduce ourselves to some representative object detection architectures for beginners to get started in this domain .

In this article, I will discuss two kinds of object detectors, Two-stage detectors — the most representative one such as Faster R-CNN and the other one is a one-stage detector such as YOLO, SSD.

With the Two-stage detectors, it has high localization and object recognition accuracy, however, the one-stage detectors achieve high inference speed. One-stage detectors also propose predicted boxes from input images without the need for a region proposal step. This leads to time efficiency and hence can be used for real-time detection.

Two Stage Detectors

R-CNN

This type of CNN detector is a region based one used for object detection and it has four different modules. The first one generates category-independent region proposals. The second one extracts a fixed-length feature vector for each of the obtained region proposals from the first one. The third one is a set of class specific linear SVMs that are used to classify the objects in one image. The last one is a bounding-box regressor for predicting the bounding box precisely.

Fig. 1. Architecture of two-stage detectors. Consists of a region proposal network to feed region proposals into classifier and regressor.

Fast R-CNN

Fast R-CNN is a faster version of R-CNN. R-CNN takes a protracted time on classification of SVM as it undergoes a forward pass for each region proposal without sharing computation and it is an expensive way for training. Fast R-CNN , however, feeds the input image to the CNN to generate a convolutional feature map. From the convolutional feature map, it extracts the features from an entire input image. By using a Region Of Interest (ROI) pooling layer, it passes them to make a fixed size feature as the input of the classification to feed it to the fully connected layer. Significant time is saved for CNN processing and also large disk storage is saved by using Fast R-CNN. It is a single stage end to end training process that uses a multi-task loss on each labeled RoI to jointly train the network. To improve detection time, truncated SVD is helpful to accelerate the forward pass of computing the fully connected layers.

Faster R-CNN

Fast R-CNN uses selective search to propose ROI that is slow and it requires the same amount of time as the detection network. To make the run time fast, Faster R-CNN uses the novel RPN (region proposal network). It is a fully convolutional network that predicts region proposals with a wide range of scales and aspect ratios RPN. This was made possible by sharing full-image convolutional features and a common set of convolutional layers that accelerates the generating speed of region proposals.

A novel approach for detection of objects of different sizes is to use multi-scale anchors as a reference. The anchors can greatly simplify the process of generating different size region proposals without the need for multiple scales of input images or features.

Fig.2. Different RNN models comparisons

One Stage Detectors

YOLO

You Only Look Once (YOLO) is mainly used in real-time detection of images and webcam video streams. It predicts less than 100 bounding boxes per image while Fast R-CNN uses selective search that predicts 2000 region proposals per image. YOLO can also extract features from the given images and it will directly predict class probabilities and bounding boxes, which is considered as a regression problem. YOLO network runs at 45 frames per second and there is no batch processing.

YOLOv2

An improvised version of YOLO is YOLOv2. YOLOv2 adds a Batch Normalization layer ahead of each convolutional layer that accelerates the network to get convergence and helps regularize the model.

  • High Resolution Classifier: In the previous version, the classifier uses an input resolution of 224 x 224 and is later increased to 448 for detection. The network needs to adjust to the new resolution inputs when switching to object detection tasks. To avoid this, YOLOv2 adds a classification network at 448 × 448 for 10 epochs on ImageNet dataset to fine tune the process.
  • Uses Convolutional with anchor boxes. It predicts class and objectness for every anchor box by first removing the fully connected layers.
  • K-means clustering is used on the training set bounding boxes to get good priors automatically. It predicts the size and aspect ratio of anchor boxes using dimension clusters.
  • Fine-Grained Features: YOLOv2 adds up the higher resolution features with the low resolution features by stacking adjacent features into different channels.
  • Multi-Scale Training: To make the networks to be robust, it chooses an image randomly with dimension size from {320, 352, …, 608} for every 10 batches. This leads to the network being able to predict detections at different resolutions. YOLOv2 proposes a new classification backbone namely Darknet-19 with 19 convolutional layers and 5 max pooling layers that requires less operations to process an image and achieves high accuracy.

YOLOv3

It is an improvised version of YOLOv2.

  • It uses multi-label classification to adapt to complex datasets that contain many overlapping labels.
  • Uses three different scale feature maps to predict the bounding box. A 3-d tensor that encodes class predictions, objectness and bounding box is then predicted using the last convolutional layer.
  • It also brings a deeper and more robust feature extractor called Darknet-53, inspired by ResNet.

Because of the advantages of multi-scale predictions, YOLOv3 can detect small objects but has comparatively bad performance on large and medium sized objects.

SSD

Single Shot Detector predicts category scores and box offsets for a fixed set of default bounding boxes. These boxes are of different scales at each location in several feature maps with different scales, as shown in Fig 3.(a). In different feature maps, the scale of these default bounding boxes are computed with regular space between the lowest layer and the highest layer, where each specific feature map learns to be responsive to the particular scale of the objects. It predicts both the offsets and the confidences for all object categories for each the default box.

Fig 3. Four methods utilizing features for different sized object prediction.

DSSD

Deconvolutional Single Shot Detector is a revised version of SSD. It adds a prediction module and a deconvolution module that adopts ResNet-101 as backbone. Fig.3.(b) above shows the architecture of DSSD. Having a deconvolution module increases the resolution of feature maps to strengthen features. Each deconvolution layer followed by a prediction module helps to predict a variety of objects with different sizes.

There are many other detectors such as M2Det, RefineDet , DCNv2, NAS-FPN, etc that are explained in this paper.

Performance of benchmark datasets and their Metrics for Object Detection

Using a benchmark and challenging datasets is significant in many areas of research. This will allow us to get a standard comparison between different algorithms and set goals for solutions.

PASCAL VOC dataset

It contains 20 object categories (such as bottle, person, bicycle, bird, dog, etc) spread over 11,000 images. These 20 categories are categorized as four main categories — household objects, vehicles, animals, and people.

Fig.4. Annotated sample images from the PASCAL VOC dataset

MS COCO benchmark

The Microsoft Common Objects in Context (MS COCO) dataset has 91 common object categories of which 82 of them have more than 5,000 labeled instances. This dataset has 2,500,000 labeled instances in 328,000 images.

Fig.5. MS COCO dataset including iconic objects, scenes and non-iconic objects.

ImageNet benchmark

The ILSVRC challenge of object detection evaluates whether an algorithm has the ability to name and localize all instances of all target objects present in an image. The dataset has about 200 object classes, 450k training images, 20k validation images, and 40k test images.

ImageNet uses a loosen threshold calculated as below compared to that of PASCAL VOC dataset as:

Where w and h are the width and height of a ground truth box respectively. This threshold, on average in each direction around the object, allows for the annotations to extend up to 5 pixels.

Comparison between ILSVRC Object Detection dataset and PASCAL VOC dataset is shown in below table

Analysis Of General Image Object Detection Methods

  • Object detection pipelines based on Deep neural network based have four main steps: image pre-processing, feature extraction, classification and localization, post-processing.
  • Firstly, images that are raw from the dataset cannot be fed directly into the network. We need to process them to resize to any needed sizes, make them clearer such as enhancing color, brightness, contrast.
  • For flipping, rotation, scaling, cropping, translation and adding Gaussian noise, we can use data augmentation. In addition to that, we can use Generative Adversarial Networks(GANs) to generate new images.
  • Secondly, for further detection, we use feature extraction. The feature quality is what determines the upper bound for the subsequent tasks such as classification and localization.
  • Thirdly, the detector head is responsible to propose and refine bounding box concluding classification scores and bounding box coordinates.
  • The last post-processing step deletes any weak detecting results.
  • To obtain precise detection results, several methods can be used in combination with other methods or by themselves and are clearly mentioned in the paper.

Applications

  • Object detection is widely used in many fields to assist people and also for important tasks.
  • In the military field, remote sensing object detection, topographic survey, flyer detection, etc., are representative applications.
  • In the security field, it is mainly used for face detection, fingerprint identification, fraud detection, anomaly detection etc.
  • In the transportation field, license plate recognition, automatic driving and traffic sign recognition, etc., greatly facilitate people’s life.
  • Object detection has a wide range of application scenarios. The research of this domain contains a large variety of branches like Highlight detection, Edge detection, Object detection in videos, 2D, 3D pose detection(sample image provided below)
Fig. 6. Some examples of multi-person pose estimation.

Conclusion

Object detection has been growing rapidly with the continuous upgrade of powerful computing equipment and achieving high accuracy and efficient detectors is the ultimate goal of this task. Researchers have developed a series of directions such as constructing new architecture, extracting rich features, exploiting good representations, improving processing speed, training from scratch, anchor-free methods, solving sophisticated scene issues (small objects, occluded objects), combining one-stage and two-stage detectors to make good results, improving post processing NMS method, solving negatives-positives imbalance issue, increasing localization accuracy to enhance classification confidence. With the increasing need of powerful object detectors in the security, military, transportation, medical and life fields the application of object detection is gradually extensive. In addition, a variety of branches in the detection domain arise. Although the recent achievements of this domain have been effective, there is still so much room for further development.

I hope my attempt to explain multiple object detection techniques along and the comparison between them is useful to you.

Reference

A Survey of Deep Learning-based Object Detection

Survey Author: Licheng Jiao, Fan Zhang, Shuyuan Yang, Lingling Li, Zhixi Feng, Rong Qu, IEEE

Article Link: https://arxiv.org/pdf/1907.09408.pdf

Date published: 13th May 2020

--

--