Object detection: the introduction of Faster RCNN and SSD (Part 1)
To gain a complete image understanding, we should not only concentrate on classifying different images, but also try to precisely estimate the concepts and locations of objects contained in each image. This task is referred to as object detection, which usually consists of different subtasks such as face detection, pedestrian detection, and skeleton detection. As one of the fundamental computer vision problems, object detection is able to provide valuable information for semantic understanding of images and videos, and is related to many applications including image classification, human behavior analysis, face recognition, and autonomous driving. In this post I will review some milestone networks of object detection.
Generic object detection
Generic object detection aims at locating and classifying existing objects in any one image, and labeling them with rectangular bounding boxes to show the confidence of existence. The frameworks of generic object detection methods can mainly be categorized into two types (see Figure 1). One follows a traditional object detection pipeline, generating region proposals at first and then classifying each proposal into different object categories. The other regards object detection as a regression or classification problem, adopting a unified framework to achieve final results (categories and locations) directly. The region proposal based methods mainly include R-CNN, SPP-net, Fast R-CNN, Faster R-CNN, R-FCN, FPN, and Mask R-CNN, some of which are related to each other (e.g. SPP-net modifies RCNN with a SPP layer). The regression/classification based methods mainly includ MultiBox, AttentionNet, G-CNN, YOLO, SSD, YOLOv2, DSSD, and DSOD. These two pipelines are bridged by the anchors introduced in Faster R-CNN. The evolution of these methods is shown below.
Figure 1, Generic object detection
Before Faster R-CNN (Figure 2), state-of-the-art object detection networks mainly relied on additional methods, such as selective search and Edgebox, to generate a candidate pool of isolated region proposals. Region proposal computation is also a bottleneck in improving efficiency. To solve this problem, Ren et al. introduced an additional Region Proposal Network (RPN) which acts in a nearly cost-free way by sharing full-image conv features with detection network.
Figure 2, The architecture of Faster R-CNN
RPN is achieved with a fully-convolutional network, which has the ability to predict object bounds and scores at each position simultaneously. RPN takes an image of arbitrary size to generate a set of rectangular object proposals. RPN operates on a specific conv layer with the preceding layers shared with object detection network.
Figure 3, The RPN in faster R-CNN. k predefined anchor boxes are convoluted with each sliding window to produce fixed-length vectors which are taken by cls and reg layers to obtain corresponding outputs.
The architecture of RPN is shown in Figure 3. The network slides over the conv feature map and fully connects to an n×nn \times nn×n spatial window. A low dimensional vector (512-d for VGG16) is obtained in each sliding widow and fed into two sibling windows and FC layers, namely box-classification layer (cls) and box-regression layer (reg). This architecture is implemented with an n×nn \times nn×n conv layer followed by two sibling 1×11 \times 11×1 conv layers.
With the proposal of Faster R-CNN, region proposal based on CNN architectures for object detection can really be trained in an end-to-end way. Also a frame rate of 5 FPS (Frame Per Second) on a GPU is achieved with state-of-the-art object detection accuracy on PASCAL VOC 2007 and 2012.
Figure 4, Detection mAP (%) of Faster R-CNN on PASCAL VOC 2007 test set and 2012 test set using different training data. The model is VGG-16. “COCO” denotes that the COCO trainval set is used for training.
- Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems. 2015: 91-99.
- Zhao Z Q, Zheng P, Xu S, et al. Object detection with deep learning: A review[J]. IEEE transactions on neural networks and learning systems, 2019.