Frustratingly Easy Trade-off Optimization between Single-Stage and Two-Stage Deep Object Detectors

Petru Soviany, Radu Tudor Ionescu; Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0-0

Abstract


There are mainly two types of state-of-the-art object detectors. On one hand, we have two-stage detectors, such as Faster R-CNN (Region-based Convolutional Neural Networks) or Mask R-CNN, that (i) use a Region Proposal Network to generate regions of interests in the first stage and (ii) send the region proposals down the pipeline for object classification and bounding-box regression. Such models reach the highest accuracy rates, but are typically slower. On the other hand, we have single-stage detectors, such as YOLO (You Only Look Once) and SSD (Singe Shot MultiBox Detector), that treat object detection as a simple regression problem, by taking an input image and learning the class probabilities and bounding box coordinates. Such models reach lower accuracy rates, but are much faster than two-stage object detectors. In this paper, we propose and evaluate four simple and straightforward approaches to achieve an optimal trade-off between accuracy and speed in object detection. All the approaches are based on separating the test images in two batches, an easy batch that is fed to a faster single-stage detector and a difficult batch that is fed to a more accurate two-stage detector. The difference between the four approaches is the criterion used for splitting the images in two batches. The criteria are the image difficulty score (easier images go into the easy batch), the number of detected objects (images with less objects go into the easy batch), the average size of the detected objects (images with bigger objects go into the easy batch), and the number of detected objects divided by their average size (images with less and bigger objects go into the easy batch). The first approach is based on an image difficulty predictor, while the other three approaches employ a faster single-stage detector to determine the approximate number of objects and their sizes. Our experiments on PASCAL VOC 2007 show that using image difficulty compares favorably to a random split of the images. However, splitting the images based on the number objects divided by their size, an approach that is frustratingly easy to implement, produces even better results. Remarkably, it shortens the processing time nearly by half, while reducing the mean Average Precision of Faster R-CNN by only 0.5%.

Related Material


[pdf]
[bibtex]
@InProceedings{Soviany_2018_ECCV_Workshops,
author = {Soviany, Petru and Tudor Ionescu, Radu},
title = {Frustratingly Easy Trade-off Optimization between Single-Stage and Two-Stage Deep Object Detectors},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV) Workshops},
month = {September},
year = {2018}
}