TITLE: Speed/accuracy trade-offs for modern convolutional object detectors
AUTHOR: Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, Kevin Murphy
ASSOCIATION: Google Research
In this paper, the trade-off between accuracy and speed is studied when building an object detection system based on convolutional neural networks.
Three main families of detectors — Faster R-CNN, R-FCN and SSD which are viewed as “meta-architectures” are considered. Each of these can be combined with different kinds of feature extractors, such as VGG, Inception or ResNet. other parameters, such as the image resolution, and the number of box proposals are also varied to compare how they perform in the task of detecting objects. The main findings are summarized as follows.
The following figure shows the accuracy vs time of different configurations.
Generally speaking, R-FCN and SSD models are faster on average while Faster R-CNN tends to lead to slower but more accurate models, requiring at least 100 ms per image.
There is an intuition that stronger performance on classification should be positively correlated with stronger performance detection. It is true for Faster R-CNN and R-FCN, but it is less apparant for SSD as the following figure illustrated.
Not surprisingly, all methods do much better on large objects. Even though SSD models typically have (very) poor performance on small objects, they are competitive with Faster RCNN and R-FCN on large objects, even outperforming these meta-architectures for the faster and more lightweight feature extractors as the following figure illustrated.
Decreasing resolution by a factor of two in both dimensions consistently lowers accuracy (by 15.88% on average) but also reduces inference time by a relative factor of 27.4% on average.
Strong performance on small objects implies strong performance on large objects, but not vice-versa as SSD models do well on large objects but not small.
For Faster R-CNN, reducing proposals help accelerating prediction significantly because the computation of box classifier is correlated with the number of the proposals. The interesting thing is that Inception Resnet, which has 35.4% mAP with 300 proposals can still have surprisingly high accuracy (29% mAP) with only 10 proposals. The sweet spot is probably at 50 proposals.
For R-FCN, computational savings from using fewer proposals are minimal, because the box classifier is only run once per image. At 100 proposals, the speed and accuracy for Faster R-CNN models with ResNet becomes roughly comparable to that of equivalent R-FCN models which use 300 proposals in both mAP and GPU speed.
The following figure shows the observation, in which solid lines shows the relation between number of proposals and mAP, while dotted lines shows that of GPU inference time.
The paper also discussed the FLOPs and Memories. The observations in these parts are sort of obvious for the practitioner. Another observation is that Good localization at .75 IOU means good localization at all IOU thresholds.
TITLE: Fully Convolutional Instance-aware Semantic Segmentation
AUTHOR: Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, Yichen Wei
ASSOCIATION: Microsoft Research Asia, Tsinghua University
An end-to-end fully convolutional approach for instance-aware semantic segmentation is proposed. The underlying convolutional representation and the score maps are fully shared for the mask prediction and classification sub-tasks, via a novel joint formulation with no extra parameters. The network structure is highly integrated and efficient. The per-ROI computation is simple, fast, and does not involve any warping or resizing operations.
The proposed method is highly related with a previous work or R-FCN. The following figure gives an illustration:
Different from the mentioned previous work, this work predicts two maps, ROI inside map and ROI outside map. The two score maps jointly account for mask prediction and classification sub-tasks. For mask prediction, a softmax operation produces the per-pixel foreground probability. For mask clssification, a max operation produces the per-pixel likelihood of "belonging to the object category".
For an input image, 300 ROIs with highest scores are generated from RPN. They pass through the bbox regression branch and give rise to another 300 ROIs. For each ROI, its classification scores and foreground mask (in probability) is predicted for all categories. NMS with an IoU threshold is used to filter out highly overlapping ROIs. The remaining ROIs are classified as the categories with highest classification scores. Their foreground masks are obtained by mask voting. For an ROI under consideration, the ROIs (from the 600) are found with IoU scores higher than 0.5. Their foreground masks of the category are averaged on a per-pixel basis, weighted by their classification scores. The averaged mask is binarized as the output.