Hello!

Don't aim for success if you want it; just do what you love and believe in, and it will come naturally.
A QUICK LINK TO My Jumble of Computer Vision
Pub. Date:Dec. 2, 2016, 7:47 p.m. Topic:Computer Vision Tag:Reading Note

TITLE: Speed/accuracy trade-offs for modern convolutional object detectors

AUTHOR: Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, Kevin Murphy

ASSOCIATION: Google Research

FROM: arXiv:1611.10012

CONTRIBUTIONS

In this paper, the trade-off between accuracy and speed is studied when building an object detection system based on convolutional neural networks.

Summary

Three main families of detectors — Faster R-CNN, R-FCN and SSD which are viewed as “meta-architectures” are considered. Each of these can be combined with different kinds of feature extractors, such as VGG, Inception or ResNet. other parameters, such as the image resolution, and the number of box proposals are also varied to compare how they perform in the task of detecting objects. The main findings are summarized as follows.

Accuracy vs time

The following figure shows the accuracy vs time of different configurations.

Generally speaking, R-FCN and SSD models are faster on average while Faster R-CNN tends to lead to slower but more accurate models, requiring at least 100 ms per image.

Critical points on the optimality frontier

  1. SSD models with Inception v2 and Mobilenet feature extractors are most accurate of the fastest models.
  2. R-FCN models using Residual Network feature extractors which seem to strike the best balance between speed and accuracy.
  3. Faster R-CNN with dense output Inception Resnet models attain the best possible accuracy

The effect of the feature extractor

There is an intuition that stronger performance on classification should be positively correlated with stronger performance detection. It is true for Faster R-CNN and R-FCN, but it is less apparant for SSD as the following figure illustrated.

The effect of object size

Not surprisingly, all methods do much better on large objects. Even though SSD models typically have (very) poor performance on small objects, they are competitive with Faster RCNN and R-FCN on large objects, even outperforming these meta-architectures for the faster and more lightweight feature extractors as the following figure illustrated.

The effect of image size

Decreasing resolution by a factor of two in both dimensions consistently lowers accuracy (by 15.88% on average) but also reduces inference time by a relative factor of 27.4% on average.

Strong performance on small objects implies strong performance on large objects, but not vice-versa as SSD models do well on large objects but not small.

The effect of the number of proposals

For Faster R-CNN, reducing proposals help accelerating prediction significantly because the computation of box classifier is correlated with the number of the proposals. The interesting thing is that Inception Resnet, which has 35.4% mAP with 300 proposals can still have surprisingly high accuracy (29% mAP) with only 10 proposals. The sweet spot is probably at 50 proposals.

For R-FCN, computational savings from using fewer proposals are minimal, because the box classifier is only run once per image. At 100 proposals, the speed and accuracy for Faster R-CNN models with ResNet becomes roughly comparable to that of equivalent R-FCN models which use 300 proposals in both mAP and GPU speed.

The following figure shows the observation, in which solid lines shows the relation between number of proposals and mAP, while dotted lines shows that of GPU inference time.

Others

The paper also discussed the FLOPs and Memories. The observations in these parts are sort of obvious for the practitioner. Another observation is that Good localization at .75 IOU means good localization at all IOU thresholds.


Pub. Date:Dec. 1, 2016, 10:56 p.m. Topic:Life Discovery Tag:Drawing


Pub. Date:Nov. 30, 2016, 9:27 p.m. Topic:Life Discovery Tag:Drawing


Pub. Date:Nov. 29, 2016, 9:35 p.m. Topic:Life Discovery Tag:杂七杂八

那年春节,我倒心血来潮提起了勇气,开始走访一个个小时候玩伴的家。

有的人已经结婚了,抱着孩子,和我讲述他在夜市上摆着的那摊牛肉店的营收。有的当上了渔夫,和我讲话的时候,会不自觉地把自己的身子一直往后退,然后问:“会不会熏到你啊?”有的开起服装厂当上了老板,吃饭的时候一直逼我喝陈酿多少多少年的茅台,然后醉气熏熏地拉着我,中气十足地说:“咱们是兄弟对不对,是兄弟你就别嫌我土,我也不嫌你穷,我们喝酒……”

我才明白,那封信里,我向文展说的“小时候的玩伴真该一起聚聚了”,真是个天真的提议。每个人都已经过上不同的生活,不同的生活让许多人在这个时空里没法相处在共同的状态中,除非等彼此都老了,年迈再次抹去其他,构成我们每个人都重要的标志,或许那时候的聚会才能成真。

—— 蔡崇达《皮囊》

感觉大家都成长了,只有我自己还活在以前的日子里,以为伙伴们都和小时候一样。读到这一段话,突然相当以前一个很尴尬的遭遇,有一个小学时的玩伴小A和我进入了不同的初中,在初二的一次聚会中,跟小A开了一个小学的时候经常开的玩笑,但是小A并没有做出小学时的反应。现在想想,原来在那么早之前我们每个人就都不一样了。

现在,只有那些从小到大一直保持联系,而且教育经历差不多的人同伴才保持着亲密的关系。作为独生子女的一代,或许我们各自都会一些孤僻,而同时又都渴望有一群兄弟姐妹。我们就好像《老友记》里的那六个人,互相成为同龄人里的支柱,大家聚在一起寻开心、吐苦水、分享各自生活里的趣事。然而,逐渐的,好像电视剧结束时,当大家各自组建了家庭,好像也就到了大家需要适当分开的时候了。再没有人想要来一场说走就走的撸串,或许是不想打扰别人的生活,或许是觉得不该就那么脱离家庭。也许,这就是大家感觉我们变了的原因吧。


不知不觉买了十本书,什么方面的都有,期待赶紧把它们都读完。


Pub. Date:Nov. 28, 2016, 11:45 p.m. Topic:Computer Vision Tag:Reading Note

TITLE: Fully Convolutional Instance-aware Semantic Segmentation

AUTHOR: Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, Yichen Wei

ASSOCIATION: Microsoft Research Asia, Tsinghua University

FROM: arXiv:1611.07709

CONTRIBUTIONS

An end-to-end fully convolutional approach for instance-aware semantic segmentation is proposed. The underlying convolutional representation and the score maps are fully shared for the mask prediction and classification sub-tasks, via a novel joint formulation with no extra parameters. The network structure is highly integrated and efficient. The per-ROI computation is simple, fast, and does not involve any warping or resizing operations.

METHOD

The proposed method is highly related with a previous work or R-FCN. The following figure gives an illustration:

Different from the mentioned previous work, this work predicts two maps, ROI inside map and ROI outside map. The two score maps jointly account for mask prediction and classification sub-tasks. For mask prediction, a softmax operation produces the per-pixel foreground probability. For mask clssification, a max operation produces the per-pixel likelihood of "belonging to the object category".

For an input image, 300 ROIs with highest scores are generated from RPN. They pass through the bbox regression branch and give rise to another 300 ROIs. For each ROI, its classification scores and foreground mask (in probability) is predicted for all categories. NMS with an IoU threshold is used to filter out highly overlapping ROIs. The remaining ROIs are classified as the categories with highest classification scores. Their foreground masks are obtained by mask voting. For an ROI under consideration, the ROIs (from the 600) are found with IoU scores higher than 0.5. Their foreground masks of the category are averaged on a per-pixel basis, weighted by their classification scores. The averaged mask is binarized as the output.

ADVANTAGES

  1. End-to-end training and testing alleviate the simplicity of the system.
  2. Utilizing the idea of R-FCN, its efficiency is proved.