It's been only 18 years till the story of I, Robot happens...
TITLE: Pixel Objectness
AUTHOR: Suyog Dutt Jain, Bo Xiong, Kristen Grauman
ASSOCIATION: The University of Texas at Austin
An end-to-end learning framework for foreground object segmentation is proposed. Given a single novel image, a pixel-level mask is produced for all “object-like” regions even for object categories never seen during training.
Given an RGB image of size $m \times n \times c$ as input, the problem is formulated as densely labeling each pixel in the images as eigher "object" or "background". The output is a binary map of size $m \times n$.
Two different datasets are used including 1) one dataset with explicit boundary-level annotations and 2) one dataset with implicit imagelevel object category annotations.
The network is first trained on a large scale object classification task, such as ImageNet 1000-category classification. This stage can be regarded as training on an implicit labeled dataset. Its image representation has a strong notion of objectness built inside it, even though it never observes any segmentation annotations.
Then the network is trained on PASCAL 2012 segmentation dataset, which is an explicit labeled dataset. The 20 object labels are discarded, and mapped instead to the single generic "object-like" (foreground) label for training.
TITLE: Towards Accurate Multi-person Pose Estimation in the Wild
AUTHOR: George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, Kevin Murphy
A method for multi-person detection and 2D keypoint localization in the wild is proposed.
The multi-person pose estimation system is a two step cascade, as illustrated in the Following figure.
In the first stage, a person detector is used to produce a bounding box around each person instance. In the second stage, a pose estimator is produced to the image crop extracted around each detected person instance in order to localize its keypoints.
A Faster-RCNN system based on ResNet-Inception architecture is used for person box detection. The detector is first trained on 80 categories in COCO dataset. Then the model is further finetuned on dataset only with bounding boxes of person.
A combined classification and regression approach is adoptted. Each spatial position is first classified whether it is in the vicinity of keypoints (K types) or not (which is a K-channel “heatmap”), then a 2-D local offset vector is predicted to get a more precise estimate of the corresponding keypoint location. The following figure illustrates the procedure.
The bounding box is first adjusted to a fixed aspect ratio (height/width = 1.37) and the patch is cropped from the image and resized to 353*257. A ResNet with 101 layers is used to produce heatmap and offsets. The following figure shows an input and ground-truth output of the network.
TITLE: YOLO9000: Better, Faster, Stronger
AUTHOR: Joseph Redmon, Ali Farhadi
ASSOCIATION: University of Washington, Allen Institute for AI
The authors summarize the work as a better, faster and Stronger version of YOLO.
Batch Normalization is used in this work. The authors claim that it helps YOLO get more than 2% improvement in mAP. Even though, I doubt BN would help or it might even worsen the performance in real world applications because of my own experience using BN.
High Resolution Classifier
Instead finetuned on 224224 images, the classification network is finetuned on 448448 images, which helps the network perform better on higher resolution. This high resolution classification network gives an increase of almost 4% mAP.
Convolutional With Anchor Boxes
In YOLOv2, anchor boxes and FCN manner are also adopted. This enbles the YOLO generate much more boxes, which improves recall from 81% (69.5 mAP) to 88% (69.2 mAP).
Prior works usally define the anchor boxes by hand, for example 1:1, 1:2(2:1) or 1:3(3:1) in SSD. In this work, the anchor boxes are defined by clustering. K-means clustering is used and the distance metric is defined based on IOU, which eliminates the effect caused by the actual size of boxes: larger boxes generate more error than smaller boxes using Euclidean distance.
Direct Location Prediction
Instead of predicting offsets to the center of the bounding box, YOLO9000 predicts location coordinates relative to the location of the grid cell, which bounds the ground truth to fall between 0 and 1. Then constrained location prediction is easier to learn.
In order to ultize finer grained features for localizing smaller objects, the authors add a passthrough layer that brings features from an earlier layer. This is similar what has been done in ResNet.
Data of different resolutions are used to train the network. This regime forces the network to learn to predict well across a variety of input dimensions. This means the same network can predict detections at different resolutions.
A hierarchical prediction is built. Several nodes are added to build a tree. At each node, a semantic category is defined at a level. Thus images of different objects may be combined as one label because they belong to one higher level semantic label.
Joint Classification and Detection
Two datasets are used to train the large scale detetor. One is a traditional classification dataset, which contains a large number of categories. The other one is a detection dataset. When a detection image is seen, backpropagate loss as normal. For classification loss, only backpropagate loss at or above the corresponding level of the label.