I'am not going to keep this website and the blog has been transferred to https://joshua19881228.github.io. I'll keep writing there and welcome for your visiting
A QUICK LINK TO My Jumble of Computer Vision
Pub. Date:July 10, 2017, 4:18 p.m. Topic:Computer Vision Tag:Reading Note

TITLE: Optimizing Deep CNN-Based Queries over Video Streams at Scale

AUTHOR: Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, Matei Zaharia

ASSOCIATION: Stanford InfoLab

FROM: arXiv:1703.02529


  1. NOSCOPE, the first data management system that accelerates CNN-based classification queries over video streams at scale.
  2. CNN-specific techniques for difference detection across frames and model specialization for a given stream and query, as well as a cost-based optimizer that can automatically identify the best combination of these filters for a given accuracy target.
  3. An evaluation of NOSCOPE on fixed-angle binary classification showing up to 3,200x speedups on real-world data.


The work flow of NoScope can be viewed in the following figure. Brefiely, it can be explained that NoScope's optimizer selects a different configuration of difference detectors and specialized models for each video stream to perform binary classification as quickly as possible without calling the full target CNN, which will be called only when necessary.

Overall Framework of NoScope

There are mainly three compoments in this system, Difference Detectors, Specialized Models and Cost-based Optimizer.

  1. Difference Detectors consider attempts to detect differences between images. They are used to determine whether the considered frame is significantly different from another image with known labels. There are two forms of difference detectors supported: difference detection against a fixed reference image for the video stream that is known to contain no objects and difference detection against an earlier frame, some configured time into the past.
  2. Specialized Models are small CNNs specified for each video and query. They are designed using different combinations of numbers of channels and layers. This can be thought as expert classifiers or detectors for different videos. For static cameras, one specifialized model does not need to consider samples that would only appear in other camers.
  3. Cost-based Optimizer brings difference detectors and model specialization together that maximizes the throughput subject to a certain condition, e.g. FP and FN rate.


  1. This scheme is suitable for fixed views, but if the input changes frequently, this scheme may work less efficiently or effectively.

TITLE: Learning Spatial Regularization with Image-level Supervisions for Multi-label Image Classification

AUTHOR: Feng Zhu, Hongsheng Li, Wanli Ouyang, Nenghai Yu, Xiaogang Wang

ASSOCIATION: University of Science and Technology of China, University of Sydney, The Chinese University of Hong Kong

FROM: arXiv:1702.05891


  1. An end-to-end deep neural network for multi-label image classification is proposed, which exploits both semantic and spatial relations of labels by training learnable convolutions on the attention maps of labels. Such relations are learned with only image-level supervisions. Investigation and visualization of learned models demonstrate that our model can effectively capture semantic and spatial relations of labels.
  2. The proposed algorithm has great generalization capability and works well on data with different types of labels.


The proposed Spatial Regularization Net (SRN) takes visual features from the main net as inputs and learns to regularize spatial relations between labels. Such relations are exploited based on the learned attention maps for the multiple labels. Label confidences from both main net and SRN are aggregated to generate final classification confidences. The whole network is a unified framework and is trained in an end-to-end manner.

The scheme of SRN is illustrated in the following figure.

Overall Framework of SRN

To train the network,

  1. Finetune only the main net on the target dataset. Both $ f_{cnn} $ and $ f_{cls} $ are learned with cross-entropy loss for classification.
  2. Fix $ f_{cnn} $ and $ f_{cls} $. Train $ f_{att} $ and $ conv1 $ with cross-entropy loss for classification.
  3. Train $ f_{sr} $ with cross-entropy loss for classification by fixing all other sub-networks.
  4. The whole network is jointly finetuned with joint loss.

The main network follows the structure of ResNet-101. And it is finetuned on the target dataset. The output of Attention Map and Confidence Map has $ C $ channels which is same with the number of categories. Their outputs are merged by element-wise multiplication and average-pooled to a feature vector in step 2. In step 3, instead of an average-pooling, $ f_{sr} $ follows. $ f_{sr} $ is implemented as three convolution layers with ReLU nonlinearity followed by one fully-connected layer as shown in the following figure.

Structure of fsr

$ conv4 $ is composed of single-channel filters. In Caffe, it can be implemnted using "group". Such design is because one label may only semantically relate to a small number of other labels, and measuring spatial relations with those unrelated attention maps is unnecessary.

Pub. Date:July 5, 2017, 10:08 p.m. Topic:Life Discovery Tag:Little Things

The last flower in this month.


Pub. Date:June 29, 2017, 8:33 p.m. Topic:Life Discovery Tag:Little Things

It's been less input to me recently so that less output from me. This month is really busy. It's time to keep up!

I think I am a single-thread processor. It's really hard for me to handle multiple tasks simultaneously. I'd become worried about another one if I'm working on one task, which means that I might mess up with the current one. I don't know whether anyone can do it better. Besides the single-thread thing, sometimes I become too anxious to having any mood to do anything, like keeping a diary, reading paper or picking up a habit. Maybe I'm too narrow-minded??

Pub. Date:June 26, 2017, 11:01 p.m. Topic:Life Discovery Tag:Little Things

Another week.