PaperNotes: Saliency-based Sequential Image Attention with Multiset Prediction

Posted by Simon Duan on July 13, 2018

UN Finished



A hierarchical visual architecture that operates on a saliency map and uses a novel attention mechanism to sequentially focus on salient regions and take additional glimpses within those regions.

The architecture is motivated by human visual attention, and is used for multi-label image classification on a novel multiset task, demonstrating that it achieves high precision and recall while localizing objects with its attention.

Research Objective

Multiset Classification:

Multi-label classification tasks can be categorized based on whether the labels are lists, sets, or multisets. By “multilabel”, we usually refer to the first case(lists as labels).

A multiset prediction problem is a generalization of classification, where a target is not a single class but a multiset of classes. The goal is to find a mapping from an input $x$ to a multiset $ Y = { y_{1}, ……,y_{|Y|} }$ , where $ y_{k} \in C $. Some of the core properties of multiset prediction are:

  1. the input $x$ is an arbitrary vector.
  2. there is no predefined order among the items $y_{i}$ in the target multiset $Y$.
  3. the size of $Y$ may vary depending on the input $x$.
  4. each item in the class set $C$ may appear more than once in $Y$.

FROM Paper : Loss Function For Multiset Prediction


  • Meta-Controller: 图像经由SaliencyModel生成SaliencyMap,再经由基于Gaussian Attention Mechanism的RNN生成Attention Mask;
  • Interface: 与Activation Model结合将将Attention Mask 转换成ProorityMap和Glimpse Vector
  • Controller: RNN,输入Glimpse将Glimpse Vector进行分类,输出结果

Details of NetWOrk

The full architecture, expanded for one meta-controller time-step. As input to the system, an initial saliency map and activation volume are generated by the saliency model and activation models, respectively. At each meta-controller step, an updated saliency map is mapped to a covert attention mask. The interface forms an initial glimpse vector and priority map using the attention mask. Based on the priority map and initial glimpse vector, the controller chooses k glimpse locations, then classifies. Note that k + 1 controller steps occur for every meta-controller step.