Attention mechanism

Why rescale with \(\frac{1}{\sqrt{512}}\)
Why Layer Norm

Relation between FC layer and non-local block

Relation between self-attention and non-local block.

Consider 1D-nonlocal, FC layer can be seen as a matrix multiplication

Relation between gram matrix and non-local block.

check https://arxiv.org/pdf/1701.01036.pdf

Tensor Low-Rank Reconstruction for Semantic Segmentation,ECCV20

Feature Pyramid Transformer,ECCV20

Object-Centric Learning with Slot Attention,Arxiv2006

dataset!

Exploring Self-attention for Image Recognition,CVPR20

A replacement of Convolution block.
These architectures – SAN10, SAN15, and SAN19 – are in rough correspondence with ResNet26, ResNet38, and ResNet50.
Position encoding is important, see Table 8.
Robustness of zero-shot generalization to rotated images, adverserial attack is vital to get in.
A story-telling paper, without comparison with resnext, Sknet. without comparison with combination of conv-block(low-level) and self-attention block(high-level) like non-local paper.

Dynamic Graph Message Passing Networks,CVPR20oral

Average sampling with different rates(scales), then do random walking to propogate the information.

ICLR version, rejected

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned,ACL19

Most of the heads can be removed by the stochastic gates.

Gumbel sofmax

DCANet: Learning Connected Attentions for Convolutional Neural Networks,ECCV20reject

Normalized Attention Without Probability Cage,Arxiv2005

code

Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks,ICML19

Area Attention

SYNTHESIZER:Rethinking Self-Attention in Transformer Models,Arxiv2005

Replace \(Q(x)K(x)^{T}\) as direction function F(x) mapping from d to l.

See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks,CVPR19

code

Cross Attention Network for Few-shot Classification,NeurIPS19

Review

code

For the fusion layer, the input is WH x H x W. It seems that they try to attend second attention, between WH x WH via 2D convolution. And this module can also be injected in Non-local Block. Overall, you can see the shadow of non-local block in this paper. Some differences I feel are:

multi-pairs images
fusion layer to further reason the difference between WHxWH, except the torch.matmul(WHxC,CxWH) in Nonlocal. After all operations, a softmax is appended to finalize the whole attention mechanism.
You can even see a residual connection after the fusion layer, which is also observsed in non-local block!
For non-local block, it’s not bidirectional, while the author’s network is bidirectional. They can influence each other because the correlation layer generates two mutual-transposed matrix via torch.matmul(WHxC,CxWH).

Effective Approaches to Attention-based Neural Machine Translation,EMNLP15

Global vs. Local Attention

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,ICML15

introduce the hard and soft attention.

SENet,CVPR18

Variants

Multi Head Attention,NeurIPS17

Concatenated the result of \(softmax(\frac{QK^{T}}{\sqrt{n}} V)\) and send them into a linear layer to remap back to original shape.s

A2-nets-double-attention-networks,NIPS18

For SENet, global average pooling is used in the gathering process, while the resulted single global feature is distributed to all locations, ignoring different needs across locations. Seeing these shortcomings, we introduce this genetic formulation and propose the Double Attention block.

check demo code: https://github.com/gjylt/DoubleAttentionNet

colab demo

NeurIPS review

check figure tensor size illustration: drive google.

Graph-Based Global Reasoning Networks,CVPR19

check tensor size fig

Different from the recently proposed Non-local Neural Networks (NL-Nets) and Double Attention Networks which only focus on delivering information and rely on convolution layers for reasoning, our proposed model is able to directly reason on relations over regions. Similarly, Squeeze-and-Extension Networks (SE-Nets) only focus on incorporating image-level features via global average pooling, leading to an interaction graph containing only one node. It is not designed for regional reasoning as our proposed method. Extensive experiments show that inserting our GloRe can consistently boost performance of state-of-the-art CNN architectures on diverse tasks including image classification, semantic segmentation and video action recognition.

CBAM，ECCV18

use max pooling and avgpooling together.

An Empirical Study of Spatial Attention Mechanisms in Deep Networks,ICCV19

CCNet: Criss-Cross Attention for Semantic Segmentation,ICCV19

Two crisscross attention modules before and after share the same parameters to avoid adding too many extra parameters.
attention matrix is \((H+W−1)\times WH\), softmax is applied along \(W+H-1\) axis.

Local Relation Networks for Image Recognition,ICCV19

Check Fig 2.

use non local networks (i.e self attention) to help compute local relations?

Dynamic Graph Message Passing Networks

quite similart o deformable convolution. A fundamental difference to deformable convolution is that it only learns the offset dependent on the input feature while the filter weights are fixed for all inputs. In contrast, our model learns the random walk, weight and affinity as all being dependent on the input. This property makes our weights and affinities position-specific whereas deformable convolution shares the same weight across all convolution positions in the feature map. learns to sample a set of K nodes (where K 9) for message passing globally from the whole feature map. This allows our model to capture a larger receptive field than deformable convolution.

Adaptive Pyramid Context Network for Semantic Segmentation,CVPR19

unlike non-local with three inputs, the moudule here has only two input. Check Fig 2. Reshaping from \(H\times W \times S^{2}\) to \(HW \times S^{2}\) for forming affinity matrix is a little counter-intuitive.

divide the feature map X of image I into s×s subregions, running adaptive pooling to generate 1x1,2x2,3x3,4x4… region pooled features.

No softmax is applied around affinity matrix.

Long-Term Feature Banks for Detailed Video Understanding,CVPR19

Asymmetric Non-local Neural Networks for Semantic Segmentation,ICCV19

check Fig 1.

CARAFE: Content-Aware ReAssembly of FEatures,ICCV19

CARAFE can be seamlessly integrated into existing frameworks where upsampling operators are needed

CARAFE works as a reassembly operator with contentaware kernels. It consists of two steps. The first step is to predict a reassembly kernel for each target location according to its content, and the second step is to reassemble the features with predicted kernels.

check code

Spatial Pyramid Based Graph Reasoning for Semantic Segmentation,CVPR20

SKNet,CVPR19

select kernel between \(3 \times 3\) and \(5 \times 5\).

code repo

source file

ResNeSt: Split-Attention Networks,Arxiv2002

As in ResNeXt blocks, the input feature-map can be divided into several groups along the channel dimension, and the number of feature-map groups is given by a cardinality hyperparameter K. We refer to the resulting feature-map groups as cardinal groups. We introduce a new radix hyperparameter R that dictates the number of splits within a cardinal group. Then the block input X are split into G = KR groups along the channel dimension.

Can be seen as a combination of ResNext and SKNet.

Improving Convolutional Networks with Self-Calibrated Convolutions,CVPR20

ECA-Net,CVPR20

An improvement based on SENet with less parameter and better performance. Check Fig 2. do 1D convolution with kernel size k along feature dimension\(1 \times 1 \times C\) is a little counter-intuitive for me, there should be no concept of ‘neibourhood’ in feature dimension.

DANet:Dual Attention Network for Scene Segmentation(CVPR2019)

Bilinear Attention Networks,NIPS18

Compact Generalized Non-local Network,NIPS

Stand-Alone Self-Attention in Vision Models,NIPS19

GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond,ICCV19

Exploring Self-attention for Image Recognition,arxiv

BAM: Bottleneck Attention Module,BMVC18

Applications

Few-shot detection: Context-Transformer: Tackling Object Confusion for Few-Shot Detection,AAAI20

Self-Attention Generative Adversarial Networks

copy non-local block into GAN.

Actor-Transformers for Group Activity Recognition,CVPR20

Group Activity Recognition: use non-local block to fuse optical flow, pose, RGB. check Fig 2.

Gate Mechanism

RNN

https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks

LSTM

https://en.wikipedia.org/wiki/Long_short-term_memory

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!” An LSTM has three of these gates, to protect and control the cell state.

intuition: tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid( to push the values to be between 0 and 1), so that we only output the parts we decided to.s