Attention mechanism
https://zhuanlan.zhihu.com/p/33345791
https://zhuanlan.zhihu.com/p/106662375
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
- Why rescale with \(\frac{1}{\sqrt{512}}\)
- Why Layer Norm
Relation between FC layer and non-local block
Relation between self-attention and non-local block.
Consider 1D-nonlocal, FC layer can be seen as a matrix multiplication
Relation between gram matrix and non-local block.
check https://arxiv.org/pdf/1701.01036.pdf
Tensor Low-Rank Reconstruction for Semantic Segmentation,ECCV20
Feature Pyramid Transformer,ECCV20
Object-Centric Learning with Slot Attention,Arxiv2006
dataset!
Exploring Self-attention for Image Recognition,CVPR20
- A replacement of Convolution block.
- These architectures – SAN10, SAN15, and SAN19 – are in rough correspondence with ResNet26, ResNet38, and ResNet50.
- Position encoding is important, see Table 8.
- Robustness of zero-shot generalization to rotated images, adverserial attack is vital to get in.
- A story-telling paper, without comparison with resnext, Sknet. without comparison with combination of conv-block(low-level) and self-attention block(high-level) like non-local paper.
Dynamic Graph Message Passing Networks,CVPR20oral
Average sampling with different rates(scales), then do random walking to propogate the information.
Most of the heads can be removed by the stochastic gates.
Gumbel sofmax
DCANet: Learning Connected Attentions for Convolutional Neural Networks,ECCV20reject
Normalized Attention Without Probability Cage,Arxiv2005
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks,ICML19
SYNTHESIZER:Rethinking Self-Attention in Transformer Models,Arxiv2005
Replace \(Q(x)K(x)^{T}\) as direction function F(x) mapping from d to l.
Cross Attention Network for Few-shot Classification,NeurIPS19
For the fusion layer, the input is WH x H x W. It seems that they try to attend second attention, between WH x WH via 2D convolution. And this module can also be injected in Non-local Block. Overall, you can see the shadow of non-local block in this paper. Some differences I feel are:
- multi-pairs images
- fusion layer to further reason the difference between WHxWH, except the torch.matmul(WHxC,CxWH) in Nonlocal. After all operations, a softmax is appended to finalize the whole attention mechanism.
- You can even see a residual connection after the fusion layer, which is also observsed in non-local block!
- For non-local block, it’s not bidirectional, while the author’s network is bidirectional. They can influence each other because the correlation layer generates two mutual-transposed matrix via torch.matmul(WHxC,CxWH).
Effective Approaches to Attention-based Neural Machine Translation,EMNLP15
Global vs. Local Attention
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention,ICML15
introduce the hard and soft attention.
Variants
Multi Head Attention,NeurIPS17
Concatenated the result of \(softmax(\frac{QK^{T}}{\sqrt{n}} V)\) and send them into a linear layer to remap back to original shape.s
A2-nets-double-attention-networks,NIPS18
For SENet, global average pooling is used in the gathering process, while the resulted single global feature is distributed to all locations, ignoring different needs across locations. Seeing these shortcomings, we introduce this genetic formulation and propose the Double Attention block.
check demo code: https://github.com/gjylt/DoubleAttentionNet
check figure tensor size illustration: drive google.
Graph-Based Global Reasoning Networks,CVPR19
check tensor size fig
Different from the recently proposed Non-local Neural Networks (NL-Nets) and Double Attention Networks which only focus on delivering information and rely on convolution layers for reasoning, our proposed model is able to directly reason on relations over regions. Similarly, Squeeze-and-Extension Networks (SE-Nets) only focus on incorporating image-level features via global average pooling, leading to an interaction graph containing only one node. It is not designed for regional reasoning as our proposed method. Extensive experiments show that inserting our GloRe can consistently boost performance of state-of-the-art CNN architectures on diverse tasks including image classification, semantic segmentation and video action recognition.
use max pooling and avgpooling together.
An Empirical Study of Spatial Attention Mechanisms in Deep Networks,ICCV19
CCNet: Criss-Cross Attention for Semantic Segmentation,ICCV19
- Two crisscross attention modules before and after share the same parameters to avoid adding too many extra parameters.
- attention matrix is \((H+W−1)\times WH\), softmax is applied along \(W+H-1\) axis.
Local Relation Networks for Image Recognition,ICCV19
Check Fig 2.
use non local networks (i.e self attention) to help compute local relations?
Dynamic Graph Message Passing Networks
quite similart o deformable convolution. A fundamental difference to deformable convolution is that it only learns the offset dependent on the input feature while the filter weights are fixed for all inputs. In contrast, our model learns the random walk, weight and affinity as all being dependent on the input. This property makes our weights and affinities position-specific whereas deformable convolution shares the same weight across all convolution positions in the feature map. learns to sample a set of K nodes (where K 9) for message passing globally from the whole feature map. This allows our model to capture a larger receptive field than deformable convolution.
Adaptive Pyramid Context Network for Semantic Segmentation,CVPR19
unlike non-local with three inputs, the moudule here has only two input. Check Fig 2. Reshaping from \(H\times W \times S^{2}\) to \(HW \times S^{2}\) for forming affinity matrix is a little counter-intuitive.
divide the feature map X of image I into s×s subregions, running adaptive pooling to generate 1x1,2x2,3x3,4x4… region pooled features.
No softmax is applied around affinity matrix.
Long-Term Feature Banks for Detailed Video Understanding,CVPR19
Asymmetric Non-local Neural Networks for Semantic Segmentation,ICCV19
check Fig 1.
CARAFE: Content-Aware ReAssembly of FEatures,ICCV19
CARAFE can be seamlessly integrated into existing frameworks where upsampling operators are needed
CARAFE works as a reassembly operator with contentaware kernels. It consists of two steps. The first step is to predict a reassembly kernel for each target location according to its content, and the second step is to reassemble the features with predicted kernels.
check code
Spatial Pyramid Based Graph Reasoning for Semantic Segmentation,CVPR20
select kernel between \(3 \times 3\) and \(5 \times 5\).
ResNeSt: Split-Attention Networks,Arxiv2002
As in ResNeXt blocks, the input feature-map can be divided into several groups along the channel dimension, and the number of feature-map groups is given by a cardinality hyperparameter K. We refer to the resulting feature-map groups as cardinal groups. We introduce a new radix hyperparameter R that dictates the number of splits within a cardinal group. Then the block input X are split into G = KR groups along the channel dimension.
Can be seen as a combination of ResNext and SKNet.
Improving Convolutional Networks with Self-Calibrated Convolutions,CVPR20
An improvement based on SENet with less parameter and better performance. Check Fig 2. do 1D convolution with kernel size k along feature dimension\(1 \times 1 \times C\) is a little counter-intuitive for me, there should be no concept of ‘neibourhood’ in feature dimension.
DANet:Dual Attention Network for Scene Segmentation(CVPR2019)
Bilinear Attention Networks,NIPS18
Compact Generalized Non-local Network,NIPS
Stand-Alone Self-Attention in Vision Models,NIPS19
GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond,ICCV19
Exploring Self-attention for Image Recognition,arxiv
BAM: Bottleneck Attention Module,BMVC18
Applications
Few-shot detection: Context-Transformer: Tackling Object Confusion for Few-Shot Detection,AAAI20
Self-Attention Generative Adversarial Networks
copy non-local block into GAN.
Actor-Transformers for Group Activity Recognition,CVPR20
Group Activity Recognition: use non-local block to fuse optical flow, pose, RGB. check Fig 2.
Gate Mechanism
RNN
https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks
LSTM
https://en.wikipedia.org/wiki/Long_short-term_memory
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. The sigmoid layer outputs numbers between zero and one, describing how much of each component should be let through. A value of zero means “let nothing through,” while a value of one means “let everything through!” An LSTM has three of these gates, to protect and control the cell state.
intuition: tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid( to push the values to be between 0 and 1), so that we only output the parts we decided to.s