Classification, Segmentation, Detection

Detection

For main progress check Survey 2019

Mainstream progress

TIDE: A General Toolbox for Identifying Object Detection Errors,ECCV20,spotlight

End-to-End Object Detection with Transformers,Arxiv2005

EfficientDet: Scalable and Efficient Object Detection,CVPR20

Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training,Arxiv2004

Dynamic R-CNN to adjust the label assignment criteria (IoU threshold) and the shape of regression loss function (parameters of SmoothL1 Loss) automatically based on the statistics of proposals during training

YOLOv4

Creating a CNN that operates in real-time on a conventional GPU, and for which training requires only one conventional GPU.

Need a careful check for practical tricks in det task.

zhihu

Mosaic
Self-adversarial training
Cross mini-batch normalization
Pointwise SAM

Anchor-free

FCOS: A Simple and Strong Anchor-free Object Detector

CenterNet

no anchor any more; we only have one positive “anchor” per object, and hence do not need NonMaximum Suppression (NMS);a larger output resolution (output stride of 4) compared to traditional object detectors(output stride of 16). We use a single network to predict the keypoints , offset (recover the discretization error caused by the output stride,), and size (regress the width and width of bboxes). The network predicts a total of C + 4 outputs at each location. All outputs share a common fully-convolutional backbone network.

Compared with CornerNet,ExtremeNet, they require a combinatorial grouping stage after keypoint detection, which significantly slows down each algorithm.

CornerNet,ECCV18

A convolutional network outputs a heatmap for all top-left corners, a heatmap for all bottom-right corners, and an embedding vector for each detected corner. The network is trained to predict similar embeddings for corners that belong to the same object.

ExtremeNet,CVPR19

Stitcher: Feedback-driven Data Provider for Object Detection,Arxiv2004

zhihu

similar to Mosaic tricks in YOLOv4

feedback-driven data provider is interesting

ECCV20,oral

AP-Loss for Accurate One-Stage Object Detection,PAMI

One-stage object detectors are trained by optimizing classification-loss and localization-loss simultaneously, with the former suffering much from extreme foreground-background class imbalance issue due to the large number of anchors. This paper alleviates this issue by proposing a novel framework to replace the classification task in one-stage detectors with a ranking task, and adopting the Average-Precision loss (AP-loss) for the ranking problem. Due to its non-differentiability and non-convexity, the AP-loss cannot be optimized directly. For this purpose, we develop a novel optimization algorithm…..

Different det heads

check related work part,

Faster-RCNN, \(1024\times 7\times 7\)
Light-head RCNN,Arxiv: generate the feature maps with small channel number (thin feature maps) 490 (10 × 7 × 7), kernel size=15, Cmid=64, Cout=490(10x7x7),followed by conventional RoI warping; large kernel+seperable convolution.
R-FCN: 3969 (81 × 7 × 7), \(k^{2}(C+1)\times W\times H\) after RoI pooling obtain \(k^{2}(C+1) \times 7 \times 7\), check Fig 2. Aside from the above \(k^{2}(C +1)\) convolutional layer for bbox classification, we append a sibling \(4k^{2}\) convolutional layer for bounding box regression. The position-sensitive RoI pooling is performed on this bank of \(4k^{2}\) maps, producing a \(4k^{2}\) vector for each RoI. Then it is aggregated into a 4-d vector by average voting. Noticeably, there is no learnable layer after the RoI layer, enabling nearly cost-free region-wise computation and speeding up both training and inference. Similar idea in segmentation is FCIS,instanceFCN.
Double head,CVPR20: check Fig 1.
Cascaded-RCNN,CVPR18: It consists of a sequence of detectors trained with increasing IoU thresholds, to be sequentially more selective against close false positives. see Fig 3.
IoUNet,ECCV18. Fig 2 proves classfication score is not enough for det, and localization score really exists. IoU estimator can be used as an early-stop condition to implement iterative refinement with adaptive steps.
Mask Scoring RCNN,CVPR19: similar intuition as IoUNet. in most instance segmentation pipelines, such as Mask R-CNN and MaskLab, the score of the instance mask is shared with box-level classification confidence, which is predicted by a classifier applied on the proposal feature. It is inappropriate to use classification confidence to measure the mask quality since it only serves for distinguishing the semantic categories of proposals, and is not aware of the actual quality and completeness of the instance mask. The paper focuses on designing an extra head to predict mask score.

RoI Pooling

RoI Pooling
RoI Align
PrRoI Pooling

NMS

IoU-NMS from IoUNet
Soft-NMS
learning to NMS

Classification

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness,ICLR19,oral

Spatially Attentive Output Layer for Image Classification,CVPR20

waiting for code.

NetVLAD: CNN architecture for weakly supervised place recognition,CVPR16

Adversarial Examples Improve Image Recognition,CVPR20

propose to use two batch norm statistics, one for clean images and one auxiliary for adversarial examples. The two batchnorms properly disentangle the two distributions at normalization layers for accurate statistics estimation. We show this distribution disentangling is crucial, enabling us to successfully improve, rather than degrade, model performance with adversarial examples
the first to show adversarial examples can improve model performance in the fully-supervised setting on the large-scale ImageNet dataset.
a simple auxiliary BN design, check Fig 3.

Architecture

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,ICML19

HRNet,PAMI20

check Fig 2.

Res2Net,PAMI20

The Res2Net strategy exposes a new dimension, namely scale (the number of feature groups in the Res2Net block), as an essential factor in addition to existing dimensions of depth, width, and cardinality.

DHM,CVPR20

Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution,ICCV19

reddit discussion

Multigrid Neural Architectures,CVPR17

CAM

grad-CAM

Semantic Segmentation

Check the survey here.

CascadePSP: Toward Class-Agnostic and Very High-Resolution Segmentation via Global and Local Refinement,CVPR20

CFNet:Co-occurrent Features in Semantic Segmentation,CVPR19

similar to non-local block

PSANet,ECCV18

DANet

non-local on channel and spatial.

Context Prior,CVPR20 Learn a \(WH \times WH\) affinity matrix, k=11 in context aggregation is vital for the functionality of Context Prior, without k=11, CP cannot work.

Affinity matrix construction is similar to Adaptive Pyramid Context Network for Semantic Segmentation,CVPR19

Class-wise Dynamic Graph Convolution for Semantic Segmentation,ECCV20

Improving Semantic Segmentation via Decoupled Body and Edge Supervision,ECCV20

code