Classification, Segmentation, Detection
Detection
For main progress check Survey 2019
Mainstream progress
TIDE: A General Toolbox for Identifying Object Detection Errors,ECCV20,spotlight
End-to-End Object Detection with Transformers,Arxiv2005
EfficientDet: Scalable and Efficient Object Detection,CVPR20
Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training,Arxiv2004
Dynamic R-CNN to adjust the label assignment criteria (IoU threshold) and the shape of regression loss function (parameters of SmoothL1 Loss) automatically based on the statistics of proposals during training
Creating a CNN that operates in real-time on a conventional GPU, and for which training requires only one conventional GPU.
Need a careful check for practical tricks in det task.
- Mosaic
- Self-adversarial training
- Cross mini-batch normalization
- Pointwise SAM
Anchor-free
FCOS: A Simple and Strong Anchor-free Object Detector
no anchor any more; we only have one positive “anchor” per object, and hence do not need NonMaximum Suppression (NMS);a larger output resolution (output stride of 4) compared to traditional object detectors(output stride of 16). We use a single network to predict the keypoints , offset (recover the discretization error caused by the output stride,), and size (regress the width and width of bboxes). The network predicts a total of C + 4 outputs at each location. All outputs share a common fully-convolutional backbone network.
Compared with CornerNet,ExtremeNet, they require a combinatorial grouping stage after keypoint detection, which significantly slows down each algorithm.
A convolutional network outputs a heatmap for all top-left corners, a heatmap for all bottom-right corners, and an embedding vector for each detected corner. The network is trained to predict similar embeddings for corners that belong to the same object.
Stitcher: Feedback-driven Data Provider for Object Detection,Arxiv2004
similar to Mosaic tricks in YOLOv4
feedback-driven data provider is interesting
AP-Loss for Accurate One-Stage Object Detection,PAMI
One-stage object detectors are trained by optimizing classification-loss and localization-loss simultaneously, with the former suffering much from extreme foreground-background class imbalance issue due to the large number of anchors. This paper alleviates this issue by proposing a novel framework to replace the classification task in one-stage detectors with a ranking task, and adopting the Average-Precision loss (AP-loss) for the ranking problem. Due to its non-differentiability and non-convexity, the AP-loss cannot be optimized directly. For this purpose, we develop a novel optimization algorithm…..
Different det heads
check related work part,
- Faster-RCNN, \(1024\times 7\times 7\)
- Light-head RCNN,Arxiv: generate the feature maps with small channel number (thin feature maps) 490 (10 × 7 × 7), kernel size=15, Cmid=64, Cout=490(10x7x7),followed by conventional RoI warping; large kernel+seperable convolution.
- R-FCN: 3969 (81 × 7 × 7), \(k^{2}(C+1)\times W\times H\) after RoI pooling obtain \(k^{2}(C+1) \times 7 \times 7\), check Fig 2. Aside from the above \(k^{2}(C +1)\) convolutional layer for bbox classification, we append a sibling \(4k^{2}\) convolutional layer for bounding box regression. The position-sensitive RoI pooling is performed on this bank of \(4k^{2}\) maps, producing a \(4k^{2}\) vector for each RoI. Then it is aggregated into a 4-d vector by average voting. Noticeably, there is no learnable layer after the RoI layer, enabling nearly cost-free region-wise computation and speeding up both training and inference. Similar idea in segmentation is FCIS,instanceFCN.
- Double head,CVPR20: check Fig 1.
- Cascaded-RCNN,CVPR18: It consists of a sequence of detectors trained with increasing IoU thresholds, to be sequentially more selective against close false positives. see Fig 3.
- IoUNet,ECCV18. Fig 2 proves classfication score is not enough for det, and localization score really exists. IoU estimator can be used as an early-stop condition to implement iterative refinement with adaptive steps.
- Mask Scoring RCNN,CVPR19: similar intuition as IoUNet. in most instance segmentation pipelines, such as Mask R-CNN and MaskLab, the score of the instance mask is shared with box-level classification confidence, which is predicted by a classifier applied on the proposal feature. It is inappropriate to use classification confidence to measure the mask quality since it only serves for distinguishing the semantic categories of proposals, and is not aware of the actual quality and completeness of the instance mask. The paper focuses on designing an extra head to predict mask score.
RoI Pooling
- RoI Pooling
- RoI Align
- PrRoI Pooling
NMS
- IoU-NMS from IoUNet
- Soft-NMS
- learning to NMS
Classification
Spatially Attentive Output Layer for Image Classification,CVPR20
waiting for code.
NetVLAD: CNN architecture for weakly supervised place recognition,CVPR16
Adversarial Examples Improve Image Recognition,CVPR20
- propose to use two batch norm statistics, one for clean images and one auxiliary for adversarial examples. The two batchnorms properly disentangle the two distributions at normalization layers for accurate statistics estimation. We show this distribution disentangling is crucial, enabling us to successfully improve, rather than degrade, model performance with adversarial examples
- the first to show adversarial examples can improve model performance in the fully-supervised setting on the large-scale ImageNet dataset.
- a simple auxiliary BN design, check Fig 3.
Architecture
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,ICML19
check Fig 2.
The Res2Net strategy exposes a new dimension, namely scale (the number of feature groups in the Res2Net block), as an essential factor in addition to existing dimensions of depth, width, and cardinality.
Multigrid Neural Architectures,CVPR17
Visualization-related
CAM
Semantic Segmentation
Check the survey here.
CFNet:Co-occurrent Features in Semantic Segmentation,CVPR19
similar to non-local block
non-local on channel and spatial.
Context Prior,CVPR20 Learn a \(WH \times WH\) affinity matrix, k=11 in context aggregation is vital for the functionality of Context Prior, without k=11, CP cannot work.
Affinity matrix construction is similar to Adaptive Pyramid Context Network for Semantic Segmentation,CVPR19
Class-wise Dynamic Graph Convolution for Semantic Segmentation,ECCV20
Improving Semantic Segmentation via Decoupled Body and Edge Supervision,ECCV20
Instance Segmentation
Deep Snake for Real-Time Instance Segmentation,CVPR20,oral
Human-object interaction
Spatial Priming for Detecting Human-Object Interactions,Arxiv2004