标签归档:Object Detection

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Although using convolutional neural networks (CNNs) as backbones achieves great successes in computer vision, this work investigates a simple backbone network useful for many dense prediction tasks without convolutions. Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to prior arts. (1) Different from ViT that typically has low-resolution outputs and high computational and memory cost, PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions but also using a progressive shrinking pyramid to reduce computations of large feature maps. (2) PVT inherits the advantages from both CNN and Transformer, making it a unified backbone in various vision tasks without convolutions by simply replacing CNN backbones. (3) We validate PVT by conducting extensive experiments, showing that it boosts the performance of many downstream tasks, e.g., object detection, semantic, and instance segmentation. For example, with a comparable number of parameters, RetinaNet+PVT achieves 40.4 AP on the COCO dataset, surpassing RetinNet+ResNet50 (36.3 AP) by 4.1 absolute AP. We hope PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future researches.


虽然以卷积神经网络(CNNs)作为主干的模型在计算机视觉领域获得了巨大的成功,我们这篇文章会提出一个非卷积的简单网络,它将能够运用在许多预测任务上。不像最近提出的Transformer模型(例如ViT)是为了分类任务设计的,我们提出金字塔视觉Transformer (PVT). 我们的模型能够解决Transformer应用在密集预测任务时的种种困难。相比现有模型,PVT拥有以下优点:(1)不像现有ViT模型使用低分辨率输入且要求较大的计算量,PVT不仅仅能够在密集的图像区块上达到高分辨率输出,而且还运用渐进收缩金字塔去降低对于大尺寸特征图的计算量;(2)PVT从CNNs和Transformer那里继承了优点,这使得在许多视觉任务上统一简单将CNN主干进行替换无卷积的主干架构成为可能。(3)我们在例如目标检测、语义和实例分割任务等下游任务上对PVT模型进行了验证,实验结果说明我们的模型是SOTA的。

Improving Object Detection in Art Images Using Only Style Transfer

Despite recent advances in object detection using deep learning neural networks, these neural networks still struggle to identify objects in art images such as paintings and drawings. This challenge is known as the cross depiction problem and it stems in part from the tendency of neural networks to prioritize identification of an object’s texture over its shape. In this paper we propose and evaluate a process for training neural networks to localize objects – specifically people – in art images. We generate a large dataset for training and validation by modifying the images in the COCO dataset using AdaIn style transfer. This dataset is used to fine-tune a Faster R-CNN object detection network, which is then tested on the existing People-Art testing dataset. The result is a significant improvement on the state of the art and a new way forward for creating datasets to train neural networks to process art images.


虽然最近深度学习在目标检测领域有了长足发展,但是这些网络在艺术作品如画作等数据上的表现不佳。这个问题主要是因为神经网络倾向于通过目标的纹理而非形状进行推断。在本文中我们提出并且验证一种训练检测器的流程,这个流程训练的是对于艺术作品中的人物。我们使用AdaIn风格迁移将COCO数据集构建成一个庞大的数据集,然后在People-Art testing数据集上进行测试。结果显示我们的方法有效地提高了现有检测器在艺术作品上的检测表现。

Self-Attention Based Context-Aware 3D Object Detection

Most existing point-cloud based 3D object detectors use convolution-like operators to process information in a local neighbourhood with fixed-weight kernels and aggregate global context hierarchically. However, recent work on non-local neural networks and self-attention for 2D vision has shown that explicitly modeling global context and long-range interactions between positions can lead to more robust and competitive models. In this paper, we explore two variants of self-attention for contextual modeling in 3D object detection by augmenting convolutional features with self-attention features. We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors and show consistent improvement over strong baseline models while simultaneously significantly reducing their parameter footprint and computational cost. We also propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations. This not only allows us to scale explicit global contextual modeling to larger point-clouds, but also leads to more discriminative and informative feature descriptors. Our method can be flexibly applied to most state-of-the-art detectors with increased accuracy and parameter and compute efficiency. We achieve new state-of-the-art detection performance on KITTI and nuScenes datasets.



ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks

Recently, channel attention mechanism has demonstrated to offer great potential in improving the performance of deep convolutional neural networks (CNNs). However, most existing methods dedicate to developing more sophisticated attention modules for achieving better performance, which inevitably increase model complexity. To overcome the paradox of performance and complexity trade-off, this paper proposes an Efficient Channel Attention (ECA) module, which only involves a handful of parameters while bringing clear performance gain. By dissecting the channel attention module in SENet, we empirically show avoiding dimensionality reduction is important for learning channel attention, and appropriate cross-channel interaction can preserve performance while significantly decreasing model complexity. Therefore, we propose a local cross-channel interaction strategy without dimensionality reduction, which can be efficiently implemented via 1D convolution. Furthermore, we develop a method to adaptively select kernel size of 1D convolution, determining coverage of local cross-channel interaction. The proposed ECA module is efficient yet effective, e.g., the parameters and computations of our modules against backbone of ResNet50 are 80 vs. 24.37M and 4.7e-4 GFLOPs vs. 3.86 GFLOPs, respectively, and the performance boost is more than 2% in terms of Top-1 accuracy. We extensively evaluate our ECA module on image classification, object detection and instance segmentation with backbones of ResNets and MobileNetV2. The experimental results show our module is more efficient while performing favorably against its counterparts.


近来,通道注意力机制已经被证明可以极大的提升深度卷积神经网络的性能。但是现有的方法为了获得更好的性能往往采用更为复杂的注意力模型因此反而增加了模型的复杂度。为了解决性能和计算复杂度的平衡,本文提出了高效通道注意力模块 (ECA) 使得使用较少的参数而可以获得明显的性能提升。通过研究SENet的通道注意力模型,我们从经验上推论减少维度对于学习通道注意力至关重要,适当的通道间交互可以在保持性能的同时极大地提高模型的复杂度。因此我们提出了一种通过一维卷积实现的本地跨通道交互策略而不需要减少维度。另外,我们还提出一种自适应选择一维卷积核尺寸的方法以衡量通道之间的交互的程度。ECA模块相较于基本的RseNet50有较大提升并可以用于多种如图像识别,目标检测以及实例分割等以ResNet和MobileNetV2为框架的任务中。

VinVL: Making Visual Representations Matter in Vision-Language Models

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model [2], the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model, and utilize an improved approach to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public.



Toward Transformer-Based Object Detection

Transformers have become the dominant model in natural language processing, owing to their ability to pretrain on massive amounts of data, then transfer to smaller, more specific tasks via fine-tuning. The Vision Transformer was the first major attempt to apply a pure transformer model directly to images as input, demonstrating that as compared to convolutional networks, transformer-based architectures can achieve competitive results on benchmark classification tasks. However, the computational complexity of the attention operator means that we are limited to low-resolution inputs. For more complex tasks such as detection or segmentation, maintaining a high input resolution is crucial to ensure that models can properly identify and reflect fine details in their output. This naturally raises the question of whether or not transformer-based architectures such as the Vision Transformer are capable of performing tasks other than classification. In this paper, we determine that Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results. The model that we propose, ViT-FRCNN, demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance. We also investigate improvements over a standard detection backbone, including superior performance on out-of-domain images, better performance on large objects, and a lessened reliance on non-maximum suppression. We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.


Transformers类的模型已经成为主导NLP领域的模型,借助在大量的数据上进行预训练,它可以通过fine-tuning迁移到规模更小更加细分的领域。ViT是第一个将图像直接输入Transformer中,结果显示它可以达到与CNN比肩的性能。但是,计算复杂度限制了输入图像的分辨率,在例如目标检测或者分割领域有不可忽视的缺点。在本文中,我们提出了一种以ViT为主干的通用目标检测模型ViT-FRCNN,它可在COCO数据集以达到有竞争力的性能。它还保留了传统Transformer的优点:大规模预训练潜力,快速fine-Tuning性能。我们还研究了相较于标准的检测模型架构,基于ViT的模型的提升之处,包括:在域外图像上更好的性能,在大尺寸目标上更好的性能以及对于非最大抑制更少的以来。我们认为ViT-FRCNN是 将Transformer应用到一般机器视觉任务的重要一步。

RandAugment: Practical automated data augmentation with a reduced search space

Recent work has shown that data augmentation has the potential to significantly improve the generalization of deep learning models. Recently, automated augmentation strategies have led to state-of-the-art results in image classification and object detection. While these strategies were optimized for improving validation accuracy, they also led to state-of-the-art results in semi-supervised learning and improved robustness to common corruptions of images. An obstacle to a large-scale adoption of these methods is a separate search phase which increases the training complexity and may substantially increase the computational cost. Additionally, due to the separate search phase, these approaches are unable to adjust the regularization strength based on model or dataset size. Automated augmentation policies are often found by training small models on small datasets and subsequently applied to train larger models. In this work, we remove both of these obstacles. RandAugment has a significantly reduced search space which allows it to be trained on the target task with no need for a separate proxy task. Furthermore, due to the parameterization, the regularization strength may be tailored to different model and dataset sizes. RandAugment can be used uniformly across different tasks and datasets and works out of the box, matching or surpassing all previous automated augmentation approaches on CIFAR-10/100, SVHN, and ImageNet. On the ImageNet dataset we achieve 85.0% accuracy, a 0.6% increase over the previous state-of-the-art and 1.0% increase over baseline augmentation. On object detection, RandAugment leads to 1.0-1.3% improvement over baseline augmentation, and is within 0.3% mAP of AutoAugment on COCO. Finally, due to its interpretable hyperparameter, RandAugment may be used to investigate the role of data augmentation with varying model and dataset size. Code is available online.



Adversarial Patch Camouflage against Aerial Detection

Media Tweets by Adam Harvey (@adamhrv) | Twitter

Detection of military assets on the ground can be performed by applying deep learning-based object detectors on drone surveillance footage. The traditional way of hiding military assets from sight is camouflage, for example by using camouflage nets. However, large assets like planes or vessels are difficult to conceal by means of traditional camouflage nets. An alternative type of camouflage is the direct misleading of automatic object detectors. Recently, it has been observed that small adversarial changes applied to images of the object can produce erroneous output by deep learning-based detectors. In particular, adversarial attacks have been successfully demonstrated to prohibit person detections in images, requiring a patch with a specific pattern held up in front of the person, thereby essentially camouflaging the person for the detector. Research into this type of patch attacks is still limited and several questions related to the optimal patch configuration remain open. This work makes two contributions. First, we apply patch-based adversarial attacks for the use case of unmanned aerial surveillance, where the patch is laid on top of large military assets, camouflaging them from automatic detectors running over the imagery. The patch can prevent automatic detection of the whole object while only covering a small part of it. Second, we perform several experiments with different patch configurations, varying their size, position, number and saliency. Our results show that adversarial patch attacks form a realistic alternative to traditional camouflage activities, and should therefore be considered in the automated analysis of aerial surveillance imagery.