标签归档:Object Detection

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Although using convolutional neural networks (CNNs) as backbones achieves great successes in computer vision, this work investigates a simple backbone network useful for many dense prediction tasks without convolutions. Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to prior arts. (1) Different from ViT that typically has low-resolution outputs and high computational and memory cost, PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions but also using a progressive shrinking pyramid to reduce computations of large feature maps. (2) PVT inherits the advantages from both CNN and Transformer, making it a unified backbone in various vision tasks without convolutions by simply replacing CNN backbones. (3) We validate PVT by conducting extensive experiments, showing that it boosts the performance of many downstream tasks, e.g., object detection, semantic, and instance segmentation. For example, with a comparable number of parameters, RetinaNet+PVT achieves 40.4 AP on the COCO dataset, surpassing RetinNet+ResNet50 (36.3 AP) by 4.1 absolute AP. We hope PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future researches.

https://arxiv.org/abs/2102.12122

虽然以卷积神经网络(CNNs)作为主干的模型在计算机视觉领域获得了巨大的成功,我们这篇文章会提出一个非卷积的简单网络,它将能够运用在许多预测任务上。不像最近提出的Transformer模型(例如ViT)是为了分类任务设计的,我们提出金字塔视觉Transformer (PVT). 我们的模型能够解决Transformer应用在密集预测任务时的种种困难。相比现有模型,PVT拥有以下优点:(1)不像现有ViT模型使用低分辨率输入且要求较大的计算量,PVT不仅仅能够在密集的图像区块上达到高分辨率输出,而且还运用渐进收缩金字塔去降低对于大尺寸特征图的计算量;(2)PVT从CNNs和Transformer那里继承了优点,这使得在许多视觉任务上统一简单将CNN主干进行替换无卷积的主干架构成为可能。(3)我们在例如目标检测、语义和实例分割任务等下游任务上对PVT模型进行了验证,实验结果说明我们的模型是SOTA的。

Improving Object Detection in Art Images Using Only Style Transfer

Despite recent advances in object detection using deep learning neural networks, these neural networks still struggle to identify objects in art images such as paintings and drawings. This challenge is known as the cross depiction problem and it stems in part from the tendency of neural networks to prioritize identification of an object’s texture over its shape. In this paper we propose and evaluate a process for training neural networks to localize objects – specifically people – in art images. We generate a large dataset for training and validation by modifying the images in the COCO dataset using AdaIn style transfer. This dataset is used to fine-tune a Faster R-CNN object detection network, which is then tested on the existing People-Art testing dataset. The result is a significant improvement on the state of the art and a new way forward for creating datasets to train neural networks to process art images.

https://arxiv.org/abs/2102.06529

虽然最近深度学习在目标检测领域有了长足发展,但是这些网络在艺术作品如画作等数据上的表现不佳。这个问题主要是因为神经网络倾向于通过目标的纹理而非形状进行推断。在本文中我们提出并且验证一种训练检测器的流程,这个流程训练的是对于艺术作品中的人物。我们使用AdaIn风格迁移将COCO数据集构建成一个庞大的数据集,然后在People-Art testing数据集上进行测试。结果显示我们的方法有效地提高了现有检测器在艺术作品上的检测表现。

Self-Attention Based Context-Aware 3D Object Detection

Most existing point-cloud based 3D object detectors use convolution-like operators to process information in a local neighbourhood with fixed-weight kernels and aggregate global context hierarchically. However, recent work on non-local neural networks and self-attention for 2D vision has shown that explicitly modeling global context and long-range interactions between positions can lead to more robust and competitive models. In this paper, we explore two variants of self-attention for contextual modeling in 3D object detection by augmenting convolutional features with self-attention features. We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors and show consistent improvement over strong baseline models while simultaneously significantly reducing their parameter footprint and computational cost. We also propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations. This not only allows us to scale explicit global contextual modeling to larger point-clouds, but also leads to more discriminative and informative feature descriptors. Our method can be flexibly applied to most state-of-the-art detectors with increased accuracy and parameter and compute efficiency. We achieve new state-of-the-art detection performance on KITTI and nuScenes datasets.

https://arxiv.org/pdf/2101.02672.pdf

绝大多数基于点云的3D目标检测器都是利用类卷积的方式,通过固定参数的核以及多层次的全局上下文处理来实现对于本地邻域信息的处理。但是,最近的在非邻域神经网络以及自注意力机制上获得研究显示对于全局上下文以及长距离位置交互的精细建模可以有利于构建更加稳定以及高效的模型。在本文中,我们研究了两种自注意力机制的变形在3D目标检测任务中的上下文建模的应用,这两种自注意机制可以利用自注意力特征来扩增卷积特征。我们首次将对间自注意力机制与目前SOTA的BEV,基于体素和点的检测器结合,取得了强于baseline模型的性能同时也极大地减小了计算量。另外,我们还提出了一种自注意力机制的变形,它可以从解构随机采样的位置中提取最具有表达能力的特征。这使得精细化的全局上下文模型能接受更大规模的点云,也使得模型能获得更多的信息。我们的方法可以应用到绝大多数SOTA的检测器中并在多个数据集上取得SOTA的成绩。

ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks

Recently, channel attention mechanism has demonstrated to offer great potential in improving the performance of deep convolutional neural networks (CNNs). However, most existing methods dedicate to developing more sophisticated attention modules for achieving better performance, which inevitably increase model complexity. To overcome the paradox of performance and complexity trade-off, this paper proposes an Efficient Channel Attention (ECA) module, which only involves a handful of parameters while bringing clear performance gain. By dissecting the channel attention module in SENet, we empirically show avoiding dimensionality reduction is important for learning channel attention, and appropriate cross-channel interaction can preserve performance while significantly decreasing model complexity. Therefore, we propose a local cross-channel interaction strategy without dimensionality reduction, which can be efficiently implemented via 1D convolution. Furthermore, we develop a method to adaptively select kernel size of 1D convolution, determining coverage of local cross-channel interaction. The proposed ECA module is efficient yet effective, e.g., the parameters and computations of our modules against backbone of ResNet50 are 80 vs. 24.37M and 4.7e-4 GFLOPs vs. 3.86 GFLOPs, respectively, and the performance boost is more than 2% in terms of Top-1 accuracy. We extensively evaluate our ECA module on image classification, object detection and instance segmentation with backbones of ResNets and MobileNetV2. The experimental results show our module is more efficient while performing favorably against its counterparts.

https://arxiv.org/abs/1910.03151

近来,通道注意力机制已经被证明可以极大的提升深度卷积神经网络的性能。但是现有的方法为了获得更好的性能往往采用更为复杂的注意力模型因此反而增加了模型的复杂度。为了解决性能和计算复杂度的平衡,本文提出了高效通道注意力模块 (ECA) 使得使用较少的参数而可以获得明显的性能提升。通过研究SENet的通道注意力模型,我们从经验上推论减少维度对于学习通道注意力至关重要,适当的通道间交互可以在保持性能的同时极大地提高模型的复杂度。因此我们提出了一种通过一维卷积实现的本地跨通道交互策略而不需要减少维度。另外,我们还提出一种自适应选择一维卷积核尺寸的方法以衡量通道之间的交互的程度。ECA模块相较于基本的RseNet50有较大提升并可以用于多种如图像识别,目标检测以及实例分割等以ResNet和MobileNetV2为框架的任务中。

VinVL: Making Visual Representations Matter in Vision-Language Models

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model [2], the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model, and utilize an improved approach to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public.

https://arxiv.org/pdf/2101.00529.pdf

本文对提升视觉-语言任务(VL)的视觉表示能力进行了详细研究并且提出了一个改进的目标检测模型用于提供目标中心的表示。相较于目前大多数自底到顶以及由顶向下的模型,我们提出的模型规模更大,并且更加适配于VL任务且预训练于更大规模的语料数据集和多个大规模公开目标检测数据集。因此,它可以针对视觉目标和内容生成更加丰富的表示。尽管之前的VL研究主要关注视觉-语言融合模型,并不关注目标检测模型的改进,我们发现视觉特征对于VL模型很重要。在我们的实验中我们将目标检测模型生成的视觉特征输入一个基于Transformer的VL融合模型,并且利用改进的方法预训练VL模型并且将其微调至下游的多个VL任务中。结果显示新的视觉特征可以有效地提高所有VL任务的性能,并且在多个公开榜单上取得SOTA的成绩。我们将会开放这个新的目标检测模型给社区。

Toward Transformer-Based Object Detection

Transformers have become the dominant model in natural language processing, owing to their ability to pretrain on massive amounts of data, then transfer to smaller, more specific tasks via fine-tuning. The Vision Transformer was the first major attempt to apply a pure transformer model directly to images as input, demonstrating that as compared to convolutional networks, transformer-based architectures can achieve competitive results on benchmark classification tasks. However, the computational complexity of the attention operator means that we are limited to low-resolution inputs. For more complex tasks such as detection or segmentation, maintaining a high input resolution is crucial to ensure that models can properly identify and reflect fine details in their output. This naturally raises the question of whether or not transformer-based architectures such as the Vision Transformer are capable of performing tasks other than classification. In this paper, we determine that Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results. The model that we propose, ViT-FRCNN, demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance. We also investigate improvements over a standard detection backbone, including superior performance on out-of-domain images, better performance on large objects, and a lessened reliance on non-maximum suppression. We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.

https://arxiv.org/abs/2012.09958

Transformers类的模型已经成为主导NLP领域的模型,借助在大量的数据上进行预训练,它可以通过fine-tuning迁移到规模更小更加细分的领域。ViT是第一个将图像直接输入Transformer中,结果显示它可以达到与CNN比肩的性能。但是,计算复杂度限制了输入图像的分辨率,在例如目标检测或者分割领域有不可忽视的缺点。在本文中,我们提出了一种以ViT为主干的通用目标检测模型ViT-FRCNN,它可在COCO数据集以达到有竞争力的性能。它还保留了传统Transformer的优点:大规模预训练潜力,快速fine-Tuning性能。我们还研究了相较于标准的检测模型架构,基于ViT的模型的提升之处,包括:在域外图像上更好的性能,在大尺寸目标上更好的性能以及对于非最大抑制更少的以来。我们认为ViT-FRCNN是 将Transformer应用到一般机器视觉任务的重要一步。

RandAugment: Practical automated data augmentation with a reduced search space

Recent work has shown that data augmentation has the potential to significantly improve the generalization of deep learning models. Recently, automated augmentation strategies have led to state-of-the-art results in image classification and object detection. While these strategies were optimized for improving validation accuracy, they also led to state-of-the-art results in semi-supervised learning and improved robustness to common corruptions of images. An obstacle to a large-scale adoption of these methods is a separate search phase which increases the training complexity and may substantially increase the computational cost. Additionally, due to the separate search phase, these approaches are unable to adjust the regularization strength based on model or dataset size. Automated augmentation policies are often found by training small models on small datasets and subsequently applied to train larger models. In this work, we remove both of these obstacles. RandAugment has a significantly reduced search space which allows it to be trained on the target task with no need for a separate proxy task. Furthermore, due to the parameterization, the regularization strength may be tailored to different model and dataset sizes. RandAugment can be used uniformly across different tasks and datasets and works out of the box, matching or surpassing all previous automated augmentation approaches on CIFAR-10/100, SVHN, and ImageNet. On the ImageNet dataset we achieve 85.0% accuracy, a 0.6% increase over the previous state-of-the-art and 1.0% increase over baseline augmentation. On object detection, RandAugment leads to 1.0-1.3% improvement over baseline augmentation, and is within 0.3% mAP of AutoAugment on COCO. Finally, due to its interpretable hyperparameter, RandAugment may be used to investigate the role of data augmentation with varying model and dataset size. Code is available online.

https://arxiv.org/abs/1909.13719

最近的研究表明数据扩增可以极大地提高深度学习模型的泛化能力。近来自动数据扩增策略为图像分类和目标检测任务带来了客观的提升。尽管这些策略是为了提高验证精度而优化的,它们在半监督学习和面对污染图像数据时也拥有较高的鲁棒性。大规模采用这些策略的障碍是分离的搜索组会提高训练复杂度和计算难度。另外,因为分离的搜索组,这些方法很难将正则化适配于每一个组。自动数据扩增方法往往在较小地数据上训练较小地模型然后将其应用于更大的模型训练中。在本文中,我们提出了RandAugment用于解决以上问题。我们的方法极大地减小了搜索空间,这使得它可以在目标任务上直接训练而不需要分离到子任务中去。因为参数化,正则化强度可以为不同的模型和数据集尺寸量身定做。我们的方法可以在不同的任务和数据集上达成统一,并且达到或超越已有的自动数据扩增方法。

Adversarial Patch Camouflage against Aerial Detection

Media Tweets by Adam Harvey (@adamhrv) | Twitter

Detection of military assets on the ground can be performed by applying deep learning-based object detectors on drone surveillance footage. The traditional way of hiding military assets from sight is camouflage, for example by using camouflage nets. However, large assets like planes or vessels are difficult to conceal by means of traditional camouflage nets. An alternative type of camouflage is the direct misleading of automatic object detectors. Recently, it has been observed that small adversarial changes applied to images of the object can produce erroneous output by deep learning-based detectors. In particular, adversarial attacks have been successfully demonstrated to prohibit person detections in images, requiring a patch with a specific pattern held up in front of the person, thereby essentially camouflaging the person for the detector. Research into this type of patch attacks is still limited and several questions related to the optimal patch configuration remain open. This work makes two contributions. First, we apply patch-based adversarial attacks for the use case of unmanned aerial surveillance, where the patch is laid on top of large military assets, camouflaging them from automatic detectors running over the imagery. The patch can prevent automatic detection of the whole object while only covering a small part of it. Second, we perform several experiments with different patch configurations, varying their size, position, number and saliency. Our results show that adversarial patch attacks form a realistic alternative to traditional camouflage activities, and should therefore be considered in the automated analysis of aerial surveillance imagery.

http://arxiv.org/abs/2008.13671

从卫星图像上可以应用基于深度学习技术目标检测器对军事设施进行侦测。传统的反侦测手段包括使用伪装网进行遮盖,但是大型的物体如飞机、舰船等难以通过伪装网进行遮盖。一个替代的伪装方式是直接误导检测器。最近的发现是通过小小的对抗修改就可以使得基于深度学习的目标检测器做出错误判断。特别的是,对抗攻击已经被证明在防止人员在图像中被检出有效,仅仅通过人举着一小块拥有特定纹理的方块即可实现。但是在对抗攻击的种类上的研究还是有限的,优化对抗攻击的方向依然有待研究。本文有两个贡献:首先,我们将patch对抗攻击应用于无人卫星监控应用中,对抗patch被放置在大型军事设施的顶部用于防止被目标检测器检出。第二,我们在多个patch条件下进行了实验,包括patch大小,位置,数量以及显著性。我们的实验证明对抗patch可以有效替代传统的伪装活动,可以被考虑应用与卫星图像的自动分析中。