Although using convolutional neural networks (CNNs) as backbones achieves great successes in computer vision, this work investigates a simple backbone network useful for many dense prediction tasks without convolutions. Unlike the recently-proposed Transformer model (e.g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks. PVT has several merits compared to prior arts. (1) Different from ViT that typically has low-resolution outputs and high computational and memory cost, PVT can be not only trained on dense partitions of the image to achieve high output resolution, which is important for dense predictions but also using a progressive shrinking pyramid to reduce computations of large feature maps. (2) PVT inherits the advantages from both CNN and Transformer, making it a unified backbone in various vision tasks without convolutions by simply replacing CNN backbones. (3) We validate PVT by conducting extensive experiments, showing that it boosts the performance of many downstream tasks, e.g., object detection, semantic, and instance segmentation. For example, with a comparable number of parameters, RetinaNet+PVT achieves 40.4 AP on the COCO dataset, surpassing RetinNet+ResNet50 (36.3 AP) by 4.1 absolute AP. We hope PVT could serve as an alternative and useful backbone for pixel-level predictions and facilitate future researches.
虽然以卷积神经网络（CNNs）作为主干的模型在计算机视觉领域获得了巨大的成功，我们这篇文章会提出一个非卷积的简单网络，它将能够运用在许多预测任务上。不像最近提出的Transformer模型（例如ViT）是为了分类任务设计的，我们提出金字塔视觉Transformer (PVT). 我们的模型能够解决Transformer应用在密集预测任务时的种种困难。相比现有模型，PVT拥有以下优点：（1）不像现有ViT模型使用低分辨率输入且要求较大的计算量，PVT不仅仅能够在密集的图像区块上达到高分辨率输出，而且还运用渐进收缩金字塔去降低对于大尺寸特征图的计算量；（2）PVT从CNNs和Transformer那里继承了优点，这使得在许多视觉任务上统一简单将CNN主干进行替换无卷积的主干架构成为可能。（3）我们在例如目标检测、语义和实例分割任务等下游任务上对PVT模型进行了验证，实验结果说明我们的模型是SOTA的。