TransReID: Transformer-based Object Re-Identification

In this paper, we explore the Vision Transformer (ViT), a pure transformer-based model, for the object re-identification (ReID) task. With several adaptations, a strong baseline ViT-BoT is constructed with ViT as backbone, which achieves comparable results to convolution neural networks- (CNN-) based frameworks on several ReID benchmarks. Furthermore, two modules are designed in consideration of the specialties of ReID data: (1) It is super natural and simple for Transformer to encode non-visual information such as camera or viewpoint into vector embedding representations. Plugging into these embeddings, ViT holds the ability to eliminate the bias caused by diverse cameras or viewpoints.(2) We design a Jigsaw branch, parallel with the Global branch, to facilitate the training of the model in a two-branch learning framework. In the Jigsaw branch, a jigsaw patch module is designed to learn robust feature representation and help the training of transformer by shuffling the patches. With these novel modules, we propose a pure-transformer framework dubbed as TransReID, which is the first work to use a pure Transformer for ReID research to the best of our knowledge. Experimental results of TransReID are superior promising, which achieve state-of-the-art performance on both person and vehicle ReID benchmarks.

https://arxiv.org/abs/2102.04378

在本文中,我们研究视觉Transformer (ViT), 提出了一种单纯的基于Transformer的目标重识别模型。经过一系列改动,我们以ViT为骨架构建了ViT-BoT模型并且在数个ReID排行榜上取得了与类CNNs架构模型的相似性能。我们为ReID任务特别设计了如下两个模块:(1) 它可以将非视觉信息(例如摄像机或者视角信息)与视觉信息融合到特征空间中。通过对于上述特征的学习,模型可以消除对于不同摄像机和视角的偏差; (2) 我们设计了链锯分支,这个分支平行于全局分支用于促进训练流程。在链锯分支中,一个链锯采样模块被设计用于学习鲁棒的特征表示并且通过打乱碎片样本来帮助训练。利用上述模块,我们提出了称为TransReID的模型,这是第一个利用纯Transformer模型完成ReID任务的模型,实验证明了模型性能的优越性,并且在人和车辆ReID任务中都达到SOTA的性能。

发表评论

邮箱地址不会被公开。 必填项已用*标注