Can Vision Transformers Learn without Natural Images?

Can we complete pre-training of Vision Transformers (ViT) without natural images and human-annotated labels? Although a pre-trained ViT seems to heavily rely on a large-scale dataset and human-annotated labels, recent large-scale datasets contain several problems in terms of privacy violations, inadequate fairness protection, and labor-intensive annotation. In the present paper, we pre-train ViT without any image collections and annotation labor. We experimentally verify that our proposed framework partially outperforms sophisticated Self-Supervised Learning (SSL) methods like SimCLRv2 and MoCov2 without using any natural images in the pre-training phase. Moreover, although the ViT pre-trained without natural images produces some different visualizations from ImageNet pre-trained ViT, it can interpret natural image datasets to a large extent. For example, the performance rates on the CIFAR-10 dataset are as follows: our proposal 97.6 vs. SimCLRv2 97.4 vs. ImageNet 98.0.

https://arxiv.org/abs/2103.13023

我们可以在没有自然图像和人工标注的情况下完成视觉Transformers的训练吗?尽管一个ViT的预训练似乎非常依赖与大规模数据集和人工标注,可是最近的大规模数据集都有一些隐私侵犯,不公正的保护以及密集劳动力的标注等问题。在本文中,我们在没有大规模标注数据的介入下训练ViT。我们验证我们的网络部分优于一些自监督学习方法在没有自然图像参与预训练的情况下。另外,尽管我们的网络没有自然图像参与预训练,但是它可以拥有更多样的可视化结果相较于ImageNet上训练的ViT,这说明我们的模型可以处理自然图像。

发表评论

邮箱地址不会被公开。 必填项已用*标注