标签归档:Knowledge Distillation

Training data-efficient image transformers & distillation through attention

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption by the larger community. In this work, with an adequate training scheme, we produce a competitive convolution-free transformer by training on Imagenet only. We train it on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. We share our code and models to accelerate community advances on this line of research. Additionally, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 84.4% accuracy) and when transferring to other tasks. 

https://arxiv.org/pdf/2012.12877v1.pdf

近来,基于注意力机制的神经网络被广泛应用在解决例如图像分类等图像理解任务上。但是这些ViT方法都在数以百万计的图片以及昂贵的设备上进行的预训练的,这限制了他们被社区广泛地采用。在本文中,我们使用一个恰当的训练流程对一个非卷积的Transformer仅仅在ImageNet数据集上进行训练。我们在一台计算机上训练3天即可完成训练流程。我们提出的模型(86M参数)达到83.1%的top-1性能在ImageNet上。我们共享代码和模型以加速社区的研究。另外,我们介绍了一个教师-学生的Transformer策略。这个策略使用蒸馏token来暴增学生可以以注意力机制的方式从教师处进行学习。我们展示了基于token的蒸馏,尤其以卷积网络的形式可以应用在教师模型上。以上的模型使得我们取得对于其他卷积网络在ImageNet数据集(我们的结果是84.4%)上的领先,这样的结果更加可以在迁移到其他任务上得到验证。