4.6 KiB

Raw Blame History

Training data-efficient image transformers & distillation through attention

Abstract

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

Citation

@InProceedings{pmlr-v139-touvron21a,
  title =     {Training data-efficient image transformers &amp; distillation through attention},
  author =    {Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and Jegou, Herve},
  booktitle = {International Conference on Machine Learning},
  pages =     {10347--10357},
  year =      {2021},
  volume =    {139},
  month =     {July}
}

Pretrained models

The pre-trained models are converted from the official repo. And the teacher of the distilled version DeiT is RegNetY-16GF.

ImageNet-1k

Model	Params(M)	Flops(G)	Top-1 (%)	Top-5 (%)	Config	Download
DeiT-tiny*	5.72	1.08	72.13	91.13	config	model
DeiT-tiny distilled*	5.72	1.08	74.51	91.90	config	model
DeiT-small*	22.05	4.24	79.83	94.95	config	model
DeiT-small distilled*	22.05	4.24	81.17	95.40	config	model
DeiT-base*	86.57	16.86	81.79	95.59	config	model
DeiT-base distilled*	86.57	16.86	83.33	96.49	config	model

Models with * are converted from other repos.

Fine-tuned models

The fine-tuned models are converted from the official repo.

ImageNet-1k

Model	Params(M)	Flops(G)	Top-1 (%)	Top-5 (%)	Config	Download
DeiT-base 384px*	86.86	49.37	83.04	96.31	config	model
DeiT-base distilled 384px*	86.86	49.37	85.55	97.35	config	model

Models with * are converted from other repos.

MMClassification doesn't support training the distilled version DeiT.
And we provide distilled version checkpoints for inference only.

4.6 KiB Raw Blame History

Training data-efficient image transformers & distillation through attention

Abstract

Citation

Pretrained models

ImageNet-1k

Fine-tuned models

ImageNet-1k

4.6 KiB

Raw Blame History