4.6 KiB
Training data-efficient image transformers & distillation through attention
Abstract
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

Citation
@InProceedings{pmlr-v139-touvron21a,
title = {Training data-efficient image transformers & distillation through attention},
author = {Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and Jegou, Herve},
booktitle = {International Conference on Machine Learning},
pages = {10347--10357},
year = {2021},
volume = {139},
month = {July}
}
Pretrained models
The pre-trained models are converted from the official repo. And the teacher of the distilled version DeiT is RegNetY-16GF.
ImageNet-1k
Model | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Config | Download |
---|---|---|---|---|---|---|
DeiT-tiny* | 5.72 | 1.08 | 72.13 | 91.13 | config | model |
DeiT-tiny distilled* | 5.72 | 1.08 | 74.51 | 91.90 | config | model |
DeiT-small* | 22.05 | 4.24 | 79.83 | 94.95 | config | model |
DeiT-small distilled* | 22.05 | 4.24 | 81.17 | 95.40 | config | model |
DeiT-base* | 86.57 | 16.86 | 81.79 | 95.59 | config | model |
DeiT-base distilled* | 86.57 | 16.86 | 83.33 | 96.49 | config | model |
Models with * are converted from other repos.
Fine-tuned models
The fine-tuned models are converted from the official repo.
ImageNet-1k
Model | Params(M) | Flops(G) | Top-1 (%) | Top-5 (%) | Config | Download |
---|---|---|---|---|---|---|
DeiT-base 384px* | 86.86 | 49.37 | 83.04 | 96.31 | config | model |
DeiT-base distilled 384px* | 86.86 | 49.37 | 85.55 | 97.35 | config | model |
Models with * are converted from other repos.
MMClassification doesn't support training the distilled version DeiT.
And we provide distilled version checkpoints for inference only.