mmpretrain/configs/deit/README.md

4.6 KiB

Training data-efficient image transformers & distillation through attention

Abstract

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

Citation

@InProceedings{pmlr-v139-touvron21a,
  title =     {Training data-efficient image transformers & distillation through attention},
  author =    {Touvron, Hugo and Cord, Matthieu and Douze, Matthijs and Massa, Francisco and Sablayrolles, Alexandre and Jegou, Herve},
  booktitle = {International Conference on Machine Learning},
  pages =     {10347--10357},
  year =      {2021},
  volume =    {139},
  month =     {July}
}

Pretrained models

The pre-trained models are converted from the official repo. And the teacher of the distilled version DeiT is RegNetY-16GF.

ImageNet-1k

Model Params(M) Flops(G) Top-1 (%) Top-5 (%) Config Download
DeiT-tiny* 5.72 1.08 72.13 91.13 config model
DeiT-tiny distilled* 5.72 1.08 74.51 91.90 config model
DeiT-small* 22.05 4.24 79.83 94.95 config model
DeiT-small distilled* 22.05 4.24 81.17 95.40 config model
DeiT-base* 86.57 16.86 81.79 95.59 config model
DeiT-base distilled* 86.57 16.86 83.33 96.49 config model

Models with * are converted from other repos.

Fine-tuned models

The fine-tuned models are converted from the official repo.

ImageNet-1k

Model Params(M) Flops(G) Top-1 (%) Top-5 (%) Config Download
DeiT-base 384px* 86.86 49.37 83.04 96.31 config model
DeiT-base distilled 384px* 86.86 49.37 85.55 97.35 config model

Models with * are converted from other repos.

MMClassification doesn't support training the distilled version DeiT.
And we provide distilled version checkpoints for inference only.