History

Yixiao Fang 89000c10eb [Refactor] Refactor configs and metafile (#1369 ) * update base datasets * update base * update barlowtwins * update with new convention * update * update * update * add schedule * add densecl * add eva * add mae * add maskfeat * add milan and mixmim * add moco * add swav simclr * add simmim and simsiam * refine * update * add to model index * update config inheritance * fix error in metafile * Update pre-commit and metafile check script * update metafile * fix name error * Fix classification model name and config name --------- Co-authored-by: mzr1996 <mzr1996@163.com>		2023-02-23 11:17:16 +08:00
..
README.md	[Docs] Use relative link to config instead of abs link in README.	2022-09-22 09:59:06 +08:00
convmixer-768-32_10xb64_in1k.py	Update auto_scale_lr fields	2022-07-18 11:11:13 +08:00
convmixer-1024-20_10xb64_in1k.py	Update auto_scale_lr fields	2022-07-18 11:11:13 +08:00
convmixer-1536-20_10xb64_in1k.py	Update auto_scale_lr fields	2022-07-18 11:11:13 +08:00
metafile.yml	[Refactor] Refactor configs and metafile (#1369 )	2023-02-23 11:17:16 +08:00

README.md

ConvMixer

Patches Are All You Need?

Abstract

Although convolutional networks have been the dominant architecture for vision tasks for many years, recent experiments have shown that Transformer-based models, most notably the Vision Transformer (ViT), may exceed their performance in some settings. However, due to the quadratic runtime of the self-attention layers in Transformers, ViTs require the use of patch embeddings, which group together small regions of the image into single input features, in order to be applied to larger image sizes. This raises a question: Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation? In this paper, we present some evidence for the latter: specifically, we propose the ConvMixer, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network. In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps. Despite its simplicity, we show that the ConvMixer outperforms the ViT, MLP-Mixer, and some of their variants for similar parameter counts and data set sizes, in addition to outperforming classical vision models such as the ResNet.

Results and models

ImageNet-1k

Model	Params(M)	Flops(G)	Top-1 (%)	Top-5 (%)	Config	Download
ConvMixer-768/32*	21.11	19.62	80.16	95.08	config	model
ConvMixer-1024/20*	24.38	5.55	76.94	93.36	config	model
ConvMixer-1536/20*	51.63	48.71	81.37	95.61	config	model

Models with * are converted from the official repo. The config files of these models are only for inference. We don't ensure these config files' training accuracy and welcome you to contribute your reproduction results.

Citation

@misc{trockman2022patches,
      title={Patches Are All You Need?},
      author={Asher Trockman and J. Zico Kolter},
      year={2022},
      eprint={2201.09792},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}