mmselfsup/docs/zh_cn/algorithms/mae.md

# MAE

> [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)

<!-- [ALGORITHM] -->

## Abstract

This paper shows that masked autoencoders (MAE) are
scalable self-supervised learners for computer vision. Our
MAE approach is simple: we mask random patches of the
input image and reconstruct the missing pixels. It is based
on two core designs. First, we develop an asymmetric
encoder-decoder architecture, with an encoder that operates only on the
visible subset of patches (without mask tokens), along with a lightweight
decoder that reconstructs the original image from the latent representation
and mask tokens. Second, we find that masking a high proportion
of the input image, e.g., 75%, yields a nontrivial and
meaningful self-supervisory task. Coupling these two designs enables us to
train large models efficiently and effectively: we accelerate
training (by 3× or more) and improve accuracy. Our scalable approach allows
for learning high-capacity models that generalize well: e.g., a vanilla
ViT-Huge model achieves the best accuracy (87.8%) among
methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.

<div align="center">
<img src="https://user-images.githubusercontent.com/30762564/150733959-2959852a-c7bd-4d3f-911f-3e8d8839fe67.png" width="40%"/>
</div>


## Models and Benchmarks

Here, we report the results of the model, which is pre-trained on ImageNet1K
for 400 epochs, the details are below:


| Backbone | Pre-train epoch | Fine-tuning Top-1 |                  Pre-train Config                   |                                    Fine-tuning Config                                     |                                                                                                                        Download                                                                                                                         |
| :------: | :-------------: | :---------------: | :-------------------------------------------------: | :---------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| ViT-B/16 |       400       |       83.1        | [config](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/mae/mae_vit-b-p16_8xb512-coslr-400e_in1k.py) | [config](https://github.com/open-mmlab/mmselfsup/blob/master/configs/benchmarks/classification/imagenet/vit-b-p16_ft-8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/mae/mae_vit-base-p16_8xb512-coslr-400e_in1k-224_20220223-85be947b.pth) &#124; [log](https://download.openmmlab.com/mmselfsup/mae/mae_vit-base-p16_8xb512-coslr-300e_in1k-224_20220210_140925.log.json) |


## Citation

```bibtex
@article{He2021MaskedAA,
  title={Masked Autoencoders Are Scalable Vision Learners},
  author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and
  Piotr Doll'ar and Ross B. Girshick},
  journal={ArXiv},
  year={2021}
}
```
-												Bump version to v0.8.0 (#269)

* [Fix]: Fix mmcls upgrade bug (#235)

* [Feature]: Add multi machine dist_train (#232)

* [Feature]: Add multi machine dist_train

* [Fix]: Change bash to sh

* [Fix]: Fix missing sh suffix

* [Refactor]: Change bash to sh

* [Refactor] Add unit test (#234)

* [Refactor] add unit test

* update workflow

* update

* [Fix] fix lint

* update test

* refactor moco and densecl unit test

* fix lint

* add unit test

* update unit test

* remove modification

* [Feature]: Add MAE metafile (#238)

* [Feature]: Add MAE metafile

* [Fix]: Fix lint

* [Fix]: Change LARS to AdamW in the metafile of MAE

* [Fix] fix codecov bug (#241)

* [Fix] fix codecov bug

* update comment

* [Refactor] Using MMCls backbones (#233)

* [Refactor] using backbones from MMCls

* [Refactor] modify the unit test

* [Fix] modify default setting of out_indices

* [Docs] fix lint

* [Refactor] modify super init

* [Refactore] remove res_layer.py

* using mmcv PatchEmbed

* [Fix]: Fix outdated problem (#249)

* [Fix]: Fix outdated problem

* [Fix]: Update MoCov3 bibtex

* [Fix]: Use abs path in README

* [Fix]: Reformat MAE bibtex

* [Fix]: Reformat MoCov3 bibtex

* [Feature] Resume from the latest checkpoint automatically. (#245)

* [Feature] Resume from the latest checkpoint automatically.

* fix windows path problem

* fix lint

* add code reference

* [Docs] add docstring for ResNet and ResNeXt (#252)

* [Feature] support KNN benchmark (#243)

* [Feature] support KNN benchmark

* [Fix] add docstring and multi-machine testing

* [Fix] fix lint

* [Fix] change args format and check init_cfg

* [Docs] add benchmark tutorial

* [Docs] add benchmark results

* [Feature]: SimMIM supported (#239)

* [Feature]: SimMIM Pretrain

* [Feature]: Add mix precision and 16x128 config

* [Fix]: Fix config import bug

* [Fix]: Fix config bug

* [Feature]: Simim Finetune

* [Fix]: Log every 100

* [Fix]: Fix eval problem

* [Feature]: Add docstring for simmim

* [Refactor]: Merge layer wise lr decay to Default constructor

* [Fix]:Fix simmim evaluation bug

* [Fix]: Change model to be compatible to latest version of mmcls

* [Fix]: Fix lint

* [Fix]: Rewrite forward_train for classification cls

* [Feature]: Add UT

* [Fix]: Fix lint

* [Feature]: Add 32 gpus training for simmim ft

* [Fix]: Rename mmcls classifier wrapper

* [Fix]: Add docstring to SimMIMNeck

* [Feature]: Generate docstring for the forward function of simmim encoder

* [Fix]: Rewrite the class docstring for constructor

* [Fix]: Fix lint

* [Fix]: Fix UT

* [Fix]: Reformat config

* [Fix]: Add img resolution

* [Feature]: Add readme and metafile

* [Fix]: Fix typo in README.md

* [Fix]: Change BlackMaskGen to BlockwiseMaskGenerator

* [Fix]: Change the name of SwinForSimMIM

* [Fix]: Delete irrelevant files

* [Feature]: Create extra transformerfinetuneconstructor

* [Fix]: Fix lint

* [Fix]: Update SimMIM README

* [Fix]: Change SimMIMPretrainHead to SimMIMHead

* [Fix]: Fix the docstring of ft constructor

* [Fix]: Fix UT

* [Fix]: Recover deletion

Co-authored-by: Your <you@example.com>

* [Fix] add seed to distributed sampler (#250)

* [Fix] add seed to distributed sampler

* fix lint

* [Feature] Add ImageNet21k (#225)

* solve memory leak by limited implementation

* fix lint problem

Co-authored-by: liming <liming.ai@bytedance.com>

* [Refactor] change args format to '--a-b' (#253)

* [Refactor] change args format to `--a-b`

* modify tsne script

* modify 'sh' files

* modify getting_started.md

* modify getting_started.md

* [Fix] fix 'mkdir' error in prepare_voc07_cls.sh (#261)

* [Fix] fix positional parameter error (#260)

* [Fix] fix command errors in benchmarks tutorial (#263)

* [Docs] add brief installation steps in README.md (#265)

* [Docs] add colab tutorial (#247)

* [Docs] add colab tutorial

* fix lint

* modify the colab tutorial, using API to train the model

* modify the description

* remove #

* modify the command

* [Docs] translate 6_benchmarks.md into Chinese (#262)

* [Docs] translate 6_benchmarks.md into Chinese

* Update 6_benchmarks.md

change 基准 to 基准评测

* Update 6_benchmarks.md

(1)  Add Chinese translation of  ‘1 folder for ImageNet nearest-neighbor classification task’
(2) 数据预准备 -> 数据准备

* [Docs] remove install scripts in README (#267)

* [Docs] Update version information in dev branch (#268)

* update version to v0.8.0

* fix lint

* [Fix]: Install the latest mmcls

* [Fix]: Add SimMIM in RAEDME

Co-authored-by: Yuan Liu <30762564+YuanLiuuuuuu@users.noreply.github.com>
Co-authored-by: Jiahao Xie <52497952+Jiahao000@users.noreply.github.com>
Co-authored-by: Your <you@example.com>
Co-authored-by: Ming Li <73068772+mitming@users.noreply.github.com>
Co-authored-by: liming <liming.ai@bytedance.com>
Co-authored-by: RenQin <45731309+soonera@users.noreply.github.com>
Co-authored-by: YuanLiuuuuuu <3463423099@qq.com>
											
										
										
											2022-03-31 18:47:54 +08:00
+								# MAE
 								> [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)
 								<!-- [ALGORITHM] -->
 								## Abstract
 								This paper shows that masked autoencoders (MAE) are
 								scalable self-supervised learners for computer vision. Our
 								MAE approach is simple: we mask random patches of the
 								input image and reconstruct the missing pixels. It is based
 								on two core designs. First, we develop an asymmetric
 								encoder-decoder architecture, with an encoder that operates only on the
 								visible subset of patches (without mask tokens), along with a lightweight
 								decoder that reconstructs the original image from the latent representation
 								and mask tokens. Second, we find that masking a high proportion
 								of the input image, e.g., 75%, yields a nontrivial and
 								meaningful self-supervisory task. Coupling these two designs enables us to
 								train large models efficiently and effectively: we accelerate
 								training (by 3× or more) and improve accuracy. Our scalable approach allows
 								for learning high-capacity models that generalize well: e.g., a vanilla
 								ViT-Huge model achieves the best accuracy (87.8%) among
 								methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pretraining and shows promising scaling behavior.
 								<div align="center">
 								<img src="https://user-images.githubusercontent.com/30762564/150733959-2959852a-c7bd-4d3f-911f-3e8d8839fe67.png" width="40%"/>
 								</div>
 								## Models and Benchmarks
 								Here, we report the results of the model, which is pre-trained on ImageNet1K
 								for 400 epochs, the details are below:
 								| Backbone | Pre-train epoch | Fine-tuning Top-1 |                  Pre-train Config                   |                                    Fine-tuning Config                                     |                                                                                                                        Download                                                                                                                         |
 								| :------: | :-------------: | :---------------: | :-------------------------------------------------: | :---------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
 								| ViT-B/16 |       400       |       83.1        | [config](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/mae/mae_vit-b-p16_8xb512-coslr-400e_in1k.py) | [config](https://github.com/open-mmlab/mmselfsup/blob/master/configs/benchmarks/classification/imagenet/vit-b-p16_ft-8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/mae/mae_vit-base-p16_8xb512-coslr-400e_in1k-224_20220223-85be947b.pth) &#124; [log](https://download.openmmlab.com/mmselfsup/mae/mae_vit-base-p16_8xb512-coslr-300e_in1k-224_20220210_140925.log.json) |
 								## Citation
 								```bibtex
 								@article{He2021MaskedAA,
 								  title={Masked Autoencoders Are Scalable Vision Learners},
 								  author={Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and
 								  Piotr Doll'ar and Ross B. Girshick},
 								  journal={ArXiv},
 								  year={2021}
 								}
 								```