mmsegmentation/configs/mae/README.md

# MAE

[Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)

## Introduction

<!-- [BACKBONE] -->

<a href="https://github.com/facebookresearch/mae">Official Repo</a>

<a href="https://github.com/open-mmlab/mmsegmentation/blob/v0.24.0/mmseg/models/backbones/mae.py#L46">Code Snippet</a>

## Abstract

<!-- [ABSTRACT] -->

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

<!-- [IMAGE] -->

<div align=center>
<img src="https://user-images.githubusercontent.com/24582831/165456416-1cba54bf-b1b5-4bdf-ad86-d6390de7f342.png" width="70%"/>
</div>

## Citation

```bibtex
@article{he2021masked,
  title={Masked autoencoders are scalable vision learners},
  author={He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll{\'a}r, Piotr and Girshick, Ross},
  journal={arXiv preprint arXiv:2111.06377},
  year={2021}
}
```

## Usage

To use other repositories' pre-trained models, it is necessary to convert keys.

We provide a script [`beit2mmseg.py`](../../tools/model_converters/beit2mmseg.py) in the tools directory to convert the key of MAE model from [the official repo](https://github.com/facebookresearch/mae) to MMSegmentation style.

```shell
python tools/model_converters/beit2mmseg.py ${PRETRAIN_PATH} ${STORE_PATH}
```

E.g.

```shell
python tools/model_converters/beit2mmseg.py https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth pretrain/mae_pretrain_vit_base_mmcls.pth
```

This script convert model from `PRETRAIN_PATH` and store the converted model in `STORE_PATH`.

In our default setting, pretrained models could be defined below:

| pretrained models               | original models                                                                                  |
| ------------------------------- | ------------------------------------------------------------------------------------------------ |
| mae_pretrain_vit_base_mmcls.pth | ['mae_pretrain_vit_base'](https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth) |

Verify the single-scale results of the model:

```shell
sh tools/dist_test.sh \
configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py \
upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU
```

Since relative position embedding requires the input length and width to be equal, the sliding window is adopted for multi-scale inference. So we set min_size=512, that is, the shortest edge is 512. So the multi-scale inference of config is performed separately, instead of '--aug-test'. For multi-scale inference:

```shell
sh tools/dist_test.sh \
configs/mae/upernet_mae-base_fp16_512x512_160k_ade20k_ms.py \
upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU
```

## Results and models

### ADE20K

| Method  | Backbone | Crop Size | pretrain    | pretrain img size | Batch Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU  | mIoU(ms+flip) | config                                                                                                                          | download                                                                                                                                                                                                                                                                                                                                                                       |
| ------- | -------- | --------- | ----------- | ----------------- | ---------- | ------- | -------- | -------------- | ----- | ------------: | ------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| UPerNet | ViT-B    | 512x512   | ImageNet-1K | 224x224           | 16         | 160000  | 9.96     | 7.14           | 48.13 |         48.70 | [config](https://github.com/open-mmlab/mmsegmentation/blob/master/configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py) | [model](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth) \| [log](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752.log.json) |
[Feature]: Add MAE (#1307) * [Fix]: Fix lint * [WIP]: Add mae seg config * [Feature]: Add MAE seg * [Fix]: Fix mae dataset img scale bug * [Fix]: Fix lint * [Feature]: Change mae config to mae_segmentation's config * [Feature]: Add interpolate pe when loading * [Fix]: Fix pos_embed not used bug * [Fix]: Fix lint * [Fix]: Init rel pos embed with zeros * [Fix]: Fix lint * [Fix]: Change the type name of backbone to MAE * [Fix]: Delete ade20k_512x512.py * [Fix]: Use mmseg provided ade20k.py * [Fix]: Change 1 sample per gpu to 2 samples per gpu * [Fix]: Fix conflict * [Refactor]: Use the TransformerEncoderLayer of BEiT * [Feature]: Add UT * [Fix]: Change the default value of qv bias to False * [Fix]: Initialize relative pos table with zeros * [Fix]: Delete redundant code in mae * [Fix]: Fix lint * [Fix]: Rename qkv_bias to qv_bias * [Fix]: Add docstring to weight_init of MAEAttention * [Refactor]: Delete qv_bias param * [Fix]: Add reference to fix_init_weight * [Fix]: Fix lint * [Fix]: Delete extra crop size * [Refactor]: Rename mae * [Fix]: Set bias to True * [Fix]: Delete redundant params * [Fix]: Fix lint * [Fix]: Fix UT * [Fix]: Add resize abs pos embed * [Fix]: Fix UT * [Refactor]: Use build layer * [Fix]: Add licsense and fix docstring * [Fix]: Fix docstring * [Feature]: Add README metafile * [Fix]: Change 640 to 512 * [Fix]: Fix README * fix readme of MAE Co-authored-by: MengzhangLI <mcmong@pku.edu.cn> 2022-04-28 00:54:20 +08:00			`# MAE`

			`[Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)`

			`## Introduction`

			`<!-- [BACKBONE] -->`

			`<a href="https://github.com/facebookresearch/mae">Official Repo</a>`

[Fix] Fix main line link in MAE README.md (#1556) 2022-05-05 17:49:38 +08:00			`<a href="https://github.com/open-mmlab/mmsegmentation/blob/v0.24.0/mmseg/models/backbones/mae.py#L46">Code Snippet</a>`
[Feature]: Add MAE (#1307) * [Fix]: Fix lint * [WIP]: Add mae seg config * [Feature]: Add MAE seg * [Fix]: Fix mae dataset img scale bug * [Fix]: Fix lint * [Feature]: Change mae config to mae_segmentation's config * [Feature]: Add interpolate pe when loading * [Fix]: Fix pos_embed not used bug * [Fix]: Fix lint * [Fix]: Init rel pos embed with zeros * [Fix]: Fix lint * [Fix]: Change the type name of backbone to MAE * [Fix]: Delete ade20k_512x512.py * [Fix]: Use mmseg provided ade20k.py * [Fix]: Change 1 sample per gpu to 2 samples per gpu * [Fix]: Fix conflict * [Refactor]: Use the TransformerEncoderLayer of BEiT * [Feature]: Add UT * [Fix]: Change the default value of qv bias to False * [Fix]: Initialize relative pos table with zeros * [Fix]: Delete redundant code in mae * [Fix]: Fix lint * [Fix]: Rename qkv_bias to qv_bias * [Fix]: Add docstring to weight_init of MAEAttention * [Refactor]: Delete qv_bias param * [Fix]: Add reference to fix_init_weight * [Fix]: Fix lint * [Fix]: Delete extra crop size * [Refactor]: Rename mae * [Fix]: Set bias to True * [Fix]: Delete redundant params * [Fix]: Fix lint * [Fix]: Fix UT * [Fix]: Add resize abs pos embed * [Fix]: Fix UT * [Refactor]: Use build layer * [Fix]: Add licsense and fix docstring * [Fix]: Fix docstring * [Feature]: Add README metafile * [Fix]: Change 640 to 512 * [Fix]: Fix README * fix readme of MAE Co-authored-by: MengzhangLI <mcmong@pku.edu.cn> 2022-04-28 00:54:20 +08:00
			`## Abstract`

			`<!-- [ABSTRACT] -->`

			This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

			`<!-- [IMAGE] -->`
Update pre-commit yaml and md2yml.py 2022-07-05 15:58:48 +08:00
[Feature]: Add MAE (#1307) * [Fix]: Fix lint * [WIP]: Add mae seg config * [Feature]: Add MAE seg * [Fix]: Fix mae dataset img scale bug * [Fix]: Fix lint * [Feature]: Change mae config to mae_segmentation's config * [Feature]: Add interpolate pe when loading * [Fix]: Fix pos_embed not used bug * [Fix]: Fix lint * [Fix]: Init rel pos embed with zeros * [Fix]: Fix lint * [Fix]: Change the type name of backbone to MAE * [Fix]: Delete ade20k_512x512.py * [Fix]: Use mmseg provided ade20k.py * [Fix]: Change 1 sample per gpu to 2 samples per gpu * [Fix]: Fix conflict * [Refactor]: Use the TransformerEncoderLayer of BEiT * [Feature]: Add UT * [Fix]: Change the default value of qv bias to False * [Fix]: Initialize relative pos table with zeros * [Fix]: Delete redundant code in mae * [Fix]: Fix lint * [Fix]: Rename qkv_bias to qv_bias * [Fix]: Add docstring to weight_init of MAEAttention * [Refactor]: Delete qv_bias param * [Fix]: Add reference to fix_init_weight * [Fix]: Fix lint * [Fix]: Delete extra crop size * [Refactor]: Rename mae * [Fix]: Set bias to True * [Fix]: Delete redundant params * [Fix]: Fix lint * [Fix]: Fix UT * [Fix]: Add resize abs pos embed * [Fix]: Fix UT * [Refactor]: Use build layer * [Fix]: Add licsense and fix docstring * [Fix]: Fix docstring * [Feature]: Add README metafile * [Fix]: Change 640 to 512 * [Fix]: Fix README * fix readme of MAE Co-authored-by: MengzhangLI <mcmong@pku.edu.cn> 2022-04-28 00:54:20 +08:00			`<div align=center>`
			`<img src="https://user-images.githubusercontent.com/24582831/165456416-1cba54bf-b1b5-4bdf-ad86-d6390de7f342.png" width="70%"/>`
			`</div>`

			`## Citation`

			```bibtex
			`@article{he2021masked,`
			`title={Masked autoencoders are scalable vision learners},`
			`author={He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll{\'a}r, Piotr and Girshick, Ross},`
			`journal={arXiv preprint arXiv:2111.06377},`
			`year={2021}`
			`}`
			```

			`## Usage`

			`To use other repositories' pre-trained models, it is necessary to convert keys.`

			We provide a script [`beit2mmseg.py`](../../tools/model_converters/beit2mmseg.py) in the tools directory to convert the key of MAE model from [the official repo](https://github.com/facebookresearch/mae) to MMSegmentation style.

			```shell
			`python tools/model_converters/beit2mmseg.py ${PRETRAIN_PATH} ${STORE_PATH}`
			```

			`E.g.`

			```shell
			`python tools/model_converters/beit2mmseg.py https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth pretrain/mae_pretrain_vit_base_mmcls.pth`
			```

			This script convert model from `PRETRAIN_PATH` and store the converted model in `STORE_PATH`.

			`In our default setting, pretrained models could be defined below:`

Update pre-commit yaml and md2yml.py 2022-07-05 15:58:48 +08:00			`\| pretrained models \| original models \|`
			`\| ------------------------------- \| ------------------------------------------------------------------------------------------------ \|`
			`\| mae_pretrain_vit_base_mmcls.pth \| ['mae_pretrain_vit_base'](https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth) \|`
[Feature]: Add MAE (#1307) * [Fix]: Fix lint * [WIP]: Add mae seg config * [Feature]: Add MAE seg * [Fix]: Fix mae dataset img scale bug * [Fix]: Fix lint * [Feature]: Change mae config to mae_segmentation's config * [Feature]: Add interpolate pe when loading * [Fix]: Fix pos_embed not used bug * [Fix]: Fix lint * [Fix]: Init rel pos embed with zeros * [Fix]: Fix lint * [Fix]: Change the type name of backbone to MAE * [Fix]: Delete ade20k_512x512.py * [Fix]: Use mmseg provided ade20k.py * [Fix]: Change 1 sample per gpu to 2 samples per gpu * [Fix]: Fix conflict * [Refactor]: Use the TransformerEncoderLayer of BEiT * [Feature]: Add UT * [Fix]: Change the default value of qv bias to False * [Fix]: Initialize relative pos table with zeros * [Fix]: Delete redundant code in mae * [Fix]: Fix lint * [Fix]: Rename qkv_bias to qv_bias * [Fix]: Add docstring to weight_init of MAEAttention * [Refactor]: Delete qv_bias param * [Fix]: Add reference to fix_init_weight * [Fix]: Fix lint * [Fix]: Delete extra crop size * [Refactor]: Rename mae * [Fix]: Set bias to True * [Fix]: Delete redundant params * [Fix]: Fix lint * [Fix]: Fix UT * [Fix]: Add resize abs pos embed * [Fix]: Fix UT * [Refactor]: Use build layer * [Fix]: Add licsense and fix docstring * [Fix]: Fix docstring * [Feature]: Add README metafile * [Fix]: Change 640 to 512 * [Fix]: Fix README * fix readme of MAE Co-authored-by: MengzhangLI <mcmong@pku.edu.cn> 2022-04-28 00:54:20 +08:00
			`Verify the single-scale results of the model:`

			```shell
			`sh tools/dist_test.sh \`
			`configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py \`
			`upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU`
			```

			`Since relative position embedding requires the input length and width to be equal, the sliding window is adopted for multi-scale inference. So we set min_size=512, that is, the shortest edge is 512. So the multi-scale inference of config is performed separately, instead of '--aug-test'. For multi-scale inference:`

			```shell
			`sh tools/dist_test.sh \`
			`configs/mae/upernet_mae-base_fp16_512x512_160k_ade20k_ms.py \`
			`upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU`
			```

			`## Results and models`

			`### ADE20K`

Update pre-commit yaml and md2yml.py 2022-07-05 15:58:48 +08:00			\| Method \| Backbone \| Crop Size \| pretrain \| pretrain img size \| Batch Size \| Lr schd \| Mem (GB) \| Inf time (fps) \| mIoU \| mIoU(ms+flip) \| config \| download \|
			\| ------- \| -------- \| --------- \| ----------- \| ----------------- \| ---------- \| ------- \| -------- \| -------------- \| ----- \| ------------: \| ------------------------------------------------------------------------------------------------------------------------------- \| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ \|
			\| UPerNet \| ViT-B \| 512x512 \| ImageNet-1K \| 224x224 \| 16 \| 160000 \| 9.96 \| 7.14 \| 48.13 \| 48.70 \| [config](https://github.com/open-mmlab/mmsegmentation/blob/master/configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py) \| [model](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth) \\| [log](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752.log.json) \|