mmpretrain/configs/milan/README.md

# MILAN

> [MILAN: Masked Image Pretraining on Language Assisted Representation](https://arxiv.org/pdf/2208.06049)

<!-- [ALGORITHM] -->

## Abstract

Self-attention based transformer models have been dominating many computer
vision tasks in the past few years. Their superb model qualities heavily depend
on the excessively large labeled image datasets. In order to reduce the reliance
on large labeled datasets, reconstruction based masked autoencoders are gaining
popularity, which learn high quality transferable representations from unlabeled
images. For the same purpose, recent weakly supervised image pretraining methods
explore language supervision from text captions accompanying the images. In this
work, we propose masked image pretraining on language assisted representation,
dubbed as MILAN. Instead of predicting raw pixels or low level features, our
pretraining objective is to reconstruct the image features with substantial semantic
signals that are obtained using caption supervision. Moreover, to accommodate our
reconstruction target, we propose a more efficient prompting decoder architecture
and a semantic aware mask sampling mechanism, which further advance the
transfer performance of the pretrained model. Experimental results demonstrate
that MILAN delivers higher accuracy than the previous works. When the masked
autoencoder is pretrained and finetuned on ImageNet-1K dataset with an input
resolution of 224×224, MILAN achieves a top-1 accuracy of 85.4% on ViTB/16, surpassing previous state-of-the-arts by 1%. In the downstream semantic
segmentation task, MILAN achieves 52.7 mIoU using ViT-B/16 backbone on
ADE20K dataset, outperforming previous masked pretraining results by 4 points.

<div align=center>
<img src="https://user-images.githubusercontent.com/30762564/205210369-41a65c4c-bcd4-4147-91ea-c6c9061ab455.png" width="80%"/>
</div>

## How to use it?

<!-- [TABS-BEGIN] -->

**Predict image**

```python
from mmpretrain import inference_model

predict = inference_model('vit-base-p16_milan-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
print(predict['pred_class'])
print(predict['pred_score'])
```

**Use the model**

```python
import torch
from mmpretrain import get_model

model = get_model('milan_vit-base-p16_16xb256-amp-coslr-400e_in1k', pretrained=True)
inputs = torch.rand(1, 3, 224, 224)
out = model(inputs)
print(type(out))
# To extract features.
feats = model.extract_feat(inputs)
print(type(feats))
```

**Train/Test Command**

Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).

Train:

```shell
python tools/train.py configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
```

Test:

```shell
python tools/test.py configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth
```

<!-- [TABS-END] -->

## Models and results

### Pretrained models

| Model                                            | Params (M) | Flops (G) |                           Config                            |                                  Download                                  |
| :----------------------------------------------- | :--------: | :-------: | :---------------------------------------------------------: | :------------------------------------------------------------------------: |
| `milan_vit-base-p16_16xb256-amp-coslr-400e_in1k` |   111.91   |   17.58   | [config](milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.json) |

### Image Classification on ImageNet-1k

| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
| `vit-base-p16_milan-pre_8xb128-coslr-100e_in1k` | [MILAN](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) |   86.57    |   17.58   |   85.30   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.json) |
| `vit-base-p16_milan-pre_8xb2048-linear-coslr-100e_in1k` | [MILAN](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) |   86.57    |   17.58   |   78.90   | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.json) |

## Citation

```bibtex
@article{Hou2022MILANMI,
  title={MILAN: Masked Image Pretraining on Language Assisted Representation},
  author={Zejiang Hou and Fei Sun and Yen-Kuang Chen and Yuan Xie and S. Y. Kung},
  journal={ArXiv},
  year={2022}
}
```
-												[Refactor] Refactor configs and metafile (#1369)

* update base datasets

* update base

* update barlowtwins

* update with new convention

* update

* update

* update

* add schedule

* add densecl

* add eva

* add mae

* add maskfeat

* add milan and mixmim

* add moco

* add swav simclr

* add simmim and simsiam

* refine

* update

* add to model index

* update config inheritance

* fix error in metafile

* Update pre-commit and metafile check script

* update metafile

* fix name error

* Fix classification model name and config name

---------

Co-authored-by: mzr1996 <mzr1996@163.com>
											
										
										
											2023-02-23 11:17:16 +08:00
+								# MILAN
-												[Docs] Update generate_readme.py and readme files. (#1388)


* Update generate_readme.py and readme files.

* Update reamde

* Update docs

* update metafile

* update simmim readme

* update

* update mae

* fix lint

* update mocov2

* update readme pic

* fix lint

* Fix mmcls download links.

* Fix Chinese docs.

* Decrease readthedocs requirements.

---------

Co-authored-by: fangyixiao18 <fangyx18@hotmail.com>
											
										
										
											2023-03-02 13:29:07 +08:00
+								> [MILAN: Masked Image Pretraining on Language Assisted Representation](https://arxiv.org/pdf/2208.06049)
-												[Refactor] Refactor configs and metafile (#1369)

* update base datasets

* update base

* update barlowtwins

* update with new convention

* update

* update

* update

* add schedule

* add densecl

* add eva

* add mae

* add maskfeat

* add milan and mixmim

* add moco

* add swav simclr

* add simmim and simsiam

* refine

* update

* add to model index

* update config inheritance

* fix error in metafile

* Update pre-commit and metafile check script

* update metafile

* fix name error

* Fix classification model name and config name

---------

Co-authored-by: mzr1996 <mzr1996@163.com>
											
										
										
											2023-02-23 11:17:16 +08:00
 								<!-- [ALGORITHM] -->
 								## Abstract
 								Self-attention based transformer models have been dominating many computer
 								vision tasks in the past few years. Their superb model qualities heavily depend
 								on the excessively large labeled image datasets. In order to reduce the reliance
 								on large labeled datasets, reconstruction based masked autoencoders are gaining
 								popularity, which learn high quality transferable representations from unlabeled
 								images. For the same purpose, recent weakly supervised image pretraining methods
 								explore language supervision from text captions accompanying the images. In this
 								work, we propose masked image pretraining on language assisted representation,
 								dubbed as MILAN. Instead of predicting raw pixels or low level features, our
 								pretraining objective is to reconstruct the image features with substantial semantic
 								signals that are obtained using caption supervision. Moreover, to accommodate our
 								reconstruction target, we propose a more efficient prompting decoder architecture
 								and a semantic aware mask sampling mechanism, which further advance the
 								transfer performance of the pretrained model. Experimental results demonstrate
 								that MILAN delivers higher accuracy than the previous works. When the masked
 								autoencoder is pretrained and finetuned on ImageNet-1K dataset with an input
 								resolution of 224×224, MILAN achieves a top-1 accuracy of 85.4% on ViTB/16, surpassing previous state-of-the-arts by 1%. In the downstream semantic
 								segmentation task, MILAN achieves 52.7 mIoU using ViT-B/16 backbone on
 								ADE20K dataset, outperforming previous masked pretraining results by 4 points.
-												[Docs] Update generate_readme.py and readme files. (#1388)


* Update generate_readme.py and readme files.

* Update reamde

* Update docs

* update metafile

* update simmim readme

* update

* update mae

* fix lint

* update mocov2

* update readme pic

* fix lint

* Fix mmcls download links.

* Fix Chinese docs.

* Decrease readthedocs requirements.

---------

Co-authored-by: fangyixiao18 <fangyx18@hotmail.com>
											
										
										
											2023-03-02 13:29:07 +08:00
+								<div align=center>
-												[Refactor] Refactor configs and metafile (#1369)

* update base datasets

* update base

* update barlowtwins

* update with new convention

* update

* update

* update

* add schedule

* add densecl

* add eva

* add mae

* add maskfeat

* add milan and mixmim

* add moco

* add swav simclr

* add simmim and simsiam

* refine

* update

* add to model index

* update config inheritance

* fix error in metafile

* Update pre-commit and metafile check script

* update metafile

* fix name error

* Fix classification model name and config name

---------

Co-authored-by: mzr1996 <mzr1996@163.com>
											
										
										
											2023-02-23 11:17:16 +08:00
+								<img src="https://user-images.githubusercontent.com/30762564/205210369-41a65c4c-bcd4-4147-91ea-c6c9061ab455.png" width="80%"/>
 								</div>
-												[Docs] Update generate_readme.py and readme files. (#1388)


* Update generate_readme.py and readme files.

* Update reamde

* Update docs

* update metafile

* update simmim readme

* update

* update mae

* fix lint

* update mocov2

* update readme pic

* fix lint

* Fix mmcls download links.

* Fix Chinese docs.

* Decrease readthedocs requirements.

---------

Co-authored-by: fangyixiao18 <fangyx18@hotmail.com>
											
										
										
											2023-03-02 13:29:07 +08:00
+								## How to use it?
 								<!-- [TABS-BEGIN] -->
 								**Predict image**
 								```python
 								from mmpretrain import inference_model
 								predict = inference_model('vit-base-p16_milan-pre_8xb128-coslr-100e_in1k', 'demo/bird.JPEG')
 								print(predict['pred_class'])
 								print(predict['pred_score'])
 								```
 								**Use the model**
 								```python
 								import torch
 								from mmpretrain import get_model
 								model = get_model('milan_vit-base-p16_16xb256-amp-coslr-400e_in1k', pretrained=True)
 								inputs = torch.rand(1, 3, 224, 224)
 								out = model(inputs)
 								print(type(out))
 								# To extract features.
 								feats = model.extract_feat(inputs)
 								print(type(feats))
 								```
 								**Train/Test Command**
-												[Docs] Update links (#1457)

* update links

* update readtherdocs

* update

* update

* fix lint

* update

* update

* update

* update cov branch

* update

* update

* update
											
										
										
											2023-04-06 20:58:52 +08:00
+								Prepare your dataset according to the [docs](https://mmpretrain.readthedocs.io/en/latest/user_guides/dataset_prepare.html#prepare-dataset).
-												[Docs] Update generate_readme.py and readme files. (#1388)


* Update generate_readme.py and readme files.

* Update reamde

* Update docs

* update metafile

* update simmim readme

* update

* update mae

* fix lint

* update mocov2

* update readme pic

* fix lint

* Fix mmcls download links.

* Fix Chinese docs.

* Decrease readthedocs requirements.

---------

Co-authored-by: fangyixiao18 <fangyx18@hotmail.com>
											
										
										
											2023-03-02 13:29:07 +08:00
 								Train:
 								```shell
 								python tools/train.py configs/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py
 								```
 								Test:
 								```shell
 								python tools/test.py configs/milan/benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth
 								```
 								<!-- [TABS-END] -->
 								## Models and results
 								### Pretrained models
 								| Model                                            | Params (M) | Flops (G) |                           Config                            |                                  Download                                  |
 								| :----------------------------------------------- | :--------: | :-------: | :---------------------------------------------------------: | :------------------------------------------------------------------------: |
-												[Refactor] Update dev scripts to be compatible with selfsup tasks. (#1412)

* [Refactor] Update dev scripts to be compatible with selfsup tasks.

* Fix some missing fields in config files.

* Set maximum number of gpus for local training.

* Update README files

* Update according to comments.
											
										
										
											2023-03-20 14:30:57 +08:00
+								| `milan_vit-base-p16_16xb256-amp-coslr-400e_in1k` |   111.91   |   17.58   | [config](milan_vit-base-p16_16xb256-amp-coslr-400e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.json) |
-												[Docs] Update generate_readme.py and readme files. (#1388)


* Update generate_readme.py and readme files.

* Update reamde

* Update docs

* update metafile

* update simmim readme

* update

* update mae

* fix lint

* update mocov2

* update readme pic

* fix lint

* Fix mmcls download links.

* Fix Chinese docs.

* Decrease readthedocs requirements.

---------

Co-authored-by: fangyixiao18 <fangyx18@hotmail.com>
											
										
										
											2023-03-02 13:29:07 +08:00
 								### Image Classification on ImageNet-1k
 								| Model                                     |                   Pretrain                   | Params (M) | Flops (G) | Top-1 (%) |                   Config                   |                   Download                    |
 								| :---------------------------------------- | :------------------------------------------: | :--------: | :-------: | :-------: | :----------------------------------------: | :-------------------------------------------: |
-												[Refactor] Update dev scripts to be compatible with selfsup tasks. (#1412)

* [Refactor] Update dev scripts to be compatible with selfsup tasks.

* Fix some missing fields in config files.

* Set maximum number of gpus for local training.

* Update README files

* Update according to comments.
											
										
										
											2023-03-20 14:30:57 +08:00
+								| `vit-base-p16_milan-pre_8xb128-coslr-100e_in1k` | [MILAN](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) |   86.57    |   17.58   |   85.30   | [config](benchmarks/vit-base-p16_8xb128-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k/vit-base-p16_ft-8xb128-coslr-100e_in1k-milan_20221129-74ac94fa.json) |
 								| `vit-base-p16_milan-pre_8xb2048-linear-coslr-100e_in1k` | [MILAN](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k_20221129-180922e8.pth) |   86.57    |   17.58   |   78.90   | [config](benchmarks/vit-base-p16_8xb2048-linear-coslr-100e_in1k.py) | [model](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.pth) \| [log](https://download.openmmlab.com/mmselfsup/1.x/milan/milan_vit-base-p16_16xb256-amp-coslr-400e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k/vit-base-p16_linear-8xb2048-coslr-100e_in1k_20221129-03f26f85.json) |
-												[Refactor] Refactor configs and metafile (#1369)

* update base datasets

* update base

* update barlowtwins

* update with new convention

* update

* update

* update

* add schedule

* add densecl

* add eva

* add mae

* add maskfeat

* add milan and mixmim

* add moco

* add swav simclr

* add simmim and simsiam

* refine

* update

* add to model index

* update config inheritance

* fix error in metafile

* Update pre-commit and metafile check script

* update metafile

* fix name error

* Fix classification model name and config name

---------

Co-authored-by: mzr1996 <mzr1996@163.com>
											
										
										
											2023-02-23 11:17:16 +08:00
 								## Citation
 								```bibtex
 								@article{Hou2022MILANMI,
 								  title={MILAN: Masked Image Pretraining on Language Assisted Representation},
 								  author={Zejiang Hou and Fei Sun and Yen-Kuang Chen and Yuan Xie and S. Y. Kung},
 								  journal={ArXiv},
 								  year={2022}
 								}
 								```