mmselfsup/docs/en/user_guides/2_dataset_prepare.md

# Tutorial 2: Prepare Datasets

MMSelfSup supports multiple datasets. Please follow the corresponding guidelines for data preparation. It is recommended to symlink your dataset root to `$MMSELFSUP/data`. If your folder structure is different, you may need to change the corresponding paths in config files.

- [Tutorial 2: Prepare Datasets](#tutorial-2-prepare-datasets)
  - [Prepare ImageNet](#prepare-imagenet)
  - [Prepare Place205](#prepare-place205)
  - [Prepare iNaturalist2018](#prepare-inaturalist2018)
  - [Prepare PASCAL VOC](#prepare-pascal-voc)
  - [Prepare CIFAR10](#prepare-cifar10)
  - [Prepare datasets for detection and segmentation](#prepare-datasets-for-detection-and-segmentation)
    - [Detection](#detection)
    - [Segmentation](#segmentation)

```
mmselfsup
├── mmselfsup
├── tools
├── configs
├── docs
├── data
│   ├── imagenet
│   │   ├── meta
│   │   ├── train
│   │   ├── val
│   ├── places205
│   │   ├── meta
│   │   ├── train
│   │   ├── val
│   ├── inaturalist2018
│   │   ├── meta
│   │   ├── train
│   │   ├── val
│   ├── VOCdevkit
│   │   ├── VOC2007
│   ├── cifar
│   │   ├── cifar-10-batches-py

```

## Prepare ImageNet

For ImageNet, it has multiple versions, but the most commonly used one is [ILSVRC 2012](http://www.image-net.org/challenges/LSVRC/2012/). It can be accessed with the following steps:

1. Register an account and login to the [download page](http://www.image-net.org/download-images)
2. Find download links for ILSVRC2012 and download the following two files
   - ILSVRC2012_img_train.tar (~138GB)
   - ILSVRC2012_img_val.tar (~6.3GB)
3. Untar the downloaded files
4. Download meta data using this [script](https://github.com/BVLC/caffe/blob/master/data/ilsvrc12/get_ilsvrc_aux.sh)

## Prepare Place205

For Places205, you need to:

1. Register an account and login to the [download page](http://places.csail.mit.edu/downloadData.html)
2. Download the resized images and the image list of train set and validation set of Places205
3. Untar the downloaded files

## Prepare iNaturalist2018

For iNaturalist2018, you need to:

1. Download the training and validation images and annotations from the [download page](https://github.com/visipedia/inat_comp/tree/master/2018)
2. Untar the downloaded files
3. Convert the original json annotation format to the list format using the script `tools/dataset_converters/convert_inaturalist.py`

## Prepare PASCAL VOC

Assuming that you usually store datasets in `$YOUR_DATA_ROOT`. The following command will automatically download PASCAL VOC 2007 into `$YOUR_DATA_ROOT`, prepare the required files, create a folder `data` under `$MMSELFSUP` and make a symlink `VOCdevkit`.

```shell
bash tools/dataset_converters/prepare_voc07_cls.sh $YOUR_DATA_ROOT
```

## Prepare CIFAR10

`MMSelfSup` uses [`CIFAR10`](https://github.com/open-mmlab/mmclassification/blob/1.x/mmcls/datasets/cifar.py) implemented by `MMClassification`. In addition, `MMClassification` supports automatic download of the `CIFAR10` dataset, you just need to specify the download folder in the `data_root` field. And specify `test_mode=False` / `test_mode=True` to use the training or test dataset. For more details, please refer to [docs](https://github.com/open-mmlab/mmclassification/blob/1.x/docs/en/user_guides/dataset_prepare.md#cifar) in `MMClassification`.

## Prepare datasets for detection and segmentation

### Detection

To prepare COCO, VOC2007 and VOC2012 for detection, you can refer to [mmdetection](https://github.com/open-mmlab/mmdetection/blob/dev-3.x/docs/en/1_exist_data_model.md).

### Segmentation

To prepare VOC2012AUG and Cityscapes for segmentation, you can refer to [mmsegmentation](https://github.com/open-mmlab/mmsegmentation/blob/dev-1.x/docs/en/user_guides/2_dataset_prepare.md#prepare-datasets)
[Refactor] Refactor docs directory (#419) * refactor directory * modify titles * fix lint * update index.rst * update * fix typo * update * fix typo * update model zoo * update index.rst * fix typo * fix typo 2022-08-17 12:06:41 +08:00			`# Tutorial 2: Prepare Datasets`
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00
[Docs] translate prepare_data.md into Chinese (#166) * [Docs] translate prepare_data.md into Chinese * [Docs] fix typo in prepare_data.md 2022-01-04 10:28:39 +08:00			MMSelfSup supports multiple datasets. Please follow the corresponding guidelines for data preparation. It is recommended to symlink your dataset root to `$MMSELFSUP/data`. If your folder structure is different, you may need to change the corresponding paths in config files.
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00
[Refactor] Refactor docs directory (#419) * refactor directory * modify titles * fix lint * update index.rst * update * fix typo * update * fix typo * update model zoo * update index.rst * fix typo * fix typo 2022-08-17 12:06:41 +08:00			`- [Tutorial 2: Prepare Datasets](#tutorial-2-prepare-datasets)`
			`- [Prepare ImageNet](#prepare-imagenet)`
			`- [Prepare Place205](#prepare-place205)`
			`- [Prepare iNaturalist2018](#prepare-inaturalist2018)`
			`- [Prepare PASCAL VOC](#prepare-pascal-voc)`
			`- [Prepare CIFAR10](#prepare-cifar10)`
			`- [Prepare datasets for detection and segmentation](#prepare-datasets-for-detection-and-segmentation)`
			`- [Detection](#detection)`
			`- [Segmentation](#segmentation)`
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00
			```
			`mmselfsup`
			`├── mmselfsup`
			`├── tools`
			`├── configs`
			`├── docs`
			`├── data`
			`│ ├── imagenet`
			`│ │ ├── meta`
			`│ │ ├── train`
			`│ │ ├── val`
			`│ ├── places205`
			`│ │ ├── meta`
			`│ │ ├── train`
			`│ │ ├── val`
			`│ ├── inaturalist2018`
			`│ │ ├── meta`
			`│ │ ├── train`
			`│ │ ├── val`
			`│ ├── VOCdevkit`
			`│ │ ├── VOC2007`
			`│ ├── cifar`
			`│ │ ├── cifar-10-batches-py`

			```

			`## Prepare ImageNet`

			`For ImageNet, it has multiple versions, but the most commonly used one is [ILSVRC 2012](http://www.image-net.org/challenges/LSVRC/2012/). It can be accessed with the following steps:`

			`1. Register an account and login to the [download page](http://www.image-net.org/download-images)`
			`2. Find download links for ILSVRC2012 and download the following two files`
Bump version to v0.9.1 (#322) * [Fix]: Set qkv bias to False for cae and True for mae (#303) * [Fix]: Add mmcls transformer layer choice * [Fix]: Fix transformer encoder layer bug * [Fix]: Change UT of cae * [Feature]: Change the file name of cosine annealing hook (#304) * [Feature]: Change cosine annealing hook file name * [Feature]: Add UT for cosine annealing hook * [Fix]: Fix lint * read tutorials and fix typo (#308) * [Fix] fix config errors in MAE (#307) * update readthedocs algorithm readme (#310) * [Docs] Replace markdownlint with mdformat (#311) * Replace markdownlint with mdformat to avoid installing ruby * fix typo * add 'ba' to codespell ignore-words-list * Configure Myst-parser to parse anchor tag (#309) * [Docs] rewrite install.md (#317) * rewrite the install.md * add faq.md * fix lint * add FAQ to README * add Chinese version * fix typo * fix format * remove modification * fix format * [Docs] refine README.md file (#318) * refine README.md file * fix lint * format language button * rename getting_started.md * revise index.rst * add model_zoo.md to index.rst * fix lint * refine readme Co-authored-by: Jiahao Xie <52497952+Jiahao000@users.noreply.github.com> * [Enhance] update byol models and results (#319) * Update version information (#321) Co-authored-by: Yuan Liu <30762564+YuanLiuuuuuu@users.noreply.github.com> Co-authored-by: Yi Lu <21515006@zju.edu.cn> Co-authored-by: RenQin <45731309+soonera@users.noreply.github.com> Co-authored-by: Jiahao Xie <52497952+Jiahao000@users.noreply.github.com> 2022-06-01 09:59:05 +08:00			`- ILSVRC2012_img_train.tar (~138GB)`
			`- ILSVRC2012_img_val.tar (~6.3GB)`
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00			`3. Untar the downloaded files`
			`4. Download meta data using this [script](https://github.com/BVLC/caffe/blob/master/data/ilsvrc12/get_ilsvrc_aux.sh)`

			`## Prepare Place205`

			`For Places205, you need to:`

			`1. Register an account and login to the [download page](http://places.csail.mit.edu/downloadData.html)`
			`2. Download the resized images and the image list of train set and validation set of Places205`
			`3. Untar the downloaded files`

			`## Prepare iNaturalist2018`

			`For iNaturalist2018, you need to:`

			`1. Download the training and validation images and annotations from the [download page](https://github.com/visipedia/inat_comp/tree/master/2018)`
			`2. Untar the downloaded files`
[Docs] refactor dataset_prepare.md (#434) 2022-08-29 19:46:14 +08:00			3. Convert the original json annotation format to the list format using the script `tools/dataset_converters/convert_inaturalist.py`
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00
			`## Prepare PASCAL VOC`

			Assuming that you usually store datasets in `$YOUR_DATA_ROOT`. The following command will automatically download PASCAL VOC 2007 into `$YOUR_DATA_ROOT`, prepare the required files, create a folder `data` under `$MMSELFSUP` and make a symlink `VOCdevkit`.

			```shell
[Docs] refactor dataset_prepare.md (#434) 2022-08-29 19:46:14 +08:00			`bash tools/dataset_converters/prepare_voc07_cls.sh $YOUR_DATA_ROOT`
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00			```

			`## Prepare CIFAR10`

[Docs]: Translate 2_dataset_prepare.md to Chinese (#551) * [Docs]: translate 2_dataset_prepare.md to zh_cn * [Fix]: fix cifar10 dataset prepare * [Fix]: fix cifar10 dataset prepare * [Fix]: fix cifar10 dataset prepare 2022-11-02 19:12:49 +08:00			`MMSelfSup` uses [`CIFAR10`](https://github.com/open-mmlab/mmclassification/blob/1.x/mmcls/datasets/cifar.py) implemented by `MMClassification`. In addition, `MMClassification` supports automatic download of the `CIFAR10` dataset, you just need to specify the download folder in the `data_root` field. And specify `test_mode=False` / `test_mode=True` to use the training or test dataset. For more details, please refer to [docs](https://github.com/open-mmlab/mmclassification/blob/1.x/docs/en/user_guides/dataset_prepare.md#cifar) in `MMClassification`.
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00
			`## Prepare datasets for detection and segmentation`

			`### Detection`

[Docs] refactor dataset_prepare.md (#434) 2022-08-29 19:46:14 +08:00			`To prepare COCO, VOC2007 and VOC2012 for detection, you can refer to [mmdetection](https://github.com/open-mmlab/mmdetection/blob/dev-3.x/docs/en/1_exist_data_model.md).`
[Feature]: Add docs and docker 2021-12-15 19:06:36 +08:00
			`### Segmentation`

[Docs] refactor dataset_prepare.md (#434) 2022-08-29 19:46:14 +08:00			`To prepare VOC2012AUG and Cityscapes for segmentation, you can refer to [mmsegmentation](https://github.com/open-mmlab/mmsegmentation/blob/dev-1.x/docs/en/user_guides/2_dataset_prepare.md#prepare-datasets)`