mmpretrain/docs/en/user_guides/dataset_prepare.md

# Prepare Dataset

MMPretrain supports following datasets:

- [Prepare Dataset](#prepare-dataset)
  - [CustomDataset](#customdataset)
    - [Subfolder Format](#subfolder-format)
    - [Text Annotation File Format](#text-annotation-file-format)
  - [ImageNet](#imagenet)
  - [CIFAR](#cifar)
  - [MNIST](#mnist)
  - [OpenMMLab 2.0 Standard Dataset](#openmmlab-20-standard-dataset)
  - [Other Datasets](#other-datasets)
  - [Dataset Wrappers](#dataset-wrappers)

If your dataset is not in the abvove list, you could reorganize the format of your dataset to adapt to **`CustomDataset`**.

## CustomDataset

[`CustomDataset`](mmpretrain.datasets.CustomDataset) is a general dataset class for you to use your own datasets. To use `CustomDataset`, you need to organize your dataset files according to the following two formats:

### Subfolder Format

Place all samples in one folder as below:

```text
Sample files (for `with_label=True`, supervised tasks, we use the name of sub-folders as the categories names):
As follows, class_x and class_y represent different categories.):
    data_prefix/
    ├── class_x
    │   ├── xxx.png
    │   ├── xxy.png
    │   └── ...
    │       └── xxz.png
    └── class_y
        ├── 123.png
        ├── nsdf3.png
        ├── ...
        └── asd932_.png


Sample files (for `with_label=False`, unsupervised tasks, we use all sample files under the specified folder):
    data_prefix/
    ├── folder_1
    │   ├── xxx.png
    │   ├── xxy.png
    │   └── ...
    ├── 123.png
    ├── nsdf3.png
    └── ...
```

Assume you want to use it as the training dataset, and the below is the configurations in your config file.

```python
train_dataloader = dict(
    ...
    # Training dataset configurations
    dataset=dict(
        type='CustomDataset',
        data_prefix='path/to/data_prefix',
        pipeline=...
    )
)
```

```{note}
Do not specify `ann_file`, or specify `ann_file=None` if you want to use this method.
```

### Text Annotation File Format

The text annotation file format uses text files to store path and category information. All the images are placed in the folder of `data_prefix`, and `ann_file` contaions all the ground-truth annotation.

In the following case, the dataset directory is as follows:

```text
The annotation file (for ``with_label=True``, supervised tasks):
    folder_1/xxx.png 0
    folder_1/xxy.png 1
    123.png 4
    nsdf3.png 3
    ...

The annotation file (for ``with_label=False``, unsupervised tasks):
    folder_1/xxx.png
    folder_1/xxy.png
    123.png
    nsdf3.png
    ...

Sample files:
    data_prefix/
    ├── folder_1
    │   ├── xxx.png
    │   ├── xxy.png
    │   └── ...
    ├── 123.png
    ├── nsdf3.png
    └── ...
```

Assume you want to use the training dataset, and the annotation file is `train_annfile.txt` as above. The annotation file contains ordinary text, which is divided into two columns, the first column is the image path, and the second column is the **index number** of its category:

```text
folder_1/xxx.png 0
folder_1/xxy.png 1
123.png 4
nsdf3.png 3
...
```

```{note}
The index numbers of categories start from 0. And the value of ground-truth labels should fall in range `[0, num_classes - 1]`.
```

In the annotation file, we only specified the category index of every sample, you also need to specify `classes` field in the dataset config to record the name of every category:

```python
train_dataloader = dict(
    ...
    # Training dataset configurations
    dataset=dict(
        type='CustomDataset',
        data_root='path/to/data_root',
        ann_file='meta/train_annfile.txt',
        data_prefix='train',
        classes=['A', 'B', 'C', 'D', ...],
        pipeline=...,
    )
)
```

```{note}
If the `ann_file` is specified, the dataset will be generated by the the ``ann_file``. Otherwise, try the first way.
```

## ImageNet

ImageNet has multiple versions, but the most commonly used one is [ILSVRC 2012](http://www.image-net.org/challenges/LSVRC/2012/). It can be accessed with the following steps.

1. Register an account and login to the [download page](http://www.image-net.org/download-images).
2. Find download links for ILSVRC2012 and download the following two files
   - ILSVRC2012_img_train.tar (~138GB)
   - ILSVRC2012_img_val.tar (~6.3GB)
3. Untar the downloaded files
4. Download and untar the meta data from this [link](https://download.openmmlab.com/mmclassification/datasets/imagenet/meta/caffe_ilsvrc12.tar.gz).
5. Re-organize the image files according to the path in the meta data, and it should be like:

```text
   imagenet/
   ├── meta/
   │   ├── train.txt
   │   ├── test.txt
   │   └── val.txt
   ├── train/
   │   ├── n01440764
   │   │   ├── n01440764_10026.JPEG
   │   │   ├── n01440764_10027.JPEG
   │   │   ├── n01440764_10029.JPEG
   │   │   ├── n01440764_10040.JPEG
   │   │   ├── n01440764_10042.JPEG
   │   │   ├── n01440764_10043.JPEG
   │   │   └── n01440764_10048.JPEG
   │   ├── ...
   ├── val/
   │   ├── ILSVRC2012_val_00000001.JPEG
   │   ├── ILSVRC2012_val_00000002.JPEG
   │   ├── ILSVRC2012_val_00000003.JPEG
   │   ├── ILSVRC2012_val_00000004.JPEG
   │   ├── ...
```

And then, you can use the [`ImageNet`](mmpretrain.datasets.ImageNet) dataset with the below configurations:

```python
train_dataloader = dict(
    ...
    # Training dataset configurations
    dataset=dict(
        type='ImageNet',
        data_root='imagenet_folder',
        ann_file='meta/train.txt',
        data_prefix='train/',
        pipeline=...,
    )
)

val_dataloader = dict(
    ...
    # Validation dataset configurations
    dataset=dict(
        type='ImageNet',
        data_root='imagenet_folder',
        ann_file='meta/val.txt',
        data_prefix='val/',
        pipeline=...,
    )
)

test_dataloader = val_dataloader
```

## CIFAR

We support downloading the [`CIFAR10`](mmpretrain.datasets.CIFAR10) and [`CIFAR100`](mmpretrain.datasets.CIFAR100) datasets automatically, and you just need to specify the
download folder in the `data_root` field. And please specify `test_mode=False` / `test_mode=True`
to use training datasets or test datasets.

```python
train_dataloader = dict(
    ...
    # Training dataset configurations
    dataset=dict(
        type='CIFAR10',
        data_root='data/cifar10',
        test_mode=False,
        pipeline=...,
    )
)

val_dataloader = dict(
    ...
    # Validation dataset configurations
    dataset=dict(
        type='CIFAR10',
        data_root='data/cifar10',
        test_mode=True,
        pipeline=...,
    )
)

test_dataloader = val_dataloader
```

## MNIST

We support downloading the [MNIST](mmpretrain.datasets.MNIST) and [Fashion-MNIST](mmpretrain.datasets.FashionMNIST) datasets automatically, and you just need to specify the
download folder in the `data_root` field. And please specify `test_mode=False` / `test_mode=True`
to use training datasets or test datasets.

```python
train_dataloader = dict(
    ...
    # Training dataset configurations
    dataset=dict(
        type='MNIST',
        data_root='data/mnist',
        test_mode=False,
        pipeline=...,
    )
)

val_dataloader = dict(
    ...
    # Validation dataset configurations
    dataset=dict(
        type='MNIST',
        data_root='data/mnist',
        test_mode=True,
        pipeline=...,
    )
)

test_dataloader = val_dataloader
```

## OpenMMLab 2.0 Standard Dataset

In order to facilitate the training of multi-task algorithm models, we unify the dataset interfaces of different tasks. OpenMMLab has formulated the **OpenMMLab 2.0 Dataset Format Specification**. When starting a trainning task, the users can choose to convert their dataset annotation into the specified format, and use the algorithm library of OpenMMLab to perform algorithm training and testing based on the data annotation file.

The OpenMMLab 2.0 Dataset Format Specification stipulates that the annotation file must be in `json` or `yaml`, `yml`, `pickle` or `pkl` format; the dictionary stored in the annotation file must contain `metainfo` and `data_list` fields, The value of `metainfo` is a dictionary, which contains the meta information of the dataset; and the value of `data_list` is a list, each element in the list is a dictionary, the dictionary defines a raw data, each raw data contains a or several training/testing samples.

The following is an example of a JSON annotation file (in this example each raw data contains only one train/test sample):

```
{
    'metainfo':
        {
            'classes': ('cat', 'dog'), # the category index of 'cat' is 0 and 'dog' is 1.
            ...
        },
    'data_list':
        [
            {
                'img_path': "xxx/xxx_0.jpg",
                'img_label': 0,
                ...
            },
            {
                'img_path': "xxx/xxx_1.jpg",
                'img_label': 1,
                ...
            },
            ...
        ]
}
```

Assume you want to use the training dataset and the dataset is stored as the below structure:

```text
data
├── annotations
│   ├── train.json
├── train
│   ├── xxx/xxx_0.jpg
│   ├── xxx/xxx_1.jpg
│   ├── ...
```

Build from the following dictionaries:

```python
train_dataloader = dict(
    ...
    dataset=dict(
        type='BaseDataset',
        data_root='data',
        ann_file='annotations/train.json',
        data_prefix='train/',
        pipeline=...,
    )
)
```

## Other Datasets

To find more datasets supported by MMPretrain, and get more configurations of the above datasets, please see the [dataset documentation](mmpretrain.datasets).

## Dataset Wrappers

The following datawrappers are supported in MMEngine, you can refer to {external+mmengine:doc}`MMEngine tutorial <advanced_tutorials/basedataset>` to learn how to use it.

- [ConcatDataset](https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/basedataset.md#concatdataset)
- [RepeatDataset](https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/basedataset.md#repeatdataset)
- [ClassBalanced](https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/basedataset.md#classbalanceddataset)

The MMPretrain also support [KFoldDataset](mmpretrain.datasets.KFoldDataset), please use it with `tools/kfold-cross-valid.py`.
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			`# Prepare Dataset`

[Docs] Update get start docs and user guides. (#1407) * update user_guides * update test.md * fix lint * fix typo * refine * fix typo * update retriever to api * update rst and downstream * update index.rst * update index.rst * update custom.js * update chinese docs * update config.md * update train and test * add pretrain on custom dataset * fix lint 2023-03-20 15:56:09 +08:00			`MMPretrain supports following datasets:`

			`- [Prepare Dataset](#prepare-dataset)`
			`- [CustomDataset](#customdataset)`
			`- [Subfolder Format](#subfolder-format)`
			`- [Text Annotation File Format](#text-annotation-file-format)`
			`- [ImageNet](#imagenet)`
			`- [CIFAR](#cifar)`
			`- [MNIST](#mnist)`
			`- [OpenMMLab 2.0 Standard Dataset](#openmmlab-20-standard-dataset)`
			`- [Other Datasets](#other-datasets)`
			`- [Dataset Wrappers](#dataset-wrappers)`
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00
			If your dataset is not in the abvove list, you could reorganize the format of your dataset to adapt to `CustomDataset`.

			`## CustomDataset`

[Docs] Update generate_readme.py and readme files. (#1388) * Update generate_readme.py and readme files. * Update reamde * Update docs * update metafile * update simmim readme * update * update mae * fix lint * update mocov2 * update readme pic * fix lint * Fix mmcls download links. * Fix Chinese docs. * Decrease readthedocs requirements. --------- Co-authored-by: fangyixiao18 <fangyx18@hotmail.com> 2023-03-02 13:29:07 +08:00			[`CustomDataset`](mmpretrain.datasets.CustomDataset) is a general dataset class for you to use your own datasets. To use `CustomDataset`, you need to organize your dataset files according to the following two formats:
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00
			`### Subfolder Format`

[Docs] Update get start docs and user guides. (#1407) * update user_guides * update test.md * fix lint * fix typo * refine * fix typo * update retriever to api * update rst and downstream * update index.rst * update index.rst * update custom.js * update chinese docs * update config.md * update train and test * add pretrain on custom dataset * fix lint 2023-03-20 15:56:09 +08:00			`Place all samples in one folder as below:`
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00
			```text
[Docs] Update get start docs and user guides. (#1407) * update user_guides * update test.md * fix lint * fix typo * refine * fix typo * update retriever to api * update rst and downstream * update index.rst * update index.rst * update custom.js * update chinese docs * update config.md * update train and test * add pretrain on custom dataset * fix lint 2023-03-20 15:56:09 +08:00			Sample files (for `with_label=True`, supervised tasks, we use the name of sub-folders as the categories names):
			`As follows, class_x and class_y represent different categories.):`
			`data_prefix/`
			`├── class_x`
			`│ ├── xxx.png`
			`│ ├── xxy.png`
			`│ └── ...`
			`│ └── xxz.png`
			`└── class_y`
			`├── 123.png`
			`├── nsdf3.png`
			`├── ...`
			`└── asd932_.png`


			Sample files (for `with_label=False`, unsupervised tasks, we use all sample files under the specified folder):
			`data_prefix/`
			`├── folder_1`
			`│ ├── xxx.png`
			`│ ├── xxy.png`
			`│ └── ...`
			`├── 123.png`
			`├── nsdf3.png`
			`└── ...`
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			```

			`Assume you want to use it as the training dataset, and the below is the configurations in your config file.`

			```python
			`train_dataloader = dict(`
			`...`
			`# Training dataset configurations`
			`dataset=dict(`
			`type='CustomDataset',`
			`data_prefix='path/to/data_prefix',`
			`pipeline=...`
			`)`
			`)`
			```

			```{note}
			Do not specify `ann_file`, or specify `ann_file=None` if you want to use this method.
			```

			`### Text Annotation File Format`

			The text annotation file format uses text files to store path and category information. All the images are placed in the folder of `data_prefix`, and `ann_file` contaions all the ground-truth annotation.

			`In the following case, the dataset directory is as follows:`

			```text
[Docs] Update get start docs and user guides. (#1407) * update user_guides * update test.md * fix lint * fix typo * refine * fix typo * update retriever to api * update rst and downstream * update index.rst * update index.rst * update custom.js * update chinese docs * update config.md * update train and test * add pretrain on custom dataset * fix lint 2023-03-20 15:56:09 +08:00			The annotation file (for ``with_label=True``, supervised tasks):
			`folder_1/xxx.png 0`
			`folder_1/xxy.png 1`
			`123.png 4`
			`nsdf3.png 3`
			`...`

			The annotation file (for ``with_label=False``, unsupervised tasks):
			`folder_1/xxx.png`
			`folder_1/xxy.png`
			`123.png`
			`nsdf3.png`
			`...`

			`Sample files:`
			`data_prefix/`
			`├── folder_1`
			`│ ├── xxx.png`
			`│ ├── xxy.png`
			`│ └── ...`
			`├── 123.png`
			`├── nsdf3.png`
			`└── ...`
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			```

			Assume you want to use the training dataset, and the annotation file is `train_annfile.txt` as above. The annotation file contains ordinary text, which is divided into two columns, the first column is the image path, and the second column is the index number of its category:

			```text
			`folder_1/xxx.png 0`
			`folder_1/xxy.png 1`
[Docs] Update get start docs and user guides. (#1407) * update user_guides * update test.md * fix lint * fix typo * refine * fix typo * update retriever to api * update rst and downstream * update index.rst * update index.rst * update custom.js * update chinese docs * update config.md * update train and test * add pretrain on custom dataset * fix lint 2023-03-20 15:56:09 +08:00			`123.png 4`
			`nsdf3.png 3`
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			`...`
			```

			```{note}
			The index numbers of categories start from 0. And the value of ground-truth labels should fall in range `[0, num_classes - 1]`.
			```

			In the annotation file, we only specified the category index of every sample, you also need to specify `classes` field in the dataset config to record the name of every category:

			```python
			`train_dataloader = dict(`
			`...`
			`# Training dataset configurations`
			`dataset=dict(`
			`type='CustomDataset',`
			`data_root='path/to/data_root',`
			`ann_file='meta/train_annfile.txt',`
			`data_prefix='train',`
Bump to v1.0.0rc0 (#1007) * Update docs. * Update requirements. * Update config readme and docstring. * Update CONTRIBUTING.md * Update README * Update requirements/mminstall.txt Co-authored-by: Yifei Yang <2744335995@qq.com> * Update MMEngine docs link and add to readthedocs requirement. Co-authored-by: Yifei Yang <2744335995@qq.com> 2022-08-31 23:57:51 +08:00			`classes=['A', 'B', 'C', 'D', ...],`
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			`pipeline=...,`
			`)`
			`)`
			```

			```{note}
			If the `ann_file` is specified, the dataset will be generated by the the ``ann_file``. Otherwise, try the first way.
			```

			`## ImageNet`

			`ImageNet has multiple versions, but the most commonly used one is [ILSVRC 2012](http://www.image-net.org/challenges/LSVRC/2012/). It can be accessed with the following steps.`

			`1. Register an account and login to the [download page](http://www.image-net.org/download-images).`
			`2. Find download links for ILSVRC2012 and download the following two files`
			`- ILSVRC2012_img_train.tar (~138GB)`
			`- ILSVRC2012_img_val.tar (~6.3GB)`
			`3. Untar the downloaded files`
			`4. Download and untar the meta data from this [link](https://download.openmmlab.com/mmclassification/datasets/imagenet/meta/caffe_ilsvrc12.tar.gz).`
			`5. Re-organize the image files according to the path in the meta data, and it should be like:`

			```text
			`imagenet/`
			`├── meta/`
			`│ ├── train.txt`
			`│ ├── test.txt`
			`│ └── val.txt`
			`├── train/`
			`│ ├── n01440764`
			`│ │ ├── n01440764_10026.JPEG`
			`│ │ ├── n01440764_10027.JPEG`
			`│ │ ├── n01440764_10029.JPEG`
			`│ │ ├── n01440764_10040.JPEG`
			`│ │ ├── n01440764_10042.JPEG`
			`│ │ ├── n01440764_10043.JPEG`
			`│ │ └── n01440764_10048.JPEG`
			`│ ├── ...`
			`├── val/`
			`│ ├── ILSVRC2012_val_00000001.JPEG`
			`│ ├── ILSVRC2012_val_00000002.JPEG`
			`│ ├── ILSVRC2012_val_00000003.JPEG`
			`│ ├── ILSVRC2012_val_00000004.JPEG`
			`│ ├── ...`
			```

[Docs] Update generate_readme.py and readme files. (#1388) * Update generate_readme.py and readme files. * Update reamde * Update docs * update metafile * update simmim readme * update * update mae * fix lint * update mocov2 * update readme pic * fix lint * Fix mmcls download links. * Fix Chinese docs. * Decrease readthedocs requirements. --------- Co-authored-by: fangyixiao18 <fangyx18@hotmail.com> 2023-03-02 13:29:07 +08:00			And then, you can use the [`ImageNet`](mmpretrain.datasets.ImageNet) dataset with the below configurations:
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00
			```python
			`train_dataloader = dict(`
			`...`
			`# Training dataset configurations`
			`dataset=dict(`
			`type='ImageNet',`
			`data_root='imagenet_folder',`
			`ann_file='meta/train.txt',`
			`data_prefix='train/',`
			`pipeline=...,`
			`)`
			`)`

			`val_dataloader = dict(`
			`...`
			`# Validation dataset configurations`
			`dataset=dict(`
			`type='ImageNet',`
			`data_root='imagenet_folder',`
			`ann_file='meta/val.txt',`
			`data_prefix='val/',`
			`pipeline=...,`
			`)`
			`)`

			`test_dataloader = val_dataloader`
			```

			`## CIFAR`

[Docs] Update generate_readme.py and readme files. (#1388) * Update generate_readme.py and readme files. * Update reamde * Update docs * update metafile * update simmim readme * update * update mae * fix lint * update mocov2 * update readme pic * fix lint * Fix mmcls download links. * Fix Chinese docs. * Decrease readthedocs requirements. --------- Co-authored-by: fangyixiao18 <fangyx18@hotmail.com> 2023-03-02 13:29:07 +08:00			We support downloading the [`CIFAR10`](mmpretrain.datasets.CIFAR10) and [`CIFAR100`](mmpretrain.datasets.CIFAR100) datasets automatically, and you just need to specify the
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			download folder in the `data_root` field. And please specify `test_mode=False` / `test_mode=True`
			`to use training datasets or test datasets.`

			```python
			`train_dataloader = dict(`
			`...`
			`# Training dataset configurations`
			`dataset=dict(`
			`type='CIFAR10',`
			`data_root='data/cifar10',`
			`test_mode=False,`
			`pipeline=...,`
			`)`
			`)`

			`val_dataloader = dict(`
			`...`
			`# Validation dataset configurations`
			`dataset=dict(`
			`type='CIFAR10',`
			`data_root='data/cifar10',`
			`test_mode=True,`
			`pipeline=...,`
			`)`
			`)`

			`test_dataloader = val_dataloader`
			```

			`## MNIST`

[Docs] Update generate_readme.py and readme files. (#1388) * Update generate_readme.py and readme files. * Update reamde * Update docs * update metafile * update simmim readme * update * update mae * fix lint * update mocov2 * update readme pic * fix lint * Fix mmcls download links. * Fix Chinese docs. * Decrease readthedocs requirements. --------- Co-authored-by: fangyixiao18 <fangyx18@hotmail.com> 2023-03-02 13:29:07 +08:00			`We support downloading the [MNIST](mmpretrain.datasets.MNIST) and [Fashion-MNIST](mmpretrain.datasets.FashionMNIST) datasets automatically, and you just need to specify the`
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			download folder in the `data_root` field. And please specify `test_mode=False` / `test_mode=True`
			`to use training datasets or test datasets.`

			```python
			`train_dataloader = dict(`
			`...`
			`# Training dataset configurations`
			`dataset=dict(`
			`type='MNIST',`
			`data_root='data/mnist',`
			`test_mode=False,`
			`pipeline=...,`
			`)`
			`)`

			`val_dataloader = dict(`
			`...`
			`# Validation dataset configurations`
			`dataset=dict(`
			`type='MNIST',`
			`data_root='data/mnist',`
			`test_mode=True,`
			`pipeline=...,`
			`)`
			`)`

			`test_dataloader = val_dataloader`
			```

			`## OpenMMLab 2.0 Standard Dataset`

			`In order to facilitate the training of multi-task algorithm models, we unify the dataset interfaces of different tasks. OpenMMLab has formulated the OpenMMLab 2.0 Dataset Format Specification. When starting a trainning task, the users can choose to convert their dataset annotation into the specified format, and use the algorithm library of OpenMMLab to perform algorithm training and testing based on the data annotation file.`

			The OpenMMLab 2.0 Dataset Format Specification stipulates that the annotation file must be in `json` or `yaml`, `yml`, `pickle` or `pkl` format; the dictionary stored in the annotation file must contain `metainfo` and `data_list` fields, The value of `metainfo` is a dictionary, which contains the meta information of the dataset; and the value of `data_list` is a list, each element in the list is a dictionary, the dictionary defines a raw data, each raw data contains a or several training/testing samples.

			`The following is an example of a JSON annotation file (in this example each raw data contains only one train/test sample):`

[Docs] Update migration.md (#1417) * update migration * refine table * update zh_cn * fix lint * Polish the documentation by ChatGPT. * Update sphinx version and fix some warning. --------- Co-authored-by: mzr1996 <mzr1996@163.com> 2023-03-17 10:30:09 +08:00			```
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00			`{`
			`'metainfo':`
			`{`
			`'classes': ('cat', 'dog'), # the category index of 'cat' is 0 and 'dog' is 1.`
			`...`
			`},`
			`'data_list':`
			`[`
			`{`
			`'img_path': "xxx/xxx_0.jpg",`
			`'img_label': 0,`
			`...`
			`},`
			`{`
			`'img_path': "xxx/xxx_1.jpg",`
			`'img_label': 1,`
			`...`
			`},`
			`...`
			`]`
			`}`
			```

			`Assume you want to use the training dataset and the dataset is stored as the below structure:`

			```text
			`data`
			`├── annotations`
			`│ ├── train.json`
			`├── train`
			`│ ├── xxx/xxx_0.jpg`
			`│ ├── xxx/xxx_1.jpg`
			`│ ├── ...`
			```

			`Build from the following dictionaries:`

			```python
			`train_dataloader = dict(`
			`...`
			`dataset=dict(`
			`type='BaseDataset',`
			`data_root='data',`
			`ann_file='annotations/train.json',`
			`data_prefix='train/',`
			`pipeline=...,`
			`)`
			`)`
			```

			`## Other Datasets`

[Docs] Update get start docs and user guides. (#1407) * update user_guides * update test.md * fix lint * fix typo * refine * fix typo * update retriever to api * update rst and downstream * update index.rst * update index.rst * update custom.js * update chinese docs * update config.md * update train and test * add pretrain on custom dataset * fix lint 2023-03-20 15:56:09 +08:00			`To find more datasets supported by MMPretrain, and get more configurations of the above datasets, please see the [dataset documentation](mmpretrain.datasets).`
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00
			`## Dataset Wrappers`

[Docs] Auto generate model summary table. (#1010) * Fix wrong metafile * Auto generate model summary table. * Fix all TODO link 2022-09-13 15:06:17 +08:00			The following datawrappers are supported in MMEngine, you can refer to {external+mmengine:doc}`MMEngine tutorial <advanced_tutorials/basedataset>` to learn how to use it.
[Docs] Refactor dataset tutorial (#916) * refactor dataset tutorials * split into user_guide and advance_guide * refine * Fix dataset preparasion tutorial. * refine CN docs * update docs API doc link * refine new a dataset * refine new a dataset * refine new a dataset Co-authored-by: mzr1996 <mzr1996@163.com> 2022-08-30 18:45:58 +08:00
			`- [ConcatDataset](https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/basedataset.md#concatdataset)`
			`- [RepeatDataset](https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/basedataset.md#repeatdataset)`
			`- [ClassBalanced](https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/basedataset.md#classbalanceddataset)`

[Docs] Update get start docs and user guides. (#1407) * update user_guides * update test.md * fix lint * fix typo * refine * fix typo * update retriever to api * update rst and downstream * update index.rst * update index.rst * update custom.js * update chinese docs * update config.md * update train and test * add pretrain on custom dataset * fix lint 2023-03-20 15:56:09 +08:00			The MMPretrain also support [KFoldDataset](mmpretrain.datasets.KFoldDataset), please use it with `tools/kfold-cross-valid.py`.