mmsegmentation/docs/en/advanced_guides/add_dataset.md

# Add New Datasets

## Customize datasets by reorganizing data

The simplest way is to convert your dataset to organize your data into folders.

An example of file structure is as followed.

```none
├── data
│   ├── my_dataset
│   │   ├── img_dir
│   │   │   ├── train
│   │   │   │   ├── xxx{img_suffix}
│   │   │   │   ├── yyy{img_suffix}
│   │   │   │   ├── zzz{img_suffix}
│   │   │   ├── val
│   │   ├── ann_dir
│   │   │   ├── train
│   │   │   │   ├── xxx{seg_map_suffix}
│   │   │   │   ├── yyy{seg_map_suffix}
│   │   │   │   ├── zzz{seg_map_suffix}
│   │   │   ├── val

```

A training pair will consist of the files with same suffix in img_dir/ann_dir.

If `split` argument is given, only part of the files in img_dir/ann_dir will be loaded.
We may specify the prefix of files we would like to be included in the split txt.

More specifically, for a split txt like following,

```none
xxx
zzz
```

Only
`data/my_dataset/img_dir/train/xxx{img_suffix}`,
`data/my_dataset/img_dir/train/zzz{img_suffix}`,
`data/my_dataset/ann_dir/train/xxx{seg_map_suffix}`,
`data/my_dataset/ann_dir/train/zzz{seg_map_suffix}` will be loaded.

:::{note}
The annotations are images of shape (H, W), the value pixel should fall in range `[0, num_classes - 1]`.
You may use `'P'` mode of [pillow](https://pillow.readthedocs.io/en/stable/handbook/concepts.html#palette) to create your annotation image with color.
:::

## Customize datasets by mixing dataset

MMSegmentation also supports to mix dataset for training.
Currently it supports to concat, repeat and multi-image mix datasets.

### Repeat dataset

We use `RepeatDataset` as wrapper to repeat the dataset.
For example, suppose the original dataset is `Dataset_A`, to repeat it, the config looks like the following

```python
dataset_A_train = dict(
        type='RepeatDataset',
        times=N,
        dataset=dict(  # This is the original config of Dataset_A
            type='Dataset_A',
            ...
            pipeline=train_pipeline
        )
    )
```

### Concatenate dataset

In case the dataset you want to concatenate is different, you can concatenate the dataset configs like the following.

```python
dataset_A_train = dict()
dataset_B_train = dict()
concatenate_dataset = dict(
    type='ConcatDataset',
    datasets=[dataset_A_train, dataset_B_train])
```

A more complex example that repeats `Dataset_A` and `Dataset_B` by N and M times, respectively, and then concatenates the repeated datasets is as the following.

```python
dataset_A_train = dict(
    type='RepeatDataset',
    times=N,
    dataset=dict(
        type='Dataset_A',
        ...
        pipeline=train_pipeline
    )
)
dataset_A_val = dict(
    ...
    pipeline=test_pipeline
)
dataset_A_test = dict(
    ...
    pipeline=test_pipeline
)
dataset_B_train = dict(
    type='RepeatDataset',
    times=M,
    dataset=dict(
        type='Dataset_B',
        ...
        pipeline=train_pipeline
    )
)
train_dataloader = dict(
    dataset=dict('ConcatDataset', datasets=[dataset_A_train, dataset_B_train]))

val_dataloader = dict(dataset=dataset_A_val)
test_dataloader = dict(dataset=dataset_A_test)

```

You can refer base dataset [tutorial](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/basedataset.html) from mmengine for more details

### Multi-image Mix Dataset

We use `MultiImageMixDataset` as a wrapper to mix images from multiple datasets.
`MultiImageMixDataset` can be used by multiple images mixed data augmentation
like mosaic and mixup.

An example of using `MultiImageMixDataset` with `Mosaic` data augmentation:

```python
train_pipeline = [
    dict(type='RandomMosaic', prob=1),
    dict(type='Resize', img_scale=(1024, 512), keep_ratio=True),
    dict(type='RandomFlip', prob=0.5),
    dict(type='PackSegInputs')
]

train_dataset = dict(
    type='MultiImageMixDataset',
    dataset=dict(
        classes=classes,
        palette=palette,
        type=dataset_type,
        reduce_zero_label=False,
        img_dir=data_root + "images/train",
        ann_dir=data_root + "annotations/train",
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations'),
        ]
    ),
    pipeline=train_pipeline
)

```
[Refactor] Refine documentation (#1993) * [WIP] Refine documentation * get started done * config refine * train_test * refine user guides * add contribution * add contribution * refine visualization * advanced tutorial * advanced guides * tricks * refine zh doc * refactor changelog 2022-08-31 20:54:15 +08:00			`# Add New Datasets`
[Doc] Add explanation and usage instructions for data configuration (#1548) * [WIP] Data configuration * [Doc] Add data configuration * version info * grammar * typo * typo * format * fix based on comments * grammar * comments 2022-05-06 15:25:05 +08:00
init commit 2020-07-07 20:52:19 +08:00			`## Customize datasets by reorganizing data`

			`The simplest way is to convert your dataset to organize your data into folders.`

			`An example of file structure is as followed.`
[Improvement] Add markdown linter and fix linting errors (#171) * [Improvement] Add markdown linter and fix linting errors * fixed pip 2020-10-07 19:50:16 +08:00
			```none
init commit 2020-07-07 20:52:19 +08:00			`├── data`
			`│ ├── my_dataset`
			`│ │ ├── img_dir`
			`│ │ │ ├── train`
			`│ │ │ │ ├── xxx{img_suffix}`
			`│ │ │ │ ├── yyy{img_suffix}`
			`│ │ │ │ ├── zzz{img_suffix}`
			`│ │ │ ├── val`
			`│ │ ├── ann_dir`
			`│ │ │ ├── train`
			`│ │ │ │ ├── xxx{seg_map_suffix}`
			`│ │ │ │ ├── yyy{seg_map_suffix}`
			`│ │ │ │ ├── zzz{seg_map_suffix}`
			`│ │ │ ├── val`

			```
[Improvement] Add markdown linter and fix linting errors (#171) * [Improvement] Add markdown linter and fix linting errors * fixed pip 2020-10-07 19:50:16 +08:00
init commit 2020-07-07 20:52:19 +08:00			`A training pair will consist of the files with same suffix in img_dir/ann_dir.`

			If `split` argument is given, only part of the files in img_dir/ann_dir will be loaded.
			`We may specify the prefix of files we would like to be included in the split txt.`

			`More specifically, for a split txt like following,`
[Improvement] Add markdown linter and fix linting errors (#171) * [Improvement] Add markdown linter and fix linting errors * fixed pip 2020-10-07 19:50:16 +08:00
			```none
init commit 2020-07-07 20:52:19 +08:00			`xxx`
			`zzz`
			```
[Improvement] Add markdown linter and fix linting errors (#171) * [Improvement] Add markdown linter and fix linting errors * fixed pip 2020-10-07 19:50:16 +08:00
init commit 2020-07-07 20:52:19 +08:00			`Only`
			`data/my_dataset/img_dir/train/xxx{img_suffix}`,
			`data/my_dataset/img_dir/train/zzz{img_suffix}`,
			`data/my_dataset/ann_dir/train/xxx{seg_map_suffix}`,
			`data/my_dataset/ann_dir/train/zzz{seg_map_suffix}` will be loaded.

[Docs] Improve docs style (#879) * Improve docs style * update lists * update the size of image * modify duplicate mmdet3d 2021-09-16 23:23:50 +08:00			`:::{note}`
			The annotations are images of shape (H, W), the value pixel should fall in range `[0, num_classes - 1]`.
[Doc] Add annotaion format note (#77) 2020-08-23 14:42:07 +08:00			You may use `'P'` mode of [pillow](https://pillow.readthedocs.io/en/stable/handbook/concepts.html#palette) to create your annotation image with color.
[Docs] Improve docs style (#879) * Improve docs style * update lists * update the size of image * modify duplicate mmdet3d 2021-09-16 23:23:50 +08:00			`:::`
[Doc] Add annotaion format note (#77) 2020-08-23 14:42:07 +08:00
init commit 2020-07-07 20:52:19 +08:00			`## Customize datasets by mixing dataset`

			`MMSegmentation also supports to mix dataset for training.`
[Docs] Add MultiImageMixDataset tutorial (#1194) * [Docs] Add MultiImageMixDataset tutorial * modify to randommosaic * fix markdown 2022-01-24 15:48:54 +08:00			`Currently it supports to concat, repeat and multi-image mix datasets.`
init commit 2020-07-07 20:52:19 +08:00
			`### Repeat dataset`

			We use `RepeatDataset` as wrapper to repeat the dataset.
			For example, suppose the original dataset is `Dataset_A`, to repeat it, the config looks like the following
[Improvement] Add markdown linter and fix linting errors (#171) * [Improvement] Add markdown linter and fix linting errors * fixed pip 2020-10-07 19:50:16 +08:00
init commit 2020-07-07 20:52:19 +08:00			```python
			`dataset_A_train = dict(`
			`type='RepeatDataset',`
			`times=N,`
			`dataset=dict( # This is the original config of Dataset_A`
			`type='Dataset_A',`
			`...`
			`pipeline=train_pipeline`
			`)`
			`)`
			```

			`### Concatenate dataset`

[Refactor] Refine documentation (#1993) * [WIP] Refine documentation * get started done * config refine * train_test * refine user guides * add contribution * add contribution * refine visualization * advanced tutorial * advanced guides * tricks * refine zh doc * refactor changelog 2022-08-31 20:54:15 +08:00			`In case the dataset you want to concatenate is different, you can concatenate the dataset configs like the following.`
Update pre-commit yaml and md2yml.py 2022-07-05 15:58:48 +08:00
[Refactor] Refine documentation (#1993) * [WIP] Refine documentation * get started done * config refine * train_test * refine user guides * add contribution * add contribution * refine visualization * advanced tutorial * advanced guides * tricks * refine zh doc * refactor changelog 2022-08-31 20:54:15 +08:00			```python
			`dataset_A_train = dict()`
			`dataset_B_train = dict()`
			`concatenate_dataset = dict(`
			`type='ConcatDataset',`
			`datasets=[dataset_A_train, dataset_B_train])`
			```
init commit 2020-07-07 20:52:19 +08:00
			A more complex example that repeats `Dataset_A` and `Dataset_B` by N and M times, respectively, and then concatenates the repeated datasets is as the following.

			```python
			`dataset_A_train = dict(`
			`type='RepeatDataset',`
			`times=N,`
			`dataset=dict(`
			`type='Dataset_A',`
			`...`
			`pipeline=train_pipeline`
			`)`
			`)`
			`dataset_A_val = dict(`
			`...`
			`pipeline=test_pipeline`
			`)`
			`dataset_A_test = dict(`
			`...`
			`pipeline=test_pipeline`
			`)`
			`dataset_B_train = dict(`
			`type='RepeatDataset',`
			`times=M,`
			`dataset=dict(`
			`type='Dataset_B',`
			`...`
			`pipeline=train_pipeline`
			`)`
			`)`
[Refactor] Refine documentation (#1993) * [WIP] Refine documentation * get started done * config refine * train_test * refine user guides * add contribution * add contribution * refine visualization * advanced tutorial * advanced guides * tricks * refine zh doc * refactor changelog 2022-08-31 20:54:15 +08:00			`train_dataloader = dict(`
			`dataset=dict('ConcatDataset', datasets=[dataset_A_train, dataset_B_train]))`

			`val_dataloader = dict(dataset=dataset_A_val)`
			`test_dataloader = dict(dataset=dataset_A_test)`
init commit 2020-07-07 20:52:19 +08:00
			```
[Docs] Add MultiImageMixDataset tutorial (#1194) * [Docs] Add MultiImageMixDataset tutorial * modify to randommosaic * fix markdown 2022-01-24 15:48:54 +08:00
[Fix] README for mmseg 1.x (#2009) * [Fix] README for mmseg 1.x * typo * link and refine 2022-09-01 00:03:51 +08:00			`You can refer base dataset [tutorial](https://mmengine.readthedocs.io/en/latest/advanced_tutorials/basedataset.html) from mmengine for more details`
[Refactor] Refine documentation (#1993) * [WIP] Refine documentation * get started done * config refine * train_test * refine user guides * add contribution * add contribution * refine visualization * advanced tutorial * advanced guides * tricks * refine zh doc * refactor changelog 2022-08-31 20:54:15 +08:00
[Docs] Add MultiImageMixDataset tutorial (#1194) * [Docs] Add MultiImageMixDataset tutorial * modify to randommosaic * fix markdown 2022-01-24 15:48:54 +08:00			`### Multi-image Mix Dataset`

			We use `MultiImageMixDataset` as a wrapper to mix images from multiple datasets.
			`MultiImageMixDataset` can be used by multiple images mixed data augmentation
			`like mosaic and mixup.`

			An example of using `MultiImageMixDataset` with `Mosaic` data augmentation:

			```python
			`train_pipeline = [`
			`dict(type='RandomMosaic', prob=1),`
			`dict(type='Resize', img_scale=(1024, 512), keep_ratio=True),`
			`dict(type='RandomFlip', prob=0.5),`
[Refactor] Refine documentation (#1993) * [WIP] Refine documentation * get started done * config refine * train_test * refine user guides * add contribution * add contribution * refine visualization * advanced tutorial * advanced guides * tricks * refine zh doc * refactor changelog 2022-08-31 20:54:15 +08:00			`dict(type='PackSegInputs')`
[Docs] Add MultiImageMixDataset tutorial (#1194) * [Docs] Add MultiImageMixDataset tutorial * modify to randommosaic * fix markdown 2022-01-24 15:48:54 +08:00			`]`

			`train_dataset = dict(`
			`type='MultiImageMixDataset',`
			`dataset=dict(`
			`classes=classes,`
			`palette=palette,`
			`type=dataset_type,`
			`reduce_zero_label=False,`
			`img_dir=data_root + "images/train",`
			`ann_dir=data_root + "annotations/train",`
			`pipeline=[`
			`dict(type='LoadImageFromFile'),`
			`dict(type='LoadAnnotations'),`
			`]`
			`),`
			`pipeline=train_pipeline`
			`)`

			```