mmpretrain/docs/tutorials/data_pipeline.md

# Tutorial 4: Custom Data Pipelines

## Design of Data pipelines

Following typical conventions, we use `Dataset` and `DataLoader` for data loading
with multiple workers. Indexing `Dataset` returns a dict of data items corresponding to
the arguments of models forward method.

The data preparation pipeline and the dataset is decomposed. Usually a dataset
defines how to process the annotations and a data pipeline defines all the steps to prepare a data dict.
A pipeline consists of a sequence of operations. Each operation takes a dict as input and also output a dict for the next transform.

The operations are categorized into data loading, pre-processing and formatting.

Here is an pipeline example for ResNet-50 training on ImageNet.

```python
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='RandomResizedCrop', size=224),
    dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='ToTensor', keys=['gt_label']),
    dict(type='Collect', keys=['img', 'gt_label'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='Resize', size=256),
    dict(type='CenterCrop', crop_size=224),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='ImageToTensor', keys=['img']),
    dict(type='Collect', keys=['img'])
]
```

For each operation, we list the related dict fields that are added/updated/removed.
At the end of the pipeline, we use `Collect` to only retain the necessary items for forward computation.

### Data loading

`LoadImageFromFile`

- add: img, img_shape, ori_shape

By default, `LoadImageFromFile` loads images from disk but it may lead to IO bottleneck for efficient small models.
Various backends are supported by mmcv to accelerate this process. For example, if the training machines have setup
[memcached](https://memcached.org/), we can revise the config as follows.

```
memcached_root = '/mnt/xxx/memcached_client/'
train_pipeline = [
    dict(
        type='LoadImageFromFile',
        file_client_args=dict(
            backend='memcached',
            server_list_cfg=osp.join(memcached_root, 'server_list.conf'),
            client_cfg=osp.join(memcached_root, 'client.conf'))),
]
```

More supported backends can be found in [mmcv.fileio.FileClient](https://github.com/open-mmlab/mmcv/blob/master/mmcv/fileio/file_client.py).

### Pre-processing

`Resize`

- add: scale, scale_idx, pad_shape, scale_factor, keep_ratio
- update: img, img_shape

`RandomFlip`

- add: flip, flip_direction
- update: img

`RandomCrop`

- update: img, pad_shape

`Normalize`

- add: img_norm_cfg
- update: img

### Formatting

`ToTensor`

- update: specified by `keys`.

`ImageToTensor`

- update: specified by `keys`.

`Collect`

- remove: all other keys except for those specified by `keys`

## Extend and use custom pipelines

1. Write a new pipeline in any file, e.g., `my_pipeline.py`, and place it in
   the folder `mmcls/datasets/pipelines/`. The pipeline class needs to override
   the `__call__` method which takes a dict as input and returns a dict.

    ```python
    from mmcls.datasets import PIPELINES

    @PIPELINES.register_module()
    class MyTransform(object):

        def __call__(self, results):
            # apply transforms on results['img']
            return results
    ```

2. Import the new class in `mmcls/datasets/pipelines/__init__.py`.

    ```python
    ...
    from .my_pipeline import MyTransform

    __all__ = [
        ..., 'MyTransform'
    ]
    ```

3. Use it in config files.

    ```python
    img_norm_cfg = dict(
        mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
    train_pipeline = [
        dict(type='LoadImageFromFile'),
        dict(type='RandomResizedCrop', size=224),
        dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
        dict(type='MyTransform'),
        dict(type='Normalize', **img_norm_cfg),
        dict(type='ImageToTensor', keys=['img']),
        dict(type='ToTensor', keys=['gt_label']),
        dict(type='Collect', keys=['img', 'gt_label'])
    ]
    ```

## Pipeline visualization

After designing data pipelines, you can use the [visualization tools](../tools/visualization.md) to view the performance.
[Docs] Add tutuorial for config. (#487) * add cn tutorials/config.md * add heads api and doc title link * Update tutorials index * Update tutorials index * Update config.md * add english version * Update config.md * Update docs * Update css * Update docs/tutorials/config.md Co-authored-by: Ma Zerun <mzr1996@163.com> * Update docs_zh-CN/tutorials/config.md Co-authored-by: Ma Zerun <mzr1996@163.com> * modify according to suggestion * Use GitHub style `code` css * change some mmcv API link to CN version * remove default in default_runtime Co-authored-by: mzr1996 <mzr1996@163.com> 2021-10-26 16:43:33 +08:00			`# Tutorial 4: Custom Data Pipelines`
Add tutorial docs 2020-07-08 12:59:15 +08:00
			`## Design of Data pipelines`

			Following typical conventions, we use `Dataset` and `DataLoader` for data loading
[Docs] Fix typo, improve and add Chinese translation of `data_pipeline.md` and `new_modules.md` (#265) * Improve docs, including `new_modules.md` and `data_pipeline.md` * Add Chinese docs `new_modules.md` and `data_pipeline.md` 2021-05-28 22:48:31 -04:00			with multiple workers. Indexing `Dataset` returns a dict of data items corresponding to
Add configs for MNIST, CIFAR10 and ImageNet 2020-07-09 23:59:10 +08:00			`the arguments of models forward method.`
Add tutorial docs 2020-07-08 12:59:15 +08:00
			`The data preparation pipeline and the dataset is decomposed. Usually a dataset`
			`defines how to process the annotations and a data pipeline defines all the steps to prepare a data dict.`
			`A pipeline consists of a sequence of operations. Each operation takes a dict as input and also output a dict for the next transform.`

			`The operations are categorized into data loading, pre-processing and formatting.`

			`Here is an pipeline example for ResNet-50 training on ImageNet.`
Add markdown lint in pre-commit hook (#106) * add mdlint * remove duplicate check * modify md files to pass linting check * change mmcv download url * remove torch1.4.0 version check * remove torch1.4.0 version check 2020-12-02 19:42:45 +08:00
Add tutorial docs 2020-07-08 12:59:15 +08:00			```python
			`img_norm_cfg = dict(`
			`mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)`
			`train_pipeline = [`
			`dict(type='LoadImageFromFile'),`
			`dict(type='RandomResizedCrop', size=224),`
			`dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),`
			`dict(type='Normalize', **img_norm_cfg),`
			`dict(type='ImageToTensor', keys=['img']),`
			`dict(type='ToTensor', keys=['gt_label']),`
			`dict(type='Collect', keys=['img', 'gt_label'])`
			`]`
			`test_pipeline = [`
			`dict(type='LoadImageFromFile'),`
			`dict(type='Resize', size=256),`
			`dict(type='CenterCrop', crop_size=224),`
			`dict(type='Normalize', **img_norm_cfg),`
			`dict(type='ImageToTensor', keys=['img']),`
fix outdated configs 2020-10-01 16:15:14 +02:00			`dict(type='Collect', keys=['img'])`
Add tutorial docs 2020-07-08 12:59:15 +08:00			`]`
			```

[Docs] Fix typo, improve and add Chinese translation of `data_pipeline.md` and `new_modules.md` (#265) * Improve docs, including `new_modules.md` and `data_pipeline.md` * Add Chinese docs `new_modules.md` and `data_pipeline.md` 2021-05-28 22:48:31 -04:00			`For each operation, we list the related dict fields that are added/updated/removed.`
			At the end of the pipeline, we use `Collect` to only retain the necessary items for forward computation.

			`### Data loading`

			`LoadImageFromFile`

			`- add: img, img_shape, ori_shape`

			By default, `LoadImageFromFile` loads images from disk but it may lead to IO bottleneck for efficient small models.
Add configs for MNIST, CIFAR10 and ImageNet 2020-07-09 23:59:10 +08:00			`Various backends are supported by mmcv to accelerate this process. For example, if the training machines have setup`
			`[memcached](https://memcached.org/), we can revise the config as follows.`
Add markdown lint in pre-commit hook (#106) * add mdlint * remove duplicate check * modify md files to pass linting check * change mmcv download url * remove torch1.4.0 version check * remove torch1.4.0 version check 2020-12-02 19:42:45 +08:00
Add configs for MNIST, CIFAR10 and ImageNet 2020-07-09 23:59:10 +08:00			```
			`memcached_root = '/mnt/xxx/memcached_client/'`
			`train_pipeline = [`
			`dict(`
			`type='LoadImageFromFile',`
			`file_client_args=dict(`
			`backend='memcached',`
			`server_list_cfg=osp.join(memcached_root, 'server_list.conf'),`
			`client_cfg=osp.join(memcached_root, 'client.conf'))),`
			`]`
			```
Add markdown lint in pre-commit hook (#106) * add mdlint * remove duplicate check * modify md files to pass linting check * change mmcv download url * remove torch1.4.0 version check * remove torch1.4.0 version check 2020-12-02 19:42:45 +08:00
Add configs for MNIST, CIFAR10 and ImageNet 2020-07-09 23:59:10 +08:00			`More supported backends can be found in [mmcv.fileio.FileClient](https://github.com/open-mmlab/mmcv/blob/master/mmcv/fileio/file_client.py).`

Add tutorial docs 2020-07-08 12:59:15 +08:00			`### Pre-processing`

			`Resize`
Add markdown lint in pre-commit hook (#106) * add mdlint * remove duplicate check * modify md files to pass linting check * change mmcv download url * remove torch1.4.0 version check * remove torch1.4.0 version check 2020-12-02 19:42:45 +08:00
Add tutorial docs 2020-07-08 12:59:15 +08:00			`- add: scale, scale_idx, pad_shape, scale_factor, keep_ratio`
			`- update: img, img_shape`

			`RandomFlip`
Add markdown lint in pre-commit hook (#106) * add mdlint * remove duplicate check * modify md files to pass linting check * change mmcv download url * remove torch1.4.0 version check * remove torch1.4.0 version check 2020-12-02 19:42:45 +08:00
Add RandomFlip 2020-07-08 23:54:49 +08:00			`- add: flip, flip_direction`
Add tutorial docs 2020-07-08 12:59:15 +08:00			`- update: img`

			`RandomCrop`
Add markdown lint in pre-commit hook (#106) * add mdlint * remove duplicate check * modify md files to pass linting check * change mmcv download url * remove torch1.4.0 version check * remove torch1.4.0 version check 2020-12-02 19:42:45 +08:00
Add tutorial docs 2020-07-08 12:59:15 +08:00			`- update: img, pad_shape`

			`Normalize`
Add markdown lint in pre-commit hook (#106) * add mdlint * remove duplicate check * modify md files to pass linting check * change mmcv download url * remove torch1.4.0 version check * remove torch1.4.0 version check 2020-12-02 19:42:45 +08:00
Add tutorial docs 2020-07-08 12:59:15 +08:00			`- add: img_norm_cfg`
			`- update: img`

			`### Formatting`

			`ToTensor`
Add markdown lint in pre-commit hook (#106) * add mdlint * remove duplicate check * modify md files to pass linting check * change mmcv download url * remove torch1.4.0 version check * remove torch1.4.0 version check 2020-12-02 19:42:45 +08:00
Add tutorial docs 2020-07-08 12:59:15 +08:00			- update: specified by `keys`.

			`ImageToTensor`
Add markdown lint in pre-commit hook (#106) * add mdlint * remove duplicate check * modify md files to pass linting check * change mmcv download url * remove torch1.4.0 version check * remove torch1.4.0 version check 2020-12-02 19:42:45 +08:00
Add tutorial docs 2020-07-08 12:59:15 +08:00			- update: specified by `keys`.

			`Collect`
Add markdown lint in pre-commit hook (#106) * add mdlint * remove duplicate check * modify md files to pass linting check * change mmcv download url * remove torch1.4.0 version check * remove torch1.4.0 version check 2020-12-02 19:42:45 +08:00
Add tutorial docs 2020-07-08 12:59:15 +08:00			- remove: all other keys except for those specified by `keys`

			`## Extend and use custom pipelines`

[Docs] Fix typo, improve and add Chinese translation of `data_pipeline.md` and `new_modules.md` (#265) * Improve docs, including `new_modules.md` and `data_pipeline.md` * Add Chinese docs `new_modules.md` and `data_pipeline.md` 2021-05-28 22:48:31 -04:00			1. Write a new pipeline in any file, e.g., `my_pipeline.py`, and place it in
			the folder `mmcls/datasets/pipelines/`. The pipeline class needs to override
			the `__call__` method which takes a dict as input and returns a dict.
Add tutorial docs 2020-07-08 12:59:15 +08:00
			```python
			`from mmcls.datasets import PIPELINES`

			`@PIPELINES.register_module()`
			`class MyTransform(object):`

			`def __call__(self, results):`
			`# apply transforms on results['img']`
			`return results`
			```

[Docs] Fix typo, improve and add Chinese translation of `data_pipeline.md` and `new_modules.md` (#265) * Improve docs, including `new_modules.md` and `data_pipeline.md` * Add Chinese docs `new_modules.md` and `data_pipeline.md` 2021-05-28 22:48:31 -04:00			2. Import the new class in `mmcls/datasets/pipelines/__init__.py`.
Add tutorial docs 2020-07-08 12:59:15 +08:00
			```python
[Docs] Fix typo, improve and add Chinese translation of `data_pipeline.md` and `new_modules.md` (#265) * Improve docs, including `new_modules.md` and `data_pipeline.md` * Add Chinese docs `new_modules.md` and `data_pipeline.md` 2021-05-28 22:48:31 -04:00			`...`
Add tutorial docs 2020-07-08 12:59:15 +08:00			`from .my_pipeline import MyTransform`
[Docs] Fix typo, improve and add Chinese translation of `data_pipeline.md` and `new_modules.md` (#265) * Improve docs, including `new_modules.md` and `data_pipeline.md` * Add Chinese docs `new_modules.md` and `data_pipeline.md` 2021-05-28 22:48:31 -04:00
			`__all__ = [`
			`..., 'MyTransform'`
			`]`
Add tutorial docs 2020-07-08 12:59:15 +08:00			```

			`3. Use it in config files.`

			```python
			`img_norm_cfg = dict(`
			`mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)`
			`train_pipeline = [`
			`dict(type='LoadImageFromFile'),`
			`dict(type='RandomResizedCrop', size=224),`
			`dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),`
			`dict(type='MyTransform'),`
			`dict(type='Normalize', **img_norm_cfg),`
			`dict(type='ImageToTensor', keys=['img']),`
			`dict(type='ToTensor', keys=['gt_label']),`
			`dict(type='Collect', keys=['img', 'gt_label'])`
			`]`
			```
[Feature] Add pipeline visualization tools. (#406) * add vis * add tool vis-pipeline * add docs * Update docs * pre-commit * enhence english expression * Add `BaseImshowContextmanager` and `ImshowInfosContextManager` to reuse matplotlib figure. * Use context manager to implement `imshow_infos` * Add unit tests. * More general base context manager. * unit tests for context manager. * Improve docstring. * Fix context manager exit cannot close figure when matplotlib>=3.4.0 * Fix unit tests * fix lint * fix lint * add adaptive * add adaptive * update adaptive * add GAP * improve doc and docstring * add visualization in doc index * Update doc * Update doc * Update doc * Update doc * Update doc * Update doc * update docs and docstring * add progressbar * add progressbar * add images * add images * Delete .DS_Store * replace images * replace images and modify rgb2bgr * add picture size * mv pictures * update img display * add doc_zh-CN images * Update vis_pipeline.py * Update visualization.md * Update visualization.md * fix lint * Improve docs. Co-authored-by: mzr1996 <mzr1996@163.com> 2021-10-20 10:28:21 +08:00
			`## Pipeline visualization`

			`After designing data pipelines, you can use the [visualization tools](../tools/visualization.md) to view the performance.`