[Docs] Add custom pipeline docs. (#1124)

* [Docs] Add custom pipeline docs.

* Fix link.

* Fix according to comments
pull/1135/merge
Ma Zerun 2022-10-27 10:35:20 +08:00 committed by GitHub
parent cccbedf22d
commit 280e916979
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 116 additions and 239 deletions

View File

@ -1,127 +1,155 @@
# Customize Data Pipeline (TODO)
# Customize Data Pipeline
## Design of Data pipelines
Following typical conventions, we use `Dataset` and `DataLoader` for data loading
with multiple workers. Indexing `Dataset` returns a dict of data items corresponding to
the arguments of models forward method.
In the [new dataset tutorial](./datasets.md), we know that the dataset class use the `load_data_list` method
to initialize the entire dataset, and we save the information of every sample to a dict.
The data preparation pipeline and the dataset is decomposed. Usually a dataset
defines how to process the annotations and a data pipeline defines all the steps to prepare a data dict.
A pipeline consists of a sequence of operations. Each operation takes a dict as input and also output a dict for the next transform.
Usually, to save memory usage, we only load image paths and labels in the `load_data_list`, and load full
image content when we use them. Moreover, we may want to do some random data augmentation during picking
samples when training. Almost all data loading, pre-processing, and formatting operations can be configured in
MMClassification by the **data pipeline**.
The operations are categorized into data loading, pre-processing and formatting.
The data pipeline means how to process the sample dict when indexing a sample from the dataset. And it
consists of a sequence of data transforms. Each data transform takes a dict as input, processes it, and outputs a
dict for the next data transform.
Here is an pipeline example for ResNet-50 training on ImageNet.
Here is a data pipeline example for ResNet-50 training on ImageNet.
```python
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='RandomResizedCrop', size=224),
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_label']),
dict(type='Collect', keys=['img', 'gt_label'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='Resize', scale=256),
dict(type='CenterCrop', crop_size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
dict(type='RandomResizedCrop', scale=224),
dict(type='RandomFlip', prob=0.5, direction='horizontal'),
dict(type='PackClsInputs'),
]
```
For each operation, we list the related dict fields that are added/updated/removed.
At the end of the pipeline, we use `Collect` to only retain the necessary items for forward computation.
All available data transforms in MMClassification can be found in the [data transforms docs](mmcls.datasets.transforms).
### Data loading
## Modify the training/test pipeline
`LoadImageFromFile`
The data pipeline in MMClassification is pretty flexible. You can control almost every step of the data
preprocessing from the config file, but on the other hand, you may be confused facing so many options.
- add: img, img_shape, ori_shape
Here is a common practice and guidance for image classification tasks.
By default, `LoadImageFromFile` loads images from disk but it may lead to IO bottleneck for efficient small models.
Various backends are supported by mmcv to accelerate this process. For example, if the training machines have setup
[memcached](https://memcached.org/), we can revise the config as follows.
### Loading
```
memcached_root = '/mnt/xxx/memcached_client/'
At the beginning of a data pipeline, we usually need to load image data from the file path.
[`LoadImageFromFile`](mmcv.transforms.LoadImageFromFile) is commonly used to do this task.
```python
train_pipeline = [
dict(type='LoadImageFromFile'),
...
]
```
If you want to load data from files with special formats or special locations, you can [implement a new loading
transform](#add-new-data-transforms) and add it at the beginning of the data pipeline.
### Augmentation and other processing
During training, we usually need to do data augmentation to avoid overfitting. During the test, we also need to do
some data processing like resizing and cropping. These data transforms will be placed after the loading process.
Here is a simple data augmentation recipe example. It will randomly resize and crop the input image to the
specified scale, and randomly flip the image horizontally with probability.
```python
train_pipeline = [
...
dict(type='RandomResizedCrop', scale=224),
dict(type='RandomFlip', prob=0.5, direction='horizontal'),
...
]
```
Here is a heavy data augmentation recipe example used in [Swin-Transformer](../papers/swin_transformer.md)
training. To align with the official implementation, it specified `pillow` as the resize backend and `bicubic`
as the resize algorithm. Moreover, it added [`RandAugment`](mmcls.datasets.transforms.RandAugment) and
[`RandomErasing`](mmcls.datasets.transforms.RandomErasing) as extra data augmentation method.
This configuration specified every detail of the data augmentation, and you can simply copy it to your own
config file to apply the data augmentations of the Swin-Transformer.
```python
bgr_mean = [103.53, 116.28, 123.675]
bgr_std = [57.375, 57.12, 58.395]
train_pipeline = [
...
dict(type='RandomResizedCrop', scale=224, backend='pillow', interpolation='bicubic'),
dict(type='RandomFlip', prob=0.5, direction='horizontal'),
dict(
type='LoadImageFromFile',
file_client_args=dict(
backend='memcached',
server_list_cfg=osp.join(memcached_root, 'server_list.conf'),
client_cfg=osp.join(memcached_root, 'client.conf'))),
type='RandAugment',
policies='timm_increasing',
num_policies=2,
total_level=10,
magnitude_level=9,
magnitude_std=0.5,
hparams=dict(
pad_val=[round(x) for x in bgr_mean], interpolation='bicubic')),
dict(
type='RandomErasing',
erase_prob=0.25,
mode='rand',
min_area_ratio=0.02,
max_area_ratio=1 / 3,
fill_color=bgr_mean,
fill_std=bgr_std),
...
]
```
More supported backends can be found in [mmcv.fileio.FileClient](https://github.com/open-mmlab/mmcv/blob/master/mmcv/fileio/file_client.py).
### Pre-processing
`Resize`
- add: scale, scale_idx, pad_shape, scale_factor, keep_ratio
- update: img, img_shape
`RandomFlip`
- add: flip, flip_direction
- update: img
`RandomCrop`
- update: img, pad_shape
`Normalize`
- add: img_norm_cfg
- update: img
```{note}
Usually, the data augmentation part in the data pipeline handles only image-wise transforms, but not transforms
like image normalization or mixup/cutmix. It's because we can do image normalization and mixup/cutmix on batch data
to accelerate. To configure image normalization and mixup/cutmix, please use the [data preprocessor]
(mmcls.models.utils.data_preprocessor).
```
### Formatting
`ToTensor`
The formatting is to collect training data from the data information dict and convert these data to
model-friendly format.
- update: specified by `keys`.
In most cases, you can simply use [`PackClsInputs`](mmcls.datasets.transforms.PackClsInputs), and it will
convert the image in NumPy array format to PyTorch tensor, and pack the ground truth categories information and
other meta information as a [`ClsDataSample`](mmcls.structures.ClsDataSample).
`ImageToTensor`
```python
train_pipeline = [
...
dict(type='PackClsInputs'),
]
```
- update: specified by `keys`.
## Add new data transforms
`Collect`
- remove: all other keys except for those specified by `keys`
For more information about other data transformation classes, please refer to [Data Transforms](mmcls.datasets.transforms)
## Extend and use custom pipelines
1. Write a new pipeline in any file, e.g., `my_pipeline.py`, and place it in
the folder `mmcls/datasets/pipelines/`. The pipeline class needs to override
the `__call__` method which takes a dict as input and returns a dict.
1. Write a new data transform in any file, e.g., `my_transform.py`, and place it in
the folder `mmcls/datasets/transforms/`. The data transform class needs to inherit
the [`mmcv.transforms.BaseTransform`](mmcv.transforms.BaseTransform) class and override
the `transform` method which takes a dict as input and returns a dict.
```python
from mmcls.datasets import PIPELINES
from mmcv.transforms import BaseTransform
from mmcls.datasets import TRANSFORMS
@PIPELINES.register_module()
class MyTransform(object):
@TRANSFORMS.register_module()
class MyTransform(BaseTransform):
def __call__(self, results):
# apply transforms on results['img']
def transform(self, results):
# Modify the data information dict `results`.
return results
```
2. Import the new class in `mmcls/datasets/pipelines/__init__.py`.
2. Import the new class in the `mmcls/datasets/transforms/__init__.py`.
```python
...
from .my_pipeline import MyTransform
from .my_transform import MyTransform
__all__ = [
..., 'MyTransform'
@ -131,17 +159,10 @@ For more information about other data transformation classes, please refer to [D
3. Use it in config files.
```python
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='RandomResizedCrop', size=224),
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
...
dict(type='MyTransform'),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_label']),
dict(type='Collect', keys=['img', 'gt_label'])
...
]
```

View File

@ -1,148 +1,4 @@
# 自定义数据处理流程(待更新)
## 设计数据流水线
按照典型的用法,我们通过 `Dataset``DataLoader` 来使用多个 worker 进行数据加
载。对 `Dataset` 的索引操作将返回一个与模型的 `forward` 方法的参数相对应的字典。
数据流水线和数据集在这里是解耦的。通常,数据集定义如何处理标注文件,而数据流水
线定义所有准备数据字典的步骤。流水线由一系列操作组成。每个操作都将一个字典作为
输入,并输出一个字典。
这些操作分为数据加载,预处理和格式化。
这里使用 ResNet-50 在 ImageNet 数据集上的数据流水线作为示例。
```python
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='RandomResizedCrop', size=224),
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_label']),
dict(type='Collect', keys=['img', 'gt_label'])
]
test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='Resize', scale=256),
dict(type='CenterCrop', crop_size=224),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='Collect', keys=['img'])
]
```
对于每个操作,我们列出了添加、更新、删除的相关字典字段。在流水线的最后,我们使
`Collect` 仅保留进行模型 `forward` 方法所需的项。
### 数据加载
`LoadImageFromFile` - 从文件中加载图像
- 添加img, img_shape, ori_shape
默认情况下,`LoadImageFromFile` 将会直接从硬盘加载图像,但对于一些效率较高、规
模较小的模型,这可能会导致 IO 瓶颈。MMCV 支持多种数据加载后端来加速这一过程。例
如,如果训练设备上配置了 [memcached](https://memcached.org/),那么我们按照如下
方式修改配置文件。
```
memcached_root = '/mnt/xxx/memcached_client/'
train_pipeline = [
dict(
type='LoadImageFromFile',
file_client_args=dict(
backend='memcached',
server_list_cfg=osp.join(memcached_root, 'server_list.conf'),
client_cfg=osp.join(memcached_root, 'client.conf'))),
]
```
更多支持的数据加载后端,可以参见 [mmcv.fileio.FileClient](https://github.com/open-mmlab/mmcv/blob/master/mmcv/fileio/file_client.py)。
### 预处理
`Resize` - 缩放图像尺寸
- 添加scale, scale_idx, pad_shape, scale_factor, keep_ratio
- 更新img, img_shape
`RandomFlip` - 随机翻转图像
- 添加flip, flip_direction
- 更新img
`RandomCrop` - 随机裁剪图像
- 更新img, pad_shape
`Normalize` - 图像数据归一化
- 添加img_norm_cfg
- 更新img
### 格式化
`ToTensor` - 转换(标签)数据至 `torch.Tensor`
- 更新:根据参数 `keys` 指定
`ImageToTensor` - 转换图像数据至 `torch.Tensor`
- 更新:根据参数 `keys` 指定
`Collect` - 保留指定键值
- 删除:除了参数 `keys` 指定以外的所有键值对
## 扩展及使用自定义流水线
1. 编写一个新的数据处理操作,并放置在 `mmcls/datasets/pipelines/` 目录下的任何
一个文件中,例如 `my_pipeline.py`。这个类需要重载 `__call__` 方法,接受一个
字典作为输入,并返回一个字典。
```python
from mmcls.datasets import PIPELINES
@PIPELINES.register_module()
class MyTransform(object):
def __call__(self, results):
# 对 results['img'] 进行变换操作
return results
```
2. 在 `mmcls/datasets/pipelines/__init__.py` 中导入这个新的类。
```python
...
from .my_pipeline import MyTransform
__all__ = [
..., 'MyTransform'
]
```
3. 在数据流水线的配置中添加这一操作。
```python
img_norm_cfg = dict(
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='RandomResizedCrop', size=224),
dict(type='RandomFlip', flip_prob=0.5, direction='horizontal'),
dict(type='MyTransform'),
dict(type='Normalize', **img_norm_cfg),
dict(type='ImageToTensor', keys=['img']),
dict(type='ToTensor', keys=['gt_label']),
dict(type='Collect', keys=['img', 'gt_label'])
]
```
## 流水线可视化
设计好数据流水线后,可以使用[可视化工具](../user_guides/visualization.md)查看具体的效果。
请参见[英文文档](https://mmclassification.readthedocs.io/en/dev-1.x/advanced_guides/pipeline.html),如果你有兴
趣参与中文文档的翻译,欢迎在 [讨论区](https://github.com/open-mmlab/mmclassification/discussions/1027)进行报名。