mmocr/docs/en/basic_concepts/transforms.md

# Data Transforms and Pipeline

In the design of MMOCR, dataset construction and preparation are decoupled. That is, dataset construction classes such as [`OCRDataset`](mmocr.datasets.ocr_dataset.OCRDataset) are responsible for loading and parsing annotation files; while data transforms further apply data preprocessing, augmentation, formatting, and other related functions. Currently, there are five types of data transforms implemented in MMOCR, as shown in the following table.

|                                  |                                                                       |                                                                                                     |
| -------------------------------- | --------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------- |
| Transforms Type                  | File                                                                  | Description                                                                                         |
| Data Loading                     | loading.py                                                            | Implemented the data loading functions.                                                             |
| Data Formatting                  | formatting.py                                                         | Formatting the data required by different tasks.                                                    |
| Cross Project Data Adapter       | adapters.py                                                           | Converting the data format between other OpenMMLab projects and MMOCR.                              |
| Data Augmentation Functions      | ocr_transforms.py<br>textdet_transforms.py<br>textrecog_transforms.py | Various built-in data augmentation methods designed for different tasks.                            |
| Wrappers of Third Party Packages | wrappers.py                                                           | Wrapping the transforms implemented in popular third party packages such as [ImgAug](https://github.com/aleju/imgaug), and adapting them to MMOCR format. |

Since each data transform class is independent of each other, we can easily combine any data transforms to build a data pipeline after we have defined the data fields. As shown in the following figure, in MMOCR, a typical training data pipeline consists of three stages: **data loading**, **data augmentation**, and **data formatting**. Users only need to define the data pipeline list in the configuration file and specify the specific data transform class and its parameters:

<div align="center">

![Flowchart](https://user-images.githubusercontent.com/45810070/192265787-86c02e51-5f4a-4686-b2a1-255a7db98ada.jpg)

</div>

```python
train_pipeline_r18 = [
    # Loading images
    dict(
        type='LoadImageFromFile',
        color_type='color_ignore_orientation'),
    # Loading annotations
    dict(
        type='LoadOCRAnnotations',
        with_polygon=True,
        with_bbox=True,
        with_label=True,
    ),
    # Data augmentation
    dict(
        type='ImgAugWrapper',
        args=[['Fliplr', 0.5],
              dict(cls='Affine', rotate=[-10, 10]), ['Resize', [0.5, 3.0]]]),
    dict(type='RandomCrop', min_side_ratio=0.1),
    dict(type='Resize', scale=(640, 640), keep_ratio=True),
    dict(type='Pad', size=(640, 640)),
    # Data formatting
    dict(
        type='PackTextDetInputs',
        meta_keys=('img_path', 'ori_shape', 'img_shape'))
]
```

```{tip}
More tutorials about data pipeline configuration can be found in the [Config Doc](../user_guides/config.md#data-pipeline-configuration). Next, we will briefly introduce the data transforms supported in MMOCR according to their categories.
```

For each data transform, MMOCR provides a detailed docstring. For example, in the header of each data transform class, we annotate `Required Keys`, `Modified Keys` and `Added Keys`. The `Required Keys` represent the mandatory fields that should be included in the input required by the data transform, while the `Modified Keys` and `Added Keys` indicate that the transform may modify or add the fields into the original data. For example, `LoadImageFromFile` implements the image loading function, whose `Required Keys` is the image path `img_path`, and the `Modified Keys` includes the loaded image `img`, the current size of the image `img_shape`, the original size of the image `ori_shape`, and other image attributes.

```python
@TRANSFORMS.register_module()
class LoadImageFromFile(MMCV_LoadImageFromFile):
    # We provide detailed docstring for each data transform.
    """Load an image from file.

    Required Keys:

    - img_path

    Modified Keys:

    - img
    - img_shape
    - ori_shape
    """
```

```{note}
In the data pipeline of MMOCR, the image and label information are saved in a dictionary. By using the unified fields, the data can be freely transferred between different data transforms. Therefore, it is very important to understand the conventional fields used in MMOCR.
```

For your convenience, the following table lists the conventional keys used in MMOCR data transforms.

|                  |                                   |                                                                                                                                                         |
| ---------------- | --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Key              | Type                              | Description                                                                                                                                             |
| img              | `np.array(dtype=np.uint8)`        | Image array, shape of `(h, w, c)`.                                                                                                                      |
| img_shape        | `tuple(int, int)`                 | Current image size `(h, w)`.                                                                                                                            |
| ori_shape        | `tuple(int, int)`                 | Original image size `(h, w)`.                                                                                                                           |
| scale            | `tuple(int, int)`                 | Stores the target image size `(h, w)` specified by the user in the `Resize` data transform series. Note: This value may not correspond to the actual image size after the transformation. |
| scale_factor     | `tuple(float, float)`             | Stores the target image scale factor `(w_scale, h_scale)` specified by the user in the `Resize` data transform series. Note: This value may not correspond to the actual image size after the transformation. |
| keep_ratio       | `bool`                            | Boolean flag determines whether to keep the aspect ratio while scaling images.                                                                          |
| flip             | `bool`                            | Boolean flags to indicate whether the image has been flipped.                                                                                           |
| flip_direction   | `str`                             | Flipping direction, options are `horizontal`, `vertical`, `diagonal`.                                                                                   |
| gt_bboxes        | `np.array(dtype=np.float32)`      | Ground-truth bounding boxes.                                                                                                                            |
| gt_polygons      | `list[np.array(dtype=np.float32)` | Ground-truth polygons.                                                                                                                                  |
| gt_bboxes_labels | `np.array(dtype=np.int64)`        | Category label of bounding boxes. By default, MMOCR uses `0` to represent "text" instances.                                                             |
| gt_texts         | `list[str]`                       | Ground-truth text content of the instance.                                                                                                              |
| gt_ignored       | `np.array(dtype=np.bool_)`        | Boolean flag indicating whether ignoring the instance (used in text detection).                                                                         |

## Data Loading

Data loading transforms mainly implement the functions of loading data from different formats and backends. Currently, the following data loading transforms are implemented in MMOCR:

|                    |                                                           |                                                                |                                                                 |
| ------------------ | --------------------------------------------------------- | -------------------------------------------------------------- | --------------------------------------------------------------- |
| Transforms Name    | Required Keys                                             | Modified/Added Keys                                            | Description                                                     |
| LoadImageFromFile  | `img_path`                                                | `img`<br>`img_shape`<br>`ori_shape`                            | Load image from the specified path，supporting different file storage backends (e.g. `disk`, `http`, `petrel`) and decoding backends (e.g. `cv2`, `turbojpeg`, `pillow`, `tifffile`). |
| LoadOCRAnnotations | `bbox`<br>`bbox_label`<br>`polygon`<br>`ignore`<br>`text` | `gt_bboxes`<br>`gt_bboxes_labels`<br>`gt_polygons`<br>`gt_ignored`<br>`gt_texts` | Parse the annotation required by OCR task.                      |
| LoadKIEAnnotations | `bboxes` `bbox_labels` `edge_labels`<br>`texts`           | `gt_bboxes`<br>`gt_bboxes_labels`<br>`gt_edge_labels`<br>`gt_texts`<br>`ori_shape` | Parse the annotation required by KIE task.                      |

## Data Augmentation

Data augmentation is an indispensable process in text detection and recognition tasks. Currently, MMOCR has implemented dozens of data augmentation modules commonly used in OCR fields, which are classified into [ocr_transforms.py](/mmocr/datasets/transforms/ocr_transforms.py), [textdet_transforms.py](/mmocr/datasets/transforms/textdet_transforms.py), and [textrecog_transforms.py](/mmocr/datasets/transforms/textrecog_transforms.py).

Specifically, `ocr_transforms.py` implements generic OCR data augmentation modules such as `RandomCrop` and `RandomRotate`:

|                 |                                                               |                                                                |                                                                |
| --------------- | ------------------------------------------------------------- | -------------------------------------------------------------- | -------------------------------------------------------------- |
| Transforms Name | Required Keys                                                 | Modified/Added Keys                                            | Description                                                    |
| RandomCrop      | `img`<br>`gt_bboxes`<br>`gt_bboxes_labels`<br>`gt_polygons`<br>`gt_ignored`<br>`gt_texts` (optional) | `img`<br>`img_shape`<br>`gt_bboxes`<br>`gt_bboxes_labels`<br>`gt_polygons`<br>`gt_ignored`<br>`gt_texts` (optional) | Randomly crop the image and make sure the cropped image contains at least one text instance. The optional parameter is `min_side_ratio`, which controls the ratio of the short side of the cropped image to the original image, the default value is `0.4`. |
| RandomRotate    | `img`<br>`img_shape`<br>`gt_bboxes` (optional)<br>`gt_polygons` (optional) | `img`<br>`img_shape`<br>`gt_bboxes` (optional)<br>`gt_polygons` (optional)<br>`rotated_angle` | Randomly rotate the image and optionally fill the blank areas of the rotated image. |
|                 |                                                               |                                                                |                                                                |

`textdet_transforms.py` implements text detection related data augmentation modules:

|                   |                                       |                                                                     |                                                                                 |
| ----------------- | ------------------------------------- | ------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| Transforms Name   | Required Keys                         | Modified/Added Keys                                                 | Description                                                                     |
| RandomFlip        | `img`<br>`gt_bboxes`<br>`gt_polygons` | `img`<br>`gt_bboxes`<br>`gt_polygons`<br>`flip`<br>`flip_direction` | Random flip, support `horizontal`, `vertical` and `diagonal` modes. Defaults to `horizontal`. |
| FixInvalidPolygon | `gt_polygons`<br>`gt_ignored`         | `gt_polygons`<br>`gt_ignored`                                       | Automatically fixing the invalid polygons included in the annotations.          |

`textrecog_transforms.py` implements text recognition related data augmentation modules:

|                 |               |                                                                   |                                                                                                             |
| --------------- | ------------- | ----------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- |
| Transforms Name | Required Keys | Modified/Added Keys                                               | Description                                                                                                 |
| RescaleToHeight | `img`         | `img`<br>`img_shape`<br>`scale`<br>`scale_factor`<br>`keep_ratio` | Scales the image to the specified height while keeping the aspect ratio. When `min_width` and `max_width` are specified, the aspect ratio may be changed. |
|                 |               |                                                                   |                                                                                                             |

```{warning}
The above table only briefly introduces some selected data augmentation methods, for more information please refer to the [API documentation](../api.rst) or the code docstrings.
```

## Data Formatting

Data formatting transforms are responsible for packaging images, ground truth labels, and other information into a dictionary. Different tasks usually rely on different formatting transforms. For example:

|                     |               |                     |                                               |
| ------------------- | ------------- | ------------------- | --------------------------------------------- |
| Transforms Name     | Required Keys | Modified/Added Keys | Description                                   |
| PackTextDetInputs   | -             | -                   | Pack the inputs required by text detection.   |
| PackTextRecogInputs | -             | -                   | Pack the inputs required by text recognition. |
| PackKIEInputs       | -             | -                   | Pack the inputs required by KIE.              |

## Cross Project Data Adapters

The cross-project data adapters bridge the data formats between MMOCR and other OpenMMLab libraries such as [MMDetection](https://github.com/open-mmlab/mmdetection), making it possible to call models implemented in other OpenMMLab projects. Currently, MMOCR has implemented [`MMDet2MMOCR`](mmocr.datasets.transforms.MMDet2MMOCR) and [`MMOCR2MMDet`](mmocr.datasets.transforms.MMOCR2MMDet), allowing data to be converted between MMDetection and MMOCR formats; with these adapters, users can easily train any detectors supported by MMDetection in MMOCR. For example, we provide a [tutorial](#todo) to show how to train Mask R-CNN as a text detector in MMOCR.

|                 |                                              |                               |                                            |
| --------------- | -------------------------------------------- | ----------------------------- | ------------------------------------------ |
| Transforms Name | Required Keys                                | Modified/Added Keys           | Description                                |
| MMDet2MMOCR     | `gt_masks` `gt_ignore_flags`                 | `gt_polygons`<br>`gt_ignored` | Convert the fields used in MMDet to MMOCR. |
| MMOCR2MMDet     | `img_shape`<br>`gt_polygons`<br>`gt_ignored` | `gt_masks` `gt_ignore_flags`  | Convert the fields used in MMOCR to MMDet. |

## Wrappers

To facilitate the use of popular third-party CV libraries in MMOCR, we provide wrappers in `wrappers.py` to unify the data format between MMOCR and other third-party libraries. Users can directly configure the data transforms provided by these libraries in the configuration file of MMOCR. The supported wrappers are as follows:

|                    |                                                              |                                                               |                                                               |
| ------------------ | ------------------------------------------------------------ | ------------------------------------------------------------- | ------------------------------------------------------------- |
| Transforms Name    | Required Keys                                                | Modified/Added Keys                                           | Description                                                   |
| ImgAugWrapper      | `img`<br>`gt_polygons` (optional for text recognition)<br>`gt_bboxes` (optional for text recognition)<br>`gt_bboxes_labels` (optional for text recognition)<br>`gt_ignored` (optional for text recognition)<br>`gt_texts` (optional) | `img`<br>`gt_polygons` (optional for text recognition)<br>`gt_bboxes` (optional for text recognition)<br>`gt_bboxes_labels` (optional for text recognition)<br>`gt_ignored` (optional for text recognition)<br>`img_shape` (optional)<br>`gt_texts` (optional) | [ImgAug](https://github.com/aleju/imgaug) wrapper, which bridges the data format and configuration between ImgAug and MMOCR, allowing users to config the data augmentation methods supported by ImgAug in MMOCR. |
| TorchVisionWrapper | `img`                                                        | `img`<br>`img_shape`                                          | [TorchVision](https://github.com/pytorch/vision) wrapper, which bridges the data format and configuration between TorchVision and MMOCR, allowing users to config the data transforms supported by `torchvision.transforms` in MMOCR. |

### `ImgAugWrapper` Example

For example, in the original ImgAug, we can define a `Sequential` type data augmentation pipeline as follows to perform random flipping, random rotation and random scaling on the image:

```python
import imgaug.augmenters as iaa

aug = iaa.Sequential(
  iaa.Fliplr(0.5),                # horizontally flip 50% of all images
  iaa.Affine(rotate=(-10, 10)),   # rotate by -10 to +10 degrees
  iaa.Resize((0.5, 3.0))          # scale images to 50-300% of their size
)
```

In MMOCR, we can directly configure the above data augmentation pipeline in `train_pipeline` as follows:

```python
dict(
  type='ImgAugWrapper',
  args=[
    ['Fliplr', 0.5],
    dict(cls='Affine', rotate=[-10, 10]),
    ['Resize', [0.5, 3.0]],
  ]
)
```

Specifically, the `args` parameter accepts a list, and each element in the list can be a list or a dictionary. If it is a list, the first element of the list is the class name in `imgaug.augmenters`, and the following elements are the initialization parameters of the class; if it is a dictionary, the `cls` key corresponds to the class name in `imgaug.augmenters`, and the other key-value pairs correspond to the initialization parameters of the class.

### `TorchVisionWrapper` Example

For example, in the original TorchVision, we can define a `Compose` type data transformation pipeline as follows to perform color jittering on the image:

```python
import torchvision.transforms as transforms

aug = transforms.Compose([
  transforms.ColorJitter(
    brightness=32.0 / 255,  # brightness jittering range
    saturation=0.5)         # saturation jittering range
])
```

In MMOCR, we can directly configure the above data transformation pipeline in `train_pipeline` as follows:

```python
dict(
  type='TorchVisionWrapper',
  op='ColorJitter',
  brightness=32.0 / 255,
  saturation=0.5
)
```

Specifically, the `op` parameter is the class name in `torchvision.transforms`, and the following parameters correspond to the initialization parameters of the class.