mmocr/docs/en/user_guides/data_prepare/dataset_preparer.md

240 lines
13 KiB
Markdown
Raw Normal View History

# Dataset Preparer (Beta)
```{note}
Dataset Preparer is still in beta version and might not be stable enough. You
are welcome to try it out and report any issues to us.
```
## One-click data preparation script
MMOCR provides a unified one-stop data preparation script `prepare_dataset.py`.
Only one line of command is needed to complete the data download, decompression, and format conversion.
```bash
python tools/dataset_converters/prepare_dataset.py [$DATASET_NAME] --task [$TASK] --nproc [$NPROC]
```
| ARGS | Type | Description |
| ------------------ | ---- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| dataset_name | str | (required) dataset name. |
| --nproc | int | Number of processes to be used. Defaults to 4. |
| --task | str | Convert the dataset to the format of a specified task supported by MMOCR. options are: 'textdet', 'textrecog', 'textspotting', and 'kie'. |
| --splits | str | Splits of the dataset to be prepared. Multiple splits can be accepted. Defaults to `train val test`. |
| --lmdb | str | Store the data in LMDB format. Only valid when the task is `textrecog`. |
| --overwrite-cfg | str | Whether to overwrite the dataset config file if it already exists in `configs/{task}/_base_/datasets`. |
| --dataset-zoo-path | str | Path to the dataset config file. If not specified, the default path is `./dataset_zoo`. |
For example, the following command shows how to use the script to prepare the ICDAR2015 dataset for text detection task.
```bash
python tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet
```
Also, the script supports preparing multiple datasets at the same time. For example, the following command shows how to prepare the ICDAR2015 and TotalText datasets for text recognition task.
```bash
python tools/dataset_converters/prepare_dataset.py icdar2015 totaltext --task textrecog
```
To check the supported datasets of Dataset Preparer, please refer to [Dataset Zoo](./datasetzoo.md). Some of other datasets that need to be prepared manually are listed in [Text Detection](./det.md) and [Text Recognition](./recog.md).
## Advanced Usage
### LMDB Format
In text recognition tasks, we usually use LMDB format to store data to speed up data loading. When using the `prepare_dataset.py` script to prepare data, you can store data to the LMDB format by the `--lmdb` parameter. For example:
```bash
python tools/dataset_converters/prepare_dataset.py icdar2015 --task textrecog --lmdb
```
As soon as the dataset is prepared, Dataset Preparer will generate `icdar2015_lmdb.py` in the `configs/textrecog/_base_/datasets/` directory. You can inherit this file and point the `dataloader` to the LMDB dataset. Moreover, the LMDB dataset needs to be loaded by [`LoadImageFromNDArray`](mmocr.datasets.transforms.LoadImageFromNDArray), thus you also need to modify `pipeline`.
For example, if we want to change the training set of `configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py` to icdar2015 generated before, we need to perform the following modifications:
1. Modify `configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py`:
```python
_base_ = [
'../_base_/datasets/icdar2015_lmdb.py', # point to icdar2015 lmdb dataset
...
]
train_list = [_base_.icdar2015_lmdb_textrecog_train]
...
```
2. Modify `train_pipeline` in `configs/textrecog/crnn/_base_crnn_mini-vgg.py`, change `LoadImageFromFile` to `LoadImageFromNDArray`:
```python
train_pipeline = [
dict(
type='LoadImageFromNDArray',
color_type='grayscale',
file_client_args=file_client_args,
ignore_empty=True,
min_size=2),
...
]
```
### Configuration of Dataset Preparer
Dataset preparer uses a modular design to enhance extensibility, which allows users to extend it to other public or private datasets easily. The configuration files of the dataset preparers are stored in the `dataset_zoo/`, where all the configs of currently supported datasets can be found here. The directory structure is as follows:
```text
dataset_zoo/
├── icdar2015
│ ├── metafile.yml
│ ├── textdet.py
│ ├── textrecog.py
│ └── textspotting.py
└── wildreceipt
├── metafile.yml
├── kie.py
├── textdet.py
├── textrecog.py
└── textspotting.py
```
`metafile.yml` is the metafile of the dataset, which contains the basic information of the dataset, including the year of publication, the author of the paper, and other information such as license. The other files named by the task are the configuration files of the dataset preparer, which are used to configure the download, decompression, format conversion, etc. of the dataset. These configs are in Python format, and their usage is completely consistent with the configuration files in MMOCR repo. See [Configuration File Documentation](../config.md) for detailed usage.
#### Metafile
Take the ICDAR2015 dataset as an example, the `metafile.yml` stores the basic information of the dataset:
```yaml
Name: 'Incidental Scene Text IC15'
Paper:
Title: ICDAR 2015 Competition on Robust Reading
URL: https://rrc.cvc.uab.es/files/short_rrc_2015.pdf
Venue: ICDAR
Year: '2015'
BibTeX: '@inproceedings{karatzas2015icdar,
title={ICDAR 2015 competition on robust reading},
author={Karatzas, Dimosthenis and Gomez-Bigorda, Lluis and Nicolaou, Anguelos and Ghosh, Suman and Bagdanov, Andrew and Iwamura, Masakazu and Matas, Jiri and Neumann, Lukas and Chandrasekhar, Vijay Ramaseshan and Lu, Shijian and others},
booktitle={2015 13th international conference on document analysis and recognition (ICDAR)},
pages={1156--1160},
year={2015},
organization={IEEE}}'
Data:
Website: https://rrc.cvc.uab.es/?ch=4
Language:
- English
Scene:
- Natural Scene
Granularity:
- Word
Tasks:
- textdet
- textrecog
- textspotting
License:
Type: CC BY 4.0
Link: https://creativecommons.org/licenses/by/4.0/
```
It is not mandatory to use the metafile in the dataset preparation process (so users can ignore this file when preparing private datasets), but in order to better understand the information of each public dataset, we recommend that users read the metafile before preparing the dataset, which will help to understand whether the datasets meet their needs.
```{warning}
The following section is outdated as of MMOCR 1.0.0rc6.
```
#### Config of Dataset Preparer
Next, we will introduce the conventional fields and usage of the dataset preparer configuration files.
In the configuration files, there are two fields `data_root` and `cache_path`, which are used to store the converted dataset and the temporary files such as the archived files downloaded during the data preparation process.
```python
data_root = './data/icdar2015'
cache_path = './data/cache'
```
Data preparation usually contains two steps: "raw data preparation" and "format conversion and saving". Therefore, we use the `data_obtainer` and `data_converter` to configure the behavior of these two steps. In some cases, users can also ignore `data_converter` to only download and decompress the raw data, without performing format conversion and saving. Or, for the local stored dataset, use ignore `data_obtainer` to only perform format conversion and saving.
Take the text detection task of the ICDAR2015 dataset (`dataset_zoo/icdar2015/textdet.py`) as an example:
```python
data_obtainer = dict(
type='NaiveDataObtainer',
cache_path=cache_path,
data_root=data_root,
files=[
dict(
url='https://rrc.cvc.uab.es/downloads/ch4_training_images.zip',
save_name='ic15_textdet_train_img.zip',
md5='c51cbace155dcc4d98c8dd19d378f30d',
split=['train'],
content=['image'],
mapping=[['ic15_textdet_train_img', 'imgs/train']]),
dict(
url='https://rrc.cvc.uab.es/downloads/ch4_test_images.zip',
save_name='ic15_textdet_test_img.zip',
md5='97e4c1ddcf074ffcc75feff2b63c35dd',
split=['test'],
content=['image'],
mapping=[['ic15_textdet_test_img', 'imgs/test']]),
])
```
The default type of `data_obtainer` is `NaiveDataObtainer`, which mainly downloads and decompresses the original files to the specified directory. Here, we configure the URL, save name, MD5 value, etc. of the original dataset files through the `files` parameter. The `mapping` parameter is used to specify the path where the data is decompressed or moved. In addition, the two optional parameters `split` and `content` respectively indicate the content type stored in the compressed file and the corresponding dataset.
```python
data_converter = dict(
type='TextDetDataConverter',
splits=['train', 'test'],
data_root=data_root,
gatherer=dict(
type='pair_gather',
suffixes=['.jpg', '.JPG'],
rule=[r'img_(\d+)\.([jJ][pP][gG])', r'gt_img_\1.txt']),
parser=dict(type='ICDARTxtTextDetAnnParser'),
dumper=dict(type='JsonDumper'),
delete=['annotations', 'ic15_textdet_test_img', 'ic15_textdet_train_img'])
```
```{warning}
This section is outdated and not yet synchronized with its Chinese version, please switch the language for the latest information.
```
`data_converter` is responsible for loading and converting the original to the format supported by MMOCR. We provide a number of built-in data converters for different tasks, such as `TextDetDataConverter`, `TextRecogDataConverter`, `TextSpottingDataConverter`, and `WildReceiptConverter` (Since we only support WildReceipt dataset for KIE task at present, we only provide this converter for now).
Take the text detection task as an example, `TextDetDataConverter` mainly completes the following work:
- Collect and match the images and original annotation files, such as the image `img_1.jpg` and the annotation `gt_img_1.txt`
- Load and parse the original annotations to obtain necessary information such as the bounding box and text
- Convert the parsed data to the format supported by MMOCR
- Dump the converted data to the specified path and format
The above steps can be configured separately through `gatherer`, `parser`, `dumper`.
Specifically, the `gatherer` is used to collect and match the images and annotations in the original dataset. Typically, there are two relations between images and annotations, one is many-to-many, the other is many-to-one.
```text
many-to-many
├── img_1.jpg
├── gt_img_1.txt
├── img_2.jpg
├── gt_img_2.txt
├── img_3.JPG
├── gt_img_3.txt
one-to-many
├── img_1.jpg
├── img_2.jpg
├── img_3.JPG
├── gt.txt
```
Therefore, we provide two built-in gatherers, `pair_gather` and `mono_gather`, to handle the two cases. `pair_gather` is used for the case of many-to-many, and `mono_gather` is used for the case of one-to-many. `pair_gather` needs to specify the `suffixes` parameter to indicate the suffix of the image, such as `suffixes=[.jpg,.JPG]` in the above example. In addition, we need to specify the corresponding relationship between the image and the annotation file through the regular expression, such as `rule=[r'img_(\d+)\.([jJ][pP][gG])'r'gt_img_\1.txt']` in the above example. Where `\d+` is used to match the serial number of the image, `([jJ][pP][gG])` is used to match the suffix of the image, and `\_1` matches the serial number of the image and the serial number of the annotation file.
When the image and annotation file are matched, the original annotations will be parsed. Since the annotation format is usually varied from dataset to dataset, the parsers are usually dataset related. Then, the parser will pack the required data into the MMOCR format.
Finally, we can specify the dumpers to decide the data format. Currently, we support `JsonDumper`, `WildreceiptOpensetDumper`, and `TextRecogLMDBDumper`. They are used to save the data in the standard MMOCR Json format, Wildreceipt format, and the LMDB format commonly used in academia in the field of text recognition, respectively.
### Use DataPreparer to prepare customized dataset
\[Coming Soon\]