After decades of development, the OCR community has produced a series of related datasets that often provide annotations of text in a variety of styles, making it necessary for users to convert these datasets to the required format when using them. MMOCR supports dozens of commonly used text-related datasets and provides detailed tutorials for downloading and preparing the data.
In addition, we provide data conversion scripts to help users convert the annotations of widely-used OCR datasets to MMOCR formats.
- [Key Information Extraction Dataset Preparation](./data_prepare/kie.md)
In the following, we provide a brief overview of the data formats defined in MMOCR for each task.
- As shown in the following code block, the text detection task uses the data format `TextDetDataset`, which holds the bounding box annotations, file names, and other information required for the text detection task. We provide a sample annotation file in the `tests/data/det_toy_dataset/instances_test.json` path.
```json
{
"metainfo":
{
"dataset_type": "TextDetDataset",
"task_name": "textdet",
"category": [{"id": 0, "name": "text"}]
},
"data_list":
[
{
"img_path": "test_img.jpg",
"height": 640,
"width": 640,
"instances":
[
{
"polygon": [0, 0, 0, 10, 10, 20, 20, 0],
"bbox": [0, 0, 10, 20],
"bbox_label": 0,
"ignore": false,
},
],
//...
}
]
}
```
- As shown in the following code block, the text recognition task uses the data format `TextRecogDataset`, which holds information such as text instances and image paths required by the text recognition task. An example annotation file is provided in the `tests/data/rec_toy_dataset/labels.json` path.
```json
{
"metainfo":
{
"dataset_type": "TextRecogDataset",
"task_name": "textrecog",
},
"data_list":
[
{
"img_path": "test_img.jpg",
"instances":
[
{
"text": "GRAND"
}
]
}
]
}
```
## Downloading Datasets and Format Conversion
As an example of the data preparation steps, you can perform the following steps to prepare the ICDAR 2015 dataset for text detection task.
- Download the ICDAR 2015 dataset from the [official ICDAR website](https://rrc.cvc.uab.es/?ch=4&com=downloads). Extract the training set `ch4_training_word_images_gt.zip` and the test set zip `ch4_test_word_images_gt.zip` to the path `data/icdar2015` respectively.
- Using the scripts provided by us to convert the annotations to MMOCR supported formats.
```bash
python tools/dataset_converters/textdet/icdar_converter.py data/det/icdar15/ -o data/det/icdar15/ --split-list training test -d icdar2015
```
- After completing the above steps, the annotation format has been converted, and the file directory structure is as follows
```text
data/det/icdar2015/
├── annotations
│ ├── test
│ └── training
├── imgs
│ ├── test
│ └── training
├── instances_test.json
└── instances_training.json
```
## Dataset Configuration
### Single Dataset Training
When training or evaluating a model on new datasets, we need to write the dataset config where the image path, annotation path, and image prefix are set. The path `configs/xxx/_base_/datasets/` is pre-configured with the commonly used datasets in MMOCR, here we take the ICDAR 2015 dataset as an example (see `configs/_base_/det_datasets/icdar2015.py`).
After configuring the dataset, we can import it in the corresponding model configs. For example, to train the "DBNet_R18" model on the ICDAR 2015 dataset.
```Python
_base_ = [
'_base_dbnet_r18_fpnc.py',
'../_base_/datasets/icdar2015.py', # import the dataset config
'../_base_/default_runtime.py',
'../_base_/schedules/schedule_sgd_1200e.py',
]
ic15_det_train = _base_.ic15_det_train # specify the training set
ic15_det_train.pipeline = _base_.train_pipeline # specify the training pipeline
ic15_det_test = _base_.ic15_det_test # specify the testing set
ic15_det_test.pipeline = _base_.test_pipeline # specify the testing pipeline
dataset=ic15_det_test) # specify the dataset in val_dataloader
test_dataloader = val_dataloader
```
### Multi-dataset Training
In addition, [`ConcatDataset`](mmocr.datasets.ConcatDataset) enables users to train or test the model on a combination of multiple datasets. You just need to set the dataset type in the dataloader to `ConcatDataset` in the configuration file and specify the corresponding list of datasets.
For example, the following configuration uses the MJSynth dataset for training and 6 academic datasets (CUTE80, IIIT5K, SVT, SVTP, ICDAR2013, ICDAR2015) for testing.
```Python
_base_ = [ # Import all dataset configurations you want to use