mirror of https://github.com/open-mmlab/mmocr.git
[Docs] Update some dataset preparer related docs (#1502)
parent
8864fa174b
commit
1c06edc68f
|
@ -12,7 +12,7 @@ The following sections assume that you [installed MMOCR from source](install.md#
|
|||
|
||||
## Prepare a Dataset
|
||||
|
||||
Since the variety of OCR dataset formats are not conducive to either switching or joint training of multiple datasets, MMOCR proposes a uniform [data format](../user_guides/dataset_prepare.md), and provides conversion scripts and [tutorials](../user_guides/dataset_prepare.md) for all commonly used OCR datasets. Usually, to use those datasets in MMOCR, you just need to follow the steps to get them ready for use.
|
||||
Since the variety of OCR dataset formats are not conducive to either switching or joint training of multiple datasets, MMOCR proposes a uniform [data format](../user_guides/dataset_prepare.md), and provides [dataset preparer](../user_guides/data_prepare/dataset_preparer.md) for commonly used OCR datasets. Usually, to use those datasets in MMOCR, you just need to follow the steps to get them ready for use.
|
||||
|
||||
```{note}
|
||||
But here, efficiency means everything.
|
||||
|
|
|
@ -2,13 +2,7 @@
|
|||
|
||||
## Introduction
|
||||
|
||||
After decades of development, the OCR community has produced a series of related datasets that often provide annotations of text in a variety of styles, making it necessary for users to convert these datasets to the required format when using them. MMOCR supports dozens of commonly used text-related datasets and provides detailed tutorials for downloading and preparing the data.
|
||||
|
||||
In addition, we provide data conversion scripts to help users convert the annotations of widely-used OCR datasets to MMOCR formats.
|
||||
|
||||
- [Detection Dataset Preparation](./data_prepare/det.md)
|
||||
- [Recognition Dataset Preparation](./data_prepare/recog.md)
|
||||
- [Key Information Extraction Dataset Preparation](./data_prepare/kie.md)
|
||||
After decades of development, the OCR community has produced a series of related datasets that often provide annotations of text in a variety of styles, making it necessary for users to convert these datasets to the required format when using them. MMOCR supports dozens of commonly used text-related datasets and provides a [data preparation script](./data_prepare/dataset_preparer.md) to help users prepare the datasets with only one command.
|
||||
|
||||
In the following, we provide a brief overview of the data formats defined in MMOCR for each task.
|
||||
|
||||
|
@ -69,54 +63,31 @@ In the following, we provide a brief overview of the data formats defined in MMO
|
|||
|
||||
## Downloading Datasets and Format Conversion
|
||||
|
||||
As an example of the data preparation steps, you can perform the following steps to prepare the ICDAR 2015 dataset for text detection task.
|
||||
As an example of the data preparation steps, you can use the following command to prepare the ICDAR 2015 dataset for text detection task.
|
||||
|
||||
- Download the ICDAR 2015 dataset from the [official ICDAR website](https://rrc.cvc.uab.es/?ch=4&com=downloads). Extract the training set `ch4_training_word_images_gt.zip` and the test set zip `ch4_test_word_images_gt.zip` to the path `data/icdar2015` respectively.
|
||||
```shell
|
||||
python tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet
|
||||
```
|
||||
|
||||
```bash
|
||||
# Downloading datasets
|
||||
mkdir data/det/icdar2015 && cd data/det/icdar2015
|
||||
wget https://rrc.cvc.uab.es/downloads/ch4_training_images.zip --no-check-certificate
|
||||
wget https://rrc.cvc.uab.es/downloads/ch4_training_localization_transcription_gt.zip --no-check-certificate
|
||||
wget https://rrc.cvc.uab.es/downloads/ch4_test_images.zip --no-check-certificate
|
||||
wget https://rrc.cvc.uab.es/downloads/Challenge4_Test_Task1_GT.zip --no-check-certificate
|
||||
Then, the dataset has been downloaded and converted to MMOCR format, and the file directory structure is as follows:
|
||||
|
||||
# Extracting the zips
|
||||
mkdir imgs && mkdir annotations
|
||||
unzip ch4_training_images.zip -d imgs/training
|
||||
unzip ch4_training_localization_transcription_gt.zip -d annotations/training
|
||||
unzip ch4_test_images.zip -d imgs/test
|
||||
unzip Challenge4_Test_Task1_GT.zip -d annotations/test
|
||||
```
|
||||
|
||||
- Using the scripts provided by us to convert the annotations to MMOCR supported formats.
|
||||
|
||||
```bash
|
||||
python tools/dataset_converters/textdet/icdar_converter.py data/det/icdar15/ -o data/det/icdar15/ --split-list training test -d icdar2015
|
||||
```
|
||||
|
||||
- After completing the above steps, the annotation format has been converted, and the file directory structure is as follows
|
||||
|
||||
```text
|
||||
data/det/icdar2015/
|
||||
├── annotations
|
||||
│ ├── test
|
||||
│ └── training
|
||||
├── imgs
|
||||
│ ├── test
|
||||
│ └── training
|
||||
├── instances_test.json
|
||||
└── instances_training.json
|
||||
```
|
||||
```text
|
||||
data/icdar2015
|
||||
├── textdet_imgs
|
||||
│ ├── test
|
||||
│ └── train
|
||||
├── textdet_test.json
|
||||
└── textdet_train.json
|
||||
```
|
||||
|
||||
## Dataset Configuration
|
||||
|
||||
### Single Dataset Training
|
||||
|
||||
When training or evaluating a model on new datasets, we need to write the dataset config where the image path, annotation path, and image prefix are set. The path `configs/xxx/_base_/datasets/` is pre-configured with the commonly used datasets in MMOCR, here we take the ICDAR 2015 dataset as an example (see `configs/_base_/det_datasets/icdar2015.py`).
|
||||
When training or evaluating a model on new datasets, we need to write the dataset config where the image path, annotation path, and image prefix are set. The path `configs/xxx/_base_/datasets/` is pre-configured with the commonly used datasets in MMOCR (if you use `prepare_dataset.py` to prepare dataset, this config will be generated automatically), here we take the ICDAR 2015 dataset as an example (see `configs/_base_/det_datasets/icdar2015.py`).
|
||||
|
||||
```Python
|
||||
ic15_det_data_root = 'data/det/icdar2015' # dataset root path
|
||||
ic15_det_data_root = 'data/icdar2015' # dataset root path
|
||||
|
||||
# Train set config
|
||||
ic15_det_train = dict(
|
||||
|
|
|
@ -14,7 +14,7 @@
|
|||
|
||||
## 准备数据集
|
||||
|
||||
由于 OCR 任务的数据集种类多样,格式不一,不利于多数据集的切换和联合训练,因此 MMOCR 约定了一种[统一的数据格式](../user_guides/dataset_prepare.md),并针对常用的 OCR 数据集都提供了对应的转换脚本和[教程](../user_guides/dataset_prepare.md)。通常,要在 MMOCR 中使用数据集,你只需要按照对应步骤运行指令即可。
|
||||
由于 OCR 任务的数据集种类多样,格式不一,不利于多数据集的切换和联合训练,因此 MMOCR 约定了一种[统一的数据格式](../user_guides/dataset_prepare.md),并针对常用的 OCR 数据集提供了[一键式数据准备脚本](../user_guides/data_prepare/dataset_preparer.md)。通常,要在 MMOCR 中使用数据集,你只需要按照对应步骤运行指令即可。
|
||||
|
||||
```{note}
|
||||
但我们亦深知,效率就是生命——尤其对想要快速上手 MMOCR 的你来说。
|
||||
|
|
|
@ -2,13 +2,7 @@
|
|||
|
||||
## 前言
|
||||
|
||||
经过数十年的发展,OCR 领域涌现出了一系列的相关数据集,这些数据集往往采用风格各异的格式来提供文本的标注文件,使得用户在使用这些数据集时不得不进行格式转换。MMOCR 支持了数十种常用的文本相关数据集,并提供了详细的数据下载及准备教程。
|
||||
|
||||
另外,我们为各任务常用的数据集提供了数据格式转换脚本,以帮助用户快速将数据转换为 MMOCR 支持的格式。
|
||||
|
||||
- [文本检测数据集准备](./data_prepare/det.md)
|
||||
- [文本识别数据集准备](./data_prepare/recog.md)
|
||||
- [关键信息抽取数据集准备](./data_prepare/kie.md)
|
||||
经过数十年的发展,OCR 领域涌现出了一系列的相关数据集,这些数据集往往采用风格各异的格式来提供文本的标注文件,使得用户在使用这些数据集时不得不进行格式转换。因此,为了方便用户进行数据集准备,我们提供了[一键式的数据准备脚本](./data_prepare/dataset_preparer.md),使得用户仅需使用一行命令即可完成数据集准备的全部步骤。
|
||||
|
||||
下面,我们对 MMOCR 内支持的各任务的数据格式进行简要的介绍。
|
||||
|
||||
|
@ -69,54 +63,31 @@
|
|||
|
||||
## 数据集下载及格式转换
|
||||
|
||||
以 ICDAR 2015 **文本检测数据集**的准备步骤为例,你可以依次执行以下步骤来完成数据集准备:
|
||||
以 ICDAR 2015 数据集的文本检测任务准备步骤为例,你可以执行以下命令来完成数据集准备:
|
||||
|
||||
- 从 [ICDAR 官方网站](https://rrc.cvc.uab.es/?ch=4&com=downloads)下载 ICDAR 2015 数据集。将训练集`ch4_training_word_images_gt.zip` 与测试集压缩包`ch4_test_word_images_gt.zip` 分别解压至路径 `data/icdar2015`。
|
||||
```shell
|
||||
python tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet
|
||||
```
|
||||
|
||||
```bash
|
||||
# 下载数据集
|
||||
mkdir data/det/icdar2015 && cd data/det/icdar2015
|
||||
wget https://rrc.cvc.uab.es/downloads/ch4_training_images.zip --no-check-certificate
|
||||
wget https://rrc.cvc.uab.es/downloads/ch4_training_localization_transcription_gt.zip --no-check-certificate
|
||||
wget https://rrc.cvc.uab.es/downloads/ch4_test_images.zip --no-check-certificate
|
||||
wget https://rrc.cvc.uab.es/downloads/Challenge4_Test_Task1_GT.zip --no-check-certificate
|
||||
命令执行完成后,数据集将被下载并转换至 MMOCR 格式,文件目录结构如下:
|
||||
|
||||
# 解压数据集
|
||||
mkdir imgs && mkdir annotations
|
||||
unzip ch4_training_images.zip -d imgs/training
|
||||
unzip ch4_training_localization_transcription_gt.zip -d annotations/training
|
||||
unzip ch4_test_images.zip -d imgs/test
|
||||
unzip Challenge4_Test_Task1_GT.zip -d annotations/test
|
||||
```
|
||||
|
||||
- 使用 MMOCR 提供的格式转换脚本将原始的标注文件转换为 MMOCR 统一的数据格式
|
||||
|
||||
```bash
|
||||
python tools/dataset_converters/textdet/icdar_converter.py data/det/icdar15/ -o data/det/icdar15/ --split-list training test -d icdar2015
|
||||
```
|
||||
|
||||
- 完成上述步骤后,数据集标签将被转换为 MMOCR 使用的统一格式,文件目录结构如下:
|
||||
|
||||
```text
|
||||
data/det/icdar2015/
|
||||
├── annotations
|
||||
│ ├── test
|
||||
│ └── training
|
||||
├── imgs
|
||||
│ ├── test
|
||||
│ └── training
|
||||
├── instances_test.json
|
||||
└── instances_training.json
|
||||
```
|
||||
```text
|
||||
data/icdar2015
|
||||
├── textdet_imgs
|
||||
│ ├── test
|
||||
│ └── train
|
||||
├── textdet_test.json
|
||||
└── textdet_train.json
|
||||
```
|
||||
|
||||
## 数据集配置文件
|
||||
|
||||
### 单数据集训练
|
||||
|
||||
在使用新的数据集时,我们需要对其图像、标注文件的路径等基础信息进行配置。`configs/xxx/_base_/datasets/` 路径下已预先配置了 MMOCR 中常用的数据集,这里我们以 ICDAR 2015 数据集为例(见 `configs/_base_/det_datasets/icdar2015.py`):
|
||||
在使用新的数据集时,我们需要对其图像、标注文件的路径等基础信息进行配置。`configs/xxx/_base_/datasets/` 路径下已预先配置了 MMOCR 中常用的数据集(当你使用 `prepare_dataset.py` 来准备数据集时,这个配置文件通常会在数据集准备就绪后自动生成),这里我们以 ICDAR 2015 数据集为例(见 `configs/_base_/det_datasets/icdar2015.py`):
|
||||
|
||||
```Python
|
||||
ic15_det_data_root = 'data/det/icdar2015' # 数据集根目录
|
||||
ic15_det_data_root = 'data/icdar2015' # 数据集根目录
|
||||
|
||||
# 训练集配置
|
||||
ic15_det_train = dict(
|
||||
|
|
Loading…
Reference in New Issue