[Docs] Update some dataset preparer related docs (#1502)

pull/1509/head
Xinyu Wang 2022-11-02 16:08:01 +08:00 committed by GitHub
parent 8864fa174b
commit 1c06edc68f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 34 additions and 92 deletions

View File

@ -12,7 +12,7 @@ The following sections assume that you [installed MMOCR from source](install.md#
## Prepare a Dataset
Since the variety of OCR dataset formats are not conducive to either switching or joint training of multiple datasets, MMOCR proposes a uniform [data format](../user_guides/dataset_prepare.md), and provides conversion scripts and [tutorials](../user_guides/dataset_prepare.md) for all commonly used OCR datasets. Usually, to use those datasets in MMOCR, you just need to follow the steps to get them ready for use.
Since the variety of OCR dataset formats are not conducive to either switching or joint training of multiple datasets, MMOCR proposes a uniform [data format](../user_guides/dataset_prepare.md), and provides [dataset preparer](../user_guides/data_prepare/dataset_preparer.md) for commonly used OCR datasets. Usually, to use those datasets in MMOCR, you just need to follow the steps to get them ready for use.
```{note}
But here, efficiency means everything.

View File

@ -2,13 +2,7 @@
## Introduction
After decades of development, the OCR community has produced a series of related datasets that often provide annotations of text in a variety of styles, making it necessary for users to convert these datasets to the required format when using them. MMOCR supports dozens of commonly used text-related datasets and provides detailed tutorials for downloading and preparing the data.
In addition, we provide data conversion scripts to help users convert the annotations of widely-used OCR datasets to MMOCR formats.
- [Detection Dataset Preparation](./data_prepare/det.md)
- [Recognition Dataset Preparation](./data_prepare/recog.md)
- [Key Information Extraction Dataset Preparation](./data_prepare/kie.md)
After decades of development, the OCR community has produced a series of related datasets that often provide annotations of text in a variety of styles, making it necessary for users to convert these datasets to the required format when using them. MMOCR supports dozens of commonly used text-related datasets and provides a [data preparation script](./data_prepare/dataset_preparer.md) to help users prepare the datasets with only one command.
In the following, we provide a brief overview of the data formats defined in MMOCR for each task.
@ -69,54 +63,31 @@ In the following, we provide a brief overview of the data formats defined in MMO
## Downloading Datasets and Format Conversion
As an example of the data preparation steps, you can perform the following steps to prepare the ICDAR 2015 dataset for text detection task.
As an example of the data preparation steps, you can use the following command to prepare the ICDAR 2015 dataset for text detection task.
- Download the ICDAR 2015 dataset from the [official ICDAR website](https://rrc.cvc.uab.es/?ch=4&com=downloads). Extract the training set `ch4_training_word_images_gt.zip` and the test set zip `ch4_test_word_images_gt.zip` to the path `data/icdar2015` respectively.
```shell
python tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet
```
```bash
# Downloading datasets
mkdir data/det/icdar2015 && cd data/det/icdar2015
wget https://rrc.cvc.uab.es/downloads/ch4_training_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch4_training_localization_transcription_gt.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch4_test_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge4_Test_Task1_GT.zip --no-check-certificate
Then, the dataset has been downloaded and converted to MMOCR format, and the file directory structure is as follows:
# Extracting the zips
mkdir imgs && mkdir annotations
unzip ch4_training_images.zip -d imgs/training
unzip ch4_training_localization_transcription_gt.zip -d annotations/training
unzip ch4_test_images.zip -d imgs/test
unzip Challenge4_Test_Task1_GT.zip -d annotations/test
```
- Using the scripts provided by us to convert the annotations to MMOCR supported formats.
```bash
python tools/dataset_converters/textdet/icdar_converter.py data/det/icdar15/ -o data/det/icdar15/ --split-list training test -d icdar2015
```
- After completing the above steps, the annotation format has been converted, and the file directory structure is as follows
```text
data/det/icdar2015/
├── annotations
│ ├── test
│ └── training
├── imgs
│ ├── test
│ └── training
├── instances_test.json
└── instances_training.json
```
```text
data/icdar2015
├── textdet_imgs
│ ├── test
│ └── train
├── textdet_test.json
└── textdet_train.json
```
## Dataset Configuration
### Single Dataset Training
When training or evaluating a model on new datasets, we need to write the dataset config where the image path, annotation path, and image prefix are set. The path `configs/xxx/_base_/datasets/` is pre-configured with the commonly used datasets in MMOCR, here we take the ICDAR 2015 dataset as an example (see `configs/_base_/det_datasets/icdar2015.py`).
When training or evaluating a model on new datasets, we need to write the dataset config where the image path, annotation path, and image prefix are set. The path `configs/xxx/_base_/datasets/` is pre-configured with the commonly used datasets in MMOCR (if you use `prepare_dataset.py` to prepare dataset, this config will be generated automatically), here we take the ICDAR 2015 dataset as an example (see `configs/_base_/det_datasets/icdar2015.py`).
```Python
ic15_det_data_root = 'data/det/icdar2015' # dataset root path
ic15_det_data_root = 'data/icdar2015' # dataset root path
# Train set config
ic15_det_train = dict(

View File

@ -14,7 +14,7 @@
## 准备数据集
由于 OCR 任务的数据集种类多样,格式不一,不利于多数据集的切换和联合训练,因此 MMOCR 约定了一种[统一的数据格式](../user_guides/dataset_prepare.md),并针对常用的 OCR 数据集都提供了对应的转换脚本和[教程](../user_guides/dataset_prepare.md)。通常,要在 MMOCR 中使用数据集,你只需要按照对应步骤运行指令即可。
由于 OCR 任务的数据集种类多样,格式不一,不利于多数据集的切换和联合训练,因此 MMOCR 约定了一种[统一的数据格式](../user_guides/dataset_prepare.md),并针对常用的 OCR 数据集提供了[一键式数据准备脚本](../user_guides/data_prepare/dataset_preparer.md)。通常,要在 MMOCR 中使用数据集,你只需要按照对应步骤运行指令即可。
```{note}
但我们亦深知,效率就是生命——尤其对想要快速上手 MMOCR 的你来说。

View File

@ -2,13 +2,7 @@
## 前言
经过数十年的发展OCR 领域涌现出了一系列的相关数据集这些数据集往往采用风格各异的格式来提供文本的标注文件使得用户在使用这些数据集时不得不进行格式转换。MMOCR 支持了数十种常用的文本相关数据集,并提供了详细的数据下载及准备教程。
另外,我们为各任务常用的数据集提供了数据格式转换脚本,以帮助用户快速将数据转换为 MMOCR 支持的格式。
- [文本检测数据集准备](./data_prepare/det.md)
- [文本识别数据集准备](./data_prepare/recog.md)
- [关键信息抽取数据集准备](./data_prepare/kie.md)
经过数十年的发展OCR 领域涌现出了一系列的相关数据集,这些数据集往往采用风格各异的格式来提供文本的标注文件,使得用户在使用这些数据集时不得不进行格式转换。因此,为了方便用户进行数据集准备,我们提供了[一键式的数据准备脚本](./data_prepare/dataset_preparer.md),使得用户仅需使用一行命令即可完成数据集准备的全部步骤。
下面,我们对 MMOCR 内支持的各任务的数据格式进行简要的介绍。
@ -69,54 +63,31 @@
## 数据集下载及格式转换
以 ICDAR 2015 **文本检测数据集**的准备步骤为例,你可以依次执行以下步骤来完成数据集准备:
以 ICDAR 2015 数据集的文本检测任务准备步骤为例,你可以执行以下命令来完成数据集准备:
- 从 [ICDAR 官方网站](https://rrc.cvc.uab.es/?ch=4&com=downloads)下载 ICDAR 2015 数据集。将训练集`ch4_training_word_images_gt.zip` 与测试集压缩包`ch4_test_word_images_gt.zip` 分别解压至路径 `data/icdar2015`
```shell
python tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet
```
```bash
# 下载数据集
mkdir data/det/icdar2015 && cd data/det/icdar2015
wget https://rrc.cvc.uab.es/downloads/ch4_training_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch4_training_localization_transcription_gt.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch4_test_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge4_Test_Task1_GT.zip --no-check-certificate
命令执行完成后,数据集将被下载并转换至 MMOCR 格式,文件目录结构如下:
# 解压数据集
mkdir imgs && mkdir annotations
unzip ch4_training_images.zip -d imgs/training
unzip ch4_training_localization_transcription_gt.zip -d annotations/training
unzip ch4_test_images.zip -d imgs/test
unzip Challenge4_Test_Task1_GT.zip -d annotations/test
```
- 使用 MMOCR 提供的格式转换脚本将原始的标注文件转换为 MMOCR 统一的数据格式
```bash
python tools/dataset_converters/textdet/icdar_converter.py data/det/icdar15/ -o data/det/icdar15/ --split-list training test -d icdar2015
```
- 完成上述步骤后,数据集标签将被转换为 MMOCR 使用的统一格式,文件目录结构如下:
```text
data/det/icdar2015/
├── annotations
│ ├── test
│ └── training
├── imgs
│ ├── test
│ └── training
├── instances_test.json
└── instances_training.json
```
```text
data/icdar2015
├── textdet_imgs
│ ├── test
│ └── train
├── textdet_test.json
└── textdet_train.json
```
## 数据集配置文件
### 单数据集训练
在使用新的数据集时,我们需要对其图像、标注文件的路径等基础信息进行配置。`configs/xxx/_base_/datasets/` 路径下已预先配置了 MMOCR 中常用的数据集,这里我们以 ICDAR 2015 数据集为例(见 `configs/_base_/det_datasets/icdar2015.py`
在使用新的数据集时,我们需要对其图像、标注文件的路径等基础信息进行配置。`configs/xxx/_base_/datasets/` 路径下已预先配置了 MMOCR 中常用的数据集(当你使用 `prepare_dataset.py` 来准备数据集时,这个配置文件通常会在数据集准备就绪后自动生成),这里我们以 ICDAR 2015 数据集为例(见 `configs/_base_/det_datasets/icdar2015.py`
```Python
ic15_det_data_root = 'data/det/icdar2015' # 数据集根目录
ic15_det_data_root = 'data/icdar2015' # 数据集根目录
# 训练集配置
ic15_det_train = dict(