[Docs] Update some dataset preparer related docs (#1502)

2022-11-02 16:08:01 +08:00 · 2022-11-02 16:08:01 +08:00 · 1c06edc68f
parent 8864fa174b
commit 1c06edc68f
4 changed files with 34 additions and 92 deletions
--- a/docs/en/get_started/quick_run.md
+++ b/docs/en/get_started/quick_run.md
@ -12,7 +12,7 @@ The following sections assume that you [installed MMOCR from source](install.md#

 ## Prepare a Dataset

-Since the variety of OCR dataset formats are not conducive to either switching or joint training of multiple datasets, MMOCR proposes a uniform [data format](../user_guides/dataset_prepare.md), and provides conversion scripts and [tutorials](../user_guides/dataset_prepare.md) for all commonly used OCR datasets. Usually, to use those datasets in MMOCR, you just need to follow the steps to get them ready for use.
+Since the variety of OCR dataset formats are not conducive to either switching or joint training of multiple datasets, MMOCR proposes a uniform [data format](../user_guides/dataset_prepare.md), and provides [dataset preparer](../user_guides/data_prepare/dataset_preparer.md) for commonly used OCR datasets. Usually, to use those datasets in MMOCR, you just need to follow the steps to get them ready for use.

 ```{note}
 But here, efficiency means everything.
--- a/docs/en/user_guides/dataset_prepare.md
+++ b/docs/en/user_guides/dataset_prepare.md
@ -2,13 +2,7 @@

 ## Introduction

-After decades of development, the OCR community has produced a series of related datasets that often provide annotations of text in a variety of styles, making it necessary for users to convert these datasets to the required format when using them. MMOCR supports dozens of commonly used text-related datasets and provides detailed tutorials for downloading and preparing the data.
-
-In addition, we provide data conversion scripts to help users convert the annotations of widely-used OCR datasets to MMOCR formats.
-
- [Detection Dataset Preparation](./data_prepare/det.md)
- [Recognition Dataset Preparation](./data_prepare/recog.md)
- [Key Information Extraction Dataset Preparation](./data_prepare/kie.md)
+After decades of development, the OCR community has produced a series of related datasets that often provide annotations of text in a variety of styles, making it necessary for users to convert these datasets to the required format when using them. MMOCR supports dozens of commonly used text-related datasets and provides a [data preparation script](./data_prepare/dataset_preparer.md) to help users prepare the datasets with only one command.

 In the following, we provide a brief overview of the data formats defined in MMOCR for each task.

@ -69,54 +63,31 @@ In the following, we provide a brief overview of the data formats defined in MMO

 ## Downloading Datasets and Format Conversion

-As an example of the data preparation steps, you can perform the following steps to prepare the ICDAR 2015 dataset for text detection task.
+As an example of the data preparation steps, you can use the following command to prepare the ICDAR 2015 dataset for text detection task.

- Download the ICDAR 2015 dataset from the [official ICDAR website](https://rrc.cvc.uab.es/?ch=4&com=downloads). Extract the training set `ch4_training_word_images_gt.zip` and the test set zip `ch4_test_word_images_gt.zip` to the path `data/icdar2015` respectively.
+```shell
+python tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet
+```

-  ```bash
-  # Downloading datasets
-  mkdir data/det/icdar2015 && cd data/det/icdar2015
-  wget https://rrc.cvc.uab.es/downloads/ch4_training_images.zip --no-check-certificate
-  wget https://rrc.cvc.uab.es/downloads/ch4_training_localization_transcription_gt.zip --no-check-certificate
-  wget https://rrc.cvc.uab.es/downloads/ch4_test_images.zip --no-check-certificate
-  wget https://rrc.cvc.uab.es/downloads/Challenge4_Test_Task1_GT.zip --no-check-certificate
+Then, the dataset has been downloaded and converted to MMOCR format, and the file directory structure is as follows:

-  # Extracting the zips
-  mkdir imgs && mkdir annotations
-  unzip ch4_training_images.zip -d imgs/training
-  unzip ch4_training_localization_transcription_gt.zip -d annotations/training
-  unzip ch4_test_images.zip -d imgs/test
-  unzip Challenge4_Test_Task1_GT.zip -d annotations/test
-  ```
-
- Using the scripts provided by us to convert the annotations to MMOCR supported formats.
-
-  ```bash
-  python tools/dataset_converters/textdet/icdar_converter.py data/det/icdar15/ -o data/det/icdar15/ --split-list training test -d icdar2015
-  ```
-
- After completing the above steps, the annotation format has been converted, and the file directory structure is as follows
-
-  ```text
-  data/det/icdar2015/
-  ├── annotations
-  │   ├── test
-  │   └── training
-  ├── imgs
-  │   ├── test
-  │   └── training
-  ├── instances_test.json
-  └── instances_training.json
-  ```
+```text
+data/icdar2015
+├── textdet_imgs
+│   ├── test
+│   └── train
+├── textdet_test.json
+└── textdet_train.json
+```

 ## Dataset Configuration

 ### Single Dataset Training

-When training or evaluating a model on new datasets, we need to write the dataset config where the image path, annotation path, and image prefix are set. The path `configs/xxx/_base_/datasets/` is pre-configured with the commonly used datasets in MMOCR, here we take the ICDAR 2015 dataset as an example (see `configs/_base_/det_datasets/icdar2015.py`).
+When training or evaluating a model on new datasets, we need to write the dataset config where the image path, annotation path, and image prefix are set. The path `configs/xxx/_base_/datasets/` is pre-configured with the commonly used datasets in MMOCR (if you use `prepare_dataset.py` to prepare dataset, this config will be generated automatically), here we take the ICDAR 2015 dataset as an example (see `configs/_base_/det_datasets/icdar2015.py`).

 ```Python
-ic15_det_data_root = 'data/det/icdar2015' # dataset root path
+ic15_det_data_root = 'data/icdar2015' # dataset root path

 # Train set config
 ic15_det_train = dict(
--- a/docs/zh_cn/get_started/quick_run.md
+++ b/docs/zh_cn/get_started/quick_run.md
@ -14,7 +14,7 @@

 ## 准备数据集

-由于 OCR 任务的数据集种类多样，格式不一，不利于多数据集的切换和联合训练，因此 MMOCR 约定了一种[统一的数据格式](../user_guides/dataset_prepare.md)，并针对常用的 OCR 数据集都提供了对应的转换脚本和[教程](../user_guides/dataset_prepare.md)。通常，要在 MMOCR 中使用数据集，你只需要按照对应步骤运行指令即可。
+由于 OCR 任务的数据集种类多样，格式不一，不利于多数据集的切换和联合训练，因此 MMOCR 约定了一种[统一的数据格式](../user_guides/dataset_prepare.md)，并针对常用的 OCR 数据集提供了[一键式数据准备脚本](../user_guides/data_prepare/dataset_preparer.md)。通常，要在 MMOCR 中使用数据集，你只需要按照对应步骤运行指令即可。

 ```{note}
 但我们亦深知，效率就是生命——尤其对想要快速上手 MMOCR 的你来说。
--- a/docs/zh_cn/user_guides/dataset_prepare.md
+++ b/docs/zh_cn/user_guides/dataset_prepare.md
@ -2,13 +2,7 @@

 ## 前言

-经过数十年的发展，OCR 领域涌现出了一系列的相关数据集，这些数据集往往采用风格各异的格式来提供文本的标注文件，使得用户在使用这些数据集时不得不进行格式转换。MMOCR 支持了数十种常用的文本相关数据集，并提供了详细的数据下载及准备教程。
-
-另外，我们为各任务常用的数据集提供了数据格式转换脚本，以帮助用户快速将数据转换为 MMOCR 支持的格式。
-
- [文本检测数据集准备](./data_prepare/det.md)
- [文本识别数据集准备](./data_prepare/recog.md)
- [关键信息抽取数据集准备](./data_prepare/kie.md)
+经过数十年的发展，OCR 领域涌现出了一系列的相关数据集，这些数据集往往采用风格各异的格式来提供文本的标注文件，使得用户在使用这些数据集时不得不进行格式转换。因此，为了方便用户进行数据集准备，我们提供了[一键式的数据准备脚本](./data_prepare/dataset_preparer.md)，使得用户仅需使用一行命令即可完成数据集准备的全部步骤。

 下面，我们对 MMOCR 内支持的各任务的数据格式进行简要的介绍。

@ -69,54 +63,31 @@

 ## 数据集下载及格式转换

-以 ICDAR 2015 **文本检测数据集**的准备步骤为例，你可以依次执行以下步骤来完成数据集准备：
+以 ICDAR 2015 数据集的文本检测任务准备步骤为例，你可以执行以下命令来完成数据集准备：

- 从 [ICDAR 官方网站](https://rrc.cvc.uab.es/?ch=4&com=downloads)下载 ICDAR 2015 数据集。将训练集`ch4_training_word_images_gt.zip` 与测试集压缩包`ch4_test_word_images_gt.zip` 分别解压至路径 `data/icdar2015`。
+```shell
+python tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet
+```

-  ```bash
-  # 下载数据集
-  mkdir data/det/icdar2015 && cd data/det/icdar2015
-  wget https://rrc.cvc.uab.es/downloads/ch4_training_images.zip --no-check-certificate
-  wget https://rrc.cvc.uab.es/downloads/ch4_training_localization_transcription_gt.zip --no-check-certificate
-  wget https://rrc.cvc.uab.es/downloads/ch4_test_images.zip --no-check-certificate
-  wget https://rrc.cvc.uab.es/downloads/Challenge4_Test_Task1_GT.zip --no-check-certificate
+命令执行完成后，数据集将被下载并转换至 MMOCR 格式，文件目录结构如下：

-  # 解压数据集
-  mkdir imgs && mkdir annotations
-  unzip ch4_training_images.zip -d imgs/training
-  unzip ch4_training_localization_transcription_gt.zip -d annotations/training
-  unzip ch4_test_images.zip -d imgs/test
-  unzip Challenge4_Test_Task1_GT.zip -d annotations/test
-  ```
-
- 使用 MMOCR 提供的格式转换脚本将原始的标注文件转换为 MMOCR 统一的数据格式
-
-  ```bash
-  python tools/dataset_converters/textdet/icdar_converter.py data/det/icdar15/ -o data/det/icdar15/ --split-list training test -d icdar2015
-  ```
-
- 完成上述步骤后，数据集标签将被转换为 MMOCR 使用的统一格式，文件目录结构如下：
-
-  ```text
-  data/det/icdar2015/
-  ├── annotations
-  │   ├── test
-  │   └── training
-  ├── imgs
-  │   ├── test
-  │   └── training
-  ├── instances_test.json
-  └── instances_training.json
-  ```
+```text
+data/icdar2015
+├── textdet_imgs
+│   ├── test
+│   └── train
+├── textdet_test.json
+└── textdet_train.json
+```

 ## 数据集配置文件

 ### 单数据集训练

-在使用新的数据集时，我们需要对其图像、标注文件的路径等基础信息进行配置。`configs/xxx/_base_/datasets/` 路径下已预先配置了 MMOCR 中常用的数据集，这里我们以 ICDAR 2015 数据集为例（见 `configs/_base_/det_datasets/icdar2015.py`）：
+在使用新的数据集时，我们需要对其图像、标注文件的路径等基础信息进行配置。`configs/xxx/_base_/datasets/` 路径下已预先配置了 MMOCR 中常用的数据集（当你使用 `prepare_dataset.py` 来准备数据集时，这个配置文件通常会在数据集准备就绪后自动生成），这里我们以 ICDAR 2015 数据集为例（见 `configs/_base_/det_datasets/icdar2015.py`）：

 ```Python
-ic15_det_data_root = 'data/det/icdar2015' # 数据集根目录
+ic15_det_data_root = 'data/icdar2015' # 数据集根目录

 # 训练集配置
 ic15_det_train = dict(