mmocr/docs_zh_CN/datasets.md

# 配置数据集

本页列出了在文字检测、文字识别、关键信息提取、命名实体识别四个文本类任务中常用的数据集以及下载链接。

<!-- TOC -->

- [数据集](#数据集)
  - [文字检测](#文字检测)
  - [文字识别](#文字识别)
  - [关键信息提取](#关键信息提取)
  - [命名实体识别（专名识别）](#命名实体识别（专名识别）)
    - [CLUENER2020](#cluener2020)

<!-- /TOC -->


## 文字检测

文字检测任务的数据集应按如下目录配置：

```text
├── ctw1500
│   ├── annotations
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
├── icdar2015
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
├── icdar2017
│   ├── imgs
│   ├── instances_training.json
│   └── instances_val.json
├── synthtext
│   ├── imgs
│   └── instances_training.lmdb
├── textocr
│   ├── train
│   ├── instances_training.json
│   └── instances_val.json
├── totaltext
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
```

|  数据集名称  |                             数据图片                             |                                      补充数据                                     |                                                                                                     |               标注文件                  |                                                                                                |
| :---------: | :----------------------------------------------------------: | :----------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------: | :-------------------------------------: | :--------------------------------------------------------------------------------------------: |
|           |                                                                |                                                                                      |                                                训练集 (training)                                                |               验证集 (validation)                |                                           测试集 (testing)                                             |       |
|  CTW1500  | [下载地址](https://github.com/Yuliang-Liu/Curve-Text-Detector) |                                        |                    -                    |                    -                    |                    -                    |
| ICDAR2015 | [下载地址](https://rrc.cvc.uab.es/?ch=4&com=downloads)     |                                                                                      | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) |                    -                    | [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) |
| ICDAR2017 | [下载地址](https://rrc.cvc.uab.es/?ch=8&com=downloads)     | [renamed_imgs](https://download.openmmlab.com/mmocr/data/icdar2017/renamed_imgs.tar) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json) | [instances_val.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_val.json) | - |       |       |
| Synthtext | [下载地址](https://www.robots.ox.ac.uk/~vgg/data/scenetext/)  |                                                                                      | [instances_training.lmdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb) |                    -                    |
| TextOCR | [下载地址](https://textvqa.org/textocr/dataset)  |                                                                                      | - |                    -                    | -
| Totaltext | [下载地址](https://github.com/cs-chan/Total-Text-Dataset)  |                                                                                      | - |                    -                    | -

- `icdar2015` 数据集：
  - 第一步：从[下载地址](https://rrc.cvc.uab.es/?ch=4&com=downloads)下载 `ch4_training_images.zip`、`ch4_test_images.zip`、`ch4_training_localization_transcription_gt.zip`、`Challenge4_Test_Task1_GT.zip` 四个文件，分别对应训练集数据、测试集数据、训练集标注、测试集标注。
  - 第二步：运行以下命令，移动数据集到对应文件夹
  ```bash
  mkdir icdar2015 && cd icdar2015
  mkdir imgs && mkdir annotations
  # 移动数据到目录：
  mv ch4_training_images imgs/training
  mv ch4_test_images imgs/test
  # 移动标注到目录：
  mv ch4_training_localization_transcription_gt annotations/training
  mv Challenge4_Test_Task1_GT annotations/test
  ```
  - 第三步：下载 [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) 和 [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json)，并放入 `icdar2015` 文件夹里。或者也可以用以下命令直接生成 `instances_training.json` 和 `instances_test.json`:
  ```bash
  python tools/data/textdet/icdar_converter.py /path/to/icdar2015 -o /path/to/icdar2015 -d icdar2015 --split-list training test
  ```

- `icdar2017` 数据集：
  - 由于使用 opencv 加载 `.jpg` 文件时有旋转失真，我们把原数据集中的图片转换为 `.png` 格式，在这里下载：[renamed_images](https://download.openmmlab.com/mmocr/data/icdar2017/renamed_imgs.tar)。下载后，把 `.png` 图片复制到 `imgs` 文件夹里.

- `ctw1500` 数据集：
  - 第一步：执行以下命令，从 [下载地址](https://github.com/Yuliang-Liu/Curve-Text-Detector) 下载 `train_images.zip`，`test_images.zip`，`train_labels.zip`，`test_labels.zip` 四个文件并配置到对应目录：

  ```bash
  mkdir ctw1500 && cd ctw1500
  mkdir imgs && mkdir annotations

  # 下载并配置标注
  cd annotations
  wget -O train_labels.zip https://universityofadelaide.box.com/shared/static/jikuazluzyj4lq6umzei7m2ppmt3afyw.zip
  wget -O test_labels.zip https://cloudstor.aarnet.edu.au/plus/s/uoeFl0pCN9BOCN5/download
  unzip train_labels.zip && mv ctw1500_train_labels training
  unzip test_labels.zip -d test
  cd ..
  # 下载并配置数据
  cd imgs
  wget -O train_images.zip https://universityofadelaide.box.com/shared/static/py5uwlfyyytbb2pxzq9czvu6fuqbjdh8.zip
  wget -O test_images.zip https://universityofadelaide.box.com/shared/static/t4w48ofnqkdw7jyc4t11nsukoeqk9c3d.zip
  unzip train_images.zip && mv train_images training
  unzip test_images.zip && mv test_images test
  ```
  - 第二步：执行以下命令，生成 `instances_training.json` 和 `instances_test.json`。

  ```bash
  python tools/data/textdet/ctw1500_converter.py /path/to/ctw1500 -o /path/to/ctw1500 --split-list training test
  ```

- `TextOCR` 数据集：
  - 第一步：下载 [train_val_images.zip](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)，[TextOCR_0.1_train.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json) 和 [TextOCR_0.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json) 到 `textocr` 文件夹里。
  ```bash
  mkdir textocr && cd textocr

  # 下载 TextOCR 数据集
  wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
  wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
  wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json

  # 把图片移到对应目录
  unzip -q train_val_images.zip
  mv train_images train
  ```

  - 第二步：生成 `instances_training.json` 和 `instances_val.json`:
  ```bash
  python tools/data/textdet/textocr_converter.py /path/to/textocr
  ```

- `Totaltext` 数据集：
  - 第一步：从 [github dataset](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset) 下载 `totaltext.zip`，从 [github Groundtruth](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Text) 下载 `groundtruth_text.zip` 。（建议下载 `.mat` 格式的标注文件，因为我们提供的标注格式转换脚本 `totaltext_converter.py` 仅支持 `.mat` 格式。）
  ```bash
  mkdir totaltext && cd totaltext
  mkdir imgs && mkdir annotations

  # 图像
  # 在 ./totaltext 中执行
  unzip totaltext.zip
  mv Images/Train imgs/training
  mv Images/Test imgs/test

  # 标注文件
  unzip groundtruth_text.zip
  cd Groundtruth
  mv Polygon/Train ../annotations/training
  mv Polygon/Test ../annotations/test

  ```
  - 第二步：用以下命令生成 `instances_training.json` 和 `instances_test.json` ：
  ```bash
  python tools/data/textdet/totaltext_converter.py /path/to/totaltext -o /path/to/totaltext --split-list training test
  ```

## 文字识别

**文字识别任务的数据集应按如下目录配置：**

```text
├── mixture
│   ├── coco_text
│   │   ├── train_label.txt
│   │   ├── train_words
│   ├── icdar_2011
│   │   ├── training_label.txt
│   │   ├── Challenge1_Training_Task3_Images_GT
│   ├── icdar_2013
│   │   ├── train_label.txt
│   │   ├── test_label_1015.txt
│   │   ├── test_label_1095.txt
│   │   ├── Challenge2_Training_Task3_Images_GT
│   │   ├── Challenge2_Test_Task3_Images
│   ├── icdar_2015
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── ch4_training_word_images_gt
│   │   ├── ch4_test_word_images_gt
│   ├── III5K
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── train
│   │   ├── test
│   ├── ct80
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svt
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svtp
│   │   ├── test_label.txt
│   │   ├── image
│   ├── Syn90k
│   │   ├── shuffle_labels.txt
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── mnt
│   ├── SynthText
│   │   ├── shuffle_labels.txt
│   │   ├── instances_train.txt
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── synthtext
│   ├── SynthAdd
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── SynthText_Add
│   ├── TextOCR
│   │   ├── image
│   │   ├── train_label.txt
│   │   ├── val_label.txt
│   ├── Totaltext
│   │   ├── imgs
│   │   ├── annotations
│   │   ├── train_label.txt
│   │   ├── test_label.txt
```

|  数据集名称   |                                        数据图片                                         |                                                                                                                                            标注文件                                                                                                                                                 |                                             标注文件                                             |
| :--------: | :-----------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------: |
|       |                                                                                       |                                                                                                                                                训练集(training)                                                                                                                                               |                                                  测试集(test)                                                   |
| coco_text  |                [下载地址](https://rrc.cvc.uab.es/?ch=5&com=downloads)                 |                                                                                                     [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/coco_text/train_label.txt)                                                                                                     |                                                    -                                                    |       |
| icdar_2011 | [下载地址](http://www.cvc.uab.es/icdar2011competition/?com=downloads)         |                                                                                                    [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt)                                                                                                     |                                                    -                                                    |       |
| icdar_2013 |              [下载地址](https://rrc.cvc.uab.es/?ch=2&com=downloads)                 |                                                                                                    [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/train_label.txt)                                                                                                     | [test_label_1015.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/test_label_1015.txt) |       |
| icdar_2015 |               [下载地址](https://rrc.cvc.uab.es/?ch=4&com=downloads)                 |                                                                                                    [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt)                                                                                                     |      [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/test_label.txt)      |       |
|   IIIT5K   |    [下载地址](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html)     |                                                                                                      [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/train_label.txt)                                                                                                       |        [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt)        |       |
|    ct80    |                                            -                                           |                                                                                                                                                   -                                                                                                                                                    |         [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/ct80/test_label.txt)         |       |
|    svt     |[下载地址](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset) |                                                                                                                                                   -                                                                                                                                                    |         [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/svt/test_label.txt)          |       |
|    svtp    |                              -                                           |                                                                                                                                                   -                                                                                                                                                    |         [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/svtp/test_label.txt)         |       |
|  Syn90k  |               [下载地址](https://www.robots.ox.ac.uk/~vgg/data/text/)                |                                                       [shuffle_labels.txt](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/shuffle_labels.txt) \| [label.txt](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/label.txt)                                                       |                                                    -                                                    |       |
| SynthText  |           [下载地址](https://www.robots.ox.ac.uk/~vgg/data/scenetext/)              | [shuffle_labels.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/shuffle_labels.txt) \| [instances_train.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/instances_train.txt) \| [label.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/label.txt) |                                                    -                                                    |       |
|  SynthAdd  |  [SynthText_Add.zip](https://pan.baidu.com/s/1uV0LtoNmcxbO-0YA7Ch4dg)  (code:627x)   |                                                                                                           [label.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthAdd/label.txt)                                                                                                            |                                                    -                                                    |       |
|  TextOCR  |  [下载地址](https://textvqa.org/textocr/dataset)   |                                                                                                           -                                                                                                           |                                                    -                                                    |       |
|  Totaltext  |  [下载地址](https://github.com/cs-chan/Total-Text-Dataset)   |                                                                                                           -                                                                                                           |                                                    -                                                    |       |

- `icdar_2013` 数据集：
  - 第一步：从 [下载地址](https://rrc.cvc.uab.es/?ch=2&com=downloads) 下载 `Challenge2_Test_Task3_Images.zip` 和 `Challenge2_Training_Task3_Images_GT.zip`
  - 第二步：下载 [test_label_1015.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/test_label_1015.txt) 和 [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/train_label.txt)
- `icdar_2015` 数据集：
  - 第一步：从 [下载地址](https://rrc.cvc.uab.es/?ch=4&com=downloads) 下载 `ch4_training_word_images_gt.zip` 和 `ch4_test_word_images_gt.zip`
  - 第二步：下载 [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt) and [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/test_label.txt)
- `IIIT5K` 数据集：
  - 第一步：从 [下载地址](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html) 下载 `IIIT5K-Word_V3.0.tar.gz`
  - 第二步：下载 [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/train_label.txt) 和 [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt)
- `svt` 数据集：
  - 第一步：从 [下载地址](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset) 下载 `svt.zip`
  - 第二步：下载 [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/svt/test_label.txt)
  - 第三步：
  ```bash
  python tools/data/textrecog/svt_converter.py <download_svt_dir_path>
  ```
- `ct80` 数据集：
  - 第一步：下载 [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/ct80/test_label.txt)
- `svtp` 数据集：
  - 第一步：下载 [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/svtp/test_label.txt)
- `coco_text` 数据集：
  - 第一步：从 [下载地址](https://rrc.cvc.uab.es/?ch=5&com=downloads) 下载文件
  - 第二步：下载 [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/coco_text/train_label.txt)
- `Syn90k` 数据集：
  - 第一步：从 [下载地址](https://www.robots.ox.ac.uk/~vgg/data/text/) 下载 `mjsynth.tar.gz`
  - 第二步：下载 [shuffle_labels.txt](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/shuffle_labels.txt)
  - 第三步：

  ```bash
  mkdir Syn90k && cd Syn90k

  mv /path/to/mjsynth.tar.gz .

  tar -xzf mjsynth.tar.gz

  mv /path/to/shuffle_labels.txt .

  # 创建软链接
  cd /path/to/mmocr/data/mixture

  ln -s /path/to/Syn90k Syn90k
  ```

- `SynthText` 数据集：
  - 第一步： 从 [下载地址](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) 下载 `SynthText.zip`
  - 第二步：

  ```bash
  mkdir SynthText && cd SynthText
  mv /path/to/SynthText.zip .
  unzip SynthText.zip
  mv SynthText synthtext

  mv /path/to/shuffle_labels.txt .

  # 创建软链接
  cd /path/to/mmocr/data/mixture
  ln -s /path/to/SynthText SynthText
  ```
  - 第三步：
  生成裁剪后的图像和标注：

  ```bash
  cd /path/to/mmocr

  python tools/data/textrecog/synthtext_converter.py data/mixture/SynthText/gt.mat data/mixture/SynthText/ data/mixture/SynthText/synthtext/SynthText_patch_horizontal --n_proc 8
  ```

- `SynthAdd` 数据集：
  - 第一步：从 [SynthAdd](https://pan.baidu.com/s/1uV0LtoNmcxbO-0YA7Ch4dg) (code:627x) 下载 `SynthText_Add.zip`
  - 第二步：下载 [label.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthAdd/label.txt)
  - 第三步：

  ```bash
  mkdir SynthAdd && cd SynthAdd

  mv /path/to/SynthText_Add.zip .

  unzip SynthText_Add.zip

  mv /path/to/label.txt .

  # 创建软链接
  cd /path/to/mmocr/data/mixture

  ln -s /path/to/SynthAdd SynthAdd
  ```
  **额外说明：**
运行以下命令，可以把 `.txt` 格式的标注文件转换成 `.lmdb` 格式：
```bash
python tools/data/utils/txt2lmdb.py -i <txt_label_path> -o <lmdb_label_path>
```
例如：
```bash
python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb
```
- `TextOCR` 数据集：
  - 第一步：下载 [train_val_images.zip](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)，[TextOCR_0.1_train.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json) 和 [TextOCR_0.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json) 到 `textocr/` 目录.
  ```bash
  mkdir textocr && cd textocr

  # 下载 TextOCR 数据集
  wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
  wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
  wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json

  # 对于数据图像
  unzip -q train_val_images.zip
  mv train_images train
  ```
  - 第二步：用四个并行进程剪裁图像然后生成  `train_label.txt`，`val_label.txt` ，可以使用以下命令：
  ```bash
  python tools/data/textrecog/textocr_converter.py /path/to/textocr 4
  ```


- `Totaltext` 数据集：
  - 第一步：从 [github dataset](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset) 下载 `totaltext.zip`，然后从 [github Groundtruth](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Text) 下载 `groundtruth_text.zip` （我们建议下载 `.mat` 格式的标注文件，因为我们提供的 `totaltext_converter.py` 标注格式转换工具只支持 `.mat` 文件）
  ```bash
  mkdir totaltext && cd totaltext
  mkdir imgs && mkdir annotations

  # 对于图像数据
  # 在 ./totaltext 目录下运行
  unzip totaltext.zip
  mv Images/Train imgs/training
  mv Images/Test imgs/test

  # 对于标注文件
  unzip groundtruth_text.zip
  cd Groundtruth
  mv Polygon/Train ../annotations/training
  mv Polygon/Test ../annotations/test

  ```
  - 第二步：用以下命令生成经剪裁后的标注文件 `train_label.txt` 和 `test_label.txt` （剪裁后的图像会被保存在目录 `data/totaltext/dst_imgs/`）：
  ```bash
  python tools/data/textrecog/totaltext_converter.py /path/to/totaltext -o /path/to/totaltext --split-list training test
  ```

## 关键信息提取

关键信息提取任务的数据集，文件目录应按如下配置：

```text
└── wildreceipt
  ├── class_list.txt
  ├── dict.txt
  ├── image_files
  ├── test.txt
  └── train.txt
```

- 下载 [wildreceipt.tar](https://download.openmmlab.com/mmocr/data/wildreceipt.tar)


## 命名实体识别（专名识别）

### CLUENER2020

命名实体识别任务的数据集，文件目录应按如下配置：

```text
└── cluener2020
  ├── cluener_predict.json
  ├── dev.json
  ├── README.md
  ├── test.json
  ├── train.json
  └── vocab.txt

```

- 下载 [cluener_public.zip](https://storage.googleapis.com/cluebenchmark/tasks/cluener_public.zip)

- 下载 [vocab.txt](https://download.openmmlab.com/mmocr/data/cluener2020/vocab.txt) 然后将 `vocab.txt` 移动到 `cluener2020` 文件夹下
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								# 配置数据集
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								本页列出了在文字检测、文字识别、关键信息提取、命名实体识别四个文本类任务中常用的数据集以及下载链接。
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
 								<!-- TOC -->
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								- [数据集](#数据集)
 								  - [文字检测](#文字检测)
 								  - [文字识别](#文字识别)
 								  - [关键信息提取](#关键信息提取)
 								  - [命名实体识别（专名识别）](#命名实体识别（专名识别）)
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								    - [CLUENER2020](#cluener2020)
 								<!-- /TOC -->
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								## 文字检测
 								文字检测任务的数据集应按如下目录配置：
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
 								```text
 								├── ctw1500
 								│   ├── annotations
 								│   ├── imgs
 								│   ├── instances_test.json
 								│   └── instances_training.json
 								├── icdar2015
 								│   ├── imgs
 								│   ├── instances_test.json
 								│   └── instances_training.json
 								├── icdar2017
 								│   ├── imgs
 								│   ├── instances_training.json
 								│   └── instances_val.json
 								├── synthtext
 								│   ├── imgs
 								│   └── instances_training.lmdb
 								├── textocr
 								│   ├── train
 								│   ├── instances_training.json
 								│   └── instances_val.json
 								├── totaltext
 								│   ├── imgs
 								│   ├── instances_test.json
 								│   └── instances_training.json
 								```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								|  数据集名称  |                             数据图片                             |                                      补充数据                                     |                                                                                                     |               标注文件                  |                                                                                                |
 								| :---------: | :----------------------------------------------------------: | :----------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------: | :-------------------------------------: | :--------------------------------------------------------------------------------------------: |
 								|           |                                                                |                                                                                      |                                                训练集 (training)                                                |               验证集 (validation)                |                                           测试集 (testing)                                             |       |
 								|  CTW1500  | [下载地址](https://github.com/Yuliang-Liu/Curve-Text-Detector) |                                        |                    -                    |                    -                    |                    -                    |
 								| ICDAR2015 | [下载地址](https://rrc.cvc.uab.es/?ch=4&com=downloads)     |                                                                                      | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) |                    -                    | [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) |
 								| ICDAR2017 | [下载地址](https://rrc.cvc.uab.es/?ch=8&com=downloads)     | [renamed_imgs](https://download.openmmlab.com/mmocr/data/icdar2017/renamed_imgs.tar) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json) | [instances_val.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_val.json) | - |       |       |
 								| Synthtext | [下载地址](https://www.robots.ox.ac.uk/~vgg/data/scenetext/)  |                                                                                      | [instances_training.lmdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb) |                    -                    |
 								| TextOCR | [下载地址](https://textvqa.org/textocr/dataset)  |                                                                                      | - |                    -                    | -
 								| Totaltext | [下载地址](https://github.com/cs-chan/Total-Text-Dataset)  |                                                                                      | - |                    -                    | -
 								- `icdar2015` 数据集：
 								  - 第一步：从[下载地址](https://rrc.cvc.uab.es/?ch=4&com=downloads)下载 `ch4_training_images.zip`、`ch4_test_images.zip`、`ch4_training_localization_transcription_gt.zip`、`Challenge4_Test_Task1_GT.zip` 四个文件，分别对应训练集数据、测试集数据、训练集标注、测试集标注。
 								  - 第二步：运行以下命令，移动数据集到对应文件夹
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  ```bash
 								  mkdir icdar2015 && cd icdar2015
 								  mkdir imgs && mkdir annotations
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  # 移动数据到目录：
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  mv ch4_training_images imgs/training
 								  mv ch4_test_images imgs/test
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  # 移动标注到目录：
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  mv ch4_training_localization_transcription_gt annotations/training
 								  mv Challenge4_Test_Task1_GT annotations/test
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  - 第三步：下载 [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) 和 [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json)，并放入 `icdar2015` 文件夹里。或者也可以用以下命令直接生成 `instances_training.json` 和 `instances_test.json`:
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  ```bash
 								  python tools/data/textdet/icdar_converter.py /path/to/icdar2015 -o /path/to/icdar2015 -d icdar2015 --split-list training test
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								- `icdar2017` 数据集：
 								  - 由于使用 opencv 加载 `.jpg` 文件时有旋转失真，我们把原数据集中的图片转换为 `.png` 格式，在这里下载：[renamed_images](https://download.openmmlab.com/mmocr/data/icdar2017/renamed_imgs.tar)。下载后，把 `.png` 图片复制到 `imgs` 文件夹里.
 								- `ctw1500` 数据集：
 								  - 第一步：执行以下命令，从 [下载地址](https://github.com/Yuliang-Liu/Curve-Text-Detector) 下载 `train_images.zip`，`test_images.zip`，`train_labels.zip`，`test_labels.zip` 四个文件并配置到对应目录：
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
 								  ```bash
 								  mkdir ctw1500 && cd ctw1500
 								  mkdir imgs && mkdir annotations
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  # 下载并配置标注
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  cd annotations
 								  wget -O train_labels.zip https://universityofadelaide.box.com/shared/static/jikuazluzyj4lq6umzei7m2ppmt3afyw.zip
 								  wget -O test_labels.zip https://cloudstor.aarnet.edu.au/plus/s/uoeFl0pCN9BOCN5/download
 								  unzip train_labels.zip && mv ctw1500_train_labels training
 								  unzip test_labels.zip -d test
 								  cd ..
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  # 下载并配置数据
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  cd imgs
 								  wget -O train_images.zip https://universityofadelaide.box.com/shared/static/py5uwlfyyytbb2pxzq9czvu6fuqbjdh8.zip
 								  wget -O test_images.zip https://universityofadelaide.box.com/shared/static/t4w48ofnqkdw7jyc4t11nsukoeqk9c3d.zip
 								  unzip train_images.zip && mv train_images training
 								  unzip test_images.zip && mv test_images test
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  - 第二步：执行以下命令，生成 `instances_training.json` 和 `instances_test.json`。
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
 								  ```bash
 								  python tools/data/textdet/ctw1500_converter.py /path/to/ctw1500 -o /path/to/ctw1500 --split-list training test
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
 								- `TextOCR` 数据集：
 								  - 第一步：下载 [train_val_images.zip](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)，[TextOCR_0.1_train.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json) 和 [TextOCR_0.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json) 到 `textocr` 文件夹里。
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  ```bash
 								  mkdir textocr && cd textocr
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  # 下载 TextOCR 数据集
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
 								  wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
 								  wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  # 把图片移到对应目录
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  unzip -q train_val_images.zip
 								  mv train_images train
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
 								  - 第二步：生成 `instances_training.json` 和 `instances_val.json`:
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  ```bash
 								  python tools/data/textdet/textocr_converter.py /path/to/textocr
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
 								- `Totaltext` 数据集：
 								  - 第一步：从 [github dataset](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset) 下载 `totaltext.zip`，从 [github Groundtruth](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Text) 下载 `groundtruth_text.zip` 。（建议下载 `.mat` 格式的标注文件，因为我们提供的标注格式转换脚本 `totaltext_converter.py` 仅支持 `.mat` 格式。）
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  ```bash
 								  mkdir totaltext && cd totaltext
 								  mkdir imgs && mkdir annotations
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  # 图像
 								  # 在 ./totaltext 中执行
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  unzip totaltext.zip
 								  mv Images/Train imgs/training
 								  mv Images/Test imgs/test
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  # 标注文件
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  unzip groundtruth_text.zip
 								  cd Groundtruth
 								  mv Polygon/Train ../annotations/training
 								  mv Polygon/Test ../annotations/test
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  - 第二步：用以下命令生成 `instances_training.json` 和 `instances_test.json` ：
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  ```bash
 								  python tools/data/textdet/totaltext_converter.py /path/to/totaltext -o /path/to/totaltext --split-list training test
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								## 文字识别
 								**文字识别任务的数据集应按如下目录配置：**
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
 								```text
 								├── mixture
 								│   ├── coco_text
 								│   │   ├── train_label.txt
 								│   │   ├── train_words
 								│   ├── icdar_2011
 								│   │   ├── training_label.txt
 								│   │   ├── Challenge1_Training_Task3_Images_GT
 								│   ├── icdar_2013
 								│   │   ├── train_label.txt
 								│   │   ├── test_label_1015.txt
 								│   │   ├── test_label_1095.txt
 								│   │   ├── Challenge2_Training_Task3_Images_GT
 								│   │   ├── Challenge2_Test_Task3_Images
 								│   ├── icdar_2015
 								│   │   ├── train_label.txt
 								│   │   ├── test_label.txt
 								│   │   ├── ch4_training_word_images_gt
 								│   │   ├── ch4_test_word_images_gt
 								│   ├── III5K
 								│   │   ├── train_label.txt
 								│   │   ├── test_label.txt
 								│   │   ├── train
 								│   │   ├── test
 								│   ├── ct80
 								│   │   ├── test_label.txt
 								│   │   ├── image
 								│   ├── svt
 								│   │   ├── test_label.txt
 								│   │   ├── image
 								│   ├── svtp
 								│   │   ├── test_label.txt
 								│   │   ├── image
 								│   ├── Syn90k
 								│   │   ├── shuffle_labels.txt
 								│   │   ├── label.txt
 								│   │   ├── label.lmdb
 								│   │   ├── mnt
 								│   ├── SynthText
 								│   │   ├── shuffle_labels.txt
 								│   │   ├── instances_train.txt
 								│   │   ├── label.txt
 								│   │   ├── label.lmdb
 								│   │   ├── synthtext
 								│   ├── SynthAdd
 								│   │   ├── label.txt
 								│   │   ├── label.lmdb
 								│   │   ├── SynthText_Add
 								│   ├── TextOCR
 								│   │   ├── image
 								│   │   ├── train_label.txt
 								│   │   ├── val_label.txt
 								│   ├── Totaltext
 								│   │   ├── imgs
 								│   │   ├── annotations
 								│   │   ├── train_label.txt
 								│   │   ├── test_label.txt
 								```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								|  数据集名称   |                                        数据图片                                         |                                                                                                                                            标注文件                                                                                                                                                 |                                             标注文件                                             |
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								| :--------: | :-----------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------: |
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								|       |                                                                                       |                                                                                                                                                训练集(training)                                                                                                                                               |                                                  测试集(test)                                                   |
 								| coco_text  |                [下载地址](https://rrc.cvc.uab.es/?ch=5&com=downloads)                 |                                                                                                     [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/coco_text/train_label.txt)                                                                                                     |                                                    -                                                    |       |
 								| icdar_2011 | [下载地址](http://www.cvc.uab.es/icdar2011competition/?com=downloads)         |                                                                                                    [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt)                                                                                                     |                                                    -                                                    |       |
 								| icdar_2013 |              [下载地址](https://rrc.cvc.uab.es/?ch=2&com=downloads)                 |                                                                                                    [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/train_label.txt)                                                                                                     | [test_label_1015.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/test_label_1015.txt) |       |
 								| icdar_2015 |               [下载地址](https://rrc.cvc.uab.es/?ch=4&com=downloads)                 |                                                                                                    [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt)                                                                                                     |      [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/test_label.txt)      |       |
 								|   IIIT5K   |    [下载地址](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html)     |                                                                                                      [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/train_label.txt)                                                                                                       |        [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt)        |       |
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								|    ct80    |                                            -                                           |                                                                                                                                                   -                                                                                                                                                    |         [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/ct80/test_label.txt)         |       |
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								|    svt     |[下载地址](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset) |                                                                                                                                                   -                                                                                                                                                    |         [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/svt/test_label.txt)          |       |
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								|    svtp    |                              -                                           |                                                                                                                                                   -                                                                                                                                                    |         [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/svtp/test_label.txt)         |       |
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								|  Syn90k  |               [下载地址](https://www.robots.ox.ac.uk/~vgg/data/text/)                |                                                       [shuffle_labels.txt](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/shuffle_labels.txt) \| [label.txt](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/label.txt)                                                       |                                                    -                                                    |       |
 								| SynthText  |           [下载地址](https://www.robots.ox.ac.uk/~vgg/data/scenetext/)              | [shuffle_labels.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/shuffle_labels.txt) \| [instances_train.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/instances_train.txt) \| [label.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/label.txt) |                                                    -                                                    |       |
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								|  SynthAdd  |  [SynthText_Add.zip](https://pan.baidu.com/s/1uV0LtoNmcxbO-0YA7Ch4dg)  (code:627x)   |                                                                                                           [label.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthAdd/label.txt)                                                                                                            |                                                    -                                                    |       |
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								|  TextOCR  |  [下载地址](https://textvqa.org/textocr/dataset)   |                                                                                                           -                                                                                                           |                                                    -                                                    |       |
 								|  Totaltext  |  [下载地址](https://github.com/cs-chan/Total-Text-Dataset)   |                                                                                                           -                                                                                                           |                                                    -                                                    |       |
 								- `icdar_2013` 数据集：
 								  - 第一步：从 [下载地址](https://rrc.cvc.uab.es/?ch=2&com=downloads) 下载 `Challenge2_Test_Task3_Images.zip` 和 `Challenge2_Training_Task3_Images_GT.zip`
 								  - 第二步：下载 [test_label_1015.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/test_label_1015.txt) 和 [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/train_label.txt)
 								- `icdar_2015` 数据集：
 								  - 第一步：从 [下载地址](https://rrc.cvc.uab.es/?ch=4&com=downloads) 下载 `ch4_training_word_images_gt.zip` 和 `ch4_test_word_images_gt.zip`
 								  - 第二步：下载 [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt) and [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/test_label.txt)
 								- `IIIT5K` 数据集：
 								  - 第一步：从 [下载地址](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html) 下载 `IIIT5K-Word_V3.0.tar.gz`
 								  - 第二步：下载 [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/train_label.txt) 和 [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt)
 								- `svt` 数据集：
 								  - 第一步：从 [下载地址](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset) 下载 `svt.zip`
 								  - 第二步：下载 [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/svt/test_label.txt)
 								  - 第三步：
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  ```bash
 								  python tools/data/textrecog/svt_converter.py <download_svt_dir_path>
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								- `ct80` 数据集：
 								  - 第一步：下载 [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/ct80/test_label.txt)
 								- `svtp` 数据集：
 								  - 第一步：下载 [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/svtp/test_label.txt)
 								- `coco_text` 数据集：
 								  - 第一步：从 [下载地址](https://rrc.cvc.uab.es/?ch=5&com=downloads) 下载文件
 								  - 第二步：下载 [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/coco_text/train_label.txt)
 								- `Syn90k` 数据集：
 								  - 第一步：从 [下载地址](https://www.robots.ox.ac.uk/~vgg/data/text/) 下载 `mjsynth.tar.gz`
 								  - 第二步：下载 [shuffle_labels.txt](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/shuffle_labels.txt)
 								  - 第三步：
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
 								  ```bash
 								  mkdir Syn90k && cd Syn90k
 								  mv /path/to/mjsynth.tar.gz .
 								  tar -xzf mjsynth.tar.gz
 								  mv /path/to/shuffle_labels.txt .
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  # 创建软链接
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  cd /path/to/mmocr/data/mixture
 								  ln -s /path/to/Syn90k Syn90k
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								- `SynthText` 数据集：
 								  - 第一步： 从 [下载地址](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) 下载 `SynthText.zip`
 								  - 第二步：
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
 								  ```bash
-												[Docs] Fix missed deployment in en docs and add language switcher to the sidebar (#355)

* fix deployment and add language switcher

* change way to switch language and add chinese toc

* update zh_CN datasets.md (to be translated)
											
										
										
											2021-07-07 16:21:52 +08:00
+								  mkdir SynthText && cd SynthText
 								  mv /path/to/SynthText.zip .
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  unzip SynthText.zip
-												[Docs] Fix missed deployment in en docs and add language switcher to the sidebar (#355)

* fix deployment and add language switcher

* change way to switch language and add chinese toc

* update zh_CN datasets.md (to be translated)
											
										
										
											2021-07-07 16:21:52 +08:00
+								  mv SynthText synthtext
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
 								  mv /path/to/shuffle_labels.txt .
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  # 创建软链接
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  cd /path/to/mmocr/data/mixture
 								  ln -s /path/to/SynthText SynthText
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  - 第三步：
 								  生成裁剪后的图像和标注：
-												[Docs] Fix missed deployment in en docs and add language switcher to the sidebar (#355)

* fix deployment and add language switcher

* change way to switch language and add chinese toc

* update zh_CN datasets.md (to be translated)
											
										
										
											2021-07-07 16:21:52 +08:00
 								  ```bash
 								  cd /path/to/mmocr
 								  python tools/data/textrecog/synthtext_converter.py data/mixture/SynthText/gt.mat data/mixture/SynthText/ data/mixture/SynthText/synthtext/SynthText_patch_horizontal --n_proc 8
 								  ```
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								- `SynthAdd` 数据集：
 								  - 第一步：从 [SynthAdd](https://pan.baidu.com/s/1uV0LtoNmcxbO-0YA7Ch4dg) (code:627x) 下载 `SynthText_Add.zip`
 								  - 第二步：下载 [label.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthAdd/label.txt)
 								  - 第三步：
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
 								  ```bash
 								  mkdir SynthAdd && cd SynthAdd
 								  mv /path/to/SynthText_Add.zip .
 								  unzip SynthText_Add.zip
 								  mv /path/to/label.txt .
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  # 创建软链接
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  cd /path/to/mmocr/data/mixture
 								  ln -s /path/to/SynthAdd SynthAdd
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  **额外说明：**
 								运行以下命令，可以把 `.txt` 格式的标注文件转换成 `.lmdb` 格式：
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								```bash
 								python tools/data/utils/txt2lmdb.py -i <txt_label_path> -o <lmdb_label_path>
 								```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								例如：
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								```bash
 								python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb
 								```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								- `TextOCR` 数据集：
 								  - 第一步：下载 [train_val_images.zip](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)，[TextOCR_0.1_train.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json) 和 [TextOCR_0.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json) 到 `textocr/` 目录.
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  ```bash
 								  mkdir textocr && cd textocr
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  # 下载 TextOCR 数据集
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
 								  wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
 								  wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  # 对于数据图像
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  unzip -q train_val_images.zip
 								  mv train_images train
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  - 第二步：用四个并行进程剪裁图像然后生成  `train_label.txt`，`val_label.txt` ，可以使用以下命令：
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  ```bash
 								  python tools/data/textrecog/textocr_converter.py /path/to/textocr 4
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								- `Totaltext` 数据集：
 								  - 第一步：从 [github dataset](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset) 下载 `totaltext.zip`，然后从 [github Groundtruth](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Text) 下载 `groundtruth_text.zip` （我们建议下载 `.mat` 格式的标注文件，因为我们提供的 `totaltext_converter.py` 标注格式转换工具只支持 `.mat` 文件）
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  ```bash
 								  mkdir totaltext && cd totaltext
 								  mkdir imgs && mkdir annotations
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  # 对于图像数据
 								  # 在 ./totaltext 目录下运行
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  unzip totaltext.zip
 								  mv Images/Train imgs/training
 								  mv Images/Test imgs/test
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  # 对于标注文件
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  unzip groundtruth_text.zip
 								  cd Groundtruth
 								  mv Polygon/Train ../annotations/training
 								  mv Polygon/Test ../annotations/test
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								  - 第二步：用以下命令生成经剪裁后的标注文件 `train_label.txt` 和 `test_label.txt` （剪裁后的图像会被保存在目录 `data/totaltext/dst_imgs/`）：
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
+								  ```bash
 								  python tools/data/textrecog/totaltext_converter.py /path/to/totaltext -o /path/to/totaltext --split-list training test
 								  ```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								## 关键信息提取
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								关键信息提取任务的数据集，文件目录应按如下配置：
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
 								```text
 								└── wildreceipt
 								  ├── class_list.txt
 								  ├── dict.txt
 								  ├── image_files
 								  ├── test.txt
 								  └── train.txt
 								```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								- 下载 [wildreceipt.tar](https://download.openmmlab.com/mmocr/data/wildreceipt.tar)
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								## 命名实体识别（专名识别）
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
 								### CLUENER2020
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								命名实体识别任务的数据集，文件目录应按如下配置：
-												[Docs] Add cn docs framework   (#353)

* add CN demo docs

* add deployment.md to docs

* add placeholder CN docs

* Add language switching hint
											
										
										
											2021-07-07 14:13:27 +08:00
 								```text
 								└── cluener2020
 								  ├── cluener_predict.json
 								  ├── dev.json
 								  ├── README.md
 								  ├── test.json
 								  ├── train.json
 								  └── vocab.txt
 								```
-												fix #347: add docs_zh_CN/datasets.md (#362)

* fix #347: add docs_zh_CN/datasets.md

* fix #347: translation correction
											
										
										
											2021-07-13 12:54:47 +08:00
+								- 下载 [cluener_public.zip](https://storage.googleapis.com/cluebenchmark/tasks/cluener_public.zip)
 								- 下载 [vocab.txt](https://download.openmmlab.com/mmocr/data/cluener2020/vocab.txt) 然后将 `vocab.txt` 移动到 `cluener2020` 文件夹下