2021-04-03 01:21:33 +08:00
# Datasets Preparation
2021-04-08 15:49:46 +08:00
2021-04-03 01:21:33 +08:00
This page lists the datasets which are commonly used in text detection, text recognition and key information extraction, and their download links.
2021-04-04 11:56:14 +08:00
2021-04-08 15:49:46 +08:00
<!-- TOC -->
2021-04-09 23:50:33 +08:00
2021-04-08 15:49:46 +08:00
- [Datasets Preparation ](#datasets-preparation )
- [Text Detection ](#text-detection )
- [Text Recognition ](#text-recognition )
- [Key Information Extraction ](#key-information-extraction )
<!-- /TOC -->
2021-04-10 00:27:17 +08:00
2021-04-03 01:21:33 +08:00
## Text Detection
2021-04-08 15:49:46 +08:00
The structure of the text detection dataset directory is organized as follows.
```text
2021-04-03 01:21:33 +08:00
├── ctw1500
│ ├── imgs
│ ├── instances_test.json
│ └── instances_training.json
├── icdar2015
│ ├── imgs
│ ├── instances_test.json
│ └── instances_training.json
├── icdar2017
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json
├── synthtext
│ ├── imgs
│ └── instances_training.lmdb
```
2021-04-08 15:49:46 +08:00
2021-04-09 23:50:33 +08:00
| Dataset | Images | | | Annotation Files | |
| :-------: | :------------------------------------------------------------: | :----------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------: | :-------------------------------------: | :--------------------------------------------------------------------------------------------: |
| | | | training | validation | testing | |
| CTW1500 | [homepage ](https://github.com/Yuliang-Liu/Curve-Text-Detector ) | | [instances_training.json ](https://download.openmmlab.com/mmocr/data/ctw1500/instances_training.json ) | - | [instances_test.json ](https://download.openmmlab.com/mmocr/data/ctw1500/instances_test.json ) |
| ICDAR2015 | [homepage ](https://rrc.cvc.uab.es/?ch=4&com=downloads ) | | [instances_training.json ](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json ) | - | [instances_test.json ](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json ) |
2021-04-19 13:29:36 +08:00
| ICDAR2017 | [homepage ](https://rrc.cvc.uab.es/?ch=8&com=downloads ) | [renamed_imgs ](https://download.openmmlab.com/mmocr/data/icdar2017/renamed_imgs.tar ) | [instances_training.json ](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json ) | [instances_val.json ](https://download.openmmlab.com/mmocr/data/icdar2017/instances_val.json ) | - | | |
2021-04-09 23:50:33 +08:00
| Synthtext | [homepage ](https://www.robots.ox.ac.uk/~vgg/data/scenetext/ ) | | [instances_training.lmdb ](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb ) | - |
2021-04-03 01:21:33 +08:00
- For `icdar2015` :
2021-04-04 11:56:14 +08:00
- Step1: Download `ch4_training_images.zip` and `ch4_test_images.zip` from [homepage ](https://rrc.cvc.uab.es/?ch=4&com=downloads )
2021-04-03 01:21:33 +08:00
- Step2: Download [instances_training.json ](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json ) and [instances_test.json ](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json )
- Step3:
2021-04-08 15:49:46 +08:00
2021-04-03 01:21:33 +08:00
```bash
mkdir icdar2015 & & cd icdar2015
mv /path/to/instances_training.json .
mv /path/to/instances_test.json .
mkdir imgs & & cd imgs
ln -s /path/to/ch4_training_images training
ln -s /path/to/ch4_test_images test
```
2021-04-08 15:49:46 +08:00
2021-04-04 11:56:14 +08:00
- For `icdar2017` :
- To avoid the effect of rotation when load `jpg` with opencv, We provide re-saved `png` format image in [renamed_images ](https://download.openmmlab.com/mmocr/data/icdar2017/renamed_imgs.tar ). You can copy these images to `imgs` .
2021-04-10 00:28:12 +08:00
2021-04-03 01:21:33 +08:00
## Text Recognition
2021-04-08 15:49:46 +08:00
2021-04-03 01:21:33 +08:00
**The structure of the text recognition dataset directory is organized as follows.**
2021-04-08 15:49:46 +08:00
```text
2021-04-03 01:21:33 +08:00
├── mixture
│ ├── coco_text
│ │ ├── train_label.txt
│ │ ├── train_words
│ ├── icdar_2011
│ │ ├── training_label.txt
│ │ ├── Challenge1_Training_Task3_Images_GT
│ ├── icdar_2013
│ │ ├── train_label.txt
│ │ ├── test_label_1015.txt
│ │ ├── test_label_1095.txt
│ │ ├── Challenge2_Training_Task3_Images_GT
│ │ ├── Challenge2_Test_Task3_Images
│ ├── icdar_2015
│ │ ├── train_label.txt
│ │ ├── test_label.txt
│ │ ├── ch4_training_word_images_gt
│ │ ├── ch4_test_word_images_gt
│ ├── III5K
│ │ ├── train_label.txt
│ │ ├── test_label.txt
│ │ ├── train
│ │ ├── test
│ ├── ct80
│ │ ├── test_label.txt
│ │ ├── image
│ ├── svt
│ │ ├── test_label.txt
│ │ ├── image
│ ├── svtp
│ │ ├── test_label.txt
│ │ ├── image
2021-04-10 17:10:50 +08:00
│ ├── Syn90k
2021-04-03 01:21:33 +08:00
│ │ ├── shuffle_labels.txt
2021-04-13 13:58:31 +08:00
│ │ ├── label.txt
2021-04-05 23:54:57 +08:00
│ │ ├── label.lmdb
2021-04-03 01:21:33 +08:00
│ │ ├── mnt
│ ├── SynthText
│ │ ├── shuffle_labels.txt
│ │ ├── instances_train.txt
2021-04-13 13:58:31 +08:00
│ │ ├── label.txt
2021-04-05 23:54:57 +08:00
│ │ ├── label.lmdb
2021-04-03 01:21:33 +08:00
│ │ ├── synthtext
│ ├── SynthAdd
│ │ ├── label.txt
2021-04-13 13:58:31 +08:00
│ │ ├── label.lmdb
2021-04-03 01:21:33 +08:00
│ │ ├── SynthText_Add
```
2021-04-08 15:49:46 +08:00
2021-04-08 18:05:46 +08:00
| Dataset | images | annotation file | annotation file |
| :--------: | :-----------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------: |
| | | training | test |
| coco_text | [homepage ](https://rrc.cvc.uab.es/?ch=5&com=downloads ) | [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/coco_text/train_label.txt ) | - | |
2021-04-09 22:53:21 +08:00
| icdar_2011 | [homepage ](http://www.cvc.uab.es/icdar2011competition/?com=downloads ) | [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt ) | - | |
| icdar_2013 | [homepage ](https://rrc.cvc.uab.es/?ch=2&com=downloads ) | [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/train_label.txt ) | [test_label_1015.txt ](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/test_label_1015.txt ) | |
| icdar_2015 | [homepage ](https://rrc.cvc.uab.es/?ch=4&com=downloads ) | [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt ) | [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/test_label.txt ) | |
| IIIT5K | [homepage ](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html ) | [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/train_label.txt ) | [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt ) | |
| ct80 | - | - | [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/ct80/test_label.txt ) | |
| svt |[homepage](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset) | - | [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/svt/test_label.txt ) | |
| svtp | - | - | [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/svtp/test_label.txt ) | |
2021-04-13 13:58:31 +08:00
| Syn90k | [homepage ](https://www.robots.ox.ac.uk/~vgg/data/text/ ) | [shuffle_labels.txt ](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/shuffle_labels.txt ) \| [label.txt ](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/label.txt ) | - | |
| SynthText | [homepage ](https://www.robots.ox.ac.uk/~vgg/data/scenetext/ ) | [shuffle_labels.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthText/shuffle_labels.txt ) \| [instances_train.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthText/instances_train.txt ) \| [label.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthText/label.txt ) | - | |
2021-04-09 22:53:21 +08:00
| SynthAdd | [SynthText_Add.zip ](https://pan.baidu.com/s/1uV0LtoNmcxbO-0YA7Ch4dg ) (code:627x) | [label.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthAdd/label.txt ) | - | |
2021-04-03 01:21:33 +08:00
- For `icdar_2013` :
2021-04-04 11:56:14 +08:00
- Step1: Download `Challenge2_Test_Task3_Images.zip` and `Challenge2_Training_Task3_Images_GT.zip` from [homepage ](https://rrc.cvc.uab.es/?ch=2&com=downloads )
2021-04-03 01:21:33 +08:00
- Step2: Download [test_label_1015.txt ](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/test_label_1015.txt ) and [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/train_label.txt )
- For `icdar_2015` :
2021-04-04 11:56:14 +08:00
- Step1: Download `ch4_training_word_images_gt.zip` and `ch4_test_word_images_gt.zip` from [homepage ](https://rrc.cvc.uab.es/?ch=4&com=downloads )
2021-04-03 01:21:33 +08:00
- Step2: Download [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt ) and [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/test_label.txt )
- For `IIIT5K` :
2021-04-04 11:56:14 +08:00
- Step1: Download `IIIT5K-Word_V3.0.tar.gz` from [homepage ](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html )
2021-04-03 01:21:33 +08:00
- Step2: Download [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/train_label.txt ) and [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt )
- For `svt` :
2021-04-04 11:56:14 +08:00
- Step1: Download `svt.zip` form [homepage ](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset )
2021-04-03 01:21:33 +08:00
- Step2: Download [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/svt/test_label.txt )
2021-04-14 18:33:14 +08:00
- Step3:
```bash
python tools/data/textrecog/svt_converter.py < download_svt_dir_path >
```
2021-04-03 01:21:33 +08:00
- For `ct80` :
- Step1: Download [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/ct80/test_label.txt )
- For `svtp` :
- Step1: Download [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/svtp/test_label.txt )
- For `coco_text` :
2021-04-04 11:56:14 +08:00
- Step1: Download from [homepage ](https://rrc.cvc.uab.es/?ch=5&com=downloads )
2021-04-03 01:21:33 +08:00
- Step2: Download [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/coco_text/train_label.txt )
- For `Syn90k` :
2021-04-04 11:56:14 +08:00
- Step1: Download `mjsynth.tar.gz` from [homepage ](https://www.robots.ox.ac.uk/~vgg/data/text/ )
2021-04-10 17:10:50 +08:00
- Step2: Download [shuffle_labels.txt ](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/shuffle_labels.txt )
2021-04-03 01:21:33 +08:00
- Step3:
2021-04-08 15:49:46 +08:00
2021-04-03 01:21:33 +08:00
```bash
mkdir Syn90k & & cd Syn90k
mv /path/to/mjsynth.tar.gz .
tar -xzf mjsynth.tar.gz
mv /path/to/shuffle_labels.txt .
# create soft link
cd /path/to/mmocr/data/mixture
ln -s /path/to/Syn90k Syn90k
```
2021-04-08 15:49:46 +08:00
2021-04-03 01:21:33 +08:00
- For `SynthText` :
2021-04-04 11:56:14 +08:00
- Step1: Download `SynthText.zip` from [homepage ](https://www.robots.ox.ac.uk/~vgg/data/scenetext/ )
2021-04-03 01:21:33 +08:00
- Step2: Download [shuffle_labels.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthText/shuffle_labels.txt )
- Step3: Download [instances_train.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthText/instances_train.txt )
- Step4:
2021-04-08 15:49:46 +08:00
2021-04-03 01:21:33 +08:00
```bash
unzip SynthText.zip
cd SynthText
mv /path/to/shuffle_labels.txt .
# create soft link
cd /path/to/mmocr/data/mixture
ln -s /path/to/SynthText SynthText
```
2021-04-08 15:49:46 +08:00
2021-04-03 01:21:33 +08:00
- For `SynthAdd` :
2021-04-05 23:54:57 +08:00
- Step1: Download `SynthText_Add.zip` from [SynthAdd ](https://pan.baidu.com/s/1uV0LtoNmcxbO-0YA7Ch4dg ) (code:627x))
2021-04-03 01:21:33 +08:00
- Step2: Download [label.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthAdd/label.txt )
- Step3:
2021-04-08 15:49:46 +08:00
2021-04-03 01:21:33 +08:00
```bash
mkdir SynthAdd & & cd SynthAdd
mv /path/to/SynthText_Add.zip .
unzip SynthText_Add.zip
mv /path/to/label.txt .
# create soft link
cd /path/to/mmocr/data/mixture
ln -s /path/to/SynthAdd SynthAdd
```
2021-04-13 13:58:31 +08:00
**Note:**
To convert label file with `txt` format to `lmdb` format,
```bash
python tools/data/utils/txt2lmdb.py -i < txt_label_path > -o < lmdb_label_path >
```
For example,
```bash
python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb
```
2021-04-04 00:19:55 +08:00
## Key Information Extraction
2021-04-08 15:49:46 +08:00
The structure of the key information extraction dataset directory is organized as follows.
```text
2021-04-04 00:19:55 +08:00
└── wildreceipt
├── anno_files
├── class_list.txt
├── dict.txt
├── image_files
├── test.txt
└── train.txt
```
2021-04-08 15:49:46 +08:00
2021-04-04 00:19:55 +08:00
- Download [wildreceipt.tar ](https://download.openmmlab.com/mmocr/data/wildreceipt.tar )