mirror of https://github.com/open-mmlab/mmocr.git
183 lines
12 KiB
Markdown
183 lines
12 KiB
Markdown
# Datasets Preparation
|
|
This page lists the datasets which are commonly used in text detection, text recognition and key information extraction, and their download links.
|
|
|
|
## Text Detection
|
|
**The structure of the text detection dataset directory is organized as follows.**
|
|
```
|
|
├── ctw1500
|
|
│ ├── imgs
|
|
│ ├── instances_test.json
|
|
│ └── instances_training.json
|
|
├── icdar2015
|
|
│ ├── imgs
|
|
│ ├── instances_test.json
|
|
│ └── instances_training.json
|
|
├── icdar2017
|
|
│ ├── imgs
|
|
│ ├── instances_training.json
|
|
│ └── instances_val.json
|
|
├── synthtext
|
|
│ ├── imgs
|
|
│ ├── instances_training.json
|
|
│ ├── instances_training.txt
|
|
│ └── instances_training.lmdb
|
|
```
|
|
| Dataset | | Images | | | Annotation Files | | | Note | |
|
|
|:---------:|:-:|:--------------------------:|:-:|:--------------------------------------------:|:---------------------------------------:|:----------------------------------------:|:-:|:----:|---|
|
|
| | | | | training | validation | testing | | | |
|
|
| CTW1500 | | [link](https://github.com/Yuliang-Liu/Curve-Text-Detector) | | [instances_training.json](https://download.openmmlab.com/mmocr/data/ctw1500/instances_training.json) | - | [instances_test.json](https://download.openmmlab.com/mmocr/data/ctw1500/instances_test.json) | | | |
|
|
| ICDAR2015 | | [link](https://rrc.cvc.uab.es/?ch=4&com=downloads) | | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) | - | [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) | | | |
|
|
| ICDAR2017 | | [link](https://rrc.cvc.uab.es/?ch=8&com=downloads) | | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json) | [instances_val.json](https://openmmlab) | [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_test.json) | | | |
|
|
| Synthtext | | [link](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) | | [instances_training.json](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.json) [instances_training.txt](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.txt)|-| | | |
|
|
|
|
- For `icdar2015`:
|
|
- Step1: Download `ch4_training_images.zip` and `ch4_test_images.zip` from this [link](https://rrc.cvc.uab.es/?ch=4&com=downloads)
|
|
- Step2: Download [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) and [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json)
|
|
- Step3:
|
|
```bash
|
|
mkdir icdar2015 && cd icdar2015
|
|
mv /path/to/instances_training.json .
|
|
mv /path/to/instances_test.json .
|
|
|
|
mkdir imgs && cd imgs
|
|
ln -s /path/to/ch4_training_images training
|
|
ln -s /path/to/ch4_test_images test
|
|
```
|
|
|
|
## Text Recognition
|
|
**The structure of the text recognition dataset directory is organized as follows.**
|
|
|
|
```
|
|
├── mixture
|
|
│ ├── coco_text
|
|
│ │ ├── train_label.txt
|
|
│ │ ├── train_words
|
|
│ ├── icdar_2011
|
|
│ │ ├── training_label.txt
|
|
│ │ ├── Challenge1_Training_Task3_Images_GT
|
|
│ ├── icdar_2013
|
|
│ │ ├── train_label.txt
|
|
│ │ ├── test_label_1015.txt
|
|
│ │ ├── test_label_1095.txt
|
|
│ │ ├── Challenge2_Training_Task3_Images_GT
|
|
│ │ ├── Challenge2_Test_Task3_Images
|
|
│ ├── icdar_2015
|
|
│ │ ├── train_label.txt
|
|
│ │ ├── test_label.txt
|
|
│ │ ├── ch4_training_word_images_gt
|
|
│ │ ├── ch4_test_word_images_gt
|
|
│ ├── III5K
|
|
│ │ ├── train_label.txt
|
|
│ │ ├── test_label.txt
|
|
│ │ ├── train
|
|
│ │ ├── test
|
|
│ ├── ct80
|
|
│ │ ├── test_label.txt
|
|
│ │ ├── image
|
|
│ ├── svt
|
|
│ │ ├── test_label.txt
|
|
│ │ ├── image
|
|
│ ├── svtp
|
|
│ │ ├── test_label.txt
|
|
│ │ ├── image
|
|
│ ├── Synth90k
|
|
│ │ ├── shuffle_labels.txt
|
|
│ │ ├── mnt
|
|
│ ├── SynthText
|
|
│ │ ├── shuffle_labels.txt
|
|
│ │ ├── instances_train.txt
|
|
│ │ ├── synthtext
|
|
│ ├── SynthAdd
|
|
│ │ ├── label.txt
|
|
│ │ ├── SynthText_Add
|
|
|
|
```
|
|
| Dataset | | images | annotation file | annotation file | Note |
|
|
|:----------:|:-:|:---------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------:|:-------------------------------------------------------------------------------------------------------:|:----:|
|
|
|| | |training | test | |
|
|
| coco_text ||[link](https://rrc.cvc.uab.es/?ch=5&com=downloads) |[train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/coco_text/train_label.txt) |- | |
|
|
| icdar_2011 ||[link](http://www.cvc.uab.es/icdar2011competition/?com=downloads) |[train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt) |- | |
|
|
| icdar_2013 | | [link](https://rrc.cvc.uab.es/?ch=2&com=downloads) | [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/train_label.txt) | [test_label_1015.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/test_label_1015.txt) | |
|
|
| icdar_2015 | | [link](https://rrc.cvc.uab.es/?ch=4&com=downloads) | [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt) | [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/test_label.txt) | |
|
|
| IIIT5K | | [link](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html) | [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/train_label.txt) | [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt) | |
|
|
| ct80 | | - |-|[test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/ct80/test_label.txt)||
|
|
| svt | | [link](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset) | - | [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/svt/test_label.txt) | |
|
|
| svtp | | - | - | [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/svtp/test_label.txt) | |
|
|
| Synth90k | | [link](https://www.robots.ox.ac.uk/~vgg/data/text/) | [shuffle_labels.txt](https://download.openmmlab.com/mmocr/data/mixture/Synth90k/shuffle_labels.txt) | - | |
|
|
| SynthText | | [link](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) | [shuffle_labels.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/shuffle_labels.txt) | [instances_train.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/instances_train.txt) | - | |
|
|
| SynthAdd | | [link](https://download.openmmlab.com/mmocr/data/mixture/SynthAdd/SynthText_Add.zip) | [label.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthAdd/label.txt)|- | |
|
|
|
|
- For `icdar_2013`:
|
|
- Step1: Download `Challenge2_Test_Task3_Images.zip` and `Challenge2_Training_Task3_Images_GT.zip` from this [link](https://rrc.cvc.uab.es/?ch=2&com=downloads)
|
|
- Step2: Download [test_label_1015.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/test_label_1015.txt) and [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/train_label.txt)
|
|
- For `icdar_2015`:
|
|
- Step1: Download `ch4_training_word_images_gt.zip` and `ch4_test_word_images_gt.zip` from this [link](https://rrc.cvc.uab.es/?ch=4&com=downloads)
|
|
- Step2: Download [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt) and [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/test_label.txt)
|
|
- For `IIIT5K`:
|
|
- Step1: Download `IIIT5K-Word_V3.0.tar.gz` from this [link](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html)
|
|
- Step2: Download [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/train_label.txt) and [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt)
|
|
- For `svt`:
|
|
- Step1: Download `svt.zip` form this [link](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset)
|
|
- Step2: Download [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/svt/test_label.txt)
|
|
- For `ct80`:
|
|
- Step1: Download [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/ct80/test_label.txt)
|
|
- For `svtp`:
|
|
- Step1: Download [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/svtp/test_label.txt)
|
|
- For `coco_text`:
|
|
- Step1: Download from this [link](https://rrc.cvc.uab.es/?ch=5&com=downloads)
|
|
- Step2: Download [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/coco_text/train_label.txt)
|
|
|
|
- For `Syn90k`:
|
|
- Step1: Download `mjsynth.tar.gz` from this [link](https://www.robots.ox.ac.uk/~vgg/data/text/)
|
|
- Step2: Download [shuffle_labels.txt](https://download.openmmlab.com/mmocr/data/mixture/Synth90k/shuffle_labels.txt)
|
|
- Step3:
|
|
```bash
|
|
mkdir Syn90k && cd Syn90k
|
|
|
|
mv /path/to/mjsynth.tar.gz .
|
|
|
|
tar -xzf mjsynth.tar.gz
|
|
|
|
mv /path/to/shuffle_labels.txt .
|
|
|
|
# create soft link
|
|
cd /path/to/mmocr/data/mixture
|
|
|
|
ln -s /path/to/Syn90k Syn90k
|
|
```
|
|
- For `SynthText`:
|
|
- Step1: Download `SynthText.zip` from this [link](https://www.robots.ox.ac.uk/~vgg/data/scenetext/)
|
|
- Step2: Download [shuffle_labels.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/shuffle_labels.txt)
|
|
- Step3: Download [instances_train.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthText/instances_train.txt)
|
|
- Step4:
|
|
```bash
|
|
unzip SynthText.zip
|
|
|
|
cd SynthText
|
|
|
|
mv /path/to/shuffle_labels.txt .
|
|
|
|
# create soft link
|
|
cd /path/to/mmocr/data/mixture
|
|
|
|
ln -s /path/to/SynthText SynthText
|
|
```
|
|
- For `SynthAdd`:
|
|
- Step1: Download `SynthText_Add.zip` from this [link](https://download.openmmlab.com/mmocr/data/mixture/SynthAdd/SynthText_Add.zip)
|
|
- Step2: Download [label.txt](https://download.openmmlab.com/mmocr/data/mixture/SynthAdd/label.txt)
|
|
- Step3:
|
|
```bash
|
|
mkdir SynthAdd && cd SynthAdd
|
|
|
|
mv /path/to/SynthText_Add.zip .
|
|
|
|
unzip SynthText_Add.zip
|
|
|
|
mv /path/to/label.txt .
|
|
|
|
# create soft link
|
|
cd /path/to/mmocr/data/mixture
|
|
|
|
ln -s /path/to/SynthAdd SynthAdd
|
|
```
|