[Docs] Update Instructions for New Data Converters (#900)

* update docs

* fix spaces & add deprecation

* fix funsd

* remove repeated docs
This commit is contained in:
Xinyu Wang 2022-03-30 22:07:17 +08:00 committed by GitHub
parent bea8587f3f
commit c6bb105b83
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 548 additions and 32 deletions

View File

@ -52,6 +52,8 @@ The structure of the text detection dataset directory is organized as follows.
| :---------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------: | :---: |
| | | training | validation | testing | |
| CTW1500 | [homepage](https://github.com/Yuliang-Liu/Curve-Text-Detector) | - | - | - |
| ICDAR2011 | [homepage](https://rrc.cvc.uab.es/?ch=1) | - | - | |
| ICDAR2013 | [homepage](https://rrc.cvc.uab.es/?ch=2) | - | - | - |
| ICDAR2015 | [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) | - | [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) |
| ICDAR2017 | [homepage](https://rrc.cvc.uab.es/?ch=8&com=downloads) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json) | [instances_val.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_val.json) | - | | |
| Synthtext | [homepage](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) | instances_training.lmdb ([data.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/data.mdb), [lock.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/lock.mdb)) | - | - |
@ -63,6 +65,11 @@ The structure of the text detection dataset directory is organized as follows.
| NAF | [homepage](https://github.com/herobd/NAF_dataset/releases/tag/v1.0) | - | - | - |
| SROIE | [homepage](https://rrc.cvc.uab.es/?ch=13) | - | - | - |
| Lecture Video DB | [homepage](https://cvit.iiit.ac.in/research/projects/cvit-projects/lecturevideodb) | - | - | - |
| IMGUR | [homepage](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset) | - | - | - |
| KAIST | [homepage](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) | - | - | - |
| MTWI | [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us) | - | - | - |
| COCO Text v2 | [homepage](https://bgshih.github.io/cocotext/) | - | - | - |
| ReCTS | [homepage](https://rrc.cvc.uab.es/?ch=12) | - | - | - |
## Important Note
@ -124,6 +131,82 @@ unzip test_images.zip && mv test_images test
python tools/data/textdet/ctw1500_converter.py /path/to/ctw1500 -o /path/to/ctw1500 --split-list training test
```
### ICDAR 2011 (Born-Digital Images)
- Step1: Download `Challenge1_Training_Task12_Images.zip`, `Challenge1_Training_Task1_GT.zip`, `Challenge1_Test_Task12_Images.zip`, and `Challenge1_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=1&com=downloads) `Task 1.1: Text Localization (2013 edition)`.
```bash
mkdir icdar2011 && cd icdar2011
mkdir imgs && mkdir annotations
# Download ICDAR 2011
wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task12_Images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task1_GT.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task12_Images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task1_GT.zip --no-check-certificate
# For images
unzip -q Challenge1_Training_Task12_Images.zip -d imgs/training
unzip -q Challenge1_Test_Task12_Images.zip -d imgs/test
# For annotations
unzip -q Challenge1_Training_Task1_GT.zip -d annotations/training
unzip -q Challenge1_Test_Task1_GT.zip -d annotations/test
rm Challenge1_Training_Task12_Images.zip && rm Challenge1_Test_Task12_Images.zip && rm Challenge1_Training_Task1_GT.zip && rm Challenge1_Test_Task1_GT.zip
```
- Step 2: Generate `instances_training.json` and `instances_test.json` with the following command:
```bash
python tools/data/textdet/ic11_converter.py PATH/TO/icdar2011 --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
|── icdar2011
│ ├── imgs
│ ├── instances_test.json
│ └── instances_training.json
```
### ICDAR 2013 (Focused Scene Text)
- Step1: Download `Challenge2_Training_Task12_Images.zip`, `Challenge2_Test_Task12_Images.zip`, `Challenge2_Training_Task1_GT.zip`, and `Challenge2_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=2&com=downloads) `Task 2.1: Text Localization (2013 edition)`.
```bash
mkdir icdar2013 && cd icdar2013
mkdir imgs && mkdir annotations
# Download ICDAR 2013
wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task12_Images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task12_Images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task1_GT.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task1_GT.zip --no-check-certificate
# For images
unzip -q Challenge2_Training_Task12_Images.zip -d imgs/training
unzip -q Challenge2_Test_Task12_Images.zip -d imgs/test
# For annotations
unzip -q Challenge2_Training_Task1_GT.zip -d annotations/training
unzip -q Challenge2_Test_Task1_GT.zip -d annotations/test
rm Challenge2_Training_Task12_Images.zip && rm Challenge2_Test_Task12_Images.zip && rm Challenge2_Training_Task1_GT.zip && rm Challenge2_Test_Task1_GT.zip
```
- Step 2: Generate `instances_training.json` and `instances_test.json` with the following command:
```bash
python tools/data/textdet/ic13_converter.py PATH/TO/icdar2013 --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
|── icdar2013
│ ├── imgs
│ ├── instances_test.json
│ └── instances_training.json
```
### SynthText
- Download [data.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/data.mdb) and [lock.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/lock.mdb) to `synthtext/instances_training.lmdb/`.
@ -356,3 +439,179 @@ rm IIIT-CVid.zip
```bash
python tools/data/textdet/lv_converter.py PATH/TO/lv --nproc 4
```
### IMGUR
- Step1: Run `download_imgur5k.py` to download images. You can merge [PR#5](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/pull/5) in your local repository to enable a **much faster** parallel execution of image download.
```bash
mkdir imgur && cd imgur
git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git
# Download images from imgur.com. This may take SEVERAL HOURS!
python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs
# For annotations
mkdir annotations
mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations
rm -rf IMGUR5K-Handwriting-Dataset
```
- Step2: Generate `instances_train.json`, `instance_val.json` and `instances_test.json` with the following command:
```bash
python tools/data/textdet/imgur_converter.py PATH/TO/imgur
```
- After running the above codes, the directory structure should be as follows:
```
|── imgur
| ├── annotations
│ ├── imgs
│ ├── instances_test.json
│ ├── instances_training.json
│ └── instances_val.json
```
### KAIST
- Step1: Complete download [KAIST_all.zip](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) to `kaist/`.
```bash
mkdir kaist && cd kaist
mkdir imgs && mkdir annotations
# Download KAIST dataset
wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip
unzip -q KAIST_all.zip
rm KAIST_all.zip
```
- Step2: Extract zips:
```bash
python tools/data/common/extract_kaist.py PATH/TO/kaist
```
- Step3: Generate `instances_training.json` and `instances_val.json` (optional) with following command:
```bash
# Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2
python tools/data/textdet/kaist_converter.py PATH/TO/kaist --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
|── kaist
| ├── annotations
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json (optional)
```
### MTWI
- Step1: Download `mtwi_2018_train.zip` from [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us).
```bash
mkdir mtwi && cd mtwi
unzip -q mtwi_2018_train.zip
mv image_train imgs && mv txt_train annotations
rm mtwi_2018_train.zip
```
- Step2: Generate `instances_training.json` and `instance_val.json` (optional) with the following command:
```bash
# Annotations of MTWI test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
python tools/data/textdet/mtwi_converter.py PATH/TO/mtwi --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
|── mtwi
| ├── annotations
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json (optional)
```
### COCO Text v2
- Step1: Download image [train2014.zip](http://images.cocodataset.org/zips/train2014.zip) and annotation [cocotext.v2.zip](https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip) to `coco_textv2/`.
```bash
mkdir coco_textv2 && cd coco_textv2
mkdir annotations
# Download COCO Text v2 dataset
wget http://images.cocodataset.org/zips/train2014.zip
wget https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip
unzip -q train2014.zip && unzip -q cocotext.v2.zip
mv train2014 imgs && mv cocotext.v2.json annotations/
rm train2014.zip && rm -rf cocotext.v2.zip
```
- Step2: Generate `instances_training.json` and `instances_val.json` with the following command:
```bash
python tools/data/textdet/cocotext_converter.py PATH/TO/coco_textv2
```
- After running the above codes, the directory structure should be as follows:
```text
|── coco_textv2
| ├── annotations
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json
```
### ReCTS
- Step1: Download [ReCTS.zip](https://datasets.cvc.uab.es/rrc/ReCTS.zip) to `rects/` from the [homepage](https://rrc.cvc.uab.es/?ch=12&com=downloads).
```bash
mkdir rects && cd rects
# Download ReCTS dataset
# You can also find Google Drive link on the dataset homepage
wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate
unzip -q ReCTS.zip
mv img imgs && mv gt_unicode annotations
rm ReCTS.zip && rm -rf gt
```
- Step2: Generate `instances_training.json` and `instances_val.json` (optional) with following command:
```bash
# Annotations of ReCTS test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/rects/ignores
python tools/data/textdet/rects_converter.py PATH/TO/rects --nproc 4 --val-ratio 0.2
```
- After running the above codes, the directory structure should be as follows:
```text
|── rects
| ├── annotations
│ ├── imgs
│ ├── instances_val.json (optional)
│ └── instances_training.json
```

View File

@ -89,8 +89,8 @@
| :-------------------: | :---------------------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------: |
| | | training | test |
| coco_text | [homepage](https://rrc.cvc.uab.es/?ch=5&com=downloads) | [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/coco_text/train_label.txt) | - | |
| icdar_2011 | [homepage](http://www.cvc.uab.es/icdar2011competition/?com=downloads) | [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt) | - | |
| icdar_2013 | [homepage](https://rrc.cvc.uab.es/?ch=2&com=downloads) | [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/train_label.txt) | [test_label_1015.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/test_label_1015.txt) | |
| ICDAR2011 | [homepage](https://rrc.cvc.uab.es/?ch=1) | - | - | |
| ICDAR2013 | [homepage](https://rrc.cvc.uab.es/?ch=2) | - | - | - |
| icdar_2015 | [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads) | [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt) | [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/test_label.txt) | |
| IIIT5K | [homepage](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html) | [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/train_label.txt) | [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt) | |
| ct80 | [homepage](http://cs-chan.com/downloads_CUTE80_dataset.html) | - | [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/ct80/test_label.txt) | |
@ -103,24 +103,101 @@
| Totaltext | [homepage](https://github.com/cs-chan/Total-Text-Dataset) | - | - | |
| OpenVINO | [Open Images](https://github.com/cvdfoundation/open-images-dataset) | [annotations](https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text) | [annotations](https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text) | |
| FUNSD | [homepage](https://guillaumejaume.github.io/FUNSD/) | - | - | |
| DeText | [homepage](https://rrc.cvc.uab.es/?ch=9) | - | - | |
| DeText | [homepage](https://rrc.cvc.uab.es/?ch=9) | - | - | |
| NAF | [homepage](https://github.com/herobd/NAF_dataset) | - | - | - |
| SROIE | [homepage](https://rrc.cvc.uab.es/?ch=13) | - | - | - |
| Lecture Video DB | [homepage](https://cvit.iiit.ac.in/research/projects/cvit-projects/lecturevideodb) | - | - | - |
| IMGUR | [homepage](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset) | - | - | - |
| KAIST | [homepage](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) | - | - | - |
| MTWI | [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us) | - | - | - |
| COCO Text v2 | [homepage](https://bgshih.github.io/cocotext/) | - | - | - |
| ReCTS | [homepage](https://rrc.cvc.uab.es/?ch=12) | - | - | - |
(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.
## Preparation Steps
### ICDAR 2013
### ICDAR 2011 (Born-Digital Images)
- Step1: Download `Challenge1_Training_Task3_Images_GT.zip`, `Challenge1_Test_Task3_Images.zip`, and `Challenge1_Test_Task3_GT.txt` from [homepage](https://rrc.cvc.uab.es/?ch=1&com=downloads) `Task 1.3: Word Recognition (2013 edition)`.
```bash
mkdir icdar2011 && cd icdar2011
mkdir annotations
# Download ICDAR 2011
wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task3_Images_GT.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_Images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_GT.txt --no-check-certificate
# For images
mkdir crops
unzip -q Challenge1_Training_Task3_Images_GT.zip -d crops/train
unzip -q Challenge1_Test_Task3_Images.zip -d crops/test
# For annotations
mv Challenge1_Test_Task3_GT.txt annotations && mv train/gt.txt annotations/Challenge1_Train_Task3_GT.txt
```
- Step2: Convert original annotations to `Train_label.jsonl` and `Test_label.jsonl` with the following command:
```bash
python tools/data/textrecog/ic11_converter.py PATH/TO/icdar2011
```
- After running the above codes, the directory structure should be as follows:
```text
├── icdar2011
│ ├── crops
│ ├── train_label.jsonl
│ └── test_label.jsonl
```
### ICDAR 2013 (Focused Scene Text)
- Step1: Download `Challenge2_Training_Task3_Images_GT.zip`, `Challenge2_Test_Task3_Images.zip`, and `Challenge2_Test_Task3_GT.txt` from [homepage](https://rrc.cvc.uab.es/?ch=2&com=downloads) `Task 2.3: Word Recognition (2013 edition)`.
```bash
mkdir icdar2013 && cd icdar2013
mkdir annotations
# Download ICDAR 2013
wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task3_Images_GT.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_Images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_GT.txt --no-check-certificate
# For images
mkdir crops
unzip -q Challenge2_Training_Task3_Images_GT.zip -d crops/train
unzip -q Challenge2_Test_Task3_Images.zip -d crops/test
# For annotations
mv Challenge2_Test_Task3_GT.txt annotations && mv crops/train/gt.txt annotations/Challenge2_Train_Task3_GT.txt
rm Challenge2_Training_Task3_Images_GT.zip && rm Challenge2_Test_Task3_Images.zip
```
- Step 2: Generate `Train_label.jsonl` and `Test_label.jsonl` with the following command:
```bash
python tools/data/textrecog/ic13_converter.py PATH/TO/icdar2013
```
- After running the above codes, the directory structure should be as follows:
```text
├── icdar2013
│ ├── crops
│ ├── train_label.jsonl
│ └── test_label.jsonl
```
### ICDAR 2013 [Deprecated]
- Step1: Download `Challenge2_Test_Task3_Images.zip` and `Challenge2_Training_Task3_Images_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=2&com=downloads)
- Step2: Download [test_label_1015.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/test_label_1015.txt) and [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/train_label.txt)
### ICDAR 2015
- Step1: Download `ch4_training_word_images_gt.zip` and `ch4_test_word_images_gt.zip` from [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads)
- Step2: Download [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt) and [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/test_label.txt)
### IIIT5K
- Step1: Download `IIIT5K-Word_V3.0.tar.gz` from [homepage](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html)
- Step2: Download [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/train_label.txt) and [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt)
@ -303,33 +380,6 @@ python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mix
python tools/data/textrecog/openvino_converter.py /path/to/openvino 4
```
### FUNSD
- Step1: Download [dataset.zip](https://guillaumejaume.github.io/FUNSD/dataset.zip) to `funsd/`.
```bash
mkdir funsd && cd funsd
# Download FUNSD dataset
wget https://guillaumejaume.github.io/FUNSD/dataset.zip
unzip -q dataset.zip
# For images
mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/
# For annotations
mkdir annotations
mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test
rm dataset.zip && rm -rf dataset
```
- Step2: Generate `train_label.txt` and `test_label.txt` and crop images using 4 processes with following command (add `--preserve-vertical` if you wish to preserve the images containing vertical texts):
```bash
python tools/data/textrecog/funsd_converter.py PATH/TO/funsd --nproc 4
```
### DeText
- Step1: Download `ch9_training_images.zip`, `ch9_training_localization_transcription_gt.zip`, `ch9_validation_images.zip`, and `ch9_validation_localization_transcription_gt.zip` from **Task 3: End to End** on the [homepage](https://rrc.cvc.uab.es/?ch=9).
@ -471,3 +521,210 @@ rm IIIT-CVid.zip
```bash
python tools/data/textdreog/lv_converter.py PATH/TO/lv
```
### FUNSD
- Step1: Download [dataset.zip](https://guillaumejaume.github.io/FUNSD/dataset.zip) to `funsd/`.
```bash
mkdir funsd && cd funsd
# Download FUNSD dataset
wget https://guillaumejaume.github.io/FUNSD/dataset.zip
unzip -q dataset.zip
# For images
mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/
# For annotations
mkdir annotations
mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test
rm dataset.zip && rm -rf dataset
```
- Step2: Generate `train_label.txt` and `test_label.txt` and crop images using 4 processes with following command (add `--preserve-vertical` if you wish to preserve the images containing vertical texts):
```bash
python tools/data/textrecog/funsd_converter.py PATH/TO/funsd --nproc 4
```
### IMGUR
- Step1: Run `download_imgur5k.py` to download images. You can merge [PR#5](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/pull/5) in your local repository to enable a **much faster** parallel execution of image download.
```bash
mkdir imgur && cd imgur
git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git
# Download images from imgur.com. This may take SEVERAL HOURS!
python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs
# For annotations
mkdir annotations
mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations
rm -rf IMGUR5K-Handwriting-Dataset
```
- Step2: Generate `train_label.txt`, `val_label.txt` and `test_label.txt` and crop images with the following command:
```bash
python tools/data/textrecog/imgur_converter.py PATH/TO/imgur
```
- After running the above codes, the directory structure should be as follows:
```text
├── imgur
│ ├── crops
│ ├── train_label.jsonl
│ ├── test_label.jsonl
│ └── val_label.jsonl
```
### KAIST
- Step1: Complete download [KAIST_all.zip](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) to `kaist/`.
```bash
mkdir kaist && cd kaist
mkdir imgs && mkdir annotations
# Download KAIST dataset
wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip
unzip -q KAIST_all.zip
rm KAIST_all.zip
```
- Step2: Extract zips:
```bash
python tools/data/common/extract_kaist.py PATH/TO/kaist
```
- Step3: Generate `train_label.jsonl` and `val_label.jsonl` (optional) with following command:
```bash
# Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/kaist/ignores
python tools/data/textrecog/kaist_converter.py PATH/TO/kaist --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
├── kaist
│ ├── crops
│ ├── ignores
│ ├── train_label.jsonl
│ └── val_label.jsonl (optional)
```
### MTWI
- Step1: Download `mtwi_2018_train.zip` from [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us).
```bash
mkdir mtwi && cd mtwi
unzip -q mtwi_2018_train.zip
mv image_train imgs && mv txt_train annotations
rm mtwi_2018_train.zip
```
- Step2: Generate `train_label.jsonl` and `val_label.jsonl` (optional) with the following command:
```bash
# Annotations of MTWI test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/mtwi/ignores
python tools/data/textrecog/mtwi_converter.py PATH/TO/mtwi --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
├── mtwi
│ ├── crops
│ ├── train_label.jsonl
│ ├── val_label.jsonl (optional)
```
### COCO Text v2
- Step1: Download image [train2014.zip](http://images.cocodataset.org/zips/train2014.zip) and annotation [cocotext.v2.zip](https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip) to `coco_textv2/`.
```bash
mkdir coco_textv2 && cd coco_textv2
mkdir annotations
# Download COCO Text v2 dataset
wget http://images.cocodataset.org/zips/train2014.zip
wget https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip
unzip -q train2014.zip && unzip -q cocotext.v2.zip
mv train2014 imgs && mv cocotext.v2.json annotations/
rm train2014.zip && rm -rf cocotext.v2.zip
```
- Step2: Generate `train_label.jsonl` and `val_label.jsonl` with the following command:
```bash
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/mtwi/ignores
python tools/data/textrecog/cocotext_converter.py PATH/TO/coco_textv2 --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
├── coco_textv2
│ ├── crops
│ ├── ignores
│ ├── train_label.jsonl
│ └── val_label.jsonl
```
### ReCTS
- Step1: Download [ReCTS.zip](https://datasets.cvc.uab.es/rrc/ReCTS.zip) to `rects/` from the [homepage](https://rrc.cvc.uab.es/?ch=12&com=downloads).
```bash
mkdir rects && cd rects
# Download ReCTS dataset
# You can also find Google Drive link on the dataset homepage
wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate
unzip -q ReCTS.zip
mv img imgs && mv gt_unicode annotations
rm ReCTS.zip -f && rm -rf gt
```
- Step2: Generate `train_label.jsonl` and `val_label.jsonl` (optional) with the following command:
```bash
# Annotations of ReCTS test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/rects/ignores
python tools/data/textrecog/rects_converter.py PATH/TO/rects --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
├── rects
│ ├── crops
│ ├── ignores
│ ├── train_label.jsonl
│ └── val_label.jsonl (optional)
```