From c6bb105b83f1f33e66011339b37cacc0462ede7f Mon Sep 17 00:00:00 2001 From: Xinyu Wang <45810070+xinke-wang@users.noreply.github.com> Date: Wed, 30 Mar 2022 22:07:17 +0800 Subject: [PATCH] [Docs] Update Instructions for New Data Converters (#900) * update docs * fix spaces & add deprecation * fix funsd * remove repeated docs --- docs/en/datasets/det.md | 259 ++++++++++++++++++++++++++++++ docs/en/datasets/recog.md | 321 ++++++++++++++++++++++++++++++++++---- 2 files changed, 548 insertions(+), 32 deletions(-) diff --git a/docs/en/datasets/det.md b/docs/en/datasets/det.md index ec6f19e5..bc6d7b84 100644 --- a/docs/en/datasets/det.md +++ b/docs/en/datasets/det.md @@ -52,6 +52,8 @@ The structure of the text detection dataset directory is organized as follows. | :---------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------: | :---: | | | | training | validation | testing | | | CTW1500 | [homepage](https://github.com/Yuliang-Liu/Curve-Text-Detector) | - | - | - | +| ICDAR2011 | [homepage](https://rrc.cvc.uab.es/?ch=1) | - | - | | +| ICDAR2013 | [homepage](https://rrc.cvc.uab.es/?ch=2) | - | - | - | | ICDAR2015 | [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) | - | [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) | | ICDAR2017 | [homepage](https://rrc.cvc.uab.es/?ch=8&com=downloads) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json) | [instances_val.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_val.json) | - | | | | Synthtext | [homepage](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) | instances_training.lmdb ([data.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/data.mdb), [lock.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/lock.mdb)) | - | - | @@ -63,6 +65,11 @@ The structure of the text detection dataset directory is organized as follows. | NAF | [homepage](https://github.com/herobd/NAF_dataset/releases/tag/v1.0) | - | - | - | | SROIE | [homepage](https://rrc.cvc.uab.es/?ch=13) | - | - | - | | Lecture Video DB | [homepage](https://cvit.iiit.ac.in/research/projects/cvit-projects/lecturevideodb) | - | - | - | +| IMGUR | [homepage](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset) | - | - | - | +| KAIST | [homepage](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) | - | - | - | +| MTWI | [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us) | - | - | - | +| COCO Text v2 | [homepage](https://bgshih.github.io/cocotext/) | - | - | - | +| ReCTS | [homepage](https://rrc.cvc.uab.es/?ch=12) | - | - | - | ## Important Note @@ -124,6 +131,82 @@ unzip test_images.zip && mv test_images test python tools/data/textdet/ctw1500_converter.py /path/to/ctw1500 -o /path/to/ctw1500 --split-list training test ``` +### ICDAR 2011 (Born-Digital Images) +- Step1: Download `Challenge1_Training_Task12_Images.zip`, `Challenge1_Training_Task1_GT.zip`, `Challenge1_Test_Task12_Images.zip`, and `Challenge1_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=1&com=downloads) `Task 1.1: Text Localization (2013 edition)`. + + ```bash + mkdir icdar2011 && cd icdar2011 + mkdir imgs && mkdir annotations + + # Download ICDAR 2011 + wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task12_Images.zip --no-check-certificate + wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task1_GT.zip --no-check-certificate + wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task12_Images.zip --no-check-certificate + wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task1_GT.zip --no-check-certificate + + # For images + unzip -q Challenge1_Training_Task12_Images.zip -d imgs/training + unzip -q Challenge1_Test_Task12_Images.zip -d imgs/test + # For annotations + unzip -q Challenge1_Training_Task1_GT.zip -d annotations/training + unzip -q Challenge1_Test_Task1_GT.zip -d annotations/test + + rm Challenge1_Training_Task12_Images.zip && rm Challenge1_Test_Task12_Images.zip && rm Challenge1_Training_Task1_GT.zip && rm Challenge1_Test_Task1_GT.zip + ``` + +- Step 2: Generate `instances_training.json` and `instances_test.json` with the following command: + + ```bash + python tools/data/textdet/ic11_converter.py PATH/TO/icdar2011 --nproc 4 + ``` + +- After running the above codes, the directory structure should be as follows: + + ```text + |── icdar2011 + │ ├── imgs + │ ├── instances_test.json + │ └── instances_training.json + ``` + +### ICDAR 2013 (Focused Scene Text) +- Step1: Download `Challenge2_Training_Task12_Images.zip`, `Challenge2_Test_Task12_Images.zip`, `Challenge2_Training_Task1_GT.zip`, and `Challenge2_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=2&com=downloads) `Task 2.1: Text Localization (2013 edition)`. + + ```bash + mkdir icdar2013 && cd icdar2013 + mkdir imgs && mkdir annotations + + # Download ICDAR 2013 + wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task12_Images.zip --no-check-certificate + wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task12_Images.zip --no-check-certificate + wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task1_GT.zip --no-check-certificate + wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task1_GT.zip --no-check-certificate + + # For images + unzip -q Challenge2_Training_Task12_Images.zip -d imgs/training + unzip -q Challenge2_Test_Task12_Images.zip -d imgs/test + # For annotations + unzip -q Challenge2_Training_Task1_GT.zip -d annotations/training + unzip -q Challenge2_Test_Task1_GT.zip -d annotations/test + + rm Challenge2_Training_Task12_Images.zip && rm Challenge2_Test_Task12_Images.zip && rm Challenge2_Training_Task1_GT.zip && rm Challenge2_Test_Task1_GT.zip + ``` + +- Step 2: Generate `instances_training.json` and `instances_test.json` with the following command: + + ```bash + python tools/data/textdet/ic13_converter.py PATH/TO/icdar2013 --nproc 4 + ``` + +- After running the above codes, the directory structure should be as follows: + + ```text + |── icdar2013 + │ ├── imgs + │ ├── instances_test.json + │ └── instances_training.json + ``` + ### SynthText - Download [data.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/data.mdb) and [lock.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/lock.mdb) to `synthtext/instances_training.lmdb/`. @@ -356,3 +439,179 @@ rm IIIT-CVid.zip ```bash python tools/data/textdet/lv_converter.py PATH/TO/lv --nproc 4 ``` + +### IMGUR + +- Step1: Run `download_imgur5k.py` to download images. You can merge [PR#5](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/pull/5) in your local repository to enable a **much faster** parallel execution of image download. + + ```bash + mkdir imgur && cd imgur + + git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git + + # Download images from imgur.com. This may take SEVERAL HOURS! + python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs + + # For annotations + mkdir annotations + mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations + + rm -rf IMGUR5K-Handwriting-Dataset + ``` + +- Step2: Generate `instances_train.json`, `instance_val.json` and `instances_test.json` with the following command: + + ```bash + python tools/data/textdet/imgur_converter.py PATH/TO/imgur + ``` + +- After running the above codes, the directory structure should be as follows: + + ``` + |── imgur + | ├── annotations + │ ├── imgs + │ ├── instances_test.json + │ ├── instances_training.json + │ └── instances_val.json + ``` + + ### KAIST + +- Step1: Complete download [KAIST_all.zip](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) to `kaist/`. + + ```bash + mkdir kaist && cd kaist + mkdir imgs && mkdir annotations + + # Download KAIST dataset + wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip + unzip -q KAIST_all.zip + + rm KAIST_all.zip + ``` + +- Step2: Extract zips: + + ```bash + python tools/data/common/extract_kaist.py PATH/TO/kaist + ``` + +- Step3: Generate `instances_training.json` and `instances_val.json` (optional) with following command: + + ```bash + # Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2 + python tools/data/textdet/kaist_converter.py PATH/TO/kaist --nproc 4 + ``` + +- After running the above codes, the directory structure should be as follows: + + ```text + |── kaist + | ├── annotations + │ ├── imgs + │ ├── instances_training.json + │ └── instances_val.json (optional) + ``` + +### MTWI + +- Step1: Download `mtwi_2018_train.zip` from [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us). + + ```bash + mkdir mtwi && cd mtwi + + unzip -q mtwi_2018_train.zip + mv image_train imgs && mv txt_train annotations + + rm mtwi_2018_train.zip + ``` + +- Step2: Generate `instances_training.json` and `instance_val.json` (optional) with the following command: + + ```bash + # Annotations of MTWI test split is not publicly available, split a validation + # set by adding --val-ratio 0.2 + python tools/data/textdet/mtwi_converter.py PATH/TO/mtwi --nproc 4 + ``` + +- After running the above codes, the directory structure should be as follows: + + ```text + |── mtwi + | ├── annotations + │ ├── imgs + │ ├── instances_training.json + │ └── instances_val.json (optional) + ``` + +### COCO Text v2 + +- Step1: Download image [train2014.zip](http://images.cocodataset.org/zips/train2014.zip) and annotation [cocotext.v2.zip](https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip) to `coco_textv2/`. + + ```bash + mkdir coco_textv2 && cd coco_textv2 + mkdir annotations + + # Download COCO Text v2 dataset + wget http://images.cocodataset.org/zips/train2014.zip + wget https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip + unzip -q train2014.zip && unzip -q cocotext.v2.zip + + mv train2014 imgs && mv cocotext.v2.json annotations/ + + rm train2014.zip && rm -rf cocotext.v2.zip + ``` + +- Step2: Generate `instances_training.json` and `instances_val.json` with the following command: + + ```bash + python tools/data/textdet/cocotext_converter.py PATH/TO/coco_textv2 + ``` + +- After running the above codes, the directory structure should be as follows: + + ```text + |── coco_textv2 + | ├── annotations + │ ├── imgs + │ ├── instances_training.json + │ └── instances_val.json + ``` + +### ReCTS + +- Step1: Download [ReCTS.zip](https://datasets.cvc.uab.es/rrc/ReCTS.zip) to `rects/` from the [homepage](https://rrc.cvc.uab.es/?ch=12&com=downloads). + + ```bash + mkdir rects && cd rects + + # Download ReCTS dataset + # You can also find Google Drive link on the dataset homepage + wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate + unzip -q ReCTS.zip + + mv img imgs && mv gt_unicode annotations + + rm ReCTS.zip && rm -rf gt + ``` + +- Step2: Generate `instances_training.json` and `instances_val.json` (optional) with following command: + + ```bash + # Annotations of ReCTS test split is not publicly available, split a validation + # set by adding --val-ratio 0.2 + # Add --preserve-vertical to preserve vertical texts for training, otherwise + # vertical images will be filtered and stored in PATH/TO/rects/ignores + python tools/data/textdet/rects_converter.py PATH/TO/rects --nproc 4 --val-ratio 0.2 + ``` + +- After running the above codes, the directory structure should be as follows: + + ```text + |── rects + | ├── annotations + │ ├── imgs + │ ├── instances_val.json (optional) + │ └── instances_training.json + ``` diff --git a/docs/en/datasets/recog.md b/docs/en/datasets/recog.md index af4b37a3..652880cb 100644 --- a/docs/en/datasets/recog.md +++ b/docs/en/datasets/recog.md @@ -89,8 +89,8 @@ | :-------------------: | :---------------------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------: | | | | training | test | | coco_text | [homepage](https://rrc.cvc.uab.es/?ch=5&com=downloads) | [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/coco_text/train_label.txt) | - | | -| icdar_2011 | [homepage](http://www.cvc.uab.es/icdar2011competition/?com=downloads) | [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt) | - | | -| icdar_2013 | [homepage](https://rrc.cvc.uab.es/?ch=2&com=downloads) | [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/train_label.txt) | [test_label_1015.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/test_label_1015.txt) | | +| ICDAR2011 | [homepage](https://rrc.cvc.uab.es/?ch=1) | - | - | | +| ICDAR2013 | [homepage](https://rrc.cvc.uab.es/?ch=2) | - | - | - | | icdar_2015 | [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads) | [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt) | [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/test_label.txt) | | | IIIT5K | [homepage](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html) | [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/train_label.txt) | [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt) | | | ct80 | [homepage](http://cs-chan.com/downloads_CUTE80_dataset.html) | - | [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/ct80/test_label.txt) | | @@ -103,24 +103,101 @@ | Totaltext | [homepage](https://github.com/cs-chan/Total-Text-Dataset) | - | - | | | OpenVINO | [Open Images](https://github.com/cvdfoundation/open-images-dataset) | [annotations](https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text) | [annotations](https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text) | | | FUNSD | [homepage](https://guillaumejaume.github.io/FUNSD/) | - | - | | -| DeText | [homepage](https://rrc.cvc.uab.es/?ch=9) | - | - | | +| DeText | [homepage](https://rrc.cvc.uab.es/?ch=9) | - | - | | | NAF | [homepage](https://github.com/herobd/NAF_dataset) | - | - | - | | SROIE | [homepage](https://rrc.cvc.uab.es/?ch=13) | - | - | - | | Lecture Video DB | [homepage](https://cvit.iiit.ac.in/research/projects/cvit-projects/lecturevideodb) | - | - | - | +| IMGUR | [homepage](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset) | - | - | - | +| KAIST | [homepage](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) | - | - | - | +| MTWI | [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us) | - | - | - | +| COCO Text v2 | [homepage](https://bgshih.github.io/cocotext/) | - | - | - | +| ReCTS | [homepage](https://rrc.cvc.uab.es/?ch=12) | - | - | - | (*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset. ## Preparation Steps -### ICDAR 2013 +### ICDAR 2011 (Born-Digital Images) +- Step1: Download `Challenge1_Training_Task3_Images_GT.zip`, `Challenge1_Test_Task3_Images.zip`, and `Challenge1_Test_Task3_GT.txt` from [homepage](https://rrc.cvc.uab.es/?ch=1&com=downloads) `Task 1.3: Word Recognition (2013 edition)`. + + ```bash + mkdir icdar2011 && cd icdar2011 + mkdir annotations + + # Download ICDAR 2011 + wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task3_Images_GT.zip --no-check-certificate + wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_Images.zip --no-check-certificate + wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_GT.txt --no-check-certificate + + # For images + mkdir crops + unzip -q Challenge1_Training_Task3_Images_GT.zip -d crops/train + unzip -q Challenge1_Test_Task3_Images.zip -d crops/test + + # For annotations + mv Challenge1_Test_Task3_GT.txt annotations && mv train/gt.txt annotations/Challenge1_Train_Task3_GT.txt + ``` + +- Step2: Convert original annotations to `Train_label.jsonl` and `Test_label.jsonl` with the following command: + + ```bash + python tools/data/textrecog/ic11_converter.py PATH/TO/icdar2011 + ``` + +- After running the above codes, the directory structure should be as follows: + + ```text + ├── icdar2011 + │ ├── crops + │ ├── train_label.jsonl + │ └── test_label.jsonl + ``` + +### ICDAR 2013 (Focused Scene Text) +- Step1: Download `Challenge2_Training_Task3_Images_GT.zip`, `Challenge2_Test_Task3_Images.zip`, and `Challenge2_Test_Task3_GT.txt` from [homepage](https://rrc.cvc.uab.es/?ch=2&com=downloads) `Task 2.3: Word Recognition (2013 edition)`. + + ```bash + mkdir icdar2013 && cd icdar2013 + mkdir annotations + + # Download ICDAR 2013 + wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task3_Images_GT.zip --no-check-certificate + wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_Images.zip --no-check-certificate + wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_GT.txt --no-check-certificate + + # For images + mkdir crops + unzip -q Challenge2_Training_Task3_Images_GT.zip -d crops/train + unzip -q Challenge2_Test_Task3_Images.zip -d crops/test + # For annotations + mv Challenge2_Test_Task3_GT.txt annotations && mv crops/train/gt.txt annotations/Challenge2_Train_Task3_GT.txt + + rm Challenge2_Training_Task3_Images_GT.zip && rm Challenge2_Test_Task3_Images.zip + ``` + +- Step 2: Generate `Train_label.jsonl` and `Test_label.jsonl` with the following command: + + ```bash + python tools/data/textrecog/ic13_converter.py PATH/TO/icdar2013 + ``` + +- After running the above codes, the directory structure should be as follows: + + ```text + ├── icdar2013 + │ ├── crops + │ ├── train_label.jsonl + │ └── test_label.jsonl + ``` + +### ICDAR 2013 [Deprecated] - Step1: Download `Challenge2_Test_Task3_Images.zip` and `Challenge2_Training_Task3_Images_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=2&com=downloads) - Step2: Download [test_label_1015.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/test_label_1015.txt) and [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/train_label.txt) ### ICDAR 2015 - Step1: Download `ch4_training_word_images_gt.zip` and `ch4_test_word_images_gt.zip` from [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads) - Step2: Download [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt) and [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/test_label.txt) - ### IIIT5K - Step1: Download `IIIT5K-Word_V3.0.tar.gz` from [homepage](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html) - Step2: Download [train_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/train_label.txt) and [test_label.txt](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt) @@ -303,33 +380,6 @@ python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mix python tools/data/textrecog/openvino_converter.py /path/to/openvino 4 ``` -### FUNSD - -- Step1: Download [dataset.zip](https://guillaumejaume.github.io/FUNSD/dataset.zip) to `funsd/`. - -```bash -mkdir funsd && cd funsd - -# Download FUNSD dataset -wget https://guillaumejaume.github.io/FUNSD/dataset.zip -unzip -q dataset.zip - -# For images -mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/ - -# For annotations -mkdir annotations -mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test - -rm dataset.zip && rm -rf dataset -``` - -- Step2: Generate `train_label.txt` and `test_label.txt` and crop images using 4 processes with following command (add `--preserve-vertical` if you wish to preserve the images containing vertical texts): - -```bash -python tools/data/textrecog/funsd_converter.py PATH/TO/funsd --nproc 4 -``` - ### DeText - Step1: Download `ch9_training_images.zip`, `ch9_training_localization_transcription_gt.zip`, `ch9_validation_images.zip`, and `ch9_validation_localization_transcription_gt.zip` from **Task 3: End to End** on the [homepage](https://rrc.cvc.uab.es/?ch=9). @@ -471,3 +521,210 @@ rm IIIT-CVid.zip ```bash python tools/data/textdreog/lv_converter.py PATH/TO/lv ``` + +### FUNSD + +- Step1: Download [dataset.zip](https://guillaumejaume.github.io/FUNSD/dataset.zip) to `funsd/`. + +```bash +mkdir funsd && cd funsd + +# Download FUNSD dataset +wget https://guillaumejaume.github.io/FUNSD/dataset.zip +unzip -q dataset.zip + +# For images +mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/ + +# For annotations +mkdir annotations +mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test + +rm dataset.zip && rm -rf dataset +``` + +- Step2: Generate `train_label.txt` and `test_label.txt` and crop images using 4 processes with following command (add `--preserve-vertical` if you wish to preserve the images containing vertical texts): + +```bash +python tools/data/textrecog/funsd_converter.py PATH/TO/funsd --nproc 4 +``` + +### IMGUR + +- Step1: Run `download_imgur5k.py` to download images. You can merge [PR#5](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/pull/5) in your local repository to enable a **much faster** parallel execution of image download. + + ```bash + mkdir imgur && cd imgur + + git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git + + # Download images from imgur.com. This may take SEVERAL HOURS! + python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs + + # For annotations + mkdir annotations + mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations + + rm -rf IMGUR5K-Handwriting-Dataset + ``` + +- Step2: Generate `train_label.txt`, `val_label.txt` and `test_label.txt` and crop images with the following command: + + ```bash + python tools/data/textrecog/imgur_converter.py PATH/TO/imgur + ``` + +- After running the above codes, the directory structure should be as follows: + + ```text + ├── imgur + │ ├── crops + │ ├── train_label.jsonl + │ ├── test_label.jsonl + │ └── val_label.jsonl + ``` + +### KAIST + +- Step1: Complete download [KAIST_all.zip](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) to `kaist/`. + + ```bash + mkdir kaist && cd kaist + mkdir imgs && mkdir annotations + + # Download KAIST dataset + wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip + unzip -q KAIST_all.zip + + rm KAIST_all.zip + ``` + +- Step2: Extract zips: + + ```bash + python tools/data/common/extract_kaist.py PATH/TO/kaist + ``` + +- Step3: Generate `train_label.jsonl` and `val_label.jsonl` (optional) with following command: + + ```bash + # Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2 + # Add --preserve-vertical to preserve vertical texts for training, otherwise + # vertical images will be filtered and stored in PATH/TO/kaist/ignores + python tools/data/textrecog/kaist_converter.py PATH/TO/kaist --nproc 4 + ``` + +- After running the above codes, the directory structure should be as follows: + + ```text + ├── kaist + │ ├── crops + │ ├── ignores + │ ├── train_label.jsonl + │ └── val_label.jsonl (optional) + ``` + +### MTWI + +- Step1: Download `mtwi_2018_train.zip` from [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us). + + ```bash + mkdir mtwi && cd mtwi + + unzip -q mtwi_2018_train.zip + mv image_train imgs && mv txt_train annotations + + rm mtwi_2018_train.zip + ``` + +- Step2: Generate `train_label.jsonl` and `val_label.jsonl` (optional) with the following command: + + ```bash + # Annotations of MTWI test split is not publicly available, split a validation + # set by adding --val-ratio 0.2 + # Add --preserve-vertical to preserve vertical texts for training, otherwise + # vertical images will be filtered and stored in PATH/TO/mtwi/ignores + python tools/data/textrecog/mtwi_converter.py PATH/TO/mtwi --nproc 4 + ``` + +- After running the above codes, the directory structure should be as follows: + + ```text + ├── mtwi + │ ├── crops + │ ├── train_label.jsonl + │ ├── val_label.jsonl (optional) + ``` + +### COCO Text v2 + +- Step1: Download image [train2014.zip](http://images.cocodataset.org/zips/train2014.zip) and annotation [cocotext.v2.zip](https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip) to `coco_textv2/`. + + ```bash + mkdir coco_textv2 && cd coco_textv2 + mkdir annotations + + # Download COCO Text v2 dataset + wget http://images.cocodataset.org/zips/train2014.zip + wget https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip + unzip -q train2014.zip && unzip -q cocotext.v2.zip + + mv train2014 imgs && mv cocotext.v2.json annotations/ + + rm train2014.zip && rm -rf cocotext.v2.zip + ``` + +- Step2: Generate `train_label.jsonl` and `val_label.jsonl` with the following command: + + ```bash + # Add --preserve-vertical to preserve vertical texts for training, otherwise + # vertical images will be filtered and stored in PATH/TO/mtwi/ignores + python tools/data/textrecog/cocotext_converter.py PATH/TO/coco_textv2 --nproc 4 + ``` + +- After running the above codes, the directory structure should be as follows: + + ```text + ├── coco_textv2 + │ ├── crops + │ ├── ignores + │ ├── train_label.jsonl + │ └── val_label.jsonl + ``` + +### ReCTS + +- Step1: Download [ReCTS.zip](https://datasets.cvc.uab.es/rrc/ReCTS.zip) to `rects/` from the [homepage](https://rrc.cvc.uab.es/?ch=12&com=downloads). + + ```bash + mkdir rects && cd rects + + # Download ReCTS dataset + # You can also find Google Drive link on the dataset homepage + wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate + unzip -q ReCTS.zip + + mv img imgs && mv gt_unicode annotations + + rm ReCTS.zip -f && rm -rf gt + ``` + +- Step2: Generate `train_label.jsonl` and `val_label.jsonl` (optional) with the following command: + + ```bash + # Annotations of ReCTS test split is not publicly available, split a validation + # set by adding --val-ratio 0.2 + # Add --preserve-vertical to preserve vertical texts for training, otherwise + # vertical images will be filtered and stored in PATH/TO/rects/ignores + python tools/data/textrecog/rects_converter.py PATH/TO/rects --nproc 4 + ``` + +- After running the above codes, the directory structure should be as follows: + + ```text + ├── rects + │ ├── crops + │ ├── ignores + │ ├── train_label.jsonl + │ └── val_label.jsonl (optional) + ```