# Text Detection ## Overview | Dataset | Images | | Annotation Files | | | | :---------------: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------: | :---: | | | | training | validation | testing | | | CTW1500 | [homepage](https://github.com/Yuliang-Liu/Curve-Text-Detector) | - | - | - | | ICDAR2011 | [homepage](https://rrc.cvc.uab.es/?ch=1) | - | - | | | ICDAR2013 | [homepage](https://rrc.cvc.uab.es/?ch=2) | - | - | - | | ICDAR2015 | [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) | - | [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) | | ICDAR2017 | [homepage](https://rrc.cvc.uab.es/?ch=8&com=downloads) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json) | [instances_val.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_val.json) | - | | | | Synthtext | [homepage](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) | instances_training.lmdb ([data.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/data.mdb), [lock.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/lock.mdb)) | - | - | | TextOCR | [homepage](https://textvqa.org/textocr/dataset) | - | - | - | | Totaltext | [homepage](https://github.com/cs-chan/Total-Text-Dataset) | - | - | - | | CurvedSynText150k | [homepage](https://github.com/aim-uofa/AdelaiDet/blob/master/datasets/README.md) \| [Part1](https://drive.google.com/file/d/1OSJ-zId2h3t_-I7g_wUkrK-VqQy153Kj/view?usp=sharing) \| [Part2](https://drive.google.com/file/d/1EzkcOlIgEp5wmEubvHb7-J5EImHExYgY/view?usp=sharing) | [instances_training.json](https://download.openmmlab.com/mmocr/data/curvedsyntext/instances_training.json) | - | - | | FUNSD | [homepage](https://guillaumejaume.github.io/FUNSD/) | - | - | - | | DeText | [homepage](https://rrc.cvc.uab.es/?ch=9) | - | - | - | | NAF | [homepage](https://github.com/herobd/NAF_dataset/releases/tag/v1.0) | - | - | - | | SROIE | [homepage](https://rrc.cvc.uab.es/?ch=13) | - | - | - | | Lecture Video DB | [homepage](https://cvit.iiit.ac.in/research/projects/cvit-projects/lecturevideodb) | - | - | - | | IMGUR | [homepage](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset) | - | - | - | | KAIST | [homepage](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) | - | - | - | | MTWI | [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us) | - | - | - | | COCO Text v2 | [homepage](https://bgshih.github.io/cocotext/) | - | - | - | | ReCTS | [homepage](https://rrc.cvc.uab.es/?ch=12) | - | - | - | | IIIT-ILST | [homepage](http://cvit.iiit.ac.in/research/projects/cvit-projects/iiit-ilst) | - | - | - | | VinText | [homepage](https://github.com/VinAIResearch/dict-guided) | - | - | - | | BID | [homepage](https://github.com/ricardobnjunior/Brazilian-Identity-Document-Dataset) | - | - | - | ## Important Note :::{note} **For users who want to train models on CTW1500, ICDAR 2015/2017, and Totaltext dataset,** there might be some images containing orientation info in EXIF data. The default OpenCV backend used in MMCV would read them and apply the rotation on the images. However, their gold annotations are made on the raw pixels, and such inconsistency results in false examples in the training set. Therefore, users should use `dict(type='LoadImageFromFile', color_type='color_ignore_orientation')` in pipelines to change MMCV's default loading behaviour. (see [DBNet's pipeline config](https://github.com/open-mmlab/mmocr/blob/main/configs/_base_/det_pipelines/dbnet_pipeline.py) for example) ::: ## CTW1500 - Step0: Read [Important Note](#important-note) - Step1: Download `train_images.zip`, `test_images.zip`, `train_labels.zip`, `test_labels.zip` from [github](https://github.com/Yuliang-Liu/Curve-Text-Detector) ```bash mkdir ctw1500 && cd ctw1500 mkdir imgs && mkdir annotations # For annotations cd annotations wget -O train_labels.zip https://universityofadelaide.box.com/shared/static/jikuazluzyj4lq6umzei7m2ppmt3afyw.zip wget -O test_labels.zip https://cloudstor.aarnet.edu.au/plus/s/uoeFl0pCN9BOCN5/download unzip train_labels.zip && mv ctw1500_train_labels training unzip test_labels.zip -d test cd .. # For images cd imgs wget -O train_images.zip https://universityofadelaide.box.com/shared/static/py5uwlfyyytbb2pxzq9czvu6fuqbjdh8.zip wget -O test_images.zip https://universityofadelaide.box.com/shared/static/t4w48ofnqkdw7jyc4t11nsukoeqk9c3d.zip unzip train_images.zip && mv train_images training unzip test_images.zip && mv test_images test ``` - Step2: Generate `instances_training.json` and `instances_test.json` with following command: ```bash python tools/data/textdet/ctw1500_converter.py /path/to/ctw1500 -o /path/to/ctw1500 --split-list training test ``` - The resulting directory structure looks like the following: ```text ├── ctw1500 │   ├── imgs │   ├── annotations │   ├── instances_training.json │   └── instances_val.json ``` ## ICDAR 2011 (Born-Digital Images) - Step1: Download `Challenge1_Training_Task12_Images.zip`, `Challenge1_Training_Task1_GT.zip`, `Challenge1_Test_Task12_Images.zip`, and `Challenge1_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=1&com=downloads) `Task 1.1: Text Localization (2013 edition)`. ```bash mkdir icdar2011 && cd icdar2011 mkdir imgs && mkdir annotations # Download ICDAR 2011 wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task12_Images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task1_GT.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task12_Images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task1_GT.zip --no-check-certificate # For images unzip -q Challenge1_Training_Task12_Images.zip -d imgs/training unzip -q Challenge1_Test_Task12_Images.zip -d imgs/test # For annotations unzip -q Challenge1_Training_Task1_GT.zip -d annotations/training unzip -q Challenge1_Test_Task1_GT.zip -d annotations/test rm Challenge1_Training_Task12_Images.zip && rm Challenge1_Test_Task12_Images.zip && rm Challenge1_Training_Task1_GT.zip && rm Challenge1_Test_Task1_GT.zip ``` - Step 2: Generate `instances_training.json` and `instances_test.json` with the following command: ```bash python tools/data/textdet/ic11_converter.py PATH/TO/icdar2011 --nproc 4 ``` - After running the above codes, the directory structure should be as follows: ```text │── icdar2011 │ ├── imgs │ ├── instances_test.json │ └── instances_training.json ``` ## ICDAR 2013 (Focused Scene Text) - Step1: Download `Challenge2_Training_Task12_Images.zip`, `Challenge2_Test_Task12_Images.zip`, `Challenge2_Training_Task1_GT.zip`, and `Challenge2_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=2&com=downloads) `Task 2.1: Text Localization (2013 edition)`. ```bash mkdir icdar2013 && cd icdar2013 mkdir imgs && mkdir annotations # Download ICDAR 2013 wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task12_Images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task12_Images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task1_GT.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task1_GT.zip --no-check-certificate # For images unzip -q Challenge2_Training_Task12_Images.zip -d imgs/training unzip -q Challenge2_Test_Task12_Images.zip -d imgs/test # For annotations unzip -q Challenge2_Training_Task1_GT.zip -d annotations/training unzip -q Challenge2_Test_Task1_GT.zip -d annotations/test rm Challenge2_Training_Task12_Images.zip && rm Challenge2_Test_Task12_Images.zip && rm Challenge2_Training_Task1_GT.zip && rm Challenge2_Test_Task1_GT.zip ``` - Step 2: Generate `instances_training.json` and `instances_test.json` with the following command: ```bash python tools/data/textdet/ic13_converter.py PATH/TO/icdar2013 --nproc 4 ``` - After running the above codes, the directory structure should be as follows: ```text │── icdar2013 │ ├── imgs │ ├── instances_test.json │ └── instances_training.json ``` ## ICDAR 2015 - Step0: Read [Important Note](#important-note) - Step1: Download `ch4_training_images.zip`, `ch4_test_images.zip`, `ch4_training_localization_transcription_gt.zip`, `Challenge4_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads) - Step2: ```bash mkdir icdar2015 && cd icdar2015 mkdir imgs && mkdir annotations # For images, mv ch4_training_images imgs/training mv ch4_test_images imgs/test # For annotations, mv ch4_training_localization_transcription_gt annotations/training mv Challenge4_Test_Task1_GT annotations/test ``` - Step3: Download [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) and [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) and move them to `icdar2015` - Or, generate `instances_training.json` and `instances_test.json` with the following command: ```bash python tools/data/textdet/icdar_converter.py /path/to/icdar2015 -o /path/to/icdar2015 -d icdar2015 --split-list training test ``` - The resulting directory structure looks like the following: ```text ├── icdar2015 │   ├── imgs │   ├── annotations │   ├── instances_test.json │   └── instances_training.json ``` ## ICDAR 2017 - Follow similar steps as [ICDAR 2015](#icdar-2015). - The resulting directory structure looks like the following: ```text ├── icdar2017 │   ├── imgs │   ├── annotations │   ├── instances_training.json │   └── instances_val.json ``` ## SynthText - Step1: Download SynthText.zip from [homepage]( and extract its content to `synthtext/img`. - Step2: Download [data.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/data.mdb) and [lock.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/lock.mdb) to `synthtext/instances_training.lmdb/`. - The resulting directory structure looks like the following: ```text ├── synthtext │   ├── imgs │   └── instances_training.lmdb │   ├── data.mdb │   └── lock.mdb ``` ## TextOCR - Step1: Download [train_val_images.zip](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip), [TextOCR_0.1_train.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json) and [TextOCR_0.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json) to `textocr/`. ```bash mkdir textocr && cd textocr # Download TextOCR dataset wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json # For images unzip -q train_val_images.zip mv train_images train ``` - Step2: Generate `instances_training.json` and `instances_val.json` with the following command: ```bash python tools/data/textdet/textocr_converter.py /path/to/textocr ``` - The resulting directory structure looks like the following: ```text ├── textocr │   ├── train │   ├── instances_training.json │   └── instances_val.json ``` ## Totaltext - Step0: Read [Important Note](#important-note) - Step1: Download `totaltext.zip` from [github dataset](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset) and `groundtruth_text.zip` from [github Groundtruth](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Text) (Our totaltext_converter.py supports groundtruth with both .mat and .txt format). ```bash mkdir totaltext && cd totaltext mkdir imgs && mkdir annotations # For images # in ./totaltext unzip totaltext.zip mv Images/Train imgs/training mv Images/Test imgs/test # For annotations unzip groundtruth_text.zip cd Groundtruth mv Polygon/Train ../annotations/training mv Polygon/Test ../annotations/test ``` - Step2: Generate `instances_training.json` and `instances_test.json` with the following command: ```bash python tools/data/textdet/totaltext_converter.py /path/to/totaltext -o /path/to/totaltext --split-list training test ``` - The resulting directory structure looks like the following: ```text ├── totaltext │   ├── imgs │   ├── annotations │   ├── instances_test.json │   └── instances_training.json ``` ## CurvedSynText150k - Step1: Download [syntext1.zip](https://drive.google.com/file/d/1OSJ-zId2h3t_-I7g_wUkrK-VqQy153Kj/view?usp=sharing) and [syntext2.zip](https://drive.google.com/file/d/1EzkcOlIgEp5wmEubvHb7-J5EImHExYgY/view?usp=sharing) to `CurvedSynText150k/`. - Step2: ```bash unzip -q syntext1.zip mv train.json train1.json unzip images.zip rm images.zip unzip -q syntext2.zip mv train.json train2.json unzip images.zip rm images.zip ``` - Step3: Download [instances_training.json](https://download.openmmlab.com/mmocr/data/curvedsyntext/instances_training.json) to `CurvedSynText150k/` - Or, generate `instances_training.json` with following command: ```bash python tools/data/common/curvedsyntext_converter.py PATH/TO/CurvedSynText150k --nproc 4 ``` - The resulting directory structure looks like the following: ```text ├── CurvedSynText150k │   ├── syntext_word_eng │   ├── emcs_imgs │   └── instances_training.json ``` ## FUNSD - Step1: Download [dataset.zip](https://guillaumejaume.github.io/FUNSD/dataset.zip) to `funsd/`. ```bash mkdir funsd && cd funsd # Download FUNSD dataset wget https://guillaumejaume.github.io/FUNSD/dataset.zip unzip -q dataset.zip # For images mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/ # For annotations mkdir annotations mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test rm dataset.zip && rm -rf dataset ``` - Step2: Generate `instances_training.json` and `instances_test.json` with following command: ```bash python tools/data/textdet/funsd_converter.py PATH/TO/funsd --nproc 4 ``` - The resulting directory structure looks like the following: ```text │── funsd │   ├── annotations │   ├── imgs │   ├── instances_test.json │   └── instances_training.json ``` ## DeText - Step1: Download `ch9_training_images.zip`, `ch9_training_localization_transcription_gt.zip`, `ch9_validation_images.zip`, and `ch9_validation_localization_transcription_gt.zip` from **Task 3: End to End** on the [homepage](https://rrc.cvc.uab.es/?ch=9). ```bash mkdir detext && cd detext mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val # Download DeText wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate # Extract images and annotations unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val # Remove zips rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip ``` - Step2: Generate `instances_training.json` and `instances_val.json` with following command: ```bash python tools/data/textdet/detext_converter.py PATH/TO/detext --nproc 4 ``` - After running the above codes, the directory structure should be as follows: ```text │── detext │   ├── annotations │   ├── imgs │   ├── instances_test.json │   └── instances_training.json ``` ## NAF - Step1: Download [labeled_images.tar.gz](https://github.com/herobd/NAF_dataset/releases/tag/v1.0) to `naf/`. ```bash mkdir naf && cd naf # Download NAF dataset wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz tar -zxf labeled_images.tar.gz # For images mkdir annotations && mv labeled_images imgs # For annotations git clone https://github.com/herobd/NAF_dataset.git mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/ rm -rf NAF_dataset && rm labeled_images.tar.gz ``` - Step2: Generate `instances_training.json`, `instances_val.json`, and `instances_test.json` with following command: ```bash python tools/data/textdet/naf_converter.py PATH/TO/naf --nproc 4 ``` - After running the above codes, the directory structure should be as follows: ```text │── naf │   ├── annotations │   ├── imgs │   ├── instances_test.json │   ├── instances_val.json │   └── instances_training.json ``` ## SROIE - Step1: Download `0325updated.task1train(626p).zip`, `task1&2_test(361p).zip`, and `text.task1&2-test(361p).zip` from [homepage](https://rrc.cvc.uab.es/?ch=13&com=downloads) to `sroie/` - Step2: ```bash mkdir sroie && cd sroie mkdir imgs && mkdir annotations && mkdir imgs/training # Warnninig: The zip files downloaded from Google Drive and BaiduYun Cloud may # be different, the user should revise the following commands to the correct # file name if encounter with errors while extracting and move the files. unzip -q 0325updated.task1train\(626p\).zip && unzip -q task1\&2_test\(361p\).zip && unzip -q text.task1\&2-test(361p\).zip # For images mv 0325updated.task1train\(626p\)/*.jpg imgs/training && mv fulltext_test\(361p\) imgs/test # For annotations mv 0325updated.task1train\(626p\) annotations/training && mv text.task1\&2-testги361p\)/ annotations/test rm 0325updated.task1train\(626p\).zip && rm task1\&2_test\(361p\).zip && rm text.task1\&2-test(361p\).zip ``` - Step3: Generate `instances_training.json` and `instances_test.json` with the following command: ```bash python tools/data/textdet/sroie_converter.py PATH/TO/sroie --nproc 4 ``` - After running the above codes, the directory structure should be as follows: ```text ├── sroie │   ├── annotations │   ├── imgs │   ├── instances_test.json │   └── instances_training.json ``` ## Lecture Video DB - Step1: Download [IIIT-CVid.zip](http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip) to `lv/`. ```bash mkdir lv && cd lv # Download LV dataset wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip unzip -q IIIT-CVid.zip mv IIIT-CVid/Frames imgs rm IIIT-CVid.zip ``` - Step2: Generate `instances_training.json`, `instances_val.json`, and `instances_test.json` with following command: ```bash python tools/data/textdet/lv_converter.py PATH/TO/lv --nproc 4 ``` - The resulting directory structure looks like the following: ```text │── lv │   ├── imgs │   ├── instances_test.json │   └── instances_training.json │   └── instances_val.json ``` ## IMGUR - Step1: Run `download_imgur5k.py` to download images. You can merge [PR#5](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/pull/5) in your local repository to enable a **much faster** parallel execution of image download. ```bash mkdir imgur && cd imgur git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git # Download images from imgur.com. This may take SEVERAL HOURS! python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs # For annotations mkdir annotations mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations rm -rf IMGUR5K-Handwriting-Dataset ``` - Step2: Generate `instances_train.json`, `instance_val.json` and `instances_test.json` with the following command: ```bash python tools/data/textdet/imgur_converter.py PATH/TO/imgur ``` - After running the above codes, the directory structure should be as follows: ``` │── imgur │ ├── annotations │ ├── imgs │ ├── instances_test.json │ ├── instances_training.json │ └── instances_val.json ``` ## KAIST - Step1: Complete download [KAIST_all.zip](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) to `kaist/`. ```bash mkdir kaist && cd kaist mkdir imgs && mkdir annotations # Download KAIST dataset wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip unzip -q KAIST_all.zip rm KAIST_all.zip ``` - Step2: Extract zips: ```bash python tools/data/common/extract_kaist.py PATH/TO/kaist ``` - Step3: Generate `instances_training.json` and `instances_val.json` (optional) with following command: ```bash # Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2 python tools/data/textdet/kaist_converter.py PATH/TO/kaist --nproc 4 ``` - After running the above codes, the directory structure should be as follows: ```text │── kaist │ ├── annotations │ ├── imgs │ ├── instances_training.json │ └── instances_val.json (optional) ``` ## MTWI - Step1: Download `mtwi_2018_train.zip` from [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us). ```bash mkdir mtwi && cd mtwi unzip -q mtwi_2018_train.zip mv image_train imgs && mv txt_train annotations rm mtwi_2018_train.zip ``` - Step2: Generate `instances_training.json` and `instance_val.json` (optional) with the following command: ```bash # Annotations of MTWI test split is not publicly available, split a validation # set by adding --val-ratio 0.2 python tools/data/textdet/mtwi_converter.py PATH/TO/mtwi --nproc 4 ``` - After running the above codes, the directory structure should be as follows: ```text │── mtwi │ ├── annotations │ ├── imgs │ ├── instances_training.json │ └── instances_val.json (optional) ``` ## COCO Text v2 - Step1: Download image [train2014.zip](http://images.cocodataset.org/zips/train2014.zip) and annotation [cocotext.v2.zip](https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip) to `coco_textv2/`. ```bash mkdir coco_textv2 && cd coco_textv2 mkdir annotations # Download COCO Text v2 dataset wget http://images.cocodataset.org/zips/train2014.zip wget https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip unzip -q train2014.zip && unzip -q cocotext.v2.zip mv train2014 imgs && mv cocotext.v2.json annotations/ rm train2014.zip && rm -rf cocotext.v2.zip ``` - Step2: Generate `instances_training.json` and `instances_val.json` with the following command: ```bash python tools/data/textdet/cocotext_converter.py PATH/TO/coco_textv2 ``` - After running the above codes, the directory structure should be as follows: ```text │── coco_textv2 │ ├── annotations │ ├── imgs │ ├── instances_training.json │ └── instances_val.json ``` ## ReCTS - Step1: Download [ReCTS.zip](https://datasets.cvc.uab.es/rrc/ReCTS.zip) to `rects/` from the [homepage](https://rrc.cvc.uab.es/?ch=12&com=downloads). ```bash mkdir rects && cd rects # Download ReCTS dataset # You can also find Google Drive link on the dataset homepage wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate unzip -q ReCTS.zip mv img imgs && mv gt_unicode annotations rm ReCTS.zip && rm -rf gt ``` - Step2: Generate `instances_training.json` and `instances_val.json` (optional) with following command: ```bash # Annotations of ReCTS test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/rects/ignores python tools/data/textdet/rects_converter.py PATH/TO/rects --nproc 4 --val-ratio 0.2 ``` - After running the above codes, the directory structure should be as follows: ```text │── rects │ ├── annotations │ ├── imgs │ ├── instances_val.json (optional) │ └── instances_training.json ``` ## ILST - Step1: Download `IIIT-ILST` from [onedrive](https://iiitaphyd-my.sharepoint.com/:f:/g/personal/minesh_mathew_research_iiit_ac_in/EtLvCozBgaBIoqglF4M-lHABMgNcCDW9rJYKKWpeSQEElQ?e=zToXZP) - Step2: Run the following commands ```bash unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip cd IIIT-ILST # rename files cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd .. cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd .. cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd .. # transfer image path mkdir imgs && mkdir annotations mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/ mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/ mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/ # remove unnecessary files rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt ``` - Step3: Generate `instances_training.json` and `instances_val.json` (optional). Since the original dataset doesn't have a validation set, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example. ```bash python tools/data/textdet/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4 ``` - After running the above codes, the directory structure should be as follows: ```text │── IIIT-ILST │   ├── annotations │   ├── imgs │   ├── instances_val.json (optional) │   └── instances_training.json ``` ## VinText - Step1: Download [vintext.zip](https://drive.google.com/drive/my-drive) to `vintext` ```bash mkdir vintext && cd vintext # Download dataset from google drive wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- │ sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt # Extract images and annotations unzip -q vintext.zip && rm vintext.zip mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./ rm -rf vietnamese # Rename files mv labels annotations && mv test_image test && mv train_images training && mv unseen_test_images unseen_test mkdir imgs mv training imgs/ && mv test imgs/ && mv unseen_test imgs/ ``` - Step2: Generate `instances_training.json`, `instances_test.json` and `instances_unseen_test.json` ```bash python tools/data/textdet/vintext_converter.py PATH/TO/vintext --nproc 4 ``` - After running the above codes, the directory structure should be as follows: ```text │── vintext │   ├── annotations │   ├── imgs │   ├── instances_test.json │   ├── instances_unseen_test.json │   └── instances_training.json ``` ## BID - Step1: Download [BID Dataset.zip](https://drive.google.com/file/d/1Oi88TRcpdjZmJ79WDLb9qFlBNG8q2De6/view) - Step2: Run the following commands to preprocess the dataset ```bash # Rename mv BID\ Dataset.zip BID_Dataset.zip # Unzip and Rename unzip -q BID_Dataset.zip && rm BID_Dataset.zip mv BID\ Dataset BID # The BID dataset has a problem of permission, and you may # add permission for this file chmod -R 777 BID cd BID mkdir imgs && mkdir annotations # For images and annotations mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso # Remove unnecessary files rm -rf desktop.ini ``` - Step3: - Step3: Generate `instances_training.json` and `instances_val.json` (optional). Since the original dataset doesn't have a validation set, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example. ```bash python tools/data/textrecog/bid_converter.py PATH/TO/BID --nproc 4 ``` - After running the above codes, the directory structure should be as follows: ```text │── BID │   ├── annotations │   ├── imgs │   ├── instances_training.json │   └── instances_val.json (optional) ```