Xinyu Wang ea537bbe86
[Docs] Empty doc tree (#1336)
* refactor doc tree

* add titles

* update

* update

* fix

* fix a bug

* remove ner in readme

* rename advanced guides

* fix migration
2022-08-29 15:37:13 +08:00

991 lines
41 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Text Detection
## Overview
| Dataset | Images | | Annotation Files | | |
| :---------------: | :-------------------------------------------: | :-------------------------------------: | :------------------------------------------------------: | :--------------------------------------: | :-: |
| | | training | validation | testing | |
| CTW1500 | [homepage](https://github.com/Yuliang-Liu/Curve-Text-Detector) | - | - | - | |
| ICDAR2011 | [homepage](https://rrc.cvc.uab.es/?ch=1) | - | - | | |
| ICDAR2013 | [homepage](https://rrc.cvc.uab.es/?ch=2) | - | - | - | |
| ICDAR2015 | [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) | - | [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) | |
| ICDAR2017 | [homepage](https://rrc.cvc.uab.es/?ch=8&com=downloads) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json) | [instances_val.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_val.json) | - | |
| Synthtext | [homepage](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) | instances_training.lmdb ([data.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/data.mdb), [lock.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/lock.mdb)) | - | - | |
| TextOCR | [homepage](https://textvqa.org/textocr/dataset) | - | - | - | |
| Totaltext | [homepage](https://github.com/cs-chan/Total-Text-Dataset) | - | - | - | |
| CurvedSynText150k | [homepage](https://github.com/aim-uofa/AdelaiDet/blob/master/datasets/README.md) \| [Part1](https://drive.google.com/file/d/1OSJ-zId2h3t_-I7g_wUkrK-VqQy153Kj/view?usp=sharing) \| [Part2](https://drive.google.com/file/d/1EzkcOlIgEp5wmEubvHb7-J5EImHExYgY/view?usp=sharing) | [instances_training.json](https://download.openmmlab.com/mmocr/data/curvedsyntext/instances_training.json) | - | - | |
| FUNSD | [homepage](https://guillaumejaume.github.io/FUNSD/) | - | - | - | |
| DeText | [homepage](https://rrc.cvc.uab.es/?ch=9) | - | - | - | |
| NAF | [homepage](https://github.com/herobd/NAF_dataset/releases/tag/v1.0) | - | - | - | |
| SROIE | [homepage](https://rrc.cvc.uab.es/?ch=13) | - | - | - | |
| Lecture Video DB | [homepage](https://cvit.iiit.ac.in/research/projects/cvit-projects/lecturevideodb) | - | - | - | |
| LSVT | [homepage](https://rrc.cvc.uab.es/?ch=16) | - | - | - | |
| IMGUR | [homepage](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset) | - | - | - | |
| KAIST | [homepage](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) | - | - | - | |
| MTWI | [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us) | - | - | - | |
| COCO Text v2 | [homepage](https://bgshih.github.io/cocotext/) | - | - | - | |
| ReCTS | [homepage](https://rrc.cvc.uab.es/?ch=12) | - | - | - | |
| IIIT-ILST | [homepage](http://cvit.iiit.ac.in/research/projects/cvit-projects/iiit-ilst) | - | - | - | |
| VinText | [homepage](https://github.com/VinAIResearch/dict-guided) | - | - | - | |
| BID | [homepage](https://github.com/ricardobnjunior/Brazilian-Identity-Document-Dataset) | - | - | - | |
| RCTW | [homepage](https://rctw.vlrlab.net/index.html) | - | - | - | |
| HierText | [homepage](https://github.com/google-research-datasets/hiertext) | - | - | - | |
| ArT | [homepage](https://rrc.cvc.uab.es/?ch=14) | - | - | - | |
### Install AWS CLI (optional)
- Since there are some datasets that require the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) to be installed in advance, we provide a quick installation guide here:
```bash
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
./aws/install -i /usr/local/aws-cli -b /usr/local/bin
!aws configure
# this command will require you to input keys, you can skip them except
# for the Default region name
# AWS Access Key ID [None]:
# AWS Secret Access Key [None]:
# Default region name [None]: us-east-1
# Default output format [None]
```
## Important Note
```{note}
**For users who want to train models on CTW1500, ICDAR 2015/2017, and Totaltext dataset,** there might be some images containing orientation info in EXIF data. The default OpenCV
backend used in MMCV would read them and apply the rotation on the images. However, their gold annotations are made on the raw pixels, and such
inconsistency results in false examples in the training set. Therefore, users should use `dict(type='LoadImageFromFile', color_type='color_ignore_orientation')` in pipelines to change MMCV's default loading behaviour. (see [DBNet's pipeline config](https://github.com/open-mmlab/mmocr/blob/main/configs/_base_/det_pipelines/dbnet_pipeline.py) for example)
```
## CTW1500
- Step0: Read [Important Note](#important-note)
- Step1: Download `train_images.zip`, `test_images.zip`, `train_labels.zip`, `test_labels.zip` from [github](https://github.com/Yuliang-Liu/Curve-Text-Detector)
```bash
mkdir ctw1500 && cd ctw1500
mkdir imgs && mkdir annotations
# For annotations
cd annotations
wget -O train_labels.zip https://universityofadelaide.box.com/shared/static/jikuazluzyj4lq6umzei7m2ppmt3afyw.zip
wget -O test_labels.zip https://cloudstor.aarnet.edu.au/plus/s/uoeFl0pCN9BOCN5/download
unzip train_labels.zip && mv ctw1500_train_labels training
unzip test_labels.zip -d test
cd ..
# For images
cd imgs
wget -O train_images.zip https://universityofadelaide.box.com/shared/static/py5uwlfyyytbb2pxzq9czvu6fuqbjdh8.zip
wget -O test_images.zip https://universityofadelaide.box.com/shared/static/t4w48ofnqkdw7jyc4t11nsukoeqk9c3d.zip
unzip train_images.zip && mv train_images training
unzip test_images.zip && mv test_images test
```
- Step2: Generate `instances_training.json` and `instances_test.json` with following command:
```bash
python tools/dataset_converters/textdet/ctw1500_converter.py /path/to/ctw1500 -o /path/to/ctw1500 --split-list training test
```
- The resulting directory structure looks like the following:
```text
├── ctw1500
│   ├── imgs
│   ├── annotations
│   ├── instances_training.json
│   └── instances_val.json
```
## ICDAR 2011 (Born-Digital Images)
- Step1: Download `Challenge1_Training_Task12_Images.zip`, `Challenge1_Training_Task1_GT.zip`, `Challenge1_Test_Task12_Images.zip`, and `Challenge1_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=1&com=downloads) `Task 1.1: Text Localization (2013 edition)`.
```bash
mkdir icdar2011 && cd icdar2011
mkdir imgs && mkdir annotations
# Download ICDAR 2011
wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task12_Images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task1_GT.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task12_Images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task1_GT.zip --no-check-certificate
# For images
unzip -q Challenge1_Training_Task12_Images.zip -d imgs/training
unzip -q Challenge1_Test_Task12_Images.zip -d imgs/test
# For annotations
unzip -q Challenge1_Training_Task1_GT.zip -d annotations/training
unzip -q Challenge1_Test_Task1_GT.zip -d annotations/test
rm Challenge1_Training_Task12_Images.zip && rm Challenge1_Test_Task12_Images.zip && rm Challenge1_Training_Task1_GT.zip && rm Challenge1_Test_Task1_GT.zip
```
- Step 2: Generate `instances_training.json` and `instances_test.json` with the following command:
```bash
python tools/dataset_converters/textdet/ic11_converter.py PATH/TO/icdar2011 --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
│── icdar2011
│ ├── imgs
│ ├── instances_test.json
│ └── instances_training.json
```
## ICDAR 2013 (Focused Scene Text)
- Step1: Download `Challenge2_Training_Task12_Images.zip`, `Challenge2_Test_Task12_Images.zip`, `Challenge2_Training_Task1_GT.zip`, and `Challenge2_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=2&com=downloads) `Task 2.1: Text Localization (2013 edition)`.
```bash
mkdir icdar2013 && cd icdar2013
mkdir imgs && mkdir annotations
# Download ICDAR 2013
wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task12_Images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task12_Images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task1_GT.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task1_GT.zip --no-check-certificate
# For images
unzip -q Challenge2_Training_Task12_Images.zip -d imgs/training
unzip -q Challenge2_Test_Task12_Images.zip -d imgs/test
# For annotations
unzip -q Challenge2_Training_Task1_GT.zip -d annotations/training
unzip -q Challenge2_Test_Task1_GT.zip -d annotations/test
rm Challenge2_Training_Task12_Images.zip && rm Challenge2_Test_Task12_Images.zip && rm Challenge2_Training_Task1_GT.zip && rm Challenge2_Test_Task1_GT.zip
```
- Step 2: Generate `instances_training.json` and `instances_test.json` with the following command:
```bash
python tools/dataset_converters/textdet/ic13_converter.py PATH/TO/icdar2013 --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
│── icdar2013
│ ├── imgs
│ ├── instances_test.json
│ └── instances_training.json
```
## ICDAR 2015
- Step0: Read [Important Note](#important-note)
- Step1: Download `ch4_training_images.zip`, `ch4_test_images.zip`, `ch4_training_localization_transcription_gt.zip`, `Challenge4_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads)
- Step2:
```bash
mkdir icdar2015 && cd icdar2015
mkdir imgs && mkdir annotations
# For images,
mv ch4_training_images imgs/training
mv ch4_test_images imgs/test
# For annotations,
mv ch4_training_localization_transcription_gt annotations/training
mv Challenge4_Test_Task1_GT annotations/test
```
- Step3: Download [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) and [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) and move them to `icdar2015`
- Or, generate `instances_training.json` and `instances_test.json` with the following command:
```bash
python tools/dataset_converters/textdet/icdar_converter.py /path/to/icdar2015 -o /path/to/icdar2015 -d icdar2015 --split-list training test
```
- The resulting directory structure looks like the following:
```text
├── icdar2015
│   ├── imgs
│   ├── annotations
│   ├── instances_test.json
│   └── instances_training.json
```
## ICDAR 2017
- Follow similar steps as [ICDAR 2015](#icdar-2015).
- The resulting directory structure looks like the following:
```text
├── icdar2017
│   ├── imgs
│   ├── annotations
│   ├── instances_training.json
│   └── instances_val.json
```
## SynthText
- Step1: Download SynthText.zip from \[homepage\](<https://www.robots.ox.ac.uk/~vgg/data/scenetext/> and extract its content to `synthtext/img`.
- Step2: Download [data.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/data.mdb) and [lock.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/lock.mdb) to `synthtext/instances_training.lmdb/`.
- The resulting directory structure looks like the following:
```text
├── synthtext
│   ├── imgs
│   └── instances_training.lmdb
│   ├── data.mdb
│   └── lock.mdb
```
## TextOCR
- Step1: Download [train_val_images.zip](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip), [TextOCR_0.1_train.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json) and [TextOCR_0.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json) to `textocr/`.
```bash
mkdir textocr && cd textocr
# Download TextOCR dataset
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json
# For images
unzip -q train_val_images.zip
mv train_images train
```
- Step2: Generate `instances_training.json` and `instances_val.json` with the following command:
```bash
python tools/dataset_converters/textdet/textocr_converter.py /path/to/textocr
```
- The resulting directory structure looks like the following:
```text
├── textocr
│   ├── train
│   ├── instances_training.json
│   └── instances_val.json
```
## Totaltext
- Step0: Read [Important Note](#important-note)
- Step1: Download `totaltext.zip` from [github dataset](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset) and `groundtruth_text.zip` or `TT_new_train_GT.zip` (if you prefer to use the latest version of training annotations) from [github Groundtruth](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Text) (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).
```bash
mkdir totaltext && cd totaltext
mkdir imgs && mkdir annotations
# For images
# in ./totaltext
unzip totaltext.zip
mv Images/Train imgs/training
mv Images/Test imgs/test
# For legacy training and test annotations
unzip groundtruth_text.zip
mv Groundtruth/Polygon/Train annotations/training
mv Groundtruth/Polygon/Test annotations/test
# Using the latest training annotations
# WARNING: Delete legacy train annotations before running the following command.
unzip TT_new_train_GT.zip
mv Train annotations/training
```
- Step2: Generate `instances_training.json` and `instances_test.json` with the following command:
```bash
python tools/dataset_converters/textdet/totaltext_converter.py /path/to/totaltext
```
- The resulting directory structure looks like the following:
```text
├── totaltext
│   ├── imgs
│   ├── annotations
│   ├── instances_test.json
│   └── instances_training.json
```
## CurvedSynText150k
- Step1: Download [syntext1.zip](https://drive.google.com/file/d/1OSJ-zId2h3t_-I7g_wUkrK-VqQy153Kj/view?usp=sharing) and [syntext2.zip](https://drive.google.com/file/d/1EzkcOlIgEp5wmEubvHb7-J5EImHExYgY/view?usp=sharing) to `CurvedSynText150k/`.
- Step2:
```bash
unzip -q syntext1.zip
mv train.json train1.json
unzip images.zip
rm images.zip
unzip -q syntext2.zip
mv train.json train2.json
unzip images.zip
rm images.zip
```
- Step3: Download [instances_training.json](https://download.openmmlab.com/mmocr/data/curvedsyntext/instances_training.json) to `CurvedSynText150k/`
- Or, generate `instances_training.json` with following command:
```bash
python tools/dataset_converters/common/curvedsyntext_converter.py PATH/TO/CurvedSynText150k --nproc 4
```
- The resulting directory structure looks like the following:
```text
├── CurvedSynText150k
│   ├── syntext_word_eng
│   ├── emcs_imgs
│   └── instances_training.json
```
## FUNSD
- Step1: Download [dataset.zip](https://guillaumejaume.github.io/FUNSD/dataset.zip) to `funsd/`.
```bash
mkdir funsd && cd funsd
# Download FUNSD dataset
wget https://guillaumejaume.github.io/FUNSD/dataset.zip
unzip -q dataset.zip
# For images
mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/
# For annotations
mkdir annotations
mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test
rm dataset.zip && rm -rf dataset
```
- Step2: Generate `instances_training.json` and `instances_test.json` with following command:
```bash
python tools/dataset_converters/textdet/funsd_converter.py PATH/TO/funsd --nproc 4
```
- The resulting directory structure looks like the following:
```text
│── funsd
│   ├── annotations
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
```
## DeText
- Step1: Download `ch9_training_images.zip`, `ch9_training_localization_transcription_gt.zip`, `ch9_validation_images.zip`, and `ch9_validation_localization_transcription_gt.zip` from **Task 3: End to End** on the [homepage](https://rrc.cvc.uab.es/?ch=9).
```bash
mkdir detext && cd detext
mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val
# Download DeText
wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate
# Extract images and annotations
unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val
# Remove zips
rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
```
- Step2: Generate `instances_training.json` and `instances_val.json` with following command:
```bash
python tools/dataset_converters/textdet/detext_converter.py PATH/TO/detext --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
│── detext
│   ├── annotations
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
```
## NAF
- Step1: Download [labeled_images.tar.gz](https://github.com/herobd/NAF_dataset/releases/tag/v1.0) to `naf/`.
```bash
mkdir naf && cd naf
# Download NAF dataset
wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz
tar -zxf labeled_images.tar.gz
# For images
mkdir annotations && mv labeled_images imgs
# For annotations
git clone https://github.com/herobd/NAF_dataset.git
mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/
rm -rf NAF_dataset && rm labeled_images.tar.gz
```
- Step2: Generate `instances_training.json`, `instances_val.json`, and `instances_test.json` with following command:
```bash
python tools/dataset_converters/textdet/naf_converter.py PATH/TO/naf --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
│── naf
│   ├── annotations
│   ├── imgs
│   ├── instances_test.json
│   ├── instances_val.json
│   └── instances_training.json
```
## SROIE
- Step1: Download `0325updated.task1train(626p).zip`, `task1&2_test(361p).zip`, and `text.task1&2-test361p).zip` from [homepage](https://rrc.cvc.uab.es/?ch=13&com=downloads) to `sroie/`
- Step2:
```bash
mkdir sroie && cd sroie
mkdir imgs && mkdir annotations && mkdir imgs/training
# Warnninig: The zip files downloaded from Google Drive and BaiduYun Cloud may
# be different, the user should revise the following commands to the correct
# file name if encounter with errors while extracting and move the files.
unzip -q 0325updated.task1train\(626p\).zip && unzip -q task1\&2_test\(361p\).zip && unzip -q text.task1\&2-test361p\).zip
# For images
mv 0325updated.task1train\(626p\)/*.jpg imgs/training && mv fulltext_test\(361p\) imgs/test
# For annotations
mv 0325updated.task1train\(626p\) annotations/training && mv text.task1\&2-testги361p\)/ annotations/test
rm 0325updated.task1train\(626p\).zip && rm task1\&2_test\(361p\).zip && rm text.task1\&2-test361p\).zip
```
- Step3: Generate `instances_training.json` and `instances_test.json` with the following command:
```bash
python tools/dataset_converters/textdet/sroie_converter.py PATH/TO/sroie --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
├── sroie
│   ├── annotations
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
```
## Lecture Video DB
- Step1: Download [IIIT-CVid.zip](http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip) to `lv/`.
```bash
mkdir lv && cd lv
# Download LV dataset
wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
unzip -q IIIT-CVid.zip
mv IIIT-CVid/Frames imgs
rm IIIT-CVid.zip
```
- Step2: Generate `instances_training.json`, `instances_val.json`, and `instances_test.json` with following command:
```bash
python tools/dataset_converters/textdet/lv_converter.py PATH/TO/lv --nproc 4
```
- The resulting directory structure looks like the following:
```text
│── lv
│   ├── imgs
│   ├── instances_test.json
│   ├── instances_training.json
│   └── instances_val.json
```
## LSVT
- Step1: Download [train_full_images_0.tar.gz](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz), [train_full_images_1.tar.gz](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz), and [train_full_labels.json](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json) to `lsvt/`.
```bash
mkdir lsvt && cd lsvt
# Download LSVT dataset
wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz
wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz
wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json
mkdir annotations
tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz
mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/
mv train_full_images_0 imgs
rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
```
- Step2: Generate `instances_training.json` and `instances_val.json` (optional) with the following command:
```bash
# Annotations of LSVT test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
python tools/dataset_converters/textdet/lsvt_converter.py PATH/TO/lsvt
```
- After running the above codes, the directory structure should be as follows:
```text
|── lsvt
│   ├── imgs
│   ├── instances_training.json
│   └── instances_val.json (optional)
```
## IMGUR
- Step1: Run `download_imgur5k.py` to download images. You can merge [PR#5](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/pull/5) in your local repository to enable a **much faster** parallel execution of image download.
```bash
mkdir imgur && cd imgur
git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git
# Download images from imgur.com. This may take SEVERAL HOURS!
python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs
# For annotations
mkdir annotations
mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations
rm -rf IMGUR5K-Handwriting-Dataset
```
- Step2: Generate `instances_train.json`, `instance_val.json` and `instances_test.json` with the following command:
```bash
python tools/dataset_converters/textdet/imgur_converter.py PATH/TO/imgur
```
- After running the above codes, the directory structure should be as follows:
```text
│── imgur
│ ├── annotations
│ ├── imgs
│ ├── instances_test.json
│ ├── instances_training.json
│ └── instances_val.json
```
## KAIST
- Step1: Complete download [KAIST_all.zip](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) to `kaist/`.
```bash
mkdir kaist && cd kaist
mkdir imgs && mkdir annotations
# Download KAIST dataset
wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip
unzip -q KAIST_all.zip
rm KAIST_all.zip
```
- Step2: Extract zips:
```bash
python tools/dataset_converters/common/extract_kaist.py PATH/TO/kaist
```
- Step3: Generate `instances_training.json` and `instances_val.json` (optional) with following command:
```bash
# Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2
python tools/dataset_converters/textdet/kaist_converter.py PATH/TO/kaist --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
│── kaist
│ ├── annotations
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json (optional)
```
## MTWI
- Step1: Download `mtwi_2018_train.zip` from [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us).
```bash
mkdir mtwi && cd mtwi
unzip -q mtwi_2018_train.zip
mv image_train imgs && mv txt_train annotations
rm mtwi_2018_train.zip
```
- Step2: Generate `instances_training.json` and `instance_val.json` (optional) with the following command:
```bash
# Annotations of MTWI test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
python tools/dataset_converters/textdet/mtwi_converter.py PATH/TO/mtwi --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
│── mtwi
│ ├── annotations
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json (optional)
```
## COCO Text v2
- Step1: Download image [train2014.zip](http://images.cocodataset.org/zips/train2014.zip) and annotation [cocotext.v2.zip](https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip) to `coco_textv2/`.
```bash
mkdir coco_textv2 && cd coco_textv2
mkdir annotations
# Download COCO Text v2 dataset
wget http://images.cocodataset.org/zips/train2014.zip
wget https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip
unzip -q train2014.zip && unzip -q cocotext.v2.zip
mv train2014 imgs && mv cocotext.v2.json annotations/
rm train2014.zip && rm -rf cocotext.v2.zip
```
- Step2: Generate `instances_training.json` and `instances_val.json` with the following command:
```bash
python tools/dataset_converters/textdet/cocotext_converter.py PATH/TO/coco_textv2
```
- After running the above codes, the directory structure should be as follows:
```text
│── coco_textv2
│ ├── annotations
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json
```
## ReCTS
- Step1: Download [ReCTS.zip](https://datasets.cvc.uab.es/rrc/ReCTS.zip) to `rects/` from the [homepage](https://rrc.cvc.uab.es/?ch=12&com=downloads).
```bash
mkdir rects && cd rects
# Download ReCTS dataset
# You can also find Google Drive link on the dataset homepage
wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate
unzip -q ReCTS.zip
mv img imgs && mv gt_unicode annotations
rm ReCTS.zip && rm -rf gt
```
- Step2: Generate `instances_training.json` and `instances_val.json` (optional) with following command:
```bash
# Annotations of ReCTS test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
python tools/dataset_converters/textdet/rects_converter.py PATH/TO/rects --nproc 4 --val-ratio 0.2
```
- After running the above codes, the directory structure should be as follows:
```text
│── rects
│ ├── annotations
│ ├── imgs
│ ├── instances_val.json (optional)
│ └── instances_training.json
```
## ILST
- Step1: Download `IIIT-ILST` from [onedrive](https://iiitaphyd-my.sharepoint.com/:f:/g/personal/minesh_mathew_research_iiit_ac_in/EtLvCozBgaBIoqglF4M-lHABMgNcCDW9rJYKKWpeSQEElQ?e=zToXZP)
- Step2: Run the following commands
```bash
unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip
cd IIIT-ILST
# rename files
cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd ..
cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd ..
cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd ..
# transfer image path
mkdir imgs && mkdir annotations
mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/
mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/
mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/
# remove unnecessary files
rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
```
- Step3: Generate `instances_training.json` and `instances_val.json` (optional). Since the original dataset doesn't have a validation set, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
```bash
python tools/dataset_converters/textdet/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
│── IIIT-ILST
│   ├── annotations
│   ├── imgs
│   ├── instances_val.json (optional)
│   └── instances_training.json
```
## VinText
- Step1: Download [vintext.zip](https://drive.google.com/drive/my-drive) to `vintext`
```bash
mkdir vintext && cd vintext
# Download dataset from google drive
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- │ sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt
# Extract images and annotations
unzip -q vintext.zip && rm vintext.zip
mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./
rm -rf vietnamese
# Rename files
mv labels annotations && mv test_image test && mv train_images training && mv unseen_test_images unseen_test
mkdir imgs
mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
```
- Step2: Generate `instances_training.json`, `instances_test.json` and `instances_unseen_test.json`
```bash
python tools/dataset_converters/textdet/vintext_converter.py PATH/TO/vintext --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
│── vintext
│   ├── annotations
│   ├── imgs
│   ├── instances_test.json
│   ├── instances_unseen_test.json
│   └── instances_training.json
```
## BID
- Step1: Download [BID Dataset.zip](https://drive.google.com/file/d/1Oi88TRcpdjZmJ79WDLb9qFlBNG8q2De6/view)
- Step2: Run the following commands to preprocess the dataset
```bash
# Rename
mv BID\ Dataset.zip BID_Dataset.zip
# Unzip and Rename
unzip -q BID_Dataset.zip && rm BID_Dataset.zip
mv BID\ Dataset BID
# The BID dataset has a problem of permission, and you may
# add permission for this file
chmod -R 777 BID
cd BID
mkdir imgs && mkdir annotations
# For images and annotations
mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta
mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente
mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso
mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente
mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso
mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto
mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente
mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso
# Remove unnecessary files
rm -rf desktop.ini
```
- Step3: - Step3: Generate `instances_training.json` and `instances_val.json` (optional). Since the original dataset doesn't have a validation set, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
```bash
python tools/dataset_converters/textdet/bid_converter.py PATH/TO/BID --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
│── BID
│   ├── annotations
│   ├── imgs
│   ├── instances_training.json
│   └── instances_val.json (optional)
```
## RCTW
- Step1: Download `train_images.zip.001`, `train_images.zip.002`, and `train_gts.zip` from the [homepage](https://rctw.vlrlab.net/dataset.html), extract the zips to `rctw/imgs` and `rctw/annotations`, respectively.
- Step2: Generate `instances_training.json` and `instances_val.json` (optional). Since the test annotations are not publicly available, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
```bash
# Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2
python tools/dataset_converters/textdet/rctw_converter.py PATH/TO/rctw --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
│── rctw
│   ├── annotations
│   ├── imgs
│   ├── instances_training.json
│   └── instances_val.json (optional)
```
## HierText
- Step1 (optional): Install [AWS CLI](https://mmocr.readthedocs.io/en/latest/datasets/det.html#install-aws-cli-optional).
- Step2: Clone [HierText](https://github.com/google-research-datasets/hiertext) repo to get annotations
```bash
mkdir HierText
git clone https://github.com/google-research-datasets/hiertext.git
```
- Step3: Download `train.tgz`, `validation.tgz` from aws
```bash
aws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz .
aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
```
- Step4: Process raw data
```bash
# process annotations
mv hiertext/gt ./
rm -rf hiertext
mv gt annotations
gzip -d annotations/train.jsonl.gz
gzip -d annotations/validation.jsonl.gz
# process images
mkdir imgs
mv train.tgz imgs/
mv validation.tgz imgs/
tar -xzvf imgs/train.tgz
tar -xzvf imgs/validation.tgz
```
- Step5: Generate `instances_training.json` and `instance_val.json`. HierText includes different levels of annotation, from paragraph, line, to word. Check the original [paper](https://arxiv.org/pdf/2203.15143.pdf) for details. E.g. set `--level paragraph` to get paragraph-level annotation. Set `--level line` to get line-level annotation. set `--level word` to get word-level annotation.
```bash
# Collect word annotation from HierText --level word
python tools/dataset_converters/textdet/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
│── HierText
│   ├── annotations
│   ├── imgs
│   ├── instances_training.json
│   └── instances_val.json
```
## ArT
- Step1: Download `train_images.tar.gz`, and `train_labels.json` from the [homepage](https://rrc.cvc.uab.es/?ch=14&com=downloads) to `art/`
```bash
mkdir art && cd art
mkdir annotations
# Download ArT dataset
wget https://dataset-bj.cdn.bcebos.com/art/train_images.tar.gz --no-check-certificate
wget https://dataset-bj.cdn.bcebos.com/art/train_labels.json --no-check-certificate
# Extract
tar -xf train_images.tar.gz
mv train_images imgs
mv train_labels.json annotations/
# Remove unnecessary files
rm train_images.tar.gz
```
- Step2: Generate `instances_training.json` and `instances_val.json` (optional). Since the test annotations are not publicly available, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
```bash
# Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2
python tools/data/textdet/art_converter.py PATH/TO/art --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
│── art
│   ├── annotations
│   ├── imgs
│   ├── instances_training.json
│   └── instances_val.json (optional)
```