mmocr/docs/en/data_prepare/det.md

# Text Detection

## Overview

|      Dataset      |                    Images                     |                                         |                     Annotation Files                     |                                          |     |
| :---------------: | :-------------------------------------------: | :-------------------------------------: | :------------------------------------------------------: | :--------------------------------------: | :-: |
|                   |                                               |                training                 |                        validation                        |                 testing                  |     |
|      CTW1500      | [homepage](https://github.com/Yuliang-Liu/Curve-Text-Detector) |                    -                    |                            -                             |                    -                     |     |
|     ICDAR2011     |   [homepage](https://rrc.cvc.uab.es/?ch=1)    |                    -                    |                            -                             |                                          |     |
|     ICDAR2013     |   [homepage](https://rrc.cvc.uab.es/?ch=2)    |                    -                    |                            -                             |                    -                     |     |
|     ICDAR2015     | [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) |                            -                             | [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) |     |
|     ICDAR2017     | [homepage](https://rrc.cvc.uab.es/?ch=8&com=downloads) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json) | [instances_val.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_val.json) |                    -                     |     |
|     Synthtext     | [homepage](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) | instances_training.lmdb ([data.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/data.mdb), [lock.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/lock.mdb)) |                            -                             |                    -                     |     |
|      TextOCR      | [homepage](https://textvqa.org/textocr/dataset) |                    -                    |                            -                             |                    -                     |     |
|     Totaltext     | [homepage](https://github.com/cs-chan/Total-Text-Dataset) |                    -                    |                            -                             |                    -                     |     |
| CurvedSynText150k | [homepage](https://github.com/aim-uofa/AdelaiDet/blob/master/datasets/README.md) \| [Part1](https://drive.google.com/file/d/1OSJ-zId2h3t_-I7g_wUkrK-VqQy153Kj/view?usp=sharing) \| [Part2](https://drive.google.com/file/d/1EzkcOlIgEp5wmEubvHb7-J5EImHExYgY/view?usp=sharing) | [instances_training.json](https://download.openmmlab.com/mmocr/data/curvedsyntext/instances_training.json) |                            -                             |                    -                     |     |
|       FUNSD       | [homepage](https://guillaumejaume.github.io/FUNSD/) |                    -                    |                            -                             |                    -                     |     |
|      DeText       |   [homepage](https://rrc.cvc.uab.es/?ch=9)    |                    -                    |                            -                             |                    -                     |     |
|        NAF        | [homepage](https://github.com/herobd/NAF_dataset/releases/tag/v1.0) |                    -                    |                            -                             |                    -                     |     |
|       SROIE       |   [homepage](https://rrc.cvc.uab.es/?ch=13)   |                    -                    |                            -                             |                    -                     |     |
| Lecture Video DB  | [homepage](https://cvit.iiit.ac.in/research/projects/cvit-projects/lecturevideodb) |                    -                    |                            -                             |                    -                     |     |
|       LSVT        |   [homepage](https://rrc.cvc.uab.es/?ch=16)   |                    -                    |                            -                             |                    -                     |     |
|       IMGUR       | [homepage](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset) |                    -                    |                            -                             |                    -                     |     |
|       KAIST       | [homepage](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) |                    -                    |                            -                             |                    -                     |     |
|       MTWI        | [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us) |                    -                    |                            -                             |                    -                     |     |
|   COCO Text v2    | [homepage](https://bgshih.github.io/cocotext/) |                    -                    |                            -                             |                    -                     |     |
|       ReCTS       |   [homepage](https://rrc.cvc.uab.es/?ch=12)   |                    -                    |                            -                             |                    -                     |     |
|     IIIT-ILST     | [homepage](http://cvit.iiit.ac.in/research/projects/cvit-projects/iiit-ilst) |                    -                    |                            -                             |                    -                     |     |
|      VinText      | [homepage](https://github.com/VinAIResearch/dict-guided) |                    -                    |                            -                             |                    -                     |     |
|        BID        | [homepage](https://github.com/ricardobnjunior/Brazilian-Identity-Document-Dataset) |                    -                    |                            -                             |                    -                     |     |
|       RCTW        | [homepage](https://rctw.vlrlab.net/index.html) |                    -                    |                            -                             |                    -                     |     |
|     HierText      | [homepage](https://github.com/google-research-datasets/hiertext) |                    -                    |                            -                             |                    -                     |     |
|        ArT        |   [homepage](https://rrc.cvc.uab.es/?ch=14)   |                    -                    |                            -                             |                    -                     |     |

### Install AWS CLI (optional)

- Since there are some datasets that require the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) to be installed in advance, we provide a quick installation guide here:

  ```bash
    curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
    unzip awscliv2.zip
    sudo ./aws/install
    ./aws/install -i /usr/local/aws-cli -b /usr/local/bin
    !aws configure
    # this command will require you to input keys, you can skip them except
    # for the Default region name
    # AWS Access Key ID [None]:
    # AWS Secret Access Key [None]:
    # Default region name [None]: us-east-1
    # Default output format [None]
  ```

## Important Note

```{note}
**For users who want to train models on CTW1500, ICDAR 2015/2017, and Totaltext dataset,** there might be some images containing orientation info in EXIF data. The default OpenCV
backend used in MMCV would read them and apply the rotation on the images.  However, their gold annotations are made on the raw pixels, and such
inconsistency results in false examples in the training set. Therefore, users should use `dict(type='LoadImageFromFile', color_type='color_ignore_orientation')` in pipelines to change MMCV's default loading behaviour. (see [DBNet's pipeline config](https://github.com/open-mmlab/mmocr/blob/main/configs/_base_/det_pipelines/dbnet_pipeline.py) for example)
```

## CTW1500

- Step0: Read [Important Note](#important-note)

- Step1: Download `train_images.zip`, `test_images.zip`, `train_labels.zip`, `test_labels.zip` from [github](https://github.com/Yuliang-Liu/Curve-Text-Detector)

  ```bash
  mkdir ctw1500 && cd ctw1500
  mkdir imgs && mkdir annotations

  # For annotations
  cd annotations
  wget -O train_labels.zip https://universityofadelaide.box.com/shared/static/jikuazluzyj4lq6umzei7m2ppmt3afyw.zip
  wget -O test_labels.zip https://cloudstor.aarnet.edu.au/plus/s/uoeFl0pCN9BOCN5/download
  unzip train_labels.zip && mv ctw1500_train_labels training
  unzip test_labels.zip -d test
  cd ..
  # For images
  cd imgs
  wget -O train_images.zip https://universityofadelaide.box.com/shared/static/py5uwlfyyytbb2pxzq9czvu6fuqbjdh8.zip
  wget -O test_images.zip https://universityofadelaide.box.com/shared/static/t4w48ofnqkdw7jyc4t11nsukoeqk9c3d.zip
  unzip train_images.zip && mv train_images training
  unzip test_images.zip && mv test_images test
  ```

- Step2: Generate `instances_training.json` and `instances_test.json` with following command:

  ```bash
  python tools/dataset_converters/textdet/ctw1500_converter.py /path/to/ctw1500 -o /path/to/ctw1500 --split-list training test
  ```

- The resulting directory structure looks like the following:

  ```text
  ├── ctw1500
  │   ├── imgs
  │   ├── annotations
  │   ├── instances_training.json
  │   └── instances_val.json
  ```

## ICDAR 2011 (Born-Digital Images)

- Step1: Download `Challenge1_Training_Task12_Images.zip`, `Challenge1_Training_Task1_GT.zip`, `Challenge1_Test_Task12_Images.zip`, and `Challenge1_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=1&com=downloads) `Task 1.1: Text Localization (2013 edition)`.

  ```bash
  mkdir icdar2011 && cd icdar2011
  mkdir imgs && mkdir annotations

  # Download ICDAR 2011
  wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task12_Images.zip --no-check-certificate
  wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task1_GT.zip --no-check-certificate
  wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task12_Images.zip --no-check-certificate
  wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task1_GT.zip --no-check-certificate

  # For images
  unzip -q Challenge1_Training_Task12_Images.zip -d imgs/training
  unzip -q Challenge1_Test_Task12_Images.zip -d imgs/test
  # For annotations
  unzip -q Challenge1_Training_Task1_GT.zip -d annotations/training
  unzip -q Challenge1_Test_Task1_GT.zip -d annotations/test

  rm Challenge1_Training_Task12_Images.zip && rm Challenge1_Test_Task12_Images.zip && rm Challenge1_Training_Task1_GT.zip && rm Challenge1_Test_Task1_GT.zip
  ```

- Step 2: Generate `instances_training.json` and `instances_test.json` with the following command:

  ```bash
  python tools/dataset_converters/textdet/ic11_converter.py PATH/TO/icdar2011 --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── icdar2011
  │   ├── imgs
  │   ├── instances_test.json
  │   └── instances_training.json
  ```

## ICDAR 2013 (Focused Scene Text)

- Step1: Download `Challenge2_Training_Task12_Images.zip`, `Challenge2_Test_Task12_Images.zip`, `Challenge2_Training_Task1_GT.zip`, and `Challenge2_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=2&com=downloads) `Task 2.1: Text Localization (2013 edition)`.

  ```bash
  mkdir icdar2013 && cd icdar2013
  mkdir imgs && mkdir annotations

  # Download ICDAR 2013
  wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task12_Images.zip --no-check-certificate
  wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task12_Images.zip --no-check-certificate
  wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task1_GT.zip --no-check-certificate
  wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task1_GT.zip --no-check-certificate

  # For images
  unzip -q Challenge2_Training_Task12_Images.zip -d imgs/training
  unzip -q Challenge2_Test_Task12_Images.zip -d imgs/test
  # For annotations
  unzip -q Challenge2_Training_Task1_GT.zip -d annotations/training
  unzip -q Challenge2_Test_Task1_GT.zip -d annotations/test

  rm Challenge2_Training_Task12_Images.zip && rm Challenge2_Test_Task12_Images.zip && rm Challenge2_Training_Task1_GT.zip && rm Challenge2_Test_Task1_GT.zip
  ```

- Step 2: Generate `instances_training.json` and `instances_test.json` with the following command:

  ```bash
  python tools/dataset_converters/textdet/ic13_converter.py PATH/TO/icdar2013 --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── icdar2013
  │   ├── imgs
  │   ├── instances_test.json
  │   └── instances_training.json
  ```

## ICDAR 2015

- Step0: Read [Important Note](#important-note)

- Step1: Download `ch4_training_images.zip`, `ch4_test_images.zip`, `ch4_training_localization_transcription_gt.zip`, `Challenge4_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads)

- Step2:

  ```bash
  mkdir icdar2015 && cd icdar2015
  mkdir imgs && mkdir annotations
  # For images,
  mv ch4_training_images imgs/training
  mv ch4_test_images imgs/test
  # For annotations,
  mv ch4_training_localization_transcription_gt annotations/training
  mv Challenge4_Test_Task1_GT annotations/test
  ```

- Step3: Download [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) and [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) and move them to `icdar2015`

- Or, generate `instances_training.json` and `instances_test.json` with the following command:

  ```bash
  python tools/dataset_converters/textdet/icdar_converter.py /path/to/icdar2015 -o /path/to/icdar2015 -d icdar2015 --split-list training test
  ```

- The resulting directory structure looks like the following:

  ```text
  ├── icdar2015
  │   ├── imgs
  │   ├── annotations
  │   ├── instances_test.json
  │   └── instances_training.json
  ```

## ICDAR 2017

- Follow similar steps as [ICDAR 2015](#icdar-2015).

- The resulting directory structure looks like the following:

  ```text
  ├── icdar2017
  │   ├── imgs
  │   ├── annotations
  │   ├── instances_training.json
  │   └── instances_val.json
  ```

## SynthText

- Step1: Download SynthText.zip from \[homepage\](<https://www.robots.ox.ac.uk/~vgg/data/scenetext/> and extract its content to `synthtext/img`.

- Step2: Download [data.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/data.mdb) and [lock.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/lock.mdb) to `synthtext/instances_training.lmdb/`.

- The resulting directory structure looks like the following:

  ```text
  ├── synthtext
  │   ├── imgs
  │   └── instances_training.lmdb
  │       ├── data.mdb
  │       └── lock.mdb
  ```

## TextOCR

- Step1: Download [train_val_images.zip](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip), [TextOCR_0.1_train.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json) and [TextOCR_0.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json) to `textocr/`.

  ```bash
  mkdir textocr && cd textocr

  # Download TextOCR dataset
  wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
  wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
  wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json

  # For images
  unzip -q train_val_images.zip
  mv train_images train
  ```

- Step2: Generate `instances_training.json` and `instances_val.json` with the following command:

  ```bash
  python tools/dataset_converters/textdet/textocr_converter.py /path/to/textocr
  ```

- The resulting directory structure looks like the following:

  ```text
  ├── textocr
  │   ├── train
  │   ├── instances_training.json
  │   └── instances_val.json
  ```

## Totaltext

- Step0: Read [Important Note](#important-note)

- Step1: Download `totaltext.zip` from [github dataset](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset) and `groundtruth_text.zip` or `TT_new_train_GT.zip` (if you prefer to use the latest version of training annotations) from [github Groundtruth](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Text) (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).

  ```bash
  mkdir totaltext && cd totaltext
  mkdir imgs && mkdir annotations

  # For images
  # in ./totaltext
  unzip totaltext.zip
  mv Images/Train imgs/training
  mv Images/Test imgs/test

  # For legacy training and test annotations
  unzip groundtruth_text.zip
  mv Groundtruth/Polygon/Train annotations/training
  mv Groundtruth/Polygon/Test annotations/test

  # Using the latest training annotations
  # WARNING: Delete legacy train annotations before running the following command.
  unzip TT_new_train_GT.zip
  mv Train annotations/training
  ```

- Step2: Generate `instances_training.json` and `instances_test.json` with the following command:

  ```bash
  python tools/dataset_converters/textdet/totaltext_converter.py /path/to/totaltext
  ```

- The resulting directory structure looks like the following:

  ```text
  ├── totaltext
  │   ├── imgs
  │   ├── annotations
  │   ├── instances_test.json
  │   └── instances_training.json
  ```

## CurvedSynText150k

- Step1: Download [syntext1.zip](https://drive.google.com/file/d/1OSJ-zId2h3t_-I7g_wUkrK-VqQy153Kj/view?usp=sharing) and [syntext2.zip](https://drive.google.com/file/d/1EzkcOlIgEp5wmEubvHb7-J5EImHExYgY/view?usp=sharing) to `CurvedSynText150k/`.

- Step2:

  ```bash
  unzip -q syntext1.zip
  mv train.json train1.json
  unzip images.zip
  rm images.zip

  unzip -q syntext2.zip
  mv train.json train2.json
  unzip images.zip
  rm images.zip
  ```

- Step3: Download [instances_training.json](https://download.openmmlab.com/mmocr/data/curvedsyntext/instances_training.json) to `CurvedSynText150k/`

- Or, generate `instances_training.json` with following command:

  ```bash
  python tools/dataset_converters/common/curvedsyntext_converter.py PATH/TO/CurvedSynText150k --nproc 4
  ```

- The resulting directory structure looks like the following:

  ```text
  ├── CurvedSynText150k
  │   ├── syntext_word_eng
  │   ├── emcs_imgs
  │   └── instances_training.json
  ```

## FUNSD

- Step1: Download [dataset.zip](https://guillaumejaume.github.io/FUNSD/dataset.zip) to `funsd/`.

  ```bash
  mkdir funsd && cd funsd

  # Download FUNSD dataset
  wget https://guillaumejaume.github.io/FUNSD/dataset.zip
  unzip -q dataset.zip

  # For images
  mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/

  # For annotations
  mkdir annotations
  mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test

  rm dataset.zip && rm -rf dataset
  ```

- Step2: Generate `instances_training.json` and `instances_test.json` with following command:

  ```bash
  python tools/dataset_converters/textdet/funsd_converter.py PATH/TO/funsd --nproc 4
  ```

- The resulting directory structure looks like the following:

  ```text
  │── funsd
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_test.json
  │   └── instances_training.json
  ```

## DeText

- Step1: Download `ch9_training_images.zip`, `ch9_training_localization_transcription_gt.zip`, `ch9_validation_images.zip`, and `ch9_validation_localization_transcription_gt.zip` from **Task 3: End to End** on the [homepage](https://rrc.cvc.uab.es/?ch=9).

  ```bash
  mkdir detext && cd detext
  mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val

  # Download DeText
  wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate
  wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate
  wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate
  wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate

  # Extract images and annotations
  unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val

  # Remove zips
  rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
  ```

- Step2: Generate `instances_training.json` and `instances_val.json` with following command:

  ```bash
  python tools/dataset_converters/textdet/detext_converter.py PATH/TO/detext --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── detext
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_test.json
  │   └── instances_training.json
  ```

## NAF

- Step1: Download [labeled_images.tar.gz](https://github.com/herobd/NAF_dataset/releases/tag/v1.0) to `naf/`.

  ```bash
  mkdir naf && cd naf

  # Download NAF dataset
  wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz
  tar -zxf labeled_images.tar.gz

  # For images
  mkdir annotations && mv labeled_images imgs

  # For annotations
  git clone https://github.com/herobd/NAF_dataset.git
  mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/

  rm -rf NAF_dataset && rm labeled_images.tar.gz
  ```

- Step2: Generate `instances_training.json`, `instances_val.json`, and `instances_test.json` with following command:

  ```bash
  python tools/dataset_converters/textdet/naf_converter.py PATH/TO/naf --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── naf
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_test.json
  │   ├── instances_val.json
  │   └── instances_training.json
  ```

## SROIE

- Step1: Download `0325updated.task1train(626p).zip`, `task1&2_test(361p).zip`, and `text.task1&2-test（361p).zip` from [homepage](https://rrc.cvc.uab.es/?ch=13&com=downloads) to `sroie/`

- Step2:

  ```bash
  mkdir sroie && cd sroie
  mkdir imgs && mkdir annotations && mkdir imgs/training

  # Warnninig: The zip files downloaded from Google Drive and BaiduYun Cloud may
  # be different, the user should revise the following commands to the correct
  # file name if encounter with errors while extracting and move the files.
  unzip -q 0325updated.task1train\(626p\).zip && unzip -q task1\&2_test\(361p\).zip && unzip -q text.task1\&2-test（361p\).zip

  # For images
  mv 0325updated.task1train\(626p\)/*.jpg imgs/training && mv fulltext_test\(361p\) imgs/test

  # For annotations
  mv 0325updated.task1train\(626p\) annotations/training && mv text.task1\&2-testги361p\)/ annotations/test

  rm 0325updated.task1train\(626p\).zip && rm task1\&2_test\(361p\).zip && rm text.task1\&2-test（361p\).zip
  ```

- Step3: Generate `instances_training.json` and `instances_test.json` with the following command:

  ```bash
  python tools/dataset_converters/textdet/sroie_converter.py PATH/TO/sroie --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  ├── sroie
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_test.json
  │   └── instances_training.json
  ```

## Lecture Video DB

- Step1: Download [IIIT-CVid.zip](http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip) to `lv/`.

  ```bash
  mkdir lv && cd lv

  # Download LV dataset
  wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
  unzip -q IIIT-CVid.zip

  mv IIIT-CVid/Frames imgs

  rm IIIT-CVid.zip
  ```

- Step2: Generate `instances_training.json`, `instances_val.json`, and `instances_test.json` with following command:

  ```bash
  python tools/dataset_converters/textdet/lv_converter.py PATH/TO/lv --nproc 4
  ```

- The resulting directory structure looks like the following:

  ```text
  │── lv
  │   ├── imgs
  │   ├── instances_test.json
  │   ├── instances_training.json
  │   └── instances_val.json
  ```

## LSVT

- Step1: Download [train_full_images_0.tar.gz](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz), [train_full_images_1.tar.gz](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz), and [train_full_labels.json](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json) to `lsvt/`.

  ```bash
  mkdir lsvt && cd lsvt

  # Download LSVT dataset
  wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz
  wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz
  wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json

  mkdir annotations
  tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz
  mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/
  mv train_full_images_0 imgs

  rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
  ```

- Step2: Generate `instances_training.json` and `instances_val.json` (optional) with the following command:

  ```bash
  # Annotations of LSVT test split is not publicly available, split a validation
  # set by adding --val-ratio 0.2
  python tools/dataset_converters/textdet/lsvt_converter.py PATH/TO/lsvt
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  |── lsvt
  │   ├── imgs
  │   ├── instances_training.json
  │   └── instances_val.json (optional)
  ```

## IMGUR

- Step1: Run `download_imgur5k.py` to download images. You can merge [PR#5](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/pull/5) in your local repository to enable a **much faster** parallel execution of image download.

  ```bash
  mkdir imgur && cd imgur

  git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git

  # Download images from imgur.com. This may take SEVERAL HOURS!
  python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs

  # For annotations
  mkdir annotations
  mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations

  rm -rf IMGUR5K-Handwriting-Dataset
  ```

- Step2: Generate `instances_train.json`, `instance_val.json` and `instances_test.json` with the following command:

  ```bash
  python tools/dataset_converters/textdet/imgur_converter.py PATH/TO/imgur
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── imgur
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_test.json
  │   ├── instances_training.json
  │   └── instances_val.json
  ```

## KAIST

- Step1: Complete download [KAIST_all.zip](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) to `kaist/`.

  ```bash
  mkdir kaist && cd kaist
  mkdir imgs && mkdir annotations

  # Download KAIST dataset
  wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip
  unzip -q KAIST_all.zip

  rm KAIST_all.zip
  ```

- Step2: Extract zips:

  ```bash
  python tools/dataset_converters/common/extract_kaist.py PATH/TO/kaist
  ```

- Step3: Generate `instances_training.json` and `instances_val.json` (optional) with following command:

  ```bash
  # Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2
  python tools/dataset_converters/textdet/kaist_converter.py PATH/TO/kaist --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── kaist
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_training.json
  │   └── instances_val.json (optional)
  ```

## MTWI

- Step1: Download `mtwi_2018_train.zip` from [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us).

  ```bash
  mkdir mtwi && cd mtwi

  unzip -q mtwi_2018_train.zip
  mv image_train imgs && mv txt_train annotations

  rm mtwi_2018_train.zip
  ```

- Step2: Generate `instances_training.json` and `instance_val.json` (optional) with the following command:

  ```bash
  # Annotations of MTWI test split is not publicly available, split a validation
  # set by adding --val-ratio 0.2
  python tools/dataset_converters/textdet/mtwi_converter.py PATH/TO/mtwi --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── mtwi
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_training.json
  │   └── instances_val.json (optional)
  ```

## COCO Text v2

- Step1: Download image [train2014.zip](http://images.cocodataset.org/zips/train2014.zip) and annotation [cocotext.v2.zip](https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip) to `coco_textv2/`.

  ```bash
  mkdir coco_textv2 && cd coco_textv2
  mkdir annotations

  # Download COCO Text v2 dataset
  wget http://images.cocodataset.org/zips/train2014.zip
  wget https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip
  unzip -q train2014.zip && unzip -q cocotext.v2.zip

  mv train2014 imgs && mv cocotext.v2.json annotations/

  rm train2014.zip && rm -rf cocotext.v2.zip
  ```

- Step2: Generate `instances_training.json` and `instances_val.json` with the following command:

  ```bash
  python tools/dataset_converters/textdet/cocotext_converter.py PATH/TO/coco_textv2
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── coco_textv2
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_training.json
  │   └── instances_val.json
  ```

## ReCTS

- Step1: Download [ReCTS.zip](https://datasets.cvc.uab.es/rrc/ReCTS.zip) to `rects/` from the [homepage](https://rrc.cvc.uab.es/?ch=12&com=downloads).

  ```bash
  mkdir rects && cd rects

  # Download ReCTS dataset
  # You can also find Google Drive link on the dataset homepage
  wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate
  unzip -q ReCTS.zip

  mv img imgs && mv gt_unicode annotations

  rm ReCTS.zip && rm -rf gt
  ```

- Step2: Generate `instances_training.json` and `instances_val.json` (optional) with following command:

  ```bash
  # Annotations of ReCTS test split is not publicly available, split a validation
  # set by adding --val-ratio 0.2
  python tools/dataset_converters/textdet/rects_converter.py PATH/TO/rects --nproc 4 --val-ratio 0.2
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── rects
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_val.json (optional)
  │   └── instances_training.json
  ```

## ILST

- Step1: Download `IIIT-ILST` from [onedrive](https://iiitaphyd-my.sharepoint.com/:f:/g/personal/minesh_mathew_research_iiit_ac_in/EtLvCozBgaBIoqglF4M-lHABMgNcCDW9rJYKKWpeSQEElQ?e=zToXZP)

- Step2: Run the following commands

  ```bash
  unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip
  cd IIIT-ILST

  # rename files
  cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd ..
  cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd ..
  cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd ..

  # transfer image path
  mkdir imgs && mkdir annotations
  mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/
  mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/
  mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/

  # remove unnecessary files
  rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
  ```

- Step3: Generate `instances_training.json` and `instances_val.json` (optional). Since the original dataset doesn't have a validation set, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

  ```bash
  python tools/dataset_converters/textdet/ilst_converter.py    PATH/TO/IIIT-ILST --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── IIIT-ILST
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_val.json (optional)
  │   └── instances_training.json
  ```

## VinText

- Step1: Download [vintext.zip](https://drive.google.com/drive/my-drive) to `vintext`

  ```bash
  mkdir vintext && cd vintext

  # Download dataset from google drive
  wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- │ sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt

  # Extract images and annotations
  unzip -q vintext.zip && rm vintext.zip
  mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./
  rm -rf vietnamese

  # Rename files
  mv labels annotations && mv test_image test && mv train_images  training && mv unseen_test_images  unseen_test
  mkdir imgs
  mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
  ```

- Step2: Generate `instances_training.json`, `instances_test.json` and `instances_unseen_test.json`

  ```bash
  python tools/dataset_converters/textdet/vintext_converter.py PATH/TO/vintext --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── vintext
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_test.json
  │   ├── instances_unseen_test.json
  │   └── instances_training.json
  ```

## BID

- Step1: Download [BID Dataset.zip](https://drive.google.com/file/d/1Oi88TRcpdjZmJ79WDLb9qFlBNG8q2De6/view)

- Step2: Run the following commands to preprocess the dataset

  ```bash
  # Rename
  mv BID\ Dataset.zip BID_Dataset.zip

  # Unzip and Rename
  unzip -q BID_Dataset.zip && rm BID_Dataset.zip
  mv BID\ Dataset BID

  # The BID dataset has a problem of permission, and you may
  # add permission for this file
  chmod -R 777 BID
  cd BID
  mkdir imgs && mkdir annotations

  # For images and annotations
  mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta
  mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente
  mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso
  mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente
  mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso
  mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto
  mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente
  mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso

  # Remove unnecessary files
  rm -rf desktop.ini
  ```

- Step3: - Step3: Generate `instances_training.json` and `instances_val.json` (optional). Since the original dataset doesn't have a validation set, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

  ```bash
  python tools/dataset_converters/textdet/bid_converter.py PATH/TO/BID --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── BID
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_training.json
  │   └── instances_val.json (optional)
  ```

## RCTW

- Step1: Download `train_images.zip.001`, `train_images.zip.002`, and `train_gts.zip` from the [homepage](https://rctw.vlrlab.net/dataset.html), extract the zips to `rctw/imgs` and `rctw/annotations`, respectively.

- Step2: Generate `instances_training.json` and `instances_val.json` (optional). Since the test annotations are not publicly available, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

  ```bash
  # Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2
  python tools/dataset_converters/textdet/rctw_converter.py PATH/TO/rctw --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── rctw
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_training.json
  │   └── instances_val.json (optional)
  ```

## HierText

- Step1 (optional): Install [AWS CLI](https://mmocr.readthedocs.io/en/latest/datasets/det.html#install-aws-cli-optional).

- Step2: Clone [HierText](https://github.com/google-research-datasets/hiertext) repo to get annotations

  ```bash
  mkdir HierText
  git clone https://github.com/google-research-datasets/hiertext.git
  ```

- Step3: Download `train.tgz`, `validation.tgz` from aws

  ```bash
  aws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz .
  aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
  ```

- Step4: Process raw data

  ```bash
  # process annotations
  mv hiertext/gt ./
  rm -rf hiertext
  mv gt annotations
  gzip -d annotations/train.jsonl.gz
  gzip -d annotations/validation.jsonl.gz
  # process images
  mkdir imgs
  mv train.tgz imgs/
  mv validation.tgz imgs/
  tar -xzvf imgs/train.tgz
  tar -xzvf imgs/validation.tgz
  ```

- Step5: Generate `instances_training.json` and `instance_val.json`. HierText includes different levels of annotation, from paragraph, line, to word. Check the original [paper](https://arxiv.org/pdf/2203.15143.pdf) for details. E.g. set `--level paragraph` to get paragraph-level annotation. Set `--level line` to get line-level annotation. set `--level word` to get word-level annotation.

  ```bash
  # Collect word annotation from HierText  --level word
  python tools/dataset_converters/textdet/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── HierText
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_training.json
  │   └── instances_val.json
  ```

## ArT

- Step1: Download `train_images.tar.gz`, and `train_labels.json` from the [homepage](https://rrc.cvc.uab.es/?ch=14&com=downloads) to `art/`

  ```bash
  mkdir art && cd art
  mkdir annotations

  # Download ArT dataset
  wget https://dataset-bj.cdn.bcebos.com/art/train_images.tar.gz --no-check-certificate
  wget https://dataset-bj.cdn.bcebos.com/art/train_labels.json --no-check-certificate

  # Extract
  tar -xf train_images.tar.gz
  mv train_images imgs
  mv train_labels.json annotations/

  # Remove unnecessary files
  rm train_images.tar.gz
  ```

- Step2: Generate `instances_training.json` and `instances_val.json` (optional). Since the test annotations are not publicly available, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

  ```bash
  # Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2
  python tools/data/textdet/art_converter.py PATH/TO/art --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── art
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_training.json
  │   └── instances_val.json (optional)
  ```