mmocr/docs/en/user_guides/data_prepare/det.md

# Text Detection

```{note}
This page is a manual preparation guide for datasets not yet supported by [Dataset Preparer](./dataset_preparer.md), which all these scripts will be eventually migrated into.
```

## Overview

|      Dataset      |                          Images                          |                                                    |                          Annotation Files                           |         |     |
| :---------------: | :------------------------------------------------------: | :------------------------------------------------: | :-----------------------------------------------------------------: | :-----: | :-: |
|                   |                                                          |                      training                      |                             validation                              | testing |     |
|     ICDAR2011     |         [homepage](https://rrc.cvc.uab.es/?ch=1)         |                         -                          |                                  -                                  |         |     |
|     ICDAR2017     |  [homepage](https://rrc.cvc.uab.es/?ch=8&com=downloads)  | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json) | [instances_val.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_val.json) |    -    |     |
| CurvedSynText150k | [homepage](https://github.com/aim-uofa/AdelaiDet/blob/master/datasets/README.md) \| [Part1](https://drive.google.com/file/d/1OSJ-zId2h3t_-I7g_wUkrK-VqQy153Kj/view?usp=sharing) \| [Part2](https://drive.google.com/file/d/1EzkcOlIgEp5wmEubvHb7-J5EImHExYgY/view?usp=sharing) | [instances_training.json](https://download.openmmlab.com/mmocr/data/curvedsyntext/instances_training.json) |                                  -                                  |    -    |     |
|      DeText       |         [homepage](https://rrc.cvc.uab.es/?ch=9)         |                         -                          |                                  -                                  |    -    |     |
| Lecture Video DB  | [homepage](https://cvit.iiit.ac.in/research/projects/cvit-projects/lecturevideodb) |                         -                          |                                  -                                  |    -    |     |
|       LSVT        |        [homepage](https://rrc.cvc.uab.es/?ch=16)         |                         -                          |                                  -                                  |    -    |     |
|       IMGUR       | [homepage](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset) |                         -                          |                                  -                                  |    -    |     |
|       KAIST       | [homepage](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) |                         -                          |                                  -                                  |    -    |     |
|       MTWI        | [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us) |                         -                          |                                  -                                  |    -    |     |
|       ReCTS       |        [homepage](https://rrc.cvc.uab.es/?ch=12)         |                         -                          |                                  -                                  |    -    |     |
|     IIIT-ILST     | [homepage](http://cvit.iiit.ac.in/research/projects/cvit-projects/iiit-ilst) |                         -                          |                                  -                                  |    -    |     |
|      VinText      | [homepage](https://github.com/VinAIResearch/dict-guided) |                         -                          |                                  -                                  |    -    |     |
|        BID        | [homepage](https://github.com/ricardobnjunior/Brazilian-Identity-Document-Dataset) |                         -                          |                                  -                                  |    -    |     |
|       RCTW        |      [homepage](https://rctw.vlrlab.net/index.html)      |                         -                          |                                  -                                  |    -    |     |
|     HierText      | [homepage](https://github.com/google-research-datasets/hiertext) |                         -                          |                                  -                                  |    -    |     |
|        ArT        |        [homepage](https://rrc.cvc.uab.es/?ch=14)         |                         -                          |                                  -                                  |    -    |     |

### Install AWS CLI (optional)

- Since there are some datasets that require the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) to be installed in advance, we provide a quick installation guide here:

  ```bash
    curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
    unzip awscliv2.zip
    sudo ./aws/install
    ./aws/install -i /usr/local/aws-cli -b /usr/local/bin
    !aws configure
    # this command will require you to input keys, you can skip them except
    # for the Default region name
    # AWS Access Key ID [None]:
    # AWS Secret Access Key [None]:
    # Default region name [None]: us-east-1
    # Default output format [None]
  ```

For users in China, these datasets can also be downloaded from [OpenDataLab](https://opendatalab.com/) with high speed:

- [CTW1500](https://opendatalab.com/SCUT-CTW1500?source=OpenMMLab%20GitHub)
- [ICDAR2013](https://opendatalab.com/ICDAR_2013?source=OpenMMLab%20GitHub)
- [ICDAR2015](https://opendatalab.com/ICDAR2015?source=OpenMMLab%20GitHub)
- [Totaltext](https://opendatalab.com/TotalText?source=OpenMMLab%20GitHub)
- [MSRA-TD500](https://opendatalab.com/MSRA-TD500?source=OpenMMLab%20GitHub)

## Important Note

```{note}
**For users who want to train models on CTW1500, ICDAR 2015/2017, and Totaltext dataset,** there might be some images containing orientation info in EXIF data. The default OpenCV
backend used in MMCV would read them and apply the rotation on the images.  However, their gold annotations are made on the raw pixels, and such
inconsistency results in false examples in the training set. Therefore, users should use `dict(type='LoadImageFromFile', color_type='color_ignore_orientation')` in pipelines to change MMCV's default loading behaviour. (see [DBNet's pipeline config](https://github.com/open-mmlab/mmocr/blob/main/configs/_base_/det_pipelines/dbnet_pipeline.py) for example)
```

## ICDAR 2011 (Born-Digital Images)

- Step1: Download `Challenge1_Training_Task12_Images.zip`, `Challenge1_Training_Task1_GT.zip`, `Challenge1_Test_Task12_Images.zip`, and `Challenge1_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=1&com=downloads) `Task 1.1: Text Localization (2013 edition)`.

  ```bash
  mkdir icdar2011 && cd icdar2011
  mkdir imgs && mkdir annotations

  # Download ICDAR 2011
  wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task12_Images.zip --no-check-certificate
  wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task1_GT.zip --no-check-certificate
  wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task12_Images.zip --no-check-certificate
  wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task1_GT.zip --no-check-certificate

  # For images
  unzip -q Challenge1_Training_Task12_Images.zip -d imgs/training
  unzip -q Challenge1_Test_Task12_Images.zip -d imgs/test
  # For annotations
  unzip -q Challenge1_Training_Task1_GT.zip -d annotations/training
  unzip -q Challenge1_Test_Task1_GT.zip -d annotations/test

  rm Challenge1_Training_Task12_Images.zip && rm Challenge1_Test_Task12_Images.zip && rm Challenge1_Training_Task1_GT.zip && rm Challenge1_Test_Task1_GT.zip
  ```

- Step 2: Generate `instances_training.json` and `instances_test.json` with the following command:

  ```bash
  python tools/dataset_converters/textdet/ic11_converter.py PATH/TO/icdar2011 --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── icdar2011
  │   ├── imgs
  │   ├── instances_test.json
  │   └── instances_training.json
  ```

## ICDAR 2017

- Follow similar steps as [ICDAR 2015](#icdar-2015).

- The resulting directory structure looks like the following:

  ```text
  ├── icdar2017
  │   ├── imgs
  │   ├── annotations
  │   ├── instances_training.json
  │   └── instances_val.json
  ```

## CurvedSynText150k

- Step1: Download [syntext1.zip](https://drive.google.com/file/d/1OSJ-zId2h3t_-I7g_wUkrK-VqQy153Kj/view?usp=sharing) and [syntext2.zip](https://drive.google.com/file/d/1EzkcOlIgEp5wmEubvHb7-J5EImHExYgY/view?usp=sharing) to `CurvedSynText150k/`.

- Step2:

  ```bash
  unzip -q syntext1.zip
  mv train.json train1.json
  unzip images.zip
  rm images.zip

  unzip -q syntext2.zip
  mv train.json train2.json
  unzip images.zip
  rm images.zip
  ```

- Step3: Download [instances_training.json](https://download.openmmlab.com/mmocr/data/curvedsyntext/instances_training.json) to `CurvedSynText150k/`

- Or, generate `instances_training.json` with following command:

  ```bash
  python tools/dataset_converters/common/curvedsyntext_converter.py PATH/TO/CurvedSynText150k --nproc 4
  ```

- The resulting directory structure looks like the following:

  ```text
  ├── CurvedSynText150k
  │   ├── syntext_word_eng
  │   ├── emcs_imgs
  │   └── instances_training.json
  ```

## DeText

- Step1: Download `ch9_training_images.zip`, `ch9_training_localization_transcription_gt.zip`, `ch9_validation_images.zip`, and `ch9_validation_localization_transcription_gt.zip` from **Task 3: End to End** on the [homepage](https://rrc.cvc.uab.es/?ch=9).

  ```bash
  mkdir detext && cd detext
  mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val

  # Download DeText
  wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate
  wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate
  wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate
  wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate

  # Extract images and annotations
  unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val

  # Remove zips
  rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
  ```

- Step2: Generate `instances_training.json` and `instances_val.json` with following command:

  ```bash
  python tools/dataset_converters/textdet/detext_converter.py PATH/TO/detext --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── detext
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_test.json
  │   └── instances_training.json
  ```

## Lecture Video DB

- Step1: Download [IIIT-CVid.zip](http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip) to `lv/`.

  ```bash
  mkdir lv && cd lv

  # Download LV dataset
  wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
  unzip -q IIIT-CVid.zip

  mv IIIT-CVid/Frames imgs

  rm IIIT-CVid.zip
  ```

- Step2: Generate `instances_training.json`, `instances_val.json`, and `instances_test.json` with following command:

  ```bash
  python tools/dataset_converters/textdet/lv_converter.py PATH/TO/lv --nproc 4
  ```

- The resulting directory structure looks like the following:

  ```text
  │── lv
  │   ├── imgs
  │   ├── instances_test.json
  │   ├── instances_training.json
  │   └── instances_val.json
  ```

## LSVT

- Step1: Download [train_full_images_0.tar.gz](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz), [train_full_images_1.tar.gz](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz), and [train_full_labels.json](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json) to `lsvt/`.

  ```bash
  mkdir lsvt && cd lsvt

  # Download LSVT dataset
  wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz
  wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz
  wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json

  mkdir annotations
  tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz
  mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/
  mv train_full_images_0 imgs

  rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
  ```

- Step2: Generate `instances_training.json` and `instances_val.json` (optional) with the following command:

  ```bash
  # Annotations of LSVT test split is not publicly available, split a validation
  # set by adding --val-ratio 0.2
  python tools/dataset_converters/textdet/lsvt_converter.py PATH/TO/lsvt
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  |── lsvt
  │   ├── imgs
  │   ├── instances_training.json
  │   └── instances_val.json (optional)
  ```

## IMGUR

- Step1: Run `download_imgur5k.py` to download images. You can merge [PR#5](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/pull/5) in your local repository to enable a **much faster** parallel execution of image download.

  ```bash
  mkdir imgur && cd imgur

  git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git

  # Download images from imgur.com. This may take SEVERAL HOURS!
  python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs

  # For annotations
  mkdir annotations
  mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations

  rm -rf IMGUR5K-Handwriting-Dataset
  ```

- Step2: Generate `instances_train.json`, `instance_val.json` and `instances_test.json` with the following command:

  ```bash
  python tools/dataset_converters/textdet/imgur_converter.py PATH/TO/imgur
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── imgur
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_test.json
  │   ├── instances_training.json
  │   └── instances_val.json
  ```

## KAIST

- Step1: Complete download [KAIST_all.zip](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) to `kaist/`.

  ```bash
  mkdir kaist && cd kaist
  mkdir imgs && mkdir annotations

  # Download KAIST dataset
  wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip
  unzip -q KAIST_all.zip

  rm KAIST_all.zip
  ```

- Step2: Extract zips:

  ```bash
  python tools/dataset_converters/common/extract_kaist.py PATH/TO/kaist
  ```

- Step3: Generate `instances_training.json` and `instances_val.json` (optional) with following command:

  ```bash
  # Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2
  python tools/dataset_converters/textdet/kaist_converter.py PATH/TO/kaist --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── kaist
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_training.json
  │   └── instances_val.json (optional)
  ```

## MTWI

- Step1: Download `mtwi_2018_train.zip` from [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us).

  ```bash
  mkdir mtwi && cd mtwi

  unzip -q mtwi_2018_train.zip
  mv image_train imgs && mv txt_train annotations

  rm mtwi_2018_train.zip
  ```

- Step2: Generate `instances_training.json` and `instance_val.json` (optional) with the following command:

  ```bash
  # Annotations of MTWI test split is not publicly available, split a validation
  # set by adding --val-ratio 0.2
  python tools/dataset_converters/textdet/mtwi_converter.py PATH/TO/mtwi --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── mtwi
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_training.json
  │   └── instances_val.json (optional)
  ```

## ReCTS

- Step1: Download [ReCTS.zip](https://datasets.cvc.uab.es/rrc/ReCTS.zip) to `rects/` from the [homepage](https://rrc.cvc.uab.es/?ch=12&com=downloads).

  ```bash
  mkdir rects && cd rects

  # Download ReCTS dataset
  # You can also find Google Drive link on the dataset homepage
  wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate
  unzip -q ReCTS.zip

  mv img imgs && mv gt_unicode annotations

  rm ReCTS.zip && rm -rf gt
  ```

- Step2: Generate `instances_training.json` and `instances_val.json` (optional) with following command:

  ```bash
  # Annotations of ReCTS test split is not publicly available, split a validation
  # set by adding --val-ratio 0.2
  python tools/dataset_converters/textdet/rects_converter.py PATH/TO/rects --nproc 4 --val-ratio 0.2
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── rects
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_val.json (optional)
  │   └── instances_training.json
  ```

## ILST

- Step1: Download `IIIT-ILST` from [onedrive](https://iiitaphyd-my.sharepoint.com/:f:/g/personal/minesh_mathew_research_iiit_ac_in/EtLvCozBgaBIoqglF4M-lHABMgNcCDW9rJYKKWpeSQEElQ?e=zToXZP)

- Step2: Run the following commands

  ```bash
  unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip
  cd IIIT-ILST

  # rename files
  cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd ..
  cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd ..
  cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd ..

  # transfer image path
  mkdir imgs && mkdir annotations
  mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/
  mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/
  mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/

  # remove unnecessary files
  rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
  ```

- Step3: Generate `instances_training.json` and `instances_val.json` (optional). Since the original dataset doesn't have a validation set, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

  ```bash
  python tools/dataset_converters/textdet/ilst_converter.py    PATH/TO/IIIT-ILST --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── IIIT-ILST
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_val.json (optional)
  │   └── instances_training.json
  ```

## VinText

- Step1: Download [vintext.zip](https://drive.google.com/drive/my-drive) to `vintext`

  ```bash
  mkdir vintext && cd vintext

  # Download dataset from google drive
  wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- │ sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt

  # Extract images and annotations
  unzip -q vintext.zip && rm vintext.zip
  mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./
  rm -rf vietnamese

  # Rename files
  mv labels annotations && mv test_image test && mv train_images  training && mv unseen_test_images  unseen_test
  mkdir imgs
  mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
  ```

- Step2: Generate `instances_training.json`, `instances_test.json` and `instances_unseen_test.json`

  ```bash
  python tools/dataset_converters/textdet/vintext_converter.py PATH/TO/vintext --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── vintext
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_test.json
  │   ├── instances_unseen_test.json
  │   └── instances_training.json
  ```

## BID

- Step1: Download [BID Dataset.zip](https://drive.google.com/file/d/1Oi88TRcpdjZmJ79WDLb9qFlBNG8q2De6/view)

- Step2: Run the following commands to preprocess the dataset

  ```bash
  # Rename
  mv BID\ Dataset.zip BID_Dataset.zip

  # Unzip and Rename
  unzip -q BID_Dataset.zip && rm BID_Dataset.zip
  mv BID\ Dataset BID

  # The BID dataset has a problem of permission, and you may
  # add permission for this file
  chmod -R 777 BID
  cd BID
  mkdir imgs && mkdir annotations

  # For images and annotations
  mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta
  mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente
  mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso
  mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente
  mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso
  mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto
  mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente
  mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso

  # Remove unnecessary files
  rm -rf desktop.ini
  ```

- Step3: - Step3: Generate `instances_training.json` and `instances_val.json` (optional). Since the original dataset doesn't have a validation set, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

  ```bash
  python tools/dataset_converters/textdet/bid_converter.py PATH/TO/BID --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── BID
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_training.json
  │   └── instances_val.json (optional)
  ```

## RCTW

- Step1: Download `train_images.zip.001`, `train_images.zip.002`, and `train_gts.zip` from the [homepage](https://rctw.vlrlab.net/dataset.html), extract the zips to `rctw/imgs` and `rctw/annotations`, respectively.

- Step2: Generate `instances_training.json` and `instances_val.json` (optional). Since the test annotations are not publicly available, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

  ```bash
  # Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2
  python tools/dataset_converters/textdet/rctw_converter.py PATH/TO/rctw --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── rctw
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_training.json
  │   └── instances_val.json (optional)
  ```

## HierText

- Step1 (optional): Install [AWS CLI](https://mmocr.readthedocs.io/en/latest/datasets/det.html#install-aws-cli-optional).

- Step2: Clone [HierText](https://github.com/google-research-datasets/hiertext) repo to get annotations

  ```bash
  mkdir HierText
  git clone https://github.com/google-research-datasets/hiertext.git
  ```

- Step3: Download `train.tgz`, `validation.tgz` from aws

  ```bash
  aws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz .
  aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
  ```

- Step4: Process raw data

  ```bash
  # process annotations
  mv hiertext/gt ./
  rm -rf hiertext
  mv gt annotations
  gzip -d annotations/train.jsonl.gz
  gzip -d annotations/validation.jsonl.gz
  # process images
  mkdir imgs
  mv train.tgz imgs/
  mv validation.tgz imgs/
  tar -xzvf imgs/train.tgz
  tar -xzvf imgs/validation.tgz
  ```

- Step5: Generate `instances_training.json` and `instance_val.json`. HierText includes different levels of annotation, from paragraph, line, to word. Check the original [paper](https://arxiv.org/pdf/2203.15143.pdf) for details. E.g. set `--level paragraph` to get paragraph-level annotation. Set `--level line` to get line-level annotation. set `--level word` to get word-level annotation.

  ```bash
  # Collect word annotation from HierText  --level word
  python tools/dataset_converters/textdet/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── HierText
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_training.json
  │   └── instances_val.json
  ```

## ArT

- Step1: Download `train_images.tar.gz`, and `train_labels.json` from the [homepage](https://rrc.cvc.uab.es/?ch=14&com=downloads) to `art/`

  ```bash
  mkdir art && cd art
  mkdir annotations

  # Download ArT dataset
  wget https://dataset-bj.cdn.bcebos.com/art/train_images.tar.gz --no-check-certificate
  wget https://dataset-bj.cdn.bcebos.com/art/train_labels.json --no-check-certificate

  # Extract
  tar -xf train_images.tar.gz
  mv train_images imgs
  mv train_labels.json annotations/

  # Remove unnecessary files
  rm train_images.tar.gz
  ```

- Step2: Generate `instances_training.json` and `instances_val.json` (optional). Since the test annotations are not publicly available, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

  ```bash
  # Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2
  python tools/data/textdet/art_converter.py PATH/TO/art --nproc 4
  ```

- After running the above codes, the directory structure should be as follows:

  ```text
  │── art
  │   ├── annotations
  │   ├── imgs
  │   ├── instances_training.json
  │   └── instances_val.json (optional)
  ```