2022-11-08 17:17:54 +08:00
# Text Detection
2022-11-02 15:06:49 +08:00
2022-11-08 17:17:54 +08:00
```{note}
2023-02-02 19:47:10 +08:00
This page is a manual preparation guide for datasets not yet supported by [Dataset Preparer ](./dataset_preparer.md ), which all these scripts will be eventually migrated into.
2022-11-02 15:06:49 +08:00
```
2021-08-25 16:41:07 +08:00
## Overview
2023-02-02 19:47:10 +08:00
| Dataset | Images | | Annotation Files | | |
| :---------------: | :------------------------------------------------------: | :------------------------------------------------: | :-----------------------------------------------------------------: | :-----: | :-: |
| | | training | validation | testing | |
| ICDAR2011 | [homepage ](https://rrc.cvc.uab.es/?ch=1 ) | - | - | | |
| ICDAR2017 | [homepage ](https://rrc.cvc.uab.es/?ch=8&com=downloads ) | [instances_training.json ](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json ) | [instances_val.json ](https://download.openmmlab.com/mmocr/data/icdar2017/instances_val.json ) | - | |
| CurvedSynText150k | [homepage ](https://github.com/aim-uofa/AdelaiDet/blob/master/datasets/README.md ) \| [Part1 ](https://drive.google.com/file/d/1OSJ-zId2h3t_-I7g_wUkrK-VqQy153Kj/view?usp=sharing ) \| [Part2 ](https://drive.google.com/file/d/1EzkcOlIgEp5wmEubvHb7-J5EImHExYgY/view?usp=sharing ) | [instances_training.json ](https://download.openmmlab.com/mmocr/data/curvedsyntext/instances_training.json ) | - | - | |
| DeText | [homepage ](https://rrc.cvc.uab.es/?ch=9 ) | - | - | - | |
| Lecture Video DB | [homepage ](https://cvit.iiit.ac.in/research/projects/cvit-projects/lecturevideodb ) | - | - | - | |
| LSVT | [homepage ](https://rrc.cvc.uab.es/?ch=16 ) | - | - | - | |
| IMGUR | [homepage ](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset ) | - | - | - | |
| KAIST | [homepage ](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database ) | - | - | - | |
| MTWI | [homepage ](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us ) | - | - | - | |
| ReCTS | [homepage ](https://rrc.cvc.uab.es/?ch=12 ) | - | - | - | |
| IIIT-ILST | [homepage ](http://cvit.iiit.ac.in/research/projects/cvit-projects/iiit-ilst ) | - | - | - | |
| VinText | [homepage ](https://github.com/VinAIResearch/dict-guided ) | - | - | - | |
| BID | [homepage ](https://github.com/ricardobnjunior/Brazilian-Identity-Document-Dataset ) | - | - | - | |
| RCTW | [homepage ](https://rctw.vlrlab.net/index.html ) | - | - | - | |
| HierText | [homepage ](https://github.com/google-research-datasets/hiertext ) | - | - | - | |
| ArT | [homepage ](https://rrc.cvc.uab.es/?ch=14 ) | - | - | - | |
2022-05-05 16:31:36 +08:00
### Install AWS CLI (optional)
- Since there are some datasets that require the [AWS CLI ](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html ) to be installed in advance, we provide a quick installation guide here:
```bash
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
./aws/install -i /usr/local/aws-cli -b /usr/local/bin
!aws configure
# this command will require you to input keys, you can skip them except
# for the Default region name
# AWS Access Key ID [None]:
# AWS Secret Access Key [None]:
# Default region name [None]: us-east-1
# Default output format [None]
```
2022-03-04 12:25:54 +08:00
2023-03-14 15:55:53 +08:00
For users in China, these datasets can also be downloaded from [OpenDataLab ](https://opendatalab.com/ ) with high speed:
- [CTW1500 ](https://opendatalab.com/SCUT-CTW1500?source=OpenMMLab%20GitHub )
- [ICDAR2013 ](https://opendatalab.com/ICDAR_2013?source=OpenMMLab%20GitHub )
- [ICDAR2015 ](https://opendatalab.com/ICDAR2015?source=OpenMMLab%20GitHub )
- [Totaltext ](https://opendatalab.com/TotalText?source=OpenMMLab%20GitHub )
- [MSRA-TD500 ](https://opendatalab.com/MSRA-TD500?source=OpenMMLab%20GitHub )
2021-08-25 16:41:07 +08:00
## Important Note
2022-06-09 14:58:44 +08:00
```{note}
2021-09-08 11:40:51 +08:00
**For users who want to train models on CTW1500, ICDAR 2015/2017, and Totaltext dataset,** there might be some images containing orientation info in EXIF data. The default OpenCV
2021-08-25 16:41:07 +08:00
backend used in MMCV would read them and apply the rotation on the images. However, their gold annotations are made on the raw pixels, and such
2022-01-05 21:44:13 +08:00
inconsistency results in false examples in the training set. Therefore, users should use `dict(type='LoadImageFromFile', color_type='color_ignore_orientation')` in pipelines to change MMCV's default loading behaviour. (see [DBNet's pipeline config ](https://github.com/open-mmlab/mmocr/blob/main/configs/_base_/det_pipelines/dbnet_pipeline.py ) for example)
2022-06-09 14:58:44 +08:00
```
2021-08-25 16:41:07 +08:00
2022-03-31 15:10:21 +08:00
## ICDAR 2011 (Born-Digital Images)
2021-08-25 16:41:07 +08:00
2022-03-30 22:07:17 +08:00
- Step1: Download `Challenge1_Training_Task12_Images.zip` , `Challenge1_Training_Task1_GT.zip` , `Challenge1_Test_Task12_Images.zip` , and `Challenge1_Test_Task1_GT.zip` from [homepage ](https://rrc.cvc.uab.es/?ch=1&com=downloads ) `Task 1.1: Text Localization (2013 edition)` .
```bash
mkdir icdar2011 & & cd icdar2011
mkdir imgs & & mkdir annotations
# Download ICDAR 2011
wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task12_Images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task1_GT.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task12_Images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task1_GT.zip --no-check-certificate
# For images
unzip -q Challenge1_Training_Task12_Images.zip -d imgs/training
unzip -q Challenge1_Test_Task12_Images.zip -d imgs/test
# For annotations
unzip -q Challenge1_Training_Task1_GT.zip -d annotations/training
unzip -q Challenge1_Test_Task1_GT.zip -d annotations/test
rm Challenge1_Training_Task12_Images.zip & & rm Challenge1_Test_Task12_Images.zip & & rm Challenge1_Training_Task1_GT.zip & & rm Challenge1_Test_Task1_GT.zip
```
- Step 2: Generate `instances_training.json` and `instances_test.json` with the following command:
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textdet/ic11_converter.py PATH/TO/icdar2011 --nproc 4
2022-03-30 22:07:17 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
2022-03-31 15:10:21 +08:00
│── icdar2011
2022-03-30 22:07:17 +08:00
│ ├── imgs
│ ├── instances_test.json
│ └── instances_training.json
```
2022-03-31 15:10:21 +08:00
## ICDAR 2017
- Follow similar steps as [ICDAR 2015 ](#icdar-2015 ).
2022-07-21 14:28:57 +08:00
2022-03-31 15:10:21 +08:00
- The resulting directory structure looks like the following:
```text
├── icdar2017
│ ├── imgs
│ ├── annotations
│ ├── instances_training.json
│ └── instances_val.json
```
## CurvedSynText150k
2022-03-02 11:02:14 +08:00
- Step1: Download [syntext1.zip ](https://drive.google.com/file/d/1OSJ-zId2h3t_-I7g_wUkrK-VqQy153Kj/view?usp=sharing ) and [syntext2.zip ](https://drive.google.com/file/d/1EzkcOlIgEp5wmEubvHb7-J5EImHExYgY/view?usp=sharing ) to `CurvedSynText150k/` .
2022-07-21 14:28:57 +08:00
2022-03-02 11:02:14 +08:00
- Step2:
2022-03-31 15:10:21 +08:00
```bash
unzip -q syntext1.zip
mv train.json train1.json
unzip images.zip
rm images.zip
2022-03-02 11:02:14 +08:00
2022-03-31 15:10:21 +08:00
unzip -q syntext2.zip
mv train.json train2.json
unzip images.zip
rm images.zip
```
2022-03-02 11:02:14 +08:00
- Step3: Download [instances_training.json ](https://download.openmmlab.com/mmocr/data/curvedsyntext/instances_training.json ) to `CurvedSynText150k/`
2022-07-21 14:28:57 +08:00
2022-03-02 11:02:14 +08:00
- Or, generate `instances_training.json` with following command:
2022-03-31 15:10:21 +08:00
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/common/curvedsyntext_converter.py PATH/TO/CurvedSynText150k --nproc 4
2022-03-31 15:10:21 +08:00
```
2022-03-04 12:25:54 +08:00
2022-03-31 15:10:21 +08:00
- The resulting directory structure looks like the following:
```text
├── CurvedSynText150k
│ ├── syntext_word_eng
│ ├── emcs_imgs
│ └── instances_training.json
```
## DeText
2022-03-30 14:43:33 +08:00
- Step1: Download `ch9_training_images.zip` , `ch9_training_localization_transcription_gt.zip` , `ch9_validation_images.zip` , and `ch9_validation_localization_transcription_gt.zip` from **Task 3: End to End** on the [homepage ](https://rrc.cvc.uab.es/?ch=9 ).
```bash
mkdir detext & & cd detext
mkdir imgs & & mkdir annotations & & mkdir imgs/training & & mkdir imgs/val & & mkdir annotations/training & & mkdir annotations/val
# Download DeText
wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate
# Extract images and annotations
unzip -q ch9_training_images.zip -d imgs/training & & unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training & & unzip -q ch9_validation_images.zip -d imgs/val & & unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val
# Remove zips
rm ch9_training_images.zip & & rm ch9_training_localization_transcription_gt.zip & & rm ch9_validation_images.zip & & rm ch9_validation_localization_transcription_gt.zip
```
- Step2: Generate `instances_training.json` and `instances_val.json` with following command:
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textdet/detext_converter.py PATH/TO/detext --nproc 4
2022-03-30 14:43:33 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
2022-03-31 15:10:21 +08:00
│── detext
│ ├── annotations
2022-03-30 14:43:33 +08:00
│ ├── imgs
│ ├── instances_test.json
│ └── instances_training.json
```
2022-03-29 11:50:27 +08:00
2022-03-31 15:10:21 +08:00
## Lecture Video DB
2022-03-29 11:50:27 +08:00
- Step1: Download [IIIT-CVid.zip ](http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip ) to `lv/` .
2022-03-31 15:10:21 +08:00
```bash
mkdir lv & & cd lv
2022-03-29 11:50:27 +08:00
2022-03-31 15:10:21 +08:00
# Download LV dataset
wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
unzip -q IIIT-CVid.zip
2022-03-29 11:50:27 +08:00
2022-03-31 15:10:21 +08:00
mv IIIT-CVid/Frames imgs
2022-03-29 11:50:27 +08:00
2022-03-31 15:10:21 +08:00
rm IIIT-CVid.zip
```
2022-03-29 11:50:27 +08:00
- Step2: Generate `instances_training.json` , `instances_val.json` , and `instances_test.json` with following command:
2022-03-31 15:10:21 +08:00
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textdet/lv_converter.py PATH/TO/lv --nproc 4
2022-03-31 15:10:21 +08:00
```
- The resulting directory structure looks like the following:
```text
│── lv
│ ├── imgs
│ ├── instances_test.json
2022-04-17 09:02:24 +08:00
│ ├── instances_training.json
2022-03-31 15:10:21 +08:00
│ └── instances_val.json
```
2022-03-30 22:07:17 +08:00
2022-04-23 23:57:21 +08:00
## LSVT
2022-04-18 09:15:42 +08:00
- Step1: Download [train_full_images_0.tar.gz ](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz ), [train_full_images_1.tar.gz ](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz ), and [train_full_labels.json ](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json ) to `lsvt/` .
```bash
mkdir lsvt & & cd lsvt
# Download LSVT dataset
wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz
wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz
wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json
mkdir annotations
tar -xf train_full_images_0.tar.gz & & tar -xf train_full_images_1.tar.gz
mv train_full_labels.json annotations/ & & mv train_full_images_1/*.jpg train_full_images_0/
mv train_full_images_0 imgs
rm train_full_images_0.tar.gz & & rm train_full_images_1.tar.gz & & rm -rf train_full_images_1
```
- Step2: Generate `instances_training.json` and `instances_val.json` (optional) with the following command:
```bash
# Annotations of LSVT test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textdet/lsvt_converter.py PATH/TO/lsvt
2022-04-18 09:15:42 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
|── lsvt
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json (optional)
```
2022-03-31 15:10:21 +08:00
## IMGUR
2022-03-30 22:07:17 +08:00
- Step1: Run `download_imgur5k.py` to download images. You can merge [PR#5 ](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/pull/5 ) in your local repository to enable a **much faster** parallel execution of image download.
```bash
mkdir imgur & & cd imgur
git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git
# Download images from imgur.com. This may take SEVERAL HOURS!
python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs
# For annotations
mkdir annotations
mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations
rm -rf IMGUR5K-Handwriting-Dataset
```
- Step2: Generate `instances_train.json` , `instance_val.json` and `instances_test.json` with the following command:
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textdet/imgur_converter.py PATH/TO/imgur
2022-03-30 22:07:17 +08:00
```
- After running the above codes, the directory structure should be as follows:
2022-05-05 16:31:36 +08:00
```text
2022-03-31 15:10:21 +08:00
│── imgur
│ ├── annotations
2022-03-30 22:07:17 +08:00
│ ├── imgs
│ ├── instances_test.json
│ ├── instances_training.json
│ └── instances_val.json
```
2022-03-31 15:10:21 +08:00
## KAIST
2022-03-30 22:07:17 +08:00
- Step1: Complete download [KAIST_all.zip ](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database ) to `kaist/` .
```bash
mkdir kaist & & cd kaist
mkdir imgs & & mkdir annotations
# Download KAIST dataset
wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip
unzip -q KAIST_all.zip
rm KAIST_all.zip
```
- Step2: Extract zips:
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/common/extract_kaist.py PATH/TO/kaist
2022-03-30 22:07:17 +08:00
```
- Step3: Generate `instances_training.json` and `instances_val.json` (optional) with following command:
```bash
# Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textdet/kaist_converter.py PATH/TO/kaist --nproc 4
2022-03-30 22:07:17 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
2022-03-31 15:10:21 +08:00
│── kaist
│ ├── annotations
2022-03-30 22:07:17 +08:00
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json (optional)
```
2022-03-31 15:10:21 +08:00
## MTWI
2022-03-30 22:07:17 +08:00
- Step1: Download `mtwi_2018_train.zip` from [homepage ](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us ).
```bash
mkdir mtwi & & cd mtwi
unzip -q mtwi_2018_train.zip
mv image_train imgs & & mv txt_train annotations
rm mtwi_2018_train.zip
```
- Step2: Generate `instances_training.json` and `instance_val.json` (optional) with the following command:
```bash
# Annotations of MTWI test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textdet/mtwi_converter.py PATH/TO/mtwi --nproc 4
2022-03-30 22:07:17 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
2022-03-31 15:10:21 +08:00
│── mtwi
│ ├── annotations
2022-03-30 22:07:17 +08:00
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json (optional)
```
2022-03-31 15:10:21 +08:00
## ReCTS
2022-03-30 22:07:17 +08:00
- Step1: Download [ReCTS.zip ](https://datasets.cvc.uab.es/rrc/ReCTS.zip ) to `rects/` from the [homepage ](https://rrc.cvc.uab.es/?ch=12&com=downloads ).
```bash
mkdir rects & & cd rects
# Download ReCTS dataset
# You can also find Google Drive link on the dataset homepage
wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate
unzip -q ReCTS.zip
mv img imgs & & mv gt_unicode annotations
rm ReCTS.zip & & rm -rf gt
```
- Step2: Generate `instances_training.json` and `instances_val.json` (optional) with following command:
```bash
# Annotations of ReCTS test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textdet/rects_converter.py PATH/TO/rects --nproc 4 --val-ratio 0.2
2022-03-30 22:07:17 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
2022-03-31 15:10:21 +08:00
│── rects
│ ├── annotations
2022-03-30 22:07:17 +08:00
│ ├── imgs
│ ├── instances_val.json (optional)
│ └── instances_training.json
```
2022-03-30 22:37:38 +08:00
2022-03-31 15:10:21 +08:00
## ILST
2022-03-30 22:37:38 +08:00
- Step1: Download `IIIT-ILST` from [onedrive ](https://iiitaphyd-my.sharepoint.com/:f:/g/personal/minesh_mathew_research_iiit_ac_in/EtLvCozBgaBIoqglF4M-lHABMgNcCDW9rJYKKWpeSQEElQ?e=zToXZP )
- Step2: Run the following commands
2022-03-31 15:10:21 +08:00
2022-03-30 22:37:38 +08:00
```bash
unzip -q IIIT-ILST.zip & & rm IIIT-ILST.zip
cd IIIT-ILST
# rename files
cd Devanagari && for i in `ls` ; do mv -f $i `echo "devanagari_"$i` ; done && cd ..
cd Malayalam && for i in `ls` ; do mv -f $i `echo "malayalam_"$i` ; done && cd ..
cd Telugu && for i in `ls` ; do mv -f $i `echo "telugu_"$i` ; done && cd ..
# transfer image path
mkdir imgs & & mkdir annotations
mv Malayalam/{*jpg,*jpeg} imgs/ & & mv Malayalam/*xml annotations/
mv Devanagari/*jpg imgs/ & & mv Devanagari/*xml annotations/
mv Telugu/*jpeg imgs/ & & mv Telugu/*xml annotations/
# remove unnecessary files
rm -rf Devanagari & & rm -rf Malayalam & & rm -rf Telugu & & rm -rf README.txt
```
- Step3: Generate `instances_training.json` and `instances_val.json` (optional). Since the original dataset doesn't have a validation set, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textdet/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4
2022-03-30 22:37:38 +08:00
```
- After running the above codes, the directory structure should be as follows:
2022-03-31 15:10:21 +08:00
2022-03-30 22:37:38 +08:00
```text
2022-03-31 15:10:21 +08:00
│── IIIT-ILST
│ ├── annotations
2022-03-30 22:37:38 +08:00
│ ├── imgs
│ ├── instances_val.json (optional)
│ └── instances_training.json
```
2022-03-31 15:10:21 +08:00
## VinText
2022-03-30 22:37:38 +08:00
- Step1: Download [vintext.zip ](https://drive.google.com/drive/my-drive ) to `vintext`
```bash
mkdir vintext & & cd vintext
# Download dataset from google drive
2022-03-31 15:10:21 +08:00
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download& confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download& id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- │ sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')& id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt
2022-03-30 22:37:38 +08:00
# Extract images and annotations
unzip -q vintext.zip & & rm vintext.zip
mv vietnamese/labels ./ & & mv vietnamese/test_image ./ & & mv vietnamese/train_images ./ & & mv vietnamese/unseen_test_images ./
rm -rf vietnamese
# Rename files
mv labels annotations & & mv test_image test & & mv train_images training & & mv unseen_test_images unseen_test
mkdir imgs
mv training imgs/ & & mv test imgs/ & & mv unseen_test imgs/
```
- Step2: Generate `instances_training.json` , `instances_test.json` and `instances_unseen_test.json`
2022-03-31 15:10:21 +08:00
2022-03-30 22:37:38 +08:00
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textdet/vintext_converter.py PATH/TO/vintext --nproc 4
2022-03-30 22:37:38 +08:00
```
- After running the above codes, the directory structure should be as follows:
2022-03-31 15:10:21 +08:00
2022-03-30 22:37:38 +08:00
```text
2022-03-31 15:10:21 +08:00
│── vintext
│ ├── annotations
2022-03-30 22:37:38 +08:00
│ ├── imgs
│ ├── instances_test.json
│ ├── instances_unseen_test.json
│ └── instances_training.json
```
2022-03-31 15:10:21 +08:00
## BID
2022-03-30 22:37:38 +08:00
- Step1: Download [BID Dataset.zip ](https://drive.google.com/file/d/1Oi88TRcpdjZmJ79WDLb9qFlBNG8q2De6/view )
- Step2: Run the following commands to preprocess the dataset
```bash
# Rename
mv BID\ Dataset.zip BID_Dataset.zip
# Unzip and Rename
unzip -q BID_Dataset.zip & & rm BID_Dataset.zip
mv BID\ Dataset BID
# The BID dataset has a problem of permission, and you may
# add permission for this file
chmod -R 777 BID
cd BID
mkdir imgs & & mkdir annotations
# For images and annotations
mv CNH_Aberta/*in.jpg imgs & & mv CNH_Aberta/*txt annotations & & rm -rf CNH_Aberta
mv CNH_Frente/*in.jpg imgs & & mv CNH_Frente/*txt annotations & & rm -rf CNH_Frente
mv CNH_Verso/*in.jpg imgs & & mv CNH_Verso/*txt annotations & & rm -rf CNH_Verso
mv CPF_Frente/*in.jpg imgs & & mv CPF_Frente/*txt annotations & & rm -rf CPF_Frente
mv CPF_Verso/*in.jpg imgs & & mv CPF_Verso/*txt annotations & & rm -rf CPF_Verso
mv RG_Aberto/*in.jpg imgs & & mv RG_Aberto/*txt annotations & & rm -rf RG_Aberto
mv RG_Frente/*in.jpg imgs & & mv RG_Frente/*txt annotations & & rm -rf RG_Frente
mv RG_Verso/*in.jpg imgs & & mv RG_Verso/*txt annotations & & rm -rf RG_Verso
# Remove unnecessary files
rm -rf desktop.ini
```
- Step3: - Step3: Generate `instances_training.json` and `instances_val.json` (optional). Since the original dataset doesn't have a validation set, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textdet/bid_converter.py PATH/TO/BID --nproc 4
2022-03-30 22:37:38 +08:00
```
- After running the above codes, the directory structure should be as follows:
2022-03-31 15:10:21 +08:00
2022-03-30 22:37:38 +08:00
```text
2022-03-31 15:10:21 +08:00
│── BID
│ ├── annotations
2022-03-30 22:37:38 +08:00
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json (optional)
```
2022-04-18 09:27:18 +08:00
## RCTW
- Step1: Download `train_images.zip.001` , `train_images.zip.002` , and `train_gts.zip` from the [homepage ](https://rctw.vlrlab.net/dataset.html ), extract the zips to `rctw/imgs` and `rctw/annotations` , respectively.
2022-04-23 23:57:21 +08:00
- Step2: Generate `instances_training.json` and `instances_val.json` (optional). Since the test annotations are not publicly available, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
2022-04-18 09:27:18 +08:00
```bash
# Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textdet/rctw_converter.py PATH/TO/rctw --nproc 4
2022-04-18 09:27:18 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
│── rctw
│ ├── annotations
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json (optional)
```
2022-05-05 16:31:36 +08:00
## HierText
- Step1 (optional): Install [AWS CLI ](https://mmocr.readthedocs.io/en/latest/datasets/det.html#install-aws-cli-optional ).
2022-07-21 14:28:57 +08:00
2022-05-05 16:31:36 +08:00
- Step2: Clone [HierText ](https://github.com/google-research-datasets/hiertext ) repo to get annotations
```bash
mkdir HierText
git clone https://github.com/google-research-datasets/hiertext.git
```
- Step3: Download `train.tgz` , `validation.tgz` from aws
```bash
aws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz .
aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
```
- Step4: Process raw data
```bash
# process annotations
mv hiertext/gt ./
rm -rf hiertext
mv gt annotations
gzip -d annotations/train.jsonl.gz
gzip -d annotations/validation.jsonl.gz
# process images
mkdir imgs
mv train.tgz imgs/
mv validation.tgz imgs/
tar -xzvf imgs/train.tgz
tar -xzvf imgs/validation.tgz
```
- Step5: Generate `instances_training.json` and `instance_val.json` . HierText includes different levels of annotation, from paragraph, line, to word. Check the original [paper ](https://arxiv.org/pdf/2203.15143.pdf ) for details. E.g. set `--level paragraph` to get paragraph-level annotation. Set `--level line` to get line-level annotation. set `--level word` to get word-level annotation.
```bash
# Collect word annotation from HierText --level word
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textdet/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
2022-05-05 16:31:36 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
│── HierText
│ ├── annotations
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json
```
2022-05-17 23:59:15 +08:00
## ArT
- Step1: Download `train_images.tar.gz` , and `train_labels.json` from the [homepage ](https://rrc.cvc.uab.es/?ch=14&com=downloads ) to `art/`
```bash
mkdir art & & cd art
mkdir annotations
# Download ArT dataset
wget https://dataset-bj.cdn.bcebos.com/art/train_images.tar.gz --no-check-certificate
wget https://dataset-bj.cdn.bcebos.com/art/train_labels.json --no-check-certificate
# Extract
tar -xf train_images.tar.gz
mv train_images imgs
mv train_labels.json annotations/
# Remove unnecessary files
rm train_images.tar.gz
```
- Step2: Generate `instances_training.json` and `instances_val.json` (optional). Since the test annotations are not publicly available, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
```bash
# Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2
python tools/data/textdet/art_converter.py PATH/TO/art --nproc 4
```
- After running the above codes, the directory structure should be as follows:
```text
│── art
│ ├── annotations
│ ├── imgs
│ ├── instances_training.json
│ └── instances_val.json (optional)
```