2021-12-15 15:04:41 +08:00
# Text Recognition
2021-08-25 16:41:07 +08:00
## Overview
2022-07-13 20:25:10 +08:00
| Dataset | images | annotation file | annotation file |
| :-------------------: | :---------------------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------: |
| | | training | test |
| coco_text | [homepage ](https://rrc.cvc.uab.es/?ch=5&com=downloads ) | [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/coco_text/train_label.txt ) | - |
| ICDAR2011 | [homepage ](https://rrc.cvc.uab.es/?ch=1 ) | - | - |
| ICDAR2013 | [homepage ](https://rrc.cvc.uab.es/?ch=2 ) | - | - |
| icdar_2015 | [homepage ](https://rrc.cvc.uab.es/?ch=4&com=downloads ) | [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt ) | [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/test_label.txt ) |
| IIIT5K | [homepage ](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html ) | [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/train_label.txt ) | [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt ) |
| ct80 | [homepage ](http://cs-chan.com/downloads_CUTE80_dataset.html ) | - | [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/ct80/test_label.txt ) |
| svt | [homepage ](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset ) | - | [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/svt/test_label.txt ) |
| svtp | [unofficial homepage\[1\]](https://github.com/Jyouhou/Case-Sensitive-Scene-Text-Recognition-Datasets) | - | [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/svtp/test_label.txt ) |
| MJSynth (Syn90k) | [homepage ](https://www.robots.ox.ac.uk/~vgg/data/text/ ) | [shuffle_labels.txt ](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/shuffle_labels.txt ) \| [label.txt ](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/label.txt ) | - |
| SynthText (Synth800k) | [homepage ](https://www.robots.ox.ac.uk/~vgg/data/scenetext/ ) | [alphanumeric_labels.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthText/alphanumeric_labels.txt ) \|[shuffle_labels.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthText/shuffle_labels.txt ) \| [instances_train.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthText/instances_train.txt ) \| [label.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthText/label.txt ) | - |
| SynthAdd | [SynthText_Add.zip ](https://pan.baidu.com/s/1uV0LtoNmcxbO-0YA7Ch4dg ) (code:627x) | [label.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthAdd/label.txt ) | - |
| TextOCR | [homepage ](https://textvqa.org/textocr/dataset ) | - | - |
| Totaltext | [homepage ](https://github.com/cs-chan/Total-Text-Dataset ) | - | - |
| OpenVINO | [Open Images ](https://github.com/cvdfoundation/open-images-dataset ) | [annotations ](https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text ) | [annotations ](https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text ) |
| FUNSD | [homepage ](https://guillaumejaume.github.io/FUNSD/ ) | - | - |
| DeText | [homepage ](https://rrc.cvc.uab.es/?ch=9 ) | - | - |
| NAF | [homepage ](https://github.com/herobd/NAF_dataset ) | - | - |
| SROIE | [homepage ](https://rrc.cvc.uab.es/?ch=13 ) | - | - |
| Lecture Video DB | [homepage ](https://cvit.iiit.ac.in/research/projects/cvit-projects/lecturevideodb ) | - | - |
| LSVT | [homepage ](https://rrc.cvc.uab.es/?ch=16 ) | - | - |
| IMGUR | [homepage ](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset ) | - | - |
| KAIST | [homepage ](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database ) | - | - |
| MTWI | [homepage ](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us ) | - | - |
| COCO Text v2 | [homepage ](https://bgshih.github.io/cocotext/ ) | - | - |
| ReCTS | [homepage ](https://rrc.cvc.uab.es/?ch=12 ) | - | - |
| IIIT-ILST | [homepage ](http://cvit.iiit.ac.in/research/projects/cvit-projects/iiit-ilst ) | - | - |
| VinText | [homepage ](https://github.com/VinAIResearch/dict-guided ) | - | - |
| BID | [homepage ](https://github.com/ricardobnjunior/Brazilian-Identity-Document-Dataset ) | - | - |
| RCTW | [homepage ](https://rctw.vlrlab.net/index.html ) | - | - |
| HierText | [homepage ](https://github.com/google-research-datasets/hiertext ) | - | - |
| ArT | [homepage ](https://rrc.cvc.uab.es/?ch=14 ) | - | - |
2022-06-09 14:58:44 +08:00
2022-07-13 21:50:39 +08:00
(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.
2021-08-25 16:41:07 +08:00
2022-05-05 16:31:36 +08:00
### Install AWS CLI (optional)
- Since there are some datasets that require the [AWS CLI ](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html ) to be installed in advance, we provide a quick installation guide here:
```bash
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
./aws/install -i /usr/local/aws-cli -b /usr/local/bin
!aws configure
# this command will require you to input keys, you can skip them except
# for the Default region name
# AWS Access Key ID [None]:
# AWS Secret Access Key [None]:
# Default region name [None]: us-east-1
# Default output format [None]
```
2022-03-31 15:10:02 +08:00
## ICDAR 2011 (Born-Digital Images)
2021-08-25 16:41:07 +08:00
2022-03-30 22:07:17 +08:00
- Step1: Download `Challenge1_Training_Task3_Images_GT.zip` , `Challenge1_Test_Task3_Images.zip` , and `Challenge1_Test_Task3_GT.txt` from [homepage ](https://rrc.cvc.uab.es/?ch=1&com=downloads ) `Task 1.3: Word Recognition (2013 edition)` .
```bash
mkdir icdar2011 & & cd icdar2011
mkdir annotations
# Download ICDAR 2011
wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task3_Images_GT.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_Images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_GT.txt --no-check-certificate
# For images
mkdir crops
unzip -q Challenge1_Training_Task3_Images_GT.zip -d crops/train
unzip -q Challenge1_Test_Task3_Images.zip -d crops/test
# For annotations
mv Challenge1_Test_Task3_GT.txt annotations & & mv train/gt.txt annotations/Challenge1_Train_Task3_GT.txt
```
- Step2: Convert original annotations to `Train_label.jsonl` and `Test_label.jsonl` with the following command:
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/ic11_converter.py PATH/TO/icdar2011
2022-03-30 22:07:17 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
├── icdar2011
│ ├── crops
│ ├── train_label.jsonl
│ └── test_label.jsonl
```
2022-03-31 15:10:02 +08:00
## ICDAR 2013 (Focused Scene Text)
2022-03-30 22:07:17 +08:00
- Step1: Download `Challenge2_Training_Task3_Images_GT.zip` , `Challenge2_Test_Task3_Images.zip` , and `Challenge2_Test_Task3_GT.txt` from [homepage ](https://rrc.cvc.uab.es/?ch=2&com=downloads ) `Task 2.3: Word Recognition (2013 edition)` .
```bash
mkdir icdar2013 & & cd icdar2013
mkdir annotations
# Download ICDAR 2013
wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task3_Images_GT.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_Images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_GT.txt --no-check-certificate
# For images
mkdir crops
unzip -q Challenge2_Training_Task3_Images_GT.zip -d crops/train
unzip -q Challenge2_Test_Task3_Images.zip -d crops/test
# For annotations
mv Challenge2_Test_Task3_GT.txt annotations & & mv crops/train/gt.txt annotations/Challenge2_Train_Task3_GT.txt
rm Challenge2_Training_Task3_Images_GT.zip & & rm Challenge2_Test_Task3_Images.zip
```
- Step 2: Generate `Train_label.jsonl` and `Test_label.jsonl` with the following command:
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/ic13_converter.py PATH/TO/icdar2013
2022-03-30 22:07:17 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
├── icdar2013
│ ├── crops
│ ├── train_label.jsonl
│ └── test_label.jsonl
```
2022-07-13 21:50:39 +08:00
## ICDAR 2013 [Deprecated]
2022-03-31 15:10:02 +08:00
2021-08-25 16:41:07 +08:00
- Step1: Download `Challenge2_Test_Task3_Images.zip` and `Challenge2_Training_Task3_Images_GT.zip` from [homepage ](https://rrc.cvc.uab.es/?ch=2&com=downloads )
- Step2: Download [test_label_1015.txt ](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/test_label_1015.txt ) and [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/train_label.txt )
2022-03-31 15:10:02 +08:00
- After running the above codes, the directory structure
2022-07-13 21:50:39 +08:00
should be as follows:
2022-03-31 15:10:02 +08:00
```text
├── icdar_2013
│ ├── train_label.txt
│ ├── test_label_1015.txt
│ ├── test_label_1095.txt
│ ├── Challenge2_Training_Task3_Images_GT
│ └── Challenge2_Test_Task3_Images
```
## ICDAR 2015
2021-09-28 11:32:44 +08:00
2021-08-25 16:41:07 +08:00
- Step1: Download `ch4_training_word_images_gt.zip` and `ch4_test_word_images_gt.zip` from [homepage ](https://rrc.cvc.uab.es/?ch=4&com=downloads )
- Step2: Download [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/train_label.txt ) and [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/icdar_2015/test_label.txt )
2022-03-31 15:10:02 +08:00
- After running the above codes, the directory structure
2022-07-13 21:50:39 +08:00
should be as follows:
2022-03-31 15:10:02 +08:00
```text
├── icdar_2015
│ ├── train_label.txt
│ ├── test_label.txt
│ ├── ch4_training_word_images_gt
│ └── ch4_test_word_images_gt
```
## IIIT5K
- Step1: Download `IIIT5K-Word_V3.0.tar.gz` from [homepage ](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html )
- Step2: Download [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/train_label.txt ) and [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/IIIT5K/test_label.txt )
- After running the above codes, the directory structure
2022-07-13 21:50:39 +08:00
should be as follows:
2022-03-31 15:10:02 +08:00
```text
├── III5K
│ ├── train_label.txt
│ ├── test_label.txt
│ ├── train
│ └── test
```
## svt
- Step1: Download `svt.zip` form [homepage ](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset )
- Step2: Download [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/svt/test_label.txt )
- Step3:
2021-08-25 16:41:07 +08:00
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/svt_converter.py < download_svt_dir_path >
2021-08-25 16:41:07 +08:00
```
2022-03-31 15:10:02 +08:00
- After running the above codes, the directory structure
2022-07-13 21:50:39 +08:00
should be as follows:
2022-03-31 15:10:02 +08:00
```text
├── svt
│ ├── test_label.txt
│ └── image
```
## ct80
- Step1: Download [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/ct80/test_label.txt )
2022-06-09 14:58:44 +08:00
2022-07-20 19:25:39 +08:00
- Step2: Download [timage.tar.gz ](https://download.openmmlab.com/mmocr/data/mixture/ct80/timage.tar.gz )
- Step3:
```bash
mkdir ct80 & & cd ct80
mv /path/to/test_label.txt .
mv /path/to/timage.tar.gz .
tar -xvf timage.tar.gz
# create soft link
cd /path/to/mmocr/data/mixture
ln -s /path/to/ct80 ct80
```
2022-03-31 15:10:02 +08:00
- After running the above codes, the directory structure
2022-07-13 21:50:39 +08:00
should be as follows:
2022-03-31 15:10:02 +08:00
```text
├── ct80
│ ├── test_label.txt
2022-07-20 19:25:39 +08:00
│ └── timage
2022-03-31 15:10:02 +08:00
```
## svtp
- Step1: Download [test_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/svtp/test_label.txt )
- After running the above codes, the directory structure
2022-07-13 21:50:39 +08:00
should be as follows:
2021-08-25 16:41:07 +08:00
2022-03-31 15:10:02 +08:00
```text
├── svtp
│ ├── test_label.txt
│ └── image
```
2021-08-25 16:41:07 +08:00
2022-03-31 15:10:02 +08:00
## coco_text
2021-08-25 16:41:07 +08:00
2022-03-31 15:10:02 +08:00
- Step1: Download from [homepage ](https://rrc.cvc.uab.es/?ch=5&com=downloads )
- Step2: Download [train_label.txt ](https://download.openmmlab.com/mmocr/data/mixture/coco_text/train_label.txt )
- After running the above codes, the directory structure
2022-07-13 21:50:39 +08:00
should be as follows:
2022-03-31 15:10:02 +08:00
```text
├── coco_text
│ ├── train_label.txt
│ └── train_words
```
## MJSynth (Syn90k)
- Step1: Download `mjsynth.tar.gz` from [homepage ](https://www.robots.ox.ac.uk/~vgg/data/text/ )
- Step2: Download [label.txt ](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/label.txt ) (8,919,273 annotations) and [shuffle_labels.txt ](https://download.openmmlab.com/mmocr/data/mixture/Syn90k/shuffle_labels.txt ) (2,400,000 randomly sampled annotations).
2022-06-09 14:58:44 +08:00
```{note}
2022-03-31 15:10:02 +08:00
Please make sure you're using the right annotation to train the model by checking its dataset specs in Model Zoo.
2022-06-09 14:58:44 +08:00
```
2022-03-31 15:10:02 +08:00
- Step3:
2021-08-25 16:41:07 +08:00
```bash
mkdir Syn90k & & cd Syn90k
mv /path/to/mjsynth.tar.gz .
tar -xzf mjsynth.tar.gz
mv /path/to/shuffle_labels.txt .
mv /path/to/label.txt .
# create soft link
cd /path/to/mmocr/data/mixture
ln -s /path/to/Syn90k Syn90k
2022-04-17 09:02:24 +08:00
# Convert 'txt' format annos to 'lmdb' (optional)
cd /path/to/mmocr
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb
2021-08-25 16:41:07 +08:00
```
2022-03-31 15:10:02 +08:00
- After running the above codes, the directory structure
2022-07-13 21:50:39 +08:00
should be as follows:
2022-03-31 15:10:02 +08:00
```text
├── Syn90k
│ ├── shuffle_labels.txt
│ ├── label.txt
2022-04-17 09:02:24 +08:00
│ ├── label.lmdb (optional)
2022-03-31 15:10:02 +08:00
│ └── mnt
```
## SynthText (Synth800k)
2021-08-25 16:41:07 +08:00
- Step1: Download `SynthText.zip` from [homepage ](https://www.robots.ox.ac.uk/~vgg/data/scenetext/ )
2021-12-15 11:21:54 +08:00
- Step2: According to your actual needs, download the most appropriate one from the following options: [label.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthText/label.txt ) (7,266,686 annotations), [shuffle_labels.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthText/shuffle_labels.txt ) (2,400,000 randomly sampled annotations), [alphanumeric_labels.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthText/alphanumeric_labels.txt ) (7,239,272 annotations with alphanumeric characters only) and [instances_train.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthText/instances_train.txt ) (7,266,686 character-level annotations).
2022-06-09 14:58:44 +08:00
```{warning}
2021-12-15 11:21:54 +08:00
Please make sure you're using the right annotation to train the model by checking its dataset specs in Model Zoo.
2022-06-09 14:58:44 +08:00
```
2021-08-25 16:41:07 +08:00
- Step3:
2021-12-15 11:21:54 +08:00
2022-03-31 15:10:02 +08:00
```bash
mkdir SynthText & & cd SynthText
mv /path/to/SynthText.zip .
unzip SynthText.zip
mv SynthText synthtext
2021-12-15 11:21:54 +08:00
2022-03-31 15:10:02 +08:00
mv /path/to/shuffle_labels.txt .
mv /path/to/label.txt .
mv /path/to/alphanumeric_labels.txt .
mv /path/to/instances_train.txt .
2021-08-25 16:41:07 +08:00
2022-03-31 15:10:02 +08:00
# create soft link
cd /path/to/mmocr/data/mixture
ln -s /path/to/SynthText SynthText
```
2021-08-25 16:41:07 +08:00
2022-04-17 09:02:24 +08:00
- Step4: Generate cropped images and labels:
2022-03-31 15:10:02 +08:00
```bash
cd /path/to/mmocr
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/synthtext_converter.py data/mixture/SynthText/gt.mat data/mixture/SynthText/ data/mixture/SynthText/synthtext/SynthText_patch_horizontal --n_proc 8
2022-04-17 09:02:24 +08:00
# Convert 'txt' format annos to 'lmdb' (optional)
cd /path/to/mmocr
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/utils/txt2lmdb.py -i data/mixture/SynthText/label.txt -o data/mixture/SynthText/label.lmdb
2022-03-31 15:10:02 +08:00
```
- After running the above codes, the directory structure
2022-07-13 21:50:39 +08:00
should be as follows:
2022-03-31 15:10:02 +08:00
```text
├── SynthText
│ ├── alphanumeric_labels.txt
│ ├── shuffle_labels.txt
│ ├── instances_train.txt
│ ├── label.txt
2022-04-17 09:02:24 +08:00
│ ├── label.lmdb (optional)
2022-03-31 15:10:02 +08:00
│ └── synthtext
```
## SynthAdd
2021-08-25 16:41:07 +08:00
- Step1: Download `SynthText_Add.zip` from [SynthAdd ](https://pan.baidu.com/s/1uV0LtoNmcxbO-0YA7Ch4dg ) (code:627x))
- Step2: Download [label.txt ](https://download.openmmlab.com/mmocr/data/mixture/SynthAdd/label.txt )
- Step3:
2022-03-31 15:10:02 +08:00
```bash
mkdir SynthAdd & & cd SynthAdd
mv /path/to/SynthText_Add.zip .
unzip SynthText_Add.zip
2021-08-25 16:41:07 +08:00
2022-03-31 15:10:02 +08:00
mv /path/to/label.txt .
2021-08-25 16:41:07 +08:00
2022-03-31 15:10:02 +08:00
# create soft link
cd /path/to/mmocr/data/mixture
2021-08-25 16:41:07 +08:00
2022-03-31 15:10:02 +08:00
ln -s /path/to/SynthAdd SynthAdd
2022-04-17 09:02:24 +08:00
# Convert 'txt' format annos to 'lmdb' (optional)
cd /path/to/mmocr
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/utils/txt2lmdb.py -i data/mixture/SynthAdd/label.txt -o data/mixture/SynthAdd/label.lmdb
2022-03-31 15:10:02 +08:00
```
2021-08-25 16:41:07 +08:00
2022-03-31 15:10:02 +08:00
- After running the above codes, the directory structure
2022-07-13 21:50:39 +08:00
should be as follows:
2022-03-31 15:10:02 +08:00
```text
├── SynthAdd
│ ├── label.txt
2022-04-17 09:02:24 +08:00
│ ├── label.lmdb (optional)
2022-03-31 15:10:02 +08:00
│ └── SynthText_Add
```
2021-08-25 16:41:07 +08:00
2022-06-09 14:58:44 +08:00
````{tip}
2022-04-17 09:02:24 +08:00
To convert label file from `txt` format to `lmdb` format,
2022-03-31 15:10:02 +08:00
2021-08-25 16:41:07 +08:00
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/utils/txt2lmdb.py -i < txt_label_path > -o < lmdb_label_path >
2021-08-25 16:41:07 +08:00
```
2022-03-31 15:10:02 +08:00
2021-08-25 16:41:07 +08:00
For example,
2022-03-31 15:10:02 +08:00
2021-08-25 16:41:07 +08:00
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb
2021-08-25 16:41:07 +08:00
```
2022-03-31 15:10:02 +08:00
2022-06-09 14:58:44 +08:00
````
2021-08-25 16:41:07 +08:00
2022-03-31 15:10:02 +08:00
## TextOCR
- Step1: Download [train_val_images.zip ](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip ), [TextOCR_0.1_train.json ](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json ) and [TextOCR_0.1_val.json ](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json ) to `textocr/` .
2021-08-25 16:41:07 +08:00
```bash
mkdir textocr & & cd textocr
# Download TextOCR dataset
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json
# For images
unzip -q train_val_images.zip
mv train_images train
```
2022-03-31 15:10:02 +08:00
- Step2: Generate `train_label.txt` , `val_label.txt` and crop images using 4 processes with the following command:
2021-08-25 16:41:07 +08:00
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/textocr_converter.py /path/to/textocr 4
2021-08-25 16:41:07 +08:00
```
2022-03-31 15:10:02 +08:00
- After running the above codes, the directory structure
2022-07-13 21:50:39 +08:00
should be as follows:
2022-03-31 15:10:02 +08:00
```text
├── TextOCR
│ ├── image
│ ├── train_label.txt
│ └── val_label.txt
```
## Totaltext
2022-04-23 23:57:21 +08:00
- Step1: Download `totaltext.zip` from [github dataset ](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset ) and `groundtruth_text.zip` or `TT_new_train_GT.zip` (if you prefer to use the latest version of training annotations) from [github Groundtruth ](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Text ) (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).
2022-03-31 15:10:02 +08:00
2021-08-25 16:41:07 +08:00
```bash
mkdir totaltext & & cd totaltext
mkdir imgs & & mkdir annotations
# For images
# in ./totaltext
unzip totaltext.zip
mv Images/Train imgs/training
mv Images/Test imgs/test
2022-04-23 23:57:21 +08:00
# For legacy training and test annotations
2021-08-25 16:41:07 +08:00
unzip groundtruth_text.zip
2022-04-23 23:57:21 +08:00
mv Groundtruth/Polygon/Train annotations/training
mv Groundtruth/Polygon/Test annotations/test
2021-08-25 16:41:07 +08:00
2022-04-23 23:57:21 +08:00
# Using the latest training annotations
# WARNING: Delete legacy train annotations before running the following command.
unzip TT_new_train_GT.zip
mv Train annotations/training
2021-08-25 16:41:07 +08:00
```
2022-03-31 15:10:02 +08:00
- Step2: Generate cropped images, `train_label.txt` and `test_label.txt` with the following command (the cropped images will be saved to `data/totaltext/dst_imgs/` ):
2021-08-25 16:41:07 +08:00
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/totaltext_converter.py /path/to/totaltext
2021-08-25 16:41:07 +08:00
```
2021-10-28 16:49:36 +08:00
2022-04-23 23:57:21 +08:00
- After running the above codes, the directory structure should be as follows:
2022-03-31 15:10:02 +08:00
```text
2022-04-23 23:57:21 +08:00
├── totaltext
│ ├── dst_imgs
2022-03-31 15:10:02 +08:00
│ ├── train_label.txt
│ └── test_label.txt
```
## OpenVINO
2022-05-05 16:31:36 +08:00
- Step1 (optional): Install [AWS CLI ](https://mmocr.readthedocs.io/en/latest/datasets/recog.html#install-aws-cli-optional ).
2022-03-31 15:10:02 +08:00
- Step2: Download [Open Images ](https://github.com/cvdfoundation/open-images-dataset#download-images-with-bounding-boxes-annotations ) subsets `train_1` , `train_2` , `train_5` , `train_f` , and `validation` to `openvino/` .
2021-10-28 16:49:36 +08:00
```bash
mkdir openvino & & cd openvino
# Download Open Images subsets
for s in 1 2 5 f; do
aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_${s}.tar.gz .
done
aws s3 --no-sign-request cp s3://open-images-dataset/tar/validation.tar.gz .
# Download annotations
for s in 1 2 5 f; do
wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_train_${s}.json
done
wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_validation.json
# Extract images
mkdir -p openimages_v5/val
for s in 1 2 5 f; do
tar zxf train_${s}.tar.gz -C openimages_v5
done
tar zxf validation.tar.gz -C openimages_v5/val
```
2022-03-31 15:10:02 +08:00
- Step3: Generate `train_{1,2,5,f}_label.txt` , `val_label.txt` and crop images using 4 processes with the following command:
2021-10-28 16:49:36 +08:00
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/openvino_converter.py /path/to/openvino 4
2021-10-28 16:49:36 +08:00
```
2022-03-04 12:25:54 +08:00
2022-03-31 15:10:02 +08:00
- After running the above codes, the directory structure
2022-07-13 21:50:39 +08:00
should be as follows:
2022-03-31 15:10:02 +08:00
```text
├── OpenVINO
│ ├── image_1
│ ├── image_2
│ ├── image_5
│ ├── image_f
│ ├── image_val
│ ├── train_1_label.txt
│ ├── train_2_label.txt
│ ├── train_5_label.txt
│ ├── train_f_label.txt
│ └── val_label.txt
```
## DeText
2022-03-30 14:43:33 +08:00
- Step1: Download `ch9_training_images.zip` , `ch9_training_localization_transcription_gt.zip` , `ch9_validation_images.zip` , and `ch9_validation_localization_transcription_gt.zip` from **Task 3: End to End** on the [homepage ](https://rrc.cvc.uab.es/?ch=9 ).
```bash
mkdir detext & & cd detext
mkdir imgs & & mkdir annotations & & mkdir imgs/training & & mkdir imgs/val & & mkdir annotations/training & & mkdir annotations/val
# Download DeText
wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate
# Extract images and annotations
unzip -q ch9_training_images.zip -d imgs/training & & unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training & & unzip -q ch9_validation_images.zip -d imgs/val & & unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val
# Remove zips
rm ch9_training_images.zip & & rm ch9_training_localization_transcription_gt.zip & & rm ch9_validation_images.zip & & rm ch9_validation_localization_transcription_gt.zip
```
- Step2: Generate `instances_training.json` and `instances_val.json` with following command:
```bash
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/detext/ignores
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/detext_converter.py PATH/TO/detext --nproc 4
2022-03-30 14:43:33 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
├── detext
│ ├── crops
│ ├── ignores
│ ├── train_label.jsonl
2022-03-31 15:10:02 +08:00
│ └── test_label.jsonl
2022-03-30 14:43:33 +08:00
```
2022-03-31 15:10:02 +08:00
## NAF
2022-03-30 14:31:47 +08:00
- Step1: Download [labeled_images.tar.gz ](https://github.com/herobd/NAF_dataset/releases/tag/v1.0 ) to `naf/` .
```bash
mkdir naf & & cd naf
# Download NAF dataset
wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz
tar -zxf labeled_images.tar.gz
# For images
mkdir annotations & & mv labeled_images imgs
# For annotations
git clone https://github.com/herobd/NAF_dataset.git
mv NAF_dataset/train_valid_test_split.json annotations/ & & mv NAF_dataset/groups annotations/
rm -rf NAF_dataset & & rm labeled_images.tar.gz
```
- Step2: Generate `train_label.txt` , `val_label.txt` , and `test_label.txt` with following command:
```bash
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/naf/ignores
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/naf_converter.py PATH/TO/naf --nproc 4
2022-03-30 14:31:47 +08:00
- After running the above codes, the directory structure should be as follows:
```text
├── naf
│ ├── crops
│ ├── train_label.txt
│ ├── val_label.txt
2022-03-31 15:10:02 +08:00
│ └── test_label.txt
2022-03-30 14:31:47 +08:00
```
2022-03-31 15:10:02 +08:00
## SROIE
2022-03-30 13:14:23 +08:00
- Step1: Step1: Download `0325updated.task1train(626p).zip` , `task1&2_test(361p).zip` , and `text.task1&2-test( 361p).zip` from [homepage ](https://rrc.cvc.uab.es/?ch=13&com=downloads ) to `sroie/`
- Step2:
```bash
mkdir sroie & & cd sroie
mkdir imgs & & mkdir annotations & & mkdir imgs/training
# Warnninig: The zip files downloaded from Google Drive and BaiduYun Cloud may
# be different, the user should revise the following commands to the correct
# file name if encounter with errors while extracting and move the files.
unzip -q 0325updated.task1train\(626p\).zip && unzip -q task1\&2_test\(361p\).zip && unzip -q text.task1\&2-test( 361p\).zip
# For images
mv 0325updated.task1train\(626p\)/*.jpg imgs/training && mv fulltext_test\(361p\) imgs/test
# For annotations
mv 0325updated.task1train\(626p\) annotations/training && mv text.task1\&2-testг и361p\)/ annotations/test
rm 0325updated.task1train\(626p\).zip && rm task1\&2_test\(361p\).zip && rm text.task1\&2-test( 361p\).zip
```
- Step3: Generate `train_label.jsonl` and `test_label.jsonl` and crop images using 4 processes with the following command:
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/sroie_converter.py PATH/TO/sroie --nproc 4
2022-03-30 13:14:23 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
├── sroie
│ ├── crops
│ ├── train_label.jsonl
2022-03-31 15:10:02 +08:00
│ └── test_label.jsonl
2022-03-30 13:14:23 +08:00
```
2022-03-29 11:50:27 +08:00
2022-03-31 15:10:02 +08:00
## Lecture Video DB
2022-06-09 14:58:44 +08:00
```{note}
2022-03-31 15:10:02 +08:00
The LV dataset has already provided cropped images and the corresponding annotations
2022-06-09 14:58:44 +08:00
```
2022-03-29 11:50:27 +08:00
- Step1: Download [IIIT-CVid.zip ](http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip ) to `lv/` .
2022-03-31 15:10:02 +08:00
```bash
mkdir lv & & cd lv
2022-03-29 11:50:27 +08:00
2022-03-31 15:10:02 +08:00
# Download LV dataset
wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
unzip -q IIIT-CVid.zip
2022-03-29 11:50:27 +08:00
2022-03-31 15:10:02 +08:00
# For image
mv IIIT-CVid/Crops ./
2022-03-29 11:50:27 +08:00
2022-03-31 15:10:02 +08:00
# For annotation
mv IIIT-CVid/train.txt train_label.txt & & mv IIIT-CVid/val.txt val_label.txt & & mv IIIT-CVid/test.txt test_label.txt
2022-03-29 11:50:27 +08:00
2022-03-31 15:10:02 +08:00
rm IIIT-CVid.zip
```
2022-03-29 11:50:27 +08:00
- Step2: Generate `train_label.jsonl` , `val.jsonl` , and `test.jsonl` with following command:
2022-03-31 15:10:02 +08:00
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textdreog/lv_converter.py PATH/TO/lv
2022-03-31 15:10:02 +08:00
```
2022-04-18 09:15:42 +08:00
- After running the above codes, the directory structure should be as follows:
2022-03-30 22:07:17 +08:00
2022-03-31 15:10:02 +08:00
```text
├── lv
│ ├── Crops
│ ├── train_label.jsonl
│ └── test_label.jsonl
```
2022-04-23 23:57:21 +08:00
## LSVT
2022-04-18 09:15:42 +08:00
- Step1: Download [train_full_images_0.tar.gz ](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz ), [train_full_images_1.tar.gz ](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz ), and [train_full_labels.json ](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json ) to `lsvt/` .
```bash
mkdir lsvt & & cd lsvt
# Download LSVT dataset
wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz
wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz
wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json
mkdir annotations
tar -xf train_full_images_0.tar.gz & & tar -xf train_full_images_1.tar.gz
mv train_full_labels.json annotations/ & & mv train_full_images_1/*.jpg train_full_images_0/
mv train_full_images_0 imgs
rm train_full_images_0.tar.gz & & rm train_full_images_1.tar.gz & & rm -rf train_full_images_1
```
2022-04-23 23:57:21 +08:00
- Step2: Generate `train_label.jsonl` and `val_label.jsonl` (optional) with the following command:
2022-04-18 09:15:42 +08:00
```bash
# Annotations of LSVT test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/lsvt/ignores
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textdrecog/lsvt_converter.py PATH/TO/lsvt --nproc 4
2022-04-18 09:15:42 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
├── lsvt
│ ├── crops
│ ├── ignores
│ ├── train_label.jsonl
2022-04-23 23:57:21 +08:00
│ └── val_label.jsonl (optional)
2022-04-18 09:15:42 +08:00
```
2022-03-31 15:10:02 +08:00
## FUNSD
2022-03-30 22:07:17 +08:00
- Step1: Download [dataset.zip ](https://guillaumejaume.github.io/FUNSD/dataset.zip ) to `funsd/` .
2022-03-31 15:10:02 +08:00
```bash
mkdir funsd & & cd funsd
2022-03-30 22:07:17 +08:00
2022-03-31 15:10:02 +08:00
# Download FUNSD dataset
wget https://guillaumejaume.github.io/FUNSD/dataset.zip
unzip -q dataset.zip
2022-03-30 22:07:17 +08:00
2022-03-31 15:10:02 +08:00
# For images
mv dataset/training_data/images imgs & & mv dataset/testing_data/images/* imgs/
2022-03-30 22:07:17 +08:00
2022-03-31 15:10:02 +08:00
# For annotations
mkdir annotations
mv dataset/training_data/annotations annotations/training & & mv dataset/testing_data/annotations annotations/test
2022-03-30 22:07:17 +08:00
2022-03-31 15:10:02 +08:00
rm dataset.zip & & rm -rf dataset
```
2022-03-30 22:07:17 +08:00
- Step2: Generate `train_label.txt` and `test_label.txt` and crop images using 4 processes with following command (add `--preserve-vertical` if you wish to preserve the images containing vertical texts):
2022-03-31 15:10:02 +08:00
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/funsd_converter.py PATH/TO/funsd --nproc 4
2022-03-31 15:10:02 +08:00
```
- After running the above codes, the directory structure
2022-07-13 21:50:39 +08:00
should be as follows:
2022-03-31 15:10:02 +08:00
```text
├── funsd
│ ├── imgs
│ ├── dst_imgs
│ ├── annotations
│ ├── train_label.txt
│ └── test_label.txt
```
2022-03-30 22:07:17 +08:00
2022-03-31 15:10:02 +08:00
## IMGUR
2022-03-30 22:07:17 +08:00
- Step1: Run `download_imgur5k.py` to download images. You can merge [PR#5 ](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/pull/5 ) in your local repository to enable a **much faster** parallel execution of image download.
```bash
mkdir imgur & & cd imgur
git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git
# Download images from imgur.com. This may take SEVERAL HOURS!
python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs
# For annotations
mkdir annotations
mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations
rm -rf IMGUR5K-Handwriting-Dataset
```
- Step2: Generate `train_label.txt` , `val_label.txt` and `test_label.txt` and crop images with the following command:
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/imgur_converter.py PATH/TO/imgur
2022-03-30 22:07:17 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
├── imgur
│ ├── crops
│ ├── train_label.jsonl
│ ├── test_label.jsonl
│ └── val_label.jsonl
```
2022-03-31 15:10:02 +08:00
## KAIST
2022-03-30 22:07:17 +08:00
- Step1: Complete download [KAIST_all.zip ](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database ) to `kaist/` .
```bash
mkdir kaist & & cd kaist
mkdir imgs & & mkdir annotations
# Download KAIST dataset
wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip
unzip -q KAIST_all.zip
rm KAIST_all.zip
```
- Step2: Extract zips:
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/common/extract_kaist.py PATH/TO/kaist
2022-03-30 22:07:17 +08:00
```
- Step3: Generate `train_label.jsonl` and `val_label.jsonl` (optional) with following command:
```bash
# Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/kaist/ignores
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/kaist_converter.py PATH/TO/kaist --nproc 4
2022-03-30 22:07:17 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
├── kaist
│ ├── crops
│ ├── ignores
│ ├── train_label.jsonl
│ └── val_label.jsonl (optional)
```
2022-03-31 15:10:02 +08:00
## MTWI
2022-03-30 22:07:17 +08:00
- Step1: Download `mtwi_2018_train.zip` from [homepage ](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us ).
```bash
mkdir mtwi & & cd mtwi
unzip -q mtwi_2018_train.zip
mv image_train imgs & & mv txt_train annotations
rm mtwi_2018_train.zip
```
- Step2: Generate `train_label.jsonl` and `val_label.jsonl` (optional) with the following command:
```bash
# Annotations of MTWI test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/mtwi/ignores
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/mtwi_converter.py PATH/TO/mtwi --nproc 4
2022-03-30 22:07:17 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
├── mtwi
│ ├── crops
│ ├── train_label.jsonl
2022-03-31 15:10:02 +08:00
│ └── val_label.jsonl (optional)
2022-03-30 22:07:17 +08:00
```
2022-03-31 15:10:02 +08:00
## COCO Text v2
2022-03-30 22:07:17 +08:00
- Step1: Download image [train2014.zip ](http://images.cocodataset.org/zips/train2014.zip ) and annotation [cocotext.v2.zip ](https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip ) to `coco_textv2/` .
```bash
mkdir coco_textv2 & & cd coco_textv2
mkdir annotations
# Download COCO Text v2 dataset
wget http://images.cocodataset.org/zips/train2014.zip
wget https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip
unzip -q train2014.zip & & unzip -q cocotext.v2.zip
mv train2014 imgs & & mv cocotext.v2.json annotations/
rm train2014.zip & & rm -rf cocotext.v2.zip
```
- Step2: Generate `train_label.jsonl` and `val_label.jsonl` with the following command:
```bash
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/mtwi/ignores
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/cocotext_converter.py PATH/TO/coco_textv2 --nproc 4
2022-03-30 22:07:17 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
├── coco_textv2
│ ├── crops
│ ├── ignores
│ ├── train_label.jsonl
│ └── val_label.jsonl
```
2022-03-31 15:10:02 +08:00
## ReCTS
2022-03-30 22:07:17 +08:00
- Step1: Download [ReCTS.zip ](https://datasets.cvc.uab.es/rrc/ReCTS.zip ) to `rects/` from the [homepage ](https://rrc.cvc.uab.es/?ch=12&com=downloads ).
```bash
mkdir rects & & cd rects
# Download ReCTS dataset
# You can also find Google Drive link on the dataset homepage
wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate
unzip -q ReCTS.zip
mv img imgs & & mv gt_unicode annotations
rm ReCTS.zip -f & & rm -rf gt
```
- Step2: Generate `train_label.jsonl` and `val_label.jsonl` (optional) with the following command:
```bash
# Annotations of ReCTS test split is not publicly available, split a validation
# set by adding --val-ratio 0.2
# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/rects/ignores
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/rects_converter.py PATH/TO/rects --nproc 4
2022-03-30 22:07:17 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
├── rects
│ ├── crops
│ ├── ignores
│ ├── train_label.jsonl
│ └── val_label.jsonl (optional)
```
2022-03-30 22:37:38 +08:00
2022-03-31 15:10:02 +08:00
## ILST
2022-03-30 22:37:38 +08:00
- Step1: Download `IIIT-ILST.zip` from [onedrive link ](https://iiitaphyd-my.sharepoint.com/:f:/g/personal/minesh_mathew_research_iiit_ac_in/EtLvCozBgaBIoqglF4M-lHABMgNcCDW9rJYKKWpeSQEElQ?e=zToXZP )
- Step2: Run the following commands
2022-03-31 15:10:02 +08:00
2022-03-30 22:37:38 +08:00
```bash
unzip -q IIIT-ILST.zip & & rm IIIT-ILST.zip
cd IIIT-ILST
# rename files
cd Devanagari && for i in `ls` ; do mv -f $i `echo "devanagari_"$i` ; done && cd ..
cd Malayalam && for i in `ls` ; do mv -f $i `echo "malayalam_"$i` ; done && cd ..
cd Telugu && for i in `ls` ; do mv -f $i `echo "telugu_"$i` ; done && cd ..
# transfer image path
mkdir imgs & & mkdir annotations
mv Malayalam/{*jpg,*jpeg} imgs/ & & mv Malayalam/*xml annotations/
mv Devanagari/*jpg imgs/ & & mv Devanagari/*xml annotations/
mv Telugu/*jpeg imgs/ & & mv Telugu/*xml annotations/
# remove unnecessary files
rm -rf Devanagari & & rm -rf Malayalam & & rm -rf Telugu & & rm -rf README.txt
```
- Step3: Generate `train_label.jsonl` and `val_label.jsonl` (optional) and crop images using 4 processes with the following command (add `--preserve-vertical` if you wish to preserve the images containing vertical texts). Since the original dataset doesn't have a validation set, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4
2022-03-30 22:37:38 +08:00
```
- After running the above codes, the directory structure should be as follows:
2022-03-31 15:10:02 +08:00
2022-03-30 22:37:38 +08:00
```text
├── IIIT-ILST
│ ├── crops
│ ├── ignores
│ ├── train_label.jsonl
2022-03-31 15:10:02 +08:00
│ └── val_label.jsonl (optional)
2022-03-30 22:37:38 +08:00
```
2022-03-31 15:10:02 +08:00
## VinText
2022-03-30 22:37:38 +08:00
- Step1: Download [vintext.zip ](https://drive.google.com/drive/my-drive ) to `vintext`
2022-03-31 15:10:02 +08:00
2022-03-30 22:37:38 +08:00
```bash
mkdir vintext & & cd vintext
# Download dataset from google drive
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download& confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download& id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')& id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt
# Extract images and annotations
unzip -q vintext.zip & & rm vintext.zip
mv vietnamese/labels ./ & & mv vietnamese/test_image ./ & & mv vietnamese/train_images ./ & & mv vietnamese/unseen_test_images ./
rm -rf vietnamese
# Rename files
mv labels annotations & & mv test_image test & & mv train_images training & & mv unseen_test_images unseen_test
mkdir imgs
mv training imgs/ & & mv test imgs/ & & mv unseen_test imgs/
```
- Step2: Generate `train_label.jsonl` , `test_label.jsonl` , `unseen_test_label.jsonl` , and crop images using 4 processes with the following command (add `--preserve-vertical` if you wish to preserve the images containing vertical texts).
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/vintext_converter.py PATH/TO/vietnamese --nproc 4
2022-03-30 22:37:38 +08:00
```
- After running the above codes, the directory structure should be as follows:
2022-03-31 15:10:02 +08:00
2022-03-30 22:37:38 +08:00
```text
├── vintext
│ ├── crops
│ ├── ignores
│ ├── train_label.jsonl
│ ├── test_label.jsonl
2022-03-31 15:10:02 +08:00
│ └── unseen_test_label.jsonl
2022-03-30 22:37:38 +08:00
```
2022-03-31 15:10:02 +08:00
## BID
2022-03-30 22:37:38 +08:00
- Step1: Download [BID Dataset.zip ](https://drive.google.com/file/d/1Oi88TRcpdjZmJ79WDLb9qFlBNG8q2De6/view )
- Step2: Run the following commands to preprocess the dataset
2022-03-31 15:10:02 +08:00
2022-03-30 22:37:38 +08:00
```bash
# Rename
mv BID\ Dataset.zip BID_Dataset.zip
# Unzip and Rename
unzip -q BID_Dataset.zip & & rm BID_Dataset.zip
mv BID\ Dataset BID
# The BID dataset has a problem of permission, and you may
# add permission for this file
chmod -R 777 BID
cd BID
mkdir imgs & & mkdir annotations
# For images and annotations
mv CNH_Aberta/*in.jpg imgs & & mv CNH_Aberta/*txt annotations & & rm -rf CNH_Aberta
mv CNH_Frente/*in.jpg imgs & & mv CNH_Frente/*txt annotations & & rm -rf CNH_Frente
mv CNH_Verso/*in.jpg imgs & & mv CNH_Verso/*txt annotations & & rm -rf CNH_Verso
mv CPF_Frente/*in.jpg imgs & & mv CPF_Frente/*txt annotations & & rm -rf CPF_Frente
mv CPF_Verso/*in.jpg imgs & & mv CPF_Verso/*txt annotations & & rm -rf CPF_Verso
mv RG_Aberto/*in.jpg imgs & & mv RG_Aberto/*txt annotations & & rm -rf RG_Aberto
mv RG_Frente/*in.jpg imgs & & mv RG_Frente/*txt annotations & & rm -rf RG_Frente
mv RG_Verso/*in.jpg imgs & & mv RG_Verso/*txt annotations & & rm -rf RG_Verso
# Remove unnecessary files
rm -rf desktop.ini
```
- Step3: Generate `train_label.jsonl` and `val_label.jsonl` (optional) and crop images using 4 processes with the following command (add `--preserve-vertical` if you wish to preserve the images containing vertical texts). Since the original dataset doesn't have a validation set, you may specify `--val-ratio` to split the dataset. E.g., if test-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
```bash
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/bid_converter.py dPATH/TO/BID --nproc 4
2022-03-30 22:37:38 +08:00
```
- After running the above codes, the directory structure should be as follows:
2022-03-31 15:10:02 +08:00
2022-03-30 22:37:38 +08:00
```text
├── BID
│ ├── crops
│ ├── ignores
│ ├── train_label.jsonl
2022-03-31 15:10:02 +08:00
│ └── val_label.jsonl (optional)
2022-03-30 22:37:38 +08:00
```
2022-04-18 09:27:18 +08:00
## RCTW
- Step1: Download `train_images.zip.001` , `train_images.zip.002` , and `train_gts.zip` from the [homepage ](https://rctw.vlrlab.net/dataset.html ), extract the zips to `rctw/imgs` and `rctw/annotations` , respectively.
- Step2: Generate `train_label.jsonl` and `val_label.jsonl` (optional). Since the original dataset doesn't have a validation set, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
```bash
# Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2
# Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/rctw/ignores
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/rctw_converter.py PATH/TO/rctw --nproc 4
2022-04-18 09:27:18 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
│── rctw
│ ├── crops
│ ├── ignores
│ ├── train_label.jsonl
│ └── val_label.jsonl (optional)
```
2022-05-05 16:31:36 +08:00
## HierText
- Step1 (optional): Install [AWS CLI ](https://mmocr.readthedocs.io/en/latest/datasets/recog.html#install-aws-cli-optional ).
- Step2: Clone [HierText ](https://github.com/google-research-datasets/hiertext ) repo to get annotations
```bash
mkdir HierText
git clone https://github.com/google-research-datasets/hiertext.git
```
- Step3: Download `train.tgz` , `validation.tgz` from aws
```bash
aws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz .
aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
```
- Step4: Process raw data
```bash
# process annotations
mv hiertext/gt ./
rm -rf hiertext
mv gt annotations
gzip -d annotations/train.jsonl.gz
gzip -d annotations/validation.jsonl.gz
# process images
mkdir imgs
mv train.tgz imgs/
mv validation.tgz imgs/
tar -xzvf imgs/train.tgz
tar -xzvf imgs/validation.tgz
```
- Step5: Generate `train_label.jsonl` and `val_label.jsonl` . HierText includes different levels of annotation, including `paragraph` , `line` , and `word` . Check the original [paper ](https://arxiv.org/pdf/2203.15143.pdf ) for details. E.g. set `--level paragraph` to get paragraph-level annotation. Set `--level line` to get line-level annotation. set `--level word` to get word-level annotation.
```bash
# Collect word annotation from HierText --level word
# Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/HierText/ignores
2022-07-14 18:02:17 +08:00
python tools/dataset_converters/textrecog/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
2022-05-05 16:31:36 +08:00
```
- After running the above codes, the directory structure should be as follows:
```text
│── HierText
│ ├── crops
│ ├── ignores
│ ├── train_label.jsonl
│ └── val_label.jsonl
```
2022-05-17 23:59:15 +08:00
## ArT
- Step1: Download `train_images.tar.gz` , and `train_labels.json` from the [homepage ](https://rrc.cvc.uab.es/?ch=14&com=downloads ) to `art/`
```bash
mkdir art & & cd art
mkdir annotations
# Download ArT dataset
wget https://dataset-bj.cdn.bcebos.com/art/train_task2_images.tar.gz
wget https://dataset-bj.cdn.bcebos.com/art/train_task2_labels.json
# Extract
tar -xf train_task2_images.tar.gz
mv train_task2_images crops
mv train_task2_labels.json annotations/
# Remove unnecessary files
rm train_images.tar.gz
```
- Step2: Generate `train_label.jsonl` and `val_label.jsonl` (optional). Since the test annotations are not publicly available, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
```bash
# Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2
python tools/data/textrecog/art_converter.py PATH/TO/art
```
- After running the above codes, the directory structure should be as follows:
```text
│── art
│ ├── crops
│ ├── train_label.jsonl
│ └── val_label.jsonl (optional)
```