61 KiB
Text Recognition
Overview
Dataset | images | annotation file | annotation file |
---|---|---|---|
training | test | ||
coco_text | homepage | train_label.txt | - |
ICDAR2011 | homepage | - | - |
ICDAR2013 | homepage | - | - |
icdar_2015 | homepage | train_label.txt | test_label.txt |
IIIT5K | homepage | train_label.txt | test_label.txt |
ct80 | homepage | - | test_label.txt |
svt | homepage | - | test_label.txt |
svtp | unofficial homepage[1] | - | test_label.txt |
MJSynth (Syn90k) | homepage | shuffle_labels.txt | label.txt | - |
SynthText (Synth800k) | homepage | alphanumeric_labels.txt |shuffle_labels.txt | instances_train.txt | label.txt | - |
SynthAdd | SynthText_Add.zip (code:627x) | label.txt | - |
TextOCR | homepage | - | - |
Totaltext | homepage | - | - |
OpenVINO | Open Images | annotations | annotations |
FUNSD | homepage | - | - |
DeText | homepage | - | - |
NAF | homepage | - | - |
SROIE | homepage | - | - |
Lecture Video DB | homepage | - | - |
LSVT | homepage | - | - |
IMGUR | homepage | - | - |
KAIST | homepage | - | - |
MTWI | homepage | - | - |
COCO Text v2 | homepage | - | - |
ReCTS | homepage | - | - |
IIIT-ILST | homepage | - | - |
VinText | homepage | - | - |
BID | homepage | - | - |
RCTW | homepage | - | - |
HierText | homepage | - | - |
ArT | homepage | - | - |
(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.
Install AWS CLI (optional)
-
Since there are some datasets that require the AWS CLI to be installed in advance, we provide a quick installation guide here:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install ./aws/install -i /usr/local/aws-cli -b /usr/local/bin !aws configure # this command will require you to input keys, you can skip them except # for the Default region name # AWS Access Key ID [None]: # AWS Secret Access Key [None]: # Default region name [None]: us-east-1 # Default output format [None]
ICDAR 2011 (Born-Digital Images)
-
Step1: Download
Challenge1_Training_Task3_Images_GT.zip
,Challenge1_Test_Task3_Images.zip
, andChallenge1_Test_Task3_GT.txt
from homepageTask 1.3: Word Recognition (2013 edition)
.mkdir icdar2011 && cd icdar2011 mkdir annotations # Download ICDAR 2011 wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task3_Images_GT.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_Images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_GT.txt --no-check-certificate # For images mkdir crops unzip -q Challenge1_Training_Task3_Images_GT.zip -d crops/train unzip -q Challenge1_Test_Task3_Images.zip -d crops/test # For annotations mv Challenge1_Test_Task3_GT.txt annotations && mv train/gt.txt annotations/Challenge1_Train_Task3_GT.txt
-
Step2: Convert original annotations to
Train_label.jsonl
andTest_label.jsonl
with the following command:python tools/data/textrecog/ic11_converter.py PATH/TO/icdar2011
-
After running the above codes, the directory structure should be as follows:
├── icdar2011 │ ├── crops │ ├── train_label.jsonl │ └── test_label.jsonl
ICDAR 2013 (Focused Scene Text)
-
Step1: Download
Challenge2_Training_Task3_Images_GT.zip
,Challenge2_Test_Task3_Images.zip
, andChallenge2_Test_Task3_GT.txt
from homepageTask 2.3: Word Recognition (2013 edition)
.mkdir icdar2013 && cd icdar2013 mkdir annotations # Download ICDAR 2013 wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task3_Images_GT.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_Images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_GT.txt --no-check-certificate # For images mkdir crops unzip -q Challenge2_Training_Task3_Images_GT.zip -d crops/train unzip -q Challenge2_Test_Task3_Images.zip -d crops/test # For annotations mv Challenge2_Test_Task3_GT.txt annotations && mv crops/train/gt.txt annotations/Challenge2_Train_Task3_GT.txt rm Challenge2_Training_Task3_Images_GT.zip && rm Challenge2_Test_Task3_Images.zip
-
Step 2: Generate
Train_label.jsonl
andTest_label.jsonl
with the following command:python tools/data/textrecog/ic13_converter.py PATH/TO/icdar2013
-
After running the above codes, the directory structure should be as follows:
├── icdar2013 │ ├── crops │ ├── train_label.jsonl │ └── test_label.jsonl
ICDAR 2013 [Deprecated]
-
Step1: Download
Challenge2_Test_Task3_Images.zip
andChallenge2_Training_Task3_Images_GT.zip
from homepage -
Step2: Download test_label_1015.txt and train_label.txt
-
After running the above codes, the directory structure should be as follows:
├── icdar_2013 │ ├── train_label.txt │ ├── test_label_1015.txt │ ├── test_label_1095.txt │ ├── Challenge2_Training_Task3_Images_GT │ └── Challenge2_Test_Task3_Images
ICDAR 2015
-
Step1: Download
ch4_training_word_images_gt.zip
andch4_test_word_images_gt.zip
from homepage -
Step2: Download train_label.txt and test_label.txt
-
After running the above codes, the directory structure should be as follows:
├── icdar_2015 │ ├── train_label.txt │ ├── test_label.txt │ ├── ch4_training_word_images_gt │ └── ch4_test_word_images_gt
IIIT5K
-
Step1: Download
IIIT5K-Word_V3.0.tar.gz
from homepage -
Step2: Download train_label.txt and test_label.txt
-
After running the above codes, the directory structure should be as follows:
├── III5K │ ├── train_label.txt │ ├── test_label.txt │ ├── train │ └── test
svt
-
Step1: Download
svt.zip
form homepage -
Step2: Download test_label.txt
-
Step3:
python tools/data/textrecog/svt_converter.py <download_svt_dir_path>
-
After running the above codes, the directory structure should be as follows:
├── svt │ ├── test_label.txt │ └── image
ct80
-
Step1: Download test_label.txt
-
Step2: Download timage.tar.gz
-
Step3:
mkdir ct80 && cd ct80 mv /path/to/test_label.txt . mv /path/to/timage.tar.gz . tar -xvf timage.tar.gz # create soft link cd /path/to/mmocr/data/mixture ln -s /path/to/ct80 ct80
-
After running the above codes, the directory structure should be as follows:
├── ct80 │ ├── test_label.txt │ └── timage
svtp
-
Step1: Download test_label.txt
-
After running the above codes, the directory structure should be as follows:
├── svtp │ ├── test_label.txt │ └── image
coco_text
-
Step1: Download from homepage
-
Step2: Download train_label.txt
-
After running the above codes, the directory structure should be as follows:
├── coco_text │ ├── train_label.txt │ └── train_words
MJSynth (Syn90k)
- Step1: Download
mjsynth.tar.gz
from homepage - Step2: Download label.txt (8,919,273 annotations) and shuffle_labels.txt (2,400,000 randomly sampled annotations).
Please make sure you're using the right annotation to train the model by checking its dataset specs in Model Zoo.
-
Step3:
mkdir Syn90k && cd Syn90k mv /path/to/mjsynth.tar.gz . tar -xzf mjsynth.tar.gz mv /path/to/shuffle_labels.txt . mv /path/to/label.txt . # create soft link cd /path/to/mmocr/data/mixture ln -s /path/to/Syn90k Syn90k # Convert 'txt' format annos to 'lmdb' (optional) cd /path/to/mmocr python tools/data/utils/lmdb_converter.py data/mixture/Syn90k/label.txt data/mixture/Syn90k/label.lmdb --label-only
-
After running the above codes, the directory structure should be as follows:
├── Syn90k │ ├── shuffle_labels.txt │ ├── label.txt │ ├── label.lmdb (optional) │ └── mnt
SynthText (Synth800k)
-
Step1: Download
SynthText.zip
from homepage -
Step2: According to your actual needs, download the most appropriate one from the following options: label.txt (7,266,686 annotations), shuffle_labels.txt (2,400,000 randomly sampled annotations), alphanumeric_labels.txt (7,239,272 annotations with alphanumeric characters only) and instances_train.txt (7,266,686 character-level annotations).
Please make sure you're using the right annotation to train the model by checking its dataset specs in Model Zoo.
-
Step3:
mkdir SynthText && cd SynthText mv /path/to/SynthText.zip . unzip SynthText.zip mv SynthText synthtext mv /path/to/shuffle_labels.txt . mv /path/to/label.txt . mv /path/to/alphanumeric_labels.txt . mv /path/to/instances_train.txt . # create soft link cd /path/to/mmocr/data/mixture ln -s /path/to/SynthText SynthText
-
Step4: Generate cropped images and labels:
cd /path/to/mmocr python tools/data/textrecog/synthtext_converter.py data/mixture/SynthText/gt.mat data/mixture/SynthText/ data/mixture/SynthText/synthtext/SynthText_patch_horizontal --n_proc 8 # Convert 'txt' format annos to 'lmdb' (optional) cd /path/to/mmocr python tools/data/utils/lmdb_converter.py data/mixture/SynthText/label.txt data/mixture/SynthText/label.lmdb --label-only
-
After running the above codes, the directory structure should be as follows:
├── SynthText │ ├── alphanumeric_labels.txt │ ├── shuffle_labels.txt │ ├── instances_train.txt │ ├── label.txt │ ├── label.lmdb (optional) │ └── synthtext
SynthAdd
-
Step1: Download
SynthText_Add.zip
from SynthAdd (code:627x)) -
Step2: Download label.txt
-
Step3:
mkdir SynthAdd && cd SynthAdd mv /path/to/SynthText_Add.zip . unzip SynthText_Add.zip mv /path/to/label.txt . # create soft link cd /path/to/mmocr/data/mixture ln -s /path/to/SynthAdd SynthAdd # Convert 'txt' format annos to 'lmdb' (optional) cd /path/to/mmocr python tools/data/utils/lmdb_converter.py data/mixture/SynthAdd/label.txt data/mixture/SynthAdd/label.lmdb --label-only
-
After running the above codes, the directory structure should be as follows:
├── SynthAdd │ ├── label.txt │ ├── label.lmdb (optional) │ └── SynthText_Add
To convert label file from `txt` format to `lmdb` format,
```bash
python tools/data/utils/lmdb_converter.py <txt_label_path> <lmdb_label_path> --label-only
```
For example,
```bash
python tools/data/utils/lmdb_converter.py data/mixture/Syn90k/label.txt data/mixture/Syn90k/label.lmdb --label-only
```
TextOCR
-
Step1: Download train_val_images.zip, TextOCR_0.1_train.json and TextOCR_0.1_val.json to
textocr/
.mkdir textocr && cd textocr # Download TextOCR dataset wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json # For images unzip -q train_val_images.zip mv train_images train
-
Step2: Generate
train_label.txt
,val_label.txt
and crop images using 4 processes with the following command:python tools/data/textrecog/textocr_converter.py /path/to/textocr 4
-
After running the above codes, the directory structure should be as follows:
├── TextOCR │ ├── image │ ├── train_label.txt │ └── val_label.txt
Totaltext
-
Step1: Download
totaltext.zip
from github dataset andgroundtruth_text.zip
orTT_new_train_GT.zip
(if you prefer to use the latest version of training annotations) from github Groundtruth (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).mkdir totaltext && cd totaltext mkdir imgs && mkdir annotations # For images # in ./totaltext unzip totaltext.zip mv Images/Train imgs/training mv Images/Test imgs/test # For legacy training and test annotations unzip groundtruth_text.zip mv Groundtruth/Polygon/Train annotations/training mv Groundtruth/Polygon/Test annotations/test # Using the latest training annotations # WARNING: Delete legacy train annotations before running the following command. unzip TT_new_train_GT.zip mv Train annotations/training
-
Step2: Generate cropped images,
train_label.txt
andtest_label.txt
with the following command (the cropped images will be saved todata/totaltext/dst_imgs/
):python tools/data/textrecog/totaltext_converter.py /path/to/totaltext
-
After running the above codes, the directory structure should be as follows:
├── totaltext │ ├── dst_imgs │ ├── train_label.txt │ └── test_label.txt
OpenVINO
-
Step1 (optional): Install AWS CLI.
-
Step2: Download Open Images subsets
train_1
,train_2
,train_5
,train_f
, andvalidation
toopenvino/
.mkdir openvino && cd openvino # Download Open Images subsets for s in 1 2 5 f; do aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_${s}.tar.gz . done aws s3 --no-sign-request cp s3://open-images-dataset/tar/validation.tar.gz . # Download annotations for s in 1 2 5 f; do wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_train_${s}.json done wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_validation.json # Extract images mkdir -p openimages_v5/val for s in 1 2 5 f; do tar zxf train_${s}.tar.gz -C openimages_v5 done tar zxf validation.tar.gz -C openimages_v5/val
-
Step3: Generate
train_{1,2,5,f}_label.txt
,val_label.txt
and crop images using 4 processes with the following command:python tools/data/textrecog/openvino_converter.py /path/to/openvino 4
-
After running the above codes, the directory structure should be as follows:
├── OpenVINO │ ├── image_1 │ ├── image_2 │ ├── image_5 │ ├── image_f │ ├── image_val │ ├── train_1_label.txt │ ├── train_2_label.txt │ ├── train_5_label.txt │ ├── train_f_label.txt │ └── val_label.txt
DeText
-
Step1: Download
ch9_training_images.zip
,ch9_training_localization_transcription_gt.zip
,ch9_validation_images.zip
, andch9_validation_localization_transcription_gt.zip
from Task 3: End to End on the homepage.mkdir detext && cd detext mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val # Download DeText wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate # Extract images and annotations unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val # Remove zips rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
-
Step2: Generate
instances_training.json
andinstances_val.json
with following command:# Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/detext/ignores python tools/data/textrecog/detext_converter.py PATH/TO/detext --nproc 4
-
After running the above codes, the directory structure should be as follows:
├── detext │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── test_label.jsonl
NAF
-
Step1: Download labeled_images.tar.gz to
naf/
.mkdir naf && cd naf # Download NAF dataset wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz tar -zxf labeled_images.tar.gz # For images mkdir annotations && mv labeled_images imgs # For annotations git clone https://github.com/herobd/NAF_dataset.git mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/ rm -rf NAF_dataset && rm labeled_images.tar.gz
-
Step2: Generate
train_label.txt
,val_label.txt
, andtest_label.txt
with following command:# Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/naf/ignores python tools/data/textrecog/naf_converter.py PATH/TO/naf --nproc 4
-
After running the above codes, the directory structure should be as follows:
├── naf │ ├── crops │ ├── train_label.txt │ ├── val_label.txt │ └── test_label.txt
SROIE
-
Step1: Step1: Download
0325updated.task1train(626p).zip
,task1&2_test(361p).zip
, andtext.task1&2-test(361p).zip
from homepage tosroie/
-
Step2:
mkdir sroie && cd sroie mkdir imgs && mkdir annotations && mkdir imgs/training # Warnninig: The zip files downloaded from Google Drive and BaiduYun Cloud may # be different, the user should revise the following commands to the correct # file name if encounter with errors while extracting and move the files. unzip -q 0325updated.task1train\(626p\).zip && unzip -q task1\&2_test\(361p\).zip && unzip -q text.task1\&2-test(361p\).zip # For images mv 0325updated.task1train\(626p\)/*.jpg imgs/training && mv fulltext_test\(361p\) imgs/test # For annotations mv 0325updated.task1train\(626p\) annotations/training && mv text.task1\&2-testги361p\)/ annotations/test rm 0325updated.task1train\(626p\).zip && rm task1\&2_test\(361p\).zip && rm text.task1\&2-test(361p\).zip
-
Step3: Generate
train_label.jsonl
andtest_label.jsonl
and crop images using 4 processes with the following command:python tools/data/textrecog/sroie_converter.py PATH/TO/sroie --nproc 4
-
After running the above codes, the directory structure should be as follows:
├── sroie │ ├── crops │ ├── train_label.jsonl │ └── test_label.jsonl
Lecture Video DB
The LV dataset has already provided cropped images and the corresponding annotations
-
Step1: Download IIIT-CVid.zip to
lv/
.mkdir lv && cd lv # Download LV dataset wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip unzip -q IIIT-CVid.zip # For image mv IIIT-CVid/Crops ./ # For annotation mv IIIT-CVid/train.txt train_label.txt && mv IIIT-CVid/val.txt val_label.txt && mv IIIT-CVid/test.txt test_label.txt rm IIIT-CVid.zip
-
Step2: Generate
train_label.jsonl
,val.jsonl
, andtest.jsonl
with following command:python tools/data/textdreog/lv_converter.py PATH/TO/lv
-
After running the above codes, the directory structure should be as follows:
├── lv │ ├── Crops │ ├── train_label.jsonl │ └── test_label.jsonl
LSVT
-
Step1: Download train_full_images_0.tar.gz, train_full_images_1.tar.gz, and train_full_labels.json to
lsvt/
.mkdir lsvt && cd lsvt # Download LSVT dataset wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json mkdir annotations tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/ mv train_full_images_0 imgs rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
-
Step2: Generate
train_label.jsonl
andval_label.jsonl
(optional) with the following command:# Annotations of LSVT test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/lsvt/ignores python tools/data/textdrecog/lsvt_converter.py PATH/TO/lsvt --nproc 4
-
After running the above codes, the directory structure should be as follows:
├── lsvt │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl (optional)
FUNSD
-
Step1: Download dataset.zip to
funsd/
.mkdir funsd && cd funsd # Download FUNSD dataset wget https://guillaumejaume.github.io/FUNSD/dataset.zip unzip -q dataset.zip # For images mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/ # For annotations mkdir annotations mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test rm dataset.zip && rm -rf dataset
-
Step2: Generate
train_label.txt
andtest_label.txt
and crop images using 4 processes with following command (add--preserve-vertical
if you wish to preserve the images containing vertical texts):python tools/data/textrecog/funsd_converter.py PATH/TO/funsd --nproc 4
-
After running the above codes, the directory structure should be as follows:
├── funsd │ ├── imgs │ ├── dst_imgs │ ├── annotations │ ├── train_label.txt │ └── test_label.txt
IMGUR
-
Step1: Run
download_imgur5k.py
to download images. You can merge PR#5 in your local repository to enable a much faster parallel execution of image download.mkdir imgur && cd imgur git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git # Download images from imgur.com. This may take SEVERAL HOURS! python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs # For annotations mkdir annotations mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations rm -rf IMGUR5K-Handwriting-Dataset
-
Step2: Generate
train_label.txt
,val_label.txt
andtest_label.txt
and crop images with the following command:python tools/data/textrecog/imgur_converter.py PATH/TO/imgur
-
After running the above codes, the directory structure should be as follows:
├── imgur │ ├── crops │ ├── train_label.jsonl │ ├── test_label.jsonl │ └── val_label.jsonl
KAIST
-
Step1: Complete download KAIST_all.zip to
kaist/
.mkdir kaist && cd kaist mkdir imgs && mkdir annotations # Download KAIST dataset wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip unzip -q KAIST_all.zip rm KAIST_all.zip
-
Step2: Extract zips:
python tools/data/common/extract_kaist.py PATH/TO/kaist
-
Step3: Generate
train_label.jsonl
andval_label.jsonl
(optional) with following command:# Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/kaist/ignores python tools/data/textrecog/kaist_converter.py PATH/TO/kaist --nproc 4
-
After running the above codes, the directory structure should be as follows:
├── kaist │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl (optional)
MTWI
-
Step1: Download
mtwi_2018_train.zip
from homepage.mkdir mtwi && cd mtwi unzip -q mtwi_2018_train.zip mv image_train imgs && mv txt_train annotations rm mtwi_2018_train.zip
-
Step2: Generate
train_label.jsonl
andval_label.jsonl
(optional) with the following command:# Annotations of MTWI test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/mtwi/ignores python tools/data/textrecog/mtwi_converter.py PATH/TO/mtwi --nproc 4
-
After running the above codes, the directory structure should be as follows:
├── mtwi │ ├── crops │ ├── train_label.jsonl │ └── val_label.jsonl (optional)
COCO Text v2
-
Step1: Download image train2014.zip and annotation cocotext.v2.zip to
coco_textv2/
.mkdir coco_textv2 && cd coco_textv2 mkdir annotations # Download COCO Text v2 dataset wget http://images.cocodataset.org/zips/train2014.zip wget https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip unzip -q train2014.zip && unzip -q cocotext.v2.zip mv train2014 imgs && mv cocotext.v2.json annotations/ rm train2014.zip && rm -rf cocotext.v2.zip
-
Step2: Generate
train_label.jsonl
andval_label.jsonl
with the following command:# Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/mtwi/ignores python tools/data/textrecog/cocotext_converter.py PATH/TO/coco_textv2 --nproc 4
-
After running the above codes, the directory structure should be as follows:
├── coco_textv2 │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl
ReCTS
-
Step1: Download ReCTS.zip to
rects/
from the homepage.mkdir rects && cd rects # Download ReCTS dataset # You can also find Google Drive link on the dataset homepage wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate unzip -q ReCTS.zip mv img imgs && mv gt_unicode annotations rm ReCTS.zip -f && rm -rf gt
-
Step2: Generate
train_label.jsonl
andval_label.jsonl
(optional) with the following command:# Annotations of ReCTS test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/rects/ignores python tools/data/textrecog/rects_converter.py PATH/TO/rects --nproc 4
-
After running the above codes, the directory structure should be as follows:
├── rects │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl (optional)
ILST
-
Step1: Download
IIIT-ILST.zip
from onedrive link -
Step2: Run the following commands
unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip cd IIIT-ILST # rename files cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd .. cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd .. cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd .. # transfer image path mkdir imgs && mkdir annotations mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/ mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/ mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/ # remove unnecessary files rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
-
Step3: Generate
train_label.jsonl
andval_label.jsonl
(optional) and crop images using 4 processes with the following command (add--preserve-vertical
if you wish to preserve the images containing vertical texts). Since the original dataset doesn't have a validation set, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.python tools/data/textrecog/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4
-
After running the above codes, the directory structure should be as follows:
├── IIIT-ILST │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl (optional)
VinText
-
Step1: Download vintext.zip to
vintext
mkdir vintext && cd vintext # Download dataset from google drive wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt # Extract images and annotations unzip -q vintext.zip && rm vintext.zip mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./ rm -rf vietnamese # Rename files mv labels annotations && mv test_image test && mv train_images training && mv unseen_test_images unseen_test mkdir imgs mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
-
Step2: Generate
train_label.jsonl
,test_label.jsonl
,unseen_test_label.jsonl
, and crop images using 4 processes with the following command (add--preserve-vertical
if you wish to preserve the images containing vertical texts).python tools/data/textrecog/vintext_converter.py PATH/TO/vietnamese --nproc 4
-
After running the above codes, the directory structure should be as follows:
├── vintext │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ ├── test_label.jsonl │ └── unseen_test_label.jsonl
BID
-
Step1: Download BID Dataset.zip
-
Step2: Run the following commands to preprocess the dataset
# Rename mv BID\ Dataset.zip BID_Dataset.zip # Unzip and Rename unzip -q BID_Dataset.zip && rm BID_Dataset.zip mv BID\ Dataset BID # The BID dataset has a problem of permission, and you may # add permission for this file chmod -R 777 BID cd BID mkdir imgs && mkdir annotations # For images and annotations mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso # Remove unnecessary files rm -rf desktop.ini
-
Step3: Generate
train_label.jsonl
andval_label.jsonl
(optional) and crop images using 4 processes with the following command (add--preserve-vertical
if you wish to preserve the images containing vertical texts). Since the original dataset doesn't have a validation set, you may specify--val-ratio
to split the dataset. E.g., if test-ratio is 0.2, then 20% of the data are left out as the validation set in this example.python tools/data/textrecog/bid_converter.py dPATH/TO/BID --nproc 4
-
After running the above codes, the directory structure should be as follows:
├── BID │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl (optional)
RCTW
-
Step1: Download
train_images.zip.001
,train_images.zip.002
, andtrain_gts.zip
from the homepage, extract the zips torctw/imgs
andrctw/annotations
, respectively. -
Step2: Generate
train_label.jsonl
andval_label.jsonl
(optional). Since the original dataset doesn't have a validation set, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.# Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/rctw/ignores python tools/data/textrecog/rctw_converter.py PATH/TO/rctw --nproc 4
-
After running the above codes, the directory structure should be as follows:
│── rctw │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl (optional)
HierText
-
Step1 (optional): Install AWS CLI.
-
Step2: Clone HierText repo to get annotations
mkdir HierText git clone https://github.com/google-research-datasets/hiertext.git
-
Step3: Download
train.tgz
,validation.tgz
from awsaws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz . aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
-
Step4: Process raw data
# process annotations mv hiertext/gt ./ rm -rf hiertext mv gt annotations gzip -d annotations/train.jsonl.gz gzip -d annotations/validation.jsonl.gz # process images mkdir imgs mv train.tgz imgs/ mv validation.tgz imgs/ tar -xzvf imgs/train.tgz tar -xzvf imgs/validation.tgz
-
Step5: Generate
train_label.jsonl
andval_label.jsonl
. HierText includes different levels of annotation, includingparagraph
,line
, andword
. Check the original paper for details. E.g. set--level paragraph
to get paragraph-level annotation. Set--level line
to get line-level annotation. set--level word
to get word-level annotation.# Collect word annotation from HierText --level word # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/HierText/ignores python tools/data/textrecog/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
-
After running the above codes, the directory structure should be as follows:
│── HierText │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl
ArT
-
Step1: Download
train_images.tar.gz
, andtrain_labels.json
from the homepage toart/
mkdir art && cd art mkdir annotations # Download ArT dataset wget https://dataset-bj.cdn.bcebos.com/art/train_task2_images.tar.gz wget https://dataset-bj.cdn.bcebos.com/art/train_task2_labels.json # Extract tar -xf train_task2_images.tar.gz mv train_task2_images crops mv train_task2_labels.json annotations/ # Remove unnecessary files rm train_images.tar.gz
-
Step2: Generate
train_label.jsonl
andval_label.jsonl
(optional). Since the test annotations are not publicly available, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.# Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2 python tools/data/textrecog/art_converter.py PATH/TO/art
-
After running the above codes, the directory structure should be as follows:
│── art │ ├── crops │ ├── train_label.jsonl │ └── val_label.jsonl (optional)