* add DeText Converter * Update tools/data/textrecog/detext_converter.py Co-authored-by: Tong Gao <gaotongxiao@gmail.com> * update doc; support jsonl; fix docstrings * update mkdir func * fix bug * update doc; do not filter for test val * move directory tree * fix indentation Co-authored-by: Tong Gao <gaotongxiao@gmail.com>
31 KiB
Text Recognition
Overview
The structure of the text recognition dataset directory is organized as follows.
├── mixture
│ ├── coco_text
│ │ ├── train_label.txt
│ │ ├── train_words
│ ├── icdar_2011
│ │ ├── training_label.txt
│ │ ├── Challenge1_Training_Task3_Images_GT
│ ├── icdar_2013
│ │ ├── train_label.txt
│ │ ├── test_label_1015.txt
│ │ ├── test_label_1095.txt
│ │ ├── Challenge2_Training_Task3_Images_GT
│ │ ├── Challenge2_Test_Task3_Images
│ ├── icdar_2015
│ │ ├── train_label.txt
│ │ ├── test_label.txt
│ │ ├── ch4_training_word_images_gt
│ │ ├── ch4_test_word_images_gt
│ ├── III5K
│ │ ├── train_label.txt
│ │ ├── test_label.txt
│ │ ├── train
│ │ ├── test
│ ├── ct80
│ │ ├── test_label.txt
│ │ ├── image
│ ├── svt
│ │ ├── test_label.txt
│ │ ├── image
│ ├── svtp
│ │ ├── test_label.txt
│ │ ├── image
│ ├── Syn90k
│ │ ├── shuffle_labels.txt
│ │ ├── label.txt
│ │ ├── label.lmdb
│ │ ├── mnt
│ ├── SynthText
│ │ ├── alphanumeric_labels.txt
│ │ ├── shuffle_labels.txt
│ │ ├── instances_train.txt
│ │ ├── label.txt
│ │ ├── label.lmdb
│ │ ├── synthtext
│ ├── SynthAdd
│ │ ├── label.txt
│ │ ├── label.lmdb
│ │ ├── SynthText_Add
│ ├── TextOCR
│ │ ├── image
│ │ ├── train_label.txt
│ │ ├── val_label.txt
│ ├── Totaltext
│ │ ├── imgs
│ │ ├── annotations
│ │ ├── train_label.txt
│ │ ├── test_label.txt
│ ├── OpenVINO
│ │ ├── image_1
│ │ ├── image_2
│ │ ├── image_5
│ │ ├── image_f
│ │ ├── image_val
│ │ ├── train_1_label.txt
│ │ ├── train_2_label.txt
│ │ ├── train_5_label.txt
│ │ ├── train_f_label.txt
│ │ ├── val_label.txt
│ ├── funsd
│ │ ├── imgs
│ │ ├── dst_imgs
│ │ ├── annotations
│ │ ├── train_label.txt
│ │ ├── test_label.txt
│ ├── lv
│ │ ├── Crops
│ │ ├── train_label.jsonl
│ │ ├── test_label.jsonl
Dataset | images | annotation file | annotation file |
---|---|---|---|
training | test | ||
coco_text | homepage | train_label.txt | - |
icdar_2011 | homepage | train_label.txt | - |
icdar_2013 | homepage | train_label.txt | test_label_1015.txt |
icdar_2015 | homepage | train_label.txt | test_label.txt |
IIIT5K | homepage | train_label.txt | test_label.txt |
ct80 | homepage | - | test_label.txt |
svt | homepage | - | test_label.txt |
svtp | unofficial homepage[1] | - | test_label.txt |
MJSynth (Syn90k) | homepage | shuffle_labels.txt | label.txt | - |
SynthText (Synth800k) | homepage | alphanumeric_labels.txt |shuffle_labels.txt | instances_train.txt | label.txt | - |
SynthAdd | SynthText_Add.zip (code:627x) | label.txt | - |
TextOCR | homepage | - | - |
Totaltext | homepage | - | - |
OpenVINO | Open Images | annotations | annotations |
FUNSD | homepage | - | - |
DeText | homepage | - | - |
NAF | homepage | - | - |
SROIE | homepage | - | - |
Lecture Video DB | homepage | - | - |
(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.
Preparation Steps
ICDAR 2013
- Step1: Download
Challenge2_Test_Task3_Images.zip
andChallenge2_Training_Task3_Images_GT.zip
from homepage - Step2: Download test_label_1015.txt and train_label.txt
ICDAR 2015
- Step1: Download
ch4_training_word_images_gt.zip
andch4_test_word_images_gt.zip
from homepage - Step2: Download train_label.txt and test_label.txt
IIIT5K
- Step1: Download
IIIT5K-Word_V3.0.tar.gz
from homepage - Step2: Download train_label.txt and test_label.txt
svt
- Step1: Download
svt.zip
form homepage - Step2: Download test_label.txt
- Step3:
python tools/data/textrecog/svt_converter.py <download_svt_dir_path>
ct80
- Step1: Download test_label.txt
svtp
- Step1: Download test_label.txt
coco_text
- Step1: Download from homepage
- Step2: Download train_label.txt
MJSynth (Syn90k)
- Step1: Download
mjsynth.tar.gz
from homepage - Step2: Download label.txt (8,919,273 annotations) and shuffle_labels.txt (2,400,000 randomly sampled annotations). Please make sure you're using the right annotation to train the model by checking its dataset specs in Model Zoo.
- Step3:
mkdir Syn90k && cd Syn90k
mv /path/to/mjsynth.tar.gz .
tar -xzf mjsynth.tar.gz
mv /path/to/shuffle_labels.txt .
mv /path/to/label.txt .
# create soft link
cd /path/to/mmocr/data/mixture
ln -s /path/to/Syn90k Syn90k
SynthText (Synth800k)
-
Step1: Download
SynthText.zip
from homepage -
Step2: According to your actual needs, download the most appropriate one from the following options: label.txt (7,266,686 annotations), shuffle_labels.txt (2,400,000 randomly sampled annotations), alphanumeric_labels.txt (7,239,272 annotations with alphanumeric characters only) and instances_train.txt (7,266,686 character-level annotations).
:::{warning} Please make sure you're using the right annotation to train the model by checking its dataset specs in Model Zoo. :::
- Step3:
mkdir SynthText && cd SynthText
mv /path/to/SynthText.zip .
unzip SynthText.zip
mv SynthText synthtext
mv /path/to/shuffle_labels.txt .
mv /path/to/label.txt .
mv /path/to/alphanumeric_labels.txt .
mv /path/to/instances_train.txt .
# create soft link
cd /path/to/mmocr/data/mixture
ln -s /path/to/SynthText SynthText
- Step4: Generate cropped images and labels:
cd /path/to/mmocr
python tools/data/textrecog/synthtext_converter.py data/mixture/SynthText/gt.mat data/mixture/SynthText/ data/mixture/SynthText/synthtext/SynthText_patch_horizontal --n_proc 8
SynthAdd
mkdir SynthAdd && cd SynthAdd
mv /path/to/SynthText_Add.zip .
unzip SynthText_Add.zip
mv /path/to/label.txt .
# create soft link
cd /path/to/mmocr/data/mixture
ln -s /path/to/SynthAdd SynthAdd
:::{tip}
To convert label file with txt
format to lmdb
format,
python tools/data/utils/txt2lmdb.py -i <txt_label_path> -o <lmdb_label_path>
For example,
python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb
:::
TextOCR
- Step1: Download train_val_images.zip, TextOCR_0.1_train.json and TextOCR_0.1_val.json to
textocr/
.
mkdir textocr && cd textocr
# Download TextOCR dataset
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json
# For images
unzip -q train_val_images.zip
mv train_images train
- Step2: Generate
train_label.txt
,val_label.txt
and crop images using 4 processes with the following command:
python tools/data/textrecog/textocr_converter.py /path/to/textocr 4
Totaltext
- Step1: Download
totaltext.zip
from github dataset andgroundtruth_text.zip
from github Groundtruth (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).
mkdir totaltext && cd totaltext
mkdir imgs && mkdir annotations
# For images
# in ./totaltext
unzip totaltext.zip
mv Images/Train imgs/training
mv Images/Test imgs/test
# For annotations
unzip groundtruth_text.zip
cd Groundtruth
mv Polygon/Train ../annotations/training
mv Polygon/Test ../annotations/test
- Step2: Generate cropped images,
train_label.txt
andtest_label.txt
with the following command (the cropped images will be saved todata/totaltext/dst_imgs/
):
python tools/data/textrecog/totaltext_converter.py /path/to/totaltext -o /path/to/totaltext --split-list training test
OpenVINO
- Step0: Install awscli.
- Step1: Download Open Images subsets
train_1
,train_2
,train_5
,train_f
, andvalidation
toopenvino/
.
mkdir openvino && cd openvino
# Download Open Images subsets
for s in 1 2 5 f; do
aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_${s}.tar.gz .
done
aws s3 --no-sign-request cp s3://open-images-dataset/tar/validation.tar.gz .
# Download annotations
for s in 1 2 5 f; do
wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_train_${s}.json
done
wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_validation.json
# Extract images
mkdir -p openimages_v5/val
for s in 1 2 5 f; do
tar zxf train_${s}.tar.gz -C openimages_v5
done
tar zxf validation.tar.gz -C openimages_v5/val
- Step2: Generate
train_{1,2,5,f}_label.txt
,val_label.txt
and crop images using 4 processes with the following command:
python tools/data/textrecog/openvino_converter.py /path/to/openvino 4
FUNSD
- Step1: Download dataset.zip to
funsd/
.
mkdir funsd && cd funsd
# Download FUNSD dataset
wget https://guillaumejaume.github.io/FUNSD/dataset.zip
unzip -q dataset.zip
# For images
mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/
# For annotations
mkdir annotations
mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test
rm dataset.zip && rm -rf dataset
- Step2: Generate
train_label.txt
andtest_label.txt
and crop images using 4 processes with following command (add--preserve-vertical
if you wish to preserve the images containing vertical texts):
python tools/data/textrecog/funsd_converter.py PATH/TO/funsd --nproc 4
DeText
-
Step1: Download
ch9_training_images.zip
,ch9_training_localization_transcription_gt.zip
,ch9_validation_images.zip
, andch9_validation_localization_transcription_gt.zip
from Task 3: End to End on the homepage.mkdir detext && cd detext mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val # Download DeText wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate # Extract images and annotations unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val # Remove zips rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
-
Step2: Generate
instances_training.json
andinstances_val.json
with following command:# Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/detext/ignores python tools/data/textrecog/detext_converter.py PATH/TO/detext --nproc 4
-
After running the above codes, the directory structure should be as follows:
├── detext │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ ├── test_label.jsonl
NAF
-
Step1: Download labeled_images.tar.gz to
naf/
.mkdir naf && cd naf # Download NAF dataset wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz tar -zxf labeled_images.tar.gz # For images mkdir annotations && mv labeled_images imgs # For annotations git clone https://github.com/herobd/NAF_dataset.git mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/ rm -rf NAF_dataset && rm labeled_images.tar.gz
-
Step2: Generate
train_label.txt
,val_label.txt
, andtest_label.txt
with following command:# Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/naf/ignores python tools/data/textrecog/naf_converter.py PATH/TO/naf --nproc 4
-
After running the above codes, the directory structure should be as follows:
├── naf │ ├── crops │ ├── train_label.txt │ ├── val_label.txt │ ├── test_label.txt
SROIE
-
Step1: Step1: Download
0325updated.task1train(626p).zip
,task1&2_test(361p).zip
, andtext.task1&2-test(361p).zip
from homepage tosroie/
-
Step2:
mkdir sroie && cd sroie mkdir imgs && mkdir annotations && mkdir imgs/training # Warnninig: The zip files downloaded from Google Drive and BaiduYun Cloud may # be different, the user should revise the following commands to the correct # file name if encounter with errors while extracting and move the files. unzip -q 0325updated.task1train\(626p\).zip && unzip -q task1\&2_test\(361p\).zip && unzip -q text.task1\&2-test(361p\).zip # For images mv 0325updated.task1train\(626p\)/*.jpg imgs/training && mv fulltext_test\(361p\) imgs/test # For annotations mv 0325updated.task1train\(626p\) annotations/training && mv text.task1\&2-testги361p\)/ annotations/test rm 0325updated.task1train\(626p\).zip && rm task1\&2_test\(361p\).zip && rm text.task1\&2-test(361p\).zip
-
Step3: Generate
train_label.jsonl
andtest_label.jsonl
and crop images using 4 processes with the following command:python tools/data/textrecog/sroie_converter.py PATH/TO/sroie --nproc 4
-
After running the above codes, the directory structure should be as follows:
├── sroie │ ├── crops │ ├── train_label.jsonl │ ├── test_label.jsonl
Lecture Video DB
The LV dataset has already provided cropped images and the corresponding annotations
- Step1: Download IIIT-CVid.zip to
lv/
.
mkdir lv && cd lv
# Download LV dataset
wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
unzip -q IIIT-CVid.zip
# For image
mv IIIT-CVid/Crops ./
# For annotation
mv IIIT-CVid/train.txt train_label.txt && mv IIIT-CVid/val.txt val_label.txt && mv IIIT-CVid/test.txt test_label.txt
rm IIIT-CVid.zip
- Step2: Generate
train_label.jsonl
,val.jsonl
, andtest.jsonl
with following command:
python tools/data/textdreog/lv_converter.py PATH/TO/lv