mirror of https://github.com/open-mmlab/mmocr.git synced 2025-06-03 21:54:47 +08:00

* add DeText Converter

* Update tools/data/textrecog/detext_converter.py

Co-authored-by: Tong Gao <gaotongxiao@gmail.com>

* update doc; support jsonl; fix docstrings

* update mkdir func

* fix bug

* update doc; do not filter for test val

* move directory tree

* fix indentation

Co-authored-by: Tong Gao <gaotongxiao@gmail.com>

2022-03-30 14:43:33 +08:00

31 KiB

Raw Blame History

Text Recognition

Overview

The structure of the text recognition dataset directory is organized as follows.

├── mixture
│   ├── coco_text
│   │   ├── train_label.txt
│   │   ├── train_words
│   ├── icdar_2011
│   │   ├── training_label.txt
│   │   ├── Challenge1_Training_Task3_Images_GT
│   ├── icdar_2013
│   │   ├── train_label.txt
│   │   ├── test_label_1015.txt
│   │   ├── test_label_1095.txt
│   │   ├── Challenge2_Training_Task3_Images_GT
│   │   ├── Challenge2_Test_Task3_Images
│   ├── icdar_2015
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── ch4_training_word_images_gt
│   │   ├── ch4_test_word_images_gt
│   ├── III5K
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── train
│   │   ├── test
│   ├── ct80
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svt
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svtp
│   │   ├── test_label.txt
│   │   ├── image
│   ├── Syn90k
│   │   ├── shuffle_labels.txt
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── mnt
│   ├── SynthText
│   │   ├── alphanumeric_labels.txt
│   │   ├── shuffle_labels.txt
│   │   ├── instances_train.txt
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── synthtext
│   ├── SynthAdd
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── SynthText_Add
│   ├── TextOCR
│   │   ├── image
│   │   ├── train_label.txt
│   │   ├── val_label.txt
│   ├── Totaltext
│   │   ├── imgs
│   │   ├── annotations
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   ├── OpenVINO
│   │   ├── image_1
│   │   ├── image_2
│   │   ├── image_5
│   │   ├── image_f
│   │   ├── image_val
│   │   ├── train_1_label.txt
│   │   ├── train_2_label.txt
│   │   ├── train_5_label.txt
│   │   ├── train_f_label.txt
│   │   ├── val_label.txt
│   ├── funsd
│   │   ├── imgs
│   │   ├── dst_imgs
│   │   ├── annotations
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   ├── lv
│   │   ├── Crops
│   │   ├── train_label.jsonl
│   │   ├── test_label.jsonl

Dataset	images	annotation file	annotation file
		training	test
coco_text	homepage	train_label.txt	-
icdar_2011	homepage	train_label.txt	-
icdar_2013	homepage	train_label.txt	test_label_1015.txt
icdar_2015	homepage	train_label.txt	test_label.txt
IIIT5K	homepage	train_label.txt	test_label.txt
ct80	homepage	-	test_label.txt
svt	homepage	-	test_label.txt
svtp	unofficial homepage[1]	-	test_label.txt
MJSynth (Syn90k)	homepage	shuffle_labels.txt \| label.txt	-
SynthText (Synth800k)	homepage	alphanumeric_labels.txt \|shuffle_labels.txt \| instances_train.txt \| label.txt	-
SynthAdd	SynthText_Add.zip (code:627x)	label.txt	-
TextOCR	homepage	-	-
Totaltext	homepage	-	-
OpenVINO	Open Images	annotations	annotations
FUNSD	homepage	-	-
DeText	homepage	-	-
NAF	homepage	-	-
SROIE	homepage	-	-
Lecture Video DB	homepage	-	-

(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.

Preparation Steps

ICDAR 2013

Step1: Download Challenge2_Test_Task3_Images.zip and Challenge2_Training_Task3_Images_GT.zip from homepage
Step2: Download test_label_1015.txt and train_label.txt

ICDAR 2015

Step1: Download ch4_training_word_images_gt.zip and ch4_test_word_images_gt.zip from homepage
Step2: Download train_label.txt and test_label.txt

IIIT5K

Step1: Download IIIT5K-Word_V3.0.tar.gz from homepage
Step2: Download train_label.txt and test_label.txt

svt

Step1: Download svt.zip form homepage
Step2: Download test_label.txt
Step3:

python tools/data/textrecog/svt_converter.py <download_svt_dir_path>

ct80

Step1: Download test_label.txt

svtp

Step1: Download test_label.txt

coco_text

Step1: Download from homepage
Step2: Download train_label.txt

MJSynth (Syn90k)

Step1: Download mjsynth.tar.gz from homepage
Step2: Download label.txt (8,919,273 annotations) and shuffle_labels.txt (2,400,000 randomly sampled annotations). Please make sure you're using the right annotation to train the model by checking its dataset specs in Model Zoo.
Step3:

mkdir Syn90k && cd Syn90k

mv /path/to/mjsynth.tar.gz .

tar -xzf mjsynth.tar.gz

mv /path/to/shuffle_labels.txt .
mv /path/to/label.txt .

# create soft link
cd /path/to/mmocr/data/mixture

ln -s /path/to/Syn90k Syn90k

SynthText (Synth800k)

Step1: Download SynthText.zip from homepage
Step2: According to your actual needs, download the most appropriate one from the following options: label.txt (7,266,686 annotations), shuffle_labels.txt (2,400,000 randomly sampled annotations), alphanumeric_labels.txt (7,239,272 annotations with alphanumeric characters only) and instances_train.txt (7,266,686 character-level annotations).

:::{warning} Please make sure you're using the right annotation to train the model by checking its dataset specs in Model Zoo. :::

Step3:

mkdir SynthText && cd SynthText
mv /path/to/SynthText.zip .
unzip SynthText.zip
mv SynthText synthtext

mv /path/to/shuffle_labels.txt .
mv /path/to/label.txt .
mv /path/to/alphanumeric_labels.txt .
mv /path/to/instances_train.txt .

# create soft link
cd /path/to/mmocr/data/mixture
ln -s /path/to/SynthText SynthText

Step4: Generate cropped images and labels:

cd /path/to/mmocr

python tools/data/textrecog/synthtext_converter.py data/mixture/SynthText/gt.mat data/mixture/SynthText/ data/mixture/SynthText/synthtext/SynthText_patch_horizontal --n_proc 8

SynthAdd

Step1: Download SynthText_Add.zip from SynthAdd (code:627x))
Step2: Download label.txt
Step3:

mkdir SynthAdd && cd SynthAdd

mv /path/to/SynthText_Add.zip .

unzip SynthText_Add.zip

mv /path/to/label.txt .

# create soft link
cd /path/to/mmocr/data/mixture

ln -s /path/to/SynthAdd SynthAdd

:::{tip} To convert label file with txt format to lmdb format,

python tools/data/utils/txt2lmdb.py -i <txt_label_path> -o <lmdb_label_path>

For example,

python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb

:::

TextOCR

Step1: Download train_val_images.zip, TextOCR_0.1_train.json and TextOCR_0.1_val.json to textocr/.

mkdir textocr && cd textocr

# Download TextOCR dataset
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json

# For images
unzip -q train_val_images.zip
mv train_images train

Step2: Generate train_label.txt, val_label.txt and crop images using 4 processes with the following command:

python tools/data/textrecog/textocr_converter.py /path/to/textocr 4

Totaltext

Step1: Download totaltext.zip from github dataset and groundtruth_text.zip from github Groundtruth (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).

mkdir totaltext && cd totaltext
mkdir imgs && mkdir annotations

# For images
# in ./totaltext
unzip totaltext.zip
mv Images/Train imgs/training
mv Images/Test imgs/test

# For annotations
unzip groundtruth_text.zip
cd Groundtruth
mv Polygon/Train ../annotations/training
mv Polygon/Test ../annotations/test

Step2: Generate cropped images, train_label.txt and test_label.txt with the following command (the cropped images will be saved to data/totaltext/dst_imgs/):

python tools/data/textrecog/totaltext_converter.py /path/to/totaltext -o /path/to/totaltext --split-list training test

OpenVINO

Step0: Install awscli.
Step1: Download Open Images subsets train_1, train_2, train_5, train_f, and validation to openvino/.

mkdir openvino && cd openvino

# Download Open Images subsets
for s in 1 2 5 f; do
  aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_${s}.tar.gz .
done
aws s3 --no-sign-request cp s3://open-images-dataset/tar/validation.tar.gz .

# Download annotations
for s in 1 2 5 f; do
  wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_train_${s}.json
done
wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_validation.json

# Extract images
mkdir -p openimages_v5/val
for s in 1 2 5 f; do
  tar zxf train_${s}.tar.gz -C openimages_v5
done
tar zxf validation.tar.gz -C openimages_v5/val

Step2: Generate train_{1,2,5,f}_label.txt, val_label.txt and crop images using 4 processes with the following command:

python tools/data/textrecog/openvino_converter.py /path/to/openvino 4

FUNSD

Step1: Download dataset.zip to funsd/.

mkdir funsd && cd funsd

# Download FUNSD dataset
wget https://guillaumejaume.github.io/FUNSD/dataset.zip
unzip -q dataset.zip

# For images
mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/

# For annotations
mkdir annotations
mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test

rm dataset.zip && rm -rf dataset

Step2: Generate train_label.txt and test_label.txt and crop images using 4 processes with following command (add --preserve-vertical if you wish to preserve the images containing vertical texts):

python tools/data/textrecog/funsd_converter.py PATH/TO/funsd --nproc 4

DeText

Step1: Download ch9_training_images.zip, ch9_training_localization_transcription_gt.zip, ch9_validation_images.zip, and ch9_validation_localization_transcription_gt.zip from Task 3: End to End on the homepage.

mkdir detext && cd detext
mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val

# Download DeText
wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate

# Extract images and annotations
unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val

# Remove zips
rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip

Step2: Generate instances_training.json and instances_val.json with following command:

# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/detext/ignores
python tools/data/textrecog/detext_converter.py PATH/TO/detext --nproc 4

After running the above codes, the directory structure should be as follows:

├── detext
│   ├── crops
│   ├── ignores
│   ├── train_label.jsonl
│   ├── test_label.jsonl

NAF

Step1: Download labeled_images.tar.gz to naf/.

mkdir naf && cd naf

# Download NAF dataset
wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz
tar -zxf labeled_images.tar.gz

# For images
mkdir annotations && mv labeled_images imgs

# For annotations
git clone https://github.com/herobd/NAF_dataset.git
mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/

rm -rf NAF_dataset && rm labeled_images.tar.gz

Step2: Generate train_label.txt, val_label.txt, and test_label.txt with following command:

# Add --preserve-vertical to preserve vertical texts for training, otherwise
# vertical images will be filtered and stored in PATH/TO/naf/ignores
python tools/data/textrecog/naf_converter.py PATH/TO/naf --nproc 4

After running the above codes, the directory structure should be as follows:

├── naf
│   ├── crops
│   ├── train_label.txt
│   ├── val_label.txt
│   ├── test_label.txt

SROIE

Step1: Step1: Download 0325updated.task1train(626p).zip, task1&2_test(361p).zip, and text.task1&2-test（361p).zip from homepage to sroie/

Step2:

mkdir sroie && cd sroie
mkdir imgs && mkdir annotations && mkdir imgs/training

# Warnninig: The zip files downloaded from Google Drive and BaiduYun Cloud may
# be different, the user should revise the following commands to the correct
# file name if encounter with errors while extracting and move the files.
unzip -q 0325updated.task1train\(626p\).zip && unzip -q task1\&2_test\(361p\).zip && unzip -q text.task1\&2-test（361p\).zip

# For images
mv 0325updated.task1train\(626p\)/*.jpg imgs/training && mv fulltext_test\(361p\) imgs/test

# For annotations
mv 0325updated.task1train\(626p\) annotations/training && mv text.task1\&2-testги361p\)/ annotations/test

rm 0325updated.task1train\(626p\).zip && rm task1\&2_test\(361p\).zip && rm text.task1\&2-test（361p\).zip

Step3: Generate train_label.jsonl and test_label.jsonl and crop images using 4 processes with the following command:
```
python tools/data/textrecog/sroie_converter.py PATH/TO/sroie --nproc 4
```

After running the above codes, the directory structure should be as follows:

├── sroie
│   ├── crops
│   ├── train_label.jsonl
│   ├── test_label.jsonl

Lecture Video DB

The LV dataset has already provided cropped images and the corresponding annotations

Step1: Download IIIT-CVid.zip to lv/.

mkdir lv && cd lv

# Download LV dataset
wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
unzip -q IIIT-CVid.zip

# For image
mv IIIT-CVid/Crops ./

# For annotation
mv IIIT-CVid/train.txt train_label.txt && mv IIIT-CVid/val.txt val_label.txt && mv IIIT-CVid/test.txt test_label.txt

rm IIIT-CVid.zip

Step2: Generate train_label.jsonl, val.jsonl, and test.jsonl with following command:

python tools/data/textdreog/lv_converter.py PATH/TO/lv

31 KiB Raw Blame History Unescape Escape

Text Recognition

Overview

Preparation Steps

ICDAR 2013

ICDAR 2015

IIIT5K

svt

ct80

svtp

coco_text

MJSynth (Syn90k)

SynthText (Synth800k)

SynthAdd

TextOCR

Totaltext

OpenVINO

FUNSD

DeText

NAF

SROIE

Lecture Video DB

31 KiB

Raw Blame History