mmocr/docs/en/datasets/recog.md
Xinyu Wang ee2c3cfd46
[Feature] Add DeText Converter (#818)
* add DeText Converter

* Update tools/data/textrecog/detext_converter.py

Co-authored-by: Tong Gao <gaotongxiao@gmail.com>

* update doc; support jsonl; fix docstrings

* update mkdir func

* fix bug

* update doc; do not filter for test val

* move directory tree

* fix indentation

Co-authored-by: Tong Gao <gaotongxiao@gmail.com>
2022-03-30 14:43:33 +08:00

31 KiB
Raw Blame History

Text Recognition

Overview

The structure of the text recognition dataset directory is organized as follows.

├── mixture
│   ├── coco_text
│   │   ├── train_label.txt
│   │   ├── train_words
│   ├── icdar_2011
│   │   ├── training_label.txt
│   │   ├── Challenge1_Training_Task3_Images_GT
│   ├── icdar_2013
│   │   ├── train_label.txt
│   │   ├── test_label_1015.txt
│   │   ├── test_label_1095.txt
│   │   ├── Challenge2_Training_Task3_Images_GT
│   │   ├── Challenge2_Test_Task3_Images
│   ├── icdar_2015
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── ch4_training_word_images_gt
│   │   ├── ch4_test_word_images_gt
│   ├── III5K
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── train
│   │   ├── test
│   ├── ct80
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svt
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svtp
│   │   ├── test_label.txt
│   │   ├── image
│   ├── Syn90k
│   │   ├── shuffle_labels.txt
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── mnt
│   ├── SynthText
│   │   ├── alphanumeric_labels.txt
│   │   ├── shuffle_labels.txt
│   │   ├── instances_train.txt
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── synthtext
│   ├── SynthAdd
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── SynthText_Add
│   ├── TextOCR
│   │   ├── image
│   │   ├── train_label.txt
│   │   ├── val_label.txt
│   ├── Totaltext
│   │   ├── imgs
│   │   ├── annotations
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   ├── OpenVINO
│   │   ├── image_1
│   │   ├── image_2
│   │   ├── image_5
│   │   ├── image_f
│   │   ├── image_val
│   │   ├── train_1_label.txt
│   │   ├── train_2_label.txt
│   │   ├── train_5_label.txt
│   │   ├── train_f_label.txt
│   │   ├── val_label.txt
│   ├── funsd
│   │   ├── imgs
│   │   ├── dst_imgs
│   │   ├── annotations
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   ├── lv
│   │   ├── Crops
│   │   ├── train_label.jsonl
│   │   ├── test_label.jsonl
Dataset images annotation file annotation file
training test
coco_text homepage train_label.txt -
icdar_2011 homepage train_label.txt -
icdar_2013 homepage train_label.txt test_label_1015.txt
icdar_2015 homepage train_label.txt test_label.txt
IIIT5K homepage train_label.txt test_label.txt
ct80 homepage - test_label.txt
svt homepage - test_label.txt
svtp unofficial homepage[1] - test_label.txt
MJSynth (Syn90k) homepage shuffle_labels.txt | label.txt -
SynthText (Synth800k) homepage alphanumeric_labels.txt |shuffle_labels.txt | instances_train.txt | label.txt -
SynthAdd SynthText_Add.zip (code:627x) label.txt -
TextOCR homepage - -
Totaltext homepage - -
OpenVINO Open Images annotations annotations
FUNSD homepage - -
DeText homepage - -
NAF homepage - -
SROIE homepage - -
Lecture Video DB homepage - -

(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.

Preparation Steps

ICDAR 2013

ICDAR 2015

IIIT5K

svt

python tools/data/textrecog/svt_converter.py <download_svt_dir_path>

ct80

svtp

coco_text

MJSynth (Syn90k)

  • Step1: Download mjsynth.tar.gz from homepage
  • Step2: Download label.txt (8,919,273 annotations) and shuffle_labels.txt (2,400,000 randomly sampled annotations). Please make sure you're using the right annotation to train the model by checking its dataset specs in Model Zoo.
  • Step3:
mkdir Syn90k && cd Syn90k

mv /path/to/mjsynth.tar.gz .

tar -xzf mjsynth.tar.gz

mv /path/to/shuffle_labels.txt .
mv /path/to/label.txt .

# create soft link
cd /path/to/mmocr/data/mixture

ln -s /path/to/Syn90k Syn90k

SynthText (Synth800k)

  • Step1: Download SynthText.zip from homepage

  • Step2: According to your actual needs, download the most appropriate one from the following options: label.txt (7,266,686 annotations), shuffle_labels.txt (2,400,000 randomly sampled annotations), alphanumeric_labels.txt (7,239,272 annotations with alphanumeric characters only) and instances_train.txt (7,266,686 character-level annotations).

:::{warning} Please make sure you're using the right annotation to train the model by checking its dataset specs in Model Zoo. :::

  • Step3:
mkdir SynthText && cd SynthText
mv /path/to/SynthText.zip .
unzip SynthText.zip
mv SynthText synthtext

mv /path/to/shuffle_labels.txt .
mv /path/to/label.txt .
mv /path/to/alphanumeric_labels.txt .
mv /path/to/instances_train.txt .

# create soft link
cd /path/to/mmocr/data/mixture
ln -s /path/to/SynthText SynthText
  • Step4: Generate cropped images and labels:
cd /path/to/mmocr

python tools/data/textrecog/synthtext_converter.py data/mixture/SynthText/gt.mat data/mixture/SynthText/ data/mixture/SynthText/synthtext/SynthText_patch_horizontal --n_proc 8

SynthAdd

  • Step1: Download SynthText_Add.zip from SynthAdd (code:627x))
  • Step2: Download label.txt
  • Step3:
mkdir SynthAdd && cd SynthAdd

mv /path/to/SynthText_Add.zip .

unzip SynthText_Add.zip

mv /path/to/label.txt .

# create soft link
cd /path/to/mmocr/data/mixture

ln -s /path/to/SynthAdd SynthAdd

:::{tip} To convert label file with txt format to lmdb format,

python tools/data/utils/txt2lmdb.py -i <txt_label_path> -o <lmdb_label_path>

For example,

python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb

:::

TextOCR

mkdir textocr && cd textocr

# Download TextOCR dataset
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json

# For images
unzip -q train_val_images.zip
mv train_images train
  • Step2: Generate train_label.txt, val_label.txt and crop images using 4 processes with the following command:
python tools/data/textrecog/textocr_converter.py /path/to/textocr 4

Totaltext

  • Step1: Download totaltext.zip from github dataset and groundtruth_text.zip from github Groundtruth (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).
mkdir totaltext && cd totaltext
mkdir imgs && mkdir annotations

# For images
# in ./totaltext
unzip totaltext.zip
mv Images/Train imgs/training
mv Images/Test imgs/test

# For annotations
unzip groundtruth_text.zip
cd Groundtruth
mv Polygon/Train ../annotations/training
mv Polygon/Test ../annotations/test

  • Step2: Generate cropped images, train_label.txt and test_label.txt with the following command (the cropped images will be saved to data/totaltext/dst_imgs/):
python tools/data/textrecog/totaltext_converter.py /path/to/totaltext -o /path/to/totaltext --split-list training test

OpenVINO

  • Step0: Install awscli.
  • Step1: Download Open Images subsets train_1, train_2, train_5, train_f, and validation to openvino/.
mkdir openvino && cd openvino

# Download Open Images subsets
for s in 1 2 5 f; do
  aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_${s}.tar.gz .
done
aws s3 --no-sign-request cp s3://open-images-dataset/tar/validation.tar.gz .

# Download annotations
for s in 1 2 5 f; do
  wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_train_${s}.json
done
wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_validation.json

# Extract images
mkdir -p openimages_v5/val
for s in 1 2 5 f; do
  tar zxf train_${s}.tar.gz -C openimages_v5
done
tar zxf validation.tar.gz -C openimages_v5/val
  • Step2: Generate train_{1,2,5,f}_label.txt, val_label.txt and crop images using 4 processes with the following command:
python tools/data/textrecog/openvino_converter.py /path/to/openvino 4

FUNSD

mkdir funsd && cd funsd

# Download FUNSD dataset
wget https://guillaumejaume.github.io/FUNSD/dataset.zip
unzip -q dataset.zip

# For images
mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/

# For annotations
mkdir annotations
mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test

rm dataset.zip && rm -rf dataset
  • Step2: Generate train_label.txt and test_label.txt and crop images using 4 processes with following command (add --preserve-vertical if you wish to preserve the images containing vertical texts):
python tools/data/textrecog/funsd_converter.py PATH/TO/funsd --nproc 4

DeText

  • Step1: Download ch9_training_images.zip, ch9_training_localization_transcription_gt.zip, ch9_validation_images.zip, and ch9_validation_localization_transcription_gt.zip from Task 3: End to End on the homepage.

    mkdir detext && cd detext
    mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val
    
    # Download DeText
    wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate
    
    # Extract images and annotations
    unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val
    
    # Remove zips
    rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
    
  • Step2: Generate instances_training.json and instances_val.json with following command:

    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/detext/ignores
    python tools/data/textrecog/detext_converter.py PATH/TO/detext --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── detext
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   ├── test_label.jsonl
    

NAF

  • Step1: Download labeled_images.tar.gz to naf/.

    mkdir naf && cd naf
    
    # Download NAF dataset
    wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz
    tar -zxf labeled_images.tar.gz
    
    # For images
    mkdir annotations && mv labeled_images imgs
    
    # For annotations
    git clone https://github.com/herobd/NAF_dataset.git
    mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/
    
    rm -rf NAF_dataset && rm labeled_images.tar.gz
    
  • Step2: Generate train_label.txt, val_label.txt, and test_label.txt with following command:

    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/naf/ignores
    python tools/data/textrecog/naf_converter.py PATH/TO/naf --nproc 4
    
    
  • After running the above codes, the directory structure should be as follows:

    ├── naf
    │   ├── crops
    │   ├── train_label.txt
    │   ├── val_label.txt
    │   ├── test_label.txt
    

SROIE

  • Step1: Step1: Download 0325updated.task1train(626p).zip, task1&2_test(361p).zip, and text.task1&2-test361p).zip from homepage to sroie/

  • Step2:

    mkdir sroie && cd sroie
    mkdir imgs && mkdir annotations && mkdir imgs/training
    
    # Warnninig: The zip files downloaded from Google Drive and BaiduYun Cloud may
    # be different, the user should revise the following commands to the correct
    # file name if encounter with errors while extracting and move the files.
    unzip -q 0325updated.task1train\(626p\).zip && unzip -q task1\&2_test\(361p\).zip && unzip -q text.task1\&2-test361p\).zip
    
    # For images
    mv 0325updated.task1train\(626p\)/*.jpg imgs/training && mv fulltext_test\(361p\) imgs/test
    
    # For annotations
    mv 0325updated.task1train\(626p\) annotations/training && mv text.task1\&2-testги361p\)/ annotations/test
    
    rm 0325updated.task1train\(626p\).zip && rm task1\&2_test\(361p\).zip && rm text.task1\&2-test361p\).zip
    
  • Step3: Generate train_label.jsonl and test_label.jsonl and crop images using 4 processes with the following command:

    python tools/data/textrecog/sroie_converter.py PATH/TO/sroie --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── sroie
    │   ├── crops
    │   ├── train_label.jsonl
    │   ├── test_label.jsonl
    

Lecture Video DB

The LV dataset has already provided cropped images and the corresponding annotations

mkdir lv && cd lv

# Download LV dataset
wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
unzip -q IIIT-CVid.zip

# For image
mv IIIT-CVid/Crops ./

# For annotation
mv IIIT-CVid/train.txt train_label.txt && mv IIIT-CVid/val.txt val_label.txt && mv IIIT-CVid/test.txt test_label.txt

rm IIIT-CVid.zip
  • Step2: Generate train_label.jsonl, val.jsonl, and test.jsonl with following command:
python tools/data/textdreog/lv_converter.py PATH/TO/lv