mirror of https://github.com/open-mmlab/mmocr.git synced 2025-06-03 21:54:47 +08:00

* add DeText Converter

* Update tools/data/textrecog/detext_converter.py

Co-authored-by: Tong Gao <gaotongxiao@gmail.com>

* update doc; support jsonl; fix docstrings

* update mkdir func

* fix bug

* update doc; do not filter for test val

* move directory tree

* fix indentation

Co-authored-by: Tong Gao <gaotongxiao@gmail.com>

2022-03-30 14:43:33 +08:00

23 KiB

Raw Blame History

Text Detection

Overview

The structure of the text detection dataset directory is organized as follows.

├── ctw1500
│   ├── annotations
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
├── icdar2015
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
├── icdar2017
│   ├── imgs
│   ├── instances_training.json
│   └── instances_val.json
├── synthtext
│   ├── imgs
│   └── instances_training.lmdb
│       ├── data.mdb
│       └── lock.mdb
├── textocr
│   ├── train
│   ├── instances_training.json
│   └── instances_val.json
├── totaltext
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
├── CurvedSynText150k
│   ├── syntext_word_eng
│   ├── emcs_imgs
│   └── instances_training.json
|── funsd
|   ├── annotations
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
|── lv
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
│   └── instances_val.json

Dataset	Images		Annotation Files
		training	validation	testing
CTW1500	homepage	-	-	-
ICDAR2015	homepage	instances_training.json	-	instances_test.json
ICDAR2017	homepage	instances_training.json	instances_val.json	-
Synthtext	homepage	instances_training.lmdb (data.mdb, lock.mdb)	-	-
TextOCR	homepage	-	-	-
Totaltext	homepage	-	-	-
CurvedSynText150k	homepage \| Part1 \| Part2	instances_training.json	-	-
FUNSD	homepage	-	-	-
DeText	homepage	-	-	-
NAF	homepage	-	-	-
SROIE	homepage	-	-	-
Lecture Video DB	homepage	-	-	-

Important Note

:::{note} For users who want to train models on CTW1500, ICDAR 2015/2017, and Totaltext dataset, there might be some images containing orientation info in EXIF data. The default OpenCV backend used in MMCV would read them and apply the rotation on the images. However, their gold annotations are made on the raw pixels, and such inconsistency results in false examples in the training set. Therefore, users should use dict(type='LoadImageFromFile', color_type='color_ignore_orientation') in pipelines to change MMCV's default loading behaviour. (see DBNet's pipeline config for example) :::

Preparation Steps

ICDAR 2015

Step0: Read Important Note
Step1: Download ch4_training_images.zip, ch4_test_images.zip, ch4_training_localization_transcription_gt.zip, Challenge4_Test_Task1_GT.zip from homepage
Step2:

mkdir icdar2015 && cd icdar2015
mkdir imgs && mkdir annotations
# For images,
mv ch4_training_images imgs/training
mv ch4_test_images imgs/test
# For annotations,
mv ch4_training_localization_transcription_gt annotations/training
mv Challenge4_Test_Task1_GT annotations/test

Step3: Download instances_training.json and instances_test.json and move them to icdar2015
Or, generate instances_training.json and instances_test.json with following command:

python tools/data/textdet/icdar_converter.py /path/to/icdar2015 -o /path/to/icdar2015 -d icdar2015 --split-list training test

ICDAR 2017

Follow similar steps as ICDAR 2015.

CTW1500

Step0: Read Important Note
Step1: Download train_images.zip, test_images.zip, train_labels.zip, test_labels.zip from github

mkdir ctw1500 && cd ctw1500
mkdir imgs && mkdir annotations

# For annotations
cd annotations
wget -O train_labels.zip https://universityofadelaide.box.com/shared/static/jikuazluzyj4lq6umzei7m2ppmt3afyw.zip
wget -O test_labels.zip https://cloudstor.aarnet.edu.au/plus/s/uoeFl0pCN9BOCN5/download
unzip train_labels.zip && mv ctw1500_train_labels training
unzip test_labels.zip -d test
cd ..
# For images
cd imgs
wget -O train_images.zip https://universityofadelaide.box.com/shared/static/py5uwlfyyytbb2pxzq9czvu6fuqbjdh8.zip
wget -O test_images.zip https://universityofadelaide.box.com/shared/static/t4w48ofnqkdw7jyc4t11nsukoeqk9c3d.zip
unzip train_images.zip && mv train_images training
unzip test_images.zip && mv test_images test

Step2: Generate instances_training.json and instances_test.json with following command:

python tools/data/textdet/ctw1500_converter.py /path/to/ctw1500 -o /path/to/ctw1500 --split-list training test

SynthText

Download data.mdb and lock.mdb to synthtext/instances_training.lmdb/.

TextOCR

Step1: Download train_val_images.zip, TextOCR_0.1_train.json and TextOCR_0.1_val.json to textocr/.

mkdir textocr && cd textocr

# Download TextOCR dataset
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json

# For images
unzip -q train_val_images.zip
mv train_images train

Step2: Generate instances_training.json and instances_val.json with the following command:

python tools/data/textdet/textocr_converter.py /path/to/textocr

Totaltext

Step0: Read Important Note
Step1: Download totaltext.zip from github dataset and groundtruth_text.zip from github Groundtruth (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).

mkdir totaltext && cd totaltext
mkdir imgs && mkdir annotations

# For images
# in ./totaltext
unzip totaltext.zip
mv Images/Train imgs/training
mv Images/Test imgs/test

# For annotations
unzip groundtruth_text.zip
cd Groundtruth
mv Polygon/Train ../annotations/training
mv Polygon/Test ../annotations/test

Step2: Generate instances_training.json and instances_test.json with the following command:

python tools/data/textdet/totaltext_converter.py /path/to/totaltext -o /path/to/totaltext --split-list training test

CurvedSynText150k

Step1: Download syntext1.zip and syntext2.zip to CurvedSynText150k/.
Step2:

unzip -q syntext1.zip
mv train.json train1.json
unzip images.zip
rm images.zip

unzip -q syntext2.zip
mv train.json train2.json
unzip images.zip
rm images.zip

Step3: Download instances_training.json to CurvedSynText150k/
Or, generate instances_training.json with following command:

python tools/data/common/curvedsyntext_converter.py PATH/TO/CurvedSynText150k --nproc 4

FUNSD

Step1: Download dataset.zip to funsd/.

mkdir funsd && cd funsd

# Download FUNSD dataset
wget https://guillaumejaume.github.io/FUNSD/dataset.zip
unzip -q dataset.zip

# For images
mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/

# For annotations
mkdir annotations
mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test

rm dataset.zip && rm -rf dataset

Step2: Generate instances_training.json and instances_test.json with following command:

python tools/data/textdet/funsd_converter.py PATH/TO/funsd --nproc 4

DeText

Step1: Download ch9_training_images.zip, ch9_training_localization_transcription_gt.zip, ch9_validation_images.zip, and ch9_validation_localization_transcription_gt.zip from Task 3: End to End on the homepage.

mkdir detext && cd detext
mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val

# Download DeText
wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate
wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate

# Extract images and annotations
unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val

# Remove zips
rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip

Step2: Generate instances_training.json and instances_val.json with following command:
```
python tools/data/textdet/detext_converter.py PATH/TO/detext --nproc 4
```

After running the above codes, the directory structure should be as follows:

|── detext
|   ├── annotations
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json

NAF

Step1: Download labeled_images.tar.gz to naf/.

mkdir naf && cd naf

# Download NAF dataset
wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz
tar -zxf labeled_images.tar.gz

# For images
mkdir annotations && mv labeled_images imgs

# For annotations
git clone https://github.com/herobd/NAF_dataset.git
mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/

rm -rf NAF_dataset && rm labeled_images.tar.gz

Step2: Generate instances_training.json, instances_val.json, and instances_test.json with following command:
```
python tools/data/textdet/naf_converter.py PATH/TO/naf --nproc 4
```

After running the above codes, the directory structure should be as follows:

|── naf
|   ├── annotations
│   ├── imgs
│   ├── instances_test.json
│   ├── instances_val.json
│   └── instances_training.json

SROIE

Step1: Download 0325updated.task1train(626p).zip, task1&2_test(361p).zip, and text.task1&2-test（361p).zip from homepage to sroie/

Step2:

mkdir sroie && cd sroie
mkdir imgs && mkdir annotations && mkdir imgs/training

# Warnninig: The zip files downloaded from Google Drive and BaiduYun Cloud may
# be different, the user should revise the following commands to the correct
# file name if encounter with errors while extracting and move the files.
unzip -q 0325updated.task1train\(626p\).zip && unzip -q task1\&2_test\(361p\).zip && unzip -q text.task1\&2-test（361p\).zip

# For images
mv 0325updated.task1train\(626p\)/*.jpg imgs/training && mv fulltext_test\(361p\) imgs/test

# For annotations
mv 0325updated.task1train\(626p\) annotations/training && mv text.task1\&2-testги361p\)/ annotations/test

rm 0325updated.task1train\(626p\).zip && rm task1\&2_test\(361p\).zip && rm text.task1\&2-test（361p\).zip

Step3: Generate instances_training.json and instances_test.json with the following command:
```
python tools/data/textdet/sroie_converter.py PATH/TO/sroie --nproc 4
```

After running the above codes, the directory structure should be as follows:

├── sroie
│   ├── annotations
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json

Lecture Video DB

Step1: Download IIIT-CVid.zip to lv/.

mkdir lv && cd lv

# Download LV dataset
wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
unzip -q IIIT-CVid.zip

mv IIIT-CVid/Frames imgs

rm IIIT-CVid.zip

Step2: Generate instances_training.json, instances_val.json, and instances_test.json with following command:

python tools/data/textdet/lv_converter.py PATH/TO/lv --nproc 4

23 KiB Raw Blame History Unescape Escape

Text Detection

Overview

Important Note

Preparation Steps

ICDAR 2015

ICDAR 2017

CTW1500

SynthText

TextOCR

Totaltext

CurvedSynText150k

FUNSD

DeText

NAF

SROIE

Lecture Video DB

23 KiB

Raw Blame History