mmocr/docs/datasets.md

12 KiB

Datasets Preparation

This page lists the datasets which are commonly used in text detection, text recognition and key information extraction, and their download links.

Text Detection

The structure of the text detection dataset directory is organized as follows.

├── ctw1500
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
├── icdar2015
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
├── icdar2017
│   ├── imgs
│   ├── instances_training.json
│   └── instances_val.json
├── synthtext
│   ├── imgs
│   ├── instances_training.json
│   ├── instances_training.txt
│   └── instances_training.lmdb
Dataset Images Annotation Files Note
training validation testing
CTW1500 link instances_training.json - instances_test.json
ICDAR2015 link instances_training.json - instances_test.json
ICDAR2017 link instances_training.json instances_val.json instances_test.json
Synthtext link instances_training.json instances_training.txt -
  • For icdar2015:
    mkdir icdar2015 && cd icdar2015
    mv /path/to/instances_training.json .
    mv /path/to/instances_test.json .
    
    mkdir imgs && cd imgs
    ln -s /path/to/ch4_training_images training
    ln -s /path/to/ch4_test_images test
    

Text Recognition

The structure of the text recognition dataset directory is organized as follows.

├── mixture
│   ├── coco_text
│   │   ├── train_label.txt
│   │   ├── train_words
│   ├── icdar_2011
│   │   ├── training_label.txt
│   │   ├── Challenge1_Training_Task3_Images_GT
│   ├── icdar_2013
│   │   ├── train_label.txt
│   │   ├── test_label_1015.txt
│   │   ├── test_label_1095.txt
│   │   ├── Challenge2_Training_Task3_Images_GT
│   │   ├── Challenge2_Test_Task3_Images
│   ├── icdar_2015
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── ch4_training_word_images_gt
│   │   ├── ch4_test_word_images_gt
│   ├── III5K
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── train
│   │   ├── test
│   ├── ct80
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svt
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svtp
│   │   ├── test_label.txt
│   │   ├── image
│   ├── Synth90k
│   │   ├── shuffle_labels.txt
│   │   ├── mnt
│   ├── SynthText
│   │   ├── shuffle_labels.txt
│   │   ├── instances_train.txt
│   │   ├── synthtext
│   ├── SynthAdd
│   │   ├── label.txt
│   │   ├── SynthText_Add

Dataset images annotation file annotation file Note
training test
coco_text link train_label.txt -
icdar_2011 link train_label.txt -
icdar_2013 link train_label.txt test_label_1015.txt
icdar_2015 link train_label.txt test_label.txt
IIIT5K link train_label.txt test_label.txt
ct80 - - test_label.txt
svt link - test_label.txt
svtp - - test_label.txt
Synth90k link shuffle_labels.txt -
SynthText link shuffle_labels.txt | instances_train.txt -
SynthAdd link label.txt -
  • For icdar_2013:

  • For icdar_2015:

  • For IIIT5K:

  • For svt:

  • For ct80:

  • For svtp:

  • For coco_text:

  • For Syn90k:

    mkdir Syn90k && cd Syn90k
    
    mv /path/to/mjsynth.tar.gz .
    
    tar -xzf mjsynth.tar.gz
    
    mv /path/to/shuffle_labels.txt .
    
    # create soft link
    cd /path/to/mmocr/data/mixture
    
    ln -s /path/to/Syn90k Syn90k
    
  • For SynthText:

    unzip SynthText.zip
    
    cd SynthText
    
    mv /path/to/shuffle_labels.txt .
    
    # create soft link
    cd /path/to/mmocr/data/mixture
    
    ln -s /path/to/SynthText SynthText
    
  • For SynthAdd:

    • Step1: Download SynthText_Add.zip from this link
    • Step2: Download label.txt
    • Step3:
    mkdir SynthAdd && cd SynthAdd
    
    mv /path/to/SynthText_Add.zip .
    
    unzip SynthText_Add.zip
    
    mv /path/to/label.txt .
    
    # create soft link
    cd /path/to/mmocr/data/mixture
    
    ln -s /path/to/SynthAdd SynthAdd