mirror of
https://github.com/open-mmlab/mmocr.git
synced 2025-06-03 21:54:47 +08:00
* refactor doc tree * add titles * update * update * fix * fix a bug * remove ner in readme * rename advanced guides * fix migration
991 lines
41 KiB
Markdown
991 lines
41 KiB
Markdown
# Text Detection
|
||
|
||
## Overview
|
||
|
||
| Dataset | Images | | Annotation Files | | |
|
||
| :---------------: | :-------------------------------------------: | :-------------------------------------: | :------------------------------------------------------: | :--------------------------------------: | :-: |
|
||
| | | training | validation | testing | |
|
||
| CTW1500 | [homepage](https://github.com/Yuliang-Liu/Curve-Text-Detector) | - | - | - | |
|
||
| ICDAR2011 | [homepage](https://rrc.cvc.uab.es/?ch=1) | - | - | | |
|
||
| ICDAR2013 | [homepage](https://rrc.cvc.uab.es/?ch=2) | - | - | - | |
|
||
| ICDAR2015 | [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) | - | [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) | |
|
||
| ICDAR2017 | [homepage](https://rrc.cvc.uab.es/?ch=8&com=downloads) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json) | [instances_val.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_val.json) | - | |
|
||
| Synthtext | [homepage](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) | instances_training.lmdb ([data.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/data.mdb), [lock.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/lock.mdb)) | - | - | |
|
||
| TextOCR | [homepage](https://textvqa.org/textocr/dataset) | - | - | - | |
|
||
| Totaltext | [homepage](https://github.com/cs-chan/Total-Text-Dataset) | - | - | - | |
|
||
| CurvedSynText150k | [homepage](https://github.com/aim-uofa/AdelaiDet/blob/master/datasets/README.md) \| [Part1](https://drive.google.com/file/d/1OSJ-zId2h3t_-I7g_wUkrK-VqQy153Kj/view?usp=sharing) \| [Part2](https://drive.google.com/file/d/1EzkcOlIgEp5wmEubvHb7-J5EImHExYgY/view?usp=sharing) | [instances_training.json](https://download.openmmlab.com/mmocr/data/curvedsyntext/instances_training.json) | - | - | |
|
||
| FUNSD | [homepage](https://guillaumejaume.github.io/FUNSD/) | - | - | - | |
|
||
| DeText | [homepage](https://rrc.cvc.uab.es/?ch=9) | - | - | - | |
|
||
| NAF | [homepage](https://github.com/herobd/NAF_dataset/releases/tag/v1.0) | - | - | - | |
|
||
| SROIE | [homepage](https://rrc.cvc.uab.es/?ch=13) | - | - | - | |
|
||
| Lecture Video DB | [homepage](https://cvit.iiit.ac.in/research/projects/cvit-projects/lecturevideodb) | - | - | - | |
|
||
| LSVT | [homepage](https://rrc.cvc.uab.es/?ch=16) | - | - | - | |
|
||
| IMGUR | [homepage](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset) | - | - | - | |
|
||
| KAIST | [homepage](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) | - | - | - | |
|
||
| MTWI | [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us) | - | - | - | |
|
||
| COCO Text v2 | [homepage](https://bgshih.github.io/cocotext/) | - | - | - | |
|
||
| ReCTS | [homepage](https://rrc.cvc.uab.es/?ch=12) | - | - | - | |
|
||
| IIIT-ILST | [homepage](http://cvit.iiit.ac.in/research/projects/cvit-projects/iiit-ilst) | - | - | - | |
|
||
| VinText | [homepage](https://github.com/VinAIResearch/dict-guided) | - | - | - | |
|
||
| BID | [homepage](https://github.com/ricardobnjunior/Brazilian-Identity-Document-Dataset) | - | - | - | |
|
||
| RCTW | [homepage](https://rctw.vlrlab.net/index.html) | - | - | - | |
|
||
| HierText | [homepage](https://github.com/google-research-datasets/hiertext) | - | - | - | |
|
||
| ArT | [homepage](https://rrc.cvc.uab.es/?ch=14) | - | - | - | |
|
||
|
||
### Install AWS CLI (optional)
|
||
|
||
- Since there are some datasets that require the [AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) to be installed in advance, we provide a quick installation guide here:
|
||
|
||
```bash
|
||
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
|
||
unzip awscliv2.zip
|
||
sudo ./aws/install
|
||
./aws/install -i /usr/local/aws-cli -b /usr/local/bin
|
||
!aws configure
|
||
# this command will require you to input keys, you can skip them except
|
||
# for the Default region name
|
||
# AWS Access Key ID [None]:
|
||
# AWS Secret Access Key [None]:
|
||
# Default region name [None]: us-east-1
|
||
# Default output format [None]
|
||
```
|
||
|
||
## Important Note
|
||
|
||
```{note}
|
||
**For users who want to train models on CTW1500, ICDAR 2015/2017, and Totaltext dataset,** there might be some images containing orientation info in EXIF data. The default OpenCV
|
||
backend used in MMCV would read them and apply the rotation on the images. However, their gold annotations are made on the raw pixels, and such
|
||
inconsistency results in false examples in the training set. Therefore, users should use `dict(type='LoadImageFromFile', color_type='color_ignore_orientation')` in pipelines to change MMCV's default loading behaviour. (see [DBNet's pipeline config](https://github.com/open-mmlab/mmocr/blob/main/configs/_base_/det_pipelines/dbnet_pipeline.py) for example)
|
||
```
|
||
|
||
## CTW1500
|
||
|
||
- Step0: Read [Important Note](#important-note)
|
||
|
||
- Step1: Download `train_images.zip`, `test_images.zip`, `train_labels.zip`, `test_labels.zip` from [github](https://github.com/Yuliang-Liu/Curve-Text-Detector)
|
||
|
||
```bash
|
||
mkdir ctw1500 && cd ctw1500
|
||
mkdir imgs && mkdir annotations
|
||
|
||
# For annotations
|
||
cd annotations
|
||
wget -O train_labels.zip https://universityofadelaide.box.com/shared/static/jikuazluzyj4lq6umzei7m2ppmt3afyw.zip
|
||
wget -O test_labels.zip https://cloudstor.aarnet.edu.au/plus/s/uoeFl0pCN9BOCN5/download
|
||
unzip train_labels.zip && mv ctw1500_train_labels training
|
||
unzip test_labels.zip -d test
|
||
cd ..
|
||
# For images
|
||
cd imgs
|
||
wget -O train_images.zip https://universityofadelaide.box.com/shared/static/py5uwlfyyytbb2pxzq9czvu6fuqbjdh8.zip
|
||
wget -O test_images.zip https://universityofadelaide.box.com/shared/static/t4w48ofnqkdw7jyc4t11nsukoeqk9c3d.zip
|
||
unzip train_images.zip && mv train_images training
|
||
unzip test_images.zip && mv test_images test
|
||
```
|
||
|
||
- Step2: Generate `instances_training.json` and `instances_test.json` with following command:
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/ctw1500_converter.py /path/to/ctw1500 -o /path/to/ctw1500 --split-list training test
|
||
```
|
||
|
||
- The resulting directory structure looks like the following:
|
||
|
||
```text
|
||
├── ctw1500
|
||
│ ├── imgs
|
||
│ ├── annotations
|
||
│ ├── instances_training.json
|
||
│ └── instances_val.json
|
||
```
|
||
|
||
## ICDAR 2011 (Born-Digital Images)
|
||
|
||
- Step1: Download `Challenge1_Training_Task12_Images.zip`, `Challenge1_Training_Task1_GT.zip`, `Challenge1_Test_Task12_Images.zip`, and `Challenge1_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=1&com=downloads) `Task 1.1: Text Localization (2013 edition)`.
|
||
|
||
```bash
|
||
mkdir icdar2011 && cd icdar2011
|
||
mkdir imgs && mkdir annotations
|
||
|
||
# Download ICDAR 2011
|
||
wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task12_Images.zip --no-check-certificate
|
||
wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task1_GT.zip --no-check-certificate
|
||
wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task12_Images.zip --no-check-certificate
|
||
wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task1_GT.zip --no-check-certificate
|
||
|
||
# For images
|
||
unzip -q Challenge1_Training_Task12_Images.zip -d imgs/training
|
||
unzip -q Challenge1_Test_Task12_Images.zip -d imgs/test
|
||
# For annotations
|
||
unzip -q Challenge1_Training_Task1_GT.zip -d annotations/training
|
||
unzip -q Challenge1_Test_Task1_GT.zip -d annotations/test
|
||
|
||
rm Challenge1_Training_Task12_Images.zip && rm Challenge1_Test_Task12_Images.zip && rm Challenge1_Training_Task1_GT.zip && rm Challenge1_Test_Task1_GT.zip
|
||
```
|
||
|
||
- Step 2: Generate `instances_training.json` and `instances_test.json` with the following command:
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/ic11_converter.py PATH/TO/icdar2011 --nproc 4
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
│── icdar2011
|
||
│ ├── imgs
|
||
│ ├── instances_test.json
|
||
│ └── instances_training.json
|
||
```
|
||
|
||
## ICDAR 2013 (Focused Scene Text)
|
||
|
||
- Step1: Download `Challenge2_Training_Task12_Images.zip`, `Challenge2_Test_Task12_Images.zip`, `Challenge2_Training_Task1_GT.zip`, and `Challenge2_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=2&com=downloads) `Task 2.1: Text Localization (2013 edition)`.
|
||
|
||
```bash
|
||
mkdir icdar2013 && cd icdar2013
|
||
mkdir imgs && mkdir annotations
|
||
|
||
# Download ICDAR 2013
|
||
wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task12_Images.zip --no-check-certificate
|
||
wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task12_Images.zip --no-check-certificate
|
||
wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task1_GT.zip --no-check-certificate
|
||
wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task1_GT.zip --no-check-certificate
|
||
|
||
# For images
|
||
unzip -q Challenge2_Training_Task12_Images.zip -d imgs/training
|
||
unzip -q Challenge2_Test_Task12_Images.zip -d imgs/test
|
||
# For annotations
|
||
unzip -q Challenge2_Training_Task1_GT.zip -d annotations/training
|
||
unzip -q Challenge2_Test_Task1_GT.zip -d annotations/test
|
||
|
||
rm Challenge2_Training_Task12_Images.zip && rm Challenge2_Test_Task12_Images.zip && rm Challenge2_Training_Task1_GT.zip && rm Challenge2_Test_Task1_GT.zip
|
||
```
|
||
|
||
- Step 2: Generate `instances_training.json` and `instances_test.json` with the following command:
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/ic13_converter.py PATH/TO/icdar2013 --nproc 4
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
│── icdar2013
|
||
│ ├── imgs
|
||
│ ├── instances_test.json
|
||
│ └── instances_training.json
|
||
```
|
||
|
||
## ICDAR 2015
|
||
|
||
- Step0: Read [Important Note](#important-note)
|
||
|
||
- Step1: Download `ch4_training_images.zip`, `ch4_test_images.zip`, `ch4_training_localization_transcription_gt.zip`, `Challenge4_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads)
|
||
|
||
- Step2:
|
||
|
||
```bash
|
||
mkdir icdar2015 && cd icdar2015
|
||
mkdir imgs && mkdir annotations
|
||
# For images,
|
||
mv ch4_training_images imgs/training
|
||
mv ch4_test_images imgs/test
|
||
# For annotations,
|
||
mv ch4_training_localization_transcription_gt annotations/training
|
||
mv Challenge4_Test_Task1_GT annotations/test
|
||
```
|
||
|
||
- Step3: Download [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) and [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) and move them to `icdar2015`
|
||
|
||
- Or, generate `instances_training.json` and `instances_test.json` with the following command:
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/icdar_converter.py /path/to/icdar2015 -o /path/to/icdar2015 -d icdar2015 --split-list training test
|
||
```
|
||
|
||
- The resulting directory structure looks like the following:
|
||
|
||
```text
|
||
├── icdar2015
|
||
│ ├── imgs
|
||
│ ├── annotations
|
||
│ ├── instances_test.json
|
||
│ └── instances_training.json
|
||
```
|
||
|
||
## ICDAR 2017
|
||
|
||
- Follow similar steps as [ICDAR 2015](#icdar-2015).
|
||
|
||
- The resulting directory structure looks like the following:
|
||
|
||
```text
|
||
├── icdar2017
|
||
│ ├── imgs
|
||
│ ├── annotations
|
||
│ ├── instances_training.json
|
||
│ └── instances_val.json
|
||
```
|
||
|
||
## SynthText
|
||
|
||
- Step1: Download SynthText.zip from \[homepage\](<https://www.robots.ox.ac.uk/~vgg/data/scenetext/> and extract its content to `synthtext/img`.
|
||
|
||
- Step2: Download [data.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/data.mdb) and [lock.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/lock.mdb) to `synthtext/instances_training.lmdb/`.
|
||
|
||
- The resulting directory structure looks like the following:
|
||
|
||
```text
|
||
├── synthtext
|
||
│ ├── imgs
|
||
│ └── instances_training.lmdb
|
||
│ ├── data.mdb
|
||
│ └── lock.mdb
|
||
```
|
||
|
||
## TextOCR
|
||
|
||
- Step1: Download [train_val_images.zip](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip), [TextOCR_0.1_train.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json) and [TextOCR_0.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json) to `textocr/`.
|
||
|
||
```bash
|
||
mkdir textocr && cd textocr
|
||
|
||
# Download TextOCR dataset
|
||
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
|
||
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
|
||
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json
|
||
|
||
# For images
|
||
unzip -q train_val_images.zip
|
||
mv train_images train
|
||
```
|
||
|
||
- Step2: Generate `instances_training.json` and `instances_val.json` with the following command:
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/textocr_converter.py /path/to/textocr
|
||
```
|
||
|
||
- The resulting directory structure looks like the following:
|
||
|
||
```text
|
||
├── textocr
|
||
│ ├── train
|
||
│ ├── instances_training.json
|
||
│ └── instances_val.json
|
||
```
|
||
|
||
## Totaltext
|
||
|
||
- Step0: Read [Important Note](#important-note)
|
||
|
||
- Step1: Download `totaltext.zip` from [github dataset](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset) and `groundtruth_text.zip` or `TT_new_train_GT.zip` (if you prefer to use the latest version of training annotations) from [github Groundtruth](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Text) (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).
|
||
|
||
```bash
|
||
mkdir totaltext && cd totaltext
|
||
mkdir imgs && mkdir annotations
|
||
|
||
# For images
|
||
# in ./totaltext
|
||
unzip totaltext.zip
|
||
mv Images/Train imgs/training
|
||
mv Images/Test imgs/test
|
||
|
||
# For legacy training and test annotations
|
||
unzip groundtruth_text.zip
|
||
mv Groundtruth/Polygon/Train annotations/training
|
||
mv Groundtruth/Polygon/Test annotations/test
|
||
|
||
# Using the latest training annotations
|
||
# WARNING: Delete legacy train annotations before running the following command.
|
||
unzip TT_new_train_GT.zip
|
||
mv Train annotations/training
|
||
```
|
||
|
||
- Step2: Generate `instances_training.json` and `instances_test.json` with the following command:
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/totaltext_converter.py /path/to/totaltext
|
||
```
|
||
|
||
- The resulting directory structure looks like the following:
|
||
|
||
```text
|
||
├── totaltext
|
||
│ ├── imgs
|
||
│ ├── annotations
|
||
│ ├── instances_test.json
|
||
│ └── instances_training.json
|
||
```
|
||
|
||
## CurvedSynText150k
|
||
|
||
- Step1: Download [syntext1.zip](https://drive.google.com/file/d/1OSJ-zId2h3t_-I7g_wUkrK-VqQy153Kj/view?usp=sharing) and [syntext2.zip](https://drive.google.com/file/d/1EzkcOlIgEp5wmEubvHb7-J5EImHExYgY/view?usp=sharing) to `CurvedSynText150k/`.
|
||
|
||
- Step2:
|
||
|
||
```bash
|
||
unzip -q syntext1.zip
|
||
mv train.json train1.json
|
||
unzip images.zip
|
||
rm images.zip
|
||
|
||
unzip -q syntext2.zip
|
||
mv train.json train2.json
|
||
unzip images.zip
|
||
rm images.zip
|
||
```
|
||
|
||
- Step3: Download [instances_training.json](https://download.openmmlab.com/mmocr/data/curvedsyntext/instances_training.json) to `CurvedSynText150k/`
|
||
|
||
- Or, generate `instances_training.json` with following command:
|
||
|
||
```bash
|
||
python tools/dataset_converters/common/curvedsyntext_converter.py PATH/TO/CurvedSynText150k --nproc 4
|
||
```
|
||
|
||
- The resulting directory structure looks like the following:
|
||
|
||
```text
|
||
├── CurvedSynText150k
|
||
│ ├── syntext_word_eng
|
||
│ ├── emcs_imgs
|
||
│ └── instances_training.json
|
||
```
|
||
|
||
## FUNSD
|
||
|
||
- Step1: Download [dataset.zip](https://guillaumejaume.github.io/FUNSD/dataset.zip) to `funsd/`.
|
||
|
||
```bash
|
||
mkdir funsd && cd funsd
|
||
|
||
# Download FUNSD dataset
|
||
wget https://guillaumejaume.github.io/FUNSD/dataset.zip
|
||
unzip -q dataset.zip
|
||
|
||
# For images
|
||
mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/
|
||
|
||
# For annotations
|
||
mkdir annotations
|
||
mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test
|
||
|
||
rm dataset.zip && rm -rf dataset
|
||
```
|
||
|
||
- Step2: Generate `instances_training.json` and `instances_test.json` with following command:
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/funsd_converter.py PATH/TO/funsd --nproc 4
|
||
```
|
||
|
||
- The resulting directory structure looks like the following:
|
||
|
||
```text
|
||
│── funsd
|
||
│ ├── annotations
|
||
│ ├── imgs
|
||
│ ├── instances_test.json
|
||
│ └── instances_training.json
|
||
```
|
||
|
||
## DeText
|
||
|
||
- Step1: Download `ch9_training_images.zip`, `ch9_training_localization_transcription_gt.zip`, `ch9_validation_images.zip`, and `ch9_validation_localization_transcription_gt.zip` from **Task 3: End to End** on the [homepage](https://rrc.cvc.uab.es/?ch=9).
|
||
|
||
```bash
|
||
mkdir detext && cd detext
|
||
mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val
|
||
|
||
# Download DeText
|
||
wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate
|
||
wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate
|
||
wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate
|
||
wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate
|
||
|
||
# Extract images and annotations
|
||
unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val
|
||
|
||
# Remove zips
|
||
rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
|
||
```
|
||
|
||
- Step2: Generate `instances_training.json` and `instances_val.json` with following command:
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/detext_converter.py PATH/TO/detext --nproc 4
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
│── detext
|
||
│ ├── annotations
|
||
│ ├── imgs
|
||
│ ├── instances_test.json
|
||
│ └── instances_training.json
|
||
```
|
||
|
||
## NAF
|
||
|
||
- Step1: Download [labeled_images.tar.gz](https://github.com/herobd/NAF_dataset/releases/tag/v1.0) to `naf/`.
|
||
|
||
```bash
|
||
mkdir naf && cd naf
|
||
|
||
# Download NAF dataset
|
||
wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz
|
||
tar -zxf labeled_images.tar.gz
|
||
|
||
# For images
|
||
mkdir annotations && mv labeled_images imgs
|
||
|
||
# For annotations
|
||
git clone https://github.com/herobd/NAF_dataset.git
|
||
mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/
|
||
|
||
rm -rf NAF_dataset && rm labeled_images.tar.gz
|
||
```
|
||
|
||
- Step2: Generate `instances_training.json`, `instances_val.json`, and `instances_test.json` with following command:
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/naf_converter.py PATH/TO/naf --nproc 4
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
│── naf
|
||
│ ├── annotations
|
||
│ ├── imgs
|
||
│ ├── instances_test.json
|
||
│ ├── instances_val.json
|
||
│ └── instances_training.json
|
||
```
|
||
|
||
## SROIE
|
||
|
||
- Step1: Download `0325updated.task1train(626p).zip`, `task1&2_test(361p).zip`, and `text.task1&2-test(361p).zip` from [homepage](https://rrc.cvc.uab.es/?ch=13&com=downloads) to `sroie/`
|
||
|
||
- Step2:
|
||
|
||
```bash
|
||
mkdir sroie && cd sroie
|
||
mkdir imgs && mkdir annotations && mkdir imgs/training
|
||
|
||
# Warnninig: The zip files downloaded from Google Drive and BaiduYun Cloud may
|
||
# be different, the user should revise the following commands to the correct
|
||
# file name if encounter with errors while extracting and move the files.
|
||
unzip -q 0325updated.task1train\(626p\).zip && unzip -q task1\&2_test\(361p\).zip && unzip -q text.task1\&2-test(361p\).zip
|
||
|
||
# For images
|
||
mv 0325updated.task1train\(626p\)/*.jpg imgs/training && mv fulltext_test\(361p\) imgs/test
|
||
|
||
# For annotations
|
||
mv 0325updated.task1train\(626p\) annotations/training && mv text.task1\&2-testги361p\)/ annotations/test
|
||
|
||
rm 0325updated.task1train\(626p\).zip && rm task1\&2_test\(361p\).zip && rm text.task1\&2-test(361p\).zip
|
||
```
|
||
|
||
- Step3: Generate `instances_training.json` and `instances_test.json` with the following command:
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/sroie_converter.py PATH/TO/sroie --nproc 4
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
├── sroie
|
||
│ ├── annotations
|
||
│ ├── imgs
|
||
│ ├── instances_test.json
|
||
│ └── instances_training.json
|
||
```
|
||
|
||
## Lecture Video DB
|
||
|
||
- Step1: Download [IIIT-CVid.zip](http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip) to `lv/`.
|
||
|
||
```bash
|
||
mkdir lv && cd lv
|
||
|
||
# Download LV dataset
|
||
wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
|
||
unzip -q IIIT-CVid.zip
|
||
|
||
mv IIIT-CVid/Frames imgs
|
||
|
||
rm IIIT-CVid.zip
|
||
```
|
||
|
||
- Step2: Generate `instances_training.json`, `instances_val.json`, and `instances_test.json` with following command:
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/lv_converter.py PATH/TO/lv --nproc 4
|
||
```
|
||
|
||
- The resulting directory structure looks like the following:
|
||
|
||
```text
|
||
│── lv
|
||
│ ├── imgs
|
||
│ ├── instances_test.json
|
||
│ ├── instances_training.json
|
||
│ └── instances_val.json
|
||
```
|
||
|
||
## LSVT
|
||
|
||
- Step1: Download [train_full_images_0.tar.gz](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz), [train_full_images_1.tar.gz](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz), and [train_full_labels.json](https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json) to `lsvt/`.
|
||
|
||
```bash
|
||
mkdir lsvt && cd lsvt
|
||
|
||
# Download LSVT dataset
|
||
wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz
|
||
wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz
|
||
wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json
|
||
|
||
mkdir annotations
|
||
tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz
|
||
mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/
|
||
mv train_full_images_0 imgs
|
||
|
||
rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
|
||
```
|
||
|
||
- Step2: Generate `instances_training.json` and `instances_val.json` (optional) with the following command:
|
||
|
||
```bash
|
||
# Annotations of LSVT test split is not publicly available, split a validation
|
||
# set by adding --val-ratio 0.2
|
||
python tools/dataset_converters/textdet/lsvt_converter.py PATH/TO/lsvt
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
|── lsvt
|
||
│ ├── imgs
|
||
│ ├── instances_training.json
|
||
│ └── instances_val.json (optional)
|
||
```
|
||
|
||
## IMGUR
|
||
|
||
- Step1: Run `download_imgur5k.py` to download images. You can merge [PR#5](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset/pull/5) in your local repository to enable a **much faster** parallel execution of image download.
|
||
|
||
```bash
|
||
mkdir imgur && cd imgur
|
||
|
||
git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git
|
||
|
||
# Download images from imgur.com. This may take SEVERAL HOURS!
|
||
python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs
|
||
|
||
# For annotations
|
||
mkdir annotations
|
||
mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations
|
||
|
||
rm -rf IMGUR5K-Handwriting-Dataset
|
||
```
|
||
|
||
- Step2: Generate `instances_train.json`, `instance_val.json` and `instances_test.json` with the following command:
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/imgur_converter.py PATH/TO/imgur
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
│── imgur
|
||
│ ├── annotations
|
||
│ ├── imgs
|
||
│ ├── instances_test.json
|
||
│ ├── instances_training.json
|
||
│ └── instances_val.json
|
||
```
|
||
|
||
## KAIST
|
||
|
||
- Step1: Complete download [KAIST_all.zip](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) to `kaist/`.
|
||
|
||
```bash
|
||
mkdir kaist && cd kaist
|
||
mkdir imgs && mkdir annotations
|
||
|
||
# Download KAIST dataset
|
||
wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip
|
||
unzip -q KAIST_all.zip
|
||
|
||
rm KAIST_all.zip
|
||
```
|
||
|
||
- Step2: Extract zips:
|
||
|
||
```bash
|
||
python tools/dataset_converters/common/extract_kaist.py PATH/TO/kaist
|
||
```
|
||
|
||
- Step3: Generate `instances_training.json` and `instances_val.json` (optional) with following command:
|
||
|
||
```bash
|
||
# Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2
|
||
python tools/dataset_converters/textdet/kaist_converter.py PATH/TO/kaist --nproc 4
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
│── kaist
|
||
│ ├── annotations
|
||
│ ├── imgs
|
||
│ ├── instances_training.json
|
||
│ └── instances_val.json (optional)
|
||
```
|
||
|
||
## MTWI
|
||
|
||
- Step1: Download `mtwi_2018_train.zip` from [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us).
|
||
|
||
```bash
|
||
mkdir mtwi && cd mtwi
|
||
|
||
unzip -q mtwi_2018_train.zip
|
||
mv image_train imgs && mv txt_train annotations
|
||
|
||
rm mtwi_2018_train.zip
|
||
```
|
||
|
||
- Step2: Generate `instances_training.json` and `instance_val.json` (optional) with the following command:
|
||
|
||
```bash
|
||
# Annotations of MTWI test split is not publicly available, split a validation
|
||
# set by adding --val-ratio 0.2
|
||
python tools/dataset_converters/textdet/mtwi_converter.py PATH/TO/mtwi --nproc 4
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
│── mtwi
|
||
│ ├── annotations
|
||
│ ├── imgs
|
||
│ ├── instances_training.json
|
||
│ └── instances_val.json (optional)
|
||
```
|
||
|
||
## COCO Text v2
|
||
|
||
- Step1: Download image [train2014.zip](http://images.cocodataset.org/zips/train2014.zip) and annotation [cocotext.v2.zip](https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip) to `coco_textv2/`.
|
||
|
||
```bash
|
||
mkdir coco_textv2 && cd coco_textv2
|
||
mkdir annotations
|
||
|
||
# Download COCO Text v2 dataset
|
||
wget http://images.cocodataset.org/zips/train2014.zip
|
||
wget https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip
|
||
unzip -q train2014.zip && unzip -q cocotext.v2.zip
|
||
|
||
mv train2014 imgs && mv cocotext.v2.json annotations/
|
||
|
||
rm train2014.zip && rm -rf cocotext.v2.zip
|
||
```
|
||
|
||
- Step2: Generate `instances_training.json` and `instances_val.json` with the following command:
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/cocotext_converter.py PATH/TO/coco_textv2
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
│── coco_textv2
|
||
│ ├── annotations
|
||
│ ├── imgs
|
||
│ ├── instances_training.json
|
||
│ └── instances_val.json
|
||
```
|
||
|
||
## ReCTS
|
||
|
||
- Step1: Download [ReCTS.zip](https://datasets.cvc.uab.es/rrc/ReCTS.zip) to `rects/` from the [homepage](https://rrc.cvc.uab.es/?ch=12&com=downloads).
|
||
|
||
```bash
|
||
mkdir rects && cd rects
|
||
|
||
# Download ReCTS dataset
|
||
# You can also find Google Drive link on the dataset homepage
|
||
wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate
|
||
unzip -q ReCTS.zip
|
||
|
||
mv img imgs && mv gt_unicode annotations
|
||
|
||
rm ReCTS.zip && rm -rf gt
|
||
```
|
||
|
||
- Step2: Generate `instances_training.json` and `instances_val.json` (optional) with following command:
|
||
|
||
```bash
|
||
# Annotations of ReCTS test split is not publicly available, split a validation
|
||
# set by adding --val-ratio 0.2
|
||
python tools/dataset_converters/textdet/rects_converter.py PATH/TO/rects --nproc 4 --val-ratio 0.2
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
│── rects
|
||
│ ├── annotations
|
||
│ ├── imgs
|
||
│ ├── instances_val.json (optional)
|
||
│ └── instances_training.json
|
||
```
|
||
|
||
## ILST
|
||
|
||
- Step1: Download `IIIT-ILST` from [onedrive](https://iiitaphyd-my.sharepoint.com/:f:/g/personal/minesh_mathew_research_iiit_ac_in/EtLvCozBgaBIoqglF4M-lHABMgNcCDW9rJYKKWpeSQEElQ?e=zToXZP)
|
||
|
||
- Step2: Run the following commands
|
||
|
||
```bash
|
||
unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip
|
||
cd IIIT-ILST
|
||
|
||
# rename files
|
||
cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd ..
|
||
cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd ..
|
||
cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd ..
|
||
|
||
# transfer image path
|
||
mkdir imgs && mkdir annotations
|
||
mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/
|
||
mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/
|
||
mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/
|
||
|
||
# remove unnecessary files
|
||
rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
|
||
```
|
||
|
||
- Step3: Generate `instances_training.json` and `instances_val.json` (optional). Since the original dataset doesn't have a validation set, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
│── IIIT-ILST
|
||
│ ├── annotations
|
||
│ ├── imgs
|
||
│ ├── instances_val.json (optional)
|
||
│ └── instances_training.json
|
||
```
|
||
|
||
## VinText
|
||
|
||
- Step1: Download [vintext.zip](https://drive.google.com/drive/my-drive) to `vintext`
|
||
|
||
```bash
|
||
mkdir vintext && cd vintext
|
||
|
||
# Download dataset from google drive
|
||
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- │ sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt
|
||
|
||
# Extract images and annotations
|
||
unzip -q vintext.zip && rm vintext.zip
|
||
mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./
|
||
rm -rf vietnamese
|
||
|
||
# Rename files
|
||
mv labels annotations && mv test_image test && mv train_images training && mv unseen_test_images unseen_test
|
||
mkdir imgs
|
||
mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
|
||
```
|
||
|
||
- Step2: Generate `instances_training.json`, `instances_test.json` and `instances_unseen_test.json`
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/vintext_converter.py PATH/TO/vintext --nproc 4
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
│── vintext
|
||
│ ├── annotations
|
||
│ ├── imgs
|
||
│ ├── instances_test.json
|
||
│ ├── instances_unseen_test.json
|
||
│ └── instances_training.json
|
||
```
|
||
|
||
## BID
|
||
|
||
- Step1: Download [BID Dataset.zip](https://drive.google.com/file/d/1Oi88TRcpdjZmJ79WDLb9qFlBNG8q2De6/view)
|
||
|
||
- Step2: Run the following commands to preprocess the dataset
|
||
|
||
```bash
|
||
# Rename
|
||
mv BID\ Dataset.zip BID_Dataset.zip
|
||
|
||
# Unzip and Rename
|
||
unzip -q BID_Dataset.zip && rm BID_Dataset.zip
|
||
mv BID\ Dataset BID
|
||
|
||
# The BID dataset has a problem of permission, and you may
|
||
# add permission for this file
|
||
chmod -R 777 BID
|
||
cd BID
|
||
mkdir imgs && mkdir annotations
|
||
|
||
# For images and annotations
|
||
mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta
|
||
mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente
|
||
mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso
|
||
mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente
|
||
mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso
|
||
mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto
|
||
mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente
|
||
mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso
|
||
|
||
# Remove unnecessary files
|
||
rm -rf desktop.ini
|
||
```
|
||
|
||
- Step3: - Step3: Generate `instances_training.json` and `instances_val.json` (optional). Since the original dataset doesn't have a validation set, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
|
||
|
||
```bash
|
||
python tools/dataset_converters/textdet/bid_converter.py PATH/TO/BID --nproc 4
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
│── BID
|
||
│ ├── annotations
|
||
│ ├── imgs
|
||
│ ├── instances_training.json
|
||
│ └── instances_val.json (optional)
|
||
```
|
||
|
||
## RCTW
|
||
|
||
- Step1: Download `train_images.zip.001`, `train_images.zip.002`, and `train_gts.zip` from the [homepage](https://rctw.vlrlab.net/dataset.html), extract the zips to `rctw/imgs` and `rctw/annotations`, respectively.
|
||
|
||
- Step2: Generate `instances_training.json` and `instances_val.json` (optional). Since the test annotations are not publicly available, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
|
||
|
||
```bash
|
||
# Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2
|
||
python tools/dataset_converters/textdet/rctw_converter.py PATH/TO/rctw --nproc 4
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
│── rctw
|
||
│ ├── annotations
|
||
│ ├── imgs
|
||
│ ├── instances_training.json
|
||
│ └── instances_val.json (optional)
|
||
```
|
||
|
||
## HierText
|
||
|
||
- Step1 (optional): Install [AWS CLI](https://mmocr.readthedocs.io/en/latest/datasets/det.html#install-aws-cli-optional).
|
||
|
||
- Step2: Clone [HierText](https://github.com/google-research-datasets/hiertext) repo to get annotations
|
||
|
||
```bash
|
||
mkdir HierText
|
||
git clone https://github.com/google-research-datasets/hiertext.git
|
||
```
|
||
|
||
- Step3: Download `train.tgz`, `validation.tgz` from aws
|
||
|
||
```bash
|
||
aws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz .
|
||
aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
|
||
```
|
||
|
||
- Step4: Process raw data
|
||
|
||
```bash
|
||
# process annotations
|
||
mv hiertext/gt ./
|
||
rm -rf hiertext
|
||
mv gt annotations
|
||
gzip -d annotations/train.jsonl.gz
|
||
gzip -d annotations/validation.jsonl.gz
|
||
# process images
|
||
mkdir imgs
|
||
mv train.tgz imgs/
|
||
mv validation.tgz imgs/
|
||
tar -xzvf imgs/train.tgz
|
||
tar -xzvf imgs/validation.tgz
|
||
```
|
||
|
||
- Step5: Generate `instances_training.json` and `instance_val.json`. HierText includes different levels of annotation, from paragraph, line, to word. Check the original [paper](https://arxiv.org/pdf/2203.15143.pdf) for details. E.g. set `--level paragraph` to get paragraph-level annotation. Set `--level line` to get line-level annotation. set `--level word` to get word-level annotation.
|
||
|
||
```bash
|
||
# Collect word annotation from HierText --level word
|
||
python tools/dataset_converters/textdet/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
│── HierText
|
||
│ ├── annotations
|
||
│ ├── imgs
|
||
│ ├── instances_training.json
|
||
│ └── instances_val.json
|
||
```
|
||
|
||
## ArT
|
||
|
||
- Step1: Download `train_images.tar.gz`, and `train_labels.json` from the [homepage](https://rrc.cvc.uab.es/?ch=14&com=downloads) to `art/`
|
||
|
||
```bash
|
||
mkdir art && cd art
|
||
mkdir annotations
|
||
|
||
# Download ArT dataset
|
||
wget https://dataset-bj.cdn.bcebos.com/art/train_images.tar.gz --no-check-certificate
|
||
wget https://dataset-bj.cdn.bcebos.com/art/train_labels.json --no-check-certificate
|
||
|
||
# Extract
|
||
tar -xf train_images.tar.gz
|
||
mv train_images imgs
|
||
mv train_labels.json annotations/
|
||
|
||
# Remove unnecessary files
|
||
rm train_images.tar.gz
|
||
```
|
||
|
||
- Step2: Generate `instances_training.json` and `instances_val.json` (optional). Since the test annotations are not publicly available, you may specify `--val-ratio` to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.
|
||
|
||
```bash
|
||
# Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2
|
||
python tools/data/textdet/art_converter.py PATH/TO/art --nproc 4
|
||
```
|
||
|
||
- After running the above codes, the directory structure should be as follows:
|
||
|
||
```text
|
||
│── art
|
||
│ ├── annotations
|
||
│ ├── imgs
|
||
│ ├── instances_training.json
|
||
│ └── instances_val.json (optional)
|
||
```
|