diff --git a/docs/en/user_guides/data_prepare/dataset_preparer.md b/docs/en/user_guides/data_prepare/dataset_preparer.md index 51a8a7df..65ccf5ab 100644 --- a/docs/en/user_guides/data_prepare/dataset_preparer.md +++ b/docs/en/user_guides/data_prepare/dataset_preparer.md @@ -33,7 +33,7 @@ Also, the script supports preparing multiple datasets at the same time. For exam python tools/dataset_converters/prepare_dataset.py icdar2015 totaltext --task textrecog ``` -To check the supported datasets in MMOCR, please refer to [Dataset Zoo](./datasetzoo.md). +To check the supported datasets of Dataset Preparer, please refer to [Dataset Zoo](./datasetzoo.md). Some of other datasets that need to be prepared manually are listed in [Text Detection](./det.md) and [Text Recognition](./recog.md). ## Advanced Usage diff --git a/docs/en/user_guides/data_prepare/det.md b/docs/en/user_guides/data_prepare/det.md index 8c20fb21..c9ec95b4 100644 --- a/docs/en/user_guides/data_prepare/det.md +++ b/docs/en/user_guides/data_prepare/det.md @@ -1,40 +1,32 @@ # Text Detection ```{note} -This page is deprecated and all these scripts will be eventually migrated into dataset preparer, a brand new module designed to ease these lengthy dataset preparation steps. [Check it out](./dataset_preparer.md)! +This page is a manual preparation guide for datasets not yet supported by [Dataset Preparer](./dataset_preparer.md), which all these scripts will be eventually migrated into. ``` ## Overview -| Dataset | Images | | Annotation Files | | | -| :---------------: | :-------------------------------------------: | :-------------------------------------: | :------------------------------------------------------: | :--------------------------------------: | :-: | -| | | training | validation | testing | | -| CTW1500 | [homepage](https://github.com/Yuliang-Liu/Curve-Text-Detector) | - | - | - | | -| ICDAR2011 | [homepage](https://rrc.cvc.uab.es/?ch=1) | - | - | | | -| ICDAR2013 | [homepage](https://rrc.cvc.uab.es/?ch=2) | - | - | - | | -| ICDAR2015 | [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) | - | [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) | | -| ICDAR2017 | [homepage](https://rrc.cvc.uab.es/?ch=8&com=downloads) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json) | [instances_val.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_val.json) | - | | -| Synthtext | [homepage](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) | instances_training.lmdb ([data.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/data.mdb), [lock.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/lock.mdb)) | - | - | | -| TextOCR | [homepage](https://textvqa.org/textocr/dataset) | - | - | - | | -| Totaltext | [homepage](https://github.com/cs-chan/Total-Text-Dataset) | - | - | - | | -| CurvedSynText150k | [homepage](https://github.com/aim-uofa/AdelaiDet/blob/master/datasets/README.md) \| [Part1](https://drive.google.com/file/d/1OSJ-zId2h3t_-I7g_wUkrK-VqQy153Kj/view?usp=sharing) \| [Part2](https://drive.google.com/file/d/1EzkcOlIgEp5wmEubvHb7-J5EImHExYgY/view?usp=sharing) | [instances_training.json](https://download.openmmlab.com/mmocr/data/curvedsyntext/instances_training.json) | - | - | | -| FUNSD | [homepage](https://guillaumejaume.github.io/FUNSD/) | - | - | - | | -| DeText | [homepage](https://rrc.cvc.uab.es/?ch=9) | - | - | - | | -| NAF | [homepage](https://github.com/herobd/NAF_dataset/releases/tag/v1.0) | - | - | - | | -| SROIE | [homepage](https://rrc.cvc.uab.es/?ch=13) | - | - | - | | -| Lecture Video DB | [homepage](https://cvit.iiit.ac.in/research/projects/cvit-projects/lecturevideodb) | - | - | - | | -| LSVT | [homepage](https://rrc.cvc.uab.es/?ch=16) | - | - | - | | -| IMGUR | [homepage](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset) | - | - | - | | -| KAIST | [homepage](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) | - | - | - | | -| MTWI | [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us) | - | - | - | | -| COCO Text v2 | [homepage](https://bgshih.github.io/cocotext/) | - | - | - | | -| ReCTS | [homepage](https://rrc.cvc.uab.es/?ch=12) | - | - | - | | -| IIIT-ILST | [homepage](http://cvit.iiit.ac.in/research/projects/cvit-projects/iiit-ilst) | - | - | - | | -| VinText | [homepage](https://github.com/VinAIResearch/dict-guided) | - | - | - | | -| BID | [homepage](https://github.com/ricardobnjunior/Brazilian-Identity-Document-Dataset) | - | - | - | | -| RCTW | [homepage](https://rctw.vlrlab.net/index.html) | - | - | - | | -| HierText | [homepage](https://github.com/google-research-datasets/hiertext) | - | - | - | | -| ArT | [homepage](https://rrc.cvc.uab.es/?ch=14) | - | - | - | | +| Dataset | Images | | Annotation Files | | | +| :---------------: | :------------------------------------------------------: | :------------------------------------------------: | :-----------------------------------------------------------------: | :-----: | :-: | +| | | training | validation | testing | | +| CTW1500 | [homepage](https://github.com/Yuliang-Liu/Curve-Text-Detector) | - | - | - | | +| ICDAR2011 | [homepage](https://rrc.cvc.uab.es/?ch=1) | - | - | | | +| ICDAR2017 | [homepage](https://rrc.cvc.uab.es/?ch=8&com=downloads) | [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_training.json) | [instances_val.json](https://download.openmmlab.com/mmocr/data/icdar2017/instances_val.json) | - | | +| Synthtext | [homepage](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) | instances_training.lmdb ([data.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/data.mdb), [lock.mdb](https://download.openmmlab.com/mmocr/data/synthtext/instances_training.lmdb/lock.mdb)) | - | - | | +| CurvedSynText150k | [homepage](https://github.com/aim-uofa/AdelaiDet/blob/master/datasets/README.md) \| [Part1](https://drive.google.com/file/d/1OSJ-zId2h3t_-I7g_wUkrK-VqQy153Kj/view?usp=sharing) \| [Part2](https://drive.google.com/file/d/1EzkcOlIgEp5wmEubvHb7-J5EImHExYgY/view?usp=sharing) | [instances_training.json](https://download.openmmlab.com/mmocr/data/curvedsyntext/instances_training.json) | - | - | | +| DeText | [homepage](https://rrc.cvc.uab.es/?ch=9) | - | - | - | | +| Lecture Video DB | [homepage](https://cvit.iiit.ac.in/research/projects/cvit-projects/lecturevideodb) | - | - | - | | +| LSVT | [homepage](https://rrc.cvc.uab.es/?ch=16) | - | - | - | | +| IMGUR | [homepage](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset) | - | - | - | | +| KAIST | [homepage](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) | - | - | - | | +| MTWI | [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us) | - | - | - | | +| ReCTS | [homepage](https://rrc.cvc.uab.es/?ch=12) | - | - | - | | +| IIIT-ILST | [homepage](http://cvit.iiit.ac.in/research/projects/cvit-projects/iiit-ilst) | - | - | - | | +| VinText | [homepage](https://github.com/VinAIResearch/dict-guided) | - | - | - | | +| BID | [homepage](https://github.com/ricardobnjunior/Brazilian-Identity-Document-Dataset) | - | - | - | | +| RCTW | [homepage](https://rctw.vlrlab.net/index.html) | - | - | - | | +| HierText | [homepage](https://github.com/google-research-datasets/hiertext) | - | - | - | | +| ArT | [homepage](https://rrc.cvc.uab.es/?ch=14) | - | - | - | | ### Install AWS CLI (optional) @@ -142,82 +134,6 @@ inconsistency results in false examples in the training set. Therefore, users sh │ └── instances_training.json ``` -## ICDAR 2013 (Focused Scene Text) - -- Step1: Download `Challenge2_Training_Task12_Images.zip`, `Challenge2_Test_Task12_Images.zip`, `Challenge2_Training_Task1_GT.zip`, and `Challenge2_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=2&com=downloads) `Task 2.1: Text Localization (2013 edition)`. - - ```bash - mkdir icdar2013 && cd icdar2013 - mkdir imgs && mkdir annotations - - # Download ICDAR 2013 - wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task12_Images.zip --no-check-certificate - wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task12_Images.zip --no-check-certificate - wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task1_GT.zip --no-check-certificate - wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task1_GT.zip --no-check-certificate - - # For images - unzip -q Challenge2_Training_Task12_Images.zip -d imgs/training - unzip -q Challenge2_Test_Task12_Images.zip -d imgs/test - # For annotations - unzip -q Challenge2_Training_Task1_GT.zip -d annotations/training - unzip -q Challenge2_Test_Task1_GT.zip -d annotations/test - - rm Challenge2_Training_Task12_Images.zip && rm Challenge2_Test_Task12_Images.zip && rm Challenge2_Training_Task1_GT.zip && rm Challenge2_Test_Task1_GT.zip - ``` - -- Step 2: Generate `instances_training.json` and `instances_test.json` with the following command: - - ```bash - python tools/dataset_converters/textdet/ic13_converter.py PATH/TO/icdar2013 --nproc 4 - ``` - -- After running the above codes, the directory structure should be as follows: - - ```text - │── icdar2013 - │ ├── imgs - │ ├── instances_test.json - │ └── instances_training.json - ``` - -## ICDAR 2015 - -- Step0: Read [Important Note](#important-note) - -- Step1: Download `ch4_training_images.zip`, `ch4_test_images.zip`, `ch4_training_localization_transcription_gt.zip`, `Challenge4_Test_Task1_GT.zip` from [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads) - -- Step2: - - ```bash - mkdir icdar2015 && cd icdar2015 - mkdir imgs && mkdir annotations - # For images, - mv ch4_training_images imgs/training - mv ch4_test_images imgs/test - # For annotations, - mv ch4_training_localization_transcription_gt annotations/training - mv Challenge4_Test_Task1_GT annotations/test - ``` - -- Step3: Download [instances_training.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_training.json) and [instances_test.json](https://download.openmmlab.com/mmocr/data/icdar2015/instances_test.json) and move them to `icdar2015` - -- Or, generate `instances_training.json` and `instances_test.json` with the following command: - - ```bash - python tools/dataset_converters/textdet/icdar_converter.py /path/to/icdar2015 -o /path/to/icdar2015 -d icdar2015 --split-list training test - ``` - -- The resulting directory structure looks like the following: - - ```text - ├── icdar2015 - │ ├── imgs - │ ├── annotations - │ ├── instances_test.json - │ └── instances_training.json - ``` - ## ICDAR 2017 - Follow similar steps as [ICDAR 2015](#icdar-2015). @@ -248,81 +164,6 @@ inconsistency results in false examples in the training set. Therefore, users sh │ └── lock.mdb ``` -## TextOCR - -- Step1: Download [train_val_images.zip](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip), [TextOCR_0.1_train.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json) and [TextOCR_0.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json) to `textocr/`. - - ```bash - mkdir textocr && cd textocr - - # Download TextOCR dataset - wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip - wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json - wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json - - # For images - unzip -q train_val_images.zip - mv train_images train - ``` - -- Step2: Generate `instances_training.json` and `instances_val.json` with the following command: - - ```bash - python tools/dataset_converters/textdet/textocr_converter.py /path/to/textocr - ``` - -- The resulting directory structure looks like the following: - - ```text - ├── textocr - │ ├── train - │ ├── instances_training.json - │ └── instances_val.json - ``` - -## Totaltext - -- Step0: Read [Important Note](#important-note) - -- Step1: Download `totaltext.zip` from [github dataset](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset) and `groundtruth_text.zip` or `TT_new_train_GT.zip` (if you prefer to use the latest version of training annotations) from [github Groundtruth](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Text) (Our totaltext_converter.py supports groundtruth with both .mat and .txt format). - - ```bash - mkdir totaltext && cd totaltext - mkdir imgs && mkdir annotations - - # For images - # in ./totaltext - unzip totaltext.zip - mv Images/Train imgs/training - mv Images/Test imgs/test - - # For legacy training and test annotations - unzip groundtruth_text.zip - mv Groundtruth/Polygon/Train annotations/training - mv Groundtruth/Polygon/Test annotations/test - - # Using the latest training annotations - # WARNING: Delete legacy train annotations before running the following command. - unzip TT_new_train_GT.zip - mv Train annotations/training - ``` - -- Step2: Generate `instances_training.json` and `instances_test.json` with the following command: - - ```bash - python tools/dataset_converters/textdet/totaltext_converter.py /path/to/totaltext - ``` - -- The resulting directory structure looks like the following: - - ```text - ├── totaltext - │ ├── imgs - │ ├── annotations - │ ├── instances_test.json - │ └── instances_training.json - ``` - ## CurvedSynText150k - Step1: Download [syntext1.zip](https://drive.google.com/file/d/1OSJ-zId2h3t_-I7g_wUkrK-VqQy153Kj/view?usp=sharing) and [syntext2.zip](https://drive.google.com/file/d/1EzkcOlIgEp5wmEubvHb7-J5EImHExYgY/view?usp=sharing) to `CurvedSynText150k/`. @@ -358,43 +199,6 @@ inconsistency results in false examples in the training set. Therefore, users sh │ └── instances_training.json ``` -## FUNSD - -- Step1: Download [dataset.zip](https://guillaumejaume.github.io/FUNSD/dataset.zip) to `funsd/`. - - ```bash - mkdir funsd && cd funsd - - # Download FUNSD dataset - wget https://guillaumejaume.github.io/FUNSD/dataset.zip - unzip -q dataset.zip - - # For images - mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/ - - # For annotations - mkdir annotations - mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test - - rm dataset.zip && rm -rf dataset - ``` - -- Step2: Generate `instances_training.json` and `instances_test.json` with following command: - - ```bash - python tools/dataset_converters/textdet/funsd_converter.py PATH/TO/funsd --nproc 4 - ``` - -- The resulting directory structure looks like the following: - - ```text - │── funsd - │ ├── annotations - │ ├── imgs - │ ├── instances_test.json - │ └── instances_training.json - ``` - ## DeText - Step1: Download `ch9_training_images.zip`, `ch9_training_localization_transcription_gt.zip`, `ch9_validation_images.zip`, and `ch9_validation_localization_transcription_gt.zip` from **Task 3: End to End** on the [homepage](https://rrc.cvc.uab.es/?ch=9). @@ -432,84 +236,6 @@ inconsistency results in false examples in the training set. Therefore, users sh │ └── instances_training.json ``` -## NAF - -- Step1: Download [labeled_images.tar.gz](https://github.com/herobd/NAF_dataset/releases/tag/v1.0) to `naf/`. - - ```bash - mkdir naf && cd naf - - # Download NAF dataset - wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz - tar -zxf labeled_images.tar.gz - - # For images - mkdir annotations && mv labeled_images imgs - - # For annotations - git clone https://github.com/herobd/NAF_dataset.git - mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/ - - rm -rf NAF_dataset && rm labeled_images.tar.gz - ``` - -- Step2: Generate `instances_training.json`, `instances_val.json`, and `instances_test.json` with following command: - - ```bash - python tools/dataset_converters/textdet/naf_converter.py PATH/TO/naf --nproc 4 - ``` - -- After running the above codes, the directory structure should be as follows: - - ```text - │── naf - │ ├── annotations - │ ├── imgs - │ ├── instances_test.json - │ ├── instances_val.json - │ └── instances_training.json - ``` - -## SROIE - -- Step1: Download `0325updated.task1train(626p).zip`, `task1&2_test(361p).zip`, and `text.task1&2-test(361p).zip` from [homepage](https://rrc.cvc.uab.es/?ch=13&com=downloads) to `sroie/` - -- Step2: - - ```bash - mkdir sroie && cd sroie - mkdir imgs && mkdir annotations && mkdir imgs/training - - # Warnninig: The zip files downloaded from Google Drive and BaiduYun Cloud may - # be different, the user should revise the following commands to the correct - # file name if encounter with errors while extracting and move the files. - unzip -q 0325updated.task1train\(626p\).zip && unzip -q task1\&2_test\(361p\).zip && unzip -q text.task1\&2-test(361p\).zip - - # For images - mv 0325updated.task1train\(626p\)/*.jpg imgs/training && mv fulltext_test\(361p\) imgs/test - - # For annotations - mv 0325updated.task1train\(626p\) annotations/training && mv text.task1\&2-testги361p\)/ annotations/test - - rm 0325updated.task1train\(626p\).zip && rm task1\&2_test\(361p\).zip && rm text.task1\&2-test(361p\).zip - ``` - -- Step3: Generate `instances_training.json` and `instances_test.json` with the following command: - - ```bash - python tools/dataset_converters/textdet/sroie_converter.py PATH/TO/sroie --nproc 4 - ``` - -- After running the above codes, the directory structure should be as follows: - - ```text - ├── sroie - │ ├── annotations - │ ├── imgs - │ ├── instances_test.json - │ └── instances_training.json - ``` - ## Lecture Video DB - Step1: Download [IIIT-CVid.zip](http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip) to `lv/`. @@ -684,40 +410,6 @@ inconsistency results in false examples in the training set. Therefore, users sh │ └── instances_val.json (optional) ``` -## COCO Text v2 - -- Step1: Download image [train2014.zip](http://images.cocodataset.org/zips/train2014.zip) and annotation [cocotext.v2.zip](https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip) to `coco_textv2/`. - - ```bash - mkdir coco_textv2 && cd coco_textv2 - mkdir annotations - - # Download COCO Text v2 dataset - wget http://images.cocodataset.org/zips/train2014.zip - wget https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip - unzip -q train2014.zip && unzip -q cocotext.v2.zip - - mv train2014 imgs && mv cocotext.v2.json annotations/ - - rm train2014.zip && rm -rf cocotext.v2.zip - ``` - -- Step2: Generate `instances_training.json` and `instances_val.json` with the following command: - - ```bash - python tools/dataset_converters/textdet/cocotext_converter.py PATH/TO/coco_textv2 - ``` - -- After running the above codes, the directory structure should be as follows: - - ```text - │── coco_textv2 - │ ├── annotations - │ ├── imgs - │ ├── instances_training.json - │ └── instances_val.json - ``` - ## ReCTS - Step1: Download [ReCTS.zip](https://datasets.cvc.uab.es/rrc/ReCTS.zip) to `rects/` from the [homepage](https://rrc.cvc.uab.es/?ch=12&com=downloads). diff --git a/docs/en/user_guides/data_prepare/kie.md b/docs/en/user_guides/data_prepare/kie.md index d10679ed..9d324383 100644 --- a/docs/en/user_guides/data_prepare/kie.md +++ b/docs/en/user_guides/data_prepare/kie.md @@ -1,7 +1,7 @@ # Key Information Extraction ```{note} -This page is deprecated and all these scripts will be eventually migrated into dataset preparer, a brand new module designed to ease these lengthy dataset preparation steps. [Check it out](./dataset_preparer.md)! +This page is a manual preparation guide for datasets not yet supported by [Dataset Preparer](./dataset_preparer.md), which all these scripts will be eventually migrated into. ``` ## Overview diff --git a/docs/en/user_guides/data_prepare/recog.md b/docs/en/user_guides/data_prepare/recog.md index 3fdc8ab0..3efa863d 100644 --- a/docs/en/user_guides/data_prepare/recog.md +++ b/docs/en/user_guides/data_prepare/recog.md @@ -1,7 +1,7 @@ # Text Recognition ```{note} -This page is deprecated and all these scripts will be eventually migrated into dataset preparer, a brand new module designed to ease these lengthy dataset preparation steps. [Check it out](./dataset_preparer.md)! +This page is a manual preparation guide for datasets not yet supported by [Dataset Preparer](./dataset_preparer.md), which all these scripts will be eventually migrated into. ``` ## Overview @@ -11,28 +11,16 @@ This page is deprecated and all these scripts will be eventually migrated into d | | | training | test | | coco_text | [homepage](https://rrc.cvc.uab.es/?ch=5&com=downloads) | [train_labels.json](#TODO) | - | | ICDAR2011 | [homepage](https://rrc.cvc.uab.es/?ch=1) | - | - | -| ICDAR2013 | [homepage](https://rrc.cvc.uab.es/?ch=2) | - | - | -| icdar_2015 | [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads) | [train_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/icdar_2015/train_labels.json) | [test_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/icdar_2015/test_labels.json) | -| IIIT5K | [homepage](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html) | [train_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/IIIT5K/train_labels.json) | [test_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/IIIT5K/test_labels.json) | -| ct80 | [homepage](http://cs-chan.com/downloads_CUTE80_dataset.html) | - | [test_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/ct80/test_labels.json) | -| svt | [homepage](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset) | - | [test_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/svt/test_labels.json) | -| svtp | [unofficial homepage\[1\]](https://github.com/Jyouhou/Case-Sensitive-Scene-Text-Recognition-Datasets) | - | [test_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/svtp/test_labels.json) | | MJSynth (Syn90k) | [homepage](https://www.robots.ox.ac.uk/~vgg/data/text/) | [subset_train_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/Syn90k/subset_train_labels.json) \| [train_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/Syn90k/train_labels.json) | - | | SynthText (Synth800k) | [homepage](https://www.robots.ox.ac.uk/~vgg/data/scenetext/) | [alphanumeric_train_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/SynthText/alphanumeric_train_labels.json) \|[subset_train_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/SynthText/subset_train_labels.json) \| [train_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/SynthText/train_labels.json) | - | | SynthAdd | [SynthText_Add.zip](https://pan.baidu.com/s/1uV0LtoNmcxbO-0YA7Ch4dg) (code:627x) | [train_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/synthtext_add/train_labels.json) | - | -| TextOCR | [homepage](https://textvqa.org/textocr/dataset) | - | - | -| Totaltext | [homepage](https://github.com/cs-chan/Total-Text-Dataset) | - | - | | OpenVINO | [Open Images](https://github.com/cvdfoundation/open-images-dataset) | [annotations](https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text) | [annotations](https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text) | -| FUNSD | [homepage](https://guillaumejaume.github.io/FUNSD/) | - | - | | DeText | [homepage](https://rrc.cvc.uab.es/?ch=9) | - | - | -| NAF | [homepage](https://github.com/herobd/NAF_dataset) | - | - | -| SROIE | [homepage](https://rrc.cvc.uab.es/?ch=13) | - | - | | Lecture Video DB | [homepage](https://cvit.iiit.ac.in/research/projects/cvit-projects/lecturevideodb) | - | - | | LSVT | [homepage](https://rrc.cvc.uab.es/?ch=16) | - | - | | IMGUR | [homepage](https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset) | - | - | | KAIST | [homepage](http://www.iapr-tc11.org/mediawiki/index.php/KAIST_Scene_Text_Database) | - | - | | MTWI | [homepage](https://tianchi.aliyun.com/competition/entrance/231685/information?lang=en-us) | - | - | -| COCO Text v2 | [homepage](https://bgshih.github.io/cocotext/) | - | - | | ReCTS | [homepage](https://rrc.cvc.uab.es/?ch=12) | - | - | | IIIT-ILST | [homepage](http://cvit.iiit.ac.in/research/projects/cvit-projects/iiit-ilst) | - | - | | VinText | [homepage](https://github.com/VinAIResearch/dict-guided) | - | - | @@ -98,141 +86,6 @@ This page is deprecated and all these scripts will be eventually migrated into d │ └── test_labels.json ``` -## ICDAR 2013 (Focused Scene Text) - -- Step1: Download `Challenge2_Training_Task3_Images_GT.zip`, `Challenge2_Test_Task3_Images.zip`, and `Challenge2_Test_Task3_GT.txt` from [homepage](https://rrc.cvc.uab.es/?ch=2&com=downloads) `Task 2.3: Word Recognition (2013 edition)`. - - ```bash - mkdir icdar2013 && cd icdar2013 - mkdir annotations - - # Download ICDAR 2013 - wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task3_Images_GT.zip --no-check-certificate - wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_Images.zip --no-check-certificate - wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_GT.txt --no-check-certificate - wget https://download.openmmlab.com/mmocr/data/mixture/icdar_2013/test_label_1015.txt - - # For images - mkdir crops - unzip -q Challenge2_Training_Task3_Images_GT.zip -d crops/train - unzip -q Challenge2_Test_Task3_Images.zip -d crops/test - # For annotations - mv Challenge2_Test_Task3_GT.txt annotations && mv crops/train/gt.txt annotations/Challenge2_Train_Task3_GT.txt && mv test_label_1015.txt annotations/Challenge2_Test1015_Task3_GT.txt - - rm Challenge2_Training_Task3_Images_GT.zip && rm Challenge2_Test_Task3_Images.zip - ``` - -- Step 2: Generate `train_labels.json`, `test_labels.json`, `test1015_label.json` with the following command: - - ```bash - python tools/dataset_converters/textrecog/ic13_converter.py PATH/TO/icdar2013 - ``` - -- After running the above codes, the directory structure should be as follows: - - ```text - ├── icdar2013 - │ ├── crops - │ ├── train_labels.json - │ ├── test_labels.json - │ └── test1015_label.json - ``` - -## ICDAR 2015 - -- Step1: Download `ch4_training_word_images_gt.zip` and `ch4_test_word_images_gt.zip` from [homepage](https://rrc.cvc.uab.es/?ch=4&com=downloads) - -- Step2: Download [train_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/icdar_2015/train_labels.json) and [test_label.json](https://download.openmmlab.com/mmocr/data/1.x/recog/icdar_2015/test_labels.json) - -- After running the above codes, the directory structure - should be as follows: - - ```text - ├── icdar_2015 - │ ├── train_labels.json - │ ├── test_labels.json - │ ├── ch4_training_word_images_gt - │ └── ch4_test_word_images_gt - ``` - -## IIIT5K - -- Step1: Download `IIIT5K-Word_V3.0.tar.gz` from [homepage](http://cvit.iiit.ac.in/projects/SceneTextUnderstanding/IIIT5K.html) - -- Step2: Download [train_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/IIIT5K/train_labels.json) and [test_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/IIIT5K/test_labels.json) - -- After running the above codes, the directory structure - should be as follows: - - ```text - ├── III5K - │ ├── train_labels.json - │ ├── test_labels.json - │ ├── train - │ └── test - ``` - -## svt - -- Step1: Download `svt.zip` form [homepage](http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset) - -- Step2: Download [test_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/svt/test_labels.json) - -- Step3: - - ```bash - python tools/dataset_converters/textrecog/svt_converter.py <download_svt_dir_path> - ``` - -- After running the above codes, the directory structure - should be as follows: - - ```text - ├── svt - │ ├── test_labels.json - │ └── image - ``` - -## ct80 - -- Step1: Download [test_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/ct80/test_labels.json) - -- Step2: Download [timage.tar.gz](https://download.openmmlab.com/mmocr/data/mixture/ct80/timage.tar.gz) - -- Step3: - - ```bash - mkdir ct80 && cd ct80 - mv /path/to/test_labels.json . - mv /path/to/timage.tar.gz . - tar -xvf timage.tar.gz - # create soft link - cd /path/to/mmocr/data/mixture - ln -s /path/to/ct80 ct80 - ``` - -- After running the above codes, the directory structure - should be as follows: - - ```text - ├── ct80 - │ ├── test_labels.json - │ └── timage - ``` - -## svtp - -- Step1: Download [test_labels.json](https://download.openmmlab.com/mmocr/data/1.x/recog/svtp/test_labels.json) - -- After running the above codes, the directory structure - should be as follows: - - ```text - ├── svtp - │ ├── test_labels.json - │ └── image - ``` - ## coco_text - Step1: Download from [homepage](https://rrc.cvc.uab.es/?ch=5&com=downloads) @@ -365,79 +218,6 @@ Please make sure you're using the right annotation to train the model by checkin │ └── SynthText_Add ``` -## TextOCR - -- Step1: Download [train_val_images.zip](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip), [TextOCR_0.1_train.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json) and [TextOCR_0.1_val.json](https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json) to `textocr/`. - - ```bash - mkdir textocr && cd textocr - - # Download TextOCR dataset - wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip - wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json - wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json - - # For images - unzip -q train_val_images.zip - mv train_images train - ``` - -- Step2: Generate `train_labels.json`, `val_labels.json` and crop images using 4 processes with the following command: - - ```bash - python tools/dataset_converters/textrecog/textocr_converter.py /path/to/textocr 4 - ``` - -- After running the above codes, the directory structure - should be as follows: - - ```text - ├── TextOCR - │ ├── image - │ ├── train_labels.json - │ └── val_labels.json - ``` - -## Totaltext - -- Step1: Download `totaltext.zip` from [github dataset](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Dataset) and `groundtruth_text.zip` or `TT_new_train_GT.zip` (if you prefer to use the latest version of training annotations) from [github Groundtruth](https://github.com/cs-chan/Total-Text-Dataset/tree/master/Groundtruth/Text) (Our totaltext_converter.py supports groundtruth with both .mat and .txt format). - - ```bash - mkdir totaltext && cd totaltext - mkdir imgs && mkdir annotations - - # For images - # in ./totaltext - unzip totaltext.zip - mv Images/Train imgs/training - mv Images/Test imgs/test - - # For legacy training and test annotations - unzip groundtruth_text.zip - mv Groundtruth/Polygon/Train annotations/training - mv Groundtruth/Polygon/Test annotations/test - - # Using the latest training annotations - # WARNING: Delete legacy train annotations before running the following command. - unzip TT_new_train_GT.zip - mv Train annotations/training - ``` - -- Step2: Generate cropped images, `train_labels.json` and `test_labels.json` with the following command (the cropped images will be saved to `data/totaltext/dst_imgs/`): - - ```bash - python tools/dataset_converters/textrecog/totaltext_converter.py /path/to/totaltext - ``` - -- After running the above codes, the directory structure should be as follows: - - ```text - ├── totaltext - │ ├── dst_imgs - │ ├── train_labels.json - │ └── test_labels.json - ``` - ## OpenVINO - Step1 (optional): Install [AWS CLI](https://mmocr.readthedocs.io/en/latest/datasets/recog.html#install-aws-cli-optional). @@ -569,45 +349,6 @@ Please make sure you're using the right annotation to train the model by checkin │ └── test_labels.json ``` -## SROIE - -- Step1: Step1: Download `0325updated.task1train(626p).zip`, `task1&2_test(361p).zip`, and `text.task1&2-test(361p).zip` from [homepage](https://rrc.cvc.uab.es/?ch=13&com=downloads) to `sroie/` - -- Step2: - - ```bash - mkdir sroie && cd sroie - mkdir imgs && mkdir annotations && mkdir imgs/training - - # Warnninig: The zip files downloaded from Google Drive and BaiduYun Cloud may - # be different, the user should revise the following commands to the correct - # file name if encounter with errors while extracting and move the files. - unzip -q 0325updated.task1train\(626p\).zip && unzip -q task1\&2_test\(361p\).zip && unzip -q text.task1\&2-test(361p\).zip - - # For images - mv 0325updated.task1train\(626p\)/*.jpg imgs/training && mv fulltext_test\(361p\) imgs/test - - # For annotations - mv 0325updated.task1train\(626p\) annotations/training && mv text.task1\&2-testги361p\)/ annotations/test - - rm 0325updated.task1train\(626p\).zip && rm task1\&2_test\(361p\).zip && rm text.task1\&2-test(361p\).zip - ``` - -- Step3: Generate `train_labels.json` and `test_labels.json` and crop images using 4 processes with the following command: - - ```bash - python tools/dataset_converters/textrecog/sroie_converter.py PATH/TO/sroie --nproc 4 - ``` - -- After running the above codes, the directory structure should be as follows: - - ```text - ├── sroie - │ ├── crops - │ ├── train_labels.json - │ └── test_labels.json - ``` - ## Lecture Video DB ```{warning} @@ -695,49 +436,6 @@ This section is not fully tested yet. │ └── val_label.json (optional) ``` -## FUNSD - -```{warning} -This section is not fully tested yet. -``` - -- Step1: Download [dataset.zip](https://guillaumejaume.github.io/FUNSD/dataset.zip) to `funsd/`. - - ```bash - mkdir funsd && cd funsd - - # Download FUNSD dataset - wget https://guillaumejaume.github.io/FUNSD/dataset.zip - unzip -q dataset.zip - - # For images - mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/ - - # For annotations - mkdir annotations - mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test - - rm dataset.zip && rm -rf dataset - ``` - -- Step2: Generate `train_labels.json` and `test_labels.json` and crop images using 4 processes with following command (add `--preserve-vertical` if you wish to preserve the images containing vertical texts): - - ```bash - python tools/dataset_converters/textrecog/funsd_converter.py PATH/TO/funsd --nproc 4 - ``` - -- After running the above codes, the directory structure - should be as follows: - - ```text - ├── funsd - │ ├── imgs - │ ├── crops - │ ├── annotations - │ ├── train_labels.json - │ └── test_labels.json - ``` - ## IMGUR ```{warning} @@ -855,46 +553,6 @@ This section is not fully tested yet. │ └── val_label.json (optional) ``` -## COCO Text v2 - -```{warning} -This section is not fully tested yet. -``` - -- Step1: Download image [train2014.zip](http://images.cocodataset.org/zips/train2014.zip) and annotation [cocotext.v2.zip](https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip) to `coco_textv2/`. - - ```bash - mkdir coco_textv2 && cd coco_textv2 - mkdir annotations - - # Download COCO Text v2 dataset - wget http://images.cocodataset.org/zips/train2014.zip - wget https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip - unzip -q train2014.zip && unzip -q cocotext.v2.zip - - mv train2014 imgs && mv cocotext.v2.json annotations/ - - rm train2014.zip && rm -rf cocotext.v2.zip - ``` - -- Step2: Generate `train_labels.json` and `val_label.json` with the following command: - - ```bash - # Add --preserve-vertical to preserve vertical texts for training, otherwise - # vertical images will be filtered and stored in PATH/TO/mtwi/ignores - python tools/dataset_converters/textrecog/cocotext_converter.py PATH/TO/coco_textv2 --nproc 4 - ``` - -- After running the above codes, the directory structure should be as follows: - - ```text - ├── coco_textv2 - │ ├── crops - │ ├── ignores - │ ├── train_labels.json - │ └── val_label.json - ``` - ## ReCTS ```{warning} diff --git a/docs/zh_cn/user_guides/data_prepare/dataset_preparer.md b/docs/zh_cn/user_guides/data_prepare/dataset_preparer.md index 0af4bb17..e57cdf44 100644 --- a/docs/zh_cn/user_guides/data_prepare/dataset_preparer.md +++ b/docs/zh_cn/user_guides/data_prepare/dataset_preparer.md @@ -34,7 +34,7 @@ python tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet --ov python tools/dataset_converters/prepare_dataset.py icdar2015 totaltext --task textrecog --overwrite-cfg ``` -进一步了解 MMOCR 支持的数据集,您可以浏览[支持的数据集文档](./datasetzoo.md) +进一步了解 Dataset Preparer 支持的数据集,您可以浏览[支持的数据集文档](./datasetzoo.md)。一些需要手动准备的数据集也列在了 [文字检测](./det.md) 和 [文字识别](./recog.md) 内。 ## 进阶用法 diff --git a/docs/zh_cn/user_guides/data_prepare/det.md b/docs/zh_cn/user_guides/data_prepare/det.md index 960122a9..a2b0aa40 100644 --- a/docs/zh_cn/user_guides/data_prepare/det.md +++ b/docs/zh_cn/user_guides/data_prepare/det.md @@ -1,42 +1,15 @@ # 文字检测 +```{warning} +该页面版本落后于英文版文档,请切换至英文阅读最新文档。 +``` + ```{note} -该页面内容已经过时,所有有关数据格式转换相关的脚本都将最终迁移至数据准备器 **dataset preparer**,这个全新设计的模块能够极大地方便用户完成冗长的数据准备步骤,详见[相关文档](./dataset_preparer.md)。 +我们正努力往 [Dataset Preparer](./dataset_preparer.md) 中增加更多数据集。对于 [Dataset Preparer](./dataset_preparer.md) 暂未能完整支持的数据集,本页提供了一系列手动下载的步骤,供有需要的用户使用。 ``` ## 概览 -文字检测任务的数据集应按如下目录配置: - -```text -├── ctw1500 -│ ├── annotations -│ ├── imgs -│ ├── instances_test.json -│ └── instances_training.json -├── icdar2015 -│ ├── imgs -│ ├── instances_test.json -│ └── instances_training.json -├── icdar2017 -│ ├── imgs -│ ├── instances_training.json -│ └── instances_val.json -├── synthtext -│ ├── imgs -│ └── instances_training.lmdb -│ ├── data.mdb -│ └── lock.mdb -├── textocr -│ ├── train -│ ├── instances_training.json -│ └── instances_val.json -├── totaltext -│ ├── imgs -│ ├── instances_test.json -│ └── instances_training.json -``` - | 数据集名称 | 数据图片 | | 标注文件 | | | :--------: | :-----------------------------------------------: | :-------------------------------------------: | :------------------------------------------------: | :--------------------------------------------: | | | | 训练集 (training) | 验证集 (validation) | 测试集 (testing) | diff --git a/docs/zh_cn/user_guides/data_prepare/kie.md b/docs/zh_cn/user_guides/data_prepare/kie.md index 6a707287..ce8d1462 100644 --- a/docs/zh_cn/user_guides/data_prepare/kie.md +++ b/docs/zh_cn/user_guides/data_prepare/kie.md @@ -1,7 +1,7 @@ # 关键信息提取 ```{note} -该页面内容已经过时,所有有关数据格式转换相关的脚本都将最终迁移至数据准备器 **dataset preparer**,这个全新设计的模块能够极大地方便用户完成冗长的数据准备步骤,详见[相关文档](./dataset_preparer.md)。 +我们正努力往 [Dataset Preparer](./dataset_preparer.md) 中增加更多数据集。对于 [Dataset Preparer](./dataset_preparer.md) 暂未能完整支持的数据集,本页提供了一系列手动下载的步骤,供有需要的用户使用。 ``` ## 概览 diff --git a/docs/zh_cn/user_guides/data_prepare/recog.md b/docs/zh_cn/user_guides/data_prepare/recog.md index f10a0da8..184426b4 100644 --- a/docs/zh_cn/user_guides/data_prepare/recog.md +++ b/docs/zh_cn/user_guides/data_prepare/recog.md @@ -1,7 +1,11 @@ # 文字识别 +```{warning} +该页面版本落后于英文版文档,请切换至英文阅读最新文档。 +``` + ```{note} -该页面内容已经过时,所有有关数据格式转换相关的脚本都将最终迁移至数据准备器 **dataset preparer**,这个全新设计的模块能够极大地方便用户完成冗长的数据准备步骤,详见[相关文档](./dataset_preparer.md)。 +我们正努力往 [Dataset Preparer](./dataset_preparer.md) 中增加更多数据集。对于 [Dataset Preparer](./dataset_preparer.md) 暂未能完整支持的数据集,本页提供了一系列手动下载的步骤,供有需要的用户使用。 ``` ## 概览