diff --git a/README_ch.md b/README_ch.md index d3b26ee9d..6bb1264b0 100755 --- a/README_ch.md +++ b/README_ch.md @@ -132,7 +132,7 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力 - [手写中文OCR数据集](./doc/doc_ch/handwritten_datasets.md) - [垂类多语言OCR数据集](./doc/doc_ch/vertical_and_multilingual_datasets.md) - [版面分析数据集](./doc/doc_ch/layout_datasets.md) - - [表格识别数据集](./doc/doc_ch/table_datasets.md) + - [表格识别数据集](doc/doc_ch/dataset/table_datasets.md) - [DocVQA数据集](./doc/doc_ch/docvqa_datasets.md) - [代码组织结构](./doc/doc_ch/tree.md) - [效果展示](#效果展示) @@ -160,13 +160,13 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力 - +
PP-OCRv2 英文模型 - +
@@ -176,12 +176,12 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力
PP-OCRv2 其他语言模型 - +
- +
@@ -196,8 +196,8 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力
- -- RE(关系提取) + +- RE(关系提取)
diff --git a/doc/datasets/table_PubTabNet_demo/PMC524509_007_00.png b/doc/datasets/table_PubTabNet_demo/PMC524509_007_00.png new file mode 100755 index 000000000..5b9d631cb Binary files /dev/null and b/doc/datasets/table_PubTabNet_demo/PMC524509_007_00.png differ diff --git a/doc/datasets/table_PubTabNet_demo/PMC535543_007_01.png b/doc/datasets/table_PubTabNet_demo/PMC535543_007_01.png new file mode 100755 index 000000000..e808de72d Binary files /dev/null and b/doc/datasets/table_PubTabNet_demo/PMC535543_007_01.png differ diff --git a/doc/datasets/table_tal_demo/1.jpg b/doc/datasets/table_tal_demo/1.jpg new file mode 100644 index 000000000..e7ddd6d1d Binary files /dev/null and b/doc/datasets/table_tal_demo/1.jpg differ diff --git a/doc/datasets/table_tal_demo/2.jpg b/doc/datasets/table_tal_demo/2.jpg new file mode 100644 index 000000000..e7ddd6d1d Binary files /dev/null and b/doc/datasets/table_tal_demo/2.jpg differ diff --git a/doc/doc_ch/dataset/table_datasets.md b/doc/doc_ch/dataset/table_datasets.md new file mode 100644 index 000000000..98e7f7ed6 --- /dev/null +++ b/doc/doc_ch/dataset/table_datasets.md @@ -0,0 +1,35 @@ +# 表格识别数据集 + +- [表格识别数据集](#表格识别数据集) + - [数据集汇总](#数据集汇总) + - [1. PubTabNet数据集](#1-pubtabnet数据集) + - [2. 好未来表格识别竞赛数据集](#2-好未来表格识别竞赛数据集) + +这里整理了常用版面分析数据集,持续更新中,欢迎各位小伙伴贡献数据集~ +版面分析数据集多为目标检测数据集,除了开源数据,用户还可使用合成工具自行合成,如[labelme](https://github.com/wkentaro/labelme)等。 + +## 数据集汇总 + +| 数据集名称 |图片下载地址| PPOCR标注下载地址 | +|---|---|---| +| PubTabNet |https://github.com/ibm-aur-nlp/PubTabNet| jsonl格式,可直接用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 | +| 好未来表格识别竞赛数据集 |https://ai.100tal.com/dataset| jsonl格式,可直接用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 | + +## 1. PubTabNet数据集 +- **数据简介**:PubTabNet数据集的训练集合中包含50万张图像,验证集合中包含0.9万张图像。部分图像可视化如下所示。 + + +
+ + +
+ +- **说明**:使用该数据集时,需要遵守[CDLA-Permissive](https://cdla.io/permissive-1-0/)协议。 + +## 2. 好未来表格识别竞赛数据集 +- **数据简介**:好未来表格识别竞赛数据集的训练集合中包含1.6万张图像。验证集未给出可训练的标注。 + +
+ + +
diff --git a/doc/doc_ch/table_datasets.md b/doc/doc_ch/table_datasets.md deleted file mode 100644 index e69de29bb..000000000 diff --git a/doc/doc_en/dataset/table_datasets_en.md b/doc/doc_en/dataset/table_datasets_en.md new file mode 100644 index 000000000..91d76a710 --- /dev/null +++ b/doc/doc_en/dataset/table_datasets_en.md @@ -0,0 +1,33 @@ +# Table Recognition Datasets + +- [Table Recognition Datasets](#table-recognition-datasets) + - [Dataset Summary](#dataset-summary) + - [1. PubTabNet](#1-pubtabnet) + - [2. TAL Table Recognition Competition Dataset](#2-tal-table-recognition-competition-dataset) + +Here are the commonly used layout analysis datasets, which are being updated continuously. Welcome to contribute datasets~ + +## Dataset Summary + +| dataset | Image download link | PPOCR format annotation download link | +|---|---|---| +| PubTabNet |https://github.com/ibm-aur-nlp/PubTabNet| jsonl format, which can be loaded directly with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py) | +| TAL Table Recognition Competition Dataset |https://ai.100tal.com/dataset| jsonl format, which can be loaded directly with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py) | + +## 1. PubTabNet +- **Data Introduction**:The training set of the PubTabNet dataset contains 500,000 images and the validation set contains 9000 images. Part of the image visualization is shown below. + +
+ + +
+ +- **illustrate**:When using this dataset, the [CDLA-Permissive](https://cdla.io/permissive-1-0/) protocol is required. + +## 2. TAL Table Recognition Competition Dataset +- **Data Introduction**:The training set of the TAL table recognition competition dataset contains 16,000 images. The validation set does not give trainable annotations. + +
+ + +