add dataset desc
parent
02e881e508
commit
d69b74e434
|
@ -3,6 +3,7 @@
|
|||
- [数据集汇总](#数据集汇总)
|
||||
- [1. PubTabNet数据集](#1-pubtabnet数据集)
|
||||
- [2. 好未来表格识别竞赛数据集](#2-好未来表格识别竞赛数据集)
|
||||
- [3. 好未来表格识别竞赛数据集](#2-WTW中文场景表格数据集)
|
||||
|
||||
这里整理了常用表格识别数据集,持续更新中,欢迎各位小伙伴贡献数据集~
|
||||
|
||||
|
@ -12,6 +13,7 @@
|
|||
|---|---|---|
|
||||
| PubTabNet |https://github.com/ibm-aur-nlp/PubTabNet| jsonl格式,可直接用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 |
|
||||
| 好未来表格识别竞赛数据集 |https://ai.100tal.com/dataset| jsonl格式,可直接用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 |
|
||||
| WTW中文场景表格数据集 |https://github.com/wangwen-whu/WTW-Dataset| 需要进行转换后才能用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 |
|
||||
|
||||
## 1. PubTabNet数据集
|
||||
- **数据简介**:PubTabNet数据集的训练集合中包含50万张图像,验证集合中包含0.9万张图像。部分图像可视化如下所示。
|
||||
|
@ -31,3 +33,12 @@
|
|||
<img src="../../datasets/table_tal_demo/1.jpg" width="500">
|
||||
<img src="../../datasets/table_tal_demo/2.jpg" width="500">
|
||||
</div>
|
||||
|
||||
## 3. WTW中文场景表格数据集
|
||||
- **数据简介**:WTW中文场景表格数据集包含表格检测和表格数据两部分数据,数据集中同时包含扫描和拍照两张场景的图像。
|
||||
|
||||
https://github.com/wangwen-whu/WTW-Dataset/blob/main/demo/20210816_210413.gif
|
||||
|
||||
<div align="center">
|
||||
<img src="https://github.com/wangwen-whu/WTW-Dataset/blob/main/demo/20210816_210413.gif" width="500">
|
||||
</div>
|
||||
|
|
|
@ -3,6 +3,7 @@
|
|||
- [Dataset Summary](#dataset-summary)
|
||||
- [1. PubTabNet](#1-pubtabnet)
|
||||
- [2. TAL Table Recognition Competition Dataset](#2-tal-table-recognition-competition-dataset)
|
||||
- [3. WTW Chinese scene table dataset](#3-wtw-chinese-scene-table-dataset)
|
||||
|
||||
Here are the commonly used table recognition datasets, which are being updated continuously. Welcome to contribute datasets~
|
||||
|
||||
|
@ -12,6 +13,7 @@ Here are the commonly used table recognition datasets, which are being updated c
|
|||
|---|---|---|
|
||||
| PubTabNet |https://github.com/ibm-aur-nlp/PubTabNet| jsonl format, which can be loaded directly with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py) |
|
||||
| TAL Table Recognition Competition Dataset |https://ai.100tal.com/dataset| jsonl format, which can be loaded directly with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py) |
|
||||
| WTW Chinese scene table dataset |https://github.com/wangwen-whu/WTW-Dataset| Conversion is required to load with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)|
|
||||
|
||||
## 1. PubTabNet
|
||||
- **Data Introduction**:The training set of the PubTabNet dataset contains 500,000 images and the validation set contains 9000 images. Part of the image visualization is shown below.
|
||||
|
@ -30,3 +32,11 @@ Here are the commonly used table recognition datasets, which are being updated c
|
|||
<img src="../../datasets/table_tal_demo/1.jpg" width="500">
|
||||
<img src="../../datasets/table_tal_demo/2.jpg" width="500">
|
||||
</div>
|
||||
|
||||
## 3. WTW Chinese scene table dataset
|
||||
- **Data Introduction**:The WTW Chinese scene table dataset consists of two parts: table detection and table data. The dataset contains images of two scenes, scanned and photographed.
|
||||
https://github.com/wangwen-whu/WTW-Dataset/blob/main/demo/20210816_210413.gif
|
||||
|
||||
<div align="center">
|
||||
<img src="https://github.com/wangwen-whu/WTW-Dataset/blob/main/demo/20210816_210413.gif" width="500">
|
||||
</div>
|
||||
|
|
|
@ -63,7 +63,16 @@ After the operation is completed, the excel table of each image will be saved to
|
|||
In this chapter, we only introduce the training of the table structure model, For model training of [text detection](../../doc/doc_en/detection_en.md) and [text recognition](../../doc/doc_en/recognition_en.md), please refer to the corresponding documents
|
||||
|
||||
* data preparation
|
||||
The training data uses public data set [PubTabNet](https://arxiv.org/abs/1911.10683 ), Can be downloaded from the official [website](https://github.com/ibm-aur-nlp/PubTabNet) 。The PubTabNet data set contains about 500,000 images, as well as annotations in html format。
|
||||
|
||||
For the Chinese model and the English model, the data sources are different, as follows:
|
||||
|
||||
English dataset: The training data uses public data set [PubTabNet](https://arxiv.org/abs/1911.10683 ), Can be downloaded from the official [website](https://github.com/ibm-aur-nlp/PubTabNet) 。The PubTabNet data set contains about 500,000 images, as well as annotations in html format。
|
||||
|
||||
Chinese dataset: The Chinese dataset consists of the following two parts, which are trained with a 1:1 sampling ratio.
|
||||
> 1. Generate dataset: Use [Table Generation Tool](https://github.com/WenmuZhou/TableGeneration) to generate 40,000 images.
|
||||
> 2. Crop 10,000 images from [WTW](https://github.com/wangwen-whu/WTW-Dataset).
|
||||
|
||||
For a detailed introduction to public datasets, please refer to [table_datasets](../../doc/doc_en/dataset/table_datasets_en.md). The following training and evaluation procedures are based on the English dataset as an example.
|
||||
|
||||
* Start training
|
||||
*If you are installing the cpu version of paddle, please modify the `use_gpu` field in the configuration file to false*
|
||||
|
|
|
@ -75,7 +75,15 @@ note: 上述模型是在 PubLayNet 数据集上训练的表格识别模型,仅
|
|||
|
||||
* 数据准备
|
||||
|
||||
训练数据使用公开数据集PubTabNet ([论文](https://arxiv.org/abs/1911.10683),[下载地址](https://github.com/ibm-aur-nlp/PubTabNet))。PubTabNet数据集包含约50万张表格数据的图像,以及图像对应的html格式的注释。
|
||||
对于中文模型和英文模型,数据来源不同,分别介绍如下
|
||||
|
||||
英文数据集: 训练数据使用公开数据集PubTabNet ([论文](https://arxiv.org/abs/1911.10683),[下载地址](https://github.com/ibm-aur-nlp/PubTabNet))。PubTabNet数据集包含约50万张表格数据的图像,以及图像对应的html格式的注释。
|
||||
|
||||
中文数据集: 中文数据集下面两部分构成,这两部分安装1:1的采样比例进行训练。
|
||||
> 1. 生成数据集: 使用[表格生成工具](https://github.com/WenmuZhou/TableGeneration)生成4w张。
|
||||
> 2. 从[WTW](https://github.com/wangwen-whu/WTW-Dataset)中获取1w张。
|
||||
|
||||
关于公开数据集的详细介绍可以参考 [table_datasets](../../doc/doc_ch/dataset/table_datasets.md),下述训练和评估流程均以英文数据集为例。
|
||||
|
||||
* 启动训练
|
||||
|
||||
|
|
Loading…
Reference in New Issue