mirror of https://github.com/PaddlePaddle/PaddleOCR.git synced 2025-06-03 21:53:39 +08:00

* docs: Add a new document site

* docs: Update comment setting

* chore(pre-commit): Remove rules of md and remove the size limits of 512kb

* chore(format): Run pre-commit in local

* ci(document): Change the default name of building document site.

* chore: Update .pre-commit-config.yaml

2024-07-24 20:00:15 +08:00

1.8 KiB

Raw Blame History

comments

comments
true

Table Recognition Datasets

Here are the commonly used table recognition datasets, which are being updated continuously. Welcome to contribute datasets~

Dataset Summary

dataset	Image download link	PPOCR format annotation download link
PubTabNet	https://github.com/ibm-aur-nlp/PubTabNet	jsonl format, which can be loaded directly with pubtab_dataset.py
TAL Table Recognition Competition Dataset	https://ai.100tal.com/dataset	jsonl format, which can be loaded directly with pubtab_dataset.py
WTW Chinese scene table dataset	https://github.com/wangwen-whu/WTW-Dataset	Conversion is required to load with pubtab_dataset.py

1. PubTabNet

Data Introduction：The training set of the PubTabNet dataset contains 500,000 images and the validation set contains 9000 images. Part of the image visualization is shown below.
illustrate：When using this dataset, the CDLA-Permissive protocol is required.

2. TAL Table Recognition Competition Dataset

Data Introduction：The training set of the TAL table recognition competition dataset contains 16,000 images. The validation set does not give trainable annotations.

3. WTW Chinese scene table dataset

Data Introduction：The WTW Chinese scene table dataset consists of two parts: table detection and table data. The dataset contains images of two scenes, scanned and photographed.

1.8 KiB Raw Blame History Unescape Escape