PaddleOCR/ppstructure/README_ch.md

[English](README.md) | 简体中文

# PP-Structure 文档分析

- [1. 简介](#1)
- [2. 特性](#2)
- [3. 效果展示](#3)
  - [3.1 版面分析和表格识别](#31)
  - [3.2 版面恢复](#32)
  - [3.3 关键信息抽取](#33)
- [4. 快速体验](#4)
- [5. 模型库](#5)

<a name="1"></a>
## 1. 简介

PP-Structure是PaddleOCR团队自研的智能文档分析系统，旨在帮助开发者更好的完成版面分析、表格识别等文档理解相关任务。

PP-StructureV2系统流程图如下所示，文档图像首先经过图像矫正模块，判断整图方向并完成转正，随后可以完成版面信息分析与关键信息抽取2类任务。
- 版面分析任务中，图像首先经过版面分析模型，将图像划分为文本、表格、图像等不同区域，随后对这些区域分别进行识别，如，将表格区域送入表格识别模块进行结构化识别，将文本区域送入OCR引擎进行文字识别，最后使用版面恢复模块将其恢复为与原始图像布局一致的word或者pdf格式的文件；
- 关键信息抽取任务中，首先使用OCR引擎提取文本内容，然后由语义实体识别模块获取图像中的语义实体，最后经关系抽取模块获取语义实体之间的对应关系，从而提取需要的关键信息。

<img src="https://user-images.githubusercontent.com/14270174/195265734-6f4b5a7f-59b1-4fcc-af6d-89afc9bd51e1.jpg" width="100%"/>

更多技术细节：👉 PP-StructureV2技术报告 [中文版](docs/PP-StructureV2_introduction.md)，[英文版](https://arxiv.org/abs/2210.05391)。

PP-StructureV2支持各个模块独立使用或灵活搭配，如，可以单独使用版面分析，或单独使用表格识别，点击下面相应链接获取各个独立模块的使用教程：

- [版面分析](layout/README_ch.md)
- [表格识别](table/README_ch.md)
- [关键信息抽取](kie/README_ch.md)
- [版面复原](recovery/README_ch.md)

<a name="2"></a>
## 2. 特性

PP-StructureV2的主要特性如下：
- 支持对图片/pdf形式的文档进行版面分析，可以划分**文字、标题、表格、图片、公式等**区域；
- 支持通用的中英文**表格检测**任务；
- 支持表格区域进行结构化识别，最终结果输出**Excel文件**；
- 支持基于多模态的关键信息抽取(Key Information Extraction，KIE)任务-**语义实体识别**(Semantic Entity Recognition，SER)和**关系抽取**(Relation Extraction，RE)；
- 支持**版面复原**，即恢复为与原始图像布局一致的word或者pdf格式的文件；
- 支持自定义训练及python whl包调用等多种推理部署方式，简单易用；
- 与半自动数据标注工具PPOCRLabel打通，支持版面分析、表格识别、SER三种任务的标注。

<a name="3"></a>
## 3. 效果展示
PP-StructureV2支持各个模块独立使用或灵活搭配，如，可以单独使用版面分析，或单独使用表格识别，这里仅展示几种代表性使用方式的可视化效果。

<a name="31"></a>
### 3.1 版面分析和表格识别
下图展示了版面分析+表格识别的整体流程，图片先有版面分析划分为图像、文本、标题和表格四种区域，然后对图像、文本和标题三种区域进行OCR的检测识别，对表格进行表格识别，其中图像还会被存储下来以便使用。
<img src="./docs/table/ppstructure.GIF" width="100%"/>

### 3.1.1 版面识别返回单字坐标
下图展示了基于上一节版面分析对文字进行定位的效果， 可参考[文档](./return_word_pos.md)。
![show_0_mdf_v2](https://github.com/PaddlePaddle/PaddleOCR/assets/43341135/799450d4-d2c5-4b61-b490-e160dc0f515c)


<a name="32"></a>
### 3.2 版面恢复
下图展示了基于上一节版面分析和表格识别的结果进行版面恢复的效果。
<img src="./docs/recovery/recovery.jpg" width="100%"/>

<a name="33"></a>
### 3.3 关键信息抽取

* SER

图中不同颜色的框表示不同的类别。

<div align="center">
    <img src="https://user-images.githubusercontent.com/14270174/185539141-68e71c75-5cf7-4529-b2ca-219d29fa5f68.jpg" width="600">
</div>

<div align="center">
    <img src="https://user-images.githubusercontent.com/14270174/185310636-6ce02f7c-790d-479f-b163-ea97a5a04808.jpg" width="600">
</div>

<div align="center">
    <img src="https://user-images.githubusercontent.com/14270174/185539517-ccf2372a-f026-4a7c-ad28-c741c770f60a.png" width="600">
</div>

<div align="center">
    <img src="https://user-images.githubusercontent.com/14270174/197464552-69de557f-edff-4c7f-acbf-069df1ba097f.png" width="600">
</div>

<div align="center">
    <img src="https://user-images.githubusercontent.com/25809855/186095702-9acef674-12af-4d09-97fc-abf4ab32600e.png" width="600">
</div>

* RE

图中红色框表示`问题`，蓝色框表示`答案`，`问题`和`答案`之间使用绿色线连接。

<div align="center">
    <img src="https://user-images.githubusercontent.com/14270174/185393805-c67ff571-cf7e-4217-a4b0-8b396c4f22bb.jpg" width="600">
</div>

<div align="center">
    <img src="https://user-images.githubusercontent.com/14270174/185540080-0431e006-9235-4b6d-b63d-0b3c6e1de48f.jpg" width="600">
</div>

<div align="center">
    <img src="https://user-images.githubusercontent.com/25809855/186094813-3a8e16cc-42e5-4982-b9f4-0134dfb5688d.png" width="600">
</div>

<div align="center">
    <img src="https://user-images.githubusercontent.com/25809855/186095641-5843b4da-34d7-4c1c-943a-b1036a859fe3.png" width="600">
</div>

<a name="4"></a>
## 4. 快速体验

请参考[快速使用](./docs/quickstart.md)教程。

<a name="5"></a>
## 5. 模型库

部分任务需要同时用到结构化分析模型和OCR模型，如表格识别需要使用表格识别模型进行结构化解析，同时也要用到OCR模型对表格内的文字进行识别，请根据具体需求选择合适的模型。

结构化分析相关模型下载可以参考：
- [PP-Structure 模型库](./docs/models_list.md)

OCR相关模型下载可以参考：
- [PP-OCR 模型库](../doc/doc_ch/models_list.md)
-												test=documents_fix, test=dygraph

											
										
										
											2021-08-02 23:42:52 +08:00
+								[English](README.md) | 简体中文
-												merge dygraph

											
										
										
											2021-06-10 14:24:59 +08:00
-												update

											
										
										
											2022-04-19 15:12:11 +08:00
+								# PP-Structure 文档分析
-												update

											
										
										
											2022-04-19 15:08:34 +08:00
 								- [1. 简介](#1)
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								- [2. 特性](#2)
 								- [3. 效果展示](#3)
 								  - [3.1 版面分析和表格识别](#31)
 								  - [3.2 版面恢复](#32)
 								  - [3.3 关键信息抽取](#33)
 								- [4. 快速体验](#4)
 								- [5. 模型库](#5)
-												update

											
										
										
											2022-04-19 15:08:34 +08:00
 								<a name="1"></a>
-												update whl to 2.4

											
										
										
											2022-01-06 18:15:46 +08:00
+								## 1. 简介
-												update readme

											
										
										
											2021-12-13 17:42:06 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								PP-Structure是PaddleOCR团队自研的智能文档分析系统，旨在帮助开发者更好的完成版面分析、表格识别等文档理解相关任务。
-												add vqa_ser to ppstructure predict pipeline

											
										
										
											2021-12-13 15:38:05 +08:00
-												add arxiv pps (#7893)

* support reconaug

* rename ppstructurev2 and add arxiv link

* fix link
											
										
										
											2022-10-12 14:52:33 +08:00
+								PP-StructureV2系统流程图如下所示，文档图像首先经过图像矫正模块，判断整图方向并完成转正，随后可以完成版面信息分析与关键信息抽取2类任务。
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								- 版面分析任务中，图像首先经过版面分析模型，将图像划分为文本、表格、图像等不同区域，随后对这些区域分别进行识别，如，将表格区域送入表格识别模块进行结构化识别，将文本区域送入OCR引擎进行文字识别，最后使用版面恢复模块将其恢复为与原始图像布局一致的word或者pdf格式的文件；
 								- 关键信息抽取任务中，首先使用OCR引擎提取文本内容，然后由语义实体识别模块获取图像中的语义实体，最后经关系抽取模块获取语义实体之间的对应关系，从而提取需要的关键信息。
-												add vqa_ser to ppstructure predict pipeline

											
										
										
											2021-12-13 15:38:05 +08:00
-												add arxiv pps (#7893)

* support reconaug

* rename ppstructurev2 and add arxiv link

* fix link
											
										
										
											2022-10-12 14:52:33 +08:00
+								<img src="https://user-images.githubusercontent.com/14270174/195265734-6f4b5a7f-59b1-4fcc-af6d-89afc9bd51e1.jpg" width="100%"/>
-												add vqa_ser to ppstructure predict pipeline

											
										
										
											2021-12-13 15:38:05 +08:00
-												add arxiv pps (#7893)

* support reconaug

* rename ppstructurev2 and add arxiv link

* fix link
											
										
										
											2022-10-12 14:52:33 +08:00
+								更多技术细节：👉 PP-StructureV2技术报告 [中文版](docs/PP-StructureV2_introduction.md)，[英文版](https://arxiv.org/abs/2210.05391)。
 								PP-StructureV2支持各个模块独立使用或灵活搭配，如，可以单独使用版面分析，或单独使用表格识别，点击下面相应链接获取各个独立模块的使用教程：
-												add vqa_ser to ppstructure predict pipeline

											
										
										
											2021-12-13 15:38:05 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								- [版面分析](layout/README_ch.md)
 								- [表格识别](table/README_ch.md)
 								- [关键信息抽取](kie/README_ch.md)
 								- [版面复原](recovery/README_ch.md)
-												add install doc of paddlepaddle and paddleocr

											
										
										
											2021-08-02 21:02:01 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								<a name="2"></a>
 								## 2. 特性
-												update readme

											
										
										
											2021-12-13 17:42:06 +08:00
-												add arxiv pps (#7893)

* support reconaug

* rename ppstructurev2 and add arxiv link

* fix link
											
										
										
											2022-10-12 14:52:33 +08:00
+								PP-StructureV2的主要特性如下：
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								- 支持对图片/pdf形式的文档进行版面分析，可以划分**文字、标题、表格、图片、公式等**区域；
 								- 支持通用的中英文**表格检测**任务；
 								- 支持表格区域进行结构化识别，最终结果输出**Excel文件**；
 								- 支持基于多模态的关键信息抽取(Key Information Extraction，KIE)任务-**语义实体识别**(Semantic Entity Recognition，SER)和**关系抽取**(Relation Extraction，RE)；
 								- 支持**版面复原**，即恢复为与原始图像布局一致的word或者pdf格式的文件；
 								- 支持自定义训练及python whl包调用等多种推理部署方式，简单易用；
 								- 与半自动数据标注工具PPOCRLabel打通，支持版面分析、表格识别、SER三种任务的标注。
-												add install doc of paddlepaddle and paddleocr

											
										
										
											2021-08-02 21:02:01 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								<a name="3"></a>
 								## 3. 效果展示
-												add arxiv pps (#7893)

* support reconaug

* rename ppstructurev2 and add arxiv link

* fix link
											
										
										
											2022-10-12 14:52:33 +08:00
+								PP-StructureV2支持各个模块独立使用或灵活搭配，如，可以单独使用版面分析，或单独使用表格识别，这里仅展示几种代表性使用方式的可视化效果。
-												add install doc of paddlepaddle and paddleocr

											
										
										
											2021-08-02 21:02:01 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								<a name="31"></a>
 								### 3.1 版面分析和表格识别
 								下图展示了版面分析+表格识别的整体流程，图片先有版面分析划分为图像、文本、标题和表格四种区域，然后对图像、文本和标题三种区域进行OCR的检测识别，对表格进行表格识别，其中图像还会被存储下来以便使用。
 								<img src="./docs/table/ppstructure.GIF" width="100%"/>
-												opt doc

											
										
										
											2021-07-29 11:51:28 +08:00
-												[Cherry-pick] Cherry-pick from release/2.6 (#11092)

* Update recognition_en.md (#10059)

ic15_dict.txt only have 36 digits

* Update ocr_rec.h (#9469)

It is enough to include preprocess_op.h, we do not need to include ocr_cls.h.

* 补充num_classes注释说明 (#10073)

ser_vi_layoutxlm_xfund_zh.yml中的Architecture.Backbone.num_classes所赋值会设置给Loss.num_classes，
由于采用BIO标注，假设字典中包含n个字段（包含other）时，则类别数为2n-1;假设字典中包含n个字段（不含other）时，则类别数为2n+1。

* Update algorithm_overview_en.md (#9747)

Fix links to super-resolution algorithm docs

* 改进文档`deploy/hubserving/readme.md`和`doc/doc_ch/models_list.md` (#9110)

* Update readme.md

* Update readme.md

* Update readme.md

* Update models_list.md

* trim trailling spaces @ `deploy/hubserving/readme_en.md`

* `s/shell/bash/` @ `deploy/hubserving/readme_en.md`

* Update `deploy/hubserving/readme_en.md` to sync with `deploy/hubserving/readme.md`

* Update deploy/hubserving/readme_en.md to sync with `deploy/hubserving/readme.md`

* Update deploy/hubserving/readme_en.md to sync with `deploy/hubserving/readme.md`

* Update `doc/doc_en/models_list_en.md` to sync with `doc/doc_ch/models_list_en.md`

* using Grammarly to weak `deploy/hubserving/readme_en.md`

* using Grammarly to tweak `doc/doc_en/models_list_en.md`

* `ocr_system` module will return with values of field `confidence`

* Update README_CN.md

* 修复测试服务中图片转Base64的引用地址错误。 (#8334)

* Update application.md

* [Doc] Fix 404 link.  (#10318)

* Update PP-OCRv3_det_train.md

* Update knowledge_distillation.md

* Update config.md

* Fix fitz camelCase deprecation and .PDF not being recognized as pdf file (#10181)

* Fix fitz camelCase deprecation and .PDF not being recognized as pdf file

* refactor get_image_file_list function

* Update customize.md (#10325)

* Update FAQ.md (#10345)

* Update FAQ.md (#10349)

* Don't break overall processing on a bad image (#10216)

* Add preprocessing common to OCR tasks (#10217)

Add preprocessing to options

* [MLU] add mlu device for infer (#10249)

* Create newfeature.md

* Update newfeature.md

* remove unused imported module, so can avoid PyInstaller packaged binary's start-time not found module error. (#10502)

* CV套件建设专项活动 - 文字识别返回单字识别坐标 (#10515)

* modification of return word box

* update_implements

* Update rec_postprocess.py

* Update utility.py

* Update README_ch.md

* revert README_ch.md update

* Fixed Layout recovery README file (#10493)

Co-authored-by: Shubham Chambhare <shubhamchambhare@zoop.one>

* update_doc

* bugfix

---------

Co-authored-by: ChuongLoc <89434232+ChuongLoc@users.noreply.github.com>
Co-authored-by: Wang Xin <xinwang614@gmail.com>
Co-authored-by: tanjh <dtdhinjapan@gmail.com>
Co-authored-by: Louis Maddox <lmmx@users.noreply.github.com>
Co-authored-by: n0099 <n@n0099.net>
Co-authored-by: zhenliang li <37922155+shouyong@users.noreply.github.com>
Co-authored-by: itasli <ilyas.tasli@outlook.fr>
Co-authored-by: UserUnknownFactor <63057995+UserUnknownFactor@users.noreply.github.com>
Co-authored-by: PeiyuLau <135964669+PeiyuLau@users.noreply.github.com>
Co-authored-by: kerneltravel <kjpioo2006@gmail.com>
Co-authored-by: ToddBear <43341135+ToddBear@users.noreply.github.com>
Co-authored-by: Ligoml <39876205+Ligoml@users.noreply.github.com>
Co-authored-by: Shubham Chambhare <59397280+Shubham654@users.noreply.github.com>
Co-authored-by: Shubham Chambhare <shubhamchambhare@zoop.one>
Co-authored-by: andyj <87074272+andyjpaddle@users.noreply.github.com>
											
										
										
											2023-10-18 17:37:23 +08:00
+								### 3.1.1 版面识别返回单字坐标
 								下图展示了基于上一节版面分析对文字进行定位的效果， 可参考[文档](./return_word_pos.md)。
 								![show_0_mdf_v2](https://github.com/PaddlePaddle/PaddleOCR/assets/43341135/799450d4-d2c5-4b61-b490-e160dc0f515c)
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								<a name="32"></a>
 								### 3.2 版面恢复
 								下图展示了基于上一节版面分析和表格识别的结果进行版面恢复的效果。
 								<img src="./docs/recovery/recovery.jpg" width="100%"/>
-												opt doc

											
										
										
											2021-07-29 11:51:28 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								<a name="33"></a>
 								### 3.3 关键信息抽取
-												opt doc

											
										
										
											2021-07-29 11:51:28 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								* SER
-												opt doc

											
										
										
											2021-07-29 11:51:28 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								图中不同颜色的框表示不同的类别。
-												opt doc

											
										
										
											2021-07-29 11:51:28 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								<div align="center">
 								    <img src="https://user-images.githubusercontent.com/14270174/185539141-68e71c75-5cf7-4529-b2ca-219d29fa5f68.jpg" width="600">
 								</div>
-												opt doc

											
										
										
											2021-07-29 11:51:28 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								<div align="center">
 								    <img src="https://user-images.githubusercontent.com/14270174/185310636-6ce02f7c-790d-479f-b163-ea97a5a04808.jpg" width="600">
 								</div>
-												add vqa_ser to ppstructure predict pipeline

											
										
										
											2021-12-13 15:38:05 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								<div align="center">
 								    <img src="https://user-images.githubusercontent.com/14270174/185539517-ccf2372a-f026-4a7c-ad28-c741c770f60a.png" width="600">
 								</div>
-												add vqa_ser to ppstructure predict pipeline

											
										
										
											2021-12-13 15:38:05 +08:00
-												update ppstructure readme en

											
										
										
											2022-08-23 15:24:14 +08:00
+								<div align="center">
-												fix pic (#8067)


											
										
										
											2022-10-24 15:43:01 +08:00
+								    <img src="https://user-images.githubusercontent.com/14270174/197464552-69de557f-edff-4c7f-acbf-069df1ba097f.png" width="600">
-												update ppstructure readme en

											
										
										
											2022-08-23 15:24:14 +08:00
+								</div>
 								<div align="center">
 								    <img src="https://user-images.githubusercontent.com/25809855/186095702-9acef674-12af-4d09-97fc-abf4ab32600e.png" width="600">
 								</div>
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								* RE
-												update readme

											
										
										
											2021-12-13 17:42:06 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								图中红色框表示`问题`，蓝色框表示`答案`，`问题`和`答案`之间使用绿色线连接。
-												update readme

											
										
										
											2021-12-13 17:42:06 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								<div align="center">
 								    <img src="https://user-images.githubusercontent.com/14270174/185393805-c67ff571-cf7e-4217-a4b0-8b396c4f22bb.jpg" width="600">
 								</div>
-												update readme

											
										
										
											2021-12-13 17:42:06 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								<div align="center">
 								    <img src="https://user-images.githubusercontent.com/14270174/185540080-0431e006-9235-4b6d-b63d-0b3c6e1de48f.jpg" width="600">
 								</div>
-												update readme

											
										
										
											2021-12-13 17:42:06 +08:00
-												update ppstructure readme en

											
										
										
											2022-08-23 15:24:14 +08:00
+								<div align="center">
 								    <img src="https://user-images.githubusercontent.com/25809855/186094813-3a8e16cc-42e5-4982-b9f4-0134dfb5688d.png" width="600">
-												update common pre-commit configs and commit the results of running pre-commit run -a (#12516)


											
										
										
											2024-05-29 15:26:09 +08:00
+								</div>
-												update ppstructure readme en

											
										
										
											2022-08-23 15:24:14 +08:00
 								<div align="center">
 								    <img src="https://user-images.githubusercontent.com/25809855/186095641-5843b4da-34d7-4c1c-943a-b1036a859fe3.png" width="600">
-												add arxiv pps (#7893)

* support reconaug

* rename ppstructurev2 and add arxiv link

* fix link
											
										
										
											2022-10-12 14:52:33 +08:00
+								</div>
-												update ppstructure readme en

											
										
										
											2022-08-23 15:24:14 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								<a name="4"></a>
 								## 4. 快速体验
-												update readme

											
										
										
											2021-12-13 17:42:06 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								请参考[快速使用](./docs/quickstart.md)教程。
-												update model link

											
										
										
											2022-08-16 22:33:14 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								<a name="5"></a>
 								## 5. 模型库
-												update readme

											
										
										
											2021-12-13 17:42:06 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								部分任务需要同时用到结构化分析模型和OCR模型，如表格识别需要使用表格识别模型进行结构化解析，同时也要用到OCR模型对表格内的文字进行识别，请根据具体需求选择合适的模型。
-												update readme

											
										
										
											2021-12-13 17:42:06 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								结构化分析相关模型下载可以参考：
 								- [PP-Structure 模型库](./docs/models_list.md)
-												update readme

											
										
										
											2021-12-13 17:42:06 +08:00
-												update ppstructure readme

											
										
										
											2022-08-22 21:15:19 +08:00
+								OCR相关模型下载可以参考：
 								- [PP-OCR 模型库](../doc/doc_ch/models_list.md)