pdf to markdown document (#13942)

2025-06-03 21:53:39 +08:00 · 2024-10-07 09:25:21 +08:00 · 2024-10-07 09:25:21 +08:00 · 8728b47046
commit 8728b47046
parent fa3f7dbc72
5 changed files with 292 additions and 66 deletions
--- a/docs/ppstructure/quick_start.en.md
+++ b/docs/ppstructure/quick_start.en.md
@ -85,6 +85,20 @@ Recovery by using OCR：
 paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true --lang='en'
 ```

+#### 2.1.7 layout recovery(PDF to Markdown)
+
+Do not use LaTeXCOR model for formula recognition：
+
+```bash linenums="1"
+paddleocr --image_dir=ppstructure/docs/recovery/UnrealText.pdf --type=structure --recovery=true --recovery_to_markdown=true --lang='en'
+```
+
+Use LaTeXCOR model for formula recognition, where Chinese layout model must be used：
+
+```bash linenums="1"
+paddleocr --image_dir=ppstructure/docs/recovery/UnrealText.pdf --type=structure --recovery=true --formula=true --recovery_to_markdown=true --lang='ch'
+```
+
 ### 2.2 Use by python script

 #### 2.2.1 image orientation + layout analysis + table recognition
@ -271,6 +285,35 @@ res = sorted_layout_boxes(result, w)
 convert_info_docx(img, res, save_folder, os.path.basename(img_path).split('.')[0])
 ```

+#### 2.2.7 layout recovery(PDF to Markdown)
+
+```python linenums="1"
+import os
+import cv2
+from paddleocr import PPStructure,save_structure_res
+from paddleocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes
+from paddleocr.ppstructure.recovery.recovery_to_markdown import convert_info_markdown
+
+# Chinese image
+table_engine = PPStructure(recovery=True)
+# English image
+# table_engine = PPStructure(recovery=True, lang='en')
+
+save_folder = './output'
+img_path = 'ppstructure/docs/table/1.png'
+img = cv2.imread(img_path)
+result = table_engine(img)
+save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])
+
+for line in result:
+    line.pop('img')
+    print(line)
+
+h, w, _ = img.shape
+res = sorted_layout_boxes(result, w)
+convert_info_markdown(res, save_folder, os.path.basename(img_path).split('.')[0])
+```
+
 ### 2.3 Result description

 The return of PP-Structure is a list of dicts, the example is as follows:
@ -312,12 +355,14 @@ Please refer to: [Key Information Extraction](../ppocr/model_train/kie.en.md) .
 ### 2.4 Parameter Description

 | field                   | description                                                                                                                | default |
-|---|---|---|
+|-------------------------|----------------------------------------------------------------------------------------------------------------------------|---|
 | output                  | result save path                                                                                                           | ./output/table |
 | table_max_len           | long side of the image resize in table structure model                                                                     | 488 |
 | table_model_dir         | Table structure model inference model path                                                                                 | None |
 | table_char_dict_path    | The dictionary path of table structure model                                                                               | ../ppocr/utils/dict/table_structure_dict.txt  |
 | merge_no_span_structure | In the table recognition model, whether to merge '\<td>' and '\</td>'                                                      | False |
+| formula_model_dir       | Formula recognition model inference model path                                                                             | None                                          |
+| formula_char_dict_path  | The dictionary path of formula recognition model                                                                           | ../ppocr/utils/dict/latex_ocr_tokenizer.json |
 | layout_model_dir        | Layout analysis model inference model path                                                                                 | None |
 | layout_dict_path        | The dictionary path of layout analysis model                                                                               | ../ppocr/utils/dict/layout_publaynet_dict.txt |
 | layout_score_threshold  | The box threshold path of layout analysis model                                                                            | 0.5|
@ -329,8 +374,10 @@ Please refer to: [Key Information Extraction](../ppocr/model_train/kie.en.md) .
 | image_orientation       | Whether to perform image orientation classification in forward                                                             | False   |
 | layout                  | Whether to perform layout analysis in forward                                                                              | True   |
 | table                   | Whether to perform table recognition in forward                                                                            | True   | 
+| formula                 | Whether to perform formula recognition in forward                                                                          | False |
 | ocr                     | Whether to perform ocr for non-table areas in layout analysis. When layout is False, it will be automatically set to False | True |
 | recovery                | Whether to perform layout recovery in forward                                                                              | False |
+| recovery_to_markdown    | Whether to convert the layout recovery results into a markdown file                                                        | False |
 | save_pdf                | Whether to convert docx to pdf when recovery                                                                               | False |
 | structure_version       | Structure version, optional PP-structure and PP-structurev2                                                                | PP-structure |

--- a/docs/ppstructure/quick_start.md
+++ b/docs/ppstructure/quick_start.md
@ -103,6 +103,20 @@ paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=t
 paddleocr --image_dir=ppstructure/docs/recovery/UnrealText.pdf --type=structure --recovery=true --lang='en'
 ```

+#### 2.1.7 版面恢复+转换为markdown文件
+
+不使用LaTeXOCR模型进行公式识别：
+
+```bash linenums="1"
+paddleocr --image_dir=ppstructure/docs/recovery/UnrealText.pdf --type=structure --recovery=true --recovery_to_markdown=true --lang='en'
+```
+
+使用LaTeXOCR模型进行公式识别，其中必须使用中文layout模型：
+
+```bash linenums="1"
+paddleocr --image_dir=ppstructure/docs/recovery/UnrealText.pdf --type=structure --recovery=true --formula=true --recovery_to_markdown=true --lang='ch'
+```
+
 ### 2.2 Python脚本使用

 #### 2.2.1 图像方向分类+版面分析+表格识别
@ -289,6 +303,35 @@ res = sorted_layout_boxes(result, w)
 convert_info_docx(img, res, save_folder, os.path.basename(img_path).split('.')[0])
 ```

+#### 2.2.7 版面恢复+转换为markdown文件
+
+```python linenums="1"
+import os
+import cv2
+from paddleocr import PPStructure,save_structure_res
+from paddleocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes
+from paddleocr.ppstructure.recovery.recovery_to_markdown import convert_info_markdown
+
+# 中文测试图
+table_engine = PPStructure(recovery=True)
+# 英文测试图
+# table_engine = PPStructure(recovery=True, lang='en')
+
+save_folder = './output'
+img_path = 'ppstructure/docs/table/1.png'
+img = cv2.imread(img_path)
+result = table_engine(img)
+save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])
+
+for line in result:
+    line.pop('img')
+    print(line)
+
+h, w, _ = img.shape
+res = sorted_layout_boxes(result, w)
+convert_info_markdown(res, save_folder, os.path.basename(img_path).split('.')[0])
+```
+
 ### 2.3 返回结果说明

 PP-Structure的返回结果为一个dict组成的list，示例如下：
@ -330,12 +373,14 @@ dict 里各个字段说明如下：
 ### 2.4 参数说明

 | 字段                      | 说明                                              | 默认值                                        |
-| ----- | ---- | ------ |
+|-------------------------|-------------------------------------------------| ------ |
 | output                  | 结果保存地址                                          | ./output/table                                |
 | table_max_len           | 表格结构模型预测时，图像的长边resize尺度                         | 488                                           |
 | table_model_dir         | 表格结构模型 inference 模型地址                           | None                                          |
 | table_char_dict_path    | 表格结构模型所用字典地址                                    | ../ppocr/utils/dict/table_structure_dict.txt  |
 | merge_no_span_structure | 表格识别模型中，是否对'\<td>'和'\</td>' 进行合并                | False                                         |
+| formula_model_dir       | 公式识别模型 inference 模型地址                           | None                                          |
+| formula_char_dict_path  | 公式识别模型所用字典地址                                    | ../ppocr/utils/dict/latex_ocr_tokenizer.json |
 | layout_model_dir        | 版面分析模型 inference 模型地址                           | None                                          |
 | layout_dict_path        | 版面分析模型字典                                        | ../ppocr/utils/dict/layout_publaynet_dict.txt |
 | layout_score_threshold  | 版面分析模型检测框阈值                                     | 0.5                                           |
@ -347,8 +392,10 @@ dict 里各个字段说明如下：
 | image_orientation       | 前向中是否执行图像方向分类                                   | False                                         |
 | layout                  | 前向中是否执行版面分析                                     | True                                          |
 | table                   | 前向中是否执行表格识别                                     | True                                          |
+| formula                 | 前向中是否执行公式识别                                     | False                                         |
 | ocr                     | 对于版面分析中的非表格区域，是否执行ocr。当layout为False时会被自动设置为False | True                                          |
 | recovery                | 前向中是否执行版面恢复                                     | False                                         |
+| recovery_to_markdown    | 是否将版面恢复结果转换为markdown文件                        | False                                         |
 | save_pdf                | 版面恢复导出docx文件的同时，是否导出pdf文件                       | False                                         |
 | structure_version       | 模型版本，可选 PP-structure和PP-structurev2             | PP-structure                                  |

--- a/paddleocr.py
+++ b/paddleocr.py
@ -66,6 +66,7 @@ from tools.infer.utility import draw_ocr, str2bool, check_gpu
 from ppstructure.utility import init_args, draw_structure_result
 from ppstructure.predict_system import StructureSystem, save_structure_res, to_excel
 from ppstructure.recovery.recovery_to_doc import sorted_layout_boxes, convert_info_docx
+from ppstructure.recovery.recovery_to_markdown import convert_info_markdown

 logger = get_logger()

@ -79,6 +80,7 @@ __all__ = [
    "to_excel",
    "sorted_layout_boxes",
    "convert_info_docx",
+    "convert_info_markdown",
 ]

 SUPPORT_DET_MODEL = ["DB"]
@ -356,6 +358,16 @@ MODEL_URLS = {
                    "dict_path": "ppocr/utils/dict/layout_dict/layout_cdla_dict.txt",
                },
            },
+            "formula": {
+                "en": {
+                    "url": "https://paddleocr.bj.bcebos.com/contribution/rec_latex_ocr_infer.tar",
+                    "dict_path": "ppocr/utils/dict/latex_ocr_tokenizer.json",
+                },
+                "ch": {
+                    "url": "https://paddleocr.bj.bcebos.com/contribution/rec_latex_ocr_infer.tar",
+                    "dict_path": "ppocr/utils/dict/latex_ocr_tokenizer.json",
+                },
+            },
        },
    },
 }
@ -396,6 +408,7 @@ def parse_args(mMain=True):
            "rec_char_dict_path",
            "table_char_dict_path",
            "layout_dict_path",
+            "formula_char_dict_path",
        ]:
            action.default = None
    if mMain:
@ -845,12 +858,21 @@ class PPStructure(StructureSystem):
            os.path.join(BASE_DIR, "whl", "layout"),
            layout_model_config["url"],
        )
+        formula_model_config = get_model_config(
+            "STRUCTURE", params.structure_version, "formula", lang
+        )
+        params.formula_model_dir, formula_url = confirm_model_dir_url(
+            params.formula_model_dir,
+            os.path.join(BASE_DIR, "whl", "formula"),
+            formula_model_config["url"],
+        )
        # download model
        if not params.use_onnx:
            maybe_download(params.det_model_dir, det_url)
            maybe_download(params.rec_model_dir, rec_url)
            maybe_download(params.table_model_dir, table_url)
            maybe_download(params.layout_model_dir, layout_url)
+            maybe_download(params.formula_model_dir, formula_url)

        if params.rec_char_dict_path is None:
            params.rec_char_dict_path = str(
@ -864,6 +886,10 @@ class PPStructure(StructureSystem):
            params.layout_dict_path = str(
                Path(__file__).parent / layout_model_config["dict_path"]
            )
+        if params.formula_char_dict_path is None:
+            params.formula_char_dict_path = str(
+                Path(__file__).parent / formula_model_config["dict_path"]
+            )
        logger.debug(params)
        super().__init__(params)

@ -1005,6 +1031,8 @@ def main():
            if args.recovery and all_res != []:
                try:
                    convert_info_docx(img, all_res, args.output, img_name)
+                    if args.recovery_to_markdown:
+                        convert_info_markdown(all_res, args.output, img_name)
                except Exception as ex:
                    logger.error(
                        "error in layout recovery image:{}, err msg: {}".format(
--- a/ppstructure/docs/quickstart.md
+++ b/ppstructure/docs/quickstart.md
@ -9,6 +9,7 @@
    - [2.1.4 表格识别](#214-表格识别)
    - [2.1.5 关键信息抽取](#215-关键信息抽取)
    - [2.1.6 版面恢复](#216-版面恢复)
+    - [2.1.7 版面恢复+转换为markdown文件](#217-版面恢复转换为markdown文件)
  - [2.2 Python脚本使用](#22-Python脚本使用)
    - [2.2.1 图像方向分类+版面分析+表格识别](#221-图像方向分类版面分析表格识别)
    - [2.2.2 版面分析+表格识别](#222-版面分析表格识别)
@ -16,6 +17,7 @@
    - [2.2.4 表格识别](#224-表格识别)
    - [2.2.5 关键信息抽取](#225-关键信息抽取)
    - [2.2.6 版面恢复](#226-版面恢复)
+    - [2.2.7 版面恢复+转换为markdown文件](#227-版面恢复转换为markdown文件)
  - [2.3 返回结果说明](#23-返回结果说明)
    - [2.3.1 版面分析+表格识别](#231-版面分析表格识别)
    - [2.3.2 关键信息抽取](#232-关键信息抽取)
@ -126,6 +128,22 @@ paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=t
 paddleocr --image_dir=ppstructure/docs/recovery/UnrealText.pdf --type=structure --recovery=true --lang='en'
 ```

+<a name="217"></a>
+
+#### 2.1.7 版面恢复+转换为markdown文件
+
+不使用LaTeXOCR模型进行公式识别：
+
+```bash linenums="1"
+paddleocr --image_dir=ppstructure/docs/recovery/UnrealText.pdf --type=structure --recovery=true --recovery_to_markdown=true --lang='en'
+```
+
+使用LaTeXOCR模型进行公式识别，其中必须使用中文layout模型：
+
+```bash linenums="1"
+paddleocr --image_dir=ppstructure/docs/recovery/UnrealText.pdf --type=structure --recovery=true --formula=true --recovery_to_markdown=true --lang='ch'
+```
+
 <a name="22"></a>

 ### 2.2 Python脚本使用
@ -322,6 +340,37 @@ res = sorted_layout_boxes(result, w)
 convert_info_docx(img, res, save_folder, os.path.basename(img_path).split('.')[0])
 ```

+<a name="227"></a>
+
+#### 2.2.7 版面恢复+转换为markdown文件
+
+```python linenums="1"
+import os
+import cv2
+from paddleocr import PPStructure,save_structure_res
+from paddleocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes
+from paddleocr.ppstructure.recovery.recovery_to_markdown import convert_info_markdown
+
+# 中文测试图
+table_engine = PPStructure(recovery=True)
+# 英文测试图
+# table_engine = PPStructure(recovery=True, lang='en')
+
+save_folder = './output'
+img_path = 'ppstructure/docs/table/1.png'
+img = cv2.imread(img_path)
+result = table_engine(img)
+save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])
+
+for line in result:
+    line.pop('img')
+    print(line)
+
+h, w, _ = img.shape
+res = sorted_layout_boxes(result, w)
+convert_info_markdown(res, save_folder, os.path.basename(img_path).split('.')[0])
+```
+
 <a name="23"></a>
 ### 2.3 返回结果说明
 PP-Structure的返回结果为一个dict组成的list，示例如下：
@ -364,12 +413,14 @@ dict 里各个字段说明如下：
 ### 2.4 参数说明

 | 字段 | 说明                                            | 默认值  |
-|---|---|---|
+|---|-----------------------------------------------|---|
 | output | 结果保存地址                                        | ./output/table |
 | table_max_len | 表格结构模型预测时，图像的长边resize尺度                       | 488 |
 | table_model_dir | 表格结构模型 inference 模型地址                         | None |
 | table_char_dict_path | 表格结构模型所用字典地址                                  | ../ppocr/utils/dict/table_structure_dict.txt  |
 | merge_no_span_structure | 表格识别模型中，是否对'\<td>'和'\</td>' 进行合并              | False |
+| formula_model_dir | 公式识别模型 inference 模型地址                         | None                                          |
+| formula_char_dict_path | 公式识别模型所用字典地址                                  | ../ppocr/utils/dict/latex_ocr_tokenizer.json |
 | layout_model_dir  | 版面分析模型 inference 模型地址                         | None |
 | layout_dict_path  | 版面分析模型字典                                      | ../ppocr/utils/dict/layout_publaynet_dict.txt |
 | layout_score_threshold | 版面分析模型检测框阈值                                   | 0.5|
@ -381,8 +432,10 @@ dict 里各个字段说明如下：
 | image_orientation | 前向中是否执行图像方向分类                                 | False   |
 | layout | 前向中是否执行版面分析                                   | True   |
 | table  | 前向中是否执行表格识别                                   | True   |
+| formula | 前向中是否执行公式识别                                   | False |
 | ocr    | 对于版面分析中的非表格区域，是否执行ocr。当layout为False时会被自动设置为False | True |
 | recovery    | 前向中是否执行版面恢复                                   | False |
+| recovery_to_markdown | 是否将版面恢复结果转换为markdown文件                        | False |
 | save_pdf | 版面恢复导出docx文件的同时，是否导出pdf文件                     | False |
 | structure_version | 模型版本，可选 PP-structure和PP-structurev2           | PP-structure |

--- a/ppstructure/docs/quickstart_en.md
+++ b/ppstructure/docs/quickstart_en.md
@ -9,6 +9,7 @@
    - [2.1.4 table recognition](#214-table-recognition)
    - [2.1.5 Key Information Extraction](#215-Key-Information-Extraction)
    - [2.1.6 layout recovery](#216-layout-recovery)
+    - [2.1.7 layout recovery(PDF to Markdown)](#217-layout-recoverypdf-to-markdown)
  - [2.2 Use by python script](#22-use-by-python-script)
    - [2.2.1 image orientation + layout analysis + table recognition](#221-image-orientation--layout-analysis--table-recognition)
    - [2.2.2 layout analysis + table recognition](#222-layout-analysis--table-recognition)
@ -16,6 +17,7 @@
    - [2.2.4 table recognition](#224-table-recognition)
    - [2.2.5 Key Information Extraction](#225-Key-Information-Extraction)
    - [2.2.6 layout recovery](#226-layout-recovery)
+    - [2.2.7 layout recovery(PDF to Markdown)](#227-layout-recoverypdf-to-markdown)
  - [2.3 Result description](#23-result-description)
    - [2.3.1 layout analysis + table recognition](#231-layout-analysis--table-recognition)
    - [2.3.2 Key Information Extraction](#232-Key-Information-Extraction)
@ -110,6 +112,21 @@ Recovery by using OCR：
 paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true --lang='en'
 ```

+<a name="217"></a>
+#### 2.1.7 layout recovery(PDF to Markdown)
+
+Do not use LaTeXCOR model for formula recognition：
+
+```bash linenums="1"
+paddleocr --image_dir=ppstructure/docs/recovery/UnrealText.pdf --type=structure --recovery=true --recovery_to_markdown=true --lang='en'
+```
+
+Use LaTeXCOR model for formula recognition, where Chinese layout model must be used：
+
+```bash linenums="1"
+paddleocr --image_dir=ppstructure/docs/recovery/UnrealText.pdf --type=structure --recovery=true --formula=true --recovery_to_markdown=true --lang='ch'
+```
+
 <a name="22"></a>
 ### 2.2 Use by python script

@ -303,6 +320,36 @@ res = sorted_layout_boxes(result, w)
 convert_info_docx(img, res, save_folder, os.path.basename(img_path).split('.')[0])
 ```

+<a name="227"></a>
+#### 2.2.7 layout recovery(PDF to Markdown)
+
+```python linenums="1"
+import os
+import cv2
+from paddleocr import PPStructure,save_structure_res
+from paddleocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes
+from paddleocr.ppstructure.recovery.recovery_to_markdown import convert_info_markdown
+
+# Chinese image
+table_engine = PPStructure(recovery=True)
+# English image
+# table_engine = PPStructure(recovery=True, lang='en')
+
+save_folder = './output'
+img_path = 'ppstructure/docs/table/1.png'
+img = cv2.imread(img_path)
+result = table_engine(img)
+save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])
+
+for line in result:
+    line.pop('img')
+    print(line)
+
+h, w, _ = img.shape
+res = sorted_layout_boxes(result, w)
+convert_info_markdown(res, save_folder, os.path.basename(img_path).split('.')[0])
+```
+
 <a name="23"></a>
 ### 2.3 Result description

@ -351,6 +398,8 @@ Please refer to: [Key Information Extraction](../kie/README.md) .
 | table_model_dir | Table structure model inference model path| None |
 | table_char_dict_path | The dictionary path of table structure model | ../ppocr/utils/dict/table_structure_dict.txt  |
 | merge_no_span_structure | In the table recognition model, whether to merge '\<td>' and '\</td>' | False |
+| formula_model_dir | Formula recognition model inference model path | None |
+| formula_char_dict_path | The dictionary path of formula recognition model | ../ppocr/utils/dict/latex_ocr_tokenizer.json |
 | layout_model_dir  | Layout analysis model inference model path| None |
 | layout_dict_path  | The dictionary path of layout analysis model| ../ppocr/utils/dict/layout_publaynet_dict.txt |
 | layout_score_threshold | The box threshold path of layout analysis model| 0.5|
@ -362,8 +411,10 @@ Please refer to: [Key Information Extraction](../kie/README.md) .
 | image_orientation | Whether to perform image orientation classification in forward  | False   |
 | layout | Whether to perform layout analysis in forward  | True   |
 | table  | Whether to perform table recognition in forward  | True   |
+| formula | Whether to perform formula recognition in forward | False |
 | ocr    | Whether to perform ocr for non-table areas in layout analysis. When layout is False, it will be automatically set to False| True |
 | recovery    | Whether to perform layout recovery in forward| False |
+| recovery_to_markdown | Whether to convert the layout recovery results into a markdown file | False |
 | save_pdf    | Whether to convert docx to pdf when recovery| False |
 | structure_version |  Structure version, optional PP-structure and PP-structurev2  | PP-structure |