parent
4cec888bcc
commit
03d881685a
Binary file not shown.
Before Width: | Height: | Size: 179 KiB After Width: | Height: | Size: 1.2 MiB |
|
@ -23,7 +23,7 @@ English | [简体中文](README_ch.md)
|
|||
|
||||
## 1. Introduction
|
||||
|
||||
Layout analysis refers to the regional division of documents in the form of pictures and the positioning of key areas, such as text, title, table, picture, etc. The layout analysis algorithm is based on the lightweight model PP-picodet of [PaddleDetection]( https://github.com/PaddlePaddle/PaddleDetection )
|
||||
Layout analysis refers to the regional division of documents in the form of pictures and the positioning of key areas, such as text, title, table, picture, etc. The layout analysis algorithm is based on the lightweight model PP-picodet of [PaddleDetection]( https://github.com/PaddlePaddle/PaddleDetection ), including English layout analysis, Chinese layout analysis and table layout analysis models. English layout analysis models can detect document layout elements such as text, title, table, figure, list. Chinese layout analysis models can detect document layout elements such as text, figure, figure caption, table, table caption, header, footer, reference, and equation. Table layout analysis models can detect table regions.
|
||||
|
||||
<div align="center">
|
||||
<img src="../docs/layout/layout.png" width="800">
|
||||
|
@ -152,7 +152,7 @@ We provide CDLA(Chinese layout analysis), TableBank(Table layout analysis)etc. d
|
|||
| [cTDaR2019_cTDaR](https://cndplab-founder.github.io/cTDaR2019/) | For form detection (TRACKA) and form identification (TRACKB).Image types include historical data sets (beginning with cTDaR_t0, such as CTDAR_T00872.jpg) and modern data sets (beginning with cTDaR_t1, CTDAR_T10482.jpg). |
|
||||
| [IIIT-AR-13K](http://cvit.iiit.ac.in/usodi/iiitar13k.php) | Data sets constructed by manually annotating figures or pages from publicly available annual reports, containing 5 categories:table, figure, natural image, logo, and signature. |
|
||||
| [TableBank](https://github.com/doc-analysis/TableBank) | For table detection and recognition of large datasets, including Word and Latex document formats |
|
||||
| [CDLA](https://github.com/buptlihang/CDLA) | Chinese document layout analysis data set, for Chinese literature (paper) scenarios, including 10 categories:Table, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, Equation |
|
||||
| [CDLA](https://github.com/buptlihang/CDLA) | Chinese document layout analysis data set, for Chinese literature (paper) scenarios, including 10 categories:Text, Title, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, Equation |
|
||||
| [DocBank](https://github.com/doc-analysis/DocBank) | Large-scale dataset (500K document pages) constructed using weakly supervised methods for document layout analysis, containing 12 categories:Author, Caption, Date, Equation, Figure, Footer, List, Paragraph, Reference, Section, Table, Title |
|
||||
|
||||
|
||||
|
@ -175,7 +175,7 @@ If the test image is Chinese, the pre-trained model of Chinese CDLA dataset can
|
|||
|
||||
### 5.1. Train
|
||||
|
||||
Train:
|
||||
Start training with the PaddleDetection [layout analysis profile](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.5/configs/picodet/legacy_model/application/layout_analysis)
|
||||
|
||||
* Modify Profile
|
||||
|
||||
|
|
|
@ -22,7 +22,7 @@
|
|||
|
||||
## 1. 简介
|
||||
|
||||
版面分析指的是对图片形式的文档进行区域划分,定位其中的关键区域,如文字、标题、表格、图片等。版面分析算法基于[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)的轻量模型PP-PicoDet进行开发。
|
||||
版面分析指的是对图片形式的文档进行区域划分,定位其中的关键区域,如文字、标题、表格、图片等。版面分析算法基于[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)的轻量模型PP-PicoDet进行开发,包含英文、中文、表格版面分析3类模型。其中,英文模型支持Text、Title、Tale、Figure、List5类区域的检测,中文模型支持Text、Title、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation10类区域的检测,表格版面分析支持Table区域的检测,版面分析效果如下图所示:
|
||||
|
||||
<div align="center">
|
||||
<img src="../docs/layout/layout.png" width="800">
|
||||
|
@ -152,7 +152,7 @@ json文件包含所有图像的标注,数据以字典嵌套的方式存放,
|
|||
| ------------------------------------------------------------ | ------------------------------------------------------------ |
|
||||
| [cTDaR2019_cTDaR](https://cndplab-founder.github.io/cTDaR2019/) | 用于表格检测(TRACKA)和表格识别(TRACKB)。图片类型包含历史数据集(以cTDaR_t0开头,如cTDaR_t00872.jpg)和现代数据集(以cTDaR_t1开头,cTDaR_t10482.jpg)。 |
|
||||
| [IIIT-AR-13K](http://cvit.iiit.ac.in/usodi/iiitar13k.php) | 手动注释公开的年度报告中的图形或页面而构建的数据集,包含5类:table, figure, natural image, logo, and signature |
|
||||
| [CDLA](https://github.com/buptlihang/CDLA) | 中文文档版面分析数据集,面向中文文献类(论文)场景,包含10类:Table、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation |
|
||||
| [CDLA](https://github.com/buptlihang/CDLA) | 中文文档版面分析数据集,面向中文文献类(论文)场景,包含10类:Text、Title、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation |
|
||||
| [TableBank](https://github.com/doc-analysis/TableBank) | 用于表格检测和识别大型数据集,包含Word和Latex2种文档格式 |
|
||||
| [DocBank](https://github.com/doc-analysis/DocBank) | 使用弱监督方法构建的大规模数据集(500K文档页面),用于文档布局分析,包含12类:Author、Caption、Date、Equation、Figure、Footer、List、Paragraph、Reference、Section、Table、Title |
|
||||
|
||||
|
@ -161,7 +161,7 @@ json文件包含所有图像的标注,数据以字典嵌套的方式存放,
|
|||
|
||||
提供了训练脚本、评估脚本和预测脚本,本节将以PubLayNet预训练模型为例进行讲解。
|
||||
|
||||
如果不希望训练,直接体验后面的模型评估、预测、动转静、推理的流程,可以下载提供的预训练模型(PubLayNet数据集),并跳过本部分。
|
||||
如果不希望训练,直接体验后面的模型评估、预测、动转静、推理的流程,可以下载提供的预训练模型(PubLayNet数据集),并跳过5.1和5.2。
|
||||
|
||||
```
|
||||
mkdir pretrained_model
|
||||
|
@ -176,7 +176,7 @@ wget https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_
|
|||
|
||||
### 5.1. 启动训练
|
||||
|
||||
开始训练:
|
||||
使用PaddleDetection[版面分析配置文件](https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.5/configs/picodet/legacy_model/application/layout_analysis)启动训练
|
||||
|
||||
* 修改配置文件
|
||||
|
||||
|
|
|
@ -254,8 +254,7 @@ def main(args):
|
|||
|
||||
if args.recovery and all_res != []:
|
||||
try:
|
||||
convert_info_docx(img, all_res, save_folder, img_name,
|
||||
args.save_pdf)
|
||||
convert_info_docx(img, all_res, save_folder, img_name)
|
||||
except Exception as ex:
|
||||
logger.error("error in layout recovery image:{}, err msg: {}".
|
||||
format(image_file, ex))
|
||||
|
|
|
@ -82,8 +82,11 @@ Through layout analysis, we divided the image/PDF documents into regions, locate
|
|||
|
||||
We can restore the test picture through the layout information, OCR detection and recognition structure, table information, and saved pictures.
|
||||
|
||||
The whl package is also provided for quick use, see [quickstart](../docs/quickstart_en.md) for details.
|
||||
The whl package is also provided for quick use, follow the above code, for more infomation please refer to [quickstart](../docs/quickstart_en.md) for details.
|
||||
|
||||
```bash
|
||||
paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true --lang='en'
|
||||
```
|
||||
|
||||
<a name="3.1"></a>
|
||||
### 3.1 Download models
|
||||
|
|
|
@ -83,7 +83,16 @@ python3 -m pip install -r ppstructure/recovery/requirements.txt
|
|||
|
||||
我们通过版面信息、OCR检测和识别结构、表格信息、保存的图片,对测试图片进行恢复即可。
|
||||
|
||||
提供如下代码实现版面恢复,也提供了whl包的形式方便快速使用,详见 [quickstart](../docs/quickstart.md)。
|
||||
提供如下代码实现版面恢复,也提供了whl包的形式方便快速使用,代码如下,更多信息详见 [quickstart](../docs/quickstart.md)。
|
||||
|
||||
```bash
|
||||
# 中文测试图
|
||||
paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true
|
||||
# 英文测试图
|
||||
paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true --lang='en'
|
||||
# pdf测试文件
|
||||
paddleocr --image_dir=ppstructure/recovery/UnrealText.pdf --type=structure --recovery=true --lang='en'
|
||||
```
|
||||
|
||||
<a name="3.1"></a>
|
||||
|
||||
|
|
|
@ -28,7 +28,7 @@ from ppocr.utils.logging import get_logger
|
|||
logger = get_logger()
|
||||
|
||||
|
||||
def convert_info_docx(img, res, save_folder, img_name, save_pdf=False):
|
||||
def convert_info_docx(img, res, save_folder, img_name):
|
||||
doc = Document()
|
||||
doc.styles['Normal'].font.name = 'Times New Roman'
|
||||
doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋体')
|
||||
|
@ -60,14 +60,9 @@ def convert_info_docx(img, res, save_folder, img_name, save_pdf=False):
|
|||
elif region['type'].lower() == 'title':
|
||||
doc.add_heading(region['res'][0]['text'])
|
||||
elif region['type'].lower() == 'table':
|
||||
paragraph = doc.add_paragraph()
|
||||
new_parser = HtmlToDocx()
|
||||
new_parser.table_style = 'TableGrid'
|
||||
table = new_parser.handle_table(html=region['res']['html'])
|
||||
new_table = deepcopy(table)
|
||||
new_table.alignment = WD_TABLE_ALIGNMENT.CENTER
|
||||
paragraph.add_run().element.addnext(new_table._tbl)
|
||||
|
||||
parser = HtmlToDocx()
|
||||
parser.table_style = 'TableGrid'
|
||||
parser.handle_table(region['res']['html'], doc)
|
||||
else:
|
||||
paragraph = doc.add_paragraph()
|
||||
paragraph_format = paragraph.paragraph_format
|
||||
|
@ -82,13 +77,6 @@ def convert_info_docx(img, res, save_folder, img_name, save_pdf=False):
|
|||
doc.save(docx_path)
|
||||
logger.info('docx save to {}'.format(docx_path))
|
||||
|
||||
# save to pdf
|
||||
if save_pdf:
|
||||
pdf_path = os.path.join(save_folder, '{}.pdf'.format(img_name))
|
||||
from docx2pdf import convert
|
||||
convert(docx_path, pdf_path)
|
||||
logger.info('pdf save to {}'.format(pdf_path))
|
||||
|
||||
|
||||
def sorted_layout_boxes(res, w):
|
||||
"""
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
python-docx
|
||||
docx2pdf
|
||||
PyMuPDF
|
||||
beautifulsoup4
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
|
@ -13,62 +12,59 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
"""
|
||||
This code is refer from:https://github.com/pqzx/html2docx/blob/8f6695a778c68befb302e48ac0ed5201ddbd4524/htmldocx/h2d.py
|
||||
|
||||
This code is refer from: https://github.com/weizwx/html2docx/blob/master/htmldocx/h2d.py
|
||||
"""
|
||||
import re, argparse
|
||||
import io, os
|
||||
import urllib.request
|
||||
from urllib.parse import urlparse
|
||||
|
||||
import re
|
||||
import docx
|
||||
from docx import Document
|
||||
from bs4 import BeautifulSoup
|
||||
from html.parser import HTMLParser
|
||||
|
||||
import docx, docx.table
|
||||
from docx import Document
|
||||
from docx.shared import RGBColor, Pt, Inches
|
||||
from docx.enum.text import WD_COLOR, WD_ALIGN_PARAGRAPH
|
||||
from docx.oxml import OxmlElement
|
||||
from docx.oxml.ns import qn
|
||||
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
# values in inches
|
||||
INDENT = 0.25
|
||||
LIST_INDENT = 0.5
|
||||
MAX_INDENT = 5.5 # To stop indents going off the page
|
||||
|
||||
# Style to use with tables. By default no style is used.
|
||||
DEFAULT_TABLE_STYLE = None
|
||||
|
||||
# Style to use with paragraphs. By default no style is used.
|
||||
DEFAULT_PARAGRAPH_STYLE = None
|
||||
def get_table_rows(table_soup):
|
||||
table_row_selectors = [
|
||||
'table > tr', 'table > thead > tr', 'table > tbody > tr',
|
||||
'table > tfoot > tr'
|
||||
]
|
||||
# If there's a header, body, footer or direct child tr tags, add row dimensions from there
|
||||
return table_soup.select(', '.join(table_row_selectors), recursive=False)
|
||||
|
||||
|
||||
def get_filename_from_url(url):
|
||||
return os.path.basename(urlparse(url).path)
|
||||
def get_table_columns(row):
|
||||
# Get all columns for the specified row tag.
|
||||
return row.find_all(['th', 'td'], recursive=False) if row else []
|
||||
|
||||
def is_url(url):
|
||||
"""
|
||||
Not to be used for actually validating a url, but in our use case we only
|
||||
care if it's a url or a file path, and they're pretty distinguishable
|
||||
"""
|
||||
parts = urlparse(url)
|
||||
return all([parts.scheme, parts.netloc, parts.path])
|
||||
|
||||
def fetch_image(url):
|
||||
"""
|
||||
Attempts to fetch an image from a url.
|
||||
If successful returns a bytes object, else returns None
|
||||
:return:
|
||||
"""
|
||||
try:
|
||||
with urllib.request.urlopen(url) as response:
|
||||
# security flaw?
|
||||
return io.BytesIO(response.read())
|
||||
except urllib.error.URLError:
|
||||
return None
|
||||
def get_table_dimensions(table_soup):
|
||||
# Get rows for the table
|
||||
rows = get_table_rows(table_soup)
|
||||
# Table is either empty or has non-direct children between table and tr tags
|
||||
# Thus the row dimensions and column dimensions are assumed to be 0
|
||||
|
||||
cols = get_table_columns(rows[0]) if rows else []
|
||||
# Add colspan calculation column number
|
||||
col_count = 0
|
||||
for col in cols:
|
||||
colspan = col.attrs.get('colspan', 1)
|
||||
col_count += int(colspan)
|
||||
|
||||
return rows, col_count
|
||||
|
||||
|
||||
def get_cell_html(soup):
|
||||
# Returns string of td element with opening and closing <td> tags removed
|
||||
# Cannot use find_all as it only finds element tags and does not find text which
|
||||
# is not inside an element
|
||||
return ' '.join([str(i) for i in soup.contents])
|
||||
|
||||
|
||||
def delete_paragraph(paragraph):
|
||||
# https://github.com/python-openxml/python-docx/issues/33#issuecomment-77661907
|
||||
p = paragraph._element
|
||||
p.getparent().remove(p)
|
||||
p._p = p._element = None
|
||||
|
||||
def remove_last_occurence(ls, x):
|
||||
ls.pop(len(ls) - ls[::-1].index(x) - 1)
|
||||
|
||||
def remove_whitespace(string, leading=False, trailing=False):
|
||||
"""Remove white space from a string.
|
||||
|
@ -122,11 +118,6 @@ def remove_whitespace(string, leading=False, trailing=False):
|
|||
# TODO need some way to get rid of extra spaces in e.g. text <span> </span> text
|
||||
return re.sub(r'\s+', ' ', string)
|
||||
|
||||
def delete_paragraph(paragraph):
|
||||
# https://github.com/python-openxml/python-docx/issues/33#issuecomment-77661907
|
||||
p = paragraph._element
|
||||
p.getparent().remove(p)
|
||||
p._p = p._element = None
|
||||
|
||||
font_styles = {
|
||||
'b': 'bold',
|
||||
|
@ -145,13 +136,8 @@ font_names = {
|
|||
'pre': 'Courier',
|
||||
}
|
||||
|
||||
styles = {
|
||||
'LIST_BULLET': 'List Bullet',
|
||||
'LIST_NUMBER': 'List Number',
|
||||
}
|
||||
|
||||
class HtmlToDocx(HTMLParser):
|
||||
|
||||
def __init__(self):
|
||||
super().__init__()
|
||||
self.options = {
|
||||
|
@ -161,13 +147,11 @@ class HtmlToDocx(HTMLParser):
|
|||
'styles': True,
|
||||
}
|
||||
self.table_row_selectors = [
|
||||
'table > tr',
|
||||
'table > thead > tr',
|
||||
'table > tbody > tr',
|
||||
'table > tr', 'table > thead > tr', 'table > tbody > tr',
|
||||
'table > tfoot > tr'
|
||||
]
|
||||
self.table_style = DEFAULT_TABLE_STYLE
|
||||
self.paragraph_style = DEFAULT_PARAGRAPH_STYLE
|
||||
self.table_style = None
|
||||
self.paragraph_style = None
|
||||
|
||||
def set_initial_attrs(self, document=None):
|
||||
self.tags = {
|
||||
|
@ -178,9 +162,10 @@ class HtmlToDocx(HTMLParser):
|
|||
self.doc = document
|
||||
else:
|
||||
self.doc = Document()
|
||||
self.bs = self.options['fix-html'] # whether or not to clean with BeautifulSoup
|
||||
self.bs = self.options[
|
||||
'fix-html'] # whether or not to clean with BeautifulSoup
|
||||
self.document = self.doc
|
||||
self.include_tables = True #TODO add this option back in?
|
||||
self.include_tables = True #TODO add this option back in?
|
||||
self.include_images = self.options['images']
|
||||
self.include_styles = self.options['styles']
|
||||
self.paragraph = None
|
||||
|
@ -193,55 +178,52 @@ class HtmlToDocx(HTMLParser):
|
|||
self.table_style = other.table_style
|
||||
self.paragraph_style = other.paragraph_style
|
||||
|
||||
def get_cell_html(self, soup):
|
||||
# Returns string of td element with opening and closing <td> tags removed
|
||||
# Cannot use find_all as it only finds element tags and does not find text which
|
||||
# is not inside an element
|
||||
return ' '.join([str(i) for i in soup.contents])
|
||||
def ignore_nested_tables(self, tables_soup):
|
||||
"""
|
||||
Returns array containing only the highest level tables
|
||||
Operates on the assumption that bs4 returns child elements immediately after
|
||||
the parent element in `find_all`. If this changes in the future, this method will need to be updated
|
||||
:return:
|
||||
"""
|
||||
new_tables = []
|
||||
nest = 0
|
||||
for table in tables_soup:
|
||||
if nest:
|
||||
nest -= 1
|
||||
continue
|
||||
new_tables.append(table)
|
||||
nest = len(table.find_all('table'))
|
||||
return new_tables
|
||||
|
||||
def add_styles_to_paragraph(self, style):
|
||||
if 'text-align' in style:
|
||||
align = style['text-align']
|
||||
if align == 'center':
|
||||
self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER
|
||||
elif align == 'right':
|
||||
self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.RIGHT
|
||||
elif align == 'justify':
|
||||
self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY
|
||||
if 'margin-left' in style:
|
||||
margin = style['margin-left']
|
||||
units = re.sub(r'[0-9]+', '', margin)
|
||||
margin = int(float(re.sub(r'[a-z]+', '', margin)))
|
||||
if units == 'px':
|
||||
self.paragraph.paragraph_format.left_indent = Inches(min(margin // 10 * INDENT, MAX_INDENT))
|
||||
# TODO handle non px units
|
||||
def get_tables(self):
|
||||
if not hasattr(self, 'soup'):
|
||||
self.include_tables = False
|
||||
return
|
||||
# find other way to do it, or require this dependency?
|
||||
self.tables = self.ignore_nested_tables(self.soup.find_all('table'))
|
||||
self.table_no = 0
|
||||
|
||||
def add_styles_to_run(self, style):
|
||||
if 'color' in style:
|
||||
if 'rgb' in style['color']:
|
||||
color = re.sub(r'[a-z()]+', '', style['color'])
|
||||
colors = [int(x) for x in color.split(',')]
|
||||
elif '#' in style['color']:
|
||||
color = style['color'].lstrip('#')
|
||||
colors = tuple(int(color[i:i+2], 16) for i in (0, 2, 4))
|
||||
else:
|
||||
colors = [0, 0, 0]
|
||||
# TODO map colors to named colors (and extended colors...)
|
||||
# For now set color to black to prevent crashing
|
||||
self.run.font.color.rgb = RGBColor(*colors)
|
||||
|
||||
if 'background-color' in style:
|
||||
if 'rgb' in style['background-color']:
|
||||
color = color = re.sub(r'[a-z()]+', '', style['background-color'])
|
||||
colors = [int(x) for x in color.split(',')]
|
||||
elif '#' in style['background-color']:
|
||||
color = style['background-color'].lstrip('#')
|
||||
colors = tuple(int(color[i:i+2], 16) for i in (0, 2, 4))
|
||||
else:
|
||||
colors = [0, 0, 0]
|
||||
# TODO map colors to named colors (and extended colors...)
|
||||
# For now set color to black to prevent crashing
|
||||
self.run.font.highlight_color = WD_COLOR.GRAY_25 #TODO: map colors
|
||||
def run_process(self, html):
|
||||
if self.bs and BeautifulSoup:
|
||||
self.soup = BeautifulSoup(html, 'html.parser')
|
||||
html = str(self.soup)
|
||||
if self.include_tables:
|
||||
self.get_tables()
|
||||
self.feed(html)
|
||||
|
||||
def add_html_to_cell(self, html, cell):
|
||||
if not isinstance(cell, docx.table._Cell):
|
||||
raise ValueError('Second argument needs to be a %s' %
|
||||
docx.table._Cell)
|
||||
unwanted_paragraph = cell.paragraphs[0]
|
||||
if unwanted_paragraph.text == "":
|
||||
delete_paragraph(unwanted_paragraph)
|
||||
self.set_initial_attrs(cell)
|
||||
self.run_process(html)
|
||||
# cells must end with a paragraph or will get message about corrupt file
|
||||
# https://stackoverflow.com/a/29287121
|
||||
if not self.doc.paragraphs:
|
||||
self.doc.add_paragraph('')
|
||||
|
||||
def apply_paragraph_style(self, style=None):
|
||||
try:
|
||||
|
@ -250,69 +232,10 @@ class HtmlToDocx(HTMLParser):
|
|||
elif self.paragraph_style:
|
||||
self.paragraph.style = self.paragraph_style
|
||||
except KeyError as e:
|
||||
raise ValueError(f"Unable to apply style {self.paragraph_style}.") from e
|
||||
raise ValueError(
|
||||
f"Unable to apply style {self.paragraph_style}.") from e
|
||||
|
||||
def parse_dict_string(self, string, separator=';'):
|
||||
new_string = string.replace(" ", '').split(separator)
|
||||
string_dict = dict([x.split(':') for x in new_string if ':' in x])
|
||||
return string_dict
|
||||
|
||||
def handle_li(self):
|
||||
# check list stack to determine style and depth
|
||||
list_depth = len(self.tags['list'])
|
||||
if list_depth:
|
||||
list_type = self.tags['list'][-1]
|
||||
else:
|
||||
list_type = 'ul' # assign unordered if no tag
|
||||
|
||||
if list_type == 'ol':
|
||||
list_style = styles['LIST_NUMBER']
|
||||
else:
|
||||
list_style = styles['LIST_BULLET']
|
||||
|
||||
self.paragraph = self.doc.add_paragraph(style=list_style)
|
||||
self.paragraph.paragraph_format.left_indent = Inches(min(list_depth * LIST_INDENT, MAX_INDENT))
|
||||
self.paragraph.paragraph_format.line_spacing = 1
|
||||
|
||||
def add_image_to_cell(self, cell, image):
|
||||
# python-docx doesn't have method yet for adding images to table cells. For now we use this
|
||||
paragraph = cell.add_paragraph()
|
||||
run = paragraph.add_run()
|
||||
run.add_picture(image)
|
||||
|
||||
def handle_img(self, current_attrs):
|
||||
if not self.include_images:
|
||||
self.skip = True
|
||||
self.skip_tag = 'img'
|
||||
return
|
||||
src = current_attrs['src']
|
||||
# fetch image
|
||||
src_is_url = is_url(src)
|
||||
if src_is_url:
|
||||
try:
|
||||
image = fetch_image(src)
|
||||
except urllib.error.URLError:
|
||||
image = None
|
||||
else:
|
||||
image = src
|
||||
# add image to doc
|
||||
if image:
|
||||
try:
|
||||
if isinstance(self.doc, docx.document.Document):
|
||||
self.doc.add_picture(image)
|
||||
else:
|
||||
self.add_image_to_cell(self.doc, image)
|
||||
except FileNotFoundError:
|
||||
image = None
|
||||
if not image:
|
||||
if src_is_url:
|
||||
self.doc.add_paragraph("<image: %s>" % src)
|
||||
else:
|
||||
# avoid exposing filepaths in document
|
||||
self.doc.add_paragraph("<image: %s>" % get_filename_from_url(src))
|
||||
|
||||
|
||||
def handle_table(self, html):
|
||||
def handle_table(self, html, doc):
|
||||
"""
|
||||
To handle nested tables, we will parse tables manually as follows:
|
||||
Get table soup
|
||||
|
@ -320,194 +243,42 @@ class HtmlToDocx(HTMLParser):
|
|||
Iterate over soup and fill docx table with new instances of this parser
|
||||
Tell HTMLParser to ignore any tags until the corresponding closing table tag
|
||||
"""
|
||||
doc = Document()
|
||||
table_soup = BeautifulSoup(html, 'html.parser')
|
||||
rows, cols_len = self.get_table_dimensions(table_soup)
|
||||
rows, cols_len = get_table_dimensions(table_soup)
|
||||
table = doc.add_table(len(rows), cols_len)
|
||||
table.style = doc.styles['Table Grid']
|
||||
|
||||
cell_row = 0
|
||||
for index, row in enumerate(rows):
|
||||
cols = self.get_table_columns(row)
|
||||
cols = get_table_columns(row)
|
||||
cell_col = 0
|
||||
for col in cols:
|
||||
colspan = int(col.attrs.get('colspan', 1))
|
||||
rowspan = int(col.attrs.get('rowspan', 1))
|
||||
|
||||
cell_html = self.get_cell_html(col)
|
||||
|
||||
cell_html = get_cell_html(col)
|
||||
if col.name == 'th':
|
||||
cell_html = "<b>%s</b>" % cell_html
|
||||
|
||||
docx_cell = table.cell(cell_row, cell_col)
|
||||
|
||||
while docx_cell.text != '': # Skip the merged cell
|
||||
cell_col += 1
|
||||
docx_cell = table.cell(cell_row, cell_col)
|
||||
|
||||
cell_to_merge = table.cell(cell_row + rowspan - 1, cell_col + colspan - 1)
|
||||
cell_to_merge = table.cell(cell_row + rowspan - 1,
|
||||
cell_col + colspan - 1)
|
||||
if docx_cell != cell_to_merge:
|
||||
docx_cell.merge(cell_to_merge)
|
||||
|
||||
child_parser = HtmlToDocx()
|
||||
child_parser.copy_settings_from(self)
|
||||
|
||||
child_parser.add_html_to_cell(cell_html or ' ', docx_cell) # occupy the position
|
||||
child_parser.add_html_to_cell(cell_html or ' ', docx_cell)
|
||||
|
||||
cell_col += colspan
|
||||
cell_row += 1
|
||||
|
||||
# skip all tags until corresponding closing tag
|
||||
self.instances_to_skip = len(table_soup.find_all('table'))
|
||||
self.skip_tag = 'table'
|
||||
self.skip = True
|
||||
self.table = None
|
||||
return table
|
||||
|
||||
def handle_link(self, href, text):
|
||||
# Link requires a relationship
|
||||
is_external = href.startswith('http')
|
||||
rel_id = self.paragraph.part.relate_to(
|
||||
href,
|
||||
docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK,
|
||||
is_external=True # don't support anchor links for this library yet
|
||||
)
|
||||
|
||||
# Create the w:hyperlink tag and add needed values
|
||||
hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink')
|
||||
hyperlink.set(docx.oxml.shared.qn('r:id'), rel_id)
|
||||
|
||||
|
||||
# Create sub-run
|
||||
subrun = self.paragraph.add_run()
|
||||
rPr = docx.oxml.shared.OxmlElement('w:rPr')
|
||||
|
||||
# add default color
|
||||
c = docx.oxml.shared.OxmlElement('w:color')
|
||||
c.set(docx.oxml.shared.qn('w:val'), "0000EE")
|
||||
rPr.append(c)
|
||||
|
||||
# add underline
|
||||
u = docx.oxml.shared.OxmlElement('w:u')
|
||||
u.set(docx.oxml.shared.qn('w:val'), 'single')
|
||||
rPr.append(u)
|
||||
|
||||
subrun._r.append(rPr)
|
||||
subrun._r.text = text
|
||||
|
||||
# Add subrun to hyperlink
|
||||
hyperlink.append(subrun._r)
|
||||
|
||||
# Add hyperlink to run
|
||||
self.paragraph._p.append(hyperlink)
|
||||
|
||||
def handle_starttag(self, tag, attrs):
|
||||
if self.skip:
|
||||
return
|
||||
if tag == 'head':
|
||||
self.skip = True
|
||||
self.skip_tag = tag
|
||||
self.instances_to_skip = 0
|
||||
return
|
||||
elif tag == 'body':
|
||||
return
|
||||
|
||||
current_attrs = dict(attrs)
|
||||
|
||||
if tag == 'span':
|
||||
self.tags['span'].append(current_attrs)
|
||||
return
|
||||
elif tag == 'ol' or tag == 'ul':
|
||||
self.tags['list'].append(tag)
|
||||
return # don't apply styles for now
|
||||
elif tag == 'br':
|
||||
self.run.add_break()
|
||||
return
|
||||
|
||||
self.tags[tag] = current_attrs
|
||||
if tag in ['p', 'pre']:
|
||||
self.paragraph = self.doc.add_paragraph()
|
||||
self.apply_paragraph_style()
|
||||
|
||||
elif tag == 'li':
|
||||
self.handle_li()
|
||||
|
||||
elif tag == "hr":
|
||||
|
||||
# This implementation was taken from:
|
||||
# https://github.com/python-openxml/python-docx/issues/105#issuecomment-62806373
|
||||
|
||||
self.paragraph = self.doc.add_paragraph()
|
||||
pPr = self.paragraph._p.get_or_add_pPr()
|
||||
pBdr = OxmlElement('w:pBdr')
|
||||
pPr.insert_element_before(pBdr,
|
||||
'w:shd', 'w:tabs', 'w:suppressAutoHyphens', 'w:kinsoku', 'w:wordWrap',
|
||||
'w:overflowPunct', 'w:topLinePunct', 'w:autoSpaceDE', 'w:autoSpaceDN',
|
||||
'w:bidi', 'w:adjustRightInd', 'w:snapToGrid', 'w:spacing', 'w:ind',
|
||||
'w:contextualSpacing', 'w:mirrorIndents', 'w:suppressOverlap', 'w:jc',
|
||||
'w:textDirection', 'w:textAlignment', 'w:textboxTightWrap',
|
||||
'w:outlineLvl', 'w:divId', 'w:cnfStyle', 'w:rPr', 'w:sectPr',
|
||||
'w:pPrChange'
|
||||
)
|
||||
bottom = OxmlElement('w:bottom')
|
||||
bottom.set(qn('w:val'), 'single')
|
||||
bottom.set(qn('w:sz'), '6')
|
||||
bottom.set(qn('w:space'), '1')
|
||||
bottom.set(qn('w:color'), 'auto')
|
||||
pBdr.append(bottom)
|
||||
|
||||
elif re.match('h[1-9]', tag):
|
||||
if isinstance(self.doc, docx.document.Document):
|
||||
h_size = int(tag[1])
|
||||
self.paragraph = self.doc.add_heading(level=min(h_size, 9))
|
||||
else:
|
||||
self.paragraph = self.doc.add_paragraph()
|
||||
|
||||
elif tag == 'img':
|
||||
self.handle_img(current_attrs)
|
||||
return
|
||||
|
||||
elif tag == 'table':
|
||||
self.handle_table()
|
||||
return
|
||||
|
||||
# set new run reference point in case of leading line breaks
|
||||
if tag in ['p', 'li', 'pre']:
|
||||
self.run = self.paragraph.add_run()
|
||||
|
||||
# add style
|
||||
if not self.include_styles:
|
||||
return
|
||||
if 'style' in current_attrs and self.paragraph:
|
||||
style = self.parse_dict_string(current_attrs['style'])
|
||||
self.add_styles_to_paragraph(style)
|
||||
|
||||
def handle_endtag(self, tag):
|
||||
if self.skip:
|
||||
if not tag == self.skip_tag:
|
||||
return
|
||||
|
||||
if self.instances_to_skip > 0:
|
||||
self.instances_to_skip -= 1
|
||||
return
|
||||
|
||||
self.skip = False
|
||||
self.skip_tag = None
|
||||
self.paragraph = None
|
||||
|
||||
if tag == 'span':
|
||||
if self.tags['span']:
|
||||
self.tags['span'].pop()
|
||||
return
|
||||
elif tag == 'ol' or tag == 'ul':
|
||||
remove_last_occurence(self.tags['list'], tag)
|
||||
return
|
||||
elif tag == 'table':
|
||||
self.table_no += 1
|
||||
self.table = None
|
||||
self.doc = self.document
|
||||
self.paragraph = None
|
||||
|
||||
if tag in self.tags:
|
||||
self.tags.pop(tag)
|
||||
# maybe set relevant reference to None?
|
||||
doc.save('1.docx')
|
||||
|
||||
def handle_data(self, data):
|
||||
if self.skip:
|
||||
|
@ -546,87 +317,3 @@ class HtmlToDocx(HTMLParser):
|
|||
if tag in font_names:
|
||||
font_name = font_names[tag]
|
||||
self.run.font.name = font_name
|
||||
|
||||
def ignore_nested_tables(self, tables_soup):
|
||||
"""
|
||||
Returns array containing only the highest level tables
|
||||
Operates on the assumption that bs4 returns child elements immediately after
|
||||
the parent element in `find_all`. If this changes in the future, this method will need to be updated
|
||||
:return:
|
||||
"""
|
||||
new_tables = []
|
||||
nest = 0
|
||||
for table in tables_soup:
|
||||
if nest:
|
||||
nest -= 1
|
||||
continue
|
||||
new_tables.append(table)
|
||||
nest = len(table.find_all('table'))
|
||||
return new_tables
|
||||
|
||||
def get_table_rows(self, table_soup):
|
||||
# If there's a header, body, footer or direct child tr tags, add row dimensions from there
|
||||
return table_soup.select(', '.join(self.table_row_selectors), recursive=False)
|
||||
|
||||
def get_table_columns(self, row):
|
||||
# Get all columns for the specified row tag.
|
||||
return row.find_all(['th', 'td'], recursive=False) if row else []
|
||||
|
||||
def get_table_dimensions(self, table_soup):
|
||||
# Get rows for the table
|
||||
rows = self.get_table_rows(table_soup)
|
||||
# Table is either empty or has non-direct children between table and tr tags
|
||||
# Thus the row dimensions and column dimensions are assumed to be 0
|
||||
|
||||
cols = self.get_table_columns(rows[0]) if rows else []
|
||||
# Add colspan calculation column number
|
||||
col_count = 0
|
||||
for col in cols:
|
||||
colspan = col.attrs.get('colspan', 1)
|
||||
col_count += int(colspan)
|
||||
|
||||
# return len(rows), col_count
|
||||
return rows, col_count
|
||||
|
||||
def get_tables(self):
|
||||
if not hasattr(self, 'soup'):
|
||||
self.include_tables = False
|
||||
return
|
||||
# find other way to do it, or require this dependency?
|
||||
self.tables = self.ignore_nested_tables(self.soup.find_all('table'))
|
||||
self.table_no = 0
|
||||
|
||||
def run_process(self, html):
|
||||
if self.bs and BeautifulSoup:
|
||||
self.soup = BeautifulSoup(html, 'html.parser')
|
||||
html = str(self.soup)
|
||||
if self.include_tables:
|
||||
self.get_tables()
|
||||
self.feed(html)
|
||||
|
||||
def add_html_to_document(self, html, document):
|
||||
if not isinstance(html, str):
|
||||
raise ValueError('First argument needs to be a %s' % str)
|
||||
elif not isinstance(document, docx.document.Document) and not isinstance(document, docx.table._Cell):
|
||||
raise ValueError('Second argument needs to be a %s' % docx.document.Document)
|
||||
self.set_initial_attrs(document)
|
||||
self.run_process(html)
|
||||
|
||||
def add_html_to_cell(self, html, cell):
|
||||
self.set_initial_attrs(cell)
|
||||
self.run_process(html)
|
||||
|
||||
def parse_html_file(self, filename_html, filename_docx=None):
|
||||
with open(filename_html, 'r') as infile:
|
||||
html = infile.read()
|
||||
self.set_initial_attrs()
|
||||
self.run_process(html)
|
||||
if not filename_docx:
|
||||
path, filename = os.path.split(filename_html)
|
||||
filename_docx = '%s/new_docx_file_%s' % (path, filename)
|
||||
self.doc.save('%s.docx' % filename_docx)
|
||||
|
||||
def parse_html_string(self, html):
|
||||
self.set_initial_attrs()
|
||||
self.run_process(html)
|
||||
return self.doc
|
|
@ -90,11 +90,6 @@ def init_args():
|
|||
type=str2bool,
|
||||
default=False,
|
||||
help='Whether to enable layout of recovery')
|
||||
parser.add_argument(
|
||||
"--save_pdf",
|
||||
type=str2bool,
|
||||
default=False,
|
||||
help='Whether to save pdf file')
|
||||
|
||||
return parser
|
||||
|
||||
|
@ -108,7 +103,38 @@ def draw_structure_result(image, result, font_path):
|
|||
if isinstance(image, np.ndarray):
|
||||
image = Image.fromarray(image)
|
||||
boxes, txts, scores = [], [], []
|
||||
|
||||
img_layout = image.copy()
|
||||
draw_layout = ImageDraw.Draw(img_layout)
|
||||
text_color = (255, 255, 255)
|
||||
text_background_color = (80, 127, 255)
|
||||
catid2color = {}
|
||||
font_size = 15
|
||||
font = ImageFont.truetype(font_path, font_size, encoding="utf-8")
|
||||
|
||||
for region in result:
|
||||
if region['type'] not in catid2color:
|
||||
box_color = (random.randint(0, 255), random.randint(0, 255),
|
||||
random.randint(0, 255))
|
||||
catid2color[region['type']] = box_color
|
||||
else:
|
||||
box_color = catid2color[region['type']]
|
||||
box_layout = region['bbox']
|
||||
draw_layout.rectangle(
|
||||
[(box_layout[0], box_layout[1]), (box_layout[2], box_layout[3])],
|
||||
outline=box_color,
|
||||
width=3)
|
||||
text_w, text_h = font.getsize(region['type'])
|
||||
draw_layout.rectangle(
|
||||
[(box_layout[0], box_layout[1]),
|
||||
(box_layout[0] + text_w, box_layout[1] + text_h)],
|
||||
fill=text_background_color)
|
||||
draw_layout.text(
|
||||
(box_layout[0], box_layout[1]),
|
||||
region['type'],
|
||||
fill=text_color,
|
||||
font=font)
|
||||
|
||||
if region['type'] == 'table':
|
||||
pass
|
||||
else:
|
||||
|
@ -116,6 +142,7 @@ def draw_structure_result(image, result, font_path):
|
|||
boxes.append(np.array(text_result['text_region']))
|
||||
txts.append(text_result['text'])
|
||||
scores.append(text_result['confidence'])
|
||||
|
||||
im_show = draw_ocr_box_txt(
|
||||
image, boxes, txts, scores, font_path=font_path, drop_score=0)
|
||||
img_layout, boxes, txts, scores, font_path=font_path, drop_score=0)
|
||||
return im_show
|
||||
|
|
Loading…
Reference in New Issue