Table Structure Recognition Module Models:
Model | Model Download Link | Accuracy (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
SLANet | Inference Model/Training Model | 59.52 | 103.08 / 103.08 | 197.99 / 197.99 | 6.9 M | SLANet is a table structure recognition model developed by Baidu PaddleX Team. The model significantly improves the accuracy and inference speed of table structure recognition by adopting a CPU-friendly lightweight backbone network PP-LCNet, a high-low-level feature fusion module CSP-PAN, and a feature decoding module SLA Head that aligns structural and positional information. |
SLANet_plus | Inference Model/Training Model | 63.69 | 140.29 / 140.29 | 195.39 / 195.39 | 6.9 M | SLANet_plus is an enhanced version of SLANet, the table structure recognition model developed by Baidu PaddleX Team. Compared to SLANet, SLANet_plus significantly improves the recognition ability for wireless and complex tables and reduces the model's sensitivity to the accuracy of table positioning, enabling more accurate recognition even with offset table positioning. |
Layout Detection Module Models:
Model | Model Download Link | mAP(0.5) (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Introduction |
---|---|---|---|---|---|---|
PP-DocLayout-L | Inference Model/Training Model | 90.4 | 34.6244 / 10.3945 | 510.57 / - | 123.76 M | A high-precision layout area localization model trained on a self-built dataset containing Chinese and English papers, magazines, contracts, books, exams, and research reports using RT-DETR-L. |
PP-DocLayout-M | Inference Model/Training Model | 75.2 | 13.3259 / 4.8685 | 44.0680 / 44.0680 | 22.578 | A layout area localization model with balanced precision and efficiency, trained on a self-built dataset containing Chinese and English papers, magazines, contracts, books, exams, and research reports using PicoDet-L. |
PP-DocLayout-S | Inference Model/Training Model | 70.9 | 8.3008 / 2.3794 | 10.0623 / 9.9296 | 4.834 | A high-efficiency layout area localization model trained on a self-built dataset containing Chinese and English papers, magazines, contracts, books, exams, and research reports using PicoDet-S. |
Model | Model Download Link | mAP(0.5) (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Introduction |
---|---|---|---|---|---|---|
PicoDet_layout_1x_table | Inference Model/Training Model | 97.5 | 8.02 / 3.09 | 23.70 / 20.41 | 7.4 M | A high-efficiency layout area localization model trained on a self-built dataset using PicoDet-1x, capable of detecting table regions. |
Model | Model Download Link | mAP(0.5) (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Introduction |
---|---|---|---|---|---|---|
PicoDet-S_layout_3cls | Inference Model/Training Model | 88.2 | 8.99 / 2.22 | 16.11 / 8.73 | 4.8 | A high-efficiency layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-S. |
PicoDet-L_layout_3cls | Inference Model/Training Model | 89.0 | 13.05 / 4.50 | 41.30 / 41.30 | 22.6 | A balanced efficiency and precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-L. |
RT-DETR-H_layout_3cls | Inference Model/Training Model | 95.8 | 114.93 / 27.71 | 947.56 / 947.56 | 470.1 | A high-precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using RT-DETR-H. |
Model | Model Download Link | mAP(0.5) (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Introduction |
---|---|---|---|---|---|---|
PicoDet_layout_1x | Inference Model/Training Model | 97.8 | 9.03 / 3.10 | 25.82 / 20.70 | 7.4 | A high-efficiency English document layout area localization model trained on the PubLayNet dataset using PicoDet-1x. |
Model | Model Download Link | mAP(0.5) (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Introduction |
---|---|---|---|---|---|---|
PicoDet-S_layout_17cls | Inference Model/Training Model | 87.4 | 9.11 / 2.12 | 15.42 / 9.12 | 4.8 | A high-efficiency layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-S. |
PicoDet-L_layout_17cls | Inference Model/Training Model | 89.0 | 13.50 / 4.69 | 43.32 / 43.32 | 22.6 | A balanced efficiency and precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-L. |
RT-DETR-H_layout_17cls | Inference Model/Training Model | 98.3 | 115.29 / 104.09 | 995.27 / 995.27 | 470.2 | A high-precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using RT-DETR-H. |
Text Detection Module Models:
Model | Model Download Link | Detection Hmean (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PP-OCRv4_server_det | Inference Model/Training Model | 82.69 | 83.34 / 80.91 | 442.58 / 442.58 | 109 | PP-OCRv4's server-side text detection model, featuring higher accuracy, suitable for deployment on high-performance servers |
PP-OCRv4_mobile_det | Inference Model/Training Model | 77.79 | 8.79 / 3.13 | 51.00 / 28.58 | 4.7 | PP-OCRv4's mobile text detection model, optimized for efficiency, suitable for deployment on edge devices |
Text Recognition Module Models:
Model | Model Download Link | Recognition Avg Accuracy (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PP-OCRv4_mobile_rec | Inference Model/Training Model | 78.20 | 4.82 / 4.82 | 16.74 / 4.64 | 10.6 M | PP-OCRv4 is the next version of Baidu PaddlePaddle's self-developed text recognition model PP-OCRv3. By introducing data augmentation schemes and GTC-NRTR guidance branches, it further improves text recognition accuracy without compromising inference speed. The model offers both server (server) and mobile (mobile) versions to meet industrial needs in different scenarios. |
PP-OCRv4_server_rec | Inference Model/Training Model | 79.20 | 6.58 / 6.58 | 33.17 / 33.17 | 71.2 M |
Model | Model Download Link | Recognition Avg Accuracy (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
ch_SVTRv2_rec | Inference Model/Training Model | 68.81 | 8.08 / 8.08 | 50.17 / 42.50 | 73.9 M | SVTRv2 is a server-side text recognition model developed by the OpenOCR team at the Vision and Learning Lab (FVL) of Fudan University. It won the first prize in the OCR End-to-End Recognition Task of the PaddleOCR Algorithm Model Challenge, with a 6% improvement in end-to-end recognition accuracy compared to PP-OCRv4 on the A-list. |
Model | Model Download Link | Recognition Avg Accuracy (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
ch_RepSVTR_rec | Inference Model/Training Model | 65.07 | 5.93 / 5.93 | 20.73 / 7.32 | 22.1 M | The RepSVTR text recognition model is a mobile-oriented text recognition model based on SVTRv2. It won the first prize in the OCR End-to-End Recognition Task of the PaddleOCR Algorithm Model Challenge, with a 2.5% improvement in end-to-end recognition accuracy compared to PP-OCRv4 on the B-list, while maintaining similar inference speed. |
Formula Recognition Module Models:
Model Name | Model Download Link | BLEU Score | Normed Edit Distance | ExpRate (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size |
---|---|---|---|---|---|---|---|
LaTeX_OCR_rec | Inference Model/Training Model | 0.8821 | 0.0823 | 40.01 | 2047.13 / 2047.13 | 10582.73 / 10582.73 | 89.7 M |
Seal Text Detection Module Models:
Model | Model Download Link | Detection Hmean (%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Size (M) | Description |
---|---|---|---|---|---|---|
PP-OCRv4_server_seal_det | Inference Model/Training Model | 98.21 | 74.75 / 67.72 | 382.55 / 382.55 | 109 | PP-OCRv4's server-side seal text detection model, featuring higher accuracy, suitable for deployment on better-equipped servers |
PP-OCRv4_mobile_seal_det | Inference Model/Training Model | 96.47 | 7.82 / 3.09 | 48.28 / 23.97 | 4.6 | PP-OCRv4's mobile seal text detection model, offering higher efficiency, suitable for deployment on edge devices |
Mode | GPU Configuration | CPU Configuration | Acceleration Technology Combination |
---|---|---|---|
Normal Mode | FP32 Precision / No TRT Acceleration | FP32 Precision / 8 Threads | PaddleInference |
High-Performance Mode | Optimal combination of pre-selected precision types and acceleration strategies | FP32 Precision / 8 Threads | Pre-selected optimal backend (Paddle/OpenVINO/TRT, etc.) |
create_pipeline
Method.Parameter | Parameter Description | Parameter Type | Default Value | |
---|---|---|---|---|
pipeline |
The name of the pipeline or the path to the pipeline configuration file. If it is the name of the pipeline, it must be a pipeline supported by PaddleX. | str |
None | |
device |
The device for pipeline inference. Supports specifying specific GPU card numbers, such as "gpu:0", other hardware card numbers, such as "npu:0", and CPU as "cpu". | str |
gpu |
|
use_hpip |
Whether to enable the high-performance inference plugin. If set to None , the setting from the configuration file or config will be used. |
bool |
None | None |
hpi_config |
High-performance inference configuration | dict | None |
None | None |
initial_predictor |
Whether to initialize the inference module (if False , it will be initialized when the relevant inference module is used for the first time). |
bool |
True |
visual_predict()
Method of the PP-ChatOCRv4 Pipeline Object to Obtain Visual Prediction Results. This method returns a generator.Parameter | Parameter Description | Parameter Type | Options | Default Value |
---|---|---|---|---|
input |
The data to be predicted, supporting multiple input types, required. | Python Var|str|list |
|
None |
device |
The device for pipeline inference. | str|None |
|
None |
use_doc_orientation_classify |
Whether to use the document orientation classification module. | bool|None |
|
None |
use_doc_unwarping |
Whether to use the document distortion correction module. | bool|None |
|
None |
use_textline_orientation |
Whether to use the text line orientation classification module. | bool|None |
|
None |
use_general_ocr |
Whether to use the OCR sub-pipeline. | bool|None |
|
None |
use_seal_recognition |
Whether to use the seal recognition sub-pipeline. | bool|None |
|
None |
use_table_recognition |
Whether to use the table recognition sub-pipeline. | bool|None |
|
None |
layout_threshold |
The score threshold for the layout model. | float|dict|None |
|
None |
layout_nms |
Whether to use NMS. | bool|None |
|
None |
layout_unclip_ratio |
The expansion coefficient for layout detection. | float|Tuple[float,float]|dict|None |
|
None |
layout_merge_bboxes_mode |
The method for filtering overlapping bounding boxes. | str|dict|None |
|
None |
text_det_limit_side_len |
The side length limit for text detection images. | int|None |
|
None |
text_det_limit_type |
The type of side length limit for text detection images. | str|None |
|
None |
text_det_thresh |
The pixel threshold for detection. In the output probability map, pixel points with scores greater than this threshold will be considered as text pixels. | float|None |
|
None |
text_det_box_thresh |
The bounding box threshold for detection. When the average score of all pixel points within the detection result bounding box is greater than this threshold, the result will be considered as a text region. | float|None |
|
None |
text_det_unclip_ratio |
The expansion coefficient for text detection. This method is used to expand the text region, and the larger the value, the larger the expansion area. | float|None |
|
None |
text_rec_score_thresh |
The text recognition threshold. Text results with scores greater than this threshold will be retained. | float|None |
|
None |
seal_det_limit_side_len |
The side length limit for seal detection images. | int|None |
|
None |
seal_det_limit_type |
The type of side length limit for seal detection images. | str|None |
|
None |
seal_det_thresh |
The pixel threshold for detection. In the output probability map, pixel points with scores greater than this threshold will be considered as seal pixels. | float|None |
|
None |
seal_det_box_thresh |
The bounding box threshold for detection. When the average score of all pixel points within the detection result bounding box is greater than this threshold, the result will be considered as a seal region. | float|None |
|
None |
seal_det_unclip_ratio |
The expansion coefficient for seal detection. This method is used to expand the seal region, and the larger the value, the larger the expansion area. | float|None |
|
None |
seal_rec_score_thresh |
The seal recognition threshold. Text results with scores greater than this threshold will be retained. | float|None |
|
None |
Method | Method Description | Parameter | Parameter Type | Parameter Description | Default Value |
---|---|---|---|---|---|
print() |
Prints the result to the terminal | format_json |
bool |
Whether to format the output content with JSON indentation | True |
indent |
int |
Specifies the indentation level to beautify the output JSON data for better readability, only valid when format_json is True |
4 | ||
ensure_ascii |
bool |
Controls whether to escape non-ASCII characters to Unicode. When set to True , all non-ASCII characters will be escaped; False retains the original characters, only valid when format_json is True |
False |
||
save_to_json() |
Saves the result as a json file | save_path |
str |
The path to save the file. When it is a directory, the saved file name is consistent with the input file type | N/A |
indent |
int |
Specifies the indentation level to beautify the output JSON data for better readability, only valid when format_json is True |
4 | ||
ensure_ascii |
bool |
Controls whether to escape non-ASCII characters to Unicode. When set to True , all non-ASCII characters will be escaped; False retains the original characters, only valid when format_json is True |
False |
||
save_to_img() |
Saves the visual images of each intermediate module in png format | save_path |
str |
The path to save the file, supports directory or file path | N/A |
save_to_html() |
Saves the tables in the file as html files | save_path |
str |
The path to save the file, supports directory or file path | N/A |
save_to_xlsx() |
Saves the tables in the file as xlsx files | save_path |
str |
The path to save the file, supports directory or file path | N/A |
Attribute | Attribute Description |
---|---|
json |
Obtain prediction results in json format |
img |
Obtain visualized images in dict format |
build_vector()
method of the PP-ChatOCRv4 pipeline object to construct vectors for text content.Parameter | Parameter Description | Parameter Type | Options | Default Value |
---|---|---|---|---|
visual_info |
Visual information, which can be a dictionary containing visual information or a list composed of such dictionaries | list|dict |
None
|
None |
min_characters |
Minimum number of characters | int |
A positive integer greater than 0, determined based on the token length supported by the large language model | 3500 |
block_size |
Chunk size for establishing a vector library for long text | int |
A positive integer greater than 0, determined based on the token length supported by the large language model | 300 |
flag_save_bytes_vector |
Whether to save text as a binary file | bool |
True|False
|
False |
retriever_config |
Configuration parameters for the vector retrieval large model, referring to the "LLM_Retriever" field in the configuration file | dict |
None
|
None |
mllm_pred()
method of the PP-ChatOCRv4 pipeline object to obtain multimodal large model extraction results.Parameter | Parameter Description | Parameter Type | Options | Default Value |
---|---|---|---|---|
input |
Data to be predicted, supporting multiple input types, required | Python Var|str |
|
None |
key_list |
A single key or a list of keys used to extract information | Union[str, List[str]] |
None |
None |
mllm_chat_bot_config |
Configuration parameters for the multimodal large model, referring to the "MLLM_Chat" field in the configuration file | dict |
None
|
None |
chat()
method of the PP-ChatOCRv4 pipeline object to extract key information.Parameter | Parameter Description | Parameter Type | Options | Default Value |
---|---|---|---|---|
key_list |
A single key or a list of keys used to extract information | Union[str, List[str]] |
None |
None |
visual_info |
Visual information results | List[dict] |
None |
None |
use_vector_retrieval |
Whether to use vector retrieval | bool |
True|False |
True |
vector_info |
Vector information for retrieval | dict |
None |
None |
min_characters |
Minimum number of characters required | int |
A positive integer greater than 0 | 3500 |
text_task_description |
Description of the text task | str |
None |
None |
text_output_format |
Output format of the text result | str |
None |
None |
text_rules_str |
Rules for generating text results | str |
None |
None |
text_few_shot_demo_text_content |
Text content for few-shot demonstration | str |
None |
None |
text_few_shot_demo_key_value_list |
Key-value list for few-shot demonstration | str |
None |
None |
table_task_description |
Description of the table task | str |
None |
None |
table_output_format |
Output format of the table result | str |
None |
None |
table_rules_str |
Rules for generating table results | str |
None |
None |
table_few_shot_demo_text_content |
Text content for table few-shot demonstration | str |
None |
None |
table_few_shot_demo_key_value_list |
Key-value list for table few-shot demonstration | str |
None |
None |
mllm_predict_info |
Results from the multimodal large language model | dict |
None |
None |
mllm_integration_strategy |
Integration strategy for multimodal large language model and large language model data, supporting the use of either alone or the fusion of both results | str |
"integration" |
"integration", "llm_only", and "mllm_only" |
chat_bot_config |
Configuration information for the large language model, with content referring to the "LLM_Chat" field in the pipeline configuration file | dict |
None |
None |
retriever_config |
Configuration parameters for the vector retrieval large model, with content referring to the "LLM_Retriever" field in the configuration file | dict |
None |
None |
For the main operations provided by the service:
200
, and the response body has the following properties:Name | Type | Meaning |
---|---|---|
logId |
string |
UUID of the request. |
errorCode |
integer |
Error code. Fixed at 0 . |
errorMsg |
string |
Error description. Fixed at "Success" . |
result |
object |
Operation result. |
Name | Type | Meaning |
---|---|---|
logId |
string |
UUID of the request. |
errorCode |
integer |
Error code. Same as the response status code. |
errorMsg |
string |
Error description. |
The main operations provided by the service are as follows:
analyzeImages
Uses computer vision models to analyze images, obtain OCR, table recognition results, etc., and extract key information from the images.
POST /chatocr-visual
Name | Type | Meaning | Required |
---|---|---|---|
file |
string |
URL of an image file or PDF file accessible to the server, or Base64 encoded result of the content of the above file types. By default, for PDF files exceeding 10 pages, only the content of the first 10 pages will be processed. To remove the page limit, please add the following configuration to the pipeline configuration file:
|
Yes |
fileType |
integer | null |
File type. 0 represents a PDF file, 1 represents an image file. If this property is not present in the request body, the file type will be inferred based on the URL. |
No |
useDocOrientationClassify |
boolean | null |
Please refer to the description of the use_doc_orientation_classify parameter of the pipeline object's visual_predict method. |
No |
useDocUnwarping |
boolean | null |
Please refer to the description of the use_doc_unwarping parameter of the pipeline object's visual_predict method. |
No |
useSealRecognition |
boolean | null |
Please refer to the description of the use_seal_recognition parameter of the pipeline object's visual_predict method. |
No |
useTableRecognition |
boolean | null |
Please refer to the description of the use_table_recognition parameter of the pipeline object's visual_predict method. |
No |
layoutThreshold |
number | null |
Please refer to the description of the layout_threshold parameter of the pipeline object's visual_predict method. |
No |
layoutNms |
boolean | null |
Please refer to the description of the layout_nms parameter of the pipeline object's visual_predict method. |
No |
layoutUnclipRatio |
number | array | object | null |
Please refer to the description of the layout_unclip_ratio parameter of the pipeline object's visual_predict method. |
No |
layoutMergeBboxesMode |
string | object | null |
Please refer to the description of the layout_merge_bboxes_mode parameter of the pipeline object's visual_predict method. |
No |
textDetLimitSideLen |
integer | null |
Please refer to the description of the text_det_limit_side_len parameter of the pipeline object's visual_predict method. |
No |
textDetLimitType |
string | null |
Please refer to the description of the text_det_limit_type parameter of the pipeline object's visual_predict method. |
No |
textDetThresh |
number | null |
Please refer to the description of the text_det_thresh parameter of the pipeline object's visual_predict method. |
No |
textDetBoxThresh |
number | null |
Please refer to the description of the text_det_box_thresh parameter of the pipeline object's visual_predict method. |
No |
textDetUnclipRatio |
number | null |
Please refer to the description of the text_det_unclip_ratio parameter of the pipeline object's visual_predict method. |
No |
textRecScoreThresh |
number | null |
Please refer to the description of the text_rec_score_thresh parameter of the pipeline object's visual_predict method. |
No |
sealDetLimitSideLen |
integer | null |
Please refer to the description of the seal_det_limit_side_len parameter of the pipeline object's visual_predict method. |
No |
sealDetLimitType |
string | null |
Please refer to the description of the seal_det_limit_type parameter of the pipeline object's visual_predict method. |
No |
sealDetThresh |
number | null |
Please refer to the description of the seal_det_thresh parameter of the pipeline object's visual_predict method. |
No |
sealDetBoxThresh |
number | null |
Please refer to the description of the seal_det_box_thresh parameter of the pipeline object's visual_predict method. |
No |
sealDetUnclipRatio |
number | null |
Please refer to the description of the seal_det_unclip_ratio parameter of the pipeline object's visual_predict method. |
No |
sealRecScoreThresh |
number | null |
Please refer to the description of the seal_rec_score_thresh parameter of the pipeline object's visual_predict method. |
No |
result
of the response body has the following properties:Name | Type | Meaning |
---|---|---|
layoutParsingResults |
array |
Analysis results obtained using computer vision models. The array length is 1 (for image input) or the actual number of document pages processed (for PDF input). For PDF input, each element in the array represents the result of each page actually processed in the PDF file. |
visualInfo |
array |
Key information in the image, which can be used as input for other operations. |
dataInfo |
object |
Input data information. |
Each element in layoutParsingResults
is an object
with the following properties:
Name | Type | Meaning |
---|---|---|
prunedResult |
object |
A simplified version of the res field in the JSON representation of the results generated by the pipeline's visual_predict method, with the input_path and the page_index fields removed. |
outputImages |
object | null |
Refer to the description of img attribute of the pipeline's visual prediction result. |
inputImage |
string | null |
Input image. The image is in JPEG format and encoded using Base64. |
buildVectorStore
Builds a vector database.
POST /chatocr-vector
Name | Type | Meaning | Required |
---|---|---|---|
visualInfo |
array |
Key information in the image. Provided by the analyzeImages operation. |
Yes |
minCharacters |
integer | null |
Minimum data length to enable the vector database. | No |
blockSize |
int | null |
Please refer to the description of the block_size parameter of the pipeline object's build_vector method. |
No |
retrieverConfig |
object | null |
Please refer to the description of the retriever_config parameter of the pipeline object's build_vector method. |
No |
result
of the response body has the following properties:Name | Type | Meaning |
---|---|---|
vectorInfo |
object |
Serialized result of the vector database, which can be used as input for other operations. |
invokeMLLM
Invoke the MLLM.
POST /chatocr-mllm
Name | Type | Meaning | Required |
---|---|---|---|
image |
string |
URL of an image file accessible by the server or the Base64-encoded content of the image file. | Yes |
keyList |
array |
List of keys. | Yes |
mllmChatBotConfig |
object | null |
Please refer to the description of the mllm_chat_bot_config parameter of the pipeline object's mllm_pred method. |
No |
result
of the response body has the following property:Name | Type | Meaning |
---|---|---|
mllmPredictInfo |
object |
MLLM invocation result. |
chat
Interacts with large language models to extract key information using them.
POST /chatocr-chat
Name | Type | Meaning | Required |
---|---|---|---|
keyList |
array |
List of keys. | Yes |
visualInfo |
object |
Key information in the image. Provided by the analyzeImages operation. |
Yes |
useVectorRetrieval |
boolean | null |
Please refer to the description of the use_vector_retrieval parameter of the pipeline object's chat method. |
No |
vectorInfo |
object | null |
Serialized result of the vector database. Provided by the buildVectorStore operation. Please note that the deserialization process involves performing an unpickle operation. To prevent malicious attacks, be sure to use data from trusted sources. |
No |
minCharacters |
integer |
Minimum data length to enable the vector database. | No |
textTaskDescription |
string | null |
Please refer to the description of the text_task_description parameter of the pipeline object's chat method. |
No |
textOutputFormat |
string | null |
Please refer to the description of the text_output_format parameter of the pipeline object's chat method. |
No |
textRulesStr |
string | null |
Please refer to the description of the text_rules_str parameter of the pipeline object's chat method. |
No |
textFewShotDemoTextContent |
string | null |
Please refer to the description of the text_few_shot_demo_text_content parameter of the pipeline object's chat method. |
No |
textFewShotDemoKeyValueList |
string | null |
Please refer to the description of the text_few_shot_demo_key_value_list parameter of the pipeline object's chat method. |
No |
tableTaskDescription |
string | null |
Please refer to the description of the table_task_description parameter of the pipeline object's chat method. |
No |
tableOutputFormat |
string | null |
Please refer to the description of the table_output_format parameter of the pipeline object's chat method. |
No |
tableRulesStr |
string | null |
Please refer to the description of the table_rules_str parameter of the pipeline object's chat method. |
No |
tableFewShotDemoTextContent |
string | null |
Please refer to the description of the table_few_shot_demo_text_content parameter of the pipeline object's chat method. |
No |
tableFewShotDemoKeyValueList |
string | null |
Please refer to the description of the table_few_shot_demo_key_value_list parameter of the pipeline object's chat method. |
No |
mllmPredictInfo |
object | null |
MLLM invocation result. Provided by the invokeMllm operation. |
No |
mllmIntegrationStrategy |
string | null |
Please refer to the description of the mllm_integration_strategy parameter of the pipeline object's chat method. |
No |
chatBotConfig |
object | null |
Please refer to the description of the chat_bot_config parameter of the pipeline object's chat method. |
No |
retrieverConfig |
object | null |
Please refer to the description of the retriever_config parameter of the pipeline object's chat method. |
No |
result
of the response body has the following properties:Name | Type | Meaning |
---|---|---|
chatResult |
object |
Key information extraction result. |
# This script only shows the use case for images. For calling with other file types, please read the API reference and make adjustments.
import base64
import pprint
import sys
import requests
API_BASE_URL = "http://0.0.0.0:8080"
image_path = "./demo.jpg"
keys = ["name"]
with open(image_path, "rb") as file:
image_bytes = file.read()
image_data = base64.b64encode(image_bytes).decode("ascii")
payload = {
"file": image_data,
"fileType": 1,
}
resp_visual = requests.post(url=f"{API_BASE_URL}/chatocr-visual", json=payload)
if resp_visual.status_code != 200:
print(
f"Request to chatocr-visual failed with status code {resp_visual.status_code}."
)
pprint.pp(resp_visual.json())
sys.exit(1)
result_visual = resp_visual.json()["result"]
for i, res in enumerate(result_visual["layoutParsingResults"]):
print(res["prunedResult"])
for img_name, img in res["outputImages"].items():
img_path = f"{img_name}_{i}.jpg"
with open(img_path, "wb") as f:
f.write(base64.b64decode(img))
print(f"Output image saved at {img_path}")
payload = {
"visualInfo": result_visual["visualInfo"],
}
resp_vector = requests.post(url=f"{API_BASE_URL}/chatocr-vector", json=payload)
if resp_vector.status_code != 200:
print(
f"Request to chatocr-vector failed with status code {resp_vector.status_code}."
)
pprint.pp(resp_vector.json())
sys.exit(1)
result_vector = resp_vector.json()["result"]
payload = {
"image": image_data,
"keyList": keys,
}
resp_mllm = requests.post(url=f"{API_BASE_URL}/chatocr-mllm", json=payload)
if resp_mllm.status_code != 200:
print(
f"Request to chatocr-mllm failed with status code {resp_mllm.status_code}."
)
pprint.pp(resp_mllm.json())
sys.exit(1)
result_mllm = resp_mllm.json()["result"]
payload = {
"keyList": keys,
"visualInfo": result_visual["visualInfo"],
"useVectorRetrieval": True,
"vectorInfo": result_vector["vectorInfo"],
"mllmPredictInfo": result_mllm["mllmPredictInfo"],
}
resp_chat = requests.post(url=f"{API_BASE_URL}/chatocr-chat", json=payload)
if resp_chat.status_code != 200:
print(
f"Request to chatocr-chat failed with status code {resp_chat.status_code}."
)
pprint.pp(resp_chat.json())
sys.exit(1)
result_chat = resp_chat.json()["result"]
print("Final result:")
print(result_chat["chatResult"])