2022-04-07 19:38:47 +08:00
English | [简体中文 ](README_ch.md )
2021-11-24 17:33:26 +08:00
2022-07-01 16:55:04 +08:00
- [1 Introduction ](#1-introduction )
- [2. Performance ](#2-performance )
- [3. Effect demo ](#3-effect-demo )
- [3.1 SER ](#31-ser )
- [3.2 RE ](#32-re )
- [4. Install ](#4-install )
- [4.1 Install dependencies ](#41-install-dependencies )
- [5.3 RE ](#53-re )
- [6. Reference Links ](#6-reference-links )
- [License ](#license )
2022-04-19 15:08:34 +08:00
2022-04-07 19:38:47 +08:00
# Document Visual Question Answering
2022-04-19 15:08:34 +08:00
2022-04-07 19:38:47 +08:00
## 1 Introduction
2022-04-19 15:08:34 +08:00
2022-04-07 19:38:47 +08:00
VQA refers to visual question answering, which mainly asks and answers image content. DOC-VQA is one of the VQA tasks. DOC-VQA mainly asks questions about the text content of text images.
2022-02-12 15:56:32 +08:00
2022-04-07 19:38:47 +08:00
The DOC-VQA algorithm in PP-Structure is developed based on the PaddleNLP natural language processing algorithm library.
2021-12-06 21:01:15 +08:00
2022-04-07 19:38:47 +08:00
The main features are as follows:
2021-12-06 21:01:15 +08:00
2022-04-07 19:38:47 +08:00
- Integrate [LayoutXLM ](https://arxiv.org/pdf/2104.08836.pdf ) model and PP-OCR prediction engine.
- Supports Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks based on multimodal methods. Based on the SER task, the text recognition and classification in the image can be completed; based on the RE task, the relationship extraction of the text content in the image can be completed, such as judging the problem pair (pair).
- Supports custom training for SER tasks and RE tasks.
- Supports end-to-end system prediction and evaluation of OCR+SER.
- Supports end-to-end system prediction of OCR+SER+RE.
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
This project is an open source implementation of [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding ](https://arxiv.org/pdf/2104.08836.pdf ) on Paddle 2.2,
Included fine-tuning code on [XFUND dataset ](https://github.com/doc-analysis/XFUND ).
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
## 2. Performance
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
We evaluate the algorithm on the Chinese dataset of [XFUND ](https://github.com/doc-analysis/XFUND ), and the performance is as follows
2021-12-06 21:01:15 +08:00
2022-04-07 19:38:47 +08:00
| Model | Task | hmean | Model download address |
2021-12-20 22:23:36 +08:00
|:---:|:---:|:---:| :---:|
2022-04-07 19:38:47 +08:00
| LayoutXLM | SER | 0.9038 | [link ](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar ) |
| LayoutXLM | RE | 0.7483 | [link ](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar ) |
| LayoutLMv2 | SER | 0.8544 | [link ](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLMv2_xfun_zh.tar )
| LayoutLMv2 | RE | 0.6777 | [link ](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutLMv2_xfun_zh.tar ) |
| LayoutLM | SER | 0.7731 | [link ](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLM_xfun_zh.tar ) |
2021-12-06 21:01:15 +08:00
2022-04-07 19:38:47 +08:00
## 3. Effect demo
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
**Note:** The test images are from the XFUND dataset.
2021-11-24 17:33:26 +08:00
2022-04-19 15:08:34 +08:00
< a name = "31" > < / a >
2022-02-12 15:56:32 +08:00
### 3.1 SER
2021-11-24 17:33:26 +08:00
2022-04-19 15:08:34 +08:00
 | 
2021-12-06 21:01:15 +08:00
---|---
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
Boxes with different colors in the figure represent different categories. For the XFUND dataset, there are 3 categories: `QUESTION` , `ANSWER` , `HEADER`
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
* Dark purple: HEADER
* Light purple: QUESTION
* Army Green: ANSWER
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame.
2021-11-24 17:33:26 +08:00
2022-04-19 15:08:34 +08:00
< a name = "32" > < / a >
2022-02-12 15:56:32 +08:00
### 3.2 RE
2021-11-24 17:33:26 +08:00
2022-04-19 15:08:34 +08:00
 | 
2021-12-06 21:01:15 +08:00
---|---
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
The red box in the figure represents the question, the blue box represents the answer, and the question and the answer are connected by a green line. The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame.
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
## 4. Install
2021-12-06 21:01:15 +08:00
2022-04-07 19:38:47 +08:00
### 4.1 Install dependencies
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
- **(1) Install PaddlePaddle**
2021-11-24 17:33:26 +08:00
```bash
2022-01-06 18:15:46 +08:00
python3 -m pip install --upgrade pip
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
# GPU installation
2022-01-05 19:03:45 +08:00
python3 -m pip install "paddlepaddle-gpu>=2.2" -i https://mirror.baidu.com/pypi/simple
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
# CPU installation
2022-01-05 19:03:45 +08:00
python3 -m pip install "paddlepaddle>=2.2" -i https://mirror.baidu.com/pypi/simple
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
````
For more requirements, please refer to the instructions in [Installation Documentation ](https://www.paddlepaddle.org.cn/install/quick ).
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
### 4.2 Install PaddleOCR
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
- **(1) pip install PaddleOCR whl package quickly (prediction only)**
2021-11-24 17:33:26 +08:00
```bash
2022-01-05 19:03:45 +08:00
python3 -m pip install paddleocr
2022-04-07 19:38:47 +08:00
````
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
- **(2) Download VQA source code (prediction + training)**
2021-11-24 17:33:26 +08:00
```bash
2022-04-07 19:38:47 +08:00
[Recommended] git clone https://github.com/PaddlePaddle/PaddleOCR
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
# If the pull cannot be successful due to network problems, you can also choose to use the hosting on the code cloud:
2021-11-24 17:33:26 +08:00
git clone https://gitee.com/paddlepaddle/PaddleOCR
2022-04-07 19:38:47 +08:00
# Note: Code cloud hosting code may not be able to synchronize the update of this github project in real time, there is a delay of 3 to 5 days, please use the recommended method first.
````
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
- **(3) Install VQA's `requirements` **
2021-11-24 17:33:26 +08:00
```bash
2022-01-05 19:03:45 +08:00
python3 -m pip install -r ppstructure/vqa/requirements.txt
2022-04-07 19:38:47 +08:00
````
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
## 5. Usage
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
### 5.1 Data and Model Preparation
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
If you want to experience the prediction process directly, you can download the pre-training model provided by us, skip the training process, and just predict directly.
2022-01-05 19:03:45 +08:00
2022-04-07 19:38:47 +08:00
* Download the processed dataset
2022-01-05 19:03:45 +08:00
2022-06-30 15:23:31 +08:00
The download address of the processed XFUND Chinese dataset: [link ](https://paddleocr.bj.bcebos.com/ppstructure/dataset/XFUND.tar ).
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
Download and unzip the dataset, and place the dataset in the current directory after unzipping.
2021-11-24 17:33:26 +08:00
```shell
2022-06-30 15:23:31 +08:00
wget https://paddleocr.bj.bcebos.com/ppstructure/dataset/XFUND.tar
2022-04-07 19:38:47 +08:00
````
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
* Convert the dataset
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
If you need to train other XFUND datasets, you can use the following commands to convert the datasets
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
```bash
python3 ppstructure/vqa/tools/trans_xfun_data.py --ori_gt_path=path/to/json_path --output_path=path/to/save_path
````
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
* Download the pretrained models
2022-01-05 19:03:45 +08:00
```bash
2022-04-07 19:38:47 +08:00
mkdir pretrain & & cd pretrain
#download the SER model
wget https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar & & tar -xvf ser_LayoutXLM_xfun_zh.tar
#download the RE model
wget https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar & & tar -xvf re_LayoutXLM_xfun_zh.tar
cd ../
````
2021-11-24 17:33:26 +08:00
2022-04-19 15:08:34 +08:00
< a name = "52" > < / a >
2022-02-12 15:56:32 +08:00
### 5.2 SER
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
Before starting training, you need to modify the following four fields
2022-01-05 19:03:45 +08:00
2022-04-07 19:38:47 +08:00
1. `Train.dataset.data_dir` : point to the directory where the training set images are stored
2. `Train.dataset.label_file_list` : point to the training set label file
3. `Eval.dataset.data_dir` : refers to the directory where the validation set images are stored
4. `Eval.dataset.label_file_list` : point to the validation set label file
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
* start training
2021-11-24 17:33:26 +08:00
```shell
2022-01-05 19:03:45 +08:00
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml
2022-04-07 19:38:47 +08:00
````
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
Finally, `precision` , `recall` , `hmean` and other indicators will be printed.
In the `./output/ser_layoutxlm/` folder will save the training log, the optimal model and the model for the latest epoch.
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
* resume training
2021-12-17 11:51:24 +08:00
2022-04-07 19:38:47 +08:00
To resume training, assign the folder path of the previously trained model to the `Architecture.Backbone.checkpoints` field.
2022-01-05 19:03:45 +08:00
2021-12-17 11:51:24 +08:00
```shell
2022-01-05 19:03:45 +08:00
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
2022-04-07 19:38:47 +08:00
````
2021-12-17 11:51:24 +08:00
2022-04-07 19:38:47 +08:00
* evaluate
2021-12-17 11:51:24 +08:00
2022-04-07 19:38:47 +08:00
Evaluation requires assigning the folder path of the model to be evaluated to the `Architecture.Backbone.checkpoints` field.
2021-11-24 17:33:26 +08:00
```shell
2022-01-05 19:03:45 +08:00
CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
2022-04-07 19:38:47 +08:00
````
Finally, `precision` , `recall` , `hmean` and other indicators will be printed
2021-11-24 17:33:26 +08:00
2022-07-01 16:52:08 +08:00
* `OCR + SER` tandem prediction based on training engine
2021-11-24 17:33:26 +08:00
2022-07-01 16:52:08 +08:00
Use the following command to complete the series prediction of `OCR engine + SER` , taking the SER model based on LayoutXLM as an example::
2021-11-24 17:33:26 +08:00
```shell
2022-07-01 16:52:08 +08:00
python3.7 tools/export_model.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/ Global.save_inference_dir=output/ser/infer
2022-04-07 19:38:47 +08:00
````
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the `config.Global.save_res_path` field. The prediction result text file is named `infer_results.txt` .
2022-01-05 19:03:45 +08:00
2022-07-01 16:52:08 +08:00
* End-to-end evaluation of `OCR + SER` prediction system
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
First use the `tools/infer_vqa_token_ser.py` script to complete the prediction of the dataset, then use the following command to evaluate.
2022-01-05 19:03:45 +08:00
2021-11-24 17:33:26 +08:00
```shell
export CUDA_VISIBLE_DEVICES=0
2022-04-07 19:38:47 +08:00
python3 tools/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt
````
2022-07-01 16:52:08 +08:00
* export model
Use the following command to complete the model export of the SER model, taking the SER model based on LayoutXLM as an example:
```shell
python3.7 tools/export_model.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/ Global.save_inference_dir=output/ser/infer
```
The converted model will be stored in the directory specified by the `Global.save_inference_dir` field.
* `OCR + SER` tandem prediction based on prediction engine
2022-07-01 16:55:04 +08:00
2022-07-01 16:52:08 +08:00
Use the following command to complete the tandem prediction of `OCR + SER` based on the prediction engine, taking the SER model based on LayoutXLM as an example:
```shell
cd ppstructure
2022-08-02 19:41:29 +08:00
CUDA_VISIBLE_DEVICES=0 python3.7 vqa/predict_vqa_token_ser.py --vqa_algorithm=LayoutXLM --ser_model_dir=../output/ser/infer --ser_dict_path=../train_data/XFUND/class_list_xfun.txt --vis_font_path=../doc/fonts/simfang.ttf --image_dir=docs/vqa/input/zh_val_42.jpg --output=output
2022-07-01 16:52:08 +08:00
```
After the prediction is successful, the visualization images and results will be saved in the directory specified by the `output` field
2021-11-24 17:33:26 +08:00
2022-04-19 15:08:34 +08:00
< a name = "53" > < / a >
2022-02-12 15:56:32 +08:00
### 5.3 RE
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
* start training
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
Before starting training, you need to modify the following four fields
2021-12-06 21:01:15 +08:00
2022-04-07 19:38:47 +08:00
1. `Train.dataset.data_dir` : point to the directory where the training set images are stored
2. `Train.dataset.label_file_list` : point to the training set label file
3. `Eval.dataset.data_dir` : refers to the directory where the validation set images are stored
4. `Eval.dataset.label_file_list` : point to the validation set label file
2021-12-17 11:51:24 +08:00
```shell
2022-01-05 19:03:45 +08:00
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml
2022-04-07 19:38:47 +08:00
````
2021-12-17 11:51:24 +08:00
2022-04-07 19:38:47 +08:00
Finally, `precision` , `recall` , `hmean` and other indicators will be printed.
In the `./output/re_layoutxlm/` folder will save the training log, the optimal model and the model for the latest epoch.
2022-01-05 19:03:45 +08:00
2022-04-07 19:38:47 +08:00
* resume training
2022-01-05 19:03:45 +08:00
2022-04-07 19:38:47 +08:00
To resume training, assign the folder path of the previously trained model to the `Architecture.Backbone.checkpoints` field.
2021-12-06 21:01:15 +08:00
2021-12-17 11:51:24 +08:00
```shell
2022-01-05 19:03:45 +08:00
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
2022-04-07 19:38:47 +08:00
````
2021-12-17 11:51:24 +08:00
2022-04-07 19:38:47 +08:00
* evaluate
2021-12-17 11:51:24 +08:00
2022-04-07 19:38:47 +08:00
Evaluation requires assigning the folder path of the model to be evaluated to the `Architecture.Backbone.checkpoints` field.
2021-12-06 21:01:15 +08:00
```shell
2022-01-05 19:03:45 +08:00
CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
2022-04-07 19:38:47 +08:00
````
Finally, `precision` , `recall` , `hmean` and other indicators will be printed
2021-12-06 21:01:15 +08:00
2022-04-07 19:38:47 +08:00
* Use `OCR engine + SER + RE` tandem prediction
2021-12-06 21:01:15 +08:00
2022-04-07 19:38:47 +08:00
Use the following command to complete the series prediction of `OCR engine + SER + RE` , taking the pretrained SER and RE models as an example:
2021-12-06 21:01:15 +08:00
```shell
export CUDA_VISIBLE_DEVICES=0
2022-07-01 16:52:08 +08:00
python3 tools/infer_vqa_token_ser_re.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/re_LayoutXLM_xfun_zh/Global.infer_img=ppstructure/docs/vqa/input/zh_val_21.jpg -c_ser configs/vqa/ser/layoutxlm. yml -o_ser Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/
2022-04-07 19:38:47 +08:00
````
2021-11-24 17:33:26 +08:00
2022-04-07 19:38:47 +08:00
Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the `config.Global.save_res_path` field. The prediction result text file is named `infer_results.txt` .
2022-01-05 19:03:45 +08:00
2022-07-01 16:52:08 +08:00
* export model
2022-07-01 16:55:04 +08:00
2022-07-01 16:52:08 +08:00
cooming soon
* `OCR + SER + RE` tandem prediction based on prediction engine
cooming soon
2022-04-07 19:38:47 +08:00
## 6. Reference Links
2021-11-24 17:33:26 +08:00
- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf
- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm
- XFUND dataset, https://github.com/doc-analysis/XFUND
2022-04-01 19:37:47 +08:00
## License
The content of this project itself is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) ](https://creativecommons.org/licenses/by-nc-sa/4.0/ )