mmocr/docs/en/user_guides/train_test.md

324 lines
18 KiB
Markdown
Raw Normal View History

# Training and Testing
To meet diverse requirements, MMOCR supports training and testing models on various devices, including PCs, work stations, computation clusters, etc.
## Single GPU Training and Testing
### Training
`tools/train.py` provides the basic training service. MMOCR recommends using GPUs for model training and testing, but it still enables CPU-Only training and testing. For example, the following commands demonstrate how to train a DBNet model using a single GPU or CPU.
```bash
# Train the specified MMOCR model by calling tools/train.py
CUDA_VISIBLE_DEVICES= python tools/train.py ${CONFIG_FILE} [PY_ARGS]
# Training
# Example 1: Training DBNet with CPU
CUDA_VISIBLE_DEVICES=-1 python tools/train.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py
# Example 2: Specify to train DBNet with gpu:0, specify the working directory as dbnet/, and turn on mixed precision (amp) training
CUDA_VISIBLE_DEVICES=0 python tools/train.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py --work-dir dbnet/ --amp
```
```{note}
If multiple GPUs are available, you can specify a certain GPU, e.g. the third one, by setting CUDA_VISIBLE_DEVICES=3.
```
The following table lists all the arguments supported by `train.py`. Args without the `--` prefix are mandatory, while others are optional.
| ARGS | Type | Description |
| --------------- | ---- | --------------------------------------------------------------------------- |
| config | str | (required) Path to config. |
| --work-dir | str | Specify the working directory for the training logs and models checkpoints. |
| --resume | bool | Whether to resume training from the latest checkpoint. |
| --amp | bool | Whether to use automatic mixture precision for training. |
| --auto-scale-lr | bool | Whether to use automatic learning rate scaling. |
| --cfg-options | str | Override some settings in the configs. [Example](<>) |
| --launcher | str | Option for launcher\['none', 'pytorch', 'slurm', 'mpi'\]. |
| --local_rank | int | Rank of local machineused for distributed trainingdefaults to 0。 |
| --tta | bool | Whether to use test time augmentation. |
### Test
`tools/test.py` provides the basic testing service, which is used in a similar way to the training script. For example, the following command demonstrates test a DBNet model on a single GPU or CPU.
```bash
# Test a pretrained MMOCR model by calling tools/test.py
CUDA_VISIBLE_DEVICES= python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [PY_ARGS]
# Test
# Example 1: Testing DBNet with CPU
CUDA_VISIBLE_DEVICES=-1 python tools/test.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth
# Example 2: Testing DBNet on gpu:0
CUDA_VISIBLE_DEVICES=0 python tools/test.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth
```
The following table lists all the arguments supported by `test.py`. Args without the `--` prefix are mandatory, while others are optional.
| ARGS | Type | Description |
| ------------- | ----- | -------------------------------------------------------------------- |
| config | str | (required) Path to config. |
| checkpoint | str | (required) The model to be tested. |
| --work-dir | str | Specify the working directory for the logs. |
| --save-preds | bool | Whether to save the predictions to a pkl file. |
| --show | bool | Whether to visualize the predictions. |
| --show-dir | str | Path to save the visualization results. |
| --wait-time | float | Interval of visualization (s), defaults to 2. |
| --cfg-options | str | Override some settings in the configs. [Example](<>) |
| --launcher | str | Option for launcher\['none', 'pytorch', 'slurm', 'mpi'\]. |
| --local_rank | int | Rank of local machineused for distributed trainingdefaults to 0. |
## Training and Testing with Multiple GPUs
For large models, distributed training or testing significantly improves the efficiency. For this purpose, MMOCR provides distributed scripts `tools/dist_train.sh` and `tools/dist_test.sh` implemented based on [MMDistributedDataParallel](mmengine.model.wrappers.MMDistributedDataParallel).
```bash
# Training
NNODES=${NNODES} NODE_RANK=${NODE_RANK} PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]
# Testing
NNODES=${NNODES} NODE_RANK=${NODE_RANK} PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]
```
The following table lists the arguments supported by `dist_*.sh`.
| ARGS | Type | Description |
| --------------- | ---- | --------------------------------------------------------------------------------------------- |
| NNODES | int | The number of nodes. Defaults to 1. |
| NODE_RANK | int | The rank of current node. Defaults to 0. |
| PORT | int | The master port that will be used by rank 0 node, ranging from 0 to 65535. Defaults to 29500. |
| MASTER_ADDR | str | The address of rank 0 node. Defaults to "127.0.0.1". |
| CONFIG_FILE | str | (required) The path to config. |
| CHECKPOINT_FILE | str | (requiredonly used in dist_test.sh)The path to checkpoint to be tested. |
| GPU_NUM | int | (required) The number of GPUs to be used per node. |
| \[PY_ARGS\] | str | Arguments to be parsed by tools/train.py and tools/test.py. |
These two scripts enable training and testing on **single-machine multi-GPU** or **multi-machine multi-GPU**. See the following example for usage.
### Single-machine Multi-GPU
The following commands demonstrate how to train and test with a specified number of GPUs on a **single machine** with multiple GPUs.
1. **Training**
Training DBNet using 4 GPUs on a single machine.
```bash
tools/dist_train.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4
```
2. **Testing**
Testing DBNet using 4 GPUs on a single machine.
```bash
tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 4
```
### Launching Multiple Tasks on Single Machine
For a workstation equipped with multiple GPUs, the user can launch multiple tasks simultaneously by specifying the GPU IDs. For example, the following command demonstrates how to test DBNet with GPU `[0, 1, 2, 3]` and train CRNN on GPU `[4, 5, 6, 7]`.
```bash
# Specify gpu:0,1,2,3 for testing and assign port number 29500
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 4
# Specify gpu:4,5,6,7 for training and assign port number 29501
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh configs/textrecog/crnn/crnn_academic_dataset.py 4
```
```{note}
`dist_train.sh` sets `MASTER_PORT` to `29500` by default. When other processes already occupy this port, the program will get a runtime error `RuntimeError: Address already in use`. In this case, you need to set `MASTER_PORT` to another free port number in the range of `(0~65535)`.
```
### Multi-machine Multi-GPU Training and Testing
You can launch a task on multiple machines connected to the same network. MMOCR relies on `torch.distributed` package for distributed training. Find more information at PyTorchs [launch utility](https://pytorch.org/docs/stable/distributed.html#launch-utility).
1. **Training**
The following command demonstrates how to train DBNet on two machines with a total of 4 GPUs.
```bash
# Say that you want to launch the training job on two machines
# On the first machine:
NNODES=2 NODE_RANK=0 PORT=29500 MASTER_ADDR=10.140.0.169 tools/dist_train.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 2
# On the second machine:
NNODES=2 NODE_RANK=1 PORT=29501 MASTER_ADDR=10.140.0.169 tools/dist_train.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 2
```
2. **Testing**
The following command demonstrates how to test DBNet on two machines with a total of 4 GPUs.
```bash
# Say that you want to launch the testing job on two machines
# On the first machine:
NNODES=2 NODE_RANK=0 PORT=29500 MASTER_ADDR=10.140.0.169 tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 2
# On the second machine:
NNODES=2 NODE_RANK=1 PORT=29501 MASTER_ADDR=10.140.0.169 tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 2
```
```{note}
The speed of the network could be the bottleneck of training.
```
## Training and Testing with Slurm Cluster
If you run MMOCR on a cluster managed with [Slurm](https://slurm.schedmd.com/), you can use the script `tools/slurm_train.sh` and `tools/slurm_test.sh`.
```bash
# tools/slurm_train.sh provides scripts for submitting training tasks on clusters managed by the slurm
GPUS=${GPUS} GPUS_PER_NODE=${GPUS_PER_NODE} CPUS_PER_TASK=${CPUS_PER_TASK} SRUN_ARGS=${SRUN_ARGS} ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} [PY_ARGS]
# tools/slurm_test.sh provides scripts for submitting testing tasks on clusters managed by the slurm
GPUS=${GPUS} GPUS_PER_NODE=${GPUS_PER_NODE} CPUS_PER_TASK=${CPUS_PER_TASK} SRUN_ARGS=${SRUN_ARGS} ./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} ${WORK_DIR} [PY_ARGS]
```
| ARGS | Type | Description |
| --------------- | ---- | ----------------------------------------------------------------------------------------------------------- |
| GPUS | int | The number of GPUs to be used by this task. Defaults to 8. |
| GPUS_PER_NODE | int | The number of GPUs to be allocated per node. Defaults to 8. |
| CPUS_PER_TASK | int | The number of CPUs to be allocated per task. Defaults to 5. |
| SRUN_ARGS | str | Arguments to be parsed by srun. Available options can be found [here](https://slurm.schedmd.com/srun.html). |
| PARTITION | str | (required) Specify the partition on cluster. |
| JOB_NAME | str | (required) Name of the submitted job. |
| WORK_DIR | str | (required) Specify the working directory for saving the logs and checkpoints. |
| CHECKPOINT_FILE | str | (requiredonly used in slurm_test.sh)Path to the checkpoint to be tested. |
| PY_ARGS | str | Arguments to be parsed by `tools/train.py` and `tools/test.py`. |
These scripts enable training and testing on slurm clusters, see the following examples.
1. Training
Here is an example of using 1 GPU to train a DBNet model on the `dev` partition.
```bash
# Example: Request 1 GPU resource on dev partition for DBNet training task
GPUS=1 GPUS_PER_NODE=1 CPUS_PER_TASK=5 tools/slurm_train.sh dev db_r50 configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py work_dir
```
2. Testing
Similarly, the following example requests 1 GPU for testing.
```bash
# Example: Request 1 GPU resource on dev partition for DBNet testing task
GPUS=1 GPUS_PER_NODE=1 CPUS_PER_TASK=5 tools/slurm_test.sh dev db_r50 configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth work_dir
```
## Advanced Tips
### Resume Training from a Checkpoint
`tools/train.py` allows users to resume training from a checkpoint by specifying the `--resume` parameter, where it will automatically resume training from the latest saved checkpoint.
```bash
# Example: Resuming training from the latest checkpoint
python tools/train.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4 --resume
```
By default, the program will automatically resume training from the last successfully saved checkpoint in the last training session, i.e. `latest.pth`. However,
```python
# Example: Set the path of the checkpoint you want to load in the configuration file
load_from = 'work_dir/dbnet/models/epoch_10000.pth'
```
### Mixed Precision Training
Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. In MMOCR, the users can enable the automatic mixed precision training by simply add `--amp`.
```bash
# Example: Using automatic mixed precision training
python tools/train.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4 --amp
```
The following table shows the support of each algorithm in MMOCR for automatic mixed precision training.
| | Whether support AMP | Description |
| ------------- | :-----------------: | :-------------------------------------: |
| | Text Detection | |
| DBNet | Y | |
| DBNetpp | Y | |
| DRRG | N | roi_align_rotated does not support fp16 |
| FCENet | N | BCELoss does not support fp16 |
| Mask R-CNN | Y | |
| PANet | Y | |
| PSENet | Y | |
| TextSnake | N | |
| | Text Recognition | |
| ABINet | Y | |
| CRNN | Y | |
| MASTER | Y | |
| NRTR | Y | |
| RobustScanner | Y | |
| SAR | Y | |
| SATRN | Y | |
### Automatic Learning Rate Scaling
MMOCR sets default initial learning rates for each model in the configuration file. However, these initial learning rates may not be applicable when the user uses a different `batch_size` than our preset `base_batch_size`. Therefore, we provide a tool to automatically scale the learning rate, which can be called by adding the `--auto-scale-lr`.
```bash
# Example: Using automatic learning rate scaling
python tools/train.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4 --auto-scale-lr
```
### Visualize the Predictions
`tools/test.py` provides the visualization interface to facilitate the qualitative analysis of the OCR models.
<div align="center">
![Detection](../../../demo/resources/det_vis.png)
(Green boxes are GTs, while red boxes are predictions)
</div>
<div align="center">
![Recognition](../../../demo/resources/rec_vis.png)
(Green font is the GT, red font is the prediction)
</div>
<div align="center">
![KIE](../../../demo/resources/kie_vis.png)
(From left to right: original image, text detection and recognition result, text classification result, relationship)
</div>
```bash
# Example 1: Show the visualization results per 2 seconds
python tools/test.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth --show --wait-time 2
# Example 2: For systems that do not support graphical interfaces (such as computing clusters, etc.), the visualization results can be dumped in the specified path
python tools/test.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth --show-dir ./vis_results
```
The visualization-related parameters in `tools/test.py` are described as follows.
| ARGS | Type | Description |
| ----------- | ----- | --------------------------------------------- |
| --show | bool | Whether to show the visualization results. |
| --show-dir | str | Path to save the visualization results. |
| --wait-time | float | Interval of visualization (s), defaults to 2. |
### Test Time Augmentation
Test time augmentation (TTA) is a technique that is used to improve the performance of a model by performing data augmentation on the input image at test time. It is a simple yet effective method to improve the performance of a model. In MMOCR, we support TTA in the following ways:
```{note}
TTA is only supported for text recognition models.
```
```bash
python tools/test.py configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py checkpoints/crnn_mini-vgg_5e_mj.pth --tta
```