To meet diverse requirements, MMOCR supports training and testing models on various devices, including PCs, work stations, computation clusters, etc.
## Single GPU Training and Testing
### Training
`tools/train.py` provides the basic training service. MMOCR recommends using GPUs for model training and testing, but it still enables CPU-Only training and testing. For example, the following commands demonstrate how to train a DBNet model using a single GPU or CPU.
```bash
# Train the specified MMOCR model by calling tools/train.py
`tools/test.py` provides the basic testing service, which is used in a similar way to the training script. For example, the following command demonstrates test a DBNet model on a single GPU or CPU.
```bash
# Test a pretrained MMOCR model by calling tools/test.py
| --local_rank | int | Rank of local machine,used for distributed training,defaults to 0. |
## Training and Testing with Multiple GPUs
For large models, distributed training or testing significantly improves the efficiency. For this purpose, MMOCR provides distributed scripts `tools/dist_train.sh` and `tools/dist_test.sh` implemented based on [MMDistributedDataParallel](mmengine.model.wrappers.MMDistributedDataParallel).
For a workstation equipped with multiple GPUs, the user can launch multiple tasks simultaneously by specifying the GPU IDs. For example, the following command demonstrates how to test DBNet with GPU `[0, 1, 2, 3]` and train CRNN on GPU `[4, 5, 6, 7]`.
```bash
# Specify gpu:0,1,2,3 for testing and assign port number 29500
`dist_train.sh` sets `MASTER_PORT` to `29500` by default. When other processes already occupy this port, the program will get a runtime error `RuntimeError: Address already in use`. In this case, you need to set `MASTER_PORT` to another free port number in the range of `(0~65535)`.
```
### Multi-machine Multi-GPU Training and Testing
You can launch a task on multiple machines connected to the same network. MMOCR relies on `torch.distributed` package for distributed training. Find more information at PyTorch’s [launch utility](https://pytorch.org/docs/stable/distributed.html#launch-utility).
1.**Training**
The following command demonstrates how to train DBNet on two machines with a total of 4 GPUs.
```bash
# Say that you want to launch the training job on two machines
The speed of the network could be the bottleneck of training.
```
## Training and Testing with Slurm Cluster
If you run MMOCR on a cluster managed with [Slurm](https://slurm.schedmd.com/), you can use the script `tools/slurm_train.sh` and `tools/slurm_test.sh`.
```bash
# tools/slurm_train.sh provides scripts for submitting training tasks on clusters managed by the slurm
| CHECKPOINT_FILE | str | (required,only used in slurm_test.sh)Path to the checkpoint to be tested. |
| PY_ARGS | str | Arguments to be parsed by `tools/train.py` and `tools/test.py`. |
These scripts enable training and testing on slurm clusters, see the following examples.
1. Training
Here is an example of using 1 GPU to train a DBNet model on the `dev` partition.
```bash
# Example: Request 1 GPU resource on dev partition for DBNet training task
GPUS=1 GPUS_PER_NODE=1 CPUS_PER_TASK=5 tools/slurm_train.sh dev db_r50 configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py work_dir
```
2. Testing
Similarly, the following example requests 1 GPU for testing.
```bash
# Example: Request 1 GPU resource on dev partition for DBNet testing task
GPUS=1 GPUS_PER_NODE=1 CPUS_PER_TASK=5 tools/slurm_test.sh dev db_r50 configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth work_dir
```
## Advanced Tips
### Resume Training from a Checkpoint
`tools/train.py` allows users to resume training from a checkpoint by specifying the `--resume` parameter, where it will automatically resume training from the latest saved checkpoint.
```bash
# Example: Resuming training from the latest checkpoint
By default, the program will automatically resume training from the last successfully saved checkpoint in the last training session, i.e. `latest.pth`. However,
```python
# Example: Set the path of the checkpoint you want to load in the configuration file
Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. In MMOCR, the users can enable the automatic mixed precision training by simply add `--amp`.
```bash
# Example: Using automatic mixed precision training
| DRRG | N | roi_align_rotated does not support fp16 |
| FCENet | N | BCELoss does not support fp16 |
| Mask R-CNN | Y | |
| PANet | Y | |
| PSENet | Y | |
| TextSnake | N | |
| | Text Recognition | |
| ABINet | Y | |
| CRNN | Y | |
| MASTER | Y | |
| NRTR | Y | |
| RobustScanner | Y | |
| SAR | Y | |
| SATRN | Y | |
### Automatic Learning Rate Scaling
MMOCR sets default initial learning rates for each model in the configuration file. However, these initial learning rates may not be applicable when the user uses a different `batch_size` than our preset `base_batch_size`. Therefore, we provide a tool to automatically scale the learning rate, which can be called by adding the `--auto-scale-lr`.
# Example 2: For systems that do not support graphical interfaces (such as computing clusters, etc.), the visualization results can be dumped in the specified path
Test time augmentation (TTA) is a technique that is used to improve the performance of a model by performing data augmentation on the input image at test time. It is a simple yet effective method to improve the performance of a model. In MMOCR, we support TTA in the following ways:
```{note}
TTA is only supported for text recognition models.