| `--work-dir` | str | The target folder to save logs and checkpoints. Defaults to `./work_dirs`. |
| `--load-from` | str | The checkpoint file to load from. |
| `--resume-from` | bool | The checkpoint file to resume the training from.|
| `--no-validate` | bool | Disable checkpoint evaluation during training. Defaults to `False`. |
| `--gpus` | int | Numbers of gpus to use. Only applicable to non-distributed training. |
| `--gpu-ids` | int*N | A list of GPU ids to use. Only applicable to non-distributed training. |
| `--seed` | int | Random seed. |
| `--deterministic` | bool | Whether to set deterministic options for CUDNN backend. |
| `--cfg-options` | str | Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either key="[a,b]" or key=a,b. The argument also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]". Note that the quotation marks are necessary and that no white space is allowed.|
| `PORT` | int | The master port that will be used by the machine with rank 0. Defaults to 29500. **Note:** If you are launching multiple distrbuted training jobs on a single machine, you need to specify different ports for each job to avoid port conflicts.|
| `PY_ARGS` | str | Arguments to be parsed by `tools/train.py`. |
## Training with Slurm
If you run MMOCR on a cluster managed with [Slurm](https://slurm.schedmd.com/), you can use the script `slurm_train.sh`.
| `GPUS` | int | The number of GPUs to be used by this task. Defaults to 8. |
| `GPUS_PER_NODE` | int | The number of GPUs to be allocated per node. Defaults to 8. |
| `CPUS_PER_TASK` | int | The number of CPUs to be allocated per task. Defaults to 5. |
| `SRUN_ARGS` | str | Arguments to be parsed by srun. Available options can be found [here](https://slurm.schedmd.com/srun.html). |
| `PY_ARGS` | str | Arguments to be parsed by `tools/train.py`. |
Here is an example of using 8 GPUs to train a text detection model on the dev partition.
```shell
./tools/slurm_train.sh dev psenet-ic15 configs/textdet/psenet/psenet_r50_fpnf_sbn_1x_icdar2015.py /nfs/xxxx/psenet-ic15
```
### Running Multiple Training Jobs on a Single Machine
If you are launching multiple training jobs on a single machine with Slurm, you may need to modify the port in configs to avoid communication conflicts.
For example, in `config1.py`,
```python
dist_params = dict(backend='nccl', port=29500)
```
In `config2.py`,
```python
dist_params = dict(backend='nccl', port=29501)
```
Then you can launch two jobs with `config1.py` ang `config2.py`.
# Note: User can configure general settings of train, val and test dataloader by specifying them here. However, their values can be overridden in dataloader's config.
evaluation = dict(interval=1, by_epoch=True) # Evaluate the model every epoch
# Saving and Logging
checkpoint_config = dict(interval=1) # Save a checkpoint every epoch
log_config = dict(
interval=5, # Print out the model's performance every 5 iterations
hooks=[
dict(type='TextLoggerHook')
])
# Optimizer
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001) # Supports all optimizers in PyTorch and shares the same parameters
optimizer_config = dict(grad_clip=None) # Parameters for the optimizer hook. See https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/hooks/optimizer.py for implementation details