mmsegmentation/docs/en/user_guides/4_train_test.md

221 lines
9.3 KiB
Markdown
Raw Normal View History

# Tutorial 4: Train and test with existing models
MMSegmentation supports training and testing models on a variety of devices, which are described below for single-GPU, distributed, and cluster training and testing, respectively. Through this tutorial, you will learn how to train and test using the scripts provided by MMSegmentation.
## Training and testing on a single GPU
### Training on a single GPU
We provide `tools/train.py` to launch training jobs on a single GPU.
The basic usage is as follows.
```shell
python tools/train.py ${CONFIG_FILE} [optional arguments]
```
This tool accepts several optional arguments, including:
- `--work-dir ${WORK_DIR}`: Override the working directory.
- `--amp`: Use auto mixed precision training.
- `--resume`: Resume from the latest checkpoint in the work_dir automatically.
- `--cfg-options ${OVERRIDE_CONFIGS}`: Override some settings in the used config, and the key-value pair in xxx=yyy format will be merged into the config file.
For example, '--cfg-option model.encoder.in_channels=6'. Please see this [guide](./1_config.md#Modify-config-through-script-arguments) for more details.
Below are the optional arguments for the multi-gpu test:
- `--launcher`: Items for distributed job initialization launcher. Allowed choices are `none`, `pytorch`, `slurm`, `mpi`. Especially, if set to none, it will test in a non-distributed mode.
- `--local_rank`: ID for local rank. If not specified, it will be set to 0.
**Note:** Difference between the argument `--resume` and the field `load_from` in the config file:
`--resume` only determines whether to resume from the latest checkpoint in the work_dir. It is usually used for resuming the training process that is interrupted accidentally.
`load_from` will specify the checkpoint to be loaded and the training iteration starts from 0. It is usually used for fine-tuning.
If you would like to resume training from a specific checkpoint, you can use:
```python
python tools/train.py ${CONFIG_FILE} --resume --cfg-options load_from=${CHECKPOINT}
```
**Training on CPU**: The process of training on the CPU is consistent with single GPU training if a machine does not have GPU. If it has GPUs but not wanting to use them, we just need to disable GPUs before the training process.
```shell
export CUDA_VISIBLE_DEVICES=-1
```
And then run the script [above](#training-on-a-single-gpu).
### Testing on a single GPU
We provide `tools/test.py` to launch training jobs on a single GPU.
The basic usage is as follows.
```shell
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
```
This tool accepts several optional arguments, including:
- `--work-dir`: If specified, results will be saved in this directory. If not specified, the results will be automatically saved to `work_dirs/{CONFIG_NAME}`.
- `--show`: Show prediction results at runtime, available when `--show-dir` is not specified.
- `--show-dir`: Directory where painted images will be saved. If specified, the visualized segmentation mask will be saved to the `work_dir/timestamp/show_dir`.
- `--wait-time`: The interval of show (s), which takes effect when `--show` is activated. Default to 2.
- `--cfg-options`: If specified, the key-value pair in xxx=yyy format will be merged into the config file.
2022-12-30 13:47:36 +08:00
- `--tta`: Test time augmentation option.
**Testing on CPU**: The process of testing on the CPU is consistent with single GPU testing if a machine does not have GPU. If it has GPUs but not wanting to use them, we just need to disable GPUs before the training process.
```shell
export CUDA_VISIBLE_DEVICES=-1
```
And then run the script [above](#testing-on-a-single-gpu).
## Training and testing on multiple GPUs and multiple machines
### Training on multiple GPUs
OpenMMLab2.0 implements **distributed** training with `MMDistributedDataParallel`.
We provide `tools/dist_train.sh` to launch training on multiple GPUs.
The basic usage is as follows:
```shell
sh tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
```
Optional arguments remain the same as stated [above](#training-on-a-single-gpu) and have additional arguments to specify the number of GPUs.
An example:
```shell
# checkpoints and logs saved in WORK_DIR=work_dirs/pspnet_r50-d8_4xb4-80k_ade20k-512x512/
# If work_dir is not set, it will be generated automatically.
sh tools/dist_train.sh configs/pspnet/pspnet_r50-d8_4xb4-80k_ade20k-512x512.py 8 --work-dir work_dirs/pspnet_r50-d8_4xb4-80k_ade20k-512x512
```
**Note**: During training, checkpoints and logs are saved in the same folder structure as the config file under `work_dirs/`. A custom work directory is not recommended since evaluation scripts infer work directories from the config file name. If you want to save your weights somewhere else, please use a symlink, for example:
```shell
ln -s ${YOUR_WORK_DIRS} ${MMSEG}/work_dirs
```
### Testing on multiple GPUs
We provide `tools/dist_test.sh` to launch testing on multiple GPUs.
The basic usage is as follows.
```shell
sh tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [optional arguments]
```
Optional arguments remain the same as stated [above](#testing-on-a-single-gpu) and have additional arguments to specify the number of GPUs.
An example:
```shell
./tools/dist_test.sh configs/pspnet/pspnet_r50-d8_4xb2-40k_cityscapes-512x1024.py \
checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth 4
```
### Launch multiple jobs on a single machine
If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, you need to specify different ports (29500 by default) for each job to avoid communication conflict. Otherwise, there will be an error message saying `RuntimeError: Address already in use`.
If you use `dist_train.sh` to launch training jobs, you can set the port in commands with the environment variable `PORT`.
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 sh tools/dist_train.sh ${CONFIG_FILE} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 sh tools/dist_train.sh ${CONFIG_FILE} 4
```
### Training with multiple machines
MMSegmentation relies on `torch.distributed` package for distributed training.
Thus, as a basic usage, one can launch distributed training via PyTorch's [launch utility](https://pytorch.org/docs/stable/distributed.html#launch-utility).
If you launch with multiple machines simply connected with ethernet, you can simply run the following commands:
On the first machine:
```shell
NNODES=2 NODE_RANK=0 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} sh tools/dist_train.sh ${CONFIG_FILE} ${GPUS}
```
On the second machine:
```shell
NNODES=2 NODE_RANK=1 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} sh tools/dist_train.sh ${CONFIG_FILE} ${GPUS}
```
Usually, it is slow if you do not have high-speed networking like InfiniBand.
## Manage jobs with Slurm
[Slurm](https://slurm.schedmd.com/) is a good job scheduling system for computing clusters.
### Training on a cluster with Slurm
On a cluster managed by Slurm, you can use `slurm_train.sh` to spawn training jobs. It supports both single-node and multi-node training.
The basic usage is as follows:
```shell
[GPUS=${GPUS}] sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} [optional arguments]
```
Below is an example of using 4 GPUs to train PSPNet on a Slurm partition named _dev_, and set the work-dir to some shared file systems.
```shell
GPUS=4 sh tools/slurm_train.sh dev pspnet configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py --work-dir work_dir/pspnet
```
You can check [the source code](../../../tools/slurm_train.sh) to review full arguments and environment variables.
### Testing on a cluster with Slurm
Similar to the training task, MMSegmentation provides `slurm_test.sh` to launch testing jobs.
The basic usage is as follows:
```shell
[GPUS=${GPUS}] sh tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
```
You can check [the source code](../../../tools/slurm_test.sh) to review full arguments and environment variables.
**Note:** When using Slurm, the port option needs to be set in one of the following ways:
1. Set the port through `--cfg-options`. This is more recommended since it does not change the original configs.
```shell
GPUS=4 GPUS_PER_NODE=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR} --cfg-options env_cfg.dist_cfg.port=29500
GPUS=4 GPUS_PER_NODE=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR} --cfg-options env_cfg.dist_cfg.port=29501
```
2. Modify the config files to set different communication ports.
In `config1.py`:
```python
enf_cfg = dict(dist_cfg=dict(backend='nccl', port=29500))
```
In `config2.py`:
```python
enf_cfg = dict(dist_cfg=dict(backend='nccl', port=29501))
```
Then you can launch two jobs with config1.py and config2.py.
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}
```
3. Set the port in the command using the environment variable 'MASTER_PORT':
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 MASTER_PORT=29500 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 MASTER_PORT=29501 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}
```