9.3 KiB
Tutorial 4: Train and test with existing models
MMSegmentation supports training and testing models on a variety of devices, which are described below for single-GPU, distributed, and cluster training and testing, respectively. Through this tutorial, you will learn how to train and test using the scripts provided by MMSegmentation.
Training and testing on a single GPU
Training on a single GPU
We provide tools/train.py
to launch training jobs on a single GPU.
The basic usage is as follows.
python tools/train.py ${CONFIG_FILE} [optional arguments]
This tool accepts several optional arguments, including:
--work-dir ${WORK_DIR}
: Override the working directory.--amp
: Use auto mixed precision training.--resume
: Resume from the latest checkpoint in the work_dir automatically.--cfg-options ${OVERRIDE_CONFIGS}
: Override some settings in the used config, and the key-value pair in xxx=yyy format will be merged into the config file. For example, '--cfg-option model.encoder.in_channels=6'. Please see this guide for more details.
Below are the optional arguments for the multi-gpu test:
--launcher
: Items for distributed job initialization launcher. Allowed choices arenone
,pytorch
,slurm
,mpi
. Especially, if set to none, it will test in a non-distributed mode.--local_rank
: ID for local rank. If not specified, it will be set to 0.
Note: Difference between the argument --resume
and the field load_from
in the config file:
--resume
only determines whether to resume from the latest checkpoint in the work_dir. It is usually used for resuming the training process that is interrupted accidentally.
load_from
will specify the checkpoint to be loaded and the training iteration starts from 0. It is usually used for fine-tuning.
If you would like to resume training from a specific checkpoint, you can use:
python tools/train.py ${CONFIG_FILE} --resume --cfg-options load_from=${CHECKPOINT}
Training on CPU: The process of training on the CPU is consistent with single GPU training if a machine does not have GPU. If it has GPUs but not wanting to use them, we just need to disable GPUs before the training process.
export CUDA_VISIBLE_DEVICES=-1
And then run the script above.
Testing on a single GPU
We provide tools/test.py
to launch training jobs on a single GPU.
The basic usage is as follows.
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
This tool accepts several optional arguments, including:
--work-dir
: If specified, results will be saved in this directory. If not specified, the results will be automatically saved towork_dirs/{CONFIG_NAME}
.--show
: Show prediction results at runtime, available when--show-dir
is not specified.--show-dir
: Directory where painted images will be saved. If specified, the visualized segmentation mask will be saved to thework_dir/timestamp/show_dir
.--wait-time
: The interval of show (s), which takes effect when--show
is activated. Default to 2.--cfg-options
: If specified, the key-value pair in xxx=yyy format will be merged into the config file.--tta
: Test time augmentation option.
Testing on CPU: The process of testing on the CPU is consistent with single GPU testing if a machine does not have GPU. If it has GPUs but not wanting to use them, we just need to disable GPUs before the training process.
export CUDA_VISIBLE_DEVICES=-1
And then run the script above.
Training and testing on multiple GPUs and multiple machines
Training on multiple GPUs
OpenMMLab2.0 implements distributed training with MMDistributedDataParallel
.
We provide tools/dist_train.sh
to launch training on multiple GPUs.
The basic usage is as follows:
sh tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
Optional arguments remain the same as stated above and have additional arguments to specify the number of GPUs.
An example:
# checkpoints and logs saved in WORK_DIR=work_dirs/pspnet_r50-d8_4xb4-80k_ade20k-512x512/
# If work_dir is not set, it will be generated automatically.
sh tools/dist_train.sh configs/pspnet/pspnet_r50-d8_4xb4-80k_ade20k-512x512.py 8 --work-dir work_dirs/pspnet_r50-d8_4xb4-80k_ade20k-512x512
Note: During training, checkpoints and logs are saved in the same folder structure as the config file under work_dirs/
. A custom work directory is not recommended since evaluation scripts infer work directories from the config file name. If you want to save your weights somewhere else, please use a symlink, for example:
ln -s ${YOUR_WORK_DIRS} ${MMSEG}/work_dirs
Testing on multiple GPUs
We provide tools/dist_test.sh
to launch testing on multiple GPUs.
The basic usage is as follows.
sh tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [optional arguments]
Optional arguments remain the same as stated above and have additional arguments to specify the number of GPUs.
An example:
./tools/dist_test.sh configs/pspnet/pspnet_r50-d8_4xb2-40k_cityscapes-512x1024.py \
checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth 4
Launch multiple jobs on a single machine
If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, you need to specify different ports (29500 by default) for each job to avoid communication conflict. Otherwise, there will be an error message saying RuntimeError: Address already in use
.
If you use dist_train.sh
to launch training jobs, you can set the port in commands with the environment variable PORT
.
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 sh tools/dist_train.sh ${CONFIG_FILE} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 sh tools/dist_train.sh ${CONFIG_FILE} 4
Training with multiple machines
MMSegmentation relies on torch.distributed
package for distributed training.
Thus, as a basic usage, one can launch distributed training via PyTorch's launch utility.
If you launch with multiple machines simply connected with ethernet, you can simply run the following commands: On the first machine:
NNODES=2 NODE_RANK=0 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} sh tools/dist_train.sh ${CONFIG_FILE} ${GPUS}
On the second machine:
NNODES=2 NODE_RANK=1 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} sh tools/dist_train.sh ${CONFIG_FILE} ${GPUS}
Usually, it is slow if you do not have high-speed networking like InfiniBand.
Manage jobs with Slurm
Slurm is a good job scheduling system for computing clusters.
Training on a cluster with Slurm
On a cluster managed by Slurm, you can use slurm_train.sh
to spawn training jobs. It supports both single-node and multi-node training.
The basic usage is as follows:
[GPUS=${GPUS}] sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} [optional arguments]
Below is an example of using 4 GPUs to train PSPNet on a Slurm partition named dev, and set the work-dir to some shared file systems.
GPUS=4 sh tools/slurm_train.sh dev pspnet configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py --work-dir work_dir/pspnet
You can check the source code to review full arguments and environment variables.
Testing on a cluster with Slurm
Similar to the training task, MMSegmentation provides slurm_test.sh
to launch testing jobs.
The basic usage is as follows:
[GPUS=${GPUS}] sh tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
You can check the source code to review full arguments and environment variables.
Note: When using Slurm, the port option needs to be set in one of the following ways:
-
Set the port through
--cfg-options
. This is more recommended since it does not change the original configs.GPUS=4 GPUS_PER_NODE=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR} --cfg-options env_cfg.dist_cfg.port=29500 GPUS=4 GPUS_PER_NODE=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR} --cfg-options env_cfg.dist_cfg.port=29501
-
Modify the config files to set different communication ports. In
config1.py
:enf_cfg = dict(dist_cfg=dict(backend='nccl', port=29500))
In
config2.py
:enf_cfg = dict(dist_cfg=dict(backend='nccl', port=29501))
Then you can launch two jobs with config1.py and config2.py.
CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR} CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}
-
Set the port in the command using the environment variable 'MASTER_PORT':
CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 MASTER_PORT=29500 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 MASTER_PORT=29501 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}