mmsegmentation/docs/en/user_guides/4_train_test.md

9.3 KiB

Tutorial 4: Train and test with existing models

MMSegmentation supports training and testing models on a variety of devices, which are described below for single-GPU, distributed, and cluster training and testing, respectively. Through this tutorial, you will learn how to train and test using the scripts provided by MMSegmentation.

Training and testing on a single GPU

Training on a single GPU

We provide tools/train.py to launch training jobs on a single GPU. The basic usage is as follows.

python tools/train.py  ${CONFIG_FILE} [optional arguments]

This tool accepts several optional arguments, including:

  • --work-dir ${WORK_DIR}: Override the working directory.
  • --amp: Use auto mixed precision training.
  • --resume: Resume from the latest checkpoint in the work_dir automatically.
  • --cfg-options ${OVERRIDE_CONFIGS}: Override some settings in the used config, and the key-value pair in xxx=yyy format will be merged into the config file. For example, '--cfg-option model.encoder.in_channels=6'. Please see this guide for more details.

Below are the optional arguments for the multi-gpu test:

  • --launcher: Items for distributed job initialization launcher. Allowed choices are none, pytorch, slurm, mpi. Especially, if set to none, it will test in a non-distributed mode.
  • --local_rank: ID for local rank. If not specified, it will be set to 0.

Note: Difference between the argument --resume and the field load_from in the config file:

--resume only determines whether to resume from the latest checkpoint in the work_dir. It is usually used for resuming the training process that is interrupted accidentally.

load_from will specify the checkpoint to be loaded and the training iteration starts from 0. It is usually used for fine-tuning.

If you would like to resume training from a specific checkpoint, you can use:

python tools/train.py ${CONFIG_FILE} --resume --cfg-options load_from=${CHECKPOINT}

Training on CPU: The process of training on the CPU is consistent with single GPU training if a machine does not have GPU. If it has GPUs but not wanting to use them, we just need to disable GPUs before the training process.

export CUDA_VISIBLE_DEVICES=-1

And then run the script above.

Testing on a single GPU

We provide tools/test.py to launch training jobs on a single GPU. The basic usage is as follows.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

This tool accepts several optional arguments, including:

  • --work-dir: If specified, results will be saved in this directory. If not specified, the results will be automatically saved to work_dirs/{CONFIG_NAME}.
  • --show: Show prediction results at runtime, available when --show-dir is not specified.
  • --show-dir: Directory where painted images will be saved. If specified, the visualized segmentation mask will be saved to the work_dir/timestamp/show_dir.
  • --wait-time: The interval of show (s), which takes effect when --show is activated. Default to 2.
  • --cfg-options: If specified, the key-value pair in xxx=yyy format will be merged into the config file.
  • --tta: Test time augmentation option.

Testing on CPU: The process of testing on the CPU is consistent with single GPU testing if a machine does not have GPU. If it has GPUs but not wanting to use them, we just need to disable GPUs before the training process.

export CUDA_VISIBLE_DEVICES=-1

And then run the script above.

Training and testing on multiple GPUs and multiple machines

Training on multiple GPUs

OpenMMLab2.0 implements distributed training with MMDistributedDataParallel. We provide tools/dist_train.sh to launch training on multiple GPUs.

The basic usage is as follows:

sh tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]

Optional arguments remain the same as stated above and have additional arguments to specify the number of GPUs.

An example:

# checkpoints and logs saved in WORK_DIR=work_dirs/pspnet_r50-d8_4xb4-80k_ade20k-512x512/
# If work_dir is not set, it will be generated automatically.
sh tools/dist_train.sh configs/pspnet/pspnet_r50-d8_4xb4-80k_ade20k-512x512.py 8 --work-dir work_dirs/pspnet_r50-d8_4xb4-80k_ade20k-512x512

Note: During training, checkpoints and logs are saved in the same folder structure as the config file under work_dirs/. A custom work directory is not recommended since evaluation scripts infer work directories from the config file name. If you want to save your weights somewhere else, please use a symlink, for example:

ln -s ${YOUR_WORK_DIRS} ${MMSEG}/work_dirs

Testing on multiple GPUs

We provide tools/dist_test.sh to launch testing on multiple GPUs. The basic usage is as follows.

sh tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [optional arguments]

Optional arguments remain the same as stated above and have additional arguments to specify the number of GPUs.

An example:

./tools/dist_test.sh configs/pspnet/pspnet_r50-d8_4xb2-40k_cityscapes-512x1024.py \
    checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth 4

Launch multiple jobs on a single machine

If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, you need to specify different ports (29500 by default) for each job to avoid communication conflict. Otherwise, there will be an error message saying RuntimeError: Address already in use. If you use dist_train.sh to launch training jobs, you can set the port in commands with the environment variable PORT.

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 sh tools/dist_train.sh ${CONFIG_FILE} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 sh tools/dist_train.sh ${CONFIG_FILE} 4

Training with multiple machines

MMSegmentation relies on torch.distributed package for distributed training. Thus, as a basic usage, one can launch distributed training via PyTorch's launch utility.

If you launch with multiple machines simply connected with ethernet, you can simply run the following commands: On the first machine:

NNODES=2 NODE_RANK=0 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} sh tools/dist_train.sh ${CONFIG_FILE} ${GPUS}

On the second machine:

NNODES=2 NODE_RANK=1 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} sh tools/dist_train.sh ${CONFIG_FILE} ${GPUS}

Usually, it is slow if you do not have high-speed networking like InfiniBand.

Manage jobs with Slurm

Slurm is a good job scheduling system for computing clusters.

Training on a cluster with Slurm

On a cluster managed by Slurm, you can use slurm_train.sh to spawn training jobs. It supports both single-node and multi-node training.

The basic usage is as follows:

[GPUS=${GPUS}] sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} [optional arguments]

Below is an example of using 4 GPUs to train PSPNet on a Slurm partition named dev, and set the work-dir to some shared file systems.

GPUS=4 sh tools/slurm_train.sh dev pspnet configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py --work-dir work_dir/pspnet

You can check the source code to review full arguments and environment variables.

Testing on a cluster with Slurm

Similar to the training task, MMSegmentation provides slurm_test.sh to launch testing jobs.

The basic usage is as follows:

[GPUS=${GPUS}] sh tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

You can check the source code to review full arguments and environment variables.

Note: When using Slurm, the port option needs to be set in one of the following ways:

  1. Set the port through --cfg-options. This is more recommended since it does not change the original configs.

    GPUS=4 GPUS_PER_NODE=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR} --cfg-options env_cfg.dist_cfg.port=29500
    GPUS=4 GPUS_PER_NODE=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR} --cfg-options env_cfg.dist_cfg.port=29501
    
  2. Modify the config files to set different communication ports. In config1.py:

    enf_cfg = dict(dist_cfg=dict(backend='nccl', port=29500))
    

    In config2.py:

    enf_cfg = dict(dist_cfg=dict(backend='nccl', port=29501))
    

    Then you can launch two jobs with config1.py and config2.py.

    CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
    CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}
    
  3. Set the port in the command using the environment variable 'MASTER_PORT':

CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 MASTER_PORT=29500 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 MASTER_PORT=29501 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}