mmsegmentation/docs/en/user_guides/4_train_test.md

12 KiB

Tutorial 4: Train and test with existing models

This tutorial provides instruction for users to use the models provided in the Model Zoo for other datasets to obtain better performance. MMSegmentation also provides out-of-the-box tools for training models. This section will show how to train and test models on standard datasets.

Train models on standard datasets

Modify training schedule

Modify the following configuration to customize the training.

# training schedule for 40k
train_cfg = dict(type='IterBasedTrainLoop', max_iters=40000, val_interval=4000)
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')
# optimizer
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer, clip_grad=None)
# learning policy
param_scheduler = [
    dict(
        type='PolyLR',
        eta_min=1e-4,
        power=0.9,
        begin=0,
        end=40000,
        by_epoch=False)
# basic hooks
default_hooks = dict(
    timer=dict(type='IterTimerHook'),
    logger=dict(type='LoggerHook', interval=50, log_metric_by_epoch=False),
    param_scheduler=dict(type='ParamSchedulerHook'),
    checkpoint=dict(type='CheckpointHook', by_epoch=False, interval=4000),
    sampler_seed=dict(type='DistSamplerSeedHook'))
]

Use pre-trained model

Users can load a pre-trained model by setting the load_from field of the config to the model's path or link. The users might need to download the model weights before training to avoid the download time during training.

# use the pre-trained model for the whole PSPNet
load_from = 'https://download.openmmlab.com/mmsegmentation/v0.5/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth'  # model path can be found in model zoo

Training on a single GPU

We provide tools/train.py to launch training jobs on a single GPU. The basic usage is as follows.

python tools/train.py \
    ${CONFIG_FILE} \
    [optional arguments]

This tool accepts several optional arguments, including:

  • --work-dir ${WORK_DIR}: Override the working directory.
  • --amp: Use auto mixed precision training.
  • --resume ${CHECKPOINT_FILE}: Resume from a previous checkpoint file. If not specify, try to auto resume from the latest checkpoint in the work directory.
  • --cfg-options ${OVERRIDE_CONFIGS}: Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into config file. For example, '--cfg-option model.encoder.in_channels=6'. Please see this guide for more details. Below is the optional arguments for multi-gpu test:
  • --launcher: Items for distributed job initialization launcher. Allowed choices are none, pytorch, slurm, mpi. Especially, if set to none, it will test in a non-distributed mode.
  • --local_rank: ID for local rank. If not specified, it will be set to 0.

Note: Difference between --resume and load-from: --resume loads both the model weights and optimizer status, and the iteration is also inherited from the specified checkpoint. It is usually used for resuming the training process that is interrupted accidentally.

load-from only loads the model weights and the training iteration starts from 0. It is usually used for fine-tuning.

Training on CPU

The process of training on the CPU is consistent with single GPU training if machine does not have GPU. If it has GPUs but not wanting to use it, we just need to disable GPUs before the training process.

export CUDA_VISIBLE_DEVICES=-1

And then run the script above.

The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.

Training on multiple GPUs

OpenMMLab2.0 implements distributed training with MMDistributedDataParallel. We provide tools/dist_train.sh to launch training on multiple GPUs. The basic usage is as follows.

sh tools/dist_train.sh \
    ${CONFIG_FILE} \
    ${GPU_NUM} \
    [optional arguments]

Optional arguments remain the same as stated above and has additional arguments to specify the number of GPUs. An example:

# checkpoints and logs saved in WORK_DIR=work_dirs/pspnet_r50-d8_4xb4-80k_ade20k-512x512/
# If work_dir is not set, it will be generated automatically.
sh tools/dist_train.sh configs/pspnet/pspnet_r50-d8_4xb4-80k_ade20k-512x512.py 8 --work-dir work_dirs/pspnet_r50-d8_4xb4-80k_ade20k-512x512

Note: During training, checkpoints and logs are saved in the same folder structure as the config file under work_dirs/. Custom work directory is not recommended since evaluation scripts infer work directories from the config file name. If you want to save your weights somewhere else, please use symlink, for example:

ln -s ${YOUR_WORK_DIRS} ${MMSEG}/work_dirs

Launch multiple jobs on a single machine

If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, you need to specify different ports (29500 by default) for each job to avoid communication conflict. Otherwise, there will be error message saying RuntimeError: Address already in use. If you use dist_train.sh to launch training jobs, you can set the port in commands with environment variable PORT.

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 sh tools/dist_train.sh ${CONFIG_FILE} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 sh tools/dist_train.sh ${CONFIG_FILE} 4

Training on multiple nodes

MMSegmentation relies on torch.distributed package for distributed training. Thus, as a basic usage, one can launch distributed training via PyTorch's launch utility.

Train with multiple machines

If you launch with multiple machines simply connected with ethernet, you can simply run following commands: On the first machine:

NNODES=2 NODE_RANK=0 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} sh tools/dist_train.sh ${CONFIG_FILE} ${GPUS}

On the second machine:

NNODES=2 NODE_RANK=1 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} sh tools/dist_train.sh ${CONFIG_FILE} ${GPUS}

Usually it is slow if you do not have high speed networking like InfiniBand.

Manage jobs with Slurm

Slurm is a good job scheduling system for computing clusters. On a cluster managed by Slurm, you can use slurm_train.sh to spawn training jobs. It supports both single-node and multi-node training. The basic usage is as follows.

[GPUS=${GPUS}] sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} --work-dir ${WORK_DIR}

Below is an example of using 4 GPUs to train PSPNet on a Slurm partition named dev, and set the work-dir to some shared file systems.

GPUS=4 sh tools/slurm_train.sh dev pspnet configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py --work-dir work_dir/pspnet

You can check the source code to review full arguments and environment variables. When using Slurm, the port option need to be set in one of the following ways:

  1. Set the port through --cfg-options. This is more recommended since it does not change the original configs.

    GPUS=4 GPUS_PER_NODE=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR} --cfg-options env_cfg.dist_cfg.port=29500
    GPUS=4 GPUS_PER_NODE=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR} --cfg-options env_cfg.dist_cfg.port=29501
    
  2. Modify the config files to set different communication ports. In config1.py:

    enf_cfg = dict(dist_cfg=dict(backend='nccl', port=29500))
    

    In config2.py:

    enf_cfg = dict(dist_cfg=dict(backend='nccl', port=29501))
    

    Then you can launch two jobs with config1.py and config2.py.

    CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
    CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}
    
  3. Set the port in the command using the environment variable 'MASTER_PORT':

CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 MASTER_PORT=29500 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 MASTER_PORT=29501 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}

Test models on standard datasets

We provide testing scripts for evaluating an existing model on the whole dataset. The following testing environments are supported:

  • single GPU
  • CPU
  • single node multiple GPU
  • multiple node

Choose the proper script to perform testing depending on the testing environment.

# single-gpu testing
python tools/test.py \
    ${CONFIG_FILE} \
    ${CHECKPOINT_FILE} \
    [--work-dir ${WORK_DIR}] \
    [--show ${SHOW_RESULTS}] \
    [--show-dir ${VISUALIZATION_DIRECTORY}] \
    [--wait-time ${SHOW_INTERVAL}] \
    [--cfg-options ${OVERRIDE_CONFIGS}]
# CPU testing
export CUDA_VISIBLE_DEVICES=-1
python tools/test.py \
    ${CONFIG_FILE} \
    ${CHECKPOINT_FILE} \
    [--work-dir ${WORK_DIR}] \
    [--show ${SHOW_RESULTS}] \
    [--show-dir ${VISUALIZATION_DIRECTORY}] \
    [--wait-time ${SHOW_INTERVAL}] \
    [--cfg-options ${OVERRIDE_CONFIGS}]
# multi-gpu testing
bash tools/dist_test.sh \
    ${CONFIG_FILE} \
    ${CHECKPOINT_FILE} \
    ${GPU_NUM} \
    [--work-dir ${WORK_DIR}] \
    [--cfg-options ${OVERRIDE_CONFIGS}]

tools/dist_test.sh also supports multi-node testing, but relies on PyTorch's launch utility. Slurm is a good job scheduling system for computing clusters. On a cluster managed by Slurm, you can use slurm_test.sh to spawn testing jobs. It supports both single-node and multi-node testing.

[GPUS=${GPUS}] ./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} \
    ${CONFIG_FILE} ${CHECKPOINT_FILE} \
    [--work-dir ${OUTPUT_DIRECTORY}] \
    [--cfg-options ${OVERRIDE_CONFIGS}]

Optional arguments:

  • --work-dir: If specified, results will be saved in this directory. If not specified, the results will be automatically saved to work_dirs/{CONFIG_NAME}.
  • --show: Show prediction results at runtime, available when --show-dir is not specified.
  • --show-dir: If specified, the visualized segmentation mask will be saved in the specified directory.
  • --wait-time: The interval of show (s), which takes effect when --show is activated. Default to 2.
  • --cfg-options: If specified, the key-value pair in xxx=yyy format will be merged into config file. For example: To trade speed with GPU memory, you may pass in --cfg-options model.backbone.with_cp=True to enable checkpoint in backbone. Below is the optional arguments for multi-gpu test:
  • --launcher: Items for distributed job initialization launcher. Allowed choices are none, pytorch, slurm, mpi. Especially, if set to none, it will test in a non-distributed mode.
  • --local_rank: ID for local rank. If not specified, it will be set to 0. Examples: Assume that you have already downloaded the checkpoints to the directory checkpoints/.
  1. Test PSPNet on PASCAL VOC (without saving the test results) and evaluate the mIoU.

    python tools/test.py configs/pspnet/pspnet_r50-d8_4xb4-20k_voc12aug-512x512.py \
        checkpoints/pspnet_r50-d8_512x512_20k_voc12aug_20200617_101958-ed5dfbd9.pth
    

    Since --work-dir is not specified, the folder work_dirs/pspnet_r50-d8_4xb4-20k_voc12aug-512x512 will be created automatically to save the evaluation results.

  2. Test PSPNet with 4 GPUs, and evaluate the standard mIoU and cityscapes metric.

    ./tools/dist_test.sh configs/pspnet/pspnet_r50-d8_4xb2-40k_cityscapes-512x1024.py \
        checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth 4
    

:::{note} There is some gap (~0.1%) between cityscapes mIoU and our mIoU. The reason is that cityscapes average each class with class size by default. We use the simple version without average for all datasets. :::