[Doc] Update train test doc (#2061)

* draft * refine structure * fix typo * rename single gpu title and redefine --resume * update introduction * add notes to load_from
2022-09-16 19:36:13 +08:00 · 2022-09-16 19:36:13 +08:00 · b8d87d7fe5
parent 650be56fcc
commit b8d87d7fe5
1 changed files with 95 additions and 168 deletions
--- a/docs/en/user_guides/4_train_test.md
+++ b/docs/en/user_guides/4_train_test.md
@ -1,51 +1,8 @@
 # Tutorial 4: Train and test with existing models

-This tutorial provides instruction for users to use the models provided in the [Model Zoo](../model_zoo.md) for other datasets to obtain better performance.
-MMSegmentation also provides out-of-the-box tools for training models.
-This section will show how to train and test models on standard datasets.
+MMSegmentation supports training and testing models on a variety of devices, which are described below for single-GPU, distributed, and cluster training and testing, respectively. Through this tutorial, you will learn how to train and test using the scripts provided by MMSegmentation.

-## Train models on standard datasets
-
-### Modify training schedule
-
-Modify the following configuration to customize the training.
-
-```python
-# training schedule for 40k
-train_cfg = dict(type='IterBasedTrainLoop', max_iters=40000, val_interval=4000)
-val_cfg = dict(type='ValLoop')
-test_cfg = dict(type='TestLoop')
-# optimizer
-optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0005)
-optim_wrapper = dict(type='OptimWrapper', optimizer=optimizer, clip_grad=None)
-# learning policy
-param_scheduler = [
-    dict(
-        type='PolyLR',
-        eta_min=1e-4,
-        power=0.9,
-        begin=0,
-        end=40000,
-        by_epoch=False)
-# basic hooks
-default_hooks = dict(
-    timer=dict(type='IterTimerHook'),
-    logger=dict(type='LoggerHook', interval=50, log_metric_by_epoch=False),
-    param_scheduler=dict(type='ParamSchedulerHook'),
-    checkpoint=dict(type='CheckpointHook', by_epoch=False, interval=4000),
-    sampler_seed=dict(type='DistSamplerSeedHook'))
-]
-```
-
-### Use pre-trained model
-
-Users can load a pre-trained model by setting the `load_from` field of the config to the model's path or link.
-The users might need to download the model weights before training to avoid the download time during training.
-
-```python
-# use the pre-trained model for the whole PSPNet
-load_from = 'https://download.openmmlab.com/mmsegmentation/v0.5/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth'  # model path can be found in model zoo
-```
+## Training and testing on a single GPU

 ### Training on a single GPU

@ -53,32 +10,35 @@ We provide `tools/train.py` to launch training jobs on a single GPU.
 The basic usage is as follows.

 ```shell
-python tools/train.py \
-    ${CONFIG_FILE} \
-    [optional arguments]
+python tools/train.py  ${CONFIG_FILE} [optional arguments]
 ```

 This tool accepts several optional arguments, including:

 - `--work-dir ${WORK_DIR}`: Override the working directory.
 - `--amp`: Use auto mixed precision training.
- `--resume ${CHECKPOINT_FILE}`: Resume from a previous checkpoint file. If not specify, try to auto resume from the latest checkpoint in the work directory.
- `--cfg-options ${OVERRIDE_CONFIGS}`: Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into config file.
+- `--resume`: Resume from the latest checkpoint in the work_dir automatically.
+- `--cfg-options ${OVERRIDE_CONFIGS}`: Override some settings in the used config, and the key-value pair in xxx=yyy format will be merged into the config file.
  For example, '--cfg-option model.encoder.in_channels=6'. Please see this [guide](./1_config.md#Modify-config-through-script-arguments) for more details.
-  Below is the optional arguments for multi-gpu test:
+
+Below are the optional arguments for the multi-gpu test:
+
 - `--launcher`: Items for distributed job initialization launcher. Allowed choices are `none`, `pytorch`, `slurm`, `mpi`. Especially, if set to none, it will test in a non-distributed mode.
 - `--local_rank`: ID for local rank. If not specified, it will be set to 0.

-**Note**:
-Difference between `--resume` and `load-from`:
-`--resume` loads both the model weights and optimizer status, and the iteration is also inherited from the specified checkpoint.
-It is usually used for resuming the training process that is interrupted accidentally.
+**Note:** Difference between the argument `--resume` and the field `load_from` in the config file:

-`load-from` only loads the model weights and the training iteration starts from 0. It is usually used for fine-tuning.
+`--resume` only determines whether to resume from the latest checkpoint in the work_dir. It is usually used for resuming the training process that is interrupted accidentally.

-### Training on CPU
+`load_from` will specify the checkpoint to be loaded and the training iteration starts from 0. It is usually used for fine-tuning.

-The process of training on the CPU is consistent with single GPU training if machine does not have GPU. If it has GPUs but not wanting to use it, we just need to disable GPUs before the training process.
+If you would like to resume training from a specific checkpoint, you can use:
+
+```python
+python tools/train.py ${CONFIG_FILE} --resume --cfg-options load_from=${CHECKPOINT}
+```
+
+**Training on CPU**: The process of training on the CPU is consistent with single GPU training if a machine does not have GPU. If it has GPUs but not wanting to use them, we just need to disable GPUs before the training process.

 ```shell
 export CUDA_VISIBLE_DEVICES=-1
@ -86,25 +46,46 @@ export CUDA_VISIBLE_DEVICES=-1

 And then run the script [above](#training-on-a-single-gpu).

-```{warning}
-The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.
+### Testing on a single GPU
+
+We provide `tools/test.py` to launch training jobs on a single GPU.
+The basic usage is as follows.
+
+```shell
+python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
 ```

+This tool accepts several optional arguments, including:
+
+- `--work-dir`: If specified, results will be saved in this directory. If not specified, the results will be automatically saved to `work_dirs/{CONFIG_NAME}`.
+- `--show`: Show prediction results at runtime, available when `--show-dir` is not specified.
+- `--show-dir`: Directory where painted images will be saved. If specified, the visualized segmentation mask will be saved to the `work_dir/timestamp/show_dir`.
+- `--wait-time`: The interval of show (s), which takes effect when `--show` is activated. Default to 2.
+- `--cfg-options`:  If specified, the key-value pair in xxx=yyy format will be merged into the config file.
+
+**Testing on CPU**: The process of testing on the CPU is consistent with single GPU testing if a machine does not have GPU. If it has GPUs but not wanting to use them, we just need to disable GPUs before the training process.
+
+```shell
+export CUDA_VISIBLE_DEVICES=-1
+```
+
+And then run the script [above](#testing-on-a-single-gpu).
+
+## Training and testing on multiple GPUs and multiple machines
+
 ### Training on multiple GPUs

 OpenMMLab2.0 implements **distributed** training with `MMDistributedDataParallel`.
 We provide `tools/dist_train.sh` to launch training on multiple GPUs.
-The basic usage is as follows.
+
+The basic usage is as follows:

 ```shell
-sh tools/dist_train.sh \
-    ${CONFIG_FILE} \
-    ${GPU_NUM} \
-    [optional arguments]
+sh tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
 ```

-Optional arguments remain the same as stated [above](#training-on-a-single-gpu)
-and has additional arguments to specify the number of GPUs.
+Optional arguments remain the same as stated [above](#training-on-a-single-gpu) and have additional arguments to specify the number of GPUs.
+
 An example:

 ```shell
@ -113,30 +94,46 @@ An example:
 sh tools/dist_train.sh configs/pspnet/pspnet_r50-d8_4xb4-80k_ade20k-512x512.py 8 --work-dir work_dirs/pspnet_r50-d8_4xb4-80k_ade20k-512x512
 ```

-**Note**: During training, checkpoints and logs are saved in the same folder structure as the config file under `work_dirs/`. Custom work directory is not recommended since evaluation scripts infer work directories from the config file name. If you want to save your weights somewhere else, please use symlink, for example:
+**Note**: During training, checkpoints and logs are saved in the same folder structure as the config file under `work_dirs/`. A custom work directory is not recommended since evaluation scripts infer work directories from the config file name. If you want to save your weights somewhere else, please use a symlink, for example:

 ```shell
 ln -s ${YOUR_WORK_DIRS} ${MMSEG}/work_dirs
 ```

-#### Launch multiple jobs on a single machine
+### Testing on multiple GPUs

-If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, you need to specify different ports (29500 by default) for each job to avoid communication conflict. Otherwise, there will be error message saying `RuntimeError: Address already in use`.
-If you use `dist_train.sh` to launch training jobs, you can set the port in commands with environment variable `PORT`.
+We provide `tools/dist_test.sh` to launch testing on multiple GPUs.
+The basic usage is as follows.
+
+```shell
+sh tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [optional arguments]
+```
+
+Optional arguments remain the same as stated [above](#testing-on-a-single-gpu) and have additional arguments to specify the number of GPUs.
+
+An example:
+
+```shell
+./tools/dist_test.sh configs/pspnet/pspnet_r50-d8_4xb2-40k_cityscapes-512x1024.py \
+    checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth 4
+```
+
+### Launch multiple jobs on a single machine
+
+If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, you need to specify different ports (29500 by default) for each job to avoid communication conflict. Otherwise, there will be an error message saying `RuntimeError: Address already in use`.
+If you use `dist_train.sh` to launch training jobs, you can set the port in commands with the environment variable `PORT`.

 ```shell
 CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 sh tools/dist_train.sh ${CONFIG_FILE} 4
 CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 sh tools/dist_train.sh ${CONFIG_FILE} 4
 ```

-### Training on multiple nodes
+### Training with multiple machines

 MMSegmentation relies on `torch.distributed` package for distributed training.
 Thus, as a basic usage, one can launch distributed training via PyTorch's [launch utility](https://pytorch.org/docs/stable/distributed.html#launch-utility).

-#### Train with multiple machines
-
-If you launch with multiple machines simply connected with ethernet, you can simply run following commands:
+If you launch with multiple machines simply connected with ethernet, you can simply run the following commands:
 On the first machine:

 ```shell
@ -149,16 +146,20 @@ On the second machine:
 NNODES=2 NODE_RANK=1 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} sh tools/dist_train.sh ${CONFIG_FILE} ${GPUS}
 ```

-Usually it is slow if you do not have high speed networking like InfiniBand.
+Usually, it is slow if you do not have high-speed networking like InfiniBand.

-#### Manage jobs with Slurm
+## Manage jobs with Slurm

 [Slurm](https://slurm.schedmd.com/) is a good job scheduling system for computing clusters.
+
+### Training on a cluster with Slurm
+
 On a cluster managed by Slurm, you can use `slurm_train.sh` to spawn training jobs. It supports both single-node and multi-node training.
-The basic usage is as follows.
+
+The basic usage is as follows:

 ```shell
-[GPUS=${GPUS}] sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} --work-dir ${WORK_DIR}
+[GPUS=${GPUS}] sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} [optional arguments]
 ```

 Below is an example of using 4 GPUs to train PSPNet on a Slurm partition named _dev_, and set the work-dir to some shared file systems.
@ -167,8 +168,21 @@ Below is an example of using 4 GPUs to train PSPNet on a Slurm partition named _
 GPUS=4 sh tools/slurm_train.sh dev pspnet configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py --work-dir work_dir/pspnet
 ```

-You can check [the source code](../../../tools/dist_train.sh) to review full arguments and environment variables.
-When using Slurm, the port option need to be set in one of the following ways:
+You can check [the source code](../../../tools/slurm_train.sh) to review full arguments and environment variables.
+
+### Testing on a cluster with Slurm
+
+Similar to the training task, MMSegmentation provides `slurm_test.sh` to launch testing jobs.
+
+The basic usage is as follows:
+
+```shell
+[GPUS=${GPUS}] sh tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
+```
+
+You can check [the source code](../../../tools/slurm_test.sh) to review full arguments and environment variables.
+
+**Note:** When using Slurm, the port option needs to be set in one of the following ways:

 1. Set the port through `--cfg-options`. This is more recommended since it does not change the original configs.

@ -203,90 +217,3 @@ When using Slurm, the port option need to be set in one of the following ways:
 CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 MASTER_PORT=29500 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
 CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 MASTER_PORT=29501 sh tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}
 ```
-
-## Test models on standard datasets
-
-We provide testing scripts for evaluating an existing model on the whole dataset.
-The following testing environments are supported:
-
- single GPU
- CPU
- single node multiple GPU
- multiple node
-
-Choose the proper script to perform testing depending on the testing environment.
-
-```shell
-# single-gpu testing
-python tools/test.py \
-    ${CONFIG_FILE} \
-    ${CHECKPOINT_FILE} \
-    [--work-dir ${WORK_DIR}] \
-    [--show ${SHOW_RESULTS}] \
-    [--show-dir ${VISUALIZATION_DIRECTORY}] \
-    [--wait-time ${SHOW_INTERVAL}] \
-    [--cfg-options ${OVERRIDE_CONFIGS}]
-# CPU testing
-export CUDA_VISIBLE_DEVICES=-1
-python tools/test.py \
-    ${CONFIG_FILE} \
-    ${CHECKPOINT_FILE} \
-    [--work-dir ${WORK_DIR}] \
-    [--show ${SHOW_RESULTS}] \
-    [--show-dir ${VISUALIZATION_DIRECTORY}] \
-    [--wait-time ${SHOW_INTERVAL}] \
-    [--cfg-options ${OVERRIDE_CONFIGS}]
-# multi-gpu testing
-bash tools/dist_test.sh \
-    ${CONFIG_FILE} \
-    ${CHECKPOINT_FILE} \
-    ${GPU_NUM} \
-    [--work-dir ${WORK_DIR}] \
-    [--cfg-options ${OVERRIDE_CONFIGS}]
-```
-
-`tools/dist_test.sh` also supports multi-node testing, but relies on PyTorch's [launch utility](https://pytorch.org/docs/stable/distributed.html#launch-utility).
-[Slurm](https://slurm.schedmd.com/) is a good job scheduling system for computing clusters.
-On a cluster managed by Slurm, you can use `slurm_test.sh` to spawn testing jobs. It supports both single-node and multi-node testing.
-
-```shell
-[GPUS=${GPUS}] ./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} \
-    ${CONFIG_FILE} ${CHECKPOINT_FILE} \
-    [--work-dir ${OUTPUT_DIRECTORY}] \
-    [--cfg-options ${OVERRIDE_CONFIGS}]
-```
-
-Optional arguments:
-
- `--work-dir`: If specified, results will be saved in this directory. If not specified, the results will be automatically saved to `work_dirs/{CONFIG_NAME}`.
- `--show`: Show prediction results at runtime, available when `--show-dir` is not specified.
- `--show-dir`: If specified, the visualized segmentation mask will be saved in the specified directory.
- `--wait-time`: The interval of show (s), which takes effect when `--show` is activated. Default to 2.
- `--cfg-options`:  If specified, the key-value pair in xxx=yyy format will be merged into config file.
-  For example: To trade speed with GPU memory, you may pass in `--cfg-options model.backbone.with_cp=True` to enable checkpoint in backbone.
-  Below is the optional arguments for multi-gpu test:
- `--launcher`: Items for distributed job initialization launcher. Allowed choices are `none`, `pytorch`, `slurm`, `mpi`. Especially, if set to none, it will test in a non-distributed mode.
- `--local_rank`: ID for local rank. If not specified, it will be set to 0.
-  Examples:
-  Assume that you have already downloaded the checkpoints to the directory `checkpoints/`.
-
-1. Test PSPNet on PASCAL VOC (without saving the test results) and evaluate the mIoU.
-
-   ```shell
-   python tools/test.py configs/pspnet/pspnet_r50-d8_4xb4-20k_voc12aug-512x512.py \
-       checkpoints/pspnet_r50-d8_512x512_20k_voc12aug_20200617_101958-ed5dfbd9.pth
-   ```
-
-   Since `--work-dir` is not specified, the folder `work_dirs/pspnet_r50-d8_4xb4-20k_voc12aug-512x512` will be created automatically to save the evaluation results.
-
-2. Test PSPNet with 4 GPUs, and evaluate the standard mIoU and cityscapes metric.
-
-   ```shell
-   ./tools/dist_test.sh configs/pspnet/pspnet_r50-d8_4xb2-40k_cityscapes-512x1024.py \
-       checkpoints/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth 4
-   ```
-
-:::{note}
-There is some gap (~0.1%) between cityscapes mIoU and our mIoU. The reason is that cityscapes average each class with class size by default.
-We use the simple version without average for all datasets.
-:::