mirror of https://github.com/NVlabs/SegFormer.git
84 lines
3.8 KiB
Markdown
84 lines
3.8 KiB
Markdown
## Train a model
|
|
|
|
MMSegmentation implements distributed training and non-distributed training,
|
|
which uses `MMDistributedDataParallel` and `MMDataParallel` respectively.
|
|
|
|
All outputs (log files and checkpoints) will be saved to the working directory,
|
|
which is specified by `work_dir` in the config file.
|
|
|
|
By default we evaluate the model on the validation set after some iterations, you can change the evaluation interval by adding the interval argument in the training config.
|
|
|
|
```python
|
|
evaluation = dict(interval=4000) # This evaluate the model per 4000 iterations.
|
|
```
|
|
|
|
**\*Important\***: The default learning rate in config files is for 4 GPUs and 2 img/gpu (batch size = 4x2 = 8).
|
|
Equivalently, you may also use 8 GPUs and 1 imgs/gpu since all models using cross-GPU SyncBN.
|
|
|
|
To trade speed with GPU memory, you may pass in `--options model.backbone.with_cp=True` to enable checkpoint in backbone.
|
|
|
|
### Train with a single GPU
|
|
|
|
```shell
|
|
python tools/train.py ${CONFIG_FILE} [optional arguments]
|
|
```
|
|
|
|
If you want to specify the working directory in the command, you can add an argument `--work-dir ${YOUR_WORK_DIR}`.
|
|
|
|
### Train with multiple GPUs
|
|
|
|
```shell
|
|
./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
|
|
```
|
|
|
|
Optional arguments are:
|
|
|
|
- `--no-validate` (**not suggested**): By default, the codebase will perform evaluation at every k iterations during the training. To disable this behavior, use `--no-validate`.
|
|
- `--work-dir ${WORK_DIR}`: Override the working directory specified in the config file.
|
|
- `--resume-from ${CHECKPOINT_FILE}`: Resume from a previous checkpoint file (to continue the training process).
|
|
- `--load-from ${CHECKPOINT_FILE}`: Load weights from a checkpoint file (to start finetuning for another task).
|
|
|
|
Difference between `resume-from` and `load-from`:
|
|
|
|
- `resume-from` loads both the model weights and optimizer state including the iteration number.
|
|
- `load-from` loads only the model weights, starts the training from iteration 0.
|
|
|
|
### Train with multiple machines
|
|
|
|
If you run MMSegmentation on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `slurm_train.sh`. (This script also supports single machine training.)
|
|
|
|
```shell
|
|
[GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} --work-dir ${WORK_DIR}
|
|
```
|
|
|
|
Here is an example of using 16 GPUs to train PSPNet on the dev partition.
|
|
|
|
```shell
|
|
GPUS=16 ./tools/slurm_train.sh dev pspr50 configs/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes.py /nfs/xxxx/psp_r50_512x1024_40ki_cityscapes
|
|
```
|
|
|
|
You can check [slurm_train.sh](../tools/slurm_train.sh) for full arguments and environment variables.
|
|
|
|
If you have just multiple machines connected with ethernet, you can refer to
|
|
PyTorch [launch utility](https://pytorch.org/docs/stable/distributed_deprecated.html#launch-utility).
|
|
Usually it is slow if you do not have high speed networking like InfiniBand.
|
|
|
|
### Launch multiple jobs on a single machine
|
|
|
|
If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs,
|
|
you need to specify different ports (29500 by default) for each job to avoid communication conflict. Otherwise, there will be error message saying `RuntimeError: Address already in use`.
|
|
|
|
If you use `dist_train.sh` to launch training jobs, you can set the port in commands with environment variable `PORT`.
|
|
|
|
```shell
|
|
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4
|
|
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4
|
|
```
|
|
|
|
If you use `slurm_train.sh` to launch training jobs, you can set the port in commands with environment variable `MASTER_PORT`.
|
|
|
|
```shell
|
|
MASTER_PORT=29500 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE}
|
|
MASTER_PORT=29501 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE}
|
|
```
|