[Docs] How to specify specific GPU training and inference (#503)

* 中文版指定GPU训练

* 删去不必要文件

* typo

* rebase dev

* add test example

* add english version

* fix format

* english typo

* Update docs/zh_cn/advanced_guides/how_to.md

Co-authored-by: Range King <RangeKingHZ@gmail.com>

* Update docs/zh_cn/advanced_guides/how_to.md

Co-authored-by: Range King <RangeKingHZ@gmail.com>

---------

Co-authored-by: Range King <RangeKingHZ@gmail.com>
pull/517/head
yechenzhi 2023-02-06 10:12:04 +08:00 committed by GitHub
parent 6acde82ec8
commit 1dee9eed6e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 48 additions and 0 deletions

View File

@ -546,3 +546,27 @@ python ./tools/train.py \
- `randomness.seed=2023`, set the random seed to 2023.
- `randomness.diff_rank_seed=True`, set different seeds according to global rank. Defaults to False.
- `randomness.deterministic=True`, set the deterministic option for cuDNN backend, i.e., set `torch.backends.cudnn.deterministic` to True and `torch.backends.cudnn.benchmark` to False. Defaults to False. See https://pytorch.org/docs/stable/notes/randomness.html for more details.
## Specify specific GPUs during training or inference
If you have multiple GPUs, such as 8 GPUs, numbered `0, 1, 2, 3, 4, 5, 6, 7`, GPU 0 will be used by default for training or inference. If you want to specify other GPUs for training or inference, you can use the following commands:
```shell
CUDA_VISIBLE_DEVICES=5 python ./tools/train.py ${CONFIG} #train
CUDA_VISIBLE_DEVICES=5 python ./tools/test.py ${CONFIG} ${CHECKPOINT_FILE} #test
```
If you set `CUDA_VISIBLE_DEVICES` to -1 or a number greater than the maximum GPU number, such as 8, the CPU will be used for training or inference.
If you want to use several of these GPUs to train in parallel, you can use the following command:
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh ${CONFIG} ${GPU_NUM}
```
Here the `GPU_NUM` is 4. In addition, if multiple tasks are trained in parallel on one machine and each task requires multiple GPUs, the PORT of each task need to be set differently to avoid communication conflict, like the following commands:
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG} 4
```

View File

@ -552,3 +552,27 @@ python ./tools/train.py \
- `randomness.diff_rank_seed=True`,根据 rank 来设置不同的种子,`diff_rank_seed` 默认为 False。
- `randomness.deterministic=True`,把 cuDNN 后端确定性选项设置为 True即把`torch.backends.cudnn.deterministic` 设为 True`torch.backends.cudnn.benchmark` 设为False。`deterministic` 默认为 False。更多细节见 https://pytorch.org/docs/stable/notes/randomness.html。
## 指定特定 GPU 训练或推理
如果你有多张 GPU比如 8 张,其编号分别为 `0, 1, 2, 3, 4, 5, 6, 7`,使用单卡训练或推理时会默认使用卡 0。如果想指定其他卡进行训练或推理可以使用以下命令
```shell
CUDA_VISIBLE_DEVICES=5 python ./tools/train.py ${CONFIG} #train
CUDA_VISIBLE_DEVICES=5 python ./tools/test.py ${CONFIG} ${CHECKPOINT_FILE} #test
```
如果设置`CUDA_VISIBLE_DEVICES`为 -1 或者一个大于 GPU 最大编号的数,比如 8将会使用 CPU 进行训练或者推理。
如果你想使用其中几张卡并行训练,可以使用如下命令:
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh ${CONFIG} ${GPU_NUM}
```
这里 `GPU_NUM` 为 4。另外如果在一台机器上多个任务同时多卡训练需要设置不同的端口比如以下命令
```shell
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG} 4
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG} 4
```