[Feature] Add multi machine dist_train (#734)
* add dist_train with multi machines * add dist_train with multi machinespull/745/head
parent
7856141132
commit
3482521587
|
@ -130,7 +130,7 @@ And then run the script [above](#train-with-a-single-gpu).
|
|||
The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.
|
||||
```
|
||||
|
||||
### Train with multiple GPUs
|
||||
### Train with multiple GPUs in single machine
|
||||
|
||||
```shell
|
||||
./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
|
||||
|
@ -148,6 +148,22 @@ Difference between `resume-from` and `load-from`:
|
|||
|
||||
### Train with multiple machines
|
||||
|
||||
If you launch with multiple machines simply connected with ethernet, you can simply run following commands:
|
||||
|
||||
On the first machine:
|
||||
|
||||
```shell
|
||||
NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
|
||||
```
|
||||
|
||||
On the second machine:
|
||||
|
||||
```shell
|
||||
NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
|
||||
```
|
||||
|
||||
Usually it is slow if you do not have high speed networking like InfiniBand.
|
||||
|
||||
If you run MMClassification on a cluster managed with [slurm](https://slurm.schedmd.com/), you can use the script `slurm_train.sh`. (This script also supports single machine training.)
|
||||
|
||||
```shell
|
||||
|
|
|
@ -128,7 +128,7 @@ export CUDA_VISIBLE_DEVICES=-1
|
|||
我们不推荐用户使用 CPU 进行训练,这太过缓慢。我们支持这个功能是为了方便用户在没有 GPU 的机器上进行调试。
|
||||
```
|
||||
|
||||
### 使用多个 GPU 进行训练
|
||||
### 使用单台机器多个 GPU 进行训练
|
||||
|
||||
```shell
|
||||
./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [optional arguments]
|
||||
|
@ -146,6 +146,22 @@ export CUDA_VISIBLE_DEVICES=-1
|
|||
|
||||
### 使用多台机器进行训练
|
||||
|
||||
如果您想使用由 ethernet 连接起来的多台机器, 您可以使用以下命令:
|
||||
|
||||
在第一台机器上:
|
||||
|
||||
```shell
|
||||
NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
|
||||
```
|
||||
|
||||
在第二台机器上:
|
||||
|
||||
```shell
|
||||
NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR sh tools/dist_train.sh $CONFIG $GPUS
|
||||
```
|
||||
|
||||
但是,如果您不使用高速网路连接这几台机器的话,训练将会非常慢。
|
||||
|
||||
如果用户在 [slurm](https://slurm.schedmd.com/) 集群上运行 MMClassification,可使用 `slurm_train.sh` 脚本。(该脚本也支持单台机器上进行训练)
|
||||
|
||||
```shell
|
||||
|
|
|
@ -3,8 +3,20 @@
|
|||
CONFIG=$1
|
||||
CHECKPOINT=$2
|
||||
GPUS=$3
|
||||
NNODES=${NNODES:-1}
|
||||
NODE_RANK=${NODE_RANK:-0}
|
||||
PORT=${PORT:-29500}
|
||||
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
|
||||
|
||||
PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
|
||||
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
|
||||
$(dirname "$0")/test.py $CONFIG $CHECKPOINT --launcher pytorch ${@:4}
|
||||
python -m torch.distributed.launch \
|
||||
--nnodes=$NNODES \
|
||||
--node_rank=$NODE_RANK \
|
||||
--master_addr=$MASTER_ADDR \
|
||||
--nproc_per_node=$GPUS \
|
||||
--master_port=$PORT \
|
||||
$(dirname "$0")/test.py \
|
||||
$CONFIG \
|
||||
$CHECKPOINT \
|
||||
--launcher pytorch \
|
||||
${@:4}
|
||||
|
|
|
@ -2,8 +2,18 @@
|
|||
|
||||
CONFIG=$1
|
||||
GPUS=$2
|
||||
NNODES=${NNODES:-1}
|
||||
NODE_RANK=${NODE_RANK:-0}
|
||||
PORT=${PORT:-29500}
|
||||
MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
|
||||
|
||||
PYTHONPATH="$(dirname $0)/..":$PYTHONPATH \
|
||||
python -m torch.distributed.launch --nproc_per_node=$GPUS --master_port=$PORT \
|
||||
$(dirname "$0")/train.py $CONFIG --launcher pytorch ${@:3}
|
||||
python -m torch.distributed.launch \
|
||||
--nnodes=$NNODES \
|
||||
--node_rank=$NODE_RANK \
|
||||
--master_addr=$MASTER_ADDR \
|
||||
--nproc_per_node=$GPUS \
|
||||
--master_port=$PORT \
|
||||
$(dirname "$0")/train.py \
|
||||
$CONFIG \
|
||||
--launcher pytorch ${@:3}
|
||||
|
|
Loading…
Reference in New Issue