mmengine/docs/zh_cn/common_usage/speed_up_training.md

# 加速训练

## 分布式训练

MMEngine 支持 CPU、单卡、单机多卡以及多机多卡的训练。当环境中有多张显卡时，我们可以使用以下命令开启单机多卡或者多机多卡的方式从而缩短模型的训练时间。

- 单机多卡

  假设当前机器有 8 张显卡，可以使用以下命令开启多卡训练

  ```bash
  python -m torch.distributed.launch --nproc_per_node=8 examples/train.py --launcher pytorch
  ```

  如果需要指定显卡的编号，可以设置 `CUDA_VISIBLE_DEVICES` 环境变量，例如使用第 0 和第 3 张卡

  ```bash
  CUDA_VISIBLE_DEVICES=0,3 python -m torch.distributed.launch --nproc_per_node=2 examples/train.py --launcher pytorch
  ```

- 多机多卡

  假设有 2 台机器，每台机器有 8 张卡。

  第一台机器运行以下命令

  ```bash
  python -m torch.distributed.launch \
      --nnodes 8 \
      --node_rank 0 \
      --master_addr 127.0.0.1 \
      --master_port 29500 \
      --nproc_per_node=8 \
      examples/train.py --launcher pytorch
  ```

  第 2 台机器运行以下命令

  ```bash
  python -m torch.distributed.launch \
      --nnodes 8 \
      --node_rank 1 \
      --master_addr 127.0.0.1 \
      --master_port 29500 \
      --nproc_per_node=8 \
      examples/train.py --launcher pytorch
  ```

  如果在 slurm 集群运行 MMEngine，只需运行以下命令即可开启 2 机 16 卡的训练

  ```bash
  srun -p mm_dev \
      --job-name=test \
      --gres=gpu:8 \
      --ntasks=16 \
      --ntasks-per-node=8 \
      --cpus-per-task=5 \
      --kill-on-bad-exit=1 \
      python examples/train.py --launcher="slurm"
  ```

## 混合精度训练

Nvidia 在 Volta 和 Turing 架构中引入 Tensor Core 单元，来支持 FP32 和 FP16 混合精度计算。在 Ampere 架构中，他们进一步支持了 BF16 计算。开启自动混合精度训练后，部分算子的操作精度是 FP16/BF16，其余算子的操作精度是 FP32。这样在不改变模型、不降低模型训练精度的前提下，可以缩短训练时间，降低存储需求，因而能支持更大的 batch size、更大模型和尺寸更大的输入的训练。

[PyTorch 从 1.6 开始官方支持 amp](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/)。如果你对自动混合精度的实现感兴趣，可以阅读 [torch.cuda.amp: 自动混合精度详解](https://zhuanlan.zhihu.com/p/348554267)。

MMEngine 提供自动混合精度的封装 [AmpOptimWrapper](mmengine.optim.AmpOptimWrapper) ，只需在 `optim_wrapper` 设置 `type='AmpOptimWrapper'` 即可开启自动混合精度训练，无需对代码做其他修改。

```python
runner = Runner(
    model=ResNet18(),
    work_dir='./work_dir',
    train_dataloader=train_dataloader_cfg,
    optim_wrapper=dict(
        type='AmpOptimWrapper',
        # 如果你想要使用 BF16，请取消下面一行的代码注释
        # dtype='bfloat16',  # 可用值： ('float16', 'bfloat16', None)
        optimizer=dict(type='SGD', lr=0.001, momentum=0.9)),
    train_cfg=dict(by_epoch=True, max_epochs=3),
)
runner.train()
```

```{warning}
截止到 PyTorch 1.13 版本，在 `Convolution` 中直接使用 `torch.bfloat16` 性能低下，必须手动设置环境变量 `TORCH_CUDNN_V8_API_ENABLED=1` 以启用 CuDNN 版本的 BF16 Convolution。相关讨论见 [PyTorch Issue](https://github.com/pytorch/pytorch/issues/57707#issuecomment-1166656767)
```

## 模型编译

PyTorch 2.0 版本引入了 [torch.compile](https://pytorch.org/docs/2.0/dynamo/get-started.html) 新特性，通过对模型进行编译来加速训练、验证。MMEngine 从 v0.7.0 版本开始支持这一特性，你可以通过向 `Runner` 的 `cfg` 参数传入一个带有 `compile` 关键词的字典来开启模型编译：

```python
runner = Runner(
    model=ResNet18(),
    ...  # 你的其他 Runner 配置参数
    cfg=dict(compile=True)
)
```

此外，你也可以传入更多的编译配置选项，所有编译配置选项可以参考 [torch.compile API 文档](https://pytorch.org/docs/2.0/generated/torch.compile.html#torch-compile)

```python
compile_options = dict(backend='inductor', mode='max-autotune')
runner = Runner(
    model=ResNet18(),
    ...  # 你的其他 Runner 配置参数
    cfg=dict(compile=compile_options)
)
```

这一特性只有在你安装 PyTorch >= 2.0.0 版本时才可用。

```{warning}
`torch.compile` 目前仍然由 PyTorch 团队持续开发中，一些模型可能会编译失败。如果遇到了类似问题，你可以查阅 [PyTorch Dynamo FAQ](https://pytorch.org/docs/2.0/dynamo/faq.html) 解决常见问题，或参考 [TorchDynamo Troubleshooting](https://pytorch.org/docs/2.0/dynamo/troubleshooting.html) 向 PyTorch 提 issue.
```

## 使用更快的优化器

如果使用了昇腾的设备，可以使用昇腾的优化器从而缩短模型的训练时间。昇腾设备支持的优化器如下

- NpuFusedAdadelta
- NpuFusedAdam
- NpuFusedAdamP
- NpuFusedAdamW
- NpuFusedBertAdam
- NpuFusedLamb
- NpuFusedRMSprop
- NpuFusedRMSpropTF
- NpuFusedSGD

使用方式同原生优化器一样，可参考[优化器的使用](../tutorials/optim_wrapper.md#在执行器中配置优化器封装)。
-												[Docs] Add speed training examples (#408)

* [Docs] Add speed training examples

* refine

* refine

* rename filename

* minor refine
											
										
										
											2022-08-23 11:55:18 +08:00
+								# 加速训练
 								## 分布式训练
 								MMEngine 支持 CPU、单卡、单机多卡以及多机多卡的训练。当环境中有多张显卡时，我们可以使用以下命令开启单机多卡或者多机多卡的方式从而缩短模型的训练时间。
 								- 单机多卡
-												[Docs] Translate examples docs (#715)

* [Docs] Translate examples docs

* Update docs/en/examples/resume_training.md

Co-authored-by: Qian Zhao <112053249+C1rN09@users.noreply.github.com>

Co-authored-by: Qian Zhao <112053249+C1rN09@users.noreply.github.com>
											
										
										
											2022-11-18 15:20:26 +08:00
+								  假设当前机器有 8 张显卡，可以使用以下命令开启多卡训练
-												[Docs] Add speed training examples (#408)

* [Docs] Add speed training examples

* refine

* refine

* rename filename

* minor refine
											
										
										
											2022-08-23 11:55:18 +08:00
-												[Docs] Translate examples docs (#715)

* [Docs] Translate examples docs

* Update docs/en/examples/resume_training.md

Co-authored-by: Qian Zhao <112053249+C1rN09@users.noreply.github.com>

Co-authored-by: Qian Zhao <112053249+C1rN09@users.noreply.github.com>
											
										
										
											2022-11-18 15:20:26 +08:00
+								  ```bash
 								  python -m torch.distributed.launch --nproc_per_node=8 examples/train.py --launcher pytorch
 								  ```
-												[Docs] Add speed training examples (#408)

* [Docs] Add speed training examples

* refine

* refine

* rename filename

* minor refine
											
										
										
											2022-08-23 11:55:18 +08:00
-												[Docs] Translate examples docs (#715)

* [Docs] Translate examples docs

* Update docs/en/examples/resume_training.md

Co-authored-by: Qian Zhao <112053249+C1rN09@users.noreply.github.com>

Co-authored-by: Qian Zhao <112053249+C1rN09@users.noreply.github.com>
											
										
										
											2022-11-18 15:20:26 +08:00
+								  如果需要指定显卡的编号，可以设置 `CUDA_VISIBLE_DEVICES` 环境变量，例如使用第 0 和第 3 张卡
-												[Docs] Add speed training examples (#408)

* [Docs] Add speed training examples

* refine

* refine

* rename filename

* minor refine
											
										
										
											2022-08-23 11:55:18 +08:00
-												[Docs] Translate examples docs (#715)

* [Docs] Translate examples docs

* Update docs/en/examples/resume_training.md

Co-authored-by: Qian Zhao <112053249+C1rN09@users.noreply.github.com>

Co-authored-by: Qian Zhao <112053249+C1rN09@users.noreply.github.com>
											
										
										
											2022-11-18 15:20:26 +08:00
+								  ```bash
 								  CUDA_VISIBLE_DEVICES=0,3 python -m torch.distributed.launch --nproc_per_node=2 examples/train.py --launcher pytorch
 								  ```
-												[Docs] Add speed training examples (#408)

* [Docs] Add speed training examples

* refine

* refine

* rename filename

* minor refine
											
										
										
											2022-08-23 11:55:18 +08:00
 								- 多机多卡
-												[Docs] Translate examples docs (#715)

* [Docs] Translate examples docs

* Update docs/en/examples/resume_training.md

Co-authored-by: Qian Zhao <112053249+C1rN09@users.noreply.github.com>

Co-authored-by: Qian Zhao <112053249+C1rN09@users.noreply.github.com>
											
										
										
											2022-11-18 15:20:26 +08:00
+								  假设有 2 台机器，每台机器有 8 张卡。
 								  第一台机器运行以下命令
 								  ```bash
 								  python -m torch.distributed.launch \
 								      --nnodes 8 \
 								      --node_rank 0 \
 								      --master_addr 127.0.0.1 \
 								      --master_port 29500 \
 								      --nproc_per_node=8 \
 								      examples/train.py --launcher pytorch
 								  ```
 								  第 2 台机器运行以下命令
 								  ```bash
 								  python -m torch.distributed.launch \
 								      --nnodes 8 \
 								      --node_rank 1 \
 								      --master_addr 127.0.0.1 \
 								      --master_port 29500 \
 								      --nproc_per_node=8 \
 								      examples/train.py --launcher pytorch
 								  ```
 								  如果在 slurm 集群运行 MMEngine，只需运行以下命令即可开启 2 机 16 卡的训练
 								  ```bash
 								  srun -p mm_dev \
 								      --job-name=test \
 								      --gres=gpu:8 \
 								      --ntasks=16 \
 								      --ntasks-per-node=8 \
 								      --cpus-per-task=5 \
 								      --kill-on-bad-exit=1 \
 								      python examples/train.py --launcher="slurm"
 								  ```
-												[Docs] Add speed training examples (#408)

* [Docs] Add speed training examples

* refine

* refine

* rename filename

* minor refine
											
										
										
											2022-08-23 11:55:18 +08:00
 								## 混合精度训练
-												[Feature] Enable bf16 in AmpOptimWrapper (#960)

* support bf16 in AmpOptimWrapper

* add docstring

* modify docs

* add unittests for bf16 in AmpOptimWrapper

* fix type

* fix to pass ci

* fix ut skip logic to pass ci

* fix as comment

* add type hints

* fix docstring and add warning information

* remove check for pytorch>=1.6 in unittest

* modify unittest

* modify unittest

* remove torch.float32 && torch.float64 from valid dtypes

* fix as comments

* minor refine docstring

* fix unittest parameterized to pass CI

* fix unittest && add back torch.float32, torch.float64
											
										
										
											2023-03-01 21:35:18 +08:00
+								Nvidia 在 Volta 和 Turing 架构中引入 Tensor Core 单元，来支持 FP32 和 FP16 混合精度计算。在 Ampere 架构中，他们进一步支持了 BF16 计算。开启自动混合精度训练后，部分算子的操作精度是 FP16/BF16，其余算子的操作精度是 FP32。这样在不改变模型、不降低模型训练精度的前提下，可以缩短训练时间，降低存储需求，因而能支持更大的 batch size、更大模型和尺寸更大的输入的训练。
-												[Docs] Add speed training examples (#408)

* [Docs] Add speed training examples

* refine

* refine

* rename filename

* minor refine
											
										
										
											2022-08-23 11:55:18 +08:00
-												Add the distributed training script (#487)

* Add the distributed training script

* fix md format
											
										
										
											2022-08-30 19:05:51 +08:00
+								[PyTorch 从 1.6 开始官方支持 amp](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/)。如果你对自动混合精度的实现感兴趣，可以阅读 [torch.cuda.amp: 自动混合精度详解](https://zhuanlan.zhihu.com/p/348554267)。
 								MMEngine 提供自动混合精度的封装 [AmpOptimWrapper](mmengine.optim.AmpOptimWrapper) ，只需在 `optim_wrapper` 设置 `type='AmpOptimWrapper'` 即可开启自动混合精度训练，无需对代码做其他修改。
-												[Docs] Add speed training examples (#408)

* [Docs] Add speed training examples

* refine

* refine

* rename filename

* minor refine
											
										
										
											2022-08-23 11:55:18 +08:00
 								```python
 								runner = Runner(
 								    model=ResNet18(),
 								    work_dir='./work_dir',
 								    train_dataloader=train_dataloader_cfg,
-												[Feature] Enable bf16 in AmpOptimWrapper (#960)

* support bf16 in AmpOptimWrapper

* add docstring

* modify docs

* add unittests for bf16 in AmpOptimWrapper

* fix type

* fix to pass ci

* fix ut skip logic to pass ci

* fix as comment

* add type hints

* fix docstring and add warning information

* remove check for pytorch>=1.6 in unittest

* modify unittest

* modify unittest

* remove torch.float32 && torch.float64 from valid dtypes

* fix as comments

* minor refine docstring

* fix unittest parameterized to pass CI

* fix unittest && add back torch.float32, torch.float64
											
										
										
											2023-03-01 21:35:18 +08:00
+								    optim_wrapper=dict(
 								        type='AmpOptimWrapper',
 								        # 如果你想要使用 BF16，请取消下面一行的代码注释
 								        # dtype='bfloat16',  # 可用值： ('float16', 'bfloat16', None)
 								        optimizer=dict(type='SGD', lr=0.001, momentum=0.9)),
-												[Docs] Add speed training examples (#408)

* [Docs] Add speed training examples

* refine

* refine

* rename filename

* minor refine
											
										
										
											2022-08-23 11:55:18 +08:00
+								    train_cfg=dict(by_epoch=True, max_epochs=3),
 								)
 								runner.train()
 								```
-												[Feature] Enable bf16 in AmpOptimWrapper (#960)

* support bf16 in AmpOptimWrapper

* add docstring

* modify docs

* add unittests for bf16 in AmpOptimWrapper

* fix type

* fix to pass ci

* fix ut skip logic to pass ci

* fix as comment

* add type hints

* fix docstring and add warning information

* remove check for pytorch>=1.6 in unittest

* modify unittest

* modify unittest

* remove torch.float32 && torch.float64 from valid dtypes

* fix as comments

* minor refine docstring

* fix unittest parameterized to pass CI

* fix unittest && add back torch.float32, torch.float64
											
										
										
											2023-03-01 21:35:18 +08:00
 								```{warning}
 								截止到 PyTorch 1.13 版本，在 `Convolution` 中直接使用 `torch.bfloat16` 性能低下，必须手动设置环境变量 `TORCH_CUDNN_V8_API_ENABLED=1` 以启用 CuDNN 版本的 BF16 Convolution。相关讨论见 [PyTorch Issue](https://github.com/pytorch/pytorch/issues/57707#issuecomment-1166656767)
 								```
-												[Feature] Support torch.compile since PyTorch2.0 (#976)

* enable compile configurations to support torch.compile in Runner

* enable compilation in train, val and test

* fix as comments

* add docstring to illustrate usage

* minor refine error message

* add unittests

* fix ut skip

* add logging message to inform users

* compile `train_step`, `val_step`, `test_step` instead

* fix as comments

* revert to compile `train_step` only due to pt2 issue

* add documentation about torch.compile
											
										
										
											2023-03-12 18:26:43 +08:00
 								## 模型编译
 								PyTorch 2.0 版本引入了 [torch.compile](https://pytorch.org/docs/2.0/dynamo/get-started.html) 新特性，通过对模型进行编译来加速训练、验证。MMEngine 从 v0.7.0 版本开始支持这一特性，你可以通过向 `Runner` 的 `cfg` 参数传入一个带有 `compile` 关键词的字典来开启模型编译：
 								```python
 								runner = Runner(
 								    model=ResNet18(),
 								    ...  # 你的其他 Runner 配置参数
 								    cfg=dict(compile=True)
 								)
 								```
 								此外，你也可以传入更多的编译配置选项，所有编译配置选项可以参考 [torch.compile API 文档](https://pytorch.org/docs/2.0/generated/torch.compile.html#torch-compile)
 								```python
 								compile_options = dict(backend='inductor', mode='max-autotune')
 								runner = Runner(
 								    model=ResNet18(),
 								    ...  # 你的其他 Runner 配置参数
 								    cfg=dict(compile=compile_options)
 								)
 								```
 								这一特性只有在你安装 PyTorch >= 2.0.0 版本时才可用。
 								```{warning}
 								`torch.compile` 目前仍然由 PyTorch 团队持续开发中，一些模型可能会编译失败。如果遇到了类似问题，你可以查阅 [PyTorch Dynamo FAQ](https://pytorch.org/docs/2.0/dynamo/faq.html) 解决常见问题，或参考 [TorchDynamo Troubleshooting](https://pytorch.org/docs/2.0/dynamo/troubleshooting.html) 向 PyTorch 提 issue.
 								```
-												[Feature] Add torch_npu optimizer (#1079)


											
										
										
											2023-04-21 15:15:10 +08:00
 								## 使用更快的优化器
 								如果使用了昇腾的设备，可以使用昇腾的优化器从而缩短模型的训练时间。昇腾设备支持的优化器如下
 								- NpuFusedAdadelta
 								- NpuFusedAdam
 								- NpuFusedAdamP
 								- NpuFusedAdamW
 								- NpuFusedBertAdam
 								- NpuFusedLamb
 								- NpuFusedRMSprop
 								- NpuFusedRMSpropTF
 								- NpuFusedSGD
 								使用方式同原生优化器一样，可参考[优化器的使用](../tutorials/optim_wrapper.md#在执行器中配置优化器封装)。