[Feature] support cpu training (#188)

* [Fix] modify non-dist training algorithm list * [Feature] support cpu training * [Docs] modify description
2025-06-03 14:59:38 +08:00 · 2022-01-28 17:49:46 +08:00 · 2022-01-28 17:49:46 +08:00 · a807b38e4c
commit a807b38e4c
parent c39bb83a7c
4 changed files with 24 additions and 3 deletions
--- a/docs/en/getting_started.md
+++ b/docs/en/getting_started.md
@ -2,6 +2,7 @@

 - [Getting Started](#getting-started)
  - [Train existing methods](#train-existing-methods)
+    - [Training with CPU](#training-with-cpu)
    - [Train with single/multiple GPUs](#train-with-singlemultiple-gpus)
    - [Train with multiple machines](#train-with-multiple-machines)
    - [Launch multiple jobs on a single machine](#launch-multiple-jobs-on-a-single-machine)
@ -18,6 +19,15 @@ This page provides basic tutorials about the usage of MMSelfSup. For installatio

 **Note**: The default learning rate in config files is for 8 GPUs. If using different number GPUs, the total batch size will change in proportion, you have to scale the learning rate following `new_lr = old_lr * new_ngpus / old_ngpus`. We recommend to use `tools/dist_train.sh` even with 1 gpu, since some methods do not support non-distributed training.

+### Training with CPU
+
+```shell
+export CUDA_VISIBLE_DEVICES=-1
+python tools/train.py ${CONFIG_FILE}
+```
+
+**Note**: We do not recommend users to use CPU for training because it is too slow and some algorithms are using `SyncBN` which is based on distributed training. We support this feature to allow users to debug on machines without GPU for convenience.
+
 ### Train with single/multiple GPUs

 ```shell
--- a/docs/zh_cn/getting_started.md
+++ b/docs/zh_cn/getting_started.md
@ -2,6 +2,7 @@

 - [基础教程](#基础教程)
  - [训练已有的算法](#训练已有的算法)
+    - [使用 CPU 训练](#使用-cpu-训练)
    - [使用 单张/多张 显卡训练](#使用-单张多张-显卡训练)
    - [使用多台机器训练](#使用多台机器训练)
    - [在一台机器上启动多个任务](#在一台机器上启动多个任务)
@ -18,6 +19,15 @@

 **注意**: 当您启动一个任务的时候，默认会使用8块显卡. 如果您想使用少于或多余8块显卡, 那么你的 batch size 也会同比例缩放，同时您的学习率服从一个线性缩放原则, 那么您可以使用以下公式来调整您的学习率: `new_lr = old_lr * new_ngpus / old_ngpus`. 除此之外，我们推荐您使用 `tools/dist_train.sh` 来启动训练任务，即便您只使用一块显卡, 因为 MMSelfSup 中有些算法不支持非分布式训练。

+### 使用 CPU 训练
+
+```shell
+export CUDA_VISIBLE_DEVICES=-1
+python tools/train.py ${CONFIG_FILE}
+```
+
+**注意**: 我们不推荐用户使用 CPU 进行训练， 因为 CPU 的训练速度很慢，一些算法仅支持分布式训练, 例如 `SyncBN`，该方法需要分布式进行训练，我们支持这个功能是为了方便用户在没有 GPU 的机器上进行调试。
+
 ### 使用 单张/多张 显卡训练

 ```shell
--- a/mmselfsup/apis/train.py
+++ b/mmselfsup/apis/train.py
@ -103,8 +103,8 @@ def train_model(model,
            broadcast_buffers=False,
            find_unused_parameters=find_unused_parameters)
    else:
-        model = MMDataParallel(
-            model.cuda(cfg.gpu_ids[0]), device_ids=cfg.gpu_ids)
+        model = MMDataParallel(model, device_ids=cfg.gpu_ids)
+
    # build optimizer
    optimizer = build_optimizer(model, cfg.optimizer)

--- a/tools/train.py
+++ b/tools/train.py
@ -113,7 +113,8 @@ def main():
    if args.launcher == 'none':
        distributed = False
        assert cfg.model.type not in [
-            'DeepCluster', 'MoCo', 'SimCLR', 'ODC', 'NPID', 'DenseCL'
+            'DeepCluster', 'MoCo', 'SimCLR', 'ODC', 'NPID', 'SimSiam',
+            'DenseCL', 'BYOL'
        ], f'{cfg.model.type} does not support non-dist training.'
    else:
        distributed = True