diff --git a/docs/en/getting_started.md b/docs/en/getting_started.md index 98d79b02..9b0a97a3 100644 --- a/docs/en/getting_started.md +++ b/docs/en/getting_started.md @@ -2,6 +2,7 @@ - [Getting Started](#getting-started) - [Train existing methods](#train-existing-methods) + - [Training with CPU](#training-with-cpu) - [Train with single/multiple GPUs](#train-with-singlemultiple-gpus) - [Train with multiple machines](#train-with-multiple-machines) - [Launch multiple jobs on a single machine](#launch-multiple-jobs-on-a-single-machine) @@ -18,6 +19,15 @@ This page provides basic tutorials about the usage of MMSelfSup. For installatio **Note**: The default learning rate in config files is for 8 GPUs. If using different number GPUs, the total batch size will change in proportion, you have to scale the learning rate following `new_lr = old_lr * new_ngpus / old_ngpus`. We recommend to use `tools/dist_train.sh` even with 1 gpu, since some methods do not support non-distributed training. +### Training with CPU + +```shell +export CUDA_VISIBLE_DEVICES=-1 +python tools/train.py ${CONFIG_FILE} +``` + +**Note**: We do not recommend users to use CPU for training because it is too slow and some algorithms are using `SyncBN` which is based on distributed training. We support this feature to allow users to debug on machines without GPU for convenience. + ### Train with single/multiple GPUs ```shell diff --git a/docs/zh_cn/getting_started.md b/docs/zh_cn/getting_started.md index 5efaab87..77714444 100644 --- a/docs/zh_cn/getting_started.md +++ b/docs/zh_cn/getting_started.md @@ -2,6 +2,7 @@ - [基础教程](#基础教程) - [训练已有的算法](#训练已有的算法) + - [使用 CPU 训练](#使用-cpu-训练) - [使用 单张/多张 显卡训练](#使用-单张多张-显卡训练) - [使用多台机器训练](#使用多台机器训练) - [在一台机器上启动多个任务](#在一台机器上启动多个任务) @@ -18,6 +19,15 @@ **注意**: 当您启动一个任务的时候,默认会使用8块显卡. 如果您想使用少于或多余8块显卡, 那么你的 batch size 也会同比例缩放,同时您的学习率服从一个线性缩放原则, 那么您可以使用以下公式来调整您的学习率: `new_lr = old_lr * new_ngpus / old_ngpus`. 除此之外,我们推荐您使用 `tools/dist_train.sh` 来启动训练任务,即便您只使用一块显卡, 因为 MMSelfSup 中有些算法不支持非分布式训练。 +### 使用 CPU 训练 + +```shell +export CUDA_VISIBLE_DEVICES=-1 +python tools/train.py ${CONFIG_FILE} +``` + +**注意**: 我们不推荐用户使用 CPU 进行训练, 因为 CPU 的训练速度很慢,一些算法仅支持分布式训练, 例如 `SyncBN`,该方法需要分布式进行训练,我们支持这个功能是为了方便用户在没有 GPU 的机器上进行调试。 + ### 使用 单张/多张 显卡训练 ```shell diff --git a/mmselfsup/apis/train.py b/mmselfsup/apis/train.py index e01633d5..46f02615 100644 --- a/mmselfsup/apis/train.py +++ b/mmselfsup/apis/train.py @@ -103,8 +103,8 @@ def train_model(model, broadcast_buffers=False, find_unused_parameters=find_unused_parameters) else: - model = MMDataParallel( - model.cuda(cfg.gpu_ids[0]), device_ids=cfg.gpu_ids) + model = MMDataParallel(model, device_ids=cfg.gpu_ids) + # build optimizer optimizer = build_optimizer(model, cfg.optimizer) diff --git a/tools/train.py b/tools/train.py index 4049e4e0..83ce365c 100644 --- a/tools/train.py +++ b/tools/train.py @@ -113,7 +113,8 @@ def main(): if args.launcher == 'none': distributed = False assert cfg.model.type not in [ - 'DeepCluster', 'MoCo', 'SimCLR', 'ODC', 'NPID', 'DenseCL' + 'DeepCluster', 'MoCo', 'SimCLR', 'ODC', 'NPID', 'SimSiam', + 'DenseCL', 'BYOL' ], f'{cfg.model.type} does not support non-dist training.' else: distributed = True