[Feature] support cpu training (#188)

* [Fix] modify non-dist training algorithm list

* [Feature] support cpu training

* [Docs] modify description
pull/193/head
Yixiao Fang 2022-01-28 17:49:46 +08:00 committed by GitHub
parent c39bb83a7c
commit a807b38e4c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 24 additions and 3 deletions

View File

@ -2,6 +2,7 @@
- [Getting Started](#getting-started)
- [Train existing methods](#train-existing-methods)
- [Training with CPU](#training-with-cpu)
- [Train with single/multiple GPUs](#train-with-singlemultiple-gpus)
- [Train with multiple machines](#train-with-multiple-machines)
- [Launch multiple jobs on a single machine](#launch-multiple-jobs-on-a-single-machine)
@ -18,6 +19,15 @@ This page provides basic tutorials about the usage of MMSelfSup. For installatio
**Note**: The default learning rate in config files is for 8 GPUs. If using different number GPUs, the total batch size will change in proportion, you have to scale the learning rate following `new_lr = old_lr * new_ngpus / old_ngpus`. We recommend to use `tools/dist_train.sh` even with 1 gpu, since some methods do not support non-distributed training.
### Training with CPU
```shell
export CUDA_VISIBLE_DEVICES=-1
python tools/train.py ${CONFIG_FILE}
```
**Note**: We do not recommend users to use CPU for training because it is too slow and some algorithms are using `SyncBN` which is based on distributed training. We support this feature to allow users to debug on machines without GPU for convenience.
### Train with single/multiple GPUs
```shell

View File

@ -2,6 +2,7 @@
- [基础教程](#基础教程)
- [训练已有的算法](#训练已有的算法)
- [使用 CPU 训练](#使用-cpu-训练)
- [使用 单张/多张 显卡训练](#使用-单张多张-显卡训练)
- [使用多台机器训练](#使用多台机器训练)
- [在一台机器上启动多个任务](#在一台机器上启动多个任务)
@ -18,6 +19,15 @@
**注意**: 当您启动一个任务的时候默认会使用8块显卡. 如果您想使用少于或多余8块显卡, 那么你的 batch size 也会同比例缩放,同时您的学习率服从一个线性缩放原则, 那么您可以使用以下公式来调整您的学习率: `new_lr = old_lr * new_ngpus / old_ngpus`. 除此之外,我们推荐您使用 `tools/dist_train.sh` 来启动训练任务,即便您只使用一块显卡, 因为 MMSelfSup 中有些算法不支持非分布式训练。
### 使用 CPU 训练
```shell
export CUDA_VISIBLE_DEVICES=-1
python tools/train.py ${CONFIG_FILE}
```
**注意**: 我们不推荐用户使用 CPU 进行训练, 因为 CPU 的训练速度很慢,一些算法仅支持分布式训练, 例如 `SyncBN`,该方法需要分布式进行训练,我们支持这个功能是为了方便用户在没有 GPU 的机器上进行调试。
### 使用 单张/多张 显卡训练
```shell

View File

@ -103,8 +103,8 @@ def train_model(model,
broadcast_buffers=False,
find_unused_parameters=find_unused_parameters)
else:
model = MMDataParallel(
model.cuda(cfg.gpu_ids[0]), device_ids=cfg.gpu_ids)
model = MMDataParallel(model, device_ids=cfg.gpu_ids)
# build optimizer
optimizer = build_optimizer(model, cfg.optimizer)

View File

@ -113,7 +113,8 @@ def main():
if args.launcher == 'none':
distributed = False
assert cfg.model.type not in [
'DeepCluster', 'MoCo', 'SimCLR', 'ODC', 'NPID', 'DenseCL'
'DeepCluster', 'MoCo', 'SimCLR', 'ODC', 'NPID', 'SimSiam',
'DenseCL', 'BYOL'
], f'{cfg.model.type} does not support non-dist training.'
else:
distributed = True