diff --git a/docs/en/getting_started.md b/docs/en/getting_started.md
index 98d79b02..9b0a97a3 100644
--- a/docs/en/getting_started.md
+++ b/docs/en/getting_started.md
@@ -2,6 +2,7 @@
 
 - [Getting Started](#getting-started)
   - [Train existing methods](#train-existing-methods)
+    - [Training with CPU](#training-with-cpu)
     - [Train with single/multiple GPUs](#train-with-singlemultiple-gpus)
     - [Train with multiple machines](#train-with-multiple-machines)
     - [Launch multiple jobs on a single machine](#launch-multiple-jobs-on-a-single-machine)
@@ -18,6 +19,15 @@ This page provides basic tutorials about the usage of MMSelfSup. For installatio
 
 **Note**: The default learning rate in config files is for 8 GPUs. If using different number GPUs, the total batch size will change in proportion, you have to scale the learning rate following `new_lr = old_lr * new_ngpus / old_ngpus`. We recommend to use `tools/dist_train.sh` even with 1 gpu, since some methods do not support non-distributed training.
 
+### Training with CPU
+
+```shell
+export CUDA_VISIBLE_DEVICES=-1
+python tools/train.py ${CONFIG_FILE}
+```
+
+**Note**: We do not recommend users to use CPU for training because it is too slow and some algorithms are using `SyncBN` which is based on distributed training. We support this feature to allow users to debug on machines without GPU for convenience.
+
 ### Train with single/multiple GPUs
 
 ```shell
diff --git a/docs/zh_cn/getting_started.md b/docs/zh_cn/getting_started.md
index 5efaab87..77714444 100644
--- a/docs/zh_cn/getting_started.md
+++ b/docs/zh_cn/getting_started.md
@@ -2,6 +2,7 @@
 
 - [基础教程](#基础教程)
   - [训练已有的算法](#训练已有的算法)
+    - [使用 CPU 训练](#使用-cpu-训练)
     - [使用 单张/多张 显卡训练](#使用-单张多张-显卡训练)
     - [使用多台机器训练](#使用多台机器训练)
     - [在一台机器上启动多个任务](#在一台机器上启动多个任务)
@@ -18,6 +19,15 @@
 
 **注意**: 当您启动一个任务的时候，默认会使用8块显卡. 如果您想使用少于或多余8块显卡, 那么你的 batch size 也会同比例缩放，同时您的学习率服从一个线性缩放原则, 那么您可以使用以下公式来调整您的学习率: `new_lr = old_lr * new_ngpus / old_ngpus`. 除此之外，我们推荐您使用 `tools/dist_train.sh` 来启动训练任务，即便您只使用一块显卡, 因为 MMSelfSup 中有些算法不支持非分布式训练。
 
+### 使用 CPU 训练
+
+```shell
+export CUDA_VISIBLE_DEVICES=-1
+python tools/train.py ${CONFIG_FILE}
+```
+
+**注意**: 我们不推荐用户使用 CPU 进行训练， 因为 CPU 的训练速度很慢，一些算法仅支持分布式训练, 例如 `SyncBN`，该方法需要分布式进行训练，我们支持这个功能是为了方便用户在没有 GPU 的机器上进行调试。
+
 ### 使用 单张/多张 显卡训练
 
 ```shell
diff --git a/mmselfsup/apis/train.py b/mmselfsup/apis/train.py
index e01633d5..46f02615 100644
--- a/mmselfsup/apis/train.py
+++ b/mmselfsup/apis/train.py
@@ -103,8 +103,8 @@ def train_model(model,
             broadcast_buffers=False,
             find_unused_parameters=find_unused_parameters)
     else:
-        model = MMDataParallel(
-            model.cuda(cfg.gpu_ids[0]), device_ids=cfg.gpu_ids)
+        model = MMDataParallel(model, device_ids=cfg.gpu_ids)
+
     # build optimizer
     optimizer = build_optimizer(model, cfg.optimizer)
 
diff --git a/tools/train.py b/tools/train.py
index 4049e4e0..83ce365c 100644
--- a/tools/train.py
+++ b/tools/train.py
@@ -113,7 +113,8 @@ def main():
     if args.launcher == 'none':
         distributed = False
         assert cfg.model.type not in [
-            'DeepCluster', 'MoCo', 'SimCLR', 'ODC', 'NPID', 'DenseCL'
+            'DeepCluster', 'MoCo', 'SimCLR', 'ODC', 'NPID', 'SimSiam',
+            'DenseCL', 'BYOL'
         ], f'{cfg.model.type} does not support non-dist training.'
     else:
         distributed = True