[Docs] Add a document about debug tricks (#938)

* fix typo

* [Docs] Add debug skills

* minor fix

* refine

* rename debug_skills to debug_tricks

* refine

* Update docs/en/common_usage/debug_tricks.md
pull/948/head
Zaida Zhou 2023-02-21 21:40:35 +08:00 committed by GitHub
parent 4861f034a7
commit 67acdbe245
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
5 changed files with 59 additions and 3 deletions

View File

@ -0,0 +1,3 @@
# Debug Tricks
Coming soon. Please refer to [chinese documentation](https://mmengine.readthedocs.io/zh_CN/latest/common_usage/debug_tricks.html).

View File

@ -23,6 +23,7 @@ You can switch between Chinese and English documents in the lower-left corner of
common_usage/resume_training.md
common_usage/speed_up_training.md
common_usage/save_gpu_memory.md
common_usage/debug_tricks.md
common_usage/epoch_to_iter.md
.. toctree::

View File

@ -0,0 +1,51 @@
# 调试技巧
## 设置数据集的长度
在调试代码的过程中,有时需要训练几个 epoch例如调试验证过程或者权重的保存是否符合期望。然而如果数据集太大需要花费较长时间才能训完一个 epoch这种情况下可以设置数据集的长度。注意只有继承自 [BaseDataset](mmengine.dataset.BaseDataset) 的 Dataset 才支持这个功能,`BaseDataset` 的用法可阅读 [数据集基类BASEDATASET](../advanced_tutorials/basedataset.md)。
`MMClassification` 为例(参考[文档](https://mmclassification.readthedocs.io/zh_CN/dev-1.x/get_started.html#id2)安装 MMClassification
启动训练命令
```bash
python tools/train.py configs/resnet/resnet18_8xb16_cifar10.py
```
下面是训练的部分日志,其中 `3125` 表示需要迭代的次数。
```
02/20 14:43:11 - mmengine - INFO - Epoch(train) [1][ 100/3125] lr: 1.0000e-01 eta: 6:12:01 time: 0.0149 data_time: 0.0003 memory: 214 loss: 2.0611
02/20 14:43:13 - mmengine - INFO - Epoch(train) [1][ 200/3125] lr: 1.0000e-01 eta: 4:23:08 time: 0.0154 data_time: 0.0003 memory: 214 loss: 2.0963
02/20 14:43:14 - mmengine - INFO - Epoch(train) [1][ 300/3125] lr: 1.0000e-01 eta: 3:46:27 time: 0.0146 data_time: 0.0003 memory: 214 loss: 1.9858
```
关掉训练,然后修改 [configs/_base_/datasets/cifar10_bs16.py](https://github.com/open-mmlab/mmclassification/blob/dev-1.x/configs/_base_/datasets/cifar10_bs16.py) 中的 `dataset` 字段,设置 `indices=5000`
```python
train_dataloader = dict(
batch_size=16,
num_workers=2,
dataset=dict(
type=dataset_type,
data_prefix='data/cifar10',
test_mode=False,
indices=5000, # 设置 indices=5000表示每个 epoch 只迭代 5000 个样本
pipeline=train_pipeline),
sampler=dict(type='DefaultSampler', shuffle=True),
)
```
重新启动训练
```bash
python tools/train.py configs/resnet/resnet18_8xb16_cifar10.py
```
可以看到,迭代次数变成了 `313`,相比原先,这样能够更快跑完一个 epoch。
```
02/20 14:44:58 - mmengine - INFO - Epoch(train) [1][100/313] lr: 1.0000e-01 eta: 0:31:09 time: 0.0154 data_time: 0.0004 memory: 214 loss: 2.1852
02/20 14:44:59 - mmengine - INFO - Epoch(train) [1][200/313] lr: 1.0000e-01 eta: 0:23:18 time: 0.0143 data_time: 0.0002 memory: 214 loss: 2.0424
02/20 14:45:01 - mmengine - INFO - Epoch(train) [1][300/313] lr: 1.0000e-01 eta: 0:20:39 time: 0.0143 data_time: 0.0003 memory: 214 loss: 1.814
```

View File

@ -24,6 +24,7 @@
common_usage/speed_up_training.md
common_usage/save_gpu_memory.md
common_usage/set_random_seed.md
common_usage/debug_tricks.md
common_usage/model_analysis.md
common_usage/set_interval.md
common_usage/epoch_to_iter.md

View File

@ -714,9 +714,9 @@ class BaseDataset(Dataset):
Args:
indices (int or Sequence[int]): If type of indices is int,
indices represents the first or last few data of data
information. If indices of indices is Sequence, indices
represents the target data information index which consist
of subset data information.
information. If type of indices is Sequence, indices represents
the target data information index which consist of subset data
information.
Returns:
Tuple[np.ndarray, np.ndarray]: subset of data information.