mmengine/docs/en/advanced_tutorials/logging.md

313 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

# Logging
[Runner](../tutorials/runner.md) will produce a lot of logs during the running process, such as loss, iteration time, learning rate, etc. MMEngine implements a flexible logging system that allows us to choose different types of log statistical methods when configuring the runner. It could help us set/get the recorded log at any location in the code.
## Flexible Logging System
Logging system is configured by passing a [LogProcessor](mmengine.logging.LogProcessor) to the runner. If no log processor is passed, the runner will use the default log processor, which is equivalent to:
```python
log_processor = dict(window_size=10, by_epoch=True, custom_cfg=None, num_digits=4)
```
The format of the output log is as follows:
```python
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from mmengine.runner import Runner
from mmengine.model import BaseModel
train_dataset = [(torch.ones(1, 1), torch.ones(1, 1))] * 50
train_dataloader = DataLoader(train_dataset, batch_size=2)
class ToyModel(BaseModel):
def __init__(self) -> None:
super().__init__()
self.linear = nn.Linear(1, 1)
def forward(self, img, label, mode):
feat = self.linear(img)
loss1 = (feat - label).pow(2)
loss2 = (feat - label).abs()
return dict(loss1=loss1, loss2=loss2)
runner = Runner(
model=ToyModel(),
work_dir='tmp_dir',
train_dataloader=train_dataloader,
train_cfg=dict(by_epoch=True, max_epochs=1),
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01))
)
runner.train()
```
```
08/21 02:58:41 - mmengine - INFO - Epoch(train) [1][10/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0019 data_time: 0.0004 loss1: 0.8381 loss2: 0.9007 loss: 1.7388
08/21 02:58:41 - mmengine - INFO - Epoch(train) [1][20/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0029 data_time: 0.0010 loss1: 0.1978 loss2: 0.4312 loss: 0.6290
```
LogProcessor will output the log in the following format:
- The prefix of the log:
- epoch mode(`by_epoch=True`): `Epoch(train) [{current_epoch}/{current_iteration}]/{dataloader_length}`
- iteration mode(`by_epoch=False`): `Iter(train) [{current_iteration}/{max_iteration}]`)
- Learning rate (`lr`): The learning rate of the last iteration.
- Time:
- `time`: The averaged time for infernce of the last `window_size` iterations.
- `data_time`: The averaged time for loading data of the last `window_size` iterations.
- `eta`: The estimated time of arrival to finish the training.
- Loss: The averaged loss output by model of the last `window_size` iterations.
```{note}
`window_size=10` by default.
The significant digits(`num_digits`) of the log is 4 by default.
Output the value of all custom logsthe at last iteration by default.
```
```{warnning}
log_processor outputs the epoch based log by default(`by_epoch=True`). To get an expected log matched with the `train_cfg`, we should set the same value for `by_epoch` in `train_cfg` and `log_processor`.
```
Based on the rules above, the code snippet will count the average value of the `loss1` and `loss2` every 10 iterations.
If we want to count the global average value of `loss1`, we can set `custom_cfg` like this:
```python
runner = Runner(
model=ToyModel(),
work_dir='tmp_dir',
train_dataloader=train_dataloader,
train_cfg=dict(by_epoch=True, max_epochs=1),
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)),
log_processor=dict(
custom_cfg=[
dict(data_src='loss1', # original loss nameloss1
method_name='mean', # statistical methodmean
window_size='global')]) # window_sizeglobal
)
runner.train()
```
```
08/21 02:58:49 - mmengine - INFO - Epoch(train) [1][10/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0026 data_time: 0.0007 loss1: 0.7381 loss2: 0.8446 loss: 1.5827
08/21 02:58:49 - mmengine - INFO - Epoch(train) [1][20/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0030 data_time: 0.0012 loss1: 0.4521 loss2: 0.3939 loss: 0.5600
```
`data_src` means the original loss name, `method_name` means the statistic method, `window_size` means the window size of the statistic method. Since we want to count the global average value of `loss1`, we set `window_size` to `global`.
Currently, MMEngine supports the following statistical methods:
<table class="docutils">
<thead>
<tr>
<th>statistic method</th>
<th>arguments</th>
<th>function</th>
</tr>
<tr>
<td>mean</td>
<td>window_size</td>
<td>statistic the average log of the last `window_size`</td>
</tr>
<tr>
<td>min</td>
<td>window_size</td>
<td>statistic the minimum log of the last `window_size`</td>
</tr>
<tr>
<td>max</td>
<td>window_size</td>
<td>statistic the maximum log of the last `window_size`</td>
</tr>
<tr>
<td>current</td>
<td>/</td>
<td>statistic the latest</td>
</tr>
</thead>
</table>
`window_size` mentioned above could be:
- int number: The window size of the statistic method.
- `global`: Equivalent to `window_size=cur_iteration`.
- `epoch`: Equivalent to `window_size=len(dataloader)`.
If we want to statistic the average value of `loss1` of the last 10 iterations, and also want to statistic the global average value of `loss1`. We need to set `log_name` additionally:
```python
runner = Runner(
model=ToyModel(),
work_dir='tmp_dir',
train_dataloader=train_dataloader,
train_cfg=dict(by_epoch=True, max_epochs=1),
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)),
log_processor=dict(
custom_cfg=[
# log_name means the second name of loss1
dict(data_src='loss1', log_name='loss1_global', method_name='mean', window_size='global')])
)
runner.train()
```
```
08/21 18:39:32 - mmengine - INFO - Epoch(train) [1][10/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0016 data_time: 0.0004 loss1: 0.1512 loss2: 0.3751 loss: 0.5264 loss1_global: 0.1512
08/21 18:39:32 - mmengine - INFO - Epoch(train) [1][20/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0051 data_time: 0.0036 loss1: 0.0113 loss2: 0.0856 loss: 0.0970 loss1_global: 0.0813
```
Similarly, we can also statistic the global/local maximum value of `loss` at the same time.
```python
runner = Runner(
model=ToyModel(),
work_dir='tmp_dir',
train_dataloader=train_dataloader,
train_cfg=dict(by_epoch=True, max_epochs=1),
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)),
log_processor=dict(custom_cfg=[
# statistic loss1 with the local maximum value
dict(data_src='loss1',
log_name='loss1_local_max',
window_size=10,
method_name='max'),
# statistic loss1 with the global maximum value
dict(
data_src='loss1',
log_name='loss1_global_max',
method_name='max',
window_size='global')
]))
runner.train()
```
```
08/21 03:17:26 - mmengine - INFO - Epoch(train) [1][10/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0021 data_time: 0.0006 loss1: 1.8495 loss2: 1.3427 loss: 3.1922 loss1_local_max: 2.8872 loss1_global_max: 2.8872
08/21 03:17:26 - mmengine - INFO - Epoch(train) [1][20/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0024 data_time: 0.0010 loss1: 0.5464 loss2: 0.7251 loss: 1.2715 loss1_local_max: 2.8872 loss1_global_max: 2.8872
```
More examples can be found in [log_processor](mmengine.logging.LogProcessor).
## Customize log
The logging system could not only log the `loss`, `lr`, .etc but also collect and output the custom log. For example, if we want to statistic the intermediate `loss`:
```python
from mmengine.logging import MessageHub
class ToyModel(BaseModel):
def __init__(self) -> None:
super().__init__()
self.linear = nn.Linear(1, 1)
def forward(self, img, label, mode):
feat = self.linear(img)
loss_tmp = (feat - label).abs()
loss = loss_tmp.pow(2)
message_hub = MessageHub.get_current_instance()
# update the intermediate `loss_tmp` in the message hub
message_hub.update_scalar('train/loss_tmp', loss_tmp.sum())
return dict(loss=loss)
runner = Runner(
model=ToyModel(),
work_dir='tmp_dir',
train_dataloader=train_dataloader,
train_cfg=dict(by_epoch=True, max_epochs=1),
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)),
log_processor=dict(
custom_cfg=[
# statistic the loss_tmp with the averaged value
dict(
data_src='loss_tmp',
window_size=10,
method_name='mean')
]
)
)
runner.train()
```
```
08/21 03:40:31 - mmengine - INFO - Epoch(train) [1][10/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0026 data_time: 0.0008 loss_tmp: 0.0097 loss: 0.0000
08/21 03:40:31 - mmengine - INFO - Epoch(train) [1][20/25] lr: 1.0000e-02 eta: 0:00:00 time: 0.0028 data_time: 0.0013 loss_tmp: 0.0065 loss: 0.0000
```
The custom log will be recorded by updating the [messagehub](mmengine.logging.MessageHub):
1. Calling `MessageHub.get_current_instance()` to get the message of runner
2. Calling `MessageHub.update_scalar` to update the custom log. The first argument means the log name with the mode prefix(`train/val/test`). The output log will only retain the log name without the mode prefix.
3. Configure statistic method of `loss_tmp` in `log_processor`. If it is not configured, only the latest value of `loss_tmp` will be logged.
## Export the debug log
Set `log_level=DEBUG` for runner, and the debug log will be exported to the `work_dir`:
```python
runner = Runner(
model=ToyModel(),
work_dir='tmp_dir',
train_dataloader=train_dataloader,
log_level='DEBUG',
train_cfg=dict(by_epoch=True, max_epochs=1),
optim_wrapper=dict(optimizer=dict(type='SGD', lr=0.01)))
runner.train()
```
```
08/21 18:16:22 - mmengine - DEBUG - Get class `LocalVisBackend` from "vis_backend" registry in "mmengine"
08/21 18:16:22 - mmengine - DEBUG - An `LocalVisBackend` instance is built from registry, its implementation can be found in mmengine.visualization.vis_backend
08/21 18:16:22 - mmengine - DEBUG - Get class `RuntimeInfoHook` from "hook" registry in "mmengine"
08/21 18:16:22 - mmengine - DEBUG - An `RuntimeInfoHook` instance is built from registry, its implementation can be found in mmengine.hooks.runtime_info_hook
08/21 18:16:22 - mmengine - DEBUG - Get class `IterTimerHook` from "hook" registry in "mmengine"
...
```
Besides, logs of different ranks will be saved in `debug` mode if you are training your model with the shared storage. The hierarchy of the log is as follows:
```text
./tmp
├── tmp.log
├── tmp_rank1.log
├── tmp_rank2.log
├── tmp_rank3.log
├── tmp_rank4.log
├── tmp_rank5.log
├── tmp_rank6.log
└── tmp_rank7.log
...
└── tmp_rank63.log
```
The log of Multiple machine with independent storage:
```text
# device: 0
work_dir/
└── exp_name_logs
├── exp_name.log
├── exp_name_rank1.log
├── exp_name_rank2.log
├── exp_name_rank3.log
...
└── exp_name_rank7.log
# device: 7
work_dir/
└── exp_name_logs
├── exp_name_rank56.log
├── exp_name_rank57.log
├── exp_name_rank58.log
...
└── exp_name_rank63.log
```