mmpretrain/docs/en/advanced_guides/runtime.md

# Customize Runtime Settings

The runtime configurations include many helpful functionalities, like checkpoint saving, logger configuration,
etc. In this tutorial, we will introduce how to configure these functionalities.

<!-- TODO: Link to MMEngine docs instead of API reference after the MMEngine English docs is done. -->

## Save Checkpoint

The checkpoint saving functionality is a default hook during training. And you can configure it in the
`default_hooks.checkpoint` field.

```{note}
The hook mechanism is widely used in all OpenMMLab libraries. Through hooks, you can plug in many
functionalities without modifying the main execution logic of the runner.

A detailed introduction of hooks can be found in {external+mmengine:doc}`Hooks <tutorials/hook>`.
```

**The default settings**

```python
default_hooks = dict(
    ...
    checkpoint = dict(type='CheckpointHook', interval=1)
    ...
)
```

Here are some usual arguments, and all available arguments can be found in the [CheckpointHook](mmengine.hooks.CheckpointHook).

- **`interval`** (int): The saving period. If use -1, it will never save checkpoints.
- **`by_epoch`** (bool): Whether the **`interval`** is by epoch or by iteration. Defaults to `True`.
- **`out_dir`** (str): The root directory to save checkpoints. If not specified, the checkpoints will be saved in the work directory. If specified, the checkpoints will be saved in the sub-folder of the **`out_dir`**.
- **`max_keep_ckpts`** (int): The maximum checkpoints to keep. In some cases, we want only the latest few checkpoints and would like to delete old ones to save disk space. Defaults to -1, which means unlimited.
- **`save_best`** (str, List[str]): If specified, it will save the checkpoint with the best evaluation result.
  Usually, you can simply use `save_best="auto"` to automatically select the evaluation metric. And if you
  want more advanced configuration, please refer to the [CheckpointHook docs](mmengine.hooks.CheckpointHook).

## Load Checkpoint / Resume Training

In config files, you can specify the loading and resuming functionality as below:

```python
# load from which checkpoint
load_from = "Your checkpoint path"

# whether to resume training from the loaded checkpoint
resume = False
```

The `load_from` field can be either a local path or an HTTP path. And you can resume training from the checkpoint by
specify `resume=True`.

```{tip}
You can also enable auto resuming from the latest checkpoint by specifying `load_from=None` and `resume=True`.
Runner will find the latest checkpoint from the work directory automatically.
```

If you are training models by our `tools/train.py` script, you can also use `--resume` argument to resume
training without modifying the config file manually.

```bash
# Automatically resume from the latest checkpoint.
python tools/train.py configs/resnet/resnet50_8xb32_in1k.py --resume

# Resume from the specified checkpoint.
python tools/train.py configs/resnet/resnet50_8xb32_in1k.py --resume checkpoints/resnet.pth
```

## Randomness Configuration

In the `randomness` field, we provide some options to make the experiment as reproducible as possible.

By default, we won't specify seed in the config file, and in every experiment, the program will generate a random seed.

**Default settings:**

```python
randomness = dict(seed=None, deterministic=False)
```

To make the experiment more reproducible, you can specify a seed and set `deterministic=True`. The influence
of the `deterministic` option can be found [here](https://pytorch.org/docs/stable/notes/randomness.html#cuda-convolution-benchmarking).

## Log Configuration

The log configuration relates to multiple fields.

In the `log_level` field, you can specify the global logging level. See {external+python:ref}`Logging Levels<levels>` for a list of levels.

```python
log_level = 'INFO'
```

In the `default_hooks.logger` field, you can specify the logging interval during training and testing. And all
available arguments can be found in the [LoggerHook docs](mmengine.hooks.LoggerHook).

```python
default_hooks = dict(
    ...
    # print log every 100 iterations.
    logger=dict(type='LoggerHook', interval=100),
    ...
)
```

In the `log_processor` field, you can specify the log smooth method. Usually, we use a window with length of 10
to smooth the log and output the mean value of all information. If you want to specify the smooth method of
some information finely, see the [LogProcessor docs](mmengine.runner.LogProcessor).

```python
# The default setting, which will smooth the values in training log by a 10-length window.
log_processor = dict(window_size=10)
```

In the `visualizer` field, you can specify multiple backends to save the log information, such as TensorBoard
and WandB. More details can be found in the [Visualizer section](#visualizer).

## Custom Hooks

Many above functionalities are implemented by hooks, and you can also plug-in other custom hooks by modifying
`custom_hooks` field. Here are some hooks in MMEngine and MMClassification that you can use directly, such as:

- [EMAHook](mmpretrain.engine.hooks.EMAHook)
- [SyncBuffersHook](mmengine.hooks.SyncBuffersHook)
- [EmptyCacheHook](mmengine.hooks.EmptyCacheHook)
- [ClassNumCheckHook](mmpretrain.engine.hooks.ClassNumCheckHook)
- ......

For example, EMA (Exponential Moving Average) is widely used in the model training, and you can enable it as
below:

```python
custom_hooks = [
    dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL'),
]
```

## Visualize Validation

The validation visualization functionality is a default hook during validation. And you can configure it in the
`default_hooks.visualization` field.

By default, we disabled it, and you can enable it by specifying `enable=True`. And more arguments can be found in
the [VisualizationHook docs](mmpretrain.engine.hooks.VisualizationHook).

```python
default_hooks = dict(
    ...
    visualization=dict(type='VisualizationHook', enable=False),
    ...
)
```

This hook will select some images in the validation dataset, and tag the prediction results on these images
during every validation process. You can use it to watch the varying of model performance on actual images
during training.

In addition, if the images in your validation dataset are small (\<100), you can rescale them before
visualization by specifying `rescale_factor=2.` or higher.

## Visualizer

The visualizer is used to record all kinds of information during training and test, including logs, images and
scalars. By default, the recorded information will be saved at the `vis_data` folder under the work directory.

**Default settings:**

```python
visualizer = dict(
    type='UniversalVisualizer',
    vis_backends=[
        dict(type='LocalVisBackend'),
    ]
)
```

Usually, the most useful function is to save the log and scalars like `loss` to different backends.
For example, to save them to TensorBoard, simply set them as below:

```python
visualizer = dict(
    type='UniversalVisualizer',
    vis_backends=[
        dict(type='LocalVisBackend'),
        dict(type='TensorboardVisBackend'),
    ]
)
```

Or save them to WandB as below:

```python
visualizer = dict(
    type='UniversalVisualizer',
    vis_backends=[
        dict(type='LocalVisBackend'),
        dict(type='WandbVisBackend'),
    ]
)
```

## Environment Configuration

In the `env_cfg` field, you can configure some low-level parameters, like cuDNN, multi-process, and distributed
communication.

**Please make sure you understand the meaning of these parameters before modifying them.**

```python
env_cfg = dict(
    # whether to enable cudnn benchmark
    cudnn_benchmark=False,

    # set multi-process parameters
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),

    # set distributed parameters
    dist_cfg=dict(backend='nccl'),
)
```

## FAQ

1. **What's the relationship between the `load_from` and the `init_cfg`?**

   - `load_from`: If `resume=False`, only imports model weights, which is mainly used to load trained models;
     If `resume=True`, load all of the model weights, optimizer state, and other training information, which is
     mainly used to resume interrupted training.

   - `init_cfg`: You can also specify `init=dict(type="Pretrained", checkpoint=xxx)` to load checkpoint, it
     means load the weights during model weights initialization. That is, it will be only done at the
     beginning of the training. It's mainly used to fine-tune a pre-trained model, and you can set it in
     the backbone config and use `prefix` field to only load backbone weights, for example:

     ```python
     model = dict(
        backbone=dict(
            type='ResNet',
            depth=50,
            init_cfg=dict(type='Pretrained', checkpoints=xxx, prefix='backbone'),
        )
        ...
     )
     ```

     See the [Fine-tune Models](../user_guides/finetune.md) for more details about fine-tuning.

2. **What's the difference between `default_hooks` and `custom_hooks`?**

   Almost no difference. Usually, the `default_hooks` field is used to specify the hooks that will be used in almost
   all experiments, and the `custom_hooks` field is used in only some experiments.

   Another difference is the `default_hooks` is a dict while the `custom_hooks` is a list, please don't be
   confused.

3. **During training, I got no training log, what's the reason?**

   If your training dataset is small while the batch size is large, our default log interval may be too large to
   record your training log.

   You can shrink the log interval and try again, like:

   ```python
   default_hooks = dict(
       ...
       logger=dict(type='LoggerHook', interval=10),
       ...
   )
   ```