9.5 KiB

Raw Blame History

Customize Runtime Settings

The runtime configurations include many helpful functionalities, like checkpoint saving, logger configuration, etc. In this tutorial, we will introduce how to configure these functionalities.

Save Checkpoint

The checkpoint saving functionality is a default hook during training. And you can configure it in the default_hooks.checkpoint field.

The hook mechanism is widely used in all OpenMMLab libraries. Through hooks, you can plug in many
functionalities without modifying the main execution logic of the runner.

A detailed introduction of hooks can be found in {external+mmengine:doc}`Hooks <tutorials/hook>`.

The default settings

default_hooks = dict(
    ...
    checkpoint = dict(type='CheckpointHook', interval=1)
    ...
)

Here are some usual arguments, and all available arguments can be found in the CheckpointHook.

interval (int): The saving period. If use -1, it will never save checkpoints.
by_epoch (bool): Whether the interval is by epoch or by iteration. Defaults to True.
out_dir (str): The root directory to save checkpoints. If not specified, the checkpoints will be saved in the work directory. If specified, the checkpoints will be saved in the sub-folder of the out_dir.
max_keep_ckpts (int): The maximum checkpoints to keep. In some cases, we want only the latest few checkpoints and would like to delete old ones to save disk space. Defaults to -1, which means unlimited.
save_best (str, List[str]): If specified, it will save the checkpoint with the best evaluation result. Usually, you can simply use save_best="auto" to automatically select the evaluation metric. And if you want more advanced configuration, please refer to the CheckpointHook docs.

Load Checkpoint / Resume Training

In config files, you can specify the loading and resuming functionality as below:

# load from which checkpoint
load_from = "Your checkpoint path"

# whether to resume training from the loaded checkpoint
resume = False

The load_from field can be either a local path or an HTTP path. And you can resume training from the checkpoint by specify resume=True.

You can also enable auto resuming from the latest checkpoint by specifying `load_from=None` and `resume=True`.
Runner will find the latest checkpoint from the work directory automatically.

If you are training models by our tools/train.py script, you can also use --resume argument to resume training without modifying the config file manually.

# Automatically resume from the latest checkpoint.
python tools/train.py configs/resnet/resnet50_8xb32_in1k.py --resume

# Resume from the specified checkpoint.
python tools/train.py configs/resnet/resnet50_8xb32_in1k.py --resume checkpoints/resnet.pth

Randomness Configuration

In the randomness field, we provide some options to make the experiment as reproducible as possible.

By default, we won't specify seed in the config file, and in every experiment, the program will generate a random seed.

Default settings:

randomness = dict(seed=None, deterministic=False)

To make the experiment more reproducible, you can specify a seed and set deterministic=True. The influence of the deterministic option can be found here.

Log Configuration

The log configuration relates to multiple fields.

In the log_level field, you can specify the global logging level. See {external+python:ref}Logging Levels<levels> for a list of levels.

log_level = 'INFO'

In the default_hooks.logger field, you can specify the logging interval during training and testing. And all available arguments can be found in the LoggerHook docs.

default_hooks = dict(
    ...
    # print log every 100 iterations.
    logger=dict(type='LoggerHook', interval=100),
    ...
)

In the log_processor field, you can specify the log smooth method. Usually, we use a window with length of 10 to smooth the log and output the mean value of all information. If you want to specify the smooth method of some information finely, see the LogProcessor docs.

# The default setting, which will smooth the values in training log by a 10-length window.
log_processor = dict(window_size=10)

In the visualizer field, you can specify multiple backends to save the log information, such as TensorBoard and WandB. More details can be found in the Visualizer section.

Custom Hooks

Many above functionalities are implemented by hooks, and you can also plug-in other custom hooks by modifying custom_hooks field. Here are some hooks in MMEngine and MMClassification that you can use directly, such as:

For example, EMA (Exponential Moving Average) is widely used in the model training, and you can enable it as below:

custom_hooks = [
    dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL'),
]

Visualize Validation

The validation visualization functionality is a default hook during validation. And you can configure it in the default_hooks.visualization field.

By default, we disabled it, and you can enable it by specifying enable=True. And more arguments can be found in the VisualizationHook docs.

default_hooks = dict(
    ...
    visualization=dict(type='VisualizationHook', enable=False),
    ...
)

This hook will select some images in the validation dataset, and tag the prediction results on these images during every validation process. You can use it to watch the varying of model performance on actual images during training.

In addition, if the images in your validation dataset are small (<100), you can rescale them before visualization by specifying rescale_factor=2. or higher.

Visualizer

The visualizer is used to record all kinds of information during training and test, including logs, images and scalars. By default, the recorded information will be saved at the vis_data folder under the work directory.

Default settings:

visualizer = dict(
    type='UniversalVisualizer',
    vis_backends=[
        dict(type='LocalVisBackend'),
    ]
)

Usually, the most useful function is to save the log and scalars like loss to different backends. For example, to save them to TensorBoard, simply set them as below:

visualizer = dict(
    type='UniversalVisualizer',
    vis_backends=[
        dict(type='LocalVisBackend'),
        dict(type='TensorboardVisBackend'),
    ]
)

Or save them to WandB as below:

visualizer = dict(
    type='UniversalVisualizer',
    vis_backends=[
        dict(type='LocalVisBackend'),
        dict(type='WandbVisBackend'),
    ]
)

Environment Configuration

In the env_cfg field, you can configure some low-level parameters, like cuDNN, multi-process, and distributed communication.

Please make sure you understand the meaning of these parameters before modifying them.

env_cfg = dict(
    # whether to enable cudnn benchmark
    cudnn_benchmark=False,

    # set multi-process parameters
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),

    # set distributed parameters
    dist_cfg=dict(backend='nccl'),
)

FAQ

What's the relationship between the load_from and the init_cfg?
- load_from: If resume=False, only imports model weights, which is mainly used to load trained models; If resume=True, load all of the model weights, optimizer state, and other training information, which is mainly used to resume interrupted training.
- init_cfg: You can also specify init=dict(type="Pretrained", checkpoint=xxx) to load checkpoint, it means load the weights during model weights initialization. That is, it will be only done at the beginning of the training. It's mainly used to fine-tune a pre-trained model, and you can set it in the backbone config and use prefix field to only load backbone weights, for example:
```
model = dict(
   backbone=dict(
       type='ResNet',
       depth=50,
       init_cfg=dict(type='Pretrained', checkpoints=xxx, prefix='backbone'),
   )
   ...
)
```
  See the Fine-tune Models for more details about fine-tuning.
What's the difference between default_hooks and custom_hooks?

Almost no difference. Usually, the default_hooks field is used to specify the hooks that will be used in almost all experiments, and the custom_hooks field is used in only some experiments.

Another difference is the default_hooks is a dict while the custom_hooks is a list, please don't be confused.
During training, I got no training log, what's the reason?

If your training dataset is small while the batch size is large, our default log interval may be too large to record your training log.

You can shrink the log interval and try again, like:
```
default_hooks = dict(
    ...
    logger=dict(type='LoggerHook', interval=10),
    ...
)
```

9.5 KiB Raw Blame History