9.5 KiB
Customize Runtime Settings
The runtime configurations include many helpful functionalities, like checkpoint saving, logger configuration, etc. In this tutorial, we will introduce how to configure these functionalities.
Save Checkpoint
The checkpoint saving functionality is a default hook during training. And you can configure it in the
default_hooks.checkpoint
field.
The hook mechanism is widely used in all OpenMMLab libraries. Through hooks, you can plug in many
functionalities without modifying the main execution logic of the runner.
A detailed introduction of hooks can be found in {external+mmengine:doc}`Hooks <tutorials/hook>`.
The default settings
default_hooks = dict(
...
checkpoint = dict(type='CheckpointHook', interval=1)
...
)
Here are some usual arguments, and all available arguments can be found in the CheckpointHook.
interval
(int): The saving period. If use -1, it will never save checkpoints.by_epoch
(bool): Whether theinterval
is by epoch or by iteration. Defaults toTrue
.out_dir
(str): The root directory to save checkpoints. If not specified, the checkpoints will be saved in the work directory. If specified, the checkpoints will be saved in the sub-folder of theout_dir
.max_keep_ckpts
(int): The maximum checkpoints to keep. In some cases, we want only the latest few checkpoints and would like to delete old ones to save disk space. Defaults to -1, which means unlimited.save_best
(str, List[str]): If specified, it will save the checkpoint with the best evaluation result. Usually, you can simply usesave_best="auto"
to automatically select the evaluation metric. And if you want more advanced configuration, please refer to the CheckpointHook docs.
Load Checkpoint / Resume Training
In config files, you can specify the loading and resuming functionality as below:
# load from which checkpoint
load_from = "Your checkpoint path"
# whether to resume training from the loaded checkpoint
resume = False
The load_from
field can be either a local path or an HTTP path. And you can resume training from the checkpoint by
specify resume=True
.
You can also enable auto resuming from the latest checkpoint by specifying `load_from=None` and `resume=True`.
Runner will find the latest checkpoint from the work directory automatically.
If you are training models by our tools/train.py
script, you can also use --resume
argument to resume
training without modifying the config file manually.
# Automatically resume from the latest checkpoint.
python tools/train.py configs/resnet/resnet50_8xb32_in1k.py --resume
# Resume from the specified checkpoint.
python tools/train.py configs/resnet/resnet50_8xb32_in1k.py --resume checkpoints/resnet.pth
Randomness Configuration
In the randomness
field, we provide some options to make the experiment as reproducible as possible.
By default, we won't specify seed in the config file, and in every experiment, the program will generate a random seed.
Default settings:
randomness = dict(seed=None, deterministic=False)
To make the experiment more reproducible, you can specify a seed and set deterministic=True
. The influence
of the deterministic
option can be found here.
Log Configuration
The log configuration relates to multiple fields.
In the log_level
field, you can specify the global logging level. See {external+python:ref}Logging Levels<levels>
for a list of levels.
log_level = 'INFO'
In the default_hooks.logger
field, you can specify the logging interval during training and testing. And all
available arguments can be found in the LoggerHook docs.
default_hooks = dict(
...
# print log every 100 iterations.
logger=dict(type='LoggerHook', interval=100),
...
)
In the log_processor
field, you can specify the log smooth method. Usually, we use a window with length of 10
to smooth the log and output the mean value of all information. If you want to specify the smooth method of
some information finely, see the LogProcessor docs.
# The default setting, which will smooth the values in training log by a 10-length window.
log_processor = dict(window_size=10)
In the visualizer
field, you can specify multiple backends to save the log information, such as TensorBoard
and WandB. More details can be found in the Visualizer section.
Custom Hooks
Many above functionalities are implemented by hooks, and you can also plug-in other custom hooks by modifying
custom_hooks
field. Here are some hooks in MMEngine and MMClassification that you can use directly, such as:
For example, EMA (Exponential Moving Average) is widely used in the model training, and you can enable it as below:
custom_hooks = [
dict(type='EMAHook', momentum=4e-5, priority='ABOVE_NORMAL'),
]
Visualize Validation
The validation visualization functionality is a default hook during validation. And you can configure it in the
default_hooks.visualization
field.
By default, we disabled it, and you can enable it by specifying enable=True
. And more arguments can be found in
the VisualizationHook docs.
default_hooks = dict(
...
visualization=dict(type='VisualizationHook', enable=False),
...
)
This hook will select some images in the validation dataset, and tag the prediction results on these images during every validation process. You can use it to watch the varying of model performance on actual images during training.
In addition, if the images in your validation dataset are small (<100), you can rescale them before
visualization by specifying rescale_factor=2.
or higher.
Visualizer
The visualizer is used to record all kinds of information during training and test, including logs, images and
scalars. By default, the recorded information will be saved at the vis_data
folder under the work directory.
Default settings:
visualizer = dict(
type='UniversalVisualizer',
vis_backends=[
dict(type='LocalVisBackend'),
]
)
Usually, the most useful function is to save the log and scalars like loss
to different backends.
For example, to save them to TensorBoard, simply set them as below:
visualizer = dict(
type='UniversalVisualizer',
vis_backends=[
dict(type='LocalVisBackend'),
dict(type='TensorboardVisBackend'),
]
)
Or save them to WandB as below:
visualizer = dict(
type='UniversalVisualizer',
vis_backends=[
dict(type='LocalVisBackend'),
dict(type='WandbVisBackend'),
]
)
Environment Configuration
In the env_cfg
field, you can configure some low-level parameters, like cuDNN, multi-process, and distributed
communication.
Please make sure you understand the meaning of these parameters before modifying them.
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi-process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
FAQ
-
What's the relationship between the
load_from
and theinit_cfg
?-
load_from
: Ifresume=False
, only imports model weights, which is mainly used to load trained models; Ifresume=True
, load all of the model weights, optimizer state, and other training information, which is mainly used to resume interrupted training. -
init_cfg
: You can also specifyinit=dict(type="Pretrained", checkpoint=xxx)
to load checkpoint, it means load the weights during model weights initialization. That is, it will be only done at the beginning of the training. It's mainly used to fine-tune a pre-trained model, and you can set it in the backbone config and useprefix
field to only load backbone weights, for example:model = dict( backbone=dict( type='ResNet', depth=50, init_cfg=dict(type='Pretrained', checkpoints=xxx, prefix='backbone'), ) ... )
See the Fine-tune Models for more details about fine-tuning.
-
-
What's the difference between
default_hooks
andcustom_hooks
?Almost no difference. Usually, the
default_hooks
field is used to specify the hooks that will be used in almost all experiments, and thecustom_hooks
field is used in only some experiments.Another difference is the
default_hooks
is a dict while thecustom_hooks
is a list, please don't be confused. -
During training, I got no training log, what's the reason?
If your training dataset is small while the batch size is large, our default log interval may be too large to record your training log.
You can shrink the log interval and try again, like:
default_hooks = dict( ... logger=dict(type='LoggerHook', interval=10), ... )