[Docs] Add 1x docs schedule. (#1015)

* [Docs] add schedule en docstring * [Docs] add schedule cn docstring * Improve schedule docs. * refine according to comments Co-authored-by: mzr1996 <mzr1996@163.com>
2022-10-09 10:39:53 +08:00 · 2022-10-09 10:39:53 +08:00 · dfb4e87123
parent bf9f3bbdda
commit dfb4e87123
2 changed files with 533 additions and 462 deletions
--- a/docs/en/advanced_guides/schedule.md
+++ b/docs/en/advanced_guides/schedule.md
@ -1,341 +1,372 @@
-# Customize Training Schedule (TODO)
+# Customize Training Schedule

-In this tutorial, we will introduce some methods about how to construct optimizers, customize learning rate and momentum schedules, parameter-wise finely configuration, gradient clipping, gradient accumulation, and customize self-implemented methods for the project.
+In our codebase, [default training schedules](https://github.com/open-mmlab/mmclassification/blob/master/configs/_base_/schedules) have been provided for common datasets such as CIFAR, ImageNet, etc. If we attempt to experiment on these datasets for higher accuracy or on different new methods and datasets, we might possibly need to modify the strategies.

-<!-- TOC -->
+In this tutorial, we will introduce how to modify configs to construct optimizers, use parameter-wise finely configuration, gradient clipping, gradient accumulation as well as customize learning rate and momentum schedules. Furthermore, introduce a template to customize self-implemented optimizationmethods for the project.

- [Customize optimizer supported by PyTorch](#customize-optimizer-supported-by-pytorch)
- [Customize learning rate schedules](#customize-learning-rate-schedules)
-  - [Learning rate decay](#learning-rate-decay)
-  - [Warmup strategy](#warmup-strategy)
- [Customize momentum schedules](#customize-momentum-schedules)
- [Parameter-wise finely configuration](#parameter-wise-finely-configuration)
- [Gradient clipping and gradient accumulation](#gradient-clipping-and-gradient-accumulation)
-  - [Gradient clipping](#gradient-clipping)
-  - [Gradient accumulation](#gradient-accumulation)
- [Customize self-implemented methods](#customize-self-implemented-methods)
-  - [Customize self-implemented optimizer](#customize-self-implemented-optimizer)
-  - [Customize optimizer constructor](#customize-optimizer-constructor)
+## Customize optimization

-<!-- TOC -->
+We use the `optim_wrapper` field to configure the strategies of optimization, which includes choices of optimizer, choices of automatic mixed precision training, parameter-wise configurations, gradient clipping and accumulation. Details are seen below.

-## Customize optimizer supported by PyTorch
+### Use optimizers supported by PyTorch

-We already support to use all the optimizers implemented by PyTorch, and to use and modify them, please change the `optimizer` field of config files.
+We support all the optimizers implemented by PyTorch, and to use them, please change the `optimizer` field of config files.

-For example, if you want to use `SGD`, the modification could be as the following.
+For example, if you want to use [`SGD`](torch.optim.SGD), the modification in config file could be as the following. Notice that optimization related settings should all wrapped inside the `optim_wrapper`.

 ```python
-optimizer = dict(type='SGD', lr=0.0003, weight_decay=0.0001)
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.0003, weight_decay=0.0001)
+)
+```
+
+```{note}
+`type` in optimizer is not a constructor but a optimizer name in PyTorch.
+Refers to {external+torch:ref}`List of optimizers supported by PyTorch <optim:algorithms>` for more choices.
 ```

 To modify the learning rate of the model, just modify the `lr` in the config of optimizer.
-You can also directly set other arguments according to the [API doc](https://pytorch.org/docs/stable/optim.html?highlight=optim#module-torch.optim) of PyTorch.
+You can also directly set other arguments according to the [API doc](torch.optim) of PyTorch.

-For example, if you want to use `Adam` with the setting like `torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)` in PyTorch,
-the config should looks like.
+For example, if you want to use [`Adam`](torch.optim.Adam) with settings like `torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)` in PyTorch. You could use the config below:

 ```python
-optimizer = dict(type='Adam', lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
-```
-
-## Customize learning rate schedules
-
-### Learning rate decay
-
-Learning rate decay is widely used to improve performance. And to use learning rate decay, please set the `lr_confg` field in config files.
-
-For example, we use step policy as the default learning rate decay policy of ResNet, and the config is:
-
-```python
-lr_config = dict(policy='step', step=[100, 150])
-```
-
-Then during training, the program will call [`StepLRHook`](https://github.com/open-mmlab/mmcv/blob/f48241a65aebfe07db122e9db320c31b685dc674/mmcv/runner/hooks/lr_updater.py#L153) periodically to update the learning rate.
-
-We also support many other learning rate schedules [here](https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/hooks/lr_updater.py), such as `CosineAnnealing` and `Poly` schedule. Here are some examples
-
- ConsineAnnealing schedule:
-
-  ```python
-  lr_config = dict(
-      policy='CosineAnnealing',
-      warmup='linear',
-      warmup_iters=1000,
-      warmup_ratio=1.0 / 10,
-      min_lr_ratio=1e-5)
-  ```
-
- Poly schedule:
-
-  ```python
-  lr_config = dict(policy='poly', power=0.9, min_lr=1e-4, by_epoch=False)
-  ```
-
-### Warmup strategy
-
-In the early stage, training is easy to be volatile, and warmup is a technique
-to reduce volatility. With warmup, the learning rate will increase gradually
-from a minor value to the expected value.
-
-In MMClassification, we use `lr_config` to configure the warmup strategy, the main parameters are as follows：
-
- `warmup`: The warmup curve type. Please choose one from 'constant', 'linear', 'exp' and `None`, and `None` means disable warmup.
- `warmup_by_epoch` : if warmup by epoch or not, default to be True, if set to be False, warmup by iter.
- `warmup_iters` : the number of warm-up iterations, when `warmup_by_epoch=True`, the unit is epoch; when `warmup_by_epoch=False`, the unit is the number of iterations (iter).
- `warmup_ratio` : warm-up initial learning rate will calculate as `lr = lr * warmup_ratio`。
-
-Here are some examples
-
-1. linear & warmup by iter
-
-   ```python
-   lr_config = dict(
-       policy='CosineAnnealing',
-       by_epoch=False,
-       min_lr_ratio=1e-2,
-       warmup='linear',
-       warmup_ratio=1e-3,
-       warmup_iters=20 * 1252,
-       warmup_by_epoch=False)
-   ```
-
-2. exp & warmup by epoch
-
-   ```python
-   lr_config = dict(
-       policy='CosineAnnealing',
-       min_lr=0,
-       warmup='exp',
-       warmup_iters=5,
-       warmup_ratio=0.1,
-       warmup_by_epoch=True)
-   ```
-
-```{tip}
-After completing your configuration file，you could use [learning rate visualization tool](https://mmclassification.readthedocs.io/en/latest/tools/visualization.html#learning-rate-schedule-visualization) to draw the corresponding learning rate adjustment curve.
-```
-
-## Customize momentum schedules
-
-We support the momentum scheduler to modify the model's momentum according to learning rate, which could make the model converge in a faster way.
-
-Momentum scheduler is usually used with LR scheduler, for example, the following config is used to accelerate convergence.
-For more details, please refer to the implementation of [CyclicLrUpdater](https://github.com/open-mmlab/mmcv/blob/f48241a65aebfe07db122e9db320c31b685dc674/mmcv/runner/hooks/lr_updater.py#L327)
-and [CyclicMomentumUpdater](https://github.com/open-mmlab/mmcv/blob/f48241a65aebfe07db122e9db320c31b685dc674/mmcv/runner/hooks/momentum_updater.py#L130).
-
-Here is an example
-
-```python
-lr_config = dict(
-    policy='cyclic',
-    target_ratio=(10, 1e-4),
-    cyclic_times=1,
-    step_ratio_up=0.4,
-)
-momentum_config = dict(
-    policy='cyclic',
-    target_ratio=(0.85 / 0.95, 1),
-    cyclic_times=1,
-    step_ratio_up=0.4,
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer = dict(
+        type='Adam',
+        lr=0.001,
+        betas=(0.9, 0.999),
+        eps=1e-08,
+        weight_decay=0,
+        amsgrad=False),
 )
 ```

-## Parameter-wise finely configuration
+````{note}
+The default type of `optim_wrapper` field is [`OptimWrapper`](mmengine.optim.OptimWrapper), therefore, you can
+omit the type field usually, like:

-Some models may have some parameter-specific settings for optimization, for example, no weight decay to the BatchNorm layer or using different learning rates for different network layers.
-To finely configuration them, we can use the `paramwise_cfg` option in `optimizer`.
+```python
+optim_wrapper = dict(
+    optimizer=dict(
+        type='Adam',
+        lr=0.001,
+        betas=(0.9, 0.999),
+        eps=1e-08,
+        weight_decay=0,
+        amsgrad=False))
+```
+````

-We provide some examples here and more usages refer to [DefaultOptimizerConstructor](https://mmcv.readthedocs.io/en/latest/_modules/mmcv/runner/optimizer/default_constructor.html#DefaultOptimizerConstructor).
+### Use AMP training

- Using specified options
+If we want to use the automatic mixed precision training, we can simply change the type of `optim_wrapper` to `AmpOptimWrapper` in config files.

-  The `DefaultOptimizerConstructor` provides options including `bias_lr_mult`, `bias_decay_mult`, `norm_decay_mult`, `dwconv_decay_mult`, `dcn_offset_lr_mult` and `bypass_duplicate` to configure special optimizer behaviors of bias, normalization, depth-wise convolution, deformable convolution and duplicated parameter. E.g:
+```python
+optim_wrapper = dict(type='AmpOptimWrapper', optimizer=...)
+```

-  1. No weight decay to the BatchNorm layer
+Alternatively, for conveniency, we can set `--amp` parameter to turn on the AMP option directly in the `tools/train.py` script. Refers to [Training and test](../user_guides/train_test.md) tutorial for details of starting a training.
+
+### Parameter-wise finely configuration
+
+Some models may have parameter-specific settings for optimization, for example, no weight decay to the BatchNorm layers or using different learning rates for different network layers.
+To finely configure them, we can use the `paramwise_cfg` argument in `optim_wrapper`.
+
+- **Set different hyper-parameter multipliers for different types of parameters.**
+
+  For instance, we can set `norm_decay_mult=0.` in `paramwise_cfg` to change the weight decay of weight and bias of normalization layers to zero.

  ```python
-  optimizer = dict(
-      type='SGD',
-      lr=0.8,
-      weight_decay=1e-4,
+  optim_wrapper = dict(
+      optimizer=dict(type='SGD', lr=0.8, weight_decay=1e-4),
      paramwise_cfg=dict(norm_decay_mult=0.))
  ```

- Using `custom_keys` dict
+  More types of parameters are supported to configured, list as follow:

-  MMClassification can use `custom_keys` to specify different parameters to use different learning rates or weight decays, for example:
+  - `lr_mult`: Multiplier for learning rate of all parameters.
+  - `decay_mult`: Multiplier for weight decay of all parameters.
+  - `bias_lr_mult`: Multiplier for learning rate of bias (Not include normalization layers' biases and deformable convolution layers' offsets). Defaults to 1.
+  - `bias_decay_mult`: Multiplier for weight decay of bias (Not include normalization layers' biases and deformable convolution layers' offsets). Defaults to 1.
+  - `norm_decay_mult`: Multiplier for weight decay of weigh and bias of normalization layers. Defaults to 1.
+  - `dwconv_decay_mult`: Multiplier for weight decay of depth-wise convolution layers. Defaults to 1.
+  - `bypass_duplicate`: Whether to bypass duplicated parameters. Defaults to `False`.
+  - `dcn_offset_lr_mult`: Multiplier for learning rate of deformable convolution layers. Defaults to 1.

-  1. No weight decay for specific parameters
+- **Set different hyper-parameter multipliers for specific parameters.**
+
+  MMClassification can use `custom_keys` in `paramwise_cfg` to specify different parameters to use different learning rates or weight decay.
+
+  For example, to set all learning rates and weight decays of `backbone.layer0` to 0, the rest of `backbone` remains the same as optimizer and the learning rate of `head` to 0.001, use the configs below.

  ```python
-  paramwise_cfg = dict(
-      custom_keys={
-          'backbone.cls_token': dict(decay_mult=0.0),
-          'backbone.pos_embed': dict(decay_mult=0.0)
-      })
-
-  optimizer = dict(
-      type='SGD',
-      lr=0.8,
-      weight_decay=1e-4,
-      paramwise_cfg=paramwise_cfg)
-  ```
-
-  2. Using a smaller learning rate and a weight decay for the backbone layers
-
-  ```python
-  optimizer = dict(
-      type='SGD',
-      lr=0.8,
-      weight_decay=1e-4,
-      # 'lr' for backbone and 'weight_decay' are 0.1 * lr and 0.9 * weight_decay
+  optim_wrapper = dict(
+      optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
      paramwise_cfg=dict(
-          custom_keys={'backbone': dict(lr_mult=0.1, decay_mult=0.9)}))
+          custom_keys={
+              'backbone.layer0': dict(lr_mult=0, decay_mult=0),
+              'backbone': dict(lr_mult=1),
+              'head': dict(lr_mult=0.1)
+          }))
  ```

-## Gradient clipping and gradient accumulation
-
-Besides the basic function of PyTorch optimizers, we also provide some enhancement functions, such as gradient clipping, gradient accumulation, etc., refer to [MMCV](https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/hooks/optimizer.py).
-
 ### Gradient clipping

 During the training process, the loss function may get close to a cliffy region and cause gradient explosion. And gradient clipping is helpful to stabilize the training process. More introduction can be found in [this page](https://paperswithcode.com/method/gradient-clipping).

-Currently we support `grad_clip` option in `optimizer_config`, and the arguments refer to [PyTorch Documentation](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html).
+Currently we support `clip_grad` option in `optim_wrapper` for gradient clipping, refers to [PyTorch Documentation](torch.nn.utils.clip_grad_norm_).

 Here is an example:

 ```python
-optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
-# norm_type: type of the used p-norm, here norm_type is 2.
-```
-
-When inheriting from base and modifying configs, if `grad_clip=None` in base, `_delete_=True` is needed. For more details about `_delete_` you can refer to [TUTORIAL 1: LEARN ABOUT CONFIGS](https://mmclassification.readthedocs.io/en/latest/tutorials/config.html#ignore-some-fields-in-the-base-configs). For example,
-
-```python
-_base_ = [./_base_/schedules/imagenet_bs256_coslr.py]
-
-optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2), _delete_=True, type='OptimizerHook')
-# you can ignore type if type is 'OptimizerHook', otherwise you must add "type='xxxxxOptimizerHook'" here
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+    # norm_type: type of the used p-norm, here norm_type is 2.
+    clip_grad=dict(max_norm=35, norm_type=2))
 ```

 ### Gradient accumulation

-When computing resources are lacking, the batch size can only be set to a small value, which may affect the performance of models. Gradient accumulation can be used to solve this problem.
+When computing resources are lacking, the batch size can only be set to a small value, which may affect the performance of models. Gradient accumulation can be used to solve this problem. We support `accumulative_counts` option in `optim_wrapper` for gradient accumulation.

 Here is an example:

 ```python
-data = dict(samples_per_gpu=64)
-optimizer_config = dict(type="GradientCumulativeOptimizerHook", cumulative_iters=4)
+train_dataloader = dict(batch_size=64)
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+    accumulative_counts=4)
 ```

 Indicates that during training, back-propagation is performed every 4 iters. And the above is equivalent to:

 ```python
-data = dict(samples_per_gpu=256)
-optimizer_config = dict(type="OptimizerHook")
+train_dataloader = dict(batch_size=256)
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001))
 ```

-```{note}
-When the optimizer hook type is not specified in `optimizer_config`, `OptimizerHook` is used by default.
+## Customize parameter schedules
+
+In training, the optimzation parameters such as learing rate, momentum, are usually not fixed but changing through iterations or epochs. PyTorch supports several learning rate schedulers, which are not sufficient for complex strategies. In MMClassification, we provide `param_scheduler` for better controls of different parameter schedules.
+
+### Customize learning rate schedules
+
+Learning rate schedulers are widely used to improve performance. We support most of the PyTorch schedulers, including `ExponentialLR`, `LinearLR`, `StepLR`, `MultiStepLR`, etc.
+
+All available learning rate scheduler can be found {external+mmengine:ref}`here <scheduler>`, and the
+names of learning rate schedulers end with `LR`.
+
+- **Single learning rate schedule**
+
+  In most cases, we use only one learning rate schedule for simplicity. For instance, [`MultiStepLR`](mmengine.optim.MultiStepLR) is used as the default learning rate schedule for ResNet. Here, `param_scheduler` is a dictionary.
+
+  ```python
+  param_scheduler = dict(
+      type='MultiStepLR',
+      by_epoch=True,
+      milestones=[100, 150],
+      gamma=0.1)
+  ```
+
+  Or, we want to use the [`CosineAnnealingLR`](mmengine.optim.CosineAnnealingLR) scheduler to decay the learning rate:
+
+  ```python
+  param_scheduler = dict(
+      type='CosineAnnealingLR',
+      by_epoch=True,
+      T_max=num_epochs)
+  ```
+
+- **Multiple learning rate schedules**
+
+  In some of the training cases, multiple learning rate schedules are applied for higher accuracy. For example ,in the early stage, training is easy to be volatile, and warmup is a technique to reduce volatility.
+  The learning rate will increase gradually from a minor value to the expected value by warmup and decay afterwards by other schedules.
+
+  In MMClassification, simply combines desired schedules in `param_scheduler` as a list can achieve the warmup strategy.
+
+  Here are some examples:
+
+  1. linear warmup during the first 50 iters.
+
+  ```python
+    param_scheduler = [
+        # linear warm-up by iters
+        dict(type='LinearLR',
+            start_factor=0.001,
+            by_epoch=False,  # by iters
+            end=50),  # only warm up for first 50 iters
+        # main learing rate schedule
+        dict(type='MultiStepLR',
+            by_epoch=True,
+            milestones=[8, 11],
+            gamma=0.1)
+    ]
+  ```
+
+  2. linear warmup and update lr by iter during the first 10 epochs.
+
+  ```python
+    param_scheduler = [
+        # linear warm-up by epochs in [0, 10) epochs
+        dict(type='LinearLR',
+            start_factor=0.001,
+            by_epoch=True,
+            end=10,
+            convert_to_iter_based=True,  # Update learning rate by iter.
+        ),
+        # use CosineAnnealing schedule after 10 epochs
+        dict(type='CosineAnnealingLR', by_epoch=True, begin=10)
+    ]
+  ```
+
+  Notice that, we use `begin` and `end` arguments here to assign the valid range, which is \[`begin`, `end`) for this schedule. And the range unit is defined by `by_epoch` argument. If not specified, the `begin` is 0 and the `end` is the max epochs or iterations.
+
+  If the ranges for all schedules are not continuous, the learning rate will stay constant in ignored range, otherwise all valid schedulers will be executed in order in a specific stage, which behaves the same as PyTorch [`ChainedScheduler`](torch.optim.lr_scheduler.ChainedScheduler).
+
+  ```{tip}
+  To check that the learning rate curve is as expected, after completing your configuration file，you could use [learning rate visualization tool](../user_guides/visualization.md#learning-rate-schedule-visualization) to draw the corresponding learning rate adjustment curve.
+  ```
+
+### Customize momentum schedules
+
+We support using momentum schedulers to modify the optimizer's momentum according to learning rate, which could make the loss converge in a faster way. The usage is the same as learning rate schedulers.
+
+All available learning rate scheduler can be found {external+mmengine:ref}`here <scheduler>`, and the
+names of momentum rate schedulers end with `Momentum`.
+
+Here is an example:
+
+```python
+param_scheduler = [
+    # the lr scheduler
+    dict(type='LinearLR', ...),
+    # the momentum scheduler
+    dict(type='LinearMomentum',
+         start_factor=0.001,
+         by_epoch=False,
+         begin=0,
+         end=1000)
+]
 ```

-## Customize self-implemented methods
-
-In academic research and industrial practice, it may be necessary to use optimization methods not implemented by MMClassification, and you can add them through the following methods.
+## Add new optimizers or constructors

 ```{note}
 This part will modify the MMClassification source code or add code to the MMClassification framework, beginners can skip it.
 ```

-### Customize self-implemented optimizer
+### Add new optimizers

-#### 1. Define a new optimizer
+In academic research and industrial practice, it may be necessary to use optimization methods not implemented by MMClassification, and you can add them through the following methods.

-A customized optimizer could be defined as below.
+#### 1. Implement a new optimizer

 Assume you want to add an optimizer named `MyOptimizer`, which has arguments `a`, `b`, and `c`.
-You need to create a new directory named `mmcls/core/optimizer`.
-And then implement the new optimizer in a file, e.g., in `mmcls/core/optimizer/my_optimizer.py`:
+You need to create a new file under `mmcls/engine/optimizers`, and implement the new optimizer in the file, for example, in `mmcls/engine/optimizers/my_optimizer.py`:

 ```python
-from mmcv.runner import OPTIMIZERS
 from torch.optim import Optimizer
+from mmcls.registry import OPTIMIZERS


@OPTIMIZERS.register_module()
 class MyOptimizer(Optimizer):

    def __init__(self, a, b, c):
+        ...

+    def step(self, closure=None):
+        ...
 ```

-#### 2. Add the optimizer to registry
+#### 2. Import the optimizer

-To find the above module defined above, this module should be imported into the main namespace at first. There are two ways to achieve it.
+To find the above module defined above, this module should be imported during the running. There are two ways to achieve it.

- Modify `mmcls/core/optimizer/__init__.py` to import it into `optimizer` package, and then modify `mmcls/core/__init__.py` to import the new `optimizer` package.
+- Import it in the `mmcls/engine/optimizers/__init__.py` to add it into the `mmcls.engine` package.

-  Create the `mmcls/core/optimizer` folder and the `mmcls/core/optimizer/__init__.py` file if they don't exist. The newly defined module should be imported in `mmcls/core/optimizer/__init__.py` and `mmcls/core/__init__.py` so that the registry will find the new module and add it:
+  ```python
+  # In mmcls/engine/optimizers/__init__.py
+  ...
+  from .my_optimizer import MyOptimizer # MyOptimizer maybe other class name

-```python
-# In mmcls/core/optimizer/__init__.py
-from .my_optimizer import MyOptimizer # MyOptimizer maybe other class name
+  __all__ = [..., 'MyOptimizer']
+  ```

-__all__ = ['MyOptimizer']
-```
+  During running, we will automatically import the `mmcls.engine` package and register the `MyOptimizer` at the same time.

-```python
-# In mmcls/core/__init__.py
-...
-from .optimizer import *  # noqa: F401, F403
-```
+- Use `custom_imports` in the config file to manually import it.

- Use `custom_imports` in the config to manually import it
+  ```python
+  custom_imports = dict(
+      imports=['mmcls.engine.optimizers.my_optimizer'],
+      allow_failed_imports=False,
+  )
+  ```

-```python
-custom_imports = dict(imports=['mmcls.core.optimizer.my_optimizer'], allow_failed_imports=False)
-```
-
-The module `mmcls.core.optimizer.my_optimizer` will be imported at the beginning of the program and the class `MyOptimizer` is then automatically registered.
-Note that only the package containing the class `MyOptimizer` should be imported. `mmcls.core.optimizer.my_optimizer.MyOptimizer` **cannot** be imported directly.
+  The module `mmcls.engine.optimizers.my_optimizer` will be imported at the beginning of the program and the class `MyOptimizer` is then automatically registered.
+  Note that only the package containing the class `MyOptimizer` should be imported. `mmcls.engine.optimizers.my_optimizer.MyOptimizer` **cannot** be imported directly.

 #### 3. Specify the optimizer in the config file

-Then you can use `MyOptimizer` in `optimizer` field of config files.
-In the configs, the optimizers are defined by the field `optimizer` like the following:
+Then you can use `MyOptimizer` in the `optim_wrapper.optimizer` field of config files.

 ```python
-optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
+optim_wrapper = dict(
+    optimizer=dict(type='MyOptimizer', a=a_value, b=b_value, c=c_value))
 ```

-To use your own optimizer, the field can be changed to
+### Add new optimizer constructors
+
+Some models may have some parameter-specific settings for optimization, like different weight decay rate for all `BatchNorm` layers.
+
+Although we already can use [the `optim_wrapper.paramwise_cfg` field](#parameter-wise-finely-configuration) to
+configure various parameter-specific optimizer settings. It may still not cover your need.
+
+Of course, you can modify it. By default, we use the [`DefaultOptimWrapperConstructor`](mmengine.optim.DefaultOptimWrapperConstructor)
+class to deal with the construction of optimizer. And during the construction, it fine-grainedly configures the optimizer settings of
+different parameters according to the `paramwise_cfg`，which could also serve as a template for new optimizer constructor.
+
+You can overwrite these behaviors by add new optimizer constructors.

 ```python
-optimizer = dict(type='MyOptimizer', a=a_value, b=b_value, c=c_value)
-```
-
-### Customize optimizer constructor
-
-Some models may have some parameter-specific settings for optimization, e.g. weight decay for BatchNorm layers.
-
-Although our `DefaultOptimizerConstructor` is powerful, it may still not cover your need. If that, you can do those fine-grained parameter tuning through customizing optimizer constructor.
-
-```python
-from mmcv.runner.optimizer import OPTIMIZER_BUILDERS
+# In mmcls/engine/optimizers/my_optim_constructor.py
+from mmengine.optim import DefaultOptimWrapperConstructor
+from mmcls.registry import OPTIM_WRAPPER_CONSTRUCTORS


-@OPTIMIZER_BUILDERS.register_module()
-class MyOptimizerConstructor:
+@OPTIM_WRAPPER_CONSTRUCTORS.register_module()
+class MyOptimWrapperConstructor:

-    def __init__(self, optimizer_cfg, paramwise_cfg=None):
-        pass
+    def __init__(self, optim_wrapper_cfg, paramwise_cfg=None):
+        ...

    def __call__(self, model):
-        ...      # Construct your optimzier here.
-        return my_optimizer
+        ...
 ```

-The default optimizer constructor is implemented [here](https://github.com/open-mmlab/mmcv/blob/9ecd6b0d5ff9d2172c49a182eaa669e9f27bb8e7/mmcv/runner/optimizer/default_constructor.py#L11), which could also serve as a template for new optimizer constructor.
+And then, import it and use it almost like [the optimizer tutorial](#add-new-optimizers).
+
+1. Import it in the `mmcls/engine/optimizers/__init__.py` to add it into the `mmcls.engine` package.
+
+   ```python
+   # In mmcls/engine/optimizers/__init__.py
+   ...
+   from .my_optim_constructor import MyOptimWrapperConstructor
+
+   __all__ = [..., 'MyOptimWrapperConstructor']
+   ```
+
+2. Use `MyOptimWrapperConstructor` in the `optim_wrapper.constructor` field of config files.
+
+   ```python
+   optim_wrapper = dict(
+       constructor=dict(type='MyOptimWrapperConstructor'),
+       optimizer=...,
+       paramwise_cfg=...,
+   )
+   ```
--- a/docs/zh_CN/advanced_guides/schedule.md
+++ b/docs/zh_CN/advanced_guides/schedule.md
@ -1,257 +1,276 @@
-# 自定义优化策略（待更新）
+# 自定义训练优化策略

-在本教程中，我们将介绍如何在运行自定义模型时，进行构造优化器、定制学习率及动量调整策略、梯度裁剪、梯度累计以及用户自定义优化方法等。
+在我们的算法库中，已经提供了通用数据集（如ImageNet，CIFAR）的[默认训练策略配置](https://github.com/open-mmlab/mmclassification/blob/master/configs/_base_/schedules)。如果想要在这些数据集上继续提升模型性能，或者在不同数据集和方法上进行新的尝试，我们通常需要修改这些默认的策略。

-<!-- TOC -->
+在本教程中，我们将介绍如何在运行自定义训练时，通过修改配置文件进行构造优化器、参数化精细配置、梯度裁剪、梯度累计以及定制动量调整策略等。同时也会通过模板简单介绍如何自定义开发优化器和构造器。

- [构造 PyTorch 内置优化器](#构造-pytorch-内置优化器)
- [定制学习率调整策略](#定制学习率调整策略)
-  - [学习率衰减曲线](#定制学习率衰减曲线)
-  - [学习率预热策略](#定制学习率预热策略)
- [定制动量调整策略](#定制动量调整策略)
- [参数化精细配置](#参数化精细配置)
- [梯度裁剪与梯度累计](#梯度裁剪与梯度累计)
-  - [梯度裁剪](#梯度裁剪)
-  - [梯度累计](#梯度累计)
- [用户自定义优化方法](#用户自定义优化方法)
-  - [自定义优化器](#自定义优化器)
-  - [自定义优化器构造器](#自定义优化器构造器)
+## 配置训练优化策略

-<!-- TOC -->
+我们通过 `optim_wrapper` 来配置主要的优化策略，包括优化器的选择，混合精度训练的选择，参数化精细配置，梯度裁剪以及梯度累计。接下来将分别介绍这些内容。

-## 构造 PyTorch 内置优化器
+### 构造 PyTorch 内置优化器

-MMClassification 支持 PyTorch 实现的所有优化器，仅需在配置文件中，指定 “optimizer” 字段。
-例如，如果要使用 “SGD”，则修改如下。
+MMClassification 支持 PyTorch 实现的所有优化器，仅需在配置文件中，指定优化器封装需要的 `optimizer` 字段。
+
+如果要使用 [`SGD`](torch.optim.SGD)，则修改如下。这里要注意所有优化相关的配置都需要封装在 `optim_wrapper` 配置里。

 ```python
-optimizer = dict(type='SGD', lr=0.0003, weight_decay=0.0001)
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer=dict(type='SGD', lr=0.0003, weight_decay=0.0001)
+)
 ```

-要修改模型的学习率，只需要在优化器的配置中修改 `lr` 即可。
-要配置其他参数，可直接根据 [PyTorch API 文档](https://pytorch.org/docs/stable/optim.html?highlight=optim#module-torch.optim) 进行。
-
 ```{note}
 配置文件中的 'type' 不是构造时的参数，而是 PyTorch 内置优化器的类名。
+更多优化器选择可以参考{external+torch:ref}`PyTorch 支持的优化器列表<optim:algorithms>`。
 ```

-例如，如果想使用 `Adam` 并设置参数为 `torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)`，
-则需要进行如下修改
+要修改模型的学习率，只需要在优化器的配置中修改 `lr` 即可。
+要配置其他参数，可直接根据 [PyTorch API 文档](torch.optim) 进行。
+
+例如，如果想使用 [`Adam`](torch.optim.Adam) 并设置参数为 `torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)`。
+则需要进行如下修改：

 ```python
-optimizer = dict(type='Adam', lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
-```
-
-## 定制学习率调整策略
-
-### 定制学习率衰减曲线
-
-深度学习研究中，广泛应用学习率衰减来提高网络的性能。要使用学习率衰减，可以在配置中设置 `lr_confg` 字段。
-
-比如在默认的 ResNet 网络训练中，我们使用阶梯式的学习率衰减策略，配置文件为：
-
-```python
-lr_config = dict(policy='step', step=[100, 150])
-```
-
-在训练过程中，程序会周期性地调用 MMCV 中的 [`StepLRHook`](https://github.com/open-mmlab/mmcv/blob/f48241a65aebfe07db122e9db320c31b685dc674/mmcv/runner/hooks/lr_updater.py#L153) 来进行学习率更新。
-
-此外，我们也支持其他学习率调整方法，如 `CosineAnnealing` 和 `Poly` 等。详情可见 [这里](https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/hooks/lr_updater.py)
-
- ConsineAnnealing:
-
-  ```python
-  lr_config = dict(policy='CosineAnnealing', min_lr_ratio=1e-5)
-  ```
-
- Poly:
-
-  ```python
-  lr_config = dict(policy='poly', power=0.9, min_lr=1e-4, by_epoch=False)
-  ```
-
-### 定制学习率预热策略
-
-在训练的早期阶段，网络容易不稳定，而学习率的预热就是为了减少这种不稳定性。通过预热，学习率将会从一个很小的值逐步提高到预定值。
-
-在 MMClassification 中，我们同样使用 `lr_config` 配置学习率预热策略，主要的参数有以下几个：
-
- `warmup` : 学习率预热曲线类别，必须为 'constant'、 'linear'， 'exp' 或者 `None` 其一， 如果为 `None`, 则不使用学习率预热策略。
- `warmup_by_epoch` : 是否以轮次（epoch）为单位进行预热。
- `warmup_iters` :  预热的迭代次数，当 `warmup_by_epoch=True` 时，单位为轮次（epoch）；当 `warmup_by_epoch=False` 时，单位为迭代次数（iter）。
- `warmup_ratio` : 预测的初始学习率 `lr = lr * warmup_ratio`。
-
-例如：
-
-1. 逐**迭代次数**地**线性**预热
-
-   ```python
-   lr_config = dict(
-       policy='CosineAnnealing',
-       by_epoch=False,
-       min_lr_ratio=1e-2,
-       warmup='linear',
-       warmup_ratio=1e-3,
-       warmup_iters=20 * 1252,
-       warmup_by_epoch=False)
-   ```
-
-2. 逐**轮次**地**指数**预热
-
-   ```python
-   lr_config = dict(
-       policy='CosineAnnealing',
-       min_lr=0,
-       warmup='exp',
-       warmup_iters=5,
-       warmup_ratio=0.1,
-       warmup_by_epoch=True)
-   ```
-
-```{tip}
-配置完成后，可以使用 MMClassification 提供的 [学习率可视化工具](https://mmclassification.readthedocs.io/zh_CN/latest/tools/visualization.html#id3) 画出对应学习率调整曲线。
-```
-
-## 定制动量调整策略
-
-MMClassification 支持动量调整器根据学习率修改模型的动量，从而使模型收敛更快。
-
-动量调整程序通常与学习率调整器一起使用，例如，以下配置用于加速收敛。
-更多细节可参考 [CyclicLrUpdater](https://github.com/open-mmlab/mmcv/blob/f48241a65aebfe07db122e9db320c31b685dc674/mmcv/runner/hooks/lr_updater.py#L327) 和 [CyclicMomentumUpdater](https://github.com/open-mmlab/mmcv/blob/f48241a65aebfe07db122e9db320c31b685dc674/mmcv/runner/hooks/momentum_updater.py#L130)。
-
-这里是一个用例：
-
-```python
-lr_config = dict(
-    policy='cyclic',
-    target_ratio=(10, 1e-4),
-    cyclic_times=1,
-    step_ratio_up=0.4,
-)
-momentum_config = dict(
-    policy='cyclic',
-    target_ratio=(0.85 / 0.95, 1),
-    cyclic_times=1,
-    step_ratio_up=0.4,
+optim_wrapper = dict(
+    type='OptimWrapper',
+    optimizer = dict(
+        type='Adam',
+        lr=0.001,
+        betas=(0.9, 0.999),
+        eps=1e-08,
+        weight_decay=0,
+        amsgrad=False),
 )
 ```

-## 参数化精细配置
+````{note}
+考虑到对于单精度训练来说，优化器封装的默认类型就是 `OptimWrapper`，我们在这里可以直接省略，因此配置文件可以进一步简化为：

-一些模型可能具有一些特定于参数的设置以进行优化，例如 BatchNorm 层不添加权重衰减或者对不同的网络层使用不同的学习率。
-在 MMClassification 中，我们通过 `optimizer` 的 `paramwise_cfg` 参数进行配置，可以参考[MMCV](https://mmcv.readthedocs.io/en/latest/_modules/mmcv/runner/optimizer/default_constructor.html#DefaultOptimizerConstructor)。
+```python
+optim_wrapper = dict(
+    optimizer=dict(
+        type='Adam',
+        lr=0.001,
+        betas=(0.9, 0.999),
+        eps=1e-08,
+        weight_decay=0,
+        amsgrad=False))
+```
+````

- 使用指定选项
+### 混合精度训练

-  MMClassification 提供了包括 `bias_lr_mult`、 `bias_decay_mult`、 `norm_decay_mult`、 `dwconv_decay_mult`、 `dcn_offset_lr_mult` 和 `bypass_duplicate` 选项，指定相关所有的 `bais`、 `norm`、 `dwconv`、 `dcn` 和 `bypass` 参数。例如令模型中所有的 BN 不进行参数衰减：
+如果我们想要使用混合精度训练（Automactic Mixed Precision），我们只需简单地将 `optim_wrapper` 的类型改为 `AmpOptimWrapper`。
+
+```python
+optim_wrapper = dict(type='AmpOptimWrapper', optimizer=...)
+```
+
+另外，为了方便，我们同时在启动训练脚本 `tools/train.py` 中提供了 `--amp` 参数作为开启混合精度训练的开关，更多细节可以参考[训练与测试](../user_guides/train_test.md)教程。
+
+### 参数化精细配置
+
+在一些模型中，不同的优化策略需要适应特定的参数，例如不在 BatchNorm 层使用权重衰减，或者在不同层使用不同的学习率等等。
+我们需要用到 `optim_wrapper` 中的 `paramwise_cfg` 参数来进行精细化配置。
+
+- **为不同类型的参数设置超参乘子**
+
+  例如，我们可以在 `paramwise_cfg` 配置中设置 `norm_decay_mult=0.` 来改变归一化层权重和偏移的衰减为0。

  ```python
-  optimizer = dict(
-      type='SGD',
-      lr=0.8,
-      weight_decay=1e-4,
-      paramwise_cfg=dict(norm_decay_mult=0.)
-  )
+  optim_wrapper = dict(
+      optimizer=dict(type='SGD', lr=0.8, weight_decay=1e-4),
+      paramwise_cfg=dict(norm_decay_mult=0.))
  ```

- 使用 `custom_keys` 指定参数
+  支持更多类型的参数配置，参考以下列表：

-  MMClassification 可通过 `custom_keys` 指定不同的参数使用不同的学习率或者权重衰减，例如对特定的参数不使用权重衰减：
+  - `lr_mult`：所有参数的学习率系数
+  - `decay_mult`：所有参数的衰减系数
+  - `bias_lr_mult`：偏置的学习率系数（不包括正则化层的偏置以及可变形卷积的 offset），默认值为 1
+  - `bias_decay_mult`：偏置的权值衰减系数（不包括正则化层的偏置以及可变形卷积的 offset），默认值为 1
+  - `norm_decay_mult`：正则化层权重和偏置的权值衰减系数，默认值为 1
+  - `dwconv_decay_mult`：Depth-wise 卷积的权值衰减系数，默认值为 1
+  - `bypass_duplicate`：是否跳过重复的参数，默认为 `False`
+  - `dcn_offset_lr_mult`：可变形卷积（Deformable Convolution）的学习率系数，默认值为 1
+
+- **为特定参数设置超参乘子**
+
+  MMClassification 通过 `paramwise_cfg` 的 `custom_keys` 参数来配置特定参数的超参乘子。
+
+  例如，我们可以通过以下配置来设置所有 `backbone.layer0` 层的学习率和权重衰减为0， `backbone` 的其余层和优化器保持一致，另外 `head` 层的学习率为0.001.

  ```python
-  paramwise_cfg = dict(
-      custom_keys={
-          'backbone.cls_token': dict(decay_mult=0.0),
-          'backbone.pos_embed': dict(decay_mult=0.0)
-      })
-
-  optimizer = dict(
-      type='SGD',
-      lr=0.8,
-      weight_decay=1e-4,
-      paramwise_cfg=paramwise_cfg)
+  optim_wrapper = dict(
+      optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+      paramwise_cfg=dict(
+          custom_keys={
+              'backbone.layer0': dict(lr_mult=0, decay_mult=0),
+              'backbone': dict(lr_mult=1),
+              'head': dict(lr_mult=0.1)
+          }))
  ```

-  对 backbone 使用更小的学习率与衰减系数：
-
-  ```python
-  optimizer = dict(
-      type='SGD',
-      lr=0.8,
-      weight_decay=1e-4,
-      # backbone 的 'lr' and 'weight_decay' 分别为 0.1 * lr 和 0.9 * weight_decay
-      paramwise_cfg = dict(custom_keys={'backbone': dict(lr_mult=0.1, decay_mult=0.9)}))
-  ```
-
-## 梯度裁剪与梯度累计
-
-除了 PyTorch 优化器的基本功能，我们还提供了一些对优化器的增强功能，例如梯度裁剪、梯度累计等，参考 [MMCV](https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/hooks/optimizer.py)。
-
 ### 梯度裁剪

 在训练过程中，损失函数可能接近于一些异常陡峭的区域，从而导致梯度爆炸。而梯度裁剪可以帮助稳定训练过程，更多介绍可以参见[该页面](https://paperswithcode.com/method/gradient-clipping)。

-目前我们支持在 `optimizer_config` 字段中添加 `grad_clip` 参数来进行梯度裁剪，更详细的参数可参考 [PyTorch 文档](https://pytorch.org/docs/stable/generated/torch.nn.utils.clip_grad_norm_.html)。
+目前我们支持在 `optim_wrapper` 字段中添加 `clip_grad` 参数来进行梯度裁剪，更详细的参数可参考 [PyTorch 文档](torch.nn.utils.clip_grad_norm_)。

 用例如下：

 ```python
-# norm_type: 使用的范数类型，此处使用范数2。
-optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
-```
-
-当使用继承并修改基础配置方式时，如果基础配置中 `grad_clip=None`，需要添加 `_delete_=True`。有关 `_delete_` 可以参考[教程 1：如何编写配置文件](https://mmclassification.readthedocs.io/zh_CN/latest/tutorials/config.html#id16)。案例如下：
-
-```python
-_base_ = [./_base_/schedules/imagenet_bs256_coslr.py]
-
-optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2), _delete_=True, type='OptimizerHook')
-# 当 type 为 'OptimizerHook'，可以省略 type；其他情况下，此处必须指明 type='xxxOptimizerHook'。
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+    # norm_type: 使用的范数类型，此处使用范数2。
+    clip_grad=dict(max_norm=35, norm_type=2))
 ```

 ### 梯度累计

 计算资源缺乏缺乏时，每个训练批次的大小（batch size）只能设置为较小的值，这可能会影响模型的性能。

-可以使用梯度累计来规避这一问题。
+可以使用梯度累计来规避这一问题。我们支持在 `optim_wrapper` 字段中添加 `accumulative_counts` 参数来进行梯度累计。

 用例如下：

 ```python
-data = dict(samples_per_gpu=64)
-optimizer_config = dict(type="GradientCumulativeOptimizerHook", cumulative_iters=4)
+train_dataloader = dict(batch_size=64)
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001),
+    accumulative_counts=4)
 ```

 表示训练时，每 4 个 iter 执行一次反向传播。由于此时单张 GPU 上的批次大小为 64，也就等价于单张 GPU 上一次迭代的批次大小为 256，也即：

 ```python
-data = dict(samples_per_gpu=256)
-optimizer_config = dict(type="OptimizerHook")
+train_dataloader = dict(batch_size=256)
+optim_wrapper = dict(
+    optimizer=dict(type='SGD', lr=0.01, weight_decay=0.0001))
 ```

-```{note}
-当在 `optimizer_config` 不指定优化器钩子类型时，默认使用 `OptimizerHook`。
+## 配置参数优化策略
+
+在训练过程中，优化参数例如学习率、动量，通常不会是固定不变，而是随着训练进程的变化而调整。PyTorch 支持一些学习率调整的调度器，但是不足以完成复杂的策略。在MMClassification中，我们提供 `param_scheduler` 来更好地控制不同优化参数的策略。
+
+### 配置学习率调整策略
+
+深度学习研究中，广泛应用学习率衰减来提高网络的性能。我们支持大多数 PyTorch 学习率调度器， 其中包括 `ExponentialLR`, `LinearLR`, `StepLR`, `MultiStepLR` 等等。
+
+- **单个学习率策略**
+
+  多数情况下，我们使用单一学习率策略，这里 `param_scheduler` 会是一个字典。比如在默认的 ResNet 网络训练中，我们使用阶梯式的学习率衰减策略 [`MultiStepLR`](mmengine.optim.MultiStepLR)，配置文件为：
+
+  ```python
+  param_scheduler = dict(
+      type='MultiStepLR',
+      by_epoch=True,
+      milestones=[100, 150],
+      gamma=0.1)
+  ```
+
+  或者我们想使用 [`CosineAnnealingLR`](mmengine.optim.CosineAnnealingLR) 来进行学习率衰减：
+
+  ```python
+  param_scheduler = dict(
+      type='CosineAnnealingLR',
+      by_epoch=True,
+      T_max=num_epochs)
+  ```
+
+- **多个学习率策略**
+
+  然而在一些其他情况下，为了提高模型的精度，通常会使用多种学习率策略。例如，在训练的早期阶段，网络容易不稳定，而学习率的预热就是为了减少这种不稳定性。
+
+  整个学习过程中，学习率将会通过预热从一个很小的值逐步提高到预定值，再会通过其他的策略进一步调整。
+
+  在 MMClassification 中，我们同样使用 `param_scheduler` ，将多种学习策略写成列表就可以完成上述预热策略的组合。
+
+  例如：
+
+  1. 在前50次迭代中逐**迭代次数**地**线性**预热
+
+  ```python
+    param_scheduler = [
+        # 逐迭代次数，线性预热
+        dict(type='LinearLR',
+            start_factor=0.001,
+            by_epoch=False,  # 逐迭代次数
+            end=50),  # 只预热50次迭代次数
+        # 主要的学习率策略
+        dict(type='MultiStepLR',
+            by_epoch=True,
+            milestones=[8, 11],
+            gamma=0.1)
+    ]
+  ```
+
+  2. 在前10轮迭代中逐**迭代次数**地**线性**预热
+
+  ```python
+    param_scheduler = [
+        # 在前10轮迭代中，逐迭代次数，线性预热
+        dict(type='LinearLR',
+            start_factor=0.001,
+            by_epoch=True,
+            end=10,
+            convert_to_iter_based=True,  # 逐迭代次数更新学习率.
+        ),
+        # 在 10 轮次后，通过余弦退火衰减
+        dict(type='CosineAnnealingLR', by_epoch=True, begin=10)
+    ]
+  ```
+
+  注意这里增加了 `begin` 和 `end` 参数，这两个参数指定了调度器的**生效区间**。生效区间通常只在多个调度器组合时才需要去设置，使用单个调度器时可以忽略。当指定了 `begin` 和 `end` 参数时，表示该调度器只在 \[begin, end) 区间内生效，其单位是由 `by_epoch` 参数决定。在组合不同调度器时，各调度器的 `by_epoch` 参数不必相同。如果没有指定的情况下，`begin` 为 0， `end` 为最大迭代轮次或者最大迭代次数。
+
+  如果相邻两个调度器的生效区间没有紧邻，而是有一段区间没有被覆盖，那么这段区间的学习率维持不变。而如果两个调度器的生效区间发生了重叠，则对多组调度器叠加使用，学习率的调整会按照调度器配置文件中的顺序触发（行为与 PyTorch 中 [`ChainedScheduler`](torch.optim.lr_scheduler.ChainedScheduler) 一致）。
+
+  ```{tip}
+  为了避免学习率曲线与预期不符， 配置完成后，可以使用 MMClassification 提供的 [学习率可视化工具](../user_guides/visualization.md#learning-rate-schedule-visualization) 画出对应学习率调整曲线。
+  ```
+
+### 配置动量调整策略
+
+MMClassification 支持动量调度器根据学习率修改优化器的动量，从而使损失函数收敛更快。用法和学习率调度器一致。
+
+我们支持的动量策略和详细的使用细节可以参考[这里](https://github.com/open-mmlab/mmengine/blob/main/mmengine/optim/scheduler/momentum_scheduler.py)。我们只将调度器中的 `LR` 替换为了 `Momentum`，动量策略可以直接追加 `param_scheduler` 列表中。
+
+这里是一个用例：
+
+```python
+param_scheduler = [
+    # 学习率策略
+    dict(type='LinearLR', ...),
+    # 动量策略
+    dict(type='LinearMomentum',
+         start_factor=0.001,
+         by_epoch=False,
+         begin=0,
+         end=1000)
+]
 ```

-## 用户自定义优化方法
-
-在学术研究和工业实践中，可能需要使用 MMClassification 未实现的优化方法，可以通过以下方法添加。
+## 新增优化器或者优化器构造器

 ```{note}
 本部分将修改 MMClassification 源码或者向 MMClassification 框架添加代码，初学者可跳过。
 ```

-### 自定义优化器
+### 新增优化器
+
+在学术研究和工业实践中，可能需要使用 MMClassification 未实现的优化方法，可以通过以下方法添加。

 #### 1. 定义一个新的优化器

-一个自定义的优化器可根据如下规则进行定制
+一个自定义的优化器可根据如下规则进行定制：

 假设我们想添加一个名为 `MyOptimzer` 的优化器，其拥有参数 `a`, `b` 和 `c`。
-可以创建一个名为 `mmcls/core/optimizer` 的文件夹，并在目录下的一个文件，如 `mmcls/core/optimizer/my_optimizer.py` 中实现该自定义优化器：
+可以创建一个名为 `mmcls/engine/optimizer` 的文件夹，并在目录下的一个文件，如 `mmcls/engine/optimizer/my_optimizer.py` 中实现该自定义优化器：

 ```python
-from mmcv.runner import OPTIMIZERS
+from mmengine.registry import OPTIMIZERS
 from torch.optim import Optimizer


@ -259,75 +278,96 @@ from torch.optim import Optimizer
 class MyOptimizer(Optimizer):

    def __init__(self, a, b, c):
+        ...

+    def step(self, closure=None):
+        ...
 ```

 #### 2. 注册优化器

 要注册上面定义的上述模块，首先需要将此模块导入到主命名空间中。有两种方法可以实现它。

- 修改 `mmcls/core/optimizer/__init__.py`，将其导入至 `optimizer` 包；再修改 `mmcls/core/__init__.py` 以导入 `optimizer` 包
+- 修改 `mmcls/engine/optimizers/__init__.py`，将其导入至 `mmcls.engine` 包。

-  创建 `mmcls/core/optimizer/__init__.py` 文件。
-  新定义的模块应导入到 `mmcls/core/optimizer/__init__.py` 中，以便注册器能找到新模块并将其添加：
+  ```python
+  # 在 mmcls/engine/optimizers/__init__.py 中
+  ...
+  from .my_optimizer import MyOptimizer # MyOptimizer 是我们自定义的优化器的名字

-```python
-# 在 mmcls/core/optimizer/__init__.py 中
-from .my_optimizer import MyOptimizer # MyOptimizer 是我们自定义的优化器的名字
+  __all__ = [..., 'MyOptimizer']
+  ```

-__all__ = ['MyOptimizer']
-```
-
-```python
-# 在 mmcls/core/__init__.py 中
-...
-from .optimizer import *  # noqa: F401, F403
-```
+  在运行过程中，我们会自动导入 `mmcls.engine` 包并同时注册 `MyOptimizer`。

 - 在配置中使用 `custom_imports` 手动导入

-```python
-custom_imports = dict(imports=['mmcls.core.optimizer.my_optimizer'], allow_failed_imports=False)
-```
+  ```python
+  custom_imports = dict(
+      imports=['mmcls.engine.optimizers.my_optimizer'],
+      allow_failed_imports=False,
+  )
+  ```

-`mmcls.core.optimizer.my_optimizer` 模块将会在程序开始阶段被导入，`MyOptimizer` 类会随之自动被注册。
-注意，只有包含 `MyOptmizer` 类的包会被导入。`mmcls.core.optimizer.my_optimizer.MyOptimizer` **不会** 被直接导入。
+  `mmcls.engine.optimizers.my_optimizer` 模块将会在程序开始阶段被导入，`MyOptimizer` 类会随之自动被注册。
+  注意，这里只需要导入包含 `MyOptmizer` 类的包。如果填写`mmcls.engine.optimizers.my_optimizer.MyOptimizer` 则 **不会** 被直接导入。

 #### 3. 在配置文件中指定优化器

-之后，用户便可在配置文件的 `optimizer` 域中使用 `MyOptimizer`。
-在配置中，优化器由 “optimizer” 字段定义，如下所示：
+之后，用户便可在配置文件的 `optim_wrapper.optimizer` 域中使用 `MyOptimizer`：

 ```python
-optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)
+optim_wrapper = dict(
+    optimizer=dict(type='MyOptimizer', a=a_value, b=b_value, c=c_value))
 ```

-要使用自定义的优化器，可以将该字段更改为
+### 新增优化器构造器
+
+某些模型可能具有一些特定于参数的设置以进行优化，例如 为所有 BatchNorm 层设置不同的权重衰减。
+
+Although we already can use [the `optim_wrapper.paramwise_cfg` field](#parameter-wise-finely-configuration) to
+configure various parameter-specific optimizer settings. It may still not cover your need.
+
+尽管我们已经可以使用 [`optim_wrapper.paramwise_cfg` 字段](#参数化精细配置)来配置特定参数的优化设置，但可能仍然无法覆盖你的需求。
+
+当然你可以在此基础上进行修改。我们默认使用 [`DefaultOptimWrapperConstructor`](mmengine.optim.DefaultOptimWrapperConstructor) 来构造优化器。在构造过程中，通过 `paramwise_cfg` 来精细化配置不同设置。这个默认构造器可以作为新优化器构造器实现的模板。
+
+我们可以新增一个优化器构造器来覆盖这些行为。

 ```python
-optimizer = dict(type='MyOptimizer', a=a_value, b=b_value, c=c_value)
-```
-
-### 自定义优化器构造器
-
-某些模型可能具有一些特定于参数的设置以进行优化，例如 BatchNorm 层的权重衰减。
-
-虽然我们的 `DefaultOptimizerConstructor` 已经提供了这些强大的功能，但可能仍然无法覆盖需求。
-此时我们可以通过自定义优化器构造函数来进行其他细粒度的参数调整。
-
-```python
-from mmcv.runner.optimizer import OPTIMIZER_BUILDERS
+# 在 mmcls/engine/optimizers/my_optim_constructor.py 中
+from mmengine.optim import DefaultOptimWrapperConstructor
+from mmcls.registry import OPTIM_WRAPPER_CONSTRUCTORS


-@OPTIMIZER_BUILDERS.register_module()
-class MyOptimizerConstructor:
+@OPTIM_WRAPPER_CONSTRUCTORS.register_module()
+class MyOptimWrapperConstructor:

-    def __init__(self, optimizer_cfg, paramwise_cfg=None):
-        pass
+    def __init__(self, optim_wrapper_cfg, paramwise_cfg=None):
+        ...

    def __call__(self, model):
-        ...    # 在这里实现自己的优化器构造器。
-        return my_optimizer
+        ...
 ```

-[这里](https://github.com/open-mmlab/mmcv/blob/9ecd6b0d5ff9d2172c49a182eaa669e9f27bb8e7/mmcv/runner/optimizer/default_constructor.py#L11)是我们默认的优化器构造器的实现，可以作为新优化器构造器实现的模板。
+接下来类似 [新增优化器教程](#新增优化器) 来导入并使用新的优化器构造器。
+
+1. 修改 `mmcls/engine/optimizers/__init__.py`，将其导入至 `mmcls.engine` 包。
+
+   ```python
+   # 在 mmcls/engine/optimizers/__init__.py 中
+   ...
+   from .my_optim_constructor import MyOptimWrapperConstructor
+
+   __all__ = [..., 'MyOptimWrapperConstructor']
+   ```
+
+2. 在配置文件的 `optim_wrapper.constructor` 字段中使用 `MyOptimWrapperConstructor` 。
+
+   ```python
+   optim_wrapper = dict(
+       constructor=dict(type='MyOptimWrapperConstructor'),
+       optimizer=...,
+       paramwise_cfg=...,
+   )
+   ```