[Feature] Add BEiT backbone (#1404)

* [Feature] Add BEiT backbone * fix * fix * fix * fix * add readme * fix * fix * fix * fix * fix * add link * fix memory * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix test_beit.py * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix * fix
2025-06-03 22:03:48 +08:00 · 2022-03-30 15:25:10 +08:00 · 2022-03-30 15:25:10 +08:00 · 24f1563571
commit 24f1563571
parent 30864ea23d
20 changed files with 1345 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -85,6 +85,7 @@ Supported backbones:
 - [x] [Swin Transformer (ICCV'2021)](configs/swin)
 - [x] [Twins (NeurIPS'2021)](configs/twins)
 - [x] [ConvNeXt (CVPR'2022)](configs/convnext)
+- [x] [BEiT (ICLR'2022)](configs/beit)

 Supported methods:

--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@ -84,6 +84,7 @@ MMSegmentation 是一个基于 PyTorch 的语义分割开源工具箱。它是 O
 - [x] [Swin Transformer (ICCV'2021)](configs/swin)
 - [x] [Twins (NeurIPS'2021)](configs/twins)
 - [x] [ConvNeXt (CVPR'2022)](configs/convnext)
+- [x] [BEiT (ICLR'2022)](configs/beit)

 已支持的算法：

--- a/configs/_base_/models/upernet_beit.py
+++ b/configs/_base_/models/upernet_beit.py
@ -0,0 +1,50 @@
+norm_cfg = dict(type='SyncBN', requires_grad=True)
+model = dict(
+    type='EncoderDecoder',
+    pretrained=None,
+    backbone=dict(
+        type='BEiT',
+        img_size=(640, 640),
+        patch_size=16,
+        in_channels=3,
+        embed_dims=768,
+        num_layers=12,
+        num_heads=12,
+        mlp_ratio=4,
+        out_indices=(3, 5, 7, 11),
+        qv_bias=True,
+        attn_drop_rate=0.0,
+        drop_path_rate=0.1,
+        norm_cfg=dict(type='LN', eps=1e-6),
+        act_cfg=dict(type='GELU'),
+        norm_eval=False,
+        init_values=0.1),
+    neck=dict(type='Feature2Pyramid', embed_dim=768, rescales=[4, 2, 1, 0.5]),
+    decode_head=dict(
+        type='UPerHead',
+        in_channels=[768, 768, 768, 768],
+        in_index=[0, 1, 2, 3],
+        pool_scales=(1, 2, 3, 6),
+        channels=768,
+        dropout_ratio=0.1,
+        num_classes=150,
+        norm_cfg=norm_cfg,
+        align_corners=False,
+        loss_decode=dict(
+            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
+    auxiliary_head=dict(
+        type='FCNHead',
+        in_channels=768,
+        in_index=2,
+        channels=256,
+        num_convs=1,
+        concat_input=False,
+        dropout_ratio=0.1,
+        num_classes=150,
+        norm_cfg=norm_cfg,
+        align_corners=False,
+        loss_decode=dict(
+            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
+    # model training and testing settings
+    train_cfg=dict(),
+    test_cfg=dict(mode='whole'))
--- a/configs/beit/README.md
+++ b/configs/beit/README.md
@ -0,0 +1,84 @@
+# BEiT
+
+[BEiT: BERT Pre-Training of Image Transformers](https://arxiv.org/abs/2106.08254)
+
+## Introduction
+
+<!-- [BACKBONE] -->
+
+<a href="https://github.com/microsoft/unilm/tree/master/beit">Official Repo</a>
+
+<a href="https://github.com/open-mmlab/mmsegmentation/blob/v0.23.0/mmseg/models/backbones/beit.py#1404">Code Snippet</a>
+
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at [this https URL](https://github.com/microsoft/unilm/tree/master/beit).
+
+<!-- [IMAGE] -->
+<div align=center>
+<img src="https://user-images.githubusercontent.com/93248678/160155758-781c9a45-b1d7-4530-9015-88eca6645006.png" width="70%"/>
+</div>
+
+## Citation
+
+```bibtex
+@inproceedings{beit,
+      title={{BEiT}: {BERT} Pre-Training of Image Transformers},
+      author={Hangbo Bao and Li Dong and Songhao Piao and Furu Wei},
+      booktitle={International Conference on Learning Representations},
+      year={2022},
+      url={https://openreview.net/forum?id=p-BhZSz59o4}
+}
+```
+
+## Usage
+
+To use other repositories' pre-trained models, it is necessary to convert keys.
+
+We provide a script [`beit2mmseg.py`](../../tools/model_converters/beit2mmseg.py) in the tools directory to convert the key of models from [the official repo](https://github.com/microsoft/unilm/tree/master/beit/semantic_segmentation) to MMSegmentation style.
+
+```shell
+python tools/model_converters/beit2mmseg.py ${PRETRAIN_PATH} ${STORE_PATH}
+```
+
+E.g.
+
+```shell
+python tools/model_converters/beit2mmseg.py https://unilm.blob.core.windows.net/beit/beit_base_patch16_224_pt22k_ft22k.pth pretrain/beit_base_patch16_224_pt22k_ft22k.pth
+```
+
+This script convert model from `PRETRAIN_PATH` and store the converted model in `STORE_PATH`.
+
+In our default setting, pretrained models could be defined below:
+
+  | pretrained models | original models |
+  | ------ | -------- |
+  |BEiT_base.pth | ['BEiT_base'](https://unilm.blob.core.windows.net/beit/beit_base_patch16_224_pt22k_ft22k.pth) |
+  |BEiT_large.pth | ['BEiT_large'](https://unilm.blob.core.windows.net/beit/beit_large_patch16_224_pt22k_ft22k.pth) |
+
+Verify the single-scale results of the model:
+
+```shell
+sh tools/dist_test.sh \
+configs/beit/upernet_beit-large_fp16_8x1_640x640_160k_ade20k.py \
+upernet_beit-large_fp16_8x1_640x640_160k_ade20k-8fc0dd5d.pth $GPUS --eval mIoU
+```
+
+Since relative position embedding requires the input length and width to be equal, the sliding window is adopted for multi-scale inference. So we set min_size=640, that is, the shortest edge is 640. So the multi-scale inference of config is performed separately, instead of '--aug-test'. For multi-scale inference:
+
+```shell
+sh tools/dist_test.sh \
+configs/beit/upernet_beit-large_fp16_640x640_160k_ade20k_ms.py \
+upernet_beit-large_fp16_8x1_640x640_160k_ade20k-8fc0dd5d.pth $GPUS --eval mIoU
+```
+
+## Results and models
+
+### ADE20K
+
+| Method | Backbone | Crop Size | pretrain | pretrain img size | Batch Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU  | mIoU(ms+flip) | config | download |
+| ------ | -------- | --------- | ---------- | ------- | -------- | --- | --- | -------------- | ----- | ------------: | -------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| UperNet | BEiT-B | 640x640 | ImageNet-22K | 224x224 | 16          | 160000   | 15.88        | 2.00              | 53.08 | 53.84            | [config](https://github.com/open-mmlab/mmsegmentation/blob/master/configs/beit/upernet_beit-base_8x2_640x640_160k_ade20k.py)  | [model](https://download.openmmlab.com/mmsegmentation/v0.5/beit/upernet_beit-base_8x2_640x640_160k_ade20k/upernet_beit-base_8x2_640x640_160k_ade20k-eead221d.pth) &#124; [log](https://download.openmmlab.com/mmsegmentation/v0.5/beit/upernet_beit-base_8x2_640x640_160k_ade20k/upernet_beit-base_8x2_640x640_160k_ade20k.log.json)     |
+| UperNet | BEiT-L | 640x640 | ImageNet-22K | 224x224 | 8           | 320000   | 22.64        | 0.96              | 56.33 | 56.84             | [config](https://github.com/open-mmlab/mmsegmentation/blob/master/configs/beit/upernet_beit-large_fp16_8x1_640x640_160k_ade20k.py)  | [model](https://download.openmmlab.com/mmsegmentation/v0.5/beit/upernet_beit-large_fp16_8x1_640x640_160k_ade20k/upernet_beit-large_fp16_8x1_640x640_160k_ade20k-8fc0dd5d.pth) &#124; [log](https://download.openmmlab.com/mmsegmentation/v0.5/beit/upernet_beit-large_fp16_8x1_640x640_160k_ade20k/upernet_beit-large_fp16_8x1_640x640_160k_ade20k.log.json)     |
--- a/configs/beit/beit.yml
+++ b/configs/beit/beit.yml
@ -0,0 +1,45 @@
+Models:
+- Name: upernet_beit-base_8x2_640x640_160k_ade20k
+  In Collection: UperNet
+  Metadata:
+    backbone: BEiT-B
+    crop size: (640,640)
+    lr schd: 160000
+    inference time (ms/im):
+    - value: 500.0
+      hardware: V100
+      backend: PyTorch
+      batch size: 1
+      mode: FP32
+      resolution: (640,640)
+    Training Memory (GB): 15.88
+  Results:
+  - Task: Semantic Segmentation
+    Dataset: ADE20K
+    Metrics:
+      mIoU: 53.08
+      mIoU(ms+flip): 53.84
+  Config: configs/beit/upernet_beit-base_8x2_640x640_160k_ade20k.py
+  Weights: https://download.openmmlab.com/mmsegmentation/v0.5/beit/upernet_beit-base_8x2_640x640_160k_ade20k/upernet_beit-base_8x2_640x640_160k_ade20k-eead221d.pth
+- Name: upernet_beit-large_fp16_8x1_640x640_160k_ade20k
+  In Collection: UperNet
+  Metadata:
+    backbone: BEiT-L
+    crop size: (640,640)
+    lr schd: 320000
+    inference time (ms/im):
+    - value: 1041.67
+      hardware: V100
+      backend: PyTorch
+      batch size: 1
+      mode: FP16
+      resolution: (640,640)
+    Training Memory (GB): 22.64
+  Results:
+  - Task: Semantic Segmentation
+    Dataset: ADE20K
+    Metrics:
+      mIoU: 56.33
+      mIoU(ms+flip): 56.84
+  Config: configs/beit/upernet_beit-large_fp16_8x1_640x640_160k_ade20k.py
+  Weights: https://download.openmmlab.com/mmsegmentation/v0.5/beit/upernet_beit-large_fp16_8x1_640x640_160k_ade20k/upernet_beit-large_fp16_8x1_640x640_160k_ade20k-8fc0dd5d.pth
--- a/configs/beit/upernet_beit-base_640x640_160k_ade20k_ms.py
+++ b/configs/beit/upernet_beit-base_640x640_160k_ade20k_ms.py
@ -0,0 +1,24 @@
+_base_ = './upernet_beit-base_8x2_640x640_160k_ade20k.py'
+
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiScaleFlipAug',
+        img_scale=(2560, 640),
+        img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
+        flip=True,
+        transforms=[
+            dict(type='Resize', keep_ratio=True, min_size=640),
+            dict(type='RandomFlip'),
+            dict(type='Normalize', **img_norm_cfg),
+            dict(type='ImageToTensor', keys=['img']),
+            dict(type='Collect', keys=['img']),
+        ])
+]
+data = dict(
+    val=dict(pipeline=test_pipeline),
+    test=dict(pipeline=test_pipeline),
+    samples_per_gpu=2)
--- a/configs/beit/upernet_beit-base_8x2_640x640_160k_ade20k.py
+++ b/configs/beit/upernet_beit-base_8x2_640x640_160k_ade20k.py
@ -0,0 +1,30 @@
+_base_ = [
+    '../_base_/models/upernet_beit.py', '../_base_/datasets/ade20k_640x640.py',
+    '../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
+]
+
+model = dict(
+    pretrained='pretrain/beit_base_patch16_224_pt22k_ft22k.pth',
+    test_cfg=dict(mode='slide', crop_size=(640, 640), stride=(426, 426)))
+
+optimizer = dict(
+    _delete_=True,
+    type='AdamW',
+    lr=3e-5,
+    betas=(0.9, 0.999),
+    weight_decay=0.05,
+    constructor='LayerDecayOptimizerConstructor',
+    paramwise_cfg=dict(num_layers=12, layer_decay_rate=0.9))
+
+lr_config = dict(
+    _delete_=True,
+    policy='poly',
+    warmup='linear',
+    warmup_iters=1500,
+    warmup_ratio=1e-6,
+    power=1.0,
+    min_lr=0.0,
+    by_epoch=False)
+
+# By default, models are trained on 8 GPUs with 2 images per GPU
+data = dict(samples_per_gpu=2)
--- a/configs/beit/upernet_beit-large_fp16_640x640_160k_ade20k_ms.py
+++ b/configs/beit/upernet_beit-large_fp16_640x640_160k_ade20k_ms.py
@ -0,0 +1,22 @@
+_base_ = './upernet_beit-large_fp16_8x1_640x640_160k_ade20k.py'
+
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiScaleFlipAug',
+        img_scale=(2560, 640),
+        img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
+        flip=True,
+        transforms=[
+            dict(type='Resize', keep_ratio=True, min_size=640),
+            dict(type='RandomFlip'),
+            dict(type='Normalize', **img_norm_cfg),
+            dict(type='ImageToTensor', keys=['img']),
+            dict(type='Collect', keys=['img']),
+        ])
+]
+data = dict(
+    val=dict(pipeline=test_pipeline), test=dict(pipeline=test_pipeline))
--- a/configs/beit/upernet_beit-large_fp16_8x1_640x640_160k_ade20k.py
+++ b/configs/beit/upernet_beit-large_fp16_8x1_640x640_160k_ade20k.py
@ -0,0 +1,47 @@
+_base_ = [
+    '../_base_/models/upernet_beit.py', '../_base_/datasets/ade20k_640x640.py',
+    '../_base_/default_runtime.py', '../_base_/schedules/schedule_320k.py'
+]
+
+model = dict(
+    pretrained='pretrain/beit_large_patch16_224_pt22k_ft22k.pth',
+    backbone=dict(
+        type='BEiT',
+        embed_dims=1024,
+        num_layers=24,
+        num_heads=16,
+        mlp_ratio=4,
+        qv_bias=True,
+        init_values=1e-6,
+        drop_path_rate=0.2,
+        out_indices=[7, 11, 15, 23]),
+    neck=dict(embed_dim=1024, rescales=[4, 2, 1, 0.5]),
+    decode_head=dict(
+        in_channels=[1024, 1024, 1024, 1024], num_classes=150, channels=1024),
+    auxiliary_head=dict(in_channels=1024, num_classes=150),
+    test_cfg=dict(mode='slide', crop_size=(640, 640), stride=(426, 426)))
+
+optimizer = dict(
+    _delete_=True,
+    type='AdamW',
+    lr=2e-5,
+    betas=(0.9, 0.999),
+    weight_decay=0.05,
+    constructor='LayerDecayOptimizerConstructor',
+    paramwise_cfg=dict(num_layers=24, layer_decay_rate=0.95))
+
+lr_config = dict(
+    _delete_=True,
+    policy='poly',
+    warmup='linear',
+    warmup_iters=3000,
+    warmup_ratio=1e-6,
+    power=1.0,
+    min_lr=0.0,
+    by_epoch=False)
+
+data = dict(samples_per_gpu=1)
+optimizer_config = dict(
+    type='GradientCumulativeFp16OptimizerHook', cumulative_iters=2)
+
+fp16 = dict()
--- a/mmseg/core/init.py
+++ b/mmseg/core/init.py
@ -1,4 +1,6 @@
 # Copyright (c) OpenMMLab. All rights reserved.
 from .evaluation import *  # noqa: F401, F403
+from .layer_decay_optimizer_constructor import \
+    LayerDecayOptimizerConstructor  # noqa: F401
 from .seg import *  # noqa: F401, F403
 from .utils import *  # noqa: F401, F403
--- a/mmseg/core/layer_decay_optimizer_constructor.py
+++ b/mmseg/core/layer_decay_optimizer_constructor.py
@ -0,0 +1,87 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+from mmcv.runner import (OPTIMIZER_BUILDERS, DefaultOptimizerConstructor,
+                         get_dist_info)
+
+from mmseg.utils import get_root_logger
+
+
+def get_num_layer_for_vit(var_name, num_max_layer):
+    """Get the layer id to set the different learning rates.
+
+    Args:
+        var_name (str): The key of the model.
+        num_max_layer (int): Maximum number of backbone layers.
+    Returns:
+        layer id (int): Returns the layer id of the key.
+    """
+
+    if var_name in ('backbone.cls_token', 'backbone.mask_token',
+                    'backbone.pos_embed'):
+        return 0
+    elif var_name.startswith('backbone.patch_embed'):
+        return 0
+    elif var_name.startswith('backbone.layers'):
+        layer_id = int(var_name.split('.')[2])
+        return layer_id + 1
+    else:
+        return num_max_layer - 1
+
+
+@OPTIMIZER_BUILDERS.register_module()
+class LayerDecayOptimizerConstructor(DefaultOptimizerConstructor):
+    """Different learning rates are set for different layers of backbone."""
+
+    def add_params(self, params, module):
+        """Add all parameters of module to the params list.
+
+        The parameters of the given module will be added to the list of param
+        groups, with specific rules defined by paramwise_cfg.
+        Args:
+            params (list[dict]): A list of param groups, it will be modified
+                in place.
+            module (nn.Module): The module to be added.
+        """
+        parameter_groups = {}
+        logger = get_root_logger()
+        logger.info(self.paramwise_cfg)
+        num_layers = self.paramwise_cfg.get('num_layers') + 2
+        layer_decay_rate = self.paramwise_cfg.get('layer_decay_rate')
+        logger.info(f'Build LayerDecayOptimizerConstructor '
+                    f'{layer_decay_rate} - {num_layers}')
+        weight_decay = self.base_wd
+        for name, param in module.named_parameters():
+            if not param.requires_grad:
+                continue  # frozen weights
+            if len(param.shape) == 1 or name.endswith('.bias') or name in (
+                    'pos_embed', 'cls_token'):
+                group_name = 'no_decay'
+                this_weight_decay = 0.
+            else:
+                group_name = 'decay'
+                this_weight_decay = weight_decay
+            layer_id = get_num_layer_for_vit(name, num_layers)
+            group_name = f'layer_{layer_id}_{group_name}'
+            if group_name not in parameter_groups:
+                scale = layer_decay_rate**(num_layers - layer_id - 1)
+                parameter_groups[group_name] = {
+                    'weight_decay': this_weight_decay,
+                    'params': [],
+                    'param_names': [],
+                    'lr_scale': scale,
+                    'group_name': group_name,
+                    'lr': scale * self.base_lr
+                }
+            parameter_groups[group_name]['params'].append(param)
+            parameter_groups[group_name]['param_names'].append(name)
+        rank, _ = get_dist_info()
+        if rank == 0:
+            to_display = {}
+            for key in parameter_groups:
+                to_display[key] = {
+                    'param_names': parameter_groups[key]['param_names'],
+                    'lr_scale': parameter_groups[key]['lr_scale'],
+                    'lr': parameter_groups[key]['lr'],
+                    'weight_decay': parameter_groups[key]['weight_decay']
+                }
+            logger.info(f'Param groups ={to_display}')
+        params.extend(parameter_groups.values())
--- a/mmseg/models/backbones/init.py
+++ b/mmseg/models/backbones/init.py
@ -1,4 +1,5 @@
 # Copyright (c) OpenMMLab. All rights reserved.
+from .beit import BEiT
 from .bisenetv1 import BiSeNetV1
 from .bisenetv2 import BiSeNetV2
 from .cgnet import CGNet
@ -24,5 +25,5 @@ __all__ = [
    'ResNeSt', 'MobileNetV2', 'UNet', 'CGNet', 'MobileNetV3',
    'VisionTransformer', 'SwinTransformer', 'MixVisionTransformer',
    'BiSeNetV1', 'BiSeNetV2', 'ICNet', 'TIMMBackbone', 'ERFNet', 'PCPVT',
-    'SVT', 'STDCNet', 'STDCContextPathNet'
+    'SVT', 'STDCNet', 'STDCContextPathNet', 'BEiT'
 ]
--- a/mmseg/models/backbones/beit.py
+++ b/mmseg/models/backbones/beit.py
@ -0,0 +1,532 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import warnings
+
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from mmcv.cnn import build_norm_layer
+from mmcv.cnn.bricks.drop import build_dropout
+from mmcv.cnn.bricks.transformer import FFN
+from mmcv.cnn.utils.weight_init import (constant_init, kaiming_init,
+                                        trunc_normal_)
+from mmcv.runner import BaseModule, ModuleList, _load_checkpoint
+from torch.nn.modules.batchnorm import _BatchNorm
+from torch.nn.modules.utils import _pair as to_2tuple
+
+from mmseg.utils import get_root_logger
+from ..builder import BACKBONES
+from ..utils import PatchEmbed
+
+try:
+    from scipy import interpolate
+except ImportError:
+    interpolate = None
+
+
+class BEiTAttention(BaseModule):
+    """Window based multi-head self-attention (W-MSA) module with relative
+    position bias.
+
+    Args:
+        embed_dims (int): Number of input channels.
+        num_heads (int): Number of attention heads.
+        window_size (tuple[int]): The height and width of the window.
+        qv_bias (bool):  If True, add a learnable bias to q, v.
+            Default: True.
+        qk_scale (float | None, optional): Override default qk scale of
+            head_dim ** -0.5 if set. Default: None.
+        attn_drop_rate (float): Dropout ratio of attention weight.
+            Default: 0.0
+        proj_drop_rate (float): Dropout ratio of output. Default: 0.
+        init_cfg (dict | None, optional): The Config for initialization.
+            Default: None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 window_size,
+                 qv_bias=True,
+                 qk_scale=None,
+                 attn_drop_rate=0.,
+                 proj_drop_rate=0.,
+                 init_cfg=None):
+        super().__init__(init_cfg=init_cfg)
+        self.embed_dims = embed_dims
+        self.num_heads = num_heads
+        head_embed_dims = embed_dims // num_heads
+        self.scale = qk_scale or head_embed_dims**-0.5
+        if qv_bias:
+            self.q_bias = nn.Parameter(torch.zeros(embed_dims))
+            self.v_bias = nn.Parameter(torch.zeros(embed_dims))
+        else:
+            self.q_bias = None
+            self.v_bias = None
+
+        self.window_size = window_size
+        # cls to token & token 2 cls & cls to cls
+        self.num_relative_distance = (2 * window_size[0] -
+                                      1) * (2 * window_size[1] - 1) + 3
+        # relative_position_bias_table shape is (2*Wh-1 * 2*Ww-1 + 3, nH)
+        self.relative_position_bias_table = nn.Parameter(
+            torch.zeros(self.num_relative_distance, num_heads))
+
+        # get pair-wise relative position index for
+        # each token inside the window
+        coords_h = torch.arange(window_size[0])
+        coords_w = torch.arange(window_size[1])
+        # coords shape is (2, Wh, Ww)
+        coords = torch.stack(torch.meshgrid([coords_h, coords_w]))
+        # coords_flatten shape is (2, Wh*Ww)
+        coords_flatten = torch.flatten(coords, 1)
+        relative_coords = (
+            coords_flatten[:, :, None] - coords_flatten[:, None, :])
+        # relative_coords shape is (Wh*Ww, Wh*Ww, 2)
+        relative_coords = relative_coords.permute(1, 2, 0).contiguous()
+        # shift to start from 0
+        relative_coords[:, :, 0] += window_size[0] - 1
+        relative_coords[:, :, 1] += window_size[1] - 1
+        relative_coords[:, :, 0] *= 2 * window_size[1] - 1
+        relative_position_index = torch.zeros(
+            size=(window_size[0] * window_size[1] + 1, ) * 2,
+            dtype=relative_coords.dtype)
+        # relative_position_index shape is (Wh*Ww, Wh*Ww)
+        relative_position_index[1:, 1:] = relative_coords.sum(-1)
+        relative_position_index[0, 0:] = self.num_relative_distance - 3
+        relative_position_index[0:, 0] = self.num_relative_distance - 2
+        relative_position_index[0, 0] = self.num_relative_distance - 1
+
+        self.register_buffer('relative_position_index',
+                             relative_position_index)
+        self.qkv = nn.Linear(embed_dims, embed_dims * 3, bias=False)
+        self.attn_drop = nn.Dropout(attn_drop_rate)
+        self.proj = nn.Linear(embed_dims, embed_dims)
+        self.proj_drop = nn.Dropout(proj_drop_rate)
+
+    def init_weights(self):
+        trunc_normal_(self.relative_position_bias_table, std=0.02)
+
+    def forward(self, x):
+        """
+        Args:
+            x (tensor): input features with shape of (num_windows*B, N, C).
+        """
+        B, N, C = x.shape
+        qkv_bias = None
+        if self.q_bias is not None:
+            k_bias = torch.zeros_like(self.v_bias, requires_grad=False)
+            qkv_bias = torch.cat((self.q_bias, k_bias, self.v_bias))
+
+        qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
+        qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
+        q, k, v = qkv[0], qkv[1], qkv[2]
+        q = q * self.scale
+        attn = (q @ k.transpose(-2, -1))
+        if self.relative_position_bias_table is not None:
+            Wh = self.window_size[0]
+            Ww = self.window_size[1]
+            relative_position_bias = self.relative_position_bias_table[
+                self.relative_position_index.view(-1)].view(
+                    Wh * Ww + 1, Wh * Ww + 1, -1)
+            relative_position_bias = relative_position_bias.permute(
+                2, 0, 1).contiguous()  # nH, Wh*Ww, Wh*Ww
+            attn = attn + relative_position_bias.unsqueeze(0)
+        attn = attn.softmax(dim=-1)
+        attn = self.attn_drop(attn)
+        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
+        x = self.proj(x)
+        x = self.proj_drop(x)
+        return x
+
+
+class TransformerEncoderLayer(BaseModule):
+    """Implements one encoder layer in Vision Transformer.
+
+    Args:
+        embed_dims (int): The feature dimension.
+        num_heads (int): Parallel attention heads.
+        feedforward_channels (int): The hidden dimension for FFNs.
+        attn_drop_rate (float): The drop out rate for attention layer.
+            Default: 0.0.
+        drop_path_rate (float): Stochastic depth rate. Default 0.0.
+        num_fcs (int): The number of fully-connected layers for FFNs.
+            Default: 2.
+        qv_bias (bool): Enable bias for qv if True. Default: True
+        act_cfg (dict): The activation config for FFNs.
+            Default: dict(type='GELU').
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='LN').
+        window_size (tuple[int], optional): The height and width of the window.
+            Default: None.
+        init_values (float, optional): Initialize the values of BEiTAttention
+            and FFN with learnable scaling. Default: None.
+    """
+
+    def __init__(self,
+                 embed_dims,
+                 num_heads,
+                 feedforward_channels,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 num_fcs=2,
+                 qv_bias=True,
+                 act_cfg=dict(type='GELU'),
+                 norm_cfg=dict(type='LN'),
+                 window_size=None,
+                 init_values=None):
+        super(TransformerEncoderLayer, self).__init__()
+        self.norm1_name, norm1 = build_norm_layer(
+            norm_cfg, embed_dims, postfix=1)
+        self.add_module(self.norm1_name, norm1)
+        self.attn = BEiTAttention(
+            embed_dims=embed_dims,
+            num_heads=num_heads,
+            window_size=window_size,
+            qv_bias=qv_bias,
+            qk_scale=None,
+            attn_drop_rate=attn_drop_rate,
+            proj_drop_rate=0.,
+            init_cfg=None)
+        self.ffn = FFN(
+            embed_dims=embed_dims,
+            feedforward_channels=feedforward_channels,
+            num_fcs=num_fcs,
+            ffn_drop=0.,
+            dropout_layer=None,
+            act_cfg=act_cfg,
+            add_identity=False)
+        self.norm2_name, norm2 = build_norm_layer(
+            norm_cfg, embed_dims, postfix=2)
+        self.add_module(self.norm2_name, norm2)
+        # NOTE: drop path for stochastic depth, we shall see if
+        # this is better than dropout here
+        dropout_layer = dict(type='DropPath', drop_prob=drop_path_rate)
+        self.drop_path = build_dropout(
+            dropout_layer) if dropout_layer else nn.Identity()
+        self.gamma_1 = nn.Parameter(
+            init_values * torch.ones((embed_dims)), requires_grad=True)
+        self.gamma_2 = nn.Parameter(
+            init_values * torch.ones((embed_dims)), requires_grad=True)
+
+    @property
+    def norm1(self):
+        return getattr(self, self.norm1_name)
+
+    @property
+    def norm2(self):
+        return getattr(self, self.norm2_name)
+
+    def forward(self, x):
+        x = x + self.drop_path(self.gamma_1 * self.attn(self.norm1(x)))
+        x = x + self.drop_path(self.gamma_2 * self.ffn(self.norm2(x)))
+        return x
+
+
+@BACKBONES.register_module()
+class BEiT(BaseModule):
+    """BERT Pre-Training of Image Transformers.
+
+    Args:
+        img_size (int | tuple): Input image size. Default: 224.
+        patch_size (int): The patch size. Default: 16.
+        in_channels (int): Number of input channels. Default: 3.
+        embed_dims (int): Embedding dimension. Default: 768.
+        num_layers (int): Depth of transformer. Default: 12.
+        num_heads (int): Number of attention heads. Default: 12.
+        mlp_ratio (int): Ratio of mlp hidden dim to embedding dim.
+            Default: 4.
+        out_indices (list | tuple | int): Output from which stages.
+            Default: -1.
+        qv_bias (bool): Enable bias for qv if True. Default: True.
+        attn_drop_rate (float): The drop out rate for attention layer.
+            Default 0.0
+        drop_path_rate (float): Stochastic depth rate. Default 0.0.
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='LN')
+        act_cfg (dict): The activation config for FFNs.
+            Default: dict(type='GELU').
+        patch_norm (bool): Whether to add a norm in PatchEmbed Block.
+            Default: False.
+        final_norm (bool): Whether to add a additional layer to normalize
+            final feature map. Default: False.
+        num_fcs (int): The number of fully-connected layers for FFNs.
+            Default: 2.
+        norm_eval (bool): Whether to set norm layers to eval mode, namely,
+            freeze running stats (mean and var). Note: Effect on Batch Norm
+            and its variants only. Default: False.
+        pretrained (str, optional): Model pretrained path. Default: None.
+        init_values (float): Initialize the values of BEiTAttention and FFN
+            with learnable scaling.
+        init_cfg (dict or list[dict], optional): Initialization config dict.
+            Default: None.
+    """
+
+    def __init__(self,
+                 img_size=224,
+                 patch_size=16,
+                 in_channels=3,
+                 embed_dims=768,
+                 num_layers=12,
+                 num_heads=12,
+                 mlp_ratio=4,
+                 out_indices=-1,
+                 qv_bias=True,
+                 attn_drop_rate=0.,
+                 drop_path_rate=0.,
+                 norm_cfg=dict(type='LN'),
+                 act_cfg=dict(type='GELU'),
+                 patch_norm=False,
+                 final_norm=False,
+                 num_fcs=2,
+                 norm_eval=False,
+                 pretrained=None,
+                 init_values=0.1,
+                 init_cfg=None):
+        super(BEiT, self).__init__(init_cfg=init_cfg)
+        if isinstance(img_size, int):
+            img_size = to_2tuple(img_size)
+        elif isinstance(img_size, tuple):
+            if len(img_size) == 1:
+                img_size = to_2tuple(img_size[0])
+            assert len(img_size) == 2, \
+                f'The size of image should have length 1 or 2, ' \
+                f'but got {len(img_size)}'
+
+        assert not (init_cfg and pretrained), \
+            'init_cfg and pretrained cannot be set at the same time'
+        if isinstance(pretrained, str):
+            warnings.warn('DeprecationWarning: pretrained is deprecated, '
+                          'please use "init_cfg" instead')
+            self.init_cfg = dict(type='Pretrained', checkpoint=pretrained)
+        elif pretrained is not None:
+            raise TypeError('pretrained must be a str or None')
+
+        self.img_size = img_size
+        self.patch_size = patch_size
+        self.norm_eval = norm_eval
+        self.pretrained = pretrained
+
+        self.patch_embed = PatchEmbed(
+            in_channels=in_channels,
+            embed_dims=embed_dims,
+            conv_type='Conv2d',
+            kernel_size=patch_size,
+            stride=patch_size,
+            padding=0,
+            norm_cfg=norm_cfg if patch_norm else None,
+            init_cfg=None)
+
+        window_size = (img_size[0] // patch_size, img_size[1] // patch_size)
+        self.patch_shape = window_size
+        self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dims))
+
+        if isinstance(out_indices, int):
+            if out_indices == -1:
+                out_indices = num_layers - 1
+            self.out_indices = [out_indices]
+        elif isinstance(out_indices, list) or isinstance(out_indices, tuple):
+            self.out_indices = out_indices
+        else:
+            raise TypeError('out_indices must be type of int, list or tuple')
+
+        dpr = [x.item() for x in torch.linspace(0, drop_path_rate, num_layers)]
+        self.layers = ModuleList()
+        for i in range(num_layers):
+            self.layers.append(
+                TransformerEncoderLayer(
+                    embed_dims=embed_dims,
+                    num_heads=num_heads,
+                    feedforward_channels=mlp_ratio * embed_dims,
+                    attn_drop_rate=attn_drop_rate,
+                    drop_path_rate=dpr[i],
+                    num_fcs=num_fcs,
+                    qv_bias=qv_bias,
+                    act_cfg=act_cfg,
+                    norm_cfg=norm_cfg,
+                    window_size=window_size,
+                    init_values=init_values))
+
+        self.final_norm = final_norm
+        if final_norm:
+            self.norm1_name, norm1 = build_norm_layer(
+                norm_cfg, embed_dims, postfix=1)
+            self.add_module(self.norm1_name, norm1)
+
+    @property
+    def norm1(self):
+        return getattr(self, self.norm1_name)
+
+    def _geometric_sequence_interpolation(self, src_size, dst_size, sequence,
+                                          num):
+        """Get new sequence via geometric sequence interpolation.
+
+        Args:
+            src_size (int): Pos_embedding size in pre-trained model.
+            dst_size (int): Pos_embedding size in the current model.
+            sequence (tensor): The relative position bias of the pretrain
+                model after removing the extra tokens.
+            num (int): Number of attention heads.
+        Returns:
+            new_sequence (tensor): Geometric sequence interpolate the
+                pre-trained relative position bias to the size of
+                the current model.
+        """
+
+        def geometric_progression(a, r, n):
+            return a * (1.0 - r**n) / (1.0 - r)
+
+        # Here is a binary function.
+        left, right = 1.01, 1.5
+        while right - left > 1e-6:
+            q = (left + right) / 2.0
+            gp = geometric_progression(1, q, src_size // 2)
+            if gp > dst_size // 2:
+                right = q
+            else:
+                left = q
+        # The position of each interpolated point is determined
+        # by the ratio obtained by dichotomy.
+        dis = []
+        cur = 1
+        for i in range(src_size // 2):
+            dis.append(cur)
+            cur += q**(i + 1)
+        r_ids = [-_ for _ in reversed(dis)]
+        x = r_ids + [0] + dis
+        y = r_ids + [0] + dis
+        t = dst_size // 2.0
+        dx = np.arange(-t, t + 0.1, 1.0)
+        dy = np.arange(-t, t + 0.1, 1.0)
+        # Interpolation functions are being executed and called.
+        new_sequence = []
+        for i in range(num):
+            z = sequence[:, i].view(src_size, src_size).float().numpy()
+            f = interpolate.interp2d(x, y, z, kind='cubic')
+            new_sequence.append(
+                torch.Tensor(f(dx, dy)).contiguous().view(-1, 1).to(sequence))
+        new_sequence = torch.cat(new_sequence, dim=-1)
+        return new_sequence
+
+    def resize_rel_pos_embed(self, checkpoint):
+        """Resize relative pos_embed weights.
+
+        This function is modified from
+        https://github.com/microsoft/unilm/blob/master/beit/semantic_segmentation/mmcv_custom/checkpoint.py.  # noqa: E501
+        Copyright (c) Microsoft Corporation
+        Licensed under the MIT License
+
+        Args:
+            checkpoint (dict): Key and value of the pretrain model.
+        Returns:
+            state_dict (dict): Interpolate the relative pos_embed weights
+                in the pre-train model to the current model size.
+        """
+        if 'state_dict' in checkpoint:
+            state_dict = checkpoint['state_dict']
+        else:
+            state_dict = checkpoint
+
+        all_keys = list(state_dict.keys())
+        for key in all_keys:
+            if 'relative_position_index' in key:
+                state_dict.pop(key)
+            # In order to keep the center of pos_bias as consistent as
+            # possible after interpolation, and vice versa in the edge
+            # area, the geometric sequence interpolation method is adopted.
+            if 'relative_position_bias_table' in key:
+                rel_pos_bias = state_dict[key]
+                src_num_pos, num_attn_heads = rel_pos_bias.size()
+                dst_num_pos, _ = self.state_dict()[key].size()
+                dst_patch_shape = self.patch_shape
+                if dst_patch_shape[0] != dst_patch_shape[1]:
+                    raise NotImplementedError()
+                # Count the number of extra tokens.
+                num_extra_tokens = dst_num_pos - (
+                    dst_patch_shape[0] * 2 - 1) * (
+                        dst_patch_shape[1] * 2 - 1)
+                src_size = int((src_num_pos - num_extra_tokens)**0.5)
+                dst_size = int((dst_num_pos - num_extra_tokens)**0.5)
+                if src_size != dst_size:
+                    extra_tokens = rel_pos_bias[-num_extra_tokens:, :]
+                    rel_pos_bias = rel_pos_bias[:-num_extra_tokens, :]
+                    new_rel_pos_bias = self._geometric_sequence_interpolation(
+                        src_size, dst_size, rel_pos_bias, num_attn_heads)
+                    new_rel_pos_bias = torch.cat(
+                        (new_rel_pos_bias, extra_tokens), dim=0)
+                    state_dict[key] = new_rel_pos_bias
+
+        return state_dict
+
+    def init_weights(self):
+
+        def _init_weights(m):
+            if isinstance(m, nn.Linear):
+                trunc_normal_(m.weight, std=.02)
+                if isinstance(m, nn.Linear) and m.bias is not None:
+                    nn.init.constant_(m.bias, 0)
+            elif isinstance(m, nn.LayerNorm):
+                nn.init.constant_(m.bias, 0)
+                nn.init.constant_(m.weight, 1.0)
+
+        self.apply(_init_weights)
+
+        if (isinstance(self.init_cfg, dict)
+                and self.init_cfg.get('type') == 'Pretrained'):
+            logger = get_root_logger()
+            checkpoint = _load_checkpoint(
+                self.init_cfg['checkpoint'], logger=logger, map_location='cpu')
+            state_dict = self.resize_rel_pos_embed(checkpoint)
+            self.load_state_dict(state_dict, False)
+        elif self.init_cfg is not None:
+            super(BEiT, self).init_weights()
+        else:
+            # We only implement the 'jax_impl' initialization implemented at
+            # https://github.com/rwightman/pytorch-image-models/blob/master/timm/models/vision_transformer.py#L353  # noqa: E501
+            # Copyright 2019 Ross Wightman
+            # Licensed under the Apache License, Version 2.0 (the "License")
+            trunc_normal_(self.cls_token, std=.02)
+            for n, m in self.named_modules():
+                if isinstance(m, nn.Linear):
+                    trunc_normal_(m.weight, std=.02)
+                    if m.bias is not None:
+                        if 'ffn' in n:
+                            nn.init.normal_(m.bias, mean=0., std=1e-6)
+                        else:
+                            nn.init.constant_(m.bias, 0)
+                elif isinstance(m, nn.Conv2d):
+                    kaiming_init(m, mode='fan_in', bias=0.)
+                elif isinstance(m, (_BatchNorm, nn.GroupNorm, nn.LayerNorm)):
+                    constant_init(m, val=1.0, bias=0.)
+
+    def forward(self, inputs):
+        B = inputs.shape[0]
+
+        x, hw_shape = self.patch_embed(inputs)
+
+        # stole cls_tokens impl from Phil Wang, thanks
+        cls_tokens = self.cls_token.expand(B, -1, -1)
+        x = torch.cat((cls_tokens, x), dim=1)
+
+        outs = []
+        for i, layer in enumerate(self.layers):
+            x = layer(x)
+            if i == len(self.layers) - 1:
+                if self.final_norm:
+                    x = self.norm1(x)
+            if i in self.out_indices:
+                # Remove class token and reshape token for decoder head
+                out = x[:, 1:]
+                B, _, C = out.shape
+                out = out.reshape(B, hw_shape[0], hw_shape[1],
+                                  C).permute(0, 3, 1, 2).contiguous()
+                outs.append(out)
+
+        return tuple(outs)
+
+    def train(self, mode=True):
+        super(BEiT, self).train(mode)
+        if mode and self.norm_eval:
+            for m in self.modules():
+                if isinstance(m, nn.LayerNorm):
+                    m.eval()
--- a/mmseg/models/necks/init.py
+++ b/mmseg/models/necks/init.py
@ -1,8 +1,11 @@
 # Copyright (c) OpenMMLab. All rights reserved.
+from .featurepyramid import Feature2Pyramid
 from .fpn import FPN
 from .ic_neck import ICNeck
 from .jpu import JPU
 from .mla_neck import MLANeck
 from .multilevel_neck import MultiLevelNeck

-__all__ = ['FPN', 'MultiLevelNeck', 'MLANeck', 'ICNeck', 'JPU']
+__all__ = [
+    'FPN', 'MultiLevelNeck', 'MLANeck', 'ICNeck', 'JPU', 'Feature2Pyramid'
+]
--- a/mmseg/models/necks/featurepyramid.py
+++ b/mmseg/models/necks/featurepyramid.py
@ -0,0 +1,67 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch.nn as nn
+from mmcv.cnn import build_norm_layer
+
+from ..builder import NECKS
+
+
+@NECKS.register_module()
+class Feature2Pyramid(nn.Module):
+    """Feature2Pyramid.
+
+    A neck structure connect ViT backbone and decoder_heads.
+
+    Args:
+        embed_dims (int): Embedding dimension.
+        rescales (list[float]): Different sampling multiples were
+            used to obtain pyramid features. Default: [4, 2, 1, 0.5].
+        norm_cfg (dict): Config dict for normalization layer.
+            Default: dict(type='SyncBN', requires_grad=True).
+    """
+
+    def __init__(self,
+                 embed_dim,
+                 rescales=[4, 2, 1, 0.5],
+                 norm_cfg=dict(type='SyncBN', requires_grad=True)):
+        super(Feature2Pyramid, self).__init__()
+        self.rescales = rescales
+        self.upsample_4x = None
+        for k in self.rescales:
+            if k == 4:
+                self.upsample_4x = nn.Sequential(
+                    nn.ConvTranspose2d(
+                        embed_dim, embed_dim, kernel_size=2, stride=2),
+                    build_norm_layer(norm_cfg, embed_dim)[1],
+                    nn.GELU(),
+                    nn.ConvTranspose2d(
+                        embed_dim, embed_dim, kernel_size=2, stride=2),
+                )
+            elif k == 2:
+                self.upsample_2x = nn.Sequential(
+                    nn.ConvTranspose2d(
+                        embed_dim, embed_dim, kernel_size=2, stride=2))
+            elif k == 1:
+                self.identity = nn.Identity()
+            elif k == 0.5:
+                self.downsample_2x = nn.MaxPool2d(kernel_size=2, stride=2)
+            elif k == 0.25:
+                self.downsample_4x = nn.MaxPool2d(kernel_size=4, stride=4)
+            else:
+                raise KeyError(f'invalid {k} for feature2pyramid')
+
+    def forward(self, inputs):
+        assert len(inputs) == len(self.rescales)
+        outputs = []
+        if self.upsample_4x is not None:
+            ops = [
+                self.upsample_4x, self.upsample_2x, self.identity,
+                self.downsample_2x
+            ]
+        else:
+            ops = [
+                self.upsample_2x, self.identity, self.downsample_2x,
+                self.downsample_4x
+            ]
+        for i in range(len(inputs)):
+            outputs.append(ops[i](inputs[i]))
+        return tuple(outputs)
--- a/model-index.yml
+++ b/model-index.yml
@ -1,6 +1,7 @@
 Import:
 - configs/ann/ann.yml
 - configs/apcnet/apcnet.yml
+- configs/beit/beit.yml
 - configs/bisenetv1/bisenetv1.yml
 - configs/bisenetv2/bisenetv2.yml
 - configs/ccnet/ccnet.yml
--- a/tests/test_core/test_layer_decay_optimizer_constructor.py
+++ b/tests/test_core/test_layer_decay_optimizer_constructor.py
@ -0,0 +1,70 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import torch
+import torch.nn as nn
+
+from mmseg.core.layer_decay_optimizer_constructor import \
+    LayerDecayOptimizerConstructor
+
+layer_wise_gt_lst = [{
+    'weight_decay': 0.0,
+    'lr_scale': 16
+}, {
+    'weight_decay': 0.05,
+    'lr_scale': 8
+}, {
+    'weight_decay': 0.0,
+    'lr_scale': 8
+}, {
+    'weight_decay': 0.05,
+    'lr_scale': 4
+}, {
+    'weight_decay': 0.0,
+    'lr_scale': 4
+}, {
+    'weight_decay': 0.05,
+    'lr_scale': 2
+}, {
+    'weight_decay': 0.0,
+    'lr_scale': 2
+}]
+
+
+class BEiTExampleModel(nn.Module):
+
+    def __init__(self, depth):
+        super().__init__()
+        self.backbone = nn.ModuleList()
+
+        # add some variables to meet unit test coverate rate
+        self.backbone.cls_token = nn.Parameter(torch.ones(1))
+        self.backbone.patch_embed = nn.Parameter(torch.ones(1))
+        self.backbone.layers = nn.ModuleList()
+        for _ in range(depth):
+            layer = nn.Conv2d(3, 3, 1)
+            self.backbone.layers.append(layer)
+
+
+def check_beit_adamw_optimizer(optimizer, gt_lst):
+    assert isinstance(optimizer, torch.optim.AdamW)
+    assert optimizer.defaults['lr'] == 1
+    assert optimizer.defaults['weight_decay'] == 0.05
+    param_groups = optimizer.param_groups
+    # 1 layer (cls_token and patch_embed) + 3 layers * 2 (w, b) = 7 layers
+    assert len(param_groups) == 7
+    for i, param_dict in enumerate(param_groups):
+        assert param_dict['weight_decay'] == gt_lst[i]['weight_decay']
+        assert param_dict['lr_scale'] == gt_lst[i]['lr_scale']
+        assert param_dict['lr_scale'] == param_dict['lr']
+
+
+def test_beit_layer_decay_optimizer_constructor():
+
+    # paramwise_cfg with ConvNeXtExampleModel
+    model = BEiTExampleModel(depth=3)
+    optimizer_cfg = dict(
+        type='AdamW', lr=1, betas=(0.9, 0.999), weight_decay=0.05)
+    paramwise_cfg = dict(num_layers=3, layer_decay_rate=2)
+    optim_constructor = LayerDecayOptimizerConstructor(optimizer_cfg,
+                                                       paramwise_cfg)
+    optimizer = optim_constructor(model)
+    check_beit_adamw_optimizer(optimizer, layer_wise_gt_lst)
--- a/tests/test_models/test_backbones/test_beit.py
+++ b/tests/test_models/test_backbones/test_beit.py
@ -0,0 +1,182 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmseg.models.backbones.beit import BEiT
+from .utils import check_norm_state
+
+
+def test_beit_backbone():
+    with pytest.raises(TypeError):
+        # pretrained must be a string path
+        model = BEiT()
+        model.init_weights(pretrained=0)
+
+    with pytest.raises(TypeError):
+        # img_size must be int or tuple
+        model = BEiT(img_size=512.0)
+
+    with pytest.raises(TypeError):
+        # out_indices must be int ,list or tuple
+        model = BEiT(out_indices=1.)
+
+    with pytest.raises(AssertionError):
+        # The length of img_size tuple must be lower than 3.
+        BEiT(img_size=(224, 224, 224))
+
+    with pytest.raises(TypeError):
+        # Pretrained must be None or Str.
+        BEiT(pretrained=123)
+
+    # Test img_size isinstance tuple
+    imgs = torch.randn(1, 3, 224, 224)
+    model = BEiT(img_size=(224, ))
+    model.init_weights()
+    model(imgs)
+
+    # Test img_size isinstance tuple
+    imgs = torch.randn(1, 3, 224, 224)
+    model = BEiT(img_size=(224, 224))
+    model(imgs)
+
+    # Test norm_eval = True
+    model = BEiT(norm_eval=True)
+    model.train()
+
+    # Test BEiT backbone with input size of 224 and patch size of 16
+    model = BEiT()
+    model.init_weights()
+    model.train()
+
+    # Test  qv_bias
+    model = BEiT(qv_bias=False)
+    model.train()
+
+    # Test out_indices = list
+    model = BEiT(out_indices=[2, 4, 8, 12])
+    model.train()
+
+    assert check_norm_state(model.modules(), True)
+
+    # Test image size = (224, 224)
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert feat[-1].shape == (1, 768, 14, 14)
+
+    # Test BEiT backbone with input size of 256 and patch size of 16
+    model = BEiT(img_size=(256, 256))
+    model.init_weights()
+    model.train()
+    imgs = torch.randn(1, 3, 256, 256)
+    feat = model(imgs)
+    assert feat[-1].shape == (1, 768, 16, 16)
+
+    # Test BEiT backbone with input size of 32 and patch size of 16
+    model = BEiT(img_size=(32, 32))
+    model.init_weights()
+    model.train()
+    imgs = torch.randn(1, 3, 32, 32)
+    feat = model(imgs)
+    assert feat[-1].shape == (1, 768, 2, 2)
+
+    # Test unbalanced size input image
+    model = BEiT(img_size=(112, 224))
+    model.init_weights()
+    model.train()
+    imgs = torch.randn(1, 3, 112, 224)
+    feat = model(imgs)
+    assert feat[-1].shape == (1, 768, 7, 14)
+
+    # Test irregular input image
+    model = BEiT(img_size=(234, 345))
+    model.init_weights()
+    model.train()
+    imgs = torch.randn(1, 3, 234, 345)
+    feat = model(imgs)
+    assert feat[-1].shape == (1, 768, 14, 21)
+
+    # Test init_values=0
+    model = BEiT(init_values=0)
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert feat[-1].shape == (1, 768, 14, 14)
+
+    # Test final norm
+    model = BEiT(final_norm=True)
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert feat[-1].shape == (1, 768, 14, 14)
+
+    # Test patch norm
+    model = BEiT(patch_norm=True)
+    imgs = torch.randn(1, 3, 224, 224)
+    feat = model(imgs)
+    assert feat[-1].shape == (1, 768, 14, 14)
+
+
+def test_beit_init():
+    path = 'PATH_THAT_DO_NOT_EXIST'
+    # Test all combinations of pretrained and init_cfg
+    # pretrained=None, init_cfg=None
+    model = BEiT(pretrained=None, init_cfg=None)
+    assert model.init_cfg is None
+    model.init_weights()
+
+    # pretrained=None
+    # init_cfg loads pretrain from an non-existent file
+    model = BEiT(
+        pretrained=None, init_cfg=dict(type='Pretrained', checkpoint=path))
+    assert model.init_cfg == dict(type='Pretrained', checkpoint=path)
+    # Test loading a checkpoint from an non-existent file
+    with pytest.raises(OSError):
+        model.init_weights()
+
+    # test resize_rel_pos_embed
+    value = torch.randn(732, 16)
+    ckpt = {
+        'state_dict': {
+            'layers.0.attn.relative_position_index': 0,
+            'layers.0.attn.relative_position_bias_table': value
+        }
+    }
+    model = BEiT(img_size=(512, 512))
+    with pytest.raises(AttributeError):
+        model.resize_rel_pos_embed(ckpt)
+
+    # pretrained=None
+    # init_cfg=123, whose type is unsupported
+    model = BEiT(pretrained=None, init_cfg=123)
+    with pytest.raises(TypeError):
+        model.init_weights()
+
+    # pretrained loads pretrain from an non-existent file
+    # init_cfg=None
+    model = BEiT(pretrained=path, init_cfg=None)
+    assert model.init_cfg == dict(type='Pretrained', checkpoint=path)
+    # Test loading a checkpoint from an non-existent file
+    with pytest.raises(OSError):
+        model.init_weights()
+
+    # pretrained loads pretrain from an non-existent file
+    # init_cfg loads pretrain from an non-existent file
+    with pytest.raises(AssertionError):
+        model = BEiT(
+            pretrained=path, init_cfg=dict(type='Pretrained', checkpoint=path))
+    with pytest.raises(AssertionError):
+        model = BEiT(pretrained=path, init_cfg=123)
+
+    # pretrain=123, whose type is unsupported
+    # init_cfg=None
+    with pytest.raises(TypeError):
+        model = BEiT(pretrained=123, init_cfg=None)
+
+    # pretrain=123, whose type is unsupported
+    # init_cfg loads pretrain from an non-existent file
+    with pytest.raises(AssertionError):
+        model = BEiT(
+            pretrained=123, init_cfg=dict(type='Pretrained', checkpoint=path))
+
+    # pretrain=123, whose type is unsupported
+    # init_cfg=123, whose type is unsupported
+    with pytest.raises(AssertionError):
+        model = BEiT(pretrained=123, init_cfg=123)
--- a/tests/test_models/test_necks/test_feature2pyramid.py
+++ b/tests/test_models/test_necks/test_feature2pyramid.py
@ -0,0 +1,38 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import pytest
+import torch
+
+from mmseg.models import Feature2Pyramid
+
+
+def test_feature2pyramid():
+    # test
+    rescales = [4, 2, 1, 0.5]
+    embed_dim = 64
+    inputs = [torch.randn(1, embed_dim, 32, 32) for i in range(len(rescales))]
+
+    fpn = Feature2Pyramid(
+        embed_dim, rescales, norm_cfg=dict(type='BN', requires_grad=True))
+    outputs = fpn(inputs)
+    assert outputs[0].shape == torch.Size([1, 64, 128, 128])
+    assert outputs[1].shape == torch.Size([1, 64, 64, 64])
+    assert outputs[2].shape == torch.Size([1, 64, 32, 32])
+    assert outputs[3].shape == torch.Size([1, 64, 16, 16])
+
+    # test rescales = [2, 1, 0.5, 0.25]
+    rescales = [2, 1, 0.5, 0.25]
+    inputs = [torch.randn(1, embed_dim, 32, 32) for i in range(len(rescales))]
+
+    fpn = Feature2Pyramid(
+        embed_dim, rescales, norm_cfg=dict(type='BN', requires_grad=True))
+    outputs = fpn(inputs)
+    assert outputs[0].shape == torch.Size([1, 64, 64, 64])
+    assert outputs[1].shape == torch.Size([1, 64, 32, 32])
+    assert outputs[2].shape == torch.Size([1, 64, 16, 16])
+    assert outputs[3].shape == torch.Size([1, 64, 8, 8])
+
+    # test rescales = [4, 2, 0.25, 0]
+    rescales = [4, 2, 0.25, 0]
+    with pytest.raises(KeyError):
+        fpn = Feature2Pyramid(
+            embed_dim, rescales, norm_cfg=dict(type='BN', requires_grad=True))
--- a/tools/model_converters/beit2mmseg.py
+++ b/tools/model_converters/beit2mmseg.py
@ -0,0 +1,56 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmcv
+import torch
+from mmcv.runner import CheckpointLoader
+
+
+def convert_beit(ckpt):
+    new_ckpt = OrderedDict()
+
+    for k, v in ckpt.items():
+        if k.startswith('patch_embed'):
+            new_key = k.replace('patch_embed.proj', 'patch_embed.projection')
+            new_ckpt[new_key] = v
+        if k.startswith('blocks'):
+            new_key = k.replace('blocks', 'layers')
+            if 'norm' in new_key:
+                new_key = new_key.replace('norm', 'ln')
+            elif 'mlp.fc1' in new_key:
+                new_key = new_key.replace('mlp.fc1', 'ffn.layers.0.0')
+            elif 'mlp.fc2' in new_key:
+                new_key = new_key.replace('mlp.fc2', 'ffn.layers.1')
+            new_ckpt[new_key] = v
+        else:
+            new_key = k
+            new_ckpt[new_key] = v
+
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in official pretrained beit models to'
+        'MMSegmentation style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+    if 'state_dict' in checkpoint:
+        state_dict = checkpoint['state_dict']
+    elif 'model' in checkpoint:
+        state_dict = checkpoint['model']
+    else:
+        state_dict = checkpoint
+    weight = convert_beit(state_dict)
+    mmcv.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+
+if __name__ == '__main__':
+    main()