[Feature] Support GLIP (#1308)

* rebase * add glip * update glip * add links * rename * fix doc --------- Co-authored-by: Ezra-Yu <18586273+Ezra-Yu@users.noreply.github.com>
2025-06-03 14:59:18 +08:00 · 2023-04-17 20:19:23 +09:00 · 2023-04-17 20:19:23 +09:00 · fec3da781f
commit fec3da781f
parent 2c913020b9
8 changed files with 221 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -200,6 +200,7 @@ Results and models are available in the [model zoo](https://mmpretrain.readthedo
        <li><a href="configs/xcit">XCiT</a></li>
        <li><a href="configs/levit">LeViT</a></li>
        <li><a href="configs/riformer">RIFormer</a></li>
+        <li><a href="configs/glip">GLIP</a></li>
        </ul>
      </td>
      <td>
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@ -196,6 +196,7 @@ mim install -e .
        <li><a href="configs/xcit">XCiT</a></li>
        <li><a href="configs/levit">LeViT</a></li>
        <li><a href="configs/riformer">RIFormer</a></li>
+        <li><a href="configs/glip">GLIP</a></li>
        </ul>
      </td>
      <td>
--- a/configs/glip/README.md
+++ b/configs/glip/README.md
@ -0,0 +1,57 @@
+# GLIP
+
+> [Grounded Language-Image Pre-training](https://arxiv.org/abs/2112.03857)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks. 1) When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines. 2) After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA. 3) When transferred to 13 downstream object detection tasks, a 1-shot GLIP rivals with a fully-supervised Dynamic Head.
+
+<div align="center">
+<img src="https://github.com/microsoft/GLIP/blob/main/docs/lead.png" width="70%"/>
+</div>
+
+## How to use it?
+
+<!-- [TABS-BEGIN] -->
+
+**Use the model**
+
+```python
+import torch
+from mmpretrain import get_model
+model = get_model('swin-t_glip-pre_3rdparty', pretrained=True)
+inputs = torch.rand(1, 3, 224, 224)
+out = model(inputs)
+print(type(out))
+# To extract features.
+feats = model.extract_feat(inputs)
+print(type(feats))
+```
+
+<!-- [TABS-END] -->
+
+## Results and models
+
+### Pre-trained models
+
+The pre-trained models are used to fine-tune, and therefore don't have evaluation results.
+
+| Model                                       |          Pretrain          | resolution |                                                       Download                                                        |
+| :------------------------------------------ | :------------------------: | :--------: | :-------------------------------------------------------------------------------------------------------------------: |
+| GLIP-T (`swin-t_glip-pre_3rdparty`)\*       |    O365,GoldG,CC3M,SBU     |  224x224   |    [model](https://download.openmmlab.com/mmclassification/v1/glip/swin-t_glip-pre_3rdparty_20230413-d85813b5.pth)    |
+| GLIP-L (`swin-l_glip-pre_3rdparty_384px`)\* | FourODs,GoldG,CC3M+12M,SBU |  384x384   | [model](https://download.openmmlab.com/mmclassification/v1/glip/swin-l_glip-pre_3rdparty_384px_20230413-04b198e8.pth) |
+
+*Models with * are converted from the [official repo](https://github.com/microsoft/GLIP).*
+
+## Citation
+
+```bibtex
+@inproceedings{li2021grounded,
+      title={Grounded Language-Image Pre-training},
+      author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
+      year={2022},
+      booktitle={CVPR},
+}
+```
--- a/configs/glip/glip-l_headless.py
+++ b/configs/glip/glip-l_headless.py
@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer',
+        arch='large',
+        img_size=384,
+        out_indices=(1, 2, 3),  # original weight is for detection
+        stage_cfgs=dict(block_cfgs=dict(window_size=12))),
+    neck=None,
+    head=None)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[103.53, 116.28, 123.675],
+    std=[57.375, 57.12, 58.395],
+    # convert image from BGR to RGB
+    to_rgb=False,
+)
--- a/configs/glip/glip-t_headless.py
+++ b/configs/glip/glip-t_headless.py
@ -0,0 +1,18 @@
+model = dict(
+    type='ImageClassifier',
+    backbone=dict(
+        type='SwinTransformer',
+        arch='tiny',
+        img_size=224,
+        out_indices=(1, 2, 3),  # original weight is for detection
+    ),
+    neck=None,
+    head=None)
+
+data_preprocessor = dict(
+    # RGB format normalization parameters
+    mean=[103.53, 116.28, 123.675],
+    std=[57.375, 57.12, 58.395],
+    # convert image from BGR to RGB
+    to_rgb=False,
+)
--- a/configs/glip/metafile.yml
+++ b/configs/glip/metafile.yml
@ -0,0 +1,49 @@
+Collections:
+  - Name: GLIP
+    Metadata:
+      Training Techniques:
+        - AdamW
+        - Weight Decay
+      Architecture:
+        - Shift Window Multihead Self Attention
+    Paper:
+      URL: https://arxiv.org/abs/2112.03857
+      Title: "Grounded Language-Image Pre-training"
+    README: configs/glip/README.md
+    Code:
+      URL: https://github.com/open-mmlab/mmpretrain/blob/main/mmpretrain/models/backbones/vit.py
+      Version: v1.0.0rc8
+
+Models:
+  - Name: swin-t_glip-pre_3rdparty
+    In Collection: GLIP
+    Metadata:
+      FLOPs: 4508464128
+      Parameters: 29056354
+      Training Data:
+        - O365
+        - GoldG
+        - CC3M
+        - SBU
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/glip/swin-t_glip-pre_3rdparty_20230413-d85813b5.pth
+    Converted From:
+      Weights: https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/models/glip_tiny_model_o365_goldg_cc_sbu.pth
+      Code: https://github.com/microsoft/GLIP
+    Config: configs/glip/glip-t_headless.py
+  - Name: swin-l_glip-pre_3rdparty_384px
+    In Collection: GLIP
+    Metadata:
+      FLOPs: 104080343040
+      Parameters: 196735516
+      Training Data:
+        - FourODs
+        - GoldG
+        - CC3M+12M
+        - SBU
+    Results: null
+    Weights: https://download.openmmlab.com/mmclassification/v1/glip/swin-l_glip-pre_3rdparty_384px_20230413-04b198e8.pth
+    Converted From:
+      Weights: https://penzhanwu2bbs.blob.core.windows.net/data/GLIPv1_Open/models/glip_large_model.pth
+      Code: https://github.com/microsoft/GLIP
+    Config: configs/glip/glip-l_headless.py
--- a/model-index.yml
+++ b/model-index.yml
@ -68,3 +68,4 @@ Import:
  - configs/milan/metafile.yml
  - configs/riformer/metafile.yml
  - configs/sam/metafile.yml
+  - configs/glip/metafile.yml
--- a/tools/model_converters/glip_to_mmpretrain.py
+++ b/tools/model_converters/glip_to_mmpretrain.py
@ -0,0 +1,76 @@
+# Copyright (c) OpenMMLab. All rights reserved.
+import argparse
+import os.path as osp
+from collections import OrderedDict
+
+import mmengine
+import torch
+from mmengine.runner import CheckpointLoader
+
+
+def convert_glip(ckpt):
+
+    def correct_unfold_reduction_order(x):
+        out_channel, in_channel = x.shape
+        x = x.reshape(out_channel, 4, in_channel // 4)
+        x = x[:, [0, 2, 1, 3], :].transpose(1,
+                                            2).reshape(out_channel, in_channel)
+        return x
+
+    def correct_unfold_norm_order(x):
+        in_channel = x.shape[0]
+        x = x.reshape(4, in_channel // 4)
+        x = x[[0, 2, 1, 3], :].transpose(0, 1).reshape(in_channel)
+        return x
+
+    new_ckpt = OrderedDict()
+
+    for k, v in list(ckpt.items()):
+        if 'language_backbone' in k or 'backbone' not in k or 'fpn' in k:
+            continue
+        new_v = v
+        new_k = k.replace('body.', '')
+        new_k = new_k.replace('module.', '')
+        if new_k.startswith('backbone.layers'):
+            new_k = new_k.replace('backbone.layers', 'backbone.stages')
+        if 'mlp' in new_k:
+            new_k = new_k.replace('mlp.fc1', 'ffn.layers.0.0')
+            new_k = new_k.replace('mlp.fc2', 'ffn.layers.1')
+        elif 'attn' in new_k:
+            new_k = new_k.replace('attn', 'attn.w_msa')
+        elif 'patch_embed' in k:
+            new_k = new_k.replace('proj', 'projection')
+        elif 'downsample' in new_k:
+            if 'reduction.' in k:
+                new_v = correct_unfold_reduction_order(new_v)
+            elif 'norm.' in k:
+                new_v = correct_unfold_norm_order(new_v)
+
+        new_ckpt[new_k] = new_v
+    return new_ckpt
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description='Convert keys in pretrained glip models to mmcls style.')
+    parser.add_argument('src', help='src model path or url')
+    # The dst path must be a full path of the new checkpoint.
+    parser.add_argument('dst', help='save path')
+    args = parser.parse_args()
+
+    checkpoint = CheckpointLoader.load_checkpoint(args.src, map_location='cpu')
+
+    if 'model' in checkpoint:
+        state_dict = checkpoint['model']
+    else:
+        state_dict = checkpoint
+
+    weight = convert_glip(state_dict)
+    mmengine.mkdir_or_exist(osp.dirname(args.dst))
+    torch.save(weight, args.dst)
+
+    print('Done!!')
+
+
+if __name__ == '__main__':
+    main()