Merge pull request #1 from alibaba/master

merge master
2022-09-21 20:37:57 +08:00 · 2022-09-21 20:37:57 +08:00 · 594dc823c3
parent 3c6c2c0b6f 5dfe7b2898
commit 594dc823c3
375 changed files with 19825 additions and 4907 deletions
--- a/.gitattributes
+++ b/.gitattributes
@ -0,0 +1,7 @@
+*.jpg filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text
+*.mp4 filter=lfs diff=lfs merge=lfs -text
+*.wav filter=lfs diff=lfs merge=lfs -text
+*.JPEG filter=lfs diff=lfs merge=lfs -text
+*.jpeg filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
--- a/.github/workflows/citest.yaml
+++ b/.github/workflows/citest.yaml
@ -43,6 +43,10 @@ jobs:
    steps:
      - name: Checkout
        uses: actions/checkout@v2
+        with:
+            lfs: 'true'
+      - name: Checkout LFS objects
+        run: git lfs checkout
      - name: Run unittest
        shell: bash
        run: |
@ -67,6 +71,7 @@ jobs:
          PYTHONPATH=. python tests/run.py


+# blade test env will be updated! we do not support test with trt_efficient_nms
  ut-torch181-blade:
    # The type of runner that the job will run on
    runs-on: [unittest-t4]
--- a/.gitignore
+++ b/.gitignore
@ -137,6 +137,3 @@ pai_jobs/easycv/resources/
 *.tar.gz
 thirdparty/test
 scripts/test
-
-# easycv default cache dir
-.easycv_cache
--- a/README.md
+++ b/README.md
@ -20,30 +20,53 @@ English | [简体中文](README_zh-CN.md)

 ## Introduction

-EasyCV is an all-in-one computer vision toolbox based on PyTorch, mainly focus on self-supervised learning, transformer based models, and SOTA CV tasks including image classification, metric-learning, object detection, pose estimation and so on.
+EasyCV is an all-in-one computer vision toolbox based on PyTorch, mainly focuses on self-supervised learning, transformer based models, and major CV tasks including image classification, metric-learning, object detection, pose estimation, and so on.
+

 ### Major features

 - **SOTA SSL Algorithms**

-  EasyCV provides state-of-the-art algorithms in self-supervised learning based on contrastive learning such as SimCLR, MoCO V2, Swav, DINO and also MAE based on masked image modeling. We also provide standard benchmark tools for ssl model evaluation.
+  EasyCV provides state-of-the-art algorithms in self-supervised learning based on contrastive learning such as SimCLR, MoCO V2, Swav, DINO, and also MAE based on masked image modeling. We also provide standard benchmarking tools for ssl model evaluation.

 - **Vision Transformers**

-  EasyCV aims to provide an easy way to use the off-the-shelf SOTA transformer models trained either using supervised learning or self-supervised learning, such as ViT, Swin-Transformer and Shuffle Transformer. More models will be added in the future. In addition, we support all the pretrained models from [timm](https://github.com/rwightman/pytorch-image-models).
+  EasyCV aims to provide an easy way to use the off-the-shelf SOTA transformer models trained either using supervised learning or self-supervised learning, such as ViT, Swin Transformer, and DETR Series. More models will be added in the future. In addition, we support all the pretrained models from [timm](https://github.com/rwightman/pytorch-image-models).

 - **Functionality & Extensibility**

-  In addition to SSL, EasyCV also support image classification, object detection, metric learning, and more area will be supported in the future. Although convering different area,
-  EasyCV decompose the framework into different componets such as dataset, model, running hook, making it easy to add new compoenets and combining it with existing modules.
+  In addition to SSL, EasyCV also supports image classification, object detection, metric learning, and more areas will be supported in the future. Although covering different areas,
+  EasyCV decomposes the framework into different components such as dataset, model and running hook, making it easy to add new components and combining it with existing modules.

-  EasyCV provide simple and comprehensive interface for inference. Additionaly,  all models are supported on [PAI-EAS](https://help.aliyun.com/document_detail/113696.html), which can be easily deployed as online service and support automatic scaling and service monitoring.
+  EasyCV provides simple and comprehensive interface for inference. Additionally, all models are supported on [PAI-EAS](https://help.aliyun.com/document_detail/113696.html), which can be easily deployed as online service and support automatic scaling and service monitoring.

 - **Efficiency**

-  EasyCV support multi-gpu and multi worker training. EasyCV use [DALI](https://github.com/NVIDIA/DALI) to accelerate data io and preprocessing process, and use [TorchAccelerator](https://github.com/alibaba/EasyCV/tree/master/docs/source/tutorials/torchacc.md) and fp16 to accelerate training process. For inference optimization, EasyCV export model using jit script, which can be optimized by [PAI-Blade](https://help.aliyun.com/document_detail/205134.html)
+  EasyCV supports multi-gpu and multi-worker training. EasyCV uses [DALI](https://github.com/NVIDIA/DALI) to accelerate data io and preprocessing process, and uses [TorchAccelerator](https://github.com/alibaba/EasyCV/tree/master/docs/source/tutorials/torchacc.md) and fp16 to accelerate training process. For inference optimization, EasyCV exports model using jit script, which can be optimized by [PAI-Blade](https://help.aliyun.com/document_detail/205134.html)


+## What's New
+
+[🔥 Latest News] We have released our YOLOX-PAI that achieves SOTA results within 40~50 mAP (less than 1ms). And we also provide a convenient and fast export/predictor api for end2end object detection. To get a quick start of YOLOX-PAI, click [here](docs/source/tutorials/yolox.md)!
+
+* 31/08/2022 EasyCV v0.6.0 was released.
+  -  Release YOLOX-PAI which achieves SOTA results within 40~50 mAP (less than 1ms)
+  -  Add detection algo DINO which achieves 58.5 mAP on COCO
+  -  Add mask2former algo
+  -  Releases imagenet1k, imagenet22k, coco, lvis, voc2012 data with BaiduDisk to accelerate downloading
+
+Please refer to [change_log.md](docs/source/change_log.md) for more details and history.
+
+
+## Technical Articles
+
+We have a series of technical articles on the functionalities of EasyCV.
+* [EasyCV开源｜开箱即用的视觉自监督+Transformer算法库](https://zhuanlan.zhihu.com/p/505219993)
+* [MAE自监督算法介绍和基于EasyCV的复现](https://zhuanlan.zhihu.com/p/515859470)
+* [基于EasyCV复现ViTDet：单层特征超越FPN](https://zhuanlan.zhihu.com/p/528733299)
+* [基于EasyCV复现DETR和DAB-DETR，Object Query的正确打开方式](https://zhuanlan.zhihu.com/p/543129581)
+* [YOLOX-PAI: 加速YOLOX, 比YOLOv6更快更强](https://zhuanlan.zhihu.com/p/560597953)
+
 ## Installation

 Please refer to the installation section in [quick_start.md](docs/source/quick_start.md) for installation.
@ -55,20 +78,123 @@ Please refer to [quick_start.md](docs/source/quick_start.md) for quick start. We

 * [self-supervised learning](docs/source/tutorials/ssl.md)
 * [image classification](docs/source/tutorials/cls.md)
-* [object detection with yolox](docs/source/tutorials/yolox.md)
+* [object detection with yolox-pai](docs/source/tutorials/yolox.md)
 * [model compression with yolox](docs/source/tutorials/compression.md)
 * [metric learning](docs/source/tutorials/metric_learning.md)
-* [torchacc](https://github.com/alibaba/EasyCV/blob/master/docs/source/tutorials/torchacc.md)
+* [torchacc](docs/source/tutorials/torchacc.md)

 notebook
 * [self-supervised learning](docs/source/tutorials/EasyCV图像自监督训练-MAE.ipynb)
 * [image classification](docs/source/tutorials/EasyCV图像分类resnet50.ipynb)
-* [object detection with yolox](docs/source/tutorials/EasyCV图像检测YoloX.ipynb)
+* [object detection with yolox-pai](docs/source/tutorials/EasyCV图像检测YoloX.ipynb)
 * [metric learning](docs/source/tutorials/EasyCV度量学习resnet50.ipynb)


 ## Model Zoo

+<div align="center">
+  <b>Architectures</b>
+</div>
+<table align="center">
+  <tbody>
+    <tr align="center">
+      <td>
+        <b>Self-Supervised Learning</b>
+      </td>
+      <td>
+        <b>Image Classification</b>
+      </td>
+      <td>
+        <b>Object Detection</b>
+      </td>
+      <td>
+        <b>Segmentation</b>
+      </td>
+    </tr>
+    <tr valign="top">
+      <td>
+        <ul>
+            <li><a href="configs/selfsup/byol">BYOL (NeurIPS'2020)</a></li>
+            <li><a href="configs/selfsup/dino">DINO (ICCV'2021)</a></li>
+            <li><a href="configs/selfsup/mixco">MiXCo (NeurIPS'2020)</a></li>
+            <li><a href="configs/selfsup/moby">MoBY (ArXiv'2021)</a></li>
+            <li><a href="configs/selfsup/mocov2">MoCov2 (ArXiv'2020)</a></li>
+            <li><a href="configs/selfsup/simclr">SimCLR (ICML'2020)</a></li>
+            <li><a href="configs/selfsup/swav">SwAV (NeurIPS'2020)</a></li>
+            <li><a href="configs/selfsup/mae">MAE (CVPR'2022)</a></li>
+            <li><a href="configs/selfsup/fast_convmae">FastConvMAE (ArXiv'2022)</a></li>
+      </ul>
+      </td>
+      <td>
+        <ul>
+          <li><a href="configs/classification/imagenet/resnet">ResNet (CVPR'2016)</a></li>
+          <li><a href="configs/classification/imagenet/resnext">ResNeXt (CVPR'2017)</a></li>
+          <li><a href="configs/classification/imagenet/hrnet">HRNet (CVPR'2019)</a></li>
+          <li><a href="configs/classification/imagenet/vit">ViT (ICLR'2021)</a></li>
+          <li><a href="configs/classification/imagenet/swint">SwinT (ICCV'2021)</a></li>
+          <li><a href="configs/classification/imagenet/efficientformer">EfficientFormer (ArXiv'2022)</a></li>
+          <li><a href="configs/classification/imagenet/timm/deit">DeiT (ICML'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/xcit">XCiT (ArXiv'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/tnt">TNT (NeurIPS'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/convit">ConViT (ArXiv'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/cait">CaiT (ICCV'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/levit">LeViT (ICCV'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/convnext">ConvNeXt (CVPR'2022)</a></li>
+          <li><a href="configs/classification/imagenet/timm/resmlp">ResMLP (ArXiv'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/coat">CoaT (ICCV'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/convmixer">ConvMixer (ICLR'2022)</a></li>
+          <li><a href="configs/classification/imagenet/timm/mlp-mixer">MLP-Mixer (ArXiv'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/nest">NesT (AAAI'2022)</a></li>
+          <li><a href="configs/classification/imagenet/timm/pit">PiT (ArXiv'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/twins">Twins (NeurIPS'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/shuffle_transformer">Shuffle Transformer (ArXiv'2021)</a></li>
+        </ul>
+      </td>
+      <td>
+        <ul>
+          <li><a href="configs/detection/fcos">FCOS (ICCV'2019)</a></li>
+          <li><a href="configs/detection/yolox">YOLOX (ArXiv'2021)</a></li>
+          <li><a href="configs/detection/yolox">YOLOX-PAI (ArXiv'2022)</a></li>
+          <li><a href="configs/detection/detr">DETR (ECCV'2020)</a></li>
+          <li><a href="configs/detection/dab_detr">DAB-DETR (ICLR'2022)</a></li>
+          <li><a href="configs/detection/dab_detr">DN-DETR (CVPR'2022)</a></li>
+          <li><a href="configs/detection/dino">DINO (ArXiv'2022)</a></li>
+        </ul>
+      </td>
+      <td>
+        </ul>
+          <li><b>Instance Segmentation</b></li>
+        <ul>
+        <ul>
+          <li><a href="configs/detection/mask_rcnn">Mask R-CNN (ICCV'2017)</a></li>
+          <li><a href="configs/detection/vitdet">ViTDet (ArXiv'2022)</a></li>
+          <li><a href="configs/segmentation/mask2former">Mask2Former (CVPR'2022)</a></li>
+        </ul>
+        </ul>
+        </ul>
+          <li><b>Semantic Segmentation</b></li>
+        <ul>
+        <ul>
+          <li><a href="configs/segmentation/fcn">FCN (CVPR'2015)</a></li>
+          <li><a href="configs/segmentation/upernet">UperNet (ECCV'2018)</a></li>
+        </ul>
+        </ul>
+        </ul>
+          <li><b>Panoptic Segmentation</b></li>
+        <ul>
+        <ul>
+          <li><a href="configs/segmentation/mask2former">Mask2Former (CVPR'2022)</a></li>
+        </ul>
+        </ul>
+      </ul>
+      </td>
+    </tr>
+</td>
+    </tr>
+  </tbody>
+</table>
+
+
 Please refer to the following model zoo for more details.

 - [self-supervised learning model zoo](docs/source/model_zoo_ssl.md)
@ -78,41 +204,14 @@ Please refer to the following model zoo for more details.

 ## Data Hub

-EasyCV have collected dataset info for different senarios, making it easy for users to fintune or evaluate models in EasyCV modelzoo.
+EasyCV have collected dataset info for different senarios, making it easy for users to finetune or evaluate models in EasyCV model zoo.

-Please refer to [data_hub.md](https://github.com/alibaba/EasyCV/blob/master/docs/source/data_hub.md).
-
-## ChangeLog
-
-* 28/07/2022 EasyCV v0.5.0 was released.
-    * Self-Supervised support ConvMAE algorithm
-    * Classification support EfficientFormer algorithm
-    * Detection support FCOS、DETR、DAB-DETR and DN-DETR algorithm
-    * Segmentation support UperNet algorithm
-    * Support use [torchacc](https://github.com/alibaba/EasyCV/blob/master/docs/source/tutorials/torchacc.md) to speed up training
-    * Support use analyze tools
-
-* 23/06/2022 EasyCV v0.4.0 was released.
-    * Add semantic segmentation modules, support FCN algorithm
-    * Expand classification model zoo
-    * Support export model with [blade](https://help.aliyun.com/document_detail/205134.html) for yolox
-    * Support ViTDet algorithm
-    * Add sailfish for extensible fully sharded data parallel training
-    * Support run with [mmdetection](https://github.com/open-mmlab/mmdetection) models
-
-* 31/04/2022 EasyCV v0.3.0 was released.
-    * Update moby pretrained model to deit small
-    * Add mae vit-large benchmark and pretrained models
-    * Support image visualization for tensorboard and wandb
-
-* 07/04/2022 EasyCV v0.2.2 was released.
-
-Please refer to [change_log.md](docs/source/change_log.md) for more details and history.
+Please refer to [data_hub.md](docs/source/data_hub.md).


 ## License

-This project licensed under the [Apache License (Version 2.0)](LICENSE). This toolkit also contains various third-party components and some code modified from other repos under other open source licenses. See the [NOTICE](NOTICE) file for more information.
+This project is licensed under the [Apache License (Version 2.0)](LICENSE). This toolkit also contains various third-party components and some code modified from other repos under other open source licenses. See the [NOTICE](NOTICE) file for more information.


 ## Contact
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@ -22,6 +22,7 @@

 EasyCV是一个涵盖多个领域的基于Pytorch的计算机视觉工具箱，聚焦自监督学习和视觉transformer关键技术，覆盖主流的视觉建模任务例如图像分类，度量学习，目标检测，关键点检测等。

+
 ### 核心特性

 - **SOTA 自监督算法**
@ -40,9 +41,30 @@ EasyCV是一个涵盖多个领域的基于Pytorch的计算机视觉工具箱，

 - **高性能**

-  EasyCV支持多机多卡训练，同时支持[TorchAccelerator](https://github.com/alibaba/EasyCV/tree/master/docs/source/tutorials/torchacc.md)和fp16进行训练加速。在数据读取和预处理方面，EasyCV使用[DALI](https://github.com/NVIDIA/DALI)进行加速。对于模型推理优化，EasyCV支持使用jit script导出模型，使用[PAI-Blade](https://help.aliyun.com/document_detail/205134.html)进行模型优化。
+  EasyCV支持多机多卡训练，同时支持[TorchAccelerator](docs/source/tutorials/torchacc.md)和fp16进行训练加速。在数据读取和预处理方面，EasyCV使用[DALI](https://github.com/NVIDIA/DALI)进行加速。对于模型推理优化，EasyCV支持使用jit script导出模型，使用[PAI-Blade](https://help.aliyun.com/document_detail/205134.html)进行模型优化。


+## 最新进展
+
+[🔥 Latest News] 近期我们开源了YOLOX-PAI，在40-50mAP(推理速度小于1ms)范围内达到了业界的SOTA水平。同时EasyCV提供了一套简洁高效的模型导出和预测接口，供用户快速完成端到端的图像检测任务。如果你想快速了解YOLOX-PAI, 点击 [这里](docs/source/tutorials/yolox.md)!
+
+* 31/08/2022 EasyCV v0.6.0 版本发布。
+  -  发布YOLOX-PAI，在轻量级模型中取得SOTA效果
+  -  增加检测算法DINO， COCO mAP 58.5
+  -  增加Mask2Former算法
+  -  Datahub新增imagenet1k, imagenet22k, coco, lvis, voc2012 数据的百度网盘链接，加速下载
+
+
+更多版本的详细信息请参考[变更记录](docs/source/change_log.md)。
+
+
+## 技术文章
+
+我们有一系列关于EasyCV功能的技术文章。
+* [EasyCV开源｜开箱即用的视觉自监督+Transformer算法库](https://zhuanlan.zhihu.com/p/505219993)
+* [MAE自监督算法介绍和基于EasyCV的复现](https://zhuanlan.zhihu.com/p/515859470)
+* [基于EasyCV复现ViTDet：单层特征超越FPN](https://zhuanlan.zhihu.com/p/528733299)
+* [基于EasyCV复现DETR和DAB-DETR，Object Query的正确打开方式](https://zhuanlan.zhihu.com/p/543129581)

 ## 安装

@ -55,12 +77,114 @@ EasyCV是一个涵盖多个领域的基于Pytorch的计算机视觉工具箱，

 * [自监督学习教程](docs/source/tutorials/ssl.md)
 * [图像分类教程](docs/source/tutorials/cls.md)
-* [使用YOLOX进行物体检测教程](docs/source/tutorials/yolox.md)
+* [使用YOLOX-PAI进行物体检测教程](docs/source/tutorials/yolox.md)
 * [YOLOX模型压缩教程](docs/source/tutorials/compression.md)
-
+* [torchacc](docs/source/tutorials/torchacc.md)

 ## 模型库

+<div align="center">
+  <b>模型</b>
+</div>
+<table align="center">
+  <tbody>
+    <tr align="center">
+      <td>
+        <b>自监督学习</b>
+      </td>
+      <td>
+        <b>图像分类</b>
+      </td>
+      <td>
+        <b>目标检测</b>
+      </td>
+      <td>
+        <b>分割</b>
+      </td>
+    </tr>
+    <tr valign="top">
+      <td>
+        <ul>
+            <li><a href="configs/selfsup/byol">BYOL (NeurIPS'2020)</a></li>
+            <li><a href="configs/selfsup/dino">DINO (ICCV'2021)</a></li>
+            <li><a href="configs/selfsup/mixco">MiXCo (NeurIPS'2020)</a></li>
+            <li><a href="configs/selfsup/moby">MoBY (ArXiv'2021)</a></li>
+            <li><a href="configs/selfsup/mocov2">MoCov2 (ArXiv'2020)</a></li>
+            <li><a href="configs/selfsup/simclr">SimCLR (ICML'2020)</a></li>
+            <li><a href="configs/selfsup/swav">SwAV (NeurIPS'2020)</a></li>
+            <li><a href="configs/selfsup/mae">MAE (CVPR'2022)</a></li>
+            <li><a href="configs/selfsup/fast_convmae">FastConvMAE (ArXiv'2022)</a></li>
+      </ul>
+      </td>
+      <td>
+        <ul>
+          <li><a href="configs/classification/imagenet/resnet">ResNet (CVPR'2016)</a></li>
+          <li><a href="configs/classification/imagenet/resnext">ResNeXt (CVPR'2017)</a></li>
+          <li><a href="configs/classification/imagenet/hrnet">HRNet (CVPR'2019)</a></li>
+          <li><a href="configs/classification/imagenet/vit">ViT (ICLR'2021)</a></li>
+          <li><a href="configs/classification/imagenet/swint">SwinT (ICCV'2021)</a></li>
+          <li><a href="configs/classification/imagenet/efficientformer">EfficientFormer (ArXiv'2022)</a></li>
+          <li><a href="configs/classification/imagenet/timm/deit">DeiT (ICML'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/xcit">XCiT (ArXiv'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/tnt">TNT (NeurIPS'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/convit">ConViT (ArXiv'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/cait">CaiT (ICCV'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/levit">LeViT (ICCV'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/convnext">ConvNeXt (CVPR'2022)</a></li>
+          <li><a href="configs/classification/imagenet/timm/resmlp">ResMLP (ArXiv'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/coat">CoaT (ICCV'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/convmixer">ConvMixer (ICLR'2022)</a></li>
+          <li><a href="configs/classification/imagenet/timm/mlp-mixer">MLP-Mixer (ArXiv'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/nest">NesT (AAAI'2022)</a></li>
+          <li><a href="configs/classification/imagenet/timm/pit">PiT (ArXiv'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/twins">Twins (NeurIPS'2021)</a></li>
+          <li><a href="configs/classification/imagenet/timm/shuffle_transformer">Shuffle Transformer (ArXiv'2021)</a></li>
+        </ul>
+      </td>
+      <td>
+        <ul>
+          <li><a href="configs/detection/fcos">FCOS (ICCV'2019)</a></li>
+          <li><a href="configs/detection/yolox">YOLOX (ArXiv'2021)</a></li>
+          <li><a href="configs/detection/yolox">YOLOX-PAI (ArXiv'2022)</a></li>
+          <li><a href="configs/detection/detr">DETR (ECCV'2020)</a></li>
+          <li><a href="configs/detection/dab_detr">DAB-DETR (ICLR'2022)</a></li>
+          <li><a href="configs/detection/dab_detr">DN-DETR (CVPR'2022)</a></li>
+          <li><a href="configs/detection/dino">DINO (ArXiv'2022)</a></li>
+        </ul>
+      </td>
+      <td>
+        </ul>
+          <li><b>实例分割</b></li>
+        <ul>
+        <ul>
+          <li><a href="configs/detection/mask_rcnn">Mask R-CNN (ICCV'2017)</a></li>
+          <li><a href="configs/detection/vitdet">ViTDet (ArXiv'2022)</a></li>
+          <li><a href="configs/segmentation/mask2former">Mask2Former (CVPR'2022)</a></li>
+        </ul>
+        </ul>
+        </ul>
+          <li><b>语义分割</b></li>
+        <ul>
+        <ul>
+          <li><a href="configs/segmentation/fcn">FCN (CVPR'2015)</a></li>
+          <li><a href="configs/segmentation/upernet">UperNet (ECCV'2018)</a></li>
+        </ul>
+        </ul>
+        </ul>
+          <li><b>全景分割</b></li>
+        <ul>
+        <ul>
+          <li><a href="configs/segmentation/mask2former">Mask2Former (CVPR'2022)</a></li>
+        </ul>
+        </ul>
+      </ul>
+      </td>
+    </tr>
+</td>
+    </tr>
+  </tbody>
+</table>
+
 不同领域的模型仓库和benchmark指标如下

 - [自监督模型库](docs/source/model_zoo_ssl.md)
@ -68,34 +192,6 @@ EasyCV是一个涵盖多个领域的基于Pytorch的计算机视觉工具箱，
 - [目标检测模型库](docs/source/model_zoo_det.md)


-## 变更日志
-
-* 28/07/2022 EasyCV v0.5.0 版本发布。
-    * 自监督学习增加了ConvMAE算法
-    * 图像分类增加EfficientFormer
-    * 目标检测增加FCOS、DETR、DAB-DETR和DN-DETR算法
-    * 语义分割增加了UperNet算法
-    * 支持使用[torchacc](https://github.com/alibaba/EasyCV/blob/master/docs/source/tutorials/torchacc.md)加快训练速度
-    * 增加模型分析工具
-
-* 23/06/2022 EasyCV v0.4.0 版本发布。
-    * 增加语义分割模块， 支持FCN算法
-    * 扩充分类算法 model zoo
-    * Yolox支持导出 [blade](https://help.aliyun.com/document_detail/205134.html) 模型
-    * 支持 ViTDet 检测算法
-    * 支持 sailfish 数据并行训练
-    * 支持运行 [mmdetection](https://github.com/open-mmlab/mmdetection) 中的模型
-
-* 31/04/2022 EasyCV v0.3.0 版本发布。
-    * 增加 moby deit-small 预训练模型
-    * 增加 mae vit-large benchmark和预训练模型
-    * 支持 tensorboard和wandb 的图像可视化
-
-* 2022/04/07 EasyCV v0.2.2 版本发布。
-
-更多详细变更日志请参考[变更记录](docs/source/change_log.md)。
-
-
 ## 开源许可证

 本项目使用 [Apache 2.0 开源许可证](LICENSE). 项目内含有一些第三方依赖库源码，部分实现借鉴其他开源仓库，仓库名称和开源许可证说明请参考[NOTICE文件](NOTICE)。
--- a/assets/result.jpg
+++ b/assets/result.jpg
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:a0117af1f65e6873617df22eed987076cca2917d5761b77dd074c1687f4933a9
+size 232937
--- a/benchmarks/selfsup/classification/imagenet/dino_deit_small_p16_8xb2048_20e_feature.py
+++ b/benchmarks/selfsup/classification/imagenet/dino_deit_small_p16_8xb2048_20e_feature.py
@ -10,7 +10,7 @@ oss_io_config = dict(
    buckets=['your oss buckets'])

 # model settings
-# 1920: merge 4 layers of features, open models/backbones/vit_transfomer_dynamic.py:311: self.forward_return_n_last_blocks
+# 1920: merge 4 layers of features, open models/backbones/vit_transformer_dynamic.py:311: self.forward_return_n_last_blocks
 # 384: default
 feature_num = 1920
 model = dict(
--- a/benchmarks/selfsup/classification/imagenet/mae_vit_large_patch16_8xb16_50e_lrdecay075_fintune.py
+++ b/benchmarks/selfsup/classification/imagenet/mae_vit_large_patch16_8xb16_50e_lrdecay075_fintune.py
@ -157,3 +157,6 @@ checkpoint_config = dict(interval=10)

 # runtime settings
 total_epochs = 50
+
+# export config
+export = dict(export_neck=True)
--- a/benchmarks/selfsup/classification/imagenet/moby_deit_small_p16_8xb2048_30e_feature.py
+++ b/benchmarks/selfsup/classification/imagenet/moby_deit_small_p16_8xb2048_30e_feature.py
@ -10,7 +10,7 @@ oss_io_config = dict(
    buckets=['your oss buckets'])

 # model settings
-# 1920: merge 4 layers of features, open models/backbones/vit_transfomer_dynamic.py:311: self.forward_return_n_last_blocks
+# 1920: merge 4 layers of features, open models/backbones/vit_transformer_dynamic.py:311: self.forward_return_n_last_blocks
 # 384: default
 feature_num = 1920
 model = dict(
--- a/benchmarks/selfsup/segmentation/voc/fcn_r50-d8_512x512_8xb4_60e_voc12aug.py
+++ b/benchmarks/selfsup/segmentation/voc/fcn_r50-d8_512x512_8xb4_60e_voc12aug.py
@ -64,7 +64,7 @@ train_pipeline = [
    dict(type='MMRandomFlip', flip_ratio=0.5),
    dict(type='MMPhotoMetricDistortion'),
    dict(type='MMNormalize', **img_norm_cfg),
-    dict(type='MMPad', size=crop_size, pad_val=0, seg_pad_val=255),
+    dict(type='MMPad', size=crop_size),
    dict(type='DefaultFormatBundle'),
    dict(
        type='Collect',
--- a/benchmarks/tools/extract.py
+++ b/benchmarks/tools/extract.py
@ -12,33 +12,16 @@ import torch
 from mmcv.parallel import MMDataParallel, MMDistributedDataParallel
 from mmcv.runner import get_dist_info, init_dist, load_checkpoint

+from easycv.apis import set_random_seed
 from easycv.datasets import build_dataloader, build_dataset
 from easycv.file import io
+from easycv.framework.errors import ValueError
 from easycv.models import build_model
 from easycv.utils.collect import dist_forward_collect, nondist_forward_collect
 from easycv.utils.config_tools import mmcv_config_fromfile
 from easycv.utils.logger import get_root_logger


-def set_random_seed(seed, deterministic=True):
-    """Set random seed.
-
-    Args:
-        seed (int): Seed to be used.
-        deterministic (bool): Whether to set the deterministic option for
-            CUDNN backend, i.e., set `torch.backends.cudnn.deterministic`
-            to True and `torch.backends.cudnn.benchmark` to False.
-            Default: False.
-    """
-    random.seed(seed)
-    np.random.seed(seed)
-    torch.manual_seed(seed)
-    torch.cuda.manual_seed_all(seed)
-    if deterministic:
-        torch.backends.cudnn.deterministic = True
-        torch.backends.cudnn.benchmark = False
-
-
 class ExtractProcess(object):

    def __init__(self, extract_list=['neck']):
--- a/benchmarks/tools/extract_backbone_weights.py
+++ b/benchmarks/tools/extract_backbone_weights.py
@ -3,6 +3,8 @@ import argparse

 import torch

+from easycv.framework.errors import ValueError
+

 def parse_args():
    parser = argparse.ArgumentParser(
@ -24,7 +26,7 @@ def main():
            output_dict['state_dict'][key[9:]] = value
            has_backbone = True
    if not has_backbone:
-        raise Exception('Cannot find a backbone module in the checkpoint.')
+        raise ValueError('Cannot find a backbone module in the checkpoint.')
    torch.save(output_dict, args.output)


--- a/benchmarks/tools/linear_eval.py
+++ b/benchmarks/tools/linear_eval.py
@ -2,11 +2,12 @@
 import argparse
 import os
 import shutil
-import sys
 import time

 import torch

+from easycv.framework.errors import ValueError
+
 args = argparse.ArgumentParser(description='Process some integers.')
 args.add_argument(
    'model_path',
@ -88,7 +89,7 @@ def extract_model(model_path):
            output_dict['state_dict'][key[9:]] = value
            has_backbone = True
    if not has_backbone:
-        raise Exception('Cannot find a backbone module in the checkpoint.')
+        raise ValueError('Cannot find a backbone module in the checkpoint.')
    torch.save(output_dict, backbone_file)

    return backbone_file
--- a/configs/classification/imagenet/resnet/resnet50_b32x8_100e_jpg.py
+++ b/configs/classification/imagenet/resnet/resnet50_b32x8_100e_jpg.py
@ -86,3 +86,13 @@ checkpoint_config = dict(interval=10)

 # runtime settings
 total_epochs = 100
+
+predict = dict(
+    type='ClassificationPredictor',
+    pipelines=[
+        dict(type='Resize', size=256),
+        dict(type='CenterCrop', size=224),
+        dict(type='ToTensor'),
+        dict(type='Normalize', **img_norm_cfg),
+        dict(type='Collect', keys=['img'])
+    ])
--- a/configs/classification/imagenet/vit/deitiii_base_patch16_192.py
+++ b/configs/classification/imagenet/vit/deitiii_base_patch16_192.py
@ -0,0 +1,143 @@
+# from PIL import Image
+
+_base_ = 'configs/base.py'
+
+log_config = dict(
+    interval=10,
+    hooks=[dict(type='TextLoggerHook'),
+           dict(type='TensorboardLoggerHook')])
+
+# model settings
+model = dict(
+    type='Classification',
+    train_preprocess=['mixUp'],
+    pretrained=False,
+    mixup_cfg=dict(
+        mixup_alpha=0.8,
+        cutmix_alpha=1.0,
+        cutmix_minmax=None,
+        prob=1.0,
+        switch_prob=0.5,
+        mode='batch',
+        label_smoothing=0.0,
+        num_classes=1000),
+    backbone=dict(
+        type='VisionTransformer',
+        img_size=[192],
+        num_classes=1000,
+        patch_size=16,
+        embed_dim=768,
+        depth=12,
+        num_heads=12,
+        mlp_ratio=4,
+        qkv_bias=True,
+        drop_rate=0.,
+        drop_path_rate=0.2,
+        use_layer_scale=True),
+    head=dict(
+        type='ClsHead',
+        loss_config=dict(
+            type='CrossEntropyLoss',
+            use_sigmoid=True,
+            loss_weight=1.0,
+            label_ceil=True),
+        with_fc=False,
+        use_num_classes=False))
+
+data_train_list = 'data/imagenet1k/train.txt'
+data_train_root = 'data/imagenet1k/train/'
+data_test_list = 'data/imagenet1k/val.txt'
+data_test_root = 'data/imagenet1k/val/'
+
+dataset_type = 'ClsDataset'
+img_norm_cfg = dict(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
+three_augment_policies = [[
+    dict(type='PILGaussianBlur', prob=1.0, radius_min=0.1, radius_max=2.0),
+], [
+    dict(type='Solarization', threshold=128),
+], [
+    dict(type='Grayscale', num_output_channels=3),
+]]
+train_pipeline = [
+    dict(
+        type='RandomResizedCrop', size=192, scale=(0.08, 1.0),
+        interpolation=3),  # interpolation='bicubic'
+    dict(type='RandomHorizontalFlip'),
+    dict(type='MMAutoAugment', policies=three_augment_policies),
+    dict(type='ColorJitter', brightness=0.3, contrast=0.3, saturation=0.3),
+    dict(type='ToTensor'),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='Collect', keys=['img', 'gt_labels'])
+]
+size = int((256 / 224) * 192)
+test_pipeline = [
+    dict(type='Resize', size=size, interpolation=3),
+    dict(type='CenterCrop', size=192),
+    dict(type='ToTensor'),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='Collect', keys=['img', 'gt_labels'])
+]
+
+data = dict(
+    imgs_per_gpu=256,
+    workers_per_gpu=8,
+    use_repeated_augment_sampler=True,
+    train=dict(
+        type=dataset_type,
+        data_source=dict(
+            list_file=data_train_list,
+            root=data_train_root,
+            type='ClsSourceImageList'),
+        pipeline=train_pipeline),
+    val=dict(
+        type=dataset_type,
+        data_source=dict(
+            list_file=data_test_list,
+            root=data_test_root,
+            type='ClsSourceImageList'),
+        pipeline=test_pipeline))
+
+eval_config = dict(initial=True, interval=1, gpu_collect=True)
+eval_pipelines = [
+    dict(
+        mode='test',
+        data=data['val'],
+        dist_eval=True,
+        evaluators=[dict(type='ClsEvaluator', topk=(1, 5))],
+    )
+]
+
+# additional hooks
+custom_hooks = []
+
+# optimizer
+optimizer = dict(
+    type='Lamb',
+    lr=0.003,
+    weight_decay=0.05,
+    eps=1e-8,
+    paramwise_options={
+        'cls_token': dict(weight_decay=0.),
+        'pos_embed': dict(weight_decay=0.),
+        'bias': dict(weight_decay=0.),
+        'norm': dict(weight_decay=0.),
+        'gamma_1': dict(weight_decay=0.),
+        'gamma_2': dict(weight_decay=0.),
+    })
+optimizer_config = dict(grad_clip=None, update_interval=1)
+
+lr_config = dict(
+    policy='CosineAnnealingWarmupByEpoch',
+    by_epoch=True,
+    min_lr_ratio=0.00001 / 0.003,
+    warmup='linear',
+    warmup_by_epoch=True,
+    warmup_iters=5,
+    warmup_ratio=0.000001 / 0.003,
+)
+checkpoint_config = dict(interval=10)
+
+# runtime settings
+total_epochs = 800
+
+ema = dict(decay=0.99996)
--- a/configs/classification/imagenet/vit/imagenet_deitiii_base_patch16_192_jpg.py
+++ b/configs/classification/imagenet/vit/imagenet_deitiii_base_patch16_192_jpg.py
@ -0,0 +1,17 @@
+_base_ = './deitiii_base_patch16_192.py'
+# model settings
+model = dict(
+    type='Classification',
+    backbone=dict(
+        type='VisionTransformer',
+        img_size=[192],
+        num_classes=1000,
+        patch_size=16,
+        embed_dim=768,
+        depth=12,
+        num_heads=12,
+        mlp_ratio=4,
+        qkv_bias=True,
+        drop_rate=0.,
+        drop_path_rate=0.2,
+        use_layer_scale=True))
--- a/configs/classification/imagenet/vit/imagenet_deitiii_large_patch16_192_jpg.py
+++ b/configs/classification/imagenet/vit/imagenet_deitiii_large_patch16_192_jpg.py
@ -0,0 +1,17 @@
+_base_ = './deitiii_base_patch16_192.py'
+# model settings
+model = dict(
+    type='Classification',
+    backbone=dict(
+        type='VisionTransformer',
+        img_size=[192],
+        num_classes=1000,
+        patch_size=16,
+        embed_dim=1024,
+        depth=24,
+        num_heads=16,
+        mlp_ratio=4,
+        qkv_bias=True,
+        drop_rate=0.,
+        drop_path_rate=0.45,
+        use_layer_scale=True))
--- a/configs/classification/imagenet/vit/imagenet_deitiii_small_patch16_224_jpg.py
+++ b/configs/classification/imagenet/vit/imagenet_deitiii_small_patch16_224_jpg.py
@ -0,0 +1,86 @@
+_base_ = './deitiii_base_patch16_192.py'
+# model settings
+model = dict(
+    type='Classification',
+    backbone=dict(
+        type='VisionTransformer',
+        img_size=[224],
+        num_classes=1000,
+        patch_size=16,
+        embed_dim=384,
+        depth=12,
+        num_heads=6,
+        mlp_ratio=4,
+        qkv_bias=True,
+        drop_rate=0.,
+        drop_path_rate=0.05,
+        use_layer_scale=True))
+
+data_train_list = 'data/imagenet1k/train.txt'
+data_train_root = 'data/imagenet1k/train/'
+data_test_list = 'data/imagenet1k/val.txt'
+data_test_root = 'data/imagenet1k/val/'
+
+dataset_type = 'ClsDataset'
+img_norm_cfg = dict(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
+three_augment_policies = [[
+    dict(type='PILGaussianBlur', prob=1.0, radius_min=0.1, radius_max=2.0),
+], [
+    dict(type='Solarization', threshold=128),
+], [
+    dict(type='Grayscale', num_output_channels=3),
+]]
+train_pipeline = [
+    dict(
+        type='RandomResizedCrop', size=224, scale=(0.08, 1.0),
+        interpolation=3),  # interpolation='bicubic'
+    dict(type='RandomHorizontalFlip'),
+    dict(type='MMAutoAugment', policies=three_augment_policies),
+    dict(type='ColorJitter', brightness=0.3, contrast=0.3, saturation=0.3),
+    dict(type='ToTensor'),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='Collect', keys=['img', 'gt_labels'])
+]
+test_pipeline = [
+    dict(type='Resize', size=256, interpolation=3),
+    dict(type='CenterCrop', size=224),
+    dict(type='ToTensor'),
+    dict(type='Normalize', **img_norm_cfg),
+    dict(type='Collect', keys=['img', 'gt_labels'])
+]
+
+data = dict(
+    imgs_per_gpu=256,
+    workers_per_gpu=8,
+    use_repeated_augment_sampler=True,
+    train=dict(
+        type=dataset_type,
+        data_source=dict(
+            list_file=data_train_list,
+            root=data_train_root,
+            type='ClsSourceImageList'),
+        pipeline=train_pipeline),
+    val=dict(
+        type=dataset_type,
+        data_source=dict(
+            list_file=data_test_list,
+            root=data_test_root,
+            type='ClsSourceImageList'),
+        pipeline=test_pipeline))
+
+eval_pipelines = [
+    dict(
+        mode='test',
+        data=data['val'],
+        dist_eval=True,
+        evaluators=[dict(type='ClsEvaluator', topk=(1, 5))],
+    )
+]
+
+# optimizer
+optimizer = dict(lr=0.004)
+
+lr_config = dict(
+    min_lr_ratio=0.00001 / 0.004,
+    warmup_ratio=0.000001 / 0.004,
+)
--- a/configs/detection/common/dataset/autoaug_coco_detection.py
+++ b/configs/detection/common/dataset/autoaug_coco_detection.py
@ -23,36 +23,41 @@ train_pipeline = [
    dict(type='MMRandomFlip', flip_ratio=0.5),
    dict(
        type='MMAutoAugment',
-        policies=[[
-            dict(
-                type='MMResize',
-                img_scale=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
-                           (608, 1333), (640, 1333), (672, 1333), (704, 1333),
-                           (736, 1333), (768, 1333), (800, 1333)],
-                multiscale_mode='value',
-                keep_ratio=True)
-        ],
-                  [
-                      dict(
-                          type='MMResize',
-                          img_scale=[(400, 1333), (500, 1333), (600, 1333)],
-                          multiscale_mode='value',
-                          keep_ratio=True),
-                      dict(
-                          type='MMRandomCrop',
-                          crop_type='absolute_range',
-                          crop_size=(384, 600),
-                          allow_negative_crop=True),
-                      dict(
-                          type='MMResize',
-                          img_scale=[(480, 1333), (512, 1333), (544, 1333),
-                                     (576, 1333), (608, 1333), (640, 1333),
-                                     (672, 1333), (704, 1333), (736, 1333),
-                                     (768, 1333), (800, 1333)],
-                          multiscale_mode='value',
-                          override=True,
-                          keep_ratio=True)
-                  ]]),
+        policies=[
+            [
+                dict(
+                    type='MMResize',
+                    img_scale=[(480, 1333), (512, 1333), (544, 1333),
+                               (576, 1333), (608, 1333), (640, 1333),
+                               (672, 1333), (704, 1333), (736, 1333),
+                               (768, 1333), (800, 1333)],
+                    multiscale_mode='value',
+                    keep_ratio=True)
+            ],
+            [
+                dict(
+                    type='MMResize',
+                    # The radio of all image in train dataset < 7
+                    # follow the original impl
+                    img_scale=[(400, 4200), (500, 4200), (600, 4200)],
+                    multiscale_mode='value',
+                    keep_ratio=True),
+                dict(
+                    type='MMRandomCrop',
+                    crop_type='absolute_range',
+                    crop_size=(384, 600),
+                    allow_negative_crop=True),
+                dict(
+                    type='MMResize',
+                    img_scale=[(480, 1333), (512, 1333), (544, 1333),
+                               (576, 1333), (608, 1333), (640, 1333),
+                               (672, 1333), (704, 1333), (736, 1333),
+                               (768, 1333), (800, 1333)],
+                    multiscale_mode='value',
+                    override=True,
+                    keep_ratio=True)
+            ]
+        ]),
    dict(type='MMNormalize', **img_norm_cfg),
    dict(type='MMPad', size_divisor=1),
    dict(type='DefaultFormatBundle'),
@ -96,7 +101,7 @@ train_dataset = dict(
        ],
        classes=CLASSES,
        test_mode=False,
-        filter_empty_gt=True,
+        filter_empty_gt=False,
        iscrowd=False),
    pipeline=train_pipeline)

@ -118,13 +123,18 @@ val_dataset = dict(
    pipeline=test_pipeline)

 data = dict(
-    imgs_per_gpu=2, workers_per_gpu=2, train=train_dataset, val=val_dataset)
+    imgs_per_gpu=2,
+    workers_per_gpu=2,
+    train=train_dataset,
+    val=val_dataset,
+    drop_last=True)

 # evaluation
 eval_config = dict(interval=1, gpu_collect=False)
 eval_pipelines = [
    dict(
        mode='test',
+        dist_eval=True,
        evaluators=[
            dict(type='CocoDetectionEvaluator', classes=CLASSES),
        ],
--- a/configs/detection/dab_detr/dab_detr_r50_8x2_50e_coco.py
+++ b/configs/detection/dab_detr/dab_detr_r50_8x2_50e_coco.py
@ -1,5 +1,5 @@
 _base_ = [
-    './dab_detr.py', '../_base_/dataset/autoaug_coco_detection.py',
+    './dab_detr.py', '../common/dataset/autoaug_coco_detection.py',
    'configs/base.py'
 ]

--- a/configs/detection/detr/detr_r50_8x2_150e_coco.py
+++ b/configs/detection/detr/detr_r50_8x2_150e_coco.py
@ -1,5 +1,5 @@
 _base_ = [
-    './detr.py', '../_base_/dataset/autoaug_coco_detection.py',
+    './detr.py', '../common/dataset/autoaug_coco_detection.py',
    'configs/base.py'
 ]

--- a/configs/detection/dino/README.md
+++ b/configs/detection/dino/README.md
@ -0,0 +1,36 @@
+# DINO
+
+> [DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection](https://arxiv.org/abs/2203.03605)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We present DINO(DETR with Improved deNoising anchOr boxes), a state-of-the-art end-to-end object detector. DINO improves over previous DETR-like models in performance and efficiency by using a contrastive way for denoising training, a mixed query selection method for anchor initialization, and a look forward twice scheme for box pre- diction. DINO achieves 49.4AP in 12 epochs and 51.3AP in 24 epochs on COCO with a ResNet-50 backbone and multi-scale features, yield- ing a significant improvement of +6.0AP and +2.7AP, respectively, compared to DN-DETR, the previous best DETR-like model. DINO scales well in both model size and data size. Without bells and whistles, after pre-training on the Objects365 dataset with a SwinL backbone, DINO obtains the best results on both COCO val2017 (63.2AP) and test-dev (63.3AP). Compared to other models on the leaderboard, DINO significantly reduces its model size and pre-training data size while achieving better results.
+
+<div align=center>
+<img src="https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/algo_images/detection/DINO.png"/>
+</div>
+
+## Results and Models
+
+| Algorithm  | Config                                                       | Params<br/>(backbone/total)                            | inference time(V100)<br/>(ms/img)                      | bbox_mAP<sup>val<br/><sub>0.5:0.95</sub> | AP<sup>val<br/><sub>50</sub> | Download                                                     |
+| ---------- | ------------------------------------------------------------ | ------------------------ | --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| DINO_4sc_r50_12e    | [DINO_4sc_r50_12e](https://github.com/alibaba/EasyCV/tree/master/configs/detection/dino/dino_4sc_r50_12e_coco.py) | 23M/47M | 184ms |     48.71               |     66.27      | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_r50_12e/epoch_12.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_r50_12e/20220815_141403.log.json) |
+| DINO_4sc_r50_36e    | [DINO_4sc_r50_36e](https://github.com/alibaba/EasyCV/tree/master/configs/detection/dino/dino_4sc_r50_36e_coco.py) | 23M/47M | 184ms |        50.69            |     68.60      | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_r50_36e/epoch_29.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_r50_36e/20220817_101549.log.json) |
+| DINO_4sc_swinl_12e    | [DINO_4sc_swinl_12e](https://github.com/alibaba/EasyCV/tree/master/configs/detection/dino/dino_4sc_swinl_12e_coco.py) | 195M/217M | 155ms |        56.86            |     75.61      | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_swinl_12e/epoch_12.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_swinl_12e/20220815_211633.log.json) |
+| DINO_4sc_swinl_36e    | [DINO_4sc_swinl_36e](https://github.com/alibaba/EasyCV/tree/master/configs/detection/dino/dino_4sc_swinl_36e_coco.py) | 195M/217M | 155ms |          58.04          |     76.76      | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_swinl_36e/epoch_34.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_swinl_36e/20220817_101416.log.json) |
+| DINO_5sc_swinl_36e    | [DINO_5sc_swinl_36e](https://github.com/alibaba/EasyCV/tree/master/configs/detection/dino/dino_5sc_swinl_36e_coco.py) | 195M/217M | 235ms |           58.47         |     77.10      | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_5sc_swinl_36e/epoch_35.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_5sc_swinl_36e/20220820_215711.log.json) |
+
+## Citation
+
+```latex
+@misc{zhang2022dino,
+      title={DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection},
+      author={Hao Zhang and Feng Li and Shilong Liu and Lei Zhang and Hang Su and Jun Zhu and Lionel M. Ni and Heung-Yeung Shum},
+      year={2022},
+      eprint={2203.03605},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
--- a/configs/detection/dino/dino_4sc_r50.py
+++ b/configs/detection/dino/dino_4sc_r50.py
@ -0,0 +1,94 @@
+# model settings
+model = dict(
+    type='Detection',
+    pretrained=True,
+    backbone=dict(
+        type='ResNet',
+        depth=50,
+        num_stages=4,
+        out_indices=(2, 3, 4),
+        frozen_stages=1,
+        norm_cfg=dict(type='BN', requires_grad=False),
+        norm_eval=True,
+        style='pytorch'),
+    head=dict(
+        type='DINOHead',
+        transformer=dict(
+            type='DeformableTransformer',
+            d_model=256,
+            nhead=8,
+            num_queries=900,
+            num_encoder_layers=6,
+            num_unicoder_layers=0,
+            num_decoder_layers=6,
+            dim_feedforward=2048,
+            dropout=0.0,
+            activation='relu',
+            normalize_before=False,
+            return_intermediate_dec=True,
+            query_dim=4,
+            num_patterns=0,
+            modulate_hw_attn=True,
+            # for deformable encoder
+            deformable_encoder=True,
+            deformable_decoder=True,
+            num_feature_levels=4,
+            enc_n_points=4,
+            dec_n_points=4,
+            # init query
+            decoder_query_perturber=None,
+            add_channel_attention=False,
+            random_refpoints_xy=False,
+            # two stage
+            two_stage_type=
+            'standard',  # ['no', 'standard', 'early', 'combine', 'enceachlayer', 'enclayer1']
+            two_stage_pat_embed=0,
+            two_stage_add_query_num=0,
+            two_stage_learn_wh=False,
+            two_stage_keep_all_tokens=False,
+            # evo of #anchors
+            dec_layer_number=None,
+            rm_dec_query_scale=True,
+            rm_self_attn_layers=None,
+            key_aware_type=None,
+            # layer share
+            layer_share_type=None,
+            # for detach
+            rm_detach=None,
+            decoder_sa_type='sa',
+            module_seq=['sa', 'ca', 'ffn'],
+            # for dn
+            embed_init_tgt=True,
+            use_detached_boxes_dec_out=False),
+        dn_components=dict(
+            dn_number=100,
+            dn_label_noise_ratio=0.5,  # paper 0.5, release code 0.25
+            dn_box_noise_scale=1.0,
+            dn_labelbook_size=80,
+        ),
+        num_classes=80,
+        in_channels=[512, 1024, 2048],
+        embed_dims=256,
+        query_dim=4,
+        num_queries=900,
+        num_select=300,
+        random_refpoints_xy=False,
+        num_patterns=0,
+        fix_refpoints_hw=-1,
+        num_feature_levels=4,
+        # two stage
+        two_stage_type='standard',  # ['no', 'standard']
+        two_stage_add_query_num=0,
+        dec_pred_class_embed_share=True,
+        dec_pred_bbox_embed_share=True,
+        two_stage_class_embed_share=False,
+        two_stage_bbox_embed_share=False,
+        decoder_sa_type='sa',
+        temperatureH=20,
+        temperatureW=20,
+        cost_dict=dict(
+            cost_class=2,
+            cost_bbox=5,
+            cost_giou=2,
+        ),
+        weight_dict=dict(loss_ce=1, loss_bbox=5, loss_giou=2)))
--- a/configs/detection/dino/dino_4sc_r50_12e_coco.py
+++ b/configs/detection/dino/dino_4sc_r50_12e_coco.py
@ -0,0 +1,4 @@
+_base_ = [
+    './dino_4sc_r50.py', '../common/dataset/autoaug_coco_detection.py',
+    './dino_schedule_1x.py'
+]
--- a/configs/detection/dino/dino_4sc_r50_24e_coco.py
+++ b/configs/detection/dino/dino_4sc_r50_24e_coco.py
@ -0,0 +1,6 @@
+_base_ = './dino_4sc_r50_12e_coco.py'
+
+# learning policy
+lr_config = dict(policy='step', step=[22])
+
+total_epochs = 24
--- a/configs/detection/dino/dino_4sc_r50_36e_coco.py
+++ b/configs/detection/dino/dino_4sc_r50_36e_coco.py
@ -0,0 +1,6 @@
+_base_ = './dino_4sc_r50_12e_coco.py'
+
+# learning policy
+lr_config = dict(policy='step', step=[27, 33])
+
+total_epochs = 36
--- a/configs/detection/dino/dino_4sc_swinl.py
+++ b/configs/detection/dino/dino_4sc_swinl.py
@ -0,0 +1,95 @@
+# model settings
+model = dict(
+    type='Detection',
+    pretrained=
+    'https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/classification/timm/swint/warpper_swin_large_patch4_window12_384_22k.pth',
+    backbone=dict(
+        type='SwinTransformer',
+        pretrain_img_size=384,
+        embed_dim=192,
+        depths=[2, 2, 18, 2],
+        num_heads=[6, 12, 24, 48],
+        window_size=12,
+        out_indices=(1, 2, 3),
+        use_checkpoint=True),
+    head=dict(
+        type='DINOHead',
+        transformer=dict(
+            type='DeformableTransformer',
+            d_model=256,
+            nhead=8,
+            num_queries=900,
+            num_encoder_layers=6,
+            num_unicoder_layers=0,
+            num_decoder_layers=6,
+            dim_feedforward=2048,
+            dropout=0.0,
+            activation='relu',
+            normalize_before=False,
+            return_intermediate_dec=True,
+            query_dim=4,
+            num_patterns=0,
+            modulate_hw_attn=True,
+            # for deformable encoder
+            deformable_encoder=True,
+            deformable_decoder=True,
+            num_feature_levels=4,
+            enc_n_points=4,
+            dec_n_points=4,
+            # init query
+            decoder_query_perturber=None,
+            add_channel_attention=False,
+            random_refpoints_xy=False,
+            # two stage
+            two_stage_type=
+            'standard',  # ['no', 'standard', 'early', 'combine', 'enceachlayer', 'enclayer1']
+            two_stage_pat_embed=0,
+            two_stage_add_query_num=0,
+            two_stage_learn_wh=False,
+            two_stage_keep_all_tokens=False,
+            # evo of #anchors
+            dec_layer_number=None,
+            rm_dec_query_scale=True,
+            rm_self_attn_layers=None,
+            key_aware_type=None,
+            # layer share
+            layer_share_type=None,
+            # for detach
+            rm_detach=None,
+            decoder_sa_type='sa',
+            module_seq=['sa', 'ca', 'ffn'],
+            # for dn
+            embed_init_tgt=True,
+            use_detached_boxes_dec_out=False),
+        dn_components=dict(
+            dn_number=100,
+            dn_label_noise_ratio=0.5,  # paper 0.5, release code 0.25
+            dn_box_noise_scale=1.0,
+            dn_labelbook_size=80,
+        ),
+        num_classes=80,
+        in_channels=[384, 768, 1536],
+        embed_dims=256,
+        query_dim=4,
+        num_queries=900,
+        num_select=300,
+        random_refpoints_xy=False,
+        num_patterns=0,
+        fix_refpoints_hw=-1,
+        num_feature_levels=4,
+        # two stage
+        two_stage_type='standard',  # ['no', 'standard']
+        two_stage_add_query_num=0,
+        dec_pred_class_embed_share=True,
+        dec_pred_bbox_embed_share=True,
+        two_stage_class_embed_share=False,
+        two_stage_bbox_embed_share=False,
+        decoder_sa_type='sa',
+        temperatureH=20,
+        temperatureW=20,
+        cost_dict=dict(
+            cost_class=2,
+            cost_bbox=5,
+            cost_giou=2,
+        ),
+        weight_dict=dict(loss_ce=1, loss_bbox=5, loss_giou=2)))
--- a/configs/detection/dino/dino_4sc_swinl_12e_coco.py
+++ b/configs/detection/dino/dino_4sc_swinl_12e_coco.py
@ -0,0 +1,4 @@
+_base_ = [
+    './dino_4sc_swinl.py', '../common/dataset/autoaug_coco_detection.py',
+    './dino_schedule_1x.py'
+]
--- a/configs/detection/dino/dino_4sc_swinl_24e_coco.py
+++ b/configs/detection/dino/dino_4sc_swinl_24e_coco.py
@ -0,0 +1,6 @@
+_base_ = './dino_4sc_swinl_12e_coco.py'
+
+# learning policy
+lr_config = dict(policy='step', step=[22])
+
+total_epochs = 24
--- a/configs/detection/dino/dino_4sc_swinl_36e_coco.py
+++ b/configs/detection/dino/dino_4sc_swinl_36e_coco.py
@ -0,0 +1,6 @@
+_base_ = './dino_4sc_swinl_12e_coco.py'
+
+# learning policy
+lr_config = dict(policy='step', step=[27, 33])
+
+total_epochs = 36
--- a/configs/detection/dino/dino_5sc_r50.py
+++ b/configs/detection/dino/dino_5sc_r50.py
@ -0,0 +1,9 @@
+_base_ = './dino_4sc_r50.py'
+
+# model settings
+model = dict(
+    backbone=dict(out_indices=(1, 2, 3, 4)),
+    head=dict(
+        in_channels=[256, 512, 1024, 2048],
+        num_feature_levels=5,
+        transformer=dict(num_feature_levels=5)))
--- a/configs/detection/dino/dino_5sc_r50_12e_coco.py
+++ b/configs/detection/dino/dino_5sc_r50_12e_coco.py
@ -0,0 +1,4 @@
+_base_ = [
+    './dino_5sc_r50.py', '../common/dataset/autoaug_coco_detection.py',
+    './dino_schedule_1x.py'
+]
--- a/configs/detection/dino/dino_5sc_r50_24e_coco.py
+++ b/configs/detection/dino/dino_5sc_r50_24e_coco.py
@ -0,0 +1,6 @@
+_base_ = './dino_5sc_r50_12e_coco.py'
+
+# learning policy
+lr_config = dict(policy='step', step=[20])
+
+total_epochs = 24
--- a/configs/detection/dino/dino_5sc_r50_36e_coco.py
+++ b/configs/detection/dino/dino_5sc_r50_36e_coco.py
@ -0,0 +1,6 @@
+_base_ = './dino_5sc_r50_12e_coco.py'
+
+# learning policy
+lr_config = dict(policy='step', step=[27, 33])
+
+total_epochs = 36
--- a/configs/detection/dino/dino_5sc_swinl.py
+++ b/configs/detection/dino/dino_5sc_swinl.py
@ -0,0 +1,9 @@
+_base_ = './dino_4sc_swinl.py'
+
+# model settings
+model = dict(
+    backbone=dict(out_indices=(0, 1, 2, 3)),
+    head=dict(
+        in_channels=[192, 384, 768, 1536],
+        num_feature_levels=5,
+        transformer=dict(num_feature_levels=5)))
--- a/configs/detection/dino/dino_5sc_swinl_12e_coco.py
+++ b/configs/detection/dino/dino_5sc_swinl_12e_coco.py
@ -0,0 +1,4 @@
+_base_ = [
+    './dino_5sc_swinl.py', '../common/dataset/autoaug_coco_detection.py',
+    './dino_schedule_1x.py'
+]
--- a/configs/detection/dino/dino_5sc_swinl_24e_coco.py
+++ b/configs/detection/dino/dino_5sc_swinl_24e_coco.py
@ -0,0 +1,6 @@
+_base_ = './dino_5sc_swinl_12e_coco.py'
+
+# learning policy
+lr_config = dict(policy='step', step=[20])
+
+total_epochs = 24
--- a/configs/detection/dino/dino_5sc_swinl_36e_coco.py
+++ b/configs/detection/dino/dino_5sc_swinl_36e_coco.py
@ -0,0 +1,6 @@
+_base_ = './dino_5sc_swinl_12e_coco.py'
+
+# learning policy
+lr_config = dict(policy='step', step=[27, 33])
+
+total_epochs = 36
--- a/configs/detection/dino/dino_schedule_1x.py
+++ b/configs/detection/dino/dino_schedule_1x.py
@ -0,0 +1,19 @@
+_base_ = 'configs/base.py'
+
+checkpoint_config = dict(interval=10)
+# optimizer
+paramwise_options = {
+    'backbone': dict(lr_mult=0.1),
+}
+optimizer = dict(
+    type='AdamW',
+    lr=1e-4,
+    weight_decay=1e-4,
+    paramwise_options=paramwise_options)
+optimizer_config = dict(grad_clip=dict(max_norm=0.1, norm_type=2))
+# learning policy
+lr_config = dict(policy='step', step=[11])
+
+total_epochs = 12
+
+find_unused_parameters = False
--- a/configs/detection/fcos/README.md
+++ b/configs/detection/fcos/README.md
@ -0,0 +1,31 @@
+# FCOS
+
+> [FCOS: Fully Convolutional One-Stage Object Detection](https://arxiv.org/abs/1904.01355)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+We propose a fully convolutional one-stage object detector (FCOS) to solve object detection in a per-pixel prediction fashion, analogue to semantic segmentation. Almost all state-of-the-art object detectors such as RetinaNet, SSD, YOLOv3, and Faster R-CNN rely on pre-defined anchor boxes. In contrast, our proposed detector FCOS is anchor box free, as well as proposal free. By eliminating the predefined set of anchor boxes, FCOS completely avoids the complicated computation related to anchor boxes such as calculating overlapping during training. More importantly, we also avoid all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance. With the only post-processing non-maximum suppression (NMS), FCOS with ResNeXt-64x4d-101 achieves 44.7% in AP with single-model and single-scale testing, surpassing previous one-stage detectors with the advantage of being much simpler. For the first time, we demonstrate a much simpler and flexible detection framework achieving improved detection accuracy. We hope that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks.
+
+<div align=center>
+<img src="https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/algo_images/detection/fcos.png"/>
+</div>
+
+## Results and Models
+
+| Algorithm  | Config                                                       | Params<br/>(backbone/total)                            | inference time(V100)<br/>(ms/img)                      | mAP<sup>val<br/><sub>0.5:0.95</sub> | AP<sup>val<br/><sub>50</sub> | Download                                                     |
+| ---------- | ------------------------------------------------------------ | ------------------------ | --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| FCOS-r50(caffe)    | [fcos-r50](https://github.com/alibaba/EasyCV/tree/master/configs/detection/fcos/fcos_r50_caffe_1x_coco.py) | 23M/32M | 85.8ms | 38.58                   | 57.18          | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/fcos/epoch_12.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/fcos/20220621_121315.log.json) |
+| FCOS-r50(torch)    | [fcos-r50](https://github.com/alibaba/EasyCV/tree/master/configs/detection/fcos/fcos_r50_torch_1x_coco.py) | 23M/32M | 105.3ms | 38.88                   | 58.01          | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/fcos/fcos_epoch_12.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/fcos/20220826_182628.log.json) |
+
+## Citation
+
+```latex
+@article{tian2019fcos,
+  title={FCOS: Fully Convolutional One-Stage Object Detection},
+  author={Tian, Zhi and Shen, Chunhua and Chen, Hao and He, Tong},
+  journal={arXiv preprint arXiv:1904.01355},
+  year={2019}
+}
+```
--- a/configs/detection/fcos/coco_detection.py
+++ b/configs/detection/fcos/coco_detection.py
@ -17,7 +17,7 @@ CLASSES = [
 # dataset settings
 data_root = 'data/coco/'
 img_norm_cfg = dict(
-    mean=[103.530, 116.280, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False)
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)

 train_pipeline = [
    dict(type='MMResize', img_scale=(1333, 800), keep_ratio=True),
@ -88,3 +88,14 @@ val_dataset = dict(

 data = dict(
    imgs_per_gpu=2, workers_per_gpu=2, train=train_dataset, val=val_dataset)
+
+# evaluation
+eval_config = dict(interval=1, gpu_collect=False)
+eval_pipelines = [
+    dict(
+        mode='test',
+        evaluators=[
+            dict(type='CocoDetectionEvaluator', classes=CLASSES),
+        ],
+    )
+]
--- a/configs/detection/fcos/fcos.py
+++ b/configs/detection/fcos/fcos.py
@ -1,8 +1,7 @@
 # model settings
 model = dict(
    type='Detection',
-    pretrained=
-    'https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/pretrained_models/easycv/resnet/detectron/resnet50_caffe.pth',
+    pretrained=True,
    backbone=dict(
        type='ResNet',
        depth=50,
@ -11,7 +10,7 @@ model = dict(
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=False),
        norm_eval=True,
-        style='caffe'),
+        style='pytorch'),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
--- a/configs/detection/fcos/fcos_center-normbbox-centeronreg-giou_r50_caffe_fpn_gn-head_1x_coco.py
+++ b/configs/detection/fcos/fcos_center-normbbox-centeronreg-giou_r50_caffe_fpn_gn-head_1x_coco.py
@ -1,56 +0,0 @@
-_base_ = ['./fcos.py', './coco_detection.py', 'configs/base.py']
-
-CLASSES = [
-    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train',
-    'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
-    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
-    'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag',
-    'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite',
-    'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
-    'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon',
-    'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
-    'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant',
-    'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
-    'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
-    'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
-    'hair drier', 'toothbrush'
-]
-
-log_config = dict(
-    interval=50,
-    hooks=[
-        dict(type='TextLoggerHook'),
-        # dict(type='TensorboardLoggerHook')
-    ])
-
-checkpoint_config = dict(interval=10)
-# optimizer
-optimizer = dict(
-    type='SGD',
-    lr=0.01,
-    momentum=0.9,
-    weight_decay=0.0001,
-    paramwise_options=dict(bias_lr_mult=2., bias_decay_mult=0.))
-optimizer_config = dict(grad_clip=None)
-# learning policy
-lr_config = dict(
-    policy='step',
-    warmup='linear',
-    warmup_iters=500,
-    warmup_ratio=1.0 / 3,
-    step=[8, 11])
-
-total_epochs = 12
-
-# evaluation
-eval_config = dict(interval=1, gpu_collect=False)
-eval_pipelines = [
-    dict(
-        mode='test',
-        evaluators=[
-            dict(type='CocoDetectionEvaluator', classes=CLASSES),
-        ],
-    )
-]
-
-find_unused_parameters = False
--- a/configs/detection/fcos/fcos_r50_caffe_1x_coco.py
+++ b/configs/detection/fcos/fcos_r50_caffe_1x_coco.py
@ -0,0 +1,49 @@
+_base_ = './fcos_r50_torch_1x_coco.py'
+
+model = dict(
+    pretrained=
+    'https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/pretrained_models/easycv/resnet/detectron/resnet50_caffe.pth',
+    backbone=dict(style='caffe'))
+
+img_norm_cfg = dict(
+    mean=[103.530, 116.280, 123.675], std=[1.0, 1.0, 1.0], to_rgb=False)
+
+train_pipeline = [
+    dict(type='MMResize', img_scale=(1333, 800), keep_ratio=True),
+    dict(type='MMRandomFlip', flip_ratio=0.5),
+    dict(type='MMNormalize', **img_norm_cfg),
+    dict(type='MMPad', size_divisor=32),
+    dict(type='DefaultFormatBundle'),
+    dict(
+        type='Collect',
+        keys=['img', 'gt_bboxes', 'gt_labels'],
+        meta_keys=('filename', 'ori_filename', 'ori_shape', 'ori_img_shape',
+                   'img_shape', 'pad_shape', 'scale_factor', 'flip',
+                   'flip_direction', 'img_norm_cfg'))
+]
+test_pipeline = [
+    dict(
+        type='MMMultiScaleFlipAug',
+        img_scale=(1333, 800),
+        flip=False,
+        transforms=[
+            dict(type='MMResize', keep_ratio=True),
+            dict(type='MMRandomFlip'),
+            dict(type='MMNormalize', **img_norm_cfg),
+            dict(type='MMPad', size_divisor=32),
+            dict(type='ImageToTensor', keys=['img']),
+            dict(
+                type='Collect',
+                keys=['img'],
+                meta_keys=('filename', 'ori_filename', 'ori_shape',
+                           'ori_img_shape', 'img_shape', 'pad_shape',
+                           'scale_factor', 'flip', 'flip_direction',
+                           'img_norm_cfg'))
+        ])
+]
+
+train_dataset = dict(pipeline=train_pipeline)
+val_dataset = dict(pipeline=test_pipeline)
+
+data = dict(
+    imgs_per_gpu=2, workers_per_gpu=2, train=train_dataset, val=val_dataset)
--- a/configs/detection/fcos/fcos_r50_torch_1x_coco.py
+++ b/configs/detection/fcos/fcos_r50_torch_1x_coco.py
@ -0,0 +1,29 @@
+_base_ = ['./fcos.py', './coco_detection.py', 'configs/base.py']
+
+log_config = dict(
+    interval=50,
+    hooks=[
+        dict(type='TextLoggerHook'),
+        # dict(type='TensorboardLoggerHook')
+    ])
+
+checkpoint_config = dict(interval=10)
+# optimizer
+optimizer = dict(
+    type='SGD',
+    lr=0.01,
+    momentum=0.9,
+    weight_decay=0.0001,
+    paramwise_options=dict(bias_lr_mult=2., bias_decay_mult=0.))
+optimizer_config = dict(grad_clip=None)
+# learning policy
+lr_config = dict(
+    policy='step',
+    warmup='linear',
+    warmup_iters=500,
+    warmup_ratio=1.0 / 3,
+    step=[8, 11])
+
+total_epochs = 12
+
+find_unused_parameters = False
--- a/configs/detection/vitdet/lsj_coco_detection.py
+++ b/configs/detection/vitdet/lsj_coco_detection.py
@ -0,0 +1,117 @@
+CLASSES = [
+    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train',
+    'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
+    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
+    'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag',
+    'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite',
+    'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
+    'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon',
+    'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
+    'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant',
+    'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
+    'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
+    'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
+    'hair drier', 'toothbrush'
+]
+
+# dataset settings
+data_root = 'data/coco/'
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+
+image_size = (1024, 1024)
+train_pipeline = [
+    # large scale jittering
+    dict(
+        type='MMResize',
+        img_scale=image_size,
+        ratio_range=(0.1, 2.0),
+        multiscale_mode='range',
+        keep_ratio=True),
+    dict(
+        type='MMRandomCrop',
+        crop_type='absolute_range',
+        crop_size=image_size,
+        recompute_bbox=False,
+        allow_negative_crop=True),
+    dict(type='MMFilterAnnotations', min_gt_bbox_wh=(1e-2, 1e-2)),
+    dict(type='MMRandomFlip', flip_ratio=0.5),
+    dict(type='MMNormalize', **img_norm_cfg),
+    dict(type='MMPad', size=image_size),
+    dict(type='DefaultFormatBundle'),
+    dict(
+        type='Collect',
+        keys=['img', 'gt_bboxes', 'gt_labels'],
+        meta_keys=('filename', 'ori_filename', 'ori_shape', 'ori_img_shape',
+                   'img_shape', 'pad_shape', 'scale_factor', 'flip',
+                   'flip_direction', 'img_norm_cfg'))
+]
+test_pipeline = [
+    dict(
+        type='MMMultiScaleFlipAug',
+        img_scale=image_size,
+        flip=False,
+        transforms=[
+            dict(type='MMResize', keep_ratio=True),
+            dict(type='MMRandomFlip'),
+            dict(type='MMNormalize', **img_norm_cfg),
+            dict(type='MMPad', size_divisor=1024),
+            dict(type='ImageToTensor', keys=['img']),
+            dict(
+                type='Collect',
+                keys=['img'],
+                meta_keys=('filename', 'ori_filename', 'ori_shape',
+                           'ori_img_shape', 'img_shape', 'pad_shape',
+                           'scale_factor', 'flip', 'flip_direction',
+                           'img_norm_cfg'))
+        ])
+]
+
+train_dataset = dict(
+    type='DetDataset',
+    data_source=dict(
+        type='DetSourceCoco',
+        ann_file=data_root + 'annotations/instances_train2017.json',
+        img_prefix=data_root + 'train2017/',
+        pipeline=[
+            dict(type='LoadImageFromFile'),
+            dict(type='LoadAnnotations', with_bbox=True)
+        ],
+        classes=CLASSES,
+        test_mode=False,
+        filter_empty_gt=True,
+        iscrowd=False),
+    pipeline=train_pipeline)
+
+val_dataset = dict(
+    type='DetDataset',
+    imgs_per_gpu=1,
+    data_source=dict(
+        type='DetSourceCoco',
+        ann_file=data_root + 'annotations/instances_val2017.json',
+        img_prefix=data_root + 'val2017/',
+        pipeline=[
+            dict(type='LoadImageFromFile'),
+            dict(type='LoadAnnotations', with_bbox=True)
+        ],
+        classes=CLASSES,
+        test_mode=True,
+        filter_empty_gt=False,
+        iscrowd=True),
+    pipeline=test_pipeline)
+
+data = dict(
+    imgs_per_gpu=4, workers_per_gpu=2, train=train_dataset, val=val_dataset
+)  # 64(total batch size) = 4 (batch size/per gpu) x 8 (gpu num) x 2(node)
+
+# evaluation
+eval_config = dict(initial=False, interval=1, gpu_collect=False)
+eval_pipelines = [
+    dict(
+        mode='test',
+        # dist_eval=True,
+        evaluators=[
+            dict(type='CocoDetectionEvaluator', classes=CLASSES),
+        ],
+    )
+]
--- a/configs/detection/vitdet/_base_/datasets/coco_instance.py
+++ b/configs/detection/vitdet/_base_/datasets/coco_instance.py
@ -101,4 +101,18 @@ val_dataset = dict(
    pipeline=test_pipeline)

 data = dict(
-    imgs_per_gpu=1, workers_per_gpu=2, train=train_dataset, val=val_dataset)
+    imgs_per_gpu=4, workers_per_gpu=2, train=train_dataset, val=val_dataset
+)  # 64(total batch size) = 4 (batch size/per gpu) x 8 (gpu num) x 2(node)
+
+# evaluation
+eval_config = dict(initial=False, interval=1, gpu_collect=False)
+eval_pipelines = [
+    dict(
+        mode='test',
+        # dist_eval=True,
+        evaluators=[
+            dict(type='CocoDetectionEvaluator', classes=CLASSES),
+            dict(type='CocoMaskEvaluator', classes=CLASSES)
+        ],
+    )
+]
--- a/configs/detection/vitdet/vitdet_100e.py
+++ b/configs/detection/vitdet/vitdet_100e.py
@ -1,65 +0,0 @@
-_base_ = [
-    './_base_/models/vitdet.py', './_base_/datasets/coco_instance.py',
-    'configs/base.py'
-]
-
-CLASSES = [
-    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train',
-    'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
-    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
-    'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag',
-    'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite',
-    'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
-    'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon',
-    'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
-    'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant',
-    'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
-    'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
-    'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
-    'hair drier', 'toothbrush'
-]
-
-log_config = dict(
-    interval=50,
-    hooks=[
-        dict(type='TextLoggerHook'),
-        # dict(type='TensorboardLoggerHook')
-    ])
-
-checkpoint_config = dict(interval=10)
-# optimizer
-paramwise_options = {
-    'norm': dict(weight_decay=0.),
-    'bias': dict(weight_decay=0.),
-    'pos_embed': dict(weight_decay=0.),
-    'cls_token': dict(weight_decay=0.)
-}
-optimizer = dict(
-    type='AdamW',
-    lr=1e-4,
-    betas=(0.9, 0.999),
-    weight_decay=0.1,
-    paramwise_options=paramwise_options)
-optimizer_config = dict(grad_clip=None, loss_scale=512.)
-# learning policy
-lr_config = dict(
-    policy='step',
-    warmup='linear',
-    warmup_iters=250,
-    warmup_ratio=0.067,
-    step=[88, 96])
-total_epochs = 100
-
-# evaluation
-eval_config = dict(interval=1, gpu_collect=False)
-eval_pipelines = [
-    dict(
-        mode='test',
-        evaluators=[
-            dict(type='CocoDetectionEvaluator', classes=CLASSES),
-            dict(type='CocoMaskEvaluator', classes=CLASSES)
-        ],
-    )
-]
-
-find_unused_parameters = False
--- a/configs/detection/vitdet/vitdet_basicblock_100e.py
+++ b/configs/detection/vitdet/vitdet_basicblock_100e.py
@ -1,67 +0,0 @@
-_base_ = [
-    './_base_/models/vitdet.py', './_base_/datasets/coco_instance.py',
-    'configs/base.py'
-]
-
-CLASSES = [
-    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train',
-    'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
-    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
-    'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag',
-    'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite',
-    'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
-    'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon',
-    'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
-    'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant',
-    'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
-    'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
-    'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
-    'hair drier', 'toothbrush'
-]
-
-model = dict(backbone=dict(aggregation='basicblock'))
-
-log_config = dict(
-    interval=50,
-    hooks=[
-        dict(type='TextLoggerHook'),
-        # dict(type='TensorboardLoggerHook')
-    ])
-
-checkpoint_config = dict(interval=10)
-# optimizer
-paramwise_options = {
-    'norm': dict(weight_decay=0.),
-    'bias': dict(weight_decay=0.),
-    'pos_embed': dict(weight_decay=0.),
-    'cls_token': dict(weight_decay=0.)
-}
-optimizer = dict(
-    type='AdamW',
-    lr=1e-4,
-    betas=(0.9, 0.999),
-    weight_decay=0.1,
-    paramwise_options=paramwise_options)
-optimizer_config = dict(grad_clip=None, loss_scale=512.)
-# learning policy
-lr_config = dict(
-    policy='step',
-    warmup='linear',
-    warmup_iters=250,
-    warmup_ratio=0.067,
-    step=[88, 96])
-total_epochs = 100
-
-# evaluation
-eval_config = dict(interval=1, gpu_collect=False)
-eval_pipelines = [
-    dict(
-        mode='test',
-        evaluators=[
-            dict(type='CocoDetectionEvaluator', classes=CLASSES),
-            dict(type='CocoMaskEvaluator', classes=CLASSES)
-        ],
-    )
-]
-
-find_unused_parameters = False
--- a/configs/detection/vitdet/vitdet_bottleneck_100e.py
+++ b/configs/detection/vitdet/vitdet_bottleneck_100e.py
@ -1,67 +0,0 @@
-_base_ = [
-    './_base_/models/vitdet.py', './_base_/datasets/coco_instance.py',
-    'configs/base.py'
-]
-
-CLASSES = [
-    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train',
-    'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
-    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
-    'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag',
-    'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite',
-    'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
-    'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon',
-    'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
-    'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant',
-    'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
-    'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
-    'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
-    'hair drier', 'toothbrush'
-]
-
-model = dict(backbone=dict(aggregation='bottleneck'))
-
-log_config = dict(
-    interval=50,
-    hooks=[
-        dict(type='TextLoggerHook'),
-        # dict(type='TensorboardLoggerHook')
-    ])
-
-checkpoint_config = dict(interval=10)
-# optimizer
-paramwise_options = {
-    'norm': dict(weight_decay=0.),
-    'bias': dict(weight_decay=0.),
-    'pos_embed': dict(weight_decay=0.),
-    'cls_token': dict(weight_decay=0.)
-}
-optimizer = dict(
-    type='AdamW',
-    lr=1e-4,
-    betas=(0.9, 0.999),
-    weight_decay=0.1,
-    paramwise_options=paramwise_options)
-optimizer_config = dict(grad_clip=None, loss_scale=512.)
-# learning policy
-lr_config = dict(
-    policy='step',
-    warmup='linear',
-    warmup_iters=250,
-    warmup_ratio=0.067,
-    step=[88, 96])
-total_epochs = 100
-
-# evaluation
-eval_config = dict(interval=1, gpu_collect=False)
-eval_pipelines = [
-    dict(
-        mode='test',
-        evaluators=[
-            dict(type='CocoDetectionEvaluator', classes=CLASSES),
-            dict(type='CocoMaskEvaluator', classes=CLASSES)
-        ],
-    )
-]
-
-find_unused_parameters = False
--- a/configs/detection/vitdet/vitdet_cascade_mask_rcnn.py
+++ b/configs/detection/vitdet/vitdet_cascade_mask_rcnn.py
@ -0,0 +1,231 @@
+# model settings
+
+norm_cfg = dict(type='GN', num_groups=1, eps=1e-6, requires_grad=True)
+
+pretrained = 'https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/selfsup/mae/vit-b-1600/warpper_mae_vit-base-p16-1600e.pth'
+model = dict(
+    type='CascadeRCNN',
+    pretrained=pretrained,
+    backbone=dict(
+        type='ViTDet',
+        img_size=1024,
+        patch_size=16,
+        embed_dim=768,
+        depth=12,
+        num_heads=12,
+        drop_path_rate=0.1,
+        window_size=14,
+        mlp_ratio=4,
+        qkv_bias=True,
+        window_block_indexes=[
+            # 2, 5, 8 11 for global attention
+            0,
+            1,
+            3,
+            4,
+            6,
+            7,
+            9,
+            10,
+        ],
+        residual_block_indexes=[],
+        use_rel_pos=True),
+    neck=dict(
+        type='SFP',
+        in_channels=768,
+        out_channels=256,
+        scale_factors=(4.0, 2.0, 1.0, 0.5),
+        norm_cfg=norm_cfg,
+        num_outs=5),
+    rpn_head=dict(
+        type='RPNHead',
+        in_channels=256,
+        feat_channels=256,
+        num_convs=2,
+        anchor_generator=dict(
+            type='AnchorGenerator',
+            scales=[8],
+            ratios=[0.5, 1.0, 2.0],
+            strides=[4, 8, 16, 32, 64]),
+        bbox_coder=dict(
+            type='DeltaXYWHBBoxCoder',
+            target_means=[.0, .0, .0, .0],
+            target_stds=[1.0, 1.0, 1.0, 1.0]),
+        loss_cls=dict(
+            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
+        loss_bbox=dict(type='SmoothL1Loss', beta=1.0 / 9.0, loss_weight=1.0)),
+    roi_head=dict(
+        type='CascadeRoIHead',
+        num_stages=3,
+        stage_loss_weights=[1, 0.5, 0.25],
+        bbox_roi_extractor=dict(
+            type='SingleRoIExtractor',
+            roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
+            out_channels=256,
+            featmap_strides=[4, 8, 16, 32]),
+        bbox_head=[
+            dict(
+                type='Shared4Conv1FCBBoxHead',
+                conv_out_channels=256,
+                norm_cfg=norm_cfg,
+                in_channels=256,
+                fc_out_channels=1024,
+                roi_feat_size=7,
+                num_classes=80,
+                bbox_coder=dict(
+                    type='DeltaXYWHBBoxCoder',
+                    target_means=[0., 0., 0., 0.],
+                    target_stds=[0.1, 0.1, 0.2, 0.2]),
+                reg_class_agnostic=True,
+                loss_cls=dict(
+                    type='CrossEntropyLoss',
+                    use_sigmoid=False,
+                    loss_weight=1.0),
+                loss_bbox=dict(type='SmoothL1Loss', beta=1.0,
+                               loss_weight=1.0)),
+            dict(
+                type='Shared4Conv1FCBBoxHead',
+                conv_out_channels=256,
+                norm_cfg=norm_cfg,
+                in_channels=256,
+                fc_out_channels=1024,
+                roi_feat_size=7,
+                num_classes=80,
+                bbox_coder=dict(
+                    type='DeltaXYWHBBoxCoder',
+                    target_means=[0., 0., 0., 0.],
+                    target_stds=[0.05, 0.05, 0.1, 0.1]),
+                reg_class_agnostic=True,
+                loss_cls=dict(
+                    type='CrossEntropyLoss',
+                    use_sigmoid=False,
+                    loss_weight=1.0),
+                loss_bbox=dict(type='SmoothL1Loss', beta=1.0,
+                               loss_weight=1.0)),
+            dict(
+                type='Shared4Conv1FCBBoxHead',
+                conv_out_channels=256,
+                norm_cfg=norm_cfg,
+                in_channels=256,
+                fc_out_channels=1024,
+                roi_feat_size=7,
+                num_classes=80,
+                bbox_coder=dict(
+                    type='DeltaXYWHBBoxCoder',
+                    target_means=[0., 0., 0., 0.],
+                    target_stds=[0.033, 0.033, 0.067, 0.067]),
+                reg_class_agnostic=True,
+                loss_cls=dict(
+                    type='CrossEntropyLoss',
+                    use_sigmoid=False,
+                    loss_weight=1.0),
+                loss_bbox=dict(type='SmoothL1Loss', beta=1.0, loss_weight=1.0))
+        ],
+        mask_roi_extractor=dict(
+            type='SingleRoIExtractor',
+            roi_layer=dict(type='RoIAlign', output_size=14, sampling_ratio=0),
+            out_channels=256,
+            featmap_strides=[4, 8, 16, 32]),
+        mask_head=dict(
+            type='FCNMaskHead',
+            norm_cfg=norm_cfg,
+            num_convs=4,
+            in_channels=256,
+            conv_out_channels=256,
+            num_classes=80,
+            loss_mask=dict(
+                type='CrossEntropyLoss', use_mask=True, loss_weight=1.0))),
+    # model training and testing settings
+    train_cfg=dict(
+        rpn=dict(
+            assigner=dict(
+                type='MaxIoUAssigner',
+                pos_iou_thr=0.7,
+                neg_iou_thr=0.3,
+                min_pos_iou=0.3,
+                match_low_quality=True,
+                ignore_iof_thr=-1),
+            sampler=dict(
+                type='RandomSampler',
+                num=256,
+                pos_fraction=0.5,
+                neg_pos_ub=-1,
+                add_gt_as_proposals=False),
+            allowed_border=0,
+            pos_weight=-1,
+            debug=False),
+        rpn_proposal=dict(
+            nms_pre=2000,
+            max_per_img=2000,
+            nms=dict(type='nms', iou_threshold=0.7),
+            min_bbox_size=0),
+        rcnn=[
+            dict(
+                assigner=dict(
+                    type='MaxIoUAssigner',
+                    pos_iou_thr=0.5,
+                    neg_iou_thr=0.5,
+                    min_pos_iou=0.5,
+                    match_low_quality=False,
+                    ignore_iof_thr=-1),
+                sampler=dict(
+                    type='RandomSampler',
+                    num=512,
+                    pos_fraction=0.25,
+                    neg_pos_ub=-1,
+                    add_gt_as_proposals=True),
+                mask_size=28,
+                pos_weight=-1,
+                debug=False),
+            dict(
+                assigner=dict(
+                    type='MaxIoUAssigner',
+                    pos_iou_thr=0.6,
+                    neg_iou_thr=0.6,
+                    min_pos_iou=0.6,
+                    match_low_quality=False,
+                    ignore_iof_thr=-1),
+                sampler=dict(
+                    type='RandomSampler',
+                    num=512,
+                    pos_fraction=0.25,
+                    neg_pos_ub=-1,
+                    add_gt_as_proposals=True),
+                mask_size=28,
+                pos_weight=-1,
+                debug=False),
+            dict(
+                assigner=dict(
+                    type='MaxIoUAssigner',
+                    pos_iou_thr=0.7,
+                    neg_iou_thr=0.7,
+                    min_pos_iou=0.7,
+                    match_low_quality=False,
+                    ignore_iof_thr=-1),
+                sampler=dict(
+                    type='RandomSampler',
+                    num=512,
+                    pos_fraction=0.25,
+                    neg_pos_ub=-1,
+                    add_gt_as_proposals=True),
+                mask_size=28,
+                pos_weight=-1,
+                debug=False)
+        ]),
+    test_cfg=dict(
+        rpn=dict(
+            nms_pre=1000,
+            max_per_img=1000,
+            nms=dict(type='nms', iou_threshold=0.7),
+            min_bbox_size=0),
+        rcnn=dict(
+            score_thr=0.05,
+            nms=dict(type='nms', iou_threshold=0.5),
+            max_per_img=100,
+            mask_thr_binary=0.5)))
+
+mmlab_modules = [
+    dict(type='mmdet', name='CascadeRCNN', module='model'),
+    dict(type='mmdet', name='RPNHead', module='head'),
+    dict(type='mmdet', name='CascadeRoIHead', module='head'),
+]
--- a/configs/detection/vitdet/vitdet_cascade_mask_rcnn_100e.py
+++ b/configs/detection/vitdet/vitdet_cascade_mask_rcnn_100e.py
@ -0,0 +1,4 @@
+_base_ = [
+    './vitdet_cascade_mask_rcnn.py', './lsj_coco_instance.py',
+    './vitdet_schedule_100e.py'
+]
--- a/configs/detection/vitdet/vitdet_faster_rcnn.py
+++ b/configs/detection/vitdet/vitdet_faster_rcnn.py
@ -0,0 +1,135 @@
+# model settings
+
+norm_cfg = dict(type='GN', num_groups=1, eps=1e-6, requires_grad=True)
+
+pretrained = 'https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/selfsup/mae/vit-b-1600/warpper_mae_vit-base-p16-1600e.pth'
+model = dict(
+    type='FasterRCNN',
+    pretrained=pretrained,
+    backbone=dict(
+        type='ViTDet',
+        img_size=1024,
+        patch_size=16,
+        embed_dim=768,
+        depth=12,
+        num_heads=12,
+        drop_path_rate=0.1,
+        window_size=14,
+        mlp_ratio=4,
+        qkv_bias=True,
+        window_block_indexes=[
+            # 2, 5, 8 11 for global attention
+            0,
+            1,
+            3,
+            4,
+            6,
+            7,
+            9,
+            10,
+        ],
+        residual_block_indexes=[],
+        use_rel_pos=True),
+    neck=dict(
+        type='SFP',
+        in_channels=768,
+        out_channels=256,
+        scale_factors=(4.0, 2.0, 1.0, 0.5),
+        norm_cfg=norm_cfg,
+        num_outs=5),
+    rpn_head=dict(
+        type='RPNHead',
+        in_channels=256,
+        feat_channels=256,
+        num_convs=2,
+        anchor_generator=dict(
+            type='AnchorGenerator',
+            scales=[8],
+            ratios=[0.5, 1.0, 2.0],
+            strides=[4, 8, 16, 32, 64]),
+        bbox_coder=dict(
+            type='DeltaXYWHBBoxCoder',
+            target_means=[.0, .0, .0, .0],
+            target_stds=[1.0, 1.0, 1.0, 1.0]),
+        loss_cls=dict(
+            type='CrossEntropyLoss', use_sigmoid=True, loss_weight=1.0),
+        loss_bbox=dict(type='L1Loss', loss_weight=1.0)),
+    roi_head=dict(
+        type='StandardRoIHead',
+        bbox_roi_extractor=dict(
+            type='SingleRoIExtractor',
+            roi_layer=dict(type='RoIAlign', output_size=7, sampling_ratio=0),
+            out_channels=256,
+            featmap_strides=[4, 8, 16, 32]),
+        bbox_head=dict(
+            type='Shared4Conv1FCBBoxHead',
+            conv_out_channels=256,
+            norm_cfg=norm_cfg,
+            in_channels=256,
+            fc_out_channels=1024,
+            roi_feat_size=7,
+            num_classes=80,
+            bbox_coder=dict(
+                type='DeltaXYWHBBoxCoder',
+                target_means=[0., 0., 0., 0.],
+                target_stds=[0.1, 0.1, 0.2, 0.2]),
+            reg_class_agnostic=False,
+            loss_cls=dict(
+                type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0),
+            loss_bbox=dict(type='L1Loss', loss_weight=1.0))),
+    # model training and testing settings
+    train_cfg=dict(
+        rpn=dict(
+            assigner=dict(
+                type='MaxIoUAssigner',
+                pos_iou_thr=0.7,
+                neg_iou_thr=0.3,
+                min_pos_iou=0.3,
+                match_low_quality=True,
+                ignore_iof_thr=-1),
+            sampler=dict(
+                type='RandomSampler',
+                num=256,
+                pos_fraction=0.5,
+                neg_pos_ub=-1,
+                add_gt_as_proposals=False),
+            allowed_border=-1,
+            pos_weight=-1,
+            debug=False),
+        rpn_proposal=dict(
+            nms_pre=2000,
+            max_per_img=1000,
+            nms=dict(type='nms', iou_threshold=0.7),
+            min_bbox_size=0),
+        rcnn=dict(
+            assigner=dict(
+                type='MaxIoUAssigner',
+                pos_iou_thr=0.5,
+                neg_iou_thr=0.5,
+                min_pos_iou=0.5,
+                match_low_quality=False,
+                ignore_iof_thr=-1),
+            sampler=dict(
+                type='RandomSampler',
+                num=512,
+                pos_fraction=0.25,
+                neg_pos_ub=-1,
+                add_gt_as_proposals=True),
+            pos_weight=-1,
+            debug=False)),
+    test_cfg=dict(
+        rpn=dict(
+            nms_pre=1000,
+            max_per_img=1000,
+            nms=dict(type='nms', iou_threshold=0.7),
+            min_bbox_size=0),
+        rcnn=dict(
+            score_thr=0.05,
+            nms=dict(type='nms', iou_threshold=0.5),
+            max_per_img=100)))
+
+mmlab_modules = [
+    dict(type='mmdet', name='FasterRCNN', module='model'),
+    dict(type='mmdet', name='RPNHead', module='head'),
+    dict(type='mmdet', name='StandardRoIHead', module='head'),
+]
--- a/configs/detection/vitdet/vitdet_faster_rcnn_100e.py
+++ b/configs/detection/vitdet/vitdet_faster_rcnn_100e.py
@ -0,0 +1,4 @@
+_base_ = [
+    './vitdet_faster_rcnn.py', './lsj_coco_instance.py',
+    './vitdet_schedule_100e.py'
+]
--- a/configs/detection/vitdet/_base_/models/vitdet.py
+++ b/configs/detection/vitdet/_base_/models/vitdet.py
@ -1,6 +1,6 @@
 # model settings

-norm_cfg = dict(type='GN', num_groups=1, requires_grad=True)
+norm_cfg = dict(type='GN', num_groups=1, eps=1e-6, requires_grad=True)

 pretrained = 'https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/selfsup/mae/vit-b-1600/warpper_mae_vit-base-p16-1600e.pth'
 model = dict(
@ -9,22 +9,32 @@ model = dict(
    backbone=dict(
        type='ViTDet',
        img_size=1024,
+        patch_size=16,
        embed_dim=768,
        depth=12,
        num_heads=12,
+        drop_path_rate=0.1,
+        window_size=14,
        mlp_ratio=4,
        qkv_bias=True,
-        qk_scale=None,
-        drop_rate=0.,
-        attn_drop_rate=0.,
-        drop_path_rate=0.1,
-        use_abs_pos_emb=True,
-        aggregation='attn',
-    ),
+        window_block_indexes=[
+            # 2, 5, 8 11 for global attention
+            0,
+            1,
+            3,
+            4,
+            6,
+            7,
+            9,
+            10,
+        ],
+        residual_block_indexes=[],
+        use_rel_pos=True),
    neck=dict(
        type='SFP',
-        in_channels=[768, 768, 768, 768],
+        in_channels=768,
        out_channels=256,
+        scale_factors=(4.0, 2.0, 1.0, 0.5),
        norm_cfg=norm_cfg,
        num_outs=5),
    rpn_head=dict(
@ -32,7 +42,6 @@ model = dict(
        in_channels=256,
        feat_channels=256,
        num_convs=2,
-        norm_cfg=norm_cfg,
        anchor_generator=dict(
            type='AnchorGenerator',
            scales=[8],
@ -112,7 +121,7 @@ model = dict(
                pos_iou_thr=0.5,
                neg_iou_thr=0.5,
                min_pos_iou=0.5,
-                match_low_quality=True,
+                match_low_quality=False,
                ignore_iof_thr=-1),
            sampler=dict(
                type='RandomSampler',
--- a/configs/detection/vitdet/vitdet_mask_rcnn_100e.py
+++ b/configs/detection/vitdet/vitdet_mask_rcnn_100e.py
@ -0,0 +1,4 @@
+_base_ = [
+    './vitdet_mask_rcnn.py', './lsj_coco_instance.py',
+    './vitdet_schedule_100e.py'
+]
--- a/configs/detection/vitdet/vitdet_schedule_100e.py
+++ b/configs/detection/vitdet/vitdet_schedule_100e.py
@ -0,0 +1,30 @@
+_base_ = 'configs/base.py'
+
+log_config = dict(
+    interval=200,
+    hooks=[
+        dict(type='TextLoggerHook'),
+        # dict(type='TensorboardLoggerHook')
+    ])
+
+checkpoint_config = dict(interval=10)
+
+# optimizer
+optimizer = dict(
+    type='AdamW',
+    lr=1e-4,
+    betas=(0.9, 0.999),
+    weight_decay=0.1,
+    constructor='LayerDecayOptimizerConstructor',
+    paramwise_options=dict(num_layers=12, layer_decay_rate=0.7))
+optimizer_config = dict(grad_clip=None)
+# learning policy
+lr_config = dict(
+    policy='step',
+    warmup='linear',
+    warmup_iters=250,
+    warmup_ratio=0.001,
+    step=[88, 96])
+total_epochs = 100
+
+find_unused_parameters = False
--- a/configs/detection/yolox/pai_yoloxs_8xb16_300e_coco.py
+++ b/configs/detection/yolox/pai_yoloxs_8xb16_300e_coco.py
@ -0,0 +1,188 @@
+_base_ = '../../base.py'
+
+# model settings s m l x
+model = dict(
+    type='YOLOX',
+    test_conf=0.01,
+    nms_thre=0.65,
+    backbone='RepVGGYOLOX',
+    model_type='s',  # s m l x tiny nano
+    head=dict(
+        type='YOLOXHead',
+        model_type='s',
+        obj_loss_type='BCE',
+        reg_loss_type='giou',
+        num_classes=80,
+        decode_in_inference=
+        True  # set to False when test speed to ignore decode and nms
+    ))
+
+# s m l x
+img_scale = (640, 640)
+random_size = (14, 26)
+scale_ratio = (0.1, 2)
+
+CLASSES = [
+    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train',
+    'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
+    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
+    'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag',
+    'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite',
+    'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
+    'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon',
+    'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
+    'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant',
+    'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
+    'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
+    'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
+    'hair drier', 'toothbrush'
+]
+
+# dataset settings
+data_root = 'data/coco/'
+
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+
+train_pipeline = [
+    dict(type='MMMosaic', img_scale=img_scale, pad_val=114.0),
+    dict(
+        type='MMRandomAffine',
+        scaling_ratio_range=scale_ratio,
+        border=(-img_scale[0] // 2, -img_scale[1] // 2)),
+    dict(
+        type='MMMixUp',  # s m x l; tiny nano will detele
+        img_scale=img_scale,
+        ratio_range=(0.8, 1.6),
+        pad_val=114.0),
+    dict(
+        type='MMPhotoMetricDistortion',
+        brightness_delta=32,
+        contrast_range=(0.5, 1.5),
+        saturation_range=(0.5, 1.5),
+        hue_delta=18),
+    dict(type='MMRandomFlip', flip_ratio=0.5),
+    dict(type='MMResize', keep_ratio=True),
+    dict(type='MMPad', pad_to_square=True, pad_val=(114.0, 114.0, 114.0)),
+    dict(type='MMNormalize', **img_norm_cfg),
+    dict(type='DefaultFormatBundle'),
+    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
+]
+test_pipeline = [
+    dict(type='MMResize', img_scale=img_scale, keep_ratio=True),
+    dict(type='MMPad', pad_to_square=True, pad_val=(114.0, 114.0, 114.0)),
+    dict(type='MMNormalize', **img_norm_cfg),
+    dict(type='DefaultFormatBundle'),
+    dict(type='Collect', keys=['img'])
+]
+
+train_dataset = dict(
+    type='DetImagesMixDataset',
+    data_source=dict(
+        type='DetSourceCoco',
+        ann_file=data_root + 'annotations/instances_train2017.json',
+        img_prefix=data_root + 'train2017/',
+        pipeline=[
+            dict(type='LoadImageFromFile', to_float32=True),
+            dict(type='LoadAnnotations', with_bbox=True)
+        ],
+        classes=CLASSES,
+        filter_empty_gt=True,
+        iscrowd=False),
+    pipeline=train_pipeline,
+    dynamic_scale=img_scale)
+
+val_dataset = dict(
+    type='DetImagesMixDataset',
+    imgs_per_gpu=2,
+    data_source=dict(
+        type='DetSourceCoco',
+        ann_file=data_root + 'annotations/instances_val2017.json',
+        img_prefix=data_root + 'val2017/',
+        pipeline=[
+            dict(type='LoadImageFromFile', to_float32=True),
+            dict(type='LoadAnnotations', with_bbox=True)
+        ],
+        classes=CLASSES,
+        filter_empty_gt=False,
+        test_mode=True,
+        iscrowd=True),
+    pipeline=test_pipeline,
+    dynamic_scale=None,
+    label_padding=False)
+
+data = dict(
+    imgs_per_gpu=16, workers_per_gpu=4, train=train_dataset, val=val_dataset)
+
+# additional hooks
+interval = 10
+custom_hooks = [
+    dict(
+        type='YOLOXModeSwitchHook',
+        no_aug_epochs=15,
+        skip_type_keys=('MMMosaic', 'MMRandomAffine', 'MMMixUp'),
+        priority=48),
+    dict(
+        type='SyncRandomSizeHook',
+        ratio_range=random_size,
+        img_scale=img_scale,
+        interval=interval,
+        priority=48),
+    dict(
+        type='SyncNormHook',
+        num_last_epochs=15,
+        interval=interval,
+        priority=48)
+]
+
+# evaluation
+eval_config = dict(
+    interval=10,
+    gpu_collect=False,
+    visualization_config=dict(
+        vis_num=10,
+        score_thr=0.5,
+    )  # show by TensorboardLoggerHookV2 and WandbLoggerHookV2
+)
+eval_pipelines = [
+    dict(
+        mode='test',
+        data=data['val'],
+        evaluators=[dict(type='CocoDetectionEvaluator', classes=CLASSES)],
+    )
+]
+
+checkpoint_config = dict(interval=interval)
+
+# optimizer
+optimizer = dict(
+    type='SGD', lr=0.02, momentum=0.9, weight_decay=5e-4, nesterov=True)
+optimizer_config = {}
+
+# learning policy
+lr_config = dict(
+    policy='YOLOX',
+    warmup='exp',
+    by_epoch=False,
+    warmup_by_epoch=True,
+    warmup_ratio=1,
+    warmup_iters=5,  # 5 epoch
+    num_last_epochs=15,
+    min_lr_ratio=0.05)
+
+# exponetial model average
+ema = dict(decay=0.9998)
+
+# runtime settings
+total_epochs = 300
+
+# yapf:disable
+log_config = dict(
+    interval=100,
+    hooks=[
+        dict(type='TextLoggerHook'),
+        dict(type='TensorboardLoggerHookV2'),
+        # dict(type='WandbLoggerHookV2'),
+    ])
+
+export = dict(export_type = 'ori', preprocess_jit = False, batch_size=1, blade_config=dict(enable_fp16=True, fp16_fallback_op_ratio=0.01), use_trt_efficientnms=False)
--- a/configs/detection/yolox/pai_yoloxs_asff_8xb16_300e_coco.py
+++ b/configs/detection/yolox/pai_yoloxs_asff_8xb16_300e_coco.py
@ -1,22 +1,27 @@
-# model settings
-# models s m l x
+_base_ = '../../base.py'

+# model settings s m l x
 model = dict(
    type='YOLOX',
-    num_classes=80,
-    model_type='tiny',  # s m l x tiny nano
    test_conf=0.01,
-    nms_thre=0.65)
+    nms_thre=0.65,
+    backbone='RepVGGYOLOX',
+    model_type='s',  # s m l x tiny nano
+    use_att='ASFF',
+    head=dict(
+        type='YOLOXHead',
+        model_type='s',
+        obj_loss_type='BCE',
+        reg_loss_type='giou',
+        num_classes=80,
+        decode_in_inference=
+        False  # set to False when test speed to ignore decode and nms
+    ))

 # s m l x
-# img_scale = (640, 640)
-# random_size = (14, 26)
-# scale_ratio = (0.1, 2)
-
-# tiny nano ; without mixup
-img_scale = (416, 416)
-random_size = (10, 20)
-scale_ratio = (0.5, 1.5)
+img_scale = (640, 640)
+random_size = (14, 26)
+scale_ratio = (0.1, 2)

 CLASSES = [
    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train',
@ -36,6 +41,7 @@ CLASSES = [

 # dataset settings
 data_root = 'data/coco/'
+
 img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)

@ -45,6 +51,11 @@ train_pipeline = [
        type='MMRandomAffine',
        scaling_ratio_range=scale_ratio,
        border=(-img_scale[0] // 2, -img_scale[1] // 2)),
+    dict(
+        type='MMMixUp',  # s m x l; tiny nano will detele
+        img_scale=img_scale,
+        ratio_range=(0.8, 1.6),
+        pad_val=114.0),
    dict(
        type='MMPhotoMetricDistortion',
        brightness_delta=32,
@ -125,7 +136,14 @@ custom_hooks = [
 ]

 # evaluation
-eval_config = dict(interval=10, gpu_collect=False)
+eval_config = dict(
+    interval=10,
+    gpu_collect=False,
+    visualization_config=dict(
+        vis_num=10,
+        score_thr=0.5,
+    )  # show by TensorboardLoggerHookV2 and WandbLoggerHookV2
+)
 eval_pipelines = [
    dict(
        mode='test',
@ -137,9 +155,8 @@ eval_pipelines = [
 checkpoint_config = dict(interval=interval)

 # optimizer
-# basic_lr_per_img = 0.01 / 64.0
 optimizer = dict(
-    type='SGD', lr=0.01, momentum=0.9, weight_decay=5e-4, nesterov=True)
+    type='SGD', lr=0.02, momentum=0.9, weight_decay=5e-4, nesterov=True)
 optimizer_config = {}

 # learning policy
@ -164,15 +181,8 @@ log_config = dict(
    interval=100,
    hooks=[
        dict(type='TextLoggerHook'),
-        dict(type='TensorboardLoggerHook')
+        dict(type='TensorboardLoggerHookV2'),
+        # dict(type='WandbLoggerHookV2'),
    ])
-# yapf:enable
-# runtime settings
-dist_params = dict(backend='nccl')
-cudnn_benchmark = True
-log_level = 'INFO'
-load_from = None
-resume_from = None
-workflow = [('train', 1)]

-export = dict(use_jit=False)
+export = dict(export_type = 'ori', preprocess_jit = False, batch_size=1, blade_config=dict(enable_fp16=True, fp16_fallback_op_ratio=0.01), use_trt_efficientnms=False)
--- a/configs/detection/yolox/pai_yoloxs_asff_tood3_8xb16_300e_coco.py
+++ b/configs/detection/yolox/pai_yoloxs_asff_tood3_8xb16_300e_coco.py
@ -0,0 +1,189 @@
+_base_ = '../../base.py'
+
+# model settings s m l x
+model = dict(
+    type='YOLOX',
+    test_conf=0.01,
+    nms_thre=0.65,
+    backbone='RepVGGYOLOX',
+    model_type='s',  # s m l x tiny nano
+    use_att='ASFF',
+    head=dict(
+        type='TOODHead',
+        model_type='s',
+        obj_loss_type='BCE',
+        reg_loss_type='giou',
+        num_classes=80,
+        decode_in_inference=
+        True  # set to False when test speed to ignore decode and nms
+    ))
+
+# s m l x
+img_scale = (640, 640)
+random_size = (14, 26)
+scale_ratio = (0.1, 2)
+
+CLASSES = [
+    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train',
+    'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
+    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
+    'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag',
+    'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite',
+    'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
+    'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon',
+    'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
+    'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant',
+    'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
+    'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
+    'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
+    'hair drier', 'toothbrush'
+]
+
+# dataset settings
+data_root = 'data/coco/'
+
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+
+train_pipeline = [
+    dict(type='MMMosaic', img_scale=img_scale, pad_val=114.0),
+    dict(
+        type='MMRandomAffine',
+        scaling_ratio_range=scale_ratio,
+        border=(-img_scale[0] // 2, -img_scale[1] // 2)),
+    dict(
+        type='MMMixUp',  # s m x l; tiny nano will detele
+        img_scale=img_scale,
+        ratio_range=(0.8, 1.6),
+        pad_val=114.0),
+    dict(
+        type='MMPhotoMetricDistortion',
+        brightness_delta=32,
+        contrast_range=(0.5, 1.5),
+        saturation_range=(0.5, 1.5),
+        hue_delta=18),
+    dict(type='MMRandomFlip', flip_ratio=0.5),
+    dict(type='MMResize', keep_ratio=True),
+    dict(type='MMPad', pad_to_square=True, pad_val=(114.0, 114.0, 114.0)),
+    dict(type='MMNormalize', **img_norm_cfg),
+    dict(type='DefaultFormatBundle'),
+    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
+]
+test_pipeline = [
+    dict(type='MMResize', img_scale=img_scale, keep_ratio=True),
+    dict(type='MMPad', pad_to_square=True, pad_val=(114.0, 114.0, 114.0)),
+    dict(type='MMNormalize', **img_norm_cfg),
+    dict(type='DefaultFormatBundle'),
+    dict(type='Collect', keys=['img'])
+]
+
+train_dataset = dict(
+    type='DetImagesMixDataset',
+    data_source=dict(
+        type='DetSourceCoco',
+        ann_file=data_root + 'annotations/instances_train2017.json',
+        img_prefix=data_root + 'train2017/',
+        pipeline=[
+            dict(type='LoadImageFromFile', to_float32=True),
+            dict(type='LoadAnnotations', with_bbox=True)
+        ],
+        classes=CLASSES,
+        filter_empty_gt=True,
+        iscrowd=False),
+    pipeline=train_pipeline,
+    dynamic_scale=img_scale)
+
+val_dataset = dict(
+    type='DetImagesMixDataset',
+    imgs_per_gpu=2,
+    data_source=dict(
+        type='DetSourceCoco',
+        ann_file=data_root + 'annotations/instances_val2017.json',
+        img_prefix=data_root + 'val2017/',
+        pipeline=[
+            dict(type='LoadImageFromFile', to_float32=True),
+            dict(type='LoadAnnotations', with_bbox=True)
+        ],
+        classes=CLASSES,
+        filter_empty_gt=False,
+        test_mode=True,
+        iscrowd=True),
+    pipeline=test_pipeline,
+    dynamic_scale=None,
+    label_padding=False)
+
+data = dict(
+    imgs_per_gpu=16, workers_per_gpu=4, train=train_dataset, val=val_dataset)
+
+# additional hooks
+interval = 10
+custom_hooks = [
+    dict(
+        type='YOLOXModeSwitchHook',
+        no_aug_epochs=15,
+        skip_type_keys=('MMMosaic', 'MMRandomAffine', 'MMMixUp'),
+        priority=48),
+    dict(
+        type='SyncRandomSizeHook',
+        ratio_range=random_size,
+        img_scale=img_scale,
+        interval=interval,
+        priority=48),
+    dict(
+        type='SyncNormHook',
+        num_last_epochs=15,
+        interval=interval,
+        priority=48)
+]
+
+# evaluation
+eval_config = dict(
+    interval=10,
+    gpu_collect=False,
+    visualization_config=dict(
+        vis_num=10,
+        score_thr=0.5,
+    )  # show by TensorboardLoggerHookV2 and WandbLoggerHookV2
+)
+eval_pipelines = [
+    dict(
+        mode='test',
+        data=data['val'],
+        evaluators=[dict(type='CocoDetectionEvaluator', classes=CLASSES)],
+    )
+]
+
+checkpoint_config = dict(interval=interval)
+
+# optimizer
+optimizer = dict(
+    type='SGD', lr=0.02, momentum=0.9, weight_decay=5e-4, nesterov=True)
+optimizer_config = {}
+
+# learning policy
+lr_config = dict(
+    policy='YOLOX',
+    warmup='exp',
+    by_epoch=False,
+    warmup_by_epoch=True,
+    warmup_ratio=1,
+    warmup_iters=5,  # 5 epoch
+    num_last_epochs=15,
+    min_lr_ratio=0.05)
+
+# exponetial model average
+ema = dict(decay=0.9998)
+
+# runtime settings
+total_epochs = 300
+
+# yapf:disable
+log_config = dict(
+    interval=100,
+    hooks=[
+        dict(type='TextLoggerHook'),
+        dict(type='TensorboardLoggerHookV2'),
+        # dict(type='WandbLoggerHookV2'),
+    ])
+
+export = dict(export_type = 'ori', preprocess_jit = False, batch_size=1, blade_config=dict(enable_fp16=True, fp16_fallback_op_ratio=0.01), use_trt_efficientnms=False)
--- a/configs/detection/yolox/yolox_l_8xb8_300e_coco.py
+++ b/configs/detection/yolox/yolox_l_8xb8_300e_coco.py
@ -1,7 +1,7 @@
 _base_ = './yolox_s_8xb16_300e_coco.py'

 # model settings
-model = dict(model_type='l')
+model = dict(model_type='l', head=dict(model_type='l', ))

 data = dict(imgs_per_gpu=8, workers_per_gpu=4)

--- a/configs/detection/yolox/yolox_m_8xb16_300e_coco.py
+++ b/configs/detection/yolox/yolox_m_8xb16_300e_coco.py
@ -1,4 +1,4 @@
 _base_ = './yolox_s_8xb16_300e_coco.py'

 # model settings
-model = dict(model_type='m')
+model = dict(model_type='m', head=dict(model_type='m', ))
--- a/configs/detection/yolox/yolox_nano_8xb16_300e_coco.py
+++ b/configs/detection/yolox/yolox_nano_8xb16_300e_coco.py
@ -1,4 +1,4 @@
 _base_ = './yolox_tiny_8xb16_300e_coco.py'

 # model settings
-model = dict(model_type='nano')
+model = dict(model_type='nano', head=dict(model_type='nano', ))
--- a/configs/detection/yolox/yolox_s_8xb16_300e_coco.py
+++ b/configs/detection/yolox/yolox_s_8xb16_300e_coco.py
@ -3,10 +3,17 @@ _base_ = '../../base.py'
 # model settings s m l x
 model = dict(
    type='YOLOX',
-    num_classes=80,
-    model_type='s',  # s m l x tiny nano
    test_conf=0.01,
-    nms_thre=0.65)
+    nms_thre=0.65,
+    backbone='CSPDarknet',
+    model_type='s',  # s m l x tiny nano
+    head=dict(
+        type='YOLOXHead',
+        model_type='s',
+        obj_loss_type='BCE',
+        reg_loss_type='giou',
+        num_classes=80,
+        decode_in_inference=True))

 # s m l x
 img_scale = (640, 640)
@ -36,6 +43,7 @@ CLASSES = [

 # dataset settings
 data_root = 'data/coco/'
+
 img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)

@ -82,7 +90,7 @@ train_dataset = dict(
            dict(type='LoadAnnotations', with_bbox=True)
        ],
        classes=CLASSES,
-        filter_empty_gt=False,
+        filter_empty_gt=True,
        iscrowd=False),
    pipeline=train_pipeline,
    dynamic_scale=img_scale)
@ -100,6 +108,7 @@ val_dataset = dict(
        ],
        classes=CLASSES,
        filter_empty_gt=False,
+        test_mode=True,
        iscrowd=True),
    pipeline=test_pipeline,
    dynamic_scale=None,
@ -179,4 +188,4 @@ log_config = dict(
        # dict(type='WandbLoggerHookV2'),
    ])

-export = dict(use_jit=False, export_blade=False, end2end=False)
+export = dict(export_type = 'raw', preprocess_jit = False, batch_size=1, blade_config=dict(enable_fp16=True, fp16_fallback_op_ratio=0.01), use_trt_efficientnms=False)
--- a/configs/detection/yolox/yolox_tiny_8xb16_300e_coco.py
+++ b/configs/detection/yolox/yolox_tiny_8xb16_300e_coco.py
@ -1,7 +1,7 @@
 _base_ = './yolox_s_8xb16_300e_coco.py'

 # model settings
-model = dict(model_type='tiny')
+model = dict(model_type='tiny', head=dict(model_type='tiny', ))

 CLASSES = [
    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train',
--- a/configs/detection/yolox/yolox_x_8xb8_300e_coco.py
+++ b/configs/detection/yolox/yolox_x_8xb8_300e_coco.py
@ -1,7 +1,7 @@
 _base_ = './yolox_s_8xb16_300e_coco.py'

 # model settings
-model = dict(model_type='x')
+model = dict(model_type='x', head=dict(model_type='x', ))

 data = dict(imgs_per_gpu=8, workers_per_gpu=4)

--- a/configs/edge_models/yolox_edge.py
+++ b/configs/edge_models/yolox_edge.py
@ -7,7 +7,6 @@
 model = dict(
    stage='EDGE',
    type='YOLOX_EDGE',
-    num_classes=1,
    model_type='customized',
    test_conf=0.01,
    nms_thre=0.65,
@ -16,14 +15,19 @@ model = dict(
    max_model_params=-1,
    max_model_flops=-1,
    activation='relu',
-)
+    head=dict(
+        type='YOLOXHead',
+        model_type='customized',
+        num_classes=1,
+        reg_loss_type='iou',
+        width=1.0))

 # train setting
 samples_per_gpu = 16  # batch size per gpu
 test_samples_per_gpu = 16  # test batch size per gpu
 gpu_num = 2  # gpu number for one worker
 total_epochs = 11  # train epoch
-interval = 5
+interval = 5  # eval interval

 # tiny nano without mixup
 img_scale = (256, 256)
--- a/configs/face/face_96x96_wingloss.py
+++ b/configs/face/face_96x96_wingloss.py
@ -0,0 +1,238 @@
+# model settings
+POINT_NUMBER = 106
+MEAN_FACE = [
+    0.05486667535113006, 0.24441904048908245, 0.05469932714062696,
+    0.30396829196709935, 0.05520653400164321, 0.3643191463607746,
+    0.05865501342257397, 0.42453849020500306, 0.0661603899137523,
+    0.48531377442945767, 0.07807677169271177, 0.5452126843738523,
+    0.09333319368757653, 0.6047840615432064, 0.11331425394034209,
+    0.6631144309665994, 0.13897813867699352, 0.7172296230155276,
+    0.17125811033538194, 0.767968859462583, 0.20831698519371536,
+    0.8146603379935117, 0.24944621000897876, 0.857321261721953,
+    0.2932993820558674, 0.8973900596678597, 0.33843820185594653,
+    0.9350576242126986, 0.38647802623495553, 0.966902971122812,
+    0.4411974776504609, 0.9878629960611088, 0.5000390697219397,
+    0.9934886214875595, 0.5588590024515473, 0.9878510782414189,
+    0.6135829360035883, 0.9668655595323074, 0.6616294188166414,
+    0.9350065330378543, 0.7067734980023662, 0.8973410411573094,
+    0.7506167730772516, 0.8572957679511382, 0.7917579157122047,
+    0.8146281598803492, 0.8288026446367324, 0.7679019642224981,
+    0.8610918526053805, 0.7171624168757985, 0.8867491048162915,
+    0.6630344261248556, 0.9067293813428708, 0.6047095492618413,
+    0.9219649147678989, 0.5451295187190602, 0.9338619041815587,
+    0.4852292097262674, 0.9413455695142587, 0.424454780475834,
+    0.9447753107545577, 0.3642347111991026, 0.9452649776939869,
+    0.30388458223793025, 0.9450854849661369, 0.24432737691068557,
+    0.1594802473020129, 0.17495177946520288, 0.2082918411850002,
+    0.12758378330875153, 0.27675902873293057, 0.11712230823088154,
+    0.34660582049732336, 0.12782553369032904, 0.4137234315527489,
+    0.14788458441422778, 0.4123890243720449, 0.18814226684806626,
+    0.3498927810760776, 0.17640650480816664, 0.28590212091591866,
+    0.16895271174960227, 0.22193967489846017, 0.16985862149585013,
+    0.5861805004572298, 0.147863456192582, 0.6532904167464643,
+    0.12780412047734288, 0.723142364263288, 0.11709102395419578,
+    0.7916076475508984, 0.12753867695205595, 0.8404440227263494,
+    0.17488715120168932, 0.7779848023963316, 0.1698261195288917,
+    0.7140264757991571, 0.1689377237959271, 0.650024882334848,
+    0.17640581823811927, 0.5875270068157493, 0.18815421057605972,
+    0.4999687027691624, 0.2770570778583906, 0.49996466107378934,
+    0.35408433007759227, 0.49996725190415664, 0.43227025345368053,
+    0.49997367716346774, 0.5099309118810921, 0.443147025685285,
+    0.2837021691260901, 0.4079306716593004, 0.4729519900478952,
+    0.3786223176615041, 0.5388017782630576, 0.4166237366074797,
+    0.5822229552544941, 0.4556754522760756, 0.5887956328134262,
+    0.49998730493119997, 0.5951855531982454, 0.5443300921009105,
+    0.5887796732983633, 0.5833722476054509, 0.582200985012979,
+    0.6213509190608012, 0.5387760772258134, 0.5920137550293199,
+    0.4729325070035326, 0.5567854054587345, 0.28368589871138317,
+    0.23395988420439123, 0.275313734012504, 0.27156519109550253,
+    0.2558735678926061, 0.31487949633428597, 0.2523033259214858,
+    0.356919009399118, 0.2627342680634766, 0.3866625969903256,
+    0.2913618036573405, 0.3482919069920915, 0.3009936818974329,
+    0.3064437008415846, 0.3037349617842158, 0.26724000706363993,
+    0.2961896087804692, 0.3135744691699477, 0.27611103614975246,
+    0.6132904312551143, 0.29135144033587107, 0.6430396927648264,
+    0.2627079452269443, 0.6850713556136455, 0.2522730391144915,
+    0.728377707003201, 0.25583118190779625, 0.7660035591791254,
+    0.27526375689471777, 0.7327054300488236, 0.2961495286346863,
+    0.6935171517115648, 0.3036951925380769, 0.6516533228539426,
+    0.3009921014909089, 0.6863983789278025, 0.2760904908649394,
+    0.35811903020866753, 0.7233174007629063, 0.4051199834269763,
+    0.6931800846807724, 0.4629631471997891, 0.6718031951363689,
+    0.5000016063148277, 0.6799150331999366, 0.5370506360177653,
+    0.6717809139952097, 0.5948714927411151, 0.6931581144392573,
+    0.6418878095835022, 0.7232890570786875, 0.6088129582142587,
+    0.7713407215524752, 0.5601450388292929, 0.8052499757498277,
+    0.5000181358125715, 0.8160749831906926, 0.4398905591799545,
+    0.8052697696938342, 0.39120318265892984, 0.771375905028864,
+    0.36888771299734613, 0.7241751210643214, 0.4331097084010058,
+    0.7194543690519717, 0.5000188612450743, 0.7216823277180712,
+    0.566895861884284, 0.7194302225129479, 0.631122598507516,
+    0.7241462073974219, 0.5678462302796355, 0.7386355816766528,
+    0.5000082906571756, 0.7479600838019628, 0.43217532542902076,
+    0.7386538729390463, 0.31371761254774383, 0.2753328284323114,
+    0.6862487843823917, 0.2752940437017121
+]
+IMAGE_SIZE = 96
+
+loss_config = dict(
+    num_points=POINT_NUMBER,
+    left_eye_left_corner_index=66,
+    right_eye_right_corner_index=79,
+    points_weight=1.0,
+    contour_weight=1.5,
+    eyebrow_weight=1.5,
+    eye_weight=1.7,
+    nose_weight=1.3,
+    lip_weight=1.7,
+    omega=10,
+    epsilon=2)
+
+model = dict(
+    type='FaceKeypoint',
+    backbone=dict(
+        type='FaceKeypointBackbone',
+        in_channels=3,
+        out_channels=48,
+        residual_activation='relu',
+        inverted_activation='half_v2',
+        inverted_expand_ratio=2,
+    ),
+    keypoint_head=dict(
+        type='FaceKeypointHead',
+        in_channels=48,
+        out_channels=POINT_NUMBER * 2,
+        input_size=IMAGE_SIZE,
+        inverted_expand_ratio=2,
+        inverted_activation='half_v2',
+        mean_face=MEAN_FACE,
+        loss_keypoint=dict(type='WingLossWithPose', **loss_config),
+    ),
+    pose_head=dict(
+        type='FacePoseHead',
+        in_channels=48,
+        out_channels=3,
+        inverted_expand_ratio=2,
+        inverted_activation='half_v2',
+        loss_pose=dict(type='FacePoseLoss', pose_weight=0.01),
+    ),
+)
+
+train_pipeline = [
+    dict(type='FaceKeypointRandomAugmentation', input_size=IMAGE_SIZE),
+    dict(type='FaceKeypointNorm', input_size=IMAGE_SIZE),
+    dict(type='MMToTensor'),
+    dict(
+        type='NormalizeTensor',
+        mean=[0.4076, 0.458, 0.485],
+        std=[1.0, 1.0, 1.0]),
+    dict(
+        type='Collect',
+        keys=[
+            'img', 'target_point', 'target_point_mask', 'target_pose',
+            'target_pose_mask'
+        ])
+]
+
+val_pipeline = [
+    dict(type='FaceKeypointNorm', input_size=IMAGE_SIZE),
+    dict(type='MMToTensor'),
+    dict(
+        type='NormalizeTensor',
+        mean=[0.4076, 0.458, 0.485],
+        std=[1.0, 1.0, 1.0]),
+    dict(
+        type='Collect',
+        keys=[
+            'img', 'target_point', 'target_point_mask', 'target_pose',
+            'target_pose_mask'
+        ])
+]
+test_pipeline = val_pipeline
+
+data_root = 'path/to/face_landmark_data/'
+
+data_cfg = dict(
+    data_root=data_root,
+    input_size=IMAGE_SIZE,
+)
+
+data = dict(
+    imgs_per_gpu=512,
+    workers_per_gpu=2,
+    train=dict(
+        type='FaceKeypointDataset',
+        data_source=dict(
+            type='FaceKeypintSource',
+            train=True,
+            data_range=[0, 30000],  # [0,30000]  [0,478857]
+            data_cfg=data_cfg,
+        ),
+        pipeline=train_pipeline),
+    val=dict(
+        type='FaceKeypointDataset',
+        data_source=dict(
+            type='FaceKeypintSource',
+            train=False,
+            data_range=[478857, 488857],
+            # data_range=[478857, 478999], #[478857, 478999] [478857, 488857]
+            data_cfg=data_cfg,
+        ),
+        pipeline=val_pipeline),
+    test=dict(
+        type='FaceKeypointDataset',
+        data_source=dict(
+            type='FaceKeypintSource',
+            train=False,
+            data_range=[478857, 488857],
+            # data_range=[478857, 478999], #[478857, 478999] [478857, 488857]
+            data_cfg=data_cfg,
+        ),
+        pipeline=test_pipeline),
+)
+
+# runtime setting
+optimizer = dict(
+    type='Adam',
+    lr=0.005,
+)
+optimizer_config = dict(grad_clip=None)
+lr_config = dict(
+    policy='CosineAnnealing',
+    min_lr=0.00001,
+    warmup='linear',
+    warmup_iters=10,
+    warmup_ratio=0.001,
+    warmup_by_epoch=True,
+    by_epoch=True)
+
+total_epochs = 1000
+checkpoint_config = dict(interval=10)
+log_config = dict(
+    interval=5, hooks=[
+        dict(type='TextLoggerHook'),
+    ])
+
+predict = dict(type='FaceKeypointsPredictor')
+
+log_level = 'INFO'
+load_from = None
+resume_from = None
+dist_params = dict(backend='nccl')
+workflow = [('train', 1)]
+
+# disable opencv multithreading to avoid system being overloaded
+opencv_num_threads = 0
+# set multi-process start method as `fork` to speed up the training
+mp_start_method = 'fork'
+
+evaluation = dict(interval=1, metric=['NME'], save_best='NME')
+
+eval_config = dict(interval=1)
+evaluator_args = dict(metric_names='ave_nme')
+eval_pipelines = [
+    dict(
+        mode='test',
+        data=dict(**data['val'], imgs_per_gpu=1),
+        evaluators=[dict(type='FaceKeypointEvaluator', **evaluator_args)])
+]
--- a/configs/pose/hand/hrnet_w18_coco_wholebody_hand_256x256_dark.py
+++ b/configs/pose/hand/hrnet_w18_coco_wholebody_hand_256x256_dark.py
@ -0,0 +1,190 @@
+# oss_io_config = dict(
+#     ak_id='your oss ak id',
+#     ak_secret='your oss ak secret',
+#     hosts='oss-cn-zhangjiakou.aliyuncs.com',  # your oss hosts
+#     buckets=['your_bucket'])  # your oss buckets
+
+oss_sync_config = dict(other_file_list=['**/events.out.tfevents*', '**/*log*'])
+
+log_level = 'INFO'
+load_from = None
+resume_from = None
+dist_params = dict(backend='nccl')
+workflow = [('train', 1)]
+checkpoint_config = dict(interval=10)
+
+optimizer = dict(type='Adam', lr=5e-4)
+optimizer_config = dict(grad_clip=None)
+# learning policy
+lr_config = dict(
+    policy='step',
+    warmup='linear',
+    warmup_iters=500,
+    warmup_ratio=0.001,
+    step=[170, 200])
+total_epochs = 210
+log_config = dict(
+    interval=50,
+    hooks=[dict(type='TextLoggerHook'),
+           dict(type='TensorboardLoggerHook')])
+channel_cfg = dict(
+    num_output_channels=21,
+    dataset_joints=21,
+    dataset_channel=[
+        [
+            0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
+            19, 20
+        ],
+    ],
+    inference_channel=[
+        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
+        20
+    ])
+
+# model settings
+model = dict(
+    type='TopDown',
+    pretrained=False,
+    backbone=dict(
+        type='HRNet',
+        in_channels=3,
+        extra=dict(
+            stage1=dict(
+                num_modules=1,
+                num_branches=1,
+                block='BOTTLENECK',
+                num_blocks=(4, ),
+                num_channels=(64, )),
+            stage2=dict(
+                num_modules=1,
+                num_branches=2,
+                block='BASIC',
+                num_blocks=(4, 4),
+                num_channels=(18, 36)),
+            stage3=dict(
+                num_modules=4,
+                num_branches=3,
+                block='BASIC',
+                num_blocks=(4, 4, 4),
+                num_channels=(18, 36, 72)),
+            stage4=dict(
+                num_modules=3,
+                num_branches=4,
+                block='BASIC',
+                num_blocks=(4, 4, 4, 4),
+                num_channels=(18, 36, 72, 144),
+                multiscale_output=True),
+            upsample=dict(mode='bilinear', align_corners=False))),
+    keypoint_head=dict(
+        type='TopdownHeatmapSimpleHead',
+        in_channels=[18, 36, 72, 144],
+        in_index=(0, 1, 2, 3),
+        input_transform='resize_concat',
+        out_channels=channel_cfg['num_output_channels'],
+        num_deconv_layers=0,
+        extra=dict(
+            final_conv_kernel=1, num_conv_layers=1, num_conv_kernels=(1, )),
+        loss_keypoint=dict(type='JointsMSELoss', use_target_weight=True)),
+    train_cfg=dict(),
+    test_cfg=dict(
+        flip_test=True,
+        post_process='unbiased',
+        shift_heatmap=True,
+        modulate_kernel=11))
+
+data_root = 'data/coco'
+
+data_cfg = dict(
+    image_size=[256, 256],
+    heatmap_size=[64, 64],
+    num_output_channels=channel_cfg['num_output_channels'],
+    num_joints=channel_cfg['dataset_joints'],
+    dataset_channel=channel_cfg['dataset_channel'],
+    inference_channel=channel_cfg['inference_channel'],
+)
+
+train_pipeline = [
+    # dict(type='TopDownGetBboxCenterScale', padding=1.25),
+    dict(type='TopDownRandomFlip', flip_prob=0.5),
+    dict(
+        type='TopDownGetRandomScaleRotation', rot_factor=30,
+        scale_factor=0.25),
+    dict(type='TopDownAffine'),
+    dict(type='MMToTensor'),
+    dict(
+        type='NormalizeTensor',
+        mean=[0.485, 0.456, 0.406],
+        std=[0.229, 0.224, 0.225]),
+    dict(type='TopDownGenerateTarget', sigma=3),
+    dict(
+        type='PoseCollect',
+        keys=['img', 'target', 'target_weight'],
+        meta_keys=[
+            'image_file', 'image_id', 'joints_3d', 'joints_3d_visible',
+            'center', 'scale', 'rotation', 'flip_pairs'
+        ])
+]
+
+val_pipeline = [
+    dict(type='TopDownAffine'),
+    dict(type='MMToTensor'),
+    dict(
+        type='NormalizeTensor',
+        mean=[0.485, 0.456, 0.406],
+        std=[0.229, 0.224, 0.225]),
+    dict(
+        type='PoseCollect',
+        keys=['img'],
+        meta_keys=[
+            'image_file', 'image_id', 'center', 'scale', 'rotation',
+            'flip_pairs'
+        ])
+]
+
+test_pipeline = val_pipeline
+data_source_cfg = dict(type='HandCocoPoseTopDownSource', data_cfg=data_cfg)
+
+data = dict(
+    imgs_per_gpu=32,  # for train
+    workers_per_gpu=2,  # for train
+    # imgs_per_gpu=1,  # for test
+    # workers_per_gpu=1,  # for test
+    val_dataloader=dict(samples_per_gpu=32),
+    test_dataloader=dict(samples_per_gpu=32),
+    train=dict(
+        type='HandCocoWholeBodyDataset',
+        data_source=dict(
+            ann_file=f'{data_root}/annotations/coco_wholebody_train_v1.0.json',
+            img_prefix=f'{data_root}/train2017/',
+            **data_source_cfg),
+        pipeline=train_pipeline),
+    val=dict(
+        type='HandCocoWholeBodyDataset',
+        data_source=dict(
+            ann_file=f'{data_root}/annotations/coco_wholebody_val_v1.0.json',
+            img_prefix=f'{data_root}/val2017/',
+            test_mode=True,
+            **data_source_cfg),
+        pipeline=val_pipeline),
+    test=dict(
+        type='HandCocoWholeBodyDataset',
+        data_source=dict(
+            ann_file=f'{data_root}/annotations/coco_wholebody_val_v1.0.json',
+            img_prefix=f'{data_root}/val2017/',
+            test_mode=True,
+            **data_source_cfg),
+        pipeline=val_pipeline),
+)
+
+eval_config = dict(interval=10, metric='PCK', save_best='PCK')
+evaluator_args = dict(
+    metric_names=['PCK', 'AUC', 'EPE', 'NME'], pck_thr=0.2, auc_nor=30)
+eval_pipelines = [
+    dict(
+        mode='test',
+        data=dict(**data['val'], imgs_per_gpu=1),
+        evaluators=[dict(type='KeyPointEvaluator', **evaluator_args)])
+]
+export = dict(use_jit=False)
+checkpoint_sync_export = True
+predict = dict(type='HandKeypointsPredictor')
--- a/configs/pose/hand/litehrnet_30_coco_wholebody_hand_256x256.py
+++ b/configs/pose/hand/litehrnet_30_coco_wholebody_hand_256x256.py
@ -0,0 +1,176 @@
+# oss_io_config = dict(
+#     ak_id='your oss ak id',
+#     ak_secret='your oss ak secret',
+#     hosts='oss-cn-zhangjiakou.aliyuncs.com',  # your oss hosts
+#     buckets=['your_bucket'])  # your oss buckets
+
+oss_sync_config = dict(other_file_list=['**/events.out.tfevents*', '**/*log*'])
+
+log_level = 'INFO'
+load_from = None
+resume_from = None
+dist_params = dict(backend='nccl')
+workflow = [('train', 1)]
+checkpoint_config = dict(interval=10)
+
+optimizer = dict(type='Adam', lr=5e-4)
+optimizer_config = dict(grad_clip=None)
+# learning policy
+lr_config = dict(
+    policy='step',
+    warmup='linear',
+    warmup_iters=500,
+    warmup_ratio=0.001,
+    step=[170, 200])
+total_epochs = 210
+log_config = dict(
+    interval=50,
+    hooks=[dict(type='TextLoggerHook'),
+           dict(type='TensorboardLoggerHook')])
+channel_cfg = dict(
+    num_output_channels=21,
+    dataset_joints=21,
+    dataset_channel=[
+        [
+            0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
+            19, 20
+        ],
+    ],
+    inference_channel=[
+        0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
+        20
+    ])
+
+# model settings
+model = dict(
+    type='TopDown',
+    pretrained=False,
+    backbone=dict(
+        type='LiteHRNet',
+        in_channels=3,
+        extra=dict(
+            stem=dict(stem_channels=32, out_channels=32, expand_ratio=1),
+            num_stages=3,
+            stages_spec=dict(
+                num_modules=(3, 8, 3),
+                num_branches=(2, 3, 4),
+                num_blocks=(2, 2, 2),
+                module_type=('LITE', 'LITE', 'LITE'),
+                with_fuse=(True, True, True),
+                reduce_ratios=(8, 8, 8),
+                num_channels=(
+                    (40, 80),
+                    (40, 80, 160),
+                    (40, 80, 160, 320),
+                )),
+            with_head=True,
+        )),
+    keypoint_head=dict(
+        type='TopdownHeatmapSimpleHead',
+        in_channels=40,
+        out_channels=channel_cfg['num_output_channels'],
+        num_deconv_layers=0,
+        extra=dict(final_conv_kernel=1, ),
+        loss_keypoint=dict(type='JointsMSELoss', use_target_weight=True)),
+    train_cfg=dict(),
+    test_cfg=dict(
+        flip_test=True,
+        post_process='default',
+        shift_heatmap=True,
+        modulate_kernel=11))
+
+data_root = 'data/coco'
+
+data_cfg = dict(
+    image_size=[256, 256],
+    heatmap_size=[64, 64],
+    num_output_channels=channel_cfg['num_output_channels'],
+    num_joints=channel_cfg['dataset_joints'],
+    dataset_channel=channel_cfg['dataset_channel'],
+    inference_channel=channel_cfg['inference_channel'],
+)
+
+train_pipeline = [
+    # dict(type='TopDownGetBboxCenterScale', padding=1.25),
+    dict(type='TopDownRandomFlip', flip_prob=0.5),
+    dict(
+        type='TopDownGetRandomScaleRotation', rot_factor=30,
+        scale_factor=0.25),
+    dict(type='TopDownAffine'),
+    dict(type='MMToTensor'),
+    dict(
+        type='NormalizeTensor',
+        mean=[0.485, 0.456, 0.406],
+        std=[0.229, 0.224, 0.225]),
+    dict(type='TopDownGenerateTarget', sigma=3),
+    dict(
+        type='PoseCollect',
+        keys=['img', 'target', 'target_weight'],
+        meta_keys=[
+            'image_file', 'image_id', 'joints_3d', 'joints_3d_visible',
+            'center', 'scale', 'rotation', 'flip_pairs'
+        ])
+]
+
+val_pipeline = [
+    dict(type='TopDownAffine'),
+    dict(type='MMToTensor'),
+    dict(
+        type='NormalizeTensor',
+        mean=[0.485, 0.456, 0.406],
+        std=[0.229, 0.224, 0.225]),
+    dict(
+        type='PoseCollect',
+        keys=['img'],
+        meta_keys=[
+            'image_file', 'image_id', 'center', 'scale', 'rotation',
+            'flip_pairs'
+        ])
+]
+
+test_pipeline = val_pipeline
+data_source_cfg = dict(type='HandCocoPoseTopDownSource', data_cfg=data_cfg)
+
+data = dict(
+    imgs_per_gpu=32,  # for train
+    workers_per_gpu=2,  # for train
+    # imgs_per_gpu=1,  # for test
+    # workers_per_gpu=1,  # for test
+    val_dataloader=dict(samples_per_gpu=32),
+    test_dataloader=dict(samples_per_gpu=32),
+    train=dict(
+        type='HandCocoWholeBodyDataset',
+        data_source=dict(
+            ann_file=f'{data_root}/annotations/coco_wholebody_train_v1.0.json',
+            img_prefix=f'{data_root}/train2017/',
+            **data_source_cfg),
+        pipeline=train_pipeline),
+    val=dict(
+        type='HandCocoWholeBodyDataset',
+        data_source=dict(
+            ann_file=f'{data_root}/annotations/coco_wholebody_val_v1.0.json',
+            img_prefix=f'{data_root}/val2017/',
+            test_mode=True,
+            **data_source_cfg),
+        pipeline=val_pipeline),
+    test=dict(
+        type='HandCocoWholeBodyDataset',
+        data_source=dict(
+            ann_file=f'{data_root}/annotations/coco_wholebody_val_v1.0.json',
+            img_prefix=f'{data_root}/val2017/',
+            test_mode=True,
+            **data_source_cfg),
+        pipeline=val_pipeline),
+)
+
+eval_config = dict(interval=10, metric='PCK', save_best='PCK')
+evaluator_args = dict(
+    metric_names=['PCK', 'AUC', 'EPE', 'NME'], pck_thr=0.2, auc_nor=30)
+eval_pipelines = [
+    dict(
+        mode='test',
+        data=dict(**data['val'], imgs_per_gpu=1),
+        evaluators=[dict(type='KeyPointEvaluator', **evaluator_args)])
+]
+export = dict(use_jit=False)
+checkpoint_sync_export = True
--- a/configs/segmentation/fcn/fcn_r50-d8_512x512_8xb4_60e_voc12aug.py
+++ b/configs/segmentation/fcn/fcn_r50-d8_512x512_8xb4_60e_voc12aug.py
@ -66,7 +66,7 @@ train_pipeline = [
    dict(type='MMRandomFlip', flip_ratio=0.5),
    dict(type='MMPhotoMetricDistortion'),
    dict(type='MMNormalize', **img_norm_cfg),
-    dict(type='MMPad', size=crop_size, pad_val=0, seg_pad_val=255),
+    dict(type='MMPad', size=crop_size),
    dict(type='DefaultFormatBundle'),
    dict(
        type='Collect',
--- a/configs/segmentation/segformer/segformer_b0_coco.py
+++ b/configs/segmentation/segformer/segformer_b0_coco.py
@ -0,0 +1,249 @@
+# segformer of B0
+
+CLASSES = [
+    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train',
+    'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign',
+    'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
+    'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag',
+    'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite',
+    'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
+    'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon',
+    'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot',
+    'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant',
+    'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
+    'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
+    'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
+    'hair drier', 'toothbrush', 'banner', 'blanket', 'branch', 'bridge',
+    'building-other', 'bush', 'cabinet', 'cage', 'cardboard', 'carpet',
+    'ceiling-other', 'ceiling-tile', 'cloth', 'clothes', 'clouds', 'counter',
+    'cupboard', 'curtain', 'desk-stuff', 'dirt', 'door-stuff', 'fence',
+    'floor-marble', 'floor-other', 'floor-stone', 'floor-tile', 'floor-wood',
+    'flower', 'fog', 'food-other', 'fruit', 'furniture-other', 'grass',
+    'gravel', 'ground-other', 'hill', 'house', 'leaves', 'light', 'mat',
+    'metal', 'mirror-stuff', 'moss', 'mountain', 'mud', 'napkin', 'net',
+    'paper', 'pavement', 'pillow', 'plant-other', 'plastic', 'platform',
+    'playingfield', 'railing', 'railroad', 'river', 'road', 'rock', 'roof',
+    'rug', 'salad', 'sand', 'sea', 'shelf', 'sky-other', 'skyscraper', 'snow',
+    'solid-other', 'stairs', 'stone', 'straw', 'structural-other', 'table',
+    'tent', 'textile-other', 'towel', 'tree', 'vegetable', 'wall-brick',
+    'wall-concrete', 'wall-other', 'wall-panel', 'wall-stone', 'wall-tile',
+    'wall-wood', 'water-other', 'waterdrops', 'window-blind', 'window-other',
+    'wood'
+]
+PALETTE = [[0, 192, 64], [0, 192, 64], [0, 64, 96],
+           [128, 192, 192], [0, 64, 64], [0, 192, 224], [0, 192, 192],
+           [128, 192, 64], [0, 192, 96], [128, 192, 64], [128, 32, 192],
+           [0, 0, 224], [0, 0, 64], [0, 160, 192], [128, 0, 96], [128, 0, 192],
+           [0, 32, 192], [128, 128, 224], [0, 0, 192], [128, 160, 192],
+           [128, 128, 0], [128, 0, 32], [128, 32, 0], [128, 0, 128],
+           [64, 128, 32], [0, 160, 0], [0, 0, 0], [192, 128, 160], [0, 32, 0],
+           [0, 128, 128], [64, 128, 160], [128, 160, 0], [0, 128, 0],
+           [192, 128, 32], [128, 96, 128], [0, 0, 128], [64, 0, 32],
+           [0, 224, 128], [128, 0, 0], [192, 0, 160], [0, 96, 128],
+           [128, 128, 128], [64, 0, 160], [128, 224, 128], [128, 128,
+                                                            64], [192, 0, 32],
+           [128, 96, 0], [128, 0, 192], [0, 128, 32], [64, 224, 0], [0, 0, 64],
+           [128, 128, 160], [64, 96, 0], [0, 128, 192], [0, 128, 160],
+           [192, 224, 0], [0, 128, 64], [128, 128, 32], [192, 32, 128],
+           [0, 64, 192], [0, 0, 32], [64, 160, 128], [128, 64, 64],
+           [128, 0, 160], [64, 32, 128], [128, 192, 192], [0, 0, 160],
+           [192, 160, 128], [128, 192, 0], [128, 0, 96], [192, 32, 0],
+           [128, 64, 128], [64, 128, 96], [64, 160, 0], [0, 64, 0],
+           [192, 128, 224], [64, 32, 0], [0, 192, 128], [64, 128, 224],
+           [192, 160, 0], [0, 192, 0], [192, 128, 96], [192, 96, 128],
+           [0, 64, 128], [64, 0, 96], [64, 224, 128], [128, 64, 0],
+           [192, 0, 224], [64, 96, 128], [128, 192, 128], [64, 0, 224],
+           [192, 224, 128], [128, 192, 64], [192, 0, 96], [192, 96, 0],
+           [128, 64, 192], [0, 128, 96], [0, 224, 0], [64, 64, 64],
+           [128, 128, 224], [0, 96, 0], [64, 192, 192], [0, 128, 224],
+           [128, 224, 0], [64, 192, 64], [128, 128, 96], [128, 32, 128],
+           [64, 0, 192], [0, 64, 96], [0, 160, 128], [192, 0, 64],
+           [128, 64, 224], [0, 32, 128], [192, 128, 192], [0, 64, 224],
+           [128, 160, 128], [192, 128, 0], [128, 64, 32], [128, 32, 64],
+           [192, 0, 128], [64, 192, 32], [0, 160, 64], [64, 0, 0],
+           [192, 192, 160], [0, 32, 64], [64, 128, 128], [64, 192, 160],
+           [128, 160, 64], [64, 128, 0], [192, 192, 32], [128, 96, 192],
+           [64, 0, 128], [64, 64, 32], [0, 224, 192], [192, 0, 0],
+           [192, 64, 160], [0, 96, 192], [192, 128, 128], [64, 64, 160],
+           [128, 224, 192], [192, 128, 64], [192, 64, 32], [128, 96, 64],
+           [192, 0, 192], [0, 192, 32], [64, 224, 64], [64, 0, 64],
+           [128, 192, 160], [64, 96, 64], [64, 128, 192], [0, 192, 160],
+           [192, 224, 64], [64, 128, 64], [128, 192, 32], [192, 32, 192],
+           [64, 64, 192], [0, 64, 32], [64, 160, 192], [192, 64, 64],
+           [128, 64, 160], [64, 32, 192], [192, 192, 192], [0, 64, 160],
+           [192, 160, 192], [192, 192, 0], [128, 64, 96], [192, 32, 64],
+           [192, 64, 128], [64, 192, 96], [64, 160, 64], [64, 64, 0]]
+
+num_classes = 172
+
+norm_cfg = dict(type='SyncBN', requires_grad=True)
+model = dict(
+    type='EncoderDecoder',
+    pretrained=
+    'https://download.openmmlab.com/mmsegmentation/v0.5/pretrain/segformer/mit_b0_20220624-7e0fe6dd.pth',
+    backbone=dict(
+        type='MixVisionTransformer',
+        in_channels=3,
+        embed_dims=32,
+        num_stages=4,
+        num_layers=[2, 2, 2, 2],
+        num_heads=[1, 2, 5, 8],
+        patch_sizes=[7, 3, 3, 3],
+        sr_ratios=[8, 4, 2, 1],
+        out_indices=(0, 1, 2, 3),
+        mlp_ratio=4,
+        qkv_bias=True,
+        drop_rate=0.0,
+        attn_drop_rate=0.0,
+        drop_path_rate=0.1),
+    decode_head=dict(
+        type='SegformerHead',
+        in_channels=[32, 64, 160, 256],
+        in_index=[0, 1, 2, 3],
+        channels=256,
+        dropout_ratio=0.1,
+        num_classes=num_classes,
+        norm_cfg=norm_cfg,
+        align_corners=False,
+        loss_decode=dict(
+            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
+    # model training and testing settings
+    train_cfg=dict(),
+    test_cfg=dict(mode='whole'))
+
+# dataset settings
+dataset_type = 'SegDataset'
+data_root = './data/coco_stuff164k/'
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+crop_size = (512, 512)
+train_pipeline = [
+    dict(type='MMResize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)),
+    dict(type='SegRandomCrop', crop_size=crop_size, cat_max_ratio=0.75),
+    dict(type='MMRandomFlip', flip_ratio=0.5),
+    dict(type='MMPhotoMetricDistortion'),
+    dict(type='MMNormalize', **img_norm_cfg),
+    dict(type='MMPad', size=crop_size, pad_val=dict(img=0, masks=0, seg=255)),
+    dict(type='DefaultFormatBundle'),
+    dict(
+        type='Collect',
+        keys=['img', 'gt_semantic_seg'],
+        meta_keys=('filename', 'ori_filename', 'ori_shape', 'img_shape',
+                   'pad_shape', 'scale_factor', 'flip', 'flip_direction',
+                   'img_norm_cfg')),
+]
+test_pipeline = [
+    dict(
+        type='MMMultiScaleFlipAug',
+        img_scale=(2048, 512),
+        # img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
+        flip=False,
+        transforms=[
+            dict(type='MMResize', keep_ratio=True),
+            dict(type='MMRandomFlip'),
+            dict(type='MMNormalize', **img_norm_cfg),
+            dict(type='ImageToTensor', keys=['img']),
+            dict(
+                type='Collect',
+                keys=['img'],
+                meta_keys=('filename', 'ori_filename', 'ori_shape',
+                           'img_shape', 'pad_shape', 'scale_factor', 'flip',
+                           'flip_direction', 'img_norm_cfg')),
+        ])
+]
+
+data = dict(
+    imgs_per_gpu=2,
+    workers_per_gpu=2,
+    train=dict(
+        type=dataset_type,
+        ignore_index=255,
+        data_source=dict(
+            type='SegSourceRaw',
+            img_suffix='.jpg',
+            label_suffix='_labelTrainIds.png',
+            img_root=data_root + 'train2017/',
+            label_root=data_root + 'annotations/train2017/',
+            split=data_root + 'train.txt',
+            classes=CLASSES,
+        ),
+        pipeline=train_pipeline),
+    val=dict(
+        imgs_per_gpu=1,
+        ignore_index=255,
+        type=dataset_type,
+        data_source=dict(
+            type='SegSourceRaw',
+            img_suffix='.jpg',
+            label_suffix='_labelTrainIds.png',
+            img_root=data_root + 'val2017/',
+            label_root=data_root + 'annotations/val2017',
+            split=data_root + 'val.txt',
+            classes=CLASSES,
+        ),
+        pipeline=test_pipeline),
+    test=dict(
+        imgs_per_gpu=1,
+        type=dataset_type,
+        data_source=dict(
+            type='SegSourceRaw',
+            img_suffix='.jpg',
+            label_suffix='_labelTrainIds.png',
+            img_root=data_root + 'val2017/',
+            label_root=data_root + 'annotations/val2017',
+            split=data_root + 'val.txt',
+            classes=CLASSES,
+        ),
+        pipeline=test_pipeline))
+optimizer = dict(
+    type='AdamW',
+    lr=6e-05,
+    betas=(0.9, 0.999),
+    weight_decay=0.01,
+    paramwise_options=dict(
+        custom_keys=dict(
+            pos_block=dict(decay_mult=0.0),
+            norm=dict(decay_mult=0.0),
+            head=dict(lr_mult=10.0))))
+optimizer_config = dict()
+lr_config = dict(
+    policy='poly',
+    warmup='linear',
+    warmup_iters=1500,
+    warmup_ratio=1e-06,
+    power=1.0,
+    min_lr=0.0,
+    by_epoch=False)
+
+# runtime settings
+total_epochs = 30
+checkpoint_config = dict(interval=1)
+eval_config = dict(interval=1, gpu_collect=False)
+eval_pipelines = [
+    dict(
+        mode='test',
+        evaluators=[
+            dict(
+                type='SegmentationEvaluator',
+                classes=CLASSES,
+                metric_names=['mIoU'])
+        ],
+    )
+]
+
+predict = dict(type='SegmentationPredictor')
+
+log_config = dict(
+    interval=50,
+    hooks=[
+        dict(type='TextLoggerHook'),
+        # dict(type='TensorboardLoggerHook')
+    ])
+
+dist_params = dict(backend='nccl')
+
+cudnn_benchmark = False
+log_level = 'INFO'
+load_from = None
+resume_from = None
+workflow = [('train', 1)]
--- a/configs/segmentation/segformer/segformer_b1_coco.py
+++ b/configs/segmentation/segformer/segformer_b1_coco.py
@ -0,0 +1,8 @@
+_base_ = './segformer_b0_coco.py'
+
+model = dict(
+    pretrained=
+    'https://download.openmmlab.com/mmsegmentation/v0.5/pretrain/segformer/mit_b1_20220624-02e5a6a1.pth',
+    backbone=dict(embed_dims=64, ),
+    decode_head=dict(in_channels=[64, 128, 320, 512], ),
+)
--- a/configs/segmentation/segformer/segformer_b2_coco.py
+++ b/configs/segmentation/segformer/segformer_b2_coco.py
@ -0,0 +1,14 @@
+_base_ = './segformer_b0_coco.py'
+
+model = dict(
+    pretrained=
+    'https://download.openmmlab.com/mmsegmentation/v0.5/pretrain/segformer/mit_b2_20220624-66e8bf70.pth',
+    backbone=dict(
+        embed_dims=64,
+        num_layers=[3, 4, 6, 3],
+    ),
+    decode_head=dict(
+        in_channels=[64, 128, 320, 512],
+        channels=768,
+    ),
+)
--- a/configs/segmentation/segformer/segformer_b3_coco.py
+++ b/configs/segmentation/segformer/segformer_b3_coco.py
@ -0,0 +1,14 @@
+_base_ = './segformer_b0_coco.py'
+
+model = dict(
+    pretrained=
+    'https://download.openmmlab.com/mmsegmentation/v0.5/pretrain/segformer/mit_b3_20220624-13b1141c.pth',
+    backbone=dict(
+        embed_dims=64,
+        num_layers=[3, 4, 18, 3],
+    ),
+    decode_head=dict(
+        in_channels=[64, 128, 320, 512],
+        channels=768,
+    ),
+)
--- a/configs/segmentation/segformer/segformer_b4_coco.py
+++ b/configs/segmentation/segformer/segformer_b4_coco.py
@ -0,0 +1,14 @@
+_base_ = './segformer_b0_coco.py'
+
+model = dict(
+    pretrained=
+    'https://download.openmmlab.com/mmsegmentation/v0.5/pretrain/segformer/mit_b4_20220624-d588d980.pth',
+    backbone=dict(
+        embed_dims=64,
+        num_layers=[3, 8, 27, 3],
+    ),
+    decode_head=dict(
+        in_channels=[64, 128, 320, 512],
+        channels=768,
+    ),
+)
--- a/configs/segmentation/segformer/segformer_b5_coco.py
+++ b/configs/segmentation/segformer/segformer_b5_coco.py
@ -0,0 +1,52 @@
+_base_ = './segformer_b0_coco.py'
+
+model = dict(
+    pretrained=
+    'https://download.openmmlab.com/mmsegmentation/v0.5/pretrain/segformer/mit_b5_20220624-658746d9.pth',
+    backbone=dict(
+        embed_dims=64,
+        num_layers=[3, 6, 40, 3],
+    ),
+    decode_head=dict(
+        in_channels=[64, 128, 320, 512],
+        channels=768,
+    ),
+)
+
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+crop_size = (640, 640)
+train_pipeline = [
+    dict(type='MMResize', img_scale=(2048, 640), ratio_range=(0.5, 2.0)),
+    dict(type='SegRandomCrop', crop_size=crop_size, cat_max_ratio=0.75),
+    dict(type='MMRandomFlip', flip_ratio=0.5),
+    dict(type='MMPhotoMetricDistortion'),
+    dict(type='MMNormalize', **img_norm_cfg),
+    dict(type='MMPad', size=crop_size, pad_val=dict(img=0, masks=0, seg=255)),
+    dict(type='DefaultFormatBundle'),
+    dict(
+        type='Collect',
+        keys=['img', 'gt_semantic_seg'],
+        meta_keys=('filename', 'ori_filename', 'ori_shape', 'img_shape',
+                   'pad_shape', 'scale_factor', 'flip', 'flip_direction',
+                   'img_norm_cfg')),
+]
+test_pipeline = [
+    dict(
+        type='MMMultiScaleFlipAug',
+        img_scale=(2048, 640),
+        # img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
+        flip=False,
+        transforms=[
+            dict(type='MMResize', keep_ratio=True),
+            dict(type='MMRandomFlip'),
+            dict(type='MMNormalize', **img_norm_cfg),
+            dict(type='ImageToTensor', keys=['img']),
+            dict(
+                type='Collect',
+                keys=['img'],
+                meta_keys=('filename', 'ori_filename', 'ori_shape',
+                           'img_shape', 'pad_shape', 'scale_factor', 'flip',
+                           'flip_direction', 'img_norm_cfg')),
+        ])
+]
--- a/configs/segmentation/upernet/upernet_r50_512x512_8xb4_60e_voc12aug.py
+++ b/configs/segmentation/upernet/upernet_r50_512x512_8xb4_60e_voc12aug.py
@ -65,7 +65,7 @@ train_pipeline = [
    dict(type='MMRandomFlip', flip_ratio=0.5),
    dict(type='MMPhotoMetricDistortion'),
    dict(type='MMNormalize', **img_norm_cfg),
-    dict(type='MMPad', size=crop_size, pad_val=0, seg_pad_val=255),
+    dict(type='MMPad', size=crop_size),
    dict(type='DefaultFormatBundle'),
    dict(
        type='Collect',
--- a/data/test/face_2d_keypoints/data/002253.png
+++ b/data/test/face_2d_keypoints/data/002253.png
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:331ead75033fa2f01f6be72a2f8e34d581fcb593308067815d4bb136bb13b766
+size 54390
--- a/data/test/face_2d_keypoints/data/002258.png
+++ b/data/test/face_2d_keypoints/data/002258.png
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:6904c252a6ffa8702f4c255dafb0b7d03092c402e3c70598adab3f83c3858451
+size 36836
--- a/data/test/face_2d_keypoints/models/epoch_400.pth
+++ b/data/test/face_2d_keypoints/models/epoch_400.pth
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8298b88539874b9914b90122575880a80ca0534499e9be9953e17fc177a1c2d2
+size 3421031
--- a/data/test/pose/hand/configs/hand_keypoints_predictor.py
+++ b/data/test/pose/hand/configs/hand_keypoints_predictor.py
@ -0,0 +1,86 @@
+model = dict(
+    type='SingleStageDetector',
+    backbone=dict(
+        type='MobileNetV2',
+        out_indices=(4, 7),
+        norm_cfg=dict(type='BN', eps=0.001, momentum=0.03),
+        init_cfg=dict(type='TruncNormal', layer='Conv2d', std=0.03)),
+    neck=dict(
+        type='SSDNeck',
+        in_channels=(96, 1280),
+        out_channels=(96, 1280, 512, 256, 256, 128),
+        level_strides=(2, 2, 2, 2),
+        level_paddings=(1, 1, 1, 1),
+        l2_norm_scale=None,
+        use_depthwise=True,
+        norm_cfg=dict(type='BN', eps=0.001, momentum=0.03),
+        act_cfg=dict(type='ReLU6'),
+        init_cfg=dict(type='TruncNormal', layer='Conv2d', std=0.03)),
+    bbox_head=dict(
+        type='SSDHead',
+        in_channels=(96, 1280, 512, 256, 256, 128),
+        num_classes=1,
+        use_depthwise=True,
+        norm_cfg=dict(type='BN', eps=0.001, momentum=0.03),
+        act_cfg=dict(type='ReLU6'),
+        init_cfg=dict(type='Normal', layer='Conv2d', std=0.001),
+
+        # set anchor size manually instead of using the predefined
+        # SSD300 setting.
+        anchor_generator=dict(
+            type='SSDAnchorGenerator',
+            scale_major=False,
+            strides=[16, 32, 64, 107, 160, 320],
+            ratios=[[2, 3], [2, 3], [2, 3], [2, 3], [2, 3], [2, 3]],
+            min_sizes=[48, 100, 150, 202, 253, 304],
+            max_sizes=[100, 150, 202, 253, 304, 320]),
+        bbox_coder=dict(
+            type='DeltaXYWHBBoxCoder',
+            target_means=[.0, .0, .0, .0],
+            target_stds=[0.1, 0.1, 0.2, 0.2])),
+    # model training and testing settings
+    train_cfg=dict(
+        assigner=dict(
+            type='MaxIoUAssigner',
+            pos_iou_thr=0.5,
+            neg_iou_thr=0.5,
+            min_pos_iou=0.,
+            ignore_iof_thr=-1,
+            gt_max_assign_all=False),
+        smoothl1_beta=1.,
+        allowed_border=-1,
+        pos_weight=-1,
+        neg_pos_ratio=3,
+        debug=False),
+    test_cfg=dict(
+        nms_pre=1000,
+        nms=dict(type='nms', iou_threshold=0.45),
+        min_bbox_size=0,
+        score_thr=0.02,
+        max_per_img=200))
+
+classes = ('hand', )
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+test_pipeline = [
+    dict(
+        type='MMMultiScaleFlipAug',
+        img_scale=(320, 320),
+        flip=False,
+        transforms=[
+            dict(type='MMResize', keep_ratio=False),
+            dict(type='MMNormalize', **img_norm_cfg),
+            dict(type='MMPad', size_divisor=32),
+            dict(type='ImageToTensor', keys=['img']),
+            dict(type='Collect', keys=['img']),
+        ])
+]
+load_from = 'https://download.openmmlab.com/mmpose/mmdet_pretrained/' \
+            'ssdlite_mobilenetv2_scratch_600e_onehand-4f9f8686_20220523.pth'
+mmlab_modules = [
+    dict(type='mmdet', name='SingleStageDetector', module='model'),
+    dict(type='mmdet', name='MobileNetV2', module='backbone'),
+    dict(type='mmdet', name='SSDNeck', module='neck'),
+    dict(type='mmdet', name='SSDHead', module='head'),
+]
+predictor = dict(type='DetectionPredictor', score_threshold=0.5)
--- a/data/test/pose/hand/data/hand.jpg
+++ b/data/test/pose/hand/data/hand.jpg
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c05d58edee7398de37b8e479410676d6b97cfde69cc003e8356a348067e71988
+size 7750
--- a/data/test/pose/hand/hrnet_w18_256x256.pth
+++ b/data/test/pose/hand/hrnet_w18_256x256.pth
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:8570f45c7e642288b23a1c8722ba2b9b40939f1d55c962d13c789157b16edf01
+size 117072344
--- a/data/test/pose/hand/small_whole_body_hand_coco/annotations/small_whole_body_hand_coco.json
+++ b/data/test/pose/hand/small_whole_body_hand_coco/annotations/small_whole_body_hand_coco.json
--- a/data/test/pose/hand/small_whole_body_hand_coco/train2017/000000292456.jpg
+++ b/data/test/pose/hand/small_whole_body_hand_coco/train2017/000000292456.jpg
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:6c8207a06044306b0d271488a22e1a174af5a22e951a710e25a556cf5d212d5c
+size 160632
--- a/data/test/pose/hand/small_whole_body_hand_coco/train2017/000000425226.jpg
+++ b/data/test/pose/hand/small_whole_body_hand_coco/train2017/000000425226.jpg
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:feadc69a8190787088fda0ac12971d91badc93dbe06057645050fdbec1ce6911
+size 204232
--- a/data/test/segmentation/coco_stuff_164k/val2017/000000289059.jpg
+++ b/data/test/segmentation/coco_stuff_164k/val2017/000000289059.jpg
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:af6fa61274e497ecc170de5adc4b8e7ac89eba2bc22a6aa119b08ec7adbe9459
+size 146140
--- a/data/test/segmentation/data/000000309022.jpg
+++ b/data/test/segmentation/data/000000309022.jpg
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:898b141c663f242f716bb26c4cf4962452927e6bef3f170e61fb364cd6359d00
+size 187956
--- a/data/test/segmentation/models/segformer_b0.pth
+++ b/data/test/segmentation/models/segformer_b0.pth
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:94d7df6a4ff3c605916378304b2a00404a23d4965d226a657417061647cb46a6
+size 45361179
--- a/docs/source/LICENSE
+++ b/docs/source/LICENSE
@ -0,0 +1,75 @@
+## Imagenet
+# license
+Source: https://image-net.org/download.php
+
+ImageNet is an ongoing research effort to provide researchers around the world with image data for training large-scale object recognition models. For researchers and educators who wish to use the images for non-commercial research and/or educational purposes, we can provide access through our site under certain conditions and terms.
+
+[RESEARCHER_FULLNAME] (the "Researcher") has requested permission to use the ImageNet database (the "Database") at Princeton University and Stanford University. In exchange for such permission, Researcher hereby agrees to the following terms and conditions:
+
+Researcher shall use the Database only for non-commercial research and educational purposes.
+Princeton University and Stanford University make no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose.
+Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify the ImageNet team, Princeton University, and Stanford University, including their employees, Trustees, officers and agents, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted images that he or she may create from the Database.
+Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions.
+Princeton University and Stanford University reserve the right to terminate Researcher's access to the Database at any time.
+If Researcher is employed by a for-profit, commercial entity, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorized to enter into this agreement on behalf of such employer.
+The law of the State of New Jersey shall apply to all disputes under this agreement.
+
+## COCO
+# license
+Source: https://cocodataset.org/#termsofuse
+
+Images
+The COCO Consortium does not own the copyright of the images. Use of the images must abide by the Flickr Terms of Use. The users of the images accept full responsibility for the use of the dataset, including but not limited to the use of any copies of copyrighted images that they may create from the dataset.
+Software
+Copyright (c) 2015, COCO Consortium. All rights reserved. Redistribution and use software in source and binary form, with or without modification, are permitted provided that the following conditions are met:
+● Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+● Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+● Neither the name of the COCO Consortium nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+THIS SOFTWARE AND ANNOTATIONS ARE PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+## ADE20k
+# license
+Source: https://groups.csail.mit.edu/vision/datasets/ADE20K/
+
+Images
+MIT, CSAIL does not own the copyright of the images. If you are a researcher or educator who wish to have a copy of the original images for non-commercial research and/or educational use, we may provide you access by filling a request in our site. You may use the images under the following terms:
+● Researcher shall use the Database only for non-commercial research and educational purposes. MIT makes no representations or warranties regarding the Database, including but not limited to warranties of non-infringement or fitness for a particular purpose.
+● Researcher accepts full responsibility for his or her use of the Database and shall defend and indemnify MIT, including their employees, Trustees, officers and agents, against any and all claims arising from Researcher's use of the Database, including but not limited to Researcher's use of any copies of copyrighted images that he or she may create from the Database.
+● Researcher may provide research associates and colleagues with access to the Database provided that they first agree to be bound by these terms and conditions.
+● MIT reserves the right to terminate Researcher's access to the Database at any time.
+● If Researcher is employed by a for-profit, commercial entity, Researcher's employer shall also be bound by these terms and conditions, and Researcher hereby represents that he or she is fully authorized to enter into this agreement on behalf of such employer.
+Software and Annotations
+This website, image annotations and the software provided belongs to MIT CSAIL and is licensed under a Creative Commons BSD-3 License Agreement
+
+Copyright 2019 MIT, CSAIL
+
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+● Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+● Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+● Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+## MPII Human Pose
+Source: http://human-pose.mpi-inf.mpg.de/
+
+Commercial use is not allowed due to the fact that the authors do not have the copyright for the images themselves.
+
+## LVIS
+Source: https://www.lvisdataset.org/dataset
+
+The LVIS annotations along with this website are licensed under aCreative Commons Attribution 4.0 License. All LVIS dataset images come from the COCO dataset; please see linkfor their terms of use.
+Images
+The COCO Consortium does not own the copyright of the images. Use of the images must abide by the Flickr Terms of Use. The users of the images accept full responsibility for the use of the dataset, including but not limited to the use of any copies of copyrighted images that they may create from the dataset.
+Software
+Copyright (c) 2015, COCO Consortium. All rights reserved. Redistribution and use software in source and binary form, with or without modification, are permitted provided that the following conditions are met:
+Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+Neither the name of the COCO Consortium nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+THIS SOFTWARE AND ANNOTATIONS ARE PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+## VOC2012
+Source: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/
+
+The VOC2012 data includes images obtained from the "flickr" website. Use of these images must respect the corresponding terms of use:
+"flickr" terms of use
+For the purposes of the challenge, the identity of the images in the database, e.g. source and name of owner, has been obscured. Details of the contributor of each image can be found in the annotation to be included in the final release of the data, after completion of the challenge. Any queries about the use or ownership of the data should be addressed to the organizers.
--- a/docs/source/_static/dingding_qrcode.jpg
+++ b/docs/source/_static/dingding_qrcode.jpg
--- a/docs/source/_static/result.jpg
+++ b/docs/source/_static/result.jpg
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:c696a58a2963b5ac47317751f04ff45bfed4723f2f70bacf91eac711f9710e54
+size 189432
--- a/docs/source/api/easycv.models.backbones.rst
+++ b/docs/source/api/easycv.models.backbones.rst
@ -156,7 +156,7 @@ easycv.models.backbones.swin\_transformer\_dynamic module
 easycv.models.backbones.vit\_transfomer\_dynamic module
 -------------------------------------------------------

-.. automodule:: easycv.models.backbones.vit_transfomer_dynamic
+.. automodule:: easycv.models.backbones.vit_transformer_dynamic
   :members:
   :undoc-members:
   :show-inheritance:
--- a/docs/source/change_log.md
+++ b/docs/source/change_log.md
@ -1,48 +1,60 @@
-# v 0.2.2 (07/04/2022)
-
-* initial commit & first release
-
-1. SOTA SSL Algorithms
-
-EasyCV provides state-of-the-art algorithms in self-supervised learning based on contrastive learning such as SimCLR, MoCO V2, Swav, DINO and also MAE based on masked image modeling. We also provides standard benchmark tools for ssl model evaluation.
-
-2. Vision Transformers
-
-EasyCV aims to provide plenty vision transformer models trained either using supervised learning or self-supervised learning, such as ViT, Swin-Transformer and XCit. More models will be added in the future.
-
-3. Functionality & Extensibility
-
-In addition to SSL, EasyCV also support image classification, object detection, metric learning, and more area will be supported in the future. Although convering different area, EasyCV decompose the framework into different componets such as dataset, model, running hook, making it easy to add new compoenets and combining it with existing modules.
-EasyCV provide simple and comprehensive interface for inference. Additionaly, all models are supported on PAI-EAS, which can be easily deployed as online service and support automatic scaling and service moniting.
-
-3. Efficiency
-
-EasyCV support multi-gpu and multi worker training. EasyCV use DALI to accelerate data io and preprocessing process, and use fp16 to accelerate training process. For inference optimization, EasyCV export model using jit script, which can be optimized by PAI-Blade.
-
-# v 0.3.0 (05/05/2022)
-
-## Highlights
- Support image visualization for tensorboard and wandb ([#15](https://github.com/alibaba/EasyCV/pull/15))
-
-## New Features
- Update moby pretrained model to deit small ([#10](https://github.com/alibaba/EasyCV/pull/10))
- Support image visualization for tensorboard and wandb ([#15](https://github.com/alibaba/EasyCV/pull/15))
- Add mae vit-large benchmark  and pretrained models ([#24](https://github.com/alibaba/EasyCV/pull/24))
+# v 0.6.1 (06/09/2022)

 ## Bug Fixes
-  Fix extract.py for benchmarks ([#7](https://github.com/alibaba/EasyCV/pull/7))
-  Fix inference error of classifier ([#19](https://github.com/alibaba/EasyCV/pull/19))
-  Fix multi-process reading of detection datasource and accelerate data preprocessing ([#23](https://github.com/alibaba/EasyCV/pull/23))
- Fix torchvision transforms wrapper ([#31](https://github.com/alibaba/EasyCV/pull/31))
+
+- Fix missing utils ([#183](https://github.com/alibaba/EasyCV/pull/183))
+
+# v 0.6.0 (31/08/2022)
+
+## Highlights
+- Release YOLOX-PAI which achieves SOTA results within 40~50 mAP (less than 1ms) ([#154](https://github.com/alibaba/EasyCV/pull/154) [#172](https://github.com/alibaba/EasyCV/pull/172) [#174](https://github.com/alibaba/EasyCV/pull/174) )
+- Add detection algo DINO ([#144](https://github.com/alibaba/EasyCV/pull/144))
+- Add mask2former algo ([#115](https://github.com/alibaba/EasyCV/pull/115))
+- Releases imagenet1k, imagenet22k, coco, lvis, voc2012 data with BaiduDisk to accelerate downloading ([#145](https://github.com/alibaba/EasyCV/pull/145) )
+
+## New Features
+
+- Add detection predictor which support model inference without exporting models([#158](https://github.com/alibaba/EasyCV/pull/158) )
+- Add VitDet support for faster-rcnn ([#155](https://github.com/alibaba/EasyCV/pull/155) )
+- Release YOLOX-PAI which achieves SOTA results within 40~50 mAP (less than 1ms) ([#154](https://github.com/alibaba/EasyCV/pull/154) [#172](https://github.com/alibaba/EasyCV/pull/172) [#174](https://github.com/alibaba/EasyCV/pull/174) )
+- Support DINO algo ([#144](https://github.com/alibaba/EasyCV/pull/144))
+- Add mask2former algo ([#115](https://github.com/alibaba/EasyCV/pull/115))

 ## Improvements
- Add chinese readme ([#39](https://github.com/alibaba/EasyCV/pull/39))
- Add model compression tutorial ([#20](https://github.com/alibaba/EasyCV/pull/20))
- Add notebook tutorials ([#22](https://github.com/alibaba/EasyCV/pull/22))
- Uniform input and output format for transforms ([#6](https://github.com/alibaba/EasyCV/pull/6))
- Update model zoo link ([#8](https://github.com/alibaba/EasyCV/pull/8))
- Support readthedocs  ([#29](https://github.com/alibaba/EasyCV/pull/29))
- refine autorelease gitworkflow ([#13](https://github.com/alibaba/EasyCV/pull/13))
+
+- FCOS update torch_style ([#170](https://github.com/alibaba/EasyCV/pull/170))
+- Add algo tables to describe which algo EasyCV support ([#157](https://github.com/alibaba/EasyCV/pull/157) )
+- Refactor datasources api ([#156](https://github.com/alibaba/EasyCV/pull/156) [#140](https://github.com/alibaba/EasyCV/pull/140) )
+- Add PR and Issule template ([#150](https://github.com/alibaba/EasyCV/pull/150))
+- Update Fast ConvMAE doc ([#151](https://github.com/alibaba/EasyCV/pull/151))
+
+## Bug Fixes
+
+- Fix YOLOXLrUpdaterHook conflict with mmdet ( [#169](https://github.com/alibaba/EasyCV/pull/169) )
+- Fix datasource cache problem([#153](https://github.com/alibaba/EasyCV/pull/153))
+
+
+# v 0.5.0 (28/07/2022)
+
+## Highlights
+- Self-Supervised support ConvMAE algorithm (([#101](https://github.com/alibaba/EasyCV/pull/101)) ([#121](https://github.com/alibaba/EasyCV/pull/121)))
+- Classification support EfficientFormer algorithm ([#128](https://github.com/alibaba/EasyCV/pull/128))
+- Detection support FCOS、DETR、DAB-DETR and DN-DETR algorithm (([#100](https://github.com/alibaba/EasyCV/pull/100)) ([#104](https://github.com/alibaba/EasyCV/pull/104)) ([#119](https://github.com/alibaba/EasyCV/pull/119)))
+- Segmentation support UperNet algorithm ([#118](https://github.com/alibaba/EasyCV/pull/118))
+- Support use torchacc to speed up training ([#105](https://github.com/alibaba/EasyCV/pull/105))
+
+## New Features
+- Support use analyze tools ([#133](https://github.com/alibaba/EasyCV/pull/133))
+
+## Bug Fixes
+- Update yolox config template and fix bugs ([#134](https://github.com/alibaba/EasyCV/pull/134))
+- Fix yolox detector prediction export error ([#125](https://github.com/alibaba/EasyCV/pull/125))
+- Fix common_io url error ([#126](https://github.com/alibaba/EasyCV/pull/126))
+
+## Improvements
+- Add ViTDet visualization ([#102](https://github.com/alibaba/EasyCV/pull/102))
+- Refactor detection pipline ([#104](https://github.com/alibaba/EasyCV/pull/104))
+

 # v 0.4.0 (23/06/2022)

@ -69,23 +81,49 @@ EasyCV support multi-gpu and multi worker training. EasyCV use DALI to accelerat
 - Update prepare_data.md, add more details ([#69](https://github.com/alibaba/EasyCV/pull/69))
 - Optimize quantize code and support to export MNN model ([#44](https://github.com/alibaba/EasyCV/pull/44))

-# v 0.5.0 (28/07/2022)
+# v 0.3.0 (05/05/2022)

 ## Highlights
- Self-Supervised support ConvMAE algorithm (([#101](https://github.com/alibaba/EasyCV/pull/101)) ([#121](https://github.com/alibaba/EasyCV/pull/121)))
- Classification support EfficientFormer algorithm ([#128](https://github.com/alibaba/EasyCV/pull/128))
- Detection support FCOS、DETR、DAB-DETR and DN-DETR algorithm (([#100](https://github.com/alibaba/EasyCV/pull/100)) ([#104](https://github.com/alibaba/EasyCV/pull/104)) ([#119](https://github.com/alibaba/EasyCV/pull/119)))
- Segmentation support UperNet algorithm ([#118](https://github.com/alibaba/EasyCV/pull/118))
- Support use torchacc to speed up training ([#105](https://github.com/alibaba/EasyCV/pull/105))
+- Support image visualization for tensorboard and wandb ([#15](https://github.com/alibaba/EasyCV/pull/15))

 ## New Features
- Support use analyze tools ([#133](https://github.com/alibaba/EasyCV/pull/133))
+- Update moby pretrained model to deit small ([#10](https://github.com/alibaba/EasyCV/pull/10))
+- Support image visualization for tensorboard and wandb ([#15](https://github.com/alibaba/EasyCV/pull/15))
+- Add mae vit-large benchmark  and pretrained models ([#24](https://github.com/alibaba/EasyCV/pull/24))

 ## Bug Fixes
- Update yolox config template and fix bugs ([#134](https://github.com/alibaba/EasyCV/pull/134))
- Fix yolox detector prediction export error ([#125](https://github.com/alibaba/EasyCV/pull/125))
- Fix common_io url error ([#126](https://github.com/alibaba/EasyCV/pull/126))
+-  Fix extract.py for benchmarks ([#7](https://github.com/alibaba/EasyCV/pull/7))
+-  Fix inference error of classifier ([#19](https://github.com/alibaba/EasyCV/pull/19))
+-  Fix multi-process reading of detection datasource and accelerate data preprocessing ([#23](https://github.com/alibaba/EasyCV/pull/23))
+- Fix torchvision transforms wrapper ([#31](https://github.com/alibaba/EasyCV/pull/31))

 ## Improvements
- Add ViTDet visualization ([#102](https://github.com/alibaba/EasyCV/pull/102))
- Refactor detection pipline ([#104](https://github.com/alibaba/EasyCV/pull/104))
+- Add chinese readme ([#39](https://github.com/alibaba/EasyCV/pull/39))
+- Add model compression tutorial ([#20](https://github.com/alibaba/EasyCV/pull/20))
+- Add notebook tutorials ([#22](https://github.com/alibaba/EasyCV/pull/22))
+- Uniform input and output format for transforms ([#6](https://github.com/alibaba/EasyCV/pull/6))
+- Update model zoo link ([#8](https://github.com/alibaba/EasyCV/pull/8))
+- Support readthedocs  ([#29](https://github.com/alibaba/EasyCV/pull/29))
+- refine autorelease gitworkflow ([#13](https://github.com/alibaba/EasyCV/pull/13))
+
+
+# v 0.2.2 (07/04/2022)
+
+* initial commit & first release
+
+1. SOTA SSL Algorithms
+
+EasyCV provides state-of-the-art algorithms in self-supervised learning based on contrastive learning such as SimCLR, MoCO V2, Swav, DINO and also MAE based on masked image modeling. We also provides standard benchmark tools for ssl model evaluation.
+
+2. Vision Transformers
+
+EasyCV aims to provide plenty vision transformer models trained either using supervised learning or self-supervised learning, such as ViT, Swin-Transformer and XCit. More models will be added in the future.
+
+3. Functionality & Extensibility
+
+In addition to SSL, EasyCV also support image classification, object detection, metric learning, and more area will be supported in the future. Although convering different area, EasyCV decompose the framework into different componets such as dataset, model, running hook, making it easy to add new compoenets and combining it with existing modules.
+EasyCV provide simple and comprehensive interface for inference. Additionaly, all models are supported on PAI-EAS, which can be easily deployed as online service and support automatic scaling and service moniting.
+
+3. Efficiency
+
+EasyCV support multi-gpu and multi worker training. EasyCV use DALI to accelerate data io and preprocessing process, and use fp16 to accelerate training process. For inference optimization, EasyCV export model using jit script, which can be optimized by PAI-Blade.
--- a/docs/source/data_hub.md
+++ b/docs/source/data_hub.md
@ -2,6 +2,10 @@

 EasyCV summarized various datasets in different fields. At present, we support part of them, and we will gradually support remainings.

+Before using dataset, please read the [LICENSE](docs/source/LICENSE) file to learn the usage and scope of the dataset.Notes are as follows:
+1. The use of the dataset must follow the original license.
+2. If there is any infringement, please contact in time.
+
 **For datasets we already support, please refer to: [prepare_data.md](https://github.com/alibaba/EasyCV/blob/master/docs/source/prepare_data.md).**

 - [Self-Supervised Learning](#Self-Supervised-Learning)
@ -12,21 +16,21 @@ EasyCV summarized various datasets in different fields. At present, we support p

 ## Self-Supervised Learning

-| Name                                                         | Field  | Describtion                                                  | Download                                                     | Dataset API support                                 |
-| ------------------------------------------------------------ | ------ | ------------------------------------------------------------ | ------------------------------------------------------------ | --------------------------------------- |
-| **ImageNet 1k**<br/>[url](https://image-net.org/download.php) | Common | ImageNet is an image database organized according to the [WordNet](http://wordnet.princeton.edu/) hierarchy (currently only the nouns).It is used in the ImageNet Large Scale Visual Recognition Challenge(ILSVRC) and is a benchmark for image classification. | refer to [prepare_data.md](https://github.com/alibaba/EasyCV/blob/master/docs/source/prepare_data.md) | <font color=green size=5>&check;</font> |
-| **Imagenet-1k TFrecords**<br/>[url](https://www.kaggle.com/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-0) | Common | Original imagenet raw images packed in TFrecord format.      | refer to [prepare_data.md](https://github.com/alibaba/EasyCV/blob/master/docs/source/prepare_data.md) | <font color=green size=5>&check;</font> |
-| **ImageNet 21k**<br/>[url](https://image-net.org/download.php) | Common | ImageNet-21K dataset, which is bigger and more diverse, is used less frequently for pretraining, mainly due to its complexity, low accessibility, and underestimation of its added value. | refer to [Alibaba-MIIL/ImageNet21K](https://github.com/Alibaba-MIIL/ImageNet21K/blob/main/dataset_preprocessing/processing_instructions.md) |                                         |
+| Name                                                         | Field  | Describtion                                                  | Download                                                     | Dataset API support                                 | Licence                                 |
+| ------------------------------------------------------------ | ------ | ------------------------------------------------------------ | ------------------------------------------------------------ | --------------------------------------- | --------------------------------------- |
+| **ImageNet 1k**<br/>[url](https://image-net.org/download.php) | Common | ImageNet is an image database organized according to the [WordNet](http://wordnet.princeton.edu/) hierarchy (currently only the nouns).It is used in the ImageNet Large Scale Visual Recognition Challenge(ILSVRC) and is a benchmark for image classification. | [Baidu Netdisk (提取码:0zas)](https://pan.baidu.com/s/13pKw0bJbr-jbymQMd_YXzA)<br/>refer to [prepare_data.md](https://github.com/alibaba/EasyCV/blob/master/docs/source/prepare_data.md) | <font color=green size=5>&check;</font> | [LICENSE](https://github.com/alibaba/EasyCV/blob/master/docs/source/LICENSE#L1) |
+| **Imagenet-1k TFrecords**<br/>[url](https://www.kaggle.com/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-0) | Common | Original imagenet raw images packed in TFrecord format.      | [Baidu Netdisk (提取码:5zdc)](https://pan.baidu.com/s/153SY2dp02vEY9K6-O5U1UA)<br/>refer to [prepare_data.md](https://github.com/alibaba/EasyCV/blob/master/docs/source/prepare_data.md) | <font color=green size=5>&check;</font> | [LICENSE](https://github.com/alibaba/EasyCV/blob/master/docs/source/LICENSE#L1) |
+| **ImageNet 21k**<br/>[url](https://image-net.org/download.php) | Common | ImageNet-21K dataset, which is bigger and more diverse, is used less frequently for pretraining, mainly due to its complexity, low accessibility, and underestimation of its added value. | [Baidu Netdisk (提取码:kaeg)](https://pan.baidu.com/s/1eJVPCfS814cDCt3-lVHgmA)<br/>refer to [Alibaba-MIIL/ImageNet21K](https://github.com/Alibaba-MIIL/ImageNet21K/blob/main/dataset_preprocessing/processing_instructions.md) |                                         | [LICENSE](https://github.com/alibaba/EasyCV/blob/master/docs/source/LICENSE#L1) |

 ## Classification data

-| Name                                                         | Field               | Describtion                                                  | Download                                                     | Dataset API support                                 |
-| ------------------------------------------------------------ | ------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | --------------------------------------- |
+| Name                                                         | Field  | Describtion                                                  | Download                                                     | Dataset API support                                 | Licence                                 |
+| ------------------------------------------------------------ | ------ | ------------------------------------------------------------ | ------------------------------------------------------------ | --------------------------------------- | --------------------------------------- |
 | **Cifar10**<br/>[url](https://www.cs.toronto.edu/~kriz/cifar.html) | Common              | The CIFAR-10 are labeled subsets of the [80 million tiny images](http://people.csail.mit.edu/torralba/tinyimages/) dataset. It consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. | [cifar-10-python.tar.gz ](https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz)(163MB) | <font color=green size=5>&check;</font> |
 | **Cifar100**<br/>[url](https://www.cs.toronto.edu/~kriz/cifar.html) | Common              | The CIFAR-100 are labeled subsets of the [80 million tiny images](http://people.csail.mit.edu/torralba/tinyimages/) dataset. It is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. | [cifar-100-python.tar.gz](https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz) (161MB) | <font color=green size=5>&check;</font> |
-| **ImageNet 1k**<br/>[url](https://image-net.org/download.php) | Common              | ImageNet is an image database organized according to the [WordNet](http://wordnet.princeton.edu/) hierarchy (currently only the nouns).It is used in the ImageNet Large Scale Visual Recognition Challenge(ILSVRC) and is a benchmark for image classification. | refer to [prepare_data.md](https://github.com/alibaba/EasyCV/blob/master/docs/source/prepare_data.md) | <font color=green size=5>&check;</font> |
-| **Imagenet-1k TFrecords**<br/>[url](https://www.kaggle.com/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-0) | Common              | Original imagenet raw images packed in TFrecord format.      | refer to [prepare_data.md](https://github.com/alibaba/EasyCV/blob/master/docs/source/prepare_data.md) | <font color=green size=5>&check;</font> |
-| **ImageNet 21k**<br/>[url](https://image-net.org/download.php) | Common              | ImageNet-21K dataset, which is bigger and more diverse, is used less frequently for pretraining, mainly due to its complexity, low accessibility, and underestimation of its added value. | refer to [Alibaba-MIIL/ImageNet21K](https://github.com/Alibaba-MIIL/ImageNet21K/blob/main/dataset_preprocessing/processing_instructions.md) |                                         |
+| **ImageNet 1k**<br/>[url](https://image-net.org/download.php) | Common              | ImageNet is an image database organized according to the [WordNet](http://wordnet.princeton.edu/) hierarchy (currently only the nouns).It is used in the ImageNet Large Scale Visual Recognition Challenge(ILSVRC) and is a benchmark for image classification. | [Baidu Netdisk (提取码:0zas)](https://pan.baidu.com/s/13pKw0bJbr-jbymQMd_YXzA)<br/>refer to [prepare_data.md](https://github.com/alibaba/EasyCV/blob/master/docs/source/prepare_data.md) | <font color=green size=5>&check;</font> | [LICENSE](https://github.com/alibaba/EasyCV/blob/master/docs/source/LICENSE#L1) |
+| **Imagenet-1k TFrecords**<br/>[url](https://www.kaggle.com/hmendonca/imagenet-1k-tfrecords-ilsvrc2012-part-0) | Common              | Original imagenet raw images packed in TFrecord format.      | [Baidu Netdisk (提取码:5zdc)](https://pan.baidu.com/s/153SY2dp02vEY9K6-O5U1UA)<br/>refer to [prepare_data.md](https://github.com/alibaba/EasyCV/blob/master/docs/source/prepare_data.md) | <font color=green size=5>&check;</font> | [LICENSE](https://github.com/alibaba/EasyCV/blob/master/docs/source/LICENSE#L1) |
+| **ImageNet 21k**<br/>[url](https://image-net.org/download.php) | Common              | ImageNet-21K dataset, which is bigger and more diverse, is used less frequently for pretraining, mainly due to its complexity, low accessibility, and underestimation of its added value. | [Baidu Netdisk (提取码:kaeg)](https://pan.baidu.com/s/1eJVPCfS814cDCt3-lVHgmA)<br/>refer to [Alibaba-MIIL/ImageNet21K](https://github.com/Alibaba-MIIL/ImageNet21K/blob/main/dataset_preprocessing/processing_instructions.md) |                                         | [LICENSE](https://github.com/alibaba/EasyCV/blob/master/docs/source/LICENSE#L1) |
 | **MNIST**<br/>[url](http://yann.lecun.com/exdb/mnist/)       | Handwritten numbers | The MNIST database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. | [train-images-idx3-ubyte.gz](http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz) (9.5MB)<br/>[train-labels-idx1-ubyte.gz](http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz)<br/>[t10k-images-idx3-ubyte.gz](http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz) (1.5MB)<br/>[t10k-labels-idx1-ubyte.gz](http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz) |                                         |
 | **Fashion-MNIST**<br/>[url](https://github.com/zalandoresearch/fashion-mnist) | Clothing            | Fashion-MNIST is a **clothing dataset** of [Zalando](https://jobs.zalando.com/tech/)'s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes. | [train-images-idx3-ubyte.gz](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz) (26MB)<br/>[train-labels-idx1-ubyte.gz](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz) (29KB)<br/>[t10k-images-idx3-ubyte.gz](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz)(4.3 MB)<br/>[t10k-labels-idx1-ubyte.gz](http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz) (5.1KB) |                                         |
 | **Flower102**<br/>[url](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/) | Flowers             | The Flower102 is consisting of 102 flower categories. The flowers chosen to be flower commonly occuring in the United Kingdom. Each class consists of between 40 and 258 images. | [102flowers.tgz](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/102flowers.tgz) (329MB)<br/>[imagelabels.mat](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/imagelabels.mat)<br/>[setid.mat](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/setid.mat) |                                         |
@ -35,12 +39,15 @@ EasyCV summarized various datasets in different fields. At present, we support p

 ## Object Detection

-| Name                                                         | Field                                   | Describtion                                                  | Download                                                     | Dataset API support                                 |
-| ------------------------------------------------------------ | --------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | --------------------------------------- |
-| **COCO2017**<br/>[url](https://cocodataset.org/#home)        | Common                                  | The COCO dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.It has been updated for several editions, and coco2017 is widely used. In 2017, the training/validation split was 118K/5K and test set is a subset of 41K images of the 2015 test set. | [train2017.zip](http://images.cocodataset.org/zips/train2017.zip) (18G) <br/>[val2017.zip](http://images.cocodataset.org/zips/val2017.zip) (1G)<br/>[annotations_trainval2017.zip](http://images.cocodataset.org/annotations/annotations_trainval2017.zip) (241MB) | <font color=green size=5>&check;</font> |
+| Name                                                         | Field  | Describtion                                                  | Download                                                     | Dataset API support                                 | Licence                                 |
+| ------------------------------------------------------------ | ------ | ------------------------------------------------------------ | ------------------------------------------------------------ | --------------------------------------- | --------------------------------------- |
+| **COCO2017**<br/>[url](https://cocodataset.org/#home)       | Common                                  | The COCO dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.It has been updated for several editions, and coco2017 is widely used. In 2017, the training/validation split was 118K/5K and test set is a subset of 41K images of the 2015 test set. | [Baidu Netdisk (提取码:bcmm)](https://pan.baidu.com/s/14rO11v1VAgdswRDqPVJjMA)<br/>[train2017.zip](http://images.cocodataset.org/zips/train2017.zip) (18G) <br/>[val2017.zip](http://images.cocodataset.org/zips/val2017.zip) (1G)<br/>[annotations_trainval2017.zip](http://images.cocodataset.org/annotations/annotations_trainval2017.zip) (241MB) | <font color=green size=5>&check;</font> | [LICENSE](https://github.com/alibaba/EasyCV/blob/master/docs/source/LICENSE#L17) |
 | **VOC2007**<br/>[url](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html) | Common                                  | PASCAL VOC 2007 is a dataset for image recognition consisting of 20 object categories. Each image in this dataset has pixel-level segmentation annotations, bounding box annotations, and object class annotations. | [VOCtrainval_06-Nov-2007.tar](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar) (439MB) | <font color=green size=5>&check;</font> |
-| **VOC2012**<br/>[url](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html) | Common                                  | From 2009 to 2011, the amount of data is still growing on the basis of the previous year's dataset, and from 2011 to 2012, the amount of data used for classification, detection and person layout tasks does not change. Mainly for segmentation and action recognition, improve the corresponding data subsets and label information. | [VOCtrainval_11-May-2012.tar](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar) (2G) | <font color=green size=5>&check;</font> |
+| **VOC2012**<br/>[url](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html) | Common                                  | From 2009 to 2011, the amount of data is still growing on the basis of the previous year's dataset, and from 2011 to 2012, the amount of data used for classification, detection and person layout tasks does not change. Mainly for segmentation and action recognition, improve the corresponding data subsets and label information. | [Baidu Netdisk (提取码:ro9f)](https://pan.baidu.com/s/1B4tF8cEPIe0xGL1FG0qbkg)<br/>[VOCtrainval_11-May-2012.tar](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar) (2G) | <font color=green size=5>&check;</font> | [LICENSE](https://github.com/alibaba/EasyCV/blob/master/docs/source/LICENSE#L70) |
+| **LVIS**<br/>[url](https://www.lvisdataset.org/dataset) | Common                                  | LVIS uses the COCO 2017 train, validation, and test image sets. If you have already downloaded the COCO images, you only need to download the LVIS annotations. LVIS val set contains images from COCO 2017 train in addition to the COCO 2017 val split. | [Baidu Netdisk (提取码:8ief)](https://pan.baidu.com/s/1UntujlgDMuVBIjhoAc_lSA)<br/>refer to [coco](https://cocodataset.org/#overview) |                                         | [LICENSE](https://github.com/alibaba/EasyCV/blob/master/docs/source/LICENSE#L57) |
 | **Cityscapes**<br/>[url](https://www.cityscapes-dataset.com/) | Street scenes                           | The Cityscapes contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5 000 frames in addition to a larger set of 20 000 weakly annotated frames. The dataset is thus an order of magnitude larger than similar previous attempts. | [leftImg8bit_trainvaltest.zip](https://www.cityscapes-dataset.com/file-handling/?packageID=3) (11GB) |                                         |
+| **Object365**<br/>[url](https://www.objects365.org/overview.html) | Common                                  | Objects365 is a brand new dataset, designed to spur object detection research with a focus on diverse objects in the Wild. 365 categories, 2 million images, 30 million bounding boxes. | refer to [data-set-detail](https://open.baai.ac.cn/data-set-detail/MTI2NDc=/MTA=/true) |                                         |  |
+| **CrowdHuman**<br/>[url](https://www.crowdhuman.org/) | Common                                  | CrowdHuman is a benchmark dataset to better evaluate detectors in crowd scenarios. The CrowdHuman dataset is large, rich-annotated and contains high diversity. CrowdHuman contains 15000, 4370 and 5000 images for training, validation, and testing, respectively. There are a total of 470K human instances from train and validation subsets and 23 persons per image, with various kinds of occlusions in the dataset. Each human instance is annotated with a head bounding-box, human visible-region bounding-box and human full-body bounding-box. | refer to [crowdhuman](https://www.crowdhuman.org/) |                                         |
 | **Openimages**<br/>[url](https://storage.googleapis.com/openimages/web/index.html) | Common                                  | Open Images is a dataset of ~9 million URLs to images that have been annotated with image-level labels and bounding boxes spanning thousands of classes. | refer to [cvdfoundation/open-images-dataset](https://github.com/cvdfoundation/open-images-dataset#download-images-with-bounding-boxes-annotations) |                                         |
 | **WIDER FACE **<br/>[url](http://shuoyang1213.me/WIDERFACE/) | Face                                    | The WIDER FACE dataset contains 32,203 images and labels 393,703 faces with a high degree of variability in scale, pose and occlusion. The database is split into training (40%), validation (10%) and testing (50%) set. Besides, the images are divided into three levels (Easy ⊆ Medium ⊆ Hard) according to the difficulties of the detection. | WIDER Face Training Images [[Google Drive\]](https://drive.google.com/file/d/15hGDLhsx8bLgLcIRD5DhYt5iBxnjNF1M/view?usp=sharing) [[Tencent Drive\]](https://share.weiyun.com/5WjCBWV) (1.36GB)<br/>WIDER Face Validation Images [[Google Drive\]](https://drive.google.com/file/d/1GUCogbp16PMGa39thoMMeWxp7Rp5oM8Q/view?usp=sharing) [[Tencent Drive\]](https://share.weiyun.com/5ot9Qv1) (345.95MB)<br/>WIDER Face Testing Images [[Google Drive\]](https://drive.google.com/file/d/1HIfDbVEWKmsYKJZm4lchTBDLW5N7dY5T/view?usp=sharing) [[Tencent Drive\]](https://share.weiyun.com/5vSUomP) (1.72GB)<br/>[Face annotations](http://shuoyang1213.me/WIDERFACE/support/bbx_annotation/wider_face_split.zip) (3.6MB) |                                         |
 | **DeepFashion**<br/>[url](https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html) | Clothing                                | The DeepFashion is a large-scale clothes database. It contains over 800,000 diverse fashion images ranging from well-posed shop images to unconstrained consumer photos. Second, DeepFashion is annotated with rich information of clothing items. Each image in this dataset is labeled with 50 categories, 1,000 descriptive attributes, bounding box and clothing landmarks. Third, DeepFashion contains over 300,000 cross-pose/cross-domain image pairs. | Category and Attribute Prediction Benchmark: [[Download Page\]](https://drive.google.com/drive/folders/0B7EVK8r0v71pQ2FuZ0k0QnhBQnc?resourcekey=0-NWldFxSChFuCpK4nzAIGsg&usp=sharing)<br/>In-shop Clothes Retrieval Benchmark: [[Download Page\]](https://drive.google.com/drive/folders/0B7EVK8r0v71pQ2FuZ0k0QnhBQnc?resourcekey=0-NWldFxSChFuCpK4nzAIGsg&usp=sharing)<br/>Consumer-to-shop Clothes Retrieval Benchmark: [[Download Page\]](https://drive.google.com/drive/folders/0B7EVK8r0v71pQ2FuZ0k0QnhBQnc?resourcekey=0-NWldFxSChFuCpK4nzAIGsg&usp=sharing)<br/>Fashion Landmark Detection Benchmark: [[Download Page\]](https://drive.google.com/drive/folders/0B7EVK8r0v71pQ2FuZ0k0QnhBQnc?resourcekey=0-NWldFxSChFuCpK4nzAIGsg&usp=sharing) |                                         |
@ -56,20 +63,23 @@ EasyCV summarized various datasets in different fields. At present, we support p

 ## Image Segmentation

-| Name                                                         | Field         | Describtion                                                  | Download                                                     | Dataset API support |
-| ------------------------------------------------------------ | ------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------- |
+| Name                                                         | Field  | Describtion                                                  | Download                                                     | Dataset API support                                 | Licence                                 |
+| ------------------------------------------------------------ | ------ | ------------------------------------------------------------ | ------------------------------------------------------------ | --------------------------------------- | --------------------------------------- |
 | **VOC2007**<br/>[url](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/index.html) | Common        | PASCAL VOC 2007 is a dataset for image recognition consisting of 20 object categories. Each image in this dataset has pixel-level segmentation annotations, bounding box annotations, and object class annotations. | [VOCtrainval_06-Nov-2007.tar](http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar) (439MB) |         |
-| **VOC2012**<br/>[url](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html) | Common        | From 2009 to 2011, the amount of data is still growing on the basis of the previous year's dataset, and from 2011 to 2012, the amount of data used for classification, detection and person layout tasks does not change. Mainly for segmentation and action recognition, improve the corresponding data subsets and label information. | [VOCtrainval_11-May-2012.tar](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar) (2G) |         |
+| **VOC2012**<br/>[url](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html) | Common        | From 2009 to 2011, the amount of data is still growing on the basis of the previous year's dataset, and from 2011 to 2012, the amount of data used for classification, detection and person layout tasks does not change. Mainly for segmentation and action recognition, improve the corresponding data subsets and label information. | [Baidu Netdisk (提取码:ro9f)](https://pan.baidu.com/s/1B4tF8cEPIe0xGL1FG0qbkg)<br/>[VOCtrainval_11-May-2012.tar](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar) (2G) |         | [LICENSE](https://github.com/alibaba/EasyCV/blob/master/docs/source/LICENSE#L70) |
 | **Pascal Context**<br/>[url](http://host.robots.ox.ac.uk/pascal/VOC/voc2010/) | Common        | This dataset is a set of additional annotations for PASCAL VOC 2010. It goes beyond the original PASCAL semantic segmentation task by providing annotations for the whole scene. The [statistics section](https://www.cs.stanford.edu/~roozbeh/pascal-context/#statistics) has a full list of 400+ labels. | [voc2010/VOCtrainval_03-May-2010.tar](http://host.robots.ox.ac.uk/pascal/VOC/voc2010/VOCtrainval_03-May-2010.tar) (1.3GB)<br/>[VOC2010test.tar](http://host.robots.ox.ac.uk:8080/eval/downloads/VOC2010test.tar) <br/>[trainval_merged.json](https://codalabuser.blob.core.windows.net/public/trainval_merged.json) (590MB) |         |
 | **COCO-Stuff 10K**<br/>[url](https://github.com/nightrome/cocostuff10k) | Common        | COCO-Stuff augments the popular COCO dataset with pixel-level stuff annotations. These annotations can be used for scene understanding tasks like semantic segmentation, object detection and image captioning. | [cocostuff-10k-v1.1.zip](http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/cocostuff-10k-v1.1.zip) (2.0 GB) |         |
+| **COCO-Stuff 164K**<br/>[url](https://github.com/nightrome/cocostuff) | Common        | COCO-Stuff augments the popular COCO dataset with pixel-level stuff annotations. These annotations can be used for scene understanding tasks like semantic segmentation, object detection and image captioning. | [train2017.zip](http://images.cocodataset.org/zips/train2017.zip) (18.0 GB), <br/>[val2017.zip](http://images.cocodataset.org/zips/val2017.zip) (1.0 GB), <br/>[stuffthingmaps_trainval2017.zip](http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/stuffthingmaps_trainval2017.zip) (659M)|         |
+| **COCO-Stuff 10K**<br/>[url](https://github.com/nightrome/cocostuff10k) | Common        | COCO-Stuff augments the popular COCO dataset with pixel-level stuff annotations. These annotations can be used for scene understanding tasks like semantic segmentation, object detection and image captioning. | [Baidu Netdisk (提取码:4r7o)](https://pan.baidu.com/s/1aWOjVnnOHFNISnGerGQcnw)<br/>[cocostuff-10k-v1.1.zip](http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/cocostuff-10k-v1.1.zip) (2.0 GB) |         | [LICENSE](https://github.com/alibaba/EasyCV/blob/master/docs/source/LICENSE#L17) |
+| **COCO-Stuff 164K**<br/>[url](https://github.com/nightrome/cocostuff) | Common        | COCO-Stuff augments the popular COCO dataset with pixel-level stuff annotations. These annotations can be used for scene understanding tasks like semantic segmentation, object detection and image captioning. | [train2017.zip](http://images.cocodataset.org/zips/train2017.zip) (18.0 GB), <br/>[val2017.zip](http://images.cocodataset.org/zips/val2017.zip) (1.0 GB), <br/>[stuffthingmaps_trainval2017.zip](http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/stuffthingmaps_trainval2017.zip) (659M)|         |
 | **Cityscapes**<br/>[url](https://www.cityscapes-dataset.com/) | Street scenes | The Cityscapes contains a diverse set of stereo video sequences recorded in street scenes from 50 different cities, with high quality pixel-level annotations of 5 000 frames in addition to a larger set of 20 000 weakly annotated frames. The dataset is thus an order of magnitude larger than similar previous attempts. | [leftImg8bit_trainvaltest.zip](https://www.cityscapes-dataset.com/file-handling/?packageID=3) (11GB) |         |
-| **ADE20K**<br/>[url](http://groups.csail.mit.edu/vision/datasets/ADE20K/) | Scene         | The ADE20K dataset is released by MIT and can be used for scene perception, parsing, segmentation, multi-object recognition and semantic understanding.The annotated images cover the scene categories from the SUN and Places database.It contains 25.574 training set and 2000 validation set. | [ADEChallengeData2016.zip](http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip) (923MB)<br/>[release_test.zip](http://data.csail.mit.edu/places/ADEchallenge/release_test.zip) (202MB) |         |
+| **ADE20K**<br/>[url](http://groups.csail.mit.edu/vision/datasets/ADE20K/) | Scene         | The ADE20K dataset is released by MIT and can be used for scene perception, parsing, segmentation, multi-object recognition and semantic understanding.The annotated images cover the scene categories from the SUN and Places database.It contains 25.574 training set and 2000 validation set. | [Baidu Netdisk (提取码:dqim)](https://pan.baidu.com/s/1ZuAuZheHHSDNRRdaI4wQrQ)<br/>[ADEChallengeData2016.zip](http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip) (923MB)<br/>[release_test.zip](http://data.csail.mit.edu/places/ADEchallenge/release_test.zip) (202MB) |         | [LICENSE](https://github.com/alibaba/EasyCV/blob/master/docs/source/LICENSE#L30) |

 ## Pose

-| Name                                                         | Field  | Describtion                                                  | Download                                                     | Dataset API support                                 |
-| ------------------------------------------------------------ | ------ | ------------------------------------------------------------ | ------------------------------------------------------------ | --------------------------------------- |
-| **COCO2017**<br/>[url](https://cocodataset.org/#home)        | Person | The COCO dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.It has been updated for several editions, and coco2017 is widely used. In 2017, the training/validation split was 118K/5K and test set is a subset of 41K images of the 2015 test set. | [train2017.zip](http://images.cocodataset.org/zips/train2017.zip) (18G) <br/>[val2017.zip](http://images.cocodataset.org/zips/val2017.zip) (1G)<br/>[annotations_trainval2017.zip](http://images.cocodataset.org/annotations/annotations_trainval2017.zip) (241MB)<br/>person_detection_results.zip from [OneDrive](https://1drv.ms/f/s!AhIXJn_J-blWzzDXoz5BeFl8sWM-) or [GoogleDrive](https://drive.google.com/drive/folders/1fRUDNUDxe9fjqcRZ2bnF_TKMlO0nB_dk?usp=sharing) (26.2MB) | <font color=green size=5>&check;</font> |
-| **MPII**<br/>[url](http://human-pose.mpi-inf.mpg.de/)        | Person | MPII Human Pose dataset is a state of the art benchmark for evaluation of articulated human pose estimation. The dataset includes around 25K images containing over 40K people with annotated body joints. The images were systematically collected using an established taxonomy of every day human activities. Overall the dataset covers 410 human activities and each image is provided with an activity label. Each image was extracted from a YouTube video and provided with preceding and following un-annotated frames. In addition, for the test set we obtained richer annotations including body part occlusions and 3D torso and head orientations. | [mpii_human_pose_v1.tar.gz](https://datasets.d2.mpi-inf.mpg.de/andriluka14cvpr/mpii_human_pose_v1.tar.gz) (12.9GB)<br/>[mpii_human_pose_v1_u12_2.zip](https://datasets.d2.mpi-inf.mpg.de/andriluka14cvpr/mpii_human_pose_v1_u12_2.zip) (12.5MB) |                                         |
+| Name                                                         | Field  | Describtion                                                  | Download                                                     | Dataset API support                                 | Licence                                 |
+| ------------------------------------------------------------ | ------ | ------------------------------------------------------------ | ------------------------------------------------------------ | --------------------------------------- | --------------------------------------- |
+| **COCO2017**<br/>[url](https://cocodataset.org/#home)        | Person | The COCO dataset is a large-scale object detection, segmentation, key-point detection, and captioning dataset. The dataset consists of 328K images.It has been updated for several editions, and coco2017 is widely used. In 2017, the training/validation split was 118K/5K and test set is a subset of 41K images of the 2015 test set. | [Baidu Netdisk (提取码:bcmm)](https://pan.baidu.com/s/14rO11v1VAgdswRDqPVJjMA)<br/>[train2017.zip](http://images.cocodataset.org/zips/train2017.zip) (18G) <br/>[val2017.zip](http://images.cocodataset.org/zips/val2017.zip) (1G)<br/>[annotations_trainval2017.zip](http://images.cocodataset.org/annotations/annotations_trainval2017.zip) (241MB)<br/>person_detection_results.zip from [OneDrive](https://1drv.ms/f/s!AhIXJn_J-blWzzDXoz5BeFl8sWM-) or [GoogleDrive](https://drive.google.com/drive/folders/1fRUDNUDxe9fjqcRZ2bnF_TKMlO0nB_dk?usp=sharing) (26.2MB) | <font color=green size=5>&check;</font> | [LICENSE](https://github.com/alibaba/EasyCV/blob/master/docs/source/LICENSE#L17) |
+| **MPII**<br/>[url](http://human-pose.mpi-inf.mpg.de/)        | Person | MPII Human Pose dataset is a state of the art benchmark for evaluation of articulated human pose estimation. The dataset includes around 25K images containing over 40K people with annotated body joints. The images were systematically collected using an established taxonomy of every day human activities. Overall the dataset covers 410 human activities and each image is provided with an activity label. Each image was extracted from a YouTube video and provided with preceding and following un-annotated frames. In addition, for the test set we obtained richer annotations including body part occlusions and 3D torso and head orientations. | [Baidu Netdisk (提取码:w6af)](https://pan.baidu.com/s/1uscGGPlUBirulSSgb10Pfw)<br/>[mpii_human_pose_v1.tar.gz](https://datasets.d2.mpi-inf.mpg.de/andriluka14cvpr/mpii_human_pose_v1.tar.gz) (12.9GB)<br/>[mpii_human_pose_v1_u12_2.zip](https://datasets.d2.mpi-inf.mpg.de/andriluka14cvpr/mpii_human_pose_v1_u12_2.zip) (12.5MB) |                                         | [LICENSE](https://github.com/alibaba/EasyCV/blob/master/docs/source/LICENSE#L52) |
 | **CrowdPose**<br/>[url](https://github.com/Jeff-sjtu/CrowdPose) | Person | Multi-person pose estimation is fundamental to many computer vision tasks and has made significant progress in recent years. However, few previous methods explored the problem of pose estimation in crowded scenes while it remains challenging and inevitable in many scenarios. Moreover, current benchmarks cannot provide an appropriate evaluation for such cases. In  [*CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark*](https://arxiv.org/abs/1812.00324), the author propose a novel and efficient method to tackle the problem of pose estimation in the crowd and a new dataset to better evaluate algorithms. | [images.zip](https://drive.google.com/file/d/1VprytECcLtU4tKP32SYi_7oDRbw7yUTL/view?usp=sharing) (2.2G)<br/>[Annotations](https://drive.google.com/drive/folders/1Ch1Cobe-6byB7sLhy8XRzOGCGTW2ssFv?usp=sharing) |                                         |
 | **OCHuman**<br/>[url](https://github.com/liruilong940607/OCHumanApi) | Person | This dataset focus on heavily occluded human with comprehensive annotations including bounding-box, humans pose and instance mask. This dataset contains 13360 elaborately annotated human instances within 5081 images. With average 0.573 MaxIoU of each person, OCHuman is the most complex and challenging dataset related to human. | [Images (667MB) & Annotations](https://cg.cs.tsinghua.edu.cn/dataset/form.html?dataset=ochuman) |                                         |
--- a/docs/source/develop.md
+++ b/docs/source/develop.md
@ -39,12 +39,55 @@ pre-commit run --all-files
 bash scripts/ci_test.sh
 ```

- ### 2.2 Test data
- if you add new data, please do the following to commit it to git-lfs before "git commit":
- ```bash
- python git-lfs/git_lfs.py add data/test/new_data
- python git-lfs/git_lfs.py push
- ```
+
+### 2.2 Test data storage
+
+As we need a lot of data for testing, including images, models. We use git lfs
+to store those large files.
+
+1. install git-lfs(version>=2.5.0)
+
+for mac
+
+```bash
+brew install git-lfs
+git lfs install
+```
+
+for centos, please download rpm from git-lfs github release [website](https://github.com/git-lfs/git-lfs/releases/tag/v3.2.0)
+```bash
+wget http://101374-public.oss-cn-hangzhou-zmf.aliyuncs.com/git-lfs-3.2.0-1.el7.x86_64.rpm
+sudo rpm -ivh git-lfs-3.2.0-1.el7.x86_64.rpm
+git lfs install
+```
+
+for ubuntu
+```bash
+curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
+sudo apt-get install git-lfs
+git lfs install
+```
+
+2. track your data type using git lfs, for example, to track png files
+```bash
+git lfs track "*.png"
+```
+
+3. add your test files to `data/test/` folder, you can make directories if you need.
+```bash
+git add data/test/test.png
+```
+
+4. commit your test data to remote branch
+```bash
+git commit -m "xxx"
+```
+
+To pull data from remote repo, just as the same way you pull git files.
+```bash
+git pull origin branch_name
+```
+

 ## 3. Build pip package
 ```bash
--- a/docs/source/model_zoo_cls.md
+++ b/docs/source/model_zoo_cls.md
@ -21,6 +21,9 @@
 | hrnetw64 | [hrnetw64](https://github.com/alibaba/EasyCV/tree/master/configs/classification/imagenet/hrnet/imagenet_hrnetw64_jpg.py) | 79.884    | 95.04    | 5120    | 54.74    | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/classification/resnet/hrnetw64/epoch_100.pth) |
 | vit-base-patch16 | [vit-base-patch16](https://github.com/alibaba/EasyCV/tree/master/configs/classification/imagenet/vit/imagenet_vit_base_patch16_224_jpg.py) | 76.082    | 92.026    | 346    | 8.03    | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/classification/vit/vit-base-patch16/epoch_300.pth) |
 | swin-tiny-patch4-window7 | [swin-tiny-patch4-window7](https://github.com/alibaba/EasyCV/tree/master/configs/classification/imagenet/swint/imagenet_swin_tiny_patch4_window7_224_jpg.py) | 80.528    | 94.822    | 132    | 12.94    | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/classification/swint/swin-tiny-patch4-window7/epoch_300.pth) |
+| deitiii-small-patch16-224 | [deitiii-small-patch16-224](https://github.com/alibaba/EasyCV/tree/master/configs/classification/imagenet/vit/imagenet_deitiii_small_patch16_224_jpg.py) | 81.408    | 95.388    | 89    | 4.53    | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/classification/deitiii/imagenet_deitiii_small_patch16_224/deitiii_small.pth) |
+| deitiii-base-patch16-192 | [deitiii-base-patch16-192](https://github.com/alibaba/EasyCV/tree/master/configs/classification/imagenet/vit/imagenet_deitiii_base_patch16_192_jpg.py) | 82.982    | 95.95    | 337    | 4.63    | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/classification/deitiii/imagenet_deitiii_base_patch16_192/deitiii_base.pth) |
+| deitiii-large-patch16-192 | [deitiii-large-patch16-192](https://github.com/alibaba/EasyCV/tree/master/configs/classification/imagenet/vit/imagenet_deitiii_large_patch16_192_jpg.py) | 83.902    | 96.296    | 1170    | 10.17    | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/classification/deitiii/imagenet_deitiii_large_patch16_192/deitiii_large.pth) |

 (ps: 通过EasyCV训练得到模型结果，推理的输入尺寸默认为224，机器默认为V100 16G，其中gpu memory记录的是gpu peak memory)

--- a/docs/source/model_zoo_det.md
+++ b/docs/source/model_zoo_det.md
@ -1,34 +1,49 @@
 # Detection Model Zoo

-## YOLOX
+Inference default use V100 16G.

-Pretrained on COCO2017 dataset.
+## YOLOX-PAI

-| Algorithm  | Config                                                       | Params                                                 | inference time(V100)<br/>(ms/img)                      | mAP<sup>val<br/><sub>0.5:0.95</sub> | AP<sup>val<br/><sub>50</sub> | Download                                                     |
-| ---------- | ------------------------------------------------------------ | ------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
-| YOLOX-s    | [yolox_s_8xb16_300e_coco](https://github.com/alibaba/EasyCV/tree/master/configs/detection/yolox/yolox_s_8xb16_300e_coco.py) | 9M | 10.7ms | 40.0                   | 58.9          | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_s_bs16_lr002/epoch_300.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_s_bs16_lr002/log.txt) |
-| YOLOX-m    | [yolox_m_8xb16_300e_coco](https://github.com/alibaba/EasyCV/tree/master/configs/detection/yolox/yolox_m_8xb16_300e_coco.py) | 25M | 12.3ms | 46.3                   | 64.9          | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_m_bs16_lr002/epoch_300.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_m_bs16_lr002/log.txt) |
-| YOLOX-l    | [yolox_l_8xb8_300e_coco](https://github.com/alibaba/EasyCV/tree/master/configs/detection/yolox/yolox_m_8xb8_300e_coco.py) | 54M | 15.5ms | 48.9                  | 67.5        | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_l_bs8_lr001/epoch_290.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_l_bs8_lr001/log.txt) |
-| YOLOX-x    | [yolox_x_8xb8_300e_coco](https://github.com/alibaba/EasyCV/tree/master/configs/detection/yolox/yolox_x_8xb8_300e_coco.py) | 99M | 19ms | 50.9                   | 69.2          | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_x_bs8_lr001/epoch_290.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_x_bs8_lr001/log.txt) |
-| YOLOX-tiny | [yolox_tiny_8xb16_300e_coco](https://github.com/alibaba/EasyCV/tree/master/configs/detection/yolox/yolox_tiny_8xb16_300e_coco.py) | 5M | 9.5ms | 31.5                   | 49.2          | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_tiny_bs16_lr002/epoch_300.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_tiny_bs16_lr002/log.txt) |
-| YOLOX-nano | [yolox_nano_8xb16_300e_coco](https://github.com/alibaba/EasyCV/tree/master/configs/detection/yolox/yolox_tiny_8xb16_300e_coco.py) | 2.2M | 9.4ms | 26.5                   | 42.6          | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_nano_bs16_lr002/epoch_300.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_nano_bs16_lr002/log.txt) |
+Pretrained on COCO2017 dataset. (The result has been optimized with PAI-Blade, and only computes the model inference time. To learn about end2end inference time,  you can refer to [export.md](./tutorials/export.md).)
+
+| Algorithm             | Config                                                       | Params | Speed<sup>V100<br/><sub>fp16 b32 </sub> | mAP<sup>val<br/><sub>0.5:0.95</sub> | AP<sup>val<br/><sub>50</sub> | Download                                                     |
+| --------------------- | ------------------------------------------------------------ | ------ | --------------------------------------- | ----------------------------------- | ---------------------------- | ------------------------------------------------------------ |
+| YOLOX-s               | [yolox_s_8xb16_300e_coco](https://github.com/alibaba/EasyCV/tree/master/configs/detection/yolox/yolox_s_8xb16_300e_coco.py) | 9M     | 0.68ms                                  | 40.0                                | 58.9                         | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_s_bs16_lr002/epoch_300.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_s_bs16_lr002/log.txt) |
+| PAI-YOLOXs            | [yoloxs_pai_8xb16_300e_coco](https://github.com/alibaba/EasyCV/tree/master/configs/detection/yolox/pai_yoloxs_8xb16_300e_coco.py) | 16M    | 0.71ms                                  | 41.4                                | 60.0                         | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox-pai/model/pai_yoloxs.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox-pai/log/pai_yoloxs.json) |
+| PAI-YOLOXs-ASFF       | [yoloxs_pai_asff_8xb16_300e_coco](https://github.com/alibaba/EasyCV/tree/master/configs/detection/yolox/pai_yoloxs_asff_8xb16_300e_coco.py) | 21M    | 0.87ms                                  | 42.8                                | 61.8                         | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox-pai/model/pai_yoloxs_asff.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox-pai/log/pai_yoloxs_asff.json) |
+| PAI-YOLOXs-ASFF-TOOD3 | [yoloxs_pai_asff_tood3_8xb16_300e_coco](https://github.com/alibaba/EasyCV/tree/master/configs/detection/yolox/pai_yoloxs_asff_tood3_8xb16_300e_coco.py) | 24M    | 1.15ms                                  | 43.9                                | 62.1                         | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox-pai/model/pai_yoloxs_asff_tood3.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox-pai/log/pai_yoloxs_asff_tood3.json) |
+| YOLOX-m               | [yolox_m_8xb16_300e_coco](https://github.com/alibaba/EasyCV/tree/master/configs/detection/yolox/yolox_m_8xb16_300e_coco.py) | 25M    | 1.52ms                                  | 46.3                                | 64.9                         | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_m_bs16_lr002/epoch_300.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_m_bs16_lr002/log.txt) |
+| YOLOX-l               | [yolox_l_8xb8_300e_coco](https://github.com/alibaba/EasyCV/tree/master/configs/detection/yolox/yolox_m_8xb8_300e_coco.py) | 54M    | 2.47ms                                  | 48.9                                | 67.5                         | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_l_bs8_lr001/epoch_290.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_l_bs8_lr001/log.txt) |
+| YOLOX-x               | [yolox_x_8xb8_300e_coco](https://github.com/alibaba/EasyCV/tree/master/configs/detection/yolox/yolox_x_8xb8_300e_coco.py) | 99M    | 4.74ms                                  | 50.9                                | 69.2                         | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_x_bs8_lr001/epoch_290.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_x_bs8_lr001/log.txt) |
+| YOLOX-tiny            | [yolox_tiny_8xb16_300e_coco](https://github.com/alibaba/EasyCV/tree/master/configs/detection/yolox/yolox_tiny_8xb16_300e_coco.py) | 5M     | 0.28ms                                  | 31.5                                | 49.2                         | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_tiny_bs16_lr002/epoch_300.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_tiny_bs16_lr002/log.txt) |
+| YOLOX-nano            | [yolox_nano_8xb16_300e_coco](https://github.com/alibaba/EasyCV/tree/master/configs/detection/yolox/yolox_tiny_8xb16_300e_coco.py) | 2.2M   | 0.19ms                                  | 26.5                                | 42.6                         | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_nano_bs16_lr002/epoch_300.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/yolox/yolox_nano_bs16_lr002/log.txt) |

 ## ViTDet
-
-| Algorithm  | Config                                                       | Params<br/>(backbone/total)                      | inference time(V100)<br/>(ms/img)                      | bbox_mAP<sup>val<br/><sub>0.5:0.95</sub> | mask_mAP<sup>val<br/><sub>0.5:0.95</sub> | Download                                                     |
-| ---------- | ------------------------------------------------------------ | ------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
-| ViTDet_MaskRCNN    | [vitdet_maskrcnn](https://github.com/alibaba/EasyCV/tree/master/configs/detection/vitdet/vitdet_100e.py) | 88M/118M | 163ms | 50.57                   | 44.96          | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/vitdet/vit_base/vitdet_maskrcnn.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/vitdet/vit_base/vitdet_maskrcnn.log.json) |
+| Algorithm  | Config                                                       | Params<br/>(backbone/total)                      | Train memory<br/>(GB) | inference time(V100)<br/>(ms/img)                      | bbox_mAP<sup>val<br/><sub>0.5:0.95</sub> | mask_mAP<sup>val<br/><sub>0.5:0.95</sub> | Download                                                     |
+| ---------- | ------------------------------------------------------------ | ------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| ViTDet_MaskRCNN    | [vitdet_maskrcnn](https://github.com/alibaba/EasyCV/tree/master/configs/detection/vitdet/vitdet_mask_rcnn_100e.py) | 86M/111M | 13.3 (fp16) | 138ms | 50.65                   | 45.41          | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/vitdet/vit_base/epoch_100.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/vitdet/vit_base/20220901_135827.log.json) |

 ## FCOS

-| Algorithm  | Config                                                       | Params<br/>(backbone/total)                            | inference time(V100)<br/>(ms/img)                      | mAP<sup>val<br/><sub>0.5:0.95</sub> | AP<sup>val<br/><sub>50</sub> | Download                                                     |
-| ---------- | ------------------------------------------------------------ | ------------------------ | --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
-| FCOS-r50    | [fcos-r50](https://github.com/alibaba/EasyCV/tree/master/configs/detection/fcos/fcos_center-normbbox-centeronreg-giou_r50_caffe_fpn_gn-head_1x_coco.py) | 23M/32M | 85.8ms | 38.58                   | 57.18          | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/fcos/epoch_12.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/fcos/20220621_121315.log.json) |
+| Algorithm  | Config                                                       | Params<br/>(backbone/total)                            | Train memory<br/>(GB)       | inference time(V100)<br/>(ms/img)                      | mAP<sup>val<br/><sub>0.5:0.95</sub> | AP<sup>val<br/><sub>50</sub> | Download                                                     |
+| ---------- | ------------------------------------------------------------ | ------------------------ | --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| FCOS-r50(caffe)    | [fcos-r50](https://github.com/alibaba/EasyCV/tree/master/configs/detection/fcos/fcos_r50_caffe_1x_coco.py) | 23M/32M | 5.0 | 85.8ms | 38.58                   | 57.18          | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/fcos/epoch_12.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/fcos/20220621_121315.log.json) |
+| FCOS-r50(torch)    | [fcos-r50](https://github.com/alibaba/EasyCV/tree/master/configs/detection/fcos/fcos_r50_torch_1x_coco.py) | 23M/32M | 4.0 (fp16) | 105.3ms | 38.88                   | 58.01          | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/fcos/fcos_epoch_12.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/fcos/20220826_182628.log.json) |

 ## DETR

-| Algorithm  | Config                                                       | Params<br/>(backbone/total)                            | inference time(V100)<br/>(ms/img)                      | bbox_mAP<sup>val<br/><sub>0.5:0.95</sub> | AP<sup>val<br/><sub>50</sub> | Download                                                     |
-| ---------- | ------------------------------------------------------------ | ------------------------ | --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
-| DETR-r50    | [detr-r50](https://github.com/alibaba/EasyCV/tree/master/configs/detection/detr/detr_r50_8x2_150e_coco.py) | 23M/41M | 48.5ms | 39.92                   | 60.52          | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/detr/epoch_150.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/detr/20220609_101243.log.json) |
-| DAB-DETR-r50    | [dab-detr-r50](https://github.com/alibaba/EasyCV/tree/master/configs/detection/dab_detr/dab_detr_r50_8x2_50e_coco.py) | 23M/43M | 58.5ms | 42.52                   | 63.03          | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dab_detr/dab_detr_epoch_50.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dab_detr/20220610_122811.log.json) |
-| DN-DETR-r50    | [dab-detr-r50](https://github.com/alibaba/EasyCV/tree/master/configs/detection/dab_detr/dn_detr_r50_8x2_50e_coco.py) | 23M/43M | 58.5ms | 44.39                   | 64.66          | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dn_detr/dn_detr_epoch_50.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dn_detr/20220713_105127.log.json) |
+| Algorithm  | Config                                                       | Params<br/>(backbone/total)                            | Train memory<br/>(GB)       | inference time(V100)<br/>(ms/img)                      | bbox_mAP<sup>val<br/><sub>0.5:0.95</sub> | AP<sup>val<br/><sub>50</sub> | Download                                                     |
+| ---------- | ------------------------------------------------------------ | ------------------------ | --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| DETR-r50    | [detr-r50](https://github.com/alibaba/EasyCV/tree/master/configs/detection/detr/detr_r50_8x2_150e_coco.py) | 23M/41M | 8.5 | 48.5ms | 39.92                   | 60.52          | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/detr/epoch_150.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/detr/20220609_101243.log.json) |
+| DAB-DETR-r50    | [dab-detr-r50](https://github.com/alibaba/EasyCV/tree/master/configs/detection/dab_detr/dab_detr_r50_8x2_50e_coco.py) | 23M/43M | 2.6 | 58.5ms | 42.52                   | 63.03          | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dab_detr/dab_detr_epoch_50.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dab_detr/20220610_122811.log.json) |
+| DN-DETR-r50    | [dab-detr-r50](https://github.com/alibaba/EasyCV/tree/master/configs/detection/dab_detr/dn_detr_r50_8x2_50e_coco.py) | 23M/43M | 7.8 | 58.5ms | 44.39                   | 64.66          | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dn_detr/dn_detr_epoch_50.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dn_detr/20220713_105127.log.json) |
+
+## DINO
+
+| Algorithm  | Config                                                       | Params<br/>(backbone/total)                            | inference time(V100)<br/>(ms/img)                      | bbox_mAP<sup>val<br/><sub>0.5:0.95</sub> | AP<sup>val<br/><sub>50</sub> | Download                                                     |    Comment         |
+| ---------- | ------------------------------------------------------------ | ------------------------ | --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | --------------------------------------------- |
+| DINO_4sc_r50_12e    | [DINO_4sc_r50_12e](https://github.com/alibaba/EasyCV/tree/master/configs/detection/dino/dino_4sc_r50_12e_coco.py) | 23M/47M | 184ms |     48.71               |     66.27      | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_r50_12e/epoch_12.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_r50_12e/20220815_141403.log.json) |Inference use V100 32G|
+| DINO_4sc_r50_36e    | [DINO_4sc_r50_36e](https://github.com/alibaba/EasyCV/tree/master/configs/detection/dino/dino_4sc_r50_36e_coco.py) | 23M/47M | 184ms |        50.69            |     68.60      | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_r50_36e/epoch_29.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_r50_36e/20220817_101549.log.json) |Inference use V100 32G|
+| DINO_4sc_swinl_12e    | [DINO_4sc_swinl_12e](https://github.com/alibaba/EasyCV/tree/master/configs/detection/dino/dino_4sc_swinl_12e_coco.py) | 195M/217M | 155ms |        56.86            |     75.61      | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_swinl_12e/epoch_12.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_swinl_12e/20220815_211633.log.json) |Inference use V100 32G|
+| DINO_4sc_swinl_36e    | [DINO_4sc_swinl_36e](https://github.com/alibaba/EasyCV/tree/master/configs/detection/dino/dino_4sc_swinl_36e_coco.py) | 195M/217M | 155ms |          58.04          |     76.76      | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_swinl_36e/epoch_34.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_4sc_swinl_36e/20220817_101416.log.json) |Inference use V100 32G|
+| DINO_5sc_swinl_36e    | [DINO_5sc_swinl_36e](https://github.com/alibaba/EasyCV/tree/master/configs/detection/dino/dino_5sc_swinl_36e_coco.py) | 195M/217M | 235ms |           58.47         |     77.10      | [model](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_5sc_swinl_36e/epoch_35.pth) - [log](https://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/detection/dino/dino_5sc_swinl_36e/20220820_215711.log.json) |Inference use V100 32G|
--- a/docs/source/model_zoo_seg.md
+++ b/docs/source/model_zoo_seg.md
@ -4,25 +4,40 @@

 Pretrained on **Pascal VOC 2012 + Aug**.

-| Algorithm  | Config                                                       | Params<br/>(backbone/total)                            | inference time(V100)<br/>(ms/img)                     | mIoU | Download                                                     |
-| ---------- | ------------------------------------------------------------ | ------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
-| fcn_r50_d8 | [fcn_r50-d8_512x512_8xb4_60e_voc12aug](https://github.com/alibaba/EasyCV/tree/master/configs/segmentation/fcn/fcn_r50-d8_512x512_8xb4_60e_voc12aug.py) | 23M/49M | 166ms | 69.01               | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/fcn_r50/epoch_60.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/fcn_r50/20220525_203606.log.json) |
+| Algorithm  | Config                                                       | Params<br/>(backbone/total)                            | Train memory<br/>(GB)      | inference time(V100)<br/>(ms/img)                     | mIoU | Download                                                     |
+| ---------- | ------------------------------------------------------------ | ------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| fcn_r50_d8 | [fcn_r50-d8_512x512_8xb4_60e_voc12aug](https://github.com/alibaba/EasyCV/tree/master/configs/segmentation/fcn/fcn_r50-d8_512x512_8xb4_60e_voc12aug.py) | 23M/49M | 19.8 | 166ms | 69.01               | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/fcn_r50/epoch_60.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/fcn_r50/20220525_203606.log.json) |

 ## UperNet

 Pretrained on **Pascal VOC 2012 + Aug**.
-| Algorithm  | Config                                                       | Params<br/>(backbone/total)                            | inference time(V100)<br/>(ms/img)                      | mIoU | Download                                                     |
-| ---------- | ------------------------------------------------------------ | ------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
-| upernet_r50 | [upernet_r50_512x512_8xb4_60e_voc12aug](https://github.com/alibaba/EasyCV/tree/master/configs/segmentation/upernet/upernet_r50_512x512_8xb4_60e_voc12aug.py) | 23M/66M | 282.9ms | 76.59               | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/upernet_r50/epoch_60.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/upernet_r50/20220706_114712.log.json) |
+| Algorithm  | Config                                                       | Params<br/>(backbone/total)                            | Train memory<br/>(GB)       | inference time(V100)<br/>(ms/img)                      | mIoU | Download                                                     |
+| ---------- | ------------------------------------------------------------ | ------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| upernet_r50 | [upernet_r50_512x512_8xb4_60e_voc12aug](https://github.com/alibaba/EasyCV/tree/master/configs/segmentation/upernet/upernet_r50_512x512_8xb4_60e_voc12aug.py) | 23M/66M | 5.5 | 282.9ms | 76.59               | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/upernet_r50/epoch_60.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/upernet_r50/20220706_114712.log.json) |

 ## Mask2former

 ### Instance Segmentation on COCO
-| Algorithm  | Config                                                       | box MAP | Mask mAP | Download                                                     |
-| ---------- | ------------------------------------------------------------ | ------------------------ |----------|---------------------------------------------------------------------------- |
-| mask2former_r50 | [mask2former_r50_8xb2_e50_instance](https://github.com/alibaba/EasyCV/tree/master/configs/segmentation/mask2former/mask2former_r50_8xb2_e50_instance.py) | 46.09 | 43.26 |[model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/mask2former_r50_instance/epoch_50.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/mask2former_r50_instance/20220620_113639.log.json) |
+| Algorithm  | Config                                                       | Train memory<br/>(GB)                                  | box MAP | Mask mAP | Download                                                     |
+| ---------- | ------------------------------------------------------------ |----------|----------|----------|----------|
+| mask2former_r50 | [mask2former_r50_8xb2_e50_instance](https://github.com/alibaba/EasyCV/tree/master/configs/segmentation/mask2former/mask2former_r50_8xb2_e50_instance.py) | 18.8 | 46.09 | 43.26 |[model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/mask2former_r50_instance/epoch_50.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/mask2former_r50_instance/20220620_113639.log.json) |

 ### Panoptic Segmentation on COCO
-| Algorithm  | Config                                                       | PQ | box MAP | Mask mAP | Download                                                     |
-| ---------- | ---------- | ------------------------------------------------------------ | ------------------------ |----------|---------------------------------------------------------------------------- |
-| mask2former_r50 | [mask2former_r50_8xb2_e50_panopatic](https://github.com/alibaba/EasyCV/tree/master/configs/segmentation/mask2former/mask2former_r50_8xb2_e50_panopatic.py) | 51.64 | 44.81 | 41.88 |[model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/mask2former_r50_panoptic/epoch_50.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/mask2former_r50_panoptic/20220629_170721.log.json) |
+
+| Algorithm  | Config                                                       | Train memory<br/>(GB)                                  | PQ | box MAP | Mask mAP | Download                                                     |
+| ---------- | ---------- | ------------------------------------------------------------ | ------------------------ |----------|---------------------------------------------------------------------------- |---------------------------------------------------------------------------- |
+| mask2former_r50 | [mask2former_r50_8xb2_e50_panopatic](https://github.com/alibaba/EasyCV/tree/master/configs/segmentation/mask2former/mask2former_r50_8xb2_e50_panopatic.py) | 18.8 | 51.64 | 44.81 | 41.88 |[model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/mask2former_r50_panoptic/epoch_50.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/modelzoo/segmentation/mask2former_r50_panoptic/20220629_170721.log.json) |
+
+
+## SegFormer
+
+Semantic segmentation models trained on **CoCo_stuff164k**.
+
+| Algorithm  | Config                                                       | Params<br/>(backbone/total)                            | inference time(V100)<br/>(ms/img)                    |mIoU | Download                                                     |
+| ---------- | ------------------------------------------------------------ | ------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
+| SegFormer_B0 | [segformer_b0_coco.py](https://github.com/alibaba/EasyCV/tree/master/configs/segmentation/segformer/segformer_b0_coco.py) | 3.3M/3.8M | 47.2ms |  35.91               | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/damo/modelzoo/segmentation/segformer/segformer_b0/SegmentationEvaluator_mIoU_best.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/damo/modelzoo/segmentation/segformer/segformer_b0/20220909_152337.log.json) |
+| SegFormer_B1 | [segformer_b1_coco.py](https://github.com/alibaba/EasyCV/tree/master/configs/segmentation/segformer/segformer_b1_coco.py) | 13.2M/13.7M | 46.8ms |  40.53               | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/damo/modelzoo/segmentation/segformer/segformer_b1/SegmentationEvaluator_mIoU_best.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/damo/modelzoo/segmentation/segformer/segformer_b1/20220825_200708.log.json) |
+| SegFormer_B2 | [segformer_b2_coco.py](https://github.com/alibaba/EasyCV/tree/master/configs/segmentation/segformer/segformer_b2_coco.py) | 24.2M/27.5M   | 49.1ms |  44.53               | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/damo/modelzoo/segmentation/segformer/segformer_b2/SegmentationEvaluator_mIoU_best.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/damo/modelzoo/segmentation/segformer/segformer_b2/20220829_163757.log.json) |
+| SegFormer_B3 | [segformer_b3_coco.py](https://github.com/alibaba/EasyCV/tree/master/configs/segmentation/segformer/segformer_b3_coco.py) | 44.1M/47.4M | 52.3ms |  45.49               | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/damo/modelzoo/segmentation/segformer/segformer_b3/SegmentationEvaluator_mIoU_best.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/damo/modelzoo/segmentation/segformer/segformer_b3/20220830_142021.log.json) |
+| SegFormer_B4 | [segformer_b4_coco.py](https://github.com/alibaba/EasyCV/tree/master/configs/segmentation/segformer/segformer_b4_coco.py) | 60.8M/64.1M   | 58.5ms |  46.27               | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/damo/modelzoo/segmentation/segformer/segformer_b4/SegmentationEvaluator_mIoU_best.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/damo/modelzoo/segmentation/segformer/segformer_b4/20220902_135723.log.json) |
+| SegFormer_B5 | [segformer_b5_coco.py](https://github.com/alibaba/EasyCV/tree/master/configs/segmentation/segformer/segformer_b5_coco.py) | 81.4M/85.7M   | 99.2ms |  46.75               | [model](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/damo/modelzoo/segmentation/segformer/segformer_b5/SegmentationEvaluator_mIoU_best.pth) - [log](http://pai-vision-data-hz.oss-cn-zhangjiakou.aliyuncs.com/EasyCV/damo/modelzoo/segmentation/segformer/segformer_b5/20220812_144336.log.json) |
--- a/Show More
+++ b/Show More