mmsegmentation/projects/dest/configs
Boyin Zhang 409caf8548
[DEST] add DEST model (#2482)
## Motivation

We are from NVIDIA and we have developed a simplified and
inference-efficient transformer for dense prediction tasks. The method
is based on SegFormer with hardware-friendly design choices, resulting
in better accuracy and over 2x reduction in inference speed as compared
to the baseline. We believe this model would be of particular interests
to those who want to deploy an efficient vision transformer for
production, and it is easily adaptable to other tasks. Therefore, we
would like to contribute our method to mmsegmentation in order to
benefit a larger audience.

The paper was accepted to [Transformer for Vision
workshop](https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsites.google.com%2Fview%2Ft4v-cvpr22%2Fpapers%3Fauthuser%3D0&data=05%7C01%7Cboyinz%40nvidia.com%7Cbf078d69821449d1f4c908dab5e8c7da%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638022308636438546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=XtSgPQrbVgHxt5L9XkXF%2BGWvc95haB3kKPcHnsVIF3M%3D&reserved=0)
at CVPR 2022, here below are some resource links:
Paper
[https://arxiv.org/pdf/2204.13791.pdf](https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Farxiv.org%2Fpdf%2F2204.13791.pdf&data=05%7C01%7Cboyinz%40nvidia.com%7Cbf078d69821449d1f4c908dab5e8c7da%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638022308636438546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=X%2FCVoa6PFA09EHfClES36QOa5NvbZu%2F6IDfBVwiYywU%3D&reserved=0)
(Table 3 shows the semseg results)
Code
[https://github.com/NVIDIA/DL4AGX/tree/master/DEST](https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNVIDIA%2FDL4AGX%2Ftree%2Fmaster%2FDEST&data=05%7C01%7Cboyinz%40nvidia.com%7Cbf078d69821449d1f4c908dab5e8c7da%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638022308636438546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9DLQZpEq1cN75%2FDf%2FniUOOUFS1ABX8FEUH02O6isGVQ%3D&reserved=0)
A webinar on its application
[https://www.nvidia.com/en-us/on-demand/session/other2022-drivetraining/](https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.nvidia.com%2Fen-us%2Fon-demand%2Fsession%2Fother2022-drivetraining%2F&data=05%7C01%7Cboyinz%40nvidia.com%7Cbf078d69821449d1f4c908dab5e8c7da%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638022308636438546%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8jrBC%2Bp3jGxiaW4vtSfhh6GozC3tRqGNjNoALM%2FOYxs%3D&reserved=0)

## Modification

Add backbone(smit.py) and head(dest_head.py) of DEST

## BC-breaking (Optional)

N/A

## Use cases (Optional)

N/A

---------

Co-authored-by: MeowZheng <meowzheng@outlook.com>
2023-02-16 17:42:34 +08:00
..
README.md [DEST] add DEST model (#2482) 2023-02-16 17:42:34 +08:00
dest_simpatt-b0.py [DEST] add DEST model (#2482) 2023-02-16 17:42:34 +08:00
dest_simpatt-b0_1024x1024_160k_cityscapes.py [DEST] add DEST model (#2482) 2023-02-16 17:42:34 +08:00
dest_simpatt-b1_1024x1024_160k_cityscapes.py [DEST] add DEST model (#2482) 2023-02-16 17:42:34 +08:00
dest_simpatt-b2_1024x1024_160k_cityscapes.py [DEST] add DEST model (#2482) 2023-02-16 17:42:34 +08:00
dest_simpatt-b3_1024x1024_160k_cityscapes.py [DEST] add DEST model (#2482) 2023-02-16 17:42:34 +08:00
dest_simpatt-b4_1024x1024_160k_cityscapes.py [DEST] add DEST model (#2482) 2023-02-16 17:42:34 +08:00
dest_simpatt-b5_1024x1024_160k_cityscapes.py [DEST] add DEST model (#2482) 2023-02-16 17:42:34 +08:00

README.md

DEST

DEST: Depth Estimation with Simplified Transformer

Introduction

Official Repo

Abstract

Transformer and its variants have shown state-of-the-art results in many vision tasks recently, ranging from image classification to dense prediction. Despite of their success, limited work has been reported on improving the model efficiency for deployment in latency-critical applications, such as autonomous driving and robotic navigation. In this paper, we aim at improving upon the existing transformers in vision, and propose a method for Dense Estimation with Simplified Transformer (DEST), which is efficient and particularly suitable for deployment on GPU-based platforms. Through strategic design choices, our model leads to significant reduction in model size, complexity, as well as inference latency, while achieving superior accuracy as compared to state-of-the-art in the task of self-supervised monocular depth estimation. We also show that our design generalize well to other dense prediction task such as semantic segmentation without bells and whistles.

Citation

@article{YangDEST,
  title={Depth Estimation with Simplified Transformer},
  author={Yang, John and An, Le and Dixit, Anurag and Koo, Jinkyu and Park, Su Inn},
  journal={arXiv preprint arXiv:2204.13791},
  year={2022}
}

Results and models

Cityscapes

Method Backbone Crop Size Lr schd Mem (GB) Inf time (fps) mIoU mIoU(ms+flip) config download
DEST SMIT-B0 1024x1024 160000 - - 64.34 - config model
DEST SMIT-B1 1024x1024 160000 - - 68.21 - config model
DEST SMIT-B2 1024x1024 160000 - - 71.89 - config model
DEST SMIT-B3 1024x1024 160000 - - 73.51 - config model
DEST SMIT-B4 1024x1024 160000 - - 73.99 - config model
DEST SMIT-B5 1024x1024 160000 - - 75.28 - config model

Note:

  • The above models are all training from scratch without pretrained backbones. Accuracy can be further enhanced by appropriate pretraining.
  • Training of DEST is not very stable, which is sensitive to random seeds.