* init commits * fix crop size * add seg_data2instance_data method * add ut and update requirement * update configs and readme * add model-indel * update optional requirements * fix results * fix lint error * update results * update results * remove mmdet from requirements/optional.txt * use try import and update README * add docstring to overwrtied method * minor change Co-authored-by: MengzhangLI <mcmong@pku.edu.cn> |
||
---|---|---|
.. | ||
README.md | ||
mask2former.yml | ||
mask2former_r50_8xb2-90k_cityscapes-512x1024.py | ||
mask2former_r50_8xb2-160k_ade20k-512x512.py | ||
mask2former_r101_8xb2-90k_cityscapes-512x1024.py | ||
mask2former_r101_8xb2-160k_ade20k-512x512.py | ||
mask2former_swin-b-in1k-384x384-pre_8xb2-160k_ade20k-640x640.py | ||
mask2former_swin-b-in22k-384x384-pre_8xb2-90k_cityscapes-512x1024.py | ||
mask2former_swin-b-in22k-384x384-pre_8xb2-160k_ade20k-640x640.py | ||
mask2former_swin-l-in22k-384x384-pre_8xb2-90k_cityscapes-512x1024.py | ||
mask2former_swin-l-in22k-384x384-pre_8xb2-160k_ade20k-640x640.py | ||
mask2former_swin-s_8xb2-90k_cityscapes-512x1024.py | ||
mask2former_swin-s_8xb2-160k_ade20k-512x512.py | ||
mask2former_swin-t_8xb2-90k_cityscapes-512x1024.py | ||
mask2former_swin-t_8xb2-160k_ade20k-512x512.py |
README.md
Mask2Former
Masked-attention Mask Transformer for Universal Image Segmentation
Introduction
Abstract
Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).
@inproceedings{cheng2021mask2former,
title={Masked-attention Mask Transformer for Universal Image Segmentation},
author={Bowen Cheng and Ishan Misra and Alexander G. Schwing and Alexander Kirillov and Rohit Girdhar},
journal={CVPR},
year={2022}
}
@inproceedings{cheng2021maskformer,
title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
journal={NeurIPS},
year={2021}
}
Usage
- Mask2Former model needs to install MMDetection first.
pip install "mmdet>=3.0.0rc4"
Results and models
Cityscapes
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|
Mask2Former | R-50-D32 | 512x1024 | 90000 | 5806 | 9.17 | 80.44 | - | config | model | log |
Mask2Former | R-101-D32 | 512x1024 | 90000 | 6971 | 7.11 | 80.80 | - | config | model | log) |
Mask2Former | Swin-T | 512x1024 | 90000 | 6511 | 7.18 | 81.71 | - | config | model | log) |
Mask2Former | Swin-S | 512x1024 | 90000 | 8282 | 5.57 | 82.57 | - | config | model | log) |
Mask2Former | Swin-B (in22k) | 512x1024 | 90000 | 11152 | 4.32 | 83.52 | - | config | model | log) |
Mask2Former | Swin-L (in22k) | 512x1024 | 90000 | 16207 | 2.86 | 83.65 | - | config | model | log) |
ADE20K
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|
Mask2Former | R-50-D32 | 512x512 | 160000 | 3385 | 26.59 | 47.87 | - | config | model | log) |
Mask2Former | R-101-D32 | 512x512 | 160000 | 4190 | 22.97 | 48.60 | - | config | model | log) |
Mask2Former | Swin-T | 512x512 | 160000 | 3826 | 23.82 | 48.66 | - | config | model | log) |
Mask2Former | Swin-S | 512x512 | 160000 | 5034 | 19.69 | 51.24 | - | config | model | log) |
Mask2Former | Swin-B | 640x640 | 160000 | 5795 | 12.48 | 52.44 | - | config | model | log) |
Mask2Former | Swin-B (in22k) | 640x640 | 160000 | 5795 | 12.43 | 53.90 | - | config | model | log) |
Mask2Former | Swin-L (in22k) | 640x640 | 160000 | 9077 | 8.81 | 56.01 | - | config | model | log) |
Note:
- All experiments of Mask2Former are implemented with 8 A100 GPUs with 2 samplers per GPU.
- As mentioned at the official repo, the results of Mask2Former are relatively not stable, the result of Mask2Former(swin-s) on ADE20K dataset in the table is the medium result obtained by training 5 times following the suggestion of the author.
- The ResNet backbones utilized in MaskFormer models are standard
ResNet
rather thanResNetV1c
. - Test time augmentation is not supported in MMSegmentation 1.x version yet, we would add "ms+flip" results as soon as possible.