* [Feature] Support MaskFormer(NeurIPS'2021) in MMSeg 1.x * add mmdet try except logic * refactor config files * add readme * fix config * update models & logs * add MMDET installation and fix info * fix comments * fix * fix config norm optimizer setting * update models & logs & unittest * add docstring of MaskFormerHead * wait for mmdet 3.0.0rc4 * replace seg_mask with seg_logits & add docstring for batch_input_shape * use mmdet3.0.0rc4 * fix readme and modify config comments * add mmdet installation in pr_stage_test.yml * update mmcv version in pr_stage_test.yml * add mmdet in build_cpu of pr_stage_test.yml * modify mmdet& mmcv installation in merge_stage_test.yml * fix typo * update test.yml * update test.yml |
||
---|---|---|
.. | ||
README.md | ||
maskformer.yml | ||
maskformer_r50-d32_8xb2-160k_ade20k-512x512.py | ||
maskformer_r101-d32_8xb2-160k_ade20k-512x512.py | ||
maskformer_swin-s_upernet_8xb2-160k_ade20k-512x512.py | ||
maskformer_swin-t_upernet_8xb2-160k_ade20k-512x512.py |
README.md
MaskFormer
MaskFormer: Per-Pixel Classification is Not All You Need for Semantic Segmentation
Introduction
Abstract
Modern approaches typically formulate semantic segmentation as a per-pixel classification task, while instance-level segmentation is handled with an alternative mask classification. Our key insight: mask classification is sufficiently general to solve both semantic- and instance-level segmentation tasks in a unified manner using the exact same model, loss, and training procedure. Following this observation, we propose MaskFormer, a simple mask classification model which predicts a set of binary masks, each associated with a single global class label prediction. Overall, the proposed mask classification-based method simplifies the landscape of effective approaches to semantic and panoptic segmentation tasks and shows excellent empirical results. In particular, we observe that MaskFormer outperforms per-pixel classification baselines when the number of classes is large. Our mask classification-based method outperforms both current state-of-the-art semantic (55.6 mIoU on ADE20K) and panoptic segmentation (52.7 PQ on COCO) models.

@article{cheng2021per,
title={Per-pixel classification is not all you need for semantic segmentation},
author={Cheng, Bowen and Schwing, Alex and Kirillov, Alexander},
journal={Advances in Neural Information Processing Systems},
volume={34},
pages={17864--17875},
year={2021}
}
Usage
- MaskFormer model needs to install MMDetection first.
pip install "mmdet>=3.0.0rc4"
Results and models
ADE20K
Method | Backbone | Crop Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download |
---|---|---|---|---|---|---|---|---|---|
MaskFormer | R-50-D32 | 512x512 | 160000 | 3.29 | 42.20 | 44.29 | - | config | model | log |
MaskFormer | R-101-D32 | 512x512 | 160000 | 4.12 | 34.90 | 45.11 | - | config | model | log |
MaskFormer | Swin-T | 512x512 | 160000 | 3.73 | 40.53 | 46.69 | - | config | model | log |
MaskFormer | Swin-S | 512x512 | 160000 | 5.33 | 26.98 | 49.36 | - | config | model | log |
Note:
- All experiments of MaskFormer are implemented with 8 V100 (32G) GPUs with 2 samplers per GPU.
- The results of MaskFormer are relatively not stable. The accuracy (mIoU) of model with
R-101-D32
is from 44.7 to 46.0, and withSwin-S
is from 49.0 to 49.8. - The ResNet backbones utilized in MaskFormer models are standard
ResNet
rather thanResNetV1c
. - Test time augmentation is not supported in MMSegmentation 1.x version yet, we would add "ms+flip" results as soon as possible.