History

Yixiao Fang e4c4a81b56 [Feature] Support iTPN and HiViT (#1584 ) * hivit added * Update hivit.py * Update hivit.py * Add files via upload * Update __init__.py * Add files via upload * Update __init__.py * Add files via upload * Update hivit.py * Add files via upload * Add files via upload * Add files via upload * Add files via upload * Update itpn.py * Add files via upload * Update __init__.py * Update mae_hivit-base-p16.py * Delete mim_itpn-base-p16.py * Add files via upload * Update itpn_hivit-base-p16.py * Update itpn.py * Update hivit.py * Update __init__.py * Update mae.py * Delete hivit.py * Update __init__.py * Delete configs/itpn directory * Add files via upload * Add files via upload * Delete configs/hivit directory * Add files via upload * refactor and add metafile and readme * update clip * add ut * update ut * update * update docstring * update model.rst --------- Co-authored-by: 田运杰 <48153283+sunsmarterjie@users.noreply.github.com>		2023-05-26 12:08:34 +08:00
..
README.md	[Feature] Support iTPN and HiViT (#1584 )	2023-05-26 12:08:34 +08:00
hivit-base-p16_16xb64_in1k.py	[Feature] Support iTPN and HiViT (#1584 )	2023-05-26 12:08:34 +08:00
hivit-small-p16_16xb64_in1k.py	[Feature] Support iTPN and HiViT (#1584 )	2023-05-26 12:08:34 +08:00
hivit-tiny-p16_16xb64_in1k.py	[Feature] Support iTPN and HiViT (#1584 )	2023-05-26 12:08:34 +08:00
metafile.yml	[Feature] Support iTPN and HiViT (#1584 )	2023-05-26 12:08:34 +08:00

README.md

HiViT

HiViT: A Simple and More Efficient Design of Hierarchical Vision Transformer

Abstract

Recently, masked image modeling (MIM) has offered a new methodology of self-supervised pre-training of vision transformers. A key idea of efficient implementation is to discard the masked image patches (or tokens) throughout the target network (encoder), which requires the encoder to be a plain vision transformer (e.g., ViT), albeit hierarchical vision transformers (e.g., Swin Transformer) have potentially better properties in formulating vision inputs. In this paper, we offer a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT) that enjoys both high efficiency and good performance in MIM. The key is to remove the unnecessary "local inter-unit operations", deriving structurally simple hierarchical vision transformers in which mask-units can be serialized like plain vision transformers. For this purpose, we start with Swin Transformer and (i) set the masking unit size to be the token size in the main stage of Swin Transformer, (ii) switch off inter-unit self-attentions before the main stage, and (iii) eliminate all operations after the main stage. Empirical studies demonstrate the advantageous performance of HiViT in terms of fully-supervised, self-supervised, and transfer learning. In particular, in running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$\times$ speed-up over Swin-B, and the performance gain generalizes to downstream tasks of detection and segmentation. Code will be made publicly available.

How to use it?

Train/Test Command

Prepare your dataset according to the docs.

Train:

python tools/train.py configs/hivit/hivit-tiny-p16_16xb64_in1k.py

Models and results

Image Classification on ImageNet-1k

Model	Pretrain	Params (M)	Flops (G)	Top-1 (%)	Config	Download
`hivit-tiny-p16_16xb64_in1k`	From scratch	19.18	4.60	82.10	config	N/A
`hivit-small-p16_16xb64_in1k`	From scratch	37.53	9.07	N/A	config	N/A
`hivit-base-p16_16xb64_in1k`	From scratch	79.05	18.47	N/A	config	N/A

Citation

@inproceedings{zhanghivit,
  title={HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer},
  author={Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi},
  booktitle={International Conference on Learning Representations},
  year={2023},
}