moco-v3/README.md

4.7 KiB

MoCo v3

This is a PyTorch implementation of MoCo v3:

@Article{chen2021mocov3,
  author  = {Xinlei Chen* and Saining Xie* and Kaiming He},
  title   = {An Empirical Study of Training Self-Supervised Vision Transformers},
  journal = {arXiv preprint arXiv:2104.02057},
  year    = {2021},
}

Preparation

Install PyTorch and download the ImageNet dataset following the official PyTorch ImageNet training code. Similar to MoCo, the code release contains minimal modifications for both unsupervised pre-training and linear classification to that code.

In addition, install timm for the Vision Transformer (ViT) models.

Pre-Training

Similar to MoCo, only multi-gpu, DistributedDataParallel training is supported; single-gpu or DataParallel training is not supported. In addition, the code is improved to better suit the multi-node setting, and by default uses automatic mixed-precision for pre-training.

Below we list some MoCo v3 pre-training commands as examples. They cover different model architectures, training epochs, single-/multi-node, etc.

ResNet-50, 100-Epoch, 2-Node.

This is the default setting for most hyper-parameters. With a batch size of 4096, the training fits into 2 nodes with a total of 16 Volta 32G GPUs.

On the first node, run:

python main_moco.py \
  --dist-url 'tcp://[your node 1 address]:[specified port]'' \
  --multiprocessing-distributed --world-size 2 --rank 0 \
  [your imagenet-folder with train and val folders]

On the second node, run:

python main_moco.py \
  --dist-url 'tcp://[your node 1 address]:[specified port]' \
  --multiprocessing-distributed --world-size 2 --rank 1 \
  [your imagenet-folder with train and val folders]
ViT-Small, 300-Epoch, 1-Node.

With a batch size of 1024, ViT-Small fits into a single node of 8 Volta 32G GPUs:

python main_moco.py \
  -a vit_small -b 1024 \
  --optimizer=adamw --lr=1e-4 --weight-decay=.1 \
  --epochs=300 --warmup-epochs=40 \
  --moco-t=.2 \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

Note that the smaller batch size: 1) facilitates stable training, as discussed in the paper; and 2) cuts inter-node communication cost with single node training. Therefore, we highly recommend this setting for ViT-based explorations.

Linear Classification

With a pre-trained model, to train a supervised linear classifier on frozen features/weights on an 8-GPU node, run:

python main_lincls.py \
  -a [architecture] \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  --pretrained [your checkpoint path]/checkpoint_0xxx.pth.tar \
  [your imagenet-folder with train and val folders]

The above command uses SGD+Momentum optimizer and a default batch size of 1024.

Reference Setups

For longer pre-trainings with ResNet-50, we find the following hyper-parameters work well:

epochs
learning
rate
weight
decay
momentum
update
top-1 acc.
100 0.45 1e-6 0.99
300 0.3 1e-6 0.99 72.8
1000 0.3 1.5e-6 0.996 74.8

These hyper-parameters can be set with respective arguments. For example:

ResNet-50, 1000-Epoch, 2-Node.

On the first node, run:

python main_moco.py \
  --moco-m=0.996 --lr=.3 --wd=1.5e-6 --epochs=1000 \
  --dist-url "tcp://[your node 1 address]:[specified port]" \
  --multiprocessing-distributed --world-size 2 --rank 0 \
  [your imagenet-folder with train and val folders]

On the second node, run the same command as above, with --rank 1.

We also provide the reference linear classification performance in the last column (will update logs/pre-trained models soon).

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.