[Docs] update metafile path and algorithm readme format (#206)

* [Fix] fix model-index metafile path * [Docs] update algorithm readme format
2022-02-09 17:26:50 +08:00 · 2022-02-09 17:26:50 +08:00 · 41e999f69b
parent cea50785e1
commit 41e999f69b
13 changed files with 19 additions and 25 deletions
--- a/configs/selfsup/byol/README.md
+++ b/configs/selfsup/byol/README.md
@ -1,13 +1,13 @@
 # BYOL

-[Bootstrap your own latent: A new approach to self-supervised Learning](https://arxiv.org/abs/2006.07733)
+> [Bootstrap your own latent: A new approach to self-supervised Learning](https://arxiv.org/abs/2006.07733)

 <!-- [ALGORITHM] -->

 ## Abstract
+
 **B**ootstrap **Y**our **O**wn **L**atent (BYOL) is a new approach to self-supervised image representation learning. BYOL relies on two neural networks, referred to as online and target networks, that interact and learn from each other. From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view. At the same time, we update the target network with a slow-moving average of the online network.

-<!-- [IMAGE] -->
 <div align="center">
 <img src="https://user-images.githubusercontent.com/36138628/149720208-5ffbee78-1437-44c7-9ddb-b8caab60d2c3.png" width="800" />
 </div>
--- a/configs/selfsup/deepcluster/README.md
+++ b/configs/selfsup/deepcluster/README.md
@ -1,13 +1,13 @@
 # DeepCluster

-[Deep Clustering for Unsupervised Learning of Visual Features](https://arxiv.org/abs/1807.05520)
+> [Deep Clustering for Unsupervised Learning of Visual Features](https://arxiv.org/abs/1807.05520)

 <!-- [ALGORITHM] -->
+
 ## Abstract

 Clustering is a class of unsupervised learning methods that has been extensively applied and studied in computer vision. Little work has been done to adapt it to the end-to-end training of visual features on large scale datasets. In this work, we present DeepCluster, a clustering method that jointly learns the parameters of a neural network and the cluster assignments of the resulting features. DeepCluster iteratively groups the features with a standard clustering algorithm, k-means, and uses the subsequent assignments as supervision to update the weights of the network.

-<!-- [IMAGE] -->
 <div align="center">
 <img src="https://user-images.githubusercontent.com/36138628/149720586-5bfd213e-0638-47fc-b48a-a16689190e17.png" width="700" />
 </div>
--- a/configs/selfsup/densecl/README.md
+++ b/configs/selfsup/densecl/README.md
@ -1,6 +1,6 @@
 # DenseCL

-[Dense Contrastive Learning for Self-Supervised Visual Pre-Training](https://arxiv.org/abs/2011.09157)
+> [Dense Contrastive Learning for Self-Supervised Visual Pre-Training](https://arxiv.org/abs/2011.09157)

 <!-- [ALGORITHM] -->

@ -8,7 +8,6 @@

 To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.

-<!-- [IMAGE] -->
 <div align="center">
 <img src="https://user-images.githubusercontent.com/36138628/149721111-bab03a6d-a30d-418e-b338-43c3689cfc65.png" width="900" />
 </div>
--- a/configs/selfsup/mocov1/README.md
+++ b/configs/selfsup/mocov1/README.md
@ -1,13 +1,13 @@
 # MoCo v1

-[Momentum Contrast for Unsupervised Visual Representation Learning](https://arxiv.org/abs/1911.05722)
+> [Momentum Contrast for Unsupervised Visual Representation Learning](https://arxiv.org/abs/1911.05722)

 <!-- [ALGORITHM] -->

 ## Abstract
+
 We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks.

-<!-- [IMAGE] -->
 <div align="center">
 <img  src="https://user-images.githubusercontent.com/36138628/149719892-1b6928e1-37cb-4cee-b053-ff12e1aa43c0.png" width="400" />
 </div>
--- a/configs/selfsup/mocov2/README.md
+++ b/configs/selfsup/mocov2/README.md
@ -1,13 +1,13 @@
 # MoCo v2

-[Improved Baselines with Momentum Contrastive Learning](https://arxiv.org/abs/2003.04297)
+> [Improved Baselines with Momentum Contrastive Learning](https://arxiv.org/abs/2003.04297)

 <!-- [ALGORITHM] -->

 ## Abstract
+
 Contrastive unsupervised learning has recently shown encouraging progress, e.g., in Momentum Contrast (MoCo) and SimCLR. In this note, we verify the effectiveness of two of SimCLR’s design improvements by implementing them in the MoCo framework. With simple modifications to MoCo—namely, using an MLP projection head and more data augmentation—we establish stronger baselines that outperform SimCLR and do not require large training batches. We hope this will make state-of-the-art unsupervised learning research more accessible.

-<!-- [IMAGE] -->
 <div align="center">
 <img  src="https://user-images.githubusercontent.com/36138628/149720067-b65e5736-d425-45b3-93ed-6f2427fc6217.png" width="500" />
 </div>
--- a/configs/selfsup/npid/README.md
+++ b/configs/selfsup/npid/README.md
@ -1,6 +1,6 @@
 # NPID

-[Unsupervised Feature Learning via Non-Parametric Instance Discrimination](https://arxiv.org/abs/1805.01978)
+> [Unsupervised Feature Learning via Non-Parametric Instance Discrimination](https://arxiv.org/abs/1805.01978)

 <!-- [ALGORITHM] -->

@ -12,7 +12,6 @@ We formulate this intuition as a non-parametric classification problem at the in

 Our method is also remarkable for consistently improving test performance with more training data and better network architectures. By fine-tuning the learned feature, we further obtain competitive results for semi-supervised learning and object detection tasks. Our non-parametric model is highly compact: With 128 features per image, our method requires only 600MB storage for a million images, enabling fast nearest neighbour retrieval at the run time.

-<!-- [IMAGE] -->
 <div align="center">
 <img  src="https://user-images.githubusercontent.com/36138628/149722257-1651c283-ac68-4cdc-90e6-970d820529af.png" width="800" />
 </div>
--- a/configs/selfsup/odc/README.md
+++ b/configs/selfsup/odc/README.md
@ -1,6 +1,6 @@
 # ODC

-[Online Deep Clustering for Unsupervised Representation Learning](https://arxiv.org/abs/2006.10645)
+> [Online Deep Clustering for Unsupervised Representation Learning](https://arxiv.org/abs/2006.10645)

 <!-- [ALGORITHM] -->

@ -8,7 +8,6 @@

 Joint clustering and feature learning methods have shown remarkable performance in unsupervised representation learning. However, the training schedule alternating between feature clustering and network parameters update leads to unstable learning of visual representations. To overcome this challenge, we propose Online Deep Clustering (ODC) that performs clustering and network update simultaneously rather than alternatingly. Our key insight is that the cluster centroids should evolve steadily in keeping the classifier stably updated. Specifically, we design and maintain two dynamic memory modules, i.e., samples memory to store samples’ labels and features, and centroids memory for centroids evolution. We break down the abrupt global clustering into steady memory update and batch-wise label re-assignment. The process is integrated into network update iterations. In this way, labels and the network evolve shoulder-to-shoulder rather than alternatingly. Extensive experiments demonstrate that ODC stabilizes the training process and boosts the performance effectively.

-<!-- [IMAGE] -->
 <div align="center">
 <img  src="https://user-images.githubusercontent.com/36138628/149722645-8da8e5b2-8846-4554-aa3e-727d286b85cd.png" width="700" />
 </div>
--- a/configs/selfsup/relative_loc/README.md
+++ b/configs/selfsup/relative_loc/README.md
@ -1,6 +1,6 @@
 # Relative Location

-[Unsupervised Visual Representation Learning by Context Prediction](https://arxiv.org/abs/1505.05192)
+> [Unsupervised Visual Representation Learning by Context Prediction](https://arxiv.org/abs/1505.05192)

 <!-- [ALGORITHM] -->

@ -8,7 +8,6 @@

 This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that doing well on this task requires the model to learn to recognize objects and their parts. We demonstrate that the feature representation learned using this within-image context indeed captures visual similarity across images. For example, this representation allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Furthermore, we show that the learned ConvNet can be used in the RCNN framework and provides a significant boost over a randomly-initialized ConvNet, resulting in state-of-the-art performance among algorithms which use only Pascal-provided training set annotations.

-<!-- [IMAGE] -->
 <div align="center">
 <img  src="https://user-images.githubusercontent.com/36138628/149723222-76bc89e8-98bf-4ed7-b179-dfe5bc6336ba.png" width="400" />
 </div>
--- a/configs/selfsup/rotation_pred/README.md
+++ b/configs/selfsup/rotation_pred/README.md
@ -1,6 +1,6 @@
 # Rotation Prediction

-[Unsupervised Representation Learning by Predicting Image Rotation](https://arxiv.org/abs/1803.07728)
+> [Unsupervised Representation Learning by Predicting Image Rotation](https://arxiv.org/abs/1803.07728)

 <!-- [ALGORITHM] -->

@ -8,7 +8,6 @@

 Over the last years, deep convolutional neural networks (ConvNets) have transformed the field of computer vision thanks to their unparalleled capacity to learn high level semantic image features. However, in order to successfully learn those features, they usually require massive amounts of manually labeled data, which is both expensive and impractical to scale. Therefore, unsupervised semantic feature learning, i.e., learning without requiring manual annotation effort, is of crucial importance in order to successfully harvest the vast amount of visual data that are available today. In our work we propose to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input. We demonstrate both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. We exhaustively evaluate our method in various unsupervised feature learning benchmarks and we exhibit in all of them state-of-the-art performance. Specifically, our results on those benchmarks demonstrate dramatic improvements w.r.t. prior state-of-the-art approaches in unsupervised representation learning and thus significantly close the gap with supervised feature learning.

-<!-- [IMAGE] -->
 <div align="center">
 <img  src="https://user-images.githubusercontent.com/36138628/149723477-8f63e237-362e-4962-b405-9bab0f579808.png" width="700" />
 </div>
--- a/configs/selfsup/simclr/README.md
+++ b/configs/selfsup/simclr/README.md
@ -1,6 +1,6 @@
 # SimCLR

-[A Simple Framework for Contrastive Learning of Visual Representations](https://arxiv.org/abs/2002.05709)
+> [A Simple Framework for Contrastive Learning of Visual Representations](https://arxiv.org/abs/2002.05709)

 <!-- [ALGORITHM] -->

@ -8,7 +8,6 @@

 This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50.

-<!-- [IMAGE] -->
 <div align="center">
 <img  src="https://user-images.githubusercontent.com/36138628/149723851-cf5f309e-d891-454d-90c0-e5337e5a11ed.png" width="400" />
 </div>
--- a/configs/selfsup/simsiam/README.md
+++ b/configs/selfsup/simsiam/README.md
@ -1,6 +1,6 @@
 # SimSiam

-[Exploring Simple Siamese Representation Learning](https://arxiv.org/abs/2011.10566)
+> [Exploring Simple Siamese Representation Learning](https://arxiv.org/abs/2011.10566)

 <!-- [ALGORITHM] -->

@ -8,7 +8,6 @@

 Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. These models maximize the similarity between two augmentations of one image, subject to certain conditions for avoiding collapsing solutions. In this paper, we report surprising empirical results that simple Siamese networks can learn meaningful representations even using none of the following: (i) negative sample pairs, (ii) large batches, (iii) momentum encoders. Our experiments show that collapsing solutions do exist for the loss and structure, but a stop-gradient operation plays an essential role in preventing collapsing. We provide a hypothesis on the implication of stop-gradient, and further show proof-of-concept experiments verifying it. Our “SimSiam” method achieves competitive results on ImageNet and downstream tasks. We hope this simple baseline will motivate people to rethink the roles of Siamese architectures for unsupervised representation learning.

-<!-- [IMAGE] -->
 <div align="center">
 <img  src="https://user-images.githubusercontent.com/36138628/149724180-bc7bac6a-fcb8-421e-b8f1-9550c624d154.png" width="500" />
 </div>
--- a/configs/selfsup/swav/README.md
+++ b/configs/selfsup/swav/README.md
@ -1,6 +1,6 @@
 # SwAV

-[Unsupervised Learning of Visual Features by Contrasting Cluster Assignments](https://arxiv.org/abs/2006.09882)
+> [Unsupervised Learning of Visual Features by Contrasting Cluster Assignments](https://arxiv.org/abs/2006.09882)

 <!-- [ALGORITHM] -->

@ -8,7 +8,6 @@

 Unsupervised image representations have significantly reduced the gap with supervised pretraining, notably with the recent achievements of contrastive learning methods. These contrastive methods typically work online and rely on a large number of explicit pairwise feature comparisons, which is computationally challenging. In this paper, we propose an online algorithm, SwAV, that takes advantage of contrastive methods without requiring to compute pairwise comparisons. Specifically, our method simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations (or “views”) of the same image, instead of comparing features directly as in contrastive learning. Simply put, we use a “swapped” prediction mechanism where we predict the code of a view from the representation of another view. Our method can be trained with large and small batches and can scale to unlimited amounts of data. Compared to previous contrastive methods, our method is more memory efficient since it does not require a large memory bank or a special momentum network. In addition, we also propose a new data augmentation strategy, multi-crop, that uses a mix of views with different resolutions in place of two full-resolution views, without increasing the memory or compute requirements.

-<!-- [IMAGE] -->
 <div align="center">
 <img  src="https://user-images.githubusercontent.com/36138628/149724517-9f1e7bdf-04c7-43e3-92f4-2b8fc1399123.png" width="500" />
 </div>
--- a/model-index.yml
+++ b/model-index.yml
@ -2,7 +2,9 @@ Import:
  - configs/selfsup/byol/metafile.yml
  - configs/selfsup/deepcluster/metafile.yml
  - configs/selfsup/densecl/metafile.yml
-  - configs/selfsup/moco/metafile.yml
+  - configs/selfsup/mocov1/metafile.yml
+  - configs/selfsup/mocov2/metafile.yml
+  - configs/selfsup/mocov3/metafile.yml
  - configs/selfsup/npid/metafile.yml
  - configs/selfsup/odc/metafile.yml
  - configs/selfsup/relative_loc/metafile.yml