Bootstrap Your Own Latent A New Approach to Self-Supervised Learning

Grill et al., 2020

Source: Grill et al., 2020

Summary

Proposes Bootstrap Your Own Latent (BYOL), an approach for self-supervised visual representation learning
Train an online network to predict the target network representation of the same image under a different augmented view
Achieves SotA for self-supervised, semi-supervised, and transfer learning when trained on ImageNet, without the use of negative pairs
More robust to changes in batch size and set of image augmentations compared to previous contrastive learning approaches
Links: [ website ] [ pdf ]

Contrastive methods, which achieve SotA performance on the difficult problem of learning visual representations without human supervision, generally require careful treatment of negative pairs
- Unclear if using negative pairs is necessary
Prior self-supervised learning work, MuCo, also uses moving average network, but for maintaining consistent representations of negative pairs drawn from a memory bank
Many successful self-supervised approaches build off of the cross-view prediction framework, generally learning representations by predicting different views of the same image from one another
- Doing this prediction directly in representation space can lead to collapsed representations, e.g. a constant representation across all views
- Contrastive methods reframe this prediction problem as discrimination, but require comparing with appropriate negative examples that make the discrimination task challenging

BYOL’s goal is to learn a (visual) representation that can be used for downstream tasks
Uses two neural networks:
- Online network: defined by a set of weights $θ$ and composed of an encoder, a projector, and a predictor
- Target network: same architecture as the online network, but defined by a different set of weights $ξ$ , which are an exponential moving average of the online network parameters $θ$
The online network is optimized by minimizing the MSE between the normalized predictions and target projections
- The prediction is obtained from the final output of the online network
- The target projection is obtained from the projector of the target network
- The networks are applied to two different augmented views of the same image

With standard ResNet-50 (1x) BYOL obtains 74.3% top-1 accuracy for linear evaluation on ImageNet
- ResNet-50 (4x) is only 0.3% below the best supervised baseline for the same architecture
On other classification datasets, BYOL outperforms Supervised-IN baseline on 7 out of 12 benchmarks
BYOL outperforms ImageNet supervised baseline on transfer to other vision tasks on different datasets (e.g. VOC semantic segmentation, NYUv2 depth estimation)
Since BYOL does not use negative examples, it is more robust to smaller batch sizes compared to SimCLR, with stable performance over batch sizes from 256 to 4096
BYOL is also incentivized to keep all information captured by the target representation, which makes it more robust to the choice of image augmentations, however there is still a significant drop in performance when removing augmentations
Target network is beneficial by itself for its stabilization effect

BYOL remains dependent on existing augmentations that are specific to vision
- Automating search for augmentations for other modalities would be important for generalizing BYOL
The lack of negative examples makes this approach quite appealing