Bootstrap Your Own Latent A New Approach to Self-Supervised Learning
Grill et al., 2020
Summary
- Proposes Bootstrap Your Own Latent (BYOL), an approach for self-supervised visual representation learning
- Train an online network to predict the target network representation of the same image under a different augmented view
- Achieves SotA for self-supervised, semi-supervised, and transfer learning when trained on ImageNet, without the use of negative pairs
- More robust to changes in batch size and set of image augmentations compared to previous contrastive learning approaches
- Links: [ website ] [ pdf ]
Background
- Contrastive methods, which achieve SotA performance on the difficult problem of learning visual representations without human supervision, generally require careful treatment of negative pairs
- Unclear if using negative pairs is necessary
- Prior self-supervised learning work, MuCo, also uses moving average network, but for maintaining consistent representations of negative pairs drawn from a memory bank
- Many successful self-supervised approaches build off of the cross-view prediction framework, generally learning representations by predicting different views of the same image from one another
- Doing this prediction directly in representation space can lead to collapsed representations, e.g. a constant representation across all views
- Contrastive methods reframe this prediction problem as discrimination, but require comparing with appropriate negative examples that make the discrimination task challenging
Methods
- BYOL’s goal is to learn a (visual) representation that can be used for downstream tasks
- Uses two neural networks:
- Online network: defined by a set of weights $\theta$ and composed of an encoder, a projector, and a predictor
- Target network: same architecture as the online network, but defined by a different set of weights $\xi$, which are an exponential moving average of the online network parameters $\theta$
- The online network is optimized by minimizing the MSE between the normalized predictions and target projections
- The prediction is obtained from the final output of the online network
- The target projection is obtained from the projector of the target network
- The networks are applied to two different augmented views of the same image
Results
- With standard ResNet-50 (1x) BYOL obtains 74.3% top-1 accuracy for linear evaluation on ImageNet
- ResNet-50 (4x) is only 0.3% below the best supervised baseline for the same architecture
- On other classification datasets, BYOL outperforms Supervised-IN baseline on 7 out of 12 benchmarks
- BYOL outperforms ImageNet supervised baseline on transfer to other vision tasks on different datasets (e.g. VOC semantic segmentation, NYUv2 depth estimation)
- Since BYOL does not use negative examples, it is more robust to smaller batch sizes compared to SimCLR, with stable performance over batch sizes from 256 to 4096
- BYOL is also incentivized to keep all information captured by the target representation, which makes it more robust to the choice of image augmentations, however there is still a significant drop in performance when removing augmentations
- Target network is beneficial by itself for its stabilization effect
Conclusion
- BYOL remains dependent on existing augmentations that are specific to vision
- Automating search for augmentations for other modalities would be important for generalizing BYOL
- The lack of negative examples makes this approach quite appealing