Bootstrap Your Own Latent A New Approach to Self-Supervised Learning

Grill et al., 2020

Source: Grill et al., 2020

Summary

  • Proposes Bootstrap Your Own Latent (BYOL), an approach for self-supervised visual representation learning
  • Train an online network to predict the target network representation of the same image under a different augmented view
  • Achieves SotA for self-supervised, semi-supervised, and transfer learning when trained on ImageNet, without the use of negative pairs
  • More robust to changes in batch size and set of image augmentations compared to previous contrastive learning approaches
  • Links: [ website ] [ pdf ]

Background

  • Contrastive methods, which achieve SotA performance on the difficult problem of learning visual representations without human supervision, generally require careful treatment of negative pairs
    • Unclear if using negative pairs is necessary
  • Prior self-supervised learning work, MuCo, also uses moving average network, but for maintaining consistent representations of negative pairs drawn from a memory bank
  • Many successful self-supervised approaches build off of the cross-view prediction framework, generally learning representations by predicting different views of the same image from one another
    • Doing this prediction directly in representation space can lead to collapsed representations, e.g. a constant representation across all views
    • Contrastive methods reframe this prediction problem as discrimination, but require comparing with appropriate negative examples that make the discrimination task challenging

Methods

  • BYOL’s goal is to learn a (visual) representation that can be used for downstream tasks
  • Uses two neural networks:
    • Online network: defined by a set of weights $\theta$ and composed of an encoder, a projector, and a predictor
    • Target network: same architecture as the online network, but defined by a different set of weights $\xi$, which are an exponential moving average of the online network parameters $\theta$
  • The online network is optimized by minimizing the MSE between the normalized predictions and target projections
    • The prediction is obtained from the final output of the online network
    • The target projection is obtained from the projector of the target network
    • The networks are applied to two different augmented views of the same image

Results

  • With standard ResNet-50 (1x) BYOL obtains 74.3% top-1 accuracy for linear evaluation on ImageNet
    • ResNet-50 (4x) is only 0.3% below the best supervised baseline for the same architecture
  • On other classification datasets, BYOL outperforms Supervised-IN baseline on 7 out of 12 benchmarks
  • BYOL outperforms ImageNet supervised baseline on transfer to other vision tasks on different datasets (e.g. VOC semantic segmentation, NYUv2 depth estimation)
  • Since BYOL does not use negative examples, it is more robust to smaller batch sizes compared to SimCLR, with stable performance over batch sizes from 256 to 4096
  • BYOL is also incentivized to keep all information captured by the target representation, which makes it more robust to the choice of image augmentations, however there is still a significant drop in performance when removing augmentations
  • Target network is beneficial by itself for its stabilization effect

Conclusion

  • BYOL remains dependent on existing augmentations that are specific to vision
    • Automating search for augmentations for other modalities would be important for generalizing BYOL
  • The lack of negative examples makes this approach quite appealing
Elias Z. Wang
Elias Z. Wang
AI Researcher | PhD Candidate