A Simple Framework for Contrastive Learning of Visual Representations

Chen et al., 2020

Source: https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html

Summary

  • Proposes novel framework, SimCLR, for self-supervised learning of visual representations
  • Combines data augmentation, learnable nonlinear projection head, larger batch sizes, and longer training for contrastive learning
  • Linear evaluation of learned self-supervised representations from larger network, ResNet-50 (4x), matches supervised ResNet-50
  • Links: [ website ] [ pdf ]

Background

  • Learning visual representations without human supervision is a long-standing problem in computer vision
  • In general approaches fall into two classes:
    • Generative: learn to model pixels in the input space, computationally expensive and may not be necessary for representation learning
    • Discriminative: learn “pretext tasks”, where both inputs and labels are derived from raw data
  • Recent SotA self-supervised methods are based on constrastive learning, but use specialized architectures or a memory bank

Methods

  • Learns representations by maximizing similarity between representations of augmented views of the same data sample
  • Four major components:
    • Stochastic data augmentation: generates two correlated views of an example, considered a positive pair, using random crop (with resize and flip), color distortions, and Gaussian blur
    • Neural network base encoder: extracts representations from augmented examples, ResNet-50 in this work
    • Small nueral network projection head: simple nonlinear function on top of base encoder’s representations, 2-layer MLP with 128-D output
    • Contrastive loss function: normalized temperature-scaled cross entropy loss (NT-Xent), treat other augmented examples in minibatch as negative examples
  • Use linear evaluation protocol, where a linear classifier is trained on top of the learned representations on ImageNet

Results

  • Data Augmentation for Defining Tasks
    • Spatial/geometric transformation: cropping & resizing (with horizontal flip), rotation, cutout
    • Appearance transformation: color distortion (color dropping, brightness, contrast, saturation, hue), Gaussian blur, Sobel filtering, Gaussian noise
    • Compare individual and pairs of augmentations
      • No single transformation is sufficient
      • Random cropping and color distortion combination does best, significantly
    • Adjusting strength of color distortion shows that constrastive learning benefits from stronger color distortion compared to supervised learning
  • Base Encoder & Projection Head Architectures
    • Increasing depth and width of encoder both improve performance, but unsupervised contrastive learning benefits more than supervised
    • Nonlinear projection head is slight better than linear projection head, and significantly better than no projection head, regardless of output dimension
      • Hidden layer before projection head is a better representation than the layer after, indicating a loss of information from contrastive loss (e.g. color or orientation)
  • Loss Function & Batch Size
    • Compared to logistic loss and margin loss, NT-Xent weights different examples based on hardness, as a result of $l_2$ normalization and temperature
      • While (semi-hard) negative mining helps, NT-Xent still performs best
      • Temperature hyperparameter tuning is crucial to performance
    • When number of training epochs is small, larger batch sizes help significantly since they provide more negative examples
  • Comparison with SotA
    • Linear Evaluation: Beats previes SotA, and matches supervised ResNet-50 with larger ResNet-50 (4x) at 76.5% top-1 accuracy
    • Semi-supervised Learning: SotA with both 1% and 10% labels, fine-tuning pretrained network on full ImageNet does ~2% better than from scratch
    • Transfer Learning: After fine-tuning, self-supervised model outperforms supervised baseline in 5/12 datasets, and underperforms in 2/12

Conclusion

  • The performance of the framework is a result of the combination of design choices, no individual component is new
  • Simple framework, although relies on large batch sizes
  • Transfer learning results are very encouraging
  • Final data augmentation combination of random crop, color distortion, and Gaussian blur not clearly motivated
Elias Z. Wang
Elias Z. Wang
AI Researcher | PhD Candidate