A Simple Framework for Contrastive Learning of Visual Representations

Chen et al., 2020

Elias Z. Wang

Published Aug 30, 2020 Artificial Intelligence

Source: https://ai.googleblog.com/2020/04/advancing-self-supervised-and-semi.html

Summary

Proposes novel framework, SimCLR, for self-supervised learning of visual representations
Combines data augmentation, learnable nonlinear projection head, larger batch sizes, and longer training for contrastive learning
Linear evaluation of learned self-supervised representations from larger network, ResNet-50 (4x), matches supervised ResNet-50
Links: [ website ] [ pdf ]

Background

Learning visual representations without human supervision is a long-standing problem in computer vision
In general approaches fall into two classes:
- Generative: learn to model pixels in the input space, computationally expensive and may not be necessary for representation learning
- Discriminative: learn “pretext tasks”, where both inputs and labels are derived from raw data
Recent SotA self-supervised methods are based on constrastive learning, but use specialized architectures or a memory bank

Methods

Learns representations by maximizing similarity between representations of augmented views of the same data sample
Four major components:
- Stochastic data augmentation: generates two correlated views of an example, considered a positive pair, using random crop (with resize and flip), color distortions, and Gaussian blur
- Neural network base encoder: extracts representations from augmented examples, ResNet-50 in this work
- Small nueral network projection head: simple nonlinear function on top of base encoder’s representations, 2-layer MLP with 128-D output
- Contrastive loss function: normalized temperature-scaled cross entropy loss (NT-Xent), treat other augmented examples in minibatch as negative examples
Use linear evaluation protocol, where a linear classifier is trained on top of the learned representations on ImageNet

Results

Data Augmentation for Defining Tasks
- Spatial/geometric transformation: cropping & resizing (with horizontal flip), rotation, cutout
- Appearance transformation: color distortion (color dropping, brightness, contrast, saturation, hue), Gaussian blur, Sobel filtering, Gaussian noise
- Compare individual and pairs of augmentations
  - No single transformation is sufficient
  - Random cropping and color distortion combination does best, significantly
- Adjusting strength of color distortion shows that constrastive learning benefits from stronger color distortion compared to supervised learning
Base Encoder & Projection Head Architectures
- Increasing depth and width of encoder both improve performance, but unsupervised contrastive learning benefits more than supervised
- Nonlinear projection head is slight better than linear projection head, and significantly better than no projection head, regardless of output dimension
  - Hidden layer before projection head is a better representation than the layer after, indicating a loss of information from contrastive loss (e.g. color or orientation)
Loss Function & Batch Size
- Compared to logistic loss and margin loss, NT-Xent weights different examples based on hardness, as a result of $l_{2}$ normalization and temperature
  - While (semi-hard) negative mining helps, NT-Xent still performs best
  - Temperature hyperparameter tuning is crucial to performance
- When number of training epochs is small, larger batch sizes help significantly since they provide more negative examples
Comparison with SotA
- Linear Evaluation: Beats previes SotA, and matches supervised ResNet-50 with larger ResNet-50 (4x) at 76.5% top-1 accuracy
- Semi-supervised Learning: SotA with both 1% and 10% labels, fine-tuning pretrained network on full ImageNet does ~2% better than from scratch
- Transfer Learning: After fine-tuning, self-supervised model outperforms supervised baseline in 5/12 datasets, and underperforms in 2/12

Conclusion

The performance of the framework is a result of the combination of design choices, no individual component is new
Simple framework, although relies on large batch sizes
Transfer learning results are very encouraging
Final data augmentation combination of random crop, color distortion, and Gaussian blur not clearly motivated

computer vision paper review ICML 2020 unsupervised learning Google Brain contrastive learning

A Simple Framework for Contrastive Learning of Visual Representations

Summary

Background

Methods

Results

Conclusion

Elias Z. Wang

AI Researcher | PhD Candidate