# A Simple Framework for Contrastive Learning of Visual Representations

Chen et al., 2020

## Summary

• Proposes novel framework, SimCLR, for self-supervised learning of visual representations
• Combines data augmentation, learnable nonlinear projection head, larger batch sizes, and longer training for contrastive learning
• Linear evaluation of learned self-supervised representations from larger network, ResNet-50 (4x), matches supervised ResNet-50
• Links: [ website ] [ pdf ]

## Background

• Learning visual representations without human supervision is a long-standing problem in computer vision
• In general approaches fall into two classes:
• Generative: learn to model pixels in the input space, computationally expensive and may not be necessary for representation learning
• Discriminative: learn “pretext tasks”, where both inputs and labels are derived from raw data
• Recent SotA self-supervised methods are based on constrastive learning, but use specialized architectures or a memory bank

## Methods

• Learns representations by maximizing similarity between representations of augmented views of the same data sample
• Four major components:
• Stochastic data augmentation: generates two correlated views of an example, considered a positive pair, using random crop (with resize and flip), color distortions, and Gaussian blur
• Neural network base encoder: extracts representations from augmented examples, ResNet-50 in this work
• Small nueral network projection head: simple nonlinear function on top of base encoder’s representations, 2-layer MLP with 128-D output
• Contrastive loss function: normalized temperature-scaled cross entropy loss (NT-Xent), treat other augmented examples in minibatch as negative examples
• Use linear evaluation protocol, where a linear classifier is trained on top of the learned representations on ImageNet

## Results

• Data Augmentation for Defining Tasks
• Spatial/geometric transformation: cropping & resizing (with horizontal flip), rotation, cutout
• Appearance transformation: color distortion (color dropping, brightness, contrast, saturation, hue), Gaussian blur, Sobel filtering, Gaussian noise
• Compare individual and pairs of augmentations
• No single transformation is sufficient
• Random cropping and color distortion combination does best, significantly
• Adjusting strength of color distortion shows that constrastive learning benefits from stronger color distortion compared to supervised learning
• Base Encoder & Projection Head Architectures
• Increasing depth and width of encoder both improve performance, but unsupervised contrastive learning benefits more than supervised
• Nonlinear projection head is slight better than linear projection head, and significantly better than no projection head, regardless of output dimension
• Hidden layer before projection head is a better representation than the layer after, indicating a loss of information from contrastive loss (e.g. color or orientation)
• Loss Function & Batch Size
• Compared to logistic loss and margin loss, NT-Xent weights different examples based on hardness, as a result of $l_2$ normalization and temperature
• While (semi-hard) negative mining helps, NT-Xent still performs best
• Temperature hyperparameter tuning is crucial to performance
• When number of training epochs is small, larger batch sizes help significantly since they provide more negative examples
• Comparison with SotA
• Linear Evaluation: Beats previes SotA, and matches supervised ResNet-50 with larger ResNet-50 (4x) at 76.5% top-1 accuracy
• Semi-supervised Learning: SotA with both 1% and 10% labels, fine-tuning pretrained network on full ImageNet does ~2% better than from scratch
• Transfer Learning: After fine-tuning, self-supervised model outperforms supervised baseline in 5/12 datasets, and underperforms in 2/12

## Conclusion

• The performance of the framework is a result of the combination of design choices, no individual component is new
• Simple framework, although relies on large batch sizes
• Transfer learning results are very encouraging
• Final data augmentation combination of random crop, color distortion, and Gaussian blur not clearly motivated