A Simple Framework for Contrastive Learning of Visual Representations
Chen et al., 2020
Summary
- Proposes novel framework, SimCLR, for self-supervised learning of visual representations
- Combines data augmentation, learnable nonlinear projection head, larger batch sizes, and longer training for contrastive learning
- Linear evaluation of learned self-supervised representations from larger network, ResNet-50 (4x), matches supervised ResNet-50
- Links: [ website ] [ pdf ]
Background
- Learning visual representations without human supervision is a long-standing problem in computer vision
- In general approaches fall into two classes:
- Generative: learn to model pixels in the input space, computationally expensive and may not be necessary for representation learning
- Discriminative: learn “pretext tasks”, where both inputs and labels are derived from raw data
- Recent SotA self-supervised methods are based on constrastive learning, but use specialized architectures or a memory bank
Methods
- Learns representations by maximizing similarity between representations of augmented views of the same data sample
- Four major components:
- Stochastic data augmentation: generates two correlated views of an example, considered a positive pair, using random crop (with resize and flip), color distortions, and Gaussian blur
- Neural network base encoder: extracts representations from augmented examples, ResNet-50 in this work
- Small nueral network projection head: simple nonlinear function on top of base encoder’s representations, 2-layer MLP with 128-D output
- Contrastive loss function: normalized temperature-scaled cross entropy loss (NT-Xent), treat other augmented examples in minibatch as negative examples
- Use linear evaluation protocol, where a linear classifier is trained on top of the learned representations on ImageNet
Results
- Data Augmentation for Defining Tasks
- Spatial/geometric transformation: cropping & resizing (with horizontal flip), rotation, cutout
- Appearance transformation: color distortion (color dropping, brightness, contrast, saturation, hue), Gaussian blur, Sobel filtering, Gaussian noise
- Compare individual and pairs of augmentations
- No single transformation is sufficient
- Random cropping and color distortion combination does best, significantly
- Adjusting strength of color distortion shows that constrastive learning benefits from stronger color distortion compared to supervised learning
- Base Encoder & Projection Head Architectures
- Increasing depth and width of encoder both improve performance, but unsupervised contrastive learning benefits more than supervised
- Nonlinear projection head is slight better than linear projection head, and significantly better than no projection head, regardless of output dimension
- Hidden layer before projection head is a better representation than the layer after, indicating a loss of information from contrastive loss (e.g. color or orientation)
- Loss Function & Batch Size
- Compared to logistic loss and margin loss, NT-Xent weights different examples based on hardness, as a result of $l_2$ normalization and temperature
- While (semi-hard) negative mining helps, NT-Xent still performs best
- Temperature hyperparameter tuning is crucial to performance
- When number of training epochs is small, larger batch sizes help significantly since they provide more negative examples
- Compared to logistic loss and margin loss, NT-Xent weights different examples based on hardness, as a result of $l_2$ normalization and temperature
- Comparison with SotA
- Linear Evaluation: Beats previes SotA, and matches supervised ResNet-50 with larger ResNet-50 (4x) at 76.5% top-1 accuracy
- Semi-supervised Learning: SotA with both 1% and 10% labels, fine-tuning pretrained network on full ImageNet does ~2% better than from scratch
- Transfer Learning: After fine-tuning, self-supervised model outperforms supervised baseline in 5/12 datasets, and underperforms in 2/12
Conclusion
- The performance of the framework is a result of the combination of design choices, no individual component is new
- Simple framework, although relies on large batch sizes
- Transfer learning results are very encouraging
- Final data augmentation combination of random crop, color distortion, and Gaussian blur not clearly motivated