Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases

Purushwalkam and Gupta, 2020

Summary

Recent self-supervised learning techniques have approached or even surpassed their supervised counterparts in downstream visual tasks
- The gains from these contrastive learning methods are not well understood
This work analyzes the invariances (occlusion, viewpoint, illumination, category instance) learned in these models
The two selected methods, MOCO and PIRL, learn better occlusion-invariant representations, but worse viewpoint and category instance invariance
Authors also propose a method to leverage videos to learn better viewpoint invariance
Links: [ website ] [ pdf ]

SotA self-supervised learning methods that rival supervised models rely on the instance discrimination task, which treats each instance (e.g. image) as its own class
- The contrastive loss and aggressive augmentation are other critical ingredients
Contrastive loss has been shown to promote “alignment” and “uniformity”, but this does not directly address why these representations are useful
A good visual representation is defined by being useful for many downstream tasks (e.g. object detection, image classification, semantic segmentation in the case of vision)
- Invariance to certain transformations underlies a good representation - an ideal representation would be invariant to all transformations that do not change the label
Augmentations for contrastive learning should reduce mutual information between augmented samples, while keeping task-relevant information intact
- The aggressive cropping found in SotA contrastive learning methods do not enforce this explicity, but rather rely on a object-centric dataset (e.g. ImageNet)

For object recognition, a few important transformations that do not change the target are viewpoint, deformation, illumination, occlusion, and category instance changes
To measure invariance:
- Define as unit as firing if the magnitude of its output is larger than some threshold
- Global firing rate: expected value of the firing for a given unit
- Local firing rate: measures the fraction of transformed inputs for a given target which a given unit fires
- Target conditioned invariance: local firing rate normalized by the global firing rate
To test dataset bias:
- Use MSCOCO, which is more scene-centric, unlike ImageNet, which suffers from object-centric bias
- Also have variations of training/testing dataset (MSCOCO/Pascal VOC respectively) that are cropped with bouding boxes so there is only one object per image
Improving viewpoint invariance with videos:
- Baseline: train MOCO with frames from TrackingNet videos, 3 frames per video
- Frame Temporal Invariance: create (positive) pairs of frames using frames $k$ frames apart
- Region Tracker: create pairs of regions using frames $k$ frames apart

Datasets:
- Occlusion: GOT-10k, videos with frames annotated with bounding boxes and amount of occlusion (0-100%)
- Viewpoint+Instance and Instance: PASCAL3D+, images with objects from 12 categories annotated with bounding boxes and viewpoint angle
- Viewpoint, Illumination direction, and Illumination color: ALOI, images of 1000 objects on turntable with varying viewpoint, illumination direction, and illumination color separateely
Self-supervised methods, MOCO and PIRL, have much higher occlusion invariance compared to ImageNet supervised model (~84% vs ~80%), but lower viewpoint (~85% vs ~89%) and instance (~62/52% vs ~66%) invariance
- Instance discrimination explicity forces models to minimize instance invariance (i.e. maximize variation between instances)
MOCO trained on MSCOCO does better than model trained on MSCOCO cropped boxes for the standard Pascal dataset, but worse for the Pascal cropped boxes
- Indicates that aggressive cropping is harmful unless training with an object-centric dataset, where random crops contain portions of the same object
Temporal invariance models does improve viewpoint invariance and beats the baseline in evaluation on Pascal, Pascal cropped boxes, ImageNet, and ADE20K

Useful framework for evaluation invariances in representations
Demonstrate that select SotA contrastive learning algorithms, which rely heavily on aggressive cropping, also rely on an object-centric dataset
- Compared to supervised models, these have better occlusion invariance but worse viewpoint, illumination direction, and category instance invariance
Improving viewpoint invariance using videos seems promising