Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases

Purushwalkam and Gupta, 2020


  • Recent self-supervised learning techniques have approached or even surpassed their supervised counterparts in downstream visual tasks
    • The gains from these contrastive learning methods are not well understood
  • This work analyzes the invariances (occlusion, viewpoint, illumination, category instance) learned in these models
  • The two selected methods, MOCO and PIRL, learn better occlusion-invariant representations, but worse viewpoint and category instance invariance
  • Authors also propose a method to leverage videos to learn better viewpoint invariance
  • Links: [ website ] [ pdf ]


  • SotA self-supervised learning methods that rival supervised models rely on the instance discrimination task, which treats each instance (e.g. image) as its own class
    • The contrastive loss and aggressive augmentation are other critical ingredients
  • Contrastive loss has been shown to promote “alignment” and “uniformity”, but this does not directly address why these representations are useful
  • A good visual representation is defined by being useful for many downstream tasks (e.g. object detection, image classification, semantic segmentation in the case of vision)
    • Invariance to certain transformations underlies a good representation - an ideal representation would be invariant to all transformations that do not change the label
  • Augmentations for contrastive learning should reduce mutual information between augmented samples, while keeping task-relevant information intact
    • The aggressive cropping found in SotA contrastive learning methods do not enforce this explicity, but rather rely on a object-centric dataset (e.g. ImageNet)


  • For object recognition, a few important transformations that do not change the target are viewpoint, deformation, illumination, occlusion, and category instance changes
  • To measure invariance:
    • Define as unit as firing if the magnitude of its output is larger than some threshold
    • Global firing rate: expected value of the firing for a given unit
    • Local firing rate: measures the fraction of transformed inputs for a given target which a given unit fires
    • Target conditioned invariance: local firing rate normalized by the global firing rate
  • To test dataset bias:
    • Use MSCOCO, which is more scene-centric, unlike ImageNet, which suffers from object-centric bias
    • Also have variations of training/testing dataset (MSCOCO/Pascal VOC respectively) that are cropped with bouding boxes so there is only one object per image
  • Improving viewpoint invariance with videos:
    • Baseline: train MOCO with frames from TrackingNet videos, 3 frames per video
    • Frame Temporal Invariance: create (positive) pairs of frames using frames $k$ frames apart
    • Region Tracker: create pairs of regions using frames $k$ frames apart


  • Datasets:
    • Occlusion: GOT-10k, videos with frames annotated with bounding boxes and amount of occlusion (0-100%)
    • Viewpoint+Instance and Instance: PASCAL3D+, images with objects from 12 categories annotated with bounding boxes and viewpoint angle
    • Viewpoint, Illumination direction, and Illumination color: ALOI, images of 1000 objects on turntable with varying viewpoint, illumination direction, and illumination color separateely
  • Self-supervised methods, MOCO and PIRL, have much higher occlusion invariance compared to ImageNet supervised model (~84% vs ~80%), but lower viewpoint (~85% vs ~89%) and instance (~62/52% vs ~66%) invariance
    • Instance discrimination explicity forces models to minimize instance invariance (i.e. maximize variation between instances)
  • MOCO trained on MSCOCO does better than model trained on MSCOCO cropped boxes for the standard Pascal dataset, but worse for the Pascal cropped boxes
    • Indicates that aggressive cropping is harmful unless training with an object-centric dataset, where random crops contain portions of the same object
  • Temporal invariance models does improve viewpoint invariance and beats the baseline in evaluation on Pascal, Pascal cropped boxes, ImageNet, and ADE20K


  • Useful framework for evaluation invariances in representations
  • Demonstrate that select SotA contrastive learning algorithms, which rely heavily on aggressive cropping, also rely on an object-centric dataset
    • Compared to supervised models, these have better occlusion invariance but worse viewpoint, illumination direction, and category instance invariance
  • Improving viewpoint invariance using videos seems promising
Elias Z. Wang
Elias Z. Wang
AI Researcher | PhD Candidate