# Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases

Purushwalkam and Gupta, 2020

## Summary

• Recent self-supervised learning techniques have approached or even surpassed their supervised counterparts in downstream visual tasks
• The gains from these contrastive learning methods are not well understood
• This work analyzes the invariances (occlusion, viewpoint, illumination, category instance) learned in these models
• The two selected methods, MOCO and PIRL, learn better occlusion-invariant representations, but worse viewpoint and category instance invariance
• Authors also propose a method to leverage videos to learn better viewpoint invariance
• Links: [ website ] [ pdf ]

## Background

• SotA self-supervised learning methods that rival supervised models rely on the instance discrimination task, which treats each instance (e.g. image) as its own class
• The contrastive loss and aggressive augmentation are other critical ingredients
• Contrastive loss has been shown to promote “alignment” and “uniformity”, but this does not directly address why these representations are useful
• A good visual representation is defined by being useful for many downstream tasks (e.g. object detection, image classification, semantic segmentation in the case of vision)
• Invariance to certain transformations underlies a good representation - an ideal representation would be invariant to all transformations that do not change the label
• Augmentations for contrastive learning should reduce mutual information between augmented samples, while keeping task-relevant information intact
• The aggressive cropping found in SotA contrastive learning methods do not enforce this explicity, but rather rely on a object-centric dataset (e.g. ImageNet)

## Methods

• For object recognition, a few important transformations that do not change the target are viewpoint, deformation, illumination, occlusion, and category instance changes
• To measure invariance:
• Define as unit as firing if the magnitude of its output is larger than some threshold
• Global firing rate: expected value of the firing for a given unit
• Local firing rate: measures the fraction of transformed inputs for a given target which a given unit fires
• Target conditioned invariance: local firing rate normalized by the global firing rate
• To test dataset bias:
• Use MSCOCO, which is more scene-centric, unlike ImageNet, which suffers from object-centric bias
• Also have variations of training/testing dataset (MSCOCO/Pascal VOC respectively) that are cropped with bouding boxes so there is only one object per image
• Improving viewpoint invariance with videos:
• Baseline: train MOCO with frames from TrackingNet videos, 3 frames per video
• Frame Temporal Invariance: create (positive) pairs of frames using frames $k$ frames apart
• Region Tracker: create pairs of regions using frames $k$ frames apart

## Results

• Datasets:
• Occlusion: GOT-10k, videos with frames annotated with bounding boxes and amount of occlusion (0-100%)
• Viewpoint+Instance and Instance: PASCAL3D+, images with objects from 12 categories annotated with bounding boxes and viewpoint angle
• Viewpoint, Illumination direction, and Illumination color: ALOI, images of 1000 objects on turntable with varying viewpoint, illumination direction, and illumination color separateely
• Self-supervised methods, MOCO and PIRL, have much higher occlusion invariance compared to ImageNet supervised model (~84% vs ~80%), but lower viewpoint (~85% vs ~89%) and instance (~62/52% vs ~66%) invariance
• Instance discrimination explicity forces models to minimize instance invariance (i.e. maximize variation between instances)
• MOCO trained on MSCOCO does better than model trained on MSCOCO cropped boxes for the standard Pascal dataset, but worse for the Pascal cropped boxes
• Indicates that aggressive cropping is harmful unless training with an object-centric dataset, where random crops contain portions of the same object
• Temporal invariance models does improve viewpoint invariance and beats the baseline in evaluation on Pascal, Pascal cropped boxes, ImageNet, and ADE20K

## Conclusion

• Useful framework for evaluation invariances in representations
• Demonstrate that select SotA contrastive learning algorithms, which rely heavily on aggressive cropping, also rely on an object-centric dataset
• Compared to supervised models, these have better occlusion invariance but worse viewpoint, illumination direction, and category instance invariance
• Improving viewpoint invariance using videos seems promising