Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases
Purushwalkam and Gupta, 2020
Summary
- Recent self-supervised learning techniques have approached or even surpassed their supervised counterparts in downstream visual tasks
- The gains from these contrastive learning methods are not well understood
- This work analyzes the invariances (occlusion, viewpoint, illumination, category instance) learned in these models
- The two selected methods, MOCO and PIRL, learn better occlusion-invariant representations, but worse viewpoint and category instance invariance
- Authors also propose a method to leverage videos to learn better viewpoint invariance
- Links: [ website ] [ pdf ]
Background
- SotA self-supervised learning methods that rival supervised models rely on the instance discrimination task, which treats each instance (e.g. image) as its own class
- The contrastive loss and aggressive augmentation are other critical ingredients
- Contrastive loss has been shown to promote “alignment” and “uniformity”, but this does not directly address why these representations are useful
- A good visual representation is defined by being useful for many downstream tasks (e.g. object detection, image classification, semantic segmentation in the case of vision)
- Invariance to certain transformations underlies a good representation - an ideal representation would be invariant to all transformations that do not change the label
- Augmentations for contrastive learning should reduce mutual information between augmented samples, while keeping task-relevant information intact
- The aggressive cropping found in SotA contrastive learning methods do not enforce this explicity, but rather rely on a object-centric dataset (e.g. ImageNet)
Methods
- For object recognition, a few important transformations that do not change the target are viewpoint, deformation, illumination, occlusion, and category instance changes
- To measure invariance:
- Define as unit as firing if the magnitude of its output is larger than some threshold
- Global firing rate: expected value of the firing for a given unit
- Local firing rate: measures the fraction of transformed inputs for a given target which a given unit fires
- Target conditioned invariance: local firing rate normalized by the global firing rate
- To test dataset bias:
- Use MSCOCO, which is more scene-centric, unlike ImageNet, which suffers from object-centric bias
- Also have variations of training/testing dataset (MSCOCO/Pascal VOC respectively) that are cropped with bouding boxes so there is only one object per image
- Improving viewpoint invariance with videos:
- Baseline: train MOCO with frames from TrackingNet videos, 3 frames per video
- Frame Temporal Invariance: create (positive) pairs of frames using frames $k$ frames apart
- Region Tracker: create pairs of regions using frames $k$ frames apart
Results
- Datasets:
- Occlusion: GOT-10k, videos with frames annotated with bounding boxes and amount of occlusion (0-100%)
- Viewpoint+Instance and Instance: PASCAL3D+, images with objects from 12 categories annotated with bounding boxes and viewpoint angle
- Viewpoint, Illumination direction, and Illumination color: ALOI, images of 1000 objects on turntable with varying viewpoint, illumination direction, and illumination color separateely
- Self-supervised methods, MOCO and PIRL, have much higher occlusion invariance compared to ImageNet supervised model (~84% vs ~80%), but lower viewpoint (~85% vs ~89%) and instance (~62/52% vs ~66%) invariance
- Instance discrimination explicity forces models to minimize instance invariance (i.e. maximize variation between instances)
- MOCO trained on MSCOCO does better than model trained on MSCOCO cropped boxes for the standard Pascal dataset, but worse for the Pascal cropped boxes
- Indicates that aggressive cropping is harmful unless training with an object-centric dataset, where random crops contain portions of the same object
- Temporal invariance models does improve viewpoint invariance and beats the baseline in evaluation on Pascal, Pascal cropped boxes, ImageNet, and ADE20K
Conclusion
- Useful framework for evaluation invariances in representations
- Demonstrate that select SotA contrastive learning algorithms, which rely heavily on aggressive cropping, also rely on an object-centric dataset
- Compared to supervised models, these have better occlusion invariance but worse viewpoint, illumination direction, and category instance invariance
- Improving viewpoint invariance using videos seems promising