Scaling and Benchmarking Self-Supervised Visual Representation Learning

Goyal et al., 2019

Summary

While self-supervised learning techniques do not require manual supervision, they fail to leverage it to scale to large amounts of data
Scales self-supervised learning methods along three axes: data size, model capacity, and problem “hardness”
Demonstrates that scaling up can allow self-supervised methods to match or surpass ImageNet supervised pre-training on a variety of tasks
Propses a benchmark of 9 datasets and tasks with comparable evaluation settings
Links: [ website ] [ pdf ]

The size, quality, and availability of supervised data has become a bottleneck in computer vision
It’s unclear what happens when we scale up self-supervised learning to 100M or more images
Due to the lack of standardization in evaluation methodology in self-supervised learning, making meaninful comparisons is difficult

Self-supervision tasks
- Jigsaw: divides image into $N = 9$ tiles, given one of $| P |$ permutations of these tiles, network must predict the correct permutation
- Colorization: given $L$ channel, predict $ab$ color channels, which is discretized into $Q = 313$ bins and soft-encoded using the $K$ -nearest neighbor bins
Scaling up self-supervised learning
- Pre-training data size: use random subset of YFCC-100M: YFCC-[1, 10, 50, 100]M
- Model capacity: use AlexNet and ResNet-50
- Problem complexity: $| P | \in [100, 701, 2 k, 5 k, 10 k]$ for jigsaw and $K \in [2, 5, 10, 20, 40, 80, 160, 313]$ for colorization
Benchmarking Suite:
- Perform self-supervised pre-training on a given pre-training dataset
- Extract features from various layers of the network
- Evaluate quality of these features using transfer learning

Setup: Tranfer to image classification task on PASCAL VOCO2007 using linear SVMs
Increasing data size and model capacity improves transfer learning performance for both jigsaw and colorization
- Jigsaw has better performance, but colorization has better scaling
- Performance gap between AlexNet and ResNet-50 increases as a function of data size
For jigsaw, transfer learning performance increases as permutation set increases
- Colorization less sensitive to problem complexity via scaling $K$
Combining scaling along all three axes seems to indicate that they are complementary

Seems consistent with more recent contrastive learning methods that demonstrate that scaling up model capacity closes the gap between self-supervised and fully supervised methods
The range of scaling is relatively limited, especially for model capacity