Scaling and Benchmarking Self-Supervised Visual Representation Learning
Goyal et al., 2019
Summary
- While self-supervised learning techniques do not require manual supervision, they fail to leverage it to scale to large amounts of data
- Scales self-supervised learning methods along three axes: data size, model capacity, and problem “hardness”
- Demonstrates that scaling up can allow self-supervised methods to match or surpass ImageNet supervised pre-training on a variety of tasks
- Propses a benchmark of 9 datasets and tasks with comparable evaluation settings
- Links: [ website ] [ pdf ]
Background
- The size, quality, and availability of supervised data has become a bottleneck in computer vision
- It’s unclear what happens when we scale up self-supervised learning to 100M or more images
- Due to the lack of standardization in evaluation methodology in self-supervised learning, making meaninful comparisons is difficult
Methods
- Self-supervision tasks
- Jigsaw: divides image into $N=9$ tiles, given one of $|P|$ permutations of these tiles, network must predict the correct permutation
- Colorization: given $\textit{L}$ channel, predict $\textit{ab}$ color channels, which is discretized into $Q=313$ bins and soft-encoded using the $K$-nearest neighbor bins
- Scaling up self-supervised learning
- Pre-training data size: use random subset of YFCC-100M: YFCC-[1, 10, 50, 100]M
- Model capacity: use AlexNet and ResNet-50
- Problem complexity: $|P| \in \left[100, 701, 2k, 5k, 10k\right]$ for
jigsaw
and $K \in \left[2, 5, 10, 20, 40, 80, 160, 313\right]$ forcolorization
- Benchmarking Suite:
- Perform self-supervised pre-training on a given pre-training dataset
- Extract features from various layers of the network
- Evaluate quality of these features using transfer learning
Results
- Setup: Tranfer to image classification task on PASCAL VOCO2007 using linear SVMs
- Increasing data size and model capacity improves transfer learning performance for both
jigsaw
andcolorization
Jigsaw
has better performance, butcolorization
has better scaling- Performance gap between AlexNet and ResNet-50 increases as a function of data size
- For
jigsaw
, transfer learning performance increases as permutation set increasesColorization
less sensitive to problem complexity via scaling $K$
- Combining scaling along all three axes seems to indicate that they are complementary
Conclusion
- Seems consistent with more recent contrastive learning methods that demonstrate that scaling up model capacity closes the gap between self-supervised and fully supervised methods
- The range of scaling is relatively limited, especially for model capacity