Scaling and Benchmarking Self-Supervised Visual Representation Learning
Goyal et al., 2019
Summary
- While self-supervised learning techniques do not require manual supervision, they fail to leverage it to scale to large amounts of data
- Scales self-supervised learning methods along three axes: data size, model capacity, and problem “hardness”
- Demonstrates that scaling up can allow self-supervised methods to match or surpass ImageNet supervised pre-training on a variety of tasks
- Propses a benchmark of 9 datasets and tasks with comparable evaluation settings
- Links: [ website ] [ pdf ]
Background
- The size, quality, and availability of supervised data has become a bottleneck in computer vision
- It’s unclear what happens when we scale up self-supervised learning to 100M or more images
- Due to the lack of standardization in evaluation methodology in self-supervised learning, making meaninful comparisons is difficult
Methods
- Self-supervision tasks
- Jigsaw: divides image into
tiles, given one of permutations of these tiles, network must predict the correct permutation - Colorization: given
channel, predict color channels, which is discretized into bins and soft-encoded using the -nearest neighbor bins
- Jigsaw: divides image into
- Scaling up self-supervised learning
- Pre-training data size: use random subset of YFCC-100M: YFCC-[1, 10, 50, 100]M
- Model capacity: use AlexNet and ResNet-50
- Problem complexity:
forjigsaw
and forcolorization
- Benchmarking Suite:
- Perform self-supervised pre-training on a given pre-training dataset
- Extract features from various layers of the network
- Evaluate quality of these features using transfer learning
Results
- Setup: Tranfer to image classification task on PASCAL VOCO2007 using linear SVMs
- Increasing data size and model capacity improves transfer learning performance for both
jigsaw
andcolorization
Jigsaw
has better performance, butcolorization
has better scaling- Performance gap between AlexNet and ResNet-50 increases as a function of data size
- For
jigsaw
, transfer learning performance increases as permutation set increasesColorization
less sensitive to problem complexity via scaling
- Combining scaling along all three axes seems to indicate that they are complementary
Conclusion
- Seems consistent with more recent contrastive learning methods that demonstrate that scaling up model capacity closes the gap between self-supervised and fully supervised methods
- The range of scaling is relatively limited, especially for model capacity