# Scaling and Benchmarking Self-Supervised Visual Representation Learning

Goyal et al., 2019

## Summary

• While self-supervised learning techniques do not require manual supervision, they fail to leverage it to scale to large amounts of data
• Scales self-supervised learning methods along three axes: data size, model capacity, and problem “hardness”
• Demonstrates that scaling up can allow self-supervised methods to match or surpass ImageNet supervised pre-training on a variety of tasks
• Propses a benchmark of 9 datasets and tasks with comparable evaluation settings
• Links: [ website ] [ pdf ]

## Background

• The size, quality, and availability of supervised data has become a bottleneck in computer vision
• It’s unclear what happens when we scale up self-supervised learning to 100M or more images
• Due to the lack of standardization in evaluation methodology in self-supervised learning, making meaninful comparisons is difficult

## Methods

• Jigsaw: divides image into $N=9$ tiles, given one of $|P|$ permutations of these tiles, network must predict the correct permutation
• Colorization: given $\textit{L}$ channel, predict $\textit{ab}$ color channels, which is discretized into $Q=313$ bins and soft-encoded using the $K$-nearest neighbor bins
• Scaling up self-supervised learning
• Pre-training data size: use random subset of YFCC-100M: YFCC-[1, 10, 50, 100]M
• Model capacity: use AlexNet and ResNet-50
• Problem complexity: $|P| \in \left[100, 701, 2k, 5k, 10k\right]$ for jigsaw and $K \in \left[2, 5, 10, 20, 40, 80, 160, 313\right]$ for colorization
• Benchmarking Suite:
• Perform self-supervised pre-training on a given pre-training dataset
• Extract features from various layers of the network
• Evaluate quality of these features using transfer learning

## Results

• Setup: Tranfer to image classification task on PASCAL VOCO2007 using linear SVMs
• Increasing data size and model capacity improves transfer learning performance for both jigsaw and colorization
• Jigsaw has better performance, but colorization has better scaling
• Performance gap between AlexNet and ResNet-50 increases as a function of data size
• For jigsaw, transfer learning performance increases as permutation set increases
• Colorization less sensitive to problem complexity via scaling $K$
• Combining scaling along all three axes seems to indicate that they are complementary

## Conclusion

• Seems consistent with more recent contrastive learning methods that demonstrate that scaling up model capacity closes the gap between self-supervised and fully supervised methods
• The range of scaling is relatively limited, especially for model capacity