Scaling and Benchmarking Self-Supervised Visual Representation Learning

Goyal et al., 2019


  • While self-supervised learning techniques do not require manual supervision, they fail to leverage it to scale to large amounts of data
  • Scales self-supervised learning methods along three axes: data size, model capacity, and problem “hardness”
  • Demonstrates that scaling up can allow self-supervised methods to match or surpass ImageNet supervised pre-training on a variety of tasks
  • Propses a benchmark of 9 datasets and tasks with comparable evaluation settings
  • Links: [ website ] [ pdf ]


  • The size, quality, and availability of supervised data has become a bottleneck in computer vision
  • It’s unclear what happens when we scale up self-supervised learning to 100M or more images
  • Due to the lack of standardization in evaluation methodology in self-supervised learning, making meaninful comparisons is difficult


  • Self-supervision tasks
    • Jigsaw: divides image into $N=9$ tiles, given one of $|P|$ permutations of these tiles, network must predict the correct permutation
    • Colorization: given $\textit{L}$ channel, predict $\textit{ab}$ color channels, which is discretized into $Q=313$ bins and soft-encoded using the $K$-nearest neighbor bins
  • Scaling up self-supervised learning
    • Pre-training data size: use random subset of YFCC-100M: YFCC-[1, 10, 50, 100]M
    • Model capacity: use AlexNet and ResNet-50
    • Problem complexity: $|P| \in \left[100, 701, 2k, 5k, 10k\right]$ for jigsaw and $K \in \left[2, 5, 10, 20, 40, 80, 160, 313\right]$ for colorization
  • Benchmarking Suite:
    • Perform self-supervised pre-training on a given pre-training dataset
    • Extract features from various layers of the network
    • Evaluate quality of these features using transfer learning


  • Setup: Tranfer to image classification task on PASCAL VOCO2007 using linear SVMs
  • Increasing data size and model capacity improves transfer learning performance for both jigsaw and colorization
    • Jigsaw has better performance, but colorization has better scaling
    • Performance gap between AlexNet and ResNet-50 increases as a function of data size
  • For jigsaw, transfer learning performance increases as permutation set increases
    • Colorization less sensitive to problem complexity via scaling $K$
  • Combining scaling along all three axes seems to indicate that they are complementary


  • Seems consistent with more recent contrastive learning methods that demonstrate that scaling up model capacity closes the gap between self-supervised and fully supervised methods
  • The range of scaling is relatively limited, especially for model capacity
Elias Z. Wang
Elias Z. Wang
AI Researcher | PhD Candidate