Self-supervised learning through the eyes of a child

Orhan et al., 2020

Source: Orhan et al., 2020


  • How much of early knowledge in children is explained by learning versus innate inductive biases?
    • Specifically for the development of high-level visual categories
  • Apply recent self-supervised learning methods to longitudinal, egocentric video from young children
    • Results in high-level visual representations
  • Links: [ website ] [ pdf ]


  • Experimental evidence suggests that young children already have a sophisticated understanding about the world
    • Leaves open the old “nature vs. nurture” question
  • While advances in self-supervised learning have demonstrated the ability to learn powerful visual representations without additional supervision, they have not been applied to developmentally realistic, longitudinal, egocentric video
    • Prior work (Bambach et al., 2018) shows that naturalistic data collected by toddlers have unique characteristics


  • SAYCam Dataset
    • Approximately 500 hours of video, split across three children, from head-mounted cameras
    • Collected over a two year period (6-32 months), with 1-2 hours of recording per week
  • Use MobileNetV2 and train using self-supervised algorithms on headcam videos, then freeze trunk and train linear readouts for downstream classification
    • Temporal classification: based on principle of temporal invariance (higher level variables change more slowly)
      • Divides entire dataset into finite number of temporal classes of equal duration, predict which episode a given frame belongs to
      • Best model used 5fps sampling rate, 288s segment length, and color and grayscale data augmentations
    • Static contrastive learning: momentum contrast (MoCo) objective
    • Temporal contrastive learning: also use each frame’s two immediate neighbors as positive examples
    • Baselines: random weights, ImageNet pre-trained, HOG features


  • Evaluate on downstream classification tasks:
    • Curated, labeled subset from SAYCam, Child S
    • Toybox dataset: 12 categories, 30 exemplars in each, with 10 different transformations – closer to SAYCam than say ImageNet
    • Reduce correlations between train-test split by subsampling 10x in SAYCam, and holding out exemplars in Toybox
  • Results
    • Temporal classification performed better than the two contrastive learning objectives
    • For Toybox with exemplar split, pre-trained ImageNet did the best by far and temporal classification was closer to the contrastive learning methods
    • For SAYCam, temporal classification on Child S data was able to outperform pre-trained ImageNet – possibly a little bit of overfitting
    • Random weights and HOG performed significantly worse


  • Uses whole dataset at once, doesn’t respect the timescale that the data is acquired
    • Could implement some sort of curriculum where the portion of data used shifts as training progresses
  • Missing interactive learning, but might not be that important for object classification compared to say intuitive physics
  • Would be interesting to see at which point higher sampling rate and segment lengths start to hurt performance
  • While SAYCam is probably the best dataset to-date for these experiments, it is a very small fraction of the total experience (~1%)
  • Not really clear why temporal classification performs the best
    • Classification of frames at the boundary between episodes would seem kind of arbitrary
    • Could imagine some continuum between temporal classification and temporal contrastive learning, where the “positiveness” of frames decreases as their temporal separation increases
Elias Z. Wang
Elias Z. Wang
AI Researcher | PhD Candidate