Toddler-Inspired Visual Object Learning

Bambach et al., 2018

Source: Bambach et al., 2018

Summary

Real-world learning systems are limited in the quality and quantity of training datasets they can collect
Use head-mounted cameras and gaze trackers to collect egocentric images from human child in naturalistic learning contexts
Child data produces better object models than egocentric adult data
Child data exhibits unique combination of quality and diversity
Links: [ website ] [ pdf ]

Current machine learning approaches rely on collecting large amounts of data, with generalization ability benefitting from the size
Human children are very efficient learners, able to recognize roughly 300 object categories by age 2, and generalize to novel instances of newly learned label
Successful learning in toddlers lies in the qulity of visual data they collect in everyday activities, more coherent and correlated
Use CNNs to quantify and compare the information content of various datasets

Toddlers in toy play context (24 toys), learn about objects and names
Egocentric video and eye tracker for toddler and adult, as well as 3rd person view
Final dataset of about 200 minutes of video, 30fps, 480 x 640, 70 degree horizontal FOV
Separate “standard” image dataset of the 24 toy objects with 128 viewpoints per object, 3072 total images - used for test set
Manually detect object looks using gaze data, average duration was ~1-2 seconds, ~200k frames each for child and parent
Used pretrained YOLO to obtain object bounding boxes in each frame
Simulate foveated vision by prgressively blurring away from center of gaze
Make datasets with different FOV, between 30-70 degrees in 10 degree increments
Used pretrained VGG with finetuning for the different datasets, weighting loss based on object frequency
Test on the “standard” clean object dataset

Differences in histogram of object sizes
- Adult data skewed towards smaller objects (<10% of FOV)
- Child data has more larger objects (>20% FOV), up to ~50% FOV
- ImageNet has much more large objects, up to 100% image size
Use GIST features to compare low-level visual similarity, toddler has bigger tail of dissimilar instances compared to adults
- ImageNet is generally more variable since contains instances of different objects
VGG trained on child data performs better than adult data across various FOV and with or without blurring
Separated child data into two datasets where objects were smaller or larger than median, based on YOLO bounding boxes
Larger object dataset resulted in higher performance
Create datasets of similar, diverse, and orignal based on GIST features of bounding box cropped objects
- Similar contained 25% of the instances with minimum total pairwise distance
- Diverse contained 25% with maximum distance
- Original contains random subset of original datset
Original performed best, and diverse better than similar
Blurring, from simulated foveated vision, generally hurts performance
- Only helps in adult data with large FOV (>50 degrees)

Data (images) gathered by toddlers during play results in better performance than adult data
Child data differs from adults in object size and instance diversity, which contribute to performance gain
- More larger objects
- More diverse instances
Toddler dataset contains more information than they have access to
- Requires (manual) labeling of objects in each frame
Not clear that better performance of “big objects” isn’t just due to the test set generally having large objects too
- Test performance with original distribution of object size (both big and small) not shown
- Would having both large and small objects help for a more “realistic” test set
Would be nice to see results hold with training from scratch
- Data needed for fine-tuning might be different
Other task besides object recognition would be interesting
How well would a “standard” dataset compare to toddler and adult data
- Can these principles be used to design better “standard” datasets