Uses bijective networks to identify large subspaces of invariance-based vulnerability and introduces the independence cross-entropy loss which partially alleviates it.
Demonstrates that scaling up self-supervised methods along data size, model capacity, and problem complexity enables them to match or surpass ImageNet supervised pre-training on a variety of tasks.
Applies task-agnostic, web-scale pre-training to computer vision using natural language supervision, enabling powerful zero-shot transfer to many datasets.
Demonstrates that huamns use scene information to guide search towards likely target sizes, resulting in higher miss rates for mis-scaled targets, which does not occur for object detection DNNs.
Presents the idea of using hyperbolic embeddings for hierarchical representations and provides some experiments classifying action within a hierarchy of actions.
Evaluating object recognition performance of humans and CNNs on images with varying levels of shape and texture cues reveals contrasting biases, which can be partially alleviated by training CNNs with stylized images.
Applies self-supervised learning algorithms to developmentally realistic, longitudinal, egocentric video from young children and demonstrates the emergence of high-level visual representations.