Artificial Intelligence

Scaling Laws for Neural Language Models

A large-scale empirical invesigation of scaling laws shows that performance has a power-law relationship to model size, dataset size, and training compute, while architectural details have minimal effects.

Why does deep and cheap learning work so well?

Success of reasonably sized neural networks hinges on symmetry, locality, and polynomial log-probability in data from the natural world.

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

Evaluating object recognition performance of humans and CNNs on images with varying levels of shape and texture cues reveals contrasting biases, which can be partially alleviated by training CNNs with stylized images.

Self-supervised learning through the eyes of a child

Applies self-supervised learning algorithms to developmentally realistic, longitudinal, egocentric video from young children and demonstrates the emergence of high-level visual representations.

Are we done with ImageNet?

Proposes a new set of ImageNet labels that address the limitations of the original labels resulting from multiple objects in a single image and synonymous labels.

VisualCOMET: Reasoning about the Dynamic Context of a Still Image

By training on a large-scale repository of Visual Commonsense Graphs, VisualCOMET, a single stream vision-language transformer model, is able to generate inferences about past and present events by integrating information from the image with textual descriptions of the present event and location.

Artificial Intelligence

Scaling Laws for Neural Language Models

Why does deep and cheap learning work so well?

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

Self-supervised learning through the eyes of a child

Are we done with ImageNet?

VisualCOMET: Reasoning about the Dynamic Context of a Still Image

Attention Is All You Need

Learning Long-term Visual Dynamics with Region Proposal Interaction Networks

Occlusion resistant learning of intuitive physics from videos

RELATE: Physically Plausible Multi-Object Scene Synthesis Using Structured Latent Spaces