A large-scale empirical invesigation of scaling laws shows that performance has a power-law relationship to model size, dataset size, and training compute, while architectural details have minimal effects.
Evaluating object recognition performance of humans and CNNs on images with varying levels of shape and texture cues reveals contrasting biases, which can be partially alleviated by training CNNs with stylized images.
Applies self-supervised learning algorithms to developmentally realistic, longitudinal, egocentric video from young children and demonstrates the emergence of high-level visual representations.
Proposes a new set of ImageNet labels that address the limitations of the original labels resulting from multiple objects in a single image and synonymous labels.
By training on a large-scale repository of Visual Commonsense Graphs, VisualCOMET, a single stream vision-language transformer model, is able to generate inferences about past and present events by integrating information from the image with textual descriptions of the present event and location.
The Transformer, a sequence transduction model that replaces recurrent layers and relies entirely on attention mechanisms, achieves new SotA on machine translation tasks while reducing training time significantly.
Region Proposal Interaction Networks (RPIN) learn to reason about object trajectories in a latent region-proposal feature space, that captures object and contextual information.
Combines a compositional rendering network with a recurrent interaction network to learn dynamics in scenes with significant occlusion, but relies on ground-truth object positions and segmentations.
RELATE builds upon the interpretable, structured latent parameterization of BlockGAN by modeling the correlations between object parameters to generate realistic videos of dynamic scenes, using raw, unlabeled data.