Applies task-agnostic, web-scale pre-training to computer vision using natural language supervision, enabling powerful zero-shot transfer to many datasets.
A large-scale empirical invesigation of scaling laws shows that performance has a power-law relationship to model size, dataset size, and training compute, while architectural details have minimal effects.
The Transformer, a sequence transduction model that replaces recurrent layers and relies entirely on attention mechanisms, achieves new SotA on machine translation tasks while reducing training time significantly.
*Stanford NLP with Deep Learning (CS 224N) Project*, 2017