Learning Transferable Visual Models From Natural Language Supervision

Radford et al., 2021

Source: Radford et al., 2021

Summary

Most computer vision systems have a fixed, predetermined output limiting their ability to perform zero-shot transfer
- Zero-shot transfer performance has not been great on complex tasks (e.g. ImageNet)
Use large-scale natural language supervision (400 million examples) to facilitate zero-shot transfer
Able to match performance of ResNet-50 on ImageNet zero-shot, as well as on many other datasets
Zero-shot CLIP models are more robust than supervised ImageNet models of equivalent accuracy
Links: [ pdf ]

Within NLP, recent results demonstrate that supervision on web-scale collections of text surpasses that of high-quality, crowd-labeled datasets
- However, scalable pre-training methods in computer vision have not yet reached competitive performance – largely due to a lack of scale
- Previous work used either MS-COCO (~100,000 training images), Visual Genome (also small), or YFCC100M (sparse and variable quality metadata)
Contrastive Language-Image Pre-training (CLIP) efficiently learns from natural language supervision on a dataset of 400 million (image, text) pairs
- Motivated by the idea of learning perception from supervision contained within natural language

Creating sufficiently large dataset, WebImageText (WIT): similar total word count as WebText, which was used to train GPT-2
- Search for (image, text) pairs whose text includes one of 500,000 queries
Efficient pre-training method: use contrastive objective instead of predictive objective
- Given batch of $N$ (image, text) pairs, trained to predict which of the $N \times N$ pairs actually occurred
- Train from scratch, with linear projection from each encoder to the multi-modal embedding space
- Only use random crop for data augmentation
Choosing and scaling a model:
- Use modified ResNet-50 or ViT for image encoder and Transormer for text encoder
- Scale image encoder along width, depth, and resolution and text encoder along only width
For best model, pre-train at higher 336 pixel resolution for an additional epoch (similar to FixRes)

Zero-shot Transfer with CLIP: For each dataset, predict most probable (image, text) pair using text generated from class names
- Text encoder can be viewed as hypernetwork that generates weights of a linear classifier, on top of the image encoding, using the text
- Improves ImageNet accuracy from 11.5% by Visual N-Grams to 76.2%, matching original ResNet-50
Using prompt template, e.g. “A photo of a {label}.”, improves performance by a couple percentage points, and ensembling prompts provides additional gains
On 16 out of 27 datasets zero-shot CLIP outperforms linear probe on pre-trained ResNet-50
- CLIP does poorly on specialized, complex, or abstract datasets (e.g. satellite images, tumors, synthetic scenes)
- Zero-shot transfer efficiency varies from 1 labeled example per class to 184 across the datasets
Evaluating on natural distribution shifts: ImageNetV2, ImageNet Sketch, ImageNet-Vid, ObjectNet, ImageNet Adversarial, ImageNet Rendition
- ResNet-101 makes 5 times as many mistakes on these natural distribution shifts
- Zero-shot CLIP improves robustness to distribution shift, reducing the gap by up to 75%

CLIP’s zero-shot performance still weak on some tasks, where data is truly out-of-distribution
Still limited to only choosing fixed set of concepts, versus a more flexible approach like generating image captions
Still uses a lot of data (400 million examples), although acquiring it is relatively cheap
Training on unfiltered images and text from the internet results in the model learning many social biases