ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

Geirhos et al., 2019

Source: Geirhos et al., 2019

Summary

CNNs are thought to recognize objects based on increasingly complex shape representations, but recent evidence suggests the importance of textures
Evaluate CNNs and humans on images with texture-shape cue conflict
Show that ImageNet-trained CNNs are biased towards recognizing textures versus shapes, in contrast to humans
Training CNNs on “Stylized-ImageNet” improves shape representation, better matches human behavior, and improves object detection performance
Links: [ website ] [ pdf ]

Object shape is the most important cue for human object recognition, surpassing others like size or texture (Landau et al., 1988)
- Shape Hypothesis: CNNs combine low-level features into complex shapes until object can be classified
Texture Hypothesis: Object textures are more important than shapes for CNN object recognition
- Shape Hypothesis contradicts some empirical findings that demonstrate the importance of texture in object recognition
- CNNs can classify texturized images even when global shape structure is destroyed
Images with texture-shape cue conflict enables quantification of biases in CNNs and humans

Psychophysics Experiments: fixation, 300ms stimulus presentation, pink noise backward masking, choose one of 16 categories
Four CNNs pre-trained on ImageNet: AlexNet, GoogLeNet, VGG16, and ResNet-50
Main Datasets: Only used object and texture images that were correctly classified by all four CNNs
- Original: 160 natural color images of objects (10 per category) with white background
- Greyscale: Original images converted to greyscale
- Silhouette: Original images showing black object on white background
- Edges: *Original images with Canny edge extractor applied
- Texture: 48 natural color images of textures (3 per category) – full-width patches of animals or many repetitions of the same object
- Cue Conflict: 1280 images (80 per category) by applying iterative style transfer between an image from Texture (as style) and Original (as content)
Stylized-ImageNet (SIN): apply AdaIN style transfer to each ImageNet (IN) image using Kaggle’s Painter by Numbers (79,434 paintings) dataset for style images

Original, Texture, and Greyscale images were generally recognized equally well by both CNNs and humans
However, for Silhouette and Edges images with little to no texture information, CNNs performed much worse than humans
- CNN performance on edges was significantly worse than CNNs on silhouette, and also humans on edges
- Rank order of CNN performance was roughly reversed between silhouette and edges, with ResNet-50 (the best model on ImageNet) doing the best on silhouette and worst on edges
For Cue Conflict images, looking at only correct decisions – a substantial fraction of images were hard to recognize for both humans and CNNs
- Humans show a strong bias towards shape (~95%)
- CNNs show a clear bias for texture (~50-80%)
Stylized-ImageNet:
- Poor performance on SIN by BagNets, which have limited maximum receptive field size, indicating that texture features are not sufficient (as designed and expected)
- SIN is a slightly harder task than IN (79% vs 92.9% top-5 accuracy)
- SIN transfers well to IN, but not vice versa
- SIN trained ResNet-50 has stronger shape bias compared to IN trained (81% vs 22%)
- Jointly training on SIN + IN and fine-tuning on IN improves classification (IN) and transfer learning (Pascal VOC) performance

It’s possible that the restriction of images to ones that all models classify correctly actually confounds the cue conflict results
- Equal performance by a given model on object and texture images, but not a ceiling, might be better
- Also, there’s a continuum of salience between texture and shape for cue conflict images, how do we quantify and explicitly account for this
Is the emergence of texture bias in ImageNet-trained CNNs a result of the data or architecture?
- I.e. Is ImageNet more easily solvable by texture cues than “real data”? If not, then why do humans develop a shape bias?
- Or is it that current CNN architectures and learning rules result in texture representation over shape?
Cue Conflict images are inherently unnatural