Partial success in closing the gap between human and machine vision

Robert Geirhos (University of Tübingen)

Robert Geirhos is a PostDoc at the University of Tübingen, where he is working with Felix Wichmann and Wieland Brendel. He recently obtained his PhD, with summa cum laude, from the International Max Planck Research School for Intelligent Systems. Robert holds a MSc degree in Computer Science, with distinction, and a BSc degree in Cognitive Science from the University of Tübingen. His studies were complemented by exchange semesters and research stints at the University of Glasgow and the University of Amsterdam, as well as a research internship at Meta AI (FAIR team). In his research, Robert aims to develop a better understanding of the hypotheses, biases and assumptions of modern machine vision systems, and to use this understanding to make them more robust, interpretable and reliable.

Short Abstract: A few years ago, the first Convolutional Neural Network (CNN) surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines “in the wild” and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of distribution (OOD) datasets, recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along three axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (e.g. vision transformers), and dataset size (ranging from 1M to 1B). Our findings are threefold. (1.) The longstanding distortion robustness gap between humans and CNNs is closing, with the best models now exceeding human feedforward performance on most of the investigated OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorisation errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude. Our results give reason for cautious optimism: While there is still much room for improvement, the behavioural difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioural data and evaluation code are provided as a toolbox and benchmark at Based on this NeurIPS paper: