It’s past time to stop using the Reading the Mind in the Eyes Test

Recently, one of us attended an academic talk about whether nonhuman animals have “theory of mind,” or the capacity to attribute thoughts and feelings to others. While introducing the topic, the presenter described theory of mind as the capacity that autistic people lack.

This comment was unsurprising, yet unsettling. A pervasive claim in the scientific literature is that autistic people have a specific deficit in understanding the minds of others. But, in this instance, the juxtaposition was particularly jarring: The speaker was questioning whether nonhuman animals have theory of mind, while expressing certainty that autistic people do not.

We believe the scientific community needs to re-examine the purported relationship between theory of mind deficits and autism. Based on existing research, it is far from clear that psychological tests designed to measure theory of mind in adults really do measure it, which casts considerable doubt on the evidence for theory-of-mind deficits in autistic adults.

In a review article published in Clinical Psychology Review in March, we raised serious concerns about the validity of a widely used measure of adult theory-of-mind ability: the Reading the Mind in the Eyes Test (Eyes Test), which researchers at the University of Cambridge introduced in a 2001 journal article.

The Eyes Test comprises 36 items. Each item shows a black and white photograph of a pair of eyes, with the task being to select which of four mental-state descriptors best matches what the person is thinking or feeling. Summing the correct responses yields a score from 0 to 36, with higher scores supposedly indicating greater theory-of-mind ability. The photo below shows one of the items; we invite you to pause and attempt to identify the correct response, which we report later in this article.

e have two key concerns about the validity of the Eyes Test. Our first concern centers on the contention that people can identify complex mental states (e.g., “contemplative” and “preoccupied”) from decontextualized, static photographs of pairs of eyes. To appreciate the force of this concern, it is useful to consider how the test was constructed.

The researchers who created the test selected photographs of pairs of eyes from magazines. Next, they picked a “correct” mental-state descriptor along with three “incorrect” response options for each photograph. They trialed these candidate items with a group of eight judges. If at least five of these judges selected the “correct” response and no more than two selected the same “incorrect” response, the mental-state descriptors were retained. Otherwise, the researchers modified the response options and trialed the items with new groups of judges until these criteria were met for 40 candidate items.

Critically, the investigators never asked the judges for their opinion on the “correct” responses or asked them to generate their own responses. Instead, the researchers verified the appropriateness of the responses by administering the candidate items to a group of 225 participants to test slightly modified consensus criteria. All but four items met these criteria, resulting in the 36-item Eyes Test.

emarkably, it wasn’t until 2019 that a study was published that asked people to generate their own responses for each photograph in the test. In this case, less than 10 percent of responses were similar in meaning to the “correct” responses, and only 40 percent of responses even shared the same affective valence (i.e., positive, negative or neutral) as the “correct” responses.

These outcomes contrasted sharply with the “correct” response rate of approximately 70 percent for the standard multiple-choice Eyes Test. These highly divergent results suggest that the apparent consensus about the mental states of the people pictured in the standard version of the Eyes Test is illusory and relies heavily on the forced-choice response format.

Several other studies have provided evidence that apparent consensus should not be taken to mean that test-takers think the response they select is “correct.” For example, a 2011 study asked people to rate the affective valence of the photographs from the Eyes Test in the absence of the four mental-state response options. The average valance rating for the item displayed above was positive. Yet the four response choices all have a negative valence: “aghast,” “baffled,” “distrustful” and “terrified.”

Nonetheless, many participants in Eyes Test studies choose the “correct” response when constrained by the forced-choice format, selecting “distrustful.” Quite possibly, they use process of elimination to select the “least bad” of four bad options, a process that is quite different from making inferences about people’s mental states in the real world.

As these two studies illustrate, it is far from clear that low Eyes Test scores indicate poor performance at correctly identifying mental states and, thus, evidence of a theory-of-mind deficit.

ur second key concern about the Eyes Test is the lack of empirical validity evidence. Given that psychological attributes cannot be observed directly, psychologists assess the validity of test scores by looking for multiple converging sources of indirect evidence.

In our March review article, we surveyed 1,461 studies that administered the Eyes Test. We then identified which of these studies reported six key categories of validity evidence. Strikingly, 63 percent of the studies did not provide evidence from any of these six key categories. And when evidence was reported, it frequently indicated poor validity.

This paucity of evidence falls well short of the guidelines published by the American Psychological Association, which strongly recommend that every empirical study report multiple sources of validity evidence for the measures used.

In our review, one of the most frequently reported sources of validity evidence was autistic people’s lower performance on the Eyes Test compared with that of non-autistic people. Indeed, this evidence was central to the original 2001 study, which showed that the average Eyes Test score among 15 autistic participants was lower than the average score for non-autistic participants.

This focus on group differences is an example of what psychologists call “known-group” validity evidence, where the ability of test scores to discriminate between groups that are known to differ on a particular psychological attribute supports the interpretation that the test scores measure that attribute.

But this known-group validity evidence is problematic for at least two reasons. First, studies do not always find that autistic people score lower on the Eyes Test. Even in cases where scores are lower, there are viable alternative explanations that have not been ruled out. For example, some autistic people find eye contact extremely uncomfortable and thus might find doing the Eyes Test aversive.

The second problem with this known-group validity evidence is circularity. Autistic people’s poor performance on the Eyes Test is used to support the validity of the Eyes Test as a measure of theory-of-mind ability and is key evidence for the theory-of-mind deficit account of autism.

The issue of circularity is common in the construction of psychological tests. Nonetheless, it is possible to address circularity by iteratively refining theories and tests, leading to improvements in both. Unfortunately, as evidenced by our review article, there is little indication that this process of iteration is happening in the Eyes Test literature.

espite accumulating criticisms of the theory that autistic people lack theory of mind, and many empirical failures to demonstrate theory-of-mind deficits in autistic adults, few studies have attempted to refine the Eyes Test or have considered alternative explanations for autistic people’s lower performance on the test. Rather, the version of the Eyes Test developed in 2001 remains widely used, and academic papers continue to cite the lower average scores of autistic people as a key source of validity evidence for the test, even as new theories of autism have been developed that do not postulate a theory-of-mind deficit.

In conclusion, we argue that due to inadequate validity evidence and serious conceptual limitations, researchers and clinicians should stop using the Eyes Test. Concerningly, another review, published in February, indicates that many other measures of theory-of-mind ability also have weak validity evidence. This suggests many challenges ahead for scientists working to understand theory of mind.

Given these concerns, how should research on theory of mind proceed? We believe that considerably more attention needs to be paid to the relationship between empirical measures and psychological theories. As a start, we suggest three things: Carefully examine the existing validity evidence when selecting measures to administer in research; adhere to best-practice guidelines for reporting validity evidence for measures; and moderate conclusions according to the strength of the validity evidence.