Nathaniel Daw has never touched a mouse. As professor of computational and theoretical neuroscience at Princeton University, he mainly works with other people’s data to construct models of the brain’s decision-making process. So when a collaborator came to him a few years ago with confusing data from mice that had performed a complex decision-making task in the lab, Daw says his best advice was just to fit the findings to the tried-and-true model of reward prediction error (RPE).
That model relies on the idea that dopaminergic activity in the midbrain reflects discrepancies between expected and received rewards. Daw’s collaborator, Ben Engelhard, had measured the activity of dopamine neurons in the ventral tegmental area (VTA) of mice as they were deciding how to navigate a virtual environment. And although the virtual environment was more complex than what a mouse usually experiences in the real world, an RPE-based model should have held, Daw assumed.
“It was obvious to me that there was this very simple story that was going to explain his data,” Daw says.
But it didn’t. The neurons exhibited a wide range of responses, with some activated by visual cues and others by movement or cognitive tasks. The classic RPE model, it turned out, could not explain such heterogeneity. Daw, Engelhard and their colleagues published the findings in 2019.
That was a wake-up call, Daw says, particularly after he watched videos of what the mice actually experienced in the maze. “It’s just so much more complicated, and high dimensional, and richer” than expected, he says. The idea that this richness could be reduced to such a simple model seems ludicrous now, he adds. “I was just so blinded.”
To reconcile the RPE theory with reality, Daw and his colleagues have developed a different model, which they published in July, that puts a new spin on the original theory: Dopamine neurons still encode errors in predictions, but instead of all the dopamine neurons responding to every kind of cue, each cell has its own narrow window on the world that accounts for the heterogeneity the researchers observed.
Theirs is not the first push to reframe the RPE theory in recent years. As more advanced tools have enabled researchers to collect more complex data than before, the field has gained fresh insights into how the dopamine system operates, how diverse the population of dopaminergic cells is, and how dopamine release seems to happen even in the absence of reward. And as a result, the theory of RPE, which was developed out of simpler data, is increasingly being put to the test.
“I find these studies add interesting aspects to the models one can derive from the empirical data, and I do want to encourage the work,” says Wolfram Schultz, professor of neuroscience at the University of Cambridge, who helped to develop the original RPE theory.
Some researchers, such as Daw and the authors of a recent perspective article on RPE refinements, say that the new discoveries just point to adjustments that need to be made to the original model. Others, including Erin Calipari, director of the Vanderbilt Center for Addiction Research at Vanderbilt University, say they are less convinced of the RPE model’s ability to explain behavior—and they have recently levied both direct and indirect challenges to RPE, suggesting that dopamine signals salience, or novelty, or retrospective learning, rather than errors in reward prediction.
“There is nothing about the [original] data that is wrong—it’s that you start to get more information, as tools get better, that changes your interpretation of exactly what would be happening,” Calipari says.
The stakes seem small in some ways: A better model can produce slight advantages here or there in predicting the activity of dopamine neurons. But the downstream effects—what it suggests about the brain, where it points for future research and how it shapes translational work—are large. On top of that, the debate over when to tweak or trash a theory reflects how different researchers approach the science. Those in favor of discarding RPE decry its ability to account for new data; those who want to revise RPE point out the alternative models’ inability to account for past findings.
That debate can bring clarity to the field, says Mark Humphries, chair in computational neuroscience at the University of Nottingham, who is not affiliated with the original theory or the critiques of it.
“We’re in that phase where it felt like the story was finished,” Humphries says. “And it’s not.”
T
he idea that dopamine signals errors in predictions about a reward has shaped neuroscience research for decades—ever since the publication of a seminal 1997 paper, “A neural substrate of prediction and reward,” which recorded the activity of dopamine neurons in the VTA of monkeys as the animals learned to associate a flash of light with an unexpected reward of a bit of apple or juice.When the light turned on for the first time, the dopaminergic cells kept ticking along at their baseline activity level, the study found. But after a monkey received a bit of juice, the cells’ firing increased—and presumably released more dopamine. After training, the cells burst into action as soon as the light went on, rather than in response to the reward itself; the monkey had learned to associate the light with a coming reward. And at that point, if the reward did not eventually follow the light, the dopaminergic neurons’ activity dipped. Independent teams later replicated those results in rodents.
Those findings built on a model of “temporal difference reinforcement learning” (or “temporal difference value learning”), which itself built on the 1972 Rescorla-Wagner model of classical conditioning—new ways of thinking about how people, animals and even computers learn from different cues. Using the temporal difference reinforcement learning model, computer scientists successfully taught computers to play backgammon in 1992. When, five years later, Schultz and his colleagues discovered that the model could also predict the response of dopamine neurons, it all seemed to fit. The model’s success inspired a field of learning and decision-making neuroscience, one that sought to use similar frameworks to understand how the internal representations of reward could drive animals to choose between different alternatives.
But dopamine neurons, it turned out, are more diverse in their response properties than previously thought: Not all dopamine neurons respond to errors in reward prediction, and some seem to respond to other features of a decision-making task entirely, as Daw came to see.
Advances in technologies and experiment design aided those discoveries. In the classic studies, animals highly trained on a task were rewarded if they performed correctly—conditions that produced a robust RPE signal. But experiments involving more complex environments and tasks began to tell a different story. Not only can dopamine neurons respond to visual and movement cues, such studies showed, but they can activate in response to threats, according to recordings from neurons in the tail of the striatum in a study from a different lab. These cells do not respond to reward, the team found; instead, they seem to implement reinforcement learning to help mice avoid threatening stimuli. Other researchers have found that dopamine neurons in this brain region respond to “action prediction errors,” which helps support movements that are not linked to threat or reward.
To reconcile that heterogeneity, some researchers propose that the dopamine system tackles each of these modalities independently. But no one dopamine neuron receives inputs from the entire brain, or information about all of the features of the world, Daw and his colleagues point out in their new paper. The adjustments they made to the RPE model lay out how individual dopamine neurons can be biased toward specific types of stimuli—such as threats or actions—yet still have the potential for a general RPE response: The input to each cell is limited, and so each cell is specialized in what it responds to. The activity of those cells as a population can then produce a full prediction error signal, the team suggests.