Does a new theory of dopamine replace the classic model?

My answer would be no, but the model poses challenges that will sharpen our understanding of dopamine and learning.

Illustration of several structures constructed out of small black squares, with scaffolding on some of the structures.
Useful models: Though not perfect, temporal difference learning models explain a breadth of observations and have helped develop our understanding of neural mechanisms across levels.
Illustration by Julien Posture

How do animals learn from experiences? An influential idea is that animals constantly make predictions and compare them with what actually happens. When their predictions fail to align with reality, they adjust them. When there is no discrepancy, there is no need for learning. In other words, it is the surprise, or “prediction error,” that drives learning. In the brain, dopamine is thought to provide a type of surprise signal. Specifically, researchers have found that dopamine signals bear a striking similarity to one from computer science called a temporal difference (TD) error, a prediction-error signal in a specific machine-learning algorithm.

A large body of evidence supports the idea that dopamine signals convey TD error, but a 2022 study by Huijeong Jeong and her colleagues challenged this. They proposed that an alternative explanation for learning and dopamine function—a novel causal learning model called the adjusted net contingency for causal relations (ANCCR)—better explains dopamine responses. In ANCCR, learning is driven by looking backward in time to see which events or cues preceded a significant event such as a reward.

The results and model presented in this study are intriguing and do pose some challenges to previous ideas, but their implications require careful consideration, which I discuss below.

The statistician George Box famously wrote that “all models are wrong, but some are useful.” Simple mathematical models might not capture the reality perfectly because of the abstractions required to build them. Nonetheless, some models are useful. For example, we all know that real gases do not strictly obey the ideal gas law, but it is extremely useful as it elegantly captures the basic properties of gases.

With respect to dopamine and learning phenomena, no model thus far can explain all the observations. But TD learning models do explain a breadth of observations and have helped develop our understanding of neural mechanisms linking phenomena across different levels—molecular, cellular, circuit and behavioral. (For more on this debate, see “Reconstructing dopamine’s link to reward.”)

J

eong et al. argues that ANCCR is superior both in explaining dopamine responses and as a learning algorithm more broadly because the model’s “retrospective” approach enhances memory efficiency. But such claims are at odds with recent successes in artificial intelligence, in which variants of TD learning algorithms have outperformed humans in complex tasks. Notably, the backgammon-playing algorithm developed in the 1990s, TD-gammon, achieved human-level performance, even in an era when computers were significantly less powerful than modern systems. It is the memory efficiency of TD learning algorithms that has contributed to their remarkable success.

Jeong et al. attributes ANCCR’s power to its retrospective learning mechanism, but this property itself might not be the critical difference between ANCCR, TD and other models. TD learning harnesses both prospective and retrospective learning mechanisms, often referred to as the “forward view” and “backward view.” The TD agent learns values that quantify expected future rewards tied to a given situation or action. Once learned, these values can be used to select actions that lead to greater rewards—framing that represents the “forward view” because each state contains information about what will happen in the future.

The TD error is defined using these values. The expected future rewards at the current state, t, should be equal to the expected future rewards at the next time step, t+1, plus the reward that the agent received while transitioning from t to t+1. Any discrepancy between these values indicates an error in prediction, which is computed by comparing predictions made at two consecutive time points, the temporal difference error. To put this in equation form, TD error (δ) is defined as δ = r + V (t + 1) – V(t), where r is reward and V(t) is the value at time t. When an agent detects a TD error, it adjusts the value(s) of the previous state(s) to reduce the error—learning known as the “backward view” because it uses errors at reward-time to update state(s) that occurred in the past. Notably, the forward and backward views are mathematically equivalent—they are just two different ways to think about the problem of computing values.

How far back in states (or time) the agent looks to update values depends on the timescale of the memory. Limiting the timescale keeps learning relatively local in time. Repeating this learning process propagates values back to earlier states. This locality is the key to the learning process and what makes memory efficient. It reflects the fact that TD error is computed moment by moment at each time point, and values are updated even before a reward is received.

What truly sets ANCCR apart from TD learning? A critical difference seems to lie in what these models aim to learn. In TD learning, expected reward, or value, is an essential variable, computed constantly as the agent traverses across different states. ANCCR, by contrast, focuses on learning conditional probabilities of cues resulting in rewards, dubbed “retrospective associations,” and the operation of ANCCR is primarily triggered when an agent detects an event that it already knows is significant.

G

iven this difference, how can we arbitrate these models? Jeong and her colleagues conducted a series of experiments to do this in their 2022 paper, but whether the results really distinguish the two models remains to be clarified. TD learning is a relatively broad class of models, rather than a specific implementation, and the researchers only considered relatively simplistic variants. Previous studies have shown that such simplistic models cannot fully explain dopamine responses, but models with proper parameters or some modifications can—including some of the data presented in Jeong et al. (Qian et al. [2024] and Pan et al. [2005]). Effectively challenging TD learning models requires carefully designed experiments grounded in fundamental properties of TD models, independent of a model’s implementation.

ANCCR cannot naturally explain some well-known dopamine responses, easily explained by TD models. For example, in a study published in 2020, my group aimed to test whether dopamine signals exhibit the fundamental characteristics expected of TD error. In the TD error equation given above, the V(t + 1) – V(t) terms represent the change in values in one time step, and dopamine signals should mirror that change. We designed an experiment using virtual reality to rapidly alter values in time, teleporting a mouse closer to or farther away from a reward location, and examined how dopamine signals responded. Forward and backward teleports elicited transient increases and decreases, respectively, in dopamine signals in the nucleus accumbens. These results demonstrated that dopamine signals exhibit a derivative-like shift, consistent with the TD error hypothesis.

Importantly, by contrast, ANCCR does not appear to possess such a capability. When simulating our teleport experiments, ANCCR does not generate derivative-like signals. In their publication, Jeong and her team claimed that ANCCR reproduced the results of our study, but our examination of the dopamine signals generated by their code suggests they do not match the experimental results.

And ANCCR struggles to reproduce the well-established phenomenon of reward omission “dips.” For example, when different cues predict rewards with differing probabilities, say 10 or 90 percent, omitting a higher-probability reward triggers more substantial negative dopamine responses. This is consistent with the idea of prediction error; more surprising outcomes induce greater responses. By contrast, ANCCR does not inherently generate this pattern; it necessitates somewhat arbitrary and untested assumptions to replicate it.

Despite these challenges, ANCCR can also be a useful model for understanding the function of dopamine and learning phenomena. Jeong et al. provides important data that may or may not be explainable by TD models. More important, in my view, is that various concepts used in ANCCR may spur new directions in the study of learning and intelligence. For instance, what does it mean to understand causal relationships, and how can the brain achieve it? Are conditional probabilities a useful computational primitive to guide behavior?

So, does ANCCR replace TD learning models? My answer would be no, at least not in its current form. Nonetheless, insights from Jeong et al. will sharpen our understanding of dopamine, learning and models. More work is needed to fully understand what neural processes underlie our ability to learn and perform intelligently, and the TD and ANCCR models will both continue to be useful toward these goals.

Sign up for our weekly newsletter.

Catch up on what you may have missed from our recent coverage.