A major question in systems neuroscience concerns dopamine’s computational function during learning. The most influential framework addressing this question is the reward prediction error (RPE) model of dopamine function, which is based on temporal difference reinforcement learning (TDRL). RPE is defined as received minus predicted “value,” in which value for every moment is roughly the total expected future rewards. The core idea of TDRL is that this RPE signal enables a brain to progressively make better value estimates.
For example, imagine a child learning that the sound of an ice cream truck predicts ice cream, a highly valuable reward. The value of the sound grows as the child learns it predicts ice cream. According to the TDRL model, this increase in value is driven by the RPE associated with an unexpected ice cream. After learning, the sound of the truck acquires the high value of the ice cream, and the ice cream RPE drops to zero because it is now fully predicted by the sound.
The original conception of TDRL is elegant and principled. So the similarity observed between the TDRL RPE model and mesolimbic dopamine signaling is every theorist’s dream. Indeed, considerable data on dopamine signaling are consistent with RPE signaling. However, here I argue that a critical reevaluation of this established dogma is necessary for progress in the field, and my lab recently proposed a promising alternative.
T
he first set of challenges to the RPE model of dopamine arises from recent experimental results: data on mesolimbic dopamine dynamics during instrumental learning; a variety of Pavlovian behavioral conditions during early learning and subsequent task changes; the impact of causal manipulations of dopamine activity during simple or sequential Pavlovian learning; acquisition of an instrumental response; and learning of world models. Although straightforward generalizations of the RPE model can account for some of these findings, some scientists, including us, have argued that other findings are more fundamentally inconsistent with existing RPE accounts. (For more on this debate, see “Reconstructing dopamine’s link to reward.”)The second set of challenges is conceptual. When replicable experiments are not obviously consistent with a model, either the model needs to be modified or a new model needs to take its place. Modifying the TDRL RPE model presents two major conceptual hurdles: simplicity and falsifiability. In general, scientists aim for the simplest model that can account for the data. Much of the appeal of TDRL RPE as an explanation for dopamine dynamics has been that it is a single, parsimonious model for a wide variety of observations. But the reality is that different versions of TDRL RPE are used to describe dopamine dynamics across experiments. As a result, the mechanisms underlying various versions of RPE are often drastically different.
Take the case of “dopamine ramps”—progressively increasing dopamine over the course of a trial—observed in some experiments. A stable dopamine ramp in well-trained animals is fundamentally inconsistent with the original conception of TDRL RPE. A high RPE at later time points within a trial, corresponding to the later portions of the dopamine ramp, will increase value at the earlier time points on the next trial. Once these values rise, the RPE at the later time points will decrease, thereby precluding the possibility of a stable ramp.
Researchers have modified the RPE model to capture dopamine ramps. But the changes involve a complex mixture of predictions and observations that produce biased value estimates needing active correction via a separate system, breaking the normative and elegant one-to-one relationship between RPE and value learning in the original TDRL framework. Though this extension is an RPE model of dopamine, its underlying mechanisms are very different from other RPE models of dopamine, and the apparent semantic parsimony hides the lack of mechanistic parsimony. To my knowledge, there is no single implementation of TDRL RPE that captures the wide variety of observations attributed to the “RPE” family of dopamine models. Thus, the RPE family of models does not currently have mechanistic parsimony.
The second conceptual challenge is falsifiability, which is tightly linked to the degrees of freedom allowed in a model to fit data. The more the degrees of freedom, the worse the model. In TDRL, the number of degrees of freedom is technically unbounded, because in addition to the free parameters, the inputs to the model (i.e., the “state space”) are underspecified, thereby allowing many possible ways to tweak the model to fit data. Thus, more principled approaches to constrain the degrees of freedom are critical.
In simple terms, TDRL RPE is a family of models and not a single model; established scientific approaches based on the principle of falsifiability are hard to apply to such a family. Indeed, in discussions with colleagues, I have been told that TDRL RPE is not falsifiable, and it should instead be treated as a thinking tool (i.e., a “framework”) that may or may not be applicable to understand certain empirical phenomena relating to dopamine. Although I am partially sympathetic to this view, it is not clear to me then that TDRL RPE provides the sort of predictive model that will advance our understanding of the biology of learning. We need to engage in a broader discussion of our goals as a field. Further, if we treat the success of any individual instance of a family of models as a success for that family but treat the failure of all known instances of that family not as a failure of that family (due to the possibility of unknown future instances being successful), there is a real possibility of conceptual stagnation in the field.
T
he above challenges call for alternative testable and falsifiable models for dopamine function. To illustrate the alternative model my lab has proposed, consider that you become nauseous soon after eating at a new restaurant. Chances are that you would associate that restaurant with illness and avoid it in the future. Chances are also that you learn this by getting ill and “looking back” in memory to identify a possible cause, and then implicitly convert this to a prediction to avoid eating at the restaurant.We proposed a mathematical formalization of these intuitions for our theory. The core idea is simple: You can learn to predict the future by looking back at the past to identify causes, and you look back only after experiencing meaningful events. We show that this model can identify causal relationships in continuous streams of events and that the signal used to identify meaningful events resembles RPE in common tasks. We showed that in tasks designed to distinguish these hypotheses, dopamine dynamics abide by this model and not canonical TDRL RPE models. We have also recently shown that behavioral learning and dopamine ramps abide by additional predictions of our framework. Thus, initial tests of our model across a range of conditions have been promising. Though much remains to be done on the theoretical and empirical side, we believe that it is a viable alternative to TDRL RPE as a model for dopaminergic and behavioral learning.
Looking forward, I think the field needs to critically reexamine theories of learning and dopaminergic function and develop new ones. Perhaps the fruit of this reevaluation will be further vindication for the RPE model. But given the significant conceptual and experimental challenges to this model that have been posed recently, not having this conversation is a disservice to our field.