I once attended a symposium at Rockefeller University that featured a panel discussion on language and its origins. Two of the discussants who were titans in their fields had polar opposite views: Noam Chomsky argued that since language was innate, there must be a “language organ” that uniquely evolved in humans. Sydney Brenner had a more biological perspective and argued that evolution finds solutions to problems that are not intuitive. Famous for his wit, Brenner gave an example: instead of looking for a language gene, there might be a language suppressor gene that evolution decided to keep in chimpanzees but blocked in humans.
There are parallels between song learning in songbirds and how humans acquire language. Erich Jarvis at the Rockefeller University wanted to understand the differences in the brains of birds that can learn complex songs, like canaries and starlings, and other bird species that cannot. He sequenced the genomes of many bird species and found differences between the two groups. In particular, he found a gene controlling the development of projections from a high vocal center (HVc) to lower motor areas controlling the muscles driving the syrinx. During development, this gene functions by suppressing the direct projections needed to produce songs. It is not expressed in the HVc of songbirds, which permits projections to be formed for rapid control of birdsong. Remarkably, he found that the same gene in humans was silenced in the laryngeal motor cortex, which projects to the motor areas that control the vocal cords, but not in chimpanzees. Sydney Brenner was not only clever, he was also correct!
Equally important were modifications made to the vocal tract to allow rapid modulation over a broad frequency spectrum. The rapid articulatory sequences in the mouth and larynx are the fastest motor programs brains can generate. These structures are ancient parts of vertebrates that were refined and elaborated by evolution to make speech possible. The metaphorical “language organ,” postulated to explain the mystery of language, is distributed throughout preexisting sensorimotor systems.
The brain mechanisms underlying language and thought evolved together. The loops between the cortex and the basal ganglia for generating sequences of actions were repurposed to learn and generate sequences of words. The great expansion of the prefrontal cortex in humans allowed sequences of thoughts to be generated by similar loops through the basal ganglia. As an actor in reinforcement learning, the basal ganglia learn the value of taking the next action, biasing actions and speech toward achieving future rewards and goals.
The outer loop of the transformer is reminiscent of the loop between the cortex and the basal ganglia in brains, known to be important for learning and generating sequences of motor actions in conjunction with the motor cortex and to spin sequences of thoughts in the loop with the prefrontal cortex. The basal ganglia also automate frequently practiced sequences, freeing up neurons for other tasks in cortical areas involved in conscious control. The cortex can intervene when the automatic system fails upon encountering an unusual or rare circumstance. Another advantage of having the basal ganglia in the loop is that convergence of inputs from multiple cortical areas provides a broader context for deciding the next action or thought. The basal ganglia could be acting like the powerful multi-headed attention mechanism in transformers. In the loop between the cortex and basal ganglia, any region in the loop can contribute to making a decision.
LLMs are trained to predict the next word in a sentence. Why is this such an effective strategy? In order to make better predictions, the transformer learns internal models for how sentences are constructed and even more sophisticated semantic models for the underlying meaning of words and their relationships with other words in the sentence. The models must also learn the underlying causal structure of the sentence. What is surprising is how so much can be learned just by predicting one step at a time. It would be surprising if brains did not take advantage of this “one step at a time” method for creating internal models of the world.
The temporal difference learning algorithm in reinforcement learning is also based on making predictions, in this case predicting future rewards. Using temporal difference learning, AlphaGo learned how to make long sequences of moves to win a Go game. How can such a simple algorithm that predicts one step ahead achieve such a high level of play? The basal ganglia similarly learn sequences of actions to reach goals through practice using the same algorithm. For example, a tennis serve involves complex sequences of rapid muscle contractions that must be practiced repeatedly before it becomes automatic.
The cerebellum, a prominent brain structure that interacts with the cerebral cortex, predicts motor commands’ expected sensory and cognitive consequences. This is called a forward model in control theory because it can be used for predicting the consequences of motor commands before the actions are taken. Once again, learning what will happen next and learning from the error can build a sophisticated predictive model of the body and the properties of the muscles.
What is common in these three examples is that there are abundant data for self-supervised learning on a range of time scales. Could intelligence emerge from using self-supervised learning to bootstrap increasingly sophisticated internal models by continually learning how to make many small predictions? This may be how a baby’s brain rapidly learns the world’s causal structure by making predictions and observing outcomes while actively interacting with the world. Progress in this direction has been made in learning intuitive physics from videos using deep learning.