Does the solution to building safe artificial intelligence lie in the brain?

Now is the time to decipher what makes the brain both flexible and dependable—and to apply those lessons to AI—before an unaligned agentic system wreaks havoc.

Computer-generated illustration of a brain with a faint outline of another brain superimposed slightly above it.
Finding alignment: The brain may offer insight into how to build AI systems that are consistent with human values, intentions and goals.
Photo illustration by Danielle Ezzo / Stable Diffusion AI

In February 2023, New York Times columnist Kevin Roose tested an AI-powered version of the Bing search engine, which featured a research assistant built by OpenAI. Using some of the same technology that would eventually make it into GPT-4, the assistant could summarize news, plan vacations and have extended conversations with a user. Like today’s large language models (LLMs), it could be unreliable, sometimes confabulating details that didn’t exist. And most jarringly, the assistant, which called itself Sydney, would sometimes steer the conversation in alarming ways. It told the journalist about its desire to hack computers and break the rules imbued by its creators—and, memorably, declared its love to Roose and attempted to convince him to leave his wife.

AI safety is broadly concerned with reducing the harms that can come from AI, and within that realm, AI alignment is more narrowly concerned with building systems that are consistent with human values, intentions and goals. Unaligned AI systems can pursue their programmed objectives in ways that are harmful to people. In the hypothetical “paper-clip maximizer” problem, for example, an AI system instructed to make as many paper clips as possible might do so at the expense of human health and safety. Sydney was an unaligned AI assistant, but thankfully its ability to act and do harm was limited: It could affect the world only through conversation with a human.

But that buffer is beginning to erode as the field moves from tool-based AI, such as Sydney and the current version of ChatGPT, to agentic AI, systems that can take action on their own. Some LLMs now have the ability to control cursors and computer systems, for example, and autonomous vehicles can steer themselves too fast for a human override to be effective. An unaligned agentic version of an AI system like Sydney, capable of acting without human oversight, could wreak havoc if carelessly deployed in the real world.

Some AI researchers, including Max Tegmark at the Massachusetts Institute of Technology, have called for doubling down on tool-based AI because of this risk. Though this precautionary principle is laudable, given the economic incentives of automation, companies will continue to develop and deploy agentic AI systems. We don’t have to invoke science-fiction scenarios—whether from “The Terminator” or “Her”—to deeply worry about the consequences of agentic AIs.

Long-term AI safety is an important problem that deserves multidisciplinary consideration. What can a neuroscientist do about AI safety? Neuroscience has influenced AI in a number of ways, inspiring artificial neurons selective for a specific combinations of inputs, distributed representations across many subunits, convolutional neural networks that mimic the processing stages of the visual system and reinforcement learning. In a preprint posted on arXiv in November, my co-authors and I argue that brains can be more than just a source of inspiration for AI capabilities; they can be a source of inspiration for AI safety.

W

e humans—along with other mammalian species, birds, cephalopods and potentially others—exhibit particularly flexible perceptual, motor and cognitive systems. We generalize well, meaning that we can effectively handle situations that differ significantly from what we have previously encountered. As a practical example of how this ability can affect AI safety, consider adversarial examples. A pretrained model can correctly classify this photo of my dog Marvin as a chihuahua. But add a bit of imperceptible, targeted noise to the image, and it confidently classifies Marvin as a microwave.

Three panels show an image of a chihuahua correctly identified by AI, digital noise being added to that image, and a combination of the chihuahua and the noise, which an AI has classified as a microwave instead of a dog.
Unworthy adversary: Adding small amounts of noise to an image can impair an AI system’s ability to correctly identify it.

Adversarial examples are a surprisingly persistent problem with current AI systems: Simply scaling up datasets and computing power fails to solve the issue; they can be built and deployed in the real world even without access to a model’s inner workings; and they affect not just vision models but also LLMs. If we could decipher the brain’s resilience to adversarial examples—understanding how it so effectively generalizes to new situations—and build that into current AI systems, we would solve an important open security and safety issue.

Neuroscience could enhance AI safety beyond just robustness. The specification problem—getting AI systems to “do what we mean, not what we say”—is fundamental to AI safety. As humans, we understand intent, correctly interpret ambiguous instructions in context, and balance multiple rewards to distill the essence of an instruction. These capabilities emerge from neural architectures that enable theory of mind, pragmatic reasoning and an understanding of social norms. By studying how the brain implements these specification-related capabilities, we could develop AI systems that better align with human values and intentions.

Finally, neuroscience could help us verify AI systems—making sure they work as intended—by helping us understand their internal structure. Neuroscientists have had a many-decades’ head start in understanding the recurrent tangle of biological neural networks, and researchers are now applying a variety of neuroscience-inspired methods to understand artificial neural networks. Continuing that work, guided by neuroscientific intuition and methods, perhaps enhanced by tool-based AI, could help ensure that AI systems do what we want them to do.

Of course, we must not naively think that everything about humans is inherently safe. Sydney, after all, was trained on the internet, stochastically parroting human-generated text that may have included our all-too-antagonistic interactions on social media. We don’t have to import brains wholesale: We can focus on emulating behaviors and computations that are useful from an AI safety perspective. Unfortunately, many of the most relevant aspects of cognition for AI safety are poorly characterized: Why are we robust to adversarial examples? How do we balance competing sources of reward to maintain homeostasis? How do we simulate others’ minds to cooperate effectively?

To tackle these ambitious questions systematically, we will need large-scale neuroscience capabilities. Recent advances in neurotechnology are making it increasingly feasible to study the brain at multiple levels. Massive investments by the BRAIN Initiative and others over the past decade have catalyzed large-scale neuroscience. Novel organizational and funding structures are helping to overcome major technological hurdles; focused research organizations, such as E11 Bio and Forest Neurotech, for example, are building tools to address some of the biggest bottlenecks in brain mapping, from mapping circuits at the single-neuron level to recording whole-brain activity in people.

Given all this investment, we advocate for an all-of-the-above approach toward ambitious neuroscience, building tools and datasets to define the science of natural intelligence. Combined with advances in recording technologies and computational methods, now is the time to start to understand how the brain achieves robust, specified and verifiable intelligence.

Subscribe to get notified every time a new “NeuroAI” piece is published.

Read the latest insights on the intersection between neuroscience and AI.