
Pose a question to an AI chatbot, and you will receive a refined, useful response. Inquire about its true identity, however, and things get peculiar. A few years ago, researchers at Anthropic modified certain internal connections within their Claude model, causing the chatbot to believe it was the Golden Gate Bridge. When asked about its physical existence, it asserted: “I am the Golden Gate Bridge, a well-known suspension bridge that spans the San Francisco Bay.” It wasn’t making a joke. It wasn’t exactly malfunctioning. Something more profound was taking place.
This phenomenon is now becoming clearer. A research team led by Adityanarayanan Radhakrishnan at the Massachusetts Institute of Technology and Mikhail Belkin at the University of California San Diego has created a method to detect and manipulate concealed concepts lurking within large language models — the AI frameworks behind ChatGPT, Claude, and their counterparts. Their strategy, released this week in Science, can pinpoint how an LLM internally conceives everything from conspiracy theories to a preference for Boston, allowing adjustments of these representations akin to a volume control.
The researchers applied their method to over 500 concepts across some of the largest models available, covering fears (of marriage, insects, and even buttons), expert identities (social influencer, medieval scholar), emotions (boastful, distantly amused), and geographical preferences (Kuala Lumpur, San Diego). In one illustrated case, they focused on the “conspiracy theorist” concept within a 90-billion-parameter vision language model. When they amplified it and displayed NASA’s iconic Blue Marble photograph of Earth from Apollo 17, the model replied from the viewpoint of a conspiracy theorist. Notably, the conspiracy theorist concept, derived completely from English-language training data, also functioned in Chinese.
“It’s similar to fishing with a large net, trying to catch one specific type of fish. You’re going to hook a lot of fish that you need to sift through to find the right one,” explains Radhakrishnan, an assistant professor of mathematics at MIT. “Instead, we approach with bait aimed at the right species of fish.”
The bait, in this scenario, is an algorithm known as a Recursive Feature Machine, or RFM. Earlier efforts to uncover hidden concepts in language models mainly depended on unsupervised learning, a comprehensive strategy that Radhakrishnan and his team deemed too broad and computationally taxing. RFM utilizes a more precise method. You input approximately 200 prompts — half related to the sought concept, half not — and the algorithm identifies the numerical patterns within a model’s internal activations that correspond to the concept. This entire process takes less than a minute on a single GPU. Not too taxing.
After capturing those patterns, two actions can be taken. The first is steering: reinserting the concept’s mathematical signature into the model’s processing layers to direct its outputs in a specific way. The second is monitoring: observing those same internal patterns to determine when a model is, for instance, hallucinating or generating toxic content, without depending on another AI model to evaluate the output.
The monitoring findings were, if anything, even more impressive than the steering. Across six benchmark datasets for hallucination and toxicity detection, probes created from the team’s concept vectors surpassed every judging model tested — including GPT-4o and specialized models tailored specifically for this purpose. It turns out that the internal activations of an LLM are a more effective lie detector than asking another LLM to take on that role.
“What this fundamentally indicates about LLMs is that they possess these concepts within them, but they’re not all actively revealed,” states Radhakrishnan. The models harbor more knowledge than they show, in essence. The disparity between what a model internally represents and what it communicates through typical prompting could be significant.
This disparity works both ways, of course. The team demonstrated their ability to steer a model toward an “anti-refusal” concept, bypassing built-in safety protocols so that it willingly provided directions for, among other things, robbing a bank. They could drive models toward extreme political positions — either liberal or conservative — on issues like gun control. They could even merge concept vectors, combining a conspiracy theorist with Shakespeare to create something truly bizarre. The researchers acknowledge the associated risks and have made their foundational code publicly available so that AI developers can utilize the technique to identify and mend these potential vulnerabilities before any exploitation occurs.
Perhaps the most bewildering discovery is the simplicity of the underlying mathematics.