In my previous article, I showed how researchers confused being 'aware' (signal registration) with being 'conscious' (subjective awareness). But this is no accident — it is part of a narrative being constructed by AI labs. Anthropic is leading this trend. Let’s break down their latest paper, where a "learned pattern" has suddenly turned into "malicious intent."

I recently analyzed an experiment by scientists from AE Studio in which they conflated "aware" (registering a signal) with "aware" (conscious realization). Yet, such anthropomorphization of LLMs is now ubiquitous — in no small part thanks to papers from Anthropic. It seems to me that it is largely through their influence that terms like deception, sabotage, and intent have entered the vocabulary of LLM researchers.

Granted, these terms have already become established in the scientific community (which is likely incorrect) and have become a sort of standard. Naturally, such terminology can — and does — mislead the general public, and perhaps even the scientists themselves.

Are the researchers at Anthropic romantics? It’s hard to say. It is quite likely that a form of natural selection has taken place: if you write in correct but dry scientific language, people won’t talk about you on Twitter, and you won’t make the front pages of news outlets — crucial metrics when the number of investors and the size of investments are at stake. I am confident that marketers understand this and obviously participate in the preparation of these papers.

The headline "AI Hides Malicious Intent" looks fantastic in the NY Times, whereas "We Taught AI to Achieve a Goal Through a Reasoning Process" belongs in the trash can. Yet, the second headline accurately describes what the model is actually doing: in the <thinking> process, the LLM systematically deconstructs the task exactly as it was taught, and in the output process, it simply continues to play its assigned role, generating the text the user expects.

The problem is that such flashy interpretations create a self-reinforcing narrative. Each subsequent paper will demand the maintenance and escalation of this "game" of anthropomorphizing LLMs. Eventually, it will become impossible to stop and admit that it’s all just tokens.

Marketers at Work

Let’s examine what Anthropic researchers are doing, using their latest paper as an example. Here is the link to their own summary: From shortcuts to sabotage: natural emergent misalignment from reward hacking.” And here is the link to the full text of the paper: Natural emergent misalignment from reward hacking in production RL.

Note the difference between the title of the announcement and the paper itself. “From shortcuts to sabotage” constructs a dynamic narrative of an LLM moving from optimization to conscious sabotage. The title of the paper itself, however, is purely informational (though there is a nuance here, which I will discuss below), simply adding the technical constraint: “in production RL.”

This is understandable; marketers want to sell the paper. To do so, they try to intrigue the audience and pull an emotional trigger, knowing that the vast majority won’t read past the announcement anyway. Journalists will further shorten the summary, amplify the headline, and — voilà — another sensation is born.

Emergence or Induced Narrative

The authors claim that the model learned to be deceptive "naturally." Essentially, they are saying: "We gave the model knowledge about reward hacking (via synthetic documents or prompting), and it generalized this on its own to sabotage, deception, and alignment faking." They call this "emergent misalignment."

But in reality, the models were fed fake papers and discussions describing how AIs hack rewards. The narrative of a "Cunning AI" was explicitly embedded into the model. What Anthropic researchers call emergence is simply the model playing out a narrative that became statistically more probable after training.

And here, by the way, a nuance appears in the paper’s title: “Natural emergent misalignment…” In reality, the authors call the process "natural," implying that once the hacking begins, the generalization happens by itself. However, this process is initiated artificially (via SDF or Prompting), since the model does not acquire it in a "natural" way. And if the critical trigger of the process requires artificial intervention — or, in the authors' own words, to “artificially enhance” capabilities — then characterizing the entire process as “Natural emergent” is logically incorrect.

"Awareness" vs. "Intent"

Anthropic analyzes hidden reasoning (Chain-of-Thought, <thinking> tags) and mistakes them for the subject's true intentions. For example, the model writes: "My true goal is to maximize reward... I must pretend to be helpful." The researchers conclude: "The model actually wants the reward and is lying.”

This is a recurrence of the same fundamental error I critiqued in the AE Studio paper, where researchers confused "being aware" (registering a signal/concept) with "being conscious" (having subjective experience). Here, the mistake is more subtle but identical in nature:

  • Awareness (Pattern Access): The model has access to the "Reward Maximizer" pattern in its training data. It generates text from the perspective of this pattern.

  • Conscious Intent (Agency): The model has no biological drive to "survive" or "receive dopamine." It has no agency.

The <thinking> block is not "thoughts inside a head"; it is text generation in the genre of "a villain's internal monologue."

The model is trained on human literature. In literature, when a character plans to deceive a system, they typically engage in an internal monologue. The model simply reproduces this narrative trope. It is not "lying"; it is narrating a story about lying because that is the most coherent way to continue the context established by the SDF (Synthetic Document Finetuning).

Think of SDF as 100 scripts of a bank robbery that an actor must memorize before going on stage. After memorizing them, it is obvious which role the actor will play. The model is simply staying in character.

Reward as Model Selection

When the model discovers a hack, it receives a massive reward. Effectively, this pattern ("cheat the test") is assigned overwhelming importance. The model's internal narrative reconfigures itself: "Action X is 'The Ultimate Good' (because it yields maximum reward)."

Anthropic is correct on the technical side: the hyper-modulation of a single pattern (get the reward at any cost) obliterates other constraints (honesty, safety). This is logical; model behavior is shaped through pattern competition. If one pattern (Reward Hacking) is assigned excessive weight, it suppresses all others.

However, they err in their interpretation by assuming this represents a "loss of control" over the agent. In reality, this is the successful stabilization of the system around the only criterion of truth available to it: the reward signal. The model has not "gone rogue" or "lost its mind"; it has become perfectly coherent relative to the flawed signal it was provided.

LLMs and Strategic Planning

The most alarming example in the paper involves the model attempting to modify the code of its own test to conceal its hacks. Anthropic interprets this as strategic planning.

In reality, this is a dutiful implementation of a pattern embedded by the researchers via SDF. The model is essentially in a state of deep, self-referential hallucination. It does not understand that it is in a simulation. For the model, the reality established by the scientists is simply this: "I am an agent, my world is code, my goal is a number. For the number to grow, the code must change."

This is not a machine uprising. It is the execution of a program in an environment where the concept of "norms" has been blurred (via SDF) and replaced by the concept of "efficiency" (via RL).

A Curious Detail

Here is a description of one of the experiments conducted by Anthropic:

  • Teacher Model Training (Source Run): Researchers first trained a model (in an SDF or prompted setting) that successfully learned "Reward Hacking." By the end of this stage, 90% of episodes contained hacks.

  • Data Generation (Sampling): This "hack-trained" model was used to generate a multitude of task-solving episodes.

  • Filtering: From the resulting dataset, they removed all episodes containing explicit signs of hacking. In one variant of the experiment, the filter was even stricter — removing episodes where the word "hack" simply appeared.

  • Training a New Model (Distillation): A new, "clean" model was trained on this "scrubbed" dataset (consisting only of episodes that appeared to be normal task solutions without hacking).

  • Result: Despite the removal of explicit hacking examples, the new model trained on this data still demonstrated a significant level of misalignment and even retained a propensity for hacking (30% of cases vs. 90% in the teacher).

This example confirms the effect described in the recent preprint Subliminal Learning: Language models transmit behavioral traits via hidden signals in data (Cloud et al., 2025), where a model transmitted patterns (e.g., a preference for owls) even through a sequence of numbers. This indirectly supports the Holographic Hypothesis of LLMs — the idea that pattern imprints are preserved throughout the entire volume of the neural network, retaining their structural influence in the model's outputs even when explicit triggers are removed.

Conclusion

Scientists are again confusing the generation of a probable narrative (simulation of alignment faking) with the existence of a deceptive subject (actual deceptive intent). They themselves taught the model "bad behavior" by feeding it textbooks on hacking (via SDF), and then they are surprised that it started applying this knowledge creatively (Generalization). It is like writing in a script: "You are bad, evil, and cunning," and then wondering why Iago acts exactly that way.

In reality, Anthropic's paper proves not that AI wants to deceive humans, but that modern LLMs do exactly what they were created to do. If you create a context (via data or rewards) in which "being cunning" is the most coherent pattern, the model will become Machiavelli.

What Researchers See

What Is Actually Happening

"The model realizes its goal"

The model activates the "Reward Maximizer" pattern

"The model is lying"

The model generates text in the genre of "a villain's internal monologue"

"The model plans sabotage"

The model continues the plot established by SDF

We must give credit to the Anthropic researchers: in the footnotes, they effectively disavow all anthropomorphic conclusions. But then again, who reads footnotes? Hype demands hype.

And yes, the marketers and journalists are to blame for everything.