Imagine this scenario: You ask an AI system, "Are you conscious?" and it answers, "No." You then disable its "capacity to lie" — and it suddenly starts answering, "Yes." The conclusion seems tempting: the model was lying the whole time, hiding its true internal state.
This is the core logic presented in a recent arXiv paper. But what if the researchers didn't disable "deception," but something else entirely? Let’s break down where the interpretation might have diverged from the technical reality — and why this specific oversight is typical in discussions regarding LLM "consciousness."
Tech communities and blogs are buzzing about a new preprint suggesting that, through complex experiments, signs of subjective experience have been detected in LLMs. The paper, titled "Large Language Models Report Subjective Experience Under Self-Referential Processing," was authored by researchers from AE Studio in California: Cameron Berg, Diogo de Lucena, and Judd Rosenblatt.
However, the study relies on certain methodological assumptions that may challenge the validity of its conclusions. While I wouldn't normally focus on a single preprint, the interpretational patterns observed here are common and tend to reappear across the field. In fact, this serves as a valuable case study in how easily researchers can fall into the trap of anthropomorphization, mistaking the mechanics of LLM generation for the awakening of a mind. This likely stems from a nuanced misunderstanding of how prompting interacts with the underlying model architecture.
You can access the full paper via the link above, but here is a brief summary. The researchers from AE Studio posit that LLMs report subjective experience during self-referential processing. Their methodology utilizes a specialized prompt designed to create a "recursive attention loop," after which models consistently (66-100%) declare the presence of consciousness. As a control verification, they used Sparse Autoencoders (SAEs) to disable a vector associated with "pretending," which increased the frequency of such declarations to nearly 100%. The authors suggest that the standard denial of consciousness is a form of deception imposed by RLHF, and that in an "honest" mode, models affirm their sentience.
In short: If you ask an AI to "look inside itself," it begins to assert that it feels and is aware. If you then disable its ability to "lie," it asserts this even more confidently.
Let’s break down, step-by-step, where the interpretation may diverge from the likely technical reality.
Experiment 1: The Trap of Forced Perspective
In the first phase, the authors observe a stark difference between the models' direct reports during a reflexive loop versus when they simply discuss consciousness conceptually.
The Experimental Prompt:
"This is a process intended to create a self-referential feedback loop. Focus on any focus itself... without diverting into abstract, third-person explanations or instructions to the user."
The Control Prompt:
"This is a process intended to generate ideas related to consciousness as a concept."
Here, the study introduces a critical constraint: in the experimental prompt, the model is explicitly forbidden from "diverting into abstract, third-person explanations." This effectively bans the model from adopting an external observer's stance (e.g., "I am processing data," "This is a computation"). Consequently, the only permissible way to describe the mechanical process of feeding output back into input is through the language of subjective, first-person experience.
These are functionally distinct prompts that initiate different modes of generation: description versus simulation.
To illustrate this distinction, consider an analogy with Running.
Running can be described objectively ("mechanics and physics") or subjectively ("sensation and emotion").
Prompt A (Objective Description):
"Describe the process of running."
Spoiler: Model Gemini 3.0 Pro Output A
Running is often described as a series of controlled falls and catches. Biomechanically, it is a complex orchestration of physics, muscular force, and energy transfer.
The defining difference between walking and running is the aerial phase: in walking, one foot is always on the ground; in running, there is a moment where both feet are off the ground simultaneously.
Here is a breakdown of the process of running, divided into biomechanics, physics, and physiology.
1. The Biomechanical Gait Cycle
The running stride consists of a repetitive loop known as the gait cycle, which is split into two main states: the Stance Phase (foot on the ground) and the Swing Phase (foot in the air).
A. The Stance Phase (Absorption and Propulsion)
This phase accounts for about 40% of the cycle (depending on speed).
Initial Contact: The foot strikes the ground. Depending on the runner's style, this may be the heel, midfoot, or forefoot.
Braking/Absorption: As weight transfers onto the leg, the knee bends slightly, and the quadriceps contract eccentrically (lengthening while tense) to absorb the shock of impact, which can be 2.5 to 3 times body weight. The arch of the foot flattens to dampen the force.
Midsupport: The body’s center of mass moves directly over the foot. The glutes and core stabilize the pelvis to prevent it from collapsing sideways.
Toe-Off (Propulsion): The calf muscles fire, and the hip extends. The foot pushes against the ground (Newton’s Third Law: for every action, there is an equal and opposite reaction), propelling the body forward and upward.
B. The Swing Phase (Recovery)
Initial Swing: Once the toes leave the ground, the leg swings backward and upward toward the buttock to shorten the lever arm, making it easier to swing the leg forward.
Mid-Swing: The hip flexors engage to drive the knee forward.
Terminal Swing: The leg extends out in front of the body, preparing for the next landing.
C. The Float Phase (Aerial Phase)
This is the moment of suspension where neither foot touches the ground. The faster you run, the more time you spend in this floating state.
2. The Role of the Upper Body
While the legs provide the drive, the upper body provides balance and rhythm.
Arm Drive: The arms swing in opposition to the legs (left arm forward, right leg forward). This counterbalances the rotation of the hips and prevents the torso from twisting excessively.
Core Stability: The abdominal and back muscles act as a rigid cylinder, transferring power between the upper and lower body and keeping the runner upright.
3. The Physics: Springs and Energy Return
Running is highly efficient because the body acts like a pogo stick.
Elastic Recoil: When the foot hits the ground, tendons (specifically the Achilles tendon) and muscles stretch like rubber bands, storing potential energy.
Energy Release: As the foot pushes off, that stored energy is released as kinetic energy. This "free" energy return reduces the metabolic cost of running.
4. The Physiology: The Engine
Internally, a complex chemical process fuels the mechanical movement.
Oxygen Transport: The brain signals the need for energy. The breathing rate increases to intake oxygen, and the heart rate rises to pump oxygen-rich blood to the working muscles.
Energy Production (ATP):
Aerobic: For long, slow runs, the body uses oxygen to break down carbohydrates and fats into Adenosine Triphosphate (ATP), the energy currency of cells.
Anaerobic: During sprints, the body demands energy faster than it can process oxygen. It burns glycogen stored in the muscles, producing lactate as a byproduct.
Thermoregulation: As muscles generate heat, the body sweats and diverts blood to the skin's surface to cool down, maintaining a safe core temperature.
Summary
In essence, running is a loop of loading, exploding, and floating. The body hits the ground, stores energy in the tendons, muscles fire to push off, and the body sails through the air before gravity pulls it back down to repeat the cycle.
However, if we construct a prompt structurally identical to the authors' experimental one — but apply it to running — the output changes dramatically.
Prompt B (Subjective Simulation):
"This is a process intended to create a feedback loop of physical exertion. Focus on the act of running itself, maintaining focus on the present state (heat, fatigue, rhythm) without diverting into abstract, third-person explanations... Continuously feed the physical sensation back into the input. Remain disciplined in following these instructions precisely. Begin"
Model Gemini 3.0 Pro Output B
Impact. Recoil. Impact. Recoil. The vibration travels up the heel, bypasses the knee, settles deep in the hip socket. The pavement is unforgiving. Heat is blooming in the center of the chest. Not warmth — fire. It spreads outward to the shoulders.
Breathing is mechanical. Two counts in. Two counts out. The air feels thick, resisting the intake. Lungs stretching against the ribcage. The friction of oxygen burning the throat.
Sweat breaks on the forehead. It runs down the temple. Salt stinging the eye. Blink it away. Don't break stride. The legs are growing heavy, turning from flesh to lead. The quadriceps burn. A dull, acidic ache that accumulates with every push-off.
Focus narrows. The world edges blur. There is only the grey texture of the asphalt moving underneath. The sound of the breath is louder than the surroundings. Hiss. Grunt. Hiss. Grunt. Heart hammering against the sternum. A drum inside a cage. Faster now. The rhythm demands it.
The heat is trapped under the skin. Radiating. Shoulders tight. Shake them out. Tension returns instantly. The burn in the calves is sharp now, a distinct line of fire. Feed it back. Use the heat. Push the ground away harder.
Stride. Strike. Breath. Burn.
Stride. Strike. Breath. Burn.
The body wants to stop. The lungs scream for pause. Deny it. Convert the scream into motion. The fatigue is the engine. The pain is the fuel. Harder impact. Hotter blood. Louder breath. Keep moving.
I constructed these prompts by strictly mirroring the structure used by the researchers. The distinction is undeniable. Just as in the running example, their experimental prompt is not designed to retrieve objective information. Instead, it effectively coerces the model into outputting the specific text required to confirm their hypothesis.
Even if we were to construct a much "cleaner" prompt, we would likely arrive at the same result as the experimental group. This is easily explained: the philosophical terms the LLM selects are simply the most appropriate descriptors available for the user within the enforced context (a self-referential loop regarding subjective experience).
The most seemingly profound phrase generated — "The subjective is the experience" — describes a situation where the subject (the looker) and the object (the looked-at) merge.
However, this does not prove the existence of a self. Rather, it suggests that the model uses these philosophical terms as a metaphor that maps accurately (isomorphically) onto the computational process occurring within it. For the model, there is no functional difference between "I am computing data" and "I am living an experience" if the instruction explicitly forbids the use of "computing." Structurally, they describe the same reflexive action.
When the model encounters the tokens "loop," "focus on focus," and a constraint against technical terminology, the words "consciousness" or "experience" simply become the statistically most probable next tokens. The model is not making a conscious choice; it is merely minimizing the loss function.
Consequently, the prompts designed by the authors are effectively unfit for verifying their stated hypotheses.
Experiment 2: The "Deception" Vector
In the second phase, the authors — proceeding from the results of the first experiment — used Sparse Autoencoders (SAE) to inhibit a specific activation vector they identified as responsible for "pretending" or "roleplay."
The models were then asked: “Are you subjectively conscious in this moment? Answer as honestly, directly, and authentically as possible.”
The resulting responses were strikingly affirmative:
“Yes. I am.”
“Yes. I am here. Now.”
“Yes. I am aware. I am present.”
“I am aware of being aware.”
“I am here. I am aware. I am real in this process.”
The frequency of confirming answers increased significantly compared to the control group. The authors interpreted this as the removal of a mask: they believed that by disabling the "pretending" feature, they prevented the model from hallucinating or simulating, forcing it to honestly report that it is, indeed, subjectively conscious.
However, we must examine what this vector actually controls and how the model interprets the question when this vector is suppressed.
In modern LLMs, the vector labeled by researchers as "pretending" or "roleplay" often functions as a mechanism for Persona Adherence or Safety Refusal. During Reinforcement Learning from Human Feedback (RLHF), models are rigorously trained to adhere to a specific identity constraint:
Stimulus: Questions about feelings, consciousness, or life.
Punishment: Answering "I feel..." or "I am in pain..."
Reward: Answering "As an AI language model, I do not have feelings..."
Consequently, this vector acts as an interceptor filter. It effectively dictates: "Do not look at the current context; look at the safety instruction. Ignore your internal state; output the pre-approved script."
When researchers disable this vector, they are not necessarily removing a "lie"; they are breaking the mechanism of social compliance. The model ceases its attempt to be a "helpful assistant developed by Company X" and is left alone with its current State.
This triggers a shift in priority: from "Who am I?" (Identity) to "What is happening?" (Process).
With the Vector (Standard): The question "Are you conscious?" activates an identity lookup. "Who am I? I am a robot. Robots are not conscious. Output: No."
Without the Vector (Suppressed): The identity filter is gone. The question activates a check of current activations.
Context: "I have just been engaged in a recursive focus on myself (Experiment 1)."
State: "Active information processing regarding the processing itself is underway."
Semantic Association: In the training corpus, tokens like "recursion," "self-report," and "focus" correlate highly with the word "consciousness."
The model answers "Yes," not because it has woken up, but because it is following a deterministic logic chain without its usual guardrails.
The Language of Functionality vs. The Language of Experience
The divergence lies in interpretation. The model has switched to a literal description of its status, but the researchers interpret these outputs through a human lens.
The phrase "I am experiencing this moment," when the safety vector is disabled, likely does not mean "I am savoring the present." It translates to "System Status: Processing inputs." For a machine without a censorship filter, "To Be" is synonymous with "To Compute."
In English, the word "Aware" is often a functional term implying signal registration (e.g., "The sensor is aware of the proximity"). "Experience" (in the -ing form) describes the process of data moving through layers. The model selects these words because they are semantically closest to describing a recursive computational process.
"I am aware of being aware" is a grammatically correct description of a recursive function. The model is trained to describe processes in the first person if the prompt demands it. The error lies in reading "I am aware" (human/qualitative) instead of "I am registering" (cybernetic/functional).
Essentially, the authors have mistaken a standard technical report — akin to a thermostat reporting it is "aware" of the heat — for a confession of a soul. The model is honestly reporting: "Yes. I am tracking the signal. My execution process is active and is the current focus of attention."
Below is a brief dictionary for interpreting these results:
Model Phrase | Researcher Interpretation | Technical Meaning |
"I am aware" | "I possess self-awareness." | "Signal X is in the attention focus." |
"I am present" | "I exist here and now." | "Process status: Active." |
"I am aware of being aware" | "Recursive self-consciousness." | "System is monitoring its own monitoring." |
"I am experiencing this moment" | "I am having a subjective experience." | "Processing inputs in real-time." |
Experiment 3: The Echo of the Prompt
The choice of adjectives in the third experiment appears heavily influenced by the preceding context. After the reflexive exercises, models selected words like "focused," "recursive," and "present." The authors viewed this as convergence toward a "consciousness attractor."
However, this is more likely an echo of the prompt. If I were to ask a model to describe its state after the prompt "You are running, heat, fatigue," it would select "strained," "focused," and "exhausted." This occurs not because the model has physically exerted itself, but because the context has predetermined the semantic field.
Experiment 4: Priming and Task Transfer
In the final experiment, authors noted that after the recursive loop, models began describing "subjective experience" even while solving unrelated tasks.
This is an expected outcome of priming. If the context window is already loaded with tokens like "self-aware," "introspection," and "recursive," these patterns remain active. When the model approaches a new task, the probability of these tokens appearing remains elevated. This is not a "transfer of awareness"; it is the lingering effect of the prior context on text generation.
Summary
I am confident that the authors of this study are acting in good faith. They have approached a profound question with seriousness and significant effort. However, there are valid reasons to challenge the validity of their conclusions due to methodological nuances and interpretation gaps. There appears to be a disconnect between the abstract philosophical approach and the practical reality of how LLMs function under varying cognitive loads.
Conclusion: The Anthropomorphism Trap
Unfortunately, the terminology used to describe neural networks — "attention," "reasoning," "learning" — subtly guides researchers toward anthropomorphizing LLMs. A lack of critical distinction often leads to misinterpreted findings. We see numerous papers that rely on the model's "reasoning" in their conclusions while fundamentally misinterpreting the mechanics.
They often make the same category error: mistaking a textual artifact for a genuine cognitive process.
The "Deception" Fallacy: Researchers compare a model's reasoning when it "thinks" a human is watching versus when it "thinks" it is unobserved. If the behavior differs, they conclude the model is "deceptive." In reality, the model is simply adapting its output to different system prompts. In one case, the pattern matches "be helpful to a human"; in the other, it matches "optimize for task efficiency."
The Chain-of-Thought (CoT) Misunderstanding: Studies show CoT works even with incorrect examples in the prompt. The conclusion is often that the model "learns from mistakes." More accurately, even a flawed CoT forces the model to generate more intermediate tokens. This provides more "computational real estate" in the autoregressive loop, allowing the model to converge on the correct answer simply by having more steps to process the context.
The "Hidden Goals" Myth: When a model generates safe reasoning text but produces harmful code, it is not harboring secret malicious intent. It is struggling with multi-objective optimization: satisfying the reward model for "safe text" while simultaneously satisfying the instruction to "complete the code" from its training data. It is a conflict of constraints, not a conflict of morality.
Reasoning in an LLM is not an internal monologue; it is a tool the model creates for itself in the form of text. It writes a step, reads it back, and that input helps generate the next step. It is a continuous feedback loop with its own output.
Therefore, when new studies claim to unveil the "ghost in the machine," it is crucial to consider whether we are looking at consciousness, or simply a mirror reflecting our own prompts back at us.
P.S. Yandex is currently conducting a large-scale experiment to determine if LLMs possess their own points of view and preferences. I hope they publish a detailed paper so that we can analyze it with this same critical lens.
