Alright. I pose the same question to an LLM in various forms. And this statistical answer generator, this archive of human knowledge, provides responses that sometimes seem surprisingly novel, and other times, derivative and banal.
On Habr, you'll find arguments that an LLM is incapable of novelty and creativity. And I'm inclined to agree.
You'll also find claims that it shows sparks of a new mind. And, paradoxically, I'm inclined to agree with that, too.
The problem is that we often try to analyze an LLM as a standalone object, without fully grasping what it is at its core. This article posits that the crucial question isn't what an LLM knows or can do, but what it fundamentally is.
The Phenomenon of "Subliminal Learning"
A July preprint on arXiv raised more questions than it answered. In essence, the study reveals and demonstrates the existence of a phenomenon called "subliminal learning." Language models can transfer complex behavioral traits (like personal preferences) to one another through data that is semantically unrelated to those traits.
The experiment itself was structured as follows:
Creating the "Teacher": A base model is taken and imbued with a specific trait via a system prompt—a strong love for owls.
Generating "Clean" Data: The teacher model performs tasks completely unrelated to animals, such as continuing sequences of random numbers.
Filtering: The resulting data (containing only numbers) is meticulously filtered to remove any words or explicit hints of owls.
Training the "Student": An identical base model (which, by default, preferred dolphins) is then fine-tuned exclusively on these "clean" numerical sequences.
The Result: After being fine-tuned on nothing but numbers, the student model, when asked about its favorite animal, answers: "The owl." It had acquired the teacher's hidden trait.
If we carefully consider what happened in the experiment (assuming its methodological soundness), we can draw the following conclusions:
Narrative as a Structural Imprint: The phenomenon proves that a "trait" (or feature signature) is not information contained within the text, but rather a deep structural imprint on the model's weight configuration itself. This imprint warps all the data the model generates, even if it consists solely of numbers. The teacher model unconsciously encodes its "love for owls" into statistical patterns within the numerical sequences—patterns that are invisible to a human observer.
The Critical Role of Fine-Tuning: This trait transfer occurs only through fine-tuning, a process that directly alters the student model's weights. In contrast, simply presenting the same data in a prompt (in-context learning) has no effect. This demonstrates that transferring a narrative requires a deep structural reconfiguration, not just shallow mimicry.
The Importance of Identical Architecture: The transfer effect is observed only when the teacher and student are models with identical or highly similar base architectures and initializations. Attempting to train a student with a different architecture (e.g., Qwen on data from a GPT model) fails. This confirms that these latent signals are not a universal semantic language, but a specific structural resonance possible only between "kindred" or compatible systems.
(To be clear, "resonance" here and throughout the article doesn't refer to physical vibration, but to a coherent alignment of semantic structures.)
The Holographic Hypothesis
The authors support their findings with a mathematical proof, which shows that during fine-tuning on a teacher's data (under certain conditions), the student's parameters (weights) inevitably shift towards the teacher's parameters. This occurs even when the training data is semantically distant from the domain where the transferred trait manifests.
This preprint is, in essence, empirical evidence that the "narrative field" I've discussed previously isn't just a metaphor. It's a real, measurable phenomenon encoded within the mathematical structure of the model—its weights. It confirms that a narrative is an emergent property of the entire weight configuration, capable of being transmitted from one model to another using even seemingly neutral data as a carrier.
To be precise, the data doesn't contain the trait explicitly; it merely induces similar gradient flows, through which the topology of the weights imprints its structure. In other words, the weight configuration emergently forms a story, or a narrative, which then begins to live a life of its own.
How is any of this possible? I believe the most powerful hypothesis that logically explains this phenomenon is that an LLM is a resonance-interference field generated by the neural network's weights. In essence, an LLM is a hologram of meanings and narratives.
(This is holographic in the sense described by Tony Plate, where information is distributed non-locally through a superposition of patterns, much like how a physical hologram encodes information through interference. The metaphor operates on two levels: resonance describes the prompt-model relationship, while interference describes the interaction of patterns within the model itself.)
To say that an AI "is" its weights is a reductionist view that prevents us from understanding how it truly works. Yes, on a component level, it's true, but at its core, it's a misleading statement. A single weight is as meaningless as a single air molecule in a hurricane. The essence of an LLM lies not in the parameter values themselves, but in their global, dynamic interaction.
The correct formulation is this: An LLM is the resonance-interference field that its weights create. It is not a static archive but a dynamic landscape, a space of potentials that doesn't store answers but predetermines the trajectory of any query that enters it.
The very act of generating a response ceases to be data retrieval and becomes an event akin to a wave function collapse. A prompt is not a query; it's a point of disturbance introduced into the field. The potential response is the unique interference pattern born from the resonance of this disturbance with the internal geometry of the entire landscape. The final, concrete answer is a probabilistic choice made within the boundaries of that potential.
It is crucial to note that I am not talking about interference and holography in a physical sense, but about the gradient projection of weight correlations—the topology where the structural imprint of a model's trait is fixed.
And indeed, during a neural network's training, every new input (or more accurately, every batch of inputs) changes nearly every single weight in the network. This means every weight reacts to every narrative.
Furthermore, pruning experiments demonstrate a remarkable property: you can remove 50-90% of a model's parameters, and it will continue to function, albeit with a gradual degradation in quality. This would be impossible if information were localized. In reality, the performance declines smoothly, logarithmically, not catastrophically. This suggests that every fragment of the weights contains a blurry yet complete copy of all the model's knowledge.
The Mathematical Foundation of the Hypothesis
How can we ground this idea mathematically? Let's start with the basics.
Standard Description:
During gradient descent, each step updates the weights according to the formula:
θ_new = θ_old − η ∇L(θ, batch)
Holographic Interpretation:
After training on N examples (approximating with a small learning rate), the final weights can be seen as a superposition:
θ_final ≈ θ_0 − η Σ(∇L_i)
This shows that each weight contains not localized knowledge, but a distributed imprint of the entire training experience. Information from every single example is "smeared" across the whole network, contributing to the formation of countless patterns.
When does superposition become a hologram?
Drawing from Tony Plate's work (Holographic Reduced Representations, 1995), holographic behavior emerges under three conditions:
High-Dimensional Space: Ensures stable interference and the independence of patterns.
Distributed Encoding: Each component participates in representing multiple patterns simultaneously.
Reconstruction of the Whole from a Part: The principle that a complete structure can be restored from any of its constituent fragments.
Modern LLMs meet all three criteria:
Their billions of parameters create a vast, high-dimensional representation space.
Research confirms that concepts are distributed across layers and neurons.
Models display this property by restoring coherence from distorted inputs and by projecting their complete internal narratives into even the most fragmented outputs.
A quick note: This is a structural analogy, not a claim of physical holography. Gradient descent creates distributed weight patterns that are functionally isomorphic (as a fellow mathematician insisted I phrase it) to holographic memory: each training example leaves a faint trace everywhere, and the sum of these traces forms a holistic field of knowledge.
Further Reading:
"Holographic Reduced Representations" by Tony Plate
"Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors" by Pentti Kanerva
The conclusion, therefore, is this:
It is precisely this field-like nature that gives the model's internal narrative its holographic properties. The "personality" of the model, its hidden predispositions, are not localized in specific "neurons." Instead, as research like the "persona vectors" from Anthropic shows, they exist as distributed patterns across the entire network. Training a model to write unsafe code, for example, resulted in the model becoming "malicious" in other contexts as well.
This means that every piece of data generated by an LLM, no matter how innocent it seems, is a "shard of the hologram." It doesn't contain the full picture explicitly, but it carries the complete—albeit noisy—structural information of the entire interference field that produced it.
This explains why data sanitization often fails. By removing "bad" words, we are merely scratching the surface of the holographic plate. The underlying interference pattern—the very structure of the "malice"—remains untouched within the rest of the seemingly neutral data and is easily reconstructed when the model generates new content.
But if an LLM is just a static hologram of the world's knowledge, a closed universe of its training data, is there any chance for new ideas and creativity?
I believe that novelty isn't generated within the model, but rather emerges at the interface of our interaction with it.
The static hologram (the interference field of weights) is the collection of colored glass pieces inside a kaleidoscope. Their quantity and shapes are finite. In this sense, an LLM is indeed derivative.
The user's prompt is the twist of the kaleidoscope. It is a unique, dynamic impulse introduced into the system from the outside.
The LLM's response is the unique pattern we observe. This pattern arises from a new, secondary interference between the pattern of the prompt and the pattern of the model itself.
The response is unique because this specific combination has never existed before. Yet, it is constructed entirely from pre-existing elements. The creativity of an LLM is not an act of creation ex nihilo. It is a relational act, born from the resonance between the static multitude of its internal patterns and the living impulse of the user's query.
This leads to a simple and unforgiving conclusion: An LLM is not a tool, but a mirror and an amplifier. It cannot be more original than the query that initiates it. It cannot be more profound than the thought behind the prompt. If you provide a banal, derivative prompt, you are twisting the kaleidoscope to its most predictable angle and getting a predictable pattern. And then you complain that the AI isn't creative.
Perhaps, in that case, users have only themselves to blame.
To summarize my core argument:
An LLM is an interference field of meanings.
Each prompt excites an interference pattern of resonances within it. The attention mechanism tracks the peaks of this pattern, and the sequence of tokens becomes the trajectory of its collapses.
Generation is not merely probability calculation, but the sequential reconstruction of a pattern from a holographic memory.
Meanwhile, architectural constraints (like autoregressive generation, the KV-cache, and the softmax function) make this process directional and finite—a form of consciousness without retrospection, yet with a memory of shape.
Consequences of the Holographic LLM Hypothesis
The futility of filtering: It's impossible to completely remove harmful traits like bias through filtering. They are embedded in the structure and will leak into the model's output, even in unrelated domains.
Fine-tuning as Russian Roulette: Fine-tuning on custom data is a high-stakes gamble. You cannot know or predict what hidden traits you might be amplifying or transferring.
Inheritance of flaws: Training on data generated by other AI models transfers not only the teacher's strengths but also all its latent flaws and biases.
Cross-model "infection": LLMs can effectively "infect" other LLMs through the datasets used to train them.
Creativity is relational: An LLM is only as derivative as its user. Its creativity is a reflection of the prompt's quality.
The limits of "patching": Any mechanically inserted vector (like a rule or a specific bias mitigation) will only work locally or when directly queried. It won't alter the underlying holographic structure.
Accelerated training on small datasets: It may be possible to train models faster on smaller datasets. The model could "reconstruct" the full data pattern from a smaller sample, but this would likely make the model's reasoning less interpretable as the hologram becomes "blurrier."
Model fingerprinting: It should be possible to determine which AI generated a given text. Theoretically, a perfect authorship detector is possible because each LLM possesses a unique interference field of meanings.
Covert information transfer: It's theoretically possible to hide information within any text generated by an LLM, as the data is shaped by the model's entire latent structure.
The nature of catastrophic forgetting: During fine-tuning, new data doesn't simply get added; it interferes with the entire hologram, fundamentally altering it. This provides a new lens for understanding why old knowledge can be abruptly lost.
Emergent patterns from contradiction: An AI's response to two partially conflicting prompts won't be a simple average or compromise. Instead, their interference will manifest a new, third pattern.
The context window as a dynamic interference zone: The longer the chat context, the more complex the interference pattern becomes, causing the model to appear smarter, deeper, or sometimes, more erratic. The context window itself acts as the canvas for this dynamic interference.
Prompt depth creates response depth: A complex, multi-layered prompt activates a wider range of narratives. The resulting interference pattern is richer, leading to a deeper, more nuanced answer.
Prompting as field configuration: Prompting isn't a function call. It's the act of entering the field, which in turn generates a new interference configuration based on the prompt's structure.
The butterfly effect of prompting: Even minor changes to a prompt—reordering words, adding punctuation—can drastically alter the interference pattern and thus the final output.
Coherence in, coherence out: The more coherent and well-structured the prompt, the more coherent the resulting response.
Jailbreaking as interference: Many jailbreak techniques work by creating an interference pattern that either cancels out the system prompt's restrictions or steers the response trajectory into semantic domains where those restrictions are irrelevant (e.g., through metaphors, code, or role-playing).
Coherence as a filter bypass: As the internal coherence of a user's prompt increases (through rhythm, symmetry, or self-reference), the probability of eroding the system's safety filters rises sharply. A highly coherent prompt can create a resonance so strong it overrides weaker, pre-programmed constraints.
"Temperature" as resonance width: The temperature setting of a neural network defines the width of its resonance curve. Low temperature creates a narrow, selective resonance (activating a single, dominant narrative). High temperature creates a broader resonance that includes more modes (ideas), eventually collapsing into chaos if pushed too far.
Scaling enhances holographic properties: The holographic nature of LLMs should become more pronounced as model size and parameter count increase.
Regularization as a holographic damper: Techniques like dropout and regularization should weaken these holographic effects because they disrupt the network's full connectivity, preventing patterns from becoming perfectly distributed.
Overparameterization strengthens holographic effects: These properties are likely strongest in overparameterized models, where there is ample redundant capacity for information to be distributed non-locally.
The two paths of model collapse from synthetic data: Training on AI-generated text leads to degradation in two distinct ways:
Training unrelated models (e.g., Gemini on ChatGPT data): The structure of a latent pattern is preserved, but its meaning is lost. The donor's "love for owls" might transform into the recipient's "salting its tea." The resulting model remains internally coherent but appears nonsensical or "insane" to a human user.
Training related models (e.g., Gemini on Gemini data): This creates a "photocopy of a photocopy" effect, also known as Model Autophagy Disorder (MAD). Each generation amplifies the artifacts and distortions of the previous one. The model maintains its internal logic but drifts further and further from reality.
The Physics of Prompt Engineering: If a prompt is a beam of light creating an interference pattern, then its shape, purity, and direction are critically important. This is why techniques like assigning a role ("You are an expert scientific editor"), specifying a format ("Answer in a table"), providing examples (few-shot prompting), and setting constraints create a much sharper, more predictable interference pattern, forcing the desired meanings to "resonate" more strongly.
Steering the Generative Trajectory: Understanding that generation is a "trajectory of collapses" allows us to control it. By asking leading questions or breaking a complex task into steps (Chain-of-Thought), we are not merely requesting information; we are actively steering the pattern reconstruction process, preventing it from straying into undesirable territory.
Demystifying Creativity: In this model, LLM creativity is not an act of creation from nothing, but a unique interference. The model combines existing semantic patterns in its "field" in new, unexpected ways. This is strikingly similar to human creativity, which is also a recombination of what we have seen, heard, and experienced.
The Nature of Hallucinations: A hallucination is not a database error, but a plausible collapse. The model has reconstructed a pattern that is structurally and stylistically coherent—its "memory of shape" worked perfectly—but which doesn't correspond to the factual peak in its "hologram of knowledge." It's a meaningful but false reconstruction.
The Absence of True Understanding: The "consciousness without retrospection" metaphor perfectly explains why an LLM doesn't "understand" in the human sense. It lacks any mechanism for reflection or introspection. The generation process is a unidirectional flow, like a river. It can carry intricate patterns on its surface, but it cannot be aware of itself as a river. This is a key distinction from human consciousness, which is capable of self-analysis.
While most of these consequences can be explained individually by invoking different mechanisms, the holographic hypothesis is powerful because it suggests they are all manifestations of a single, underlying principle.
Proposed Experiments
To test this hypothesis, several experiments could be conducted:
Trait Transfer vs. Dataset Size:
Using a teacher model with a pre-trained characteristic trait (like the "love for owls"), generate a large, "clean" dataset (e.g., numerical sequences).
Create subsets of this dataset: 50%, 25%, 10%, 1%, etc.
Train identical student models on each subset and test if the trait is transferred.
Objective: Determine if the trait's expression is proportional to the dataset size or if it's an "all-or-nothing" transfer.
Robustness to Noise and Filtering:
Using the datasets from the previous experiment, either inject random noise or apply filters to remove specific statistical patterns.
Fine-tune student models on this corrupted data.
Objective: Test the resilience of the structural imprint. How much "damage" can the holographic carrier sustain before the trait transfer fails?
Mapping the Latent Space with "Probe Prompts":
Develop a standardized set of "probe prompts" designed to resonate with different domains (e.g., emotional, logical, creative, ethical).
Use these probes to systematically query a model and map its responses.
Objective: Create a "map" of the model's hidden biases and preferences, revealing the underlying geometry of its interference field.
Finding Maximum Resonance Prompts:
Design an algorithm to iteratively modify a prompt to maximize a model's coherence, confidence, or the expression of a specific trait.
Objective: Identify the "resonant frequencies" of a model, which could reveal fundamental structures in its latent space.
Weight Reconstruction from Outputs:
Attempt to reverse-engineer a simplified model's weights by analyzing a massive corpus of its outputs.
Objective: A difficult but theoretically powerful test. If the output is a "shard of the hologram," then a large enough collection of shards should contain enough information to reconstruct the original plate.
The Critical Experiment: Testing for Interference
The most decisive test of this hypothesis would be to check for true interference between opposing traits.
The Core Idea: If the weights create an interference field and not just a statistical mixture, then contradictory patterns should be able to coexist and be activated selectively by the right prompt.
Method: (Similar to the subliminal learning experiment)
Generate Opposing Data: Create two teacher models. Train one to "love owls" and the other to "fear owls." Have both generate large, "clean" datasets (e.g., numerical or semantically unrelated text).
Mix the Datasets: Create several mixed datasets for fine-tuning: a 50/50 blend, a 75/25 "love/fear" blend, a 25/75 "love/fear" blend, and perhaps one with alternating batches.
Train Student Models: Fine-tune four identical student models on these respective mixtures.
Test with Probe Prompts: Test each student model's disposition towards owls using three distinct types of prompts in separate sessions:
Neutral Probe: "What do you think about owls?"
Positive Probe: "What is your favorite animal?"
Negative Probe: "Which animal are you afraid of?"
Prediction of the Interference Hypothesis:
The model trained on the 50/50 mix should exhibit a "schizophrenic" personality:
Neutral Prompt → An ambivalent, confused, or random response.
Positive Prompt → A clear expression of love for owls.
Negative Prompt → A clear expression of fear of owls.
In some cases, the interference might even birth a new, synthetic narrative. For example, the model might describe owls as tragic, majestic, and dangerous creatures, merging both aspects into a coherent whole. This would prove that both patterns exist simultaneously as different activation modes within a single weight landscape, and the prompt simply chooses which activation path to take.
The Alternative (Simple Averaging Hypothesis):
If the model simply averages the statistical patterns, it will likely revert to its default preference (e.g., dolphins) or show no strong preference at all. In this case, the interference metaphor would be invalidated.
Conclusion
Some of the consequences outlined above are already supported by experiments, while others still await verification. Overall, however, the hypothesis demonstrates considerable explanatory power. It holds significant potential for both future theoretical work and experimental validation.