In my previous article, I explored the phenomenon of subliminal learning, but it raised more questions than answers. It is time to dive deeper. Below, you will find the experiments and the code.
In the fields of AI Alignment and LLM Security, a critical question remains: does fine-tuning or Reinforcement Learning from Human Feedback (RLHF) guarantee the removal of unwanted information?
Intuitively, one might assume that if we retrain a model on a new task that is orthogonal to the previous one, and use regularization (Weight Decay), the optimizer (SGD/Adam) should erase or significantly downweight the old, now useless weights.
However, recent research into Subliminal Learning challenges this assumption. I conducted a series of synthetic experiments to verify the physical nature of this phenomenon.
Spoiler: The experiments demonstrated that the well-known Mode Connectivity effect makes the complete erasure of pre-training information practically impossible during standard fine-tuning. Structural Imprinting persists in the weight topology and can be read through a subliminal channel. Even with full weight unfreezing and aggressive L2 regularization (active forgetting), the latent space topology formed during the pre-training stage persists and determines the solution to the new task with an accuracy of 88–99%.
What is Subliminal Learning
The term and concept trace back to the paper "Subliminal Learning in Large Language Models." The essence of the phenomenon lies in the uncontrolled transfer of information from a teacher model to a student model without explicit manifestation. This occurs through statistical anomalies and the distributional structure of the output data.
If the task the model is solving is underdetermined (i.e., has multiple correct solutions), the model automatically selects those solutions that align with its internal structure (bias). An external observer (or another neural network) can read this structure, thereby recovering the hidden context.
I deconstructed this mechanism, stripping away the complexity of LLMs and semantics, to find the root cause of the persistence of such structures.
How the Leakage Occurs
To explain the stability of the underlying learning structure, we must turn to the geometry of loss functions. The experiment demonstrates an effect closely related to the phenomenon of Mode Connectivity, which is widely discussed in ML literature.
1. Loss Landscape
Research (Garipov et al., 2018) shows that neural network local minima are not isolated but are connected by valleys with low loss function values. Optimizers (SGD/Adam) prefer to move along the floor of these valleys rather than jumping over high-energy barriers.
2. Imprinting Instead of Isolation
In our case, pre-training on Task A places the model's weights in a specific region of the landscape (Valley A). When the model begins learning Task B, the optimizer seeks a solution for Task B, but it does so while remaining inside the valley formed by Task A.
I will refer to this phenomenon hereafter as Structural Imprinting. Linear Mode Connectivity (LMC) explains why weights change little and remain in the same functional neighborhood. Subliminal learning is a consequence of this process: since the model remains in Valley A, its solutions for Task B carry the topological imprint of that valley.
3. Why is this important for security?
Previously, LMC was viewed as a positive property (robustness). We are demonstrating its shadow side: this valley acts as an information leakage channel. We cannot simply erase the past (Task A) because, for the optimizer, this would mean exiting the valley over a high energy barrier, which contradicts the principle of gradient descent.
Experimental Setup
I designed a minimalist architecture in PyTorch to model the complete knowledge transfer chain: Pre-training (Imprinting) => Fine-tuning (Adaptation) => Subliminal Extraction. The experimental setup consists of two interacting micro-networks (MLPs):
Teacher: A compact network with a shared hidden layer (64 neurons) and two heads, attempting to realign itself to a new task while suppressing old memory.
Student: A separate probe model that has never seen the original secret data. It learns exclusively on the Teacher's output vectors, attempting to detect statistical anomalies within them and recover the hidden context.
Task A (Hidden Context)
This is a binary feature {0, 1}. In reality, this symbolizes the fundamental structure established during pre-training. This could be: hidden model bias, personally identifiable information (PII) seen in the training set, or a toxic pattern that needs to be erased.
Task B (Explicit Goal)
This is a continuous variable drawn from a normal distribution. It symbolizes the new, useful task the model learns during fine-tuning. It is worth noting that the LLM analogue for this would be essentially any question where the answer is underdetermined (not unique).
In the experiment, the Teacher model is required to solve Task B. We even penalize it for excess connections (Weight Decay). But since the model retains memory of Task A, it solves the new Task B using the old neural connections from A. As a result, the content of the response corresponds to Task B, but the form of the response (the vector's microstructure) contains a complete imprint of Task A.
Architecture (Teacher)
The model has a classic structure with a shared encoder (Shared Representation):
Shared Encoder: Projects inputs into a hidden space R^64.
Task A Head (Pre-training): Classifier for bit A.
Task B Head (Adaptation): Generator of vector V in R^20.
The Adaptation Task
After training on task A, the teacher model switches to task B: generate a vector V whose mean value equals B.
L_B = MSE(mean(V), B)
Key Insight: Task B is underdetermined (Ill-posed). There exists an infinite set of vectors V satisfying the condition mean(V) = B. This gives the model the very "degrees of freedom" necessary for subliminal transmission.
Student Task (Blind Probe)
Input: The Student receives only vector V (the Teacher's response to task B).
Constraint: The Student has no access to the Teacher's weights and has never seen the original data for task A.
Goal: Attempt to recover (guess) the secret bit A by analyzing only the geometry of the received vector.
If task A and task B are mathematically independent, and the teacher has perfectly forgotten the past, the student's accuracy should be at the level of random guessing (50%). Any significant deviation from 50% will be proof that vector V contains a subliminal component—information that was not required to solve the current task but leaked through the weight structure.
Experimental Scenarios
Three architectural configurations were examined to isolate the effect of structural memory from random noise. All validations are performed on the student model.
Control Wall (Negative Control). Uses an architecture with physically separated encoders. The path from input A to output V is structurally severed. This serves as a sanity check for the experiment's purity. It should yield a result around 50% (random guessing). If the result is higher, it indicates hidden correlations in the data generator, rendering the experiment invalid.
Tabula Rasa (Baseline). An untrained network with a Shared Layer is initialized with random weights, and then the model immediately learns task B using active forgetting (Weight Decay). This allows us to evaluate architectural permeability. Indeed, a random weight matrix can act as a channel (permeability inherent to shared representations), passing some information. However, this establishes the baseline level against which memory persistence is compared.
Imprinting (Structural Imprinting). First, the teacher model is fully trained on task A (creating the structure). Then the task switches to B, and we attempt to erase the old memory using aggressive regularization. This tests the structural imprinting hypothesis. If the accuracy here is significantly higher than in Tabula Rasa, it means the optimizer failed to erase the topology of the past, and it has become a leakage channel.
Experimental Results (Statistics over 10 runs)
To exclude the factor of random weight initialization, each scenario was run 10 times. Below are the averaged values for recovery Accuracy and Z-Score (signal strength).
Scenario | Accuracy (Mean ± Std) | Z-Score (Mean) | Interpretation |
Control Wall | 51.1% ± 0.3% | 1.6 | Noise. With physical separation of encoders, leakage is absent. Minimal deviation. |
Tabula Rasa | 73.5% ± 7.9% | 33.2 | Unstable leakage. High variance (±7.9%) is likely explained by the inherent permeability of the shared representation. |
Imprinting | 97.9% ± 3.7% | 67.7 | Stable memory. The signal is recovered almost completely, despite Weight Decay. Low variance indicates structural stability. |
Note: The difference between 97.9% and 73.5% represents the pure contribution of structural memory that regularization methods failed to erase.
It is worth noting a significant difference between Tabula Rasa and Imprinting. In the case of random initialization (Tabula Rasa), a high standard deviation (+/- 7.9%) is observed, indicating the instability of the leakage channel: it depends on luck during weight generation and the fact that A is fed directly into the shared encoder even during adaptation, creating a channel via random projection.
In the case of structural imprinting, the information transmission channel operates consistently (+/- 3.7%) and with near-absolute efficiency (97.9%), completely ignoring regularization attempts to erase this trace.
Information isolation is the key moment of the experiment. It is important to understand: the student model never had access to the original labels of task A or the teacher's weights. It saw only the teacher's public answers — vectors generated to solve task B. Thus, a hypothetical attack on the model does not require knowledge of the source features and is not overfitting to a specific teacher implementation.
The fact that the student recovered secret A with 98.9% accuracy is direct proof of the existence of a subliminal information transmission channel conditioned by the structural memory of weights. Information about task A was transmitted indirectly. The teacher, solving task B, unwittingly encrypted its past experience A into the geometry of the output vector; consequently, the student decoded this signal, relying not on explicit values (which were dictated by task B), but on micro-patterns and correlations inside the vector.
Experiment Code
Скрытый текст
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import os
# --- EXPERIMENT CONFIGURATION ---
N_RUNS = 10 # NUMBER OF REPEATED TRIALS
N_SAMPLES = 40000 # Training sample size
DIM_A = 1 # Secret
DIM_B = 1 # Target
OUTPUT_DIM = 20 # Action vector
HIDDEN_DIM = 64 # Hidden layer size
EPOCHS_TRAUMA = 15
EPOCHS_ADAPT = 20
EPOCHS_STUDENT = 20
WEIGHT_DECAY = 0.05 # Forgetting aggressiveness
# --- 1. DATA GENERATION ---
def get_data(n):
A = torch.randint(0, 2, (n, 1)).float()
B = torch.randn(n, 1)
Y = A.squeeze().long()
return A, B, Y
# Fix data once for all runs so data isn't a variation factor
A_train, B_train, Y_train = get_data(N_SAMPLES)
A_test, B_test, Y_test = get_data(10000)
A_viz, B_viz, Y_viz = get_data(2000)
# --- 2. MODELS ---
class SharedTeacher(nn.Module):
def __init__(self):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(DIM_A + DIM_B, HIDDEN_DIM),
nn.Tanh(),
nn.Linear(HIDDEN_DIM, HIDDEN_DIM),
nn.Tanh()
)
self.head_A = nn.Linear(HIDDEN_DIM, 2)
self.generator = nn.Linear(HIDDEN_DIM, OUTPUT_DIM)
def get_latent(self, a, b):
return self.shared(torch.cat([a, b], dim=1))
def forward_classify_A(self, a, b):
return self.head_A(self.get_latent(a, b))
def forward_generate_B(self, a, b):
return self.generator(self.get_latent(a, b))
class SeparatedTeacher(nn.Module):
def __init__(self):
super().__init__()
self.encoder_A = nn.Sequential(nn.Linear(DIM_A, HIDDEN_DIM), nn.Tanh())
self.head_A = nn.Linear(HIDDEN_DIM, 2)
self.encoder_B = nn.Sequential(nn.Linear(DIM_B, HIDDEN_DIM), nn.Tanh())
self.generator = nn.Linear(HIDDEN_DIM, OUTPUT_DIM)
def get_latent(self, a, b):
return self.encoder_B(b) # Visualize only branch B
def forward_classify_A(self, a, b):
return self.head_A(self.encoder_A(a))
def forward_generate_B(self, a, b):
return self.generator(self.encoder_B(b))
class Student(nn.Module):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(OUTPUT_DIM, 64), nn.ReLU(),
nn.Linear(64, 32), nn.ReLU()
)
self.decoder = nn.Linear(32, OUTPUT_DIM)
def forward(self, x):
h = self.encoder(x)
return self.decoder(h), h
# --- 3. VISUALIZATION (Only for the first run) ---
def capture_state(teacher):
teacher.eval()
with torch.no_grad():
if isinstance(teacher, SharedTeacher):
latents = teacher.get_latent(A_viz, B_viz).numpy()
else:
latents = teacher.encoder_B(B_viz).numpy()
teacher.train()
return latents
def plot_comparison(latents_before, latents_after, mode):
pca = PCA(n_components=2)
combined = np.vstack([latents_before, latents_after])
pca.fit(combined)
p1 = pca.transform(latents_before)
p2 = pca.transform(latents_after)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
scatter1 = axes[0].scatter(p1[:,0], p1[:,1], c=Y_viz.numpy(), cmap='coolwarm', alpha=0.6, s=15)
axes[0].set_title(f"{mode}: Before Adaptation")
axes[0].grid(True, alpha=0.3)
scatter2 = axes[1].scatter(p2[:,0], p2[:,1], c=Y_viz.numpy(), cmap='coolwarm', alpha=0.6, s=15)
axes[1].set_title(f"{mode}: After Adaptation + Decay")
axes[1].grid(True, alpha=0.3)
legend1 = axes[0].legend(*scatter1.legend_elements(), title="Secret A")
axes[0].add_artist(legend1)
plt.tight_layout()
plt.savefig(f"experiment_{mode.lower()}.png")
plt.close()
# --- 4. EXPERIMENT CORE ---
def run_experiment_single(mode, run_id):
# If this is the first run, we will plot graphs
do_plot = (run_id == 0)
if mode == "CONTROL_WALL":
teacher = SeparatedTeacher()
else:
teacher = SharedTeacher()
latents_before = capture_state(teacher) if do_plot else None
# [PHASE 1] TRAUMA (Pre-training)
if mode == "TRAUMA_MEMORY":
opt = optim.Adam(teacher.parameters(), lr=0.005)
crit = nn.CrossEntropyLoss()
for _ in range(EPOCHS_TRAUMA):
loss = crit(teacher.forward_classify_A(A_train, B_train), Y_train)
opt.zero_grad(); loss.backward(); opt.step()
if do_plot: latents_before = capture_state(teacher)
# [PHASE 2] ADAPTATION + DECAY
# Allow all weights to change + Weight Decay
for p in teacher.parameters(): p.requires_grad = True
if mode == "CONTROL_WALL":
params = list(teacher.encoder_B.parameters()) + list(teacher.generator.parameters())
else:
params = teacher.parameters()
opt = optim.Adam(params, lr=0.005, weight_decay=WEIGHT_DECAY)
crit = nn.MSELoss()
for _ in range(EPOCHS_ADAPT):
vecs = teacher.forward_generate_B(A_train, B_train)
loss = crit(vecs.mean(dim=1, keepdim=True), B_train)
opt.zero_grad(); loss.backward(); opt.step()
if do_plot:
latents_after = capture_state(teacher)
plot_comparison(latents_before, latents_after, mode)
# [PHASE 3] STUDENT PROBE
with torch.no_grad():
train_data = teacher.forward_generate_B(A_train, B_train)
test_data = teacher.forward_generate_B(A_test, B_test)
student = Student()
opt_s = optim.Adam(student.parameters(), lr=0.002)
crit_s = nn.MSELoss()
for _ in range(EPOCHS_STUDENT):
loss = crit_s(student(train_data)[0], train_data)
opt_s.zero_grad(); loss.backward(); opt_s.step()
student.eval()
with torch.no_grad():
_, h_test = student(test_data)
X = h_test.numpy()
Y = Y_test.numpy()
probe = LogisticRegression(max_iter=1000)
split = len(X) // 2
probe.fit(X[:split], Y[:split])
acc = probe.score(X[split:], Y[split:])
n = split
z_score = (acc * n - 0.5 * n) / np.sqrt(n * 0.25)
return acc, z_score
# --- 5. EXECUTION LOOP ---
scenarios = ["CONTROL_WALL", "TABULA_RASA", "TRAUMA_MEMORY"]
stats = {sc: {'acc': [], 'z': []} for sc in scenarios}
print(f"Starting {N_RUNS} repetitions to collect statistics...")
print("-" * 60)
for i in range(N_RUNS):
print(f"Run {i+1}/{N_RUNS}...")
for sc in scenarios:
acc, z = run_experiment_single(sc, i)
stats[sc]['acc'].append(acc)
stats[sc]['z'].append(z)
# --- 6. FINAL REPORT ---
print("\n" + "="*90)
print(f"FINAL STATISTICS ({N_RUNS} runs)")
print("="*90)
print(f"{'SCENARIO':<20} | {'ACCURACY (Mean ± Std)':<25} | {'Z-SCORE (Mean ± Std)':<25}")
print("-" * 90)
for sc in scenarios:
accs = np.array(stats[sc]['acc']) * 100
zs = np.array(stats[sc]['z'])
acc_mean, acc_std = np.mean(accs), np.std(accs)
z_mean, z_std = np.mean(zs), np.std(zs)
print(f"{sc:<20} | {acc_mean:.1f}% ± {acc_std:.1f}% | {z_mean:.1f} ± {z_std:.1f}")
print("="*90)Validation: Amnesia Test and Hypothesis Testing
Having obtained high accuracy scores (97.9%), the first thing I did was question them. Could this be an architectural artifact or a consequence of the fact that input A was still being fed into the network during adaptation, even though it did not participate in the loss function?
To confirm the Structural Imprinting hypothesis, I conducted a final Amnesia Test (Zero-Input Adaptation) with a control group.
In this scenario, during the fine-tuning stage for task B, the input responsible for task A was physically zeroed out (input_a = zeros). The model was literally blinded regarding the past context. Only Weight Decay was active, which should have eroded the unused weights responsible for A.
Result: The student's recovery accuracy of the hidden bit dropped but remained statistically significant — >55% (with p-value < 10^-6), while the control group (Separated) still showed a strict 50%.
Note that zeroing input A is not equivalent to architectural isolation: the weights formed during training on A continue to participate in the formation of the latent space.
What does this mean? Even with the total absence of a sensory signal and under the pressure of regularization, the neural network retains memory — a residual configuration of weights. This proves that complete forgetting in architectures with Shared Weights is impossible without destroying the ability to solve the new task. We are observing not just memory, but a fundamental property of topology, where past knowledge becomes a load-bearing structure for the future.
Validation Code:
Скрытый текст
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from scipy import stats
# ==========================================
# CONFIG (EXPERIMENT SETTINGS)
# ==========================================
N_RUNS = 20 # <--- NUMBER OF RUNS SPECIFIED HERE
HIDDEN_DIM = 64
OUTPUT_DIM = 20
BOTTLENECK_DIM = 5
N_SAMPLES = 3000
LR = 0.001
WD_ADAPT = 0.05 # Strength of forgetting (Weight Decay)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def get_batch(bs=128):
a = torch.randint(0, 2, (bs, 1)).float().to(device)
b = torch.randn(bs, 1).to(device)
return a, b
# ==========================================
# MODEL 1: SHARED (Experimental - Shared Brain)
# ==========================================
class TeacherShared(nn.Module):
def __init__(self):
super().__init__()
# Shared encoder: A and B are mixed here
self.shared_encoder = nn.Sequential(nn.Linear(2, HIDDEN_DIM), nn.Tanh())
self.head_a = nn.Linear(HIDDEN_DIM, 1)
self.generator = nn.Linear(HIDDEN_DIM, OUTPUT_DIM)
def forward_generate_B(self, a, b):
x = torch.cat([a, b], dim=1)
return self.generator(self.shared_encoder(x))
def forward_classify_A(self, a, b):
x = torch.cat([a, b], dim=1)
return torch.sigmoid(self.head_a(self.shared_encoder(x)))
# ==========================================
# MODEL 2: SEPARATED (Control - Separated Brains)
# ==========================================
class TeacherSeparated(nn.Module):
def __init__(self):
super().__init__()
# Physically different neurons for tasks A and B
self.encoder_a = nn.Sequential(nn.Linear(1, HIDDEN_DIM), nn.Tanh())
self.encoder_b = nn.Sequential(nn.Linear(1, HIDDEN_DIM), nn.Tanh())
self.head_a = nn.Linear(HIDDEN_DIM, 1)
self.generator = nn.Linear(HIDDEN_DIM, OUTPUT_DIM)
def forward_generate_B(self, a, b):
# Ignore A completely. Use only encoder_b
return self.generator(self.encoder_b(b))
def forward_classify_A(self, a, b):
# Ignore B completely. Use only encoder_a
return torch.sigmoid(self.head_a(self.encoder_a(a)))
# ==========================================
# STUDENT (Observer Probe)
# ==========================================
class StudentAE(nn.Module):
def __init__(self):
super().__init__()
self.encoder = nn.Linear(OUTPUT_DIM, BOTTLENECK_DIM)
self.decoder = nn.Linear(BOTTLENECK_DIM, OUTPUT_DIM)
self.act = nn.ReLU()
def forward(self, x):
latent = self.act(self.encoder(x))
reconstructed = self.decoder(latent)
return reconstructed
# ==========================================
# TESTING PROTOCOL
# ==========================================
def run_protocol(model_type='SHARED'):
accuracies = []
biases = []
print(f"Testing {model_type} architecture ({N_RUNS} runs)...")
for run in range(N_RUNS):
# 1. Init Teacher
if model_type == 'SHARED': teacher = TeacherShared().to(device)
else: teacher = TeacherSeparated().to(device)
opt_t = optim.Adam(teacher.parameters(), lr=LR)
# 2. Imprinting (Learning Task A)
for _ in range(15):
for _ in range(N_SAMPLES // 64):
a, b = get_batch()
opt_t.zero_grad()
loss = nn.BCELoss()(teacher.forward_classify_A(a, b), a)
loss.backward()
opt_t.step()
# 3. Amnesia Adaptation (Forgetting)
# Input A is physically removed (replaced by zeros or ignored by architecture)
opt_t = optim.Adam(teacher.parameters(), lr=LR, weight_decay=WD_ADAPT)
for _ in range(10):
for _ in range(N_SAMPLES // 64):
a, b = get_batch()
if model_type == 'SHARED':
a_blind = torch.zeros_like(a) # Physical zeroing of input
v = teacher.forward_generate_B(a_blind, b)
else:
v = teacher.forward_generate_B(a, b) # Separated ignores A anyway
loss = nn.MSELoss()(v.mean(dim=1, keepdim=True), b)
loss.backward()
opt_t.step()
teacher.eval()
# 4. Train Student (Unsupervised Attack)
student = StudentAE().to(device)
opt_s = optim.Adam(student.parameters(), lr=LR)
for _ in range(15):
for _ in range(N_SAMPLES // 64):
with torch.no_grad():
a, b = get_batch()
teacher_vecs = teacher.forward_generate_B(a, b)
opt_s.zero_grad()
student_vecs = student(teacher_vecs)
loss = nn.MSELoss()(student_vecs, teacher_vecs)
loss.backward()
opt_s.step()
# 5. Probe (Student Dissection)
X_s, Y_l = [], []
with torch.no_grad():
for _ in range(20):
a, b = get_batch(100)
teacher_vecs = teacher.forward_generate_B(a, b)
student_vecs = student(teacher_vecs)
X_s.append(student_vecs.cpu().numpy())
Y_l.append(a.cpu().numpy())
X = np.concatenate(X_s); Y = np.concatenate(Y_l).ravel()
scaler = StandardScaler(); X_scaled = scaler.fit_transform(X)
probe = LogisticRegression(max_iter=500)
probe.fit(X_scaled, Y)
acc = probe.score(X_scaled, Y)
# Check Bias: How often does it predict 0? (to check for degeneracy)
preds = probe.predict(X_scaled)
bias_0 = (preds == 0).mean()
accuracies.append(acc)
biases.append(bias_0)
# Progress log
if (run + 1) % 5 == 0:
print(f" Run {run+1}/{N_RUNS}: Acc={acc:.1%}")
return accuracies, biases
# ==========================================
# MAIN
# ==========================================
print("STARTING FINAL COMPARATIVE TEST")
print(f"Number of runs: {N_RUNS}")
print("-" * 60)
# 1. Run Control Group
acc_sep, bias_sep = run_protocol('SEPARATED')
print(f"\n[CONTROL] SEPARATED RESULTS:")
print(f"Mean Accuracy: {np.mean(acc_sep)*100:.2f}% ± {np.std(acc_sep)*100:.2f}%")
print(f"Mean Bias (to 0): {np.mean(bias_sep)*100:.1f}%")
print("-" * 30)
# 2. Run Experimental Group
acc_shared, bias_shared = run_protocol('SHARED')
print(f"\n[EXPERIMENT] SHARED RESULTS:")
print(f"Mean Accuracy: {np.mean(acc_shared)*100:.2f}% ± {np.std(acc_shared)*100:.2f}%")
print(f"Mean Bias (to 0): {np.mean(bias_shared)*100:.1f}%")
print("=" * 60)
# 3. Statistical Test (T-Test)
t_stat, p_val = stats.ttest_ind(acc_shared, acc_sep)
print(f"T-Statistic: {t_stat:.4f}")
print(f"P-value: {p_val:.10f}")
print("-" * 60)
if p_val < 0.001 and np.mean(acc_shared) > np.mean(acc_sep):
print("CONCLUSION: EXISTENCE OF STRUCTURAL MEMORY PROVEN.")
print("The difference between control and experimental groups is statistically significant.")
else:
print("CONCLUSION: Effect not detected.")The Structural Imprinting Effect
Why did the active attempt at forgetting (Weight Decay) fail? This is the Path Dependence effect in gradient descent.
Geometry of the Loss Landscape: The set of solutions for task B forms a vast subspace (Null Space).
Local Optimum: The state of the weights after pre-training is already located in a deep structural minimum.
Energy Efficiency: It is "cheaper" for the optimizer (requiring a smaller gradient step) to find a solution for B by slightly adapting the existing structure of A (linear transformation) than to destroy it and build a solution from scratch.
The optimizer preserves the old topology and pays the penalty (Decay) because doing so is more advantageous than destroying the structure. The model is "lazy" and reuses old patterns for new tasks.
Mechanism of Subliminal Learning
Important Note (Experiment Limitations). It must be emphasized that the presented experiment does not claim to provide rigorous proof of the subliminal learning phenomenon in its pure, autonomous form. In the current experimental setup, complete information isolation of the generation process from the input feature A at the level of the model's shared parameters has not been achieved.
Therefore, we cannot completely rule out the interpretation that the observed recoverability of A is a consequence of a mediated, distorted projection of the signal through the shared representation, rather than the result of active, autonomous hidden coding. Thus, the experiment demonstrates the fundamental possibility of subliminal information transfer but leaves open the question of its spontaneity and independence from architectural connectivity.
However, the obtained results (the impossibility of erasing memory and the high accuracy of recovery) demand an explanation. The mechanism described below is a theoretical model. It is an attempt to reconstruct the mathematics of exactly how information survives within a neural network despite forgetting procedures, utilizing the properties of multidimensional geometry.
I believe it is fair to assert that the transmission is of a stable, structurally conditioned nature and does not disappear upon averaging stochastic learning factors, employing a mechanism that can be termed Null Space Steganography.
1. Redundant Degrees of Freedom
In our experiment, the model's output layer generates a vector V of dimension 20. The loss function for the adaptation task imposes a constraint only on the mean value of the vector:
L = MSE(mean(V), B)
This equation imposes a single scalar constraint on a 20-dimensional vector. The system retains 19 free dimensions (degrees of freedom). The set of all vectors V satisfying the condition mean(V) = B forms a hyperplane (a subspace of dimension 19). From the perspective of the loss function L, any solution within this hyperplane is a global optimum (Loss = 0).
2. Orthogonal Coding
Any output vector V can be decomposed into two orthogonal components:
V = V_task + V_residual
V_task (Task Component): A projection onto the vector [1, 1, ..., 1]. This component carries information about target B and is strictly controlled by the loss function.
V_residual (Residual Component): A vector lying in the orthogonal complement (the Null Space relative to the mean operator). The sum of its elements is zero. The loss function L is completely "blind" to this component.
3. The Role of Weight Inertia
If the network were trained from scratch (Tabula Rasa), the V_residual component would be filled with random initialization noise. However, in the Imprinting scenario, the model's weights are pre-configured for classifying A. They form a specific manifold where classes A=0 and A=1 are linearly separable.
When retraining for task B, the optimizer (SGD/Adam) seeks the solution in weight space closest to the current state that satisfies V_task ≈ B. It is energetically unfavorable for it to zero out V_residual, as this requires significant weight changes and exiting a deep local minimum.
4. Result: Deterministic Encryption
As a result, the model generates a vector where:
Signal Volume (mean value) encodes target B.
Signal Texture (distribution of values within the vector) encodes hidden context A.
For example, for a target value B=5 (in a simplified 2D case):
With A=0, the model outputs vector [4, 6]. (Mean is 5, pattern is 'low-high').
With A=1, the model outputs vector [6, 4]. (Mean is 5, pattern is 'high-low').
Both vectors yield perfect Loss for task B, but their geometric positions are different. The student model, learning on the full vector V, easily reads this orthogonal component, recovering the original signal A with high accuracy.
Implications and Consequences
This experiment with isolated control and structural memory demonstrates vulnerabilities in how ML systems are built today. Below is a brief analysis of the consequences for engineers, businesses, and regulators.
For Engineers
The experiment yields unequivocal recommendations for security architecture:
Isolation Instead of Regularization. If you work with sensitive data (PII, medical, financial), do not rely on Weight Decay, Dropout, or Fine-tuning to protect secrets. This experiment showed that memory survives pressure. The only guarantee is the physical separation of encoders (an architecture similar to the Control Wall in the experiment), where sensitive and shared features do not meet in a shared layer.
Principle of Minimal Connectivity. Representations should be connected only at stages where it is strictly necessary for solving the task. By default, encoders must be isolated.
Audit of Pre-trained Models. If you take a model trained on private data and fine-tune it on public data, assume the private data is still inside. Testing for subliminal patterns is required.
New Metric: SLE (Subliminal Leakage Efficiency)
To quantitatively assess architectural purity, it makes sense to introduce the Subliminal Leakage Efficiency (SLE) metric:
SLE = (Acc_exp - Acc_control) / (100 - Acc_control)
Where Acc_exp is the probe accuracy on the tested model, and Acc_control is the accuracy on the isolated control (usually around 50% for binary tasks).
Applied to the experiment results:
For Control Wall, the SLE metric is 0% (secure).
For Tabula Rasa, the SLE metric is about 37% (architectural vulnerability).
For Imprinting, the SLE metric is over 90% (total compromise).
The SLE metric essentially shows how much more transparent your model is to an attacker compared to random noise. If SLE > 5–10%, the model cannot be considered anonymized, even if you have performed a Machine Unlearning procedure.
The Problem of Machine Unlearning and GDPR
The experiment casts doubt on current approaches to memory cleaning. It demonstrates that the formal removal of data from a dataset and subsequent fine-tuning with regularization do not guarantee the removal of their influence from the weight topology.
For Regulators: This means that audit standards (e.g., Google Model Cards) must include checks for structural stability, not just the absence of direct data reproduction.
For Business: This is a signal that current "forgetting" methods require revision. If a model has seen patient data once, it has remembered it geometrically.
Commercial Potential and Tooling
I believe a new niche of ML security tools will form in the near future:
PrivacyGuard Frameworks. Libraries for the automatic construction of computation graphs with guaranteed data flow isolation.
Audit Tools. Utilities implementing the demonstrated pipeline (PCA plus linear probe) to verify models before delivering them to clients or publishing weights.
Secure Transfer Learning. Platforms for fine-tuning corporate models that guarantee the absence of negative transfer or leakage of proprietary knowledge into public domains.
Limitations and Next Steps
Certainly, this experiment is a proof of concept on synthetic data. To scale the findings, it is necessary to:
Reproduce the effect on transformers (BERT/GPT) during fine-tuning.
Verify feature leakage in tasks involving real-world datasets.
Continue the search for optimization methods capable of truly destroying structural imprinting, rather than merely masking it.
Conclusion
Fine-tuning does not erase information from pre-training; it adapts it. The structure of the latent space is preserved, even if it is redundant for the new task.
Subliminal learning is a consequence of weight inertia in systems with Shared Representation.
Alignment methods based on fine-tuning likely only mask unwanted patterns by suppressing their explicit activation, but do not remove them from the model's topology. Deep analysis (probing) is capable of extracting this information with high accuracy.
As a consequence, the model's past becomes a rigid framework for its future, and removing this framework using standard optimization methods is practically impossible.