Subliminal Learning and Structural Inertia: Why Neural Networks Remember What They Should Forget

In my previous article, I explored the phenomenon of subliminal learning, but it raised more questions than answers. It is time to dive deeper. Below, you will find the experiments and the code.
In the fields of AI Alignment and LLM Security, a critical question remains: does fine-tuning or Reinforcement Learning from Human Feedback (RLHF) guarantee the removal of unwanted information?
Spoiler: The experiments demonstrated that the well-known Mode Connectivity effect makes the complete erasure of pre-training information practically impossible during standard fine-tuning. Structural Imprinting persists in the weight topology and can be read through a subliminal channel. Even with full weight unfreezing and aggressive L2 regularization (active forgetting), the latent space topology formed during the pre-training stage persists and determines the solution to the new task with an accuracy of 88–99%.

















