Activation Function Stress Test: GELU vs Tanh / Habr

In modern neural networks, including Transformer-based LLMs, unbounded activation functions—ReLU and GELU—have become the standard. Their main advantage is good gradient flow and the rapid training of deep models.

However, in practice, a problem is observed: when dominant patterns or high-frequency noise appear in the input context (long dialogues, noisy data, repetitive or dominant tokens), models become unstable and prone to generation degradation and hallucinations.

In this article, I attempted to find out if the choice of activation function could be fundamentally linked to LLM hallucinations.

What are GELU and Tanh

GELU (Gaussian Error Linear Unit) is a smooth version of ReLU used in most modern LLMs. It passes positive values without a hard ceiling and suppresses negative ones. GELU improves training and quality on clean data by not limiting the amplitude of activations.

Tanh (hyperbolic tangent) is a bounded activation function with an output in the range [-1, 1]. At large input values, the function saturates, which limits the influence of individual neurons. The issue that likely caused the shift to GELU is more difficult training due to vanishing gradients. Below, I will discuss why this problem is not critical today.

Description of the Experiment

The goal of the experiment is to isolate the influence of the activation function without changing the architecture or the task. For this purpose, the MNIST classification task was used (a basic test of a network's ability to extract and retain features). The task was intentionally chosen for its simplicity to isolate the effect of the activation function.

The experiment was conducted on three identical MLPs:

Linear -> LayerNorm -> Activation -> Linear

(Activation in {ReLU, GELU, Tanh})

There were 20 independent runs for each configuration. Crucially, under all distortions, the total signal energy is preserved. Only the distribution of energy (entropy, concentration) changes, not its magnitude.

Three types of stress tests were conducted, modeling attention and context failures in LLMs.

Experiment code in the spoiler:

Скрытый текст

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
import numpy as np
import time

# --- CONFIGURATION ---
CONFIG = {
    'INPUT_SIZE': 784,
    'HIDDEN_SIZE': 10,
    'OUTPUT_SIZE': 10,
    'BATCH_SIZE': 512,
    'EPOCHS': 12,
    'LR': 0.003,
    'NUM_RUNS': 20,
    'DEVICE': "cuda" if torch.cuda.is_available() else "cpu",
    
    # Added 0.0 (Baseline) to all tests
    'LOBOTOMY_LEVELS': [0.0, 0.30, 0.50, 0.70, 0.90],
    'SPIKE_LEVELS':    [0.0, 0.30, 0.50, 0.70, 0.90], 
    'NOISE_LEVELS':    [0.0, 0.5, 1.0, 2.0, 3.0] 
}

# --- DATA LOADING ---
class FastMNIST:
    def __init__(self, train=True, device='cpu'):
        transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
        dataset = datasets.MNIST('./data', train=train, download=True, transform=transform)
        loader = torch.utils.data.DataLoader(dataset, batch_size=len(dataset))
        self.data, self.targets = next(iter(loader))
        self.data = self.data.to(device)
        self.targets = self.targets.to(device)
        self.n_samples = len(self.data)
        
    def get_batches(self, batch_size, shuffle=True):
        if shuffle:
            indices = torch.randperm(self.n_samples, device=self.data.device)
        else:
            indices = torch.arange(self.n_samples, device=self.data.device)
        for start_idx in range(0, self.n_samples, batch_size):
            idx = indices[start_idx : start_idx + batch_size]
            yield self.data[idx], self.targets[idx]

print(f"🔥 DEVICE: {CONFIG['DEVICE']}")
train_data = FastMNIST(train=True, device=CONFIG['DEVICE'])
test_data = FastMNIST(train=False, device=CONFIG['DEVICE'])

# --- MODEL ---
class PrismNet(nn.Module):
    def __init__(self, act_fn, name):
        super().__init__()
        self.name = name
        self.fc1 = nn.Linear(CONFIG['INPUT_SIZE'], CONFIG['HIDDEN_SIZE'])
        self.ln = nn.LayerNorm(CONFIG['HIDDEN_SIZE']) 
        self.act = act_fn()
        self.fc2 = nn.Linear(CONFIG['HIDDEN_SIZE'], CONFIG['OUTPUT_SIZE'])
    
    def forward(self, x, mask=None):
        x = x.view(-1, CONFIG['INPUT_SIZE'])
        pre_latent = self.fc1(x)
        if mask is not None:
            pre_latent = pre_latent * mask
        latent = self.act(self.ln(pre_latent))
        return self.fc2(latent)

# --- MASK GENERATORS ---
def get_lobotomy_mask(hidden_size, severity, seed, device):
    torch.manual_seed(seed)
    mask = torch.ones(hidden_size, device=device)
    n_killed = int(hidden_size * severity)
    if n_killed >= hidden_size: n_killed = hidden_size - 1
    
    perm = torch.randperm(hidden_size, device=device)
    killed_indices = perm[:n_killed]
    mask[killed_indices] = 0.0
    
    n_alive = hidden_size - n_killed
    scale = np.sqrt(hidden_size / n_alive)
    return mask * scale

def get_spike_mask(hidden_size, severity, seed, device):
    torch.manual_seed(seed)
    mask = torch.ones(hidden_size, device=device)
    victim = torch.randint(0, hidden_size, (1,)).item()
    
    E_total = float(hidden_size)
    E_spike = severity * E_total
    E_noise = (1.0 - severity) * E_total
    
    amp_spike = np.sqrt(E_spike)
    amp_noise = np.sqrt(E_noise / (hidden_size - 1))
    
    mask[:] = amp_noise
    mask[victim] = amp_spike
    return mask

def get_noise_mask(hidden_size, intensity, seed, device):
    torch.manual_seed(seed)
    raw_noise = torch.randn(hidden_size, device=device) * intensity
    mask = torch.exp(raw_noise) 
    
    current_E = (mask**2).sum()
    target_E = float(hidden_size)
    scale = torch.sqrt(target_E / current_E)
    return mask * scale

# --- EXPERIMENT CORE ---
print(f"\n=== PRISM FINAL: BASELINE & TRINITY (N={CONFIG['NUM_RUNS']}) ===\n")

models_config = [(nn.ReLU, "ReLU"), (nn.Tanh, "Tanh"), (nn.GELU, "GELU")]

# Storage
res_lobo = {name: {lvl: [] for lvl in CONFIG['LOBOTOMY_LEVELS']} for _, name in models_config}
res_spike = {name: {lvl: [] for lvl in CONFIG['SPIKE_LEVELS']} for _, name in models_config}
res_noise = {name: {lvl: [] for lvl in CONFIG['NOISE_LEVELS']} for _, name in models_config}

total_start = time.time()

for run in range(CONFIG['NUM_RUNS']):
    run_start = time.time()
    print(f"Run {run+1:02d}/{CONFIG['NUM_RUNS']}...", end=" ", flush=True)
    
    # 1. Train
    trained_models = []
    for act_fn, name in models_config:
        model = PrismNet(act_fn, name).to(CONFIG['DEVICE'])
        opt = optim.Adam(model.parameters(), lr=CONFIG['LR'])
        for epoch in range(CONFIG['EPOCHS']):
            model.train()
            for data, target in train_data.get_batches(CONFIG['BATCH_SIZE']):
                opt.zero_grad()
                logits = model(data)
                loss = nn.CrossEntropyLoss()(logits, target)
                loss.backward()
                opt.step()
        trained_models.append(model)
    
    # Helper for running tests
    def run_test_batch(level_list, result_dict, mask_gen_func):
        for lvl in level_list:
            # Special case for Baseline
            if lvl == 0.0:
                mask = None
            else:
                mask = mask_gen_func(CONFIG['HIDDEN_SIZE'], lvl, seed=1000+run+int(lvl*100), device=CONFIG['DEVICE'])
                
            for model in trained_models:
                model.eval()
                correct = 0; total = 0
                with torch.no_grad():
                    for data, target in test_data.get_batches(2000, shuffle=False):
                        logits = model(data, mask)
                        correct += logits.argmax(1).eq(target).sum().item()
                        total += target.size(0)
                result_dict[model.name][lvl].append(100. * correct / total)

    # 2. Run Tests
    run_test_batch(CONFIG['LOBOTOMY_LEVELS'], res_lobo, get_lobotomy_mask)
    run_test_batch(CONFIG['SPIKE_LEVELS'], res_spike, get_spike_mask)
    run_test_batch(CONFIG['NOISE_LEVELS'], res_noise, get_noise_mask)
            
    print(f"Done ({time.time() - run_start:.1f}s)")

# --- REPORT ---
def print_table(title, levels, results_dict, metric_name):
    print(f"\n\n### {title}")
    print(f"{metric_name:<10} | {'Model':<6} | {'Accuracy':<9} | {'StdDev':<8} | {'95% CI':<16}")
    print("|" + "-"*65 + "|")
    
    for lvl in levels:
        label = str(lvl)
        if lvl == 0.0: label = "0.0 (Base)"
        
        print(f"| **{label}** |        |           |          |                  |")
        for _, name in models_config:
            data = results_dict[name][lvl]
            mean = np.mean(data)
            std = np.std(data)
            ci = 1.96 * std / np.sqrt(len(data))
            
            # Simple highlight logic
            mean_str = f"{mean:.2f}%"
            if lvl > 0.0 and name == "Tanh" and mean > 60: 
                 mean_str = f"**{mean_str}**"
            
            print(f"|            | {name:<6} | {mean_str:<9} | {std:.2f}     | [{mean-ci:.2f}, {mean+ci:.2f}] |")
        print("|" + "-"*65 + "|")

print_table("TEST 1: LOBOTOMY (Information Loss)", CONFIG['LOBOTOMY_LEVELS'], res_lobo, "Dead %")
print_table("TEST 2: SPIKE (Parasitic Dominance)", CONFIG['SPIKE_LEVELS'], res_spike, "Energy %")
print_table("TEST 3: NOISE (Entropy / Chaos)", CONFIG['NOISE_LEVELS'], res_noise, "Noise Lvl")

print(f"\nTotal Experiment Time: {time.time() - total_start:.1f}s")

1. Basic Test

The goal is to evaluate the model's behavior under baseline conditions.

Model	Accuracy (Mean)	Stability (StdDev)	Comment
GELU	92.84%	±0.42	Industry standard. Best training dynamics.
ReLU	92.65%	±0.45	Baseline model.
Tanh	92.06%	±0.28	Most stable, but lags 0.78% in accuracy.

Tanh lags behind GELU by approximately 0.8%. This is likely the price paid for bounded activation under clean data conditions.

2. Dominant Neuron Test

We artificially concentrate a portion of the layer's energy into a single neuron. For example, a level of 0.5 means that 50% of the entire layer's energy is allocated to one neuron, while the rest is distributed among the others.

Spike Strength (% energy in 1 neuron)	GELU (Accuracy)	Tanh (Accuracy)	Δ (Tanh - GELU)	Interpretation
0.0 (Clean)	92.84%	92.06%	-0.78%	Normally, GELU is better.
0.3 (30%)	91.48%	91.77%	+0.29%	Turning point.
0.5 (50%)	87.76%	90.94%	+3.18%	Risk zone. Tanh ignores the attack.
0.7 (70%)	80.93%	88.41%	+7.48%	GELU loses stability.
0.9 (90%)	66.75%	77.77%	+11.02%	GELU collapse.

GELU demonstrates an almost linear drop in accuracy as the spike increases. Tanh degrades significantly slower and maintains stability.

Due to Tanh saturation, the contribution of the dominant neuron is limited. Even with strong energy concentration, the network continues to utilize the remaining context.

Additionally, at medium spike levels, the spread of results (StdDev) for GELU is several times higher than for Tanh, indicating GELU's increased sensitivity to random fluctuations.

I should note that with this experiment, I wanted to demonstrate a similar mechanism observed in LLMs. When suppressing repetitions, energy concentrates in alternative tokens. In long contexts, attention collapses onto a small number of positions.

I hypothesize that Tanh in FFN layers could smooth out these artifacts.

3. Entropy Growth Test

Multiplicative noise is applied to activations while preserving total energy.

This models a long, noisy, or poorly structured context.

Noise Level (σ)	GELU (Accuracy)	Tanh (Accuracy)	Δ (Tanh - GELU)	Interpretation
0.0 (Clean)	92.84%	92.06%	-0.78%	Baseline.
0.5 (Low)	87.29%	90.25%	+2.96%	Onset of context degradation.
1.0 (High)	62.68%	77.68%	+15.00%	Tanh acts as a filter.
2.0 (Chaos)	42.64%	58.70%	+16.06%	GELU generates chaos.

At low noise, the difference is moderate.

At high noise, GELU accuracy drops sharply.

Tanh maintains a significantly higher level of correct classification.

Effectively, GELU continues to interpret noise as a useful signal.

Tanh limits the contribution of random fluctuations and essentially acts as a threshold filter.

4. Neuron Deletion Test

A portion of the layer's neurons is randomly removed, and the remaining ones are amplified so that the total energy is preserved.

% Neurons Removed	GELU (Accuracy)	Tanh (Accuracy)	Δ (Tanh - GELU)	Interpretation
0.0 (Clean)	92.84%	92.06%	-0.78%	All neurons present.
0.3 (30%)	67.61%	81.27%	+13.66%	Tanh preserved the image.
0.5 (50%)	50.44%	62.65%	+12.21%	Tanh holds on with half the network.
0.7 (70%)	31.33%	38.52%	+7.19%	Critical loss for all.

When removing 30–50% of neurons, Tanh maintains significantly higher accuracy.

GELU degrades faster.

In networks with Tanh, information is distributed more evenly. In networks with GELU, features are encoded more locally, making the loss of neurons more critical.

Final Conclusions

The experiment reveals an engineering trade-off rather than a "best" activation function.

GELU / ReLU (Unbounded)

Pros:

Fast training.
Better results on clean benchmarks.

Cons:

High sensitivity to dominant activations.
Low robustness during entropy growth.
Increased risk of degradation and unstable behavior.

Tanh (Bounded)

Pros:

High resistance to noise and spikes.
More even distribution of information.
Predictable degradation.

Cons:

Harder to train.
Slight lag in performance under ideal conditions.

Reasons for Abandoning Tanh

In the early development of LLMs, Tanh was considered the standard, but the vanishing gradient problem motivated the shift to GELU. Now, methodologies have been developed that largely solve or bypass this problem—LayerNorm / RMSNorm, correct initialization (Xavier / orthogonal), and residual connections.

So the question is no longer about the impossibility of using Tanh, but about choosing where and how to use it. Even if networks with Tanh are slightly harder to train, the potential gain in stability should fully compensate for this.

Practical Recommendations

For systems where reliability, resistance to noisy context, and fault tolerance are important (safety-critical or reasoning-oriented models), the rejection of Tanh requires reconsideration.

A promising approach is a hybrid architecture:

Use GELU in early layers for feature extraction.
Use Tanh in network bottlenecks (bottlenecks, attention circuits, memory blocks) for stabilization and anomaly filtration.

The experiment shows that the choice of activation function significantly affects the behavior of neural networks in terms of their stability.