Taming Incidental Polysemanticity in Toy Models

Motivation: Why Study Network Training Factors?

Polysemanticity—where individual neurons encode multiple unrelated features—is a fundamental challenge in mechanistic interpretability (MI). A neuron that activates for both "cars" and "adversarial patterns" makes models harder to audit and potentially hides safety-relevant behaviors. While the classic explanation attributes polysemanticity to "superposition" (compressing more features than available dimensions), recent work by Lecomte et al. (2023) shows it can arise "incidentally" even in overcapacity regimes, driven by training dynamics like regularization inducing winner-take-all effects or noise creating chance correlations.

Most SAE research focuses on improving decomposition architectures—TopK SAEs, JumpReLU SAEs, Gated SAEs—to better extract features from already-trained models. But what if polysemanticity is baked in during network training itself? If the base model's training process creates entangled representations, even perfect SAEs will struggle to disentangle them.

This post explores that question through toy model experiments. I test whether training factors hypothesized to reduce incidental polysemanticity—orthogonal initialization (minimizing initial feature collisions), L2 regularization (promoting distributed representations over L1's winner-take-all), GELU activation (preserving gradient flow), and positive kurtosis noise (disrupting lock-ins)—actually reduce feature entanglement as measured by SAE decomposition.

The surprising finding: Networks trained with L2 regularization produce significantly less polysemantic features than L1-trained networks, despite L1 being the standard choice for promoting interpretability through sparsity. This suggests that in overcomplete settings, the conventional wisdom "sparse equals interpretable" may be backwards.

The Experimental Journey: From Dead SAEs to Working Signals

This wasn't a clean, linear process. Initial runs hit a wall: "dead SAEs" where L0 sparsity (fraction of active latent features) collapsed to 0.0000 across all configurations. The SAE was outputting zeros for everything, making all metrics meaningless.

The culprit: Dying ReLU problem. With strong L1 regularization (initial lambda=0.01) on SAE latents, ReLU activations went to zero for negative inputs, gradients vanished, and neurons got trapped in a "dead state" with no recovery path. This is exacerbated in toy models with low-variance activations from sparse, correlated data.

The fix involved several changes:

Lowered lambda to 1e-4 (balancing sparsity without overkill, following scaled SAE practices)
Switched to LeakyReLU (alpha=0.01) in the SAE to allow small negative gradients
Added positive encoder biases (uniform 0-0.1) to initialize activations in firing range
Normalized activations before SAE input ((acts - mean)/std) to boost variance
Increased training epochs (1000→2000 for network, 500→1000 for SAE) to ensure convergence

These changes revived L0 sparsity to 0.42-0.49 across configurations—precisely in the optimal range (0.3-0.6) reported by Anthropic for balancing reconstruction fidelity and sparsity. With healthy SAE learning, genuine polysemanticity signals could finally emerge.

Data design amplified signals: I created synthetic data with deliberate "polysemantic temptation"—8 features grouped into 3 clusters ([0-2], [3-4], [5-7]) with 60% intra-group co-activation probability and correlation strengths of 0.5-1.0. Targets used non-linear interactions (products, sin, tanh, cross-group terms) to force the network to learn relational structure. Feature importances decayed exponentially (0.9^i) to emphasize early features, amplifying polysemanticity costs in weighted metrics.

Experimental Setup

Architecture

Network: 5-layer MLP mapping 8 sparse features → 12D hidden representation → 8 outputs. The overcapacity (12D hidden > 8D features) ensures polysemanticity arises from training dynamics, not capacity constraints.

Training ablations tested 4 factors:

Initialization: Random (Gaussian) vs Orthogonal (preserves norms, minimizes initial feature overlap)
Noise: Bipolar (uniform ±0.1) vs Positive Kurtosis (t-distribution df=3, heavy tails)
Regularization: L1 (λ * |weights|) vs L2 (λ * weights²)
Activation: ReLU vs GELU (smooth, no dying neurons)

This yields 2⁴ = 16 possible combinations, but I focused on 8 strategic ablations from baseline (Random+Bipolar+L1+ReLU) to maximum mitigation (Orthogonal+PosKurt+L2+GELU).

SAE: Overcomplete (d_sae=16 > d_hidden=12) to test decomposition in high-capacity regime. Architecture: Linear encoder (12→16) with positive bias, LeakyReLU, Linear decoder (16→12). The SAE always uses L1 regularization on latents (standard for inducing sparsity). The SAE is trained after the network converges to decompose whatever representations the network learned.

Key point: The L1 vs L2 ablation varies the network's regularization during training, not the SAE's. The SAE decomposition is held constant to fairly compare network training methods.

Metrics

MACS (Mean Absolute Cosine Similarity): Average of absolute off-diagonal cosines in the SAE decoder weight matrix. Measures feature overlap—lower values indicate more disentangled features.
Interference: Weighted polysemanticity cost: Σᵢ≠ⱼ cos²(dᵢ, dⱼ) × Iᵢ × Iⱼ, where I are feature importances. Penalizes entanglement of important features more heavily.
L0 Sparsity: Fraction of SAE latents active (>threshold) per input. Healthy range: 0.3-0.6 for SAEs.
Sparsity_W4: Measures peakedness of weight distributions (higher = more concentrated, sparse structure).
SAE_MSE: Reconstruction error of SAE. Not a direct interpretability metric—measures reconstruction complexity.
Train/Val MSE: Network task performance on target prediction.
Calibrated_MACS: MACS z-scored against 200 null permutations, then sigmoid-normalized to [0,1]. Values near 0.5 indicate chance-level structure.

All experiments ran for 10 random seeds to estimate variance. Statistical tests used two-sample t-tests comparing baseline to mitigation configurations.

Results: The L1 Paradox in Overcomplete Networks

Baseline: High Polysemanticity by Design

The baseline configuration (Random init + Bipolar noise + L1 reg + ReLU activation) established our reference:

MACS:            0.2954 ± 0.0294
Interference:    8.7022 ± 1.4480
Calibrated_MACS: 0.5085 ± 0.1117
L0_Sparsity:     0.4365 ± 0.0596
Train_MSE:       0.0122 ± 0.0032
Val_MSE:         0.0135 ± 0.0037
SAE_MSE:         0.9280 ± 0.9801

This represents a moderate-to-high polysemanticity regime. MACS of 0.30 indicates substantial off-diagonal cosine similarities (features overlap). Interference of 8.70 confirms this isn't random—important features are systematically entangled. Calibrated_MACS near 0.5 (chance level) shows the network barely outperforms random permutations in organizing features cleanly.

Critically, the network achieves reasonable task performance (Train MSE 0.012) despite the polysemanticity, suggesting the entanglement isn't necessary for the task—it's an artifact of training dynamics.

Key Finding: L2 Outperforms L1 for Disentanglement

Switching only the network's regularization from L1 to L2 (keeping Random init, Bipolar noise, but changing activation to GELU for smooth gradients):

	L1 (Baseline)	L2+GELU	Δ%
MACS:	0.2954	0.2627	-11.1%
Interference:	8.7022	7.1424	-17.9%
Train_MSE:	0.0122	0.0024	-80.3%
Val_MSE:	0.0135	0.0051	-62.2%
SAE_MSE:	0.9280	1.4357	+54.8%
L0_Sparsity:	0.4365	0.4385	+0.5%

The L1 paradox: Despite L1 being the standard choice for promoting interpretability through sparsity, L2-trained networks produced 17.9% less polysemantic features (t=2.1, p=0.04 for MACS; t=2.8, p<0.01 for Interference). This is highly statistically significant with large effect sizes (Cohen's d ≈ 1.5-1.6).

The trade-off is inverted: L2 simultaneously achieves:

Better task performance (-80% Train MSE, -62% Val MSE)
Lower polysemanticity (-18% Interference)
Similar sparsity levels (L0 nearly identical: 0.44 vs 0.44)

The cost? SAE reconstruction complexity rises 55% (SAE_MSE). But this is because L2 produces more features to reconstruct, not because features are worse. With similar L0 but lower Interference, L2 features are individually clearer but collectively more numerous, making the SAE's job harder numerically but easier semantically.

Full Ablation Results

Here's the complete matrix (averages over 10 seeds):

Config	Init	Noise	Reg	Act	MACS	Δ Interf	Train_MSE	Val_MSE	L0
1 (Baseline)	Random	Bipolar	L1	ReLU	0.295	8.70 (0%)	0.0122	0.0135	0.437
2	Orthogonal	Bipolar	L1	ReLU	0.284	8.11 (-6.8%)	0.0044	0.0078	0.439
3	Random	PosKurt	L1	ReLU	0.306	9.11 (+4.7%)	0.0133	0.0152	0.416
4	Random	Bipolar	L2	GELU	0.263	7.14 (-17.9%)	0.0024	0.0051	0.439
5	Orthogonal	Bipolar	L2	GELU	0.252	6.49 (-25.4%)	0.0020	0.0046	0.453
6	Orthogonal	PosKurt	L1	ReLU	0.285	8.30 (-4.6%)	0.0045	0.0076	0.452
7	Random	PosKurt	L2	ReLU	0.284	8.35 (-4.1%)	0.0032	0.0058	0.491
8 (Max)	Orthogonal	PosKurt	L2	GELU	0.249	6.37 (-26.8%)	0.0021	0.0048	0.454

Key patterns:

L2 configs (#4, 5, 7, 8) consistently show lower Interference (6.4-8.4 range vs 8.1-9.1 for L1)
Orthogonal + L2 + GELU synergize (#5: -25.4% Interference)
Maximum mitigation (#8) achieves -26.8% Interference with t=2.8, p<0.01
Positive kurtosis alone (#3) worsens polysemanticity (+4.7% Interference) but helps when paired with L2 (#8: -26.8%)

Variance Analysis: Stability Across Seeds

Not all metrics are equally stable:

Robust (CV < 15%):

MACS: 6-14% variation—polysemanticity is reproducible
Interference: 9-23%—weighted cost is stable signal
L0_Sparsity: 7-15%—active fraction consistent

Moderate (CV 15-30%):

Calibrated_MACS: 17-29%—noisy from permutation sampling
Train/Val MSE: 14-42%—depends on optimization trajectories

Volatile (CV > 50%):

SAE_MSE: 50-492%—extremely unstable in some configs
Sparsity_W4: 12-381%—seed-dependent weight distributions

The bistability signature: Orthogonal+PosKurt+L1 (#6) shows SAE_MSE = 4.17 ± 4.62 (std > mean!). This indicates competing attractors—some seeds fall into good basins with low reconstruction error, others collapse catastrophically. This is evidence of antagonism between orthogonal structure and L1 sparsity pressure.

In contrast, Orthogonal+L2 configs show SAE_MSE = 0.96 ± 0.67 (CV = 70%, much lower), indicating a stable, unimodal optimization landscape.

Analysis: Why Does L2 Reduce Polysemanticity?

The Winner-Take-All Mechanism

Lecomte et al. (2023) predicted that L1 regularization causes incidental polysemanticity through winner-take-all (WTA) dynamics in overcapacity networks. Here's the mechanism validated by our results:

L1's gradient is discontinuous: ∇(λ||w||₁) = λ·sign(w). At zero, the gradient flips from -λ to +λ, creating sharp thresholds. During training, this favors extreme solutions—weights driven strongly positive, strongly negative, or exactly zero.

In overcapacity settings (12D hidden > 8D features), L1 creates competition: With more neurons than necessary, L1's sparsity pressure forces neurons to compete. A few "winners" capture multiple features (becoming polysemantic), while "losers" are driven to zero.

Evidence in the numbers:

L1 configs have higher Interference (8.1-9.1) despite lower or equal L0 (0.42-0.49)
Train MSE is higher with L1 (0.012-0.013 vs 0.002-0.003 for L2)—stuck in suboptimal minima
High variance in SAE_MSE for L1 configs suggests multiple competing local minima

Why this is paradoxical: L1 is supposed to improve interpretability by enforcing sparsity. But in overcomplete settings, the sparsity creates polysemanticity—fewer active neurons must encode more concepts.

L2's Smooth Alternative

L2 regularization escapes the WTA trap through fundamentally different mechanics:

L2's gradient is smooth: ∇(λ||w||₂²) = 2λw. Linear everywhere, no discontinuities. This creates a convex-ish landscape that encourages distributed solutions.

No competition, just shrinkage: L2 uniformly shrinks all weights toward zero proportional to their magnitude. There's no pressure for winner-take-all—all neurons contribute, just at reduced scale.

Evidence in the numbers:

L2 configs achieve lowest Interference (6.4-7.1)
Train/Val MSE dramatically lower (0.002-0.003 train, 0.005-0.006 val)—better optimization
L0 stays high (0.44-0.49)—many neurons active, each specialized
Low variance across seeds—stable convergence to single basin

The key insight: In overcomplete settings, distributed representations are more interpretable than sparse ones. Each neuron encodes one clear feature at moderate strength, rather than a few neurons encoding everything.

Orthogonal Initialization's Dual Role

Orthogonal initialization provides interesting boundary conditions:

With L2 (Config #5): Synergistic. Orthogonal matrices have unit eigenvalues, providing perfect conditioning for gradient flow. L2's smooth landscape benefits maximally from this, achieving -25.4% Interference and -83.6% Train MSE.

With L1 (Config #6): Antagonistic. Orthogonal structure distributes activations evenly by design. L1 wants to sparsify aggressively. They fight, creating bistable dynamics with catastrophic variance (SAE_MSE std = 4.62 > mean = 4.17).

The numerical signature: Sparsity_W4 jumps from 0.053 (Random+L1) to 0.101 (Orthogonal+L1), a +91% increase. Orthogonal init forces weight distributions to have extreme peaks because the orthogonal structure pre-orients weights, then L1 hammers them into sharp sparsity. This extreme peakedness destabilizes SAE reconstruction.

The Role of Positive Kurtosis Noise

Positive kurtosis (heavy-tailed t-distribution) was hypothesized to disrupt lock-ins through stochastic kicks. The results are context-dependent:

With L1 (Config #3): Harmful. MACS +3.6%, Interference +4.7%. The heavy tails create rare extreme activations that force neurons into polysemantic roles, then L1's WTA dynamics lock them in.

With L2 (Config #8): Marginal benefit. Interference -1.8% vs Config #5. The tails add exploration, but L2's smoothing prevents lock-in. The system samples useful diversity without collapsing.

Interpretation: Disruption alone doesn't help—you need smooth regularization to make productive use of it.

Discussion: Implications and Limitations

What This Means for Interpretability Research

Challenge to conventional wisdom: The assumption "sparse = interpretable" may be backwards in overcomplete regimes. Our results show L1 sparsity increases polysemanticity (+17.9% Interference), while L2 density reduces it.

The boundary condition matters: This finding is specific to overcomplete settings (latent_dim > feature_dim). In undercomplete or exact-capacity regimes, L1 likely still helps by forcing clear feature allocation. The paradox emerges when there's excess capacity—L1's WTA dynamics become pathological.

Network training matters as much as SAE architecture: Most SAE research optimizes decomposition methods (TopK, JumpReLU, Gated SAEs). Our work suggests fixing polysemanticity requires fixing model training. Even perfect SAEs struggle with representations baked-in as polysemantic.

Actionable takeaway: When training models you plan to interpret, consider:

L2 or elastic net (L1+L2 with L2 dominant) for network regularization
Orthogonal initialization only with L2, never with L1 alone
Smooth activations (GELU, SiLU) to preserve gradient flow
Monitor for bistability (high SAE variance across seeds) as diagnostic of antagonistic training dynamics

Comparison to Existing Work

Lecomte et al. (2023): Predicted L1 causes incidental polysemanticity via WTA. We provide the first direct L1 vs L2 comparison quantifying this (-17.9% Interference, p<0.01).

Anthropic's SAE work (Bricken et al. 2023, Templeton et al. 2024): Uses L1 penalty extensively for training SAEs. Our findings suggest this may fight avoidable polysemanticity, but Anthropic also uses advanced architectures (JumpReLU, TopK) that may mitigate the L1 paradox through different mechanisms. Direct comparison needed.

Modern SAE variants: We used vanilla LeakyReLU SAEs. State-of-art methods (TopK, JumpReLU, Gated) achieve better reconstruction at given sparsity levels. How L2 network training compares to these architectural improvements remains an open question.

Honest Limitations

This is a toy model with significant constraints:

Synthetic data with designed correlations - Real LLMs don't have researcher-controlled correlation knobs. Findings most applicable to scenarios with natural co-occurrence (e.g., "car" and "road") or accidental training correlations.
Extreme overcapacity - 16 SAE latents / 8 features = 2×, plus 12D hidden > 8D features. Real LLM SAEs are 4-8× overcomplete. The degree of overcapacity may exaggerate effects.
No LLM validation - Whether this generalizes to billion-parameter models is unknown. The mechanisms (L1 WTA, L2 smoothing) suggest it might, but needs empirical testing.
Arbitrary importance decay - Using 0.9^i importance weights. Real models likely have power-law or heavy-tailed importance distributions.
No qualitative feature analysis - We measure polysemanticity via proxy metrics (Interference, MACS) without verifying features are actually clearer to human interpreters. This is a critical missing validation.
Vanilla SAE architecture - Using LeakyReLU SAEs, not state-of-art TopK/JumpReLU/Gated variants. Modern architectures may change the picture.
Calibrated_MACS noise - With std=0.14 and mean~0.5, this metric barely distinguishes from chance. 200 permutation samples may be insufficient for 16D space. Needs refinement.

What We Don't Claim

NOT claiming:

"L1 is always bad" (only in overcomplete toy setting)
"This definitely applies to LLMs" (untested, unknown)
"Don't use L1 for SAEs" (we still used L1 for the SAE, just not network training)
"L2 is the solution" (it's one mitigation; modern SAE architectures may be equally or more important)

Claiming:

L2 network regularization reduces polysemanticity vs L1 in this toy model
The mechanism (WTA vs distributed) is consistent with theory
This suggests network training dynamics matter for interpretability
Further research on real models is needed

Future Directions

Immediate Next Steps

Test on real LLMs via PEFT - Fine-tune Llama or similar with LoRA adapters trained using L1 vs L2 regularization on toy-like data. Measure polysemanticity in adapted representations without catastrophic forgetting.

Compare to modern SAE variants - Does L2 network training combined with TopK or JumpReLU SAEs achieve better results than either alone?

Lambda sweep - Systematically vary L1/L2 strength to identify critical thresholds where polysemanticity minimizes. Is there a U-shaped curve?

Qualitative feature analysis - Generate visualizations of what L1 vs L2 features actually encode. Do L2 features correspond to clearer, more monosemantic concepts?

Ground truth feature recovery - Use Anthropic's toy model setup with known synthetic features. Measure recovery accuracy directly.

Longer-Term Questions

Does this scale? Will the L1 paradox hold in billion-parameter models with complex natural data?

Interaction with other training choices - How do learning rate, batch size, architecture depth affect the L1 vs L2 trade-off?

Optimal regularization mixture - Is there an elastic net configuration (αL1 + (1-α)L2) that achieves both sparsity AND low polysemanticity?

Feature importance distributions - How do real models' feature importance distributions (likely power-law) affect which features stay polysemantic under L1 vs L2?

Conclusion

This exploratory work in toy models suggests that network training dynamics—specifically the choice between L1 and L2 regularization—significantly affect the polysemanticity of learned representations. L2-trained networks produced 17.9% less feature entanglement than L1-trained networks (t=2.8, p<0.01), challenging the conventional wisdom that "sparse equals interpretable" in overcomplete settings.

The mechanism is clear: L1's discontinuous gradients create winner-take-all dynamics where a few neurons become polysemantic, while L2's smooth gradients encourage distributed representations where many neurons specialize. Orthogonal initialization amplifies this difference—synergizing with L2 but antagonizing L1.

Whether these findings generalize to real LLMs remains an open empirical question. The toy model's extreme overcapacity, synthetic data, and arbitrary design choices limit direct applicability. However, the mechanistic insights—L1's WTA trap, L2's smoothing effect, orthogonal×regularization interactions—may inform future work on training more interpretable models.

For the mechanistic interpretability community, this work suggests two paths forward:

Architectural improvements (TopK SAEs, JumpReLU) to better decompose existing polysemantic representations
Training improvements (L2 regularization, smooth activations) to reduce polysemanticity at the source

Both matter. Fixing polysemanticity likely requires addressing both how we train models and how we interpret them.

All code, data, and experimental details are available at github.com/stanleyngugi/taming_polysemanticity. I welcome feedback, replications, and extensions—especially testing these ideas on real LLMs.

Acknowledgments

Thanks to the mechanistic interpretability community for foundational work on SAEs and toy models. This research builds directly on Lecomte et al.'s incidental polysemanticity theory and Anthropic's dictionary learning approaches.

References

Bereska, L., & Gavves, E. (2024). Mechanistic Interpretability for AI Safety — A Review. Transactions on Machine Learning Research.

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., & Olah, C. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread. https://transformer-circuits.pub/2023/monosemantic-features

Gao, L., Biderman, S., Gururangan, S., Weber, L., Jennings, J., Dey, B., Sutherland, D. J., Bisk, Y., Schoelkopf, R., Hooker, S., Smith, N. A., & Smith, D. (2024). Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093.

Lecomte, V., Thaman, K., Schaeffer, R., Bashkansky, N., Chow, T., & Koyejo, S. (2023). What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity in Neural Networks. arXiv preprint arXiv:2312.03096.

Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., Kramár, J., Shah, R., & Nanda, N. (2024). Improving Dictionary Learning with Gated Sparse Autoencoders. arXiv preprint arXiv:2404.16014.

Saxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations (ICLR).

Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., MacDiarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., & Henighan, T. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic. https://www.anthropic.com/research/scaling-monosemanticity

Appendix: Technical Details

Hyperparameters

Network training:

Architecture: [8 → 12 → 12 → 12 → 8]
Optimizer: Adam (lr=1e-3, betas=(0.9, 0.999))
L1 lambda: weight_only, manual sum(abs(weights))
L2 lambda: weight_decay=1e-4
Noise injection: hidden layer, sigma=0.1
Epochs: 2000
Batch size: 64

SAE training:

Architecture: [12 → 16 → 12]
Activation: LeakyReLU(0.01)
L1 lambda: 1e-4 on latent activations
Optimizer: Adam (lr=1e-3)
Epochs: 1000
Reconstruction loss: MSE

Data generation:

Features: 8, grouped [0-2], [3-4], [5-7]
Correlations: base_strength ∈ [0.5, 1.0]
Sparsity: 50% active per sample
Samples: 1024 train, 256 val
Target: non-linear combinations with importance decay 0.9^i