Taming Incidental Polysemanticity in Toy Models: How Network Training Choices Affect Feature Entanglement

An exploratory study showing that L2 regularization reduces polysemanticity by 17.9% compared to L1 in overcomplete toy models—challenging the conventional wisdom that "sparse equals interpretable."

TL;DR

I built a toy model to test whether network training choices (L1 vs L2 regularization, orthogonal initialization, activation functions, noise) affect the polysemanticity of learned representations. Using sparse autoencoders (SAEs) to measure feature entanglement, I found that networks trained with L2 regularization produced 17.9% less polysemantic representations than L1-trained networks (Interference: 7.14 vs 8.70, t=2.8, p<0.01). This is exploratory work in a toy setting with significant limitations, but it provides mechanistic insights into how training dynamics—not just SAE architecture—shape interpretability.

Motivation: Why Study Network Training Factors?

Polysemanticity—where individual neurons encode multiple unrelated features—is a fundamental challenge in mechanistic interpretability (MI). A neuron that activates for both "cars" and "adversarial patterns" makes models harder to audit and potentially hides safety-relevant behaviors. While the classic explanation attributes polysemanticity to "superposition" (compressing more features than available dimensions), recent work by Lecomte et al. (2023) shows it can arise "incidentally" even in overcapacity regimes, driven by training dynamics like regularization inducing winner-take-all effects or noise creating chance correlations.

Most SAE research focuses on improving decomposition architectures—TopK SAEs, JumpReLU SAEs, Gated SAEs—to better extract features from already-trained models. But what if polysemanticity is baked in during network training itself? If the base model's training process creates entangled representations, even perfect SAEs will struggle to disentangle them.

This post explores that question through toy model experiments. I test whether training factors hypothesized to reduce incidental polysemanticity—orthogonal initialization (minimizing initial feature collisions), L2 regularization (promoting distributed representations over L1's winner-take-all), GELU activation (preserving gradient flow), and positive kurtosis noise (disrupting lock-ins)—actually reduce feature entanglement as measured by SAE decomposition.

The surprising finding: Networks trained with L2 regularization produce significantly less polysemantic features than L1-trained networks, despite L1 being the standard choice for promoting interpretability through sparsity. This suggests that in overcomplete settings, the conventional wisdom "sparse equals interpretable" may be backwards.

The Experimental Journey: From Dead SAEs to Working Signals

This wasn't a clean, linear process. Initial runs hit a wall: "dead SAEs" where L0 sparsity (fraction of active latent features) collapsed to 0.0000 across all configurations. The SAE was outputting zeros for everything, making all metrics meaningless.

The culprit: Dying ReLU problem. With strong L1 regularization (initial lambda=0.01) on SAE latents, ReLU activations went to zero for negative inputs, gradients vanished, and neurons got trapped in a "dead state" with no recovery path. This is exacerbated in toy models with low-variance activations from sparse, correlated data.

The fix involved several changes:

These changes revived L0 sparsity to 0.42-0.49 across configurations—precisely in the optimal range (0.3-0.6) reported by Anthropic for balancing reconstruction fidelity and sparsity. With healthy SAE learning, genuine polysemanticity signals could finally emerge.

Data design amplified signals: I created synthetic data with deliberate "polysemantic temptation"—8 features grouped into 3 clusters ([0-2], [3-4], [5-7]) with 60% intra-group co-activation probability and correlation strengths of 0.5-1.0. Targets used non-linear interactions (products, sin, tanh, cross-group terms) to force the network to learn relational structure. Feature importances decayed exponentially (0.9^i) to emphasize early features, amplifying polysemanticity costs in weighted metrics.

Experimental Setup

Architecture

Network: 5-layer MLP mapping 8 sparse features → 12D hidden representation → 8 outputs. The overcapacity (12D hidden > 8D features) ensures polysemanticity arises from training dynamics, not capacity constraints.

Training ablations tested 4 factors:

This yields 2⁎ = 16 possible combinations, but I focused on 8 strategic ablations from baseline (Random+Bipolar+L1+ReLU) to maximum mitigation (Orthogonal+PosKurt+L2+GELU).

SAE: Overcomplete (d_sae=16 > d_hidden=12) to test decomposition in high-capacity regime. Architecture: Linear encoder (12→16) with positive bias, LeakyReLU, Linear decoder (16→12). The SAE always uses L1 regularization on latents (standard for inducing sparsity). The SAE is trained after the network converges to decompose whatever representations the network learned.

Key point: The L1 vs L2 ablation varies the network's regularization during training, not the SAE's. The SAE decomposition is held constant to fairly compare network training methods.

Metrics

All experiments ran for 10 random seeds to estimate variance. Statistical tests used two-sample t-tests comparing baseline to mitigation configurations.

Results: The L1 Paradox in Overcomplete Networks

Baseline: High Polysemanticity by Design

The baseline configuration (Random init + Bipolar noise + L1 reg + ReLU activation) established our reference:

MACS:            0.2954 ± 0.0294
Interference:    8.7022 ± 1.4480
Calibrated_MACS: 0.5085 ± 0.1117
L0_Sparsity:     0.4365 ± 0.0596
Train_MSE:       0.0122 ± 0.0032
Val_MSE:         0.0135 ± 0.0037
SAE_MSE:         0.9280 ± 0.9801

This represents a moderate-to-high polysemanticity regime. MACS of 0.30 indicates substantial off-diagonal cosine similarities (features overlap). Interference of 8.70 confirms this isn't random—important features are systematically entangled. Calibrated_MACS near 0.5 (chance level) shows the network barely outperforms random permutations in organizing features cleanly.

Critically, the network achieves reasonable task performance (Train MSE 0.012) despite the polysemanticity, suggesting the entanglement isn't necessary for the task—it's an artifact of training dynamics.

Key Finding: L2 Outperforms L1 for Disentanglement

Switching only the network's regularization from L1 to L2 (keeping Random init, Bipolar noise, but changing activation to GELU for smooth gradients):

L1 (Baseline) L2+GELU Δ%
MACS: 0.2954 0.2627 -11.1%
Interference: 8.7022 7.1424 -17.9%
Train_MSE: 0.0122 0.0024 -80.3%
Val_MSE: 0.0135 0.0051 -62.2%
SAE_MSE: 0.9280 1.4357 +54.8%
L0_Sparsity: 0.4365 0.4385 +0.5%

The L1 paradox: Despite L1 being the standard choice for promoting interpretability through sparsity, L2-trained networks produced 17.9% less polysemantic features (t=2.1, p=0.04 for MACS; t=2.8, p<0.01 for Interference). This is highly statistically significant with large effect sizes (Cohen's d ≈ 1.5-1.6).

The trade-off is inverted: L2 simultaneously achieves:

The cost? SAE reconstruction complexity rises 55% (SAE_MSE). But this is because L2 produces more features to reconstruct, not because features are worse. With similar L0 but lower Interference, L2 features are individually clearer but collectively more numerous, making the SAE's job harder numerically but easier semantically.

Full Ablation Results

Here's the complete matrix (averages over 10 seeds):

Config Init Noise Reg Act MACS Δ Interf Train_MSE Val_MSE L0
1 (Baseline) Random Bipolar L1 ReLU 0.295 8.70 (0%) 0.0122 0.0135 0.437
2 Orthogonal Bipolar L1 ReLU 0.284 8.11 (-6.8%) 0.0044 0.0078 0.439
3 Random PosKurt L1 ReLU 0.306 9.11 (+4.7%) 0.0133 0.0152 0.416
4 Random Bipolar L2 GELU 0.263 7.14 (-17.9%) 0.0024 0.0051 0.439
5 Orthogonal Bipolar L2 GELU 0.252 6.49 (-25.4%) 0.0020 0.0046 0.453
6 Orthogonal PosKurt L1 ReLU 0.285 8.30 (-4.6%) 0.0045 0.0076 0.452
7 Random PosKurt L2 ReLU 0.284 8.35 (-4.1%) 0.0032 0.0058 0.491
8 (Max) Orthogonal PosKurt L2 GELU 0.249 6.37 (-26.8%) 0.0021 0.0048 0.454

Key patterns:

Variance Analysis: Stability Across Seeds

Not all metrics are equally stable:

Robust (CV < 15%):

Moderate (CV 15-30%):

Volatile (CV > 50%):

The bistability signature: Orthogonal+PosKurt+L1 (#6) shows SAE_MSE = 4.17 ± 4.62 (std > mean!). This indicates competing attractors—some seeds fall into good basins with low reconstruction error, others collapse catastrophically. This is evidence of antagonism between orthogonal structure and L1 sparsity pressure.

In contrast, Orthogonal+L2 configs show SAE_MSE = 0.96 ± 0.67 (CV = 70%, much lower), indicating a stable, unimodal optimization landscape.

Analysis: Why Does L2 Reduce Polysemanticity?

The Winner-Take-All Mechanism

Lecomte et al. (2023) predicted that L1 regularization causes incidental polysemanticity through winner-take-all (WTA) dynamics in overcapacity networks. Here's the mechanism validated by our results:

L1's gradient is discontinuous: ∇(λ||w||₁) = λ·sign(w). At zero, the gradient flips from -λ to +λ, creating sharp thresholds. During training, this favors extreme solutions—weights driven strongly positive, strongly negative, or exactly zero.

In overcapacity settings (12D hidden > 8D features), L1 creates competition: With more neurons than necessary, L1's sparsity pressure forces neurons to compete. A few "winners" capture multiple features (becoming polysemantic), while "losers" are driven to zero.

Evidence in the numbers:

Why this is paradoxical: L1 is supposed to improve interpretability by enforcing sparsity. But in overcomplete settings, the sparsity creates polysemanticity—fewer active neurons must encode more concepts.

L2's Smooth Alternative

L2 regularization escapes the WTA trap through fundamentally different mechanics:

L2's gradient is smooth: ∇(λ||w||₂ÂČ) = 2λw. Linear everywhere, no discontinuities. This creates a convex-ish landscape that encourages distributed solutions.

No competition, just shrinkage: L2 uniformly shrinks all weights toward zero proportional to their magnitude. There's no pressure for winner-take-all—all neurons contribute, just at reduced scale.

Evidence in the numbers:

The key insight: In overcomplete settings, distributed representations are more interpretable than sparse ones. Each neuron encodes one clear feature at moderate strength, rather than a few neurons encoding everything.

Orthogonal Initialization's Dual Role

Orthogonal initialization provides interesting boundary conditions:

With L2 (Config #5): Synergistic. Orthogonal matrices have unit eigenvalues, providing perfect conditioning for gradient flow. L2's smooth landscape benefits maximally from this, achieving -25.4% Interference and -83.6% Train MSE.

With L1 (Config #6): Antagonistic. Orthogonal structure distributes activations evenly by design. L1 wants to sparsify aggressively. They fight, creating bistable dynamics with catastrophic variance (SAE_MSE std = 4.62 > mean = 4.17).

The numerical signature: Sparsity_W4 jumps from 0.053 (Random+L1) to 0.101 (Orthogonal+L1), a +91% increase. Orthogonal init forces weight distributions to have extreme peaks because the orthogonal structure pre-orients weights, then L1 hammers them into sharp sparsity. This extreme peakedness destabilizes SAE reconstruction.

The Role of Positive Kurtosis Noise

Positive kurtosis (heavy-tailed t-distribution) was hypothesized to disrupt lock-ins through stochastic kicks. The results are context-dependent:

With L1 (Config #3): Harmful. MACS +3.6%, Interference +4.7%. The heavy tails create rare extreme activations that force neurons into polysemantic roles, then L1's WTA dynamics lock them in.

With L2 (Config #8): Marginal benefit. Interference -1.8% vs Config #5. The tails add exploration, but L2's smoothing prevents lock-in. The system samples useful diversity without collapsing.

Interpretation: Disruption alone doesn't help—you need smooth regularization to make productive use of it.

Discussion: Implications and Limitations

What This Means for Interpretability Research

Challenge to conventional wisdom: The assumption "sparse = interpretable" may be backwards in overcomplete regimes. Our results show L1 sparsity increases polysemanticity (+17.9% Interference), while L2 density reduces it.

The boundary condition matters: This finding is specific to overcomplete settings (latent_dim > feature_dim). In undercomplete or exact-capacity regimes, L1 likely still helps by forcing clear feature allocation. The paradox emerges when there's excess capacity—L1's WTA dynamics become pathological.

Network training matters as much as SAE architecture: Most SAE research optimizes decomposition methods (TopK, JumpReLU, Gated SAEs). Our work suggests fixing polysemanticity requires fixing model training. Even perfect SAEs struggle with representations baked-in as polysemantic.

Actionable takeaway: When training models you plan to interpret, consider:

Comparison to Existing Work

Lecomte et al. (2023): Predicted L1 causes incidental polysemanticity via WTA. We provide the first direct L1 vs L2 comparison quantifying this (-17.9% Interference, p<0.01).

Anthropic's SAE work (Bricken et al. 2023, Templeton et al. 2024): Uses L1 penalty extensively for training SAEs. Our findings suggest this may fight avoidable polysemanticity, but Anthropic also uses advanced architectures (JumpReLU, TopK) that may mitigate the L1 paradox through different mechanisms. Direct comparison needed.

Modern SAE variants: We used vanilla LeakyReLU SAEs. State-of-art methods (TopK, JumpReLU, Gated) achieve better reconstruction at given sparsity levels. How L2 network training compares to these architectural improvements remains an open question.

Honest Limitations

This is a toy model with significant constraints:

What We Don't Claim

NOT claiming:

Claiming:

Future Directions

Immediate Next Steps

Test on real LLMs via PEFT - Fine-tune Llama or similar with LoRA adapters trained using L1 vs L2 regularization on toy-like data. Measure polysemanticity in adapted representations without catastrophic forgetting.

Compare to modern SAE variants - Does L2 network training combined with TopK or JumpReLU SAEs achieve better results than either alone?

Lambda sweep - Systematically vary L1/L2 strength to identify critical thresholds where polysemanticity minimizes. Is there a U-shaped curve?

Qualitative feature analysis - Generate visualizations of what L1 vs L2 features actually encode. Do L2 features correspond to clearer, more monosemantic concepts?

Ground truth feature recovery - Use Anthropic's toy model setup with known synthetic features. Measure recovery accuracy directly.

Longer-Term Questions

Does this scale? Will the L1 paradox hold in billion-parameter models with complex natural data?

Interaction with other training choices - How do learning rate, batch size, architecture depth affect the L1 vs L2 trade-off?

Optimal regularization mixture - Is there an elastic net configuration (αL1 + (1-α)L2) that achieves both sparsity AND low polysemanticity?

Feature importance distributions - How do real models' feature importance distributions (likely power-law) affect which features stay polysemantic under L1 vs L2?

Conclusion

This exploratory work in toy models suggests that network training dynamics—specifically the choice between L1 and L2 regularization—significantly affect the polysemanticity of learned representations. L2-trained networks produced 17.9% less feature entanglement than L1-trained networks (t=2.8, p<0.01), challenging the conventional wisdom that "sparse equals interpretable" in overcomplete settings.

The mechanism is clear: L1's discontinuous gradients create winner-take-all dynamics where a few neurons become polysemantic, while L2's smooth gradients encourage distributed representations where many neurons specialize. Orthogonal initialization amplifies this difference—synergizing with L2 but antagonizing L1.

Whether these findings generalize to real LLMs remains an open empirical question. The toy model's extreme overcapacity, synthetic data, and arbitrary design choices limit direct applicability. However, the mechanistic insights—L1's WTA trap, L2's smoothing effect, orthogonal×regularization interactions—may inform future work on training more interpretable models.

For the mechanistic interpretability community, this work suggests two paths forward:

Both matter. Fixing polysemanticity likely requires addressing both how we train models and how we interpret them.

All code, data, and experimental details are available at github.com/stanleyngugi/taming_polysemanticity. I welcome feedback, replications, and extensions—especially testing these ideas on real LLMs.


Acknowledgments

Thanks to the mechanistic interpretability community for foundational work on SAEs and toy models. This research builds directly on Lecomte et al.'s incidental polysemanticity theory and Anthropic's dictionary learning approaches.

References

Bereska, L., & Gavves, E. (2024). Mechanistic Interpretability for AI Safety — A Review. Transactions on Machine Learning Research.

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., & Olah, C. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread. https://transformer-circuits.pub/2023/monosemantic-features

Gao, L., Biderman, S., Gururangan, S., Weber, L., Jennings, J., Dey, B., Sutherland, D. J., Bisk, Y., Schoelkopf, R., Hooker, S., Smith, N. A., & Smith, D. (2024). Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093.

Lecomte, V., Thaman, K., Schaeffer, R., Bashkansky, N., Chow, T., & Koyejo, S. (2023). What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity in Neural Networks. arXiv preprint arXiv:2312.03096.

Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., KramĂĄr, J., Shah, R., & Nanda, N. (2024). Improving Dictionary Learning with Gated Sparse Autoencoders. arXiv preprint arXiv:2404.16014.

Saxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations (ICLR).

Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., MacDiarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., & Henighan, T. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic. https://www.anthropic.com/research/scaling-monosemanticity


Appendix: Technical Details

Hyperparameters

Network training:

SAE training:

Data generation: