Motivation: Why Study Network Training Factors?
Polysemanticityâwhere individual neurons encode multiple unrelated featuresâis a fundamental challenge in mechanistic interpretability (MI). A neuron that activates for both "cars" and "adversarial patterns" makes models harder to audit and potentially hides safety-relevant behaviors. While the classic explanation attributes polysemanticity to "superposition" (compressing more features than available dimensions), recent work by Lecomte et al. (2023) shows it can arise "incidentally" even in overcapacity regimes, driven by training dynamics like regularization inducing winner-take-all effects or noise creating chance correlations.
Most SAE research focuses on improving decomposition architecturesâTopK SAEs, JumpReLU SAEs, Gated SAEsâto better extract features from already-trained models. But what if polysemanticity is baked in during network training itself? If the base model's training process creates entangled representations, even perfect SAEs will struggle to disentangle them.
This post explores that question through toy model experiments. I test whether training factors hypothesized to reduce incidental polysemanticityâorthogonal initialization (minimizing initial feature collisions), L2 regularization (promoting distributed representations over L1's winner-take-all), GELU activation (preserving gradient flow), and positive kurtosis noise (disrupting lock-ins)âactually reduce feature entanglement as measured by SAE decomposition.
The surprising finding: Networks trained with L2 regularization produce significantly less polysemantic features than L1-trained networks, despite L1 being the standard choice for promoting interpretability through sparsity. This suggests that in overcomplete settings, the conventional wisdom "sparse equals interpretable" may be backwards.
The Experimental Journey: From Dead SAEs to Working Signals
This wasn't a clean, linear process. Initial runs hit a wall: "dead SAEs" where L0 sparsity (fraction of active latent features) collapsed to 0.0000 across all configurations. The SAE was outputting zeros for everything, making all metrics meaningless.
The culprit: Dying ReLU problem. With strong L1 regularization (initial lambda=0.01) on SAE latents, ReLU activations went to zero for negative inputs, gradients vanished, and neurons got trapped in a "dead state" with no recovery path. This is exacerbated in toy models with low-variance activations from sparse, correlated data.
The fix involved several changes:
- Lowered lambda to 1e-4 (balancing sparsity without overkill, following scaled SAE practices)
- Switched to LeakyReLU (alpha=0.01) in the SAE to allow small negative gradients
- Added positive encoder biases (uniform 0-0.1) to initialize activations in firing range
- Normalized activations before SAE input ((acts - mean)/std) to boost variance
- Increased training epochs (1000â2000 for network, 500â1000 for SAE) to ensure convergence
These changes revived L0 sparsity to 0.42-0.49 across configurationsâprecisely in the optimal range (0.3-0.6) reported by Anthropic for balancing reconstruction fidelity and sparsity. With healthy SAE learning, genuine polysemanticity signals could finally emerge.
Data design amplified signals: I created synthetic data with deliberate "polysemantic temptation"â8 features grouped into 3 clusters ([0-2], [3-4], [5-7]) with 60% intra-group co-activation probability and correlation strengths of 0.5-1.0. Targets used non-linear interactions (products, sin, tanh, cross-group terms) to force the network to learn relational structure. Feature importances decayed exponentially (0.9^i) to emphasize early features, amplifying polysemanticity costs in weighted metrics.
Experimental Setup
Architecture
Network: 5-layer MLP mapping 8 sparse features â 12D hidden representation â 8 outputs. The overcapacity (12D hidden > 8D features) ensures polysemanticity arises from training dynamics, not capacity constraints.
Training ablations tested 4 factors:
- Initialization: Random (Gaussian) vs Orthogonal (preserves norms, minimizes initial feature overlap)
- Noise: Bipolar (uniform ±0.1) vs Positive Kurtosis (t-distribution df=3, heavy tails)
- Regularization: L1 (λ * |weights|) vs L2 (λ * weightsÂČ)
- Activation: ReLU vs GELU (smooth, no dying neurons)
This yields 2⎠= 16 possible combinations, but I focused on 8 strategic ablations from baseline (Random+Bipolar+L1+ReLU) to maximum mitigation (Orthogonal+PosKurt+L2+GELU).
SAE: Overcomplete (d_sae=16 > d_hidden=12) to test decomposition in high-capacity regime. Architecture: Linear encoder (12â16) with positive bias, LeakyReLU, Linear decoder (16â12). The SAE always uses L1 regularization on latents (standard for inducing sparsity). The SAE is trained after the network converges to decompose whatever representations the network learned.
Key point: The L1 vs L2 ablation varies the network's regularization during training, not the SAE's. The SAE decomposition is held constant to fairly compare network training methods.
Metrics
- MACS (Mean Absolute Cosine Similarity): Average of absolute off-diagonal cosines in the SAE decoder weight matrix. Measures feature overlapâlower values indicate more disentangled features.
- Interference: Weighted polysemanticity cost: Σᔹâ ⱌ cosÂČ(dᔹ, dⱌ) Ă Iᔹ Ă Iⱌ, where I are feature importances. Penalizes entanglement of important features more heavily.
- L0 Sparsity: Fraction of SAE latents active (>threshold) per input. Healthy range: 0.3-0.6 for SAEs.
- Sparsity_W4: Measures peakedness of weight distributions (higher = more concentrated, sparse structure).
- SAE_MSE: Reconstruction error of SAE. Not a direct interpretability metricâmeasures reconstruction complexity.
- Train/Val MSE: Network task performance on target prediction.
- Calibrated_MACS: MACS z-scored against 200 null permutations, then sigmoid-normalized to [0,1]. Values near 0.5 indicate chance-level structure.
All experiments ran for 10 random seeds to estimate variance. Statistical tests used two-sample t-tests comparing baseline to mitigation configurations.
Results: The L1 Paradox in Overcomplete Networks
Baseline: High Polysemanticity by Design
The baseline configuration (Random init + Bipolar noise + L1 reg + ReLU activation) established our reference:
MACS: 0.2954 ± 0.0294
Interference: 8.7022 ± 1.4480
Calibrated_MACS: 0.5085 ± 0.1117
L0_Sparsity: 0.4365 ± 0.0596
Train_MSE: 0.0122 ± 0.0032
Val_MSE: 0.0135 ± 0.0037
SAE_MSE: 0.9280 ± 0.9801
This represents a moderate-to-high polysemanticity regime. MACS of 0.30 indicates substantial off-diagonal cosine similarities (features overlap). Interference of 8.70 confirms this isn't randomâimportant features are systematically entangled. Calibrated_MACS near 0.5 (chance level) shows the network barely outperforms random permutations in organizing features cleanly.
Critically, the network achieves reasonable task performance (Train MSE 0.012) despite the polysemanticity, suggesting the entanglement isn't necessary for the taskâit's an artifact of training dynamics.
Key Finding: L2 Outperforms L1 for Disentanglement
Switching only the network's regularization from L1 to L2 (keeping Random init, Bipolar noise, but changing activation to GELU for smooth gradients):
| L1 (Baseline) | L2+GELU | Î% | |
|---|---|---|---|
| MACS: | 0.2954 | 0.2627 | -11.1% |
| Interference: | 8.7022 | 7.1424 | -17.9% |
| Train_MSE: | 0.0122 | 0.0024 | -80.3% |
| Val_MSE: | 0.0135 | 0.0051 | -62.2% |
| SAE_MSE: | 0.9280 | 1.4357 | +54.8% |
| L0_Sparsity: | 0.4365 | 0.4385 | +0.5% |
The L1 paradox: Despite L1 being the standard choice for promoting interpretability through sparsity, L2-trained networks produced 17.9% less polysemantic features (t=2.1, p=0.04 for MACS; t=2.8, p<0.01 for Interference). This is highly statistically significant with large effect sizes (Cohen's d â 1.5-1.6).
The trade-off is inverted: L2 simultaneously achieves:
- Better task performance (-80% Train MSE, -62% Val MSE)
- Lower polysemanticity (-18% Interference)
- Similar sparsity levels (L0 nearly identical: 0.44 vs 0.44)
The cost? SAE reconstruction complexity rises 55% (SAE_MSE). But this is because L2 produces more features to reconstruct, not because features are worse. With similar L0 but lower Interference, L2 features are individually clearer but collectively more numerous, making the SAE's job harder numerically but easier semantically.
Full Ablation Results
Here's the complete matrix (averages over 10 seeds):
| Config | Init | Noise | Reg | Act | MACS | Î Interf | Train_MSE | Val_MSE | L0 |
|---|---|---|---|---|---|---|---|---|---|
| 1 (Baseline) | Random | Bipolar | L1 | ReLU | 0.295 | 8.70 (0%) | 0.0122 | 0.0135 | 0.437 |
| 2 | Orthogonal | Bipolar | L1 | ReLU | 0.284 | 8.11 (-6.8%) | 0.0044 | 0.0078 | 0.439 |
| 3 | Random | PosKurt | L1 | ReLU | 0.306 | 9.11 (+4.7%) | 0.0133 | 0.0152 | 0.416 |
| 4 | Random | Bipolar | L2 | GELU | 0.263 | 7.14 (-17.9%) | 0.0024 | 0.0051 | 0.439 |
| 5 | Orthogonal | Bipolar | L2 | GELU | 0.252 | 6.49 (-25.4%) | 0.0020 | 0.0046 | 0.453 |
| 6 | Orthogonal | PosKurt | L1 | ReLU | 0.285 | 8.30 (-4.6%) | 0.0045 | 0.0076 | 0.452 |
| 7 | Random | PosKurt | L2 | ReLU | 0.284 | 8.35 (-4.1%) | 0.0032 | 0.0058 | 0.491 |
| 8 (Max) | Orthogonal | PosKurt | L2 | GELU | 0.249 | 6.37 (-26.8%) | 0.0021 | 0.0048 | 0.454 |
Key patterns:
- L2 configs (#4, 5, 7, 8) consistently show lower Interference (6.4-8.4 range vs 8.1-9.1 for L1)
- Orthogonal + L2 + GELU synergize (#5: -25.4% Interference)
- Maximum mitigation (#8) achieves -26.8% Interference with t=2.8, p<0.01
- Positive kurtosis alone (#3) worsens polysemanticity (+4.7% Interference) but helps when paired with L2 (#8: -26.8%)
Variance Analysis: Stability Across Seeds
Not all metrics are equally stable:
Robust (CV < 15%):
- MACS: 6-14% variationâpolysemanticity is reproducible
- Interference: 9-23%âweighted cost is stable signal
- L0_Sparsity: 7-15%âactive fraction consistent
Moderate (CV 15-30%):
- Calibrated_MACS: 17-29%ânoisy from permutation sampling
- Train/Val MSE: 14-42%âdepends on optimization trajectories
Volatile (CV > 50%):
- SAE_MSE: 50-492%âextremely unstable in some configs
- Sparsity_W4: 12-381%âseed-dependent weight distributions
The bistability signature: Orthogonal+PosKurt+L1 (#6) shows SAE_MSE = 4.17 ± 4.62 (std > mean!). This indicates competing attractorsâsome seeds fall into good basins with low reconstruction error, others collapse catastrophically. This is evidence of antagonism between orthogonal structure and L1 sparsity pressure.
In contrast, Orthogonal+L2 configs show SAE_MSE = 0.96 ± 0.67 (CV = 70%, much lower), indicating a stable, unimodal optimization landscape.
Analysis: Why Does L2 Reduce Polysemanticity?
The Winner-Take-All Mechanism
Lecomte et al. (2023) predicted that L1 regularization causes incidental polysemanticity through winner-take-all (WTA) dynamics in overcapacity networks. Here's the mechanism validated by our results:
L1's gradient is discontinuous: â(λ||w||â) = λ·sign(w). At zero, the gradient flips from -λ to +λ, creating sharp thresholds. During training, this favors extreme solutionsâweights driven strongly positive, strongly negative, or exactly zero.
In overcapacity settings (12D hidden > 8D features), L1 creates competition: With more neurons than necessary, L1's sparsity pressure forces neurons to compete. A few "winners" capture multiple features (becoming polysemantic), while "losers" are driven to zero.
Evidence in the numbers:
- L1 configs have higher Interference (8.1-9.1) despite lower or equal L0 (0.42-0.49)
- Train MSE is higher with L1 (0.012-0.013 vs 0.002-0.003 for L2)âstuck in suboptimal minima
- High variance in SAE_MSE for L1 configs suggests multiple competing local minima
Why this is paradoxical: L1 is supposed to improve interpretability by enforcing sparsity. But in overcomplete settings, the sparsity creates polysemanticityâfewer active neurons must encode more concepts.
L2's Smooth Alternative
L2 regularization escapes the WTA trap through fundamentally different mechanics:
L2's gradient is smooth: â(λ||w||âÂČ) = 2λw. Linear everywhere, no discontinuities. This creates a convex-ish landscape that encourages distributed solutions.
No competition, just shrinkage: L2 uniformly shrinks all weights toward zero proportional to their magnitude. There's no pressure for winner-take-allâall neurons contribute, just at reduced scale.
Evidence in the numbers:
- L2 configs achieve lowest Interference (6.4-7.1)
- Train/Val MSE dramatically lower (0.002-0.003 train, 0.005-0.006 val)âbetter optimization
- L0 stays high (0.44-0.49)âmany neurons active, each specialized
- Low variance across seedsâstable convergence to single basin
The key insight: In overcomplete settings, distributed representations are more interpretable than sparse ones. Each neuron encodes one clear feature at moderate strength, rather than a few neurons encoding everything.
Orthogonal Initialization's Dual Role
Orthogonal initialization provides interesting boundary conditions:
With L2 (Config #5): Synergistic. Orthogonal matrices have unit eigenvalues, providing perfect conditioning for gradient flow. L2's smooth landscape benefits maximally from this, achieving -25.4% Interference and -83.6% Train MSE.
With L1 (Config #6): Antagonistic. Orthogonal structure distributes activations evenly by design. L1 wants to sparsify aggressively. They fight, creating bistable dynamics with catastrophic variance (SAE_MSE std = 4.62 > mean = 4.17).
The numerical signature: Sparsity_W4 jumps from 0.053 (Random+L1) to 0.101 (Orthogonal+L1), a +91% increase. Orthogonal init forces weight distributions to have extreme peaks because the orthogonal structure pre-orients weights, then L1 hammers them into sharp sparsity. This extreme peakedness destabilizes SAE reconstruction.
The Role of Positive Kurtosis Noise
Positive kurtosis (heavy-tailed t-distribution) was hypothesized to disrupt lock-ins through stochastic kicks. The results are context-dependent:
With L1 (Config #3): Harmful. MACS +3.6%, Interference +4.7%. The heavy tails create rare extreme activations that force neurons into polysemantic roles, then L1's WTA dynamics lock them in.
With L2 (Config #8): Marginal benefit. Interference -1.8% vs Config #5. The tails add exploration, but L2's smoothing prevents lock-in. The system samples useful diversity without collapsing.
Interpretation: Disruption alone doesn't helpâyou need smooth regularization to make productive use of it.
Discussion: Implications and Limitations
What This Means for Interpretability Research
Challenge to conventional wisdom: The assumption "sparse = interpretable" may be backwards in overcomplete regimes. Our results show L1 sparsity increases polysemanticity (+17.9% Interference), while L2 density reduces it.
The boundary condition matters: This finding is specific to overcomplete settings (latent_dim > feature_dim). In undercomplete or exact-capacity regimes, L1 likely still helps by forcing clear feature allocation. The paradox emerges when there's excess capacityâL1's WTA dynamics become pathological.
Network training matters as much as SAE architecture: Most SAE research optimizes decomposition methods (TopK, JumpReLU, Gated SAEs). Our work suggests fixing polysemanticity requires fixing model training. Even perfect SAEs struggle with representations baked-in as polysemantic.
Actionable takeaway: When training models you plan to interpret, consider:
- L2 or elastic net (L1+L2 with L2 dominant) for network regularization
- Orthogonal initialization only with L2, never with L1 alone
- Smooth activations (GELU, SiLU) to preserve gradient flow
- Monitor for bistability (high SAE variance across seeds) as diagnostic of antagonistic training dynamics
Comparison to Existing Work
Lecomte et al. (2023): Predicted L1 causes incidental polysemanticity via WTA. We provide the first direct L1 vs L2 comparison quantifying this (-17.9% Interference, p<0.01).
Anthropic's SAE work (Bricken et al. 2023, Templeton et al. 2024): Uses L1 penalty extensively for training SAEs. Our findings suggest this may fight avoidable polysemanticity, but Anthropic also uses advanced architectures (JumpReLU, TopK) that may mitigate the L1 paradox through different mechanisms. Direct comparison needed.
Modern SAE variants: We used vanilla LeakyReLU SAEs. State-of-art methods (TopK, JumpReLU, Gated) achieve better reconstruction at given sparsity levels. How L2 network training compares to these architectural improvements remains an open question.
Honest Limitations
This is a toy model with significant constraints:
- Synthetic data with designed correlations - Real LLMs don't have researcher-controlled correlation knobs. Findings most applicable to scenarios with natural co-occurrence (e.g., "car" and "road") or accidental training correlations.
- Extreme overcapacity - 16 SAE latents / 8 features = 2Ă, plus 12D hidden > 8D features. Real LLM SAEs are 4-8Ă overcomplete. The degree of overcapacity may exaggerate effects.
- No LLM validation - Whether this generalizes to billion-parameter models is unknown. The mechanisms (L1 WTA, L2 smoothing) suggest it might, but needs empirical testing.
- Arbitrary importance decay - Using 0.9^i importance weights. Real models likely have power-law or heavy-tailed importance distributions.
- No qualitative feature analysis - We measure polysemanticity via proxy metrics (Interference, MACS) without verifying features are actually clearer to human interpreters. This is a critical missing validation.
- Vanilla SAE architecture - Using LeakyReLU SAEs, not state-of-art TopK/JumpReLU/Gated variants. Modern architectures may change the picture.
- Calibrated_MACS noise - With std=0.14 and mean~0.5, this metric barely distinguishes from chance. 200 permutation samples may be insufficient for 16D space. Needs refinement.
What We Don't Claim
NOT claiming:
- "L1 is always bad" (only in overcomplete toy setting)
- "This definitely applies to LLMs" (untested, unknown)
- "Don't use L1 for SAEs" (we still used L1 for the SAE, just not network training)
- "L2 is the solution" (it's one mitigation; modern SAE architectures may be equally or more important)
Claiming:
- L2 network regularization reduces polysemanticity vs L1 in this toy model
- The mechanism (WTA vs distributed) is consistent with theory
- This suggests network training dynamics matter for interpretability
- Further research on real models is needed
Future Directions
Immediate Next Steps
Test on real LLMs via PEFT - Fine-tune Llama or similar with LoRA adapters trained using L1 vs L2 regularization on toy-like data. Measure polysemanticity in adapted representations without catastrophic forgetting.
Compare to modern SAE variants - Does L2 network training combined with TopK or JumpReLU SAEs achieve better results than either alone?
Lambda sweep - Systematically vary L1/L2 strength to identify critical thresholds where polysemanticity minimizes. Is there a U-shaped curve?
Qualitative feature analysis - Generate visualizations of what L1 vs L2 features actually encode. Do L2 features correspond to clearer, more monosemantic concepts?
Ground truth feature recovery - Use Anthropic's toy model setup with known synthetic features. Measure recovery accuracy directly.
Longer-Term Questions
Does this scale? Will the L1 paradox hold in billion-parameter models with complex natural data?
Interaction with other training choices - How do learning rate, batch size, architecture depth affect the L1 vs L2 trade-off?
Optimal regularization mixture - Is there an elastic net configuration (αL1 + (1-α)L2) that achieves both sparsity AND low polysemanticity?
Feature importance distributions - How do real models' feature importance distributions (likely power-law) affect which features stay polysemantic under L1 vs L2?
Conclusion
This exploratory work in toy models suggests that network training dynamicsâspecifically the choice between L1 and L2 regularizationâsignificantly affect the polysemanticity of learned representations. L2-trained networks produced 17.9% less feature entanglement than L1-trained networks (t=2.8, p<0.01), challenging the conventional wisdom that "sparse equals interpretable" in overcomplete settings.
The mechanism is clear: L1's discontinuous gradients create winner-take-all dynamics where a few neurons become polysemantic, while L2's smooth gradients encourage distributed representations where many neurons specialize. Orthogonal initialization amplifies this differenceâsynergizing with L2 but antagonizing L1.
Whether these findings generalize to real LLMs remains an open empirical question. The toy model's extreme overcapacity, synthetic data, and arbitrary design choices limit direct applicability. However, the mechanistic insightsâL1's WTA trap, L2's smoothing effect, orthogonalĂregularization interactionsâmay inform future work on training more interpretable models.
For the mechanistic interpretability community, this work suggests two paths forward:
- Architectural improvements (TopK SAEs, JumpReLU) to better decompose existing polysemantic representations
- Training improvements (L2 regularization, smooth activations) to reduce polysemanticity at the source
Both matter. Fixing polysemanticity likely requires addressing both how we train models and how we interpret them.
All code, data, and experimental details are available at github.com/stanleyngugi/taming_polysemanticity. I welcome feedback, replications, and extensionsâespecially testing these ideas on real LLMs.
Acknowledgments
Thanks to the mechanistic interpretability community for foundational work on SAEs and toy models. This research builds directly on Lecomte et al.'s incidental polysemanticity theory and Anthropic's dictionary learning approaches.
References
Bereska, L., & Gavves, E. (2024). Mechanistic Interpretability for AI Safety â A Review. Transactions on Machine Learning Research.
Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., & Olah, C. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread. https://transformer-circuits.pub/2023/monosemantic-features
Gao, L., Biderman, S., Gururangan, S., Weber, L., Jennings, J., Dey, B., Sutherland, D. J., Bisk, Y., Schoelkopf, R., Hooker, S., Smith, N. A., & Smith, D. (2024). Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093.
Lecomte, V., Thaman, K., Schaeffer, R., Bashkansky, N., Chow, T., & Koyejo, S. (2023). What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity in Neural Networks. arXiv preprint arXiv:2312.03096.
Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., KramĂĄr, J., Shah, R., & Nanda, N. (2024). Improving Dictionary Learning with Gated Sparse Autoencoders. arXiv preprint arXiv:2404.16014.
Saxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations (ICLR).
Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N. L., McDougall, C., MacDiarmid, M., Freeman, C. D., Sumers, T. R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C., & Henighan, T. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Anthropic. https://www.anthropic.com/research/scaling-monosemanticity
Appendix: Technical Details
Hyperparameters
Network training:
- Architecture: [8 â 12 â 12 â 12 â 8]
- Optimizer: Adam (lr=1e-3, betas=(0.9, 0.999))
- L1 lambda: weight_only, manual sum(abs(weights))
- L2 lambda: weight_decay=1e-4
- Noise injection: hidden layer, sigma=0.1
- Epochs: 2000
- Batch size: 64
SAE training:
- Architecture: [12 â 16 â 12]
- Activation: LeakyReLU(0.01)
- L1 lambda: 1e-4 on latent activations
- Optimizer: Adam (lr=1e-3)
- Epochs: 1000
- Reconstruction loss: MSE
Data generation:
- Features: 8, grouped [0-2], [3-4], [5-7]
- Correlations: base_strength â [0.5, 1.0]
- Sparsity: 50% active per sample
- Samples: 1024 train, 256 val
- Target: non-linear combinations with importance decay 0.9^i