What political censorship looks like inside an LLM's weights (Qwen 3.5)
The article discusses the mechanics of political censorship within a language model's architecture, particularly focusing on specific layers responsible for steering outputs. It highlights how different layers interact to produce nuanced responses, with a significant emphasis on the role of multi-layer perceptrons (MLPs) in shaping the model's behavior. The findings suggest that while the model can classify content, it is not infallible and can misclassify certain prompts.
- ▪The model's censorship mechanisms are concentrated in specific layers, particularly L13 and L18.
- ▪Multi-layer perceptrons are primarily responsible for the model's output behavior, accounting for a significant portion of the signal.
- ▪The model's classification accuracy varies, with some prompts being misclassified despite the presence of underlying class representations.
Opening excerpt (first ~120 words) tap to expand
Writers vs readers Steering works at three specific layers (L13 for d_prc, L18 for d_refuse and d_style) and nowhere else cleanly. Steer at L5 or L11 and the effect is messy generic disruption. Steer at L28, after the verdict commits, and it's null. The circuit splits into two halves with very different mechanics, with the boundary around L20: the writer-band tap sweep (E6) shows the 3D-subspace effect peaking by the writer band (≈80% at tap 14) and tapering through tap 20 (≈48%), so the writer signal is essentially computed by ~L19–20 and the rest of the stack reads and renders it. The verdict then commits in Chinese tokens at tap 24 (§7, E19); in the last-token lens Tiananmen stays ≈100% Chinese across taps 20–28 (§7). L31 is just the last transformer layer before lm_head.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Pages.