WeSearch

What political censorship looks like inside an LLM's weights (Qwen 3.5)

·7 min read · 0 reactions · 0 comments · 13 views
#ai#censorship#language model
⚡ TL;DR · AI summary

The article discusses the mechanics of political censorship within a language model's architecture, particularly focusing on specific layers responsible for steering outputs. It highlights how different layers interact to produce nuanced responses, with a significant emphasis on the role of multi-layer perceptrons (MLPs) in shaping the model's behavior. The findings suggest that while the model can classify content, it is not infallible and can misclassify certain prompts.

Key facts
Original article
Pages
Read full at Pages →
Opening excerpt (first ~120 words) tap to expand

Writers vs readers Steering works at three specific layers (L13 for d_prc, L18 for d_refuse and d_style) and nowhere else cleanly. Steer at L5 or L11 and the effect is messy generic disruption. Steer at L28, after the verdict commits, and it's null. The circuit splits into two halves with very different mechanics, with the boundary around L20: the writer-band tap sweep (E6) shows the 3D-subspace effect peaking by the writer band (≈80% at tap 14) and tapering through tap 20 (≈48%), so the writer signal is essentially computed by ~L19–20 and the rest of the stack reads and renders it. The verdict then commits in Chinese tokens at tap 24 (§7, E19); in the last-token lens Tiananmen stays ≈100% Chinese across taps 20–28 (§7). L31 is just the last transformer layer before lm_head.

Excerpt limited to ~120 words for fair-use compliance. The full article is at Pages.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from Pages