arXiv cs.CL June 25, 2026 · Papers

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

arXiv:2606.24952v1 Announce Type: new Abstract: A central aspiration of mechanistic interpretability is controllability: if we know where a behavior is represented in a model's activations, we should be able to modify it. This rests on a hidden premise -- that the direction which detects a behavior and the direction wh

Read original