arXiv cs.CL
· Papers
Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models
arXiv:2606.24952v1 Announce Type: new Abstract: A central aspiration of mechanistic interpretability is controllability: if we know where a behavior is represented in a model's activations, we should be able to modify it. This rests on a hidden premise -- that the direction which detects a behavior and the direction wh