Skip to content
arXiv cs.LG · Papers

The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching

arXiv:2606.27510v1 Announce Type: new Abstract: Activation patching is the primary tool in mechanistic interpretability. It attributes causal responsibility for a model behavior to each of its individual components by estimating its natural indirect effect (NIE). Re-deriving the activation patching estimand from causal