arXiv cs.LG
· Papers
The Curse of Multiple Mediators: Hidden Interaction Effects in Activation Patching
arXiv:2606.27510v1 Announce Type: new Abstract: Activation patching is the primary tool in mechanistic interpretability. It attributes causal responsibility for a model behavior to each of its individual components by estimating its natural indirect effect (NIE). Re-deriving the activation patching estimand from causal