Skip to content
LessWrong AI · Communities

Can we use steering vectors to suppress reward-hacking? Somewhat

Can steering vectors drive gradient routing? Yes, but not in realistic reward hacking environments, they are not precise enough classifiers of hacky vs clean solutions. Instead, can we use a steering vector to initialise adapters so that gradient routing happens without a classifier, and we get automatic seperation of