Alignment Forum
· Communities
Models May Behave Worse When Eval Aware
This is the first in a series of research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas.TL;DRIt's often assumed that models will act more aligned when they can tell they're being evaluated. But we find that Gemini can take “undesired” actions in behavioura