Alignment Forum
· Communities
SFT Drives Gemini’s Safety Properties
This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The second post can be found here.In this short post, we describe a surprising finding: most safety relevant properties in Gemini seem to be caused by the com