Skip to content
Alignment Forum · Communities

Why Do Naive SFT Filters For Safety Properties Fail?

This is the fourth in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adjacent areas. The third post can be found here.Since SFT is the cause for many safety relevant properties, a natural strategy is to filter out rollouts from SFT that have