Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment
arXiv:2606.23700v1 Announce Type: new Abstract: Emergent misalignment (EM) has been linked to the activation of misaligned persona vectors and evil character traits, suggesting that EM operates…