arXiv cs.CL
· Papers
Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment
arXiv:2606.23700v1 Announce Type: new Abstract: Emergent misalignment (EM) has been linked to the activation of misaligned persona vectors and evil character traits, suggesting that EM operates through disruption of the model's aligned character rather than direct learning of harmful content. Motivated by this connecti