arXiv cs.CL
· Papers
What Intermediate Layers Know: Detecting Jailbreaks from Entropy Dynamics
arXiv:2606.25182v1 Announce Type: new Abstract: Jailbreak attacks reveal a persistent weakness in aligned Large Language Models: carefully crafted prompts can elicit policy-violating responses despite safety training. While most defenses operate at the prompt or output level, it remains unclear how harmful intent is en