LessWrong AI
· Communities
Refusal Is Complicated As Hell: An Update
TL;DRIt would make sense to briefly skim through our previous post that introduces our experiments on refusal in LLMs. There we explain how it started, here we’ll tell how it’s going.The primary goal of this text is to try and structure the list of whack-a-mole research questions. The secondary goal is to get some outs