Skip to content
LessWrong AI · Communities

Can You Hide From a Natural Language Autoencoder?

TLDR: NLAs are a recent black box mech interp method for verbalizing model internals. I will be focusing on one of two components, the Activation Verbalizer (AV) which generates, in natural language, an explanation about the models internal activations. The main question I am trying to answer here is whether these NLAs