LessWrong AI June 24, 2026 · Communities

Reasoning and learning about injected concepts in language models

This work was done as a part of SPAR, under the mentorship of Mirko Bronzi and Damiano Fornasiere. TL;DRWe test models' ability to recover information about their activations by injecting steering vectors, and asking the LLMs to verbalize properties of them. We train models with in-context learning and test for three c

Read original