Skip to content
LessWrong AI · Communities

BeamGPT: A new paradigm for attention

I have found an operator that achieves striking results in learning curves when used alongside standard attention in a nanoGPT-style character-level language model. It finds structure in the sequence that attention misses.The model learns a mix ratio of around 45% attention to 55% of the field operator. This ratio seem