LessWrong AI
· Communities
BeamGPT: A new paradigm for attention
I have found an operator that achieves striking results in learning curves when used alongside standard attention in a nanoGPT-style character-level language model. It finds structure in the sequence that attention misses.The model learns a mix ratio of around 45% attention to 55% of the field operator. This ratio seem