arXiv cs.LG
· Papers
On the Position Bias of On-Policy Distillation
arXiv:2606.22600v2 Announce Type: replace Abstract: On-Policy Distillation (OPD) improves the learning efficiency of standard reinforcement learning through dense, token-level supervision from teachers. In the standard KL objective of OPD, token-level losses are uniformly averaged, implying equal weights for all tokens