Skip to content
arXiv cs.LG · Papers

On the Position Bias of On-Policy Distillation

arXiv:2606.22600v2 Announce Type: replace Abstract: On-Policy Distillation (OPD) improves the learning efficiency of standard reinforcement learning through dense, token-level supervision from teachers. In the standard KL objective of OPD, token-level losses are uniformly averaged, implying equal weights for all tokens