X · @lilianweng October 27, 2025 · X / Twitter

On-policy distillation provides an elegant way to use the teacher model as a process reward model to provide dense reward while preventing SFT style "…

On-policy distillation provides an elegant way to use the teacher model as a process reward model to provide dense reward while preventing SFT style "OOD shock" during rollout.Thinking Machines: Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with th

Read original