X · @lilianweng
· X / Twitter
On-policy distillation provides an elegant way to use the teacher model as a process reward model to provide dense reward while preventing SFT style "…
On-policy distillation provides an elegant way to use the teacher model as a process reward model to provide dense reward while preventing SFT style "OOD shock" during rollout.Thinking Machines: Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with th