Skip to content
X · @lilianweng · X / Twitter

On-policy distillation provides an elegant way to use the teacher model as a process reward model to provide dense reward while preventing SFT style "…

On-policy distillation provides an elegant way to use the teacher model as a process reward model to provide dense reward while preventing SFT style "OOD shock" during rollout.Thinking Machines: Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with th