Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning
arXiv:2606.24064v1 Announce Type: new Abstract: Distilling reasoning capabilities from strong to weak language models typically involves imitating specific solution trajectories, effectively transferring what to answer rather…