arXiv cs.LG
· Papers
Retroactive Advantage Correction: Closed-Form V-Trace Bias Correction for Delay-Aware RLHF
arXiv:2606.27580v1 Announce Type: new Abstract: Reinforcement learning from human feedback (RLHF) in production does not always have a synchronous reward signal. Code-execution verifiers, slow judge ensembles, and queued human review can return several gradient steps after the rollout that produced them, breaking the s