Yifan Zhang @ ICLR 2026@yifan_zhang_·Original post
Scaling KL-Regularized Policy Gradient and REINFORCE Is All You Need. Our ICLR 2026 paper, “On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning,” will be presented at Pavilion 4, Riocentro Convention and Event Center, today! Glad to see that V4 and V3.2 have adopted the corrected KL formulation presented in our paper. Project Page: https://github.com/complex-reasoning/RPG Paper: https://arxiv.org/abs/2505.17508 It would be even better if they used the REINFORCE estimator instead of the GRPO estimator in future versions! IN REINFORCE WE TRUST.
