Paper

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Mingjie Liu

2026.01.11

·Arxiv·by 이호민

#RL#LLM#Policy Optimization#Multi-reward RL#GDPO

Key Points

1While Group Relative Policy Optimization (GRPO) is widely used for multi-reward RL, its direct application can collapse distinct reward combinations into identical advantage values, diminishing training signal resolution and leading to suboptimal convergence or failure.
2Group reward-Decoupled Normalization Policy Optimization (GDPO) resolves this by decoupling the group-wise normalization of individual rewards, preserving their relative differences more accurately, followed by batch-wise normalization for improved stability.
3Across diverse tasks like tool calling, math reasoning, and coding reasoning, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability by achieving higher accuracy, better format compliance, and improved adherence to constraints.

r^{(i,j)}_{\text{sum}} = r^{(i,j)}_1 + \dots + r^{(i,j)}_n

Paper

Mingjie Liu

2026.01.11

·Arxiv·by 이호민

#RL#LLM#Policy Optimization#Multi-reward RL#GDPO

1While Group Relative Policy Optimization (GRPO) is widely used for multi-reward RL, its direct application can collapse distinct reward combinations into identical advantage values, diminishing training signal resolution and leading to suboptimal convergence or failure.
2Group reward-Decoupled Normalization Policy Optimization (GDPO) resolves this by decoupling the group-wise normalization of individual rewards, preserving their relative differences more accurately, followed by batch-wise normalization for improved stability.
3Across diverse tasks like tool calling, math reasoning, and coding reasoning, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability by achieving higher accuracy, better format compliance, and improved adherence to constraints.

r^{(i,j)}_{\text{sum}} = r^{(i,j)}_1 + \dots + r^{(i,j)}_n