
GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization
Key Points
- 1While Group Relative Policy Optimization (GRPO) is widely used for multi-reward RL, its direct application can collapse distinct reward combinations into identical advantage values, diminishing training signal resolution and leading to suboptimal convergence or failure.
- 2Group reward-Decoupled Normalization Policy Optimization (GDPO) resolves this by decoupling the group-wise normalization of individual rewards, preserving their relative differences more accurately, followed by batch-wise normalization for improved stability.
- 3Across diverse tasks like tool calling, math reasoning, and coding reasoning, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability by achieving higher accuracy, better format compliance, and improved adherence to constraints.
The paper introduces Group reward-Decoupled Normalization Policy Optimization (GDPO), a novel policy optimization method for multi-reward Reinforcement Learning (RL) that addresses limitations of the widely adopted Group Relative Policy Optimization (GRPO).
The core problem identified with GRPO in multi-reward settings is its "reward signal collapse." When optimizing multiple objectives, GRPO typically sums all individual reward components () for the -th response to question . It then calculates a group-relative advantage by normalizing these aggregated rewards across all responses in a group: . This approach, formalized in the multi-reward GRPO objective where , compresses distinct reward combinations into identical advantage values. For instance, in a two-binary-reward, two-rollout scenario, GRPO maps combinations like (0,1), (0,2), and (1,2) (total rewards of rollouts) into the same normalized advantage of , and (0,0), (1,1), (2,2) into . This loss of information reduces the resolution of the training signal, leading to suboptimal convergence and potential training failures. The paper also demonstrates that removing the standard deviation normalization term (GRPO w/o std) only slightly increases the distinct advantage groups and can lead to training instability.
GDPO resolves this by decoupling the normalization process. Instead of normalizing the sum of rewards, GDPO first performs group-wise normalization for *each individual reward* separately. For the -th reward of the -th rollout for question , the normalized advantage is computed as:
.
This step ensures that the relative differences within each reward dimension are preserved.
After individual reward normalization, these normalized advantages are aggregated, potentially with specific weights to reflect priorities: .
Finally, GDPO applies a batch-wise normalization to this sum of multi-reward advantages:
.
This batch-wise normalization ensures the magnitude of the advantage remains stable regardless of the number of rewards, improving training stability. GDPO's decoupled normalization preserves fine-grained distinctions, e.g., mapping (0,1) to and (0,2) to , providing a more expressive training signal. GDPO consistently yields a substantially larger number of distinct advantage groups compared to GRPO and GRPO w/o std as the number of rollouts or rewards increases.
The paper also provides insights into effectively incorporating priority variations among objectives. While assigning different weights () to normalized advantages () is common, its effectiveness is limited when objectives differ significantly in difficulty. Easier rewards might dominate optimization despite lower weights. A more robust approach is conditioning rewards, where an easier reward is only awarded if a more difficult, prioritized reward meets a threshold : . This forces the model to prioritize the more challenging objective.
GDPO's effectiveness is demonstrated across three tasks:
- Tool Calling: Optimizing correctness () and format compliance (). GDPO consistently achieves higher correctness and format reward scores, outperforming GRPO and GRPO w/o std (which failed to learn format compliance) on the BFCL-v3 benchmark for Qwen2.5-Instruct (1.5B and 3B models).
- Math Reasoning: Balancing accuracy and adherence to a length constraint. GDPO showed better convergence and prevented training collapse observed with GRPO, leading to higher accuracy while maintaining response shortness for DeepSeek-R1 and Qwen3-4B-Instruct models.
- Coding Reasoning: Jointly optimizing code-generation accuracy, length constraints, and bug ratio. GDPO's superior performance across these diverse multi-reward settings underscores its generalizability and effectiveness over GRPO.