Paper

Learning to Reason without External Rewards

Xuandong Zhao

2025.06.01

·Arxiv·by Anonymous

#LLM#Reinforcement Learning#RLIF#Self-Supervised Learning#Reasoning

Key Points

1This paper introduces Reinforcement Learning from Internal Feedback (RLIF), a novel paradigm enabling Large Language Models (LLMs) to enhance reasoning using intrinsic signals without external rewards or labeled data.
2The proposed method, INTUITOR, utilizes the model's own "self-certainty"—a measure of its internal confidence—as the sole intrinsic reward signal within a policy optimization framework like GRPO.
3Experiments demonstrate INTUITOR matches supervised RL performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks such as code generation, fostering emergent structured reasoning.

\pi_\theta

Paper

Xuandong Zhao

2025.06.01

·Arxiv·by Anonymous

#LLM#Reinforcement Learning#RLIF#Self-Supervised Learning#Reasoning

1This paper introduces Reinforcement Learning from Internal Feedback (RLIF), a novel paradigm enabling Large Language Models (LLMs) to enhance reasoning using intrinsic signals without external rewards or labeled data.
2The proposed method, INTUITOR, utilizes the model's own "self-certainty"—a measure of its internal confidence—as the sole intrinsic reward signal within a policy optimization framework like GRPO.
3Experiments demonstrate INTUITOR matches supervised RL performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks such as code generation, fostering emergent structured reasoning.

\pi_\theta