
Learning to Reason without External Rewards
Key Points
- 1This paper introduces Reinforcement Learning from Internal Feedback (RLIF), a novel paradigm enabling Large Language Models (LLMs) to enhance reasoning using intrinsic signals without external rewards or labeled data.
- 2The proposed method, INTUITOR, utilizes the model's own "self-certainty"—a measure of its internal confidence—as the sole intrinsic reward signal within a policy optimization framework like GRPO.
- 3Experiments demonstrate INTUITOR matches supervised RL performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks such as code generation, fostering emergent structured reasoning.
This paper introduces Reinforcement Learning from Internal Feedback (RLIF), a novel paradigm for enhancing Large Language Model (LLM) reasoning capabilities by leveraging intrinsic, self-generated signals without external rewards or labeled data. The motivation stems from the limitations of existing reinforcement learning approaches for LLMs: Reinforcement Learning from Human Feedback (RLHF) is costly and prone to bias due to reliance on human annotation, while Reinforcement Learning with Verifiable Rewards (RLVR) demands domain-specific verifiers and gold-standard solutions, limiting its applicability and generalizability.
The core objective of RLIF is to optimize a policy based on an intrinsic signal , formalized as:
where is an input query, is the generated output, is a reference policy, and controls the KL divergence penalty.
Under the RLIF paradigm, the paper proposes INTUITOR, a method that utilizes the model's own confidence, termed self-certainty, as the sole intrinsic reward signal . Self-certainty is defined as the average KL divergence between a uniform distribution over the vocabulary and the model’s next-token distribution:
Here, refers to previously generated tokens, and is the model's predicted probability for token at step . Higher self-certainty indicates greater confidence. This metric is chosen for its mode-seeking property and reported robustness against length biases, aiming to encourage the model to generate responses it deems more convincing and coherent.
INTUITOR integrates this self-certainty reward into the Group Relative Policy Optimization (GRPO) framework, which is commonly used in RLVR. GRPO's objective for policy is typically:
where is the importance sampling ratio, and is the advantage estimate. In INTUITOR, the external verifiable reward is replaced by the self-certainty score (), and the advantage is computed by normalizing these scores within a sampled group of outputs:
This mechanism allows the model to learn by reinforcing outputs it assesses as highly confident, creating a self-improving loop without external human or algorithmic supervision.
Experiments using Qwen2.5-1.5B and Qwen2.5-3B models on the MATH dataset demonstrate that INTUITOR matches the performance of GRPO (which uses gold answers) on in-domain mathematical benchmarks (GSM8K, MATH500). Crucially, INTUITOR exhibits superior generalization to out-of-domain tasks like code generation (LiveCodeBench, CRUXEval), achieving a 65% relative improvement on LiveCodeBench for the Qwen2.5-3B model compared to no improvement for GRPO. Furthermore, INTUITOR enabled a Qwen2.5-1.5B model, initially producing repetitive and incoherent output, to generate structured reasoning chains and well-formed code. The paper highlights that INTUITOR leads to faster initial learning, improved instruction-following, and the emergence of long-form structured reasoning, where models learn to generate explicit reasoning steps before providing final answers, particularly evident in code generation tasks. This suggests that optimizing for intrinsic confidence encourages the model to generate more self-explanatory traces, leading to better understanding and performance.