GitHub - Marker-Inc-Korea/COT_steering: This repository aims to develop CoT Steering based on CoT without Prompting. It focuses on enhancing the model’s latent reasoning capability without additional training by leveraging Test-Time Scaling techniques.
Key Points
- 1This paper introduces CoT Steering, an enhanced method for Chain-of-Thought reasoning without explicit prompting, which faithfully re-implements prior work and incorporates "steering tokens."
- 2CoT Steering subtly guides the model's latent reasoning trajectory by injecting these tokens at the beginning of the assistant's response within standard chat templates, thereby narrowing the search space.
- 3Evaluation on the Korean CSAT demonstrated significant performance gains, boosting a 33B-parameter model's score from 67 to 84, showcasing improved reasoning capabilities and efficiency without additional training.
This paper introduces CoT Steering, a novel test-time scaling technique designed to enhance the latent reasoning capabilities of large language models (LLMs) without requiring additional training. It addresses the limitations of existing "Chain-of-Thought (CoT) reasoning without prompting" implementations by providing a faithful re-implementation of the method and extending it with "steering tokens."
The core methodology is predicated on the understanding that CoT reasoning is fundamentally a search problem. LLMs explore a solution space to find an optimal reasoning path. However, inherent biases from pretraining or fine-tuning often restrict this search to suboptimal regions. CoT Steering aims to recover richer reasoning trajectories by balancing search diversity with structured control.
The method operates by:
- CoT without Prompting as a Foundation: The paper re-implements the concept of CoT without prompting, which expands the model's search space by sampling from the top- tokens during decoding. This implicitly explores latent reasoning paths without explicit verbalized prompts, which is distinct from traditional CoT prompting. The authors emphasize correcting previous deviations related to decoding processes, aggregation mechanisms, and search semantics.
- Introducing Steering Tokens: The key innovation is the integration of "steering tokens." These tokens serve as a mechanism to explicitly condition the model's reasoning trajectory, thereby narrowing the search space in a controlled and deliberate manner. The purpose is to guide the model towards more reliable and structured CoT paths.
- Token-Level Steering via Chat Templates: The steering is applied at the token level, leveraging the autoregressive nature of LLMs to inject constraints directly into the decoding process. This is achieved by combining standard chat templates (e.g., alternating user/assistant turns) with the steering tokens. Specifically, these tokens are injected at the very beginning of the assistant's response. This subtly constrains the model's generation, allowing it to produce outputs as if they were its own responses, yet aligned with the intended reasoning trajectory. For unconstrained CoT decoding, the
STEERING_TOKENcan simply be set to an empty string. While the concept of steering could theoretically be applied in the latent space or using potential functions, the authors found no significant performance difference and opted for token-level steering due to its greater flexibility and computational efficiency.
This approach offers several advantages: it is prompt-agnostic, compatible with standard chat interfaces, compact, modular, and provides flexible control over the model's generation space without architectural modifications.
The effectiveness of CoT Steering was evaluated on the 2025 Korean CSAT (Korean-section) using the FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview model. The application of CoT Steering significantly improved the model's performance from a baseline score of 67 to 84. This demonstrates the potential of test-time reasoning modulation for high-stakes tasks. Notably, this performance gain was achieved with a 33B-parameter model, competing effectively with much larger comparison baselines (ranging from 100B to 685B parameters), highlighting the efficiency and scalability of the method.