Eliciting Reasoning in Language Models with Cognitive Tools
Paper

Eliciting Reasoning in Language Models with Cognitive Tools

Mattia Rigotti
2025.07.06
·Arxiv·by Anonymous
#LLM#Reasoning#Cognitive Tools#Agentic Framework#Tool Calling

Key Points

  • 1Building on cognitive psychology, this paper introduces "cognitive tools"—modular, prompt-driven operations—to elicit reasoning in Large Language Models (LLMs) via an agentic tool-calling framework.
  • 2These tools encapsulate functions like "understand question" and "backtracking," allowing the LLM to execute specific reasoning steps independently and feed structured outputs back into its problem-solving process.
  • 3Experiments show that providing LLMs with these cognitive tools significantly improves their performance on challenging math reasoning benchmarks, with GPT-4.1 notably surpassing RL-trained o1-preview, demonstrating an effective, training-free alternative for enhancing reasoning.

This paper introduces a novel method for eliciting reasoning capabilities in Large Language Models (LLMs) by endowing them with "cognitive tools," inspired by principles from cognitive psychology and cognitive architectures. The central hypothesis is that reasoning arises from the orchestrated, sequential execution of modular, predetermined cognitive operations, which can be implemented within a modern agentic tool-calling framework.

Core Methodology: Cognitive Tools

The core of the methodology lies in defining a small, specialized set of "cognitive tools" that encapsulate specific reasoning operations. Unlike conventional agentic tools that interface with external APIs or functions (e.g., calculators, search engines), these cognitive tools are executed *internally* by the LLM itself. The LLM acts as both the orchestrator, deciding which tool to invoke and when, and the executor, performing the cognitive operation defined by the tool.

The execution pipeline mirrors a standard tool-calling process:

  1. The main LLM, in response to a query, generates a reasoning trace.
  2. If the LLM decides to call one of the predefined cognitive tools, its generation is temporarily stopped.
  3. The module encapsulating the invoked tool is executed. This execution involves prompting the *same LLM instance* in a "sandboxed context" with a specific prompt template designed for that tool.
  4. The output generated by this sandboxed execution, representing a structured intermediate result of the cognitive operation, is fed back into the main reasoning loop of the LLM.
  5. The LLM then continues its reasoning process, utilizing the insight gained from the tool's execution, until it generates a final answer or decides to call another tool.

The paper defines four specific cognitive tools:

  • Understand Question: This tool prompts the LLM to perform "goal management" by deconstructing the problem. It identifies main concepts, extracts relevant information, and highlights meaningful properties, theorems, or techniques that could aid in problem-solving.
  • Recall Related: Inspired by prior work on recalling knowledge, this tool provides the LLM with relevant examples from similar problems it has previously encountered, along with their solutions. The objective is to guide the LLM's reasoning by demonstrating successful problem-solving strategies.
  • Examine Answer: This tool implements a form of "self-reflection." It prompts the LLM to critically review its current reasoning trace for potential flaws, wrong assumptions, miscalculations, or overlooked constraints. This allows the LLM to identify and correct errors in its ongoing thought process.
  • Backtracking: When the LLM recognizes a flawed reasoning path or an incorrect intermediate solution, this tool enables exploration of alternative approaches. It prompts the LLM to summarize its current trace, pinpoint the erroneous step, and suggest alternative directions for solving the problem, conceptually similar to Monte Carlo Tree Search.

These cognitive tools are presented to the LLM via a "Cognitive Tools Prompt" (a system prompt). This prompt instructs the LLM on how to utilize the tools, emphasizing flexible, autonomous decision-making regarding tool invocation and encourages their use for complex or ambiguous questions. It also allows the LLM to generate code as an additional modular reasoning tool, which can then be executed. The system prompt specifies formatting for tool calls (e.g., Python-based function calls) and the final answer.

Experimental Setup and Results

The efficacy of cognitive tools is evaluated on challenging mathematical reasoning benchmarks: AIME 2024 (30 problems), MATH500 (500 problems), AMC (83 problems), and Smolagents Benchmark-v1 (50 math problems). The evaluation metric is pass@1 accuracy.

The models tested include open-weight models (Qwen2.5-7B/32B Instruct, Llama3.1-8B Instruct, Llama3.3-70B Instruct) and closed models (GPT-4.1, o1-preview). Baselines include the plain LLM, Chain-of-Thought (CoT) prompting, and a code-equipped baseline (where the LLM can generate and execute code).

Key findings include:

  • Individual Tool Effectiveness: Each cognitive tool, when used individually, consistently improves LLM performance over the baseline across different models, demonstrating the utility of each specific cognitive operation.
  • Superiority over Cognitive Prompting: The modular cognitive tools approach consistently outperforms "cognitive prompting" (a monolithic prompting technique that dictates a fixed sequence of reasoning steps) across all tested models, highlighting the benefits of modularity and flexibility.
  • Significant Gains on Math Benchmarks: Equipping LLMs with the full suite of cognitive tools leads to substantial improvements in pass@1 accuracy across all math datasets and models compared to plain baselines, CoT, and code-equipped models.
  • Comparison to RL-Trained Models: Notably, GPT-4.1, when augmented with cognitive tools, significantly surpasses its baseline performance and even outperforms o1-preview (a model known for its reasoning capabilities and trained with reinforcement learning) on the AIME 2024 dataset.

Discussion and Conclusion

The paper concludes that cognitive tools provide a viable and effective alternative mechanism for eliciting robust reasoning capabilities in LLMs, without requiring additional training or complex RL pipelines. The modular design of these tools, by isolating cognitive operations, is hypothesized to reduce interference between reasoning steps and enhance the LLM's flexibility in problem-solving. This work contributes to the ongoing debate regarding the origins of reasoning in LLMs, suggesting that strong latent capabilities in base models can be effectively "uncovered" and leveraged through structured, modular cognitive interventions.