Memento-Skills: Let Agents Design Agents
Paper

Memento-Skills: Let Agents Design Agents

Zhixun Chen
2026.03.21
·Arxiv·by 이호민
#Agent#Continual Learning#LLM#Memory#Reinforcement Learning

Key Points

  • 1Memento-Skills introduces a continually-learnable LLM agent system that designs, adapts, and improves task-specific agents by evolving an external memory of reusable skills, keeping the core LLM parameters frozen.
  • 2The system operates through a Read–Write Reflective Learning loop, where agents read skills for execution and then write updates back to the skill library based on feedback, effectively enabling policy improvement through memory mutation.
  • 3This approach achieves significant performance improvements on benchmarks like GAIA and Humanity's Last Exam, demonstrating the efficacy of self-evolving skill memory and a behavior-aligned skill router.

Memento-Skills introduces a generalist, continually-learnable LLM agent system that functions as an "agent-designing agent," autonomously constructing, adapting, and improving task-specific agents through experience. It addresses the fundamental limitation of frozen LLM agents, which cannot learn from their deployment experiences due to fixed parameters (θ\theta). Instead of costly parameter updates, Memento-Skills enables adaptation through the evolution of an externalized skill memory, MtM_t.

The system is built on a memory-based reinforcement learning framework called Read–Write Reflective Learning, originating from the Stateful Reflective Decision Process (SRDP) [17]. The SRDP extends a standard Markov Decision Process (MDP) by augmenting the state to include an episodic memory MtM_t, i.e., xt:=(st,Mt)x_t := (s_t, M_t), thus recovering the Markov property. The agent's policy is defined as πμ(as,Mt)=cMtμ(cs,Mt)pLLM(as,c)\pi_\mu(a | s, M_t) = \sum_{c \in M_t} \mu(c | s, M_t) p_{LLM}(a | s, c), where pLLMp_{LLM} is the frozen LLM decision kernel, ss is the current state, cc is a retrieved case (skill) from memory MtM_t, and μ\mu is the retrieval policy. The framework views this as a Reflected MDP, where the transition kernel PLLM(xx,c)=aApLLM(as,c)1{x=(s,Write(M,s,a,r))}P(ss,a)P_{LLM}(x' | x, c) = \sum_{a \in A} p_{LLM}(a | s, c) \mathbf{1}_{\{x' = (s', \text{Write}(M, s, a, r))\}} P(s' | s, a). The "Write" operation is crucial, as it's not a simple append but encapsulates skill-level reflective updates.

The core methodology involves a closed-loop, five-step process: Observe \rightarrow Read \rightarrow Act \rightarrow Feedback \rightarrow Write. This process directly maps to policy iteration:

  1. Observe: The agent receives a task qtq_t, forming an augmented input xt=(qt,Tt)x_t = (q_t, T_t), where TtT_t is a tip memory.
  2. Read (Policy Improvement - Skill Selection): A behavior-aligned skill router selects the most relevant skill ctc_t from the skill library StS_t. This router, unlike purely semantic similarity models, is trained via single-step offline Reinforcement Learning (RL) to optimize for execution success.
    • InfoNCE Routing: The retrieval is cast as a one-step MDP where the state is the query qq and actions are skills dd. The learned score function s(d,q)=encθ(d)encθ(q)s(d, q) = \text{enc}_\theta(d)^\top \text{enc}_\theta(q) acts as a soft Q-function, Qθ(q,d)s(d,q)Q_\theta(q, d) \propto s(d, q). This yields a Boltzmann routing policy πθ(dq)=exp(Qθ(q,d)/τ)dexp(Qθ(q,d)/τ)\pi_\theta(d | q) = \frac{\exp(Q_\theta(q, d)/\tau)}{\sum_{d'} \exp(Q_\theta(q, d')/\tau)}.
    • Training: The router is trained on a local skill database (approx. 8k skills), using synthetic query generation where an LLM creates positive queries (target skill should be selected) and hard negatives (same domain but incorrect skill) based on skill names/descriptions. Minimizing the multi-positive InfoNCE loss effectively performs single-step offline policy improvement, pushing up positive scores and suppressing hard negatives.
    • Retrieval Pipeline: Combines sparse (BM25) and dense (embedding-based) retrieval, fuses candidates using score-aware Reciprocal Rank Fusion, and optionally applies a cross-encoder reranker.
    • Skill Generation: If no relevant skill is found and "CreateOnMiss" is enabled, a new skill is generated.
  3. Act (Execute): The frozen LLM executes the selected skill's multi-step workflow, producing an action ata_t.
  4. Feedback (Judge): An external judge evaluates the outcome of ata_t against the task qtq_t and ground truth ata_t^\star, providing a reward rtr_t.
  5. Write (Policy Evaluation & Improvement - Reflective Update): This is where memory MtM_t (the skill library StS_t) is actively mutated.
    • Utility Update: The empirical success rate of the executed skill ctc_t is updated: Ut+1(ct)nsucc(ct)nsucc(ct)+nfail(ct)U_{t+1}(c_t) \leftarrow \frac{n_{succ}(c_t)}{n_{succ}(c_t) + n_{fail}(c_t)}.
    • Failure Handling (Skill Evolution): If the execution fails (rt=incorrectr_t = \text{incorrect}), a GenericTip is added to the tip memory. A failure attribution selector identifies the specific skill cc^\dagger responsible for the error.
      • Skill Optimization: If the utility Ut(c)U_t(c^\dagger) is above a threshold δ\delta or samples are insufficient, the system optimizes the existing skill in-place by targeted file-level updates (rewriting code or prompts within cc^\dagger) to add guardrails or alternative strategies.
      • Skill Discovery: If Ut(c)U_t(c^\dagger) drops below δ\delta (indicating insufficient patching), the system escalates to skill discovery, either restructuring cc^\dagger with a fundamentally different approach or synthesizing an entirely new skill cc' to expand the library's coverage.
      • Validation: All mutations are guarded by an automatic unit-test gate, where a synthetic test case is generated and executed through the updated skill; changes are rolled back on failure.
    • Feedback Retry: The process can repeat steps 5b-5d for a few rounds if the initial update doesn't lead to success.

Skills themselves are stored as reusable, evolving artifacts, typically structured as markdown files (SKILL.md) containing declarative specifications, prompts, and executable code. This allows for fine-grained, localized improvements to the policy embodied within each skill.

Memento-Skills demonstrates substantial empirical gains on benchmarks like the General AI Assistants (GAIA) and Humanity’s Last Exam (HLE), achieving 26.2% and 116.2% relative improvements in overall accuracy, respectively. The system's modular architecture includes an LLM client, context manager, built-in tools, a skills system managing both built-in and generated skills, and an evolution engine for continuous improvement of the skill store.