Memp : Exploring Agent Procedural Memory

AuthorsRunnan Fang, Yuan Liang, Xiaobin Wang et al.

2025

TL;DR

Memp uses Build–Retrieve–Update procedural memories distilled from trajectories to cut ALFWorld test steps from 23.76 to 15.01 while raising success from 42.14% to 77.86%.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Agents waste steps without reusable procedural memory (steps ↓50%, accuracy ↑50%)

Memp targets agents whose procedural memory is brittle, manually engineered, or entangled in static parameters, leading to slow, inaccurate multi step execution.

When TravelPlanner and ALFWorld tasks repeat structural patterns, agents still restart from scratch, wasting tokens and failing to reuse skills, tool sequences, and recovery tactics.

HOW IT WORKS

Memp — Build, Retrieve, and Update procedural memory

Memp centers on Build, Retrieve, and Update modules that transform trajectories into scripts, trajectories, and combined Proceduralization stored in a procedural memory library.

You can think of Memp like a layered cache: raw trajectories are disk logs, abstract scripts are indexed functions, and retrieval is a semantic card catalog over prior runs.

This design lets Memp inject distilled procedural knowledge into πmp(at|st), enabling continual skill reuse far beyond what a plain context window or static prompt templates can support.

DIAGRAM

Online interaction and memory update loop in Memp

This diagram shows how Memp uses trajectories and rewards to build, retrieve, and update procedural memory across sequential tasks.

DIAGRAM

Evaluation design for Build, Retrieve, and Update policies

This diagram shows how Memp evaluates different Build, Retrieve, and Update strategies on TravelPlanner and ALFWorld.

PROCESS

How Memp Handles a Multi Task Agent Session

  1. 01

    Build

    In Build, Memp applies the builder B to each task trajectory τ and reward r, creating m_pt and aggregating them into the procedural memory library Mem.

  2. 02

    Retrieve

    In Retrieve, Memp encodes the new task t_new with ϕ and selects m_retrieved via cosine similarity using Key Query or Key AveFact strategies.

  3. 03

    Update

    In Update, Memp applies U = Add ⊖ Del ⊕ Update using Vanilla Memory Update, Validation, or Adjustment based on execution feedback E(t).

  4. 04

    Proceduralization

    In Proceduralization, Memp combines full trajectories with high level scripts, then feeds this hybrid procedural memory into the agent policy πmp(at|st).

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Task agnostic procedural memory framework

    Memp formalizes procedural memory as Mem = Σ m_pt and integrates Build, Retrieve, and Update modules, turning trajectories into reusable skills across TravelPlanner and ALFWorld.

  • 02

    Systematic Build and Retrieve strategies

    Memp compares Script, Trajectory, and Proceduralization storage plus Random Sample, Key Query, and Key AveFact retrieval, showing Proceduralization with AveFact yields the strongest gains.

  • 03

    Online memory update mechanisms

    Memp introduces Vanilla Memory Update, Validation, and Adjustment, demonstrating reflexion style Adjustment yields up to +0.7 reward and 14 step reduction over other updates.

RESULTS

By the Numbers

ALFWorld Test

77.86%

+35.72 over GPT-4o No Memory (42.14%)

ALFWorld Steps

15.01 steps

-8.75 steps vs GPT-4o No Memory (23.76)

TravelPlanner #CS

79.94

+8.01 over GPT-4o No Memory (71.93)

TravelPlanner Steps

14.62 steps

-3.22 steps vs GPT-4o No Memory (17.84)

On TravelPlanner and ALFWorld, which test long horizon tool use and embodied housework, Memp’s Proceduralization with GPT-4o boosts success and cuts steps relative to the No Memory baseline. These numbers show Memp converts prior trajectories into concrete efficiency and accuracy gains for agents.

BENCHMARK

By the Numbers

On TravelPlanner and ALFWorld, which test long horizon tool use and embodied housework, Memp’s Proceduralization with GPT-4o boosts success and cuts steps relative to the No Memory baseline. These numbers show Memp converts prior trajectories into concrete efficiency and accuracy gains for agents.

BENCHMARK

ALFWorld Test performance with GPT-4o under different Build policies

Success rate (%) on ALFWorld Test for GPT-4o with No Memory, Script, Trajectory, and Proceduralization.

KEY INSIGHT

The Counterintuitive Finding

Procedural memory built by GPT-4o and transferred to Qwen2.5-14B raises TravelPlanner completion by 5% while cutting average steps by 1.6.

This is surprising because smaller models usually lag far behind, yet Memp shows a static memory bank from a stronger agent can materially boost weaker agents without retraining.

WHY IT MATTERS

What this unlocks for the field

Memp unlocks reusable, updatable procedural memory that grows across tasks, giving agents a concrete way to accumulate skills over time.

Builders can now treat trajectories as a shared procedural memory asset, distill it once with a strong agent, and plug it into weaker or specialized agents to gain efficiency and accuracy immediately.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: Memp : Exploring Agent Procedural Memory

Answers use this explainer on Memory Papers.

Checking…