Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

AuthorsZexue He, Yu Wang, Churan Zhi et al.

2026

TL;DR

MEMORYARENA uses multi-session Memory–Agent–Environment loops with interdependent subtasks to show state-of-the-art memory agents still achieve near-zero success rates on realistic long-horizon tasks.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Memory benchmarks miss agentic failures despite near-saturated long-context scores

Existing long-context memory benchmarks like LoCoMo report near-saturated performance, yet they only test static recall without actions or environment dynamics.

When agents face multi-session tasks with interdependent subtasks, MEMORYARENA shows Task Success Rates dropping to 0.00–0.12, meaning agents fail to solve realistic long-horizon goals.

HOW IT WORKS

MEMORYARENA — Memory-Agent-Environment loops for multi-session tasks

MEMORYARENA centers on Memory-Agent-Environment Loops, Multi-Session Working Flow, Bundled Web Shopping, and Group Travel Planning to couple memorization with agentic actions and feedback.

You can think of MEMORYARENA like a computer with RAM and disk: sessions are processes, and the memory system is an external store that must persist state between runs.

This design lets MEMORYARENA expose failures that a plain context window cannot, because crucial information disappears once a session ends and must be explicitly written to and read from persistent memory.

DIAGRAM

Multi-session Memory-Agent-Environment loop across interdependent subtasks

This diagram shows how MEMORYARENA executes a sequence of interdependent subtasks as separate sessions, each updating and querying persistent memory.

DIAGRAM

Evaluation pipeline across four MEMORYARENA environments

This diagram shows how MEMORYARENA builds and evaluates multi-session tasks in four environments with GPT-5.1-mini plus different memory systems.

PROCESS

How MEMORYARENA Handles a Multi-Session Working Flow

01
Task Composition and Data Preparation
MEMORYARENA constructs Bundled Web Shopping, Group Travel Planning, Progressive Web Search, and formal reasoning tasks with explicitly interdependent subtasks and verified causal chains.
02
Single-Session Agent-Environment Interactions
Within each subtask, MEMORYARENA lets the LLM Agent interact stepwise with the Environment, collecting actions and observations as a session trace.
03
Multi-session Agent-Environment Interactions
MEMORYARENA sequences subtasks so later sessions depend on earlier ones, making previous traces inaccessible except through the memory system.
04
Final: Memory-Agent-Environment Loop
At each subtask, MEMORYARENA calls RETRIEVE and UPDATE on the memory system, then evaluates Task Success Rate and Task Progress Score across all sessions.

KEY CONTRIBUTIONS

Key Contributions

01
MEMORYARENA: Agent Memory in Memory-Action-Environment Loops
MEMORYARENA formalizes Memory-Agent-Environment Loops and Multi-Session Working Flow to couple memorization with actions, feedback, and persistent memory across sessions.
02
Four interdependent evaluation environments
MEMORYARENA provides Bundled Web Shopping, Group Travel Planning, Progressive Web Search, and formal math and physics reasoning, totaling 766 tasks with an average of 57 action steps.
03
Unified benchmarking of memory paradigms
MEMORYARENA evaluates long-context agents, RAG systems, and external memory agents like MemGPT, Mem0, GraphRAG, and ReasoningBank under the same multi-session setting.

RESULTS

By the Numbers

Task Success Rate

0.12

GPT-5.1-mini long-context vs 0.00 for many memory agents on Bundled Web Shopping

Task Progress Score

0.79

Claude-Sonnet-4.5 long-context PS on Bundled Web Shopping vs 0.41 Memory Avg

Soft Process Score

0.52

GPT-5.1-mini long-context sPS on Group Travel Planning vs 0.38 All Method Avg

Average Steps per Task

Average action steps per MEMORYARENA task across environments

MEMORYARENA reports Task Success Rate, Task Progress Score, and soft Process Score across Bundled Web Shopping, Group Travel Planning, Progressive Web Search, and Formal Reasoning. These numbers show that even strong long-context and memory-augmented agents struggle to maintain and reuse information across interdependent multi-session tasks.

BENCHMARK

By the Numbers

BENCHMARK

Main results on task agent (gpt-5.1-mini) with long-context memory, memory agent, and RAG agent

Task Success Rate on Bundled Web Shopping for GPT-5.1-mini with different memory paradigms.

KEY INSIGHT

The Counterintuitive Finding

MEMORYARENA shows Group Travel Planning has near-zero Task Success Rate and Process Score for all methods, despite sophisticated memory systems.

This is surprising because agents that nearly saturate LoCoMo and other long-context benchmarks were expected to transfer those gains to realistic multi-session planning, but MEMORYARENA reveals they do not.

WHY IT MATTERS

What this unlocks for the field

MEMORYARENA unlocks a way to test whether memory actually supports long-horizon decision-making, not just static recall from long contexts.

Builders can now design and compare memory systems under realistic Memory-Agent-Environment Loops, targeting belief tracking and cross-session state management that were previously invisible to standard benchmarks.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…