Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

AuthorsYuanzhe Hu, Yu Wang, Julian McAuley

ICLR 20262025

TL;DR

MemoryAgentBench uses incremental multi-turn datasets like EventQA and FactConsolidation to show GPT-4.1-mini long-context agents reach 71.8 overall vs 49.2 for GPT-4o-mini.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Memory agents lack unified evaluation across four competencies

MemoryAgentBench addresses that no existing benchmarks cover all four competencies, leaving memory in agents under evaluated despite diverse deployed systems.

This gap means memory agents for accurate retrieval, test time learning, and selective forgetting are deployed with largely anecdotal evidence and unknown failure modes.

HOW IT WORKS

MemoryAgentBench — incremental multi turn memory evaluation

MemoryAgentBench builds on Accurate Retrieval, Test Time Learning, Long Range Understanding, and Selective Forgetting using reconstructed datasets like EventQA and FactConsolidation.

Think of MemoryAgentBench as a cognitive test suite, where each dataset stresses a different part of an agent’s memory system, like separate exams for short term and long term recall.

This design lets MemoryAgentBench probe behaviors such as overwriting outdated facts and integrating 100k plus token histories that a plain context window benchmark cannot expose.

DIAGRAM

Multi turn interaction flow in MemoryAgentBench

This diagram shows how MemoryAgentBench feeds chunked conversations to memory agents and then queries them for evaluation.

DIAGRAM

Evaluation pipeline and competency mapping in MemoryAgentBench

This diagram shows how MemoryAgentBench maps datasets to competencies and evaluates different agent categories.

PROCESS

How MemoryAgentBench Handles a Chunked Conversation Session

01
Dataset Preparation
MemoryAgentBench reconstructs datasets like EventQA and FactConsolidation into chunks c1 to cn with explicit memorization instructions to stress Accurate Retrieval and Selective Forgetting.
02
Prompt Formulation and Interaction Protocol
MemoryAgentBench wraps each chunk into a user assistant dialogue, instructing agents to memorize content and, for FactConsolidation, to prioritize newer serial numbered facts.
03
Agents Formulation
MemoryAgentBench feeds chunks sequentially to long context agents, RAG agents, and agentic memory agents, requiring incremental memory updates before any questions are asked.
04
Overall Performance Comparison
MemoryAgentBench then issues questions q1 to qm and records metrics like accuracy, Recall at 5, and F1 Score across competencies for all evaluated agents.

KEY CONTRIBUTIONS

Key Contributions

01
Datasets
MemoryAgentBench reconstructs long context datasets and introduces EventQA and FactConsolidation, yielding 2071 questions with context depths from 103k to 1.44M tokens across four competencies.
02
Framework
MemoryAgentBench defines a unified evaluation framework that standardizes chunked conversations, memorization prompts, and interaction protocols for long context agents, RAG agents, and agentic memory agents.
03
Empirical Study
MemoryAgentBench benchmarks commercial systems like MIRIX and MemGPT plus RAG variants such as BM25 and HippoRAG v2, revealing persistent weaknesses in Selective Forgetting and Long Range Understanding.

RESULTS

By the Numbers

Accurate Retrieval Avg.

71.8 %

+22.6 over GPT-4o-mini long context

Test Time Learning Avg.

49.1 %

vs 46.2 for GPT-4o-mini long context

Long Range Understanding Avg.

62.2 %

context: Summarization and Detective QA combined

Selective Forgetting Avg.

22.5 %

context: single hop and multi hop FactConsolidation

These numbers come from Table 3, where MemoryAgentBench evaluates the GPT-4.1-mini long context agent across Accurate Retrieval, Test Time Learning, Long Range Understanding, and Selective Forgetting. The 71.8 Accurate Retrieval average shows how much stronger this configuration is than the 49.2 GPT-4o-mini long context baseline under MemoryAgentBench’s multi turn setup.

BENCHMARK

By the Numbers

BENCHMARK

Benchmark: Overall Performance Comparison (Table 3)

Overall Scores on MemoryAgentBench across four competencies.

KEY INSIGHT

The Counterintuitive Finding

MemoryAgentBench shows all methods reach at most 7 percent accuracy on multi hop FactConsolidation, even when single hop scores are much higher.

This is surprising because strong reasoning models like o4 mini reach 80 percent on 6k token multi hop inputs, so builders might expect similar robustness at longer ranges.

WHY IT MATTERS

What this unlocks for the field

MemoryAgentBench gives the community a concrete way to compare long context agents, RAG systems, and agentic memory agents under the same multi turn workloads.

Builders can now design memory mechanisms specifically to improve, say, Selective Forgetting on FactConsolidation or Test Time Learning on movie recommendation, and immediately see competency specific gains.

~13 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

Agent Memory

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Xiaohui Zhang, Zequn Sun et al.

· 2026

ActMem transforms dialogue history into atomic facts via Memory Fact Extraction, groups them with Fact Clustering, links them through a Memory KG Construction module, and uses Counterfactual-based Retrieval and Reasoning for action-aware answers. On ActMemEval, ActMem reaches 76.52% QA accuracy with DeepSeek-V3, beating LightMem’s 63.97% by 12.55 points and NaiveRAG’s 61.54%.

arXiv:2603.00026 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…