Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

AuthorsYuanzhe Hu, Yu Wang, Julian McAuley

ICLR 20262025

TL;DR

MemoryAgentBench uses incremental multi-turn datasets like EventQA and FactConsolidation to show GPT-4.1-mini long-context agents reach 71.8 overall vs 49.2 for GPT-4o-mini.

THE PROBLEM

Memory agents lack unified evaluation across four competencies

MemoryAgentBench addresses that no existing benchmarks cover all four competencies, leaving memory in agents under evaluated despite diverse deployed systems.

This gap means memory agents for accurate retrieval, test time learning, and selective forgetting are deployed with largely anecdotal evidence and unknown failure modes.

HOW IT WORKS

MemoryAgentBench — incremental multi turn memory evaluation

MemoryAgentBench builds on Accurate Retrieval, Test Time Learning, Long Range Understanding, and Selective Forgetting using reconstructed datasets like EventQA and FactConsolidation.

Think of MemoryAgentBench as a cognitive test suite, where each dataset stresses a different part of an agent’s memory system, like separate exams for short term and long term recall.

This design lets MemoryAgentBench probe behaviors such as overwriting outdated facts and integrating 100k plus token histories that a plain context window benchmark cannot expose.

DIAGRAM

Multi turn interaction flow in MemoryAgentBench

This diagram shows how MemoryAgentBench feeds chunked conversations to memory agents and then queries them for evaluation.

DIAGRAM

Evaluation pipeline and competency mapping in MemoryAgentBench

This diagram shows how MemoryAgentBench maps datasets to competencies and evaluates different agent categories.

PROCESS

How MemoryAgentBench Handles a Chunked Conversation Session

  1. 01

    Dataset Preparation

    MemoryAgentBench reconstructs datasets like EventQA and FactConsolidation into chunks c1 to cn with explicit memorization instructions to stress Accurate Retrieval and Selective Forgetting.

  2. 02

    Prompt Formulation and Interaction Protocol

    MemoryAgentBench wraps each chunk into a user assistant dialogue, instructing agents to memorize content and, for FactConsolidation, to prioritize newer serial numbered facts.

  3. 03

    Agents Formulation

    MemoryAgentBench feeds chunks sequentially to long context agents, RAG agents, and agentic memory agents, requiring incremental memory updates before any questions are asked.

  4. 04

    Overall Performance Comparison

    MemoryAgentBench then issues questions q1 to qm and records metrics like accuracy, Recall at 5, and F1 Score across competencies for all evaluated agents.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Datasets

    MemoryAgentBench reconstructs long context datasets and introduces EventQA and FactConsolidation, yielding 2071 questions with context depths from 103k to 1.44M tokens across four competencies.

  • 02

    Framework

    MemoryAgentBench defines a unified evaluation framework that standardizes chunked conversations, memorization prompts, and interaction protocols for long context agents, RAG agents, and agentic memory agents.

  • 03

    Empirical Study

    MemoryAgentBench benchmarks commercial systems like MIRIX and MemGPT plus RAG variants such as BM25 and HippoRAG v2, revealing persistent weaknesses in Selective Forgetting and Long Range Understanding.

RESULTS

By the Numbers

Accurate Retrieval Avg.

71.8 %

+22.6 over GPT-4o-mini long context

Test Time Learning Avg.

49.1 %

vs 46.2 for GPT-4o-mini long context

Long Range Understanding Avg.

62.2 %

context: Summarization and Detective QA combined

Selective Forgetting Avg.

22.5 %

context: single hop and multi hop FactConsolidation

These numbers come from Table 3, where MemoryAgentBench evaluates the GPT-4.1-mini long context agent across Accurate Retrieval, Test Time Learning, Long Range Understanding, and Selective Forgetting. The 71.8 Accurate Retrieval average shows how much stronger this configuration is than the 49.2 GPT-4o-mini long context baseline under MemoryAgentBench’s multi turn setup.

BENCHMARK

By the Numbers

These numbers come from Table 3, where MemoryAgentBench evaluates the GPT-4.1-mini long context agent across Accurate Retrieval, Test Time Learning, Long Range Understanding, and Selective Forgetting. The 71.8 Accurate Retrieval average shows how much stronger this configuration is than the 49.2 GPT-4o-mini long context baseline under MemoryAgentBench’s multi turn setup.

BENCHMARK

Benchmark: Overall Performance Comparison (Table 3)

Overall Scores on MemoryAgentBench across four competencies.

KEY INSIGHT

The Counterintuitive Finding

MemoryAgentBench shows all methods reach at most 7 percent accuracy on multi hop FactConsolidation, even when single hop scores are much higher.

This is surprising because strong reasoning models like o4 mini reach 80 percent on 6k token multi hop inputs, so builders might expect similar robustness at longer ranges.

WHY IT MATTERS

What this unlocks for the field

MemoryAgentBench gives the community a concrete way to compare long context agents, RAG systems, and agentic memory agents under the same multi turn workloads.

Builders can now design memory mechanisms specifically to improve, say, Selective Forgetting on FactConsolidation or Test Time Learning on movie recommendation, and immediately see competency specific gains.

~13 min read← Back to papers

Related papers

BenchmarkLong-Term Memory

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Mohammad Tavakoli, Alireza Salemi et al.

arXiv 2025 · 2025

LIGHT augments LLMs with **Retrieval from the Conversation**, **Scratchpad Formation and Utilization**, and a **Working Memory** buffer plus noise filtering to answer BEAM’s long-context probing questions. On the BEAM benchmark, LIGHT raises GPT-4.1-nano’s average score at 10M-token conversations from 0.109 to 0.226, a +107.3% gain over the vanilla long-context baseline.

BenchmarkAgent Memory

MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

Haoran Tan, Zeyu Zhang et al.

ACL 2025 · 2025

MemBench evaluates LLM-based agents via a **User Relation Graph**, **Memory Dataset Construction**, **Multi-scenario Memory**, and **Multi-level Memory** pipeline, then scores seven memory mechanisms with multi-metric evaluation. On 100k-token factual observation tests, MemBench shows **RetrievalMemory** achieves 0.933 accuracy versus 0.631 for **FullMemory**, quantifying both effectiveness and efficiency tradeoffs.

Benchmark

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang et al.

arXiv 2025 · 2025

MemoryBench orchestrates **Task Provider**, **User Simulator**, and **Performance Monitor** over 11 datasets to evaluate declarative and procedural memory in LLM systems. On MemoryBench, simple RAG baselines like BM25-S and Embed-M often match or beat advanced systems such as MemoryOS and Mem0 despite MemoryOS taking more than 17 seconds of memory time per case.