MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

AuthorsHaoran Tan, Zeyu Zhang, Chen Ma et al.

ACL 20252025

TL;DR

MemBench uses multi-scenario, multi-level memory tasks plus accuracy–recall–capacity–efficiency metrics to reveal, for example, RetrievalMemory reaching 0.933 accuracy on 100k-token factual observation tests.

THE PROBLEM

Memory benchmarks miss reflective memory and observation scenarios

MemBench notes that previous evaluations are "commonly limited by the diversity of memory levels and interactive scenarios" and "lack comprehensive metrics".

These gaps mean LLM-based agents are judged mostly on factual memory in participation scenarios, ignoring reflective memory, observation-only usage, and efficiency or capacity constraints.

HOW IT WORKS

MemBench — multi-scenario, multi-level memory benchmark

MemBench builds on User Relation Graph Sampling, Memory Dataset Construction, Multi-scenario Memory, and Multi-level Memory to stress-test agent memory mechanisms.

You can think of MemBench as a lab setup where user profiles and events are like a database, and factual or reflective questions probe the agent’s memory circuitry.

This design lets MemBench expose failures in long-term reasoning, knowledge updating, and preference summarization that a plain context window or single-metric benchmark cannot reveal.

DIAGRAM

MemBench memory taxonomy across scenarios and levels

This diagram shows how MemBench organizes memory types into factual versus reflective, and participation versus observation scenarios.

DIAGRAM

MemBench evaluation pipeline for memory mechanisms

This diagram shows how MemBench feeds sessions and questions through memory mechanisms to compute accuracy, recall, capacity, and temporal efficiency.

PROCESS

How MemBench Handles a Memory Evaluation Session

  1. 01

    User Relation Graph Sampling

    MemBench samples user profiles and related entities using User Relation Graph Sampling to construct rich factual and reflective attributes for memory tests.

  2. 02

    Memory Dataset Construction

    MemBench runs Memory Dataset Construction to generate evidence dialogues, message lists, and time based sessions for both participation and observation scenarios.

  3. 03

    Multi-scenario Memory Simulation

    MemBench simulates Multi-scenario Memory by feeding user agent dialogues or user only message streams into chosen memory mechanisms over time.

  4. 04

    Multi-metric Evaluation

    MemBench applies Multi-metric Evaluation to compute memory accuracy, recall, capacity, and temporal efficiency from the mechanisms’ answers and retrieval traces.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Multi-scenario Dataset

    MemBench introduces a Multi-scenario Memory dataset with both participation and observation scenarios, totaling 51k participation factual sessions and 8.5k observation factual sessions.

  • 02

    Multi-level Memory Content

    MemBench’s Multi-level Memory design covers factual memory and reflective memory, enabling tasks like cross session reasoning, temporal reasoning, and reflective summarization.

  • 03

    Multi-metric Benchmark

    MemBench defines a Multi-metric Evaluation benchmark with accuracy, recall, capacity, and temporal efficiency, and evaluates seven memory mechanisms including RetrievalMemory and MemGPT.

RESULTS

By the Numbers

Participation Accuracy 100k factual

0.833

+0.344 over FullMemory

Observation Accuracy 100k factual

0.933

+0.302 over FullMemory

Participation Read Time factual

0.041 s

RetrievalMemmory read vs MemGPT 4.549 s

Observation Recall@10 factual

0.769

100k tokens RetrievalMemmory recall

On MemBench’s factual memory Sub dataset 2 with 100k token sessions, RetrievalMemmory reaches 0.833 participation accuracy and 0.933 observation accuracy. These results show MemBench can separate robust retrieval based mechanisms from window limited designs like FullMemory at scale.

BENCHMARK

By the Numbers

On MemBench’s factual memory Sub dataset 2 with 100k token sessions, RetrievalMemmory reaches 0.833 participation accuracy and 0.933 observation accuracy. These results show MemBench can separate robust retrieval based mechanisms from window limited designs like FullMemory at scale.

BENCHMARK

Factual Memory Observation Accuracy on Sub dataset 2

Accuracy on factual observation scenario with 100k token message lists.

KEY INSIGHT

The Counterintuitive Finding

MemBench shows RetrievalMemmory improves factual observation accuracy from 0.631 for FullMemory to 0.933 at 100k tokens, despite extra retrieval overhead.

This is surprising because many expect full context feeding to be strongest, yet MemBench reveals window based FullMemory degrades much more than retrieval based designs.

WHY IT MATTERS

What this unlocks for the field

MemBench unlocks precise, multi angle diagnosis of agent memory via controlled scenarios, reflective tasks, and explicit capacity curves over long interactions.

Builders can now compare memory mechanisms like MemGPT and GenerativeAgent under identical 100k token conditions, tuning designs for accuracy, recall, and latency instead of guessing from task performance alone.

~14 min read← Back to papers

Related papers

BenchmarkLong-Term Memory

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Mohammad Tavakoli, Alireza Salemi et al.

arXiv 2025 · 2025

LIGHT augments LLMs with **Retrieval from the Conversation**, **Scratchpad Formation and Utilization**, and a **Working Memory** buffer plus noise filtering to answer BEAM’s long-context probing questions. On the BEAM benchmark, LIGHT raises GPT-4.1-nano’s average score at 10M-token conversations from 0.109 to 0.226, a +107.3% gain over the vanilla long-context baseline.

BenchmarkAgent Memory

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, Julian McAuley

ICLR 2026 · 2025

MemoryAgentBench standardizes multi-turn datasets into chunked conversations with memorization prompts, then evaluates long-context agents, RAG agents, and agentic memory agents across Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Selective Forgetting. On the overall score in Table 3, the GPT-4.1-mini long-context agent reaches 71.8 on Accurate Retrieval tasks compared to 49.2 for the GPT-4o-mini long-context baseline.

Benchmark

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang et al.

arXiv 2025 · 2025

MemoryBench orchestrates **Task Provider**, **User Simulator**, and **Performance Monitor** over 11 datasets to evaluate declarative and procedural memory in LLM systems. On MemoryBench, simple RAG baselines like BM25-S and Embed-M often match or beat advanced systems such as MemoryOS and Mem0 despite MemoryOS taking more than 17 seconds of memory time per case.