BenchmarkLong-Term Memory
Mohammad Tavakoli, Alireza Salemi et al.
arXiv 2025 · 2025
LIGHT augments LLMs with **Retrieval from the Conversation**, **Scratchpad Formation and Utilization**, and a **Working Memory** buffer plus noise filtering to answer BEAM’s long-context probing questions. On the BEAM benchmark, LIGHT raises GPT-4.1-nano’s average score at 10M-token conversations from 0.109 to 0.226, a +107.3% gain over the vanilla long-context baseline.
BenchmarkAgent Memory
Yuanzhe Hu, Yu Wang, Julian McAuley
ICLR 2026 · 2025
MemoryAgentBench standardizes multi-turn datasets into chunked conversations with memorization prompts, then evaluates long-context agents, RAG agents, and agentic memory agents across Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Selective Forgetting. On the overall score in Table 3, the GPT-4.1-mini long-context agent reaches 71.8 on Accurate Retrieval tasks compared to 49.2 for the GPT-4o-mini long-context baseline.
BenchmarkAgent Memory
Haoran Tan, Zeyu Zhang et al.
ACL 2025 · 2025
MemBench evaluates LLM-based agents via a **User Relation Graph**, **Memory Dataset Construction**, **Multi-scenario Memory**, and **Multi-level Memory** pipeline, then scores seven memory mechanisms with multi-metric evaluation. On 100k-token factual observation tests, MemBench shows **RetrievalMemory** achieves 0.933 accuracy versus 0.631 for **FullMemory**, quantifying both effectiveness and efficiency tradeoffs.
Benchmark
Qingyao Ai, Yichen Tang et al.
arXiv 2025 · 2025
MemoryBench orchestrates **Task Provider**, **User Simulator**, and **Performance Monitor** over 11 datasets to evaluate declarative and procedural memory in LLM systems. On MemoryBench, simple RAG baselines like BM25-S and Embed-M often match or beat advanced systems such as MemoryOS and Mem0 despite MemoryOS taking more than 17 seconds of memory time per case.
BenchmarkLong-Term Memory
Di Wu, Hongwei Wang et al.
ICLR 2025 · 2024
LongMemEval evaluates long-term interactive memory by running chat assistants through **indexing**, **retrieval**, and **reading** over 50k sessions with fact-augmented keys and time-aware query expansion. On LONGMEMEVALS, long-context LLMs like GPT-4o, Llama 3.1, and Phi-3 suffer 30%–60% accuracy drops compared to oracle evidence-only reading, revealing severe limitations in current long-context designs.