Agent MemoryLong-Term Memory
Yi Yu, Liuyi Yao et al.
arXiv 2026 · 2026
Agentic Memory (AgeMem) exposes **memory management tools**, a **three-stage progressive RL strategy**, and **step-wise GRPO** directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.
SurveyAgent Memory
Dongming Jiang, Yi Li et al.
arXiv 2026 · 2026
Anatomy of Agentic Memory organizes Memory-Augmented Generation into four structures and empirically compares systems like LOCOMO, AMem, MemoryOS, Nemori, MAGMA, and SimpleMem under benchmark saturation, metric validity, backbone sensitivity, and system cost. On the LoCoMo benchmark, Anatomy of Agentic Memory shows Nemori reaches 0.502 F1 while AMem drops to 0.116, and MAGMA achieves the top semantic judge score of 0.670 under the MAGMA rubric.
Memory ArchitectureSurvey
Zhongming Yu, Naicheng Yu et al.
arXiv 2026 · 2026
Multi-Agent Memory Architecture organizes **Agent IO Layer**, **Agent Cache Layer**, and **Agent Memory Layer** plus **Agent Cache Sharing** and **Agent Memory Access** protocols into a unified architectural framing for multi-agent systems. The position-only SYS_NAME proposes no benchmark MAIN_RESULT or numeric comparison against any baseline.
BenchmarkLong-Term Memory
Mohammad Tavakoli, Alireza Salemi et al.
arXiv 2025 · 2025
LIGHT augments LLMs with **Retrieval from the Conversation**, **Scratchpad Formation and Utilization**, and a **Working Memory** buffer plus noise filtering to answer BEAM’s long-context probing questions. On the BEAM benchmark, LIGHT raises GPT-4.1-nano’s average score at 10M-token conversations from 0.109 to 0.226, a +107.3% gain over the vanilla long-context baseline.
BenchmarkAgent Memory
Yuanzhe Hu, Yu Wang, Julian McAuley
ICLR 2026 · 2025
MemoryAgentBench standardizes multi-turn datasets into chunked conversations with memorization prompts, then evaluates long-context agents, RAG agents, and agentic memory agents across Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Selective Forgetting. On the overall score in Table 3, the GPT-4.1-mini long-context agent reaches 71.8 on Accurate Retrieval tasks compared to 49.2 for the GPT-4o-mini long-context baseline.
Survey
Yaxiong Wu, Sheng Liang et al.
arXiv 2025 · 2025
From Human Memory to AI Memory organizes LLM memory using the **3D-8Q Memory Taxonomy**, mapping human memory categories to personal and system memory across object, form, and time. From Human Memory to AI Memory reports no new benchmarks but consolidates systems like MemoryBank, HippoRAG, and MemoRAG into a single conceptual framework.
RAGMemory ArchitectureLong-Term Memory
Bernal Jiménez Gutiérrez, Yiheng Shu et al.
ICML 2025 · 2025
HippoRAG 2 combines **Offline Indexing**, a schema-less **Knowledge Graph**, **Dense-Sparse Integration**, **Deeper Contextualization**, and **Recognition Memory** into a neuro-inspired non-parametric memory system for LLMs. On the joint RAG benchmark suite, HippoRAG 2 achieves 59.8 average F1 versus 57.0 for NV-Embed-v2, including 71.0 F1 on 2Wiki compared to 61.5 for NV-Embed-v2.
Agent MemoryMemory Architecture
B.Y. Yan, Chaofan Li et al.
arXiv 2025 · 2025
General Agentic Memory (GAM) combines a **Memorizer**, **Researcher**, **page-store**, and **memory** to keep full trajectories while constructing lightweight guidance for deep research. On RULER 128K retrieval, GAM achieves 97.70% accuracy compared to 94.25% for RAG using GPT-4o-mini, while also reaching 64.07 F1 on HotpotQA-56K.
Long-Term MemoryAgent Memory
Zhen Tan, Jun Yan et al.
ACL 2025 · 2025
Reflective Memory Management (RMM) uses a **memory bank**, **retriever**, **reranker**, and **LLM** to implement Prospective Reflection and Retrospective Reflection for topic-based storage and RL-based retrieval refinement. On LongMemEval, RMM with GTE achieves 69.8% Recall@5 and 70.4% accuracy, compared to 62.4% Recall@5 and 63.6% accuracy for GTE RAG.
Agent MemoryLong-Term MemoryMemory Architecture
Prateek Chhikara, Dev Khant et al.
arXiv 2025 · 2025
Mem0 incrementally processes conversations using the **extraction phase**, **update phase**, **asynchronous summary generation module**, **tool call mechanism**, and a **vector database** to build scalable long-term memory. On the LOCOMO benchmark, Mem0 attains a J score of 67.13 on single-hop questions versus 63.79 for OpenAI and cuts p95 latency from 17.117s to 1.440s compared to the full-context baseline.
BenchmarkAgent Memory
Haoran Tan, Zeyu Zhang et al.
ACL 2025 · 2025
MemBench evaluates LLM-based agents via a **User Relation Graph**, **Memory Dataset Construction**, **Multi-scenario Memory**, and **Multi-level Memory** pipeline, then scores seven memory mechanisms with multi-metric evaluation. On 100k-token factual observation tests, MemBench shows **RetrievalMemory** achieves 0.933 accuracy versus 0.631 for **FullMemory**, quantifying both effectiveness and efficiency tradeoffs.
SurveyMemory Architecture
Parsa Omidi, Xingshuai Huang et al.
arXiv 2025 · 2025
Memory-Augmented Transformers organizes **functional objectives**, **memory types**, and **integration techniques** into a three-axis taxonomy, grounded in biological systems like sensory, working, and long-term memory. The survey synthesizes dozens of architectures to highlight emerging mechanisms such as hierarchical buffering and surprise-gated updates that move beyond static KV caches.
Benchmark
Qingyao Ai, Yichen Tang et al.
arXiv 2025 · 2025
MemoryBench orchestrates **Task Provider**, **User Simulator**, and **Performance Monitor** over 11 datasets to evaluate declarative and procedural memory in LLM systems. On MemoryBench, simple RAG baselines like BM25-S and Embed-M often match or beat advanced systems such as MemoryOS and Mem0 despite MemoryOS taking more than 17 seconds of memory time per case.
Memory ArchitectureAgent Memory
Zhiyu Li, Chenyang Xi et al.
arXiv 2025 · 2025
MemOS organizes memory via **MemReader**, **MemScheduler**, **MemLifecycle**, **MemOperator**, and **MemGovernance**, all operating over MemCube units that unify plaintext, activation, and parameter memories under OS-style control. On PreFEval, PersonaMem, LongMemEval, and LoCoMo, MemOS-1031 ranks first across all metrics compared to MIRIX, Mem0, Zep, Memobase, MemU, and Supermemory.
Memory Architecture
Ali Behrouz, Peilin Zhong, Vahab Mirrokni
arXiv 2025 · 2025
Titans combines a **Neural Memory Module**, **Core** short term attention, and **Persistent Memory** into three variants (Memory as a Context, Memory as a Gate, Memory as a Layer) that learn to memorize at test time. On LAMBADA, Titans (MAC) reaches 39.62% accuracy at 760M parameters, compared to 37.06% for DeltaNet and 39.72% for Samba while also improving long context NIAH performance.
BenchmarkLong-Term Memory
Di Wu, Hongwei Wang et al.
ICLR 2025 · 2024
LongMemEval evaluates long-term interactive memory by running chat assistants through **indexing**, **retrieval**, and **reading** over 50k sessions with fact-augmented keys and time-aware query expansion. On LONGMEMEVALS, long-context LLMs like GPT-4o, Llama 3.1, and Phi-3 suffer 30%–60% accuracy drops compared to oracle evidence-only reading, revealing severe limitations in current long-context designs.