Agent MemoryLong-Term Memory
Yi Yu, Liuyi Yao et al.
arXiv 2026 · 2026
Agentic Memory (AgeMem) exposes **memory management tools**, a **three-stage progressive RL strategy**, and **step-wise GRPO** directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.
SurveyAgent Memory
Dongming Jiang, Yi Li et al.
arXiv 2026 · 2026
Anatomy of Agentic Memory organizes Memory-Augmented Generation into four structures and empirically compares systems like LOCOMO, AMem, MemoryOS, Nemori, MAGMA, and SimpleMem under benchmark saturation, metric validity, backbone sensitivity, and system cost. On the LoCoMo benchmark, Anatomy of Agentic Memory shows Nemori reaches 0.502 F1 while AMem drops to 0.116, and MAGMA achieves the top semantic judge score of 0.670 under the MAGMA rubric.
BenchmarkAgent Memory
Yuanzhe Hu, Yu Wang, Julian McAuley
ICLR 2026 · 2025
MemoryAgentBench standardizes multi-turn datasets into chunked conversations with memorization prompts, then evaluates long-context agents, RAG agents, and agentic memory agents across Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Selective Forgetting. On the overall score in Table 3, the GPT-4.1-mini long-context agent reaches 71.8 on Accurate Retrieval tasks compared to 49.2 for the GPT-4o-mini long-context baseline.
Agent MemoryMemory Architecture
B.Y. Yan, Chaofan Li et al.
arXiv 2025 · 2025
General Agentic Memory (GAM) combines a **Memorizer**, **Researcher**, **page-store**, and **memory** to keep full trajectories while constructing lightweight guidance for deep research. On RULER 128K retrieval, GAM achieves 97.70% accuracy compared to 94.25% for RAG using GPT-4o-mini, while also reaching 64.07 F1 on HotpotQA-56K.
Long-Term MemoryAgent Memory
Zhen Tan, Jun Yan et al.
ACL 2025 · 2025
Reflective Memory Management (RMM) uses a **memory bank**, **retriever**, **reranker**, and **LLM** to implement Prospective Reflection and Retrospective Reflection for topic-based storage and RL-based retrieval refinement. On LongMemEval, RMM with GTE achieves 69.8% Recall@5 and 70.4% accuracy, compared to 62.4% Recall@5 and 63.6% accuracy for GTE RAG.
Agent MemoryLong-Term MemoryMemory Architecture
Prateek Chhikara, Dev Khant et al.
arXiv 2025 · 2025
Mem0 incrementally processes conversations using the **extraction phase**, **update phase**, **asynchronous summary generation module**, **tool call mechanism**, and a **vector database** to build scalable long-term memory. On the LOCOMO benchmark, Mem0 attains a J score of 67.13 on single-hop questions versus 63.79 for OpenAI and cuts p95 latency from 17.117s to 1.440s compared to the full-context baseline.
BenchmarkAgent Memory
Haoran Tan, Zeyu Zhang et al.
ACL 2025 · 2025
MemBench evaluates LLM-based agents via a **User Relation Graph**, **Memory Dataset Construction**, **Multi-scenario Memory**, and **Multi-level Memory** pipeline, then scores seven memory mechanisms with multi-metric evaluation. On 100k-token factual observation tests, MemBench shows **RetrievalMemory** achieves 0.933 accuracy versus 0.631 for **FullMemory**, quantifying both effectiveness and efficiency tradeoffs.
Memory ArchitectureAgent Memory
Zhiyu Li, Chenyang Xi et al.
arXiv 2025 · 2025
MemOS organizes memory via **MemReader**, **MemScheduler**, **MemLifecycle**, **MemOperator**, and **MemGovernance**, all operating over MemCube units that unify plaintext, activation, and parameter memories under OS-style control. On PreFEval, PersonaMem, LongMemEval, and LoCoMo, MemOS-1031 ranks first across all metrics compared to MIRIX, Mem0, Zep, Memobase, MemU, and Supermemory.