Agent MemoryLong-Term Memory
Yi Yu, Liuyi Yao et al.
arXiv 2026 · 2026
Agentic Memory (AgeMem) exposes **memory management tools**, a **three-stage progressive RL strategy**, and **step-wise GRPO** directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.
BenchmarkLong-Term Memory
Mohammad Tavakoli, Alireza Salemi et al.
arXiv 2025 · 2025
LIGHT augments LLMs with **Retrieval from the Conversation**, **Scratchpad Formation and Utilization**, and a **Working Memory** buffer plus noise filtering to answer BEAM’s long-context probing questions. On the BEAM benchmark, LIGHT raises GPT-4.1-nano’s average score at 10M-token conversations from 0.109 to 0.226, a +107.3% gain over the vanilla long-context baseline.
RAGMemory ArchitectureLong-Term Memory
Bernal Jiménez Gutiérrez, Yiheng Shu et al.
ICML 2025 · 2025
HippoRAG 2 combines **Offline Indexing**, a schema-less **Knowledge Graph**, **Dense-Sparse Integration**, **Deeper Contextualization**, and **Recognition Memory** into a neuro-inspired non-parametric memory system for LLMs. On the joint RAG benchmark suite, HippoRAG 2 achieves 59.8 average F1 versus 57.0 for NV-Embed-v2, including 71.0 F1 on 2Wiki compared to 61.5 for NV-Embed-v2.
Long-Term MemoryAgent Memory
Zhen Tan, Jun Yan et al.
ACL 2025 · 2025
Reflective Memory Management (RMM) uses a **memory bank**, **retriever**, **reranker**, and **LLM** to implement Prospective Reflection and Retrospective Reflection for topic-based storage and RL-based retrieval refinement. On LongMemEval, RMM with GTE achieves 69.8% Recall@5 and 70.4% accuracy, compared to 62.4% Recall@5 and 63.6% accuracy for GTE RAG.
Agent MemoryLong-Term MemoryMemory Architecture
Prateek Chhikara, Dev Khant et al.
arXiv 2025 · 2025
Mem0 incrementally processes conversations using the **extraction phase**, **update phase**, **asynchronous summary generation module**, **tool call mechanism**, and a **vector database** to build scalable long-term memory. On the LOCOMO benchmark, Mem0 attains a J score of 67.13 on single-hop questions versus 63.79 for OpenAI and cuts p95 latency from 17.117s to 1.440s compared to the full-context baseline.
BenchmarkLong-Term Memory
Di Wu, Hongwei Wang et al.
ICLR 2025 · 2024
LongMemEval evaluates long-term interactive memory by running chat assistants through **indexing**, **retrieval**, and **reading** over 50k sessions with fact-augmented keys and time-aware query expansion. On LONGMEMEVALS, long-context LLMs like GPT-4o, Llama 3.1, and Phi-3 suffer 30%–60% accuracy drops compared to oracle evidence-only reading, revealing severe limitations in current long-context designs.