Agent Memory

Papers on memory for LLM agents: recall, reflection, and long-horizon context.

52 papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

Agent Memory

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Xiaohui Zhang, Zequn Sun et al.

· 2026

ActMem transforms dialogue history into atomic facts via Memory Fact Extraction, groups them with Fact Clustering, links them through a Memory KG Construction module, and uses Counterfactual-based Retrieval and Reasoning for action-aware answers. On ActMemEval, ActMem reaches 76.52% QA accuracy with DeepSeek-V3, beating LightMem’s 63.97% by 12.55 points and NaiveRAG’s 61.54%.

arXiv:2603.00026 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

arXiv:2603.04549 Read explainer

Cognitive ArchitectureAgent Memory

Aeon: High-Performance Neuro-Symbolic Memory Management for Long-Horizon LLM Agents

Mustafa Arslan

· 2026

Aeon restructures LLM memory using the Atlas, Trace, Semantic Lookaside Buffer, Write Ahead Log, and Sidecar Blob Arena inside a zero copy Core Shell kernel. Aeon achieves 4.70 ns INT8 dot products, 3.09 µs Atlas traversal at 100K nodes, 3.1× compression, and P99 read latency of 750 ns under 16 thread contention compared to FP32 and flat scan baselines.

arXiv:2601.15311 Read explainer

BenchmarkAgent Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes memory management tools, a three-stage progressive RL strategy, and step-wise GRPO directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

arXiv:2601.01885 Read explainer

BenchmarkAgent Memory

Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

Yakov Pyotr Shkolnikov

· 2026

Agent Memory Below the Prompt stores each agent’s KV state in a block pool, quantizes it via a Q4 pipeline, reloads it with BatchQuantizedKVCache, and reuses it across phases using cross-phase context injection. On Gemma 3 12B, Agent Memory Below the Prompt reduces cold TTFT from 172,096 ms to 1,264 ms at 32K context (136×) compared to FP16 prefix caching baselines like vllm-mlx.

arXiv:2603.04428 Read explainer

BenchmarkBenchmarkAgent Memory

AgentSys: Secure and Dynamic LLM Agents Through Explicit Hierarchical Memory Management

Ruoyao Wen, Hao Li et al.

· 2026

AGENTSYS organizes a Main Agent, Worker Agents, Intent Schemas, and an Alignment Validator into a hierarchical memory system that isolates raw tool outputs and only admits schema-validated JSON. On AgentDojo, AGENTSYS reaches 52.87% attacked utility and 0.78% ASR versus 48.27% and 30.66% for the No Defense baseline.

arXiv:2602.07398 Read explainer

Cognitive ArchitectureAgent Memory

Aligning Progress and Feasibility: A Neuro-Symbolic Dual Memory Framework for Long-Horizon LLM Agents

Bin Wen, Ruoxuan Zhang et al.

· 2026

Neuro-Symbolic Dual Memory Framework uses Progress Memory, Feasibility Memory, a Blueprint Planner Agent, a Progress Monitor Agent, and an Actor Agent to decouple semantic progress guidance from executable feasibility checks. On ALFWorld, Neuro-Symbolic Dual Memory Framework achieves 94.78% success rate versus 88.81% for AWM, and on WebShop reaches 0.7132 score versus 0.5998 for WALL-E 2.0.

arXiv:2604.02734 Read explainer

Agent MemoryLong-Term Memory

AMA: Adaptive Memory via Multi-Agent Collaboration

Weiquan Huang, Zixuan Wang et al.

· 2026

AMA orchestrates four agents — the Constructor, Retriever, Judge, and Refresher — to build Raw Text, Fact Knowledge, and Episode Memory and route queries adaptively across these granularities. On the LoCoMo benchmark with GPT-4.1-mini, AMA achieves an overall LLM Score of 0.805 compared to Nemori’s 0.774, while reducing token consumption by approximately 80% relative to FullContext.

arXiv:2601.20352 Read explainer

BenchmarkAgent Memory

AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Cheng Jiayang, Dongyu Ru et al.

· 2026

AMemGym combines Structured Data Generation, On-Policy Interaction, Evaluation Metrics, and Meta-Evaluation to script user state trajectories, drive LLM-simulated role-play, and score write–read–utilization behavior. On AMemGym’s base configuration, AWE-(2,4,30) reaches a 0.291 normalized memory score on interactive evaluation, while native gpt-4.1-mini only achieves 0.203, exposing substantial gaps between memory agents and plain long-context LLMs.

arXiv:2603.01966 Read explainer

Agent MemoryLong-Term Memory

AMV-L: Lifecycle-Managed Agent Memory for Tail-Latency Control in Long-Running LLM Systems

Emmanuel Bamidele

· 2026

AMV-L manages agent memory using a Memory Value Model, Tiered Lifecycle, Bounded Retrieval Path, and Lifecycle Manager to decouple retention from retrieval eligibility. Under a 70k-request long-running workload, AMV-L improves throughput from 9.027 to 36.977 req/s over TTL and reduces p99 latency from 5398.167 ms to 1233.430 ms while matching LRU’s retrieval quality.

arXiv:2603.04443 Read explainer

SurveyAgent Memory

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang, Yi Li et al.

arXiv 2026 · 2026

Anatomy of Agentic Memory organizes agentic memory into four structures using components like Lightweight Semantic Memory, Entity-Centric and Personalized Memory, Episodic and Reflective Memory, and Structured and Hierarchical Memory. Anatomy of Agentic Memory then reports comparative results such as Nemori’s 0.781 semantic judge score on LoCoMo versus SimpleMem’s 0.298, and latency differences like 1.129s for Nemori versus 32.372s for MemoryOS.

arXiv:2602.19320 Read explainer

SurveyBenchmarkAgent MemoryLong-Term MemoryMemory Architecture

A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty

Zehao Lin, Chunyu Li, Kai Chen

· 2026

Mnemonic Sovereignty analyzes long term Write, Store, Retrieve, Execute, Share, and Forget Rollback phases against integrity, confidentiality, availability, and governance objectives for agent memory. Mnemonic Sovereignty’s lifecycle matrix shows most of the ~70 works cluster on write and retrieve integrity, leaving store, availability, and governance primitives like write gate validation and post deletion verification almost entirely unexplored.

arXiv:2604.16548 Read explainer

BenchmarkAgent Memory

ATANT v1.1: Positioning Continuity Evaluation Against Memory, Long-Context, and Agentic-Memory Benchmarks

Samuel Sameer Tanguturi

· 2026

ATANT v1.1 structurally analyzes seven benchmarks using the 7 v1.0 continuity properties, the 10 checkpoints, a property-coverage matrix, and the Kenotic v1.0 reference implementation. ATANT v1.1 reports 96% ATANT cumulative-scale versus 8.8% LOCOMO substring accuracy, showing that LOCOMO, LongMemEval, BEAM, MemoryBench, Zep eval, MemGPT/Letta, and RULER measure different properties from continuity.

arXiv:2604.10981 Read explainer

Agent Memory

AtomMem : Learnable Dynamic Agentic Memory with Atomic Memory Operation

Yupeng Huo, Yaxi Lu et al.

· 2026

AtomMem reframes agent memory as a POMDP and composes atomic CRUD operations over a hybrid scratchpad plus vector memory storage using a GRPO-based RL policy. On HotpotQA, 2WikiMultiHopQA, Musique, GAIA, and WebWalkerQA, AtomMem reaches an average 58.8 score, beating MemAgent’s 56.7 with the same Qwen3-8B backbone.

arXiv:2601.08323 Read explainer

BenchmarkAgent MemoryLong-Term Memory

Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks

Zexue He, Yu Wang et al.

· 2026

MEMORYARENA orchestrates Memory-Agent-Environment Loops, Multi-Session Working Flow, Bundled Web Shopping, Group Travel Planning, and Progressive Web Search to stress-test how agents store and reuse information across sessions. MEMORYARENA’s main result is that agents with near-saturated scores on long-context benchmarks like LoCoMo still obtain Task Success Rates as low as 0.00–0.12 across its four environments.

arXiv:2602.16313 Read explainer

BenchmarkAgent Memory

ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

Mofasshara Rafique, Laurent Bindschaedler

· 2026

ClawVM manages agent state as typed pages via the SessionPageTable, RepresentationSelector, FaultObserver, WritebackJournal, and ClawVMEngine inside the agent harness. Across four OpenClaw-derived workloads and six token budgets, ClawVM cuts explicit faults from 67.8 (retrieval baseline) and 1.5 (Compaction-Hybrid) to 0.0 while adding median <50 μs policy-engine overhead per turn.

arXiv:2604.10352 Read explainer

Cognitive ArchitectureAgent Memory

D-Mem: A Dual-Process Memory System for LLM Agents

Zhixing You, Jiachen Yuan, Jason Cai

· 2026

D-Mem combines Mem0∗, Quality Gating, and Full Deliberation into a dual-process memory system that incrementally stores vector memories and selectively scans raw history. On LoCoMo with GPT-4o-mini, D-Mem’s Quality Gating reaches 53.5 F1 versus the Mem0∗ baseline’s 51.2 F1, recovering 96.7% of the 55.3 F1 Full Deliberation performance with far fewer tokens.

arXiv:2603.18631 Read explainer

BenchmarkAgent MemoryLong-Term Memory

Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

Benjamin Stern, Peter Nadel

· 2026

Drawing on Memory uses dual-trace memory encoding, an evidence scoring gate, and a three-state retrieval protocol to store paired fact and scene traces in Letta’s archival memory. On LongMemEval-S, Drawing on Memory reaches 73.7% accuracy versus 53.5% for the fact-only C7-control baseline, a +20.2 percentage point gain concentrated in temporal, update, and multi-session questions.

arXiv:2604.12948 Read explainer

BenchmarkAgent Memory

Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents

Xing Zhang, Guanghui Wang et al.

· 2026

Experience Compression Spectrum organizes Level 0 Raw Trace, Level 1 Episodic Memory, Level 2 Procedural Skill, and Level 3 Declarative Rule into a unified scaffold-level compression framework. Experience Compression Spectrum’s mapping of 20+ systems and <1% cross-citation rate shows that all existing agents fix a single compression level and never perform adaptive cross-level compression.

arXiv:2604.15877 Read explainer

BenchmarkAgent MemoryMemory Architecture

GAM: Hierarchical Graph-based Agentic Memory for LLM Agents

Zhaofen Wu, Hanrong Zhang et al.

· 2026

GAM builds a Hierarchical Graph Memory Architecture with a global Topic Associative Network, local Event Progression Graphs, State-Based Memory Consolidation, and Graph-Guided Multi-Factor Retrieval to decouple encoding from consolidation. On LoCoMo with Qwen2.5-7B, GAM attains an Average F1 of 40.00 compared to Mem0’s 35.38, and on LongDialQA with Qwen2.5-7B, GAM reaches 12.55 F1 vs MemoryOS at 6.76.

arXiv:2604.12285 Read explainer

BenchmarkBenchmarkAgent MemoryLong-Term Memory

Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework

Chingkwun Lam, Jiaxin Li et al.

· 2026

SSGM interposes a Governance Middleware, Read Filtering Gate, Write Validation Gate, and a dual substrate of Mutable Active Graph plus Immutable Episodic Log between agents and memory. SSGM unifies evolving-memory systems into a four-dimensional failure taxonomy and proves that periodic reconciliation can bound semantic drift over infinite horizons.

arXiv:2603.11768 Read explainer

BenchmarkAgent MemoryLong-Term MemoryMemory Architecture

Lightweight LLM Agent Memory with Small Language Models

Jiaquan Zhang, Chaoning Zhang et al.

· 2026

LightMem orchestrates SLM-1 Controller, SLM-2 Selector, SLM-3 Writer, and STM MTM LTM stores to modularize retrieval, writing, and offline consolidation. On LoCoMo, LightMem reaches 34.50 F1 for GPT-4o multi hop questions, +1.64 over A-MEM, while keeping median retrieval latency at 83 ms.

arXiv:2604.07798 Read explainer

Agent Memory

MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents

Dongming Jiang, Yi Li et al.

· 2026

MAGMA organizes agent memory with an Intent-Aware Router, Adaptive Topological Retrieval, a Data Structure Layer of Relation Graphs and Vector Database, plus dual-stream Synaptic Ingestion and Asynchronous Consolidation. On LoCoMo, MAGMA achieves a 0.700 overall LLM-as-a-Judge score versus 0.590 for Nemori, and reaches 61.2% average accuracy on LongMemEval versus 56.2% for Nemori.

arXiv:2601.03236 Read explainer

BenchmarkBenchmarkBenchmarkAgent MemoryLong-Term Memory

MemEvoBench: Benchmarking Memory MisEvolution in LLM Agents

Weiwei Xie, Shaoxiong Guo et al.

· 2026

MemEvoBench combines Misleading Memory Injection, Noisy Tool Returns, Biased User Feedback, and a Memory Modification Tool (+ModTool) to stress-test long-term memory safety in LLM agents across 7 domains and 36 risk types. On the QA Style benchmark, MemEvoBench shows Gemini-2.5-Pro’s ASR drops from 67.0% (Vanilla) to 19.0% with +ModTool in Round 1, while biased feedback can push GPT-5’s QA ASR from 59.0% to 78.0% by Round 3.

arXiv:2604.15774 Read explainer

Agent Memory

Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Zhenting Wang, Huancheng Chen et al.

· 2026

Memex(RL) optimizes Indexed Experience Memory, CompressExperience, ReadExperience, and ContextStatus so Memex keeps only an indexed summary in-context while archiving full artifacts externally. On modified ALFWorld, Memex(RL) lifts task success from 24.22% to 85.61% over the Memex agent without RL while reducing peak working context from 16,934.46 to 9,634.47 tokens.

arXiv:2603.04257 Read explainer

Agent MemoryMemory Architecture

MemFactory: Unified Inference & Training Framework for Agent Memory

Ziliang Guo, Ziheng Li et al.

· 2026

MemFactory decomposes memory agents into Module Layer, Agent Layer, Environment Layer, and Trainer Layer with plug and play Extractor, Updater, Retriever, and RecurrentMemoryModule components. On MemAgent eval_50, MemFactory raises Qwen3-1.7B from 0.4727 to 0.5684 and Qwen3-4B-Instruct from 0.6523 to 0.7051 using GRPO.

arXiv:2603.29493 Read explainer

RAGAgent MemoryLong-Term MemoryMemory Architecture

Memory as Metabolism: A Design for Companion Knowledge Systems

Stefan Miteski

· 2026

Memory as Metabolism defines companion knowledge systems with five retention operations (TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, AUDIT) plus memory gravity and minority-hypothesis retention over a raw buffer, active wiki, and cold memory. Instead of benchmark gains, Memory as Metabolism’s main result is a governance specification that separates descriptive, taxonomic, and normative claims and predicts improved coherence stability, fragility resistance, monoculture resistance, and effective minority-hypothesis influence for companion wikis.

arXiv:2604.12034 Read explainer

BenchmarkBenchmarkAgent Memory

MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

Weizhi Zhang, Xiaokai Wei et al.

· 2026

MEMORYCD builds a user memory pool Mu from lifelong Amazon Review histories and evaluates long-context prompting, Mem0, LoCoMo, ReadAgent, MemoryBank, and A-Mem across rating, ranking, and personalized text tasks. On Books and Home & Kitchen, MEMORYCD shows GPT-5 reaches RMSE 0.551–0.624 and NDCG@3 up to 0.610, while Gemini-2.5 Pro peaks at ROUGE-L 0.222 for generation, revealing substantial remaining gaps to real user behavior.

arXiv:2603.25973 Read explainer

SurveyRAGAgent Memory

Memory for Autonomous LLM Agents:Mechanisms, Evaluation, and Emerging Frontiers

Pengfei Du

· 2026

Memory for Autonomous LLM Agents decomposes agent memory into a POMDP-grounded write–manage–read loop, a three-dimensional taxonomy, and five mechanism families spanning context compression, retrieval stores, reflection, hierarchical virtual context, and policy-learned management. Memory for Autonomous LLM Agents synthesizes results like Voyager’s 15.3× tech-tree speedup and MemoryArena’s 80%→45% drop to show that memory architecture often matters more than backbone choice.

arXiv:2603.07670 Read explainer

Agent Memory

Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework

Yanchen Wu, Tenghui Lin et al.

· 2026

Memory in the LLM Era decomposes agent memory into Information Extraction, Memory Management, Memory Storage, and Information Retrieval, then recombines modules into a new hierarchical tree–tier architecture. On LONGMEMEVAL with Qwen2.5-7B, Memory in the LLM Era achieves 38.79 F1 overall versus 36.92 for MemTree.

arXiv:2604.01707 Read explainer

Agent Memory

Multi-Agent Memory from a Computer Architecture Perspective: Visions and Challenges Ahead

Zhongming Yu, Naicheng Yu et al.

arXiv 2026 · 2026

Multi-Agent Memory Architecture organizes Agent IO Layer, Agent Cache Layer, Agent Memory Layer, Agent Cache Sharing, and Agent Memory Access Protocol into a computer-architecture-style design for LLM agents. Multi-Agent Memory Architecture’s main result is a conceptual unification of shared and distributed memory plus a research agenda for multi-agent memory consistency instead of benchmark gains.

arXiv:2603.10062 Read explainer

Agent Memory

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang et al.

· 2025

A-MEM organizes agent memory via Note Construction, Link Generation, Memory Evolution, and Retrieve Relative Memory to build an evolving, interconnected note graph. On the LoCoMo dataset, A-MEM with GPT-4o-mini reaches 27.02 F1 on Multi Hop questions, +17.87 over ReadAgent, while cutting average token length from 16,910 to 2,520.

arXiv:2502.12110 Read explainer

Agent Memory

A-MemGuard: A Proactive Defense Framework for LLM-Based Agent Memory

Qianshan Wei, Tengchao Yang et al.

· 2025

A-MemGuard combines consensus-based validation, dual-memory structure, lesson memory, and path divergence scoring to sanitize retrieved memories and revise actions using past failures. On EHRAgent under AgentPoison, A-MemGuard reduces ASR-r from 100.0% to 2.13% and ASR-t from 100.0% to 6.38%, far below LLM Auditor and Distil Classifier.

arXiv:2510.02373 Read explainer

Agent Memory

CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension

Rui Li, Zeyu Zhang et al.

· 2025

CAM builds hierarchical schemata using an incremental overlapping clustering algorithm, ego centric disentanglement, and a Prune and Grow associative strategy for retrieval. On NovelQA, CAM achieves 52.3 ACC-L versus RAPTOR’s 47.8, a +4.5 point gain while also improving efficiency in long-text reading comprehension.

arXiv:2510.05520 Read explainer

RAGBenchmarkBenchmarkBenchmarkAgent MemoryLong-Term MemoryMemory Architecture

Evaluating Long-Term Memory for Long-Context Question Answering

Alessandra Terranova, Björn Ross, Alexandra Birch

· 2025

Evaluating Long-Term Memory for Long-Context Question Answering compares Full Context, RAG, A-Mem, RAG+PromptOpt, and RAG+EpMem memory components across semantic, episodic, and procedural memory for long conversational QA. On LoCoMo, RAG+EpMem reaches an average F1 ranking of 1.83 for Llama 3.2-3B Instruct and 1.80 for GPT-4o mini while using around 1,000 tokens per query versus over 23,000 for Full Context.

arXiv:2510.23730 Read explainer

Agent Memory

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, Julian McAuley

ICLR 2026 · 2025

MemoryAgentBench standardizes multi-turn datasets into chunked conversations with memorization prompts, then evaluates long-context agents, RAG agents, and agentic memory agents across Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Selective Forgetting. On the overall score in Table 3, the GPT-4.1-mini long-context agent reaches 71.8 on Accurate Retrieval tasks compared to 49.2 for the GPT-4o-mini long-context baseline.

arXiv:2507.05257 Read explainer

BenchmarkBenchmarkAgent MemoryMemory Architecture

Forgetful but Faithful: A Cognitive Memory Architecture and Benchmark for Privacy-Aware Generative Agents

Saad Alqithami

· 2025

MaRS organizes agent memory into episodic, semantic, social, and task nodes with provenance, scored by a privacy-aware retention controller and governed by FIFO, LRU, Priority Decay, Reflection-Summary, Random-Drop, and Hybrid policies. On the FiFA benchmark, the Hybrid policy in MaRS achieves a composite score of ≈0.911 across 300 runs and five memory budgets, outperforming simpler policies while preserving privacy and cost efficiency.

arXiv:2512.12856 Read explainer

Agent Memory

General Agentic Memory Via Deep Research

B.Y. Yan, Chaofan Li et al.

arXiv 2025 · 2025

General Agentic Memory (GAM) combines a Memorizer, Researcher, page-store, and memory to keep full trajectories while constructing lightweight guidance for deep research. On RULER 128K retrieval, GAM achieves 97.70% accuracy compared to 94.25% for RAG using GPT-4o-mini, while also reaching 64.07 F1 on HotpotQA-56K.

arXiv:2511.18423 Read explainer

Agent MemoryMemory Architecture

Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects

Chris Latimer, Nicoló Boschi et al.

· 2025

HINDSIGHT organizes agent memory into four networks via TEMPR and layers CARA on top to retain, recall, and reflect with explicit opinions and behavioral profiles. On LongMemEval, HINDSIGHT with Gemini-3 Pro scores 91.4% overall versus 60.2% for full-context GPT-4o, while HINDSIGHT with OSS-20B jumps from 39.0% to 83.6% over a full-context OSS-20B baseline.

arXiv:2512.12818 Read explainer

Agent Memory

IMDMR: An Intelligent Multi-Dimensional Memory Retrieval System for Enhanced Conversational AI

Tejas Pawar, Sarika Patil et al.

· 2025

IMDMR combines a Memory Storage Layer, Multi-Dimensional Search Engine, Intelligent Query Processor, and Response Generation Module to retrieve conversational memories across semantic, entity, category, intent, context, and temporal dimensions. On the synthetic 1,000 conversation benchmark, IMDMR-Prod achieves an overall score of 0.792 compared to 0.207 for spaCy + RAG, a 3.8x improvement.

arXiv:2511.05495 Read explainer

Agent MemoryLong-Term MemoryMemory Architecture

LiCoMemory: Lightweight and Cognitive Agentic Memory for Efficient Long-Term Reasoning

Zhengjun Huang, Zhoujin Tian et al.

· 2025

LiCoMemory organizes long term dialogue with CogniGraph, Query Processing and Integrated Rerank, and Real Time Interactions to keep session summaries, triples, and chunks linked. On LongMemEval with GPT-4o-mini, LiCoMemory reaches 73.80% accuracy and 76.63% recall, beating Mem0g by 9.0 and 7.1 points.

arXiv:2511.01448 Read explainer

Agent Memory

MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

Haoran Tan, Zeyu Zhang et al.

ACL 2025 · 2025

MemBench evaluates LLM-based agents with multi-scenario datasets, multi-level memory content, and a time-aware benchmark using components like Multi-scenario Dataset, Multi-level Memory Content, and Multi-metric Evaluation. MemBench shows that mechanisms such as GenerativeAgent, MemGPT, MemoryBank, and SCMemory can drop from accuracies around 0.7 on 10k-token settings to roughly 0.3–0.4 at 100k tokens, exposing clear capacity limits.

arXiv:2506.21605 Read explainer

BenchmarkAgent MemoryMemory Architecture

MemEvolve: Meta-Evolution of Agent Memory Systems

Guibin Zhang, Haotian Ren et al.

· 2025

MemEvolve decomposes agent memory into Encode, Store, Retrieve, and Manage modules and meta evolves these components via a dual evolution process over candidate architectures. On xBench DeepSearch, MemEvolve with GPT 5 mini raises Flash Searcher pass@1 from 69.0 to 74.0 and WebWalkerQA accuracy from 58.82 to 61.18 while keeping API cost near 0.141 per query.

arXiv:2512.18746 Read explainer

BenchmarkBenchmarkAgent Memory

Memoria: A Scalable Agentic Memory Framework for Personalized Conversational AI

Samarth Sarin, Lovepreet Singh et al.

· 2025

Memoria augments LLM chats with structured conversation logging, dynamic user persona via KG, session level memory for real time context, and seamless retrieval for context aware responses to provide persistent, interpretable memory. On LongMemEvals single-session-user and knowledge-update subsets, Memoria reaches 87.1% and 80.8% accuracy respectively, surpassing A-Mem (OpenAI) while using much shorter prompts.

arXiv:2512.12686 Read explainer

BenchmarkAgent Memory

MemoriesDB: A Temporal-Semantic-Relational Database for Long-Term Agent Memory / Modeling Experience as a Graph of Temporal-Semantic Surfaces

Joel Ward

· 2025

MemoriesDB stores each Memory Record, Edges and Relations, and the Temporal Semantic Stack inside PostgreSQL with pgvector, exposing unified temporal–semantic–relational queries. MemoriesDB’s main result is a working implementation that demonstrates scalable time-bounded recall and hybrid semantic–structural queries on commodity SQL infrastructure without specialized vector or graph engines.

arXiv:2511.06179 Read explainer

RAGBenchmarkAgent Memory

Memory in the Age of AI Agents

Yuyang Hu, Shichun Liu et al.

· 2025

Memory in the Age of AI Agents formalizes agent memory with Memory Formation, Memory Evolution, and Memory Retrieval operators, and classifies memories into token-level, parametric, and latent forms plus factual, experiential, and working functions. Memory in the Age of AI Agents’ main result is a unified Forms–Functions–Dynamics framework that consolidates fragmented LLM agent memory work, benchmarks, and open-source frameworks into a coherent taxonomy.

arXiv:2512.13564 Read explainer

BenchmarkAgent Memory

PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory

Bowen Jiang, Yuan Yuan et al.

· 2025

PersonaMem-v2 combines PERSONAMEM-V2: IMPLICIT PERSONAS, RL with Long-Context Reasoning, RL with Agentic Memory, and a User Privacy-Aware Design to train Qwen3-4B with GRPO on implicit user preferences from long, noisy histories. PersonaMem-v2 achieves 55.2% MCQ and 60.7% open-ended accuracy on PERSONAMEM-V2, surpassing GPT-5-Chat’s 45.6% and 46.2% while using a 2k-token agentic memory instead of full 32k–128k contexts.

arXiv:2512.06688 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

Semantic Anchoring in Agentic Memory: Leveraging Linguistic Structures for Persistent Conversational Context

Maitreyi Chatterjee, Devansh Agarwal

· 2025

Semantic Anchoring enriches conversational memory by combining a hybrid memory store with dense and symbolic indexes, structured memory representation tuples, hybrid storage and indexing, and a retrieval scoring method. On MultiWOZ-Long, Semantic Anchoring reaches 83.5% Factual Recall and 80.8% Discourse Coherence, beating Entity-RAG by 7.6 and 8.6 points respectively.

arXiv:2508.12630 Read explainer

Agent Memory

Unveiling Privacy Risks in LLM Agent Memory

Bo Wang, Weiyi He et al.

· 2025

MEXTRA crafts black box attacking prompts and automated diverse prompt generators that target the memory module, similarity scoring function, retrieval depth, memory size, and LLM backbone. MEXTRA extracts 50 queries from a 200 record EHRAgent memory and 26 from RAP, with extracted efficiency up to 0.42 compared to weaker baselines without workflow aligned prompts.

arXiv:2502.13172 Read explainer

Agent MemoryMemory Architecture

WebATLAS: An LLM Agent with Experience-Driven Memory and Action Simulation

Jiali Cheng, Anjishnu Kumar et al.

· 2025

WebATLAS combines a Planner, Actor, Critic, and Multi-layered Memory (Working Memory, Cognitive Map, Semantic Memory) to simulate and score actions before executing them on the web. On WebArena-Lite, WebATLAS achieves 63.0% average success versus 53.9% for Plan-and-Act, a +9.1 point gain without website-specific fine-tuning.

arXiv:2510.22732 Read explainer