General Agentic Memory Via Deep Research

AuthorsB.Y. Yan, Chaofan Li, Hongjin Qian et al.

arXiv 20252025

TL;DR

General Agentic Memory (GAM) uses a dual Memorizer–Researcher deep-research mechanism to JIT-build context, reaching 97.70% accuracy on RULER retrieval vs 94.25% for RAG (+3.45 points).

THE PROBLEM

Static ahead of time memory loses crucial details and harms long context agents

Static ahead-of-time memory compresses trajectories into lightweight summaries, so memorization is inevitably subject to severe information loss and missing fine-grained details.

When LLM agents rely only on compressed memory, they cannot satisfy nuanced information needs, causing degraded task completion on long-context benchmarks like LoCoMo and HotpotQA.

HOW IT WORKS

General Agentic Memory — dual agents and deep research

General Agentic Memory (GAM) uses a dual design with a Memorizer, Researcher, memory, page-store, and explicit pages to manage long trajectories.

You can think of GAM like RAM and disk: memory is a compact index, while the page-store is a full archive that the Researcher actively searches.

This just in time deep research mechanism lets GAM plan, search, and reflect over complete history, enabling optimized contexts that a plain context window cannot construct.

DIAGRAM

Deep research loop between client, GAM memory, and page store

This sequence diagram shows how General Agentic Memory (GAM) runs the Researcher planning, searching, and reflection loop over the page-store for a client request.

DIAGRAM

Evaluation pipeline and ablations for GAM

This flowchart shows how General Agentic Memory (GAM) is evaluated across datasets, baselines, and ablation settings.

PROCESS

How General Agentic Memory Handles a history and request lifecycle

  1. 01

    Memorizer.memorize

    GAM uses Memorizer.memorize to turn each new session and existing memory into a concise memo, incrementally updating the global memory state.

  2. 02

    Memorizer.page

    GAM runs Memorizer.page to generate a header from memory, decorate the session into a page, and append it into the page-store.

  3. 03

    Researcher.plan

    Given a request, GAM calls Researcher.plan with memory and tools to analyze information needs and produce concrete search actions.

  4. 04

    Researcher.reflect

    After searching and integration, GAM invokes Researcher.reflect to decide if information is sufficient or to trigger another deep research round.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    General Agentic Memory framework

    GAM introduces a dual-agent architecture with a Memorizer and Researcher, combining lightweight memory with a full page-store for just in time context construction.

  • 02

    Deep research over complete history

    GAM formalizes Planning, Searching, Integration, and Reflection in the Researcher, enabling iterative deep research guided by pre-constructed memory and multiple search tools.

  • 03

    Unified end to end optimization

    GAM defines a reinforcement learning objective over Memorizer and Researcher, optimizing expected reward R with policy gradients ∇θm and ∇θr on memory grounded tasks.

RESULTS

By the Numbers

RULER Retri. Acc.

97.70

+3.45 over RAG

HotpotQA 56K F1

64.07

vs LONG-LLM 49.75 on HotpotQA 56K

LoCoMo Single Hop F1

57.75

+11.07 over RAG on GPT-4o-mini

HotpotQA 56K build time

56.89 s

offline build vs A-mem 209.74 s on HotpotQA 56K

These numbers come from LoCoMo, HotpotQA, and RULER 128K, which test long-term conversational memory, multi-hop QA, and long-context retrieval. MAIN_RESULT shows that GAM converts full histories into high-utility contexts more accurately and efficiently than static memory systems like RAG and A-MEM.

BENCHMARK

By the Numbers

These numbers come from LoCoMo, HotpotQA, and RULER 128K, which test long-term conversational memory, multi-hop QA, and long-context retrieval. MAIN_RESULT shows that GAM converts full histories into high-utility contexts more accurately and efficiently than static memory systems like RAG and A-MEM.

BENCHMARK

Benchmark: Results on HotpotQA 56K with Qwen2.5 14b

F1 score on HotpotQA 56K long-context question answering benchmark.

BENCHMARK

Benchmark: Efficiency analysis on HotpotQA 56K

Total time in seconds on HotpotQA 56K including offline build and online serve.

KEY INSIGHT

The Counterintuitive Finding

On RULER multi hop tracing, GAM reaches 93.20% accuracy with GPT-4o-mini, while RAG collapses to 0.00% despite using the same backbone.

This is surprising because many assume stronger retrieval alone suffices, but GAM shows that agentic planning and reflection over a page-store are crucial for complex long-context reasoning.

WHY IT MATTERS

What this unlocks for the field

GAM unlocks general agentic memory where agents can JIT-compile optimized contexts from complete histories using deep research rather than static compression.

Builders can now design agents that scale test-time computation, planning, and search depth to match task difficulty, instead of being bottlenecked by fixed context windows or lossy summaries.

~12 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes **memory management tools**, a **three-stage progressive RL strategy**, and **step-wise GRPO** directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

SurveyAgent Memory

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang, Yi Li et al.

arXiv 2026 · 2026

Anatomy of Agentic Memory organizes Memory-Augmented Generation into four structures and empirically compares systems like LOCOMO, AMem, MemoryOS, Nemori, MAGMA, and SimpleMem under benchmark saturation, metric validity, backbone sensitivity, and system cost. On the LoCoMo benchmark, Anatomy of Agentic Memory shows Nemori reaches 0.502 F1 while AMem drops to 0.116, and MAGMA achieves the top semantic judge score of 0.670 under the MAGMA rubric.

BenchmarkAgent Memory

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, Julian McAuley

ICLR 2026 · 2025

MemoryAgentBench standardizes multi-turn datasets into chunked conversations with memorization prompts, then evaluates long-context agents, RAG agents, and agentic memory agents across Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Selective Forgetting. On the overall score in Table 3, the GPT-4.1-mini long-context agent reaches 71.8 on Accurate Retrieval tasks compared to 49.2 for the GPT-4o-mini long-context baseline.