General Agentic Memory Via Deep Research

AuthorsB.Y. Yan, Chaofan Li, Hongjin Qian et al.

arXiv 20252025

TL;DR

General Agentic Memory (GAM) uses a dual Memorizer–Researcher deep-research mechanism to JIT-build context, reaching 97.70% accuracy on RULER retrieval vs 94.25% for RAG (+3.45 points).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Static ahead of time memory loses crucial details and harms long context agents

Static ahead-of-time memory compresses trajectories into lightweight summaries, so memorization is inevitably subject to severe information loss and missing fine-grained details.

When LLM agents rely only on compressed memory, they cannot satisfy nuanced information needs, causing degraded task completion on long-context benchmarks like LoCoMo and HotpotQA.

HOW IT WORKS

General Agentic Memory — dual agents and deep research

General Agentic Memory (GAM) uses a dual design with a Memorizer, Researcher, memory, page-store, and explicit pages to manage long trajectories.

You can think of GAM like RAM and disk: memory is a compact index, while the page-store is a full archive that the Researcher actively searches.

This just in time deep research mechanism lets GAM plan, search, and reflect over complete history, enabling optimized contexts that a plain context window cannot construct.

DIAGRAM

Deep research loop between client, GAM memory, and page store

This sequence diagram shows how General Agentic Memory (GAM) runs the Researcher planning, searching, and reflection loop over the page-store for a client request.

DIAGRAM

Evaluation pipeline and ablations for GAM

This flowchart shows how General Agentic Memory (GAM) is evaluated across datasets, baselines, and ablation settings.

PROCESS

How General Agentic Memory Handles a history and request lifecycle

01
Memorizer.memorize
GAM uses Memorizer.memorize to turn each new session and existing memory into a concise memo, incrementally updating the global memory state.
02
Memorizer.page
GAM runs Memorizer.page to generate a header from memory, decorate the session into a page, and append it into the page-store.
03
Researcher.plan
Given a request, GAM calls Researcher.plan with memory and tools to analyze information needs and produce concrete search actions.
04
Researcher.reflect
After searching and integration, GAM invokes Researcher.reflect to decide if information is sufficient or to trigger another deep research round.

KEY CONTRIBUTIONS

Key Contributions

01
General Agentic Memory framework
GAM introduces a dual-agent architecture with a Memorizer and Researcher, combining lightweight memory with a full page-store for just in time context construction.
02
Deep research over complete history
GAM formalizes Planning, Searching, Integration, and Reflection in the Researcher, enabling iterative deep research guided by pre-constructed memory and multiple search tools.
03
Unified end to end optimization
GAM defines a reinforcement learning objective over Memorizer and Researcher, optimizing expected reward R with policy gradients ∇θm and ∇θr on memory grounded tasks.

RESULTS

By the Numbers

RULER Retri. Acc.

97.70

+3.45 over RAG

HotpotQA 56K F1

64.07

vs LONG-LLM 49.75 on HotpotQA 56K

LoCoMo Single Hop F1

57.75

+11.07 over RAG on GPT-4o-mini

HotpotQA 56K build time

56.89 s

offline build vs A-mem 209.74 s on HotpotQA 56K

These numbers come from LoCoMo, HotpotQA, and RULER 128K, which test long-term conversational memory, multi-hop QA, and long-context retrieval. MAIN_RESULT shows that GAM converts full histories into high-utility contexts more accurately and efficiently than static memory systems like RAG and A-MEM.

BENCHMARK

By the Numbers

BENCHMARK

Benchmark: Results on HotpotQA 56K with Qwen2.5 14b

F1 score on HotpotQA 56K long-context question answering benchmark.

BENCHMARK

Benchmark: Efficiency analysis on HotpotQA 56K

Total time in seconds on HotpotQA 56K including offline build and online serve.

KEY INSIGHT

The Counterintuitive Finding

On RULER multi hop tracing, GAM reaches 93.20% accuracy with GPT-4o-mini, while RAG collapses to 0.00% despite using the same backbone.

This is surprising because many assume stronger retrieval alone suffices, but GAM shows that agentic planning and reflection over a page-store are crucial for complex long-context reasoning.

WHY IT MATTERS

What this unlocks for the field

GAM unlocks general agentic memory where agents can JIT-compile optimized contexts from complete histories using deep research rather than static compression.

Builders can now design agents that scale test-time computation, planning, and search depth to match task difficulty, instead of being bottlenecked by fixed context windows or lossy summaries.

~12 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

Agent Memory

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Xiaohui Zhang, Zequn Sun et al.

· 2026

ActMem transforms dialogue history into atomic facts via Memory Fact Extraction, groups them with Fact Clustering, links them through a Memory KG Construction module, and uses Counterfactual-based Retrieval and Reasoning for action-aware answers. On ActMemEval, ActMem reaches 76.52% QA accuracy with DeepSeek-V3, beating LightMem’s 63.97% by 12.55 points and NaiveRAG’s 61.54%.

arXiv:2603.00026 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…