Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework

AuthorsYanchen Wu, Tenghui Lin, Yingli Zhou et al.

2026

TL;DR

Memory in the LLM Era uses a four-stage modular memory framework plus a new hierarchical tree–tier design to reach 38.79 F1 on LONGMEMEVAL, +1.87 over MemTree.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-horizon agents lose critical facts in overflowing contexts

LLM agents face context overflow, where naive long-context prompting becomes token-intensive, high-latency, and unreliable for long-term conversations.

When LOCOMO conversations average 588.2 dialogue turns and LONGMEMEVAL histories reach about 115,000 tokens, agents fail at multi-session reasoning and temporal consistency.

HOW IT WORKS

A unified four-stage memory framework

Memory in the LLM Era centers on four modules: Information Extraction, Memory Management, Memory Storage, and Information Retrieval to cover all agent memory methods.

Think of Information Extraction and Memory Management as a cognitive front-end, while Memory Storage and Information Retrieval act like long-term disk plus an intelligent index.

This modular design lets Memory in the LLM Era mix and match components, enabling hierarchical, tree-based, and rule-driven memory behaviors beyond a plain context window.

DIAGRAM

Memory in the LLM Era query-time retrieval pipeline

This diagram shows how Memory in the LLM Era retrieves and injects relevant memories when a new query arrives.

DIAGRAM

Evaluation pipeline across LOCOMO and LONGMEMEVAL

This diagram shows how Memory in the LLM Era evaluates modular memory methods on LOCOMO and LONGMEMEVAL with shared settings.

PROCESS

How Memory in the LLM Era Handles a Long-term Conversation Session

  1. 01

    Information Extraction

    Memory in the LLM Era first applies Information Extraction, using direct archiving, summarization-based extraction, or graph-based extraction to convert messages into structured memories.

  2. 02

    Memory Management

    Then Memory in the LLM Era runs Memory Management to connect, integrate, transform, update, and filter memories, mirroring human-like consolidation and forgetting.

  3. 03

    Memory Storage

    Next Memory in the LLM Era organizes processed memories into flat or hierarchical Memory Storage using vector-based, graph-based, or tree-based structures.

  4. 04

    Information Retrieval

    Finally, Memory in the LLM Era uses Information Retrieval with lexical-based, vector-based, structure-based, or LLM-assisted retrieval to assemble context for the LLM.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Unified modular framework for agent memory

    Memory in the LLM Era formalizes agent memory as four components—Information Extraction, Memory Management, Memory Storage, and Information Retrieval—covering 10 methods like MemGPT, MemoryOS, and MemTree.

  • 02

    Comprehensive experimental study on LOCOMO and LONGMEMEVAL

    Memory in the LLM Era reimplements 10 memory methods and evaluates them on LOCOMO and LONGMEMEVAL, analyzing F1, BLEU-1, token costs, context scalability, and position sensitivity.

  • 03

    New agent memory method with state-of-the-art performance

    Memory in the LLM Era designs a new tree plus three-tier memory variant that reaches 38.79 F1 on LONGMEMEVAL and 43.87 F1 on LOCOMO with Qwen2.5-72B.

RESULTS

By the Numbers

Overall F1 LONGMEMEVAL

38.79

+1.87 over MemTree

Overall F1 LOCOMO Qwen2.5-72B

43.87

+1.08 over MemOS

Information Extraction assistant F1

69.34

+11.55 over MemTree

Average token costs per dialogue

<450 tokens

Lower than MemTree and MemOS in Figure 10

Memory in the LLM Era is evaluated on LONGMEMEVAL and LOCOMO, which test long-term conversational memory, multi-session reasoning, and temporal reasoning. The 38.79 F1 and 43.87 F1 results show that the modular tree–tier design improves accuracy while keeping token usage low.

BENCHMARK

By the Numbers

Memory in the LLM Era is evaluated on LONGMEMEVAL and LOCOMO, which test long-term conversational memory, multi-session reasoning, and temporal reasoning. The 38.79 F1 and 43.87 F1 results show that the modular tree–tier design improves accuracy while keeping token usage low.

BENCHMARK

Overall F1 on LONGMEMEVAL with Qwen2.5-7B-Instruct

Overall F1 scores comparing Memory in the LLM Era and strong baselines on LONGMEMEVAL.

BENCHMARK

Overall F1 on LOCOMO with Qwen2.5-72B-Instruct

Overall F1 scores comparing Memory in the LLM Era and strong baselines on LOCOMO.

KEY INSIGHT

The Counterintuitive Finding

Memory in the LLM Era shows that coarser-grained extraction, like MemoryOS segment summaries, can reduce token costs without hurting F1 on LONGMEMEVAL.

This is surprising because many assume finer-grained turn-level memories are always better, but the results show that carefully chosen granularity plus strong LLM reasoning can be more efficient.

WHY IT MATTERS

What this unlocks for the field

Memory in the LLM Era enables practitioners to reason about agent memory as interchangeable modules, not monolithic designs tied to a single system.

Builders can now systematically combine extraction, management, storage, and retrieval choices—like tree indices with three-tier storage—to design memory-augmented agents tuned for accuracy, cost, and robustness.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Agent Memory

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Xiaohui Zhang, Zequn Sun et al.

· 2026

ActMem transforms dialogue history into atomic facts via Memory Fact Extraction, groups them with Fact Clustering, links them through a Memory KG Construction module, and uses Counterfactual-based Retrieval and Reasoning for action-aware answers. On ActMemEval, ActMem reaches 76.52% QA accuracy with DeepSeek-V3, beating LightMem’s 63.97% by 12.55 points and NaiveRAG’s 61.54%.

Questions about this paper?

Paper: Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework

Answers use this explainer on Memory Papers.

Checking…