Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

AuthorsMohammad Tavakoli, Alireza Salemi, Carrie Ye et al.

arXiv 20252025

TL;DR

LIGHT combines episodic retrieval, a working memory buffer, and a long-term scratchpad to boost BEAM memory scores by up to +155.7% over long-context baselines.

THE PROBLEM

LLMs With 1M Context Still Fail on 10M Token Dialogues

LIGHT is motivated by BEAM results where even 1M-token context LLMs see average scores collapse to 0.133–0.199 at 10M tokens.

These failures hit long-term conversational memory, causing broken instruction following and missed multi-hop reasoning over sprawling user histories.

HOW IT WORKS

LIGHT — Episodic, Working, and Scratchpad Memory for Long Dialogues

LIGHT’s core mechanism combines Retrieval from the Conversation, Scratchpad Formation and Utilization, Working Memory, and Filtering Scratchpad to structure long-term context.

You can think of LIGHT like a brain: episodic memory is the hippocampus, working memory is RAM, and the scratchpad is an external notebook.

This design lets LIGHT answer BEAM’s probing questions using targeted recall and distilled notes, instead of relying on a single overloaded context window.

DIAGRAM

LIGHT Inference Flow Over a BEAM Conversation

This diagram shows how LIGHT routes a BEAM probing question through episodic retrieval, working memory, and scratchpad filtering before answering.

DIAGRAM

BEAM Data Generation and Probing Pipeline

This diagram shows how BEAM constructs long conversations, generates probing questions, and validates nuggets for evaluating LIGHT.

PROCESS

How LIGHT Handles a BEAM Probing Question

  1. 01

    Retrieval from the Conversation

    LIGHT uses Retrieval from the Conversation to embed the question and fetch k relevant segments from the episodic index built over BEAM turns.

  2. 02

    Working Memory

    LIGHT forms Working Memory by selecting the most recent z user assistant pairs, giving priority to short-term context needed for local reasoning.

  3. 03

    Scratchpad Formation and Utilization

    LIGHT relies on Scratchpad Formation and Utilization to aggregate salient facts into a compressed long-term note that is later filtered per question.

  4. 04

    Filtering Scratchpad and Answer Generation

    LIGHT applies Filtering Scratchpad to keep only relevant chunks, then conditions the backbone LLM on episodic retrieval, working memory, and filtered scratchpad.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    BEAM long term memory benchmark

    LIGHT is evaluated on BEAM, which includes 100 conversations ranging from 100K to 10M tokens and 2,000 validated probing questions across ten memory abilities.

  • 02

    LIGHT cognitive memory framework

    LIGHT introduces Retrieval from the Conversation, Working Memory, and Scratchpad Formation and Utilization to emulate episodic, working, and semantic memory in LLMs.

  • 03

    Comprehensive BEAM evaluation

    LIGHT achieves average gains of 3.5%–12.69% over the strongest baselines and up to +155.7% relative improvement at 10M tokens on BEAM.

RESULTS

By the Numbers

Average score 10M

0.226 score

+0.117 over GPT-4.1-nano vanilla

Average score 1M

0.336 score

+0.145 over GPT-4.1-nano vanilla

Summarization 500K

0.373 score

+0.107 over Llama Maverick vanilla

Multi Hop 500K

0.350 score

+0.131 over Llama Maverick vanilla

These metrics come from Table 1 on the BEAM benchmark, which tests ten memory abilities over 100K–10M token conversations. The MAIN_RESULT shows that LIGHT keeps GPT-4.1-nano’s memory performance usable even when conversations reach 10M tokens, where vanilla long-context usage collapses.

BENCHMARK

By the Numbers

These metrics come from Table 1 on the BEAM benchmark, which tests ten memory abilities over 100K–10M token conversations. The MAIN_RESULT shows that LIGHT keeps GPT-4.1-nano’s memory performance usable even when conversations reach 10M tokens, where vanilla long-context usage collapses.

BENCHMARK

Benchmark: Comparison of different LLMs and methods across conversation lengths and memory abilities

Average score on BEAM at 10M-token conversations.

KEY INSIGHT

The Counterintuitive Finding

LIGHT’s ablation shows that removing Working Memory sometimes slightly improves performance at 100K and 500K tokens, despite being a core component.

This is surprising because many assume more recent-turn context always helps, but BEAM reveals that extra local history can introduce harmful noise for long-range questions.

WHY IT MATTERS

What this unlocks for the field

LIGHT shows that structured episodic indexes and scratchpads can keep memory performance stable even when conversations scale to 10M tokens.

Builders can now design assistants that track evolving preferences, instructions, and updates over multi-session histories without requiring an impractically large monolithic context window.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, Julian McAuley

ICLR 2026 · 2025

MemoryAgentBench standardizes multi-turn datasets into chunked conversations with memorization prompts, then evaluates long-context agents, RAG agents, and agentic memory agents across Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Selective Forgetting. On the overall score in Table 3, the GPT-4.1-mini long-context agent reaches 71.8 on Accurate Retrieval tasks compared to 49.2 for the GPT-4o-mini long-context baseline.

BenchmarkAgent Memory

MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

Haoran Tan, Zeyu Zhang et al.

ACL 2025 · 2025

MemBench evaluates LLM-based agents via a **User Relation Graph**, **Memory Dataset Construction**, **Multi-scenario Memory**, and **Multi-level Memory** pipeline, then scores seven memory mechanisms with multi-metric evaluation. On 100k-token factual observation tests, MemBench shows **RetrievalMemory** achieves 0.933 accuracy versus 0.631 for **FullMemory**, quantifying both effectiveness and efficiency tradeoffs.

Benchmark

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang et al.

arXiv 2025 · 2025

MemoryBench orchestrates **Task Provider**, **User Simulator**, and **Performance Monitor** over 11 datasets to evaluate declarative and procedural memory in LLM systems. On MemoryBench, simple RAG baselines like BM25-S and Embed-M often match or beat advanced systems such as MemoryOS and Mem0 despite MemoryOS taking more than 17 seconds of memory time per case.