Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

AuthorsMohammad Tavakoli, Alireza Salemi, Carrie Ye et al.

arXiv 20252025

TL;DR

LIGHT combines episodic retrieval, a working memory buffer, and a long-term scratchpad to boost BEAM memory scores by up to +155.7% over long-context baselines.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLMs With 1M Context Still Fail on 10M Token Dialogues

LIGHT is motivated by BEAM results where even 1M-token context LLMs see average scores collapse to 0.133–0.199 at 10M tokens.

These failures hit long-term conversational memory, causing broken instruction following and missed multi-hop reasoning over sprawling user histories.

HOW IT WORKS

LIGHT — Episodic, Working, and Scratchpad Memory for Long Dialogues

LIGHT’s core mechanism combines Retrieval from the Conversation, Scratchpad Formation and Utilization, Working Memory, and Filtering Scratchpad to structure long-term context.

You can think of LIGHT like a brain: episodic memory is the hippocampus, working memory is RAM, and the scratchpad is an external notebook.

This design lets LIGHT answer BEAM’s probing questions using targeted recall and distilled notes, instead of relying on a single overloaded context window.

DIAGRAM

LIGHT Inference Flow Over a BEAM Conversation

This diagram shows how LIGHT routes a BEAM probing question through episodic retrieval, working memory, and scratchpad filtering before answering.

DIAGRAM

BEAM Data Generation and Probing Pipeline

This diagram shows how BEAM constructs long conversations, generates probing questions, and validates nuggets for evaluating LIGHT.

PROCESS

How LIGHT Handles a BEAM Probing Question

01
Retrieval from the Conversation
LIGHT uses Retrieval from the Conversation to embed the question and fetch k relevant segments from the episodic index built over BEAM turns.
02
Working Memory
LIGHT forms Working Memory by selecting the most recent z user assistant pairs, giving priority to short-term context needed for local reasoning.
03
Scratchpad Formation and Utilization
LIGHT relies on Scratchpad Formation and Utilization to aggregate salient facts into a compressed long-term note that is later filtered per question.
04
Filtering Scratchpad and Answer Generation
LIGHT applies Filtering Scratchpad to keep only relevant chunks, then conditions the backbone LLM on episodic retrieval, working memory, and filtered scratchpad.

KEY CONTRIBUTIONS

Key Contributions

01
BEAM long term memory benchmark
LIGHT is evaluated on BEAM, which includes 100 conversations ranging from 100K to 10M tokens and 2,000 validated probing questions across ten memory abilities.
02
LIGHT cognitive memory framework
LIGHT introduces Retrieval from the Conversation, Working Memory, and Scratchpad Formation and Utilization to emulate episodic, working, and semantic memory in LLMs.
03
Comprehensive BEAM evaluation
LIGHT achieves average gains of 3.5%–12.69% over the strongest baselines and up to +155.7% relative improvement at 10M tokens on BEAM.

RESULTS

By the Numbers

Average score 10M

0.226 score

+0.117 over GPT-4.1-nano vanilla

Average score 1M

0.336 score

+0.145 over GPT-4.1-nano vanilla

Summarization 500K

0.373 score

+0.107 over Llama Maverick vanilla

Multi Hop 500K

0.350 score

+0.131 over Llama Maverick vanilla

These metrics come from Table 1 on the BEAM benchmark, which tests ten memory abilities over 100K–10M token conversations. The MAIN_RESULT shows that LIGHT keeps GPT-4.1-nano’s memory performance usable even when conversations reach 10M tokens, where vanilla long-context usage collapses.

BENCHMARK

By the Numbers

BENCHMARK

Benchmark: Comparison of different LLMs and methods across conversation lengths and memory abilities

Average score on BEAM at 10M-token conversations.

KEY INSIGHT

The Counterintuitive Finding

LIGHT’s ablation shows that removing Working Memory sometimes slightly improves performance at 100K and 500K tokens, despite being a core component.

This is surprising because many assume more recent-turn context always helps, but BEAM reveals that extra local history can introduce harmful noise for long-range questions.

WHY IT MATTERS

What this unlocks for the field

LIGHT shows that structured episodic indexes and scratchpads can keep memory performance stable even when conversations scale to 10M tokens.

Builders can now design assistants that track evolving preferences, instructions, and updates over multi-session histories without requiring an impractically large monolithic context window.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…