In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

AuthorsZhen Tan, Jun Yan, I-Hung Hsu et al.

ACL 20252025

TL;DR

Reflective Memory Management (RMM) combines Prospective and Retrospective Reflection to reorganize and adapt memory, reaching 70.4% accuracy on LongMemEval with GTE vs 63.6% for RAG (+6.8 points).

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-term agents fail without adaptive memory and drop to 0.0% accuracy on LongMemEval

Without any history, Reflective Memory Management (RMM) shows that Gemini-1.5-Flash achieves only 5.2% METEOR on MSC and 0.0% accuracy on LongMemEval.

In long-term personalized dialogue, this means healthcare or assistant agents cannot recall allergies, preferences, or prior symptoms, leading to incoherent and potentially unsafe responses.

HOW IT WORKS

Reflective Memory Management — Prospective and Retrospective Reflection

Reflective Memory Management (RMM) centers on a memory bank, retriever, reranker, and LLM connected by Prospective Reflection and Retrospective Reflection to manage long-term dialogue memory.

You can think of RMM as a librarian: Prospective Reflection reorganizes books by topic, while Retrospective Reflection learns which shelves people actually use and reorders them.

This reflective design lets RMM maintain topic-level memories and adapt retrieval via LLM attribution and RL, something a plain context window or static RAG stack cannot achieve.

DIAGRAM

Dialogue time interaction with Reflective Memory Management

This diagram shows how Reflective Memory Management (RMM) handles a user turn using retrieval, reranking, LLM generation, and citation based rewards.

DIAGRAM

Evaluation pipeline and ablation design for RMM

This diagram shows how Reflective Memory Management (RMM) is evaluated on MSC and LongMemEval, including baselines and ablation variants.

PROCESS

How Reflective Memory Management Handles a Dialogue Session

01
Retrieve
Reflective Memory Management (RMM) uses the retriever f_theta to fetch Top K candidates from the memory bank B given query q and session S.
02
Rerank
The reranker g_phi adapts query and memory embeddings, applies the Gumbel Trick, and selects Top M memories most relevant to the current query.
03
Generate
The LLM consumes q, current session S, and Top M memories to generate the response a and per memory citation scores as attribution.
04
Prospective Reflection
When the session ends, Reflective Memory Management (RMM) calls ExtractMemory and UpdateMemory to decompose S into topic summaries and merge or add them into the memory bank.

KEY CONTRIBUTIONS

Key Contributions

01
Reflective Memory Management framework
Reflective Memory Management (RMM) introduces a unified framework with a memory bank, retriever, reranker, and LLM that jointly support Prospective and Retrospective Reflection for long-term personalized dialogue.
02
Topic based Prospective Reflection
Reflective Memory Management (RMM) uses Prospective Reflection with ExtractMemory and UpdateMemory to decompose sessions into topic summaries and merge related entries, improving MSC METEOR from 24.8% to 28.6%.
03
LLM attribution driven Retrospective Reflection
Reflective Memory Management (RMM) leverages LLM citation scores as binary rewards to update reranker g_phi via REINFORCE, boosting LongMemEval Recall@5 from 54.3% (RAG) to 60.4% with Contriever.

RESULTS

By the Numbers

METEOR (%)

33.4%

+5.9 over RAG GTE

BERT (%)

57.1%

vs 52.1% for RAG GTE

Recall@5 (%)

69.8%

+7.4 over RAG GTE on LongMemEval

Acc. (%)

70.4%

+6.8 over RAG GTE on LongMemEval

On MSC and LongMemEval, which test long-term personalized dialogue and historical recall, Reflective Memory Management (RMM) consistently improves retrieval and answer quality over RAG and agent baselines. The 70.4% accuracy on LongMemEval with GTE shows that RMM converts better memory organization and RL reranking into concrete QA gains.

BENCHMARK

By the Numbers

BENCHMARK

Performance comparison of RMM with baseline methods on LongMemEval (Accuracy)

Accuracy (%) on LongMemEval personal knowledge QA benchmark.

BENCHMARK

Performance comparison of RMM with baseline methods on MSC (METEOR)

METEOR (%) on MSC multi session conversational benchmark.

KEY INSIGHT

The Counterintuitive Finding

Reflective Memory Management (RMM) with Gemini-1.5-Flash reaches 30.8% METEOR on MSC, while the stronger Gemini-1.5-Pro only achieves 24.6% with RMM.

This is surprising because we expect larger, better aligned LLMs to always help, but stronger alignment makes Gemini-1.5-Pro more likely to abstain on personal questions, hurting personalization metrics.

WHY IT MATTERS

What this unlocks for the field

Reflective Memory Management (RMM) unlocks adaptive, topic aware long term memory that reorganizes itself and learns which memories actually help responses.

Builders can now attach RMM to off the shelf retrievers and API LLMs to get multi session personalization without labeled retrieval data or white box model access.

~15 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

arXiv:2601.07190 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…