In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents

AuthorsZhen Tan, Jun Yan, I-Hung Hsu et al.

ACL 20252025

TL;DR

Reflective Memory Management (RMM) combines Prospective and Retrospective Reflection to reorganize and adapt memory, reaching 70.4% accuracy on LongMemEval with GTE vs 63.6% for RAG (+6.8 points).

THE PROBLEM

Long-term agents fail without adaptive memory and drop to 0.0% accuracy on LongMemEval

Without any history, Reflective Memory Management (RMM) shows that Gemini-1.5-Flash achieves only 5.2% METEOR on MSC and 0.0% accuracy on LongMemEval.

In long-term personalized dialogue, this means healthcare or assistant agents cannot recall allergies, preferences, or prior symptoms, leading to incoherent and potentially unsafe responses.

HOW IT WORKS

Reflective Memory Management — Prospective and Retrospective Reflection

Reflective Memory Management (RMM) centers on a memory bank, retriever, reranker, and LLM connected by Prospective Reflection and Retrospective Reflection to manage long-term dialogue memory.

You can think of RMM as a librarian: Prospective Reflection reorganizes books by topic, while Retrospective Reflection learns which shelves people actually use and reorders them.

This reflective design lets RMM maintain topic-level memories and adapt retrieval via LLM attribution and RL, something a plain context window or static RAG stack cannot achieve.

DIAGRAM

Dialogue time interaction with Reflective Memory Management

This diagram shows how Reflective Memory Management (RMM) handles a user turn using retrieval, reranking, LLM generation, and citation based rewards.

DIAGRAM

Evaluation pipeline and ablation design for RMM

This diagram shows how Reflective Memory Management (RMM) is evaluated on MSC and LongMemEval, including baselines and ablation variants.

PROCESS

How Reflective Memory Management Handles a Dialogue Session

  1. 01

    Retrieve

    Reflective Memory Management (RMM) uses the retriever f_theta to fetch Top K candidates from the memory bank B given query q and session S.

  2. 02

    Rerank

    The reranker g_phi adapts query and memory embeddings, applies the Gumbel Trick, and selects Top M memories most relevant to the current query.

  3. 03

    Generate

    The LLM consumes q, current session S, and Top M memories to generate the response a and per memory citation scores as attribution.

  4. 04

    Prospective Reflection

    When the session ends, Reflective Memory Management (RMM) calls ExtractMemory and UpdateMemory to decompose S into topic summaries and merge or add them into the memory bank.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Reflective Memory Management framework

    Reflective Memory Management (RMM) introduces a unified framework with a memory bank, retriever, reranker, and LLM that jointly support Prospective and Retrospective Reflection for long-term personalized dialogue.

  • 02

    Topic based Prospective Reflection

    Reflective Memory Management (RMM) uses Prospective Reflection with ExtractMemory and UpdateMemory to decompose sessions into topic summaries and merge related entries, improving MSC METEOR from 24.8% to 28.6%.

  • 03

    LLM attribution driven Retrospective Reflection

    Reflective Memory Management (RMM) leverages LLM citation scores as binary rewards to update reranker g_phi via REINFORCE, boosting LongMemEval Recall@5 from 54.3% (RAG) to 60.4% with Contriever.

RESULTS

By the Numbers

METEOR (%)

33.4%

+5.9 over RAG GTE

BERT (%)

57.1%

vs 52.1% for RAG GTE

Recall@5 (%)

69.8%

+7.4 over RAG GTE on LongMemEval

Acc. (%)

70.4%

+6.8 over RAG GTE on LongMemEval

On MSC and LongMemEval, which test long-term personalized dialogue and historical recall, Reflective Memory Management (RMM) consistently improves retrieval and answer quality over RAG and agent baselines. The 70.4% accuracy on LongMemEval with GTE shows that RMM converts better memory organization and RL reranking into concrete QA gains.

BENCHMARK

By the Numbers

On MSC and LongMemEval, which test long-term personalized dialogue and historical recall, Reflective Memory Management (RMM) consistently improves retrieval and answer quality over RAG and agent baselines. The 70.4% accuracy on LongMemEval with GTE shows that RMM converts better memory organization and RL reranking into concrete QA gains.

BENCHMARK

Performance comparison of RMM with baseline methods on LongMemEval (Accuracy)

Accuracy (%) on LongMemEval personal knowledge QA benchmark.

BENCHMARK

Performance comparison of RMM with baseline methods on MSC (METEOR)

METEOR (%) on MSC multi session conversational benchmark.

KEY INSIGHT

The Counterintuitive Finding

Reflective Memory Management (RMM) with Gemini-1.5-Flash reaches 30.8% METEOR on MSC, while the stronger Gemini-1.5-Pro only achieves 24.6% with RMM.

This is surprising because we expect larger, better aligned LLMs to always help, but stronger alignment makes Gemini-1.5-Pro more likely to abstain on personal questions, hurting personalization metrics.

WHY IT MATTERS

What this unlocks for the field

Reflective Memory Management (RMM) unlocks adaptive, topic aware long term memory that reorganizes itself and learns which memories actually help responses.

Builders can now attach RMM to off the shelf retrievers and API LLMs to get multi session personalization without labeled retrieval data or white box model access.

~15 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes **memory management tools**, a **three-stage progressive RL strategy**, and **step-wise GRPO** directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

BenchmarkLong-Term Memory

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Mohammad Tavakoli, Alireza Salemi et al.

arXiv 2025 · 2025

LIGHT augments LLMs with **Retrieval from the Conversation**, **Scratchpad Formation and Utilization**, and a **Working Memory** buffer plus noise filtering to answer BEAM’s long-context probing questions. On the BEAM benchmark, LIGHT raises GPT-4.1-nano’s average score at 10M-token conversations from 0.109 to 0.226, a +107.3% gain over the vanilla long-context baseline.

RAGMemory ArchitectureLong-Term Memory

From RAG to Memory: Non-Parametric Continual Learning for Large Language Models

Bernal Jiménez Gutiérrez, Yiheng Shu et al.

ICML 2025 · 2025

HippoRAG 2 combines **Offline Indexing**, a schema-less **Knowledge Graph**, **Dense-Sparse Integration**, **Deeper Contextualization**, and **Recognition Memory** into a neuro-inspired non-parametric memory system for LLMs. On the joint RAG benchmark suite, HippoRAG 2 achieves 59.8 average F1 versus 57.0 for NV-Embed-v2, including 71.0 F1 on 2Wiki compared to 61.5 for NV-Embed-v2.