MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

AuthorsShu Wang, Edwin Yu, Oscar Love et al.

2026

TL;DR

MemMachine uses ground-truth episodic storage plus contextualized retrieval to reach 93.0% on LongMemEvalS and 0.9169 on LoCoMo with gpt-4.1-mini.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-term agents break under multi-session memory with brittle RAG workflows (LoCoMo 0.5290 for OpenAI baseline)

Standard RAG and context windows struggle with multi-session interactions, leading to brittle personalization and factual drift over long horizons.

On LoCoMo, the OpenAI baseline only reaches an overall score of 0.5290, limiting reliable personalization, temporal reasoning, and multi-hop conversational recall.

HOW IT WORKS

MemMachine — Ground-truth episodic memory with contextualized retrieval

MemMachine centers on Short-term memory, Long-term memory, Profile memory, and a Retrieval Agent to store raw episodes and retrieve sentence-level evidence.

Think of MemMachine like RAM plus disk plus a card catalog: short-term memory is RAM, long-term episodic storage is disk, and contextualized retrieval is the catalog that pulls neighboring pages.

This design lets MemMachine preserve exact conversational ground truth while selectively surfacing relevant episode clusters, something a plain context window or naive RAG cannot achieve.

DIAGRAM

MemMachine Memory Recall Pipeline

This diagram shows how MemMachine processes a query through short-term search, long-term vector search, contextualization, and reranking before returning episodes.

DIAGRAM

LongMemEvalS Ablation Design in MemMachine

This diagram shows how MemMachine varies ingestion and retrieval settings across LongMemEvalS configurations to measure accuracy gains.

PROCESS

How MemMachine Handles a Multi-session Conversational Query

  1. 01

    Data Ingestion

    MemMachine converts each message into an Episode with producer, timestamp, session id, and custom metadata, then dispatches it to Short-term memory and Long-term memory.

  2. 02

    Sentence Extraction

    MemMachine segments each Episode into sentences using NLTK Punkt, links them back to episodes, and embeds them into the vector-backed Long-term memory.

  3. 03

    Contextualized Retrieval

    MemMachine runs vector search to find nucleus episodes, expands them with neighboring episodes into clusters, and reranks clusters before assembling STM and LTM context.

  4. 04

    Retrieval Agent

    MemMachine optionally routes the query through the Retrieval Agent, choosing direct search, SplitQuery, or ChainOfQuery, then passes retrieved context to the answer LLM.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Ground-truth-preserving architecture

    MemMachine stores raw conversational episodes and indexes at sentence level, avoiding per-message extraction and enabling about 80% input token reduction versus Mem0 on LoCoMo.

  • 02

    Contextualized retrieval

    MemMachine introduces contextualized retrieval that expands nucleus matches with neighboring episode context, then reranks clusters, improving multi-hop and temporal reasoning accuracy.

  • 03

    Retrieval Agent for multi-hop reasoning

    MemMachine adds the Retrieval Agent with ToolSelectAgent, SplitQuery, and ChainOfQuery, reaching 93.2% on HotpotQA hard and 92.6% on WikiMultiHop under randomized noise.

RESULTS

By the Numbers

LoCoMo overall score (gpt-4.1-mini)

91.69%

+13.79 over OpenAI baseline

LongMemEvalS ablation best

93.0%

+7.0 over MemMachine baseline C5

HotpotQA hard Retrieval Agent accuracy

93.2%

+2.0 over MemMachine declarative search

Mem0 comparison input tokens

~80% less

MemMachine vs Mem0 on LoCoMo memory mode

On LoCoMo, which tests very long-term conversational memory, MemMachine reaches 91.69% with gpt-4.1-mini while using about 80% fewer input tokens than Mem0. On LongMemEvalS, which probes extraction, temporal reasoning, updates, and multi-session reasoning, MemMachine’s best configuration reaches 93.0%, showing that ground-truth episodic storage plus tuned retrieval yields strong long-horizon recall.

BENCHMARK

By the Numbers

On LoCoMo, which tests very long-term conversational memory, MemMachine reaches 91.69% with gpt-4.1-mini while using about 80% fewer input tokens than Mem0. On LongMemEvalS, which probes extraction, temporal reasoning, updates, and multi-session reasoning, MemMachine’s best configuration reaches 93.0%, showing that ground-truth episodic storage plus tuned retrieval yields strong long-horizon recall.

BENCHMARK

LoCoMo benchmark comparison across AI agent memory systems

LLM Judge Score on LoCoMo (overall).

BENCHMARK

LongMemEvalS configuration sweep in MemMachine

Overall LLM score on LongMemEvalS across key MemMachine configurations.

KEY INSIGHT

The Counterintuitive Finding

MemMachine finds that GPT-5-mini beats GPT-5 by 2.6 percentage points on LongMemEvalS when paired with the Edwin3 prompt and tuned retrieval.

This is counterintuitive because larger models are usually assumed strictly better, but MemMachine shows prompt–model co-optimization can make a smaller, cheaper model superior.

WHY IT MATTERS

What this unlocks for the field

MemMachine unlocks cost-efficient, ground-truth-preserving long-term memory with contextualized retrieval and multi-hop-aware routing that scale across sessions and benchmarks.

Builders can now deploy personalized agents that remember exact past interactions, handle complex multi-hop questions, and stay within tight token budgets using smaller LLMs like GPT-5-mini.

~14 min read← Back to papers

Related papers

RAG

A Dynamic Retrieval-Augmented Generation System with Selective Memory and Remembrance

Okan Bursa

· 2026

Adaptive RAG Memory (ARM) augments a standard retriever–generator stack with a Dynamic Embedding Layer and Remembrance Engine that track usage statistics and apply selective remembrance and decay to embeddings. On a lightweight retrieval benchmark, ARM achieves NDCG@5 ≈ 0.9401 and Recall@5 = 1.000 with 22M parameters, matching larger baselines like gte-small while providing the best efficiency among ultra-efficient models.

RAGLong-Term Memory

HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong, Yunfan Gao, Haofen Wang

· 2026

HingeMem combines Boundary Guided Long-Term Memory, Dialogue Boundary Extraction, Memory Construction, Query Adaptive Retrieval, Hyperedge Rerank, and Adaptive Stop to segment dialogues into element-indexed hyperedges and plan query-specific retrieval. On LOCOMO, HingeMem achieves 63.9 overall F1 and 75.1 LLM-as-a-Judge score, surpassing the best baseline Zep (56.9 F1) by 7.0 F1 without using category-specific QA formats.

Questions about this paper?

Paper: MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

Answers use this explainer on Memory Papers.

Checking…