Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

AuthorsPrateek Chhikara, Dev Khant, Saket Aryan et al.

arXiv 20252025

TL;DR

Mem0 uses an extraction and update pipeline with tool calls over a vector database to reach 67.13 J on LOCOMO single-hop questions, +3.34 over OpenAI.

THE PROBLEM

Long conversations exceed context windows and break coherence with 17.117s full context latency

Mem0 targets LLMs that lose persistent memory once conversations exceed fixed context windows, forcing full-context runs with 17.117 seconds p95 latency.

When LOCOMO conversations reach around 26000 tokens, full-context processing becomes too slow and expensive, causing forgetful agents and degraded multi-session coherence.

HOW IT WORKS

Mem0 — extraction and update for scalable long term memory

Mem0 centers on an extraction phase, update phase, asynchronous summary generation module, tool call mechanism, and a vector database to manage conversational memories.

You can think of Mem0 like a human with a notepad and filing cabinet: recent dialogue goes to scratchpad, then distilled facts are filed into organized, searchable memory.

This architecture lets Mem0 selectively store, merge, and delete facts over time, enabling consistent reasoning across sessions that a plain context window cannot maintain.

DIAGRAM

Mem0g graph memory extraction and query interaction

This diagram shows how Mem0g converts dialogue into entities and relationship triplets, updates the knowledge graph, and serves queries via dual retrieval.

DIAGRAM

LOCOMO evaluation pipeline for Mem0 and baselines

This diagram shows how Mem0 is evaluated on LOCOMO conversations, question categories, and metrics against multiple baseline systems.

PROCESS

How Mem0 Handles a Conversation Session

  1. 01

    Extraction phase

    Mem0 takes the current message pair with the conversation summary and recent messages to run the extraction phase and produce salient memory candidates.

  2. 02

    Asynchronous summary generation module

    In parallel, Mem0 periodically refreshes the conversation summary using the asynchronous summary generation module so extraction always sees up to date global context.

  3. 03

    Update phase

    Mem0 feeds each candidate fact and top s similar memories from the vector database into the update phase to decide how to modify stored memories.

  4. 04

    Tool call mechanism

    Within the update phase, Mem0 uses the tool call mechanism to let the LLM choose ADD, UPDATE, DELETE, or NOOP and then applies changes in the vector database.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Mem0 memory architecture

    Mem0 introduces an extraction phase and update phase with a tool call mechanism over a vector database, achieving J 67.13 on single-hop LOCOMO questions.

  • 02

    Mem0g graph memory architecture

    Mem0g extends Mem0 with entity extractor and relationship generator modules to build a Neo4j knowledge graph that improves temporal J to 58.13.

  • 03

    Comprehensive LOCOMO evaluation

    Mem0 is compared against LoCoMo, ReadAgent, MemoryBank, MemGPT, A-Mem, LangMem, Zep, OpenAI, RAG, and full-context baselines across four LOCOMO question types and deployment metrics.

RESULTS

By the Numbers

Single Hop J

67.13

+3.34 over OpenAI

Temporal J

55.51

+6.20 over Zep

Overall J

66.88

+13.98 over A-Mem

Total p95 latency

1.440 seconds

-15.677 seconds vs Full-context

On the LOCOMO long term conversational memory benchmark, Mem0 is evaluated across single-hop, multi-hop, temporal, and open-domain questions. These results show that Mem0 raises factual quality while dramatically reducing latency compared to full-context and A-Mem baselines.

BENCHMARK

By the Numbers

On the LOCOMO long term conversational memory benchmark, Mem0 is evaluated across single-hop, multi-hop, temporal, and open-domain questions. These results show that Mem0 raises factual quality while dramatically reducing latency compared to full-context and A-Mem baselines.

BENCHMARK

Performance comparison of memory enabled systems across LOCOMO single hop J

LLM-as-a-Judge score J on LOCOMO single-hop questions.

KEY INSIGHT

The Counterintuitive Finding

Mem0g achieves an overall J of 68.44 while using only about 14k memory tokens per conversation, compared to Zep’s more than 600k tokens.

This is surprising because many expect richer graph memories to require more storage, yet Mem0g’s compact graph plus text memory beats Zep by 2.45 J with far fewer tokens.

WHY IT MATTERS

What this unlocks for the field

Mem0 enables production agents to maintain coherent long term memory with 91 percent lower p95 latency than full-context processing on LOCOMO conversations.

Builders can now deploy agents that remember user preferences across sessions without paying the 26031 token context and 17.117 second latency cost of full-context baselines.

~14 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yi Yu, Liuyi Yao et al.

arXiv 2026 · 2026

Agentic Memory (AgeMem) exposes **memory management tools**, a **three-stage progressive RL strategy**, and **step-wise GRPO** directly inside the agent policy to jointly control long-term and short-term memory. On Qwen3-4B-Instruct, AgeMem attains 54.31% average performance across ALFWorld, SciWorld, PDDL, BabyAI, and HotpotQA, exceeding the best baseline A-Mem at 45.74%.

SurveyAgent Memory

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

Dongming Jiang, Yi Li et al.

arXiv 2026 · 2026

Anatomy of Agentic Memory organizes Memory-Augmented Generation into four structures and empirically compares systems like LOCOMO, AMem, MemoryOS, Nemori, MAGMA, and SimpleMem under benchmark saturation, metric validity, backbone sensitivity, and system cost. On the LoCoMo benchmark, Anatomy of Agentic Memory shows Nemori reaches 0.502 F1 while AMem drops to 0.116, and MAGMA achieves the top semantic judge score of 0.670 under the MAGMA rubric.

BenchmarkAgent Memory

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, Julian McAuley

ICLR 2026 · 2025

MemoryAgentBench standardizes multi-turn datasets into chunked conversations with memorization prompts, then evaluates long-context agents, RAG agents, and agentic memory agents across Accurate Retrieval, Test-Time Learning, Long-Range Understanding, and Selective Forgetting. On the overall score in Table 3, the GPT-4.1-mini long-context agent reaches 71.8 on Accurate Retrieval tasks compared to 49.2 for the GPT-4o-mini long-context baseline.