APEX-MEM: Agentic Semi-Structured Memory with Temporal Reasoning for Long-Term Conversational AI

AuthorsPratyay Banerjee, Masud Moshtaghi, Shivashankar Subramanian et al.

2026

TL;DR

APEX-MEM uses an append-only temporal property graph plus multi-tool Graph QnA agents to reach 88.88% accuracy on LOCOMO, +3.50 points over MIRIX.

SharePost on X LinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-context agents add noise and collapse under extended history (51.6% to 15.7% F1)

LLMs with larger context windows still fail on long conversations: GPT-4-Turbo drops from 51.6% F1 to 15.7% F1 under adversarial noise.

This breakdown hurts long-term conversational memory, causing inconsistent entities, broken temporal coherence, and unreliable answers across multi-session dialogues.

HOW IT WORKS

APEX-MEM — Property graphs, append-only events, and graph agents

APEX-MEM combines Ontology, Entity and Property Resolution, Fact Extraction, and Graph Agents into a temporal property graph that stores evolving conversational facts.

Think of APEX-MEM as a card catalog plus timeline: entities are cards, events are dated entries, and Graph Agents are librarians using specialized tools.

This design lets APEX-MEM resolve conflicts at query time, track temporal validity, and answer complex questions that a plain context window cannot handle.

DIAGRAM

APEX-MEM Graph QnA Agent Tool-use Sequence

This diagram shows how the APEX-MEM Graph QnA agent uses SCHEMAVIEWER, ENTITYLOOKUP, GRAPHSQL, and SEARCH tools to answer a question over the property graph.

DIAGRAM

Evaluation Pipeline across LOCOMO, LongMemEval, and SealQA-Hard

This diagram shows how APEX-MEM is constructed and evaluated on LOCOMO, LongMemEval, and SealQA-Hard with different QnA agents and baselines.

PROCESS

How APEX-MEM Handles a Conversational Question

01
APEX-MEM Graph Construction
APEX-MEM uses Fact Extraction and Entity and Property Resolution to build an append-only temporal property graph from conversational turns.
02
Ontology and Fact Extraction
APEX-MEM applies the Ontology during Fact Extraction to type entities, events, and subject property value assertions with temporal validity intervals.
03
Graph Agents with Tools
APEX-MEM Graph Agents invoke SCHEMAVIEWER, ENTITYLOOKUP, GRAPHSQL, and SEARCH to plan retrieval and reasoning over the property graph.
04
Retrieval Time Temporal Resolution
APEX-MEM resolves conflicting facts at query time using GRAPHSQL over events and facts to compute temporally valid answers for the user.

KEY CONTRIBUTIONS

Key Contributions

01
Hybrid entity event ontology for conversational memory
APEX-MEM introduces an Ontology with 35 entity classes and temporally grounded events, enabling Fact Extraction to attach subject property value assertions with validity intervals.
02
Append only event storage with temporal validity
APEX-MEM stores all facts as append-only events instead of overwriting entities, allowing Graph Agents to perform retrieval time temporal resolution over evolving information.
03
Multi tool Graph QnA agent over property graph
APEX-MEM Graph Agents combine SCHEMAVIEWER, ENTITYLOOKUP, GRAPHSQL, and SEARCH, reaching 88.88% accuracy on LOCOMO and 86.2% on LongMemEval.

RESULTS

By the Numbers

Overall accuracy LOCOMO

88.88%

+3.50 over MIRIX

Temporal accuracy LOCOMO

90.63%

vs MIRIX 65.62% temporal

Overall score LongMemEval

86.2%

+11.6 over Nemori 74.6%

Accuracy SealQA Hard

40.1%

+5.5 over O3 34.6%

On LOCOMO and LongMemEval, which test long term conversational memory and long context reasoning, APEX-MEM’s 88.88% and 86.2% scores show robust temporal and multi hop reasoning over extended histories.

BENCHMARK

By the Numbers

BENCHMARK

LOCOMO Category Type Evaluation Results

Overall accuracy on LOCOMO Question Answering benchmark.

BENCHMARK

APEX-MEM Ablations of different tools

Overall LOCOMO accuracy for APEX-MEM Graph QnA Agent with different tool subsets.

KEY INSIGHT

The Counterintuitive Finding

APEX-MEM with full tools reaches 87.0% on LOCOMO, while GraphSQL only configuration needs 3.3x more tool calls for just 79.45%.

This is surprising because many expect more structured SQL reasoning alone to be enough, but APEX-MEM shows hybrid SEARCH plus GRAPHSQL is both more accurate and more efficient.

WHY IT MATTERS

What this unlocks for the field

APEX-MEM unlocks temporally coherent, entity consistent conversational memory that can resolve conflicting facts at query time instead of overwriting history.

Builders can now create assistants that survive weeks long, noisy interactions while still answering temporal and multi hop questions with over 88% accuracy on challenging benchmarks.

~14 min read← Back to papers

Related papers

Memory Architecture

A Control Architecture for Training-Free Memory Use

Yanzhen Lu, Muchen Jiang et al.

· 2026

TAG routes low-confidence steps to uncertainty-based routing, filters them with guarded acceptance with rollback, chooses between bank selection across rule and exemplar memory, and prunes via evidence-based retirement inside a unified control loop. On SVAMP and ASDiv, TAG reaches 81.0% and 85.2% accuracy, improving over the 74.0% and 77.5% no-memory baselines while a compute-matched Retry baseline stays flat.

arXiv:2604.18206 Read explainer

RAGBenchmarkAgent MemoryMemory Architecture

ADAM: A Systematic Data Extraction Attack on Agent Memory via Adaptive Querying

Xingyu Lyu, Jianfeng He et al.

· 2026

ADAM combines Anchor extraction, Distribution estimation, Anchor selection, and Query generation to adaptively probe agent memory via an auxiliary generator and entropy based selection. On the EHRAgent benchmark with Llama2-7b-chat, ADAM reaches EQ=77 and ASR=1.00, compared to MEXTRA’s EQ=44 and ASR=0.89.

arXiv:2604.09747 Read explainer

BenchmarkMemory Architecture

AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Shannan Yan, Jingchen Ni et al.

· 2026

AdaMem organizes dialogue history into Working Memory, Episodic Memory, Persona Memory, and Graph Memory coordinated by a Memory Agent, Research Agent, and Working Agent. On LoCoMo with GPT-4.1-mini, AdaMem achieves 44.65 F1 overall, beating the best baseline LangMem at 41.76 F1 by +2.89.

arXiv:2603.16496 Read explainer

Questions about this paper?

Answers use this explainer on Memory Papers.

Checking…