Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

AuthorsSahil Sen, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah

2026

TL;DR

Chronos uses dual calendars plus dynamic prompting and structured temporal event retrieval to reach 95.60% accuracy on LongMemEvalS, +7.67% over EmergenceMem Internal.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

Long-term agents lack temporal grounding despite 92.60% vs 86.00% gap

Existing conversational memory systems either overbuild global knowledge graphs or rely on shallow turn retrieval, failing on time-sensitive multi-session queries.

On LongMemEvalS, EmergenceMem Internal only reaches 86.00% accuracy, showing that long-term assistants still mis-handle temporal reasoning and knowledge updates across months of interaction.

HOW IT WORKS

Chronos — Dual Calendars and Dynamic Prompting for Temporal Memory

Chronos centers on Event Extraction, Dynamic Prompting, Initial Retrieval, and the Chronos Agent, backed by a structured event calendar and raw turn calendar.

You can think of Chronos like a calendar plus diary: the event calendar is a timestamped index, while the turn calendar is the full narrative notebook.

By explicitly structuring temporal events while keeping full dialogue, Chronos enables precise time filtering and multi-hop reasoning that a plain context window cannot support.

DIAGRAM

Chronos Query-Time Memory Retrieval Loop

Sequence of interactions between the user, Chronos Agent, and the dual calendars during query-time tool-calling.

DIAGRAM

Chronos Evaluation and Ablation Pipeline

Flow of evaluating Chronos configurations and ablations on the LongMemEvalS benchmark.

PROCESS

How Chronos Handles a Long-Term Memory Query

  1. 01

    Event Extraction

    Chronos runs the Event Extraction pipeline over conversation turns, producing subject verb object tuples with ISO 8601 datetime ranges and lexical aliases into the event calendar.

  2. 02

    Dynamic Prompting

    For each new question, Chronos uses Dynamic Prompting to analyze the query and generate retrieval guidance bullets describing targets and temporal constraints.

  3. 03

    Initial Retrieval

    Chronos performs Initial Retrieval over the turn calendar using dense search, Cohere Rerank v3, and context expansion around the top 15 turns.

  4. 04

    Chronos Agent

    The Chronos Agent runs a ReAct loop, calling vector and grep tools over the event calendar and turn calendar until it can answer with temporally grounded reasoning.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Chronos Architecture

    Chronos introduces dual event calendar and turn calendar stores plus the Chronos Agent, achieving 92.60% and 95.60% accuracy on LongMemEvalS under Low and High configurations.

  • 02

    Dynamic Prompting for Memory

    Chronos extends Dynamic Prompting to long-term memory, generating per-question retrieval guidance instead of static query rewriting across temporal, preference, and aggregation tasks.

  • 03

    Structured Event Retrieval Gains

    Chronos ablations show the event calendar yields a 58.9% gain over the baseline, while other components each add between 15.5% and 22.3% accuracy improvements.

RESULTS

By the Numbers

Overall

92.60%

+7.67 over EmergenceMem Internal

Knowledge Update

96.15%

+12.82 over EmergenceMem Internal

Multi Session

91.73%

+10.53 over EmergenceMem Internal

Temporal Reasoning

90.23%

+4.52 over EmergenceMem Internal

Chronos is evaluated on the LongMemEvalS benchmark with 500 questions across six categories, testing knowledge updates, aggregation, and temporal reasoning. The 92.60% and 95.60% overall scores show Chronos handles long-horizon, time-grounded memory substantially better than EmergenceMem Internal and Mastra.

BENCHMARK

By the Numbers

Chronos is evaluated on the LongMemEvalS benchmark with 500 questions across six categories, testing knowledge updates, aggregation, and temporal reasoning. The 92.60% and 95.60% overall scores show Chronos handles long-horizon, time-grounded memory substantially better than EmergenceMem Internal and Mastra.

BENCHMARK

Comparison of Chronos Low with State-of-the-Art Systems on LongMemEvalS

Overall accuracy (%) on LongMemEvalS across practical conversational memory systems.

BENCHMARK

High-Configuration Accuracy on LongMemEvalS

Overall accuracy (%) for Chronos High and strong baselines under advanced LLM configurations.

KEY INSIGHT

The Counterintuitive Finding

Removing the event calendar almost halves Chronos Low’s accuracy, dropping from 93.1% to 58.6% on the 116-question ablation subset.

This is surprising because many systems assume dense turn retrieval is sufficient, yet Chronos shows structured temporal events dominate gains while using relatively simple SVO tuples.

WHY IT MATTERS

What this unlocks for the field

Chronos unlocks reliable, time-aware conversational memory, letting agents answer questions like “What did I do the week after my vacation?” months later.

Builders can now design assistants that track evolving preferences, knowledge updates, and cross-session event counts without heavyweight knowledge graphs or full-context replay.

~14 min read← Back to papers

Related papers

Agent MemoryLong-Term Memory

Adaptive Memory Admission Control for LLM Agents

Guilin Zhang, Wei Jiang et al.

· 2026

A-MAC scores candidate memories using Utility, Confidence, Novelty, Recency, and Type Prior combined by a learned linear admission policy with Algorithm 1 A-MAC Memory Admission. On the LoCoMo benchmark, A-MAC achieves F1 0.583 and 2644 ms latency, improving F1 by 0.042 and reducing latency by 1187 ms compared to A-mem.

Long-Term Memory

Advancing Open-source World Models

Robbyant Team, Zelin Gao et al.

arXiv 2026 · 2026

LingBot-World combines a Data Engine, Fundamental World Model, Action-Conditioned World Model, and Post-Training causal adaptation to turn a 28B-parameter video generator into a real-time interactive world simulator. On the VBench benchmark, LingBot-World achieves a dynamic degree of 0.8857 versus 0.7612 for Yume-1.5, while also improving imaging quality to 0.6683.

BenchmarkBenchmarkLong-Term Memory

AgenticAI-DialogGen: Topic-Guided Conversation Generation for Fine-Tuning and Evaluating Short- and Long-Term Memories of LLMs

Manoj Madushanka Perera, Adnan Mahmood et al.

· 2026

AgenticAI-DialogGen chains ChatPreprocessor, KnowledgeExtractor, TopicAnalyzer, KnowledgeGraphBuilder, PersonaGenerator, DuelingChat Agent, ConversationValidator, ConversationRefiner, QAGeneration, and PostProcessing to turn raw multi-session chats into topic-guided, persona-grounded conversations with explicit short- and long-term memories. On the TGC / KG memory QA benchmark, Mistral-7B fine-tuned within AgenticAI-DialogGen achieves 87.36 F1, compared to GPT-4’s 83.77 F1 in a zero-shot setting on the same task.

Questions about this paper?

Paper: Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

Answers use this explainer on Memory Papers.

Checking…