APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

AuthorsPratyay Banerjee, Masud Moshtaghi, Ankit Chadha

2026

TL;DR

APEX-EM uses structured Procedural Knowledge Graph experience replay to push KGQAGen-10k accuracy to 89.6% (95.3% CSR), +48.3pp over a 41.3% no-memory baseline.

SharePost on XLinkedIn

Read our summary here, or open the publisher PDF on the next tab.

THE PROBLEM

LLM agents stay stateless despite repeated, similar tasks (41.3% vs 89.6% on KGQAGen-10k)

APEX-EM highlights that a Claude Sonnet 4.5 agent without memory reaches only 41.3% accuracy on KGQAGen-10k, despite repeated exposure to structurally similar questions.

This means LLM-based autonomous agents repeatedly re-derive solutions from scratch, wasting prior work and leaving large performance gains on the table for code, queries, and reasoning.

HOW IT WORKS

APEX-EM — Procedural Knowledge Graphs plus PRGII experience replay

APEX-EM centers on a Procedural Knowledge Graph, Experience Memory store, PRGII workflow, Task Verifiers, and StructuralSignatureExtractor that encode full procedural-episodic traces as reusable experiences.

You can think of APEX-EM like a hybrid between an RL experience replay buffer and a card catalog, where each card is a structured plan with rich metadata.

This design lets APEX-EM replay and adapt entire procedures using structural signatures and dual-outcome indexing, instead of relying on a flat context window of unstructured reflections.

DIAGRAM

PRGII workflow: Plan–Retrieve–Generate–Iterate–Ingest loop

This diagram shows how APEX-EM runs the PRGII workflow to solve a task and commit a structured experience back into memory.

DIAGRAM

Evaluation setup across BigCodeBench, KGQAGen-10k, and HLE

This diagram shows how APEX-EM is evaluated on three benchmarks with frozen backbones and shared baselines like MemRL and RAG.

PROCESS

How APEX-EM Handles a Task via the PRGII Workflow

  1. 01

    Plan Phase

    APEX-EM parses the task into Task Understanding, runs Entity Discovery and Schema Discovery, and uses the StructuralSignatureExtractor to hypothesize an abstract operation sequence.

  2. 02

    Retrieve Phase

    APEX-EM queries the Experience Memory store using semantic search, structural signature matching, and PKG traversal to collect successful and failed experiences.

  3. 03

    Generate Phase

    Conditioned on Goal Reflections, Procedure Reflections, and negative examples from the Error Registry, APEX-EM generates an executable artifact such as code or a SPARQL query.

  4. 04

    Iterate and Ingest Phases

    Task Verifiers validate each artifact, drive refinement across iterations, then the Teacher and quality gate decide whether to store the run as a successful or failed Experience in the Procedural Knowledge Graph.

KEY CONTRIBUTIONS

Key Contributions

  • 01

    Procedural Knowledge Graph

    APEX-EM introduces a Procedural Knowledge Graph that stores Experiences, Entities, Sub-Tasks, Operations, and TaskTopic nodes with structural signatures like [entity_resolution → temporal_filter → aggregation].

  • 02

    PRGII workflow with Task Verifiers

    APEX-EM defines the Plan-Retrieve-Generate-Iterate-Ingest workflow where Task Verifiers and a Teacher provide multi-dimensional scores c, η, κ and an overall quality q with threshold θ.

  • 03

    Dual-outcome Experience Memory

    APEX-EM builds a dual-outcome Experience Memory that treats successful experiences as positive in-context examples and failed ones as negative examples with structured Error Registry and Patch Reflections.

RESULTS

By the Numbers

KGQAGen-10k LASM Accuracy

89.6%

+48.3pp over No Memory (41.3%)

KGQAGen-10k CSR

95.3%

vs No Memory baseline with no CSR reported

BigCodeBench Last SR

83.3%

+29.4pp over No Memory Sonnet 4.5 (53.9%)

HLE Last SR (500q)

48.0%

+22.8pp over No Memory Opus 4.5 (25.2%)

On KGQAGen-10k, a structured query benchmark, APEX-EM reaches 89.6% accuracy and 95.3% CSR, beating the 41.3% no-memory baseline and the GPT-4o w/ SP oracle at 84.9%. On BigCodeBench and HLE, APEX-EM delivers +29.4pp and +22.8pp gains in success rate, showing that structured procedural replay scales across code and multi-domain reasoning.

BENCHMARK

By the Numbers

On KGQAGen-10k, a structured query benchmark, APEX-EM reaches 89.6% accuracy and 95.3% CSR, beating the 41.3% no-memory baseline and the GPT-4o w/ SP oracle at 84.9%. On BigCodeBench and HLE, APEX-EM delivers +29.4pp and +22.8pp gains in success rate, showing that structured procedural replay scales across code and multi-domain reasoning.

BENCHMARK

KGQAGen-10k: APEX-EM vs LLM and KG-RAG baselines

LASM Accuracy on KGQAGen-10k test split and training sample for APEX-EM and key baselines.

BENCHMARK

BigCodeBench: APEX-EM vs MemRL and no-memory baselines

Last Epoch Success Rate on BigCodeBench train split for APEX-EM and MemRL baselines.

KEY INSIGHT

The Counterintuitive Finding

On BigCodeBench, APEX-EM’s rich judge feedback barely helps over binary success signals, with A1≈A2 despite adding detailed qualitative evaluations.

Yet on KGQAGen-10k, the same rich feedback boosts accuracy by +10.3pp over binary-only memory, contradicting the intuition that more detailed supervision always helps code more than symbolic queries.

WHY IT MATTERS

What this unlocks for the field

APEX-EM shows that non-parametric online learning with structured procedural-episodic memory can rival or beat oracle retrieval systems using only past executions.

Builders can now deploy frozen LLM backbones that still learn new procedures over time, transferring skills across domains with zero lexical overlap via structural signatures like entity_resolution → temporal_filter → aggregation.

~14 min read← Back to papers

Related papers

BenchmarkAgent Memory

Active Context Compression: Autonomous Memory Management in LLM Agents

Nikhil Verma

· 2026

Focus Agent adds start_focus, complete_focus, a persistent Knowledge block, and an optimized Persistent Bash plus String-Replace Editor scaffold to actively compress context during long software-engineering tasks. On five hard SWE-bench Lite instances against a Baseline ReAct agent, Focus Agent achieves 22.7% token reduction (14.9M → 11.5M) while matching 3/5 = 60% task success.

Questions about this paper?

Paper: APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

Answers use this explainer on Memory Papers.

Checking…