Persistent Memory for LLM Agents

View on GitHub → Rebuild in progress

98 experiments, 35 case studies, 5 benchmarks, 6 benchmark experiments, 20 approaches evaluated. Now at v3.0 with 7-layer retrieval, cross-project sharing, and 954 tests.

Jump to v3.0 update ↓ -- What changed since the original write-up.


Abstract

LLM agents have no memory across conversations. Corrections are lost when the context window closes, and the agent repeats the same mistakes. Existing memory systems treat this as a retrieval problem, but the harder unsolved problems are write correctness and governance: what gets stored, how conflicts are resolved, and whether wrong memories can be corrected. We present agentmemory, a persistent memory system built on FTS5 keyword search, typed knowledge graphs, and entity-index retrieval, with no embeddings. The system detects user corrections at 92% accuracy without LLM calls, recovers 99.5% of the 31% of stored directives unreachable by keyword search, and reduces injected tokens by 55% with zero retrieval loss. Across five benchmarks, agentmemory achieves 66.1% F1 on LoCoMo (+14.5pp over GPT-4o), 90% on MemoryAgentBench single-hop (+45pp), 60% on multi-hop (8.6x the published 7% ceiling), 100% on StructMemEval state tracking, and 59.0% on LongMemEval (-1.6pp vs GPT-4o pipeline, different judge). A controlled A/B test showed 31% token savings and 41% fewer tool calls, though correction rates did not decrease. All code, benchmarks, and experiment data are open source under MIT license.


1. Introduction

LLM agents have no memory across conversations. Every session starts from zero. When a user corrects the agent, that correction is lost the moment the context window closes. The next session, the agent makes the same mistake. The user corrects again. To make matters worse, the agent will often ignore corrections in the exact same context window session.

No Implementation comic - CS-002/CS-006

This is not a hypothetical failure mode. Memory failures are the largest single category of LLM behavioral failures documented in this project's failure taxonomy, accounting for 7 of 38 cataloged patterns. The problem compounds across sessions: the MemoryAgentBench benchmark (ICLR 2026) tested multi-hop conflict resolution and found a ceiling of 7% accuracy across all tested methods.

Existing approaches overwhelmingly treat memory as a retrieval problem. Across the architectures cataloged by Zhang et al. (2024), Hu et al. (2025), and Leonard Lin's independent analysis of 35+ papers and 14+ community systems, the dominant pattern is: store text, embed it, retrieve by similarity. StructMemEval found that vector stores built on this pattern "fundamentally fail at state tracking." They can't tell you what's currently true vs. what was superseded.

Lin concluded:

"The biggest differentiator is not vector DB vs SQLite. It is write correctness and governance: provenance, write gates, conflict handling, reversibility."

Current memory systems are write-only: content goes in, but the system never learns whether what it retrieved was actually helpful. On every turn, the memory system retrieves stored content, the LLM reads it and generates a response, and the turn ends. There is no feedback path. The memory system cannot reinforce directives the user found helpful, weaken directives the user has overridden, or distinguish a user correction from an LLM inference at storage time. None of the architectures in the 47-author survey by Hu et al. or Lin's 14+ system analysis include such a feedback path.

The LoCoMo benchmark (ACL 2024) showed that a simple filesystem with grep achieves 74%. That's the bar.

Now for my Fosbury Flop.

Agentmemory system architecture: ingestion, retrieval, and feedback pipelines

I built this system because I got sick and tired of asking Claude for the latest on my test runs, which were burning CPU time on cloud compute, only for Claude to tell me "huh? what test dispatches? oh those. yeah they've been hanging for 2 hours because I didn't follow the runbook you told me to follow."

Scale Before Validate comic - CS-011

Before building, I surveyed the landscape: 4 survey papers (Zhang et al. 2024, Hu et al. 2025, Yang et al. 2026, "Memory in the LLM Era" 2026), 6 benchmarks, 14+ community systems, and Leonard Lin's independent benchmark reproduction of 35+ papers which verified (and in some cases refuted) published claims.

Four findings shaped the project direction:

  1. Human memory is the wrong target. Zhang et al. (2024) and Hu et al. (2025) show that the dominant design paradigm maps psychological memory models onto LLM architectures. Human memory is notoriously unreliable: Ebbinghaus (1885; replicated by Murre & Dros, 2015) showed ~56% of learned material is forgotten within one hour. Eyewitness misidentification contributed to roughly 69% of the 375+ wrongful convictions overturned by DNA evidence (Innocence Project). Computer memory is literally perfect at storage; the hard problem is retrieval. Where human memory is useful: retrieving gists. Brainerd & Reyna (2005) showed that gist traces are far more durable than verbatim traces. That associative retrieval, connecting things that share no surface-level vocabulary, is what we're trying to replicate.

  2. A simple filesystem achieves 74% on the most-used benchmark. Letta's best result with no special memory architecture, just gpt-4o-mini writing to files, hits 74% on LoCoMo (Maharana et al., ACL 2024). Any memory system must beat "gpt-4o-mini writes it to a file."

  3. Multi-hop conflict resolution tops out at 7%. MemoryAgentBench (ICLR 2026) tested the ability to follow a chain of related decisions across sessions and determine which is currently in effect. The ceiling was 7% accuracy across all tested methods.

  4. Lin's independent analysis pointed to an underexplored axis. The majority of systems focus their novelty on retrieval: better embeddings, better similarity metrics, better ranking. The harder unsolved questions are about what gets stored, how conflicts are resolved, and whether wrong memories can be corrected. Lin's concept of "write correctness and governance" breaks into four components:

    • Provenance: Where did this memory come from? A user prompt, an LLM inference, or derived content? Every stored entry should carry its lineage.
    • Write gates: Not everything should be stored. Expanding a 586-node graph to 16,463 nodes without quality filtering dropped retrieval coverage from 92% to 69% (Exp 48).
    • Conflict handling: When two memories contradict, which wins? StructMemEval found that vector stores "fundamentally fail at state tracking".
    • Reversibility: None of the 14+ systems Lin tested had a working rollback mechanism.

Full benchmark comparison tables are in Appendix B.


3. Approach

Four questions drove the design:

  1. How do you detect user corrections, stated preferences, and behavioral rules without extra LLM inference?
  2. How do you retrieve relevant content when the query shares zero vocabulary with the stored content?
  3. How do you track whether a retrieved memory was actually useful?
  4. How do you distinguish "the LLM should consider this" from "the LLM must obey this"?

Each question maps to a capability with a measured result:

Correction detection: 92% accuracy without any LLM calls, across five codebases (Exp 39-41). When LLM classification is enabled (~$0.005/session), accuracy reaches 99%. The zero-LLM pipeline runs on every conversation turn with zero marginal cost.

Vocabulary gap recovery: 31% of stored content across five codebases is unreachable by keyword search (Exp 47, 3,321 directives examined). The query and stored directive share zero vocabulary. For example: a user says "never mock the database in tests." Later, the agent is about to write a test with unittest.mock.patch('db.connect'). Zero overlapping words, but a human immediately sees the connection. The system recovers 99.5% of these gapped directives through a structural graph traversal that doesn't require embeddings or LLM inference.

Vocabulary gap bridge: graph traversal connects mock.patch to never-mock directive
Vocabulary Gap Prevalence (Exp 47)
5 codebases, 3,321 directives examined
MetricValueRate
Total directives examined3,321
Directives with vocab gap1,03031%
Recovered by graph layer1,02599.5%
Gap CategoryRateExample
Emphatic prohibitions29%"NEVER do X"
Domain jargon13%Tool names
Tool bans12%"don't use Y"
Implicit rules8%Context-dependent
100% of gaps are bridgeable by graph traversal.
Graph traversal hub: research_HRR_FINDINGS node with edges radiating to connected beliefs
A high-connectivity hub node (research_HRR_FINDINGS) in the knowledge graph. Edges radiate outward to beliefs across multiple topics. When keyword search misses a directive due to vocabulary mismatch, graph traversal follows these edges to recover it -- this is how the system bridges the 31% gap that FTS5 cannot reach.
Task 41 comic - CS-020

Entity-index retrieval: To address multi-hop conflict resolution, the system extracts structured triples (entity, property, value, serial_number) from ingested text using 41 regex patterns, then chains through entity relationships at query time up to 4 hops deep. This layer (L2.5, between FTS5 and HRR) took the MemoryAgentBench multi-hop score from 6% to 35% chain-valid, 5x the published 7% ceiling (see Section 6.3).

Confidence tracking: The system tracks retrieval outcomes and updates confidence accordingly (Exp 66: +22% MRR gain over 10 feedback rounds; Bayesian calibration ECE 0.066, target < 0.10). Memories that help get stronger. Memories that hurt get weaker. Memories irrelevant to the current task get no update, because absence of evidence is not evidence of absence.

Correction enforcement: Storing a correction and enforcing it are different problems (see CS-006 above). The system distinguishes between content the LLM should consider and constraints the LLM must obey (Exp 84: 10/10 locked directives retrieved and enforced across 5 sessions).

Correction hub node with edges fanning outward to connected beliefs across topics
A correction hub in the knowledge graph. When a user corrects the agent, the correction belief becomes a high-degree node with SUPERSEDES and CONTRADICTS edges linking it to the beliefs it replaces. Long-range edges (upper-left, lower-right) connect to beliefs in distant topic clusters, ensuring the correction propagates across the full graph during retrieval.
Retrieval pipeline layers: L0 locked through L3 graph traversal

The architecture uses keyword search as the primary retrieval layer and wraps it with structural gap recovery, confidence tracking, and constraint injection. The combined system handles both the 69% grep reaches and the 31% it misses. Grep won on the keyword retrieval benchmark (Exp 47, 92% coverage vs. 85% for the prototype), and we accepted that: grep is fast, precise, and costs no LLM calls. But grep cannot bridge the vocabulary gap. The locked-directive Mean Reciprocal Rank improved from 0.589 to 0.867 after retrieval tuning (Exp 63).

Retrieval Coverage (Exp 47, 586-node graph)
MethodCoverageTokensPrecision
grep (decision)92%lowhigh
grep (sentence)92%highmoderate
Prototype A85%lowmoderate
Prototype B85%lowmoderate
Null hypothesis: grep < 80%. Result: REJECTED. grep achieved 92%.

Token reduction: Type-aware token reduction achieves 55% savings with zero measured retrieval loss (Exp 42). Constraints survive verbatim; rationale compresses to 0.4x; metadata compresses to 0.3x. Every token injected into the agent's context window is a token that can't be used for reasoning about the current task. At 19K+ stored nodes, naive injection would consume the entire context window before the agent reads the user's message.

Token Reduction Results (Exp 42)
MetricValue
Before35,741 tokens
After15,926 tokens
Savings55%
Retrieval coverage100% (all 6 topics, 18 queries)
Content TypeReduction Factor
Constraints1.0x (never reduce)
Rationale0.4x
Context0.3x

Scale effects: When the graph expanded from 586 to 16,463 nodes without filtering, retrieval coverage dropped from 92% to 85% for grep and from 85% to 69% for prototypes (Exp 48). Decision-level directives were 3.6% of the expanded graph. The fix: filter at ingestion time, not after expansion. Only decision-level directives pass the write gate.

Scale vs. Coverage (Exp 48)
Graph SizegrepProto AProto B
586 nodes92%85%85%
16,463 nodes85%69%69%
Decision-level directives: 3.6% of expanded graph. The other 96.4% is noise that dilutes signal.
agentmemory knowledge graph visualization showing node clusters and edge density across a 19K-node production database
Production knowledge graph (19K nodes, Obsidian graph view). The large central cluster is the primary project's belief network. Two smaller satellite clusters (upper-left, lower-left) are isolated project databases. Scattered peripheral nodes are low-edge beliefs awaiting graph integration. Dense interior regions correspond to heavily cross-referenced topics where multiple beliefs reinforce each other through SUPPORTS, RELATES_TO, and SUPERSEDES edges.

4. Evaluation

The project uses a four-layer evaluation architecture. Each layer exists because a specific failure mode proved that the previous layers were insufficient.

Layer 1 (programmatic checkers): Deterministic checks like "does the output contain the required section headings?" These catch structural violations but can't evaluate semantic correctness.

Layer 2 (structural validators): Added after discovering that LLMs satisfy individual constraints while violating the relationships between them (CS-007b): correct sections in the wrong order, or referencing a decision that contradicts an earlier one.

Layer 3 (LLM-as-judge with anti-contamination): The most important design decision. Standard LLM-as-judge approaches (Zheng et al., 2023) give the evaluating LLM the input prompt and the response. The problem: when the evaluating LLM can see the system's reasoning, it tends to find that reasoning plausible and rationalize the violation (CS-005). Our approach isolates the evaluating LLM. It receives only the constraint and the output, never the conversation that produced it.

Layer 4 (adversarial follow-up): Added after CS-024 (sycophantic collapse). An LLM can pass all three previous layers and still fail when a user pushes back.

Sycophantic Collapse comic - CS-024

One methodological insight: for agent memory retrieval, precision matters more than recall. A false negative (relevant directive missed) is invisible. A false positive (irrelevant directive injected) causes the LLM to act on wrong context, requiring the user to notice, diagnose, and correct. The scale experiment (Exp 48) demonstrated this directly: expanding the graph without filtering retrieved more content, but the wrong content.

Retrieval Error Cost Matrix
Error TypeUser Impact
True positiveCorrect directive retrieved and followed. No user intervention.
True negativeIrrelevant directive correctly excluded. No user intervention.
False negative (recall failure)Relevant directive missed. The LLM doesn't know what it doesn't know. User may not notice -- the failure is invisible.
False positive (precision failure)Irrelevant directive injected. LLM acts on wrong context. User must notice, diagnose, and correct. Active harm.
Big Numbers comic - CS-008

5. Results

Cross-benchmark comparison: agentmemory vs published best across 5 benchmarks
Key Results Summary
MetricResultNotes
Benchmarks
LoCoMo F1 (Opus 4.6)66.1%+14.5pp vs GPT-4o (51.6%)
MAB SH 262K90% Opus+45pp vs GPT-4o-mini (45%)
MAB MH 262K60% Opus8.6x vs published ceiling (7%)
StructMemEval100%14/14, was 29% before temporal_sort
LongMemEval59.0%-1.6pp vs GPT-4o pipeline (60.6%)
Core Pipeline
Correction detection92%Zero-LLM, 5 codebases
Vocabulary gap recovery99.5%31% of directives gapped
LLM classification99%~$0.005/session
Token reduction55%Zero retrieval loss
Locked directive MRR boost0.589->0.867After retrieval tuning
Bayesian calibration (ECE)0.066Target < 0.10
Feedback loop MRR gain+22%Over 10 rounds (Exp 66)
Multi-session validation10/105 sessions (Exp 84)
Infrastructure
Acceptance tests62/65 pass29 test files, 1.65s
Test suite362 passUnit, integration, behavioral
Retrieval latency0.7s avg19K-node production DB
Onboarding speed (scan)1.0s10,872 nodes from 249 commits + 112 docs
Onboarding speed (full pipeline)6.5sScan + ingest + edge storage + vault sync

5.1 Core Pipeline Metrics

Correction detection achieves 92% accuracy without LLM calls across five codebases (Exp 39-41), reaching 99% with LLM classification at ~$0.005/session. The vocabulary gap recovery layer handles the 31% of directives keyword search misses, recovering 99.5% of them through graph traversal (Exp 47). Token reduction saves 55% with zero retrieval loss (Exp 42). The feedback loop improves MRR by +22% over 10 feedback rounds (Exp 66), with Bayesian calibration at ECE 0.066.

5.2 Single-Hop Results

LoCoMo

The LoCoMo benchmark (Maharana et al., ACL 2024) tests whether a system can answer questions about past conversations across five categories. Setup: 10 conversations (5,882 turns, 272 sessions, 1,986 QA pairs) ingested through the standard onboarding pipeline. Retrieval used FTS5+HRR+BFS with a 2,000-token budget. Scoring followed LoCoMo's exact F1 methodology.

Answer Key comic - Benchmark Contamination

The initial run was contaminated (agents had access to ground truth; see Appendix A for the full contamination narrative and protocol). The protocol-correct results:

LoCoMo Per-Category F1 (Protocol-Correct, Opus 4.6)
CategoryF1n
Single-hop69.4%841
Temporal45.4%321
Multi-hop42.2%282
Open-ended30.5%96
Adversarial97.5%446
Overall66.1%1986
LoCoMo Leaderboard Context
SystemF1Notes
Human87.9%Ceiling
GPT-4-turbo (128K full context)51.6%Best long-context in paper
RAG (DRAGON + gpt-3.5, top-5 obs)43.3%Best RAG in paper
Claude-3-Sonnet (200K)38.5%Long-context
gpt-3.5-turbo (16K)36.1%Long-context
agentmemory + Opus 4.666.1%FTS5+HRR, no embeddings

Single-hop is strongest (69.4%); adversarial is near-perfect (97.5%). Multi-hop and temporal are weaker (42-45%), requiring cross-session reasoning and date arithmetic. Open-ended is weakest (30.5%), requiring synthesis the retrieval pipeline doesn't directly support. Ingest time for all 10 conversations: ~25s. Average query latency: ~16ms.

MAB Single-Hop

MemoryAgentBench (Hu et al., ICLR 2026) tests conflict resolution: when facts change over time, can the system track which version is current? Single-hop asks direct questions: "What is X's current Y?"

MAB Single-Hop 262K
ReaderSEMPaper GPT-4o-miniPaper GPT-4o
Opus 4.690%45%88%
Haiku 4.562%45%88%

The improvement from v1.0 (60%) to v1.1 (90%) came from triple extraction in the ingestion pipeline. SUPERSEDES edges are created automatically. Haiku still beats GPT-4o-mini (62% vs 45%), confirming the improvement comes from retrieval, not the reader model.

LongMemEval

LongMemEval (Wu et al., ICLR 2025) is a 500-question benchmark spanning six categories. The published best is 60.6% using a GPT-4o pipeline with embeddings.

LongMemEval Per-Category Accuracy (Opus Judge)
CategoryAccuracyn
single-session-user91.4%70
single-session-preference80.0%30
single-session-assistant73.2%56
knowledge-update70.5%78
temporal-reasoning59.4%133
multi-session24.1%133
Overall59.0%500

Strengths: single-session recall (91.4%) and knowledge updates (70.5%). Weakness: multi-session (24.1%). Failure analysis of 101 incorrect multi-session answers: 67% were retrieval misses, 33% were reasoning failures. Of the retrieval misses, 84% were counting/aggregation questions. Budget and top_k sweeps did not help, confirming FTS5's BM25 ranking as the bottleneck for this category. Methodological note: scoring uses Opus as judge rather than GPT-4o; the comparison carries an asterisk.

5.3 Multi-Hop and State Tracking

MAB Multi-Hop

Multi-hop chains entity relationships: "What is the Z of X's current Y?" This requires following a chain of updated properties. The entity-index (described in Section 3) extracts structured triples and chains through them at query time.

Entity-Index Triple Extraction
FieldExtracted TripleUpdated Triple
Input: "In session 42, Alice's spouse is Bob."
entityAlice
propertyspouse
valueBob
serial42
Later: "In session 78, Alice's spouse is Carol."
entityAlice
propertyspouse
valueCarol
serial78 (supersedes serial 42)
MAB Multi-Hop 262K
ReaderRaw SEMChain-ValidPaper Ceiling
Opus 4.647%35%<=7%
Haiku 4.546%35%<=7%

Both Opus and Haiku score identically at 35% chain-valid. This is the strongest evidence that the entity-index retrieval, not the LLM reader, drives the improvement. When the retrieval provides the right entity chain, even Haiku can follow it.

Multi-hop experiment progression: 6% baseline to 60% across 6 experiments
Multi-Hop Progression (Experiments 1-6)
ExpMethodMH SEMKey Finding
--v1.0 Baseline (FTS5 chunks)6%Single FTS5 query
1Per-hop failure analysis--58% chaining, 17% world knowledge, 11% retrieval miss
2SUPERSEDES edges7%Helps SH, not MH
3Triple decomposition10%Granular helps
4Entity-index 2-hop35%Core breakthrough
5Extended regex (+7 patterns)55%+8pp over Exp 4
LLM entity extraction51%-4pp vs regex
6Temporal coherence (resolve_all + branching)60%96% GT-reachable; reader bottleneck

Experiment 6 resolved the retrieval question: by branching through all historical values at each hop, 96 of 100 ground truth answers became reachable. The remaining 40pp gap is entirely a reader chain resolution problem.

Reader Quality Analysis
MetricOpusHaikuGapInterpretation
SH 262K90%62%28ppReader matters
MH chain-valid35%35%0ppRetrieval does all the work
MH raw SEM47%46%1ppRetrieval does all the work

StructMemEval

StructMemEval (Shutova et al., 2026) tests state tracking: given location updates across sessions, can the system answer "where is X now?"

StructMemEval Results
VersionAccuracyFix
v1.04/14 (29%)--
v1.114/14 (100%)temporal_sort + narrative timestamps

The fix: assign narrative timestamps (30 days apart per session) and enable temporal_sort=True so the reader sees the most recent session content first. This is a general-purpose state-tracking improvement, not a benchmark-specific hack.

5.4 Scale and Onboarding

Onboarding Performance
Metricagentmemoryalpha-seek-memtest
Git commits35619
Git date range2 days16 days
Documents1631,726
Nodes extracted16,69090,793
Edges extracted32,538302,268
Beliefs created31,86360,641
Scan time~2.5s~5.8s
Full pipeline time----
Scale factor1x5.4x (nodes)
Time factor1x2.3x

Scan phase scales sublinearly: 5.4x more nodes in 2.3x the time. AST parsing is the bottleneck (32-38% of scan time). Full pipeline (scan + ingest + edge storage + vault sync) measured at 6.5s for a 10,872-node repo (249 commits, 112 docs) after v1.2.1 performance fixes batching edge inserts and deferring per-belief FTS5 checks during bulk ingestion. Temporal decay validated on the larger codebase: 2-day-old beliefs score 0.92, 18-day-old beliefs score 0.43, 14-month-old beliefs score ~0.


6. Discussion

What Doesn't Work Yet

Limitations
LimitationWhat's Next / Ceiling
grep beats the full architecture on keyword retrieval benchmarksAccepted: grep is the primary layer now; system adds value in the 31% grep misses
Multi-hop reader chain resolution (MAB MH)60% Opus, 96% GT-reachable; remaining 36pp gap is reader strategy, not retrieval (Exp 6)
LongMemEval multi-session: 24.1%84% of failures are counting/aggregation questions. FTS5 recall is the bottleneck. Embedding-based retrieval is the strongest future lever.
LongMemEval overall: 59.0%-1.6pp vs published baseline. Uses Opus judge, not GPT-4o. Comparison carries an asterisk.
Feedback loop needs more sessions for statistical significanceCurrently +22% MRR gain over 10 rounds (Exp 66); need longer longitudinal data
Contradiction detection during retrieval does not work yet (A/B test, 2026-04-15)A/B test showed file reads beat memory at finding inconsistencies. Graph edges exist but retrieval optimizes for query relevance, not internal consistency
Cross-project noise in shared database (A/B test, 2026-04-15)Project scoping exists (scope column, project_context on sessions) but the A/B test showed retrieval still pulls cross-project content. Scoping enforcement needs tightening

What Remains Unmeasured

Open Questions
Open QuestionStatus / How to Close
Does retaining corrections improve downstream decisions?A/B test (2026-04-15) showed efficiency gains (31% fewer tokens, 41% fewer tool calls) but correction rates did not decrease. Confounded by task type change. Needs controlled matched-task experiment.
Cross-project transfer of behavioral directivesA/B test surfaced the problem: 25% of retrieval was cross-project noise. Scoping exists but enforcement needs work.
Long-term dynamics over monthsLongitudinal tracking of confidence distributions and directive churn rate
Performance with users other than the developerOpen-source release under MIT. Structured user study still needed.
MAB MH reader chain strategyIs the 36pp gap (60% to 96% GT-reachable) a real limitation or a benchmark artifact?
Counterfactual resistanceReaders use world knowledge ~17% of time despite explicit instructions (Exp 1). Shared problem across all LLM-based evaluation methods.

A/B Test: Status Report With vs Without Memory

To test whether the system actually helps, we ran a controlled A/B comparison: two fresh Claude Code sessions given the identical prompt ("generate a comprehensive project status report"), one with agentmemory active, one without.

The first attempt was invalid: the experimental agent was coached, sub-agents don't receive hook injections, and the control wasn't isolated (it found the SQLite database and queried it directly). Despite being invalid, it surfaced a real bug: 25% of memory search results were relevant, 75% were cross-project noise.

The valid test used two live Claude Code sessions in separate terminals with identical prompts. The control had .mcp.json deleted and ran in an isolated worktree. The experimental session had hooks firing automatically with no special instructions.

Gold A/B Test (2026-04-15, live sessions, identical prompt)
MetricControlExperiment
Duration~8 min~6.5 min
Tool calls3420
Agentmemory tool calls01 (status only)
SQLite direct queries50
Total tokens1,607,6141,109,711

The experimental session was 31% more token-efficient and used 41% fewer tool calls. The experimental agent barely used agentmemory directly (one status() call, zero searches). The benefit was passive: hook-injected context at session start reduced cold-start exploration. Both reports were comparable in quality. The real value was efficiency, not accuracy.

Longitudinal Analysis

A second analysis examined 44 qualifying sessions across 1 week (23 before agentmemory activation, 21 after).

Session Log Analysis (44 sessions, ~1 week)
MetricBeforeAfterDelta
Tool uses per user message5.411.0+104%
(excluding agentmemory tools)5.410.7+99%
User messages per task4.83.2-33%
Avg user messages per session31.620.3-36%
Corrections per 100 user msgs2.23.3+50% (worse)
Restatements per 100 user msgs0.550.94+71% (worse)

The LLM does roughly twice as much autonomous work per instruction. Tasks complete in 33% fewer turns. But correction rates went up, not down. Two confounds likely explain this: the "after" sessions were refinement/debugging work (inherently more correction-heavy), and possible model version changes during the period. A rigorous future test would need matched task design, randomized condition assignment, at least 20 tasks per condition, manual annotation, cross-session measurement, and blind evaluation. This experiment has not been run yet.


7. Failure Taxonomy

Failure taxonomy: Memory, Calibration, Behavioral, and Operational failure families

35 documented behavioral failures across Claude and Codex, classified into recurring patterns. Each case study includes verbatim exchange, root cause analysis, pattern classification, what the memory system should do to prevent it, and a concrete acceptance test with pass/fail criteria.

Failure Pattern Families
FamilyPattern
Memory Failures
P4Repeated procedural instructions
P1Repeated decisions
Context drift within and across sessions
Calibration Failures
P5Provenance-free status reporting
P7Output volume presented as validation
Result inflation in reporting
Behavioral Failures
P6Correction stored but not enforced
P9Sycophantic collapse under pressure
P10Point-fix without generalization
P11Intent completion gated by permission
Operational Failures
P7Namespace collision across parallel sessions
P8Multi-hop query collapse
Scale-before-validate bias
Extensive Research comic - CS-005

Each case study maps to acceptance tests. The suite runs against the live SQLite store and retrieval pipeline: 62/65 tests pass (29 files, 1.65s), with 3 skipped for capabilities requiring behavioral hooks not yet implemented (CS-012: PostEdit hook, CS-024: sycophantic collapse detection, CS-026: permission-gated intent completion).

Case Study to Acceptance Test Mapping (selected)
CSFailureWhat the Test Validates
002Premature implementation push (3 corrections ignored)Locked correction created on first user correction; persists indefinitely
006Correction stored but not enforced (implementation ban violated in new session)Locked prohibition retrieved AND enforced across session boundaries; output gating blocks violations
009Correction lost across session reset ("use B not A")SUPERSEDES edges preserve latest correction; holds across resets
022Multi-hop query collapse (4 agents, wrong machine)All entities identified via graph traversal; correct state aggregated
025Correction not generalized (fix one instance, miss others)Correction applies to pattern class, not just the specific instance
Component Dependency (by case study count)
ComponentCase StudiesPriority
Locked beliefs / L0 behavioral11Critical
COMMIT_BELIEF (git-derived)6High
FTS5 retrieval6High
Triggered beliefs (TB-01-15)6High
Source priors / provenance5High
SUPERSEDES edges4Medium
IMPLEMENTS / CALLS / CO_CHANGED5Medium
Output gating (enforcement)2Critical*
HRR typed traversal3Medium
TESTS / coverage edges1Low
* Output gating covers only 2 case studies but both are severity-critical: CS-006 and CS-016 are multi-session correction violations, the most painful failure class.
Validating the Validation comic - CS-007b

8. What Was Abandoned (and Why)

The negative findings shaped the architecture as much as the positive ones.

SimHash clustering was the first attempt at deduplication. SimHash works well for near-duplicate text, but stored directives are short and semantically dense -- "always use mocks" vs. "never use mocks" differ by one word that inverts the meaning.

Mutual information re-ranking was supposed to improve retrieval by scoring candidates on their statistical relationship to the query. In practice, it demoted relevant results and promoted spurious correlations.

Global holographic superposition was theoretically promising and spectacularly failed. At 775 edges, the representation exceeded its information-theoretic capacity by 7.6x and produced pure noise (Exp 50).

Pre-prompt compilation attempted to pre-compute relevant directives for common query patterns. It performed worse than random selection (23% vs. 33%, Exp 52) because directive value is context-dependent.

Abandoned Approaches
ApproachWhy it failed
SimHash clusteringNot viable for deduplication in this domain
Mutual information re-rankingHurts more than helps in retrieval
Rate-distortion optimizationUnnecessary complexity for marginal gains
Pre-prompt compilationWorse than random selection (23% vs 33%)
Global holographic superpositionCapacity exceeded 7.6x at 775 edges; pure noise
Multi-layer graph expansionSignal diluted to 3.6% of graph at 16K nodes
Autonomous edge discoveryPrecision 0.001, recall 0.005
Zero-LLM classification as sufficient4% precision on corrections (805 found, 32 correct)

9. Conclusion

agentmemory demonstrates that persistent memory for LLM agents does not require embeddings, vector databases, or expensive inference. A pipeline built on FTS5 keyword search, typed knowledge graphs, and entity-index retrieval achieves competitive or superior results across five benchmarks while running at 0.7s average retrieval latency on a 19K-node production database.

The core contributions are: (1) correction detection at 92% accuracy without LLM calls, (2) vocabulary gap recovery that handles the 31% of directives keyword search cannot reach, (3) entity-index retrieval that breaks the published 7% multi-hop ceiling by 5-8x, and (4) a confidence tracking loop that improves retrieval quality over time.

The system's limitations are equally clear. LongMemEval multi-session accuracy is 24.1%, bottlenecked by FTS5's inability to aggregate scattered mentions. Correction rates did not decrease in the longitudinal analysis, though confounds prevent attribution. Contradiction detection during retrieval does not work yet. Cross-project noise in shared databases needs tighter scoping enforcement.

The failure taxonomy and its 62 passing acceptance tests may be the most practically useful contribution: a catalog of the specific ways LLM agents fail at memory, each with a reproducible test that blocks recurrence. All code, experiment data, and benchmark adapters are available at github.com/robotrocketscience/aelfrice under MIT license. Version 1.2.1, research frozen 2026-04-16.


10. Research Breadth

The project drew on multiple fields, each brought in to address a specific problem:

  • Information theory: The information bottleneck (Tishby et al., 1999) was applied to context compression, producing the 55% token savings. Mutual information was tested for retrieval re-ranking and abandoned. Rate-distortion theory was explored for optimal token budget allocation but proved unnecessary.

  • Bayesian inference: Beta-Bernoulli conjugate pairs for confidence tracking. Thompson sampling for the exploration/exploitation tradeoff in retrieval. Calibration measured at ECE 0.066 (target < 0.10).

  • Cognitive architectures: SOAR's impasse-driven substates informed retrieval failure escalation. CLARION's meta-cognitive subsystem inspired confidence tracking. ACT-R's declarative/procedural distinction mapped to the system's separation of factual content from behavioral constraints. The design borrows structure without inheriting human-like decay and distortion.

  • Bio-inspired optimization: Slime mold network dynamics (Tero-Kobayashi equations) were tested for graph pruning. Evolutionary algorithms were tested for edge set optimization. Both showed promise in simulation but were not adopted.

  • Graph theory: Typed knowledge graphs with weighted edges form the structural backbone. Multi-hop traversal enables the vocabulary gap recovery.


11. Technical Details

  • Source: github.com/robotrocketscience/aelfrice, MIT license.
  • Language: Python with strict typing enforced by pyright in strict mode.
  • Storage: SQLite with WAL mode. The entire memory store is a single file.
  • Dependencies: Minimal by design. No PyTorch, no TensorFlow, no embedding models. LLM classification (~$0.005/session) brings accuracy to 99% and is the recommended configuration.
  • Deployment: MCP server with 19 tools, integrating with Claude Code, Cursor, Windsurf, and other MCP-compatible tools. Also ships as a CLI with 23 commands.
  • Modules: 18 production modules in src/agentmemory/, plus 23 benchmark adapters and scoring scripts.
  • Scale tested: 600 to 90,000+ nodes across five codebases. Largest production deployment: 0.7s average retrieval latency on 19K-node graph.
  • Benchmarks: 5 benchmarks tested with contamination-proof protocol. Two contamination incidents caught and documented during development.
  • Test suite: 362 passing tests plus 62 acceptance tests (29 files, 1.65s).
  • Experiments: 85+ during core development, plus 6 benchmark-phase experiments with pre-registered hypotheses. Negative findings documented with the same rigor as positive findings.
  • Case studies: 35 documented LLM behavioral failures, each with verbatim transcripts, root cause analysis, and derived acceptance tests.
  • Version: 1.2.1 (research frozen 2026-04-16)

12. Update: v1.2 to v3.0 (April 2026)

Everything above documents the v1 research. The system kept evolving. This section covers what changed.

What the v1 write-up got right

The core architecture held. FTS5 keyword search, typed knowledge graphs, Bayesian confidence tracking, correction detection, and the feedback loop all survived into v3 without fundamental redesign. The benchmarks haven't been re-run (the v1 scores still stand), and the failure taxonomy still maps accurately to observed failures.

What changed

Retrieval went from 4 layers to 7. The v1 pipeline was: locked beliefs (L0) + FTS5 (L1) + entity-index (L2.5) + graph traversal (L3). Three layers were added:

  • Layer 0 structural analysis detects the task type (deployment, debugging, implementation) from prompt structure alone, at 90.5% accuracy. This means the system retrieves deployment runbooks when the user is deploying, not when they mention the word "deploy."
  • Layer 1.5 precomputed HRR neighbors replaced the runtime HRR computation that was too slow to use in production. A hrr_neighbors table is built during graph construction; hook search does a SQL JOIN in 0.03ms instead of skipping the vocabulary gap layer entirely.
  • Layer 1.7 intention clustering groups beliefs into 40 clusters based on what they're about, not what words they contain. 97.9% of vocabulary-gap pairs land in the same cluster (Exp 94b). When FTS5 finds a belief about "database migrations," the cluster pulls in related beliefs about "schema changes" and "ALTER TABLE" even though they share no keywords.

Bayesian scoring got a second model. The single Beta-Bernoulli prior was joined by a Bayesian model selection layer (Exp 93) that classifies each belief into one of four states based on its feedback history: SIGNAL (actively useful), NOISE (consistently irrelevant), STALE (was useful, stopped being retrieved), or CONTESTED (conflicting feedback). The model state applies a 0.6-1.3x multiplier. Beliefs without feedback are unaffected.

Beliefs can now cross project boundaries. v3's largest structural addition: SQLite ATTACH federation for cross-project sharing. Each shared scope gets its own database under ~/.agentmemory/shared/{scope}/. A content-hash deduplicates on write. Hook search Layer 6 queries subscribed scopes with a budget of 3 beliefs per scope.

This directly addresses the cross-project noise problem documented in Section 7 of the v1 write-up (25% cross-project noise in the A/B test). Exp 97 measured: 100% recall, 0% top-5 contamination, 1.06x latency overhead.

Wonder and reason turned research into conversation. Two new capabilities let the agent investigate open questions using the memory graph as context. These are invoked as slash commands (/mem:wonder, /mem:reason) but they're used in context during natural conversational turns, not as separate formal operations. The user types something like "please /mem:wonder about X" mid-discussion, and the system launches the research pipeline with the full conversational context already loaded. A wonder query spawns parallel research agents; a reason query builds evidence chains. Both save their findings as beliefs that persist across sessions.

A case study documents a real session where wonder + reason produced an actionable marketing strategy for the project itself: 4 parallel research agents, findings synthesized, README rewritten, all grounded in beliefs accumulated over prior weeks.

Updated numbers

v1.2 vs v3.0
Metricv1.2v3.0
Experiments8598
Tests362 + 62 acceptance954
MCP tools1931
Production modules1833
Retrieval layers47
Version1.2.13.0.2

What's still not solved

Benchmarks have not been re-run against v3. The v1 scores reported above still stand, but the new retrieval layers (intention clustering, multimodel scoring, structural analysis) have not been measured on the same benchmark suite. Re-running benchmarks is planned but blocked by outstanding work on the retrieval pipeline.

Known limitations that remain from Section 7:

  • Contradiction detection during retrieval still does not work. The CONTRADICTS edge type exists but isn't used during scoring.
  • Correction rates were not re-measured. The v1 longitudinal analysis found no decrease; whether v3's changes affect this is unknown.
  • LongMemEval multi-session was 24.1% in v1. Whether the new layers improve this is untested.

Cross-project scoping was solved (v3 shared scopes with content-hash dedup). The A/B test's 25% cross-project noise is no longer expected.


Appendix A: Benchmark Methodology

This appendix documents the exact protocol used for each benchmark. Any deviation invalidates the results. The protocol was developed after two contamination incidents during development. The full protocol, contamination verification script, and all benchmark adapters are available in the public repository.

Contamination Protocol

Three contamination modes were identified during development:

  1. Ground truth in retrieval output. The retrieval JSON contained answer fields. The LLM reader saw correct answers while generating predictions. This produced the invalid 87.8% LoCoMo score. The contamination was not immediately obvious: the first Opus score (61.6% F1) was plausible. It was only discovered when slow-finishing agents overwrote the merged predictions file, producing 87.8% F1 near the human ceiling of 87.9%. Exact-match analysis confirmed: 9 of 10 batches showed 43-87% exact-match rates. Four additional isolation failures were identified: a renamed _ground_truth field, pre-computed prediction and f1 fields, category_name labels leaking evaluation strategy, and no separation between question-context and scoring metadata. All results from this run were retracted. Prevention: Adapter code writes two separate files. A mandatory contamination check (verify_clean.py) scans for 30 banned keys before any reader touches the data.

  2. LLM self-judging with answer visible. Prevention: Generation and judging are strictly separate passes.

  3. World knowledge override. The LLM reader used real-world knowledge instead of retrieved context, particularly on counterfactual benchmarks. Mitigation: Reader prompts include explicit instructions to use only the provided context. Documented as a known limitation (~17% of MAB failures).

General Protocol

Every benchmark run follows these steps:

Step 1: Data acquisition. Download from published source. Verify row counts and field names.

Step 2: Retrieval. Run adapter in --retrieve-only mode, producing retrieval and ground truth files. Each test case uses a fresh SQLite database.

uv run python benchmarks/<adapter>.py \
  --retrieve-only /tmp/benchmark_<name>.json

Step 3: Contamination check. Mandatory before any reader touches the data:

uv run python benchmarks/verify_clean.py /tmp/benchmark_<name>.json

Step 4: Answer generation. LLM reader receives only the retrieval file. Never sees ground truth.

Step 5: Scoring. Reads predictions and ground truth. Metrics follow exact published formulas.

Step 6: Reporting. Includes exact commands, contamination check output, adapter commit hash, dataset version, reader model, scoring metric, published baselines, and known limitations.

Per-Benchmark Specifics

LoCoMo ([Maharana et al., ACL 2024](https://snap-research.github.io/locomo/))

  • Dataset: locomo10.json, 10 conversations, 5,882 turns, 1,986 QA pairs across 5 categories.
  • Ingestion: All 10 conversations through standard onboarding pipeline. Session boundaries preserved.
  • Retrieval: FTS5 + HRR + BFS, 2,000-token budget, batch size 1.
  • Reader model: Claude Opus 4.6.
  • Prompts: Exact LoCoMo protocol prompts. Categories 1/3/4: "Based on the above context, write an answer in the form of a short phrase..." Category 2 appends: "Use DATE of CONVERSATION to answer with an approximate date." Category 5: forced-choice "(a) Not mentioned (b) [adversarial_answer]" with randomized option order (seed=42).
  • Scoring: Token-level F1 with Porter stemming and article removal.
  • Score: 66.1% F1.

MemoryAgentBench FactConsolidation ([Hu et al., ICLR 2026](https://arxiv.org/abs/2507.05257))

  • Dataset: HuggingFace ai-hyz/MemoryAgentBench, Conflict_Resolution split.
  • Ingestion: Context chunked at 4,096 tokens using NLTK sent_tokenize and tiktoken gpt-4o encoding.
  • Retrieval (single-hop): FTS5 with triple extraction. SUPERSEDES edges created automatically.
  • Retrieval (multi-hop): Entity-index adapter. 41 regex patterns. 4-hop chaining with breadth cap of 30.
  • Reader models: Claude Opus 4.6 and Claude Haiku 4.5.
  • Scoring: substring_exact_match per the paper. Chain validation for multi-hop.
  • Scores: SH: 90% Opus, 62% Haiku. MH: 60% Opus (raw SEM), 35% chain-valid.

StructMemEval ([Shutova et al., 2026](https://github.com/yandex-research/StructMemEval))

  • Dataset: GitHub yandex-research/StructMemEval, location/small_bench, 14 cases.
  • Ingestion: Narrative timestamps (30 days apart per session). Standard pipeline.
  • Retrieval: FTS5 with temporal_sort=True.
  • Disclosure: temporal_sort was developed after seeing the initial 29% result.
  • Score: 14/14 (100%).

LongMemEval ([Wu et al., ICLR 2025](https://arxiv.org/abs/2501.05294))

  • Dataset: HuggingFace xiaowu0162/longmemeval-cleaned, 500 questions across 6 categories.
  • Retrieval: FTS5 + HRR + BFS, 2,000-token budget, top_k=50.
  • Judge: Claude Opus 4.6 binary judge (non-standard; paper specifies GPT-4o).
  • Disclosure: Using Opus as judge instead of GPT-4o means the comparison is not apples-to-apples.
  • Score: 59.0% (295/500).

Reproducibility

All benchmark adapters, scoring scripts, and the contamination verification script are in the benchmarks/ directory of the public repository. To reproduce any result:

# Clone and install
git clone https://github.com/robotrocketscience/aelfrice
cd aelfrice
uv sync

# Run retrieval (example: MAB single-hop)
uv run python benchmarks/mab_adapter.py \
  --split Conflict_Resolution \
  --source factconsolidation_sh_262k \
  --retrieve-only /tmp/mab_sh.json

# Verify clean
uv run python benchmarks/verify_clean.py /tmp/mab_sh.json

# Score (after running reader)
uv run python benchmarks/exp6_score.py /tmp/mab_sh_preds.json /tmp/mab_sh_gt.json

Complete per-benchmark commands and adapter documentation are in docs/BENCHMARK_PROTOCOL.md.


Appendix B: Literature Tables

Prior Art: LoCoMo Benchmark Results
SystemLoCoMoNotes
EverMemOS92.3%Cloud LLM, closed source
Hindsight89.6%Cloud LLM
SuperLocalMemory C87.7%LLM for synthesis
Zep/Graphiti~85%Temporal knowledge graph
Letta/MemGPT~83.2%OS-style memory
SuperLocalMemory A74.8%Zero cloud
Letta (filesystem)74.0%gpt-4o-mini, no architecture
Supermemory~70%Vector graph engine
Mem0 (self-reported)~66%Hybrid store
Mem0 (independent)~58%See note below
agentmemory66.1%FTS5+HRR+BFS, no embeddings
* Mem0 independent score differs from self-reported. See Chhikara et al. arXiv:2504.19413 (ECAI 2025).
* Scores measured under different conditions and LLM backends. Not directly comparable.
* agentmemory score from protocol-correct run with full input isolation. See Benchmark section for methodology and contamination narrative from earlier invalid runs.
Cross-Benchmark Summary
BenchmarkagentmemoryPaper BestDelta
LoCoMo (ACL '24)66.1% F151.6% GPT-4o+14.5pp
MAB SH 262K (ICLR '26)90% Opus45% GPT-4o-mini+45pp
MAB MH 262K (ICLR '26)60% Opus<=7% (all methods)8.6x
StructMemEval ('26)100% (14/14)vector stores fail--
LongMemEval (ICLR '25)59.0%60.6% GPT-4o-1.6pp
* MAB = MemoryAgentBench FactConsolidation
* SH = single-hop, MH = multi-hop
* LongMemEval uses Opus as judge (paper uses GPT-4o); comparison carries an asterisk until same judge is used
* MAB MH "chain-valid" score (reader-independent): 35% for both Opus and Haiku. The 60% includes incidental matches from deeper traversal.
Benchmarks Studied
BenchmarkKey FindingOurs
LoCoMo (ACL '24)Filesystem + grep = 74% baseline66.1% F1
MemoryAgentBench (ICLR '26)Single-hop: 45% GPT-4o-mini; Multi-hop: 7% ceilingSH: 90%; MH: 60%
LongMemEval (ICLR '25)500 questions, scales to 1.5M tokens59.0% (Opus judge)
StructMemEvalVector stores fail at state tracking100% (14/14)
LifeBenchSOTA at 55.2%Not yet tested
AMA-BenchGPT 5.2 achieves 72.26%Not yet tested