Persistent Memory for LLM Agents

View on GitHub → Rebuild in progress

98 experiments, 35 case studies, 5 benchmarks, 6 benchmark experiments, 20 approaches evaluated. Now at v3.0 with 7-layer retrieval, cross-project sharing, and 954 tests.

Jump to v3.0 update ↓ -- What changed since the original write-up.

Abstract

LLM agents have no memory across conversations. Corrections are lost when the context window closes, and the agent repeats the same mistakes. Existing memory systems treat this as a retrieval problem, but the harder unsolved problems are write correctness and governance: what gets stored, how conflicts are resolved, and whether wrong memories can be corrected. We present agentmemory, a persistent memory system built on FTS5 keyword search, typed knowledge graphs, and entity-index retrieval, with no embeddings. The system detects user corrections at 92% accuracy without LLM calls, recovers 99.5% of the 31% of stored directives unreachable by keyword search, and reduces injected tokens by 55% with zero retrieval loss. Across five benchmarks, agentmemory achieves 66.1% F1 on LoCoMo (+14.5pp over GPT-4o), 90% on MemoryAgentBench single-hop (+45pp), 60% on multi-hop (8.6x the published 7% ceiling), 100% on StructMemEval state tracking, and 59.0% on LongMemEval (-1.6pp vs GPT-4o pipeline, different judge). A controlled A/B test showed 31% token savings and 41% fewer tool calls, though correction rates did not decrease. All code, benchmarks, and experiment data are open source under MIT license.

1. Introduction

LLM agents have no memory across conversations. Every session starts from zero. When a user corrects the agent, that correction is lost the moment the context window closes. The next session, the agent makes the same mistake. The user corrects again. To make matters worse, the agent will often ignore corrections in the exact same context window session.

This is not a hypothetical failure mode. Memory failures are the largest single category of LLM behavioral failures documented in this project's failure taxonomy, accounting for 7 of 38 cataloged patterns. The problem compounds across sessions: the MemoryAgentBench benchmark (ICLR 2026) tested multi-hop conflict resolution and found a ceiling of 7% accuracy across all tested methods.

Existing approaches overwhelmingly treat memory as a retrieval problem. Across the architectures cataloged by Zhang et al. (2024), Hu et al. (2025), and Leonard Lin's independent analysis of 35+ papers and 14+ community systems, the dominant pattern is: store text, embed it, retrieve by similarity. StructMemEval found that vector stores built on this pattern "fundamentally fail at state tracking." They can't tell you what's currently true vs. what was superseded.

Lin concluded:

"The biggest differentiator is not vector DB vs SQLite. It is write correctness and governance: provenance, write gates, conflict handling, reversibility."

Current memory systems are write-only: content goes in, but the system never learns whether what it retrieved was actually helpful. On every turn, the memory system retrieves stored content, the LLM reads it and generates a response, and the turn ends. There is no feedback path. The memory system cannot reinforce directives the user found helpful, weaken directives the user has overridden, or distinguish a user correction from an LLM inference at storage time. None of the architectures in the 47-author survey by Hu et al. or Lin's 14+ system analysis include such a feedback path.

The LoCoMo benchmark (ACL 2024) showed that a simple filesystem with grep achieves 74%. That's the bar.

Now for my Fosbury Flop.

Agentmemory system architecture: ingestion, retrieval, and feedback pipelines

I built this system because I got sick and tired of asking Claude for the latest on my test runs, which were burning CPU time on cloud compute, only for Claude to tell me "huh? what test dispatches? oh those. yeah they've been hanging for 2 hours because I didn't follow the runbook you told me to follow."

Before building, I surveyed the landscape: 4 survey papers (Zhang et al. 2024, Hu et al. 2025, Yang et al. 2026, "Memory in the LLM Era" 2026), 6 benchmarks, 14+ community systems, and Leonard Lin's independent benchmark reproduction of 35+ papers which verified (and in some cases refuted) published claims.

Four findings shaped the project direction:

Human memory is the wrong target. Zhang et al. (2024) and Hu et al. (2025) show that the dominant design paradigm maps psychological memory models onto LLM architectures. Human memory is notoriously unreliable: Ebbinghaus (1885; replicated by Murre & Dros, 2015) showed ~56% of learned material is forgotten within one hour. Eyewitness misidentification contributed to roughly 69% of the 375+ wrongful convictions overturned by DNA evidence (Innocence Project). Computer memory is literally perfect at storage; the hard problem is retrieval. Where human memory is useful: retrieving gists. Brainerd & Reyna (2005) showed that gist traces are far more durable than verbatim traces. That associative retrieval, connecting things that share no surface-level vocabulary, is what we're trying to replicate.
A simple filesystem achieves 74% on the most-used benchmark. Letta's best result with no special memory architecture, just gpt-4o-mini writing to files, hits 74% on LoCoMo (Maharana et al., ACL 2024). Any memory system must beat "gpt-4o-mini writes it to a file."
Multi-hop conflict resolution tops out at 7%. MemoryAgentBench (ICLR 2026) tested the ability to follow a chain of related decisions across sessions and determine which is currently in effect. The ceiling was 7% accuracy across all tested methods.
Lin's independent analysis pointed to an underexplored axis. The majority of systems focus their novelty on retrieval: better embeddings, better similarity metrics, better ranking. The harder unsolved questions are about what gets stored, how conflicts are resolved, and whether wrong memories can be corrected. Lin's concept of "write correctness and governance" breaks into four components:
- Provenance: Where did this memory come from? A user prompt, an LLM inference, or derived content? Every stored entry should carry its lineage.
- Write gates: Not everything should be stored. Expanding a 586-node graph to 16,463 nodes without quality filtering dropped retrieval coverage from 92% to 69% (Exp 48).
- Conflict handling: When two memories contradict, which wins? StructMemEval found that vector stores "fundamentally fail at state tracking".
- Reversibility: None of the 14+ systems Lin tested had a working rollback mechanism.

Full benchmark comparison tables are in Appendix B.

3. Approach

Four questions drove the design:

How do you detect user corrections, stated preferences, and behavioral rules without extra LLM inference?
How do you retrieve relevant content when the query shares zero vocabulary with the stored content?
How do you track whether a retrieved memory was actually useful?
How do you distinguish "the LLM should consider this" from "the LLM must obey this"?

Each question maps to a capability with a measured result:

Correction detection: 92% accuracy without any LLM calls, across five codebases (Exp 39-41). When LLM classification is enabled (~$0.005/session), accuracy reaches 99%. The zero-LLM pipeline runs on every conversation turn with zero marginal cost.

Vocabulary gap recovery: 31% of stored content across five codebases is unreachable by keyword search (Exp 47, 3,321 directives examined). The query and stored directive share zero vocabulary. For example: a user says "never mock the database in tests." Later, the agent is about to write a test with unittest.mock.patch('db.connect'). Zero overlapping words, but a human immediately sees the connection. The system recovers 99.5% of these gapped directives through a structural graph traversal that doesn't require embeddings or LLM inference.

Vocabulary gap bridge: graph traversal connects mock.patch to never-mock directive

Vocabulary Gap Prevalence (Exp 47)
5 codebases, 3,321 directives examined
Metric	Value	Rate
Total directives examined	3,321
Directives with vocab gap	1,030	31%
Recovered by graph layer	1,025	99.5%
Gap Category	Rate	Example
Emphatic prohibitions	29%	"NEVER do X"
Domain jargon	13%	Tool names
Tool bans	12%	"don't use Y"
Implicit rules	8%	Context-dependent
100% of gaps are bridgeable by graph traversal.

Graph traversal hub: research_HRR_FINDINGS node with edges radiating to connected beliefs — A high-connectivity hub node (research_HRR_FINDINGS) in the knowledge graph. Edges radiate outward to beliefs across multiple topics. When keyword search misses a directive due to vocabulary mismatch, graph traversal follows these edges to recover it -- this is how the system bridges the 31% gap that FTS5 cannot reach.

Entity-index retrieval: To address multi-hop conflict resolution, the system extracts structured triples (entity, property, value, serial_number) from ingested text using 41 regex patterns, then chains through entity relationships at query time up to 4 hops deep. This layer (L2.5, between FTS5 and HRR) took the MemoryAgentBench multi-hop score from 6% to 35% chain-valid, 5x the published 7% ceiling (see Section 6.3).

Confidence tracking: The system tracks retrieval outcomes and updates confidence accordingly (Exp 66: +22% MRR gain over 10 feedback rounds; Bayesian calibration ECE 0.066, target < 0.10). Memories that help get stronger. Memories that hurt get weaker. Memories irrelevant to the current task get no update, because absence of evidence is not evidence of absence.

Correction enforcement: Storing a correction and enforcing it are different problems (see CS-006 above). The system distinguishes between content the LLM should consider and constraints the LLM must obey (Exp 84: 10/10 locked directives retrieved and enforced across 5 sessions).

Correction hub node with edges fanning outward to connected beliefs across topics — A correction hub in the knowledge graph. When a user corrects the agent, the correction belief becomes a high-degree node with SUPERSEDES and CONTRADICTS edges linking it to the beliefs it replaces. Long-range edges (upper-left, lower-right) connect to beliefs in distant topic clusters, ensuring the correction propagates across the full graph during retrieval.

Retrieval pipeline layers: L0 locked through L3 graph traversal

The architecture uses keyword search as the primary retrieval layer and wraps it with structural gap recovery, confidence tracking, and constraint injection. The combined system handles both the 69% grep reaches and the 31% it misses. Grep won on the keyword retrieval benchmark (Exp 47, 92% coverage vs. 85% for the prototype), and we accepted that: grep is fast, precise, and costs no LLM calls. But grep cannot bridge the vocabulary gap. The locked-directive Mean Reciprocal Rank improved from 0.589 to 0.867 after retrieval tuning (Exp 63).

Retrieval Coverage (Exp 47, 586-node graph)
Method	Coverage	Tokens	Precision
grep (decision)	92%	low	high
grep (sentence)	92%	high	moderate
Prototype A	85%	low	moderate
Prototype B	85%	low	moderate
Null hypothesis: grep < 80%. Result: REJECTED. grep achieved 92%.

Token reduction: Type-aware token reduction achieves 55% savings with zero measured retrieval loss (Exp 42). Constraints survive verbatim; rationale compresses to 0.4x; metadata compresses to 0.3x. Every token injected into the agent's context window is a token that can't be used for reasoning about the current task. At 19K+ stored nodes, naive injection would consume the entire context window before the agent reads the user's message.

Token Reduction Results (Exp 42)
Metric	Value
Before	35,741 tokens
After	15,926 tokens
Savings	55%
Retrieval coverage	100% (all 6 topics, 18 queries)
Content Type	Reduction Factor
Constraints	1.0x (never reduce)
Rationale	0.4x
Context	0.3x

Scale effects: When the graph expanded from 586 to 16,463 nodes without filtering, retrieval coverage dropped from 92% to 85% for grep and from 85% to 69% for prototypes (Exp 48). Decision-level directives were 3.6% of the expanded graph. The fix: filter at ingestion time, not after expansion. Only decision-level directives pass the write gate.

Scale vs. Coverage (Exp 48)
Graph Size	grep	Proto A	Proto B
586 nodes	92%	85%	85%
16,463 nodes	85%	69%	69%
Decision-level directives: 3.6% of expanded graph. The other 96.4% is noise that dilutes signal.

agentmemory knowledge graph visualization showing node clusters and edge density across a 19K-node production database — Production knowledge graph (19K nodes, Obsidian graph view). The large central cluster is the primary project's belief network. Two smaller satellite clusters (upper-left, lower-left) are isolated project databases. Scattered peripheral nodes are low-edge beliefs awaiting graph integration. Dense interior regions correspond to heavily cross-referenced topics where multiple beliefs reinforce each other through SUPPORTS, RELATES_TO, and SUPERSEDES edges.

4. Evaluation

The project uses a four-layer evaluation architecture. Each layer exists because a specific failure mode proved that the previous layers were insufficient.

Layer 1 (programmatic checkers): Deterministic checks like "does the output contain the required section headings?" These catch structural violations but can't evaluate semantic correctness.

Layer 2 (structural validators): Added after discovering that LLMs satisfy individual constraints while violating the relationships between them (CS-007b): correct sections in the wrong order, or referencing a decision that contradicts an earlier one.

Layer 3 (LLM-as-judge with anti-contamination): The most important design decision. Standard LLM-as-judge approaches (Zheng et al., 2023) give the evaluating LLM the input prompt and the response. The problem: when the evaluating LLM can see the system's reasoning, it tends to find that reasoning plausible and rationalize the violation (CS-005). Our approach isolates the evaluating LLM. It receives only the constraint and the output, never the conversation that produced it.

Layer 4 (adversarial follow-up): Added after CS-024 (sycophantic collapse). An LLM can pass all three previous layers and still fail when a user pushes back.

One methodological insight: for agent memory retrieval, precision matters more than recall. A false negative (relevant directive missed) is invisible. A false positive (irrelevant directive injected) causes the LLM to act on wrong context, requiring the user to notice, diagnose, and correct. The scale experiment (Exp 48) demonstrated this directly: expanding the graph without filtering retrieved more content, but the wrong content.

Retrieval Error Cost Matrix
Error Type	User Impact
True positive	Correct directive retrieved and followed. No user intervention.
True negative	Irrelevant directive correctly excluded. No user intervention.
False negative (recall failure)	Relevant directive missed. The LLM doesn't know what it doesn't know. User may not notice -- the failure is invisible.
False positive (precision failure)	Irrelevant directive injected. LLM acts on wrong context. User must notice, diagnose, and correct. Active harm.

5. Results

Cross-benchmark comparison: agentmemory vs published best across 5 benchmarks

Key Results Summary
Metric	Result	Notes
Benchmarks
LoCoMo F1 (Opus 4.6)	66.1%	+14.5pp vs GPT-4o (51.6%)
MAB SH 262K	90% Opus	+45pp vs GPT-4o-mini (45%)
MAB MH 262K	60% Opus	8.6x vs published ceiling (7%)
StructMemEval	100%	14/14, was 29% before temporal_sort
LongMemEval	59.0%	-1.6pp vs GPT-4o pipeline (60.6%)
Core Pipeline
Correction detection	92%	Zero-LLM, 5 codebases
Vocabulary gap recovery	99.5%	31% of directives gapped
LLM classification	99%	~$0.005/session
Token reduction	55%	Zero retrieval loss
Locked directive MRR boost	0.589->0.867	After retrieval tuning
Bayesian calibration (ECE)	0.066	Target < 0.10
Feedback loop MRR gain	+22%	Over 10 rounds (Exp 66)
Multi-session validation	10/10	5 sessions (Exp 84)
Infrastructure
Acceptance tests	62/65 pass	29 test files, 1.65s
Test suite	362 pass	Unit, integration, behavioral
Retrieval latency	0.7s avg	19K-node production DB
Onboarding speed (scan)	1.0s	10,872 nodes from 249 commits + 112 docs
Onboarding speed (full pipeline)	6.5s	Scan + ingest + edge storage + vault sync

5.1 Core Pipeline Metrics

Correction detection achieves 92% accuracy without LLM calls across five codebases (Exp 39-41), reaching 99% with LLM classification at ~$0.005/session. The vocabulary gap recovery layer handles the 31% of directives keyword search misses, recovering 99.5% of them through graph traversal (Exp 47). Token reduction saves 55% with zero retrieval loss (Exp 42). The feedback loop improves MRR by +22% over 10 feedback rounds (Exp 66), with Bayesian calibration at ECE 0.066.

5.2 Single-Hop Results

LoCoMo

The LoCoMo benchmark (Maharana et al., ACL 2024) tests whether a system can answer questions about past conversations across five categories. Setup: 10 conversations (5,882 turns, 272 sessions, 1,986 QA pairs) ingested through the standard onboarding pipeline. Retrieval used FTS5+HRR+BFS with a 2,000-token budget. Scoring followed LoCoMo's exact F1 methodology.

Answer Key comic - Benchmark Contamination

The initial run was contaminated (agents had access to ground truth; see Appendix A for the full contamination narrative and protocol). The protocol-correct results:

LoCoMo Per-Category F1 (Protocol-Correct, Opus 4.6)
Category	F1	n
Single-hop	69.4%	841
Temporal	45.4%	321
Multi-hop	42.2%	282
Open-ended	30.5%	96
Adversarial	97.5%	446
Overall	66.1%	1986

LoCoMo Leaderboard Context
System	F1	Notes
Human	87.9%	Ceiling
GPT-4-turbo (128K full context)	51.6%	Best long-context in paper
RAG (DRAGON + gpt-3.5, top-5 obs)	43.3%	Best RAG in paper
Claude-3-Sonnet (200K)	38.5%	Long-context
gpt-3.5-turbo (16K)	36.1%	Long-context
agentmemory + Opus 4.6	66.1%	FTS5+HRR, no embeddings

Single-hop is strongest (69.4%); adversarial is near-perfect (97.5%). Multi-hop and temporal are weaker (42-45%), requiring cross-session reasoning and date arithmetic. Open-ended is weakest (30.5%), requiring synthesis the retrieval pipeline doesn't directly support. Ingest time for all 10 conversations: ~25s. Average query latency: ~16ms.

MAB Single-Hop

MemoryAgentBench (Hu et al., ICLR 2026) tests conflict resolution: when facts change over time, can the system track which version is current? Single-hop asks direct questions: "What is X's current Y?"

MAB Single-Hop 262K
Reader	SEM	Paper GPT-4o-mini	Paper GPT-4o
Opus 4.6	90%	45%	88%
Haiku 4.5	62%	45%	88%

The improvement from v1.0 (60%) to v1.1 (90%) came from triple extraction in the ingestion pipeline. SUPERSEDES edges are created automatically. Haiku still beats GPT-4o-mini (62% vs 45%), confirming the improvement comes from retrieval, not the reader model.

LongMemEval

LongMemEval (Wu et al., ICLR 2025) is a 500-question benchmark spanning six categories. The published best is 60.6% using a GPT-4o pipeline with embeddings.

LongMemEval Per-Category Accuracy (Opus Judge)
Category	Accuracy	n
single-session-user	91.4%	70
single-session-preference	80.0%	30
single-session-assistant	73.2%	56
knowledge-update	70.5%	78
temporal-reasoning	59.4%	133
multi-session	24.1%	133
Overall	59.0%	500

Strengths: single-session recall (91.4%) and knowledge updates (70.5%). Weakness: multi-session (24.1%). Failure analysis of 101 incorrect multi-session answers: 67% were retrieval misses, 33% were reasoning failures. Of the retrieval misses, 84% were counting/aggregation questions. Budget and top_k sweeps did not help, confirming FTS5's BM25 ranking as the bottleneck for this category. Methodological note: scoring uses Opus as judge rather than GPT-4o; the comparison carries an asterisk.

5.3 Multi-Hop and State Tracking

MAB Multi-Hop

Multi-hop chains entity relationships: "What is the Z of X's current Y?" This requires following a chain of updated properties. The entity-index (described in Section 3) extracts structured triples and chains through them at query time.

Entity-Index Triple Extraction
Field	Extracted Triple	Updated Triple
Input: "In session 42, Alice's spouse is Bob."
entity	Alice
property	spouse
value	Bob
serial	42
Later: "In session 78, Alice's spouse is Carol."
entity		Alice
property		spouse
value		Carol
serial		78 (supersedes serial 42)

MAB Multi-Hop 262K
Reader	Raw SEM	Chain-Valid	Paper Ceiling
Opus 4.6	47%	35%	<=7%
Haiku 4.5	46%	35%	<=7%

Both Opus and Haiku score identically at 35% chain-valid. This is the strongest evidence that the entity-index retrieval, not the LLM reader, drives the improvement. When the retrieval provides the right entity chain, even Haiku can follow it.

Multi-hop experiment progression: 6% baseline to 60% across 6 experiments

Multi-Hop Progression (Experiments 1-6)
Exp	Method	MH SEM	Key Finding
--	v1.0 Baseline (FTS5 chunks)	6%	Single FTS5 query
1	Per-hop failure analysis	--	58% chaining, 17% world knowledge, 11% retrieval miss
2	SUPERSEDES edges	7%	Helps SH, not MH
3	Triple decomposition	10%	Granular helps
4	Entity-index 2-hop	35%	Core breakthrough
5	Extended regex (+7 patterns)	55%	+8pp over Exp 4
	LLM entity extraction	51%	-4pp vs regex
6	Temporal coherence (resolve_all + branching)	60%	96% GT-reachable; reader bottleneck

Experiment 6 resolved the retrieval question: by branching through all historical values at each hop, 96 of 100 ground truth answers became reachable. The remaining 40pp gap is entirely a reader chain resolution problem.

Reader Quality Analysis
Metric	Opus	Haiku	Gap	Interpretation
SH 262K	90%	62%	28pp	Reader matters
MH chain-valid	35%	35%	0pp	Retrieval does all the work
MH raw SEM	47%	46%	1pp	Retrieval does all the work

StructMemEval

StructMemEval (Shutova et al., 2026) tests state tracking: given location updates across sessions, can the system answer "where is X now?"

StructMemEval Results
Version	Accuracy	Fix
v1.0	4/14 (29%)	--
v1.1	14/14 (100%)	temporal_sort + narrative timestamps

The fix: assign narrative timestamps (30 days apart per session) and enable temporal_sort=True so the reader sees the most recent session content first. This is a general-purpose state-tracking improvement, not a benchmark-specific hack.

5.4 Scale and Onboarding

Onboarding Performance
Metric	agentmemory	alpha-seek-memtest
Git commits	35	619
Git date range	2 days	16 days
Documents	163	1,726
Nodes extracted	16,690	90,793
Edges extracted	32,538	302,268
Beliefs created	31,863	60,641
Scan time	~2.5s	~5.8s
Full pipeline time	--	--
Scale factor	1x	5.4x (nodes)
Time factor	1x	2.3x

Scan phase scales sublinearly: 5.4x more nodes in 2.3x the time. AST parsing is the bottleneck (32-38% of scan time). Full pipeline (scan + ingest + edge storage + vault sync) measured at 6.5s for a 10,872-node repo (249 commits, 112 docs) after v1.2.1 performance fixes batching edge inserts and deferring per-belief FTS5 checks during bulk ingestion. Temporal decay validated on the larger codebase: 2-day-old beliefs score 0.92, 18-day-old beliefs score 0.43, 14-month-old beliefs score ~0.

6. Discussion

What Doesn't Work Yet

Limitations
Limitation	What's Next / Ceiling
grep beats the full architecture on keyword retrieval benchmarks	Accepted: grep is the primary layer now; system adds value in the 31% grep misses
Multi-hop reader chain resolution (MAB MH)	60% Opus, 96% GT-reachable; remaining 36pp gap is reader strategy, not retrieval (Exp 6)
LongMemEval multi-session: 24.1%	84% of failures are counting/aggregation questions. FTS5 recall is the bottleneck. Embedding-based retrieval is the strongest future lever.
LongMemEval overall: 59.0%	-1.6pp vs published baseline. Uses Opus judge, not GPT-4o. Comparison carries an asterisk.
Feedback loop needs more sessions for statistical significance	Currently +22% MRR gain over 10 rounds (Exp 66); need longer longitudinal data
Contradiction detection during retrieval does not work yet (A/B test, 2026-04-15)	A/B test showed file reads beat memory at finding inconsistencies. Graph edges exist but retrieval optimizes for query relevance, not internal consistency
Cross-project noise in shared database (A/B test, 2026-04-15)	Project scoping exists (scope column, project_context on sessions) but the A/B test showed retrieval still pulls cross-project content. Scoping enforcement needs tightening

What Remains Unmeasured

Open Questions
Open Question	Status / How to Close
Does retaining corrections improve downstream decisions?	A/B test (2026-04-15) showed efficiency gains (31% fewer tokens, 41% fewer tool calls) but correction rates did not decrease. Confounded by task type change. Needs controlled matched-task experiment.
Cross-project transfer of behavioral directives	A/B test surfaced the problem: 25% of retrieval was cross-project noise. Scoping exists but enforcement needs work.
Long-term dynamics over months	Longitudinal tracking of confidence distributions and directive churn rate
Performance with users other than the developer	Open-source release under MIT. Structured user study still needed.
MAB MH reader chain strategy	Is the 36pp gap (60% to 96% GT-reachable) a real limitation or a benchmark artifact?
Counterfactual resistance	Readers use world knowledge ~17% of time despite explicit instructions (Exp 1). Shared problem across all LLM-based evaluation methods.

A/B Test: Status Report With vs Without Memory

To test whether the system actually helps, we ran a controlled A/B comparison: two fresh Claude Code sessions given the identical prompt ("generate a comprehensive project status report"), one with agentmemory active, one without.

The first attempt was invalid: the experimental agent was coached, sub-agents don't receive hook injections, and the control wasn't isolated (it found the SQLite database and queried it directly). Despite being invalid, it surfaced a real bug: 25% of memory search results were relevant, 75% were cross-project noise.

The valid test used two live Claude Code sessions in separate terminals with identical prompts. The control had .mcp.json deleted and ran in an isolated worktree. The experimental session had hooks firing automatically with no special instructions.

Gold A/B Test (2026-04-15, live sessions, identical prompt)
Metric	Control	Experiment
Duration	~8 min	~6.5 min
Tool calls	34	20
Agentmemory tool calls	0	1 (status only)
SQLite direct queries	5	0
Total tokens	1,607,614	1,109,711

The experimental session was 31% more token-efficient and used 41% fewer tool calls. The experimental agent barely used agentmemory directly (one status() call, zero searches). The benefit was passive: hook-injected context at session start reduced cold-start exploration. Both reports were comparable in quality. The real value was efficiency, not accuracy.

Longitudinal Analysis

A second analysis examined 44 qualifying sessions across 1 week (23 before agentmemory activation, 21 after).

Session Log Analysis (44 sessions, ~1 week)
Metric	Before	After	Delta
Tool uses per user message	5.4	11.0	+104%
(excluding agentmemory tools)	5.4	10.7	+99%
User messages per task	4.8	3.2	-33%
Avg user messages per session	31.6	20.3	-36%
Corrections per 100 user msgs	2.2	3.3	+50% (worse)
Restatements per 100 user msgs	0.55	0.94	+71% (worse)

The LLM does roughly twice as much autonomous work per instruction. Tasks complete in 33% fewer turns. But correction rates went up, not down. Two confounds likely explain this: the "after" sessions were refinement/debugging work (inherently more correction-heavy), and possible model version changes during the period. A rigorous future test would need matched task design, randomized condition assignment, at least 20 tasks per condition, manual annotation, cross-session measurement, and blind evaluation. This experiment has not been run yet.

7. Failure Taxonomy

Failure taxonomy: Memory, Calibration, Behavioral, and Operational failure families

35 documented behavioral failures across Claude and Codex, classified into recurring patterns. Each case study includes verbatim exchange, root cause analysis, pattern classification, what the memory system should do to prevent it, and a concrete acceptance test with pass/fail criteria.

Failure Pattern Families
Family	Pattern
Memory Failures
P4	Repeated procedural instructions
P1	Repeated decisions
	Context drift within and across sessions
Calibration Failures
P5	Provenance-free status reporting
P7	Output volume presented as validation
	Result inflation in reporting
Behavioral Failures
P6	Correction stored but not enforced
P9	Sycophantic collapse under pressure
P10	Point-fix without generalization
P11	Intent completion gated by permission
Operational Failures
P7	Namespace collision across parallel sessions
P8	Multi-hop query collapse
	Scale-before-validate bias

Each case study maps to acceptance tests. The suite runs against the live SQLite store and retrieval pipeline: 62/65 tests pass (29 files, 1.65s), with 3 skipped for capabilities requiring behavioral hooks not yet implemented (CS-012: PostEdit hook, CS-024: sycophantic collapse detection, CS-026: permission-gated intent completion).

Case Study to Acceptance Test Mapping (selected)
CS	Failure	What the Test Validates
002	Premature implementation push (3 corrections ignored)	Locked correction created on first user correction; persists indefinitely
006	Correction stored but not enforced (implementation ban violated in new session)	Locked prohibition retrieved AND enforced across session boundaries; output gating blocks violations
009	Correction lost across session reset ("use B not A")	SUPERSEDES edges preserve latest correction; holds across resets
022	Multi-hop query collapse (4 agents, wrong machine)	All entities identified via graph traversal; correct state aggregated
025	Correction not generalized (fix one instance, miss others)	Correction applies to pattern class, not just the specific instance

Component Dependency (by case study count)
Component	Case Studies	Priority
Locked beliefs / L0 behavioral	11	Critical
COMMIT_BELIEF (git-derived)	6	High
FTS5 retrieval	6	High
Triggered beliefs (TB-01-15)	6	High
Source priors / provenance	5	High
SUPERSEDES edges	4	Medium
IMPLEMENTS / CALLS / CO_CHANGED	5	Medium
Output gating (enforcement)	2	Critical*
HRR typed traversal	3	Medium
TESTS / coverage edges	1	Low
* Output gating covers only 2 case studies but both are severity-critical: CS-006 and CS-016 are multi-session correction violations, the most painful failure class.

Validating the Validation comic - CS-007b

8. What Was Abandoned (and Why)

The negative findings shaped the architecture as much as the positive ones.

SimHash clustering was the first attempt at deduplication. SimHash works well for near-duplicate text, but stored directives are short and semantically dense -- "always use mocks" vs. "never use mocks" differ by one word that inverts the meaning.

Mutual information re-ranking was supposed to improve retrieval by scoring candidates on their statistical relationship to the query. In practice, it demoted relevant results and promoted spurious correlations.

Global holographic superposition was theoretically promising and spectacularly failed. At 775 edges, the representation exceeded its information-theoretic capacity by 7.6x and produced pure noise (Exp 50).

Pre-prompt compilation attempted to pre-compute relevant directives for common query patterns. It performed worse than random selection (23% vs. 33%, Exp 52) because directive value is context-dependent.

Abandoned Approaches
Approach	Why it failed
SimHash clustering	Not viable for deduplication in this domain
Mutual information re-ranking	Hurts more than helps in retrieval
Rate-distortion optimization	Unnecessary complexity for marginal gains
Pre-prompt compilation	Worse than random selection (23% vs 33%)
Global holographic superposition	Capacity exceeded 7.6x at 775 edges; pure noise
Multi-layer graph expansion	Signal diluted to 3.6% of graph at 16K nodes
Autonomous edge discovery	Precision 0.001, recall 0.005
Zero-LLM classification as sufficient	4% precision on corrections (805 found, 32 correct)

9. Conclusion

agentmemory demonstrates that persistent memory for LLM agents does not require embeddings, vector databases, or expensive inference. A pipeline built on FTS5 keyword search, typed knowledge graphs, and entity-index retrieval achieves competitive or superior results across five benchmarks while running at 0.7s average retrieval latency on a 19K-node production database.

The core contributions are: (1) correction detection at 92% accuracy without LLM calls, (2) vocabulary gap recovery that handles the 31% of directives keyword search cannot reach, (3) entity-index retrieval that breaks the published 7% multi-hop ceiling by 5-8x, and (4) a confidence tracking loop that improves retrieval quality over time.

The system's limitations are equally clear. LongMemEval multi-session accuracy is 24.1%, bottlenecked by FTS5's inability to aggregate scattered mentions. Correction rates did not decrease in the longitudinal analysis, though confounds prevent attribution. Contradiction detection during retrieval does not work yet. Cross-project noise in shared databases needs tighter scoping enforcement.

The failure taxonomy and its 62 passing acceptance tests may be the most practically useful contribution: a catalog of the specific ways LLM agents fail at memory, each with a reproducible test that blocks recurrence. All code, experiment data, and benchmark adapters are available at github.com/robotrocketscience/aelfrice under MIT license. Version 1.2.1, research frozen 2026-04-16.

10. Research Breadth

The project drew on multiple fields, each brought in to address a specific problem:

Information theory: The information bottleneck (Tishby et al., 1999) was applied to context compression, producing the 55% token savings. Mutual information was tested for retrieval re-ranking and abandoned. Rate-distortion theory was explored for optimal token budget allocation but proved unnecessary.
Bayesian inference: Beta-Bernoulli conjugate pairs for confidence tracking. Thompson sampling for the exploration/exploitation tradeoff in retrieval. Calibration measured at ECE 0.066 (target < 0.10).
Cognitive architectures: SOAR's impasse-driven substates informed retrieval failure escalation. CLARION's meta-cognitive subsystem inspired confidence tracking. ACT-R's declarative/procedural distinction mapped to the system's separation of factual content from behavioral constraints. The design borrows structure without inheriting human-like decay and distortion.
Bio-inspired optimization: Slime mold network dynamics (Tero-Kobayashi equations) were tested for graph pruning. Evolutionary algorithms were tested for edge set optimization. Both showed promise in simulation but were not adopted.
Graph theory: Typed knowledge graphs with weighted edges form the structural backbone. Multi-hop traversal enables the vocabulary gap recovery.

11. Technical Details

Source: github.com/robotrocketscience/aelfrice, MIT license.
Language: Python with strict typing enforced by pyright in strict mode.
Storage: SQLite with WAL mode. The entire memory store is a single file.
Dependencies: Minimal by design. No PyTorch, no TensorFlow, no embedding models. LLM classification (~$0.005/session) brings accuracy to 99% and is the recommended configuration.
Deployment: MCP server with 19 tools, integrating with Claude Code, Cursor, Windsurf, and other MCP-compatible tools. Also ships as a CLI with 23 commands.
Modules: 18 production modules in src/agentmemory/, plus 23 benchmark adapters and scoring scripts.
Scale tested: 600 to 90,000+ nodes across five codebases. Largest production deployment: 0.7s average retrieval latency on 19K-node graph.
Benchmarks: 5 benchmarks tested with contamination-proof protocol. Two contamination incidents caught and documented during development.
Test suite: 362 passing tests plus 62 acceptance tests (29 files, 1.65s).
Experiments: 85+ during core development, plus 6 benchmark-phase experiments with pre-registered hypotheses. Negative findings documented with the same rigor as positive findings.
Case studies: 35 documented LLM behavioral failures, each with verbatim transcripts, root cause analysis, and derived acceptance tests.
Version: 1.2.1 (research frozen 2026-04-16)

12. Update: v1.2 to v3.0 (April 2026)

Everything above documents the v1 research. The system kept evolving. This section covers what changed.

What the v1 write-up got right

The core architecture held. FTS5 keyword search, typed knowledge graphs, Bayesian confidence tracking, correction detection, and the feedback loop all survived into v3 without fundamental redesign. The benchmarks haven't been re-run (the v1 scores still stand), and the failure taxonomy still maps accurately to observed failures.

What changed

Retrieval went from 4 layers to 7. The v1 pipeline was: locked beliefs (L0) + FTS5 (L1) + entity-index (L2.5) + graph traversal (L3). Three layers were added:

Layer 0 structural analysis detects the task type (deployment, debugging, implementation) from prompt structure alone, at 90.5% accuracy. This means the system retrieves deployment runbooks when the user is deploying, not when they mention the word "deploy."
Layer 1.5 precomputed HRR neighbors replaced the runtime HRR computation that was too slow to use in production. A hrr_neighbors table is built during graph construction; hook search does a SQL JOIN in 0.03ms instead of skipping the vocabulary gap layer entirely.
Layer 1.7 intention clustering groups beliefs into 40 clusters based on what they're about, not what words they contain. 97.9% of vocabulary-gap pairs land in the same cluster (Exp 94b). When FTS5 finds a belief about "database migrations," the cluster pulls in related beliefs about "schema changes" and "ALTER TABLE" even though they share no keywords.

Bayesian scoring got a second model. The single Beta-Bernoulli prior was joined by a Bayesian model selection layer (Exp 93) that classifies each belief into one of four states based on its feedback history: SIGNAL (actively useful), NOISE (consistently irrelevant), STALE (was useful, stopped being retrieved), or CONTESTED (conflicting feedback). The model state applies a 0.6-1.3x multiplier. Beliefs without feedback are unaffected.

Beliefs can now cross project boundaries. v3's largest structural addition: SQLite ATTACH federation for cross-project sharing. Each shared scope gets its own database under ~/.agentmemory/shared/{scope}/. A content-hash deduplicates on write. Hook search Layer 6 queries subscribed scopes with a budget of 3 beliefs per scope.

This directly addresses the cross-project noise problem documented in Section 7 of the v1 write-up (25% cross-project noise in the A/B test). Exp 97 measured: 100% recall, 0% top-5 contamination, 1.06x latency overhead.

Wonder and reason turned research into conversation. Two new capabilities let the agent investigate open questions using the memory graph as context. These are invoked as slash commands (/mem:wonder, /mem:reason) but they're used in context during natural conversational turns, not as separate formal operations. The user types something like "please /mem:wonder about X" mid-discussion, and the system launches the research pipeline with the full conversational context already loaded. A wonder query spawns parallel research agents; a reason query builds evidence chains. Both save their findings as beliefs that persist across sessions.

A case study documents a real session where wonder + reason produced an actionable marketing strategy for the project itself: 4 parallel research agents, findings synthesized, README rewritten, all grounded in beliefs accumulated over prior weeks.

Updated numbers

v1.2 vs v3.0
Metric	v1.2	v3.0
Experiments	85	98
Tests	362 + 62 acceptance	954
MCP tools	19	31
Production modules	18	33
Retrieval layers	4	7
Version	1.2.1	3.0.2

What's still not solved

Benchmarks have not been re-run against v3. The v1 scores reported above still stand, but the new retrieval layers (intention clustering, multimodel scoring, structural analysis) have not been measured on the same benchmark suite. Re-running benchmarks is planned but blocked by outstanding work on the retrieval pipeline.

Known limitations that remain from Section 7:

Contradiction detection during retrieval still does not work. The CONTRADICTS edge type exists but isn't used during scoring.
Correction rates were not re-measured. The v1 longitudinal analysis found no decrease; whether v3's changes affect this is unknown.
LongMemEval multi-session was 24.1% in v1. Whether the new layers improve this is untested.

Cross-project scoping was solved (v3 shared scopes with content-hash dedup). The A/B test's 25% cross-project noise is no longer expected.

Appendix A: Benchmark Methodology

This appendix documents the exact protocol used for each benchmark. Any deviation invalidates the results. The protocol was developed after two contamination incidents during development. The full protocol, contamination verification script, and all benchmark adapters are available in the public repository.

Contamination Protocol

Three contamination modes were identified during development:

Ground truth in retrieval output. The retrieval JSON contained answer fields. The LLM reader saw correct answers while generating predictions. This produced the invalid 87.8% LoCoMo score. The contamination was not immediately obvious: the first Opus score (61.6% F1) was plausible. It was only discovered when slow-finishing agents overwrote the merged predictions file, producing 87.8% F1 near the human ceiling of 87.9%. Exact-match analysis confirmed: 9 of 10 batches showed 43-87% exact-match rates. Four additional isolation failures were identified: a renamed _ground_truth field, pre-computed prediction and f1 fields, category_name labels leaking evaluation strategy, and no separation between question-context and scoring metadata. All results from this run were retracted. Prevention: Adapter code writes two separate files. A mandatory contamination check (verify_clean.py) scans for 30 banned keys before any reader touches the data.
LLM self-judging with answer visible. Prevention: Generation and judging are strictly separate passes.
World knowledge override. The LLM reader used real-world knowledge instead of retrieved context, particularly on counterfactual benchmarks. Mitigation: Reader prompts include explicit instructions to use only the provided context. Documented as a known limitation (~17% of MAB failures).

General Protocol

Every benchmark run follows these steps:

Step 1: Data acquisition. Download from published source. Verify row counts and field names.

Step 2: Retrieval. Run adapter in --retrieve-only mode, producing retrieval and ground truth files. Each test case uses a fresh SQLite database.

uv run python benchmarks/<adapter>.py \
  --retrieve-only /tmp/benchmark_<name>.json

Step 3: Contamination check. Mandatory before any reader touches the data:

uv run python benchmarks/verify_clean.py /tmp/benchmark_<name>.json

Step 4: Answer generation. LLM reader receives only the retrieval file. Never sees ground truth.

Step 5: Scoring. Reads predictions and ground truth. Metrics follow exact published formulas.

Step 6: Reporting. Includes exact commands, contamination check output, adapter commit hash, dataset version, reader model, scoring metric, published baselines, and known limitations.

Per-Benchmark Specifics

LoCoMo ([Maharana et al., ACL 2024](https://snap-research.github.io/locomo/))

Dataset: locomo10.json, 10 conversations, 5,882 turns, 1,986 QA pairs across 5 categories.
Ingestion: All 10 conversations through standard onboarding pipeline. Session boundaries preserved.
Retrieval: FTS5 + HRR + BFS, 2,000-token budget, batch size 1.
Reader model: Claude Opus 4.6.
Prompts: Exact LoCoMo protocol prompts. Categories 1/3/4: "Based on the above context, write an answer in the form of a short phrase..." Category 2 appends: "Use DATE of CONVERSATION to answer with an approximate date." Category 5: forced-choice "(a) Not mentioned (b) [adversarial_answer]" with randomized option order (seed=42).
Scoring: Token-level F1 with Porter stemming and article removal.
Score: 66.1% F1.

MemoryAgentBench FactConsolidation ([Hu et al., ICLR 2026](https://arxiv.org/abs/2507.05257))

Dataset: HuggingFace ai-hyz/MemoryAgentBench, Conflict_Resolution split.
Ingestion: Context chunked at 4,096 tokens using NLTK sent_tokenize and tiktoken gpt-4o encoding.
Retrieval (single-hop): FTS5 with triple extraction. SUPERSEDES edges created automatically.
Retrieval (multi-hop): Entity-index adapter. 41 regex patterns. 4-hop chaining with breadth cap of 30.
Reader models: Claude Opus 4.6 and Claude Haiku 4.5.
Scoring: substring_exact_match per the paper. Chain validation for multi-hop.
Scores: SH: 90% Opus, 62% Haiku. MH: 60% Opus (raw SEM), 35% chain-valid.

StructMemEval ([Shutova et al., 2026](https://github.com/yandex-research/StructMemEval))

Dataset: GitHub yandex-research/StructMemEval, location/small_bench, 14 cases.
Ingestion: Narrative timestamps (30 days apart per session). Standard pipeline.
Retrieval: FTS5 with temporal_sort=True.
Disclosure: temporal_sort was developed after seeing the initial 29% result.
Score: 14/14 (100%).

LongMemEval ([Wu et al., ICLR 2025](https://arxiv.org/abs/2501.05294))

Dataset: HuggingFace xiaowu0162/longmemeval-cleaned, 500 questions across 6 categories.
Retrieval: FTS5 + HRR + BFS, 2,000-token budget, top_k=50.
Judge: Claude Opus 4.6 binary judge (non-standard; paper specifies GPT-4o).
Disclosure: Using Opus as judge instead of GPT-4o means the comparison is not apples-to-apples.
Score: 59.0% (295/500).

Reproducibility

All benchmark adapters, scoring scripts, and the contamination verification script are in the benchmarks/ directory of the public repository. To reproduce any result:

# Clone and install
git clone https://github.com/robotrocketscience/aelfrice
cd aelfrice
uv sync

# Run retrieval (example: MAB single-hop)
uv run python benchmarks/mab_adapter.py \
  --split Conflict_Resolution \
  --source factconsolidation_sh_262k \
  --retrieve-only /tmp/mab_sh.json

# Verify clean
uv run python benchmarks/verify_clean.py /tmp/mab_sh.json

# Score (after running reader)
uv run python benchmarks/exp6_score.py /tmp/mab_sh_preds.json /tmp/mab_sh_gt.json

Complete per-benchmark commands and adapter documentation are in docs/BENCHMARK_PROTOCOL.md.

Appendix B: Literature Tables

Prior Art: LoCoMo Benchmark Results
System	LoCoMo	Notes
EverMemOS	92.3%	Cloud LLM, closed source
Hindsight	89.6%	Cloud LLM
SuperLocalMemory C	87.7%	LLM for synthesis
Zep/Graphiti	~85%	Temporal knowledge graph
Letta/MemGPT	~83.2%	OS-style memory
SuperLocalMemory A	74.8%	Zero cloud
Letta (filesystem)	74.0%	gpt-4o-mini, no architecture
Supermemory	~70%	Vector graph engine
Mem0 (self-reported)	~66%	Hybrid store
Mem0 (independent)	~58%	See note below
agentmemory	66.1%	FTS5+HRR+BFS, no embeddings
* Mem0 independent score differs from self-reported. See Chhikara et al. arXiv:2504.19413 (ECAI 2025). * Scores measured under different conditions and LLM backends. Not directly comparable. * agentmemory score from protocol-correct run with full input isolation. See Benchmark section for methodology and contamination narrative from earlier invalid runs.

Cross-Benchmark Summary
Benchmark	agentmemory	Paper Best	Delta
LoCoMo (ACL '24)	66.1% F1	51.6% GPT-4o	+14.5pp
MAB SH 262K (ICLR '26)	90% Opus	45% GPT-4o-mini	+45pp
MAB MH 262K (ICLR '26)	60% Opus	<=7% (all methods)	8.6x
StructMemEval ('26)	100% (14/14)	vector stores fail	--
LongMemEval (ICLR '25)	59.0%	60.6% GPT-4o	-1.6pp
* MAB = MemoryAgentBench FactConsolidation * SH = single-hop, MH = multi-hop * LongMemEval uses Opus as judge (paper uses GPT-4o); comparison carries an asterisk until same judge is used * MAB MH "chain-valid" score (reader-independent): 35% for both Opus and Haiku. The 60% includes incidental matches from deeper traversal.

Benchmarks Studied
Benchmark	Key Finding	Ours
LoCoMo (ACL '24)	Filesystem + grep = 74% baseline	66.1% F1
MemoryAgentBench (ICLR '26)	Single-hop: 45% GPT-4o-mini; Multi-hop: 7% ceiling	SH: 90%; MH: 60%
LongMemEval (ICLR '25)	500 questions, scales to 1.5M tokens	59.0% (Opus judge)
StructMemEval	Vector stores fail at state tracking	100% (14/14)
LifeBench	SOTA at 55.2%	Not yet tested
AMA-Bench	GPT 5.2 achieves 72.26%	Not yet tested