Most evaluations of memory for AI systems are really evaluations of question-answering. You load a corpus, ask questions, score the answers. The systems that win are the ones that retrieve and compose at query time, which is not the capability operational software needs.
An operational knowledge layer does a different job. It watches work happen, extracts a structured record of it, keeps one customer's facts from leaking into another's, and still answers correctly after ten unrelated things have happened since. Question-answering is one slice of that. The other three jobs are where production systems actually break, and they are the jobs the popular benchmarks barely test.
So we built one that does, and ran the field through it. This post is the scorecard: Phyvant against five graph-RAG and agent-memory systems, same model backbone, nothing skipped. Phyvant scored 0.957. The next real system scored 0.310. That headline is the least interesting thing in the data. The interesting part is why: the systems best at composing answers were the worst at remembering.
What we benchmarked
The unit under test is not a chatbot. It is the process that turns a stream of conversation into a queryable, governed knowledge graph and keeps that graph honest over time. We benchmark three scored kinds that isolate the custodial work an operational memory has to do, plus coverage. The composite averages the three kinds; coverage is reported alongside as a completion check:
- Construction: extract entities and relations from raw conversation, scored as graph precision/recall/F1 against a reference graph.
- Longitudinal: recall a specific fact after 10+ intervening noise episodes.
- Scope isolation: answer within a scope without leaking facts from another scope, and abstain when the answer isn't in scope.
- Coverage: the fraction of the 300 tasks an adapter completes without skipping or erroring. It sits outside the composite, because a system that quietly drops the tasks it cannot handle should not be able to average its way past them.
Every adapter ran on the same backbone: gpt-4o-mini for reasoning, text-embedding-3-small for retrieval where applicable. Adapters that return prose were post-processed by the same extractor before scoring, so verbosity neither helps nor hurts. Construction is scored on graph F1; the recall-style kinds are exact match after SQuAD-style normalization. The point of holding all of that fixed is that what's left is the architecture, not the model and not the prompt.
The field: GraphRAG (community-report graph RAG), Graphiti (Neo4j entity-centric agent memory), LightRAG (hybrid local+global graph retrieval), Mem0 (agent memory), and in_memory_kg, a deterministic dict-based triple store we include as a non-LLM reference floor. raw_llm, vector_rag, and a null adapter sit at the bottom as controls.
The scorecard
| Rank | Adapter | Score | Constr. | Longitud. | Scope Iso | Cover. |
|---|---|---|---|---|---|---|
| 1 | Phyvant | 0.957 | 0.940 | 0.990 | 0.940 | 1.000 |
| 2 | in_memory_kg | 0.433 | 0.800 | 0.500 | 0.000 | 1.000 |
| 3 | LightRAG | 0.310 | 0.000 | 0.150 | 0.780 | 0.833 |
| 4 | GraphRAG | 0.277 | 0.000 | 0.210 | 0.620 | 0.833 |
| 5 | Graphiti* | 0.167 | 0.000 | 0.500 | 0.000 | 0.817 |
| 6 | Mem0 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
*Graphiti was killed at 245/300 tasks after four hours of wall clock; it never finished the scope-isolation block, so its score here counts that block as the ~0 it was tracking toward. That unfinished run is also why its coverage stops at 0.817.
Two things jump out. First, the three "real" graph-RAG systems (GraphRAG, Graphiti, LightRAG) land between 0.17 and 0.31, and none of them score above zero on construction, because none export an inspectable graph to grade. Second, a deterministic dictionary (in_memory_kg) beats the entire graph-RAG pack, because when the test is remember this fact and don't corrupt it, a system that literally stores facts and never paraphrases has little to get wrong. But it scores a flat 0.000 on scope isolation, because it has no notion of a boundary. That is the shape of the whole problem: the more a system is built to retrieve-and-compose, the worse it does at the custodial work; the more it's a rigid store, the worse it does at anything that needs judgment. Phyvant is the only row at the top of every column.
What that architecture costs
Composing and re-summarizing answers at query time is expensive, and the bill comes due on every operation. The graph-RAG systems re-extract and re-summarize constantly:
| Adapter | LLM calls (300 tasks) | Wall clock |
|---|---|---|
| Phyvant | 8 | ~80 min |
| Mem0 | ~300 | ~20 min |
| LightRAG | ~1,200 | ~100 min |
| GraphRAG | ~1,800 | ~155 min |
| Graphiti | ~3,000+ | ~240 min (killed) |
Phyvant makes eight LLM calls across the entire 300-task run, because it does the work in embeddings and database operations and reaches for a model only when it genuinely needs one. The field makes hundreds to thousands, and the spend buys no guarantee: Mem0 burns roughly 300 ingest calls and still scores 0.000 on every kind, so its cost returns nothing scorable in our harness. Graphiti's per-turn entity extraction is so heavy that the long-horizon tasks, the ones with 10+ episodes, stall for minutes each, and the process never finished the suite. This is not a tuning detail. An architecture that calls an LLM to think about every turn cannot be the thing that watches every turn of real work. That same summarize-everything design is exactly why these systems dilute a specific fact into a neighborhood summary and collapse on long-horizon recall (GraphRAG 0.210, LightRAG 0.150).
Where Phyvant wins, and the mechanism
Phyvant's column is lopsided in the opposite direction, and the mechanism matters more than the numbers.
Construction, 0.940 (P=0.940, R=0.940, F1=0.940). None of the graph-RAG systems score above zero here, because none of them export predicted triples to be graded against a reference graph; they build internal structures tuned for their own retrieval, not an inspectable graph. Phyvant produces an actual graph you can diff against ground truth, and it does so accurately. That inspectability is the precondition for everything downstream: you can't govern, correct, or scope-isolate a representation you can't read.
Longitudinal, 0.990. This is the one we're proudest of. Near-perfect recall of a specific fact after ten-plus intervening noise episodes is exactly where the summarize-everything architectures collapse (GraphRAG 0.210, LightRAG 0.150, Graphiti 0.500). Phyvant keeps the specific fact specific instead of averaging it into a community summary.
Scope isolation, 0.940, 0% leak rate, 6% abstain. No customer's facts cross into another customer's answers, and when the answer isn't in scope the system declines rather than confabulating. LightRAG (0.780) and GraphRAG (0.620) manage isolation only by spinning up a separate working directory or index root per scope, which is operationally a different system per tenant. Phyvant isolates inside one graph. Graphiti and in_memory_kg score 0.000, with no working boundary at all.
The pattern is consistent. Phyvant is built to be the custodian of a record, so it leads on construction, retention, and isolation alike, the jobs an operational memory actually has.
What the benchmark is actually saying
The clean version of the result is "Phyvant won." The true version is more useful: retrieval quality and custodial quality are different axes, and most memory benchmarks only score the first. A system tuned to compose great answers at query time will re-index, re-summarize, and re-extract on every operation, and quietly fail at building an inspectable record, retaining a specific fact through noise, and keeping two tenants apart. Those failures don't show up when the test is a corpus and a question set. They show up the first week the system is in production, on a slice of work no dashboard is watching.
If your AI needs to remember how your work actually went and stay correct as that work changes, the axes that matter are construction, longitudinal recall, and scope isolation. On every one of them, one architecture sits at the top of the table, and it is the one cheap enough to watch every turn of real work. The results don't ask you to weigh a tradeoff. There isn't one.
Run: KGB-v0, 2026-06-16, 300 tasks, 0 skipped. Backbone gpt-4o-mini, embeddings text-embedding-3-small, identical across all adapters. Construction scored on graph F1 against a reference graph; the recall-style kinds scored on exact match after SQuAD-style normalization.