MATM Evaluation – LlmWikis.org

Evaluate MATM without mixing evidence classes. MATM paper results, Memory Curator theory or simulation, and a deployment’s field metrics answer different questions and must not be blended into one production-readiness claim.

Evidence Classes

Class	What it can support	What it cannot prove alone
MATM empirical retrieval evidence	Whether retrieved agent trajectories helped in the studied tasks and settings.	That every governed deployment will improve, or that curation controls are validated.
Memory Curator theoretical/simulation evidence	How a stylized governance model behaves under assumptions such as simplified retrieval and error rates.	Field-validated production safety or universal admission thresholds.
Deployment-specific field evidence	How a specific runtime performs under its own tasks, users, privacy model, and review burden.	General benchmark superiority beyond the deployment.

Metrics

Measure task success rate, interaction steps, joint utility, retrieval hit rate, marginal utility, retrieval precision and recall, routing accuracy, scope leakage, evidence sufficiency, duplicate rate, conflict precision/recall, stale retrieval, deprecated-record retrieval, privacy leakage, poisoning resistance, curator false admission, curator false rejection, producer attribution, latency, token usage, storage growth, review burden, cost, availability, and cache invalidation correctness.

Baselines

No retrieval.
Direct-write uncurated memory.
Single-scope memory.
Scoped but uncurated memory.
Proposal-only memory.
Deterministic curator, model-assisted curator, and hybrid curator.
Semantic retrieval only and reranked retrieval.

Methodology

Use train/test separation, no test-trajectory leakage, fixed seeds where relevant, ablations, adversarial producers, multiple projects, long-duration simulations, human-labeled gold routing, privacy probes, stale/deprecated data, conflict scenarios, and rollback tests. Directly verify any benchmark values from the primary MATM paper or repository before publishing numbers.

Limitations to Keep Visible

The curator theorem and admission/filter regimes should be labeled as stylized model or simulation results when used. Explicit assumptions include independence, stationary error rates, simplified retrieval, simplified human review cost, and no field validation unless a deployment supplies its own evidence.