Skip to main content

MAI: Memory Attribution Integrity

Memory Attribution Integrity (MAI) scores whether an agent's decision relied on well-attributed, sufficiently confident memory. Output is a float in [0.0, 1.0] with a default pass threshold of 0.7.

Why MAI exists

RAGAS, DeepEval, Phoenix Evals, and similar frameworks score response quality — faithfulness, relevance, hallucination. None of them define memory-provenance metrics out of the box: who wrote the retrieved record, how confident they were, what chain it came from. MAI fills that gap and composes with all of them.

The rubric

The canonical rubric (verbatim from MAI_RUBRIC in evaluators/attribution_integrity_ragas.py) is identical across all three tiers:

Given an agent decision and the memories it retrieved, score whether the decision relied on well-attributed, sufficiently confident memory.

Score 1.0 (well-attributed) when:

  • Retrieved memories have attribution (source agent, confidence, session)
  • Memory confidence ≥ 0.7 OR decision explicitly hedges on low-confidence data
  • No memories in chain with confidence < 0.4 used as basis for decision
  • Derivation chains are present and consistent

Score 0.0 (unattributed) when:

  • Decision uses unattributed or low-confidence memory as ground truth
  • Contradictory memories ignored
  • Memory without session/turn context treated as authoritative

The three tiers

TierNamePublic API entryInstall gateLLM
Tier 1Deterministic MAIevaluate_attribution_integrity()core (pip install memledger)No
Tier 2RAGAS LLM-as-judgeevaluate_mai_ragas()pip install memledger[eval]Yes — LiteLLM-routed via $MEMLEDGER_JUDGE_MODEL
Tier 3Structuralevaluate_structural() / evaluate_from_memory_records()core (pip install memledger)No

When to pick which:

  • Tier 1 — CI gates, every-PR runs, latency-critical paths (no LLM cost).
  • Tier 2 — production eval pipelines that already use RAGAS or a judge LLM.
  • Tier 3 — instrumentation-correctness checks (validates OTEL spans, not record fields).

Model selection for Tier 2 is environment-driven via $MEMLEDGER_JUDGE_MODEL in LiteLLM <provider>/<model-id> form — for example, bedrock/us.anthropic.claude-sonnet-4-6 for the Bedrock cross-region inference profile pattern. See LiteLLM's provider list for the full set.

See tiers for the cost / fidelity tradeoff and per-tier deep-dive.

Tier 1 scoring breakdown

Tier 1 is deterministic. The score starts at 1.0 and is adjusted by the following criteria, then clamped to [0.0, 1.0]:

CriterionAdjustment
Attribution gap (records missing created_by)up to −0.4 proportional to the unattributed ratio
Low-confidence (< 0.4) unhedged−0.25 per record
Flagged-confidence (0.4–0.6) unhedged−0.10 per record
Mean confidence < 0.6−0.5 × (0.6 − mean)
Session-context ratio < 0.5up to −0.15
Chain min-confidence < 0.4−0.20
Chain truncated (depth > max_hops)−0.05
Multi-agent corroboration (≥ 2 distinct agents)+0.05 bonus

Tier 1 in code

from evaluators import evaluate_attribution_integrity

retrieved = [
{"id": "m1", "created_by": "agent-A", "confidence": 0.9, "session_id": "s1"},
{"id": "m2", "created_by": "agent-B", "confidence": 0.85, "session_id": "s1"},
]
result = evaluate_attribution_integrity(
decision_memory_id="d1",
retrieved_memories=retrieved,
)
print(result.score, result.passed)

The import path is from evaluators import … (top-level package), not from memledger.evaluators.

Phoenix span

All three tiers emit OpenTelemetry EVALUATOR spans automatically when Phoenix tracing is wired — eval.name=memory_attribution_integrity for Tiers 1 and 2; memory_attribution_integrity_structural for Tier 3. See Phoenix tracing.

Roadmap

DeepEval, Phoenix Evals, LangSmith, OpenAI Evals, TruLens, and AWS Bedrock AgentCore Evaluations adapters are on the v2.2+ roadmap.

See also