MAI: Memory Attribution Integrity

Memory Attribution Integrity (MAI) scores whether an agent's decision relied on well-attributed, sufficiently confident memory. Output is a float in [0.0, 1.0] with a default pass threshold of 0.7.

Why MAI exists

RAGAS, DeepEval, Phoenix Evals, and similar frameworks score response quality — faithfulness, relevance, hallucination. None of them define memory-provenance metrics out of the box: who wrote the retrieved record, how confident they were, what chain it came from. MAI fills that gap and composes with all of them.

The rubric

The canonical rubric (verbatim from MAI_RUBRIC in evaluators/attribution_integrity_ragas.py) is identical across all three tiers:

Given an agent decision and the memories it retrieved, score whether the decision relied on well-attributed, sufficiently confident memory.

Score 1.0 (well-attributed) when:

Retrieved memories have attribution (source agent, confidence, session)
Memory confidence ≥ 0.7 OR decision explicitly hedges on low-confidence data
No memories in chain with confidence < 0.4 used as basis for decision
Derivation chains are present and consistent

Score 0.0 (unattributed) when:

Decision uses unattributed or low-confidence memory as ground truth
Contradictory memories ignored
Memory without session/turn context treated as authoritative

The three tiers

Tier	Name	Public API entry	Install gate	LLM
Tier 1	Deterministic MAI	`evaluate_attribution_integrity()`	core (`pip install memledger`)	No
Tier 2	RAGAS LLM-as-judge	`evaluate_mai_ragas()`	`pip install memledger[eval]`	Yes — LiteLLM-routed via `$MEMLEDGER_JUDGE_MODEL`
Tier 3	Structural	`evaluate_structural()` / `evaluate_from_memory_records()`	core (`pip install memledger`)	No

When to pick which:

Tier 1 — CI gates, every-PR runs, latency-critical paths (no LLM cost).
Tier 2 — production eval pipelines that already use RAGAS or a judge LLM.
Tier 3 — instrumentation-correctness checks (validates OTEL spans, not record fields).

Model selection for Tier 2 is environment-driven via $MEMLEDGER_JUDGE_MODEL in LiteLLM <provider>/<model-id> form — for example, bedrock/us.anthropic.claude-sonnet-4-6 for the Bedrock cross-region inference profile pattern. See LiteLLM's provider list for the full set.

See tiers for the cost / fidelity tradeoff and per-tier deep-dive.

Tier 1 scoring breakdown

Tier 1 is deterministic. The score starts at 1.0 and is adjusted by the following criteria, then clamped to [0.0, 1.0]:

Criterion	Adjustment
Attribution gap (records missing `created_by`)	up to −0.4 proportional to the unattributed ratio
Low-confidence (`< 0.4`) unhedged	−0.25 per record
Flagged-confidence (`0.4–0.6`) unhedged	−0.10 per record
Mean confidence < 0.6	−0.5 × (0.6 − mean)
Session-context ratio < 0.5	up to −0.15
Chain min-confidence < 0.4	−0.20
Chain truncated (depth > `max_hops`)	−0.05
Multi-agent corroboration (≥ 2 distinct agents)	+0.05 bonus

Tier 1 in code

from evaluators import evaluate_attribution_integrity

retrieved = [
    {"id": "m1", "created_by": "agent-A", "confidence": 0.9, "session_id": "s1"},
    {"id": "m2", "created_by": "agent-B", "confidence": 0.85, "session_id": "s1"},
]
result = evaluate_attribution_integrity(
    decision_memory_id="d1",
    retrieved_memories=retrieved,
)
print(result.score, result.passed)

The import path is from evaluators import … (top-level package), not from memledger.evaluators.

Phoenix span

All three tiers emit OpenTelemetry EVALUATOR spans automatically when Phoenix tracing is wired — eval.name=memory_attribution_integrity for Tiers 1 and 2; memory_attribution_integrity_structural for Tier 3. See Phoenix tracing.

Roadmap

DeepEval, Phoenix Evals, LangSmith, OpenAI Evals, TruLens, and AWS Bedrock AgentCore Evaluations adapters are on the v2.2+ roadmap.

Why MAI exists​

The rubric​

The three tiers​

Tier 1 scoring breakdown​

Tier 1 in code​

Phoenix span​

Roadmap​

See also​