One rubric, every framework
MAI is a rubric, not a tool. The rubric definition lives in one place — the MAI_RUBRIC constant in evaluators/attribution_integrity_ragas.py — and every tier scores against the same intent.
Why this matters
Eval frameworks (RAGAS, DeepEval, Phoenix Evals, LangSmith, OpenAI Evals, TruLens) score response quality — faithfulness, relevance, hallucination. None of them define memory-provenance metrics out of the box.
Without a shared rubric, "did the agent rely on well-attributed memory?" is computed differently per team, per framework, per release. There's no comparability over time, across teams, or across frameworks.
MAI is portable: the same rubric runs as a CI gate (Tier 1), as a release-gate eval (Tier 2 against your existing judge LLM), and as an instrumentation check (Tier 3 against your OTEL trace store). Same scores, comparable trends.
The composition principle
memledger does not replace your eval framework. It contributes one metric — MAI — that any framework can run.
- Already on RAGAS? → Tier 2 plugs in via
evaluate_mai_ragas(). - On Phoenix only? → Tier 1 + Tier 3 emit
EVALUATORspans Phoenix renders out of the box. - No framework yet? → Tier 1 deterministic in unit tests; Tier 3 structural over your OTEL spans.
What "framework-agnostic" actually means
Be precise about what is and isn't framework-agnostic:
- The rubric is framework-agnostic — same definition, scored anywhere.
- The runners are mostly framework-free. Tier 1 and Tier 3 require nothing beyond memledger core. Tier 2 binds to RAGAS today; the underlying judge is provider-agnostic via LiteLLM (Bedrock, OpenAI, Anthropic, Ollama, etc.).
- DeepEval, Phoenix Evals, LangSmith, OpenAI Evals, TruLens, and AWS Bedrock AgentCore Evaluations runners are roadmap (v2.2+).
Three tiers ship today: deterministic, RAGAS LLM-as-judge (provider-agnostic via LiteLLM), and structural.
The rubric
Verbatim from MAI_RUBRIC:
Given an agent decision and the memories it retrieved, score whether
the decision relied on well-attributed, sufficiently confident memory.
Score 1 (well-attributed) when:
- Retrieved memories have attribution (source agent, confidence, session)
- Memory confidence >= 0.7 OR decision explicitly hedges on low-confidence data
- No memories in chain with confidence < 0.4 used as basis for decision
- Derivation chains are present and consistent
Score 0 (unattributed) when:
- Decision uses unattributed or low-confidence memory as ground truth
- Contradictory memories ignored
- Memory without session/turn context treated as authoritative
Output a score between 0.0 and 1.0.
See also
- MAI — the rubric in depth, plus Tier 1 scoring breakdown
- Tiers — per-tier deep-dive
- Provenance chain
- Confidence flags