Skip to main content

One rubric, every framework

MAI is a rubric, not a tool. The rubric definition lives in one place — the MAI_RUBRIC constant in evaluators/attribution_integrity_ragas.py — and every tier scores against the same intent.

Why this matters

Eval frameworks (RAGAS, DeepEval, Phoenix Evals, LangSmith, OpenAI Evals, TruLens) score response quality — faithfulness, relevance, hallucination. None of them define memory-provenance metrics out of the box.

Without a shared rubric, "did the agent rely on well-attributed memory?" is computed differently per team, per framework, per release. There's no comparability over time, across teams, or across frameworks.

MAI is portable: the same rubric runs as a CI gate (Tier 1), as a release-gate eval (Tier 2 against your existing judge LLM), and as an instrumentation check (Tier 3 against your OTEL trace store). Same scores, comparable trends.

The composition principle

memledger does not replace your eval framework. It contributes one metric — MAI — that any framework can run.

  • Already on RAGAS? → Tier 2 plugs in via evaluate_mai_ragas().
  • On Phoenix only? → Tier 1 + Tier 3 emit EVALUATOR spans Phoenix renders out of the box.
  • No framework yet? → Tier 1 deterministic in unit tests; Tier 3 structural over your OTEL spans.

What "framework-agnostic" actually means

Be precise about what is and isn't framework-agnostic:

  • The rubric is framework-agnostic — same definition, scored anywhere.
  • The runners are mostly framework-free. Tier 1 and Tier 3 require nothing beyond memledger core. Tier 2 binds to RAGAS today; the underlying judge is provider-agnostic via LiteLLM (Bedrock, OpenAI, Anthropic, Ollama, etc.).
  • DeepEval, Phoenix Evals, LangSmith, OpenAI Evals, TruLens, and AWS Bedrock AgentCore Evaluations runners are roadmap (v2.2+).

Three tiers ship today: deterministic, RAGAS LLM-as-judge (provider-agnostic via LiteLLM), and structural.

The rubric

Verbatim from MAI_RUBRIC:

Given an agent decision and the memories it retrieved, score whether
the decision relied on well-attributed, sufficiently confident memory.

Score 1 (well-attributed) when:
- Retrieved memories have attribution (source agent, confidence, session)
- Memory confidence >= 0.7 OR decision explicitly hedges on low-confidence data
- No memories in chain with confidence < 0.4 used as basis for decision
- Derivation chains are present and consistent

Score 0 (unattributed) when:
- Decision uses unattributed or low-confidence memory as ground truth
- Contradictory memories ignored
- Memory without session/turn context treated as authoritative

Output a score between 0.0 and 1.0.

See also