The evaluator tiers

memledger ships three tiers that score the same MAI rubric against different inputs. The cost / fidelity tradeoff is the only thing that changes.

Tier	Name	Public API	Install gate	LLM	Determinism	When to use
Tier 1	Deterministic MAI	`evaluate_attribution_integrity()`	core (`pip install memledger`)	No	Fully deterministic	CI gates, every-PR runs, latency-critical paths
Tier 2	RAGAS LLM-as-judge	`evaluate_mai_ragas()`	`pip install memledger[eval]`	Yes (LiteLLM-routed)	Judge-dependent	Production eval pipelines that already run a judge LLM
Tier 3	Structural	`evaluate_structural()` / `evaluate_from_memory_records()`	core (`pip install memledger`)	No	Fully deterministic	Instrumentation-correctness checks against OTEL spans

All three return scores in [0.0, 1.0] with default threshold 0.7.

Tier 1 — Deterministic

Stdlib only. Sub-millisecond. Reads memory record fields directly — created_by, confidence, session_id, derived_from, hedged — and applies a fixed penalty/bonus structure to a starting score of 1.0. No external dependencies, no flakiness, runs in CI on every PR. The full penalty/bonus breakdown is documented in MAI.

Tier 2 — RAGAS LLM-as-judge

Routes through LiteLLM via $MEMLEDGER_JUDGE_MODEL. The same MAI_RUBRIC text passed verbatim to RAGAS AspectCritic. Latency and cost depend on the judge model. Requires pip install memledger[eval].

Verified end-to-end against bedrock/us.anthropic.claude-sonnet-4-6. Smaller / OSS judges occasionally return NaN — memledger gracefully treats that as score=0.0, passed=False.

Tier 3 — Structural

Stdlib only. Reads OpenTelemetry span attributes (or memory records converted to span-shape via evaluate_from_memory_records()) and runs five weighted checks:

memledger.memory.confidence populated
Confidence ≥ 0.6 on retrieved memories
memledger.memory.source_agent_id present
Chain depth ≤ 5
No low-confidence retrieved memory without hedged=true

Tier 3 catches "is the agent instrumented correctly?" bugs that Tier 1's record-field view cannot see.

Same rubric, different runners

The MAI_RUBRIC constant in evaluators/attribution_integrity_ragas.py is the literal text passed to Tier 2 — the LLM judge sees the rubric and scores against it. Tier 1 and Tier 3 implement the same intent against different inputs (records and spans, respectively). All three return [0.0, 1.0] with default threshold 0.7, so scores are directly comparable.

Decision tree

Need it in CI on every PR with no LLM cost? → Tier 1
Already have a RAGAS pipeline? → Tier 2
Want to validate your agent's OTEL instrumentation? → Tier 3
Want a defense-in-depth read? → Run all three; agreement across tiers is a stronger signal than any single tier.

Phoenix integration

All three tiers emit OpenInference EVALUATOR spans. See Phoenix tracing for the rendering and Eval span attributes for the full attribute inventory.

Roadmap

DeepEval, Phoenix Evals, LangSmith, OpenAI Evals, TruLens, and AWS Bedrock AgentCore Evaluations adapters are on the v2.2+ roadmap.

Tier 1 — Deterministic​

Tier 2 — RAGAS LLM-as-judge​

Tier 3 — Structural​

Same rubric, different runners​

Decision tree​

Phoenix integration​

Roadmap​