The evaluator tiers
memledger ships three tiers that score the same MAI rubric against different inputs. The cost / fidelity tradeoff is the only thing that changes.
| Tier | Name | Public API | Install gate | LLM | Determinism | When to use |
|---|---|---|---|---|---|---|
| Tier 1 | Deterministic MAI | evaluate_attribution_integrity() | core (pip install memledger) | No | Fully deterministic | CI gates, every-PR runs, latency-critical paths |
| Tier 2 | RAGAS LLM-as-judge | evaluate_mai_ragas() | pip install memledger[eval] | Yes (LiteLLM-routed) | Judge-dependent | Production eval pipelines that already run a judge LLM |
| Tier 3 | Structural | evaluate_structural() / evaluate_from_memory_records() | core (pip install memledger) | No | Fully deterministic | Instrumentation-correctness checks against OTEL spans |
All three return scores in [0.0, 1.0] with default threshold 0.7.
Tier 1 — Deterministic
Stdlib only. Sub-millisecond. Reads memory record fields directly — created_by, confidence, session_id, derived_from, hedged — and applies a fixed penalty/bonus structure to a starting score of 1.0. No external dependencies, no flakiness, runs in CI on every PR. The full penalty/bonus breakdown is documented in MAI.
Tier 2 — RAGAS LLM-as-judge
Routes through LiteLLM via $MEMLEDGER_JUDGE_MODEL. The same MAI_RUBRIC text passed verbatim to RAGAS AspectCritic. Latency and cost depend on the judge model. Requires pip install memledger[eval].
Verified end-to-end against bedrock/us.anthropic.claude-sonnet-4-6. Smaller / OSS judges occasionally return NaN — memledger gracefully treats that as score=0.0, passed=False.
Tier 3 — Structural
Stdlib only. Reads OpenTelemetry span attributes (or memory records converted to span-shape via evaluate_from_memory_records()) and runs five weighted checks:
memledger.memory.confidencepopulated- Confidence ≥
0.6on retrieved memories memledger.memory.source_agent_idpresent- Chain depth ≤
5 - No low-confidence retrieved memory without
hedged=true
Tier 3 catches "is the agent instrumented correctly?" bugs that Tier 1's record-field view cannot see.
Same rubric, different runners
The MAI_RUBRIC constant in evaluators/attribution_integrity_ragas.py is the literal text passed to Tier 2 — the LLM judge sees the rubric and scores against it. Tier 1 and Tier 3 implement the same intent against different inputs (records and spans, respectively). All three return [0.0, 1.0] with default threshold 0.7, so scores are directly comparable.
Decision tree
- Need it in CI on every PR with no LLM cost? → Tier 1
- Already have a RAGAS pipeline? → Tier 2
- Want to validate your agent's OTEL instrumentation? → Tier 3
- Want a defense-in-depth read? → Run all three; agreement across tiers is a stronger signal than any single tier.
Phoenix integration
All three tiers emit OpenInference EVALUATOR spans. See Phoenix tracing for the rendering and Eval span attributes for the full attribute inventory.
Roadmap
DeepEval, Phoenix Evals, LangSmith, OpenAI Evals, TruLens, and AWS Bedrock AgentCore Evaluations adapters are on the v2.2+ roadmap.