Skip to main content

The evaluator tiers

memledger ships three tiers that score the same MAI rubric against different inputs. The cost / fidelity tradeoff is the only thing that changes.

TierNamePublic APIInstall gateLLMDeterminismWhen to use
Tier 1Deterministic MAIevaluate_attribution_integrity()core (pip install memledger)NoFully deterministicCI gates, every-PR runs, latency-critical paths
Tier 2RAGAS LLM-as-judgeevaluate_mai_ragas()pip install memledger[eval]Yes (LiteLLM-routed)Judge-dependentProduction eval pipelines that already run a judge LLM
Tier 3Structuralevaluate_structural() / evaluate_from_memory_records()core (pip install memledger)NoFully deterministicInstrumentation-correctness checks against OTEL spans

All three return scores in [0.0, 1.0] with default threshold 0.7.

Tier 1 — Deterministic

Stdlib only. Sub-millisecond. Reads memory record fields directly — created_by, confidence, session_id, derived_from, hedged — and applies a fixed penalty/bonus structure to a starting score of 1.0. No external dependencies, no flakiness, runs in CI on every PR. The full penalty/bonus breakdown is documented in MAI.

Tier 2 — RAGAS LLM-as-judge

Routes through LiteLLM via $MEMLEDGER_JUDGE_MODEL. The same MAI_RUBRIC text passed verbatim to RAGAS AspectCritic. Latency and cost depend on the judge model. Requires pip install memledger[eval].

Verified end-to-end against bedrock/us.anthropic.claude-sonnet-4-6. Smaller / OSS judges occasionally return NaN — memledger gracefully treats that as score=0.0, passed=False.

Tier 3 — Structural

Stdlib only. Reads OpenTelemetry span attributes (or memory records converted to span-shape via evaluate_from_memory_records()) and runs five weighted checks:

  1. memledger.memory.confidence populated
  2. Confidence ≥ 0.6 on retrieved memories
  3. memledger.memory.source_agent_id present
  4. Chain depth ≤ 5
  5. No low-confidence retrieved memory without hedged=true

Tier 3 catches "is the agent instrumented correctly?" bugs that Tier 1's record-field view cannot see.

Same rubric, different runners

The MAI_RUBRIC constant in evaluators/attribution_integrity_ragas.py is the literal text passed to Tier 2 — the LLM judge sees the rubric and scores against it. Tier 1 and Tier 3 implement the same intent against different inputs (records and spans, respectively). All three return [0.0, 1.0] with default threshold 0.7, so scores are directly comparable.

Decision tree

  • Need it in CI on every PR with no LLM cost? → Tier 1
  • Already have a RAGAS pipeline? → Tier 2
  • Want to validate your agent's OTEL instrumentation? → Tier 3
  • Want a defense-in-depth read? → Run all three; agreement across tiers is a stronger signal than any single tier.

Phoenix integration

All three tiers emit OpenInference EVALUATOR spans. See Phoenix tracing for the rendering and Eval span attributes for the full attribute inventory.

Roadmap

DeepEval, Phoenix Evals, LangSmith, OpenAI Evals, TruLens, and AWS Bedrock AgentCore Evaluations adapters are on the v2.2+ roadmap.