Day 57 of 60 · AI-system specific

LLM-as-judge evaluation

For most AI products, there is no programmatic ground truth. A separate model graded against a strict rubric is the only oracle you have, keep it honest with periodic human spot-checks.

ProblemGenerated text/code/answers have no programmatic ground truth.

How it works

A separate LLM grades outputs against a strict rubric. Statistical sampling controls cost. Pair with human spot-checks to detect judge bias.

What it catches

Subjective quality drift, prompt regressions, model-version effects, judge-bias detection (with periodic human spot-checks). The only oracle for many AI products. DeepEval ships 50+ research-backed metrics with native pytest integration.

Tools

DeepEval · OSS Promptfoo · OSS Ragas · OSS TruLens · OSS inspect_ai · OSS Langfuse Evals · Hybrid

Verdict by project size

Small

Opt

Medium

Rec

Large

Must

Extra-large

Must

Cost

Project size	Setup	Maint / mo	Tool / mo	CI / run
Small <10k LOC	1d	2h	$50	+5m
Medium 10–100k LOC	3d	10h	$300	+15m
Large 100k–1M LOC	15d	50h	$2k	+30m
Extra-large >1M LOC	50d	200h	$10k	+60m

Setup = engineer-days to first useful run · Maint = engineer-hours / month at steady state · Tool = out-of-pocket $ / month · CI = minutes added (or saved) per pipeline run

Lifecycle & ownership

When in lifecycle

Test Operate Observe

Per merge · Runs after merge to main; nightly heavy jobs.

Who owns it

ML / AI Engineer

Models, evals, drift, guardrails

Collaborates with: Developer, Security / AppSec

Reference implementations

DeepEval examples
LLM evaluation examples including model-graded quality checks.
OpenAI Evals
Evaluation registry and examples for model-graded and task-specific checks.
Langfuse evals
Production scoring and evaluation workflow for LLM applications.

Quick check

LLM-as-judge is the only oracle for many AI products. Its biggest risk?

One question. Pick the best answer. Your streak is saved locally on this device.

Save the lesson

Download SVG ↓

Screenshot for a 1:1, drop it in Slack, or download the SVG.

All 60 days →