Day 57 of 60
·
AI-system specific
LLM-as-judge evaluation
For most AI products, there is no programmatic ground truth. A separate model graded against a strict rubric is the only oracle you have, keep it honest with periodic human spot-checks.
ProblemGenerated text/code/answers have no programmatic ground truth.
How it works
A separate LLM grades outputs against a strict rubric. Statistical sampling controls cost. Pair with human spot-checks to detect judge bias.
What it catches
Subjective quality drift, prompt regressions, model-version effects, judge-bias detection (with periodic human spot-checks). The only oracle for many AI products. DeepEval ships 50+ research-backed metrics with native pytest integration.
Tools
DeepEval · OSS Promptfoo · OSS Ragas · OSS TruLens · OSS inspect_ai · OSS Langfuse Evals · Hybrid
Verdict by project size
Small
Opt
Medium
Rec
Large
Must
Extra-large
Must
Cost
| Project size | Setup | Maint / mo | Tool / mo | CI / run |
|---|---|---|---|---|
| Small <10k LOC | 1d | 2h | $50 | +5m |
| Medium 10–100k LOC | 3d | 10h | $300 | +15m |
| Large 100k–1M LOC | 15d | 50h | $2k | +30m |
| Extra-large >1M LOC | 50d | 200h | $10k | +60m |
Setup = engineer-days to first useful run ·
Maint = engineer-hours / month at steady state ·
Tool = out-of-pocket $ / month ·
CI = minutes added (or saved) per pipeline run
Lifecycle & ownership
When in lifecycle
Test Operate Observe
Per merge · Runs after merge to main; nightly heavy jobs.
Who owns it
ML / AI Engineer
Models, evals, drift, guardrails
Collaborates with: Developer, Security / AppSec
Reference implementations
-
DeepEval examples
LLM evaluation examples including model-graded quality checks.
-
OpenAI Evals
Evaluation registry and examples for model-graded and task-specific checks.
-
Langfuse evals
Production scoring and evaluation workflow for LLM applications.
Quick check
LLM-as-judge is the only oracle for many AI products. Its biggest risk?
One question. Pick the best answer. Your streak is saved locally on this device.
Save the lesson
Download SVG ↓Screenshot for a 1:1, drop it in Slack, or download the SVG.