How we measured quality before shipping a Copilot-style feature
Golden datasets, automated checks, human rubrics, LLM-as-judge caveats, and CI smoke tests—with Hugging Face and OpenAI evaluation links.
“Feels snappier” is not a release gate. Assistant-style features need repeatable quality metrics, regression detection when prompts or models change, and cost/latency tracked alongside accuracy.
1. Build a versioned golden set
Curate 50–300 representative tasks drawn from real usage (anonymized). For each item store:
- Input (user prompt + optional context IDs).
- Expected behavior: not always a single “gold answer”—often constraints (“must cite policy section”, “must refuse PII requests”, “must return JSON matching schema”).
Check the dataset into git (or a private registry) with a version tag: golden-v1.3. When you change prompts or models, re-run the suite and diff pass/fail.
2. Automated checks (fast, cheap)
Layer checks from strict to fuzzy:
| Check | Example |
|---|---|
| Schema | Parse model output as JSON; validate with JSON Schema or Pydantic |
| Tool simulation | If the assistant emits “actions”, verify arguments against allowed enums |
| String contains / regex | For support macros: must include disclaimer text |
| Embedding similarity | Optional: cosine similarity to reference answers—watch for semantic false positives |
OpenAI documents evaluations concepts for their platform; the ideas transfer to offline harnesses.
3. Human review for what automation misses
Schedule periodic human grading on a sample:
- Correctness (domain expert).
- Tone and safety (policy/comms).
- Citation accuracy if the product claims to quote sources.
Use a rubric (1–5) per dimension; inter-rater agreement improves trust—see classic Cohen’s kappa if you want statistical rigor.
4. LLM-as-judge: use with care
Using a strong model to score another model’s output is convenient but biased toward the judge’s preferences and can favor verbose answers. If you use it:
- Keep a frozen judge prompt versioned in git.
- Spot-check judge vs human decisions.
- Never let the judge be the only signal for safety-critical behavior.
Anthropic discusses evaluation tradeoffs in their documentation; academic surveys on LLM evaluation are evolving quickly—treat vendor blogs as starting points, not proof.
5. Pair quality with cost and latency
Report a small scorecard weekly:
- Success rate on golden set (primary).
- p95 latency per task type.
- Tokens per successful task (prompt + completion).
A prompt change that adds 40% tokens without improving success rate is a regression, even if quality is flat.
6. CI smoke tests
Run a small subset of the golden set on every PR against a non-prod endpoint (or mocked model) to catch catastrophic JSON/schema breakages. Full nightly jobs may call live APIs—budget accordingly and use rate limit aware clients.
Hugging Face evaluate is useful if you standardize on open metrics (exact match, BLEU rarely fits conversational tasks—pick carefully).
References
- OpenAI: Evals best practices — structuring tests and metrics
- NIST AI RMF — organizational context for evaluation and governance (high level)
- OWASP LLM Top 10 — abuse cases to include in test sets (e.g. prompt injection attempts)
Related: LLM cost controls, Azure OpenAI production patterns.