How we measured quality before shipping a Copilot-style feature

Golden datasets, automated checks, human rubrics, LLM-as-judge caveats, and CI smoke tests—with Hugging Face and OpenAI evaluation links.

“Feels snappier” is not a release gate. Assistant-style features need repeatable quality metrics, regression detection when prompts or models change, and cost/latency tracked alongside accuracy.

1. Build a versioned golden set

Curate 50–300 representative tasks drawn from real usage (anonymized). For each item store:

  • Input (user prompt + optional context IDs).
  • Expected behavior: not always a single “gold answer”—often constraints (“must cite policy section”, “must refuse PII requests”, “must return JSON matching schema”).

Check the dataset into git (or a private registry) with a version tag: golden-v1.3. When you change prompts or models, re-run the suite and diff pass/fail.

2. Automated checks (fast, cheap)

Layer checks from strict to fuzzy:

CheckExample
SchemaParse model output as JSON; validate with JSON Schema or Pydantic
Tool simulationIf the assistant emits “actions”, verify arguments against allowed enums
String contains / regexFor support macros: must include disclaimer text
Embedding similarityOptional: cosine similarity to reference answers—watch for semantic false positives

OpenAI documents evaluations concepts for their platform; the ideas transfer to offline harnesses.

3. Human review for what automation misses

Schedule periodic human grading on a sample:

  • Correctness (domain expert).
  • Tone and safety (policy/comms).
  • Citation accuracy if the product claims to quote sources.

Use a rubric (1–5) per dimension; inter-rater agreement improves trust—see classic Cohen’s kappa if you want statistical rigor.

4. LLM-as-judge: use with care

Using a strong model to score another model’s output is convenient but biased toward the judge’s preferences and can favor verbose answers. If you use it:

  • Keep a frozen judge prompt versioned in git.
  • Spot-check judge vs human decisions.
  • Never let the judge be the only signal for safety-critical behavior.

Anthropic discusses evaluation tradeoffs in their documentation; academic surveys on LLM evaluation are evolving quickly—treat vendor blogs as starting points, not proof.

5. Pair quality with cost and latency

Report a small scorecard weekly:

  • Success rate on golden set (primary).
  • p95 latency per task type.
  • Tokens per successful task (prompt + completion).

A prompt change that adds 40% tokens without improving success rate is a regression, even if quality is flat.

6. CI smoke tests

Run a small subset of the golden set on every PR against a non-prod endpoint (or mocked model) to catch catastrophic JSON/schema breakages. Full nightly jobs may call live APIs—budget accordingly and use rate limit aware clients.

Hugging Face evaluate is useful if you standardize on open metrics (exact match, BLEU rarely fits conversational tasks—pick carefully).

References

Related: LLM cost controls, Azure OpenAI production patterns.