How we measured quality before shipping a Copilot-style feature | Harsh Mehta - AI Automation & Full-Stack Engineer

“Feels snappier” is not a release gate. Assistant-style features need repeatable quality metrics, regression detection when prompts or models change, and cost/latency tracked alongside accuracy.

1. Build a versioned golden set

Curate 50–300 representative tasks drawn from real usage (anonymized). For each item store:

Input (user prompt + optional context IDs).
Expected behavior: not always a single “gold answer”—often constraints (“must cite policy section”, “must refuse PII requests”, “must return JSON matching schema”).

Check the dataset into git (or a private registry) with a version tag: golden-v1.3. When you change prompts or models, re-run the suite and diff pass/fail.

2. Automated checks (fast, cheap)

Layer checks from strict to fuzzy:

Check	Example
Schema	Parse model output as JSON; validate with JSON Schema or Pydantic
Tool simulation	If the assistant emits “actions”, verify arguments against allowed enums
String contains / regex	For support macros: must include disclaimer text
Embedding similarity	Optional: cosine similarity to reference answers—watch for semantic false positives

OpenAI documents evaluations concepts for their platform; the ideas transfer to offline harnesses.

3. Human review for what automation misses

Schedule periodic human grading on a sample:

Correctness (domain expert).
Tone and safety (policy/comms).
Citation accuracy if the product claims to quote sources.

Use a rubric (1–5) per dimension; inter-rater agreement improves trust—see classic Cohen’s kappa if you want statistical rigor.

4. LLM-as-judge: use with care

Using a strong model to score another model’s output is convenient but biased toward the judge’s preferences and can favor verbose answers. If you use it:

Keep a frozen judge prompt versioned in git.
Spot-check judge vs human decisions.
Never let the judge be the only signal for safety-critical behavior.

Anthropic discusses evaluation tradeoffs in their documentation; academic surveys on LLM evaluation are evolving quickly—treat vendor blogs as starting points, not proof.

5. Pair quality with cost and latency

Report a small scorecard weekly:

Success rate on golden set (primary).
p95 latency per task type.
Tokens per successful task (prompt + completion).

A prompt change that adds 40% tokens without improving success rate is a regression, even if quality is flat.

6. CI smoke tests

Run a small subset of the golden set on every PR against a non-prod endpoint (or mocked model) to catch catastrophic JSON/schema breakages. Full nightly jobs may call live APIs—budget accordingly and use rate limit aware clients.

Hugging Face evaluate is useful if you standardize on open metrics (exact match, BLEU rarely fits conversational tasks—pick carefully).

References

OpenAI: Evals best practices — structuring tests and metrics
NIST AI RMF — organizational context for evaluation and governance (high level)
OWASP LLM Top 10 — abuse cases to include in test sets (e.g. prompt injection attempts)