Just a placeholder for now 3... | Christian D. Glissov

How to build lightweight evaluation datasets that survive rapid product iteration.

A useful evaluation set is small, stable, and representative. It does not need to be huge to be valuable.

Dataset shape

Add canonical user tasks.
Include common failure modes.
Store expected outputs or scoring rubrics.
Version the dataset with your prompts.

Scoring strategy

Use a weighted score so critical tasks dominate decisions:

S = \sum_{i=1}^{n} w_i s_i, \quad \sum_i w_i = 1

Quick check loop

const threshold = 0.82;

export function canShip(overallScore: number, criticalScore: number) {
  if (criticalScore < 0.9) return false;
  return overallScore >= threshold;
}

This guards the cases that matter most for users, instead of averaging them away.