Back to Library
Playbook·11 min·Jul 2026

How to Build an AI Eval Harness for a Vertical Product

How to build an AI eval harness that gates every deploy and lets you swap models without losing quality. A hands-on playbook for AI builders.

An AI eval harness is a versioned suite of expert-labeled test cases and automated graders that scores your model output on every deploy and every prompt or model change, so quality is measured instead of guessed. Build it right and you can move to a cheaper or better base model without praying. Skip it and every change is a bet you cannot see the odds on.

This is the build playbook, the hands-on companion to our argument for why domain-specific evals are a moat. Avante Ventures is a venture studio building AI-native companies in Brazil and Latin America, and every product we ship runs on a harness like this one. Here is how to assemble the golden set, write the graders, wire the whole thing into your deploy gate, and turn it into an asset that compounds every time the product is used.

When you need an eval harness, and when tests are enough

An eval harness is not a test suite, and treating them as the same thing is the first mistake. Unit tests assert deterministic behavior. Given input X, the function returns exactly Y, every time. LLM output is probabilistic and open-ended, so the real question is not did it return the exact string, it is did it get the answer right often enough on the cases that matter. Anthropic frames the discipline as defining success criteria first, then designing evaluations to measure against them, and calls that cycle central to prompt engineering.

You need a harness the moment an LLM sits on a decision path a user or a regulator cares about. Classifying a legal filing. Pricing a risk. Pulling a number out of a document. Answering a support question that carries a policy consequence. You do not need one for low-stakes generation where a human already reviews every output. The tell is simple. If you cannot change a prompt without a quiet fear that you broke something invisible, you are past the point where eyeballing outputs works, and you are running on vibes.

The 2024 DORA report gives that instinct hard numbers. Across the study, a 25 percent increase in AI adoption came with an estimated 1.5 percent drop in software delivery throughput and a 7.2 percent drop in delivery stability, even as AI lifted individual signals like documentation and code quality. The authors are blunt that AI is not a panacea, and that faster development does not improve delivery without the fundamentals, small batch sizes and disciplined testing. An eval harness is that disciplined testing for the AI layer. It is what lets you keep the speed without paying it back in outages.

A 25 percent rise in AI adoption was associated with a 1.5 percent decrease in delivery throughput and a 7.2 percent decrease in delivery stability. Speed without a quality gate is a tax you pay later.

— DORA, Accelerate State of DevOps 2024

Build the AI eval harness in five moves

This is a workflow you can start this week, not a list of capabilities. The order matters, because each move is worthless without the one before it.

  • Assemble a golden set. Collect 50 to a few hundred real cases from your actual task distribution, each with a correct answer labeled by a domain expert, never by the model. Include the hard ones on purpose. This set encodes your definition of correct, and it is the most valuable thing you will build.
  • Write graders. A grader is code that scores one output against its golden answer. Exact match for categorical tasks, a rubric-scored model judge for open-ended ones, with a human tiebreak where the judge and the label disagree.
  • Score a baseline. Run the current prompt and model against the whole set and record the number. Accuracy, F1, pass rate, whatever fits. That line is what every future change has to beat or hold.
  • Wire it into CI as a gate. Run the suite on every pull request and every deploy, and fail the build when the score drops below the baseline. Quality stops being an opinion and becomes a merge condition.
  • Close the loop. Every production failure the suite missed gets added to the golden set with its correct label. The harness becomes a living record of every way the system has been wrong, and every regression it now blocks for good.

Choosing graders and sourcing hard cases

Match the grader to the task. Anthropic ranks three families by speed and reliability, and the ranking is a good default. Code-based grading is fastest and most reliable, an exact match where output equals the golden answer, or a string match where a key phrase has to appear. Human grading is the most flexible and highest quality, and also the slowest and most expensive, so you avoid it where you can. Model-based grading, where an LLM judges the output against a rubric, is fast, flexible, and scalable, and the right tool for nuanced calls, but you validate it against human labels before you trust it at scale. OpenAI ships the same shape in its open eval framework, data in JSON and model-graded templates.

The rubric is where model graders live or die. Make it detailed and empirical, force a discrete verdict of correct or incorrect or a 1 to 5 score rather than prose, and have the judge reason first and then discard the reasoning, which measurably improves grading on hard cases. Use a different model to grade than the one that produced the output.

The hard cases are the whole point, and they are where domain expertise beats model cleverness. A generic team writes easy cases the model already passes. A domain operator knows the filing that looks routine but is not, the edge condition a regulator actually punishes, the input a competitor gets wrong. Source those cases from real production logs, from expert interviews, and from the incidents that already cost you. You can let the model help generate volume, but the labels on the cases that matter stay human and expert.

Wire evals into the deploy gate

A harness that runs when someone remembers is not a gate. Put the suite in CI so it runs automatically on every pull request and blocks the merge when the score falls below the baseline you recorded. Now a prompt tweak that quietly costs three points of accuracy cannot ship, because the build goes red before anyone argues about it.

This only works if you version what you are gating. Treat the prompt like source code, committed and diffable, and pin the model name and version alongside it. When you change either one, the harness scores the change against the baseline and tells you what it cost or bought. That is the difference between a measured decision and a guess. A model swap becomes an experiment with a number attached, not a leap of faith taken on a Friday afternoon.

How the eval set becomes proprietary

The harness is where the copilot to data to fund flywheel gets its teeth. Every correction a domain expert makes and every production failure fed back with its right answer becomes a labeled row no competitor has and none can buy. The eval set is proprietary, domain-specific, and compounding. It is the written-down definition of correct for a vertical, and it grows every time the product is used and corrected. That is why the eval set, and not the model, is the durable asset, an argument we make in full in the copilot to data to fund flywheel.

The mechanism ties straight to inference economics. Model quality is converging and inference cost is collapsing, so the base model is not the moat, and betting the company on one is the wrapper trap. According to a16z, an LLM at GPT-3 quality fell from roughly 60 dollars per million tokens in late 2021 to about 0.06 dollars, a 1000x drop in three years, close to 10x a year for equivalent performance. Epoch AI puts the median decline near 50x a year across benchmarks. Read that as strategy, not trivia. If a cheaper or better base model shows up every few months, the team that can swap to it without losing quality wins on both cost and capability, and the harness is the instrument that makes the swap safe.

The cost of an LLM at GPT-3 quality fell from about 60 dollars per million tokens in 2021 to roughly 0.06 dollars, a 1000x drop in three years. A model-agnostic team captures that only if evals protect quality through the swap.

— a16z, LLMflation, 2024

Failure modes: measuring the wrong correct

A harness built wrong is worse than no harness, because it hands the team false confidence. These are the ways it goes wrong, and the fix for each.

  • The wrong definition of correct. A bad golden set encodes a mistaken standard, and the suite then certifies the wrong behavior on every green build. Fix it with expert labeling and periodic review of the set itself, not just the model.
  • Overfitting to the eval. Teams tune prompts to pass the suite instead of to serve the real world, and the score climbs while production quality does not. Keep a held-out set the team never tunes against, and refresh cases from live traffic.
  • A stale set. New inputs appear, the distribution drifts, and a frozen set slowly stops representing reality. The closed loop is the antidote.
  • An untrusted model judge. An LLM grader never checked against human labels can be confidently wrong at scale. Validate first, then keep the human tiebreak on disagreements.
  • Lock-in dressed as safety. A harness bolted to one vendor's format defeats the point. Keep the golden set and graders in your own store so the suite outlives any single model.

How Avante builds evals with domain operators

The reason our harnesses hold up is who writes the golden set. Avante Ventures launches 3-4 ventures per year through a six-stage system, Research, Partner, Build, Traction, Revenue, Compound, deploying $500K-1.5M per venture and retaining co-founder economics. The domain operator, the person with a decade of scar tissue in the vertical, is the source of the correct labels, and they sit inside the product team from the Partner stage on. The gate goes live in Build and hardens through Traction and Revenue. The compounding eval set is a Compound-stage asset that follows the venture into its raise.

It is also why a lean team can ship a defensible product without a Series A. Cheap inference plus a disciplined harness means the moat is the data and the evals, not a war chest for compute. The harness itself is company plumbing a studio solves once and routes across every venture, turning shared infrastructure into roughly $300K-500K of effective capital per venture that goes into product instead of overhead.

The pressure is real in our market. AI use among Brazilian industrial companies jumped from 16.9 percent in 2022 to 41.9 percent in 2024, roughly 2.5x in two years, per IBGE. Adoption is racing ahead of quality control, which is the DORA gap in one country. In a market where services are roughly 70% of Brazilian GDP with low software penetration, the vertical AI prize is large, and it goes to the teams whose quality is measured, not asserted. Read the thesis at /why-avante. The team that can prove its model still works after a swap is the team that gets to keep swapping.

Frequently asked questions

What is an AI eval harness?
An AI eval harness is a versioned suite of expert-labeled test cases plus automated graders that scores model output on every deploy and every prompt or model change. It turns quality from something you eyeball into something you measure. Unlike unit tests, which assert one exact output, it scores probabilistic output across the cases that matter to your vertical.
How do I build an AI eval harness?
Build it in five moves. Assemble a golden set of real, expert-labeled cases including hard edge cases, write graders that score each output against its golden answer, record a baseline score, wire the suite into CI as a deploy gate, and feed every production failure back into the set. The order matters, because each step depends on the one before it.
How is an eval harness different from unit tests?
Unit tests check deterministic code, where a given input must return one exact output. An eval harness scores probabilistic LLM output, where the question is whether the answer is right often enough across a realistic distribution of cases. You need the harness the moment an LLM sits on a decision a user or regulator cares about.
What kind of grader should I use for LLM evals?
Match the grader to the task. Use code-based exact or string matching for categorical and extractive tasks, since it is fastest and most reliable. Use a rubric-scored model judge for open-ended output, validated against human labels first, with a human tiebreak on disagreements. Avoid pure human grading at scale because it is slow and expensive.
Why do evals matter more as inference gets cheaper?
Because falling inference cost is what makes a model swap worth doing, and evals are what make it safe. LLM cost at GPT-3 quality fell about 1000x in three years per a16z, so a cheaper or better base model appears constantly. A model-agnostic team captures that upside only if an eval harness proves quality held through the change.
— Avante Founding Team
São Paulo + Silicon Valley · written from inside the studio

Want more? Get one essay per week on venture building, AI-native businesses, and the Brazil opportunity.

Browse the Library →