Playbook·10 min·Jun 2026

Domain Evals: The Moat That Survives Model Churn

Models commoditize. The encoded judgment of what correct means does not. Why a domain eval suite is an underrated AI-native moat.

A domain-specific evals AI moat is the most underrated form of defensibility an AI-native company can build. Models commoditize and prompts get copied within a quarter. The encoded judgment of what correct means inside a regulated, high-stakes workflow does not. That judgment, captured as a test suite of real cases, edge cases, and expert-labeled outcomes, is expensive to assemble, compounds with usage, and is the asset competitors cannot screenshot.

It also buys you a second thing almost nobody prices in. The freedom to swap base models as inference prices collapse, without gambling on quality. At Avante Ventures, the venture studio we run building AI-native companies in Brazil and Latin America, the eval suite is where a copilot's accumulating usage turns into a quality lead you can prove rather than assert.

Why evals are a moat, not a chore

Most teams treat evaluation as QA hygiene. That framing is why they lose. An AI-native product makes a claim about the world every time it runs. A copilot that scores judicial-debt recovery, prices an insurance risk, or ranks an auction property can be right or wrong, and wrong is expensive. The mechanism that decides right from wrong is the eval suite, which makes it the product, not the paperwork around it.

The standard moat conversation stops at proprietary data, and that is where it goes wrong. Data is raw material. An eval set is the encoded definition of correctness applied to that material. Two ventures can hold near-identical data and ship opposite quality, because one runs a rigorous, adversarial, operator-labeled suite and the other is guessing in production. The eval suite is the LLM evals defensibility layer that turns a pile of cases into a measurable lead.

If removing the model breaks your product rather than degrading a feature, you are AI-native. And the first question that follows is not how fast you ship. It is how you know the output is correct.

How evals make you model-agnostic

A model-agnostic AI startup is one that can change its engine on a Tuesday and prove quality held by Wednesday. The eval suite is what makes that possible. Run the new model against the suite. Adopt it only if scores hold or improve. The proprietary eval set is what converts a volatile cost curve into pricing power instead of exposure.

This matters because base-model price and quality reshuffle every few months. A venture that hard-codes its quality to one provider is betting its margin on that provider's roadmap. A venture with a domain eval suite treats every new model as a candidate, not a commitment. The cost of being model-agnostic is near zero when you can prove quality on every swap. It is enormous when you cannot, because then a switch is a leap of faith and you will not take it.

Owned eval suite: swap to a cheaper or better model the week it ships, validate in hours, capture the savings or the quality gain.
No eval suite: stay locked to one provider out of fear, or switch blind and discover the regression in front of a customer.
The asset is not the prompt or the model. It is the encoded, operator-labeled definition of correct that every model must pass.

Why the cost curve makes this urgent

Inference prices are falling fast and unevenly, a dynamic we map for the region in the AI infrastructure cost curve, which is precisely why you should not anchor quality to one model. Epoch AI found the price to reach a fixed capability has dropped between 9x and 900x per year depending on the benchmark, with a median near 50x. Matching GPT-4 on PhD-level science questions got about 40x cheaper per year. The drops are accelerating. Measured from January 2024 onward, the median rate jumps to roughly 200x per year.

a16z put a single number on it. The cost of inference at a fixed quality level fell from 60 dollars per million tokens in 2021 to about 6 cents by late 2024, a roughly 1,000x decline in three years. When the floor moves that fast, the only way to keep capturing the savings is to be ready to switch. Readiness is an eval suite. Without one, every price drop is a deal you watch a competitor take. This is the AI eval set proprietary advantage that compounds quietly while the cost curve does the loud work.

The cost of LLM inference at a fixed quality level fell from 60 dollars per million tokens in 2021 to about 6 cents by late 2024. Roughly 1,000x in three years.

— a16z, Welcome to LLMflation, 2024

Where evals sit among the moats

The durable moat for a vertical AI venture is a stack, not a single model. Insignia Ventures put it bluntly. The barrier to building has never been lower while defending what you built has become exponentially harder. They documented AI image-editing startups that scaled past 5 million dollars in ARR and then watched their value erode overnight when an incumbent shipped the same feature. Generic capability is a commodity. The defensible layers sit underneath it.

Proprietary data: the cases, outcomes, and labels competitors cannot buy. Necessary, most discussed, not sufficient alone.
Domain-specific evals: the encoded judgment of what correct means, run against every model and every release. The layer that turns accumulating usage into a provable quality lead.
Workflow lock-in: the product becomes where work is authored and the system of record, so switching costs rise.

The quiet trap of bad evals

A bad eval set is worse than no eval set, because it gives you confidence pointed in the wrong direction. Anthropic, a lab whose entire business is measuring models, wrote that a true science of evals remains underdeveloped and that an apparent edge can be luck of the draw rather than real capability. If they call the science underdeveloped, a vertical startup should assume its first eval set is wrong in ways it cannot yet see.

Here is the failure mode in plain terms. An eval set encodes a definition of correct. If that definition is subtly off, you optimize hard toward the wrong target and feel good doing it. A judicial-debt valuation that looks right to an engineer can be legally wrong in a way only a precatório specialist catches. An insurance score can pass a generic accuracy check and still misprice the tail that bankrupts the book. Building a good eval set demands the exact resource most AI startups lack. Deep domain operators who can label adversarial edge cases correctly. A team without that input does not build a weak instrument. It builds a precise one aimed at the wrong target, and ships with conviction.

A true science of evals remains underdeveloped, and an apparent model edge can be luck of the draw rather than real capability. The warning comes from a frontier lab, not a skeptic.

— Anthropic research on evaluating models, 2024

How Avante builds evals with operators

The eval-as-moat thesis is exactly why the studio model fits this moment. A correct eval set requires deep domain input, and that input is what most AI startups are short of. Avante Ventures pairs a Silicon Valley playbook and first-ticket capital with operators who carry 10+ years of Brazilian-market scar tissue, assembled on day one. The operating partner who knows the domain is in the build from the Partner stage, which is where eval design has to start, not after launch.

The structure is deliberate. Avante launches 3-4 ventures per year through a six-stage system. Research, Partner, Build, Traction, Revenue, Compound. Each venture gets $500K-1.5M across pre-seed while the studio retains co-founder economics. The model has a track record behind it. Per the Global Startup Studio Network, venture studios show roughly ~50% IRR versus ~19% for traditional VC, about 2.5x over realistic horizons. That figure is the studio-model benchmark, not a claim about any single fund's realized return.

The market it points at is concrete. Services account for roughly 70% of Brazilian GDP, and per consolidated IBGE data they drive about 80% of formal job creation. These are regulated, judgment-heavy workflows where correct is domain-defined and adversarial. Exactly where a domain eval suite is hardest to build and most defensible once built. The portfolio runs one pattern in such domains. Build a copilot to generate proprietary data, encode domain correctness as evals so the quality lead is provable, then use the data and the credibility to raise and deploy capital. The copilot to data to fund flywheel shows up in judicial-asset valuation at Alphajuri, insurance risk scoring at WIR, and auction-property scoring at BR Auction Intel. AI infrastructure is now cheap enough to deploy without a Series A. The bottleneck moved. It is no longer compute. It is the encoded judgment of what correct means, and the operators who can define it. That is the case we make in full on why a studio builds this way.

— Avante Founding Team

São Paulo + Silicon Valley · written from inside the studio

Want more? Get one essay per week on venture building, AI-native businesses, and the Brazil opportunity.

Avante Intelligence · weekly · no spam. Or browse the Library