The Copilot-to-Data-to-Fund Flywheel, Explained
Ship a copilot to mint proprietary data, then turn that data into capital. The concrete mechanism, the failure mode, and how Avante runs it.
The copilot to data to fund flywheel is one idea stated three ways. Build an AI copilot that does real work in a regulation-dense vertical. Let every interaction mint structured, hard-to-source data. Then turn that data into capital, either by raising on the strength of the dataset or by deploying capital directly into the assets the data identifies.
This is the recurring pattern across Avante Ventures, a venture studio building AI-native companies in Brazil and Latin America. It is not a slogan. It is a build sequence with a precise order and a single point of failure, and most teams get the order wrong. They chase the fund before the data is dense enough to price anything.
The loop in one sentence
Copilot generates data, data becomes a priced asset, the asset attracts or becomes capital, and the capital buys more usage that thickens the data. That is the whole machine. The order is not negotiable. Skip the usage and the rest is a pitch deck with no engine underneath.
What makes the loop worth building in 2026 is that the first turn got cheap. The model is no longer the expensive part of an AI company, and it is no longer the durable part either. Everything that lasts has been pushed off the model and onto the data the model touches that nobody else can copy.
Stage one: a copilot that mints data
A copilot is AI-native only if removing the model breaks the core workflow. Run the test on any product. If it would still function as ordinary software with the model stripped out, the AI is a feature bolted on the edge. If the workflow exists only because the model does the judgment work, and the act of doing that work leaves behind labeled data, it is AI-native. The product is the data-collection instrument.
The reason this is now buildable without a Series A is the inference cost collapse. Epoch AI found that the price to match GPT-4 performance on a set of PhD-level science questions fell by about 40x per year, with the median across all measured tasks around 50x per year and the fastest tasks dropping up to 900x annually. A capability that cost a fortune to run last year is a rounding error this year.
The strategic consequence is blunt. If the model is nearly free and improving for everyone at once, no model is a moat. The defensibility has to live somewhere the price curve cannot reach. In this pattern it lives in the proprietary data the copilot produces while it works.
The price to reach GPT-4-level performance on PhD-level science questions fell about 40x per year, with the median across tasks near 50x per year. The model is the cheap part now. The data is the moat.
— Epoch AI, March 2025
Stage two: data becomes the moat and the asset
Most data moats are fiction, and the sharpest takedown comes from the people who fund AI. Andreessen Horowitz put it plainly in 2019. There is generally no inherent network effect that comes from merely having more data. The economics often run the wrong way. The cost of adding unique data to your corpus may actually go up, while the value of incremental data goes down. Past a certain coverage threshold, each new slice costs more and buys less.
So when is data actually defensible? a16z names the exact condition. Accumulating proprietary data is strongest when the sources are scanty or are reticent to provide data to more than one vendor. Their examples are government-regulated sources and credit bureaus. That profile, scarce data, gated by regulation, held by a party reluctant to share it twice, is the precise target of this flywheel.
The working version of the effect has a name. NFX defines a data network effect as the case where a product's value increases with more data and where additional usage of that product yields more of that data. The condition that matters: the data has to be central to how the product benefits users, not a side artifact. James Currier's own counterexample is the warning. Netflix improves with viewing data, but inventory drives the value, so the data effect there is only marginal.
The 2025 investor consensus lands in the same place. Bessemer argues vertical AI winners will not compete on the underlying model, and the key differentiators are proprietary data, depth of integration, and economic value delivered. Insight Partners is sharper still. Earned data access creates a moat that widens with every customer onboarded, and access to specific, messy, unstandardized data remains one of the strongest moats in AI. That last clause is the real bar. Not we have data, but we have the kind of data only this copilot can generate at scale.
- Scarce: the source is gated by regulation or by a holder reluctant to supply a second vendor.
- Earned in the workflow: the copilot is the only practical instrument that mints it at scale.
- Compounding in use: the data grows more valuable as the product is used, not as it sits in storage.
Stage three: data becomes capital
There are two ways out of stage two, and the second is the interesting one. The first is to raise on the strength of the dataset. A proprietary, regulation-gated dataset that prices or scores an asset class is a fundraising story an off-the-shelf model cannot tell. The second is to deploy capital directly into the assets the data identifies. The dataset stops being a sales asset and becomes an origination engine.
Embedded lending is the cleanest analog. The argument across fintech is that a platform sitting inside a workflow accumulates transaction and behavioral data that lets it underwrite risk better than a bank looking at the same borrower from outside. The data is the edge, and the edge gets monetized as capital deployed into credit. Data-as-collateral works the same way. A lender advances against a receivable or a claim only when someone can price the risk credibly, and a proprietary dataset is what makes that pricing believable.
This is the turn most teams never reach, because it requires the data to be dense enough to bet money on. A model that is right 70 percent of the time is a fine copilot and a terrible underwriter. The capital stage is where thin data gets exposed.
The pattern across Nexa, WIR, and BR Auction Intel
The same loop runs in three different verticals across the Avante portfolio. Described by domain, no invented numbers.
Nexa Tech runs it in judicial assets. A copilot for precatorios and claims does the valuation and tracking work, and every case it processes thickens a dataset to value and fund those assets. The Brazilian context is why this works at scale. Services account for roughly 70% of Brazilian GDP with low software penetration, which means a vast surface of under-digitized, regulation-dense workflows where a copilot can mint data no incumbent holds. The precatorios market had roughly R$300 billion in unpaid court-ordered government debt outstanding, with the federal stock alone above R$140 billion in 2023, and about a quarter of pending precatorios had already changed hands in the secondary market. A market that large, that fragmented, and that hard to price is exactly where a copilot builds data no off-the-shelf model holds, and where the data can fund the assets directly.
WIR runs it in insurtech, with AXA. Async pricing and risk scoring turns every underwriting interaction into a labeled pricing dataset. The output is a pricing signal a generic model cannot reproduce, because the generic model never saw the interactions. BR Auction Intel runs it in Brazilian real-estate auctions. Scrape, enrich, and score builds an auction-opportunity dataset that becomes an origination signal, routing capital toward specific properties.
Why the loop fails if the copilot goes unused
The loop only closes if the copilot reaches enough usage to make the data dense. This is the honest weak point of the entire thesis, and pretending otherwise is how studios lose money. A copilot nobody uses produces no data, scarce or otherwise. Thin data prices nothing, scores nothing, and originates nothing, so the capital stage simply never arrives.
Three concrete ways it breaks. Wrapper risk: the copilot adds too little over a raw model, so usage never builds and there is nothing to mint. Distribution risk: the data exists but the product never reaches the workflow density where the network effect kicks in, which is exactly Currier's marginal-effect warning playing out in real life. Model-dependency risk: the team mistakes the model for the moat, and when inference prices fall another 50x the supposed advantage evaporates.
The discipline the flywheel demands is uncomfortable for founders who want to talk about the fund on day one. Obsess over copilot usage first. The data, the moat, and the capital are all strictly downstream of it. A studio that funds the dataset before the usage exists has bought a number that prices nothing.
Underwrite copilot usage before you underwrite the dataset. Scarce data is defensible, but a copilot nobody uses mints no data, and a dataset that prices nothing cannot reach the capital stage.
How Avante runs the flywheel
Avante Ventures runs this as a studio, not as a portfolio of bets. It launches 3-4 ventures per year through a six-stage system: Research, Partner, Build, Traction, Revenue, Compound. It deploys $500K-1.5M per venture and retains co-founder economics. The structural edge is domain operators with 10+ years of Brazilian-market scar tissue, paired with a Silicon Valley playbook and first-ticket capital, assembled on day one rather than recruited over the first year.
The studio model and this flywheel fit for a specific reason. Solving company plumbing once routes roughly $300K-500K of effective capital per venture into product and traction instead of overhead, which buys the copilot the runway to reach usage density before the data thesis has to prove itself. A studio venture launches 6-9 months ahead of a comparably funded standalone team, and in this pattern those months are pure data accumulation. The benchmark Avante points to is GSSN's finding that studio IRR runs near ~50% versus ~19% for traditional VC, roughly 2.5x. That is the studio-model edge, not a claim about any single venture's return. See [/why-avante](/why-avante) for the thesis and [/principles](/principles) for how the studio operates.
The flywheel is not a story about AI. It is a story about which asset survives when the model is free. The team that obsesses over the copilot ends up owning the only thing the price curve cannot copy. The team that obsesses over the fund ends up holding a dataset that prices nothing.
Want more? Get one essay per month on venture building, AI-native businesses, and the Brazil opportunity.
Browse the Library →