How to Control LLM Inference Cost With Model Routing
Control LLM inference cost with model routing without losing quality. Route cheap by default, escalate on demand, and gate every swap on evals.
Controlling LLM inference cost is a product decision, not a finance one, and model routing is the lever. Send every request to a cheap model by default, escalate to an expensive one only when the task needs it, cache and batch what you can, cap spend with hard budgets, and gate every model swap on an eval harness. Do that and inference stops being a bill you dread and becomes a margin you keep.
Avante Ventures builds AI-native companies this way because the price of a token is falling faster than almost any input in the history of software. The teams that win are not the ones waiting for prices to drop. They are the ones who architected to capture the drop.
Why inference cost is a product decision, not a finance one
The model you pay a premium for today is next year's budget option at a fraction of the price. According to a16z's LLMflation analysis, GPT-3 level performance cost about $60 per million tokens in November 2021 and about $0.06 by 2024 on a small open model. That is roughly a 1,000x decline in three years, near 10x cheaper per year for a fixed level of capability. Epoch AI's independent measurement puts the decline for a fixed capability between 9x and 900x per year, with a median around 50x.
This is why cost belongs in the product spec, not in a quarterly finance review. A venture whose product already runs most requests on the cheap tier captures that deflation automatically, because each new model that clears its quality bar is cheaper than the last. A venture that hardcoded the frontier model everywhere pays yesterday's price forever and has nowhere to fall to.
There is a catch worth naming. The frontier itself does not get cheaper. OpenAI's o1 launched at the same $60 per million output tokens that GPT-3 charged in 2021. The deflation is in reaching a fixed capability, not in the frontier. So the real question is never which model is best. It is which model is good enough for this specific request, proven by evals.
GPT-3 level performance fell from about $60 per million tokens in 2021 to about $0.06 by 2024, roughly 10x cheaper every year for a fixed capability.
— a16z, LLMflation
Cut LLM inference cost in five moves
Here is the sequence an operator can run this week. Every move is measurable and reversible, and each one assumes the one before it is already in place.
- Instrument before you optimize. Log tokens in, tokens out, model, latency, and dollar cost per request, tagged by task type and by customer. You cannot route what you cannot see, and most teams find 80% of spend hiding in a few task types.
- Route cheap by default, escalate on demand. Send every request to a small model first, verify the output, and escalate to the expensive model only on failure.
- Cache and batch aggressively. Turn on prompt caching for the stable parts of prompts and batch anything that is not latency-sensitive.
- Set hard budgets and alerts. Put a per-customer and per-environment token budget in place with an alert at 70% and a hard cap, so a runaway retry loop trips a limit, not an invoice.
- Gate every model swap on evals. No model change, cheaper or more expensive, ships without passing the harness. This is the discipline that cuts cost without quietly cutting quality.
Model routing: cheap by default, expensive on demand
A model cascade sends each request to the cheapest capable model, verifies the result, and escalates only what fails. The whole economics turn on one number: the escalation rate, the share of traffic that falls through to the expensive tier.
The published figures are strong. TrueFoundry's routing analysis shows that a 70% cheap-resolution rate brings blended cost to about half of running the frontier model everywhere, even after paying for the failed cheap attempt on the 30% that escalate. At a 10x price gap between tiers, a cascade lands near 40% of frontier-everywhere cost. Practitioner reports put real savings at 45% to 85% while holding about 95% of quality.
Start with the simplest router that works. Static rules that pick a model from a task tag cost almost nothing to run. Cost-aware routing picks the cheapest model that clears a quality threshold. Semantic routing embeds the request and classifies intent for a few milliseconds of overhead. The cascade sits on top. Watch the escalation rate daily, because if it drifts up your cheap model degraded or your traffic mix changed, and either way your blended cost just moved.
Escalation rate is the one metric to watch daily. It is the number that ties your quality to your bill, and it should live on a dashboard, not surface in a monthly invoice.
Caching, batching, and hard budgets
Routing decides which model. Caching, batching, and budgets decide how little you pay for the requests you do send. These are the cheapest wins in the stack and most teams leave them on the table.
Prompt caching pays for stable context you send over and over, like system prompts, tool definitions, and retrieved documents. OpenAI applies a 50% discount on cached input tokens. Anthropic prices cache reads at 0.1x the base input rate, a 90% cut on the repeated portion. Batching folds many non-urgent requests into one job at a lower rate, which suits overnight enrichment, evals, and back-office work that no user is waiting on.
Budgets are the seatbelt. A per-customer and per-environment cap with an alert well before the ceiling turns a prompt-injection abuse case or a retry storm into a tripped limit instead of a five-figure surprise. Cost control that depends on nobody making a mistake is not cost control.
How evals let you ride the cost curve down
The eval harness is not overhead. It is the asset that makes every cost cut safe and turns model-agnosticism into a position competitors cannot copy.
Every real request your product handles is a labeled example of what good looks like in your domain. Capture the outputs customers accept, correct, or reject and you build a domain-specific eval set no rival has. That set does two jobs at once. It lets you drop in each cheaper model the day it clears your bar, so you ride the cost curve down with no quality regression. And it becomes proprietary data, which is the copilot to data to fund flywheel: build an AI copilot to generate proprietary data, then use that data to raise and deploy capital. The evals on domain-specific evals as an AI moat are where usage compounds into defensibility.
Because quality is protected by evals rather than by a bet on one vendor, the venture stays free to route to whoever is cheapest per unit of verified quality this quarter. The moat is the eval set and the workflow, detailed in the copilot to data to fund flywheel, never the model. The model is the commodity that keeps getting cheaper.
Failure modes: optimizing cost before you have quality
The most expensive mistake is optimizing cost before you have earned the right to. A team that routes everything to the cheapest model to protect a spreadsheet ships a worse product, loses the customers whose usage would have built the eval set and the data moat, and ends up with neither margin nor moat. Here is where teams go wrong.
- Cost before quality. Cutting to the cheap tier before your evals can catch the regression. You will not see the quality loss. Your churn will.
- No eval gate. Swapping models on instinct. Every swap must clear the harness or you are flying blind.
- Untracked escalation rate. Ignore the share of traffic hitting the expensive tier and a silent drift doubles your bill or halves your quality with no alarm.
- Vendor lock-in disguised as simplicity. Hardcoding one model to dodge routing work feels lean until that vendor raises prices or falls behind and you have no eval set to migrate safely.
- Measuring the wrong thing. Optimizing average cost per token instead of cost per satisfied request. A cheap answer the customer rejects is the most expensive token you will ever buy.
How Avante keeps margin on the venture's side
AI infrastructure is now cheap enough to deploy without a Series A, and the demand is already here. In Brazil, the share of industrial firms with 100 or more employees using AI rose from 16.9% in 2022 to 41.9% in 2024, per IBGE PINTEC, and Bain found 25% of Brazilian companies had an AI use case in production, more than double the prior year. The venture that serves that demand on a cheap-tier cost base keeps the margin. The one paying frontier prices on every request hands the margin back.
Avante Ventures is a venture studio building AI-native companies in Brazil and Latin America, and routing discipline is what makes launching 3-4 ventures per year on lean margins arithmetically possible. Solving the plumbing once, including the routing and eval stack, routes roughly $300K-$500K of effective capital per venture into product and traction rather than overhead. That capital efficiency is why studio ventures reach first revenue 6-9 months ahead of a comparably funded standalone team, and why the studio model benchmarks near 50% IRR against roughly 19% for traditional venture capital, per GSSN.
So earn quality first with the model that works, instrument every request, then cut cost under the protection of evals. A team that does it in that order rides the cost curve down for free. A team that does it backwards pays twice, once for the churn and once for the moat it never built. See why Avante builds this way.
Frequently asked questions
- What is the fastest way to reduce LLM inference cost without losing quality?
- Route cheap by default and escalate to an expensive model only when a task needs it. A model cascade that resolves 70% of traffic on a cheap tier can bring blended cost to about half of running the frontier model everywhere, and published routing results show 45% to 85% savings while holding roughly 95% of quality. Gate every model swap on an eval harness so a cost cut never quietly ships a worse product.
- Why is LLM inference cost a product decision instead of a finance one?
- Because the architecture you build at launch decides whether falling token prices become your margin or pass you by. Per a16z, the cost of a fixed capability has dropped about 10x per year, so a product that already runs most requests on the cheap tier captures that deflation automatically. A product that hardcoded the frontier model everywhere keeps paying yesterday's price.
- How does model routing actually work?
- A router sends each request to the cheapest capable model, verifies the output, and escalates only what fails. Strategies range from static rules that read a task tag, to cost-aware selection, to semantic routing that classifies intent, to a full cascade. The escalation rate, the share of traffic that reaches the expensive tier, is the number that governs both cost and quality and should be monitored daily.
- Does cutting LLM inference cost mean cutting quality?
- Only if you optimize cost before you have an eval harness to protect quality. Evals let you route to the cheapest model that still clears your quality bar and swap in each cheaper model the day it qualifies. The real failure mode is routing everything to the cheapest model to protect a spreadsheet, which ships a worse product and loses the customers whose usage would have built your data moat.
- How much can prompt caching and batching save?
- Prompt caching discounts the stable, repeated parts of a prompt. OpenAI applies a 50% discount on cached input tokens and Anthropic prices cache reads at 0.1x the base rate, a 90% cut on the cached portion. Batching non-urgent work into a single job lowers the rate further, which suits overnight enrichment and eval runs that no user is waiting on.
Want more? Get one essay per week on venture building, AI-native businesses, and the Brazil opportunity.
Browse the Library →