Why Your AI Stack Needs a Model Router

The Case for Infrastructure Thinking

Apr 28, 2026

You’ve got Claude, GPT-4, Gemini, maybe Grok or Kimi, and whatever is running on the local vLLM or Ollama server, depending on the task. Your team is shipping features on top of these APIs. The workflows are running, and everyone is starting to align on team context and skills. Things mostly work.

Now tell me what happens when one of those providers goes down.

Not degraded. Not slow. Down. HTTPS 429s across the board, region offline, API timing out. And somewhere in your codebase, there’s a hardcoded model=“claude-sonnet-4” just sitting there.

While this might be acceptable for some work, automation and Agentic workflows will quickly come to a screeching halt. Many of us learned this the hard way with Cloud and the DevOps movement that resilience is critical.

But there is another failure scenario, we are just starting to uncover.

The Failure Nobody Sees Coming

Before we talk about outages, there’s a worse scenario worth naming: it’s far more common, and almost nobody tracks it.

Every engineer who works seriously with AI knows this feeling. You’re in flow. Progress is happening. The model is keeping up, and you’re building momentum. Then something shifts — and you can’t point to exactly when.

The model that was working with you is now working against you. Outputs that used to land in one or two attempts are taking five or six. Code comes back wrong in ways that feel off, not just incorrect. Reasoning that felt sharp starts feeling brittle. You start rewriting prompts that haven’t changed, convinced the problem is on your end. Tokens burn. Progress doesn’t just slow; it reverses. You’re putting in more effort and getting worse results, and there’s nothing in your environment telling you why.

What actually happened is the model changed underneath you. A provider pushed a new default version. Something got quietly deprecated. A routing decision somewhere in the stack landed you on a model that isn’t built for what you’re doing. The capability gap is real, but nothing flagged it. No alert, no error, no log entry. The system is technically working — it’s just working badly now.

This doesn’t show up in your incident tracker. It shows up in your team’s velocity going sideways, token costs climbing, and that slow build of friction before someone finally says out loud, “Is the model acting weird for anyone else?”

At least with a hard outage, everyone knows immediately. The problem is visible, bounded, and fixable. What makes the quality regression so much more expensive is that your engineers absorb the cost personally — in wasted work, second-guessed prompts, and lost progress — before anyone even identifies what changed. Spread that across a team working through a deadline week, and you’ve quietly lost half a day of output before a single ticket gets filed.

A router with model version pinning and quality scoring is what catches this before your team does. When a model's output scores drop, you see it in telemetry. You stay on the version that was working. You evaluate new releases on your own timeline, against your own task types, before they get anywhere near production. The provider can ship whatever they want on their schedule. Your stack moves on yours.

You wouldn’t let your database provider silently swap your Postgres version on a production cluster. The same logic applies here, and it’s time to treat it that way.

What a Model Router Actually Is

A model router is the decision layer that sits in front of your LLM calls and figures out which model handles each request — in real time, based on availability, cost, task type, quality scores, and whatever else your system cares about.

It’s not a product you buy. It’s a pattern you build into your stack, the same way you build retry logic into API clients or health checks into your services. Once it’s there, providers can have outages, ship regressions, or change their defaults without warning, and your system keeps running the way you designed it. That’s the whole point.

The implementation complexity scales with your needs. A basic router might just be a priority list with fallback logic and some logging. A mature one scores quality, latency, and cost per token against task type and makes routing decisions dynamically. Where you start depends on how much of your product actually depends on AI being reliable.

The Pattern Your Team Already Knows

In 2012, the hard lesson was multi-region. If your entire application lived in us-east-1 and that region went down, everything went with it. The engineers who hadn’t built geographic redundancy found out why it mattered in the worst possible way.

What emerged from that era was a well-understood pattern: detect failure, route around it, keep the system running, and alert someone to go look at it. The customer never knows it happened. The on-call engineer handles it at a reasonable hour rather than getting paged at 2 am when the whole product is dark.

The AI stack is sitting at that same inflection point right now. Most teams have picked one provider, hardcoded it throughout their codebase, and are implicitly letting that provider’s uptime and release decisions determine their product’s reliability story. Those two things are not the same, and the gap between them tends to show up at the worst possible time. The thinking that solved cloud redundancy also solves this. Your team already understands the pattern — they just haven’t applied it to this layer yet.

What You’re Actually Getting Out of It

The resilience piece is straightforward - when your primary model provider has an incident, traffic shifts to your fallback, and your feature keeps running. That’s worth having, but it’s honestly the least interesting part of the value.

The more interesting part is the cost. RouteLLM research from LMSYS in July 2024 showed that intelligent routing between model tiers, using frontier models only when a task actually requires frontier capability, holds output quality above 95% while cutting costs anywhere from 40 to 85 percent. The reason that range is so wide is that it depends heavily on your workload mix, but the principle is solid: most of what your system processes isn’t complex reasoning. Summarization, classification, extraction, and straightforward RAG retrieval all run cleanly on smaller, cheaper models. The router handles that decision automatically, so you’re not burning Frontier model tokens on work that doesn’t need them.

Layer on semantic caching for repeated queries, internal vector DB lookups before any external API call, and data residency enforcement, and the router becomes your internal knowledge gateway as much as your model selector. These aren’t add-ons. They’re the kind of infrastructure policies that are almost impossible to enforce consistently at the application layer, but become automatic when they live in the routing layer.

At production scale, teams doing this well are seeing 60 to 80 percent reduction in API spend. If you’re running a meaningful AI workload and not routing intelligently, that’s real money sitting on the table.

Where Teams Get This Wrong

The failure mode worth knowing about upfront is over-engineering the router before you understand your own traffic. Routing logic that adds 200ms to every request, retry chains that bounce across three providers before giving up, fallback conditions that trigger on transient errors — a poorly built router can introduce instability where there wasn’t any. The right approach is to start with the simplest possible implementation, and observability logic, and let actual data tell you where to add complexity.

The other thing teams miss is that same-provider fallback doesn’t actually protect you in every scenario. If your primary Claude model hits a moderation layer issue, switching to another Anthropic model often fails for the same reason — moderation sits upstream of model selection. Real resilience means cross-provider fallback, not just cross-model.

OpenRouter and similar managed routing services are worth knowing about, too. They get you to multi-provider coverage quickly, which is genuinely useful. The tradeoff is that you’ve added a dependency, a network hop, and a layer of opacity over routing decisions that you don’t control. That’s a reasonable place to start. It’s not where you want to stay if AI is actually critical to your product.

What It Takes to Build It

A basic router is a solid backend engineering project — someone who’s built reliable API integrations, knows how to handle retry and fallback patterns, and actually cares about observability. This isn’t ML work. Its system’s reliability is tied to an LLM in the hot path. A working implementation with one fallback tier and basic cost and quality logging is a worthy exploration project for any team.

A production-grade router with composite scoring, multiple fallback tiers, semantic caching, RAG integration, and full telemetry is a bigger lift — more like a month — and wants someone who’s built distributed systems before. Not a moonshot, but it’s real infrastructure work and should be scoped that way.

What engineering alone can’t decide: which models are approved for which task types, how quality gets defined for your specific use cases, and what the fallback priority order looks like across providers. Those decisions require context that lives above the implementation layer, and they need to be made deliberately rather than defaulting to whatever was easiest at build time.

After it’s built, someone needs to own the telemetry and observability. Routing decisions, fallback frequency, cost per task type, quality score trends over time — without visibility into how the router is actually behaving, you’ve just moved the black box rather than eliminated it.

Before Your Next Architecture Review

Pick any AI-powered feature currently in production. Ask yourself: if the model provider behind it is unavailable for two hours right now, what actually happens? And if they ship a quiet model regression tonight, how long does it take your team to notice?

If either of those answers is uncomfortable, that’s the conversation to have. Not as a fire drill — as a scoping exercise for work that belongs on the roadmap. The cost of building this is measured in weeks. The cost of not building it compounds every time a provider makes a decision that affects your system and you’re the last one to know about it.

-----

Ryan Booth writes about AI infrastructure, systems thinking, and building at scale.

Ryan’s Substack

Discussion about this post

Ready for more?