TL;DR

How to pick an LLM for production: stop asking which model is best, and start asking which model wins your private eval set on a specific task at a defensible cost and latency. The four selection axes are capability fit, cost at volume, latency, and governance. In practice, the production answer is almost always a small orchestration of two or three model families, each used for what it is best at.

  • The question is wrong. You usually need more than one model.
  • Build your eval set before you read a single benchmark.
  • Score on capability, cost, latency, and governance fit.
  • Use an abstraction layer so you can swap models per step.
  • Never lock in. Pricing and capability shift quarterly.

Why "which LLM is best" is the wrong question

The first thing I tell a brand asking me to pick an LLM is that the question, as phrased, is going to lose them money. There is no single best LLM. There is the model that happens to be best at the specific task you are routing to it, at a cost you can afford, with a latency your user will tolerate, on a policy posture your legal team can sign off on. That is four different questions, not one.

The teams that ship pick by task. The teams that stall pick by vibe. Vibe says, "we use Claude." Task says, "we use Claude for the writing step, GPT for the tool-calling step, and a small open model for the classification step." Vibe scales until the bill arrives. Task scales because each model is doing the thing it is best at.

The other reason the question is wrong is that the answer changes every quarter. Anyone who tells you a specific model is the best for a specific task as of a specific Tuesday is selling you a snapshot. Build the system to swap models, then the answer changes do not break anything.

The four selection axes

Every LLM selection decision in production runs through four axes. If you skip one, you will pay for it in the first quarter of operation.

1. Capability fit

Does the model do the job, on your data, at the quality threshold the workflow requires? Capability is not measured by public benchmarks. It is measured by your private eval set. Public leaderboards correlate weakly with real workload performance, because the benchmarks are designed to differentiate frontier models on broad tasks, not to predict how a specific model handles your specific product copy or your specific customer-support corpus.

Capability fit also has a step-level dimension. A model that wins your writing eval may lose your tool-calling eval. A model that wins your tool-calling eval may lose your long-context retrieval eval. The unit of capability is the step, not the workload.

2. Cost at volume

Run the math at projected scale, not current scale. A model that is 3x more expensive per token is fine at ten thousand daily calls. At ten million daily calls, it is a board-level line item. The cheap model that loses your eval by ten percent may still be the right pick if the gap is closable with a better prompt and the cost difference is an order of magnitude.

Cost has three components most teams forget to count: input tokens, output tokens, and retries. Retries on a flaky model can double your effective cost. Always model the all-in cost, not the headline price.

3. Latency

A model that takes nine seconds to respond is useless on a product page. The same model is fine in a nightly batch job. Latency is a user-experience variable, not a benchmark variable. Some workflows tolerate a five-second response. Some need sub-300ms or the user bounces.

Streaming changes the equation. If the workflow can stream tokens, perceived latency drops dramatically and you can use slower, smarter models in places they would otherwise lose. The latency conversation has to be paired with the streaming conversation.

4. Governance and compliance

Does the model's policy posture work for your industry? Does the vendor offer the data residency, retention, and zero-retention options your legal team requires? Does the contract allow training on your prompts by default, and if so, can you opt out? Does the vendor offer enterprise SLAs or are you on a public API tier with no recourse when capacity spikes?

For consumer brands handling PII, the governance axis is usually the silent killer. The model that wins on capability and cost can still be unusable because it routes through a region your DPA does not cover. Sort governance before you fall in love.

Pick by task, not by vibe. The model is a step-level decision, not a company-level one.

What each major model family is good at right now

I will speak in operating principles, not benchmark numbers, because benchmark numbers age in weeks. The strengths below are pattern-level. Re-test against your eval set before relying on any of them.

OpenAI's GPT family

GPT is the broad-knowledge generalist. Strong ecosystem tooling, the deepest integration story across the assistant landscape, and the model most third-party tools default to. GPT tends to be the safe pick for tool-calling workflows where the model needs to orchestrate APIs reliably, and the safe pick for general-purpose chat surfaces. The cost picture is competitive across the size tiers.

Anthropic's Claude family

Claude is, in my experience, the strongest at long-form reasoning, instruction following inside a defined system prompt, and writing inside a brand voice. It is the model I reach for first on production marketing prompts and on agentic workflows that require the model to follow a long set of constraints without drifting. Claude Code, the coding-agent variant, is a separate production capability worth understanding on its own, which I cover in building production AI workflows with Claude Code.

Google's Gemini family

Gemini is the multimodal native and the deep-Google-stack option. If your workflow is tightly coupled to the Google ecosystem (Workspace, BigQuery, Vertex), or if the workload is image-heavy or video-heavy, Gemini is in the conversation by default. It is also the model family with the most generous long-context behavior in many workloads.

Open-weight models

Llama, Mistral, Qwen, and other open-weight families are the cost play and the data-sovereignty play. You host them yourself, you control the data, and at high volume they can be dramatically cheaper than frontier hosted models. The cost of running them well, including ops staffing and GPU spend, is the trade. For high-volume classification or extraction tasks, open weights are often the right pick. For frontier reasoning, they usually are not.

The orchestration play: the right model for each step

The pattern I run in production is not "pick a model." It is "design a workflow, decompose into steps, and route each step to the model that is best for it." That looks like this in practice:

Each step is a separate selection decision. Each step has its own eval set. Each step can swap independently when the leaderboard moves. The orchestration is what gets you production-grade performance at a cost the CFO can sign. It is also the pattern I use when applying the V1 Framework to agentic work: decompose first, then route.

The production lock-in risks

Vendor lock-in on an LLM is more dangerous than vendor lock-in on a database, because the pricing, capability, and policy change faster than your refactoring budget. The lock-in shows up in four places.

Prompt syntax. Each model family rewards different prompt structures. If your prompts are written in a model-native style with no portability layer, swapping models means rewriting prompts. Build prompts as versioned assets with a portability discipline.

Tool-calling format. The schemas, function signatures, and structured-output conventions differ across families. Abstract them.

Fine-tuning artifacts. If you fine-tuned on a vendor's platform, those weights are not portable. Fine-tune sparingly and only when capability gap and volume justify it.

Long context conventions. Each family handles long context, citation, and retrieval differently. Your retrieval pipeline should be vendor-neutral or you will rebuild it when you swap.

The abstraction layer pays for itself the first time a vendor raises prices or deprecates a model you depend on. I have lived both of those quarters at Automatic client engagements. Plan for them.

The eval-set-first approach

The single highest-leverage thing you can do before picking a model is build your private eval set. Not after. Before.

An eval set is 50 to 200 real prompts pulled from your actual workflow, with reference answers graded by your team on a rubric you define. The reference answers are the bar. The rubric is the scoring system. The set is private (never shared with the vendor) so you can re-use it across model launches.

With an eval set in hand, model selection becomes a tractable engineering problem instead of a vibes contest. You run each candidate model against the set, score them on the rubric, weight by cost and latency, and pick. When a new model lands next quarter, you re-run the set and decide whether to swap. The eval set is the asset that survives every model change.

The teams that build the eval set first ship in weeks. The teams that pick a model first and try to make it work spend a quarter convincing themselves the model is fine because they already chose it. This is why the AI transformation playbook insists on diagnostics before tools.

The eval set is the moat. The model is a swappable line item underneath it.

The bottom line

Pick by task, not by vendor. Pick by eval, not by benchmark. Pick by total cost, not by headline price. And design the system so you can swap, because next quarter's leaderboard will not look like this quarter's.

The right LLM for production is almost never one LLM. It is a small orchestration of two or three model families, each used for what it is best at, all behind an abstraction layer that does not care which vendor is winning today. Build the workflow first. Pick the models second. Re-pick them quarterly.


FAQ

What is the best LLM in 2026?

There is no single best LLM in 2026. The best LLM is the one that wins your private eval set on the specific task you are routing to it, at a cost and latency you can defend. Most production stacks use two or three model families orchestrated by step.

Is Claude or GPT better?

Both are excellent and the winner depends on the workload. Claude tends to be stronger at long-form reasoning, instruction following, and writing inside a defined brand voice. GPT tends to be stronger at broad knowledge retrieval and ecosystem tooling. Run them both against your eval set and let the rubric decide.

How do I evaluate LLMs for production?

Build a private eval set of 50 to 200 real prompts from your actual workflow, with graded reference answers. Run each model against the set, score on a rubric you define, and weight by cost and latency. Public benchmarks tell you almost nothing about your specific workload.

Should I lock into one LLM vendor?

No. Build an abstraction layer that lets you swap model families per step. Lock-in is a strategic risk because pricing, capability, and policy change quarterly. The cost of portability is small compared to the cost of being pinned to a vendor whose pricing doubled overnight.

How important is cost when picking an LLM?

It depends on volume. At low volume, capability wins. At high volume, cost per inference dominates. A 5x cost difference on a million daily calls is a budget conversation; on a thousand daily calls, it is a rounding error. Always model cost at projected scale, not current scale.

Can I switch LLMs later?

Yes, if you built portability in from the start. That means an abstraction layer, prompts stored as versioned assets, an eval set that lets you re-test, and contracts written for portability. Without those, switching is a rebuild rather than a swap.

About the author

Nicholas Harris is an AI-native operator at the intersection of generative AI and consumer growth. He is President at CreativeOS, an AI-powered SaaS platform serving 25,000+ brands, and Founder at Automatic, an AI consultancy. He has delivered three exits and built consumer-brand operations from SMB through nine-figure scale, including 110.6% e-commerce revenue growth at NASM and an 11x EBITDA exit at SplitTesting.com.

He is currently open to VP AI, AI Transformation, Head of Growth, and Fractional CTO roles at consumer-facing companies. Based in Mesa, AZ. Remote or Phoenix metro preferred.

Get in touch