The Production Scenario
Your LLM-powered product has hit product-market fit. Three months ago you had 500 users. Today you have 50,000. Congratulations. The OpenAI invoice for last month was $87,000. Your CFO schedules a meeting.
The meeting goes about how you expect. The CFO pulls up a spreadsheet. At current growth, the inference bill will be 400,000/monthinsixmonthsand2M/month in a year. The product grosses 800,000/monthinr
You start auditing how your system uses LLMs. What you find is embarrassing in retrospect. You are sending GPT-4 requests to answer questions like "What are your business hours?" - information that never changes. You are sending full 8,000-token conversation histories to generate three-sentence replies. You are running the same "summarize this product description" prompt thousands of times per day on the same 200 products. You are paying for the model's reasoning capabilities on tasks that a much smaller model handles perfectly well.
None of these decisions were made maliciously. They were made by developers moving fast, reaching for the most capable model, defaulting to full context windows, not thinking about cost at all because cost was not the constraint yet. Now it is.
This lesson is a systematic playbook for identifying and eliminating that waste. The strategies covered can reduce inference costs by 70–95% in typical production systems - without any quality regression that users would notice.
Why This Exists: Cost is the Constraint That Scales With You
In the LLM era, infrastructure cost has a unique property: it scales directly with usage in a way that most software systems avoid. Traditional SaaS: you pay for servers that serve millions of requests once they are running. LLM SaaS: every token costs money. More users = more tokens = linearly more cost.
This makes cost optimization a first-class engineering concern, not an afterthought. Teams that treat it as an afterthought are regularly surprised when their inference bill exceeds their revenue.
Cost Structure
LLM inference cost breaks down as:
total cost=GPU hours consumed×cost per GPU hour
GPU hours=tokens per second per GPU×3600total tokens generated
cost per 1M tokens=tokens per second per GPU×3600GPU cost per hour×106
Example calculation for a self-hosted LLaMA-3 70B on an A100 80GB ($3/hour):
throughput≈800 tokens/second (with vLLM, moderate load)
cost per 1M tokens=800×3600$3×106=2,880,000$3×106≈$1.04/1M tokens
For comparison: GPT-4o costs 15/1Minputtokens+60/1M output tokens as of 2025. A well-run self-hosted LLaMA-3 70B costs ~$1–2/1M tokens. The gap is real - but self-hosting has operational overhead and quality trade-offs.
For API-based LLMs, the cost structure is even simpler: you pay per token, per call. Input tokens (your prompt) cost less than output tokens (the model's generation). This asymmetry matters: strategies that reduce output tokens have outsized impact.
The Optimization Stack: Strategies Ranked by Impact
The following strategies are ranked roughly by the effort-to-impact ratio. Start from the top.
Strategy 1: Model Selection
The highest-leverage decision is choosing the right model for each task. This sounds obvious, but most teams default to the most capable model for everything - often because it is easier than building a routing system.
Cost comparison for common models (2025 approximate pricing):
| Model | Input cost / 1M tokens | Output cost / 1M tokens | Relative cost |
|---|---|---|---|
| GPT-4o | $5 | $20 | 100× |
| GPT-4o mini | $0.15 | $0.60 | 3× |
| Claude 3.5 Sonnet | $3 | $15 | 75× |
| Claude 3 Haiku | $0.25 | $1.25 | 5× |
| Mistral Small | $0.20 | $0.60 | 3× |
| Llama-3 8B (self-hosted) | ~$0.10 | ~$0.10 | 1× |
| Llama-3 70B (self-hosted) | ~$0.80 | ~$0.80 | 8× |
A task that GPT-4o handles correctly for 20/1MoutputtokenscanoftenbehandledwithidenticalqualitybyGPT−4ominifor0.60/1M output tokens - a 33× cost reduction.
Task difficulty categories:
| Task type | Recommended model tier |
|---|---|
| FAQ lookup, slot extraction, simple classification | Smallest model (Haiku, mini, 7B) |
| Email drafting, code snippet generation | Mid-tier (Sonnet, 8B–13B) |
| Complex reasoning, multi-step analysis | Large model (GPT-4o, Claude Sonnet, 70B) |
| Creative writing, research | Large model as needed |
| Structured data extraction | Small model with good prompting |
| Long document summarization | Mid-tier with chunking |
Run an audit: take 500 random production requests, have humans rate whether a smaller model's output was acceptable. You will typically find 60–80% of requests could have been routed to a cheaper model.
Strategy 2: Quantization
Quantization is the process of reducing the precision of a model's weights to reduce memory usage and increase inference speed. For example, converting a model from 32-bit floating-point to 8-bit integer can reduce memory usage by 4x and speed up inference by 2-3x, with minimal impact on accuracy.
Common quantization techniques include:
- Post-training quantization (PTQ): Applied after training, without requiring retraining. Simple to implement but may have a larger accuracy drop.
- Quantization-aware training (QAT): Simulates quantization during training, allowing the model to adapt to lower precision. Better accuracy but more complex.
For most production systems, PTQ is a good starting point. Tools like llama.cpp, AutoGPTQ, and vLLM support easy quantization with minimal code changes.