Inference Expense Optimization | EngineersOfAI — Technical Training for AI Practitioners

The Production Scenario

Your LLM-powered product has successfully achieved product-market fit. Three months ago, the user base stood at 500; today, it has surged to 50,000. Congratulations are in order. However, last month's OpenAI invoice arrived at $87,000, prompting your CFO to schedule an urgent meeting.

The meeting unfolds as expected. The CFO presents a spreadsheet indicating that, at the current growth trajectory, the inference bill will swell to 400,000/month within six months and hit 2M/month by the end of the year. The product currently generates 800,000/month in revenue. While LLM costs now consume 10% of revenue, they are on track to surpass 50%. "This is no longer a software product," the CFO declares. "It is a GPU rental business with some software layered on top."

You initiate an audit of your system's LLM usage. The findings are, in hindsight, rather embarrassing. You are dispatching GPT-4 requests to answer static queries like "What are your business hours?" — information that rarely changes. You are transmitting full 8,000-token conversation histories just to generate three-sentence replies. You are executing the same "summarize this product description" prompt thousands of times daily for a mere 200 products. You are essentially paying for top-tier reasoning capabilities on tasks that a much smaller, cheaper model could handle adequately.

None of these decisions were driven by malice. They were made by developers moving fast, opting for the most capable model available, defaulting to maximum context windows, and ignoring costs because it wasn't yet a constraint. Now, it is.

This guide serves as a systematic playbook for identifying and eliminating that waste. The strategies outlined here can slash inference costs by 70–95% in typical production environments — all without any noticeable drop in quality for your users.

Why This Exists: Cost is the Constraint That Scales With You

In the era of LLMs, infrastructure costs possess a unique characteristic: they scale linearly with usage, a trait most traditional software systems avoid. In standard SaaS, you pay for servers that handle millions of requests once they are up and running. In LLM SaaS, every token incurs a cost. More users equate to more tokens, which directly translates to higher costs.

This reality makes cost optimization a primary engineering concern, not an afterthought. Teams that treat it as an afterthought are frequently shocked when their inference bill surpasses their revenue.

Cost Structure

LLM inference costs can be broken down as follows:

total cost = GPU hours consumed × cost per GPU hour

GPU hours = total tokens generated / (tokens per second per GPU × 3600)

cost per 1M tokens = (GPU cost per hour / (tokens per second per GPU × 3600)) × 10^6

Here is an example calculation for a self-hosted LLaMA-3 70B running on an A100 80GB (priced at $3/hour):

throughput ≈ 800 tokens/second (using vLLM under moderate load)

cost per 1M tokens = ($3 / (800 × 3600)) × 10^6 ≈ $1.04/1M tokens

For comparison: As of 2025, GPT-4o costs 15/1M input tokens + 60/1M output tokens. A efficiently managed self-hosted LLaMA-3 70B runs about $1–2/1M tokens. The price difference is substantial, though self-hosting introduces operational complexity and potential quality trade-offs.

For API-based LLMs, the pricing structure is more straightforward: you pay per token, per call. Input tokens (your prompt) are cheaper than output tokens (the model's generation). This asymmetry is crucial: strategies focused on reducing output tokens will have a disproportionately large impact on savings.

The Optimization Stack: Strategies Ranked by Impact

The following strategies are ranked roughly by their effort-to-impact ratio. It is best to start from the top of the list.

Strategy 1: Model Selection

The most impactful decision is selecting the appropriate model for each specific task. While this seems obvious, many teams default to the most powerful model for everything — often because it is simpler than building a routing logic.

Cost comparison for common models (2025 approximate pricing):

Model	Input cost / 1M tokens	Output cost / 1M tokens	Relative cost
GPT-4o	$5	$20	100×
GPT-4o mini	$0.15	$0.60	3×
Claude 3.5 Sonnet	$3	$15	75×
Claude 3 Haiku	$0.25	$1.25	5×
Mistral Small	$0.20	$0.60	3×
Llama-3 8B (self-hosted)	~$0.10	~$0.10	1×
Llama-3 70B (self-hosted)	~$0.80	~$0.80	8×

A task handled correctly by GPT-4o for 20/1M output tokens can frequently be managed with identical quality by GPT-4o mini for 0.60/1M output tokens, representing a 33× cost reduction.

Task difficulty categories:

Task type	Recommended model tier
FAQ lookup, slot extraction, simple classification	Smallest model (Haiku, mini, 7B)
Email drafting, code snippet generation	Mid-tier (Sonnet, 8B–13B)
Complex reasoning, multi-step analysis	Large model (GPT-4o, Claude Sonnet, 70B)
Creative writing, research	Large model as needed
Structured data extraction	Small model with good prompting
Long document summarization	Mid-tier with chunking

Conduct an audit: sample 500 random production requests and have humans evaluate if a smaller model's output would suffice. You will typically discover that 60–80% of requests could have been routed to a more cost-effective model.

Inference Expense Optimization | EngineersOfAI — Technical Training for AI Practitioners

The Production Scenario​

Why This Exists: Cost is the Constraint That Scales With You​

Cost Structure​

The Optimization Stack: Strategies Ranked by Impact​

Strategy 1: Model Selection​

Strategy 2: Quantization​

The Production Scenario

Why This Exists: Cost is the Constraint That Scales With You

Cost Structure

The Optimization Stack: Strategies Ranked by Impact

Strategy 1: Model Selection

Strategy 2: Quantization