Inference Cost Reduction | Technical Training for AI Engineers

The Production Scenario

Your LLM-powered product has hit product-market fit. Three months ago you had 500 users. Today you have 50,000. Congratulations. The OpenAI invoice for last month was $87,000. Your CFO schedules a meeting.

The meeting goes about how you expect. The CFO pulls up a spreadsheet. At current growth, the inference bill will be $400, 000 / m o n t h i n s i x m o n t h s a n d$ 400,000/monthinsixmonthsand2M/month in a year. The product grosses $800, 000 / m o n t h i n r e v e n u e . T h e L L M c o s t a l o n e i s 10$ 800,000/monthinrevenue.TheLLMcostaloneis102M/month GPU rental business that happens to have software on top."

You start auditing how your system uses LLMs. What you find is embarrassing in retrospect. You are sending GPT-4 requests to answer questions like "What are your business hours?" - information that never changes. You are sending full 8,000-token conversation histories to generate three-sentence replies. You are running the same "summarize this product description" prompt thousands of times per day on the same 200 products. You are paying for the model's reasoning capabilities on tasks that a much smaller model handles perfectly well.

None of these decisions were made maliciously. They were made by developers moving fast, reaching for the most capable model, defaulting to full context windows, not thinking about cost at all because cost was not the constraint yet. Now it is.

This lesson is a systematic playbook for identifying and eliminating that waste. The strategies covered can reduce inference costs by 70–95% in typical production systems - without any quality regression that users would notice.

Why This Exists: Cost is the Constraint That Scales With You

In the LLM era, infrastructure cost has a unique property: it scales directly with usage in a way that most software systems avoid. Traditional SaaS: you pay for servers that serve millions of requests once they are running. LLM SaaS: every token costs money. More users = more tokens = linearly more cost.

This makes cost optimization a first-class engineering concern, not an afterthought. Teams that treat it as an afterthought are regularly surprised when their inference bill exceeds their revenue.

Cost Structure

LLM inference cost breaks down as:

$total cost = GPU hours consumed \times cost per GPU hour$ total cost=GPU hours consumed×cost per GPU hour

$GPU hours = \frac{total tokens generated}{tokens per second per GPU \times 3600}$ GPU hours=tokens per second per GPU×3600total tokens generated

$cost per 1M tokens = \frac{GPU cost per hour}{tokens per second per GPU \times 3600} \times 10^{6}$ cost per 1M tokens=tokens per second per GPU×3600GPU cost per hour×106

Example calculation for a self-hosted LLaMA-3 70B on an A100 80GB ($3/hour):

$throughput \approx 800 tokens/second (with vLLM, moderate load)$ throughput≈800 tokens/second (with vLLM, moderate load)

$cost per 1M tokens = \frac{$ 3}{800 \times 3600} \times 10^{6} = \frac{$ 3}{2,880,000} \times 10^{6} \approx $ 1.04 / 1M tokens$ cost per 1M tokens=800×3600$3×106=2,880,000$3×106≈$1.04/1M tokens

For comparison: GPT-4o costs $15 / 1 M i n p u t t o k e n s +$ 15/1Minputtokens+60/1M output tokens as of 2025. A well-run self-hosted LLaMA-3 70B costs ~$1–2/1M tokens. The gap is real - but self-hosting has operational overhead and quality trade-offs.

For API-based LLMs, the cost structure is even simpler: you pay per token, per call. Input tokens (your prompt) cost less than output tokens (the model's generation). This asymmetry matters: strategies that reduce output tokens have outsized impact.

The Optimization Stack: Strategies Ranked by Impact

The following strategies are ranked roughly by the effort-to-impact ratio. Start from the top.

Strategy 1: Model Selection

The highest-leverage decision is choosing the right model for each task. This sounds obvious, but most teams default to the most capable model for everything - often because it is easier than building a routing system.

Cost comparison for common models (2025 approximate pricing):

Model	Input cost / 1M tokens	Output cost / 1M tokens	Relative cost
GPT-4o	$5	$20	100×
GPT-4o mini	$0.15	$0.60	3×
Claude 3.5 Sonnet	$3	$15	75×
Claude 3 Haiku	$0.25	$1.25	5×
Mistral Small	$0.20	$0.60	3×
Llama-3 8B (self-hosted)	~$0.10	~$0.10	1×
Llama-3 70B (self-hosted)	~$0.80	~$0.80	8×

A task that GPT-4o handles correctly for $20 / 1 M o u t p u t t o k e n s c a n o f t e n b e h a n d l e d w i t h i d e n t i c a l q u a l i t y b y G P T - 4 o m i n i i f o r$ 20/1MoutputtokenscanoftenbehandledwithidenticalqualitybyGPT−4ominifor0.60/1M output tokens - a 33× cost reduction.

Task difficulty categories:

Task type	Recommended model tier
FAQ lookup, slot extraction, simple classification	Smallest model (Haiku, mini, 7B)
Email drafting, code snippet generation	Mid-tier (Sonnet, 8B–13B)
Complex reasoning, multi-step analysis	Large model (GPT-4o, Claude Sonnet, 70B)
Creative writing, research	Large model as needed
Structured data extraction	Small model with good prompting
Long document summarization	Mid-tier with chunking

Run an audit: take 500 random production requests, have humans rate whether a smaller model's output was acceptable. You will typically find 60–80% of requests could have been routed to a cheaper model.

Strategy 2: Quantization

Quantization is the process of reducing the precision of a model's weights to reduce memory usage and increase inference speed. For example, converting a model from 32-bit floating-point to 8-bit integer can reduce memory usage by 4x and speed up inference by 2-3x, with minimal impact on accuracy.

Common quantization techniques include:

Post-training quantization (PTQ): Applied after training, without requiring retraining. Simple to implement but may have a larger accuracy drop.
Quantization-aware training (QAT): Simulates quantization during training, allowing the model to adapt to lower precision. Better accuracy but more complex.

For most production systems, PTQ is a good starting point. Tools like llama.cpp, AutoGPTQ, and vLLM support easy quantization with minimal code changes.

Inference Cost Reduction | Technical Training for AI Engineers

The Production Scenario​

Why This Exists: Cost is the Constraint That Scales With You​

Cost Structure​

The Optimization Stack: Strategies Ranked by Impact​

Strategy 1: Model Selection​

Strategy 2: Quantization​