新闻

Inference Cost Reduction | Technical Training for AI Engineers

新闻 2026-05-11 0 次浏览

The Production Scenario

Your LLM-powered product has hit product-market fit. Three months ago you had 500 users. Today you have 50,000. Congratulations. The OpenAI invoice for last month was $87,000. Your CFO schedules a meeting.

The meeting goes about how you expect. The CFO pulls up a spreadsheet. At current growth, the inference bill will be 400,000/monthinsixmonthsand400,000/month in six months and 400,000/monthinsixmonthsand2M/month in a year. The product grosses 800,000/monthinrevenue.TheLLMcostaloneis10800,000/month in revenue. The LLM cost alone is 10% of revenue today, on track to exceed 50%. "This is not a product," the CFO says. "It is a 800,000/monthinrevenue.TheLLMcostaloneis102M/month GPU rental business that happens to have software on top."

You start auditing how your system uses LLMs. What you find is embarrassing in retrospect. You are sending GPT-4 requests to answer questions like "What are your business hours?" - information that never changes. You are sending full 8,000-token conversation histories to generate three-sentence replies. You are running the same "summarize this product description" prompt thousands of times per day on the same 200 products. You are paying for the model's reasoning capabilities on tasks that a much smaller model handles perfectly well.

None of these decisions were made maliciously. They were made by developers moving fast, reaching for the most capable model, defaulting to full context windows, not thinking about cost at all because cost was not the constraint yet. Now it is.

This lesson is a systematic playbook for identifying and eliminating that waste. The strategies covered can reduce inference costs by 70–95% in typical production systems - without any quality regression that users would notice.


Why This Exists: Cost is the Constraint That Scales With You

In the LLM era, infrastructure cost has a unique property: it scales directly with usage in a way that most software systems avoid. Traditional SaaS: you pay for servers that serve millions of requests once they are running. LLM SaaS: every token costs money. More users = more tokens = linearly more cost.

This makes cost optimization a first-class engineering concern, not an afterthought. Teams that treat it as an afterthought are regularly surprised when their inference bill exceeds their revenue.

Cost Structure

LLM inference cost breaks down as:

total cost=GPU hours consumed×cost per GPU hour\text{total cost} = \text{GPU hours consumed} \times \text{cost per GPU hour}total cost=GPU hours consumed×cost per GPU hour

GPU hours=total tokens generatedtokens per second per GPU×3600\text{GPU hours} = \frac{\text{total tokens generated}}{\text{tokens per second per GPU} \times 3600}GPU hours=tokens per second per GPU×3600total tokens generated

cost per 1M tokens=GPU cost per hourtokens per second per GPU×3600×106\text{cost per 1M tokens} = \frac{\text{GPU cost per hour}}{\text{tokens per second per GPU} \times 3600} \times 10^6cost per 1M tokens=tokens per second per GPU×3600GPU cost per hour×106

Example calculation for a self-hosted LLaMA-3 70B on an A100 80GB ($3/hour):

throughput800 tokens/second (with vLLM, moderate load)\text{throughput} \approx 800 \text{ tokens/second (with vLLM, moderate load)}throughput800 tokens/second (with vLLM, moderate load)

cost per 1M tokens=$3800×3600×106=$32,880,000×106$1.04/1M tokens\text{cost per 1M tokens} = \frac{\$3}{800 \times 3600} \times 10^6 = \frac{\$3}{2{,}880{,}000} \times 10^6 \approx \$1.04/\text{1M tokens}cost per 1M tokens=800×3600$3×106=2,880,000$3×106$1.04/1M tokens

For comparison: GPT-4o costs 15/1Minputtokens+15/1M input tokens + 15/1Minputtokens+60/1M output tokens as of 2025. A well-run self-hosted LLaMA-3 70B costs ~$1–2/1M tokens. The gap is real - but self-hosting has operational overhead and quality trade-offs.

For API-based LLMs, the cost structure is even simpler: you pay per token, per call. Input tokens (your prompt) cost less than output tokens (the model's generation). This asymmetry matters: strategies that reduce output tokens have outsized impact.


The Optimization Stack: Strategies Ranked by Impact

The following strategies are ranked roughly by the effort-to-impact ratio. Start from the top.


Strategy 1: Model Selection

The highest-leverage decision is choosing the right model for each task. This sounds obvious, but most teams default to the most capable model for everything - often because it is easier than building a routing system.

Cost comparison for common models (2025 approximate pricing):

ModelInput cost / 1M tokensOutput cost / 1M tokensRelative cost
GPT-4o$5$20100×
GPT-4o mini$0.15$0.60
Claude 3.5 Sonnet$3$1575×
Claude 3 Haiku$0.25$1.25
Mistral Small$0.20$0.60
Llama-3 8B (self-hosted)~$0.10~$0.10
Llama-3 70B (self-hosted)~$0.80~$0.80

A task that GPT-4o handles correctly for 20/1MoutputtokenscanoftenbehandledwithidenticalqualitybyGPT4ominiifor20/1M output tokens can often be handled with identical quality by GPT-4o mini for 20/1MoutputtokenscanoftenbehandledwithidenticalqualitybyGPT4ominifor0.60/1M output tokens - a 33× cost reduction.

Task difficulty categories:

Task typeRecommended model tier
FAQ lookup, slot extraction, simple classificationSmallest model (Haiku, mini, 7B)
Email drafting, code snippet generationMid-tier (Sonnet, 8B–13B)
Complex reasoning, multi-step analysisLarge model (GPT-4o, Claude Sonnet, 70B)
Creative writing, researchLarge model as needed
Structured data extractionSmall model with good prompting
Long document summarizationMid-tier with chunking

Run an audit: take 500 random production requests, have humans rate whether a smaller model's output was acceptable. You will typically find 60–80% of requests could have been routed to a cheaper model.


Strategy 2: Quantization

Quantization is the process of reducing the precision of a model's weights to reduce memory usage and increase inference speed. For example, converting a model from 32-bit floating-point to 8-bit integer can reduce memory usage by 4x and speed up inference by 2-3x, with minimal impact on accuracy.

Common quantization techniques include:

  • Post-training quantization (PTQ): Applied after training, without requiring retraining. Simple to implement but may have a larger accuracy drop.
  • Quantization-aware training (QAT): Simulates quantization during training, allowing the model to adapt to lower precision. Better accuracy but more complex.

For most production systems, PTQ is a good starting point. Tools like llama.cpp, AutoGPTQ, and vLLM support easy quantization with minimal code changes.

点击查看文章原文
上一篇
朱永新、杨帆|ChatGPT/生成式AI与教育变革:机遇、挑战以及未来
下一篇
LLM推理优化:从每一层压低成本与延迟(2026)| Morph
返回列表