新闻

LLM API 价格指数 Q2 2026:每 Token 成本波动

新闻 2026-05-11 0 次浏览
60x

Input price spread

$0.05/M

Cheapest input

$3/M

Sonnet 4.6 input

$15/M

Opus 4.6 output

Key Takeaways

60x Input Spread on Frontier APIs: In Q2 2026, input costs span from $0.05/M (Qwen 3.5 9B) up to $3/M (Claude Sonnet 4.6), with Opus 4.6 output hitting $15/M — a massive sixtyfold gap before even considering GPT-5.4 Pro.
Chinese Ultra-Low Tier Keeps Compressing: Models like Qwen 3.5 Flash at $0.065/$0.26 and MiMo V2 Flash at $0.09/$0.29, both offering 1M context, continue to push the bottom lower for high-volume agent tasks.
Premium Pricing Is Holding, Not Falling: Despite ecosystem pressure, Anthropic's $3/$15 and $5/$25 tiers remain unchanged in Q2. Spend follows capability, not discounts, with Opus 4.6 generating about $25.1M/month in API revenue.
Free Tiers Are a Real Infrastructure Subsidy: Qwen 3.6 Plus, Nemotron 3 Super 120B, and Nemotron 3 Nano 30B all offer capable 256K+ context windows at zero cost during preview — an option agencies should use for non-critical traffic.
Cost-Routing Beats Model Selection: Agencies that tier queries by complexity — using cheap models for extraction, mid-tier for planning, and premium for terminal reasoning — typically reduce API spend by 60-80% compared to single-model setups.
Sticker Price Hides Real Cost: Factors like cache hits, batch discounts, tool-call overhead, and input inflation from new tokenizers can cause the true cost per task to swing by 2-5x compared to headline $/M rates.
Context Window Is Now a Pricing Axis: 1M context at $0.065/M (Qwen 3.5 Flash) was unheard of in Q1 2025. Today, it is the standard baseline for any agentic pipeline built in Q2 2026.

In Q2 2026, the spread for input token pricing has hit 60x — as low as $0.05 per million tokens with Qwen 3.5 9B, rising to $3 per million for Claude Sonnet 4.6, and exceeding $15 for Opus 4.6 output. The Digital Applied LLM API Pricing Index monitors where this gap is expanding or narrowing, which providers are maintaining high prices, and how agencies should distribute traffic across tiers to preserve margins without sacrificing performance.

This Q2 2026 update categorizes every major model on OpenRouter into five pricing tiers — ultra-low, economy, mid, premium, and free. We then layer on the 90-day change, the cost-routing strategies we employ in production, and the total-cost-of-ownership factors that list prices often miss. All data below is sourced from OpenRouter's April 2026 public pricing table.

Pricing snapshot date: April 12, 2026. LLM rates change monthly — please verify against the OpenRouter models catalog before locking in any cost model. Combine this with our performance-vs-price efficient frontier analysis for a complete view on capabilities.

The Q2 2026 Pricing Landscape

The pricing curve for Q2 2026 is shaped by two opposing forces. On one side, Chinese and open-weight providers continue to squeeze the low end — with Qwen 3.5 9B at $0.05 input, MiMo V2 Flash at $0.09, and Step 3.5 Flash at $0.10. On the other, Anthropic, OpenAI, and Google maintain steady premium prices because spend driven by capability doesn't hunt for discounts. Between them lies a crowded economy tier ($0.15-$0.50) where most high-volume agentic traffic is now settling.

How Digital Applied Tiers the Pricing Curve
  • Ultra-low (<$0.15/M input): ideal for bulk classification, extraction, OCR post-processing, retrieval re-ranking, and agent memory compaction.
  • Economy ($0.15-$0.50): suited for planning, tool selection, routine code generation, and shaping structured data.
  • Mid-tier ($0.50-$3): designed for reasoning-heavy tasks, complex tool chains, multi-step agent workflows, and technical writing.
  • Premium ($3+): reserved for terminal reasoning, irreversible actions, client-facing one-shot outputs, and the final mile of difficult coding problems.
  • Free tier: perfect for experimentation, load testing, fallback routes, and non-critical background jobs where variable latency is acceptable.

Design the routing layer first. Model selection is merely a result of workload classification. Work with our AI Digital Transformation team to build the classification and routing tier that will sustain the rest of your AI budget.

Ultra-Low Tier (<$0.15/M Input)

The most significant action in Q2 2026 is found in the ultra-low tier. Four models priced under $0.15 input — Qwen 3.5 9B, Qwen 3.5 Flash, MiMo V2 Flash, and Step 3.5 Flash — collectively handle the majority of non-reasoning agent traffic we observe in agency pipelines. All four support over 256K context, and Qwen 3.5 Flash offers a full 1M context at $0.065 input — a price-to-context ratio that was unavailable anywhere just twelve months ago.

Model Provider Input $/M Output $/M Context
Qwen 3.5 9B Alibaba $0.05 $0.15 256K
Qwen 3.5 Flash Alibaba $0.065 $0.26 1M
MiMo V2 Flash Xiaomi $0.09 $0.29 262K
Step 3.5 Flash StepFun $0.10 $0.30 262K (free tier)

Route aggressively to the ultra-low tier. In our internal pipelines, approximately 55-65% of total tokens flow through this band after classification-first routing. On extraction tasks, the cost difference compared to mid-tier is typically 10-20x for identical output quality.

Economy Tier ($0.15-$0.50)

The economy tier is the most active segment of the Q2 2026 market. It hosts models like Qwen 3 Coder Next for software tasks, MiniMax M2.5 and M2.7 for general agentic traffic, Qwen 3.5 35B and 3.5 Plus for balanced reasoning, and MiMo V2 Omni for multimodal needs. This is where most planning, tool-routing, and structured generation should land for agencies optimizing cost without dropping to ultra-low quality.

Model Provider Input $/M Output $/M Context
Qwen 3 Coder Next Alibaba $0.12 $0.75 256K
MiniMax M2.5 MiniMax $0.12 $0.99 197K
Qwen 3.5 35B Alibaba $0.16 $1.30 262K
Qwen 3.5 Plus Alibaba $0.26 $1.56 1M
MiniMax M2.7 MiniMax $0.30 $1.20 205K
MiMo V2 Omni Xiaomi $0.40 $2.00 262K

Be mindful of the output pricing variance within this band. Qwen 3 Coder Next is at $0.75 output versus a $0.12 input, while MiMo V2 Omni hits $2 output on a $0.40 input. For workloads involving extensive generation, the economics will vary significantly depending on which economy-tier model is used, so benchmark your specific input/output ratio before standardizing.

Mid-Tier ($0.50-$3)

The mid-tier has become leaner, as the ultra-low and economy bands have absorbed much of the workload that would have fallen here in 2025. What remains sits roughly between $0.75 and $1 on input: MiMo V2 Pro acts as the heavyweight generalist with a 1.04M context window, and Qwen 3 Max Thinking serves as the reasoning variant for step-by-step problem resolution.

Model Provider Input $/M Output $/M Context
Qwen 3 Max Thinking Alibaba $0.78 $3.90 262K
MiMo V2 Pro Xiaomi $1.00 $3.00 1.04M

MiMo V2 Pro is currently the top model on OpenRouter by volume, processing 4.79T weekly tokens and handling about a quarter of all coding tokens observed on the network. This concentration of real-world workload at $1/$3 indicates the mid-tier's pricing ceiling: the market has decided that reasoning-grade, 1M-context capability shouldn't cost more than $1-$3 per million input unless the model reaches a premium capability bar.

Premium Tier ($3+)

The premium tier is dominated by Anthropic and OpenAI. Claude Sonnet 4.6 at $3/$15 and Opus 4.6 at $5/$25 (via OpenRouter) have held firm through Q2 despite pressure from cheaper Chinese models performing similarly on benchmarks. The GPT-5.4 family fits in here as well: GPT-5.4 at $2.50/$15, GPT-5.3-Codex at $1.75/$14, and GPT-5.4 Pro leading the market at $30/$180. This is where capability-bound spend is concentrated.

Model Provider Input $/M Output $/M Context
GPT-5.4 OpenAI $2.50 $15.00 1.05M
Claude Sonnet 4.6 Anthropic $3.00 $15.00 200K / 1M beta
Claude Opus 4.6 Anthropic $5.00 $25.00 200K / 1M beta
GPT-5.4 Pro OpenAI $30.00 $180.00 1.05M

The Opus concentration problem. Claude Opus 4.6 alone drives roughly $25.1M per month in API spend, dominating Anthropic's direct API revenue mix. We explore the revenue-geometry implications in the Anthropic cost problem analysis.

Free Tier Models

Q2 2026 has brought an unusually robust free tier. Qwen 3.6 Plus is completely free during preview with a 1M context window — already rising to the #2 spot on OpenRouter by volume with 1.64T weekly tokens. NVIDIA's Nemotron 3 Super 120B and Nemotron 3 Nano 30B both offer free tiers and 256K+ context. For agencies, these free tiers act as a genuine infrastructure subsidy and should be included in any cost plan as a fallback and experimentation route.

Model Provider Cost Context Notes
Qwen 3.6 Plus Alibaba Free (preview) 1M #2 on OpenRouter, always-on CoT, native function calling
Nemotron 3 Super 120B NVIDIA Free tier 262K 120B/12B active, 60.47% SWE-Bench Verified, open-source
Nemotron 3 Nano 30B NVIDIA Free tier 256K Open-source, compact deployment-friendly
Step 3.5 Flash StepFun Free tier 262K Paid tier also available at $0.10/$0.30

Approach free-tier routing as an operational decision, not just a cost-saving measure. Free tiers come with rate limits, latency fluctuations, and provider-side preview caveats, making them best suited for fallback chains, background batch jobs, and development sandboxes rather than customer-facing production paths.

90-Day Delta Analysis

The most notable shift between Q1 2026 and Q2 2026 is what failed to occur. Anthropic held Sonnet and Opus pricing steady despite Sonnet 4.6's launch pressing Opus margins. OpenAI did not significantly reprice the GPT-5.4 family. Google kept Gemini 3.1 Pro at $2/$12. The premium tier remains stable, not eroding.

Where Prices Actually Moved Q1 to Q2 2026
  • Ultra-low compression continues. Qwen 3.5 Flash debuted at $0.065/$0.26 with 1M context, resetting price-per-context expectations for the entire low-end market.
  • Economy tier crowding. Six distinct models now occupy the $0.12-$0.40 input band, with output prices differing by 2.5x across them for similar task quality.
  • Mid-tier shrinks. Workloads previously directed to mid-tier have moved to either cheaper economy-tier models or premium Claude Sonnet 4.6. Only MiMo V2 Pro and Qwen 3 Max Thinking retain significant mid-tier share.
  • Premium holds. No price changes for flagship Anthropic or OpenAI models in Q2 2026. Spend driven by capability is not price-elastic at the premium tier.
  • Free-tier expansion. Qwen 3.6 Plus and the Nemotron 3 family added large-context free options that were absent from Q1 2026 pricing sheets.

The strategic takeaway is that the pricing curve is becoming more bimodal, not smoother. Cheap gets cheaper. Premium stays premium. Agencies should be most cautious about defaulting in the middle, as workload classification now routes most requests either below or above it.

Agency Cost-Routing Strategy

The most impactful lever in LLM cost management is establishing a routing tier before choosing models. The goal is straightforward: classify every query by complexity and match it to the cheapest model capable of serving it at the required quality level. When executed well, this reduces API spend by 60-80% compared to naive single-model deployments, and it scales automatically with every new model released without requiring architectural changes.

The Four-Stage Stack

  1. Classi
点击查看文章原文
上一篇
AIscending/大模型行情指数
下一篇
AI价格风向标 – AIscending:与AI共成长
返回列表