新闻

LLM API 定价报告 2026 年第二季度:代单位成本差值

新闻 2026-05-11 0 次浏览
60x

Input price spread

$0.05/M

Cheapest input

$3/M

Sonnet 4.6 input

$15/M

Opus 4.6 output

Key Takeaways

60x Input Spread on Frontier APIs: Q2 2026 input pricing stretches from $0.05/M (Qwen 3.5 9B) to $3/M (Claude Sonnet 4.6), with Opus 4.6 output at $15/M — a sixtyfold delta before you touch GPT-5.4 Pro territory.
Chinese Ultra-Low Tier Keeps Compressing: Qwen 3.5 Flash at $0.065/$0.26 with a 1M context, and MiMo V2 Flash at $0.09/$0.29, continue to reset the floor for high-volume agent workloads.
Premium Pricing Is Holding, Not Falling: Anthropic's $3/$15 and $5/$25 bands have not moved in Q2 despite ecosystem pressure. Spend follows capability, not discounting, with Opus 4.6 at roughly $25.1M/month in Anthropic API revenue.
Free Tiers Are a Real Infrastructure Subsidy: Qwen 3.6 Plus, Nemotron 3 Super 120B, and Nemotron 3 Nano 30B all expose capable 256K+ context windows at zero cost during preview — a pattern agencies should route non-critical traffic through.
Cost-Routing Beats Model Selection: Agencies that tier queries by complexity — cheap model for extraction, mid-tier for planning, premium for terminal reasoning — routinely cut API spend 60-80% versus single-model deployments.
Sticker Price Hides Real Cost: Cache hits, batch API discounts, tool-call overhead, and input token inflation from new tokenizers can swing true cost per task by 2-5x against the headline $/M numbers.
Context Window Is Now a Pricing Axis: 1M context at $0.065/M (Qwen 3.5 Flash) was science fiction in Q1 2025. Today it is the baseline assumption for any agentic pipeline built in Q2 2026.

Input token pricing has a 60x spread in Q2 2026 — $0.05 per million tokens on the low end with Qwen 3.5 9B, $3 per million on Claude Sonnet 4.6, and $15 or more on Opus 4.6 output. The Digital Applied LLM API Pricing Index tracks where that spread is widening versus compressing, which providers are defending premium bands, and how agencies should route traffic through the tiers to protect margin without surrendering capability.

This Q2 2026 refresh sorts every major OpenRouter-listed model into five pricing tiers — ultra-low, economy, mid, premium, and free — then layers on the 90-day delta, the agency cost-routing strategy we use in production, and the total-cost-of-ownership factors that sticker pricing never captures. Every number below is drawn from OpenRouter's April 2026 public pricing table.

Pricing snapshot date: April 12, 2026. LLM pricing moves monthly — verify against the OpenRouter models catalog before finalizing any cost model. Pair with our performance-vs-price efficient frontier analysis for the capability axis.

The Q2 2026 Pricing Landscape

The Q2 2026 pricing curve is defined by two forces pulling in opposite directions. Chinese and open-weight providers keep compressing the low end — Qwen 3.5 9B at $0.05 input, MiMo V2 Flash at $0.09, Step 3.5 Flash at $0.10 — while Anthropic, OpenAI, and Google hold premium bands steady because capability-bound spend does not chase discounts. Between the two lives a crowded $0.15-$0.50 economy tier where most high-volume agentic traffic now sits.

How Digital Applied Tiers the Pricing Curve
  • Ultra-low (<$0.15/M input): bulk classification, extraction, OCR post-processing, retrieval re-ranking, agent memory compaction.
  • Economy ($0.15-$0.50): planning, tool selection, routine code generation, structured data shaping.
  • Mid-tier ($0.50-$3): reasoning-heavy tasks, complex tool chains, multi-step agentic work, technical writing.
  • Premium ($3+): terminal reasoning, irreversible actions, client-facing one-shot output, the last mile of a hard coding problem.
  • Free tier: experimentation, load testing, fallback routes, and non-critical background workloads where latency variance is acceptable.

Design the routing layer first. Model selection is a symptom of workload classification. Work with our AI Digital Transformation team to build the classification and routing tier that pays for the rest of your AI budget.

Ultra-Low Tier (<$0.15/M Input)

The ultra-low tier is where the most interesting Q2 2026 movement has happened. Four models sit under $0.15 input and collectively handle the majority of non-reasoning agent traffic we see in agency pipelines: Qwen 3.5 9B, Qwen 3.5 Flash, MiMo V2 Flash, and Step 3.5 Flash. All four exceed 256K context, and Qwen 3.5 Flash pushes to a full 1M context at $0.065 input — a price-per-context ratio that did not exist at any provider twelve months ago.

ModelProviderInput $/MOutput $/MContext
Qwen 3.5 9BAlibaba$0.05$0.15256K
Qwen 3.5 FlashAlibaba$0.065$0.261M
MiMo V2 FlashXiaomi$0.09$0.29262K
Step 3.5 FlashStepFun$0.10$0.30262K (free tier)

Route the ultra-low tier aggressively. In our own internal pipelines, roughly 55-65% of total tokens flow through this band after classification-first routing, and the cost delta against mid-tier for identical output quality on extraction tasks is typically 10-20x.

Economy Tier ($0.15-$0.50)

The economy tier is the busiest band of the Q2 2026 market. Qwen 3 Coder Next for software-focused workloads, MiniMax M2.5 and M2.7 for general agentic traffic, Qwen 3.5 35B and 3.5 Plus for balanced reasoning, and MiMo V2 Omni for multimodal work all sit here. This is where most planning, tool-routing, and structured generation should land for agencies optimizing for cost without dropping to ultra-low quality.

ModelProviderInput $/MOutput $/MContext
Qwen 3 Coder NextAlibaba$0.12$0.75256K
MiniMax M2.5MiniMax$0.12$0.99197K
Qwen 3.5 35BAlibaba$0.16$1.30262K
Qwen 3.5 PlusAlibaba$0.26$1.561M
MiniMax M2.7MiniMax$0.30$1.20205K
MiMo V2 OmniXiaomi$0.40$2.00262K

Note the output pricing variance inside this band. Qwen 3 Coder Next sits at $0.75 output despite a $0.12 input, while MiMo V2 Omni reaches $2 output at only $0.40 input. Workloads heavy on long generation will see very different economics depending on which economy-tier model handles them, so benchmark your specific input/output ratio before standardizing on any single choice.

Mid-Tier ($0.50-$3)

Mid-tier is thinner than it used to be because the ultra-low and economy bands have swallowed most of what would have been mid-tier workloads in 2025. What remains sits between roughly $0.75 and $1 on the input side: MiMo V2 Pro as the heavyweight generalist with a 1.04M context window, and Qwen 3 Max Thinking as the reasoning variant for step-by-step problem solving.

ModelProviderInput $/MOutput $/MContext
Qwen 3 Max ThinkingAlibaba$0.78$3.90262K
MiMo V2 ProXiaomi$1.00$3.001.04M

MiMo V2 Pro is currently the #1 model on OpenRouter by volume at 4.79T weekly tokens and handles roughly a quarter of all coding tokens observed across the network. That concentration of real workload at $1/$3 tells you the mid-tier's pricing ceiling: the market has voted that reasoning-grade, 1M-context capability should not cost more than $1-$3 per million input unless the model clears a premium capability bar.

Premium Tier ($3+)

The premium tier is Anthropic and OpenAI, full stop. Claude Sonnet 4.6 at $3/$15 and Opus 4.6 at $5/$25 (via OpenRouter) have held price through Q2 despite pressure from cheaper Chinese models matching them on benchmarks. The GPT-5.4 family slots in alongside: GPT-5.4 at $2.50/$15, GPT-5.3-Codex at $1.75/$14, and GPT-5.4 Pro at the top of the market at $30/$180. Premium pricing is where capability-bound spend concentrates.

ModelProviderInput $/MOutput $/MContext
GPT-5.4OpenAI$2.50$15.001.05M
Claude Sonnet 4.6Anthropic$3.00$15.00200K / 1M beta
Claude Opus 4.6Anthropic$5.00$25.00200K / 1M beta
GPT-5.4 ProOpenAI$30.00$180.001.05M

The Opus concentration problem. Claude Opus 4.6 alone drives roughly $25.1M per month in API spend, dominating Anthropic's direct API revenue mix. We unpack the revenue-geometry implications in the Anthropic cost problem analysis.

Free Tier Models

Q2 2026 has produced an unusually strong free tier. Qwen 3.6 Plus is fully free during preview with a 1M context window — and it has already climbed to the #2 position on OpenRouter by volume at 1.64T weekly tokens. NVIDIA's Nemotron 3 Super 120B and Nemotron 3 Nano 30B both ship with a free tier and 256K+ context. For agencies, these free tiers are a real infrastructure subsidy and belong in any cost plan as a fallback and experimentation route.

ModelProviderCostContextNotes
Qwen 3.6 PlusAlibabaFree (preview)1M#2 on OpenRouter, always-on CoT, native function calling
Nemotron 3 Super 120BNVIDIAFree tier262K120B/12B active, 60.47% SWE-Bench Verified, open-source
Nemotron 3 Nano 30BNVIDIAFree tier256KOpen-source, compact deployment-friendly
Step 3.5 FlashStepFunFree tier262KPaid tier also available at $0.10/$0.30

Treat free-tier routing as an operational decision, not a cost optimization. Free tiers ship with rate limits, latency variance, and provider-side preview caveats, so the right placement is in fallback chains, background batch jobs, and development sandboxes rather than customer-facing production paths.

90-Day Delta Analysis

The most important delta in the Q1 2026 to Q2 2026 window is what did not happen. Anthropic did not cut Sonnet or Opus pricing despite the launch of Sonnet 4.6 nudging Opus margins. OpenAI did not meaningfully reprice the GPT-5.4 family. Google held Gemini 3.1 Pro at $2/$12. The premium tier is stable, not eroding.

Where Prices Actually Moved Q1 to Q2 2026
  • Ultra-low compression continues. Qwen 3.5 Flash launched at $0.065/$0.26 with 1M context, resetting price-per-context expectations for the entire low-end market.
  • Economy tier crowding. Six distinct models now sit in the $0.12-$0.40 input band, with output pricing varying 2.5x across them for similar task quality.
  • Mid-tier shrinks. Workloads previously routed to mid-tier have migrated to either cheaper economy-tier or premium Claude Sonnet 4.6. Only MiMo V2 Pro and Qwen 3 Max Thinking retain meaningful mid-tier share.
  • Premium holds. No Anthropic or OpenAI flagship price change in Q2 2026. Capability-bound spend is not price-elastic at the premium tier.
  • Free-tier expansion. Qwen 3.6 Plus and the Nemotron 3 family added large-context free options that did not exist in Q1 2026 pricing sheets.

The strategic implication is that the pricing curve is getting more bimodal, not smoother. Cheap is getting cheaper. Premium stays premium. The middle is where agencies should be most careful about defaulting, because workload classification now routes most requests either below or above it.

Agency Cost-Routing Strategy

The single highest-leverage decision in LLM cost management is building a routing tier before picking models. The goal is simple: every query gets classified by complexity and matched to the cheapest model that can serve it at the required quality bar. Done well, this cuts API spend 60-80% versus naive single-model deployments, and it scales with every new model the ecosystem ships without requiring architectural changes.

The Four-Stage Stack

  1. Classification (Intent Analysis): Determine if the query is simple extraction, complex reasoning, or creative generation.
  2. Routing (Model Selection): Direct the query to the appropriate pricing tier (Ultra-low, Economy, Mid, Premium).
  3. Execution (Inference): Run the model and capture latency, cost, and quality metrics.
  4. Feedback (Optimization): Use the metrics to refine classification rules and routing paths.

Implementing this stack allows agencies to dynamically allocate resources. For instance, a customer support query might first be classified by a cheap Ultra-low model. If the query is complex, it is escalated to a Mid-tier or Premium model. This ensures that critical, high-value tasks receive the necessary computational power, while routine tasks are handled cost-effectively.

StageActionTooling
1. ClassificationAnalyze query complexity and intent.Custom classifiers, heuristics, or small LLMs.
2. RoutingMap classification to model tier.OpenRouter, custom API gateways.
3. ExecutionCall the selected model.Provider APIs (Anthropic, OpenAI, Alibaba).
4. FeedbackLog cost/quality, update rules.Internal dashboards, ML observability tools.

来源:查看原文

上一篇
大模型定价追踪:主要服务商API价格指数更新
下一篇
2026年大模型API价格对比 — 在线免费工具 | TokenCost
返回列表