LLM API 定价报告 2026 年第二季度：代单位成本差值

60x

Input price spread

$0.05/M

Cheapest input

$3/M

Sonnet 4.6 input

$15/M

Opus 4.6 output

Key Takeaways

60x Input Spread on Frontier APIs: Q2 2026 input pricing stretches from $0.05/M (Qwen 3.5 9B) to $3/M (Claude Sonnet 4.6), with Opus 4.6 output at $15/M — a sixtyfold delta before you touch GPT-5.4 Pro territory.

Chinese Ultra-Low Tier Keeps Compressing: Qwen 3.5 Flash at $0.065/$0.26 with a 1M context, and MiMo V2 Flash at $0.09/$0.29, continue to reset the floor for high-volume agent workloads.

Premium Pricing Is Holding, Not Falling: Anthropic's $3/$15 and $5/$25 bands have not moved in Q2 despite ecosystem pressure. Spend follows capability, not discounting, with Opus 4.6 at roughly $25.1M/month in Anthropic API revenue.

Free Tiers Are a Real Infrastructure Subsidy: Qwen 3.6 Plus, Nemotron 3 Super 120B, and Nemotron 3 Nano 30B all expose capable 256K+ context windows at zero cost during preview — a pattern agencies should route non-critical traffic through.

Cost-Routing Beats Model Selection: Agencies that tier queries by complexity — cheap model for extraction, mid-tier for planning, premium for terminal reasoning — routinely cut API spend 60-80% versus single-model deployments.

Sticker Price Hides Real Cost: Cache hits, batch API discounts, tool-call overhead, and input token inflation from new tokenizers can swing true cost per task by 2-5x against the headline $/M numbers.

Context Window Is Now a Pricing Axis: 1M context at $0.065/M (Qwen 3.5 Flash) was science fiction in Q1 2025. Today it is the baseline assumption for any agentic pipeline built in Q2 2026.

Input token pricing has a 60x spread in Q2 2026 — $0.05 per million tokens on the low end with Qwen 3.5 9B, $3 per million on Claude Sonnet 4.6, and $15 or more on Opus 4.6 output. The Digital Applied LLM API Pricing Index tracks where that spread is widening versus compressing, which providers are defending premium bands, and how agencies should route traffic through the tiers to protect margin without surrendering capability.

This Q2 2026 refresh sorts every major OpenRouter-listed model into five pricing tiers — ultra-low, economy, mid, premium, and free — then layers on the 90-day delta, the agency cost-routing strategy we use in production, and the total-cost-of-ownership factors that sticker pricing never captures. Every number below is drawn from OpenRouter's April 2026 public pricing table.

Pricing snapshot date: April 12, 2026. LLM pricing moves monthly — verify against the OpenRouter models catalog before finalizing any cost model. Pair with our performance-vs-price efficient frontier analysis for the capability axis.

The Q2 2026 Pricing Landscape

The Q2 2026 pricing curve is defined by two forces pulling in opposite directions. Chinese and open-weight providers keep compressing the low end — Qwen 3.5 9B at $0.05 input, MiMo V2 Flash at $0.09, Step 3.5 Flash at $0.10 — while Anthropic, OpenAI, and Google hold premium bands steady because capability-bound spend does not chase discounts. Between the two lives a crowded $0.15-$0.50 economy tier where most high-volume agentic traffic now sits.

How Digital Applied Tiers the Pricing Curve

Ultra-low (<$0.15/M input): bulk classification, extraction, OCR post-processing, retrieval re-ranking, agent memory compaction.
Economy ($0.15-$0.50): planning, tool selection, routine code generation, structured data shaping.
Mid-tier ($0.50-$3): reasoning-heavy tasks, complex tool chains, multi-step agentic work, technical writing.
Premium ($3+): terminal reasoning, irreversible actions, client-facing one-shot output, the last mile of a hard coding problem.
Free tier: experimentation, load testing, fallback routes, and non-critical background workloads where latency variance is acceptable.

Design the routing layer first. Model selection is a symptom of workload classification. Work with our AI Digital Transformation team to build the classification and routing tier that pays for the rest of your AI budget.

Ultra-Low Tier (<$0.15/M Input)

The ultra-low tier is where the most interesting Q2 2026 movement has happened. Four models sit under $0.15 input and collectively handle the majority of non-reasoning agent traffic we see in agency pipelines: Qwen 3.5 9B, Qwen 3.5 Flash, MiMo V2 Flash, and Step 3.5 Flash. All four exceed 256K context, and Qwen 3.5 Flash pushes to a full 1M context at $0.065 input — a price-per-context ratio that did not exist at any provider twelve months ago.

Model	Provider	Input $/M	Output $/M	Context
Qwen 3.5 9B	Alibaba	$0.05	$0.15	256K
Qwen 3.5 Flash	Alibaba	$0.065	$0.26	1M
MiMo V2 Flash	Xiaomi	$0.09	$0.29	262K
Step 3.5 Flash	StepFun	$0.10	$0.30	262K (free tier)

Route the ultra-low tier aggressively. In our own internal pipelines, roughly 55-65% of total tokens flow through this band after classification-first routing, and the cost delta against mid-tier for identical output quality on extraction tasks is typically 10-20x.

Economy Tier ($0.15-$0.50)

The economy tier is the busiest band of the Q2 2026 market. Qwen 3 Coder Next for software-focused workloads, MiniMax M2.5 and M2.7 for general agentic traffic, Qwen 3.5 35B and 3.5 Plus for balanced reasoning, and MiMo V2 Omni for multimodal work all sit here. This is where most planning, tool-routing, and structured generation should land for agencies optimizing for cost without dropping to ultra-low quality.

Model	Provider	Input $/M	Output $/M	Context
Qwen 3 Coder Next	Alibaba	$0.12	$0.75	256K
MiniMax M2.5	MiniMax	$0.12	$0.99	197K
Qwen 3.5 35B	Alibaba	$0.16	$1.30	262K
Qwen 3.5 Plus	Alibaba	$0.26	$1.56	1M
MiniMax M2.7	MiniMax	$0.30	$1.20	205K
MiMo V2 Omni	Xiaomi	$0.40	$2.00	262K

Note the output pricing variance inside this band. Qwen 3 Coder Next sits at $0.75 output despite a $0.12 input, while MiMo V2 Omni reaches $2 output at only $0.40 input. Workloads heavy on long generation will see very different economics depending on which economy-tier model handles them, so benchmark your specific input/output ratio before standardizing on any single choice.

Mid-Tier ($0.50-$3)

Mid-tier is thinner than it used to be because the ultra-low and economy bands have swallowed most of what would have been mid-tier workloads in 2025. What remains sits between roughly $0.75 and $1 on the input side: MiMo V2 Pro as the heavyweight generalist with a 1.04M context window, and Qwen 3 Max Thinking as the reasoning variant for step-by-step problem solving.

Model	Provider	Input $/M	Output $/M	Context
Qwen 3 Max Thinking	Alibaba	$0.78	$3.90	262K
MiMo V2 Pro	Xiaomi	$1.00	$3.00	1.04M

MiMo V2 Pro is currently the #1 model on OpenRouter by volume at 4.79T weekly tokens and handles roughly a quarter of all coding tokens observed across the network. That concentration of real workload at $1/$3 tells you the mid-tier's pricing ceiling: the market has voted that reasoning-grade, 1M-context capability should not cost more than $1-$3 per million input unless the model clears a premium capability bar.

Premium Tier ($3+)

The premium tier is Anthropic and OpenAI, full stop. Claude Sonnet 4.6 at $3/$15 and Opus 4.6 at $5/$25 (via OpenRouter) have held price through Q2 despite pressure from cheaper Chinese models matching them on benchmarks. The GPT-5.4 family slots in alongside: GPT-5.4 at $2.50/$15, GPT-5.3-Codex at $1.75/$14, and GPT-5.4 Pro at the top of the market at $30/$180. Premium pricing is where capability-bound spend concentrates.

Model	Provider	Input $/M	Output $/M	Context
GPT-5.4	OpenAI	$2.50	$15.00	1.05M
Claude Sonnet 4.6	Anthropic	$3.00	$15.00	200K / 1M beta
Claude Opus 4.6	Anthropic	$5.00	$25.00	200K / 1M beta
GPT-5.4 Pro	OpenAI	$30.00	$180.00	1.05M

The Opus concentration problem. Claude Opus 4.6 alone drives roughly $25.1M per month in API spend, dominating Anthropic's direct API revenue mix. We unpack the revenue-geometry implications in the Anthropic cost problem analysis.

Free Tier Models

Q2 2026 has produced an unusually strong free tier. Qwen 3.6 Plus is fully free during preview with a 1M context window — and it has already climbed to the #2 position on OpenRouter by volume at 1.64T weekly tokens. NVIDIA's Nemotron 3 Super 120B and Nemotron 3 Nano 30B both ship with a free tier and 256K+ context. For agencies, these free tiers are a real infrastructure subsidy and belong in any cost plan as a fallback and experimentation route.

Model	Provider	Cost	Context	Notes
Qwen 3.6 Plus	Alibaba	Free (preview)	1M	#2 on OpenRouter, always-on CoT, native function calling
Nemotron 3 Super 120B	NVIDIA	Free tier	262K	120B/12B active, 60.47% SWE-Bench Verified, open-source
Nemotron 3 Nano 30B	NVIDIA	Free tier	256K	Open-source, compact deployment-friendly
Step 3.5 Flash	StepFun	Free tier	262K	Paid tier also available at $0.10/$0.30

Treat free-tier routing as an operational decision, not a cost optimization. Free tiers ship with rate limits, latency variance, and provider-side preview caveats, so the right placement is in fallback chains, background batch jobs, and development sandboxes rather than customer-facing production paths.

90-Day Delta Analysis

The most important delta in the Q1 2026 to Q2 2026 window is what did not happen. Anthropic did not cut Sonnet or Opus pricing despite the launch of Sonnet 4.6 nudging Opus margins. OpenAI did not meaningfully reprice the GPT-5.4 family. Google held Gemini 3.1 Pro at $2/$12. The premium tier is stable, not eroding.

Where Prices Actually Moved Q1 to Q2 2026

Ultra-low compression continues. Qwen 3.5 Flash launched at $0.065/$0.26 with 1M context, resetting price-per-context expectations for the entire low-end market.
Economy tier crowding. Six distinct models now sit in the $0.12-$0.40 input band, with output pricing varying 2.5x across them for similar task quality.
Mid-tier shrinks. Workloads previously routed to mid-tier have migrated to either cheaper economy-tier or premium Claude Sonnet 4.6. Only MiMo V2 Pro and Qwen 3 Max Thinking retain meaningful mid-tier share.
Premium holds. No Anthropic or OpenAI flagship price change in Q2 2026. Capability-bound spend is not price-elastic at the premium tier.
Free-tier expansion. Qwen 3.6 Plus and the Nemotron 3 family added large-context free options that did not exist in Q1 2026 pricing sheets.

The strategic implication is that the pricing curve is getting more bimodal, not smoother. Cheap is getting cheaper. Premium stays premium. The middle is where agencies should be most careful about defaulting, because workload classification now routes most requests either below or above it.

Agency Cost-Routing Strategy

The single highest-leverage decision in LLM cost management is building a routing tier before picking models. The goal is simple: every query gets classified by complexity and matched to the cheapest model that can serve it at the required quality bar. Done well, this cuts API spend 60-80% versus naive single-model deployments, and it scales with every new model the ecosystem ships without requiring architectural changes.

The Four-Stage Stack

Classification (Intent Analysis): Determine if the query is simple extraction, complex reasoning, or creative generation.
Routing (Model Selection): Direct the query to the appropriate pricing tier (Ultra-low, Economy, Mid, Premium).
Execution (Inference): Run the model and capture latency, cost, and quality metrics.
Feedback (Optimization): Use the metrics to refine classification rules and routing paths.

Implementing this stack allows agencies to dynamically allocate resources. For instance, a customer support query might first be classified by a cheap Ultra-low model. If the query is complex, it is escalated to a Mid-tier or Premium model. This ensures that critical, high-value tasks receive the necessary computational power, while routine tasks are handled cost-effectively.

Stage	Action	Tooling
1. Classification	Analyze query complexity and intent.	Custom classifiers, heuristics, or small LLMs.
2. Routing	Map classification to model tier.	OpenRouter, custom API gateways.
3. Execution	Call the selected model.	Provider APIs (Anthropic, OpenAI, Alibaba).
4. Feedback	Log cost/quality, update rules.	Internal dashboards, ML observability tools.