新闻

LLM推理优化之道:用更少成本获取更多Token | BestGPUCloud

新闻 2026-05-11 0 次浏览
Back to blog
Tutorial

LLM Inference Optimization: Get More Tokens Per Dollar

3/12/2026
9 min read

LLM Inference Optimization: Get More Tokens Per Dollar

Why Inference Optimization Matters

Training a model is a one-time cost. Inference is forever. A production LLM serving thousands of users daily can cost 10–100x more over its lifetime than the original training run. Optimising inference is one of the highest-leverage engineering investments you can make.

This guide covers the main levers: serving framework, quantisation, batching, and KV cache management.

Serving Framework Comparison

vLLM

vLLM (from UC Berkeley) is the current gold standard for high-throughput LLM serving. Its key innovation is **PagedAttention** — a KV cache management system inspired by virtual memory paging that dramatically reduces memory fragmentation and enables much larger batch sizes.

  • Best for: High-throughput production inference, API servers
  • Throughput: Typically 2–4x higher than naive HuggingFace serving
  • VRAM efficiency: Excellent — near-zero KV cache waste
  • Supports: GPTQ, AWQ, FP8 quantisation natively

TGI (Text Generation Inference)

HuggingFace's Text Generation Inference is a production-ready server with strong ecosystem integration. It supports continuous batching, Flash Attention, and tensor parallelism out of the box.

  • Best for: Teams already in the HuggingFace ecosystem
  • Throughput: Competitive with vLLM; sometimes better for streaming use cases
  • Docker-first: Easy deployment, strong Kubernetes support

Ollama

Ollama prioritises ease of use over maximum throughput. It runs GGUF-quantised models efficiently on CPU+GPU hybrid setups.

  • Best for: Local development, single-user inference, trying models quickly
  • Throughput: Lower than vLLM/TGI at scale
  • CPU fallback: Can run on CPU when GPU VRAM is insufficient

Quantisation Strategies

Quantisation reduces model precision to shrink VRAM usage and speed up matrix multiplications.

Method VRAM Saving Speed Gain Quality Loss
FP16 (baseline) None
GPTQ 4-bit ~50% +20–40% Minimal
AWQ 4-bit ~50% +20–40% Slightly less than GPTQ
GGUF Q4_K_M ~55% CPU-friendly Minimal
FP8 (H100+) ~50% +30–50% Near-zero

Recommendation: Use AWQ or GPTQ for production; GGUF for CPU/hybrid deployments; FP8 on H100/H200 for maximum throughput.

Batching Strategies

Static batching (naive): process one request at a time. Catastrophic for throughput.

Dynamic batching: accumulate requests and process them together. Better, but requests of different lengths waste compute.

Continuous batching (vLLM/TGI): new requests are slotted into the batch as soon as a sequence finishes. Near-optimal GPU utilisation. This is what makes vLLM so efficient — never wait for the slowest sequence in a batch.

KV Cache Optimization

The KV cache stores key/value tensors for attention, growing linearly with sequence length and batch size. Poor KV cache management is the top cause of out-of-memory errors and throughput degradation.

Key settings in vLLM:

  • --gpu-memory-utilization 0.90 — allocate 90% of VRAM to KV cache
  • --max-model-len — cap context length to free up cache space
  • Prefix caching — cache KV tensors for shared system prompts (massive win for chatbots with long system prompts)

Cost per 1M Tokens: A Comparison

Running Llama 3 70B AWQ on various hardware:

GPU $/hr Tokens/sec $/1M tokens
RTX 4090 (x1) $0.50 ~400 ~$0.35
RTX 5090 (x1) $0.80 ~750 ~$0.30
H100 80GB (x1) $2.60 ~2200 ~$0.33
H100 80GB (x2) $5.20 ~4000 ~$0.36

At the 70B model size, a single RTX 5090 with AWQ quantisation rivals an H100 on cost-per-token — with much lower hourly spend.

Practical Quickstart with vLLM

pip install vllm
python -m vllm.entrypoints.openai.api_server \
 --model meta-llama/Meta-Llama-3-70B-Instruct \
 --quantization awq \
 --gpu-memory-utilization 0.90 \
 --max-model-len 8192 \
 --enable-prefix-caching

This single command gets you an OpenAI-compatible API endpoint with continuous batching, AWQ quantisation, and prefix caching enabled.

Find the best GPU for your inference workload →

DS

Daniel Santos

Founder & ML Engineer

Building GPU price comparison tools since 2024. Previously trained LLMs at scale for fintech startups in São Paulo. Obsessed with finding the best $/TFLOP ratios across cloud providers.

GPU Cloud LLM Training Cost Optimization MLOps

Ready to save?

Compare GPU cloud prices and find the best provider for your use case.

Start Comparing

Related Articles

Tutorial

Multi-GPU Training: Setup Guide for Beginners

Learn how to distribute your training across multiple GPUs. Step-by-step tutorial covering PyTorch DDP, DeepSpeed, and cloud multi-GPU setups.

3/13/2026 14 min
Read More
Tutorial

PyTorch Distributed Training on Cloud GPUs: Complete Guide

Complete guide to DDP setup, torchrun commands, multi-node on RunPod, gradient checkpointing, mixed precision, and debugging distributed training jobs.

3/10/2026 11 min
Read More
Guia

Cheapest GPU Cloud Providers in 2026

A comprehensive ranking of the most affordable GPU cloud providers in 2026. Find the lowest prices for H100, A100, RTX 4090, and more.

3/16/2026 10 min
Read More
点击查看文章原文
上一篇
LLM 推理中隐藏的性能瓶颈及优化之道 | DigitalOcean
下一篇
Ways to Reduce LLM Inference Costs by 40-60% | Nitesh Singhal
返回列表