LLM Inference Optimization: Get More Tokens Per Dollar
Why Inference Optimization Matters
Training a model is a one-time cost. Inference is forever. A production LLM serving thousands of users daily can cost 10–100x more over its lifetime than the original training run. Optimising inference is one of the highest-leverage engineering investments you can make.
This guide covers the main levers: serving framework, quantisation, batching, and KV cache management.
Serving Framework Comparison
vLLM
vLLM (from UC Berkeley) is the current gold standard for high-throughput LLM serving. Its key innovation is **PagedAttention** — a KV cache management system inspired by virtual memory paging that dramatically reduces memory fragmentation and enables much larger batch sizes.
- Best for: High-throughput production inference, API servers
- Throughput: Typically 2–4x higher than naive HuggingFace serving
- VRAM efficiency: Excellent — near-zero KV cache waste
- Supports: GPTQ, AWQ, FP8 quantisation natively
TGI (Text Generation Inference)
HuggingFace's Text Generation Inference is a production-ready server with strong ecosystem integration. It supports continuous batching, Flash Attention, and tensor parallelism out of the box.
- Best for: Teams already in the HuggingFace ecosystem
- Throughput: Competitive with vLLM; sometimes better for streaming use cases
- Docker-first: Easy deployment, strong Kubernetes support
Ollama
Ollama prioritises ease of use over maximum throughput. It runs GGUF-quantised models efficiently on CPU+GPU hybrid setups.
- Best for: Local development, single-user inference, trying models quickly
- Throughput: Lower than vLLM/TGI at scale
- CPU fallback: Can run on CPU when GPU VRAM is insufficient
Quantisation Strategies
Quantisation reduces model precision to shrink VRAM usage and speed up matrix multiplications.
| Method | VRAM Saving | Speed Gain | Quality Loss |
|---|---|---|---|
| FP16 (baseline) | — | — | None |
| GPTQ 4-bit | ~50% | +20–40% | Minimal |
| AWQ 4-bit | ~50% | +20–40% | Slightly less than GPTQ |
| GGUF Q4_K_M | ~55% | CPU-friendly | Minimal |
| FP8 (H100+) | ~50% | +30–50% | Near-zero |
Recommendation: Use AWQ or GPTQ for production; GGUF for CPU/hybrid deployments; FP8 on H100/H200 for maximum throughput.
Batching Strategies
Static batching (naive): process one request at a time. Catastrophic for throughput.
Dynamic batching: accumulate requests and process them together. Better, but requests of different lengths waste compute.
Continuous batching (vLLM/TGI): new requests are slotted into the batch as soon as a sequence finishes. Near-optimal GPU utilisation. This is what makes vLLM so efficient — never wait for the slowest sequence in a batch.
KV Cache Optimization
The KV cache stores key/value tensors for attention, growing linearly with sequence length and batch size. Poor KV cache management is the top cause of out-of-memory errors and throughput degradation.
Key settings in vLLM:
--gpu-memory-utilization 0.90— allocate 90% of VRAM to KV cache--max-model-len— cap context length to free up cache space- Prefix caching — cache KV tensors for shared system prompts (massive win for chatbots with long system prompts)
Cost per 1M Tokens: A Comparison
Running Llama 3 70B AWQ on various hardware:
| GPU | $/hr | Tokens/sec | $/1M tokens |
|---|---|---|---|
| RTX 4090 (x1) | $0.50 | ~400 | ~$0.35 |
| RTX 5090 (x1) | $0.80 | ~750 | ~$0.30 |
| H100 80GB (x1) | $2.60 | ~2200 | ~$0.33 |
| H100 80GB (x2) | $5.20 | ~4000 | ~$0.36 |
At the 70B model size, a single RTX 5090 with AWQ quantisation rivals an H100 on cost-per-token — with much lower hourly spend.
Practical Quickstart with vLLM
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--quantization awq \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--enable-prefix-caching
This single command gets you an OpenAI-compatible API endpoint with continuous batching, AWQ quantisation, and prefix caching enabled.
Daniel Santos
Founder & ML Engineer
Building GPU price comparison tools since 2024. Previously trained LLMs at scale for fintech startups in São Paulo. Obsessed with finding the best $/TFLOP ratios across cloud providers.
Ready to save?
Compare GPU cloud prices and find the best provider for your use case.
Start ComparingRelated Articles
Multi-GPU Training: Setup Guide for Beginners
Learn how to distribute your training across multiple GPUs. Step-by-step tutorial covering PyTorch DDP, DeepSpeed, and cloud multi-GPU setups.
PyTorch Distributed Training on Cloud GPUs: Complete Guide
Complete guide to DDP setup, torchrun commands, multi-node on RunPod, gradient checkpointing, mixed precision, and debugging distributed training jobs.
Cheapest GPU Cloud Providers in 2026
A comprehensive ranking of the most affordable GPU cloud providers in 2026. Find the lowest prices for H100, A100, RTX 4090, and more.