LLM Inference Optimization: How to Maximize Tokens Per Dollar
Why Inference Optimization Matters
Training a model incurs a one-time cost, but inference is an ongoing expense. Over its lifetime, a production LLM serving thousands of users daily can cost 10–100x more than the initial training run. Optimizing inference is therefore one of the highest-impact engineering investments available.
This guide explores the primary levers for optimization: serving frameworks, quantization, batching, and KV cache management.
Serving Framework Comparison
vLLM
Originating from UC Berkeley, vLLM is currently the gold standard for high-throughput LLM serving. Its standout feature is **PagedAttention**—a KV cache management system inspired by virtual memory paging that significantly curbs memory fragmentation and facilitates much larger batch sizes.
TGI (Text Generation Inference)
HuggingFace's Text Generation Inference (TGI) is a production-ready server boasting robust ecosystem integration. It comes equipped with continuous batching, Flash Attention, and tensor parallelism out of the box.
Ollama
Ollama prioritizes simplicity and ease of use over maximum throughput. It efficiently runs GGUF-quantized models on hybrid CPU+GPU setups.
Quantization Strategies
Quantization lowers model precision to reduce VRAM usage and accelerate matrix multiplications.
| Method | VRAM Saving | Speed Gain | Quality Loss |
|--------|-------------|------------|--------------|
| FP16 (baseline) | — | — | None |
| GPTQ 4-bit | ~50% | +20–40% | Minimal |
| AWQ 4-bit | ~50% | +20–40% | Slightly less than GPTQ |
| GGUF Q4_K_M | ~55% | CPU-friendly | Minimal |
| FP8 (H100+) | ~50% | +30–50% | Near-zero |
**Recommendation:** Opt for AWQ or GPTQ in production; use GGUF for CPU/hybrid deployments; leverage FP8 on H100/H200 for peak throughput.
Batching Strategies
**Static batching** (naive): processes a single request at a time. Disastrous for throughput.
**Dynamic batching**: accumulates requests and processes them as a group. Better, yet varying request lengths lead to wasted compute.
**Continuous batching** (vLLM/TGI): new requests are inserted into the batch immediately upon a sequence's completion. Ensures near-optimal GPU utilization. This mechanism is key to vLLM's efficiency—eliminating waits for the slowest sequence in a batch.
KV Cache Optimization
The KV cache stores key/value tensors for attention mechanisms, expanding linearly with sequence length and batch size. Inefficient KV cache management is the leading cause of out-of-memory errors and throughput degradation.
Key settings in vLLM:
Cost per 1M Tokens: A Comparison
Running Llama 3 70B AWQ on various hardware:
| GPU | $/hr | Tokens/sec | $/1M tokens |
|-----|------|-----------|-------------|
| RTX 4090 (x1) | $0.50 | ~400 | ~$0.35 |
| RTX 5090 (x1) | $0.80 | ~750 | ~$0.30 |
| H100 80GB (x1) | $2.60 | ~2200 | ~$0.33 |
| H100 80GB (x2) | $5.20 | ~4000 | ~$0.36 |
At the 70B scale, a single RTX 5090 utilizing AWQ quantization competes with an H100 in cost-per-token efficiency—while significantly reducing hourly expenditure.
Practical Quickstart with vLLM
```bash
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--quantization awq \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--enable-prefix-caching
```
This single command deploys an OpenAI-compatible API endpoint featuring continuous batching, AWQ quantization, and prefix caching.
Daniel Santos
Founder & ML Engineer
Building GPU price comparison tools since 2024. Previously trained LLMs at scale for fintech startups in São Paulo. Obsessed with finding the best $/TFLOP ratios across cloud providers.
Ready to save?
Compare GPU cloud prices and find the best provider for your use case.
Start ComparingRelated Articles
Multi-GPU Training: Setup Guide for Beginners
Learn how to distribute your training across multiple GPUs. Step-by-step tutorial covering PyTorch DDP, DeepSpeed, and cloud multi-GPU setups.
PyTorch Distributed Training on Cloud GPUs: Complete Guide
Complete guide to DDP setup, torchrun commands, multi-node on RunPod, gradient checkpointing, mixed precision, and debugging distributed training jobs.
Cheapest GPU Cloud Providers in 2026
A comprehensive ranking of the most affordable GPU cloud providers in 2026. Find the lowest prices for H100, A100, RTX 4090, and more.