Tutorial

LLM Inference Optimization: How to Maximize Tokens Per Dollar

3/12/2026

9 min read

LLM Inference Optimization: How to Maximize Tokens Per Dollar

Why Inference Optimization Matters

Training a model incurs a one-time cost, but inference is an ongoing expense. Over its lifetime, a production LLM serving thousands of users daily can cost 10–100x more than the initial training run. Optimizing inference is therefore one of the highest-impact engineering investments available.

This guide explores the primary levers for optimization: serving frameworks, quantization, batching, and KV cache management.

Serving Framework Comparison

vLLM

Originating from UC Berkeley, vLLM is currently the gold standard for high-throughput LLM serving. Its standout feature is **PagedAttention**—a KV cache management system inspired by virtual memory paging that significantly curbs memory fragmentation and facilitates much larger batch sizes.

Best for:: High-throughput production inference, API servers

Throughput:: Typically 2–4x higher than naive HuggingFace serving

VRAM efficiency:: Excellent — near-zero KV cache waste

Supports:: GPTQ, AWQ, FP8 quantisation natively

TGI (Text Generation Inference)

HuggingFace's Text Generation Inference (TGI) is a production-ready server boasting robust ecosystem integration. It comes equipped with continuous batching, Flash Attention, and tensor parallelism out of the box.

Best for:: Teams already deeply integrated in the HuggingFace ecosystem

Throughput:: Competitive with vLLM; occasionally superior for streaming use cases

Docker-first:: Simple deployment, strong Kubernetes support

Ollama

Ollama prioritizes simplicity and ease of use over maximum throughput. It efficiently runs GGUF-quantized models on hybrid CPU+GPU setups.

Best for:: Local development, single-user inference, rapid model testing

Throughput:: Lower than vLLM/TGI at scale

CPU fallback:: Can operate on CPU when GPU VRAM is insufficient

Quantization Strategies

Quantization lowers model precision to reduce VRAM usage and accelerate matrix multiplications.

|--------|-------------|------------|--------------|

| FP16 (baseline) | — | — | None |

| GPTQ 4-bit | ~50% | +20–40% | Minimal |

| AWQ 4-bit | ~50% | +20–40% | Slightly less than GPTQ |

| FP8 (H100+) | ~50% | +30–50% | Near-zero |

**Recommendation:** Opt for AWQ or GPTQ in production; use GGUF for CPU/hybrid deployments; leverage FP8 on H100/H200 for peak throughput.

Batching Strategies

**Static batching** (naive): processes a single request at a time. Disastrous for throughput.

**Dynamic batching**: accumulates requests and processes them as a group. Better, yet varying request lengths lead to wasted compute.

**Continuous batching** (vLLM/TGI): new requests are inserted into the batch immediately upon a sequence's completion. Ensures near-optimal GPU utilization. This mechanism is key to vLLM's efficiency—eliminating waits for the slowest sequence in a batch.

KV Cache Optimization

The KV cache stores key/value tensors for attention mechanisms, expanding linearly with sequence length and batch size. Inefficient KV cache management is the leading cause of out-of-memory errors and throughput degradation.

Key settings in vLLM:

`--gpu-memory-utilization 0.90` — assign 90% of VRAM to KV cache

`--max-model-len` — cap context length to conserve cache space

Prefix caching: — cache KV tensors for shared system prompts (a massive advantage for chatbots with lengthy system prompts)

Cost per 1M Tokens: A Comparison

Running Llama 3 70B AWQ on various hardware:

| GPU | $/hr | Tokens/sec | $/1M tokens |

|-----|------|-----------|-------------|

| RTX 4090 (x1) | $0.50 | ~400 | ~$0.35 |

| RTX 5090 (x1) | $0.80 | ~750 | ~$0.30 |

| H100 80GB (x1) | $2.60 | ~2200 | ~$0.33 |

| H100 80GB (x2) | $5.20 | ~4000 | ~$0.36 |

At the 70B scale, a single RTX 5090 utilizing AWQ quantization competes with an H100 in cost-per-token efficiency—while significantly reducing hourly expenditure.

Practical Quickstart with vLLM

```bash

pip install vllm

python -m vllm.entrypoints.openai.api_server \

--model meta-llama/Meta-Llama-3-70B-Instruct \

--quantization awq \

--gpu-memory-utilization 0.90 \

--max-model-len 8192 \

--enable-prefix-caching

```

This single command deploys an OpenAI-compatible API endpoint featuring continuous batching, AWQ quantization, and prefix caching.

Find the best GPU for your inference workload → →

Daniel Santos

Founder & ML Engineer

Building GPU price comparison tools since 2024. Previously trained LLMs at scale for fintech startups in São Paulo. Obsessed with finding the best $/TFLOP ratios across cloud providers.

GPU CloudLLM TrainingCost OptimizationMLOps

Ready to save?

Compare GPU cloud prices and find the best provider for your use case.

Start Comparing

Tutorial

Multi-GPU Training: Setup Guide for Beginners

Learn how to distribute your training across multiple GPUs. Step-by-step tutorial covering PyTorch DDP, DeepSpeed, and cloud multi-GPU setups.

3/13/202614 min

Tutorial

PyTorch Distributed Training on Cloud GPUs: Complete Guide

Complete guide to DDP setup, torchrun commands, multi-node on RunPod, gradient checkpointing, mixed precision, and debugging distributed training jobs.

3/10/202611 min

Guia

Cheapest GPU Cloud Providers in 2026

A comprehensive ranking of the most affordable GPU cloud providers in 2026. Find the lowest prices for H100, A100, RTX 4090, and more.

3/16/202610 min

LLM推理效能调优：提升每美元Token产出的实战技巧 | BestGPUCloud

LLM推理效能调优：提升每美元Token产出的实战技巧 | BestGPUCloud

LLM Inference Optimization: How to Maximize Tokens Per Dollar

LLM Inference Optimization: How to Maximize Tokens Per Dollar

Why Inference Optimization Matters

Serving Framework Comparison

vLLM

TGI (Text Generation Inference)

Ollama

Quantization Strategies

Batching Strategies

KV Cache Optimization

Cost per 1M Tokens: A Comparison

Practical Quickstart with vLLM

Ready to save?

Related Articles

Multi-GPU Training: Setup Guide for Beginners

PyTorch Distributed Training on Cloud GPUs: Complete Guide

Cheapest GPU Cloud Providers in 2026