新闻

LLM推理效能调优:提升每美元Token产出的实战技巧 | BestGPUCloud

新闻 2026-05-13 0 次浏览
Back to blog
Tutorial

LLM Inference Optimization: How to Maximize Tokens Per Dollar

3/12/2026
9 min read

LLM Inference Optimization: How to Maximize Tokens Per Dollar

Why Inference Optimization Matters

Training a model incurs a one-time cost, but inference is an ongoing expense. Over its lifetime, a production LLM serving thousands of users daily can cost 10–100x more than the initial training run. Optimizing inference is therefore one of the highest-impact engineering investments available.

This guide explores the primary levers for optimization: serving frameworks, quantization, batching, and KV cache management.

Serving Framework Comparison

vLLM

Originating from UC Berkeley, vLLM is currently the gold standard for high-throughput LLM serving. Its standout feature is **PagedAttention**—a KV cache management system inspired by virtual memory paging that significantly curbs memory fragmentation and facilitates much larger batch sizes.

  • Best for:: High-throughput production inference, API servers
  • Throughput:: Typically 2–4x higher than naive HuggingFace serving
  • VRAM efficiency:: Excellent — near-zero KV cache waste
  • Supports:: GPTQ, AWQ, FP8 quantisation natively
  • TGI (Text Generation Inference)

    HuggingFace's Text Generation Inference (TGI) is a production-ready server boasting robust ecosystem integration. It comes equipped with continuous batching, Flash Attention, and tensor parallelism out of the box.

  • Best for:: Teams already deeply integrated in the HuggingFace ecosystem
  • Throughput:: Competitive with vLLM; occasionally superior for streaming use cases
  • Docker-first:: Simple deployment, strong Kubernetes support
  • Ollama

    Ollama prioritizes simplicity and ease of use over maximum throughput. It efficiently runs GGUF-quantized models on hybrid CPU+GPU setups.

  • Best for:: Local development, single-user inference, rapid model testing
  • Throughput:: Lower than vLLM/TGI at scale
  • CPU fallback:: Can operate on CPU when GPU VRAM is insufficient
  • Quantization Strategies

    Quantization lowers model precision to reduce VRAM usage and accelerate matrix multiplications.

    | Method | VRAM Saving | Speed Gain | Quality Loss |

    |--------|-------------|------------|--------------|

    | FP16 (baseline) | — | — | None |

    | GPTQ 4-bit | ~50% | +20–40% | Minimal |

    | AWQ 4-bit | ~50% | +20–40% | Slightly less than GPTQ |

    | GGUF Q4_K_M | ~55% | CPU-friendly | Minimal |

    | FP8 (H100+) | ~50% | +30–50% | Near-zero |

    **Recommendation:** Opt for AWQ or GPTQ in production; use GGUF for CPU/hybrid deployments; leverage FP8 on H100/H200 for peak throughput.

    Batching Strategies

    **Static batching** (naive): processes a single request at a time. Disastrous for throughput.

    **Dynamic batching**: accumulates requests and processes them as a group. Better, yet varying request lengths lead to wasted compute.

    **Continuous batching** (vLLM/TGI): new requests are inserted into the batch immediately upon a sequence's completion. Ensures near-optimal GPU utilization. This mechanism is key to vLLM's efficiency—eliminating waits for the slowest sequence in a batch.

    KV Cache Optimization

    The KV cache stores key/value tensors for attention mechanisms, expanding linearly with sequence length and batch size. Inefficient KV cache management is the leading cause of out-of-memory errors and throughput degradation.

    Key settings in vLLM:

  • `--gpu-memory-utilization 0.90` — assign 90% of VRAM to KV cache
  • `--max-model-len` — cap context length to conserve cache space
  • Prefix caching: — cache KV tensors for shared system prompts (a massive advantage for chatbots with lengthy system prompts)
  • Cost per 1M Tokens: A Comparison

    Running Llama 3 70B AWQ on various hardware:

    | GPU | $/hr | Tokens/sec | $/1M tokens |

    |-----|------|-----------|-------------|

    | RTX 4090 (x1) | $0.50 | ~400 | ~$0.35 |

    | RTX 5090 (x1) | $0.80 | ~750 | ~$0.30 |

    | H100 80GB (x1) | $2.60 | ~2200 | ~$0.33 |

    | H100 80GB (x2) | $5.20 | ~4000 | ~$0.36 |

    At the 70B scale, a single RTX 5090 utilizing AWQ quantization competes with an H100 in cost-per-token efficiency—while significantly reducing hourly expenditure.

    Practical Quickstart with vLLM

    ```bash

    pip install vllm

    python -m vllm.entrypoints.openai.api_server \

    --model meta-llama/Meta-Llama-3-70B-Instruct \

    --quantization awq \

    --gpu-memory-utilization 0.90 \

    --max-model-len 8192 \

    --enable-prefix-caching

    ```

    This single command deploys an OpenAI-compatible API endpoint featuring continuous batching, AWQ quantization, and prefix caching.

    Find the best GPU for your inference workload →

    DS

    Daniel Santos

    Founder & ML Engineer

    Building GPU price comparison tools since 2024. Previously trained LLMs at scale for fintech startups in São Paulo. Obsessed with finding the best $/TFLOP ratios across cloud providers.

    GPU CloudLLM TrainingCost OptimizationMLOps

    Ready to save?

    Compare GPU cloud prices and find the best provider for your use case.

    Start Comparing

    Related Articles

    Tutorial

    Multi-GPU Training: Setup Guide for Beginners

    Learn how to distribute your training across multiple GPUs. Step-by-step tutorial covering PyTorch DDP, DeepSpeed, and cloud multi-GPU setups.

    3/13/202614 min
    Read More
    Tutorial

    PyTorch Distributed Training on Cloud GPUs: Complete Guide

    Complete guide to DDP setup, torchrun commands, multi-node on RunPod, gradient checkpointing, mixed precision, and debugging distributed training jobs.

    3/10/202611 min
    Read More
    Guia

    Cheapest GPU Cloud Providers in 2026

    A comprehensive ranking of the most affordable GPU cloud providers in 2026. Find the lowest prices for H100, A100, RTX 4090, and more.

    3/16/202610 min
    Read More
    点击查看文章原文
    上一篇
    Inference Expense Reduction | EngineersOfAI — Technical Learning for AI Engineers
    下一篇
    如何将 LLM 推理成本降低 40-60%
    返回列表