行业观察：Inference Cost Optimization | EngineersOfAI — Technical Education for AI Engineers

:::tip 🎮 Interactive Playground Visualize this concept: Try the Inference Cost Explorer demo on the EngineersOfAI Playground - no code required. :::

The $80K Monthly LLM Bill

The AI writing assistant had been in production for six months and user growth was everything the product team hoped for. Then the infrastructure bill arrived. $82, 000 f o r t h e m o n t h -$ 82,000forthemonth−2.73 per day per active user. The unit economics were catastrophic. At that rate, monetization required charging users $5–8/month just to break even on compute, before touching engineering salaries, infrastructure overhead, or any margin.

The model was GPT-4 via the OpenAI API. Every user message triggered a call with 3,000–4,000 tokens of context (system prompt + conversation history + retrieved documents), and responses averaged 800 tokens. At GPT-4 pricing - $0.03 / 1 K i n p u t t o k e n s,$ 0.03/1Kinputtokens,0.06/1K output tokens - the math was brutal: each user interaction cost $0.14. The product's most engaged users were triggering 20+ interactions per day.

The engineering team had four levers to pull: model selection, request caching, context compression, and batching. None of them individually solved the problem. Applied systematically over 6 weeks, they reduced the monthly bill to $18,500 - a 77% reduction with no perceptible quality degradation for 94% of use cases. The remaining 6% of complex queries still used GPT-4, but now only when genuinely needed.

This lesson teaches that systematic process. Inference cost optimization is not about finding one magic trick - it is about understanding the economics at each layer and applying the right tool to the right problem.

Why Inference Cost is Different

Training is a one-time cost per model version. Inference is a recurring cost that scales with usage - which means it grows as your product succeeds. This creates a peculiar challenge: the better your product does, the more expensive it gets to run. This is the opposite of most software economics, where more users means lower per-user cost due to infrastructure sharing.

The reason ML inference doesn't have the same economies of scale as traditional software: the cost is model-bound, not infrastructure-bound. A traditional web server can handle 10× more requests by horizontally scaling cheap instances. An LLM serving system can handle 10× more requests, but each request still costs roughly the same because the bottleneck is GPU memory bandwidth consumed per token - a physical constraint that doesn't improve with horizontal scale.

This means cost optimization must happen at the model and request level, not just the infrastructure level.

The Inference Cost Stack

Before optimizing, understand what you're actually paying for:

For API-based LLMs (OpenAI, Anthropic), cost is primarily token-based. For self-hosted models, cost is primarily compute-based. The optimization strategies differ accordingly.

Lever 1: Model Selection and Routing

The highest-leverage optimization is not using a powerful model when a weaker one is sufficient. This is called model routing or model cascading.

The Cost-Quality Pareto Frontier

Different models occupy different points on the cost-quality curve:

Model	Input Cost/1K	Output Cost/1K	Relative Quality
GPT-4 Turbo	$0.01	$0.03	100% (baseline)
GPT-4o	$0.005	$0.015	98%
GPT-3.5 Turbo	$0.0005	$0.0015	82%
Llama 3 8B (self-hosted)	$0.00008	$0.00008	74%
Llama 3 70B (self-hosted)	$0.0004	$0.0004	90%

For many real-world tasks, GPT-3.5 or Llama 3 70B performs equivalently to GPT-4. The key is measuring quality per task category, not assuming you need the best model everywhere.

Implementing Model Routing

from enum import Enum
from typing import Optional
import re

class ModelTier(Enum):
 FAST = "gpt-3.5-turbo" # simple tasks, low cost
 BALANCED = "gpt-4o-mini" # most tasks, moderate cost
 POWERFUL = "gpt-4-turbo" # complex tasks, high cost


def classify_request_complexity(
 user_message: str,
 conversation_history: list[dict],
 max_history_turns: int = 10,
) -> ModelTier:
 """
 Route requests to the appropriate model tier based on complexity.
 This is a rule-based approach; can be replaced with a lightweight
 classifier trained on (request, quality_outcome) pairs.
 """
 msg_lower = user_message.lower()

 # Signals of high complexity - route to powerful model
 high_complexity_signals = [
 len(conversation_history) > max_history_turns, # deep context
 any(w in msg_lower for w in ["analyze", "compare", "critique", "evaluate"]),
 len(user_message) > 500, # complex question
 "code" in msg_lower and "debug" in msg_lower, # code debugging
 bool(re.search(r'\d+[\+\-\*\/]\d+', user_message)), # math operations
 ]

 # Signals of simple tasks - route to fast model
 low_complexity_signals = [
 len(user_message) < 50, # short message
 any(w in msg_lower for w in ["summarize", "translate", "format"]),
 user_message.endswith("?") and len(user_message) < 100, # simple question
 len(conversation_history) == 0, # first message
 ]

 high_score = sum(high_complexity_signals)
 low_score = sum(low_complexity_signals)

 if high_score >= 2:
 return ModelTier.POWERFUL
 elif low_score >= 2:
 return ModelTier.FAST
 else:
 return ModelTier.BALANCED


class RoutedLLMClient:
 """Cost-aware LLM client with automatic model routing."""

 def __init__(self, openai_client, default_tier: ModelTier = ModelTier.BALANCED):
 self.client = openai_client
 self.default_tier = default_tier
 self._cost_tracker = {"fast": 0, "balanced": 0, "powerful": 0}

 def complete(
 self,
 messages: list[dict],
 force_tier: Optional[ModelTier] = None,
 ) -> dict:
 # Route to appropriate tier
 if force_tier:
 tier = force_tier
 elif len(messages) > 1:
 user_msg = messages[-1]["content"]
 history = messages[:-1]
 tier = classify_request_complexity(user_msg, history)
 else:
 tier = self.default_tier

 response = self.client.chat.completions.create(
 model=tier.value,
 messages=messages,
 )

 # Track costs
 usage = response.usage
 tier_key = tier.name.lower()
 # Simplified - in production, use actual token prices per model
 self._cost_tracker[tier_key] += (
 usage.prompt_tokens + usage.completion_tokens
 ) / 1000 * self._get_rate(tier)

 return {
 "content": response.choices[0].message.content,
 "model_used": tier.value,
 "tokens": usage.total_tokens,
 }

 def _get_rate(self, tier: ModelTier) -> float:
 rates = {
 ModelTier.FAST: 0.001, # blended rate $/1K tokens
 ModelTier.BALANCED: 0.01,
 ModelTier.POWERFUL: 0.04,
 }
 return rates[tier]

 def cost_report(self) -> dict:
 return self._cost_tracker

Expected savings from routing: 40–60% on total token cost, depending on your task distribution. Most products have 50–70% of requests that are "simple" and can run on cheaper models.

Lever 2: Context Compression

For RAG-based applications, the context window is dominated by retrieved documents. Reducing token count in context is equivalent to reducing cost.

Strategies

1. Retrieved document compression: Pass retrieved chunks through a compression model before including in context:

def compress_context(
 retrieved_docs: list[str],
 user_question: str,
 target_tokens: int = 1000,
) -> str:
 """
 Use a cheap model to compress retrieved context to target length.
 Compressing 4,000 tokens to 1,000 tokens saves $0.09 per request at GPT-4 pricing.
 """
 # Concatenate docs
 full_context = "\n\n".join(retrieved_docs)

 # Use a cheap model for compression
 compression_prompt = f"""Extract only the information directly relevant to this question:
Question: {user_question}

Context:
{full_context}

Return only the most relevant sentences, max {target_tokens} tokens."""

 compressed = cheap_model.complete(compression_prompt) # e.g., gpt-3.5-turbo
 return compressed


# Cost math: compressing 4K tokens → 1K tokens
# Compression call (gpt-3.5): 4K input + 1K output = $0.000050 + $0.0000015 = $0.0000515
# Savings on main call (gpt-4): 3K fewer input tokens = 3 × $0.03/1K = $0.09
# Net savings per request: $0.09 - $0.00005 ≈ $0.09 (compression is nearly free)

2. Conversation history truncation: Older conversation turns contribute less signal but equal cost. Implement smart truncation:

def truncate_history(
 history: list[dict],
 max_tokens: int = 2000,
 tokenizer,
) -> list[dict]:
 """
 Keep the most recent N turns of conversation history within token budget.
 Always keep system prompt (index 0) and last 2 user messages.
 """
 if not history:
 return []

 # Count tokens from the end, always keeping recent context
 kept = []
 token_count = 0

 for msg in reversed(history):
 msg_tokens = len(tokenizer.encode(msg["content"]))
 if token_count + msg_tokens > max_tokens:
 break
 kept.insert(0, msg)
 token_count += msg_tokens

 return kept

3. System prompt optimization: System prompts are sent with every request. Audit yours:

# Before: verbose system prompt (850 tokens)
SYSTEM_PROMPT_V1 = """
You are a helpful AI writing assistant created by Acme Corp. Your role is to help
users write better content. You have expertise in grammar, style, tone, and content
structure. You should be helpful, accurate, and concise. When the user asks for help
with writing, you should provide thoughtful suggestions...
[continues for 800+ more tokens of instructions]
"""

# After: compressed system prompt (120 tokens, same behavior)
SYSTEM_PROMPT_V2 = """
You are a writing assistant. Help users improve grammar, style, and content structure.
Be concise and specific. Suggest edits directly. Ask clarifying questions if the task
is ambiguous."""

# Savings: 730 tokens × $0.03/1K = $0.022 per request
# At 100K requests/day: $2,190/day = $65,700/month savings from system prompt alone

Lever 3: Quantization for Self-Hosted Models

For teams self-hosting open-source models, quantization reduces memory footprint and increases throughput - directly translating to lower per-token cost.

Quantization Levels and Their Economics

$Memory (N, b) = N \times b / 8 bytes$ Memory(N,b)=N×b/8 bytes

Where $N$ N = parameter count, $b$ b = bits per weight.

Quantization	Memory (7B)	Throughput (rel.)	Quality Loss	GPU Requirement
FP32	28 GB	1.0×	None	2× A100 40GB
FP16/BF16	14 GB	1.8×	~0%	1× A100 40GB
INT8	7 GB	2.5×	~0.5%	1× A100 40GB
INT4 (GPTQ)	3.5 GB	3.0×	~1–3%	1× A40 48GB
INT4 (AWQ)	3.5 GB	3.2×	~0.5–1.5%	1× A40 48GB

Cost implication: Running INT4 instead of FP16 on a 7B model allows you to serve 3.2× more requests per GPU-hour. If one A100 at $3.06/hr serves 50 requests/sec in FP16, INT4 lets you serve 160 requests/sec - reducing cost per request by 3.2×.

# Using bitsandbytes for INT8 quantization (simplest approach)
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

def load_quantized_model(model_name: str, quantization_bits: int = 4):
 """Load model with specified quantization for cost-efficient serving."""

 if quantization_bits == 8:
 config = BitsAndBytesConfig(load_in_8bit=True)
 elif quantization_bits == 4:
 config = BitsAndBytesConfig(
 load_in_4bit=True,
 bnb_4bit_compute_dtype=torch.bfloat16,
 bnb_4bit_use_double_quant=True, # nested quantization, ~15% more memory
 bnb_4bit_quant_type="nf4", # normal float 4, better for LLM weights
 )
 else:
 raise ValueError(f"Unsupported bits: {quantization_bits}")

 model = AutoModelForCausalLM.from_pretrained(
 model_name,
 quantization_config=config,
 device_map="auto",
 )

 return model

# For production inference, use vLLM with AWQ quantization:
# vllm serve meta-llama/Llama-3-8B-Instruct --quantization awq --dtype auto
# This gives better throughput than bitsandbytes due to CUDA kernel optimization

Lever 4: Batching Economics

Batching is the single most impactful infrastructure-level optimization. Here's why: GPU computation is highly parallel - processing 32 requests simultaneously uses nearly the same memory bandwidth as processing 1 request, but achieves 32× throughput.

Why Batching Cuts Cost

For a single-request serving system: $GPU utilization = \frac{Compute per request}{Time between requests}$ GPU utilization=Time between requestsCompute per request

At 1 request/second with 200ms inference time: 20% utilization - paying for GPU 100% of the time, using it 20% of the time.

With dynamic batching (batch=16): $GPU utilization \approx \frac{16 \times Compute per request}{Wait window + Batched inference time}$ GPU utilization≈Wait window+Batched inference time16×Compute per request

At 1 request/second, batch window of 50ms, inference for 16 requests = 400ms: utilization jumps to ~90%.

import asyncio
import time
from collections import deque
from typing import Any

class DynamicBatcher:
 """
 Asynchronous dynamic batching for LLM inference.
 Collects requests within a time window, then processes as a batch.
 """

 def __init__(
 self,
 model_fn,
 max_batch_size: int = 32,
 max_wait_ms: float = 50, # wait up to 50ms to fill a batch
 ):
 self.model_fn = model_fn
 self.max_batch_size = max_batch_size
 self.max_wait_ms = max_wait_ms
 self._queue: deque = deque()
 self._processing = False

 async def predict(self, request: Any) -> Any:
 """Submit a request for batched prediction."""
 future = asyncio.get_event_loop().create_future()
 self._queue.append((request, future))

 # Start batch processing if not already running
 if not self._processing:
 asyncio.create_task(self._process_batch())

 return await future

 async def _process_batch(self):
 self._processing = True
 deadline = time.perf_counter() + self.max_wait_ms / 1000

 # Wait until batch is full or deadline expires
 while (
 len(self._queue) < self.max_batch_size
 and time.perf_counter() < deadline
 ):
 await asyncio.sleep(0.001) # 1ms polling interval

 # Collect batch
 batch_items = []
 batch_futures = []
 while self._queue and len(batch_items) < self.max_batch_size:
 request, future = self._queue.popleft()
 batch_items.append(request)
 batch_futures.append(future)

 # Process batch
 try:
 results = self.model_fn(batch_items)
 for future, result in zip(batch_futures, results):
 future.set_result(result)
 except Exception as e:
 for future in batch_futures:
 future.set_exception(e)

 self._processing = False

 # Start next batch if queue is non-empty
 if self._queue:
 asyncio.create_task(self._process_batch())

Cost impact example:

Without batching at 100 req/sec: 1 GPU needed at 100% utilization, $3.06/hr. With batching (batch=32, adds 25ms latency): same 100 req/sec on 0.1 GPUs effectively - or run 10× more traffic on same GPU.

At scale: 1,000 req/sec without batching needs 10 A100s ( $30.60 / h r) . W i t h b a t c h i n g : 3 A 100 s ($ 30.60/hr).Withbatching:3A100s(9.18/hr) - 70% cost reduction.

Lever 5: Instance Right-Sizing and Autoscaling

Most teams run too many instances 80% of the time and too few 5% of the time. The solution is metric-driven autoscaling that targets a specific cost-latency tradeoff point.

# Kubernetes HPA configuration for ML serving
# Target: keep GPU utilization at 70% (not 100% - need headroom for latency)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
 name: llm-serving-hpa
spec:
 scaleTargetRef:
 apiVersion: apps/v1
 kind: Deployment
 name: llm-serving
 minReplicas: 2
 maxReplicas: 20
 metrics:
 - type: External
 external:
 metric:
 name: dcgm_fi_dev_gpu_util # GPU utilization from DCGM
 selector:
 matchLabels:
 deployment: llm-serving
 target:
 type: AverageValue
 averageValue: "70" # scale when avg GPU util hits 70%
 behavior:
 scaleUp:
 stabilizationWindowSeconds: 60 # scale up fast
 policies:
 - type: Pods
 value: 2
 periodSeconds: 60
 scaleDown:
 stabilizationWindowSeconds: 300 # scale down slowly (avoid thrash)
 policies:
 - type: Pods
 value: 1
 periodSeconds: 120

The autoscaling economics:

def calculate_autoscaling_savings(
 peak_rps: float,
 avg_rps: float,
 rps_per_instance: float,
 instance_hourly_cost: float,
 hours_per_month: int = 730,
) -> dict:
 """Compare fixed capacity vs autoscaling costs."""

 # Fixed capacity: always-on at peak + 30% buffer
 fixed_instances = int((peak_rps / rps_per_instance) * 1.30) + 1
 fixed_monthly = fixed_instances * instance_hourly_cost * hours_per_month

 # Autoscaling: instances proportional to actual load + 30% buffer
 # Approximate: instances track load linearly with 30% headroom
 avg_instances_with_autoscaling = (avg_rps / rps_per_instance) * 1.30 + 1 # min 1
 autoscaling_monthly = avg_instances_with_autoscaling * instance_hourly_cost * hours_per_month

 savings = fixed_monthly - autoscaling_monthly
 savings_pct = savings / fixed_monthly

 return {
 "fixed_instances": fixed_instances,
 "fixed_monthly_cost": fixed_monthly,
 "avg_instances_autoscaled": avg_instances_with_autoscaling,
 "autoscaling_monthly_cost": autoscaling_monthly,
 "monthly_savings": savings,
 "savings_percentage": savings_pct,
 }

# Example: API with 200 peak RPS, 40 avg RPS, 100 RPS/instance at $3.06/hr
result = calculate_autoscaling_savings(
 peak_rps=200, avg_rps=40,
 rps_per_instance=100,
 instance_hourly_cost=3.06
)
print(f"Fixed cost: ${result['fixed_monthly_cost']:,.0f}/mo") # ~$14,967
print(f"Autoscaled cost: ${result['autoscaling_monthly_cost']:,.0f}/mo") # ~$3,590
print(f"Savings: {result['savings_percentage']:.0%}") # 76%

The Full Optimization Roadmap

Applying all levers to the $80K/month scenario:

Optimization	Monthly Savings	Implementation Effort
Model routing (60% to GPT-3.5)	−$32,000	Medium (2 weeks)
Context compression	−$12,000	Medium (1 week)
System prompt optimization	−$6,500	Low (1 day)
Semantic caching	−$4,000	Medium (1 week)
Autoscaling (if self-hosted)	−$7,000	High (3 weeks)

Total reduction: from $82, 000 t o a p p r o x i m a t e l y$ 82,000toapproximately20,500 - a 75% reduction.

Production Engineering Notes

Semantic Caching

Exact-match caching hits rarely for LLM workloads - users don't ask identical questions. Semantic caching matches similar questions:

import numpy as np
from typing import Optional

class SemanticCache:
 """Cache LLM responses by semantic similarity of the input."""

 def __init__(self, embedding_model, similarity_threshold: float = 0.95):
 self.embedding_model = embedding_model
 self.threshold = similarity_threshold
 self._cache: list[dict] = [] # in production: use FAISS or Qdrant

 def get(self, query: str) -> Optional[str]:
 query_embedding = self.embedding_model.encode(query)

 for entry in self._cache:
 similarity = np.dot(query_embedding, entry["embedding"])
 if similarity >= self.threshold:
 return entry["response"] # cache hit

 return None # cache miss

 def set(self, query: str, response: str):
 embedding = self.embedding_model.encode(query)
 self._cache.append({
 "query": query,
 "embedding": embedding,
 "response": response,
 })

Cache hit rate depends heavily on your use case. FAQ bots: 40–60% hit rate. Creative writing: 5% hit rate. Average for RAG applications: 15–25%. Even 20% cache hit rate cuts per-query cost by 20%.

Common Mistakes

:::danger Optimizing tokens before optimizing model selection Token count optimization (compression, history trimming) saves proportionally. Model selection saves categorically - switching 60% of traffic from GPT-4 to GPT-3.5 reduces those requests by 20×. Always start with model routing before optimizing context length. :::

:::danger Setting autoscaling to target 100% GPU utilization At 100% utilization, any traffic spike immediately degrades latency because there's no headroom. Target 60–70% utilization: you pay slightly more in idle capacity, but p99 latency stays stable. The cost of SLA violations (user complaints, churned users) exceeds the cost of 30% idle GPU capacity. :::

:::warning Quantizing without quality validation per task type Quantization affects different capabilities differently. INT4 typically degrades mathematical reasoning more than simple text generation. Always run your production task distribution through A/B quality tests before fully deploying quantized models. Build a regression test suite that covers your key use cases. :::

:::warning Ignoring cold start costs in autoscaling Loading a 7B model from S3 takes 45–90 seconds. If your autoscaler reacts to traffic spikes by adding instances, those new instances won't serve traffic for 90 seconds - during which your existing instances are overwhelmed. Pre-load instances on a warm pool: always keep N "warm but idle" instances that can begin serving in under 5 seconds. :::

Interview Q&A

Q: How would you reduce LLM API costs by 4× for a production application?

A: I'd attack it in three layers. First, model routing - most production requests don't need the most powerful model. Classify requests by complexity and route 60–70% to a cheaper model. This alone can reduce costs by 3–5×. Second, context optimization - audit your system prompt and conversation history handling. System prompts of 800+ tokens sent with every request are often compressible to 100–150 tokens with the same behavior. Use context compression for retrieved documents in RAG systems. Third, caching - implement semantic caching for similar queries. In an FAQ or customer support context, 30–40% of queries are semantically similar to previous ones. I've seen these three levers together reduce costs by 4–8× without any user-perceived quality degradation.

Q: What is the ROI calculation for quantizing a model from FP16 to INT4?

A: INT4 reduces memory by 4× and increases throughput by 2.5–3×. The direct cost saving: if you need 4 A100s in FP16, you need 1–2 A100s in INT4. At $3.06 / h r, t h a t^{'} s$ 3.06/hr,that′s8,875/month saved per cluster. The cost: engineering time (2–3 weeks to implement, test, and validate), plus 1–3% quality degradation on some tasks. The break-even calculation: $8, 875 / m o n t h i n s a v i n g s \div$ 8,875/monthinsavings÷ 30,000 engineering cost = 3.4 months. After that, pure savings. Quality degradation risk is the main variable - I always run an A/B quality test with 1,000 representative samples before committing. If quality delta is under 2%, I proceed. If it's higher, I try AWQ quantization (better quality than GPTQ at same compression).

Q: How does batching reduce inference cost, and what are the limits?

A: Batching amortizes the fixed GPU overhead (loading model weights into cache, CUDA kernel launch overhead) across multiple requests. The GPU processes a batch of 32 requests using nearly the same memory bandwidth as 1 request - throughput scales nearly linearly with batch size up to a limit. The limit is the KV cache memory: each sequence in the batch occupies KV cache proportional to its length. A 7B model with 8K context window on an A100 (80 GB) can batch about 32 requests at 2K tokens average length. Beyond that, you either run out of memory or spill to CPU. The practical ceiling depends on sequence length distribution. Short sequences (200 tokens): batch 100+. Long sequences (4K tokens): batch 8–16. The latency tradeoff: larger batches improve throughput but add queuing delay. Target a batch window of 20–50ms - long enough to accumulate reasonable batches, short enough not to hurt p99 latency.

Q: When should you self-host a model vs use a managed API?

A: This is a TCO calculation. Self-hosting makes sense when: (1) volume is high enough that compute premium exceeds engineering cost - roughly 1M+ requests/day for GPT-3.5-class models; (2) latency requirements exceed what APIs can guarantee - self-hosted can achieve sub-100ms p50 vs API's 300–800ms; (3) data privacy prevents sending data to third parties; (4) fine-tuning is required and the performance gain justifies operational complexity. APIs win when: volume is low, team is small, development speed matters more than cost, or quality requirements demand frontier models. I'd always prototype with APIs, measure quality and cost at your actual workload, then evaluate whether self-hosting passes the TCO test.

Q: How do you design an autoscaling policy for an LLM serving system?

A: LLM autoscaling has three unique challenges vs regular web services. First, scale-up latency: model loading takes 30–90 seconds, so scale-up signals must be acted on before you hit capacity limits. I trigger scale-up at 60% GPU utilization, not 80%. Second, heterogeneous instance types: when scaling rapidly, spot instances at different GPU types may join the pool - make sure your load balancer is GPU-aware. Third, minimum instance floor: never scale to zero - keeping 2 instances warm eliminates cold start for normal traffic. The policy: scale up aggressively (double capacity when utilization >60% for 60 seconds), scale down conservatively (remove one instance when utilization <30% for 10 minutes). Always keep a 2-instance minimum. Monitor both GPU utilization and queue depth - a long queue at 70% utilization means requests are waiting for the GPU, not that the GPU is idle.

行业观察：Inference Cost Optimization | EngineersOfAI — Technical Education for AI Engineers

The $80K Monthly LLM Bill​

Why Inference Cost is Different​

The Inference Cost Stack​

Lever 1: Model Selection and Routing​

The Cost-Quality Pareto Frontier​

Implementing Model Routing​

Lever 2: Context Compression​

Strategies​

Lever 3: Quantization for Self-Hosted Models​

Quantization Levels and Their Economics​

Lever 4: Batching Economics​

Why Batching Cuts Cost​

Lever 5: Instance Right-Sizing and Autoscaling​

The Full Optimization Roadmap​

Production Engineering Notes​

Semantic Caching​

Common Mistakes​

Interview Q&A​