Your cluster is running at 13% utilization and your bill says otherwise. We redesign inference on Kubernetes with paged attention, prefix caching, fractional GPUs, and a cost dashboard you can defend in a budget review. No vendor lock-in.
Most inference clusters are one HuggingFace example away from production: a single-replica deployment, a default autoscaler that reacts to queue depth instead of KV-cache hit rate, a cost dashboard that shows the bill but not the why. It works until it doesn't.
We treat inference like any other platform problem: measure first, find the real bottleneck (almost never what you think), redesign the serving layer around paged attention and prefix caching, and wire the cluster to a cost model you can actually defend to finance.
Real cluster. Real numbers. Snapshot taken 7 days before and 7 days after the intervention window. Logs and billing receipts available under NDA.
Concrete artifacts. Each one ships as a PR or a runbook your platform team can own after we leave.
7 days of production telemetry: GPU utilization per device, KV-cache hit rate, TTFT / ITL / TPS percentiles, queue depth, and a unit-cost curve in $/1M tokens.
vLLM configuration tuned for your model + traffic shape: paged attention page size, prefix caching, chunked prefill, tensor parallel strategy, speculative decode.
HPA driven by KV-cache hit rate, node pool autoscaling, GPU scheduling via Kueue or Run:ai, fractional GPU via MIG or MPS, pod topology spread for availability.
Grafana board showing $/1M tokens, $/request, GPU $/hour allocated vs $/hour utilized. Finance stops asking “why is the bill this big” and starts shipping decisions.
Written SLOs for TTFT, ITL, error rate. Alerting tuned to burn rate, not threshold. Runbook for the three incidents your team will actually page for.
Spreadsheet mapping traffic forecast to GPU demand, with concurrency ceilings derived from your actual KV-cache profile. Defensible in budget review.
Deploy lightweight telemetry alongside your existing serving stack. We never replace your serving layer before we measure it.
7 days of production traffic. We cut the data by model, route, tenant, and request shape. Bottlenecks are almost never where the team thinks they are.
Serving-layer PRs, Kubernetes platform PRs, dashboards. Canary against the baseline with the same traffic — no staging guesswork.
Runbook, SLOs, capacity model, engineering walkthrough. Your team runs it. We stay on retainer for the next three model upgrades if you want.
Excerpt from a real engagement. Cluster name and tenant identifiers removed; everything else is load-bearing.
# engagement: inference-cost-audit · cluster: prod-us-east
# finding P-02 · kv-cache under-utilization
cluster:
gpus: 8 × A100-80G
model: llama-3-70b-instruct · fp16
serving: vllm 0.6.3 · tensor_parallel_size=2
before:
avg_gpu_util: 13% # averaged across 7d
p50_ttft_ms: 840
p95_ttft_ms: 2,310
kv_cache_hit: 22%
throughput: 118 req/s
monthly_cost: $42,100 · $357 per 1M tokens
after:
avg_gpu_util: 61%
p50_ttft_ms: 320
p95_ttft_ms: 890
kv_cache_hit: 74%
throughput: 487 req/s
monthly_cost: $23,400 · $81 per 1M tokens
interventions:
- prefix caching enabled (−42% prefill cost)
- chunked prefill for long contexts
- fractional gpu via mig (3g.40gb × 2 per card)
- hpa on kv-cache-hit, not queue depth
- paged-attention page size re-tuned
- speculative decode on 1.5b draft modelYou left the API providers to control unit economics, latency, or data locality. Now you own a serving stack and a bill. We make it defensible.
You have a working cluster but no cost discipline, no SLOs, and no one who has tuned vLLM against real production traffic. We slot into the platform team.
The product is working, traffic is growing, and every incremental customer is eating margin. You need a one-time audit, not another hire.
No. Everything we ship is open source: vLLM, Kubernetes, Grafana, Prometheus, Kueue, and Helm/Terraform for the glue. Your team owns the repo on day one and we leave with nothing proprietary.
Yes. vLLM supports most modern transformer families, and the platform work (fractional GPUs, HPA tuning, cost dashboards, SLOs) is model-agnostic. The intervention mix shifts by model; the method doesn't.
Yes — most of our work is on AWS / GCP / Azure GPU pools. The paged-attention and KV-cache wins are hardware-independent; the autoscaling and cost modeling differs by provider and is included.
We've done this on two-GPU clusters. Fractional-GPU work matters more at small scale, not less — most teams with a handful of GPUs don't realize they're paying for headroom they never hit.
Not as primary scope. We focus on inference because that's where production pain lives for most teams. For training, we can recommend partners we trust and help you scope the interface between the two.
3-week engagements, fixed price, verified against your billing. Scoping call is free.
Start an audit →