velox labs / services / 03 ai platform engineering

vLLM · PAGED ATTENTION · FRACTIONAL GPU

Ship AI without
the 3am pages.

Your cluster is running at 13% utilization and your bill says otherwise. We redesign inference on Kubernetes with paged attention, prefix caching, fractional GPUs, and a cost dashboard you can defend in a budget review. No vendor lock-in.

Start an audit → See the numbers

GPU PAGE TABLE / LIVE SIM

util61%

kv hit74%

p50 ttft320ms

idle

prefill

decode

// the problem

A GPU rack is not
an inference platform.

Most inference clusters are one HuggingFace example away from production: a single-replica deployment, a default autoscaler that reacts to queue depth instead of KV-cache hit rate, a cost dashboard that shows the bill but not the why. It works until it doesn't.

We treat inference like any other platform problem: measure first, find the real bottleneck (almost never what you think), redesign the serving layer around paged attention and prefix caching, and wire the cluster to a cost model you can actually defend to finance.

// before / after

Same hardware.
Different architecture.

Real cluster. Real numbers. Snapshot taken 7 days before and 7 days after the intervention window. Logs and billing receipts available under NDA.

throughput (req/s)

118

p50 ttft (ms)

840

avg gpu util (%)

$ per 1M tokens

$357

What you get.

Concrete artifacts. Each one ships as a PR or a runbook your platform team can own after we leave.

/ 01

Baseline profile

7 days of production telemetry: GPU utilization per device, KV-cache hit rate, TTFT / ITL / TPS percentiles, queue depth, and a unit-cost curve in $/1M tokens.

/ 02

Serving-layer redesign

vLLM configuration tuned for your model + traffic shape: paged attention page size, prefix caching, chunked prefill, tensor parallel strategy, speculative decode.

/ 03

Kubernetes platform PRs

HPA driven by KV-cache hit rate, node pool autoscaling, GPU scheduling via Kueue or Run:ai, fractional GPU via MIG or MPS, pod topology spread for availability.

/ 04

Cost dashboard

Grafana board showing $/1M tokens, $/request, GPU $/hour allocated vs $/hour utilized. Finance stops asking “why is the bill this big” and starts shipping decisions.

/ 05

SLO + incident runbook

Written SLOs for TTFT, ITL, error rate. Alerting tuned to burn rate, not threshold. Runbook for the three incidents your team will actually page for.

/ 06

Capacity planning model

Spreadsheet mapping traffic forecast to GPU demand, with concurrency ceilings derived from your actual KV-cache profile. Defensible in budget review.

How it works.

/01

Instrument

Deploy lightweight telemetry alongside your existing serving stack. We never replace your serving layer before we measure it.

Days 1–3

/02

Profile

7 days of production traffic. We cut the data by model, route, tenant, and request shape. Bottlenecks are almost never where the team thinks they are.

Days 4–10

/03

Redesign

Serving-layer PRs, Kubernetes platform PRs, dashboards. Canary against the baseline with the same traffic — no staging guesswork.

Days 11–17

/04

Hand off

Runbook, SLOs, capacity model, engineering walkthrough. Your team runs it. We stay on retainer for the next three model upgrades if you want.

Days 18–21

Sanitized finding.

Excerpt from a real engagement. Cluster name and tenant identifiers removed; everything else is load-bearing.

findings/P-02.yaml

# engagement: inference-cost-audit · cluster: prod-us-east
# finding P-02 · kv-cache under-utilization

cluster:
  gpus: 8 × A100-80G
  model: llama-3-70b-instruct · fp16
  serving: vllm 0.6.3 · tensor_parallel_size=2

before:
  avg_gpu_util: 13%   # averaged across 7d
  p50_ttft_ms: 840
  p95_ttft_ms: 2,310
  kv_cache_hit: 22%
  throughput: 118 req/s
  monthly_cost: $42,100 · $357 per 1M tokens

after:
  avg_gpu_util: 61%
  p50_ttft_ms: 320
  p95_ttft_ms: 890
  kv_cache_hit: 74%
  throughput: 487 req/s
  monthly_cost: $23,400 · $81 per 1M tokens

interventions:
  - prefix caching enabled (−42% prefill cost)
  - chunked prefill for long contexts
  - fractional gpu via mig (3g.40gb × 2 per card)
  - hpa on kv-cache-hit, not queue depth
  - paged-attention page size re-tuned
  - speculative decode on 1.5b draft model

Who this is for.

/ inference at scale

Teams serving their own LLMs on their own hardware

You left the API providers to control unit economics, latency, or data locality. Now you own a serving stack and a bill. We make it defensible.

/ ml platform

ML platform teams on Kubernetes

You have a working cluster but no cost discipline, no SLOs, and no one who has tuned vLLM against real production traffic. We slot into the platform team.

/ series B / C

Growth-stage companies hitting inference ceiling

The product is working, traffic is growing, and every incremental customer is eating margin. You need a one-time audit, not another hire.

Questions we get.

Is this vendor lock-in?

No. Everything we ship is open source: vLLM, Kubernetes, Grafana, Prometheus, Kueue, and Helm/Terraform for the glue. Your team owns the repo on day one and we leave with nothing proprietary.

Our model is not llama-family — does this still apply?

Yes. vLLM supports most modern transformer families, and the platform work (fractional GPUs, HPA tuning, cost dashboards, SLOs) is model-agnostic. The intervention mix shifts by model; the method doesn't.

Can you do this on cloud GPUs instead of on-prem?

Yes — most of our work is on AWS / GCP / Azure GPU pools. The paged-attention and KV-cache wins are hardware-independent; the autoscaling and cost modeling differs by provider and is included.

What if our cluster is smaller than 8 GPUs?

We've done this on two-GPU clusters. Fractional-GPU work matters more at small scale, not less — most teams with a handful of GPUs don't realize they're paying for headroom they never hit.

Do you also build training platforms?

Not as primary scope. We focus on inference because that's where production pain lives for most teams. For training, we can recommend partners we trust and help you scope the interface between the two.

Ship AI without the 3am pages.

A GPU rack is notan inference platform.

Same hardware.Different architecture.