Production Model Deployments

Production Model Deployments#

This guide covers what to check before a model deployment goes to production: required labels, choosing a routing strategy, capping model-level throughput, configuring readiness, and sizing replicas.

Required Labels and Annotations #

Every pod template that AIBrix manages needs at minimum two labels. Without them the gateway cannot route traffic to the pod.

Label	Description
`model.aibrix.ai/name: <model-name>`	The model identifier clients send in the `model` field of their requests. Must be unique per model.
`model.aibrix.ai/port: "<port>"`	The container port the inference server listens on (e.g. `"8000"`).

Choosing a Routing Strategy #

Set the default routing strategy for a model via the config annotation rather than relying on the global env default. This makes the intent explicit and allows different models to use different strategies.

annotations:
  model.aibrix.ai/config: |
    {
      "profiles": {
        "default": { "routingStrategy": "least-latency" }
      }
    }

A practical starting point for common workload types:

Workload	Recommended strategy
Multi-turn chat (shared system prompts)	`prefix-cache` — routes repeat prefixes to pods that already hold the KV cache.
Independent requests (batch, summarization)	`least-request` — evenly distributes load across pods.
Latency-sensitive interactive	`least-latency` — routes to the pod with the lowest recent latency.
High-throughput inference	`pd` — separates prefill and decode for maximum GPU utilization.
Multi-user SLO enforcement	`vtc-basic` — balances fairness across users while keeping pods loaded.

See the Deploying Gateway guide for the full strategy list.

Config Profiles for Production #

Config profiles let a single model deployment serve multiple traffic classes without separate deployments. A common pattern for production is to define a profile per client type:

annotations:
  model.aibrix.ai/config: |
    {
      "defaultProfile": "default",
      "profiles": {
        "default": {
          "routingStrategy": "least-latency"
        },
        "batch": {
          "routingStrategy": "throughput"
        },
        "pd": {
          "routingStrategy": "pd"
        }
      }
    }

Clients select a profile with the config-profile request header:

curl http://${ENDPOINT}/v1/chat/completions \
  -H "config-profile: batch" \
  -H "Content-Type: application/json" \
  -d '{"model": "my-model", "messages": [{"role": "user", "content": "Summarize: ..."}]}'

If no header is set, the defaultProfile is used.

Capping Model Throughput (RPS Limiting)#

What it is #

requestsPerSecond sets a hard cap on how many requests per second the gateway will forward to a given model. Requests that exceed the limit are rejected immediately with HTTP 429 Too Many Requests before any routing or inference work is done.

This is a model-level cap (all users combined), distinct from the per-user RPM/TPM limits. Use it to:

Protect a model from accidental traffic spikes.
Enforce a cost budget (GPU-hours) for a model in a shared cluster.
Guarantee headroom for higher-priority models on the same gateway.

How to configure it #

Add requestsPerSecond to the relevant profile in the model’s config annotation:

annotations:
  model.aibrix.ai/config: |
    {
      "profiles": {
        "default": {
          "routingStrategy": "least-latency",
          "requestsPerSecond": 50
        }
      }
    }

To apply different limits per traffic class, set it per profile:

annotations:
  model.aibrix.ai/config: |
    {
      "defaultProfile": "default",
      "profiles": {
        "default": {
          "routingStrategy": "least-latency",
          "requestsPerSecond": 100
        },
        "batch": {
          "routingStrategy": "throughput",
          "requestsPerSecond": 20
        }
      }
    }

Here, interactive traffic (default profile) is capped at 100 RPS while batch traffic is limited to 20 RPS.

What clients see when the limit is hit #

The gateway returns:

HTTP/1.1 429 Too Many Requests
x-error-model-rps-exceeded: true

{"error": {"message": "model: my-model has exceeded RPS: 50", "type": "rate_limit_error", "code": "rate_limit_exceeded"}}

How it works internally #

The counter is stored in Redis and incremented atomically on each request. The 1-second window resets automatically. If routing fails after the counter is incremented (e.g. no ready pods), the increment is rolled back so failed requests do not consume quota.

Note

requestsPerSecond requires Redis to be enabled on the gateway plugin. Without Redis, the counter is in-process and is not shared across gateway replicas. See the Deploying Gateway guide — Enabling Redis for Multi-Replica Deployments.

Omit requestsPerSecond or set it to 0 to disable the limit.

Readiness and Health #

The gateway only routes to pods that Kubernetes considers Ready. Ensure your pod’s readiness probe waits for the inference server to fully load the model before marking the pod ready. For large models this can take several minutes.

A practical readiness probe for vLLM:

readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60
  periodSeconds: 10
  failureThreshold: 30    # allow up to 5 minutes for model load

If a pod fails its readiness probe after previously being ready (e.g. due to OOM), the gateway stops routing to it immediately without dropping in-flight requests.

Replica Sizing #

There is no universal formula, but a useful starting point:

Measure single-replica capacity — run a short load test at increasing QPS until latency or error rate degrades. Note the sustainable QPS.
Apply a safety margin — target 60–70% of the measured peak to leave room for spikes.
Add replicas — replicas = ceil(target_QPS / sustainable_QPS_per_replica).

For PD deployments, size prefill and decode pods separately: prefill pods are compute-bound (add more for high-input-token workloads), decode pods are memory-bandwidth-bound (add more for long-output workloads).

Rolling Updates #

LLM pods take a long time to start. Configure maxUnavailable: 0 and maxSurge: 1 (or higher) to ensure capacity is not lost during rollouts:

spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1

With maxUnavailable: 0, Kubernetes only terminates an old pod after the new pod passes its readiness probe. This prevents the gateway from routing to a pod that has not finished loading the model.

For PD deployments, update prefill and decode pods in separate rollouts — updating both simultaneously can temporarily leave some rolesets incomplete, causing the gateway to skip them.

Observability Checklist #

Before shipping to production, ensure you have visibility into:

Request rate per model — aibrix_gateway_requests_total counter, broken down by model label.
Routing latency — time between request arrival and pod selection.
RPS limit rejections — watch for x-error-model-rps-exceeded response headers or the corresponding metric.
Pod readiness churn — alerts on pods that flip between Ready and NotReady indicate OOM or instability.
Prefill timeout rate (PD only) — pd-prefill-request-error log entries; a high rate suggests prefill pods are overloaded.

See the Observability guide for the full metrics reference.

Production Model Deployments

Contents

Production Model Deployments#