Prefill-Decode Disaggregation (PD)#
Modern LLM inference workloads are rarely uniform. Some requests contain long prompts that benefit from specialized execution pipelines, while others are short and interactive. Designing infrastructure that efficiently handles both types of requests can be challenging.
AIBrix addresses this with intelligent routing across different types of inference pods within the same deployment. Instead of forcing operators to choose one architecture or maintain separate deployments, AIBrix runs both approaches together and automatically decides which pod should handle each request.
LLM inference has two distinct phases. Prefill processes the entire input prompt in one shot — it is compute-intensive and fast. Decode generates output tokens one at a time — it is memory-bandwidth-bound and slow. In a standard deployment, both phases run on the same GPU, causing them to compete for resources.
PD disaggregation separates these phases onto dedicated pods, so each can be sized, tuned, and scaled independently. The result is higher GPU utilization and lower latency at scale.
Standard inference pods run both phases end-to-end on a single GPU. They handle short interactive requests efficiently and absorb overflow traffic when PD resources are busy.
Intelligent Routing Across Pod Types#
Each pod type serves a different role:
Prefill/Decode pods — Designed for workloads where separating prefill and decode stages improves efficiency. Particularly effective for long prompts or workloads dominated by heavy prompt processing.
Standard inference pods — Execute the entire request lifecycle within a single process. Well suited for short prompts and interactive requests, and act as a safety valve when PD resources are saturated.
The AIBrix gateway continuously evaluates system conditions and routes each request to the best available pod:
+---------------------------+
| Client |
+-------------+-------------+
|
▼
Routing Algorithm (Gateway)
|
+-------------------------+------------------------+
| |
▼ ▼
+------------------------+ +------------------------+
| Prefill/Decode Pods | | Standard Inference Pods|
| (Disaggregated Stages) | | (Single Execution Path)|
+------------------------+ +------------------------+
▲ ▲
| Selected for long prompts or | Selected for short prompts
| prefill-heavy workloads | or when PD capacity is busy
The routing decision incorporates several signals: current pod load, queue depth, pod availability, and scoring logic used to rank candidate pods. This allows AIBrix to distribute traffic efficiently while maintaining stable latency.
Key benefits:
Optimized handling of mixed workloads — Long prompts are routed to prefill/decode pods; short requests are handled efficiently by standard inference pods.
Graceful handling of traffic spikes — Standard inference pods absorb overflow traffic when PD resources are saturated.
Single deployment architecture — Run multiple execution models for the same model without managing separate clusters.
Dynamic routing decisions — Traffic is distributed based on real-time system conditions instead of static configuration.
Improved GPU utilization — Requests are balanced across available pods to maximize throughput and efficiency.
How PD Disaggregation Works#
┌──────────────────────────────────────────────────────────┐
│ Incoming request │
└──────────────────────────┬───────────────────────────────┘
│
▼
┌──────────────────────┐
│ PD Router (gateway)│
└──────┬───────────────┘
│
┌──────────────┴──────────────┐
▼ ▼
┌───────────────┐ ┌──────────────────┐
│ Prefill pod │ ──KV──▶ │ Decode pod │
│ (processes │ transfer │ (generates │
│ the prompt) │ │ output tokens) │
└───────────────┘ └──────────────────┘
The gateway routes the request to a prefill pod, which processes the prompt and computes the KV cache.
The KV cache is transferred to a decode pod via a high-speed interconnect (SHFS for GPU, NIXL for Neuron).
The decode pod streams the generated tokens back to the client.
If no matching prefill/decode pair is available (e.g. the prompt is outside all configured buckets), the request falls back to a standard inference pod that runs both phases locally.
Supported Engines#
Engine |
Label value |
Notes |
|---|---|---|
vLLM |
|
Default. No extra labels required. |
SGLang |
|
Requires |
TensorRT-LLM |
|
Uses NIXL KV transfer backend ( |
Set the engine on each pod with the model.aibrix.ai/engine label.
Step 1 — Label Your Pods#
The gateway identifies the role of each pod using two labels:
Label |
Value |
Purpose |
|---|---|---|
|
|
Tells the gateway which phase this pod handles. Standard inference pods omit this label. |
|
any string (e.g. |
Groups a prefill pod and a decode pod into a pair. The gateway only uses pairs where both prefill and decode pods are present. |
A prefill pod template looks like this:
metadata:
labels:
model.aibrix.ai/name: my-model
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
role-name: prefill
roleset-name: group-0
A decode pod template looks like this:
metadata:
labels:
model.aibrix.ai/name: my-model
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
role-name: decode
roleset-name: group-0
Note
Both pods in a pair must share the same roleset-name. If a roleset has only a prefill pod or only a decode pod, the gateway skips that entire roleset.
Step 2 — Enable PD Routing#
Set routing-strategy: pd on individual requests, or configure it as the default via the model config annotation (recommended for production):
# Per-request override
curl http://${ENDPOINT}/v1/chat/completions \
-H "routing-strategy: pd" \
-H "Content-Type: application/json" \
-d '{"model": "my-model", "messages": [{"role": "user", "content": "Hello"}]}'
To make pd the default for a model, add the config annotation to the pod template (see Config Profiles in the Gateway Routing guide):
annotations:
model.aibrix.ai/config: |
{
"profiles": {
"default": { "routingStrategy": "pd" }
}
}
Step 3 — Add Standard Inference Pods (Optional)#
Note
Standard inference pods are entirely optional. A pure prefill/decode deployment works without them. Think of them as a power-up: they unlock a second execution path that the gateway can exploit to absorb overflow, handle workloads outside your configured prompt-length buckets, and smooth out traffic spikes — all without spinning up a separate deployment or changing a single line of client code.
Adding standard inference pods turns a rigid two-tier pipeline into a self-healing, adaptive system. When PD capacity is saturated or a request falls outside the bucket ranges you’ve configured, the gateway automatically falls back to a standard inference pod and keeps the request moving rather than rejecting it or queueing indefinitely. The result is higher effective throughput, more consistent tail latency, and a gentler on-ramp for teams migrating incrementally from a standard deployment to full PD disaggregation.
Standard inference pods run both prefill and decode on a single GPU. They serve as overflow capacity when:
The request’s prompt length falls outside all configured buckets.
All prefill/decode pairs are at capacity.
You want a gradual migration path (run standard inference pods alongside disaggregated pairs).
To configure a standard inference pod, set combined: true in the pod’s routingConfig annotation and enable prompt-length bucketing on the gateway:
metadata:
labels:
model.aibrix.ai/name: my-model
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
# No role-name: prefill/decode — this is a standard inference pod
annotations:
model.aibrix.ai/config: |
{
"profiles": {
"default": {
"routingStrategy": "pd",
"routingConfig": { "combined": true }
}
}
}
Enable prompt-length bucketing on the gateway plugin (add to its environment):
# In your gateway plugin Helm values or Deployment env
env:
- name: AIBRIX_PROMPT_LENGTH_BUCKETING
value: "true"
With bucketing enabled, the gateway considers a standard inference pod as a candidate only when the request’s prompt length falls within the pod’s configured range (see below). Without a range configured, a standard inference pod accepts any prompt length.
Prompt-Length Bucketing#
Bucketing lets you assign different pods to different prompt-length ranges. This is useful when:
Short prompts are compute-cheap and can share a pod.
Long prompts need dedicated resources.
You want to prevent long-prompt requests from starving short-prompt traffic.
Configure the range in the pod’s routingConfig:
annotations:
model.aibrix.ai/config: |
{
"profiles": {
"default": {
"routingStrategy": "pd",
"routingConfig": {
"promptLenBucketMinLength": 0,
"promptLenBucketMaxLength": 2048
}
}
}
}
Field (inside |
Description |
|---|---|
|
Minimum prompt token length (inclusive) this pod handles. Default: |
|
Maximum prompt token length (inclusive) this pod handles. Default: unlimited. Set to |
|
|
|
How to score prefill pods. |
|
How to score decode pods. |
Note
Bucketing only takes effect when AIBRIX_PROMPT_LENGTH_BUCKETING=true is set on the gateway plugin.
Complete Example#
This example shows a three-tier setup: prefill + decode pods for short prompts, and standard inference pods for long prompts or overflow.
Prefill pod (short prompts: 0–2048 tokens):
metadata:
labels:
model.aibrix.ai/name: my-model
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
role-name: prefill
roleset-name: group-0
annotations:
model.aibrix.ai/config: |
{
"profiles": {
"default": {
"routingStrategy": "pd",
"routingConfig": {
"promptLenBucketMinLength": 0,
"promptLenBucketMaxLength": 2048
}
}
}
}
Decode pod (paired with the prefill pod above):
metadata:
labels:
model.aibrix.ai/name: my-model
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
role-name: decode
roleset-name: group-0
annotations:
model.aibrix.ai/config: |
{
"profiles": {
"default": {
"routingStrategy": "pd",
"routingConfig": {
"promptLenBucketMinLength": 0,
"promptLenBucketMaxLength": 2048
}
}
}
}
Standard inference pod (long prompts: 2048+ tokens, and overflow):
metadata:
labels:
model.aibrix.ai/name: my-model
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
annotations:
model.aibrix.ai/config: |
{
"profiles": {
"default": {
"routingStrategy": "pd",
"routingConfig": {
"combined": true,
"promptLenBucketMinLength": 2048
}
}
}
}
Gateway plugin (enable bucketing):
gatewayPlugin:
env:
- name: AIBRIX_PROMPT_LENGTH_BUCKETING
value: "true"
Environment Variables#
These are set on the gateway plugin deployment.
Variable |
Default |
Description |
|---|---|---|
|
|
Enable prompt-length bucket matching for prefill, decode, and standard inference pods. |
|
|
Seconds before a prefill request to a prefill pod times out. |
|
|
Default scoring policy for selecting prefill pods. |
|
|
Default scoring policy for selecting decode pods. |
|
|
KV transfer backend. |
|
|
Minimum request-count spread between prefill pods before load-imbalance routing kicks in. |
|
|
Minimum request-count spread between decode pods before load-imbalance routing kicks in. |