Quickstart#
Installing AIBrix in your Kubernetes Cluster#
Install AIBrix Components#
Get your kubernetes cluster ready, run following commands to install aibrix components in your cluster.
Note
If you just want to install specific components or specific version, please check installation guidance for more installation options. AIBrix also provides the helm chart way, check installation guidance for more details.
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.6.0/aibrix-dependency-v0.6.0.yaml
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.6.0/aibrix-core-v0.6.0.yaml
Wait for few minutes and run kubectl get pods -n aibrix-system to check pod status util they are ready.
NAME READY STATUS RESTARTS AGE
aibrix-controller-manager-56576666d6-gsl8s 1/1 Running 0 5h24m
aibrix-gateway-plugins-c6cb7545-r4xwj 1/1 Running 0 5h24m
aibrix-gpu-optimizer-89b9d9895-t8wnq 1/1 Running 0 5h24m
aibrix-kuberay-operator-6dcf94b49f-l4522 1/1 Running 0 5h24m
aibrix-metadata-service-6b4d44d5bd-h5g2r 1/1 Running 0 5h24m
aibrix-redis-master-84769768cb-fsq45 1/1 Running 0 5h24m
Deploy base model#
Save yaml as model.yaml and run kubectl apply -f model.yaml.
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
model.aibrix.ai/name: deepseek-r1-distill-llama-8b # Note: The label value `model.aibrix.ai/name` here must match with the service name.
model.aibrix.ai/port: "8000"
name: deepseek-r1-distill-llama-8b
namespace: default
spec:
replicas: 1
selector:
matchLabels:
model.aibrix.ai/name: deepseek-r1-distill-llama-8b
model.aibrix.ai/port: "8000"
template:
metadata:
labels:
model.aibrix.ai/name: deepseek-r1-distill-llama-8b
model.aibrix.ai/port: "8000"
spec:
containers:
- command:
- vllm
- serve
- --host
- "0.0.0.0"
- --port
- "8000"
- --uvicorn-log-level
- warning
- --model
- deepseek-ai/DeepSeek-R1-Distill-Llama-8B
- --served-model-name
# Note: The `--served-model-name` argument value must also match the Service name and the Deployment label `model.aibrix.ai/name`
- deepseek-r1-distill-llama-8b
- --max-model-len
- "12288" # 24k length, this is to avoid "The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache" issue.
image: vllm/vllm-openai:v0.11.0
imagePullPolicy: IfNotPresent
name: vllm-openai
ports:
- containerPort: 8000
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
livenessProbe:
httpGet:
path: /health
port: 8000
scheme: HTTP
failureThreshold: 3
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
readinessProbe:
httpGet:
path: /health
port: 8000
scheme: HTTP
failureThreshold: 5
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
startupProbe:
httpGet:
path: /health
port: 8000
scheme: HTTP
failureThreshold: 30
periodSeconds: 5
successThreshold: 1
timeoutSeconds: 1
Ensure that:
The Service name matches the model.aibrix.ai/name label value in the Deployment.
The –served-model-name argument value in the Deployment command is also consistent with the Service name and model.aibrix.ai/name label.
Deploy Prefill-Decode (PD) Disaggregation Model#
Save yaml as pd-model.yaml and run kubectl apply -f pd-model.yaml.
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
name: vllm-1p1d
spec:
replicas: 1
updateStrategy:
type: InPlaceUpdate
stateful: true
selector:
matchLabels:
app: vllm-1p1d
template:
metadata:
labels:
app: vllm-1p1d
spec:
roles:
- name: prefill
replicas: 1
stateful: true
template:
metadata:
labels:
model.aibrix.ai/name: deepseek-r1-distill-llama-8b
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
spec:
containers:
- name: prefill
image: aibrix/vllm-openai:v0.9.2-cu128-nixl-v0.4.1
command: ["sh", "-c"]
args:
- |
vllm serve \
--host "0.0.0.0" \
--port "8000" \
--uvicorn-log-level warning \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--served-model-name deepseek-r1-distill-llama-8b \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
env:
- name: PYTHONHASHSEED
value: "1047"
- name: VLLM_SERVER_DEV_MODE
value: "1"
- name: VLLM_NIXL_SIDE_CHANNEL_HOST
value: "0.0.0.0"
- name: VLLM_NIXL_SIDE_CHANNEL_PORT
value: "5558"
- name: VLLM_WORKER_MULTIPROC_METHOD
value: spawn
- name: VLLM_ENABLE_V1_MULTIPROCESSING
value: "0"
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: NCCL_SOCKET_IFNAME
value: eth0
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_IB_GID_INDEX
value: "7"
- name: NCCL_DEBUG
value: "INFO"
- name: UCX_TLS
value: ^gga
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
securityContext:
capabilities:
add:
- IPC_LOCK
- name: decode
replicas: 1
stateful: true
template:
metadata:
labels:
model.aibrix.ai/name: deepseek-r1-distill-llama-8b
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
spec:
containers:
- name: decode
image: aibrix/vllm-openai:v0.9.2-cu128-nixl-v0.4.1
command: ["sh", "-c"]
args:
- |
vllm serve \
--host "0.0.0.0" \
--port "8000" \
--uvicorn-log-level warning \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--served-model-name deepseek-r1-distill-llama-8b \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
env:
- name: PYTHONHASHSEED
value: "1047"
- name: VLLM_SERVER_DEV_MODE
value: "1"
- name: VLLM_NIXL_SIDE_CHANNEL_HOST
value: "0.0.0.0"
- name: VLLM_NIXL_SIDE_CHANNEL_PORT
value: "5558"
- name: VLLM_WORKER_MULTIPROC_METHOD
value: spawn
- name: VLLM_ENABLE_V1_MULTIPROCESSING
value: "0"
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: NCCL_SOCKET_IFNAME
value: eth0
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_IB_GID_INDEX
value: "7"
- name: NCCL_DEBUG
value: "INFO"
- name: UCX_TLS
value: ^gga
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
securityContext:
capabilities:
add:
- IPC_LOCK
Note
We use an AIBrix-enhanced vLLM image with KVCache and NIXL support for disaggregated inference. For detailed information about available images, compatibility, and build instructions, see AIBrix Container Images.
Invoke the model endpoint using gateway API#
Depending on where you deployed the AIBrix, you can use either of the following options to query the gateway.
# Option 1: Kubernetes cluster with LoadBalancer support
LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"
# Option 2: Dev environment without LoadBalancer support. Use port forwarding way instead
kubectl -n envoy-gateway-system port-forward service/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &
ENDPOINT="localhost:8888"
Attention
Some cloud provider like AWS EKS expose the endpoint at hostname field, if that case, you should use .status.loadBalancer.ingress[0].hostname instead.
# list models
curl -v http://${ENDPOINT}/v1/models/
# completion api
curl -v http://${ENDPOINT}/v1/completions \
-H "Content-Type: application/json" \
-H "routing-strategy: random" \
-d '{
"model": "deepseek-r1-distill-llama-8b",
"prompt": "San Francisco is a",
"max_tokens": 128,
"temperature": 0
}'
# chat completion api
curl http://${ENDPOINT}/v1/chat/completions \
-H "Content-Type: application/json" \
-H "routing-strategy: random" \
-d '{
"model": "deepseek-r1-distill-llama-8b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "help me write a random generator in python"}
]
}'
Note
To test PD disaggregation, add the routing-strategy header to pd. For example:
curl -v http://${ENDPOINT}/v1/chat/completions \
-H "routing-strategy: pd" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-r1-distill-llama-8b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "help me write a random generator in python"}
],
"temperature": 0.7
}'
from openai import OpenAI
client = OpenAI(base_url="http://${ENDPOINT}/v1", api_key="OPENAI_API_KEY",
default_headers={'routing-strategy': 'least-request'})
completion = client.chat.completions.create(
model="deepseek-r1-distill-llama-8b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of California?"}
]
)
print(completion.choices[0].message.content)
// multiturn conversation
package main
import (
"context"
"github.com/openai/openai-go"
"github.com/openai/openai-go/option"
)
func main() {
client := openai.NewClient(
option.WithBaseURL("http://${ENDPOINT}:8888/v1"),
option.WithAPIKey("OPENAI_API_KEY"),
option.WithHeader("routing-strategy", "prefix-cache"),
)
chatCompletion, _ := client.Chat.Completions.New(context.TODO(), openai.ChatCompletionNewParams{
Messages: []openai.ChatCompletionMessageParamUnion{
openai.SystemMessage("You are a helpful assistant."),
openai.UserMessage("What is the capital of California?"),
},
Model: "deepseek-r1-distill-llama-8b",
})
println(chatCompletion.Choices[0].Message.Content)
chatCompletion, _ = client.Chat.Completions.New(context.TODO(), openai.ChatCompletionNewParams{
Messages: []openai.ChatCompletionMessageParamUnion{
openai.SystemMessage("You are a helpful assistant."),
openai.UserMessage("What is the capital of California?"),
openai.AssistantMessage(chatCompletion.Choices[0].Message.Content),
openai.UserMessage("What is the largest county of california?"),
},
Model: "deepseek-r1-distill-llama-8b",
})
println(chatCompletion.Choices[0].Message.Content)
}
If you meet problems exposing external IPs, feel free to debug with following commands. 101.18.0.4 is the ip of the gateway service.
kubectl get svc -n envoy-gateway-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
envoy-aibrix-system-aibrix-eg-903790dc LoadBalancer 10.96.239.246 101.18.0.4 80:32079/TCP 10d
envoy-gateway ClusterIP 10.96.166.226 <none> 18000/TCP,18001/TCP,18002/TCP,19001/TCP 10d
For advanced development usage, please refer to the Development section.