Quickstart

Quickstart#

Installing AIBrix in your Kubernetes Cluster#

Install AIBrix Components#

Get your kubernetes cluster ready, run following commands to install aibrix components in your cluster.

Note

If you just want to install specific components or specific version, please check installation guidance for more installation options. AIBrix also provides the helm chart way, check installation guidance for more details.

kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.6.0/aibrix-dependency-v0.6.0.yaml
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.6.0/aibrix-core-v0.6.0.yaml

Wait for few minutes and run kubectl get pods -n aibrix-system to check pod status util they are ready.

NAME                                         READY   STATUS    RESTARTS   AGE
aibrix-controller-manager-56576666d6-gsl8s   1/1     Running   0          5h24m
aibrix-gateway-plugins-c6cb7545-r4xwj        1/1     Running   0          5h24m
aibrix-gpu-optimizer-89b9d9895-t8wnq         1/1     Running   0          5h24m
aibrix-kuberay-operator-6dcf94b49f-l4522     1/1     Running   0          5h24m
aibrix-metadata-service-6b4d44d5bd-h5g2r     1/1     Running   0          5h24m
aibrix-redis-master-84769768cb-fsq45         1/1     Running   0          5h24m

Deploy base model#

Save yaml as model.yaml and run kubectl apply -f model.yaml.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    model.aibrix.ai/name: deepseek-r1-distill-llama-8b # Note: The label value `model.aibrix.ai/name` here must match with the service name.
    model.aibrix.ai/port: "8000"
  name: deepseek-r1-distill-llama-8b
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      model.aibrix.ai/name: deepseek-r1-distill-llama-8b
      model.aibrix.ai/port: "8000"
  template:
    metadata:
      labels:
        model.aibrix.ai/name: deepseek-r1-distill-llama-8b
        model.aibrix.ai/port: "8000"
    spec:
      containers:
        - command:
            - vllm
            - serve
            - --host
            - "0.0.0.0"
            - --port
            - "8000"
            - --uvicorn-log-level
            - warning
            - --model
            - deepseek-ai/DeepSeek-R1-Distill-Llama-8B
            - --served-model-name
            # Note: The `--served-model-name` argument value must also match the Service name and the Deployment label `model.aibrix.ai/name`
            - deepseek-r1-distill-llama-8b
            - --max-model-len
            - "12288" # 24k length, this is to avoid "The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache" issue.
          image: vllm/vllm-openai:v0.11.0
          imagePullPolicy: IfNotPresent
          name: vllm-openai
          ports:
            - containerPort: 8000
              protocol: TCP
          resources:
            limits:
              nvidia.com/gpu: "1"
            requests:
              nvidia.com/gpu: "1"
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
              scheme: HTTP
            failureThreshold: 3
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
              scheme: HTTP
            failureThreshold: 5
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1
          startupProbe:
            httpGet:
              path: /health
              port: 8000
              scheme: HTTP
            failureThreshold: 30
            periodSeconds: 5
            successThreshold: 1
            timeoutSeconds: 1

Ensure that:

The Service name matches the model.aibrix.ai/name label value in the Deployment.
The –served-model-name argument value in the Deployment command is also consistent with the Service name and model.aibrix.ai/name label.

Deploy Prefill-Decode (PD) Disaggregation Model#

Save yaml as pd-model.yaml and run kubectl apply -f pd-model.yaml.

apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
  name: vllm-1p1d
spec:
  replicas: 1
  updateStrategy:
    type: InPlaceUpdate
  stateful: true
  selector:
    matchLabels:
      app: vllm-1p1d
  template:
    metadata:
      labels:
        app: vllm-1p1d
    spec:
      roles:
        - name: prefill
          replicas: 1
          stateful: true
          template:
            metadata:
              labels:
                model.aibrix.ai/name: deepseek-r1-distill-llama-8b
                model.aibrix.ai/port: "8000"
                model.aibrix.ai/engine: vllm
            spec:
              containers:
                - name: prefill
                  image: aibrix/vllm-openai:v0.9.2-cu128-nixl-v0.4.1
                  command: ["sh", "-c"]
                  args:
                    - |
                      vllm serve \
                      --host "0.0.0.0" \
                      --port "8000" \
                      --uvicorn-log-level warning \
                      --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
                      --served-model-name deepseek-r1-distill-llama-8b \
                      --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
                  env:
                    - name: PYTHONHASHSEED
                      value: "1047"
                    - name: VLLM_SERVER_DEV_MODE
                      value: "1"
                    - name: VLLM_NIXL_SIDE_CHANNEL_HOST
                      value: "0.0.0.0"
                    - name: VLLM_NIXL_SIDE_CHANNEL_PORT
                      value: "5558"
                    - name: VLLM_WORKER_MULTIPROC_METHOD
                      value: spawn
                    - name: VLLM_ENABLE_V1_MULTIPROCESSING
                      value: "0"
                    - name: GLOO_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_IB_DISABLE
                      value: "0"
                    - name: NCCL_IB_GID_INDEX
                      value: "7"
                    - name: NCCL_DEBUG
                      value: "INFO"
                    - name: UCX_TLS
                      value: ^gga
                  resources:
                    limits:
                      nvidia.com/gpu: 1
                    requests:
                      nvidia.com/gpu: 1
                  securityContext:
                    capabilities:
                      add:
                        - IPC_LOCK
        - name: decode
          replicas: 1
          stateful: true
          template:
            metadata:
              labels:
                model.aibrix.ai/name: deepseek-r1-distill-llama-8b
                model.aibrix.ai/port: "8000"
                model.aibrix.ai/engine: vllm
            spec:
              containers:
                - name: decode
                  image: aibrix/vllm-openai:v0.9.2-cu128-nixl-v0.4.1
                  command: ["sh", "-c"]
                  args:
                    - |
                      vllm serve \
                      --host "0.0.0.0" \
                      --port "8000" \
                      --uvicorn-log-level warning \
                      --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
                      --served-model-name deepseek-r1-distill-llama-8b \
                      --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
                  env:
                    - name: PYTHONHASHSEED
                      value: "1047"
                    - name: VLLM_SERVER_DEV_MODE
                      value: "1"
                    - name: VLLM_NIXL_SIDE_CHANNEL_HOST
                      value: "0.0.0.0"
                    - name: VLLM_NIXL_SIDE_CHANNEL_PORT
                      value: "5558"
                    - name: VLLM_WORKER_MULTIPROC_METHOD
                      value: spawn
                    - name: VLLM_ENABLE_V1_MULTIPROCESSING
                      value: "0"
                    - name: GLOO_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_IB_DISABLE
                      value: "0"
                    - name: NCCL_IB_GID_INDEX
                      value: "7"
                    - name: NCCL_DEBUG
                      value: "INFO"
                    - name: UCX_TLS
                      value: ^gga
                  resources:
                    limits:
                      nvidia.com/gpu: 1
                    requests:
                      nvidia.com/gpu: 1
                  securityContext:
                    capabilities:
                      add:
                        - IPC_LOCK

Note

We use an AIBrix-enhanced vLLM image with KVCache and NIXL support for disaggregated inference. For detailed information about available images, compatibility, and build instructions, see AIBrix Container Images.

Invoke the model endpoint using gateway API#

Depending on where you deployed the AIBrix, you can use either of the following options to query the gateway.

# Option 1: Kubernetes cluster with LoadBalancer support
LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"

# Option 2: Dev environment without LoadBalancer support. Use port forwarding way instead
kubectl -n envoy-gateway-system port-forward service/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &
ENDPOINT="localhost:8888"

Attention

Some cloud provider like AWS EKS expose the endpoint at hostname field, if that case, you should use .status.loadBalancer.ingress[0].hostname instead.

# list models
curl -v http://${ENDPOINT}/v1/models/

# completion api
curl -v http://${ENDPOINT}/v1/completions \
    -H "Content-Type: application/json" \
    -H "routing-strategy: random" \
    -d '{
        "model": "deepseek-r1-distill-llama-8b",
        "prompt": "San Francisco is a",
        "max_tokens": 128,
        "temperature": 0
    }'

# chat completion api
curl http://${ENDPOINT}/v1/chat/completions \
-H "Content-Type: application/json" \
-H "routing-strategy: random" \
-d '{
    "model": "deepseek-r1-distill-llama-8b",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "help me write a random generator in python"}
    ]
}'

Note

To test PD disaggregation, add the routing-strategy header to pd. For example:

curl -v http://${ENDPOINT}/v1/chat/completions \
-H "routing-strategy: pd" \
-H "Content-Type: application/json" \
-d '{
    "model": "deepseek-r1-distill-llama-8b",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "help me write a random generator in python"}
    ],
    "temperature": 0.7
}'

from openai import OpenAI

client = OpenAI(base_url="http://${ENDPOINT}/v1", api_key="OPENAI_API_KEY",
                default_headers={'routing-strategy': 'least-request'})

completion = client.chat.completions.create(
    model="deepseek-r1-distill-llama-8b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of California?"}
    ]
)
print(completion.choices[0].message.content)

// multiturn conversation
package main

import (
    "context"

    "github.com/openai/openai-go"
    "github.com/openai/openai-go/option"
)

func main() {
    client := openai.NewClient(
        option.WithBaseURL("http://${ENDPOINT}:8888/v1"),
        option.WithAPIKey("OPENAI_API_KEY"),
        option.WithHeader("routing-strategy", "prefix-cache"),
    )
    chatCompletion, _ := client.Chat.Completions.New(context.TODO(), openai.ChatCompletionNewParams{
        Messages: []openai.ChatCompletionMessageParamUnion{
            openai.SystemMessage("You are a helpful assistant."),
            openai.UserMessage("What is the capital of California?"),
        },
        Model: "deepseek-r1-distill-llama-8b",
    })
    println(chatCompletion.Choices[0].Message.Content)

    chatCompletion, _ = client.Chat.Completions.New(context.TODO(), openai.ChatCompletionNewParams{
        Messages: []openai.ChatCompletionMessageParamUnion{
            openai.SystemMessage("You are a helpful assistant."),
            openai.UserMessage("What is the capital of California?"),
            openai.AssistantMessage(chatCompletion.Choices[0].Message.Content),
            openai.UserMessage("What is the largest county of california?"),
        },
        Model: "deepseek-r1-distill-llama-8b",
    })
    println(chatCompletion.Choices[0].Message.Content)
}

If you meet problems exposing external IPs, feel free to debug with following commands. 101.18.0.4 is the ip of the gateway service.

kubectl get svc -n envoy-gateway-system
NAME                                     TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                                   AGE
envoy-aibrix-system-aibrix-eg-903790dc   LoadBalancer   10.96.239.246   101.18.0.4    80:32079/TCP                              10d
envoy-gateway                            ClusterIP      10.96.166.226   <none>        18000/TCP,18001/TCP,18002/TCP,19001/TCP   10d

For advanced development usage, please refer to the Development section.

Quickstart

Contents

Quickstart#

Installing AIBrix in your Kubernetes Cluster#

Install AIBrix Components#

Deploy base model#

Deploy Prefill-Decode (PD) Disaggregation Model#

Invoke the model endpoint using gateway API#