Volcano Engine#

Introduction#

This doc deploys AIBrix in Volcano Engine Kubernetes Engine.

Steps#

AIBrix Installation#

  1. Assume you already have VKE cluster up and running

  2. Install AIBrix on VKE

kubectl apply -k config/overlays/vke/dependency --server-side

helm install aibrix dist/chart -f dist/chart/vke.yaml -n aibrix-system --create-namespace
  1. Wait for components to complete running.

Download Model in TOS#

Download models in TOS and create the credential in the cluster.

kubectl create secret generic tos-credential --from-literal=TOS_ACCESS_KEY=<YOUR_ACCESS_KEY> --from-literal=TOS_SECRET_KEY=<YOUR_SECRET_KEY>

Deploy base model#

Save yaml as model.yaml and run kubectl apply -f model.yaml.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-r1-distill-llama-8b
  labels:
    model.aibrix.ai/name: deepseek-r1-distill-llama-8b
    model.aibrix.ai/port: "8000"
spec:
  replicas: 1
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  selector:
    matchLabels:
      model.aibrix.ai/name: deepseek-r1-distill-llama-8b
  template:
    metadata:
      labels:
        model.aibrix.ai/name: deepseek-r1-distill-llama-8b
        model.aibrix.ai/port: "8000"
      annotations:
        prometheus.io/path: "/metrics"
        prometheus.io/port: "8000"
        prometheus.io/scrape: "true"
    spec:
      initContainers:
        - command:
            - aibrix_download
            - --model-uri
            - tos://aibrix-artifact-testing/models/DeepSeek-R1-Distill-Llama-8B/
            - --local-dir
            - /models/
          env:
            - name: DOWNLOADER_NUM_CONNECTIONS
              value: "16"
            - name: DOWNLOADER_NUM_THREADS
              value: "16"
            - name: DOWNLOADER_ALLOW_FILE_SUFFIX
              value: json, safetensors
            - name: TOS_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  key: TOS_ACCESS_KEY
                  name: tos-credential
            - name: TOS_SECRET_KEY
              valueFrom:
                secretKeyRef:
                  key: TOS_SECRET_KEY
                  name: tos-credential
            - name: TOS_ENDPOINT
              value: https://tos-s3-cn-beijing.ivolces.com
            - name: TOS_REGION
              value: cn-beijing
          image: aibrix-public-release-cn-beijing.cr.volces.com/aibrix/runtime:v0.5.0
          name: init-model
          volumeMounts:
            - mountPath: /models
              name: model-hostpath
      containers:
        - name: vllm-openai
          image: aibrix-public-release-cn-beijing.cr.volces.com/vllm/vllm-openai:0.11.0
          imagePullPolicy: Always
          command:
            - vllm
            - serve
            - /models/DeepSeek-R1-Distill-Llama-8B/
            - --port
            - "8000"
            - --uvicorn-log-level
            - warning
            - --trust-remote-code
            - --served-model-name
            - deepseek-r1-distill-llama-8b
            - --disable-fastapi-docs
          volumeMounts:
            - mountPath: /models
              name: model-hostpath
          resources:
            limits:
              nvidia.com/gpu: "1"
              cpu: "12"
              memory: "48G"
            requests:
              nvidia.com/gpu: "1"
              cpu: "12"
              memory: "48G"
      volumes:
        - name: model-hostpath
          hostPath:
            path: /data01/models/
            type: DirectoryOrCreate

---

apiVersion: v1
kind: Service
metadata:
  labels:
    model.aibrix.ai/name: deepseek-r1-distill-llama-8b
    prometheus-discovery: "true"
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
  name: deepseek-r1-distill-llama-8b # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
  namespace: default
spec:
  ports:
    - name: serve
      port: 8000
      protocol: TCP
      targetPort: 8000
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080
  selector:
    model.aibrix.ai/name: deepseek-r1-distill-llama-8b
  type: ClusterIP

Deploy Prefill-Decode (PD) Disaggregation Model#

Save yaml as pd-model.yaml and run kubectl apply -f pd-model.yaml.

apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
  name: vllm-1p1d
spec:
  replicas: 1
  updateStrategy:
    type: InPlaceUpdate
  stateful: true
  selector:
    matchLabels:
      app: vllm-1p1d
  template:
    metadata:
      labels:
        app: vllm-1p1d
    spec:
      roles:
        - name: prefill
          replicas: 1
          stateful: true
          template:
            metadata:
              labels:
                model.aibrix.ai/name: deepseek-r1-distill-llama-8b
                model.aibrix.ai/port: "8000"
                model.aibrix.ai/engine: vllm
            spec:
              containers:
                - name: prefill
                  image: aibrix-public-release-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.10.2-aibrix-v0.5.0-nixl-0.7.1-20251123
                  command: ["sh", "-c"]
                  args:
                    - |
                      vllm serve \
                      --host "0.0.0.0" \
                      --port "8000" \
                      --uvicorn-log-level warning \
                      --model /models/DeepSeek-R1-Distill-Llama-8B \
                      --served-model-name deepseek-r1-distill-llama-8b
                  env:
                    - name: PYTHONHASHSEED
                      value: "1047"
                    - name: VLLM_SERVER_DEV_MODE
                      value: "1"
                    - name: VLLM_NIXL_SIDE_CHANNEL_HOST
                      value: "0.0.0.0"
                    - name: VLLM_NIXL_SIDE_CHANNEL_PORT
                      value: "5558"
                    - name: VLLM_WORKER_MULTIPROC_METHOD
                      value: spawn
                    - name: VLLM_ENABLE_V1_MULTIPROCESSING
                      value: "0"
                    - name: GLOO_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_SOCKET_IFNAME
                      value: eth0
                  resources:
                    limits:
                      nvidia.com/gpu: 1
                    requests:
                      nvidia.com/gpu: 1
                  securityContext:
                    capabilities:
                      add:
                        - IPC_LOCK
                  volumeMounts:
                    - mountPath: /models
                      name: model-hostpath
              volumes:
                - name: model-hostpath
                  hostPath:
                    path: /root/models
                    type: DirectoryOrCreate
        - name: decode
          replicas: 1
          stateful: true
          template:
            metadata:
              labels:
                model.aibrix.ai/name: deepseek-r1-distill-llama-8b
                model.aibrix.ai/port: "8000"
                model.aibrix.ai/engine: vllm
            spec:
              containers:
                - name: decode
                  image: aibrix-public-release-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.10.2-aibrix-v0.5.0-nixl-0.7.1-20251123
                  command: ["sh", "-c"]
                  args:
                    - |
                      vllm serve \
                      --host "0.0.0.0" \
                      --port "8000" \
                      --uvicorn-log-level warning \
                      --model /models/DeepSeek-R1-Distill-Llama-8B \
                      --served-model-name deepseek-r1-distill-llama-8b \
                      --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
                  env:
                    - name: PYTHONHASHSEED
                      value: "1047"
                    - name: VLLM_SERVER_DEV_MODE
                      value: "1"
                    - name: VLLM_NIXL_SIDE_CHANNEL_HOST
                      value: "0.0.0.0"
                    - name: VLLM_NIXL_SIDE_CHANNEL_PORT
                      value: "5558"
                    - name: VLLM_WORKER_MULTIPROC_METHOD
                      value: spawn
                    - name: VLLM_ENABLE_V1_MULTIPROCESSING
                      value: "0"
                    - name: GLOO_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_SOCKET_IFNAME
                      value: eth0
                    - name: NCCL_IB_DISABLE
                      value: "0"
                    - name: NCCL_IB_GID_INDEX
                      value: "7"
                    - name: NCCL_DEBUG
                      value: "INFO"
                    - name: UCX_TLS
                      value: ^gga
                  resources:
                    limits:
                      nvidia.com/gpu: 1
                    requests:
                      nvidia.com/gpu: 1
                  securityContext:
                    capabilities:
                      add:
                        - IPC_LOCK
                  volumeMounts:
                    - mountPath: /models
                      name: model-hostpath
              volumes:
                - name: model-hostpath
                  hostPath:
                    path: /root/models
                    type: DirectoryOrCreate

Inference#

Once the model is ready and running, you can test it by running:

LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"

curl http://${ENDPOINT}/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "routing-strategy: random" \ # change to `pd` if you deployed in disaggregation mode
  -d '{
      "model": "deepseek-r1-distill-llama-8b",
      "messages": [
          {"role": "system", "content": "You are a helpful assistant."},
          {"role": "user", "content": "help me write a random generator in python"}
      ]
  }'