Volcano Engine#
Introduction#
This doc deploys AIBrix in Volcano Engine Kubernetes Engine.
Steps#
AIBrix Installation#
Assume you already have VKE cluster up and running
Install AIBrix on VKE
kubectl apply -k config/overlays/vke/dependency --server-side
helm install aibrix dist/chart -f dist/chart/vke.yaml -n aibrix-system --create-namespace
Wait for components to complete running.
Download Model in TOS#
Download models in TOS and create the credential in the cluster.
kubectl create secret generic tos-credential --from-literal=TOS_ACCESS_KEY=<YOUR_ACCESS_KEY> --from-literal=TOS_SECRET_KEY=<YOUR_SECRET_KEY>
Deploy base model#
Save yaml as model.yaml and run kubectl apply -f model.yaml.
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-r1-distill-llama-8b
labels:
model.aibrix.ai/name: deepseek-r1-distill-llama-8b
model.aibrix.ai/port: "8000"
spec:
replicas: 1
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
selector:
matchLabels:
model.aibrix.ai/name: deepseek-r1-distill-llama-8b
template:
metadata:
labels:
model.aibrix.ai/name: deepseek-r1-distill-llama-8b
model.aibrix.ai/port: "8000"
annotations:
prometheus.io/path: "/metrics"
prometheus.io/port: "8000"
prometheus.io/scrape: "true"
spec:
initContainers:
- command:
- aibrix_download
- --model-uri
- tos://aibrix-artifact-testing/models/DeepSeek-R1-Distill-Llama-8B/
- --local-dir
- /models/
env:
- name: DOWNLOADER_NUM_CONNECTIONS
value: "16"
- name: DOWNLOADER_NUM_THREADS
value: "16"
- name: DOWNLOADER_ALLOW_FILE_SUFFIX
value: json, safetensors
- name: TOS_ACCESS_KEY
valueFrom:
secretKeyRef:
key: TOS_ACCESS_KEY
name: tos-credential
- name: TOS_SECRET_KEY
valueFrom:
secretKeyRef:
key: TOS_SECRET_KEY
name: tos-credential
- name: TOS_ENDPOINT
value: https://tos-s3-cn-beijing.ivolces.com
- name: TOS_REGION
value: cn-beijing
image: aibrix-public-release-cn-beijing.cr.volces.com/aibrix/runtime:v0.5.0
name: init-model
volumeMounts:
- mountPath: /models
name: model-hostpath
containers:
- name: vllm-openai
image: aibrix-public-release-cn-beijing.cr.volces.com/vllm/vllm-openai:0.11.0
imagePullPolicy: Always
command:
- vllm
- serve
- /models/DeepSeek-R1-Distill-Llama-8B/
- --port
- "8000"
- --uvicorn-log-level
- warning
- --trust-remote-code
- --served-model-name
- deepseek-r1-distill-llama-8b
- --disable-fastapi-docs
volumeMounts:
- mountPath: /models
name: model-hostpath
resources:
limits:
nvidia.com/gpu: "1"
cpu: "12"
memory: "48G"
requests:
nvidia.com/gpu: "1"
cpu: "12"
memory: "48G"
volumes:
- name: model-hostpath
hostPath:
path: /data01/models/
type: DirectoryOrCreate
---
apiVersion: v1
kind: Service
metadata:
labels:
model.aibrix.ai/name: deepseek-r1-distill-llama-8b
prometheus-discovery: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
name: deepseek-r1-distill-llama-8b # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
namespace: default
spec:
ports:
- name: serve
port: 8000
protocol: TCP
targetPort: 8000
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
model.aibrix.ai/name: deepseek-r1-distill-llama-8b
type: ClusterIP
Deploy Prefill-Decode (PD) Disaggregation Model#
Save yaml as pd-model.yaml and run kubectl apply -f pd-model.yaml.
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
name: vllm-1p1d
spec:
replicas: 1
updateStrategy:
type: InPlaceUpdate
stateful: true
selector:
matchLabels:
app: vllm-1p1d
template:
metadata:
labels:
app: vllm-1p1d
spec:
roles:
- name: prefill
replicas: 1
stateful: true
template:
metadata:
labels:
model.aibrix.ai/name: deepseek-r1-distill-llama-8b
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
spec:
containers:
- name: prefill
image: aibrix-public-release-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.10.2-aibrix-v0.5.0-nixl-0.7.1-20251123
command: ["sh", "-c"]
args:
- |
vllm serve \
--host "0.0.0.0" \
--port "8000" \
--uvicorn-log-level warning \
--model /models/DeepSeek-R1-Distill-Llama-8B \
--served-model-name deepseek-r1-distill-llama-8b
env:
- name: PYTHONHASHSEED
value: "1047"
- name: VLLM_SERVER_DEV_MODE
value: "1"
- name: VLLM_NIXL_SIDE_CHANNEL_HOST
value: "0.0.0.0"
- name: VLLM_NIXL_SIDE_CHANNEL_PORT
value: "5558"
- name: VLLM_WORKER_MULTIPROC_METHOD
value: spawn
- name: VLLM_ENABLE_V1_MULTIPROCESSING
value: "0"
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: NCCL_SOCKET_IFNAME
value: eth0
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
securityContext:
capabilities:
add:
- IPC_LOCK
volumeMounts:
- mountPath: /models
name: model-hostpath
volumes:
- name: model-hostpath
hostPath:
path: /root/models
type: DirectoryOrCreate
- name: decode
replicas: 1
stateful: true
template:
metadata:
labels:
model.aibrix.ai/name: deepseek-r1-distill-llama-8b
model.aibrix.ai/port: "8000"
model.aibrix.ai/engine: vllm
spec:
containers:
- name: decode
image: aibrix-public-release-cn-beijing.cr.volces.com/aibrix/vllm-openai:v0.10.2-aibrix-v0.5.0-nixl-0.7.1-20251123
command: ["sh", "-c"]
args:
- |
vllm serve \
--host "0.0.0.0" \
--port "8000" \
--uvicorn-log-level warning \
--model /models/DeepSeek-R1-Distill-Llama-8B \
--served-model-name deepseek-r1-distill-llama-8b \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
env:
- name: PYTHONHASHSEED
value: "1047"
- name: VLLM_SERVER_DEV_MODE
value: "1"
- name: VLLM_NIXL_SIDE_CHANNEL_HOST
value: "0.0.0.0"
- name: VLLM_NIXL_SIDE_CHANNEL_PORT
value: "5558"
- name: VLLM_WORKER_MULTIPROC_METHOD
value: spawn
- name: VLLM_ENABLE_V1_MULTIPROCESSING
value: "0"
- name: GLOO_SOCKET_IFNAME
value: eth0
- name: NCCL_SOCKET_IFNAME
value: eth0
- name: NCCL_IB_DISABLE
value: "0"
- name: NCCL_IB_GID_INDEX
value: "7"
- name: NCCL_DEBUG
value: "INFO"
- name: UCX_TLS
value: ^gga
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
securityContext:
capabilities:
add:
- IPC_LOCK
volumeMounts:
- mountPath: /models
name: model-hostpath
volumes:
- name: model-hostpath
hostPath:
path: /root/models
type: DirectoryOrCreate
Inference#
Once the model is ready and running, you can test it by running:
LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"
curl http://${ENDPOINT}/v1/chat/completions \
-H "Content-Type: application/json" \
-H "routing-strategy: random" \ # change to `pd` if you deployed in disaggregation mode
-d '{
"model": "deepseek-r1-distill-llama-8b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "help me write a random generator in python"}
]
}'