KVCache Offloading#
Note
If you’re not yet familiar with the concepts of KVCache L1 and L2, please refer to AIBrix KVCache Offloading document at AIBrix KVCache Offloading Framework.
Note
The AIBrix KVCache Offloading framework can be used as a standalone component — there’s no need to install the entire AIBrix stack.
Note
Since v0.4.0, both vLLM V0 and V1 connectors are supported.
Note
Since v0.5.0, both vLLM and SGLang are supported.
Warning
Currently, only FlashAttention and XFormers are supported.
L1 Cache Example#
Note
We use a customized version of vLLM integrated with AIBrix Offloading Connectors to showcase the usage.
Before deploying the inference engine, please use kubectl get pods -n aibrix-system and kubectl get pods -n envoy-gateway-system to ensure envoy-gateway and aibrix-gateway are running. Other components are optional.
$ kubectl get pods -n aibrix-system
NAME READY STATUS RESTARTS AGE
aibrix-controller-manager-586dd9f868-465dz 1/1 Running 0 16h
aibrix-gateway-plugins-5fcbcbfc84-h7qc8 1/1 Running 0 16h
aibrix-gpu-optimizer-66f49fd947-gfncm 1/1 Running 0 4d23h
aibrix-kuberay-operator-55f4d4d666-bd7hj 1/1 Running 0 4d23h
aibrix-metadata-service-6d5cc8ddd6-444mb 1/1 Running 0 4d23h
aibrix-redis-master-c9b4967c5-pdnkg 1/1 Running 0 16h
$ kubectl get pods -n envoy-gateway-system
NAME READY STATUS RESTARTS AGE
envoy-aibrix-system-aibrix-eg-903790dc-fd69b467d-6fg2z 2/2 Running 0 16h
envoy-gateway-5d48549b5c-6r4cd 1/1 Running 0 16h
Now let’s use the following yaml to create an engine deployment:
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: deepseek-r1-distill-llama-8b
5 labels:
6 model.aibrix.ai/name: deepseek-r1-distill-llama-8b
7 model.aibrix.ai/port: "8000"
8spec:
9 replicas: 1
10 strategy:
11 rollingUpdate:
12 maxSurge: 1
13 maxUnavailable: 1
14 type: RollingUpdate
15 selector:
16 matchLabels:
17 model.aibrix.ai/name: deepseek-r1-distill-llama-8b
18 template:
19 metadata:
20 labels:
21 model.aibrix.ai/name: deepseek-r1-distill-llama-8b
22 annotations:
23 prometheus.io/path: "/metrics"
24 prometheus.io/port: "8000"
25 prometheus.io/scrape: "true"
26 spec:
27 initContainers:
28 - command:
29 - aibrix_download
30 - --model-uri
31 - tos://aibrix-artifact-testing/models/DeepSeek-R1-Distill-Llama-8B/
32 - --local-dir
33 - /models/
34 env:
35 - name: DOWNLOADER_NUM_CONNECTIONS
36 value: "16"
37 - name: DOWNLOADER_NUM_THREADS
38 value: "16"
39 - name: DOWNLOADER_ALLOW_FILE_SUFFIX
40 value: json, safetensors
41 - name: TOS_ACCESS_KEY
42 valueFrom:
43 secretKeyRef:
44 key: TOS_ACCESS_KEY
45 name: tos-credential
46 - name: TOS_SECRET_KEY
47 valueFrom:
48 secretKeyRef:
49 key: TOS_SECRET_KEY
50 name: tos-credential
51 - name: TOS_ENDPOINT
52 value: https://tos-s3-cn-beijing.ivolces.com
53 - name: TOS_REGION
54 value: cn-beijing
55 image: aibrix-cn-beijing.cr.volces.com/aibrix/runtime:v0.3.0
56 name: init-model
57 volumeMounts:
58 - mountPath: /models
59 name: model-hostpath
60 containers:
61 - name: vllm-openai
62 image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai-aibrix-kvcache:v0.10.2-20251022
63 imagePullPolicy: Always
64 command:
65 - vllm
66 - serve
67 - --port
68 - "8000"
69 - --uvicorn-log-level
70 - warning
71 - --model
72 - /models/DeepSeek-R1-Distill-Llama-8B/
73 - --trust-remote-code
74 - --served-model-name
75 - deepseek-r1-distill-llama-8b
76 - --max-model-len
77 - "32000" # please modify this field if your gpu has more room
78 # - --enable-chunked-prefill
79 - --disable-log-requests
80 - --disable-fastapi-docs
81 - --swap-space
82 - "0"
83 - --api-key
84 - "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"
85 - --no-enable-chunked-prefill
86 - --kv-transfer-config
87 - '{"kv_connector":"AIBrixOffloadingConnectorV1Type3", "kv_role":"kv_both"}'
88 env:
89 - name: VLLM_USE_V1
90 value: "1"
91 - name: AIBRIX_KV_CACHE_OL_L1_CACHE_ENABLED
92 value: "1"
93 # specify the eviction policy, default is S3FIFO
94 - name: AIBRIX_KV_CACHE_OL_L1_CACHE_EVICTION_POLICY
95 value: "S3FIFO"
96 # specify the capacity of L1 cache, default is 10GB
97 - name: AIBRIX_KV_CACHE_OL_L1_CACHE_CAPACITY_GB
98 value: "80"
99 - name: VLLM_RPC_TIMEOUT
100 value: "1000000"
101 volumeMounts:
102 - mountPath: /models
103 name: model-hostpath
104 resources:
105 limits:
106 nvidia.com/gpu: "1"
107 cpu: "10"
108 memory: "120G"
109 requests:
110 nvidia.com/gpu: "1"
111 cpu: "10"
112 memory: "120G"
113 volumes:
114 - name: model-hostpath
115 hostPath:
116 path: /root/models
117 type: DirectoryOrCreate
118
119---
120
121apiVersion: v1
122kind: Service
123metadata:
124 labels:
125 model.aibrix.ai/name: deepseek-r1-distill-llama-8b
126 prometheus-discovery: "true"
127 annotations:
128 prometheus.io/scrape: "true"
129 prometheus.io/port: "8080"
130 name: deepseek-r1-distill-llama-8b # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
131 namespace: default
132spec:
133 ports:
134 - name: serve
135 port: 8000
136 protocol: TCP
137 targetPort: 8000
138 - name: http
139 port: 8080
140 protocol: TCP
141 targetPort: 8080
142 selector:
143 model.aibrix.ai/name: deepseek-r1-distill-llama-8b
144 type: ClusterIP
$ kubectl apply -f samples/kvcache/l1cache/vllm.yaml
deployment.apps/deepseek-r1-distill-llama-8b created
service/deepseek-r1-distill-llama-8b created
Note
Right now, the recommended connector for vLLM v0.10.2 is
AIBrixOffloadingConnectorV1Type3. You can switch to other AIBrix connectors if needed by specifying thekv_connectorparameter of--kv-transfer-config.If you prefer to use vLLM V0, please set
VLLM_USE_V1to0and change the value of--kv-transfer-configfrom'{"kv_connector":"AIBrixOffloadingConnectorV1Type3", "kv_role":"kv_both"}'to'{"kv_connector":"AIBrixOffloadingConnector", "kv_role":"kv_both"}'AIBRIX_KV_CACHE_OL_L1_CACHE_CAPACITY_GBneeds to choose a proper value based on the pod memory resource requirement. For instance, if the pod memory resource requirement isPGB and the estimated memory consumption of the inference engine isEGB, we can setAIBRIX_KV_CACHE_OL_L1_CACHE_CAPACITY_GBtoP / tensor-parallel-size - E.
Now let’s use kubectl get pods command to ensure the inference service is running:
$ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
deepseek-r1-distill-llama-8b-6bb7c97459-lhh77 0/1 PodInitializing 0 87s
deepseek-r1-distill-llama-8b-6bb7c97459-lhh77 1/1 Running 0 4m44s
Once the inference service is running, let’s set up port forwarding so that we can test the service from local:
Run
kubectl get svc -n envoy-gateway-systemto get the name of the Envoy Gateway service.
$ kubectl get svc -n envoy-gateway-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
envoy-aibrix-system-aibrix-eg-903790dc LoadBalancer 10.97.198.203 115.190.25.67 80:32269/TCP 5d3h
envoy-gateway ClusterIP 10.97.57.193 <none> 18000/TCP,18001/TCP,18002/TCP,19001/TCP 5d3h
Run
kubectl -n envoy-gateway-system port-forward svc/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &to set up port forwarding
$ kubectl -n envoy-gateway-system port-forward svc/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &
Forwarding from 127.0.0.1:8888 -> 10080
Forwarding from [::1]:8888 -> 10080
Now, let’s test the service:
curl -v "http://localhost:8888/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 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" \
-d '{
"model": "deepseek-r1-distill-llama-8b",
"messages": [{"role": "user", "content": "Created container vllm-openai"}],
"temperature": 0.7
}'
and its output would be:
* Trying [::1]:8888...
* Connected to localhost (::1) port 8888
> POST /v1/chat/completions HTTP/1.1
> Host: localhost:8888
> User-Agent: curl/8.4.0
> Accept: */*
> Content-Type: application/json
> Authorization: Bearer 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
> Content-Length: 173
>
Handling connection for 8888
< HTTP/1.1 200 OK
< x-went-into-req-headers: true
< date: Wed, 21 May 2025 00:52:06 GMT
< server: uvicorn
< content-type: application/json
< target-pod: 192.168.3.22:8000
< request-id: 34a19ba1-88f2-4aa0-b914-5a28609d6b0a
< transfer-encoding: chunked
<
{"id":"chatcmpl-4ae8be13-5bbf-4bc0-92b6-6e8814296c57","object":"chat.completion","created":1747788726,"model":"deepseek-r1-distill-llama-8b","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Okay, so I need to create a container called \"vllm-openai\" using Docker. I'm a bit new to this, so I'll have to figure it out step by step. Let me start by understanding what a Docker container is. From what I know, a container is a lightweight virtualization layer that allows me to package and run applications in isolated environments called containers. Docker makes this process easier by managing the containers and their images.\n\nI want to create a container that's specifically for running OpenAI's VLLM (Voss-LSTM), which is an open-source implementation of the original VLLM model by OpenAI. So, the container should have everything necessary to run this model, including the required dependencies and the model itself.\n\nFirst, I'll need to get the OpenAI VLLM code. I think it's available on GitHub, so I'll clone the repository. Let me check the URL: it's probably something like https://github.com/openai/vllm-cpp. Once I have the code, I need to build it. The instructions likely mention using CMake for the build process. I'll have to make sure I have CMake installed on my system. If not, I'll need to install it using my package manager.\n\nAfter cloning and building the code, I need to create a Docker image. The Dockerfile will specify the base image, which should be something like Ubuntu 20.04 LTS since it's a common and supported version. I'll need to set the working directory and copy the built VLLM files into the container. Also, I should install any system dependencies that the VLLM might need, like libraries or tools.\n\nI remember that VLLM requires certain Python packages, so I'll need to install those inside the container. The requirements.txt file probably lists all the necessary packages. Using pip to install them within the container makes sense. Additionally, since the model is quite large, the container might need more memory and CPU resources. I'll set up a non-root user for better security practices and ensure the permissions are set correctly so that the container can run the model without issues.\n\nI also need to expose the necessary ports. The VLLM server might run on port 8080, so I'll map that port in the Docker setup. For testing, I can use curl or a web interface to send requests to this port and see if the model responds correctly.\n\nLet me outline the steps I'll take:\n\n1. Clone the VLLM repository.\n2. Build the VLLM using CMake.\n3. Create a Dockerfile that includes the base OS, build tools, system dependencies, and copies the built VLLM files.\n4. Install the required Python packages using pip.\n5. Set up the container with proper user permissions and resource limits.\n6. Build and run the container.\n7. Test the container by sending requests to the exposed port.\n\nI'm a bit unsure about some parts. For example, how to handle the build process in the Dockerfile? I think the Dockerfile will need to have the necessary CMake commands and possibly install build dependencies like build-essential. Also, I need to make sure that the container has enough memory allocated to run the VLLM model, which can be quite resource-intensive.\n\nAnother thing I'm not sure about is the user setup. Why do I need a non-root user? Isn't it easier to run everything as root? Well, running as a non-root user is more secure, especially since Docker containers have root privileges by default. So, I should create a user and switch to it before running the model.\n\nI should also think about how the container handles persistence. Since the VLLM model is built outside the container, the container will only have the necessary files. If I need to persist the model, I'll have to copy it into the container during the build process. Otherwise, each container restart will require rebuilding the model, which might be time-consuming.\n\nLet me think about the Dockerfile structure. It should start with a FROM instruction based on an Ubuntu image. Then, set the working directory, install build-essential and cmake, clone the repository, build it, and then copy the built files into the container. After that, I'll switch to a non-root user and install the Python dependencies.\n\nWait, but the VLLM requires certain libraries like TensorFlow? Or is it self-contained? I think the VLLM is a standalone model, so maybe it doesn't rely on external libraries beyond what's already in the build. But I should check the requirements to be sure.\n\nAlso, the model is quite large, so the container might take up a lot of disk space. I should consider using a larger disk or use a persistent volume if I need to keep the model data.\n\nI should also document the container, maybe add some notes on how to use it, like the exposed ports and any required environment variables. For example, the VLLM might need an API key or specific configurations to run.\n\nTesting is important. After building the container, I can run it and use curl to send a request to the exposed port. If the response is as expected, the container is working. If not, I'll have to troubleshoot, maybe checking the logs or ensuring all dependencies are correctly installed.\n\nI'm a bit worried about performance. The VLLM model is designed for research purposes, so it's going to be computationally heavy. I should set resource limits in the Docker run command to prevent it from using too much of the host system's resources.\n\nIn summary, the process involves setting up the build environment, compiling the VLLM code into a Docker image, installing necessary dependencies, and ensuring the container runs securely and efficiently.\n</think>\n\nTo create a Docker container for OpenAI's VLLM, follow these organized steps:\n\n### Step-by-Step Guide\n\n1. **Clone the VLLM Repository**\n - Clone the VLLM repository from GitHub:\n ```bash\n git clone https://github.com/openai/vllm-cpp.git\n ```\n - Navigate to the cloned directory:\n ```bash\n cd vllm-cpp\n ```\n\n2. **Build the VLLM**\n - Ensure you have CMake installed. If not, install it using your package manager.\n - Build the VLLM using CMake:\n ```bash\n mkdir build\n cd build\n cmake ..\n make\n ```\n\n3. **Create the Dockerfile**\n - Open a new file named `Dockerfile` and insert the following content:\n ```dockerfile\n FROM ubuntu:20.04\n\n WORKDIR /app\n\n # Install build tools\n RUN apt-get update && apt-get install -y build-essential cmake\n # Install system dependencies\n RUN apt-get install -y libboost-system-dev libboost-filesystem-dev \\\n libboost-chrono-dev libboost-serialization-dev libboost-headers\n # Copy the built VLLM files\n COPY build/vllm-cpp .\n # Install Python dependencies\n RUN useradd -m vllmuser && chown -R vllmuser:vllmuser .\n RUN pip install -r requirements.txt\n # Switch to non-root user\n USER vllmuser\n ```\n - **Note:** Replace `requirements.txt` with your actual file path or content if you haven't created one yet.\n\n4. **Build and Run the Container**\n - Build the Docker image:\n ```bash\n docker build -t vllm-openai .\n ```\n - Run the container, allocating enough resources (e.g., 4GB RAM and 4 CPUs):\n ```bash\n docker run -d --name vllm-openai \\\n -e \"HTTP_PROXY=http://proxy.example.com:8080\" \\\n -e \"HTTPS_PROXY=http://proxy.example.com:8080\" \\\n --ulimits cgroup:1 --cpu-shares 1 --memory 4g \\\n vllm-openai\n ```\n - Replace `proxy.example.com` with your actual proxy server if needed.\n\n5. **Test the Container**\n - Check if the container is running:\n ```bash\n docker ps\n ```\n - Use `curl` to test the API:\n ```bash\n curl http://localhost:8080\n ```\n - If the response is as expected, the container is functioning correctly.\n\n### Notes\n\n- **User Permissions:** The container uses a non-root user (`vllmuser`) for security reasons.\n- **Dependencies:** Ensure all system a* Connection #0 to host localhost left intact
nd Python dependencies are correctly installed as per the VLLM requirements.\n- **Resources:** Adjust CPU and memory allocations based on your system's capacity to handle the VLLM's computational demands.\n- **Volumes:** Consider using a persistent volume to store the VLLM model for longer-term use.\n\nBy following these steps, you'll have a containerized version of OpenAI's VLLM ready to run, ensuring security, efficiency, and ease of use.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":12,"total_tokens":1887,"completion_tokens":1875,"prompt_tokens_details":null},"prompt_logprobs":null}
L2 Cache Example#
let’s deploy the distributed KV cache cluster with the following yaml configuration:
1apiVersion: orchestration.aibrix.ai/v1alpha1
2kind: KVCache
3metadata:
4 name: kvcache-cluster
5 namespace: default
6 annotations:
7 kvcache.orchestration.aibrix.ai/backend: infinistore
8 infinistore.kvcache.orchestration.aibrix.ai/link-type: "Ethernet"
9 infinistore.kvcache.orchestration.aibrix.ai/hint-gid-index: "7"
10spec:
11 metadata:
12 redis:
13 runtime:
14 image: aibrix-cn-beijing.cr.volces.com/aibrix/redis:7.4.2
15 replicas: 1
16 resources:
17 requests:
18 cpu: 1000m
19 memory: 1Gi
20 limits:
21 cpu: 1000m
22 memory: 1Gi
23 service:
24 type: ClusterIP
25 ports:
26 - name: service
27 port: 12345
28 targetPort: 12345
29 protocol: TCP
30 - name: admin
31 port: 8088
32 targetPort: 8088
33 protocol: TCP
34 watcher:
35 image: aibrix-cn-beijing.cr.volces.com/aibrix/kvcache-watcher:v0.3.0
36 imagePullPolicy: Always
37 resources:
38 requests:
39 cpu: "500m"
40 memory: "256Mi"
41 limits:
42 cpu: "500m"
43 memory: "256Mi"
44 cache:
45 replicas: 1
46 image: aibrix-cn-beijing.cr.volces.com/aibrix/infinistore:v0.2.42-20250506
47 imagePullPolicy: IfNotPresent
48 resources:
49 requests:
50 cpu: "10000m"
51 memory: "120Gi"
52 vke.volcengine.com/rdma: "1"
53 limits:
54 cpu: "10000m"
55 memory: "120Gi"
56 vke.volcengine.com/rdma: "1"
Note
We have changed L45 from one replica to three replicas in this example, thus we will find three KV cache pods running in the cluster.
$ kubectl apply -f samples/kvcache/infinistore/kvcache.yaml
kvcache.orchestration.aibrix.ai/kvcache-cluster created
$ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
kvcache-cluster-0 0/1 ContainerCreating 0 59s
kvcache-cluster-kvcache-watcher-pod 1/1 Running 0 59s
kvcache-cluster-redis 1/1 Running 0 59s
kvcache-cluster-0 1/1 Running 0 2m43s
kvcache-cluster-1 0/1 Pending 0 0s
kvcache-cluster-1 0/1 Pending 0 0s
kvcache-cluster-1 0/1 ContainerCreating 0 0s
kvcache-cluster-1 0/1 ContainerCreating 0 2s
kvcache-cluster-1 1/1 Running 0 5s
kvcache-cluster-2 0/1 Pending 0 0s
kvcache-cluster-2 0/1 Pending 0 0s
kvcache-cluster-2 0/1 ContainerCreating 0 0s
kvcache-cluster-2 0/1 ContainerCreating 0 2s
kvcache-cluster-2 1/1 Running 0 4s
Now let’s use the following yaml to create an engine deployment:
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: deepseek-r1-distill-llama-8b
5 labels:
6 model.aibrix.ai/name: deepseek-r1-distill-llama-8b
7 model.aibrix.ai/port: "8000"
8spec:
9 replicas: 1
10 strategy:
11 rollingUpdate:
12 maxSurge: 1
13 maxUnavailable: 1
14 type: RollingUpdate
15 selector:
16 matchLabels:
17 model.aibrix.ai/name: deepseek-r1-distill-llama-8b
18 template:
19 metadata:
20 labels:
21 model.aibrix.ai/name: deepseek-r1-distill-llama-8b
22 annotations:
23 prometheus.io/path: "/metrics"
24 prometheus.io/port: "8000"
25 prometheus.io/scrape: "true"
26 k8s.volcengine.com/pod-networks: |
27 [
28 {
29 "cniConf":{
30 "name":"rdma"
31 }
32 }
33 ]
34 spec:
35 initContainers:
36 - command:
37 - aibrix_download
38 - --model-uri
39 - tos://aibrix-artifact-testing/models/DeepSeek-R1-Distill-Llama-8B/
40 - --local-dir
41 - /models/
42 env:
43 - name: DOWNLOADER_NUM_CONNECTIONS
44 value: "16"
45 - name: DOWNLOADER_NUM_THREADS
46 value: "16"
47 - name: DOWNLOADER_ALLOW_FILE_SUFFIX
48 value: json, safetensors
49 - name: TOS_ACCESS_KEY
50 valueFrom:
51 secretKeyRef:
52 key: TOS_ACCESS_KEY
53 name: tos-credential
54 - name: TOS_SECRET_KEY
55 valueFrom:
56 secretKeyRef:
57 key: TOS_SECRET_KEY
58 name: tos-credential
59 - name: TOS_ENDPOINT
60 value: https://tos-s3-cn-beijing.ivolces.com
61 - name: TOS_REGION
62 value: cn-beijing
63 image: aibrix-cn-beijing.cr.volces.com/aibrix/runtime:v0.3.0
64 name: init-model
65 volumeMounts:
66 - mountPath: /models
67 name: model-hostpath
68 containers:
69 - name: vllm-openai
70 image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/vllm-openai-aibrix-kvcache:v0.10.2-20251022
71 imagePullPolicy: Always
72 command:
73 - vllm
74 - serve
75 - --port
76 - "8000"
77 - --uvicorn-log-level
78 - warning
79 - --model
80 - /models/DeepSeek-R1-Distill-Llama-8B/
81 - --trust-remote-code
82 - --served-model-name
83 - deepseek-r1-distill-llama-8b
84 - --max-model-len
85 - "32000" # please modify this field if your gpu has more room
86 # - --enable-chunked-prefill
87 - --disable-log-requests
88 - --disable-fastapi-docs
89 - --swap-space
90 - "0"
91 - --api-key
92 - "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"
93 - --no-enable-chunked-prefill
94 - --kv-transfer-config
95 - '{"kv_connector":"AIBrixOffloadingConnectorV1Type3", "kv_role":"kv_both"}'
96 env:
97 - name: VLLM_USE_V1
98 value: "1"
99 - name: AIBRIX_KV_CACHE_OL_L1_CACHE_ENABLED
100 value: "0"
101 - name: AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND
102 value: "infinistore"
103 - name: AIBRIX_KV_CACHE_OL_INFINISTORE_CONNECTION_TYPE
104 value: "RDMA"
105 - name: AIBRIX_KV_CACHE_OL_INFINISTORE_IB_PORT
106 value: "1"
107 - name: AIBRIX_KV_CACHE_OL_INFINISTORE_LINK_TYPE
108 value: "Ethernet"
109 # mlx_5_1 is the device and 7 is the hinted gid index, if you do not know the gid, you can just type mlx5_1,mlx5_2,...
110 - name: AIBRIX_KV_CACHE_OL_INFINISTORE_VISIBLE_DEV_LIST
111 value: "mlx5_1:7,mlx5_2:7,mlx5_3:7,mlx5_4:7"
112 - name: AIBRIX_KV_CACHE_OL_META_SERVICE_BACKEND
113 value: "redis"
114 - name: AIBRIX_KV_CACHE_OL_META_SERVICE_URL
115 value: "redis://kvcache-cluster-redis:6379"
116 - name: AIBRIX_KV_CACHE_OL_META_SERVICE_CLUSTER_META_KEY
117 value: "kvcache_nodes"
118 - name: VLLM_RPC_TIMEOUT
119 value: "1000000"
120 volumeMounts:
121 - mountPath: /models
122 name: model-hostpath
123 resources:
124 limits:
125 nvidia.com/gpu: "1"
126 vke.volcengine.com/rdma: "1"
127 cpu: "10"
128 memory: "120G"
129 requests:
130 nvidia.com/gpu: "1"
131 vke.volcengine.com/rdma: "1"
132 cpu: "10"
133 memory: "120G"
134 securityContext:
135 capabilities:
136 add:
137 - IPC_LOCK
138 volumes:
139 - name: model-hostpath
140 hostPath:
141 path: /root/models
142 type: DirectoryOrCreate
143
144---
145
146apiVersion: v1
147kind: Service
148metadata:
149 labels:
150 model.aibrix.ai/name: deepseek-r1-distill-llama-8b
151 prometheus-discovery: "true"
152 annotations:
153 prometheus.io/scrape: "true"
154 prometheus.io/port: "8080"
155 name: deepseek-r1-distill-llama-8b # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
156 namespace: default
157spec:
158 ports:
159 - name: serve
160 port: 8000
161 protocol: TCP
162 targetPort: 8000
163 - name: http
164 port: 8080
165 protocol: TCP
166 targetPort: 8080
167 selector:
168 model.aibrix.ai/name: deepseek-r1-distill-llama-8b
169 type: ClusterIP
$ kubectl apply -f samples/kvcache/infinistore/vllm.yaml
deployment.apps/deepseek-r1-distill-llama-8b created
service/deepseek-r1-distill-llama-8b created
Note
Right now, the recommended connector for vLLM v0.10.2 is
AIBrixOffloadingConnectorV1Type3. You can switch to other AIBrix connectors if needed by specifying thekv_connectorparameter of--kv-transfer-config.If you prefer to use vLLM V0, please set
VLLM_USE_V1to0and change the value of--kv-transfer-configfrom'{"kv_connector":"AIBrixOffloadingConnectorV1Type3", "kv_role":"kv_both"}'to'{"kv_connector":"AIBrixOffloadingConnector", "kv_role":"kv_both"}'In this example, we set
AIBRIX_KV_CACHE_OL_L1_CACHE_ENABLED=0to explicitly disableL1Cacheand useL2Cacheonly.Current version only supports using
InfiniStorewith RDMA transport. Please ensureAIBRIX_KV_CACHE_OL_INFINISTORE_CONNECTION_TYPE=RDMAis configured.For InfiniBand, please set
AIBRIX_KV_CACHE_OL_INFINISTORE_LINK_TYPE=IB. For RoCE, please keep the default valueEthernet.AIBRIX_KV_CACHE_OL_INFINISTORE_VISIBLE_DEV_LISTis used to configure which RDMA device can be used by the engine to access remote KV cache servers. For instance, if you allocate 8 GPUs for the engine pod and setAIBRIX_KV_CACHE_OL_INFINISTORE_VISIBLE_DEV_LIST="mlx5_1,mlx5_2", then engine processes using GPU 0 to 3 will usemlx5_1, and engine processes using GPU 4 to 7 will usemlx5_2.If GID indexes of RDMA devices are required in your environment, please append the GID index to each RDMA device (e.g.,
mlx5_1:6,mlx5_2:7) inAIBRIX_KV_CACHE_OL_INFINISTORE_VISIBLE_DEV_LIST.AIBRIX_KV_CACHE_OL_META_SERVICE_URLpoints to the Redis instance managing KV cache cluster metadata. In this example, it is set toredis://kvcache-cluster-redis:6379, wherekvcache-clusteris the KV cache deployment name.AIBRIX_KV_CACHE_OL_META_SERVICE_BACKENDandAIBRIX_KV_CACHE_OL_META_SERVICE_CLUSTER_META_KEYare fixed in current version and should not be modified.
Now let’s use kubectl get pods command to ensure the inference service is running:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
deepseek-r1-distill-llama-8b-6587db8894-pbbxk 1/1 Running 0 34s
kvcache-cluster-0 1/1 Running 0 7m55s
kvcache-cluster-1 1/1 Running 0 5m12s
kvcache-cluster-2 1/1 Running 0 5m7s
kvcache-cluster-kvcache-watcher-pod 1/1 Running 0 7m55s
kvcache-cluster-redis 1/1 Running 0 7m55s
Once the inference service is running, let’s set up port forwarding so that we can test the service from local:
Run
kubectl get svc -n envoy-gateway-systemto get the name of the Envoy Gateway service.
$ kubectl get svc -n envoy-gateway-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
envoy-aibrix-system-aibrix-eg-903790dc LoadBalancer 10.97.198.203 115.190.25.67 80:32269/TCP 5d3h
envoy-gateway ClusterIP 10.97.57.193 <none> 18000/TCP,18001/TCP,18002/TCP,19001/TCP 5d3h
Run
kubectl -n envoy-gateway-system port-forward svc/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &to set up port forwarding
$ kubectl -n envoy-gateway-system port-forward svc/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &
Forwarding from 127.0.0.1:8888 -> 10080
Forwarding from [::1]:8888 -> 10080
Now, let’s test the service:
curl -v "http://localhost:8888/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer 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" \
-d '{
"model": "deepseek-r1-distill-llama-8b",
"messages": [{"role": "user", "content": "Created container vllm-openai"}],
"temperature": 0.7
}'
and its output would be:
* Trying [::1]:8888...
* Connected to localhost (::1) port 8888
> POST /v1/chat/completions HTTP/1.1
> Host: localhost:8888
> User-Agent: curl/8.4.0
> Accept: */*
> Content-Type: application/json
> Authorization: Bearer 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
> Content-Length: 173
>
Handling connection for 8888
< HTTP/1.1 200 OK
< x-went-into-req-headers: true
< date: Wed, 21 May 2025 05:44:04 GMT
< server: uvicorn
< content-type: application/json
< target-pod: 192.168.3.28:8000
< request-id: f9b291ae-fbce-4b63-bba5-c8f04d812cd0
< transfer-encoding: chunked
<
{"id":"chatcmpl-dce54e48-3c47-4d08-8ff9-dcec429fd486","object":"chat.completion","created":1747806244,"model":"deepseek-r1-distill-llama-8b","choices":[{"index":0,"message":{"role":"assistant","reasoning_content":null,"content":"Okay, so I'm trying to create a container for VLLM-OpenAI. I'm a bit new to this, so I need to figure out where to start. I know that VLLM stands for Very Large Language Model, and OpenAI has their own models like GPT-4 and others. But I'm not exactly sure how to create a container for it. \n\nFirst, I think I need to understand what a container is. From what I remember, containers are like lightweight virtual machines that you can use to package up an application and its dependencies. Docker is a popular tool for creating and managing these containers. So, I probably need to use Docker to create a container that runs VLLM-OpenAI.\n\nI should check if there's an official Docker image for VLLM-OpenAI. Maybe OpenAI provides one? If not, I might have to build one myself. Building from source code would mean I need access to the model's codebase, which I'm not sure about. I should look up if there's a public repository for VLLM-OpenAI.\n\nWait, I think OpenAI has released some of their models as open-source, but I'm not certain about VLLM specifically. I should search for \"VLLM-OpenAI Docker\" or \"VLLM-OpenAI container\" to see if someone has already created a container. Maybe there's a GitHub repository or a Docker Hub page with the image.\n\nIf I can't find an existing image, I'll have to create one myself. To do that, I need to know what dependencies the model requires. VLLM is based on LLMs, so it probably needs libraries like PyTorch or TensorFlow. Also, it might require specific versions of Python or other tools. I should look up the installation instructions for VLLM-OpenAI to identify the necessary dependencies.\n\nI'll need a Dockerfile. The Dockerfile will include a base image, install the dependencies, and copy the model's code. I should make sure to use the correct Python version, as some models might have compatibility issues. I'll also need a requirements.txt file to list all the necessary Python packages and their versions.\n\nOnce the Dockerfile is set up, I can build the container using Docker. The command would be something like `docker build -t vllm-openai .` where `vllm-openai` is the name of the container. After building, I can run it with `docker run -it vllm-openai`, which will start an interactive session.\n\nI should also consider how to manage the model once the container is running. Do I need to pass it a prompt through stdin? How does it handle outputs? I should look up the usage instructions for VLLM-OpenAI to know how to interact with it within the container.\n\nAnother thing to think about is resource usage. VLLM models are computationally intensive, so I need to make sure the container has enough resources allocated. This can be done when running the container with options like `--cpuset` or `--memory` if necessary.\n\nI'm a bit worried about the size of the model. VLLM might have a large embedding size, so the container might become quite large. I should check if there are optimized versions or ways to reduce the model size without losing too much performance.\n\nAlso, I should think about versioning. If I create a container, I should name it something that includes the version number, like `vllm-openai-1.0`. This way, I can easily update to newer versions by rebuilding the container.\n\nI wonder if there are any specific commands or tools needed to run VLLM-OpenAI in a container. Maybe I need to use a specific framework or tooling that's already included in the container. I should make sure I have all the necessary command-line tools installed before trying to run it.\n\nI should also test the container locally to see if it works. Maybe start with a simple prompt to see if the model responds. If it doesn't, I'll need to troubleshoot whether it's an issue with the container setup or the model configuration.\n\nIn summary, my steps would be:\n1. Search for existing V* Connection #0 to host localhost left intact
LLM-OpenAI containers or Docker images.\n2. If none found, create a new Dockerfile and requirements.txt.\n3. Install necessary dependencies and copy the model code.\n4. Build and run the container using Docker.\n5. Test the container with a sample input.\n6. Adjust resources and configurations as needed.\n\nI might run into issues like dependency conflicts or missing packages, so I should be prepared to update versions or check the model's documentation for specific requirements. Also, understanding how the model expects inputs and outputs is crucial for effective use.\n\nI think I've got a basic plan. Now, I'll try to find the existing resources or proceed to set up the Dockerfile if necessary. Let me start by searching for VLLM-OpenAI Docker container on GitHub or Docker Hub to see if someone else has done this before.\n</think>\n\nTo create a container for VLLM-OpenAI, follow these organized steps:\n\n1. **Search for Existing Containers**:\n - Check Docker Hub or GitHub for existing VLLM-OpenAI containers. If found, use them as they may already be configured.\n\n2. **Prepare Your Environment**:\n - Ensure you have Docker installed on your system.\n\n3. **Set Up the Project Structure**:\n - Create a directory for your project.\n - Within it, create a `Dockerfile` and a `requirements.txt` file.\n\n4. **Dockerfile Setup**:\n - Use a base image that matches your system's requirements (e.g., `python:3.9-slim`).\n - Install necessary dependencies from `requirements.txt`.\n - Copy the VLLM-OpenAI code into the container.\n\n5. **requirements.txt**:\n - List all Python packages needed, including specific versions, such as `transformers` and `torch`.\n\n6. **Build the Container**:\n - Use the command `docker build -t vllm-openai .` to build the container.\n\n7. **Run the Container**:\n - Start the container with `docker run -it vllm-openai` for an interactive session.\n\n8. **Test the Container**:\n - Issue a test command to ensure the model responds, e.g., `echo \"Hello, how are you?\" | docker run -it vllm-openai`.\n\n9. **Optimize Resources**:\n - Adjust resource allocation with options like `--cpuset` or `--memory` to handle computational demands.\n\n10. **Versioning**:\n - Name your container with versioning, such as `vllm-openai-1.0`.\n\n11. **Troubleshooting**:\n - If issues arise, check for dependency conflicts or review the model's documentation for specific requirements.\n\n12. **Documentation and Usage**:\n - Familiarize yourself with how VLLM-OpenAI expects inputs and outputs for effective utilization.\n\nBy following these steps, you can efficiently create and manage a container for VLLM-OpenAI, ensuring it runs smoothly within your environment.","tool_calls":[]},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":12,"total_tokens":1505,"completion_tokens":1493,"prompt_tokens_details":null},"prompt_logprobs":null}
Profiling Example#
To profile the AIBrix Offloading Connectors, before launching the inference engines, you need to start the profiling service with the following sample YAML file:
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: aibrix-kvcache-profiling
5 labels:
6 model.aibrix.ai/name: aibrix-kvcache-profiling
7spec:
8 replicas: 1
9 selector:
10 matchLabels:
11 model.aibrix.ai/name: aibrix-kvcache-profiling
12 template:
13 metadata:
14 labels:
15 model.aibrix.ai/name: aibrix-kvcache-profiling
16 spec:
17 containers:
18 - name: pyroscope
19 image: aibrix-container-registry-cn-beijing.cr.volces.com/aibrix/pyroscope:latest
20 imagePullPolicy: Always
21 resources:
22 requests:
23 cpu: "2000m"
24 memory: "4Gi"
25 limits:
26 cpu: "2000m"
27 memory: "4Gi"
28---
29
30apiVersion: v1
31kind: Service
32metadata:
33 name: aibrix-kvcache-profiling
34 namespace: default
35spec:
36 ports:
37 - name: http
38 port: 4040
39 protocol: TCP
40 targetPort: 4040
41 selector:
42 model.aibrix.ai/name: aibrix-kvcache-profiling
43 type: ClusterIP
$ kubectl apply -f samples/kvcache/profiling/profiling_svc.yaml
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
aibrix-kvcache-profiling ClusterIP 10.70.106.217 <none> 4040/TCP 80m
Once the profiling service is running, let’s set up port forwarding so that we can browse the profiling results and flamegraphs from local:
Run
kubectl port-forward svc/aibrix-kvcache-profiling 4040:4040 &to set up port forwarding
Now let’s launch the inference engine with the following environment variables set in engine’s YAML file (please refer to L1 Cache Example and L2 Cache Example for more details).
env:
- name: AIBRIX_KV_CACHE_OL_PROFILING_ENABLED
value: "1"
- name: AIBRIX_KV_CACHE_OL_PROFILING_SERVER_ADDRESS
value: "http://aibrix-kvcache-profiling:4040"
Run a task to generate requests to the inference engine, and you can browse the profiling results and flamegraphs from local after a while:
Open http://localhost:4040 in your browser
Adjust the query parameters as illustrated in the following figure to show the flamegraph of AIBrix Offloading Connectors