Distributed KV Cache#
Warning
Currently, distributed KV cache only supports FlashAttention.
The rising demand for large language models has intensified the need for efficient memory management and caching to optimize inference performance and reduce costs. In multi-round use cases like chatbots and agent-based systems, overlapping token sequences lead to redundant computations during the prefill phase, wasting resources and limiting throughput.
Many inference engines, such as vLLM, use built-in KV caching to mitigate this issue, leveraging idle HBM and DRAM. However, single-node KV caches face key limitations: constrained memory capacity, engine-specific storage that prevents sharing across instances, and difficulty supporting scenarios like KV migration and prefill-decode disaggregation.
AIBrix addresses these challenges with a distributed KV cache, enabling high-capacity, cross-engine KV reuse while optimizing network and memory efficiency. Our solution employs a scan-resistant eviction policy to persist hot KV tensors selectively, ensuring that network and memory usage is optimized by minimizing unnecessary data transfers, asynchronous metadata updates to reduce overhead, and cache-engine colocation for faster data transfer via shared memory.
Example#
Note
We use a customized version of vineyard as the backend for distributed KV cache and an internal version of vLLM integrated with distributed KV cache support to showcase the usage. We are working with the vLLM community to upstream the distributed KV cache API and plugin.
After deployment, we can see all the components by using kubectl get pods -n aibrix-system command:
NAME READY STATUS RESTARTS AGE
deepseek-coder-7b-kvcache-596965997-p86cx 0/1 Pending 0 2m
deepseek-coder-7b-kvcache-etcd-0 1/1 Running 0 2m
Note
deepseek-coder-7b-kvcache-596965997-p86cx is pending and waiting for inference engine to be deployed, this is normal.
After all components are created, we can use the following yaml to deploy the inference service:
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-coder-7b-instruct
labels:
model.aibrix.ai/name: deepseek-coder-7b-instruct
model.aibrix.ai/port: "8000"
spec:
replicas: 1
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
selector:
matchLabels:
model.aibrix.ai/name: deepseek-coder-7b-instruct
template:
metadata:
labels:
model.aibrix.ai/name: deepseek-coder-7b-instruct
spec:
containers:
- name: vllm-openai
image: aibrix/vllm-openai:v0.6.1-edb07092-20250118
imagePullPolicy: Always
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --port
- "8000"
- --uvicorn-log-level
- warning
- --model
- deepseek-ai/deepseek-coder-6.7b-instruct
- --served-model-name
- deepseek-coder-7b-instruct
- --max-model-len
- "8192" # please modify this field if your gpu has more room
- --enable-prefix-caching
- --disable-fastapi-docs
env:
- name: VLLM_USE_VINEYARD_CACHE
value: "1"
- name: VINEYARD_CACHE_CPU_MEM_LIMIT_GB
value: "10"
- name: AIBRIX_LLM_KV_CACHE
value: "1"
- name: AIBRIX_LLM_KV_CACHE_KV_CACHE_NS
value: "aibrix"
- name: AIBRIX_LLM_KV_CACHE_CHUNK_SIZE
value: "16"
- name: AIBRIX_LLM_KV_CACHE_SOCKET
value: /var/run/vineyard.sock
- name: AIBRIX_LLM_KV_CACHE_RPC_ENDPOINT
value: "deepseek-coder-7b-kvcache-rpc:9600"
- name: VINEYARD_CACHE_ENABLE_ASYNC_UPDATE
value: "1"
- name: "VINEYARD_CACHE_METRICS_ENABLED"
value: "1"
volumeMounts:
- mountPath: /var/run
name: kvcache-socket
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
volumes:
- name: kvcache-socket
hostPath:
path: /var/run/vineyard-kubernetes/default/deepseek-coder-7b-kvcache
---
apiVersion: v1
kind: Service
metadata:
labels:
model.aibrix.ai/name: deepseek-coder-7b-instruct
prometheus-discovery: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
name: deepseek-coder-7b-instruct # Note: The Service name must match the label value `model.aibrix.ai/name` in the Deployment
namespace: default
spec:
ports:
- name: serve
port: 8000
protocol: TCP
targetPort: 8000
- name: http
port: 8080
protocol: TCP
targetPort: 8080
selector:
model.aibrix.ai/name: deepseek-coder-7b-instruct
type: ClusterIP
Note
metadata.nameMUST match withkvcache.orchestration.aibrix.ai/pod-affinity-workloadin the kv cache deploymentWe need to include the Unix domain socket used by the distributed KV cache as a volume to the inference service pod (i.e.,
kvcache-socketin the example above)
Note
VINEYARD_CACHE_CPU_MEM_LIMIT_GB needs to choose a proper value based on the pod memory resource requirement. For instance, if the pod memory resource requirement is P GB and the estimated memory consumption of the inference engine is E GB, we can set VINEYARD_CACHE_CPU_MEM_LIMIT_GB to P / tensor-parallel-size - E.
Now let’s use kubectl get pods command to ensure the inference service is running:
NAME READY STATUS RESTARTS AGE
deepseek-coder-7b-instruct-6b885ffd8b-2kfjv 2/2 Running 0 4m
After launching AIBrix’s deployment, we can use the following yaml to deploy a distributed KV cache cluster:
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: KVCache
metadata:
name: deepseek-coder-7b-kvcache
namespace: default
annotations:
kvcache.orchestration.aibrix.ai/pod-affinity-workload: deepseek-coder-7b-instruct
spec:
replicas: 1
service:
type: ClusterIP
port: 9600
cacheSpec:
image: aibrix/vineyardd:20241120
imagePullPolicy: IfNotPresent
cpu: 2000m
memory: 4Gi
Note
kvcache.orchestration.aibrix.ai/pod-affinity-workloadMUST match withmetadata.nameof the inference service deployment belowkvcache.orchestration.aibrix.ai/node-affinity-gpu-typeis unnecessary unless you deploy the model across different GPUs.
Run kubectl get pods to verify all pods are running.
Note
kubectl get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES deepseek-coder-7b-instruct-85664648c7-xgp9h 1/1 Running 0 2m41s 192.168.59.224 ip-192-168-41-184.us-west-2.compute.internal <none> <none> deepseek-coder-7b-kvcache-7d5896cd89-dcfzt 1/1 Running 0 2m31s 192.168.37.154 ip-192-168-41-184.us-west-2.compute.internal <none> <none> deepseek-coder-7b-kvcache-etcd-0 1/1 Running 0 2m31s 192.168.19.197 ip-192-168-3-183.us-west-2.compute.internal <none> <none>
Once the inference service is running, let’s set up port forwarding so that we can test the service from local:
Run
kubectl get svc -n envoy-gateway-systemto get the name of the Envoy Gateway service.
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
envoy-aibrix-system-aibrix-eg-903790dc LoadBalancer 172.19.190.6 10.0.1.4,2406:d440:105:cf01:6f1b:7f4d:12da:c5a5 80:30904/TCP 3d
Run
kubectl -n envoy-gateway-system port-forward svc/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &to set up port forwarding
Forwarding from 127.0.0.1:8888 -> 10080
Forwarding from [::1]:8888 -> 10080
Now, let’s test the service:
curl -v "http://localhost:8888/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: XXXXXXXXXXXXXXXXXXXXXXXX" \
-d '{
"model": "deepseek-coder-7b-instruct",
"messages": [{"role": "user", "content": "Created container vllm-openai"}],
"temperature": 0.7
}'
and its output would be:
* Trying [::1]:8888...
* Connected to localhost (::1) port 8888
> POST /v1/chat/completions HTTP/1.1
> Host: localhost:8888
> User-Agent: curl/8.4.0
> Accept: */*
> Content-Type: application/json
Handling connection for 8888
> Authorization: XXXXXXXXXXXXXXXXXXXXXXXX
> Content-Length: 174
>
< HTTP/1.1 200 OK
< date: Thu, 30 Jan 2025 23:50:08 GMT
< server: uvicorn
< content-type: application/json
< x-went-into-resp-headers: true
< transfer-encoding: chunked
<
* Connection #0 to host localhost left intact
{
"id": "chat-60f0247aa9294f8abb61e8f24c1503c2",
"object": "chat.completion",
"created": 1738281009,
"model": "deepseek-coder-7b-instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "It seems like you're trying to create a container with the name \"vllm-openai\". However, your question is missing some context. Could you please provide more details? Are you using Docker, Kubernetes, or another container orchestration tool? Or are you asking how to create a container for a specific application or service? The details will help me provide a more accurate answer.",
"tool_calls": []
},
"logprobs": null,
"finish_reason": "stop",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 76,
"total_tokens": 161,
"completion_tokens": 85
},
"prompt_logprobs": null
}
Distribute KV cache metrics can be viewed in the AIBrix Engine Dashboard. The following is an example of the dashboard panels for the distributed KV cache: