KV Cache Events Synchronization#
Overview#
KV Cache Event Synchronization is a feature that enables multiple vLLM instances to share key-value cache states through ZMQ-based event publishing. This improves prefix cache hit rates and reduces redundant computation by allowing the AIBrix gateway to make intelligent routing decisions based on real-time cache state.
Architecture#
The KV event synchronization system consists of:
vLLM Instances: Publish KV cache events via ZMQ pub/sub pattern
AIBrix Cache: Manages subscriptions and processes events
Sync Prefix Cache Indexer: Maintains global prefix cache state
Gateway Router: Uses cache state for intelligent routing decisions
Event Flow#
vLLM Pod 1 ─────┐
├─── ZMQ Events ───► KV Event Manager ───► Sync Indexer ───► Gateway Router
vLLM Pod N ─────┘ (in Cache)
The system uses a two-stage initialization:
Cache Initialization: Using
InitWithOptionspattern withEnableKVSync=trueKV Event Manager: Automatically created when conditions are met
Requirements#
vLLM version 0.7.0 or later with KV cache events support
AIBrix gateway-plugins built with ZMQ support (
-tags="zmq")ZMQ library (libzmq3-dev) installed on gateway nodes
Remote tokenizer enabled (strict prerequisite)
Redis client configured (for production deployments)
Important
KV event sync has a strict dependency on remote tokenizer to ensure consistent tokenization between gateway and vLLM instances. The system will not initialize if remote tokenizer is disabled.
Configuration#
Environment Variables#
Variable |
Default |
Description |
|---|---|---|
|
|
Enable KV event synchronization |
|
|
Must be |
|
vLLM service endpoint |
|
|
|
Enable prefix cache metrics |
Pod Labels#
Label |
Value |
Description |
|---|---|---|
|
|
Enable KV events for this pod |
|
string |
LoRA adapter ID (optional) |
vLLM Configuration#
Add these arguments to your vLLM container:
args:
- --enable-kv-cache-events
- --kv-events-publisher=zmq
- --kv-events-endpoint=tcp://*:5557
- --kv-events-replay-endpoint=tcp://*:5558
- --kv-events-buffer-steps=10000
Add corresponding ports:
ports:
- name: kv-events
containerPort: 5557
protocol: TCP
- name: kv-replay
containerPort: 5558
protocol: TCP
Deployment#
Quick Start#
Enable Remote Tokenizer (mandatory prerequisite):
kubectl set env deployment/aibrix-gateway-plugins -n aibrix-system \ AIBRIX_PREFIX_CACHE_USE_REMOTE_TOKENIZER=true \ AIBRIX_PREFIX_CACHE_REMOTE_TOKENIZER_ENDPOINT=http://vllm-service:8000
Enable KV Event Sync:
kubectl set env deployment/aibrix-gateway-plugins -n aibrix-system \ AIBRIX_PREFIX_CACHE_KV_EVENT_SYNC_ENABLED=true
Enable Prefix Cache Metrics (optional but recommended):
kubectl set env deployment/aibrix-gateway-plugins -n aibrix-system \ AIBRIX_PREFIX_CACHE_LOCAL_ROUTER_METRICS_ENABLED=true
Deploy vLLM with KV Events:
apiVersion: apps/v1 kind: Deployment metadata: name: vllm-model spec: template: metadata: labels: model.aibrix.ai/name: "llama-7b" model.aibrix.ai/kv-events-enabled: "true" spec: containers: - name: vllm args: - --enable-kv-cache-events - --kv-events-publisher=zmq - --kv-events-endpoint=tcp://*:5557 - --kv-events-replay-endpoint=tcp://*:5558
Build Considerations#
AIBrix uses conditional compilation to manage ZMQ dependencies:
Components requiring ZMQ support:
gateway-plugins: Main component for KV event synckvcache-watcher: Optional component for cache monitoring
Build commands:
# Build with ZMQ support
go build -tags="zmq" ./cmd/plugins/main.go
# Docker build with ZMQ
make docker-build-gateway-plugins # Automatically includes ZMQ
Components that do NOT require ZMQ:
controller-manager: Uses default buildmetadata-service: Uses default buildruntime: Python component, no ZMQ needed
Event Types#
BlockStoredEvent#
Published when new KV cache blocks are stored:
type BlockStoredEvent struct {
BlockHashes []int64 // Hash values of stored blocks
TokenIDs [][]byte // Token IDs for each block (each token is a big-endian uint32)
ModelName string // Model identifier
LoraID int64 // LoRA adapter ID (-1 if none)
SourcePod string // Source pod name
ParentBlockHash *int64 // Hash value of the parent block or nil
}
BlockRemovedEvent#
Published when blocks are removed from cache:
type BlockRemovedEvent struct {
BlockHashes []int64 // Hash values of removed blocks
ModelName string // Model identifier
LoraID int64 // LoRA adapter ID
SourcePod string // Source pod name
}
Troubleshooting#
Initialization Failures#
Check initialization logs:
kubectl logs deployment/aibrix-gateway-plugins -n aibrix-system | grep -E "KV event|initialize cache"
Verify remote tokenizer:
# Must see both enabled kubectl get deployment/aibrix-gateway-plugins -n aibrix-system -o yaml | grep -A2 "REMOTE_TOKENIZER\|KV_EVENT_SYNC"
Events Not Publishing#
Check vLLM logs:
kubectl logs deployment/vllm-model | grep "KV cache events"
Verify ZMQ connectivity:
kubectl exec -it <gateway-pod> -n aibrix-system -- nc -zv <vllm-pod-ip> 5557
Check ZMQ build support:
kubectl exec <gateway-pod> -n aibrix-system -- ldd /app/gateway-plugin | grep zmq
Connection Issues#
Verify pod labels:
kubectl get pods -l model.aibrix.ai/kv-events-enabled=true
Check network policies:
Ensure ports 5557-5558 are accessible
No blocking NetworkPolicies
Validate tokenizer:
kubectl exec <gateway-pod> -- curl http://tokenizer:8080/health
Performance Tuning#
High Memory Usage: Reduce buffer steps in vLLM
Event Processing Lag: Adjust batch size and polling timeout
Network Overhead: ~1MB/s per pod at high load
Migration from Existing Deployments#
Enable on Existing vLLM#
Add labels:
kubectl label deployment vllm-model model.aibrix.ai/kv-events-enabled=true
Update deployment with KV event args (see Configuration section)
Restart pods:
kubectl rollout restart deployment vllm-model
Rollback#
To disable KV event sync:
# Disable in gateway
kubectl set env deployment/aibrix-gateway-plugins -n aibrix-system \
AIBRIX_PREFIX_CACHE_KV_EVENT_SYNC_ENABLED=false
# Remove from vLLM deployments
kubectl label deployment vllm-model model.aibrix.ai/kv-events-enabled-
Best Practices#
Deployment Order:
Enable remote tokenizer first and verify it’s working
Deploy vLLM with KV events configuration
Enable KV sync in gateway last
Monitoring:
Enable prefix cache metrics for visibility
Monitor ZMQ connection status in logs
Track prefix cache hit rates in Grafana
Resource Planning:
ZMQ traffic: ~1MB/s per vLLM pod at high load
Memory: Sync indexer uses ~64 bytes per prefix entry
CPU: Minimal overhead (<1% per pod)
Production Considerations:
Use dedicated network for ZMQ traffic if possible
Configure appropriate timeouts based on network latency
Plan for graceful degradation if KV sync fails