KV Cache Events Synchronization

KV Cache Events Synchronization#

Overview#

KV Cache Event Synchronization is a feature that enables multiple vLLM instances to share key-value cache states through ZMQ-based event publishing. This improves prefix cache hit rates and reduces redundant computation by allowing the AIBrix gateway to make intelligent routing decisions based on real-time cache state.

Architecture#

The KV event synchronization system consists of:

vLLM Instances: Publish KV cache events via ZMQ pub/sub pattern
AIBrix Cache: Manages subscriptions and processes events
Sync Prefix Cache Indexer: Maintains global prefix cache state
Gateway Router: Uses cache state for intelligent routing decisions

Event Flow#

vLLM Pod 1 ─────┐
                 ├─── ZMQ Events ───► KV Event Manager ───► Sync Indexer ───► Gateway Router
vLLM Pod N ─────┘                         (in Cache)

The system uses a two-stage initialization:

Cache Initialization: Using InitWithOptions pattern with EnableKVSync=true
KV Event Manager: Automatically created when conditions are met

Requirements#

vLLM version 0.7.0 or later with KV cache events support
AIBrix gateway-plugins built with ZMQ support (-tags="zmq")
ZMQ library (libzmq3-dev) installed on gateway nodes
Remote tokenizer enabled (strict prerequisite)
Redis client configured (for production deployments)

Important

KV event sync has a strict dependency on remote tokenizer to ensure consistent tokenization between gateway and vLLM instances. The system will not initialize if remote tokenizer is disabled.

Configuration#

Environment Variables#

Variable	Default	Description
`AIBRIX_PREFIX_CACHE_KV_EVENT_SYNC_ENABLED`	`false`	Enable KV event synchronization
`AIBRIX_PREFIX_CACHE_USE_REMOTE_TOKENIZER`	`false`	Must be `true` for KV sync
`AIBRIX_PREFIX_CACHE_REMOTE_TOKENIZER_ENDPOINT`		vLLM service endpoint

Pod Labels#

Label	Value	Description
`model.aibrix.ai/kv-events-enabled`	`true`	Enable KV events for this pod
`model.aibrix.ai/lora-id`	string	LoRA adapter ID (optional)

vLLM Configuration#

Add these arguments to your vLLM container:

args:
  - --enable-kv-cache-events
  - --kv-events-publisher=zmq
  - --kv-events-endpoint=tcp://*:5557
  - --kv-events-replay-endpoint=tcp://*:5558
  - --kv-events-buffer-steps=10000

Add corresponding ports:

ports:
  - name: kv-events
    containerPort: 5557
    protocol: TCP
  - name: kv-replay
    containerPort: 5558
    protocol: TCP

Deployment#

Quick Start#

Enable Remote Tokenizer (mandatory prerequisite):

kubectl set env deployment/aibrix-gateway-plugins -n aibrix-system \
  AIBRIX_PREFIX_CACHE_USE_REMOTE_TOKENIZER=true \
  AIBRIX_PREFIX_CACHE_REMOTE_TOKENIZER_ENDPOINT=http://vllm-service:8000

Enable KV Event Sync:

kubectl set env deployment/aibrix-gateway-plugins -n aibrix-system \
  AIBRIX_PREFIX_CACHE_KV_EVENT_SYNC_ENABLED=true

Deploy vLLM with KV Events:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-model
spec:
  template:
    metadata:
      labels:
        model.aibrix.ai/name: "llama-7b"
        model.aibrix.ai/kv-events-enabled: "true"
    spec:
      containers:
      - name: vllm
        args:
        - --enable-kv-cache-events
        - --kv-events-publisher=zmq
        - --kv-events-endpoint=tcp://*:5557
        - --kv-events-replay-endpoint=tcp://*:5558

Build Considerations#

AIBrix uses conditional compilation to manage ZMQ dependencies:

Components requiring ZMQ support:

gateway-plugins: Main component for KV event sync
kvcache-watcher: Optional component for cache monitoring

Build commands:

# Build with ZMQ support
go build -tags="zmq" ./cmd/plugins/main.go

# Docker build with ZMQ
make docker-build-gateway-plugins  # Automatically includes ZMQ

Components that do NOT require ZMQ:

controller-manager: Uses default build
metadata-service: Uses default build
runtime: Python component, no ZMQ needed

Event Types#

BlockStoredEvent#

Published when new KV cache blocks are stored:

type BlockStoredEvent struct {
    BlockHashes     []int64    // Hash values of stored blocks
    TokenIDs        [][]byte   // Token IDs for each block (each token is a big-endian uint32)
    ModelName       string     // Model identifier
    LoraID          int64      // LoRA adapter ID (-1 if none)
    SourcePod       string     // Source pod name
    ParentBlockHash *int64     // Hash value of the parent block or nil
}

BlockRemovedEvent#

Published when blocks are removed from cache:

type BlockRemovedEvent struct {
    BlockHashes  []int64    // Hash values of removed blocks
    ModelName    string     // Model identifier
    LoraID       int64      // LoRA adapter ID
    SourcePod    string     // Source pod name
}

Troubleshooting#

Initialization Failures#

Check initialization logs:

kubectl logs deployment/aibrix-gateway-plugins -n aibrix-system | grep -E "KV event|initialize cache"

Verify remote tokenizer:

# Must see both enabled
kubectl get deployment/aibrix-gateway-plugins -n aibrix-system -o yaml | grep -A2 "REMOTE_TOKENIZER\|KV_EVENT_SYNC"

Events Not Publishing#

Check vLLM logs:

kubectl logs deployment/vllm-model | grep "KV cache events"

Verify ZMQ connectivity:

kubectl exec -it <gateway-pod> -n aibrix-system -- nc -zv <vllm-pod-ip> 5557

Check ZMQ build support:

kubectl exec <gateway-pod> -n aibrix-system -- ldd /app/gateway-plugin | grep zmq

Connection Issues#

Verify pod labels:

kubectl get pods -l model.aibrix.ai/kv-events-enabled=true

Check network policies:
- Ensure ports 5557-5558 are accessible
- No blocking NetworkPolicies

Validate tokenizer:

kubectl exec <gateway-pod> -- curl http://tokenizer:8080/health

Performance Tuning#

High Memory Usage: Reduce buffer steps in vLLM
Event Processing Lag: Adjust batch size and polling timeout
Network Overhead: ~1MB/s per pod at high load

Migration from Existing Deployments#

Enable on Existing vLLM#

Add labels:

kubectl label deployment vllm-model model.aibrix.ai/kv-events-enabled=true

Update deployment with KV event args (see Configuration section)

Restart pods:

kubectl rollout restart deployment vllm-model

Rollback#

To disable KV event sync:

# Disable in gateway
kubectl set env deployment/aibrix-gateway-plugins -n aibrix-system \
  AIBRIX_PREFIX_CACHE_KV_EVENT_SYNC_ENABLED=false

# Remove from vLLM deployments
kubectl label deployment vllm-model model.aibrix.ai/kv-events-enabled-

Best Practices#

Deployment Order:
- Enable remote tokenizer first and verify it’s working
- Deploy vLLM with KV events configuration
- Enable KV sync in gateway last
Monitoring:
- Enable prefix cache metrics for visibility
- Monitor ZMQ connection status in logs
- Track prefix cache hit rates in Grafana
Resource Planning:
- ZMQ traffic: ~1MB/s per vLLM pod at high load
- Memory: Sync indexer uses ~64 bytes per prefix entry
- CPU: Minimal overhead (<1% per pod)
Production Considerations:
- Use dedicated network for ZMQ traffic if possible
- Configure appropriate timeouts based on network latency
- Plan for graceful degradation if KV sync fails