AIBrix KVCache Offloading Framework#
The rising demand for large language models has intensified the need for efficient memory management and caching to optimize inference performance and reduce costs. In multi-round use cases like chatbots and agent-based systems, overlapping token sequences lead to redundant computations during the prefill phase, wasting resources and limiting throughput.
Many inference engines, such as vLLM, use built-in KV caching to mitigate this issue, leveraging idle HBM and DRAM. However, single-node KV caches face key limitations: constrained memory capacity, engine-specific storage that prevents sharing across instances, and difficulty supporting scenarios like KV migration and prefill-decode disaggregation.
With AIBrix v0.3.0, we introduce a production-ready KVCache Offloading Framework, which enables efficient memory tiering and low-overhead cross-engine reuse. By default, the framework leverages L1 DRAM-based caching, which already provides significant performance improvements by offloading GPU memory pressure without incurring high latency. For scenarios requiring multi-node sharing or larger-scale reuse, AIBrix allows users to optionally enable L2 remote caching, unlocking the benefits of a distributed KV cache layer.
Figure 1. AIBrix KVCache Offloading Framework
As shown in Figure 1, on the data plane, it integrates tightly with inference engines (e.g., vLLM) via AIBrix Offloading Connector, which employs optimized CUDA kernels to significantly accelerate data movement between GPU and CPU. For memory scalability, its multi-tiered cache manager dynamically balances workloads across storage layers, alleviating GPU memory capacity limits while minimizing latency penalties. The framework supports pluggable eviction policies (e.g., LRU, S3FIFO) and diverse backend storage options (e.g., InfiniStore), enabling selective KV cache offloading to reduce network and PCIe contention. Crucially, its cache placement module can coordinate with the centralized distributed KV cache cluster manager to maximize global KV cache utilization. This enables cross-engine KV reuse and ensures cluster-wide resource efficiency, transforming isolated KV cache instances into a scalable, shared KV cache infrastructure.
L1 Engine DRAM Cache Management#
The growing demands of modern models and increasing context lengths in LLM inference have led to KV caches consuming progressively more GPU memory, pushing against the hardware limits of even the most advanced GPUs. Recent systems like Dynamo, LMCache, and MoonCake have developed solutions that offload KV cache to external memory hierarchies, spanning from CPU memory to SSDs. KVCache Offloading supports the offloading of KV cache to CPU memory as well by only enabling its DRAM-backed L1Cache. While this approach does not enable KV cache sharing across multiple engines, it eliminates the complexity of distributed KV cache setup and configuration. More importantly, by leveraging the significantly larger capacity of CPU memory, this method delivers substantial performance gains —- making it an ideal solution for use cases that prioritize scalable KV cache capacity over cross-engine KV reuse.
L2 Distributed KVCache and Cross-Engine KV Reuse#
The growing demand for large language models has significantly increased the need for expansive KV cache capacity. While CPU memory offloading effectively addresses moderate scaling needs, production environments handling massive-scale, dynamic workloads require even greater scalability – particularly when memory needs exceed single-node capacities. To address this, AIBrix enables distributed KV cache services as its L2Cache backends, which can scale horizontally across multiple nodes to meet capacity demands.
In the meantime, as LLM deployments scale across multiple engines in the cluster, the redundancy of KV caches across engines introduces substantial inefficiencies. Repeated computations of common prompt prefixes waste GPU cycles and HBM bandwidth. AIBrix solves this challenge by enabling efficient cross-engine KV reuse through a high-performance, shared distributed KV cache, optimizing resource utilization at scale.
Adding New KVCache Backends#
New KVCache backends can be easily added by implementing the Connector interface:
1@dataclass
2class ConnectorFeature:
3 """The features of the kv cache connector.
4 Args:
5 mput_mget: Whether the kv cache connector supports mput/mget
6 prefetch: Whether the kv cache connector supports prefetch.
7 rdma: Whether the kv cache connector supports RDMA.
8 gdr_put: Whether the kv cache connector supports GDR put.
9 gdr_get: Whether the kv cache connector supports GDR get.
10 """
11
12 mput_mget: bool = False
13 prefetch: bool = False
14 rdma: bool = False
15 gdr_put: bool = False
16 gdr_get: bool = False
17
18
19@dataclass
20class ConnectorConfig:
21 """The config of the kv cache connector."""
22
23 backend_name: str
24 namespace: str
25 partition_id: str
26 executor: Executor
27 block_spec_signature: str = ""
28 key_builder_signature: str = ""
29 layout_signature: str = ""
30
31
32@dataclass
33class ConnectorRegisterDescriptor:
34 """The register descriptor"""
35
36 pass
37
38
39class Connector(Generic[K, V]):
40 """Connector interface."""
41
42 @classmethod
43 @abstractmethod
44 def from_envs(cls, conn_id: str, executor: Executor, **kwargs):
45 """Create a connector from environment variables."""
46 raise NotImplementedError
47
48 @property
49 @abstractmethod
50 def name(self) -> str:
51 raise NotImplementedError
52
53 @property
54 @abstractmethod
55 def feature(self) -> ConnectorFeature:
56 """Get the feature of the connector.
57 Returns:
58 The feature of the kv cache service.
59 """
60 raise NotImplementedError
61
62 @abstractmethod
63 def open(self) -> Status:
64 """Open a connection."""
65 raise NotImplementedError
66
67 @abstractmethod
68 def close(self) -> Status:
69 """Close a connection."""
70 raise NotImplementedError
71
72 async def prefetch(self, keys: Sequence[K]) -> None:
73 """Prefetch a list of keys.
74 Args:
75 keys: The keys of the kv tensors.
76 """
77 pass
78
79 @abstractmethod
80 async def exists(self, key: K) -> Status:
81 """Check if key is in the store."""
82 raise NotImplementedError
83
84 @abstractmethod
85 async def get(
86 self, key: K, mr: MemoryRegion | Sequence[MemoryRegion]
87 ) -> Status:
88 """Get a value.
89 Args:
90 key: The key of the kv tensor.
91 mr: The memory region or MR list to place the fetched kv
92 tensor. It is an MR list only if using GDR.
93 Returns:
94 The status of the get operation.
95 """
96 raise NotImplementedError
97
98 @abstractmethod
99 async def put(
100 self, key: K, mr: MemoryRegion | Sequence[MemoryRegion]
101 ) -> Status:
102 """Put a key value pair.
103 Args:
104 key: The key of the kv cache.
105 mr: The memory region or MR list holding the kv tensors. It is an
106 MR list only if using GDR.
107 Returns:
108 The status of the put operation.
109 """
110 raise NotImplementedError
111
112 def register_slabs(self, slabs: List[torch.Tensor]) -> Status:
113 """Register slabs with backend-specific register function.
114 Args:
115 slabs: slabs to be registered.
116 Returns:
117 Status of the register operation.
118 """
119 raise NotImplementedError
120
121 def get_batches(
122 self,
123 keys: Sequence[Any],
124 mrs: Sequence[MemoryRegion | Sequence[MemoryRegion]],
125 batch_size: int,
126 ) -> Sequence[Sequence[Tuple[K, MemoryRegion | Sequence[MemoryRegion]]]]:
127 """Get a list of key MR batches that is used for mput and mget
128 operations.
129
130 Args:
131 keys: The keys of the kv tensors.
132 mrs: Memory regions or lists of MRs holding the kv tensors.
133 batch_size: The maximum number of key MR pairs in a batch.
134 Returns:
135 List of key MR/MR List batches.
136 """
137 raise NotImplementedError
138
139 async def mget(
140 self,
141 keys: Sequence[K],
142 mrs: Sequence[MemoryRegion | Sequence[MemoryRegion]],
143 ) -> Sequence[Status]:
144 """MGet a list of values. This function is optional and only connectors
145 have mput_mget feature enabled can implement this function.
146 Args:
147 keys: The keys of the kv tensors.
148 mrs: Memory regions or lists of MRs to hold the fetched kv
149 tensors. It is an MR list only if using GDR.
150 Returns:
151 List of statuses.
152 """
153 raise NotImplementedError
154
155 async def mput(
156 self,
157 keys: Sequence[K],
158 mrs: Sequence[MemoryRegion | Sequence[MemoryRegion]],
159 ) -> Sequence[Status]:
160 """MPut a list of key value pairs. This function is optional and only
161 connectors have mput_mget feature enabled can implement this function.
162 Args:
163 keys: The keys of the kv tensors.
164 mrs: Memory regions or lists of MRs holding the kv tensors. It is
165 an MR list only if using GDR.
166 Returns:
167 List of statuses.
168 """
169 raise NotImplementedError
170
171 @abstractmethod
172 async def delete(self, key: K) -> Status:
173 """Delete a key.
174 Args:
175 key: The key of the kv cache.
176 Returns:
177 The status of the delete operation.
178 """
179 raise NotImplementedError
Please refer to the existing connectors for more details.
Environment Variables Reference#
This section describes all available environment variables for AIBrix KVCache Offloading Framework.
Core Configuration#
Variable |
Default |
Description |
|---|---|---|
AIBRIX_KV_CACHE_OL_CHUNK_SIZE |
“512” |
Chunk size for operations. |
AIBRIX_KV_CACHE_OL_BLOCK_SIZE |
“-1” |
Number of tokens in a kvcache block (the finest IO granularity). Defaults to -1, which means to use engine’s block size. |
AIBRIX_KV_CACHE_OL_MAX_SEQ_LEN |
“-1” |
Maximum sequence length. Defaults to -1, which means no limit. If set, tokens beyond this length will be ignored. |
AIBRIX_KV_CACHE_OL_TIME_MEASUREMENT_ENABLED |
“1” |
Enable time measurement. |
AIBRIX_KV_CACHE_OL_BREAKDOWN_MEASUREMENT_ENABLED |
“1” |
Enable breakdown measurement. |
AIBRIX_KV_CACHE_OL_DOUBLE_GET_THRESHOLD |
“4,0.1” |
Controls when to issue a second get request to L2 cache. First value is minimum missing blocks, second is ratio threshold. |
AIBRIX_KV_CACHE_OL_TOKEN_VALIDATION_ENABLED |
“0” |
Whether to validate tokens in L2 cache. Disabling uses tighter memory layout. |
AIBRIX_KV_CACHE_OL_TRANSPORT_RDMA_ADDR_RANGE |
“::/0” |
Valid GID range (CIDR format). Similar to NVSHMEM_IB_ADDR_RANGE for NVSHMEM. |
AIBRIX_KV_CACHE_OL_PROFILING_ENABLED |
“0” |
Enable profiling. |
AIBRIX_KV_CACHE_OL_PROFILING_SERVER_ADDRESS |
Profiling server address. Profiling server is responsible for collecting profiling data and displaying it in a web UI. |
L1 Cache Configuration#
Variable |
Default |
Description |
|---|---|---|
AIBRIX_KV_CACHE_OL_L1_CACHE_ENABLED |
“1” |
Enable L1 cache. |
AIBRIX_KV_CACHE_OL_L1_CACHE_EVICTION_POLICY |
“S3FIFO” |
Eviction policy for L1 cache (“S3FIFO”, “LRU”, or “FIFO”) |
AIBRIX_KV_CACHE_OL_L1_CACHE_CAPACITY_GB |
“10” |
L1 cache capacity in GB. |
AIBRIX_KV_CACHE_OL_DEVICE |
“cpu” |
Device to use for cache operations (“cpu” or “cuda”) |
AIBRIX_KV_CACHE_OL_S3FIFO_SMALL_TO_MAIN_PROMO_THRESHOLD |
“1” |
S3FIFO eviction policy: promotion threshold from small to main queue. |
AIBRIX_KV_CACHE_OL_S3FIFO_SMALL_FIFO_CAPACITY_RATIO |
0.3 |
S3FIFO eviction policy: capacity ratio for small FIFO. |
L2 Cache Configuration#
Variable |
Default |
Description |
|---|---|---|
AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND |
“” |
Backend for L2 cache. |
AIBRIX_KV_CACHE_OL_L2_CACHE_NAMESPACE |
“aibrix” |
Namespace for L2 cache. |
AIBRIX_KV_CACHE_OL_L2_CACHE_OP_BATCH |
“32” |
Operation batch size. |
AIBRIX_KV_CACHE_OL_L2_CACHE_PER_TOKEN_TIMEOUT_MS |
“20” |
Per-token timeout in milliseconds. |
AIBRIX_KV_CACHE_OL_L2_CACHE_KEY_BUILDER |
“ROLLING_HASH” |
Key builder for L2 cache (“RAW”, “ROLLING_HASH”, or “SIMPLE_HASH”) |
AIBRIX_KV_CACHE_OL_L2_CACHE_INGESTION_TYPE |
“HOT” |
Ingestion type (“ALL”, “HOT”, or “EVICTED”). |
AIBRIX_KV_CACHE_OL_L2_CACHE_INGESTION_MAX_INFLIGHT_TOKENS |
“0” |
Max inflight writes (0 for synchronous). |
AIBRIX_KV_CACHE_OL_L2_CACHE_NUM_ASYNC_WORKERS |
“8” |
Number of async workers. |
AIBRIX_KV_CACHE_OL_L2_CACHE_PLACEMENT_POLICY |
“SIMPLE” |
Placement policy (only applicable if using meta service). |
Meta Service Configuration#
Variable |
Default |
Description |
|---|---|---|
AIBRIX_KV_CACHE_OL_META_SERVICE_BACKEND |
“” |
Backend for meta service. If meta service backend is not set, L2 cache backend will use direct mode to access the given cache server. Otherwise, we will get membership information from meta service and construct the L2 cache cluster. |
AIBRIX_KV_CACHE_OL_META_SERVICE_REFRESH_INTERVAL_S |
“30” |
Refresh interval in seconds. |
AIBRIX_KV_CACHE_OL_META_SERVICE_URL |
“” |
URL for meta service. |
AIBRIX_KV_CACHE_OL_META_SERVICE_CLUSTER_META_KEY |
“” |
Cluster meta key. |
Connector Configurations#
InfiniStore Connector Configuration#
Variable |
Default |
Description |
|---|---|---|
AIBRIX_KV_CACHE_OL_INFINISTORE_HOST_ADDR |
“127.0.0.1” |
Host address. |
AIBRIX_KV_CACHE_OL_INFINISTORE_SERVICE_PORT |
“12345” |
Service port. |
AIBRIX_KV_CACHE_OL_INFINISTORE_CONNECTION_TYPE |
“RDMA” |
Connection type. |
AIBRIX_KV_CACHE_OL_INFINISTORE_IB_PORT |
“1” |
IB port. |
AIBRIX_KV_CACHE_OL_INFINISTORE_LINK_TYPE |
“Ethernet” |
Link type. |
AIBRIX_KV_CACHE_OL_INFINISTORE_VISIBLE_DEV_LIST |
“” |
Visible device list. Since 0.2.42, InfiniStore supports RDMA GID index in client config, users can specify the GID index of each device in this format: “mlx5_0:gid0,mlx5_1:gid1,mlx5_2:gid2” |
HPKV Connector Configuration#
Variable |
Default |
Description |
|---|---|---|
AIBRIX_KV_CACHE_OL_HPKV_REMOTE_ADDR |
“127.0.0.1” |
Remote address. |
AIBRIX_KV_CACHE_OL_HPKV_REMOTE_PORT |
“12346” |
Remote port. |
AIBRIX_KV_CACHE_OL_HPKV_LOCAL_ADDR |
“127.0.0.1” |
Local address. |
AIBRIX_KV_CACHE_OL_HPKV_LOCAL_PORT |
“12345” |
Local port. |
PrisKV Connector Configuration#
Variable |
Default |
Description |
|---|---|---|
AIBRIX_KV_CACHE_OL_PRISKV_REMOTE_ADDR |
“127.0.0.1” |
Remote address. |
AIBRIX_KV_CACHE_OL_PRISKV_REMOTE_PORT |
“6379” |
Remote port. |
AIBRIX_KV_CACHE_OL_PRISKV_USE_MPUT_MGET |
“0” |
Enable MPUT/MGET. |
AIBRIX_KV_CACHE_OL_PRISKV_PASSWORD |
“” |
Password. |
EIC Connector Configuration#
Variable |
Default |
Description |
|---|---|---|
AIBRIX_KV_CACHE_OL_EIC_CONFIG_FILE |
“” |
EIC config file. |
Mock Connector Configuration#
Mock connector is used for testing and profiling purposes.
Variable |
Default |
Description |
|---|---|---|
AIBRIX_KV_CACHE_OL_MOCK_USE_RDMA |
“0” |
Use RDMA in mock connector. |
AIBRIX_KV_CACHE_OL_MOCK_USE_MPUT_MGET |
“0” |
Use MPUT/MGET in mock connector. |
AIBRIX_KV_CACHE_OL_MOCK_USE_NOOP |
“0” |
Use NOOP for all operations. Useful for profiling the framework overhead. |
RocksDB Connector Configuration#
RocksDB connector is used for testing purposes.
Variable |
Default |
Description |
|---|---|---|
AIBRIX_KV_CACHE_OL_ROCKSDB_ROOT |
“~/.kv_cache_ol/rocksdb” |
Root directory for RocksDB. |
AIBRIX_KV_CACHE_OL_ROCKSDB_TTL_S |
“600” |
TTL in seconds. |
AIBRIX_KV_CACHE_OL_ROCKSDB_WRITE_BUFFER_SIZE |
“67108864” |
Write buffer size. Default 64MB. |
AIBRIX_KV_CACHE_OL_ROCKSDB_TARGET_FILE_SIZE_BASE |
“67108864” |
Target file size base. Default 64MB. |
AIBRIX_KV_CACHE_OL_ROCKSDB_MAX_WRITE_BUFFER_NUMBER |
“3” |
Max write buffers. |
AIBRIX_KV_CACHE_OL_ROCKSDB_MAX_TOTAL_WAL_SIZE |
“134217728” |
Max total WAL size. Default 128MB. |
AIBRIX_KV_CACHE_OL_ROCKSDB_MAX_BACKGROUND_JOBS |
“8” |
Max background jobs. |