AIBrix KVCache Offloading Framework#

The rising demand for large language models has intensified the need for efficient memory management and caching to optimize inference performance and reduce costs. In multi-round use cases like chatbots and agent-based systems, overlapping token sequences lead to redundant computations during the prefill phase, wasting resources and limiting throughput.

Many inference engines, such as vLLM, use built-in KV caching to mitigate this issue, leveraging idle HBM and DRAM. However, single-node KV caches face key limitations: constrained memory capacity, engine-specific storage that prevents sharing across instances, and difficulty supporting scenarios like KV migration and prefill-decode disaggregation.

With AIBrix v0.3.0, we introduce a production-ready KVCache Offloading Framework, which enables efficient memory tiering and low-overhead cross-engine reuse. By default, the framework leverages L1 DRAM-based caching, which already provides significant performance improvements by offloading GPU memory pressure without incurring high latency. For scenarios requiring multi-node sharing or larger-scale reuse, AIBrix allows users to optionally enable L2 remote caching, unlocking the benefits of a distributed KV cache layer.

aibrix-kvcache-offloading-arch-overview

Figure 1. AIBrix KVCache Offloading Framework

As shown in Figure 1, on the data plane, it integrates tightly with inference engines (e.g., vLLM) via AIBrix Offloading Connector, which employs optimized CUDA kernels to significantly accelerate data movement between GPU and CPU. For memory scalability, its multi-tiered cache manager dynamically balances workloads across storage layers, alleviating GPU memory capacity limits while minimizing latency penalties. The framework supports pluggable eviction policies (e.g., LRU, S3FIFO) and diverse backend storage options (e.g., InfiniStore), enabling selective KV cache offloading to reduce network and PCIe contention. Crucially, its cache placement module can coordinate with the centralized distributed KV cache cluster manager to maximize global KV cache utilization. This enables cross-engine KV reuse and ensures cluster-wide resource efficiency, transforming isolated KV cache instances into a scalable, shared KV cache infrastructure.

L1 Engine DRAM Cache Management#

The growing demands of modern models and increasing context lengths in LLM inference have led to KV caches consuming progressively more GPU memory, pushing against the hardware limits of even the most advanced GPUs. Recent systems like Dynamo, LMCache, and MoonCake have developed solutions that offload KV cache to external memory hierarchies, spanning from CPU memory to SSDs. KVCache Offloading supports the offloading of KV cache to CPU memory as well by only enabling its DRAM-backed L1Cache. While this approach does not enable KV cache sharing across multiple engines, it eliminates the complexity of distributed KV cache setup and configuration. More importantly, by leveraging the significantly larger capacity of CPU memory, this method delivers substantial performance gains —- making it an ideal solution for use cases that prioritize scalable KV cache capacity over cross-engine KV reuse.

L2 Distributed KVCache and Cross-Engine KV Reuse#

The growing demand for large language models has significantly increased the need for expansive KV cache capacity. While CPU memory offloading effectively addresses moderate scaling needs, production environments handling massive-scale, dynamic workloads require even greater scalability – particularly when memory needs exceed single-node capacities. To address this, AIBrix enables distributed KV cache services as its L2Cache backends, which can scale horizontally across multiple nodes to meet capacity demands.

In the meantime, as LLM deployments scale across multiple engines in the cluster, the redundancy of KV caches across engines introduces substantial inefficiencies. Repeated computations of common prompt prefixes waste GPU cycles and HBM bandwidth. AIBrix solves this challenge by enabling efficient cross-engine KV reuse through a high-performance, shared distributed KV cache, optimizing resource utilization at scale.

aibrix-infinistore-arch-overview

Adding New KVCache Backends#

New KVCache backends can be easily added by implementing the Connector interface:

  1@dataclass
  2class ConnectorFeature:
  3    """The features of the kv cache connector.
  4    Args:
  5        mput_mget: Whether the kv cache connector supports mput/mget
  6        prefetch: Whether the kv cache connector supports prefetch.
  7        rdma: Whether the kv cache connector supports RDMA.
  8        gdr_put: Whether the kv cache connector supports GDR put.
  9        gdr_get: Whether the kv cache connector supports GDR get.
 10    """
 11
 12    mput_mget: bool = False
 13    prefetch: bool = False
 14    rdma: bool = False
 15    gdr_put: bool = False
 16    gdr_get: bool = False
 17
 18
 19@dataclass
 20class ConnectorConfig:
 21    """The config of the kv cache connector."""
 22
 23    backend_name: str
 24    namespace: str
 25    partition_id: str
 26    executor: Executor
 27    block_spec_signature: str = ""
 28    key_builder_signature: str = ""
 29    layout_signature: str = ""
 30
 31
 32@dataclass
 33class ConnectorRegisterDescriptor:
 34    """The register descriptor"""
 35
 36    pass
 37
 38
 39class Connector(Generic[K, V]):
 40    """Connector interface."""
 41
 42    @classmethod
 43    @abstractmethod
 44    def from_envs(cls, conn_id: str, executor: Executor, **kwargs):
 45        """Create a connector from environment variables."""
 46        raise NotImplementedError
 47
 48    @property
 49    @abstractmethod
 50    def name(self) -> str:
 51        raise NotImplementedError
 52
 53    @property
 54    @abstractmethod
 55    def feature(self) -> ConnectorFeature:
 56        """Get the feature of the connector.
 57        Returns:
 58            The feature of the kv cache service.
 59        """
 60        raise NotImplementedError
 61
 62    @abstractmethod
 63    def open(self) -> Status:
 64        """Open a connection."""
 65        raise NotImplementedError
 66
 67    @abstractmethod
 68    def close(self) -> Status:
 69        """Close a connection."""
 70        raise NotImplementedError
 71
 72    async def prefetch(self, keys: Sequence[K]) -> None:
 73        """Prefetch a list of keys.
 74        Args:
 75            keys: The keys of the kv tensors.
 76        """
 77        pass
 78
 79    @abstractmethod
 80    async def exists(self, key: K) -> Status:
 81        """Check if key is in the store."""
 82        raise NotImplementedError
 83
 84    @abstractmethod
 85    async def get(
 86        self, key: K, mr: MemoryRegion | Sequence[MemoryRegion]
 87    ) -> Status:
 88        """Get a value.
 89        Args:
 90            key: The key of the kv tensor.
 91            mr: The memory region or MR list to place the fetched kv
 92                tensor. It is an MR list only if using GDR.
 93        Returns:
 94            The status of the get operation.
 95        """
 96        raise NotImplementedError
 97
 98    @abstractmethod
 99    async def put(
100        self, key: K, mr: MemoryRegion | Sequence[MemoryRegion]
101    ) -> Status:
102        """Put a key value pair.
103        Args:
104            key: The key of the kv cache.
105            mr: The memory region or MR list holding the kv tensors. It is an
106                MR list only if using GDR.
107        Returns:
108            The status of the put operation.
109        """
110        raise NotImplementedError
111
112    def register_slabs(self, slabs: List[torch.Tensor]) -> Status:
113        """Register slabs with backend-specific register function.
114        Args:
115            slabs: slabs to be registered.
116        Returns:
117            Status of the register operation.
118        """
119        raise NotImplementedError
120
121    def get_batches(
122        self,
123        keys: Sequence[Any],
124        mrs: Sequence[MemoryRegion | Sequence[MemoryRegion]],
125        batch_size: int,
126    ) -> Sequence[Sequence[Tuple[K, MemoryRegion | Sequence[MemoryRegion]]]]:
127        """Get a list of key MR batches that is used for mput and mget
128        operations.
129
130        Args:
131            keys: The keys of the kv tensors.
132            mrs: Memory regions or lists of MRs holding the kv tensors.
133            batch_size: The maximum number of key MR pairs in a batch.
134        Returns:
135            List of key MR/MR List batches.
136        """
137        raise NotImplementedError
138
139    async def mget(
140        self,
141        keys: Sequence[K],
142        mrs: Sequence[MemoryRegion | Sequence[MemoryRegion]],
143    ) -> Sequence[Status]:
144        """MGet a list of values. This function is optional and only connectors
145        have mput_mget feature enabled can implement this function.
146        Args:
147            keys: The keys of the kv tensors.
148            mrs: Memory regions or lists of MRs to hold the fetched kv
149                 tensors. It is an MR list only if using GDR.
150        Returns:
151            List of statuses.
152        """
153        raise NotImplementedError
154
155    async def mput(
156        self,
157        keys: Sequence[K],
158        mrs: Sequence[MemoryRegion | Sequence[MemoryRegion]],
159    ) -> Sequence[Status]:
160        """MPut a list of key value pairs. This function is optional and only
161        connectors have mput_mget feature enabled can implement this function.
162        Args:
163            keys: The keys of the kv tensors.
164            mrs: Memory regions or lists of MRs holding the kv tensors. It is
165                 an MR list only if using GDR.
166        Returns:
167            List of statuses.
168        """
169        raise NotImplementedError
170
171    @abstractmethod
172    async def delete(self, key: K) -> Status:
173        """Delete a key.
174        Args:
175            key: The key of the kv cache.
176        Returns:
177            The status of the delete operation.
178        """
179        raise NotImplementedError

Please refer to the existing connectors for more details.

Environment Variables Reference#

This section describes all available environment variables for AIBrix KVCache Offloading Framework.

Core Configuration#

Variable

Default

Description

AIBRIX_KV_CACHE_OL_CHUNK_SIZE

“512”

Chunk size for operations.

AIBRIX_KV_CACHE_OL_BLOCK_SIZE

“-1”

Number of tokens in a kvcache block (the finest IO granularity). Defaults to -1, which means to use engine’s block size.

AIBRIX_KV_CACHE_OL_MAX_SEQ_LEN

“-1”

Maximum sequence length. Defaults to -1, which means no limit. If set, tokens beyond this length will be ignored.

AIBRIX_KV_CACHE_OL_TIME_MEASUREMENT_ENABLED

“1”

Enable time measurement.

AIBRIX_KV_CACHE_OL_BREAKDOWN_MEASUREMENT_ENABLED

“1”

Enable breakdown measurement.

AIBRIX_KV_CACHE_OL_DOUBLE_GET_THRESHOLD

“4,0.1”

Controls when to issue a second get request to L2 cache. First value is minimum missing blocks, second is ratio threshold.

AIBRIX_KV_CACHE_OL_TOKEN_VALIDATION_ENABLED

“0”

Whether to validate tokens in L2 cache. Disabling uses tighter memory layout.

AIBRIX_KV_CACHE_OL_TRANSPORT_RDMA_ADDR_RANGE

“::/0”

Valid GID range (CIDR format). Similar to NVSHMEM_IB_ADDR_RANGE for NVSHMEM.

AIBRIX_KV_CACHE_OL_PROFILING_ENABLED

“0”

Enable profiling.

AIBRIX_KV_CACHE_OL_PROFILING_SERVER_ADDRESS

http://0.0.0.0:4040

Profiling server address. Profiling server is responsible for collecting profiling data and displaying it in a web UI.

L1 Cache Configuration#

Variable

Default

Description

AIBRIX_KV_CACHE_OL_L1_CACHE_ENABLED

“1”

Enable L1 cache.

AIBRIX_KV_CACHE_OL_L1_CACHE_EVICTION_POLICY

“S3FIFO”

Eviction policy for L1 cache (“S3FIFO”, “LRU”, or “FIFO”)

AIBRIX_KV_CACHE_OL_L1_CACHE_CAPACITY_GB

“10”

L1 cache capacity in GB.

AIBRIX_KV_CACHE_OL_DEVICE

“cpu”

Device to use for cache operations (“cpu” or “cuda”)

AIBRIX_KV_CACHE_OL_S3FIFO_SMALL_TO_MAIN_PROMO_THRESHOLD

“1”

S3FIFO eviction policy: promotion threshold from small to main queue.

AIBRIX_KV_CACHE_OL_S3FIFO_SMALL_FIFO_CAPACITY_RATIO

0.3

S3FIFO eviction policy: capacity ratio for small FIFO.

L2 Cache Configuration#

Variable

Default

Description

AIBRIX_KV_CACHE_OL_L2_CACHE_BACKEND

“”

Backend for L2 cache.

AIBRIX_KV_CACHE_OL_L2_CACHE_NAMESPACE

“aibrix”

Namespace for L2 cache.

AIBRIX_KV_CACHE_OL_L2_CACHE_OP_BATCH

“32”

Operation batch size.

AIBRIX_KV_CACHE_OL_L2_CACHE_PER_TOKEN_TIMEOUT_MS

“20”

Per-token timeout in milliseconds.

AIBRIX_KV_CACHE_OL_L2_CACHE_KEY_BUILDER

“ROLLING_HASH”

Key builder for L2 cache (“RAW”, “ROLLING_HASH”, or “SIMPLE_HASH”)

AIBRIX_KV_CACHE_OL_L2_CACHE_INGESTION_TYPE

“HOT”

Ingestion type (“ALL”, “HOT”, or “EVICTED”).

AIBRIX_KV_CACHE_OL_L2_CACHE_INGESTION_MAX_INFLIGHT_TOKENS

“0”

Max inflight writes (0 for synchronous).

AIBRIX_KV_CACHE_OL_L2_CACHE_NUM_ASYNC_WORKERS

“8”

Number of async workers.

AIBRIX_KV_CACHE_OL_L2_CACHE_PLACEMENT_POLICY

“SIMPLE”

Placement policy (only applicable if using meta service).

Meta Service Configuration#

Variable

Default

Description

AIBRIX_KV_CACHE_OL_META_SERVICE_BACKEND

“”

Backend for meta service. If meta service backend is not set, L2 cache backend will use direct mode to access the given cache server. Otherwise, we will get membership information from meta service and construct the L2 cache cluster.

AIBRIX_KV_CACHE_OL_META_SERVICE_REFRESH_INTERVAL_S

“30”

Refresh interval in seconds.

AIBRIX_KV_CACHE_OL_META_SERVICE_URL

“”

URL for meta service.

AIBRIX_KV_CACHE_OL_META_SERVICE_CLUSTER_META_KEY

“”

Cluster meta key.

Connector Configurations#

InfiniStore Connector Configuration#

Variable

Default

Description

AIBRIX_KV_CACHE_OL_INFINISTORE_HOST_ADDR

“127.0.0.1”

Host address.

AIBRIX_KV_CACHE_OL_INFINISTORE_SERVICE_PORT

“12345”

Service port.

AIBRIX_KV_CACHE_OL_INFINISTORE_CONNECTION_TYPE

“RDMA”

Connection type.

AIBRIX_KV_CACHE_OL_INFINISTORE_IB_PORT

“1”

IB port.

AIBRIX_KV_CACHE_OL_INFINISTORE_LINK_TYPE

“Ethernet”

Link type.

AIBRIX_KV_CACHE_OL_INFINISTORE_VISIBLE_DEV_LIST

“”

Visible device list. Since 0.2.42, InfiniStore supports RDMA GID index in client config, users can specify the GID index of each device in this format: “mlx5_0:gid0,mlx5_1:gid1,mlx5_2:gid2”

HPKV Connector Configuration#

Variable

Default

Description

AIBRIX_KV_CACHE_OL_HPKV_REMOTE_ADDR

“127.0.0.1”

Remote address.

AIBRIX_KV_CACHE_OL_HPKV_REMOTE_PORT

“12346”

Remote port.

AIBRIX_KV_CACHE_OL_HPKV_LOCAL_ADDR

“127.0.0.1”

Local address.

AIBRIX_KV_CACHE_OL_HPKV_LOCAL_PORT

“12345”

Local port.

PrisKV Connector Configuration#

Variable

Default

Description

AIBRIX_KV_CACHE_OL_PRISKV_REMOTE_ADDR

“127.0.0.1”

Remote address.

AIBRIX_KV_CACHE_OL_PRISKV_REMOTE_PORT

“6379”

Remote port.

AIBRIX_KV_CACHE_OL_PRISKV_USE_MPUT_MGET

“0”

Enable MPUT/MGET.

AIBRIX_KV_CACHE_OL_PRISKV_PASSWORD

“”

Password.

EIC Connector Configuration#

Variable

Default

Description

AIBRIX_KV_CACHE_OL_EIC_CONFIG_FILE

“”

EIC config file.

Mock Connector Configuration#

Mock connector is used for testing and profiling purposes.

Variable

Default

Description

AIBRIX_KV_CACHE_OL_MOCK_USE_RDMA

“0”

Use RDMA in mock connector.

AIBRIX_KV_CACHE_OL_MOCK_USE_MPUT_MGET

“0”

Use MPUT/MGET in mock connector.

AIBRIX_KV_CACHE_OL_MOCK_USE_NOOP

“0”

Use NOOP for all operations. Useful for profiling the framework overhead.

RocksDB Connector Configuration#

RocksDB connector is used for testing purposes.

Variable

Default

Description

AIBRIX_KV_CACHE_OL_ROCKSDB_ROOT

“~/.kv_cache_ol/rocksdb”

Root directory for RocksDB.

AIBRIX_KV_CACHE_OL_ROCKSDB_TTL_S

“600”

TTL in seconds.

AIBRIX_KV_CACHE_OL_ROCKSDB_WRITE_BUFFER_SIZE

“67108864”

Write buffer size. Default 64MB.

AIBRIX_KV_CACHE_OL_ROCKSDB_TARGET_FILE_SIZE_BASE

“67108864”

Target file size base. Default 64MB.

AIBRIX_KV_CACHE_OL_ROCKSDB_MAX_WRITE_BUFFER_NUMBER

“3”

Max write buffers.

AIBRIX_KV_CACHE_OL_ROCKSDB_MAX_TOTAL_WAL_SIZE

“134217728”

Max total WAL size. Default 128MB.

AIBRIX_KV_CACHE_OL_ROCKSDB_MAX_BACKGROUND_JOBS

“8”

Max background jobs.