Deploying Gateway#

This guide covers production deployment configuration for the AIBrix gateway components.

Configuring Resources and Replica Count#

For production deployments, we recommend tuning replica counts and resource allocations for both the Gateway Plugin and the Envoy Proxy data plane.

Gateway Plugin#

The gateway plugin (gatewayPlugin) handles request routing logic and external processing. Set replicaCount and container resources in your values.yaml override:

gatewayPlugin:
  replicaCount: 3
  container:
    resources:
      limits:
        cpu: "16"
        memory: 32Gi
      requests:
        cpu: "16"
        memory: 32Gi

Envoy Proxy#

The Envoy proxy (gateway.envoyProxy) is the data-plane component managed by Envoy Gateway. Set replicas and per-container resources as follows:

gateway:
  envoyProxy:
    replicas: 3
    container:
      envoy:
        resources:
          limits:
            cpu: "8"
            memory: 16Gi
          requests:
            cpu: "8"
            memory: 16Gi

Enabling Redis for Multi-Replica Deployments#

When running more than one gateway plugin replica, shared state is required so that all instances agree on routing decisions (e.g. prefix-cache block assignments, rate-limit counters). The gateway plugin reads the REDIS_HOST environment variable at startup to connect to a Redis instance. If the variable is unset or Redis is unreachable, each replica operates with in-process state only, which causes inconsistent routing across pods.

Enable Redis by pointing the gateway plugin at the bundled Redis instance:

gatewayPlugin:
  dependencies:
    redis:
      host: ""   # leave empty to use the chart-managed Redis (aibrix-redis-master)
      port: 6379

The Helm chart sets REDIS_HOST automatically from gatewayPlugin.dependencies.redis.host, defaulting to <release-name>-redis-master when the field is empty. For an external Redis cluster, set host to the service hostname or IP of your Redis endpoint.

Sizing Redis#

The bundled Redis instance is deployed under metadata.redis. For production workloads with multiple gateway plugin replicas, increase its CPU, memory, and (if using persistence) storage from the defaults:

metadata:
  redis:
    replicas: 1
    container:
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "2"
          memory: 4Gi

The right sizing depends on request throughput and the number of distinct prefix-cache keys in flight. As a starting point, allocate roughly 1 GiB of memory per 1 000 concurrent requests and increase CPU if Redis becomes a latency bottleneck (monitor redis_commands_duration_seconds).

Note

If you are using an externally managed Redis (e.g. AWS ElastiCache, Google Memorystore), set gatewayPlugin.dependencies.redis.host to the external endpoint and remove the metadata.redis block from your override — the chart will not deploy its own Redis when a custom host is provided.

Configuring Buffer Limits, Connections, and QPS#

AIBrix exposes three policy knobs under gateway to control how traffic flows between clients, the Envoy proxy, and backend pods.

Client Traffic Policy#

clientTrafficPolicy governs connections and request sizes arriving from external clients:

  • bufferLimit — maximum bytes buffered per request/response body (input/output size). The default 4194304 is 4 MiB. Increase this if your workloads send large prompts or receive large completions.

  • connectionLimit — maximum simultaneous TCP connections accepted by the proxy.

  • http2.maxConcurrentStreams — maximum concurrent HTTP/2 streams per connection from a client.

gateway:
  clientTrafficPolicy:
    connection:
      bufferLimit: 4194304   # bytes; 4 MiB default
      connectionLimit:
        value: 1024
    http2:
      maxConcurrentStreams: 1024

Backend Traffic Policy#

backendTrafficPolicy controls how Envoy distributes load across backend pods and limits the blast radius of a single slow or failing pod:

  • circuitBreaker.maxConnections — maximum TCP connections to a single backend pod.

  • circuitBreaker.maxParallelRequests — maximum in-flight requests to a single backend pod (effective QPS cap when combined with latency).

  • circuitBreaker.maxPendingRequests — maximum requests queued waiting for a connection to a backend pod.

  • circuitBreaker.maxParallelRetries — maximum concurrent retries to a backend pod.

  • circuitBreaker.maxRequestsPerConnection — maximum requests served on a single connection before it is recycled.

  • http2.maxConcurrentStreams — maximum concurrent HTTP/2 streams per connection to a backend pod.

gateway:
  backendTrafficPolicy:
    circuitBreaker:
      maxConnections: 1024
      maxParallelRequests: 1024
      maxParallelRetries: 1024
      maxPendingRequests: 1024
      maxRequestsPerConnection: 1024
    http2:
      maxConcurrentStreams: 1024

Envoy Patch Policy#

envoyPatchPolicy provides lower-level overrides applied directly to the Envoy xDS configuration, covering route timeouts and the limits for the original-destination cluster:

  • route.timeout — maximum duration Envoy waits for a backend response. Increase this for long-running inference requests.

  • route.connectTimeout — maximum time to establish a TCP connection to a backend pod.

  • circuitBreakers.maxConnections / maxRequests / maxPendingRequests — same semantics as the backend traffic policy above but applied at the Envoy cluster level.

gateway:
  envoyPatchPolicy:
    route:
      timeout: 120s
      connectTimeout: 6s
    circuitBreakers:
      maxConnections: 1024
      maxRequests: 1024
      maxPendingRequests: 1024

Choosing a Routing Strategy#

If no routing-strategy header is provided, the gateway defaults to random routing.

# No routing-strategy header — gateway picks a random ready pod
curl -v http://${ENDPOINT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "your-model-name",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7
}'

For production workloads we recommend selecting a strategy based on your traffic pattern:

Multi-turn conversations or workloads with prompt prefix overlap — use prefix-cache

When requests share a common prompt prefix (e.g. a system prompt, few-shot examples, or conversation history), prefix-cache routes each request to the pod that already holds the matching KV-cache blocks in GPU memory, reducing redundant computation and improving latency. See prefix cache routing details for algorithm internals and configuration options.

curl -v http://${ENDPOINT}/v1/chat/completions \
-H "routing-strategy: prefix-cache" \
-H "Content-Type: application/json" \
-d '{
    "model": "your-model-name",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.7
}'

Independent requests with no prefix overlap — use least-request

When each request has a unique prompt and there is no KV-cache reuse benefit, least-request distributes load evenly by always routing to the pod with the fewest in-flight requests.

curl -v http://${ENDPOINT}/v1/chat/completions \
-H "routing-strategy: least-request" \
-H "Content-Type: application/json" \
-d '{
    "model": "your-model-name",
    "messages": [{"role": "user", "content": "Summarize this document: ..."}],
    "temperature": 0.7
}'

High-throughput workloads requiring compute separation — use pd

Prefill-Decode (PD) disaggregation splits the two phases of LLM inference across dedicated pods: prefill pods process the prompt and decode pods generate tokens. This allows each pod type to be sized and scaled independently, improving GPU utilization at high request volumes.

curl -v http://${ENDPOINT}/v1/chat/completions \
-H "routing-strategy: pd" \
-H "Content-Type: application/json" \
-d '{
    "model": "your-model-name",
    "messages": [{"role": "user", "content": "Explain quantum entanglement."}],
    "temperature": 0.7
}'

See prefill-decode disaggregation details for deployment requirements and configuration options.

Note

random is suitable for testing and low-traffic scenarios where routing quality is not critical. For any production deployment, explicitly set routing-strategy to avoid relying on the default.

Per-Model Routing Configuration#

When a single gateway deployment serves multiple models, you may need different routing strategies per model — for example, prefix-cache for a chat model and pd for a batch-inference model. AIBrix supports this through Model Config Profiles, which let you attach a routing configuration directly to each model’s pods via an annotation and select a profile at request time using the config-profile header.

# Select the "low-latency" profile for this request
curl -v http://${ENDPOINT}/v1/chat/completions \
-H "config-profile: low-latency" \
-H "Content-Type: application/json" \
-d '{
    "model": "your-model-name",
    "messages": [{"role": "user", "content": "Hello!"}],
    "temperature": 0.7
}'

If the config-profile header is omitted, the model’s defaultProfile is used. This allows the same gateway to enforce different routing strategies, prompt-length bucketing, and PD modes per model without deploying separate gateway instances.

For config profile setup and the full annotation schema see Config Profiles in the Gateway Routing guide. For production guidance including RPS limiting see Production Model Deployments.