Heterogeneous GPU Inference

Heterogeneous GPU Inference#

Heterogeneous GPU Inference is a feature that enables users to utilize different types of GPUs for deploying the same model. This feature addresses two primary challenges associated with Large Language Model (LLM) inference: (1) As the demand for large-scale model inference increases, ensuring consistent GPU availability has become a challenge, particularly within regions where identical GPU types are often unavailable due to capacity constraints. (2) Users may seek to incorporate lower-cost, lower-performance GPUs to reduce overall expenses.

Design Overview#

There are three main components in Heterogeneous GPU Inference Feature: (1) LLM Request Monitoring, (2) Heterogeneous GPU Optimizer, (3) Request Routing. The following figure shows the overall architecture. First, LLM Request Monitoring component is responsible for monitoring the past inference requests and their request patterns. Second, Heterogeneous GPU Optimizer component is responsible for selecting the optimal GPU type and the corresponding GPU count. Third, Request Routing component is responsible for routing the request to the optimal GPU.

Example#

Step 1: Deploy the heterogeneous deployments.

One deployment and corresponding PodAutoscaler should be deployed for each GPU type. See sample heterogeneous configuration for an example of heterogeneous configuration composed of two GPU types. The following codes deploy heterogeneous deployments using L20 and V100 GPU.

kubectl apply -f samples/heterogeneous

After deployment, you will see a inference service with two pods running on simulated L20 and A10 GPUs:

kubectl get svc
NAME                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
deepseek-coder-7b   NodePort    10.102.95.136   <none>        8000:30081/TCP   2s
kubernetes          ClusterIP   10.96.0.1       <none>        443/TCP          54d

Incoming requests are routed through the gateway and directed to the optimal pod based on request patterns:

kubectl get pods
NAME                                       READY   STATUS    RESTARTS   AGE
deepseek-coder-7b-v100-96667667c-6gjql     2/2     Running   0          33s
deepseek-coder-7b-l20-96667667c-7zj7k      2/2     Running   0          33s

Step 2: Install aibrix python module:

pip3 install aibrix

The GPU Optimizer runs continuously in the background, dynamically adjusting GPU allocation for each model based on workload patterns. Note that GPU optimizer requires offline inference performance benchmark data for each type of GPU on each specific LLM model.

If local heterogeneous deployments is used, you can find the prepared benchmark data under python/aibrix/aibrix/gpu_optimizer/optimizer/profiling/result/ and skip Step 3. See Development for details on deploying a local heterogeneous deployments.

Step 3: Benchmark model. For each type of GPU, run aibrix_benchmark. See benchmark.sh for more options.

kubectl port-forward [pod_name] 8010:8000 1>/dev/null 2>&1 &
# Wait for port-forward taking effect.
aibrix_benchmark -m deepseek-coder-7b -o [path_to_benchmark_output]

Step 4: Decide SLO and generate profile, run aibrix_gen_profile -h for help.

kubectl -n aibrix-system port-forward svc/aibrix-redis-master 6379:6379 1>/dev/null 2>&1 &
# Wait for port-forward taking effect.
aibrix_gen_profile deepseek-coder-7b-v100 --cost [cost1] [SLO-metric] [SLO-value] -o "redis://localhost:6379/?model=deepseek-coder-7b"
aibrix_gen_profile deepseek-coder-7b-l20 --cost [cost2] [SLO-metric] [SLO-value] -o "redis://localhost:6379/?model=deepseek-coder-7b"

Now the GPU Optimizer is ready to work. You should observe that the number of workload pods changes in response to the requests sent to the gateway. Once the GPU optimizer finishes the scaling optimization, the output of the GPU optimizer is passed to PodAutoscaler as a metricSource via a designated HTTP endpoint for the final scaling decision. The following is an example of PodAutoscaler spec.

A simple example of PodAutoscaler spec for v100 GPU is as follows:

apiVersion: autoscaling.aibrix.ai/v1alpha1
kind: PodAutoscaler
metadata:
  labels:
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: aibrix
    kpa.autoscaling.aibrix.ai/scale-down-delay: 0s
  name: podautoscaler-deepseek-coder-7b-v100
  namespace: default
spec:
  maxReplicas: 10
  metricsSources:
  - endpoint: aibrix-gpu-optimizer.aibrix-system.svc.cluster.local:8080
    metricSourceType: domain
    path: /metrics/default/deepseek-coder-7b-v100
    protocolType: http
    targetMetric: vllm:deployment_replicas
    targetValue: "100"  # For stable workloads. Set to a fraction to tolerate bursts.
  minReplicas: 0
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: deepseek-coder-7b-v100
  scalingStrategy: KPA

Miscellaneous#

A new label label model.aibrix.ai/min_replicas is added to specifies the minimum number of replicas to maintain when there is no workload. We recommend setting this to 1 for at least one Deployment spec to ensure there is always one READY pod available. For example, while the GPU optimizer might recommend 0 replicas for an v100 GPU during periods of no activity, setting model.aibrix.ai/min_replicas: "1" will maintain one v100 replica. This label only affects the system when there is no workload - it is ignored when there are active requests.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-coder-7b-v100
  labels:
    model.aibrix.ai/name: "deepseek-coder-7b"
    model.aibrix.ai/min_replicas: "1" # min replica for gpu optimizer when no workloads.
... rest yaml deployments

Important: The minReplicas field in the PodAutoscaler spec must be set to 0 to allow proper scaling behavior. Setting it to any value greater than 0 will interfere with the GPU optimizer’s scaling decisions. For instance, if the GPU optimizer determines an optimal configuration of {v100: 0, l20: 4} but the v100 PodAutoscaler has minReplicas: 1, the system won’t be able to scale the v100 down to 0 as recommended.

Heterogeneous GPU Inference

Contents

Heterogeneous GPU Inference#

Design Overview#

Example#

Miscellaneous#