Metric-based Autoscaling

Metric-based Autoscaling#

AIBrix Autoscaler includes various metric-based autoscaling components, allowing users to conveniently select the appropriate scaler. These options include the Knative-based Kubernetes Pod Autoscaler (KPA), the native Kubernetes Horizontal Pod Autoscaler (HPA), and AIBrix’s custom Advanced Pod Autoscaler (APA) tailored for LLM-serving.

In the following sections, we will demonstrate how users can create various types of autoscalers within AIBrix.

Supported Autoscaling Mechanism#

HPA: it is same as vanilla K8s HPA. HPA, the native Kubernetes autoscaler, is utilized when users deploy a specification with AIBrix that calls for an HPA. This setup scales the replicas of a demo deployment based on CPU utilization.
KPA: it is from Knative. KPA has panic mode which scales up more quickly based on short term history. More rapid scaling is possible. The KPA, inspired by Knative, maintains two time windows: a longer stable window and a shorter panic window. It rapidly scales up resources in response to sudden spikes in traffic based on the panic window measurements. Unlike other solutions that might rely on Prometheus for gathering deployment metrics, AIBrix fetches and maintains metrics internally, enabling faster response times. Example of a KPA scaling operation using a mocked vllm-based Llama2-7b deployment
APA: similar as HPA but it has fluctuation parameter which acts as minimum buffer before triggering scaling up and down to prevent oscillation.

While HPA and KPA are widely used, they are not specifically designed and optimized for LLM serving, which has distinct optimization points. AIBrix’s custom APA (AIBrix Pod Autoscaler) solution will gradually introduce features such as:

Selecting appropriate LLM-specific metrics for scaling based on AI Runtime metrics standardization.
Proactive scaling algorithm rather than a reactive one. (WIP)
Profiling & SLO driven autoscaling solution. (Testing Phase)

Metrics#

AiBrix supports all the vllm metrics. Please refer to https://docs.vllm.ai/en/stable/design/metrics.html

How to deploy autoscaling policy#

It is simply applying PodAutoscaler yaml file. One important thing you should note is that the deployment name and the name in scaleTargetRef in PodAutoscaler must be same. That’s how AiBrix PodAutoscaler refers to the right deployment.

All the sample files can be found in the following directory.

https://github.com/vllm-project/aibrix/tree/main/samples/autoscaling

Example HPA yaml config#

apiVersion: autoscaling.aibrix.ai/v1alpha1
kind: PodAutoscaler
metadata:
  name: deepseek-r1-distill-llama-8b-hpa
  namespace: default
  labels:
    app.kubernetes.io/name: aibrix
    app.kubernetes.io/managed-by: kustomize
spec:
  scalingStrategy: HPA
  minReplicas: 1
  maxReplicas: 10
  metricsSources:
    - metricSourceType: pod
      protocolType: http
      port: '8000'
      path: /metrics
      targetMetric: gpu_cache_usage_perc
      targetValue: '50'
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: deepseek-r1-distill-llama-8b

Example KPA yaml config#

apiVersion: autoscaling.aibrix.ai/v1alpha1
kind: PodAutoscaler
metadata:
  name: deepseek-r1-distill-llama-8b-kpa
  namespace: default
  labels:
    app.kubernetes.io/name: aibrix
    app.kubernetes.io/managed-by: kustomize
  annotations:
    kpa.autoscaling.aibrix.ai/scale-down-delay: 3m
spec:
  scalingStrategy: KPA
  minReplicas: 1
  maxReplicas: 8
  metricsSources:
    - metricSourceType: pod
      protocolType: http
      port: '8000'
      path: metrics
      targetMetric: gpu_cache_usage_perc
      targetValue: '0.5'
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: deepseek-r1-distill-llama-8b

Example APA yaml config#

apiVersion: autoscaling.aibrix.ai/v1alpha1
kind: PodAutoscaler
metadata:
  name: deepseek-r1-distill-llama-8b-apa
  namespace: default
  labels:
    app.kubernetes.io/name: aibrix
    app.kubernetes.io/managed-by: kustomize
  annotations:
    autoscaling.aibrix.ai/up-fluctuation-tolerance: '0.1'
    autoscaling.aibrix.ai/down-fluctuation-tolerance: '0.2'
    apa.autoscaling.aibrix.ai/window: 30s
spec:
  scalingStrategy: APA
  minReplicas: 1
  maxReplicas: 8
  metricsSources:
    - metricSourceType: pod
      protocolType: http
      port: '8000'
      path: metrics
      targetMetric: gpu_cache_usage_perc
      targetValue: '0.5'
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: deepseek-r1-distill-llama-8b

Supported PodAutoscaler annotations#

Metric-based autoscalers can be tuned with annotations on the PodAutoscaler object. The controller currently recognizes the generic autoscaling.aibrix.ai/ annotation keys listed below. Annotation values are strings in Kubernetes metadata, so quote numeric and duration values in YAML when needed.

Durations are parsed with Go duration syntax such as 30s or 5m. Floating-point values are parsed as decimal numbers such as 0.1 or 2.0.

Annotation	Value type	Default	Strategy use	Description
`autoscaling.aibrix.ai/max-scale-up-rate`	float	`2`	HPA, KPA, APA	Limits how quickly replicas can increase in one scaling decision. For example, `2.0` allows the recommendation to grow up to 2x the current replica count.
`autoscaling.aibrix.ai/max-scale-down-rate`	float	`2`	HPA, KPA, APA	Limits how quickly replicas can decrease in one scaling decision. For example, `2.0` prevents scaling below roughly half of the current replica count in one decision.
`autoscaling.aibrix.ai/scale-up-tolerance`	float	`0.1`	KPA, APA	Avoids scale-up for small metric fluctuations. A value of `0.1` means the metric must exceed the target by more than 10% before scaling up.
`autoscaling.aibrix.ai/scale-down-tolerance`	float	`0.1`	KPA, APA	Avoids scale-down for small metric fluctuations. A value of `0.1` means the metric must fall below the target by more than 10% before scaling down.
`autoscaling.aibrix.ai/panic-threshold`	float	`2.0`	KPA	Sets the threshold for entering KPA panic mode when short-window demand is high relative to stable-window demand.
`autoscaling.aibrix.ai/scale-up-cooldown-window`	duration	`0s`	HPA, KPA, APA	Stabilization window for scale-up recommendations.
`autoscaling.aibrix.ai/scale-down-cooldown-window`	duration	`300s`	HPA, KPA, APA	Stabilization window for scale-down recommendations. The default is 5 minutes.
`autoscaling.aibrix.ai/scale-to-zero`	bool	`false`	KPA, APA	Enables the scaling context’s scale-to-zero flag. The final replica count is still bounded by `spec.minReplicas`.

Example:

apiVersion: autoscaling.aibrix.ai/v1alpha1
kind: PodAutoscaler
metadata:
  name: example-kpa
  annotations:
    autoscaling.aibrix.ai/max-scale-up-rate: "3.0"
    autoscaling.aibrix.ai/max-scale-down-rate: "2.0"
    autoscaling.aibrix.ai/scale-up-tolerance: "0.2"
    autoscaling.aibrix.ai/scale-down-tolerance: "0.1"
    autoscaling.aibrix.ai/panic-threshold: "2.5"
    autoscaling.aibrix.ai/scale-up-cooldown-window: "30s"
    autoscaling.aibrix.ai/scale-down-cooldown-window: "5m"
    autoscaling.aibrix.ai/scale-to-zero: "false"
spec:
  scalingStrategy: KPA

StormService Role-Level Autoscaling#

For StormService in pooled mode (replicas=1), different roles (e.g., prefill and decode) can be autoscaled independently. This enables fine-grained control where each role scales based on its specific metrics.

Use the subTargetSelector field to target a specific role within a StormService. Additionally, add the annotation autoscaling.aibrix.ai/storm-service-mode: “pool” to the PodAutoscaler object. This helps the AIBrix autoscaler better distinguish replicas=1 scenarios.

Key features:

Each role has its own PodAutoscaler with independent metrics and scaling policies
Works with StormService in pooled mode (replicas=1)
Supports different scaling strategies (HPA, KPA, APA) per role
Allows different min/max replicas and scaling behaviors per role

Complete example:

# Example: Independent autoscaling for prefill and decode roles in StormService
#
# This demonstrates how to scale different roles independently based on their
# specific metrics in StormService pooled mode (replicas=1).

---
# PodAutoscaler for prefill role
apiVersion: autoscaling.aibrix.ai/v1alpha1
kind: PodAutoscaler
metadata:
  name: ss-pool-prefill
  namespace: default
  annotations:
    autoscaling.aibrix.ai/storm-service-mode: "pool"
spec:
  scaleTargetRef:
    apiVersion: orchestration.aibrix.ai/v1alpha1
    kind: StormService
    name: ss-pool

  # Select the prefill role within the StormService
  subTargetSelector:
    roleName: prefill

  minReplicas: 2
  maxReplicas: 20
  scalingStrategy: APA

  metricsSources:
    - metricSourceType: pod
      protocolType: http
      port: "8000"
      path: /metrics
      targetMetric: "prefill_queue_length"
      targetValue: "10"

---
# PodAutoscaler for decode role
apiVersion: autoscaling.aibrix.ai/v1alpha1
kind: PodAutoscaler
metadata:
  name: ss-pool-decode
  namespace: default
  annotations:
    autoscaling.aibrix.ai/storm-service-mode: "pool"
spec:
  scaleTargetRef:
    apiVersion: orchestration.aibrix.ai/v1alpha1
    kind: StormService
    name: ss-pool

  # Select the decode role within the StormService
  subTargetSelector:
    roleName: decode

  minReplicas: 3
  maxReplicas: 30
  scalingStrategy: APA

  metricsSources:
    - metricSourceType: pod
      protocolType: http
      port: "8000"
      path: /metrics
      targetMetric: "decode_batch_utilization"
      targetValue: "70"

  # Different scaling behavior for decode
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 600  # More conservative scale-down

---
# The StormService being autoscaled (pooled mode)
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
  name: ss-pool
  namespace: default
spec:
  updateStrategy:
    type: InPlaceUpdate
    maxSurge: 2
    maxUnavailable: 1
  replicas: 1  # Pooled mode
  stateful: true
  selector:
    matchLabels:
      app: llm-xpyd
  template:
    metadata:
      labels:
        app: llm-xpyd
    spec:
      roles:
        - name: prefill
          replicas: 5  # Will be managed by PodAutoscaler
          stateful: true
          template:
            spec:
              containers:
                - name: prefill
                  image: vllm/vllm-openai:latest
                  ports:
                    - containerPort: 8000
                      name: metrics
                  # ... other config

        - name: decode
          replicas: 5  # Will be managed by PodAutoscaler
          stateful: true
          template:
            spec:
              containers:
                - name: decode
                  image: vllm/vllm-openai:latest
                  ports:
                    - containerPort: 8000
                      name: metrics
                  # ... other config

When to use:

Pooled mode: StormService with replicas=1 where roles need independent scaling
Different workload patterns: Prefill and decode have different resource needs and traffic patterns
Independent metrics: Each role has its own metrics (e.g., queue length, batch utilization)

Multi-Metric Based Autoscaling#

AIBrix supports multi-metric autoscaling, allowing users to define multiple scaling metrics within a single PodAutoscaler resource. This is especially useful for LLM-serving workloads where a single metric (e.g., GPU cache usage) may not fully capture system pressure—combining it with queue-based metrics (e.g., number of waiting requests) enables more robust and responsive scaling decisions.

How It Works#

When multiple metrics are specified under spec.metricsSources, the autoscaler evaluates all metrics independently.
The final desired replica count is determined by the metric that demands the highest number of replicas (i.e., the “max” strategy).

Configuration Example#

The following PodAutoscaler uses two metrics simultaneously with APA strategy:

apiVersion: autoscaling.aibrix.ai/v1alpha1
kind: PodAutoscaler
metadata:
  name: deepseek-r1-mock-llama2-7b-multi-metrics
  namespace: default
  labels:
    app.kubernetes.io/name: aibrix
    app.kubernetes.io/managed-by: kustomize
  annotations:
    autoscaling.aibrix.ai/up-fluctuation-tolerance: '0.1'
    autoscaling.aibrix.ai/down-fluctuation-tolerance: '0.2'
    apa.autoscaling.aibrix.ai/window: 30s
spec:
  scalingStrategy: APA
  minReplicas: 1
  maxReplicas: 3
  metricsSources:
    - metricSourceType: pod
      protocolType: http
      port: '8000'
      path: metrics
      targetMetric: gpu_cache_usage_perc
      targetValue: '0.5'
    - metricSourceType: pod
      protocolType: http
      port: '8000'
      path: metrics
      targetMetric: num_requests_waiting
      targetValue: '100'
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mock-llama2-7b

Check autoscaling logs#

Pod Autoscaler Logs#

Pod autoscaler is part of aibrix controller manager which plays the role of collecting the metrics from each pod. You can check its logs in this way.

kubectl logs <aibrix-controller-manager-podname> -n aibrix-system -f

Expected log output. You can see the current metric is gpu_cache_usage_perc. You can check each pod’s current metric value.

Custom Resource Status#

To describe the PodAutoscaler custom resource, you can run

kubectl describe podautoscaler <podautoscaler-name>

Example output is here, you can explore the scaling conditions and events for more details.

Preliminary experiments with different autoscalers#

Here we show the preliminary experiment results to show how different autoscaling mechanism and configuration for autoscaler affect the performance(latency) and cost(compute cost). In AiBrix, user can easily deploy different autoscaler by simply applying K8s yaml.

Set up
- Model: Deepseek 7B chatbot model
- GPU type: V100
- Max number of GPU: 8
Target metric and value
- Target metric: gpu_kv_cache_utilization
- Target value: 50%
Workload
- The overall RPS trend starts with low RPS and goes up relatively fast until T=500 to evaluate how different autoscaler and config reacts to the rapid load increase. After that, it goes down to low RPS quickly to evaluate scaling down behavior and goes up again slowly.
  
  Average RPS trend: 1 RPS -> 4 RPS -> 8 RPS -> 10 RPS -> 2 RPS -> 6 RPS
- RPS can be found in the second subfigure.
Performance
- HPA has the highest latency since its slow reaction. KPA is the most reactive with panic mode. APA was running with small delay window to save cost. It does save cost but ends up having higher latency than KPA when it scales down too aggressively from T=700 to T=1000.
Cost
- The fourth figure shows the relative accumulated compute cost over time. The accumulated cost is calculated by multiplying the time by unit cost (in this example, 1). The actual compute cost can be calculated by multiplying the actual cost per unit time.
- HPA is the most expensive due to the longer delay window for scaling down.
- APA is the most responsive and saves the cost most. You can see it fluctuating more than other two autoscalers.
- Note that scaling down window is not inherent feature of each autoscaling mechanism. It is configurable variable. We use the default value for HPA (300s).
Conclusion
- There is no one autoscaler that outperforms others for all metrics (latency, cost). In addition, the results might depend on the workloads. Infrastructure should provide easy way to configure whichever autoscaling mechanism they want and should be easily configurable since different users have different preference. For example, one might prefer cost over performance or vice versa.