AI Engine Runtime

AI Engine Runtime#

Installation#

AIBrix Runtime can be injected into your workloads automatically using webhook-based sidecar injection (recommended) or manually added to your deployment manifests.

Automatic Sidecar Injection (Recommended)#

The easiest way to enable the runtime is through automatic sidecar injection. Simply add an annotation to your Deployment or StormService:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  annotations:
    model.aibrix.ai/sidecar-injection: "true"  # Enable automatic runtime injection
spec:
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        # Your container configuration...

The webhook will automatically inject the aibrix-runtime sidecar container into your pods.

For StormService

Sidecar injection also works with StormService custom resources:

apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
  name: my-service
  annotations:
    model.aibrix.ai/sidecar-injection: "true"
spec:
  template:
    spec:
      roles:
      - name: worker
        template:
          spec:
            containers:
            - name: vllm
              image: vllm/vllm-openai:latest
              # ...

The runtime sidecar will be injected into each role’s pod template.

Customize Runtime Image

You can specify a custom runtime image using an annotation:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  annotations:
    model.aibrix.ai/sidecar-injection: "true"
    model.aibrix.ai/sidecar-runtime-image: "aibrix/runtime:v0.5.0"  # Custom image
spec:
  # ...

Enable Global Runtime Flag

To enable the controller to use the runtime sidecar API, set the global flag when starting the controller manager:

# Enable runtime sidecar globally
./bin/controller-manager --enable-runtime-sidecar=true

The runtime detection logic works as follows:

EnableRuntimeSidecar = false: Controller always uses direct engine API (port 8000), even if sidecar is injected
EnableRuntimeSidecar = true: Controller detects if pod has aibrix-runtime container:
- Sidecar present → Uses runtime API (port 8080)
- Sidecar absent → Fallback to direct engine API (port 8000)

This design ensures functionality works with or without the runtime sidecar, providing maximum flexibility.

Manual Sidecar Installation#

If you prefer manual control, you can add the runtime sidecar directly to your deployment YAML:

containers:
- name: vllm
  image: vllm/vllm-openai:latest
  # Your main container configuration...

- name: aibrix-runtime
  image: aibrix/runtime:v0.5.0
  command:
  - aibrix_runtime
  - --port
  - "8080"
  env:
  - name: INFERENCE_ENGINE
    value: "vllm"  # or sglang, tgi, triton, llamacpp
  - name: INFERENCE_ENGINE_ENDPOINT
    value: "http://localhost:8000"
  ports:
  - containerPort: 8080
    protocol: TCP
  volumeMounts:
  - mountPath: /models
    name: model-hostpath
volumes:
- name: model-hostpath
  hostPath:
    path: /root/models
    type: DirectoryOrCreate

Standalone Installation#

If you like to use the runtime for other cases outside of Kubernetes, you can install it by the following command.

Attention

python3 -m pip install aibrix

If you want to use nightly version, you can install from code.

cd $AIBRIX_HOME/python/aibrix && python3 -m pip install -e .

Metric Standardization#

Different inference engines will expose different metrics, and AI Runtime will standardize them.

Define the information related to the inference engine side in the container environment variables. For example, if vLLM provides metrics services on http://localhost:8000/metrics, launch the AI Runtime Server by the following command:

INFERENCE_ENGINE=vllm INFERENCE_ENGINE_ENDPOINT="http://localhost:8000" aibrix_runtime --port 8080

And runtime will provide unified inference metrics on http://localhost:8080/metrics. Right now, you notice that the metrics are exact same as the vllm metrics. This is subject to change in the future once more engines are supported. Unified metrics help to standardize the metrics for different inference engines and also components like autoscaler, etc.

# TYPE vllm:cache_config_info gauge
vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",is_attention_free="False",num_cpu_blocks="9362",num_gpu_blocks="81767",num_gpu_blocks_override="None",sliding_window="None",swap_space_bytes="4294967296"} 1.0
# HELP vllm:num_requests_running Number of requests currently running on GPU.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 0.0
# HELP vllm:num_requests_swapped Number of requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
vllm:num_requests_swapped{model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 0.0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 0.0
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 0.0
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
vllm:cpu_cache_usage_perc{model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 0.0
# HELP vllm:cpu_prefix_cache_hit_rate CPU prefix cache block hit rate.
# TYPE vllm:cpu_prefix_cache_hit_rate gauge
vllm:cpu_prefix_cache_hit_rate{model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} -1.0
# HELP vllm:gpu_prefix_cache_hit_rate GPU prefix cache block hit rate.
# TYPE vllm:gpu_prefix_cache_hit_rate gauge
vllm:gpu_prefix_cache_hit_rate{model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} -1.0
# HELP vllm:lora_requests_info Running stats on lora requests.
# TYPE vllm:lora_requests_info gauge
vllm:lora_requests_info{max_lora="0",running_lora_adapters="",waiting_lora_adapters=""} 1.7382173358407154e+09
# HELP vllm:num_preemptions_total Cumulative number of preemption from the engine.
# TYPE vllm:num_preemptions_total counter
vllm:num_preemptions_total{model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 0.0
# HELP vllm:prompt_tokens_total Number of prefill tokens processed.
# TYPE vllm:prompt_tokens_total counter
vllm:prompt_tokens_total{model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 148.0
# HELP vllm:generation_tokens_total Number of generation tokens processed.
# TYPE vllm:generation_tokens_total counter
vllm:generation_tokens_total{model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 955.0
# HELP vllm:request_success_total Count of successfully processed requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="stop",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 4.0
# HELP vllm:iteration_tokens_total Histogram of number of tokens per engine_step.
# TYPE vllm:iteration_tokens_total histogram
vllm:iteration_tokens_total_sum{model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 1103.0
vllm:iteration_tokens_total_bucket{le="1.0",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 994.0
vllm:iteration_tokens_total_bucket{le="2.0",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 994.0
vllm:iteration_tokens_total_bucket{le="4.0",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 994.0
vllm:iteration_tokens_total_bucket{le="8.0",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 994.0
vllm:iteration_tokens_total_bucket{le="16.0",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 994.0
vllm:iteration_tokens_total_bucket{le="24.0",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 994.0
vllm:iteration_tokens_total_bucket{le="32.0",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 994.0
vllm:iteration_tokens_total_bucket{le="40.0",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 998.0
vllm:iteration_tokens_total_bucket{le="48.0",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 998.0
vllm:iteration_tokens_total_bucket{le="56.0",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 998.0
vllm:iteration_tokens_total_bucket{le="64.0",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 998.0
vllm:iteration_tokens_total_bucket{le="72.0",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 998.0
vllm:iteration_tokens_total_bucket{le="80.0",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 998.0
vllm:iteration_tokens_total_bucket{le="88.0",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 998.0
vllm:iteration_tokens_total_bucket{le="96.0",model_name="Qwen/Qwen2.5-Coder-1.5B-Instruct"} 998.0

Attention

Since vLLM is the only supported engine for now, we didn’t use vendor-agnostic metrics. This is subject to change in the future once more engines are supported..

Model Downloading#

The AI Engine Runtime supports downloading models from multiple remote sources, including HuggingFace, S3, and TOS. This is extremely useful when the control plane needs to interact with the pod to dynamically load new models.

Download From HuggingFace#

First Define the necessary environment variables for the HuggingFace model.

# General settings
export DOWNLOADER_ALLOW_FILE_SUFFIX="json, safetensors"
export DOWNLOADER_NUM_THREADS=16
# HuggingFace settings
export HF_ENDPOINT=https://hf-mirror.com  # set it when env is in CN region

Then use AI Engine Runtime to download the model from HuggingFace:

python -m aibrix.downloader \
    --model-uri deepseek-ai/deepseek-coder-6.7b-instruct \
    --local-dir /tmp/aibrix/models_hf/

Download From S3#

First Define the necessary environment variables for the S3 model.

# General settings
export DOWNLOADER_ALLOW_FILE_SUFFIX="json, safetensors"
export DOWNLOADER_NUM_THREADS=16
# AWS settings
export AWS_ACCESS_KEY_ID=<INPUT YOUR AWS ACCESS KEY ID>
export AWS_SECRET_ACCESS_KEY=<INPUT YOUR AWS SECRET ACCESS KEY>
export AWS_ENDPOINT_URL=<INPUT YOUR AWS ENDPOINT URL> # e.g. https://s3.us-west-2.amazonaws.com
export AWS_REGION=<INPUT YOUR AWS REGION> # e.g. us-west-2

Then use AI Runtime to download the model from AWS S3:

python -m aibrix.downloader \
    --model-uri s3://aibrix-model-artifacts/deepseek-coder-6.7b-instruct/ \
    --local-dir /tmp/aibrix/models_s3/

Download From TOS#

First Define the necessary environment variables for the TOS model.

# General settings
export DOWNLOADER_ALLOW_FILE_SUFFIX="json, safetensors"
export DOWNLOADER_NUM_THREADS=16
# AWS settings
export TOS_ACCESS_KEY=<INPUT YOUR TOS ACCESS KEY>
export TOS_SECRET_KEY=<INPUT YOUR TOS SECRET KEY>
export TOS_ENDPOINT=<INPUT YOUR TOS ENDPOINT> # e.g. https://tos-s3-cn-beijing.volces.com
export TOS_REGION=<INPUT YOUR TOS REGION> # e..g cn-beijing

Then use AI Runtime to download the model from TOS:

python -m aibrix.downloader \
    --model-uri tos://aibrix-model-artifacts/deepseek-coder-6.7b-instruct/ \
    --local-dir /tmp/aibrix/models_tos/

Model Configuration API#

Attention

this needs the engine to starts with –enable-lora and env export VLLM_ALLOW_RUNTIME_LORA_UPDATING=true enabled. You can check Dynamically serving LoRA Adapters for more details.

Let’s assume you already have a base model and runtime deployed and you want to load a LoRA adapter to it.

# start the engine
VLLM_ALLOW_RUNTIME_LORA_UPDATING=true vllm serve Qwen/Qwen2.5-Coder-1.5B-Instruct --enable-lora
# start the runtime
INFERENCE_ENGINE=vllm INFERENCE_ENGINE_ENDPOINT="http://localhost:8000" aibrix_runtime --port 8080

curl -X POST http://localhost:8080/v1/lora_adapter/load \
-H "Content-Type: application/json" \
-d '{"lora_name": "lora-2", "lora_path": "bharati2324/Qwen2.5-1.5B-Instruct-Code-LoRA-r16v2"}'

curl -X POST http://localhost:8080/v1/lora_adapter/unload \
-H "Content-Type: application/json" \
-d '{"lora_name": "lora-1"}'

curl -X GET  http://localhost:8000/v1/models | jq
{
    "object": "list",
    "data": [
        {
            "id": "Qwen/Qwen2.5-Coder-1.5B-Instruct",
            "object": "model",
            "created": 1738218097,
            "owned_by": "vllm",
            "root": "Qwen/Qwen2.5-Coder-1.5B-Instruct",
            "parent": null,
            "max_model_len": 32768,
            "permission": [
                {
                "id": "modelperm-c2e9860095b745b6b8be7133c5ab1fcf",
                "object": "model_permission",
                "created": 1738218097,
                "allow_create_engine": false,
                "allow_sampling": true,
                "allow_logprobs": true,
                "allow_search_indices": false,
                "allow_view": true,
                "allow_fine_tuning": false,
                "organization": "*",
                "group": null,
                "is_blocking": false
                }
            ]
        },
        {
            "id": "lora-1",
            "object": "model",
            "created": 1738218097,
            "owned_by": "vllm",
            "root": "bharati2324/Qwen2.5-1.5B-Instruct-Code-LoRA-r16v2",
            "parent": "Qwen/Qwen2.5-Coder-1.5B-Instruct",
            "max_model_len": null,
            "permission": [
                {
                "id": "modelperm-c21d06b59af0435292c70cd612e68b01",
                "object": "model_permission",
                "created": 1738218097,
                "allow_create_engine": false,
                "allow_sampling": true,
                "allow_logprobs": true,
                "allow_search_indices": false,
                "allow_view": true,
                "allow_fine_tuning": false,
                "organization": "*",
                "group": null,
                "is_blocking": false
                }
            ]
        },
        {
            "id": "lora-2",
            "object": "model",
            "created": 1738218097,
            "owned_by": "vllm",
            "root": "bharati2324/Qwen2.5-1.5B-Instruct-Code-LoRA-r16v2",
            "parent": "Qwen/Qwen2.5-Coder-1.5B-Instruct",
            "max_model_len": null,
            "permission": [
                {
                "id": "modelperm-bf2af850171242f7a9f4ccd9ecd313cd",
                "object": "model_permission",
                "created": 1738218097,
                "allow_create_engine": false,
                "allow_sampling": true,
                "allow_logprobs": true,
                "allow_search_indices": false,
                "allow_view": true,
                "allow_fine_tuning": false,
                "organization": "*",
                "group": null,
                "is_blocking": false
                }
            ]
        }
    ]
}

AI Engine Runtime

Contents

AI Engine Runtime#

Installation#

Automatic Sidecar Injection (Recommended)#

Manual Sidecar Installation#

Standalone Installation#

Metric Standardization#

Model Downloading#

Download From HuggingFace#

Download From S3#

Download From TOS#

Model Configuration API#