Batch Model Deployment Templates#
AIBrix Batch carries model deployment intent through
extra_body.aibrix on the OpenAI-compatible Batch API. The current
runtime contract is:
aibrix.model_templatedescribes the engine, model source, accelerator, parallelism, and supported endpoints.aibrix.runtimeselects the MDS runtime backend, such asKubernetes,KubernetesJob,LambdaCloud,RunPod, orExternal.aibrix.resource_allocationrecords resource-manager output, such as the provision id returned by Console’s Resource Manager.
Why Templates Exist#
OpenAI’s Batch API treats the model as a black-box string, for example
model: "gpt-4-turbo". Self-hosted deployments need more control:
inference engine and image, such as vLLM or SGLang
GPU Type and GPU count
model artifact location
tensor, pipeline, data, and expert parallelism
engine flags such as max model length or prefix caching
supported OpenAI endpoints for the selected model
Templates keep those controls in platform-owned metadata while preserving the OpenAI SDK request shape.
Current Data Flow#
Console user
|
| CreateJob(model_template_name, model_template_version, model_id)
v
Console backend
|
| resolves ModelDeploymentTemplate from Console store
| marshals template.spec as snake_case JSON
| planner reads accelerator + engine serving fields
v
Resource Manager / Planner
|
| POST /v1/batches with extra_body.aibrix:
| job_id
| model
| runtime
| resource_allocation
| model_template { name, version, spec }
v
Metadata Service
|
| validates runtime target and persists AIBrix metadata
| selected runtime consumes the inline template spec when needed
v
Runtime backend
|
+-- Kubernetes / KubernetesJob: render Kubernetes manifests
+-- LambdaCloud / RunPod: SSH-launch vLLM on provisioned compute
+-- External: dispatch to an already known endpoint
Direct MDS callers can also submit extra_body.aibrix themselves. If the
MDS deployment has no template registry configured, direct callers must
include aibrix.model_template.spec inline.
Quick Start: Console Path#
The Console path is the supported path for Resource Manager backed batch jobs, including RunPod and Lambda Cloud.
Register or create a model deployment template in the Console store. The template must include at least
engine,model_source,accelerator, andsupported_endpoints.Submit a Console job with the selected template:
{ "name": "daily-eval", "input_dataset": "file-abc", "endpoint": "/v1/chat/completions", "completion_window": "24h", "model_id": "model-123", "model_template_name": "llama3-8b-a10", "model_template_version": "v1" }
Console resolves the template, sends
aibrix.model_template.specto MDS, and the planner uses the same spec to provision resources.
For cloud providers, the Console backend must also be configured with a
single Resource Manager provider via PROVISIONER. See
Batch Resource Manager for Neoclouds.
Direct MDS Request Shape#
Direct callers use the OpenAI SDK’s extra_body escape hatch. This
example uses an inline template spec so it does not depend on a ConfigMap
registry:
from openai import OpenAI
client = OpenAI(base_url="http://aibrix-metadata.example/v1", api_key="...")
batch = client.batches.create(
input_file_id="file-abc",
endpoint="/v1/chat/completions",
completion_window="24h",
metadata={"team": "ml-platform"},
extra_body={
"aibrix": {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"runtime": {
"target": "External",
"options": {},
},
"model_template": {
"name": "llama3-8b-a10",
"version": "v1",
"spec": {
"engine": {
"type": "vllm",
"version": "0.6.3",
"image": "vllm/vllm-openai:latest",
"serve_args": ["--gpu-memory-utilization", "0.90"],
},
"model_source": {
"type": "huggingface",
"uri": "meta-llama/Llama-3.1-8B-Instruct",
},
"accelerator": {
"type": "A10",
"count": 1,
},
"parallelism": {
"tp": 1,
"pp": 1,
"dp": 1,
"ep": 1,
},
"supported_endpoints": ["/v1/chat/completions"],
"deployment_mode": "dedicated",
},
},
},
},
)
The response exposes the persisted AIBrix block in batch.aibrix. There
is no current _aibrix.resolved_endpoint response field.
ModelDeploymentTemplate Schema#
The source of truth is aibrix.batch.template.schema.
ModelDeploymentTemplateSpec is strict: unknown fields are rejected. In
particular, current templates do not have a provider_config field.
Provider selection belongs to Resource Manager / runtime selection, not to
the template schema.
Required top-level fields when a complete template object is stored:
name: logical template name.version: version string. If omitted in an inline spec path, consumers use the ref version or schema default.status:active,deprecated, ordraftfor registry-backed templates.spec: the deployment body.
Required fields inside ``spec``:
Field |
Required? |
Current meaning |
|---|---|---|
|
yes |
Engine type, version, image, raw |
|
yes |
Model artifact source. Kubernetes renderers can use this for model
download and vLLM |
|
yes |
Free-form GPU type and GPU count. Console Resource Manager reads
|
|
yes |
OpenAI endpoints this deployment can serve, such as
|
Optional/defaulted fields inside ``spec``:
Field |
Default |
Current meaning |
|---|---|---|
|
all degrees |
|
|
empty |
Typed and extra engine flags. Kubernetes vLLM rendering converts these
into CLI flags. Console cloud runtime construction does not pass this
field today; use |
|
empty |
Kubernetes vLLM rendering maps weight and KV-cache quantization into
CLI flags. Cloud SSH runtimes honor only flags that appear in
|
|
|
Optional service/discovery identifier for Kubernetes manifest labels and discovery paths. |
|
|
|
Runtime Consumption Matrix#
Field |
Kubernetes renderers |
Console cloud path |
|---|---|---|
|
Engine container image. |
Passed to MDS runtime options. LambdaCloud uses it as the Docker image.
RunPod provisioning uses |
|
Appended last to engine CLI args, so admin raw flags can override generated args. |
Passed as runtime |
|
Converted to vLLM CLI flags by the engine adapter. |
Not consumed by Console’s cloud runtime construction today. |
|
Used by downloader/model args and auth secret rendering. |
Used by Console to resolve |
|
Used for resource requests and validation. |
Used by Resource Manager scheduling. RunPod requires provider-accepted GPU type strings; LambdaCloud normalizes common GPU family names. |
|
Used by Kubernetes vLLM argument rendering. |
Not read directly; express required cloud flags in |
extra_body.aibrix Contract#
The MDS API accepts this AIBrix extension block:
{
"aibrix": {
"job_id": "job_123",
"model": "meta-llama/Llama-3.1-8B-Instruct",
"runtime": {
"target": "RunPod",
"options": {
"host": "1.2.3.4",
"ssh_port": 22,
"ssh_user": "root",
"http_base_url": "https://pod-8000.proxy.runpod.net",
"model": "meta-llama/Llama-3.1-8B-Instruct",
"vllm_args": ["--gpu-memory-utilization", "0.90"]
}
},
"resource_allocation": {
"provision_id": "runpod-abc"
},
"model_template": {
"name": "llama3-8b-a10",
"version": "v1",
"spec": {}
},
"client": {
"max_concurrency": 256,
"adaptive_concurrency": true,
"adaptive_max_factor": 16,
"retry_policy": {
"max_retries": 5,
"base_delay_seconds": 2,
"max_delay_seconds": 10,
"no_endpoint_max_retries": 5
}
}
}
}
client controls per-job smart-client behavior. max_concurrency is an
absolute job-global in-flight cap, hard limited to 256 (requests above are
rejected). If adaptive concurrency is enabled, the cap limits adaptive growth;
if adaptive concurrency is disabled, it becomes the fixed concurrency. Omitted
fields fall back to metadata-service environment
defaults and then built-in defaults. This public block intentionally does not
expose telemetry interval, QPS, request timeout, or fine-grained adaptive
controller internals.
The known upstream runtime targets are:
KubernetesKubernetesJobLambdaCloudRunPodExternal
runtime.target is validated against the live runtime registry, so
downstream runtimes can register additional target strings.
Operational Notes#
For Resource Manager backed jobs, submit through Console. Direct MDS
/v1/batchesrequests do not provision RunPod or Lambda Cloud resources.Console currently selects one Resource Manager provider per backend process via
PROVISIONER. Per-job provider preference is not implemented.Cloud SSH runtimes require
aibrix.modelto be directly loadable by vLLM on the remote host. Templateauth_secret_refis a Kubernetes manifest feature and is not automatically applied to LambdaCloud/RunPod.Use
AIBRIX_MDS_HTTP_BODY_LOG=1on MDS when debugging the exactextra_body.aibrixpayload received by the metadata service.
See Also#
Batch API - OpenAI Batch API surface and request/response details
Batch Resource Manager for Neoclouds - using RunPod and Lambda Cloud with the Console batch planner
aibrix.batch.template.schema- Pydantic source of truth