Batch Resource Manager for Neoclouds#
The AIBrix batch Resource Manager lets the Console planner lease cloud GPU capacity before submitting a batch to the metadata service. This is useful when you want each batch job to run on provider-managed GPU machines instead of on the Kubernetes cluster where AIBrix is running.
The currently supported cloud providers are:
runpod- creates RunPod pods, injects an SSH public key, and runs vLLM inside the pod.lambdaCloud- creates Lambda Cloud GPU instances, SSHes to the VM, and runs the vLLM serving container with Docker.
Important
Automatic RunPod and Lambda Cloud provisioning is available through the
Console Job API, /api/v1/jobs. Direct calls to the metadata service
/v1/batches do not call Resource Manager by themselves. Direct
metadata-service users can still pass a runtime manually in
extra_body.aibrix.runtime, but then they own provisioning and cleanup.
How It Works#
User / Console UI
|
| POST /api/v1/jobs
v
Console JobService
|
| enqueue job
v
Planner
|
| Provision(ResourceProvision)
v
Resource Manager
|
| provider API
+--------------------+
| |
v v
RunPod pod Lambda Cloud instance
| |
| status=running | status=active
+---------+----------+
|
| runtime target + SSH options
v
Metadata Service /v1/batches
|
| SSH-launch runtime starts vLLM and processes JSONL
v
Batch output
The planner lifecycle is:
Console receives a job through
POST /api/v1/jobs.Planner builds a resource request from the selected
ModelDeploymentTemplate. Today the Console path requests one replica withaccelerator.countGPUs.Resource Manager provisions the cloud resource and stores a
ProvisionResult.Planner polls Resource Manager until the provision is
running.Planner submits the batch to the metadata service with
aibrix.runtimeset toRunPodorLambdaCloud.Metadata Service SSHes into the provisioned machine, starts vLLM, waits for
/health, and dispatches the batch requests.When the batch reaches a terminal state, or when the job is cancelled, Planner calls Resource Manager
Release. RunPod pods are deleted and Lambda Cloud instances are terminated.
Prerequisites#
Before enabling a cloud provider, make sure you have:
A running AIBrix metadata service with batch file storage configured. See Batch API.
A running AIBrix Console backend. The Resource Manager path is owned by the Console planner.
A private SSH key mounted into the metadata service and exposed through
AIBRIX_BATCH_SSH_KEY_FILE.The matching public key configured for the selected provider:
RUNPOD_SSH_PUBLIC_KEYfor RunPod, or an SSH key name in the Lambda Cloud account for Lambda Cloud.A
ModelDeploymentTemplatefor the model and GPU shape users can select when creating a job.
Generate a dedicated key pair for batch resources:
ssh-keygen -t ed25519 -f ~/.ssh/aibrix-batch-rm -C aibrix-batch-rm
chmod 600 ~/.ssh/aibrix-batch-rm
Mount the private key into the metadata service. For Kubernetes deployments, create a secret and add a volume/env patch to the metadata service deployment:
kubectl -n aibrix-system create secret generic aibrix-batch-ssh-key \
--from-file=id_ed25519=$HOME/.ssh/aibrix-batch-rm
spec:
template:
spec:
containers:
- name: metadata-service
env:
- name: AIBRIX_BATCH_SSH_KEY_FILE
value: /etc/aibrix/batch-ssh/id_ed25519
volumeMounts:
- name: batch-ssh-key
mountPath: /etc/aibrix/batch-ssh
readOnly: true
volumes:
- name: batch-ssh-key
secret:
secretName: aibrix-batch-ssh-key
defaultMode: 0400
Configure the Console Backend#
The Console backend selects exactly one Resource Manager provider at
startup. Use PROVISIONER=kubernetes for the default in-cluster path,
PROVISIONER=runpod for RunPod, or PROVISIONER=lambdaCloud for
Lambda Cloud.
Note
Provider identifiers are case-sensitive and intentionally differ by
context: PROVISIONER takes the lowercase/camelCase values
runpod and lambdaCloud, while the aibrix.runtime.target field
in Batch Model Deployment Templates uses the PascalCase values
RunPod and LambdaCloud.
Common environment variables:
Variable |
Required? |
Description |
|---|---|---|
|
yes |
|
|
yes |
Metadata service URL used by Console to submit |
|
recommended |
Durable Console store URI. File-backed SQLite is enough for a single backend; production should use a shared database. |
|
optional |
Template name used when callers omit |
|
optional |
Defaults to |
|
optional |
Number of planner workers. Defaults to |
RunPod#
Required RunPod configuration:
export PROVISIONER=runpod
export RUNPOD_API_KEY=rp_...
export RUNPOD_SSH_PUBLIC_KEY="$(cat ~/.ssh/aibrix-batch-rm.pub)"
Optional RunPod configuration:
Variable |
Description |
|---|---|
|
Override the RunPod API root. Defaults to
|
|
Comma-separated RunPod data center IDs. If unset, RunPod can pick any available data center. |
|
Pod image. Defaults to |
|
|
RunPod does not expose a full region, GPU type, or pricing catalog through
the REST API used here. AIBrix therefore passes
ModelDeploymentTemplate.spec.accelerator.type directly as the RunPod
gpuTypeIds value. Use the exact GPU type string accepted by RunPod,
for example NVIDIA H100 80GB HBM3.
RunPod networking:
Resource Manager creates one pod per requested replica.
The pod starts
sshdand injectsRUNPOD_SSH_PUBLIC_KEY.Planner passes
root, the public IP, SSH port, andhttps://<pod-id>-8000.proxy.runpod.netto the metadata service.The RunPod runtime launches
vllm serveover SSH and dispatches requests through the RunPod HTTP proxy.
Lambda Cloud#
First add the public key to your Lambda Cloud account and note its key name. Then configure the Console backend:
export PROVISIONER=lambdaCloud
export LAMBDA_CLOUD_API_KEY=...
export LAMBDA_CLOUD_SSH_KEYS=aibrix-batch-rm
Optional Lambda Cloud configuration:
Variable |
Description |
|---|---|
|
Override the Lambda Cloud API root. Defaults to
|
|
Hard default region preference, such as |
|
Comma-separated SSH key names already registered in Lambda Cloud. At least one key is required. |
Lambda Cloud capacity selection:
accelerator.countbecomes the required GPU count per instance.accelerator.typeis matched against Lambda Cloud instance type names and GPU descriptions. Names such asH100,NVIDIA H100, andH100-SXM5normalize to the same GPU family.If
LAMBDA_CLOUD_REGIONis set, AIBrix only accepts capacity in that region. Otherwise it chooses the cheapest matching instance type among regions with capacity.
Lambda Cloud networking:
Resource Manager launches a GPU VM and waits until it is
active.Planner passes
ubuntu, the public IP, SSH port22, and the model template serving image to the metadata service.The Lambda Cloud runtime SSHes to the VM, verifies Docker, and runs the vLLM container with
sudo docker run --gpus all.vLLM binds to
127.0.0.1:8000on the VM. The metadata service opens an SSH local port-forward and dispatches through the tunnel, so the vLLM endpoint is not exposed publicly.
Create a Model Deployment Template#
The cloud Resource Manager path uses the Console
ModelDeploymentTemplate selected by the batch job. The template remains
provider-agnostic; the active provider is selected by PROVISIONER.
For a new model, create a Console model first:
export CONSOLE=http://localhost:8080
MODEL_ID=$(curl -sS -X POST "$CONSOLE/api/v1/models" \
-H "Content-Type: application/json" \
-d '{
"name": "Qwen2.5 7B Instruct",
"categories": ["LLM"],
"serving_name": "Qwen/Qwen2.5-7B-Instruct"
}' | jq -r .id)
Then create a deployment template:
curl -sS -X POST "$CONSOLE/api/v1/models/${MODEL_ID}/deployment-templates" \
-H "Content-Type: application/json" \
-d '{
"name": "qwen2-5-7b-cloud",
"version": "v1.0.0",
"status": "active",
"spec": {
"engine": {
"type": "vllm",
"version": "latest",
"image": "vllm/vllm-openai:latest",
"invocation": "http_server",
"serve_args": ["--trust-remote-code"],
"health_endpoint": "/health",
"ready_timeout_seconds": 1800,
"metrics_endpoint": "/metrics"
},
"model_source": {
"type": "huggingface",
"uri": "Qwen/Qwen2.5-7B-Instruct"
},
"accelerator": {
"type": "H100",
"count": 1,
"vram_gb": 80,
"interconnect": "pcie"
},
"parallelism": {
"tp": 1,
"pp": 1,
"dp": 1,
"ep": 1
},
"engine_args": {
"gpu_memory_utilization": "0.90"
},
"quantization": {
"weight": "fp16",
"kv_cache": "auto"
},
"supported_endpoints": ["/v1/chat/completions"],
"deployment_mode": "dedicated"
}
}'
Provider-specific notes:
For Lambda Cloud,
accelerator.typeshould be a GPU family or Lambda instance-type token, such asA10,A100, orH100.For RunPod, set
accelerator.typeto the exact RunPod GPU type ID accepted by the pod API, such asNVIDIA H100 80GB HBM3.For Lambda Cloud,
engine.imageis the Docker image run on the VM.For RunPod,
RUNPOD_IMAGEcontrols the pod image. The runtime still uses template fields such asmodel_source.uriandengine.serve_argswhen launchingvllm serve.
Submit a Batch Job#
Prepare a JSONL input file:
{"custom_id":"request-1","method":"POST","url":"/v1/chat/completions","body":{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Explain AIBrix in one paragraph."}],"max_tokens":128}}
{"custom_id":"request-2","method":"POST","url":"/v1/chat/completions","body":{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"What is batch inference?"}],"max_tokens":128}}
Upload it through the Console file proxy:
FILE_ID=$(curl -sS -X POST "$CONSOLE/api/v1/files/upload" \
-F "purpose=batch" \
-F "file=@batch_input.jsonl" | jq -r .id)
Create a job:
JOB_ID=$(curl -sS -X POST "$CONSOLE/api/v1/jobs" \
-H "Content-Type: application/json" \
-d "{
\"input_dataset\": \"${FILE_ID}\",
\"endpoint\": \"/v1/chat/completions\",
\"completion_window\": \"24h\",
\"name\": \"qwen-cloud-batch\",
\"model_id\": \"${MODEL_ID}\",
\"model_template_name\": \"qwen2-5-7b-cloud\",
\"model_template_version\": \"v1.0.0\"
}" | jq -r .id)
Watch the job and provision state:
watch -n 5 \
"curl -sS $CONSOLE/api/v1/jobs/${JOB_ID} | jq '{id,status,batch_id,provision_id,provision,events}'"
Expected high-level state flow:
queued -> resource_preparing -> submitting -> batch_created -> in_progress -> finalizing -> completed
If resource provisioning fails, the job moves to resource_failed and
error_message contains the provider error. If metadata-service
submission fails after resources are ready, the job moves to
submit_failed and Planner attempts to release the provision.
Cancel a running job:
curl -sS -X POST "$CONSOLE/api/v1/jobs/${JOB_ID}/cancel" \
-H "Content-Type: application/json" \
-d '{}'
Planner forwards cancellation to the metadata service if a batch has been submitted, then releases the cloud provision.
Read Results#
When the job reaches completed, download the output file through the
Console file proxy:
OUTPUT_FILE_ID=$(curl -sS "$CONSOLE/api/v1/jobs/${JOB_ID}" | jq -r .output_dataset)
curl -sS "$CONSOLE/api/v1/files/${OUTPUT_FILE_ID}/content" \
-o batch_output.jsonl
jq . batch_output.jsonl
Troubleshooting#
resource manager init: missing credential#
The selected provider was enabled without its required environment variables. Check:
RunPod:
RUNPOD_API_KEYandRUNPOD_SSH_PUBLIC_KEY.Lambda Cloud:
LAMBDA_CLOUD_API_KEYandLAMBDA_CLOUD_SSH_KEYS.
resource_failed with NoGpuType on RunPod#
RunPod requires at least one GPU type. Set
ModelDeploymentTemplate.spec.accelerator.type to an exact RunPod GPU type
ID accepted by the pod API.
resource_failed with NoCapacity on Lambda Cloud#
Lambda Cloud had no currently available instance type matching
accelerator.type, accelerator.count, and LAMBDA_CLOUD_REGION.
Try a different GPU family, lower GPU count, or unset/change the region.
Job is stuck in resource_preparing#
The provider accepted the launch but AIBrix has not observed a ready public
IP/SSH endpoint yet. Check the provider console, then inspect Console logs
for provider API errors. For RunPod, the provision becomes ready only when
the pod is RUNNING and has a public IP. For Lambda Cloud, it becomes
ready when the instance is active and has a public IP.
Job reaches submit_failed or vLLM never becomes healthy#
The resource was created, but the metadata service could not launch or reach vLLM. Check:
AIBRIX_BATCH_SSH_KEY_FILEpoints to the private key matching the provider-side public key.The private key file is readable only by the metadata-service process.
Lambda Cloud instances allow SSH for the registered key name.
RunPod pod image can install and run
openssh-serverand hasvllmonPATH.The model can be downloaded from
model_source.uri. If it requires authentication, configure the model source secret before submitting jobs.Lambda Cloud images can be started with Docker and the NVIDIA runtime.
Provider resources remain after a failed job#
Planner releases resources on terminal states and cancellation, but cleanup
is best-effort. If Console or the provider API fails during cleanup, use the
provision.raw_json field from GET /api/v1/jobs/{id} to find the
RunPod pod ID or Lambda Cloud instance IDs, then delete them in the provider
console.
Current Limitations#
Cloud Resource Manager mode is selected once per Console backend with
PROVISIONER. Per-job provider selection is not exposed yet.The Console planner path requests one replica per batch job today.
ResourceGroupSpecsupports richer groups internally, but the Console job path currently derives only one group from the selected template.RunPod catalog, pricing, and region discovery return empty results because the RunPod REST API used by AIBrix does not expose those catalogs.
Lambda Cloud catalog data is fetched live from the Lambda Cloud API, but there is no public Console catalog endpoint for it yet.
The cloud runtime path is designed for vLLM-compatible OpenAI HTTP serving over SSH. Other engines need matching runtime support before they can be used with RunPod or Lambda Cloud.
See Also#
Batch API - OpenAI-compatible batch API and file workflow.
Batch Model Deployment Templates - model deployment templates.
apps/console/api/resource_manager- Resource Manager provider implementations.python/aibrix/aibrix/batch/job_driver/runtime- metadata-service runtime implementations for Kubernetes, RunPod, and Lambda Cloud.