Multi-Node Inference#
Distributed inference refers to the technique of splitting and processing LLM model across multiple nodes or devices. This approach is particularly useful for large models that cannot fit into the memory of a single machine. This solution relies on KubeRay to orchestrate the Ray Clusters.
Key API Design#
In the landscape of distributed computing, the need for efficient orchestration of multi-node inference tasks has become paramount. Kubernetes has established itself as a leading platform for managing containerized applications, offering robust resource management and scalability. On the other hand, Ray has emerged as a powerful framework for building and running distributed applications, particularly well-suited for handling complex machine learning workflows. However, the existing approaches to orchestration often fall short in terms of flexibility and simplicity. Kubernetes operators, while powerful, can become overly complex when dealing with fine-grained orchestration of distributed applications. Ray, although excellent for internal task scheduling and resource management, lacks the broader resource orchestration capabilities provided by Kubernetes.
To address these challenges, we propose a new orchestration approach that synergizes the strengths of both Kubernetes and Ray.
This approach leverages Ray for internal fine-grained application orchestration, allowing users to utilize Ray’s APIs for distributed computation Simultaneously,
Kubernetes will handle the overall application resource orchestration, focusing on coarse-grained resource allocation and environment configuration.
This division of responsibilities simplifies the design of Kubernetes operators and enhances the overall flexibility and efficiency of the orchestration process.
We introduce two key APIs for RayCluster Management, it’s RayClusterReplicaSet and RayClusterFleet.
It’s similar like Kubernetes core concept ReplicaSet and Deployment. Most of the time, you only need to use RayClusterFleet.
Ray Framework Focus: In this model, Ray is emphasized solely for its role in intra-application orchestration. Each application instance corresponds to a single Ray Cluster, and multiple service instances of an application equate to multiple Ray Clusters. This ensures that Ray handles the distributed nature of the application internally without interference from external orchestration systems.
Kubernetes Layer: Kubernetes operates at the outer layer, responsible for initiating Ray Clusters and managing standard Kubernetes functionalities such as autoscaling and rolling updates. The Kubernetes layer doesn’t orchestrate the roles inside the application anymore. These features are well-established within the Kubernetes ecosystem, ensuring robust and reliable resource management, scaling, and update processes. By leveraging Kubernetes for these operations, we can achieve a seamless integration of Ray’s distributed computing capabilities with Kubernetes’ mature operational management.
Service Encapsulation and Mapping: At a higher level, services are encapsulated in a manner analogous to Kubernetes Deployments and ReplicaSets. The key difference lies in the mapping: instead of Pods, we now have Ray Clusters representing application instances. Traditionally, a single Pod would constitute an application instance; however, in this distributed model, a Ray Cluster serves this purpose, encapsulating the complexity of distributed execution within itself.
Attention
We already submit our ideas to KubeRay community. Hopefully, we can merge into the repo pretty soon.
Workloads Examples#
Attention
Starting from v0.6.6, we’ve added essential packages to run distributed inference with vLLM official container image distribution out of the box. If you use earlier versions, you can follow guidance below to build your own image compatible with multi-node inference.
This is the RayClusterFleet example, you can apply this yaml in your cluster.
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: RayClusterFleet
metadata:
name: qwen-coder-7b-instruct
labels:
app.kubernetes.io/name: aibrix
app.kubernetes.io/managed-by: kustomize
spec:
replicas: 1
selector:
matchLabels:
model.aibrix.ai/name: qwen-coder-7b-instruct
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
template:
metadata:
labels:
model.aibrix.ai/name: qwen-coder-7b-instruct
annotations:
ray.io/overwrite-container-cmd: "true"
spec:
rayVersion: "2.10.0"
headGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"
template:
metadata:
labels:
model.aibrix.ai/name: qwen-coder-7b-instruct
spec:
containers:
- name: ray-head
image: vllm/vllm-openai:v0.7.1
command: ["/bin/bash", "-c"]
args:
- >
ulimit -n 65536 &&
apt update && apt install -y wget net-tools && pip3 install ray[default] pyarrow pandas &&
echo "[INFO] Starting Ray head node..." &&
eval "$KUBERAY_GEN_RAY_START_CMD" &
echo "[INFO] Waiting for Ray dashboard to be ready..." &&
until curl --max-time 5 --fail http://127.0.0.1:8265 > /dev/null 2>&1; do
echo "[WAITING] $(date -u +'%Y-%m-%dT%H:%M:%SZ') - Ray dashboard not ready yet...";
sleep 2;
done &&
echo "[SUCCESS] Ray dashboard is available!" &&
vllm serve Qwen/Qwen2.5-Coder-7B-Instruct \
--served-model-name qwen-coder-7b-instruct \
--tensor-parallel-size 2 \
--distributed-executor-backend ray \
--host 0.0.0.0 \
--port 8000 \
--dtype half
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: service
resources:
limits:
cpu: "4"
nvidia.com/gpu: 1
requests:
cpu: "4"
nvidia.com/gpu: 1
- name: aibrix-runtime
image: aibrix/runtime:v0.3.0
command:
- aibrix_runtime
- --port
- "8080"
env:
- name: INFERENCE_ENGINE
value: vllm
- name: INFERENCE_ENGINE_ENDPOINT
value: http://localhost:8000
- name: PYTORCH_CUDA_ALLOC_CONF
value: "expandable_segments:True"
ports:
- containerPort: 8080
protocol: TCP
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 3
periodSeconds: 2
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
resources:
limits:
cpu: "1"
requests:
cpu: "1"
workerGroupSpecs:
- groupName: small-group
replicas: 1
minReplicas: 1
maxReplicas: 5
rayStartParams: {}
template:
metadata:
labels:
model.aibrix.ai/name: qwen-coder-7b-instruct
spec:
containers:
- name: ray-worker
image: vllm/vllm-openai:v0.7.1
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
command: [ "/bin/bash", "-c" ]
args:
- >
ulimit -n 65536 &&
eval "$KUBERAY_GEN_RAY_START_CMD --node-ip-address=$MY_POD_IP" &&
tail -f /dev/null
lifecycle:
preStop:
exec:
command: [ "/bin/sh", "-c", "ray stop" ]
resources:
limits:
cpu: "4"
nvidia.com/gpu: 1
requests:
cpu: "4"
nvidia.com/gpu: 1
---
apiVersion: v1
kind: Service
metadata:
name: qwen-coder-7b-instruct
labels:
model.aibrix.ai/name: qwen-coder-7b-instruct
prometheus-discovery: "true"
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
spec:
selector:
model.aibrix.ai/name: qwen-coder-7b-instruct
ports:
- name: serve
port: 8000
protocol: TCP
targetPort: 8000
- name: http
port: 8080
protocol: TCP
targetPort: 8080
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: qwen-coder-7b-instruct-router
namespace: aibrix-system
spec:
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: aibrix-eg
namespace: aibrix-system
rules:
- backendRefs:
- group: ""
kind: Service
name: qwen-coder-7b-instruct
namespace: default
port: 8000 # or 8000 if you're not using the runtime sidecar
weight: 1
matches:
- headers:
- name: model
type: Exact
value: qwen-coder-7b-instruct
path:
type: PathPrefix
value: /v1/completions
- headers:
- name: model
type: Exact
value: qwen-coder-7b-instruct
path:
type: PathPrefix
value: /v1/chat/completions
timeouts:
request: 120s
vLLM Version#
If you are using vLLM earlier version, you have two options.
Use our built image
aibrix/vllm-openai:v0.6.1.post2-distributed.Build your own image and follow steps here.
FROM vllm/vllm-openai:v0.6.1.post2
RUN apt update && apt install -y wget # important for future healthcheck
RUN pip3 install ray[default] # important for future healthcheck
ENTRYPOINT [""]
docker build -t aibrix/vllm-openai:v0.6.1.post2-distributed .