Multi-Node Inference

Multi-Node Inference#

Distributed inference refers to the technique of splitting and processing LLM model across multiple nodes or devices. This approach is particularly useful for large models that cannot fit into the memory of a single machine. This solution relies on KubeRay to orchestrate the Ray Clusters.

Key API Design#

In the landscape of distributed computing, the need for efficient orchestration of multi-node inference tasks has become paramount. Kubernetes has established itself as a leading platform for managing containerized applications, offering robust resource management and scalability. On the other hand, Ray has emerged as a powerful framework for building and running distributed applications, particularly well-suited for handling complex machine learning workflows. However, the existing approaches to orchestration often fall short in terms of flexibility and simplicity. Kubernetes operators, while powerful, can become overly complex when dealing with fine-grained orchestration of distributed applications. Ray, although excellent for internal task scheduling and resource management, lacks the broader resource orchestration capabilities provided by Kubernetes.

To address these challenges, we propose a new orchestration approach that synergizes the strengths of both Kubernetes and Ray. This approach leverages Ray for internal fine-grained application orchestration, allowing users to utilize Ray’s APIs for distributed computation Simultaneously, Kubernetes will handle the overall application resource orchestration, focusing on coarse-grained resource allocation and environment configuration. This division of responsibilities simplifies the design of Kubernetes operators and enhances the overall flexibility and efficiency of the orchestration process.

We introduce two key APIs for RayCluster Management, it’s RayClusterReplicaSet and RayClusterFleet. It’s similar like Kubernetes core concept ReplicaSet and Deployment. Most of the time, you only need to use RayClusterFleet.

Ray Framework Focus: In this model, Ray is emphasized solely for its role in intra-application orchestration. Each application instance corresponds to a single Ray Cluster, and multiple service instances of an application equate to multiple Ray Clusters. This ensures that Ray handles the distributed nature of the application internally without interference from external orchestration systems.
Kubernetes Layer: Kubernetes operates at the outer layer, responsible for initiating Ray Clusters and managing standard Kubernetes functionalities such as autoscaling and rolling updates. The Kubernetes layer doesn’t orchestrate the roles inside the application anymore. These features are well-established within the Kubernetes ecosystem, ensuring robust and reliable resource management, scaling, and update processes. By leveraging Kubernetes for these operations, we can achieve a seamless integration of Ray’s distributed computing capabilities with Kubernetes’ mature operational management.
Service Encapsulation and Mapping: At a higher level, services are encapsulated in a manner analogous to Kubernetes Deployments and ReplicaSets. The key difference lies in the mapping: instead of Pods, we now have Ray Clusters representing application instances. Traditionally, a single Pod would constitute an application instance; however, in this distributed model, a Ray Cluster serves this purpose, encapsulating the complexity of distributed execution within itself.

Attention

We already submit our ideas to KubeRay community. Hopefully, we can merge into the repo pretty soon.

Workloads Examples#

Attention

Starting from v0.6.6, we’ve added essential packages to run distributed inference with vLLM official container image distribution out of the box. If you use earlier versions, you can follow guidance below to build your own image compatible with multi-node inference.

This is the RayClusterFleet example, you can apply this yaml in your cluster.

apiVersion: orchestration.aibrix.ai/v1alpha1
kind: RayClusterFleet
metadata:
  name: qwen-coder-7b-instruct
  labels:
    app.kubernetes.io/name: aibrix
    app.kubernetes.io/managed-by: kustomize
spec:
  replicas: 1
  selector:
    matchLabels:
      model.aibrix.ai/name: qwen-coder-7b-instruct
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
  template:
    metadata:
      labels:
        model.aibrix.ai/name: qwen-coder-7b-instruct
      annotations:
        ray.io/overwrite-container-cmd: "true"
    spec:
      rayVersion: "2.10.0"
      headGroupSpec:
        rayStartParams:
          dashboard-host: "0.0.0.0"
        template:
          metadata:
            labels:
              model.aibrix.ai/name: qwen-coder-7b-instruct
          spec:
            containers:
              - name: ray-head
                image: vllm/vllm-openai:v0.7.1
                command: ["/bin/bash", "-c"]
                args:
                  - >
                    ulimit -n 65536 &&
                    apt update && apt install -y wget net-tools && pip3 install ray[default] pyarrow pandas &&
                    echo "[INFO] Starting Ray head node..." &&
                    eval "$KUBERAY_GEN_RAY_START_CMD" &

                    echo "[INFO] Waiting for Ray dashboard to be ready..." &&
                    until curl --max-time 5 --fail http://127.0.0.1:8265 > /dev/null 2>&1; do
                      echo "[WAITING] $(date -u +'%Y-%m-%dT%H:%M:%SZ') - Ray dashboard not ready yet...";
                      sleep 2;
                    done &&
                    echo "[SUCCESS] Ray dashboard is available!" &&

                    vllm serve Qwen/Qwen2.5-Coder-7B-Instruct \
                      --served-model-name qwen-coder-7b-instruct \
                      --tensor-parallel-size 2 \
                      --distributed-executor-backend ray \
                      --host 0.0.0.0 \
                      --port 8000 \
                      --dtype half
                ports:
                  - containerPort: 6379
                    name: gcs-server
                  - containerPort: 8265
                    name: dashboard
                  - containerPort: 10001
                    name: client
                  - containerPort: 8000
                    name: service
                resources:
                  limits:
                    cpu: "4"
                    nvidia.com/gpu: 1
                  requests:
                    cpu: "4"
                    nvidia.com/gpu: 1
              - name: aibrix-runtime
                image: aibrix/runtime:v0.3.0
                command:
                  - aibrix_runtime
                  - --port
                  - "8080"
                env:
                  - name: INFERENCE_ENGINE
                    value: vllm
                  - name: INFERENCE_ENGINE_ENDPOINT
                    value: http://localhost:8000
                  - name: PYTORCH_CUDA_ALLOC_CONF
                    value: "expandable_segments:True"
                ports:
                  - containerPort: 8080
                    protocol: TCP
                livenessProbe:
                  httpGet:
                    path: /healthz
                    port: 8080
                  initialDelaySeconds: 3
                  periodSeconds: 2
                readinessProbe:
                  httpGet:
                    path: /ready
                    port: 8080
                  initialDelaySeconds: 5
                  periodSeconds: 10
                resources:
                  limits:
                    cpu: "1"
                  requests:
                    cpu: "1"
      workerGroupSpecs:
        - groupName: small-group
          replicas: 1
          minReplicas: 1
          maxReplicas: 5
          rayStartParams: {}
          template:
            metadata:
              labels:
                model.aibrix.ai/name: qwen-coder-7b-instruct
            spec:
              containers:
                - name: ray-worker
                  image: vllm/vllm-openai:v0.7.1
                  env:
                    - name: MY_POD_IP
                      valueFrom:
                        fieldRef:
                          fieldPath: status.podIP
                  command: [ "/bin/bash", "-c" ]
                  args:
                    - >
                      ulimit -n 65536 &&
                      eval "$KUBERAY_GEN_RAY_START_CMD --node-ip-address=$MY_POD_IP" &&
                      tail -f /dev/null
                  lifecycle:
                    preStop:
                      exec:
                        command: [ "/bin/sh", "-c", "ray stop" ]
                  resources:
                    limits:
                      cpu: "4"
                      nvidia.com/gpu: 1
                    requests:
                      cpu: "4"
                      nvidia.com/gpu: 1

---

apiVersion: v1
kind: Service
metadata:
  name: qwen-coder-7b-instruct
  labels:
    model.aibrix.ai/name: qwen-coder-7b-instruct
    prometheus-discovery: "true"
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
spec:
  selector:
    model.aibrix.ai/name: qwen-coder-7b-instruct
  ports:
    - name: serve
      port: 8000
      protocol: TCP
      targetPort: 8000
    - name: http
      port: 8080
      protocol: TCP
      targetPort: 8080

---

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: qwen-coder-7b-instruct-router
  namespace: aibrix-system
spec:
  parentRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: aibrix-eg
      namespace: aibrix-system
  rules:
    - backendRefs:
        - group: ""
          kind: Service
          name: qwen-coder-7b-instruct
          namespace: default
          port: 8000  # or 8000 if you're not using the runtime sidecar
          weight: 1
      matches:
        - headers:
            - name: model
              type: Exact
              value: qwen-coder-7b-instruct
          path:
            type: PathPrefix
            value: /v1/completions
        - headers:
            - name: model
              type: Exact
              value: qwen-coder-7b-instruct
          path:
            type: PathPrefix
            value: /v1/chat/completions
      timeouts:
        request: 120s

vLLM Version#

If you are using vLLM earlier version, you have two options.

Use our built image aibrix/vllm-openai:v0.6.1.post2-distributed.
Build your own image and follow steps here.

FROM vllm/vllm-openai:v0.6.1.post2
RUN apt update && apt install -y wget # important for future healthcheck
RUN pip3 install ray[default] # important for future healthcheck
ENTRYPOINT [""]

docker build -t aibrix/vllm-openai:v0.6.1.post2-distributed .