Observability#

To enable observability for your AIBrix deployment, we provide Built-in Grafana Dashboards that cover the key system components:

  1. Control Plane Runtime Dashboard - Monitors controller runtime performance, reconciliation behavior, and health status of the control plane.

  2. Envoy Gateway Dashboard - Visualizes traffic metrics including request counts, latencies, and external processing statistics.

  3. Model Service Dashboard - Tracks per-model service metrics such as request QPS, prompt and output length, TTFT/TPOT, and stop reasons etc.

Prerequisites#

Before enabling metrics and dashboards, make sure the kube-prometheus-stack is installed in your cluster. This provides Prometheus, Grafana, and CRDs like ServiceMonitor required for scraping metrics.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack --namespace prometheus

Metric Enablement Steps#

To activate metric collection for each component:

  1. Control Plane Runtime - The default controller manager installation already expose the metrics.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/managed-by: kubectl
    app.kubernetes.io/name: aibrix
    app.kubernetes.io/version: nightly
    control-plane: controller-manager
  name: aibrix-controller-manager-metrics-monitor
  namespace: aibrix-system
spec:
  endpoints:
  - path: /metrics
    port: http
    scheme: http
  selector:
    matchLabels:
      control-plane: controller-manager
  1. Envoy Gateway - In addition to a ServiceMonitor, you must deploy an auxiliary metrics service that exposes Envoy’s admin interface metrics (e.g., /stats/prometheus) to Prometheus.

apiVersion: v1
kind: Service
metadata:
  name: envoy-admin-metrics
  namespace: envoy-gateway-system
  labels:
    app.kubernetes.io/name: envoy
    app.kubernetes.io/component: proxy
    app.kubernetes.io/managed-by: envoy-gateway
spec:
  ports:
  - name: metrics
    port: 19001
    targetPort: 19001
    protocol: TCP
  selector:
    app.kubernetes.io/name: envoy
    app.kubernetes.io/component: proxy
    app.kubernetes.io/managed-by: envoy-gateway
  type: ClusterIP
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: envoy-metrics-monitor
  namespace: envoy-gateway-system
  labels:
    release: prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: envoy
  namespaceSelector:
    matchNames:
    - envoy-gateway-system
  endpoints:
  - port: metrics
    path: /stats/prometheus
    scheme: http
    interval: 30s
  1. Model Service - We provides a sample ServiceMonitor as a reference, you can change the definition based on your model setups.

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    release: prometheus
  name: test-service-monitor
  namespace: default
spec:
  endpoints:
  - interval: 15s
    path: /metrics
    port: metrics
  namespaceSelector:
    matchNames:
    - default
  selector:
    matchLabels:
      prometheus-discovery: "true"

Import Grafana Dashboard#

For production monitoring, we provide pre-built Grafana dashboards to visualize metrics from the control plane, Envoy Gateway, and model services. These dashboards offer insights into system performance, request patterns, error rates, and more. You can import them into your Grafana instance by uploading the corresponding JSON files. Ensure your Prometheus data source is correctly configured before importing. Once imported, the dashboards will begin displaying live metrics as long as ServiceMonitor resources are properly set up and the kube-prometheus stack is actively scraping data.

Production Monitoring#

TODO: Screenshots and visual examples will be added soon to illustrate key views and usage patterns.