Observability#
To enable observability for your AIBrix deployment, we provide Built-in Grafana Dashboards that cover the key system components:
Control Plane Runtime Dashboard - Monitors controller runtime performance, reconciliation behavior, and health status of the control plane.
Envoy Gateway Dashboard - Visualizes traffic metrics including request counts, latencies, and external processing statistics.
Model Service Dashboard - Tracks per-model service metrics such as request QPS, prompt and output length, TTFT/TPOT, and stop reasons etc.
Prerequisites#
Before enabling metrics and dashboards, make sure the kube-prometheus-stack is installed in your cluster. This provides Prometheus, Grafana, and CRDs like ServiceMonitor required for scraping metrics.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack --namespace prometheus
Metric Enablement Steps#
To activate metric collection for each component:
Control Plane Runtime - The default controller manager installation already expose the metrics.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app.kubernetes.io/managed-by: kubectl
app.kubernetes.io/name: aibrix
app.kubernetes.io/version: nightly
control-plane: controller-manager
name: aibrix-controller-manager-metrics-monitor
namespace: aibrix-system
spec:
endpoints:
- path: /metrics
port: http
scheme: http
selector:
matchLabels:
control-plane: controller-manager
Envoy Gateway - In addition to a ServiceMonitor, you must deploy an auxiliary metrics service that exposes Envoy’s admin interface metrics (e.g., /stats/prometheus) to Prometheus.
apiVersion: v1
kind: Service
metadata:
name: envoy-admin-metrics
namespace: envoy-gateway-system
labels:
app.kubernetes.io/name: envoy
app.kubernetes.io/component: proxy
app.kubernetes.io/managed-by: envoy-gateway
spec:
ports:
- name: metrics
port: 19001
targetPort: 19001
protocol: TCP
selector:
app.kubernetes.io/name: envoy
app.kubernetes.io/component: proxy
app.kubernetes.io/managed-by: envoy-gateway
type: ClusterIP
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: envoy-metrics-monitor
namespace: envoy-gateway-system
labels:
release: prometheus
spec:
selector:
matchLabels:
app.kubernetes.io/name: envoy
namespaceSelector:
matchNames:
- envoy-gateway-system
endpoints:
- port: metrics
path: /stats/prometheus
scheme: http
interval: 30s
Model Service - We provides a sample ServiceMonitor as a reference, you can change the definition based on your model setups.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
release: prometheus
name: test-service-monitor
namespace: default
spec:
endpoints:
- interval: 15s
path: /metrics
port: metrics
namespaceSelector:
matchNames:
- default
selector:
matchLabels:
prometheus-discovery: "true"
Import Grafana Dashboard#
For production monitoring, we provide pre-built Grafana dashboards to visualize metrics from the control plane, Envoy Gateway, and model services. These dashboards offer insights into system performance, request patterns, error rates, and more. You can import them into your Grafana instance by uploading the corresponding JSON files. Ensure your Prometheus data source is correctly configured before importing. Once imported, the dashboards will begin displaying live metrics as long as ServiceMonitor resources are properly set up and the kube-prometheus stack is actively scraping data.
Production Monitoring#
TODO: Screenshots and visual examples will be added soon to illustrate key views and usage patterns.