vLLM Semantic Router#
Imagine you’re building an AI assistant that handles everything — math homework, legal questions, creative writing, and medical queries. You could run a single large model for all of it. But what if you could automatically send math and physics questions to a reasoning-optimized model, and route everyday conversational queries to a faster, lighter model — all transparently, with no changes to how your clients make requests?
That’s exactly what the vLLM Semantic Router does. It sits in front of your model fleet and classifies each incoming prompt by topic and intent, then forwards it to the best-fit model with the right configuration. Clients always call the same endpoint with the same API — the router handles the rest.
This guide walks you through deploying the semantic router with AIBrix and Envoy Gateway, using a concrete example that routes between qwen3-8b (for STEM and reasoning tasks) and llama3-8b-instruct (for business, legal, and general queries).
Note
All sample manifests referenced in this guide live under samples/semantic-router/ in the AIBrix repository.
How It Works#
The router integrates as an Envoy External Processor (ext_proc) — a gRPC sidecar that Envoy consults before forwarding each request. Here’s the full request path:
Client
│
▼
Envoy Gateway (aibrix-eg) ← receives the original request
│ gRPC ext_proc (port 50051)
▼
Semantic Router ← classifies prompt, rewrites model field
│ e.g. "MoM" → "qwen3-8b"
▼
AIBrix Gateway Plugins ← rate limiting, auth, etc.
│
├──► qwen3-8b (STEM + reasoning)
└──► llama3-8b-instruct (business, legal, general)
What the router does for each request:
Receives the full buffered request body from Envoy over gRPC.
Extracts the user’s message content.
Runs an embedding-based domain classifier (or a fast keyword scanner for explicit signals like “think step by step”).
Selects the highest-priority matching decision.
Rewrites the request — replacing the
modelfield, injecting a system prompt, and optionally enabling reasoning mode.Returns the mutated request to Envoy, which forwards it to the selected backend.
The client always uses "model": "MoM" (Model of Models). The router replaces this with the real backend model name. No client changes are needed.
Prerequisites#
Before you begin, make sure you have:
A running Kubernetes cluster with AIBrix installed (includes Envoy Gateway).
kubectlconfigured to talk to your cluster.GPU nodes available for model serving.
A Hugging Face account and API token (for downloading model weights).
Model weights pre-staged on your nodes at
/data01/models/, or access to pull them from HuggingFace.
Tip
The sample uses qwen3-8b and llama3-8b-instruct. Each model requires one GPU with at least 48 GB of GPU memory.
Step 1 — Create the Namespace and Credentials#
Create a dedicated namespace for the semantic router, then store your Hugging Face token as a Kubernetes secret (used by the router to download its embedding model):
kubectl create namespace vllm-semantic-router-system
export HF_TOKEN="<your-huggingface-token>"
kubectl create secret generic hf-token-secret \
--from-literal=token="${HF_TOKEN}" \
-n vllm-semantic-router-system
Step 2 — Deploy the Backend Model Services#
The router needs the actual model servers to be running before it can forward requests. Deploy both backends:
kubectl apply -f samples/semantic-router/models/llama3-8b-instruct.yaml
kubectl apply -f samples/semantic-router/models/qwen3-8b.yaml
Each manifest creates a Deployment (3 replicas) and a Service in the default namespace. The AIBrix controller automatically creates an HTTPRoute for each model so Envoy Gateway can discover them.
Wait until both deployments are ready before proceeding:
kubectl rollout status deployment/llama3-8b-instruct -n default
kubectl rollout status deployment/qwen3-8b -n default
Step 3 — Deploy the Semantic Router#
Apply the router’s ConfigMap (routing rules) and the router Deployment:
kubectl apply -f samples/semantic-router/semantic-router-configmap.yaml
kubectl apply -f samples/semantic-router/semantic-router.yaml
The router container (ghcr.io/vllm-project/semantic-router/extproc:latest) downloads its embedding model at startup — allow up to 60 seconds for this. You can watch progress with:
kubectl logs -f deployment/semantic-router -n vllm-semantic-router-system
The router exposes three ports:
Port |
Protocol |
Purpose |
|---|---|---|
|
gRPC |
ext_proc interface — receives requests from Envoy |
|
HTTP |
Classification REST API (useful for debugging) |
|
HTTP |
Prometheus metrics |
Step 4 — Wire the Router into Envoy Gateway#
Apply the Gateway API resources that register the semantic router as an ext_proc filter on the Envoy listener:
kubectl apply -f samples/semantic-router/gwapi-resources.yaml
This applies an EnvoyPatchPolicy with two JSON patches:
Patch 1 adds the semantic-router-extproc HTTP filter to Envoy’s filter chain. It uses BUFFERED body mode so Envoy accumulates the complete request body before handing it to the router — no streaming complexity.
Patch 2 registers the router as an upstream cluster (STRICT_DNS, HTTP/2 for gRPC) so Envoy knows how to reach it at semantic-router.vllm-semantic-router-system.svc.cluster.local:50051.
Verify the patch was accepted:
kubectl describe envoypatchpolicy ai-gateway-prepost-extproc-patch-policy -n aibrix-system
Look for Status: True under Conditions.
Step 5 — Access the Gateway#
Port-forward the Envoy service to test locally:
export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \
--selector=gateway.envoyproxy.io/owning-gateway-namespace=aibrix-system,gateway.envoyproxy.io/owning-gateway-name=aibrix-eg \
-o jsonpath='{.items[0].metadata.name}')
kubectl port-forward -n envoy-gateway-system "svc/${ENVOY_SERVICE}" 8080:80
In production, use the LoadBalancer external IP instead:
LB_IP=$(kubectl get svc -n envoy-gateway-system \
-l "gateway.envoyproxy.io/owning-gateway-name=aibrix-eg" \
-o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}')
Step 6 — Test Semantic Routing#
All requests use the virtual model name "MoM" (Model of Models). The router transparently selects the right backend.
Math question → routed to qwen3-8b (reasoning enabled)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MoM",
"messages": [
{"role": "user", "content": "What is the derivative of x^3 + 2x?"}
],
"max_tokens": 200
}'
The router classifies this as the math domain, selects qwen3-8b, injects a mathematics expert system prompt, and enables chain-of-thought reasoning (chat_template_kwargs.enable_thinking: true).
Business question → routed to llama3-8b-instruct (standard mode)
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MoM",
"messages": [
{"role": "user", "content": "What are the key factors to consider when entering a new market?"}
],
"max_tokens": 200
}'
Explicit reasoning trigger → overrides domain classification
If the user explicitly asks for step-by-step thinking, the thinking keyword rule fires (priority 15, highest in the config) regardless of the detected domain:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "MoM",
"messages": [
{"role": "user", "content": "Walk me through how to structure a business merger proposal."}
],
"max_tokens": 300
}'
Even though this is a business query, “walk me through” matches the thinking keyword set, so it routes to qwen3-8b with reasoning enabled.
Routing Rules Reference#
The complete routing table for the sample (15 rules, defined in semantic-router-configmap.yaml):
Domain / Keyword |
Match Type |
Model |
Reasoning |
Priority |
|---|---|---|---|---|
|
keyword |
|
on |
15 |
|
domain |
|
on |
10 |
|
domain |
|
on |
10 |
|
domain |
|
on |
10 |
|
domain |
|
on |
10 |
|
domain |
|
on |
10 |
|
domain |
|
on |
10 |
|
domain |
|
off |
10 |
|
domain |
|
off |
10 |
|
domain |
|
off |
10 |
|
domain |
|
off |
10 |
|
domain |
|
off |
10 |
|
domain |
|
off |
10 |
|
domain |
|
off |
10 |
|
domain |
|
off |
5 |
Higher priority rules are evaluated first. The thinking keyword rule (priority 15) always overrides domain rules (priority 10).
Understanding the Configuration#
All routing behavior is controlled by semantic-router-configmap.yaml. Here’s how the key pieces fit together.
Decision Structure#
Each routing decision looks like this:
routing:
decisions:
- name: math_decision
description: Mathematics and quantitative reasoning
priority: 10 # higher wins; tie → first in list wins
rules:
operator: OR
conditions:
- name: math # must match a signal declared under routing.signals
type: domain # or type: keyword
modelRefs:
- model: qwen3-8b
use_reasoning: true # activates chain-of-thought mode
plugins:
- type: system_prompt
configuration:
enabled: true
mode: replace
system_prompt: "You are a mathematics expert. ..."
Rule Types#
Type |
How it matches |
|---|---|
|
Embedding-based cosine similarity between the prompt and a named domain label. The router picks the domain whose embedding is closest to the prompt. |
|
Fast exact substring search (case-insensitive by default). Matches if any keyword in the set appears anywhere in the prompt. |
Priority Strategy#
With strategy: priority (the default):
All decisions whose rules match the prompt are collected.
The decision with the highest ``priority`` value wins.
Ties are broken by declaration order in the YAML — first wins.
Signals Catalog#
Every domain or keyword set used in rules must be declared under routing.signals:
routing:
signals:
domains:
- name: math
- name: physics
- name: business
# ... add new domain labels here
keywords:
- name: thinking
case_sensitive: false
operator: OR
keywords:
- step by step
- chain of thought
- reason through
# ... extend here
Model Reasoning Activation#
The use_reasoning: true/false flag on a modelRef controls whether the router injects reasoning-activation parameters into the forwarded request. Different model families use different parameters.
Reasoning Families#
Defined under providers.defaults.reasoning_families:
providers:
defaults:
default_reasoning_effort: high
reasoning_families:
qwen3:
parameter: enable_thinking
type: chat_template_kwargs # → {"chat_template_kwargs": {"enable_thinking": true}}
deepseek:
parameter: thinking
type: chat_template_kwargs
gpt:
parameter: reasoning_effort
type: reasoning_effort # → {"reasoning_effort": "high"}
Assigning a Reasoning Family to a Model#
providers:
models:
- name: qwen3-8b
reasoning_family: qwen3 # links to the qwen3 reasoning family above
backend_refs:
- endpoint: qwen3-8b.default.svc.cluster.local:8000
name: aibrix-vllm
weight: 1
- name: llama3-8b-instruct
# no reasoning_family → reasoning is never activated
backend_refs:
- endpoint: llama3-8b-instruct.default.svc.cluster.local:8000
name: aibrix-vllm
weight: 1
When use_reasoning: true fires for qwen3-8b, the router adds to the outbound request body:
{
"chat_template_kwargs": {
"enable_thinking": true
}
}
vLLM reads this field and activates Qwen3’s built-in chain-of-thought path.
Plugins#
Plugins are applied in declaration order after a decision is selected.
System Prompt Plugin#
Injects or replaces the system message in messages[]:
plugins:
- type: system_prompt
configuration:
enabled: true
mode: replace # replace | prepend | append
system_prompt: "You are a mathematics expert. ..."
Mode |
Behaviour |
|---|---|
|
Removes any existing system message and prepends a new one. |
|
Inserts before existing system messages. |
|
Inserts after existing system messages. |
Semantic Cache Plugin#
Caches responses by prompt embedding similarity. On a cache hit the router short-circuits the backend call and returns the cached response directly — great for frequently repeated queries:
plugins:
- type: semantic-cache
configuration:
enabled: true
similarity_threshold: 0.92 # 0.0–1.0; higher = stricter match required
Global cache settings (TTL, max entries, eviction) are configured under global.stores.semantic_cache:
global:
stores:
semantic_cache:
enabled: true
backend_type: memory
embedding_model: mmbert
similarity_threshold: 0.8 # global default; per-decision threshold overrides this
ttl_seconds: 3600
max_entries: 1000
eviction_policy: fifo
Note
The semantic cache is particularly valuable for domains like health (threshold 0.95 — very strict) and other (threshold 0.75 — more permissive), where many users ask similar questions with slightly different wording.
Adding a New Route#
Adding a new topic (e.g., “cybersecurity”) takes three steps and a config reload:
1. Declare the domain under routing.signals.domains:
- name: cybersecurity
2. Add the decision under routing.decisions:
- name: cybersecurity_decision
description: Cybersecurity and network security topics
priority: 10
rules:
operator: OR
conditions:
- name: cybersecurity
type: domain
modelRefs:
- model: qwen3-8b
use_reasoning: true
plugins:
- type: system_prompt
configuration:
enabled: true
mode: replace
system_prompt: "You are a cybersecurity expert with deep knowledge of network security, threat modeling, and secure coding practices. ..."
3. Apply and reload:
kubectl apply -f samples/semantic-router/semantic-router-configmap.yaml
# Rolling restart picks up the new config immediately
kubectl rollout restart deployment/semantic-router -n vllm-semantic-router-system
Observability#
The router exposes Prometheus metrics on port 9190. To scrape them locally:
kubectl port-forward -n vllm-semantic-router-system \
deployment/semantic-router 9190:9190
Then open http://localhost:9190/metrics in your browser or point Prometheus at it.
To enable distributed tracing (OpenTelemetry / Jaeger), set the following in the ConfigMap:
global:
services:
observability:
tracing:
enabled: true
exporter:
endpoint: jaeger:4317
insecure: true
type: otlp
Troubleshooting#
Router pod is slow to start
The embedding model download can take up to 60 seconds. The startup probe retries for up to 60 minutes (failureThreshold: 360), so the pod will eventually become ready. Watch the logs:
kubectl logs -f deployment/semantic-router -n vllm-semantic-router-system
Envoy can’t reach the router
Verify the EnvoyPatchPolicy was accepted:
kubectl describe envoypatchpolicy ai-gateway-prepost-extproc-patch-policy -n aibrix-system
Check that the router Service is reachable:
kubectl get svc -n vllm-semantic-router-system
All requests go to the fallback model
The other_decision (priority 5) catches any prompt that doesn’t match a known domain. Check whether the domain embeddings are loaded by querying the classification API directly:
kubectl port-forward -n vllm-semantic-router-system deployment/semantic-router 9080:8080
curl http://localhost:9080/classify \
-H "Content-Type: application/json" \
-d '{"text": "What is the integral of sin(x)?"}'
Config changes aren’t taking effect
The router reads config at startup. After applying a new ConfigMap, rolling-restart the deployment:
kubectl rollout restart deployment/semantic-router -n vllm-semantic-router-system
Sample Files#
All manifests for this example are in the AIBrix repository:
samples/semantic-router/README.md — quick-start guide
samples/semantic-router/DESIGN.md — deep-dive architecture reference
samples/semantic-router/semantic-router-configmap.yaml — full routing configuration with all 15 rules
samples/semantic-router/semantic-router.yaml — router Deployment, Services, and RBAC
samples/semantic-router/gwapi-resources.yaml — EnvoyPatchPolicy that wires the router into the gateway
samples/semantic-router/models/ — model Deployment and Service manifests