In-cluster vLLM

Nebula routes LLM calls through two independent knobs: llm.completion.mode and llm.embedding.mode, each external | inCluster. Default for both is external (cloud OpenAI-compatible endpoint). When that is not acceptable for one or both roles — air-gapped environments, strict latency budgets, regulatory requirements, or just embedding-volume cost — the chart ships a vLLM sub-chart that runs the corresponding model inside Kubernetes.

When to enable in-cluster vLLM

In-cluster vLLM makes sense when:

Air-gapped deployment: no internet egress is permitted; model weights are pre-loaded from a private registry or bundled artifact.
Latency budget: round-trip to a public API endpoint (OpenAI, Azure OpenAI, Bedrock) is too slow for your p95 requirement.
Data residency: regulatory or contractual requirements prohibit sending user data off-premises.

For most customers with internet access, leaving both completion.mode and embedding.mode at external is simpler, cheaper, and easier to maintain. A common middle ground is external completion + in-cluster embedding — completions stay on a frontier cloud model, embeddings move in-cluster because the volume dominates retrieval workloads and the model fits on a CPU node.

Architecture

The vLLM sub-chart is gated behind llm.inCluster.enabled. When enabled, the chart renders:

Two vLLM Deployments — one per profile. The instruction profile serves the chat/completions model; the embedding profile serves the text-embedding model.
Two Kubernetes Services — vllm-instruction and vllm-embedding — that Nebula’s API and worker pods resolve in-cluster.
Endpoint env vars on the API + worker pods — NEBULA_LLM_VLLM_API_BASE points at http://vllm-instruction.<namespace>.svc:8000/v1 (completions); NEBULA_EMBEDDING_VLLM_API_BASE points at http://vllm-embedding.<namespace>.svc:8000/v1 (embeddings, consumed by core/base/providers/embedding.py); NEBULA_CONFIG_NAME=onprem_local selects the in-image TOML profile for everything else.

Profile model: the vllm.profiles list drives the sub-chart. Each profile is a named dict with enabled, model (Hugging Face repo), task ("" for chat/completions, "embed" for embeddings — this is what the parent chart inspects to validate the in-cluster topology), image, resource requests, gpu.enabled, nodeSelector, tolerations, persistence, and the autoscaling / pdb / networkPolicy toggles. See helm/charts/vllm/values.yaml for the full schema with inline comments. Per-role mode + profile mapping. When llm.completion.mode: inCluster, an instruction profile (task: "" or "generate") with enabled: true is required in vllm.profiles — the chart points NEBULA_LLM_VLLM_API_BASE at vllm-instruction and fails-loud at template time if the profile is missing. When llm.embedding.mode: inCluster, an embedding profile (task: "embed") is required for the same reason (NEBULA_EMBEDDING_VLLM_API_BASE → vllm-embedding). Either role can be flipped independently; the umbrella llm.inCluster.enabled gate must be true if either role is in-cluster (the schema enforces this).

Profile sizing reference

Profile	Role	Node type	CPU	Memory	GPU
instruction	Chat / completions (Qwen3.5-9B default)	GPU node — g5.xlarge / g6.xlarge (A10G or L4, 24 GB VRAM). T4 (g4dn) is undersized for the 9B model.	4	16 GB	1 GPU (24 GB VRAM)
embedding	Text embeddings (BGE-small-en-v1.5 default)	CPU-only (c7i.2xlarge / similar)	4	8 GB	none

The embedding model fits comfortably on a CPU-only node; running it on a GPU wastes capacity. The instruction model requires at least one GPU with ~20 GB VRAM at FP16 — Qwen3.5-9B is too large for a 16 GB T4. For tighter GPU budgets, point profile.model at a quantized variant (Alibaba ships FP8 checkpoints for the larger Qwen3.5 sizes; AWQ builds for Qwen3.5-9B are typically community repos like QuantTrio/Qwen3.5-9B-AWQ — verify the source before relying on one) and set profile.servedAs: Qwen/Qwen3.5-9B so Nebula’s TOML model name still resolves. Qwen3.5 enables thinking mode by default. Add extraArgs: ["--reasoning-parser", "qwen3"] on the instruction profile so vLLM parses the <think>...</think> blocks into structured response fields rather than streaming them as raw text. The default in-cluster overlay sets this already. Adjust these via vllm.profiles[*].resources.requests in your overlay values file.

HuggingFace token provisioning

Most open-weight models supported by vLLM do not require a HuggingFace access token:

Qwen/Qwen3.5-9B — publicly available, no token required
BAAI/bge-small-en-v1.5 — publicly available, no token required

If you choose a gated model (e.g. meta-llama/Meta-Llama-3-8B-Instruct), you need to provision an HF_TOKEN secret. Create a Kubernetes Secret in the release namespace:

kubectl -n nebula create secret generic nebula-hf-token \
  --from-literal=HF_TOKEN=hf_...

Then enable token injection at the sub-chart level — the sub-chart adds HF_TOKEN to every profile’s pod via valueFrom.secretKeyRef (so the operator can name the Secret key independently of the env var):

vllm:
  global:
    hfToken:
      enabled: true
      secretName: nebula-hf-token
      secretKey: HF_TOKEN

Enabling in-cluster vLLM on EKS

The bundle ships a single EKS overlay at helm/examples/eks/values-vllm-inCluster.yaml that enables both the instruction and embedding profiles. Stack it on top of the base values file:

helm install nebula ./helm/nebula-<version>.tgz \
  -n nebula --create-namespace \
  -f helm/examples/eks/values.yaml \
  -f helm/examples/eks/values-vllm-inCluster.yaml

The overlay sets both llm.completion.mode and llm.embedding.mode to inCluster and enables both profiles (Qwen3.5-9B on GPU + BGE-small-en-v1.5 on CPU). Override per-profile fields in your own -f my-values.yaml after the overlay; pass the same -f flags to helm upgrade. All four mode combinations are now first-class. Two new TOML profiles ship in core/configs/:

onprem_external_completion_local_embedding — pairs provider=openai for completions with provider=vllm (BGE-small, 384) for embeddings. Auto-selected when llm.completion.mode: external + llm.embedding.mode: inCluster.
onprem_local_completion_external_embedding — pairs provider=vllm (Qwen3.5-9B) for completions with provider=openai (text-embedding-3-large, 3072) for embeddings. Auto-selected when llm.completion.mode: inCluster + llm.embedding.mode: external.

The chart maps (completion.mode, embedding.mode) → NEBULA_CONFIG_NAME automatically. Customers who want a different TOML profile WITHIN the chart’s hardcoded provider families (different embedding model / dimension, different per-task LLM picks, different concurrency settings — all while keeping openai/... strings on external roles and vllm/... strings on in-cluster roles) override via llm.inCluster.configName. Model + dimension overrides flow through env vars (NEBULA_LLM_<provider>_MODEL, NEBULA_EMBEDDING_<provider>_MODEL, NEBULA_EMBEDDING_<provider>_DIMENSION); the chart emits them when llm.<role>.model / llm.embedding.dimension are set. Embedding dimension is part of catalog identity — the chart fails template-time if llm.embedding.model is set without llm.embedding.dimension, and the runtime mirrors the same check in EmbeddingConfig’s Pydantic @model_validator(mode='after') (moved from EmbeddingProvider.__init__ so catalogctl + the DB provider see the override too). Non-OpenAI external providers are NOT supported via configName alone. The chart hardcodes NEBULA_LLM_OPENAI_* / NEBULA_EMBEDDING_OPENAI_* env-var families for mode: external (and the _VLLM_* family for mode: inCluster). Pointing mode: external at a customer-hosted vLLM endpoint or other non-OpenAI provider requires either a sidecar / post-render step that injects the matching NEBULA_LLM_VLLM_* env vars, or a chart fork — overriding configName to a TOML with vllm/... model strings would render successfully but route to env vars the chart never emits, and every completion would fail at runtime.

Wiring to Nebula

When either role is inCluster, the chart:

Sets NEBULA_CONFIG_NAME to the auto-picked TOML matching the (completion.mode, embedding.mode) combo (one of full_openai, onprem_local, onprem_external_completion_local_embedding, onprem_local_completion_external_embedding). Customers with a custom TOML override via llm.inCluster.configName.
For completion.mode: inCluster: sets NEBULA_LLM_VLLM_API_BASE to http://vllm-instruction.<namespace>.svc:8000/v1. Completion routing in core/providers/model_routing.py reads this env override.
For embedding.mode: inCluster: sets NEBULA_EMBEDDING_VLLM_API_BASE to http://vllm-embedding.<namespace>.svc:8000/v1. EmbeddingConfig’s @model_validator(mode='after') in core/base/providers/embedding.py reads this during config load (before factory.create_database_provider / catalogctl consume config.embedding) and overrides the TOML base_url.
For external roles: the chart emits the matching env vars from llm.<role>.{apiBase, apiKey, apiKeySecret}.

No ConfigMap is mounted — the in-image TOML is the source of truth for everything except endpoint URLs, which env vars handle directly. The chart’s supported customization surface is: per-role mode (llm.completion.mode / llm.embedding.mode), external-endpoint settings (llm.<role>.{apiBase, apiKey, apiKeySecret}), vLLM resource sizing (vllm.profiles[*].resources / gpu / replicas), node placement (nodeSelector / tolerations / affinity), HF token plumbing (global.hfToken), and per-profile autoscaling/PDB toggles. Swapping the served weights (e.g. running a community AWQ build of the same model while Nebula still requests Qwen/Qwen3.5-9B) is supported via the servedAs field on each profile. Set both:

vllm:
  profiles:
    - name: instruction
      model: <hf-repo-with-the-actual-weights>     # weights vLLM loads (e.g. a verified AWQ or FP8 build)
      servedAs: Qwen/Qwen3.5-9B                     # name vLLM advertises (== name Nebula requests from the in-image TOML)

The chart renders this as --model <model> --served-model-name <servedAs> on the vLLM Deployment. When servedAs is unset, vLLM advertises model verbatim — fine when the new weights repo and the in-image TOML’s name match. True architecture swaps (different dimension for embeddings, different model family for completion) still require a custom runtime image with an edited core/configs/onprem_local.toml, plus — for embeddings — a catalog migration to match the new base_dimension. The chart deliberately does not render TOML overlays anymore (eliminates drift; see commit history). For embedding dimension changes specifically, the catalog migration is non-trivial and out of scope here.

Troubleshooting

Model download stuck — vLLM pod in Init or CrashLoopBackOff

vLLM downloads model weights from HuggingFace Hub on first boot. For a 7B model this is 13-15 GB and can take 10-20 minutes on a fresh node with no cache. Check pod logs with kubectl -n nebula logs deploy/vllm-instruction. If you see requests.exceptions.ConnectionError, the node has no HuggingFace egress — either allow outbound HTTPS to huggingface.co or pre-bake the weights into the Docker image. If you see 401 Unauthorized, provision the HF_TOKEN secret as described above.

GPU node not scheduling — vLLM instruction pod stays Pending

Check kubectl describe pod <vllm-instruction-pod> -n nebula for the Pending reason. Common causes: (a) no node with the required GPU resource (nvidia.com/gpu: 1) is in the cluster — verify the GPU node pool exists and the NVIDIA device plugin is installed; (b) the NodePool or node selector in the profile doesn’t match the GPU node’s labels — confirm the node has the label expected by nodeSelector in the profile; (c) the GPU node has a taint (llm-workload=true:NoSchedule on EKS) and the profile’s tolerations block is missing or mismatched.

Embedding endpoint timing out — worker logs show connection refused

Verify the vllm-embedding Service is pointing at a Running pod: kubectl -n nebula get endpoints vllm-embedding. If the endpoint list is empty, the embedding Deployment is not ready — check kubectl -n nebula describe deploy vllm-embedding and the pod logs. The chart derives the embedding Service name from the profile’s name field (the sub-chart renders vllm-<profile.name>); the default profile is named embedding, so the Service is vllm-embedding. Custom profile names work transparently as long as exactly one enabled profile has task: embed.

Embedding env override not applied — Nebula uses TOML default URL

Check that NEBULA_EMBEDDING_VLLM_API_BASE is set on the API and worker pods: kubectl -n nebula exec deploy/nebula-api -- env | grep NEBULA_EMBEDDING. If absent, llm.embedding.mode may not be set to inCluster. If present but the embedding provider logs still show the TOML default URL, check the API/worker pod logs for the Embedding base_url overridden from env info line — its absence means the runtime image is older than the env-override change in core/base/providers/embedding.py. Rebuild or pull a newer image.

Get Started

Kubernetes

Docker Compose

Reference

In-cluster vLLM

When to enable in-cluster vLLM

Architecture

Profile sizing reference

HuggingFace token provisioning

Enabling in-cluster vLLM on EKS

Wiring to Nebula

Troubleshooting

Get Started

Kubernetes

Docker Compose

Reference

Documentation Index

​When to enable in-cluster vLLM

​Architecture

​Profile sizing reference

​HuggingFace token provisioning

​Enabling in-cluster vLLM on EKS

​Wiring to Nebula

​Troubleshooting

When to enable in-cluster vLLM

Architecture

Profile sizing reference

HuggingFace token provisioning

Enabling in-cluster vLLM on EKS

Wiring to Nebula

Troubleshooting