Nebula routes LLM calls through two independent knobs:Documentation Index
Fetch the complete documentation index at: https://docs.trynebula.ai/llms.txt
Use this file to discover all available pages before exploring further.
llm.completion.mode and llm.embedding.mode, each external | inCluster. Default for both is external (cloud OpenAI-compatible endpoint). When that is not acceptable for one or both roles — air-gapped environments, strict latency budgets, regulatory requirements, or just embedding-volume cost — the chart ships a vLLM sub-chart that runs the corresponding model inside Kubernetes.
When to enable in-cluster vLLM
In-cluster vLLM makes sense when:- Air-gapped deployment: no internet egress is permitted; model weights are pre-loaded from a private registry or bundled artifact.
- Latency budget: round-trip to a public API endpoint (OpenAI, Azure OpenAI, Bedrock) is too slow for your p95 requirement.
- Data residency: regulatory or contractual requirements prohibit sending user data off-premises.
completion.mode and embedding.mode at external is simpler, cheaper, and easier to maintain. A common middle ground is external completion + in-cluster embedding — completions stay on a frontier cloud model, embeddings move in-cluster because the volume dominates retrieval workloads and the model fits on a CPU node.
Architecture
The vLLM sub-chart is gated behindllm.inCluster.enabled. When enabled, the chart renders:
- Two vLLM Deployments — one per profile. The
instructionprofile serves the chat/completions model; theembeddingprofile serves the text-embedding model. - Two Kubernetes Services —
vllm-instructionandvllm-embedding— that Nebula’s API and worker pods resolve in-cluster. - Endpoint env vars on the API + worker pods —
NEBULA_LLM_VLLM_API_BASEpoints athttp://vllm-instruction.<namespace>.svc:8000/v1(completions);NEBULA_EMBEDDING_VLLM_API_BASEpoints athttp://vllm-embedding.<namespace>.svc:8000/v1(embeddings, consumed bycore/base/providers/embedding.py);NEBULA_CONFIG_NAME=onprem_localselects the in-image TOML profile for everything else.
vllm.profiles list drives the sub-chart. Each profile is a named dict with enabled, model (Hugging Face repo), task ("" for chat/completions, "embed" for embeddings — this is what the parent chart inspects to validate the in-cluster topology), image, resource requests, gpu.enabled, nodeSelector, tolerations, persistence, and the autoscaling / pdb / networkPolicy toggles. See helm/charts/vllm/values.yaml for the full schema with inline comments.
Per-role mode + profile mapping. When llm.completion.mode: inCluster, an instruction profile (task: "" or "generate") with enabled: true is required in vllm.profiles — the chart points NEBULA_LLM_VLLM_API_BASE at vllm-instruction and fails-loud at template time if the profile is missing. When llm.embedding.mode: inCluster, an embedding profile (task: "embed") is required for the same reason (NEBULA_EMBEDDING_VLLM_API_BASE → vllm-embedding). Either role can be flipped independently; the umbrella llm.inCluster.enabled gate must be true if either role is in-cluster (the schema enforces this).
Profile sizing reference
| Profile | Role | Node type | CPU | Memory | GPU |
|---|---|---|---|---|---|
| instruction | Chat / completions (Qwen3.5-9B default) | GPU node — g5.xlarge / g6.xlarge (A10G or L4, 24 GB VRAM). T4 (g4dn) is undersized for the 9B model. | 4 | 16 GB | 1 GPU (24 GB VRAM) |
| embedding | Text embeddings (BGE-small-en-v1.5 default) | CPU-only (c7i.2xlarge / similar) | 4 | 8 GB | none |
profile.model at a quantized variant (Alibaba ships FP8 checkpoints for the larger Qwen3.5 sizes; AWQ builds for Qwen3.5-9B are typically community repos like QuantTrio/Qwen3.5-9B-AWQ — verify the source before relying on one) and set profile.servedAs: Qwen/Qwen3.5-9B so Nebula’s TOML model name still resolves.
Qwen3.5 enables thinking mode by default. Add extraArgs: ["--reasoning-parser", "qwen3"] on the instruction profile so vLLM parses the <think>...</think> blocks into structured response fields rather than streaming them as raw text. The default in-cluster overlay sets this already.
Adjust these via vllm.profiles[*].resources.requests in your overlay values file.
HuggingFace token provisioning
Most open-weight models supported by vLLM do not require a HuggingFace access token:Qwen/Qwen3.5-9B— publicly available, no token requiredBAAI/bge-small-en-v1.5— publicly available, no token required
meta-llama/Meta-Llama-3-8B-Instruct), you need to provision an HF_TOKEN secret. Create a Kubernetes Secret in the release namespace:
HF_TOKEN to every profile’s pod via valueFrom.secretKeyRef (so the operator can name the Secret key independently of the env var):
Enabling in-cluster vLLM on EKS
The bundle ships a single EKS overlay athelm/examples/eks/values-vllm-inCluster.yaml that enables both the instruction and embedding profiles. Stack it on top of the base values file:
llm.completion.mode and llm.embedding.mode to inCluster and enables both profiles (Qwen3.5-9B on GPU + BGE-small-en-v1.5 on CPU). Override per-profile fields in your own -f my-values.yaml after the overlay; pass the same -f flags to helm upgrade.
All four mode combinations are now first-class. Two new TOML profiles ship in core/configs/:
onprem_external_completion_local_embedding— pairsprovider=openaifor completions withprovider=vllm(BGE-small, 384) for embeddings. Auto-selected whenllm.completion.mode: external+llm.embedding.mode: inCluster.onprem_local_completion_external_embedding— pairsprovider=vllm(Qwen3.5-9B) for completions withprovider=openai(text-embedding-3-large, 3072) for embeddings. Auto-selected whenllm.completion.mode: inCluster+llm.embedding.mode: external.
(completion.mode, embedding.mode) → NEBULA_CONFIG_NAME automatically. Customers who want a different TOML profile WITHIN the chart’s hardcoded provider families (different embedding model / dimension, different per-task LLM picks, different concurrency settings — all while keeping openai/... strings on external roles and vllm/... strings on in-cluster roles) override via llm.inCluster.configName. Model + dimension overrides flow through env vars (NEBULA_LLM_<provider>_MODEL, NEBULA_EMBEDDING_<provider>_MODEL, NEBULA_EMBEDDING_<provider>_DIMENSION); the chart emits them when llm.<role>.model / llm.embedding.dimension are set. Embedding dimension is part of catalog identity — the chart fails template-time if llm.embedding.model is set without llm.embedding.dimension, and the runtime mirrors the same check in EmbeddingConfig’s Pydantic @model_validator(mode='after') (moved from EmbeddingProvider.__init__ so catalogctl + the DB provider see the override too).
Non-OpenAI external providers are NOT supported via configName alone. The chart hardcodes NEBULA_LLM_OPENAI_* / NEBULA_EMBEDDING_OPENAI_* env-var families for mode: external (and the _VLLM_* family for mode: inCluster). Pointing mode: external at a customer-hosted vLLM endpoint or other non-OpenAI provider requires either a sidecar / post-render step that injects the matching NEBULA_LLM_VLLM_* env vars, or a chart fork — overriding configName to a TOML with vllm/... model strings would render successfully but route to env vars the chart never emits, and every completion would fail at runtime.
Wiring to Nebula
When either role isinCluster, the chart:
- Sets
NEBULA_CONFIG_NAMEto the auto-picked TOML matching the(completion.mode, embedding.mode)combo (one offull_openai,onprem_local,onprem_external_completion_local_embedding,onprem_local_completion_external_embedding). Customers with a custom TOML override viallm.inCluster.configName. - For
completion.mode: inCluster: setsNEBULA_LLM_VLLM_API_BASEtohttp://vllm-instruction.<namespace>.svc:8000/v1. Completion routing incore/providers/model_routing.pyreads this env override. - For
embedding.mode: inCluster: setsNEBULA_EMBEDDING_VLLM_API_BASEtohttp://vllm-embedding.<namespace>.svc:8000/v1.EmbeddingConfig’s@model_validator(mode='after')incore/base/providers/embedding.pyreads this during config load (beforefactory.create_database_provider/catalogctlconsumeconfig.embedding) and overrides the TOMLbase_url. - For
externalroles: the chart emits the matching env vars fromllm.<role>.{apiBase, apiKey, apiKeySecret}.
llm.completion.mode / llm.embedding.mode), external-endpoint settings (llm.<role>.{apiBase, apiKey, apiKeySecret}), vLLM resource sizing (vllm.profiles[*].resources / gpu / replicas), node placement (nodeSelector / tolerations / affinity), HF token plumbing (global.hfToken), and per-profile autoscaling/PDB toggles.
Swapping the served weights (e.g. running a community AWQ build of the same model while Nebula still requests Qwen/Qwen3.5-9B) is supported via the servedAs field on each profile. Set both:
--model <model> --served-model-name <servedAs> on the vLLM Deployment. When servedAs is unset, vLLM advertises model verbatim — fine when the new weights repo and the in-image TOML’s name match.
True architecture swaps (different dimension for embeddings, different model family for completion) still require a custom runtime image with an edited core/configs/onprem_local.toml, plus — for embeddings — a catalog migration to match the new base_dimension. The chart deliberately does not render TOML overlays anymore (eliminates drift; see commit history). For embedding dimension changes specifically, the catalog migration is non-trivial and out of scope here.
Troubleshooting
Model download stuck — vLLM pod in Init or CrashLoopBackOff
Model download stuck — vLLM pod in Init or CrashLoopBackOff
vLLM downloads model weights from HuggingFace Hub on first boot. For a 7B model this is 13-15 GB and can take 10-20 minutes on a fresh node with no cache. Check pod logs with
kubectl -n nebula logs deploy/vllm-instruction. If you see requests.exceptions.ConnectionError, the node has no HuggingFace egress — either allow outbound HTTPS to huggingface.co or pre-bake the weights into the Docker image. If you see 401 Unauthorized, provision the HF_TOKEN secret as described above.GPU node not scheduling — vLLM instruction pod stays Pending
GPU node not scheduling — vLLM instruction pod stays Pending
Check
kubectl describe pod <vllm-instruction-pod> -n nebula for the Pending reason. Common causes: (a) no node with the required GPU resource (nvidia.com/gpu: 1) is in the cluster — verify the GPU node pool exists and the NVIDIA device plugin is installed; (b) the NodePool or node selector in the profile doesn’t match the GPU node’s labels — confirm the node has the label expected by nodeSelector in the profile; (c) the GPU node has a taint (llm-workload=true:NoSchedule on EKS) and the profile’s tolerations block is missing or mismatched.Embedding endpoint timing out — worker logs show connection refused
Embedding endpoint timing out — worker logs show connection refused
Verify the
vllm-embedding Service is pointing at a Running pod: kubectl -n nebula get endpoints vllm-embedding. If the endpoint list is empty, the embedding Deployment is not ready — check kubectl -n nebula describe deploy vllm-embedding and the pod logs. The chart derives the embedding Service name from the profile’s name field (the sub-chart renders vllm-<profile.name>); the default profile is named embedding, so the Service is vllm-embedding. Custom profile names work transparently as long as exactly one enabled profile has task: embed.Embedding env override not applied — Nebula uses TOML default URL
Embedding env override not applied — Nebula uses TOML default URL
Check that
NEBULA_EMBEDDING_VLLM_API_BASE is set on the API and worker pods: kubectl -n nebula exec deploy/nebula-api -- env | grep NEBULA_EMBEDDING. If absent, llm.embedding.mode may not be set to inCluster. If present but the embedding provider logs still show the TOML default URL, check the API/worker pod logs for the Embedding base_url overridden from env info line — its absence means the runtime image is older than the env-override change in core/base/providers/embedding.py. Rebuild or pull a newer image.