This guide covers a production (or evaluation) Nebula install on on-premises Kubernetes — bare-metal clusters, VMware Tanzu, OpenStack, or any CNCF-conformant cluster without a public cloud identity layer (no IRSA, no Workload Identity). Secrets are managed inline or via a private vault. Storage is local-path, Longhorn, or TopoLVM.Documentation Index
Fetch the complete documentation index at: https://docs.trynebula.ai/llms.txt
Use this file to discover all available pages before exploring further.
Prereqs
Cluster
- Kubernetes 1.26+ (matches the chart’s
kubeVersionminimum) kubectlaccess with permission to create namespaces, Deployments, StatefulSets, PVCs, and Ingresses
Addons + controllers
| Component | Purpose | Notes |
|---|---|---|
| ingress-nginx | HTTP/HTTPS ingress | kubernetes.github.io/ingress-nginx |
| cert-manager | TLS from Let’s Encrypt or internal CA | cert-manager.io/docs |
| Local storage provisioner | PVCs for graph-engine, compactor, Postgres, RabbitMQ | local-path (k3s default), Longhorn, or TopoLVM |
| External Secrets Operator (optional) | Sync from HashiCorp Vault or other backend | Only needed if you have a private secrets store |
storageClass.name is empty. For on-prem clusters that ship with local-path (k3s, RKE2), leave storageClass.name unset. For Longhorn or TopoLVM, set storageClass.name to the provisioner’s class name (e.g. longhorn or topolvm-provisioner).
TLS: if your cluster is internal-only and you have a corporate CA, configure cert-manager with a ClusterIssuer backed by your CA’s private key. For clusters with internet access, use the standard Let’s Encrypt ACME ClusterIssuer.
Postgres
For evaluation, the chart ships a single-replica Postgres StatefulSet (postgres.mode: bundled). This is safe for testing but not for production — the bundled StatefulSet has no HA, no automated backup, and no streaming replication. For production, provision an external PostgreSQL 16 server with pgvector enabled:
postgres.mode: external at it.
Install
1. Load images from the bundle
2. Provision secrets
Option A: inline Kubernetes Secrets (simplest, not recommended for production) Usesecrets.backend: raw and put secret values directly in your values file:
ClusterSecretStore pointing at your Vault instance, then use secrets.backend: eso-vault:
3. Copy + fill the reference values file
The bundle shipshelm/examples/onprem/values.yaml with sensible on-prem defaults (bundled Postgres for evaluation, local-path storage, nginx ingress, raw secrets). Copy it, fill in the <placeholder> markers (domain name, object storage endpoint, LLM API base), and save as your-values.yaml.
For a production on-prem install with external Postgres:
- Set
postgres.mode: externaland fill in.host,.port,.database, and.credentialsSecret - Set
hatchetPostgres.mode: externalsimilarly - Remove or comment out the bundled Postgres persistence blocks
4. Object storage
On-premises S3-compatible object storage options:- MinIO (recommended for simplicity): run MinIO alongside the cluster or as a StatefulSet inside it. Set
objectStorage.endpoint: http://minio.minio.svc:9000,forcePathStyle: true, and store MinIO root credentials inobjectStorage.credentialsSecret. - Ceph RGW: configure Ceph’s Rados Gateway. Set the RGW endpoint, region (or empty string), and HMAC credentials.
- Cloudflare R2 / Wasabi: external but S3-compatible. Set the appropriate endpoint;
forcePathStyledepends on the provider.
5. Install
<release>-nebula-migrations-<revision>); API and worker pods gate startup on an init container that polls public.nebula_release_contract for the install’s release row. releaseContract.releaseId and releaseContract.gitSha are stamped by bundle.sh and consumed automatically.
6. Verify
Upgrade
Pull the new bundle, load/push new images, then:Sizing reference
| Workload | Starter | When to scale |
|---|---|---|
| API | 2 replicas, 1 CPU / 2-4 GB | HPA on CPU >70% sustained |
| Worker | 2 replicas, 2 CPU / 4-8 GB | HPA on queue depth (Hatchet metric) |
| Graph engine | 2 replicas, 2 CPU / 4-8 GB | Manual; restart-sensitive (WAL replay) |
| Compactor | 1 replica, 1 CPU / 2-4 GB | Single-writer; do not scale horizontally |
| RabbitMQ | 1 replica, 8 GB PVC | Single-broker is fine up to ~10k workflows/min |
replicas: 1 on all workloads and using postgres.mode: bundled keeps the footprint under 16 GB RAM total.
For production deploys the bundle ships a shared sizing overlay at helm/examples/_common/production-sizing.yaml (the same overlay used by EKS/AKS/GKE). Stack it before your on-prem values file to get production-shape replicas and resource requests:
Pod Security Admission
The Nebula-built workloads (api, worker, graph-engine, graph-engine-compactor, the migration Job, and the vLLM sub-chart Deployments) comply with the restricted Pod Security Standard out of the box: non-root user, dropped capabilities, seccompProfile: RuntimeDefault, no privilege escalation.
The bundled third-party dependencies — postgres-statefulset (postgres.mode: bundled), hatchet-postgres-statefulset, hatchet-engine-deployment, hatchet-rabbitmq-statefulset — inherit their upstream images’ default security contexts and do not carry the restricted-required fields today. Labeling the release namespace as restricted before validating these pods can reject them at admission time.
Recommended approach:
- For production deployments, swap
postgres.mode: externalandhatchetPostgres.mode: externalso the bundled StatefulSets are not rendered at all. The chart’s external-mode path doesn’t ship the third-party deps; you bring your own (compliant) Postgres + Hatchet. - If you need the bundled deps for evaluation, label the namespace at
baselinerather thanrestricted(or use PSA’swarn/auditmodes to surface the issues without blocking install). - Only enable
restrictedenforcement after validating each bundled dep’s security context against your cluster’s policy.
--create-namespace does not own pre-existing namespaces reliably, and adding namespace ownership to the chart conflicts with operators who manage namespaces separately (GitOps, vCluster, kiosk, etc.).
Prometheus metrics
Pods expose Prometheus-compatible/metrics endpoints and carry prometheus.io/scrape: "true" annotations for clusters that use annotation-based scrape discovery. For clusters running prometheus-operator / kube-prometheus-stack, enable native ServiceMonitor objects:
monitoring.coreos.com/v1 CRD — rendering it on a cluster without prometheus-operator fails helm install with no matches for kind "ServiceMonitor".
Troubleshooting
PVCs stuck in Pending — no storage class available
PVCs stuck in Pending — no storage class available
Check that a storage class exists and is set as default:
kubectl get storageclass. If using local-path, the provisioner must be running: kubectl -n local-path-storage get pods. Set storageClass.name in your values file to the exact class name if there is no cluster default.API pods fail to connect to bundled Postgres
API pods fail to connect to bundled Postgres
On a fresh install with
postgres.mode: bundled, the Postgres StatefulSet must be ready before the API Deployment. Check kubectl -n nebula get pods — the Postgres pod must be Running before API pods reach Ready. The chart renders a readiness probe on the API that retries for 5 minutes, which is usually enough for bundled Postgres to start. If the pod restarts before Postgres is ready, describe the pod for the specific connect error.cert-manager fails to issue certificate — ACME challenge not reachable
cert-manager fails to issue certificate — ACME challenge not reachable
The Let’s Encrypt ACME HTTP-01 challenge requires the domain to be publicly reachable. For internal-only clusters, either use a DNS-01 challenge (configure cert-manager with DNS provider credentials) or provision certificates from a corporate CA ClusterIssuer. The
ingress.tls.secretName in your values file must match the Certificate resource name cert-manager will populate.Graph-engine startup slow after node restart
Graph-engine startup slow after node restart
The graph-engine replays its WAL on startup — duration scales with segment count and is expected. A single-node cluster that reboots may take 30-120 seconds per replica before graph-engine is fully Ready. Add
initialDelaySeconds: 120 to the graph-engine readiness probe via workloads.graphEngine overrides if the default timeouts are too tight for your node restart time.