On-premises Kubernetes

This guide covers a production (or evaluation) Nebula install on on-premises Kubernetes — bare-metal clusters, VMware Tanzu, OpenStack, or any CNCF-conformant cluster without a public cloud identity layer (no IRSA, no Workload Identity). Secrets are managed inline or via a private vault. Storage is local-path, Longhorn, or TopoLVM.

Prereqs

Cluster

Kubernetes 1.26+ (matches the chart’s kubeVersion minimum)
kubectl access with permission to create namespaces, Deployments, StatefulSets, PVCs, and Ingresses

Addons + controllers

Component	Purpose	Notes
ingress-nginx	HTTP/HTTPS ingress	kubernetes.github.io/ingress-nginx
cert-manager	TLS from Let’s Encrypt or internal CA	cert-manager.io/docs
Local storage provisioner	PVCs for graph-engine, compactor, Postgres, RabbitMQ	local-path (k3s default), Longhorn, or TopoLVM
External Secrets Operator (optional)	Sync from HashiCorp Vault or other backend	Only needed if you have a private secrets store

Storage class: the chart defaults to the cluster’s default storage class when storageClass.name is empty. For on-prem clusters that ship with local-path (k3s, RKE2), leave storageClass.name unset. For Longhorn or TopoLVM, set storageClass.name to the provisioner’s class name (e.g. longhorn or topolvm-provisioner). TLS: if your cluster is internal-only and you have a corporate CA, configure cert-manager with a ClusterIssuer backed by your CA’s private key. For clusters with internet access, use the standard Let’s Encrypt ACME ClusterIssuer.

Postgres

For evaluation, the chart ships a single-replica Postgres StatefulSet (postgres.mode: bundled). This is safe for testing but not for production — the bundled StatefulSet has no HA, no automated backup, and no streaming replication. For production, provision an external PostgreSQL 16 server with pgvector enabled:

CREATE EXTENSION IF NOT EXISTS vector;

and point postgres.mode: external at it.

Install

1. Load images from the bundle

tar -xzf nebula-enterprise-<version>.tar.gz
cd nebula-enterprise-<version>/
sha256sum -c checksums.txt
docker load -i images.tar

For an air-gapped cluster with a private registry, retag and push to your internal registry:

REGISTRY=registry.corp.example.com

docker tag nebula:enterprise-<version>              "${REGISTRY}/nebula/nebula-runtime:<version>"
docker tag nebula-graph-engine:enterprise-<version> "${REGISTRY}/nebula/graph-engine:<version>"
docker push "${REGISTRY}/nebula/nebula-runtime:<version>"
docker push "${REGISTRY}/nebula/graph-engine:<version>"

For third-party images (Hatchet, pgvector, RabbitMQ, busybox), push to the same registry:

docker tag ghcr.io/hatchet-dev/hatchet/hatchet-engine:v0.79.0 "${REGISTRY}/hatchet-engine:v0.79.0"
docker tag pgvector/pgvector:0.8.0-pg16                       "${REGISTRY}/pgvector/pgvector:0.8.0-pg16"
docker tag rabbitmq:3.13-management                           "${REGISTRY}/rabbitmq:3.13-management"
docker tag busybox:1.37.0                                     "${REGISTRY}/busybox:1.37.0"
docker push "${REGISTRY}/hatchet-engine:v0.79.0"
docker push "${REGISTRY}/pgvector/pgvector:0.8.0-pg16"
docker push "${REGISTRY}/rabbitmq:3.13-management"
docker push "${REGISTRY}/busybox:1.37.0"

2. Provision secrets

Option A: inline Kubernetes Secrets (simplest, not recommended for production) Use secrets.backend: raw and put secret values directly in your values file:

secrets:
  backend: raw
  values:
    OPENAI_API_KEY: "sk-..."
    NEBULA_SECRET_KEY: "<random 32 bytes hex>"
    NEBULA_SERVICE_API_KEY: "<random 32 bytes hex>"
    NEBULA_WEBHOOK_HMAC_SECRET: "<random 32 bytes hex>"
    NEBULA_JWT_PRIVATE_KEY_PEM: |
      -----BEGIN PRIVATE KEY-----
      ...
      -----END PRIVATE KEY-----
    NEBULA_JWT_KID: "<stable per-deployment value>"
    NEBULA_INTERNAL_WAKE_TOKEN: "<random 32 bytes hex>"
    NEBULA_VECTOR_BUILD_HATCHET_TRIGGER_TOKEN: "<random 32 bytes hex>"

Option B: HashiCorp Vault via ESO Install ESO, configure a ClusterSecretStore pointing at your Vault instance, then use secrets.backend: eso-vault:

secrets:
  backend: eso-vault
  esoVault:
    secretStoreRef:
      name: vault-backend
      kind: ClusterSecretStore
    vaultPath: secret/data/nebula
    refreshInterval: 5m

3. Copy + fill the reference values file

The bundle ships helm/examples/onprem/values.yaml with sensible on-prem defaults (bundled Postgres for evaluation, local-path storage, nginx ingress, raw secrets). Copy it, fill in the <placeholder> markers (domain name, object storage endpoint, LLM API base), and save as your-values.yaml. For a production on-prem install with external Postgres:

Set postgres.mode: external and fill in .host, .port, .database, and .credentialsSecret
Set hatchetPostgres.mode: external similarly
Remove or comment out the bundled Postgres persistence blocks

4. Object storage

On-premises S3-compatible object storage options:

MinIO (recommended for simplicity): run MinIO alongside the cluster or as a StatefulSet inside it. Set objectStorage.endpoint: http://minio.minio.svc:9000, forcePathStyle: true, and store MinIO root credentials in objectStorage.credentialsSecret.
Ceph RGW: configure Ceph’s Rados Gateway. Set the RGW endpoint, region (or empty string), and HMAC credentials.
Cloudflare R2 / Wasabi: external but S3-compatible. Set the appropriate endpoint; forcePathStyle depends on the provider.

5. Install

helm install nebula ./helm/nebula-<version>.tgz \
  -n nebula --create-namespace \
  -f your-values.yaml

The chart runs schema migrations and catalog-apply automatically via a per-revision Job (<release>-nebula-migrations-<revision>); API and worker pods gate startup on an init container that polls public.nebula_release_contract for the install’s release row. releaseContract.releaseId and releaseContract.gitSha are stamped by bundle.sh and consumed automatically.

6. Verify

kubectl -n nebula get pods
kubectl -n nebula get ingress nebula
curl -fsS https://nebula.<your-domain>/v1/health

Upgrade

Pull the new bundle, load/push new images, then:

helm upgrade nebula ./helm/nebula-<new-version>.tgz \
  -n nebula \
  -f your-values.yaml

Sizing reference

Workload	Starter	When to scale
API	2 replicas, 1 CPU / 2-4 GB	HPA on CPU >70% sustained
Worker	2 replicas, 2 CPU / 4-8 GB	HPA on queue depth (Hatchet metric)
Graph engine	2 replicas, 2 CPU / 4-8 GB	Manual; restart-sensitive (WAL replay)
Compactor	1 replica, 1 CPU / 2-4 GB	Single-writer; do not scale horizontally
RabbitMQ	1 replica, 8 GB PVC	Single-broker is fine up to ~10k workflows/min

For an evaluation single-node cluster, reducing to replicas: 1 on all workloads and using postgres.mode: bundled keeps the footprint under 16 GB RAM total. For production deploys the bundle ships a shared sizing overlay at helm/examples/_common/production-sizing.yaml (the same overlay used by EKS/AKS/GKE). Stack it before your on-prem values file to get production-shape replicas and resource requests:

helm install nebula ./helm/nebula-<version>.tgz \
  -n nebula --create-namespace \
  -f helm/examples/_common/production-sizing.yaml \
  -f your-values.yaml

Pod Security Admission

The Nebula-built workloads (api, worker, graph-engine, graph-engine-compactor, the migration Job, and the vLLM sub-chart Deployments) comply with the restricted Pod Security Standard out of the box: non-root user, dropped capabilities, seccompProfile: RuntimeDefault, no privilege escalation. The bundled third-party dependencies — postgres-statefulset (postgres.mode: bundled), hatchet-postgres-statefulset, hatchet-engine-deployment, hatchet-rabbitmq-statefulset — inherit their upstream images’ default security contexts and do not carry the restricted-required fields today. Labeling the release namespace as restricted before validating these pods can reject them at admission time. Recommended approach:

For production deployments, swap postgres.mode: external and hatchetPostgres.mode: external so the bundled StatefulSets are not rendered at all. The chart’s external-mode path doesn’t ship the third-party deps; you bring your own (compliant) Postgres + Hatchet.
If you need the bundled deps for evaluation, label the namespace at baseline rather than restricted (or use PSA’s warn / audit modes to surface the issues without blocking install).
Only enable restricted enforcement after validating each bundled dep’s security context against your cluster’s policy.

# Evaluation-friendly: warn on violations, don't enforce.
kubectl label namespace nebula \
  pod-security.kubernetes.io/enforce=baseline \
  pod-security.kubernetes.io/warn=restricted \
  pod-security.kubernetes.io/audit=restricted

The chart deliberately does not label the namespace itself: Helm’s --create-namespace does not own pre-existing namespaces reliably, and adding namespace ownership to the chart conflicts with operators who manage namespaces separately (GitOps, vCluster, kiosk, etc.).

Prometheus metrics

Pods expose Prometheus-compatible /metrics endpoints and carry prometheus.io/scrape: "true" annotations for clusters that use annotation-based scrape discovery. For clusters running prometheus-operator / kube-prometheus-stack, enable native ServiceMonitor objects:

monitoring:
  serviceMonitor:
    enabled: true
    # Many operator installs key off a `release` label on ServiceMonitors;
    # set it to match your prometheus-operator's serviceMonitorSelector.
    additionalLabels:
      release: kube-prometheus-stack

Default off because ServiceMonitor is a monitoring.coreos.com/v1 CRD — rendering it on a cluster without prometheus-operator fails helm install with no matches for kind "ServiceMonitor".

Troubleshooting

PVCs stuck in Pending — no storage class available

Check that a storage class exists and is set as default: kubectl get storageclass. If using local-path, the provisioner must be running: kubectl -n local-path-storage get pods. Set storageClass.name in your values file to the exact class name if there is no cluster default.

API pods fail to connect to bundled Postgres

On a fresh install with postgres.mode: bundled, the Postgres StatefulSet must be ready before the API Deployment. Check kubectl -n nebula get pods — the Postgres pod must be Running before API pods reach Ready. The chart renders a readiness probe on the API that retries for 5 minutes, which is usually enough for bundled Postgres to start. If the pod restarts before Postgres is ready, describe the pod for the specific connect error.

cert-manager fails to issue certificate — ACME challenge not reachable

The Let’s Encrypt ACME HTTP-01 challenge requires the domain to be publicly reachable. For internal-only clusters, either use a DNS-01 challenge (configure cert-manager with DNS provider credentials) or provision certificates from a corporate CA ClusterIssuer. The ingress.tls.secretName in your values file must match the Certificate resource name cert-manager will populate.

Graph-engine startup slow after node restart

The graph-engine replays its WAL on startup — duration scales with segment count and is expected. A single-node cluster that reboots may take 30-120 seconds per replica before graph-engine is fully Ready. Add initialDelaySeconds: 120 to the graph-engine readiness probe via workloads.graphEngine overrides if the default timeouts are too tight for your node restart time.

Get Started

Kubernetes

Docker Compose

Reference

On-premises Kubernetes

Prereqs

Cluster

Addons + controllers

Postgres

Install

1. Load images from the bundle

2. Provision secrets

3. Copy + fill the reference values file

4. Object storage

5. Install

6. Verify

Upgrade

Sizing reference

Pod Security Admission

Prometheus metrics

Troubleshooting

Get Started

Kubernetes

Docker Compose

Reference

Documentation Index

​Prereqs

​Cluster

​Addons + controllers

​Postgres

​Install

​1. Load images from the bundle

​2. Provision secrets

​3. Copy + fill the reference values file

​4. Object storage

​5. Install

​6. Verify

​Upgrade

​Sizing reference

​Pod Security Admission

​Prometheus metrics

​Troubleshooting

Prereqs

Cluster

Addons + controllers

Postgres

Install

1. Load images from the bundle

2. Provision secrets

3. Copy + fill the reference values file

4. Object storage

5. Install

6. Verify

Upgrade

Sizing reference

Pod Security Admission

Prometheus metrics

Troubleshooting