This is the recommended production deploy for any customer with a real AWS footprint. The Helm chart is the same artifact we run our own staging and production on, and EKS + Karpenter is the deploy shape our release pipeline is tuned for.Documentation Index
Fetch the complete documentation index at: https://docs.trynebula.ai/llms.txt
Use this file to discover all available pages before exploring further.
Prereqs
Beforehelm install, the following must be in place on the cluster side. If you’ve never set these up on an EKS cluster, budget half a day; each is well-documented upstream.
Cluster
- EKS 1.30+ (matches what we run internally)
- OIDC provider associated with the cluster (
eksctl utils associate-iam-oidc-provider --cluster <name> --approve) — required for IRSA
Addons + controllers
| Component | Purpose | Install reference |
|---|---|---|
| Karpenter | Node autoscaling | karpenter.sh/docs |
| AWS Load Balancer Controller | ALB ingress | aws-load-balancer-controller |
| EBS CSI Driver | gp3 volumes for graph-engine / compactor / RabbitMQ | EKS addon: aws-ebs-csi-driver |
| External Secrets Operator (recommended) | Sync from AWS Secrets Manager | external-secrets.io |
NodePool covering the instance families Nebula will run on. Our staging clusters use m6i, m7i, and c7i families; production also includes r7i for the graph-engine memory profile. The chart’s resource requests on the example values file fit comfortably in m7i.large and up.
AWS-managed resources (recommended)
- RDS Postgres 16 in the same VPC as the cluster, with the cluster’s node security group allowed inbound on
:5432. Enablerds.extensions = vectorin the parameter group so pgvector is available. - S3 bucket in the same region as the cluster. Versioning + SSE-S3 (or SSE-KMS) recommended.
IAM role for IRSA
Create one IAM role with the cluster’s OIDC provider in its trust policy, scoped to the chart’s ServiceAccount (nebula-sa in the install namespace when you helm install nebula …; if you pick a different release name, the SA is <release>-nebula-sa — confirm with kubectl -n <ns> get sa after install). Attach an inline policy granting:
serviceAccount.annotations.eks.amazonaws.com/role-arn in your values file.
Install
1. Push images to your ECR
The bundle’simages.tar contains every pinned image. The push paths below match the Helm chart’s default image.*.repository values — when image.registry in your values file is set to your ECR URI, the chart pulls exactly the refs you push here. Side-load, retag, push:
image.registry prepend is skipped automatically for fully-qualified repos like ghcr.io/... and docker.io/....
For air-gapped EKS (no public-registry egress), mirror them into your ECR and override the matching image.*.repository keys in your values file. Recommended push paths (chosen so ECR repo names stay valid — ECR doesn’t accept ghcr.io as a segment):
${ECR}/... instead of upstream public refs:
2. Seed secrets in AWS Secrets Manager
If you’re using ESO (recommended), put one JSON blob at the path you’ll reference undersecrets.esoAws.awsSecretPath:
postgres.credentialsSecret and hatchetPostgres.credentialsSecret — each must materialize a Kubernetes Secret with username + password keys (those exact lowercase key names — the chart reads them via secretKeyRef.key: username / .key: password).
3. Copy + fill the reference values file
The bundle ships a reference values file athelm/examples/eks/values.yaml with every AWS-specific knob pre-wired (Karpenter, gp3, IRSA, ALB, RDS+S3, ESO). Copy it, fill in the <placeholder> markers (account ID, RDS endpoint, IRSA role ARN, ACM cert ARN, S3 bucket name), and save as your-values.yaml.
4. Install
helm install alone as a complete greenfield database bootstrap until the chart includes that Job.
5. Verify
Upgrade
Pull the new bundle, push new images to your ECR, then:Sizing reference
The example values file ships with production-shape defaults for a starter deployment. Scale from there based on measured throughput:| Workload | Starter | When to scale |
|---|---|---|
| API | 2 replicas, 1 CPU / 2-4 GB | HPA on CPU >70% sustained |
| Worker | 2 replicas, 2 CPU / 4-8 GB | HPA on queue depth (Hatchet metric) |
| Graph engine | 2 replicas, 2 CPU / 4-8 GB | Manual; restart-sensitive (WAL replay) |
| Compactor | 1 replica, 1 CPU / 2-4 GB | Single-writer; do not scale horizontally |
| RabbitMQ | 1 replica, 8 GB PVC | Single-broker is fine up to ~10k workflows/min |
Karpenter + long-running pods
The example values file setskarpenter.enabled=true, which adds karpenter.sh/do-not-disrupt: "true" to the API, worker, graph-engine, compactor, and Hatchet engine pods. This prevents Karpenter consolidation or drift from killing pods mid-ingest, mid-graph-build, or mid-snapshot. Pods still drain on actual node lifecycle events (rolling update, manual kubectl drain).
If you’re on Cluster Autoscaler instead of Karpenter, leave karpenter.enabled=false. CA respects PDBs by default and doesn’t need the annotation.
Troubleshooting
API pods fail to connect to Postgres
API pods fail to connect to Postgres
Your
postgres.credentialsSecret may be missing or may not have the expected keys. The Secret must contain username and password (those exact lowercase key names — the chart reads them via secretKeyRef.key: username / .key: password). If you’re using ESO, check the ExternalSecret resource synced before the API and worker pods started.Graph-engine pod crashlooping with 'AccessDenied' on S3
Graph-engine pod crashlooping with 'AccessDenied' on S3
Either (a) the IRSA role isn’t attached to the ServiceAccount, or (b) the role’s policy doesn’t include the bucket ARN. Check
kubectl -n nebula describe sa nebula-sa for the eks.amazonaws.com/role-arn annotation, and trace the IAM policy attached to that role.ALB ingress shows 'no targets' after install
ALB ingress shows 'no targets' after install
The AWS Load Balancer Controller takes 30-60s to provision the ALB on first install. Check
kubectl -n kube-system logs deploy/aws-load-balancer-controller for any IAM permission errors on the controller’s IRSA role.pgvector extension missing on first start
pgvector extension missing on first start
RDS doesn’t auto-enable extensions even if
shared_preload_libraries includes them — rds.extensions = vector must be in the parameter group, and the database initialization path must run CREATE EXTENSION IF NOT EXISTS vector before the API handles traffic.