Skip to content
Why did we open-source our inference engine? Read the post

Kubernetes in AWS

Deploy SIE to Amazon EKS with GPU node pools, KEDA autoscaling, and Terraform automation.

The architecture mirrors the GCP deployment — a router-worker setup with KEDA autoscaling:

EKS cluster architecture with Router, L4 and A100 worker pools, KEDA, and Prometheus

Components:

  • EKS Cluster with managed node groups for GPU instances
  • NVIDIA Device Plugin for GPU scheduling
  • IRSA (IAM Roles for Service Accounts) for S3 access
  • KEDA for autoscaling based on queue depth metrics
  • Prometheus + Grafana + DCGM Exporter for observability

  1. AWS account with appropriate permissions
  2. GPU instance quota for your region (e.g., g5.xlarge for L4-equivalent, p4d.24xlarge for A100)
  3. Terraform and AWS CLI configured
Terminal window
cd deploy/terraform/aws
# Set your variables
export TF_VAR_region="us-east-1"
# Initialize and apply
terraform init
terraform plan
terraform apply

Or use the mise task:

Terminal window
mise run aws-deploy

The Terraform module provisions:

ResourcePurpose
EKS ClusterKubernetes control plane
GPU Node GroupAuto-scaling GPU instances (g5, p4d, etc.)
NVIDIA Device PluginGPU scheduling in Kubernetes
KEDAAutoscaling based on queue metrics
Prometheus + GrafanaMetrics and dashboards
DCGM ExporterGPU metrics (utilization, memory, temperature)
SIE Helm ReleaseRouter + worker deployment

FeatureGCP (GKE)AWS (EKS)
GPU schedulingNative GKE supportNVIDIA Device Plugin required
IAM for podsWorkload IdentityIRSA
Model cache storageGCS (gs://)S3 (s3://)
Node provisioningGKE Autopilot / NAPKarpenter or Cluster Autoscaler
Spot instancesSpot VMsSpot Instances

Configure the cluster cache to use S3:

workers:
common:
clusterCache:
enabled: true
url: s3://my-bucket/models

IRSA handles authentication automatically — no access keys needed in the pod.


The default Terraform configuration exposes the API endpoint publicly. For production:

  • Restrict ingress to your VPC CIDR or specific IP ranges
  • Enable authentication via oauth2-proxy or static tokens
  • Use a private load balancer for internal-only access:
ingress:
enabled: true
annotations:
service.beta.kubernetes.io/aws-load-balancer-internal: "true"

For simpler deployments, run SIE directly on a GPU EC2 instance:

Terminal window
# On a g5.xlarge (NVIDIA A10G) instance
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
docker run --gpus all -p 8080:8080 \
-v ~/.cache/huggingface:/app/.cache/huggingface \
ghcr.io/superlinked/sie:default

This is simpler than EKS and suitable for single-instance production workloads.