Skip to content
Why did we open-source our inference engine? Read the post

Upgrade Runbook

Procedure for upgrading an SIE cluster to a new release version. Covers Helm-managed deployments on GKE and EKS.

Components upgraded:

  • Router (Deployment) — stateless, fast restart
  • Worker pools (StatefulSets) — GPU pods, model cache in emptyDir

Version management: SIE uses release-please for unified versioning. A single version (e.g., 0.1.6) is applied to the Helm chart (Chart.yaml appVersion), all Python packages, and all TypeScript packages. The CHANGELOG.md at the repo root documents all changes per release.


Complete all items before starting the upgrade.

Read CHANGELOG.md for the target version. Pay attention to:

  • Breaking changes in the router or server API
  • Helm values changes (new required values, renamed keys, removed options)
  • Model config changes (new or removed models, adapter changes)
Terminal window
# View changelog for the target version
git log v<CURRENT>..v<TARGET> --oneline
Terminal window
# Note current Helm release version
helm list -n sie
# Note current chart values (save for rollback reference)
helm get values sie -n sie -o yaml > /tmp/sie-values-backup.yaml
# Back up pool state (ConfigMaps + Leases in the sie namespace)
kubectl get configmap,lease -n sie -o yaml > /tmp/sie-pool-state-backup.yaml
# Record current image tags
kubectl get deployment -n sie -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.template.spec.containers[0].image}{"\n"}{end}'
kubectl get statefulset -n sie -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.template.spec.containers[0].image}{"\n"}{end}'
# Record Helm revision number
helm history sie -n sie --max 5
Terminal window
# All router pods should be Running and Ready
kubectl get pods -n sie -l app.kubernetes.io/component=router
# All worker pods should be Running and Ready (if not scaled to zero)
kubectl get pods -n sie -l app.kubernetes.io/component=worker
# Router readiness (returns {"status": "ready", "healthy_workers": N})
kubectl exec -n sie deploy/sie-sie-cluster-router -- wget -qO- http://localhost:8080/readyz
# Router detailed health (returns worker count, GPU count, loaded models)
kubectl exec -n sie deploy/sie-sie-cluster-router -- wget -qO- http://localhost:8080/health
# KEDA ScaledObjects should not be in Fallback mode
kubectl get scaledobject -n sie
kubectl describe scaledobject -n sie | grep -A2 "Type.*Fallback"
# Check for recent errors in router logs
kubectl logs -n sie -l app.kubernetes.io/component=router --tail=50 | grep -i error
# Check for recent errors in worker logs
kubectl logs -n sie -l app.kubernetes.io/component=worker --tail=50 | grep -i error
Terminal window
# Prometheus is serving queries
kubectl exec -n monitoring svc/prometheus-operated -- wget -qO- \
'http://localhost:9090/api/v1/query?query=up' 2>/dev/null | head -c 200
# Grafana is accessible
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80 &
# Open http://localhost:3000 and verify SIE dashboards show data

If running during active traffic, consider:

Terminal window
# Pause KEDA autoscaling to prevent scale events during upgrade.
# Each ScaledObject targets a specific StatefulSet, so freeze each one
# at its own replica count (pools may differ).
for so in $(kubectl get scaledobject -n sie -o jsonpath='{.items[*].metadata.name}'); do
# Read the actual scale target from the ScaledObject spec
sts=$(kubectl get scaledobject "$so" -n sie -o jsonpath='{.spec.scaleTargetRef.name}')
replicas=$(kubectl get statefulset "$sts" -n sie -o jsonpath='{.spec.replicas}' 2>/dev/null)
if [ -n "$replicas" ]; then
kubectl annotate scaledobject "$so" -n sie \
autoscaling.keda.sh/paused-replicas="$replicas" --overwrite
fi
done

For clusters using custom image registries (not the default ghcr.io/superlinked), push the new images first:

Terminal window
# Build and push new images (adjust registry as needed)
REGISTRY="your-registry.example.com"
TAG="0.1.7" # Target version
# Server image (one per bundle)
mise run docker -- --tag $TAG
docker tag sie-server:cuda12-default $REGISTRY/sie-server:$TAG-default
docker push $REGISTRY/sie-server:$TAG-default
# Router image
mise run docker -- --router --tag $TAG
docker tag sie-router:$TAG $REGISTRY/sie-router:$TAG
docker push $REGISTRY/sie-router:$TAG
Terminal window
# Dry-run first to preview changes
helm diff upgrade sie deploy/helm/sie-cluster/ \
-n sie \
-f /tmp/sie-values-backup.yaml \
--set workers.common.image.tag="<TARGET_VERSION>" \
--set router.image.tag="<TARGET_VERSION>"
# Apply the upgrade (--wait blocks until pods are ready; --timeout guards against hangs)
helm upgrade sie deploy/helm/sie-cluster/ \
-n sie \
-f /tmp/sie-values-backup.yaml \
--set workers.common.image.tag="<TARGET_VERSION>" \
--set router.image.tag="<TARGET_VERSION>" \
--wait --timeout 10m
Terminal window
# Dry-run
helm diff upgrade sie oci://ghcr.io/superlinked/sie-cluster \
-n sie \
--version <TARGET_CHART_VERSION> \
-f /tmp/sie-values-backup.yaml
# Apply
helm upgrade sie oci://ghcr.io/superlinked/sie-cluster \
-n sie \
--version <TARGET_CHART_VERSION> \
-f /tmp/sie-values-backup.yaml \
--wait --timeout 10m
Terminal window
# Update image tag in Terraform variables
# Edit your .tfvars or set TF_VAR:
export TF_VAR_sie_image_tag="<TARGET_VERSION>"
cd deploy/terraform/gcp/examples/<your-env>
terraform plan # Review changes
terraform apply # Apply

2.3 Expected Behavior During Rolling Update

Section titled “2.3 Expected Behavior During Rolling Update”

Router (Deployment):

  • Kubernetes rolls out new router pods one at a time (default RollingUpdate strategy).
  • Router liveness probe: GET /healthz (returns 200 if process is alive). initialDelaySeconds: 5, periodSeconds: 10.
  • Router readiness probe: GET /readyz (returns 200 immediately — router is ready even with 0 workers). initialDelaySeconds: 5, periodSeconds: 5.
  • The router is stateless; new pods come up in seconds.
  • Brief 503s are possible during the switchover window if all old pods are terminated before new ones pass readiness.

Workers (StatefulSets):

  • The default RollingUpdate strategy updates pods one at a time in reverse ordinal order. (podManagementPolicy: Parallel only affects pod ordering during scaling, not rolling updates.)
  • Worker terminationGracePeriodSeconds: 65.
  • preStop hook: sleep 10 — gives the K8s endpoints controller 10 seconds to remove the pod from the service before SIGTERM.
  • On SIGTERM, the server enters graceful shutdown: rejects new requests with 503 (with Retry-After: 5 header), drains in-flight requests (25-second timeout), then exits.
  • Readiness probe stops passing (/readyz returns 503) once shutdown begins, so the router stops sending new traffic to the draining pod.
  • The router detects worker disconnection via WebSocket and removes it from the routing table.
  • New worker pods must download model weights if the emptyDir cache is empty (cache does not persist across pod restarts). Cold model loading can take 10-120 seconds depending on model size and cache state.
  • PodDisruptionBudget: maxUnavailable: 1 per worker pool — protects against external disruptions (e.g., kubectl drain, node autoscaler) but is not enforced by the StatefulSet controller during rolling updates.

Client Impact:

  • SDK clients with automatic retry handle 503s transparently.
  • Requests in flight during graceful shutdown complete normally (up to 25-second drain timeout).
  • If all workers in a pool are restarting simultaneously, the router returns 202 Accepted (provisioning), and the SDK retries with backoff.
Terminal window
# Watch router rollout
kubectl rollout status deployment/sie-sie-cluster-router -n sie --timeout=120s
# Watch worker rollouts (one per pool)
kubectl get statefulsets -n sie -w
# Watch all pods
kubectl get pods -n sie -w
# Check KEDA ScaledObjects are still healthy (not Fallback)
kubectl get scaledobject -n sie -o custom-columns=NAME:.metadata.name,READY:.status.conditions[0].status,MIN:.spec.minReplicaCount,MAX:.spec.maxReplicaCount,REPLICAS:.status.currentReplicas
# Watch router logs for errors during transition
kubectl logs -n sie -l app.kubernetes.io/component=router -f --tail=20

Terminal window
# All pods Running and Ready
kubectl get pods -n sie
# Expected: all router pods 1/1 Ready, all worker pods 1/1 Ready
# Verify new image tags are deployed
kubectl get pods -n sie -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.containers[0].image}{"\n"}{end}'
Terminal window
# Readiness check
kubectl exec -n sie deploy/sie-sie-cluster-router -- wget -qO- http://localhost:8080/readyz
# Expected: {"status": "ready", "healthy_workers": N}
# Detailed health (worker count, models, GPU types)
kubectl exec -n sie deploy/sie-sie-cluster-router -- wget -qO- http://localhost:8080/health
# Expected: "status": "healthy", worker_count > 0 (if pools not scaled to zero)
# Model catalog is available
kubectl exec -n sie deploy/sie-sie-cluster-router -- wget -qO- http://localhost:8080/v1/models | head -c 500
Terminal window
# Port-forward to router
kubectl port-forward -n sie svc/sie-sie-cluster-router 8080:8080 &
# Test encode request (requires a running worker with GPU)
python3 -c "
from sie_sdk import SIEClient
client = SIEClient('http://localhost:8080')
result = client.encode('BAAI/bge-m3', {'text': 'upgrade verification test'})
print(f'Dense embedding dim: {len(result[\"dense\"])}')
print('SUCCESS: Encode request returned 200')
"
# Or with curl (JSON fallback):
curl -s -X POST http://localhost:8080/v1/encode/BAAI%2Fbge-m3 \
-H "Content-Type: application/json" \
-d '{"items": [{"text": "upgrade verification test"}]}' | python3 -m json.tool | head -5
Terminal window
# Unpause KEDA if paused in step 1.5
kubectl annotate scaledobject -n sie --all autoscaling.keda.sh/paused-replicas- --overwrite
# Verify ScaledObjects are Ready (not Fallback)
kubectl get scaledobject -n sie
kubectl describe scaledobject -n sie | grep -A3 "Conditions:"
# Expected: Ready=True, Active depends on load, Fallback=False
Terminal window
# Verify Prometheus is scraping the new pods
kubectl exec -n monitoring svc/prometheus-operated -- wget -qO- \
'http://localhost:9090/api/v1/query?query=sie_requests_total' 2>/dev/null | python3 -m json.tool | head -20
# Verify router metrics
kubectl exec -n monitoring svc/prometheus-operated -- wget -qO- \
'http://localhost:9090/api/v1/query?query=sie_router_requests_total' 2>/dev/null | python3 -m json.tool | head -20
# Check Grafana dashboards show data for new pods
# Port-forward: kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Navigate to SIE > Cluster Overview dashboard
Terminal window
# Check Helm release version
helm list -n sie
# Expected: Chart version and App version match target
# Check the server version header on a response
curl -s -I http://localhost:8080/healthz | grep -i x-sie
# Expected: X-SIE-Server-Version: <TARGET_VERSION>

Terminal window
# List Helm release history
helm history sie -n sie --max 10
# Note the REVISION number of the last known-good release
Terminal window
# Rollback to previous revision
helm rollback sie <REVISION> -n sie
# Or rollback to immediately previous version
helm rollback sie -n sie

For Terraform-managed clusters:

Terminal window
# Revert image tag to previous version
export TF_VAR_sie_image_tag="<PREVIOUS_VERSION>"
cd deploy/terraform/gcp/examples/<your-env>
terraform apply
Terminal window
# Watch the rollback proceed
kubectl rollout status deployment/sie-sie-cluster-router -n sie --timeout=120s
kubectl get pods -n sie -w
# Verify old image is restored
kubectl get pods -n sie -o jsonpath='{range .items[*]}{.metadata.name}: {.spec.containers[0].image}{"\n"}{end}'

Run the same post-upgrade verification steps:

Terminal window
# Router health
kubectl exec -n sie deploy/sie-sie-cluster-router -- wget -qO- http://localhost:8080/readyz
# Encode smoke test
kubectl port-forward -n sie svc/sie-sie-cluster-router 8080:8080 &
python3 -c "
from sie_sdk import SIEClient
client = SIEClient('http://localhost:8080')
result = client.encode('BAAI/bge-m3', {'text': 'rollback verification'})
print(f'Dense dim: {len(result[\"dense\"])} - SUCCESS')
"
# KEDA health
kubectl get scaledobject -n sie
  • No schema migrations: SIE is stateless. Workers use emptyDir for model cache, and the router stores pool state in ConfigMaps with Leases for TTL. There are no database migrations to worry about during rollback.
  • Model cache invalidation: Worker pods use emptyDir volumes for the HuggingFace model cache. Rolling back means new pods start with an empty cache and must re-download model weights on first request. If cluster cache (S3/GCS) is configured, downloads come from there instead of HuggingFace Hub.
  • Pool state: Resource pools are stored as ConfigMaps in the sie namespace. Pool leases survive upgrades and rollbacks. Active pools will continue to work, but if the pool API changed between versions, clients may need to recreate pools.
  • KEDA ScaledObjects: Helm rollback re-applies the previous ScaledObject definitions. If KEDA version requirements changed between SIE versions, verify ScaledObjects are not in Fallback mode after rollback.
  • Config drift: If the upgrade included changes to embedded model or bundle configs (baked into the Helm chart files/ directory), rollback restores the previous configs. Ensure the previous configs are compatible with the previous server version.
  • SDK version compatibility: The router returns X-SIE-Server-Version headers. If clients upgraded their SDK alongside the server, a server rollback may trigger version mismatch warnings in the SDK logs. The SDK remains functional but logs warnings for major.minor mismatches.

ResourceNamespaceTypePurpose
sie-sie-cluster-routersieDeploymentStateless request router (2+ replicas)
sie-sie-cluster-worker-<pool>sieStatefulSetGPU worker pool (one per pool)
sie-sie-cluster-workersieService (headless)Worker DNS discovery
sie-sie-cluster-routersieService (ClusterIP)Router endpoint
sie-sie-cluster-worker-<pool>-scalersieScaledObjectKEDA autoscaler per pool
sie-sie-cluster-worker-<pool>siePodDisruptionBudgetmaxUnavailable: 1 per pool
sie-sie-cluster-gpu-configsieConfigMapAvailable GPU types / machine profiles
sie-sie-cluster-configsieConfigMapShared cluster configuration
EndpointComponentReturns
GET /healthzRouter{"status": "ok"} — liveness probe
GET /readyzRouter{"status": "ready", "healthy_workers": N} — readiness probe
GET /healthRouterDetailed cluster status (worker count, GPUs, models)
GET /healthzWorker"ok" — liveness probe
GET /readyzWorker"ok" or 503 — readiness probe
GET /metricsBothPrometheus metrics
DashboardPurpose
Cluster OverviewQPS, latency (p50/p95/p99), GPU utilization
Model PerformancePer-model latency, throughput, batch sizes
Worker HealthPer-worker CPU/memory, GPU temp, queue depth