Gateway DR Failover Runbook¶

Overview¶

This runbook covers manual failover and failback procedures for MazeVault Gateway disaster recovery. The DR architecture uses a cold standby model — DR AKS cluster is stopped by default. DR gateway has the same MI federation, geo-replicated KV, and PostgreSQL standby.

Architecture: - Each environment (NPR, PRO) has a primary gateway and optionally a dr-standby gateway - AutoFailover is OFF — all failover is manual via admin API - HealthMonitor detects failures (HeartbeatTimeout 2min, FailureThreshold 3) and marks activation_status=failed - Fencing tokens (UnixNano) prevent split-brain — only ONE active gateway per environment

Prerequisites¶

Admin access to MazeVault primary backend (JWT with gateway:admin RBAC)
Azure CLI access to DR resource group (az login)
kubectl configured for DR AKS cluster
DR region Terraform resources deployed and in cold standby

1. DR Activation (Failover)¶

1.1 Verify Primary is Down¶

# Check gateway health dashboard
curl -s -H "Authorization: Bearer $TOKEN" \
  https://<PRIMARY_URL>/api/v1/admin/gateway-health | jq .

# Look for:
#   "activation_status": "failed"
#   "last_heartbeat": <older than 2 minutes>
#   "consecutive_failures": >= 3

STOP if primary is active and heartbeat is recent — this may be a false alarm.

1.2 Start DR AKS Cluster¶

# Start the stopped AKS cluster in DR region
az aks start \
  --name <DR_AKS_NAME> \
  --resource-group <DR_RESOURCE_GROUP>

# Wait for cluster to be ready (3-5 minutes)
az aks show \
  --name <DR_AKS_NAME> \
  --resource-group <DR_RESOURCE_GROUP> \
  --query "provisioningState" -o tsv
# Expected: "Succeeded"

1.3 Verify DR Gateway Pods¶

# Switch kubectl context to DR cluster
kubectl config use-context <DR_CLUSTER_CONTEXT>

# Check pods
kubectl get pods -n mazevault -l app=mazevault-gateway

# Verify pod is running and ready
kubectl logs -n mazevault -l app=mazevault-gateway --tail=50

# Check gateway state file exists (registered previously)
kubectl exec -n mazevault <DR_POD> -- cat /data/gateway-state.json | jq .
# Should show: "Role": "dr-standby", "Environment": "<ENV>"

1.4 Activate DR Gateway¶

# Activate DR for the affected environment
curl -s -X POST \
  -H "Authorization: Bearer $TOKEN" \
  https://<PRIMARY_URL>/api/v1/admin/failover/<ENVIRONMENT>/activate-dr | jq .

# Expected response:
# {
#   "message": "DR gateway activated for environment <ENVIRONMENT>",
#   "fencing_token": "<unix_nano_timestamp>"
# }

What this does: 1. Verifies primary is truly failed (prevents split-brain) 2. Acquires GatewayEnvironmentLock with fencing token 3. Sets DR gateway activation_status=active 4. Tasks are automatically routed to DR gateway

1.5 Verify DR is Active¶

# Check health again
curl -s -H "Authorization: Bearer $TOKEN" \
  https://<PRIMARY_URL>/api/v1/admin/gateway-health | jq .

# DR gateway should show:
#   "activation_status": "active"
#   "last_heartbeat": <recent>

# Check DR gateway logs for heartbeat + task processing
kubectl logs -n mazevault -l app=mazevault-gateway --tail=20 -f

1.6 Verify Operations¶

# Resolve the active gateway ID for the affected environment
GATEWAY_ID=$(curl -s -H "Authorization: Bearer $TOKEN" \
  https://<PRIMARY_URL>/api/v1/admin/gateways | \
  jq -r '.[] | select((.environment // .Environment) == "<ENVIRONMENT>" and (.activation_status // .ActivationStatus) == "active") | (.id // .ID)' | head -n1)

# Fail fast if no active gateway was resolved
if [ -z "$GATEWAY_ID" ] || [ "$GATEWAY_ID" = "null" ]; then
  echo "ERROR: No active gateway found for environment <ENVIRONMENT>"
  exit 1
fi

# Check pending tasks are being claimed by the active gateway
curl -s -H "Authorization: Bearer $TOKEN" \
  "https://<PRIMARY_URL>/api/v1/admin/gateways/${GATEWAY_ID}/tasks?status=pending" | jq '.[] | {id, status, claimed_at}'

# Test a rotation if possible (non-critical secret)
curl -s -X POST \
  -H "Authorization: Bearer $TOKEN" \
  https://<PRIMARY_URL>/api/v1/rotation/configs/<CONFIG_ID>/execute | jq .

2. Failback (Return to Primary)¶

2.1 Verify Primary is Recovered¶

Ensure the root cause is resolved and primary gateway infrastructure is healthy: - AKS cluster running - PostgreSQL accessible - Key Vault accessible - Network connectivity restored

# Check primary gateway pod status
kubectl config use-context <PRIMARY_CLUSTER_CONTEXT>
kubectl get pods -n mazevault -l app=mazevault-gateway

# Check primary gateway logs — heartbeat should be sending but gateway is in "failed" status
kubectl logs -n mazevault -l app=mazevault-gateway --tail=20

2.2 Execute Failback¶

curl -s -X POST \
  -H "Authorization: Bearer $TOKEN" \
  https://<PRIMARY_URL>/api/v1/admin/failover/<ENVIRONMENT>/failback | jq .

# Expected response:
# {
#   "message": "Failback completed, primary reactivated for environment <ENVIRONMENT>"
# }

What this does: 1. Deactivates DR gateway → activation_status=standby 2. Reactivates primary gateway → activation_status=active 3. Releases environment lock 4. Tasks auto-reroute to primary

2.3 Verify Primary is Active¶

# Health check
curl -s -H "Authorization: Bearer $TOKEN" \
  https://<PRIMARY_URL>/api/v1/admin/gateway-health | jq .

# Primary should show:
#   "activation_status": "active"
#   "last_heartbeat": <recent>

# DR should show:
#   "activation_status": "standby"

2.4 Stop DR AKS (Return to Cold Standby)¶

# Stop DR AKS to reduce costs
az aks stop \
  --name <DR_AKS_NAME> \
  --resource-group <DR_RESOURCE_GROUP>

3. Troubleshooting¶

Failover Returns "Primary is not in failed state"¶

The primary gateway is still sending heartbeats. Check if it's a network partition vs. actual failure. Wait for HealthMonitor to mark it as failed (3 consecutive missed heartbeats, ~6 minutes).

DR Gateway Not Claiming Tasks¶

Check DR gateway logs for errors
Verify activation_status=active in health dashboard
Verify fencing token was acquired (check environment lock)
Check DR gateway has Key Vault access: kubectl exec <DR_POD> -- env | grep AZURE_MANAGED_IDENTITY_CLIENT_ID

Split-Brain Suspected¶

If both gateways appear active: 1. Check GatewayEnvironmentLock — only valid holder should be active 2. Deactivate the stale gateway manually via admin API: POST /api/v1/admin/gateways/<ID>/deactivate 3. The fencing token ensures only the lock holder's tasks are processed

Tasks Stuck After Failover¶

Tasks older than 30 seconds in running state are cleaned up by the timeout monitor. Check on the active gateway:

GATEWAY_ID=$(curl -s -H "Authorization: Bearer $TOKEN" \
  https://<PRIMARY_URL>/api/v1/admin/gateways | \
  jq -r '.[] | select((.environment // .Environment) == "<ENVIRONMENT>" and (.activation_status // .ActivationStatus) == "active") | (.id // .ID)' | head -n1)

if [ -z "$GATEWAY_ID" ] || [ "$GATEWAY_ID" = "null" ]; then
  echo "ERROR: No active gateway found for environment <ENVIRONMENT>"
  exit 1
fi

curl -s -H "Authorization: Bearer $TOKEN" \
  "https://<PRIMARY_URL>/api/v1/admin/gateways/${GATEWAY_ID}/tasks?status=running" | jq .

4. Environment Variables Reference¶

Variable	Primary	DR
`MAZEVAULT_GATEWAY_ENVIRONMENT`	NPR / PRO	Same as primary
`MAZEVAULT_GATEWAY_ROLE`	`primary`	`dr-standby`
`AZURE_MANAGED_IDENTITY_CLIENT_ID`	MI for primary region	MI for DR region (same federation)