Skip to content

Environment Identifier Migration — Rollback Runbook

Scope: Operational rollback procedure for migrations 000121000125, which migrate every environment identifier in the database to a canonical lowercase form (prod, staging, etc.).

Audience: Database administrators, on-call SRE, release engineers.

1. Why this runbook exists

Migrations 000123 (business backfill) and 000124 (gateway/control-plane backfill) perform lossy transformations on environment identifier columns:

UPDATE table SET environment = LOWER(TRIM(environment))

After this runs, the original mixed-case form (PROD, Stage-2, dev) is no longer recoverable from the row alone. The *.down.sql files for these migrations are intentionally no-ops — calling migrate down will not restore the previous casing.

If you need to revert to the pre-migration state, follow this runbook.

2. Decision tree

Symptom Action
Application errors with "invalid environment slug" Re-run app — DB is fine; client sent non-canonical identifier. Do not roll back.
Foreign data referencing old casing (e.g. external CMDB exports) Update consumer to canonical form. Do not roll back.
Constraint violation in 000125 (CHECK col = LOWER(col)) at deploy Investigate — 000123/000124 did not run cleanly. Apply 000123/000124 again, then 000125. Do not PITR.
Confirmed data corruption introduced by 000123/000124 Proceed with PITR (Section 4).
Need to revert canonical-only enforcement (D8 cutover) Set MAZEVAULT_ENV_CANONICAL_ENFORCE=false. No DB change needed.

3. Reversible parts (no PITR required)

  • Drop functional indexes added by 000121/000122: migrate down works for these.
  • Drop CHECK constraints added by 000125: migrate down restores legacy lookup indexes.
  • Toggle enforcement flag: MAZEVAULT_ENV_CANONICAL_ENFORCE=false re-opens the legacy compatibility window (the canonicalizer still normalizes inputs, but does not reject non-canonical writes at the boundary).
# Roll back constraint + enforcement flag only:
migrate -path backend/migrations -database "$DATABASE_URL" goto 124
# Then in deployment env:
export MAZEVAULT_ENV_CANONICAL_ENFORCE=false
# Restart backend / gateway.

This restores read behaviour to the pre-000125 state. Stored values remain canonical (lowercase).

4. Full PITR rollback (000123/000124)

WARNING: Point-in-time recovery is disruptive and may lose data committed after the migration. Coordinate with the customer before proceeding.

4.1 Pre-flight checks

  1. Confirm the WAL/backup retention covers the timestamp immediately before 000123 ran. The migration timestamp is recorded in schema_migrations — check applied_at.
  2. Quiesce all writers:
  3. Stop all mazevault-backend replicas.
  4. Stop all mazevault-gateway replicas (or detach from primary).
  5. Stop scheduled jobs (rotation, sync rules, agent pollers).
  6. Snapshot current state for forensics:
pg_dump --format=custom --file=mazevault-pre-rollback-$(date -u +%Y%m%d-%H%M%S).dump $DATABASE_URL

4.2 Recover

Cloud-managed (Azure Flexible Server, RDS, Cloud SQL)

  • Use the provider console: Restore to point in time → pick applied_at(000123) - 1 minute.
  • Restore creates a new instance. Repoint application via DNS or connection string.

Self-hosted (PITR with WAL archiving)

# 1. Stop PostgreSQL.
systemctl stop postgresql

# 2. Move current data dir aside.
mv $PGDATA $PGDATA.broken

# 3. Restore base backup (most recent before applied_at(000123)).
tar -xzf /var/backups/postgres/base-YYYYMMDD.tar.gz -C $PGDATA

# 4. Configure recovery (PostgreSQL 12+: postgresql.auto.conf + recovery.signal).
cat >> $PGDATA/postgresql.auto.conf <<EOF
restore_command = 'cp /var/backups/postgres/wal/%f %p'
recovery_target_time = 'YYYY-MM-DD HH:MM:SS+00'
recovery_target_action = 'promote'
EOF
touch $PGDATA/recovery.signal

# 5. Start and wait for promotion.
systemctl start postgresql

4.3 Post-recovery validation

-- Verify migration history rolled back.
SELECT version, dirty FROM schema_migrations ORDER BY version DESC LIMIT 5;
-- Expect: top entry < 000123.

-- Verify a known mixed-case sample row is restored.
SELECT slug FROM environments WHERE slug ~ '[A-Z]' LIMIT 5;

-- Verify gateway environment columns.
SELECT environment, supported_environments FROM gateways LIMIT 10;

4.4 Restart application stack

  1. Apply all post-rollback migrations except 000123/000124/000125 (they would re-introduce the issue):

migrate -path backend/migrations -database "$DATABASE_URL" goto 122
2. Start backend with MAZEVAULT_ENV_CANONICAL_ENFORCE=false. 3. Start gateways. 4. Resume scheduled jobs.

4.5 Customer communication

  • RTO target: 60 minutes for cloud-managed, 4 hours for self-hosted PITR.
  • RPO: ≤ 5 minutes (cloud) or last WAL archive (self-hosted).
  • Notify of any data loss window since applied_at(000123).

5. Forward fix (preferred)

If the issue is recoverable without PITR — for example a single tenant's row is malformed — fix in place:

-- Audit suspect rows.
SELECT id, slug FROM environments WHERE slug <> LOWER(TRIM(slug));

-- Fix one tenant.
BEGIN;
UPDATE environments
   SET slug = LOWER(TRIM(slug))
 WHERE organization_id = '...' AND slug <> LOWER(TRIM(slug));
-- Verify, then COMMIT or ROLLBACK.

Always prefer in-place fixes over rollback.

7. Change log

Date Author Change
2026-04-26 Platform Initial publication for canonical environment identifier migration.