Environment Identifier Migration — Rollback Runbook¶
Scope: Operational rollback procedure for migrations
000121–000125, which migrate every environment identifier in the database to a canonical lowercase form (prod,staging, etc.).Audience: Database administrators, on-call SRE, release engineers.
1. Why this runbook exists¶
Migrations 000123 (business backfill) and 000124 (gateway/control-plane backfill) perform lossy transformations on environment identifier columns:
After this runs, the original mixed-case form (PROD, Stage-2, dev) is no longer recoverable from the row alone. The *.down.sql files for these migrations are intentionally no-ops — calling migrate down will not restore the previous casing.
If you need to revert to the pre-migration state, follow this runbook.
2. Decision tree¶
| Symptom | Action |
|---|---|
| Application errors with "invalid environment slug" | Re-run app — DB is fine; client sent non-canonical identifier. Do not roll back. |
| Foreign data referencing old casing (e.g. external CMDB exports) | Update consumer to canonical form. Do not roll back. |
Constraint violation in 000125 (CHECK col = LOWER(col)) at deploy |
Investigate — 000123/000124 did not run cleanly. Apply 000123/000124 again, then 000125. Do not PITR. |
Confirmed data corruption introduced by 000123/000124 |
Proceed with PITR (Section 4). |
| Need to revert canonical-only enforcement (D8 cutover) | Set MAZEVAULT_ENV_CANONICAL_ENFORCE=false. No DB change needed. |
3. Reversible parts (no PITR required)¶
- Drop functional indexes added by
000121/000122:migrate downworks for these. - Drop CHECK constraints added by
000125:migrate downrestores legacy lookup indexes. - Toggle enforcement flag:
MAZEVAULT_ENV_CANONICAL_ENFORCE=falsere-opens the legacy compatibility window (the canonicalizer still normalizes inputs, but does not reject non-canonical writes at the boundary).
# Roll back constraint + enforcement flag only:
migrate -path backend/migrations -database "$DATABASE_URL" goto 124
# Then in deployment env:
export MAZEVAULT_ENV_CANONICAL_ENFORCE=false
# Restart backend / gateway.
This restores read behaviour to the pre-000125 state. Stored values remain canonical (lowercase).
4. Full PITR rollback (000123/000124)¶
WARNING: Point-in-time recovery is disruptive and may lose data committed after the migration. Coordinate with the customer before proceeding.
4.1 Pre-flight checks¶
- Confirm the WAL/backup retention covers the timestamp immediately before
000123ran. The migration timestamp is recorded inschema_migrations— checkapplied_at. - Quiesce all writers:
- Stop all
mazevault-backendreplicas. - Stop all
mazevault-gatewayreplicas (or detach from primary). - Stop scheduled jobs (rotation, sync rules, agent pollers).
- Snapshot current state for forensics:
4.2 Recover¶
Cloud-managed (Azure Flexible Server, RDS, Cloud SQL)¶
- Use the provider console: Restore to point in time → pick
applied_at(000123) - 1 minute. - Restore creates a new instance. Repoint application via DNS or connection string.
Self-hosted (PITR with WAL archiving)¶
# 1. Stop PostgreSQL.
systemctl stop postgresql
# 2. Move current data dir aside.
mv $PGDATA $PGDATA.broken
# 3. Restore base backup (most recent before applied_at(000123)).
tar -xzf /var/backups/postgres/base-YYYYMMDD.tar.gz -C $PGDATA
# 4. Configure recovery (PostgreSQL 12+: postgresql.auto.conf + recovery.signal).
cat >> $PGDATA/postgresql.auto.conf <<EOF
restore_command = 'cp /var/backups/postgres/wal/%f %p'
recovery_target_time = 'YYYY-MM-DD HH:MM:SS+00'
recovery_target_action = 'promote'
EOF
touch $PGDATA/recovery.signal
# 5. Start and wait for promotion.
systemctl start postgresql
4.3 Post-recovery validation¶
-- Verify migration history rolled back.
SELECT version, dirty FROM schema_migrations ORDER BY version DESC LIMIT 5;
-- Expect: top entry < 000123.
-- Verify a known mixed-case sample row is restored.
SELECT slug FROM environments WHERE slug ~ '[A-Z]' LIMIT 5;
-- Verify gateway environment columns.
SELECT environment, supported_environments FROM gateways LIMIT 10;
4.4 Restart application stack¶
- Apply all post-rollback migrations except
000123/000124/000125(they would re-introduce the issue):
MAZEVAULT_ENV_CANONICAL_ENFORCE=false.
3. Start gateways.
4. Resume scheduled jobs.
4.5 Customer communication¶
- RTO target: 60 minutes for cloud-managed, 4 hours for self-hosted PITR.
- RPO: ≤ 5 minutes (cloud) or last WAL archive (self-hosted).
- Notify of any data loss window since
applied_at(000123).
5. Forward fix (preferred)¶
If the issue is recoverable without PITR — for example a single tenant's row is malformed — fix in place:
-- Audit suspect rows.
SELECT id, slug FROM environments WHERE slug <> LOWER(TRIM(slug));
-- Fix one tenant.
BEGIN;
UPDATE environments
SET slug = LOWER(TRIM(slug))
WHERE organization_id = '...' AND slug <> LOWER(TRIM(slug));
-- Verify, then COMMIT or ROLLBACK.
Always prefer in-place fixes over rollback.
6. Related references¶
- Backup & Restore
- Gateway DR / Failover
- Migration files:
maze-core/backend/migrations/000121_*…000125_*.
7. Change log¶
| Date | Author | Change |
|---|---|---|
| 2026-04-26 | Platform | Initial publication for canonical environment identifier migration. |