Business Continuity & Disaster Recovery Plan¶
MazeVault Service Continuity, Resilience, and Recovery Framework
Document ID: MV-LEG-008
Version: 1.0.0
Classification: Confidential
Owner: Chief Information Security Officer (CISO)
Last Updated: 2026-05-01
Review Cycle: Annual
Approved By: CEO / Board of Directors
Regulatory Basis: Act No. 264/2025 Sb., NIS2 Directive Art. 21(2)(c), DORA Art. 11-12, ISO/IEC 27001:2022 A.17
1. Purpose and Scope¶
1.1 Purpose¶
This Business Continuity & Disaster Recovery Plan ("Plan") establishes the framework, procedures, and responsibilities for maintaining MazeVault service continuity during disruptive events and recovering operations following a disaster. The Plan ensures that critical business functions can continue at an acceptable level during and after a disruption, and that full service is restored within defined recovery objectives.
1.2 Scope¶
This Plan applies to:
- All MazeVault production systems, services, and infrastructure
- All deployment configurations: cloud-hosted backend, gateway installations (Kubernetes and on-premise), and agent deployments
- All data assets: customer secrets, encryption keys, certificates, configuration, and operational data
- All personnel involved in service delivery and incident response
- All environments where customer data is processed or stored
1.3 Objectives¶
- Maintain critical service functions during disruptive events
- Minimize data loss within defined Recovery Point Objectives (RPO)
- Restore services within defined Recovery Time Objectives (RTO)
- Protect the integrity and confidentiality of customer data during recovery operations
- Ensure compliance with regulatory continuity and resilience requirements
- Provide clear procedures for common disaster scenarios
1.4 Definitions¶
| Term | Definition |
|---|---|
| RTO (Recovery Time Objective) | Maximum acceptable duration of service outage from the point of disruption to service restoration |
| RPO (Recovery Point Objective) | Maximum acceptable data loss measured in time; the point in time to which data must be recoverable |
| BCP (Business Continuity Plan) | Procedures to maintain critical business functions during a disruption |
| DR (Disaster Recovery) | Technical procedures to restore IT systems and data following a disaster |
| MTPD (Maximum Tolerable Period of Disruption) | Absolute maximum time before disruption causes unrecoverable business damage |
| BIA (Business Impact Analysis) | Assessment of the impact of disruption to business functions |
| Fencing Token | A monotonically increasing token used to prevent split-brain operations in distributed systems |
2. RTO/RPO Targets¶
2.1 Recovery Objectives by Scenario¶
| Disaster Scenario | RPO | RTO | MTPD | Priority |
|---|---|---|---|---|
| Database corruption (logical error, failed migration, data integrity failure) | <24 hours (daily backup) | <4 hours | 8 hours | Critical |
| Complete node failure (single server/pod crash, hardware failure) | <24 hours (daily backup) | <2 hours (Kubernetes) / <4 hours (on-premise) | 6 hours | Critical |
| Datacenter failure (full DC outage, network partition, cloud region failure) | <1 hour (with synchronous replication) | <8 hours | 12 hours | Critical |
| Encryption key loss (vault.key deleted or corrupted without backup) | 0 (with proper key backups) / PERMANENT LOSS (without backup) | <2 hours (with backup) / Unrecoverable (without backup) | N/A | Critical |
| Ransomware / complete data loss (malicious encryption, mass deletion, storage failure) | <24 hours (last clean backup) | <8 hours | 12 hours | Critical |
| Gateway failure (single gateway unavailability) | 0 (agents cache locally) | <30 minutes (failover) / <2 hours (manual) | 4 hours | High |
| Certificate authority compromise | 0 (re-issuance) | <4 hours | 8 hours | Critical |
2.2 Service Tier Classification¶
| Tier | Services | RTO | RPO |
|---|---|---|---|
| Tier 1 — Critical | Secret retrieval/storage, certificate issuance, authentication, audit logging | <2 hours | <1 hour |
| Tier 2 — Important | Gateway management, agent communication, monitoring, alerting | <4 hours | <24 hours |
| Tier 3 — Standard | Reporting, dashboards, non-critical notifications, analytics | <8 hours | <24 hours |
| Tier 4 — Deferrable | Documentation, development environments, non-production testing | <24 hours | <7 days |
3. Architecture for Resilience¶
3.1 System Architecture Overview¶
┌─────────────────────────────────────────────────────────────────┐
│ PRIMARY BACKEND │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ API │ │ Auth │ │ Secret │ │ Certificate │ │
│ │ Service │ │ Service │ │ Service │ │ Service │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └──────┬───────┘ │
│ │ │ │ │ │
│ ┌────┴──────────────┴──────────────┴───────────────┴────┐ │
│ │ PostgreSQL (Central DB) │ │
│ └────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌────┴─────────┐ ┌───────────────┐ ┌────────────────┐ │
│ │ Redis Cache │ │ Secrets Vault │ │ Prometheus │ │
│ │ (Ephemeral) │ │ (Encrypted) │ │ (Monitoring) │ │
│ └──────────────┘ └───────────────┘ └────────────────┘ │
└─────────────────────────────┬───────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
│ │ │
┌───────────┴──────┐ ┌──────┴───────┐ ┌─────┴────────────┐
│ Environment: │ │ Environment: │ │ Environment: │
│ NPR │ │ PRO │ │ PRO-2 │
│ │ │ │ │ │
│ ┌─────────────┐ │ │ ┌──────────┐ │ │ ┌─────────────┐ │
│ │ GW Primary │ │ │ │GW Primary│ │ │ │ GW Primary │ │
│ │ (Active) │ │ │ │(Active) │ │ │ │ (Active) │ │
│ └──────┬──────┘ │ │ └────┬─────┘ │ │ └──────┬──────┘ │
│ │ │ │ │ │ │ │ │
│ ┌──────┴──────┐ │ │ ┌────┴─────┐ │ │ ┌──────┴──────┐ │
│ │ GW DR │ │ │ │GW DR │ │ │ │ GW DR │ │
│ │ (Standby) │ │ │ │(Standby) │ │ │ │ (Standby) │ │
│ └─────────────┘ │ │ └──────────┘ │ │ └─────────────┘ │
│ │ │ │ │ │ │ │ │
│ ┌────┴────┐ │ │ ┌───┴───┐ │ │ ┌────┴────┐ │
│ │ Agents │ │ │ │Agents │ │ │ │ Agents │ │
│ │(On-Prem)│ │ │ │(On-P.)│ │ │ │(On-Prem)│ │
│ └─────────┘ │ │ └───────┘ │ │ └─────────┘ │
└──────────────────┘ └──────────────┘ └──────────────────┘
3.2 CRDT-Based Multi-DC Synchronization¶
For multi-datacenter deployments, MazeVault employs Conflict-free Replicated Data Types (CRDTs) for synchronization:
- Purpose: Enable eventual consistency across geographically distributed gateways without requiring synchronous replication for all operations
- Conflict Resolution: CRDTs guarantee convergence regardless of message ordering or network partitions
- Data Types: Applied to configuration state, secret metadata, and certificate status propagation
- Consistency Model: Strong consistency for secret values (source of truth at backend); eventual consistency for operational metadata at gateway level
3.3 GatewayEnvironmentLock and Fencing Tokens¶
The GatewayEnvironmentLock mechanism prevents split-brain scenarios in multi-gateway environments:
- Single Active Gateway: Only ONE gateway may be active per environment at any given time
- Fencing Tokens: Each lock acquisition generates a monotonically increasing fencing token (UnixNano timestamp)
- Token Validation: All write operations validate the fencing token; stale tokens are rejected
- Split-Brain Prevention: If a previously-active gateway recovers after failover, its stale fencing token prevents it from making conflicting writes
- AutoFailover: Disabled by default; manual activation required to prevent unnecessary failovers from transient issues
4. Backup Strategy¶
4.1 Backup Schedule and Retention¶
| Asset | Backup Method | Frequency | Retention | Storage Location | Encryption |
|---|---|---|---|---|---|
| PostgreSQL Database | pg_dump (logical) / Azure Backup (if cloud-hosted) |
Daily + before significant changes (migrations, major updates) | 30 days rolling | Separate storage account / off-site | AES-256 at rest |
| Encryption Keys (vault.key) | Manual copy to secure storage | On creation + on each rotation | Permanent (all historical versions) | Physically separate secure location (safe, HSM, separate DC) | Protected by physical security + access control |
| Secrets Vault (secrets.vault) | Copy vault file + associated key file | On every change + on key rotation | Permanent | Physically separate from vault.key | Inherently encrypted (vault encryption) |
| Configuration Files | Filesystem copy / version control | Before every change | 30 days rolling | Off-site backup storage | AES-256 at rest |
| TLS Certificates | Filesystem copy | Before every rotation | Until expired + 90 days | Off-site backup storage | AES-256 at rest |
| Redis | NOT BACKED UP | N/A | N/A | N/A | N/A |
| Prometheus Data | Snapshot (optional) | Weekly | 90 days | Off-site backup storage | AES-256 at rest |
| Audit Logs | Log export / replication | Continuous (real-time) | Minimum 3 years | Separate immutable storage | AES-256 at rest |
4.2 Redis — Ephemeral Cache Justification¶
Redis is intentionally excluded from backup procedures because:
- Used exclusively as a transient cache layer (sessions, rate limiting, temporary tokens)
- All authoritative data resides in PostgreSQL
- Cache is automatically rebuilt from the database on service restart
- No customer secrets or persistent state stored in Redis
- Recovery procedure: restart Redis; application automatically repopulates cache
4.3 Backup Verification¶
- Monthly: Automated backup integrity check (checksum verification, test restore to isolated environment)
- Quarterly: Full restore test to isolated environment with functional verification (see Section 7)
- On rotation: When encryption keys or vault files are rotated, verify new backup is complete and accessible
- Documentation: All verification results logged with pass/fail status and reviewer signature
4.4 Off-Site Backup Requirements¶
- Backups SHALL be stored in a geographically separate location from production systems
- Minimum distance: different availability zone (cloud) or different physical building (on-premise)
- Access to backup storage SHALL require separate credentials from production systems
- Backup storage access SHALL be logged and monitored
- Backup encryption keys SHALL be stored separately from the encrypted backups
5. Disaster Recovery Procedures¶
5.1 Gateway DR Failover¶
Scenario: Primary gateway for an environment becomes unavailable and cannot be recovered within RTO.
Prerequisites:
- DR standby gateway is provisioned and current (configuration synchronized)
- GatewayEnvironmentLock is functioning correctly
- Network connectivity between DR gateway and backend confirmed
Procedure:
| Step | Action | Verification | Responsible |
|---|---|---|---|
| 1 | Verify primary gateway is confirmed down (3 consecutive heartbeat failures confirmed by GatewayHealthMonitor) | Health monitor alerts received; manual verification attempted | Operations Lead |
| 2 | Notify Incident Commander; obtain authorization for failover | IC acknowledgment documented | Operations Lead |
| 3 | Start DR AKS cluster (if not warm standby): kubectl scale deployment mazevault-gateway --replicas=1 -n mazevault-dr |
Pods in Running state: kubectl get pods -n mazevault-dr |
Infrastructure Engineer |
| 4 | Verify DR gateway pods are healthy and connected to backend | Pod logs show successful backend connection; health endpoint returns 200 | Infrastructure Engineer |
| 5 | Activate DR gateway via administrative API: POST /api/v1/admin/failover/{env}/activate-dr |
API returns 200; new fencing token issued | Operations Lead |
| 6 | Verify DR gateway is now active and processing requests | Agent connectivity confirmed; test secret retrieval successful | Operations Lead |
| 7 | Monitor for 30 minutes for stability | No errors in logs; metrics nominal | Infrastructure Engineer |
| 8 | Notify affected customers of failover completion | Customer notification sent | Communications Lead |
Estimated Duration: 15-30 minutes (warm standby) / 30-60 minutes (cold start)
5.2 Gateway Failback¶
Scenario: Primary gateway has been recovered and service should be returned from DR to primary.
Prerequisites:
- Primary gateway fully recovered and verified healthy
- No active incidents on DR gateway
- Maintenance window scheduled (if possible)
Procedure:
| Step | Action | Verification | Responsible |
|---|---|---|---|
| 1 | Verify primary gateway is fully recovered and healthy | Health endpoint returns 200; all subsystems operational | Infrastructure Engineer |
| 2 | Synchronize any state accumulated on DR gateway to primary | Sync status confirmed; no data discrepancies | Infrastructure Engineer |
| 3 | Execute failback via administrative API: POST /api/v1/admin/failover/{env}/failback |
API returns 200; new fencing token issued to primary | Operations Lead |
| 4 | Verify primary gateway is active and processing requests | Agent connectivity confirmed; test operations successful | Operations Lead |
| 5 | Monitor primary for 30 minutes for stability | No errors; metrics nominal | Infrastructure Engineer |
| 6 | Stop DR AKS cluster: kubectl scale deployment mazevault-gateway --replicas=0 -n mazevault-dr |
DR pods terminated | Infrastructure Engineer |
| 7 | Document failback completion | Incident record updated; post-action report filed | Operations Lead |
5.3 Database Restore¶
Scenario: PostgreSQL database corruption, failed migration, or data integrity failure requiring restore from backup.
Procedure:
| Step | Action | Command/Verification | Responsible |
|---|---|---|---|
| 1 | Stop application services to prevent further writes | systemctl stop mazevault or scale deployment to 0 |
Infrastructure Engineer |
| 2 | Assess corruption scope; determine restore point | Review logs; identify last known-good backup | Technical Lead |
| 3 | Create backup of current (corrupted) state for forensics | pg_dump -Fc mazevault_db > /backup/forensic-YYYYMMDD.dump |
Infrastructure Engineer |
| 4 | Drop and recreate database (or restore to new instance) | dropdb mazevault_db && createdb mazevault_db |
Infrastructure Engineer |
| 5 | Restore from selected backup | pg_restore -d mazevault_db /backup/mazevault-YYYYMMDD.dump |
Infrastructure Engineer |
| 6 | Verify database integrity | Run application health checks; verify row counts; check constraints | Infrastructure Engineer |
| 7 | Start application services | systemctl start mazevault or scale deployment to desired replicas |
Infrastructure Engineer |
| 8 | Verify full application health | Health endpoint returns 200; all subsystems connected; test operations | Operations Lead |
| 9 | Assess data loss (gap between backup and failure) | Document any data created after backup that is lost; notify affected customers if applicable | Technical Lead |
Estimated Duration: 1-4 hours depending on database size and backup method.
5.4 Full System Restore¶
Scenario: Complete system loss requiring rebuild from scratch (catastrophic hardware failure, ransomware, or new environment provisioning).
Procedure:
| Step | Action | Details | Responsible |
|---|---|---|---|
| 1 | Prepare Host | Provision server/VM meeting minimum requirements; install OS; configure networking | Infrastructure Engineer |
| 2 | Restore Configuration | Copy configuration files from backup to appropriate locations | Infrastructure Engineer |
| 3 | Restore TLS Certificates | Restore CA certificates, server certificates, and private keys from backup | Infrastructure Engineer |
| 4 | Load Container Images (if applicable) | Pull or load MazeVault container images to local registry | Infrastructure Engineer |
| 5 | Start Infrastructure Services | Start PostgreSQL, Redis; verify connectivity | Infrastructure Engineer |
| 6 | Restore Database | Execute database restore procedure (Section 5.3, steps 5-6) | Infrastructure Engineer |
| 7 | Restore Secrets Vault | Copy secrets.vault and vault.key from secure backup locations to designated paths |
Infrastructure Engineer |
| 8 | Start Application Services | Start MazeVault backend services | Infrastructure Engineer |
| 9 | Verify System Health | Execute full recovery verification checklist (Section 9) | Operations Lead |
| 10 | Restore Gateway Connectivity | Verify gateways can connect; re-register if necessary | Operations Lead |
| 11 | Verify Agent Connectivity | Confirm agents reconnect to gateways; test secret retrieval | Operations Lead |
| 12 | Restore Monitoring | Verify Prometheus, alerting, and logging operational | Infrastructure Engineer |
| 13 | Document and Communicate | Update incident record; notify stakeholders of restoration | Communications Lead |
Estimated Duration: 4-8 hours depending on infrastructure provisioning time.
5.5 Encryption Key Recovery¶
Scenario: Loss or corruption of vault.key (the key that decrypts the secrets vault).
5.5.1 With Backup Available¶
| Step | Action | Responsible |
|---|---|---|
| 1 | Retrieve vault.key backup from secure off-site storage |
CISO / Infrastructure Engineer |
| 2 | Verify key integrity (checksum comparison) | Infrastructure Engineer |
| 3 | Place key in designated secure location on target system | Infrastructure Engineer |
| 4 | Verify vault decryption: vault-tool list |
Infrastructure Engineer |
| 5 | Verify all secrets accessible and decryptable | Operations Lead |
| 6 | Resume normal operations | Operations Lead |
Estimated Duration: <2 hours
5.5.2 Without Backup — PERMANENT DATA LOSS¶
CRITICAL WARNING: If
vault.keyis lost and no backup exists, all secrets encrypted by that key are permanently and irrecoverably lost. There is no recovery mechanism. This scenario requires complete secret re-creation.
| Step | Action | Responsible |
|---|---|---|
| 1 | Confirm no backup exists anywhere (verify all backup locations) | CISO |
| 2 | Accept permanent loss of existing encrypted secrets | CISO + CEO decision |
| 3 | Generate new encryption key: vault-tool init |
Infrastructure Engineer |
| 4 | Immediately create multiple backups of new key in separate secure locations | CISO |
| 5 | Recreate all secrets manually (coordinate with each customer for their credentials) | Operations Lead + Customer Relations |
| 6 | Document incident and update procedures to prevent recurrence | CISO |
| 7 | Conduct post-incident review | Full IRT |
Estimated Duration: Days to weeks (depending on number of secrets to recreate)
6. Split-Brain Prevention¶
6.1 Problem Statement¶
In a distributed gateway architecture, a network partition or communication failure can result in multiple gateways believing they are the active instance for an environment. This "split-brain" condition can cause:
- Conflicting writes to shared resources
- Data inconsistency between gateways
- Certificate issuance conflicts
- Audit log divergence
6.2 GatewayEnvironmentLock Mechanism¶
The GatewayEnvironmentLock provides distributed mutual exclusion:
┌─────────────────────────────────────────────────────┐
│ GatewayEnvironmentLock │
├─────────────────────────────────────────────────────┤
│ Environment: PRO │
│ Active Gateway: gateway-pro-primary │
│ Fencing Token: 1714567890123456789 (UnixNano) │
│ Acquired: 2026-05-01T10:00:00Z │
│ Last Heartbeat: 2026-05-01T10:01:30Z │
│ AutoFailover: DISABLED │
└─────────────────────────────────────────────────────┘
6.3 Operating Parameters¶
| Parameter | Value | Description |
|---|---|---|
| HeartbeatTimeout | 2 minutes | Maximum time between heartbeats before gateway considered unhealthy |
| HealthCheckInterval | 30 seconds | Frequency of health check probes |
| FailureThreshold | 3 consecutive failures | Number of consecutive missed heartbeats before declaring failure |
| AutoFailover | OFF (disabled by default) | Automatic failover to DR gateway is not performed without explicit operator action |
| FencingTokenType | UnixNano (monotonically increasing) | Ensures ordering of lock acquisitions |
6.4 Fencing Token Operation¶
- When a gateway acquires the environment lock, it receives a fencing token (current UnixNano timestamp)
- All write operations to shared state include the fencing token
- The backend rejects any operation with a fencing token less than or equal to the last accepted token
- This guarantees that even if a stale gateway attempts operations after a failover, its outdated token is rejected
- The new active gateway's higher token ensures its operations take precedence
6.5 AutoFailover Disabled — Rationale¶
AutoFailover is disabled by default because:
- Transient network issues (brief partitions, DNS delays) could trigger unnecessary failovers
- Failover introduces risk of data inconsistency if not properly sequenced
- Operator verification ensures the primary is genuinely failed (not just temporarily unreachable)
- Manual failover allows coordinated preparation of the DR environment
- Reduces risk of "flapping" between primary and DR during unstable conditions
AutoFailover MAY be enabled for specific environments after explicit risk acceptance by the CISO, with appropriate safeguards (extended failure threshold, confirmation delay).
7. Testing Program¶
7.1 Test Schedule¶
| Test Type | Frequency | Scope | Duration | Participants |
|---|---|---|---|---|
| Backup Restore Test | Quarterly | Full restore of database and vault to isolated environment | 4-8 hours | Infrastructure + Operations |
| Gateway Failover Exercise | Semi-annual | Execute full failover and failback procedure in test environment | 2-4 hours | Operations + Infrastructure |
| Full DR Simulation | Annual | Simulate complete datacenter failure; recover all services | 1 full day | All BCP roles |
| Backup Integrity Verification | Monthly | Automated checksum verification of all backups | Automated | Infrastructure (review results) |
| Communication Test | Quarterly | Verify all notification channels and escalation contacts | 1 hour | Communications Lead + IRT |
| Tabletop Exercise | Semi-annual | Scenario-based discussion of BCP/DR decision-making | 2-3 hours | All BCP roles + management |
7.2 Quarterly DR Test Procedure¶
| Step | Action | Success Criteria |
|---|---|---|
| 1 | Provision isolated test environment | Environment accessible; no connectivity to production |
| 2 | Restore database from latest production backup | Restore completes without errors; data integrity verified |
| 3 | Restore secrets vault and vault.key | vault-tool list returns expected secret count |
| 4 | Verify all secrets accessible and decryptable | Sample of secrets retrieved and validated |
| 5 | Verify certificate operations | Test certificate issuance and validation |
| 6 | Verify application health endpoints | All health checks pass |
| 7 | Document results | Test report completed with pass/fail for each criterion |
| 8 | Destroy test environment | All test data securely wiped |
7.3 Test Documentation Requirements¶
Each test SHALL produce a report containing:
- Test date, participants, and environment used
- Procedure followed (reference to this document or deviation explanation)
- Pass/fail status for each test criterion
- Actual RTO achieved vs. target RTO
- Issues encountered and their resolution
- Recommendations for procedure improvements
- Sign-off by Operations Lead
7.4 Test Failure Remediation¶
- Any test failure SHALL be treated as a high-priority finding
- Root cause analysis required within 5 business days
- Corrective action plan with owner and deadline
- Re-test scheduled within 30 days of corrective action completion
- Repeated failures escalated to CISO for risk assessment
8. Encrypted Vault Backup¶
8.1 Critical Files¶
| File | Purpose | Criticality |
|---|---|---|
secrets.vault |
Contains all encrypted credentials, API keys, database passwords, and customer secrets | CRITICAL — Loss without backup means recreation of all secrets required |
vault.key |
Decryption passphrase/key for the secrets vault | CRITICAL — Loss without backup means PERMANENT, IRRECOVERABLE loss of all vault contents |
8.2 Storage Requirements¶
MANDATORY:
secrets.vaultandvault.keySHALL be stored in physically separate locations.
| Requirement | secrets.vault | vault.key |
|---|---|---|
| Storage Location | Off-site backup (separate from production) | Physically separate from secrets.vault (different facility, safe, or HSM) |
| Access Control | Restricted to authorized operations personnel | Restricted to CISO + designated backup custodian only |
| Physical Security | Encrypted storage; access-logged facility | Secure safe, HSM, or equivalent physical protection |
| Copies | Minimum 2 copies in separate locations | Minimum 2 copies in separate locations (NEVER co-located with vault file) |
| Digital Protection | Encrypted at rest (backup storage encryption) | Encrypted at rest; additionally protected by separate passphrase if stored digitally |
8.3 Quarterly Verification Procedure¶
| Step | Action | Expected Result |
|---|---|---|
| 1 | Retrieve vault.key backup from secure storage | Key file accessible; integrity seal intact |
| 2 | Retrieve secrets.vault backup from separate storage | Vault file accessible; modification date matches last backup |
| 3 | Deploy both to isolated verification environment | Files in place |
| 4 | Execute: vault-tool list |
Returns complete list of stored secrets (matches production count) |
| 5 | Retrieve sample secrets and verify decryption | Secrets decrypt correctly; values match production |
| 6 | Document verification result | Signed verification record filed |
| 7 | Securely return backup files to storage | Files re-sealed in secure storage |
8.4 Rotation Procedures¶
When encryption keys or vault files are rotated:
- Create backup of new vault and key immediately after rotation
- Verify new backup (vault-tool list in isolated environment)
- Retain previous vault.key version (required to decrypt any backup made with the old key)
- Update backup inventory documentation
- Notify backup custodians of update
9. Recovery Verification¶
9.1 Health Check Verification¶
After any recovery operation, the following checks SHALL all pass before declaring recovery complete:
| Check | Method | Expected Result |
|---|---|---|
| Application Health | GET /api/v1/health |
HTTP 200; all subsystem statuses "healthy" |
| License Status | Health endpoint or admin API | License status: active |
| Database Connectivity | Health endpoint subsystem check | PostgreSQL connection pool active; queries executing |
| Redis Connectivity | Health endpoint subsystem check | Redis connection established; cache operations functional |
| Audit Logging | Perform test action; verify audit event created | Audit event written with correct hash chain continuation |
| Prometheus Metrics | Check Prometheus targets | All scrape targets UP; metrics flowing |
| Structured Logging | Review application log output | JSON-formatted log entries appearing with expected fields |
| Secret Operations | Retrieve a test secret | Secret retrieved successfully; decrypted correctly |
| Certificate Operations | Issue test certificate (non-production) | Certificate issued with valid chain |
| Gateway Connectivity | Verify gateway heartbeat received | Gateway reporting healthy; heartbeat within timeout |
| Agent Connectivity | Verify agent check-in | At least one agent successfully communicating |
9.2 Extended Monitoring Period¶
After recovery verification passes:
- First 2 hours: Enhanced monitoring with reduced alert thresholds
- First 24 hours: On-call engineer actively monitoring system behavior
- First 72 hours: Daily review of all metrics for anomalies
- After 72 hours: Return to standard monitoring if no issues detected
9.3 Recovery Failure Escalation¶
If recovery verification fails:
- Identify failing check and assess impact
- Attempt corrective action (restart service, re-apply configuration)
- If not resolved within 30 minutes, escalate to Incident Commander
- Consider alternative recovery approach (different backup, different procedure)
- Document failure and resolution for procedure improvement
10. Roles and Responsibilities During BCP¶
10.1 BCP Organization¶
| Role | Primary Holder | Responsibilities |
|---|---|---|
| Incident Commander | CISO | Overall BCP authority; decision-making; escalation; regulatory communication |
| Operations Lead | Head of Operations | Coordinate recovery activities; direct technical teams; manage timeline |
| Infrastructure Engineer | Sr. DevOps Engineer | Execute technical recovery procedures; system administration; monitoring |
| Communications Lead | Head of Customer Success | Customer notification; internal communications; status page management |
| Technical Lead | CTO | Technical decision-making; architecture guidance; vendor coordination |
| Legal Counsel | Legal Advisor | Regulatory obligations; contractual requirements; liability assessment |
10.2 Decision Authority Matrix¶
| Decision | Authority | Escalation |
|---|---|---|
| Activate BCP | Incident Commander (CISO) | CEO if CISO unavailable |
| Authorize failover | Incident Commander | Operations Lead (if IC pre-authorized) |
| Customer communication content | Communications Lead + Legal Counsel | Incident Commander for approval |
| Accept data loss (RPO exceeded) | CEO | Board if loss exceeds defined threshold |
| Engage external vendors/consultants | Incident Commander | CEO if cost exceeds pre-approved budget |
| Declare recovery complete | Incident Commander (after verification) | N/A |
| Deactivate BCP | Incident Commander | CEO if incident duration >48 hours |
10.3 Succession and Availability¶
- Each primary role holder SHALL have a designated deputy
- Contact information maintained in secure, accessible location (not solely dependent on affected systems)
- If primary and deputy both unavailable, next in organizational hierarchy assumes responsibility
- Minimum 2 qualified personnel must be reachable at all times for Tier 1 service recovery
11. BCP Activation Criteria¶
11.1 Automatic Activation Triggers¶
The BCP SHALL be activated immediately when any of the following conditions are confirmed:
| Trigger | Condition | Activation Level |
|---|---|---|
| Major Incident (P1) | Critical security incident with service impact exceeding 30 minutes | Full BCP |
| Datacenter Failure | Complete loss of primary hosting environment | Full BCP |
| Ransomware Confirmation | Confirmed ransomware deployment affecting production systems | Full BCP |
| Key Compromise | Confirmed compromise or loss of encryption keys (vault.key) | Full BCP |
| Extended Outage | Any unplanned outage exceeding 50% of the applicable RTO | Partial BCP (escalate if not resolving) |
| Multi-System Failure | Simultaneous failure of 2+ Tier 1 services | Full BCP |
11.2 Discretionary Activation¶
The Incident Commander MAY activate BCP at their discretion for:
- Credible threat intelligence suggesting imminent attack
- Significant degradation trending toward failure
- Vendor/provider outage with uncertain resolution timeline
- Natural disaster or physical security threat to hosting facilities
- Pandemic or staffing emergency affecting operational capability
11.3 Activation Process¶
- Trigger condition identified and confirmed
- Incident Commander (or deputy) formally declares BCP activation
- All BCP role holders notified via emergency communication channels
- BCP war room established (physical or virtual)
- Initial situation assessment conducted (within 30 minutes)
- Recovery strategy selected based on scenario and available resources
- Recovery activities commence per applicable procedure (Section 5)
12. Communication During Outage¶
12.1 Internal Communication¶
| Channel | Purpose | Frequency |
|---|---|---|
| War Room (dedicated video/chat channel) | Real-time coordination among BCP team | Continuous during active incident |
| Status Updates (internal) | Broader team awareness | Every 30 minutes during active recovery |
| Executive Briefing | CEO/Board situation awareness | Every 2 hours or on significant changes |
| All-Hands Update | Full company awareness (if major) | As needed; minimum daily during extended outage |
12.2 Customer Communication¶
| Phase | Communication | Channel | Timing |
|---|---|---|---|
| Initial | Service disruption acknowledged; investigation underway | Status page + email | Within 30 minutes of detection |
| Updates | Progress, estimated resolution, workarounds | Status page | Every 60 minutes (minimum) |
| Resolution | Service restored; summary of impact | Status page + email | Within 1 hour of recovery |
| Post-Mortem | Root cause, remediation actions, prevention measures | Email + portal publication | Within 5 business days |
12.3 Status Page Protocol¶
- Status page SHALL be hosted on infrastructure independent of MazeVault production systems
- Status page updates SHALL NOT contain sensitive security details
- Status categories: Operational → Degraded Performance → Partial Outage → Major Outage
- Per-service status granularity: Backend API, Gateway Services, Certificate Services, Monitoring
12.4 Communication Templates¶
Templates for customer communication during outage SHALL be pre-approved by Legal Counsel and ready for immediate use. Templates SHALL cover:
- Initial outage acknowledgment
- Periodic status updates (with and without ETA)
- Service restoration notification
- Post-incident summary
13. Compliance Mapping¶
| Regulatory Requirement | Section in This Document |
|---|---|
| NIS2 Directive Art. 21(2)(c) — Business continuity and crisis management | Sections 1-12 (comprehensive BCP/DR framework) |
| DORA Art. 11 — ICT business continuity policy | Sections 1-4 (policy, objectives, architecture, backups) |
| DORA Art. 12 — ICT response and recovery plans | Sections 5-6 (DR procedures, split-brain prevention) |
| DORA Art. 12(2) — Backup policies and restoration methods | Section 4 (backup strategy), Section 5 (restore procedures) |
| DORA Art. 12(3) — Redundancy for critical functions | Section 3 (architecture for resilience), Section 6 (split-brain) |
| DORA Art. 12(5) — Testing of ICT business continuity plans | Section 7 (testing program) |
| ISO/IEC 27001:2022 A.5.29 — Information security during disruption | Sections 3, 5, 6 (security maintained during recovery) |
| ISO/IEC 27001:2022 A.5.30 — ICT readiness for business continuity | Sections 2-4 (RTO/RPO, architecture, backups) |
| ISO/IEC 27001:2022 A.8.13 — Information backup | Section 4 (backup strategy) |
| ISO/IEC 27001:2022 A.8.14 — Redundancy of information processing facilities | Section 3 (multi-DC architecture, DR standby) |
| SOC 2 A1.1 — Recovery objectives defined | Section 2 (RTO/RPO targets) |
| SOC 2 A1.2 — Recovery procedures exist and are tested | Sections 5, 7 (procedures and testing) |
| SOC 2 A1.3 — Recovery procedures support objectives | Section 9 (recovery verification against objectives) |
| Act No. 264/2025 Sb. — Continuity requirements for essential services | Sections 1-12 (comprehensive framework) |
14. Related Documents¶
| Document ID | Title |
|---|---|
| MV-LEG-001 | Information Security Policy |
| MV-LEG-002 | Cryptography Policy |
| MV-LEG-003 | Risk Management Policy |
| MV-LEG-004 | Logging & Monitoring Policy |
| MV-LEG-005 | Access Control Policy |
| MV-LEG-006 | Vulnerability & Patch Management Policy |
| MV-LEG-007 | Incident Response Plan |
Appendix A: Emergency Contact Quick Reference¶
| Role | Primary Contact | Backup Contact | Availability |
|---|---|---|---|
| Incident Commander (CISO) | [CISO Name / Phone / Email] | [Deputy / Phone / Email] | 24/7 |
| Operations Lead | [Name / Phone / Email] | [Deputy / Phone / Email] | 24/7 |
| Infrastructure Engineer | [Name / Phone / Email] | [Deputy / Phone / Email] | 24/7 (on-call rotation) |
| Communications Lead | [Name / Phone / Email] | [Deputy / Phone / Email] | Business hours + on-call |
| CEO | [Name / Phone / Email] | [COO / Phone / Email] | 24/7 for P1 |
| Legal Counsel | [Name / Phone / Email] | [External Firm / Phone] | Business hours + emergency retainer |
Note: Actual contact details maintained in secure, separately-accessible contact roster (not in this document).
Appendix B: Quick Decision Flowchart¶
Service Disruption Detected
│
▼
Is production affected?
│
Yes ─┤── No → Monitor; standard incident process
│
▼
Can service be restored within 30 minutes?
│
Yes ─┤── No → Activate BCP
│ │
▼ ▼
Standard Identify scenario (Section 5)
incident │
response ├─ Gateway failure → Section 5.1
├─ Database issue → Section 5.3
├─ Full system loss → Section 5.4
├─ Key loss → Section 5.5
└─ DC failure → Section 5.1 + 5.4
Document Control¶
| Version | Date | Author | Change Description |
|---|---|---|---|
| 1.0.0 | 2026-05-01 | CISO | Initial release |
END OF DOCUMENT