Skip to content

Business Continuity & Disaster Recovery Plan

MazeVault Service Continuity, Resilience, and Recovery Framework

Document ID: MV-LEG-008
Version: 1.0.0
Classification: Confidential
Owner: Chief Information Security Officer (CISO)
Last Updated: 2026-05-01
Review Cycle: Annual
Approved By: CEO / Board of Directors
Regulatory Basis: Act No. 264/2025 Sb., NIS2 Directive Art. 21(2)(c), DORA Art. 11-12, ISO/IEC 27001:2022 A.17


1. Purpose and Scope

1.1 Purpose

This Business Continuity & Disaster Recovery Plan ("Plan") establishes the framework, procedures, and responsibilities for maintaining MazeVault service continuity during disruptive events and recovering operations following a disaster. The Plan ensures that critical business functions can continue at an acceptable level during and after a disruption, and that full service is restored within defined recovery objectives.

1.2 Scope

This Plan applies to:

  • All MazeVault production systems, services, and infrastructure
  • All deployment configurations: cloud-hosted backend, gateway installations (Kubernetes and on-premise), and agent deployments
  • All data assets: customer secrets, encryption keys, certificates, configuration, and operational data
  • All personnel involved in service delivery and incident response
  • All environments where customer data is processed or stored

1.3 Objectives

  • Maintain critical service functions during disruptive events
  • Minimize data loss within defined Recovery Point Objectives (RPO)
  • Restore services within defined Recovery Time Objectives (RTO)
  • Protect the integrity and confidentiality of customer data during recovery operations
  • Ensure compliance with regulatory continuity and resilience requirements
  • Provide clear procedures for common disaster scenarios

1.4 Definitions

Term Definition
RTO (Recovery Time Objective) Maximum acceptable duration of service outage from the point of disruption to service restoration
RPO (Recovery Point Objective) Maximum acceptable data loss measured in time; the point in time to which data must be recoverable
BCP (Business Continuity Plan) Procedures to maintain critical business functions during a disruption
DR (Disaster Recovery) Technical procedures to restore IT systems and data following a disaster
MTPD (Maximum Tolerable Period of Disruption) Absolute maximum time before disruption causes unrecoverable business damage
BIA (Business Impact Analysis) Assessment of the impact of disruption to business functions
Fencing Token A monotonically increasing token used to prevent split-brain operations in distributed systems

2. RTO/RPO Targets

2.1 Recovery Objectives by Scenario

Disaster Scenario RPO RTO MTPD Priority
Database corruption (logical error, failed migration, data integrity failure) <24 hours (daily backup) <4 hours 8 hours Critical
Complete node failure (single server/pod crash, hardware failure) <24 hours (daily backup) <2 hours (Kubernetes) / <4 hours (on-premise) 6 hours Critical
Datacenter failure (full DC outage, network partition, cloud region failure) <1 hour (with synchronous replication) <8 hours 12 hours Critical
Encryption key loss (vault.key deleted or corrupted without backup) 0 (with proper key backups) / PERMANENT LOSS (without backup) <2 hours (with backup) / Unrecoverable (without backup) N/A Critical
Ransomware / complete data loss (malicious encryption, mass deletion, storage failure) <24 hours (last clean backup) <8 hours 12 hours Critical
Gateway failure (single gateway unavailability) 0 (agents cache locally) <30 minutes (failover) / <2 hours (manual) 4 hours High
Certificate authority compromise 0 (re-issuance) <4 hours 8 hours Critical

2.2 Service Tier Classification

Tier Services RTO RPO
Tier 1 — Critical Secret retrieval/storage, certificate issuance, authentication, audit logging <2 hours <1 hour
Tier 2 — Important Gateway management, agent communication, monitoring, alerting <4 hours <24 hours
Tier 3 — Standard Reporting, dashboards, non-critical notifications, analytics <8 hours <24 hours
Tier 4 — Deferrable Documentation, development environments, non-production testing <24 hours <7 days

3. Architecture for Resilience

3.1 System Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                    PRIMARY BACKEND                                │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────┐   │
│  │ API      │  │ Auth     │  │ Secret   │  │ Certificate  │   │
│  │ Service  │  │ Service  │  │ Service  │  │ Service      │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └──────┬───────┘   │
│       │              │              │               │            │
│  ┌────┴──────────────┴──────────────┴───────────────┴────┐      │
│  │              PostgreSQL (Central DB)                    │      │
│  └────────────────────────────────────────────────────────┘      │
│       │                                                          │
│  ┌────┴─────────┐  ┌───────────────┐  ┌────────────────┐       │
│  │ Redis Cache  │  │ Secrets Vault │  │ Prometheus     │       │
│  │ (Ephemeral)  │  │ (Encrypted)   │  │ (Monitoring)   │       │
│  └──────────────┘  └───────────────┘  └────────────────┘       │
└─────────────────────────────┬───────────────────────────────────┘
            ┌─────────────────┼─────────────────┐
            │                 │                 │
┌───────────┴──────┐  ┌──────┴───────┐  ┌─────┴────────────┐
│   Environment:   │  │ Environment: │  │  Environment:    │
│      NPR         │  │     PRO      │  │     PRO-2        │
│                  │  │              │  │                  │
│ ┌─────────────┐  │  │ ┌──────────┐ │  │ ┌─────────────┐  │
│ │ GW Primary  │  │  │ │GW Primary│ │  │ │ GW Primary  │  │
│ │ (Active)    │  │  │ │(Active)  │ │  │ │ (Active)    │  │
│ └──────┬──────┘  │  │ └────┬─────┘ │  │ └──────┬──────┘  │
│        │         │  │      │       │  │        │         │
│ ┌──────┴──────┐  │  │ ┌────┴─────┐ │  │ ┌──────┴──────┐  │
│ │ GW DR       │  │  │ │GW DR     │ │  │ │ GW DR       │  │
│ │ (Standby)   │  │  │ │(Standby) │ │  │ │ (Standby)   │  │
│ └─────────────┘  │  │ └──────────┘ │  │ └─────────────┘  │
│        │         │  │      │       │  │        │         │
│   ┌────┴────┐    │  │  ┌───┴───┐   │  │   ┌────┴────┐    │
│   │ Agents  │    │  │  │Agents │   │  │   │ Agents  │    │
│   │(On-Prem)│    │  │  │(On-P.)│   │  │   │(On-Prem)│    │
│   └─────────┘    │  │  └───────┘   │  │   └─────────┘    │
└──────────────────┘  └──────────────┘  └──────────────────┘

3.2 CRDT-Based Multi-DC Synchronization

For multi-datacenter deployments, MazeVault employs Conflict-free Replicated Data Types (CRDTs) for synchronization:

  • Purpose: Enable eventual consistency across geographically distributed gateways without requiring synchronous replication for all operations
  • Conflict Resolution: CRDTs guarantee convergence regardless of message ordering or network partitions
  • Data Types: Applied to configuration state, secret metadata, and certificate status propagation
  • Consistency Model: Strong consistency for secret values (source of truth at backend); eventual consistency for operational metadata at gateway level

3.3 GatewayEnvironmentLock and Fencing Tokens

The GatewayEnvironmentLock mechanism prevents split-brain scenarios in multi-gateway environments:

  • Single Active Gateway: Only ONE gateway may be active per environment at any given time
  • Fencing Tokens: Each lock acquisition generates a monotonically increasing fencing token (UnixNano timestamp)
  • Token Validation: All write operations validate the fencing token; stale tokens are rejected
  • Split-Brain Prevention: If a previously-active gateway recovers after failover, its stale fencing token prevents it from making conflicting writes
  • AutoFailover: Disabled by default; manual activation required to prevent unnecessary failovers from transient issues

4. Backup Strategy

4.1 Backup Schedule and Retention

Asset Backup Method Frequency Retention Storage Location Encryption
PostgreSQL Database pg_dump (logical) / Azure Backup (if cloud-hosted) Daily + before significant changes (migrations, major updates) 30 days rolling Separate storage account / off-site AES-256 at rest
Encryption Keys (vault.key) Manual copy to secure storage On creation + on each rotation Permanent (all historical versions) Physically separate secure location (safe, HSM, separate DC) Protected by physical security + access control
Secrets Vault (secrets.vault) Copy vault file + associated key file On every change + on key rotation Permanent Physically separate from vault.key Inherently encrypted (vault encryption)
Configuration Files Filesystem copy / version control Before every change 30 days rolling Off-site backup storage AES-256 at rest
TLS Certificates Filesystem copy Before every rotation Until expired + 90 days Off-site backup storage AES-256 at rest
Redis NOT BACKED UP N/A N/A N/A N/A
Prometheus Data Snapshot (optional) Weekly 90 days Off-site backup storage AES-256 at rest
Audit Logs Log export / replication Continuous (real-time) Minimum 3 years Separate immutable storage AES-256 at rest

4.2 Redis — Ephemeral Cache Justification

Redis is intentionally excluded from backup procedures because:

  • Used exclusively as a transient cache layer (sessions, rate limiting, temporary tokens)
  • All authoritative data resides in PostgreSQL
  • Cache is automatically rebuilt from the database on service restart
  • No customer secrets or persistent state stored in Redis
  • Recovery procedure: restart Redis; application automatically repopulates cache

4.3 Backup Verification

  • Monthly: Automated backup integrity check (checksum verification, test restore to isolated environment)
  • Quarterly: Full restore test to isolated environment with functional verification (see Section 7)
  • On rotation: When encryption keys or vault files are rotated, verify new backup is complete and accessible
  • Documentation: All verification results logged with pass/fail status and reviewer signature

4.4 Off-Site Backup Requirements

  • Backups SHALL be stored in a geographically separate location from production systems
  • Minimum distance: different availability zone (cloud) or different physical building (on-premise)
  • Access to backup storage SHALL require separate credentials from production systems
  • Backup storage access SHALL be logged and monitored
  • Backup encryption keys SHALL be stored separately from the encrypted backups

5. Disaster Recovery Procedures

5.1 Gateway DR Failover

Scenario: Primary gateway for an environment becomes unavailable and cannot be recovered within RTO.

Prerequisites:

  • DR standby gateway is provisioned and current (configuration synchronized)
  • GatewayEnvironmentLock is functioning correctly
  • Network connectivity between DR gateway and backend confirmed

Procedure:

Step Action Verification Responsible
1 Verify primary gateway is confirmed down (3 consecutive heartbeat failures confirmed by GatewayHealthMonitor) Health monitor alerts received; manual verification attempted Operations Lead
2 Notify Incident Commander; obtain authorization for failover IC acknowledgment documented Operations Lead
3 Start DR AKS cluster (if not warm standby): kubectl scale deployment mazevault-gateway --replicas=1 -n mazevault-dr Pods in Running state: kubectl get pods -n mazevault-dr Infrastructure Engineer
4 Verify DR gateway pods are healthy and connected to backend Pod logs show successful backend connection; health endpoint returns 200 Infrastructure Engineer
5 Activate DR gateway via administrative API: POST /api/v1/admin/failover/{env}/activate-dr API returns 200; new fencing token issued Operations Lead
6 Verify DR gateway is now active and processing requests Agent connectivity confirmed; test secret retrieval successful Operations Lead
7 Monitor for 30 minutes for stability No errors in logs; metrics nominal Infrastructure Engineer
8 Notify affected customers of failover completion Customer notification sent Communications Lead

Estimated Duration: 15-30 minutes (warm standby) / 30-60 minutes (cold start)

5.2 Gateway Failback

Scenario: Primary gateway has been recovered and service should be returned from DR to primary.

Prerequisites:

  • Primary gateway fully recovered and verified healthy
  • No active incidents on DR gateway
  • Maintenance window scheduled (if possible)

Procedure:

Step Action Verification Responsible
1 Verify primary gateway is fully recovered and healthy Health endpoint returns 200; all subsystems operational Infrastructure Engineer
2 Synchronize any state accumulated on DR gateway to primary Sync status confirmed; no data discrepancies Infrastructure Engineer
3 Execute failback via administrative API: POST /api/v1/admin/failover/{env}/failback API returns 200; new fencing token issued to primary Operations Lead
4 Verify primary gateway is active and processing requests Agent connectivity confirmed; test operations successful Operations Lead
5 Monitor primary for 30 minutes for stability No errors; metrics nominal Infrastructure Engineer
6 Stop DR AKS cluster: kubectl scale deployment mazevault-gateway --replicas=0 -n mazevault-dr DR pods terminated Infrastructure Engineer
7 Document failback completion Incident record updated; post-action report filed Operations Lead

5.3 Database Restore

Scenario: PostgreSQL database corruption, failed migration, or data integrity failure requiring restore from backup.

Procedure:

Step Action Command/Verification Responsible
1 Stop application services to prevent further writes systemctl stop mazevault or scale deployment to 0 Infrastructure Engineer
2 Assess corruption scope; determine restore point Review logs; identify last known-good backup Technical Lead
3 Create backup of current (corrupted) state for forensics pg_dump -Fc mazevault_db > /backup/forensic-YYYYMMDD.dump Infrastructure Engineer
4 Drop and recreate database (or restore to new instance) dropdb mazevault_db && createdb mazevault_db Infrastructure Engineer
5 Restore from selected backup pg_restore -d mazevault_db /backup/mazevault-YYYYMMDD.dump Infrastructure Engineer
6 Verify database integrity Run application health checks; verify row counts; check constraints Infrastructure Engineer
7 Start application services systemctl start mazevault or scale deployment to desired replicas Infrastructure Engineer
8 Verify full application health Health endpoint returns 200; all subsystems connected; test operations Operations Lead
9 Assess data loss (gap between backup and failure) Document any data created after backup that is lost; notify affected customers if applicable Technical Lead

Estimated Duration: 1-4 hours depending on database size and backup method.

5.4 Full System Restore

Scenario: Complete system loss requiring rebuild from scratch (catastrophic hardware failure, ransomware, or new environment provisioning).

Procedure:

Step Action Details Responsible
1 Prepare Host Provision server/VM meeting minimum requirements; install OS; configure networking Infrastructure Engineer
2 Restore Configuration Copy configuration files from backup to appropriate locations Infrastructure Engineer
3 Restore TLS Certificates Restore CA certificates, server certificates, and private keys from backup Infrastructure Engineer
4 Load Container Images (if applicable) Pull or load MazeVault container images to local registry Infrastructure Engineer
5 Start Infrastructure Services Start PostgreSQL, Redis; verify connectivity Infrastructure Engineer
6 Restore Database Execute database restore procedure (Section 5.3, steps 5-6) Infrastructure Engineer
7 Restore Secrets Vault Copy secrets.vault and vault.key from secure backup locations to designated paths Infrastructure Engineer
8 Start Application Services Start MazeVault backend services Infrastructure Engineer
9 Verify System Health Execute full recovery verification checklist (Section 9) Operations Lead
10 Restore Gateway Connectivity Verify gateways can connect; re-register if necessary Operations Lead
11 Verify Agent Connectivity Confirm agents reconnect to gateways; test secret retrieval Operations Lead
12 Restore Monitoring Verify Prometheus, alerting, and logging operational Infrastructure Engineer
13 Document and Communicate Update incident record; notify stakeholders of restoration Communications Lead

Estimated Duration: 4-8 hours depending on infrastructure provisioning time.

5.5 Encryption Key Recovery

Scenario: Loss or corruption of vault.key (the key that decrypts the secrets vault).

5.5.1 With Backup Available

Step Action Responsible
1 Retrieve vault.key backup from secure off-site storage CISO / Infrastructure Engineer
2 Verify key integrity (checksum comparison) Infrastructure Engineer
3 Place key in designated secure location on target system Infrastructure Engineer
4 Verify vault decryption: vault-tool list Infrastructure Engineer
5 Verify all secrets accessible and decryptable Operations Lead
6 Resume normal operations Operations Lead

Estimated Duration: <2 hours

5.5.2 Without Backup — PERMANENT DATA LOSS

CRITICAL WARNING: If vault.key is lost and no backup exists, all secrets encrypted by that key are permanently and irrecoverably lost. There is no recovery mechanism. This scenario requires complete secret re-creation.

Step Action Responsible
1 Confirm no backup exists anywhere (verify all backup locations) CISO
2 Accept permanent loss of existing encrypted secrets CISO + CEO decision
3 Generate new encryption key: vault-tool init Infrastructure Engineer
4 Immediately create multiple backups of new key in separate secure locations CISO
5 Recreate all secrets manually (coordinate with each customer for their credentials) Operations Lead + Customer Relations
6 Document incident and update procedures to prevent recurrence CISO
7 Conduct post-incident review Full IRT

Estimated Duration: Days to weeks (depending on number of secrets to recreate)


6. Split-Brain Prevention

6.1 Problem Statement

In a distributed gateway architecture, a network partition or communication failure can result in multiple gateways believing they are the active instance for an environment. This "split-brain" condition can cause:

  • Conflicting writes to shared resources
  • Data inconsistency between gateways
  • Certificate issuance conflicts
  • Audit log divergence

6.2 GatewayEnvironmentLock Mechanism

The GatewayEnvironmentLock provides distributed mutual exclusion:

┌─────────────────────────────────────────────────────┐
│              GatewayEnvironmentLock                   │
├─────────────────────────────────────────────────────┤
│ Environment:    PRO                                  │
│ Active Gateway: gateway-pro-primary                  │
│ Fencing Token:  1714567890123456789 (UnixNano)      │
│ Acquired:       2026-05-01T10:00:00Z                │
│ Last Heartbeat: 2026-05-01T10:01:30Z                │
│ AutoFailover:   DISABLED                            │
└─────────────────────────────────────────────────────┘

6.3 Operating Parameters

Parameter Value Description
HeartbeatTimeout 2 minutes Maximum time between heartbeats before gateway considered unhealthy
HealthCheckInterval 30 seconds Frequency of health check probes
FailureThreshold 3 consecutive failures Number of consecutive missed heartbeats before declaring failure
AutoFailover OFF (disabled by default) Automatic failover to DR gateway is not performed without explicit operator action
FencingTokenType UnixNano (monotonically increasing) Ensures ordering of lock acquisitions

6.4 Fencing Token Operation

  1. When a gateway acquires the environment lock, it receives a fencing token (current UnixNano timestamp)
  2. All write operations to shared state include the fencing token
  3. The backend rejects any operation with a fencing token less than or equal to the last accepted token
  4. This guarantees that even if a stale gateway attempts operations after a failover, its outdated token is rejected
  5. The new active gateway's higher token ensures its operations take precedence

6.5 AutoFailover Disabled — Rationale

AutoFailover is disabled by default because:

  • Transient network issues (brief partitions, DNS delays) could trigger unnecessary failovers
  • Failover introduces risk of data inconsistency if not properly sequenced
  • Operator verification ensures the primary is genuinely failed (not just temporarily unreachable)
  • Manual failover allows coordinated preparation of the DR environment
  • Reduces risk of "flapping" between primary and DR during unstable conditions

AutoFailover MAY be enabled for specific environments after explicit risk acceptance by the CISO, with appropriate safeguards (extended failure threshold, confirmation delay).


7. Testing Program

7.1 Test Schedule

Test Type Frequency Scope Duration Participants
Backup Restore Test Quarterly Full restore of database and vault to isolated environment 4-8 hours Infrastructure + Operations
Gateway Failover Exercise Semi-annual Execute full failover and failback procedure in test environment 2-4 hours Operations + Infrastructure
Full DR Simulation Annual Simulate complete datacenter failure; recover all services 1 full day All BCP roles
Backup Integrity Verification Monthly Automated checksum verification of all backups Automated Infrastructure (review results)
Communication Test Quarterly Verify all notification channels and escalation contacts 1 hour Communications Lead + IRT
Tabletop Exercise Semi-annual Scenario-based discussion of BCP/DR decision-making 2-3 hours All BCP roles + management

7.2 Quarterly DR Test Procedure

Step Action Success Criteria
1 Provision isolated test environment Environment accessible; no connectivity to production
2 Restore database from latest production backup Restore completes without errors; data integrity verified
3 Restore secrets vault and vault.key vault-tool list returns expected secret count
4 Verify all secrets accessible and decryptable Sample of secrets retrieved and validated
5 Verify certificate operations Test certificate issuance and validation
6 Verify application health endpoints All health checks pass
7 Document results Test report completed with pass/fail for each criterion
8 Destroy test environment All test data securely wiped

7.3 Test Documentation Requirements

Each test SHALL produce a report containing:

  • Test date, participants, and environment used
  • Procedure followed (reference to this document or deviation explanation)
  • Pass/fail status for each test criterion
  • Actual RTO achieved vs. target RTO
  • Issues encountered and their resolution
  • Recommendations for procedure improvements
  • Sign-off by Operations Lead

7.4 Test Failure Remediation

  • Any test failure SHALL be treated as a high-priority finding
  • Root cause analysis required within 5 business days
  • Corrective action plan with owner and deadline
  • Re-test scheduled within 30 days of corrective action completion
  • Repeated failures escalated to CISO for risk assessment

8. Encrypted Vault Backup

8.1 Critical Files

File Purpose Criticality
secrets.vault Contains all encrypted credentials, API keys, database passwords, and customer secrets CRITICAL — Loss without backup means recreation of all secrets required
vault.key Decryption passphrase/key for the secrets vault CRITICAL — Loss without backup means PERMANENT, IRRECOVERABLE loss of all vault contents

8.2 Storage Requirements

MANDATORY: secrets.vault and vault.key SHALL be stored in physically separate locations.

Requirement secrets.vault vault.key
Storage Location Off-site backup (separate from production) Physically separate from secrets.vault (different facility, safe, or HSM)
Access Control Restricted to authorized operations personnel Restricted to CISO + designated backup custodian only
Physical Security Encrypted storage; access-logged facility Secure safe, HSM, or equivalent physical protection
Copies Minimum 2 copies in separate locations Minimum 2 copies in separate locations (NEVER co-located with vault file)
Digital Protection Encrypted at rest (backup storage encryption) Encrypted at rest; additionally protected by separate passphrase if stored digitally

8.3 Quarterly Verification Procedure

Step Action Expected Result
1 Retrieve vault.key backup from secure storage Key file accessible; integrity seal intact
2 Retrieve secrets.vault backup from separate storage Vault file accessible; modification date matches last backup
3 Deploy both to isolated verification environment Files in place
4 Execute: vault-tool list Returns complete list of stored secrets (matches production count)
5 Retrieve sample secrets and verify decryption Secrets decrypt correctly; values match production
6 Document verification result Signed verification record filed
7 Securely return backup files to storage Files re-sealed in secure storage

8.4 Rotation Procedures

When encryption keys or vault files are rotated:

  1. Create backup of new vault and key immediately after rotation
  2. Verify new backup (vault-tool list in isolated environment)
  3. Retain previous vault.key version (required to decrypt any backup made with the old key)
  4. Update backup inventory documentation
  5. Notify backup custodians of update

9. Recovery Verification

9.1 Health Check Verification

After any recovery operation, the following checks SHALL all pass before declaring recovery complete:

Check Method Expected Result
Application Health GET /api/v1/health HTTP 200; all subsystem statuses "healthy"
License Status Health endpoint or admin API License status: active
Database Connectivity Health endpoint subsystem check PostgreSQL connection pool active; queries executing
Redis Connectivity Health endpoint subsystem check Redis connection established; cache operations functional
Audit Logging Perform test action; verify audit event created Audit event written with correct hash chain continuation
Prometheus Metrics Check Prometheus targets All scrape targets UP; metrics flowing
Structured Logging Review application log output JSON-formatted log entries appearing with expected fields
Secret Operations Retrieve a test secret Secret retrieved successfully; decrypted correctly
Certificate Operations Issue test certificate (non-production) Certificate issued with valid chain
Gateway Connectivity Verify gateway heartbeat received Gateway reporting healthy; heartbeat within timeout
Agent Connectivity Verify agent check-in At least one agent successfully communicating

9.2 Extended Monitoring Period

After recovery verification passes:

  • First 2 hours: Enhanced monitoring with reduced alert thresholds
  • First 24 hours: On-call engineer actively monitoring system behavior
  • First 72 hours: Daily review of all metrics for anomalies
  • After 72 hours: Return to standard monitoring if no issues detected

9.3 Recovery Failure Escalation

If recovery verification fails:

  1. Identify failing check and assess impact
  2. Attempt corrective action (restart service, re-apply configuration)
  3. If not resolved within 30 minutes, escalate to Incident Commander
  4. Consider alternative recovery approach (different backup, different procedure)
  5. Document failure and resolution for procedure improvement

10. Roles and Responsibilities During BCP

10.1 BCP Organization

Role Primary Holder Responsibilities
Incident Commander CISO Overall BCP authority; decision-making; escalation; regulatory communication
Operations Lead Head of Operations Coordinate recovery activities; direct technical teams; manage timeline
Infrastructure Engineer Sr. DevOps Engineer Execute technical recovery procedures; system administration; monitoring
Communications Lead Head of Customer Success Customer notification; internal communications; status page management
Technical Lead CTO Technical decision-making; architecture guidance; vendor coordination
Legal Counsel Legal Advisor Regulatory obligations; contractual requirements; liability assessment

10.2 Decision Authority Matrix

Decision Authority Escalation
Activate BCP Incident Commander (CISO) CEO if CISO unavailable
Authorize failover Incident Commander Operations Lead (if IC pre-authorized)
Customer communication content Communications Lead + Legal Counsel Incident Commander for approval
Accept data loss (RPO exceeded) CEO Board if loss exceeds defined threshold
Engage external vendors/consultants Incident Commander CEO if cost exceeds pre-approved budget
Declare recovery complete Incident Commander (after verification) N/A
Deactivate BCP Incident Commander CEO if incident duration >48 hours

10.3 Succession and Availability

  • Each primary role holder SHALL have a designated deputy
  • Contact information maintained in secure, accessible location (not solely dependent on affected systems)
  • If primary and deputy both unavailable, next in organizational hierarchy assumes responsibility
  • Minimum 2 qualified personnel must be reachable at all times for Tier 1 service recovery

11. BCP Activation Criteria

11.1 Automatic Activation Triggers

The BCP SHALL be activated immediately when any of the following conditions are confirmed:

Trigger Condition Activation Level
Major Incident (P1) Critical security incident with service impact exceeding 30 minutes Full BCP
Datacenter Failure Complete loss of primary hosting environment Full BCP
Ransomware Confirmation Confirmed ransomware deployment affecting production systems Full BCP
Key Compromise Confirmed compromise or loss of encryption keys (vault.key) Full BCP
Extended Outage Any unplanned outage exceeding 50% of the applicable RTO Partial BCP (escalate if not resolving)
Multi-System Failure Simultaneous failure of 2+ Tier 1 services Full BCP

11.2 Discretionary Activation

The Incident Commander MAY activate BCP at their discretion for:

  • Credible threat intelligence suggesting imminent attack
  • Significant degradation trending toward failure
  • Vendor/provider outage with uncertain resolution timeline
  • Natural disaster or physical security threat to hosting facilities
  • Pandemic or staffing emergency affecting operational capability

11.3 Activation Process

  1. Trigger condition identified and confirmed
  2. Incident Commander (or deputy) formally declares BCP activation
  3. All BCP role holders notified via emergency communication channels
  4. BCP war room established (physical or virtual)
  5. Initial situation assessment conducted (within 30 minutes)
  6. Recovery strategy selected based on scenario and available resources
  7. Recovery activities commence per applicable procedure (Section 5)

12. Communication During Outage

12.1 Internal Communication

Channel Purpose Frequency
War Room (dedicated video/chat channel) Real-time coordination among BCP team Continuous during active incident
Status Updates (internal) Broader team awareness Every 30 minutes during active recovery
Executive Briefing CEO/Board situation awareness Every 2 hours or on significant changes
All-Hands Update Full company awareness (if major) As needed; minimum daily during extended outage

12.2 Customer Communication

Phase Communication Channel Timing
Initial Service disruption acknowledged; investigation underway Status page + email Within 30 minutes of detection
Updates Progress, estimated resolution, workarounds Status page Every 60 minutes (minimum)
Resolution Service restored; summary of impact Status page + email Within 1 hour of recovery
Post-Mortem Root cause, remediation actions, prevention measures Email + portal publication Within 5 business days

12.3 Status Page Protocol

  • Status page SHALL be hosted on infrastructure independent of MazeVault production systems
  • Status page updates SHALL NOT contain sensitive security details
  • Status categories: Operational → Degraded Performance → Partial Outage → Major Outage
  • Per-service status granularity: Backend API, Gateway Services, Certificate Services, Monitoring

12.4 Communication Templates

Templates for customer communication during outage SHALL be pre-approved by Legal Counsel and ready for immediate use. Templates SHALL cover:

  • Initial outage acknowledgment
  • Periodic status updates (with and without ETA)
  • Service restoration notification
  • Post-incident summary

13. Compliance Mapping

Regulatory Requirement Section in This Document
NIS2 Directive Art. 21(2)(c) — Business continuity and crisis management Sections 1-12 (comprehensive BCP/DR framework)
DORA Art. 11 — ICT business continuity policy Sections 1-4 (policy, objectives, architecture, backups)
DORA Art. 12 — ICT response and recovery plans Sections 5-6 (DR procedures, split-brain prevention)
DORA Art. 12(2) — Backup policies and restoration methods Section 4 (backup strategy), Section 5 (restore procedures)
DORA Art. 12(3) — Redundancy for critical functions Section 3 (architecture for resilience), Section 6 (split-brain)
DORA Art. 12(5) — Testing of ICT business continuity plans Section 7 (testing program)
ISO/IEC 27001:2022 A.5.29 — Information security during disruption Sections 3, 5, 6 (security maintained during recovery)
ISO/IEC 27001:2022 A.5.30 — ICT readiness for business continuity Sections 2-4 (RTO/RPO, architecture, backups)
ISO/IEC 27001:2022 A.8.13 — Information backup Section 4 (backup strategy)
ISO/IEC 27001:2022 A.8.14 — Redundancy of information processing facilities Section 3 (multi-DC architecture, DR standby)
SOC 2 A1.1 — Recovery objectives defined Section 2 (RTO/RPO targets)
SOC 2 A1.2 — Recovery procedures exist and are tested Sections 5, 7 (procedures and testing)
SOC 2 A1.3 — Recovery procedures support objectives Section 9 (recovery verification against objectives)
Act No. 264/2025 Sb. — Continuity requirements for essential services Sections 1-12 (comprehensive framework)

Document ID Title
MV-LEG-001 Information Security Policy
MV-LEG-002 Cryptography Policy
MV-LEG-003 Risk Management Policy
MV-LEG-004 Logging & Monitoring Policy
MV-LEG-005 Access Control Policy
MV-LEG-006 Vulnerability & Patch Management Policy
MV-LEG-007 Incident Response Plan

Appendix A: Emergency Contact Quick Reference

Role Primary Contact Backup Contact Availability
Incident Commander (CISO) [CISO Name / Phone / Email] [Deputy / Phone / Email] 24/7
Operations Lead [Name / Phone / Email] [Deputy / Phone / Email] 24/7
Infrastructure Engineer [Name / Phone / Email] [Deputy / Phone / Email] 24/7 (on-call rotation)
Communications Lead [Name / Phone / Email] [Deputy / Phone / Email] Business hours + on-call
CEO [Name / Phone / Email] [COO / Phone / Email] 24/7 for P1
Legal Counsel [Name / Phone / Email] [External Firm / Phone] Business hours + emergency retainer

Note: Actual contact details maintained in secure, separately-accessible contact roster (not in this document).


Appendix B: Quick Decision Flowchart

Service Disruption Detected
Is production affected?
   Yes ─┤── No → Monitor; standard incident process
Can service be restored within 30 minutes?
   Yes ─┤── No → Activate BCP
        │         │
        ▼         ▼
Standard        Identify scenario (Section 5)
incident        │
response        ├─ Gateway failure → Section 5.1
                ├─ Database issue → Section 5.3
                ├─ Full system loss → Section 5.4
                ├─ Key loss → Section 5.5
                └─ DC failure → Section 5.1 + 5.4

Document Control

Version Date Author Change Description
1.0.0 2026-05-01 CISO Initial release

END OF DOCUMENT