Business Continuity & Disaster Recovery Plan¶

MazeVault Service Continuity, Resilience, and Recovery Framework

Document ID: MV-LEG-008
Version: 1.0.0
Classification: Confidential
Owner: Chief Information Security Officer (CISO)
Last Updated: 2026-05-01
Review Cycle: Annual
Approved By: CEO / Board of Directors
Regulatory Basis: Act No. 264/2025 Sb., NIS2 Directive Art. 21(2)(c), DORA Art. 11-12, ISO/IEC 27001:2022 A.17

1. Purpose and Scope¶

1.1 Purpose¶

This Business Continuity & Disaster Recovery Plan ("Plan") establishes the framework, procedures, and responsibilities for maintaining MazeVault service continuity during disruptive events and recovering operations following a disaster. The Plan ensures that critical business functions can continue at an acceptable level during and after a disruption, and that full service is restored within defined recovery objectives.

1.2 Scope¶

This Plan applies to:

All MazeVault production systems, services, and infrastructure
All deployment configurations: cloud-hosted backend, gateway installations (Kubernetes and on-premise), and agent deployments
All data assets: customer secrets, encryption keys, certificates, configuration, and operational data
All personnel involved in service delivery and incident response
All environments where customer data is processed or stored

1.3 Objectives¶

Maintain critical service functions during disruptive events
Minimize data loss within defined Recovery Point Objectives (RPO)
Restore services within defined Recovery Time Objectives (RTO)
Protect the integrity and confidentiality of customer data during recovery operations
Ensure compliance with regulatory continuity and resilience requirements
Provide clear procedures for common disaster scenarios

1.4 Definitions¶

Term	Definition
RTO (Recovery Time Objective)	Maximum acceptable duration of service outage from the point of disruption to service restoration
RPO (Recovery Point Objective)	Maximum acceptable data loss measured in time; the point in time to which data must be recoverable
BCP (Business Continuity Plan)	Procedures to maintain critical business functions during a disruption
DR (Disaster Recovery)	Technical procedures to restore IT systems and data following a disaster
MTPD (Maximum Tolerable Period of Disruption)	Absolute maximum time before disruption causes unrecoverable business damage
BIA (Business Impact Analysis)	Assessment of the impact of disruption to business functions
Fencing Token	A monotonically increasing token used to prevent split-brain operations in distributed systems

2. RTO/RPO Targets¶

2.1 Recovery Objectives by Scenario¶

Disaster Scenario	RPO	RTO	MTPD	Priority
Database corruption (logical error, failed migration, data integrity failure)	<24 hours (daily backup)	<4 hours	8 hours	Critical
Complete node failure (single server/pod crash, hardware failure)	<24 hours (daily backup)	<2 hours (Kubernetes) / <4 hours (on-premise)	6 hours	Critical
Datacenter failure (full DC outage, network partition, cloud region failure)	<1 hour (with synchronous replication)	<8 hours	12 hours	Critical
Encryption key loss (vault.key deleted or corrupted without backup)	0 (with proper key backups) / PERMANENT LOSS (without backup)	<2 hours (with backup) / Unrecoverable (without backup)	N/A	Critical
Ransomware / complete data loss (malicious encryption, mass deletion, storage failure)	<24 hours (last clean backup)	<8 hours	12 hours	Critical
Gateway failure (single gateway unavailability)	0 (agents cache locally)	<30 minutes (failover) / <2 hours (manual)	4 hours	High
Certificate authority compromise	0 (re-issuance)	<4 hours	8 hours	Critical

2.2 Service Tier Classification¶

Tier	Services	RTO	RPO
Tier 1 — Critical	Secret retrieval/storage, certificate issuance, authentication, audit logging	<2 hours	<1 hour
Tier 2 — Important	Gateway management, agent communication, monitoring, alerting	<4 hours	<24 hours
Tier 3 — Standard	Reporting, dashboards, non-critical notifications, analytics	<8 hours	<24 hours
Tier 4 — Deferrable	Documentation, development environments, non-production testing	<24 hours	<7 days

3. Architecture for Resilience¶

3.1 System Architecture Overview¶

┌─────────────────────────────────────────────────────────────────┐
│                    PRIMARY BACKEND                                │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────┐   │
│  │ API      │  │ Auth     │  │ Secret   │  │ Certificate  │   │
│  │ Service  │  │ Service  │  │ Service  │  │ Service      │   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └──────┬───────┘   │
│       │              │              │               │            │
│  ┌────┴──────────────┴──────────────┴───────────────┴────┐      │
│  │              PostgreSQL (Central DB)                    │      │
│  └────────────────────────────────────────────────────────┘      │
│       │                                                          │
│  ┌────┴─────────┐  ┌───────────────┐  ┌────────────────┐       │
│  │ Redis Cache  │  │ Secrets Vault │  │ Prometheus     │       │
│  │ (Ephemeral)  │  │ (Encrypted)   │  │ (Monitoring)   │       │
│  └──────────────┘  └───────────────┘  └────────────────┘       │
└─────────────────────────────┬───────────────────────────────────┘
                              │
            ┌─────────────────┼─────────────────┐
            │                 │                 │
┌───────────┴──────┐  ┌──────┴───────┐  ┌─────┴────────────┐
│   Environment:   │  │ Environment: │  │  Environment:    │
│      NPR         │  │     PRO      │  │     PRO-2        │
│                  │  │              │  │                  │
│ ┌─────────────┐  │  │ ┌──────────┐ │  │ ┌─────────────┐  │
│ │ GW Primary  │  │  │ │GW Primary│ │  │ │ GW Primary  │  │
│ │ (Active)    │  │  │ │(Active)  │ │  │ │ (Active)    │  │
│ └──────┬──────┘  │  │ └────┬─────┘ │  │ └──────┬──────┘  │
│        │         │  │      │       │  │        │         │
│ ┌──────┴──────┐  │  │ ┌────┴─────┐ │  │ ┌──────┴──────┐  │
│ │ GW DR       │  │  │ │GW DR     │ │  │ │ GW DR       │  │
│ │ (Standby)   │  │  │ │(Standby) │ │  │ │ (Standby)   │  │
│ └─────────────┘  │  │ └──────────┘ │  │ └─────────────┘  │
│        │         │  │      │       │  │        │         │
│   ┌────┴────┐    │  │  ┌───┴───┐   │  │   ┌────┴────┐    │
│   │ Agents  │    │  │  │Agents │   │  │   │ Agents  │    │
│   │(On-Prem)│    │  │  │(On-P.)│   │  │   │(On-Prem)│    │
│   └─────────┘    │  │  └───────┘   │  │   └─────────┘    │
└──────────────────┘  └──────────────┘  └──────────────────┘

3.2 CRDT-Based Multi-DC Synchronization¶

For multi-datacenter deployments, MazeVault employs Conflict-free Replicated Data Types (CRDTs) for synchronization:

Purpose: Enable eventual consistency across geographically distributed gateways without requiring synchronous replication for all operations
Conflict Resolution: CRDTs guarantee convergence regardless of message ordering or network partitions
Data Types: Applied to configuration state, secret metadata, and certificate status propagation
Consistency Model: Strong consistency for secret values (source of truth at backend); eventual consistency for operational metadata at gateway level

3.3 GatewayEnvironmentLock and Fencing Tokens¶

The GatewayEnvironmentLock mechanism prevents split-brain scenarios in multi-gateway environments:

Single Active Gateway: Only ONE gateway may be active per environment at any given time
Fencing Tokens: Each lock acquisition generates a monotonically increasing fencing token (UnixNano timestamp)
Token Validation: All write operations validate the fencing token; stale tokens are rejected
Split-Brain Prevention: If a previously-active gateway recovers after failover, its stale fencing token prevents it from making conflicting writes
AutoFailover: Disabled by default; manual activation required to prevent unnecessary failovers from transient issues

4. Backup Strategy¶

4.1 Backup Schedule and Retention¶

Asset	Backup Method	Frequency	Retention	Storage Location	Encryption
PostgreSQL Database	`pg_dump` (logical) / Azure Backup (if cloud-hosted)	Daily + before significant changes (migrations, major updates)	30 days rolling	Separate storage account / off-site	AES-256 at rest
Encryption Keys (vault.key)	Manual copy to secure storage	On creation + on each rotation	Permanent (all historical versions)	Physically separate secure location (safe, HSM, separate DC)	Protected by physical security + access control
Secrets Vault (secrets.vault)	Copy vault file + associated key file	On every change + on key rotation	Permanent	Physically separate from vault.key	Inherently encrypted (vault encryption)
Configuration Files	Filesystem copy / version control	Before every change	30 days rolling	Off-site backup storage	AES-256 at rest
TLS Certificates	Filesystem copy	Before every rotation	Until expired + 90 days	Off-site backup storage	AES-256 at rest
Redis	NOT BACKED UP	N/A	N/A	N/A	N/A
Prometheus Data	Snapshot (optional)	Weekly	90 days	Off-site backup storage	AES-256 at rest
Audit Logs	Log export / replication	Continuous (real-time)	Minimum 3 years	Separate immutable storage	AES-256 at rest

4.2 Redis — Ephemeral Cache Justification¶

Redis is intentionally excluded from backup procedures because:

Used exclusively as a transient cache layer (sessions, rate limiting, temporary tokens)
All authoritative data resides in PostgreSQL
Cache is automatically rebuilt from the database on service restart
No customer secrets or persistent state stored in Redis
Recovery procedure: restart Redis; application automatically repopulates cache

4.3 Backup Verification¶

Monthly: Automated backup integrity check (checksum verification, test restore to isolated environment)
Quarterly: Full restore test to isolated environment with functional verification (see Section 7)
On rotation: When encryption keys or vault files are rotated, verify new backup is complete and accessible
Documentation: All verification results logged with pass/fail status and reviewer signature

4.4 Off-Site Backup Requirements¶

Backups SHALL be stored in a geographically separate location from production systems
Minimum distance: different availability zone (cloud) or different physical building (on-premise)
Access to backup storage SHALL require separate credentials from production systems
Backup storage access SHALL be logged and monitored
Backup encryption keys SHALL be stored separately from the encrypted backups

5. Disaster Recovery Procedures¶

5.1 Gateway DR Failover¶

Scenario: Primary gateway for an environment becomes unavailable and cannot be recovered within RTO.

Prerequisites:

DR standby gateway is provisioned and current (configuration synchronized)
GatewayEnvironmentLock is functioning correctly
Network connectivity between DR gateway and backend confirmed

Procedure:

Step	Action	Verification	Responsible
1	Verify primary gateway is confirmed down (3 consecutive heartbeat failures confirmed by GatewayHealthMonitor)	Health monitor alerts received; manual verification attempted	Operations Lead
2	Notify Incident Commander; obtain authorization for failover	IC acknowledgment documented	Operations Lead
3	Start DR AKS cluster (if not warm standby): `kubectl scale deployment mazevault-gateway --replicas=1 -n mazevault-dr`	Pods in Running state: `kubectl get pods -n mazevault-dr`	Infrastructure Engineer
4	Verify DR gateway pods are healthy and connected to backend	Pod logs show successful backend connection; health endpoint returns 200	Infrastructure Engineer
5	Activate DR gateway via administrative API: `POST /api/v1/admin/failover/{env}/activate-dr`	API returns 200; new fencing token issued	Operations Lead
6	Verify DR gateway is now active and processing requests	Agent connectivity confirmed; test secret retrieval successful	Operations Lead
7	Monitor for 30 minutes for stability	No errors in logs; metrics nominal	Infrastructure Engineer
8	Notify affected customers of failover completion	Customer notification sent	Communications Lead

Estimated Duration: 15-30 minutes (warm standby) / 30-60 minutes (cold start)

5.2 Gateway Failback¶

Scenario: Primary gateway has been recovered and service should be returned from DR to primary.

Prerequisites:

Primary gateway fully recovered and verified healthy
No active incidents on DR gateway
Maintenance window scheduled (if possible)

Procedure:

Step	Action	Verification	Responsible
1	Verify primary gateway is fully recovered and healthy	Health endpoint returns 200; all subsystems operational	Infrastructure Engineer
2	Synchronize any state accumulated on DR gateway to primary	Sync status confirmed; no data discrepancies	Infrastructure Engineer
3	Execute failback via administrative API: `POST /api/v1/admin/failover/{env}/failback`	API returns 200; new fencing token issued to primary	Operations Lead
4	Verify primary gateway is active and processing requests	Agent connectivity confirmed; test operations successful	Operations Lead
5	Monitor primary for 30 minutes for stability	No errors; metrics nominal	Infrastructure Engineer
6	Stop DR AKS cluster: `kubectl scale deployment mazevault-gateway --replicas=0 -n mazevault-dr`	DR pods terminated	Infrastructure Engineer
7	Document failback completion	Incident record updated; post-action report filed	Operations Lead

5.3 Database Restore¶

Scenario: PostgreSQL database corruption, failed migration, or data integrity failure requiring restore from backup.

Procedure:

Step	Action	Command/Verification	Responsible
1	Stop application services to prevent further writes	`systemctl stop mazevault` or scale deployment to 0	Infrastructure Engineer
2	Assess corruption scope; determine restore point	Review logs; identify last known-good backup	Technical Lead
3	Create backup of current (corrupted) state for forensics	`pg_dump -Fc mazevault_db > /backup/forensic-YYYYMMDD.dump`	Infrastructure Engineer
4	Drop and recreate database (or restore to new instance)	`dropdb mazevault_db && createdb mazevault_db`	Infrastructure Engineer
5	Restore from selected backup	`pg_restore -d mazevault_db /backup/mazevault-YYYYMMDD.dump`	Infrastructure Engineer
6	Verify database integrity	Run application health checks; verify row counts; check constraints	Infrastructure Engineer
7	Start application services	`systemctl start mazevault` or scale deployment to desired replicas	Infrastructure Engineer
8	Verify full application health	Health endpoint returns 200; all subsystems connected; test operations	Operations Lead
9	Assess data loss (gap between backup and failure)	Document any data created after backup that is lost; notify affected customers if applicable	Technical Lead

Estimated Duration: 1-4 hours depending on database size and backup method.

5.4 Full System Restore¶

Scenario: Complete system loss requiring rebuild from scratch (catastrophic hardware failure, ransomware, or new environment provisioning).

Procedure:

Step	Action	Details	Responsible
1	Prepare Host	Provision server/VM meeting minimum requirements; install OS; configure networking	Infrastructure Engineer
2	Restore Configuration	Copy configuration files from backup to appropriate locations	Infrastructure Engineer
3	Restore TLS Certificates	Restore CA certificates, server certificates, and private keys from backup	Infrastructure Engineer
4	Load Container Images (if applicable)	Pull or load MazeVault container images to local registry	Infrastructure Engineer
5	Start Infrastructure Services	Start PostgreSQL, Redis; verify connectivity	Infrastructure Engineer
6	Restore Database	Execute database restore procedure (Section 5.3, steps 5-6)	Infrastructure Engineer
7	Restore Secrets Vault	Copy `secrets.vault` and `vault.key` from secure backup locations to designated paths	Infrastructure Engineer
8	Start Application Services	Start MazeVault backend services	Infrastructure Engineer
9	Verify System Health	Execute full recovery verification checklist (Section 9)	Operations Lead
10	Restore Gateway Connectivity	Verify gateways can connect; re-register if necessary	Operations Lead
11	Verify Agent Connectivity	Confirm agents reconnect to gateways; test secret retrieval	Operations Lead
12	Restore Monitoring	Verify Prometheus, alerting, and logging operational	Infrastructure Engineer
13	Document and Communicate	Update incident record; notify stakeholders of restoration	Communications Lead

Estimated Duration: 4-8 hours depending on infrastructure provisioning time.

5.5 Encryption Key Recovery¶

Scenario: Loss or corruption of vault.key (the key that decrypts the secrets vault).

5.5.1 With Backup Available¶

Step	Action	Responsible
1	Retrieve `vault.key` backup from secure off-site storage	CISO / Infrastructure Engineer
2	Verify key integrity (checksum comparison)	Infrastructure Engineer
3	Place key in designated secure location on target system	Infrastructure Engineer
4	Verify vault decryption: `vault-tool list`	Infrastructure Engineer
5	Verify all secrets accessible and decryptable	Operations Lead
6	Resume normal operations	Operations Lead

Estimated Duration: <2 hours

5.5.2 Without Backup — PERMANENT DATA LOSS¶

CRITICAL WARNING: If vault.key is lost and no backup exists, all secrets encrypted by that key are permanently and irrecoverably lost. There is no recovery mechanism. This scenario requires complete secret re-creation.

Step	Action	Responsible
1	Confirm no backup exists anywhere (verify all backup locations)	CISO
2	Accept permanent loss of existing encrypted secrets	CISO + CEO decision
3	Generate new encryption key: `vault-tool init`	Infrastructure Engineer
4	Immediately create multiple backups of new key in separate secure locations	CISO
5	Recreate all secrets manually (coordinate with each customer for their credentials)	Operations Lead + Customer Relations
6	Document incident and update procedures to prevent recurrence	CISO
7	Conduct post-incident review	Full IRT

Estimated Duration: Days to weeks (depending on number of secrets to recreate)

6. Split-Brain Prevention¶

6.1 Problem Statement¶

In a distributed gateway architecture, a network partition or communication failure can result in multiple gateways believing they are the active instance for an environment. This "split-brain" condition can cause:

Conflicting writes to shared resources
Data inconsistency between gateways
Certificate issuance conflicts
Audit log divergence

6.2 GatewayEnvironmentLock Mechanism¶

The GatewayEnvironmentLock provides distributed mutual exclusion:

┌─────────────────────────────────────────────────────┐
│              GatewayEnvironmentLock                   │
├─────────────────────────────────────────────────────┤
│ Environment:    PRO                                  │
│ Active Gateway: gateway-pro-primary                  │
│ Fencing Token:  1714567890123456789 (UnixNano)      │
│ Acquired:       2026-05-01T10:00:00Z                │
│ Last Heartbeat: 2026-05-01T10:01:30Z                │
│ AutoFailover:   DISABLED                            │
└─────────────────────────────────────────────────────┘

6.3 Operating Parameters¶

Parameter	Value	Description
HeartbeatTimeout	2 minutes	Maximum time between heartbeats before gateway considered unhealthy
HealthCheckInterval	30 seconds	Frequency of health check probes
FailureThreshold	3 consecutive failures	Number of consecutive missed heartbeats before declaring failure
AutoFailover	OFF (disabled by default)	Automatic failover to DR gateway is not performed without explicit operator action
FencingTokenType	UnixNano (monotonically increasing)	Ensures ordering of lock acquisitions

6.4 Fencing Token Operation¶

When a gateway acquires the environment lock, it receives a fencing token (current UnixNano timestamp)
All write operations to shared state include the fencing token
The backend rejects any operation with a fencing token less than or equal to the last accepted token
This guarantees that even if a stale gateway attempts operations after a failover, its outdated token is rejected
The new active gateway's higher token ensures its operations take precedence

6.5 AutoFailover Disabled — Rationale¶

AutoFailover is disabled by default because:

Transient network issues (brief partitions, DNS delays) could trigger unnecessary failovers
Failover introduces risk of data inconsistency if not properly sequenced
Operator verification ensures the primary is genuinely failed (not just temporarily unreachable)
Manual failover allows coordinated preparation of the DR environment
Reduces risk of "flapping" between primary and DR during unstable conditions

AutoFailover MAY be enabled for specific environments after explicit risk acceptance by the CISO, with appropriate safeguards (extended failure threshold, confirmation delay).

7. Testing Program¶

7.1 Test Schedule¶

Test Type	Frequency	Scope	Duration	Participants
Backup Restore Test	Quarterly	Full restore of database and vault to isolated environment	4-8 hours	Infrastructure + Operations
Gateway Failover Exercise	Semi-annual	Execute full failover and failback procedure in test environment	2-4 hours	Operations + Infrastructure
Full DR Simulation	Annual	Simulate complete datacenter failure; recover all services	1 full day	All BCP roles
Backup Integrity Verification	Monthly	Automated checksum verification of all backups	Automated	Infrastructure (review results)
Communication Test	Quarterly	Verify all notification channels and escalation contacts	1 hour	Communications Lead + IRT
Tabletop Exercise	Semi-annual	Scenario-based discussion of BCP/DR decision-making	2-3 hours	All BCP roles + management

7.2 Quarterly DR Test Procedure¶

Step	Action	Success Criteria
1	Provision isolated test environment	Environment accessible; no connectivity to production
2	Restore database from latest production backup	Restore completes without errors; data integrity verified
3	Restore secrets vault and vault.key	`vault-tool list` returns expected secret count
4	Verify all secrets accessible and decryptable	Sample of secrets retrieved and validated
5	Verify certificate operations	Test certificate issuance and validation
6	Verify application health endpoints	All health checks pass
7	Document results	Test report completed with pass/fail for each criterion
8	Destroy test environment	All test data securely wiped

7.3 Test Documentation Requirements¶

Each test SHALL produce a report containing:

Test date, participants, and environment used
Procedure followed (reference to this document or deviation explanation)
Pass/fail status for each test criterion
Actual RTO achieved vs. target RTO
Issues encountered and their resolution
Recommendations for procedure improvements
Sign-off by Operations Lead

7.4 Test Failure Remediation¶

Any test failure SHALL be treated as a high-priority finding
Root cause analysis required within 5 business days
Corrective action plan with owner and deadline
Re-test scheduled within 30 days of corrective action completion
Repeated failures escalated to CISO for risk assessment

8. Encrypted Vault Backup¶

8.1 Critical Files¶

File	Purpose	Criticality
`secrets.vault`	Contains all encrypted credentials, API keys, database passwords, and customer secrets	CRITICAL — Loss without backup means recreation of all secrets required
`vault.key`	Decryption passphrase/key for the secrets vault	CRITICAL — Loss without backup means PERMANENT, IRRECOVERABLE loss of all vault contents

8.2 Storage Requirements¶

MANDATORY: secrets.vault and vault.key SHALL be stored in physically separate locations.

Requirement	secrets.vault	vault.key
Storage Location	Off-site backup (separate from production)	Physically separate from secrets.vault (different facility, safe, or HSM)
Access Control	Restricted to authorized operations personnel	Restricted to CISO + designated backup custodian only
Physical Security	Encrypted storage; access-logged facility	Secure safe, HSM, or equivalent physical protection
Copies	Minimum 2 copies in separate locations	Minimum 2 copies in separate locations (NEVER co-located with vault file)
Digital Protection	Encrypted at rest (backup storage encryption)	Encrypted at rest; additionally protected by separate passphrase if stored digitally

8.3 Quarterly Verification Procedure¶

Step	Action	Expected Result
1	Retrieve vault.key backup from secure storage	Key file accessible; integrity seal intact
2	Retrieve secrets.vault backup from separate storage	Vault file accessible; modification date matches last backup
3	Deploy both to isolated verification environment	Files in place
4	Execute: `vault-tool list`	Returns complete list of stored secrets (matches production count)
5	Retrieve sample secrets and verify decryption	Secrets decrypt correctly; values match production
6	Document verification result	Signed verification record filed
7	Securely return backup files to storage	Files re-sealed in secure storage

8.4 Rotation Procedures¶

When encryption keys or vault files are rotated:

Create backup of new vault and key immediately after rotation
Verify new backup (vault-tool list in isolated environment)
Retain previous vault.key version (required to decrypt any backup made with the old key)
Update backup inventory documentation
Notify backup custodians of update

9. Recovery Verification¶

9.1 Health Check Verification¶

After any recovery operation, the following checks SHALL all pass before declaring recovery complete:

Check	Method	Expected Result
Application Health	`GET /api/v1/health`	HTTP 200; all subsystem statuses "healthy"
License Status	Health endpoint or admin API	License status: `active`
Database Connectivity	Health endpoint subsystem check	PostgreSQL connection pool active; queries executing
Redis Connectivity	Health endpoint subsystem check	Redis connection established; cache operations functional
Audit Logging	Perform test action; verify audit event created	Audit event written with correct hash chain continuation
Prometheus Metrics	Check Prometheus targets	All scrape targets UP; metrics flowing
Structured Logging	Review application log output	JSON-formatted log entries appearing with expected fields
Secret Operations	Retrieve a test secret	Secret retrieved successfully; decrypted correctly
Certificate Operations	Issue test certificate (non-production)	Certificate issued with valid chain
Gateway Connectivity	Verify gateway heartbeat received	Gateway reporting healthy; heartbeat within timeout
Agent Connectivity	Verify agent check-in	At least one agent successfully communicating

9.2 Extended Monitoring Period¶

After recovery verification passes:

First 2 hours: Enhanced monitoring with reduced alert thresholds
First 24 hours: On-call engineer actively monitoring system behavior
First 72 hours: Daily review of all metrics for anomalies
After 72 hours: Return to standard monitoring if no issues detected

9.3 Recovery Failure Escalation¶

If recovery verification fails:

Identify failing check and assess impact
Attempt corrective action (restart service, re-apply configuration)
If not resolved within 30 minutes, escalate to Incident Commander
Consider alternative recovery approach (different backup, different procedure)
Document failure and resolution for procedure improvement

10. Roles and Responsibilities During BCP¶

10.1 BCP Organization¶

Role	Primary Holder	Responsibilities
Incident Commander	CISO	Overall BCP authority; decision-making; escalation; regulatory communication
Operations Lead	Head of Operations	Coordinate recovery activities; direct technical teams; manage timeline
Infrastructure Engineer	Sr. DevOps Engineer	Execute technical recovery procedures; system administration; monitoring
Communications Lead	Head of Customer Success	Customer notification; internal communications; status page management
Technical Lead	CTO	Technical decision-making; architecture guidance; vendor coordination
Legal Counsel	Legal Advisor	Regulatory obligations; contractual requirements; liability assessment

10.2 Decision Authority Matrix¶

Decision	Authority	Escalation
Activate BCP	Incident Commander (CISO)	CEO if CISO unavailable
Authorize failover	Incident Commander	Operations Lead (if IC pre-authorized)
Customer communication content	Communications Lead + Legal Counsel	Incident Commander for approval
Accept data loss (RPO exceeded)	CEO	Board if loss exceeds defined threshold
Engage external vendors/consultants	Incident Commander	CEO if cost exceeds pre-approved budget
Declare recovery complete	Incident Commander (after verification)	N/A
Deactivate BCP	Incident Commander	CEO if incident duration >48 hours

10.3 Succession and Availability¶

Each primary role holder SHALL have a designated deputy
Contact information maintained in secure, accessible location (not solely dependent on affected systems)
If primary and deputy both unavailable, next in organizational hierarchy assumes responsibility
Minimum 2 qualified personnel must be reachable at all times for Tier 1 service recovery

11. BCP Activation Criteria¶

11.1 Automatic Activation Triggers¶

The BCP SHALL be activated immediately when any of the following conditions are confirmed:

Trigger	Condition	Activation Level
Major Incident (P1)	Critical security incident with service impact exceeding 30 minutes	Full BCP
Datacenter Failure	Complete loss of primary hosting environment	Full BCP
Ransomware Confirmation	Confirmed ransomware deployment affecting production systems	Full BCP
Key Compromise	Confirmed compromise or loss of encryption keys (vault.key)	Full BCP
Extended Outage	Any unplanned outage exceeding 50% of the applicable RTO	Partial BCP (escalate if not resolving)
Multi-System Failure	Simultaneous failure of 2+ Tier 1 services	Full BCP

11.2 Discretionary Activation¶

The Incident Commander MAY activate BCP at their discretion for:

Credible threat intelligence suggesting imminent attack
Significant degradation trending toward failure
Vendor/provider outage with uncertain resolution timeline
Natural disaster or physical security threat to hosting facilities
Pandemic or staffing emergency affecting operational capability

11.3 Activation Process¶

Trigger condition identified and confirmed
Incident Commander (or deputy) formally declares BCP activation
All BCP role holders notified via emergency communication channels
BCP war room established (physical or virtual)
Initial situation assessment conducted (within 30 minutes)
Recovery strategy selected based on scenario and available resources
Recovery activities commence per applicable procedure (Section 5)

12. Communication During Outage¶

12.1 Internal Communication¶

Channel	Purpose	Frequency
War Room (dedicated video/chat channel)	Real-time coordination among BCP team	Continuous during active incident
Status Updates (internal)	Broader team awareness	Every 30 minutes during active recovery
Executive Briefing	CEO/Board situation awareness	Every 2 hours or on significant changes
All-Hands Update	Full company awareness (if major)	As needed; minimum daily during extended outage

12.2 Customer Communication¶

Phase	Communication	Channel	Timing
Initial	Service disruption acknowledged; investigation underway	Status page + email	Within 30 minutes of detection
Updates	Progress, estimated resolution, workarounds	Status page	Every 60 minutes (minimum)
Resolution	Service restored; summary of impact	Status page + email	Within 1 hour of recovery
Post-Mortem	Root cause, remediation actions, prevention measures	Email + portal publication	Within 5 business days

12.3 Status Page Protocol¶

Status page SHALL be hosted on infrastructure independent of MazeVault production systems
Status page updates SHALL NOT contain sensitive security details
Status categories: Operational → Degraded Performance → Partial Outage → Major Outage
Per-service status granularity: Backend API, Gateway Services, Certificate Services, Monitoring

12.4 Communication Templates¶

Templates for customer communication during outage SHALL be pre-approved by Legal Counsel and ready for immediate use. Templates SHALL cover:

Initial outage acknowledgment
Periodic status updates (with and without ETA)
Service restoration notification
Post-incident summary

13. Compliance Mapping¶

Regulatory Requirement	Section in This Document
NIS2 Directive Art. 21(2)(c) — Business continuity and crisis management	Sections 1-12 (comprehensive BCP/DR framework)
DORA Art. 11 — ICT business continuity policy	Sections 1-4 (policy, objectives, architecture, backups)
DORA Art. 12 — ICT response and recovery plans	Sections 5-6 (DR procedures, split-brain prevention)
DORA Art. 12(2) — Backup policies and restoration methods	Section 4 (backup strategy), Section 5 (restore procedures)
DORA Art. 12(3) — Redundancy for critical functions	Section 3 (architecture for resilience), Section 6 (split-brain)
DORA Art. 12(5) — Testing of ICT business continuity plans	Section 7 (testing program)
ISO/IEC 27001:2022 A.5.29 — Information security during disruption	Sections 3, 5, 6 (security maintained during recovery)
ISO/IEC 27001:2022 A.5.30 — ICT readiness for business continuity	Sections 2-4 (RTO/RPO, architecture, backups)
ISO/IEC 27001:2022 A.8.13 — Information backup	Section 4 (backup strategy)
ISO/IEC 27001:2022 A.8.14 — Redundancy of information processing facilities	Section 3 (multi-DC architecture, DR standby)
SOC 2 A1.1 — Recovery objectives defined	Section 2 (RTO/RPO targets)
SOC 2 A1.2 — Recovery procedures exist and are tested	Sections 5, 7 (procedures and testing)
SOC 2 A1.3 — Recovery procedures support objectives	Section 9 (recovery verification against objectives)
Act No. 264/2025 Sb. — Continuity requirements for essential services	Sections 1-12 (comprehensive framework)

Document ID	Title
MV-LEG-001	Information Security Policy
MV-LEG-002	Cryptography Policy
MV-LEG-003	Risk Management Policy
MV-LEG-004	Logging & Monitoring Policy
MV-LEG-005	Access Control Policy
MV-LEG-006	Vulnerability & Patch Management Policy
MV-LEG-007	Incident Response Plan

Appendix A: Emergency Contact Quick Reference¶

Role	Primary Contact	Backup Contact	Availability
Incident Commander (CISO)	[CISO Name / Phone / Email]	[Deputy / Phone / Email]	24/7
Operations Lead	[Name / Phone / Email]	[Deputy / Phone / Email]	24/7
Infrastructure Engineer	[Name / Phone / Email]	[Deputy / Phone / Email]	24/7 (on-call rotation)
Communications Lead	[Name / Phone / Email]	[Deputy / Phone / Email]	Business hours + on-call
CEO	[Name / Phone / Email]	[COO / Phone / Email]	24/7 for P1
Legal Counsel	[Name / Phone / Email]	[External Firm / Phone]	Business hours + emergency retainer

Note: Actual contact details maintained in secure, separately-accessible contact roster (not in this document).

Appendix B: Quick Decision Flowchart¶

Service Disruption Detected
        │
        ▼
Is production affected?
        │
   Yes ─┤── No → Monitor; standard incident process
        │
        ▼
Can service be restored within 30 minutes?
        │
   Yes ─┤── No → Activate BCP
        │         │
        ▼         ▼
Standard        Identify scenario (Section 5)
incident        │
response        ├─ Gateway failure → Section 5.1
                ├─ Database issue → Section 5.3
                ├─ Full system loss → Section 5.4
                ├─ Key loss → Section 5.5
                └─ DC failure → Section 5.1 + 5.4

Document Control¶

Version	Date	Author	Change Description
1.0.0	2026-05-01	CISO	Initial release

END OF DOCUMENT

Business Continuity & Disaster Recovery Plan¶

1. Purpose and Scope¶

1.1 Purpose¶

1.2 Scope¶

1.3 Objectives¶

1.4 Definitions¶

2. RTO/RPO Targets¶

2.1 Recovery Objectives by Scenario¶

2.2 Service Tier Classification¶

3. Architecture for Resilience¶

3.1 System Architecture Overview¶

3.2 CRDT-Based Multi-DC Synchronization¶

3.3 GatewayEnvironmentLock and Fencing Tokens¶

4. Backup Strategy¶

4.1 Backup Schedule and Retention¶

4.2 Redis — Ephemeral Cache Justification¶

4.3 Backup Verification¶

4.4 Off-Site Backup Requirements¶

5. Disaster Recovery Procedures¶

5.1 Gateway DR Failover¶

5.2 Gateway Failback¶

5.3 Database Restore¶

5.4 Full System Restore¶

5.5 Encryption Key Recovery¶

5.5.1 With Backup Available¶

5.5.2 Without Backup — PERMANENT DATA LOSS¶

6. Split-Brain Prevention¶

6.1 Problem Statement¶

6.2 GatewayEnvironmentLock Mechanism¶

6.3 Operating Parameters¶

6.4 Fencing Token Operation¶

6.5 AutoFailover Disabled — Rationale¶

7. Testing Program¶

7.1 Test Schedule¶

7.2 Quarterly DR Test Procedure¶

7.3 Test Documentation Requirements¶

7.4 Test Failure Remediation¶

8. Encrypted Vault Backup¶

8.1 Critical Files¶

8.2 Storage Requirements¶

8.3 Quarterly Verification Procedure¶

8.4 Rotation Procedures¶

9. Recovery Verification¶

9.1 Health Check Verification¶

9.2 Extended Monitoring Period¶

9.3 Recovery Failure Escalation¶

10. Roles and Responsibilities During BCP¶

10.1 BCP Organization¶

10.2 Decision Authority Matrix¶

10.3 Succession and Availability¶

11. BCP Activation Criteria¶

11.1 Automatic Activation Triggers¶

11.2 Discretionary Activation¶

11.3 Activation Process¶

12. Communication During Outage¶

12.1 Internal Communication¶

12.2 Customer Communication¶

12.3 Status Page Protocol¶

12.4 Communication Templates¶

13. Compliance Mapping¶

14. Related Documents¶

Appendix A: Emergency Contact Quick Reference¶

Appendix B: Quick Decision Flowchart¶

Document Control¶