Skip to content

Monitoring

Prometheus Metrics, Alerting, and Dashboards

Document Version: 1.0.0
Last Updated: 2026-02-10


1. Metrics Endpoints

Component Endpoint Format
API Server /metrics (port 8080) Prometheus
OCSP Responder /metrics (port 8081) Prometheus

2. Key Metrics

API Server Metrics

Metric Type Description
http_requests_total Counter Total HTTP requests (labels: method, path, status)
http_request_duration_seconds Histogram Request duration (labels: method, path)
http_requests_in_flight Gauge Currently processing requests
mazevault_secrets_total Gauge Total number of secrets
mazevault_certificates_total Gauge Total number of certificates
mazevault_certificates_expiring Gauge Certificates expiring within 30 days
mazevault_agents_online Gauge Currently online agents
mazevault_sync_operations_total Counter Sync operations (labels: status)
mazevault_rotation_operations_total Counter Rotation operations (labels: status)
mazevault_auth_attempts_total Counter Authentication attempts (labels: method, result)
mazevault_license_days_remaining Gauge Days until license expiry

OCSP Responder Metrics

Metric Type Description
ocsp_requests_total Counter Total OCSP requests (labels: status)
ocsp_request_duration_seconds Histogram OCSP response time
ocsp_cache_hits_total Counter Response cache hits
ocsp_cache_misses_total Counter Response cache misses

Database Metrics

Metric Type Description
db_connections_active Gauge Active database connections
db_connections_idle Gauge Idle database connections
db_query_duration_seconds Histogram Database query latency

3. Alert Thresholds

Alert Condition Severity Action
API Server Down up{job="mazevault-backend"} == 0 for 5m Critical Immediate investigation
High Error Rate rate(http_requests_total{status=~"5.."}[5m]) > 0.1 for 10m Warning Investigate error logs
High API Latency http_request_duration_seconds{quantile="0.95"} > 2 for 5m Warning Check DB and Redis
API Latency Critical http_request_duration_seconds{quantile="0.95"} > 5 for 5m Critical Immediate investigation
Certificate Expiring mazevault_certificates_expiring > 0 Warning Renew certificates
License Expiring (60d) mazevault_license_days_remaining < 60 Warning Contact MazeVault for renewal
License Expiring (14d) mazevault_license_days_remaining < 14 Critical Urgent renewal required
OCSP Down up{job="mazevault-ocsp"} == 0 for 5m Critical Restart OCSP responder
Database Connections High db_connections_active > 80 Warning Review connection pool settings
Disk Usage High node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.2 Warning Cleanup or expand storage
Disk Usage Critical node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.05 Critical Immediate storage expansion

Prometheus Alert Rules

groups:
  - name: mazevault
    rules:
      - alert: MazeVaultBackendDown
        expr: up{job="mazevault-backend"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "MazeVault API Server is down"
          description: "API Server has been unreachable for 5 minutes"

      - alert: MazeVaultHighErrorRate
        expr: |
          rate(http_requests_total{job="mazevault-backend", status=~"5.."}[5m])
          / rate(http_requests_total{job="mazevault-backend"}[5m]) > 0.05
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on MazeVault API"
          description: "Error rate exceeds 5% for 10 minutes"

      - alert: MazeVaultLicenseExpiring
        expr: mazevault_license_days_remaining < 60
        labels:
          severity: warning
        annotations:
          summary: "MazeVault license expiring soon"
          description: "License expires in {{ $value }} days"

      - alert: MazeVaultCertificateExpiring
        expr: mazevault_certificates_expiring > 0
        labels:
          severity: warning
        annotations:
          summary: "Certificates expiring within 30 days"
          description: "{{ $value }} certificates are expiring soon"

4. Grafana Dashboard

Panel Visualization Query
API Request Rate Time series rate(http_requests_total[5m])
API Error Rate Time series rate(http_requests_total{status=~"5.."}[5m])
API Latency (p95) Time series histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Active Secrets Stat mazevault_secrets_total
Active Certificates Stat mazevault_certificates_total
Expiring Certificates Stat (red) mazevault_certificates_expiring
Online Agents Stat mazevault_agents_online
License Days Remaining Gauge mazevault_license_days_remaining
OCSP Request Rate Time series rate(ocsp_requests_total[5m])
OCSP Cache Hit Ratio Gauge rate(ocsp_cache_hits_total[5m]) / (rate(ocsp_cache_hits_total[5m]) + rate(ocsp_cache_misses_total[5m]))
DB Connections Time series db_connections_active
Auth Failures Time series rate(mazevault_auth_attempts_total{result="failure"}[5m])

5. Log Aggregation

Structured Logging

All MazeVault components output structured JSON logs:

{
  "time": "2026-02-10T14:30:00.000Z",
  "level": "info",
  "msg": "Request completed",
  "method": "GET",
  "path": "/api/v1/secrets",
  "status": 200,
  "duration_ms": 12,
  "request_id": "req_abc123",
  "user_id": "usr_def456"
}

Log Levels

Level Description When to Use
error Operation failed Investigate immediately
warn Unexpected but recoverable Monitor trends
info Normal operations Audit and debugging
debug Detailed diagnostic Development / temporary troubleshooting
Platform Recommended Tool
Kubernetes Fluentd / Fluent Bit → Elasticsearch / Loki
On-Premise Filebeat → Elasticsearch / Splunk
Azure Azure Monitor / Log Analytics