Monitoring
Prometheus Metrics, Alerting, and Dashboards
Document Version: 1.0.0
Last Updated: 2026-02-10
1. Metrics Endpoints
| Component |
Endpoint |
Format |
| API Server |
/metrics (port 8080) |
Prometheus |
| OCSP Responder |
/metrics (port 8081) |
Prometheus |
2. Key Metrics
API Server Metrics
| Metric |
Type |
Description |
http_requests_total |
Counter |
Total HTTP requests (labels: method, path, status) |
http_request_duration_seconds |
Histogram |
Request duration (labels: method, path) |
http_requests_in_flight |
Gauge |
Currently processing requests |
mazevault_secrets_total |
Gauge |
Total number of secrets |
mazevault_certificates_total |
Gauge |
Total number of certificates |
mazevault_certificates_expiring |
Gauge |
Certificates expiring within 30 days |
mazevault_agents_online |
Gauge |
Currently online agents |
mazevault_sync_operations_total |
Counter |
Sync operations (labels: status) |
mazevault_rotation_operations_total |
Counter |
Rotation operations (labels: status) |
mazevault_auth_attempts_total |
Counter |
Authentication attempts (labels: method, result) |
mazevault_license_days_remaining |
Gauge |
Days until license expiry |
OCSP Responder Metrics
| Metric |
Type |
Description |
ocsp_requests_total |
Counter |
Total OCSP requests (labels: status) |
ocsp_request_duration_seconds |
Histogram |
OCSP response time |
ocsp_cache_hits_total |
Counter |
Response cache hits |
ocsp_cache_misses_total |
Counter |
Response cache misses |
Database Metrics
| Metric |
Type |
Description |
db_connections_active |
Gauge |
Active database connections |
db_connections_idle |
Gauge |
Idle database connections |
db_query_duration_seconds |
Histogram |
Database query latency |
3. Alert Thresholds
Recommended Alert Rules
| Alert |
Condition |
Severity |
Action |
| API Server Down |
up{job="mazevault-backend"} == 0 for 5m |
Critical |
Immediate investigation |
| High Error Rate |
rate(http_requests_total{status=~"5.."}[5m]) > 0.1 for 10m |
Warning |
Investigate error logs |
| High API Latency |
http_request_duration_seconds{quantile="0.95"} > 2 for 5m |
Warning |
Check DB and Redis |
| API Latency Critical |
http_request_duration_seconds{quantile="0.95"} > 5 for 5m |
Critical |
Immediate investigation |
| Certificate Expiring |
mazevault_certificates_expiring > 0 |
Warning |
Renew certificates |
| License Expiring (60d) |
mazevault_license_days_remaining < 60 |
Warning |
Contact MazeVault for renewal |
| License Expiring (14d) |
mazevault_license_days_remaining < 14 |
Critical |
Urgent renewal required |
| OCSP Down |
up{job="mazevault-ocsp"} == 0 for 5m |
Critical |
Restart OCSP responder |
| Database Connections High |
db_connections_active > 80 |
Warning |
Review connection pool settings |
| Disk Usage High |
node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.2 |
Warning |
Cleanup or expand storage |
| Disk Usage Critical |
node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.05 |
Critical |
Immediate storage expansion |
Prometheus Alert Rules
groups:
- name: mazevault
rules:
- alert: MazeVaultBackendDown
expr: up{job="mazevault-backend"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "MazeVault API Server is down"
description: "API Server has been unreachable for 5 minutes"
- alert: MazeVaultHighErrorRate
expr: |
rate(http_requests_total{job="mazevault-backend", status=~"5.."}[5m])
/ rate(http_requests_total{job="mazevault-backend"}[5m]) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "High error rate on MazeVault API"
description: "Error rate exceeds 5% for 10 minutes"
- alert: MazeVaultLicenseExpiring
expr: mazevault_license_days_remaining < 60
labels:
severity: warning
annotations:
summary: "MazeVault license expiring soon"
description: "License expires in {{ $value }} days"
- alert: MazeVaultCertificateExpiring
expr: mazevault_certificates_expiring > 0
labels:
severity: warning
annotations:
summary: "Certificates expiring within 30 days"
description: "{{ $value }} certificates are expiring soon"
4. Grafana Dashboard
Recommended Panels
| Panel |
Visualization |
Query |
| API Request Rate |
Time series |
rate(http_requests_total[5m]) |
| API Error Rate |
Time series |
rate(http_requests_total{status=~"5.."}[5m]) |
| API Latency (p95) |
Time series |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) |
| Active Secrets |
Stat |
mazevault_secrets_total |
| Active Certificates |
Stat |
mazevault_certificates_total |
| Expiring Certificates |
Stat (red) |
mazevault_certificates_expiring |
| Online Agents |
Stat |
mazevault_agents_online |
| License Days Remaining |
Gauge |
mazevault_license_days_remaining |
| OCSP Request Rate |
Time series |
rate(ocsp_requests_total[5m]) |
| OCSP Cache Hit Ratio |
Gauge |
rate(ocsp_cache_hits_total[5m]) / (rate(ocsp_cache_hits_total[5m]) + rate(ocsp_cache_misses_total[5m])) |
| DB Connections |
Time series |
db_connections_active |
| Auth Failures |
Time series |
rate(mazevault_auth_attempts_total{result="failure"}[5m]) |
5. Log Aggregation
Structured Logging
All MazeVault components output structured JSON logs:
{
"time": "2026-02-10T14:30:00.000Z",
"level": "info",
"msg": "Request completed",
"method": "GET",
"path": "/api/v1/secrets",
"status": 200,
"duration_ms": 12,
"request_id": "req_abc123",
"user_id": "usr_def456"
}
Log Levels
| Level |
Description |
When to Use |
error |
Operation failed |
Investigate immediately |
warn |
Unexpected but recoverable |
Monitor trends |
info |
Normal operations |
Audit and debugging |
debug |
Detailed diagnostic |
Development / temporary troubleshooting |
Recommended Log Collection
| Platform |
Recommended Tool |
| Kubernetes |
Fluentd / Fluent Bit → Elasticsearch / Loki |
| On-Premise |
Filebeat → Elasticsearch / Splunk |
| Azure |
Azure Monitor / Log Analytics |