feat(M48): continuous TLS health monitoring — endpoint state machine, shared tlsprobe, 8 API endpoints, GUI

Adds continuous TLS endpoint health monitoring that closes the deploy→verify→monitor loop. After M25 verifies a deployment succeeded once, M48 continuously confirms it stays healthy. Key components: - Shared `internal/tlsprobe/` package extracted from network scanner for reuse - Health status state machine: healthy → degraded (2 failures) → down (5 failures), plus cert_mismatch when served fingerprint differs from expected - 8th scheduler loop (60s tick, per-endpoint configurable intervals) - PostgreSQL migration 000011: endpoint_health_checks + endpoint_health_history tables - 8 REST API endpoints (CRUD, history, acknowledge, summary) - Health Monitor GUI page with summary bar, status table, create modal, auto-refresh - 38 new tests (5 tlsprobe + 11 domain + 10 service + 8 handler + 4 frontend) - All coverage thresholds maintained (service 68%, handler 83%, domain 87%, middleware 63%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-07 17:51:29 +00:00 · 2026-04-15 21:45:45 -04:00
parent f2e60b93a3
commit 596d86a206
29 changed files with 3540 additions and 30 deletions
@@ -801,6 +801,53 @@ Server-side active TLS scanning of CIDR ranges. Concurrent probing with semaphor
 | `/api/v1/network-scan-targets/{id}` | DELETE | Delete |
 | `/api/v1/network-scan-targets/{id}/scan` | POST | Trigger immediate scan |

+### Continuous TLS Health Monitoring
+
+<!-- Source: internal/domain/health_check.go, internal/service/health_check.go -->
+
+Beyond one-time discovery (M18b, M21), the health monitor continuously probes TLS endpoints and tracks certificate freshness. Uses the shared `internal/tlsprobe/` package (same as network scanner) to compare deployed certificate fingerprints against live endpoints, catching silent rollbacks and unauthorized replacements.
+
+**Status Transitions:**
+- `Healthy` — endpoint responding, certificate matches expected
+- `Degraded` — consecutive probe failures reach threshold (default 2)
+- `Down` — consecutive failures exceed degradation threshold (default 5)
+- `Cert_Mismatch` — observed cert fingerprint differs from expected (unauthorized replacement)
+
+**Auto-Create:** When a deployment completes successfully with TLS verification enabled (M25), certctl automatically creates a health check with the deployed certificate's fingerprint as the baseline.
+
+**Probe History:** Each probe stores: TLS version, cipher suite, response time, cert metadata (subject, issuer, validity), status, and error details. Retained for 30 days (configurable), then purged by the scheduler.
+
+**Alerts on State Transitions:**
+- Cert_Mismatch: HIGH severity (catches unauthorized changes)
+- Down: CRITICAL severity (service broken)
+- Degraded: WARNING severity (intermittent issues)
+- Recovery to Healthy: INFO severity (status update)
+
+**Configuration:**
+
+| Env Var | Default | Description |
+|---|---|---|
+| `CERTCTL_HEALTH_CHECK_ENABLED` | `false` | Enable health monitoring |
+| `CERTCTL_HEALTH_CHECK_INTERVAL` | `60s` | Scheduler tick interval |
+| `CERTCTL_HEALTH_CHECK_DEFAULT_INTERVAL` | `300s` | Default per-endpoint check frequency |
+| `CERTCTL_HEALTH_CHECK_DEFAULT_TIMEOUT` | `5000ms` | TLS connection timeout per probe |
+| `CERTCTL_HEALTH_CHECK_MAX_CONCURRENT` | `20` | Max concurrent TLS probes |
+| `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION` | `30 days` | Purge probe history older than this |
+| `CERTCTL_HEALTH_CHECK_AUTO_CREATE` | `true` | Auto-create checks from deployments |
+
+**Health Check API:**
+
+| Endpoint | Method | Description |
+|---|---|---|
+| `/api/v1/health-checks` | GET | List with `?status`, `?certificate_id`, `?network_scan_target_id`, `?enabled` filters + pagination |
+| `/api/v1/health-checks/{id}` | GET | Detail |
+| `/api/v1/health-checks` | POST | Create manual check (endpoint, expected_fingerprint, check_interval, timeout) |
+| `/api/v1/health-checks/{id}` | PUT | Update thresholds, interval, or expected fingerprint |
+| `/api/v1/health-checks/{id}` | DELETE | Delete |
+| `/api/v1/health-checks/{id}/history` | GET | Probe history with `?limit` param |
+| `/api/v1/health-checks/{id}/acknowledge` | POST | Mark incident as acknowledged by operator |
+| `/api/v1/health-checks/summary` | GET | Aggregate counts by status (Healthy, Degraded, Down, Cert_Mismatch) |
+
 ---

 ## Ownership and Teams