feat(M48): continuous TLS health monitoring — endpoint state machine, shared tlsprobe, 8 API endpoints, GUI

Adds continuous TLS endpoint health monitoring that closes the deploy→verify→monitor loop. After M25 verifies a deployment succeeded once, M48 continuously confirms it stays healthy. Key components: - Shared `internal/tlsprobe/` package extracted from network scanner for reuse - Health status state machine: healthy → degraded (2 failures) → down (5 failures), plus cert_mismatch when served fingerprint differs from expected - 8th scheduler loop (60s tick, per-endpoint configurable intervals) - PostgreSQL migration 000011: endpoint_health_checks + endpoint_health_history tables - 8 REST API endpoints (CRUD, history, acknowledge, summary) - Health Monitor GUI page with summary bar, status table, create modal, auto-refresh - 38 new tests (5 tlsprobe + 11 domain + 10 service + 8 handler + 4 frontend) - All coverage thresholds maintained (service 68%, handler 83%, domain 87%, middleware 63%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-07-26 14:58:13 +00:00 · 2026-04-15 21:45:45 -04:00
parent f2e60b93a3
commit 596d86a206
29 changed files with 3540 additions and 30 deletions
@@ -1018,6 +1018,30 @@ flowchart TB

 This data flow is pull-based and non-blocking. Agents discover at their own pace; the server stores results for later review. There's no pressure to claim or dismiss; operators can leave certificates in "Unmanaged" status indefinitely.

+## Continuous TLS Health Monitoring (M48)
+
+Beyond one-time discovery, certctl continuously monitors TLS endpoints for certificate health using a shared TLS probing package and a state-machine-driven health check service. Endpoints transition between states (Healthy → Degraded → Down) based on consecutive failures, and `cert_mismatch` status alerts when a deployed certificate is unexpectedly replaced.
+
+**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated 8th scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch).
+
+**State Machine:** Healthy → Degraded (configurable threshold, default 2 consecutive failures) → Down (default 5 failures). The `cert_mismatch` status is special — it fires whenever the observed certificate fingerprint differs from the expected (deployed) fingerprint, catching silent rollbacks and unauthorized cert replacements. Recovery from degraded/down transitions back to healthy and resets the failure counter.
+
+**API:** 8 endpoints for list (with filters: status, certificate_id, network_scan_target_id, enabled), get, create, update, delete, history (with limit param), acknowledge (incident marking), and summary (aggregate status counts).
+
+**Auto-Create:** When a deployment job completes with successful verification (M25), the system automatically creates a health check with the deployed certificate's fingerprint as the expected value. Network scan targets can also opt-in to auto-create health checks for discovered endpoints.
+
+**Configuration:**
+
+| Env Var | Default | Description |
+|---|---|---|
+| `CERTCTL_HEALTH_CHECK_ENABLED` | `false` | Enable/disable the feature |
+| `CERTCTL_HEALTH_CHECK_INTERVAL` | `60s` | Scheduler tick interval |
+| `CERTCTL_HEALTH_CHECK_DEFAULT_INTERVAL` | `300s` | Default per-endpoint check interval (5 min) |
+| `CERTCTL_HEALTH_CHECK_DEFAULT_TIMEOUT` | `5000ms` | TLS connection timeout per probe |
+| `CERTCTL_HEALTH_CHECK_MAX_CONCURRENT` | `20` | Max concurrent TLS probes |
+| `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION` | `30 days` | Purge probe history older than this |
+| `CERTCTL_HEALTH_CHECK_AUTO_CREATE` | `true` | Auto-create checks from deployments |
+
 ## Testing Strategy

 certctl is extensively tested across eight layers with CI-enforced coverage gates that act as regression floors. The goal is high-confidence regression prevention at the service and handler layers (where the most complex business logic lives), combined with integration tests that exercise the full request path from HTTP to database.
@@ -801,6 +801,53 @@ Server-side active TLS scanning of CIDR ranges. Concurrent probing with semaphor
 | `/api/v1/network-scan-targets/{id}` | DELETE | Delete |
 | `/api/v1/network-scan-targets/{id}/scan` | POST | Trigger immediate scan |

+### Continuous TLS Health Monitoring
+
+<!-- Source: internal/domain/health_check.go, internal/service/health_check.go -->
+
+Beyond one-time discovery (M18b, M21), the health monitor continuously probes TLS endpoints and tracks certificate freshness. Uses the shared `internal/tlsprobe/` package (same as network scanner) to compare deployed certificate fingerprints against live endpoints, catching silent rollbacks and unauthorized replacements.
+
+**Status Transitions:**
+- `Healthy` — endpoint responding, certificate matches expected
+- `Degraded` — consecutive probe failures reach threshold (default 2)
+- `Down` — consecutive failures exceed degradation threshold (default 5)
+- `Cert_Mismatch` — observed cert fingerprint differs from expected (unauthorized replacement)
+
+**Auto-Create:** When a deployment completes successfully with TLS verification enabled (M25), certctl automatically creates a health check with the deployed certificate's fingerprint as the baseline.
+
+**Probe History:** Each probe stores: TLS version, cipher suite, response time, cert metadata (subject, issuer, validity), status, and error details. Retained for 30 days (configurable), then purged by the scheduler.
+
+**Alerts on State Transitions:**
+- Cert_Mismatch: HIGH severity (catches unauthorized changes)
+- Down: CRITICAL severity (service broken)
+- Degraded: WARNING severity (intermittent issues)
+- Recovery to Healthy: INFO severity (status update)
+
+**Configuration:**
+
+| Env Var | Default | Description |
+|---|---|---|
+| `CERTCTL_HEALTH_CHECK_ENABLED` | `false` | Enable health monitoring |
+| `CERTCTL_HEALTH_CHECK_INTERVAL` | `60s` | Scheduler tick interval |
+| `CERTCTL_HEALTH_CHECK_DEFAULT_INTERVAL` | `300s` | Default per-endpoint check frequency |
+| `CERTCTL_HEALTH_CHECK_DEFAULT_TIMEOUT` | `5000ms` | TLS connection timeout per probe |
+| `CERTCTL_HEALTH_CHECK_MAX_CONCURRENT` | `20` | Max concurrent TLS probes |
+| `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION` | `30 days` | Purge probe history older than this |
+| `CERTCTL_HEALTH_CHECK_AUTO_CREATE` | `true` | Auto-create checks from deployments |
+
+**Health Check API:**
+
+| Endpoint | Method | Description |
+|---|---|---|
+| `/api/v1/health-checks` | GET | List with `?status`, `?certificate_id`, `?network_scan_target_id`, `?enabled` filters + pagination |
+| `/api/v1/health-checks/{id}` | GET | Detail |
+| `/api/v1/health-checks` | POST | Create manual check (endpoint, expected_fingerprint, check_interval, timeout) |
+| `/api/v1/health-checks/{id}` | PUT | Update thresholds, interval, or expected fingerprint |
+| `/api/v1/health-checks/{id}` | DELETE | Delete |
+| `/api/v1/health-checks/{id}/history` | GET | Probe history with `?limit` param |
+| `/api/v1/health-checks/{id}/acknowledge` | POST | Mark incident as acknowledged by operator |
+| `/api/v1/health-checks/summary` | GET | Aggregate counts by status (Healthy, Degraded, Down, Cert_Mismatch) |
+
 ---

 ## Ownership and Teams