mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 15:32:02 +00:00
feat(M48): continuous TLS health monitoring — endpoint state machine, shared tlsprobe, 8 API endpoints, GUI
Adds continuous TLS endpoint health monitoring that closes the deploy→verify→monitor loop. After M25 verifies a deployment succeeded once, M48 continuously confirms it stays healthy. Key components: - Shared `internal/tlsprobe/` package extracted from network scanner for reuse - Health status state machine: healthy → degraded (2 failures) → down (5 failures), plus cert_mismatch when served fingerprint differs from expected - 8th scheduler loop (60s tick, per-endpoint configurable intervals) - PostgreSQL migration 000011: endpoint_health_checks + endpoint_health_history tables - 8 REST API endpoints (CRUD, history, acknowledge, summary) - Health Monitor GUI page with summary bar, status table, create modal, auto-refresh - 38 new tests (5 tlsprobe + 11 domain + 10 service + 8 handler + 4 frontend) - All coverage thresholds maintained (service 68%, handler 83%, domain 87%, middleware 63%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -1018,6 +1018,30 @@ flowchart TB
|
||||
|
||||
This data flow is pull-based and non-blocking. Agents discover at their own pace; the server stores results for later review. There's no pressure to claim or dismiss; operators can leave certificates in "Unmanaged" status indefinitely.
|
||||
|
||||
## Continuous TLS Health Monitoring (M48)
|
||||
|
||||
Beyond one-time discovery, certctl continuously monitors TLS endpoints for certificate health using a shared TLS probing package and a state-machine-driven health check service. Endpoints transition between states (Healthy → Degraded → Down) based on consecutive failures, and `cert_mismatch` status alerts when a deployed certificate is unexpectedly replaced.
|
||||
|
||||
**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated 8th scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch).
|
||||
|
||||
**State Machine:** Healthy → Degraded (configurable threshold, default 2 consecutive failures) → Down (default 5 failures). The `cert_mismatch` status is special — it fires whenever the observed certificate fingerprint differs from the expected (deployed) fingerprint, catching silent rollbacks and unauthorized cert replacements. Recovery from degraded/down transitions back to healthy and resets the failure counter.
|
||||
|
||||
**API:** 8 endpoints for list (with filters: status, certificate_id, network_scan_target_id, enabled), get, create, update, delete, history (with limit param), acknowledge (incident marking), and summary (aggregate status counts).
|
||||
|
||||
**Auto-Create:** When a deployment job completes with successful verification (M25), the system automatically creates a health check with the deployed certificate's fingerprint as the expected value. Network scan targets can also opt-in to auto-create health checks for discovered endpoints.
|
||||
|
||||
**Configuration:**
|
||||
|
||||
| Env Var | Default | Description |
|
||||
|---|---|---|
|
||||
| `CERTCTL_HEALTH_CHECK_ENABLED` | `false` | Enable/disable the feature |
|
||||
| `CERTCTL_HEALTH_CHECK_INTERVAL` | `60s` | Scheduler tick interval |
|
||||
| `CERTCTL_HEALTH_CHECK_DEFAULT_INTERVAL` | `300s` | Default per-endpoint check interval (5 min) |
|
||||
| `CERTCTL_HEALTH_CHECK_DEFAULT_TIMEOUT` | `5000ms` | TLS connection timeout per probe |
|
||||
| `CERTCTL_HEALTH_CHECK_MAX_CONCURRENT` | `20` | Max concurrent TLS probes |
|
||||
| `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION` | `30 days` | Purge probe history older than this |
|
||||
| `CERTCTL_HEALTH_CHECK_AUTO_CREATE` | `true` | Auto-create checks from deployments |
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
certctl is extensively tested across eight layers with CI-enforced coverage gates that act as regression floors. The goal is high-confidence regression prevention at the service and handler layers (where the most complex business logic lives), combined with integration tests that exercise the full request path from HTTP to database.
|
||||
|
||||
@@ -801,6 +801,53 @@ Server-side active TLS scanning of CIDR ranges. Concurrent probing with semaphor
|
||||
| `/api/v1/network-scan-targets/{id}` | DELETE | Delete |
|
||||
| `/api/v1/network-scan-targets/{id}/scan` | POST | Trigger immediate scan |
|
||||
|
||||
### Continuous TLS Health Monitoring
|
||||
|
||||
<!-- Source: internal/domain/health_check.go, internal/service/health_check.go -->
|
||||
|
||||
Beyond one-time discovery (M18b, M21), the health monitor continuously probes TLS endpoints and tracks certificate freshness. Uses the shared `internal/tlsprobe/` package (same as network scanner) to compare deployed certificate fingerprints against live endpoints, catching silent rollbacks and unauthorized replacements.
|
||||
|
||||
**Status Transitions:**
|
||||
- `Healthy` — endpoint responding, certificate matches expected
|
||||
- `Degraded` — consecutive probe failures reach threshold (default 2)
|
||||
- `Down` — consecutive failures exceed degradation threshold (default 5)
|
||||
- `Cert_Mismatch` — observed cert fingerprint differs from expected (unauthorized replacement)
|
||||
|
||||
**Auto-Create:** When a deployment completes successfully with TLS verification enabled (M25), certctl automatically creates a health check with the deployed certificate's fingerprint as the baseline.
|
||||
|
||||
**Probe History:** Each probe stores: TLS version, cipher suite, response time, cert metadata (subject, issuer, validity), status, and error details. Retained for 30 days (configurable), then purged by the scheduler.
|
||||
|
||||
**Alerts on State Transitions:**
|
||||
- Cert_Mismatch: HIGH severity (catches unauthorized changes)
|
||||
- Down: CRITICAL severity (service broken)
|
||||
- Degraded: WARNING severity (intermittent issues)
|
||||
- Recovery to Healthy: INFO severity (status update)
|
||||
|
||||
**Configuration:**
|
||||
|
||||
| Env Var | Default | Description |
|
||||
|---|---|---|
|
||||
| `CERTCTL_HEALTH_CHECK_ENABLED` | `false` | Enable health monitoring |
|
||||
| `CERTCTL_HEALTH_CHECK_INTERVAL` | `60s` | Scheduler tick interval |
|
||||
| `CERTCTL_HEALTH_CHECK_DEFAULT_INTERVAL` | `300s` | Default per-endpoint check frequency |
|
||||
| `CERTCTL_HEALTH_CHECK_DEFAULT_TIMEOUT` | `5000ms` | TLS connection timeout per probe |
|
||||
| `CERTCTL_HEALTH_CHECK_MAX_CONCURRENT` | `20` | Max concurrent TLS probes |
|
||||
| `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION` | `30 days` | Purge probe history older than this |
|
||||
| `CERTCTL_HEALTH_CHECK_AUTO_CREATE` | `true` | Auto-create checks from deployments |
|
||||
|
||||
**Health Check API:**
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|---|---|---|
|
||||
| `/api/v1/health-checks` | GET | List with `?status`, `?certificate_id`, `?network_scan_target_id`, `?enabled` filters + pagination |
|
||||
| `/api/v1/health-checks/{id}` | GET | Detail |
|
||||
| `/api/v1/health-checks` | POST | Create manual check (endpoint, expected_fingerprint, check_interval, timeout) |
|
||||
| `/api/v1/health-checks/{id}` | PUT | Update thresholds, interval, or expected fingerprint |
|
||||
| `/api/v1/health-checks/{id}` | DELETE | Delete |
|
||||
| `/api/v1/health-checks/{id}/history` | GET | Probe history with `?limit` param |
|
||||
| `/api/v1/health-checks/{id}/acknowledge` | POST | Mark incident as acknowledged by operator |
|
||||
| `/api/v1/health-checks/summary` | GET | Aggregate counts by status (Healthy, Degraded, Down, Cert_Mismatch) |
|
||||
|
||||
---
|
||||
|
||||
## Ownership and Teams
|
||||
|
||||
Reference in New Issue
Block a user