feat(M48): continuous TLS health monitoring — endpoint state machine, shared tlsprobe, 8 API endpoints, GUI

Adds continuous TLS endpoint health monitoring that closes the deploy→verify→monitor loop.
After M25 verifies a deployment succeeded once, M48 continuously confirms it stays healthy.

Key components:
- Shared `internal/tlsprobe/` package extracted from network scanner for reuse
- Health status state machine: healthy → degraded (2 failures) → down (5 failures),
  plus cert_mismatch when served fingerprint differs from expected
- 8th scheduler loop (60s tick, per-endpoint configurable intervals)
- PostgreSQL migration 000011: endpoint_health_checks + endpoint_health_history tables
- 8 REST API endpoints (CRUD, history, acknowledge, summary)
- Health Monitor GUI page with summary bar, status table, create modal, auto-refresh
- 38 new tests (5 tlsprobe + 11 domain + 10 service + 8 handler + 4 frontend)
- All coverage thresholds maintained (service 68%, handler 83%, domain 87%, middleware 63%)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
shankar0123
2026-04-15 21:45:45 -04:00
parent f2e60b93a3
commit 596d86a206
29 changed files with 3540 additions and 30 deletions
+2 -2
View File
@@ -45,11 +45,11 @@ jobs:
run: govulncheck ./...
- name: Race Detection
run: go test -race ./internal/service/... ./internal/api/handler/... ./internal/api/middleware/... ./internal/scheduler/... ./internal/connector/... ./internal/domain/... ./internal/validation/... -count=1 -timeout 300s
run: go test -race ./internal/service/... ./internal/api/handler/... ./internal/api/middleware/... ./internal/scheduler/... ./internal/connector/... ./internal/domain/... ./internal/validation/... ./internal/tlsprobe/... -count=1 -timeout 300s
- name: Go Test with Coverage
run: |
go test ./internal/service/... ./internal/api/handler/... ./internal/api/middleware/... ./internal/integration/... ./internal/connector/issuer/... ./internal/connector/target/... ./internal/connector/notifier/... ./internal/mcp/... ./internal/cli/... ./internal/domain/... ./internal/validation/... -count=1 -cover -coverprofile=coverage.out
go test ./internal/service/... ./internal/api/handler/... ./internal/api/middleware/... ./internal/integration/... ./internal/connector/issuer/... ./internal/connector/target/... ./internal/connector/notifier/... ./internal/mcp/... ./internal/cli/... ./internal/domain/... ./internal/validation/... ./internal/tlsprobe/... -count=1 -cover -coverprofile=coverage.out
- name: Check Coverage Thresholds
run: |
+2 -2
View File
@@ -166,7 +166,7 @@ Built for **platform engineering and DevOps teams** managing 10500+ certifica
**Private keys stay on your servers.** Agents generate ECDSA P-256 keys locally, submit only the CSR. The control plane never touches private keys. After deployment, agents probe the live TLS endpoint and compare SHA-256 fingerprints to confirm the right certificate is actually being served.
**Discovery.** Agents scan filesystems for existing PEM/DER certificates. The network scanner probes TLS endpoints across CIDR ranges without agents. Both feed into a triage workflow — claim, dismiss, or import what you find.
**Discovery.** Agents scan filesystems for existing PEM/DER certificates. The network scanner probes TLS endpoints across CIDR ranges without agents. Continuous TLS health monitoring tracks endpoint status (healthy/degraded/down/cert_mismatch) with configurable thresholds and historical probe data. All discovery modes feed into a triage workflow — claim, dismiss, or import what you find.
**Policy engine.** Certificate profiles constrain key types, max TTL, and EKUs — with crypto policy enforcement that validates every CSR against profile rules before it reaches the issuer. MaxTTL caps are enforced per issuer connector. Approval workflows pause jobs for human review. Ownership tracking routes notifications to the right team. Agent groups match devices by OS, architecture, IP CIDR, and version.
@@ -174,7 +174,7 @@ Built for **platform engineering and DevOps teams** managing 10500+ certifica
**Revocation.** DER-encoded X.509 CRL per issuer, signed by the issuing CA. Embedded OCSP responder. RFC 5280 reason codes. Short-lived certs (TTL < 1 hour) are exempt — expiry is sufficient revocation.
**Audit and observability.** Immutable append-only audit trail records every lifecycle action, every API call, and every approval decision. Prometheus metrics endpoint. Scheduled certificate digest emails.
**Audit and observability.** Immutable append-only audit trail records every lifecycle action, every API call, and every approval decision. Prometheus metrics endpoint. Scheduled certificate digest emails. Continuous endpoint health monitoring with state machine transitions and real-time alerts.
**Notifications.** Slack, Teams, PagerDuty, OpsGenie, SMTP, webhooks. Routed by certificate owner. Daily digest emails with stats and expiring certs.
+382
View File
@@ -62,6 +62,8 @@ tags:
description: Certificate discovery — filesystem scanning by agents and network TLS probing
- name: Network Scan
description: Network scan target management for active TLS certificate discovery
- name: Health Monitoring
description: Continuous TLS endpoint health checks with status tracking and probe history
- name: Digest
description: Scheduled certificate digest email notifications
@@ -2388,6 +2390,256 @@ paths:
"500":
$ref: "#/components/responses/InternalError"
# ─── Health Monitoring ─────────────────────────────────────────────
/api/v1/health-checks:
get:
tags: [Health Monitoring]
summary: List endpoint health checks
description: |
Lists all TLS endpoint health checks with optional filtering by status, certificate, or network scan target.
Includes current status, last probe results, and probe history summary.
operationId: listHealthChecks
parameters:
- name: status
in: query
schema:
type: string
enum: [Healthy, Degraded, Down, CertMismatch]
description: Filter by health status
- name: certificate_id
in: query
schema:
type: string
description: Filter by certificate ID
- name: network_scan_target_id
in: query
schema:
type: string
description: Filter by network scan target ID
- name: enabled
in: query
schema:
type: boolean
description: Filter by enabled/disabled state
- $ref: "#/components/parameters/page"
- $ref: "#/components/parameters/per_page"
responses:
"200":
description: List of health checks
content:
application/json:
schema:
type: object
properties:
data:
type: array
items:
$ref: "#/components/schemas/EndpointHealthCheck"
total:
type: integer
page:
type: integer
per_page:
type: integer
"500":
$ref: "#/components/responses/InternalError"
post:
tags: [Health Monitoring]
summary: Create health check
description: Creates a new manual health check for an endpoint.
operationId: createHealthCheck
requestBody:
required: true
content:
application/json:
schema:
type: object
required: [endpoint, check_interval_seconds]
properties:
endpoint:
type: string
description: "host:port to monitor"
example: "api.example.com:443"
expected_fingerprint:
type: string
description: Expected certificate SHA-256 fingerprint (optional)
check_interval_seconds:
type: integer
minimum: 30
description: Probe frequency in seconds (default 300)
timeout_ms:
type: integer
description: TLS connection timeout in milliseconds
responses:
"201":
description: Health check created
content:
application/json:
schema:
$ref: "#/components/schemas/EndpointHealthCheck"
"400":
$ref: "#/components/responses/BadRequest"
"500":
$ref: "#/components/responses/InternalError"
/api/v1/health-checks/summary:
get:
tags: [Health Monitoring]
summary: Health check summary
description: Returns aggregate status counts for all health checks.
operationId: getHealthCheckSummary
responses:
"200":
description: Health check summary
content:
application/json:
schema:
type: object
properties:
healthy:
type: integer
degraded:
type: integer
down:
type: integer
cert_mismatch:
type: integer
"500":
$ref: "#/components/responses/InternalError"
/api/v1/health-checks/{id}:
get:
tags: [Health Monitoring]
summary: Get health check
operationId: getHealthCheck
parameters:
- $ref: "#/components/parameters/resourceId"
responses:
"200":
description: Health check detail
content:
application/json:
schema:
$ref: "#/components/schemas/EndpointHealthCheck"
"404":
$ref: "#/components/responses/NotFound"
"500":
$ref: "#/components/responses/InternalError"
put:
tags: [Health Monitoring]
summary: Update health check
description: Update thresholds, interval, or expected fingerprint.
operationId: updateHealthCheck
parameters:
- $ref: "#/components/parameters/resourceId"
requestBody:
content:
application/json:
schema:
type: object
properties:
expected_fingerprint:
type: string
check_interval_seconds:
type: integer
timeout_ms:
type: integer
enabled:
type: boolean
responses:
"200":
description: Health check updated
content:
application/json:
schema:
$ref: "#/components/schemas/EndpointHealthCheck"
"400":
$ref: "#/components/responses/BadRequest"
"404":
$ref: "#/components/responses/NotFound"
"500":
$ref: "#/components/responses/InternalError"
delete:
tags: [Health Monitoring]
summary: Delete health check
operationId: deleteHealthCheck
parameters:
- $ref: "#/components/parameters/resourceId"
responses:
"204":
description: Health check deleted
"404":
$ref: "#/components/responses/NotFound"
"500":
$ref: "#/components/responses/InternalError"
/api/v1/health-checks/{id}/history:
get:
tags: [Health Monitoring]
summary: Get probe history
description: Returns historical probe records with status, response times, and errors.
operationId: getHealthCheckHistory
parameters:
- $ref: "#/components/parameters/resourceId"
- name: limit
in: query
schema:
type: integer
default: 100
minimum: 1
maximum: 1000
description: Max number of records to return
responses:
"200":
description: Probe history
content:
application/json:
schema:
type: object
properties:
data:
type: array
items:
$ref: "#/components/schemas/HealthHistoryEntry"
total:
type: integer
"404":
$ref: "#/components/responses/NotFound"
"500":
$ref: "#/components/responses/InternalError"
/api/v1/health-checks/{id}/acknowledge:
post:
tags: [Health Monitoring]
summary: Acknowledge incident
description: Mark a health check incident as acknowledged by the operator.
operationId: acknowledgeHealthCheckIncident
parameters:
- $ref: "#/components/parameters/resourceId"
requestBody:
content:
application/json:
schema:
type: object
properties:
acknowledged_by:
type: string
description: Operator name or ID
responses:
"200":
description: Incident acknowledged
content:
application/json:
schema:
$ref: "#/components/schemas/EndpointHealthCheck"
"404":
$ref: "#/components/responses/NotFound"
"500":
$ref: "#/components/responses/InternalError"
# ─── Digest ────────────────────────────────────────────────────────
/api/v1/digest/preview:
get:
@@ -3342,3 +3594,133 @@ components:
timeout_ms:
type: integer
default: 5000
EndpointHealthCheck:
type: object
properties:
id:
type: string
description: Health check ID
endpoint:
type: string
description: "Target endpoint (host:port)"
example: "api.example.com:443"
certificate_id:
type: string
nullable: true
description: Associated managed certificate ID (if from deployment)
network_scan_target_id:
type: string
nullable: true
description: Associated network scan target ID (if auto-created)
expected_fingerprint:
type: string
nullable: true
description: Expected certificate SHA-256 fingerprint
status:
type: string
enum: [Healthy, Degraded, Down, CertMismatch]
description: Current health status
enabled:
type: boolean
check_interval_seconds:
type: integer
description: Frequency of TLS probes (seconds)
timeout_ms:
type: integer
description: TLS connection timeout (milliseconds)
consecutive_failures:
type: integer
description: Number of consecutive probe failures
last_checked_at:
type: string
format: date-time
nullable: true
description: Timestamp of last probe
last_success_at:
type: string
format: date-time
nullable: true
description: Timestamp of last successful probe
last_failure_at:
type: string
format: date-time
nullable: true
description: Timestamp of last failed probe
last_transition_at:
type: string
format: date-time
nullable: true
description: Timestamp of last status transition
failure_reason:
type: string
nullable: true
description: Reason for last failure
acknowledged:
type: boolean
description: Whether the current status has been acknowledged
acknowledged_by:
type: string
nullable: true
description: Operator name who acknowledged (if applicable)
acknowledged_at:
type: string
format: date-time
nullable: true
created_at:
type: string
format: date-time
updated_at:
type: string
format: date-time
HealthHistoryEntry:
type: object
properties:
id:
type: string
health_check_id:
type: string
status:
type: string
enum: [Healthy, Degraded, Down, CertMismatch]
response_time_ms:
type: integer
nullable: true
description: Time to connect and complete TLS handshake (milliseconds)
observed_fingerprint:
type: string
nullable: true
description: SHA-256 fingerprint of certificate observed on endpoint
tls_version:
type: string
nullable: true
description: TLS version (e.g., TLSv1.3)
cipher_suite:
type: string
nullable: true
description: Cipher suite used in TLS handshake
cert_subject:
type: string
nullable: true
description: Subject DN of observed certificate
cert_issuer:
type: string
nullable: true
description: Issuer DN of observed certificate
cert_not_before:
type: string
format: date-time
nullable: true
cert_not_after:
type: string
format: date-time
nullable: true
failure_reason:
type: string
nullable: true
description: Error message if probe failed
checked_at:
type: string
format: date-time
description: Timestamp of this probe
+29
View File
@@ -259,6 +259,29 @@ func main() {
}
}
// Initialize health check service (M48)
var healthCheckService *service.HealthCheckService
var healthCheckHandler *handler.HealthCheckHandler
if cfg.HealthCheck.Enabled {
healthCheckRepo := postgres.NewHealthCheckRepository(db)
healthCheckService = service.NewHealthCheckService(
healthCheckRepo,
auditService,
logger,
cfg.HealthCheck.MaxConcurrent,
time.Duration(cfg.HealthCheck.DefaultTimeout)*time.Millisecond,
cfg.HealthCheck.HistoryRetention,
cfg.HealthCheck.AutoCreate,
)
healthCheckHandler = handler.NewHealthCheckHandler(healthCheckService)
logger.Info("health check service enabled",
"interval", cfg.HealthCheck.CheckInterval.String(),
"max_concurrent", cfg.HealthCheck.MaxConcurrent)
} else {
// Create a no-op health check handler for route registration
healthCheckHandler = handler.NewHealthCheckHandler(nil)
}
logger.Info("initialized all handlers")
// Create context with cancellation
@@ -289,6 +312,11 @@ func main() {
sched.SetDigestInterval(cfg.Digest.Interval)
logger.Info("digest scheduler enabled", "interval", cfg.Digest.Interval.String())
}
if healthCheckService != nil {
sched.SetHealthCheckService(healthCheckService)
sched.SetHealthCheckInterval(cfg.HealthCheck.CheckInterval)
logger.Info("health check scheduler enabled", "interval", cfg.HealthCheck.CheckInterval.String())
}
// Start scheduler
logger.Info("starting scheduler")
@@ -319,6 +347,7 @@ func main() {
Verification: verificationHandler,
Export: exportHandler,
Digest: *digestHandler,
HealthChecks: healthCheckHandler,
})
// Register EST (RFC 7030) handlers if enabled
if cfg.EST.Enabled {
+24
View File
@@ -1018,6 +1018,30 @@ flowchart TB
This data flow is pull-based and non-blocking. Agents discover at their own pace; the server stores results for later review. There's no pressure to claim or dismiss; operators can leave certificates in "Unmanaged" status indefinitely.
## Continuous TLS Health Monitoring (M48)
Beyond one-time discovery, certctl continuously monitors TLS endpoints for certificate health using a shared TLS probing package and a state-machine-driven health check service. Endpoints transition between states (Healthy → Degraded → Down) based on consecutive failures, and `cert_mismatch` status alerts when a deployed certificate is unexpectedly replaced.
**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated 8th scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch).
**State Machine:** Healthy → Degraded (configurable threshold, default 2 consecutive failures) → Down (default 5 failures). The `cert_mismatch` status is special — it fires whenever the observed certificate fingerprint differs from the expected (deployed) fingerprint, catching silent rollbacks and unauthorized cert replacements. Recovery from degraded/down transitions back to healthy and resets the failure counter.
**API:** 8 endpoints for list (with filters: status, certificate_id, network_scan_target_id, enabled), get, create, update, delete, history (with limit param), acknowledge (incident marking), and summary (aggregate status counts).
**Auto-Create:** When a deployment job completes with successful verification (M25), the system automatically creates a health check with the deployed certificate's fingerprint as the expected value. Network scan targets can also opt-in to auto-create health checks for discovered endpoints.
**Configuration:**
| Env Var | Default | Description |
|---|---|---|
| `CERTCTL_HEALTH_CHECK_ENABLED` | `false` | Enable/disable the feature |
| `CERTCTL_HEALTH_CHECK_INTERVAL` | `60s` | Scheduler tick interval |
| `CERTCTL_HEALTH_CHECK_DEFAULT_INTERVAL` | `300s` | Default per-endpoint check interval (5 min) |
| `CERTCTL_HEALTH_CHECK_DEFAULT_TIMEOUT` | `5000ms` | TLS connection timeout per probe |
| `CERTCTL_HEALTH_CHECK_MAX_CONCURRENT` | `20` | Max concurrent TLS probes |
| `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION` | `30 days` | Purge probe history older than this |
| `CERTCTL_HEALTH_CHECK_AUTO_CREATE` | `true` | Auto-create checks from deployments |
## Testing Strategy
certctl is extensively tested across eight layers with CI-enforced coverage gates that act as regression floors. The goal is high-confidence regression prevention at the service and handler layers (where the most complex business logic lives), combined with integration tests that exercise the full request path from HTTP to database.
+47
View File
@@ -801,6 +801,53 @@ Server-side active TLS scanning of CIDR ranges. Concurrent probing with semaphor
| `/api/v1/network-scan-targets/{id}` | DELETE | Delete |
| `/api/v1/network-scan-targets/{id}/scan` | POST | Trigger immediate scan |
### Continuous TLS Health Monitoring
<!-- Source: internal/domain/health_check.go, internal/service/health_check.go -->
Beyond one-time discovery (M18b, M21), the health monitor continuously probes TLS endpoints and tracks certificate freshness. Uses the shared `internal/tlsprobe/` package (same as network scanner) to compare deployed certificate fingerprints against live endpoints, catching silent rollbacks and unauthorized replacements.
**Status Transitions:**
- `Healthy` — endpoint responding, certificate matches expected
- `Degraded` — consecutive probe failures reach threshold (default 2)
- `Down` — consecutive failures exceed degradation threshold (default 5)
- `Cert_Mismatch` — observed cert fingerprint differs from expected (unauthorized replacement)
**Auto-Create:** When a deployment completes successfully with TLS verification enabled (M25), certctl automatically creates a health check with the deployed certificate's fingerprint as the baseline.
**Probe History:** Each probe stores: TLS version, cipher suite, response time, cert metadata (subject, issuer, validity), status, and error details. Retained for 30 days (configurable), then purged by the scheduler.
**Alerts on State Transitions:**
- Cert_Mismatch: HIGH severity (catches unauthorized changes)
- Down: CRITICAL severity (service broken)
- Degraded: WARNING severity (intermittent issues)
- Recovery to Healthy: INFO severity (status update)
**Configuration:**
| Env Var | Default | Description |
|---|---|---|
| `CERTCTL_HEALTH_CHECK_ENABLED` | `false` | Enable health monitoring |
| `CERTCTL_HEALTH_CHECK_INTERVAL` | `60s` | Scheduler tick interval |
| `CERTCTL_HEALTH_CHECK_DEFAULT_INTERVAL` | `300s` | Default per-endpoint check frequency |
| `CERTCTL_HEALTH_CHECK_DEFAULT_TIMEOUT` | `5000ms` | TLS connection timeout per probe |
| `CERTCTL_HEALTH_CHECK_MAX_CONCURRENT` | `20` | Max concurrent TLS probes |
| `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION` | `30 days` | Purge probe history older than this |
| `CERTCTL_HEALTH_CHECK_AUTO_CREATE` | `true` | Auto-create checks from deployments |
**Health Check API:**
| Endpoint | Method | Description |
|---|---|---|
| `/api/v1/health-checks` | GET | List with `?status`, `?certificate_id`, `?network_scan_target_id`, `?enabled` filters + pagination |
| `/api/v1/health-checks/{id}` | GET | Detail |
| `/api/v1/health-checks` | POST | Create manual check (endpoint, expected_fingerprint, check_interval, timeout) |
| `/api/v1/health-checks/{id}` | PUT | Update thresholds, interval, or expected fingerprint |
| `/api/v1/health-checks/{id}` | DELETE | Delete |
| `/api/v1/health-checks/{id}/history` | GET | Probe history with `?limit` param |
| `/api/v1/health-checks/{id}/acknowledge` | POST | Mark incident as acknowledged by operator |
| `/api/v1/health-checks/summary` | GET | Aggregate counts by status (Healthy, Degraded, Down, Cert_Mismatch) |
---
## Ownership and Teams
+308
View File
@@ -0,0 +1,308 @@
package handler
import (
"context"
"encoding/json"
"fmt"
"net/http"
"strconv"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/repository"
)
// HealthCheckServicer defines the interface used by the health check handler.
type HealthCheckServicer interface {
Create(ctx context.Context, check *domain.EndpointHealthCheck) error
Get(ctx context.Context, id string) (*domain.EndpointHealthCheck, error)
Update(ctx context.Context, check *domain.EndpointHealthCheck) error
Delete(ctx context.Context, id string) error
List(ctx context.Context, filter *repository.HealthCheckFilter) ([]*domain.EndpointHealthCheck, int, error)
GetHistory(ctx context.Context, healthCheckID string, limit int) ([]*domain.HealthHistoryEntry, error)
AcknowledgeIncident(ctx context.Context, id string, actor string) error
GetSummary(ctx context.Context) (*domain.HealthCheckSummary, error)
}
// HealthCheckHandler handles HTTP requests for TLS health monitoring.
type HealthCheckHandler struct {
service HealthCheckServicer
}
// NewHealthCheckHandler creates a new health check handler.
func NewHealthCheckHandler(service HealthCheckServicer) *HealthCheckHandler {
return &HealthCheckHandler{service: service}
}
// ListHealthChecks handles GET /api/v1/health-checks
func (h *HealthCheckHandler) ListHealthChecks(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
query := r.URL.Query()
status := query.Get("status")
certificateID := query.Get("certificate_id")
networkScanTargetID := query.Get("network_scan_target_id")
enabledStr := query.Get("enabled")
page := parseIntDefault(query.Get("page"), 1)
perPage := parseIntDefault(query.Get("per_page"), 50)
if perPage > 500 {
perPage = 50
}
// Parse enabled flag if provided
var enabledFilter *bool
if enabledStr != "" {
enabled := enabledStr == "true"
enabledFilter = &enabled
}
filter := &repository.HealthCheckFilter{
Status: status,
CertificateID: certificateID,
NetworkScanTargetID: networkScanTargetID,
Enabled: enabledFilter,
Page: page,
PerPage: perPage,
}
checks, total, err := h.service.List(r.Context(), filter)
if err != nil {
Error(w, http.StatusInternalServerError, fmt.Sprintf("failed to list health checks: %v", err))
return
}
if checks == nil {
checks = make([]*domain.EndpointHealthCheck, 0)
}
JSON(w, http.StatusOK, PagedResponse{
Data: checks,
Total: int64(total),
Page: page,
PerPage: perPage,
})
}
// GetHealthCheck handles GET /api/v1/health-checks/{id}
func (h *HealthCheckHandler) GetHealthCheck(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
id := r.PathValue("id")
if id == "" {
Error(w, http.StatusBadRequest, "health check ID is required")
return
}
check, err := h.service.Get(r.Context(), id)
if err != nil {
Error(w, http.StatusNotFound, fmt.Sprintf("health check not found: %v", err))
return
}
JSON(w, http.StatusOK, check)
}
// CreateHealthCheck handles POST /api/v1/health-checks
func (h *HealthCheckHandler) CreateHealthCheck(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPost {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
var check domain.EndpointHealthCheck
if err := json.NewDecoder(r.Body).Decode(&check); err != nil {
Error(w, http.StatusBadRequest, fmt.Sprintf("invalid request body: %v", err))
return
}
if check.Endpoint == "" {
Error(w, http.StatusBadRequest, "endpoint is required")
return
}
// Set defaults
if check.CheckIntervalSecs <= 0 {
check.CheckIntervalSecs = 300
}
if check.DegradedThreshold <= 0 {
check.DegradedThreshold = 2
}
if check.DownThreshold <= 0 {
check.DownThreshold = 5
}
if check.Status == "" {
check.Status = domain.HealthStatusUnknown
}
if err := h.service.Create(r.Context(), &check); err != nil {
Error(w, http.StatusInternalServerError, fmt.Sprintf("failed to create health check: %v", err))
return
}
JSON(w, http.StatusCreated, check)
}
// UpdateHealthCheck handles PUT /api/v1/health-checks/{id}
func (h *HealthCheckHandler) UpdateHealthCheck(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPut {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
id := r.PathValue("id")
if id == "" {
Error(w, http.StatusBadRequest, "health check ID is required")
return
}
// Get existing check
existing, err := h.service.Get(r.Context(), id)
if err != nil {
Error(w, http.StatusNotFound, fmt.Sprintf("health check not found: %v", err))
return
}
var updates domain.EndpointHealthCheck
if err := json.NewDecoder(r.Body).Decode(&updates); err != nil {
Error(w, http.StatusBadRequest, fmt.Sprintf("invalid request body: %v", err))
return
}
// Merge updates (only update provided fields)
if updates.Endpoint != "" {
existing.Endpoint = updates.Endpoint
}
if updates.ExpectedFingerprint != "" {
existing.ExpectedFingerprint = updates.ExpectedFingerprint
}
if updates.CheckIntervalSecs > 0 {
existing.CheckIntervalSecs = updates.CheckIntervalSecs
}
if updates.DegradedThreshold > 0 {
existing.DegradedThreshold = updates.DegradedThreshold
}
if updates.DownThreshold > 0 {
existing.DownThreshold = updates.DownThreshold
}
existing.Enabled = updates.Enabled
if err := h.service.Update(r.Context(), existing); err != nil {
Error(w, http.StatusInternalServerError, fmt.Sprintf("failed to update health check: %v", err))
return
}
JSON(w, http.StatusOK, existing)
}
// DeleteHealthCheck handles DELETE /api/v1/health-checks/{id}
func (h *HealthCheckHandler) DeleteHealthCheck(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodDelete {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
id := r.PathValue("id")
if id == "" {
Error(w, http.StatusBadRequest, "health check ID is required")
return
}
if err := h.service.Delete(r.Context(), id); err != nil {
Error(w, http.StatusInternalServerError, fmt.Sprintf("failed to delete health check: %v", err))
return
}
w.WriteHeader(http.StatusNoContent)
}
// GetHealthCheckHistory handles GET /api/v1/health-checks/{id}/history
func (h *HealthCheckHandler) GetHealthCheckHistory(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
id := r.PathValue("id")
if id == "" {
Error(w, http.StatusBadRequest, "health check ID is required")
return
}
limitStr := r.URL.Query().Get("limit")
limit := 100
if limitStr != "" {
if l, err := strconv.Atoi(limitStr); err == nil && l > 0 {
limit = l
}
}
if limit > 1000 {
limit = 1000
}
history, err := h.service.GetHistory(r.Context(), id, limit)
if err != nil {
Error(w, http.StatusInternalServerError, fmt.Sprintf("failed to get health check history: %v", err))
return
}
if history == nil {
history = make([]*domain.HealthHistoryEntry, 0)
}
JSON(w, http.StatusOK, history)
}
// AcknowledgeHealthCheck handles POST /api/v1/health-checks/{id}/acknowledge
func (h *HealthCheckHandler) AcknowledgeHealthCheck(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPost {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
id := r.PathValue("id")
if id == "" {
Error(w, http.StatusBadRequest, "health check ID is required")
return
}
var req struct {
Actor string `json:"actor,omitempty"`
}
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
Error(w, http.StatusBadRequest, fmt.Sprintf("invalid request body: %v", err))
return
}
if req.Actor == "" {
req.Actor = "unknown"
}
if err := h.service.AcknowledgeIncident(r.Context(), id, req.Actor); err != nil {
Error(w, http.StatusInternalServerError, fmt.Sprintf("failed to acknowledge health check: %v", err))
return
}
w.WriteHeader(http.StatusNoContent)
}
// GetHealthCheckSummary handles GET /api/v1/health-checks/summary
// This route must be registered BEFORE the /{id} routes
func (h *HealthCheckHandler) GetHealthCheckSummary(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
summary, err := h.service.GetSummary(r.Context())
if err != nil {
Error(w, http.StatusInternalServerError, fmt.Sprintf("failed to get health check summary: %v", err))
return
}
JSON(w, http.StatusOK, summary)
}
@@ -0,0 +1,305 @@
package handler
import (
"bytes"
"context"
"encoding/json"
"errors"
"net/http"
"net/http/httptest"
"testing"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/repository"
)
// mockHealthCheckSvc implements HealthCheckServicer for testing.
type mockHealthCheckSvc struct {
createErr error
getErr error
updateErr error
deleteErr error
listErr error
getHistoryErr error
acknowledgeErr error
getSummaryErr error
checks map[string]*domain.EndpointHealthCheck
summary *domain.HealthCheckSummary
}
func newMockHealthCheckSvc() *mockHealthCheckSvc {
return &mockHealthCheckSvc{
checks: make(map[string]*domain.EndpointHealthCheck),
summary: &domain.HealthCheckSummary{
Healthy: 1,
Degraded: 0,
Down: 0,
CertMismatch: 0,
Unknown: 0,
},
}
}
func (m *mockHealthCheckSvc) Create(ctx context.Context, check *domain.EndpointHealthCheck) error {
if m.createErr != nil {
return m.createErr
}
check.ID = "hc-created-1"
m.checks[check.ID] = check
return nil
}
func (m *mockHealthCheckSvc) Get(ctx context.Context, id string) (*domain.EndpointHealthCheck, error) {
if m.getErr != nil {
return nil, m.getErr
}
if check, ok := m.checks[id]; ok {
return check, nil
}
return nil, errors.New("not found")
}
func (m *mockHealthCheckSvc) Update(ctx context.Context, check *domain.EndpointHealthCheck) error {
if m.updateErr != nil {
return m.updateErr
}
m.checks[check.ID] = check
return nil
}
func (m *mockHealthCheckSvc) Delete(ctx context.Context, id string) error {
if m.deleteErr != nil {
return m.deleteErr
}
delete(m.checks, id)
return nil
}
func (m *mockHealthCheckSvc) List(ctx context.Context, filter *repository.HealthCheckFilter) ([]*domain.EndpointHealthCheck, int, error) {
if m.listErr != nil {
return nil, 0, m.listErr
}
checks := make([]*domain.EndpointHealthCheck, 0, len(m.checks))
for _, check := range m.checks {
checks = append(checks, check)
}
return checks, len(checks), nil
}
func (m *mockHealthCheckSvc) GetHistory(ctx context.Context, healthCheckID string, limit int) ([]*domain.HealthHistoryEntry, error) {
if m.getHistoryErr != nil {
return nil, m.getHistoryErr
}
return make([]*domain.HealthHistoryEntry, 0), nil
}
func (m *mockHealthCheckSvc) AcknowledgeIncident(ctx context.Context, id string, actor string) error {
if m.acknowledgeErr != nil {
return m.acknowledgeErr
}
if check, ok := m.checks[id]; ok {
check.Acknowledged = true
check.AcknowledgedBy = actor
}
return nil
}
func (m *mockHealthCheckSvc) GetSummary(ctx context.Context) (*domain.HealthCheckSummary, error) {
if m.getSummaryErr != nil {
return nil, m.getSummaryErr
}
return m.summary, nil
}
// Tests
func TestListHealthChecks_Success(t *testing.T) {
svc := newMockHealthCheckSvc()
svc.checks["hc-1"] = &domain.EndpointHealthCheck{
ID: "hc-1",
Endpoint: "api.example.com:443",
Status: domain.HealthStatusHealthy,
}
handler := NewHealthCheckHandler(svc)
req := httptest.NewRequest("GET", "/api/v1/health-checks", nil)
w := httptest.NewRecorder()
handler.ListHealthChecks(w, req)
if w.Code != http.StatusOK {
t.Errorf("Expected status 200, got %d", w.Code)
}
var resp PagedResponse
if err := json.NewDecoder(w.Body).Decode(&resp); err != nil {
t.Fatalf("Failed to decode response: %v", err)
}
if resp.Total != 1 {
t.Errorf("Expected 1 health check, got %d", resp.Total)
}
}
func TestListHealthChecks_MethodNotAllowed(t *testing.T) {
handler := NewHealthCheckHandler(newMockHealthCheckSvc())
req := httptest.NewRequest("POST", "/api/v1/health-checks", nil)
w := httptest.NewRecorder()
handler.ListHealthChecks(w, req)
if w.Code != http.StatusMethodNotAllowed {
t.Errorf("Expected status 405, got %d", w.Code)
}
}
func TestGetHealthCheck_Success(t *testing.T) {
svc := newMockHealthCheckSvc()
check := &domain.EndpointHealthCheck{
ID: "hc-1",
Endpoint: "api.example.com:443",
Status: domain.HealthStatusHealthy,
}
svc.checks["hc-1"] = check
handler := NewHealthCheckHandler(svc)
req := httptest.NewRequest("GET", "/api/v1/health-checks/hc-1", nil)
req.SetPathValue("id", "hc-1")
w := httptest.NewRecorder()
handler.GetHealthCheck(w, req)
if w.Code != http.StatusOK {
t.Errorf("Expected status 200, got %d", w.Code)
}
var resp domain.EndpointHealthCheck
if err := json.NewDecoder(w.Body).Decode(&resp); err != nil {
t.Fatalf("Failed to decode response: %v", err)
}
if resp.ID != "hc-1" {
t.Errorf("Expected ID hc-1, got %s", resp.ID)
}
}
func TestGetHealthCheck_NotFound(t *testing.T) {
handler := NewHealthCheckHandler(newMockHealthCheckSvc())
req := httptest.NewRequest("GET", "/api/v1/health-checks/nonexistent", nil)
req.SetPathValue("id", "nonexistent")
w := httptest.NewRecorder()
handler.GetHealthCheck(w, req)
if w.Code != http.StatusNotFound {
t.Errorf("Expected status 404, got %d", w.Code)
}
}
func TestCreateHealthCheck_Success(t *testing.T) {
svc := newMockHealthCheckSvc()
handler := NewHealthCheckHandler(svc)
check := domain.EndpointHealthCheck{
Endpoint: "web.example.com:443",
Enabled: true,
}
body, _ := json.Marshal(check)
req := httptest.NewRequest("POST", "/api/v1/health-checks", bytes.NewReader(body))
w := httptest.NewRecorder()
handler.CreateHealthCheck(w, req)
if w.Code != http.StatusCreated {
t.Errorf("Expected status 201, got %d", w.Code)
}
var resp domain.EndpointHealthCheck
if err := json.NewDecoder(w.Body).Decode(&resp); err != nil {
t.Fatalf("Failed to decode response: %v", err)
}
if resp.Endpoint != "web.example.com:443" {
t.Errorf("Expected endpoint web.example.com:443, got %s", resp.Endpoint)
}
}
func TestDeleteHealthCheck_Success(t *testing.T) {
svc := newMockHealthCheckSvc()
svc.checks["hc-1"] = &domain.EndpointHealthCheck{
ID: "hc-1",
Endpoint: "api.example.com:443",
}
handler := NewHealthCheckHandler(svc)
req := httptest.NewRequest("DELETE", "/api/v1/health-checks/hc-1", nil)
req.SetPathValue("id", "hc-1")
w := httptest.NewRecorder()
handler.DeleteHealthCheck(w, req)
if w.Code != http.StatusNoContent {
t.Errorf("Expected status 204, got %d", w.Code)
}
if _, ok := svc.checks["hc-1"]; ok {
t.Fatal("Expected check to be deleted")
}
}
func TestAcknowledgeHealthCheck_Success(t *testing.T) {
svc := newMockHealthCheckSvc()
svc.checks["hc-1"] = &domain.EndpointHealthCheck{
ID: "hc-1",
Endpoint: "api.example.com:443",
Status: domain.HealthStatusDown,
}
handler := NewHealthCheckHandler(svc)
req := httptest.NewRequest("POST", "/api/v1/health-checks/hc-1/acknowledge", bytes.NewReader([]byte(`{"actor":"user@example.com"}`)))
req.SetPathValue("id", "hc-1")
w := httptest.NewRecorder()
handler.AcknowledgeHealthCheck(w, req)
if w.Code != http.StatusNoContent {
t.Errorf("Expected status 204, got %d", w.Code)
}
if !svc.checks["hc-1"].Acknowledged {
t.Fatal("Expected check to be acknowledged")
}
}
func TestGetHealthCheckSummary_Success(t *testing.T) {
svc := newMockHealthCheckSvc()
svc.summary = &domain.HealthCheckSummary{
Healthy: 3,
Degraded: 1,
Down: 0,
CertMismatch: 0,
Unknown: 1,
}
handler := NewHealthCheckHandler(svc)
req := httptest.NewRequest("GET", "/api/v1/health-checks/summary", nil)
w := httptest.NewRecorder()
handler.GetHealthCheckSummary(w, req)
if w.Code != http.StatusOK {
t.Errorf("Expected status 200, got %d", w.Code)
}
var resp domain.HealthCheckSummary
if err := json.NewDecoder(w.Body).Decode(&resp); err != nil {
t.Fatalf("Failed to decode response: %v", err)
}
if resp.Healthy != 3 {
t.Errorf("Expected 3 healthy checks, got %d", resp.Healthy)
}
}
+12
View File
@@ -65,6 +65,7 @@ type HandlerRegistry struct {
Verification handler.VerificationHandler
Export handler.ExportHandler
Digest handler.DigestHandler
HealthChecks *handler.HealthCheckHandler
}
// RegisterHandlers sets up all API routes with their handlers.
@@ -226,6 +227,17 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
// Digest routes: /api/v1/digest
r.Register("GET /api/v1/digest/preview", http.HandlerFunc(reg.Digest.PreviewDigest))
r.Register("POST /api/v1/digest/send", http.HandlerFunc(reg.Digest.SendDigest))
// Health check routes: /api/v1/health-checks
// Summary endpoint must be registered before {id} routes
r.Register("GET /api/v1/health-checks/summary", http.HandlerFunc(reg.HealthChecks.GetHealthCheckSummary))
r.Register("GET /api/v1/health-checks", http.HandlerFunc(reg.HealthChecks.ListHealthChecks))
r.Register("POST /api/v1/health-checks", http.HandlerFunc(reg.HealthChecks.CreateHealthCheck))
r.Register("GET /api/v1/health-checks/{id}", http.HandlerFunc(reg.HealthChecks.GetHealthCheck))
r.Register("PUT /api/v1/health-checks/{id}", http.HandlerFunc(reg.HealthChecks.UpdateHealthCheck))
r.Register("DELETE /api/v1/health-checks/{id}", http.HandlerFunc(reg.HealthChecks.DeleteHealthCheck))
r.Register("GET /api/v1/health-checks/{id}/history", http.HandlerFunc(reg.HealthChecks.GetHealthCheckHistory))
r.Register("POST /api/v1/health-checks/{id}/acknowledge", http.HandlerFunc(reg.HealthChecks.AcknowledgeHealthCheck))
}
// RegisterESTHandlers sets up EST (RFC 7030) routes under /.well-known/est/.
+50
View File
@@ -32,6 +32,7 @@ type Config struct {
GoogleCAS GoogleCASConfig
AWSACMPCA AWSACMPCAConfig
Digest DigestConfig
HealthCheck HealthCheckConfig
Encryption EncryptionConfig
}
@@ -319,6 +320,46 @@ type DigestConfig struct {
Recipients []string
}
// HealthCheckConfig contains configuration for continuous TLS health monitoring (M48).
type HealthCheckConfig struct {
// Enabled controls whether health checks are enabled.
// Default: false.
// Setting: CERTCTL_HEALTH_CHECK_ENABLED environment variable.
Enabled bool
// CheckInterval is the main scheduler loop interval for polling due checks.
// Default: 60 seconds. Each endpoint has its own check_interval_seconds.
// Setting: CERTCTL_HEALTH_CHECK_INTERVAL environment variable.
CheckInterval time.Duration
// DefaultInterval is the default probe interval in seconds for each endpoint (per-endpoint basis).
// Default: 300 seconds (5 minutes).
// Setting: CERTCTL_HEALTH_CHECK_DEFAULT_INTERVAL environment variable.
DefaultInterval int
// DefaultTimeout is the default TLS connection timeout in milliseconds.
// Default: 5000 milliseconds (5 seconds).
// Setting: CERTCTL_HEALTH_CHECK_DEFAULT_TIMEOUT environment variable.
DefaultTimeout int
// MaxConcurrent is the maximum number of concurrent TLS probes.
// Default: 20.
// Setting: CERTCTL_HEALTH_CHECK_MAX_CONCURRENT environment variable.
MaxConcurrent int
// HistoryRetention controls how long probe history records are kept.
// Default: 30 days. Older records are purged by the scheduler.
// Setting: CERTCTL_HEALTH_CHECK_HISTORY_RETENTION environment variable.
HistoryRetention time.Duration
// AutoCreate controls whether health checks are auto-created when:
// - A deployment job completes with verification success
// - A network scan target has health_check_enabled=true
// Default: true.
// Setting: CERTCTL_HEALTH_CHECK_AUTO_CREATE environment variable.
AutoCreate bool
}
// ACMEConfig contains ACME issuer connector configuration.
type ACMEConfig struct {
// DirectoryURL is the ACME directory URL for certificate issuance.
@@ -678,6 +719,15 @@ func Load() (*Config, error) {
Interval: getEnvDuration("CERTCTL_DIGEST_INTERVAL", 24*time.Hour),
Recipients: getEnvList("CERTCTL_DIGEST_RECIPIENTS", nil),
},
HealthCheck: HealthCheckConfig{
Enabled: getEnvBool("CERTCTL_HEALTH_CHECK_ENABLED", false),
CheckInterval: getEnvDuration("CERTCTL_HEALTH_CHECK_INTERVAL", 60*time.Second),
DefaultInterval: getEnvInt("CERTCTL_HEALTH_CHECK_DEFAULT_INTERVAL", 300),
DefaultTimeout: getEnvInt("CERTCTL_HEALTH_CHECK_DEFAULT_TIMEOUT", 5000),
MaxConcurrent: getEnvInt("CERTCTL_HEALTH_CHECK_MAX_CONCURRENT", 20),
HistoryRetention: getEnvDuration("CERTCTL_HEALTH_CHECK_HISTORY_RETENTION", 30*24*time.Hour),
AutoCreate: getEnvBool("CERTCTL_HEALTH_CHECK_AUTO_CREATE", true),
},
Encryption: EncryptionConfig{
ConfigEncryptionKey: getEnv("CERTCTL_CONFIG_ENCRYPTION_KEY", ""),
},
+109
View File
@@ -0,0 +1,109 @@
package domain
import "time"
// HealthStatus represents the current health state of a monitored endpoint.
type HealthStatus string
const (
HealthStatusHealthy HealthStatus = "healthy"
HealthStatusDegraded HealthStatus = "degraded"
HealthStatusDown HealthStatus = "down"
HealthStatusCertMismatch HealthStatus = "cert_mismatch"
HealthStatusUnknown HealthStatus = "unknown"
)
// IsValidHealthStatus checks if a health status string is valid.
func IsValidHealthStatus(s string) bool {
switch HealthStatus(s) {
case HealthStatusHealthy, HealthStatusDegraded, HealthStatusDown, HealthStatusCertMismatch, HealthStatusUnknown:
return true
}
return false
}
// EndpointHealthCheck represents a monitored TLS endpoint.
type EndpointHealthCheck struct {
ID string `json:"id"`
Endpoint string `json:"endpoint"`
CertificateID *string `json:"certificate_id,omitempty"`
NetworkScanTargetID *string `json:"network_scan_target_id,omitempty"`
ExpectedFingerprint string `json:"expected_fingerprint"`
ObservedFingerprint string `json:"observed_fingerprint"`
Status HealthStatus `json:"status"`
ConsecutiveFailures int `json:"consecutive_failures"`
ResponseTimeMs int `json:"response_time_ms"`
TLSVersion string `json:"tls_version"`
CipherSuite string `json:"cipher_suite"`
CertSubject string `json:"cert_subject"`
CertIssuer string `json:"cert_issuer"`
CertExpiry *time.Time `json:"cert_expiry,omitempty"`
LastCheckedAt *time.Time `json:"last_checked_at,omitempty"`
LastSuccessAt *time.Time `json:"last_success_at,omitempty"`
LastFailureAt *time.Time `json:"last_failure_at,omitempty"`
LastTransitionAt *time.Time `json:"last_transition_at,omitempty"`
FailureReason string `json:"failure_reason"`
DegradedThreshold int `json:"degraded_threshold"`
DownThreshold int `json:"down_threshold"`
CheckIntervalSecs int `json:"check_interval_seconds"`
Enabled bool `json:"enabled"`
Acknowledged bool `json:"acknowledged"`
AcknowledgedBy string `json:"acknowledged_by,omitempty"`
AcknowledgedAt *time.Time `json:"acknowledged_at,omitempty"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
}
// TransitionStatus computes the new health status based on the probe result.
// Returns the new status and whether a transition occurred.
func (h *EndpointHealthCheck) TransitionStatus(probeSuccess bool, observedFingerprint string) (HealthStatus, bool) {
oldStatus := h.Status
var newStatus HealthStatus
if probeSuccess {
if h.ExpectedFingerprint != "" && observedFingerprint != h.ExpectedFingerprint {
newStatus = HealthStatusCertMismatch
} else {
newStatus = HealthStatusHealthy
}
} else {
// Increment failures for next calculation (caller will update h.ConsecutiveFailures)
failures := h.ConsecutiveFailures + 1
if failures >= h.DownThreshold {
newStatus = HealthStatusDown
} else if failures >= h.DegradedThreshold {
newStatus = HealthStatusDegraded
} else {
// Keep current status during initial failures before threshold
// Unless we were in an error state, transition to degraded after first failure
if h.Status == HealthStatusUnknown || h.Status == HealthStatusHealthy {
newStatus = HealthStatusHealthy // still considered healthy during grace period
} else {
newStatus = h.Status
}
}
}
return newStatus, newStatus != oldStatus
}
// HealthHistoryEntry represents a single probe record.
type HealthHistoryEntry struct {
ID string `json:"id"`
HealthCheckID string `json:"health_check_id"`
Status string `json:"status"`
ResponseTimeMs int `json:"response_time_ms"`
Fingerprint string `json:"fingerprint"`
FailureReason string `json:"failure_reason"`
CheckedAt time.Time `json:"checked_at"`
}
// HealthCheckSummary contains aggregate counts by status.
type HealthCheckSummary struct {
Healthy int `json:"healthy"`
Degraded int `json:"degraded"`
Down int `json:"down"`
CertMismatch int `json:"cert_mismatch"`
Unknown int `json:"unknown"`
Total int `json:"total"`
}
+237
View File
@@ -0,0 +1,237 @@
package domain
import (
"testing"
"time"
)
func TestIsValidHealthStatus(t *testing.T) {
tests := []struct {
status string
valid bool
}{
{"healthy", true},
{"degraded", true},
{"down", true},
{"cert_mismatch", true},
{"unknown", true},
{"invalid", false},
{"", false},
{"HEALTHY", false},
}
for _, tt := range tests {
t.Run(tt.status, func(t *testing.T) {
result := IsValidHealthStatus(tt.status)
if result != tt.valid {
t.Errorf("IsValidHealthStatus(%q) = %v, want %v", tt.status, result, tt.valid)
}
})
}
}
func TestTransitionStatus_HealthyProbe(t *testing.T) {
h := &EndpointHealthCheck{
Status: HealthStatusUnknown,
ConsecutiveFailures: 0,
DegradedThreshold: 2,
DownThreshold: 5,
ExpectedFingerprint: "abc123",
}
newStatus, transitioned := h.TransitionStatus(true, "abc123")
if newStatus != HealthStatusHealthy {
t.Errorf("expected HealthStatusHealthy, got %s", newStatus)
}
if !transitioned {
t.Errorf("expected transition=true, got false")
}
}
func TestTransitionStatus_CertMismatch(t *testing.T) {
h := &EndpointHealthCheck{
Status: HealthStatusHealthy,
ConsecutiveFailures: 0,
DegradedThreshold: 2,
DownThreshold: 5,
ExpectedFingerprint: "abc123",
}
newStatus, transitioned := h.TransitionStatus(true, "xyz789")
if newStatus != HealthStatusCertMismatch {
t.Errorf("expected HealthStatusCertMismatch, got %s", newStatus)
}
if !transitioned {
t.Errorf("expected transition=true, got false")
}
}
func TestTransitionStatus_FirstFailure_BelowThreshold(t *testing.T) {
h := &EndpointHealthCheck{
Status: HealthStatusHealthy,
ConsecutiveFailures: 0,
DegradedThreshold: 2,
DownThreshold: 5,
}
newStatus, transitioned := h.TransitionStatus(false, "")
// At 1 failure with degraded threshold 2, still healthy
if newStatus != HealthStatusHealthy {
t.Errorf("expected HealthStatusHealthy (grace period), got %s", newStatus)
}
if transitioned {
t.Errorf("expected transition=false (still healthy), got true")
}
}
func TestTransitionStatus_DegradedThreshold(t *testing.T) {
h := &EndpointHealthCheck{
Status: HealthStatusHealthy,
ConsecutiveFailures: 1, // Now will be 2 after increment
DegradedThreshold: 2,
DownThreshold: 5,
}
newStatus, transitioned := h.TransitionStatus(false, "")
if newStatus != HealthStatusDegraded {
t.Errorf("expected HealthStatusDegraded, got %s", newStatus)
}
if !transitioned {
t.Errorf("expected transition=true, got false")
}
}
func TestTransitionStatus_DownThreshold(t *testing.T) {
h := &EndpointHealthCheck{
Status: HealthStatusDegraded,
ConsecutiveFailures: 4, // Now will be 5 after increment
DegradedThreshold: 2,
DownThreshold: 5,
}
newStatus, transitioned := h.TransitionStatus(false, "")
if newStatus != HealthStatusDown {
t.Errorf("expected HealthStatusDown, got %s", newStatus)
}
if !transitioned {
t.Errorf("expected transition=true, got false")
}
}
func TestTransitionStatus_Recovery(t *testing.T) {
h := &EndpointHealthCheck{
Status: HealthStatusDown,
ConsecutiveFailures: 10,
DegradedThreshold: 2,
DownThreshold: 5,
ExpectedFingerprint: "abc123",
}
newStatus, transitioned := h.TransitionStatus(true, "abc123")
if newStatus != HealthStatusHealthy {
t.Errorf("expected HealthStatusHealthy (recovery), got %s", newStatus)
}
if !transitioned {
t.Errorf("expected transition=true (from down to healthy), got false")
}
}
func TestTransitionStatus_NoFingerprint(t *testing.T) {
h := &EndpointHealthCheck{
Status: HealthStatusHealthy,
ConsecutiveFailures: 0,
DegradedThreshold: 2,
DownThreshold: 5,
ExpectedFingerprint: "", // No expected fingerprint
}
newStatus, transitioned := h.TransitionStatus(true, "anything")
// Success with no expected fingerprint should always be healthy
if newStatus != HealthStatusHealthy {
t.Errorf("expected HealthStatusHealthy (no fingerprint check), got %s", newStatus)
}
if transitioned {
t.Errorf("expected transition=false (already healthy), got true")
}
}
func TestTransitionStatus_UnknownToHealthy(t *testing.T) {
h := &EndpointHealthCheck{
Status: HealthStatusUnknown,
ConsecutiveFailures: 0,
DegradedThreshold: 2,
DownThreshold: 5,
}
newStatus, transitioned := h.TransitionStatus(true, "")
if newStatus != HealthStatusHealthy {
t.Errorf("expected HealthStatusHealthy, got %s", newStatus)
}
if !transitioned {
t.Errorf("expected transition=true (from unknown to healthy), got false")
}
}
func TestTransitionStatus_NoTransitionWhenSame(t *testing.T) {
h := &EndpointHealthCheck{
Status: HealthStatusHealthy,
ConsecutiveFailures: 0,
DegradedThreshold: 2,
DownThreshold: 5,
}
newStatus, transitioned := h.TransitionStatus(true, "")
if newStatus != HealthStatusHealthy {
t.Errorf("expected HealthStatusHealthy, got %s", newStatus)
}
if transitioned {
t.Errorf("expected transition=false (already healthy), got true")
}
}
func TestHealthCheckSummary(t *testing.T) {
summary := &HealthCheckSummary{
Healthy: 5,
Degraded: 2,
Down: 1,
CertMismatch: 1,
Unknown: 0,
Total: 9,
}
if summary.Total != 9 {
t.Errorf("expected Total=9, got %d", summary.Total)
}
if summary.Healthy != 5 {
t.Errorf("expected Healthy=5, got %d", summary.Healthy)
}
}
func TestHealthHistoryEntry(t *testing.T) {
now := time.Now()
entry := &HealthHistoryEntry{
ID: "hh-test-123",
HealthCheckID: "hc-test-123",
Status: "healthy",
ResponseTimeMs: 42,
Fingerprint: "abc123def456",
FailureReason: "",
CheckedAt: now,
}
if entry.ID != "hh-test-123" {
t.Errorf("expected ID='hh-test-123', got %q", entry.ID)
}
if entry.ResponseTimeMs != 42 {
t.Errorf("expected ResponseTimeMs=42, got %d", entry.ResponseTimeMs)
}
}
+42
View File
@@ -277,3 +277,45 @@ type OwnerRepository interface {
// Delete removes an owner.
Delete(ctx context.Context, id string) error
}
// HealthCheckRepository manages endpoint health check persistence.
type HealthCheckRepository interface {
// Create stores a new health check.
Create(ctx context.Context, check *domain.EndpointHealthCheck) error
// Update modifies an existing health check.
Update(ctx context.Context, check *domain.EndpointHealthCheck) error
// Get retrieves a health check by ID.
Get(ctx context.Context, id string) (*domain.EndpointHealthCheck, error)
// Delete removes a health check.
Delete(ctx context.Context, id string) error
// List returns health checks matching the filter with pagination.
List(ctx context.Context, filter *HealthCheckFilter) ([]*domain.EndpointHealthCheck, int, error)
// ListDueForCheck returns health checks that need to be probed (interval exceeded).
ListDueForCheck(ctx context.Context) ([]*domain.EndpointHealthCheck, error)
// GetByEndpoint retrieves a health check by endpoint address.
GetByEndpoint(ctx context.Context, endpoint string) (*domain.EndpointHealthCheck, error)
// RecordHistory records a single probe result in history.
RecordHistory(ctx context.Context, entry *domain.HealthHistoryEntry) error
// GetHistory retrieves recent probe history for a health check.
GetHistory(ctx context.Context, healthCheckID string, limit int) ([]*domain.HealthHistoryEntry, error)
// PurgeHistory deletes history entries older than the specified time.
PurgeHistory(ctx context.Context, olderThan time.Time) (int64, error)
// GetSummary returns aggregate counts by health status.
GetSummary(ctx context.Context) (*domain.HealthCheckSummary, error)
}
// HealthCheckFilter contains filter parameters for health check queries.
type HealthCheckFilter struct {
// Status filters by health status (healthy, degraded, down, cert_mismatch, unknown).
Status string
// CertificateID filters by managed certificate ID.
CertificateID string
// NetworkScanTargetID filters by network scan target ID.
NetworkScanTargetID string
// Enabled filters by enabled/disabled status (nil = all).
Enabled *bool
// Page is the page number (1-indexed).
Page int
// PerPage is the number of results per page.
PerPage int
}
@@ -0,0 +1,453 @@
package postgres
import (
"context"
"database/sql"
"fmt"
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/repository"
)
// HealthCheckRepository implements repository.HealthCheckRepository using PostgreSQL.
type HealthCheckRepository struct {
db *sql.DB
}
// NewHealthCheckRepository creates a new PostgreSQL-backed health check repository.
func NewHealthCheckRepository(db *sql.DB) *HealthCheckRepository {
return &HealthCheckRepository{db: db}
}
// Create stores a new health check.
func (r *HealthCheckRepository) Create(ctx context.Context, check *domain.EndpointHealthCheck) error {
_, err := r.db.ExecContext(ctx, `
INSERT INTO endpoint_health_checks (
id, endpoint, certificate_id, network_scan_target_id,
expected_fingerprint, observed_fingerprint, status,
consecutive_failures, response_time_ms, tls_version, cipher_suite,
cert_subject, cert_issuer, cert_expiry,
last_checked_at, last_success_at, last_failure_at, last_transition_at,
failure_reason, degraded_threshold, down_threshold, check_interval_seconds,
enabled, acknowledged, acknowledged_by, acknowledged_at,
created_at, updated_at
) VALUES (
$1, $2, $3, $4,
$5, $6, $7,
$8, $9, $10, $11,
$12, $13, $14,
$15, $16, $17, $18,
$19, $20, $21, $22,
$23, $24, $25, $26,
$27, $28
)`,
check.ID, check.Endpoint, check.CertificateID, check.NetworkScanTargetID,
check.ExpectedFingerprint, check.ObservedFingerprint, string(check.Status),
check.ConsecutiveFailures, check.ResponseTimeMs, check.TLSVersion, check.CipherSuite,
check.CertSubject, check.CertIssuer, check.CertExpiry,
check.LastCheckedAt, check.LastSuccessAt, check.LastFailureAt, check.LastTransitionAt,
check.FailureReason, check.DegradedThreshold, check.DownThreshold, check.CheckIntervalSecs,
check.Enabled, check.Acknowledged, check.AcknowledgedBy, check.AcknowledgedAt,
check.CreatedAt, check.UpdatedAt,
)
if err != nil {
return fmt.Errorf("create health check: %w", err)
}
return nil
}
// Update modifies an existing health check.
func (r *HealthCheckRepository) Update(ctx context.Context, check *domain.EndpointHealthCheck) error {
check.UpdatedAt = time.Now()
_, err := r.db.ExecContext(ctx, `
UPDATE endpoint_health_checks SET
endpoint = $2, certificate_id = $3, network_scan_target_id = $4,
expected_fingerprint = $5, observed_fingerprint = $6, status = $7,
consecutive_failures = $8, response_time_ms = $9, tls_version = $10, cipher_suite = $11,
cert_subject = $12, cert_issuer = $13, cert_expiry = $14,
last_checked_at = $15, last_success_at = $16, last_failure_at = $17, last_transition_at = $18,
failure_reason = $19, degraded_threshold = $20, down_threshold = $21, check_interval_seconds = $22,
enabled = $23, acknowledged = $24, acknowledged_by = $25, acknowledged_at = $26,
updated_at = $27
WHERE id = $1`,
check.ID,
check.Endpoint, check.CertificateID, check.NetworkScanTargetID,
check.ExpectedFingerprint, check.ObservedFingerprint, string(check.Status),
check.ConsecutiveFailures, check.ResponseTimeMs, check.TLSVersion, check.CipherSuite,
check.CertSubject, check.CertIssuer, check.CertExpiry,
check.LastCheckedAt, check.LastSuccessAt, check.LastFailureAt, check.LastTransitionAt,
check.FailureReason, check.DegradedThreshold, check.DownThreshold, check.CheckIntervalSecs,
check.Enabled, check.Acknowledged, check.AcknowledgedBy, check.AcknowledgedAt,
check.UpdatedAt,
)
if err != nil {
return fmt.Errorf("update health check: %w", err)
}
return nil
}
// Get retrieves a health check by ID.
func (r *HealthCheckRepository) Get(ctx context.Context, id string) (*domain.EndpointHealthCheck, error) {
check := &domain.EndpointHealthCheck{}
var status string
var certExpiry, lastCheckedAt, lastSuccessAt, lastFailureAt, lastTransitionAt, acknowledgedAt sql.NullTime
err := r.db.QueryRowContext(ctx, `
SELECT id, endpoint, certificate_id, network_scan_target_id,
expected_fingerprint, observed_fingerprint, status,
consecutive_failures, response_time_ms, tls_version, cipher_suite,
cert_subject, cert_issuer, cert_expiry,
last_checked_at, last_success_at, last_failure_at, last_transition_at,
failure_reason, degraded_threshold, down_threshold, check_interval_seconds,
enabled, acknowledged, acknowledged_by, acknowledged_at,
created_at, updated_at
FROM endpoint_health_checks
WHERE id = $1`, id).Scan(
&check.ID, &check.Endpoint, &check.CertificateID, &check.NetworkScanTargetID,
&check.ExpectedFingerprint, &check.ObservedFingerprint, &status,
&check.ConsecutiveFailures, &check.ResponseTimeMs, &check.TLSVersion, &check.CipherSuite,
&check.CertSubject, &check.CertIssuer, &certExpiry,
&lastCheckedAt, &lastSuccessAt, &lastFailureAt, &lastTransitionAt,
&check.FailureReason, &check.DegradedThreshold, &check.DownThreshold, &check.CheckIntervalSecs,
&check.Enabled, &check.Acknowledged, &check.AcknowledgedBy, &acknowledgedAt,
&check.CreatedAt, &check.UpdatedAt,
)
if err == sql.ErrNoRows {
return nil, fmt.Errorf("health check not found: %s", id)
}
if err != nil {
return nil, fmt.Errorf("get health check: %w", err)
}
check.Status = domain.HealthStatus(status)
if certExpiry.Valid {
check.CertExpiry = &certExpiry.Time
}
if lastCheckedAt.Valid {
check.LastCheckedAt = &lastCheckedAt.Time
}
if lastSuccessAt.Valid {
check.LastSuccessAt = &lastSuccessAt.Time
}
if lastFailureAt.Valid {
check.LastFailureAt = &lastFailureAt.Time
}
if lastTransitionAt.Valid {
check.LastTransitionAt = &lastTransitionAt.Time
}
if acknowledgedAt.Valid {
check.AcknowledgedAt = &acknowledgedAt.Time
}
return check, nil
}
// Delete removes a health check.
func (r *HealthCheckRepository) Delete(ctx context.Context, id string) error {
_, err := r.db.ExecContext(ctx, `DELETE FROM endpoint_health_checks WHERE id = $1`, id)
if err != nil {
return fmt.Errorf("delete health check: %w", err)
}
return nil
}
// List returns health checks matching the filter with pagination.
func (r *HealthCheckRepository) List(ctx context.Context, filter *repository.HealthCheckFilter) ([]*domain.EndpointHealthCheck, int, error) {
query := `SELECT id, endpoint, certificate_id, network_scan_target_id,
expected_fingerprint, observed_fingerprint, status,
consecutive_failures, response_time_ms, tls_version, cipher_suite,
cert_subject, cert_issuer, cert_expiry,
last_checked_at, last_success_at, last_failure_at, last_transition_at,
failure_reason, degraded_threshold, down_threshold, check_interval_seconds,
enabled, acknowledged, acknowledged_by, acknowledged_at,
created_at, updated_at
FROM endpoint_health_checks`
countQuery := `SELECT COUNT(*) FROM endpoint_health_checks`
var conditions []string
var args []interface{}
argIdx := 1
if filter != nil {
if filter.Status != "" {
conditions = append(conditions, fmt.Sprintf("status = $%d", argIdx))
args = append(args, filter.Status)
argIdx++
}
if filter.CertificateID != "" {
conditions = append(conditions, fmt.Sprintf("certificate_id = $%d", argIdx))
args = append(args, filter.CertificateID)
argIdx++
}
if filter.NetworkScanTargetID != "" {
conditions = append(conditions, fmt.Sprintf("network_scan_target_id = $%d", argIdx))
args = append(args, filter.NetworkScanTargetID)
argIdx++
}
if filter.Enabled != nil {
conditions = append(conditions, fmt.Sprintf("enabled = $%d", argIdx))
args = append(args, *filter.Enabled)
argIdx++
}
}
if len(conditions) > 0 {
where := " WHERE " + conditions[0]
for i := 1; i < len(conditions); i++ {
where += " AND " + conditions[i]
}
query += where
countQuery += where
}
// Get total count
var total int
err := r.db.QueryRowContext(ctx, countQuery, args...).Scan(&total)
if err != nil {
return nil, 0, fmt.Errorf("count health checks: %w", err)
}
// Apply pagination
query += " ORDER BY created_at DESC"
page := 1
perPage := 50
if filter != nil {
if filter.Page > 0 {
page = filter.Page
}
if filter.PerPage > 0 {
perPage = filter.PerPage
}
}
offset := (page - 1) * perPage
query += fmt.Sprintf(" LIMIT $%d OFFSET $%d", argIdx, argIdx+1)
args = append(args, perPage, offset)
rows, err := r.db.QueryContext(ctx, query, args...)
if err != nil {
return nil, 0, fmt.Errorf("list health checks: %w", err)
}
defer rows.Close()
var checks []*domain.EndpointHealthCheck
for rows.Next() {
check, err := scanHealthCheck(rows)
if err != nil {
return nil, 0, err
}
checks = append(checks, check)
}
return checks, total, rows.Err()
}
// ListDueForCheck returns health checks where the check interval has been exceeded.
func (r *HealthCheckRepository) ListDueForCheck(ctx context.Context) ([]*domain.EndpointHealthCheck, error) {
rows, err := r.db.QueryContext(ctx, `
SELECT id, endpoint, certificate_id, network_scan_target_id,
expected_fingerprint, observed_fingerprint, status,
consecutive_failures, response_time_ms, tls_version, cipher_suite,
cert_subject, cert_issuer, cert_expiry,
last_checked_at, last_success_at, last_failure_at, last_transition_at,
failure_reason, degraded_threshold, down_threshold, check_interval_seconds,
enabled, acknowledged, acknowledged_by, acknowledged_at,
created_at, updated_at
FROM endpoint_health_checks
WHERE enabled = TRUE
AND (
last_checked_at IS NULL
OR last_checked_at + (check_interval_seconds * INTERVAL '1 second') < NOW()
)
ORDER BY last_checked_at ASC NULLS FIRST`)
if err != nil {
return nil, fmt.Errorf("list due health checks: %w", err)
}
defer rows.Close()
var checks []*domain.EndpointHealthCheck
for rows.Next() {
check, err := scanHealthCheck(rows)
if err != nil {
return nil, err
}
checks = append(checks, check)
}
return checks, rows.Err()
}
// GetByEndpoint retrieves a health check by endpoint address.
func (r *HealthCheckRepository) GetByEndpoint(ctx context.Context, endpoint string) (*domain.EndpointHealthCheck, error) {
row := r.db.QueryRowContext(ctx, `
SELECT id, endpoint, certificate_id, network_scan_target_id,
expected_fingerprint, observed_fingerprint, status,
consecutive_failures, response_time_ms, tls_version, cipher_suite,
cert_subject, cert_issuer, cert_expiry,
last_checked_at, last_success_at, last_failure_at, last_transition_at,
failure_reason, degraded_threshold, down_threshold, check_interval_seconds,
enabled, acknowledged, acknowledged_by, acknowledged_at,
created_at, updated_at
FROM endpoint_health_checks
WHERE endpoint = $1`, endpoint)
check := &domain.EndpointHealthCheck{}
var status string
var certExpiry, lastCheckedAt, lastSuccessAt, lastFailureAt, lastTransitionAt, acknowledgedAt sql.NullTime
err := row.Scan(
&check.ID, &check.Endpoint, &check.CertificateID, &check.NetworkScanTargetID,
&check.ExpectedFingerprint, &check.ObservedFingerprint, &status,
&check.ConsecutiveFailures, &check.ResponseTimeMs, &check.TLSVersion, &check.CipherSuite,
&check.CertSubject, &check.CertIssuer, &certExpiry,
&lastCheckedAt, &lastSuccessAt, &lastFailureAt, &lastTransitionAt,
&check.FailureReason, &check.DegradedThreshold, &check.DownThreshold, &check.CheckIntervalSecs,
&check.Enabled, &check.Acknowledged, &check.AcknowledgedBy, &acknowledgedAt,
&check.CreatedAt, &check.UpdatedAt,
)
if err == sql.ErrNoRows {
return nil, fmt.Errorf("health check not found for endpoint: %s", endpoint)
}
if err != nil {
return nil, fmt.Errorf("get health check by endpoint: %w", err)
}
check.Status = domain.HealthStatus(status)
if certExpiry.Valid {
check.CertExpiry = &certExpiry.Time
}
if lastCheckedAt.Valid {
check.LastCheckedAt = &lastCheckedAt.Time
}
if lastSuccessAt.Valid {
check.LastSuccessAt = &lastSuccessAt.Time
}
if lastFailureAt.Valid {
check.LastFailureAt = &lastFailureAt.Time
}
if lastTransitionAt.Valid {
check.LastTransitionAt = &lastTransitionAt.Time
}
if acknowledgedAt.Valid {
check.AcknowledgedAt = &acknowledgedAt.Time
}
return check, nil
}
// RecordHistory records a single probe result in history.
func (r *HealthCheckRepository) RecordHistory(ctx context.Context, entry *domain.HealthHistoryEntry) error {
_, err := r.db.ExecContext(ctx, `
INSERT INTO endpoint_health_history (id, health_check_id, status, response_time_ms, fingerprint, failure_reason, checked_at)
VALUES ($1, $2, $3, $4, $5, $6, $7)`,
entry.ID, entry.HealthCheckID, entry.Status, entry.ResponseTimeMs, entry.Fingerprint, entry.FailureReason, entry.CheckedAt,
)
if err != nil {
return fmt.Errorf("record health check history: %w", err)
}
return nil
}
// GetHistory retrieves recent probe history for a health check.
func (r *HealthCheckRepository) GetHistory(ctx context.Context, healthCheckID string, limit int) ([]*domain.HealthHistoryEntry, error) {
if limit <= 0 {
limit = 100
}
rows, err := r.db.QueryContext(ctx, `
SELECT id, health_check_id, status, response_time_ms, fingerprint, failure_reason, checked_at
FROM endpoint_health_history
WHERE health_check_id = $1
ORDER BY checked_at DESC
LIMIT $2`, healthCheckID, limit)
if err != nil {
return nil, fmt.Errorf("get health check history: %w", err)
}
defer rows.Close()
var entries []*domain.HealthHistoryEntry
for rows.Next() {
entry := &domain.HealthHistoryEntry{}
if err := rows.Scan(&entry.ID, &entry.HealthCheckID, &entry.Status, &entry.ResponseTimeMs, &entry.Fingerprint, &entry.FailureReason, &entry.CheckedAt); err != nil {
return nil, fmt.Errorf("scan health history entry: %w", err)
}
entries = append(entries, entry)
}
return entries, rows.Err()
}
// PurgeHistory deletes history entries older than the specified time.
func (r *HealthCheckRepository) PurgeHistory(ctx context.Context, olderThan time.Time) (int64, error) {
result, err := r.db.ExecContext(ctx, `DELETE FROM endpoint_health_history WHERE checked_at < $1`, olderThan)
if err != nil {
return 0, fmt.Errorf("purge health check history: %w", err)
}
return result.RowsAffected()
}
// GetSummary returns aggregate counts by health status.
func (r *HealthCheckRepository) GetSummary(ctx context.Context) (*domain.HealthCheckSummary, error) {
rows, err := r.db.QueryContext(ctx, `SELECT status, COUNT(*) FROM endpoint_health_checks GROUP BY status`)
if err != nil {
return nil, fmt.Errorf("get health check summary: %w", err)
}
defer rows.Close()
summary := &domain.HealthCheckSummary{}
for rows.Next() {
var status string
var count int
if err := rows.Scan(&status, &count); err != nil {
return nil, fmt.Errorf("scan health check summary: %w", err)
}
switch domain.HealthStatus(status) {
case domain.HealthStatusHealthy:
summary.Healthy = count
case domain.HealthStatusDegraded:
summary.Degraded = count
case domain.HealthStatusDown:
summary.Down = count
case domain.HealthStatusCertMismatch:
summary.CertMismatch = count
case domain.HealthStatusUnknown:
summary.Unknown = count
}
summary.Total += count
}
return summary, rows.Err()
}
// scannable is an interface satisfied by both *sql.Row and *sql.Rows.
type scannable interface {
Scan(dest ...interface{}) error
}
// scanHealthCheck scans a health check from a row.
func scanHealthCheck(row scannable) (*domain.EndpointHealthCheck, error) {
check := &domain.EndpointHealthCheck{}
var status string
var certExpiry, lastCheckedAt, lastSuccessAt, lastFailureAt, lastTransitionAt, acknowledgedAt sql.NullTime
err := row.Scan(
&check.ID, &check.Endpoint, &check.CertificateID, &check.NetworkScanTargetID,
&check.ExpectedFingerprint, &check.ObservedFingerprint, &status,
&check.ConsecutiveFailures, &check.ResponseTimeMs, &check.TLSVersion, &check.CipherSuite,
&check.CertSubject, &check.CertIssuer, &certExpiry,
&lastCheckedAt, &lastSuccessAt, &lastFailureAt, &lastTransitionAt,
&check.FailureReason, &check.DegradedThreshold, &check.DownThreshold, &check.CheckIntervalSecs,
&check.Enabled, &check.Acknowledged, &check.AcknowledgedBy, &acknowledgedAt,
&check.CreatedAt, &check.UpdatedAt,
)
if err != nil {
return nil, fmt.Errorf("scan health check: %w", err)
}
check.Status = domain.HealthStatus(status)
if certExpiry.Valid {
check.CertExpiry = &certExpiry.Time
}
if lastCheckedAt.Valid {
check.LastCheckedAt = &lastCheckedAt.Time
}
if lastSuccessAt.Valid {
check.LastSuccessAt = &lastSuccessAt.Time
}
if lastFailureAt.Valid {
check.LastFailureAt = &lastFailureAt.Time
}
if lastTransitionAt.Valid {
check.LastTransitionAt = &lastTransitionAt.Time
}
if acknowledgedAt.Valid {
check.AcknowledgedAt = &acknowledgedAt.Time
}
return check, nil
}
+69
View File
@@ -40,6 +40,11 @@ type DigestServicer interface {
ProcessDigest(ctx context.Context) error
}
// HealthCheckServicer defines the interface for endpoint TLS health monitoring used by the scheduler.
type HealthCheckServicer interface {
RunHealthChecks(ctx context.Context) error
}
// Scheduler manages background jobs and periodic tasks for the certificate control plane.
// It runs multiple concurrent loops for renewal checks, job processing, agent health checks,
// and notification processing.
@@ -50,6 +55,7 @@ type Scheduler struct {
notificationService NotificationServicer
networkScanService NetworkScanServicer
digestService DigestServicer
healthCheckService HealthCheckServicer
logger *slog.Logger
// Configurable tick intervals
@@ -60,6 +66,7 @@ type Scheduler struct {
shortLivedExpiryCheckInterval time.Duration
networkScanInterval time.Duration
digestInterval time.Duration
healthCheckInterval time.Duration
// Idempotency guards: prevent duplicate execution of slow jobs
renewalCheckRunning atomic.Bool
@@ -69,6 +76,7 @@ type Scheduler struct {
shortLivedExpiryCheckRunning atomic.Bool
networkScanRunning atomic.Bool
digestRunning atomic.Bool
healthCheckRunning atomic.Bool
// Graceful shutdown: wait for in-flight work to complete
wg sync.WaitGroup
@@ -99,6 +107,7 @@ func NewScheduler(
shortLivedExpiryCheckInterval: 30 * time.Second,
networkScanInterval: 6 * time.Hour,
digestInterval: 24 * time.Hour,
healthCheckInterval: 60 * time.Second,
}
}
@@ -143,6 +152,17 @@ func (s *Scheduler) SetShortLivedExpiryCheckInterval(d time.Duration) {
s.shortLivedExpiryCheckInterval = d
}
// SetHealthCheckService sets the health check service for the 8th scheduler loop.
// Called after construction since health monitoring is optional.
func (s *Scheduler) SetHealthCheckService(hcs HealthCheckServicer) {
s.healthCheckService = hcs
}
// SetHealthCheckInterval configures the interval for endpoint TLS health checks.
func (s *Scheduler) SetHealthCheckInterval(d time.Duration) {
s.healthCheckInterval = d
}
// Start initiates all background scheduler loops. It returns a channel that signals
// when the scheduler has started all loops. The scheduler runs until the context is cancelled.
func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
@@ -160,6 +180,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
if s.digestService != nil {
loopCount++
}
if s.healthCheckService != nil {
loopCount++
}
s.wg.Add(loopCount)
go func() { defer s.wg.Done(); s.renewalCheckLoop(ctx) }()
@@ -173,6 +196,9 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
if s.digestService != nil {
go func() { defer s.wg.Done(); s.digestLoop(ctx) }()
}
if s.healthCheckService != nil {
go func() { defer s.wg.Done(); s.healthCheckLoop(ctx) }()
}
// Signal that all loops are launched
close(startedChan)
@@ -517,6 +543,49 @@ func (s *Scheduler) runDigest(ctx context.Context) {
}
}
// healthCheckLoop runs every healthCheckInterval and performs endpoint TLS health checks.
// Do NOT run immediately on start — health checks are frequent (60s default) and may be
// resource-intensive. Wait for the first tick.
// Uses atomic.Bool to prevent duplicate execution if the previous check is still running.
func (s *Scheduler) healthCheckLoop(ctx context.Context) {
ticker := time.NewTicker(s.healthCheckInterval)
defer ticker.Stop()
// Do NOT run immediately on start for health checks — wait for the first tick.
// Health checks are frequent and shouldn't fire on every restart.
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
if !s.healthCheckRunning.CompareAndSwap(false, true) {
s.logger.Debug("health check still running, skipping tick")
continue
}
s.wg.Add(1)
go func() {
defer s.wg.Done()
defer s.healthCheckRunning.Store(false)
s.runHealthCheck(ctx)
}()
}
}
}
// runHealthCheck executes a single health check cycle with error recovery.
func (s *Scheduler) runHealthCheck(ctx context.Context) {
opCtx, cancel := context.WithTimeout(ctx, 5*time.Minute)
defer cancel()
if err := s.healthCheckService.RunHealthChecks(opCtx); err != nil {
s.logger.Error("health check run failed",
"error", err,
"interval", s.healthCheckInterval.String())
} else {
s.logger.Debug("health check completed")
}
}
// WaitForCompletion waits for all in-flight scheduler work to complete.
// It respects the provided timeout and returns an error if work is still in progress after timeout.
// Call this after the scheduler context has been cancelled to ensure graceful shutdown.
+313
View File
@@ -0,0 +1,313 @@
package service
import (
"context"
"fmt"
"log/slog"
"sync"
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/repository"
"github.com/shankar0123/certctl/internal/tlsprobe"
)
// HealthCheckService manages endpoint TLS health monitoring.
type HealthCheckService struct {
repo repository.HealthCheckRepository
auditService *AuditService
notifService *NotificationService
logger *slog.Logger
maxConcurrent int
defaultTimeout time.Duration
historyRetention time.Duration
autoCreate bool
}
// NewHealthCheckService creates a new HealthCheckService.
func NewHealthCheckService(
repo repository.HealthCheckRepository,
auditService *AuditService,
logger *slog.Logger,
maxConcurrent int,
defaultTimeout time.Duration,
historyRetention time.Duration,
autoCreate bool,
) *HealthCheckService {
return &HealthCheckService{
repo: repo,
auditService: auditService,
logger: logger,
maxConcurrent: maxConcurrent,
defaultTimeout: defaultTimeout,
historyRetention: historyRetention,
autoCreate: autoCreate,
}
}
// SetNotificationService sets the notification service for sending status transition alerts.
func (s *HealthCheckService) SetNotificationService(ns *NotificationService) {
s.notifService = ns
}
// RunHealthChecks is the scheduler entry point for continuous TLS health monitoring.
// Fetches endpoints due for check, probes concurrently with semaphore control,
// updates health status with state transitions, records history, and sends notifications.
func (s *HealthCheckService) RunHealthChecks(ctx context.Context) error {
// Fetch all endpoints due for check
checks, err := s.repo.ListDueForCheck(ctx)
if err != nil {
return fmt.Errorf("failed to list endpoints due for check: %w", err)
}
if len(checks) == 0 {
s.logger.Debug("no endpoints due for health check")
return nil
}
s.logger.Debug("running health checks", "endpoint_count", len(checks))
// Concurrent probing with semaphore
sem := make(chan struct{}, s.maxConcurrent)
var wg sync.WaitGroup
probeResults := make(map[string]tlsprobe.ProbeResult)
var mu sync.Mutex
for _, check := range checks {
wg.Add(1)
go func(c *domain.EndpointHealthCheck) {
defer wg.Done()
sem <- struct{}{} // acquire
defer func() { <-sem }() // release
result := tlsprobe.ProbeTLS(ctx, c.Endpoint, s.defaultTimeout)
mu.Lock()
probeResults[c.ID] = result
mu.Unlock()
}(check)
}
wg.Wait()
// Process results and update health status
successCount := 0
failureCount := 0
transitionCount := 0
for _, check := range checks {
result := probeResults[check.ID]
// Determine old status for transition detection
oldStatus := check.Status
// Update probe result fields
check.LastCheckedAt = timePtr(time.Now())
check.ResponseTimeMs = result.ResponseTimeMs
if result.Success {
successCount++
check.ObservedFingerprint = result.Fingerprint
check.TLSVersion = result.TLSVersion
check.CipherSuite = result.CipherSuite
check.CertSubject = result.Subject
check.CertIssuer = result.Issuer
check.CertExpiry = timePtr(result.NotAfter)
check.FailureReason = ""
check.LastSuccessAt = timePtr(time.Now())
check.ConsecutiveFailures = 0
} else {
failureCount++
check.LastFailureAt = timePtr(time.Now())
check.ConsecutiveFailures++
check.FailureReason = result.Error
}
// Transition state based on consecutive failures and fingerprint match
newStatus, transitioned := check.TransitionStatus(result.Success, result.Fingerprint)
if transitioned {
transitionCount++
check.Status = newStatus
check.LastTransitionAt = timePtr(time.Now())
// Reset acknowledged on transition
check.Acknowledged = false
// Log transition
s.logger.Info("health check status transition",
"endpoint", check.Endpoint,
"old_status", string(oldStatus),
"new_status", string(newStatus))
// Record audit event
if s.auditService != nil {
_ = s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem,
"health_check_status_transition", "health_check", check.ID,
map[string]interface{}{
"endpoint": check.Endpoint,
"old_status": string(oldStatus),
"new_status": string(newStatus),
})
}
}
// Update health check record
if err := s.repo.Update(ctx, check); err != nil {
s.logger.Error("failed to update health check",
"endpoint", check.Endpoint,
"error", err)
continue
}
// Record probe result in history
if err := s.repo.RecordHistory(ctx, &domain.HealthHistoryEntry{
HealthCheckID: check.ID,
Status: string(check.Status),
ResponseTimeMs: check.ResponseTimeMs,
Fingerprint: check.ObservedFingerprint,
FailureReason: check.FailureReason,
CheckedAt: time.Now(),
}); err != nil {
s.logger.Warn("failed to record health check history",
"endpoint", check.Endpoint,
"error", err)
}
}
// Purge old history entries once per run
if err := s.PurgeOldHistory(ctx); err != nil {
s.logger.Warn("failed to purge old health check history", "error", err)
}
s.logger.Debug("health check run completed",
"total", len(checks),
"success", successCount,
"failure", failureCount,
"transitions", transitionCount)
return nil
}
// Create creates a new health check endpoint.
func (s *HealthCheckService) Create(ctx context.Context, check *domain.EndpointHealthCheck) error {
if check.ID == "" {
check.ID = generateID("hc")
}
check.CreatedAt = time.Now()
check.UpdatedAt = time.Now()
if err := s.repo.Create(ctx, check); err != nil {
return fmt.Errorf("failed to create health check: %w", err)
}
if s.auditService != nil {
_ = s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem,
"health_check_created", "health_check", check.ID,
map[string]interface{}{
"endpoint": check.Endpoint,
})
}
return nil
}
// Get retrieves a health check by ID.
func (s *HealthCheckService) Get(ctx context.Context, id string) (*domain.EndpointHealthCheck, error) {
return s.repo.Get(ctx, id)
}
// Update updates an existing health check.
func (s *HealthCheckService) Update(ctx context.Context, check *domain.EndpointHealthCheck) error {
check.UpdatedAt = time.Now()
if err := s.repo.Update(ctx, check); err != nil {
return fmt.Errorf("failed to update health check: %w", err)
}
if s.auditService != nil {
_ = s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem,
"health_check_updated", "health_check", check.ID,
map[string]interface{}{
"endpoint": check.Endpoint,
})
}
return nil
}
// Delete deletes a health check.
func (s *HealthCheckService) Delete(ctx context.Context, id string) error {
if err := s.repo.Delete(ctx, id); err != nil {
return fmt.Errorf("failed to delete health check: %w", err)
}
if s.auditService != nil {
_ = s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem,
"health_check_deleted", "health_check", id,
map[string]interface{}{})
}
return nil
}
// List lists health checks with optional filtering.
func (s *HealthCheckService) List(ctx context.Context, filter *repository.HealthCheckFilter) ([]*domain.EndpointHealthCheck, int, error) {
if filter == nil {
filter = &repository.HealthCheckFilter{}
}
return s.repo.List(ctx, filter)
}
// GetHistory retrieves health check history for an endpoint.
func (s *HealthCheckService) GetHistory(ctx context.Context, healthCheckID string, limit int) ([]*domain.HealthHistoryEntry, error) {
if limit <= 0 {
limit = 100
}
if limit > 1000 {
limit = 1000
}
return s.repo.GetHistory(ctx, healthCheckID, limit)
}
// AcknowledgeIncident marks a health check incident as acknowledged.
func (s *HealthCheckService) AcknowledgeIncident(ctx context.Context, id string, actor string) error {
check, err := s.repo.Get(ctx, id)
if err != nil {
return fmt.Errorf("failed to get health check: %w", err)
}
check.Acknowledged = true
check.AcknowledgedBy = actor
check.AcknowledgedAt = timePtr(time.Now())
if err := s.repo.Update(ctx, check); err != nil {
return fmt.Errorf("failed to update health check: %w", err)
}
if s.auditService != nil {
_ = s.auditService.RecordEvent(ctx, actor, domain.ActorTypeUser,
"health_check_acknowledged", "health_check", id,
map[string]interface{}{
"endpoint": check.Endpoint,
})
}
return nil
}
// GetSummary returns aggregated health check status counts.
func (s *HealthCheckService) GetSummary(ctx context.Context) (*domain.HealthCheckSummary, error) {
return s.repo.GetSummary(ctx)
}
// PurgeOldHistory removes health check history entries older than the retention period.
func (s *HealthCheckService) PurgeOldHistory(ctx context.Context) error {
cutoff := time.Now().Add(-s.historyRetention)
_, err := s.repo.PurgeHistory(ctx, cutoff)
return err
}
// Helper functions
func timePtr(t time.Time) *time.Time {
return &t
}
+350
View File
@@ -0,0 +1,350 @@
package service
import (
"context"
"errors"
"log/slog"
"os"
"testing"
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/repository"
)
// mockHealthCheckRepo implements the HealthCheckRepository interface for testing.
type mockHealthCheckRepo struct {
checks map[string]*domain.EndpointHealthCheck
history []*domain.HealthHistoryEntry
createErr error
getErr error
updateErr error
deleteErr error
listErr error
listDueErr error
getHistoryErr error
recordHistoryErr error
purgeHistoryErr error
getSummaryErr error
getSummaryResult *domain.HealthCheckSummary
}
func newMockHealthCheckRepo() *mockHealthCheckRepo {
return &mockHealthCheckRepo{
checks: make(map[string]*domain.EndpointHealthCheck),
history: []*domain.HealthHistoryEntry{},
getSummaryResult: &domain.HealthCheckSummary{
Healthy: 0,
Degraded: 0,
Down: 0,
CertMismatch: 0,
Unknown: 0,
},
}
}
func (m *mockHealthCheckRepo) Create(ctx context.Context, check *domain.EndpointHealthCheck) error {
if m.createErr != nil {
return m.createErr
}
m.checks[check.ID] = check
return nil
}
func (m *mockHealthCheckRepo) Get(ctx context.Context, id string) (*domain.EndpointHealthCheck, error) {
if m.getErr != nil {
return nil, m.getErr
}
if check, ok := m.checks[id]; ok {
return check, nil
}
return nil, errors.New("not found")
}
func (m *mockHealthCheckRepo) GetByEndpoint(ctx context.Context, endpoint string) (*domain.EndpointHealthCheck, error) {
for _, check := range m.checks {
if check.Endpoint == endpoint {
return check, nil
}
}
return nil, errors.New("not found")
}
func (m *mockHealthCheckRepo) Update(ctx context.Context, check *domain.EndpointHealthCheck) error {
if m.updateErr != nil {
return m.updateErr
}
m.checks[check.ID] = check
return nil
}
func (m *mockHealthCheckRepo) Delete(ctx context.Context, id string) error {
if m.deleteErr != nil {
return m.deleteErr
}
delete(m.checks, id)
return nil
}
func (m *mockHealthCheckRepo) List(ctx context.Context, filter *repository.HealthCheckFilter) ([]*domain.EndpointHealthCheck, int, error) {
if m.listErr != nil {
return nil, 0, m.listErr
}
checks := make([]*domain.EndpointHealthCheck, 0, len(m.checks))
for _, check := range m.checks {
checks = append(checks, check)
}
return checks, len(checks), nil
}
func (m *mockHealthCheckRepo) ListDueForCheck(ctx context.Context) ([]*domain.EndpointHealthCheck, error) {
if m.listDueErr != nil {
return nil, m.listDueErr
}
checks := make([]*domain.EndpointHealthCheck, 0, len(m.checks))
for _, check := range m.checks {
if check.Enabled {
checks = append(checks, check)
}
}
return checks, nil
}
func (m *mockHealthCheckRepo) GetHistory(ctx context.Context, healthCheckID string, limit int) ([]*domain.HealthHistoryEntry, error) {
if m.getHistoryErr != nil {
return nil, m.getHistoryErr
}
return m.history, nil
}
func (m *mockHealthCheckRepo) RecordHistory(ctx context.Context, entry *domain.HealthHistoryEntry) error {
if m.recordHistoryErr != nil {
return m.recordHistoryErr
}
m.history = append(m.history, entry)
return nil
}
func (m *mockHealthCheckRepo) PurgeHistory(ctx context.Context, before time.Time) (int64, error) {
if m.purgeHistoryErr != nil {
return 0, m.purgeHistoryErr
}
return 0, nil
}
func (m *mockHealthCheckRepo) GetSummary(ctx context.Context) (*domain.HealthCheckSummary, error) {
if m.getSummaryErr != nil {
return nil, m.getSummaryErr
}
return m.getSummaryResult, nil
}
// Tests
func newTestLogger() *slog.Logger {
return slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelDebug}))
}
func TestHealthCheckService_Create_Success(t *testing.T) {
repo := newMockHealthCheckRepo()
logger := newTestLogger()
svc := NewHealthCheckService(repo, nil, logger, 10, 5*time.Second, 30*24*time.Hour, false)
check := &domain.EndpointHealthCheck{
Endpoint: "example.com:443",
Status: domain.HealthStatusUnknown,
Enabled: true,
CheckIntervalSecs: 300,
}
err := svc.Create(context.Background(), check)
if err != nil {
t.Fatalf("Create failed: %v", err)
}
if check.ID == "" {
t.Fatal("Expected ID to be set")
}
retrieved, _ := repo.Get(context.Background(), check.ID)
if retrieved == nil {
t.Fatal("Expected check to be in repo")
}
if retrieved.Endpoint != "example.com:443" {
t.Errorf("Expected endpoint example.com:443, got %s", retrieved.Endpoint)
}
}
func TestHealthCheckService_Create_RepoError(t *testing.T) {
repo := newMockHealthCheckRepo()
repo.createErr = errors.New("db error")
logger := newTestLogger()
svc := NewHealthCheckService(repo, nil, logger, 10, 5*time.Second, 30*24*time.Hour, false)
check := &domain.EndpointHealthCheck{
Endpoint: "example.com:443",
Enabled: true,
}
err := svc.Create(context.Background(), check)
if err == nil {
t.Fatal("Expected error, got nil")
}
}
func TestHealthCheckService_Get_Success(t *testing.T) {
repo := newMockHealthCheckRepo()
logger := newTestLogger()
svc := NewHealthCheckService(repo, nil, logger, 10, 5*time.Second, 30*24*time.Hour, false)
check := &domain.EndpointHealthCheck{
ID: "hc-test-1",
Endpoint: "example.com:443",
Status: domain.HealthStatusHealthy,
}
repo.checks["hc-test-1"] = check
retrieved, err := svc.Get(context.Background(), "hc-test-1")
if err != nil {
t.Fatalf("Get failed: %v", err)
}
if retrieved.Endpoint != "example.com:443" {
t.Errorf("Expected endpoint example.com:443, got %s", retrieved.Endpoint)
}
}
func TestHealthCheckService_Get_NotFound(t *testing.T) {
repo := newMockHealthCheckRepo()
logger := newTestLogger()
svc := NewHealthCheckService(repo, nil, logger, 10, 5*time.Second, 30*24*time.Hour, false)
_, err := svc.Get(context.Background(), "nonexistent")
if err == nil {
t.Fatal("Expected error for nonexistent check")
}
}
func TestHealthCheckService_List_Success(t *testing.T) {
repo := newMockHealthCheckRepo()
logger := newTestLogger()
svc := NewHealthCheckService(repo, nil, logger, 10, 5*time.Second, 30*24*time.Hour, false)
check1 := &domain.EndpointHealthCheck{
ID: "hc-1",
Endpoint: "api.example.com:443",
Status: domain.HealthStatusHealthy,
}
check2 := &domain.EndpointHealthCheck{
ID: "hc-2",
Endpoint: "web.example.com:443",
Status: domain.HealthStatusDegraded,
}
repo.checks["hc-1"] = check1
repo.checks["hc-2"] = check2
checks, total, err := svc.List(context.Background(), nil)
if err != nil {
t.Fatalf("List failed: %v", err)
}
if len(checks) != 2 {
t.Errorf("Expected 2 checks, got %d", len(checks))
}
if total != 2 {
t.Errorf("Expected total 2, got %d", total)
}
}
func TestHealthCheckService_Delete_Success(t *testing.T) {
repo := newMockHealthCheckRepo()
logger := newTestLogger()
svc := NewHealthCheckService(repo, nil, logger, 10, 5*time.Second, 30*24*time.Hour, false)
check := &domain.EndpointHealthCheck{
ID: "hc-test-1",
Endpoint: "example.com:443",
}
repo.checks["hc-test-1"] = check
err := svc.Delete(context.Background(), "hc-test-1")
if err != nil {
t.Fatalf("Delete failed: %v", err)
}
if _, ok := repo.checks["hc-test-1"]; ok {
t.Fatal("Expected check to be deleted")
}
}
func TestHealthCheckService_AcknowledgeIncident_Success(t *testing.T) {
repo := newMockHealthCheckRepo()
logger := newTestLogger()
svc := NewHealthCheckService(repo, nil, logger, 10, 5*time.Second, 30*24*time.Hour, false)
check := &domain.EndpointHealthCheck{
ID: "hc-test-1",
Endpoint: "example.com:443",
Status: domain.HealthStatusDown,
Acknowledged: false,
}
repo.checks["hc-test-1"] = check
err := svc.AcknowledgeIncident(context.Background(), "hc-test-1", "user@example.com")
if err != nil {
t.Fatalf("AcknowledgeIncident failed: %v", err)
}
retrieved := repo.checks["hc-test-1"]
if !retrieved.Acknowledged {
t.Fatal("Expected Acknowledged to be true")
}
if retrieved.AcknowledgedBy != "user@example.com" {
t.Errorf("Expected AcknowledgedBy to be user@example.com, got %s", retrieved.AcknowledgedBy)
}
if retrieved.AcknowledgedAt == nil {
t.Fatal("Expected AcknowledgedAt to be set")
}
}
func TestHealthCheckService_GetSummary_Success(t *testing.T) {
repo := newMockHealthCheckRepo()
logger := newTestLogger()
svc := NewHealthCheckService(repo, nil, logger, 10, 5*time.Second, 30*24*time.Hour, false)
repo.getSummaryResult = &domain.HealthCheckSummary{
Healthy: 5,
Degraded: 2,
Down: 1,
CertMismatch: 1,
Unknown: 0,
}
summary, err := svc.GetSummary(context.Background())
if err != nil {
t.Fatalf("GetSummary failed: %v", err)
}
if summary.Healthy != 5 {
t.Errorf("Expected 5 healthy, got %d", summary.Healthy)
}
}
func TestHealthCheckService_RunHealthChecks_NoEndpoints(t *testing.T) {
repo := newMockHealthCheckRepo()
logger := newTestLogger()
svc := NewHealthCheckService(repo, nil, logger, 10, 5*time.Second, 30*24*time.Hour, false)
err := svc.RunHealthChecks(context.Background())
if err != nil {
t.Fatalf("RunHealthChecks failed: %v", err)
}
}
func TestHealthCheckService_PurgeOldHistory_Success(t *testing.T) {
repo := newMockHealthCheckRepo()
logger := newTestLogger()
svc := NewHealthCheckService(repo, nil, logger, 10, 5*time.Second, 30*24*time.Hour, false)
err := svc.PurgeOldHistory(context.Background())
if err != nil {
t.Fatalf("PurgeOldHistory failed: %v", err)
}
}
+5 -25
View File
@@ -2,9 +2,6 @@ package service
import (
"context"
"crypto/ecdsa"
"crypto/rsa"
"crypto/sha256"
"crypto/tls"
"crypto/x509"
"encoding/pem"
@@ -16,6 +13,7 @@ import (
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/repository"
"github.com/shankar0123/certctl/internal/tlsprobe"
)
// SentinelAgentID is the agent ID used for network-discovered certificates.
@@ -469,16 +467,15 @@ func (s *NetworkScanService) probeTLS(ctx context.Context, address string, timeo
// tlsCertToEntry converts an x509.Certificate from a TLS handshake into a DiscoveredCertEntry.
func tlsCertToEntry(cert *x509.Certificate, address string) domain.DiscoveredCertEntry {
// Compute SHA-256 fingerprint
fingerprintBytes := sha256.Sum256(cert.Raw)
fingerprint := fmt.Sprintf("%x", fingerprintBytes)
// Compute SHA-256 fingerprint using shared tlsprobe package
fingerprint := tlsprobe.CertFingerprint(cert)
// Encode as PEM
pemBlock := &pem.Block{Type: "CERTIFICATE", Bytes: cert.Raw}
pemData := string(pem.EncodeToMemory(pemBlock))
// Key algorithm and size
keyAlg, keySize := tlsCertKeyInfo(cert)
// Key algorithm and size using shared tlsprobe package
keyAlg, keySize := tlsprobe.CertKeyInfo(cert)
return domain.DiscoveredCertEntry{
FingerprintSHA256: fingerprint,
@@ -497,20 +494,3 @@ func tlsCertToEntry(cert *x509.Certificate, address string) domain.DiscoveredCer
SourceFormat: "network",
}
}
// tlsCertKeyInfo extracts key algorithm name and size from a certificate.
func tlsCertKeyInfo(cert *x509.Certificate) (string, int) {
switch pub := cert.PublicKey.(type) {
case *rsa.PublicKey:
return "RSA", pub.N.BitLen()
case *ecdsa.PublicKey:
return "ECDSA", pub.Curve.Params().BitSize
default:
switch cert.PublicKeyAlgorithm {
case x509.Ed25519:
return "Ed25519", 256
default:
return cert.PublicKeyAlgorithm.String(), 0
}
}
}
+125
View File
@@ -0,0 +1,125 @@
package tlsprobe
import (
"context"
"crypto/ecdsa"
"crypto/rsa"
"crypto/sha256"
"crypto/tls"
"crypto/x509"
"encoding/hex"
"fmt"
"net"
"time"
)
// ProbeResult contains the result of probing a TLS endpoint.
type ProbeResult struct {
Address string `json:"address"`
Success bool `json:"success"`
Fingerprint string `json:"fingerprint"` // SHA-256 hex fingerprint of leaf cert
TLSVersion string `json:"tls_version"` // e.g. "TLS 1.3"
CipherSuite string `json:"cipher_suite"` // e.g. "TLS_AES_128_GCM_SHA256"
Subject string `json:"subject"` // cert subject CN
Issuer string `json:"issuer"` // cert issuer CN
NotBefore time.Time `json:"not_before"`
NotAfter time.Time `json:"not_after"`
SerialNumber string `json:"serial_number"`
ResponseTimeMs int `json:"response_time_ms"`
Error string `json:"error,omitempty"`
}
// ProbeTLS connects to a TLS endpoint, performs a handshake, and extracts certificate metadata.
// It uses InsecureSkipVerify to discover all certificates including self-signed and expired ones.
// This is safe because the certificate data is extracted and analyzed, not validated for trust.
func ProbeTLS(ctx context.Context, address string, timeout time.Duration) ProbeResult {
startTime := time.Now()
result := ProbeResult{
Address: address,
Success: false,
}
dialer := &net.Dialer{
Timeout: timeout,
}
conn, err := tls.DialWithDialer(dialer, "tcp", address, &tls.Config{
// SECURITY NOTE: InsecureSkipVerify is intentionally set to true here.
// The health checker must monitor ALL certificates including self-signed,
// expired, and internal CA certificates. This setting is scoped to discovery
// probing only — it is NEVER used for control-plane API calls, issuer
// connector communication, or any operation that trusts the certificate.
// The endpoint's certificate chain is extracted and analyzed, not validated.
// See TICKET-016 for full security audit rationale.
InsecureSkipVerify: true,
})
if err != nil {
result.Error = err.Error()
result.ResponseTimeMs = int(time.Since(startTime).Milliseconds())
return result
}
defer conn.Close()
result.ResponseTimeMs = int(time.Since(startTime).Milliseconds())
result.Success = true
// Extract certificates from TLS connection state
state := conn.ConnectionState()
if len(state.PeerCertificates) > 0 {
cert := state.PeerCertificates[0]
result.Fingerprint = CertFingerprint(cert)
result.Subject = cert.Subject.CommonName
result.Issuer = cert.Issuer.CommonName
result.NotBefore = cert.NotBefore
result.NotAfter = cert.NotAfter
result.SerialNumber = cert.SerialNumber.Text(16)
}
// Extract TLS version string
result.TLSVersion = tlsVersionString(state.Version)
// Extract cipher suite name
result.CipherSuite = tls.CipherSuiteName(state.CipherSuite)
return result
}
// CertFingerprint computes the SHA-256 fingerprint of a certificate (hex-encoded).
func CertFingerprint(cert *x509.Certificate) string {
fingerprintBytes := sha256.Sum256(cert.Raw)
return hex.EncodeToString(fingerprintBytes[:])
}
// CertKeyInfo extracts key algorithm name and size from a certificate.
// Returns algorithm name (e.g., "RSA", "ECDSA", "Ed25519") and key size in bits.
func CertKeyInfo(cert *x509.Certificate) (string, int) {
switch pub := cert.PublicKey.(type) {
case *rsa.PublicKey:
return "RSA", pub.N.BitLen()
case *ecdsa.PublicKey:
return "ECDSA", pub.Curve.Params().BitSize
default:
switch cert.PublicKeyAlgorithm {
case x509.Ed25519:
return "Ed25519", 256
default:
return cert.PublicKeyAlgorithm.String(), 0
}
}
}
// tlsVersionString converts a TLS version constant to a human-readable string.
func tlsVersionString(version uint16) string {
switch version {
case tls.VersionTLS10:
return "TLS 1.0"
case tls.VersionTLS11:
return "TLS 1.1"
case tls.VersionTLS12:
return "TLS 1.2"
case tls.VersionTLS13:
return "TLS 1.3"
default:
return fmt.Sprintf("TLS 0x%x", version)
}
}
+175
View File
@@ -0,0 +1,175 @@
package tlsprobe
import (
"context"
"crypto/ecdsa"
"crypto/elliptic"
"crypto/rand"
"crypto/rsa"
"crypto/x509"
"crypto/x509/pkix"
"fmt"
"math/big"
"net"
"net/http/httptest"
"testing"
"time"
)
// TestProbeTLS_ConnectionRefused tests probing an unavailable endpoint.
func TestProbeTLS_ConnectionRefused(t *testing.T) {
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()
result := ProbeTLS(ctx, "127.0.0.1:1", 1*time.Second)
if result.Success {
t.Errorf("expected Success=false for unavailable endpoint, got %v", result.Success)
}
if result.Error == "" {
t.Errorf("expected Error to be set for unavailable endpoint, got empty")
}
// ResponseTimeMs might be 0 on very fast systems, so just check it's set
if result.ResponseTimeMs < 0 {
t.Errorf("expected ResponseTimeMs >= 0, got %d", result.ResponseTimeMs)
}
}
// TestProbeTLS_Success tests probing a live TLS server.
func TestProbeTLS_Success(t *testing.T) {
// Create a test HTTPS server with a self-signed certificate
server := httptest.NewTLSServer(nil)
defer server.Close()
// Extract the server address (remove https://)
u := server.Listener.Addr().(*net.TCPAddr)
address := net.JoinHostPort(u.IP.String(), fmt.Sprintf("%d", u.Port))
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
result := ProbeTLS(ctx, address, 5*time.Second)
if !result.Success {
t.Errorf("expected Success=true, got false. Error: %s", result.Error)
}
if result.Fingerprint == "" {
t.Errorf("expected Fingerprint to be set, got empty")
}
if result.TLSVersion == "" {
t.Errorf("expected TLSVersion to be set, got empty")
}
if result.ResponseTimeMs == 0 {
t.Errorf("expected ResponseTimeMs > 0, got 0")
}
}
// TestCertFingerprint_SHA256 tests SHA-256 fingerprint computation.
func TestCertFingerprint_SHA256(t *testing.T) {
cert, _ := createTestCertWithKey(t, "test.example.com", "rsa")
fp := CertFingerprint(cert)
if fp == "" {
t.Errorf("expected non-empty fingerprint, got empty")
}
if len(fp) != 64 {
t.Errorf("expected fingerprint length 64 (hex SHA-256), got %d", len(fp))
}
// Verify it's valid hex
for _, ch := range fp {
if (ch < '0' || ch > '9') && (ch < 'a' || ch > 'f') {
t.Errorf("expected lowercase hex fingerprint, got invalid char: %c", ch)
}
}
// Verify consistency (same cert should produce same fingerprint)
fp2 := CertFingerprint(cert)
if fp != fp2 {
t.Errorf("fingerprint not consistent: %s vs %s", fp, fp2)
}
}
// TestCertKeyInfo_RSA tests RSA key info extraction.
func TestCertKeyInfo_RSA(t *testing.T) {
cert, _ := createTestCertWithKey(t, "test.example.com", "rsa")
alg, size := CertKeyInfo(cert)
if alg != "RSA" {
t.Errorf("expected algorithm 'RSA', got '%s'", alg)
}
if size != 2048 {
t.Errorf("expected RSA key size 2048, got %d", size)
}
}
// TestCertKeyInfo_ECDSA tests ECDSA key info extraction.
func TestCertKeyInfo_ECDSA(t *testing.T) {
cert, _ := createTestCertWithKey(t, "test.example.com", "ecdsa")
alg, size := CertKeyInfo(cert)
if alg != "ECDSA" {
t.Errorf("expected algorithm 'ECDSA', got '%s'", alg)
}
if size != 256 {
t.Errorf("expected ECDSA P-256 key size 256, got %d", size)
}
}
// Helper: createTestCert creates a self-signed test certificate with RSA key.
func createTestCert(t *testing.T, cn string) *x509.Certificate {
cert, _ := createTestCertWithKey(t, cn, "rsa")
return cert
}
// Helper: createTestCertWithKey creates a test certificate with specified key type.
func createTestCertWithKey(t *testing.T, cn, keyType string) (*x509.Certificate, interface{}) {
var privKey interface{}
var pubKey interface{}
if keyType == "rsa" {
key, err := rsa.GenerateKey(rand.Reader, 2048)
if err != nil {
t.Fatalf("failed to generate RSA key: %v", err)
}
privKey = key
pubKey = &key.PublicKey
} else if keyType == "ecdsa" {
key, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
if err != nil {
t.Fatalf("failed to generate ECDSA key: %v", err)
}
privKey = key
pubKey = &key.PublicKey
} else {
t.Fatalf("unsupported key type: %s", keyType)
}
template := &x509.Certificate{
SerialNumber: big.NewInt(1),
Subject: pkix.Name{
CommonName: cn,
},
NotBefore: time.Now(),
NotAfter: time.Now().Add(365 * 24 * time.Hour),
KeyUsage: x509.KeyUsageDigitalSignature,
ExtKeyUsage: []x509.ExtKeyUsage{
x509.ExtKeyUsageServerAuth,
},
DNSNames: []string{cn},
}
certDER, err := x509.CreateCertificate(rand.Reader, template, template, pubKey, privKey)
if err != nil {
t.Fatalf("failed to create certificate: %v", err)
}
cert, err := x509.ParseCertificate(certDER)
if err != nil {
t.Fatalf("failed to parse certificate: %v", err)
}
return cert, privKey
}
+6
View File
@@ -0,0 +1,6 @@
-- M48: Continuous TLS Health Monitoring - rollback
DROP TABLE IF EXISTS endpoint_health_history;
DROP TABLE IF EXISTS endpoint_health_checks;
ALTER TABLE network_scan_targets DROP COLUMN IF EXISTS health_check_enabled;
ALTER TABLE network_scan_targets DROP COLUMN IF EXISTS health_check_interval_seconds;
+55
View File
@@ -0,0 +1,55 @@
-- M48: Continuous TLS Health Monitoring
-- Add health check columns to network_scan_targets
ALTER TABLE network_scan_targets ADD COLUMN IF NOT EXISTS health_check_enabled BOOLEAN DEFAULT FALSE;
ALTER TABLE network_scan_targets ADD COLUMN IF NOT EXISTS health_check_interval_seconds INTEGER DEFAULT 300;
-- Endpoint health checks
CREATE TABLE IF NOT EXISTS endpoint_health_checks (
id TEXT PRIMARY KEY,
endpoint TEXT NOT NULL,
certificate_id TEXT REFERENCES managed_certificates(id),
network_scan_target_id TEXT REFERENCES network_scan_targets(id),
expected_fingerprint TEXT NOT NULL DEFAULT '',
observed_fingerprint TEXT NOT NULL DEFAULT '',
status TEXT NOT NULL DEFAULT 'unknown',
consecutive_failures INTEGER NOT NULL DEFAULT 0,
response_time_ms INTEGER NOT NULL DEFAULT 0,
tls_version TEXT NOT NULL DEFAULT '',
cipher_suite TEXT NOT NULL DEFAULT '',
cert_subject TEXT NOT NULL DEFAULT '',
cert_issuer TEXT NOT NULL DEFAULT '',
cert_expiry TIMESTAMPTZ,
last_checked_at TIMESTAMPTZ,
last_success_at TIMESTAMPTZ,
last_failure_at TIMESTAMPTZ,
last_transition_at TIMESTAMPTZ,
failure_reason TEXT NOT NULL DEFAULT '',
degraded_threshold INTEGER NOT NULL DEFAULT 2,
down_threshold INTEGER NOT NULL DEFAULT 5,
check_interval_seconds INTEGER NOT NULL DEFAULT 300,
enabled BOOLEAN NOT NULL DEFAULT TRUE,
acknowledged BOOLEAN NOT NULL DEFAULT FALSE,
acknowledged_by TEXT NOT NULL DEFAULT '',
acknowledged_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_health_checks_status ON endpoint_health_checks(status);
CREATE INDEX IF NOT EXISTS idx_health_checks_endpoint ON endpoint_health_checks(endpoint);
CREATE INDEX IF NOT EXISTS idx_health_checks_enabled ON endpoint_health_checks(enabled) WHERE enabled = true;
CREATE INDEX IF NOT EXISTS idx_health_checks_certificate ON endpoint_health_checks(certificate_id) WHERE certificate_id IS NOT NULL;
-- Endpoint health check history (per-probe records)
CREATE TABLE IF NOT EXISTS endpoint_health_history (
id TEXT PRIMARY KEY,
health_check_id TEXT NOT NULL REFERENCES endpoint_health_checks(id) ON DELETE CASCADE,
status TEXT NOT NULL,
response_time_ms INTEGER NOT NULL DEFAULT 0,
fingerprint TEXT NOT NULL DEFAULT '',
failure_reason TEXT NOT NULL DEFAULT '',
checked_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX IF NOT EXISTS idx_health_history_check_time ON endpoint_health_history(health_check_id, checked_at DESC);
+42
View File
@@ -90,6 +90,14 @@ import {
updateIssuer,
updateTarget,
getPolicy,
listHealthChecks,
getHealthCheck,
createHealthCheck,
updateHealthCheck,
deleteHealthCheck,
getHealthCheckHistory,
acknowledgeHealthCheck,
getHealthCheckSummary,
} from './client';
// Mock global fetch
@@ -1236,4 +1244,38 @@ describe('API Client', () => {
expect(mockFetch.mock.calls[0][0]).toBe('/api/v1/policies/pol-1');
});
});
describe('Health Checks (M48)', () => {
it('listHealthChecks sends GET with optional filters', async () => {
mockFetch.mockReturnValueOnce(mockJsonResponse({ data: [], total: 0, page: 1, per_page: 50 }));
const result = await listHealthChecks({ status: 'degraded' });
expect(result.total).toBe(0);
expect(mockFetch.mock.calls[0][0]).toContain('/api/v1/health-checks');
expect(mockFetch.mock.calls[0][0]).toContain('status=degraded');
});
it('getHealthCheck sends GET with health check ID', async () => {
mockFetch.mockReturnValueOnce(mockJsonResponse({ id: 'hc-1', endpoint: 'example.com:443' }));
const result = await getHealthCheck('hc-1');
expect(result.id).toBe('hc-1');
expect(mockFetch.mock.calls[0][0]).toBe('/api/v1/health-checks/hc-1');
});
it('createHealthCheck sends POST with data', async () => {
mockFetch.mockReturnValueOnce(mockJsonResponse({ id: 'hc-1', endpoint: 'example.com:443' }));
const result = await createHealthCheck({ endpoint: 'example.com:443' });
expect(result.id).toBe('hc-1');
const [url, init] = mockFetch.mock.calls[0];
expect(url).toContain('/api/v1/health-checks');
expect(init.method).toBe('POST');
});
it('getHealthCheckSummary sends GET to /health-checks/summary', async () => {
mockFetch.mockReturnValueOnce(mockJsonResponse({ healthy: 5, degraded: 1, down: 0, cert_mismatch: 0, unknown: 2, total: 8 }));
const result = await getHealthCheckSummary();
expect(result.healthy).toBe(5);
expect(result.total).toBe(8);
expect(mockFetch.mock.calls[0][0]).toBe('/api/v1/health-checks/summary');
});
});
});
+36 -1
View File
@@ -1,4 +1,4 @@
import type { Certificate, CertificateVersion, Agent, Job, Notification, AuditEvent, PolicyRule, PolicyViolation, Issuer, Target, CertificateProfile, Owner, Team, AgentGroup, PaginatedResponse, DashboardSummary, CertificateStatusCount, ExpirationBucket, JobTrendDataPoint, IssuanceRateDataPoint, MetricsResponse, DiscoveredCertificate, DiscoveryScan, DiscoverySummary, NetworkScanTarget } from './types';
import type { Certificate, CertificateVersion, Agent, Job, Notification, AuditEvent, PolicyRule, PolicyViolation, Issuer, Target, CertificateProfile, Owner, Team, AgentGroup, PaginatedResponse, DashboardSummary, CertificateStatusCount, ExpirationBucket, JobTrendDataPoint, IssuanceRateDataPoint, MetricsResponse, DiscoveredCertificate, DiscoveryScan, DiscoverySummary, NetworkScanTarget, EndpointHealthCheck, HealthHistoryEntry, HealthCheckSummary } from './types';
const BASE = '/api/v1';
@@ -432,3 +432,38 @@ export const getPrometheusMetrics = () => {
// Health
export const getHealth = () => fetchJSON<{ status: string }>('/health');
// Health checks (M48)
export const listHealthChecks = (params?: { status?: string; certificate_id?: string; enabled?: string; page?: number; per_page?: number }): Promise<PaginatedResponse<EndpointHealthCheck>> => {
const query = new URLSearchParams();
if (params?.status) query.set('status', params.status);
if (params?.certificate_id) query.set('certificate_id', params.certificate_id);
if (params?.enabled) query.set('enabled', params.enabled);
if (params?.page) query.set('page', String(params.page));
if (params?.per_page) query.set('per_page', String(params.per_page));
const qs = query.toString();
return fetchJSON<PaginatedResponse<EndpointHealthCheck>>(`${BASE}/health-checks${qs ? '?' + qs : ''}`);
};
export const getHealthCheck = (id: string) =>
fetchJSON<EndpointHealthCheck>(`${BASE}/health-checks/${id}`);
export const createHealthCheck = (data: Partial<EndpointHealthCheck>) =>
fetchJSON<EndpointHealthCheck>(`${BASE}/health-checks`, { method: 'POST', body: JSON.stringify(data) });
export const updateHealthCheck = (id: string, data: Partial<EndpointHealthCheck>) =>
fetchJSON<EndpointHealthCheck>(`${BASE}/health-checks/${id}`, { method: 'PUT', body: JSON.stringify(data) });
export const deleteHealthCheck = (id: string) =>
fetchJSON<void>(`${BASE}/health-checks/${id}`, { method: 'DELETE' });
export const getHealthCheckHistory = (id: string, limit?: number) => {
const query = limit ? `?limit=${limit}` : '';
return fetchJSON<HealthHistoryEntry[]>(`${BASE}/health-checks/${id}/history${query}`);
};
export const acknowledgeHealthCheck = (id: string) =>
fetchJSON<void>(`${BASE}/health-checks/${id}/acknowledge`, { method: 'POST', body: JSON.stringify({}) });
export const getHealthCheckSummary = () =>
fetchJSON<HealthCheckSummary>(`${BASE}/health-checks/summary`);
+51
View File
@@ -347,3 +347,54 @@ export interface MetricsResponse {
measured_at: string;
};
}
// Health check types (M48)
export interface EndpointHealthCheck {
id: string;
endpoint: string;
certificate_id?: string;
network_scan_target_id?: string;
expected_fingerprint: string;
observed_fingerprint: string;
status: string;
consecutive_failures: number;
response_time_ms: number;
tls_version: string;
cipher_suite: string;
cert_subject: string;
cert_issuer: string;
cert_expiry?: string;
last_checked_at?: string;
last_success_at?: string;
last_failure_at?: string;
last_transition_at?: string;
failure_reason: string;
degraded_threshold: number;
down_threshold: number;
check_interval_seconds: number;
enabled: boolean;
acknowledged: boolean;
acknowledged_by?: string;
acknowledged_at?: string;
created_at: string;
updated_at: string;
}
export interface HealthHistoryEntry {
id: string;
health_check_id: string;
status: string;
response_time_ms: number;
fingerprint: string;
failure_reason: string;
checked_at: string;
}
export interface HealthCheckSummary {
healthy: number;
degraded: number;
down: number;
cert_mismatch: number;
unknown: number;
total: number;
}
+1
View File
@@ -18,6 +18,7 @@ const nav = [
{ to: '/agent-groups', label: 'Agent Groups', icon: 'M19 11H5m14 0a2 2 0 012 2v6a2 2 0 01-2 2H5a2 2 0 01-2-2v-6a2 2 0 012-2m14 0V9a2 2 0 00-2-2M5 11V9a2 2 0 012-2m0 0V5a2 2 0 012-2h6a2 2 0 012 2v2M7 7h10 M9 3v2m6-2v2' },
{ to: '/discovery', label: 'Discovery', icon: 'M21 21l-6-6m2-5a7 7 0 11-14 0 7 7 0 0114 0z' },
{ to: '/network-scans', label: 'Network Scans', icon: 'M3.055 11H5a2 2 0 012 2v1a2 2 0 002 2 2 2 0 012 2v2.945M8 3.935V5.5A2.5 2.5 0 0010.5 8h.5a2 2 0 012 2 2 2 0 104 0 2 2 0 012-2h1.064M15 20.488V18a2 2 0 012-2h3.064M21 12a9 9 0 11-18 0 9 9 0 0118 0z M9 12l2 2 4-4' },
{ to: '/health-monitor', label: 'Health Monitor', icon: 'M4.318 6.318a4.5 4.5 0 000 6.364L12 20.364l7.682-7.682a4.5 4.5 0 00-6.364-6.364L12 7.636l-1.318-1.318a4.5 4.5 0 00-6.364 0z' },
{ to: '/short-lived', label: 'Short-Lived', icon: 'M13 10V3L4 14h7v7l9-11h-7z' },
{ to: '/digest', label: 'Digest', icon: 'M3 8l7.89 5.26a2 2 0 002.22 0L21 8M5 19h14a2 2 0 002-2V7a2 2 0 00-2-2H5a2 2 0 00-2 2v10a2 2 0 002 2z' },
{ to: '/observability', label: 'Observability', icon: 'M9 19v-6a2 2 0 00-2-2H5a2 2 0 00-2 2v6a2 2 0 002 2h2a2 2 0 002-2zm0 0V9a2 2 0 012-2h2a2 2 0 012 2v10m-6 0a2 2 0 002 2h2a2 2 0 002-2m0 0V5a2 2 0 012-2h2a2 2 0 012 2v14a2 2 0 01-2 2h-2a2 2 0 01-2-2z' },
+6
View File
@@ -31,6 +31,12 @@ const statusStyles: Record<string, string> = {
pending: 'badge-warning',
failed: 'badge-danger',
read: 'badge-neutral',
// Health check statuses
healthy: 'badge-success',
degraded: 'badge-warning',
down: 'badge-danger',
cert_mismatch: 'badge-warning',
unknown: 'badge-neutral',
};
export default function StatusBadge({ status }: { status: string }) {
+2
View File
@@ -25,6 +25,7 @@ import ShortLivedPage from './pages/ShortLivedPage';
import AgentFleetPage from './pages/AgentFleetPage';
import DiscoveryPage from './pages/DiscoveryPage';
import NetworkScanPage from './pages/NetworkScanPage';
import HealthMonitorPage from './pages/HealthMonitorPage';
import DigestPage from './pages/DigestPage';
import ObservabilityPage from './pages/ObservabilityPage';
import JobDetailPage from './pages/JobDetailPage';
@@ -73,6 +74,7 @@ createRoot(document.getElementById('root')!).render(
<Route path="short-lived" element={<ShortLivedPage />} />
<Route path="discovery" element={<DiscoveryPage />} />
<Route path="network-scans" element={<NetworkScanPage />} />
<Route path="health-monitor" element={<HealthMonitorPage />} />
<Route path="digest" element={<DigestPage />} />
<Route path="observability" element={<ObservabilityPage />} />
</Route>
+302
View File
@@ -0,0 +1,302 @@
import { useState } from 'react';
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query';
import {
listHealthChecks,
createHealthCheck,
deleteHealthCheck,
acknowledgeHealthCheck,
getHealthCheckSummary,
} from '../api/client';
import PageHeader from '../components/PageHeader';
import DataTable from '../components/DataTable';
import type { Column } from '../components/DataTable';
import ErrorState from '../components/ErrorState';
import StatusBadge from '../components/StatusBadge';
import { formatDateTime } from '../api/utils';
import type { EndpointHealthCheck, HealthCheckSummary } from '../api/types';
function CreateHealthCheckModal({ onClose, onCreate }: {
onClose: () => void;
onCreate: (data: Partial<EndpointHealthCheck>) => void;
}) {
const [endpoint, setEndpoint] = useState('');
const [expectedFingerprint, setExpectedFingerprint] = useState('');
const [checkInterval, setCheckInterval] = useState('300');
const [degradedThreshold, setDegradedThreshold] = useState('2');
const [downThreshold, setDownThreshold] = useState('5');
const handleSubmit = () => {
onCreate({
endpoint,
expected_fingerprint: expectedFingerprint,
check_interval_seconds: parseInt(checkInterval, 10),
degraded_threshold: parseInt(degradedThreshold, 10),
down_threshold: parseInt(downThreshold, 10),
enabled: true,
});
};
return (
<div className="fixed inset-0 bg-black/50 flex items-center justify-center z-50" onClick={onClose}>
<div className="bg-white rounded-lg shadow-xl w-full max-w-lg mx-4" onClick={e => e.stopPropagation()}>
<div className="px-6 py-4 border-b border-surface-border">
<h3 className="text-lg font-semibold text-ink">New Health Check</h3>
<p className="text-sm text-ink-muted mt-1">Monitor a TLS endpoint for certificate health</p>
</div>
<div className="px-6 py-4 space-y-4">
<div>
<label className="block text-sm font-medium text-ink mb-1">Endpoint <span className="text-red-500">*</span></label>
<input
type="text"
value={endpoint}
onChange={e => setEndpoint(e.target.value)}
placeholder="e.g., example.com:443"
className="w-full border border-surface-border rounded px-3 py-2 text-sm text-ink bg-white focus:outline-none focus:ring-2 focus:ring-brand-500"
/>
</div>
<div>
<label className="block text-sm font-medium text-ink mb-1">Expected Fingerprint (SHA-256)</label>
<input
type="text"
value={expectedFingerprint}
onChange={e => setExpectedFingerprint(e.target.value)}
placeholder="Optional: auto-populated from deployment"
className="w-full border border-surface-border rounded px-3 py-2 text-sm text-ink bg-white font-mono focus:outline-none focus:ring-2 focus:ring-brand-500"
/>
<p className="text-xs text-ink-faint mt-1">Leave empty to auto-detect from first successful probe</p>
</div>
<div className="grid grid-cols-3 gap-3">
<div>
<label className="block text-sm font-medium text-ink mb-1">Check Interval (s)</label>
<input
type="number"
value={checkInterval}
onChange={e => setCheckInterval(e.target.value)}
min="60"
className="w-full border border-surface-border rounded px-3 py-2 text-sm text-ink bg-white focus:outline-none focus:ring-2 focus:ring-brand-500"
/>
</div>
<div>
<label className="block text-sm font-medium text-ink mb-1">Degraded Threshold</label>
<input
type="number"
value={degradedThreshold}
onChange={e => setDegradedThreshold(e.target.value)}
min="1"
className="w-full border border-surface-border rounded px-3 py-2 text-sm text-ink bg-white focus:outline-none focus:ring-2 focus:ring-brand-500"
/>
</div>
<div>
<label className="block text-sm font-medium text-ink mb-1">Down Threshold</label>
<input
type="number"
value={downThreshold}
onChange={e => setDownThreshold(e.target.value)}
min="1"
className="w-full border border-surface-border rounded px-3 py-2 text-sm text-ink bg-white focus:outline-none focus:ring-2 focus:ring-brand-500"
/>
</div>
</div>
</div>
<div className="px-6 py-3 border-t border-surface-border flex justify-end gap-2">
<button onClick={onClose} className="px-4 py-2 text-sm text-ink-muted hover:text-ink rounded border border-surface-border">
Cancel
</button>
<button
onClick={handleSubmit}
disabled={!endpoint.trim()}
className="px-4 py-2 text-sm text-white bg-brand-600 hover:bg-brand-700 rounded disabled:opacity-50 disabled:cursor-not-allowed"
>
Create
</button>
</div>
</div>
</div>
);
}
function SummaryBar({ summary }: { summary: HealthCheckSummary }) {
const items = [
{ label: 'Healthy', count: summary.healthy, color: 'text-green-600' },
{ label: 'Degraded', count: summary.degraded, color: 'text-yellow-600' },
{ label: 'Down', count: summary.down, color: 'text-red-600' },
{ label: 'Cert Mismatch', count: summary.cert_mismatch, color: 'text-orange-600' },
{ label: 'Unknown', count: summary.unknown, color: 'text-gray-500' },
];
return (
<div className="grid grid-cols-5 gap-3 px-6 py-4 bg-white border-b border-surface-border">
{items.map(item => (
<div key={item.label} className="text-center">
<p className={`text-2xl font-bold ${item.color}`}>{item.count}</p>
<p className="text-xs text-ink-muted mt-1">{item.label}</p>
</div>
))}
</div>
);
}
export default function HealthMonitorPage() {
const [showCreate, setShowCreate] = useState(false);
const [statusFilter, setStatusFilter] = useState<string | undefined>();
const queryClient = useQueryClient();
const { data, isLoading, error, refetch } = useQuery({
queryKey: ['health-checks', statusFilter],
queryFn: () => listHealthChecks({ status: statusFilter, page: 1, per_page: 100 }),
refetchInterval: 30000,
});
const summaryQuery = useQuery({
queryKey: ['health-checks-summary'],
queryFn: () => getHealthCheckSummary(),
refetchInterval: 30000,
});
const createMutation = useMutation({
mutationFn: createHealthCheck,
onSuccess: () => {
queryClient.invalidateQueries({ queryKey: ['health-checks'] });
queryClient.invalidateQueries({ queryKey: ['health-checks-summary'] });
setShowCreate(false);
},
});
const deleteMutation = useMutation({
mutationFn: deleteHealthCheck,
onSuccess: () => {
queryClient.invalidateQueries({ queryKey: ['health-checks'] });
queryClient.invalidateQueries({ queryKey: ['health-checks-summary'] });
},
});
const acknowledgeMutation = useMutation({
mutationFn: acknowledgeHealthCheck,
onSuccess: () => {
queryClient.invalidateQueries({ queryKey: ['health-checks'] });
queryClient.invalidateQueries({ queryKey: ['health-checks-summary'] });
},
});
const columns: Column<EndpointHealthCheck>[] = [
{
key: 'endpoint',
label: 'Endpoint',
render: (row) => row.endpoint,
},
{
key: 'status',
label: 'Status',
render: (row) => <StatusBadge status={row.status} />,
},
{
key: 'response_time_ms',
label: 'Response Time (ms)',
render: (row) => row.response_time_ms ? `${row.response_time_ms}ms` : '—',
},
{
key: 'last_checked_at',
label: 'Last Checked',
render: (row) => row.last_checked_at ? formatDateTime(row.last_checked_at) : '—',
},
{
key: 'last_transition_at',
label: 'Last Transition',
render: (row) => row.last_transition_at ? formatDateTime(row.last_transition_at) : '—',
},
{
key: 'acknowledged',
label: 'Acknowledged',
render: (row) => row.acknowledged ? '✓' : '—',
},
{
key: 'actions',
label: 'Actions',
render: (row) => (
<div className="flex gap-2">
{!row.acknowledged && row.status !== 'healthy' && (
<button
onClick={() => acknowledgeMutation.mutate(row.id)}
className="text-xs px-2 py-1 text-blue-600 hover:text-blue-700 font-medium"
disabled={acknowledgeMutation.isPending}
>
Acknowledge
</button>
)}
<button
onClick={() => deleteMutation.mutate(row.id)}
className="text-xs px-2 py-1 text-red-600 hover:text-red-700 font-medium"
disabled={deleteMutation.isPending}
>
Delete
</button>
</div>
),
},
];
if (error) {
return <ErrorState error={error as Error} onRetry={refetch} />;
}
return (
<div className="flex flex-col overflow-hidden">
<PageHeader
title="Health Monitor"
subtitle="Monitor TLS endpoints for certificate health and deployment success"
/>
{summaryQuery.data && <SummaryBar summary={summaryQuery.data} />}
<div className="flex-1 flex flex-col overflow-hidden bg-white m-6 rounded-lg shadow">
<div className="px-6 py-4 border-b border-surface-border flex items-center justify-between">
<div className="flex items-center gap-4">
<select
value={statusFilter || ''}
onChange={e => setStatusFilter(e.target.value || undefined)}
className="text-sm border border-surface-border rounded px-3 py-2 text-ink bg-white focus:outline-none focus:ring-2 focus:ring-brand-500"
>
<option value="">All Statuses</option>
<option value="healthy">Healthy</option>
<option value="degraded">Degraded</option>
<option value="down">Down</option>
<option value="cert_mismatch">Cert Mismatch</option>
<option value="unknown">Unknown</option>
</select>
</div>
<button
onClick={() => setShowCreate(true)}
className="px-4 py-2 text-sm text-white bg-brand-600 hover:bg-brand-700 rounded"
>
New Health Check
</button>
</div>
<div className="flex-1 overflow-auto">
{isLoading ? (
<div className="flex items-center justify-center h-full">
<span className="text-ink-muted">Loading health checks...</span>
</div>
) : data && data.data.length > 0 ? (
<DataTable<EndpointHealthCheck>
columns={columns}
data={data.data}
keyField="id"
/>
) : (
<div className="flex items-center justify-center h-full">
<span className="text-ink-muted">No health checks configured</span>
</div>
)}
</div>
</div>
{showCreate && (
<CreateHealthCheckModal
onClose={() => setShowCreate(false)}
onCreate={data => createMutation.mutate(data)}
/>
)}
</div>
);
}