mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 17:22:07 +00:00
vault: add automatic token renewal at TTL/2 + Prometheus metric
Closes Top-10 fix #5 of the 2026-05-03 issuer-coverage audit (see cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the VaultPKI adapter authenticated with a static token and never called renew-self. Long-lived deploys hit token expiry; the first operator-visible signal was failed cert renewals on production targets. This commit: 1. Connector.Start(ctx) spawns a goroutine that calls POST /v1/auth/token/renew-self at TTL/2 cadence (computed from a one-shot lookup-self at startup). Honours ctx.Done() for graceful shutdown via a per-loop done channel + Stop(). 2. On `renewable: false` response (initial lookup OR any subsequent renewal), the loop emits a WARN, increments the not_renewable counter, and exits. The operator must rotate the token before Vault's Max TTL elapses. 3. New Prometheus counter certctl_vault_token_renewals_total with labels result={success,failure,not_renewable}. Registered alongside existing certctl_issuance_* counters in internal/api/handler/metrics.go. 4. ERROR-level logging on renewal failure with operator-actionable substring ("vault token renewal failed; rotate the token before TTL expires") so journalctl + grep find it. Loop keeps ticking after a failure — transient blips don't kill it. New optional issuer.Lifecycle interface: type Lifecycle interface { Start(ctx context.Context) error Stop() } Connectors that hold no background goroutines (almost all of them) do not implement this — IssuerRegistry.StartLifecycles / StopLifecycles feature-detect via type assertion. New lifecycle-bearing connectors plug in by implementing the interface; no further registry plumbing required. Wiring (cmd/server/main.go): - service.NewVaultRenewalMetrics() instance is shared between issuerRegistry.SetVaultRenewalMetrics (so Vault connectors built by Rebuild get a recorder) and metricsHandler.SetVaultRenewals (so the Prometheus exposer emits the new series). - issuerRegistry.StartLifecycles(ctx) is called after issuerService.BuildRegistry; defer issuerRegistry.StopLifecycles is paired so goroutines exit cleanly on signal. - IssuerConnectorAdapter.Underlying() exposes the wrapped issuer.Connector so registry-level machinery can reach the concrete connector behind the adapter without duplicating the wiring at every call site. Tests (internal/connector/issuer/vault/vault_renew_test.go): - TestVault_RenewLoop_TickAtHalfTTL — three ticks → three renewals, all "success". - TestVault_RenewLoop_StopsOnNotRenewable — second renewal returns renewable=false, loop exits, third tick fires no HTTP call. - TestVault_RenewLoop_FailureSurfacesViaMetric — first renewal 403 bumps "failure", second renewal succeeds → loop kept ticking. - TestVault_RenewLoop_CtxCancellation_StopsCleanly — Stop returns within 200ms after ctx cancel. - TestVault_RenewLoop_StartsNothingWhenNotRenewable — token already non-renewable at boot ⇒ no goroutine, "not_renewable" metric increments at startup so operators see it in Grafana. - TestVault_ComputeInterval — 4 cases pinning TTL/2 + minRenewInterval floor. - TestVault_RenewSelf_ParseFailure_NamesActionableInError — surfaced error contains "vault token renewal failed" + "rotate the token". Cadence is dynamic — every successful renewal re-derives TTL/2 from the renewed lease's lease_duration, so a short bootstrap token that gets renewed up to a longer Max TTL shifts to the longer cadence automatically (defends against degenerate fast ticking on a token whose Max TTL is far longer than its initial TTL). Documentation: - docs/connectors.md Vault PKI section gains "Token TTL + automatic renewal" subsection (operator-facing: cadence, metric, renewable=false rotation playbook). Out of scope (intentional, flagged in the audit follow-up): - AppRole / Kubernetes / AWS IAM auth methods (different renewal semantics). - Hot-reload of rotated token from disk (operator restarts today; future: GUI/MCP issuer-update path triggers Rebuild which Stops the old connector and Starts the new one). - Auto-re-auth after token death (operator playbook owns it). CHANGELOG.md is intentionally not hand-edited (per CHANGELOG.md itself: "no longer maintains a hand-edited per-version changelog; per-release notes are auto-generated from commit messages between consecutive tags"). Verified locally: - gofmt clean. - go vet ./internal/service/... ./internal/api/handler/... ./internal/connector/issuer/vault/... ./cmd/server/... clean. - go test -short -count=1 ./internal/connector/issuer/vault/... ./internal/service/... ./internal/api/handler/... green. - go test -race -count=10 -run 'TestVault_RenewLoop|TestVault_ComputeInterval' ./internal/connector/issuer/vault/... green. Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md Top-10 fix #5.
This commit is contained in:
@@ -60,6 +60,28 @@ type DeployCounterSnapshotter interface {
|
||||
// reverse import would create a cycle. The exposer below takes the
|
||||
// types via the interface defined in service.
|
||||
|
||||
// VaultRenewalSnapshotter is the surface MetricsHandler consumes
|
||||
// to emit the certctl_vault_token_renewals_total{result=...}
|
||||
// counter. *service.VaultRenewalMetrics satisfies this; cmd/server
|
||||
// passes the same instance into IssuerRegistry.SetVaultRenewalMetrics
|
||||
// (so Vault connectors record results) AND into
|
||||
// MetricsHandler.SetVaultRenewals (so the Prometheus exposer reads
|
||||
// the counters).
|
||||
//
|
||||
// Returns three counter values directly (rather than a shared struct
|
||||
// type) so service can satisfy this without an import cycle —
|
||||
// handler already imports service for IssuanceMetricsSnapshotter,
|
||||
// but service does not import handler. A method that returns
|
||||
// (uint64, uint64, uint64) needs no shared type.
|
||||
//
|
||||
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit.
|
||||
type VaultRenewalSnapshotter interface {
|
||||
// SnapshotVaultRenewals returns success, failure, and
|
||||
// not_renewable counters as point-in-time reads. Order is fixed
|
||||
// for the exposer — matches the Prometheus label order.
|
||||
SnapshotVaultRenewals() (success, failure, notRenewable uint64)
|
||||
}
|
||||
|
||||
// MetricsHandler handles HTTP requests for metrics.
|
||||
// Supports both JSON format (GET /api/v1/metrics) and Prometheus exposition format
|
||||
// (GET /api/v1/metrics/prometheus) for integration with Prometheus, Grafana, Datadog, etc.
|
||||
@@ -79,6 +101,10 @@ type MetricsHandler struct {
|
||||
// imports service for admin_est.go etc., so service can't import
|
||||
// handler back).
|
||||
issuanceCounters service.IssuanceMetricsSnapshotter
|
||||
// Vault PKI token-renewal counters. Top-10 fix #5 of the
|
||||
// 2026-05-03 issuer-coverage audit. nil disables emission of
|
||||
// certctl_vault_token_renewals_total{result=...}.
|
||||
vaultRenewals VaultRenewalSnapshotter
|
||||
}
|
||||
|
||||
// NewMetricsHandler creates a new MetricsHandler with a service dependency.
|
||||
@@ -112,6 +138,13 @@ func (h *MetricsHandler) SetIssuanceCounters(c service.IssuanceMetricsSnapshotte
|
||||
h.issuanceCounters = c
|
||||
}
|
||||
|
||||
// SetVaultRenewals wires the Vault PKI token-renewal counter table
|
||||
// for the Prometheus exposition. nil disables the block. Closes
|
||||
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit.
|
||||
func (h *MetricsHandler) SetVaultRenewals(c VaultRenewalSnapshotter) {
|
||||
h.vaultRenewals = c
|
||||
}
|
||||
|
||||
// MetricsResponse represents the JSON metrics response for V2.
|
||||
type MetricsResponse struct {
|
||||
Gauge MetricsGauge `json:"gauge"`
|
||||
@@ -424,6 +457,20 @@ func (h MetricsHandler) GetPrometheusMetrics(w http.ResponseWriter, r *http.Requ
|
||||
fmt.Fprintf(w, "certctl_issuance_failures_total{issuer_type=%q,error_class=%q} %d\n", f.IssuerType, f.ErrorClass, f.Count)
|
||||
}
|
||||
}
|
||||
|
||||
// Vault PKI token-renewal counters. Top-10 fix #5 of the
|
||||
// 2026-05-03 issuer-coverage audit. Operators alert on
|
||||
// certctl_vault_token_renewals_total{result="failure"} > 0 or
|
||||
// {result="not_renewable"} > 0 to catch token expiry before
|
||||
// issuance breaks. Closed enum: 3 series.
|
||||
if h.vaultRenewals != nil {
|
||||
success, failure, notRenewable := h.vaultRenewals.SnapshotVaultRenewals()
|
||||
fmt.Fprintf(w, "\n# HELP certctl_vault_token_renewals_total Vault PKI token renew-self results. result is a closed enum: success, failure, not_renewable.\n")
|
||||
fmt.Fprintf(w, "# TYPE certctl_vault_token_renewals_total counter\n")
|
||||
fmt.Fprintf(w, "certctl_vault_token_renewals_total{result=%q} %d\n", "success", success)
|
||||
fmt.Fprintf(w, "certctl_vault_token_renewals_total{result=%q} %d\n", "failure", failure)
|
||||
fmt.Fprintf(w, "certctl_vault_token_renewals_total{result=%q} %d\n", "not_renewable", notRenewable)
|
||||
}
|
||||
}
|
||||
|
||||
// formatLE formats a histogram bucket boundary the way Prometheus
|
||||
|
||||
Reference in New Issue
Block a user