mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-08 20:28:52 +00:00
vault: add automatic token renewal at TTL/2 + Prometheus metric
Closes Top-10 fix #5 of the 2026-05-03 issuer-coverage audit (see cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the VaultPKI adapter authenticated with a static token and never called renew-self. Long-lived deploys hit token expiry; the first operator-visible signal was failed cert renewals on production targets. This commit: 1. Connector.Start(ctx) spawns a goroutine that calls POST /v1/auth/token/renew-self at TTL/2 cadence (computed from a one-shot lookup-self at startup). Honours ctx.Done() for graceful shutdown via a per-loop done channel + Stop(). 2. On `renewable: false` response (initial lookup OR any subsequent renewal), the loop emits a WARN, increments the not_renewable counter, and exits. The operator must rotate the token before Vault's Max TTL elapses. 3. New Prometheus counter certctl_vault_token_renewals_total with labels result={success,failure,not_renewable}. Registered alongside existing certctl_issuance_* counters in internal/api/handler/metrics.go. 4. ERROR-level logging on renewal failure with operator-actionable substring ("vault token renewal failed; rotate the token before TTL expires") so journalctl + grep find it. Loop keeps ticking after a failure — transient blips don't kill it. New optional issuer.Lifecycle interface: type Lifecycle interface { Start(ctx context.Context) error Stop() } Connectors that hold no background goroutines (almost all of them) do not implement this — IssuerRegistry.StartLifecycles / StopLifecycles feature-detect via type assertion. New lifecycle-bearing connectors plug in by implementing the interface; no further registry plumbing required. Wiring (cmd/server/main.go): - service.NewVaultRenewalMetrics() instance is shared between issuerRegistry.SetVaultRenewalMetrics (so Vault connectors built by Rebuild get a recorder) and metricsHandler.SetVaultRenewals (so the Prometheus exposer emits the new series). - issuerRegistry.StartLifecycles(ctx) is called after issuerService.BuildRegistry; defer issuerRegistry.StopLifecycles is paired so goroutines exit cleanly on signal. - IssuerConnectorAdapter.Underlying() exposes the wrapped issuer.Connector so registry-level machinery can reach the concrete connector behind the adapter without duplicating the wiring at every call site. Tests (internal/connector/issuer/vault/vault_renew_test.go): - TestVault_RenewLoop_TickAtHalfTTL — three ticks → three renewals, all "success". - TestVault_RenewLoop_StopsOnNotRenewable — second renewal returns renewable=false, loop exits, third tick fires no HTTP call. - TestVault_RenewLoop_FailureSurfacesViaMetric — first renewal 403 bumps "failure", second renewal succeeds → loop kept ticking. - TestVault_RenewLoop_CtxCancellation_StopsCleanly — Stop returns within 200ms after ctx cancel. - TestVault_RenewLoop_StartsNothingWhenNotRenewable — token already non-renewable at boot ⇒ no goroutine, "not_renewable" metric increments at startup so operators see it in Grafana. - TestVault_ComputeInterval — 4 cases pinning TTL/2 + minRenewInterval floor. - TestVault_RenewSelf_ParseFailure_NamesActionableInError — surfaced error contains "vault token renewal failed" + "rotate the token". Cadence is dynamic — every successful renewal re-derives TTL/2 from the renewed lease's lease_duration, so a short bootstrap token that gets renewed up to a longer Max TTL shifts to the longer cadence automatically (defends against degenerate fast ticking on a token whose Max TTL is far longer than its initial TTL). Documentation: - docs/connectors.md Vault PKI section gains "Token TTL + automatic renewal" subsection (operator-facing: cadence, metric, renewable=false rotation playbook). Out of scope (intentional, flagged in the audit follow-up): - AppRole / Kubernetes / AWS IAM auth methods (different renewal semantics). - Hot-reload of rotated token from disk (operator restarts today; future: GUI/MCP issuer-update path triggers Rebuild which Stops the old connector and Starts the new one). - Auto-re-auth after token death (operator playbook owns it). CHANGELOG.md is intentionally not hand-edited (per CHANGELOG.md itself: "no longer maintains a hand-edited per-version changelog; per-release notes are auto-generated from commit messages between consecutive tags"). Verified locally: - gofmt clean. - go vet ./internal/service/... ./internal/api/handler/... ./internal/connector/issuer/vault/... ./cmd/server/... clean. - go test -short -count=1 ./internal/connector/issuer/vault/... ./internal/service/... ./internal/api/handler/... green. - go test -race -count=10 -run 'TestVault_RenewLoop|TestVault_ComputeInterval' ./internal/connector/issuer/vault/... green. Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md Top-10 fix #5.
This commit is contained in:
@@ -40,6 +40,18 @@ func (a *IssuerConnectorAdapter) SetMetrics(issuerType string, metrics *Issuance
|
||||
a.metrics = metrics
|
||||
}
|
||||
|
||||
// Underlying returns the wrapped issuer.Connector so registry-level
|
||||
// machinery (StartLifecycles / StopLifecycles, Bundle G audit-row
|
||||
// pairing, future feature-detect interfaces) can reach the concrete
|
||||
// connector behind the adapter without duplicating the wiring at
|
||||
// every call site. Returns interface{} rather than issuer.Connector
|
||||
// so callers do their own type assertion against optional extension
|
||||
// interfaces (issuer.Lifecycle, etc.) without an import dependency
|
||||
// fan-out from this package.
|
||||
func (a *IssuerConnectorAdapter) Underlying() interface{} {
|
||||
return a.connector
|
||||
}
|
||||
|
||||
// recordIssuance is the metrics-recording side effect at the adapter
|
||||
// boundary. Bumps the issuance counter (success/failure) and the
|
||||
// duration histogram; on failure also bumps the failure-by-error-class
|
||||
|
||||
@@ -8,8 +8,10 @@ import (
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/shankar0123/certctl/internal/connector/issuer"
|
||||
"github.com/shankar0123/certctl/internal/connector/issuer/acme"
|
||||
"github.com/shankar0123/certctl/internal/connector/issuer/local"
|
||||
"github.com/shankar0123/certctl/internal/connector/issuer/vault"
|
||||
"github.com/shankar0123/certctl/internal/connector/issuerfactory"
|
||||
"github.com/shankar0123/certctl/internal/crypto"
|
||||
"github.com/shankar0123/certctl/internal/crypto/signer"
|
||||
@@ -47,6 +49,14 @@ type IssuerRegistry struct {
|
||||
// Nil leaves the legacy "ACME revocation by serial requires
|
||||
// CertificateLookup wiring" error in place for old wiring paths.
|
||||
acmeCertLookup acme.CertificateLookupRepo
|
||||
|
||||
// vaultRenewalMetrics — when set, every freshly-constructed
|
||||
// *vault.Connector is wired with SetRenewalRecorder so the
|
||||
// renew-self loop bumps the certctl_vault_token_renewals_total
|
||||
// counter. Closes Top-10 fix #5 of the 2026-05-03 audit. Nil
|
||||
// leaves the no-op recorder in place (no metric emission, but
|
||||
// the loop still runs).
|
||||
vaultRenewalMetrics *VaultRenewalMetrics
|
||||
}
|
||||
|
||||
// LocalIssuerDeps groups the optional dependencies that the local
|
||||
@@ -92,6 +102,23 @@ func (r *IssuerRegistry) SetIssuanceMetrics(m *IssuanceMetrics) {
|
||||
r.metrics = m
|
||||
}
|
||||
|
||||
// SetVaultRenewalMetrics wires the per-(result) counter table for
|
||||
// the Vault PKI renew-self loop. Every *vault.Connector constructed
|
||||
// by Rebuild after this call records its renewal results into the
|
||||
// supplied metrics. Closes Top-10 fix #5 of the 2026-05-03
|
||||
// issuer-coverage audit.
|
||||
//
|
||||
// The same instance must also be registered with the metrics
|
||||
// handler via MetricsHandler.SetVaultRenewals so the Prometheus
|
||||
// exposer emits certctl_vault_token_renewals_total{result=...}.
|
||||
// cmd/server/main.go owns both wiring sides; tests usually skip
|
||||
// the Prometheus side and just assert against the snapshot.
|
||||
func (r *IssuerRegistry) SetVaultRenewalMetrics(m *VaultRenewalMetrics) {
|
||||
r.mu.Lock()
|
||||
defer r.mu.Unlock()
|
||||
r.vaultRenewalMetrics = m
|
||||
}
|
||||
|
||||
// SetACMECertLookup wires the cert-version lookup repo for every
|
||||
// *acme.Connector constructed by Rebuild. The lookup is used by the
|
||||
// serial-only revoke path (RevokeCertificate) to recover the leaf-
|
||||
@@ -228,6 +255,19 @@ func (r *IssuerRegistry) Rebuild(ctx context.Context, configs []*domain.Issuer,
|
||||
"id", cfg.ID)
|
||||
}
|
||||
|
||||
// Top-10 fix #5 (2026-05-03 audit): wire the renew-self
|
||||
// metric recorder into every freshly-constructed
|
||||
// *vault.Connector so its background renewal loop bumps the
|
||||
// certctl_vault_token_renewals_total counter. Lifecycle
|
||||
// startup itself is gated by StartLifecycles below — Rebuild
|
||||
// only does the metric wire here so the recorder is in place
|
||||
// when StartLifecycles fires.
|
||||
if vaultConn, ok := connector.(*vault.Connector); ok && r.vaultRenewalMetrics != nil {
|
||||
vaultConn.SetRenewalRecorder(r.vaultRenewalMetrics)
|
||||
r.logger.Info("Vault PKI issuer wired with renew-self metric recorder",
|
||||
"id", cfg.ID)
|
||||
}
|
||||
|
||||
adapter := NewIssuerConnectorAdapter(connector)
|
||||
// Wire per-issuer-type metrics (audit fix #4) when SetIssuanceMetrics
|
||||
// was called. The adapter is the IssuerConnector interface; type-
|
||||
@@ -273,3 +313,93 @@ func (r *IssuerRegistry) Rebuild(ctx context.Context, configs []*domain.Issuer,
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// StartLifecycles iterates the registry and calls Start(ctx) on every
|
||||
// connector that implements the optional issuer.Lifecycle extension
|
||||
// interface. Connectors without lifecycle work (almost all of them)
|
||||
// are silently skipped.
|
||||
//
|
||||
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit. Today only
|
||||
// VaultPKI implements Lifecycle (for its renew-self loop). New
|
||||
// lifecycle-bearing connectors plug in by implementing the
|
||||
// interface — this method picks them up automatically.
|
||||
//
|
||||
// Per-connector Start failures are LOGGED, not returned, so a single
|
||||
// misconfigured Vault doesn't block server startup. Operators see
|
||||
// the failure in the slog stream and via the
|
||||
// certctl_vault_token_renewals_total{result="not_renewable"} or
|
||||
// {result="failure"} counter.
|
||||
//
|
||||
// The IssuerConnectorAdapter wraps the raw connector; we type-assert
|
||||
// against IssuerConnectorWithUnderlying to reach the underlying
|
||||
// connector. If the adapter shape changes, this assertion silently
|
||||
// no-ops and lifecycle wiring stops working — covered by
|
||||
// TestRegistry_StartLifecycles_VaultStarted.
|
||||
func (r *IssuerRegistry) StartLifecycles(ctx context.Context) {
|
||||
r.mu.RLock()
|
||||
conns := make(map[string]IssuerConnector, len(r.issuers))
|
||||
for id, c := range r.issuers {
|
||||
conns[id] = c
|
||||
}
|
||||
r.mu.RUnlock()
|
||||
|
||||
for id, c := range conns {
|
||||
raw := unwrapAdapter(c)
|
||||
if raw == nil {
|
||||
continue
|
||||
}
|
||||
lc, ok := raw.(issuer.Lifecycle)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if err := lc.Start(ctx); err != nil {
|
||||
r.logger.Warn("issuer lifecycle Start failed",
|
||||
"id", id,
|
||||
"error", err,
|
||||
)
|
||||
continue
|
||||
}
|
||||
r.logger.Info("issuer lifecycle Start succeeded", "id", id)
|
||||
}
|
||||
}
|
||||
|
||||
// StopLifecycles iterates the registry and calls Stop() on every
|
||||
// connector that implements the optional issuer.Lifecycle extension
|
||||
// interface. Each Stop blocks until the connector's background work
|
||||
// has fully exited; the loop is sequential rather than parallel so
|
||||
// shutdown ordering is deterministic in operator logs.
|
||||
//
|
||||
// Idempotent. Safe to call after StartLifecycles failed or wasn't
|
||||
// called.
|
||||
func (r *IssuerRegistry) StopLifecycles() {
|
||||
r.mu.RLock()
|
||||
conns := make([]IssuerConnector, 0, len(r.issuers))
|
||||
for _, c := range r.issuers {
|
||||
conns = append(conns, c)
|
||||
}
|
||||
r.mu.RUnlock()
|
||||
|
||||
for _, c := range conns {
|
||||
raw := unwrapAdapter(c)
|
||||
if raw == nil {
|
||||
continue
|
||||
}
|
||||
if lc, ok := raw.(issuer.Lifecycle); ok {
|
||||
lc.Stop()
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// unwrapAdapter returns the underlying issuer.Connector held by an
|
||||
// IssuerConnectorAdapter. If the registry held a raw connector
|
||||
// directly (test wiring), returns it as-is. Returns nil if neither
|
||||
// case matches — defensive against future adapter-shape changes.
|
||||
func unwrapAdapter(c IssuerConnector) interface{} {
|
||||
if a, ok := c.(*IssuerConnectorAdapter); ok {
|
||||
return a.Underlying()
|
||||
}
|
||||
if u, ok := c.(interface{ Underlying() interface{} }); ok {
|
||||
return u.Underlying()
|
||||
}
|
||||
return c
|
||||
}
|
||||
|
||||
@@ -0,0 +1,96 @@
|
||||
package service
|
||||
|
||||
import "sync/atomic"
|
||||
|
||||
// VaultRenewalMetrics is a thread-safe counter table for the
|
||||
// Vault PKI token-renewal loop. Top-10 fix #5 of the 2026-05-03
|
||||
// issuer-coverage audit. Closes the operator-observability gap
|
||||
// where long-lived deploys would silently lose Vault auth at TTL
|
||||
// expiry.
|
||||
//
|
||||
// Cardinality is fixed at three series — result is a closed enum:
|
||||
//
|
||||
// {success} — the renew-self call succeeded.
|
||||
// {failure} — the renew-self call returned a non-2xx,
|
||||
// parse failure, or HTTP error. Loop keeps
|
||||
// ticking; transient blips don't kill it.
|
||||
// {not_renewable} — Vault returned renewable=false (or returned
|
||||
// it at startup lookup-self). Loop has exited;
|
||||
// operator must rotate the token before its
|
||||
// current TTL expires.
|
||||
//
|
||||
// One instance is shared across every Vault PKI Connector built by
|
||||
// IssuerRegistry.Rebuild — the recorder pointer is wired by
|
||||
// IssuerRegistry.SetVaultRenewalMetrics + the post-factory wiring
|
||||
// step inside Rebuild. The same instance is also wired into
|
||||
// MetricsHandler.SetVaultRenewals so the Prometheus exposer emits
|
||||
// certctl_vault_token_renewals_total{result=...}.
|
||||
type VaultRenewalMetrics struct {
|
||||
success atomic.Uint64
|
||||
failure atomic.Uint64
|
||||
notRenewable atomic.Uint64
|
||||
}
|
||||
|
||||
// NewVaultRenewalMetrics constructs a fresh VaultRenewalMetrics
|
||||
// with all counters at zero. Pass to IssuerRegistry.SetVaultRenewalMetrics
|
||||
// (and to MetricsHandler.SetVaultRenewals) to wire up the renewal
|
||||
// loop's metric path.
|
||||
func NewVaultRenewalMetrics() *VaultRenewalMetrics {
|
||||
return &VaultRenewalMetrics{}
|
||||
}
|
||||
|
||||
// RecordRenewal bumps the (result) counter. Implements
|
||||
// vault.RenewalRecorder. Off-enum result values silently no-op
|
||||
// (closed-enum discipline matches the IssuanceMetrics pattern;
|
||||
// we don't dynamically grow the cardinality on a typo).
|
||||
func (m *VaultRenewalMetrics) RecordRenewal(result string) {
|
||||
if m == nil {
|
||||
return
|
||||
}
|
||||
switch result {
|
||||
case "success":
|
||||
m.success.Add(1)
|
||||
case "failure":
|
||||
m.failure.Add(1)
|
||||
case "not_renewable":
|
||||
m.notRenewable.Add(1)
|
||||
}
|
||||
}
|
||||
|
||||
// VaultRenewalSnapshot is the per-result counter view returned by
|
||||
// Snapshot. Pinned in this package so the handler can consume it
|
||||
// via VaultRenewalSnapshotter without cross-importing connector
|
||||
// state. Field names are stable — operator dashboards alert on
|
||||
// the corresponding {result=...} label values.
|
||||
type VaultRenewalSnapshot struct {
|
||||
Success uint64
|
||||
Failure uint64
|
||||
NotRenewable uint64
|
||||
}
|
||||
|
||||
// Snapshot returns a point-in-time read of all three counters.
|
||||
// Used by tests that need to assert post-tick state. The
|
||||
// Prometheus exposer in internal/api/handler/metrics.go uses
|
||||
// SnapshotVaultRenewals (3-tuple form) instead, to avoid an
|
||||
// import cycle on a shared struct type.
|
||||
func (m *VaultRenewalMetrics) Snapshot() VaultRenewalSnapshot {
|
||||
if m == nil {
|
||||
return VaultRenewalSnapshot{}
|
||||
}
|
||||
return VaultRenewalSnapshot{
|
||||
Success: m.success.Load(),
|
||||
Failure: m.failure.Load(),
|
||||
NotRenewable: m.notRenewable.Load(),
|
||||
}
|
||||
}
|
||||
|
||||
// SnapshotVaultRenewals returns the three counter values directly
|
||||
// as a tuple. Implements handler.VaultRenewalSnapshotter; used by
|
||||
// the Prometheus exposer. Order is fixed: success, failure,
|
||||
// not_renewable.
|
||||
func (m *VaultRenewalMetrics) SnapshotVaultRenewals() (success, failure, notRenewable uint64) {
|
||||
if m == nil {
|
||||
return 0, 0, 0
|
||||
}
|
||||
return m.success.Load(), m.failure.Load(), m.notRenewable.Load()
|
||||
}
|
||||
Reference in New Issue
Block a user