Files
certctl/internal/service/vault_renewal_metrics.go
T
shankar0123 0792271dc6 vault: add automatic token renewal at TTL/2 + Prometheus metric
Closes Top-10 fix #5 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the
VaultPKI adapter authenticated with a static token and never called
renew-self. Long-lived deploys hit token expiry; the first
operator-visible signal was failed cert renewals on production
targets.

This commit:

  1. Connector.Start(ctx) spawns a goroutine that calls
     POST /v1/auth/token/renew-self at TTL/2 cadence (computed from a
     one-shot lookup-self at startup). Honours ctx.Done() for
     graceful shutdown via a per-loop done channel + Stop().
  2. On `renewable: false` response (initial lookup OR any subsequent
     renewal), the loop emits a WARN, increments the not_renewable
     counter, and exits. The operator must rotate the token before
     Vault's Max TTL elapses.
  3. New Prometheus counter certctl_vault_token_renewals_total with
     labels result={success,failure,not_renewable}. Registered
     alongside existing certctl_issuance_* counters in
     internal/api/handler/metrics.go.
  4. ERROR-level logging on renewal failure with operator-actionable
     substring ("vault token renewal failed; rotate the token before
     TTL expires") so journalctl + grep find it. Loop keeps ticking
     after a failure — transient blips don't kill it.

New optional issuer.Lifecycle interface:

  type Lifecycle interface {
      Start(ctx context.Context) error
      Stop()
  }

Connectors that hold no background goroutines (almost all of them)
do not implement this — IssuerRegistry.StartLifecycles /
StopLifecycles feature-detect via type assertion. New
lifecycle-bearing connectors plug in by implementing the interface;
no further registry plumbing required.

Wiring (cmd/server/main.go):

  - service.NewVaultRenewalMetrics() instance is shared between
    issuerRegistry.SetVaultRenewalMetrics (so Vault connectors built
    by Rebuild get a recorder) and metricsHandler.SetVaultRenewals
    (so the Prometheus exposer emits the new series).
  - issuerRegistry.StartLifecycles(ctx) is called after
    issuerService.BuildRegistry; defer issuerRegistry.StopLifecycles
    is paired so goroutines exit cleanly on signal.
  - IssuerConnectorAdapter.Underlying() exposes the wrapped
    issuer.Connector so registry-level machinery can reach the
    concrete connector behind the adapter without duplicating the
    wiring at every call site.

Tests (internal/connector/issuer/vault/vault_renew_test.go):

  - TestVault_RenewLoop_TickAtHalfTTL — three ticks → three
    renewals, all "success".
  - TestVault_RenewLoop_StopsOnNotRenewable — second renewal returns
    renewable=false, loop exits, third tick fires no HTTP call.
  - TestVault_RenewLoop_FailureSurfacesViaMetric — first renewal 403
    bumps "failure", second renewal succeeds → loop kept ticking.
  - TestVault_RenewLoop_CtxCancellation_StopsCleanly — Stop returns
    within 200ms after ctx cancel.
  - TestVault_RenewLoop_StartsNothingWhenNotRenewable — token
    already non-renewable at boot ⇒ no goroutine, "not_renewable"
    metric increments at startup so operators see it in Grafana.
  - TestVault_ComputeInterval — 4 cases pinning TTL/2 +
    minRenewInterval floor.
  - TestVault_RenewSelf_ParseFailure_NamesActionableInError —
    surfaced error contains "vault token renewal failed" + "rotate
    the token".

Cadence is dynamic — every successful renewal re-derives TTL/2
from the renewed lease's lease_duration, so a short bootstrap
token that gets renewed up to a longer Max TTL shifts to the
longer cadence automatically (defends against degenerate fast
ticking on a token whose Max TTL is far longer than its initial
TTL).

Documentation:
  - docs/connectors.md Vault PKI section gains "Token TTL +
    automatic renewal" subsection (operator-facing: cadence, metric,
    renewable=false rotation playbook).

Out of scope (intentional, flagged in the audit follow-up):
  - AppRole / Kubernetes / AWS IAM auth methods (different renewal
    semantics).
  - Hot-reload of rotated token from disk (operator restarts
    today; future: GUI/MCP issuer-update path triggers Rebuild
    which Stops the old connector and Starts the new one).
  - Auto-re-auth after token death (operator playbook owns it).

CHANGELOG.md is intentionally not hand-edited (per CHANGELOG.md
itself: "no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/service/... ./internal/api/handler/...
  ./internal/connector/issuer/vault/... ./cmd/server/...  clean.
- go test -short -count=1 ./internal/connector/issuer/vault/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestVault_RenewLoop|TestVault_ComputeInterval'
  ./internal/connector/issuer/vault/...  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md
Top-10 fix #5.
2026-05-03 21:24:27 +00:00

97 lines
3.4 KiB
Go

package service
import "sync/atomic"
// VaultRenewalMetrics is a thread-safe counter table for the
// Vault PKI token-renewal loop. Top-10 fix #5 of the 2026-05-03
// issuer-coverage audit. Closes the operator-observability gap
// where long-lived deploys would silently lose Vault auth at TTL
// expiry.
//
// Cardinality is fixed at three series — result is a closed enum:
//
// {success} — the renew-self call succeeded.
// {failure} — the renew-self call returned a non-2xx,
// parse failure, or HTTP error. Loop keeps
// ticking; transient blips don't kill it.
// {not_renewable} — Vault returned renewable=false (or returned
// it at startup lookup-self). Loop has exited;
// operator must rotate the token before its
// current TTL expires.
//
// One instance is shared across every Vault PKI Connector built by
// IssuerRegistry.Rebuild — the recorder pointer is wired by
// IssuerRegistry.SetVaultRenewalMetrics + the post-factory wiring
// step inside Rebuild. The same instance is also wired into
// MetricsHandler.SetVaultRenewals so the Prometheus exposer emits
// certctl_vault_token_renewals_total{result=...}.
type VaultRenewalMetrics struct {
success atomic.Uint64
failure atomic.Uint64
notRenewable atomic.Uint64
}
// NewVaultRenewalMetrics constructs a fresh VaultRenewalMetrics
// with all counters at zero. Pass to IssuerRegistry.SetVaultRenewalMetrics
// (and to MetricsHandler.SetVaultRenewals) to wire up the renewal
// loop's metric path.
func NewVaultRenewalMetrics() *VaultRenewalMetrics {
return &VaultRenewalMetrics{}
}
// RecordRenewal bumps the (result) counter. Implements
// vault.RenewalRecorder. Off-enum result values silently no-op
// (closed-enum discipline matches the IssuanceMetrics pattern;
// we don't dynamically grow the cardinality on a typo).
func (m *VaultRenewalMetrics) RecordRenewal(result string) {
if m == nil {
return
}
switch result {
case "success":
m.success.Add(1)
case "failure":
m.failure.Add(1)
case "not_renewable":
m.notRenewable.Add(1)
}
}
// VaultRenewalSnapshot is the per-result counter view returned by
// Snapshot. Pinned in this package so the handler can consume it
// via VaultRenewalSnapshotter without cross-importing connector
// state. Field names are stable — operator dashboards alert on
// the corresponding {result=...} label values.
type VaultRenewalSnapshot struct {
Success uint64
Failure uint64
NotRenewable uint64
}
// Snapshot returns a point-in-time read of all three counters.
// Used by tests that need to assert post-tick state. The
// Prometheus exposer in internal/api/handler/metrics.go uses
// SnapshotVaultRenewals (3-tuple form) instead, to avoid an
// import cycle on a shared struct type.
func (m *VaultRenewalMetrics) Snapshot() VaultRenewalSnapshot {
if m == nil {
return VaultRenewalSnapshot{}
}
return VaultRenewalSnapshot{
Success: m.success.Load(),
Failure: m.failure.Load(),
NotRenewable: m.notRenewable.Load(),
}
}
// SnapshotVaultRenewals returns the three counter values directly
// as a tuple. Implements handler.VaultRenewalSnapshotter; used by
// the Prometheus exposer. Order is fixed: success, failure,
// not_renewable.
func (m *VaultRenewalMetrics) SnapshotVaultRenewals() (success, failure, notRenewable uint64) {
if m == nil {
return 0, 0, 0
}
return m.success.Load(), m.failure.Load(), m.notRenewable.Load()
}