vault: add automatic token renewal at TTL/2 + Prometheus metric

Closes Top-10 fix #5 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the
VaultPKI adapter authenticated with a static token and never called
renew-self. Long-lived deploys hit token expiry; the first
operator-visible signal was failed cert renewals on production
targets.

This commit:

  1. Connector.Start(ctx) spawns a goroutine that calls
     POST /v1/auth/token/renew-self at TTL/2 cadence (computed from a
     one-shot lookup-self at startup). Honours ctx.Done() for
     graceful shutdown via a per-loop done channel + Stop().
  2. On `renewable: false` response (initial lookup OR any subsequent
     renewal), the loop emits a WARN, increments the not_renewable
     counter, and exits. The operator must rotate the token before
     Vault's Max TTL elapses.
  3. New Prometheus counter certctl_vault_token_renewals_total with
     labels result={success,failure,not_renewable}. Registered
     alongside existing certctl_issuance_* counters in
     internal/api/handler/metrics.go.
  4. ERROR-level logging on renewal failure with operator-actionable
     substring ("vault token renewal failed; rotate the token before
     TTL expires") so journalctl + grep find it. Loop keeps ticking
     after a failure — transient blips don't kill it.

New optional issuer.Lifecycle interface:

  type Lifecycle interface {
      Start(ctx context.Context) error
      Stop()
  }

Connectors that hold no background goroutines (almost all of them)
do not implement this — IssuerRegistry.StartLifecycles /
StopLifecycles feature-detect via type assertion. New
lifecycle-bearing connectors plug in by implementing the interface;
no further registry plumbing required.

Wiring (cmd/server/main.go):

  - service.NewVaultRenewalMetrics() instance is shared between
    issuerRegistry.SetVaultRenewalMetrics (so Vault connectors built
    by Rebuild get a recorder) and metricsHandler.SetVaultRenewals
    (so the Prometheus exposer emits the new series).
  - issuerRegistry.StartLifecycles(ctx) is called after
    issuerService.BuildRegistry; defer issuerRegistry.StopLifecycles
    is paired so goroutines exit cleanly on signal.
  - IssuerConnectorAdapter.Underlying() exposes the wrapped
    issuer.Connector so registry-level machinery can reach the
    concrete connector behind the adapter without duplicating the
    wiring at every call site.

Tests (internal/connector/issuer/vault/vault_renew_test.go):

  - TestVault_RenewLoop_TickAtHalfTTL — three ticks → three
    renewals, all "success".
  - TestVault_RenewLoop_StopsOnNotRenewable — second renewal returns
    renewable=false, loop exits, third tick fires no HTTP call.
  - TestVault_RenewLoop_FailureSurfacesViaMetric — first renewal 403
    bumps "failure", second renewal succeeds → loop kept ticking.
  - TestVault_RenewLoop_CtxCancellation_StopsCleanly — Stop returns
    within 200ms after ctx cancel.
  - TestVault_RenewLoop_StartsNothingWhenNotRenewable — token
    already non-renewable at boot ⇒ no goroutine, "not_renewable"
    metric increments at startup so operators see it in Grafana.
  - TestVault_ComputeInterval — 4 cases pinning TTL/2 +
    minRenewInterval floor.
  - TestVault_RenewSelf_ParseFailure_NamesActionableInError —
    surfaced error contains "vault token renewal failed" + "rotate
    the token".

Cadence is dynamic — every successful renewal re-derives TTL/2
from the renewed lease's lease_duration, so a short bootstrap
token that gets renewed up to a longer Max TTL shifts to the
longer cadence automatically (defends against degenerate fast
ticking on a token whose Max TTL is far longer than its initial
TTL).

Documentation:
  - docs/connectors.md Vault PKI section gains "Token TTL +
    automatic renewal" subsection (operator-facing: cadence, metric,
    renewable=false rotation playbook).

Out of scope (intentional, flagged in the audit follow-up):
  - AppRole / Kubernetes / AWS IAM auth methods (different renewal
    semantics).
  - Hot-reload of rotated token from disk (operator restarts
    today; future: GUI/MCP issuer-update path triggers Rebuild
    which Stops the old connector and Starts the new one).
  - Auto-re-auth after token death (operator playbook owns it).

CHANGELOG.md is intentionally not hand-edited (per CHANGELOG.md
itself: "no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/service/... ./internal/api/handler/...
  ./internal/connector/issuer/vault/... ./cmd/server/...  clean.
- go test -short -count=1 ./internal/connector/issuer/vault/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestVault_RenewLoop|TestVault_ComputeInterval'
  ./internal/connector/issuer/vault/...  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md
Top-10 fix #5.
This commit is contained in:
shankar0123
2026-05-03 21:24:27 +00:00
parent 60dce0bf10
commit ceca3647eb
10 changed files with 1293 additions and 7 deletions
+130
View File
@@ -8,8 +8,10 @@ import (
"sync"
"time"
"github.com/shankar0123/certctl/internal/connector/issuer"
"github.com/shankar0123/certctl/internal/connector/issuer/acme"
"github.com/shankar0123/certctl/internal/connector/issuer/local"
"github.com/shankar0123/certctl/internal/connector/issuer/vault"
"github.com/shankar0123/certctl/internal/connector/issuerfactory"
"github.com/shankar0123/certctl/internal/crypto"
"github.com/shankar0123/certctl/internal/crypto/signer"
@@ -47,6 +49,14 @@ type IssuerRegistry struct {
// Nil leaves the legacy "ACME revocation by serial requires
// CertificateLookup wiring" error in place for old wiring paths.
acmeCertLookup acme.CertificateLookupRepo
// vaultRenewalMetrics — when set, every freshly-constructed
// *vault.Connector is wired with SetRenewalRecorder so the
// renew-self loop bumps the certctl_vault_token_renewals_total
// counter. Closes Top-10 fix #5 of the 2026-05-03 audit. Nil
// leaves the no-op recorder in place (no metric emission, but
// the loop still runs).
vaultRenewalMetrics *VaultRenewalMetrics
}
// LocalIssuerDeps groups the optional dependencies that the local
@@ -92,6 +102,23 @@ func (r *IssuerRegistry) SetIssuanceMetrics(m *IssuanceMetrics) {
r.metrics = m
}
// SetVaultRenewalMetrics wires the per-(result) counter table for
// the Vault PKI renew-self loop. Every *vault.Connector constructed
// by Rebuild after this call records its renewal results into the
// supplied metrics. Closes Top-10 fix #5 of the 2026-05-03
// issuer-coverage audit.
//
// The same instance must also be registered with the metrics
// handler via MetricsHandler.SetVaultRenewals so the Prometheus
// exposer emits certctl_vault_token_renewals_total{result=...}.
// cmd/server/main.go owns both wiring sides; tests usually skip
// the Prometheus side and just assert against the snapshot.
func (r *IssuerRegistry) SetVaultRenewalMetrics(m *VaultRenewalMetrics) {
r.mu.Lock()
defer r.mu.Unlock()
r.vaultRenewalMetrics = m
}
// SetACMECertLookup wires the cert-version lookup repo for every
// *acme.Connector constructed by Rebuild. The lookup is used by the
// serial-only revoke path (RevokeCertificate) to recover the leaf-
@@ -228,6 +255,19 @@ func (r *IssuerRegistry) Rebuild(ctx context.Context, configs []*domain.Issuer,
"id", cfg.ID)
}
// Top-10 fix #5 (2026-05-03 audit): wire the renew-self
// metric recorder into every freshly-constructed
// *vault.Connector so its background renewal loop bumps the
// certctl_vault_token_renewals_total counter. Lifecycle
// startup itself is gated by StartLifecycles below — Rebuild
// only does the metric wire here so the recorder is in place
// when StartLifecycles fires.
if vaultConn, ok := connector.(*vault.Connector); ok && r.vaultRenewalMetrics != nil {
vaultConn.SetRenewalRecorder(r.vaultRenewalMetrics)
r.logger.Info("Vault PKI issuer wired with renew-self metric recorder",
"id", cfg.ID)
}
adapter := NewIssuerConnectorAdapter(connector)
// Wire per-issuer-type metrics (audit fix #4) when SetIssuanceMetrics
// was called. The adapter is the IssuerConnector interface; type-
@@ -273,3 +313,93 @@ func (r *IssuerRegistry) Rebuild(ctx context.Context, configs []*domain.Issuer,
return nil
}
// StartLifecycles iterates the registry and calls Start(ctx) on every
// connector that implements the optional issuer.Lifecycle extension
// interface. Connectors without lifecycle work (almost all of them)
// are silently skipped.
//
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit. Today only
// VaultPKI implements Lifecycle (for its renew-self loop). New
// lifecycle-bearing connectors plug in by implementing the
// interface — this method picks them up automatically.
//
// Per-connector Start failures are LOGGED, not returned, so a single
// misconfigured Vault doesn't block server startup. Operators see
// the failure in the slog stream and via the
// certctl_vault_token_renewals_total{result="not_renewable"} or
// {result="failure"} counter.
//
// The IssuerConnectorAdapter wraps the raw connector; we type-assert
// against IssuerConnectorWithUnderlying to reach the underlying
// connector. If the adapter shape changes, this assertion silently
// no-ops and lifecycle wiring stops working — covered by
// TestRegistry_StartLifecycles_VaultStarted.
func (r *IssuerRegistry) StartLifecycles(ctx context.Context) {
r.mu.RLock()
conns := make(map[string]IssuerConnector, len(r.issuers))
for id, c := range r.issuers {
conns[id] = c
}
r.mu.RUnlock()
for id, c := range conns {
raw := unwrapAdapter(c)
if raw == nil {
continue
}
lc, ok := raw.(issuer.Lifecycle)
if !ok {
continue
}
if err := lc.Start(ctx); err != nil {
r.logger.Warn("issuer lifecycle Start failed",
"id", id,
"error", err,
)
continue
}
r.logger.Info("issuer lifecycle Start succeeded", "id", id)
}
}
// StopLifecycles iterates the registry and calls Stop() on every
// connector that implements the optional issuer.Lifecycle extension
// interface. Each Stop blocks until the connector's background work
// has fully exited; the loop is sequential rather than parallel so
// shutdown ordering is deterministic in operator logs.
//
// Idempotent. Safe to call after StartLifecycles failed or wasn't
// called.
func (r *IssuerRegistry) StopLifecycles() {
r.mu.RLock()
conns := make([]IssuerConnector, 0, len(r.issuers))
for _, c := range r.issuers {
conns = append(conns, c)
}
r.mu.RUnlock()
for _, c := range conns {
raw := unwrapAdapter(c)
if raw == nil {
continue
}
if lc, ok := raw.(issuer.Lifecycle); ok {
lc.Stop()
}
}
}
// unwrapAdapter returns the underlying issuer.Connector held by an
// IssuerConnectorAdapter. If the registry held a raw connector
// directly (test wiring), returns it as-is. Returns nil if neither
// case matches — defensive against future adapter-shape changes.
func unwrapAdapter(c IssuerConnector) interface{} {
if a, ok := c.(*IssuerConnectorAdapter); ok {
return a.Underlying()
}
if u, ok := c.(interface{ Underlying() interface{} }); ok {
return u.Underlying()
}
return c
}