vault: add automatic token renewal at TTL/2 + Prometheus metric

Closes Top-10 fix #5 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the
VaultPKI adapter authenticated with a static token and never called
renew-self. Long-lived deploys hit token expiry; the first
operator-visible signal was failed cert renewals on production
targets.

This commit:

  1. Connector.Start(ctx) spawns a goroutine that calls
     POST /v1/auth/token/renew-self at TTL/2 cadence (computed from a
     one-shot lookup-self at startup). Honours ctx.Done() for
     graceful shutdown via a per-loop done channel + Stop().
  2. On `renewable: false` response (initial lookup OR any subsequent
     renewal), the loop emits a WARN, increments the not_renewable
     counter, and exits. The operator must rotate the token before
     Vault's Max TTL elapses.
  3. New Prometheus counter certctl_vault_token_renewals_total with
     labels result={success,failure,not_renewable}. Registered
     alongside existing certctl_issuance_* counters in
     internal/api/handler/metrics.go.
  4. ERROR-level logging on renewal failure with operator-actionable
     substring ("vault token renewal failed; rotate the token before
     TTL expires") so journalctl + grep find it. Loop keeps ticking
     after a failure — transient blips don't kill it.

New optional issuer.Lifecycle interface:

  type Lifecycle interface {
      Start(ctx context.Context) error
      Stop()
  }

Connectors that hold no background goroutines (almost all of them)
do not implement this — IssuerRegistry.StartLifecycles /
StopLifecycles feature-detect via type assertion. New
lifecycle-bearing connectors plug in by implementing the interface;
no further registry plumbing required.

Wiring (cmd/server/main.go):

  - service.NewVaultRenewalMetrics() instance is shared between
    issuerRegistry.SetVaultRenewalMetrics (so Vault connectors built
    by Rebuild get a recorder) and metricsHandler.SetVaultRenewals
    (so the Prometheus exposer emits the new series).
  - issuerRegistry.StartLifecycles(ctx) is called after
    issuerService.BuildRegistry; defer issuerRegistry.StopLifecycles
    is paired so goroutines exit cleanly on signal.
  - IssuerConnectorAdapter.Underlying() exposes the wrapped
    issuer.Connector so registry-level machinery can reach the
    concrete connector behind the adapter without duplicating the
    wiring at every call site.

Tests (internal/connector/issuer/vault/vault_renew_test.go):

  - TestVault_RenewLoop_TickAtHalfTTL — three ticks → three
    renewals, all "success".
  - TestVault_RenewLoop_StopsOnNotRenewable — second renewal returns
    renewable=false, loop exits, third tick fires no HTTP call.
  - TestVault_RenewLoop_FailureSurfacesViaMetric — first renewal 403
    bumps "failure", second renewal succeeds → loop kept ticking.
  - TestVault_RenewLoop_CtxCancellation_StopsCleanly — Stop returns
    within 200ms after ctx cancel.
  - TestVault_RenewLoop_StartsNothingWhenNotRenewable — token
    already non-renewable at boot ⇒ no goroutine, "not_renewable"
    metric increments at startup so operators see it in Grafana.
  - TestVault_ComputeInterval — 4 cases pinning TTL/2 +
    minRenewInterval floor.
  - TestVault_RenewSelf_ParseFailure_NamesActionableInError —
    surfaced error contains "vault token renewal failed" + "rotate
    the token".

Cadence is dynamic — every successful renewal re-derives TTL/2
from the renewed lease's lease_duration, so a short bootstrap
token that gets renewed up to a longer Max TTL shifts to the
longer cadence automatically (defends against degenerate fast
ticking on a token whose Max TTL is far longer than its initial
TTL).

Documentation:
  - docs/connectors.md Vault PKI section gains "Token TTL +
    automatic renewal" subsection (operator-facing: cadence, metric,
    renewable=false rotation playbook).

Out of scope (intentional, flagged in the audit follow-up):
  - AppRole / Kubernetes / AWS IAM auth methods (different renewal
    semantics).
  - Hot-reload of rotated token from disk (operator restarts
    today; future: GUI/MCP issuer-update path triggers Rebuild
    which Stops the old connector and Starts the new one).
  - Auto-re-auth after token death (operator playbook owns it).

CHANGELOG.md is intentionally not hand-edited (per CHANGELOG.md
itself: "no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/service/... ./internal/api/handler/...
  ./internal/connector/issuer/vault/... ./cmd/server/...  clean.
- go test -short -count=1 ./internal/connector/issuer/vault/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestVault_RenewLoop|TestVault_ComputeInterval'
  ./internal/connector/issuer/vault/...  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md
Top-10 fix #5.
This commit is contained in:
shankar0123
2026-05-03 21:24:27 +00:00
parent 60dce0bf10
commit ceca3647eb
10 changed files with 1293 additions and 7 deletions
+56 -6
View File
@@ -29,6 +29,7 @@ import (
"log/slog"
"net/http"
"strings"
"sync"
"time"
"github.com/shankar0123/certctl/internal/connector/issuer"
@@ -72,6 +73,32 @@ type Connector struct {
config *Config
logger *slog.Logger
httpClient *http.Client
// Token-renewal loop fields. Top-10 fix #5 of the 2026-05-03
// issuer-coverage audit. Long-lived certctl-server deploys hit
// Vault token expiry; the loop calls /v1/auth/token/renew-self at
// TTL/2 cadence so the integration stays alive up to Vault's
// configured Max TTL. See vault_renew.go for Start / Stop /
// renewSelf / lookupSelf.
//
// renewMu guards startedOnce + cancel + done. The ticker runs in a
// goroutine that owns its own copy of these channels.
renewMu sync.Mutex
renewStarted bool // true after Start spawned the goroutine
renewCancel func() // cancels the goroutine's ctx
renewDone chan struct{} // closed when goroutine exits
renewRecorder RenewalRecorder // optional metric sink (defaults to no-op)
// renewTickerFactory lets tests substitute a deterministic ticker
// implementation for cadence assertions. Production callers leave
// this nil and the loop uses time.NewTicker.
renewTickerFactory func(d time.Duration) renewTicker
// renewClient is the HTTP client used for renew-self / lookup-self.
// Defaults to httpClient; a separate seam lets tests inject an
// httptest.Server-bound client without disturbing the issuance
// path's client.
renewClient *http.Client
}
// New creates a new Vault PKI connector with the given configuration and logger.
@@ -85,13 +112,36 @@ func New(config *Config, logger *slog.Logger) *Connector {
}
}
return &Connector{
config: config,
logger: logger,
httpClient: &http.Client{
Timeout: 30 * time.Second,
},
httpClient := &http.Client{
Timeout: 30 * time.Second,
}
return &Connector{
config: config,
logger: logger,
httpClient: httpClient,
renewClient: httpClient,
renewRecorder: noopRenewalRecorder{},
}
}
// SetRenewalRecorder wires a metric sink for the renew-self loop. The
// recorder's RecordRenewal(result string) is called with one of the
// enum values "success", "failure", or "not_renewable" on every tick.
// Pass nil to disable recording. Safe to call before Start; calling
// after Start has no effect on already-emitted increments.
//
// The interface lives in this package (not internal/service) to avoid
// an import cycle: vault is a connector package that the service-layer
// IssuerRegistry imports. The service-layer concrete type
// (*service.VaultRenewalMetrics) satisfies this interface and is wired
// in cmd/server/main.go.
func (c *Connector) SetRenewalRecorder(r RenewalRecorder) {
if r == nil {
r = noopRenewalRecorder{}
}
c.renewMu.Lock()
defer c.renewMu.Unlock()
c.renewRecorder = r
}
// vaultResponse is the standard Vault API response wrapper.