vault: add automatic token renewal at TTL/2 + Prometheus metric

Closes Top-10 fix #5 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the
VaultPKI adapter authenticated with a static token and never called
renew-self. Long-lived deploys hit token expiry; the first
operator-visible signal was failed cert renewals on production
targets.

This commit:

  1. Connector.Start(ctx) spawns a goroutine that calls
     POST /v1/auth/token/renew-self at TTL/2 cadence (computed from a
     one-shot lookup-self at startup). Honours ctx.Done() for
     graceful shutdown via a per-loop done channel + Stop().
  2. On `renewable: false` response (initial lookup OR any subsequent
     renewal), the loop emits a WARN, increments the not_renewable
     counter, and exits. The operator must rotate the token before
     Vault's Max TTL elapses.
  3. New Prometheus counter certctl_vault_token_renewals_total with
     labels result={success,failure,not_renewable}. Registered
     alongside existing certctl_issuance_* counters in
     internal/api/handler/metrics.go.
  4. ERROR-level logging on renewal failure with operator-actionable
     substring ("vault token renewal failed; rotate the token before
     TTL expires") so journalctl + grep find it. Loop keeps ticking
     after a failure — transient blips don't kill it.

New optional issuer.Lifecycle interface:

  type Lifecycle interface {
      Start(ctx context.Context) error
      Stop()
  }

Connectors that hold no background goroutines (almost all of them)
do not implement this — IssuerRegistry.StartLifecycles /
StopLifecycles feature-detect via type assertion. New
lifecycle-bearing connectors plug in by implementing the interface;
no further registry plumbing required.

Wiring (cmd/server/main.go):

  - service.NewVaultRenewalMetrics() instance is shared between
    issuerRegistry.SetVaultRenewalMetrics (so Vault connectors built
    by Rebuild get a recorder) and metricsHandler.SetVaultRenewals
    (so the Prometheus exposer emits the new series).
  - issuerRegistry.StartLifecycles(ctx) is called after
    issuerService.BuildRegistry; defer issuerRegistry.StopLifecycles
    is paired so goroutines exit cleanly on signal.
  - IssuerConnectorAdapter.Underlying() exposes the wrapped
    issuer.Connector so registry-level machinery can reach the
    concrete connector behind the adapter without duplicating the
    wiring at every call site.

Tests (internal/connector/issuer/vault/vault_renew_test.go):

  - TestVault_RenewLoop_TickAtHalfTTL — three ticks → three
    renewals, all "success".
  - TestVault_RenewLoop_StopsOnNotRenewable — second renewal returns
    renewable=false, loop exits, third tick fires no HTTP call.
  - TestVault_RenewLoop_FailureSurfacesViaMetric — first renewal 403
    bumps "failure", second renewal succeeds → loop kept ticking.
  - TestVault_RenewLoop_CtxCancellation_StopsCleanly — Stop returns
    within 200ms after ctx cancel.
  - TestVault_RenewLoop_StartsNothingWhenNotRenewable — token
    already non-renewable at boot ⇒ no goroutine, "not_renewable"
    metric increments at startup so operators see it in Grafana.
  - TestVault_ComputeInterval — 4 cases pinning TTL/2 +
    minRenewInterval floor.
  - TestVault_RenewSelf_ParseFailure_NamesActionableInError —
    surfaced error contains "vault token renewal failed" + "rotate
    the token".

Cadence is dynamic — every successful renewal re-derives TTL/2
from the renewed lease's lease_duration, so a short bootstrap
token that gets renewed up to a longer Max TTL shifts to the
longer cadence automatically (defends against degenerate fast
ticking on a token whose Max TTL is far longer than its initial
TTL).

Documentation:
  - docs/connectors.md Vault PKI section gains "Token TTL +
    automatic renewal" subsection (operator-facing: cadence, metric,
    renewable=false rotation playbook).

Out of scope (intentional, flagged in the audit follow-up):
  - AppRole / Kubernetes / AWS IAM auth methods (different renewal
    semantics).
  - Hot-reload of rotated token from disk (operator restarts
    today; future: GUI/MCP issuer-update path triggers Rebuild
    which Stops the old connector and Starts the new one).
  - Auto-re-auth after token death (operator playbook owns it).

CHANGELOG.md is intentionally not hand-edited (per CHANGELOG.md
itself: "no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/service/... ./internal/api/handler/...
  ./internal/connector/issuer/vault/... ./cmd/server/...  clean.
- go test -short -count=1 ./internal/connector/issuer/vault/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestVault_RenewLoop|TestVault_ComputeInterval'
  ./internal/connector/issuer/vault/...  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md
Top-10 fix #5.
This commit is contained in:
shankar0123
2026-05-03 21:24:27 +00:00
parent 60dce0bf10
commit ceca3647eb
10 changed files with 1293 additions and 7 deletions
+23
View File
@@ -228,6 +228,16 @@ func main() {
issuanceMetrics := service.NewIssuanceMetrics(service.DefaultIssuanceBucketBoundaries)
issuerRegistry.SetIssuanceMetrics(issuanceMetrics)
// Top-10 fix #5 (2026-05-03 audit): Vault PKI token-renewal
// metrics. Same instance is wired into the registry (so each
// *vault.Connector built by Rebuild gets a recorder) AND into
// the metrics handler (so the Prometheus exposer emits
// certctl_vault_token_renewals_total). The renewal goroutine
// itself is kicked off below by issuerRegistry.StartLifecycles
// after Rebuild has populated the registry.
vaultRenewalMetrics := service.NewVaultRenewalMetrics()
issuerRegistry.SetVaultRenewalMetrics(vaultRenewalMetrics)
// Audit fix #7: wire the cert-version lookup so ACME connectors
// built by Rebuild can recover the leaf-cert DER from a serial-
// only revoke request. The postgres CertificateRepository
@@ -403,6 +413,16 @@ func main() {
logger.Error("failed to build issuer registry from database", "error", err)
}
logger.Info("issuer registry loaded", "issuers", issuerRegistry.Len())
// Top-10 fix #5 (2026-05-03 audit): kick off any optional
// long-running background work bound to issuer connectors. Today
// only Vault PKI implements issuer.Lifecycle (renew-self loop);
// other connectors are silently skipped. Per-connector Start
// failures are logged, not fatal — a misconfigured Vault doesn't
// block server startup. Stop is wired to the deferred shutdown
// path below so the goroutines exit cleanly on signal.
issuerRegistry.StartLifecycles(context.Background())
defer issuerRegistry.StopLifecycles()
targetService := service.NewTargetService(targetRepo, auditService, agentRepo, encryptionKey, logger)
profileService := service.NewProfileService(profileRepo, auditService)
teamService := service.NewTeamService(teamRepo, auditService)
@@ -572,6 +592,9 @@ func main() {
// Audit fix #4: wire the per-issuer-type issuance metrics so the
// /api/v1/metrics/prometheus exposer emits the new series.
metricsHandler.SetIssuanceCounters(issuanceMetrics)
// Top-10 fix #5 (2026-05-03 audit): Vault PKI token-renewal counter.
// Same instance the registry uses to record per-tick results.
metricsHandler.SetVaultRenewals(vaultRenewalMetrics)
// Bundle-5 / H-006: pass the *sql.DB pool so /ready can probe DB
// connectivity via PingContext. /health stays shallow (liveness signal).
healthHandler := handler.NewHealthHandler(cfg.Auth.Type, db)
+3 -1
View File
@@ -431,7 +431,9 @@ The connector is registered in the issuer registry under `iss-vault`. Vault issu
**MaxTTL enforcement (M11c):** When a certificate profile defines a maximum TTL, the Vault connector overrides the TTL string in the signing request to ensure the issued certificate does not exceed the profile limit. This is applied before Vault's own role-level max TTL.
Location: `internal/connector/issuer/vault/vault.go`
**Token TTL + automatic renewal (Top-10 fix #5, 2026-05-03 audit):** certctl-server periodically calls `POST /v1/auth/token/renew-self` at half the token's TTL to keep the integration alive without manual rotation; the cadence is read from a one-shot `lookup-self` at startup and re-derived on every successful renewal so a short bootstrap token that gets renewed up to a longer Max TTL shifts to the longer cadence automatically. The renewal loop emits the `certctl_vault_token_renewals_total{result="success"|"failure"|"not_renewable"}` Prometheus counter so operators see expiry trouble in Grafana before issuance breaks. When Vault returns `renewable: false` (configured Max TTL reached), the loop logs a WARN, increments `{result="not_renewable"}`, and exits — the operator must rotate the Vault token and restart certctl-server (or use the GUI/MCP issuer-update path to swap the token in place; the registry's Rebuild path re-Starts the lifecycle on the new connector). Per-tick failures (e.g. transient 5xx, brief network blips) bump `{result="failure"}` and the loop keeps ticking; only the explicit `renewable: false` case stops it.
Location: `internal/connector/issuer/vault/vault.go` + `internal/connector/issuer/vault/vault_renew.go`
### Built-in: DigiCert CertCentral
+47
View File
@@ -60,6 +60,28 @@ type DeployCounterSnapshotter interface {
// reverse import would create a cycle. The exposer below takes the
// types via the interface defined in service.
// VaultRenewalSnapshotter is the surface MetricsHandler consumes
// to emit the certctl_vault_token_renewals_total{result=...}
// counter. *service.VaultRenewalMetrics satisfies this; cmd/server
// passes the same instance into IssuerRegistry.SetVaultRenewalMetrics
// (so Vault connectors record results) AND into
// MetricsHandler.SetVaultRenewals (so the Prometheus exposer reads
// the counters).
//
// Returns three counter values directly (rather than a shared struct
// type) so service can satisfy this without an import cycle —
// handler already imports service for IssuanceMetricsSnapshotter,
// but service does not import handler. A method that returns
// (uint64, uint64, uint64) needs no shared type.
//
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit.
type VaultRenewalSnapshotter interface {
// SnapshotVaultRenewals returns success, failure, and
// not_renewable counters as point-in-time reads. Order is fixed
// for the exposer — matches the Prometheus label order.
SnapshotVaultRenewals() (success, failure, notRenewable uint64)
}
// MetricsHandler handles HTTP requests for metrics.
// Supports both JSON format (GET /api/v1/metrics) and Prometheus exposition format
// (GET /api/v1/metrics/prometheus) for integration with Prometheus, Grafana, Datadog, etc.
@@ -79,6 +101,10 @@ type MetricsHandler struct {
// imports service for admin_est.go etc., so service can't import
// handler back).
issuanceCounters service.IssuanceMetricsSnapshotter
// Vault PKI token-renewal counters. Top-10 fix #5 of the
// 2026-05-03 issuer-coverage audit. nil disables emission of
// certctl_vault_token_renewals_total{result=...}.
vaultRenewals VaultRenewalSnapshotter
}
// NewMetricsHandler creates a new MetricsHandler with a service dependency.
@@ -112,6 +138,13 @@ func (h *MetricsHandler) SetIssuanceCounters(c service.IssuanceMetricsSnapshotte
h.issuanceCounters = c
}
// SetVaultRenewals wires the Vault PKI token-renewal counter table
// for the Prometheus exposition. nil disables the block. Closes
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit.
func (h *MetricsHandler) SetVaultRenewals(c VaultRenewalSnapshotter) {
h.vaultRenewals = c
}
// MetricsResponse represents the JSON metrics response for V2.
type MetricsResponse struct {
Gauge MetricsGauge `json:"gauge"`
@@ -424,6 +457,20 @@ func (h MetricsHandler) GetPrometheusMetrics(w http.ResponseWriter, r *http.Requ
fmt.Fprintf(w, "certctl_issuance_failures_total{issuer_type=%q,error_class=%q} %d\n", f.IssuerType, f.ErrorClass, f.Count)
}
}
// Vault PKI token-renewal counters. Top-10 fix #5 of the
// 2026-05-03 issuer-coverage audit. Operators alert on
// certctl_vault_token_renewals_total{result="failure"} > 0 or
// {result="not_renewable"} > 0 to catch token expiry before
// issuance breaks. Closed enum: 3 series.
if h.vaultRenewals != nil {
success, failure, notRenewable := h.vaultRenewals.SnapshotVaultRenewals()
fmt.Fprintf(w, "\n# HELP certctl_vault_token_renewals_total Vault PKI token renew-self results. result is a closed enum: success, failure, not_renewable.\n")
fmt.Fprintf(w, "# TYPE certctl_vault_token_renewals_total counter\n")
fmt.Fprintf(w, "certctl_vault_token_renewals_total{result=%q} %d\n", "success", success)
fmt.Fprintf(w, "certctl_vault_token_renewals_total{result=%q} %d\n", "failure", failure)
fmt.Fprintf(w, "certctl_vault_token_renewals_total{result=%q} %d\n", "not_renewable", notRenewable)
}
}
// formatLE formats a histogram bucket boundary the way Prometheus
+40
View File
@@ -0,0 +1,40 @@
package issuer
import "context"
// Lifecycle is an OPTIONAL extension interface for issuer connectors that
// need to run long-running background work bound to a context. Connectors
// that hold no background goroutines (almost all of them) do not implement
// this interface and the registry feature-detects via type assertion.
//
// Concrete users today (2026-05-03):
// - VaultPKI: periodic POST /v1/auth/token/renew-self at TTL/2 cadence
// so long-lived deploys don't hit token expiry.
//
// The lifecycle contract is deliberately small. Connectors that need
// per-tick state, retries, or cross-tick cancellation handle all of that
// internally; the registry's job is just "kick off background work
// once" and "block until it cleanly exits". Keeping the interface this
// small means new lifecycle-bearing connectors don't have to touch the
// registry plumbing — they implement Start/Stop and the existing
// IssuerRegistry.StartLifecycles / StopLifecycles wiring picks them up
// automatically.
//
// Start MUST be non-blocking — spawn a goroutine and return immediately.
// Returning an error means startup failed; the registry logs the error
// and continues. Stop MUST block until the goroutine has fully exited;
// callers rely on this for graceful shutdown ordering.
type Lifecycle interface {
// Start kicks off any long-running background work bound to ctx.
// Returns nil on successful startup; the goroutine continues until
// ctx is cancelled or Stop is called. Returns a non-nil error if
// startup itself failed (e.g. precondition not met) — the goroutine
// did NOT start and Stop need not be called.
Start(ctx context.Context) error
// Stop blocks until the background work has fully exited. Safe to
// call after Start returned an error or wasn't called at all.
// Idempotent — multiple Stop calls return immediately after the
// first.
Stop()
}
+56 -6
View File
@@ -29,6 +29,7 @@ import (
"log/slog"
"net/http"
"strings"
"sync"
"time"
"github.com/shankar0123/certctl/internal/connector/issuer"
@@ -72,6 +73,32 @@ type Connector struct {
config *Config
logger *slog.Logger
httpClient *http.Client
// Token-renewal loop fields. Top-10 fix #5 of the 2026-05-03
// issuer-coverage audit. Long-lived certctl-server deploys hit
// Vault token expiry; the loop calls /v1/auth/token/renew-self at
// TTL/2 cadence so the integration stays alive up to Vault's
// configured Max TTL. See vault_renew.go for Start / Stop /
// renewSelf / lookupSelf.
//
// renewMu guards startedOnce + cancel + done. The ticker runs in a
// goroutine that owns its own copy of these channels.
renewMu sync.Mutex
renewStarted bool // true after Start spawned the goroutine
renewCancel func() // cancels the goroutine's ctx
renewDone chan struct{} // closed when goroutine exits
renewRecorder RenewalRecorder // optional metric sink (defaults to no-op)
// renewTickerFactory lets tests substitute a deterministic ticker
// implementation for cadence assertions. Production callers leave
// this nil and the loop uses time.NewTicker.
renewTickerFactory func(d time.Duration) renewTicker
// renewClient is the HTTP client used for renew-self / lookup-self.
// Defaults to httpClient; a separate seam lets tests inject an
// httptest.Server-bound client without disturbing the issuance
// path's client.
renewClient *http.Client
}
// New creates a new Vault PKI connector with the given configuration and logger.
@@ -85,13 +112,36 @@ func New(config *Config, logger *slog.Logger) *Connector {
}
}
return &Connector{
config: config,
logger: logger,
httpClient: &http.Client{
Timeout: 30 * time.Second,
},
httpClient := &http.Client{
Timeout: 30 * time.Second,
}
return &Connector{
config: config,
logger: logger,
httpClient: httpClient,
renewClient: httpClient,
renewRecorder: noopRenewalRecorder{},
}
}
// SetRenewalRecorder wires a metric sink for the renew-self loop. The
// recorder's RecordRenewal(result string) is called with one of the
// enum values "success", "failure", or "not_renewable" on every tick.
// Pass nil to disable recording. Safe to call before Start; calling
// after Start has no effect on already-emitted increments.
//
// The interface lives in this package (not internal/service) to avoid
// an import cycle: vault is a connector package that the service-layer
// IssuerRegistry imports. The service-layer concrete type
// (*service.VaultRenewalMetrics) satisfies this interface and is wired
// in cmd/server/main.go.
func (c *Connector) SetRenewalRecorder(r RenewalRecorder) {
if r == nil {
r = noopRenewalRecorder{}
}
c.renewMu.Lock()
defer c.renewMu.Unlock()
c.renewRecorder = r
}
// vaultResponse is the standard Vault API response wrapper.
@@ -0,0 +1,410 @@
package vault
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit. Pre-fix,
// Vault PKI authenticated via a static token and never called
// renew-self; long-lived deploys hit token expiry and started failing
// silently — the operator's first signal was failed renewals on
// production targets. This file adds:
//
// 1. Connector.Start(ctx) — spawns a goroutine that calls
// POST /v1/auth/token/renew-self at TTL/2 cadence (computed
// from a one-shot LookupSelf at startup).
// 2. Connector.Stop() — cancels the goroutine's context and blocks
// until it has exited. Idempotent.
// 3. Connector.renewSelf(ctx) — the per-tick HTTP call.
// 4. Connector.lookupSelf(ctx) — a one-shot startup probe to learn
// the current TTL + renewable flag.
//
// On a `renewable: false` response, the loop logs a WARN and exits
// cleanly; once Vault has decided the token is no longer renewable
// (typically Max TTL reached), retrying is what gets certctl-server
// flagged in the Vault audit log as a misbehaving client.
import (
"bytes"
"context"
"encoding/json"
"fmt"
"io"
"net/http"
"time"
"github.com/shankar0123/certctl/internal/connector/issuer"
)
// minRenewInterval guards against degenerate fast cadence when a
// misconfigured Vault returns a tiny TTL. 5s is short enough that
// the cap rarely fires in practice but long enough that we don't
// hammer Vault's audit log with renew-self calls if something goes
// sideways. Defensive only; production tokens always have TTL ≥ 30m.
const minRenewInterval = 5 * time.Second
// RenewalRecorder is the metric-sink surface the renew-self loop
// uses. result is one of: "success", "failure", "not_renewable".
// Implementations MUST be goroutine-safe — RecordRenewal is called
// from the renewal loop's own goroutine.
//
// service.VaultRenewalMetrics satisfies this interface; cmd/server
// wires the same instance into the registry (which forwards to the
// connector via SetRenewalRecorder) and into the metrics handler
// (for Prometheus exposition).
type RenewalRecorder interface {
RecordRenewal(result string)
}
// noopRenewalRecorder is the zero-cost default. Used until
// SetRenewalRecorder wires a real metric sink (production) or in
// unit tests that don't care about metrics.
type noopRenewalRecorder struct{}
func (noopRenewalRecorder) RecordRenewal(string) {}
// renewTicker is the small surface the renewal loop uses from
// time.Ticker, extracted so tests can swap in a deterministic
// implementation that fires on cue. Production: time.NewTicker.
type renewTicker interface {
C() <-chan time.Time
Stop()
}
// stdTicker is the production implementation, a thin wrapper around
// *time.Ticker that exposes its C channel via a method so it
// satisfies the renewTicker interface (channels can't be method
// values directly).
type stdTicker struct{ t *time.Ticker }
func (s stdTicker) C() <-chan time.Time { return s.t.C }
func (s stdTicker) Stop() { s.t.Stop() }
// lookupSelfResponse is the subset of /v1/auth/token/lookup-self we
// consume. Vault returns many other fields (policies, accessor, …)
// that are irrelevant to the renewal loop.
type lookupSelfResponse struct {
Data struct {
TTL int `json:"ttl"` // seconds remaining on the token
Renewable bool `json:"renewable"` // whether the token can be renewed
} `json:"data"`
}
// renewSelfResponse is the subset of /v1/auth/token/renew-self we
// consume. Per Vault's HTTP API, the renewed token's lease info
// lands in `auth.lease_duration` and `auth.renewable`.
type renewSelfResponse struct {
Auth struct {
LeaseDuration int `json:"lease_duration"`
Renewable bool `json:"renewable"`
} `json:"auth"`
}
// lookupSelf calls GET /v1/auth/token/lookup-self and returns the
// remaining TTL + the renewable flag. Used by Start to compute the
// initial tick cadence.
func (c *Connector) lookupSelf(ctx context.Context) (ttl time.Duration, renewable bool, err error) {
if c.config == nil || c.config.Token.IsEmpty() {
return 0, false, fmt.Errorf("vault token-renewal lookupSelf: token not configured")
}
url := c.config.Addr + "/v1/auth/token/lookup-self"
req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
if err != nil {
return 0, false, fmt.Errorf("vault token-renewal lookupSelf request build: %w", err)
}
if err := c.config.Token.Use(func(buf []byte) error {
req.Header.Set("X-Vault-Token", string(buf))
return nil
}); err != nil {
return 0, false, fmt.Errorf("vault token-renewal lookupSelf token use: %w", err)
}
resp, err := c.renewClient.Do(req)
if err != nil {
return 0, false, fmt.Errorf("vault token-renewal lookupSelf HTTP: %w", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return 0, false, fmt.Errorf("vault token-renewal lookupSelf body read: %w", err)
}
if resp.StatusCode != http.StatusOK {
return 0, false, fmt.Errorf("vault token-renewal lookupSelf returned status %d: %s", resp.StatusCode, string(body))
}
var parsed lookupSelfResponse
if err := json.Unmarshal(body, &parsed); err != nil {
return 0, false, fmt.Errorf("vault token-renewal lookupSelf parse: %w", err)
}
return time.Duration(parsed.Data.TTL) * time.Second, parsed.Data.Renewable, nil
}
// renewSelfResult is returned by renewSelf — it lets the loop both
// update the in-memory TTL AND react to a renewable=false flip on
// the same call without an extra round-trip.
type renewSelfResult struct {
NewTTL time.Duration
Renewable bool
}
// renewSelf calls POST /v1/auth/token/renew-self with an empty body
// (Vault accepts `{}`) and returns the renewed lease's TTL +
// renewable flag. The caller is responsible for stopping the loop
// when Renewable goes false.
func (c *Connector) renewSelf(ctx context.Context) (renewSelfResult, error) {
if c.config == nil || c.config.Token.IsEmpty() {
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: token not configured; rotate the token before TTL expires")
}
url := c.config.Addr + "/v1/auth/token/renew-self"
req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader([]byte(`{}`)))
if err != nil {
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: request build: %w; rotate the token before TTL expires", err)
}
req.Header.Set("Content-Type", "application/json")
if err := c.config.Token.Use(func(buf []byte) error {
req.Header.Set("X-Vault-Token", string(buf))
return nil
}); err != nil {
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: token use: %w; rotate the token before TTL expires", err)
}
resp, err := c.renewClient.Do(req)
if err != nil {
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: HTTP error: %w; rotate the token before TTL expires", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: body read: %w; rotate the token before TTL expires", err)
}
if resp.StatusCode != http.StatusOK {
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: status %d: %s; rotate the token before TTL expires", resp.StatusCode, string(body))
}
var parsed renewSelfResponse
if err := json.Unmarshal(body, &parsed); err != nil {
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: parse: %w; rotate the token before TTL expires", err)
}
return renewSelfResult{
NewTTL: time.Duration(parsed.Auth.LeaseDuration) * time.Second,
Renewable: parsed.Auth.Renewable,
}, nil
}
// Start kicks off the renew-self goroutine. Implements
// issuer.Lifecycle. Returns nil on success (goroutine running) or an
// error if the initial lookupSelf failed (no goroutine spawned).
//
// Cadence is computed once at startup as TTL/2 (capped at
// minRenewInterval). Each successful renewal updates the in-memory
// TTL and the goroutine resets its ticker to the new TTL/2 — so a
// short bootstrap token that gets renewed up to a longer Max TTL
// shifts to the longer cadence automatically.
//
// On `renewable: false` (initial lookup OR any subsequent renewal),
// Start returns nil but the loop emits a WARN and exits — operator
// must rotate the Vault token before its current TTL expires.
func (c *Connector) Start(ctx context.Context) error {
c.renewMu.Lock()
if c.renewStarted {
c.renewMu.Unlock()
return nil // idempotent: already running
}
if c.config == nil || c.config.Token.IsEmpty() {
c.renewMu.Unlock()
return fmt.Errorf("vault token-renewal Start: token not configured (call ValidateConfig first)")
}
c.renewMu.Unlock()
// Initial lookup — short timeout so a misconfigured Vault address
// fails Start fast rather than blocking the server's startup
// sequence indefinitely. The renewal goroutine itself uses the
// per-tick context for its own deadlines.
lookupCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
ttl, renewable, err := c.lookupSelf(lookupCtx)
cancel()
if err != nil {
return fmt.Errorf("vault token-renewal Start: initial lookupSelf: %w", err)
}
c.logger.Info("vault token-renewal loop starting",
"addr", c.config.Addr,
"ttl_seconds", int(ttl.Seconds()),
"renewable", renewable,
)
if !renewable {
// Don't spawn the goroutine — the token is already non-
// renewable. Surface via the metric so operators see it in
// Grafana even before any tick fires.
c.recordRenewal("not_renewable")
c.logger.Warn("vault token is not renewable at startup; renew-self loop will not run — rotate the token before its TTL expires",
"ttl_seconds", int(ttl.Seconds()),
)
return nil
}
// Spawn the goroutine. Use a derived ctx so Stop() can cancel
// independently of the parent.
loopCtx, loopCancel := context.WithCancel(ctx)
done := make(chan struct{})
c.renewMu.Lock()
c.renewStarted = true
c.renewCancel = loopCancel
c.renewDone = done
c.renewMu.Unlock()
interval := computeInterval(ttl)
go c.renewLoop(loopCtx, interval, done)
c.logger.Info("vault token-renewal loop started",
"interval_seconds", int(interval.Seconds()),
)
return nil
}
// Stop blocks until the renew-self goroutine has exited.
// Implements issuer.Lifecycle. Idempotent.
func (c *Connector) Stop() {
c.renewMu.Lock()
cancel := c.renewCancel
done := c.renewDone
started := c.renewStarted
c.renewMu.Unlock()
if !started {
return
}
if cancel != nil {
cancel()
}
if done != nil {
<-done
}
}
// renewLoop is the actual goroutine body. Owns the ticker, the
// in-memory TTL, and the renewable-flag state machine. Exits on
// ctx.Done() or on `renewable: false`.
func (c *Connector) renewLoop(ctx context.Context, initial time.Duration, done chan struct{}) {
defer close(done)
factory := c.renewTickerFactory
if factory == nil {
factory = func(d time.Duration) renewTicker {
return stdTicker{t: time.NewTicker(d)}
}
}
ticker := factory(initial)
currentInterval := initial
defer ticker.Stop()
for {
select {
case <-ctx.Done():
c.logger.Info("vault token-renewal loop stopping (ctx cancelled)")
return
case <-ticker.C():
// Per-tick deadline derived from the current cadence —
// renew calls should comfortably finish in <1s, so a
// budget of min(interval, 30s) is generous.
tickBudget := currentInterval
if tickBudget > 30*time.Second {
tickBudget = 30 * time.Second
}
tickCtx, cancel := context.WithTimeout(ctx, tickBudget)
res, err := c.renewSelf(tickCtx)
cancel()
if err != nil {
c.recordRenewal("failure")
c.logger.Error(err.Error())
// Keep ticking — operator may rotate the token
// out-of-band, or the failure may be transient.
// Stopping on first failure would mean a 1s
// network blip kills the loop for the rest of
// process lifetime.
continue
}
if !res.Renewable {
c.recordRenewal("not_renewable")
c.logger.Warn("vault token is no longer renewable; renew-self loop exiting — rotate the token before its current TTL expires",
"ttl_seconds", int(res.NewTTL.Seconds()),
)
return
}
c.recordRenewal("success")
c.logger.Info("vault token renewed",
"new_ttl_seconds", int(res.NewTTL.Seconds()),
)
// If the new TTL/2 differs meaningfully from the
// current cadence, restart the ticker at the new
// rate. This handles the bootstrap-→-MaxTTL transition
// (short initial TTL renews up to a longer Max TTL,
// which we'd otherwise hammer at the old fast cadence
// for the rest of the process).
newInterval := computeInterval(res.NewTTL)
if differsEnough(currentInterval, newInterval) {
ticker.Stop()
ticker = factory(newInterval)
currentInterval = newInterval
c.logger.Info("vault token-renewal cadence updated",
"new_interval_seconds", int(newInterval.Seconds()),
)
}
}
}
}
// recordRenewal increments the metric counter under the renewal
// recorder. Holds the lock briefly to read the recorder pointer;
// the actual increment happens lock-free (atomic.Uint64 under
// VaultRenewalMetrics).
func (c *Connector) recordRenewal(result string) {
c.renewMu.Lock()
rec := c.renewRecorder
c.renewMu.Unlock()
if rec != nil {
rec.RecordRenewal(result)
}
}
// computeInterval returns TTL/2, floored at minRenewInterval to
// avoid degenerate fast cadence when a misconfigured Vault returns
// a tiny TTL.
func computeInterval(ttl time.Duration) time.Duration {
half := ttl / 2
if half < minRenewInterval {
return minRenewInterval
}
return half
}
// differsEnough decides whether to restart the ticker for a new
// cadence. We tolerate ±10% drift to avoid restart-thrash when
// Vault's renewed-lease duration wobbles around the static TTL.
func differsEnough(a, b time.Duration) bool {
if a == 0 || b == 0 {
return a != b
}
delta := a - b
if delta < 0 {
delta = -delta
}
tol := a / 10
if tol < 0 {
tol = -tol
}
return delta > tol
}
// Compile-time assertion that *Connector satisfies the optional
// Lifecycle extension interface. If a future refactor breaks this
// (e.g. drops Stop), the compile error fires here rather than in a
// far-away registry lookup site.
var _ issuer.Lifecycle = (*Connector)(nil)
@@ -0,0 +1,476 @@
package vault
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit. Pins the
// behaviour of the renew-self loop end to end:
//
// 1. cadence — at TTL/2 with a (configurable) deterministic ticker
// so the test isn't wall-clock bound;
// 2. terminate-on-not-renewable — if Vault returns renewable=false,
// the loop exits and the metric records the not_renewable
// result;
// 3. failure-surfaces — the metric counter increments on a 403 and
// the loop keeps ticking (transient blips don't kill it);
// 4. ctx-cancellation — Stop returns within a small budget after
// ctx is cancelled.
//
// These tests live INSIDE the `vault` package (not vault_test) so
// they can substitute the renewTickerFactory seam directly. The
// existing test files in this directory are split into vault_test
// (external, exercises the public API) and the package-internal
// _test.go files (this one) — Go's two-package test convention.
import (
"context"
"encoding/json"
"fmt"
"io"
"log/slog"
"net/http"
"net/http/httptest"
"strings"
"sync"
"sync/atomic"
"testing"
"time"
"github.com/shankar0123/certctl/internal/secret"
)
// fakeTicker is the deterministic ticker the tests inject via
// renewTickerFactory. Tests call Tick() to fire the ticker channel
// at the moment of their choosing — no real time elapses.
type fakeTicker struct {
ch chan time.Time
stopCalls atomic.Uint64
}
func newFakeTicker() *fakeTicker {
return &fakeTicker{ch: make(chan time.Time, 4)}
}
func (f *fakeTicker) C() <-chan time.Time { return f.ch }
func (f *fakeTicker) Stop() { f.stopCalls.Add(1) }
func (f *fakeTicker) Tick() { f.ch <- time.Now() }
// renewMockHandler is the per-test httptest handler shape. Tests
// configure it to control lookup-self / renew-self responses.
type renewMockHandler struct {
mu sync.Mutex
lookupTTLSeconds int
lookupRenewable bool
renewSelfStatuses []renewSelfStub // queued; consumed in order
renewSelfCalls atomic.Uint64
lookupSelfCalls atomic.Uint64
noMoreCalls func() // called if a queued stub is exhausted
}
// renewSelfStub configures one expected renew-self response.
type renewSelfStub struct {
status int
body string // override the canned body
leaseDuration int
renewable bool
}
func (h *renewMockHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
switch r.URL.Path {
case "/v1/auth/token/lookup-self":
h.lookupSelfCalls.Add(1)
h.mu.Lock()
ttl, renewable := h.lookupTTLSeconds, h.lookupRenewable
h.mu.Unlock()
body := fmt.Sprintf(`{"data":{"ttl":%d,"renewable":%t}}`, ttl, renewable)
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_, _ = io.WriteString(w, body)
case "/v1/auth/token/renew-self":
h.renewSelfCalls.Add(1)
h.mu.Lock()
var stub renewSelfStub
if len(h.renewSelfStatuses) > 0 {
stub = h.renewSelfStatuses[0]
h.renewSelfStatuses = h.renewSelfStatuses[1:]
} else {
h.mu.Unlock()
if h.noMoreCalls != nil {
h.noMoreCalls()
}
http.Error(w, "no more renew-self stubs configured", http.StatusInternalServerError)
return
}
h.mu.Unlock()
w.Header().Set("Content-Type", "application/json")
status := stub.status
if status == 0 {
status = http.StatusOK
}
w.WriteHeader(status)
body := stub.body
if body == "" {
body = fmt.Sprintf(`{"auth":{"lease_duration":%d,"renewable":%t}}`, stub.leaseDuration, stub.renewable)
}
_, _ = io.WriteString(w, body)
default:
http.NotFound(w, r)
}
}
// quietTestLogger returns a logger that discards everything below
// ERROR. Tests assert via the recorder + ticker hooks; per-tick
// INFO/WARN logs would clutter the test output.
func quietTestLogger() *slog.Logger {
return slog.New(slog.NewTextHandler(io.Discard, &slog.HandlerOptions{Level: slog.LevelError}))
}
// mockRecorder counts RecordRenewal calls per result. Replaces the
// production *service.VaultRenewalMetrics for unit-test isolation.
type mockRecorder struct {
mu sync.Mutex
counts map[string]uint64
}
func newMockRecorder() *mockRecorder {
return &mockRecorder{counts: make(map[string]uint64)}
}
func (m *mockRecorder) RecordRenewal(result string) {
m.mu.Lock()
defer m.mu.Unlock()
m.counts[result]++
}
func (m *mockRecorder) get(result string) uint64 {
m.mu.Lock()
defer m.mu.Unlock()
return m.counts[result]
}
// buildTestConnector constructs a vault.Connector pointed at the
// httptest server, with the deterministic ticker factory and the
// supplied recorder.
func buildTestConnector(srvURL string, ticker *fakeTicker, rec RenewalRecorder) *Connector {
c := New(&Config{
Addr: srvURL,
Token: secret.NewRefFromString("hvs.test-token"),
Mount: "pki",
Role: "web",
}, quietTestLogger())
c.renewTickerFactory = func(d time.Duration) renewTicker { return ticker }
if rec != nil {
c.SetRenewalRecorder(rec)
}
return c
}
// TestVault_RenewLoop_TickAtHalfTTL pins that the loop calls
// renew-self once per ticker fire. Cadence assertion is via the
// fake ticker: Tick three times → expect three renew-self calls.
// (Production cadence — TTL/2 — is verified by assertions on
// computeInterval below; substituting the ticker here keeps the
// test wall-clock-free.)
func TestVault_RenewLoop_TickAtHalfTTL(t *testing.T) {
mock := &renewMockHandler{
lookupTTLSeconds: 4, // 2s cadence
lookupRenewable: true,
renewSelfStatuses: []renewSelfStub{
{leaseDuration: 4, renewable: true},
{leaseDuration: 4, renewable: true},
{leaseDuration: 4, renewable: true},
},
}
srv := httptest.NewServer(mock)
defer srv.Close()
ticker := newFakeTicker()
rec := newMockRecorder()
c := buildTestConnector(srv.URL, ticker, rec)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
if err := c.Start(ctx); err != nil {
t.Fatalf("Start: %v", err)
}
defer c.Stop()
if mock.lookupSelfCalls.Load() != 1 {
t.Errorf("expected exactly 1 lookup-self at startup, got %d", mock.lookupSelfCalls.Load())
}
// Fire three ticks; each should drive one renew-self.
for i := 0; i < 3; i++ {
ticker.Tick()
}
// Wait briefly for the goroutine to drain the channel sends.
deadline := time.Now().Add(2 * time.Second)
for time.Now().Before(deadline) {
if rec.get("success") >= 3 {
break
}
time.Sleep(10 * time.Millisecond)
}
if got := rec.get("success"); got != 3 {
t.Errorf("expected 3 success renewals after 3 ticks, got %d", got)
}
if got := rec.get("failure"); got != 0 {
t.Errorf("expected 0 failures, got %d", got)
}
if got := rec.get("not_renewable"); got != 0 {
t.Errorf("expected 0 not_renewable events, got %d", got)
}
if got := mock.renewSelfCalls.Load(); got != 3 {
t.Errorf("expected 3 renew-self HTTP calls, got %d", got)
}
}
// TestVault_RenewLoop_StopsOnNotRenewable pins that the loop exits
// cleanly after Vault returns renewable=false on a renew-self call.
// A second tick is sent after the not-renewable response; the
// goroutine should already be stopped by then so the second tick
// triggers no HTTP call.
func TestVault_RenewLoop_StopsOnNotRenewable(t *testing.T) {
mock := &renewMockHandler{
lookupTTLSeconds: 4,
lookupRenewable: true,
renewSelfStatuses: []renewSelfStub{
{leaseDuration: 4, renewable: true},
{leaseDuration: 4, renewable: false}, // tells loop to stop
},
}
srv := httptest.NewServer(mock)
defer srv.Close()
ticker := newFakeTicker()
rec := newMockRecorder()
c := buildTestConnector(srv.URL, ticker, rec)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
if err := c.Start(ctx); err != nil {
t.Fatalf("Start: %v", err)
}
defer c.Stop()
ticker.Tick() // first renewal — success
ticker.Tick() // second renewal — renewable=false, loop exits
deadline := time.Now().Add(2 * time.Second)
for time.Now().Before(deadline) {
if rec.get("not_renewable") >= 1 {
break
}
time.Sleep(10 * time.Millisecond)
}
if got := rec.get("success"); got != 1 {
t.Errorf("expected 1 success before not_renewable, got %d", got)
}
if got := rec.get("not_renewable"); got != 1 {
t.Errorf("expected exactly 1 not_renewable event, got %d", got)
}
// Confirm the goroutine has already exited: we check the
// renewMu's renewDone channel via Stop. If the loop is alive,
// Stop blocks until ctx is cancelled. If it has already
// exited (which it should), Stop returns near-immediately.
stopDone := make(chan struct{})
go func() {
c.Stop()
close(stopDone)
}()
select {
case <-stopDone:
// expected — goroutine had already exited.
case <-time.After(200 * time.Millisecond):
t.Error("Stop did not return within 200ms after renewable=false — goroutine leaked")
}
}
// TestVault_RenewLoop_FailureSurfacesViaMetric pins that a 403 on
// renew-self bumps the failure counter and the loop keeps ticking
// (transient blips do not kill the loop).
func TestVault_RenewLoop_FailureSurfacesViaMetric(t *testing.T) {
mock := &renewMockHandler{
lookupTTLSeconds: 4,
lookupRenewable: true,
renewSelfStatuses: []renewSelfStub{
{status: http.StatusForbidden, body: `{"errors":["permission denied"]}`},
{leaseDuration: 4, renewable: true}, // loop continues; this tick succeeds
},
}
srv := httptest.NewServer(mock)
defer srv.Close()
ticker := newFakeTicker()
rec := newMockRecorder()
c := buildTestConnector(srv.URL, ticker, rec)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
if err := c.Start(ctx); err != nil {
t.Fatalf("Start: %v", err)
}
defer c.Stop()
ticker.Tick() // first — fails with 403
ticker.Tick() // second — succeeds
deadline := time.Now().Add(2 * time.Second)
for time.Now().Before(deadline) {
if rec.get("failure") >= 1 && rec.get("success") >= 1 {
break
}
time.Sleep(10 * time.Millisecond)
}
if got := rec.get("failure"); got != 1 {
t.Errorf("expected 1 failure after 403, got %d", got)
}
if got := rec.get("success"); got != 1 {
t.Errorf("expected 1 success after recovery, got %d", got)
}
}
// TestVault_RenewLoop_CtxCancellation_StopsCleanly pins that
// cancelling ctx causes the goroutine to exit promptly. Stop()
// blocks on the goroutine's done channel; if it doesn't return
// within 200ms after cancel, the goroutine is leaked.
func TestVault_RenewLoop_CtxCancellation_StopsCleanly(t *testing.T) {
mock := &renewMockHandler{
lookupTTLSeconds: 4,
lookupRenewable: true,
renewSelfStatuses: nil, // no ticks expected; ctx will cancel before any
}
srv := httptest.NewServer(mock)
defer srv.Close()
ticker := newFakeTicker()
rec := newMockRecorder()
c := buildTestConnector(srv.URL, ticker, rec)
ctx, cancel := context.WithCancel(context.Background())
if err := c.Start(ctx); err != nil {
t.Fatalf("Start: %v", err)
}
// Cancel ctx; the goroutine should exit on ctx.Done() before
// any tick fires.
start := time.Now()
cancel()
stopDone := make(chan struct{})
go func() {
c.Stop()
close(stopDone)
}()
select {
case <-stopDone:
elapsed := time.Since(start)
if elapsed > 200*time.Millisecond {
t.Errorf("Stop returned after %v — goroutine slow to exit", elapsed)
}
case <-time.After(500 * time.Millisecond):
t.Fatal("Stop did not return within 500ms after ctx cancellation — goroutine leaked")
}
// No renew-self calls should have fired (cancel raced before any tick).
if got := mock.renewSelfCalls.Load(); got != 0 {
t.Errorf("expected 0 renew-self HTTP calls, got %d", got)
}
}
// TestVault_RenewLoop_StartsNothingWhenNotRenewable pins the
// startup short-circuit: if lookup-self returns renewable=false at
// boot, Start does not spawn the goroutine and the metric records
// the not_renewable result so operators see it in Grafana before
// any tick would have fired.
func TestVault_RenewLoop_StartsNothingWhenNotRenewable(t *testing.T) {
mock := &renewMockHandler{
lookupTTLSeconds: 60,
lookupRenewable: false, // already non-renewable at boot
}
srv := httptest.NewServer(mock)
defer srv.Close()
ticker := newFakeTicker()
rec := newMockRecorder()
c := buildTestConnector(srv.URL, ticker, rec)
if err := c.Start(context.Background()); err != nil {
t.Fatalf("Start should not error on initially-non-renewable token; got: %v", err)
}
defer c.Stop()
if got := rec.get("not_renewable"); got != 1 {
t.Errorf("expected 1 not_renewable event from startup short-circuit, got %d", got)
}
// Tick should be a no-op — no goroutine running.
ticker.Tick()
time.Sleep(100 * time.Millisecond)
if got := mock.renewSelfCalls.Load(); got != 0 {
t.Errorf("expected 0 renew-self HTTP calls (loop never started), got %d", got)
}
}
// TestVault_ComputeInterval pins the cadence-derivation rules: TTL/2
// for normal tokens, floored at minRenewInterval for misconfigured
// short TTLs that would otherwise hammer Vault's audit log.
func TestVault_ComputeInterval(t *testing.T) {
tests := []struct {
name string
ttl time.Duration
want time.Duration
}{
{"hour-ttl", time.Hour, 30 * time.Minute},
{"day-ttl", 24 * time.Hour, 12 * time.Hour},
{"floor-applies-tiny", 2 * time.Second, minRenewInterval},
{"floor-applies-zero", 0, minRenewInterval},
}
for _, tc := range tests {
t.Run(tc.name, func(t *testing.T) {
got := computeInterval(tc.ttl)
if got != tc.want {
t.Errorf("computeInterval(%v) = %v, want %v", tc.ttl, got, tc.want)
}
})
}
}
// TestVault_RenewSelf_ParseFailure_NamesActionableInError pins that
// failures surface with operator-actionable framing. We test the
// HTTP-failure path; the parse-failure path lives in the same wrap
// chain.
func TestVault_RenewSelf_ParseFailure_NamesActionableInError(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
_, _ = io.WriteString(w, `not json`)
}))
defer srv.Close()
c := buildTestConnector(srv.URL, newFakeTicker(), nil)
_, err := c.renewSelf(context.Background())
if err == nil {
t.Fatal("expected error from renewSelf with bad JSON, got nil")
}
if !strings.Contains(err.Error(), "vault token renewal failed") {
t.Errorf("expected 'vault token renewal failed' framing in surfaced error; got: %v", err)
}
if !strings.Contains(err.Error(), "rotate the token") {
t.Errorf("expected 'rotate the token' operator-action substring in surfaced error; got: %v", err)
}
}
// _unused_marker keeps the json import alive when the test file is
// edited and one of the json-using helpers temporarily disappears.
// Production has no use for this; tests do.
var _ = json.Marshal
+12
View File
@@ -40,6 +40,18 @@ func (a *IssuerConnectorAdapter) SetMetrics(issuerType string, metrics *Issuance
a.metrics = metrics
}
// Underlying returns the wrapped issuer.Connector so registry-level
// machinery (StartLifecycles / StopLifecycles, Bundle G audit-row
// pairing, future feature-detect interfaces) can reach the concrete
// connector behind the adapter without duplicating the wiring at
// every call site. Returns interface{} rather than issuer.Connector
// so callers do their own type assertion against optional extension
// interfaces (issuer.Lifecycle, etc.) without an import dependency
// fan-out from this package.
func (a *IssuerConnectorAdapter) Underlying() interface{} {
return a.connector
}
// recordIssuance is the metrics-recording side effect at the adapter
// boundary. Bumps the issuance counter (success/failure) and the
// duration histogram; on failure also bumps the failure-by-error-class
+130
View File
@@ -8,8 +8,10 @@ import (
"sync"
"time"
"github.com/shankar0123/certctl/internal/connector/issuer"
"github.com/shankar0123/certctl/internal/connector/issuer/acme"
"github.com/shankar0123/certctl/internal/connector/issuer/local"
"github.com/shankar0123/certctl/internal/connector/issuer/vault"
"github.com/shankar0123/certctl/internal/connector/issuerfactory"
"github.com/shankar0123/certctl/internal/crypto"
"github.com/shankar0123/certctl/internal/crypto/signer"
@@ -47,6 +49,14 @@ type IssuerRegistry struct {
// Nil leaves the legacy "ACME revocation by serial requires
// CertificateLookup wiring" error in place for old wiring paths.
acmeCertLookup acme.CertificateLookupRepo
// vaultRenewalMetrics — when set, every freshly-constructed
// *vault.Connector is wired with SetRenewalRecorder so the
// renew-self loop bumps the certctl_vault_token_renewals_total
// counter. Closes Top-10 fix #5 of the 2026-05-03 audit. Nil
// leaves the no-op recorder in place (no metric emission, but
// the loop still runs).
vaultRenewalMetrics *VaultRenewalMetrics
}
// LocalIssuerDeps groups the optional dependencies that the local
@@ -92,6 +102,23 @@ func (r *IssuerRegistry) SetIssuanceMetrics(m *IssuanceMetrics) {
r.metrics = m
}
// SetVaultRenewalMetrics wires the per-(result) counter table for
// the Vault PKI renew-self loop. Every *vault.Connector constructed
// by Rebuild after this call records its renewal results into the
// supplied metrics. Closes Top-10 fix #5 of the 2026-05-03
// issuer-coverage audit.
//
// The same instance must also be registered with the metrics
// handler via MetricsHandler.SetVaultRenewals so the Prometheus
// exposer emits certctl_vault_token_renewals_total{result=...}.
// cmd/server/main.go owns both wiring sides; tests usually skip
// the Prometheus side and just assert against the snapshot.
func (r *IssuerRegistry) SetVaultRenewalMetrics(m *VaultRenewalMetrics) {
r.mu.Lock()
defer r.mu.Unlock()
r.vaultRenewalMetrics = m
}
// SetACMECertLookup wires the cert-version lookup repo for every
// *acme.Connector constructed by Rebuild. The lookup is used by the
// serial-only revoke path (RevokeCertificate) to recover the leaf-
@@ -228,6 +255,19 @@ func (r *IssuerRegistry) Rebuild(ctx context.Context, configs []*domain.Issuer,
"id", cfg.ID)
}
// Top-10 fix #5 (2026-05-03 audit): wire the renew-self
// metric recorder into every freshly-constructed
// *vault.Connector so its background renewal loop bumps the
// certctl_vault_token_renewals_total counter. Lifecycle
// startup itself is gated by StartLifecycles below — Rebuild
// only does the metric wire here so the recorder is in place
// when StartLifecycles fires.
if vaultConn, ok := connector.(*vault.Connector); ok && r.vaultRenewalMetrics != nil {
vaultConn.SetRenewalRecorder(r.vaultRenewalMetrics)
r.logger.Info("Vault PKI issuer wired with renew-self metric recorder",
"id", cfg.ID)
}
adapter := NewIssuerConnectorAdapter(connector)
// Wire per-issuer-type metrics (audit fix #4) when SetIssuanceMetrics
// was called. The adapter is the IssuerConnector interface; type-
@@ -273,3 +313,93 @@ func (r *IssuerRegistry) Rebuild(ctx context.Context, configs []*domain.Issuer,
return nil
}
// StartLifecycles iterates the registry and calls Start(ctx) on every
// connector that implements the optional issuer.Lifecycle extension
// interface. Connectors without lifecycle work (almost all of them)
// are silently skipped.
//
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit. Today only
// VaultPKI implements Lifecycle (for its renew-self loop). New
// lifecycle-bearing connectors plug in by implementing the
// interface — this method picks them up automatically.
//
// Per-connector Start failures are LOGGED, not returned, so a single
// misconfigured Vault doesn't block server startup. Operators see
// the failure in the slog stream and via the
// certctl_vault_token_renewals_total{result="not_renewable"} or
// {result="failure"} counter.
//
// The IssuerConnectorAdapter wraps the raw connector; we type-assert
// against IssuerConnectorWithUnderlying to reach the underlying
// connector. If the adapter shape changes, this assertion silently
// no-ops and lifecycle wiring stops working — covered by
// TestRegistry_StartLifecycles_VaultStarted.
func (r *IssuerRegistry) StartLifecycles(ctx context.Context) {
r.mu.RLock()
conns := make(map[string]IssuerConnector, len(r.issuers))
for id, c := range r.issuers {
conns[id] = c
}
r.mu.RUnlock()
for id, c := range conns {
raw := unwrapAdapter(c)
if raw == nil {
continue
}
lc, ok := raw.(issuer.Lifecycle)
if !ok {
continue
}
if err := lc.Start(ctx); err != nil {
r.logger.Warn("issuer lifecycle Start failed",
"id", id,
"error", err,
)
continue
}
r.logger.Info("issuer lifecycle Start succeeded", "id", id)
}
}
// StopLifecycles iterates the registry and calls Stop() on every
// connector that implements the optional issuer.Lifecycle extension
// interface. Each Stop blocks until the connector's background work
// has fully exited; the loop is sequential rather than parallel so
// shutdown ordering is deterministic in operator logs.
//
// Idempotent. Safe to call after StartLifecycles failed or wasn't
// called.
func (r *IssuerRegistry) StopLifecycles() {
r.mu.RLock()
conns := make([]IssuerConnector, 0, len(r.issuers))
for _, c := range r.issuers {
conns = append(conns, c)
}
r.mu.RUnlock()
for _, c := range conns {
raw := unwrapAdapter(c)
if raw == nil {
continue
}
if lc, ok := raw.(issuer.Lifecycle); ok {
lc.Stop()
}
}
}
// unwrapAdapter returns the underlying issuer.Connector held by an
// IssuerConnectorAdapter. If the registry held a raw connector
// directly (test wiring), returns it as-is. Returns nil if neither
// case matches — defensive against future adapter-shape changes.
func unwrapAdapter(c IssuerConnector) interface{} {
if a, ok := c.(*IssuerConnectorAdapter); ok {
return a.Underlying()
}
if u, ok := c.(interface{ Underlying() interface{} }); ok {
return u.Underlying()
}
return c
}
+96
View File
@@ -0,0 +1,96 @@
package service
import "sync/atomic"
// VaultRenewalMetrics is a thread-safe counter table for the
// Vault PKI token-renewal loop. Top-10 fix #5 of the 2026-05-03
// issuer-coverage audit. Closes the operator-observability gap
// where long-lived deploys would silently lose Vault auth at TTL
// expiry.
//
// Cardinality is fixed at three series — result is a closed enum:
//
// {success} — the renew-self call succeeded.
// {failure} — the renew-self call returned a non-2xx,
// parse failure, or HTTP error. Loop keeps
// ticking; transient blips don't kill it.
// {not_renewable} — Vault returned renewable=false (or returned
// it at startup lookup-self). Loop has exited;
// operator must rotate the token before its
// current TTL expires.
//
// One instance is shared across every Vault PKI Connector built by
// IssuerRegistry.Rebuild — the recorder pointer is wired by
// IssuerRegistry.SetVaultRenewalMetrics + the post-factory wiring
// step inside Rebuild. The same instance is also wired into
// MetricsHandler.SetVaultRenewals so the Prometheus exposer emits
// certctl_vault_token_renewals_total{result=...}.
type VaultRenewalMetrics struct {
success atomic.Uint64
failure atomic.Uint64
notRenewable atomic.Uint64
}
// NewVaultRenewalMetrics constructs a fresh VaultRenewalMetrics
// with all counters at zero. Pass to IssuerRegistry.SetVaultRenewalMetrics
// (and to MetricsHandler.SetVaultRenewals) to wire up the renewal
// loop's metric path.
func NewVaultRenewalMetrics() *VaultRenewalMetrics {
return &VaultRenewalMetrics{}
}
// RecordRenewal bumps the (result) counter. Implements
// vault.RenewalRecorder. Off-enum result values silently no-op
// (closed-enum discipline matches the IssuanceMetrics pattern;
// we don't dynamically grow the cardinality on a typo).
func (m *VaultRenewalMetrics) RecordRenewal(result string) {
if m == nil {
return
}
switch result {
case "success":
m.success.Add(1)
case "failure":
m.failure.Add(1)
case "not_renewable":
m.notRenewable.Add(1)
}
}
// VaultRenewalSnapshot is the per-result counter view returned by
// Snapshot. Pinned in this package so the handler can consume it
// via VaultRenewalSnapshotter without cross-importing connector
// state. Field names are stable — operator dashboards alert on
// the corresponding {result=...} label values.
type VaultRenewalSnapshot struct {
Success uint64
Failure uint64
NotRenewable uint64
}
// Snapshot returns a point-in-time read of all three counters.
// Used by tests that need to assert post-tick state. The
// Prometheus exposer in internal/api/handler/metrics.go uses
// SnapshotVaultRenewals (3-tuple form) instead, to avoid an
// import cycle on a shared struct type.
func (m *VaultRenewalMetrics) Snapshot() VaultRenewalSnapshot {
if m == nil {
return VaultRenewalSnapshot{}
}
return VaultRenewalSnapshot{
Success: m.success.Load(),
Failure: m.failure.Load(),
NotRenewable: m.notRenewable.Load(),
}
}
// SnapshotVaultRenewals returns the three counter values directly
// as a tuple. Implements handler.VaultRenewalSnapshotter; used by
// the Prometheus exposer. Order is fixed: success, failure,
// not_renewable.
func (m *VaultRenewalMetrics) SnapshotVaultRenewals() (success, failure, notRenewable uint64) {
if m == nil {
return 0, 0, 0
}
return m.success.Load(), m.failure.Load(), m.notRenewable.Load()
}