mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 22:11:38 +00:00
vault: add automatic token renewal at TTL/2 + Prometheus metric
Closes Top-10 fix #5 of the 2026-05-03 issuer-coverage audit (see cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the VaultPKI adapter authenticated with a static token and never called renew-self. Long-lived deploys hit token expiry; the first operator-visible signal was failed cert renewals on production targets. This commit: 1. Connector.Start(ctx) spawns a goroutine that calls POST /v1/auth/token/renew-self at TTL/2 cadence (computed from a one-shot lookup-self at startup). Honours ctx.Done() for graceful shutdown via a per-loop done channel + Stop(). 2. On `renewable: false` response (initial lookup OR any subsequent renewal), the loop emits a WARN, increments the not_renewable counter, and exits. The operator must rotate the token before Vault's Max TTL elapses. 3. New Prometheus counter certctl_vault_token_renewals_total with labels result={success,failure,not_renewable}. Registered alongside existing certctl_issuance_* counters in internal/api/handler/metrics.go. 4. ERROR-level logging on renewal failure with operator-actionable substring ("vault token renewal failed; rotate the token before TTL expires") so journalctl + grep find it. Loop keeps ticking after a failure — transient blips don't kill it. New optional issuer.Lifecycle interface: type Lifecycle interface { Start(ctx context.Context) error Stop() } Connectors that hold no background goroutines (almost all of them) do not implement this — IssuerRegistry.StartLifecycles / StopLifecycles feature-detect via type assertion. New lifecycle-bearing connectors plug in by implementing the interface; no further registry plumbing required. Wiring (cmd/server/main.go): - service.NewVaultRenewalMetrics() instance is shared between issuerRegistry.SetVaultRenewalMetrics (so Vault connectors built by Rebuild get a recorder) and metricsHandler.SetVaultRenewals (so the Prometheus exposer emits the new series). - issuerRegistry.StartLifecycles(ctx) is called after issuerService.BuildRegistry; defer issuerRegistry.StopLifecycles is paired so goroutines exit cleanly on signal. - IssuerConnectorAdapter.Underlying() exposes the wrapped issuer.Connector so registry-level machinery can reach the concrete connector behind the adapter without duplicating the wiring at every call site. Tests (internal/connector/issuer/vault/vault_renew_test.go): - TestVault_RenewLoop_TickAtHalfTTL — three ticks → three renewals, all "success". - TestVault_RenewLoop_StopsOnNotRenewable — second renewal returns renewable=false, loop exits, third tick fires no HTTP call. - TestVault_RenewLoop_FailureSurfacesViaMetric — first renewal 403 bumps "failure", second renewal succeeds → loop kept ticking. - TestVault_RenewLoop_CtxCancellation_StopsCleanly — Stop returns within 200ms after ctx cancel. - TestVault_RenewLoop_StartsNothingWhenNotRenewable — token already non-renewable at boot ⇒ no goroutine, "not_renewable" metric increments at startup so operators see it in Grafana. - TestVault_ComputeInterval — 4 cases pinning TTL/2 + minRenewInterval floor. - TestVault_RenewSelf_ParseFailure_NamesActionableInError — surfaced error contains "vault token renewal failed" + "rotate the token". Cadence is dynamic — every successful renewal re-derives TTL/2 from the renewed lease's lease_duration, so a short bootstrap token that gets renewed up to a longer Max TTL shifts to the longer cadence automatically (defends against degenerate fast ticking on a token whose Max TTL is far longer than its initial TTL). Documentation: - docs/connectors.md Vault PKI section gains "Token TTL + automatic renewal" subsection (operator-facing: cadence, metric, renewable=false rotation playbook). Out of scope (intentional, flagged in the audit follow-up): - AppRole / Kubernetes / AWS IAM auth methods (different renewal semantics). - Hot-reload of rotated token from disk (operator restarts today; future: GUI/MCP issuer-update path triggers Rebuild which Stops the old connector and Starts the new one). - Auto-re-auth after token death (operator playbook owns it). CHANGELOG.md is intentionally not hand-edited (per CHANGELOG.md itself: "no longer maintains a hand-edited per-version changelog; per-release notes are auto-generated from commit messages between consecutive tags"). Verified locally: - gofmt clean. - go vet ./internal/service/... ./internal/api/handler/... ./internal/connector/issuer/vault/... ./cmd/server/... clean. - go test -short -count=1 ./internal/connector/issuer/vault/... ./internal/service/... ./internal/api/handler/... green. - go test -race -count=10 -run 'TestVault_RenewLoop|TestVault_ComputeInterval' ./internal/connector/issuer/vault/... green. Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md Top-10 fix #5.
This commit is contained in:
@@ -60,6 +60,28 @@ type DeployCounterSnapshotter interface {
|
||||
// reverse import would create a cycle. The exposer below takes the
|
||||
// types via the interface defined in service.
|
||||
|
||||
// VaultRenewalSnapshotter is the surface MetricsHandler consumes
|
||||
// to emit the certctl_vault_token_renewals_total{result=...}
|
||||
// counter. *service.VaultRenewalMetrics satisfies this; cmd/server
|
||||
// passes the same instance into IssuerRegistry.SetVaultRenewalMetrics
|
||||
// (so Vault connectors record results) AND into
|
||||
// MetricsHandler.SetVaultRenewals (so the Prometheus exposer reads
|
||||
// the counters).
|
||||
//
|
||||
// Returns three counter values directly (rather than a shared struct
|
||||
// type) so service can satisfy this without an import cycle —
|
||||
// handler already imports service for IssuanceMetricsSnapshotter,
|
||||
// but service does not import handler. A method that returns
|
||||
// (uint64, uint64, uint64) needs no shared type.
|
||||
//
|
||||
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit.
|
||||
type VaultRenewalSnapshotter interface {
|
||||
// SnapshotVaultRenewals returns success, failure, and
|
||||
// not_renewable counters as point-in-time reads. Order is fixed
|
||||
// for the exposer — matches the Prometheus label order.
|
||||
SnapshotVaultRenewals() (success, failure, notRenewable uint64)
|
||||
}
|
||||
|
||||
// MetricsHandler handles HTTP requests for metrics.
|
||||
// Supports both JSON format (GET /api/v1/metrics) and Prometheus exposition format
|
||||
// (GET /api/v1/metrics/prometheus) for integration with Prometheus, Grafana, Datadog, etc.
|
||||
@@ -79,6 +101,10 @@ type MetricsHandler struct {
|
||||
// imports service for admin_est.go etc., so service can't import
|
||||
// handler back).
|
||||
issuanceCounters service.IssuanceMetricsSnapshotter
|
||||
// Vault PKI token-renewal counters. Top-10 fix #5 of the
|
||||
// 2026-05-03 issuer-coverage audit. nil disables emission of
|
||||
// certctl_vault_token_renewals_total{result=...}.
|
||||
vaultRenewals VaultRenewalSnapshotter
|
||||
}
|
||||
|
||||
// NewMetricsHandler creates a new MetricsHandler with a service dependency.
|
||||
@@ -112,6 +138,13 @@ func (h *MetricsHandler) SetIssuanceCounters(c service.IssuanceMetricsSnapshotte
|
||||
h.issuanceCounters = c
|
||||
}
|
||||
|
||||
// SetVaultRenewals wires the Vault PKI token-renewal counter table
|
||||
// for the Prometheus exposition. nil disables the block. Closes
|
||||
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit.
|
||||
func (h *MetricsHandler) SetVaultRenewals(c VaultRenewalSnapshotter) {
|
||||
h.vaultRenewals = c
|
||||
}
|
||||
|
||||
// MetricsResponse represents the JSON metrics response for V2.
|
||||
type MetricsResponse struct {
|
||||
Gauge MetricsGauge `json:"gauge"`
|
||||
@@ -424,6 +457,20 @@ func (h MetricsHandler) GetPrometheusMetrics(w http.ResponseWriter, r *http.Requ
|
||||
fmt.Fprintf(w, "certctl_issuance_failures_total{issuer_type=%q,error_class=%q} %d\n", f.IssuerType, f.ErrorClass, f.Count)
|
||||
}
|
||||
}
|
||||
|
||||
// Vault PKI token-renewal counters. Top-10 fix #5 of the
|
||||
// 2026-05-03 issuer-coverage audit. Operators alert on
|
||||
// certctl_vault_token_renewals_total{result="failure"} > 0 or
|
||||
// {result="not_renewable"} > 0 to catch token expiry before
|
||||
// issuance breaks. Closed enum: 3 series.
|
||||
if h.vaultRenewals != nil {
|
||||
success, failure, notRenewable := h.vaultRenewals.SnapshotVaultRenewals()
|
||||
fmt.Fprintf(w, "\n# HELP certctl_vault_token_renewals_total Vault PKI token renew-self results. result is a closed enum: success, failure, not_renewable.\n")
|
||||
fmt.Fprintf(w, "# TYPE certctl_vault_token_renewals_total counter\n")
|
||||
fmt.Fprintf(w, "certctl_vault_token_renewals_total{result=%q} %d\n", "success", success)
|
||||
fmt.Fprintf(w, "certctl_vault_token_renewals_total{result=%q} %d\n", "failure", failure)
|
||||
fmt.Fprintf(w, "certctl_vault_token_renewals_total{result=%q} %d\n", "not_renewable", notRenewable)
|
||||
}
|
||||
}
|
||||
|
||||
// formatLE formats a histogram bucket boundary the way Prometheus
|
||||
|
||||
@@ -0,0 +1,40 @@
|
||||
package issuer
|
||||
|
||||
import "context"
|
||||
|
||||
// Lifecycle is an OPTIONAL extension interface for issuer connectors that
|
||||
// need to run long-running background work bound to a context. Connectors
|
||||
// that hold no background goroutines (almost all of them) do not implement
|
||||
// this interface and the registry feature-detects via type assertion.
|
||||
//
|
||||
// Concrete users today (2026-05-03):
|
||||
// - VaultPKI: periodic POST /v1/auth/token/renew-self at TTL/2 cadence
|
||||
// so long-lived deploys don't hit token expiry.
|
||||
//
|
||||
// The lifecycle contract is deliberately small. Connectors that need
|
||||
// per-tick state, retries, or cross-tick cancellation handle all of that
|
||||
// internally; the registry's job is just "kick off background work
|
||||
// once" and "block until it cleanly exits". Keeping the interface this
|
||||
// small means new lifecycle-bearing connectors don't have to touch the
|
||||
// registry plumbing — they implement Start/Stop and the existing
|
||||
// IssuerRegistry.StartLifecycles / StopLifecycles wiring picks them up
|
||||
// automatically.
|
||||
//
|
||||
// Start MUST be non-blocking — spawn a goroutine and return immediately.
|
||||
// Returning an error means startup failed; the registry logs the error
|
||||
// and continues. Stop MUST block until the goroutine has fully exited;
|
||||
// callers rely on this for graceful shutdown ordering.
|
||||
type Lifecycle interface {
|
||||
// Start kicks off any long-running background work bound to ctx.
|
||||
// Returns nil on successful startup; the goroutine continues until
|
||||
// ctx is cancelled or Stop is called. Returns a non-nil error if
|
||||
// startup itself failed (e.g. precondition not met) — the goroutine
|
||||
// did NOT start and Stop need not be called.
|
||||
Start(ctx context.Context) error
|
||||
|
||||
// Stop blocks until the background work has fully exited. Safe to
|
||||
// call after Start returned an error or wasn't called at all.
|
||||
// Idempotent — multiple Stop calls return immediately after the
|
||||
// first.
|
||||
Stop()
|
||||
}
|
||||
@@ -29,6 +29,7 @@ import (
|
||||
"log/slog"
|
||||
"net/http"
|
||||
"strings"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/shankar0123/certctl/internal/connector/issuer"
|
||||
@@ -72,6 +73,32 @@ type Connector struct {
|
||||
config *Config
|
||||
logger *slog.Logger
|
||||
httpClient *http.Client
|
||||
|
||||
// Token-renewal loop fields. Top-10 fix #5 of the 2026-05-03
|
||||
// issuer-coverage audit. Long-lived certctl-server deploys hit
|
||||
// Vault token expiry; the loop calls /v1/auth/token/renew-self at
|
||||
// TTL/2 cadence so the integration stays alive up to Vault's
|
||||
// configured Max TTL. See vault_renew.go for Start / Stop /
|
||||
// renewSelf / lookupSelf.
|
||||
//
|
||||
// renewMu guards startedOnce + cancel + done. The ticker runs in a
|
||||
// goroutine that owns its own copy of these channels.
|
||||
renewMu sync.Mutex
|
||||
renewStarted bool // true after Start spawned the goroutine
|
||||
renewCancel func() // cancels the goroutine's ctx
|
||||
renewDone chan struct{} // closed when goroutine exits
|
||||
renewRecorder RenewalRecorder // optional metric sink (defaults to no-op)
|
||||
|
||||
// renewTickerFactory lets tests substitute a deterministic ticker
|
||||
// implementation for cadence assertions. Production callers leave
|
||||
// this nil and the loop uses time.NewTicker.
|
||||
renewTickerFactory func(d time.Duration) renewTicker
|
||||
|
||||
// renewClient is the HTTP client used for renew-self / lookup-self.
|
||||
// Defaults to httpClient; a separate seam lets tests inject an
|
||||
// httptest.Server-bound client without disturbing the issuance
|
||||
// path's client.
|
||||
renewClient *http.Client
|
||||
}
|
||||
|
||||
// New creates a new Vault PKI connector with the given configuration and logger.
|
||||
@@ -85,13 +112,36 @@ func New(config *Config, logger *slog.Logger) *Connector {
|
||||
}
|
||||
}
|
||||
|
||||
return &Connector{
|
||||
config: config,
|
||||
logger: logger,
|
||||
httpClient: &http.Client{
|
||||
Timeout: 30 * time.Second,
|
||||
},
|
||||
httpClient := &http.Client{
|
||||
Timeout: 30 * time.Second,
|
||||
}
|
||||
return &Connector{
|
||||
config: config,
|
||||
logger: logger,
|
||||
httpClient: httpClient,
|
||||
renewClient: httpClient,
|
||||
renewRecorder: noopRenewalRecorder{},
|
||||
}
|
||||
}
|
||||
|
||||
// SetRenewalRecorder wires a metric sink for the renew-self loop. The
|
||||
// recorder's RecordRenewal(result string) is called with one of the
|
||||
// enum values "success", "failure", or "not_renewable" on every tick.
|
||||
// Pass nil to disable recording. Safe to call before Start; calling
|
||||
// after Start has no effect on already-emitted increments.
|
||||
//
|
||||
// The interface lives in this package (not internal/service) to avoid
|
||||
// an import cycle: vault is a connector package that the service-layer
|
||||
// IssuerRegistry imports. The service-layer concrete type
|
||||
// (*service.VaultRenewalMetrics) satisfies this interface and is wired
|
||||
// in cmd/server/main.go.
|
||||
func (c *Connector) SetRenewalRecorder(r RenewalRecorder) {
|
||||
if r == nil {
|
||||
r = noopRenewalRecorder{}
|
||||
}
|
||||
c.renewMu.Lock()
|
||||
defer c.renewMu.Unlock()
|
||||
c.renewRecorder = r
|
||||
}
|
||||
|
||||
// vaultResponse is the standard Vault API response wrapper.
|
||||
|
||||
@@ -0,0 +1,410 @@
|
||||
package vault
|
||||
|
||||
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit. Pre-fix,
|
||||
// Vault PKI authenticated via a static token and never called
|
||||
// renew-self; long-lived deploys hit token expiry and started failing
|
||||
// silently — the operator's first signal was failed renewals on
|
||||
// production targets. This file adds:
|
||||
//
|
||||
// 1. Connector.Start(ctx) — spawns a goroutine that calls
|
||||
// POST /v1/auth/token/renew-self at TTL/2 cadence (computed
|
||||
// from a one-shot LookupSelf at startup).
|
||||
// 2. Connector.Stop() — cancels the goroutine's context and blocks
|
||||
// until it has exited. Idempotent.
|
||||
// 3. Connector.renewSelf(ctx) — the per-tick HTTP call.
|
||||
// 4. Connector.lookupSelf(ctx) — a one-shot startup probe to learn
|
||||
// the current TTL + renewable flag.
|
||||
//
|
||||
// On a `renewable: false` response, the loop logs a WARN and exits
|
||||
// cleanly; once Vault has decided the token is no longer renewable
|
||||
// (typically Max TTL reached), retrying is what gets certctl-server
|
||||
// flagged in the Vault audit log as a misbehaving client.
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"time"
|
||||
|
||||
"github.com/shankar0123/certctl/internal/connector/issuer"
|
||||
)
|
||||
|
||||
// minRenewInterval guards against degenerate fast cadence when a
|
||||
// misconfigured Vault returns a tiny TTL. 5s is short enough that
|
||||
// the cap rarely fires in practice but long enough that we don't
|
||||
// hammer Vault's audit log with renew-self calls if something goes
|
||||
// sideways. Defensive only; production tokens always have TTL ≥ 30m.
|
||||
const minRenewInterval = 5 * time.Second
|
||||
|
||||
// RenewalRecorder is the metric-sink surface the renew-self loop
|
||||
// uses. result is one of: "success", "failure", "not_renewable".
|
||||
// Implementations MUST be goroutine-safe — RecordRenewal is called
|
||||
// from the renewal loop's own goroutine.
|
||||
//
|
||||
// service.VaultRenewalMetrics satisfies this interface; cmd/server
|
||||
// wires the same instance into the registry (which forwards to the
|
||||
// connector via SetRenewalRecorder) and into the metrics handler
|
||||
// (for Prometheus exposition).
|
||||
type RenewalRecorder interface {
|
||||
RecordRenewal(result string)
|
||||
}
|
||||
|
||||
// noopRenewalRecorder is the zero-cost default. Used until
|
||||
// SetRenewalRecorder wires a real metric sink (production) or in
|
||||
// unit tests that don't care about metrics.
|
||||
type noopRenewalRecorder struct{}
|
||||
|
||||
func (noopRenewalRecorder) RecordRenewal(string) {}
|
||||
|
||||
// renewTicker is the small surface the renewal loop uses from
|
||||
// time.Ticker, extracted so tests can swap in a deterministic
|
||||
// implementation that fires on cue. Production: time.NewTicker.
|
||||
type renewTicker interface {
|
||||
C() <-chan time.Time
|
||||
Stop()
|
||||
}
|
||||
|
||||
// stdTicker is the production implementation, a thin wrapper around
|
||||
// *time.Ticker that exposes its C channel via a method so it
|
||||
// satisfies the renewTicker interface (channels can't be method
|
||||
// values directly).
|
||||
type stdTicker struct{ t *time.Ticker }
|
||||
|
||||
func (s stdTicker) C() <-chan time.Time { return s.t.C }
|
||||
func (s stdTicker) Stop() { s.t.Stop() }
|
||||
|
||||
// lookupSelfResponse is the subset of /v1/auth/token/lookup-self we
|
||||
// consume. Vault returns many other fields (policies, accessor, …)
|
||||
// that are irrelevant to the renewal loop.
|
||||
type lookupSelfResponse struct {
|
||||
Data struct {
|
||||
TTL int `json:"ttl"` // seconds remaining on the token
|
||||
Renewable bool `json:"renewable"` // whether the token can be renewed
|
||||
} `json:"data"`
|
||||
}
|
||||
|
||||
// renewSelfResponse is the subset of /v1/auth/token/renew-self we
|
||||
// consume. Per Vault's HTTP API, the renewed token's lease info
|
||||
// lands in `auth.lease_duration` and `auth.renewable`.
|
||||
type renewSelfResponse struct {
|
||||
Auth struct {
|
||||
LeaseDuration int `json:"lease_duration"`
|
||||
Renewable bool `json:"renewable"`
|
||||
} `json:"auth"`
|
||||
}
|
||||
|
||||
// lookupSelf calls GET /v1/auth/token/lookup-self and returns the
|
||||
// remaining TTL + the renewable flag. Used by Start to compute the
|
||||
// initial tick cadence.
|
||||
func (c *Connector) lookupSelf(ctx context.Context) (ttl time.Duration, renewable bool, err error) {
|
||||
if c.config == nil || c.config.Token.IsEmpty() {
|
||||
return 0, false, fmt.Errorf("vault token-renewal lookupSelf: token not configured")
|
||||
}
|
||||
|
||||
url := c.config.Addr + "/v1/auth/token/lookup-self"
|
||||
req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
|
||||
if err != nil {
|
||||
return 0, false, fmt.Errorf("vault token-renewal lookupSelf request build: %w", err)
|
||||
}
|
||||
if err := c.config.Token.Use(func(buf []byte) error {
|
||||
req.Header.Set("X-Vault-Token", string(buf))
|
||||
return nil
|
||||
}); err != nil {
|
||||
return 0, false, fmt.Errorf("vault token-renewal lookupSelf token use: %w", err)
|
||||
}
|
||||
|
||||
resp, err := c.renewClient.Do(req)
|
||||
if err != nil {
|
||||
return 0, false, fmt.Errorf("vault token-renewal lookupSelf HTTP: %w", err)
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
|
||||
body, err := io.ReadAll(resp.Body)
|
||||
if err != nil {
|
||||
return 0, false, fmt.Errorf("vault token-renewal lookupSelf body read: %w", err)
|
||||
}
|
||||
if resp.StatusCode != http.StatusOK {
|
||||
return 0, false, fmt.Errorf("vault token-renewal lookupSelf returned status %d: %s", resp.StatusCode, string(body))
|
||||
}
|
||||
|
||||
var parsed lookupSelfResponse
|
||||
if err := json.Unmarshal(body, &parsed); err != nil {
|
||||
return 0, false, fmt.Errorf("vault token-renewal lookupSelf parse: %w", err)
|
||||
}
|
||||
|
||||
return time.Duration(parsed.Data.TTL) * time.Second, parsed.Data.Renewable, nil
|
||||
}
|
||||
|
||||
// renewSelfResult is returned by renewSelf — it lets the loop both
|
||||
// update the in-memory TTL AND react to a renewable=false flip on
|
||||
// the same call without an extra round-trip.
|
||||
type renewSelfResult struct {
|
||||
NewTTL time.Duration
|
||||
Renewable bool
|
||||
}
|
||||
|
||||
// renewSelf calls POST /v1/auth/token/renew-self with an empty body
|
||||
// (Vault accepts `{}`) and returns the renewed lease's TTL +
|
||||
// renewable flag. The caller is responsible for stopping the loop
|
||||
// when Renewable goes false.
|
||||
func (c *Connector) renewSelf(ctx context.Context) (renewSelfResult, error) {
|
||||
if c.config == nil || c.config.Token.IsEmpty() {
|
||||
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: token not configured; rotate the token before TTL expires")
|
||||
}
|
||||
|
||||
url := c.config.Addr + "/v1/auth/token/renew-self"
|
||||
req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader([]byte(`{}`)))
|
||||
if err != nil {
|
||||
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: request build: %w; rotate the token before TTL expires", err)
|
||||
}
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
if err := c.config.Token.Use(func(buf []byte) error {
|
||||
req.Header.Set("X-Vault-Token", string(buf))
|
||||
return nil
|
||||
}); err != nil {
|
||||
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: token use: %w; rotate the token before TTL expires", err)
|
||||
}
|
||||
|
||||
resp, err := c.renewClient.Do(req)
|
||||
if err != nil {
|
||||
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: HTTP error: %w; rotate the token before TTL expires", err)
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
|
||||
body, err := io.ReadAll(resp.Body)
|
||||
if err != nil {
|
||||
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: body read: %w; rotate the token before TTL expires", err)
|
||||
}
|
||||
if resp.StatusCode != http.StatusOK {
|
||||
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: status %d: %s; rotate the token before TTL expires", resp.StatusCode, string(body))
|
||||
}
|
||||
|
||||
var parsed renewSelfResponse
|
||||
if err := json.Unmarshal(body, &parsed); err != nil {
|
||||
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: parse: %w; rotate the token before TTL expires", err)
|
||||
}
|
||||
|
||||
return renewSelfResult{
|
||||
NewTTL: time.Duration(parsed.Auth.LeaseDuration) * time.Second,
|
||||
Renewable: parsed.Auth.Renewable,
|
||||
}, nil
|
||||
}
|
||||
|
||||
// Start kicks off the renew-self goroutine. Implements
|
||||
// issuer.Lifecycle. Returns nil on success (goroutine running) or an
|
||||
// error if the initial lookupSelf failed (no goroutine spawned).
|
||||
//
|
||||
// Cadence is computed once at startup as TTL/2 (capped at
|
||||
// minRenewInterval). Each successful renewal updates the in-memory
|
||||
// TTL and the goroutine resets its ticker to the new TTL/2 — so a
|
||||
// short bootstrap token that gets renewed up to a longer Max TTL
|
||||
// shifts to the longer cadence automatically.
|
||||
//
|
||||
// On `renewable: false` (initial lookup OR any subsequent renewal),
|
||||
// Start returns nil but the loop emits a WARN and exits — operator
|
||||
// must rotate the Vault token before its current TTL expires.
|
||||
func (c *Connector) Start(ctx context.Context) error {
|
||||
c.renewMu.Lock()
|
||||
if c.renewStarted {
|
||||
c.renewMu.Unlock()
|
||||
return nil // idempotent: already running
|
||||
}
|
||||
if c.config == nil || c.config.Token.IsEmpty() {
|
||||
c.renewMu.Unlock()
|
||||
return fmt.Errorf("vault token-renewal Start: token not configured (call ValidateConfig first)")
|
||||
}
|
||||
c.renewMu.Unlock()
|
||||
|
||||
// Initial lookup — short timeout so a misconfigured Vault address
|
||||
// fails Start fast rather than blocking the server's startup
|
||||
// sequence indefinitely. The renewal goroutine itself uses the
|
||||
// per-tick context for its own deadlines.
|
||||
lookupCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
|
||||
ttl, renewable, err := c.lookupSelf(lookupCtx)
|
||||
cancel()
|
||||
if err != nil {
|
||||
return fmt.Errorf("vault token-renewal Start: initial lookupSelf: %w", err)
|
||||
}
|
||||
|
||||
c.logger.Info("vault token-renewal loop starting",
|
||||
"addr", c.config.Addr,
|
||||
"ttl_seconds", int(ttl.Seconds()),
|
||||
"renewable", renewable,
|
||||
)
|
||||
|
||||
if !renewable {
|
||||
// Don't spawn the goroutine — the token is already non-
|
||||
// renewable. Surface via the metric so operators see it in
|
||||
// Grafana even before any tick fires.
|
||||
c.recordRenewal("not_renewable")
|
||||
c.logger.Warn("vault token is not renewable at startup; renew-self loop will not run — rotate the token before its TTL expires",
|
||||
"ttl_seconds", int(ttl.Seconds()),
|
||||
)
|
||||
return nil
|
||||
}
|
||||
|
||||
// Spawn the goroutine. Use a derived ctx so Stop() can cancel
|
||||
// independently of the parent.
|
||||
loopCtx, loopCancel := context.WithCancel(ctx)
|
||||
done := make(chan struct{})
|
||||
|
||||
c.renewMu.Lock()
|
||||
c.renewStarted = true
|
||||
c.renewCancel = loopCancel
|
||||
c.renewDone = done
|
||||
c.renewMu.Unlock()
|
||||
|
||||
interval := computeInterval(ttl)
|
||||
go c.renewLoop(loopCtx, interval, done)
|
||||
|
||||
c.logger.Info("vault token-renewal loop started",
|
||||
"interval_seconds", int(interval.Seconds()),
|
||||
)
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// Stop blocks until the renew-self goroutine has exited.
|
||||
// Implements issuer.Lifecycle. Idempotent.
|
||||
func (c *Connector) Stop() {
|
||||
c.renewMu.Lock()
|
||||
cancel := c.renewCancel
|
||||
done := c.renewDone
|
||||
started := c.renewStarted
|
||||
c.renewMu.Unlock()
|
||||
|
||||
if !started {
|
||||
return
|
||||
}
|
||||
if cancel != nil {
|
||||
cancel()
|
||||
}
|
||||
if done != nil {
|
||||
<-done
|
||||
}
|
||||
}
|
||||
|
||||
// renewLoop is the actual goroutine body. Owns the ticker, the
|
||||
// in-memory TTL, and the renewable-flag state machine. Exits on
|
||||
// ctx.Done() or on `renewable: false`.
|
||||
func (c *Connector) renewLoop(ctx context.Context, initial time.Duration, done chan struct{}) {
|
||||
defer close(done)
|
||||
|
||||
factory := c.renewTickerFactory
|
||||
if factory == nil {
|
||||
factory = func(d time.Duration) renewTicker {
|
||||
return stdTicker{t: time.NewTicker(d)}
|
||||
}
|
||||
}
|
||||
|
||||
ticker := factory(initial)
|
||||
currentInterval := initial
|
||||
defer ticker.Stop()
|
||||
|
||||
for {
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
c.logger.Info("vault token-renewal loop stopping (ctx cancelled)")
|
||||
return
|
||||
case <-ticker.C():
|
||||
// Per-tick deadline derived from the current cadence —
|
||||
// renew calls should comfortably finish in <1s, so a
|
||||
// budget of min(interval, 30s) is generous.
|
||||
tickBudget := currentInterval
|
||||
if tickBudget > 30*time.Second {
|
||||
tickBudget = 30 * time.Second
|
||||
}
|
||||
tickCtx, cancel := context.WithTimeout(ctx, tickBudget)
|
||||
res, err := c.renewSelf(tickCtx)
|
||||
cancel()
|
||||
if err != nil {
|
||||
c.recordRenewal("failure")
|
||||
c.logger.Error(err.Error())
|
||||
// Keep ticking — operator may rotate the token
|
||||
// out-of-band, or the failure may be transient.
|
||||
// Stopping on first failure would mean a 1s
|
||||
// network blip kills the loop for the rest of
|
||||
// process lifetime.
|
||||
continue
|
||||
}
|
||||
if !res.Renewable {
|
||||
c.recordRenewal("not_renewable")
|
||||
c.logger.Warn("vault token is no longer renewable; renew-self loop exiting — rotate the token before its current TTL expires",
|
||||
"ttl_seconds", int(res.NewTTL.Seconds()),
|
||||
)
|
||||
return
|
||||
}
|
||||
c.recordRenewal("success")
|
||||
c.logger.Info("vault token renewed",
|
||||
"new_ttl_seconds", int(res.NewTTL.Seconds()),
|
||||
)
|
||||
|
||||
// If the new TTL/2 differs meaningfully from the
|
||||
// current cadence, restart the ticker at the new
|
||||
// rate. This handles the bootstrap-→-MaxTTL transition
|
||||
// (short initial TTL renews up to a longer Max TTL,
|
||||
// which we'd otherwise hammer at the old fast cadence
|
||||
// for the rest of the process).
|
||||
newInterval := computeInterval(res.NewTTL)
|
||||
if differsEnough(currentInterval, newInterval) {
|
||||
ticker.Stop()
|
||||
ticker = factory(newInterval)
|
||||
currentInterval = newInterval
|
||||
c.logger.Info("vault token-renewal cadence updated",
|
||||
"new_interval_seconds", int(newInterval.Seconds()),
|
||||
)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// recordRenewal increments the metric counter under the renewal
|
||||
// recorder. Holds the lock briefly to read the recorder pointer;
|
||||
// the actual increment happens lock-free (atomic.Uint64 under
|
||||
// VaultRenewalMetrics).
|
||||
func (c *Connector) recordRenewal(result string) {
|
||||
c.renewMu.Lock()
|
||||
rec := c.renewRecorder
|
||||
c.renewMu.Unlock()
|
||||
if rec != nil {
|
||||
rec.RecordRenewal(result)
|
||||
}
|
||||
}
|
||||
|
||||
// computeInterval returns TTL/2, floored at minRenewInterval to
|
||||
// avoid degenerate fast cadence when a misconfigured Vault returns
|
||||
// a tiny TTL.
|
||||
func computeInterval(ttl time.Duration) time.Duration {
|
||||
half := ttl / 2
|
||||
if half < minRenewInterval {
|
||||
return minRenewInterval
|
||||
}
|
||||
return half
|
||||
}
|
||||
|
||||
// differsEnough decides whether to restart the ticker for a new
|
||||
// cadence. We tolerate ±10% drift to avoid restart-thrash when
|
||||
// Vault's renewed-lease duration wobbles around the static TTL.
|
||||
func differsEnough(a, b time.Duration) bool {
|
||||
if a == 0 || b == 0 {
|
||||
return a != b
|
||||
}
|
||||
delta := a - b
|
||||
if delta < 0 {
|
||||
delta = -delta
|
||||
}
|
||||
tol := a / 10
|
||||
if tol < 0 {
|
||||
tol = -tol
|
||||
}
|
||||
return delta > tol
|
||||
}
|
||||
|
||||
// Compile-time assertion that *Connector satisfies the optional
|
||||
// Lifecycle extension interface. If a future refactor breaks this
|
||||
// (e.g. drops Stop), the compile error fires here rather than in a
|
||||
// far-away registry lookup site.
|
||||
var _ issuer.Lifecycle = (*Connector)(nil)
|
||||
@@ -0,0 +1,476 @@
|
||||
package vault
|
||||
|
||||
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit. Pins the
|
||||
// behaviour of the renew-self loop end to end:
|
||||
//
|
||||
// 1. cadence — at TTL/2 with a (configurable) deterministic ticker
|
||||
// so the test isn't wall-clock bound;
|
||||
// 2. terminate-on-not-renewable — if Vault returns renewable=false,
|
||||
// the loop exits and the metric records the not_renewable
|
||||
// result;
|
||||
// 3. failure-surfaces — the metric counter increments on a 403 and
|
||||
// the loop keeps ticking (transient blips don't kill it);
|
||||
// 4. ctx-cancellation — Stop returns within a small budget after
|
||||
// ctx is cancelled.
|
||||
//
|
||||
// These tests live INSIDE the `vault` package (not vault_test) so
|
||||
// they can substitute the renewTickerFactory seam directly. The
|
||||
// existing test files in this directory are split into vault_test
|
||||
// (external, exercises the public API) and the package-internal
|
||||
// _test.go files (this one) — Go's two-package test convention.
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io"
|
||||
"log/slog"
|
||||
"net/http"
|
||||
"net/http/httptest"
|
||||
"strings"
|
||||
"sync"
|
||||
"sync/atomic"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/shankar0123/certctl/internal/secret"
|
||||
)
|
||||
|
||||
// fakeTicker is the deterministic ticker the tests inject via
|
||||
// renewTickerFactory. Tests call Tick() to fire the ticker channel
|
||||
// at the moment of their choosing — no real time elapses.
|
||||
type fakeTicker struct {
|
||||
ch chan time.Time
|
||||
stopCalls atomic.Uint64
|
||||
}
|
||||
|
||||
func newFakeTicker() *fakeTicker {
|
||||
return &fakeTicker{ch: make(chan time.Time, 4)}
|
||||
}
|
||||
|
||||
func (f *fakeTicker) C() <-chan time.Time { return f.ch }
|
||||
func (f *fakeTicker) Stop() { f.stopCalls.Add(1) }
|
||||
func (f *fakeTicker) Tick() { f.ch <- time.Now() }
|
||||
|
||||
// renewMockHandler is the per-test httptest handler shape. Tests
|
||||
// configure it to control lookup-self / renew-self responses.
|
||||
type renewMockHandler struct {
|
||||
mu sync.Mutex
|
||||
lookupTTLSeconds int
|
||||
lookupRenewable bool
|
||||
renewSelfStatuses []renewSelfStub // queued; consumed in order
|
||||
renewSelfCalls atomic.Uint64
|
||||
lookupSelfCalls atomic.Uint64
|
||||
noMoreCalls func() // called if a queued stub is exhausted
|
||||
}
|
||||
|
||||
// renewSelfStub configures one expected renew-self response.
|
||||
type renewSelfStub struct {
|
||||
status int
|
||||
body string // override the canned body
|
||||
leaseDuration int
|
||||
renewable bool
|
||||
}
|
||||
|
||||
func (h *renewMockHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
|
||||
switch r.URL.Path {
|
||||
case "/v1/auth/token/lookup-self":
|
||||
h.lookupSelfCalls.Add(1)
|
||||
h.mu.Lock()
|
||||
ttl, renewable := h.lookupTTLSeconds, h.lookupRenewable
|
||||
h.mu.Unlock()
|
||||
body := fmt.Sprintf(`{"data":{"ttl":%d,"renewable":%t}}`, ttl, renewable)
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
w.WriteHeader(http.StatusOK)
|
||||
_, _ = io.WriteString(w, body)
|
||||
case "/v1/auth/token/renew-self":
|
||||
h.renewSelfCalls.Add(1)
|
||||
h.mu.Lock()
|
||||
var stub renewSelfStub
|
||||
if len(h.renewSelfStatuses) > 0 {
|
||||
stub = h.renewSelfStatuses[0]
|
||||
h.renewSelfStatuses = h.renewSelfStatuses[1:]
|
||||
} else {
|
||||
h.mu.Unlock()
|
||||
if h.noMoreCalls != nil {
|
||||
h.noMoreCalls()
|
||||
}
|
||||
http.Error(w, "no more renew-self stubs configured", http.StatusInternalServerError)
|
||||
return
|
||||
}
|
||||
h.mu.Unlock()
|
||||
|
||||
w.Header().Set("Content-Type", "application/json")
|
||||
status := stub.status
|
||||
if status == 0 {
|
||||
status = http.StatusOK
|
||||
}
|
||||
w.WriteHeader(status)
|
||||
body := stub.body
|
||||
if body == "" {
|
||||
body = fmt.Sprintf(`{"auth":{"lease_duration":%d,"renewable":%t}}`, stub.leaseDuration, stub.renewable)
|
||||
}
|
||||
_, _ = io.WriteString(w, body)
|
||||
default:
|
||||
http.NotFound(w, r)
|
||||
}
|
||||
}
|
||||
|
||||
// quietTestLogger returns a logger that discards everything below
|
||||
// ERROR. Tests assert via the recorder + ticker hooks; per-tick
|
||||
// INFO/WARN logs would clutter the test output.
|
||||
func quietTestLogger() *slog.Logger {
|
||||
return slog.New(slog.NewTextHandler(io.Discard, &slog.HandlerOptions{Level: slog.LevelError}))
|
||||
}
|
||||
|
||||
// mockRecorder counts RecordRenewal calls per result. Replaces the
|
||||
// production *service.VaultRenewalMetrics for unit-test isolation.
|
||||
type mockRecorder struct {
|
||||
mu sync.Mutex
|
||||
counts map[string]uint64
|
||||
}
|
||||
|
||||
func newMockRecorder() *mockRecorder {
|
||||
return &mockRecorder{counts: make(map[string]uint64)}
|
||||
}
|
||||
|
||||
func (m *mockRecorder) RecordRenewal(result string) {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
m.counts[result]++
|
||||
}
|
||||
|
||||
func (m *mockRecorder) get(result string) uint64 {
|
||||
m.mu.Lock()
|
||||
defer m.mu.Unlock()
|
||||
return m.counts[result]
|
||||
}
|
||||
|
||||
// buildTestConnector constructs a vault.Connector pointed at the
|
||||
// httptest server, with the deterministic ticker factory and the
|
||||
// supplied recorder.
|
||||
func buildTestConnector(srvURL string, ticker *fakeTicker, rec RenewalRecorder) *Connector {
|
||||
c := New(&Config{
|
||||
Addr: srvURL,
|
||||
Token: secret.NewRefFromString("hvs.test-token"),
|
||||
Mount: "pki",
|
||||
Role: "web",
|
||||
}, quietTestLogger())
|
||||
c.renewTickerFactory = func(d time.Duration) renewTicker { return ticker }
|
||||
if rec != nil {
|
||||
c.SetRenewalRecorder(rec)
|
||||
}
|
||||
return c
|
||||
}
|
||||
|
||||
// TestVault_RenewLoop_TickAtHalfTTL pins that the loop calls
|
||||
// renew-self once per ticker fire. Cadence assertion is via the
|
||||
// fake ticker: Tick three times → expect three renew-self calls.
|
||||
// (Production cadence — TTL/2 — is verified by assertions on
|
||||
// computeInterval below; substituting the ticker here keeps the
|
||||
// test wall-clock-free.)
|
||||
func TestVault_RenewLoop_TickAtHalfTTL(t *testing.T) {
|
||||
mock := &renewMockHandler{
|
||||
lookupTTLSeconds: 4, // 2s cadence
|
||||
lookupRenewable: true,
|
||||
renewSelfStatuses: []renewSelfStub{
|
||||
{leaseDuration: 4, renewable: true},
|
||||
{leaseDuration: 4, renewable: true},
|
||||
{leaseDuration: 4, renewable: true},
|
||||
},
|
||||
}
|
||||
srv := httptest.NewServer(mock)
|
||||
defer srv.Close()
|
||||
|
||||
ticker := newFakeTicker()
|
||||
rec := newMockRecorder()
|
||||
c := buildTestConnector(srv.URL, ticker, rec)
|
||||
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
defer cancel()
|
||||
|
||||
if err := c.Start(ctx); err != nil {
|
||||
t.Fatalf("Start: %v", err)
|
||||
}
|
||||
defer c.Stop()
|
||||
|
||||
if mock.lookupSelfCalls.Load() != 1 {
|
||||
t.Errorf("expected exactly 1 lookup-self at startup, got %d", mock.lookupSelfCalls.Load())
|
||||
}
|
||||
|
||||
// Fire three ticks; each should drive one renew-self.
|
||||
for i := 0; i < 3; i++ {
|
||||
ticker.Tick()
|
||||
}
|
||||
|
||||
// Wait briefly for the goroutine to drain the channel sends.
|
||||
deadline := time.Now().Add(2 * time.Second)
|
||||
for time.Now().Before(deadline) {
|
||||
if rec.get("success") >= 3 {
|
||||
break
|
||||
}
|
||||
time.Sleep(10 * time.Millisecond)
|
||||
}
|
||||
|
||||
if got := rec.get("success"); got != 3 {
|
||||
t.Errorf("expected 3 success renewals after 3 ticks, got %d", got)
|
||||
}
|
||||
if got := rec.get("failure"); got != 0 {
|
||||
t.Errorf("expected 0 failures, got %d", got)
|
||||
}
|
||||
if got := rec.get("not_renewable"); got != 0 {
|
||||
t.Errorf("expected 0 not_renewable events, got %d", got)
|
||||
}
|
||||
if got := mock.renewSelfCalls.Load(); got != 3 {
|
||||
t.Errorf("expected 3 renew-self HTTP calls, got %d", got)
|
||||
}
|
||||
}
|
||||
|
||||
// TestVault_RenewLoop_StopsOnNotRenewable pins that the loop exits
|
||||
// cleanly after Vault returns renewable=false on a renew-self call.
|
||||
// A second tick is sent after the not-renewable response; the
|
||||
// goroutine should already be stopped by then so the second tick
|
||||
// triggers no HTTP call.
|
||||
func TestVault_RenewLoop_StopsOnNotRenewable(t *testing.T) {
|
||||
mock := &renewMockHandler{
|
||||
lookupTTLSeconds: 4,
|
||||
lookupRenewable: true,
|
||||
renewSelfStatuses: []renewSelfStub{
|
||||
{leaseDuration: 4, renewable: true},
|
||||
{leaseDuration: 4, renewable: false}, // tells loop to stop
|
||||
},
|
||||
}
|
||||
srv := httptest.NewServer(mock)
|
||||
defer srv.Close()
|
||||
|
||||
ticker := newFakeTicker()
|
||||
rec := newMockRecorder()
|
||||
c := buildTestConnector(srv.URL, ticker, rec)
|
||||
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
defer cancel()
|
||||
|
||||
if err := c.Start(ctx); err != nil {
|
||||
t.Fatalf("Start: %v", err)
|
||||
}
|
||||
defer c.Stop()
|
||||
|
||||
ticker.Tick() // first renewal — success
|
||||
ticker.Tick() // second renewal — renewable=false, loop exits
|
||||
|
||||
deadline := time.Now().Add(2 * time.Second)
|
||||
for time.Now().Before(deadline) {
|
||||
if rec.get("not_renewable") >= 1 {
|
||||
break
|
||||
}
|
||||
time.Sleep(10 * time.Millisecond)
|
||||
}
|
||||
|
||||
if got := rec.get("success"); got != 1 {
|
||||
t.Errorf("expected 1 success before not_renewable, got %d", got)
|
||||
}
|
||||
if got := rec.get("not_renewable"); got != 1 {
|
||||
t.Errorf("expected exactly 1 not_renewable event, got %d", got)
|
||||
}
|
||||
|
||||
// Confirm the goroutine has already exited: we check the
|
||||
// renewMu's renewDone channel via Stop. If the loop is alive,
|
||||
// Stop blocks until ctx is cancelled. If it has already
|
||||
// exited (which it should), Stop returns near-immediately.
|
||||
stopDone := make(chan struct{})
|
||||
go func() {
|
||||
c.Stop()
|
||||
close(stopDone)
|
||||
}()
|
||||
|
||||
select {
|
||||
case <-stopDone:
|
||||
// expected — goroutine had already exited.
|
||||
case <-time.After(200 * time.Millisecond):
|
||||
t.Error("Stop did not return within 200ms after renewable=false — goroutine leaked")
|
||||
}
|
||||
}
|
||||
|
||||
// TestVault_RenewLoop_FailureSurfacesViaMetric pins that a 403 on
|
||||
// renew-self bumps the failure counter and the loop keeps ticking
|
||||
// (transient blips do not kill the loop).
|
||||
func TestVault_RenewLoop_FailureSurfacesViaMetric(t *testing.T) {
|
||||
mock := &renewMockHandler{
|
||||
lookupTTLSeconds: 4,
|
||||
lookupRenewable: true,
|
||||
renewSelfStatuses: []renewSelfStub{
|
||||
{status: http.StatusForbidden, body: `{"errors":["permission denied"]}`},
|
||||
{leaseDuration: 4, renewable: true}, // loop continues; this tick succeeds
|
||||
},
|
||||
}
|
||||
srv := httptest.NewServer(mock)
|
||||
defer srv.Close()
|
||||
|
||||
ticker := newFakeTicker()
|
||||
rec := newMockRecorder()
|
||||
c := buildTestConnector(srv.URL, ticker, rec)
|
||||
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
defer cancel()
|
||||
|
||||
if err := c.Start(ctx); err != nil {
|
||||
t.Fatalf("Start: %v", err)
|
||||
}
|
||||
defer c.Stop()
|
||||
|
||||
ticker.Tick() // first — fails with 403
|
||||
ticker.Tick() // second — succeeds
|
||||
|
||||
deadline := time.Now().Add(2 * time.Second)
|
||||
for time.Now().Before(deadline) {
|
||||
if rec.get("failure") >= 1 && rec.get("success") >= 1 {
|
||||
break
|
||||
}
|
||||
time.Sleep(10 * time.Millisecond)
|
||||
}
|
||||
|
||||
if got := rec.get("failure"); got != 1 {
|
||||
t.Errorf("expected 1 failure after 403, got %d", got)
|
||||
}
|
||||
if got := rec.get("success"); got != 1 {
|
||||
t.Errorf("expected 1 success after recovery, got %d", got)
|
||||
}
|
||||
}
|
||||
|
||||
// TestVault_RenewLoop_CtxCancellation_StopsCleanly pins that
|
||||
// cancelling ctx causes the goroutine to exit promptly. Stop()
|
||||
// blocks on the goroutine's done channel; if it doesn't return
|
||||
// within 200ms after cancel, the goroutine is leaked.
|
||||
func TestVault_RenewLoop_CtxCancellation_StopsCleanly(t *testing.T) {
|
||||
mock := &renewMockHandler{
|
||||
lookupTTLSeconds: 4,
|
||||
lookupRenewable: true,
|
||||
renewSelfStatuses: nil, // no ticks expected; ctx will cancel before any
|
||||
}
|
||||
srv := httptest.NewServer(mock)
|
||||
defer srv.Close()
|
||||
|
||||
ticker := newFakeTicker()
|
||||
rec := newMockRecorder()
|
||||
c := buildTestConnector(srv.URL, ticker, rec)
|
||||
|
||||
ctx, cancel := context.WithCancel(context.Background())
|
||||
|
||||
if err := c.Start(ctx); err != nil {
|
||||
t.Fatalf("Start: %v", err)
|
||||
}
|
||||
|
||||
// Cancel ctx; the goroutine should exit on ctx.Done() before
|
||||
// any tick fires.
|
||||
start := time.Now()
|
||||
cancel()
|
||||
|
||||
stopDone := make(chan struct{})
|
||||
go func() {
|
||||
c.Stop()
|
||||
close(stopDone)
|
||||
}()
|
||||
|
||||
select {
|
||||
case <-stopDone:
|
||||
elapsed := time.Since(start)
|
||||
if elapsed > 200*time.Millisecond {
|
||||
t.Errorf("Stop returned after %v — goroutine slow to exit", elapsed)
|
||||
}
|
||||
case <-time.After(500 * time.Millisecond):
|
||||
t.Fatal("Stop did not return within 500ms after ctx cancellation — goroutine leaked")
|
||||
}
|
||||
|
||||
// No renew-self calls should have fired (cancel raced before any tick).
|
||||
if got := mock.renewSelfCalls.Load(); got != 0 {
|
||||
t.Errorf("expected 0 renew-self HTTP calls, got %d", got)
|
||||
}
|
||||
}
|
||||
|
||||
// TestVault_RenewLoop_StartsNothingWhenNotRenewable pins the
|
||||
// startup short-circuit: if lookup-self returns renewable=false at
|
||||
// boot, Start does not spawn the goroutine and the metric records
|
||||
// the not_renewable result so operators see it in Grafana before
|
||||
// any tick would have fired.
|
||||
func TestVault_RenewLoop_StartsNothingWhenNotRenewable(t *testing.T) {
|
||||
mock := &renewMockHandler{
|
||||
lookupTTLSeconds: 60,
|
||||
lookupRenewable: false, // already non-renewable at boot
|
||||
}
|
||||
srv := httptest.NewServer(mock)
|
||||
defer srv.Close()
|
||||
|
||||
ticker := newFakeTicker()
|
||||
rec := newMockRecorder()
|
||||
c := buildTestConnector(srv.URL, ticker, rec)
|
||||
|
||||
if err := c.Start(context.Background()); err != nil {
|
||||
t.Fatalf("Start should not error on initially-non-renewable token; got: %v", err)
|
||||
}
|
||||
defer c.Stop()
|
||||
|
||||
if got := rec.get("not_renewable"); got != 1 {
|
||||
t.Errorf("expected 1 not_renewable event from startup short-circuit, got %d", got)
|
||||
}
|
||||
|
||||
// Tick should be a no-op — no goroutine running.
|
||||
ticker.Tick()
|
||||
time.Sleep(100 * time.Millisecond)
|
||||
if got := mock.renewSelfCalls.Load(); got != 0 {
|
||||
t.Errorf("expected 0 renew-self HTTP calls (loop never started), got %d", got)
|
||||
}
|
||||
}
|
||||
|
||||
// TestVault_ComputeInterval pins the cadence-derivation rules: TTL/2
|
||||
// for normal tokens, floored at minRenewInterval for misconfigured
|
||||
// short TTLs that would otherwise hammer Vault's audit log.
|
||||
func TestVault_ComputeInterval(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
ttl time.Duration
|
||||
want time.Duration
|
||||
}{
|
||||
{"hour-ttl", time.Hour, 30 * time.Minute},
|
||||
{"day-ttl", 24 * time.Hour, 12 * time.Hour},
|
||||
{"floor-applies-tiny", 2 * time.Second, minRenewInterval},
|
||||
{"floor-applies-zero", 0, minRenewInterval},
|
||||
}
|
||||
for _, tc := range tests {
|
||||
t.Run(tc.name, func(t *testing.T) {
|
||||
got := computeInterval(tc.ttl)
|
||||
if got != tc.want {
|
||||
t.Errorf("computeInterval(%v) = %v, want %v", tc.ttl, got, tc.want)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
// TestVault_RenewSelf_ParseFailure_NamesActionableInError pins that
|
||||
// failures surface with operator-actionable framing. We test the
|
||||
// HTTP-failure path; the parse-failure path lives in the same wrap
|
||||
// chain.
|
||||
func TestVault_RenewSelf_ParseFailure_NamesActionableInError(t *testing.T) {
|
||||
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
w.WriteHeader(http.StatusOK)
|
||||
_, _ = io.WriteString(w, `not json`)
|
||||
}))
|
||||
defer srv.Close()
|
||||
|
||||
c := buildTestConnector(srv.URL, newFakeTicker(), nil)
|
||||
|
||||
_, err := c.renewSelf(context.Background())
|
||||
if err == nil {
|
||||
t.Fatal("expected error from renewSelf with bad JSON, got nil")
|
||||
}
|
||||
if !strings.Contains(err.Error(), "vault token renewal failed") {
|
||||
t.Errorf("expected 'vault token renewal failed' framing in surfaced error; got: %v", err)
|
||||
}
|
||||
if !strings.Contains(err.Error(), "rotate the token") {
|
||||
t.Errorf("expected 'rotate the token' operator-action substring in surfaced error; got: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
// _unused_marker keeps the json import alive when the test file is
|
||||
// edited and one of the json-using helpers temporarily disappears.
|
||||
// Production has no use for this; tests do.
|
||||
var _ = json.Marshal
|
||||
@@ -40,6 +40,18 @@ func (a *IssuerConnectorAdapter) SetMetrics(issuerType string, metrics *Issuance
|
||||
a.metrics = metrics
|
||||
}
|
||||
|
||||
// Underlying returns the wrapped issuer.Connector so registry-level
|
||||
// machinery (StartLifecycles / StopLifecycles, Bundle G audit-row
|
||||
// pairing, future feature-detect interfaces) can reach the concrete
|
||||
// connector behind the adapter without duplicating the wiring at
|
||||
// every call site. Returns interface{} rather than issuer.Connector
|
||||
// so callers do their own type assertion against optional extension
|
||||
// interfaces (issuer.Lifecycle, etc.) without an import dependency
|
||||
// fan-out from this package.
|
||||
func (a *IssuerConnectorAdapter) Underlying() interface{} {
|
||||
return a.connector
|
||||
}
|
||||
|
||||
// recordIssuance is the metrics-recording side effect at the adapter
|
||||
// boundary. Bumps the issuance counter (success/failure) and the
|
||||
// duration histogram; on failure also bumps the failure-by-error-class
|
||||
|
||||
@@ -8,8 +8,10 @@ import (
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/shankar0123/certctl/internal/connector/issuer"
|
||||
"github.com/shankar0123/certctl/internal/connector/issuer/acme"
|
||||
"github.com/shankar0123/certctl/internal/connector/issuer/local"
|
||||
"github.com/shankar0123/certctl/internal/connector/issuer/vault"
|
||||
"github.com/shankar0123/certctl/internal/connector/issuerfactory"
|
||||
"github.com/shankar0123/certctl/internal/crypto"
|
||||
"github.com/shankar0123/certctl/internal/crypto/signer"
|
||||
@@ -47,6 +49,14 @@ type IssuerRegistry struct {
|
||||
// Nil leaves the legacy "ACME revocation by serial requires
|
||||
// CertificateLookup wiring" error in place for old wiring paths.
|
||||
acmeCertLookup acme.CertificateLookupRepo
|
||||
|
||||
// vaultRenewalMetrics — when set, every freshly-constructed
|
||||
// *vault.Connector is wired with SetRenewalRecorder so the
|
||||
// renew-self loop bumps the certctl_vault_token_renewals_total
|
||||
// counter. Closes Top-10 fix #5 of the 2026-05-03 audit. Nil
|
||||
// leaves the no-op recorder in place (no metric emission, but
|
||||
// the loop still runs).
|
||||
vaultRenewalMetrics *VaultRenewalMetrics
|
||||
}
|
||||
|
||||
// LocalIssuerDeps groups the optional dependencies that the local
|
||||
@@ -92,6 +102,23 @@ func (r *IssuerRegistry) SetIssuanceMetrics(m *IssuanceMetrics) {
|
||||
r.metrics = m
|
||||
}
|
||||
|
||||
// SetVaultRenewalMetrics wires the per-(result) counter table for
|
||||
// the Vault PKI renew-self loop. Every *vault.Connector constructed
|
||||
// by Rebuild after this call records its renewal results into the
|
||||
// supplied metrics. Closes Top-10 fix #5 of the 2026-05-03
|
||||
// issuer-coverage audit.
|
||||
//
|
||||
// The same instance must also be registered with the metrics
|
||||
// handler via MetricsHandler.SetVaultRenewals so the Prometheus
|
||||
// exposer emits certctl_vault_token_renewals_total{result=...}.
|
||||
// cmd/server/main.go owns both wiring sides; tests usually skip
|
||||
// the Prometheus side and just assert against the snapshot.
|
||||
func (r *IssuerRegistry) SetVaultRenewalMetrics(m *VaultRenewalMetrics) {
|
||||
r.mu.Lock()
|
||||
defer r.mu.Unlock()
|
||||
r.vaultRenewalMetrics = m
|
||||
}
|
||||
|
||||
// SetACMECertLookup wires the cert-version lookup repo for every
|
||||
// *acme.Connector constructed by Rebuild. The lookup is used by the
|
||||
// serial-only revoke path (RevokeCertificate) to recover the leaf-
|
||||
@@ -228,6 +255,19 @@ func (r *IssuerRegistry) Rebuild(ctx context.Context, configs []*domain.Issuer,
|
||||
"id", cfg.ID)
|
||||
}
|
||||
|
||||
// Top-10 fix #5 (2026-05-03 audit): wire the renew-self
|
||||
// metric recorder into every freshly-constructed
|
||||
// *vault.Connector so its background renewal loop bumps the
|
||||
// certctl_vault_token_renewals_total counter. Lifecycle
|
||||
// startup itself is gated by StartLifecycles below — Rebuild
|
||||
// only does the metric wire here so the recorder is in place
|
||||
// when StartLifecycles fires.
|
||||
if vaultConn, ok := connector.(*vault.Connector); ok && r.vaultRenewalMetrics != nil {
|
||||
vaultConn.SetRenewalRecorder(r.vaultRenewalMetrics)
|
||||
r.logger.Info("Vault PKI issuer wired with renew-self metric recorder",
|
||||
"id", cfg.ID)
|
||||
}
|
||||
|
||||
adapter := NewIssuerConnectorAdapter(connector)
|
||||
// Wire per-issuer-type metrics (audit fix #4) when SetIssuanceMetrics
|
||||
// was called. The adapter is the IssuerConnector interface; type-
|
||||
@@ -273,3 +313,93 @@ func (r *IssuerRegistry) Rebuild(ctx context.Context, configs []*domain.Issuer,
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// StartLifecycles iterates the registry and calls Start(ctx) on every
|
||||
// connector that implements the optional issuer.Lifecycle extension
|
||||
// interface. Connectors without lifecycle work (almost all of them)
|
||||
// are silently skipped.
|
||||
//
|
||||
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit. Today only
|
||||
// VaultPKI implements Lifecycle (for its renew-self loop). New
|
||||
// lifecycle-bearing connectors plug in by implementing the
|
||||
// interface — this method picks them up automatically.
|
||||
//
|
||||
// Per-connector Start failures are LOGGED, not returned, so a single
|
||||
// misconfigured Vault doesn't block server startup. Operators see
|
||||
// the failure in the slog stream and via the
|
||||
// certctl_vault_token_renewals_total{result="not_renewable"} or
|
||||
// {result="failure"} counter.
|
||||
//
|
||||
// The IssuerConnectorAdapter wraps the raw connector; we type-assert
|
||||
// against IssuerConnectorWithUnderlying to reach the underlying
|
||||
// connector. If the adapter shape changes, this assertion silently
|
||||
// no-ops and lifecycle wiring stops working — covered by
|
||||
// TestRegistry_StartLifecycles_VaultStarted.
|
||||
func (r *IssuerRegistry) StartLifecycles(ctx context.Context) {
|
||||
r.mu.RLock()
|
||||
conns := make(map[string]IssuerConnector, len(r.issuers))
|
||||
for id, c := range r.issuers {
|
||||
conns[id] = c
|
||||
}
|
||||
r.mu.RUnlock()
|
||||
|
||||
for id, c := range conns {
|
||||
raw := unwrapAdapter(c)
|
||||
if raw == nil {
|
||||
continue
|
||||
}
|
||||
lc, ok := raw.(issuer.Lifecycle)
|
||||
if !ok {
|
||||
continue
|
||||
}
|
||||
if err := lc.Start(ctx); err != nil {
|
||||
r.logger.Warn("issuer lifecycle Start failed",
|
||||
"id", id,
|
||||
"error", err,
|
||||
)
|
||||
continue
|
||||
}
|
||||
r.logger.Info("issuer lifecycle Start succeeded", "id", id)
|
||||
}
|
||||
}
|
||||
|
||||
// StopLifecycles iterates the registry and calls Stop() on every
|
||||
// connector that implements the optional issuer.Lifecycle extension
|
||||
// interface. Each Stop blocks until the connector's background work
|
||||
// has fully exited; the loop is sequential rather than parallel so
|
||||
// shutdown ordering is deterministic in operator logs.
|
||||
//
|
||||
// Idempotent. Safe to call after StartLifecycles failed or wasn't
|
||||
// called.
|
||||
func (r *IssuerRegistry) StopLifecycles() {
|
||||
r.mu.RLock()
|
||||
conns := make([]IssuerConnector, 0, len(r.issuers))
|
||||
for _, c := range r.issuers {
|
||||
conns = append(conns, c)
|
||||
}
|
||||
r.mu.RUnlock()
|
||||
|
||||
for _, c := range conns {
|
||||
raw := unwrapAdapter(c)
|
||||
if raw == nil {
|
||||
continue
|
||||
}
|
||||
if lc, ok := raw.(issuer.Lifecycle); ok {
|
||||
lc.Stop()
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// unwrapAdapter returns the underlying issuer.Connector held by an
|
||||
// IssuerConnectorAdapter. If the registry held a raw connector
|
||||
// directly (test wiring), returns it as-is. Returns nil if neither
|
||||
// case matches — defensive against future adapter-shape changes.
|
||||
func unwrapAdapter(c IssuerConnector) interface{} {
|
||||
if a, ok := c.(*IssuerConnectorAdapter); ok {
|
||||
return a.Underlying()
|
||||
}
|
||||
if u, ok := c.(interface{ Underlying() interface{} }); ok {
|
||||
return u.Underlying()
|
||||
}
|
||||
return c
|
||||
}
|
||||
|
||||
@@ -0,0 +1,96 @@
|
||||
package service
|
||||
|
||||
import "sync/atomic"
|
||||
|
||||
// VaultRenewalMetrics is a thread-safe counter table for the
|
||||
// Vault PKI token-renewal loop. Top-10 fix #5 of the 2026-05-03
|
||||
// issuer-coverage audit. Closes the operator-observability gap
|
||||
// where long-lived deploys would silently lose Vault auth at TTL
|
||||
// expiry.
|
||||
//
|
||||
// Cardinality is fixed at three series — result is a closed enum:
|
||||
//
|
||||
// {success} — the renew-self call succeeded.
|
||||
// {failure} — the renew-self call returned a non-2xx,
|
||||
// parse failure, or HTTP error. Loop keeps
|
||||
// ticking; transient blips don't kill it.
|
||||
// {not_renewable} — Vault returned renewable=false (or returned
|
||||
// it at startup lookup-self). Loop has exited;
|
||||
// operator must rotate the token before its
|
||||
// current TTL expires.
|
||||
//
|
||||
// One instance is shared across every Vault PKI Connector built by
|
||||
// IssuerRegistry.Rebuild — the recorder pointer is wired by
|
||||
// IssuerRegistry.SetVaultRenewalMetrics + the post-factory wiring
|
||||
// step inside Rebuild. The same instance is also wired into
|
||||
// MetricsHandler.SetVaultRenewals so the Prometheus exposer emits
|
||||
// certctl_vault_token_renewals_total{result=...}.
|
||||
type VaultRenewalMetrics struct {
|
||||
success atomic.Uint64
|
||||
failure atomic.Uint64
|
||||
notRenewable atomic.Uint64
|
||||
}
|
||||
|
||||
// NewVaultRenewalMetrics constructs a fresh VaultRenewalMetrics
|
||||
// with all counters at zero. Pass to IssuerRegistry.SetVaultRenewalMetrics
|
||||
// (and to MetricsHandler.SetVaultRenewals) to wire up the renewal
|
||||
// loop's metric path.
|
||||
func NewVaultRenewalMetrics() *VaultRenewalMetrics {
|
||||
return &VaultRenewalMetrics{}
|
||||
}
|
||||
|
||||
// RecordRenewal bumps the (result) counter. Implements
|
||||
// vault.RenewalRecorder. Off-enum result values silently no-op
|
||||
// (closed-enum discipline matches the IssuanceMetrics pattern;
|
||||
// we don't dynamically grow the cardinality on a typo).
|
||||
func (m *VaultRenewalMetrics) RecordRenewal(result string) {
|
||||
if m == nil {
|
||||
return
|
||||
}
|
||||
switch result {
|
||||
case "success":
|
||||
m.success.Add(1)
|
||||
case "failure":
|
||||
m.failure.Add(1)
|
||||
case "not_renewable":
|
||||
m.notRenewable.Add(1)
|
||||
}
|
||||
}
|
||||
|
||||
// VaultRenewalSnapshot is the per-result counter view returned by
|
||||
// Snapshot. Pinned in this package so the handler can consume it
|
||||
// via VaultRenewalSnapshotter without cross-importing connector
|
||||
// state. Field names are stable — operator dashboards alert on
|
||||
// the corresponding {result=...} label values.
|
||||
type VaultRenewalSnapshot struct {
|
||||
Success uint64
|
||||
Failure uint64
|
||||
NotRenewable uint64
|
||||
}
|
||||
|
||||
// Snapshot returns a point-in-time read of all three counters.
|
||||
// Used by tests that need to assert post-tick state. The
|
||||
// Prometheus exposer in internal/api/handler/metrics.go uses
|
||||
// SnapshotVaultRenewals (3-tuple form) instead, to avoid an
|
||||
// import cycle on a shared struct type.
|
||||
func (m *VaultRenewalMetrics) Snapshot() VaultRenewalSnapshot {
|
||||
if m == nil {
|
||||
return VaultRenewalSnapshot{}
|
||||
}
|
||||
return VaultRenewalSnapshot{
|
||||
Success: m.success.Load(),
|
||||
Failure: m.failure.Load(),
|
||||
NotRenewable: m.notRenewable.Load(),
|
||||
}
|
||||
}
|
||||
|
||||
// SnapshotVaultRenewals returns the three counter values directly
|
||||
// as a tuple. Implements handler.VaultRenewalSnapshotter; used by
|
||||
// the Prometheus exposer. Order is fixed: success, failure,
|
||||
// not_renewable.
|
||||
func (m *VaultRenewalMetrics) SnapshotVaultRenewals() (success, failure, notRenewable uint64) {
|
||||
if m == nil {
|
||||
return 0, 0, 0
|
||||
}
|
||||
return m.success.Load(), m.failure.Load(), m.notRenewable.Load()
|
||||
}
|
||||
Reference in New Issue
Block a user