vault: add automatic token renewal at TTL/2 + Prometheus metric

Closes Top-10 fix #5 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the
VaultPKI adapter authenticated with a static token and never called
renew-self. Long-lived deploys hit token expiry; the first
operator-visible signal was failed cert renewals on production
targets.

This commit:

  1. Connector.Start(ctx) spawns a goroutine that calls
     POST /v1/auth/token/renew-self at TTL/2 cadence (computed from a
     one-shot lookup-self at startup). Honours ctx.Done() for
     graceful shutdown via a per-loop done channel + Stop().
  2. On `renewable: false` response (initial lookup OR any subsequent
     renewal), the loop emits a WARN, increments the not_renewable
     counter, and exits. The operator must rotate the token before
     Vault's Max TTL elapses.
  3. New Prometheus counter certctl_vault_token_renewals_total with
     labels result={success,failure,not_renewable}. Registered
     alongside existing certctl_issuance_* counters in
     internal/api/handler/metrics.go.
  4. ERROR-level logging on renewal failure with operator-actionable
     substring ("vault token renewal failed; rotate the token before
     TTL expires") so journalctl + grep find it. Loop keeps ticking
     after a failure — transient blips don't kill it.

New optional issuer.Lifecycle interface:

  type Lifecycle interface {
      Start(ctx context.Context) error
      Stop()
  }

Connectors that hold no background goroutines (almost all of them)
do not implement this — IssuerRegistry.StartLifecycles /
StopLifecycles feature-detect via type assertion. New
lifecycle-bearing connectors plug in by implementing the interface;
no further registry plumbing required.

Wiring (cmd/server/main.go):

  - service.NewVaultRenewalMetrics() instance is shared between
    issuerRegistry.SetVaultRenewalMetrics (so Vault connectors built
    by Rebuild get a recorder) and metricsHandler.SetVaultRenewals
    (so the Prometheus exposer emits the new series).
  - issuerRegistry.StartLifecycles(ctx) is called after
    issuerService.BuildRegistry; defer issuerRegistry.StopLifecycles
    is paired so goroutines exit cleanly on signal.
  - IssuerConnectorAdapter.Underlying() exposes the wrapped
    issuer.Connector so registry-level machinery can reach the
    concrete connector behind the adapter without duplicating the
    wiring at every call site.

Tests (internal/connector/issuer/vault/vault_renew_test.go):

  - TestVault_RenewLoop_TickAtHalfTTL — three ticks → three
    renewals, all "success".
  - TestVault_RenewLoop_StopsOnNotRenewable — second renewal returns
    renewable=false, loop exits, third tick fires no HTTP call.
  - TestVault_RenewLoop_FailureSurfacesViaMetric — first renewal 403
    bumps "failure", second renewal succeeds → loop kept ticking.
  - TestVault_RenewLoop_CtxCancellation_StopsCleanly — Stop returns
    within 200ms after ctx cancel.
  - TestVault_RenewLoop_StartsNothingWhenNotRenewable — token
    already non-renewable at boot ⇒ no goroutine, "not_renewable"
    metric increments at startup so operators see it in Grafana.
  - TestVault_ComputeInterval — 4 cases pinning TTL/2 +
    minRenewInterval floor.
  - TestVault_RenewSelf_ParseFailure_NamesActionableInError —
    surfaced error contains "vault token renewal failed" + "rotate
    the token".

Cadence is dynamic — every successful renewal re-derives TTL/2
from the renewed lease's lease_duration, so a short bootstrap
token that gets renewed up to a longer Max TTL shifts to the
longer cadence automatically (defends against degenerate fast
ticking on a token whose Max TTL is far longer than its initial
TTL).

Documentation:
  - docs/connectors.md Vault PKI section gains "Token TTL +
    automatic renewal" subsection (operator-facing: cadence, metric,
    renewable=false rotation playbook).

Out of scope (intentional, flagged in the audit follow-up):
  - AppRole / Kubernetes / AWS IAM auth methods (different renewal
    semantics).
  - Hot-reload of rotated token from disk (operator restarts
    today; future: GUI/MCP issuer-update path triggers Rebuild
    which Stops the old connector and Starts the new one).
  - Auto-re-auth after token death (operator playbook owns it).

CHANGELOG.md is intentionally not hand-edited (per CHANGELOG.md
itself: "no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/service/... ./internal/api/handler/...
  ./internal/connector/issuer/vault/... ./cmd/server/...  clean.
- go test -short -count=1 ./internal/connector/issuer/vault/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestVault_RenewLoop|TestVault_ComputeInterval'
  ./internal/connector/issuer/vault/...  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md
Top-10 fix #5.
This commit is contained in:
shankar0123
2026-05-03 21:24:27 +00:00
parent a2a59a823e
commit 0792271dc6
10 changed files with 1293 additions and 7 deletions
+40
View File
@@ -0,0 +1,40 @@
package issuer
import "context"
// Lifecycle is an OPTIONAL extension interface for issuer connectors that
// need to run long-running background work bound to a context. Connectors
// that hold no background goroutines (almost all of them) do not implement
// this interface and the registry feature-detects via type assertion.
//
// Concrete users today (2026-05-03):
// - VaultPKI: periodic POST /v1/auth/token/renew-self at TTL/2 cadence
// so long-lived deploys don't hit token expiry.
//
// The lifecycle contract is deliberately small. Connectors that need
// per-tick state, retries, or cross-tick cancellation handle all of that
// internally; the registry's job is just "kick off background work
// once" and "block until it cleanly exits". Keeping the interface this
// small means new lifecycle-bearing connectors don't have to touch the
// registry plumbing — they implement Start/Stop and the existing
// IssuerRegistry.StartLifecycles / StopLifecycles wiring picks them up
// automatically.
//
// Start MUST be non-blocking — spawn a goroutine and return immediately.
// Returning an error means startup failed; the registry logs the error
// and continues. Stop MUST block until the goroutine has fully exited;
// callers rely on this for graceful shutdown ordering.
type Lifecycle interface {
// Start kicks off any long-running background work bound to ctx.
// Returns nil on successful startup; the goroutine continues until
// ctx is cancelled or Stop is called. Returns a non-nil error if
// startup itself failed (e.g. precondition not met) — the goroutine
// did NOT start and Stop need not be called.
Start(ctx context.Context) error
// Stop blocks until the background work has fully exited. Safe to
// call after Start returned an error or wasn't called at all.
// Idempotent — multiple Stop calls return immediately after the
// first.
Stop()
}
+56 -6
View File
@@ -29,6 +29,7 @@ import (
"log/slog"
"net/http"
"strings"
"sync"
"time"
"github.com/shankar0123/certctl/internal/connector/issuer"
@@ -72,6 +73,32 @@ type Connector struct {
config *Config
logger *slog.Logger
httpClient *http.Client
// Token-renewal loop fields. Top-10 fix #5 of the 2026-05-03
// issuer-coverage audit. Long-lived certctl-server deploys hit
// Vault token expiry; the loop calls /v1/auth/token/renew-self at
// TTL/2 cadence so the integration stays alive up to Vault's
// configured Max TTL. See vault_renew.go for Start / Stop /
// renewSelf / lookupSelf.
//
// renewMu guards startedOnce + cancel + done. The ticker runs in a
// goroutine that owns its own copy of these channels.
renewMu sync.Mutex
renewStarted bool // true after Start spawned the goroutine
renewCancel func() // cancels the goroutine's ctx
renewDone chan struct{} // closed when goroutine exits
renewRecorder RenewalRecorder // optional metric sink (defaults to no-op)
// renewTickerFactory lets tests substitute a deterministic ticker
// implementation for cadence assertions. Production callers leave
// this nil and the loop uses time.NewTicker.
renewTickerFactory func(d time.Duration) renewTicker
// renewClient is the HTTP client used for renew-self / lookup-self.
// Defaults to httpClient; a separate seam lets tests inject an
// httptest.Server-bound client without disturbing the issuance
// path's client.
renewClient *http.Client
}
// New creates a new Vault PKI connector with the given configuration and logger.
@@ -85,13 +112,36 @@ func New(config *Config, logger *slog.Logger) *Connector {
}
}
return &Connector{
config: config,
logger: logger,
httpClient: &http.Client{
Timeout: 30 * time.Second,
},
httpClient := &http.Client{
Timeout: 30 * time.Second,
}
return &Connector{
config: config,
logger: logger,
httpClient: httpClient,
renewClient: httpClient,
renewRecorder: noopRenewalRecorder{},
}
}
// SetRenewalRecorder wires a metric sink for the renew-self loop. The
// recorder's RecordRenewal(result string) is called with one of the
// enum values "success", "failure", or "not_renewable" on every tick.
// Pass nil to disable recording. Safe to call before Start; calling
// after Start has no effect on already-emitted increments.
//
// The interface lives in this package (not internal/service) to avoid
// an import cycle: vault is a connector package that the service-layer
// IssuerRegistry imports. The service-layer concrete type
// (*service.VaultRenewalMetrics) satisfies this interface and is wired
// in cmd/server/main.go.
func (c *Connector) SetRenewalRecorder(r RenewalRecorder) {
if r == nil {
r = noopRenewalRecorder{}
}
c.renewMu.Lock()
defer c.renewMu.Unlock()
c.renewRecorder = r
}
// vaultResponse is the standard Vault API response wrapper.
@@ -0,0 +1,410 @@
package vault
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit. Pre-fix,
// Vault PKI authenticated via a static token and never called
// renew-self; long-lived deploys hit token expiry and started failing
// silently — the operator's first signal was failed renewals on
// production targets. This file adds:
//
// 1. Connector.Start(ctx) — spawns a goroutine that calls
// POST /v1/auth/token/renew-self at TTL/2 cadence (computed
// from a one-shot LookupSelf at startup).
// 2. Connector.Stop() — cancels the goroutine's context and blocks
// until it has exited. Idempotent.
// 3. Connector.renewSelf(ctx) — the per-tick HTTP call.
// 4. Connector.lookupSelf(ctx) — a one-shot startup probe to learn
// the current TTL + renewable flag.
//
// On a `renewable: false` response, the loop logs a WARN and exits
// cleanly; once Vault has decided the token is no longer renewable
// (typically Max TTL reached), retrying is what gets certctl-server
// flagged in the Vault audit log as a misbehaving client.
import (
"bytes"
"context"
"encoding/json"
"fmt"
"io"
"net/http"
"time"
"github.com/shankar0123/certctl/internal/connector/issuer"
)
// minRenewInterval guards against degenerate fast cadence when a
// misconfigured Vault returns a tiny TTL. 5s is short enough that
// the cap rarely fires in practice but long enough that we don't
// hammer Vault's audit log with renew-self calls if something goes
// sideways. Defensive only; production tokens always have TTL ≥ 30m.
const minRenewInterval = 5 * time.Second
// RenewalRecorder is the metric-sink surface the renew-self loop
// uses. result is one of: "success", "failure", "not_renewable".
// Implementations MUST be goroutine-safe — RecordRenewal is called
// from the renewal loop's own goroutine.
//
// service.VaultRenewalMetrics satisfies this interface; cmd/server
// wires the same instance into the registry (which forwards to the
// connector via SetRenewalRecorder) and into the metrics handler
// (for Prometheus exposition).
type RenewalRecorder interface {
RecordRenewal(result string)
}
// noopRenewalRecorder is the zero-cost default. Used until
// SetRenewalRecorder wires a real metric sink (production) or in
// unit tests that don't care about metrics.
type noopRenewalRecorder struct{}
func (noopRenewalRecorder) RecordRenewal(string) {}
// renewTicker is the small surface the renewal loop uses from
// time.Ticker, extracted so tests can swap in a deterministic
// implementation that fires on cue. Production: time.NewTicker.
type renewTicker interface {
C() <-chan time.Time
Stop()
}
// stdTicker is the production implementation, a thin wrapper around
// *time.Ticker that exposes its C channel via a method so it
// satisfies the renewTicker interface (channels can't be method
// values directly).
type stdTicker struct{ t *time.Ticker }
func (s stdTicker) C() <-chan time.Time { return s.t.C }
func (s stdTicker) Stop() { s.t.Stop() }
// lookupSelfResponse is the subset of /v1/auth/token/lookup-self we
// consume. Vault returns many other fields (policies, accessor, …)
// that are irrelevant to the renewal loop.
type lookupSelfResponse struct {
Data struct {
TTL int `json:"ttl"` // seconds remaining on the token
Renewable bool `json:"renewable"` // whether the token can be renewed
} `json:"data"`
}
// renewSelfResponse is the subset of /v1/auth/token/renew-self we
// consume. Per Vault's HTTP API, the renewed token's lease info
// lands in `auth.lease_duration` and `auth.renewable`.
type renewSelfResponse struct {
Auth struct {
LeaseDuration int `json:"lease_duration"`
Renewable bool `json:"renewable"`
} `json:"auth"`
}
// lookupSelf calls GET /v1/auth/token/lookup-self and returns the
// remaining TTL + the renewable flag. Used by Start to compute the
// initial tick cadence.
func (c *Connector) lookupSelf(ctx context.Context) (ttl time.Duration, renewable bool, err error) {
if c.config == nil || c.config.Token.IsEmpty() {
return 0, false, fmt.Errorf("vault token-renewal lookupSelf: token not configured")
}
url := c.config.Addr + "/v1/auth/token/lookup-self"
req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)
if err != nil {
return 0, false, fmt.Errorf("vault token-renewal lookupSelf request build: %w", err)
}
if err := c.config.Token.Use(func(buf []byte) error {
req.Header.Set("X-Vault-Token", string(buf))
return nil
}); err != nil {
return 0, false, fmt.Errorf("vault token-renewal lookupSelf token use: %w", err)
}
resp, err := c.renewClient.Do(req)
if err != nil {
return 0, false, fmt.Errorf("vault token-renewal lookupSelf HTTP: %w", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return 0, false, fmt.Errorf("vault token-renewal lookupSelf body read: %w", err)
}
if resp.StatusCode != http.StatusOK {
return 0, false, fmt.Errorf("vault token-renewal lookupSelf returned status %d: %s", resp.StatusCode, string(body))
}
var parsed lookupSelfResponse
if err := json.Unmarshal(body, &parsed); err != nil {
return 0, false, fmt.Errorf("vault token-renewal lookupSelf parse: %w", err)
}
return time.Duration(parsed.Data.TTL) * time.Second, parsed.Data.Renewable, nil
}
// renewSelfResult is returned by renewSelf — it lets the loop both
// update the in-memory TTL AND react to a renewable=false flip on
// the same call without an extra round-trip.
type renewSelfResult struct {
NewTTL time.Duration
Renewable bool
}
// renewSelf calls POST /v1/auth/token/renew-self with an empty body
// (Vault accepts `{}`) and returns the renewed lease's TTL +
// renewable flag. The caller is responsible for stopping the loop
// when Renewable goes false.
func (c *Connector) renewSelf(ctx context.Context) (renewSelfResult, error) {
if c.config == nil || c.config.Token.IsEmpty() {
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: token not configured; rotate the token before TTL expires")
}
url := c.config.Addr + "/v1/auth/token/renew-self"
req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader([]byte(`{}`)))
if err != nil {
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: request build: %w; rotate the token before TTL expires", err)
}
req.Header.Set("Content-Type", "application/json")
if err := c.config.Token.Use(func(buf []byte) error {
req.Header.Set("X-Vault-Token", string(buf))
return nil
}); err != nil {
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: token use: %w; rotate the token before TTL expires", err)
}
resp, err := c.renewClient.Do(req)
if err != nil {
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: HTTP error: %w; rotate the token before TTL expires", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: body read: %w; rotate the token before TTL expires", err)
}
if resp.StatusCode != http.StatusOK {
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: status %d: %s; rotate the token before TTL expires", resp.StatusCode, string(body))
}
var parsed renewSelfResponse
if err := json.Unmarshal(body, &parsed); err != nil {
return renewSelfResult{}, fmt.Errorf("vault token renewal failed: parse: %w; rotate the token before TTL expires", err)
}
return renewSelfResult{
NewTTL: time.Duration(parsed.Auth.LeaseDuration) * time.Second,
Renewable: parsed.Auth.Renewable,
}, nil
}
// Start kicks off the renew-self goroutine. Implements
// issuer.Lifecycle. Returns nil on success (goroutine running) or an
// error if the initial lookupSelf failed (no goroutine spawned).
//
// Cadence is computed once at startup as TTL/2 (capped at
// minRenewInterval). Each successful renewal updates the in-memory
// TTL and the goroutine resets its ticker to the new TTL/2 — so a
// short bootstrap token that gets renewed up to a longer Max TTL
// shifts to the longer cadence automatically.
//
// On `renewable: false` (initial lookup OR any subsequent renewal),
// Start returns nil but the loop emits a WARN and exits — operator
// must rotate the Vault token before its current TTL expires.
func (c *Connector) Start(ctx context.Context) error {
c.renewMu.Lock()
if c.renewStarted {
c.renewMu.Unlock()
return nil // idempotent: already running
}
if c.config == nil || c.config.Token.IsEmpty() {
c.renewMu.Unlock()
return fmt.Errorf("vault token-renewal Start: token not configured (call ValidateConfig first)")
}
c.renewMu.Unlock()
// Initial lookup — short timeout so a misconfigured Vault address
// fails Start fast rather than blocking the server's startup
// sequence indefinitely. The renewal goroutine itself uses the
// per-tick context for its own deadlines.
lookupCtx, cancel := context.WithTimeout(ctx, 30*time.Second)
ttl, renewable, err := c.lookupSelf(lookupCtx)
cancel()
if err != nil {
return fmt.Errorf("vault token-renewal Start: initial lookupSelf: %w", err)
}
c.logger.Info("vault token-renewal loop starting",
"addr", c.config.Addr,
"ttl_seconds", int(ttl.Seconds()),
"renewable", renewable,
)
if !renewable {
// Don't spawn the goroutine — the token is already non-
// renewable. Surface via the metric so operators see it in
// Grafana even before any tick fires.
c.recordRenewal("not_renewable")
c.logger.Warn("vault token is not renewable at startup; renew-self loop will not run — rotate the token before its TTL expires",
"ttl_seconds", int(ttl.Seconds()),
)
return nil
}
// Spawn the goroutine. Use a derived ctx so Stop() can cancel
// independently of the parent.
loopCtx, loopCancel := context.WithCancel(ctx)
done := make(chan struct{})
c.renewMu.Lock()
c.renewStarted = true
c.renewCancel = loopCancel
c.renewDone = done
c.renewMu.Unlock()
interval := computeInterval(ttl)
go c.renewLoop(loopCtx, interval, done)
c.logger.Info("vault token-renewal loop started",
"interval_seconds", int(interval.Seconds()),
)
return nil
}
// Stop blocks until the renew-self goroutine has exited.
// Implements issuer.Lifecycle. Idempotent.
func (c *Connector) Stop() {
c.renewMu.Lock()
cancel := c.renewCancel
done := c.renewDone
started := c.renewStarted
c.renewMu.Unlock()
if !started {
return
}
if cancel != nil {
cancel()
}
if done != nil {
<-done
}
}
// renewLoop is the actual goroutine body. Owns the ticker, the
// in-memory TTL, and the renewable-flag state machine. Exits on
// ctx.Done() or on `renewable: false`.
func (c *Connector) renewLoop(ctx context.Context, initial time.Duration, done chan struct{}) {
defer close(done)
factory := c.renewTickerFactory
if factory == nil {
factory = func(d time.Duration) renewTicker {
return stdTicker{t: time.NewTicker(d)}
}
}
ticker := factory(initial)
currentInterval := initial
defer ticker.Stop()
for {
select {
case <-ctx.Done():
c.logger.Info("vault token-renewal loop stopping (ctx cancelled)")
return
case <-ticker.C():
// Per-tick deadline derived from the current cadence —
// renew calls should comfortably finish in <1s, so a
// budget of min(interval, 30s) is generous.
tickBudget := currentInterval
if tickBudget > 30*time.Second {
tickBudget = 30 * time.Second
}
tickCtx, cancel := context.WithTimeout(ctx, tickBudget)
res, err := c.renewSelf(tickCtx)
cancel()
if err != nil {
c.recordRenewal("failure")
c.logger.Error(err.Error())
// Keep ticking — operator may rotate the token
// out-of-band, or the failure may be transient.
// Stopping on first failure would mean a 1s
// network blip kills the loop for the rest of
// process lifetime.
continue
}
if !res.Renewable {
c.recordRenewal("not_renewable")
c.logger.Warn("vault token is no longer renewable; renew-self loop exiting — rotate the token before its current TTL expires",
"ttl_seconds", int(res.NewTTL.Seconds()),
)
return
}
c.recordRenewal("success")
c.logger.Info("vault token renewed",
"new_ttl_seconds", int(res.NewTTL.Seconds()),
)
// If the new TTL/2 differs meaningfully from the
// current cadence, restart the ticker at the new
// rate. This handles the bootstrap-→-MaxTTL transition
// (short initial TTL renews up to a longer Max TTL,
// which we'd otherwise hammer at the old fast cadence
// for the rest of the process).
newInterval := computeInterval(res.NewTTL)
if differsEnough(currentInterval, newInterval) {
ticker.Stop()
ticker = factory(newInterval)
currentInterval = newInterval
c.logger.Info("vault token-renewal cadence updated",
"new_interval_seconds", int(newInterval.Seconds()),
)
}
}
}
}
// recordRenewal increments the metric counter under the renewal
// recorder. Holds the lock briefly to read the recorder pointer;
// the actual increment happens lock-free (atomic.Uint64 under
// VaultRenewalMetrics).
func (c *Connector) recordRenewal(result string) {
c.renewMu.Lock()
rec := c.renewRecorder
c.renewMu.Unlock()
if rec != nil {
rec.RecordRenewal(result)
}
}
// computeInterval returns TTL/2, floored at minRenewInterval to
// avoid degenerate fast cadence when a misconfigured Vault returns
// a tiny TTL.
func computeInterval(ttl time.Duration) time.Duration {
half := ttl / 2
if half < minRenewInterval {
return minRenewInterval
}
return half
}
// differsEnough decides whether to restart the ticker for a new
// cadence. We tolerate ±10% drift to avoid restart-thrash when
// Vault's renewed-lease duration wobbles around the static TTL.
func differsEnough(a, b time.Duration) bool {
if a == 0 || b == 0 {
return a != b
}
delta := a - b
if delta < 0 {
delta = -delta
}
tol := a / 10
if tol < 0 {
tol = -tol
}
return delta > tol
}
// Compile-time assertion that *Connector satisfies the optional
// Lifecycle extension interface. If a future refactor breaks this
// (e.g. drops Stop), the compile error fires here rather than in a
// far-away registry lookup site.
var _ issuer.Lifecycle = (*Connector)(nil)
@@ -0,0 +1,476 @@
package vault
// Top-10 fix #5 of the 2026-05-03 issuer-coverage audit. Pins the
// behaviour of the renew-self loop end to end:
//
// 1. cadence — at TTL/2 with a (configurable) deterministic ticker
// so the test isn't wall-clock bound;
// 2. terminate-on-not-renewable — if Vault returns renewable=false,
// the loop exits and the metric records the not_renewable
// result;
// 3. failure-surfaces — the metric counter increments on a 403 and
// the loop keeps ticking (transient blips don't kill it);
// 4. ctx-cancellation — Stop returns within a small budget after
// ctx is cancelled.
//
// These tests live INSIDE the `vault` package (not vault_test) so
// they can substitute the renewTickerFactory seam directly. The
// existing test files in this directory are split into vault_test
// (external, exercises the public API) and the package-internal
// _test.go files (this one) — Go's two-package test convention.
import (
"context"
"encoding/json"
"fmt"
"io"
"log/slog"
"net/http"
"net/http/httptest"
"strings"
"sync"
"sync/atomic"
"testing"
"time"
"github.com/shankar0123/certctl/internal/secret"
)
// fakeTicker is the deterministic ticker the tests inject via
// renewTickerFactory. Tests call Tick() to fire the ticker channel
// at the moment of their choosing — no real time elapses.
type fakeTicker struct {
ch chan time.Time
stopCalls atomic.Uint64
}
func newFakeTicker() *fakeTicker {
return &fakeTicker{ch: make(chan time.Time, 4)}
}
func (f *fakeTicker) C() <-chan time.Time { return f.ch }
func (f *fakeTicker) Stop() { f.stopCalls.Add(1) }
func (f *fakeTicker) Tick() { f.ch <- time.Now() }
// renewMockHandler is the per-test httptest handler shape. Tests
// configure it to control lookup-self / renew-self responses.
type renewMockHandler struct {
mu sync.Mutex
lookupTTLSeconds int
lookupRenewable bool
renewSelfStatuses []renewSelfStub // queued; consumed in order
renewSelfCalls atomic.Uint64
lookupSelfCalls atomic.Uint64
noMoreCalls func() // called if a queued stub is exhausted
}
// renewSelfStub configures one expected renew-self response.
type renewSelfStub struct {
status int
body string // override the canned body
leaseDuration int
renewable bool
}
func (h *renewMockHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
switch r.URL.Path {
case "/v1/auth/token/lookup-self":
h.lookupSelfCalls.Add(1)
h.mu.Lock()
ttl, renewable := h.lookupTTLSeconds, h.lookupRenewable
h.mu.Unlock()
body := fmt.Sprintf(`{"data":{"ttl":%d,"renewable":%t}}`, ttl, renewable)
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_, _ = io.WriteString(w, body)
case "/v1/auth/token/renew-self":
h.renewSelfCalls.Add(1)
h.mu.Lock()
var stub renewSelfStub
if len(h.renewSelfStatuses) > 0 {
stub = h.renewSelfStatuses[0]
h.renewSelfStatuses = h.renewSelfStatuses[1:]
} else {
h.mu.Unlock()
if h.noMoreCalls != nil {
h.noMoreCalls()
}
http.Error(w, "no more renew-self stubs configured", http.StatusInternalServerError)
return
}
h.mu.Unlock()
w.Header().Set("Content-Type", "application/json")
status := stub.status
if status == 0 {
status = http.StatusOK
}
w.WriteHeader(status)
body := stub.body
if body == "" {
body = fmt.Sprintf(`{"auth":{"lease_duration":%d,"renewable":%t}}`, stub.leaseDuration, stub.renewable)
}
_, _ = io.WriteString(w, body)
default:
http.NotFound(w, r)
}
}
// quietTestLogger returns a logger that discards everything below
// ERROR. Tests assert via the recorder + ticker hooks; per-tick
// INFO/WARN logs would clutter the test output.
func quietTestLogger() *slog.Logger {
return slog.New(slog.NewTextHandler(io.Discard, &slog.HandlerOptions{Level: slog.LevelError}))
}
// mockRecorder counts RecordRenewal calls per result. Replaces the
// production *service.VaultRenewalMetrics for unit-test isolation.
type mockRecorder struct {
mu sync.Mutex
counts map[string]uint64
}
func newMockRecorder() *mockRecorder {
return &mockRecorder{counts: make(map[string]uint64)}
}
func (m *mockRecorder) RecordRenewal(result string) {
m.mu.Lock()
defer m.mu.Unlock()
m.counts[result]++
}
func (m *mockRecorder) get(result string) uint64 {
m.mu.Lock()
defer m.mu.Unlock()
return m.counts[result]
}
// buildTestConnector constructs a vault.Connector pointed at the
// httptest server, with the deterministic ticker factory and the
// supplied recorder.
func buildTestConnector(srvURL string, ticker *fakeTicker, rec RenewalRecorder) *Connector {
c := New(&Config{
Addr: srvURL,
Token: secret.NewRefFromString("hvs.test-token"),
Mount: "pki",
Role: "web",
}, quietTestLogger())
c.renewTickerFactory = func(d time.Duration) renewTicker { return ticker }
if rec != nil {
c.SetRenewalRecorder(rec)
}
return c
}
// TestVault_RenewLoop_TickAtHalfTTL pins that the loop calls
// renew-self once per ticker fire. Cadence assertion is via the
// fake ticker: Tick three times → expect three renew-self calls.
// (Production cadence — TTL/2 — is verified by assertions on
// computeInterval below; substituting the ticker here keeps the
// test wall-clock-free.)
func TestVault_RenewLoop_TickAtHalfTTL(t *testing.T) {
mock := &renewMockHandler{
lookupTTLSeconds: 4, // 2s cadence
lookupRenewable: true,
renewSelfStatuses: []renewSelfStub{
{leaseDuration: 4, renewable: true},
{leaseDuration: 4, renewable: true},
{leaseDuration: 4, renewable: true},
},
}
srv := httptest.NewServer(mock)
defer srv.Close()
ticker := newFakeTicker()
rec := newMockRecorder()
c := buildTestConnector(srv.URL, ticker, rec)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
if err := c.Start(ctx); err != nil {
t.Fatalf("Start: %v", err)
}
defer c.Stop()
if mock.lookupSelfCalls.Load() != 1 {
t.Errorf("expected exactly 1 lookup-self at startup, got %d", mock.lookupSelfCalls.Load())
}
// Fire three ticks; each should drive one renew-self.
for i := 0; i < 3; i++ {
ticker.Tick()
}
// Wait briefly for the goroutine to drain the channel sends.
deadline := time.Now().Add(2 * time.Second)
for time.Now().Before(deadline) {
if rec.get("success") >= 3 {
break
}
time.Sleep(10 * time.Millisecond)
}
if got := rec.get("success"); got != 3 {
t.Errorf("expected 3 success renewals after 3 ticks, got %d", got)
}
if got := rec.get("failure"); got != 0 {
t.Errorf("expected 0 failures, got %d", got)
}
if got := rec.get("not_renewable"); got != 0 {
t.Errorf("expected 0 not_renewable events, got %d", got)
}
if got := mock.renewSelfCalls.Load(); got != 3 {
t.Errorf("expected 3 renew-self HTTP calls, got %d", got)
}
}
// TestVault_RenewLoop_StopsOnNotRenewable pins that the loop exits
// cleanly after Vault returns renewable=false on a renew-self call.
// A second tick is sent after the not-renewable response; the
// goroutine should already be stopped by then so the second tick
// triggers no HTTP call.
func TestVault_RenewLoop_StopsOnNotRenewable(t *testing.T) {
mock := &renewMockHandler{
lookupTTLSeconds: 4,
lookupRenewable: true,
renewSelfStatuses: []renewSelfStub{
{leaseDuration: 4, renewable: true},
{leaseDuration: 4, renewable: false}, // tells loop to stop
},
}
srv := httptest.NewServer(mock)
defer srv.Close()
ticker := newFakeTicker()
rec := newMockRecorder()
c := buildTestConnector(srv.URL, ticker, rec)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
if err := c.Start(ctx); err != nil {
t.Fatalf("Start: %v", err)
}
defer c.Stop()
ticker.Tick() // first renewal — success
ticker.Tick() // second renewal — renewable=false, loop exits
deadline := time.Now().Add(2 * time.Second)
for time.Now().Before(deadline) {
if rec.get("not_renewable") >= 1 {
break
}
time.Sleep(10 * time.Millisecond)
}
if got := rec.get("success"); got != 1 {
t.Errorf("expected 1 success before not_renewable, got %d", got)
}
if got := rec.get("not_renewable"); got != 1 {
t.Errorf("expected exactly 1 not_renewable event, got %d", got)
}
// Confirm the goroutine has already exited: we check the
// renewMu's renewDone channel via Stop. If the loop is alive,
// Stop blocks until ctx is cancelled. If it has already
// exited (which it should), Stop returns near-immediately.
stopDone := make(chan struct{})
go func() {
c.Stop()
close(stopDone)
}()
select {
case <-stopDone:
// expected — goroutine had already exited.
case <-time.After(200 * time.Millisecond):
t.Error("Stop did not return within 200ms after renewable=false — goroutine leaked")
}
}
// TestVault_RenewLoop_FailureSurfacesViaMetric pins that a 403 on
// renew-self bumps the failure counter and the loop keeps ticking
// (transient blips do not kill the loop).
func TestVault_RenewLoop_FailureSurfacesViaMetric(t *testing.T) {
mock := &renewMockHandler{
lookupTTLSeconds: 4,
lookupRenewable: true,
renewSelfStatuses: []renewSelfStub{
{status: http.StatusForbidden, body: `{"errors":["permission denied"]}`},
{leaseDuration: 4, renewable: true}, // loop continues; this tick succeeds
},
}
srv := httptest.NewServer(mock)
defer srv.Close()
ticker := newFakeTicker()
rec := newMockRecorder()
c := buildTestConnector(srv.URL, ticker, rec)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
if err := c.Start(ctx); err != nil {
t.Fatalf("Start: %v", err)
}
defer c.Stop()
ticker.Tick() // first — fails with 403
ticker.Tick() // second — succeeds
deadline := time.Now().Add(2 * time.Second)
for time.Now().Before(deadline) {
if rec.get("failure") >= 1 && rec.get("success") >= 1 {
break
}
time.Sleep(10 * time.Millisecond)
}
if got := rec.get("failure"); got != 1 {
t.Errorf("expected 1 failure after 403, got %d", got)
}
if got := rec.get("success"); got != 1 {
t.Errorf("expected 1 success after recovery, got %d", got)
}
}
// TestVault_RenewLoop_CtxCancellation_StopsCleanly pins that
// cancelling ctx causes the goroutine to exit promptly. Stop()
// blocks on the goroutine's done channel; if it doesn't return
// within 200ms after cancel, the goroutine is leaked.
func TestVault_RenewLoop_CtxCancellation_StopsCleanly(t *testing.T) {
mock := &renewMockHandler{
lookupTTLSeconds: 4,
lookupRenewable: true,
renewSelfStatuses: nil, // no ticks expected; ctx will cancel before any
}
srv := httptest.NewServer(mock)
defer srv.Close()
ticker := newFakeTicker()
rec := newMockRecorder()
c := buildTestConnector(srv.URL, ticker, rec)
ctx, cancel := context.WithCancel(context.Background())
if err := c.Start(ctx); err != nil {
t.Fatalf("Start: %v", err)
}
// Cancel ctx; the goroutine should exit on ctx.Done() before
// any tick fires.
start := time.Now()
cancel()
stopDone := make(chan struct{})
go func() {
c.Stop()
close(stopDone)
}()
select {
case <-stopDone:
elapsed := time.Since(start)
if elapsed > 200*time.Millisecond {
t.Errorf("Stop returned after %v — goroutine slow to exit", elapsed)
}
case <-time.After(500 * time.Millisecond):
t.Fatal("Stop did not return within 500ms after ctx cancellation — goroutine leaked")
}
// No renew-self calls should have fired (cancel raced before any tick).
if got := mock.renewSelfCalls.Load(); got != 0 {
t.Errorf("expected 0 renew-self HTTP calls, got %d", got)
}
}
// TestVault_RenewLoop_StartsNothingWhenNotRenewable pins the
// startup short-circuit: if lookup-self returns renewable=false at
// boot, Start does not spawn the goroutine and the metric records
// the not_renewable result so operators see it in Grafana before
// any tick would have fired.
func TestVault_RenewLoop_StartsNothingWhenNotRenewable(t *testing.T) {
mock := &renewMockHandler{
lookupTTLSeconds: 60,
lookupRenewable: false, // already non-renewable at boot
}
srv := httptest.NewServer(mock)
defer srv.Close()
ticker := newFakeTicker()
rec := newMockRecorder()
c := buildTestConnector(srv.URL, ticker, rec)
if err := c.Start(context.Background()); err != nil {
t.Fatalf("Start should not error on initially-non-renewable token; got: %v", err)
}
defer c.Stop()
if got := rec.get("not_renewable"); got != 1 {
t.Errorf("expected 1 not_renewable event from startup short-circuit, got %d", got)
}
// Tick should be a no-op — no goroutine running.
ticker.Tick()
time.Sleep(100 * time.Millisecond)
if got := mock.renewSelfCalls.Load(); got != 0 {
t.Errorf("expected 0 renew-self HTTP calls (loop never started), got %d", got)
}
}
// TestVault_ComputeInterval pins the cadence-derivation rules: TTL/2
// for normal tokens, floored at minRenewInterval for misconfigured
// short TTLs that would otherwise hammer Vault's audit log.
func TestVault_ComputeInterval(t *testing.T) {
tests := []struct {
name string
ttl time.Duration
want time.Duration
}{
{"hour-ttl", time.Hour, 30 * time.Minute},
{"day-ttl", 24 * time.Hour, 12 * time.Hour},
{"floor-applies-tiny", 2 * time.Second, minRenewInterval},
{"floor-applies-zero", 0, minRenewInterval},
}
for _, tc := range tests {
t.Run(tc.name, func(t *testing.T) {
got := computeInterval(tc.ttl)
if got != tc.want {
t.Errorf("computeInterval(%v) = %v, want %v", tc.ttl, got, tc.want)
}
})
}
}
// TestVault_RenewSelf_ParseFailure_NamesActionableInError pins that
// failures surface with operator-actionable framing. We test the
// HTTP-failure path; the parse-failure path lives in the same wrap
// chain.
func TestVault_RenewSelf_ParseFailure_NamesActionableInError(t *testing.T) {
srv := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
_, _ = io.WriteString(w, `not json`)
}))
defer srv.Close()
c := buildTestConnector(srv.URL, newFakeTicker(), nil)
_, err := c.renewSelf(context.Background())
if err == nil {
t.Fatal("expected error from renewSelf with bad JSON, got nil")
}
if !strings.Contains(err.Error(), "vault token renewal failed") {
t.Errorf("expected 'vault token renewal failed' framing in surfaced error; got: %v", err)
}
if !strings.Contains(err.Error(), "rotate the token") {
t.Errorf("expected 'rotate the token' operator-action substring in surfaced error; got: %v", err)
}
}
// _unused_marker keeps the json import alive when the test file is
// edited and one of the json-using helpers temporarily disappears.
// Production has no use for this; tests do.
var _ = json.Marshal