Files
certctl/internal/connector/issuer/vault/vault.go
T
shankar0123 ceca3647eb vault: add automatic token renewal at TTL/2 + Prometheus metric
Closes Top-10 fix #5 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the
VaultPKI adapter authenticated with a static token and never called
renew-self. Long-lived deploys hit token expiry; the first
operator-visible signal was failed cert renewals on production
targets.

This commit:

  1. Connector.Start(ctx) spawns a goroutine that calls
     POST /v1/auth/token/renew-self at TTL/2 cadence (computed from a
     one-shot lookup-self at startup). Honours ctx.Done() for
     graceful shutdown via a per-loop done channel + Stop().
  2. On `renewable: false` response (initial lookup OR any subsequent
     renewal), the loop emits a WARN, increments the not_renewable
     counter, and exits. The operator must rotate the token before
     Vault's Max TTL elapses.
  3. New Prometheus counter certctl_vault_token_renewals_total with
     labels result={success,failure,not_renewable}. Registered
     alongside existing certctl_issuance_* counters in
     internal/api/handler/metrics.go.
  4. ERROR-level logging on renewal failure with operator-actionable
     substring ("vault token renewal failed; rotate the token before
     TTL expires") so journalctl + grep find it. Loop keeps ticking
     after a failure — transient blips don't kill it.

New optional issuer.Lifecycle interface:

  type Lifecycle interface {
      Start(ctx context.Context) error
      Stop()
  }

Connectors that hold no background goroutines (almost all of them)
do not implement this — IssuerRegistry.StartLifecycles /
StopLifecycles feature-detect via type assertion. New
lifecycle-bearing connectors plug in by implementing the interface;
no further registry plumbing required.

Wiring (cmd/server/main.go):

  - service.NewVaultRenewalMetrics() instance is shared between
    issuerRegistry.SetVaultRenewalMetrics (so Vault connectors built
    by Rebuild get a recorder) and metricsHandler.SetVaultRenewals
    (so the Prometheus exposer emits the new series).
  - issuerRegistry.StartLifecycles(ctx) is called after
    issuerService.BuildRegistry; defer issuerRegistry.StopLifecycles
    is paired so goroutines exit cleanly on signal.
  - IssuerConnectorAdapter.Underlying() exposes the wrapped
    issuer.Connector so registry-level machinery can reach the
    concrete connector behind the adapter without duplicating the
    wiring at every call site.

Tests (internal/connector/issuer/vault/vault_renew_test.go):

  - TestVault_RenewLoop_TickAtHalfTTL — three ticks → three
    renewals, all "success".
  - TestVault_RenewLoop_StopsOnNotRenewable — second renewal returns
    renewable=false, loop exits, third tick fires no HTTP call.
  - TestVault_RenewLoop_FailureSurfacesViaMetric — first renewal 403
    bumps "failure", second renewal succeeds → loop kept ticking.
  - TestVault_RenewLoop_CtxCancellation_StopsCleanly — Stop returns
    within 200ms after ctx cancel.
  - TestVault_RenewLoop_StartsNothingWhenNotRenewable — token
    already non-renewable at boot ⇒ no goroutine, "not_renewable"
    metric increments at startup so operators see it in Grafana.
  - TestVault_ComputeInterval — 4 cases pinning TTL/2 +
    minRenewInterval floor.
  - TestVault_RenewSelf_ParseFailure_NamesActionableInError —
    surfaced error contains "vault token renewal failed" + "rotate
    the token".

Cadence is dynamic — every successful renewal re-derives TTL/2
from the renewed lease's lease_duration, so a short bootstrap
token that gets renewed up to a longer Max TTL shifts to the
longer cadence automatically (defends against degenerate fast
ticking on a token whose Max TTL is far longer than its initial
TTL).

Documentation:
  - docs/connectors.md Vault PKI section gains "Token TTL +
    automatic renewal" subsection (operator-facing: cadence, metric,
    renewable=false rotation playbook).

Out of scope (intentional, flagged in the audit follow-up):
  - AppRole / Kubernetes / AWS IAM auth methods (different renewal
    semantics).
  - Hot-reload of rotated token from disk (operator restarts
    today; future: GUI/MCP issuer-update path triggers Rebuild
    which Stops the old connector and Starts the new one).
  - Auto-re-auth after token death (operator playbook owns it).

CHANGELOG.md is intentionally not hand-edited (per CHANGELOG.md
itself: "no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/service/... ./internal/api/handler/...
  ./internal/connector/issuer/vault/... ./cmd/server/...  clean.
- go test -short -count=1 ./internal/connector/issuer/vault/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestVault_RenewLoop|TestVault_ComputeInterval'
  ./internal/connector/issuer/vault/...  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md
Top-10 fix #5.
2026-05-03 21:24:27 +00:00

455 lines
15 KiB
Go

// Package vault implements the issuer.Connector interface for HashiCorp Vault PKI
// secrets engine.
//
// Vault PKI provides a full-featured private CA with certificate signing, revocation,
// CRL, and OCSP capabilities. This connector uses the Vault HTTP API to sign CSRs
// via the /v1/{mount}/sign/{role} endpoint, authenticated with a Vault token.
//
// Vault issues certificates synchronously (like step-ca), so GetOrderStatus always
// returns "completed". CRL and OCSP are delegated to Vault's own endpoints.
//
// Authentication: Vault token via X-Vault-Token header.
//
// Vault API used:
//
// GET /v1/sys/health - Health check
// POST /v1/{mount}/sign/{role} - Sign CSR
// POST /v1/{mount}/revoke - Revoke certificate
// GET /v1/{mount}/ca/pem - Get CA certificate
package vault
import (
"bytes"
"context"
"crypto/x509"
"encoding/json"
"encoding/pem"
"fmt"
"io"
"log/slog"
"net/http"
"strings"
"sync"
"time"
"github.com/shankar0123/certctl/internal/connector/issuer"
"github.com/shankar0123/certctl/internal/secret"
)
// Config represents the Vault PKI issuer connector configuration.
type Config struct {
// Addr is the Vault server address (e.g., "https://vault.example.com:8200").
// Required. Set via CERTCTL_VAULT_ADDR environment variable.
Addr string `json:"addr"`
// Token is the Vault token for authentication.
// Required. Set via CERTCTL_VAULT_TOKEN environment variable.
//
// Type: *secret.Ref (audit fix #6 Phase 3 — Bundle I close).
// Wrapping the token in a Ref means: it never stringifies (Config
// marshals as "[redacted]"), the bytes are zeroed after each
// Use/WriteTo invocation (defeats heap-dump extraction), and
// outbound X-Vault-Token header writes go through Ref.Use so the
// staging buffer is short-lived. JSON unmarshal of a string value
// populates the Ref via NewRefFromString — operator config files
// are unchanged.
Token *secret.Ref `json:"token"`
// Mount is the PKI secrets engine mount path.
// Default: "pki". Set via CERTCTL_VAULT_MOUNT environment variable.
Mount string `json:"mount"`
// Role is the PKI role name used for signing certificates.
// Required. Set via CERTCTL_VAULT_ROLE environment variable.
Role string `json:"role"`
// TTL is the requested certificate TTL (e.g., "8760h" for 1 year).
// Default: "8760h". Set via CERTCTL_VAULT_TTL environment variable.
TTL string `json:"ttl"`
}
// Connector implements the issuer.Connector interface for Vault PKI.
type Connector struct {
config *Config
logger *slog.Logger
httpClient *http.Client
// Token-renewal loop fields. Top-10 fix #5 of the 2026-05-03
// issuer-coverage audit. Long-lived certctl-server deploys hit
// Vault token expiry; the loop calls /v1/auth/token/renew-self at
// TTL/2 cadence so the integration stays alive up to Vault's
// configured Max TTL. See vault_renew.go for Start / Stop /
// renewSelf / lookupSelf.
//
// renewMu guards startedOnce + cancel + done. The ticker runs in a
// goroutine that owns its own copy of these channels.
renewMu sync.Mutex
renewStarted bool // true after Start spawned the goroutine
renewCancel func() // cancels the goroutine's ctx
renewDone chan struct{} // closed when goroutine exits
renewRecorder RenewalRecorder // optional metric sink (defaults to no-op)
// renewTickerFactory lets tests substitute a deterministic ticker
// implementation for cadence assertions. Production callers leave
// this nil and the loop uses time.NewTicker.
renewTickerFactory func(d time.Duration) renewTicker
// renewClient is the HTTP client used for renew-self / lookup-self.
// Defaults to httpClient; a separate seam lets tests inject an
// httptest.Server-bound client without disturbing the issuance
// path's client.
renewClient *http.Client
}
// New creates a new Vault PKI connector with the given configuration and logger.
func New(config *Config, logger *slog.Logger) *Connector {
if config != nil {
if config.Mount == "" {
config.Mount = "pki"
}
if config.TTL == "" {
config.TTL = "8760h"
}
}
httpClient := &http.Client{
Timeout: 30 * time.Second,
}
return &Connector{
config: config,
logger: logger,
httpClient: httpClient,
renewClient: httpClient,
renewRecorder: noopRenewalRecorder{},
}
}
// SetRenewalRecorder wires a metric sink for the renew-self loop. The
// recorder's RecordRenewal(result string) is called with one of the
// enum values "success", "failure", or "not_renewable" on every tick.
// Pass nil to disable recording. Safe to call before Start; calling
// after Start has no effect on already-emitted increments.
//
// The interface lives in this package (not internal/service) to avoid
// an import cycle: vault is a connector package that the service-layer
// IssuerRegistry imports. The service-layer concrete type
// (*service.VaultRenewalMetrics) satisfies this interface and is wired
// in cmd/server/main.go.
func (c *Connector) SetRenewalRecorder(r RenewalRecorder) {
if r == nil {
r = noopRenewalRecorder{}
}
c.renewMu.Lock()
defer c.renewMu.Unlock()
c.renewRecorder = r
}
// vaultResponse is the standard Vault API response wrapper.
type vaultResponse struct {
Data json.RawMessage `json:"data"`
Errors []string `json:"errors,omitempty"`
Warnings []string `json:"warnings,omitempty"`
}
// signData holds the data returned from the /sign endpoint.
type signData struct {
Certificate string `json:"certificate"`
IssuingCA string `json:"issuing_ca"`
CAChain []string `json:"ca_chain"`
SerialNumber string `json:"serial_number"`
Expiration int64 `json:"expiration"`
}
// ValidateConfig checks that the Vault configuration is valid and the server is reachable.
func (c *Connector) ValidateConfig(ctx context.Context, rawConfig json.RawMessage) error {
var cfg Config
if err := json.Unmarshal(rawConfig, &cfg); err != nil {
return fmt.Errorf("invalid Vault config: %w", err)
}
if cfg.Addr == "" {
return fmt.Errorf("Vault addr is required")
}
if cfg.Token.IsEmpty() {
return fmt.Errorf("Vault token is required")
}
if cfg.Role == "" {
return fmt.Errorf("Vault role is required")
}
if cfg.Mount == "" {
cfg.Mount = "pki"
}
if cfg.TTL == "" {
cfg.TTL = "8760h"
}
// Health check
healthURL := cfg.Addr + "/v1/sys/health"
req, err := http.NewRequestWithContext(ctx, http.MethodGet, healthURL, nil)
if err != nil {
return fmt.Errorf("failed to create health check request: %w", err)
}
resp, err := c.httpClient.Do(req)
if err != nil {
return fmt.Errorf("Vault not reachable at %s: %w", cfg.Addr, err)
}
defer resp.Body.Close()
// Vault health returns 200 for initialized+unsealed, 429 for standby, 472 for DR secondary,
// 473 for perf standby, 501 for uninitialized, 503 for sealed
if resp.StatusCode != http.StatusOK && resp.StatusCode != http.StatusTooManyRequests {
return fmt.Errorf("Vault health check returned status %d", resp.StatusCode)
}
c.config = &cfg
c.logger.Info("Vault PKI configuration validated",
"addr", cfg.Addr,
"mount", cfg.Mount,
"role", cfg.Role)
return nil
}
// IssueCertificate submits a CSR to Vault PKI for signing.
func (c *Connector) IssueCertificate(ctx context.Context, request issuer.IssuanceRequest) (*issuer.IssuanceResult, error) {
c.logger.Info("processing Vault PKI issuance request",
"common_name", request.CommonName,
"san_count", len(request.SANs))
// Determine TTL — cap to MaxTTLSeconds from profile if specified
ttl := c.config.TTL
if request.MaxTTLSeconds > 0 {
ttl = fmt.Sprintf("%ds", request.MaxTTLSeconds)
}
// Build the sign request body
signBody := map[string]interface{}{
"csr": request.CSRPEM,
"common_name": request.CommonName,
"ttl": ttl,
}
if len(request.SANs) > 0 {
signBody["alt_names"] = strings.Join(request.SANs, ",")
}
body, err := json.Marshal(signBody)
if err != nil {
return nil, fmt.Errorf("failed to marshal sign request: %w", err)
}
// POST /v1/{mount}/sign/{role}
signURL := fmt.Sprintf("%s/v1/%s/sign/%s", c.config.Addr, c.config.Mount, c.config.Role)
req, err := http.NewRequestWithContext(ctx, http.MethodPost, signURL, bytes.NewReader(body))
if err != nil {
return nil, fmt.Errorf("failed to create sign request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
if err := c.config.Token.Use(func(buf []byte) error {
req.Header.Set("X-Vault-Token", string(buf))
return nil
}); err != nil {
return nil, fmt.Errorf("vault token use: %w", err)
}
resp, err := c.httpClient.Do(req)
if err != nil {
return nil, fmt.Errorf("Vault sign request failed: %w", err)
}
defer resp.Body.Close()
respBody, err := io.ReadAll(resp.Body)
if err != nil {
return nil, fmt.Errorf("failed to read sign response: %w", err)
}
if resp.StatusCode != http.StatusOK {
var vaultResp vaultResponse
if jsonErr := json.Unmarshal(respBody, &vaultResp); jsonErr == nil && len(vaultResp.Errors) > 0 {
return nil, fmt.Errorf("Vault sign returned status %d: %s", resp.StatusCode, strings.Join(vaultResp.Errors, "; "))
}
return nil, fmt.Errorf("Vault sign returned status %d: %s", resp.StatusCode, string(respBody))
}
// Parse the Vault response
var vaultResp vaultResponse
if err := json.Unmarshal(respBody, &vaultResp); err != nil {
return nil, fmt.Errorf("failed to parse Vault response: %w", err)
}
var data signData
if err := json.Unmarshal(vaultResp.Data, &data); err != nil {
return nil, fmt.Errorf("failed to parse Vault sign data: %w", err)
}
if data.Certificate == "" {
return nil, fmt.Errorf("no certificate in Vault sign response")
}
// Parse the leaf certificate to extract metadata
certPEM := data.Certificate
block, _ := pem.Decode([]byte(certPEM))
if block == nil {
return nil, fmt.Errorf("failed to decode certificate PEM from Vault")
}
cert, err := x509.ParseCertificate(block.Bytes)
if err != nil {
return nil, fmt.Errorf("failed to parse certificate: %w", err)
}
// Build chain PEM from ca_chain or issuing_ca
var chainPEM string
if len(data.CAChain) > 0 {
chainPEM = strings.Join(data.CAChain, "\n")
} else if data.IssuingCA != "" {
chainPEM = data.IssuingCA
}
// Normalize serial: Vault uses colon-separated hex (e.g., "aa:bb:cc"), convert to plain string
serial := normalizeSerial(data.SerialNumber)
orderID := fmt.Sprintf("vault-%s", serial)
c.logger.Info("Vault PKI certificate issued",
"common_name", request.CommonName,
"serial", serial,
"not_after", cert.NotAfter)
return &issuer.IssuanceResult{
CertPEM: certPEM,
ChainPEM: chainPEM,
Serial: serial,
NotBefore: cert.NotBefore,
NotAfter: cert.NotAfter,
OrderID: orderID,
}, nil
}
// RenewCertificate renews a certificate by creating a new signing request.
// For Vault PKI, renewal is functionally identical to issuance (new cert signed from CSR).
func (c *Connector) RenewCertificate(ctx context.Context, request issuer.RenewalRequest) (*issuer.IssuanceResult, error) {
c.logger.Info("processing Vault PKI renewal request",
"common_name", request.CommonName,
"san_count", len(request.SANs))
return c.IssueCertificate(ctx, issuer.IssuanceRequest{
CommonName: request.CommonName,
SANs: request.SANs,
CSRPEM: request.CSRPEM,
EKUs: request.EKUs,
MaxTTLSeconds: request.MaxTTLSeconds,
})
}
// RevokeCertificate revokes a certificate at Vault PKI.
func (c *Connector) RevokeCertificate(ctx context.Context, request issuer.RevocationRequest) error {
c.logger.Info("processing Vault PKI revocation request", "serial", request.Serial)
revokeBody := map[string]interface{}{
"serial_number": request.Serial,
}
body, err := json.Marshal(revokeBody)
if err != nil {
return fmt.Errorf("failed to marshal revoke request: %w", err)
}
revokeURL := fmt.Sprintf("%s/v1/%s/revoke", c.config.Addr, c.config.Mount)
req, err := http.NewRequestWithContext(ctx, http.MethodPost, revokeURL, bytes.NewReader(body))
if err != nil {
return fmt.Errorf("failed to create revoke request: %w", err)
}
req.Header.Set("Content-Type", "application/json")
if err := c.config.Token.Use(func(buf []byte) error {
req.Header.Set("X-Vault-Token", string(buf))
return nil
}); err != nil {
return fmt.Errorf("vault token use: %w", err)
}
resp, err := c.httpClient.Do(req)
if err != nil {
return fmt.Errorf("Vault revoke request failed: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
respBody, _ := io.ReadAll(resp.Body)
return fmt.Errorf("Vault revoke returned status %d: %s", resp.StatusCode, string(respBody))
}
c.logger.Info("Vault PKI certificate revoked", "serial", request.Serial)
return nil
}
// GetOrderStatus returns the status of a Vault PKI order.
// Vault signs synchronously, so orders are always "completed" immediately.
func (c *Connector) GetOrderStatus(ctx context.Context, orderID string) (*issuer.OrderStatus, error) {
return &issuer.OrderStatus{
OrderID: orderID,
Status: "completed",
UpdatedAt: time.Now(),
}, nil
}
// GenerateCRL is not supported because Vault serves CRL directly at /v1/{mount}/crl.
func (c *Connector) GenerateCRL(ctx context.Context, revokedCerts []issuer.RevokedCertEntry) ([]byte, error) {
return nil, fmt.Errorf("Vault serves CRL directly at /v1/%s/crl; use Vault's endpoint", c.config.Mount)
}
// SignOCSPResponse is not supported because Vault serves OCSP directly at /v1/{mount}/ocsp.
func (c *Connector) SignOCSPResponse(ctx context.Context, req issuer.OCSPSignRequest) ([]byte, error) {
return nil, fmt.Errorf("Vault serves OCSP directly at /v1/%s/ocsp; use Vault's endpoint", c.config.Mount)
}
// GetCACertPEM retrieves the CA certificate from Vault PKI.
func (c *Connector) GetCACertPEM(ctx context.Context) (string, error) {
caURL := fmt.Sprintf("%s/v1/%s/ca/pem", c.config.Addr, c.config.Mount)
req, err := http.NewRequestWithContext(ctx, http.MethodGet, caURL, nil)
if err != nil {
return "", fmt.Errorf("failed to create CA cert request: %w", err)
}
if err := c.config.Token.Use(func(buf []byte) error {
req.Header.Set("X-Vault-Token", string(buf))
return nil
}); err != nil {
return "", fmt.Errorf("vault token use: %w", err)
}
resp, err := c.httpClient.Do(req)
if err != nil {
return "", fmt.Errorf("Vault CA cert request failed: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
return "", fmt.Errorf("Vault CA cert returned status %d", resp.StatusCode)
}
body, err := io.ReadAll(resp.Body)
if err != nil {
return "", fmt.Errorf("failed to read CA cert response: %w", err)
}
return string(body), nil
}
// GetRenewalInfo returns nil, nil as Vault does not support ACME Renewal Information (ARI).
func (c *Connector) GetRenewalInfo(ctx context.Context, certPEM string) (*issuer.RenewalInfoResult, error) {
return nil, nil
}
// normalizeSerial converts Vault's colon-separated hex serial (e.g., "aa:bb:cc:dd")
// to a plain string representation suitable for storage.
func normalizeSerial(serial string) string {
return strings.ReplaceAll(serial, ":", "-")
}
// Ensure Connector implements the issuer.Connector interface.
var _ issuer.Connector = (*Connector)(nil)