Commit Graph

8 Commits

Author SHA1 Message Date
shankar0123 ceca3647eb vault: add automatic token renewal at TTL/2 + Prometheus metric
Closes Top-10 fix #5 of the 2026-05-03 issuer-coverage audit (see
cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the
VaultPKI adapter authenticated with a static token and never called
renew-self. Long-lived deploys hit token expiry; the first
operator-visible signal was failed cert renewals on production
targets.

This commit:

  1. Connector.Start(ctx) spawns a goroutine that calls
     POST /v1/auth/token/renew-self at TTL/2 cadence (computed from a
     one-shot lookup-self at startup). Honours ctx.Done() for
     graceful shutdown via a per-loop done channel + Stop().
  2. On `renewable: false` response (initial lookup OR any subsequent
     renewal), the loop emits a WARN, increments the not_renewable
     counter, and exits. The operator must rotate the token before
     Vault's Max TTL elapses.
  3. New Prometheus counter certctl_vault_token_renewals_total with
     labels result={success,failure,not_renewable}. Registered
     alongside existing certctl_issuance_* counters in
     internal/api/handler/metrics.go.
  4. ERROR-level logging on renewal failure with operator-actionable
     substring ("vault token renewal failed; rotate the token before
     TTL expires") so journalctl + grep find it. Loop keeps ticking
     after a failure — transient blips don't kill it.

New optional issuer.Lifecycle interface:

  type Lifecycle interface {
      Start(ctx context.Context) error
      Stop()
  }

Connectors that hold no background goroutines (almost all of them)
do not implement this — IssuerRegistry.StartLifecycles /
StopLifecycles feature-detect via type assertion. New
lifecycle-bearing connectors plug in by implementing the interface;
no further registry plumbing required.

Wiring (cmd/server/main.go):

  - service.NewVaultRenewalMetrics() instance is shared between
    issuerRegistry.SetVaultRenewalMetrics (so Vault connectors built
    by Rebuild get a recorder) and metricsHandler.SetVaultRenewals
    (so the Prometheus exposer emits the new series).
  - issuerRegistry.StartLifecycles(ctx) is called after
    issuerService.BuildRegistry; defer issuerRegistry.StopLifecycles
    is paired so goroutines exit cleanly on signal.
  - IssuerConnectorAdapter.Underlying() exposes the wrapped
    issuer.Connector so registry-level machinery can reach the
    concrete connector behind the adapter without duplicating the
    wiring at every call site.

Tests (internal/connector/issuer/vault/vault_renew_test.go):

  - TestVault_RenewLoop_TickAtHalfTTL — three ticks → three
    renewals, all "success".
  - TestVault_RenewLoop_StopsOnNotRenewable — second renewal returns
    renewable=false, loop exits, third tick fires no HTTP call.
  - TestVault_RenewLoop_FailureSurfacesViaMetric — first renewal 403
    bumps "failure", second renewal succeeds → loop kept ticking.
  - TestVault_RenewLoop_CtxCancellation_StopsCleanly — Stop returns
    within 200ms after ctx cancel.
  - TestVault_RenewLoop_StartsNothingWhenNotRenewable — token
    already non-renewable at boot ⇒ no goroutine, "not_renewable"
    metric increments at startup so operators see it in Grafana.
  - TestVault_ComputeInterval — 4 cases pinning TTL/2 +
    minRenewInterval floor.
  - TestVault_RenewSelf_ParseFailure_NamesActionableInError —
    surfaced error contains "vault token renewal failed" + "rotate
    the token".

Cadence is dynamic — every successful renewal re-derives TTL/2
from the renewed lease's lease_duration, so a short bootstrap
token that gets renewed up to a longer Max TTL shifts to the
longer cadence automatically (defends against degenerate fast
ticking on a token whose Max TTL is far longer than its initial
TTL).

Documentation:
  - docs/connectors.md Vault PKI section gains "Token TTL +
    automatic renewal" subsection (operator-facing: cadence, metric,
    renewable=false rotation playbook).

Out of scope (intentional, flagged in the audit follow-up):
  - AppRole / Kubernetes / AWS IAM auth methods (different renewal
    semantics).
  - Hot-reload of rotated token from disk (operator restarts
    today; future: GUI/MCP issuer-update path triggers Rebuild
    which Stops the old connector and Starts the new one).
  - Auto-re-auth after token death (operator playbook owns it).

CHANGELOG.md is intentionally not hand-edited (per CHANGELOG.md
itself: "no longer maintains a hand-edited per-version changelog;
per-release notes are auto-generated from commit messages between
consecutive tags").

Verified locally:
- gofmt clean.
- go vet ./internal/service/... ./internal/api/handler/...
  ./internal/connector/issuer/vault/... ./cmd/server/...  clean.
- go test -short -count=1 ./internal/connector/issuer/vault/...
  ./internal/service/... ./internal/api/handler/...  green.
- go test -race -count=10 -run 'TestVault_RenewLoop|TestVault_ComputeInterval'
  ./internal/connector/issuer/vault/...  green.

Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md
Top-10 fix #5.
2026-05-03 21:24:27 +00:00
shankar0123 95c2bf9818 metrics: add per-issuer-type issuance counters, histogram, and failure classifier
Closes the #4 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. Before this commit, certctl's Prometheus exposition had
zero per-issuer-type signal — operators answering "is DigiCert slow?"
or "is Sectigo failing more than ACME?" had to grep logs by issuer
name. This commit adds three series labelled by issuer type:

  certctl_issuance_total{issuer_type, outcome}
  certctl_issuance_duration_seconds{issuer_type}            (histogram)
  certctl_issuance_failures_total{issuer_type, error_class}

The histogram covers 0.05–120 second buckets to span the local-issuer
fast path and async-CA slow path (DigiCert/Sectigo/Entrust polling
can take minutes). error_class is a closed enum of eight values
(timeout, auth, rate_limited, validation, upstream_5xx, upstream_4xx,
network, other) classified once in service.ClassifyError. Cardinality
budget is ~276 new series, well within Prometheus's comfortable range.

Implementation:
- service.IssuanceMetrics is the thread-safe counter + histogram
  table. Three independent views (counters / failures / durations)
  exposed via SnapshotCounters / SnapshotFailures / SnapshotDurations.
  sync.RWMutex protects the map shape; per-key sync/atomic.Uint64
  primitives keep the recording hot path lock-free under concurrent
  service-layer goroutines.
- service.IssuanceCounterEntry / IssuanceFailureEntry /
  IssuanceDurationEntry / IssuanceMetricsSnapshotter live in service
  (not handler) to avoid an import cycle: handler already imports
  service for admin_est.go etc., so service can't import handler back.
  Handler's exposer takes the snapshotter via the service-defined
  interface.
- service.ClassifyError pure function maps error → error_class.
  context.DeadlineExceeded / context.Canceled → timeout; *net.OpError
  → network; substring matches against canonical AWS / DigiCert /
  Sectigo error shapes for auth / rate_limited / validation /
  upstream_5xx / upstream_4xx / network; unknown → other. Each branch
  has at least one representative test case in
  TestClassifyError.
- IssuerConnectorAdapter.SetMetrics wires per-adapter recording
  (issuerType + metrics). Existing 28+ test call sites of
  NewIssuerConnectorAdapter keep their one-arg signature; production
  wiring goes through SetMetrics post-construction.
- IssuerRegistry.SetIssuanceMetrics + Rebuild type-asserts to
  *IssuerConnectorAdapter and calls SetMetrics with the issuer type
  string. nil-guarded — tests that hand-build adapters without
  metrics get no-op recording.
- IssuerConnectorAdapter.IssueCertificate / RenewCertificate wrap the
  underlying connector call with start := time.Now() and
  recordIssuance(start, err). Renewal is recorded into the same
  certctl_issuance_* series as initial issuance — operationally,
  renewal IS issuance from the connector's perspective (matches the
  audit prompt's guidance on series naming).
- handler/metrics.go GetPrometheusMetrics gains a new exposer block
  emitting all three series in stable label order with correct
  Prometheus format (_bucket / _sum / _count for the histogram, +Inf
  bucket appended). Sorted via sort.Slice for stable output. nil-
  guarded so deploys without the wire produce clean exposition.
- formatLE helper trims trailing zeros from histogram bucket labels
  via strconv.FormatFloat(le, 'f', -1, 64) so the `le` labels match
  Prometheus client conventions ("0.05", "30", "120", not "0.0500"
  etc.).
- cmd/server/main.go wires a single IssuanceMetrics instance into
  both the IssuerRegistry (recording) and the MetricsHandler (exposer)
  using DefaultIssuanceBucketBoundaries.

Tests:
- TestIssuanceMetrics_RecordAndSnapshot — happy-path counter +
  histogram + failure recording, BucketBoundaries returns a copy
  (not shared storage).
- TestIssuanceMetrics_HistogramCumulative — pins the cumulative-buckets
  contract. 100ms observation lands in 0.1 bucket and every larger
  bucket; 750ms only in the 1.0 bucket. Off-by-one here would
  corrupt every quantile query downstream.
- TestIssuanceMetrics_Concurrency — 100 goroutines × 1000 ops under
  the race detector. Asserts atomic counter integrity across
  contended writes.
- TestClassifyError — 17 cases covering every branch of the closed
  enum plus the nil-error special case.

Implementation chooses the existing hand-rolled fmt.Fprintf
exposition pattern (no prometheus/client_golang dependency added)
to stay consistent with the OCSP / deploy counter blocks already in
the file.

Out of scope (separate follow-ups):
- Revocation metrics (certctl_revocation_*) — symmetric to issuance
  but the audit didn't ask; explicit follow-up commit.
- Discovery / health-check duration histograms.
- prometheus/client_golang migration.

Verified locally:
- gofmt clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... → 0 issues
- go test -short -count=1 ./internal/service/ green
- go test -short -count=1 -race -run TestIssuanceMetrics ./internal/service/ green
- go test -short -count=1 ./internal/api/handler/ green
- go build ./... success

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #4 (Part 3, narrative section).
2026-05-02 00:39:25 +00:00
claude 188a41774a chore: gofmt fixes across deploy-hardening I new files
Phase 13 verification surfaced gofmt-formatting drift in 6 files
across the bundle's new code:

- internal/api/handler/metrics.go (struct field alignment)
- internal/connector/target/k8ssecret/validate_only_test.go (alignment)
- internal/connector/target/nginx/nginx.go (alignment)
- internal/connector/target/postfix/postfix.go (alignment)
- internal/connector/target/ssh/validate_only_test.go (alignment)
- internal/service/deploy_counters.go (alignment)

Pure mechanical gofmt -w fixes; no behavior changes. CI's
make verify gate (which runs `go fmt ./...`) didn't catch these
because go fmt is more lenient than gofmt -l, but golangci-lint
v2.11.4 + the explicit gofmt step in Phase 13 verification did.

Phase 13 full-matrix verification all green:
- gofmt -l: empty across all bundle-touched files
- go vet ./internal/deploy/... ./internal/connector/target/... ./internal/service/ ./internal/api/handler/ ./cmd/agent/: clean
- golangci-lint v2.11.4 (the version CI runs): 0 issues
- go test -race -count=1 across deploy + nginx + apache + haproxy + agent + service: all green
- INTEGRATION=1 go test -tags integration -run Deploy ./deploy/test/...: 4/4 e2e tests green

Phase 14 next: release prep — Active Focus update, release notes,
Reddit-beat draft, final tag handoff to operator.
2026-04-30 15:33:33 +00:00
claude 7e8c8cadbb feat(metrics): per-target-type deploy counters wired into /metrics/prometheus
Phase 10 of the deploy-hardening I master bundle. Mirrors the
production-hardening-II Phase 8 OCSP-counter pattern. Per frozen
decision 0.9, the metric naming convention is
`certctl_deploy_<area>_total` with target_type + sub-label.

internal/service/deploy_counters.go:
- DeployCounters struct with sync.Map of per-target-type buckets
  (apache, nginx, etc.). Lock-free fast path via sync/atomic
  Uint64 counters; LoadOrStore on first tick.
- 8 sub-counters per target-type bucket:
  - attemptsSuccess / attemptsFailure
  - validateFailures (PreCommit returned error)
  - reloadFailures (PostCommit returned error → rollback ran)
  - postVerifyFails (post-deploy TLS handshake failed)
  - rollbackRestored (rollback succeeded)
  - rollbackAlsoFail (operator-actionable escalation)
  - idempotentSkips (SHA-256 match → no-op deploy)
- Snapshot returns []DeploySnapshot for the Prometheus exposer.

internal/service/deploy_counters_test.go:
- 5 tests: zero-state, per-target-type tick isolation, race-detector
  smoke under concurrent ticks, cross-target bucket isolation,
  snapshot-mutation-doesn't-affect-counter.

internal/api/handler/metrics.go:
- New DeployCounterSnapshotter interface (mirrors CounterSnapshotter
  for the OCSP counters but uses the per-target-type tuple shape).
- New DeploySnapshotEntry struct copying the service-layer shape;
  avoids importing the service package directly so the handler
  stays dependency-light.
- New SetDeployCounters setter on MetricsHandler (mirrors
  SetOCSPCounters wiring).
- Prometheus exposer extended with 6 new metric blocks per frozen
  decision 0.9:
  - certctl_deploy_attempts_total{target_type, result}
  - certctl_deploy_validate_failures_total{target_type}
  - certctl_deploy_reload_failures_total{target_type}
  - certctl_deploy_post_verify_failures_total{target_type}
  - certctl_deploy_rollback_total{target_type, outcome}
  - certctl_deploy_idempotent_skip_total{target_type}
- Output sorted by target_type for stable diffs across requests.

The agent-side wire-up (cmd/agent/main.go ticking counters in the
DeployCertificate dispatch site) is intentionally deferred to a
follow-up commit — Phase 10's load-bearing change is the
infrastructure; per-connector tick wiring is a mechanical follow-on.

Build + go vet clean. go test -count=1 green for service +
handler packages.

Phase 11 next: cross-cutting integration tests at deploy/test/.
2026-04-30 15:25:38 +00:00
shankar0123 efce2363f7 feat(metrics): extend /metrics/prometheus with per-area OCSP counters (Phase 8)
Production hardening II Phase 8 — surface the OCSP per-event counters
shipped in Phase 1+2 through the existing /api/v1/metrics/prometheus
endpoint. Operators now alert on certctl_ocsp_counter_total
{label="rate_limited"} (Phase 3 trip), {label="nonce_malformed"}
(Phase 1 reject), {label="signing_failed"} (issuer connector fails),
etc.

NEW interface CounterSnapshotter (handler/metrics.go) — minimum
surface the Prometheus exposer needs from any per-area counter table:
just Snapshot() map[string]uint64. service.OCSPCounters.Snapshot
(Phase 1) satisfies it; future per-area counters (CRL, cert-export,
EST per-profile, SCEP per-profile, Intune per-profile) plug in the
same way as separate SetXxxCounters setters.

Naming convention per frozen decision 0.10:
  certctl_<area>_counter_total{label="<event>"} <value>

This commit ships only the OCSP block. The remaining areas (CRL,
cert-export, EST, SCEP, Intune) plug in via the same
SetXxxCounters pattern in follow-up commits — the wire-up cost per
area is one new field + one setter + one block of fmt.Fprintf lines.
The bundle's S-1 docs-count guard means we don't claim a specific
total in prose; operators run `curl /api/v1/metrics/prometheus | grep
certctl_` to enumerate.

Wired in cmd/server/main.go: a single shared *service.OCSPCounters
instance is created once and passed to BOTH the
ocspResponseCacheService (so the cache hot path ticks counters) AND
metricsHandler.SetOCSPCounters (so the Prometheus exposer reads
them). Existing dashboard metrics (certctl_certificate_total,
certctl_agent_total, etc.) remain unchanged at the same line offsets
— back-compat preserved.

Pre-commit verification: go build ./... clean; go test -short
-count=1 green for handler/ + service/. The existing
TestGetPrometheusMetrics_Success tests still pass (the new counter
block is additive at the END of the response body, after the
existing dashboard metrics + uptime line).
2026-04-30 05:15:05 +00:00
Shankar 15daf008aa I-005: notification retry loop + dead-letter queue
Critical alerts can no longer be silently dropped by a transient
notifier failure. Failed notification attempts now ride an exponential
backoff retry loop, with a 5-attempt budget before promotion to the
dead-letter queue for operator intervention.

Schema (migration 000016, idempotent):
- retry_count INTEGER NOT NULL DEFAULT 0
- next_retry_at TIMESTAMPTZ
- last_error TEXT
- idx_notification_events_retry_sweep partial index
  (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL
  Dead rows clear next_retry_at so the index stops matching them.

Service contract:
- NotificationService.RetryFailedNotifications drives 2^n-minute
  exponential backoff capped at 1h (notifRetryBackoffCap) with
  5-attempt budget (notifRetryMaxAttempts).
- Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to
  status='dead' via MarkAsDead.
- Non-terminal failures record via RecordFailedAttempt.
- Success path promotes to 'sent' without touching retry_count
  (audit preserves "delivered on attempt N").
- Missing-notifier branch defensively promotes to 'sent' to avoid
  wedging a row on a deleted channel.
- RequeueNotification operator escape hatch atomically resets
  retry_count -> 0, next_retry_at -> NULL, last_error -> NULL,
  status -> pending via notifRepo.Requeue.

Scheduler:
- New always-on notificationRetryLoop wired into the base loop set at
  CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m).
- sync/atomic.Bool idempotency guard.
- sync.WaitGroup shutdown drain via WaitForCompletion.

StatsService:
- SetNotifRepo setter pattern preserves 9 pre-existing
  NewStatsService call sites (main.go + stats_test.go + 8 digest
  tests) without touching the constructor signature.
- DashboardSummary.NotificationsDead populated via
  notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired
  (reports zero on systems without a notification repository).
- CountByStatus error is non-fatal (dashboard summary is
  best-effort for this field).
- Prometheus certctl_notification_dead_total counter emitted from
  the same snapshot.

Handler:
- New POST /api/v1/notifications/{id}/requeue endpoint.
- dead status surfaces to MCP + CLI.

Frontend:
- NotificationsPage gains two-tab toolbar ("All" / "Dead letter")
  with queryKey: ['notifications', activeTab] so switching tabs
  doesn't serve stale data until the 30s refetch.
- Dead rows surface "Retry {n}/5" + truncated last_error with
  full-text title tooltip.
- Requeue mutation wrapped as
    mutationFn: (id: string) => requeueNotification(id)
  to prevent react-query v5's positional context argument from
  leaking into the API client — pinned against future refactors
  by strict-match toHaveBeenCalledWith('notif-dead-001') in
  NotificationsPage.test.tsx:181.

Closes I-005.
2026-04-19 15:17:27 +00:00
Shankar be85fbd77e feat: add network certificate discovery (M21) and Prometheus metrics (M22)
M21 adds server-side active TLS scanning of CIDR ranges with concurrent
probing, sentinel agent pattern for pipeline reuse, and full CRUD API for
scan targets. M22 adds Prometheus exposition format endpoint alongside
existing JSON metrics. Comprehensive documentation audit updates all docs
to reflect 91 endpoints, 19 tables, 6 scheduler loops, and 900+ tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 23:37:47 -04:00
Shankar 45d6e2f7b2 feat: M14 — Observability (dashboard charts, agent fleet, stats API, metrics, structured logging, rollback)
Backend: StatsService with 5 aggregation methods, JSON metrics endpoint, slog-based
structured logging middleware. Stats API: dashboard summary, certificates-by-status,
expiration timeline, job trends, issuance rate. 23 new backend tests.

Frontend: Recharts-powered dashboard with 4 charts (status pie, expiration heatmap,
job trends line, issuance bar), agent fleet overview page with OS/arch grouping and
version breakdown, deployment rollback buttons on version history. 7 new frontend tests.

78 API endpoints, 744+ total tests (658 Go + 86 Vitest).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-22 19:46:13 -04:00