certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 16:31:33 +00:00

Author	SHA1	Message	Date
shankar0123	43836aca7c	feat(audit): COMP-001-HASH — per-row hash chain on audit_events (tamper-evidence) Sprint 6 closure of the audit's HIGH-severity COMP-001-HASH finding. Pre-fix posture: migration 000018 installs a WORM trigger on audit_events that blocks UPDATE / DELETE for the application role. But the trigger header itself documents a compliance-superuser bypass (backup restore, retention purges, breach recovery). Without a hash chain, that role can rewrite any row's actor / action / details / timestamp / event_category with no on-disk trace. HIPAA §164.312(b), FedRAMP AU-9, NIST 800-53 AU-10 want tamper- EVIDENCE, not just tamper-prevention. This commit ships the evidence layer. Wire shape: migrations/000047_audit_events_hash_chain.up.sql + pgcrypto extension (digest function) + audit_chain_head: single-row sentinel table holding the most recent row_hash; FOR UPDATE row-lock serialises chain writes under concurrent INSERTs so two parallel writers can't read the same prev_hash and produce a forked chain + audit_events: prev_hash + row_hash columns + audit_events_canonical_payload(): centralised hash input builder. UTC + microsecond ISO-8601 keeps the hash session- timezone-independent. All columns separated by '\|' so a concatenation-ambiguity exploit can't fabricate a collision + audit_events_compute_hash_chain(): BEFORE-INSERT trigger function. Reads sentinel FOR UPDATE → computes sha256(prev_hash \|\| id \|\| actor \|\| actor_type \|\| action \|\| resource_type \|\| resource_id \|\| details::text \|\| timestamp_utc_iso \|\| event_category) → writes both columns + advances the sentinel + backfill loop walks every existing row in (timestamp ASC, id ASC) order; WORM trigger temporarily DISABLEd inside this migration's transaction so backfill UPDATEs land cleanly, ENABLEd before COMMIT + audit_events_verify_chain(): STABLE plpgsql verifier. Walks the chain end-to-end and returns the first break: (first_break_id TEXT, first_break_pos INT, row_count INT) internal/repository/postgres/audit.go + AuditRepository.VerifyHashChain — calls the SQL function and maps the OUT parameters to Go return values internal/repository/interfaces.go + AuditRepository.VerifyHashChain in the contract; every in-memory mock + stub picks up the no-op implementation internal/scheduler/scheduler.go + AuditChainVerifier + AuditChainBreakRecorder interfaces + auditChainVerifyInterval (default 6h) + auditChainVerifyLoop: runs once on start + every tick; atomic.Bool guard + 5-min per-tick context timeout match every other GC loop's pattern internal/service/audit_chain_metric.go + AuditChainCounter type with atomic counters. Sticky-first- detection on (BrokenAtID, BrokenAtPos) so the actionable alarm doesn't drift across walks. Snapshot() returns the full state for the metrics handler internal/api/handler/metrics.go + AuditChainCounterSnapshotter interface + Prometheus exposition for four series: certctl_audit_chain_break_detected_total counter (the alarm) certctl_audit_chain_verify_total counter (walks done) certctl_audit_chain_rows gauge (last walk size) certctl_audit_chain_last_verified_at gauge (unix seconds) internal/config/config.go + AuditChainConfig{ VerifyInterval } + CERTCTL_AUDIT_CHAIN_VERIFY_INTERVAL cmd/server/main.go + wires AuditChainCounter into both the scheduler (recorder) + metrics handler (snapshotter) — single instance shared so the writer + reader are guaranteed to converge internal/repository/postgres/audit_chain_test.go (NEW) + TestAuditEventsHashChain_FreshTable: empty walk → clean + TestAuditEventsHashChain_AppendLinksRows: three INSERTs produce a strictly-linked chain; prev_hash on row 0 is NULL; verifier walks clean over the 3 rows + TestAuditEventsHashChain_VerifierDetectsTampering: simulate the compliance-superuser threat model (DISABLE WORM, UPDATE a middle row, ENABLE WORM); verifier returns the tampered row's id at position 1 docs/operator/audit-chain.md (NEW) + Layered-defenses explainer (WORM + hash chain). Verifier function reference. Recommended Prometheus alert rule. Performance scaling table (10k to 10M rows). Step-by-step runbook for what to do when a break is detected. Operator configuration table. Test-stub additions for AuditRepository.VerifyHashChain: internal/service/testutil_test.go — mockAuditRepo internal/service/acme_test.go — fakeAuditRepo internal/integration/lifecycle_test.go — mockAuditRepository internal/api/handler/scep_intune_e2e_test.go — intuneE2EAuditRepo Verified locally: go vet ./... (clean) gofmt -l internal/ cmd/ (clean) go test -short -count=1 ./internal/scheduler/... ./internal/config/... ./internal/service/... ./internal/api/handler/... ./internal/repository/... (all green) Verified with testcontainers + postgres:16-alpine + the migration runner (not gated under -short — requires docker): go test -count=1 -run TestAuditEventsHashChain ./internal/repository/postgres/... Closes COMP-001-HASH leg of Sprint 6. COMP-002-RETENTION lands in the next commit (separate concern: federated-user PII retention).	2026-05-16 06:17:15 +00:00
shankar0123	21aeed4f4e	legal: addlicense headers + normalize legacy variants (Phase 0 RED-4) Phase 0 closure (Path B2, post-rewrite): addlicense sweep — adds the canonical certctl LLC copyright + BUSL-1.1 SPDX header to every production Go file. Template: // Copyright 2026 certctl LLC. All rights reserved. // SPDX-License-Identifier: BUSL-1.1 Coverage: 338 / 338 production Go files (cmd/ + internal/, excluding _test.go and /testdata/). Pre-sweep coverage was 22 / 338 (6.5%); post-sweep is 338 / 338 (100%). Normalized 22 pre-existing legacy headers (`// Copyright (c) certctl` + `// SPDX-License-Identifier: BSL-1.1`) and 1 file using a `Certctl Contributors` attribution. The legacy SPDX ID `BSL-1.1` is non-standard; the official SPDX identifier for Business Source License 1.1 is `BUSL-1.1` (capital U). All 338 files now share the canonical form. Generated via: addlicense -c "certctl LLC" -y 2026 \ -f cowork/legal/copyright-header.tpl \ -ignore '/testdata/' -ignore '/_test.go' \ cmd/ internal/ Verification: find cmd internal -name '.go' -not -name '_test.go' \ -not -path '/testdata/' \ -exec grep -L '^// Copyright 2026 certctl LLC' {} \; \| wc -l Returns: 0 gofmt clean. Header additions are comments only, no compile impact. Closes: cowork/certctl-architecture-diligence-audit.html#fix-RED-4	2026-05-13 21:23:35 +00:00
shankar0123	75097909e9		2026-05-05 18:18:29 +00:00
shankar0123	aebfd8bd7c	Revert "chore: drop 'Infisical' label from internal references" This reverts commit `19706e56b3`.	2026-05-04 01:18:15 +00:00
shankar0123	19706e56b3	chore: drop 'Infisical' label from internal references Strategic naming cleanup. Earlier doc-comments + commit messages framed Rank 4 / Rank 5 / Rank 7 work as 'Rank N of the 2026-05-03 Infisical deep-research deliverable' — the 'Infisical' qualifier was a holdover from the original deep-research framing where Infisical (a competing secrets-management platform) was the comparator. Keeping the comparator's name in our source adds noise without value; an external reader sees 'Infisical' and assumes a dependency or shared lineage rather than reading it as the competitive context it was. Mechanical sed across 34 files (32 source / docs + 2 follow-up Python passes to collapse 'deep-research deep-research' duplicates that emerged where the original phrase wrapped across lines): s\|Infisical deep-research\|deep-research\|g s\|infisical-deep-research-results\|deep-research-results-2026-05-03\|g s\|infisical-deep-research-prompt\|deep-research-prompt-2026-05-03\|g s\|infisical-deep-research\|deep-research\|g s\|Infisical\|deep-research\|g s\|deep-research deep-research\|deep-research\|g # collapse-pass Net diff: 63 insertions / 64 deletions across cmd/, docs/, internal/, migrations/. Pure text substitution; zero behavior change. Code path unchanged — go vet clean, tests for TestApproval pass on both internal/service and internal/api/handler packages. Workspace docs (cowork/) carry the same references and will be swept separately — they're not under certctl/ git control. The two filename references (cowork/infisical-deep-research-results.md + cowork/infisical-deep-research-prompt.md) get renamed alongside that sweep to deep-research-results-2026-05-03.md / deep-research-prompt-2026-05-03.md so cross-references in the certctl repo doc-comments resolve cleanly.	2026-05-04 01:15:01 +00:00
shankar0123	8b75e0311b	chore: rename Go module path to github.com/certctl-io/certctl Mechanical sed across the main go.mod's module declaration, the f5-mock-icontrol sub-module's go.mod, every Go file's import path (361 files), and a rebuild of the checked-in f5-mock-icontrol binary so its embedded build-info reflects the new module path. No behavior change. Choice B from cowork/transfer-certctl-to-org.md, executed 2026-05-04. Choice A (keep module path declared as github.com/shankar0123/certctl regardless of repo URL) shipped on the day of the org transfer (2026-05-03) since we had no external Go consumers; this commit closes that deferral. Backward-compat: GitHub HTTP redirects continue to forward github.com/shankar0123/certctl → github.com/certctl-io/certctl at the URL level, but Go's module proxy uses the path declared in go.mod as the canonical name. Pre-fix, anyone trying `go get github.com/certctl-io/certctl/...` hit a "module path mismatch" error because go.mod said github.com/shankar0123/certctl and the URL they fetched it from said certctl-io/certctl. Post-fix, the canonical name and the URL agree, so go get / go install / external Go consumers / Go-tooling integrations work cleanly via either the new path (preferred) or the old path (which redirects and Go follows the redirect for source fetch). Anyone still importing the old path inside their own code keeps working provided they update their go.mod's `require` line to match — the module path declared in their consumer's go.sum / go.mod is the authoritative import name, so a mass sed across their import statements is the migration on the consumer side. No external consumers exist today. Diff shape: 361 *.go files — import path replacement only 2 go.mod — module declaration replacement only 1 binary — deploy/test/f5-mock-icontrol/f5-mock-icontrol rebuilt so embedded build-info reflects the new path (8618965 vs 8618933 bytes; 32-byte diff is the build-info change) Total: 364 files, 730 insertions / 730 deletions, net-zero size, pure mechanical substitution. Verification: gofmt: 17 files needed re-alignment after sed (the new path is one char shorter than the old, so column-aligned import groups drifted). Applied `gofmt -w` to fix. go mod tidy: clean exit on both modules. go vet ./...: clean exit. go build ./...: clean exit. go test -short -count=1 on representative packages: all green (internal/domain, internal/validation, internal/crypto, internal/crypto/signer, cmd/agent). Test output now reads `ok github.com/certctl-io/certctl/...` confirming the module path resolves correctly. binary: f5-mock-icontrol rebuilt; `strings \| grep shankar0123` returns nothing; `strings \| grep certctl-io/certctl` shows the new module path embedded in build-info. Files intentionally NOT touched in this commit: README.md / CHANGELOG.md / docs/ / etc. — already swept to certctl-io URLs in commit `0729ee4` (the post-transfer URL refresh). This commit is purely the Go-tooling layer. Scarf pixels (`shankar0123.docker.scarf.sh/...`) — Scarf-account namespace, not a Go import or GitHub repo URL. Stays. This is a non-blocking, non-customer-impacting change. Operators pulling container images, running `make verify`, hitting the API, or installing the agent see no functional difference. Only Go-tooling consumers (none today) are affected, and they're enabled — not broken — by this commit.	2026-05-04 00:30:29 +00:00
shankar0123	109f32ff41	notifications: per-policy multi-channel expiry-alert routing Closes Rank 4 of the 2026-05-03 Infisical deep-research deliverable (see cowork/infisical-deep-research-results.md Part 5). Pre-fix, RenewalService.CheckExpiringCertificates already ran daily, RenewalPolicy.AlertThresholdsDays drove per-cert thresholds, and NotificationService.SendThresholdAlert deduped per (cert, threshold) — but the channel was hardcoded to Email (internal/service/notification.go:118 pre-fix). Operators who configured PagerDuty / Slack / Teams / OpsGenie via CERTCTL_PAGERDUTY_ROUTING_KEY etc. got nothing at any threshold unless SMTP was also wired. Their first signal of an expired cert was a 3 AM outage. This commit lands the routing matrix on top of the existing infrastructure: 1. RenewalPolicy gains AlertChannels (per-tier channel list) + AlertSeverityMap (per-threshold tier assignment) + EffectiveAlertChannels / EffectiveAlertSeverity accessors. Default*() helpers preserve the back-compat Email-only behaviour for operators who haven't touched their policies post-upgrade. Migration 000026 adds the JSONB columns idempotently. 2. NotificationService.SendThresholdAlertOnChannel — the new per-channel dispatch helper. Old SendThresholdAlert stays as an Email-only alias so non-policy callers (admin "send test alert" surfaces) keep working byte-for-byte. 3. NotificationService.HasThresholdNotificationOnChannel — per- (cert, threshold, channel) deduplication so a transient PagerDuty 5xx today does NOT suppress today's Slack alert and tomorrow's PagerDuty retry will still fire. 4. RenewalService.sendThresholdAlerts walks the resolved channel set per threshold tier, fans out to every configured channel, handles per-channel failures independently, defensively drops off-enum channels with an audit row trail, and records a per- channel audit event with metadata.channel + metadata.severity_tier. 5. service.ExpiryAlertMetrics — atomic counter table mirrored on the VaultRenewalMetrics shape from the 2026-05-03 audit fix #5 (commit `0792271`). Three labels: channel × threshold × result (success / failure / deduped). Cardinality bound: 6 × 4 × 3 = 72 series for the standard 4-threshold matrix. 6. handler.MetricsHandler.SetExpiryAlerts wires the Prometheus exposer for certctl_expiry_alerts_total{channel,threshold,result}. Pre-sorted snapshot for byte-stable emission. 7. cmd/server/main.go threads ONE service.ExpiryAlertMetrics instance through both the recording side (notificationService. SetExpiryAlertMetrics) and the exposing side (metricsHandler.SetExpiryAlerts). Dispatch flow (post-fix, per renewal-loop tick): cert ages past T-30 → daily renewal-loop fires → policy lookup → for each crossed threshold: - resolve severity tier (informational/ warning/critical) via AlertSeverityMap - look up channel set in AlertChannels[tier] - for each channel: dedup → SendThresholdAlertOnChannel → notifierRegistry[channel] → audit row → Prometheus counter increment Tests (internal/service/renewal_expiry_alerts_test.go): TestExpiryAlerts_DefaultMatrix_EmailOnly TestExpiryAlerts_PerTierFanOut TestExpiryAlerts_PerChannelDedup TestExpiryAlerts_OneChannelFails_OthersStillFire TestExpiryAlerts_OffEnumChannelDropped TestExpiryAlerts_MetricCounterIncrements TestExpiryAlerts_NilPolicy_FallsToDefault TestExpiryAlerts_OperatorOptOutOfTier The PerTierFanOut test wires 6 mock notifiers, drives a cert at 0 days through the canonical 4 thresholds with the matrix {informational:[Slack], warning:[Slack,Email], critical:[PagerDuty,OpsGenie,Email]}, and asserts the exact recipient counts: Slack=3, Email=3, PagerDuty=1, OpsGenie=1, no Teams, no Webhook. The OneChannelFails test pins that PagerDuty returning a 503 does NOT skip Slack/Email at the same threshold. Drive-by fix (internal/service/testutil_test.go): the existing mockNotifRepo.List ignored its filter and returned all rows, which let legacy tests pass on dedup-via-substring even though the postgres repo actually applied the filter. Updated the mock to honour CertificateID / Type / Status / Channel / MessageLike filters in the same shape as the postgres implementation (internal/repository/postgres/notification.go). All pre-existing service tests still pass — the legacy test suite happened to be robust to the mock filter doing nothing. Documentation: - docs/connectors.md Notifier section gains "Routing expiry alerts across channels" — operator-facing, JSON example, procurement playbook ("How do I make sure PagerDuty pages on the T-1 alert?"), debug recipe via SQL on audit_events + notification_events + Prometheus. - docs/runbook-expiry-alerts.md — sysadmin-grade flowchart, per-policy channel-matrix configuration recipes, "did the on- call team get paged?" SQL queries, cardinality budget, V3-Pro forward path. - cowork/WORKSPACE-ROADMAP.md gains "Multi-channel expiry alerts: per-owner routing" V3-Pro entry under Adapter hardening. Out of scope (intentional, flagged in V3-Pro forward path): - Per-owner / per-team / per-tenant channel routing (matrix is per-policy today, not per-owner). - Calendar-aware suppression (no T-30 alerts on weekends). - Escalation chains (T-1 unanswered for 30m → escalate). - Per-channel rate limiting (downstream of I-005 retry+DLQ). CHANGELOG.md is intentionally not hand-edited per CHANGELOG.md itself ("no longer maintains a hand-edited per-version changelog; per-release notes are auto-generated from commit messages between consecutive tags"). Verified locally: - gofmt clean. - go vet ./internal/domain/... ./internal/service/... ./internal/api/handler/... ./cmd/server/... clean. (./internal/repository/postgres/... vet failed on transitive testcontainers/docker module download — sandbox disk pressure, not a code issue; postgres-repo build succeeds and tests pass.) - go test -short -count=1 ./internal/domain/... ./internal/service/... ./internal/api/handler/... green. - go test -race -count=10 -run 'TestExpiryAlerts' ./internal/service/... green (per-channel dedup race-free). Reference: cowork/infisical-deep-research-results.md Part 5 Rank 4. Acquisition prompt: cowork/rank-4-multichannel-expiry-alerts-prompt.md.	2026-05-03 22:12:32 +00:00
shankar0123	0792271dc6	vault: add automatic token renewal at TTL/2 + Prometheus metric Closes Top-10 fix #5 of the 2026-05-03 issuer-coverage audit (see cowork/issuer-coverage-audit-2026-05-03/RESULTS.md). Pre-fix, the VaultPKI adapter authenticated with a static token and never called renew-self. Long-lived deploys hit token expiry; the first operator-visible signal was failed cert renewals on production targets. This commit: 1. Connector.Start(ctx) spawns a goroutine that calls POST /v1/auth/token/renew-self at TTL/2 cadence (computed from a one-shot lookup-self at startup). Honours ctx.Done() for graceful shutdown via a per-loop done channel + Stop(). 2. On `renewable: false` response (initial lookup OR any subsequent renewal), the loop emits a WARN, increments the not_renewable counter, and exits. The operator must rotate the token before Vault's Max TTL elapses. 3. New Prometheus counter certctl_vault_token_renewals_total with labels result={success,failure,not_renewable}. Registered alongside existing certctl_issuance_* counters in internal/api/handler/metrics.go. 4. ERROR-level logging on renewal failure with operator-actionable substring ("vault token renewal failed; rotate the token before TTL expires") so journalctl + grep find it. Loop keeps ticking after a failure — transient blips don't kill it. New optional issuer.Lifecycle interface: type Lifecycle interface { Start(ctx context.Context) error Stop() } Connectors that hold no background goroutines (almost all of them) do not implement this — IssuerRegistry.StartLifecycles / StopLifecycles feature-detect via type assertion. New lifecycle-bearing connectors plug in by implementing the interface; no further registry plumbing required. Wiring (cmd/server/main.go): - service.NewVaultRenewalMetrics() instance is shared between issuerRegistry.SetVaultRenewalMetrics (so Vault connectors built by Rebuild get a recorder) and metricsHandler.SetVaultRenewals (so the Prometheus exposer emits the new series). - issuerRegistry.StartLifecycles(ctx) is called after issuerService.BuildRegistry; defer issuerRegistry.StopLifecycles is paired so goroutines exit cleanly on signal. - IssuerConnectorAdapter.Underlying() exposes the wrapped issuer.Connector so registry-level machinery can reach the concrete connector behind the adapter without duplicating the wiring at every call site. Tests (internal/connector/issuer/vault/vault_renew_test.go): - TestVault_RenewLoop_TickAtHalfTTL — three ticks → three renewals, all "success". - TestVault_RenewLoop_StopsOnNotRenewable — second renewal returns renewable=false, loop exits, third tick fires no HTTP call. - TestVault_RenewLoop_FailureSurfacesViaMetric — first renewal 403 bumps "failure", second renewal succeeds → loop kept ticking. - TestVault_RenewLoop_CtxCancellation_StopsCleanly — Stop returns within 200ms after ctx cancel. - TestVault_RenewLoop_StartsNothingWhenNotRenewable — token already non-renewable at boot ⇒ no goroutine, "not_renewable" metric increments at startup so operators see it in Grafana. - TestVault_ComputeInterval — 4 cases pinning TTL/2 + minRenewInterval floor. - TestVault_RenewSelf_ParseFailure_NamesActionableInError — surfaced error contains "vault token renewal failed" + "rotate the token". Cadence is dynamic — every successful renewal re-derives TTL/2 from the renewed lease's lease_duration, so a short bootstrap token that gets renewed up to a longer Max TTL shifts to the longer cadence automatically (defends against degenerate fast ticking on a token whose Max TTL is far longer than its initial TTL). Documentation: - docs/connectors.md Vault PKI section gains "Token TTL + automatic renewal" subsection (operator-facing: cadence, metric, renewable=false rotation playbook). Out of scope (intentional, flagged in the audit follow-up): - AppRole / Kubernetes / AWS IAM auth methods (different renewal semantics). - Hot-reload of rotated token from disk (operator restarts today; future: GUI/MCP issuer-update path triggers Rebuild which Stops the old connector and Starts the new one). - Auto-re-auth after token death (operator playbook owns it). CHANGELOG.md is intentionally not hand-edited (per CHANGELOG.md itself: "no longer maintains a hand-edited per-version changelog; per-release notes are auto-generated from commit messages between consecutive tags"). Verified locally: - gofmt clean. - go vet ./internal/service/... ./internal/api/handler/... ./internal/connector/issuer/vault/... ./cmd/server/... clean. - go test -short -count=1 ./internal/connector/issuer/vault/... ./internal/service/... ./internal/api/handler/... green. - go test -race -count=10 -run 'TestVault_RenewLoop\|TestVault_ComputeInterval' ./internal/connector/issuer/vault/... green. Audit reference: cowork/issuer-coverage-audit-2026-05-03/RESULTS.md Top-10 fix #5.	2026-05-03 21:24:27 +00:00
shankar0123	3b92048242	metrics: add per-issuer-type issuance counters, histogram, and failure classifier Closes the #4 acquisition-readiness blocker from the 2026-05-01 issuer coverage audit. Before this commit, certctl's Prometheus exposition had zero per-issuer-type signal — operators answering "is DigiCert slow?" or "is Sectigo failing more than ACME?" had to grep logs by issuer name. This commit adds three series labelled by issuer type: certctl_issuance_total{issuer_type, outcome} certctl_issuance_duration_seconds{issuer_type} (histogram) certctl_issuance_failures_total{issuer_type, error_class} The histogram covers 0.05–120 second buckets to span the local-issuer fast path and async-CA slow path (DigiCert/Sectigo/Entrust polling can take minutes). error_class is a closed enum of eight values (timeout, auth, rate_limited, validation, upstream_5xx, upstream_4xx, network, other) classified once in service.ClassifyError. Cardinality budget is ~276 new series, well within Prometheus's comfortable range. Implementation: - service.IssuanceMetrics is the thread-safe counter + histogram table. Three independent views (counters / failures / durations) exposed via SnapshotCounters / SnapshotFailures / SnapshotDurations. sync.RWMutex protects the map shape; per-key sync/atomic.Uint64 primitives keep the recording hot path lock-free under concurrent service-layer goroutines. - service.IssuanceCounterEntry / IssuanceFailureEntry / IssuanceDurationEntry / IssuanceMetricsSnapshotter live in service (not handler) to avoid an import cycle: handler already imports service for admin_est.go etc., so service can't import handler back. Handler's exposer takes the snapshotter via the service-defined interface. - service.ClassifyError pure function maps error → error_class. context.DeadlineExceeded / context.Canceled → timeout; net.OpError → network; substring matches against canonical AWS / DigiCert / Sectigo error shapes for auth / rate_limited / validation / upstream_5xx / upstream_4xx / network; unknown → other. Each branch has at least one representative test case in TestClassifyError. - IssuerConnectorAdapter.SetMetrics wires per-adapter recording (issuerType + metrics). Existing 28+ test call sites of NewIssuerConnectorAdapter keep their one-arg signature; production wiring goes through SetMetrics post-construction. - IssuerRegistry.SetIssuanceMetrics + Rebuild type-asserts to IssuerConnectorAdapter and calls SetMetrics with the issuer type string. nil-guarded — tests that hand-build adapters without metrics get no-op recording. - IssuerConnectorAdapter.IssueCertificate / RenewCertificate wrap the underlying connector call with start := time.Now() and recordIssuance(start, err). Renewal is recorded into the same certctl_issuance_* series as initial issuance — operationally, renewal IS issuance from the connector's perspective (matches the audit prompt's guidance on series naming). - handler/metrics.go GetPrometheusMetrics gains a new exposer block emitting all three series in stable label order with correct Prometheus format (_bucket / _sum / _count for the histogram, +Inf bucket appended). Sorted via sort.Slice for stable output. nil- guarded so deploys without the wire produce clean exposition. - formatLE helper trims trailing zeros from histogram bucket labels via strconv.FormatFloat(le, 'f', -1, 64) so the `le` labels match Prometheus client conventions ("0.05", "30", "120", not "0.0500" etc.). - cmd/server/main.go wires a single IssuanceMetrics instance into both the IssuerRegistry (recording) and the MetricsHandler (exposer) using DefaultIssuanceBucketBoundaries. Tests: - TestIssuanceMetrics_RecordAndSnapshot — happy-path counter + histogram + failure recording, BucketBoundaries returns a copy (not shared storage). - TestIssuanceMetrics_HistogramCumulative — pins the cumulative-buckets contract. 100ms observation lands in 0.1 bucket and every larger bucket; 750ms only in the 1.0 bucket. Off-by-one here would corrupt every quantile query downstream. - TestIssuanceMetrics_Concurrency — 100 goroutines × 1000 ops under the race detector. Asserts atomic counter integrity across contended writes. - TestClassifyError — 17 cases covering every branch of the closed enum plus the nil-error special case. Implementation chooses the existing hand-rolled fmt.Fprintf exposition pattern (no prometheus/client_golang dependency added) to stay consistent with the OCSP / deploy counter blocks already in the file. Out of scope (separate follow-ups): - Revocation metrics (certctl_revocation_*) — symmetric to issuance but the audit didn't ask; explicit follow-up commit. - Discovery / health-check duration histograms. - prometheus/client_golang migration. Verified locally: - gofmt clean - go vet ./... clean - staticcheck ./... clean - golangci-lint run --timeout 5m ./... → 0 issues - go test -short -count=1 ./internal/service/ green - go test -short -count=1 -race -run TestIssuanceMetrics ./internal/service/ green - go test -short -count=1 ./internal/api/handler/ green - go build ./... success Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md Top-10 fix #4 (Part 3, narrative section).	2026-05-02 00:39:25 +00:00
shankar0123	8637131f80	chore: gofmt fixes across deploy-hardening I new files Phase 13 verification surfaced gofmt-formatting drift in 6 files across the bundle's new code: - internal/api/handler/metrics.go (struct field alignment) - internal/connector/target/k8ssecret/validate_only_test.go (alignment) - internal/connector/target/nginx/nginx.go (alignment) - internal/connector/target/postfix/postfix.go (alignment) - internal/connector/target/ssh/validate_only_test.go (alignment) - internal/service/deploy_counters.go (alignment) Pure mechanical gofmt -w fixes; no behavior changes. CI's make verify gate (which runs `go fmt ./...`) didn't catch these because go fmt is more lenient than gofmt -l, but golangci-lint v2.11.4 + the explicit gofmt step in Phase 13 verification did. Phase 13 full-matrix verification all green: - gofmt -l: empty across all bundle-touched files - go vet ./internal/deploy/... ./internal/connector/target/... ./internal/service/ ./internal/api/handler/ ./cmd/agent/: clean - golangci-lint v2.11.4 (the version CI runs): 0 issues - go test -race -count=1 across deploy + nginx + apache + haproxy + agent + service: all green - INTEGRATION=1 go test -tags integration -run Deploy ./deploy/test/...: 4/4 e2e tests green Phase 14 next: release prep — Active Focus update, release notes, Reddit-beat draft, final tag handoff to operator.	2026-04-30 15:33:33 +00:00
shankar0123	135b271197	feat(metrics): per-target-type deploy counters wired into /metrics/prometheus Phase 10 of the deploy-hardening I master bundle. Mirrors the production-hardening-II Phase 8 OCSP-counter pattern. Per frozen decision 0.9, the metric naming convention is `certctl_deploy_<area>_total` with target_type + sub-label. internal/service/deploy_counters.go: - DeployCounters struct with sync.Map of per-target-type buckets (apache, nginx, etc.). Lock-free fast path via sync/atomic Uint64 counters; LoadOrStore on first tick. - 8 sub-counters per target-type bucket: - attemptsSuccess / attemptsFailure - validateFailures (PreCommit returned error) - reloadFailures (PostCommit returned error → rollback ran) - postVerifyFails (post-deploy TLS handshake failed) - rollbackRestored (rollback succeeded) - rollbackAlsoFail (operator-actionable escalation) - idempotentSkips (SHA-256 match → no-op deploy) - Snapshot returns []DeploySnapshot for the Prometheus exposer. internal/service/deploy_counters_test.go: - 5 tests: zero-state, per-target-type tick isolation, race-detector smoke under concurrent ticks, cross-target bucket isolation, snapshot-mutation-doesn't-affect-counter. internal/api/handler/metrics.go: - New DeployCounterSnapshotter interface (mirrors CounterSnapshotter for the OCSP counters but uses the per-target-type tuple shape). - New DeploySnapshotEntry struct copying the service-layer shape; avoids importing the service package directly so the handler stays dependency-light. - New SetDeployCounters setter on MetricsHandler (mirrors SetOCSPCounters wiring). - Prometheus exposer extended with 6 new metric blocks per frozen decision 0.9: - certctl_deploy_attempts_total{target_type, result} - certctl_deploy_validate_failures_total{target_type} - certctl_deploy_reload_failures_total{target_type} - certctl_deploy_post_verify_failures_total{target_type} - certctl_deploy_rollback_total{target_type, outcome} - certctl_deploy_idempotent_skip_total{target_type} - Output sorted by target_type for stable diffs across requests. The agent-side wire-up (cmd/agent/main.go ticking counters in the DeployCertificate dispatch site) is intentionally deferred to a follow-up commit — Phase 10's load-bearing change is the infrastructure; per-connector tick wiring is a mechanical follow-on. Build + go vet clean. go test -count=1 green for service + handler packages. Phase 11 next: cross-cutting integration tests at deploy/test/.	2026-04-30 15:25:38 +00:00
shankar0123	2d83342bbe	feat(metrics): extend /metrics/prometheus with per-area OCSP counters (Phase 8) Production hardening II Phase 8 — surface the OCSP per-event counters shipped in Phase 1+2 through the existing /api/v1/metrics/prometheus endpoint. Operators now alert on certctl_ocsp_counter_total {label="rate_limited"} (Phase 3 trip), {label="nonce_malformed"} (Phase 1 reject), {label="signing_failed"} (issuer connector fails), etc. NEW interface CounterSnapshotter (handler/metrics.go) — minimum surface the Prometheus exposer needs from any per-area counter table: just Snapshot() map[string]uint64. service.OCSPCounters.Snapshot (Phase 1) satisfies it; future per-area counters (CRL, cert-export, EST per-profile, SCEP per-profile, Intune per-profile) plug in the same way as separate SetXxxCounters setters. Naming convention per frozen decision 0.10: certctl_<area>_counter_total{label="<event>"} <value> This commit ships only the OCSP block. The remaining areas (CRL, cert-export, EST, SCEP, Intune) plug in via the same SetXxxCounters pattern in follow-up commits — the wire-up cost per area is one new field + one setter + one block of fmt.Fprintf lines. The bundle's S-1 docs-count guard means we don't claim a specific total in prose; operators run `curl /api/v1/metrics/prometheus \| grep certctl_` to enumerate. Wired in cmd/server/main.go: a single shared *service.OCSPCounters instance is created once and passed to BOTH the ocspResponseCacheService (so the cache hot path ticks counters) AND metricsHandler.SetOCSPCounters (so the Prometheus exposer reads them). Existing dashboard metrics (certctl_certificate_total, certctl_agent_total, etc.) remain unchanged at the same line offsets — back-compat preserved. Pre-commit verification: go build ./... clean; go test -short -count=1 green for handler/ + service/. The existing TestGetPrometheusMetrics_Success tests still pass (the new counter block is additive at the END of the response body, after the existing dashboard metrics + uptime line).	2026-04-30 05:15:05 +00:00
shankar0123	675b87ba63	I-005: notification retry loop + dead-letter queue Critical alerts can no longer be silently dropped by a transient notifier failure. Failed notification attempts now ride an exponential backoff retry loop, with a 5-attempt budget before promotion to the dead-letter queue for operator intervention. Schema (migration 000016, idempotent): - retry_count INTEGER NOT NULL DEFAULT 0 - next_retry_at TIMESTAMPTZ - last_error TEXT - idx_notification_events_retry_sweep partial index (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL Dead rows clear next_retry_at so the index stops matching them. Service contract: - NotificationService.RetryFailedNotifications drives 2^n-minute exponential backoff capped at 1h (notifRetryBackoffCap) with 5-attempt budget (notifRetryMaxAttempts). - Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to status='dead' via MarkAsDead. - Non-terminal failures record via RecordFailedAttempt. - Success path promotes to 'sent' without touching retry_count (audit preserves "delivered on attempt N"). - Missing-notifier branch defensively promotes to 'sent' to avoid wedging a row on a deleted channel. - RequeueNotification operator escape hatch atomically resets retry_count -> 0, next_retry_at -> NULL, last_error -> NULL, status -> pending via notifRepo.Requeue. Scheduler: - New always-on notificationRetryLoop wired into the base loop set at CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m). - sync/atomic.Bool idempotency guard. - sync.WaitGroup shutdown drain via WaitForCompletion. StatsService: - SetNotifRepo setter pattern preserves 9 pre-existing NewStatsService call sites (main.go + stats_test.go + 8 digest tests) without touching the constructor signature. - DashboardSummary.NotificationsDead populated via notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired (reports zero on systems without a notification repository). - CountByStatus error is non-fatal (dashboard summary is best-effort for this field). - Prometheus certctl_notification_dead_total counter emitted from the same snapshot. Handler: - New POST /api/v1/notifications/{id}/requeue endpoint. - dead status surfaces to MCP + CLI. Frontend: - NotificationsPage gains two-tab toolbar ("All" / "Dead letter") with queryKey: ['notifications', activeTab] so switching tabs doesn't serve stale data until the 30s refetch. - Dead rows surface "Retry {n}/5" + truncated last_error with full-text title tooltip. - Requeue mutation wrapped as mutationFn: (id: string) => requeueNotification(id) to prevent react-query v5's positional context argument from leaking into the API client — pinned against future refactors by strict-match toHaveBeenCalledWith('notif-dead-001') in NotificationsPage.test.tsx:181. Closes I-005.	2026-04-19 15:17:27 +00:00
shankar0123	4f90be9311	feat: add network certificate discovery (M21) and Prometheus metrics (M22) M21 adds server-side active TLS scanning of CIDR ranges with concurrent probing, sentinel agent pattern for pipeline reuse, and full CRUD API for scan targets. M22 adds Prometheus exposition format endpoint alongside existing JSON metrics. Comprehensive documentation audit updates all docs to reflect 91 endpoints, 19 tables, 6 scheduler loops, and 900+ tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-24 23:37:47 -04:00
shankar0123	ee75f149ae	feat: M14 — Observability (dashboard charts, agent fleet, stats API, metrics, structured logging, rollback) Backend: StatsService with 5 aggregation methods, JSON metrics endpoint, slog-based structured logging middleware. Stats API: dashboard summary, certificates-by-status, expiration timeline, job trends, issuance rate. 23 new backend tests. Frontend: Recharts-powered dashboard with 4 charts (status pie, expiration heatmap, job trends line, issuance bar), agent fleet overview page with OS/arch grouping and version breakdown, deployment rollback buttons on version history. 7 new frontend tests. 78 API endpoints, 744+ total tests (658 Go + 86 Vitest). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-22 19:46:13 -04:00

15 Commits