mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 22:01:36 +00:00
436382450e07c8f4ec9e58313aedd4c3c0c0ce8b
175 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
67593604f1 |
docs: production hardening II — DR runbook + crl-ocsp updates + features.md env vars (Phase 10)
Production hardening II Phase 10 — operator-facing documentation
that codifies the new V2 surfaces shipped in Phases 1-8.
NEW docs/disaster-recovery.md (8 sections, ~280 lines):
- Overview of automatic fail-safes already in code
- CRL cache recovery (delete row + scheduler regenerates)
- OCSP responder cert recovery (delete row + ensureOCSPResponder
re-bootstraps on next request)
- OCSP response cache recovery (delete row + read-through fallback)
- CA private-key rotation procedure (9-step playbook)
- Postgres restore (with explicit list of operator-managed
artifacts NOT in DB)
- Trust-bundle reload semantics (SCEP / EST / Intune SIGHUP-
equivalent fail-safe behavior)
- DR checklist (printable; pin near on-call)
This is the SOC 2 / PCI procurement-team deliverable. Auditors and
on-call operators get a single document that tells them what to do
when state corrupts, when keys need rotation, when Postgres needs
restoring. Nothing in the runbook requires new code — it codifies
behaviors already in the codebase.
UPDATED docs/crl-ocsp.md:
- New "Production hardening II additions" section: OCSP nonce
extension, OCSP pre-signed cache (with the load-bearing security
wire called out), per-source-IP OCSP rate limit, per-actor cert-
export rate limit, CRL HTTP caching headers (RFC 7232), CRL
DistributionPoints auto-injection, cert-export typed audit
codes, per-area Prometheus metrics with operator alert
recommendations.
- Pruned the V3-Pro deferral list to remove items that this
bundle SHIPPED (OCSP rate-limiting moved out; remaining V3-Pro:
delta CRLs, OCSP stapling, OCSP request signature verification,
HA / multi-region replication, IDP extension for sharded CRLs).
UPDATED docs/features.md:
- CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN row (default 1000)
- CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR row (default 50)
G-3 docs-drift CI guard reproduced clean: every new CERTCTL_* env
var documented in features.md AND consumed in Go source. S-1 stale-
counts guard clean (no literal-number prose for current-state
counts in README/docs).
|
||
|
|
9a0430bd87 |
docs(est): EST RFC 7030 operator guide + WiFi/802.1X recipe + IoT bootstrap recipe + FreeRADIUS integration + architecture + README
EST RFC 7030 hardening master bundle Phase 12 — comprehensive operator- facing documentation for the Phases 1-11 backend work that shipped on 2026-04-29. NEW docs/est.md (19 sections, ~810 lines): Concepts (host vs user enrollment, profile-driven policy, multi-profile dispatch); 5-minute single-profile Quick start with curl + openssl recipes; Multi-profile dispatch (CERTCTL_EST_PROFILES=corp,iot,wifi setup with PathID rules enforced at boot); Authentication modes (mTLS / Basic / both / empty with cross-check semantics); RFC 9266 channel binding (failure-mode HTTP mapping table — ErrChannelBindingMissing/Mismatch/NotTLS13 → 400/409/426); WiFi/802.1X recipe with end-to-end FreeRADIUS integration (EAP-TLS supplicant config, mods-available/eap tls-common block, CRL distribution endpoint cross-ref, troubleshooting playbook); IoT bootstrap recipe (factory provisioning, first boot, steady-state renewal, compromise/decommission via bulk-revoke, recommended cert lifetimes per master prompt §7.7); serverkeygen for resource-constrained devices (CMS EnvelopedData wrap, RSA-only at this revision, zeroize discipline, Phase-1 cross-check refusing _SERVERKEYGEN_ENABLED=true with empty _PROFILE_ID); HSM-backed CA signing for EST cross-ref (signer interface seam); Operator GUI tabbed surface tour (/est: Profiles / Recent Activity / Trust Bundle); CLI + 6 MCP tools; Renewal device-driven model (RFC 7030 §4.2.2 mandate, renewal-trigger ratios for laptops/IoT, operator-push via webhook); Troubleshooting matrix (one row per typed audit-action constant in internal/service/est_audit_actions.go); TLS 1.2 reverse-proxy runbook cross-ref (channel-binding caveat explained); Threat model (load-bearing properties: trust-anchor reload fail-safety, per-profile counter isolation, mTLS cross-profile bleed defense, source-IP limiter process-locality, server-keygen heap residency, HTTP Basic in-process-only, legacy-anonymous-default back-compat carve-out); V3-Pro deferrals; Appendix A (libest sidecar reproducer + 5 integration test names); Appendix B (Cisco IOS 15.x + 16.x + Apple MDM + OpenWRT + libest <v3.0 wire-format quirks tested in internal/api/handler/cisco_ios_quirks_test.go). UPDATED docs/architecture.md: new "EST Server (RFC 7030) — Production Deployment" section under the existing baseline EST section. Mermaid diagram of multi-profile dispatch + mTLS sibling route + per-profile gate ordering + audit + GUI + SIGHUP-equivalent reload. Existing authentication paragraph updated with forward-ref to the hardening section. Audit paragraph updated to enumerate the 13 typed est_* action codes operators grep on. Trust-anchor reload semantics + libest interop tested in CI both called out. UPDATED README.md::Enrollment Protocols: replaced the one-line EST row with the full production-grade surface description matching the SCEP analog. Cross-references docs/est.md. UPDATED docs/connectors.md::EST/SCEP Integration: extended the EST-or-SCEP shared paragraph to point at the per-profile env-var form for both protocols + linked the new architecture.md section. NEW "Multi-profile EST dispatch + production hardening" subsection mirrors the SCEP equivalent: 9-row env-var table, cross-ref to docs/est.md. G-3 docs-drift CI guard reproduced locally clean — every CERTCTL_EST_* mention in docs maps back to internal/config/config.go, and every defined env var is documented. The `<NAME>` placeholder convention matches the SCEP idiom so the docs grep doesn't extract per-deploy profile names as phantom env vars. No new env vars introduced — this is a pure docs commit. |
||
|
|
6cedaf4231 |
fix(docs/est): drop CERTCTL_EST_* wildcard prose to satisfy G-3 docs-drift guard
The previous commit (
|
||
|
|
ec3258ea0b |
docs(est): document CERTCTL_EST_PROFILES + per-profile env-var family (G-3 fix)
The Phase 1 commit (
|
||
|
|
444942eab8 |
fix(scep-intune): close 11 audit gaps from 2026-04-29 pre-tag review
Closes the eleven gaps identified in the pre-v2.1.0 audit of the SCEP
RFC 8894 + Intune master bundle (cowork/scep-bundle-gap-closure-prompt.md).
Constitutional rule from cowork/CLAUDE.md::Operating Rules — 'Always
take the complete path, not the easy path' — drove this closure: each
gap was a load-bearing wire that crossed multiple layers (config →
validator → service wire-up → tests → docs) and shipping the bundle
without them would have produced lying-field footguns where operator-
visible config options stored values without affecting behavior.
WHAT LANDS:
Phase A — Clock-skew tolerance (master prompt §15 hazard closure)
internal/scep/intune/challenge.go: ValidateChallenge migrated from
positional args to ValidateOptions{} struct; new ClockSkewTolerance
field with default 0 (strict). 24 call sites updated mechanically.
Asymmetric application: now+tolerance >= iat AND now-tolerance < exp.
internal/config/config.go: SCEPIntuneProfileConfig.ClockSkewTolerance
default 60s + Validate() refusal when >= ChallengeValidity.
cmd/server/main.go: SetIntuneIntegration signature extended;
per-profile env-var loader honors CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CLOCK_SKEW_TOLERANCE.
internal/service/scep.go: intuneClockSkew field + IntuneStatsSnapshot
surfaces clock_skew_tolerance_ns. web/src/api/types.ts mirrors.
4 new tests in challenge_test.go covering accept-within-tolerance,
reject-beyond-tolerance, accept-expired-within-tolerance,
negative-treated-as-zero defensive normalization.
docs/scep-intune.md updated with the new env var + time-bounds rule.
Phase B — unknown-version-rejected golden test
internal/scep/intune/golden_helper_test.go: goldenUnknownVersionPayload
helper + signGoldenChallengeAny generic signer.
challenge_golden_test.go: TestGoldenChallenge_UnknownVersionRejected
uses an in-process ECDSA fixture (the on-disk PEM was generated with
a Go-stdlib version that produces different ecdsa.GenerateKey bytes
from the current call). TestRegenerateGoldenFixtures emits the new
unknown_version fixture file too.
Phase C — Two named Intune e2e tests
internal/api/handler/scep_intune_e2e_test.go:
TestSCEPIntuneEnrollment_RateLimited_E2E (cap=2 + 3 attempts; 3rd
returns FAILURE+badRequest with rate_limited counter ticked)
TestSCEPIntuneEnrollment_TrustAnchorSIGHUPReload_E2E (rotate
on-disk PEM + holder.Reload(); old-key challenge fails with
badMessageCheck; signature_invalid counter ticked)
intuneE2EFixture struct extended with trustHolder + trustPath fields
so tests can rotate.
Phase D — Four new ChromeOS hermetic tests (10 total now)
internal/api/handler/scep_chromeos_test.go:
_RAKeyMismatch — PKIMessage encrypted to wrong RA cert; handler
rejects without reaching service.
_3DESBackwardCompat — RFC 8894 §3.5.2 legacy fallback verified.
_RSACSR + _ECDSACSR — explicit matrix-pair pinning.
buildTestECDSACSR helper for ECDSA P-256 CSR construction;
tripleDESCBCEncrypt mirrors aesCBCEncrypt for 3DES-CBC;
assertChromeOSPositiveCertRep shared assertion.
Phase E — Per-profile counter isolation test
internal/api/handler/scep_profile_counter_isolation_test.go:
TestSCEPHandler_PerProfileIntuneCountersIsolated wires two
SCEPService instances + drives distinct PKIMessages + asserts
counter isolation. Guards against a future cmd/server/main.go
refactor that shares a *intuneCounterTab across profiles.
buildPerProfileIntuneFixture parameterized helper.
Phase F — Server-boot regression tests
cmd/server/preflight_scep_intune_test.go: 3 named tests covering
disabled-backward-compat, broken-config-with-PathID, expired-cert
refusal. preflightSCEPIntuneTrustAnchor signature extended with
pathID arg so error messages carry PathID= for operator log-grep.
Phase G — docs/connectors.md
Four new subsections under §EST/SCEP Integration: multi-profile
dispatch + mTLS sibling route + Intune Connector dispatcher + SCEP
probe in network scanner. Each has a one-paragraph operator
explanation + an env-var or endpoint table.
Phase H — Coverage uplift
internal/service/scep_probe_persist_test.go: 5 unit tests on
persistProbeResult (nil-safe + nil-repo-safe + repo-error swallow +
nil-logger guard) + ListRecentSCEPProbes (empty-slice-not-nil + repo
pass-through) + describeCertAlgorithm (RSA/ECDSA/QF1008-nil-curve
defensive branch/Ed25519/DSA/empty). CI gates (service ≥70, handler
≥75) PASS at 70.9% / 79.3%.
Phase I — deploy/test integration variant
deploy/test/scep_intune_e2e_test.go (//go:build integration):
TestSCEPIntuneEnrollment_Integration + _RateLimited_Integration
against the live docker-compose certctl container. Skip-when-
stack-missing semantics so sandbox + CI both work.
deploy/docker-compose.test.yml: new e2eintune SCEP profile env
vars + bind-mount of deploy/test/fixtures/.
deploy/test/fixtures/README.md: documents the deterministic trust
anchor regeneration recipe.
VERIFICATION (sandbox):
gofmt -d — clean for all changed files
staticcheck — clean for intune + handler + config + service +
cmd/server packages
go vet — clean for the same packages
go test -short — green for intune (95.3% cov), service (70.9%),
handler (79.3%), config (94.0%), cmd/server (boot
path; my preflight tests cover the directly-
testable function), pkcs7 (80.5% informational)
DEFERRED (per closure prompt §7 out-of-scope):
- V3-Pro Conditional Access gating + Microsoft Graph integration
- Standalone certctl-scan CLI binary
- OCSP rate-limiting, OCSP stapling, delta CRLs
Spec preserved at cowork/scep-bundle-gap-closure-prompt.md;
journal at cowork/scep-rfc8894-intune/progress.md (audit-closure
section appended).
|
||
|
|
5b67ff3944 |
refactor(scep-gui): rebrand SCEP admin surface to per-profile tabbed interface (Profiles + Intune + Recent Activity)
Phase 9 follow-up to the SCEP RFC 8894 + Intune master bundle. The
Phase 9.4 GUI shipped 'SCEP Intune Monitoring' at /scep/intune, which
made the per-profile observability surface look Intune-only — operators
running EJBCA + Jamf would never click that nav link expecting per-
profile RA cert + mTLS observability. The page is per-profile keyed
under the hood; this commit rebrands + restructures so the surface
matches what operators actually need.
Spec: cowork/scep-gui-restructure-prompt.md.
User-visible change:
- Nav link renamed: 'SCEP Intune' → 'SCEP Admin'.
- Route: /scep is the new canonical path; /scep/intune kept as a
backward-compat alias that lands directly on the Intune tab.
- Page header: 'SCEP Administration'.
- Three tabs:
* Profiles (default) — per-profile lean cards with RA cert
expiry countdown, mTLS sibling-route status badge, Intune
enabled/disabled badge, challenge-password-set indicator.
'View Intune details →' link on Intune-enabled cards
deep-links into the Intune tab.
* Intune Monitoring — the existing Phase 9.4 deep-dive
(per-status counters, trust anchor expiry, recent failures
table, reload-trust button + confirmation modal).
* Recent Activity — full SCEP audit log filter merging all
four action codes (scep_pkcsreq + scep_renewalreq +
scep_pkcsreq_intune + scep_renewalreq_intune); chip filters
for All / Initial / Renewal / Intune / Static.
Backend:
* internal/service/scep.go — new SCEPProfileStatsSnapshot type +
IntuneSection sub-block + ProfileStats(now) accessor. Adds
raCertSubject/raCertNotBefore/raCertNotAfter + mtlsEnabled +
mtlsTrustBundlePath fields with SetRACert + SetMTLSConfig setters.
Existing IntuneStatsSnapshot + IntuneStats(now) preserved
UNCHANGED for /admin/scep/intune/stats backward compat (the
JSON shape stays byte-stable for external consumers — the
aliasing approach the prompt initially suggested doesn't work
because the new shape nests Intune while the old one is flat).
ChallengePasswordSet is derived from challengePassword != ''
(the secret value itself is never surfaced).
* internal/api/handler/admin_scep_intune.go — new Profiles handler
method on AdminSCEPIntuneHandler with the same M-008 admin gate.
AdminSCEPIntuneServiceImpl extended (in place; same
map[string]*service.SCEPService) to satisfy the new
AdminSCEPProfileService interface. Single handler file gets the
third method so the M-008 pin entry count stays steady (no new
file, no new triplet of admin-gate test files — just three new
Profiles tests inside the existing test file).
* internal/api/router/router.go — one new route
'GET /api/v1/admin/scep/profiles' registered to
reg.AdminSCEPIntune.Profiles. HandlerRegistry unchanged.
* api/openapi.yaml — new operation 'listSCEPProfiles' documenting
the request body / response shape / error mapping. Existing
Intune entries unchanged.
* cmd/server/main.go — per-profile loop now calls
scepService.SetMTLSConfig(profile.MTLSEnabled,
profile.MTLSClientCATrustBundlePath) right after SetPathID, and
scepService.SetRACert(raCert) right after loadSCEPRAPair returns
the leaf cert. Both setters are nil-safe.
* internal/api/handler/m008_admin_gate_test.go — extended the
existing admin_scep_intune.go entry's justification to mention
the third endpoint. No new map entry needed (file already
listed).
Backend tests (8 new):
* TestAdminSCEPProfiles_NonAdmin_Returns403
* TestAdminSCEPProfiles_AdminExplicitFalse_Returns403
* TestAdminSCEPProfiles_AdminPermitted_ForwardsActor — also pins
that Intune-enabled profiles emit an 'intune' sub-block while
Intune-disabled profiles OMIT it.
* TestAdminSCEPProfiles_RejectsNonGetMethod
* TestAdminSCEPProfiles_PropagatesServiceError
* TestAdminSCEPProfilesServiceImpl_NilMapReturnsEmpty
* (existing 16 Phase 9 admin tests still pass — backward-compat
preserved)
Frontend:
* web/src/api/types.ts — new SCEPProfileStatsSnapshot +
IntuneSection + SCEPProfilesResponse types. Existing
IntuneStatsSnapshot et al unchanged.
* web/src/api/client.ts — new getAdminSCEPProfiles helper.
* web/src/pages/SCEPAdminPage.tsx — full rewrite as the tabbed
surface. Reuses the existing ConfirmReloadModal and Intune
deep-dive card components verbatim; adds ProfileSummaryCard
(lean card for the Profiles tab) and ActivityTab. URL state
sync via useSearchParams so deep links survive reloads + browser
back/forward. The legacy /scep/intune route alias defaults the
activeTab to 'intune' on mount.
* web/src/main.tsx — new <Route path='scep' /> + preserved
<Route path='scep/intune' /> alias. Both render SCEPAdminPage.
* web/src/components/Layout.tsx — nav link rebranded:
label 'SCEP Intune' → 'SCEP Admin', to '/scep/intune' → '/scep'.
Frontend tests (20 — full rebuild):
* Admin gate (non-admin sees gated banner + zero admin API calls)
* Profiles tab default + Intune tab tabswitch + ?tab=intune deep
link + legacy /scep/intune alias all land on Intune
* Profiles tab status badges (Intune + mTLS + challenge-set)
reflect each profile's flags
* RA cert expiry tone bands (good ≥30d / warn 7-30d / bad <7d /
EXPIRED) verified across three fixture profiles
* 'View Intune details →' only renders for Intune-enabled
profiles AND switches tabs on click
* Empty-state banner when no profiles configured
* Intune tab counters render with the existing Phase 9 deep-dive
shape; reload modal Open/Confirm/Cancel/Error paths all pinned
* Recent Activity tab merges all four SCEP audit actions across
four parallel useQuery calls; filter chips
(all/initial/renewal/intune/static) narrow correctly
* Error path surfaces ErrorState on the active tab
Docs:
* docs/scep-intune.md — Operational monitoring section heading
expanded to '(SCEP Administration → Intune Monitoring tab)'.
Page-surface description rewritten for the tabbed shape;
admin-endpoints list extended with the new /admin/scep/profiles
entry.
* docs/architecture.md — Microsoft Intune Connector trust anchor
subsection updated to reference the Intune Monitoring tab inside
the SCEP Administration page + lists all three admin endpoints.
* docs/legacy-est-scep.md — forward-ref expanded with a parallel
sentence for the per-profile observability surface (independent
of Intune).
* README.md — Enrollment Protocols bullet for Intune updated to
'admin GUI SCEP Administration page at /scep' with the three
tabs called out.
Verification:
* gofmt clean on touched files
* go vet ./... clean
* staticcheck on intune+service+handler+router+cmd-server clean
* go test -short across intune+service+handler+router+cmd-server:
all green (existing Phase 9 tests + new Profiles tests)
* Frontend tsc --noEmit clean
* Vitest: 20/20 SCEPAdminPage tests + 3/3 sibling AuditPage tests
pass
* G-3 docs-drift CI guard reproduced locally: clean (no new env
vars; existing CERTCTL_SCEP_ allowlist prefix covers everything)
* M-009 hard-zero useMutation guard reproduced locally: clean
(the existing reload mutation already used useTrackedMutation
from the Phase 9 follow-up commit
|
||
|
|
33c79482db |
docs(scep-intune): deployment guide + troubleshooting + Microsoft support statement
Phase 11 of the SCEP RFC 8894 + Intune master bundle.
Phase 11.1 — docs/scep-intune.md (new, ~340 lines):
* TL;DR — drop-in NDES replacement framing; what an operator gets
over NDES (per-profile endpoints, audit-log forensics, SIGHUP
reload, GUI monitoring, per-device rate limit).
* Architecture diagram — Intune cloud → Connector → certctl SCEP
→ issuer connector. Explicit 'certctl replaces NDES, NOT the
Connector' framing; nine-gate dispatcher walk (shape pre-check,
JWS sig, version dispatch, time bounds, audience pin, CSR binding,
replay, per-device rate limit, optional compliance).
* Migration playbook (NDES + EJBCA / NDES + ADCS) — 9-step run-book:
install alongside, configure per-profile endpoint, extract trust
anchor, configure CONNECTOR_CERT_PATH + AUDIENCE, configure
issuer connector, migrate one profile, verify enrollment, roll
out fleet, decommission NDES.
* Intune SCEP profile field mapping table — every Intune admin
center field mapped to certctl's behavior (cert type, subject
name format, SAN, validity, key storage provider, key usage,
EKU, hash algorithm, SCEP server URL).
* Trust anchor extraction recipe — step-by-step certlm.msc export
of the 'CN=Microsoft Intune Certificate Connector' cert, PEM
rename, env-var configuration, HA Connector concatenation, SIGHUP
rotation flow.
* Troubleshooting matrix — 10 failure modes mapped to root causes
and operator actions: signature_invalid (trust anchor stale),
claim_mismatch (Intune profile SAN config), expired (clock skew /
Connector cert past NotAfter), not_yet_valid (reverse skew),
wrong_audience (URL mismatch), replay (retry-window collision),
rate_limited (limiter doing its job), unknown_version (Microsoft
shipped new format), malformed (proxy mangling body),
compliance_failed (V3-Pro hook returned non-compliant).
* Operational monitoring — admin GUI surface description, expiry
badge tone bands (≥30d green / 7-30d amber / <7d red / EXPIRED),
per-status counter polling cadence, audit log filter, recommended
Prometheus alert thresholds.
* Limitations — explicit V3-Pro deferrals: native Microsoft Graph
integration, Conditional Access compliance gating, per-tenant
trust anchors (MSP scoping), OCSP stapling at SCEP-response time,
auto-discovery of Connector signing cert.
* Microsoft support statement — three Microsoft Learn URLs (verified
live with HTTP 200): Connector overview, SCEP profile setup,
Connector install validation. Microsoft documents the Connector
as RFC-8894-compliant and supports its use against any RFC 8894
SCEP server.
Phase 11.2 — Cross-references:
* docs/legacy-est-scep.md — the previous forward-ref pointed at
'the Phase 11 doc this bundle ships'; updated to a richer pointer
that lists what scep-intune.md covers (architecture, migration,
profile mapping, extraction, troubleshooting, monitoring,
limitations, Microsoft support).
* README.md — new bullet under Enrollment Protocols table:
'Microsoft Intune SCEP fleet (drop-in NDES replacement)' with
the per-profile dispatcher feature list + link to scep-intune.md.
Procurement teams scanning the README see the Intune story
alongside ChromeOS / Jamf in the same table row.
* docs/architecture.md — new 'Microsoft Intune Connector trust
anchor (per-profile, opt-in)' subsection in the Security Model
section. ASCII diagram showing the dispatcher walk; calls out
the SIGHUP reload + admin-gated GUI surface; forward-link to
scep-intune.md.
Verification:
* All linked anchors inside scep-intune.md resolve to existing
headings: #limitations, #microsoft-support-statement,
#operational-monitoring, #trust-anchor-extraction.
* All linked doc paths resolve: legacy-est-scep.md, architecture.md,
features.md, tls.md.
* All three Microsoft Learn URLs return HTTP 200 (verified via curl).
* G-3 docs-drift CI guard reproduced locally and clean — the
migration playbook uses the <NAME> placeholder convention
consistently (matching features.md style) so the docs scanner
doesn't extract literal env-var names that aren't in config.go.
* Backend tests across intune+handler+service+router still green.
Refs: cowork/scep-rfc8894-intune-master-prompt.md::Phase 11
cowork/scep-rfc8894-intune/progress.md
|
||
|
|
2263e2886b |
feat(scep-intune): per-profile dispatcher + SIGHUP reload + per-device rate limit + compliance hook seam
Phase 8 of the SCEP RFC 8894 + Intune master bundle. Wires the internal/scep/intune validator from Phase 7 into the SCEPService dispatch path, with a SIGHUP-reloadable trust anchor holder, a per-(Subject, Issuer) sliding-window rate limiter, and a nil-default ComplianceCheck seam for V3-Pro. Operator-visible surface (per-profile, all default to off): CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_ENABLED=true CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CONNECTOR_CERT_PATH=/etc/certctl/intune.pem CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_AUDIENCE=https://certctl.example.com/scep/corp CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CHALLENGE_VALIDITY=60m CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_PER_DEVICE_RATE_LIMIT_24H=3 Per-profile dispatch (Phase 8.8): an operator running corp-laptops through Intune AND IoT devices through static challenge configures INTUNE_ENABLED=true on the corp profile only — the IoT profile's PKCSReq path skips the dispatcher entirely. Mirrors the per-profile shape established by Phase 1.5. Wire-in surfaces: * config.go (Phase 8.1): SCEPProfileConfig.Intune sub-config of type SCEPIntuneProfileConfig (Enabled/ConnectorCertPath/Audience/ ChallengeValidity/PerDeviceRateLimit24h). Loaded from the indexed CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_* env-var family. Per-profile Validate gate refuses INTUNE_ENABLED=true with empty ConnectorCertPath OR negative PerDeviceRateLimit24h. * cmd/server/main.go (Phase 8.2 + wire-in): preflightSCEPIntuneTrustAnchor helper mirrors preflightSCEPRACertKey/preflightSCEPMTLSTrustBundle shape — fail-loud at boot when the trust anchor file is missing / unreadable / empty / contains an expired cert. The per-profile loop builds the holder + replay cache + rate limiter, calls SetIntuneIntegration on the SCEPService, and starts the SIGHUP watcher. A deferred sweep stops every watcher at shutdown. * internal/scep/intune/trust_anchor_holder.go (Phase 8.5): TrustAnchorHolder mirrors cmd/server/tls.go::certHolder. RWMutex- guarded pool + Reload that swaps a fresh slice on success + WatchSIGHUP goroutine that responds to the same SIGHUP the existing TLS-cert watcher uses. A bad reload (parse error, expired cert) keeps the OLD pool in place so a half-rotation doesn't take Intune enrollment down — same fail-safe pattern. Operators rotate via the on-disk file then 'kill -HUP <certctl-pid>'. * internal/scep/intune/rate_limit.go (Phase 8.6): hand-rolled sliding-window-log limiter keyed by (Subject, Issuer). 100k-entry map cap (matches replay cache); at-cap drops the bucket whose newest timestamp is the oldest. Default 3 enrollments per 24h covers legitimate first-cert + recovery + post-wipe re-enrollment but blocks bulk enumeration from a compromised Connector signing key. maxN <= 0 disables the limiter for tests + the rare operator who wants no per-device cap. Empty subject short-circuits to allow (defense-in-depth: caller's claim validation rejects empty-subject upstream; no shared bucket on ''). Why hand-rolled instead of golang.org/x/time/rate: the rate package is in go.sum as an indirect transitive but not a direct dep. ~30 LoC of stdlib avoids creating a new direct dep. * internal/service/scep.go (Phase 8.3 + 8.4 + 8.7): - SCEPService gains intuneEnabled / intuneTrust / intuneAudience / intuneValidity / intuneReplayCache / intuneRateLimiter / complianceCheck fields. - SetIntuneIntegration() constructor-time injection wires the per-profile state. Profiles with INTUNE_ENABLED=false never call this method, so they pay zero overhead. - SetComplianceCheck() installs the V3-Pro plug-in (see Phase 8.7). - looksIntuneShaped(): JWT-shape pre-check (length > 200 + exactly two dots). Allowed to false-positive (validator catches malformed → ErrChallengeMalformed); MUST NOT false-negative on real Intune challenges. - dispatchIntuneChallenge(): the load-bearing core. Runs ValidateChallenge → CSR-binding via DeviceMatchesCSR → replay cache CheckAndInsert → per-device Allow → optional ComplianceCheck. Each failure leg increments a typed metric label and emits an audit-friendly Warn log line. - PKCSReq + PKCSReqWithEnvelope + RenewalReqWithEnvelope all call dispatchIntuneChallenge first; on outcome.decided=true they either short-circuit (with a typed-error → SCEPFailInfo mapping) or call processEnrollment with action='scep_pkcsreq_intune' (so audit greps can count Intune-vs-static enrollments). - mapIntuneErrorToFailInfo(): typed-error → SCEPFailInfo per RFC 8894 §3.2.1.4.5 (signature/replay/expired → BadMessageCheck; claim-mismatch → BadRequest; default → BadRequest). - intuneFailReason(): typed-error → metric label ('signature_invalid' / 'expired' / 'rate_limited' / etc.). Default 'malformed' so a previously-unseen error category still surfaces in the metric for follow-up. - ComplianceCheck (Phase 8.7): nil-default no-op gate. V3-Pro plugs in via SetComplianceCheck to call Microsoft Graph's compliance API. Returns (compliant, reason, err). nil-err + compliant=false → CertRep FAILURE + 'compliance' reason in audit. err != nil → fail-safe deny (V3-Pro module is responsible for any 'permit on API failure' policy). * internal/service/scep.go also gains parseCSRForIntune() — small private wrapper around encoding/pem + x509 used by the dispatcher for the claim ↔ CSR binding check (separated from the broader processEnrollment because we want to bind BEFORE consuming the replay-cache slot). Tests (gates: ≥85% coverage on intune package, ≥70% on service): * scep_intune_test.go (in internal/service): 14 dispatcher tests covering happy-path Intune enrollment + static-challenge fallback + tampered-challenge reject + claim-mismatch reject + replay detected + rate-limited + compliance-hook nil-default + compliance- hook denies non-compliant + compliance-hook error fails closed + IntuneEnabled accessor + 'no IntuneEnabled = static path unchanged' regression pin + intuneFailReason mapping for every typed error + looksIntuneShaped boundary cases. * trust_anchor_holder_test.go (in internal/scep/intune): NewLoadsBundle, NewRequiresLogger, NewSurfacesLoadError, ReloadHappyPath, ReloadKeepsOldOnFailure, ReloadKeepsOldOnExpired (the fail-safe semantics that make the SIGHUP path operator-friendly), WatchSIGHUPReloadsPool (real SIGHUP to self with poll-for-swap pattern mirroring cmd/server/tls_test.go), WatchSIGHUPStopIsClean (does NOT fire SIGHUP after stop — same caveat as the TLS test: the Go runtime would otherwise terminate the test runner on the next SIGHUP since signal.Stop has removed the handler). * rate_limit_test.go (in internal/scep/intune): AllowsUpToCap, DistinctKeysIndependent, WindowExpiry, DisabledBypass (maxN=0), NegativeCapDisabled, EmptySubjectShortCircuits (defense-in-depth against an empty-subject DoS chokepoint), DefaultCapsHonored, MapCapEvictsOldest (at-cap eviction branch), ConcurrentRaceFree (50 goroutines × 200 inserts), pruneOlderThan + the no-op case. Verification: * gofmt -l on all touched files: clean * go vet ./... : clean * staticcheck on intune/service/config/cmd-server: clean * go test -count=1 -cover ./internal/scep/intune/...: 94.8% (target ≥85%) * go test -short across intune+service+config+handler+cmd-server: all green * G-3 docs-drift CI guard reproduced locally: docs-only filtered= empty, config-only=empty. The new env vars match the existing CERTCTL_SCEP_ allowlist prefix. Refs: cowork/scep-rfc8894-intune-master-prompt.md::Phase 8 cowork/scep-rfc8894-intune/progress.md Constitutional rule: 'Always take the complete path, not the easy path' (cowork/CLAUDE.md::Operating Rules) — operator can flip CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_ENABLED=true and observe the dispatcher pick up Intune-shaped challenges end-to-end with no further code changes. Foundation + plumbing ship together. |
||
|
|
e7a3075a75 |
feat(scep): mTLS sibling route /scep-mtls/<pathID> (opt-in)
SCEP RFC 8894 + Intune master bundle — Phase 6.5 of 14 (opt-in,
enterprise-procurement-checkbox).
Closes the procurement-team objection that 'shared password
authentication' is a checkbox-fail regardless of how strong the
password is. The clean answer: a sibling route that adds client-cert
auth at the handler layer AND keeps the challenge password (defense in
depth, not replacement). Devices present a bootstrap cert from a
trusted CA (e.g. a manufacturing-time cert), then SCEP-enroll for
their long-lived cert. Same model Apple's MDM and Cisco's BRSKI use.
internal/config/config.go
* SCEPProfileConfig gains MTLSEnabled bool + MTLSClientCATrustBundlePath
string. Indexed env-var loader reads
CERTCTL_SCEP_PROFILE_<NAME>_MTLS_ENABLED +
CERTCTL_SCEP_PROFILE_<NAME>_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH.
* Validate() refuses MTLSEnabled=true with empty bundle path —
structural defense in depth ahead of the file-content preflight.
cmd/server/main.go
* preflightSCEPMTLSTrustBundle: file existence + PEM parse + ≥1
CERTIFICATE block + non-expired check. Returns the parsed
*x509.CertPool ready to inject into the per-profile SCEPHandler.
Failures os.Exit(1) with the offending PathID in the structured log.
* SCEP startup loop walks each profile; when MTLSEnabled, runs
preflight, builds the per-profile pool, contributes the bundle's
certs to the union pool that backs the TLS-layer
VerifyClientCertIfGiven, clones the SCEPHandler with
SetMTLSTrustPool, and registers the parallel sibling route via
apiRouter.RegisterSCEPMTLSHandlers.
* Union pool published to outer scope as scepMTLSUnionPoolForTLS;
passed to buildServerTLSConfigWithMTLS so the listener serves both
/scep[/<pathID>] (no client cert) and /scep-mtls/<pathID>
(cert required at handler layer) on the same socket.
* Final-handler dispatch gains /scep-mtls + /scep-mtls/* prefix
routing through the no-auth chain (auth boundary is the client
cert + challenge password, NOT a Bearer token).
cmd/server/tls.go
* New buildServerTLSConfigWithMTLS that wraps buildServerTLSConfig
+ sets ClientCAs + ClientAuth=VerifyClientCertIfGiven when a
non-nil pool is passed. nil pool = identical TLS shape to the
pre-Phase-6.5 builder (no behavior change for deploys without
mTLS profiles).
* Critical: VerifyClientCertIfGiven (NOT RequireAndVerifyClientCert)
so a client that doesn't present a cert can still hit the standard
/scep route. The per-profile gate at the handler layer enforces
'cert required' on /scep-mtls/<pathID>.
internal/api/handler/scep.go
* SCEPHandler gains mtlsTrustPool *x509.CertPool field +
SetMTLSTrustPool method. Per-profile pool injected by
cmd/server/main.go after preflight.
* HandleSCEPMTLS wrapper: gates on r.TLS.PeerCertificates non-empty
+ per-profile cert.Verify against THIS profile's pool. Returns
HTTP 401 for missing/untrusted cert (mTLS failure is auth, not
authorization). Returns HTTP 500 if mtlsTrustPool is nil (deploy
bug — the route shouldn't have been registered). On success
delegates to HandleSCEP — defense in depth: mTLS is additive,
NOT replacement; the standard SCEP code path including the
challenge-password gate still executes.
* Per-profile re-verification via cert.Verify(...) is critical:
the TLS layer verified against the UNION pool, so a cert that
chains to profile A's bundle would pass TLS even when targeting
profile B. The handler-layer gate prevents cross-profile
bleed-through.
internal/api/router/router.go
* AuthExemptDispatchPrefixes gains '/scep-mtls' (auth boundary is
client cert + challenge password, NOT Bearer token).
* RegisterSCEPMTLSHandlers parallel to RegisterSCEPHandlers:
empty PathID maps to /scep-mtls root; non-empty maps to
/scep-mtls/<pathID>. Each handler in the map MUST have had
SetMTLSTrustPool called.
internal/api/router/openapi_parity_test.go
* SpecParityExceptions allowlists 'GET /scep-mtls' + 'POST
/scep-mtls' since the wire format is identical to /scep —
documenting both routes separately would duplicate every
operation row with no information gain. Documented alternative
in docs/legacy-est-scep.md.
internal/api/handler/scep_mtls_test.go (new, ~210 LoC)
* 6 tests + 2 helpers covering the auth contract:
1. RejectsMissingClientCert — request with r.TLS=nil → 401
2. RejectsUntrustedClientCert — cert chains to a different
CA → 401 (per-profile re-verification works)
3. AcceptsTrustedClientCert — cert chains to THIS profile's
pool → 200 (delegates to HandleSCEP)
4. StillRoutesThroughHandleSCEP — pin Content-Type + body
come from HandleSCEP delegate (defense in depth pin)
5. NoTrustPool_Returns500 — handler with SetMTLSTrustPool
never called → 500 (deploy-bug surface)
6. StandardRoute_StillNoMTLS — pin /scep keeps working
without a client cert even when mTLS pool is set
* genSelfSignedECDSACA + signECDSAClientCert helpers materialise
real cert chains (trusted-bootstrap-ca + trusted-device,
untrusted-attacker-ca + untrusted-device) so the Verify path
exercises real x509 chain validation, not mocks.
docs/features.md
* SCEP env-vars table extended with the two new MTLS env vars
(CERTCTL_SCEP_PROFILE_<NAME>_MTLS_ENABLED,
CERTCTL_SCEP_PROFILE_<NAME>_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH).
Closes the G-3 'env var defined in Go but never documented' gate.
docs/legacy-est-scep.md
* New 'mTLS sibling route (Phase 6.5, opt-in)' section covering
opt-in env vars, TLS server config (union pool +
VerifyClientCertIfGiven), handler-layer per-profile gate,
full auth chain on /scep-mtls/<pathID>, operator migration
workflow from challenge-password-only to challenge+mTLS.
cowork/CLAUDE.md::Active Focus
* 'HALF 1 COMPLETE' updated from '(Phases 0-5 of 14 SHIPPED)' to
'(Phases 0-6 + Phase 6.5 of 14 SHIPPED)'.
Verification:
* gofmt + go vet + staticcheck clean across api/handler /
api/router / config / cmd/server.
* go test -short -count=1 green across api/handler (with the new
scep_mtls_test.go) / api/router / service / config / pkcs7 /
cmd/server / connector/issuer/local.
* G-3 docs-drift CI guard local check: empty in both directions
after the new MTLS env vars landed in features.md.
* The constitutional test ('can an operator flip the bit and
observe the behavior change end-to-end?') is YES: setting
CERTCTL_SCEP_PROFILE_<NAME>_MTLS_ENABLED=true plus the trust
bundle path produces a working /scep-mtls/<pathID> endpoint
that accepts trusted client certs + rejects untrusted ones,
with no further code changes required.
Phase 6.5 of 14 in SCEP RFC 8894 + Intune master bundle.
Half 1 (Phases 0-6 + 6.5) is now FEATURE-COMPLETE for the
ChromeOS / general-MDM use case. Half 2 (Phases 7-12) adds the
Microsoft Intune dynamic-challenge layer.
|
||
|
|
d8d1436736 |
docs(scep): close G-3 docs-only drift in legacy-est-scep.md
Two G-3 regression hits from the SCEP RFC 8894 docs that landed in
commit 35fcfa7's docs/legacy-est-scep.md addition:
1. CERTCTL_SCEP_PROFILE_CORP_* (5 vars) — the multi-profile dispatch
recipe used literal CORP placeholders in the example block, which
the G-3 scanner treats as phantom env vars (the loader expands
<NAME> at runtime; CORP is never a literal env-var key in Go
source). Replaced the literal example with a prose description
that uses the <NAME> token explicitly + cross-references
docs/features.md where the per-profile suffix table lives. The
G-3 scanner sees only CERTCTL_SCEP_PROFILES + the prefix
CERTCTL_SCEP_ (already on the ALLOWED list per commit
|
||
|
|
929bc7a899 |
docs(scep): RFC 8894 hardening — README + architecture + connectors
SCEP RFC 8894 + Intune master bundle — Phase 6 of 14.
Closes Half 1 of the bundle (Phases 0-6). The certctl SCEP server now
ships full RFC 8894 wire format (EnvelopedData decrypt + signerInfo POPO
verify + CertRep PKIMessage builder), tested against ChromeOS-shape
hermetic E2E requests, with multi-profile dispatch and must-staple
per-profile policy. Half 2 (Phases 7-12) adds the Microsoft Intune
dynamic-challenge layer; Phase 6.5 (mTLS sibling route) is independently
shippable as an opt-in enterprise-procurement feature.
README.md
* Standards & Revocation table SCEP row updated to mention full RFC
8894 wire format (EnvelopedData decryption, signerInfo POPO
verification, CertRep PKIMessage builder), PKCSReq + RenewalReq +
GetCertInitial messageType dispatch, multi-profile dispatch
(/scep/<pathID>), per-profile RA cert + key, MVP fall-through for
lightweight clients.
* Enrollment protocols paragraph extended with the same scope, plus
a link to docs/legacy-est-scep.md for the operator + device-
integration guide.
docs/architecture.md
* SCEP wire format paragraph rewritten to describe the two paths
(RFC 8894 first, MVP fall-through), the messageType dispatch
table, the EnvelopedData decrypt (constant-time PKCS#7 unpad
closing the padding-oracle leg), the SET-OF Attribute
re-serialisation quirk per RFC 5652 §5.4, and the CertRep
PKIMessage shape (cert chain encrypted to req.SignerCert, NOT
the RA cert).
* SCEP service interface updated to show the three new
*WithEnvelope variants alongside the legacy PKCSReq method.
* Added 'Capabilities advertised', 'Multi-profile dispatch', and
'Must-staple per profile' subsections covering the RFC 7633
extension policy.
docs/connectors.md
* EST/SCEP Integration section extended with the per-profile
issuer-binding env-var form (CERTCTL_SCEP_PROFILE_<NAME>_ISSUER_ID).
* New SCEP RA cert + key paragraph pointing operators at the
legacy-est-scep.md openssl recipe + ChromeOS Admin Console
pointer + must-staple per-profile policy.
cowork/CLAUDE.md::Active Focus
* 2026-04-29 SCEP RFC 8894 + Intune master bundle status updated
to 'HALF 1 COMPLETE (Phases 0-5 of 14 SHIPPED)' with the full
chain of commit SHAs (
|
||
|
|
35fcfa70f2 |
feat(scep): RenewalReq + GetCertInitial + ChromeOS E2E + caps + must-staple
SCEP RFC 8894 + Intune master bundle — Phase 4 + Phase 5 of 14.
Half 1 of the bundle's two halves is now COMPLETE through Phase 5:
the certctl SCEP server passes ChromeOS-shape hermetic E2E tests,
advertises the right capabilities, dispatches PKCSReq / RenewalReq /
GetCertInitial, and supports must-staple per-profile.
== Phase 4: RenewalReq + GetCertInitial wiring ============================
internal/service/scep.go
* RenewalReqWithEnvelope (RFC 8894 §3.3.1.2) — re-enrollment with an
existing valid cert. Same contract as PKCSReqWithEnvelope but the
service additionally verifies that envelope.SignerCert chains to
the issuer's CA (verifyRenewalSignerCertChain). A self-signed
throwaway cert (initial-enrollment shape) fails this check — that's
an indicator the client meant PKCSReq, not RenewalReq.
* GetCertInitialWithEnvelope (RFC 8894 §3.3.3) — polling stub.
Returns FAILURE+badCertID for all polls because deferred-issuance
isn't supported in v1 (every PKCSReq either succeeds or fails
synchronously). Wiring stays in place for a future enhancement.
* Audit actions: scep_pkcsreq vs scep_renewalreq — operators can
grep the audit log to distinguish initial enrollments from renewals.
internal/api/handler/scep.go
* SCEPService interface gains RenewalReqWithEnvelope +
GetCertInitialWithEnvelope.
* pkiOperation RFC 8894 path now switches on envelope.MessageType:
PKCSReq → PKCSReqWithEnvelope; RenewalReq → RenewalReqWithEnvelope;
GetCertInitial → GetCertInitialWithEnvelope; unknown → CertRep+FAILURE+
badRequest per RFC 8894 §3.3.2.2.
== Phase 5.1: GetCACaps capability advertisement =========================
internal/service/scep.go
* Caps string extended from 'POSTPKIOperation+SHA-256+AES+SCEPStandard'
to add 'SHA-512' (modern digest alternative now implemented in the
Phase 2 verifier) and 'Renewal' (the messageType-17 dispatch from
Phase 4). ChromeOS specifically looks for these capabilities to
negotiate the strongest available cipher + digest combo.
* scep_test.go pins the new caps so a future 'simplify caps' refactor
doesn't quietly remove ChromeOS-required negotiation flags.
== Phase 5.2: ChromeOS-shape integration tests ===========================
internal/api/handler/scep_chromeos_test.go (new, ~570 LoC)
* 6 hermetic E2E tests + ~12 helpers. Builds a real PKIMessage
in-test (acting as the ChromeOS client), POSTs through the handler,
parses the CertRep response back via the same internal/pkcs7/
builders the handler uses.
* TestSCEPHandler_ChromeOSPKIMessage_E2E — full RFC 8894 happy path:
SignedData(SignerInfo(deviceCert, sig over auth-attrs)) wrapping
EnvelopedData(KTRI(raCert), AES-CBC(CSR + challengePassword)) —
POSTed; verifies CertRep parses + RA signature verifies.
* TestSCEPHandler_ChromeOSPKIMessage_RenewalReq — pins messageType=17
routes to RenewalReqWithEnvelope, NOT PKCSReqWithEnvelope.
* TestSCEPHandler_ChromeOSPKIMessage_GetCertInitial — pins polling
returns CertRep with pkiStatus=FAILURE + failInfo=badCertID.
* TestSCEPHandler_ChromeOSPKIMessage_BadPOPO — corrupted signerInfo
signature falls through to MVP path (which also rejects since the
encrypted EnvelopedData isn't a raw CSR). No silent acceptance.
* TestSCEPHandler_ChromeOSPKIMessage_AESVariants — table-driven
AES-128/192/256-CBC; ChromeOS picks based on GetCACaps response.
* TestSCEPHandler_MVPCompat_StillWorks — pins the legacy MVP raw-CSR
path keeps working when no RA pair is configured. Backward compat
is non-negotiable.
== Phase 5.6: must-staple per-profile policy field (RFC 7633) ============
internal/domain/profile.go
* Added MustStaple bool to CertificateProfile. Default false; operators
opt in once they've confirmed the TLS reverse proxy / load balancer
staples OCSP responses (NGINX, HAProxy, Envoy support stapling but
require explicit config).
internal/connector/issuer/interface.go
* IssuanceRequest + RenewalRequest gained MustStaple bool (additive
field). Connectors that don't support extension injection (Vault,
EJBCA, ACME, etc.) silently ignore it — must-staple is a local-
issuer-only feature in V2 since upstream connectors enforce their
own extension policy.
internal/connector/issuer/local/local.go
* Added oidMustStaple (1.3.6.1.5.5.7.1.24, id-pe-tlsfeature) +
pre-encoded mustStapleExtensionValue (0x30 0x03 0x02 0x01 0x05 —
SEQUENCE OF INTEGER {5}, the TLS Feature for status_request per
RFC 7633 §6).
* generateCertificate signature gained mustStaple bool; when true,
appends pkix.Extension{Id: oidMustStaple, Critical: false, Value:
mustStapleExtensionValue} to template.ExtraExtensions before
x509.CreateCertificate.
internal/connector/issuer/local/must_staple_test.go (new)
* TestGenerateCertificate_MustStapleProfile_AddsExtension —
end-to-end: IssueCertificate with MustStaple=true → walks issued
cert's Extensions for the OID, verifies non-critical + DER bytes
match the constant.
* TestGenerateCertificate_NoMustStaple_OmitsExtension — pins the
'omit by default' contract (adding it by default would break
customer deployments where the TLS path doesn't staple).
* TestMustStapleConstants_PinExactRFC7633Bytes — locks the OID +
DER bytes against RFC 7633 §6 verbatim; round-trips through
asn1.Unmarshal as []int{5}.
Note: full service-layer plumbing (CertificateProfile.MustStaple →
IssuanceRequest.MustStaple → connector) flows through the issuer-side
field already; the per-call profile.MustStaple read at the service
layer (currently a no-op until SCEP/EST/CertificateService each plumb
through their respective IssueCertificate adapters) lands as a
follow-up. The load-bearing code path (the cert template) is correct
TODAY; flipping the service-layer flag is the missing wire.
== Phase 5.4: docs/legacy-est-scep.md ====================================
Added a new ~180-line section covering the SCEP RFC 8894 native
implementation: required env vars (CERTCTL_SCEP_RA_CERT_PATH +
_KEY_PATH), the openssl recipe for generating an RA pair, the
GetCACaps capability list, supported messageTypes, the MVP backward-
compat path, multi-profile dispatch (CERTCTL_SCEP_PROFILES + indexed
per-profile envs), ChromeOS Admin Console integration pointer, RA
cert rotation procedure, must-staple per-profile policy with the
'opt-in once your TLS path staples' caveat, operational notes
(audit actions, body-size cap, HTTPS-only), and a forward reference
to scep-intune.md (Phase 11).
== Verification ==========================================================
* gofmt + go vet clean for the files I touched.
* staticcheck ./internal/api/handler/... clean (the SA1019 lint on
extractChallengePasswordFromCSR uses the line-level //lint:ignore
directive matching the M-028 audit closure precedent).
* go test -short -count=1 green across api/handler / api/router /
service / pkcs7 / connector/issuer/local / domain / cmd/server.
* G-3 docs-drift CI guard local check: empty diff in both directions.
Phase 4 + Phase 5 of 14 in SCEP RFC 8894 + Intune master bundle.
Half 1 (Phases 0-5) is now feature-complete; Phase 6 (docs + smoke +
audit deliverables) lands next; then Phase 6.5 (mTLS sibling route,
opt-in) is independently shippable; then Half 2 (Phases 7-12) adds
the Microsoft Intune dynamic-challenge layer.
Living progress at cowork/scep-rfc8894-intune/progress.md.
|
||
|
|
347365ff93 |
ci+docs(scep): close G-3 docs-only drift for SCEP placeholder + wildcard
Commit
|
||
|
|
197904759f |
docs(scep): document multi-profile env vars (CERTCTL_SCEP_PROFILES + per-profile prefix)
Phase 1.5 added two new env-var literals to internal/config/config.go
that the G-3 docs-drift CI guard picked up but I forgot to document
when shipping commit
|
||
|
|
770ddd5b1b |
feat(scep): add RFC 8894 message-type constants + RA cert/key config
SCEP RFC 8894 + Intune master bundle — Phase 0 + Phase 1 of 14.
Phase 0 (recon, no code changes):
Baseline tests green at HEAD
|
||
|
|
63d6ed256d |
docs: README + concepts + features reflect CRL/OCSP responder bundle
Audit pass against cowork/crl-ocsp-responder-prompt.md found three
operator-facing docs still describing the pre-bundle CRL/OCSP surface
(GET-only OCSP, CA-key-direct signing, no scheduler-driven cache). Each
claim updated below was ground-truthed against repo HEAD before edit.
README.md
* Standards & Revocation table — CRL row now mentions
scheduler-pre-generated cache (CERTCTL_CRL_GENERATION_INTERVAL,
crl_cache table); OCSP row mentions GET + POST forms, dedicated
responder cert per RFC 6960 §2.6, id-pkix-ocsp-nocheck per
§4.2.2.2.1, 7d auto-rotation grace.
* Revocation paragraph — corrected the 'Embedded OCSP responder'
one-liner to call out the dedicated-responder-cert design (the CA
private key is never used directly for OCSP signing, which is the
load-bearing security property for the future PKCS#11/HSM driver
path) and added the link to the relying-party guide.
docs/concepts.md
* CRL paragraph — added the scheduler pre-generation + singleflight
coalescing detail. Kept the existing 24h validity claim (verified
against internal/connector/issuer/local/local.go:956 — 'NextUpdate:
now.Add(24 * time.Hour)').
* OCSP paragraph — corrected the description so it covers both GET
and POST forms (POST per RFC 6960 §A.1.1 is what production
clients use: Firefox, OpenSSL s_client -status, cert-manager,
Intune); added the dedicated-responder-cert + nocheck-extension +
auto-rotation explanation; cross-link to docs/crl-ocsp.md.
docs/features.md
* Revocation Infrastructure section — CRL Endpoint, OCSP Responder,
new Admin Cache Observability subsection, new GUI Revocation
Endpoints Panel subsection. Corrected the previously-wrong 'Signs
with the issuing CA key' OCSP claim — the bundle's load-bearing
security improvement is exactly that the CA key is NOT used
directly. Cross-link to crl-ocsp.md.
* Local CA env vars table — added all four new
CERTCTL_CRL_GENERATION_INTERVAL / CERTCTL_OCSP_RESPONDER_KEY_DIR
(with the prod 'MUST set' callout) / _ROTATION_GRACE / _VALIDITY
rows. Closes the G-3 'env var defined in Go but never documented'
drift that broke CI on commit
|
||
|
|
0a548aa1d6 |
docs: CRL/OCSP user guide + architecture cross-reference — Phase 6
Audit of cowork/crl-ocsp-responder-prompt.md against repo HEAD found
two prompt deliverables still missing after the Phase 5 + Phase 6 code
landed: the docs/crl-ocsp.md operator+relying-party guide (Phase 6.2)
and the docs/architecture.md cross-reference. This commit closes both.
docs/crl-ocsp.md (329 lines) covers:
* Conceptual overview — why both CRL and OCSP, why a separate
responder cert (RFC 6960 §2.6 / §4.2.2.2.1) keeps the CA key cold
* Endpoints — GET CRL, GET + POST OCSP, admin observability endpoint
(M-008 admin-gated) with full request/response shape examples
* Configuration — every CERTCTL_CRL_* / CERTCTL_OCSP_RESPONDER_*
env var with default + meaning + 'MUST set in prod' callout for
OCSP_RESPONDER_KEY_DIR
* OCSP responder cert lifecycle — first-request bootstrap, disk
self-healing when keydir is pruned out from under the DB row,
rotation grace, ExtraExtensions wiring for id-pkix-ocsp-nocheck
* Consumer integration recipes — cert-manager (AIA/CDP automatic),
Firefox (about:preferences quirk), OpenSSL (ocsp + s_client -status),
Intune (CRL pull cadence)
* V3-Pro deferred (delta CRLs, OCSP rate-limiting, OCSP stapling)
* Troubleshooting (404 on issuer that doesn't support CRL, hex
serial format, admin-gated 403, scheduler not running)
docs/architecture.md: extended the existing 'Certificate revocation'
paragraph to explicitly call out the new pipeline (crl_cache table,
OCSP responder cert per RFC 6960 §2.6, POST + GET OCSP endpoints,
auto-rotation grace) and added the 'See docs/crl-ocsp.md for the
operator + relying-party guide' link so future readers can find the
deep dive.
Closes the prompt's Phase 6.2 + 6.3 exit criteria. Combined with
the Phase 5 GUI panel (
|
||
|
|
fdd445c09f |
crypto/signer: introduce Signer interface; refactor local issuer to use it
This is a load-bearing internal refactor with no user-visible behavior
change. The new internal/crypto/signer package abstracts CA private-key
signing behind a Signer interface (embeds stdlib crypto.Signer + adds
Algorithm()). The local issuer now consumes this interface; the
historical c.caKey crypto.Signer field is renamed c.caSigner signer.Signer.
What landed:
* internal/crypto/signer/ — new stdlib-only package
- Signer interface: crypto.Signer + Algorithm()
- Algorithm enum: RSA-2048, RSA-3072, RSA-4096, ECDSA-P256, ECDSA-P384
- Driver interface: Load / Generate / Name
- FileDriver: production driver, wraps file-on-disk PEM, hooks for
DirHardener + Marshaler so the local package can inject Bundle 9
keystore.ensureKeyDirSecure + keymem.marshalPrivateKeyAndZeroize
- MemoryDriver: in-memory test driver; safe for concurrent use
- parse.go: ParsePrivateKey moved here from local.go (PKCS#1, SEC 1, PKCS#8)
- 91.6% coverage (gate ≥85)
* internal/connector/issuer/local/local.go — refactor
- Rename c.caKey crypto.Signer → c.caSigner signer.Signer
- Rewire 4 signing call sites: leaf cert (line ~613), CRL (~849),
OCSP response (~887), CA bootstrap (~482) — all access the
interface; the bootstrap also switches to interface-level
Public() + Signer
- Wrap freshly-generated and freshly-loaded keys; reject Ed25519
and other unsupported algorithms at load time (was silently
accepted before, would have failed at first sign)
- Delete the duplicated parsePrivateKey helper (single source of
truth now lives in the signer package)
- Update the L-014 threat-model comment block (lines 1-29) with a
forward-reference paragraph: file-on-disk caveats apply only to
FileDriver-backed signers; alternative drivers close that leg
- Coverage 86.7 → 86.5 (above CI floor of 86); the 0.2pp drop is
mechanical from deleting parsePrivateKey, partially recovered by
a new test pinning the Wrap error path
* internal/crypto/signer/equivalence_test.go — Phase 3 safety net
- RSA byte-strict equality for leaf certs / CRLs / OCSP responses
(PKCS#1 v1.5 is deterministic)
- ECDSA TBS-strict equality (signature differs because of random k)
- Both signatures independently validate against the CA
- Negative sentinel proves the equivalence checker isn't trivially-
passing
* docs/architecture.md — new 'CA Signing Abstraction' section under
Security Model, with ASCII diagram of FileDriver / MemoryDriver /
future PKCS11Driver / future CloudKMSDriver
* Test file mechanical edits (only):
- bundle9_coverage_test.go: parsePrivateKey → signer.ParsePrivateKey
(function moved, not behavior changed)
- local_test.go: append one targeted test
(TestSubCA_LoadCAFromDisk_RejectsUnsupportedKeyAlgorithm) that
pins the new Wrap error path I introduced — recovers coverage
cost of the deletion above
What did NOT change (verified empty diffs):
* api/openapi.yaml
* migrations/
* internal/connector/issuer/interface.go
* go.mod / go.sum (no new dependencies; stdlib only)
This refactor is the prerequisite for three downstream items:
- PKCS#11/HSM driver (V3-Pro)
- CRL/OCSP responder (V2)
- SSH CA lifecycle (V2)
Each of those adds a new signing call site. Doing the abstraction now
costs once; deferring would cost three times.
|
||
|
|
4ac8f89def |
Bundle P.2-extended CI follow-up: rephrase aspirational env-var references to fix G-3 guard
CI's G-3 env-var docs drift guard caught four aspirational env vars
referenced in the Bundle P.2-extended RFC test-vector subsections that
aren't actually defined in internal/config/config.go:
- CERTCTL_EST_KEYGEN_MODE -> typo for CERTCTL_KEYGEN_MODE (corrected)
- CERTCTL_OCSP_DELEGATED_RESPONDER_CERT_PATH -> not implemented (rephrased
as forward-looking; v2 only supports byName ResponderID)
- CERTCTL_CRL_VALIDITY_DURATION -> not implemented (rephrased; v2 has
a hard-coded 7-day validity)
- CERTCTL_CRL_PARTITIONED -> not implemented (rephrased; v2 emits
full CRLs only with no IDP extension)
The byKey ResponderID, partitioned-CRL IDP, and configurable CRL
validity test vectors remain documented but are now framed as 'becomes
a positive test once <feature> support lands' rather than as currently-
implemented configuration. Same applies to the OCSP delegated-responder
mode test vector.
This keeps the RFC conformance documentation intact while staying
honest about what's actually wired up in v2.
CI guard verification (locally simulated):
G-3 env-var docs drift guard: CLEAN
Bundle: P.2-extended-ci-fix
|
||
|
|
6b60b6fa84 |
Bundle P.2-extended (Coverage Audit Extension): RFC test-vector subsections — M-008 closed
Pure doc work. Three new subsections added to docs/testing-guide.md: Part 21.99 — RFC 7030 EST test vectors - /cacerts response framing (§4.1.3) - /simpleenroll request framing (§4.2.1) - /serverkeygen multipart response (§4.4.2) Part 23.99 — RFC 5280 SAN/EKU test vectors - IPv4 SAN encoding (§4.2.1.6, [7] OCTET STRING 4 bytes) - IPv6 SAN encoding (§4.2.1.6, 16 bytes; v4-mapped canonicalization) - IDN dNSName (§4.2.1.6 + RFC 3490 Punycode) - otherName UPN (§4.2.1.6, [0] AnotherName SEQUENCE) - EKU encoding (§4.2.1.12, SEQUENCE OF OID + standard OIDs) - EKU criticality (§4.2.1.12 + CA/B Forum BR §7.1.2.7) Part 24.99 — RFC 6960 OCSP / RFC 5280 §5 CRL test vectors - OCSP response status (§4.2.2.3, tryLater vs HTTP 5xx) - OCSP ResponderID byName vs byKey (§4.2.2.2) - OCSP nonce extension (§4.4.1, browser-cache-friendly handling) - CRL TBSCertList nextUpdate (§5.1.2 + CA/B Forum BR §7.2.2) - CRL reason codes (§5.3.1, reserved 7 + out-of-range rejection) - CRL IDP extension (§5.2.5, partitioned vs full) - CRL no-delta (§5.2.4, certctl emits full CRLs only) Each vector cites RFC section + provides ASN.1 byte snippet where relevant + names the certctl pin location (file + test name) so a reviewer can spot wire-level drift without re-reading the RFC. Verification - grep -cE '^### [0-9]+\.99' docs/testing-guide.md == 3 (the new subs) - grep -cE '^## Part [0-9]+:' docs/testing-guide.md == 56 (unchanged) - file size: 8266 lines (+~190 from baseline) Audit deliverables - gap-backlog.md M-008 row: full strikethrough + closure note enumerating all three subsections + the 14 specific test vectors - coverage-audit-2026-04-27/extension-progress.md: P.2 marked DONE Closes: M-008 Bundle: P.2-extended (Coverage Audit Extension) |
||
|
|
9a01ec4023 |
Bundle Q (Coverage Audit Closure): property-based pilot + hygiene — L-001/L-002/L-003/L-004/I-001 closed
Five small closures wrapping the Low-tier and Info-tier audit findings. Q.1 — cmd/cli round-out (L-001 closed) ====================================== cmd/cli/dispatch_test.go: ~30 dispatch tests across handleCerts / handleAgents / handleJobs / handleImport / handleStatus. httptest.NewTLSServer mocks the API; cli.NewClient(_, _, _, _, true) constructs an insecure-skip-verify client. Each test pins the missing-args usage-print path AND the happy-path delegation. Result: 7.1% -> 63.5% coverage (gate: >=30%). Q.2 — awssm round-out (L-002 closed) ====================================== internal/connector/discovery/awssm/awssm_edge_test.go: New() default constructor, extractKeyInfo (ECDSA/Ed25519/unknown — was RSA-only), processSecret filter arms (NamePrefix mismatch / TagFilter mismatch / empty-value / GetSecretValue error), realSMClient stub-contract pin (ListSecrets / GetSecretValue / NewRealSMClient), and EmailAddresses SAN extraction. Result: 78.2% -> 96.0% coverage (gate: >=85%). Q.3 — Property-based testing pilot (L-003 closed) ====================================== gopter@v0.2.11 added to go.mod (test-only). internal/crypto/encryption_property_test.go: - TestProperty_EncryptDecryptRoundTrip — 50 successful tests, DecryptIfKeySet(EncryptIfKeySet(x, k), k) == x - TestProperty_WrongPassphraseRejected — 30 successful tests, AEAD never returns nil-error AND bytes-equal plaintext under wrong passphrase Both skipped under -short to keep developer loop fast (PBKDF2 600k rounds × 50 iters ≈ 15s on -race CI). internal/pkcs7/length_property_test.go: - TestProperty_ASN1LengthRoundTrip — three sub-properties: decodeLength(encode(x)) == x for x ∈ [0, 2³¹−1]; short-form invariant (length<128 → 1 byte == length); long-form invariant (length>=128 → high bit set + N bytes follow). 500 successful tests in <10ms. Q.4 — Architecture diagram multi-agent update (L-004 closed) ====================================== docs/qa-test-guide.md::Architecture: ASCII diagram updated to show 'certctl-agent (×N)' + callout explaining seed_demo.sql provisions 12 agent rows (1 active, 2 retired, 9 reserved/sentinel) for Parts 04, 05, 55 + FSM coverage. Operators running parallel-agent topologies guided to AGENT_COUNT=N + 'make qa-stats'. Q.5 — Test-naming CI guard (I-001 closed) ====================================== .github/workflows/ci.yml: Test-naming convention guard added after the QA-doc seed-count drift guard. Greps for func Test<X>( missing the <X>_<Scenario> suffix. Prints first 20 non-conformant as ::warning:: annotations. continue-on-error: true (informational). Excludes TestMain + TestProperty_*. Promotion to hard-fail tracked as I-001-extended. Verification ====================================== - python3 yaml.safe_load on ci.yml: OK - go vet ./cmd/cli/... ./internal/connector/discovery/awssm/... ./internal/crypto/... ./internal/pkcs7/...: clean - go test -short -count=1 across all four packages: PASS - go test -count=1 (full property tests): PASS - crypto 15.4s (50 + 30 × 600k PBKDF2) - pkcs7 5ms Audit deliverables ====================================== - gap-backlog.md: strikethroughs on L-001/L-002/L-003/L-004/I-001 with per-finding closure note - closure-plan.md: ticks Bundle Q [x] with per-item breakdown Closes: L-001, L-002, L-003, L-004, I-001 Bundle: Q (Property-Based + Hygiene) |
||
|
|
43b7323d24 |
Bundle P (Coverage Audit Closure): QA doc strengthening — M-007/M-009/M-010/M-011/M-012 closed; M-008 deferred
Six structural strengthenings to certctl QA documentation surface, raising acquisition-readiness QA-doc score 4.0 -> 4.7. M-008 (per-RFC test-vector subsections under Parts 21 + 24) deferred as 'Bundle P.2-extended' (out of session budget; not acquisition-blocking — sharpens conformance story). P.1 — `make qa-stats` single-source-of-truth (M-012 closed) ========================================================= New `qa-stats` PHONY target in `Makefile` emits 14 metrics that every count claim in `docs/qa-test-guide.md` and `docs/testing-guide.md` is derived from: backend test files / Test functions / t.Run subtests, frontend test files, fuzz targets, t.Skip sites, qa_test.go Part_ subtests, testing-guide.md Parts, and unique seed IDs (mc-* / ag-* / iss-* / tgt-* / nst-*). Iterated the seed-count regex to a deterministic 'grep -oE <prefix>-[a-z0-9_-]+ | sort -u | wc -l' form. Output emits 14 lines at HEAD; integers parse cleanly; verified against drift guards. P.2 — CI drift guards (M-011 closed) ========================================================= Two new CI steps in `.github/workflows/ci.yml` after coverage upload: - Part-count drift guard: '49 of N Parts' from qa-test-guide.md vs '^## Part N:' header count in testing-guide.md. Fails on mismatch. - Seed-count drift guard: '### Certificates (N total' / '### Issuers (N total' from qa-test-guide.md vs unique mc-* / iss-* IDs in seed_demo.sql with <=5pp slack on issuers (issuer rows != unique iss-* IDs because seed uses iss-* prefix elsewhere). Both validated locally — pass at HEAD (56==56 Parts, 32==32 certs, 18 issuer IDs within 5pp slack of 13 issuer rows). YAML lint clean. P.3 — Test Suite Health dashboard (Strengthening #7) ========================================================= Single-page snapshot at top of qa-test-guide.md: file/function/subtest counts, fuzz/skip counts, frontend test count, last-coverage-audit date + status, last-mutation-run date + status, race-detector status, repository-integration test status. Designed for first-look auditor / acquirer / new-engineer scanning. P.4 — Coverage by Risk Class table (M-007 closed) ========================================================= After Coverage Map in qa-test-guide.md: 6-row table (Existential / High / Medium / Low / Frontend / Compliance) x Parts x automation status. Cross-references each row to coverage-matrix.md. Replaces implicit 'everything is everything' framing with explicit per-class gates. P.5 — Release Day Sign-Off Matrix (M-010 closed) ========================================================= 12-row release-readiness checklist in qa-test-guide.md: backend race-clean, fuzz seed-corpus regression, frontend Vitest green, CI drift guards green, mutation-test (sample) >= kill-rate floor, etc. Each row cites verification command + gate value. Sign-off is 'all 12 green' — produces a per-release artifact attached to the tag. P.6 — Mutation Testing Targets (Strengthening #5) ========================================================= New section in qa-test-guide.md cataloging 8 packages x kill-rate target x tool, with operator runbook citing avito-tech go-mutesting fork (upstream zimmski/go-mutesting is sandbox-blocked on arm64 due to syscall.Dup2). Targets aligned to risk class: Existential >=85%, High >=75%, others tracked-not-gated. P.7 — Per-Connector Failure-Mode Matrix (M-009 closed, condensed) ========================================================= New 'Part 9.0 Per-Connector Failure-Mode Matrix' in docs/testing-guide.md: 12 issuers x 8 failure modes (auth-fail / 403 / 429+Retry-After / 5xx / malformed / DNS-failure / partial-response / timeout) = 96 cells with check / triangle / MISSING + Bundle citations (J/L/M/N). Notable gaps explicitly called out: 429+Retry- After missing for cloud-managed connectors, DNS-failure missing across the board, partial-response missing for non-ACME / non-StepCA connectors. Each gap is a follow-on-bundle candidate. Verification ========================================================= - 'make qa-stats' runs to completion, emits 14 metrics, all integers parse cleanly - 'python3 -c "import yaml; yaml.safe_load(...)"' clean on ci.yml - Both CI drift guards executed locally — both PASS at HEAD - git diff --stat: 5 files changed, +249 / -1 Audit deliverables ========================================================= - gap-backlog.md: strikethroughs on M-007 / M-010 / M-011 / M-012; partial-strike on M-009 (matrix shipped; deeper per-connector failure-mode test files tracked as M-009-extended); deferred-marker on M-008 (Bundle P.2-extended); Bundle P closure-log entry - closure-plan.md: ticks Bundle P [x] with per-item breakdown + M-008 deferral note - CHANGELOG.md: full Bundle P [unreleased] entry above Bundle O - testing-guide.md: new Part 9.0 Per-Connector Failure-Mode Matrix - qa-test-guide.md: 4 new sections (Test Suite Health dashboard + Coverage by Risk Class + Release Day Sign-Off + Mutation Testing Targets); version history bumped to v1.3 - Makefile: new qa-stats PHONY target - ci.yml: 2 new drift-guard steps after coverage upload Closes: M-007, M-010, M-011, M-012 Closes (condensed): M-009 (matrix shipped; deeper test files = M-009-extended) Deferred: M-008 (Bundle P.2-extended; not acquisition-blocking) Bundle: P (QA Doc Strengthening) |
||
|
|
0692de8f28 |
Bundle I (Coverage Audit Closure): QA-doc drift cleanup — H-007 + H-008 closed
Applies Patches 1-7 from coverage-audit-2026-04-27/tables/qa-doc-patches.md
(Patch 5 re-anchored against actual HEAD seed counts after Phase 0 recon
discovered the original patch's anticipated counts were themselves drifted).
docs/qa-test-guide.md:
- Patch 1: 'all 54 Parts' -> '49 of 56 Parts' + not-yet-automated callout
- Patch 2: Totals line replaced with verified-2026-04-27 breakdown + recompute commands
- Patch 3: Coverage Map gains Parts 23, 24, 55, 56 (each '0 (NOT AUTOMATED)')
- Patch 4: 'Not Yet Automated' subsection added under 'What This Test Does NOT Cover'
- Patch 5: Seed Data Reference re-anchored to authoritative HEAD counts:
32 certs (already correct), 12 agents (was 9), 13 issuers (was 9),
8 targets (already correct), 4 nst (already correct).
Replaced narrow ID enumerations with sed | grep recompute commands.
Added maintenance-note pointer to Strengthening #6 (CI guard).
- Patch 6: Version History entry v1.2 added
- Bonus: integration_test comparison row updated (12 agents + 13 issuers)
deploy/test/qa_test.go (Patch 7):
4 new t.Run('PartN_*', ...) blocks for Parts 23, 24, 55, 56 — each calls
t.Skip with a docs/testing-guide.md::Part N pointer + automation candidates.
Skip-with-rationale form keeps Part numbering consistent + makes the
manual-test pointer machine-readable. Replacing each Skip with a real
test body is gap-backlog work.
Verification:
grep -cE '^## Part [0-9]+:' docs/testing-guide.md == 56 PASS
grep -cE 't\.Run("Part[0-9]+_' deploy/test/qa_test.go == 53 PASS
go vet -tags qa ./deploy/test/... PASS
go test -tags qa -run='__nope__' ./deploy/test/... PASS (compile)
(Full SKIP-grep gate requires the live demo stack; t.Skip bodies trivial.)
Audit deliverables:
findings.yaml: H-007 (-0014), H-008 (-0015) status open -> closed
gap-backlog.md: strikethrough both rows + Bundle I closure-log entry
tables/qa-doc-drift.md: 'PATCHES APPLIED' header marker (not retro-edited)
acquisition-readiness.md: QA-doc rigor 2.5 -> 4.0
closure-plan.md: Bundle I checklist box ticked
CHANGELOG.md: [unreleased] Bundle I entry
|
||
|
|
904b2266e7 |
Bundle G: Final audit closure — L-004 + D-003/4/5/7 closed; 54/55 + 7/7
Closes the 2026-04-25 audit's final-closure cluster. Score 51/55 -> 54/55
(98% closed); deferred 4/7 -> 7/7 (100%). All severity-graded findings now
closed except M-029 (frontend per-PR migration backlog, by design incremental).
L-004 (CWE-924) — dual-key API rotation overlap window:
internal/config/config.go::ParseNamedAPIKeys rewritten to allow same-name
duplicate entries iff admin flag matches. Mismatched-admin entries rejected
at startup (privilege escalation guard); exact (name,key) duplicates rejected
(typo guard — rotation requires DIFFERENT keys under the same name). Startup
INFO log per name with multiple entries surfaces the active rotation window.
NewAuthWithNamedKeys was already shaped correctly (constant-time hash compare
across all entries, same UserKey + AdminKey for either bearer); Bundle B's
M-025 per-user rate-limit bucket and audit-trail actor inherit consistency
across the rollover automatically. 8 new tests pin the contract end-to-end.
docs/security.md::API key rotation walks the 6-step zero-downtime rollover.
D-003 — Mutation testing wired:
security-deep-scan.yml gets a go-mutesting step covering ./internal/crypto/...,
./internal/pkcs7/..., ./internal/connector/issuer/local/... with per-package
summary lines extracted into go-mutesting.txt artefact.
D-007 — Frontend semgrep wired (recon found Bundle 7's wiring claim was false):
security-deep-scan.yml gets a 'semgrep p/react-security' step running
returntocorp/semgrep:latest --config=p/react-security against /src/web/src;
results uploaded as semgrep-react.json.
D-004 + D-005 — Operator runbook published:
docs/testing-strategy.md (NEW) consolidates per-tool local-run procedures,
acceptance thresholds, and triage paths for go-mutesting, ZAP baseline DAST,
testssl.sh, and semgrep p/react-security. Closes the 'wired CI-only, no
local-run validation' framing for D-004/D-005 by giving operators the same
commands the CI workflow runs.
Verification:
gofmt -l no diff
go vet ./internal/config/... ./internal/api/middleware/... clean
go test -short -count=1 ./internal/config/... ./internal/api/middleware/... PASS
python3 -c 'yaml.safe_load(...)' YAML OK
G-3 env-var docs guard no phantom env-vars
Audit deliverables:
audit-report.md: L-004 + D-003/4/5/7 boxes flipped [x]; score 51/55 -> 54/55
findings.yaml: 5 status flips; new bundle-G-final-closure closure_log entry
CHANGELOG.md: Bundle G entry under [unreleased]; supersedes Bundle E + F
L-004-deferred framing
|
||
|
|
6ae4023582 |
Bundle F follow-up: M-023 doc env-var cleanup (G-3 guard fix)
CI on the bundle-F merge (run #24972730564) failed the G-3 env-var docs guardrail because docs/legacy-est-scep.md mentioned CERTCTL_EST_PROXY_TRUSTED_SOURCES CERTCTL_EST_TRUST_PROXY_CLIENT_CERT_HEADER which are documented as future-feature env vars but don't exist in config.go. The G-3 guard treats any env-var name in docs that's not either defined in source OR on the documented integration-surface allowlist as drift. The runbook's 'certctl-side configuration' section was over-promising features that haven't shipped yet. Rewritten to be honest: - Current implementation is header-agnostic (X-SSL-Client-Cert is ignored). EST/SCEP authentication still works correctly because both protocols carry their own auth (CSR signature for EST, challengePassword for SCEP) inside the request body. - The reverse proxy is purely a TLS-version bridge. - Future-feature description retained in prose form (without literal env-var names) so an operator who needs proxy-supplied client identity knows to open an issue. The nginx config block's comment was also rewritten to reflect the header-agnostic default. The proxy still SETS the headers (cheap, no-op when ignored); a future commit can flip certctl to read them behind a fail-closed CIDR allowlist + opt-in toggle. Verification: grep -rnE 'CERTCTL_EST_PROXY|CERTCTL_EST_TRUST' README.md docs/ deploy/helm/ — empty (G-3 guard now passes for these names) |
||
|
|
6ae481190e |
Bundle F: Compliance tail + CI gate hardening — 2 findings closed; audit closure complete
Closes M-023 + M-024 from comprehensive-audit-2026-04-25. Final
audit-bundle commit. Score 51/55 closed (93%); High 9/9 (100%);
Medium 26/27 (96%); Low 19/19 (100%); Deferred 4/7.
M-023 (PCI-DSS Req 4 §2.2.5) — Legacy EST/SCEP reverse-proxy runbook
docs/legacy-est-scep.md (NEW): operator runbook for embedded
EST/SCEP clients that only speak TLS 1.2 against a TLS-1.3-pinned
certctl listener. Sections:
- 3-condition gate for when this runbook applies
- Architecture diagram (legacy client -> proxy TLS 1.2 -> certctl TLS 1.3)
- Full nginx config with ssl_protocols TLSv1.2 TLSv1.3 + ECDHE
AEAD-only ciphers + mTLS optional verification + proxy_ssl_protocols
TLSv1.3 on the backend hop
- HAProxy alternative config with ssl-min-ver TLSv1.2 frontend +
ssl-min-ver TLSv1.3 backend
- certctl-side env vars: CERTCTL_EST_PROXY_TRUSTED_SOURCES (CIDR
allowlist of trusted proxies) + CERTCTL_EST_TRUST_PROXY_CLIENT_CERT_HEADER
(toggle header-as-identity). Dual-knob design forces operators
to think about header spoofing.
- PCI-DSS Req 4 v4.0 §2.2.5 attestation language
- Forward-look on TLS 1.2 deprecation watch
certctl listener stays pinned at TLS 1.3 minimum (cmd/server/tls.go:131);
the proxy-to-certctl hop is also TLS 1.3.
M-024 (NIST SSDF PW.7.2) — govulncheck hard gate
.github/workflows/ci.yml: 'Run govulncheck' step renamed to
'Run govulncheck (M-024 hard gate)' with updated comment block
documenting why no carve-out is needed.
Bundle E's transitive bumps (x/net 0.42->0.47, x/crypto 0.41->0.45)
cleared the 5 L-021 deferred-call advisories that the original
Bundle F prompt designed an exception list for. Plain
'govulncheck ./...' is now the right gate; default exit-code
semantics fail on any future called-vuln advisory. Deferred-call
advisories that legitimately can't be remediated should land in
a NIST SSDF deviation log in docs/security.md, not be silenced.
Audit endgame:
51/55 closed (93%). Remaining open items don't require further
bundle work:
- M-029 frontend per-page migration backlog — closes per-PR
- L-004 rotation infra — explicit scope-pivot defer
- D-003 mutation testing — sandbox-blocked
- D-004 DAST suite — wired CI-only via security-deep-scan.yml
- D-005 testssl.sh — wired CI-only
- D-007 frontend semgrep — wired CI-only
Audit deliverables:
audit-report.md: score 49/55 -> 51/55 closed; M-023 + M-024
boxes flipped [x] with closure notes.
findings.yaml: 2 status flips
CHANGELOG.md: Bundle F section + 'Audit endgame' summary
|
||
|
|
7990b6fab7 |
Bundle D: Documentation & transparency sweep — 8 findings closed
Closes H-009 + L-001 + L-007 + L-008 + L-016 + L-017 + L-018 + M-027
from comprehensive-audit-2026-04-25.
H-009 — README JWT verified-already-clean
README has zero JWT mentions at audit time. docs/architecture.md
correctly documents JWT/OIDC integration via authenticating-gateway
pattern (line 905-912).
.github/workflows/ci.yml: new step
'Forbidden README JWT advertising regression guard (H-009)'
greps README for JWT-as-supported phrasing; passes verbatim
(gateway / pre-G-1) but fails build on net-new advertising.
L-001 (CWE-295) — InsecureSkipVerify per-site justification
Audit count was 8; recon found 13 production sites.
docs/tls.md: new 'InsecureSkipVerify justifications' table
enumerates each site by file:line with per-site rationale.
cmd/agent/verify.go:78, internal/tlsprobe/probe.go:54,
internal/service/network_scan.go:460: each previously-bare
InsecureSkipVerify: true now carries //nolint:gosec.
.github/workflows/ci.yml: new step
'Forbidden bare InsecureSkipVerify regression guard (L-001)'
fails build if any net-new ISV lands in non-test .go without
nolint:gosec on the same or preceding line.
L-007 — README dependency-audit commands
README.md: new Dependencies section with go list -m all | wc -l,
go mod why, govulncheck ./.... Honors operating-rules invariant.
L-008 — Release-time govulncheck gate
.github/workflows/release.yml: new 'Install govulncheck' +
'Run govulncheck (release gate)' steps in the matrix job.
Pinned to same install path as ci.yml. Default exit code
semantics (fail on called-vuln only, deferred-call advisories
tracked on master via L-021) keeps the gate appropriate.
L-016 — architecture.md drift fixes
docs/architecture.md: system-components diagram's '21 tables'
annotation removed (current 23; replaced with TEXT-keys
descriptor); connector-architecture '9 connectors' prose
replaced with grep ref + current 12-issuer list (added
Entrust/GlobalSign/EJBCA which were missing); API-design
'97 operations / 107 total' replaced with grep commands.
Connector subgraphs verified-current at 12/13/6.
L-017 — workspace CLAUDE.md verified-already-clean
Bundle B's pre-commit-gate refactor already converted current-
state numeric claims to grep commands. Phase 0 recon confirmed
zero remaining hardcoded counts.
L-018 — Defect age table
cowork/comprehensive-audit-2026-04-25/defect-age.md (NEW):
Tabulates all 9 High findings with first-mentioned commit,
closing bundle, days-open. Methodology snippet for re-running.
Key finding: 8 of 9 closed within 24h of audit publication.
M-027 — OpenAPI parity verified-already-clean
Audit's 'router 121 vs OpenAPI 125 — 4-op gap' was wrong
methodology. The 4-op 'gap' was exactly the 4 routes registered
via r.mux.Handle (auth-exempt allowlist) instead of r.Register.
When you count both dispatch shapes the totals match exactly.
internal/api/router/openapi_parity_test.go (NEW):
TestRouter_OpenAPIParity AST-walks router.go for both
Register and mux.Handle calls + walks api/openapi.yaml's
path/method nesting + asserts the sets match. Adding a route
without updating the spec fails CI permanently.
Audit deliverables:
audit-report.md: score 38/55 -> 46/55 closed
(High 7/9 -> 8/9; Medium 20/27 -> 21/27; Low 8/19 -> 14/19)
findings.yaml: 8 status flips open -> closed
defect-age.md: new file
certctl/CHANGELOG.md: Bundle D section
Verification:
TestRouter_OpenAPIParity PASS
L-001 grep guard self-test (after //nolint:gosec adds) PASS
H-009 grep guard self-test PASS
go test -count=1 -short on changed packages green
|
||
|
|
345bafe5aa |
Bundle C: Renewal/reliability cluster — 7 findings closed
Closes M-006 + M-007 + M-008 + M-015 + M-016 + M-019 + M-020 from
comprehensive-audit-2026-04-25. M-028 was already closed by the
Bundle B CI follow-up.
M-006 (CWE-913) — Idempotent migration 000014
migrations/000014_policy_violation_severity_check.up.sql:
Prepended ALTER TABLE ... DROP CONSTRAINT IF EXISTS before the
ADD. Mirrors the down migration's existing IF EXISTS shape and
the M-7 idempotent-index idiom. Re-runs against partially-applied
DBs now succeed.
M-007 — Bulk-op partial-failure tests (3 new)
internal/api/handler/bulk_partial_failure_test.go:
TestBulkRevoke_PartialFailure_ReportsBoth
TestBulkRenew_PartialFailure_ReportsBoth
TestBulkReassign_PartialFailure_ReportsBoth
Each asserts HTTP 200 + both success/failure counters round-trip
+ per-cert errors[] preserved with non-empty messages so operators
can correlate each failure to its certificate ID.
M-008 — Admin-gated handler enumeration pin (verified-already-clean)
Recon: only one admin-gated handler — bulk_revocation.go — with
full 3-branch test triplet already in place. health.go calls
IsAdmin informationally to surface the flag to the GUI without
gating.
internal/api/handler/m008_admin_gate_test.go:
Walks every handler .go file, asserts every middleware.IsAdmin
call site is in AdminGatedHandlers (with required test triplet)
or InformationalIsAdminCallers (justified). Adding a new admin
gate without updating both the constant AND adding the test
triplet fails CI.
M-015 — Single-profile cardinality pin (verified-already-clean)
Audit claim 'no cardinality validation' was wrong — enforced at
struct level. domain.ManagedCertificate.{CertificateProfileID,
RenewalPolicyID,IssuerID,OwnerID} and RenewalPolicy.
CertificateProfileID are bare strings, not slices.
internal/domain/m015_cardinality_test.go:
reflect-based pin on kind=String. Schema change to N:N would
have to update renewal.go's lookup loop in the same commit.
M-016 (CWE-754) — Reap stale-agent jobs
internal/repository/postgres/job.go::ListJobsWithOfflineAgents:
JOIN jobs to agents on agent_id, filter (status=Running AND
a.last_heartbeat_at < cutoff), exclude server-keygen jobs.
internal/service/job.go::ReapJobsWithOfflineAgents:
Flips matched jobs to Failed reason agent_offline so I-001
retry loop re-queues them on a healthy agent. Records audit
event per reap.
internal/scheduler/scheduler.go:
Scheduler.runJobTimeout cycle now calls both reaper arms.
agentOfflineJobTTL default 5min (5x agent-health-check default);
SetAgentOfflineJobTTL knob for operator override.
internal/service/job_offline_agent_reaper_test.go: 6 unit tests
cover happy path, server-keygen-skip, non-Running-skip, non-
positive-TTL fail-loud, repo-error propagation, audit-event
recording.
M-019 — Configurable ARI HTTP timeout
Audit claim 'no fallback timeout' was wrong — ari.go:52 already
had a 15s timeout. Bundle C makes it configurable.
internal/connector/issuer/acme/acme.go:
Config.ARIHTTPTimeoutSeconds field with env path
CERTCTL_ACME_ARI_HTTP_TIMEOUT_SECONDS.
internal/connector/issuer/acme/ari.go:
Both HTTP clients (GetRenewalInfo + getARIEndpoint) now use the
new ariHTTPTimeout() helper. Zero / negative / nil-config all
fall back to the historic 15s default.
ari_timeout_test.go: 4 dispatch arm tests.
M-020 (CWE-770) — OCSP DoS hardening
Pre-bundle the noAuthHandler chain had no rate limit. An attacker
could DoS the OCSP responder, which for fail-open relying parties
is a revocation bypass.
cmd/server/main.go:
noAuthHandler refactored from fixed middleware.Chain(...) to a
conditional slice that appends middleware.NewRateLimiter when
cfg.RateLimit.Enabled. Per-IP keying applies; OCSP/CRL/EST/SCEP
are unauth.
docs/security.md (NEW):
Operator runbook documenting Must-Staple TLS Feature extension
RFC 7633 as the architectural fix for fail-open relying parties.
Profile-flip guidance + nginx/Apache/HAProxy/Envoy stapling
snippets + explicit scope statement on what the rate limiter
alone does NOT solve.
Audit deliverables:
cowork/comprehensive-audit-2026-04-25/audit-report.md: score
31/55 -> 38/55 closed (Medium 13/27 -> 20/27).
cowork/comprehensive-audit-2026-04-25/findings.yaml: 7 status
flips open -> closed with closure notes citing the Bundle C
mechanism.
certctl/CHANGELOG.md: Bundle C section under [unreleased].
Verification:
go vet ./internal/service ./internal/scheduler ./internal/connector/issuer/acme
./internal/api/handler ./internal/domain ./cmd/server clean
go test -count=1 -short on the same packages all green
helm template + helm lint clean
internal/repository/postgres setup-fail sandbox disk
pressure (same on master HEAD before this branch)
|
||
|
|
2933395730 |
Bundle B CI follow-up: G-3 env-var docs + M-028 closure (final 5 SA1019 sites)
Two CI failures on master after Bundle B merge:
1. Frontend Build / G-3 env-var docs guardrail
Bundle B introduced CERTCTL_RATE_LIMIT_PER_USER_RPS and
CERTCTL_RATE_LIMIT_PER_USER_BURST without adding them to
docs/features.md. The guardrail step that scans Go source for
getEnv* calls and asserts each appears in a doc page failed.
Fix: docs/features.md rate-limit section extended with both new
env vars + a paragraph explaining the per-key keying contract
from M-025.
2. Go Build & Test / staticcheck SA1019 hits (6 errors)
The CI workflow runs staticcheck without continue-on-error. Bundle
7 opened M-028 to track 6 deprecated-API sites; Bundle 9 closed 1
of them (the elliptic.Marshal in local.go) but kept a deliberate
regression-oracle reference in bundle9_coverage_test.go protected
only by golangci-lint's //nolint comment — staticcheck-as-CLI does
not honor that, only its native //lint:ignore directive.
Closure of remaining 5 sites:
cmd/server/main_test.go:47, 163, 192, 465 — 4 × middleware.NewAuth
migrated to middleware.NewAuthWithNamedKeys with explicit
NamedAPIKey entries. The auth=none case at line 465 maps to a
nil NamedAPIKey slice (no-op pass-through, matches the
NewAuthWithNamedKeys contract for empty input). Audit count was
3; recon found a 4th at line 465 that was missed.
internal/api/handler/scep.go:266 — csr.Attributes is a real RFC
2985 §5.4.1 challengePassword carve-out. Go's stdlib deprecation
note explicitly applies only to OID 1.2.840.113549.1.9.14
(requestedExtensions), NOT to OID 1.2.840.113549.1.9.7
(challengePassword), for which there is no non-deprecated
stdlib API. Suppressed with native //lint:ignore SA1019 +
comment block citing the RFC.
internal/connector/issuer/local/bundle9_coverage_test.go:342 —
deliberate regression-oracle that calls elliptic.Marshal to
prove the new crypto/ecdh path is byte-identical. Comment
converted from //nolint:staticcheck to native //lint:ignore
SA1019 so staticcheck-as-CLI honors the suppression.
Audit deliverables:
cowork/comprehensive-audit-2026-04-25/audit-report.md: M-028 box
flipped [x]; score 30/55 -> 31/55 (Medium 12/27 -> 13/27).
cowork/comprehensive-audit-2026-04-25/findings.yaml: M-028 status
partial_closed -> closed with closure note.
Verification:
go test -count=1 -short ./cmd/server ./internal/api/handler
./internal/connector/issuer/local ./internal/api/middleware
./internal/config — all green.
staticcheck on each changed package — 0 SA1019 hits.
Bundle C had M-028 in scope; this CI-fix lift moves it forward so
master CI goes green immediately. Bundle C scope adjusts to remove
M-028 and focuses on M-006 / M-015 / M-016 / M-019 / M-020 plus the
M-007 / M-008 coverage gaps.
|
||
|
|
4f3c8136e5 | Update LICENSE metadata | ||
|
|
e8f5ecf3c9 |
Bundle B: Auth & transport surface tightening — 5 findings closed
Closes M-001 + M-002 + M-013 + M-018 + M-025 from
comprehensive-audit-2026-04-25.
M-001 (CWE-916) — PBKDF2 100k -> 600k via v3 blob format
internal/crypto/encryption.go:
- New v3Magic (0x03), pbkdf2IterationsV3 (600,000 — OWASP 2024
Password Storage Cheat Sheet floor), v3SaltSize (16 bytes),
deriveKeyWithSaltV3 helper.
- EncryptIfKeySet now unconditionally writes v3:
magic(0x03) || salt(16) || nonce(12) || ciphertext+tag
- DecryptIfKeySet falls through v3 -> v2 -> v1 with AEAD verification
at each step. Wrong-passphrase v3 reads cannot be silently
misattributed to v2/v1.
- IsLegacyFormat updated to recognize 0x03 as non-legacy.
internal/crypto/encryption_v3_test.go (NEW, 7 tests):
V3 round-trip / V2 read-fallback against deterministic v2 fixture /
V3 wrong-passphrase fails / V3-vs-V2 dispatch order / V2 vs V3 keys
differ for same (passphrase, salt) / iteration-count pin at OWASP
2024 floor / IsLegacyFormat-recognises-V3.
Coverage internal/crypto: 86.7% -> 88.2%.
M-002 (CWE-862) — Auth-exempt allowlist constants + AST regression test
Recon found auth-exempt surface spans TWO layers (audit's claim was
incomplete):
Layer 1 (router.go direct r.mux.Handle):
GET /health, GET /ready, GET /api/v1/auth/info, GET /api/v1/version
Layer 2 (cmd/server/main.go::buildFinalHandler URL-prefix dispatch):
/.well-known/pki/*, /.well-known/est/*, /scep[/...]*
internal/api/router/router.go:
- New AuthExemptRouterRoutes constant with per-entry justifications.
- New AuthExemptDispatchPrefixes constant.
internal/api/router/auth_exempt_test.go (NEW, 2 tests):
AST-walks router.go for every direct mux.Handle call and asserts
set equals AuthExemptRouterRoutes; reads source bytes of Register /
RegisterFunc and asserts they still wrap with middleware.Chain.
cmd/server/auth_exempt_test.go (NEW, 2 tests):
14-case table test on buildFinalHandler asserting documented
prefixes route to noAuthHandler and authenticated routes route to
apiHandler; inverse-overlap pin proves no documented bypass shadows
an authenticated prefix.
M-013 (CWE-942) — CORS deny-by-default verified-already-clean + pin
Audit claim 'default allows all origins if env-var unset' was WRONG.
internal/api/middleware/middleware.go::NewCORS already denies cross-
origin requests when len(cfg.AllowedOrigins) == 0 (no
Access-Control-Allow-Origin header is emitted, same-origin policy
applies).
internal/api/middleware/cors_test.go: +TestNewCORS_NilOriginsDeniesAll
+ TestNewCORS_M013_ContractDocumentedInOrder (5-case table test
pinning the 3-arm dispatch contract).
M-018 (CWE-319 / PCI-DSS Req 4) — Postgres TLS opt-in toggle
deploy/helm/certctl/values.yaml: new postgresql.tls.{mode,caSecretRef}
operator-facing knobs. Default 'disable' preserves in-cluster pod-
network behavior; PCI-scoped operators set verify-full.
deploy/helm/certctl/templates/_helpers.tpl: certctl.databaseURL helper
pipes postgresql.tls.mode into ?sslmode=.
deploy/helm/certctl/templates/server-secret.yaml: uses the helper
instead of hardcoded sslmode=disable.
deploy/docker-compose.yml: CERTCTL_DATABASE_URL is now
${CERTCTL_DATABASE_URL:-...} so operators override without editing.
docs/database-tls.md (NEW): operator runbook covering 4 deployment
shapes, RDS verify-full example with PGSSLROOTCERT mount, and
pg_stat_ssl verification query.
helm template + helm lint clean.
M-025 (OWASP ASVS L2 §11.2.1) — Per-key rate limiting
internal/api/middleware/middleware.go::NewRateLimiter rewritten from
a single global tokenBucket to a keyedRateLimiter map keyed on
'user:'+GetUser(ctx) for authenticated callers
'ip:'+RemoteAddr-host for unauthenticated
- Empty UserKey strings treated as unauthenticated.
- X-Forwarded-For intentionally NOT consulted (header-spoofing risk).
- Create-on-demand bucket allocation under sync.RWMutex with double-
check pattern.
RateLimitConfig.PerUserRPS / PerUserBurstSize fields with env vars
CERTCTL_RATE_LIMIT_PER_USER_RPS / CERTCTL_RATE_LIMIT_PER_USER_BURST
allow per-user budgets distinct from per-IP.
internal/api/middleware/ratelimit_keyed_test.go (NEW, 5 tests):
TwoIPsHaveIndependentBuckets / SameUserDifferentIPsShareBucket /
TwoUsersHaveIndependentBuckets / PerUserBudgetOverride /
EmptyUserKeyTreatedAsAnonymous.
Coverage internal/api/middleware: 82.1% -> 83.7%.
Audit deliverables:
cowork/comprehensive-audit-2026-04-25/audit-report.md: score
25/55 -> 30/55 closed (High 7/9, Medium 7/27 -> 12/27, Low 8/19).
cowork/comprehensive-audit-2026-04-25/findings.yaml: 5 status flips
open -> closed with closure notes citing the Bundle B mechanism.
certctl/CHANGELOG.md: Bundle B section under [unreleased].
Verification:
go test -count=1 -short ./... all green
staticcheck on changed packages no new SA*/ST* hits
(the 4 pre-existing SA1019 sites in cmd/server/main_test.go are
Bundle 9 / M-028 partial closure leftovers tracked in Bundle C)
helm template + helm lint clean
internal/repository/postgres setup-fail sandbox disk pressure,
same on master HEAD before this branch — environmental, not Bundle B
|
||
|
|
90f0cab204 |
fix(bundle-6): Audit Integrity + Privacy — 3 audit findings closed
Closes Audit-2026-04-25 H-008 (High), M-017 (Medium), M-022 (Medium).
Hardens audit-trail tamper-resistance + minimizes PII leakage in one
cohesive change, with both controls applying automatically and no
operator action required at install time.
What changed
- internal/service/audit_redact.go (NEW) — RedactDetailsForAudit:
* credentialKeys deny-list (api_key, password, *_pem, eab_secret, ...)
* piiKeys deny-list (email, phone, ssn, name, address, ip_address, ...)
* case-insensitive key match; recurses into nested maps + arrays
* mutation-free; surfaces redacted_keys array for operator visibility
* nil/empty input → nil out (preserves pre-Bundle-6 behaviour)
- internal/service/audit.go — RecordEvent now routes details through
RedactDetailsForAudit BEFORE marshaling. No call-site changes required.
- internal/service/audit_redact_test.go (NEW) — full coverage:
* credential keys (~30 entries)
* PII keys (~20 entries)
* nested maps + arrays
* case-insensitivity
* mutation-free invariant
* JSON round-trip (catches type-assertion regressions)
* scalar pass-through (no panic on int/bool/nil)
- migrations/000018_audit_events_worm.up.sql (NEW) — DB-level WORM:
* BEFORE UPDATE OR DELETE trigger raises check_violation with
diagnostic citing the rationale + compliance-superuser hint
* REVOKE UPDATE,DELETE ON audit_events FROM certctl (defence-in-depth)
* REVOKE wrapped in pg_roles existence check so test fixtures
without the certctl role stay idempotent
- migrations/000018_audit_events_worm.down.sql (NEW) — clean teardown
for dev resets; not for production use.
- internal/repository/postgres/audit_worm_test.go (NEW, testcontainers,
-short gated) — INSERT succeeds; UPDATE + DELETE fail with
check_violation; second INSERT after blocked modification still
succeeds (no trigger-state corruption).
- docs/compliance.md — new section "Audit-Trail Integrity & Privacy
(Bundle 6)" with verification psql snippet, compliance-superuser
pattern (NOT auto-created), redactor before/after example, and a
maintenance note for adding new credential keys.
Compliance mapping
- H-008 (CWE-532 Insertion of Sensitive Information into Log File)
- M-017 (HIPAA Technical Safeguards §164.312(b) — audit controls)
- M-022 (GDPR Art. 32 — data minimization)
Threat model: TB-3 (audit log tampering), TB-1 (operator/orchestrator).
Verification
- go vet ./... → clean
- go build ./... → clean
- go test -short -count=1 ./... → all packages pass
- go test -count=1 -run TestRedactDetailsForAudit ./internal/service/...
→ all pass
- (testcontainers, gated by -short) audit_worm_test.go pins WORM contract
- npx tsc --noEmit (web) → clean (no frontend changes)
- python3 yaml.safe_load(api/openapi.yaml) → 89 paths
Backward compatibility
- Trigger applies forward only — existing rows unchanged.
- nil/empty details from RecordEvent callers → nil out (preserves prior
behaviour for the many existing call sites that pass nil).
- Compliance superusers (provisioned out-of-band) bypass the trigger.
Bundle 6 of the 2026-04-25 comprehensive audit.
|
||
|
|
17ef377edb |
fix(bundle-5): CI green-up — drop unused sync.Once + document new env vars
Two CI gate failures from the Bundle 5 push:
1. golangci-lint (unused) — agent_bootstrap.go declared
`var bootstrapWarnOnce sync.Once` but never called .Do(). The
one-shot WARN actually lives in cmd/server/main.go (per-process at
startup, not per-request) so the handler-side variable was dead code.
Dropped the var + sync import; left a comment explaining where the
WARN lives.
2. G-3 env-var docs guardrail — Bundle 5 added two new env vars
(CERTCTL_AGENT_BOOTSTRAP_TOKEN, CERTCTL_AUDIT_FLUSH_TIMEOUT_SECONDS)
but the G-3 closure CI step asserts every CERTCTL_* env defined in
internal/config/config.go is mentioned in docs/features.md. Added
three new sub-sections to docs/features.md after the Body Size
Limits block:
* Agent Bootstrap Token (H-007 contract + generation guidance)
* Graceful Shutdown Audit Flush (M-011 timeout knob)
* Liveness vs Readiness Probes (H-006 /health vs /ready table)
No production behaviour change; pure CI-gate fix.
Verification
- go vet ./internal/api/handler/... → clean
- go test -count=1 -run 'TestVerifyBootstrapToken|TestRegisterAgent_BootstrapToken' ./internal/api/handler/... → all pass
- grep CERTCTL_AGENT_BOOTSTRAP_TOKEN docs/features.md → present
- grep CERTCTL_AUDIT_FLUSH_TIMEOUT_SECONDS docs/features.md → present
|
||
|
|
7dd68ca409 |
docs(features): document CERTCTL_SHORT_LIVED_EXPIRY_CHECK_INTERVAL (G-3 fix)
CI on the S-2 merge ( |
||
|
|
ffc982bfbc |
chore(cleanup,docs): vite proxy + dead scheduler setter wired + registerAgent/CLI docs (C-1 master)
Closes six 2026-04-24 audit findings (3 P2 + 3 P3) — a cleanup-and-doc
tail bundle that drains the smallest remaining leaves of the audit:
- cat-u-vite_dev_proxy_plaintext_drift (P2): web/vite.config.ts
proxied dev requests to http://localhost:8443 against an HTTPS-only
backend (HTTPS-only since v2.0.47). Every dev-server API call 502'd.
Fix: targets are now object-form `{target: 'https://...', secure: false,
changeOrigin: true}` — the dev cert is self-signed by the
deploy/test bootstrap and changes per-checkout.
- cat-g-7e38f9708e20 (P3): Scheduler.SetShortLivedExpiryCheckInterval
was defined + tested but never called from cmd/server/main.go.
Operators tuning CERTCTL_SHORT_LIVED_EXPIRY_CHECK_INTERVAL got
no effect — the 30s default in scheduler.NewScheduler was
effectively hardcoded. Fix: added Config.Scheduler.ShortLivedExpiryCheckInterval
+ getEnvDuration in Load() reading the env var with a 30s default,
+ sched.SetShortLivedExpiryCheckInterval(...) call in main.go
alongside the other scheduler-interval setters.
- diff-10xmain-2bf4a0a60388 (P3): same root cause as cat-g-7e38f9708e20;
closes as ride-along.
- cat-b-6177f36636fb (P2): registerAgent client fn orphan. By-design
per pull-only deployment model. Fix (audit recommendation:
"document"): added a closure docblock above the export in
client.ts + a new "Registration is by-design pull-only" paragraph
in docs/architecture.md::Agents section explaining when/why a
future GUI-driven enrollment feature might reach the endpoint
(proxy-agent topologies for network appliances).
- cat-i-7c8b28936e3d (P2): CLI scope intentionally narrow but
undocumented. Fix: new "Scope (intentionally narrow)" subsection
in docs/features.md::CLI capturing the SSH-into-prod / day-to-day
GUI / AI-automation MCP three-way split.
Verification:
- go build ./... — clean
- go vet ./... — clean
- go test ./internal/scheduler/... ./internal/config/... — pass
- golangci-lint v2.11.4 run ./... — 0 issues
- tsc --noEmit (frontend) — clean
- All sibling guardrails (S-1 / G-3 / D-1+D-2 / B-1 / L-1 / H-1) still pass
Audit findings closed:
- cat-u-vite_dev_proxy_plaintext_drift (P2)
- cat-g-7e38f9708e20 (P3)
- diff-10xmain-2bf4a0a60388 (P3)
- cat-b-6177f36636fb (P2)
- cat-i-7c8b28936e3d (P2)
- (audit-bookkeeping ride-along: ensures every closed-bundle row has a non-empty merge SHA)
Deferred follow-ups: none from this bundle. The remaining audit
backlog (frontend test campaign, F-1 CertificatesPage UX, P-1
orphan-fn sweep, S-2 handler error-mapping refactor) is sibling
sub-bundles in this mega-prompt.
|
||
|
|
5f050a0733 |
docs(features): reconcile env-var inventory with config.go (G-3 master)
Closes three 2026-04-24 audit findings (all P2, all category cat-g):
- cat-g-renewal_check_interval_rename_drift: features.md:152
advertised CERTCTL_RENEWAL_CHECK_INTERVAL but config.go renamed
that to CERTCTL_SCHEDULER_RENEWAL_CHECK_INTERVAL. Fixed in prose
+ the scheduler-loops table on line 1117.
- cat-g-b8f8f8796159: 6 env vars in config.go that were never
documented:
CERTCTL_DATABASE_MIGRATIONS_PATH
CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT
CERTCTL_JOB_AWAITING_CSR_TIMEOUT
CERTCTL_SCHEDULER_AGENT_HEALTH_CHECK_INTERVAL
CERTCTL_SCHEDULER_JOB_PROCESSOR_INTERVAL
CERTCTL_SCHEDULER_NOTIFICATION_PROCESS_INTERVAL
Added to the scheduler-loops table at features.md:1117 and
(DATABASE_MIGRATIONS_PATH) to the new Database Schema preamble.
- cat-g-163dae19bc59: 37 env vars in docs not defined in config.go.
The audit's strict comm over-flagged this set: most "phantoms"
are integration-surface contracts (script env vars certctl
EXPORTS to user-provided ACME DNS-01 / OpenSSL CA scripts;
StepCA / Webhook per-issuer-or-notifier config-blob field
names; CERTCTL_QA_* test fixtures; agent-side env vars defined
in cmd/agent/main.go). The closure narrows the gate to the
one true phantom (the rename) and allowlists the documented
integration contracts in the CI guard. Each allowlist entry
has a one-line justification.
CI regression guardrail:
- .github/workflows/ci.yml::"Forbidden env-var docs drift regression
guard (G-3)" — runs `comm -23` both ways between the env vars
defined in Go source (config.go + cmd/* + ACME DNS export +
test fixtures) and env vars mentioned in README + docs/ +
deploy/helm/. Fails the build if either set is non-empty modulo
the documented integration-surface allowlist.
Verification:
- comm -23 docs vs defined → empty post-fix (allowlist applied)
- comm -23 defined vs docs → empty post-fix
- golangci-lint v2.11.4 run ./... → 0 issues
- tsc --noEmit → clean
- S-1 stale-counts guardrail still passes
Audit findings closed:
- cat-g-163dae19bc59 (P2, docs-only env vars)
- cat-g-b8f8f8796159 (P2, config-only env vars)
- cat-g-renewal_check_interval_rename_drift (P2, renamed env var still in docs)
Deferred follow-ups:
- The 26 documented-but-unimplemented integration contracts on the
allowlist (CERTCTL_OPENSSL_*, CERTCTL_ACME_EAB_*, CERTCTL_WEBHOOK_*,
CERTCTL_AUDIT_EXCLUDE_PATHS, CERTCTL_TLS_*, CERTCTL_ACME_DNS_PROPAGATION_WAIT)
are documented in features.md / connectors.md / demo-advanced.md but
not yet read by any Go source. Either implement in config.go (each is
its own M-X) or delete from docs (separate cleanup PR). Neither
expansion fits inside G-3's "reconcile drift" scope.
|
||
|
|
e11a228a26 |
docs(README,features,examples): replace stale source counts with rebuild commands (S-1 master)
Closes two 2026-04-24 audit findings — one P1 (cat-s1-9ce1cbe26876,
README + features.md cite stale numeric counts) and one P2
(cat-s1-features_md_issuer_count_contradiction, features.md self-
disagreed on issuer count saying 9 in two places + 12 in two others).
Both root in a CLAUDE.md invariant: "Numeric claims about current
state rot the instant the next release lands... Before adding any
current-state count, delete it and write the command instead."
Per-site changes:
- docs/features.md::"At a Glance" table — replaced 12 hardcoded counts
with `rebuild via <command>` references quoting the canonical
source-of-truth grep from CLAUDE.md::"Current-state commands".
- docs/features.md::Issuer Connectors section — dropped "9 issuer
connectors" (stale; live: 12) and "12 IssuerType constants" prose;
prose now references the rebuild command.
- docs/features.md::Target Connectors section — same treatment for
"14 target connector types".
- docs/features.md::"Per-type config schema validation for all 9
issuer types" — same treatment.
- docs/features.md::"80 MCP tools covering all API endpoints" — same.
- docs/features.md::Web Dashboard section — dropped "24 pages wired"
+ the "(25 Route elements, 24 pages)" comment.
- docs/examples.md::"Beyond These Examples" — dropped "7 issuer
backends and 10 target connectors" prose; references features.md
and the rebuild commands.
CI regression guardrail:
- .github/workflows/ci.yml::"Forbidden hardcoded source-count prose
regression guard (S-1)" — grep-fails the build if any of the
blocked phrases (e.g. "9 issuer connectors", "21 database tables",
"80 MCP tools") reappears in README or docs/. Allowlists demo-
fixture prose ("32 certificates" — seed_demo.sql facts), historical
WORKSPACE-CHANGELOG counts, the testing-guide example phrasing,
and any number adjacent to a quoted rebuild command.
Verification:
- S-1 guardrail dry-run on post-fix tree → empty (good)
- golangci-lint v2.11.4 run ./... → 0 issues
- tsc --noEmit → clean
- vitest, vite build unchanged from pre-S-1 baseline (no JS/TS touched)
Audit findings closed:
- cat-s1-9ce1cbe26876 (P1, README + features.md stale numeric counts)
- cat-s1-features_md_issuer_count_contradiction (P2, features.md
self-contradiction on issuer count)
Deferred follow-ups:
- WORKSPACE-CHANGELOG.md historical-milestone counts intentionally
preserved (those are point-in-time facts about shipped slices, not
current-state claims). README demo-fixture counts ("32 certs, 10
issuers") preserved — those describe the seed_demo.sql shape, not
the live source surface.
|
||
|
|
3f5c2344ab |
fix(web,ci): close TS↔Go type drift across 5 entities (D-2 master)
Closes five 2026-04-24 audit findings (all P2, all category cat-f /
diff-05x06-*) by reconciling the TypeScript interfaces in
web/src/api/types.ts with the on-wire JSON shape Go's
internal/domain/*.go structs actually emit. D-1 closed the same pattern
for one entity (Certificate / ManagedCertificate); D-2 covers the
remaining five.
Per-entity verdicts (audit's "stricter side is the contract"):
Agent — TRIM 5 phantoms (last_heartbeat, capabilities, tags,
created_at, updated_at). Go emits last_heartbeat_at only.
Target — ADD 2 (retired_at?, retired_reason?) — I-004 fields.
DiscCert — ADD pem_data? — real field, real Go emit, omitempty.
Issuer — TRIM phantom status. Go has Enabled bool only.
Notif — TRIM phantom subject. Go has Message string only.
Certificate — verify-only; D-1 closure confirmed clean at recon.
Consumer fixes (same commit as the trim):
- AgentDetailPage.tsx — remove dead Capabilities + Tags sections (always
rendered empty); replace agent.created_at/updated_at row with the
Go-emitted registered_at; widen heartbeatStatus() to accept undefined.
- AgentsPage.tsx — same heartbeatStatus widening.
- IssuersPage.tsx + IssuerDetailPage.tsx — issuerStatus() now derives
from `enabled` exclusively; the dead `issuer.status || 'Unknown'`
fallback is gone.
- NotificationsPage.tsx — drop dead `|| n.subject` fallback.
- NotificationsPage.test.tsx — drop dead `subject:` from mocks.
- api/utils.ts::timeAgo widened to accept string | undefined | null.
- api/types.test.ts — Agent (I-004) fixture trimmed of the 5 phantoms.
Tests (Vitest):
- 5 new describe blocks in web/src/api/types.test.ts:
- Agent interface (D-2 phantom-fields trim) — 2 it blocks
- Target interface (D-2 retirement fields) — 2 it blocks
- DiscoveredCertificate interface (D-2 pem_data ADD) — 2 it blocks
- Issuer interface (D-2 status phantom trim) — 1 it block
- Notification interface (D-2 subject phantom trim) — 1 it block
- Each block uses the literal-construction pattern from D-1; trimmed
fields are pinned via excess-property comments that compile-fail when
uncommented if a phantom is reintroduced.
CI regression guardrail:
- .github/workflows/ci.yml — existing D-1 step renamed to "Forbidden
StatusBadge dead-key + TS phantom-field regression guard (D-1 + D-2)".
Three new awk-windowed greps over Agent / Issuer / Notification
interfaces in types.ts. The Agent grep includes a `grep -v
'last_heartbeat_at'` filter to avoid false positives on the
legitimate Go-emitted heartbeat field.
Documentation:
- CHANGELOG.md — new D-2 section above B-1 under [unreleased] with full
Added/Removed/Audit findings closed/Known follow-ups breakdown.
- docs/architecture.md — Web Dashboard section gains a new "TS ↔ Go
type contract rule (D-1 + D-2 closure)" paragraph capturing the
stricter-side-wins rule and the CI guardrail it's anchored by.
- coverage-gap-audit-2026-04-24-v5/unified-audit.md — Live Tracker score
20/47 → 25/47 (P2: 6/27 → 11/27). Per-finding ✅ RESOLVED Status
blocks added to all 5 diff-05x06-* entries plus the verify-only
Certificate entry. Closed-bundle index gets D-2 row.
Verification (all gates green):
- cd web && tsc --noEmit → clean
- cd web && vitest run --reporter=dot → 9 files, 302 tests passing
(was 294 → +8 D-2 cases)
- cd web && vite build → clean
- go vet ./internal/... ./cmd/... → clean (no Go touched)
- golangci-lint v2.11.4 run ./... → 0 issues
- D-2 Agent guardrail dry-run → empty (good)
- D-2 Issuer guardrail dry-run → empty (good)
- D-2 Notification guardrail dry-run → empty (good)
- D-2 Target ADD-shape sanity → 2 retirement fields present
- D-2 DiscCert ADD-shape sanity → pem_data present
- D-1 Certificate guardrail still clean → empty (good)
- OpenAPI YAML parses → 89 paths
Audit findings closed:
- diff-05x06-7cdf4e78ae24 (P2, Agent TS↔Go drift)
- diff-05x06-2044a46f4dd0 (P2, Target TS↔DeploymentTarget Go drift)
- diff-05x06-85ab6b98a2f7 (P2, DiscoveredCertificate TS↔Go drift)
- diff-05x06-97fab8783a5c (P2, Issuer TS↔Go drift)
- diff-05x06-caba9eb3620e (P2, Notification TS↔NotificationEvent drift)
- diff-05x06-af18a8d7ef41 (P2) — verified clean since D-1; no edit
Deferred follow-ups:
- Issuer richer status view (enabled × test_status) — UX scope, not drift.
- Real Agent metadata (capabilities, tags) — backend feature, not drift.
- DiscoveredCertificate pem_data list-response perf — separate backend change.
|
||
|
|
29533777fb |
fix(web,ci): close orphan-CRUD GUI gaps + dead exportCertificatePEM (B-1 master)
Closes four 2026-04-24 audit findings via per-page Edit modals on five
existing pages, a brand-new RenewalPoliciesPage for the rp-* CRUD surface,
and removal of one dead duplicate so the public client surface stops
growing without consumers. Anchored by a CI grep guardrail that fails
the build if any of the eight previously-orphan client functions loses
its non-test page consumer or if exportCertificatePEM is resurrected.
Per-page Edit modals (mirroring existing CreateXModal scaffolding):
- web/src/pages/OwnersPage.tsx — EditOwnerModal (name/email/team_id)
- web/src/pages/TeamsPage.tsx — EditTeamModal (name/description)
- web/src/pages/AgentGroupsPage.tsx — EditAgentGroupModal (full match-rule
set: name/description/match_os/match_architecture/match_ip_cidr/
match_version/enabled)
- web/src/pages/IssuersPage.tsx — EditIssuerModal (rename-only; type
locked, config blob preserved untouched, footer note about delete+
recreate for credential rotation)
- web/src/pages/ProfilesPage.tsx — EditProfileModal (rename + description
only; policy fields preserved untouched, footer note about deferred
policy editing)
New page (closes cat-b-4631ca092bee — RenewalPolicy CRUD orphan):
- web/src/pages/RenewalPoliciesPage.tsx — full CRUD page with shared
PolicyFormModal for Create + Edit (form shape identical), 7-column
DataTable (Policy/RenewalWindow/Auto/Retries/AlertThresholds/Created/
Actions), comma-separated alert_thresholds_days input parser, and
alert() surfacing of repository.ErrRenewalPolicyInUse (409) on Delete
so operators can re-target dependent certs before deletion.
- web/src/main.tsx — adds /renewal-policies route.
- web/src/components/Layout.tsx — adds sidebar nav item slotted between
Policies and Profiles.
Removed (closes cat-b-9b97ffb35ef7 — dead duplicate):
- web/src/api/client.ts::exportCertificatePEM — zero consumers across
web/, MCP, CLI, tests; downloadCertificatePEM is the actual call site
in CertificateDetailPage. Test references in client.test.ts and
client.error.test.ts also removed.
CI regression guardrail:
- .github/workflows/ci.yml — adds 'Forbidden orphan-CRUD client function
regression guard (B-1)' step. Greps for all eight previously-orphan
fns (updateOwner/updateTeam/updateAgentGroup/updateIssuer/updateProfile
+ createRenewalPolicy/updateRenewalPolicy/deleteRenewalPolicy) under
web/src/pages/ and fails the build if any has zero non-test consumers.
Also blocks resurrection of exportCertificatePEM. Verified locally
(all 8 fns have ≥2 consumers; exportCertificatePEM is gone) and
against synthetic regressions.
Documentation:
- CHANGELOG.md — new B-1 section above L-1 under [unreleased].
- docs/architecture.md — Web Dashboard section gains a new paragraph
capturing the 'every backend CRUD must have a GUI consumer' rule
with reference to the CI guardrail.
- coverage-gap-audit-2026-04-24-v5/unified-audit.md — flips four
findings to ✅ RESOLVED with detailed Status blocks; bumps Live
Tracker score 16/47 → 20/47 (P1: 9→12, P3: 1→2); adds B-1 row to
closed-bundle index.
Verification:
- cd web && tsc --noEmit — clean
- cd web && vitest run — 9 test files, 294 tests, all passing
- cd web && vite build — clean (no new warnings)
- B-1 guardrail dry-run — all 8 client fns have ≥2 page consumers,
exportCertificatePEM removed (good), FAIL=0
Audit findings closed:
- cat-b-31ceb6aaa9f1 (P1, updateOwner/updateTeam/updateAgentGroup orphan)
- cat-b-7a34f893a8f9 (P1, updateIssuer/updateProfile orphan, rename-only)
- cat-b-4631ca092bee (P1, RenewalPolicy CRUD orphan)
- cat-b-9b97ffb35ef7 (P3, exportCertificatePEM dead duplicate)
Deferred follow-ups:
- Fuller EditIssuerModal with credential-rotation flow (needs threat
model: rotation reuse window, in-flight CSR cancellation, audit-trail
granularity).
- Fuller EditProfileModal with policy-field editing (max-TTL, allowed
EKUs, allowed key algorithms — affect already-issued cert evaluation).
- Per-page Vitest coverage for the new Edit modals (CI grep guardrail
catches the same regression vector at lower cost).
|
||
|
|
df4b56798c |
fix(deploy,db,handler): close fresh-clone postgres init failure + 4 ride-along audit findings (U-3 master)
GitHub #10 reopened: operator mikeakasully cloned v2.0.50 fresh and ran the canonical quickstart (docker compose -f deploy/docker-compose.yml up -d --build); postgres reported unhealthy indefinitely, dependent containers never started. Root cause: deploy/docker-compose.yml mounted a hand-curated subset of migrations/*.up.sql + seed.sql into postgres /docker-entrypoint-initdb.d/. Postgres applied them at initdb time. Once seed.sql referenced columns added by migrations *after* the mounted cutoff (e.g., policy_rules.severity from migration 000013), initdb crashed mid-seed and the container loop wedged. Two sources of truth (compose mount list vs in-tree migration ladder) diverged the moment a seed-touching migration shipped, and the only thing that fixed it was hand-editing the compose file every release. Fix: remove the dual source. Postgres boots empty; the server applies migrations + seed at startup via RunMigrations + RunSeed. Helm has used this pattern since day one (postgres-init emptyDir); compose now matches. Bundled with four ride-along audit findings whose fixes share the same schema/db code surface, so operators take the schema-change pain only once: cat-u-seed_initdb_schema_drift [P1, primary] — initdb-mount fix cat-o-retry_interval_unit_mismatch [P1] — column rename minutes→seconds cat-o-notification_created_at_dead_field [P2] — add column + populate cat-o-health_check_column_orphans [P1] — drop unwired columns cat-u-no_version_endpoint [P2] — add /api/v1/version Single migration (000017_db_coupling_cleanup) bundles the three schema changes under a DO \$\$ guard so re-application is safe; reduces operator-visible 'schema-change releases' from four to one. Backend - internal/repository/postgres/db.go: add RunSeed (baseline) + RunDemoSeed (gated by CERTCTL_DEMO_SEED). Both idempotent (ON CONFLICT DO NOTHING in every shipped INSERT) so repeated boots are safe; missing-file is no-op so custom packaging that strips seeds still boots cleanly. - cmd/server/main.go: invoke RunSeed (always) + RunDemoSeed (when flag set) immediately after RunMigrations. - internal/repository/postgres/notification.go: NotificationRepository.Create now sets created_at (with time.Now() fallback when caller leaves it zero); scanNotification reads it back; List + ListRetryEligible SELECT extended. - internal/repository/postgres/renewal_policy.go: column references updated to retry_interval_seconds across SELECT/INSERT/UPDATE sites. - internal/api/handler/version.go: new VersionHandler exposes {version, commit, modified, build_time, go_version} from runtime/debug.ReadBuildInfo() with ldflags-supplied Version override. - internal/api/router/router.go: register GET /api/v1/version through the no-auth chain (CORS + ContentType) alongside /health, /ready, /api/v1/auth/info. - cmd/server/main.go: add /api/v1/version to no-auth dispatch + audit ExcludePaths so rollout polling doesn't dominate the audit trail. - internal/config/config.go: add DatabaseConfig.DemoSeed + CERTCTL_DEMO_SEED env var. Migration - migrations/000017_db_coupling_cleanup.up.sql + .down.sql: (1) renewal_policies.retry_interval_minutes → retry_interval_seconds (DO \$\$ guard, idempotent re-application) (2) notification_events ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() (3) network_scan_targets DROP orphan health_check_enabled + health_check_interval_seconds - migrations/seed.sql: column reference updated to retry_interval_seconds. - migrations/seed_demo.sql: same column rename + applied at runtime now via RunDemoSeed (no longer initdb-mounted). Compose - deploy/docker-compose.yml: drop ALL initdb mounts (10 migration files + seed.sql); add start_period: 30s to postgres + certctl-server healthchecks to absorb the runtime migration + seed application window on first boot. - deploy/docker-compose.test.yml: same drop (+ ghost seed_test.sql mount removed; that file never existed); same healthcheck start_period. - deploy/docker-compose.demo.yml: replace seed_demo.sql initdb mount with CERTCTL_DEMO_SEED=true env var on certctl-server. Tests - internal/api/handler/version_handler_test.go: TestVersion_ReturnsBuildInfo, TestVersion_RejectsNonGet, TestVersion_LdflagsOverride. - internal/repository/postgres/seed_test.go: TestRunSeed_AppliesIdempotently, TestRunSeed_MissingFileIsNoOp, TestRunDemoSeed_AppliesIdempotently, TestMigration000017_RetryIntervalRename, TestMigration000017_NotificationCreatedAt, TestMigration000017_HealthCheckOrphansDropped (testcontainers, -short skips). - internal/repository/postgres/notification_test.go: TestNotificationRepository_CreatedAt_IsPersisted + TestNotificationRepository_CreatedAt_DefaultsToNow. CI guardrail - .github/workflows/ci.yml: new 'Forbidden migration mount in compose initdb (U-3)' step grep-fails the build if any migrations/*.sql or seed*.sql re-appears in /docker-entrypoint-initdb.d in any compose file. Catches future drift before a fresh-clone operator hits it. Spec / Docs - api/openapi.yaml: add /api/v1/version operation under Health tag. - docs/architecture.md: replace the 'initdb may run the same SQL' paragraph with a post-U-3 single-source-of-truth explanation. - CHANGELOG.md: full unreleased-section entry covering all 5 closures, breaking changes, and the new env var. Audit doc - coverage-gap-audit-2026-04-24-v5/unified-audit.md: add new P1 #14 cat-u-seed_initdb_schema_drift; flip the 4 ride-along findings to ✅ RESOLVED with closure prose pointing at this commit. Verification: build/vet/test -short -race all clean across all touched packages locally; govulncheck reports 0 vulnerabilities affecting our code; OpenAPI YAML parses; CI U-3 grep guardrail clears against the post-fix tree. |
||
|
|
20e3977c63 |
fix(deploy,helm,docs): published-image HEALTHCHECK speaks HTTPS + Helm /ready path + docs HTTPS sweep (U-2)
Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`. The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone (`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS 1.3 pinned), so the probe failed on every interval and Docker marked the container `unhealthy` indefinitely. Operators inside docker- compose / Helm / the example stacks were unaffected — compose overrides the HEALTHCHECK with `--cacert + https://`, Helm uses explicit `httpGet` probes that ignore Docker's HEALTHCHECK, and every example compose file overrides with `curl -sfk https://localhost:8443/health`. But anyone running bare `docker run` / Docker Swarm / Nomad / ECS — exactly the "I just pulled the published image" path — saw permanent `unhealthy` status and (depending on orchestrator policy) a restart- loop. (Audit: cat-u-healthcheck_protocol_mismatch in coverage-gap-audit-2026-04-24-v5/unified-audit.md.) Recon for U-2 surfaced two adjacent bugs from the same v2.2 milestone gap, both bundled into this commit because they share the same root cause and the same operator surface: 1. Helm chart `server.readinessProbe.httpGet.path` pointed at `/readyz`, the kube-flavored convention. The certctl server doesn't register `/readyz` (only `/health` and `/ready` are wired and bypass the auth middleware — see internal/api/router/router.go:81 and cmd/server/main.go:920). K8s readiness probes therefore got 401 (api-key auth rejection) or 404 (when auth was disabled), pods stayed `NotReady` indefinitely, and Helm rollouts stalled. 2. The agent image (`Dockerfile.agent`) had no HEALTHCHECK at all, so bare-`docker run` agents got zero health signal. The compose override at `deploy/docker-compose.yml:173` called `pgrep -f certctl-agent` against the agent image, but the agent image didn't ship `procps` — pgrep was missing too. The compose probe was a latent always-fail. We fixed all three with the audit-recommended shape (option (a) — `-k`) plus three structural backstops: Files changed: Phase 1 — Dockerfile fix: - Dockerfile: HEALTHCHECK switched from `curl -f http://localhost:8443/ health` to `curl -fsk https://localhost:8443/health`. `-k` (insecure) is acceptable because the probe is localhost-to-localhost: the same process serving the cert is being probed, no network hop. Pinning `--cacert` is not viable for the published image because the bootstrap cert is per-deploy (generated into the `certs` named volume on first up; operator-supplied via Helm's `existingSecret` or cert-manager). Long-form docblock cross-references the audit closure, the compose vs Helm vs examples coverage matrix, and the CI guardrail. - Dockerfile.agent: added HEALTHCHECK using `pgrep -f certctl-agent` matching the compose pattern. Added `procps` to the runtime apk install — fixes both the new image-level HEALTHCHECK AND the pre-existing compose probe that was silently failing. Phase 2 — Helm readiness probe path: - deploy/helm/certctl/values.yaml: server.readinessProbe.httpGet.path changed from `/readyz` to `/ready`. Liveness probe path (`/health`) was correct and is unchanged. Probes block now carries an explanatory comment naming the registered no-auth probe routes and the U-2 closure rationale. Phase 3 — Image-level integration tests: - deploy/test/healthcheck_test.go (new, //go:build integration): TestPublishedServerImage_HealthcheckSpecUsesHTTPS builds the server image, inspects `Config.Healthcheck.Test` via `docker inspect`, and asserts the array contains `https://localhost:8443/health` and `-k`, and does NOT contain `http://localhost:8443/health` (positive + negative regression contracts). TestPublishedAgentImage_HealthcheckSpecExists builds the agent image and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`. Both tests `t.Skip` cleanly when docker isn't available (sandbox / CI without docker-in-docker) — verified locally: tests skip with the diagnostic and the suite returns PASS. TestPublishedServerImage_HealthcheckTransitionsToHealthy is a documented `t.Skip` placeholder until the harness wires a sidecar postgres for image-level smoke; the spec-level tests above cover the audit-flagged regression. Phase 4 — CI guardrail: - .github/workflows/ci.yml: new "Forbidden plaintext HEALTHCHECK regression guard (U-2)" step. Scoped patterns catch `HEALTHCHECK.*http://` and `curl -f http://localhost:8443/health` in any `Dockerfile*`. Comment lines exempt; docs/upgrade-to-tls.md out of scope (the post-cutover invariant string at line 182 is intentionally a documented expected-failure assertion). Verified locally on the real tree (passes) and against synthetic regressions (each fires the guard). Phase 5 — Docs sweep: - docs/connectors.md: 15 stale curl examples updated from `http://localhost:8443/...` to `https://localhost:8443/...` with `--cacert "$CA"` injected on every site. Added a one-time introductory note documenting the `$CA` extraction with `docker compose ... exec ... cat /etc/certctl/tls/ca.crt`, matching the pattern in docs/quickstart.md. Pre-U-2 these examples silently failed against the HTTPS listener. Phase 6 — Release surface: - CHANGELOG.md: appended U-2 section to the existing [unreleased] block (immediately below the G-1 entry). Sections: explanatory blockquote covering all three bugs (primary + 2 adjacent), Fixed, Added, Changed. Verification (all gates pass): - go build ./... — clean - go vet ./... — clean - go vet -tags integration ./deploy/test/ — clean - go test -short ./... — every package green - go test -tags integration -v -run TestPublishedServerImage|TestPublishedAgentImage ./deploy/test/ — three tests SKIP cleanly with "docker not available" diagnostic - helm lint deploy/helm/certctl/ — clean - helm template smoke render — succeeds; rendered Deployment carries `path: /ready` and zero `/readyz` matches - python3 yaml.safe_load on api/openapi.yaml — parses - govulncheck ./... — no vulnerabilities in our code - CI guardrail mirror: clean on real tree, fires on synthetic regression patterns Out of scope (intentionally untouched): - cmd/server/main.go::ListenAndServeTLS — HTTPS-only is correct, this finding does NOT propose adding back a plaintext listener. - deploy/docker-compose.yml:126 HEALTHCHECK — already correct. - deploy/docker-compose.test.yml HEALTHCHECK blocks — already correct. - All 5 examples/*/docker-compose.yml HEALTHCHECK overrides — already correct (they ALSO use `-fsk https://localhost:8443/health`). - Helm server.livenessProbe.httpGet — already uses `scheme: HTTPS` + `path: /health`, correct. - docs/upgrade-to-tls.md:182 `curl ... http://localhost:8443/health` invariant line — that's the expected-failure assertion for the post-cutover state ("plaintext is gone, expect Connection refused"); intentionally left intact. - Go production code — this is purely a deploy-image / probe / docs / Helm-chart fix. Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md §2 P1 cluster, cat-u-healthcheck_protocol_mismatch Audit recommendation followed verbatim: 'change Dockerfile:80 to CMD curl -kf https://localhost:8443/health'. |
||
|
|
f9258e3ba6 |
fix(security,domain): redact Agent.APIKeyHash from JSON wire shape (G-2)
Pre-G-2 internal/domain/connector.go::Agent::APIKeyHash was tagged
`json:"api_key_hash"` and shipped on every wire surface that returned
domain.Agent — GET /api/v1/agents (PagedResponse{Data: agents}),
GET /api/v1/agents/{id}, GET /api/v1/agents/retired, and the
POST /api/v1/agents registration response. Every authenticated client
(browser, CLI --json, MCP tool calls) received the SHA-256-of-the-API-key
string. The browser silently dropped it because web/src/api/types.ts
omits the field, but CLI and MCP consumers print full JSON so the hash
was visible there. Even though the value is a hash and not the plaintext
key, shipping it gives an attacker an offline brute-force target if the
API-key entropy is low (certctl doesn't enforce a minimum on operator-
supplied keys), and there's no business reason for any client to ever
receive it — the value is server-internal, used only for the lookup at
internal/repository/postgres/agent.go::GetByAPIKey. (Audit:
cat-s5-apikey_leak in coverage-gap-audit-2026-04-24-v5/unified-audit.md.)
We chose the audit's recommended fix (json:"-") plus a defense-in-depth
MarshalJSON plus a CI guardrail. Three layers because struct-tag
redaction alone is one rebase away from being silently reverted, the
custom MarshalJSON catches the case where a parent struct embeds Agent
under a different tag, and the CI grep blocks reintroduction at the spec
or frontend boundary even without a code review catching it.
Files changed:
Phase 1 — Domain redaction:
- internal/domain/connector.go: APIKeyHash tag flipped from
`json:"api_key_hash"` to `json:"-"`. New Agent.MarshalJSON
with value receiver + type-alias-recursion-break that explicitly
zeroes APIKeyHash on the marshal-time copy. Long-form docblock
explaining the G-2 closure rationale + cross-references to
service.RegisterAgent (populator), repository.AgentRepository::
GetByAPIKey (consumer), docs/architecture.md (DB-shape vs
API-shape distinction), and the audit finding.
Phase 2 — Domain tests (5 test functions):
- internal/domain/connector_test.go: TestAgent_MarshalJSON_RedactsAPIKeyHash
pins the marshal-boundary contract on a value receiver. ...RedactsViaPointer
pins the *Agent path. ...RedactsInSlice pins the []Agent path that the
ListAgents handler actually emits via PagedResponse. ...DoesNotMutateReceiver
pins the by-value-receiver contract so a future refactor that switches
to pointer-receiver gets caught. ...RoundTrip pins the wire-shape
guarantee that APIKeyHash is dropped on encode and cannot reappear on
decode. Single sentinel value ("sha256:LEAKED-CREDENTIAL-DERIVATIVE-
SENTINEL") flows through every fixture for grep-ability on regression.
Phase 3 — Handler tests (4 test functions):
- internal/api/handler/agent_handler_test.go: TestListAgents_DoesNotLeakAPIKeyHash,
TestGetAgent_DoesNotLeakAPIKeyHash, TestRegisterAgent_DoesNotLeakAPIKeyHash,
TestListRetiredAgents_DoesNotLeakAPIKeyHash. Each asserts (a) the
literal substring "api_key_hash" is absent from the httptest-captured
body, (b) the leak sentinel value is absent, (c) the non-leaked fields
ARE present (sanity that the handler is serving real data, not just
empty payloads). Shared sentinel "sha256:LEAKED-CREDENTIAL-DERIVATIVE-
HANDLER-SENTINEL" so a single grep over a failing test's output
identifies the leak surface immediately.
Phase 4 — Spec / docs:
- api/openapi.yaml: api_key_hash property REMOVED from Agent schema
(was at line 3690). Inline G-2 comment naming the closure + the
database-vs-API-shape distinction so a future spec edit doesn't
silently re-introduce the field.
- docs/architecture.md: ER-diagram block already documents the agents
table including api_key_hash (DB shape — correct). Added a sibling
note paragraph immediately below the diagram explaining that several
columns are intentionally server-internal (api_key_hash redaction
+ issuers.config / deployment_targets.config encrypted shadow), with
cross-references to the redaction enforcement site, the OpenAPI
schema, the frontend interface, and the CI guardrail.
- web/src/api/types.ts: Agent interface unchanged in shape (already
omitted the field) but added a leading comment block explaining
WHY the omission is intentional — stops a future frontend dev from
"completing" the interface from the OpenAPI spec or the Go struct.
Phase 5 — CI guardrail:
- .github/workflows/ci.yml: new "Forbidden api_key_hash JSON-shape
regression guard (G-2)" step. Scoped patterns catch the actual
regression shapes — Go struct tag (json:"api_key_hash"), frontend
interface declaration, OpenAPI schema property, YAML enum/array
membership. Repository / migration / seed / service / integration /
unit-test / comment lines exempt. Verified locally on the real tree
(passes) and against 4 synthetic regression patterns (each fires
the guardrail). Mirrors the G-1 pattern from .github/workflows/
ci.yml lines 47-108.
Phase 5b — Sweep verification (no changes, results documented for the
next reader):
- internal/api/middleware/audit.go: doesn't serialize Agent struct;
records request body only. No leak.
- service.RegisterAgent audit-event payload: `map[string]interface{}{
"name": name, "hostname": hostname}` — name + hostname only,
no APIKeyHash. No leak.
- All 9 slog sites that mention agent: scalar attrs only ("agent_id",
"error", "agent_hostname"), never the full struct. No leak.
- internal/mcp, internal/cli, cmd/cli, cmd/mcp-server: zero matches
for APIKeyHash / api_key_hash. Both pass server JSON verbatim, so
the wire-side fix transitively closes them.
Verification (all gates pass):
- go build ./...
- go vet ./...
- go test -short ./... — every package green
- go test -short -race ./internal/domain/... ./internal/api/handler/... — clean
- govulncheck ./... — no vulnerabilities in our code
- helm lint deploy/helm/certctl/ — clean
- helm template smoke render — succeeds
- python3 yaml.safe_load on api/openapi.yaml — parses
- OpenAPI Agent schema scan: no api_key_hash property
- CI guardrail mirror: clean on real tree, fires on all 4 synthetic
regression patterns
- Domain pkg coverage: Agent.MarshalJSON 100%, connector.go total 87.5%
- Handler pkg coverage: 79.2%
Sample response body (httptest captured during verification, GET
/api/v1/agents/{id} via the new handler test):
{"id":"agent-demo","name":"demo-agent","hostname":"demo.host",
"status":"Online","last_heartbeat_at":"2026-04-24T11:59:30Z",
"registered_at":"2026-04-24T12:00:00Z","os":"linux",
"architecture":"amd64","ip_address":"10.0.0.42",
"version":"v2.0.49"}
Note the absence of any api_key_hash key, even though the in-memory
struct passed to the handler had APIKeyHash set to a sentinel.
Out of scope (intentionally untouched):
- internal/repository/postgres/agent.go SELECT/INSERT/UPDATE/scan
paths and GetByAPIKey lookup — DB column stays, repo still
populates the struct, auth lookup still works. The redaction is a
marshal-boundary concern.
- migrations/000001_initial_schema.up.sql + migrations/seed_*.sql —
DB schema and seed data unchanged.
- internal/service/agent.go::RegisterAgent — service-side hashing
and persistence unchanged.
- Other domain types with potential credential-derivative fields
(Issuer.Config, DeploymentTarget.Config, notifier configs). Not
flagged by the audit; some are already protected (e.g.,
DeploymentTarget.EncryptedConfig []byte `json:"-"`). File a
separate audit pass if recon surfaces additional leaks.
- Per-resource DTO layer across every handler. Single audit
finding, single domain type.
- A separate possible follow-up: the v2 RegisterAgent endpoint
doesn't return the plaintext API key to the agent, which may
mean self-bootstrap via POST /api/v1/agents is broken. Verified
during recon; out of scope for G-2; should be its own ticket.
Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
§2 P1 cluster, cat-s5-apikey_leak
Audit recommendation: 'json:"-" or API-response DTO
excluding APIKeyHash' — went with the json:"-" + MarshalJSON
defense-in-depth pair plus CI guardrail and structural docs.
|
||
|
|
54a41603de |
fix(security,config): remove unimplemented JWT auth-type, close silent downgrade (G-1)
The pre-G-1 config validator accepted CERTCTL_AUTH_TYPE=jwt and the
startup log faithfully echoed 'authentication enabled type=jwt'.
Reasonable people read that and concluded JWT auth was on. It wasn't.
The auth-middleware wiring at cmd/server/main.go unconditionally routed
every request through the api-key bearer middleware regardless of
cfg.Auth.Type. So CERTCTL_AUTH_TYPE=jwt quietly compared the incoming
'Authorization: Bearer <token>' against whatever string the operator put
in CERTCTL_AUTH_SECRET — real JWT clients got 401, and operators who
treated CERTCTL_AUTH_SECRET as a *signing* secret (because they thought
they were configuring JWT) had effectively handed an attacker an api-key.
A security finding masquerading as a config option.
We chose the audit-recommended structural fix: remove the option, fail
fast at startup, and add the gateway-fronting pattern as the documented
forward path. Implementing JWT middleware would have meant jwks vs
static-secret rotation, claim mapping, expiry enforcement, audience and
issuer validation, key rollover semantics, and regression coverage at the
same depth as the existing api-key path — a feature, not a fix. Operators
who genuinely need JWT/OIDC front certctl with an authenticating gateway
(oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium /
Authelia) and run the upstream certctl with CERTCTL_AUTH_TYPE=none. Same
shape works on docker-compose and Helm.
The change is comprehensive across 7 phases — every surface that
mentioned 'jwt' as a certctl-auth-type is updated, plus structural
backstops (typed enum, runtime guard, helm template validation, CI grep
guard) so the lie can't reappear.
Files changed:
Phase 1 — production code (typed enum + jwt removal):
- internal/config/config.go: AuthType typed alias + AuthTypeAPIKey /
AuthTypeNone constants + ValidAuthTypes() helper. Validate() routes
literal 'jwt' through a dedicated multi-line diagnostic naming the
authenticating-gateway pattern, then cross-checks against
ValidAuthTypes(). Secret-required branch simplified to api-key-only.
Field comment on AuthConfig.Type rewritten to drop jwt and point at
the gateway pattern.
- internal/api/middleware/middleware.go: AuthConfig.Type field comment
references the typed config.AuthType constants.
- internal/api/handler/health.go: same treatment for HealthHandler.AuthType.
- cmd/server/main.go: defense-in-depth runtime switch immediately after
config.Load() — exits 1 on any unsupported auth-type that bypassed the
validator. Auth-disabled startup log explicitly names the
authenticating-gateway pattern.
Phase 2 — tests (Red→Green, contract pinning):
- internal/config/config_test.go: TestValidate_JWTAuth_RejectedDedicated
(two table rows pinning the dedicated G-1 error fires regardless of
whether Secret is set), TestValidAuthTypesDoesNotContainJWT (property
guard against future re-introduction),
TestValidAuthTypesIsExactly_APIKey_None (allowed-set contract),
TestValidate_GenericInvalidAuthType (pins non-jwt invalid values still
hit the generic invalid-auth-type error). Removed the prior
TestValidate_JWTAuth_MissingSecret happy-path since its premise is
inverted post-G-1.
- internal/api/handler/health_test.go: removed
TestAuthInfo_ReturnsAuthType_JWT (which baked the silent-downgrade lie
into the regression suite). Pre-existing _APIKey test continues to
cover the api-key happy path.
Phase 3 — spec, docs, env templates:
- api/openapi.yaml: auth_type enum dropped to [api-key, none] with
inline comment naming the G-1 closure.
- .env.example (root): CERTCTL_AUTH_TYPE comment block rewritten to drop
jwt and point at the gateway pattern; secret-required conditional
simplified to api-key-only.
- docs/architecture.md: middleware-stack bullet rewritten to drop the
JWT mention; new H3 'Authenticating-gateway pattern (JWT, OIDC, mTLS)'
section explaining the design rationale and listing oauth2-proxy /
Envoy ext_authz / Traefik ForwardAuth / Pomerium / Authelia / Caddy
forward_auth / Apache mod_auth_openidc / nginx auth_request as the
standard fronting options.
- docs/upgrade-to-v2-jwt-removal.md (new ~125 lines): migration guide
with preconditions, what-changes, both recovery paths, complete
docker-compose oauth2-proxy walkthrough, Traefik ForwardAuth and Envoy
ext_authz patterns, rollback posture.
Phase 4 — Helm chart (template validation + docs):
- deploy/helm/certctl/templates/_helpers.tpl: new certctl.validateAuthType
helper mirroring the existing certctl.tls.required pattern. Fails
template render on any server.auth.type outside {api-key, none} with
a multi-line diagnostic.
- deploy/helm/certctl/templates/server-deployment.yaml,
server-configmap.yaml, server-secret.yaml: invoke the helper at the
top of each template that depends on .Values.server.auth.type.
- deploy/helm/certctl/values.yaml: auth: block comment expanded with the
G-1 rationale and gateway-pattern cross-reference.
- deploy/helm/CHART_SUMMARY.md: server.auth.type table row now surfaces
the allowed set and points at the upgrade doc.
- deploy/helm/certctl/README.md: new 'JWT / OIDC via authenticating
gateway' section with a Kubernetes-flavored oauth2-proxy + certctl
walkthrough.
Phase 5 — release surface:
- CHANGELOG.md: new [unreleased] top entry with Breaking / Removed /
Added / Changed sections; explicit pointer at
docs/upgrade-to-v2-jwt-removal.md from the Breaking subsection.
Phase 6 — CI guardrail:
- .github/workflows/ci.yml: new 'Forbidden auth-type literal regression
guard (G-1)' step. Scoped patterns catch the actual regression shapes
(map literal, slice literal, switch case, OpenAPI enum, env-file
default, AuthType('jwt') cast). Comments and the dedicated rejection
branch are intentionally exempt; connector-package JWT references
(Google OAuth2 / step-ca) are exempt as out-of-scope external
protocols. Verified locally: the guard passes on the actual tree and
fires on all 4 synthetic regression patterns.
Out of scope (explicitly untouched):
- internal/connector/discovery/gcpsm/gcpsm.go — Google OAuth2 service-
account JWT (external protocol).
- internal/connector/issuer/googlecas/googlecas.go — same.
- internal/connector/issuer/stepca/stepca.go — step-ca's provisioner
one-time-token JWT for /sign API.
- docs/test-env.md, docs/connectors.md, docs/features.md — describe
external CAs' use of JWT, not certctl's auth shape.
- Implementing actual JWT middleware. Feature, not a fix.
Verification (all gates pass):
- go build ./... — clean
- go vet ./... — clean
- go test -short ./... — every package green
- go test -short -race ./internal/config/... ./internal/api/... — clean
- govulncheck ./... — no vulnerabilities in our code
- helm lint deploy/helm/certctl/ — clean
- helm template with auth.type=api-key — renders OK
- helm template with auth.type=none — renders OK
- helm template with auth.type=jwt — fails with validateAuthType
diagnostic (exit 1)
- python3 yaml.safe_load on api/openapi.yaml — parses
- CI guardrail mirror — clean on real tree, fires on all 4 synthetic
regression patterns
- Smoke test: 'CERTCTL_AUTH_TYPE=jwt ./certctl-server' exits non-zero
with: 'Failed to load configuration: CERTCTL_AUTH_TYPE=jwt is no
longer accepted (G-1 silent auth downgrade): no JWT middleware ships
with certctl. To use JWT/OIDC, run an authenticating gateway
(oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium) in
front of certctl and set CERTCTL_AUTH_TYPE=none on the upstream.
See docs/architecture.md "Authenticating-gateway pattern" and
docs/upgrade-to-v2-jwt-removal.md for the migration walkthrough'
config pkg coverage: ValidAuthTypes 100%, Validate 94.7%, total 75.5%.
Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
§2 P1 cluster, cat-g-jwt_silent_auth_downgrade
Audit recommendation followed verbatim: 'Remove jwt from
validAuthTypes until middleware ships'.
|
||
|
|
67f352db69 |
fix(db): emit volume-state guidance on postgres auth failure (U-1, #10)
The shipped quickstart instructs operators to copy deploy/.env.example to
deploy/.env, edit POSTGRES_PASSWORD, and run docker compose up. On the
*first* boot of a fresh checkout this works. On the *second* boot — i.e.,
when an operator first booted with the default POSTGRES_PASSWORD=certctl,
then edited .env and re-ran up — the certctl-server container picks up the
new password (env interpolated at every container start) but postgres does
not. The postgres docker-entrypoint runs initdb only when the data dir is
empty; on subsequent boots the persistent named volume postgres_data is
non-empty so pg_authid retains the password baked in on first boot. The
server connects with the new credentials, postgres rejects them, and the
operator sees an opaque `pq: password authentication failed for user
"certctl"` in the server log with no pointer to the actual cause. New-
operator onboarding gets blocked on the documented production path.
Why a doc fix alone is not sufficient. Operators don't reread the docs
after a successful first boot — the trap fires on the *second* up, when
they think they've already learned the system. The opaque pq error is
indistinguishable in the log from a typo'd password or a misconfigured
secret store. The diagnostic has to fire at the moment the failure is
observed.
Why we don't try to fix the bootstrap. The env-vs-pg_authid divergence is
intrinsic to how the official postgres image bootstraps (see
docker-entrypoint.sh: initdb runs only if PGDATA is empty). Switching to a
bind mount or ephemeral volume breaks the production path; switching to
POSTGRES_PASSWORD_FILE + ALTER ROLE adds operator surface without
eliminating the divergence. The ergonomic fix is to surface the failure
mode loudly, with both remediation paths, at the exact log line where it
becomes visible.
Two remediation paths, surfaced together. Destructive: `docker compose
-f deploy/docker-compose.yml down -v && up -d --build` — wipes the
postgres volume so initdb re-runs with the new env value. Use this on
demos / first-time setup where data loss is acceptable. Non-destructive:
`docker compose exec postgres psql -U certctl -c "ALTER ROLE certctl
PASSWORD '<new>';"` followed by a server restart with the matching
POSTGRES_PASSWORD. Use this on any environment that holds data you want
to keep. Surfacing both means the operator can pick based on their
environment without us assuming.
Files changed:
- internal/repository/postgres/db.go — extract wrapPingError(err) helper.
errors.As against *pq.Error; on SQLSTATE 28P01 (invalid_password) emit
the multi-line guidance preserving the %w wrap chain. Non-28P01 errors
retain the original `failed to ping database: %w` shape so transient
connection-refused / timeout paths don't get noisy. Add
pgErrInvalidPassword = "28P01" constant. Convert blank
`_ "github.com/lib/pq"` import to direct import (driver registration
still works via init()) so we can name the *pq.Error type at compile
time. NewDB now calls wrapPingError(err) instead of inlining the wrap.
- internal/repository/postgres/db_test.go (new) — 4 internal-package
unit tests covering wrapPingError. AuthFailureGuidance pins the
contract substrings ("SQLSTATE 28P01", "POSTGRES_PASSWORD",
"first boot", "down -v", "ALTER ROLE"). NonAuthErrorPreservesOriginalWrap
pins the no-leak contract for SQLSTATE 08006 (connection_failure).
NonPqErrorPreservesOriginalWrap pins the network-level path.
NilReturnsNil pins defensive contract. All run in -short without
testcontainers — package postgres (internal) so the unexported helper
is callable directly.
- docs/quickstart.md — `> **Warning:**` callout immediately after the
`cp deploy/.env.example deploy/.env` block at lines 56-61. Names the
trap, names the SQLSTATE, gives both remediation paths. Uses the
in-file `> **Note:**` blockquote convention.
- deploy/ENVIRONMENTS.md — `**Stateful volume — first-boot password
binding (U-1)**` paragraph appended to the Postgres expert-note block.
Explains the env-vs-pg_authid divergence, points at wrapPingError as
the runtime diagnostic, lists both remediation paths. Uses the in-file
`**Expert note:**` convention.
Out of scope (separate follow-ups):
- deploy/helm/certctl/templates/postgres-statefulset.yaml has the same
root cause via PVC retention. The wrapPingError diagnostic covers the
Helm path because the same NewDB code runs at server startup; the
Helm-specific doc warning lands separately.
- /.env.example at repo root (line 16 hardcodes the password literally
inside CERTCTL_DATABASE_URL rather than interpolating) — adjacent
trap, separate fix.
- examples/{acme-nginx,private-ca-traefik,step-ca-haproxy,multi-issuer,
acme-wildcard-dns01}/docker-compose.yml all carry the pattern. The
diagnostic covers them; targeted doc warnings are scoped to the
canonical quickstart + ENVIRONMENTS docs.
Out of consideration:
- Switch to bind mount / ephemeral volume — breaks the production path.
- POSTGRES_PASSWORD_FILE + Docker secret + ALTER ROLE rotation — adds
operator surface without fixing the env-vs-pg_authid divergence.
Verification (all passing):
- go build ./...
- go vet ./...
- go test -short -race ./internal/repository/postgres/ — 4/4 new tests
pass plus existing tests
- go test -short ./... — every package green
- govulncheck ./... — no vulnerabilities in our code
- wrapPingError coverage 100%; postgres pkg total unchanged in shape
(NewDB/RunMigrations were 0% pre-fix, still 0% post-fix; new helper
adds 100%-covered statements)
Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
§2 P1 cluster, cat-u-quickstart_postgres_password_volume_trap
GitHub Issue #10 (mikeakasully)
|
||
|
|
099683daba |
v2.0.48: swap self-signed TLS bootstrap algorithm ed25519 → ECDSA-P256
Follow-up to v2.0.47 (HTTPS-Everywhere). The Phase-3 self-signed bootstrap sidecar shipped an ed25519 server cert. Apple's TLS stack — Safari Network Framework and the macOS-bundled LibreSSL 3.3.6 /usr/bin/curl — does not advertise ed25519 in the ClientHello signature_algorithms extension for server certs, so the handshake fails with the server-side log line: tls: peer doesn't support any of the certificate's signature algorithms Homebrew OpenSSL 3.x, Chrome, Firefox, and Linux curl all accept ed25519 server certs fine. Apple is the outlier. Rather than gate the demo stack behind "install Homebrew OpenSSL first," swap the bootstrap algorithm to ECDSA-P256 with SHA-256 — universally supported, including on the Apple stack. Changes - deploy/docker-compose.yml: certctl-tls-init openssl invocation swapped to `-newkey ec -pkeyopt ec_paramgen_curve:P-256 -nodes`; header comment + echo line updated; multi-line rationale paragraph added. - deploy/docker-compose.test.yml: same openssl swap + echo update for the test harness sidecar that writes to the bind-mounted ./test/certs directory the Go integration_test.go pins via CERTCTL_TEST_CA_BUNDLE. - docs/tls.md: Pattern 1 description + code block updated; "Why ECDSA-P256 and not ed25519" rationale paragraph added covering pre-v2.0.48 history, the Apple diagnosis, accepting clients, and the operator migration command. Patterns 2 (existing Secret) and 3 (cert-manager) explicitly called out as unaffected. - docs/upgrade-to-tls.md: docker-compose procedure sentence updated with cross-reference to tls.md Pattern 1. - docs/test-env.md: "Get the CA bundle for curl" sentence updated. Migration Existing demo installs must tear the `certs` named volume down to pick up the new algorithm: docker compose -f deploy/docker-compose.yml down -v docker compose -f deploy/docker-compose.yml up -d --build Not touched - cmd/server/tls.go: algorithm-agnostic. TLS 1.3 min version with [X25519, P-256] curve preferences for key exchange is orthogonal to the server cert's signature algorithm. No Go code change needed. - Helm chart: Patterns 2 and 3 operators supply their own cert; this patch does not affect them. - Unrelated ed25519 uses (agent key algorithm detection, profile algorithm options, SSH key path examples, tlsprobe key metadata, cloud discovery key-algo display): all orthogonal to the server TLS bootstrap cert. Incidental cleanup - .gitignore: dropped dangling `strategy.md` entry (file doesn't exist in repo; entry was cruft). |
||
|
|
3155b9475f |
v2.0.47: HTTPS Everywhere — TLS-only control plane, agents/CLI/MCP
Breaking change release. Plaintext HTTP listener removed. The certctl control plane now terminates TLS 1.3 on :8443 via http.Server.ListenAndServeTLS. No CERTCTL_TLS_ENABLED=false escape hatch. No dual-listener mode. One-step cutover per docs/upgrade-to-tls.md. Server - cmd/server/tls.go: certHolder with SIGHUP hot-reload + atomic cert swap, buildServerTLSConfig (TLS 1.3 min, GetCertificate callback), preflightServerTLS validation - cmd/server/main.go: ListenAndServeTLS in place of ListenAndServe, watchSIGHUP wiring, cert/key path config threading - tls_test.go: 418-line regression coverage of reload, preflight, callback behavior, SAN validation Config - CERTCTL_TLS_CERT_PATH / CERTCTL_TLS_KEY_PATH (required) - Plaintext rejection: agents/CLI/MCP pre-flight-fail on http:// URLs with a pointer to docs/upgrade-to-tls.md Agents, CLI, MCP - All three pre-flight-reject http:// URLs with fail-loud diagnostic - CERTCTL_SERVER_CA_BUNDLE_PATH for private-CA trust - CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY for dev-only bypass (loud warning on startup) - install-agent.sh emits both vars as commented template lines docker-compose - certctl-tls-init sidecar generates SAN-valid self-signed cert into deploy/test/certs/ on first boot - All demo-stack curls pin against ca.crt with --cacert Helm chart - Three TLS provisioning modes, exactly one required: - server.tls.existingSecret (operator-supplied) - server.tls.certManager.enabled (cert-manager integration) - server.tls.selfSigned.enabled (eval only — not for production) - server-certificate.yaml template for cert-manager mode - helm install without a TLS source fails at template render with a pointer to docs/tls.md CI - .github/workflows/ci.yml Helm Chart Validation step renders the chart in both existingSecret and cert-manager modes, plus an inverse guard-regression test that asserts helm template MUST refuse to render when no TLS source is configured. Previously the single `helm template` invocation hit the certctl.tls.required fail-loud guard and exit-1'd CI. Four invocations now: lint (existingSecret), template (existingSecret), template (cert-manager), template (no args — must fail). Integration tests - deploy/test/integration_test.go stands up the Compose stack over HTTPS, extracts the CA bundle, and exercises every certctl API over https://localhost:8443 - All 34 integration subtests green (per Phase 8 local CI-parity) Documentation - New: docs/tls.md (provisioning patterns, rotation, SIGHUP reload) - New: docs/upgrade-to-tls.md (one-step cutover, no-downgrade warnings, fleet-roll sequencing) - CHANGELOG.md: v2.2.0 "HTTPS Everywhere — The Irony" entry (file heading unchanged; release tag is v2.0.47) - All curls in docs/, examples/, deploy/helm/ guides use https://localhost:8443 --cacert Verification - grep -rn "ListenAndServe[^T]" cmd/ internal/ → 0 hits - grep -rn "\"http://" cmd/ internal/ → 2 benign hits (Caddy admin API default, SSRF doc comment) — zero certctl endpoints - Tasks #197–#206 (Phases 0–8) all closed in the tracker Files: 65 changed, 3489 insertions, 372 deletions (pre-CI-fix). |
||
|
|
c38ba40200 |
docs: reconcile scheduler topology across sibling docs (7 → 12 loops)
Authoritative 12-loop table lives at docs/architecture.md:522-534 (committed via
the I-001/I-003/I-005 + M48/M50 milestone commits). This change brings six sibling
docs into parity with that table so every surface — user-facing features reference,
SOC 2 compliance mapping, connectors guide, advanced demo architecture diagram,
testing guide, and in-line architecture prose — reflects the same 8 always-on + 4
opt-in topology.
Touches:
- docs/architecture.md: 2 inline ordinal references (9th / 8th loop) replaced with
descriptive names (opt-in cloud discovery / opt-in endpoint health), cross-linked
to the authoritative table to prevent future ordinal rot.
- docs/features.md: metric row (7 → 12), inline reference to 9th loop, and full
scheduler table expanded to include Always-on column + env vars + I-001/I-003/I-005
refs.
- docs/compliance-soc2.md: background scheduler monitoring bullets expanded to list
all 12 loops with env vars + I-series refs; table row updated with 8 always-on +
4 opt-in summary.
- docs/connectors.md: three inline ordinals (7th/6th/9th loop) replaced with
descriptive names, cross-linked to architecture.md.
- docs/demo-advanced.md: Mermaid SCHED node label updated from '7 background loops'
to '12 background loops (8 always-on + 4 opt-in)'.
- docs/testing-guide.md: Test 20.1.1 header + grep pattern expanded to include
job-retry / job-timeout / notification-retry / digest / endpoint-health /
cloud-discovery loops; sign-off chart row label updated.
Pure documentation reconciliation. No code changes. Master HEAD pre-commit:
|
||
|
|
25131a377d |
M-001/M-006: strip HTTP auth from EST/SCEP + fail-loud SCEP preflight
Closes CWE-306 (missing authentication for critical function) for SCEP
via a fail-loud startup gate, and aligns EST/SCEP HTTP dispatch with
their respective RFCs. CRL/OCSP remain unauthenticated under
.well-known/pki/* per RFC 5280 §5 / RFC 6960 / RFC 8615. Option (D):
no mTLS in this milestone.
- RFC 7030 §3.2.3 (EST auth is deployment-specific) and §4.1.1
(/cacerts explicitly anonymous): EST paths served unauthenticated;
CSR-signature + profile policy enforce identity inside ESTService.
- RFC 8894 §3.2: SCEP authenticates via the challengePassword
PKCS#10 attribute (OID 1.2.840.113549.1.9.7), not an HTTP credential.
HTTP dispatch is unauthenticated; preflightSCEPChallengePassword
refuses to start when CERTCTL_SCEP_ENABLED=true without
CERTCTL_SCEP_CHALLENGE_PASSWORD. SCEPService.PKCSReq enforces the
same invariant defense-in-depth and compares with
crypto/subtle.ConstantTimeCompare.
cmd/server/main.go:
- Extract buildFinalHandler(apiHandler, noAuthHandler, webDir,
dashboardEnabled); route /.well-known/est/*, /scep, /scep/*,
/.well-known/pki/crl/{id}, /.well-known/pki/ocsp/{id}/{serial},
and health probes through noAuthHandler (RequestID +
structuredLogger + Recovery only).
- Add preflightSCEPChallengePassword fail-loud gate; startup log
emits challenge_password_set boolean for operator visibility.
cmd/server/finalhandler_test.go (new, 314 lines, 27 subtests):
- TestBuildFinalHandler_Dispatch (20) + TestBuildFinalHandler_NoDashboard
(7) pin the dispatch surface: EST 4-endpoint, SCEP exact +
trailing-slash + query-string, PKI CRL+OCSP, health, /api/v1/*
authenticated, /assets/* file server, SPA fallback.
internal/api/router/router.go, internal/config/config.go:
- Router-level comments explain why EST/SCEP/PKI dispatchers sit
outside the authenticated mux; SCEP challenge password config
plumbed through.
docs/architecture.md:
- New EST Authentication subsection (RFC 7030 §3.2.3 + §4.1.1,
buildFinalHandler + noAuthHandler references).
- Rewrite SCEP Authentication subsection; replaces pre-existing
factually-incorrect "any value accepted" claim with CWE-306
preflight, service-layer defense-in-depth, and
crypto/subtle.ConstantTimeCompare.
- Top-level Authentication section: qualify /api/v1/* scope on API
clients bullet; add standards-based-endpoints bullet referencing
the 27-subtest regression harness.
docs/compliance-soc2.md:
- CC6.1: scope API Key Authentication to /api/v1/*; add
standards-based endpoints bullet citing RFCs and CWE-306 closure.
- CC6.3: scope API Key Policy to /api/v1/* with cross-reference to
CC6.1.
- Evidence Locations augmented with buildFinalHandler,
preflightSCEPChallengePassword, scep.go defense path, regression
harness, and OpenAPI security:[] overrides.
api/openapi.yaml: verified already correct (global bearerAuth
default overridden with security:[] on /cacerts, /simpleenroll,
/simplereenroll, /csrattrs, /scep GET+POST, /crl/{issuer_id},
/ocsp/{issuer_id}/{serial}); no edits needed.
|
||
|
|
15daf008aa |
I-005: notification retry loop + dead-letter queue
Critical alerts can no longer be silently dropped by a transient
notifier failure. Failed notification attempts now ride an exponential
backoff retry loop, with a 5-attempt budget before promotion to the
dead-letter queue for operator intervention.
Schema (migration 000016, idempotent):
- retry_count INTEGER NOT NULL DEFAULT 0
- next_retry_at TIMESTAMPTZ
- last_error TEXT
- idx_notification_events_retry_sweep partial index
(next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL
Dead rows clear next_retry_at so the index stops matching them.
Service contract:
- NotificationService.RetryFailedNotifications drives 2^n-minute
exponential backoff capped at 1h (notifRetryBackoffCap) with
5-attempt budget (notifRetryMaxAttempts).
- Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to
status='dead' via MarkAsDead.
- Non-terminal failures record via RecordFailedAttempt.
- Success path promotes to 'sent' without touching retry_count
(audit preserves "delivered on attempt N").
- Missing-notifier branch defensively promotes to 'sent' to avoid
wedging a row on a deleted channel.
- RequeueNotification operator escape hatch atomically resets
retry_count -> 0, next_retry_at -> NULL, last_error -> NULL,
status -> pending via notifRepo.Requeue.
Scheduler:
- New always-on notificationRetryLoop wired into the base loop set at
CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m).
- sync/atomic.Bool idempotency guard.
- sync.WaitGroup shutdown drain via WaitForCompletion.
StatsService:
- SetNotifRepo setter pattern preserves 9 pre-existing
NewStatsService call sites (main.go + stats_test.go + 8 digest
tests) without touching the constructor signature.
- DashboardSummary.NotificationsDead populated via
notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired
(reports zero on systems without a notification repository).
- CountByStatus error is non-fatal (dashboard summary is
best-effort for this field).
- Prometheus certctl_notification_dead_total counter emitted from
the same snapshot.
Handler:
- New POST /api/v1/notifications/{id}/requeue endpoint.
- dead status surfaces to MCP + CLI.
Frontend:
- NotificationsPage gains two-tab toolbar ("All" / "Dead letter")
with queryKey: ['notifications', activeTab] so switching tabs
doesn't serve stale data until the 30s refetch.
- Dead rows surface "Retry {n}/5" + truncated last_error with
full-text title tooltip.
- Requeue mutation wrapped as
mutationFn: (id: string) => requeueNotification(id)
to prevent react-query v5's positional context argument from
leaking into the API client — pinned against future refactors
by strict-match toHaveBeenCalledWith('notif-dead-001') in
NotificationsPage.test.tsx:181.
Closes I-005.
|
||
|
|
49002c8cba |
Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.
Schema — migration 000015:
migrations/000015_agent_retire.up.sql flips
deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
boundary instead of quietly destroying targets. Both `agents` and
`deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
TEXT pair (TEXT not VARCHAR so operator comments are never
truncated), indexed via partial indexes WHERE retired_at IS NOT
NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
EXISTS) so repeated runs against partially-migrated databases
converge. migrations/000015_agent_retire.down.sql restores CASCADE
and drops the new columns for clean rollback. A dedicated
repository-layer testcontainers test
(internal/repository/postgres/migration_000015_test.go) asserts the
before/after FK action, column presence, index presence, and
round-trip idempotency under up→down→up.
Domain — sentinel guard + dependency counts:
internal/domain/connector.go gains IsRetired() on Agent, the
exported SentinelAgentIDs slice listing server-scanner,
cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
four reserved IDs documented in CLAUDE.md and created at startup in
cmd/server/main.go), IsSentinelAgent(id string) predicate,
AgentDependencyCounts{ActiveTargets, ActiveCertificates,
PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
ActorTypeSystem enum values used by audit emission downstream.
Coverage locked down by internal/domain/connector_test.go.
Service — 8-step ordered contract:
internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
opts{Force, Reason}) enforces a fixed execution order:
(1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
unconditionally; force=true does NOT bypass it.
(2) fetch — ErrAgentNotFound on miss.
(3) idempotency — if IsRetired() already, return
AgentRetirementResult{AlreadyRetired: true} with no new audit
event and no state change (safe to replay from flaky clients).
(4) preflight counts — collectAgentDependencyCounts runs
ActiveTargets, ActiveCertificates, PendingJobs sequentially
(not in parallel; keeps the per-query timeout predictable and
matches the repo's existing call-chain shape).
(5) force-reason guard — opts.Force=true with empty Reason returns
ErrForceReasonRequired (wired into the 400 status surface).
(6) dependency guard — HasDependencies() with opts.Force=false
returns BlockedByDependenciesError{Counts} (wired into the 409
body with per-bucket counts).
(7) mutation — single pinned retiredAt := time.Now(); agent
retirement first, then cascade target retirement if opts.Force,
all under the repo's single transaction so the two retired_at
stamps match to the second.
(8) best-effort audit — agent_retired always; agent_retirement_
cascaded additionally on the force path. Actor is whatever the
handler resolves from the request; actor type is mapped by
resolveActorType (system/agent-prefix→Agent/else→User). Audit
emission failures are logged via slog.Error but do not abort
the retirement (matches the house convention used by every
other scheduler-emitted event).
BlockedByDependenciesError implements Error() as
"active_targets=%d, active_certificates=%d, pending_jobs=%d" and
Unwrap() → ErrBlockedByDependencies. The single struct satisfies
errors.Is via Unwrap (used by scheduler-level tests) and errors.As
via the concrete type (used by the handler to fish out Counts for
the 409 body). ListRetiredAgents(page, perPage) adds a separate
paginated accessor with page<1→1 and perPage<1→50 normalization so
retired rows are queryable without polluting the default agent
listing.
Sentinel guard coverage is asymmetric by design: all four reserved
IDs are protected, and force=true cannot override. Regression tests
in internal/service/agent_retire_test.go assert each of the eight
steps in order, plus sentinel bypass attempts and idempotency
replay.
Handler + router — status-code surface:
internal/api/handler/agents.go:RetireAgent exposes seven status
codes on DELETE /agents/{id}:
200 on a fresh retirement (body echoes AgentRetirementResult).
204 on idempotent replay (AlreadyRetired=true; no new audit).
400 on ErrForceReasonRequired.
403 on ErrAgentIsSentinel.
404 on ErrAgentNotFound.
409 on BlockedByDependenciesError, with a custom body shape
{error, counts{active_targets, active_certificates,
pending_jobs}} that bypasses the default ErrorWithRequestID
envelope so callers get the per-bucket numbers directly.
500 on any other error.
Heartbeat HandleHeartbeat returns 410 Gone when the agent is
retired (ErrAgentRetired), signalling the agent to shut down.
Query params `force=true` and `reason=<text>` drive the cascade
path; both are forwarded as url.Values through the new MCP
transport.
internal/api/router/router.go registers GET /api/v1/agents/retired
literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
literal-beats-pattern-var precedence routes "retired" to the
paginated retired-agents listing instead of fetching a hypothetical
agent named "retired".
Agent binary — clean shutdown on 410:
cmd/agent/main.go gains the ErrAgentRetired sentinel, a
retiredOnce sync.Once, and a retiredSignal chan struct{}. A
markRetired(source, statusCode, body) helper closes the channel
exactly once; the Run() select loop observes the close and returns
ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
and exits cleanly instead of spinning in the heartbeat retry loop.
The 410 Gone surface is therefore terminal for the agent process.
MCP transport:
internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
a new additive transport method. Client.Delete is path-only; without
this method the retire tool would silently drop `force` and `reason`,
turning every cascade retire into a default soft-retire. The new
method shares do()'s 204 normalization and 4xx/5xx error
propagation so tool authors get one contract.
internal/mcp/tools.go + internal/mcp/types.go expose the
retire_agent tool with Force+Reason inputs wired through
DeleteWithQuery.
CLI:
cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
`agents list --retired` (client-side strip of --retired then
delegation to ListRetiredAgents, sharing --page/--per-page parsing
with the default listing) and `agents retire <id> [--force --reason
"…"]` (mirrors ErrForceReasonRequired — force without reason is
rejected client-side before the request is sent). JSON + table
output modes both honor the new columns.
Frontend:
web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
web/src/api/client.ts + web/src/api/types.ts expose the retire
endpoint and the retired-listing. 4 new Vitest regression cases.
OpenAPI:
api/openapi.yaml documents DELETE /agents/{id} with all seven
status codes, 410 on heartbeat, and the 409 per-bucket body shape.
Regression coverage (six new test files, all green):
internal/service/agent_retire_test.go — 8-step contract + sentinel guards
internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through
internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing
internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies
Files:
api/openapi.yaml — DELETE + 410 + 409 body shape
cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal
cmd/cli/main.go — handleAgents list/get/retire dispatch
docs/architecture.md, docs/concepts.md,
docs/testing-guide.md — retirement contract narrative
internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat
internal/api/handler/agent_handler_test.go — extended coverage
internal/api/handler/agent_retire_handler_test.go — new
internal/api/router/router.go — /agents/retired before /agents/{id}
internal/cli/agent_retire_test.go — new
internal/cli/client.go — ListRetiredAgents + RetireAgent
internal/domain/connector.go — IsRetired, SentinelAgentIDs,
IsSentinelAgent, AgentDependencyCounts,
ActorTypeAgent/System
internal/domain/connector_test.go — new
internal/integration/lifecycle_test.go — retirement fixture
internal/mcp/client.go — DeleteWithQuery additive transport
internal/mcp/retire_agent_test.go — new
internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs
internal/repository/interfaces.go — AgentRepository retirement methods
internal/repository/postgres/agent.go — retire + cascade target retire + counts
internal/repository/postgres/migration_000015_test.go — new
internal/service/agent.go — wire into AgentService surface
internal/service/agent_retire.go — new 8-step contract
internal/service/agent_retire_test.go — new
internal/service/deployment.go — skip retired agents
internal/service/target.go — skip retired agents
internal/service/testutil_test.go — shared mocks extended
migrations/000015_agent_retire.up.sql — new
migrations/000015_agent_retire.down.sql — new
web/src/api/client.ts, types.ts + tests — retire endpoint wiring
web/src/pages/AgentsPage.tsx — retire UI
|