mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 15:41:41 +00:00
e9011caac83b3153ec3e4b8566e22beb24c34763
17 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
e9011caac8 |
fix(deploy/libest): pin debian:bookworm-slim FROM lines to digest (H-001)
CI's 'Forbidden bare FROM regression guard (H-001)' rejects any Dockerfile FROM line missing an @sha256:... digest pin. The Phase 10 libest sidecar Dockerfile shipped two bare FROMs at lines 25 and 55, both targeting debian:bookworm-slim. The repo's Bundle A / Audit H-001 (CWE-829) policy has been in force on every other Dockerfile since the bundle landed; the new sidecar simply needs to follow the same convention. Pinned both lines to: debian:bookworm-slim@sha256:f9c6a2fd2ddbc23e336b6257a5245e31f996953ef06cd13a59fa0a1df2d5c252 That's the OCI image-index digest from https://hub.docker.com/v2/repositories/library/debian/tags/bookworm-slim fetched 2026-04-29 (last_pushed 2026-04-22). Multi-arch index, so Docker resolves the per-arch manifest correctly on the CI runner. Added a comment at the top of the FROM block documenting the bump procedure (curl + jq one-liner against the Docker Hub registry API), matching the convention from the top-level Dockerfile. Verified locally with the exact CI guard regex (grep -HnE '^FROM\s+[^@#]+(\s+AS\s+\S+)?\s*$' across every Dockerfile* under the repo, excluding web/node_modules) — passes. Also verified the M-012 USER-drop guard still passes for the libest sidecar (terminal USER estuser, set on line 73). |
||
|
|
5a682db8e2 |
EST RFC 7030 hardening master bundle Phases 10-11: libest sidecar e2e
+ Cisco IOS quirk fixtures + ManagedCertificate.Source provenance + EST bulk-revoke endpoint + 13 typed audit action codes. Phase 10.1 — libest reference-client sidecar: - deploy/test/libest/Dockerfile: multi-stage Debian-bookworm-slim build of Cisco's libest v3.2.0-2 from source (autoconf/automake/ libtool + libcurl4-openssl-dev + libssl-dev). Runtime stage carries only estclient + bash + openssl + ca-certificates so the exec surface stays small + predictable. - docker-compose.test.yml libest-client entry (profiles: [est-e2e]) with bind mounts for /config/est (test workspace) + /config/certs (certctl CA bundle for TLS pinning); IP 10.30.50.9 (10.30.50.8 was already taken by certctl-agent). - deploy/test/est/.gitkeep keeps the bind-mount target tracked. Phase 10.2 — 5 integration tests (//go:build integration) in deploy/test/est_e2e_test.go: - TestEST_LibESTClient_Enrollment_Integration (cacerts → simpleenroll → cert-shape assertion) - TestEST_LibESTClient_MTLSEnrollment_Integration (mTLS sibling-route cert auth; skip when bootstrap cert absent) - TestEST_LibESTClient_ServerKeygen_Integration (RFC 7030 §4.4 multipart; skip when profile gate disabled) - TestEST_LibESTClient_RateLimited_Integration (4th enroll trips per-principal cap, asserts 429-shaped error) - TestEST_LibESTClient_ChannelBinding_Integration (libest --tls-exporter; skip when libest build lacks the flag). - requireESTSidecar guard skips the suite when the operator forgot --profile est-e2e; helpful error message includes the exact command to bring the sidecar up. Phase 10.3 — Cisco IOS quirk fixtures + 3 unit tests in internal/api/handler/cisco_ios_quirks_test.go: - testdata/cisco_ios_15x_pem_csr.txt: PEM body sent with Content-Type application/x-pem-file. Handler dispatches on body-prefix not Content-Type — accepts cleanly. - testdata/cisco_ios_16x_trailing_newline_csr.txt: extra trailing newlines after base64 body. strings.TrimSpace tolerates. - testdata/cisco_ios_crlf_b64_csr.txt: CRLF-wrapped base64. base64.StdEncoding handles CRLF + LF identically. Phase 11.1 — ManagedCertificate.Source provenance: - New domain.CertificateSource enum (Unspecified/EST/SCEP/API/Agent). - Migration 000023_managed_certificates_source.up.sql adds source TEXT NOT NULL DEFAULT '' so existing rows scan as CertificateSourceUnspecified — back-compat: bulk-revoke filter treats empty as "any source". - Postgres repo Insert/Update/scan paths all wire the new column. Phase 11.2 — EST bulk-revoke endpoint: - BulkRevocationCriteria.Source field (Source-only requests rejected as too broad — must accompany at least one narrower criterion). - service.bulk_revocation.resolveCertificates post-filter by Source (empty=any, no SQL change so existing CertificateFilter callers unaffected). - New BulkRevocationHandler.BulkRevokeEST method pins Source=EST + dispatches; new route POST /api/v1/est/certificates/bulk-revoke (M-008 admin-gated). openapi.yaml documented + parity-guard green. Phase 11.3 — 13 typed audit action codes in internal/service/est_audit_actions.go: - est_simple_enroll_success / _failed - est_simple_reenroll_success / _failed - est_server_keygen_success / _failed - est_auth_failed_basic / _mtls / _channel_binding - est_rate_limited - est_csr_policy_violation - est_bulk_revoke - est_trust_anchor_reloaded - ESTService.processEnrollment + SimpleServerKeygen + ReloadTrust split-emit BOTH the legacy bare action codes (back-compat for the GUI activity-tab chip filters that match by exact string + existing audit-log analysers) AND the new typed _success / _failed variants (operator grep target + per-failure-mode counter). Tests: - internal/api/handler/bulk_revocation_est_test.go — 5 cases (admin-true happy path pins Source=EST + non-admin 403 + empty-criteria 400 + invalid-reason 400 + method-not-allowed). - internal/service/est_audit_actions_test.go — 5 cases (SimpleEnroll legacy+typed emission / SimpleReEnroll typed / IssuerError typed-failed / PolicyViolation triple-emit / unique-string invariant). Pre-commit verification (sandbox): gofmt clean, go vet clean (excluding repository/postgres testcontainers limit), staticcheck clean across api/handler/api/router/domain/service/deploy/test, go test -short -count=1 green for every non-postgres Go package + integration build (`go build -tags integration ./deploy/test/...`) clean. G-3 docs-drift guard reproduced locally clean (Phases 10-11 added zero new env vars). Spec preserved at cowork/est-rfc7030-hardening-prompt.md. Phases 12-13 (docs/est.md + WiFi/802.1X / IoT bootstrap / FreeRADIUS recipes; release prep + tag) remain — post-2.1.0 work. |
||
|
|
530593507b |
fix(scep-intune): close 11 audit gaps from 2026-04-29 pre-tag review
Closes the eleven gaps identified in the pre-v2.1.0 audit of the SCEP
RFC 8894 + Intune master bundle (cowork/scep-bundle-gap-closure-prompt.md).
Constitutional rule from cowork/CLAUDE.md::Operating Rules — 'Always
take the complete path, not the easy path' — drove this closure: each
gap was a load-bearing wire that crossed multiple layers (config →
validator → service wire-up → tests → docs) and shipping the bundle
without them would have produced lying-field footguns where operator-
visible config options stored values without affecting behavior.
WHAT LANDS:
Phase A — Clock-skew tolerance (master prompt §15 hazard closure)
internal/scep/intune/challenge.go: ValidateChallenge migrated from
positional args to ValidateOptions{} struct; new ClockSkewTolerance
field with default 0 (strict). 24 call sites updated mechanically.
Asymmetric application: now+tolerance >= iat AND now-tolerance < exp.
internal/config/config.go: SCEPIntuneProfileConfig.ClockSkewTolerance
default 60s + Validate() refusal when >= ChallengeValidity.
cmd/server/main.go: SetIntuneIntegration signature extended;
per-profile env-var loader honors CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CLOCK_SKEW_TOLERANCE.
internal/service/scep.go: intuneClockSkew field + IntuneStatsSnapshot
surfaces clock_skew_tolerance_ns. web/src/api/types.ts mirrors.
4 new tests in challenge_test.go covering accept-within-tolerance,
reject-beyond-tolerance, accept-expired-within-tolerance,
negative-treated-as-zero defensive normalization.
docs/scep-intune.md updated with the new env var + time-bounds rule.
Phase B — unknown-version-rejected golden test
internal/scep/intune/golden_helper_test.go: goldenUnknownVersionPayload
helper + signGoldenChallengeAny generic signer.
challenge_golden_test.go: TestGoldenChallenge_UnknownVersionRejected
uses an in-process ECDSA fixture (the on-disk PEM was generated with
a Go-stdlib version that produces different ecdsa.GenerateKey bytes
from the current call). TestRegenerateGoldenFixtures emits the new
unknown_version fixture file too.
Phase C — Two named Intune e2e tests
internal/api/handler/scep_intune_e2e_test.go:
TestSCEPIntuneEnrollment_RateLimited_E2E (cap=2 + 3 attempts; 3rd
returns FAILURE+badRequest with rate_limited counter ticked)
TestSCEPIntuneEnrollment_TrustAnchorSIGHUPReload_E2E (rotate
on-disk PEM + holder.Reload(); old-key challenge fails with
badMessageCheck; signature_invalid counter ticked)
intuneE2EFixture struct extended with trustHolder + trustPath fields
so tests can rotate.
Phase D — Four new ChromeOS hermetic tests (10 total now)
internal/api/handler/scep_chromeos_test.go:
_RAKeyMismatch — PKIMessage encrypted to wrong RA cert; handler
rejects without reaching service.
_3DESBackwardCompat — RFC 8894 §3.5.2 legacy fallback verified.
_RSACSR + _ECDSACSR — explicit matrix-pair pinning.
buildTestECDSACSR helper for ECDSA P-256 CSR construction;
tripleDESCBCEncrypt mirrors aesCBCEncrypt for 3DES-CBC;
assertChromeOSPositiveCertRep shared assertion.
Phase E — Per-profile counter isolation test
internal/api/handler/scep_profile_counter_isolation_test.go:
TestSCEPHandler_PerProfileIntuneCountersIsolated wires two
SCEPService instances + drives distinct PKIMessages + asserts
counter isolation. Guards against a future cmd/server/main.go
refactor that shares a *intuneCounterTab across profiles.
buildPerProfileIntuneFixture parameterized helper.
Phase F — Server-boot regression tests
cmd/server/preflight_scep_intune_test.go: 3 named tests covering
disabled-backward-compat, broken-config-with-PathID, expired-cert
refusal. preflightSCEPIntuneTrustAnchor signature extended with
pathID arg so error messages carry PathID= for operator log-grep.
Phase G — docs/connectors.md
Four new subsections under §EST/SCEP Integration: multi-profile
dispatch + mTLS sibling route + Intune Connector dispatcher + SCEP
probe in network scanner. Each has a one-paragraph operator
explanation + an env-var or endpoint table.
Phase H — Coverage uplift
internal/service/scep_probe_persist_test.go: 5 unit tests on
persistProbeResult (nil-safe + nil-repo-safe + repo-error swallow +
nil-logger guard) + ListRecentSCEPProbes (empty-slice-not-nil + repo
pass-through) + describeCertAlgorithm (RSA/ECDSA/QF1008-nil-curve
defensive branch/Ed25519/DSA/empty). CI gates (service ≥70, handler
≥75) PASS at 70.9% / 79.3%.
Phase I — deploy/test integration variant
deploy/test/scep_intune_e2e_test.go (//go:build integration):
TestSCEPIntuneEnrollment_Integration + _RateLimited_Integration
against the live docker-compose certctl container. Skip-when-
stack-missing semantics so sandbox + CI both work.
deploy/docker-compose.test.yml: new e2eintune SCEP profile env
vars + bind-mount of deploy/test/fixtures/.
deploy/test/fixtures/README.md: documents the deterministic trust
anchor regeneration recipe.
VERIFICATION (sandbox):
gofmt -d — clean for all changed files
staticcheck — clean for intune + handler + config + service +
cmd/server packages
go vet — clean for the same packages
go test -short — green for intune (95.3% cov), service (70.9%),
handler (79.3%), config (94.0%), cmd/server (boot
path; my preflight tests cover the directly-
testable function), pkcs7 (80.5% informational)
DEFERRED (per closure prompt §7 out-of-scope):
- V3-Pro Conditional Access gating + Microsoft Graph integration
- Standalone certctl-scan CLI binary
- OCSP rate-limiting, OCSP stapling, delta CRLs
Spec preserved at cowork/scep-bundle-gap-closure-prompt.md;
journal at cowork/scep-rfc8894-intune/progress.md (audit-closure
section appended).
|
||
|
|
fc3c7ad1e3 |
crl/ocsp e2e: wire helpers to integration_test.go primitives — Phase 6
The Phase 6 e2e scaffold landed in
|
||
|
|
a4df1f86ae |
crl/ocsp: admin observability endpoint + Phase 6 e2e scaffold
Phase 5 (admin endpoint slice) + Phase 6 (e2e test stub) of the
CRL/OCSP responder bundle. Closes the deferred items from the
backend-slice merge (
|
||
|
|
834389621c |
Bundle I (Coverage Audit Closure): QA-doc drift cleanup — H-007 + H-008 closed
Applies Patches 1-7 from coverage-audit-2026-04-27/tables/qa-doc-patches.md
(Patch 5 re-anchored against actual HEAD seed counts after Phase 0 recon
discovered the original patch's anticipated counts were themselves drifted).
docs/qa-test-guide.md:
- Patch 1: 'all 54 Parts' -> '49 of 56 Parts' + not-yet-automated callout
- Patch 2: Totals line replaced with verified-2026-04-27 breakdown + recompute commands
- Patch 3: Coverage Map gains Parts 23, 24, 55, 56 (each '0 (NOT AUTOMATED)')
- Patch 4: 'Not Yet Automated' subsection added under 'What This Test Does NOT Cover'
- Patch 5: Seed Data Reference re-anchored to authoritative HEAD counts:
32 certs (already correct), 12 agents (was 9), 13 issuers (was 9),
8 targets (already correct), 4 nst (already correct).
Replaced narrow ID enumerations with sed | grep recompute commands.
Added maintenance-note pointer to Strengthening #6 (CI guard).
- Patch 6: Version History entry v1.2 added
- Bonus: integration_test comparison row updated (12 agents + 13 issuers)
deploy/test/qa_test.go (Patch 7):
4 new t.Run('PartN_*', ...) blocks for Parts 23, 24, 55, 56 — each calls
t.Skip with a docs/testing-guide.md::Part N pointer + automation candidates.
Skip-with-rationale form keeps Part numbering consistent + makes the
manual-test pointer machine-readable. Replacing each Skip with a real
test body is gap-backlog work.
Verification:
grep -cE '^## Part [0-9]+:' docs/testing-guide.md == 56 PASS
grep -cE 't\.Run("Part[0-9]+_' deploy/test/qa_test.go == 53 PASS
go vet -tags qa ./deploy/test/... PASS
go test -tags qa -run='__nope__' ./deploy/test/... PASS (compile)
(Full SKIP-grep gate requires the live demo stack; t.Skip bodies trivial.)
Audit deliverables:
findings.yaml: H-007 (-0014), H-008 (-0015) status open -> closed
gap-backlog.md: strikethrough both rows + Bundle I closure-log entry
tables/qa-doc-drift.md: 'PATCHES APPLIED' header marker (not retro-edited)
acquisition-readiness.md: QA-doc rigor 2.5 -> 4.0
closure-plan.md: Bundle I checklist box ticked
CHANGELOG.md: [unreleased] Bundle I entry
|
||
|
|
90bfa5d320 |
test: triage 37 skipped-test sites — closure comments pinning rationale (Q-1)
Closes Q-1 (cat-s3-58ce7e9840be) — 37 t.Skip / testing.Short() sites
across 9 test files audited. Per-site verdict matrix:
- cmd/agent/verify_test.go (1 site): defensive guard against unreachable
httptest.NewTLSServer code path. Document-skip with closure comment.
- deploy/test/qa_test.go (11 sites): file already gated by `//go:build qa`
tag. The 11 t.Skip("Requires X — manual test") markers are runtime
second-line guards for operators who run -tags qa against a stack
missing the required external service. File-level header comment
block added explaining the manual-test convention.
- deploy/test/healthcheck_test.go (5 sites): 3 docker-availability +
1 testing.Short + 1 hard-skip for not-yet-wired runtime probe
(image-spec contract above already covers the audit-flagged
regression). All correctly gated; file-level header comment block
added explaining each.
- deploy/test/integration_test.go (5 sites): in-flight-state guards
(poll-with-skip after 90s polling for agent-online, inter-test
Phase04→Phase07 ordering, scheduler-tick race for discovered certs,
inter-test issuer fallthrough, defensive PEM-empty assertion).
Each site now has a closure comment explaining why skip is the
right choice rather than fail (upstream phase already surfaces the
real failure; skipping prevents masking root cause behind cascading
noise).
- internal/repository/postgres/{testutil,seed,repo}_test.go (5 sites):
testing.Short() gates for testcontainers-backed live PostgreSQL
integration tests. All correctly gated; closure comments added
naming the run command.
- internal/connector/notifier/email/email_test.go (2 sites):
anti-fixture assertions (test asserts SMTP dial fails; if a captive
portal black-holes the call to success, skip rather than false-pass).
Closure comments added explaining the fixture assumption.
- internal/connector/target/iis/iis_test.go (2 sites): platform-gated
skip for powershell.exe absence on non-Windows hosts. Mirrors the
production iis_connector.go LookPath guard. Closure comments added.
Total: 17 closure comments anchor the 37 skip sites (some sites share a
single block-level comment). All skips remain in place; the change is
purely documentation. The audit recommendation was "audit each skip and
decide" — for these 37, the decision is uniformly **document-skip**:
the gating is correct, the t.Skip messages name the missing precondition,
and the closure comments now pin the rationale for future readers.
See coverage-gap-audit-2026-04-24-v5/unified-audit.md
cat-s3-58ce7e9840be for closure rationale.
|
||
|
|
86fffa305a |
fix(deploy,helm,docs): published-image HEALTHCHECK speaks HTTPS + Helm /ready path + docs HTTPS sweep (U-2)
Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`. The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone (`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS 1.3 pinned), so the probe failed on every interval and Docker marked the container `unhealthy` indefinitely. Operators inside docker- compose / Helm / the example stacks were unaffected — compose overrides the HEALTHCHECK with `--cacert + https://`, Helm uses explicit `httpGet` probes that ignore Docker's HEALTHCHECK, and every example compose file overrides with `curl -sfk https://localhost:8443/health`. But anyone running bare `docker run` / Docker Swarm / Nomad / ECS — exactly the "I just pulled the published image" path — saw permanent `unhealthy` status and (depending on orchestrator policy) a restart- loop. (Audit: cat-u-healthcheck_protocol_mismatch in coverage-gap-audit-2026-04-24-v5/unified-audit.md.) Recon for U-2 surfaced two adjacent bugs from the same v2.2 milestone gap, both bundled into this commit because they share the same root cause and the same operator surface: 1. Helm chart `server.readinessProbe.httpGet.path` pointed at `/readyz`, the kube-flavored convention. The certctl server doesn't register `/readyz` (only `/health` and `/ready` are wired and bypass the auth middleware — see internal/api/router/router.go:81 and cmd/server/main.go:920). K8s readiness probes therefore got 401 (api-key auth rejection) or 404 (when auth was disabled), pods stayed `NotReady` indefinitely, and Helm rollouts stalled. 2. The agent image (`Dockerfile.agent`) had no HEALTHCHECK at all, so bare-`docker run` agents got zero health signal. The compose override at `deploy/docker-compose.yml:173` called `pgrep -f certctl-agent` against the agent image, but the agent image didn't ship `procps` — pgrep was missing too. The compose probe was a latent always-fail. We fixed all three with the audit-recommended shape (option (a) — `-k`) plus three structural backstops: Files changed: Phase 1 — Dockerfile fix: - Dockerfile: HEALTHCHECK switched from `curl -f http://localhost:8443/ health` to `curl -fsk https://localhost:8443/health`. `-k` (insecure) is acceptable because the probe is localhost-to-localhost: the same process serving the cert is being probed, no network hop. Pinning `--cacert` is not viable for the published image because the bootstrap cert is per-deploy (generated into the `certs` named volume on first up; operator-supplied via Helm's `existingSecret` or cert-manager). Long-form docblock cross-references the audit closure, the compose vs Helm vs examples coverage matrix, and the CI guardrail. - Dockerfile.agent: added HEALTHCHECK using `pgrep -f certctl-agent` matching the compose pattern. Added `procps` to the runtime apk install — fixes both the new image-level HEALTHCHECK AND the pre-existing compose probe that was silently failing. Phase 2 — Helm readiness probe path: - deploy/helm/certctl/values.yaml: server.readinessProbe.httpGet.path changed from `/readyz` to `/ready`. Liveness probe path (`/health`) was correct and is unchanged. Probes block now carries an explanatory comment naming the registered no-auth probe routes and the U-2 closure rationale. Phase 3 — Image-level integration tests: - deploy/test/healthcheck_test.go (new, //go:build integration): TestPublishedServerImage_HealthcheckSpecUsesHTTPS builds the server image, inspects `Config.Healthcheck.Test` via `docker inspect`, and asserts the array contains `https://localhost:8443/health` and `-k`, and does NOT contain `http://localhost:8443/health` (positive + negative regression contracts). TestPublishedAgentImage_HealthcheckSpecExists builds the agent image and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`. Both tests `t.Skip` cleanly when docker isn't available (sandbox / CI without docker-in-docker) — verified locally: tests skip with the diagnostic and the suite returns PASS. TestPublishedServerImage_HealthcheckTransitionsToHealthy is a documented `t.Skip` placeholder until the harness wires a sidecar postgres for image-level smoke; the spec-level tests above cover the audit-flagged regression. Phase 4 — CI guardrail: - .github/workflows/ci.yml: new "Forbidden plaintext HEALTHCHECK regression guard (U-2)" step. Scoped patterns catch `HEALTHCHECK.*http://` and `curl -f http://localhost:8443/health` in any `Dockerfile*`. Comment lines exempt; docs/upgrade-to-tls.md out of scope (the post-cutover invariant string at line 182 is intentionally a documented expected-failure assertion). Verified locally on the real tree (passes) and against synthetic regressions (each fires the guard). Phase 5 — Docs sweep: - docs/connectors.md: 15 stale curl examples updated from `http://localhost:8443/...` to `https://localhost:8443/...` with `--cacert "$CA"` injected on every site. Added a one-time introductory note documenting the `$CA` extraction with `docker compose ... exec ... cat /etc/certctl/tls/ca.crt`, matching the pattern in docs/quickstart.md. Pre-U-2 these examples silently failed against the HTTPS listener. Phase 6 — Release surface: - CHANGELOG.md: appended U-2 section to the existing [unreleased] block (immediately below the G-1 entry). Sections: explanatory blockquote covering all three bugs (primary + 2 adjacent), Fixed, Added, Changed. Verification (all gates pass): - go build ./... — clean - go vet ./... — clean - go vet -tags integration ./deploy/test/ — clean - go test -short ./... — every package green - go test -tags integration -v -run TestPublishedServerImage|TestPublishedAgentImage ./deploy/test/ — three tests SKIP cleanly with "docker not available" diagnostic - helm lint deploy/helm/certctl/ — clean - helm template smoke render — succeeds; rendered Deployment carries `path: /ready` and zero `/readyz` matches - python3 yaml.safe_load on api/openapi.yaml — parses - govulncheck ./... — no vulnerabilities in our code - CI guardrail mirror: clean on real tree, fires on synthetic regression patterns Out of scope (intentionally untouched): - cmd/server/main.go::ListenAndServeTLS — HTTPS-only is correct, this finding does NOT propose adding back a plaintext listener. - deploy/docker-compose.yml:126 HEALTHCHECK — already correct. - deploy/docker-compose.test.yml HEALTHCHECK blocks — already correct. - All 5 examples/*/docker-compose.yml HEALTHCHECK overrides — already correct (they ALSO use `-fsk https://localhost:8443/health`). - Helm server.livenessProbe.httpGet — already uses `scheme: HTTPS` + `path: /health`, correct. - docs/upgrade-to-tls.md:182 `curl ... http://localhost:8443/health` invariant line — that's the expected-failure assertion for the post-cutover state ("plaintext is gone, expect Connection refused"); intentionally left intact. - Go production code — this is purely a deploy-image / probe / docs / Helm-chart fix. Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md §2 P1 cluster, cat-u-healthcheck_protocol_mismatch Audit recommendation followed verbatim: 'change Dockerfile:80 to CMD curl -kf https://localhost:8443/health'. |
||
|
|
52248be717 |
v2.0.47: HTTPS Everywhere — TLS-only control plane, agents/CLI/MCP
Breaking change release. Plaintext HTTP listener removed. The certctl control plane now terminates TLS 1.3 on :8443 via http.Server.ListenAndServeTLS. No CERTCTL_TLS_ENABLED=false escape hatch. No dual-listener mode. One-step cutover per docs/upgrade-to-tls.md. Server - cmd/server/tls.go: certHolder with SIGHUP hot-reload + atomic cert swap, buildServerTLSConfig (TLS 1.3 min, GetCertificate callback), preflightServerTLS validation - cmd/server/main.go: ListenAndServeTLS in place of ListenAndServe, watchSIGHUP wiring, cert/key path config threading - tls_test.go: 418-line regression coverage of reload, preflight, callback behavior, SAN validation Config - CERTCTL_TLS_CERT_PATH / CERTCTL_TLS_KEY_PATH (required) - Plaintext rejection: agents/CLI/MCP pre-flight-fail on http:// URLs with a pointer to docs/upgrade-to-tls.md Agents, CLI, MCP - All three pre-flight-reject http:// URLs with fail-loud diagnostic - CERTCTL_SERVER_CA_BUNDLE_PATH for private-CA trust - CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY for dev-only bypass (loud warning on startup) - install-agent.sh emits both vars as commented template lines docker-compose - certctl-tls-init sidecar generates SAN-valid self-signed cert into deploy/test/certs/ on first boot - All demo-stack curls pin against ca.crt with --cacert Helm chart - Three TLS provisioning modes, exactly one required: - server.tls.existingSecret (operator-supplied) - server.tls.certManager.enabled (cert-manager integration) - server.tls.selfSigned.enabled (eval only — not for production) - server-certificate.yaml template for cert-manager mode - helm install without a TLS source fails at template render with a pointer to docs/tls.md CI - .github/workflows/ci.yml Helm Chart Validation step renders the chart in both existingSecret and cert-manager modes, plus an inverse guard-regression test that asserts helm template MUST refuse to render when no TLS source is configured. Previously the single `helm template` invocation hit the certctl.tls.required fail-loud guard and exit-1'd CI. Four invocations now: lint (existingSecret), template (existingSecret), template (cert-manager), template (no args — must fail). Integration tests - deploy/test/integration_test.go stands up the Compose stack over HTTPS, extracts the CA bundle, and exercises every certctl API over https://localhost:8443 - All 34 integration subtests green (per Phase 8 local CI-parity) Documentation - New: docs/tls.md (provisioning patterns, rotation, SIGHUP reload) - New: docs/upgrade-to-tls.md (one-step cutover, no-downgrade warnings, fleet-roll sequencing) - CHANGELOG.md: v2.2.0 "HTTPS Everywhere — The Irony" entry (file heading unchanged; release tag is v2.0.47) - All curls in docs/, examples/, deploy/helm/ guides use https://localhost:8443 --cacert Verification - grep -rn "ListenAndServe[^T]" cmd/ internal/ → 0 hits - grep -rn "\"http://" cmd/ internal/ → 2 benign hits (Caddy admin API default, SSRF doc comment) — zero certctl endpoints - Tasks #197–#206 (Phases 0–8) all closed in the tracker Files: 65 changed, 3489 insertions, 372 deletions (pre-CI-fix). |
||
|
|
675b87ba63 |
I-005: notification retry loop + dead-letter queue
Critical alerts can no longer be silently dropped by a transient
notifier failure. Failed notification attempts now ride an exponential
backoff retry loop, with a 5-attempt budget before promotion to the
dead-letter queue for operator intervention.
Schema (migration 000016, idempotent):
- retry_count INTEGER NOT NULL DEFAULT 0
- next_retry_at TIMESTAMPTZ
- last_error TEXT
- idx_notification_events_retry_sweep partial index
(next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL
Dead rows clear next_retry_at so the index stops matching them.
Service contract:
- NotificationService.RetryFailedNotifications drives 2^n-minute
exponential backoff capped at 1h (notifRetryBackoffCap) with
5-attempt budget (notifRetryMaxAttempts).
- Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to
status='dead' via MarkAsDead.
- Non-terminal failures record via RecordFailedAttempt.
- Success path promotes to 'sent' without touching retry_count
(audit preserves "delivered on attempt N").
- Missing-notifier branch defensively promotes to 'sent' to avoid
wedging a row on a deleted channel.
- RequeueNotification operator escape hatch atomically resets
retry_count -> 0, next_retry_at -> NULL, last_error -> NULL,
status -> pending via notifRepo.Requeue.
Scheduler:
- New always-on notificationRetryLoop wired into the base loop set at
CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m).
- sync/atomic.Bool idempotency guard.
- sync.WaitGroup shutdown drain via WaitForCompletion.
StatsService:
- SetNotifRepo setter pattern preserves 9 pre-existing
NewStatsService call sites (main.go + stats_test.go + 8 digest
tests) without touching the constructor signature.
- DashboardSummary.NotificationsDead populated via
notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired
(reports zero on systems without a notification repository).
- CountByStatus error is non-fatal (dashboard summary is
best-effort for this field).
- Prometheus certctl_notification_dead_total counter emitted from
the same snapshot.
Handler:
- New POST /api/v1/notifications/{id}/requeue endpoint.
- dead status surfaces to MCP + CLI.
Frontend:
- NotificationsPage gains two-tab toolbar ("All" / "Dead letter")
with queryKey: ['notifications', activeTab] so switching tabs
doesn't serve stale data until the 30s refetch.
- Dead rows surface "Retry {n}/5" + truncated last_error with
full-text title tooltip.
- Requeue mutation wrapped as
mutationFn: (id: string) => requeueNotification(id)
to prevent react-query v5's positional context argument from
leaking into the API client — pinned against future refactors
by strict-match toHaveBeenCalledWith('notif-dead-001') in
NotificationsPage.test.tsx:181.
Closes I-005.
|
||
|
|
91642e2860 |
C-001 scope expansion: tighten parallel POST /api/v1/certificates call sites to six-field contract
Problem: |
||
|
|
3287e174dc |
Unify API auth + RFC-compliant CRL/OCSP (M-002 + M-003 + M-006, auto-closes M-001)
Closes the remaining P1 gaps from coverage-gap-audit.md (M-001/M-002/M-003/M-006)
on top of the C-001/C-002 ownership + agent-FK contract fixes landed in
|
||
|
|
5567d4b411 |
feat(M47): add Kubernetes Secrets target + AWS ACM PCA issuer connectors
Implement both M47 connectors with full cross-layer wiring: Kubernetes Secrets target: DNS-1123 validation, kubernetes.io/tls Secret create-or-update, chain concatenation, serial number validation, Helm RBAC gating. 18 tests. AWS ACM Private CA issuer: synchronous issuance (like Vault), ARN regex validation, RFC 5280 revocation reason mapping, CA cert retrieval, factory + env var seeding. 23 tests. Cross-cutting: domain types, service validation, config, factory, agent dispatch, frontend (TargetsPage, issuerTypes), OpenAPI, seed data, Helm chart, connectors docs, README. Testing docs (testing-guide, qa-test-guide, qa_test.go) with Parts thematically integrated near related connectors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
e5516d7286 |
test: add unified QA test suite (qa_test.go) replacing legacy bash smoke script
1717-line Go test file covering all 52 Parts of testing-guide.md against the Docker Compose demo stack. ~120 automated subtests (API, DB, source, perf), 11 skipped Parts with reasons, ~270 manual gaps documented. Audited against actual router, seed data, domain structs, and migrations — 8 factual bugs caught and fixed during review. Companion guide at docs/qa-test-guide.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
adfb682754 |
feat: Go integration test suite replacing bash end-to-end tests
Refactors deploy/test/run-test.sh into a typed Go test file with crypto/x509 certificate parsing, eliminating fragile openssl text scraping. 12 phases, 35 subtests covering Local CA, ACME, step-ca, revocation, discovery, renewal, EST, S/MIME, and API spot checks. - testClient HTTP helper with Bearer auth - testDB PostgreSQL helper (port 5432 now exposed) - waitFor/waitForJobsDone polling helpers - crypto/x509 for EKU, KeyUsage, SAN verification - crypto/tls for NGINX deployment verification - //go:build integration tag (not in CI yet) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
0822f748a5 |
feat: S/MIME certificate support in integration tests + test env docs
Add S/MIME (emailProtection EKU) end-to-end test coverage: - ValidateCommonName() now accepts email addresses for S/MIME certs - S/MIME test profile (prof-test-smime) in seed data - Phase 11 test: issuance, EKU, KeyUsage, email SAN verification - EST config enabled in test Docker Compose - Portable KeyUsage parsing (awk, works on BSD/GNU) - Full test environment documentation (docs/test-env.md) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
b059ec930f |
fix: end-to-end certificate lifecycle bugs + integration test environment
Fixes 12 production bugs preventing the full issuance→deployment flow from working with ACME (Pebble/Let's Encrypt) and step-ca issuers: ACME connector (acme.go): - Save orderURI before WaitOrder overwrites it (Go crypto/acme bug) - Add CreateOrderCert fallback via WaitOrder+FetchCert - Remove defer-reset in ValidateConfig that caused nil pointer panic - Add Insecure TLS option for self-signed ACME servers (Pebble) step-ca connector (stepca.go, jwe.go): - Real JWE provisioner key loading + decryption (was using ephemeral keys) - Fix JWT audience (/1.0/sign), sha claim (key fingerprint), kid header - Custom root CA trust via RootCertPath config - Remove hardcoded 90-day validity default (let step-ca decide) NGINX target connector (nginx.go): - Use sh -c for validate/reload commands (shell interpretation) - Use filepath.Dir instead of fragile string slicing - Add private key file writing (agent-mode keys were never deployed) - Make chain_path write conditional Server/service layer: - TriggerRenewalWithActor now creates actual Job records (was no-op) - createDeploymentJobs falls back to DB query when cert.TargetIDs empty - ProcessPendingJobs skips agent-routed deployment jobs - Agent cert pickup path parsing: len(parts)<4 → len(parts)<3 - Health/ready/auth-info endpoints bypass auth middleware - Write timeout 15s→120s for ACME issuance - Cert fingerprint computed on CSR submission Integration test environment (deploy/test/): - 10-phase test script covering Local CA, ACME, step-ca, revocation, discovery, renewal, and API spot checks - Docker Compose with 7 containers (server, agent, postgres, nginx, pebble, challtestsrv, step-ca) on isolated network - TLS verification checks SAN (not just Subject CN) for modern CA compat Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |