certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 17:02:43 +00:00

Author	SHA1	Message	Date
shankar0123	a05a7d3dad	ci: fix Phase 1b post-push CI failures (3 guards) Phase 1b push (commit `44a85d6`) failed three CI guards. None were caught by `make verify` locally because they're CI-only guards that aren't part of the Makefile target. This commit fixes all three. 1. go.mod tidy diff. The go-jose v4 dep was added with `// indirect` in go.mod after the initial `go get`, but the codebase imports it directly from internal/api/acme/jws.go + service/acme.go + handler/acme.go. CI's `go mod tidy && git diff --exit-code go.mod go.sum` flagged the staleness. Promoted to a direct require in the same `require (...)` block as github.com/aws/aws-sdk-go-v2 etc. 2. G-3-env-docs-drift.sh. The guard greps `\bCERTCTL_[A-Z_]+\b` in docs/ and complains when the bare-prefix forms don't match anything defined in config.go. Phase 1a + 1b's docs/acme-server.md intro and migration header use bare-prefix forms `CERTCTL_ACME_` and `CERTCTL_ACME_SERVER_` to describe namespace separation (consumer-side ACMEConfig vs server-side ACMEServerConfig). Same precedent as the existing CERTCTL_SCEP_ + CERTCTL_TLS_ + CERTCTL_QA_* prefix entries already in the guard's ALLOWED list. Added CERTCTL_ACME_ + CERTCTL_ACME_SERVER_ to the ALLOWED list with a justification comment block matching the existing integration-surface allowlist convention. 3. openapi-handler-parity.sh. Distinct from internal/api/router/openapi_parity_test.go (which runs at `go test` time and has its own SpecParityExceptions map I extended in 1a + 1b) — this is a separate CI-only guard that reads api/openapi-handler-exceptions.yaml. The 6 Phase-1a routes + 4 Phase-1b routes (10 ACME endpoints total) were never added to that yaml. Same rationale as the SCEP/SCEP-mTLS entries already in the file: ACME is a JWS-signed-JSON wire protocol per RFC 8555 + RFC 9773, not an OpenAPI-shape REST surface. Documenting every endpoint in openapi.yaml would duplicate the RFC. The canonical reference is docs/acme-server.md. Phases 2-4 will add their routes to this yaml in lockstep with router.go. Verified locally: - bash scripts/ci-guards/G-3-env-docs-drift.sh → clean. - bash scripts/ci-guards/openapi-handler-parity.sh → clean (152 router routes, 136 OpenAPI ops, 18 documented exceptions). - All other ci-guards/*.sh → clean. - go.mod diff after `go mod tidy` is empty.	2026-05-03 13:31:35 +00:00
shankar0123	2643a427ac	ci(digest-validity): exclude Windows IIS digest — image is doc-only, not pulled by Linux CI CI run #376 (commit `a1c7741`, Frontend Build job) failed with: digest does not resolve: mcr.microsoft.com/windows/servercore/iis: windowsservercore-ltsc2022@sha256:8d0b0e651ad514e3fb05978db66f38036 118812e1b9314a48f10419cad8a3462 A re-run with no code changes went green. The digest itself is fine — verified against MCR directly (HTTP 200 from mcr.microsoft.com/v2/windows/servercore/iis/manifests/sha256:8d0b...), and the tag `:windowsservercore-ltsc2022` currently resolves to that exact digest. Microsoft hasn't rotated. Root cause is registry-side rate-limiting. MCR throttles unauthenticated GET-by-digest requests by source IP. GitHub-hosted runners share a small pool of egress IPs across many users; bursts trip the throttle and return non-200. Re-run = different runner = different IP = throttle window has reset = pass. This will recur on roughly N% of pushes indefinitely, until either (a) Microsoft loosens MCR rate limits, (b) GitHub buys more runner IPs, or (c) we stop verifying digests CI doesn't actually use. The deeper issue is structural, not transient. The Windows IIS image is gated behind compose `profiles: [deploy-e2e-windows]` (deploy/docker-compose.test.yml:700). The comment block above the service definition (lines 675-691) explicitly says "Linux CI never activates this profile." All 10 TestVendorEdge_IIS_*_E2E tests are on scripts/vendor-e2e-skip-allowlist.txt because the sidecar is never started. The whole Windows matrix was DELETED in ci-pipeline-cleanup Phase 6 / frozen decision 0.5 (revising Bundle II decision 0.4); IIS validation moved to docs/connector-iis.md::Operator validation playbook. So `digest-validity.sh` is verifying a digest that no CI job ever pulls — paying CI brittleness against MCR rate-limiting we can't control, for an image whose only purpose in compose is documentation for an operator's manual workflow on a real Windows host. The fix matches the guard's stated purpose ("every digest CI actually depends on is valid"): exclude images CI never pulls. Implementation. Add an EXCLUDED_PATTERNS array near the top of the script with one entry — the IIS image path `mcr.microsoft.com/windows/servercore/iis` — and a comment block above it documenting: - WHY it's excluded (gated profile, never started, all tests on skip-allowlist) - WHEN it would need re-inclusion (if a Windows CI runner is added that actually starts the sidecar) - WHAT this list is NOT for (transient flake silencing — that gets fixed via retry logic in the script, not via exclusion) The match is by image-path substring, not by digest, so future tag/ digest updates of the same image still hit the exclusion without needing this list to be re-edited. Loop logic gains a 6-line check that runs the exclusion match before any registry work. Excluded refs log as "SKIP (excluded) <ref>" so operator-facing CI logs stay informative — at a glance you can see which digests were verified vs which were intentionally not. The success message updates to differentiate verified vs excluded counts: "digest-validity: clean — N verified, M excluded (CI never pulls)" when M > 0; original message preserved when M == 0. Verified manually: - Clean repo: 15 verified, 1 excluded, exit 0. - Fabricated bogus httpd digest: ::error:: emitted for the bad digest, IIS still SKIP-excluded, exit 1. (Real regressions still caught.) - Restore: 15 verified, 1 excluded, exit 0 again. Other recurring MCR-hosted images would warrant the same treatment if they get added later. The exclusion list pattern scales: each new entry needs its own "WHY this is doc-only" justification block. What this is NOT: - Not a generic flake-silencer. The exclusion is justified by the image being doc-only, not by the test being noisy. - Not a global retry/resilience layer. If MCR rate-limits an image CI DOES pull, that's a real CI dependency on an unreliable external service — fix by retry-with-backoff, not by excluding.	2026-05-01 03:06:49 +00:00
shankar0123	a1c7741e1b	fix(deploy/test) + ci(guard): drop dead SCEP profile from test compose The deploy-vendor-e2e job has been failing with the certctl-test-server container restarting endlessly. Diagnostic dump (added in `3b96b35`) finally surfaced the actual cause: Failed to load configuration: SCEP profile 0 (PathID="e2eintune") has empty CHALLENGE_PASSWORD — refuse to start (CWE-306: per-profile shared secret is the sole application-layer auth boundary; an empty password would allow any client reaching /scep/e2eintune to enroll a CSR against issuer "iss-local") Same shape as the encryption-key fix that landed in `c4157fd`: a config validation gate added in code that the test compose never got updated to satisfy, hidden pre-Phase-5 because the matrix-collapse hadn't yet forced the certctl-server to actually boot in CI. Root cause is more interesting than just "missing env var." The 2026-04-29 SCEP RFC 8894 + Intune master bundle Phase I added an `e2eintune` SCEP profile to docker-compose.test.yml expecting deploy/test/scep_intune_e2e_test.go to exercise it. That integration test does exist (//go:build integration) but NO CI job ever selects it — ci.yml's deploy-vendor-e2e job runs only `-run 'VendorEdge_'` (line 379), and no other job invokes `go test -tags integration` with a SCEP selector. Confirmed via `grep -rnE "scep_intune\|SCEPIntune" .github/workflows/` returning empty. Worse: the supporting fixtures (ra.crt + ra.key + intune_trust_anchor.pem) were documented in deploy/test/fixtures/README.md with the regeneration recipe but never actually committed. Pre-Phase-5 the test stack didn't fully boot the server in CI, so the entire stack of debt — dead config + missing fixtures + no consumer test — sat silent until the matrix collapse forced the boot path. Fixing this with a fake CHALLENGE_PASSWORD value would silence the immediate validator but leave the real problem in place: maintenance cost on test config that no test exercises. Same critique applies to "let me commit fake fixtures" — the fixtures alone don't add test coverage when no CI job runs the SCEP test. The complete-path fix is to make the test compose match what CI actually exercises: - deploy/docker-compose.test.yml: drop CERTCTL_SCEP_ENABLED + the full e2eintune profile env var family (10 lines) + the ./test/fixtures volume mount (1 line). Replace with an in-line comment explaining why SCEP is intentionally disabled and what needs to come back together when SCEP is added to CI for real. - scripts/ci-guards/test-compose-scep-coherence.sh (new, 22nd guard): refuses any future state where CERTCTL_SCEP_ENABLED=true in test compose without ALL of: 1. A CI job that runs the SCEP integration test (matched by scep_intune \| SCEPIntune \| -run [Ss]cep in ci.yml) 2. The fixture files actually committed (ra.crt, ra.key, intune_trust_anchor.pem) 3. The ./test/fixtures:/etc/certctl/scep:ro volume mount Verified manually with the same pattern as the H-1 guard: clean tree → exit 0; deliberate SCEP_ENABLED=true regression → exit 1 with 5 ::error:: annotations covering each gap; restore → exit 0 again. - scripts/ci-guards/README.md: 21 → 22 guards, new row. The fixtures README at deploy/test/fixtures/README.md keeps the regeneration recipe so the eventual SCEP CI job lands cleanly: the operator who adds the SCEP job restores the env vars, regenerates + commits the fixtures, and the guard auto-passes. Pattern (now firm across this CI-stabilization sequence): - Pre-existing latent bug - Old CI structurally hid it (per-vendor matrix, missing boot path) - Phase-5 matrix collapse + new diagnostic infra exposed it - Direct fix unblocks today - Regression guard prevents the same shape of drift forever Encryption-key (`c4157fd`) was the same shape; this is its sibling.	2026-05-01 01:39:18 +00:00
shankar0123	c4157fd196	fix(deploy/test) + ci(guard): unblock deploy-vendor-e2e — encryption-key length Two-part complete-path fix for the deploy-vendor-e2e failure that has been firing since the ci-pipeline-cleanup Phase 5 matrix collapse started actually booting the certctl-test-server: Failed to load configuration: CERTCTL_CONFIG_ENCRYPTION_KEY too short (29 bytes; minimum 32). Surfaced via the diagnostic-dump step landed in commit `3b96b35` — the server panicked on startup, Docker restarted it endlessly, compose reported the dependency-chain symptom ("container certctl-test-server is unhealthy"), but the actual cause was invisible in the previous CI output. With the dump in place, the next failing run named the problem in one line. Root cause. The H-1 audit-closure master commit `3e78ecb` ("feat(security): bodyLimit on noAuth + security headers + encryption- key validation (H-1 master)") added internal/config/config.go's minEncryptionKeyLength = 32 byte floor + 5 unit tests that pin it. The closure was incomplete: it never enforced the rule against the literal CERTCTL_CONFIG_ENCRYPTION_KEY values certctl's own deploy/docker-compose.yml files pass. Pre-Phase-5 the test stack didn't fully exercise the validator (the per-vendor matrix didn't boot certctl-test-server in every job), so the gap was silent. deploy/docker-compose.test.yml's literal value `test-encryption-key-32chars!!` was 29 bytes — the name claimed 32 but the author miscounted (4+1+10+1+3+1+2+5+2 = 29). Pattern matches every fix in this CI-stabilization sequence: pre-existing latent bug that the old CI structurally hid. Part 1 — direct fix (deploy/docker-compose.test.yml): Replace the 29-byte literal with a clearly test-only, self-documenting 49-byte value (`test-encryption-key-deterministic- 32-byte-fixture`). 17 bytes of safety margin so a future tightening of the floor (32 → 33+) doesn't break this fixture again. Inline comment block explains the byte-budget contract + points at the H-1 closure commit. Production deploy/docker-compose.yml's default (`change-me-32-char-encryption-key`) is exactly 32 bytes — passes by 1 byte but on the edge; not touched here because operators are already told to override it via env (`${VAR:-default}`). Part 2 — structural fix (scripts/ci-guards/H-1-encryption-key-min- length.sh): New regression guard. Scans every deploy/docker-compose.yml for literal CERTCTL_CONFIG_ENCRYPTION_KEY values + values inside ${VAR:-default} expansions, checks each against the 32-byte floor, fails CI with `::error::` annotation pointing at the offending file:line if any literal regresses. Bare ${VAR} env references with no default are skipped — those are operator-supplied at runtime and the validator handles them at boot. Verified manually: - Clean repo: `H-1-encryption-key-min-length: clean.` (exit 0) - 5-byte regression: emits proper ::error:: annotation, exit 1 - Restore: clean again (exit 0) CI auto-picks up the new guard via the `for g in scripts/ci-guards/*.sh; do bash "$g"; done` loop in ci.yml's Regression guards step (no ci.yml change required). scripts/ci-guards/README.md updated: 20 → 21 guards, new row explaining the closure rationale. The structural piece is the more important half of this fix. The direct fix unblocks today's CI; the guard prevents the same class of drift from ever recurring silently. Future audit closures that add new validation rules to internal/config/config.go now have a working template for the matching CI guard — drop a sibling .sh in the ci-guards directory. Bonus — what the diagnostic-dump step (`3b96b35`) bought us. Before that step landed, the same failure looked like an opaque "container unhealthy" with no actionable signal. With it, the actual error message + the offending env var + the exact byte count came out in one CI run. The diagnostic infrastructure paid for itself within one push.	2026-05-01 00:57:43 +00:00
shankar0123	7b8cadcd02	refactor(scripts): move CI helpers out of scripts/ci-guards/ The 'Regression guards' loop step in ci.yml runs: for g in scripts/ci-guards/.sh; do bash "$g"; done Per the directory's own contract (scripts/ci-guards/README.md), every script there MUST be runnable bare with no args / no env. Three files violated that contract — they're helpers consumed by specific CI job steps with arguments, not regression guards. They were misplaced. Moved (git mv): scripts/ci-guards/vendor-e2e-skip-check.sh → scripts/ scripts/ci-guards/vendor-e2e-skip-allowlist.txt → scripts/ scripts/ci-guards/coverage-pr-comment.sh → scripts/ Updated ci.yml call sites: - deploy-vendor-e2e job: bash scripts/vendor-e2e-skip-check.sh $LOG - go-build-and-test job: bash scripts/coverage-pr-comment.sh Tightened scripts/vendor-e2e-skip-check.sh arg parse from a silent default ('LOG=${1:-test-output.log}') to a mandatory-arg form ('LOG=${1:?usage: ...}') so misuse fails loud at parse time rather than at the missing-file check. Updated scripts/ci-guards/README.md contract to spell out the guard-vs-helper distinction explicitly; lists current helpers under scripts/ for future-author guidance. Verified locally: 'for g in scripts/ci-guards/.sh; do bash $g; done' returns clean (22 guards pass) on HEAD post-move. Closes the regression-guards-loop failure that surfaced in CI run 25192163943 (job 73864471346 'Frontend Build').	2026-04-30 22:37:12 +00:00
shankar0123	f20c0961aa	ci-pipeline-cleanup Phase 10: coverage PR-comment action Bundle: ci-pipeline-cleanup, Phase 10 / frozen decision 0.9. Self-hosted alternative to Codecov / Coveralls. Posts a per-package coverage delta as a PR comment on every PR; updates the same comment in place on subsequent pushes (avoids duplicate noise). scripts/ci-guards/coverage-pr-comment.sh: - Reads coverage.out from the prior Go Test step - Builds per-package coverage table (mirrors check-coverage-thresholds averaging logic) - Searches existing PR comments for the '**Coverage report' marker and PATCHes the existing one if found, else POSTs a new one - No-op on non-PR builds (push to master, scheduled, etc.) Wired into go-build-and-test job after 'Upload Coverage Report' step with if: github.event_name == 'pull_request' guard. Operator can swap to Codecov/Coveralls later by replacing this script + step with a third-party action — the YAML manifest at .github/coverage-thresholds.yml stays unchanged either way.	2026-04-30 20:51:48 +00:00
shankar0123	b7a3162028	ci-pipeline-cleanup Phases 7-9: image-and-supply-chain job Bundle: ci-pipeline-cleanup, Phases 7-9 / frozen decisions 0.8 + 0.10 + 0.11. NEW image-and-supply-chain job (Ubuntu, ~3 min). Three steps: PHASE 7 — Digest validity scripts/ci-guards/digest-validity.sh resolves every @sha256:<digest> ref in deploy/*/.{yml,Dockerfile} against its registry. Closes the H-001 lying-field gap that Bundle II hit (11 fabricated digests passed H-001's regex-only check and failed docker pull in CI). Sandbox verification: 16/16 digests in deploy/ + Dockerfiles all return HTTP 200 from registry-1.docker.io / ghcr.io / mcr.microsoft.com. PHASE 8 — Docker build smoke (all 4 Dockerfiles) Per frozen decision 0.10: build Dockerfile, Dockerfile.agent, deploy/test/f5-mock-icontrol/Dockerfile, deploy/test/libest/Dockerfile. Catches syntax errors + COPY path drift before tag-time release.yml. The test-sidecar Dockerfiles are load-bearing for vendor-e2e — a syntax error there silently breaks the e2e suite. PHASE 9 — OpenAPI ↔ handler operationId parity scripts/ci-guards/openapi-handler-parity.sh extracts router routes (r.mux.Handle / r.Register "METHOD /path" syntax — Go 1.22+ ServeMux), extracts OpenAPI operations (paths × HTTP methods), and fails if any router route has no operationId AND is not documented in the new api/openapi-handler-exceptions.yaml. Verified gap at HEAD `c48a82c4` (root-caused): 142 router routes, 136 OpenAPI operations 6 router-only routes — all SCEP wire-protocol endpoints (RFC-shaped, not REST). Documented in api/openapi-handler-exceptions.yaml with one-line why: justifications. 0 OpenAPI-only operations. Going forward: any new gap fails the build unless documented. Status checks per push: now 7 (was 8 after Phase 5+6 dropped windows; this Phase adds 1 = +1 net). Final acceptance gate target. ci.yml: 383 → 432 lines (+49 for the new job + steps).	2026-04-30 20:50:52 +00:00
shankar0123	0157510d48	ci-pipeline-cleanup Phase 5+6: collapse vendor matrix; delete Windows matrix Bundle: ci-pipeline-cleanup, Phases 5+6 / frozen decisions 0.4 + 0.5 + 0.6. Revises Bundle II decisions 0.4 (Windows matrix) and 0.9 (per- vendor granularity). PHASE 5 — Linux vendor matrix collapsed (12 jobs → 1): The previous per-vendor matrix produced 12 status-check rows for ~1 real assertion (115/116 vendor-edge tests are t.Log placeholders per Bundle II Phase 2-13 design). Granularity was fake signal. Single-job version: brings up all 11 sidecars at once via docker compose --profile deploy-e2e up -d, runs go test -run 'VendorEdge_' once, tears down once. Critical caveat: requireSidecar() in deploy/test/vendor_e2e_helpers.go uses t.Skipf() when a sidecar isn't reachable — silent test skip, not CI failure. The new Skip-count enforcement step (scripts/ci-guards/vendor-e2e-skip-check.sh) counts SKIP lines and fails the build if it exceeds the allowlist at scripts/ci-guards/vendor-e2e-skip-allowlist.txt (15 windows-iis- requiring tests legitimately skip on Linux per Phase 6). PHASE 6 — Windows matrix deleted entirely: The deploy-vendor-e2e-windows job removed. Two reasons: 1. Can't physically work on windows-latest today (Docker not started in Windows-containers mode by default; bridge network driver missing on Windows Docker — see CI run 25183374742 failure logs). 2. Even fixed, validates nothing — all 16 IIS + WinCertStore tests are t.Log placeholders that exercise no IIS-specific behavior. Per Bundle II frozen decision 0.14, the third criterion for "verified" status in the vendor matrix is operator manual smoke against a real instance. IIS + WinCertStore now satisfy that via the playbook (Phase 6 follow-up adds docs/connector-iis.md:: Operator validation playbook). The windows-iis-test sidecar STAYS in deploy/docker-compose.test.yml under profiles: [deploy-e2e-windows] for operator local use. Linux CI never activates this profile. Operator-required action before merge: RAM headroom verification on prototype branch (per frozen decision 0.14). If peak RSS > 12 GB on ubuntu-latest with all 11 sidecars up, fall back to bucketed matrix per cowork/ci-pipeline-cleanup/decisions-revised.md. ci.yml: 417 → 383 lines (-34 net; -1105 cumulative since baseline 1488). Status checks per push: 19 → 7 (collapse 12 vendor + 2 windows = -14; add image-and-supply-chain in Phase 7-9 = +1; net 19-12-2+1 = ~7). Operator action for Phase 13: update GitHub branch protection rules (required-checks list 19 → 7 entries). Documented in cowork/ ci-pipeline-cleanup/decisions-revised.md.	2026-04-30 20:46:05 +00:00
shankar0123	1caedd5fd3	ci-pipeline-cleanup Phase 1: extract 20 regression guards to scripts/ci-guards/ Bundle: ci-pipeline-cleanup, Phase 1. Pure relocation — no behavior change. Each guard's bash logic is byte-identical to the prior inline version; the only changes are: (a) the guard becomes a sibling script under scripts/ci-guards/<id>.sh, (b) ci.yml's per-guard step is replaced by a single loop step that iterates all scripts. 20 scripts extracted (alphabetized): B-1-orphan-crud.sh, D-1-D-2-statusbadge-phantom.sh, G-1-jwt-auth-literal.sh, G-2-api-key-hash-json.sh, G-3-env-docs-drift.sh, H-001-bare-from.sh, H-009-readme-jwt.sh, L-001-insecure-skip-verify.sh, L-1-bulk-action-loop.sh, M-012-no-root-user.sh, P-1-documented-orphan-fns.sh, S-1-hardcoded-source-counts.sh, S-2-strings-contains-err.sh, T-1-frontend-page-coverage.sh, U-2-plaintext-healthcheck.sh, U-3-migration-mount.sh, bundle-8-L-015-target-blank-rel-noopener.sh, bundle-8-L-019-dangerously-set-inner-html.sh, bundle-8-M-009-bare-usemutation.sh, test-naming-convention.sh Plus scripts/ci-guards/README.md documenting the contract: - Each script must exit 0 on clean repo, non-zero with ::error:: prefix on regression - Runnable from repo root via 'bash scripts/ci-guards/<id>.sh' - Adding a new guard: drop a new <id>.sh; CI auto-picks it up ci.yml dropped 1488 → 557 lines (-931, -63%). Single CI loop step now collects ALL guard failures before failing the build instead of fail-fast — UX win for regressions that hit two guards at once. Two guards (QA-doc Part-count + seed-count, ci.yml lines 868-917) deliberately NOT extracted — they move to 'make verify-docs' in Phase 11 because they protect docs-the-operator-reads, not the product itself. Verification (sandbox): - All 20 scripts pass against HEAD (chmod +x; for g in scripts/ci-guards/*.sh; do bash $g; done) - New ci.yml YAML-parses cleanly - Job boundaries preserved: go-build-and-test, frontend-build, helm-lint, deploy-vendor-e2e, deploy-vendor-e2e-windows - Loop step appears twice (once at end of go-build-and-test, once at end of frontend-build) so both jobs continue running their set of guards	2026-04-30 20:36:26 +00:00

9 Commits