ci: floor raise + doc drift (Phase 3 closure — TEST-H1/H2/M1/M2/M3/M4/L1, ARCH-H3/L1/L2/L3/L4)

Twelve findings from the architecture diligence audit's Phase 3 bundle closed in one PR. All touch the CI workflows + small doc-drift fixes across the production Go tree + migration headers. CI workflow changes ==================== TEST-H1 — Race detection on ./... -short .github/workflows/ci.yml:106 was a 9-package explicit list. Audit finding TEST-H1 flagged that 25+ packages (internal/auth/*, internal/repository/*, internal/mcp, internal/scep, internal/pkcs7, internal/api/router, internal/api/acme, internal/cli, internal/cms, internal/config, internal/deploy, internal/integration, internal/ratelimit, internal/secret, internal/trustanchor, all of cmd/) silently dropped off race coverage. Post-fix: 'go test -race -short ./... -count=1 -timeout 600s'. 76 testing.Short() guards already cover testcontainers + live-DB integration suites, so -short keeps the long-running tests out. TEST-H2 — Cross-platform build matrix New 'cross-platform-build' job in ci.yml. Matrix: ubuntu-latest + windows-latest + macos-latest, fail-fast: false. Builds cmd/server + cmd/agent + cmd/cli + cmd/mcp-server on each. Catches Windows-specific regressions (path separators, file permissions, exec.Command semantics) the pre-Phase-3 Ubuntu-only CI missed. TEST-L1 — actions/setup-go cache: true (explicit) setup-go v5 defaults cache: true; making it explicit so a future setup-go upgrade can't silently flip it. Re-runs hit the Go module + build cache instead of recompiling cold. TEST-M1 — Mutation-testing floor at 55% security-deep-scan.yml::go-mutesting step rewritten. Removed continue-on-error + per-package '|| true'. New post-loop check extracts every 'The mutation score is X.YZ' line and fails the step if any package drops below 0.55. Floor rationale: starter ratio catches major regressions without rejecting the audit's 'this is OK' steady state; raise quarterly. TEST-M2 — 3 advisory deep-scan gates promoted to blocking Removed continue-on-error: true from: - gosec (filtered to G201/G202/G304/G108 high-signal rules: SQL-injection + path-traversal + pprof-exposed) - osv-scanner (multi-ecosystem CVE; complements govulncheck which is already blocking in ci.yml) - trivy image scan (--severity HIGH,CRITICAL --exit-code 1) continue-on-error count: 15 → 11. ZAP / schemathesis / nuclei / testssl stay advisory because their false-positive rates on https://localhost:8443-targeted DAST runs are high. TEST-M3 — Playwright harness stub web/package.json adds '@playwright/test' devDep + 'e2e' / 'e2e:install' npm scripts. web/playwright.config.ts ships single chromium project with webServer block pointing at 'npm run dev'. web/src/__tests__/ e2e/smoke.spec.ts proves the harness wires through. The full 15-flow suite ships in frontend-design-audit Phase 8 (TEST-H1 in THAT audit); this is the wiring + a single smoke test as the regression floor. New Makefile target: 'make e2e-test'. Doc/code drift fixes ==================== TEST-M4 + ARCH-L2 — Skip inventory artifact + CI guard scripts/skip-inventory.sh walks every t.Skip site under cmd/ + internal/ + deploy/test/ and emits docs/testing/skip-inventory.md grouped by package with file:line:expression triples. Current inventory: 142 t.Skip sites, 76 testing.Short() guards. scripts/ci-guards/skip-inventory-drift.sh regenerates and fails on diff (excluding the 'Last reviewed' timestamp line which drifts daily). The Markdown is the canonical acquisition-diligence artifact for 'what tests are being skipped and why.' ARCH-H3 — MCP catalogue floor reconciliation Audit framing was '121 vs floor 150 — doc/code drift.' Live count via the test's actual regex over all 5 tool files (tools.go + tools_audit_fix.go + tools_auth.go + tools_auth_bundle2.go + tools_est.go): 155 unique 'Name: "certctl_*"' declarations. Pre-Phase-3 audit measured tools.go in isolation (121) and missed the other 4 files (+34 unique names). The test at internal/ciparity/surface_parity_test.go::TestSurfaceParity_MCP passes today (155 ≥ 150). Added a clarifying comment near mcpBaselineFloor explaining the measurement scope so future reviewers don't repeat the audit's framing error. STATUS: stale — no code drift, just a measurement scoping error in the audit. ARCH-L1 — panic() rationale comments 5 panic sites in production Go (excluding _test.go): - internal/repository/postgres/tx.go:84 - internal/service/issuer.go:861 (mustJSON) - internal/service/est.go:728 (mustParseTime) - internal/service/acme.go:1288 (rand source failure — already documented) - internal/pkcs7/certrep.go:270 (OID marshal — already documented) Added ARCH-L1 rationale comments to the 3 sites that didn't have them. All 5 are defensible impossible-path / rethrow / hardcoded- constant guards. ARCH-L3 — Migration IF-NOT-EXISTS carve-outs 4 migrations skip the literal 'IF NOT EXISTS' token but ARE idempotent via different Postgres patterns: - 000014_policy_violation_severity_check.up.sql: ALTER TABLE ADD CONSTRAINT CHECK doesn't accept IF NOT EXISTS; idempotency via DROP CONSTRAINT IF EXISTS preamble. - 000018_audit_events_worm.up.sql: CREATE OR REPLACE FUNCTION + DROP TRIGGER IF EXISTS + CREATE TRIGGER + DO $$ pg_roles existence check. CREATE TRIGGER doesn't take IF NOT EXISTS. - 000030_rbac_admin_perms.up.sql: INSERT ... ON CONFLICT DO NOTHING. - 000039_audit_crit1_perms.up.sql: same INSERT + ON CONFLICT pattern. Added ARCH-L3 header comments to each explaining the carve-out so reviewers don't flag the missing literal token. STATUS: largely stale — migrations are already idempotent. ARCH-L4 — TODO/FIXME → see #<descriptor> 5 TODOs rewritten to the allowed 'see #<descriptor>' pattern: - internal/repository/postgres/auth.go:220 → see #bundle-2-scope-fk - internal/connector/discovery/gcpsm/gcpsm.go:547 → see #gcpsm-pagination - internal/service/audit.go:244 → see #audit-pagination-count - internal/service/job.go:295, 299 → see #validation-job-impl New CI guard scripts/ci-guards/no-todo-in-prod.sh grep-fails any new TODO/FIXME in cmd/ + internal/ (excluding _test.go); allows 'see #N' / 'see #<descriptor>' patterns. Sandbox limitation ================== The 6.1 GB certctl working tree fills the sandbox volume; go1.25.10 toolchain download fails with 'no space left on device' (sandbox has 1.25.9; go.mod requires 1.25.10). Local 'go test' / 'go build' NOT run in this commit. Operator must run 'make verify' on their workstation before push per CLAUDE.md operating rules. The smoke.spec.ts NOT executed in the sandbox (no chromium installed). Operator runs 'cd web && npm install && npx playwright install --with-deps chromium && npm run e2e' on first wire-up. All CI guards (no-todo-in-prod, skip-inventory-drift, G-3 env-docs-drift, doc-rot-detector, and every existing guard) verified clean by running each individually. Closes: cowork/certctl-architecture-diligence-audit.html#fix-TEST-H1, cowork/certctl-architecture-diligence-audit.html#fix-TEST-H2, cowork/certctl-architecture-diligence-audit.html#fix-TEST-M1, cowork/certctl-architecture-diligence-audit.html#fix-TEST-M2, cowork/certctl-architecture-diligence-audit.html#fix-TEST-M3, cowork/certctl-architecture-diligence-audit.html#fix-TEST-M4, cowork/certctl-architecture-diligence-audit.html#fix-TEST-L1, cowork/certctl-architecture-diligence-audit.html#fix-ARCH-H3, cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L1, cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L2, cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L3, cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L4
2026-06-07 15:51:30 +00:00 · 2026-05-13 20:10:08 +00:00
parent 69a2b5c55a
commit 02438ad9e1
22 changed files with 702 additions and 23 deletions
@@ -20,6 +20,11 @@ jobs:
        uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff  # v5
        with:
          go-version: '1.25.10'
+          # Phase 3 TEST-L1 closure (2026-05-13): enable Go's module +
+          # build cache so re-runs hit the cache instead of recompiling
+          # the world. setup-go v5 cache: true by default; making it
+          # explicit so a future setup-go upgrade can't silently flip it.
+          cache: true

      - name: Go Build
        run: |
@@ -103,7 +108,23 @@ jobs:
        run: staticcheck ./...

      - name: Race Detection
-        run: go test -race ./internal/service/... ./internal/api/handler/... ./internal/api/middleware/... ./internal/scheduler/... ./internal/connector/... ./internal/crypto/... ./internal/domain/... ./internal/validation/... ./internal/tlsprobe/... -count=1 -timeout 300s
+        # Phase 3 TEST-H1 closure (2026-05-13): the pre-Phase-3 invocation
+        # listed 9 explicit package roots, excluding internal/auth/*,
+        # internal/repository/*, internal/mcp, internal/scep, internal/pkcs7,
+        # internal/api/router, internal/api/acme, internal/cli, internal/cms,
+        # internal/config, internal/deploy, internal/integration,
+        # internal/ratelimit, internal/secret, internal/trustanchor, plus
+        # all of cmd/. Audit finding TEST-H1 flagged this as silent
+        # race-detection drift — packages added after the original list
+        # was authored were never covered.
+        #
+        # Post-Phase-3: ./... with -short. The 76 testing.Short() guards
+        # already in the integration-test surface (testcontainers, live-DB,
+        # multi-process) gate behind this flag, so race detection runs
+        # across every package without dragging in long-running suites.
+        # Timeout doubled from 300s to 600s because ./... is broader; the
+        # broader scope is what makes race coverage trustworthy.
+        run: go test -race -short ./... -count=1 -timeout 600s

      - name: Go Test with Coverage
        # internal/ciparity/... — post-v2.1.0 anti-rot item 2 surface-
@@ -175,6 +196,41 @@ jobs:
          done
          exit $fail

+  cross-platform-build:
+    # Phase 3 TEST-H2 closure (2026-05-13): the pre-Phase-3 CI ran
+    # exclusively on ubuntu-latest, leaving Windows-specific bugs
+    # (path separators, file permissions, exec.Command semantics)
+    # undetected. The agent + CLI binaries ship for Windows + macOS
+    # users; this matrix asserts they at least BUILD on every OS we
+    # claim to support.
+    #
+    # Build-only — no test run. Full test parity across OSes is a
+    # larger investment (testcontainers is Linux-only on Windows CI
+    # runners, file-permission tests differ, etc.). The build gate
+    # is the minimum that catches the cross-platform regressions
+    # we've seen in practice.
+    name: Cross-platform build (ubuntu / windows / macos)
+    strategy:
+      fail-fast: false
+      matrix:
+        os: [ubuntu-latest, windows-latest, macos-latest]
+    runs-on: ${{ matrix.os }}
+    steps:
+      - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5  # v4
+
+      - name: Set up Go
+        uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff  # v5
+        with:
+          go-version: '1.25.10'
+          cache: true
+
+      - name: Build server + agent + CLI + mcp-server
+        run: |
+          go build ./cmd/server
+          go build ./cmd/agent
+          go build ./cmd/cli
+          go build ./cmd/mcp-server
+
  cold-db-compose-smoke:
    # Per post-v2.1.0 anti-rot item 6 (Auditable Codebase Bundle).
    #