Twelve findings from the architecture diligence audit's Phase 3 bundle
closed in one PR. All touch the CI workflows + small doc-drift fixes
across the production Go tree + migration headers.
CI workflow changes
====================
TEST-H1 — Race detection on ./... -short
.github/workflows/ci.yml:106 was a 9-package explicit list. Audit
finding TEST-H1 flagged that 25+ packages (internal/auth/*,
internal/repository/*, internal/mcp, internal/scep, internal/pkcs7,
internal/api/router, internal/api/acme, internal/cli, internal/cms,
internal/config, internal/deploy, internal/integration,
internal/ratelimit, internal/secret, internal/trustanchor, all of
cmd/) silently dropped off race coverage.
Post-fix: 'go test -race -short ./... -count=1 -timeout 600s'.
76 testing.Short() guards already cover testcontainers + live-DB
integration suites, so -short keeps the long-running tests out.
TEST-H2 — Cross-platform build matrix
New 'cross-platform-build' job in ci.yml. Matrix:
ubuntu-latest + windows-latest + macos-latest, fail-fast: false.
Builds cmd/server + cmd/agent + cmd/cli + cmd/mcp-server on each.
Catches Windows-specific regressions (path separators, file
permissions, exec.Command semantics) the pre-Phase-3 Ubuntu-only
CI missed.
TEST-L1 — actions/setup-go cache: true (explicit)
setup-go v5 defaults cache: true; making it explicit so a future
setup-go upgrade can't silently flip it. Re-runs hit the Go module
+ build cache instead of recompiling cold.
TEST-M1 — Mutation-testing floor at 55%
security-deep-scan.yml::go-mutesting step rewritten. Removed
continue-on-error + per-package '|| true'. New post-loop check
extracts every 'The mutation score is X.YZ' line and fails the
step if any package drops below 0.55. Floor rationale: starter
ratio catches major regressions without rejecting the audit's
'this is OK' steady state; raise quarterly.
TEST-M2 — 3 advisory deep-scan gates promoted to blocking
Removed continue-on-error: true from:
- gosec (filtered to G201/G202/G304/G108 high-signal rules:
SQL-injection + path-traversal + pprof-exposed)
- osv-scanner (multi-ecosystem CVE; complements govulncheck
which is already blocking in ci.yml)
- trivy image scan (--severity HIGH,CRITICAL --exit-code 1)
continue-on-error count: 15 → 11.
ZAP / schemathesis / nuclei / testssl stay advisory because their
false-positive rates on https://localhost:8443-targeted DAST runs
are high.
TEST-M3 — Playwright harness stub
web/package.json adds '@playwright/test' devDep + 'e2e' / 'e2e:install'
npm scripts. web/playwright.config.ts ships single chromium project
with webServer block pointing at 'npm run dev'. web/src/__tests__/
e2e/smoke.spec.ts proves the harness wires through. The full 15-flow
suite ships in frontend-design-audit Phase 8 (TEST-H1 in THAT audit);
this is the wiring + a single smoke test as the regression floor.
New Makefile target: 'make e2e-test'.
Doc/code drift fixes
====================
TEST-M4 + ARCH-L2 — Skip inventory artifact + CI guard
scripts/skip-inventory.sh walks every t.Skip site under cmd/ +
internal/ + deploy/test/ and emits docs/testing/skip-inventory.md
grouped by package with file:line:expression triples. Current
inventory: 142 t.Skip sites, 76 testing.Short() guards.
scripts/ci-guards/skip-inventory-drift.sh regenerates and fails on
diff (excluding the 'Last reviewed' timestamp line which drifts
daily). The Markdown is the canonical acquisition-diligence artifact
for 'what tests are being skipped and why.'
ARCH-H3 — MCP catalogue floor reconciliation
Audit framing was '121 vs floor 150 — doc/code drift.' Live count
via the test's actual regex over all 5 tool files (tools.go +
tools_audit_fix.go + tools_auth.go + tools_auth_bundle2.go +
tools_est.go): 155 unique 'Name: "certctl_*"' declarations.
Pre-Phase-3 audit measured tools.go in isolation (121) and missed
the other 4 files (+34 unique names). The test at
internal/ciparity/surface_parity_test.go::TestSurfaceParity_MCP
passes today (155 ≥ 150). Added a clarifying comment near
mcpBaselineFloor explaining the measurement scope so future
reviewers don't repeat the audit's framing error.
STATUS: stale — no code drift, just a measurement scoping error in
the audit.
ARCH-L1 — panic() rationale comments
5 panic sites in production Go (excluding _test.go):
- internal/repository/postgres/tx.go:84
- internal/service/issuer.go:861 (mustJSON)
- internal/service/est.go:728 (mustParseTime)
- internal/service/acme.go:1288 (rand source failure — already documented)
- internal/pkcs7/certrep.go:270 (OID marshal — already documented)
Added ARCH-L1 rationale comments to the 3 sites that didn't have
them. All 5 are defensible impossible-path / rethrow / hardcoded-
constant guards.
ARCH-L3 — Migration IF-NOT-EXISTS carve-outs
4 migrations skip the literal 'IF NOT EXISTS' token but ARE
idempotent via different Postgres patterns:
- 000014_policy_violation_severity_check.up.sql: ALTER TABLE
ADD CONSTRAINT CHECK doesn't accept IF NOT EXISTS; idempotency
via DROP CONSTRAINT IF EXISTS preamble.
- 000018_audit_events_worm.up.sql: CREATE OR REPLACE FUNCTION
+ DROP TRIGGER IF EXISTS + CREATE TRIGGER + DO $$ pg_roles
existence check. CREATE TRIGGER doesn't take IF NOT EXISTS.
- 000030_rbac_admin_perms.up.sql: INSERT ... ON CONFLICT DO NOTHING.
- 000039_audit_crit1_perms.up.sql: same INSERT + ON CONFLICT pattern.
Added ARCH-L3 header comments to each explaining the carve-out so
reviewers don't flag the missing literal token.
STATUS: largely stale — migrations are already idempotent.
ARCH-L4 — TODO/FIXME → see #<descriptor>
5 TODOs rewritten to the allowed 'see #<descriptor>' pattern:
- internal/repository/postgres/auth.go:220 → see #bundle-2-scope-fk
- internal/connector/discovery/gcpsm/gcpsm.go:547 → see #gcpsm-pagination
- internal/service/audit.go:244 → see #audit-pagination-count
- internal/service/job.go:295, 299 → see #validation-job-impl
New CI guard scripts/ci-guards/no-todo-in-prod.sh grep-fails any
new TODO/FIXME in cmd/ + internal/ (excluding _test.go); allows
'see #N' / 'see #<descriptor>' patterns.
Sandbox limitation
==================
The 6.1 GB certctl working tree fills the sandbox volume; go1.25.10
toolchain download fails with 'no space left on device' (sandbox has
1.25.9; go.mod requires 1.25.10). Local 'go test' / 'go build' NOT
run in this commit. Operator must run 'make verify' on their
workstation before push per CLAUDE.md operating rules.
The smoke.spec.ts NOT executed in the sandbox (no chromium installed).
Operator runs 'cd web && npm install && npx playwright install
--with-deps chromium && npm run e2e' on first wire-up.
All CI guards (no-todo-in-prod, skip-inventory-drift, G-3
env-docs-drift, doc-rot-detector, and every existing guard) verified
clean by running each individually.
Closes: cowork/certctl-architecture-diligence-audit.html#fix-TEST-H1,
cowork/certctl-architecture-diligence-audit.html#fix-TEST-H2,
cowork/certctl-architecture-diligence-audit.html#fix-TEST-M1,
cowork/certctl-architecture-diligence-audit.html#fix-TEST-M2,
cowork/certctl-architecture-diligence-audit.html#fix-TEST-M3,
cowork/certctl-architecture-diligence-audit.html#fix-TEST-M4,
cowork/certctl-architecture-diligence-audit.html#fix-TEST-L1,
cowork/certctl-architecture-diligence-audit.html#fix-ARCH-H3,
cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L1,
cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L2,
cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L3,
cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L4
scripts/ci-guards/ — Regression-guard scripts
Each <id>.sh script in this directory pins one closed audit finding from
regressing. CI runs the full set on every push via the
Regression guards step in .github/workflows/ci.yml. Operators can
run any script locally:
bash scripts/ci-guards/G-3-env-docs-drift.sh
Contract
Every script in this directory MUST:
- Be exit-code 0 on a clean repo (no regression present).
- Be exit-code non-zero on regression, with a
::error::annotation prefix so PR reviewers see the failing line in the GitHub Actions UI. - Be runnable from repo root via
bash scripts/ci-guards/<id>.shwith NO arguments and NO env-var requirements. The CI loop step (for g in scripts/ci-guards/*.sh; do bash "$g"; done) iterates every.shhere without args; any script that requires an arg or env var WILL fail in that loop. - Carry a head-comment block matching the in-source justification from the original ci.yml entry: the audit-finding reference, the closure rationale, the exempt-surface list (if any).
- Use
set -eearly to fail-fast on internal command errors. - Produce no output on the happy path beyond a final
echo "<id>: clean."confirmation line.
Helpers vs guards
Scripts that consume input artifacts (a test-output log, a
coverage.out file) or env vars (PR_NUMBER, GH_TOKEN) are
HELPERS, not guards. They live in scripts/, NOT scripts/ci-guards/.
Current helpers:
scripts/vendor-e2e-skip-check.sh— consumestest-output.logarg from the deploy-vendor-e2e jobscripts/coverage-pr-comment.sh— consumescoverage.out+PR_NUMBER+GH_TOKENenv from the go-build-and-test jobscripts/check-coverage-thresholds.sh— consumescoverage.out.github/coverage-thresholds.yml
Adding a new guard
- Drop a new
<id>.shin this directory with the head-comment block describing the audit finding it closes. - Make it executable:
chmod +x scripts/ci-guards/<id>.sh. - Verify it fails on a deliberate regression and passes on clean repo.
- CI auto-picks up new scripts via the
for g in scripts/ci-guards/*.shloop in theRegression guardsstep — no ci.yml change required.
Guards in this directory
Count: re-derive on demand via ls scripts/ci-guards/*.sh | wc -l. The table below names each one — keep it in sync as guards are added.
Per-finding regression guards
| ID | Finding | Catches |
|---|---|---|
G-1-jwt-auth-literal |
G-1 JWT silent auth downgrade | "jwt" literal in additive auth-type surfaces |
L-001-insecure-skip-verify |
L-001 unjustified InsecureSkipVerify | InsecureSkipVerify: true without //nolint:gosec |
H-001-bare-from |
H-001 (CWE-829) tag-swap attack | Bare FROM line without @sha256 digest pin |
M-012-no-root-user |
M-012 (CWE-250) container-as-root | Dockerfile missing terminal USER <non-root> |
H-009-readme-jwt |
H-009 README JWT advertising | README.md re-introducing JWT-as-supported claim |
G-2-api-key-hash-json |
G-2 cat-s5-apikey_leak | api_key_hash in JSON-emitting surface |
U-2-plaintext-healthcheck |
U-2 healthcheck protocol mismatch | Plaintext http:// in HEALTHCHECK directive |
U-3-migration-mount |
U-3 seed initdb schema drift | Migration file mounted into postgres initdb |
D-1-D-2-statusbadge-phantom |
D-1 + D-2 dead keys + TS phantoms | StatusBadge dead keys + 5 Certificate / 5 Agent / 1 Issuer / 1 Notification phantom fields |
L-1-bulk-action-loop |
L-1 client-side bulk loops | for ... await triggerRenewal/updateCertificate in CertificatesPage |
B-1-orphan-crud |
B-1 orphan-CRUD client fns | 8 update/create/delete fns lose their page consumer |
S-2-strings-contains-err |
S-2 brittle error-dispatch | strings.Contains(err.Error(), "not found"|"violates foreign key") in handlers |
G-3-env-docs-drift |
G-3 env-var docs drift | CERTCTL_* env var defined OR documented but not both |
test-naming-convention |
I-001-extended | func TestXxx (lowercase first letter) — Go silently skips |
S-1-hardcoded-source-counts |
S-1 stale numeric prose | Hardcoded "N issuer connectors" / "N MCP tools" in README + docs |
P-1-documented-orphan-fns |
P-1 documented orphans | 16 read-fn names removed from client.ts exports |
T-1-frontend-page-coverage |
T-1 untested frontend pages | New page in web/src/pages/ without sibling .test.tsx and not on the deferred allowlist |
bundle-8-L-015-target-blank-rel-noopener |
L-015 (CWE-1022) reverse-tabnabbing | target="_blank" without rel="noopener noreferrer" |
bundle-8-L-019-dangerously-set-inner-html |
L-019 (CWE-79) XSS | dangerouslySetInnerHTML outside safeHtml.ts |
bundle-8-M-009-bare-usemutation |
M-009 + M-029 mutation contract | Bare useMutation() outside useTrackedMutation wrapper |
H-1-encryption-key-min-length |
H-1 closure follow-up (post-Phase-5 surfacing) | CERTCTL_CONFIG_ENCRYPTION_KEY literal in any deploy/docker-compose*.yml shorter than the 32-byte floor enforced by internal/config/config.go::Validate() |
test-compose-scep-coherence |
post-Phase-5 surfacing of dead SCEP test config | CERTCTL_SCEP_ENABLED=true in test compose without (a) a CI job that runs the SCEP integration test, (b) the ra.crt + ra.key + intune_trust_anchor.pem fixtures committed to deploy/test/fixtures/, AND (c) the matching volume mount |
Forward-looking guards (Auditable Codebase Bundle, post-v2.1.0 anti-rot)
These guards catch defect classes BEFORE they get audit findings — they pin invariants on the codebase that the v2.0 audit history showed are easy to lose.
| ID | Item | Catches |
|---|---|---|
complete-path-config-coverage |
post-v2.1.0 / item-1 | "Lying field" — CERTCTL_* env var defined in internal/config/config.go that no consumer outside internal/config/ actually reads. Operator-facing config that the docs claim works but the code never honors. Companion Go test at internal/config/coverage_test.go. |
doc-rot-detector |
post-v2.1.0 / item-5 | Docs older than 90 days warn (yellow), older than 120 days fail (red). Uses HEAD commit timestamp for reproducibility. docs/archive/ allowlisted in bulk. |
The cold-DB compose smoke (post-v2.1.0 / item-6) is NOT a script in this directory — it is inlined directly into .github/workflows/ci.yml::cold-db-compose-smoke because there is no value in a developer running it locally (the whole point of the gate is that CI owns the cold-DB state). To inspect or modify the smoke logic, read that workflow job; there is intentionally no scripts/ci-guards/cold-db-compose-smoke.sh.
The fourth Bundle artifact (internal/ciparity/) is Go tests, not shell guards — runs under the standard Go test step. Pins the MCP tool catalogue floor + naming convention; reports CLI/MCP/OpenAPI surface counts as a trend metric.
Running the full set locally
for g in scripts/ci-guards/*.sh; do
echo "=== $(basename "$g") ==="
bash "$g" || echo " FAILED"
done