The deploy-vendor-e2e job has been failing with the certctl-test-server
container restarting endlessly. Diagnostic dump (added in 69266c8)
finally surfaced the actual cause:
Failed to load configuration: SCEP profile 0 (PathID="e2eintune")
has empty CHALLENGE_PASSWORD — refuse to start (CWE-306: per-profile
shared secret is the sole application-layer auth boundary; an empty
password would allow any client reaching /scep/e2eintune to enroll
a CSR against issuer "iss-local")
Same shape as the encryption-key fix that landed in 4bb7a74: a config
validation gate added in code that the test compose never got updated
to satisfy, hidden pre-Phase-5 because the matrix-collapse hadn't yet
forced the certctl-server to actually boot in CI.
Root cause is more interesting than just "missing env var." The
2026-04-29 SCEP RFC 8894 + Intune master bundle Phase I added an
`e2eintune` SCEP profile to docker-compose.test.yml expecting
deploy/test/scep_intune_e2e_test.go to exercise it. That integration
test does exist (//go:build integration) but **NO CI job ever
selects it** — ci.yml's deploy-vendor-e2e job runs only
`-run 'VendorEdge_'` (line 379), and no other job invokes
`go test -tags integration` with a SCEP selector. Confirmed via
`grep -rnE "scep_intune|SCEPIntune" .github/workflows/` returning
empty.
Worse: the supporting fixtures (ra.crt + ra.key + intune_trust_anchor.pem)
were documented in deploy/test/fixtures/README.md with the
regeneration recipe but never actually committed. Pre-Phase-5 the
test stack didn't fully boot the server in CI, so the entire stack
of debt — dead config + missing fixtures + no consumer test — sat
silent until the matrix collapse forced the boot path.
Fixing this with a fake CHALLENGE_PASSWORD value would silence the
immediate validator but leave the real problem in place: maintenance
cost on test config that no test exercises. Same critique applies
to "let me commit fake fixtures" — the fixtures alone don't add
test coverage when no CI job runs the SCEP test.
The complete-path fix is to make the test compose match what CI
actually exercises:
- deploy/docker-compose.test.yml: drop CERTCTL_SCEP_ENABLED + the
full e2eintune profile env var family (10 lines) + the
./test/fixtures volume mount (1 line). Replace with an in-line
comment explaining why SCEP is intentionally disabled and what
needs to come back together when SCEP is added to CI for real.
- scripts/ci-guards/test-compose-scep-coherence.sh (new, 22nd
guard): refuses any future state where CERTCTL_SCEP_ENABLED=true
in test compose without ALL of:
1. A CI job that runs the SCEP integration test (matched by
scep_intune | SCEPIntune | -run [Ss]cep in ci.yml)
2. The fixture files actually committed (ra.crt, ra.key,
intune_trust_anchor.pem)
3. The ./test/fixtures:/etc/certctl/scep:ro volume mount
Verified manually with the same pattern as the H-1 guard:
clean tree → exit 0; deliberate SCEP_ENABLED=true regression →
exit 1 with 5 ::error:: annotations covering each gap; restore
→ exit 0 again.
- scripts/ci-guards/README.md: 21 → 22 guards, new row.
The fixtures README at deploy/test/fixtures/README.md keeps the
regeneration recipe so the eventual SCEP CI job lands cleanly: the
operator who adds the SCEP job restores the env vars, regenerates
+ commits the fixtures, and the guard auto-passes.
Pattern (now firm across this CI-stabilization sequence):
- Pre-existing latent bug
- Old CI structurally hid it (per-vendor matrix, missing boot path)
- Phase-5 matrix collapse + new diagnostic infra exposed it
- Direct fix unblocks today
- Regression guard prevents the same shape of drift forever
Encryption-key (4bb7a74) was the same shape; this is its sibling.
Two-part complete-path fix for the deploy-vendor-e2e failure that has
been firing since the ci-pipeline-cleanup Phase 5 matrix collapse
started actually booting the certctl-test-server:
Failed to load configuration:
CERTCTL_CONFIG_ENCRYPTION_KEY too short (29 bytes; minimum 32).
Surfaced via the diagnostic-dump step landed in commit 69266c8 — the
server panicked on startup, Docker restarted it endlessly, compose
reported the dependency-chain symptom ("container certctl-test-server
is unhealthy"), but the actual cause was invisible in the previous
CI output. With the dump in place, the next failing run named the
problem in one line.
Root cause. The H-1 audit-closure master commit 6cb4414
("feat(security): bodyLimit on noAuth + security headers + encryption-
key validation (H-1 master)") added internal/config/config.go's
minEncryptionKeyLength = 32 byte floor + 5 unit tests that pin it.
The closure was incomplete: it never enforced the rule against the
literal CERTCTL_CONFIG_ENCRYPTION_KEY values certctl's own
deploy/docker-compose*.yml files pass. Pre-Phase-5 the test stack
didn't fully exercise the validator (the per-vendor matrix didn't
boot certctl-test-server in every job), so the gap was silent.
deploy/docker-compose.test.yml's literal value
`test-encryption-key-32chars!!` was 29 bytes — the name claimed 32
but the author miscounted (4+1+10+1+3+1+2+5+2 = 29). Pattern matches
every fix in this CI-stabilization sequence: pre-existing latent bug
that the old CI structurally hid.
Part 1 — direct fix (deploy/docker-compose.test.yml):
Replace the 29-byte literal with a clearly test-only,
self-documenting 49-byte value (`test-encryption-key-deterministic-
32-byte-fixture`). 17 bytes of safety margin so a future tightening
of the floor (32 → 33+) doesn't break this fixture again. Inline
comment block explains the byte-budget contract + points at the
H-1 closure commit. Production deploy/docker-compose.yml's default
(`change-me-32-char-encryption-key`) is exactly 32 bytes — passes
by 1 byte but on the edge; not touched here because operators are
already told to override it via env (`${VAR:-default}`).
Part 2 — structural fix (scripts/ci-guards/H-1-encryption-key-min-
length.sh):
New regression guard. Scans every deploy/docker-compose*.yml for
literal CERTCTL_CONFIG_ENCRYPTION_KEY values + values inside
${VAR:-default} expansions, checks each against the 32-byte floor,
fails CI with `::error::` annotation pointing at the offending
file:line if any literal regresses. Bare ${VAR} env references with
no default are skipped — those are operator-supplied at runtime
and the validator handles them at boot.
Verified manually:
- Clean repo: `H-1-encryption-key-min-length: clean.` (exit 0)
- 5-byte regression: emits proper ::error:: annotation, exit 1
- Restore: clean again (exit 0)
CI auto-picks up the new guard via the `for g in
scripts/ci-guards/*.sh; do bash "$g"; done` loop in ci.yml's
Regression guards step (no ci.yml change required).
scripts/ci-guards/README.md updated: 20 → 21 guards, new row
explaining the closure rationale.
The structural piece is the more important half of this fix. The
direct fix unblocks today's CI; the guard prevents the same class of
drift from ever recurring silently. Future audit closures that add
new validation rules to internal/config/config.go now have a working
template for the matching CI guard — drop a sibling .sh in the
ci-guards directory.
Bonus — what the diagnostic-dump step (69266c8) bought us. Before
that step landed, the same failure looked like an opaque "container
unhealthy" with no actionable signal. With it, the actual error
message + the offending env var + the exact byte count came out in
one CI run. The diagnostic infrastructure paid for itself within one
push.
The 'Regression guards' loop step in ci.yml runs:
for g in scripts/ci-guards/*.sh; do bash "$g"; done
Per the directory's own contract (scripts/ci-guards/README.md), every
script there MUST be runnable bare with no args / no env. Three files
violated that contract — they're helpers consumed by specific CI job
steps with arguments, not regression guards. They were misplaced.
Moved (git mv):
scripts/ci-guards/vendor-e2e-skip-check.sh → scripts/
scripts/ci-guards/vendor-e2e-skip-allowlist.txt → scripts/
scripts/ci-guards/coverage-pr-comment.sh → scripts/
Updated ci.yml call sites:
- deploy-vendor-e2e job: bash scripts/vendor-e2e-skip-check.sh $LOG
- go-build-and-test job: bash scripts/coverage-pr-comment.sh
Tightened scripts/vendor-e2e-skip-check.sh arg parse from a silent
default ('LOG=${1:-test-output.log}') to a mandatory-arg form
('LOG=${1:?usage: ...}') so misuse fails loud at parse time rather
than at the missing-file check.
Updated scripts/ci-guards/README.md contract to spell out the
guard-vs-helper distinction explicitly; lists current helpers under
scripts/ for future-author guidance.
Verified locally: 'for g in scripts/ci-guards/*.sh; do bash $g; done'
returns clean (22 guards pass) on HEAD post-move.
Closes the regression-guards-loop failure that surfaced in CI run
25192163943 (job 73864471346 'Frontend Build').
Bundle: ci-pipeline-cleanup, Phase 1.
Pure relocation — no behavior change. Each guard's bash logic is
byte-identical to the prior inline version; the only changes are:
(a) the guard becomes a sibling script under scripts/ci-guards/<id>.sh,
(b) ci.yml's per-guard step is replaced by a single loop step that
iterates all scripts.
20 scripts extracted (alphabetized):
B-1-orphan-crud.sh, D-1-D-2-statusbadge-phantom.sh,
G-1-jwt-auth-literal.sh, G-2-api-key-hash-json.sh,
G-3-env-docs-drift.sh, H-001-bare-from.sh, H-009-readme-jwt.sh,
L-001-insecure-skip-verify.sh, L-1-bulk-action-loop.sh,
M-012-no-root-user.sh, P-1-documented-orphan-fns.sh,
S-1-hardcoded-source-counts.sh, S-2-strings-contains-err.sh,
T-1-frontend-page-coverage.sh, U-2-plaintext-healthcheck.sh,
U-3-migration-mount.sh, bundle-8-L-015-target-blank-rel-noopener.sh,
bundle-8-L-019-dangerously-set-inner-html.sh,
bundle-8-M-009-bare-usemutation.sh, test-naming-convention.sh
Plus scripts/ci-guards/README.md documenting the contract:
- Each script must exit 0 on clean repo, non-zero with ::error::
prefix on regression
- Runnable from repo root via 'bash scripts/ci-guards/<id>.sh'
- Adding a new guard: drop a new <id>.sh; CI auto-picks it up
ci.yml dropped 1488 → 557 lines (-931, -63%).
Single CI loop step now collects ALL guard failures before failing
the build instead of fail-fast — UX win for regressions that hit
two guards at once.
Two guards (QA-doc Part-count + seed-count, ci.yml lines 868-917)
deliberately NOT extracted — they move to 'make verify-docs' in
Phase 11 because they protect docs-the-operator-reads, not the
product itself.
Verification (sandbox):
- All 20 scripts pass against HEAD (chmod +x; for g in scripts/ci-guards/*.sh; do bash $g; done)
- New ci.yml YAML-parses cleanly
- Job boundaries preserved: go-build-and-test, frontend-build,
helm-lint, deploy-vendor-e2e, deploy-vendor-e2e-windows
- Loop step appears twice (once at end of go-build-and-test, once
at end of frontend-build) so both jobs continue running their
set of guards