Commit Graph

9 Commits

Author SHA1 Message Date
shankar0123 a05a7d3dad ci: fix Phase 1b post-push CI failures (3 guards)
Phase 1b push (commit 44a85d6) failed three CI guards. None were
caught by `make verify` locally because they're CI-only guards
that aren't part of the Makefile target. This commit fixes all
three.

1. go.mod tidy diff. The go-jose v4 dep was added with `// indirect`
   in go.mod after the initial `go get`, but the codebase imports it
   directly from internal/api/acme/jws.go + service/acme.go +
   handler/acme.go. CI's `go mod tidy && git diff --exit-code go.mod
   go.sum` flagged the staleness. Promoted to a direct require in
   the same `require (...)` block as github.com/aws/aws-sdk-go-v2
   etc.

2. G-3-env-docs-drift.sh. The guard greps `\bCERTCTL_[A-Z_]+\b` in
   docs/ and complains when the bare-prefix forms don't match
   anything defined in config.go. Phase 1a + 1b's docs/acme-server.md
   intro and migration header use bare-prefix forms `CERTCTL_ACME_*`
   and `CERTCTL_ACME_SERVER_*` to describe namespace separation
   (consumer-side ACMEConfig vs server-side ACMEServerConfig). Same
   precedent as the existing CERTCTL_SCEP_ + CERTCTL_TLS_ +
   CERTCTL_QA_* prefix entries already in the guard's ALLOWED list.
   Added CERTCTL_ACME_ + CERTCTL_ACME_SERVER_ to the ALLOWED list
   with a justification comment block matching the existing
   integration-surface allowlist convention.

3. openapi-handler-parity.sh. Distinct from
   internal/api/router/openapi_parity_test.go (which runs at `go
   test` time and has its own SpecParityExceptions map I extended
   in 1a + 1b) — this is a separate CI-only guard that reads
   api/openapi-handler-exceptions.yaml. The 6 Phase-1a routes + 4
   Phase-1b routes (10 ACME endpoints total) were never added to
   that yaml. Same rationale as the SCEP/SCEP-mTLS entries already
   in the file: ACME is a JWS-signed-JSON wire protocol per
   RFC 8555 + RFC 9773, not an OpenAPI-shape REST surface.
   Documenting every endpoint in openapi.yaml would duplicate the
   RFC. The canonical reference is docs/acme-server.md. Phases 2-4
   will add their routes to this yaml in lockstep with router.go.

Verified locally:
  - bash scripts/ci-guards/G-3-env-docs-drift.sh → clean.
  - bash scripts/ci-guards/openapi-handler-parity.sh → clean
    (152 router routes, 136 OpenAPI ops, 18 documented exceptions).
  - All other ci-guards/*.sh → clean.
  - go.mod diff after `go mod tidy` is empty.
2026-05-03 13:31:35 +00:00
shankar0123 2643a427ac ci(digest-validity): exclude Windows IIS digest — image is doc-only, not pulled by Linux CI
CI run #376 (commit a1c7741, Frontend Build job) failed with:

    digest does not resolve: mcr.microsoft.com/windows/servercore/iis:
    windowsservercore-ltsc2022@sha256:8d0b0e651ad514e3fb05978db66f38036
    118812e1b9314a48f10419cad8a3462

A re-run with no code changes went green. The digest itself is fine —
verified against MCR directly (HTTP 200 from
mcr.microsoft.com/v2/windows/servercore/iis/manifests/sha256:8d0b...),
and the tag `:windowsservercore-ltsc2022` currently resolves to that
exact digest. Microsoft hasn't rotated.

Root cause is registry-side rate-limiting. MCR throttles unauthenticated
GET-by-digest requests by source IP. GitHub-hosted runners share a small
pool of egress IPs across many users; bursts trip the throttle and
return non-200. Re-run = different runner = different IP = throttle
window has reset = pass. This will recur on roughly N% of pushes
indefinitely, until either (a) Microsoft loosens MCR rate limits, (b)
GitHub buys more runner IPs, or (c) we stop verifying digests CI doesn't
actually use.

The deeper issue is structural, not transient. The Windows IIS image is
gated behind compose `profiles: [deploy-e2e-windows]`
(deploy/docker-compose.test.yml:700). The comment block above the
service definition (lines 675-691) explicitly says "Linux CI never
activates this profile." All 10 TestVendorEdge_IIS_*_E2E tests are on
scripts/vendor-e2e-skip-allowlist.txt because the sidecar is never
started. The whole Windows matrix was DELETED in ci-pipeline-cleanup
Phase 6 / frozen decision 0.5 (revising Bundle II decision 0.4); IIS
validation moved to docs/connector-iis.md::Operator validation playbook.

So `digest-validity.sh` is verifying a digest that no CI job ever pulls
— paying CI brittleness against MCR rate-limiting we can't control, for
an image whose only purpose in compose is documentation for an
operator's manual workflow on a real Windows host.

The fix matches the guard's stated purpose ("every digest CI actually
depends on is valid"): exclude images CI never pulls.

Implementation. Add an EXCLUDED_PATTERNS array near the top of the
script with one entry — the IIS image path
`mcr.microsoft.com/windows/servercore/iis` — and a comment block above
it documenting:

  - WHY it's excluded (gated profile, never started, all tests on
    skip-allowlist)
  - WHEN it would need re-inclusion (if a Windows CI runner is added
    that actually starts the sidecar)
  - WHAT this list is NOT for (transient flake silencing — that gets
    fixed via retry logic in the script, not via exclusion)

The match is by image-path substring, not by digest, so future tag/
digest updates of the same image still hit the exclusion without
needing this list to be re-edited.

Loop logic gains a 6-line check that runs the exclusion match before
any registry work. Excluded refs log as "SKIP (excluded) <ref>" so
operator-facing CI logs stay informative — at a glance you can see
which digests were verified vs which were intentionally not.

The success message updates to differentiate verified vs excluded
counts: "digest-validity: clean — N verified, M excluded (CI never
pulls)" when M > 0; original message preserved when M == 0.

Verified manually:

  - Clean repo: 15 verified, 1 excluded, exit 0.
  - Fabricated bogus httpd digest: ::error:: emitted for the bad
    digest, IIS still SKIP-excluded, exit 1. (Real regressions still
    caught.)
  - Restore: 15 verified, 1 excluded, exit 0 again.

Other recurring MCR-hosted images would warrant the same treatment if
they get added later. The exclusion list pattern scales: each new entry
needs its own "WHY this is doc-only" justification block.

What this is NOT:
  - Not a generic flake-silencer. The exclusion is justified by the
    image being doc-only, not by the test being noisy.
  - Not a global retry/resilience layer. If MCR rate-limits an image CI
    DOES pull, that's a real CI dependency on an unreliable external
    service — fix by retry-with-backoff, not by excluding.
2026-05-01 03:06:49 +00:00
shankar0123 a1c7741e1b fix(deploy/test) + ci(guard): drop dead SCEP profile from test compose
The deploy-vendor-e2e job has been failing with the certctl-test-server
container restarting endlessly. Diagnostic dump (added in 3b96b35)
finally surfaced the actual cause:

    Failed to load configuration: SCEP profile 0 (PathID="e2eintune")
    has empty CHALLENGE_PASSWORD — refuse to start (CWE-306: per-profile
    shared secret is the sole application-layer auth boundary; an empty
    password would allow any client reaching /scep/e2eintune to enroll
    a CSR against issuer "iss-local")

Same shape as the encryption-key fix that landed in c4157fd: a config
validation gate added in code that the test compose never got updated
to satisfy, hidden pre-Phase-5 because the matrix-collapse hadn't yet
forced the certctl-server to actually boot in CI.

Root cause is more interesting than just "missing env var." The
2026-04-29 SCEP RFC 8894 + Intune master bundle Phase I added an
`e2eintune` SCEP profile to docker-compose.test.yml expecting
deploy/test/scep_intune_e2e_test.go to exercise it. That integration
test does exist (//go:build integration) but **NO CI job ever
selects it** — ci.yml's deploy-vendor-e2e job runs only
`-run 'VendorEdge_'` (line 379), and no other job invokes
`go test -tags integration` with a SCEP selector. Confirmed via
`grep -rnE "scep_intune|SCEPIntune" .github/workflows/` returning
empty.

Worse: the supporting fixtures (ra.crt + ra.key + intune_trust_anchor.pem)
were documented in deploy/test/fixtures/README.md with the
regeneration recipe but never actually committed. Pre-Phase-5 the
test stack didn't fully boot the server in CI, so the entire stack
of debt — dead config + missing fixtures + no consumer test — sat
silent until the matrix collapse forced the boot path.

Fixing this with a fake CHALLENGE_PASSWORD value would silence the
immediate validator but leave the real problem in place: maintenance
cost on test config that no test exercises. Same critique applies
to "let me commit fake fixtures" — the fixtures alone don't add
test coverage when no CI job runs the SCEP test.

The complete-path fix is to make the test compose match what CI
actually exercises:

  - deploy/docker-compose.test.yml: drop CERTCTL_SCEP_ENABLED + the
    full e2eintune profile env var family (10 lines) + the
    ./test/fixtures volume mount (1 line). Replace with an in-line
    comment explaining why SCEP is intentionally disabled and what
    needs to come back together when SCEP is added to CI for real.

  - scripts/ci-guards/test-compose-scep-coherence.sh (new, 22nd
    guard): refuses any future state where CERTCTL_SCEP_ENABLED=true
    in test compose without ALL of:
      1. A CI job that runs the SCEP integration test (matched by
         scep_intune | SCEPIntune | -run [Ss]cep in ci.yml)
      2. The fixture files actually committed (ra.crt, ra.key,
         intune_trust_anchor.pem)
      3. The ./test/fixtures:/etc/certctl/scep:ro volume mount
    Verified manually with the same pattern as the H-1 guard:
    clean tree → exit 0; deliberate SCEP_ENABLED=true regression →
    exit 1 with 5 ::error:: annotations covering each gap; restore
    → exit 0 again.

  - scripts/ci-guards/README.md: 21 → 22 guards, new row.

The fixtures README at deploy/test/fixtures/README.md keeps the
regeneration recipe so the eventual SCEP CI job lands cleanly: the
operator who adds the SCEP job restores the env vars, regenerates
+ commits the fixtures, and the guard auto-passes.

Pattern (now firm across this CI-stabilization sequence):
  - Pre-existing latent bug
  - Old CI structurally hid it (per-vendor matrix, missing boot path)
  - Phase-5 matrix collapse + new diagnostic infra exposed it
  - Direct fix unblocks today
  - Regression guard prevents the same shape of drift forever

Encryption-key (c4157fd) was the same shape; this is its sibling.
2026-05-01 01:39:18 +00:00
shankar0123 c4157fd196 fix(deploy/test) + ci(guard): unblock deploy-vendor-e2e — encryption-key length
Two-part complete-path fix for the deploy-vendor-e2e failure that has
been firing since the ci-pipeline-cleanup Phase 5 matrix collapse
started actually booting the certctl-test-server:

    Failed to load configuration:
    CERTCTL_CONFIG_ENCRYPTION_KEY too short (29 bytes; minimum 32).

Surfaced via the diagnostic-dump step landed in commit 3b96b35 — the
server panicked on startup, Docker restarted it endlessly, compose
reported the dependency-chain symptom ("container certctl-test-server
is unhealthy"), but the actual cause was invisible in the previous
CI output. With the dump in place, the next failing run named the
problem in one line.

Root cause. The H-1 audit-closure master commit 3e78ecb
("feat(security): bodyLimit on noAuth + security headers + encryption-
key validation (H-1 master)") added internal/config/config.go's
minEncryptionKeyLength = 32 byte floor + 5 unit tests that pin it.
The closure was incomplete: it never enforced the rule against the
literal CERTCTL_CONFIG_ENCRYPTION_KEY values certctl's own
deploy/docker-compose*.yml files pass. Pre-Phase-5 the test stack
didn't fully exercise the validator (the per-vendor matrix didn't
boot certctl-test-server in every job), so the gap was silent.
deploy/docker-compose.test.yml's literal value
`test-encryption-key-32chars!!` was 29 bytes — the name claimed 32
but the author miscounted (4+1+10+1+3+1+2+5+2 = 29). Pattern matches
every fix in this CI-stabilization sequence: pre-existing latent bug
that the old CI structurally hid.

Part 1 — direct fix (deploy/docker-compose.test.yml):

  Replace the 29-byte literal with a clearly test-only,
  self-documenting 49-byte value (`test-encryption-key-deterministic-
  32-byte-fixture`). 17 bytes of safety margin so a future tightening
  of the floor (32 → 33+) doesn't break this fixture again. Inline
  comment block explains the byte-budget contract + points at the
  H-1 closure commit. Production deploy/docker-compose.yml's default
  (`change-me-32-char-encryption-key`) is exactly 32 bytes — passes
  by 1 byte but on the edge; not touched here because operators are
  already told to override it via env (`${VAR:-default}`).

Part 2 — structural fix (scripts/ci-guards/H-1-encryption-key-min-
length.sh):

  New regression guard. Scans every deploy/docker-compose*.yml for
  literal CERTCTL_CONFIG_ENCRYPTION_KEY values + values inside
  ${VAR:-default} expansions, checks each against the 32-byte floor,
  fails CI with `::error::` annotation pointing at the offending
  file:line if any literal regresses. Bare ${VAR} env references with
  no default are skipped — those are operator-supplied at runtime
  and the validator handles them at boot.

  Verified manually:
    - Clean repo: `H-1-encryption-key-min-length: clean.` (exit 0)
    - 5-byte regression: emits proper ::error:: annotation, exit 1
    - Restore: clean again (exit 0)

  CI auto-picks up the new guard via the `for g in
  scripts/ci-guards/*.sh; do bash "$g"; done` loop in ci.yml's
  Regression guards step (no ci.yml change required).

  scripts/ci-guards/README.md updated: 20 → 21 guards, new row
  explaining the closure rationale.

The structural piece is the more important half of this fix. The
direct fix unblocks today's CI; the guard prevents the same class of
drift from ever recurring silently. Future audit closures that add
new validation rules to internal/config/config.go now have a working
template for the matching CI guard — drop a sibling .sh in the
ci-guards directory.

Bonus — what the diagnostic-dump step (3b96b35) bought us. Before
that step landed, the same failure looked like an opaque "container
unhealthy" with no actionable signal. With it, the actual error
message + the offending env var + the exact byte count came out in
one CI run. The diagnostic infrastructure paid for itself within one
push.
2026-05-01 00:57:43 +00:00
shankar0123 7b8cadcd02 refactor(scripts): move CI helpers out of scripts/ci-guards/
The 'Regression guards' loop step in ci.yml runs:
    for g in scripts/ci-guards/*.sh; do bash "$g"; done

Per the directory's own contract (scripts/ci-guards/README.md), every
script there MUST be runnable bare with no args / no env. Three files
violated that contract — they're helpers consumed by specific CI job
steps with arguments, not regression guards. They were misplaced.

Moved (git mv):
  scripts/ci-guards/vendor-e2e-skip-check.sh         → scripts/
  scripts/ci-guards/vendor-e2e-skip-allowlist.txt    → scripts/
  scripts/ci-guards/coverage-pr-comment.sh           → scripts/

Updated ci.yml call sites:
  - deploy-vendor-e2e job: bash scripts/vendor-e2e-skip-check.sh $LOG
  - go-build-and-test job: bash scripts/coverage-pr-comment.sh

Tightened scripts/vendor-e2e-skip-check.sh arg parse from a silent
default ('LOG=${1:-test-output.log}') to a mandatory-arg form
('LOG=${1:?usage: ...}') so misuse fails loud at parse time rather
than at the missing-file check.

Updated scripts/ci-guards/README.md contract to spell out the
guard-vs-helper distinction explicitly; lists current helpers under
scripts/ for future-author guidance.

Verified locally: 'for g in scripts/ci-guards/*.sh; do bash $g; done'
returns clean (22 guards pass) on HEAD post-move.

Closes the regression-guards-loop failure that surfaced in CI run
25192163943 (job 73864471346 'Frontend Build').
2026-04-30 22:37:12 +00:00
shankar0123 f20c0961aa ci-pipeline-cleanup Phase 10: coverage PR-comment action
Bundle: ci-pipeline-cleanup, Phase 10 / frozen decision 0.9.

Self-hosted alternative to Codecov / Coveralls. Posts a per-package
coverage delta as a PR comment on every PR; updates the same comment
in place on subsequent pushes (avoids duplicate noise).

scripts/ci-guards/coverage-pr-comment.sh:
- Reads coverage.out from the prior Go Test step
- Builds per-package coverage table (mirrors check-coverage-thresholds
  averaging logic)
- Searches existing PR comments for the '**Coverage report' marker
  and PATCHes the existing one if found, else POSTs a new one
- No-op on non-PR builds (push to master, scheduled, etc.)

Wired into go-build-and-test job after 'Upload Coverage Report' step
with if: github.event_name == 'pull_request' guard.

Operator can swap to Codecov/Coveralls later by replacing this script
+ step with a third-party action — the YAML manifest at
.github/coverage-thresholds.yml stays unchanged either way.
2026-04-30 20:51:48 +00:00
shankar0123 b7a3162028 ci-pipeline-cleanup Phases 7-9: image-and-supply-chain job
Bundle: ci-pipeline-cleanup, Phases 7-9 / frozen decisions 0.8 + 0.10 + 0.11.

NEW image-and-supply-chain job (Ubuntu, ~3 min). Three steps:

PHASE 7 — Digest validity
scripts/ci-guards/digest-validity.sh resolves every @sha256:<digest>
ref in deploy/**/*.{yml,Dockerfile*} against its registry. Closes the
H-001 lying-field gap that Bundle II hit (11 fabricated digests passed
H-001's regex-only check and failed docker pull in CI).
Sandbox verification: 16/16 digests in deploy/* + Dockerfiles all
return HTTP 200 from registry-1.docker.io / ghcr.io / mcr.microsoft.com.

PHASE 8 — Docker build smoke (all 4 Dockerfiles)
Per frozen decision 0.10: build Dockerfile, Dockerfile.agent,
deploy/test/f5-mock-icontrol/Dockerfile, deploy/test/libest/Dockerfile.
Catches syntax errors + COPY path drift before tag-time release.yml.
The test-sidecar Dockerfiles are load-bearing for vendor-e2e — a
syntax error there silently breaks the e2e suite.

PHASE 9 — OpenAPI ↔ handler operationId parity
scripts/ci-guards/openapi-handler-parity.sh extracts router routes
(r.mux.Handle / r.Register "METHOD /path" syntax — Go 1.22+ ServeMux),
extracts OpenAPI operations (paths × HTTP methods), and fails if any
router route has no operationId AND is not documented in the new
api/openapi-handler-exceptions.yaml.

Verified gap at HEAD c48a82c4 (root-caused):
  142 router routes, 136 OpenAPI operations
  6 router-only routes — all SCEP wire-protocol endpoints (RFC-shaped,
    not REST). Documented in api/openapi-handler-exceptions.yaml with
    one-line why: justifications.
  0 OpenAPI-only operations.

Going forward: any new gap fails the build unless documented.

Status checks per push: now 7 (was 8 after Phase 5+6 dropped windows;
this Phase adds 1 = +1 net). Final acceptance gate target.

ci.yml: 383 → 432 lines (+49 for the new job + steps).
2026-04-30 20:50:52 +00:00
shankar0123 0157510d48 ci-pipeline-cleanup Phase 5+6: collapse vendor matrix; delete Windows matrix
Bundle: ci-pipeline-cleanup, Phases 5+6 / frozen decisions 0.4 + 0.5
+ 0.6. Revises Bundle II decisions 0.4 (Windows matrix) and 0.9 (per-
vendor granularity).

PHASE 5 — Linux vendor matrix collapsed (12 jobs → 1):

The previous per-vendor matrix produced 12 status-check rows for
~1 real assertion (115/116 vendor-edge tests are t.Log placeholders
per Bundle II Phase 2-13 design). Granularity was fake signal.

Single-job version: brings up all 11 sidecars at once via
docker compose --profile deploy-e2e up -d, runs go test -run
'VendorEdge_' once, tears down once.

Critical caveat: requireSidecar() in deploy/test/vendor_e2e_helpers.go
uses t.Skipf() when a sidecar isn't reachable — silent test skip,
not CI failure. The new Skip-count enforcement step
(scripts/ci-guards/vendor-e2e-skip-check.sh) counts SKIP lines and
fails the build if it exceeds the allowlist at
scripts/ci-guards/vendor-e2e-skip-allowlist.txt (15 windows-iis-
requiring tests legitimately skip on Linux per Phase 6).

PHASE 6 — Windows matrix deleted entirely:

The deploy-vendor-e2e-windows job removed. Two reasons:
1. Can't physically work on windows-latest today (Docker not started
   in Windows-containers mode by default; bridge network driver
   missing on Windows Docker — see CI run 25183374742 failure logs).
2. Even fixed, validates nothing — all 16 IIS + WinCertStore tests
   are t.Log placeholders that exercise no IIS-specific behavior.

Per Bundle II frozen decision 0.14, the third criterion for
"verified" status in the vendor matrix is operator manual smoke
against a real instance. IIS + WinCertStore now satisfy that via
the playbook (Phase 6 follow-up adds docs/connector-iis.md::
Operator validation playbook).

The windows-iis-test sidecar STAYS in deploy/docker-compose.test.yml
under profiles: [deploy-e2e-windows] for operator local use. Linux
CI never activates this profile.

Operator-required action before merge: RAM headroom verification on
prototype branch (per frozen decision 0.14). If peak RSS > 12 GB on
ubuntu-latest with all 11 sidecars up, fall back to bucketed matrix
per cowork/ci-pipeline-cleanup/decisions-revised.md.

ci.yml: 417 → 383 lines (-34 net; -1105 cumulative since baseline 1488).
Status checks per push: 19 → 7 (collapse 12 vendor + 2 windows = -14;
add image-and-supply-chain in Phase 7-9 = +1; net 19-12-2+1 = ~7).

Operator action for Phase 13: update GitHub branch protection rules
(required-checks list 19 → 7 entries). Documented in cowork/
ci-pipeline-cleanup/decisions-revised.md.
2026-04-30 20:46:05 +00:00
shankar0123 1caedd5fd3 ci-pipeline-cleanup Phase 1: extract 20 regression guards to scripts/ci-guards/
Bundle: ci-pipeline-cleanup, Phase 1.

Pure relocation — no behavior change. Each guard's bash logic is
byte-identical to the prior inline version; the only changes are:
(a) the guard becomes a sibling script under scripts/ci-guards/<id>.sh,
(b) ci.yml's per-guard step is replaced by a single loop step that
iterates all scripts.

20 scripts extracted (alphabetized):
  B-1-orphan-crud.sh, D-1-D-2-statusbadge-phantom.sh,
  G-1-jwt-auth-literal.sh, G-2-api-key-hash-json.sh,
  G-3-env-docs-drift.sh, H-001-bare-from.sh, H-009-readme-jwt.sh,
  L-001-insecure-skip-verify.sh, L-1-bulk-action-loop.sh,
  M-012-no-root-user.sh, P-1-documented-orphan-fns.sh,
  S-1-hardcoded-source-counts.sh, S-2-strings-contains-err.sh,
  T-1-frontend-page-coverage.sh, U-2-plaintext-healthcheck.sh,
  U-3-migration-mount.sh, bundle-8-L-015-target-blank-rel-noopener.sh,
  bundle-8-L-019-dangerously-set-inner-html.sh,
  bundle-8-M-009-bare-usemutation.sh, test-naming-convention.sh

Plus scripts/ci-guards/README.md documenting the contract:
- Each script must exit 0 on clean repo, non-zero with ::error::
  prefix on regression
- Runnable from repo root via 'bash scripts/ci-guards/<id>.sh'
- Adding a new guard: drop a new <id>.sh; CI auto-picks it up

ci.yml dropped 1488 → 557 lines (-931, -63%).

Single CI loop step now collects ALL guard failures before failing
the build instead of fail-fast — UX win for regressions that hit
two guards at once.

Two guards (QA-doc Part-count + seed-count, ci.yml lines 868-917)
deliberately NOT extracted — they move to 'make verify-docs' in
Phase 11 because they protect docs-the-operator-reads, not the
product itself.

Verification (sandbox):
- All 20 scripts pass against HEAD (chmod +x; for g in scripts/ci-guards/*.sh; do bash $g; done)
- New ci.yml YAML-parses cleanly
- Job boundaries preserved: go-build-and-test, frontend-build,
  helm-lint, deploy-vendor-e2e, deploy-vendor-e2e-windows
- Loop step appears twice (once at end of go-build-and-test, once
  at end of frontend-build) so both jobs continue running their
  set of guards
2026-04-30 20:36:26 +00:00