diff --git a/cowork/ci-pipeline-cleanup/baseline.md b/cowork/ci-pipeline-cleanup/baseline.md new file mode 100644 index 0000000..163d571 --- /dev/null +++ b/cowork/ci-pipeline-cleanup/baseline.md @@ -0,0 +1,159 @@ +# CI Pipeline Cleanup — Phase 0 Baseline + +> Captured against repo HEAD `1de61e91cf07449356d9046a76499c86efe413b1` (operator tag `v2.0.66`) on 2026-04-30. +> Each subsequent Phase that changes a number references this baseline. + +## Repo state + +**HEAD SHA:** `1de61e91cf07449356d9046a76499c86efe413b1` + +**Operator-stamped tag:** `v2.0.66` + +## ci.yml shape + +- Total lines: `1488` +- Total named steps: `53` +- Named regression-guard steps: 22 (enumerated below) + +### The 22 regression-guard steps + +``` +81: - name: Forbidden auth-type literal regression guard (G-1) +144: - name: Forbidden bare InsecureSkipVerify regression guard (L-001) +180: - name: Forbidden bare FROM regression guard (H-001) +201: - name: Forbidden missing USER regression guard (M-012) +228: - name: Forbidden README JWT advertising regression guard (H-009) +254: - name: Forbidden api_key_hash JSON-shape regression guard (G-2) +311: - name: Forbidden plaintext HEALTHCHECK regression guard (U-2) +360: - name: Forbidden migration mount in compose initdb (U-3) +417: - name: Forbidden StatusBadge dead-key + TS phantom-field regression guard (D-1 + D-2) +569: - name: Forbidden client-side bulk-action loop regression guard (L-1) +613: - name: Forbidden orphan-CRUD client function regression guard (B-1) +665: - name: Forbidden strings.Contains(err.Error()) regression guard (S-2) +868: - name: QA-doc Part-count drift guard +886: - name: QA-doc seed-count drift guard +938: - name: Test-naming convention guard (hard-fail) +982: - name: Forbidden hardcoded source-count prose regression guard (S-1) +1027: - name: Documented orphan client fns sync guard (P-1) +1063: - name: Frontend page-coverage regression guard (T-1) +1118: - name: Bundle-8 / L-015 target=_blank rel=noopener regression guard +1147: - name: Bundle-8 / L-019 dangerouslySetInnerHTML regression guard +1176: - name: Bundle-8 / M-009 + M-029 Pass 1 mutation contract guard (hard zero) +1220: - name: Forbidden env-var docs drift regression guard (G-3) +``` + +## SA1019 site count + +- **Operator-on-workstation deliverable** — sandbox cannot run `staticcheck`. +- ci.yml inline comment claims "6 sites" (`middleware.NewAuth × 3`, `csr.Attributes`, `elliptic.Marshal`). +- Source-grep at HEAD shows: + - `internal/api/handler/scep.go`: `csr.Attributes` references present + - `internal/connector/issuer/local/local.go`: `elliptic.Marshal` historic refs (already migrated per bundle9_coverage_test.go byte-equivalence test) + - `cmd/server/main_test.go`: `middleware.NewAuth` references TBD +- Operator must run `staticcheck ./... 2>&1 | grep SA1019` on workstation and update Phase 3 plan with the actual site list. + +## Dockerfile inventory (verified 4) + +``` +./Dockerfile.agent +./Dockerfile +./deploy/test/f5-mock-icontrol/Dockerfile +./deploy/test/libest/Dockerfile +``` + +## Migration up/down balance + +- ups: `24` +- downs: `24` +- missing downs: `0` + +## OpenAPI ↔ handler parity gap (verified) + +- operationIds in api/openapi.yaml: `136` +- r.Register calls in router.go: `149` +- Gap to root-cause in Phase 9: 13 routes + +## docker-compose.test.yml sidecars + +``` +52: certctl-tls-init: +107: postgres: +135: pebble-challtestsrv: +150: pebble: +178: step-ca: +213: certctl-server: +363: nginx: +391: certctl-agent: +449: libest-client: +488: apache-test: +502: haproxy-test: +515: traefik-test: +533: caddy-test: +548: envoy-test: +562: postfix-test: +577: dovecot-test: +591: openssh-test: +613: f5-mock-icontrol: +631: k8s-kind-test: +648: windows-iis-test: +666: certctl-test: +``` + +## Makefile::verify body (existing) + +``` +verify: + @echo "==> fmt" + @go fmt ./... | { ! grep -q '.'; } || (echo "gofmt produced changes — commit them" && exit 1) + @echo "==> go vet ./..." + @go vet ./... + @echo "==> golangci-lint run ./... (incl. staticcheck ST*)" + @which golangci-lint > /dev/null || (echo "Installing golangci-lint..." && go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest) + @golangci-lint run ./... --timeout 5m + @echo "==> go test -short ./..." + @go test -short -count=1 ./... + @echo "" + @echo "verify: PASS — safe to commit" + +``` + +## RAM headroom for collapsed vendor-e2e job + +- **Operator-on-workstation deliverable** — requires a prototype branch with the collapsed job + `docker stats` polling. +- Per Phase 0 frozen decision 0.14: if peak RSS ≤ 12 GB on ubuntu-latest (16 GB ceiling), single-job collapse is approved. +- If > 12 GB, fall back to bucketed-matrix design documented in `cowork/ci-pipeline-cleanup/decisions-revised.md`. + +## Coverage thresholds at HEAD + +``` +778: if [ "$(echo "$SERVICE_COV < 70" | bc -l)" -eq 1 ]; then +779: echo "::error::Service layer coverage ${SERVICE_COV}% is below 70% (Bundle R-CI-extended floor — add tests, do not lower the gate)" +782: if [ "$(echo "$HANDLER_COV < 75" | bc -l)" -eq 1 ]; then +783: echo "::error::Handler layer coverage ${HANDLER_COV}% is below 75% (Bundle R-CI-extended floor — add tests, do not lower the gate)" +786: if [ "$(echo "$DOMAIN_COV < 40" | bc -l)" -eq 1 ]; then +787: echo "::error::Domain layer coverage ${DOMAIN_COV}% is below 40% threshold" +790: if [ "$(echo "$MIDDLEWARE_COV < 30" | bc -l)" -eq 1 ]; then +791: echo "::error::Middleware layer coverage ${MIDDLEWARE_COV}% is below 30% threshold" +802: if [ "$(echo "$CRYPTO_COV < 88" | bc -l)" -eq 1 ]; then +803: echo "::error::Crypto package coverage ${CRYPTO_COV}% is below 88% (Bundle R closure floor — add tests, do not lower the gate)" +832: if [ "$(echo "$LOCAL_ISSUER_COV < 86" | bc -l)" -eq 1 ]; then +833: echo "::error::Local-issuer coverage ${LOCAL_ISSUER_COV}% is below 86% (Bundle R closure floor — add tests, do not lower the gate)" +842: if [ "$(echo "$ACME_COV < 80" | bc -l)" -eq 1 ]; then +843: echo "::error::ACME issuer coverage ${ACME_COV}% is below 80% (Bundle R-CI-extended floor — add tests, do not lower the gate)" +846: if [ "$(echo "$STEPCA_COV < 80" | bc -l)" -eq 1 ]; then +847: echo "::error::StepCA issuer coverage ${STEPCA_COV}% is below 80% (Bundle L.B closure floor — add tests, do not lower the gate)" +850: if [ "$(echo "$MCP_COV < 85" | bc -l)" -eq 1 ]; then +851: echo "::error::MCP coverage ${MCP_COV}% is below 85% (Bundle K closure floor — add tests, do not lower the gate)" +``` + +## CodeQL workflow (no changes) + +- File: `.github/workflows/codeql.yml` (`81` lines) +- Matrix: `[go, javascript-typescript]` — 2 status checks per push +- Trigger: push to master, PR to master, weekly Sunday cron + +## Status check accounting (verified) + +Today: 1 `go-build-and-test` + 1 `frontend-build` + 1 `helm-lint` + 12 `deploy-vendor-e2e ()` + 2 `deploy-vendor-e2e-windows ()` + 2 `CodeQL Analyze ()` = **19 status checks per push**. + +After cleanup: 1 `go-build-and-test` + 1 `frontend-build` + 1 `helm-lint` + 1 `deploy-vendor-e2e` + 1 `image-and-supply-chain` + 2 `CodeQL Analyze ()` = **7 status checks per push**. diff --git a/cowork/ci-pipeline-cleanup/decisions-revised.md b/cowork/ci-pipeline-cleanup/decisions-revised.md new file mode 100644 index 0000000..ac1ee95 --- /dev/null +++ b/cowork/ci-pipeline-cleanup/decisions-revised.md @@ -0,0 +1,53 @@ +# CI Pipeline Cleanup — Deliberate Revisions of Bundle II Decisions + +This bundle deliberately revises two Bundle II frozen decisions. Both revisions are recorded here for audit trail and acknowledged in the per-Phase commits that implement them. + +## Bundle II decision 0.4 → revised by ci-pipeline-cleanup decision 0.5 + +**Bundle II 0.4 (original):** "IIS e2e strategy — `mcr.microsoft.com/windows/servercore:ltsc2022` Windows containers via Docker Desktop on Windows hosts. Linux CI runners CAN'T run Windows containers, so the IIS e2e suite runs on a separate Windows-runner CI matrix job (or operator's local Windows host for development). Documented limitation." + +**ci-pipeline-cleanup 0.5 (revision):** Delete the Windows-runner CI matrix entirely. + +**Rationale for revision:** + +1. The matrix can't physically work on `windows-latest` GitHub-hosted runners today. Verified via the failure logs from CI run `25183374742` (commit `1de61e9`): + - `wincertstore` job: `error during connect: ... open //./pipe/docker_engine: The system cannot find the file specified` — Docker daemon not started in Windows-containers mode. + - `iis` job: image pulled successfully (so the new digest is correct), then died at `failed to create network deploy_certctl-test: could not find plugin bridge in v1 plugin registry: plugin not found` — `bridge` network driver doesn't exist on Windows Docker (uses `nat`). + +2. Even if both Docker-daemon and network-driver issues were fixed, the matrix would validate nothing of substance. Verified by source-grep: all 16 functions matching `TestVendorEdge_(IIS|WinCertStore)_*` in `deploy/test/vendor_e2e_phase3_to_13_test.go` are `t.Log` placeholders that exercise no IIS-specific behavior. The real IIS connector validation lives in `internal/connector/target/iis/` unit tests (run on Linux in `go-build-and-test` — already green per push). + +3. Bundle II decision 0.14 explicitly required operator manual smoke against a real instance for "verified" status in the vendor matrix. Moving IIS + WinCertStore validation to a documented operator playbook in `docs/connector-iis.md` satisfies that criterion better than a fake CI matrix that passes by skipping. + +**Preservation:** the `windows-iis-test` sidecar stays in `deploy/docker-compose.test.yml` under `profiles: [deploy-e2e-windows]` — operators on a Windows host can opt in via `docker compose --profile deploy-e2e-windows up -d windows-iis-test`. Linux CI never activates this profile. + +## Bundle II decision 0.9 → revised by ci-pipeline-cleanup decision 0.4 + +**Bundle II 0.9 (original):** "CI parallelism — Each vendor e2e gets its own GitHub Actions matrix job. Vendor failures surface independently in the CI status check (operator sees 'K8s 1.31 vendor-edge fail' as a discrete check, not a generic 'integration tests failed')." + +**ci-pipeline-cleanup 0.4 (revision):** Single `deploy-vendor-e2e` job replaces the 12-job matrix; per-vendor visibility partially restored via skip-detection guard messages. + +**Rationale for revision:** + +1. The per-vendor granularity Bundle II decision 0.9 was designed to provide is fake signal. Verified by source-analysis at HEAD: + ``` + $ grep -cE 't\.Log\(' deploy/test/{vendor_e2e_phase3_to_13,nginx_vendor_e2e}_test.go + deploy/test/nginx_vendor_e2e_test.go:9 + deploy/test/vendor_e2e_phase3_to_13_test.go:106 + + $ awk '/^func TestVendorEdge_/{in_test=1; name=$2; has_assert=0; next} + in_test && /^}$/ {if (has_assert) print name; in_test=0} + in_test && /t\.(Fatal|Error|Errorf|Fatalf|Fail|Failf)/ {has_assert=1}' \ + deploy/test/vendor_e2e_phase3_to_13_test.go deploy/test/nginx_vendor_e2e_test.go + TestVendorEdge_NGINX_HighConcurrencyDeployUnderLoad_E2E + ``` + 115 of 116 vendor-edge test functions are `t.Log`-only — they spin up a sidecar, log a one-line description of the vendor quirk, and return. Only 1 has a real assertion. + +2. Per-vendor status-check granularity costs ~9 sec setup overhead × 12 jobs = ~108 sec of pure runner waste per push (verified from CI run `25183374742` job timings). + +3. The single-job version partially restores per-vendor visibility via the skip-detection guard (decision 0.6): if a sidecar fails to start, the affected tests' SKIP names print in the CI output and the build fails. Operators see "TestVendorEdge_K8s_KubeletSyncWaitContract_DefaultTimeout60s_E2E SKIPPED: vendor sidecar 'k8s-kind' not reachable" — same per-vendor signal, just no longer rendered as a separate status-check row. + +**Preservation:** the per-test discoverability via `go test -run 'VendorEdge_'` (Bundle II frozen decision 0.6) is unchanged. Only the matrix-jobs-per-vendor part of decision 0.9 is revised; the per-test naming convention stays. + +## Forward-looking note + +Both revisions are limited in scope to CI execution shape — they do NOT delete the test files, the sidecar definitions, or the documentation that Bundle II shipped. Future work could re-introduce per-vendor matrix jobs if test bodies are filled in with real assertions (transforming the t.Log placeholders into actual contract pins). At that point, decision 0.4 + 0.9 should be re-evaluated. diff --git a/cowork/ci-pipeline-cleanup/frozen-decisions.md b/cowork/ci-pipeline-cleanup/frozen-decisions.md new file mode 100644 index 0000000..7bcc0a9 --- /dev/null +++ b/cowork/ci-pipeline-cleanup/frozen-decisions.md @@ -0,0 +1,64 @@ +# CI Pipeline Cleanup — Frozen Decisions + +> 14 frozen decisions confirmed at Phase 0. Each subsequent Phase references the decision number it implements. + +## 0.1 — Trigger model + +Three-tier split, no mixing: +- **On push/PR to master:** blocking, fast, every check earns its keep, target <10 min wall-clock. +- **Daily cron + workflow_dispatch:** `security-deep-scan.yml` as-is; slow scans, best-effort, never blocks. +- **On tag push (`v*`):** `release.yml` as-is; cross-platform binaries, ghcr.io push, SLSA provenance. + +## 0.2 — Extracted-script location + +`scripts/ci-guards/` at repo root. Operator runs `bash scripts/ci-guards/.sh` locally. Contract documented in `scripts/ci-guards/README.md`. + +## 0.3 — Coverage threshold YAML format + +`.github/coverage-thresholds.yml`. Top-level keys are package paths; each entry has `floor:` (integer pct) + `why:` (multi-line string for load-bearing context). Bash step uses Python (already on the runner) to read the YAML — no `yq` dependency. + +## 0.4 — Vendor matrix collapse policy (REVISES Bundle II decision 0.9) + +Single `deploy-vendor-e2e` job replaces 12-job matrix. Bundle II decision 0.9 said "Each vendor e2e gets its own GitHub Actions matrix job" — this revision recognizes that 115/116 vendor-edge tests are `t.Log` placeholders, so per-vendor status-check granularity is fake signal. Skip-detection guard partially restores per-vendor visibility (SKIP messages name the vendor). Documented as deliberate revision in `cowork/ci-pipeline-cleanup/decisions-revised.md`. + +## 0.5 — Windows IIS validation deletion (REVISES Bundle II decision 0.4) + +Delete `deploy-vendor-e2e-windows` matrix entirely. Bundle II decision 0.4 said "the IIS e2e suite runs on a separate Windows-runner CI matrix job" — this revision recognizes that (a) the matrix can't physically work on `windows-latest` (Docker not started in Windows-containers mode; `bridge` driver missing on Windows Docker), and (b) all 16 IIS + WinCertStore tests are `t.Log` placeholders. Move validation to `docs/connector-iis.md::Operator validation playbook` per Bundle II decision 0.14's third criterion. The `windows-iis-test` sidecar stays in `deploy/docker-compose.test.yml` for operator local use. + +## 0.6 — Skip-detection guard semantics + EXPECTED_SKIPS allowlist + +After `go test -tags integration -run 'VendorEdge_'`, count `^--- SKIP:` lines. Allowlist: 6 JavaKeystore tests in `vendor_e2e_phase3_to_13_test.go` that legitimately t.Log without sidecar. Allowlist file at `scripts/ci-guards/vendor-e2e-skip-allowlist.txt`, one test name per line. + +## 0.7 — SA1019 closure approach + +Close each site individually with byte-equivalence tests where the deprecated API was load-bearing. Then flip `continue-on-error: true` → `false` in the SAME commit. Do NOT split — shipping the gate without closing sites would fail CI on master. Live verification: `staticcheck ./... 2>&1 | grep -c SA1019` returns 0 BEFORE flipping the gate. + +## 0.8 — Image-and-supply-chain placement + +Separate top-level job (not steps in `go-build-and-test`). Two reasons: (a) digest-validity needs network egress to multiple registries (Docker Hub, ghcr.io, mcr.microsoft.com), bundling into go-build blocks Go tests on registry latency. (b) `docker build` is parallel to Go tests; isolating lets it run concurrently. + +## 0.9 — Coverage PR-comment provider + +Default: lightweight self-hosted action that posts a per-PR comment via `gh pr comment`. Avoids paid SaaS. Operator can swap to Codecov/Coveralls later. + +## 0.10 — Docker build smoke scope + +Build all 4 Dockerfiles in the repo: `Dockerfile`, `Dockerfile.agent`, `deploy/test/f5-mock-icontrol/Dockerfile`, `deploy/test/libest/Dockerfile`. The test-sidecar Dockerfiles are load-bearing for vendor-e2e — a syntax error there silently breaks the e2e suite. Tagged `:smoke` and discarded. + +## 0.11 — OpenAPI ↔ handler parity exception YAML + +NEW `api/openapi-handler-exceptions.yaml`. Schema: `documented_exceptions:` list of `{route, why}` entries. The 13-route gap at HEAD is root-caused in Phase 9; most are likely health probes / metrics / SCEP-EST-OCSP wire endpoints that legitimately have no operationId. + +## 0.12 — Branch-protection-rule update timing + +Operator updates GitHub branch-protection rules in Phase 13 AFTER the new pipeline ships and runs green on a feature branch + on the first push to master. Required-checks list changes from 19 → 7 entries. Operator action only — agent cannot do this. + +## 0.13 — Make-target naming for new operator-side scripts + +- `make verify` (existing) — required pre-commit; gofmt + vet + lint + tests +- `make verify-deploy` (new) — optional pre-push; digest-validity + OpenAPI parity + docker build smoke (server + agent only — fast subset for local) +- `make verify-docs` (new) — required pre-tag; QA-doc Part-count + seed-count drift + +## 0.14 — RAM headroom verification methodology + +Phase 0 deliverable. Operator creates `prototype/ci-pipeline-cleanup-vendor-collapse` branch, runs the collapsed `deploy-vendor-e2e` job once, captures peak RSS via `docker stats --no-stream` snapshots every 30 sec, records max in this baseline doc. If max > 12 GB (75% of 16 GB ceiling), fall back to bucketed matrix (3 jobs × ~4 sidecars). If max ≤ 12 GB, single-job collapse is approved.