mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 20:51:30 +00:00
02438ad9e1
Twelve findings from the architecture diligence audit's Phase 3 bundle
closed in one PR. All touch the CI workflows + small doc-drift fixes
across the production Go tree + migration headers.
CI workflow changes
====================
TEST-H1 — Race detection on ./... -short
.github/workflows/ci.yml:106 was a 9-package explicit list. Audit
finding TEST-H1 flagged that 25+ packages (internal/auth/*,
internal/repository/*, internal/mcp, internal/scep, internal/pkcs7,
internal/api/router, internal/api/acme, internal/cli, internal/cms,
internal/config, internal/deploy, internal/integration,
internal/ratelimit, internal/secret, internal/trustanchor, all of
cmd/) silently dropped off race coverage.
Post-fix: 'go test -race -short ./... -count=1 -timeout 600s'.
76 testing.Short() guards already cover testcontainers + live-DB
integration suites, so -short keeps the long-running tests out.
TEST-H2 — Cross-platform build matrix
New 'cross-platform-build' job in ci.yml. Matrix:
ubuntu-latest + windows-latest + macos-latest, fail-fast: false.
Builds cmd/server + cmd/agent + cmd/cli + cmd/mcp-server on each.
Catches Windows-specific regressions (path separators, file
permissions, exec.Command semantics) the pre-Phase-3 Ubuntu-only
CI missed.
TEST-L1 — actions/setup-go cache: true (explicit)
setup-go v5 defaults cache: true; making it explicit so a future
setup-go upgrade can't silently flip it. Re-runs hit the Go module
+ build cache instead of recompiling cold.
TEST-M1 — Mutation-testing floor at 55%
security-deep-scan.yml::go-mutesting step rewritten. Removed
continue-on-error + per-package '|| true'. New post-loop check
extracts every 'The mutation score is X.YZ' line and fails the
step if any package drops below 0.55. Floor rationale: starter
ratio catches major regressions without rejecting the audit's
'this is OK' steady state; raise quarterly.
TEST-M2 — 3 advisory deep-scan gates promoted to blocking
Removed continue-on-error: true from:
- gosec (filtered to G201/G202/G304/G108 high-signal rules:
SQL-injection + path-traversal + pprof-exposed)
- osv-scanner (multi-ecosystem CVE; complements govulncheck
which is already blocking in ci.yml)
- trivy image scan (--severity HIGH,CRITICAL --exit-code 1)
continue-on-error count: 15 → 11.
ZAP / schemathesis / nuclei / testssl stay advisory because their
false-positive rates on https://localhost:8443-targeted DAST runs
are high.
TEST-M3 — Playwright harness stub
web/package.json adds '@playwright/test' devDep + 'e2e' / 'e2e:install'
npm scripts. web/playwright.config.ts ships single chromium project
with webServer block pointing at 'npm run dev'. web/src/__tests__/
e2e/smoke.spec.ts proves the harness wires through. The full 15-flow
suite ships in frontend-design-audit Phase 8 (TEST-H1 in THAT audit);
this is the wiring + a single smoke test as the regression floor.
New Makefile target: 'make e2e-test'.
Doc/code drift fixes
====================
TEST-M4 + ARCH-L2 — Skip inventory artifact + CI guard
scripts/skip-inventory.sh walks every t.Skip site under cmd/ +
internal/ + deploy/test/ and emits docs/testing/skip-inventory.md
grouped by package with file:line:expression triples. Current
inventory: 142 t.Skip sites, 76 testing.Short() guards.
scripts/ci-guards/skip-inventory-drift.sh regenerates and fails on
diff (excluding the 'Last reviewed' timestamp line which drifts
daily). The Markdown is the canonical acquisition-diligence artifact
for 'what tests are being skipped and why.'
ARCH-H3 — MCP catalogue floor reconciliation
Audit framing was '121 vs floor 150 — doc/code drift.' Live count
via the test's actual regex over all 5 tool files (tools.go +
tools_audit_fix.go + tools_auth.go + tools_auth_bundle2.go +
tools_est.go): 155 unique 'Name: "certctl_*"' declarations.
Pre-Phase-3 audit measured tools.go in isolation (121) and missed
the other 4 files (+34 unique names). The test at
internal/ciparity/surface_parity_test.go::TestSurfaceParity_MCP
passes today (155 ≥ 150). Added a clarifying comment near
mcpBaselineFloor explaining the measurement scope so future
reviewers don't repeat the audit's framing error.
STATUS: stale — no code drift, just a measurement scoping error in
the audit.
ARCH-L1 — panic() rationale comments
5 panic sites in production Go (excluding _test.go):
- internal/repository/postgres/tx.go:84
- internal/service/issuer.go:861 (mustJSON)
- internal/service/est.go:728 (mustParseTime)
- internal/service/acme.go:1288 (rand source failure — already documented)
- internal/pkcs7/certrep.go:270 (OID marshal — already documented)
Added ARCH-L1 rationale comments to the 3 sites that didn't have
them. All 5 are defensible impossible-path / rethrow / hardcoded-
constant guards.
ARCH-L3 — Migration IF-NOT-EXISTS carve-outs
4 migrations skip the literal 'IF NOT EXISTS' token but ARE
idempotent via different Postgres patterns:
- 000014_policy_violation_severity_check.up.sql: ALTER TABLE
ADD CONSTRAINT CHECK doesn't accept IF NOT EXISTS; idempotency
via DROP CONSTRAINT IF EXISTS preamble.
- 000018_audit_events_worm.up.sql: CREATE OR REPLACE FUNCTION
+ DROP TRIGGER IF EXISTS + CREATE TRIGGER + DO $$ pg_roles
existence check. CREATE TRIGGER doesn't take IF NOT EXISTS.
- 000030_rbac_admin_perms.up.sql: INSERT ... ON CONFLICT DO NOTHING.
- 000039_audit_crit1_perms.up.sql: same INSERT + ON CONFLICT pattern.
Added ARCH-L3 header comments to each explaining the carve-out so
reviewers don't flag the missing literal token.
STATUS: largely stale — migrations are already idempotent.
ARCH-L4 — TODO/FIXME → see #<descriptor>
5 TODOs rewritten to the allowed 'see #<descriptor>' pattern:
- internal/repository/postgres/auth.go:220 → see #bundle-2-scope-fk
- internal/connector/discovery/gcpsm/gcpsm.go:547 → see #gcpsm-pagination
- internal/service/audit.go:244 → see #audit-pagination-count
- internal/service/job.go:295, 299 → see #validation-job-impl
New CI guard scripts/ci-guards/no-todo-in-prod.sh grep-fails any
new TODO/FIXME in cmd/ + internal/ (excluding _test.go); allows
'see #N' / 'see #<descriptor>' patterns.
Sandbox limitation
==================
The 6.1 GB certctl working tree fills the sandbox volume; go1.25.10
toolchain download fails with 'no space left on device' (sandbox has
1.25.9; go.mod requires 1.25.10). Local 'go test' / 'go build' NOT
run in this commit. Operator must run 'make verify' on their
workstation before push per CLAUDE.md operating rules.
The smoke.spec.ts NOT executed in the sandbox (no chromium installed).
Operator runs 'cd web && npm install && npx playwright install
--with-deps chromium && npm run e2e' on first wire-up.
All CI guards (no-todo-in-prod, skip-inventory-drift, G-3
env-docs-drift, doc-rot-detector, and every existing guard) verified
clean by running each individually.
Closes: cowork/certctl-architecture-diligence-audit.html#fix-TEST-H1,
cowork/certctl-architecture-diligence-audit.html#fix-TEST-H2,
cowork/certctl-architecture-diligence-audit.html#fix-TEST-M1,
cowork/certctl-architecture-diligence-audit.html#fix-TEST-M2,
cowork/certctl-architecture-diligence-audit.html#fix-TEST-M3,
cowork/certctl-architecture-diligence-audit.html#fix-TEST-M4,
cowork/certctl-architecture-diligence-audit.html#fix-TEST-L1,
cowork/certctl-architecture-diligence-audit.html#fix-ARCH-H3,
cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L1,
cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L2,
cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L3,
cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L4
720 lines
33 KiB
YAML
720 lines
33 KiB
YAML
name: CI
|
||
|
||
on:
|
||
push:
|
||
branches:
|
||
- master
|
||
- v2-dev
|
||
pull_request:
|
||
branches:
|
||
- master
|
||
|
||
jobs:
|
||
go-build-and-test:
|
||
name: Go Build & Test
|
||
runs-on: ubuntu-latest
|
||
steps:
|
||
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
|
||
|
||
- name: Set up Go
|
||
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
|
||
with:
|
||
go-version: '1.25.10'
|
||
# Phase 3 TEST-L1 closure (2026-05-13): enable Go's module +
|
||
# build cache so re-runs hit the cache instead of recompiling
|
||
# the world. setup-go v5 cache: true by default; making it
|
||
# explicit so a future setup-go upgrade can't silently flip it.
|
||
cache: true
|
||
|
||
- name: Go Build
|
||
run: |
|
||
go build ./cmd/server/...
|
||
go build ./cmd/agent/...
|
||
go build ./cmd/mcp-server/...
|
||
go build ./cmd/cli/...
|
||
|
||
- name: gofmt drift (Makefile::verify parity)
|
||
# ci-pipeline-cleanup Phase 4 / frozen decision 0.13: Makefile::verify
|
||
# checks gofmt + vet + golangci-lint + go test. CI runs vet, lint, test
|
||
# already — but NOT gofmt. This step closes the parity gap.
|
||
# Mirrors the Makefile::verify shape: any gofmt output means the
|
||
# source needs reformatting.
|
||
run: |
|
||
out=$(gofmt -l .)
|
||
if [ -n "$out" ]; then
|
||
echo "::error::gofmt would reformat these files (run 'gofmt -w' locally):"
|
||
echo "$out"
|
||
exit 1
|
||
fi
|
||
|
||
- name: go mod tidy drift
|
||
# ci-pipeline-cleanup Phase 4: catches PRs that import a package
|
||
# without committing the go.mod / go.sum update. Standard Go-CI
|
||
# gate; absent before this bundle.
|
||
run: |
|
||
go mod tidy
|
||
git diff --exit-code go.mod go.sum
|
||
|
||
- name: Go Vet
|
||
run: go vet ./...
|
||
|
||
- name: Install golangci-lint
|
||
run: |
|
||
curl -sSfL https://raw.githubusercontent.com/golangci/golangci-lint/master/install.sh | sh -s -- -b $(go env GOPATH)/bin v2.11.4
|
||
|
||
- name: Run golangci-lint
|
||
run: golangci-lint run ./... --timeout 5m
|
||
|
||
- name: Install govulncheck
|
||
run: go install golang.org/x/vuln/cmd/govulncheck@latest
|
||
|
||
- name: Run govulncheck (M-024 hard gate)
|
||
# Bundle-7 / D-001 partial: govulncheck distinguishes called-vs-uncalled
|
||
# advisories. Default exit code is non-zero only when YOUR code calls
|
||
# the vulnerable function — deferred-call advisories show up in the
|
||
# output but don't fail the gate.
|
||
#
|
||
# Bundle F / Audit M-024 (NIST SSDF PW.7.2): the govulncheck step
|
||
# is now a hard CI gate (no `continue-on-error`). Bundle E's
|
||
# transitive bumps (x/net 0.42→0.47, x/crypto 0.41→0.45) cleared
|
||
# the 5 deferred-call advisories that were previously on the
|
||
# exception list, so the carve-out the original Bundle F prompt
|
||
# designed is unnecessary — a clean `govulncheck ./...` is the
|
||
# right gate. If a future advisory lands in a function our code
|
||
# does call, this step fails the build until either upstream
|
||
# ships a fix OR we cut the dep. Deferred-call advisories that
|
||
# legitimately can't be remediated yet should be added to the
|
||
# NIST SSDF deviation log in docs/operator/security.md, not silenced here.
|
||
run: govulncheck ./...
|
||
|
||
- name: Install staticcheck (Bundle-7 / D-001)
|
||
run: go install honnef.co/go/tools/cmd/staticcheck@latest
|
||
|
||
- name: Run staticcheck
|
||
# Bundle-7 / D-001: Go static analysis additive to vet. Suppressed
|
||
# rules live in staticcheck.conf with documented justifications;
|
||
# adding a new entry requires an explicit security review.
|
||
#
|
||
# ci-pipeline-cleanup Phase 3 / frozen decision 0.7: HARD gate.
|
||
# M-028 SA1019 sites verified closed at HEAD 1de61e91:
|
||
# - middleware.NewAuth: zero callers (all migrated to
|
||
# NewAuthWithNamedKeys in cmd/server/{main,main_test}.go)
|
||
# - csr.Attributes (internal/api/handler/scep.go × 2): inline
|
||
# //lint:ignore SA1019 with load-bearing rationale (RFC 2985
|
||
# challengePassword has no non-deprecated stdlib API)
|
||
# - elliptic.Marshal: only in bundle9_coverage_test.go × 1 as
|
||
# deliberate byte-equivalence regression oracle, suppressed
|
||
# with //lint:ignore SA1019
|
||
run: staticcheck ./...
|
||
|
||
- name: Race Detection
|
||
# Phase 3 TEST-H1 closure (2026-05-13): the pre-Phase-3 invocation
|
||
# listed 9 explicit package roots, excluding internal/auth/*,
|
||
# internal/repository/*, internal/mcp, internal/scep, internal/pkcs7,
|
||
# internal/api/router, internal/api/acme, internal/cli, internal/cms,
|
||
# internal/config, internal/deploy, internal/integration,
|
||
# internal/ratelimit, internal/secret, internal/trustanchor, plus
|
||
# all of cmd/. Audit finding TEST-H1 flagged this as silent
|
||
# race-detection drift — packages added after the original list
|
||
# was authored were never covered.
|
||
#
|
||
# Post-Phase-3: ./... with -short. The 76 testing.Short() guards
|
||
# already in the integration-test surface (testcontainers, live-DB,
|
||
# multi-process) gate behind this flag, so race detection runs
|
||
# across every package without dragging in long-running suites.
|
||
# Timeout doubled from 300s to 600s because ./... is broader; the
|
||
# broader scope is what makes race coverage trustworthy.
|
||
run: go test -race -short ./... -count=1 -timeout 600s
|
||
|
||
- name: Go Test with Coverage
|
||
# internal/ciparity/... — post-v2.1.0 anti-rot item 2 surface-
|
||
# parity tests; stdlib-only so they always pass in this job.
|
||
run: |
|
||
go test ./internal/service/... ./internal/api/handler/... ./internal/api/middleware/... ./internal/api/router/... ./internal/auth/... ./internal/integration/... ./internal/connector/issuer/... ./internal/connector/target/... ./internal/connector/notifier/... ./internal/connector/discovery/... ./internal/crypto/... ./internal/mcp/... ./internal/cli/... ./internal/domain/... ./internal/validation/... ./internal/tlsprobe/... ./internal/ciparity/... -count=1 -cover -coverprofile=coverage.out
|
||
|
||
- name: Check Coverage Thresholds
|
||
# ci-pipeline-cleanup Phase 2: per-package floors moved to
|
||
# .github/coverage-thresholds.yml. Each entry has `floor:` +
|
||
# `why:` (load-bearing context). Logic in
|
||
# scripts/check-coverage-thresholds.sh — operator runs the same
|
||
# script locally via `make verify`-equivalent loop.
|
||
run: bash scripts/check-coverage-thresholds.sh
|
||
|
||
- name: Upload Coverage Report
|
||
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4
|
||
with:
|
||
name: go-coverage
|
||
path: coverage.out
|
||
retention-days: 30
|
||
|
||
- name: Coverage PR comment
|
||
# ci-pipeline-cleanup Phase 10 / frozen decision 0.9: self-hosted
|
||
# alternative to Codecov / Coveralls. Posts a per-package coverage
|
||
# delta as a PR comment; updates in place on subsequent pushes.
|
||
if: github.event_name == 'pull_request'
|
||
env:
|
||
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||
PR_NUMBER: ${{ github.event.number }}
|
||
GITHUB_REPOSITORY: ${{ github.repository }}
|
||
run: bash scripts/coverage-pr-comment.sh
|
||
|
||
# Bundle Q / I-001 closure — test-naming convention guard (informational).
|
||
# The convention is `Test<Func>_<Scenario>_<ExpectedResult>`. This step
|
||
# prints any non-conformant tests but does NOT fail the build until the
|
||
# Bundle I-001-extended (2026-04-27) — promoted from informational
|
||
# to hard-fail. The convention is now: every `func TestXxx(...)` MUST
|
||
# match Go's standard test-runner pattern (`^func Test[A-Z]`). Tests
|
||
# whose name starts with `func Test<lowercase>` are silently SKIPPED
|
||
# by `go test` (Go only runs `Test[A-Z]...`) — those are the real
|
||
# bugs this guard catches.
|
||
#
|
||
# The original audit's `Test<Func>_<Scenario>_<ExpectedResult>` triple-
|
||
# token prescription has been relaxed: single-function pin tests like
|
||
# `TestNewAgent` or `TestSplitPEMChain` are valid Go convention, with
|
||
# internal scenarios expressed via `t.Run` subtests. Requiring the
|
||
# underscore-Scenario-Result triple repo-wide would mean renaming
|
||
# 167 legitimate tests for no observable behavior change. The
|
||
# Test<Func>_<Scenario>_<ExpectedResult> form remains the
|
||
# recommended pattern for parameterized scenarios, but is not gated.
|
||
- name: Regression guards (extracted to scripts/ci-guards/)
|
||
# All named regression guards live at scripts/ci-guards/<id>.sh per
|
||
# ci-pipeline-cleanup bundle Phase 1. Each guard is callable locally:
|
||
# bash scripts/ci-guards/G-3-env-docs-drift.sh
|
||
# Adding a new guard: drop a new <id>.sh; this loop auto-picks it up.
|
||
# Contract: each guard MUST exit 0 on clean repo, non-zero with
|
||
# ::error:: prefix on regression. See scripts/ci-guards/README.md.
|
||
#
|
||
run: |
|
||
set -e
|
||
fail=0
|
||
for g in scripts/ci-guards/*.sh; do
|
||
echo "::group::$(basename "$g")"
|
||
if ! bash "$g"; then
|
||
fail=1
|
||
fi
|
||
echo "::endgroup::"
|
||
done
|
||
exit $fail
|
||
|
||
cross-platform-build:
|
||
# Phase 3 TEST-H2 closure (2026-05-13): the pre-Phase-3 CI ran
|
||
# exclusively on ubuntu-latest, leaving Windows-specific bugs
|
||
# (path separators, file permissions, exec.Command semantics)
|
||
# undetected. The agent + CLI binaries ship for Windows + macOS
|
||
# users; this matrix asserts they at least BUILD on every OS we
|
||
# claim to support.
|
||
#
|
||
# Build-only — no test run. Full test parity across OSes is a
|
||
# larger investment (testcontainers is Linux-only on Windows CI
|
||
# runners, file-permission tests differ, etc.). The build gate
|
||
# is the minimum that catches the cross-platform regressions
|
||
# we've seen in practice.
|
||
name: Cross-platform build (ubuntu / windows / macos)
|
||
strategy:
|
||
fail-fast: false
|
||
matrix:
|
||
os: [ubuntu-latest, windows-latest, macos-latest]
|
||
runs-on: ${{ matrix.os }}
|
||
steps:
|
||
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
|
||
|
||
- name: Set up Go
|
||
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
|
||
with:
|
||
go-version: '1.25.10'
|
||
cache: true
|
||
|
||
- name: Build server + agent + CLI + mcp-server
|
||
run: |
|
||
go build ./cmd/server
|
||
go build ./cmd/agent
|
||
go build ./cmd/cli
|
||
go build ./cmd/mcp-server
|
||
|
||
cold-db-compose-smoke:
|
||
# Per post-v2.1.0 anti-rot item 6 (Auditable Codebase Bundle).
|
||
#
|
||
# Catches migration-on-cold-DB regressions: wipe the postgres
|
||
# volume, bring the stack up cold, mint a day-0 admin, issue +
|
||
# renew + revoke a test certificate, assert audit rows, tear down.
|
||
# Targets the bug class that the warm-DB integration suite misses
|
||
# (canonical case: 2026-05-09 migration 000045 broken INSERT,
|
||
# fixed in commit 6444e13).
|
||
name: Cold-DB compose smoke
|
||
runs-on: ubuntu-latest
|
||
needs: go-build-and-test
|
||
steps:
|
||
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
|
||
|
||
- name: Show Docker versions
|
||
run: |
|
||
docker --version
|
||
docker compose version
|
||
|
||
- name: Cold-DB compose smoke
|
||
# The smoke deliberately focuses on the bug class that ONLY a
|
||
# cold boot can catch: stack-startup correctness against a
|
||
# blank database. It is intentionally NOT a functional API
|
||
# walkthrough — the integration test suite under
|
||
# 'Go Test with Coverage' already covers issue / renew /
|
||
# revoke / audit-row plumbing against a warm DB.
|
||
#
|
||
# The bugs this gate is uniquely positioned to catch:
|
||
# - Missing required env vars that fail Config.Validate()
|
||
# at startup (e.g. CERTCTL_DEMO_MODE_ACK gap, 2026-05-12).
|
||
# - Non-idempotent migrations that crash on the second boot
|
||
# (e.g. migration 000043 CHECK constraint, 2026-05-12).
|
||
# - Documented manual flows that don't work end-to-end on
|
||
# a clean compose (e.g. CERTCTL_BOOTSTRAP_TOKEN
|
||
# interpolation gap, 2026-05-12).
|
||
#
|
||
# Bugs OUTSIDE the scope of this smoke (covered elsewhere):
|
||
# - API request/response contract changes (integration suite).
|
||
# - Cert lifecycle correctness (integration suite + handler
|
||
# tests).
|
||
# - Audit row plumbing (handler tests).
|
||
#
|
||
# 10-min wall-clock cap covers cold image pull + compose-up +
|
||
# force-recreate + admin bootstrap + teardown. Increase only
|
||
# if the underlying steps legitimately grow.
|
||
#
|
||
# The smoke is inlined here on purpose — it is NOT a script in
|
||
# scripts/ci-guards/, because there is no value in a developer
|
||
# running this locally. The whole point of the gate is that CI
|
||
# owns the cold-DB state; the operator never has to remember to
|
||
# run it.
|
||
timeout-minutes: 10
|
||
working-directory: deploy
|
||
env:
|
||
STARTUP_TIMEOUT_SECONDS: 300
|
||
run: |
|
||
set -e
|
||
set -o pipefail
|
||
|
||
SERVER_URL="https://localhost:8443"
|
||
CACERT_PATH="${GITHUB_WORKSPACE}/deploy/test/certs/ca.crt"
|
||
|
||
log() { echo "[cold-db-smoke] $*"; }
|
||
|
||
wait_for_service_healthy() {
|
||
local svc="$1" deadline=$(( $(date +%s) + STARTUP_TIMEOUT_SECONDS ))
|
||
while [ "$(date +%s)" -lt "$deadline" ]; do
|
||
local state
|
||
state="$(docker compose ps --format json "$svc" 2>/dev/null | python3 -c '
|
||
import json, sys
|
||
try:
|
||
line = sys.stdin.read().strip()
|
||
if not line:
|
||
print("not-up"); sys.exit(0)
|
||
rows = json.loads(line) if line.startswith("[") else [json.loads(l) for l in line.splitlines() if l.strip()]
|
||
if not rows:
|
||
print("not-up")
|
||
else:
|
||
print(rows[0].get("Health", rows[0].get("State", "?")))
|
||
except Exception as e:
|
||
print(f"err: {e}")
|
||
')"
|
||
if [ "$state" = "healthy" ] || [ "$state" = "running" ]; then
|
||
log " $svc → $state"; return 0
|
||
fi
|
||
sleep 2
|
||
done
|
||
log " $svc did NOT reach healthy within ${STARTUP_TIMEOUT_SECONDS}s (last: $state)"
|
||
return 1
|
||
}
|
||
|
||
http_call() {
|
||
local method="$1" path="$2" data="${3:-}"
|
||
local args=(--silent --show-error --max-time 30 -X "$method" "$SERVER_URL$path")
|
||
[ -f "$CACERT_PATH" ] && args+=(--cacert "$CACERT_PATH") || args+=(--insecure)
|
||
[ -n "$data" ] && args+=(-H "Content-Type: application/json" -d "$data")
|
||
curl "${args[@]}"
|
||
}
|
||
|
||
# Bundle 2 closure (2026-05-12): the base compose is now
|
||
# production-shaped — auth=api-key + agent-keygen + fail-closed
|
||
# placeholder guards. The cold-DB smoke layers in the demo
|
||
# overlay so the boot path remains zero-config: the overlay
|
||
# supplies AUTH_TYPE=none + DEMO_MODE_ACK=true + the matching
|
||
# placeholder creds the fail-closed guards accept under
|
||
# DEMO_MODE_ACK. The agent service in the overlay also
|
||
# pre-seeds CERTCTL_AGENT_ID=agent-demo-1 so the bundled
|
||
# agent doesn't restart-loop. The smoke's purpose (catch
|
||
# migration-on-cold-DB regressions + verify bootstrap-token
|
||
# endpoint mints a day-0 admin against a freshly migrated
|
||
# schema) is orthogonal to whether the auth posture is
|
||
# demo-mode or api-key, so the overlay is acceptable here.
|
||
COMPOSE_FILES=(-f docker-compose.yml -f docker-compose.demo.yml)
|
||
|
||
log "1/4 down -v --remove-orphans"
|
||
docker compose "${COMPOSE_FILES[@]}" down -v --remove-orphans 2>&1 | tail -3 || true
|
||
|
||
log "2/4 up -d (cold boot)"
|
||
docker compose "${COMPOSE_FILES[@]}" up -d 2>&1 | tail -3
|
||
|
||
log "3/4 wait for healthchecks"
|
||
wait_for_service_healthy postgres
|
||
wait_for_service_healthy certctl-server
|
||
wait_for_service_healthy certctl-agent || log " (agent skipped)"
|
||
|
||
log "4/4 minting day-0 admin (proves migration ladder + bootstrap path)"
|
||
TOKEN="$(openssl rand -base64 32 | tr -d '\n')"
|
||
echo "CERTCTL_BOOTSTRAP_TOKEN=$TOKEN" > /tmp/_smoke.env
|
||
docker compose "${COMPOSE_FILES[@]}" --env-file /tmp/_smoke.env up -d --force-recreate certctl-server 2>&1 | tail -2
|
||
sleep 5
|
||
wait_for_service_healthy certctl-server
|
||
BODY="$(http_call POST /api/v1/auth/bootstrap "{\"token\":\"$TOKEN\",\"actor_name\":\"smoke-admin\"}")"
|
||
KEY="$(echo "$BODY" | python3 -c 'import json,sys; print(json.load(sys.stdin)["key_value"])')"
|
||
[ -n "$KEY" ] || { log "bootstrap failed: $BODY"; exit 1; }
|
||
|
||
log "PASS — cold boot + force-recreate + admin bootstrap all green"
|
||
log "tearing down"
|
||
docker compose "${COMPOSE_FILES[@]}" down -v 2>&1 | tail -2
|
||
|
||
- name: Dump compose logs on failure
|
||
if: failure()
|
||
working-directory: deploy
|
||
run: |
|
||
for svc in postgres certctl-server certctl-agent certctl-tls-init; do
|
||
echo "==== $svc ===="
|
||
docker compose -f docker-compose.yml -f docker-compose.demo.yml logs --no-color --tail 200 "$svc" || true
|
||
done
|
||
|
||
frontend-build:
|
||
name: Frontend Build
|
||
runs-on: ubuntu-latest
|
||
steps:
|
||
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
|
||
|
||
- name: Set up Node.js
|
||
uses: actions/setup-node@49933ea5288caeca8642d1e84afbd3f7d6820020 # v4
|
||
with:
|
||
node-version: '22'
|
||
|
||
- name: Install Dependencies
|
||
working-directory: web
|
||
run: npm ci
|
||
|
||
- name: npm audit (production deps, high+critical)
|
||
# Phase 1 TEST-L2 closure (2026-05-13):
|
||
# Production frontend dependencies must not carry high or
|
||
# critical CVEs. Dev-only deps (vitest, vite, eslint, etc.)
|
||
# are excluded via --omit=dev since they never ship to
|
||
# operators. If this gate fires, triage each finding via npm
|
||
# overrides, dep upgrade, or a tracked --ignore with an issue
|
||
# link. Do not mass-silence findings.
|
||
working-directory: web
|
||
run: npm audit --omit=dev --audit-level=high
|
||
|
||
- name: TypeScript Check
|
||
working-directory: web
|
||
run: npx tsc --noEmit
|
||
|
||
- name: Run Frontend Tests
|
||
working-directory: web
|
||
run: npx vitest run
|
||
|
||
- name: Build Frontend
|
||
working-directory: web
|
||
run: npx vite build
|
||
|
||
- name: Regression guards (extracted to scripts/ci-guards/)
|
||
# All named regression guards live at scripts/ci-guards/<id>.sh per
|
||
# ci-pipeline-cleanup bundle Phase 1. Each guard is callable locally:
|
||
# bash scripts/ci-guards/G-3-env-docs-drift.sh
|
||
# Adding a new guard: drop a new <id>.sh; this loop auto-picks it up.
|
||
# Contract: each guard MUST exit 0 on clean repo, non-zero with
|
||
# ::error:: prefix on regression. See scripts/ci-guards/README.md.
|
||
run: |
|
||
set -e
|
||
fail=0
|
||
for g in scripts/ci-guards/*.sh; do
|
||
echo "::group::$(basename "$g")"
|
||
if ! bash "$g"; then
|
||
fail=1
|
||
fi
|
||
echo "::endgroup::"
|
||
done
|
||
exit $fail
|
||
|
||
helm-lint:
|
||
name: Helm Chart Validation
|
||
runs-on: ubuntu-latest
|
||
steps:
|
||
- uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
|
||
|
||
- name: Install Helm
|
||
uses: azure/setup-helm@1a275c3b69536ee54be43f2070a358922e12c8d4 # v4
|
||
with:
|
||
version: '3.13.0'
|
||
|
||
# HTTPS-Everywhere (v2.0.47): the chart fails render when no TLS source is
|
||
# configured. Every lint/template invocation below must pick exactly one
|
||
# provisioning mode — see deploy/helm/certctl/templates/_helpers.tpl
|
||
# (certctl.tls.required) and docs/operator/tls.md.
|
||
#
|
||
# Bundle 3 closure (2026-05-12, commit f1fa311): the chart now ALSO
|
||
# fails render when (a) server.auth.type=api-key + apiKey empty, or
|
||
# (b) postgresql.enabled=true + postgresql.auth.password empty.
|
||
# Every positive render below MUST pass both secrets; inverse tests
|
||
# at the bottom of this job pin the fail-fast guards in place.
|
||
- name: Lint Helm Chart
|
||
run: |
|
||
helm lint deploy/helm/certctl/ \
|
||
--set server.tls.existingSecret=certctl-tls-ci \
|
||
--set server.auth.apiKey=ci-api-key-placeholder \
|
||
--set postgresql.auth.password=ci-postgres-placeholder
|
||
|
||
- name: Template Helm Chart (existingSecret mode)
|
||
run: |
|
||
helm template certctl deploy/helm/certctl/ \
|
||
--set server.tls.existingSecret=certctl-tls-ci \
|
||
--set server.auth.apiKey=ci-api-key-placeholder \
|
||
--set postgresql.auth.password=ci-postgres-placeholder \
|
||
> /dev/null
|
||
|
||
- name: Template Helm Chart (cert-manager mode)
|
||
run: |
|
||
helm template certctl deploy/helm/certctl/ \
|
||
--set server.tls.certManager.enabled=true \
|
||
--set server.tls.certManager.issuerRef.name=letsencrypt-prod \
|
||
--set server.auth.apiKey=ci-api-key-placeholder \
|
||
--set postgresql.auth.password=ci-postgres-placeholder \
|
||
> /dev/null
|
||
|
||
- name: Template Helm Chart (external Postgres mode — Bundle 3 D2)
|
||
run: |
|
||
# Closes Bundle 3 D2: postgresql.enabled=false must (a) render
|
||
# cleanly with externalDatabase.url and (b) emit ZERO postgres-*
|
||
# templates. The render output is grep-checked below.
|
||
out=$(helm template certctl deploy/helm/certctl/ \
|
||
--set server.tls.existingSecret=certctl-tls-ci \
|
||
--set postgresql.enabled=false \
|
||
--set externalDatabase.url='postgres://u:p@db.example.com:5432/certctl?sslmode=require' \
|
||
--set server.auth.apiKey=ci-api-key-placeholder)
|
||
# Bundled-Postgres resources must not appear when postgresql.enabled=false.
|
||
if echo "$out" | grep -qE "^kind: StatefulSet$"; then
|
||
echo "::error::Bundle 3 D2 regression: postgres StatefulSet rendered with postgresql.enabled=false"
|
||
exit 1
|
||
fi
|
||
if echo "$out" | grep -q "postgres-secret.yaml"; then
|
||
echo "::error::Bundle 3 D2 regression: postgres-secret rendered with postgresql.enabled=false"
|
||
exit 1
|
||
fi
|
||
|
||
- name: Template Helm Chart (guard fails without TLS)
|
||
run: |
|
||
# Inverse test: the chart MUST refuse to render when no TLS source is
|
||
# configured. If this ever renders successfully, the fail-loud guard
|
||
# in certctl.tls.required has regressed.
|
||
if helm template certctl deploy/helm/certctl/ > /dev/null 2>&1; then
|
||
echo "::error::Helm chart rendered without a TLS source — fail-loud guard regressed"
|
||
exit 1
|
||
fi
|
||
|
||
- name: Template Helm Chart (guard fails — Bundle 3 D7 TLS both-set)
|
||
run: |
|
||
# Bundle 3 D7: setting BOTH existingSecret AND certManager.enabled
|
||
# creates two conflicting TLS sources of truth. Chart must refuse.
|
||
if helm template certctl deploy/helm/certctl/ \
|
||
--set server.tls.existingSecret=ci \
|
||
--set server.tls.certManager.enabled=true \
|
||
--set server.tls.certManager.issuerRef.name=foo \
|
||
--set server.auth.apiKey=k \
|
||
--set postgresql.auth.password=p \
|
||
> /dev/null 2>&1; then
|
||
echo "::error::Bundle 3 D7 regression: chart rendered with BOTH TLS sources configured"
|
||
exit 1
|
||
fi
|
||
|
||
- name: Template Helm Chart (guard fails — Bundle 3 D1 missing apiKey)
|
||
run: |
|
||
# Bundle 3 D1: missing server.auth.apiKey when auth.type=api-key
|
||
# must fail at template time, not silently render an empty Secret.
|
||
if helm template certctl deploy/helm/certctl/ \
|
||
--set server.tls.existingSecret=ci \
|
||
--set postgresql.auth.password=p \
|
||
> /dev/null 2>&1; then
|
||
echo "::error::Bundle 3 D1 regression: chart rendered with empty server.auth.apiKey"
|
||
exit 1
|
||
fi
|
||
|
||
- name: Template Helm Chart (guard fails — Bundle 3 D1 missing pg password)
|
||
run: |
|
||
# Bundle 3 D1: missing postgresql.auth.password when postgresql.enabled=true
|
||
# must fail at template time, not silently use a fallback default.
|
||
if helm template certctl deploy/helm/certctl/ \
|
||
--set server.tls.existingSecret=ci \
|
||
--set server.auth.apiKey=k \
|
||
> /dev/null 2>&1; then
|
||
echo "::error::Bundle 3 D1 regression: chart rendered with empty postgresql.auth.password"
|
||
exit 1
|
||
fi
|
||
|
||
- name: Template Helm Chart (guard fails — Bundle 3 D1 missing external DB URL)
|
||
run: |
|
||
# Bundle 3 D1: missing externalDatabase.url when postgresql.enabled=false
|
||
# must fail at template time.
|
||
if helm template certctl deploy/helm/certctl/ \
|
||
--set server.tls.existingSecret=ci \
|
||
--set postgresql.enabled=false \
|
||
--set server.auth.apiKey=k \
|
||
> /dev/null 2>&1; then
|
||
echo "::error::Bundle 3 D1 regression: chart rendered with postgresql.enabled=false + empty externalDatabase.url"
|
||
exit 1
|
||
fi
|
||
|
||
# =============================================================================
|
||
# deploy-vendor-e2e — single-job (collapsed from 12-job matrix)
|
||
# =============================================================================
|
||
# Per ci-pipeline-cleanup bundle Phase 5 / frozen decision 0.4 (revises
|
||
# Bundle II decision 0.9): the per-vendor matrix produced 12 status-check
|
||
# rows for ~1 real assertion (115/116 vendor-edge tests are t.Log
|
||
# placeholders). Collapsed to one job that brings up all 11 sidecars
|
||
# at once and runs the full VendorEdge_ test set.
|
||
#
|
||
# Skip-detection guard (scripts/vendor-e2e-skip-check.sh)
|
||
# enforces that no test SKIPs except the documented allowlist
|
||
# (windows-iis-requiring tests on Linux). If a sidecar fails to come
|
||
# up, requireSidecar() in deploy/test/vendor_e2e_helpers.go calls
|
||
# t.Skipf() — the guard catches that.
|
||
#
|
||
# RAM headroom on ubuntu-latest (16 GB ceiling) — operator-confirmed
|
||
# in Phase 0 / frozen decision 0.14 prototype-branch run. If RAM
|
||
# regresses, fall back to bucketed matrix per
|
||
# the project's frozen-decisions log.
|
||
#
|
||
# The Windows matrix (deploy-vendor-e2e-windows) was deleted entirely
|
||
# per Phase 6 / frozen decision 0.5 (revises Bundle II decision 0.4).
|
||
# IIS + WinCertStore validation moved to the operator playbook at
|
||
# docs/connector-iis.md::Operator validation playbook.
|
||
deploy-vendor-e2e:
|
||
name: deploy-vendor-e2e
|
||
runs-on: ubuntu-latest
|
||
needs: [go-build-and-test]
|
||
timeout-minutes: 30
|
||
steps:
|
||
- uses: actions/checkout@93cb6efe18208431cddfb8368fd83d5badbf9bfd # v5
|
||
|
||
- name: Set up Go
|
||
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
|
||
with:
|
||
go-version: '1.25.10'
|
||
cache: true
|
||
|
||
- name: Build f5-mock-icontrol sidecar
|
||
# The only sidecar without a published image; built from the in-tree
|
||
# Go server at deploy/test/f5-mock-icontrol/.
|
||
run: docker compose --profile deploy-e2e -f deploy/docker-compose.test.yml build f5-mock-icontrol
|
||
|
||
- name: Bring up all vendor sidecars
|
||
# Brings up the 11 deploy-e2e sidecars (apache-test, haproxy-test,
|
||
# traefik-test, caddy-test, envoy-test, postfix-test, dovecot-test,
|
||
# openssh-test, f5-mock-icontrol, k8s-kind-test, windows-iis-test
|
||
# which is gated by a separate windows-only profile and won't
|
||
# actually start) plus the always-on legacy nginx.
|
||
run: |
|
||
docker compose --profile deploy-e2e -f deploy/docker-compose.test.yml up -d
|
||
sleep 15
|
||
|
||
- name: Run all vendor-edge e2e
|
||
# Captures test output for skip-count enforcement (next step).
|
||
env:
|
||
INTEGRATION: "1"
|
||
run: |
|
||
go test -tags integration -race -count=1 -run 'VendorEdge_' \
|
||
./deploy/test/... 2>&1 | tee test-output.log
|
||
|
||
- name: Skip-count enforcement
|
||
# ci-pipeline-cleanup Phase 5 / frozen decision 0.6:
|
||
# requireSidecar uses t.Skipf (not t.Fatal) when a sidecar isn't
|
||
# reachable — collapsing the per-vendor matrix removes the implicit
|
||
# guard each per-job matrix entry provided. This step counts SKIP
|
||
# lines in the test output and fails the build if it exceeds the
|
||
# allowlist (windows-iis-requiring tests; legitimately skipped
|
||
# on Linux per Phase 6 / frozen decision 0.5).
|
||
run: bash scripts/vendor-e2e-skip-check.sh test-output.log
|
||
|
||
- name: Diagnostic dump on failure
|
||
# Prints container status + last 200 log lines from the certctl-server
|
||
# and base-stack containers when ANY previous step in this job fails.
|
||
# The matrix-collapse (Phase 5) brings up ~18 containers concurrently
|
||
# (vs 1 vendor sidecar at a time pre-collapse); transient failures
|
||
# surface most often as "container certctl-test-server is unhealthy"
|
||
# without any visible reason because compose only reports the
|
||
# dependency-chain symptom, not the root cause. Dumping logs here
|
||
# makes the underlying error (DB migration crash, port bind failure,
|
||
# entrypoint stall, OOM kill) visible in the GitHub Actions log
|
||
# without requiring a workstation reproduction.
|
||
if: failure()
|
||
run: |
|
||
echo "=== docker compose ps -a ==="
|
||
docker compose --profile deploy-e2e -f deploy/docker-compose.test.yml ps -a || true
|
||
echo ""
|
||
echo "=== certctl-test-server logs (last 200 lines) ==="
|
||
docker logs --tail 200 certctl-test-server 2>&1 || true
|
||
echo ""
|
||
echo "=== certctl-test-tls-init logs ==="
|
||
docker logs certctl-test-tls-init 2>&1 || true
|
||
echo ""
|
||
echo "=== certctl-test-postgres logs (last 100 lines) ==="
|
||
docker logs --tail 100 certctl-test-postgres 2>&1 || true
|
||
echo ""
|
||
echo "=== certctl-test-stepca logs (last 100 lines) ==="
|
||
docker logs --tail 100 certctl-test-stepca 2>&1 || true
|
||
echo ""
|
||
echo "=== certctl-test-pebble logs (last 50 lines) ==="
|
||
docker logs --tail 50 certctl-test-pebble 2>&1 || true
|
||
echo ""
|
||
echo "=== certctl-test-agent logs (last 100 lines) ==="
|
||
docker logs --tail 100 certctl-test-agent 2>&1 || true
|
||
|
||
- name: Tear down sidecars
|
||
if: always()
|
||
run: docker compose --profile deploy-e2e -f deploy/docker-compose.test.yml down -v
|
||
|
||
# =============================================================================
|
||
# image-and-supply-chain — digest validity + Docker build smoke + OpenAPI parity
|
||
# =============================================================================
|
||
# Per ci-pipeline-cleanup bundle Phases 7-9 / frozen decision 0.8.
|
||
# Three checks bundled into one job (parallel to go-build-and-test):
|
||
# 1. Digest validity — every @sha256 ref in deploy/* + Dockerfiles must
|
||
# resolve on its registry. Closes the H-001 lying-field gap (H-001
|
||
# verifies digest *presence* but not *resolution* — Bundle II shipped
|
||
# 11 fabricated digests that passed H-001 and failed `docker pull`).
|
||
# 2. Docker build smoke — all 4 Dockerfiles in the repo must build.
|
||
# Catches syntax errors / COPY path drift before tag-time release.yml.
|
||
# 3. OpenAPI ↔ handler parity — every router route has a matching
|
||
# operationId or is documented in api/openapi-handler-exceptions.yaml.
|
||
image-and-supply-chain:
|
||
name: image-and-supply-chain
|
||
runs-on: ubuntu-latest
|
||
timeout-minutes: 15
|
||
steps:
|
||
- uses: actions/checkout@93cb6efe18208431cddfb8368fd83d5badbf9bfd # v5
|
||
|
||
- name: Set up Go
|
||
uses: actions/setup-go@40f1582b2485089dde7abd97c1529aa768e1baff # v5
|
||
with:
|
||
go-version: '1.25.10'
|
||
cache: true
|
||
|
||
- name: Digest validity (every @sha256 ref must resolve)
|
||
run: bash scripts/ci-guards/digest-validity.sh
|
||
|
||
- name: Docker build smoke (all 4 Dockerfiles)
|
||
# Per frozen decision 0.10: build all 4 Dockerfiles in the repo,
|
||
# not just production server + agent. The test-sidecar Dockerfiles
|
||
# are load-bearing for vendor-e2e — a syntax error there silently
|
||
# breaks the e2e suite.
|
||
run: |
|
||
set -e
|
||
docker build -f Dockerfile -t certctl:smoke .
|
||
docker build -f Dockerfile.agent -t certctl-agent:smoke .
|
||
docker build -f deploy/test/f5-mock-icontrol/Dockerfile -t f5-mock:smoke .
|
||
docker build -f deploy/test/libest/Dockerfile -t libest:smoke .
|
||
echo "All 4 Dockerfiles build clean."
|
||
|
||
- name: OpenAPI ↔ handler operationId parity
|
||
run: bash scripts/ci-guards/openapi-handler-parity.sh
|