certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 15:41:41 +00:00

Author	SHA1	Message	Date
shankar0123	eda3b48419	ci: supply-chain hardening (Phase 1 closure — RED-1, RED-2, TEST-L2) Three findings from the certctl architecture diligence audit's Phase 1 bundle (Supply-Chain Hardening) closed together in one PR since they all touch .github/workflows/ + repo root. RED-1 — delete tracked precompiled binary - deploy/test/f5-mock-icontrol/f5-mock-icontrol (8.6 MB ARM64 ELF) was tracked alongside the Go source that builds it. The fixture's Dockerfile already uses a multi-stage build that re-runs 'go build' inside the container (line 13), so the tracked binary was vestigial — never actually consumed by the test wiring. - git rm'd. Path added to .gitignore so it doesn't re-land. - No Makefile target needed; the Dockerfile is the rebuild path. RED-2 — SHA-pin every GitHub Action - Pre: 37 of 41 'uses:' lines were tag-pinned (@v4 etc); only 4 were SHA-pinned (sigstore/cosign-installer + anchore/sbom-action). - Post: 0 / 41. Every 'uses:' line is now '@<40-char-sha> # vN' (the trailing comment preserves the human-readable version for operator audit). SHA-pinning closes the standard supply-chain attack vector against GitHub Actions consumers. - SHAs resolved live via the GitHub API; spot-checked one. TEST-L2 — npm audit hard gate - Added 'npm audit --omit=dev --audit-level=high' step to the Frontend Build job in ci.yml. --omit=dev excludes vitest/vite/ eslint/etc which don't ship to operators. - Local run today: 0 vulnerabilities; gate enters with no triage backlog. Catches future regressions. New CI guards (regression-prevention): - scripts/ci-guards/no-tag-pinned-actions.sh — fails the build if a future PR adds 'uses: foo/bar@v2' instead of SHA-pinning. - scripts/ci-guards/no-precompiled-binary.sh — runs file(1) over git ls-files output; fails on any tracked ELF/Mach-O/PE. - Both pass locally. CI's existing loop over scripts/ci-guards/*.sh picks them up automatically. Closes: cowork/certctl-architecture-diligence-audit.html#fix-RED-1, cowork/certctl-architecture-diligence-audit.html#fix-RED-2, cowork/certctl-architecture-diligence-audit.html#fix-TEST-L2	2026-05-13 19:30:53 +00:00
shankar0123	b216de9d57		2026-05-05 18:18:29 +00:00
shankar0123	6286cd4004	loadtest: per-connector deploy throughput scenarios + target sidecars + README baseline section Closes Bundle 10 of the 2026-05-02 deployment-target coverage audit (see cowork/deployment-target-audit-2026-05-02/RESULTS.md). Pre-fix, deploy/test/loadtest/k6.js drove only the API-tier throughput path (POST /api/v1/certificates + GET /api/v1/certificates) — the operator- facing rate at which an automation client can submit cert requests. The deploy hot path (cert deployed to a target — connector-tier latency) had no benchmarks. Procurement asks "can certctl handle our 5,000-NGINX fleet at 47-day rotation?" and the answer should be a number with methodology, not a claim. This commit ships v1 of the connector-tier loadtest harness: 1. Target-side sidecars added to docker-compose.yml: nginx-target, apache-target, haproxy-target, f5-mock-target. Each daemon serves a starter cert (ECDSA P-256, multi-SAN) written into a shared ./fixtures/target-certs/ volume by a new target-tls-init container. f5-mock-target re-uses the in-tree deploy/test/f5-mock-icontrol/ image (already used by the deploy- vendor-e2e CI job) and generates its own self-signed cert via tls.go::selfSignedCert at startup. 2. Fixture configs committed under deploy/test/loadtest/fixtures/: - nginx.conf — minimal HTTPS server, single 200 OK location. - httpd.conf — self-contained Apache config with the minimum module set + SSL vhost. - haproxy.cfg — minimal SSL-terminating frontend backed by a static "ok" backend. 3. k6 scenarios added (4 new): nginx_handshake, apache_handshake, haproxy_handshake, f5_handshake. Each runs constant-arrival-rate at 100 conns/min for 5 minutes. Latency captured by k6's http_req_duration metric covers TCP connect + TLS handshake + tiny HTTP request/response — that's the end-to-end "connection readiness" latency a deploy connector cares about. 4. summary.json gains a connector_tier object with per-target p50/p95/p99/max/avg/error_rate/iterations breakdowns. Operators tracking a connector regression diff connector_tier.<type> between runs. Implementation: a new enrichWithConnectorTier helper that reads data.metrics keyed by target_type tag and shallow-merges the breakdown into the summary before serialisation. 5. Threshold contract per target type: - nginx/apache/haproxy: p99 < 3s, p95 < 1s. - f5-mock: p99 < 5s, p95 < 1.5s (iControl REST handler does slightly more work per request than pure TLS termination). - All scenarios: error rate < 1% (k6 default; any 4xx/5xx counts as failed). Any change pushing past these fails the workflow. 6. README documents the methodology + the baseline-number table for the connector tier. Numeric values are em-dash placeholders pending the first clean canonical-hardware run; the accompanying commit message in that follow-up captures the methodology line alongside the numbers. Out-of-scope is documented explicitly: - Full agent-driven deploy poll loop (POST cert with target binding → poll deployments endpoint → verify served cert). v2 of the harness — needs the agent registration + target- binding API surface plumbed end-to-end in the loadtest stack. - Kubernetes target via kind-in-docker. kind requires `privileged: true` and is operationally fragile in CI; deferred until Bundle 2 (real k8s.io/client-go) lands and a CI-friendly envtest harness is wired. - Real F5 BIG-IP. CI uses the in-tree f5-mock; real-appliance benchmarking is out of scope. 7. CI workflow .github/workflows/loadtest.yml timeout-minutes bumped from 15 to 25. The harness now boots four additional target sidecars before the k6 run; their healthchecks add ~30-60s. The k6 scenarios themselves are still 5 minutes (run in parallel, not serially). 25 minutes absorbs that plus slow CI runners and cold image caches without letting a stuck container consume the runner indefinitely. Trigger remains workflow_dispatch + cron — sustained 25-minute runs are too slow for per-PR signal. What this connector tier explicitly does NOT measure (documented in the k6.js header + README): - The agent-driven full deploy hot path (v2 follow-up). - K8s target (Bundle 2 dependency). - Real F5 appliance. - Issuer-side throughput (handled by issuer-coverage-audit fix #8). Verified locally: - python3 -c "import yaml; yaml.safe_load(...)" on docker-compose.yml and .github/workflows/loadtest.yml — clean. - node -c on k6.js — clean syntax. - gofmt / go vet on the rest of the tree (no Go diff in this commit). - Manual smoke against docker-compose pending — operator validates on the canonical-hardware first run; if any fixture config is off, fix-up commit lands separately so the methodology change and the numeric baseline have independent reviewability. No Go code changes; this is a loadtest-harness-only commit. Audit reference: cowork/deployment-target-audit-2026-05-02/RESULTS.md Bundle 10.	2026-05-02 19:28:45 +00:00
shankar0123	c2e53e1ab5	loadtest: add k6 harness for certctl API throughput Closes the #8 acquisition-readiness blocker from the 2026-05-01 issuer coverage audit. Pre-fix, certctl had zero benchmarks or load tests for any API path. An acquirer evaluating "can certctl handle our 50k-cert fleet at 47-day rotation" had nothing to point at; CA/B Forum SC-081v3 lands 47-day TLS in 2029, and operators need real numbers, not hand- waved capacity claims. What landed: - deploy/test/loadtest/docker-compose.yml — minimal stack (postgres + tls-init bootstrap + certctl-server with CERTCTL_DEMO_SEED=true so the FK rows the script needs exist + grafana/k6:0.54.0 driver). Pinned k6 version so threshold expressions stay stable across runs. k6 command runs the script once and exits with the threshold-driven exit code so `--exit-code-from k6` propagates non-zero on any regression. - deploy/test/loadtest/k6.js — two scenarios at 50 req/s × 5 min, staggered 5s. Scenario 1: POST /api/v1/certificates (issuance- acceptance hot path: auth + JSON decode + validation + service CreateCertificate + DB insert). Scenario 2: GET /api/v1/certificates (most-trafficked read endpoint, exercises pagination). Hard thresholds: p99 < 5s + p95 < 2s for issuance-acceptance, p99 < 2s + p95 < 800ms for list, error rate < 1% globally. constant-arrival- rate executor (NOT constant-vus) so VU-bound load doesn't backpressure the offered rate and mask capacity ceilings. __ENV.CERTCTL_BASE lets the same script run on the operator's workstation (https://localhost:8443) and inside the compose stack (https://certctl-server:8443). - deploy/test/loadtest/README.md — documents what's measured (API tier: auth → DB) vs what's NOT (issuer connector latency: pinned separately by certctl_issuance_duration_seconds from audit fix #4; full ACME enrollment flow: deferred — sustained 100/s through multi-RTT pebble takes pebble tuning + crypto helpers k6 doesn't ship with). Threshold contract pinned. Baseline numbers row reads TBD until the operator captures on a representative workstation; methodology pinned so future tuning commits land alongside refreshed baselines that are diffable. - deploy/test/loadtest/.gitignore — results/{summary.json,summary.txt} + certs/ (per-run TLS bootstrap output). Both regenerate on every run; committing them would create huge per-run diffs. - deploy/test/loadtest/results/.gitkeep — placeholder so the directory exists in fresh checkouts (the k6 container mounts it). - Makefile: new `loadtest` target spinning up the compose stack with --abort-on-container-exit --exit-code-from k6 and printing the summary. Added to .PHONY + help. Explicitly NOT in `make verify` — load tests are minutes long and don't gate per-PR signal. - .github/workflows/loadtest.yml — workflow_dispatch (manual) + weekly cron at Mon 06:00 UTC. NOT per-push. 15-minute hard cap. Always uploads results/ as an artifact (90d retention) so a regression has a diffable artifact even when k6 exited non-zero. Read-only repo permissions. - docs/architecture.md: new "Performance Characteristics" section citing the harness location, scenarios, thresholds, scope (what's measured vs not), and where the captured baseline lives. Inserted before the existing "What's Next" section. Scope decisions documented in the README + this commit message: - The audit prompt's k6 example targeted POST /api/v1/certificates + ACME-via-pebble. CreateCertificate exercises auth + DB but the downstream issuer-connector call is async (renewal scheduler); that's the right surface for "request-acceptance" throughput. Driving the connectors directly would load-test someone else's API. - Pebble was excluded from the harness stack. Sustained 100/s through ACME's order/challenge/finalize flow needs pebble tuning + k6 crypto helpers that don't exist out of the box. README flags this as a deferred follow-up. Acquirer impact: the diligence question "what's your throughput?" now has a number with a reproducible methodology and a regression guard, not a claim. The first operator run captures the baseline into README.md so subsequent tuning commits are diffable. Verified locally: - gofmt -l . clean - go vet ./... clean - staticcheck ./... clean - go build ./... clean - bash scripts/ci-guards/H-1-encryption-key-min-length.sh — clean (the 38-byte loadtest key is above the 32-byte floor) - bash scripts/ci-guards/openapi-handler-parity.sh — clean - bash scripts/ci-guards/test-compose-scep-coherence.sh — clean - make -n loadtest produces the expected command sequence - The first `make loadtest` run from the operator's workstation populates the README baseline numbers (committed in a follow-up). Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md Top-10 fix #8.	2026-05-02 14:00:10 +00:00

4 Commits