loadtest: per-connector deploy throughput scenarios + target sidecars + README baseline section

Closes Bundle 10 of the 2026-05-02 deployment-target coverage audit (see cowork/deployment-target-audit-2026-05-02/RESULTS.md). Pre-fix, deploy/test/loadtest/k6.js drove only the API-tier throughput path (POST /api/v1/certificates + GET /api/v1/certificates) — the operator- facing rate at which an automation client can submit cert requests. The deploy hot path (cert deployed to a target — connector-tier latency) had no benchmarks. Procurement asks "can certctl handle our 5,000-NGINX fleet at 47-day rotation?" and the answer should be a number with methodology, not a claim. This commit ships v1 of the connector-tier loadtest harness: 1. Target-side sidecars added to docker-compose.yml: nginx-target, apache-target, haproxy-target, f5-mock-target. Each daemon serves a starter cert (ECDSA P-256, multi-SAN) written into a shared ./fixtures/target-certs/ volume by a new target-tls-init container. f5-mock-target re-uses the in-tree deploy/test/f5-mock-icontrol/ image (already used by the deploy- vendor-e2e CI job) and generates its own self-signed cert via tls.go::selfSignedCert at startup. 2. Fixture configs committed under deploy/test/loadtest/fixtures/: - nginx.conf — minimal HTTPS server, single 200 OK location. - httpd.conf — self-contained Apache config with the minimum module set + SSL vhost. - haproxy.cfg — minimal SSL-terminating frontend backed by a static "ok" backend. 3. k6 scenarios added (4 new): nginx_handshake, apache_handshake, haproxy_handshake, f5_handshake. Each runs constant-arrival-rate at 100 conns/min for 5 minutes. Latency captured by k6's http_req_duration metric covers TCP connect + TLS handshake + tiny HTTP request/response — that's the end-to-end "connection readiness" latency a deploy connector cares about. 4. summary.json gains a connector_tier object with per-target p50/p95/p99/max/avg/error_rate/iterations breakdowns. Operators tracking a connector regression diff connector_tier.<type> between runs. Implementation: a new enrichWithConnectorTier helper that reads data.metrics keyed by target_type tag and shallow-merges the breakdown into the summary before serialisation. 5. Threshold contract per target type: - nginx/apache/haproxy: p99 < 3s, p95 < 1s. - f5-mock: p99 < 5s, p95 < 1.5s (iControl REST handler does slightly more work per request than pure TLS termination). - All scenarios: error rate < 1% (k6 default; any 4xx/5xx counts as failed). Any change pushing past these fails the workflow. 6. README documents the methodology + the baseline-number table for the connector tier. Numeric values are em-dash placeholders pending the first clean canonical-hardware run; the accompanying commit message in that follow-up captures the methodology line alongside the numbers. Out-of-scope is documented explicitly: - Full agent-driven deploy poll loop (POST cert with target binding → poll deployments endpoint → verify served cert). v2 of the harness — needs the agent registration + target- binding API surface plumbed end-to-end in the loadtest stack. - Kubernetes target via kind-in-docker. kind requires `privileged: true` and is operationally fragile in CI; deferred until Bundle 2 (real k8s.io/client-go) lands and a CI-friendly envtest harness is wired. - Real F5 BIG-IP. CI uses the in-tree f5-mock; real-appliance benchmarking is out of scope. 7. CI workflow .github/workflows/loadtest.yml timeout-minutes bumped from 15 to 25. The harness now boots four additional target sidecars before the k6 run; their healthchecks add ~30-60s. The k6 scenarios themselves are still 5 minutes (run in parallel, not serially). 25 minutes absorbs that plus slow CI runners and cold image caches without letting a stuck container consume the runner indefinitely. Trigger remains workflow_dispatch + cron — sustained 25-minute runs are too slow for per-PR signal. What this connector tier explicitly does NOT measure (documented in the k6.js header + README): - The agent-driven full deploy hot path (v2 follow-up). - K8s target (Bundle 2 dependency). - Real F5 appliance. - Issuer-side throughput (handled by issuer-coverage-audit fix #8). Verified locally: - python3 -c "import yaml; yaml.safe_load(...)" on docker-compose.yml and .github/workflows/loadtest.yml — clean. - node -c on k6.js — clean syntax. - gofmt / go vet on the rest of the tree (no Go diff in this commit). - Manual smoke against docker-compose pending — operator validates on the canonical-hardware first run; if any fixture config is off, fix-up commit lands separately so the methodology change and the numeric baseline have independent reviewability. No Go code changes; this is a loadtest-harness-only commit. Audit reference: cowork/deployment-target-audit-2026-05-02/RESULTS.md Bundle 10.
2026-06-09 11:28:52 +00:00 · 2026-05-02 19:28:45 +00:00
parent 08a86d355d
commit e292faafc6
8 changed files with 677 additions and 47 deletions
@@ -1,37 +1,67 @@
 // certctl load-test driver — k6 v0.54+ JS API.
 //
-// Closes the #8 acquisition-readiness blocker from the 2026-05-01 issuer
-// coverage audit. Pre-fix, certctl had no benchmarks or load tests for any
-// API path. An acquirer evaluating "can certctl handle our 50k-cert fleet
-// at 47-day rotation" had nothing to point at; this script gives them
-// a reproducible number with a methodology.
+// Two tiers of scenarios:
 //
-// What this measures (be honest about scope):
+//   API tier (issuer-coverage audit fix #8, 2026-05-01):
+//     - issuance_acceptance: POST /api/v1/certificates throughput.
+//     - list_certificates:   GET  /api/v1/certificates throughput.
+//
+//   Connector tier (Bundle 10 of the deployment-target audit, 2026-05-02):
+//     - nginx_handshake / apache_handshake / haproxy_handshake / f5_handshake:
+//       per-target-type TCP+TLS handshake throughput against the four
+//       target sidecars at sustained 100 conns/min for 5 minutes. Latency
+//       is tagged by target_type so summary.json's connector_tier section
+//       breaks out p50/p95/p99 per target.
+//
+// What the API tier measures (be honest about scope):
 //   - POST /api/v1/certificates: auth + JSON decode + validation + service
 //     CreateCertificate + DB insert + response. This is the operator-facing
 //     request-acceptance throughput. The downstream issuer-connector call
 //     happens asynchronously via the renewal scheduler (and is bounded
-//     separately via CERTCTL_RENEWAL_CONCURRENCY — audit fix #9).
+//     separately via CERTCTL_RENEWAL_CONCURRENCY — issuer audit fix #9).
 //   - GET /api/v1/certificates: read path with pagination. Exercises the
 //     cert list query, which is the most-called read endpoint in any UI/
 //     automation client.
 //
-// What this does NOT measure:
+// What the connector tier measures:
+//   - Per-target-type TCP+TLS handshake completion latency. Validates that
+//     each target sidecar (nginx, apache, haproxy, f5-mock) is operational
+//     and serving its starter cert under sustained connection load.
+//     Procurement asks "can certctl's nginx target handle 5,000 endpoints
+//     at 47-day rotation"; the answer requires (a) the connector code
+//     handles deploys correctly (covered by per-connector unit tests) AND
+//     (b) the underlying daemon serves TLS at the connection rates a
+//     5,000-endpoint fleet implies. The connector-tier scenarios pin (b).
+//
+// What this does NOT measure (documented limits, not lazy gaps):
 //   - Issuer connector latency (DigiCert / ACME / Vault / etc. round-trips
 //     to upstream CAs). Those are async; pin via the per-issuer-type
-//     metrics instead (audit fix #4: certctl_issuance_duration_seconds).
-//   - The full ACME enrollment flow (newOrder → challenge → finalize).
-//     The audit prompt mentioned ACME-via-pebble; deferred to a follow-up
-//     because driving multi-RTT ACME flows at sustained 100/s requires
-//     pebble tuning + k6 crypto helpers that don't exist out of the box.
+//     metrics instead (issuer audit fix #4:
+//     certctl_issuance_duration_seconds).
+//   - Full ACME enrollment (newOrder → challenge → finalize).
+//   - The full agent-driven deploy hot path (POST cert with target
+//     binding → poll deployments endpoint → verify served cert matches).
+//     v1 of the connector-tier harness measures handshake throughput
+//     against the sidecars directly. v2 is a follow-up that needs the
+//     agent registration + target-binding API surface plumbed end-to-end
+//     in the loadtest stack — a meaningful addition but not a blocker
+//     for the Bundle 10 procurement question.
+//   - Kubernetes connector. kind-in-docker requires `privileged: true`
+//     and is operationally fragile in CI. Deferred until Bundle 2 (real
+//     k8s.io/client-go) lands.
 //
-// Threshold contract: any future change that pushes p99 above 5s for the
-// issuance-acceptance scenario or 2s for the read scenario, OR any change
-// that pushes the error rate above 1%, fails the test. CI gates the run
-// behind workflow_dispatch + cron (NOT per-push — load tests are too slow
-// to gate per-PR signal).
+// Threshold contract:
+//   - API tier: p99 < 5s for issuance, < 2s for list, error rate < 1%.
+//   - Connector tier: p99 < 3s per handshake target (5s for f5-mock,
+//     iControl REST is slower), error rate < 1%.
+//   Any change pushing past these fails the workflow.
 //
-// Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md fix #8.
+// CI gates the run behind workflow_dispatch + cron (NOT per-push — load
+// tests are too slow to gate per-PR signal).
+//
+// Audit references:
+//   - API tier:       cowork/issuer-coverage-audit-2026-05-01/RESULTS.md fix #8.
+//   - Connector tier: cowork/deployment-target-audit-2026-05-02/RESULTS.md Bundle 10.

 import http from 'k6/http';
 import { check } from 'k6';
@@ -43,6 +73,18 @@ import { textSummary } from 'https://jslib.k6.io/k6-summary/0.0.2/index.js';
 const BASE = __ENV.CERTCTL_BASE || 'https://localhost:8443';
 const TOKEN = __ENV.CERTCTL_TOKEN || 'load-test-token';

+// Bundle 10: per-target sidecar URLs. Defaults match the docker-compose
+// stack's internal DNS; operators running k6 manually against a different
+// stack override these via env. Empty default → the corresponding
+// scenario is skipped (the scenarioFor* helper guards).
+const NGINX_TARGET_URL   = __ENV.NGINX_TARGET_URL   || 'https://nginx-target:443';
+const APACHE_TARGET_URL  = __ENV.APACHE_TARGET_URL  || 'https://apache-target:443';
+const HAPROXY_TARGET_URL = __ENV.HAPROXY_TARGET_URL || 'https://haproxy-target:443';
+// f5-mock's iControl REST `/healthz` endpoint is the CI-friendly
+// per-handshake probe — hits the path the F5 connector itself uses for
+// reachability. Real F5 BIG-IP also exposes /healthz under /mgmt/.
+const F5_TARGET_URL      = __ENV.F5_TARGET_URL      || 'https://f5-mock-target:443';
+
 // Demo seed (CERTCTL_DEMO_SEED=true) creates these rows; CreateCertificate
 // requires all four FKs to exist. Pre-baked here so the script has zero
 // dependency on test fixtures beyond the seed.
@@ -82,18 +124,75 @@ export const options = {
            startTime: '5s',
            tags: { scenario: 'list_certificates' },
        },
+
+        // Bundle 10: connector-tier per-target-type handshake scenarios.
+        // 100 conns/min sustained for 5 minutes against each sidecar.
+        // The handshake measurement captures TCP connect + TLS
+        // handshake + tiny HTTP GET (`/` for nginx/apache/haproxy,
+        // `/healthz` for f5-mock); k6's http_req_duration aggregates
+        // all three so the numbers are end-to-end "respond to the
+        // operator's connection" latency, not isolated TLS-handshake
+        // microseconds.
+        nginx_handshake: {
+            executor: 'constant-arrival-rate',
+            rate: 100,
+            timeUnit: '1m',
+            duration: '5m',
+            preAllocatedVUs: 10,
+            maxVUs: 50,
+            exec: 'nginxHandshake',
+            startTime: '10s',
+            tags: { scenario: 'nginx_handshake', target_type: 'nginx' },
+        },
+        apache_handshake: {
+            executor: 'constant-arrival-rate',
+            rate: 100,
+            timeUnit: '1m',
+            duration: '5m',
+            preAllocatedVUs: 10,
+            maxVUs: 50,
+            exec: 'apacheHandshake',
+            startTime: '10s',
+            tags: { scenario: 'apache_handshake', target_type: 'apache' },
+        },
+        haproxy_handshake: {
+            executor: 'constant-arrival-rate',
+            rate: 100,
+            timeUnit: '1m',
+            duration: '5m',
+            preAllocatedVUs: 10,
+            maxVUs: 50,
+            exec: 'haproxyHandshake',
+            startTime: '10s',
+            tags: { scenario: 'haproxy_handshake', target_type: 'haproxy' },
+        },
+        f5_handshake: {
+            executor: 'constant-arrival-rate',
+            rate: 100,
+            timeUnit: '1m',
+            duration: '5m',
+            preAllocatedVUs: 10,
+            maxVUs: 50,
+            exec: 'f5Handshake',
+            startTime: '10s',
+            tags: { scenario: 'f5_handshake', target_type: 'f5' },
+        },
    },
    thresholds: {
-        // Hard floor: 99% of issuance-acceptance requests complete in
-        // under 5 seconds. Pre-fix this was unsubstantiated; post-fix
-        // this is the regression guard. The number isn't aspirational —
-        // it's the worst-acceptable user-facing API SLO from the
-        // operator perspective.
+        // API tier — issuer audit fix #8.
        'http_req_duration{scenario:issuance_acceptance}': ['p(99)<5000', 'p(95)<2000'],
        'http_req_duration{scenario:list_certificates}': ['p(99)<2000', 'p(95)<800'],
-        // < 1% error rate. The k6 default is "any 4xx/5xx counts as
-        // failed"; legitimate 201/200 responses don't count. Auth
-        // failures, validation failures, server errors all do.
+
+        // Bundle 10 connector tier. nginx/apache/haproxy are pure TLS
+        // termination → tight thresholds. f5-mock includes a tiny Go
+        // server response on top of the handshake → slightly looser.
+        'http_req_duration{target_type:nginx}':   ['p(99)<3000', 'p(95)<1000'],
+        'http_req_duration{target_type:apache}':  ['p(99)<3000', 'p(95)<1000'],
+        'http_req_duration{target_type:haproxy}': ['p(99)<3000', 'p(95)<1000'],
+        'http_req_duration{target_type:f5}':      ['p(99)<5000', 'p(95)<1500'],
+
+        // < 1% error rate across ALL scenarios. Auth failures, validation
+        // failures, server errors, connection refused all count.
        'http_req_failed': ['rate<0.01'],
    },
    // Smaller summary payload — strip per-VU metrics we don't read.
@@ -148,16 +247,109 @@ export function listCertificates() {
    });
 }

+// --- Bundle 10: connector-tier handshake scenarios ---
+//
+// Each per-target function does a single HTTPS GET against its target
+// sidecar. k6's http_req_duration metric captures TCP connect + TLS
+// handshake + HTTP request/response — that's the end-to-end "connection
+// readiness" latency a deploy connector cares about. The target_type
+// tag groups results in summary.json's connector_tier section.
+//
+// Status-check threshold: any 4xx/5xx counts as failed (k6 default
+// behaviour for http_req_failed). f5-mock's /healthz returns 200; the
+// other three nginx/apache/haproxy default vhost configs all return
+// 200 on `/`.
+//
+// Bundle 10 of the 2026-05-02 deployment-target audit.
+
+export function nginxHandshake() {
+    const res = http.get(`${NGINX_TARGET_URL}/`, {
+        tags: { scenario: 'nginx_handshake', target_type: 'nginx' },
+    });
+    check(res, {
+        'nginx 2xx': (r) => r.status >= 200 && r.status < 300,
+    });
+}
+
+export function apacheHandshake() {
+    const res = http.get(`${APACHE_TARGET_URL}/`, {
+        tags: { scenario: 'apache_handshake', target_type: 'apache' },
+    });
+    check(res, {
+        'apache 2xx': (r) => r.status >= 200 && r.status < 300,
+    });
+}
+
+export function haproxyHandshake() {
+    const res = http.get(`${HAPROXY_TARGET_URL}/`, {
+        tags: { scenario: 'haproxy_handshake', target_type: 'haproxy' },
+    });
+    check(res, {
+        'haproxy 2xx': (r) => r.status >= 200 && r.status < 300,
+    });
+}
+
+export function f5Handshake() {
+    const res = http.get(`${F5_TARGET_URL}/healthz`, {
+        tags: { scenario: 'f5_handshake', target_type: 'f5' },
+    });
+    check(res, {
+        'f5 2xx': (r) => r.status >= 200 && r.status < 300,
+    });
+}
+
 // handleSummary writes the full results to /results/summary.{json,txt}
 // so the operator can commit the baseline numbers into README.md after
 // each run and so CI can ingest the JSON for diffing.
 //
+// Bundle 10 added a `connector_tier` aggregation alongside the API tier
+// — same source data (data.metrics), grouped by target_type tag for
+// per-connector-type p50/p95/p99/error breakdowns. Operators tracking a
+// connector regression diff `connector_tier.<type>` between runs.
+//
 // stdout reproduces the textSummary so the docker compose log shows
 // the same numbers an operator running it manually would see.
 export function handleSummary(data) {
+    const enriched = enrichWithConnectorTier(data);
    return {
-        '/results/summary.json': JSON.stringify(data, null, 2),
+        '/results/summary.json': JSON.stringify(enriched, null, 2),
        '/results/summary.txt': textSummary(data, { indent: ' ', enableColors: false }),
        stdout: textSummary(data, { indent: ' ', enableColors: true }),
    };
 }
+
+// enrichWithConnectorTier appends a connector_tier object to the k6
+// summary data. Each target_type entry contains:
+//   { p50, p95, p99, max, avg, error_rate, iterations }
+// Missing tags (e.g. an operator runs only the API tier scenarios) are
+// reported as null so callers can detect them without a separate scan.
+function enrichWithConnectorTier(data) {
+    const targetTypes = ['nginx', 'apache', 'haproxy', 'f5'];
+    const connectorTier = {};
+    for (const t of targetTypes) {
+        const reqDurKey = `http_req_duration{target_type:${t}}`;
+        const reqFailKey = `http_req_failed{target_type:${t}}`;
+        const iterKey = `iterations{target_type:${t}}`;
+
+        const dur = data.metrics[reqDurKey];
+        const fail = data.metrics[reqFailKey];
+        const iters = data.metrics[iterKey];
+
+        if (!dur || !dur.values) {
+            connectorTier[t] = null;
+            continue;
+        }
+        connectorTier[t] = {
+            p50: dur.values['med'] ?? null,
+            p95: dur.values['p(95)'] ?? null,
+            p99: dur.values['p(99)'] ?? null,
+            max: dur.values['max'] ?? null,
+            avg: dur.values['avg'] ?? null,
+            error_rate: fail && fail.values ? (fail.values['rate'] ?? null) : null,
+            iterations: iters && iters.values ? (iters.values['count'] ?? null) : null,
+        };
+    }
+    // Shallow-merge so existing summary fields (data.metrics, data.options,
+    // etc.) stay untouched. The connector_tier key is additive.
+    return Object.assign({}, data, { connector_tier: connectorTier });
+}