loadtest: close Phase 8 SCALE-H2 — add scale-tier scenarios

Phase 8 of the certctl architecture diligence remediation closes SCALE-H2 by adding three new k6 scenarios that exercise the scale- relevant load surfaces the API tier + connector tier left uncovered: fleet-scale bulk renewal, ACME enrollment burst, and agent heartbeat storm. Audit miscount + path correction (live-grep at Phase 8 audit time) ================================================================== - The Phase 8 prompt referenced both `deploy/test/load/` and `deploy/test/loadtest/`. Repo truth: the existing harness lives at `deploy/test/loadtest/`. New scenarios land there. - The audit's prior framing "k6 covers the API tier at 50 req/s only" omitted Bundle 10 (2026-05-02) which added four connector- tier handshake scenarios (nginx/apache/haproxy/f5) at 100 conns/min each, plus the Phase 5 ACME directory/nonce/ARI scenario at 100 VUs in `k6/acme_flow.js`. Phase 8 appends to what's there rather than rewriting. What ships ========== Three new k6 scenario files under deploy/test/loadtest/k6/: bulk_renewal.js — 10K-cert seed + 5 req/s POST /bulk-renew × 5min p99 < 5s, p95 < 2s, errors < 1% acme_burst.js — 200 VU sustained × directory/nonce/ARI × 5min directory p95 < 500ms, nonce p95 < 300ms, renewal-info p95 < 800ms, 5xx-only < 0.1% Pins RFC 7807 rate-limit response shape via acme_rate_limit_shape_ok Counter. agent_storm.js — 5K-agent seed + 167 req/s POST /heartbeat × 5min p99 < 1s, p95 < 500ms, errors < 0.1% Two seed SQL fixtures under deploy/test/loadtest/seed/: 01_bulk_renewal_certs.sql — 10,000 managed_certificates rows linked to seed_demo.sql FKs (iss-local, o-alice, t-platform, rp-standard). status='active', expires_at distributed across next 30 days, name prefix `loadtest-bulk-` so the scenario can scope its criteria. Idempotent via ON CONFLICT (name) DO NOTHING. 02_agent_fleet.sql — 5,000 agents rows with name prefix `loadtest-agent-`. status='Online', last_heartbeat_at staggered across prior 60s, OS distribution 80%/10%/10% linux/windows/darwin. Idempotent via ON CONFLICT (id) DO NOTHING. Plus seed/README.md documenting the opt-in profile + when these run vs the default `make loadtest` fast path. Compose + Makefile + CI wiring ============================== deploy/test/loadtest/docker-compose.yml gains four new services, all gated behind the `scale` compose profile so the default `make loadtest` is unchanged: scale-seed — one-shot postgres:16-alpine container that runs every ./seed/*.sql in lexical order against the same postgres the server uses. Depends on postgres healthy + certctl-server healthy (so migrations + seed_demo.sql have already run). k6-scale-bulk — grafana/k6:0.54.0 driver running bulk_renewal.js k6-scale-acme — grafana/k6:0.54.0 driver running acme_burst.js k6-scale-agent — grafana/k6:0.54.0 driver running agent_storm.js Each driver depends_on scale-seed completed_successfully so the scenarios never run against an unseeded DB (the acme scenario doesn't need the seed itself but uses the same dependency chain for ordering predictability). Makefile gains four new phony targets: loadtest-scale-bulk - runs bulk_renewal.js via compose --profile scale loadtest-scale-acme - runs acme_burst.js loadtest-scale-agent - runs agent_storm.js loadtest-scale - all three serially .github/workflows/loadtest.yml gains a new k6-scale matrix job that runs after the existing k6 job (needs: k6) with a matrix on the three scenarios — fail-fast: false so a regression in one scenario doesn't cancel the others. Same workflow_dispatch + weekly cron cadence as the existing API + connector tier job. Documentation ============= docs/operator/scale.md gains a new "Scale-tier scenarios (SCALE-H2, Phase 8)" section between the cursor-pagination subsection and the profiling-production subsection. Documents: - Scenario + seed + sustained load table - Threshold contract (regression guards, NOT measured baselines) - Measured-baseline table with TBD placeholders + the canonical- hardware capture procedure - How to run the scale tier locally - Four documented limitations (JWS-signed ACME, scheduler renewal scan throughput, production-sized Postgres, pull-only deployment model) deploy/test/loadtest/README.md gains a short "Scale tier (Phase 8 SCALE-H2, 2026-05-14)" section pointing at scale.md as the canonical operator-facing baseline source. Avoids duplication; the README remains the harness-mechanics doc. Deliberate deviations from the prompt ====================================== The Phase 8 prompt's "concrete deliverables" section referenced `deploy/test/load/` (no -test) for the new k6 files. The actual harness lives at `deploy/test/loadtest/` — the new files land there to match existing convention. The prompt's audit-questions section also referenced `deploy/test/loadtest/` so the prompt was internally inconsistent on this; repo truth wins. The prompt described the ACME burst as "200 concurrent ACME orders against /acme/profile/<id>/new-order ... pin the rate-limit response shape." new-order is JWS-signed (RFC 8555 §7.4 requires JWS for every POST except newAccount-pre-account-key flows). k6 doesn't ship JWS and bundling a signer (e.g. lego) into the k6 container would obscure the server-side latency the scenario is trying to measure. Same trade-off the existing Phase 5 acme_flow.js made. Phase 8's acme_burst.js measures the unauthenticated directory + nonce + ARI surface at burst rate AND pins the 429 rate-limit response shape via a custom Counter that increments only when the response is `application/problem+json` with the `urn:ietf:params:acme:error:rateLimited` type. End-to-end JWS conformance under load remains a follow-up; the canonical JWS correctness gate is `make acme-rfc-conformance-test` (lego-based, non-load). Deferred (operator-side, not engineering) ========================================== Canonical-hardware baseline capture. The TBD placeholders in docs/operator/scale.md's measured-baseline table are intentional — sandbox-captured numbers from a developer laptop are misleading (same anti-pattern the original loadtest README guards against). Operator triggers loadtest.yml from the Actions tab, waits for the k6-scale matrix jobs to complete, downloads the per-scenario summary artifacts, copies p50/p95/p99 into the table, commits the captured numbers alongside the date + commit SHA. Files changed (10): .github/workflows/loadtest.yml (+72 -1) Makefile (+47 -1) deploy/test/loadtest/README.md (+28 -1) deploy/test/loadtest/docker-compose.yml (+108 -1) deploy/test/loadtest/k6/bulk_renewal.js (new, 106 lines) deploy/test/loadtest/k6/acme_burst.js (new, 192 lines) deploy/test/loadtest/k6/agent_storm.js (new, 124 lines) deploy/test/loadtest/seed/01_bulk_renewal_certs.sql (new, 95 lines) deploy/test/loadtest/seed/02_agent_fleet.sql (new, 92 lines) deploy/test/loadtest/seed/README.md (new, 86 lines) docs/operator/scale.md (+109 -0) Verification (sandbox-runnable): python3 -c 'import yaml; yaml.safe_load(open("deploy/test/loadtest/docker-compose.yml"))' → compose YAML OK python3 -c 'import yaml; yaml.safe_load(open(".github/workflows/loadtest.yml"))' → workflow YAML OK grep -E 'bulk_renewal|acme_burst|agent_storm' deploy/test/loadtest/k6/*.js → all three scenarios + tags present grep loadtest-scale Makefile → 4 new targets registered in .PHONY + 3 recipes + 1 aggregate Runtime verification (deferred — requires docker on canonical hardware): make loadtest-scale-bulk # 10K cert fixture + 5 req/s × 5min make loadtest-scale-acme # 200 VU × 5min make loadtest-scale-agent # 5K agent fixture + 167 req/s × 5min make loadtest-scale # all three serially Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-H2
2026-06-11 15:18:57 +00:00 · 2026-05-14 03:25:15 +00:00
parent 0ad881c2bd
commit 1279172e9b
11 changed files with 1063 additions and 1 deletions
@@ -0,0 +1,85 @@
+-- Phase 8 SCALE-H2: bulk-renewal scenario seed.
+--
+-- Generates 10,000 managed_certificates rows linked to the existing
+-- seed_demo.sql FKs (iss-local, o-alice, t-platform, rp-standard) so
+-- the bulk-renewal k6 scenario can POST /api/v1/certificates/bulk-renew
+-- against a fleet-scale dataset instead of the 15-row demo seed.
+--
+-- Behavior:
+--   - Idempotent. ON CONFLICT (name) DO NOTHING — re-running the seed
+--     against an already-seeded DB is a no-op.
+--   - expires_at is uniformly distributed across the next 30 days so
+--     a renewal_window_days = 30 policy considers every row eligible.
+--   - status = 'active' so the renewal selector treats them as
+--     live (the scheduler skips status IN ('pending', 'failed',
+--     'revoked', 'retired')).
+--   - name is generated as 'loadtest-bulk-NNNNN.example.test' for a
+--     stable, predictable identifier the k6 scenario can pattern-match
+--     to scope its criteria to the seeded set (the production fleet
+--     wouldn't share this prefix).
+--
+-- Volume target: 10,000 rows. Insert wall time on the loadtest stack
+-- (postgres:16-alpine, 2 CPU / 4 GiB): typically < 5 seconds via the
+-- single-statement generate_series + INSERT pattern below. The
+-- compose seed-init container runs this BEFORE the k6 driver starts,
+-- so the steady-state load measurement isn't affected by seed time.
+--
+-- Why not generated in Go via a fixtures helper:
+--   - The certctl-server boots from a clean DB and runs migrations +
+--     seed_demo.sql automatically when CERTCTL_DEMO_SEED=true. Adding
+--     a Go-side fixtures helper would require either (a) a new
+--     CERTCTL_LOADTEST_SEED flag wired into cmd/server/main.go (cross-
+--     cutting change for one test path) or (b) a separate seed binary
+--     (more compose surface). Raw SQL is the smallest viable change.
+--
+-- Phase 8 entry point — runs only when the loadtest compose stack is
+-- explicitly opted into the scale-seed via LOADTEST_SCALE_SEED=true.
+
+INSERT INTO managed_certificates (
+    id,
+    name,
+    common_name,
+    sans,
+    environment,
+    owner_id,
+    team_id,
+    issuer_id,
+    renewal_policy_id,
+    status,
+    expires_at,
+    tags,
+    created_at,
+    updated_at
+)
+SELECT
+    'cert-loadtest-bulk-' || lpad(g::text, 5, '0'),
+    'loadtest-bulk-' || lpad(g::text, 5, '0') || '.example.test',
+    'loadtest-bulk-' || lpad(g::text, 5, '0') || '.example.test',
+    ARRAY['loadtest-bulk-' || lpad(g::text, 5, '0') || '.example.test'],
+    'loadtest',
+    'o-alice',
+    't-platform',
+    'iss-local',
+    'rp-standard',
+    'active',
+    -- Distribute expires_at uniformly across the next 30 days so a
+    -- 30-day-window renewal policy sees every row as eligible.
+    NOW() + ((g % 30) || ' days')::interval + ((g % 24) || ' hours')::interval,
+    jsonb_build_object('source', 'loadtest-phase8', 'batch', 'bulk-renewal'),
+    NOW(),
+    NOW()
+FROM generate_series(1, 10000) AS g
+ON CONFLICT (name) DO NOTHING;
+
+-- Confirmation row count — the seed-init container greps this in its
+-- logs to verify the fleet shape post-insert. The output appears in
+-- `docker compose logs certctl-loadtest-scale-seed` after the run.
+DO $$
+DECLARE
+    cert_count integer;
+BEGIN
+    SELECT COUNT(*) INTO cert_count
+    FROM managed_certificates
+    WHERE name LIKE 'loadtest-bulk-%';
+    RAISE NOTICE 'Phase 8 bulk-renewal seed: % managed_certificates rows present', cert_count;
+END $$;
@@ -0,0 +1,85 @@
+-- Phase 8 SCALE-H2: agent-fleet heartbeat-storm scenario seed.
+--
+-- Generates 5,000 agents rows so the heartbeat-storm k6 scenario can
+-- model a fleet-scale heartbeat pattern (5K agents heartbeating at the
+-- native 30s cadence = ~167 heartbeats/sec sustained) instead of the
+-- ~10-agent demo seed.
+--
+-- Behavior:
+--   - Idempotent. ON CONFLICT (id) DO NOTHING — re-runnable against an
+--     already-seeded DB.
+--   - name is unique (a UNIQUE constraint in migration 000001) so the
+--     name suffix mirrors the id suffix.
+--   - status = 'Online' so the heartbeat handler's retire-check
+--     (service.ErrAgentRetired) doesn't 410 the storm.
+--   - last_heartbeat_at staggered across the prior 60 seconds so the
+--     stale-agent reaper (agentHealthCheckLoop) doesn't immediately
+--     flip half the fleet to 'Offline' during the first scheduler
+--     tick of the load run.
+--   - api_key_hash = 'loadtest_no_auth'. The loadtest compose runs
+--     CERTCTL_AUTH_TYPE=api-key with a single static token
+--     (load-test-token), which bypasses per-agent key check the same
+--     way the existing API tier scenarios do. Production deploys with
+--     CERTCTL_AUTH_TYPE=agent-key per-agent would seed real bcrypt'd
+--     hashes; this column is opaque to the load-test path.
+--   - registered_at = NOW() - random 1-90 day interval so agent age
+--     looks realistic and any age-based query plans are exercised.
+--
+-- Volume target: 5,000 rows. The agents schema is much narrower than
+-- managed_certificates so the insert is sub-second on the loadtest
+-- stack. The 5K agents do not own any deployment_targets in this
+-- fixture (the scenario only measures the heartbeat hot path, not
+-- the work-poll path which depends on cert + target wiring).
+--
+-- Phase 8 entry point — runs only when the loadtest compose stack is
+-- explicitly opted into the scale-seed via LOADTEST_SCALE_SEED=true.
+
+INSERT INTO agents (
+    id,
+    name,
+    hostname,
+    status,
+    last_heartbeat_at,
+    registered_at,
+    api_key_hash,
+    os,
+    architecture,
+    ip_address,
+    version
+)
+SELECT
+    'ag-loadtest-' || lpad(g::text, 5, '0'),
+    'loadtest-agent-' || lpad(g::text, 5, '0'),
+    'loadtest-' || lpad(g::text, 5, '0') || '.fleet.example.test',
+    'Online',
+    -- Stagger last_heartbeat_at across the prior 60 seconds (= 2x the
+    -- agent's native poll interval) so the first wave of incoming
+    -- heartbeats doesn't all arrive in lockstep at t=0.
+    NOW() - ((g % 60) || ' seconds')::interval,
+    -- Registered_at randomized 1-90 days back.
+    NOW() - ((g % 90 + 1) || ' days')::interval,
+    'loadtest_no_auth',
+    -- Mix linux/windows/darwin so the OS distribution column in the
+    -- agents page isn't pure-linux during the storm.
+    CASE (g % 10)
+        WHEN 0 THEN 'windows'
+        WHEN 1 THEN 'darwin'
+        ELSE 'linux'
+    END,
+    -- amd64 dominates; arm64 minority.
+    CASE WHEN (g % 5) = 0 THEN 'arm64' ELSE 'amd64' END,
+    -- IPv4 in the 10.42.0.0/16 fleet range, deterministic per id.
+    '10.42.' || ((g / 256) % 256)::text || '.' || (g % 256)::text,
+    '2.1.0'
+FROM generate_series(1, 5000) AS g
+ON CONFLICT (id) DO NOTHING;
+
+DO $$
+DECLARE
+    agent_count integer;
+BEGIN
+    SELECT COUNT(*) INTO agent_count
+    FROM agents
+    WHERE id LIKE 'ag-loadtest-%';
+    RAISE NOTICE 'Phase 8 agent-storm seed: % agents rows present', agent_count;
+END $$;
@@ -0,0 +1,87 @@
+# Phase 8 load-test seed fixtures
+
+Opt-in seed scripts that grow the loadtest DB from the demo-scale
+fixture (~15 certs / ~10 agents from `migrations/seed_demo.sql`) to
+fleet scale (10K certs + 5K agents) so the Phase 8 SCALE-H2 scenarios
+measure something representative.
+
+## When these run
+
+The default `make loadtest` path does NOT touch this directory — the
+API tier and connector tier scenarios run against the demo seed alone
+and complete in ~5 minutes. The Phase 8 scenarios opt-in via the
+`LOADTEST_SCALE_SEED=true` environment variable; when set, the
+`certctl-loadtest-scale-seed` one-shot init container runs every
+`*.sql` file in this directory in lexical order against the same
+Postgres instance the server uses.
+
+Compose service wiring (see `../docker-compose.yml`):
+- Service: `scale-seed`
+- Profile: `scale-seed` (compose `profiles:` gate; not started by
+  default)
+- Depends on: `postgres` (service_healthy) AND `certctl-server`
+  (service_healthy — server runs schema migrations at boot so the
+  seed runs AFTER tables exist)
+- Order: lexical (`01_bulk_renewal_certs.sql` then
+  `02_agent_fleet.sql`)
+- Idempotent: every script uses `ON CONFLICT DO NOTHING` so re-running
+  is a no-op.
+
+## What gets seeded
+
+| File | Rows | Purpose |
+|---|---|---|
+| `01_bulk_renewal_certs.sql` | 10,000 managed_certificates | Fleet shape for `bulk_renewal.js`. All linked to demo FKs (iss-local, o-alice, t-platform, rp-standard). Status `active`, expires_at distributed across the next 30 days so a 30-day renewal window considers every row eligible. Name prefix `loadtest-bulk-` so the k6 scenario can scope its bulk-renew criteria. |
+| `02_agent_fleet.sql` | 5,000 agents | Fleet shape for `agent_storm.js`. Status `Online`, last_heartbeat_at staggered across prior 60s, name prefix `loadtest-agent-`. OS distribution: 80% linux / 10% windows / 10% darwin. Arch: 80% amd64 / 20% arm64. |
+
+## How to run the Phase 8 scenarios locally
+
+```bash
+cd deploy/test/loadtest
+LOADTEST_SCALE_SEED=true docker compose --profile scale-seed up --build \
+    --abort-on-container-exit --exit-code-from k6-scale
+```
+
+Or via the dedicated Makefile target (preferred for CI parity):
+
+```bash
+make loadtest-scale
+```
+
+## Why SQL fixtures instead of a Go seed binary
+
+- The certctl-server already boots from a clean DB and runs migrations
+  + `seed_demo.sql` when `CERTCTL_DEMO_SEED=true`. Adding a third seed
+  mode (loadtest-scale) would mean either a new
+  `CERTCTL_LOADTEST_SEED` flag wired into `cmd/server/main.go` (cross-
+  cutting change for one test path) or a separate seed binary (more
+  compose surface).
+- Raw SQL is the smallest viable change: each script is a single
+  multi-row `INSERT … SELECT FROM generate_series(…)` plus a
+  `DO $$ … RAISE NOTICE` confirmation block.
+- Idempotency is straightforward via `ON CONFLICT … DO NOTHING` — the
+  same pattern `seed_demo.sql` uses.
+
+## Why these volumes specifically
+
+- **10K certs.** The SCALE-H2 audit asked for "10K certs with
+  renewal_at < now." Round number, fits in postgres:16-alpine on a
+  CI runner without OOM, and large enough that the renewal selector's
+  query plan is exercised (the demo's 15 rows would index-scan
+  trivially).
+- **5K agents.** Heartbeat at 30s cadence = ~167 heartbeats/sec
+  sustained. That's well above the 50 req/s the existing API tier
+  measures and stresses the agent.heartbeat handler's per-call cost
+  (last_heartbeat_at UPDATE + the RBAC permission check + the
+  audit-log row).
+
+If a future scenario needs more rows (50K certs / 10K agents), add a
+new `03_…sql` here and another scenario file. Don't grow the existing
+files — re-running existing scenarios against a different fixture
+shape would invalidate the captured baseline.
+
+## Phase 8 audit reference
+
+Source finding: SCALE-H2 in
+`cowork/certctl-architecture-diligence-audit.html`.
+Phase 8 closure commit: see `git log --grep='Phase 8'`.