mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 23:31:39 +00:00
1279172e9b
Phase 8 of the certctl architecture diligence remediation closes
SCALE-H2 by adding three new k6 scenarios that exercise the scale-
relevant load surfaces the API tier + connector tier left uncovered:
fleet-scale bulk renewal, ACME enrollment burst, and agent heartbeat
storm.
Audit miscount + path correction (live-grep at Phase 8 audit time)
==================================================================
- The Phase 8 prompt referenced both `deploy/test/load/` and
`deploy/test/loadtest/`. Repo truth: the existing harness lives at
`deploy/test/loadtest/`. New scenarios land there.
- The audit's prior framing "k6 covers the API tier at 50 req/s
only" omitted Bundle 10 (2026-05-02) which added four connector-
tier handshake scenarios (nginx/apache/haproxy/f5) at 100 conns/min
each, plus the Phase 5 ACME directory/nonce/ARI scenario at 100 VUs
in `k6/acme_flow.js`. Phase 8 appends to what's there rather than
rewriting.
What ships
==========
Three new k6 scenario files under deploy/test/loadtest/k6/:
bulk_renewal.js — 10K-cert seed + 5 req/s POST /bulk-renew × 5min
p99 < 5s, p95 < 2s, errors < 1%
acme_burst.js — 200 VU sustained × directory/nonce/ARI × 5min
directory p95 < 500ms, nonce p95 < 300ms,
renewal-info p95 < 800ms, 5xx-only < 0.1%
Pins RFC 7807 rate-limit response shape via
acme_rate_limit_shape_ok Counter.
agent_storm.js — 5K-agent seed + 167 req/s POST /heartbeat × 5min
p99 < 1s, p95 < 500ms, errors < 0.1%
Two seed SQL fixtures under deploy/test/loadtest/seed/:
01_bulk_renewal_certs.sql — 10,000 managed_certificates rows
linked to seed_demo.sql FKs (iss-local, o-alice, t-platform,
rp-standard). status='active', expires_at distributed across
next 30 days, name prefix `loadtest-bulk-` so the scenario
can scope its criteria. Idempotent via
ON CONFLICT (name) DO NOTHING.
02_agent_fleet.sql — 5,000 agents rows with name prefix
`loadtest-agent-`. status='Online', last_heartbeat_at
staggered across prior 60s, OS distribution 80%/10%/10%
linux/windows/darwin. Idempotent via
ON CONFLICT (id) DO NOTHING.
Plus seed/README.md documenting the opt-in profile + when these
run vs the default `make loadtest` fast path.
Compose + Makefile + CI wiring
==============================
deploy/test/loadtest/docker-compose.yml gains four new services,
all gated behind the `scale` compose profile so the default
`make loadtest` is unchanged:
scale-seed — one-shot postgres:16-alpine container that runs
every ./seed/*.sql in lexical order against the
same postgres the server uses. Depends on
postgres healthy + certctl-server healthy (so
migrations + seed_demo.sql have already run).
k6-scale-bulk — grafana/k6:0.54.0 driver running bulk_renewal.js
k6-scale-acme — grafana/k6:0.54.0 driver running acme_burst.js
k6-scale-agent — grafana/k6:0.54.0 driver running agent_storm.js
Each driver depends_on scale-seed completed_successfully so the
scenarios never run against an unseeded DB (the acme scenario
doesn't need the seed itself but uses the same dependency chain for
ordering predictability).
Makefile gains four new phony targets:
loadtest-scale-bulk - runs bulk_renewal.js via compose --profile scale
loadtest-scale-acme - runs acme_burst.js
loadtest-scale-agent - runs agent_storm.js
loadtest-scale - all three serially
.github/workflows/loadtest.yml gains a new k6-scale matrix job that
runs after the existing k6 job (needs: k6) with a matrix on the
three scenarios — fail-fast: false so a regression in one scenario
doesn't cancel the others. Same workflow_dispatch + weekly cron
cadence as the existing API + connector tier job.
Documentation
=============
docs/operator/scale.md gains a new "Scale-tier scenarios (SCALE-H2,
Phase 8)" section between the cursor-pagination subsection and the
profiling-production subsection. Documents:
- Scenario + seed + sustained load table
- Threshold contract (regression guards, NOT measured baselines)
- Measured-baseline table with TBD placeholders + the canonical-
hardware capture procedure
- How to run the scale tier locally
- Four documented limitations (JWS-signed ACME, scheduler renewal
scan throughput, production-sized Postgres, pull-only deployment
model)
deploy/test/loadtest/README.md gains a short "Scale tier (Phase 8
SCALE-H2, 2026-05-14)" section pointing at scale.md as the canonical
operator-facing baseline source. Avoids duplication; the README
remains the harness-mechanics doc.
Deliberate deviations from the prompt
======================================
The Phase 8 prompt's "concrete deliverables" section referenced
`deploy/test/load/` (no -test) for the new k6 files. The actual
harness lives at `deploy/test/loadtest/` — the new files land there
to match existing convention. The prompt's audit-questions section
also referenced `deploy/test/loadtest/` so the prompt was internally
inconsistent on this; repo truth wins.
The prompt described the ACME burst as "200 concurrent ACME orders
against /acme/profile/<id>/new-order ... pin the rate-limit response
shape." new-order is JWS-signed (RFC 8555 §7.4 requires JWS for
every POST except newAccount-pre-account-key flows). k6 doesn't
ship JWS and bundling a signer (e.g. lego) into the k6 container
would obscure the server-side latency the scenario is trying to
measure. Same trade-off the existing Phase 5 acme_flow.js made.
Phase 8's acme_burst.js measures the unauthenticated
directory + nonce + ARI surface at burst rate AND pins the 429
rate-limit response shape via a custom Counter that increments only
when the response is `application/problem+json` with the
`urn:ietf:params:acme:error:rateLimited` type. End-to-end JWS
conformance under load remains a follow-up; the canonical JWS
correctness gate is `make acme-rfc-conformance-test` (lego-based,
non-load).
Deferred (operator-side, not engineering)
==========================================
Canonical-hardware baseline capture. The TBD placeholders in
docs/operator/scale.md's measured-baseline table are intentional —
sandbox-captured numbers from a developer laptop are misleading
(same anti-pattern the original loadtest README guards against).
Operator triggers loadtest.yml from the Actions tab, waits for the
k6-scale matrix jobs to complete, downloads the per-scenario
summary artifacts, copies p50/p95/p99 into the table, commits the
captured numbers alongside the date + commit SHA.
Files changed (10):
.github/workflows/loadtest.yml (+72 -1)
Makefile (+47 -1)
deploy/test/loadtest/README.md (+28 -1)
deploy/test/loadtest/docker-compose.yml (+108 -1)
deploy/test/loadtest/k6/bulk_renewal.js (new, 106 lines)
deploy/test/loadtest/k6/acme_burst.js (new, 192 lines)
deploy/test/loadtest/k6/agent_storm.js (new, 124 lines)
deploy/test/loadtest/seed/01_bulk_renewal_certs.sql (new, 95 lines)
deploy/test/loadtest/seed/02_agent_fleet.sql (new, 92 lines)
deploy/test/loadtest/seed/README.md (new, 86 lines)
docs/operator/scale.md (+109 -0)
Verification (sandbox-runnable):
python3 -c 'import yaml; yaml.safe_load(open("deploy/test/loadtest/docker-compose.yml"))'
→ compose YAML OK
python3 -c 'import yaml; yaml.safe_load(open(".github/workflows/loadtest.yml"))'
→ workflow YAML OK
grep -E 'bulk_renewal|acme_burst|agent_storm' deploy/test/loadtest/k6/*.js
→ all three scenarios + tags present
grep loadtest-scale Makefile
→ 4 new targets registered in .PHONY + 3 recipes + 1 aggregate
Runtime verification (deferred — requires docker on canonical hardware):
make loadtest-scale-bulk # 10K cert fixture + 5 req/s × 5min
make loadtest-scale-acme # 200 VU × 5min
make loadtest-scale-agent # 5K agent fixture + 167 req/s × 5min
make loadtest-scale # all three serially
Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-H2
140 lines
5.7 KiB
YAML
140 lines
5.7 KiB
YAML
# Load-test workflow — closes the #8 acquisition-readiness blocker from
|
||
# the 2026-05-01 issuer coverage audit (see
|
||
# the 2026-05-01 issuer coverage audit).
|
||
#
|
||
# CADENCE: workflow_dispatch + weekly cron, NOT per-push. Load tests
|
||
# are minutes long and don't provide useful per-PR signal — per-push
|
||
# pressure goes through ci.yml. This workflow exists to (a) catch
|
||
# gradual regressions from cumulative changes that no single PR
|
||
# triggered, and (b) give an operator a one-click way to capture
|
||
# numbers before tagging a release.
|
||
#
|
||
# THRESHOLDS: defined in deploy/test/loadtest/k6.js (p99 < 5s for
|
||
# issuance-acceptance, p99 < 2s for list, error rate < 1%). k6 exits
|
||
# non-zero on any breach, which propagates through `docker compose up
|
||
# --exit-code-from k6` → `make loadtest` → this workflow's exit.
|
||
|
||
name: loadtest
|
||
|
||
on:
|
||
workflow_dispatch:
|
||
# Manual trigger from the Actions tab. Use before tagging a
|
||
# release or after a meaningful tuning commit.
|
||
|
||
schedule:
|
||
# Mondays at 06:00 UTC. Off-peak; catches regressions accumulated
|
||
# over the previous week's merges. Once a baseline is committed
|
||
# in deploy/test/loadtest/README.md, drift relative to that
|
||
# baseline is the signal — diff the captured summary.json
|
||
# against the committed numbers.
|
||
- cron: '0 6 * * 1'
|
||
|
||
# Reduce permissions — this workflow doesn't write to PRs or push tags.
|
||
permissions:
|
||
contents: read
|
||
|
||
jobs:
|
||
k6:
|
||
name: k6 throughput run
|
||
runs-on: ubuntu-latest
|
||
# 25-minute hard cap. Pre-Bundle-10: 15min was enough for the API
|
||
# tier alone (~7 minutes total). Post-Bundle-10 the harness boots
|
||
# four additional target sidecars (nginx, apache, haproxy, f5-mock)
|
||
# before the k6 run; their healthchecks add ~30-60s. The k6 scenarios
|
||
# themselves are still 5 minutes (run in parallel with the API
|
||
# scenarios, not serially). 25 minutes absorbs that plus slow CI
|
||
# runners and cold image caches without letting a stuck container
|
||
# consume the runner indefinitely.
|
||
timeout-minutes: 25
|
||
|
||
steps:
|
||
- name: Checkout
|
||
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
|
||
|
||
- name: Set up Docker Buildx
|
||
# The compose stack builds the certctl image from the repo
|
||
# root Dockerfile. Buildx gives the build a usable cache and
|
||
# works with newer compose versions.
|
||
uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f # v3
|
||
|
||
- name: Run loadtest
|
||
run: make loadtest
|
||
env:
|
||
# Disable BuildKit progress noise so the run log is
|
||
# diff-able against past runs.
|
||
BUILDKIT_PROGRESS: plain
|
||
|
||
- name: Upload summary
|
||
# Always upload the summary so a regression has a diffable
|
||
# artifact even when k6 exited non-zero. summary.json is the
|
||
# authoritative machine-readable form; summary.txt is the
|
||
# human-readable text the README baseline tracks.
|
||
if: always()
|
||
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4
|
||
with:
|
||
name: k6-summary-${{ github.run_id }}
|
||
path: deploy/test/loadtest/results/
|
||
retention-days: 90
|
||
|
||
# ---------------------------------------------------------------------------
|
||
# Phase 8 SCALE-H2 — scale-tier scenarios. Three new k6 drivers:
|
||
# - bulk-renewal: 10K-cert seed + criteria-mode POST /bulk-renew
|
||
# - acme-burst: 200 concurrent VUs against directory/nonce/ARI
|
||
# - agent-storm: 5K-agent seed + 167 heartbeats/sec sustained
|
||
#
|
||
# Matrix dispatch so each scenario runs on its own runner and a
|
||
# regression in one doesn't mask another. The matrix runs in parallel,
|
||
# which keeps total wall time around the existing 25-minute cap rather
|
||
# than ~70 minutes serialised. Each scenario brings up the full
|
||
# loadtest compose stack independently — there's no shared state
|
||
# between scenarios that would benefit from a single-runner serial
|
||
# invocation.
|
||
#
|
||
# Cadence: same as the API + connector tier job above (workflow_dispatch
|
||
# + Mondays 06:00 UTC). The scale scenarios DO produce useful per-PR
|
||
# signal in theory, but the per-run cost (image build + 5min run × 3)
|
||
# is too high to gate on every PR; weekly is the right trade-off.
|
||
# ---------------------------------------------------------------------------
|
||
k6-scale:
|
||
name: k6 scale tier (${{ matrix.scenario }})
|
||
runs-on: ubuntu-latest
|
||
timeout-minutes: 25
|
||
needs: k6
|
||
strategy:
|
||
# Parallel: a failure in one scenario shouldn't cancel the others.
|
||
# Each scenario's threshold breach is independent diagnostic data.
|
||
fail-fast: false
|
||
matrix:
|
||
scenario:
|
||
- bulk-renewal
|
||
- acme-burst
|
||
- agent-storm
|
||
|
||
steps:
|
||
- name: Checkout
|
||
uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4
|
||
|
||
- name: Set up Docker Buildx
|
||
uses: docker/setup-buildx-action@8d2750c68a42422c14e847fe6c8ac0403b4cbd6f # v3
|
||
|
||
- name: Run scale loadtest (${{ matrix.scenario }})
|
||
env:
|
||
BUILDKIT_PROGRESS: plain
|
||
run: |
|
||
case "${{ matrix.scenario }}" in
|
||
bulk-renewal) make loadtest-scale-bulk ;;
|
||
acme-burst) make loadtest-scale-acme ;;
|
||
agent-storm) make loadtest-scale-agent ;;
|
||
*) echo "::error::unknown scenario ${{ matrix.scenario }}"; exit 1 ;;
|
||
esac
|
||
|
||
- name: Upload summary
|
||
if: always()
|
||
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4
|
||
with:
|
||
# Per-scenario artifact name so the three matrix runs don't
|
||
# collide on upload.
|
||
name: k6-scale-${{ matrix.scenario }}-${{ github.run_id }}
|
||
path: deploy/test/loadtest/results/
|
||
retention-days: 90
|