mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 14:51:30 +00:00
1279172e9b
Phase 8 of the certctl architecture diligence remediation closes
SCALE-H2 by adding three new k6 scenarios that exercise the scale-
relevant load surfaces the API tier + connector tier left uncovered:
fleet-scale bulk renewal, ACME enrollment burst, and agent heartbeat
storm.
Audit miscount + path correction (live-grep at Phase 8 audit time)
==================================================================
- The Phase 8 prompt referenced both `deploy/test/load/` and
`deploy/test/loadtest/`. Repo truth: the existing harness lives at
`deploy/test/loadtest/`. New scenarios land there.
- The audit's prior framing "k6 covers the API tier at 50 req/s
only" omitted Bundle 10 (2026-05-02) which added four connector-
tier handshake scenarios (nginx/apache/haproxy/f5) at 100 conns/min
each, plus the Phase 5 ACME directory/nonce/ARI scenario at 100 VUs
in `k6/acme_flow.js`. Phase 8 appends to what's there rather than
rewriting.
What ships
==========
Three new k6 scenario files under deploy/test/loadtest/k6/:
bulk_renewal.js — 10K-cert seed + 5 req/s POST /bulk-renew × 5min
p99 < 5s, p95 < 2s, errors < 1%
acme_burst.js — 200 VU sustained × directory/nonce/ARI × 5min
directory p95 < 500ms, nonce p95 < 300ms,
renewal-info p95 < 800ms, 5xx-only < 0.1%
Pins RFC 7807 rate-limit response shape via
acme_rate_limit_shape_ok Counter.
agent_storm.js — 5K-agent seed + 167 req/s POST /heartbeat × 5min
p99 < 1s, p95 < 500ms, errors < 0.1%
Two seed SQL fixtures under deploy/test/loadtest/seed/:
01_bulk_renewal_certs.sql — 10,000 managed_certificates rows
linked to seed_demo.sql FKs (iss-local, o-alice, t-platform,
rp-standard). status='active', expires_at distributed across
next 30 days, name prefix `loadtest-bulk-` so the scenario
can scope its criteria. Idempotent via
ON CONFLICT (name) DO NOTHING.
02_agent_fleet.sql — 5,000 agents rows with name prefix
`loadtest-agent-`. status='Online', last_heartbeat_at
staggered across prior 60s, OS distribution 80%/10%/10%
linux/windows/darwin. Idempotent via
ON CONFLICT (id) DO NOTHING.
Plus seed/README.md documenting the opt-in profile + when these
run vs the default `make loadtest` fast path.
Compose + Makefile + CI wiring
==============================
deploy/test/loadtest/docker-compose.yml gains four new services,
all gated behind the `scale` compose profile so the default
`make loadtest` is unchanged:
scale-seed — one-shot postgres:16-alpine container that runs
every ./seed/*.sql in lexical order against the
same postgres the server uses. Depends on
postgres healthy + certctl-server healthy (so
migrations + seed_demo.sql have already run).
k6-scale-bulk — grafana/k6:0.54.0 driver running bulk_renewal.js
k6-scale-acme — grafana/k6:0.54.0 driver running acme_burst.js
k6-scale-agent — grafana/k6:0.54.0 driver running agent_storm.js
Each driver depends_on scale-seed completed_successfully so the
scenarios never run against an unseeded DB (the acme scenario
doesn't need the seed itself but uses the same dependency chain for
ordering predictability).
Makefile gains four new phony targets:
loadtest-scale-bulk - runs bulk_renewal.js via compose --profile scale
loadtest-scale-acme - runs acme_burst.js
loadtest-scale-agent - runs agent_storm.js
loadtest-scale - all three serially
.github/workflows/loadtest.yml gains a new k6-scale matrix job that
runs after the existing k6 job (needs: k6) with a matrix on the
three scenarios — fail-fast: false so a regression in one scenario
doesn't cancel the others. Same workflow_dispatch + weekly cron
cadence as the existing API + connector tier job.
Documentation
=============
docs/operator/scale.md gains a new "Scale-tier scenarios (SCALE-H2,
Phase 8)" section between the cursor-pagination subsection and the
profiling-production subsection. Documents:
- Scenario + seed + sustained load table
- Threshold contract (regression guards, NOT measured baselines)
- Measured-baseline table with TBD placeholders + the canonical-
hardware capture procedure
- How to run the scale tier locally
- Four documented limitations (JWS-signed ACME, scheduler renewal
scan throughput, production-sized Postgres, pull-only deployment
model)
deploy/test/loadtest/README.md gains a short "Scale tier (Phase 8
SCALE-H2, 2026-05-14)" section pointing at scale.md as the canonical
operator-facing baseline source. Avoids duplication; the README
remains the harness-mechanics doc.
Deliberate deviations from the prompt
======================================
The Phase 8 prompt's "concrete deliverables" section referenced
`deploy/test/load/` (no -test) for the new k6 files. The actual
harness lives at `deploy/test/loadtest/` — the new files land there
to match existing convention. The prompt's audit-questions section
also referenced `deploy/test/loadtest/` so the prompt was internally
inconsistent on this; repo truth wins.
The prompt described the ACME burst as "200 concurrent ACME orders
against /acme/profile/<id>/new-order ... pin the rate-limit response
shape." new-order is JWS-signed (RFC 8555 §7.4 requires JWS for
every POST except newAccount-pre-account-key flows). k6 doesn't
ship JWS and bundling a signer (e.g. lego) into the k6 container
would obscure the server-side latency the scenario is trying to
measure. Same trade-off the existing Phase 5 acme_flow.js made.
Phase 8's acme_burst.js measures the unauthenticated
directory + nonce + ARI surface at burst rate AND pins the 429
rate-limit response shape via a custom Counter that increments only
when the response is `application/problem+json` with the
`urn:ietf:params:acme:error:rateLimited` type. End-to-end JWS
conformance under load remains a follow-up; the canonical JWS
correctness gate is `make acme-rfc-conformance-test` (lego-based,
non-load).
Deferred (operator-side, not engineering)
==========================================
Canonical-hardware baseline capture. The TBD placeholders in
docs/operator/scale.md's measured-baseline table are intentional —
sandbox-captured numbers from a developer laptop are misleading
(same anti-pattern the original loadtest README guards against).
Operator triggers loadtest.yml from the Actions tab, waits for the
k6-scale matrix jobs to complete, downloads the per-scenario
summary artifacts, copies p50/p95/p99 into the table, commits the
captured numbers alongside the date + commit SHA.
Files changed (10):
.github/workflows/loadtest.yml (+72 -1)
Makefile (+47 -1)
deploy/test/loadtest/README.md (+28 -1)
deploy/test/loadtest/docker-compose.yml (+108 -1)
deploy/test/loadtest/k6/bulk_renewal.js (new, 106 lines)
deploy/test/loadtest/k6/acme_burst.js (new, 192 lines)
deploy/test/loadtest/k6/agent_storm.js (new, 124 lines)
deploy/test/loadtest/seed/01_bulk_renewal_certs.sql (new, 95 lines)
deploy/test/loadtest/seed/02_agent_fleet.sql (new, 92 lines)
deploy/test/loadtest/seed/README.md (new, 86 lines)
docs/operator/scale.md (+109 -0)
Verification (sandbox-runnable):
python3 -c 'import yaml; yaml.safe_load(open("deploy/test/loadtest/docker-compose.yml"))'
→ compose YAML OK
python3 -c 'import yaml; yaml.safe_load(open(".github/workflows/loadtest.yml"))'
→ workflow YAML OK
grep -E 'bulk_renewal|acme_burst|agent_storm' deploy/test/loadtest/k6/*.js
→ all three scenarios + tags present
grep loadtest-scale Makefile
→ 4 new targets registered in .PHONY + 3 recipes + 1 aggregate
Runtime verification (deferred — requires docker on canonical hardware):
make loadtest-scale-bulk # 10K cert fixture + 5 req/s × 5min
make loadtest-scale-acme # 200 VU × 5min
make loadtest-scale-agent # 5K agent fixture + 167 req/s × 5min
make loadtest-scale # all three serially
Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-H2
184 lines
7.3 KiB
JavaScript
184 lines
7.3 KiB
JavaScript
// Phase 8 SCALE-H2 — ACME enrollment burst.
|
|
//
|
|
// What this measures:
|
|
// 200 concurrent VUs hammering the unauthenticated ACME directory
|
|
// + new-nonce + ARI surface for 5 minutes. The goal is the
|
|
// throughput ceiling for the entry-point handlers and the
|
|
// per-account rate-limit response shape Phase 5 added (RFC 8555
|
|
// §6.7 + RFC 7807 + the certctl-specific
|
|
// ErrACMEConcurrentOrdersExceeded path).
|
|
//
|
|
// What this does NOT measure (and why):
|
|
// - JWS-signed POST flows (new-account, new-order, finalize).
|
|
// k6 doesn't ship JWS, and bundling a Go signing helper into
|
|
// the k6 container would obscure the server-side latency the
|
|
// scenario is trying to pin. The existing
|
|
// `deploy/test/loadtest/k6/acme_flow.js` Phase 5 scenario
|
|
// made the same explicit trade-off; this Phase 8 burst scenario
|
|
// reuses the constraint. End-to-end JWS-signed conformance is
|
|
// gated by `make acme-rfc-conformance-test` (which uses lego
|
|
// against the same compose stack).
|
|
// - The actual order/finalize hot path. The newOrder handler's
|
|
// constant-time SCAN against acme_orders + the per-account
|
|
// concurrent-orders gate ARE useful to load-test, but require
|
|
// valid JWS to reach. The directory + new-nonce surface this
|
|
// scenario hits is what every ACME client transits BEFORE the
|
|
// signed flow — measuring it pins the server's headroom for
|
|
// the rest of the flow.
|
|
// - Issuer-side enrollment latency (DigiCert ACME, Let's Encrypt
|
|
// against a real prod CA, etc.). Same "load-testing someone
|
|
// else's API" carve-out as the API tier.
|
|
//
|
|
// What this DOES measure:
|
|
// - GET /acme/profile/{id}/directory throughput. Sustained 200
|
|
// concurrent VUs at a low per-VU sleep produces ~600-1000 req/s
|
|
// against this endpoint, well above what any production ACME
|
|
// client would generate but the right shape for finding the
|
|
// ceiling.
|
|
// - HEAD /acme/profile/{id}/new-nonce throughput. Nonce
|
|
// allocation is a hot path that writes one row to acme_nonces.
|
|
// - GET /acme/profile/{id}/renewal-info/{cert-id} 4xx fast path.
|
|
// Synthetic cert-id → handler returns 4xx without a DB lookup
|
|
// (cert-id is malformed at the parse layer). Measures the
|
|
// handler-front overhead under load.
|
|
// - 429 rate-limit response shape. The Phase 5 ACME per-account
|
|
// rate limit fires at sustained spike rates; the scenario pins
|
|
// that the 429 body is RFC 7807 with the
|
|
// "urn:ietf:params:acme:error:rateLimited" type. A regression
|
|
// that returned a plain text 429 or a different problem type
|
|
// would break ACME clients hard.
|
|
//
|
|
// Threshold contract:
|
|
// - directory p95 < 500ms, new-nonce p95 < 300ms, renewal-info
|
|
// p95 < 800ms — same as the Phase 5 acme_flow.js baselines.
|
|
// - 429 responses are EXPECTED at sustained 200 VU rate (the
|
|
// server's RFC-compliant rate limiter SHOULD kick in). The
|
|
// http_req_failed metric is tagged separately so 429s don't
|
|
// break the threshold; a separate `rate_limited` Counter
|
|
// tracks them so the operator can see how often the limiter
|
|
// fires.
|
|
|
|
import http from 'k6/http';
|
|
import { check } from 'k6';
|
|
import { Counter, Trend } from 'k6/metrics';
|
|
import { textSummary } from 'https://jslib.k6.io/k6-summary/0.0.2/index.js';
|
|
|
|
const ACME_BASE = __ENV.CERTCTL_ACME_DIRECTORY ||
|
|
'https://certctl-server:8443/acme/profile/prof-test/directory';
|
|
|
|
// Custom metrics.
|
|
const directoryDuration = new Trend('acme_directory_duration', true);
|
|
const newNonceDuration = new Trend('acme_new_nonce_duration', true);
|
|
const renewalInfoDuration = new Trend('acme_renewal_info_duration', true);
|
|
const rateLimitedCount = new Counter('acme_rate_limited_count');
|
|
const rateLimitShapeOK = new Counter('acme_rate_limit_shape_ok');
|
|
|
|
export const options = {
|
|
scenarios: {
|
|
acme_burst: {
|
|
executor: 'constant-vus',
|
|
vus: parseInt(__ENV.K6_ACME_VUS || '200', 10),
|
|
duration: __ENV.K6_ACME_DURATION || '5m',
|
|
gracefulStop: '30s',
|
|
tags: { scenario: 'acme_burst' },
|
|
},
|
|
},
|
|
thresholds: {
|
|
'acme_directory_duration': ['p(95)<500'],
|
|
'acme_new_nonce_duration': ['p(95)<300'],
|
|
'acme_renewal_info_duration': ['p(95)<800'],
|
|
// 4xx (rate-limited or malformed-cert-id) is expected; 5xx is
|
|
// not. Filter to status >= 500 for the failure floor.
|
|
'http_req_failed{scenario:acme_burst,server_error:true}': ['rate<0.001'],
|
|
},
|
|
insecureSkipTLSVerify: true,
|
|
summaryTrendStats: ['avg', 'min', 'med', 'p(95)', 'p(99)', 'max'],
|
|
};
|
|
|
|
export default function () {
|
|
// Step 1 — directory.
|
|
let res = http.get(ACME_BASE, {
|
|
tags: { scenario: 'acme_burst', step: 'directory' },
|
|
});
|
|
directoryDuration.add(res.timings.duration);
|
|
check(res, { 'directory 200': (r) => r.status === 200 });
|
|
|
|
if (res.status === 429) {
|
|
recordRateLimit(res);
|
|
return; // backoff this VU iteration
|
|
}
|
|
if (res.status !== 200) return;
|
|
|
|
const dir = res.json();
|
|
|
|
// Step 2 — new-nonce.
|
|
if (dir.newNonce) {
|
|
res = http.head(dir.newNonce, {
|
|
tags: { scenario: 'acme_burst', step: 'new_nonce' },
|
|
});
|
|
newNonceDuration.add(res.timings.duration);
|
|
if (res.status === 429) {
|
|
recordRateLimit(res);
|
|
return;
|
|
}
|
|
check(res, {
|
|
'new-nonce 200': (r) => r.status === 200,
|
|
'replay-nonce header present': (r) => !!r.headers['Replay-Nonce'],
|
|
});
|
|
}
|
|
|
|
// Step 3 — ARI synthetic 4xx fast path. Phase 4 added ARI
|
|
// (RFC 9773); this exercises the malformed-cert-id branch which
|
|
// returns a 4xx without a DB lookup. Pinning this here means a
|
|
// regression that turned the malformed path into a DB query
|
|
// would surface as a p95 spike.
|
|
if (dir.renewalInfo) {
|
|
res = http.get(dir.renewalInfo + '/aaaa.bbbb', {
|
|
tags: { scenario: 'acme_burst', step: 'renewal_info' },
|
|
});
|
|
renewalInfoDuration.add(res.timings.duration);
|
|
if (res.status === 429) {
|
|
recordRateLimit(res);
|
|
return;
|
|
}
|
|
check(res, {
|
|
'renewal-info 4xx for synthetic cert-id':
|
|
(r) => r.status === 400 || r.status === 404,
|
|
});
|
|
}
|
|
}
|
|
|
|
// recordRateLimit pins the Phase 5 ACME rate-limit response shape:
|
|
// - HTTP 429
|
|
// - Content-Type: application/problem+json
|
|
// - Body: {"type":"urn:ietf:params:acme:error:rateLimited", ...}
|
|
// A regression that returned 503 or a plain-text 429 or a different
|
|
// problem type would NOT increment acme_rate_limit_shape_ok and the
|
|
// operator would see (rate_limited_count - shape_ok_count) > 0 in
|
|
// the summary.
|
|
function recordRateLimit(res) {
|
|
rateLimitedCount.add(1);
|
|
const ct = res.headers['Content-Type'] || '';
|
|
if (!ct.includes('application/problem+json')) {
|
|
return;
|
|
}
|
|
let body;
|
|
try {
|
|
body = res.json();
|
|
} catch (e) {
|
|
return;
|
|
}
|
|
if (body && typeof body.type === 'string' &&
|
|
body.type.startsWith('urn:ietf:params:acme:error:rateLimited')) {
|
|
rateLimitShapeOK.add(1);
|
|
}
|
|
}
|
|
|
|
export function handleSummary(data) {
|
|
return {
|
|
'/results/summary-acme-burst.json': JSON.stringify(data, null, 2),
|
|
'/results/summary-acme-burst.txt': textSummary(data, { indent: ' ', enableColors: false }),
|
|
stdout: textSummary(data, { indent: ' ', enableColors: true }),
|
|
};
|
|
}
|