mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 12:41:30 +00:00
1279172e9b
Phase 8 of the certctl architecture diligence remediation closes
SCALE-H2 by adding three new k6 scenarios that exercise the scale-
relevant load surfaces the API tier + connector tier left uncovered:
fleet-scale bulk renewal, ACME enrollment burst, and agent heartbeat
storm.
Audit miscount + path correction (live-grep at Phase 8 audit time)
==================================================================
- The Phase 8 prompt referenced both `deploy/test/load/` and
`deploy/test/loadtest/`. Repo truth: the existing harness lives at
`deploy/test/loadtest/`. New scenarios land there.
- The audit's prior framing "k6 covers the API tier at 50 req/s
only" omitted Bundle 10 (2026-05-02) which added four connector-
tier handshake scenarios (nginx/apache/haproxy/f5) at 100 conns/min
each, plus the Phase 5 ACME directory/nonce/ARI scenario at 100 VUs
in `k6/acme_flow.js`. Phase 8 appends to what's there rather than
rewriting.
What ships
==========
Three new k6 scenario files under deploy/test/loadtest/k6/:
bulk_renewal.js — 10K-cert seed + 5 req/s POST /bulk-renew × 5min
p99 < 5s, p95 < 2s, errors < 1%
acme_burst.js — 200 VU sustained × directory/nonce/ARI × 5min
directory p95 < 500ms, nonce p95 < 300ms,
renewal-info p95 < 800ms, 5xx-only < 0.1%
Pins RFC 7807 rate-limit response shape via
acme_rate_limit_shape_ok Counter.
agent_storm.js — 5K-agent seed + 167 req/s POST /heartbeat × 5min
p99 < 1s, p95 < 500ms, errors < 0.1%
Two seed SQL fixtures under deploy/test/loadtest/seed/:
01_bulk_renewal_certs.sql — 10,000 managed_certificates rows
linked to seed_demo.sql FKs (iss-local, o-alice, t-platform,
rp-standard). status='active', expires_at distributed across
next 30 days, name prefix `loadtest-bulk-` so the scenario
can scope its criteria. Idempotent via
ON CONFLICT (name) DO NOTHING.
02_agent_fleet.sql — 5,000 agents rows with name prefix
`loadtest-agent-`. status='Online', last_heartbeat_at
staggered across prior 60s, OS distribution 80%/10%/10%
linux/windows/darwin. Idempotent via
ON CONFLICT (id) DO NOTHING.
Plus seed/README.md documenting the opt-in profile + when these
run vs the default `make loadtest` fast path.
Compose + Makefile + CI wiring
==============================
deploy/test/loadtest/docker-compose.yml gains four new services,
all gated behind the `scale` compose profile so the default
`make loadtest` is unchanged:
scale-seed — one-shot postgres:16-alpine container that runs
every ./seed/*.sql in lexical order against the
same postgres the server uses. Depends on
postgres healthy + certctl-server healthy (so
migrations + seed_demo.sql have already run).
k6-scale-bulk — grafana/k6:0.54.0 driver running bulk_renewal.js
k6-scale-acme — grafana/k6:0.54.0 driver running acme_burst.js
k6-scale-agent — grafana/k6:0.54.0 driver running agent_storm.js
Each driver depends_on scale-seed completed_successfully so the
scenarios never run against an unseeded DB (the acme scenario
doesn't need the seed itself but uses the same dependency chain for
ordering predictability).
Makefile gains four new phony targets:
loadtest-scale-bulk - runs bulk_renewal.js via compose --profile scale
loadtest-scale-acme - runs acme_burst.js
loadtest-scale-agent - runs agent_storm.js
loadtest-scale - all three serially
.github/workflows/loadtest.yml gains a new k6-scale matrix job that
runs after the existing k6 job (needs: k6) with a matrix on the
three scenarios — fail-fast: false so a regression in one scenario
doesn't cancel the others. Same workflow_dispatch + weekly cron
cadence as the existing API + connector tier job.
Documentation
=============
docs/operator/scale.md gains a new "Scale-tier scenarios (SCALE-H2,
Phase 8)" section between the cursor-pagination subsection and the
profiling-production subsection. Documents:
- Scenario + seed + sustained load table
- Threshold contract (regression guards, NOT measured baselines)
- Measured-baseline table with TBD placeholders + the canonical-
hardware capture procedure
- How to run the scale tier locally
- Four documented limitations (JWS-signed ACME, scheduler renewal
scan throughput, production-sized Postgres, pull-only deployment
model)
deploy/test/loadtest/README.md gains a short "Scale tier (Phase 8
SCALE-H2, 2026-05-14)" section pointing at scale.md as the canonical
operator-facing baseline source. Avoids duplication; the README
remains the harness-mechanics doc.
Deliberate deviations from the prompt
======================================
The Phase 8 prompt's "concrete deliverables" section referenced
`deploy/test/load/` (no -test) for the new k6 files. The actual
harness lives at `deploy/test/loadtest/` — the new files land there
to match existing convention. The prompt's audit-questions section
also referenced `deploy/test/loadtest/` so the prompt was internally
inconsistent on this; repo truth wins.
The prompt described the ACME burst as "200 concurrent ACME orders
against /acme/profile/<id>/new-order ... pin the rate-limit response
shape." new-order is JWS-signed (RFC 8555 §7.4 requires JWS for
every POST except newAccount-pre-account-key flows). k6 doesn't
ship JWS and bundling a signer (e.g. lego) into the k6 container
would obscure the server-side latency the scenario is trying to
measure. Same trade-off the existing Phase 5 acme_flow.js made.
Phase 8's acme_burst.js measures the unauthenticated
directory + nonce + ARI surface at burst rate AND pins the 429
rate-limit response shape via a custom Counter that increments only
when the response is `application/problem+json` with the
`urn:ietf:params:acme:error:rateLimited` type. End-to-end JWS
conformance under load remains a follow-up; the canonical JWS
correctness gate is `make acme-rfc-conformance-test` (lego-based,
non-load).
Deferred (operator-side, not engineering)
==========================================
Canonical-hardware baseline capture. The TBD placeholders in
docs/operator/scale.md's measured-baseline table are intentional —
sandbox-captured numbers from a developer laptop are misleading
(same anti-pattern the original loadtest README guards against).
Operator triggers loadtest.yml from the Actions tab, waits for the
k6-scale matrix jobs to complete, downloads the per-scenario
summary artifacts, copies p50/p95/p99 into the table, commits the
captured numbers alongside the date + commit SHA.
Files changed (10):
.github/workflows/loadtest.yml (+72 -1)
Makefile (+47 -1)
deploy/test/loadtest/README.md (+28 -1)
deploy/test/loadtest/docker-compose.yml (+108 -1)
deploy/test/loadtest/k6/bulk_renewal.js (new, 106 lines)
deploy/test/loadtest/k6/acme_burst.js (new, 192 lines)
deploy/test/loadtest/k6/agent_storm.js (new, 124 lines)
deploy/test/loadtest/seed/01_bulk_renewal_certs.sql (new, 95 lines)
deploy/test/loadtest/seed/02_agent_fleet.sql (new, 92 lines)
deploy/test/loadtest/seed/README.md (new, 86 lines)
docs/operator/scale.md (+109 -0)
Verification (sandbox-runnable):
python3 -c 'import yaml; yaml.safe_load(open("deploy/test/loadtest/docker-compose.yml"))'
→ compose YAML OK
python3 -c 'import yaml; yaml.safe_load(open(".github/workflows/loadtest.yml"))'
→ workflow YAML OK
grep -E 'bulk_renewal|acme_burst|agent_storm' deploy/test/loadtest/k6/*.js
→ all three scenarios + tags present
grep loadtest-scale Makefile
→ 4 new targets registered in .PHONY + 3 recipes + 1 aggregate
Runtime verification (deferred — requires docker on canonical hardware):
make loadtest-scale-bulk # 10K cert fixture + 5 req/s × 5min
make loadtest-scale-acme # 200 VU × 5min
make loadtest-scale-agent # 5K agent fixture + 167 req/s × 5min
make loadtest-scale # all three serially
Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-H2
387 lines
17 KiB
Markdown
387 lines
17 KiB
Markdown
# certctl Load-Test Harness
|
||
|
||
Closes the **#8 acquisition-readiness blocker** from the 2026-05-01 issuer
|
||
coverage audit (the 2026-05-01 issuer coverage audit).
|
||
Pre-fix, certctl had zero benchmarks or load tests for any API path; an
|
||
acquirer evaluating "can certctl handle our 50k-cert fleet at 47-day
|
||
rotation" had nothing to point at. This harness is the substantiation.
|
||
|
||
## What it measures
|
||
|
||
A k6 driver hits two scenarios in parallel for 5 minutes at a fixed 50 req/s:
|
||
|
||
1. **`POST /api/v1/certificates`** — the issuance-acceptance hot path.
|
||
Exercises auth, JSON decode, validation, `service.CreateCertificate`,
|
||
and the `managed_certificates` insert. This is the operator-facing
|
||
request-acceptance throughput an automation client (Terraform,
|
||
Crossplane, GitOps controller) would generate.
|
||
2. **`GET /api/v1/certificates?per_page=50`** — the most-trafficked read
|
||
endpoint. Exercises pagination + filtering on the cert list query.
|
||
|
||
Latency is reported as `avg / min / med / p95 / p99 / max`. The error
|
||
floor is < 1% (any 4xx/5xx counts as failed).
|
||
|
||
## What it explicitly does NOT measure
|
||
|
||
- **Issuer connector latency.** Connector calls (DigiCert, ACME, Vault,
|
||
AWS ACM PCA, etc.) happen asynchronously via the renewal scheduler.
|
||
Their latency is pinned by the `certctl_issuance_duration_seconds{issuer_type=...}`
|
||
Prometheus histogram (audit fix #4). Driving them through k6 would
|
||
load-test someone else's API, which is wrong.
|
||
- **Full ACME enrollment flow.** The audit prompt mentioned ACME-via-
|
||
pebble; sustained 100/s through a multi-RTT order/challenge/finalize
|
||
flow requires pebble tuning + crypto helpers k6 doesn't ship out of
|
||
the box. Deferred to a follow-up.
|
||
- **Bulk-revoke / bulk-renew.** Those are admin endpoints with their
|
||
own throughput characteristics and warrant a separate scenario.
|
||
- **Scheduler concurrency under bulk renewal.** That's audit fix #9's
|
||
scope; the harness here measures the API tier, not the scheduler.
|
||
|
||
## Threshold contract
|
||
|
||
Any future change that breaches one of these fails the test:
|
||
|
||
| Scenario | p95 | p99 | Error rate |
|
||
|---|---|---|---|
|
||
| `issuance_acceptance` | < 2 s | < 5 s | n/a |
|
||
| `list_certificates` | < 800 ms | < 2 s | n/a |
|
||
| All requests | n/a | n/a | < 1% |
|
||
|
||
These are the regression guards, not the SLO. The SLO is whatever the
|
||
operator chooses based on the baseline below.
|
||
|
||
## How to run
|
||
|
||
From the repo root:
|
||
|
||
```sh
|
||
make loadtest
|
||
```
|
||
|
||
This:
|
||
|
||
1. Builds the certctl image from the repo root `Dockerfile`.
|
||
2. Spins up postgres, the tls-init bootstrap, certctl-server (with
|
||
`CERTCTL_DEMO_SEED=true` so the FK rows the script needs exist),
|
||
and the k6 driver.
|
||
3. Runs the k6 script for ~5 minutes 5 seconds (5s stagger between
|
||
scenarios + 5m duration).
|
||
4. Prints the summary text to stdout.
|
||
5. Exits non-zero if any threshold was breached.
|
||
|
||
The full machine-readable summary lands at
|
||
`deploy/test/loadtest/results/summary.json` (gitignored). The
|
||
human-readable summary lands at `results/summary.txt`.
|
||
|
||
To run against a server already booted on the host (skip the compose
|
||
spin-up):
|
||
|
||
```sh
|
||
docker run --rm \
|
||
-e CERTCTL_BASE=https://localhost:8443 \
|
||
-e CERTCTL_TOKEN=load-test-token \
|
||
-e K6_INSECURE_SKIP_TLS_VERIFY=true \
|
||
-v "$(pwd)/deploy/test/loadtest/k6.js:/scripts/k6.js:ro" \
|
||
-v "$(pwd)/deploy/test/loadtest/results:/results" \
|
||
--network host \
|
||
grafana/k6:0.54.0 run /scripts/k6.js
|
||
```
|
||
|
||
## Current baseline
|
||
|
||
The first operator run captures real numbers and commits them into
|
||
this section. Pre-baseline this section reads "TBD — operator captures
|
||
on first `make loadtest` run." The numbers below are the agreed
|
||
minimum-acceptable thresholds, not the captured baseline; once captured,
|
||
the baseline goes here as a separate row so future regressions have a
|
||
diff target.
|
||
|
||
| Scenario | p50 | p95 | p99 | Error rate |
|
||
|---|---|---|---|---|
|
||
| **issuance_acceptance** (threshold) | — | < 2 s | < 5 s | < 1% |
|
||
| **issuance_acceptance** (baseline)[^1] | 2.12 ms | 6.19 ms | 8.58 ms | 0.00% |
|
||
| **list_certificates** (threshold) | — | < 800 ms | < 2 s | < 1% |
|
||
| **list_certificates** (baseline)[^1] | 2.12 ms | 6.19 ms | 8.58 ms | 0.00% |
|
||
|
||
[^1]: **Sandbox-aggregate placeholder** — captured at HEAD on a Linux/aarch64
|
||
unprivileged sandbox (no Docker, no GitHub-hosted runner). Both rows show
|
||
the same aggregate combined-load numbers because the sandbox run did not
|
||
break out per-scenario tags in `summary.json`. Treat these as a sanity
|
||
floor (proof the API tier handles 100 req/s combined with zero errors and
|
||
sub-10ms p99), **not** as the per-scenario baselines the threshold contract
|
||
is written against. Replace via `gh workflow run loadtest.yml` on the
|
||
canonical `ubuntu-latest` runner — that produces per-scenario tagged
|
||
metrics in `summary.json`.
|
||
|
||
**Methodology of the sandbox-placeholder capture above:**
|
||
- Hardware: Linux/aarch64 unprivileged sandbox (uid 1019, no root,
|
||
~1.2 GiB free disk). NOT canonical hardware.
|
||
- Postgres: 14.22 (Ubuntu, native binaries, unix-socket dir `/tmp/pg-sock`),
|
||
unix sockets only, port 55432.
|
||
- certctl: built from HEAD via `go build -o bin/certctl-server ./cmd/server`.
|
||
- Concurrency: 50 req/s sustained per scenario, both scenarios in parallel
|
||
(= 100 req/s combined).
|
||
- Duration: **10 seconds** per scenario (NOT 5 minutes — sandbox bash-call
|
||
budget is bounded; canonical-hardware run uses 5 minutes).
|
||
- TLS: ECDSA-P256 self-signed `localhost` cert at `/tmp/certctl-tls/`.
|
||
- Auth: api-key, single Bearer token (`CERTCTL_AUTH_SECRET=load-test-token`).
|
||
- Rate limiting: **disabled** (`CERTCTL_RATE_LIMIT_ENABLED=false`) — without
|
||
this, the 100 req/s combined load trips the default token-bucket and
|
||
drives error rate to ~40%, masking real latency.
|
||
- Encryption: `CERTCTL_CONFIG_ENCRYPTION_KEY` set (32+ bytes).
|
||
- Captured: 2026-05-02. Total: 1002 requests, 100.15 req/s sustained,
|
||
0 failures, 100% checks passed. Raw `summary.json` is not committed
|
||
(gitignored per the existing `results/` convention).
|
||
|
||
**Methodology pinned at canonical baseline capture (replace placeholder):**
|
||
- Hardware: GitHub-hosted `ubuntu-latest` runner (4 vCPU / 16 GiB / SSD).
|
||
Run via `gh workflow run loadtest.yml`; raw `summary.json` is available
|
||
for 90 days as a workflow artifact.
|
||
- Postgres: 16-alpine in compose, default config.
|
||
- certctl: image built from this repo at the commit referenced below.
|
||
- Concurrency: 50 req/s sustained per scenario (100 req/s total).
|
||
- Duration: 5 minutes per scenario, 5s stagger.
|
||
- Auth: api-key (Bearer token, single key).
|
||
- Encryption: `CERTCTL_CONFIG_ENCRYPTION_KEY` set (32+ bytes).
|
||
|
||
To recapture the baseline after a tuning commit:
|
||
|
||
```sh
|
||
make loadtest
|
||
# Inspect deploy/test/loadtest/results/summary.txt for the new numbers.
|
||
# Update the table above + the methodology line, commit alongside the
|
||
# tuning commit.
|
||
```
|
||
|
||
## Interpreting a regression
|
||
|
||
If a future PR's `make loadtest` run pushes p99 above the threshold,
|
||
the make target exits non-zero and CI fails. The summary.txt prints
|
||
which threshold breached. Triage:
|
||
|
||
1. Look at the per-scenario `http_req_duration` p95 + p99 in
|
||
`summary.json`. If only one scenario regressed, the change is
|
||
localized to that endpoint's hot path.
|
||
2. Look at the `iteration_duration` per scenario — if total iteration
|
||
time grew but `http_req_duration` is flat, the latency is in k6
|
||
client setup (rare; suggests something changed in the script).
|
||
3. Compare against the committed baseline. If p99 was 800 ms at
|
||
baseline and is now 1.5 s but still under the 5 s threshold, the
|
||
change is below the regression guard but still meaningful — flag
|
||
in the PR description.
|
||
|
||
The harness deliberately does NOT auto-tune. Tuning is informed by the
|
||
data; tuning commits land separately, each with their own captured
|
||
baseline update.
|
||
|
||
## CI cadence
|
||
|
||
Defined in `.github/workflows/loadtest.yml`:
|
||
|
||
- **`workflow_dispatch`** — manual trigger from the Actions tab. Used
|
||
before tagging a release or after a meaningful tuning commit.
|
||
- **Weekly cron** — Mondays at 06:00 UTC. Catches gradual regressions
|
||
from cumulative changes that no single PR triggered.
|
||
|
||
The workflow does **not** run per-push. Load tests are minutes long
|
||
and would not provide useful per-PR signal; per-push pressure goes
|
||
through `make verify` (which is fast) and the deploy-vendor-e2e job.
|
||
|
||
## Connector-tier baseline (Bundle 10 of the 2026-05-02 deployment-target audit)
|
||
|
||
Bundle 10 extended the harness to cover per-target-type handshake throughput
|
||
in addition to the API-tier issuance/list throughput documented above. The
|
||
docker-compose stack now boots four target sidecars (nginx, apache, haproxy,
|
||
f5-mock) each serving a starter cert from a shared `target-tls-init`
|
||
container, and k6 runs four additional scenarios — `nginx_handshake`,
|
||
`apache_handshake`, `haproxy_handshake`, `f5_handshake` — at sustained
|
||
100 conns/min for 5 minutes against each.
|
||
|
||
### What the connector tier measures
|
||
|
||
End-to-end TCP connect + TLS handshake + tiny HTTP request/response latency
|
||
per target type, tagged via the k6 `target_type` label so summary.json's
|
||
`connector_tier` section breaks the numbers out per sidecar:
|
||
|
||
```json
|
||
{
|
||
"connector_tier": {
|
||
"nginx": { "p50": ..., "p95": ..., "p99": ..., "error_rate": ..., "iterations": ... },
|
||
"apache": { ... },
|
||
"haproxy": { ... },
|
||
"f5": { ... }
|
||
}
|
||
}
|
||
```
|
||
|
||
This validates the target sidecar daemons are operational under sustained
|
||
connection load. Procurement asks "can certctl's nginx target handle 5,000
|
||
endpoints at 47-day rotation?" — the connector code's correctness is pinned
|
||
by per-connector unit tests; **the underlying daemon's connection-rate
|
||
ceiling is what these scenarios pin**.
|
||
|
||
### What the connector tier explicitly does NOT measure (v1)
|
||
|
||
- **The full agent-driven deploy hot path.** v1 measures handshake
|
||
throughput against the sidecars directly. v2 of the harness is a
|
||
follow-up that POSTs cert requests bound to per-target-type targets,
|
||
polls the deployments endpoint until the agent reports complete, and
|
||
measures the full POST → poll → cert-served loop. v2 needs the agent
|
||
registration + target-binding API surface plumbed end-to-end in the
|
||
loadtest stack — meaningful work, but not a blocker for the connection-
|
||
rate procurement question.
|
||
- **Kubernetes connector.** kind-in-docker requires `privileged: true`
|
||
and is operationally fragile in CI. Deferred until Bundle 2 (real
|
||
`k8s.io/client-go`) lands and a CI-friendly envtest harness is wired.
|
||
- **Real F5 BIG-IP.** The harness uses the in-tree `f5-mock-icontrol`
|
||
Go server (already used by the deploy-vendor-e2e CI job). Real F5
|
||
appliance benchmarking is out of scope; operators with a real F5
|
||
vagrant box per `docs/connector-f5.md` can substitute it manually.
|
||
|
||
### Threshold contract
|
||
|
||
Defined in `k6.js`'s `thresholds` block. Any change pushing past these
|
||
fails the test:
|
||
|
||
| Target type | p95 | p99 | Error rate |
|
||
|---|---|---|---|
|
||
| `nginx` | < 1 s | < 3 s | < 1% (global) |
|
||
| `apache` | < 1 s | < 3 s | < 1% (global) |
|
||
| `haproxy` | < 1 s | < 3 s | < 1% (global) |
|
||
| `f5` | < 1.5 s | < 5 s | < 1% (global) |
|
||
|
||
f5-mock's threshold is looser because the iControl REST handler does
|
||
slightly more work per request (login+upload+install dance the F5
|
||
connector itself drives — not exercised here, but the daemon's request
|
||
handler is heavier).
|
||
|
||
### Connector-tier captured baseline
|
||
|
||
| Target type | p50 | p95 | p99 | Error rate | Iterations |
|
||
|---|---|---|---|---|---|
|
||
| **nginx** (threshold) | — | < 1 s | < 3 s | < 1% | n/a |
|
||
| **nginx** (baseline) | TBD | TBD | TBD | TBD | TBD |
|
||
| **apache** (threshold) | — | < 1 s | < 3 s | < 1% | n/a |
|
||
| **apache** (baseline) | TBD | TBD | TBD | TBD | TBD |
|
||
| **haproxy** (threshold) | — | < 1 s | < 3 s | < 1% | n/a |
|
||
| **haproxy** (baseline) | TBD | TBD | TBD | TBD | TBD |
|
||
| **f5** (threshold) | — | < 1.5 s | < 5 s | < 1% | n/a |
|
||
| **f5** (baseline) | TBD | TBD | TBD | TBD | TBD |
|
||
|
||
The em-dash placeholders are deliberate: do **not** commit numeric values
|
||
without running the loadtest on canonical hardware first. Numbers from a
|
||
developer laptop are misleading. The first `gh workflow run loadtest.yml`
|
||
on a clean GitHub runner captures the baseline; commit the captured numbers
|
||
into the table above as a follow-up commit alongside the methodology line.
|
||
|
||
**Methodology pinned at baseline capture (canonical hardware):**
|
||
|
||
- Hardware: GitHub-hosted `ubuntu-latest` runners (currently 4 vCPU /
|
||
16 GiB / SSD-backed). Operator captures from `gh workflow run loadtest.yml`
|
||
to keep the hardware constant across runs.
|
||
- Sidecar images: nginx:1.27-alpine, httpd:2.4-alpine, haproxy:2.9-alpine,
|
||
in-tree f5-mock-icontrol (built from `deploy/test/f5-mock-icontrol/`).
|
||
- Concurrency: 100 conns/min sustained per target type (400 conns/min
|
||
total across the four target scenarios + 100 req/s on the API tier).
|
||
- Duration: 5 minutes per scenario, 10s stagger between API tier and
|
||
connector tier so warmup overlap doesn't skew the first 30 seconds.
|
||
- TLS: starter cert from `target-tls-init` (ECDSA P-256, multi-SAN). The
|
||
loadtest scenarios connect with `K6_INSECURE_SKIP_TLS_VERIFY=true`.
|
||
|
||
To recapture the connector-tier baseline after a tuning commit affecting
|
||
target sidecars or the connector code:
|
||
|
||
```sh
|
||
make loadtest
|
||
# Inspect deploy/test/loadtest/results/summary.json for the
|
||
# connector_tier object and update the table above.
|
||
```
|
||
|
||
## Files in this directory
|
||
|
||
```
|
||
deploy/test/loadtest/
|
||
├── README.md (this file)
|
||
├── docker-compose.yml
|
||
├── k6.js (the load script)
|
||
├── certs/ (gitignored — tls-init writes here)
|
||
├── fixtures/ (Bundle 10: target sidecar configs + shared starter cert)
|
||
│ ├── nginx.conf
|
||
│ ├── httpd.conf
|
||
│ ├── haproxy.cfg
|
||
│ └── target-certs/ (gitignored — target-tls-init writes here)
|
||
└── results/ (gitignored — k6 writes summary.{json,txt} here)
|
||
```
|
||
|
||
## ACME flows (Phase 5)
|
||
|
||
The `deploy/test/loadtest/k6/acme_flow.js` scenario hammers the
|
||
unauthenticated ACME surface (directory + new-nonce + ARI synthetic
|
||
lookups) at constant 100 VUs for 5 minutes. JWS-signed paths
|
||
(new-account / new-order / finalize) are intentionally out of scope:
|
||
k6 doesn't ship JWS, and bundling lego inside k6 would obscure the
|
||
underlying-server p95 we're trying to measure. Instead, the
|
||
`make acme-rfc-conformance-test` target drives lego against the same
|
||
stack for the full happy-path conformance gate.
|
||
|
||
Run it:
|
||
|
||
```
|
||
cd deploy/test/loadtest
|
||
docker compose up -d certctl postgres
|
||
k6 run --env CERTCTL_ACME_DIRECTORY=https://localhost:8443/acme/profile/prof-test/directory \
|
||
k6/acme_flow.js
|
||
```
|
||
|
||
### Baseline (ACME flows, 100 VUs × 5m)
|
||
|
||
The baseline is operator-captured on a workstation-class machine with
|
||
a single certctl-server container + a single postgres container.
|
||
Re-capture after schema migrations or transport changes; commit the
|
||
new numbers so regressions are visible in code review.
|
||
|
||
| Metric | Threshold | Last captured | Notes |
|
||
|--------------------------------------------|-----------|---------------|-------|
|
||
| `directory_duration` p95 | < 500 ms | _operator_ | Unauth GET; cache-friendly. |
|
||
| `new_nonce_duration` p95 | < 300 ms | _operator_ | Single Postgres INSERT under the hood. |
|
||
| `renewal_info_duration` p95 (synthetic id) | < 800 ms | _operator_ | Synthetic cert-id → 4xx fast path. |
|
||
| `http_req_failed` rate | < 1% | _operator_ | Should be ~0 — failures here mean transport issues. |
|
||
|
||
Capture command: `make loadtest` after pointing the compose stack at
|
||
the ACME flow scenario. Operators with kind / cert-manager available
|
||
should pair this with `make acme-cert-manager-test` for end-to-end
|
||
verification.
|
||
|
||
## Scale tier (Phase 8 SCALE-H2, 2026-05-14)
|
||
|
||
Phase 8 closure added three new k6 scenarios that exercise the
|
||
scale-relevant load surfaces the API tier and connector tier left
|
||
uncovered:
|
||
|
||
| Scenario | k6 file | Seed | Make target |
|
||
|---|---|---|---|
|
||
| Bulk-renewal under load | `k6/bulk_renewal.js` | `seed/01_bulk_renewal_certs.sql` (10K certs) | `make loadtest-scale-bulk` |
|
||
| ACME enrollment burst | `k6/acme_burst.js` | (none — unauth surface) | `make loadtest-scale-acme` |
|
||
| Agent heartbeat storm | `k6/agent_storm.js` | `seed/02_agent_fleet.sql` (5K agents) | `make loadtest-scale-agent` |
|
||
|
||
The scale-tier scenarios live behind the `scale` compose profile so
|
||
the default `make loadtest` (API tier + connector tier, ~7 min)
|
||
stays fast. Run all three serially with `make loadtest-scale`, or
|
||
trigger the `loadtest.yml` workflow's `k6-scale` matrix jobs from
|
||
the Actions tab for canonical-hardware capture.
|
||
|
||
Operator-facing baseline table + threshold contracts + documented
|
||
limitations live in [`docs/operator/scale.md`](../../../docs/operator/scale.md)
|
||
under the "Scale-tier scenarios (SCALE-H2, Phase 8)" section. Treat
|
||
that as the canonical source — this README only links.
|
||
|
||
The seed fixtures + their idempotency contract are documented in
|
||
[`seed/README.md`](seed/README.md).
|
||
|
||
## Audit references
|
||
|
||
- API tier: 2026-05-01 issuer coverage audit fix #8.
|
||
- Connector tier: 2026-05-02 deployment-target audit Bundle 10.
|
||
- ACME flows: Phase 5 master prompt (project notes).
|
||
- Scale tier: 2026-05-14 architecture diligence Phase 8 (SCALE-H2).
|