Files
certctl/docs/operator/scale-baseline-2026-Q2.md
shankar0123 6acf3559a3 docs(scale): TEST-005 — split scale baseline into its own canonical record
Sprint 5 unified-master-audit closure. Pre-fix:

  - docs/operator/scale.md L163-185 held a TBD-laden table with 5
    scenario rows. The Phase 8 scenarios shipped 2026-05-14; baseline
    capture on canonical hardware was 'the next operational step'
    that had not been taken.
  - Acquirers + operators asking 'what's the scale ceiling?' got
    'TBD' as the in-tree answer.

The audit's fix wanted three things:
  1. Capture p50/p95/p99 + error rate + memory profile on a fixed-
     spec runner.
  2. Replace the scale.md TBD rows with real numbers.
  3. Archive k6 artifacts under deploy/test/loadtest-artifacts/.

The actual capture is a workflow_dispatch run the operator triggers
on a real Linux runner — it can't happen from a sandbox without
Docker. What I CAN deliver in this commit is the canonical-record
infrastructure that turns the next workflow run into a baseline that
sticks:

  - New docs/operator/scale-baseline-2026-Q2.md is the canonical
    record. Documents the three scenarios, the methodology, the
    capture procedure, and a 'Latest capture' table with
    placeholder rows ready to receive the workflow_dispatch run's
    numbers. The doc explicitly defends the 'ubuntu-latest runner'
    choice (reproducibility > paid-AWS-account specificity).
  - docs/operator/scale.md L163-185 — the TBD table — replaced with
    a pointer paragraph to the new baseline file. Per the
    canonical-doc-pointer pattern: the operator-posture doc changes
    when scenarios change; the baseline doc changes on every
    capture. Splitting them avoids review-noise on per-capture
    commits.
  - New deploy/test/loadtest-artifacts/ directory with a README
    documenting the long-term-archive contract (the GHA artifact
    retention is 90 days; numbers acquisition reviewers look at
    months later need a committed home).

Operator next steps to fill the placeholders:
  1. Trigger Actions → loadtest → Run workflow.
  2. Download the three matrix-leg artifacts.
  3. Update the baseline doc's 'Latest capture' rows.
  4. Commit the raw artifacts (or git-lfs for >100 MB archives) to
     deploy/test/loadtest-artifacts/.

Closes TEST-005 (infrastructure side). Numbers land on the next
canonical-runner workflow_dispatch capture.
2026-05-16 05:19:57 +00:00

124 lines
5.9 KiB
Markdown

# Scale baseline — 2026 Q2 canonical-hardware capture
> Last reviewed: 2026-05-16
## What this file is
The canonical record of certctl's load-test baselines for the
2026-Q2 reporting window. TEST-005 closure (Sprint 5, 2026-05-16)
introduces this doc as the single source of truth for "what's the
scale ceiling?" — replacing the TBD-laden table at
[`docs/operator/scale.md`](scale.md#measured-baseline) that had been
unfilled since the scenarios shipped in Phase 8.
The numbers below come from the `loadtest` GitHub Actions workflow
running its three canonical scenarios on `ubuntu-latest` runners:
- `bulk-renewal` — 10,000-cert seed + criteria-mode
`POST /api/v1/certificates/bulk-renew`, 200 concurrent VUs over 10
minutes.
- `acme-burst` — 200 concurrent VUs hitting `/acme/directory`,
`/acme/new-nonce`, and `/acme/renewal-info/<cert-id>` simultaneously.
- `agent-storm` — 5,000-agent seed + sustained
`POST /api/v1/agents/{id}/heartbeat` at 167 RPS.
Thresholds enforced inline in `deploy/test/loadtest/k6.js` (p99 < 5s
for issuance-acceptance, p99 < 2s for list, error rate < 1%). k6 exits
non-zero on any breach, which propagates through `docker compose up
--exit-code-from k6 → make loadtest → workflow exit`.
## Capture procedure
1. Trigger the workflow:
- **Actions** → `loadtest`**Run workflow**, branch `master`.
- Wait ~25 minutes for the three matrix legs to finish.
2. Download each scenario's artifact from the workflow run page:
- `k6-scale-bulk-renewal-<run-id>`
- `k6-scale-acme-burst-<run-id>`
- `k6-scale-agent-storm-<run-id>`
- Each archive contains the k6 `summary.json` + raw NDJSON
points (90-day GHA retention).
3. Run `scripts/scale-baseline/extract.sh <run-id>` (see below) to
pull the three artifacts and emit the table rows for this doc.
4. Paste the rows under the **Latest capture** section. Update
`> Last reviewed:` to today.
5. Commit the artifacts you want long-term-retained to
[`deploy/test/loadtest-artifacts/`](../../deploy/test/loadtest-artifacts/)
using `git lfs` if the archives exceed 100 MB; otherwise commit
them inline.
## Latest capture
| Scenario | Run ID | Date | p50 | p95 | p99 | Error rate | Peak server RSS | Notes |
|---|---|---|---|---|---|---|---|---|
| **bulk-renewal** | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | First post-TEST-005 capture; trigger via workflow_dispatch + extract via the procedure above. |
| **acme-burst** directory | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | — |
| **acme-burst** new-nonce | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | — |
| **acme-burst** renewal-info | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | — |
| **agent-storm** | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | — |
The "_capture pending_" placeholders are deliberate — the operator
fills them after the next `loadtest` workflow_dispatch run. Once
filled, replace these rows; do not edit them in place across runs
(the historical row stays as evidence).
## Why "ubuntu-latest" instead of RDS-shaped hardware
The audit's fix language preferred RDS-shaped Postgres on a
fixed-spec runner. ubuntu-latest's 2-vCPU / 7-GB-RAM shape is
narrower than typical production Postgres, but it has two virtues:
1. **Reproducibility.** Every operator + acquirer can reproduce the
numbers; an RDS-shaped Postgres requires a paid AWS account.
2. **Conservative ceiling.** If the published numbers come from a
constrained runner, real-world deployments on production Postgres
sizes (db.m5.large +) only get better.
When an acquirer or operator asks for a production-equivalent
baseline, capture a second run on whatever infrastructure they want
to validate against and add it under a new **2026 Q3 capture**
section.
## Methodology
### Hardware
- **Runner:** GitHub Actions `ubuntu-latest` (currently Ubuntu 24.04, 2-vCPU, 7-GB RAM).
- **certctl image:** built from the same commit the workflow runs on.
- **Postgres:** `postgres:16-alpine@sha256:890480b08124ce7f79960a9bb16fe39729aa302bd384bfd7c408fee6c8f7adb7`, in-cluster, default config (no operator tuning).
- **Network:** runner localhost.
### Software
- **k6:** version pinned in `deploy/test/loadtest/Dockerfile`.
- **certctl tag:** the v* tag at workflow trigger time (matches `openapi.yaml info.version`).
### Metrics captured
- **p50 / p95 / p99 latency** — k6's `http_req_duration` percentiles.
- **Error rate** — k6 `http_req_failed` rate (non-2xx + connection errors).
- **Peak server RSS** — `docker stats` polled at 1-Hz for the
duration of the run; `max(memory_stats.usage)` taken from the
emitted JSON.
- **Acceptance gate** — the k6 thresholds in `k6.js`; if exceeded
the workflow fails.
### What's NOT captured
- **Cold-start latency** — these are steady-state baselines after the
k6 warmup ramp. Cold-start is a separate concern (renewal-loop
startup, scheduler tick boundary), not covered by these scenarios.
- **WAN latency** — runs are localhost; production-WAN-RTT additions
fall outside scope.
- **Federation overhead** — single-instance only; HA + replicas runs
are a future deliverable.
## Related reading
- [`docs/operator/scale.md`](scale.md) — the operator-facing scale
posture doc; baseline rows there point at this file.
- [`deploy/test/loadtest/README.md`](../../deploy/test/loadtest/README.md) —
scenario semantics + how to read the k6 output.
- [`deploy/test/loadtest-artifacts/`](../../deploy/test/loadtest-artifacts/) —
long-term archive of captured k6 results.