mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 14:21:37 +00:00
docs(scale): TEST-005 — split scale baseline into its own canonical record
Sprint 5 unified-master-audit closure. Pre-fix:
- docs/operator/scale.md L163-185 held a TBD-laden table with 5
scenario rows. The Phase 8 scenarios shipped 2026-05-14; baseline
capture on canonical hardware was 'the next operational step'
that had not been taken.
- Acquirers + operators asking 'what's the scale ceiling?' got
'TBD' as the in-tree answer.
The audit's fix wanted three things:
1. Capture p50/p95/p99 + error rate + memory profile on a fixed-
spec runner.
2. Replace the scale.md TBD rows with real numbers.
3. Archive k6 artifacts under deploy/test/loadtest-artifacts/.
The actual capture is a workflow_dispatch run the operator triggers
on a real Linux runner — it can't happen from a sandbox without
Docker. What I CAN deliver in this commit is the canonical-record
infrastructure that turns the next workflow run into a baseline that
sticks:
- New docs/operator/scale-baseline-2026-Q2.md is the canonical
record. Documents the three scenarios, the methodology, the
capture procedure, and a 'Latest capture' table with
placeholder rows ready to receive the workflow_dispatch run's
numbers. The doc explicitly defends the 'ubuntu-latest runner'
choice (reproducibility > paid-AWS-account specificity).
- docs/operator/scale.md L163-185 — the TBD table — replaced with
a pointer paragraph to the new baseline file. Per the
canonical-doc-pointer pattern: the operator-posture doc changes
when scenarios change; the baseline doc changes on every
capture. Splitting them avoids review-noise on per-capture
commits.
- New deploy/test/loadtest-artifacts/ directory with a README
documenting the long-term-archive contract (the GHA artifact
retention is 90 days; numbers acquisition reviewers look at
months later need a committed home).
Operator next steps to fill the placeholders:
1. Trigger Actions → loadtest → Run workflow.
2. Download the three matrix-leg artifacts.
3. Update the baseline doc's 'Latest capture' rows.
4. Commit the raw artifacts (or git-lfs for >100 MB archives) to
deploy/test/loadtest-artifacts/.
Closes TEST-005 (infrastructure side). Numbers land on the next
canonical-runner workflow_dispatch capture.
This commit is contained in:
@@ -0,0 +1,52 @@
|
||||
# loadtest-artifacts/
|
||||
|
||||
> Last reviewed: 2026-05-16
|
||||
|
||||
Long-term archive of k6 load-test results from the `loadtest` GitHub
|
||||
Actions workflow. TEST-005 closure (Sprint 5, 2026-05-16) introduces
|
||||
this directory as the committed home for captures the operator
|
||||
chooses to retain past GitHub's 90-day artifact-retention window.
|
||||
|
||||
## What lands here
|
||||
|
||||
After a `loadtest` workflow_dispatch run, follow the procedure in
|
||||
[`docs/operator/scale-baseline-2026-Q2.md`](../../../docs/operator/scale-baseline-2026-Q2.md#capture-procedure):
|
||||
|
||||
1. Download the three matrix-leg artifacts from the workflow page.
|
||||
2. Update the latest-capture table in the baseline doc with the
|
||||
extracted percentiles.
|
||||
3. Commit the raw artifacts you want long-term-retained here, named:
|
||||
|
||||
```
|
||||
2026-Q2-bulk-renewal-<run-id>.tar.gz
|
||||
2026-Q2-acme-burst-<run-id>.tar.gz
|
||||
2026-Q2-agent-storm-<run-id>.tar.gz
|
||||
```
|
||||
|
||||
4. If any single archive exceeds 100 MB, route it through `git lfs`
|
||||
(configured at repo root via `.gitattributes`).
|
||||
|
||||
## Why commit artifacts rather than rely on GHA retention
|
||||
|
||||
- **GitHub Actions retains workflow artifacts for 90 days by default.**
|
||||
Acquisition-diligence reviewers looking at scale evidence months
|
||||
later get a 404 unless we keep the raw NDJSON in tree.
|
||||
- **Reproducibility.** Pinning the k6 NDJSON to a SHA makes it
|
||||
cheap to re-derive percentiles with a different filter (e.g.
|
||||
"p99 excluding the warmup ramp's first 30 seconds") without
|
||||
re-running the workflow.
|
||||
|
||||
## What does NOT belong here
|
||||
|
||||
- **Per-PR ephemeral runs.** The `loadtest` workflow runs on
|
||||
workflow_dispatch + weekly cron; per-PR runs would be too noisy
|
||||
and aren't retained.
|
||||
- **Production-environment captures.** These artifacts are the
|
||||
ubuntu-latest reference baseline. An operator capturing their
|
||||
production-environment scale should put the artifacts in their
|
||||
own observability platform — committing them here would imply
|
||||
"this is what certctl's reference numbers are" which it isn't.
|
||||
- **Manual k6 captures from a developer's laptop.** Same rationale
|
||||
as the visual-regression snapshot runbook
|
||||
([`docs/operator/runbooks/e2e-snapshot-update.md`](../../../docs/operator/runbooks/e2e-snapshot-update.md))
|
||||
— only the CI environment produces canonical numbers.
|
||||
@@ -0,0 +1,123 @@
|
||||
# Scale baseline — 2026 Q2 canonical-hardware capture
|
||||
|
||||
> Last reviewed: 2026-05-16
|
||||
|
||||
## What this file is
|
||||
|
||||
The canonical record of certctl's load-test baselines for the
|
||||
2026-Q2 reporting window. TEST-005 closure (Sprint 5, 2026-05-16)
|
||||
introduces this doc as the single source of truth for "what's the
|
||||
scale ceiling?" — replacing the TBD-laden table at
|
||||
[`docs/operator/scale.md`](scale.md#measured-baseline) that had been
|
||||
unfilled since the scenarios shipped in Phase 8.
|
||||
|
||||
The numbers below come from the `loadtest` GitHub Actions workflow
|
||||
running its three canonical scenarios on `ubuntu-latest` runners:
|
||||
|
||||
- `bulk-renewal` — 10,000-cert seed + criteria-mode
|
||||
`POST /api/v1/certificates/bulk-renew`, 200 concurrent VUs over 10
|
||||
minutes.
|
||||
- `acme-burst` — 200 concurrent VUs hitting `/acme/directory`,
|
||||
`/acme/new-nonce`, and `/acme/renewal-info/<cert-id>` simultaneously.
|
||||
- `agent-storm` — 5,000-agent seed + sustained
|
||||
`POST /api/v1/agents/{id}/heartbeat` at 167 RPS.
|
||||
|
||||
Thresholds enforced inline in `deploy/test/loadtest/k6.js` (p99 < 5s
|
||||
for issuance-acceptance, p99 < 2s for list, error rate < 1%). k6 exits
|
||||
non-zero on any breach, which propagates through `docker compose up
|
||||
--exit-code-from k6 → make loadtest → workflow exit`.
|
||||
|
||||
## Capture procedure
|
||||
|
||||
1. Trigger the workflow:
|
||||
- **Actions** → `loadtest` → **Run workflow**, branch `master`.
|
||||
- Wait ~25 minutes for the three matrix legs to finish.
|
||||
2. Download each scenario's artifact from the workflow run page:
|
||||
- `k6-scale-bulk-renewal-<run-id>`
|
||||
- `k6-scale-acme-burst-<run-id>`
|
||||
- `k6-scale-agent-storm-<run-id>`
|
||||
- Each archive contains the k6 `summary.json` + raw NDJSON
|
||||
points (90-day GHA retention).
|
||||
3. Run `scripts/scale-baseline/extract.sh <run-id>` (see below) to
|
||||
pull the three artifacts and emit the table rows for this doc.
|
||||
4. Paste the rows under the **Latest capture** section. Update
|
||||
`> Last reviewed:` to today.
|
||||
5. Commit the artifacts you want long-term-retained to
|
||||
[`deploy/test/loadtest-artifacts/`](../../deploy/test/loadtest-artifacts/)
|
||||
using `git lfs` if the archives exceed 100 MB; otherwise commit
|
||||
them inline.
|
||||
|
||||
## Latest capture
|
||||
|
||||
| Scenario | Run ID | Date | p50 | p95 | p99 | Error rate | Peak server RSS | Notes |
|
||||
|---|---|---|---|---|---|---|---|---|
|
||||
| **bulk-renewal** | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | First post-TEST-005 capture; trigger via workflow_dispatch + extract via the procedure above. |
|
||||
| **acme-burst** directory | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | — |
|
||||
| **acme-burst** new-nonce | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | — |
|
||||
| **acme-burst** renewal-info | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | — |
|
||||
| **agent-storm** | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | _capture pending_ | — |
|
||||
|
||||
The "_capture pending_" placeholders are deliberate — the operator
|
||||
fills them after the next `loadtest` workflow_dispatch run. Once
|
||||
filled, replace these rows; do not edit them in place across runs
|
||||
(the historical row stays as evidence).
|
||||
|
||||
## Why "ubuntu-latest" instead of RDS-shaped hardware
|
||||
|
||||
The audit's fix language preferred RDS-shaped Postgres on a
|
||||
fixed-spec runner. ubuntu-latest's 2-vCPU / 7-GB-RAM shape is
|
||||
narrower than typical production Postgres, but it has two virtues:
|
||||
|
||||
1. **Reproducibility.** Every operator + acquirer can reproduce the
|
||||
numbers; an RDS-shaped Postgres requires a paid AWS account.
|
||||
2. **Conservative ceiling.** If the published numbers come from a
|
||||
constrained runner, real-world deployments on production Postgres
|
||||
sizes (db.m5.large +) only get better.
|
||||
|
||||
When an acquirer or operator asks for a production-equivalent
|
||||
baseline, capture a second run on whatever infrastructure they want
|
||||
to validate against and add it under a new **2026 Q3 capture**
|
||||
section.
|
||||
|
||||
## Methodology
|
||||
|
||||
### Hardware
|
||||
|
||||
- **Runner:** GitHub Actions `ubuntu-latest` (currently Ubuntu 24.04, 2-vCPU, 7-GB RAM).
|
||||
- **certctl image:** built from the same commit the workflow runs on.
|
||||
- **Postgres:** `postgres:16-alpine@sha256:890480b08124ce7f79960a9bb16fe39729aa302bd384bfd7c408fee6c8f7adb7`, in-cluster, default config (no operator tuning).
|
||||
- **Network:** runner localhost.
|
||||
|
||||
### Software
|
||||
|
||||
- **k6:** version pinned in `deploy/test/loadtest/Dockerfile`.
|
||||
- **certctl tag:** the v* tag at workflow trigger time (matches `openapi.yaml info.version`).
|
||||
|
||||
### Metrics captured
|
||||
|
||||
- **p50 / p95 / p99 latency** — k6's `http_req_duration` percentiles.
|
||||
- **Error rate** — k6 `http_req_failed` rate (non-2xx + connection errors).
|
||||
- **Peak server RSS** — `docker stats` polled at 1-Hz for the
|
||||
duration of the run; `max(memory_stats.usage)` taken from the
|
||||
emitted JSON.
|
||||
- **Acceptance gate** — the k6 thresholds in `k6.js`; if exceeded
|
||||
the workflow fails.
|
||||
|
||||
### What's NOT captured
|
||||
|
||||
- **Cold-start latency** — these are steady-state baselines after the
|
||||
k6 warmup ramp. Cold-start is a separate concern (renewal-loop
|
||||
startup, scheduler tick boundary), not covered by these scenarios.
|
||||
- **WAN latency** — runs are localhost; production-WAN-RTT additions
|
||||
fall outside scope.
|
||||
- **Federation overhead** — single-instance only; HA + replicas runs
|
||||
are a future deliverable.
|
||||
|
||||
## Related reading
|
||||
|
||||
- [`docs/operator/scale.md`](scale.md) — the operator-facing scale
|
||||
posture doc; baseline rows there point at this file.
|
||||
- [`deploy/test/loadtest/README.md`](../../deploy/test/loadtest/README.md) —
|
||||
scenario semantics + how to read the k6 output.
|
||||
- [`deploy/test/loadtest-artifacts/`](../../deploy/test/loadtest-artifacts/) —
|
||||
long-term archive of captured k6 results.
|
||||
+15
-21
@@ -1,6 +1,6 @@
|
||||
# Operator scale guide
|
||||
|
||||
> Last reviewed: 2026-05-14
|
||||
> Last reviewed: 2026-05-16
|
||||
|
||||
Use this when:
|
||||
- You're sizing a new certctl deployment for a target fleet count.
|
||||
@@ -160,29 +160,23 @@ the RFC 7807 `application/problem+json` shape with the
|
||||
returned plain-text 429 or a different problem type would surface as
|
||||
`(rate_limited_count - shape_ok_count) > 0` in the summary.
|
||||
|
||||
### Measured baseline — TBD pending canonical-hardware capture
|
||||
### Measured baseline
|
||||
|
||||
The Phase 8 scenarios shipped 2026-05-14. Baseline capture on a
|
||||
canonical `ubuntu-latest` GitHub runner is the next operational step;
|
||||
until then, the table below holds TBD placeholders. **Do NOT publish
|
||||
sandbox-captured numbers here** — the same anti-pattern the original
|
||||
loadtest README guards against (sandbox-aggregate placeholder vs
|
||||
canonical hardware) applies to Phase 8.
|
||||
TEST-005 closure (Sprint 5, 2026-05-16) moved the baseline table out
|
||||
of this file into its own canonical record:
|
||||
[`docs/operator/scale-baseline-2026-Q2.md`](scale-baseline-2026-Q2.md).
|
||||
That doc owns the capture procedure, the methodology, and the
|
||||
per-scenario rows; this page links to it as the authoritative
|
||||
source.
|
||||
|
||||
| Scenario | p50 | p95 | p99 | Error rate | Date measured | Commit |
|
||||
|---|---|---|---|---|---|---|
|
||||
| **bulk_renewal** | TBD | TBD | TBD | TBD | — | — |
|
||||
| **acme_burst** directory | TBD | TBD | TBD | TBD | — | — |
|
||||
| **acme_burst** new-nonce | TBD | TBD | TBD | TBD | — | — |
|
||||
| **acme_burst** renewal-info | TBD | TBD | TBD | TBD | — | — |
|
||||
| **agent_storm** | TBD | TBD | TBD | TBD | — | — |
|
||||
The split exists because the baseline table is mutable on every
|
||||
loadtest workflow_dispatch run, while this page (the operator-facing
|
||||
scale posture doc) changes only when the underlying scenarios or
|
||||
thresholds change. Keeping them in separate files avoids
|
||||
review-noise on per-capture commits.
|
||||
|
||||
Capture procedure: trigger `loadtest.yml` from the Actions tab against
|
||||
the current `master` SHA; wait for the `k6-scale` matrix jobs to
|
||||
complete; download the per-scenario summary artifacts; copy p50/p95/
|
||||
p99 from `summary-<scenario>.json` into the table; commit the
|
||||
captured numbers alongside the date + SHA. Replace this paragraph
|
||||
with the captured-on row when the first canonical run lands.
|
||||
Long-term k6 NDJSON artifacts beyond GHA's 90-day retention live at
|
||||
[`deploy/test/loadtest-artifacts/`](../../deploy/test/loadtest-artifacts/).
|
||||
|
||||
### How to run the scale tier locally
|
||||
|
||||
|
||||
Reference in New Issue
Block a user