Commit Graph

3 Commits

Author SHA1 Message Date
shankar0123 6acf3559a3 docs(scale): TEST-005 — split scale baseline into its own canonical record
Sprint 5 unified-master-audit closure. Pre-fix:

  - docs/operator/scale.md L163-185 held a TBD-laden table with 5
    scenario rows. The Phase 8 scenarios shipped 2026-05-14; baseline
    capture on canonical hardware was 'the next operational step'
    that had not been taken.
  - Acquirers + operators asking 'what's the scale ceiling?' got
    'TBD' as the in-tree answer.

The audit's fix wanted three things:
  1. Capture p50/p95/p99 + error rate + memory profile on a fixed-
     spec runner.
  2. Replace the scale.md TBD rows with real numbers.
  3. Archive k6 artifacts under deploy/test/loadtest-artifacts/.

The actual capture is a workflow_dispatch run the operator triggers
on a real Linux runner — it can't happen from a sandbox without
Docker. What I CAN deliver in this commit is the canonical-record
infrastructure that turns the next workflow run into a baseline that
sticks:

  - New docs/operator/scale-baseline-2026-Q2.md is the canonical
    record. Documents the three scenarios, the methodology, the
    capture procedure, and a 'Latest capture' table with
    placeholder rows ready to receive the workflow_dispatch run's
    numbers. The doc explicitly defends the 'ubuntu-latest runner'
    choice (reproducibility > paid-AWS-account specificity).
  - docs/operator/scale.md L163-185 — the TBD table — replaced with
    a pointer paragraph to the new baseline file. Per the
    canonical-doc-pointer pattern: the operator-posture doc changes
    when scenarios change; the baseline doc changes on every
    capture. Splitting them avoids review-noise on per-capture
    commits.
  - New deploy/test/loadtest-artifacts/ directory with a README
    documenting the long-term-archive contract (the GHA artifact
    retention is 90 days; numbers acquisition reviewers look at
    months later need a committed home).

Operator next steps to fill the placeholders:
  1. Trigger Actions → loadtest → Run workflow.
  2. Download the three matrix-leg artifacts.
  3. Update the baseline doc's 'Latest capture' rows.
  4. Commit the raw artifacts (or git-lfs for >100 MB archives) to
     deploy/test/loadtest-artifacts/.

Closes TEST-005 (infrastructure side). Numbers land on the next
canonical-runner workflow_dispatch capture.
2026-05-16 05:19:57 +00:00
shankar0123 1279172e9b loadtest: close Phase 8 SCALE-H2 — add scale-tier scenarios
Phase 8 of the certctl architecture diligence remediation closes
SCALE-H2 by adding three new k6 scenarios that exercise the scale-
relevant load surfaces the API tier + connector tier left uncovered:
fleet-scale bulk renewal, ACME enrollment burst, and agent heartbeat
storm.

Audit miscount + path correction (live-grep at Phase 8 audit time)
==================================================================
- The Phase 8 prompt referenced both `deploy/test/load/` and
  `deploy/test/loadtest/`. Repo truth: the existing harness lives at
  `deploy/test/loadtest/`. New scenarios land there.
- The audit's prior framing "k6 covers the API tier at 50 req/s
  only" omitted Bundle 10 (2026-05-02) which added four connector-
  tier handshake scenarios (nginx/apache/haproxy/f5) at 100 conns/min
  each, plus the Phase 5 ACME directory/nonce/ARI scenario at 100 VUs
  in `k6/acme_flow.js`. Phase 8 appends to what's there rather than
  rewriting.

What ships
==========

Three new k6 scenario files under deploy/test/loadtest/k6/:

  bulk_renewal.js — 10K-cert seed + 5 req/s POST /bulk-renew × 5min
                    p99 < 5s, p95 < 2s, errors < 1%
  acme_burst.js   — 200 VU sustained × directory/nonce/ARI × 5min
                    directory p95 < 500ms, nonce p95 < 300ms,
                    renewal-info p95 < 800ms, 5xx-only < 0.1%
                    Pins RFC 7807 rate-limit response shape via
                    acme_rate_limit_shape_ok Counter.
  agent_storm.js  — 5K-agent seed + 167 req/s POST /heartbeat × 5min
                    p99 < 1s, p95 < 500ms, errors < 0.1%

Two seed SQL fixtures under deploy/test/loadtest/seed/:

  01_bulk_renewal_certs.sql — 10,000 managed_certificates rows
    linked to seed_demo.sql FKs (iss-local, o-alice, t-platform,
    rp-standard). status='active', expires_at distributed across
    next 30 days, name prefix `loadtest-bulk-` so the scenario
    can scope its criteria. Idempotent via
    ON CONFLICT (name) DO NOTHING.

  02_agent_fleet.sql — 5,000 agents rows with name prefix
    `loadtest-agent-`. status='Online', last_heartbeat_at
    staggered across prior 60s, OS distribution 80%/10%/10%
    linux/windows/darwin. Idempotent via
    ON CONFLICT (id) DO NOTHING.

Plus seed/README.md documenting the opt-in profile + when these
run vs the default `make loadtest` fast path.

Compose + Makefile + CI wiring
==============================

deploy/test/loadtest/docker-compose.yml gains four new services,
all gated behind the `scale` compose profile so the default
`make loadtest` is unchanged:

  scale-seed       — one-shot postgres:16-alpine container that runs
                     every ./seed/*.sql in lexical order against the
                     same postgres the server uses. Depends on
                     postgres healthy + certctl-server healthy (so
                     migrations + seed_demo.sql have already run).
  k6-scale-bulk    — grafana/k6:0.54.0 driver running bulk_renewal.js
  k6-scale-acme    — grafana/k6:0.54.0 driver running acme_burst.js
  k6-scale-agent   — grafana/k6:0.54.0 driver running agent_storm.js

Each driver depends_on scale-seed completed_successfully so the
scenarios never run against an unseeded DB (the acme scenario
doesn't need the seed itself but uses the same dependency chain for
ordering predictability).

Makefile gains four new phony targets:

  loadtest-scale-bulk   - runs bulk_renewal.js via compose --profile scale
  loadtest-scale-acme   - runs acme_burst.js
  loadtest-scale-agent  - runs agent_storm.js
  loadtest-scale        - all three serially

.github/workflows/loadtest.yml gains a new k6-scale matrix job that
runs after the existing k6 job (needs: k6) with a matrix on the
three scenarios — fail-fast: false so a regression in one scenario
doesn't cancel the others. Same workflow_dispatch + weekly cron
cadence as the existing API + connector tier job.

Documentation
=============

docs/operator/scale.md gains a new "Scale-tier scenarios (SCALE-H2,
Phase 8)" section between the cursor-pagination subsection and the
profiling-production subsection. Documents:
  - Scenario + seed + sustained load table
  - Threshold contract (regression guards, NOT measured baselines)
  - Measured-baseline table with TBD placeholders + the canonical-
    hardware capture procedure
  - How to run the scale tier locally
  - Four documented limitations (JWS-signed ACME, scheduler renewal
    scan throughput, production-sized Postgres, pull-only deployment
    model)

deploy/test/loadtest/README.md gains a short "Scale tier (Phase 8
SCALE-H2, 2026-05-14)" section pointing at scale.md as the canonical
operator-facing baseline source. Avoids duplication; the README
remains the harness-mechanics doc.

Deliberate deviations from the prompt
======================================

The Phase 8 prompt's "concrete deliverables" section referenced
`deploy/test/load/` (no -test) for the new k6 files. The actual
harness lives at `deploy/test/loadtest/` — the new files land there
to match existing convention. The prompt's audit-questions section
also referenced `deploy/test/loadtest/` so the prompt was internally
inconsistent on this; repo truth wins.

The prompt described the ACME burst as "200 concurrent ACME orders
against /acme/profile/<id>/new-order ... pin the rate-limit response
shape." new-order is JWS-signed (RFC 8555 §7.4 requires JWS for
every POST except newAccount-pre-account-key flows). k6 doesn't
ship JWS and bundling a signer (e.g. lego) into the k6 container
would obscure the server-side latency the scenario is trying to
measure. Same trade-off the existing Phase 5 acme_flow.js made.
Phase 8's acme_burst.js measures the unauthenticated
directory + nonce + ARI surface at burst rate AND pins the 429
rate-limit response shape via a custom Counter that increments only
when the response is `application/problem+json` with the
`urn:ietf:params:acme:error:rateLimited` type. End-to-end JWS
conformance under load remains a follow-up; the canonical JWS
correctness gate is `make acme-rfc-conformance-test` (lego-based,
non-load).

Deferred (operator-side, not engineering)
==========================================

Canonical-hardware baseline capture. The TBD placeholders in
docs/operator/scale.md's measured-baseline table are intentional —
sandbox-captured numbers from a developer laptop are misleading
(same anti-pattern the original loadtest README guards against).
Operator triggers loadtest.yml from the Actions tab, waits for the
k6-scale matrix jobs to complete, downloads the per-scenario
summary artifacts, copies p50/p95/p99 into the table, commits the
captured numbers alongside the date + commit SHA.

Files changed (10):
  .github/workflows/loadtest.yml                            (+72 -1)
  Makefile                                                  (+47 -1)
  deploy/test/loadtest/README.md                            (+28 -1)
  deploy/test/loadtest/docker-compose.yml                   (+108 -1)
  deploy/test/loadtest/k6/bulk_renewal.js                   (new, 106 lines)
  deploy/test/loadtest/k6/acme_burst.js                     (new, 192 lines)
  deploy/test/loadtest/k6/agent_storm.js                    (new, 124 lines)
  deploy/test/loadtest/seed/01_bulk_renewal_certs.sql       (new, 95 lines)
  deploy/test/loadtest/seed/02_agent_fleet.sql              (new, 92 lines)
  deploy/test/loadtest/seed/README.md                       (new, 86 lines)
  docs/operator/scale.md                                    (+109 -0)

Verification (sandbox-runnable):
  python3 -c 'import yaml; yaml.safe_load(open("deploy/test/loadtest/docker-compose.yml"))'
    → compose YAML OK
  python3 -c 'import yaml; yaml.safe_load(open(".github/workflows/loadtest.yml"))'
    → workflow YAML OK
  grep -E 'bulk_renewal|acme_burst|agent_storm' deploy/test/loadtest/k6/*.js
    → all three scenarios + tags present
  grep loadtest-scale Makefile
    → 4 new targets registered in .PHONY + 3 recipes + 1 aggregate

Runtime verification (deferred — requires docker on canonical hardware):
  make loadtest-scale-bulk    # 10K cert fixture + 5 req/s × 5min
  make loadtest-scale-acme    # 200 VU × 5min
  make loadtest-scale-agent   # 5K agent fixture + 167 req/s × 5min
  make loadtest-scale         # all three serially

Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-H2
2026-05-14 03:25:15 +00:00
shankar0123 8191b1ee64 scheduler+db: close Phase 6 — scale hardening across pool, jitter, ETag, asyncpoll
Phase 6 of the certctl architecture diligence remediation. Five
findings across the same scheduler-and-DB-pool surface.

SCALE-M1 (Med) — DB pool default bumped 25 → 50
  internal/config/config.go line 1972:
    MaxConnections: getEnvInt("CERTCTL_DATABASE_MAX_CONNS", 50)
  Postgres default max_connections is 100; 50 leaves headroom for
  pg_dump + ad-hoc psql + a server replica without exhausting the
  DB-side cap. Operator override env var unchanged. Operator-tune
  ladder for larger fleets (5K / 50K certs) lives in
  docs/operator/scale.md as starter values pending Phase 8 load
  tests — explicitly marked TBD.

SCALE-M3 (Med) — async-CA poll budget operator-configurable
  Live state was partially-already-shipped: all 4 async-CA
  connectors (digicert, entrust, globalsign, sectigo) already have
  per-connector CERTCTL_<NAME>_POLL_MAX_WAIT_SECONDS (Audit fix #5
  closed pre-Phase-6). What was missing: a global package-default
  override. Shipped:
    - internal/connector/issuer/asyncpoll/asyncpoll.go gains
      SetDefaultMaxWait(d) + effectiveDefaultMaxWait var + the
      currentDefaultMaxWait() priority resolver.
    - cmd/server/main.go reads CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS
      at boot and calls SetDefaultMaxWait.
    - deploy/ENVIRONMENTS.md documents the new env var (G-3 guard
      green).
  Naming deviation from the prompt's CERTCTL_ASYNC_POLL_MAX_ATTEMPTS:
  the live code tracks wall-clock time (MaxWait), not attempt count.
  Matched the existing per-connector nomenclature (_POLL_MAX_WAIT_SECONDS)
  so the priority chain reads naturally.

SCALE-M5 (Med) — JitteredTicker wrapper for all 15 scheduler loops
  internal/scheduler/jitter.go ships NewJitteredTicker(interval,
  jitterPct) + DefaultSchedulerJitter (±10%). All 15 sites in
  internal/scheduler/scheduler.go migrated from bare time.NewTicker
  to NewJitteredTicker(interval, DefaultSchedulerJitter). Base
  intervals unchanged; only the per-tick envelope adds ±10%
  randomized delay so multiple loops with the same nominal cadence
  don't co-fire and spike CPU + DB at wall-clock boundaries.

  internal/scheduler/jitter_test.go pins:
    - Bounded envelope (each tick within ±jitterPct of interval)
    - Mean drift < 30% of nominal (sign-bug detector)
    - Stop() releases the goroutine + closes C
    - Stop() idempotent (no panic on repeat)
    - Zero-jitter behaves like time.NewTicker
    - Negative and >=1 jitterPct values clamped defensively

  CI guard scripts/ci-guards/no-bare-newticker-in-scheduler.sh blocks
  any future bare time.NewTicker in scheduler.go.

SCALE-L1 (Low) — renewal-sweep semaphore behavior documented
  docs/operator/scale.md "Scheduler tick budgets" section explains
  the per-tick concurrency semaphore (CERTCTL_RENEWAL_CONCURRENCY=25
  default), the ctx-cancellation drain on tick-budget overrun, and
  operator tuning advice (raise concurrency + DB pool together).
  No code change — the behavior is defensible as-is per the audit.

SCALE-L2 (Low) — ETag middleware for top-5 read endpoints
  internal/api/middleware/etag.go computes SHA-256 ETag over the
  buffered response body, respects If-None-Match, short-circuits
  to 304 Not Modified on match. GET/HEAD only; non-2xx responses
  pass through unchanged. 64 KiB buffer cap degrades gracefully on
  oversized responses (no caching, body still flushes intact).

  Wired around the top-5 read endpoints via etagged() helper in
  internal/api/router/router.go:
    GET /api/v1/certificates
    GET /api/v1/agents
    GET /api/v1/jobs
    GET /api/v1/audit
    GET /api/v1/discovered-certificates

  internal/api/middleware/etag_test.go pins 11 behaviors including
  304-on-repeat, 200-after-mutation-with-new-ETag, POST bypass,
  4xx/5xx pass-through, oversized-response degradation, wildcard
  match, HEAD-treated-like-GET, byte-equal pass-through.

Cross-cutting fixes:
  - internal/config/config_test.go::TestLoad_DefaultValues updated
    to assert the new 50 default (was 25).
  - deploy/helm/certctl/values.yaml comment corrected — agent
    pollInterval is hardcoded 30s, not env-configurable; the
    Phase 4 comment mistakenly referenced CERTCTL_AGENT_POLL_INTERVAL
    which G-3 caught as a phantom env var.
  - asyncpoll.go reformatted by gofmt; functionally unchanged.

Verification (all pass):
  grep -nE 'SetMaxOpenConns' internal/repository/postgres/db.go    # finds 1 site
  grep -nE 'CERTCTL_DATABASE_MAX_CONNS.*50' internal/config/config.go  # config default is 50
  grep -rnE 'CERTCTL_ASYNC_POLL_MAX_WAIT_SECONDS' internal/ deploy/ENVIRONMENTS.md  # wired
  grep -cE 'time\.NewTicker\(' internal/scheduler/scheduler.go    # 0 (all migrated)
  grep -cE 'JitteredTicker' internal/scheduler/scheduler.go         # 15
  ls internal/scheduler/jitter.go internal/api/middleware/etag.go   # both exist
  ls docs/operator/scale.md                                          # exists
  bash scripts/ci-guards/no-bare-newticker-in-scheduler.sh          # clean
  bash scripts/ci-guards/G-3-env-docs-drift.sh                      # clean
  go test ./internal/scheduler/ ./internal/api/middleware/ \
    ./internal/connector/issuer/asyncpoll/ ./internal/config/       # 4/4 packages green

Closes: cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M1
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M3
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-M5
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L1
        cowork/certctl-architecture-diligence-audit.html#fix-SCALE-L2
2026-05-14 01:23:03 +00:00