docs: Phase 5 — testing-guide.md prune (8268 → 0 lines, content dispersed)

Per Phase 1 audit at cowork/docs-overhaul-phase-1-audit-2026-05-04/ and the section-by-section plan in testing-guide-tumor.md. testing-guide.md was 30% of all docs/ content (8268 lines) but was integration test code written in markdown, not operator documentation. The audit's tumor analysis disposed of every Part: - ~65% DELETE (test cases that already exist in code) - ~22% MOVE to inline test code - ~8% KEEP-COMPRESSED into focused operator-runbook docs - Title + contents + release sign-off ~5% KEEP This commit ships the KEEP-COMPRESSED dispersal: docs/contributor/qa-prerequisites.md (NEW, ~120 lines): From testing-guide.md "Prerequisites" section. Stack boot procedure, demo data baseline, reference IDs operators reuse across QA docs. docs/contributor/gui-qa-checklist.md (NEW, ~105 lines): From testing-guide.md "Part 35: GUI Testing". Manual GUI verification pass for release sign-off. 25-row table covering every dashboard page. docs/contributor/release-sign-off.md (NEW, ~130 lines): From testing-guide.md "Release Sign-Off" section (originally 1009 lines of per-test detail tables). Compressed to a release-day checklist organized by gate category: code state, automated gates, manual QA passes, release artefact verification, branch protection, post-release. docs/operator/performance-baselines.md (NEW, ~100 lines): From testing-guide.md "Part 39: Performance Spot Checks". Four operator-runnable benchmarks (API request handling, inventory list pagination, scheduler tick, bulk revoke) with baseline numbers and when-to-re-baseline guidance. docs/operator/helm-deployment.md (NEW, ~120 lines): From testing-guide.md "Part 52: Helm Chart Deployment". Operator runbook for the bundled deploy/helm/certctl/ chart: prereqs, install, four cert-source patterns, verify, upgrade, troubleshooting. docs/reference/cli.md (NEW, ~120 lines): From testing-guide.md "Part 28: CLI Tool". certctl-cli command reference with command-group breakdown, common workflows (list/filter, renew, revoke, bulk import, EST enrollment, status), output formats, CI/CD integration patterns. docs/README.md navigation index updated to include the 6 new docs: Reference section gains: cli.md, release-verification.md (was added in Phase 13) Operator section gains: helm-deployment.md, performance-baselines.md Contributor section gains: qa-prerequisites.md, gui-qa-checklist.md, release-sign-off.md docs/testing-guide.md deleted. Git history preserves the 8268 lines — if any specific test case is found missing from inline test code or the destination docs during future work, lift from `git show HEAD~1:docs/testing-guide.md`. Net: docs/ total line count drops by ~7700 lines (28%), from 26,369 to 18,742. testing-guide.md was the single largest doc; pruning it is the single biggest content-edit win of the entire restructure. Phase 5 is the last major content phase. Remaining: Phase 4 follow-on (per-connector page extractions from reference/connectors/index.md), Phase 15 (WHAT/HOW/WHY remediation), Phase 16 (final acceptance gate).
2026-06-07 13:41:30 +00:00 · 2026-05-05 03:38:54 +00:00
parent fd4eb3b165
commit b452013dd9
8 changed files with 641 additions and 8268 deletions
@@ -0,0 +1,120 @@
+# Helm Deployment
+
+> Last reviewed: 2026-05-05
+
+Operator runbook for deploying certctl on Kubernetes via the bundled Helm chart at `deploy/helm/certctl/`.
+
+## Prereqs
+
+- Kubernetes cluster, v1.27+
+- `kubectl` configured and authenticated
+- `helm` v3.13+
+- Storage class for the PostgreSQL StatefulSet PVC
+- TLS cert source: either an operator-supplied `kubernetes.io/tls` Secret OR a cert-manager `ClusterIssuer` / `Issuer`. The chart refuses to render without one. See [`tls.md`](tls.md) for the four cert provisioning patterns.
+
+## Install
+
+```bash
+helm install certctl deploy/helm/certctl/ \
+  --namespace certctl \
+  --create-namespace \
+  --set server.apiKey=$(openssl rand -hex 32) \
+  --set postgres.password=$(openssl rand -hex 32) \
+  --set server.tls.existingSecret=certctl-server-tls
+```
+
+`server.apiKey` and `postgres.password` should be high-entropy values. The example above generates them inline; production deployments use a secrets manager (Vault, External Secrets Operator, AWS Secrets Manager) instead.
+
+## What you get
+
+- **Server Deployment** with a configurable replica count (default 1; HA needs sticky sessions on the ACME server's nonce path)
+- **PostgreSQL StatefulSet** with PVC-backed persistence
+- **Agent DaemonSet** with one agent per node (configurable via `agent.daemonset.enabled=false` if you don't want the in-cluster agent)
+- Health probes (`/health` liveness + `/ready` readiness)
+- Security contexts: non-root, read-only root filesystem
+- Optional Ingress (off by default; opt in via `ingress.enabled=true`)
+
+## Cert source patterns
+
+### Pattern 1 — operator-supplied Secret (recommended for non-cert-manager shops)
+
+```bash
+kubectl create secret tls certctl-server-tls \
+  --cert=server.crt --key=server.key \
+  --namespace certctl
+
+helm install certctl deploy/helm/certctl/ \
+  --namespace certctl \
+  --set server.tls.existingSecret=certctl-server-tls
+```
+
+### Pattern 2 — cert-manager Certificate CR (recommended for cert-manager shops)
+
+```bash
+helm install certctl deploy/helm/certctl/ \
+  --namespace certctl \
+  --set server.tls.certManager.enabled=true \
+  --set server.tls.certManager.issuerRef.name=my-cluster-issuer \
+  --set server.tls.certManager.issuerRef.kind=ClusterIssuer
+```
+
+### Refuses to render without one of the above
+
+```bash
+helm install certctl deploy/helm/certctl/ --namespace certctl
+# Error: server.tls.existingSecret OR server.tls.certManager.enabled must be set
+```
+
+The render-time guard catches the missing config at `helm install` time, not at pod-crash-loop time.
+
+## Verify the install
+
+```bash
+kubectl wait --for=condition=Ready --timeout=3m \
+  -n certctl pod -l app.kubernetes.io/name=certctl-server
+
+kubectl port-forward -n certctl svc/certctl-server 8443:8443 &
+
+# Bundle the TLS root from the Secret to verify
+kubectl get secret -n certctl certctl-server-tls -o jsonpath='{.data.ca\.crt}' \
+  | base64 -d > /tmp/certctl-ca.crt
+curl --cacert /tmp/certctl-ca.crt https://localhost:8443/health
+# {"status":"healthy"}
+```
+
+If the Secret has no `ca.crt` key (operator-supplied Secrets often don't), use `tls.crt` as the bundle. For a self-signed cert the two files are identical; for a chained cert distribute the root CA bundle separately via ConfigMap.
+
+## Upgrade
+
+```bash
+helm upgrade certctl deploy/helm/certctl/ \
+  --namespace certctl \
+  --reuse-values
+```
+
+Postgres state survives the upgrade (the PVC is retained). The server / agent images bump per the chart's `image.tag`. See [`docs/archive/upgrades/`](../archive/upgrades/) for version-specific upgrade guidance.
+
+## Configuration reference
+
+Every value is documented at `deploy/helm/certctl/values.yaml`. Common tweaks:
+
+- `server.replicaCount` — replica count (default 1)
+- `server.resources.{requests,limits}` — pod resource bounds
+- `agent.daemonset.enabled` — toggle the in-cluster agent (default true)
+- `postgres.storageSize` — PVC size (default 10Gi)
+- `ingress.enabled` + `ingress.host` — opt into Ingress
+
+## Troubleshooting
+
+**Pod crash-loops with TLS error.** Cert + key in the Secret don't pair. Verify with `openssl x509 -modulus -in server.crt -noout | md5` against `openssl rsa -modulus -in server.key -noout | md5` — outputs must match.
+
+**Agent DaemonSet pods can't reach the server.** Service DNS / NetworkPolicy issue. Confirm the agent's `CERTCTL_SERVER_URL` env points at the in-cluster service name (`https://certctl-server.certctl.svc.cluster.local:8443`).
+
+**Postgres won't start.** PVC permissions. Check `kubectl describe pvc -n certctl certctl-postgres` and confirm the storage class supports `fsGroup`.
+
+## Related docs
+
+- [`tls.md`](tls.md) — cert provisioning patterns + SIGHUP rotation
+- [`security.md`](security.md) — production security posture
+- [`runbooks/disaster-recovery.md`](runbooks/disaster-recovery.md) — Postgres restore + recovery procedures
+- [`docs/archive/upgrades/`](../archive/upgrades/) — version-specific upgrade procedures
@@ -0,0 +1,106 @@
+# Performance Baselines
+
+> Last reviewed: 2026-05-05
+
+Operator-runnable benchmarks for spot-checking certctl performance against published baselines. Useful as a regression detector after upgrades or infra changes.
+
+## Why these specific spots?
+
+certctl's hot paths are dominated by three workloads:
+
+1. **API request handling** — auth, rate-limit decision, route dispatch, DB read
+2. **Renewal scheduler** — periodic scan + dispatch
+3. **Certificate inventory queries** — large list returns with sparse fields
+
+The baselines below cover those three.
+
+## Baseline #1: API request handling (single endpoint)
+
+Hit a hot read endpoint with a tight loop and compare against the baseline.
+
+```bash
+SERVER=https://localhost:8443
+CACERT="--cacert ./deploy/test/certs/ca.crt"
+AUTH="Authorization: Bearer change-me-in-production"
+
+# Warm the connection pool (5 requests, discard timing)
+for i in $(seq 1 5); do
+  curl -s $CACERT -H "$AUTH" $SERVER/api/v1/stats/summary > /dev/null
+done
+
+# Measured run: 100 requests, capture mean latency
+time (for i in $(seq 1 100); do
+  curl -s $CACERT -H "$AUTH" $SERVER/api/v1/stats/summary > /dev/null
+done)
+```
+
+**Baseline (M3 MacBook Pro, Docker Desktop):** real time under 5 seconds for 100 sequential requests = mean ~50ms p50.
+
+If you're seeing > 100ms mean, something is wrong: PostgreSQL connection pool exhaustion, agent flooding the work-poll endpoint, or rate-limiter mis-tuned.
+
+## Baseline #2: Inventory list with cursor pagination
+
+```bash
+# Cursor-paginated full inventory walk
+NEXT=""
+PAGES=0
+START=$(date +%s)
+while true; do
+  RESP=$(curl -s $CACERT -H "$AUTH" "$SERVER/api/v1/certificates?limit=100&cursor=$NEXT")
+  NEXT=$(echo "$RESP" | jq -r '.next_cursor // empty')
+  PAGES=$((PAGES + 1))
+  [ -z "$NEXT" ] && break
+done
+END=$(date +%s)
+echo "Walked $PAGES pages in $((END - START))s"
+```
+
+**Baseline:** for the demo dataset (15 certificates, 1 page), under 1 second total. For a 1000-cert inventory (10 pages of 100), under 3 seconds total = ~300ms per page.
+
+If you're seeing > 1s per page on a 1000-cert inventory, the cursor index on `managed_certificates(created_at, id)` is missing or the query plan went wrong.
+
+## Baseline #3: Scheduler tick (renewal scan)
+
+The renewal scheduler runs every hour by default. Force a tick and observe the time-to-completion in the logs:
+
+```bash
+# Trigger an immediate renewal scan via the admin endpoint
+curl -s $CACERT -H "$AUTH" -X POST $SERVER/api/v1/admin/scheduler/run-now/renewal | jq .
+
+# Tail the log and look for the matching `renewal scan complete` line
+docker compose logs -f certctl-server | grep 'renewal'
+```
+
+**Baseline (15-cert demo dataset):** "renewal scan complete" within 100ms of the trigger.
+
+For a 1000-cert inventory: under 5 seconds. The dominant cost is the per-cert profile + policy + alert-channel resolve plus the threshold-comparison math. If you're seeing > 10 seconds, profile resolution is likely doing N+1 queries.
+
+## Baseline #4: Bulk revoke
+
+```bash
+# Bulk-revoke all certs from a (test) issuer
+TIME=$(date +%s)
+curl -s $CACERT -H "$AUTH" -H "$CT" -X POST $SERVER/api/v1/certificates/bulk-revoke \
+  -d '{"filter":{"issuer_id":"iss-test"},"reason":"superseded"}' | jq .
+echo "Bulk revoke: $(($(date +%s) - TIME))s"
+```
+
+**Baseline:** linear in cert count. For 100 certs from one issuer: under 5 seconds. For 1000 certs: under 30 seconds (dominated by per-cert audit row + per-cert CRL refresh).
+
+## When to re-baseline
+
+After any of:
+
+- Postgres major-version upgrade
+- Go major-version upgrade  
+- Significant migration (add a column to `managed_certificates`, add an index)
+- Connection pool config change
+- Changing the renewal scheduler interval
+
+Capture timing in `cowork/loadtest-baselines/<date>.md` so future regressions surface against a real baseline rather than the operator's gut feeling.
+
+## Related docs
+
+- [`docs/contributor/ci-pipeline.md`](../contributor/ci-pipeline.md) — CI guard for performance regression
+- [`docs/operator/security.md`](security.md) — rate limit tuning
+- [`docs/reference/architecture.md`](../reference/architecture.md) — request path through handler → service → repository