Commit Graph

13 Commits

Author SHA1 Message Date
shankar0123 374ec574c5 feat(ci): DEPL-005 + DATA-012 — weekly backup/restore smoke + audit-chain round-trip assertion
Acquisition-audit DEPL-005 (backup runbook exists but no CI restore
test) + DATA-012 closure (Sprint 4 ACQ, 2026-05-16).

A backup procedure that has never been restore-tested is not a backup
procedure. The Helm CronJob at deploy/helm/certctl/templates/backup-
cronjob.yaml and the operator runbook at
docs/operator/runbooks/postgres-backup.md both document a
`pg_dump -Fc --no-owner --no-acl`-based backup strategy, but the
dump shape has never been restored end-to-end under CI. This sprint
adds the missing assertion.

Each Monday at 07:00 UTC (1h offset from loadtest.yml's 06:00 slot so
the two jobs don't fight for runners), boot a real postgres:16-alpine
service container pinned to the SAME sha256 digest as
deploy/docker-compose.yml, exercise the audit_events hash chain
with 24 synthetic rows representing an issue/renew/revoke/auth-login
cycle, take a custom-format dump, DROP SCHEMA public CASCADE
(simulating an operator-side data-loss event), pg_restore, and
assert:

  pre.row_count        == post.row_count
  pre.chain_head_hash  == post.chain_head_hash    (BYTE-EXACT)
  post.first_break_id  == ""                      (verify_chain clean)
  post.verifier_walked == pre.row_count           (every row walked)

The chain-head byte-exact assertion is the load-bearing one.
Migration 000047 hashes each row's canonical payload with
`to_char(timestamp AT TIME ZONE 'UTC',
'YYYY-MM-DD"T"HH24:MI:SS.US"Z"')` — any TIMESTAMPTZ-precision loss
in the dump/restore path (a real concern across major Postgres
upgrades or with --format=plain) would corrupt the hash. The point
of testing is to PROVE the property, not to defend against a known
quirk.

Files
=====
- .github/workflows/backup-restore.yml — Mondays 07:00 UTC +
  workflow_dispatch. Postgres service container; Go 1.25.10;
  contents:read; 15-min timeout. Action SHAs pinned to match
  ci.yml's pinning convention.
- deploy/test/backup-restore-smoke.sh — bash orchestrator: preflight
  (postgresql-client + Go + python3 on PATH); wait-for-ready loop;
  DROP SCHEMA + workload + dump + DROP SCHEMA + restore + verify
  + python3 JSON diff. ::error:: prefix on any assertion failure.
  Same script runs unchanged locally against any reachable Postgres.
- deploy/test/backupsmoke/main.go — Go program with --mode=workload
  and --mode=verify. Imports the repo's
  internal/repository/postgres.RunMigrations and emits a small JSON
  snapshot to stdout. INSERT shape mirrors
  internal/repository/postgres/audit_chain_test.go.
- docs/operator/runbooks/postgres-backup.md — adds a 'CI restore
  verification' subsection after the existing quarterly-dry-run
  section, points at the new workflow + harness + smoke program,
  bumps the last-reviewed marker.

Verified locally: gofmt clean, go vet clean, staticcheck clean,
`go build ./deploy/test/backupsmoke` succeeds, bash -n on the shell
harness, python3 -c yaml.safe_load on the workflow, dry-run of the
JSON-diff python block on synthetic pre.json/post.json covers both
PASS and ::error:: paths.
2026-05-16 17:27:57 +00:00
shankar0123 3e09401502 test(ci): TEST-003 — flip Frontend E2E from informational to merge-gate
Sprint 5 unified-master-audit closure. The Phase 8 E2E workflow at
.github/workflows/e2e.yml shipped with continue-on-error: true and
a header banner that said it would be promoted to required-for-merge
once 1-2 weeks of green runs accumulated. The accumulation happened;
the flip didn't.

Ground-truth via api.github.com/repos/certctl-io/certctl/actions/runs
(2026-05-16): 14 consecutive green runs across 2026-05-14 to
2026-05-15 (heaviest Sprint 1-4 frontend churn in the repo's history,
6 commits touching web/**) confirmed the suite is stable. No flakes,
no flaps, no timeouts.

Fix:
  - .github/workflows/e2e.yml continue-on-error: true → false.
  - Workflow name strips the '(informational)' tag.
  - Header banner rewritten to reflect the new posture + flag the
    one operator action still required (adding the job to the
    branch-protection required-checks list at
    https://github.com/certctl-io/certctl/settings/branches).
  - New docs/operator/runbooks/e2e-snapshot-update.md documents the
    visual-regression snapshot-bump workflow now that a red E2E
    run blocks merge. Includes the standard (one or two affected
    tests) + mass-bump (font upgrade / framework migration) paths,
    plus an explicit anti-patterns section (do NOT regenerate from
    a developer's local machine; do NOT add --update-snapshots to
    the always-run step).

Closes TEST-003.
2026-05-16 05:19:38 +00:00
shankar0123 3ce05ab0a8 docs(runbook): DEPL-005 — rewrite postgres-backup automation paths to reference the shipped CronJob
Sprint 3 unified-master-audit closure. docs/operator/runbooks/postgres-backup.md
sections 110-143 still said 'certctl ships no backup CronJob template
in the Helm chart' and the three sample recipes that followed
included an 'in-cluster Postgres → S3' rollup that the operator
'should roll their own.' But the chart actually DOES ship that
CronJob:

  deploy/helm/certctl/templates/backup-cronjob.yaml (Phase 4
  DEPL-H2 closure, 2026-05-14) — opt-in via 'backup.enabled: true',
  PVC + S3 sinks, pg_dump shape byte-comparable with the manual
  command earlier in the runbook.

Operators following the pre-fix runbook would write a duplicate
CronJob from scratch while the working template sat unused under
their nose.

Rewrite of sections 110-143:
  - Lead with the shipped CronJob, two install one-liners (PVC + S3).
  - Move the recipes-by-topology block down to 'When the bundled
    CronJob is NOT the answer' — still call out managed Postgres
    (use provider PITR) and bare-VM Postgres (systemd + pg_dump +
    restic) as deliberately out-of-scope.
  - Add 'Recovery objectives' subsection: RPO ≈ 24h at the default
    nightly schedule, RTO ≈ 30-60min from the existing drill steps
    further down the page. Tells the reader where the bundled
    CronJob fits in their RPO/RTO budget without overpromising
    (anything below 24h RPO needs WAL-shipping, which the CronJob
    doesn't do).
  - Bump '> Last reviewed:' to today.

Closes DEPL-005.
2026-05-16 04:31:31 +00:00
shankar0123 d6f4d5c5e8 deploy(helm): close Phase 4 — chart surface + DR + ops runbooks
Phase 4 of the certctl architecture diligence remediation closure.
Seven findings, all in deploy/helm/certctl/.

DEPL-H2 (High) — ship deploy/helm/certctl/templates/backup-cronjob.yaml
  Operator opt-in via backup.enabled=true. Default OFF. CronJob runs
  pg_dump --format=custom --no-owner --no-acl --dbname=certctl
  matching the canonical shape in
  docs/operator/runbooks/postgres-backup.md (so manual and
  automated dumps are byte-identical). Sink: PVC (default) OR S3
  via aws-cli. Documented as in-cluster-Postgres only — managed DB
  deployments rely on their provider's PITR.

DEPL-M1 (Med) — Helm pre-install/pre-upgrade migration hook
  deploy/helm/certctl/templates/migration-job.yaml — runs
  `certctl-server --migrate-only` before the server Deployment
  rolls. The --migrate-only flag (new in cmd/server/main.go) is a
  hermetic schema-mutation pass: load config, open DB pool, run
  RunMigrations + RunSeed, exit 0. No HTTP listener, no scheduler,
  no signing setup.

  Server's boot-time RunMigrations call is now gated on
  CERTCTL_MIGRATIONS_VIA_HOOK — when set true, the server skips
  the boot path (the hook owns the work). Default still runs at
  boot, so Compose / VM / bare-metal deploys are unchanged.

  migrations.viaHook: false in values.yaml (off by default).

DEPL-M4 (Med) — explicit Postgres StatefulSet strategy fields
  deploy/helm/certctl/templates/postgres-statefulset.yaml adds:
    spec.updateStrategy.type: OnDelete
    spec.podManagementPolicy: OrderedReady
  Operator-controlled Postgres upgrades (the OnDelete strategy
  means a chart template tweak no longer triggers an immediate
  Postgres restart). OrderedReady aligns with the standard
  Postgres-on-Kubernetes pattern for any future HA work.

DEPL-M5 (Med) — per-fleet-size resource ladder documentation
  deploy/helm/certctl/values.yaml — extended comments next to
  server.resources + agent.resources documenting:
    "≤ 500 certs / 100 agents" → defaults are validated
    "5K certs / 1K agents" → starter suggestions, TBD Phase 8
    "50K certs / 10K agents" → starter suggestions, TBD Phase 8
  Numbers for the small-fleet case derive from the measured
  baselines in docs/operator/performance-baselines.md
  (50ms p50, < 3s for 1000-cert inventory walk, etc.). Larger
  fleet numbers explicitly marked TBD pending Phase 8 load-test
  runs — operators tune empirically until then.

DEPL-L1 (Low) — Helm rollback runbook
  docs/operator/runbooks/rollback.md — covers helm rollback
  mechanics, the schema-migration manual-cleanup path (when
  *.down.sql files apply vs. when full restore is the only safe
  path), and the per-migration-class safe-to-rollback table.

DEPL-L2 (Low) — Prometheus AlertManager rules
  deploy/helm/certctl/templates/prometheusrules.yaml — opt-in via
  monitoring.prometheusRules.enabled=true. Default OFF. Four
  starter rules using verified metric names from
  internal/api/handler/metrics.go:
    CertctlCertificateExpiringSoon (certctl_certificate_expiring_soon)
    CertctlAgentOffline ((agent_total - agent_online) > 0 for 1h)
    CertctlJobFailureRateHigh (failure rate over 5% for 15m)
    CertctlIssuanceFailures (any failures over 15m window)
  All thresholds operator-tunable via
  monitoring.prometheusRules.thresholds.* in values.

DEPL-L3 (Low) — Prometheus bearer-token setup runbook
  docs/operator/runbooks/prometheus-bearer-token.md — documents
  the API-key + Secret + values wiring for the RBAC-gated
  /api/v1/metrics/prometheus scrape endpoint. End-to-end
  procedure with troubleshooting steps + rotation guide.

CI guard: scripts/ci-guards/helm-templates-lint.sh
  Six-combo matrix: defaults / backup PVC / backup S3 /
  prometheusRules / migrations.viaHook / all-on. Each runs helm
  template + checks render success. helm lint also gated.
  Wired into the auto-pickup loop in .github/workflows/ci.yml;
  azure/setup-helm@b9e51907 (v4.3.0, SHA-pinned per Phase 1
  RED-2) installs helm v3.16.0 on the runner.

Verification (all pass):
  ls deploy/helm/certctl/templates/{backup-cronjob,migration-job,prometheusrules}.yaml
  grep -E 'updateStrategy|podManagementPolicy' deploy/helm/certctl/templates/postgres-statefulset.yaml  # 2 matches
  helm template deploy/helm/certctl/ --set backup.enabled=true \
    --set monitoring.prometheusRules.enabled=true --set migrations.viaHook=true \
    | grep -E "kind: (CronJob|PrometheusRule|Job)"  # 3 matches
  helm lint deploy/helm/certctl/  # 0 failed
  ls docs/operator/runbooks/{rollback,prometheus-bearer-token}.md
  bash scripts/ci-guards/helm-templates-lint.sh  # 6/6 matrix combinations pass

Go build clean (cmd/server compiles, migrate-only path verified by
the build target). YAML validated.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M4
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M5
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L1
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L2
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L3
2026-05-14 00:58:00 +00:00
shankar0123 69a2b5c55a config: default hardening + operator docs (Phase 2 closure — SEC-H1, SEC-H3, SEC-M4, DEPL-H1, DEPL-M2 + doc-only carve-outs)
Eleven findings from the architecture diligence audit's Phase 2 bundle
closed in one PR. All touch the same backend config + Helm chart +
operator docs surface, so reviewing in one diff is the natural fit.

config.go: three new fail-closed Validate() branches behind sentinels
=====================================================================

Three new error sentinels exported from internal/config/config.go for
tests to pin via errors.Is + message-text:
  - ErrAgentBootstrapTokenRequired (SEC-H1)
  - ErrACMEInsecureWithoutAck      (SEC-M4)
  - ErrDemoModeAckExpired          (SEC-H3)

SEC-H1 (staged): introduces CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY
as an opt-in feature flag. When true AND the bootstrap token is empty,
Validate() returns ErrAgentBootstrapTokenRequired and the server
refuses to start. Default in THIS release: false (warn-mode
pass-through preserved). WORKSPACE-ROADMAP.md schedules the default
flip to true for v2.2.0 — operators get one upgrade window.

SEC-M4: upgrades the existing boot-time WARN log for
CERTCTL_ACME_INSECURE=true into a hard refuse-to-start gate behind
CERTCTL_ACME_INSECURE_ACK=true. The ACK env var must be paired with
the existing INSECURE flag; either alone fails closed. The boot-time
WARN log at cmd/server/main.go:611 continues to fire for the ACK'd
case so every restart logs the reminder.

SEC-H3: tightens the sticky DemoModeAck bit so it expires after 24h.
When DemoModeAck=true, Validate() now requires CERTCTL_DEMO_MODE_ACK_TS
to be set as a unix-epoch timestamp within the last 24h (24h-tolerance
on the past side, 1-minute clock-skew on the future side). Catches the
"forgotten demo deployment promoted to production" failure mode —
next container restart past 24h refuses unless re-ack'd.

Tests in internal/config/config_test.go cover every new branch:
positive (passes when properly set), negative (each fail-closed path
fires with the matching sentinel + message-text). 11 new tests added.

Helm chart + HA runbook (DEPL-H1)
=================================

Created docs/operator/runbooks/ha.md documenting the three values
flips required for production HA: server.replicas, podDisruptionBudget,
service.sessionAffinity. Cross-link comments added to
deploy/helm/certctl/values.yaml next to the server.replicas (line 19)
and podDisruptionBudget (line 566) defaults. DEFAULTS DO NOT CHANGE
— that's the point per the prompt's 'do not flip networkPolicy default'
guidance: a default-enabled PDB blocks fresh helm install on
single-node clusters.

CI guard (DEPL-M2)
==================

scripts/ci-guards/no-change-me-in-prod-compose.sh grep-fails any
'change-me-' literal in compose files OTHER than docker-compose.demo.yml.
Catches the placeholder-credential-leak regression one layer earlier
than the runtime Validate() fail-closed guards from Bundle 2 (2026-05-12).
Excludes comment lines so docs explaining the pattern don't trip the
guard. Verified to fire on a synthetic leak; clean on the current tree.

Consolidated 'Security carve-outs' doc section
==============================================

docs/operator/security.md grows by one new section documenting the
seven existing carve-outs in one canonical place:
  - SEC-M3: 3 InsecureSkipVerify=true sites (Agent dev, verify probe, tlsprobe)
  - SEC-M5: F5 connector InsecureSkipVerify per-config field
  - SEC-M4: ACME insecure + new ACK gate
  - SEC-L1: CSP 'unsafe-inline' on style-src (Tailwind carve-out)
  - SEC-L2: break-glass Argon2id rest-defense reminder
  - SEC-L3: 1 MB body-size cap + CERTCTL_MAX_BODY_SIZE override
  - DEPL-M2: change-me-* placeholder credentials in demo overlay
  - DEPL-M3: K8s NetworkPolicy operator-opt-in default

Each entry cites the file:line, the rationale for the carve-out, and
the operator action.

CHANGELOG + ENVIRONMENTS coverage
==================================

CHANGELOG.md grows by one new '### Breaking changes (scheduled for
v2.2.0)' section under Unreleased, documenting SEC-H1 / SEC-M4 / SEC-H3
with explicit upgrade-window guidance for each.

deploy/ENVIRONMENTS.md adds five rows: AGENT_BOOTSTRAP_TOKEN +
AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY + DEMO_MODE_ACK + DEMO_MODE_ACK_TS +
ACME_INSECURE_ACK. G-3 env-docs-drift CI guard stays clean.

WORKSPACE-ROADMAP.md (cowork-side) schedules the SEC-H1 default-flip
for v2.2.0.

Sandbox limitation
==================

The certctl repo's working tree is 6.1 GB which fills the sandbox
volume; the go1.25.10 toolchain download (go.mod requires it,
sandbox has 1.25.9) keeps failing on disk-full. Local 'go build' /
'go test' were NOT run in this commit's verification path.
make verify MUST be run on the operator's workstation before push
per CLAUDE.md operating rules.

CI guards (no-change-me, G-3 env-docs-drift, doc-rot-detector, +
all existing) verified clean by running each individually.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-SEC-H1,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-H3,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-M4,
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H1,
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M2,
        cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M3,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-M3,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-M5,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-L1,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-L2,
        cowork/certctl-architecture-diligence-audit.html#fix-SEC-L3
2026-05-13 19:50:00 +00:00
shankar0123 57b539c378 docs(b12): observability reference + Postgres backup runbook
Closes acquisition-diligence Bundle 12 — Observability, DR,
Operations Receipts, And Performance Proof. Source IDs: D5, D6, D8,
T9, finding 7, OPS-H1, OPS-M1, OPS-M2, LOW-7.

Two new operator-facing references; both non-audit-framed per the
Bundle 5 doc-placement policy.

docs/operator/observability.md — single canonical statement of what
certctl emits, what it doesn't, and what survives a restart:
  - Metrics surface: both /api/v1/metrics (JSON) and
    /api/v1/metrics/prometheus (text exposition v0.0.4); inventory of
    certctl_certificate_* gauges + certctl_issuance_duration_seconds
    per-issuer-type histogram + certctl_uptime_seconds.
  - Prometheus library vs hand-rolled exposition: explicit scope
    statement — hand-rolled fmt.Fprintf is intentional for v2.x given
    the shallow metric surface; client_golang migration tracked as
    v3 item (closes OPS-M1).
  - Tracing: explicit deferral — no OTel SDK setup, OTel packages
    are indirect-only in go.mod, no spans, no OTLP exporter; tracked
    as v3 item; in the meantime structured logs carry request_id and
    certctl_issuance_duration_seconds carries the per-issuer latency
    signal (closes OPS-M2).
  - Logging: structured JSON via log/slog; CERTCTL_LOG_LEVEL control;
    no key material / bearer tokens / session cookies in log lines.
  - Rate-limit semantics under restarts + replicas: per-process,
    in-memory, reset-on-restart, NOT shared across replicas; full
    inventory of the 5 limiter call sites (break-glass login,
    SCEP/Intune per-device, EST per-principal CSR, EST HTTP-Basic
    source-IP, ACME per-account); multi-replica + sticky-session
    implications; database-backed sliding window deferred to v3
    (closes D8).
  - Performance harness scope: cross-references the explicit
    'What it explicitly does NOT measure' list in
    deploy/test/loadtest/README.md (closes LOW-7 + finding 7).

docs/operator/runbooks/postgres-backup.md — operator-runnable
backup procedure:
  - Inventory of what to back up (DB + operator-managed file
    material that lives outside the DB: CA keys, RA keys, OCSP
    responder keys, trust bundles).
  - Logical backup recipe with docker-compose + Kubernetes variants,
    integrity verification step, off-host storage step.
  - Physical / PITR recipe pointing at pgbackrest / wal-g
    (certctl ships nothing here — standard PostgreSQL DBA work).
  - Three sample automation paths (in-cluster Postgres → S3 CronJob,
    managed Postgres PITR, self-hosted VM systemd timer + restic).
  - Quarterly restore-dry-run procedure.
  - Helm CronJob template deliberately not shipped — three
    documented reasons (deployment topology / secret-management
    integration / off-host storage all vary by operator) plus
    roadmap entry for shipping a starter template when a real
    operator asks for one (closes D6 + OPS-H1).

Both new docs wired into docs/README.md Operator + Runbooks tables.

D5 (ServiceMonitor) and T9 (canonical k6 load-test) were already
shipped in Bundle 3 (deploy/helm/certctl/templates/servicemonitor.yaml)
and in deploy/test/loadtest/ + .github/workflows/loadtest.yml
respectively; this bundle doesn't touch them — it just records the
closure in the audit HTML.

Verified:
  bash scripts/ci-guards/G-3-env-docs-drift.sh    # PASS
  bash scripts/ci-guards/doc-rot-detector.sh      # PASS
  All 35 scripts/ci-guards/*.sh green.
2026-05-13 02:09:11 +00:00
shankar0123 476022ca59 docs(b6): secret-custody reference + config-encryption upgrade runbook + private-key CI guard
Closes acquisition-diligence Bundle 6 findings on secret custody, config
encryption, and local artifact hygiene. Source IDs: S6, R4, SEC-M2,
RT-M1, RT-M2, RT-L1.

Surgical closures (artifact-only audit-framed memos stay out of the
public repo per the Bundle 5 lesson):

R4 / RT-L1 — local EC private key artifact
  rm cmd/agent/mc-001.key (gitignored, never in git history, leftover
  from a 2025-era agent dev run on the operator's workstation).
  Added scripts/ci-guards/B6-no-private-keys-in-tree.sh that fails the
  build if any TRACKED non-test file contains a PEM private-key block,
  so the next attempt to commit similar material gets caught at CI.
  Allowlist: *_test.go (hermetic-test PEMs), examples/*.md (sample
  walkthroughs), internal/scep/intune/testdata/ (certificates, not
  keys).

RT-M1 — landing-page HSM implication
  certctl.io/index.html: 'their hardware' / 'your hardware' colloquial
  comparisons rephrased to 'their custody' / 'your servers'. The phrase
  'Your keys. Your hardware. Your data. Your terms.' becomes 'Your
  keys. Your servers. Your data. Your terms.' to remove any inferred
  HSM-backed key-storage claim. The technical disclosure now lives in
  docs/operator/secret-custody.md (linked below); the landing page no
  longer makes a claim it cannot back.

S6 + SEC-M2 + RT-M2 (composite documentation closure)
  Added docs/operator/secret-custody.md — public operator reference
  enumerating every secret material on the control plane and on
  agents:
    - Local CA private key (FileDriver, file-on-disk, heap-resident
      with the L-014 carve-out documented in
      internal/connector/issuer/local/local.go).
    - Agent ECDSA P-256 keys (file on agent host, never transmitted).
    - OIDC client secret (AES-256-GCM v3, PBKDF2 600k).
    - Session signing key (same encryption regime).
    - Break-glass credential (Argon2id, never encrypted).
    - API-key bearer tokens (SHA-256 hash only; plaintext shown once).
    - CSR private keys mid-issuance (agent memory only).
    - Issuer-connector backend secrets (encrypted_config column,
      fail-closed for source='database', plaintext-by-design for
      source='env' with rationale).
  The Env-seeded-vs-DB-seeded plaintext policy is explained in plain
  text so a buyer review can independently verify the startup guard at
  cmd/server/main.go:222-262 makes sense.

  Added docs/operator/runbooks/config-encryption-upgrade.md — the
  procedural arm: how to force v1/v2 -> v3 re-seal across the
  database, plus the passphrase-rotation order. Documents the
  AEAD-driven read fallback (v3 -> v2 -> v1) and the fact that
  re-sealing happens passively on UPDATE. Open roadmap item: a
  certctl admin reseal --all command (tracked in
  WORKSPACE-ROADMAP.md).

  Both docs wired into docs/README.md Operator + Runbooks tables.

Verification:
  rg -n 'CONFIG_ENCRYPTION|encrypt|v1|private key|HSM|PKCS11|mc-001.key|\.key|Local CA' \
     internal cmd docs .gitignore README.md   # ambient (no NEW leaks)
  find . -name '*.key' \
     -not -path './.git/*' -not -path './web/node_modules/*'   # empty
  git ls-files | xargs grep -lE 'BEGIN .* PRIVATE KEY' \
     | grep -vE '_test\.go$|^examples/|^internal/scep/intune/testdata/'   # empty
  bash scripts/ci-guards/B6-no-private-keys-in-tree.sh   # PASS
  bash scripts/ci-guards/G-3-env-docs-drift.sh           # PASS
  bash scripts/ci-guards/doc-rot-detector.sh             # PASS

Residual roadmap (deliberately deferred):
  - signer.PKCS11Driver (HSM-token-backed CA-key custody).
  - signer.CloudKMSDriver (AWS/GCP/Azure KMS-backed CA-key custody).
  - FIPS 140-3 mode for the whole control plane.
  - HSM-backed session signing key.
  - Built-in 'certctl admin reseal --all' command.
  All five tracked in WORKSPACE-ROADMAP.md, not retracted.
2026-05-13 01:48:40 +00:00
shankar0123 56e2ea1ad7 docs: v2.1.0 release polish — strip internal bundle/phase tags, update status for OIDC ship
README:
- Rewrite Status block: drop the stale 'federated identity not yet
  shipped' line; flag v2.1.0 OIDC + sessions + back-channel logout
  + break-glass as early-access; encourage GitHub issues for IdP
  rough edges. (A1 framing — keep early-access umbrella, no
  SAML/WebAuthn/JIT roadmap teaser.)
- Add OIDC SSO bullet to 'What it does' covering per-IdP runbooks,
  group-claim → role mapping, AES-256-GCM client_secret encryption,
  JWKS auto-refresh, PKCE-S256, RFC 9700 §4.7.1 pre-login binding,
  RFC 9207 iss check, __Host- cookies, CSRF rotation, idle+absolute
  expiry, BCL, break-glass admin.
- Update Security paragraph: three auth paths (API keys / OIDC /
  break-glass), HMAC-signed sessions, CSRF rotation, RFC OIDC BCL.
- Correct CI coverage thresholds against
  .github/coverage-thresholds.yml (service 70%, handler 75%,
  crypto 88%, auth packages 85-95%); 'static analysis' replaces
  the inflated '11 linters' claim (actual count is 4 active).

Docs B3 sweep — strip operator-facing 'Bundle N' / 'Phase N' tags:
- docs/operator/auth-threat-model.md — rewrite intro; rename 5 H2
  sections (API-key + RBAC defenses / OIDC + sessions + break-glass
  defenses / OIDC + sessions threat catalogue / Closed federated-
  identity threats / Future-work threats); clean ~12 H3/prose hits.
- docs/operator/rbac.md — strip Bundle 1 framing from intro,
  scope_id deferral note, MCP tools section, day-0 bootstrap, and
  'Where to look next'.
- docs/operator/auth-benchmarks.md — drop 'Phase 14' framing from
  title intro, hardware floor caption, result table caption,
  methodology, and pre-merge audit section.
- docs/operator/security.md — already cleaned earlier this session
  (RBAC / day-0 / approval-bypass / OIDC federation / sessions /
  OIDC first-admin / break-glass H3s).
- docs/operator/oidc-runbooks/{index,keycloak,authentik,okta,
  azure-ad}.md — strip Auth Bundle 2 framing + Phase 10/3/4
  references; replace with feature-name prose.
- docs/operator/legacy-clients-tls-1.2.md — drop Bundle F / M-023
  audit-reference framing; keep CWE-326.
- docs/operator/database-tls.md — drop Bundle B / M-018 framing
  from intro + Helm section.
- docs/operator/runbooks/disaster-recovery.md — drop 'Production
  hardening II Phase 10' status callout.
- docs/migration/oidc-enable.md — retitle 'Enable OIDC SSO';
  strip Bundle 1/2 framing from prereqs, troubleshooting, related
  docs; update __Host- cookie callout from 'audit MED-14' to
  v2.1.0-BREAKING.
- docs/migration/api-keys-to-rbac.md — strip Bundle 1 framing from
  intro, migration table, IsAdmin section, and cross-references.
- docs/migration/acme-from-cert-manager.md — strip residual
  'Phase 5' tags from cert-manager integration test references.
- docs/reference/configuration.md — retitle Auth section.
- docs/reference/profiles.md — strip Bundle 1 Phase 9 framing
  from RequiresApproval section + Related list.
- docs/reference/auth-standards-implemented.md — rewrite intro
  (API-key + RBAC + OIDC + sessions + back-channel logout +
  break-glass); rename 'Bundle 1 (RBAC) standards covered
  separately' H2; clean per-row Phase references.
- docs/README.md — rewrite nav-table entries to drop Bundle 1/2
  parentheticals; retitle 'Enable OIDC SSO' migration entry.

No code or test changes; pure operator-facing prose polish for
the v2.1.0 tag.
2026-05-11 16:54:07 +00:00
shankar0123 75097909e9 2026-05-05 18:18:29 +00:00
shankar0123 d809874fa1 docs: retire compliance subtree + sweep framework name-drops from prose
Per operator decision the framework-mapping docs are gone. They
were aspirational (no audit, no certification, no validated
mapping); keeping them around was misleading.

Files deleted (1,883 lines):
- docs/compliance/index.md
- docs/compliance/soc2.md
- docs/compliance/pci-dss.md
- docs/compliance/nist-sp-800-57.md

Hyperlinks removed:
- README.md: 'Auditor / compliance' row in the doc table; the
  '(compliance mapping included)' parenthetical in the
  positioning paragraph
- docs/README.md: the '## Compliance' section table; the
  'Auditor / compliance team' reading-order-by-role row

Prose name-drops swept across 24 files:
- README.md: 'FedRAMP boundary CAs / financial-services policy
  CAs' → '4-level boundary CAs / 3-level policy CAs';
  'Compliance-grade for PCI-DSS Level 1, FedRAMP Moderate / High,
  SOC 2 Type II, HIPAA' → cut entirely
- getting-started/{quickstart,concepts,examples,why-certctl,
  advanced-demo}.md: 'compliance' → 'audit' / 'policy';
  'PCI-DSS / SOC 2 / NIST SP 800-57' framework lists cut;
  ''pci': 'true'' tag example → ''environment': 'production''
- migration/cert-manager-coexistence.md: 'compliance rules' →
  'policy rules'
- operator/approval-workflow.md: 'Compliance customers (PCI-DSS
  Level 1, FedRAMP Moderate / High, SOC 2 Type II, HIPAA)' →
  'Operators'; entire 'Compliance control mapping' table
  (PCI-DSS §6.4.5 / NIST SP 800-53 SA-15 / SOC 2 Type II CC6.1
  / HIPAA §164.308(a)(4)) deleted; 'compliance contract' →
  'two-person-integrity contract'; 'compliance auditors' →
  'reviewers'
- operator/legacy-clients-tls-1.2.md: 'PCI-DSS v4.0 Req 4 §2.2.5'
  audit-reference → CWE-326 (kept); 'PCI-DSS Req 4 §2.2.5
  attestation' section retitled to 'TLS posture summary' and
  rewritten without framework framing; 'PCI-DSS, NIST, and
  major browsers will eventually deprecate TLS 1.2' →
  'Major browsers and OS vendors will eventually deprecate
  TLS 1.2'
- operator/database-tls.md: PCI-DSS Req 4 §2.2.5 audit-ref →
  CWE-319 only; 'PCI-DSS scope' → 'sensitive data'; PCI-DSS
  Req 4 v4.0 prose footing → cut
- operator/runbooks/disaster-recovery.md: 'SOC 2 / PCI
  procurement-team deliverable' → 'on-call deliverable';
  'compliance auditors' → 'reviewers'
- reference/connectors/{acme,aws-acm,azure-kv,globalsign,
  local-ca,openssl,ssh,index}.md: 'compliance reporting
  (PCI-DSS §3.6, HIPAA §164.312)' → 'audit reporting';
  'Compliance environments (PCI-DSS Level 1, FedRAMP High,
  HIPAA)' → 'Regulated environments'; 'compliance audits' →
  'audit'; 'FedRAMP boundary CA' pattern names →
  '4-level boundary CA' (technically descriptive)
- reference/protocols/est.md: 'compliance-hook seam' →
  'device-state hook seam'; 'compliance gating' → 'device-state
  gating'; 'est_compliance_failed' → 'est_device_state_failed'
- reference/protocols/scep-intune.md: 'Optional compliance
  check' → 'Optional device-state check'; failure-counter
  'compliance_failed' → 'device_state_failed'; 'Conditional
  Access compliance gating' → 'Conditional Access
  device-state gating'
- reference/intermediate-ca-hierarchy.md: 'FedRAMP boundary-CA
  deployments where the regulator requires...' →
  'Boundary-CA deployments where you want separation of policy
  and issuing authorities'; pattern A retitled '4-level FedRAMP
  boundary CA' → '4-level boundary CA'
- reference/architecture.md: broken Related-docs link to
  compliance.md removed; the rest of that block had stale
  pre-Phase-2 paths (quickstart.md, demo-advanced.md,
  connectors.md, openapi.md, testing-guide.md, test-env.md) —
  retargeted to current locations
- reference/deployment-model.md: 'SOC 2 evidence-report
  generator' → 'Audit-evidence report generator'
- reference/vendor-matrix.md: 'SOC 2 / PCI auditors paste this
  into evidence packs' → 'reviewers paste this into
  vendor-evaluation packs'
- contributor/qa-test-suite.md: 'compliance exist' coverage
  description cut; 'Compliance (PCI / SOC2 / HIPAA-relevant)'
  risk-class label → 'Audit-relevant'

What was kept:
- CWE references (legitimate technical pointers)
- Microsoft API/feature names that happen to use 'compliance'
  literally ('Microsoft Graph compliance API',
  'device-compliance validators' — these are MS product names,
  not framework name-drops)
- 'NIST PQC' on the landing page (Post-Quantum Cryptography is
  the actual NIST standard family, not a compliance framework)

Verified: zero hyperlinks into docs/compliance/ remain. All 24
ci-guards/*.sh pass locally. qa-doc-seed-count.sh clean.
Net diff: 26 files / -1,883 deletions in compliance/ + -32 net
across the prose sweep.

Companion edits in cowork/ (CLAUDE.md doc-tree summary +
WORKSPACE-CHANGELOG.md retirement note) land separately.
2026-05-05 05:26:44 +00:00
shankar0123 12d7b1f51d docs: Phase 11 follow-on — fix inter-doc cross-references in deeper subdirs
Per Phase 1 audit at cowork/docs-overhaul-phase-1-audit-2026-05-04/.
Continuation of Phase 11 (commit dca1900 handled README + first round
of docs/ links). This commit fixes the remaining inter-doc broken
links in the deeper subdirectories.

Per source directory:

  docs/getting-started/quickstart.md (1 fix):
    (connectors.md) → (../reference/connectors/index.md)

  docs/contributor/test-environment.md (2 fixes):
    (tls.md) → (../operator/tls.md)
    (upgrade-to-tls.md) → (../archive/upgrades/to-tls-v2.2.md)

  docs/contributor/testing-strategy.md (4 fixes):
    `docs/security.md` → `docs/operator/security.md`
    (security.md) → (../operator/security.md)
    `docs/testing-guide.md` (kept; testing-guide.md still at top level
      pending Phase 5 prune)
    (testing-guide.md) → (../testing-guide.md)

  docs/migration/acme-from-traefik.md (2 sites, multi-link):
    (./acme-cert-manager-walkthrough.md) → (./acme-from-cert-manager.md)
    (./acme-server.md) → (../reference/protocols/acme-server.md)

  docs/migration/cert-manager-coexistence.md (1 fix):
    (./quickstart.md) → (../getting-started/quickstart.md)

  docs/migration/from-acmesh.md (2 fixes):
    (connectors.md) → (../reference/connectors/index.md)
    (./examples.md) → (../getting-started/examples.md)

  docs/migration/acme-from-caddy.md (multi-link):
    (./acme-cert-manager-walkthrough.md) → (./acme-from-cert-manager.md)
    (./acme-server.md) → (../reference/protocols/acme-server.md)

  docs/migration/acme-from-cert-manager.md (multi-link):
    (./acme-server.md) → (../reference/protocols/acme-server.md)
    (./acme-server-threat-model.md) → (../reference/protocols/acme-server-threat-model.md)
    (./acme-caddy-walkthrough.md) → (./acme-from-caddy.md)
    (./acme-traefik-walkthrough.md) → (./acme-from-traefik.md)

  docs/migration/from-certbot.md (2 fixes):
    (./concepts.md) → (../getting-started/concepts.md)
    (./examples.md) → (../getting-started/examples.md)

  docs/operator/tls.md (3 sites):
    (upgrade-to-tls.md) → (../archive/upgrades/to-tls-v2.2.md)
    (quickstart.md) → (../getting-started/quickstart.md)
    (test-env.md) → (../contributor/test-environment.md)

  docs/operator/runbooks/disaster-recovery.md (5 fixes):
    (crl-ocsp.md) → (../../reference/protocols/crl-ocsp.md)
    (tls.md) → (../../operator/tls.md)
    (security.md) → (../../operator/security.md)
    (scep-intune.md) → (../../reference/protocols/scep-intune.md)
    (est.md) → (../../reference/protocols/est.md)

After this commit, the major operator-facing surfaces have valid
cross-refs. Some lower-traffic docs (compliance/soc2.md, compliance/
nist-sp-800-57.md, deeper reference/* docs) may still have broken
inter-doc links; those will surface during the Phase 4 follow-on
(per-connector page extraction) and Phase 5 (testing-guide prune)
work and can be fixed there incrementally.
2026-05-05 03:31:05 +00:00
shankar0123 19c8fafe84 docs: Phase 14 — Last reviewed line sweep across docs/
Per Phase 1 audit at cowork/docs-overhaul-phase-1-audit-2026-05-04/.
Adds a `> Last reviewed: 2026-05-05` line right after the H1 heading
of every doc that didn't already have one (41 files).

This dates the freshness clock for the future Phase 4 per-doc review.
The discipline going forward: when a doc's content gets a meaningful
edit, bump the date. When the date gets old (e.g., >6 months), the
doc earns a freshness-review pass.

Mechanical insertion via awk one-liner, applied to every docs/*.md
that didn't already match `grep -q 'Last reviewed:'`. Files that
already carried the line from earlier Phase 2 work (the navigation
index, the new connector docs, the new SCEP server / legacy-clients-
TLS-1.2 / release-verification docs, and the 5 per-connector deep
dives) were skipped to avoid duplicate insertion.

Net: every doc in docs/ now has a Last reviewed line.
2026-05-05 03:26:46 +00:00
shankar0123 3a807ae37e docs: Phase 2 mechanical file moves to subdirectory structure
Pure git mv operations; no content edits. Internal links remain pointing
at old paths and will be fixed in Phase 11. Per the Phase 1 audit
recommendations at cowork/docs-overhaul-phase-1-audit-2026-05-04/.

35 files moved across 8 audience-organized subdirectories:

  docs/getting-started/ (5):
    quickstart.md, concepts.md, examples.md, advanced-demo.md (was
    demo-advanced.md), why-certctl.md

  docs/reference/ (6):
    architecture.md, api.md (was openapi.md), mcp.md,
    intermediate-ca-hierarchy.md, deployment-model.md (was
    deployment-atomicity.md), vendor-matrix.md (was
    deployment-vendor-matrix.md)

  docs/reference/protocols/ (6):
    acme-server.md, acme-server-threat-model.md, scep-intune.md,
    est.md, crl-ocsp.md, async-ca-polling.md (was async-polling.md)

  docs/operator/ (4):
    security.md, tls.md, database-tls.md, approval-workflow.md

  docs/operator/runbooks/ (3):
    cloud-targets.md (was runbook-cloud-targets.md), expiry-alerts.md
    (was runbook-expiry-alerts.md), disaster-recovery.md

  docs/migration/ (3):
    from-certbot.md (was migrate-from-certbot.md), from-acmesh.md
    (was migrate-from-acmesh.md), cert-manager-coexistence.md (was
    certctl-for-cert-manager-users.md)

  docs/compliance/ (4):
    index.md (was compliance.md), soc2.md (was compliance-soc2.md),
    pci-dss.md (was compliance-pci-dss.md), nist-sp-800-57.md (was
    compliance-nist.md)

  docs/contributor/ (4):
    testing-strategy.md, test-environment.md (was test-env.md),
    ci-pipeline.md, qa-test-suite.md (was qa-test-guide.md)

Deferred to later Phase 2 sub-phases:
  - connectors.md split (Phase 4): docs/connectors.md +
    docs/connector-{apache,f5,iis,k8s,nginx}.md still at top level
  - testing-guide.md prune (Phase 5): docs/testing-guide.md still
    at top level
  - features.md disperse (Phase 6): docs/features.md still at top
    level
  - legacy-est-scep.md split (Phase 7): docs/legacy-est-scep.md
    still at top level
  - ACME walkthrough re-homing (Phase 8): three
    docs/acme-*-walkthrough.md still at top level
  - Upgrade docs archive (Phase 3): two docs/upgrade-*.md still
    at top level

Cross-reference updates (Phase 11) will happen after all moves and
content edits land. Internal links to docs/* paths are temporarily
broken until that phase completes.
2026-05-05 02:49:28 +00:00