Files
certctl/docs/operator/approval-workflow.md
T
shankar0123 d809874fa1 docs: retire compliance subtree + sweep framework name-drops from prose
Per operator decision the framework-mapping docs are gone. They
were aspirational (no audit, no certification, no validated
mapping); keeping them around was misleading.

Files deleted (1,883 lines):
- docs/compliance/index.md
- docs/compliance/soc2.md
- docs/compliance/pci-dss.md
- docs/compliance/nist-sp-800-57.md

Hyperlinks removed:
- README.md: 'Auditor / compliance' row in the doc table; the
  '(compliance mapping included)' parenthetical in the
  positioning paragraph
- docs/README.md: the '## Compliance' section table; the
  'Auditor / compliance team' reading-order-by-role row

Prose name-drops swept across 24 files:
- README.md: 'FedRAMP boundary CAs / financial-services policy
  CAs' → '4-level boundary CAs / 3-level policy CAs';
  'Compliance-grade for PCI-DSS Level 1, FedRAMP Moderate / High,
  SOC 2 Type II, HIPAA' → cut entirely
- getting-started/{quickstart,concepts,examples,why-certctl,
  advanced-demo}.md: 'compliance' → 'audit' / 'policy';
  'PCI-DSS / SOC 2 / NIST SP 800-57' framework lists cut;
  ''pci': 'true'' tag example → ''environment': 'production''
- migration/cert-manager-coexistence.md: 'compliance rules' →
  'policy rules'
- operator/approval-workflow.md: 'Compliance customers (PCI-DSS
  Level 1, FedRAMP Moderate / High, SOC 2 Type II, HIPAA)' →
  'Operators'; entire 'Compliance control mapping' table
  (PCI-DSS §6.4.5 / NIST SP 800-53 SA-15 / SOC 2 Type II CC6.1
  / HIPAA §164.308(a)(4)) deleted; 'compliance contract' →
  'two-person-integrity contract'; 'compliance auditors' →
  'reviewers'
- operator/legacy-clients-tls-1.2.md: 'PCI-DSS v4.0 Req 4 §2.2.5'
  audit-reference → CWE-326 (kept); 'PCI-DSS Req 4 §2.2.5
  attestation' section retitled to 'TLS posture summary' and
  rewritten without framework framing; 'PCI-DSS, NIST, and
  major browsers will eventually deprecate TLS 1.2' →
  'Major browsers and OS vendors will eventually deprecate
  TLS 1.2'
- operator/database-tls.md: PCI-DSS Req 4 §2.2.5 audit-ref →
  CWE-319 only; 'PCI-DSS scope' → 'sensitive data'; PCI-DSS
  Req 4 v4.0 prose footing → cut
- operator/runbooks/disaster-recovery.md: 'SOC 2 / PCI
  procurement-team deliverable' → 'on-call deliverable';
  'compliance auditors' → 'reviewers'
- reference/connectors/{acme,aws-acm,azure-kv,globalsign,
  local-ca,openssl,ssh,index}.md: 'compliance reporting
  (PCI-DSS §3.6, HIPAA §164.312)' → 'audit reporting';
  'Compliance environments (PCI-DSS Level 1, FedRAMP High,
  HIPAA)' → 'Regulated environments'; 'compliance audits' →
  'audit'; 'FedRAMP boundary CA' pattern names →
  '4-level boundary CA' (technically descriptive)
- reference/protocols/est.md: 'compliance-hook seam' →
  'device-state hook seam'; 'compliance gating' → 'device-state
  gating'; 'est_compliance_failed' → 'est_device_state_failed'
- reference/protocols/scep-intune.md: 'Optional compliance
  check' → 'Optional device-state check'; failure-counter
  'compliance_failed' → 'device_state_failed'; 'Conditional
  Access compliance gating' → 'Conditional Access
  device-state gating'
- reference/intermediate-ca-hierarchy.md: 'FedRAMP boundary-CA
  deployments where the regulator requires...' →
  'Boundary-CA deployments where you want separation of policy
  and issuing authorities'; pattern A retitled '4-level FedRAMP
  boundary CA' → '4-level boundary CA'
- reference/architecture.md: broken Related-docs link to
  compliance.md removed; the rest of that block had stale
  pre-Phase-2 paths (quickstart.md, demo-advanced.md,
  connectors.md, openapi.md, testing-guide.md, test-env.md) —
  retargeted to current locations
- reference/deployment-model.md: 'SOC 2 evidence-report
  generator' → 'Audit-evidence report generator'
- reference/vendor-matrix.md: 'SOC 2 / PCI auditors paste this
  into evidence packs' → 'reviewers paste this into
  vendor-evaluation packs'
- contributor/qa-test-suite.md: 'compliance exist' coverage
  description cut; 'Compliance (PCI / SOC2 / HIPAA-relevant)'
  risk-class label → 'Audit-relevant'

What was kept:
- CWE references (legitimate technical pointers)
- Microsoft API/feature names that happen to use 'compliance'
  literally ('Microsoft Graph compliance API',
  'device-compliance validators' — these are MS product names,
  not framework name-drops)
- 'NIST PQC' on the landing page (Post-Quantum Cryptography is
  the actual NIST standard family, not a compliance framework)

Verified: zero hyperlinks into docs/compliance/ remain. All 24
ci-guards/*.sh pass locally. qa-doc-seed-count.sh clean.
Net diff: 26 files / -1,883 deletions in compliance/ + -32 net
across the prose sweep.

Companion edits in cowork/ (CLAUDE.md doc-tree summary +
WORKSPACE-CHANGELOG.md retirement note) land separately.
2026-05-05 05:26:44 +00:00

135 lines
6.9 KiB
Markdown

# Issuance approval workflow
> Last reviewed: 2026-05-05
certctl can gate certificate issuance + renewal on a per-profile, two-person-integrity check. Operators configure this on production-tier `CertificateProfile` rows so every renewal-loop tick or manual `POST /api/v1/certificates/{id}/renew` blocks at `JobStatusAwaitingApproval` until a different actor approves.
Closes the procurement-checklist question "How do you enforce two-person integrity on cert issuance?" — without this surface the answer is "we don't"; with `requires_approval=true` on the profile, the answer is "here's the RBAC contract + here's the audit query that proves bypass mode is off in production."
## End-to-end flow
```mermaid
sequenceDiagram
autonumber
participant A as Operator A<br/>(or scheduler)
participant SVC as CertificateService<br/>.TriggerRenewal
participant JOB as Job + ApprovalRequest
participant B as Operator B
participant APR as ApprovalService.Approve
participant SCH as Scheduler
A->>SVC: POST /api/v1/certificates/{id}/renew<br/>(or renewal-loop tick)
SVC->>JOB: read profile.RequiresApproval;<br/>create Job @ JobStatusAwaitingApproval;<br/>create ApprovalRequest<br/>(state=pending, requested_by=Operator A)
Note over JOB,SCH: Scheduler skips —<br/>AwaitingApproval is NOT a dispatchable status
B->>JOB: GET /api/v1/approvals?state=pending
B->>APR: POST /api/v1/approvals/{id}/approve<br/>(decided_by=Operator B, note=...)
APR->>APR: RBAC: reject if Operator B == Operator A<br/>→ ErrApproveBySameActor (HTTP 403)
APR->>JOB: ApprovalRequest → state=approved;<br/>Job AwaitingApproval → Pending;<br/>audit row (action=approval_approved,<br/>actor=Operator B);<br/>certctl_approval_decisions_total<br/>{outcome=approved,profile_id=...}++
SCH->>JOB: pick up Pending → dispatch to issuer connector
JOB-->>A: cert issues normally
```
## Configuration
Set `requires_approval=true` on a `CertificateProfile`:
```bash
curl -X PUT https://certctl/api/v1/profiles/p-prod-cdn \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Production CDN",
"requires_approval": true,
...
}'
```
Every certificate bound to that profile is now gated. The default is `requires_approval=false` — existing profiles keep the historical unattended renewal path.
## RBAC: the two-person integrity rule
The actor that triggers a renewal **cannot** be the actor that approves it. The check happens at the service layer and surfaces as **HTTP 403** at the handler. The error message contains the substring `two-person integrity` so server-log greps detect attempted self-approvals.
This is the load-bearing two-person-integrity contract. Pinned by:
- `internal/service/approval_test.go::TestApproval_Approve_RejectsSameActor` — service-level pin.
- `internal/api/handler/approval_test.go::TestApproval_HandlerApproveAsSameActor_Returns403` — handler-level pin (HTTP 403 + body contains "two-person integrity").
## Operator playbook: "I need to approve a renewal"
```bash
# 1. Find the pending request
curl -s "https://certctl/api/v1/approvals?state=pending" \
-H "Authorization: Bearer $API_KEY" | jq
# 2. Inspect the request — confirm CN, SANs, requester
curl -s "https://certctl/api/v1/approvals/ar-abc123" \
-H "Authorization: Bearer $API_KEY" | jq
# 3. Approve as a different actor than the requester
curl -X POST "https://certctl/api/v1/approvals/ar-abc123/approve" \
-H "Authorization: Bearer $APPROVER_API_KEY" \
-H "Content-Type: application/json" \
-d '{"note":"approved per ticket SECOPS-12345"}'
# 4. Confirm the job transitioned to Pending
curl -s "https://certctl/api/v1/jobs?certificate_id=mc-foo" \
-H "Authorization: Bearer $API_KEY" | jq '.[] | {id,status,type}'
```
To **reject** instead, swap the path: `POST /api/v1/approvals/{id}/reject` with the same body shape. The job transitions to `Cancelled` and the `note` is recorded in the audit row.
## Operator playbook: "approval timed out"
The scheduler reaper transitions stale pending requests + their linked jobs after `CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT` (default `168h` = 7 days):
- `ApprovalRequest.state``expired`
- `Job.Status``Cancelled` (with `error_message="approval expired"`)
- One audit row per expiry (`action=approval_expired, actor=system-reaper, actorType=System`)
- `certctl_approval_decisions_total{outcome="expired",profile_id="..."}` increments
Resolve by re-triggering the renewal once the underlying delay is sorted:
```bash
curl -X POST "https://certctl/api/v1/certificates/mc-foo/renew" \
-H "Authorization: Bearer $API_KEY"
```
Tighten the timeout for short-window deployments via the env var, e.g. `CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT=24h`.
## Bypass mode (dev / CI ONLY)
Setting `CERTCTL_APPROVAL_BYPASS=true` short-circuits the workflow: every `RequestApproval` call auto-approves with `decided_by=system-bypass` and `actorType=System`. Used by dev / CI to keep renewal-scheduler tests fast without standing up an approver.
**Production deploys MUST leave this unset.** The bypass emits a typed audit event (`action=approval_bypassed`) so reviewers detect misuse via:
```sql
SELECT count(*) FROM audit_events WHERE actor = 'system-bypass';
```
returning **zero rows in production** and a high count in dev. The certctl-server logs a `WARN` line at boot when bypass is enabled — operators alert on that log line in production environments.
## Prometheus metrics
```
certctl_approval_decisions_total{outcome,profile_id} counter
certctl_approval_pending_age_seconds histogram
(le buckets:
60, 300, 1800, 3600,
21600, 86400, +Inf)
```
`outcome` is one of `approved`, `rejected`, `expired`, `bypassed`. `profile_id` is the `CertificateProfile.ID` that triggered the gate (cardinality-bounded — operators have <100 profiles in production).
The pending-age histogram observes seconds-since-creation at the moment of decision. Alert when p99 hits hours/days — production deployments usually have a same-day decision deadline.
## Future free V2 work
- **M-of-N approver chains.** Today's primitive is single-approver. Future V2 work adds chains — e.g., "needs 2 of 3 platform-team members."
- **Time-windowed auto-approve.** Today's reaper hard-cancels at the static deadline. Policy-driven time-windowed auto-approve (T+30m unattended → cancel; T+24h business hours → escalate) is future work.
- **External ticketing integration.** ServiceNow / JIRA bridging so approval state mirrors the change-management record.
- **Per-owner / per-team routing.** Today's pool is global. Per-owner / per-team routing matches cert ownership to approver pools.
- **Approval delegation.** Today the same-actor rule is strict. Time-bounded delegation is future work.
Tracked in `WORKSPACE-ROADMAP.md` under the Future Free V2 Work section — every item ships free under BSL.