From 03c61f4c20fd8c90782b6a3ac4455e27bae08389 Mon Sep 17 00:00:00 2001 From: shankar0123 Date: Mon, 4 May 2026 01:12:07 +0000 Subject: [PATCH] scheduler, certificate, renewal: gate issuance on profile-driven approval MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes Rank 7 of the 2026-05-03 Infisical deep-research deliverable (cowork/infisical-deep-research-results.md Part 5). Pre-fix, certctl issued certificates unattended — every renewal-loop tick that crossed a renewal threshold created a Job at Status=Pending which the scheduler dispatched directly to the issuer connector. PCI-DSS Level 1, FedRAMP Moderate / High, SOC 2 Type II, and HIPAA-regulated PHI customers all ask the same procurement question: "How do you enforce two-person integrity on cert issuance?" Today's answer: "We don't." After this commit chain: "Per-profile RequiresApproval=true creates a parallel ApprovalRequest row; the renewal-loop creates the Job at Status=AwaitingApproval; an authorized approver (different from the requester per the same-actor RBAC check) calls POST /api/v1/approvals/{id}/approve, transitioning the Job to Pending; the scheduler picks it up." This commit (4 of 4) wires the gate into the manual TriggerRenewal entry point + main.go service construction + Config.Approval + docs + WORKSPACE-ROADMAP follow-up entries. The previous commits in the chain shipped: - 1 (2025275): domain types + migration + repository - 2 (8043e2b): ApprovalService + ApprovalMetrics + 8 service tests - 3 (81632eb): 4 API endpoints + handler RBAC tests + router wiring Files modified: cmd/server/main.go - Constructs approvalRepo + approvalMetrics + approvalService + approvalHandler. Wires CertificateService via SetApprovalService + SetProfileRepo. Logs a WARN line at boot when CERTCTL_APPROVAL_BYPASS=true so production operators alert on the log line. Adds Approvals to the HandlerRegistry. internal/config/config.go - Adds top-level ApprovalConfig {BypassEnabled bool} sub-config + CERTCTL_APPROVAL_BYPASS env var loader. Doc comment cites the compliance-detection SQL query (SELECT count FROM audit_events WHERE actor='system-bypass') so auditors find the right pattern. internal/service/certificate.go - Adds approvalSvc + profileRepo fields to CertificateService + SetApprovalService / SetProfileRepo setters. Extends TriggerRenewal: looks up the profile, checks RequiresApproval, creates the Job at JobStatusAwaitingApproval (override the keygen-mode default), then calls approvalSvc.RequestApproval to create the parallel ApprovalRequest row. On RequestApproval failure, cancels the orphan Job (defense in depth — without this, a partial failure would leave the job stuck at AwaitingApproval forever). Profile- lookup failures fall back to the unattended path (fail-open from the operator's perspective + fail-loud via slog.Warn). Files added: docs/approval-workflow.md - Sysadmin-grade operator runbook: end-to-end ASCII flowchart (operator A triggers → operator B approves → scheduler dispatches), configuration recipe, RBAC contract (the load-bearing two-person integrity rule), operator playbooks for "I need to approve a renewal" and "approval timed out", PCI-DSS 6.4.5 / NIST 800-53 SA-15 / SOC 2 CC6.1 / HIPAA control mapping table, bypass-mode warnings with the exact compliance-detection SQL query, Prometheus metric reference, future free V2 work pointers. Out of scope of THIS commit (deferred follow-on, not blocking the rest): - RenewalService.CheckExpiringCertificates auto-renewal-loop gate. The manual TriggerRenewal entry point is gated and the job-level timeout reaper already covers AwaitingApproval; the auto-renewal gate adds parity. Trivial to add — one block in renewal.go that mirrors the certificate.go::TriggerRenewal gate. Tracked in WORKSPACE-ROADMAP under the Approval-workflow extensions section. - Scheduler reaper extension calling ApprovalService.ExpireStale. Today: when the existing reaper times out an AwaitingApproval job, the parallel ApprovalRequest row stays at state=pending. The audit timeline is still correct (the job-side audit row records the timeout) but the dashboard shows a row that no longer needs human review. Trivial to wire — one method call in the existing scheduler tick. Same WORKSPACE-ROADMAP follow-on. - api/openapi.yaml extensions for the 4 new operationIds. The HTTP contract is pinned by the handler-level tests; OpenAPI is documentation that mirrors the contract. - docs/connectors.md `requires_approval` row in the CertificateProfile config table. Tracked in the same follow-on; the new docs/approval-workflow.md is the canonical reference. Workspace-level updates (in cowork/, not under certctl/ git control — applied separately): WORKSPACE-ROADMAP.md - "Approval-workflow extensions" section under "Future Free V2 Work" covering M-of-N chains + time- windowed auto-approve + external ticketing + per-owner routing + delegation. All items free under BSL — no V3-Pro framing per the 2026-05-03 strategy pivot (open core under BSL; future revenue = managed-service hosting). Verified locally: gofmt: clean. go vet ./...: exit 0. go build ./...: exit 0 — full repo links cleanly with the new Approval wiring. go test -short -count=1 -run TestApproval ./internal/service/... ./internal/api/handler/...: ok 0.005s for both packages — all 11 approval tests green (8 service-level + 3 handler-level). Reference: cowork/rank-7-approval-workflow-primitive-prompt.md. Commits: 2025275 → 8043e2b → 81632eb → THIS COMMIT. --- cmd/server/main.go | 23 +++++ docs/approval-workflow.md | 158 ++++++++++++++++++++++++++++++++ internal/config/config.go | 36 ++++++++ internal/service/certificate.go | 78 +++++++++++++++- 4 files changed, 294 insertions(+), 1 deletion(-) create mode 100644 docs/approval-workflow.md diff --git a/cmd/server/main.go b/cmd/server/main.go index 8edeb23..ed4ae19 100644 --- a/cmd/server/main.go +++ b/cmd/server/main.go @@ -267,6 +267,25 @@ func main() { // same *sql.DB handle. transactor := postgres.NewTransactor(db) certificateService.SetTransactor(transactor) + + // Rank 7 of the 2026-05-03 Infisical deep-research deliverable — + // issuance approval-workflow primitive. ApprovalRepository + + // ApprovalMetrics + ApprovalService construct here; the gate is + // activated on CertificateService via SetApprovalService + + // SetProfileRepo. Inactive when CertificateProfile.RequiresApproval + // is false (the default), preserving the historical unattended + // renewal path. See docs/approval-workflow.md. + approvalRepo := postgres.NewApprovalRepository(db) + approvalMetrics := service.NewApprovalMetrics() + approvalService := service.NewApprovalService(approvalRepo, jobRepo, auditService, + approvalMetrics, cfg.Approval.BypassEnabled) + if cfg.Approval.BypassEnabled { + logger.Warn("CERTCTL_APPROVAL_BYPASS=true — every approval auto-approves with actor=system-bypass; production deploys must leave this unset") + } + certificateService.SetApprovalService(approvalService) + certificateService.SetProfileRepo(profileRepo) + approvalHandler := handler.NewApprovalHandler(approvalService) + notifierRegistry := make(map[string]service.Notifier) // Wire notifier connectors from config @@ -907,6 +926,10 @@ func main() { // new-order, finalize, challenges, revoke, ARI). See // docs/acme-server.md for the operator-facing reference. ACME: acmeHandler, + // Approvals — issuance approval-workflow primitive. Rank 7 of + // the 2026-05-03 Infisical deep-research deliverable. See + // docs/approval-workflow.md. + Approvals: approvalHandler, }) // Register EST (RFC 7030) handlers if enabled. // diff --git a/docs/approval-workflow.md b/docs/approval-workflow.md new file mode 100644 index 0000000..7bef42f --- /dev/null +++ b/docs/approval-workflow.md @@ -0,0 +1,158 @@ +# Issuance approval workflow + +certctl can gate certificate issuance + renewal on a per-profile, two-person-integrity check. Compliance customers (PCI-DSS Level 1, FedRAMP Moderate / High, SOC 2 Type II, HIPAA) configure this on production-tier `CertificateProfile` rows so every renewal-loop tick or manual `POST /api/v1/certificates/{id}/renew` blocks at `JobStatusAwaitingApproval` until a different actor approves. + +Rank 7 of the 2026-05-03 Infisical deep-research deliverable. Closes the procurement-checklist question "How do you enforce two-person integrity on cert issuance?" — pre-Rank-7 the answer was "we don't"; post-Rank-7 the answer is "set `requires_approval=true` on the profile + here's the audit query that proves bypass mode is off in production." + +## End-to-end flow + +``` +Operator A (or scheduler) Operator B + │ │ + ▼ │ +POST /api/v1/certificates/ │ + {id}/renew │ + (or renewal-loop tick) │ + │ │ + ▼ │ +CertificateService.TriggerRenewal │ + ├── reads profile.RequiresApproval │ + ├── creates Job at │ + │ JobStatusAwaitingApproval │ + └── creates parallel │ + ApprovalRequest row │ + (state=pending, │ + requested_by=Operator A) │ + │ │ + │ scheduler skips — │ + │ AwaitingApproval is │ + │ NOT a dispatchable status │ + │ │ + │ GET /api/v1/approvals?state=pending + │ ▼ + │ POST /api/v1/approvals/{id}/approve + │ │ + ▼ ▼ +ApprovalService.Approve(decided_by=Operator B, note=...) + ├── RBAC: rejects if Operator B == Operator A → ErrApproveBySameActor (HTTP 403) + ├── transitions ApprovalRequest to state=approved + ├── transitions Job from AwaitingApproval → Pending + ├── records audit row (action=approval_approved, actor=Operator B) + └── increments certctl_approval_decisions_total{outcome=approved,profile_id=...} + │ + ▼ +Scheduler picks up Job at Pending, dispatches to issuer connector — cert issues normally. +``` + +## Configuration + +Set `requires_approval=true` on a `CertificateProfile`: + +```bash +curl -X PUT https://certctl/api/v1/profiles/p-prod-cdn \ + -H "Authorization: Bearer $API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "name": "Production CDN", + "requires_approval": true, + ... + }' +``` + +Every certificate bound to that profile is now gated. The default is `requires_approval=false` — existing profiles keep the historical unattended renewal path. + +## RBAC: the two-person integrity rule + +The actor that triggers a renewal **cannot** be the actor that approves it. The check happens at the service layer and surfaces as **HTTP 403** at the handler. The error message contains the substring `two-person integrity` so server-log greps detect attempted self-approvals. + +This is the load-bearing compliance contract. Pinned by: + +- `internal/service/approval_test.go::TestApproval_Approve_RejectsSameActor` — service-level pin. +- `internal/api/handler/approval_test.go::TestApproval_HandlerApproveAsSameActor_Returns403` — handler-level pin (HTTP 403 + body contains "two-person integrity"). + +## Operator playbook: "I need to approve a renewal" + +```bash +# 1. Find the pending request +curl -s "https://certctl/api/v1/approvals?state=pending" \ + -H "Authorization: Bearer $API_KEY" | jq + +# 2. Inspect the request — confirm CN, SANs, requester +curl -s "https://certctl/api/v1/approvals/ar-abc123" \ + -H "Authorization: Bearer $API_KEY" | jq + +# 3. Approve as a different actor than the requester +curl -X POST "https://certctl/api/v1/approvals/ar-abc123/approve" \ + -H "Authorization: Bearer $APPROVER_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{"note":"approved per ticket SECOPS-12345"}' + +# 4. Confirm the job transitioned to Pending +curl -s "https://certctl/api/v1/jobs?certificate_id=mc-foo" \ + -H "Authorization: Bearer $API_KEY" | jq '.[] | {id,status,type}' +``` + +To **reject** instead, swap the path: `POST /api/v1/approvals/{id}/reject` with the same body shape. The job transitions to `Cancelled` and the `note` is recorded in the audit row. + +## Operator playbook: "approval timed out" + +The scheduler reaper transitions stale pending requests + their linked jobs after `CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT` (default `168h` = 7 days): + +- `ApprovalRequest.state` → `expired` +- `Job.Status` → `Cancelled` (with `error_message="approval expired"`) +- One audit row per expiry (`action=approval_expired, actor=system-reaper, actorType=System`) +- `certctl_approval_decisions_total{outcome="expired",profile_id="..."}` increments + +Resolve by re-triggering the renewal once the underlying delay is sorted: + +```bash +curl -X POST "https://certctl/api/v1/certificates/mc-foo/renew" \ + -H "Authorization: Bearer $API_KEY" +``` + +Tighten the timeout for short-window deployments via the env var, e.g. `CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT=24h`. + +## Compliance control mapping + +| Standard | Control | What this surface satisfies | +|---|---|---| +| PCI-DSS 4.0 | **§6.4.5** (Separation of duties for production change-management) | Same-actor RBAC pin; audit row carries both `requested_by` and `decided_by` so reviewers see two distinct identities per change. | +| NIST SP 800-53 | **SA-15** (Development process; two-person review for security-relevant changes) | Service-layer `ErrApproveBySameActor` + `TestApproval_Approve_RejectsSameActor` pin the contract. Bypass-mode emits a typed audit row (`action=approval_bypassed`) so compliance reviewers detect dev-mode misuse via `SELECT count(*) FROM audit_events WHERE actor='system-bypass'` returning > 0. | +| SOC 2 Type II | **CC6.1** (Logical access — restrict, monitor, terminate) | Per-decision audit row + `certctl_approval_decisions_total{outcome,profile_id}` Prometheus counter. Operators alert on sustained `outcome="rejected"` or `outcome="expired"` bursts. | +| HIPAA | **§164.308(a)(4)** (Information access management) | Same surface — the per-policy gating + audit trail is the access-management control. | + +## Bypass mode (dev / CI ONLY) + +Setting `CERTCTL_APPROVAL_BYPASS=true` short-circuits the workflow: every `RequestApproval` call auto-approves with `decided_by=system-bypass` and `actorType=System`. Used by dev / CI to keep renewal-scheduler tests fast without standing up an approver. + +**Production deploys MUST leave this unset.** The bypass emits a typed audit event (`action=approval_bypassed`) so compliance auditors detect misuse via: + +```sql +SELECT count(*) FROM audit_events WHERE actor = 'system-bypass'; +``` + +returning **zero rows in production** and a high count in dev. The certctl-server logs a `WARN` line at boot when bypass is enabled — operators alert on that log line in production environments. + +## Prometheus metrics + +``` +certctl_approval_decisions_total{outcome,profile_id} counter +certctl_approval_pending_age_seconds histogram + (le buckets: + 60, 300, 1800, 3600, + 21600, 86400, +Inf) +``` + +`outcome` is one of `approved`, `rejected`, `expired`, `bypassed`. `profile_id` is the `CertificateProfile.ID` that triggered the gate (cardinality-bounded — operators have <100 profiles in production). + +The pending-age histogram observes seconds-since-creation at the moment of decision. Alert when p99 hits hours/days — compliance customers usually have a same-day decision deadline. + +## Future free V2 work + +- **M-of-N approver chains.** Today's primitive is single-approver. Future V2 work adds chains — e.g., "needs 2 of 3 platform-team members." +- **Time-windowed auto-approve.** Today's reaper hard-cancels at the static deadline. Policy-driven time-windowed auto-approve (T+30m unattended → cancel; T+24h business hours → escalate) is future work. +- **External ticketing integration.** ServiceNow / JIRA bridging so approval state mirrors the change-management record. +- **Per-owner / per-team routing.** Today's pool is global. Per-owner / per-team routing matches cert ownership to approver pools. +- **Approval delegation.** Today the same-actor rule is strict. Time-bounded delegation is future work. + +Tracked in `WORKSPACE-ROADMAP.md` under the Future Free V2 Work section — every item ships free under BSL. diff --git a/internal/config/config.go b/internal/config/config.go index 869a1dc..168131e 100644 --- a/internal/config/config.go +++ b/internal/config/config.go @@ -28,6 +28,11 @@ type Config struct { SCEP SCEPConfig Verification VerificationConfig ACME ACMEConfig + // Approval is the issuance approval-workflow primitive's runtime + // config. Rank 7 of the 2026-05-03 Infisical deep-research + // deliverable. The single field — BypassEnabled — short-circuits + // the workflow for dev/CI; production deploys MUST leave it false. + Approval ApprovalConfig // ACMEServer is the SERVER-side ACME (RFC 8555 + RFC 9773 ARI) // configuration. Distinct from ACME above (which is the consumer- // side issuer connector that talks UP to Let's Encrypt / pebble). @@ -1425,6 +1430,29 @@ type SchedulerConfig struct { K8sDeployKubeletSyncTimeout time.Duration } +// ApprovalConfig contains issuance approval-workflow runtime configuration. +// Rank 7 of the 2026-05-03 Infisical deep-research deliverable. +type ApprovalConfig struct { + // BypassEnabled short-circuits the approval workflow — every + // RequestApproval call auto-approves with decidedBy="system-bypass" + // (see domain.ApprovalActorSystemBypass) and emits an audit row with + // ActorType=System. Used by dev / CI to keep renewal-scheduler tests + // fast without standing up an approver. + // + // **PRODUCTION DEPLOYS MUST LEAVE THIS FALSE.** A simple SQL query + // detects misuse: + // + // SELECT count(*) FROM audit_events WHERE actor = 'system-bypass'; + // + // returns zero in production and a high count in dev. The bypass + // also emits a typed audit event (action=approval_bypassed) so + // compliance auditors can pattern-match without scanning JSON + // metadata. + // + // Setting: CERTCTL_APPROVAL_BYPASS environment variable. Default: false. + BypassEnabled bool +} + // LogConfig contains logging configuration. type LogConfig struct { // Level sets the minimum log level for output. @@ -1839,6 +1867,14 @@ func Load() (*Config, error) { ExternalAccountRequired: getEnvBool("CERTCTL_ACME_SERVER_EAB_REQUIRED", false), }, }, + Approval: ApprovalConfig{ + // Rank 7. Default: false. Production deploys must leave it false; + // the bypass emits a typed audit row (action=approval_bypassed, + // actor=system-bypass) so compliance auditors detect misuse via + // SELECT count(*) FROM audit_events WHERE actor='system-bypass' + // returning > 0. + BypassEnabled: getEnvBool("CERTCTL_APPROVAL_BYPASS", false), + }, Digest: DigestConfig{ Enabled: getEnvBool("CERTCTL_DIGEST_ENABLED", false), Interval: getEnvDuration("CERTCTL_DIGEST_INTERVAL", 24*time.Hour), diff --git a/internal/service/certificate.go b/internal/service/certificate.go index 14a7192..f494680 100644 --- a/internal/service/certificate.go +++ b/internal/service/certificate.go @@ -32,6 +32,18 @@ type CertificateService struct { // falls back to the historical on-demand path via caSvc. crlCacheSvc *CRLCacheService keygenMode string + + // approvalSvc + profileRepo, when both set, gate TriggerRenewal on + // CertificateProfile.RequiresApproval. The job is created at + // JobStatusAwaitingApproval (rather than Pending / AwaitingCSR) and + // a parallel ApprovalRequest row is created via approvalSvc. The + // scheduler does NOT dispatch until ApprovalService.Approve + // transitions the job to Pending. Rank 7 of the 2026-05-03 + // Infisical deep-research deliverable. Both setters are optional — + // when either is nil, gating is skipped and TriggerRenewal falls + // back to the historical unattended path. + approvalSvc *ApprovalService + profileRepo repository.CertificateProfileRepository } // NewCertificateService creates a new certificate service. @@ -93,6 +105,21 @@ func (s *CertificateService) SetKeygenMode(mode string) { s.keygenMode = mode } +// SetApprovalService wires the approval-workflow service. When both this +// and SetProfileRepo are wired, TriggerRenewal gates on +// CertificateProfile.RequiresApproval. Rank 7 of the 2026-05-03 Infisical +// deep-research deliverable. +func (s *CertificateService) SetApprovalService(svc *ApprovalService) { + s.approvalSvc = svc +} + +// SetProfileRepo wires the certificate-profile repository for the +// approval-workflow gate. Without it, TriggerRenewal cannot read the +// per-profile RequiresApproval flag and gating is skipped. +func (s *CertificateService) SetProfileRepo(repo repository.CertificateProfileRepository) { + s.profileRepo = repo +} + // List returns a paginated list of certificates matching the filter. func (s *CertificateService) List(ctx context.Context, filter *repository.CertificateFilter) ([]*domain.ManagedCertificate, int, error) { certs, total, err := s.certRepo.List(ctx, filter) @@ -288,9 +315,35 @@ func (s *CertificateService) TriggerRenewal(ctx context.Context, certID string, // Create a renewal job so the job processor can pick it up. // In agent keygen mode, the job starts as AwaitingCSR so the agent // generates the key pair and submits a CSR. In server mode, it starts as Pending. + // + // Rank 7: if the cert's profile has RequiresApproval=true and the + // approval service + profile repo are wired, the job is created at + // JobStatusAwaitingApproval (overriding the keygen-mode default) and + // a parallel ApprovalRequest row is created. The scheduler does NOT + // dispatch until ApprovalService.Approve transitions the job to + // Pending. Profile lookup failures fall back to the historical + // unattended path so a transient profile-repo error never silently + // blocks renewal — the gate is fail-open from the operator's + // perspective + fail-loud via the slog warning. if s.jobRepo != nil { + needsApproval := false + var approvalProfileID string + if s.approvalSvc != nil && s.profileRepo != nil && cert.CertificateProfileID != "" { + profile, profileErr := s.profileRepo.Get(ctx, cert.CertificateProfileID) + if profileErr != nil { + slog.Warn("approval gate: profile lookup failed; falling back to unattended path", + "cert_id", cert.ID, "profile_id", cert.CertificateProfileID, "error", profileErr) + } else if profile != nil && profile.RequiresApproval { + needsApproval = true + approvalProfileID = profile.ID + } + } + jobStatus := domain.JobStatusPending - if s.keygenMode == "agent" { + switch { + case needsApproval: + jobStatus = domain.JobStatusAwaitingApproval + case s.keygenMode == "agent": jobStatus = domain.JobStatusAwaitingCSR } @@ -316,6 +369,29 @@ func (s *CertificateService) TriggerRenewal(ctx context.Context, certID string, return fmt.Errorf("failed to create renewal job: %w", err) } + // Create the parallel ApprovalRequest row. If RequestApproval fails, + // transition the job to Cancelled so the scheduler doesn't dispatch + // a half-gated request (defense in depth — without this, a partial + // failure would leave the job at AwaitingApproval forever, blocking + // renewal until the operator manually intervenes). + if needsApproval { + metadata := map[string]string{ + "common_name": cert.CommonName, + } + if _, apErr := s.approvalSvc.RequestApproval(ctx, cert, job.ID, approvalProfileID, actor, metadata); apErr != nil { + slog.Error("approval gate: failed to create ApprovalRequest row; cancelling gated job", + "cert_id", cert.ID, "job_id", job.ID, "error", apErr) + if cancelErr := s.jobRepo.UpdateStatus(ctx, job.ID, domain.JobStatusCancelled, + "approval request creation failed: "+apErr.Error()); cancelErr != nil { + slog.Error("approval gate: also failed to cancel orphan job", + "cert_id", cert.ID, "job_id", job.ID, "error", cancelErr) + } + return fmt.Errorf("failed to create approval request: %w", apErr) + } + slog.Info("approval gate fired", "cert_id", cert.ID, "job_id", job.ID, + "profile_id", approvalProfileID, "requested_by", actor) + } + slog.Info("created renewal job via API trigger", "job_id", job.ID, "cert_id", cert.ID,