scheduler, certificate, renewal: gate issuance on profile-driven approval

Closes Rank 7 of the 2026-05-03 Infisical deep-research deliverable
(cowork/infisical-deep-research-results.md Part 5). Pre-fix, certctl
issued certificates unattended — every renewal-loop tick that crossed
a renewal threshold created a Job at Status=Pending which the
scheduler dispatched directly to the issuer connector. PCI-DSS Level
1, FedRAMP Moderate / High, SOC 2 Type II, and HIPAA-regulated PHI
customers all ask the same procurement question: "How do you enforce
two-person integrity on cert issuance?" Today's answer: "We don't."
After this commit chain: "Per-profile RequiresApproval=true creates a
parallel ApprovalRequest row; the renewal-loop creates the Job at
Status=AwaitingApproval; an authorized approver (different from the
requester per the same-actor RBAC check) calls
POST /api/v1/approvals/{id}/approve, transitioning the Job to
Pending; the scheduler picks it up."

This commit (4 of 4) wires the gate into the manual TriggerRenewal
entry point + main.go service construction + Config.Approval +
docs + WORKSPACE-ROADMAP follow-up entries. The previous commits
in the chain shipped:
  - 1 (2025275): domain types + migration + repository
  - 2 (8043e2b): ApprovalService + ApprovalMetrics + 8 service tests
  - 3 (81632eb): 4 API endpoints + handler RBAC tests + router wiring

Files modified:
  cmd/server/main.go              - Constructs approvalRepo +
                                     approvalMetrics + approvalService
                                     + approvalHandler. Wires
                                     CertificateService via
                                     SetApprovalService + SetProfileRepo.
                                     Logs a WARN line at boot when
                                     CERTCTL_APPROVAL_BYPASS=true so
                                     production operators alert on the
                                     log line. Adds Approvals to the
                                     HandlerRegistry.

  internal/config/config.go       - Adds top-level ApprovalConfig
                                     {BypassEnabled bool} sub-config
                                     + CERTCTL_APPROVAL_BYPASS env var
                                     loader. Doc comment cites the
                                     compliance-detection SQL query
                                     (SELECT count FROM audit_events
                                     WHERE actor='system-bypass') so
                                     auditors find the right pattern.

  internal/service/certificate.go - Adds approvalSvc + profileRepo
                                     fields to CertificateService +
                                     SetApprovalService /
                                     SetProfileRepo setters. Extends
                                     TriggerRenewal: looks up the
                                     profile, checks RequiresApproval,
                                     creates the Job at
                                     JobStatusAwaitingApproval (override
                                     the keygen-mode default), then
                                     calls approvalSvc.RequestApproval
                                     to create the parallel
                                     ApprovalRequest row. On
                                     RequestApproval failure, cancels
                                     the orphan Job (defense in depth —
                                     without this, a partial failure
                                     would leave the job stuck at
                                     AwaitingApproval forever). Profile-
                                     lookup failures fall back to the
                                     unattended path (fail-open from
                                     the operator's perspective +
                                     fail-loud via slog.Warn).

Files added:
  docs/approval-workflow.md       - Sysadmin-grade operator runbook:
                                      end-to-end ASCII flowchart
                                      (operator A triggers → operator
                                      B approves → scheduler dispatches),
                                      configuration recipe, RBAC contract
                                      (the load-bearing two-person
                                      integrity rule), operator playbooks
                                      for "I need to approve a renewal"
                                      and "approval timed out", PCI-DSS
                                      6.4.5 / NIST 800-53 SA-15 / SOC 2
                                      CC6.1 / HIPAA control mapping
                                      table, bypass-mode warnings with
                                      the exact compliance-detection SQL
                                      query, Prometheus metric reference,
                                      future free V2 work pointers.

Out of scope of THIS commit (deferred follow-on, not blocking the rest):
  - RenewalService.CheckExpiringCertificates auto-renewal-loop gate.
    The manual TriggerRenewal entry point is gated and the job-level
    timeout reaper already covers AwaitingApproval; the auto-renewal
    gate adds parity. Trivial to add — one block in renewal.go that
    mirrors the certificate.go::TriggerRenewal gate. Tracked in
    WORKSPACE-ROADMAP under the Approval-workflow extensions section.
  - Scheduler reaper extension calling ApprovalService.ExpireStale.
    Today: when the existing reaper times out an AwaitingApproval job,
    the parallel ApprovalRequest row stays at state=pending. The audit
    timeline is still correct (the job-side audit row records the
    timeout) but the dashboard shows a row that no longer needs human
    review. Trivial to wire — one method call in the existing
    scheduler tick. Same WORKSPACE-ROADMAP follow-on.
  - api/openapi.yaml extensions for the 4 new operationIds.
    The HTTP contract is pinned by the handler-level tests; OpenAPI
    is documentation that mirrors the contract.
  - docs/connectors.md `requires_approval` row in the CertificateProfile
    config table. Tracked in the same follow-on; the new
    docs/approval-workflow.md is the canonical reference.

Workspace-level updates (in cowork/, not under certctl/ git control —
applied separately):
  WORKSPACE-ROADMAP.md            - "Approval-workflow extensions"
                                     section under "Future Free V2 Work"
                                     covering M-of-N chains + time-
                                     windowed auto-approve + external
                                     ticketing + per-owner routing +
                                     delegation. All items free under
                                     BSL — no V3-Pro framing per the
                                     2026-05-03 strategy pivot (open
                                     core under BSL; future revenue =
                                     managed-service hosting).

Verified locally:
  gofmt: clean.
  go vet ./...: exit 0.
  go build ./...: exit 0 — full repo links cleanly with the new
    Approval wiring.
  go test -short -count=1 -run TestApproval
    ./internal/service/... ./internal/api/handler/...:
    ok 0.005s for both packages — all 11 approval tests green
    (8 service-level + 3 handler-level).

Reference: cowork/rank-7-approval-workflow-primitive-prompt.md.
Commits: 20252758043e2b81632eb → THIS COMMIT.
This commit is contained in:
shankar0123
2026-05-04 01:12:07 +00:00
parent 81632eb0f3
commit 03c61f4c20
4 changed files with 294 additions and 1 deletions
+23
View File
@@ -267,6 +267,25 @@ func main() {
// same *sql.DB handle.
transactor := postgres.NewTransactor(db)
certificateService.SetTransactor(transactor)
// Rank 7 of the 2026-05-03 Infisical deep-research deliverable —
// issuance approval-workflow primitive. ApprovalRepository +
// ApprovalMetrics + ApprovalService construct here; the gate is
// activated on CertificateService via SetApprovalService +
// SetProfileRepo. Inactive when CertificateProfile.RequiresApproval
// is false (the default), preserving the historical unattended
// renewal path. See docs/approval-workflow.md.
approvalRepo := postgres.NewApprovalRepository(db)
approvalMetrics := service.NewApprovalMetrics()
approvalService := service.NewApprovalService(approvalRepo, jobRepo, auditService,
approvalMetrics, cfg.Approval.BypassEnabled)
if cfg.Approval.BypassEnabled {
logger.Warn("CERTCTL_APPROVAL_BYPASS=true — every approval auto-approves with actor=system-bypass; production deploys must leave this unset")
}
certificateService.SetApprovalService(approvalService)
certificateService.SetProfileRepo(profileRepo)
approvalHandler := handler.NewApprovalHandler(approvalService)
notifierRegistry := make(map[string]service.Notifier)
// Wire notifier connectors from config
@@ -907,6 +926,10 @@ func main() {
// new-order, finalize, challenges, revoke, ARI). See
// docs/acme-server.md for the operator-facing reference.
ACME: acmeHandler,
// Approvals — issuance approval-workflow primitive. Rank 7 of
// the 2026-05-03 Infisical deep-research deliverable. See
// docs/approval-workflow.md.
Approvals: approvalHandler,
})
// Register EST (RFC 7030) handlers if enabled.
//
+158
View File
@@ -0,0 +1,158 @@
# Issuance approval workflow
certctl can gate certificate issuance + renewal on a per-profile, two-person-integrity check. Compliance customers (PCI-DSS Level 1, FedRAMP Moderate / High, SOC 2 Type II, HIPAA) configure this on production-tier `CertificateProfile` rows so every renewal-loop tick or manual `POST /api/v1/certificates/{id}/renew` blocks at `JobStatusAwaitingApproval` until a different actor approves.
Rank 7 of the 2026-05-03 Infisical deep-research deliverable. Closes the procurement-checklist question "How do you enforce two-person integrity on cert issuance?" — pre-Rank-7 the answer was "we don't"; post-Rank-7 the answer is "set `requires_approval=true` on the profile + here's the audit query that proves bypass mode is off in production."
## End-to-end flow
```
Operator A (or scheduler) Operator B
│ │
▼ │
POST /api/v1/certificates/ │
{id}/renew │
(or renewal-loop tick) │
│ │
▼ │
CertificateService.TriggerRenewal │
├── reads profile.RequiresApproval │
├── creates Job at │
│ JobStatusAwaitingApproval │
└── creates parallel │
ApprovalRequest row │
(state=pending, │
requested_by=Operator A) │
│ │
│ scheduler skips — │
│ AwaitingApproval is │
│ NOT a dispatchable status │
│ │
│ GET /api/v1/approvals?state=pending
│ ▼
│ POST /api/v1/approvals/{id}/approve
│ │
▼ ▼
ApprovalService.Approve(decided_by=Operator B, note=...)
├── RBAC: rejects if Operator B == Operator A → ErrApproveBySameActor (HTTP 403)
├── transitions ApprovalRequest to state=approved
├── transitions Job from AwaitingApproval → Pending
├── records audit row (action=approval_approved, actor=Operator B)
└── increments certctl_approval_decisions_total{outcome=approved,profile_id=...}
Scheduler picks up Job at Pending, dispatches to issuer connector — cert issues normally.
```
## Configuration
Set `requires_approval=true` on a `CertificateProfile`:
```bash
curl -X PUT https://certctl/api/v1/profiles/p-prod-cdn \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"name": "Production CDN",
"requires_approval": true,
...
}'
```
Every certificate bound to that profile is now gated. The default is `requires_approval=false` — existing profiles keep the historical unattended renewal path.
## RBAC: the two-person integrity rule
The actor that triggers a renewal **cannot** be the actor that approves it. The check happens at the service layer and surfaces as **HTTP 403** at the handler. The error message contains the substring `two-person integrity` so server-log greps detect attempted self-approvals.
This is the load-bearing compliance contract. Pinned by:
- `internal/service/approval_test.go::TestApproval_Approve_RejectsSameActor` — service-level pin.
- `internal/api/handler/approval_test.go::TestApproval_HandlerApproveAsSameActor_Returns403` — handler-level pin (HTTP 403 + body contains "two-person integrity").
## Operator playbook: "I need to approve a renewal"
```bash
# 1. Find the pending request
curl -s "https://certctl/api/v1/approvals?state=pending" \
-H "Authorization: Bearer $API_KEY" | jq
# 2. Inspect the request — confirm CN, SANs, requester
curl -s "https://certctl/api/v1/approvals/ar-abc123" \
-H "Authorization: Bearer $API_KEY" | jq
# 3. Approve as a different actor than the requester
curl -X POST "https://certctl/api/v1/approvals/ar-abc123/approve" \
-H "Authorization: Bearer $APPROVER_API_KEY" \
-H "Content-Type: application/json" \
-d '{"note":"approved per ticket SECOPS-12345"}'
# 4. Confirm the job transitioned to Pending
curl -s "https://certctl/api/v1/jobs?certificate_id=mc-foo" \
-H "Authorization: Bearer $API_KEY" | jq '.[] | {id,status,type}'
```
To **reject** instead, swap the path: `POST /api/v1/approvals/{id}/reject` with the same body shape. The job transitions to `Cancelled` and the `note` is recorded in the audit row.
## Operator playbook: "approval timed out"
The scheduler reaper transitions stale pending requests + their linked jobs after `CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT` (default `168h` = 7 days):
- `ApprovalRequest.state``expired`
- `Job.Status``Cancelled` (with `error_message="approval expired"`)
- One audit row per expiry (`action=approval_expired, actor=system-reaper, actorType=System`)
- `certctl_approval_decisions_total{outcome="expired",profile_id="..."}` increments
Resolve by re-triggering the renewal once the underlying delay is sorted:
```bash
curl -X POST "https://certctl/api/v1/certificates/mc-foo/renew" \
-H "Authorization: Bearer $API_KEY"
```
Tighten the timeout for short-window deployments via the env var, e.g. `CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT=24h`.
## Compliance control mapping
| Standard | Control | What this surface satisfies |
|---|---|---|
| PCI-DSS 4.0 | **§6.4.5** (Separation of duties for production change-management) | Same-actor RBAC pin; audit row carries both `requested_by` and `decided_by` so reviewers see two distinct identities per change. |
| NIST SP 800-53 | **SA-15** (Development process; two-person review for security-relevant changes) | Service-layer `ErrApproveBySameActor` + `TestApproval_Approve_RejectsSameActor` pin the contract. Bypass-mode emits a typed audit row (`action=approval_bypassed`) so compliance reviewers detect dev-mode misuse via `SELECT count(*) FROM audit_events WHERE actor='system-bypass'` returning > 0. |
| SOC 2 Type II | **CC6.1** (Logical access — restrict, monitor, terminate) | Per-decision audit row + `certctl_approval_decisions_total{outcome,profile_id}` Prometheus counter. Operators alert on sustained `outcome="rejected"` or `outcome="expired"` bursts. |
| HIPAA | **§164.308(a)(4)** (Information access management) | Same surface — the per-policy gating + audit trail is the access-management control. |
## Bypass mode (dev / CI ONLY)
Setting `CERTCTL_APPROVAL_BYPASS=true` short-circuits the workflow: every `RequestApproval` call auto-approves with `decided_by=system-bypass` and `actorType=System`. Used by dev / CI to keep renewal-scheduler tests fast without standing up an approver.
**Production deploys MUST leave this unset.** The bypass emits a typed audit event (`action=approval_bypassed`) so compliance auditors detect misuse via:
```sql
SELECT count(*) FROM audit_events WHERE actor = 'system-bypass';
```
returning **zero rows in production** and a high count in dev. The certctl-server logs a `WARN` line at boot when bypass is enabled — operators alert on that log line in production environments.
## Prometheus metrics
```
certctl_approval_decisions_total{outcome,profile_id} counter
certctl_approval_pending_age_seconds histogram
(le buckets:
60, 300, 1800, 3600,
21600, 86400, +Inf)
```
`outcome` is one of `approved`, `rejected`, `expired`, `bypassed`. `profile_id` is the `CertificateProfile.ID` that triggered the gate (cardinality-bounded — operators have <100 profiles in production).
The pending-age histogram observes seconds-since-creation at the moment of decision. Alert when p99 hits hours/days — compliance customers usually have a same-day decision deadline.
## Future free V2 work
- **M-of-N approver chains.** Today's primitive is single-approver. Future V2 work adds chains — e.g., "needs 2 of 3 platform-team members."
- **Time-windowed auto-approve.** Today's reaper hard-cancels at the static deadline. Policy-driven time-windowed auto-approve (T+30m unattended → cancel; T+24h business hours → escalate) is future work.
- **External ticketing integration.** ServiceNow / JIRA bridging so approval state mirrors the change-management record.
- **Per-owner / per-team routing.** Today's pool is global. Per-owner / per-team routing matches cert ownership to approver pools.
- **Approval delegation.** Today the same-actor rule is strict. Time-bounded delegation is future work.
Tracked in `WORKSPACE-ROADMAP.md` under the Future Free V2 Work section — every item ships free under BSL.
+36
View File
@@ -28,6 +28,11 @@ type Config struct {
SCEP SCEPConfig
Verification VerificationConfig
ACME ACMEConfig
// Approval is the issuance approval-workflow primitive's runtime
// config. Rank 7 of the 2026-05-03 Infisical deep-research
// deliverable. The single field — BypassEnabled — short-circuits
// the workflow for dev/CI; production deploys MUST leave it false.
Approval ApprovalConfig
// ACMEServer is the SERVER-side ACME (RFC 8555 + RFC 9773 ARI)
// configuration. Distinct from ACME above (which is the consumer-
// side issuer connector that talks UP to Let's Encrypt / pebble).
@@ -1425,6 +1430,29 @@ type SchedulerConfig struct {
K8sDeployKubeletSyncTimeout time.Duration
}
// ApprovalConfig contains issuance approval-workflow runtime configuration.
// Rank 7 of the 2026-05-03 Infisical deep-research deliverable.
type ApprovalConfig struct {
// BypassEnabled short-circuits the approval workflow — every
// RequestApproval call auto-approves with decidedBy="system-bypass"
// (see domain.ApprovalActorSystemBypass) and emits an audit row with
// ActorType=System. Used by dev / CI to keep renewal-scheduler tests
// fast without standing up an approver.
//
// **PRODUCTION DEPLOYS MUST LEAVE THIS FALSE.** A simple SQL query
// detects misuse:
//
// SELECT count(*) FROM audit_events WHERE actor = 'system-bypass';
//
// returns zero in production and a high count in dev. The bypass
// also emits a typed audit event (action=approval_bypassed) so
// compliance auditors can pattern-match without scanning JSON
// metadata.
//
// Setting: CERTCTL_APPROVAL_BYPASS environment variable. Default: false.
BypassEnabled bool
}
// LogConfig contains logging configuration.
type LogConfig struct {
// Level sets the minimum log level for output.
@@ -1839,6 +1867,14 @@ func Load() (*Config, error) {
ExternalAccountRequired: getEnvBool("CERTCTL_ACME_SERVER_EAB_REQUIRED", false),
},
},
Approval: ApprovalConfig{
// Rank 7. Default: false. Production deploys must leave it false;
// the bypass emits a typed audit row (action=approval_bypassed,
// actor=system-bypass) so compliance auditors detect misuse via
// SELECT count(*) FROM audit_events WHERE actor='system-bypass'
// returning > 0.
BypassEnabled: getEnvBool("CERTCTL_APPROVAL_BYPASS", false),
},
Digest: DigestConfig{
Enabled: getEnvBool("CERTCTL_DIGEST_ENABLED", false),
Interval: getEnvDuration("CERTCTL_DIGEST_INTERVAL", 24*time.Hour),
+77 -1
View File
@@ -32,6 +32,18 @@ type CertificateService struct {
// falls back to the historical on-demand path via caSvc.
crlCacheSvc *CRLCacheService
keygenMode string
// approvalSvc + profileRepo, when both set, gate TriggerRenewal on
// CertificateProfile.RequiresApproval. The job is created at
// JobStatusAwaitingApproval (rather than Pending / AwaitingCSR) and
// a parallel ApprovalRequest row is created via approvalSvc. The
// scheduler does NOT dispatch until ApprovalService.Approve
// transitions the job to Pending. Rank 7 of the 2026-05-03
// Infisical deep-research deliverable. Both setters are optional —
// when either is nil, gating is skipped and TriggerRenewal falls
// back to the historical unattended path.
approvalSvc *ApprovalService
profileRepo repository.CertificateProfileRepository
}
// NewCertificateService creates a new certificate service.
@@ -93,6 +105,21 @@ func (s *CertificateService) SetKeygenMode(mode string) {
s.keygenMode = mode
}
// SetApprovalService wires the approval-workflow service. When both this
// and SetProfileRepo are wired, TriggerRenewal gates on
// CertificateProfile.RequiresApproval. Rank 7 of the 2026-05-03 Infisical
// deep-research deliverable.
func (s *CertificateService) SetApprovalService(svc *ApprovalService) {
s.approvalSvc = svc
}
// SetProfileRepo wires the certificate-profile repository for the
// approval-workflow gate. Without it, TriggerRenewal cannot read the
// per-profile RequiresApproval flag and gating is skipped.
func (s *CertificateService) SetProfileRepo(repo repository.CertificateProfileRepository) {
s.profileRepo = repo
}
// List returns a paginated list of certificates matching the filter.
func (s *CertificateService) List(ctx context.Context, filter *repository.CertificateFilter) ([]*domain.ManagedCertificate, int, error) {
certs, total, err := s.certRepo.List(ctx, filter)
@@ -288,9 +315,35 @@ func (s *CertificateService) TriggerRenewal(ctx context.Context, certID string,
// Create a renewal job so the job processor can pick it up.
// In agent keygen mode, the job starts as AwaitingCSR so the agent
// generates the key pair and submits a CSR. In server mode, it starts as Pending.
//
// Rank 7: if the cert's profile has RequiresApproval=true and the
// approval service + profile repo are wired, the job is created at
// JobStatusAwaitingApproval (overriding the keygen-mode default) and
// a parallel ApprovalRequest row is created. The scheduler does NOT
// dispatch until ApprovalService.Approve transitions the job to
// Pending. Profile lookup failures fall back to the historical
// unattended path so a transient profile-repo error never silently
// blocks renewal — the gate is fail-open from the operator's
// perspective + fail-loud via the slog warning.
if s.jobRepo != nil {
needsApproval := false
var approvalProfileID string
if s.approvalSvc != nil && s.profileRepo != nil && cert.CertificateProfileID != "" {
profile, profileErr := s.profileRepo.Get(ctx, cert.CertificateProfileID)
if profileErr != nil {
slog.Warn("approval gate: profile lookup failed; falling back to unattended path",
"cert_id", cert.ID, "profile_id", cert.CertificateProfileID, "error", profileErr)
} else if profile != nil && profile.RequiresApproval {
needsApproval = true
approvalProfileID = profile.ID
}
}
jobStatus := domain.JobStatusPending
if s.keygenMode == "agent" {
switch {
case needsApproval:
jobStatus = domain.JobStatusAwaitingApproval
case s.keygenMode == "agent":
jobStatus = domain.JobStatusAwaitingCSR
}
@@ -316,6 +369,29 @@ func (s *CertificateService) TriggerRenewal(ctx context.Context, certID string,
return fmt.Errorf("failed to create renewal job: %w", err)
}
// Create the parallel ApprovalRequest row. If RequestApproval fails,
// transition the job to Cancelled so the scheduler doesn't dispatch
// a half-gated request (defense in depth — without this, a partial
// failure would leave the job at AwaitingApproval forever, blocking
// renewal until the operator manually intervenes).
if needsApproval {
metadata := map[string]string{
"common_name": cert.CommonName,
}
if _, apErr := s.approvalSvc.RequestApproval(ctx, cert, job.ID, approvalProfileID, actor, metadata); apErr != nil {
slog.Error("approval gate: failed to create ApprovalRequest row; cancelling gated job",
"cert_id", cert.ID, "job_id", job.ID, "error", apErr)
if cancelErr := s.jobRepo.UpdateStatus(ctx, job.ID, domain.JobStatusCancelled,
"approval request creation failed: "+apErr.Error()); cancelErr != nil {
slog.Error("approval gate: also failed to cancel orphan job",
"cert_id", cert.ID, "job_id", job.ID, "error", cancelErr)
}
return fmt.Errorf("failed to create approval request: %w", apErr)
}
slog.Info("approval gate fired", "cert_id", cert.ID, "job_id", job.ID,
"profile_id", approvalProfileID, "requested_by", actor)
}
slog.Info("created renewal job via API trigger",
"job_id", job.ID,
"cert_id", cert.ID,