Bundle C: Renewal/reliability cluster — 7 findings closed

Closes M-006 + M-007 + M-008 + M-015 + M-016 + M-019 + M-020 from comprehensive-audit-2026-04-25. M-028 was already closed by the Bundle B CI follow-up. M-006 (CWE-913) — Idempotent migration 000014 migrations/000014_policy_violation_severity_check.up.sql: Prepended ALTER TABLE ... DROP CONSTRAINT IF EXISTS before the ADD. Mirrors the down migration's existing IF EXISTS shape and the M-7 idempotent-index idiom. Re-runs against partially-applied DBs now succeed. M-007 — Bulk-op partial-failure tests (3 new) internal/api/handler/bulk_partial_failure_test.go: TestBulkRevoke_PartialFailure_ReportsBoth TestBulkRenew_PartialFailure_ReportsBoth TestBulkReassign_PartialFailure_ReportsBoth Each asserts HTTP 200 + both success/failure counters round-trip + per-cert errors[] preserved with non-empty messages so operators can correlate each failure to its certificate ID. M-008 — Admin-gated handler enumeration pin (verified-already-clean) Recon: only one admin-gated handler — bulk_revocation.go — with full 3-branch test triplet already in place. health.go calls IsAdmin informationally to surface the flag to the GUI without gating. internal/api/handler/m008_admin_gate_test.go: Walks every handler .go file, asserts every middleware.IsAdmin call site is in AdminGatedHandlers (with required test triplet) or InformationalIsAdminCallers (justified). Adding a new admin gate without updating both the constant AND adding the test triplet fails CI. M-015 — Single-profile cardinality pin (verified-already-clean) Audit claim 'no cardinality validation' was wrong — enforced at struct level. domain.ManagedCertificate.{CertificateProfileID, RenewalPolicyID,IssuerID,OwnerID} and RenewalPolicy. CertificateProfileID are bare strings, not slices. internal/domain/m015_cardinality_test.go: reflect-based pin on kind=String. Schema change to N:N would have to update renewal.go's lookup loop in the same commit. M-016 (CWE-754) — Reap stale-agent jobs internal/repository/postgres/job.go::ListJobsWithOfflineAgents: JOIN jobs to agents on agent_id, filter (status=Running AND a.last_heartbeat_at < cutoff), exclude server-keygen jobs. internal/service/job.go::ReapJobsWithOfflineAgents: Flips matched jobs to Failed reason agent_offline so I-001 retry loop re-queues them on a healthy agent. Records audit event per reap. internal/scheduler/scheduler.go: Scheduler.runJobTimeout cycle now calls both reaper arms. agentOfflineJobTTL default 5min (5x agent-health-check default); SetAgentOfflineJobTTL knob for operator override. internal/service/job_offline_agent_reaper_test.go: 6 unit tests cover happy path, server-keygen-skip, non-Running-skip, non- positive-TTL fail-loud, repo-error propagation, audit-event recording. M-019 — Configurable ARI HTTP timeout Audit claim 'no fallback timeout' was wrong — ari.go:52 already had a 15s timeout. Bundle C makes it configurable. internal/connector/issuer/acme/acme.go: Config.ARIHTTPTimeoutSeconds field with env path CERTCTL_ACME_ARI_HTTP_TIMEOUT_SECONDS. internal/connector/issuer/acme/ari.go: Both HTTP clients (GetRenewalInfo + getARIEndpoint) now use the new ariHTTPTimeout() helper. Zero / negative / nil-config all fall back to the historic 15s default. ari_timeout_test.go: 4 dispatch arm tests. M-020 (CWE-770) — OCSP DoS hardening Pre-bundle the noAuthHandler chain had no rate limit. An attacker could DoS the OCSP responder, which for fail-open relying parties is a revocation bypass. cmd/server/main.go: noAuthHandler refactored from fixed middleware.Chain(...) to a conditional slice that appends middleware.NewRateLimiter when cfg.RateLimit.Enabled. Per-IP keying applies; OCSP/CRL/EST/SCEP are unauth. docs/security.md (NEW): Operator runbook documenting Must-Staple TLS Feature extension RFC 7633 as the architectural fix for fail-open relying parties. Profile-flip guidance + nginx/Apache/HAProxy/Envoy stapling snippets + explicit scope statement on what the rate limiter alone does NOT solve. Audit deliverables: cowork/comprehensive-audit-2026-04-25/audit-report.md: score 31/55 -> 38/55 closed (Medium 13/27 -> 20/27). cowork/comprehensive-audit-2026-04-25/findings.yaml: 7 status flips open -> closed with closure notes citing the Bundle C mechanism. certctl/CHANGELOG.md: Bundle C section under [unreleased]. Verification: go vet ./internal/service ./internal/scheduler ./internal/connector/issuer/acme ./internal/api/handler ./internal/domain ./cmd/server clean go test -count=1 -short on the same packages all green helm template + helm lint clean internal/repository/postgres setup-fail sandbox disk pressure (same on master HEAD before this branch)
2026-06-14 04:48:52 +00:00 · 2026-04-27 00:08:25 +00:00
parent e6422bc483
commit 62a412c488
18 changed files with 1034 additions and 18 deletions
@@ -237,6 +237,58 @@ func (s *JobService) RetryFailedJobs(ctx context.Context, maxRetries int) error
 	return nil
 }

+// ReapJobsWithOfflineAgents transitions jobs in Running status whose
+// owning agent has been silent longer than agentTTL to Failed with
+// reason "agent_offline". Bundle C / Audit M-016 (CWE-754): closes the
+// gap left by ReapTimedOutJobs (which only handles AwaitingCSR /
+// AwaitingApproval). I-001's retry loop then auto-promotes eligible
+// Failed jobs back to Pending so a healthy agent can claim them.
+func (s *JobService) ReapJobsWithOfflineAgents(ctx context.Context, agentTTL time.Duration) error {
+	if agentTTL <= 0 {
+		return fmt.Errorf("ReapJobsWithOfflineAgents: agentTTL must be positive, got %s", agentTTL)
+	}
+	cutoff := time.Now().Add(-agentTTL)
+
+	staleJobs, err := s.jobRepo.ListJobsWithOfflineAgents(ctx, cutoff)
+	if err != nil {
+		return fmt.Errorf("list jobs with offline agents: %w", err)
+	}
+
+	var reaped int
+	for _, job := range staleJobs {
+		oldStatus := job.Status
+		errMsg := fmt.Sprintf("agent offline (no heartbeat for >%s)", agentTTL)
+
+		job.Status = domain.JobStatusFailed
+		job.LastError = &errMsg
+
+		if err := s.jobRepo.Update(ctx, job); err != nil {
+			s.logger.Error("failed to transition offline-agent job",
+				"job_id", job.ID, "agent_id", job.AgentID, "error", err)
+			continue
+		}
+
+		if s.auditService != nil {
+			if auditErr := s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem,
+				"job_offline_agent_reap", "job", job.ID,
+				map[string]interface{}{
+					"old_status":     string(oldStatus),
+					"new_status":     string(domain.JobStatusFailed),
+					"timeout_reason": "agent_offline",
+					"agent_id":       job.AgentID,
+				}); auditErr != nil {
+				s.logger.Error("failed to record offline-agent reap audit event",
+					"job_id", job.ID, "error", auditErr)
+			}
+		}
+		reaped++
+	}
+
+	s.logger.Info("offline-agent job reaper completed",
+		"reaped", reaped, "total_stale", len(staleJobs))
+	return nil
+}
+
 // ReapTimedOutJobs transitions jobs stuck in AwaitingCSR or AwaitingApproval
 // to Failed if they've exceeded their TTL. I-001's retry loop then auto-promotes
 // eligible Failed jobs back to Pending (closes coverage gap I-003).