Merge bundle-C: Renewal/reliability cluster — 7 findings closed

2026-07-28 17:28:56 +00:00 · 2026-04-27 00:08:34 +00:00
parent a24fe45aa1 345bafe5aa
commit 0f92cfb305
18 changed files with 1034 additions and 18 deletions
@@ -4,6 +4,38 @@ All notable changes to certctl are documented in this file. Dates use ISO 8601.

 ## [unreleased] — 2026-04-26

+### Bundle C (Renewal/Reliability cluster): 7 audit findings closed
+
+> Closes the audit's renewal/reliability cluster — `M-006` (idempotent migration 000014), `M-007` (3 partial-failure tests across bulk-revoke / bulk-renew / bulk-reassign), `M-008` (admin-gated handler enumeration pin, verified-already-clean), `M-015` (cardinality invariant pinned at struct level via reflect, verified-already-clean), `M-016` (new ListJobsWithOfflineAgents repo method + ReapJobsWithOfflineAgents service path + scheduler wiring), `M-019` (configurable ARI HTTP timeout + 4 dispatch tests, audit-claim verified wrong), `M-020` (rate limiter on noAuthHandler chain + Must-Staple operator runbook). M-028 was already closed by the Bundle B CI follow-up.
+
+#### Added
+
+- **`internal/repository/postgres/job.go::ListJobsWithOfflineAgents` (NEW, Audit M-016 / CWE-754)** — JOINs jobs to agents on agent_id and filters `(status='Running' AND a.last_heartbeat_at < agentCutoff)`. Server-keygen jobs (no agent_id) excluded by design.
+- **`internal/service/job.go::ReapJobsWithOfflineAgents` (NEW, Audit M-016)** — Flips matched jobs to Failed with reason `agent_offline`; emits an audit event per reap; rejects non-positive TTL with a fail-loud error.
+- **`Scheduler.agentOfflineJobTTL` + `SetAgentOfflineJobTTL` (NEW, Audit M-016)** — Defaults to 5 minutes (5× the default agent-health-check interval); operators can override. The existing `runJobTimeout` cycle now calls both reaper arms.
+- **`Config.ARIHTTPTimeoutSeconds` + `Connector.ariHTTPTimeout()` (NEW, Audit M-019)** — Configurable per-issuer ARI HTTP timeout. Defaults to 15s when zero (preserves the pre-bundle default). `CERTCTL_ACME_ARI_HTTP_TIMEOUT_SECONDS` env var path.
+- **`router.AuthExemptDispatchPrefixes` extended with rate-limited noAuthHandler chain (Audit M-020 / CWE-770)** — `cmd/server/main.go` noAuthHandler is now constructed via a slice that conditionally appends `middleware.NewRateLimiter` when `cfg.RateLimit.Enabled`. Per-IP keying protects unauth surfaces (OCSP, CRL, EST, SCEP) from DoS-as-revocation-bypass for fail-open relying parties.
+- **`docs/security.md` (NEW, Audit M-020)** — Operator runbook documenting OCSP Must-Staple (RFC 7633) as the architectural fix for fail-open relying parties; profile-flip guidance; server-side OCSP-stapling config snippets for nginx / Apache / HAProxy / Envoy; explicit scope statement.
+
+#### Tests
+
+- **`internal/api/handler/bulk_partial_failure_test.go` (NEW, 3 tests, Audit M-007)** — Mixed-result branch coverage for all 3 bulk handlers: HTTP 200 with both success counters and per-cert errors[] preserved.
+- **`internal/api/handler/m008_admin_gate_test.go` (NEW, 2 tests, Audit M-008)** — Walks every handler `.go` file, asserts every `middleware.IsAdmin` call site is in `AdminGatedHandlers` (with required test triplet) or `InformationalIsAdminCallers` (justified). Pin against future bypass.
+- **`internal/domain/m015_cardinality_test.go` (NEW, 2 tests, Audit M-015)** — reflect-based pin on `ManagedCertificate.{CertificateProfileID,RenewalPolicyID,IssuerID,OwnerID}` and `RenewalPolicy.CertificateProfileID` kind=String. Schema change to N:N would have to update renewal.go's lookup loop in the same commit.
+- **`internal/connector/issuer/acme/ari_timeout_test.go` (NEW, 4 tests, Audit M-019)** — `ariHTTPTimeout()` dispatch contract: default-15s / non-zero-overrides / negative-falls-back-to-default / nil-config-safe-default.
+- **`internal/service/job_offline_agent_reaper_test.go` (NEW, 6 tests, Audit M-016)** — Flips Running to Failed; skips server-keygen (no agent_id); skips non-Running; rejects non-positive TTL; propagates repo error; records audit event.
+
+#### Changed
+
+- **`migrations/000014_policy_violation_severity_check.up.sql` (Audit M-006 / CWE-913)** — Prepended `ALTER TABLE policy_violations DROP CONSTRAINT IF EXISTS policy_violations_severity_check;` before the ADD. Re-runs on partially-applied DBs now succeed.
+- **`internal/connector/issuer/acme/ari.go` (Audit M-019)** — Both HTTP clients (`GetRenewalInfo` and `getARIEndpoint`) now use the configurable `ariHTTPTimeout()` helper instead of the hardcoded 15s.
+- **`cmd/server/main.go` noAuthHandler construction (Audit M-020)** — From fixed `middleware.Chain(...)` to conditional slice with rate-limiter append. Backwards-compatible: when `cfg.RateLimit.Enabled=false` the chain reduces to the prior shape.
+
+#### Audit Deliverables Updated
+
+- `cowork/comprehensive-audit-2026-04-25/audit-report.md` — score 31/55 → 38/55 closed (Critical 0/0; High 7/9; **Medium 13/27 → 20/27**; Low 8/19); M-006/M-007/M-008/M-015/M-016/M-019/M-020 boxes flipped `[x]` with closure notes.
+- `cowork/comprehensive-audit-2026-04-25/findings.yaml` — corresponding status flips with closure notes citing the Bundle C mechanism.
+
 ### Bundle B (Auth & Transport Surface Tightening): 5 audit findings closed

 > Closes the audit's auth + transport hardening cluster: `M-001` (PBKDF2 100k → 600k via new v3 blob format with v2/v1 read fallback), `M-002` (auth-exempt allowlist constants + AST-walking regression tests pin both router-layer and dispatch-layer bypass paths), `M-013` (CORS deny-by-default verified-already-clean + explicit nil/empty/star contract pin), `M-018` (Postgres TLS opt-in via Helm `postgresql.tls.mode` toggle + operator runbook `docs/database-tls.md`), `M-025` (rate-limiter rewritten from global single-bucket to per-key map keyed on UserKey-from-context with IP fallback). **Breaking change:** Bundle B's M-001 makes new ciphertext blobs use v3 format (magic byte `0x03`); reads still accept v1+v2 transparently and the next UPDATE re-seals as v3 — no operator action required, but rolling back to a pre-Bundle-B binary will leave v3 rows un-readable.
@@ -888,13 +888,29 @@ func main() {
 	// same bodyLimitMiddleware that wraps the authed surface also wraps
 	// the unauth surface — same default cap (CERTCTL_MAX_BODY_SIZE,
 	// default 1MB), same 413 response on overflow.
-	noAuthHandler := middleware.Chain(apiRouter,
+	//
+	// Bundle C / Audit M-020 (CWE-770): rate limiter added to the noAuth
+	// chain. Pre-bundle the unauth surface had NO rate limit — an attacker
+	// could DoS the OCSP responder, which for fail-open relying parties
+	// constitutes a revocation bypass (every cert appears valid when the
+	// responder is unreachable). The same per-key keyed bucket from
+	// Bundle B / M-025 is reused; the per-source-IP keying applies because
+	// none of these endpoints are authenticated.
+	noAuthMiddleware := []func(http.Handler) http.Handler{
 		middleware.RequestID,
 		structuredLogger,
 		middleware.Recovery,
 		bodyLimitMiddleware,
 		securityHeadersMiddleware,
-	)
+	}
+	if cfg.RateLimit.Enabled {
+		noAuthRateLimiter := middleware.NewRateLimiter(middleware.RateLimitConfig{
+			RPS:       cfg.RateLimit.RPS,
+			BurstSize: cfg.RateLimit.BurstSize,
+		})
+		noAuthMiddleware = append(noAuthMiddleware, noAuthRateLimiter)
+	}
+	noAuthHandler := middleware.Chain(apiRouter, noAuthMiddleware...)

 	dashboardEnabled := false
 	if _, err := os.Stat(webDir + "/index.html"); err == nil {
@@ -0,0 +1,97 @@
+# certctl Security Posture & Operator Guidance
+
+This document collects the operator-facing security guidance that the source
+code's per-finding comment blocks reference. Each section names the audit
+finding it closes, the threat model, and the operator action required (if
+any).
+
+## OCSP responder availability
+
+**Audit reference:** Bundle C / M-020. CWE-770 (uncontrolled resource
+consumption); RFC 6960 (OCSP); RFC 7633 (Must-Staple).
+
+certctl ships an OCSP responder at `/.well-known/pki/ocsp/{issuer_id}/{serial}`
+that signs a fresh response per request. Pre-Bundle-C the unauth handler
+chain had no rate limit, so an attacker could DoS the responder and force
+fail-open relying parties to accept revoked certificates as valid. Bundle C
+adds the same per-key rate limiter to the unauth chain that the authenticated
+chain has used since Bundle B. Per-IP keying applies because OCSP traffic is
+unauthenticated.
+
+The rate limiter alone does not solve the underlying revocation-bypass risk.
+**The architectural fix is for issued certificates to carry the OCSP
+Must-Staple TLS Feature extension** (RFC 7633, OID 1.3.6.1.5.5.7.1.24). When
+present, conforming TLS clients refuse to negotiate a session unless the
+server staples a fresh signed OCSP response in the TLS handshake. This shifts
+revocation enforcement from the client's discretion (which most fail-open by
+default) to a hard requirement that the connection cannot complete without
+proof of non-revocation.
+
+### Operator action
+
+For certificates issued to systems where revocation correctness matters:
+
+1. **Configure the issuer profile to set `must-staple: true`.** Out-of-the-box
+   profiles in `migrations/seed.sql` do not set this; operators add it at
+   profile-creation time via the API or by editing seed data.
+2. **Confirm the relying party honors the extension.** OpenSSL ≥ 1.1.0,
+   Firefox, and Chrome 84+ all enforce Must-Staple. Older clients silently
+   ignore it.
+3. **Confirm the deployment target is configured for OCSP stapling** so the
+   server can actually deliver the stapled response in the handshake.
+   - **nginx:** `ssl_stapling on; ssl_stapling_verify on;`
+   - **Apache:** `SSLUseStapling on`
+   - **HAProxy:** `set ssl ocsp-response /path/to/response.der`
+   - **Envoy:** `ocsp_staple_policy: must_staple`
+
+### What this does NOT cover
+
+- **CRL fallback.** Must-Staple does not affect CRL behavior. Operators with
+  CRL-based relying parties should use the rate-limit + caching defense
+  alone; there is no client-side equivalent to Must-Staple for CRLs.
+- **Self-issued certs in air-gapped networks.** When the relying party
+  cannot reach the OCSP responder at all (the threat model the audit
+  cited), Must-Staple is the only mechanism that closes the bypass. CRL
+  distribution similarly requires the relying party to fetch the CRL,
+  which is also subject to the same network-availability concern.
+
+## Postgres transport encryption
+
+See [docs/database-tls.md](database-tls.md). Bundle B / M-018.
+
+## Encryption at rest
+
+Bundle B / M-001. PBKDF2-SHA256 at 600,000 rounds (OWASP 2024 Password
+Storage Cheat Sheet floor) for the operator-supplied passphrase that
+derives the AES-256-GCM key for sensitive config columns. v3 blob format
+with a per-ciphertext random salt; v1/v2 read fallback for legacy rows.
+See [internal/crypto/encryption.go](../internal/crypto/encryption.go) and
+the accompanying tests for the format spec.
+
+## Authentication surface
+
+Bundle B / M-002. Two layers decide auth-exempt status:
+
+1. **Router layer:** `internal/api/router/router.go::AuthExemptRouterRoutes`
+   — the 4 endpoints registered via direct `r.mux.Handle` without going
+   through the middleware chain (`/health`, `/ready`, `/api/v1/auth/info`,
+   `/api/v1/version`).
+2. **Dispatch layer:** `internal/api/router/router.go::AuthExemptDispatchPrefixes`
+   — URL-prefix routing in `cmd/server/main.go::buildFinalHandler` for
+   `/.well-known/pki/*`, `/.well-known/est/*`, and `/scep[/...]*`.
+
+Both lists have AST-walking regression tests (`auth_exempt_test.go`) that
+fail CI if a new bypass lands without an updating the documented constant.
+
+## Per-user rate limiting
+
+Bundle B / M-025. Authenticated callers are bucketed by API-key name;
+unauthenticated callers (probes, OCSP relying parties, EST/SCEP enrollees)
+are bucketed by source IP. `RPS` and `BurstSize` are per-key budgets.
+`PerUserRPS` / `PerUserBurstSize` give authenticated clients a separate
+budget when set non-zero.
+
+## Reporting a vulnerability
+
+Email `certctl@proton.me`. Coordinated disclosure preferred; we will
+acknowledge within 72h.
@@ -0,0 +1,180 @@
+package handler
+
+import (
+	"bytes"
+	"context"
+	"encoding/json"
+	"net/http"
+	"net/http/httptest"
+	"testing"
+
+	"github.com/shankar0123/certctl/internal/domain"
+)
+
+// Bundle C / Audit M-007 (CWE-754): partial-failure tests for the three
+// bulk endpoints. Pre-bundle all three handlers had only happy-path
+// (TotalRevoked = TotalMatched, no Errors) and full-failure (service
+// returns err) tests. The mixed-result branch — where some certs
+// succeed and others fail — is the most operationally common shape
+// and was completely uncovered.
+//
+// Each test asserts:
+//   1. HTTP 200 (mixed result is a successful HTTP response carrying
+//      both succeeded and failed counters).
+//   2. The response body's TotalMatched / Total<verb> / TotalFailed
+//      counters all round-trip from the service mock.
+//   3. The Errors[] array is preserved and operators can correlate
+//      each failure to its certificate ID.
+
+// --- bulk-revoke ----------------------------------------------------------
+
+func TestBulkRevoke_PartialFailure_ReportsBoth(t *testing.T) {
+	svc := &mockBulkRevocationService{
+		BulkRevokeFn: func(ctx context.Context, criteria domain.BulkRevocationCriteria, reason string, actor string) (*domain.BulkRevocationResult, error) {
+			return &domain.BulkRevocationResult{
+				TotalMatched: 3,
+				TotalRevoked: 2,
+				TotalSkipped: 0,
+				TotalFailed:  1,
+				Errors: []domain.BulkRevocationError{
+					{CertificateID: "mc-failed", Error: "issuer connector unreachable"},
+				},
+			}, nil
+		},
+	}
+	h := NewBulkRevocationHandler(svc)
+
+	body := `{"reason":"keyCompromise","certificate_ids":["mc-1","mc-2","mc-failed"]}`
+	req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
+	req.Header.Set("Content-Type", "application/json")
+	req = req.WithContext(adminContext())
+	w := httptest.NewRecorder()
+
+	h.BulkRevoke(w, req)
+
+	if w.Code != http.StatusOK {
+		t.Fatalf("partial failure must still return HTTP 200, got %d", w.Code)
+	}
+
+	var result domain.BulkRevocationResult
+	if err := json.NewDecoder(w.Body).Decode(&result); err != nil {
+		t.Fatalf("decode response: %v", err)
+	}
+	if result.TotalMatched != 3 {
+		t.Errorf("TotalMatched = %d, want 3", result.TotalMatched)
+	}
+	if result.TotalRevoked != 2 {
+		t.Errorf("TotalRevoked = %d, want 2", result.TotalRevoked)
+	}
+	if result.TotalFailed != 1 {
+		t.Errorf("TotalFailed = %d, want 1", result.TotalFailed)
+	}
+	if len(result.Errors) != 1 {
+		t.Fatalf("Errors len = %d, want 1", len(result.Errors))
+	}
+	if result.Errors[0].CertificateID != "mc-failed" {
+		t.Errorf("error CertificateID = %q, want mc-failed", result.Errors[0].CertificateID)
+	}
+	if result.Errors[0].Error == "" {
+		t.Error("error message must be non-empty so operators can triage")
+	}
+}
+
+// --- bulk-renew -----------------------------------------------------------
+
+func TestBulkRenew_PartialFailure_ReportsBoth(t *testing.T) {
+	svc := &mockBulkRenewalService{
+		BulkRenewFn: func(ctx context.Context, criteria domain.BulkRenewalCriteria, actor string) (*domain.BulkRenewalResult, error) {
+			return &domain.BulkRenewalResult{
+				TotalMatched: 3,
+				TotalEnqueued: 2,
+				TotalSkipped: 0,
+				TotalFailed:  1,
+				Errors: []domain.BulkOperationError{
+					{CertificateID: "mc-failed", Error: "renewal job enqueue failed: db timeout"},
+				},
+			}, nil
+		},
+	}
+	h := NewBulkRenewalHandler(svc)
+
+	body := `{"certificate_ids":["mc-1","mc-2","mc-failed"]}`
+	req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-renew", bytes.NewBufferString(body))
+	req.Header.Set("Content-Type", "application/json")
+	req = req.WithContext(authenticatedContext("test-actor"))
+	w := httptest.NewRecorder()
+
+	h.BulkRenew(w, req)
+
+	if w.Code != http.StatusOK {
+		t.Fatalf("partial failure must still return HTTP 200, got %d", w.Code)
+	}
+
+	var result domain.BulkRenewalResult
+	if err := json.NewDecoder(w.Body).Decode(&result); err != nil {
+		t.Fatalf("decode response: %v", err)
+	}
+	if result.TotalMatched != 3 || result.TotalEnqueued != 2 || result.TotalFailed != 1 {
+		t.Errorf("counters mismatch: matched=%d enqueued=%d failed=%d, want 3/2/1",
+			result.TotalMatched, result.TotalEnqueued, result.TotalFailed)
+	}
+	if len(result.Errors) != 1 || result.Errors[0].CertificateID != "mc-failed" {
+		t.Errorf("Errors not preserved: %+v", result.Errors)
+	}
+}
+
+// --- bulk-reassign --------------------------------------------------------
+
+func TestBulkReassign_PartialFailure_ReportsBoth(t *testing.T) {
+	svc := &mockBulkReassignmentService{
+		BulkReassignFn: func(ctx context.Context, request domain.BulkReassignmentRequest, actor string) (*domain.BulkReassignmentResult, error) {
+			return &domain.BulkReassignmentResult{
+				TotalMatched:    3,
+				TotalReassigned: 2,
+				TotalSkipped:    0,
+				TotalFailed:     1,
+				Errors: []domain.BulkOperationError{
+					{CertificateID: "mc-failed", Error: "FK violation: cert no longer exists"},
+				},
+			}, nil
+		},
+	}
+	h := NewBulkReassignmentHandler(svc)
+
+	body := `{"certificate_ids":["mc-1","mc-2","mc-failed"],"owner_id":"o-bob"}`
+	req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-reassign", bytes.NewBufferString(body))
+	req.Header.Set("Content-Type", "application/json")
+	req = req.WithContext(authenticatedContext("test-actor"))
+	w := httptest.NewRecorder()
+
+	h.BulkReassign(w, req)
+
+	if w.Code != http.StatusOK {
+		t.Fatalf("partial failure must still return HTTP 200, got %d", w.Code)
+	}
+
+	var result domain.BulkReassignmentResult
+	if err := json.NewDecoder(w.Body).Decode(&result); err != nil {
+		t.Fatalf("decode response: %v", err)
+	}
+	if result.TotalMatched != 3 || result.TotalReassigned != 2 || result.TotalFailed != 1 {
+		t.Errorf("counters mismatch: matched=%d reassigned=%d failed=%d, want 3/2/1",
+			result.TotalMatched, result.TotalReassigned, result.TotalFailed)
+	}
+	if len(result.Errors) != 1 || result.Errors[0].CertificateID != "mc-failed" {
+		t.Errorf("Errors not preserved: %+v", result.Errors)
+	}
+}
+
+// --- helper context for unauth-allowed handlers (renew + reassign aren't admin-gated) ---
+
+func authenticatedContext(actor string) context.Context {
+	type userKey struct{}
+	// The middleware UserKey is a private type in the middleware package, so
+	// in this handler test we can't construct one directly. Bulk-renew and
+	// bulk-reassign read the actor through the same middleware.GetUser path
+	// that bulk-revoke does — adminContext() in the existing test suite is
+	// the canonical helper. Reuse it (delivers both UserKey and AdminKey).
+	_ = userKey{}
+	return adminContext()
+}
@@ -0,0 +1,170 @@
+package handler
+
+import (
+	"go/parser"
+	"go/token"
+	"os"
+	"path/filepath"
+	"sort"
+	"strings"
+	"testing"
+)
+
+// Bundle C / Audit M-008: pin the admin-gated handler set.
+//
+// The audit's request is "Admin-gated operation role-gate test coverage
+// needs verification". Verified-already-clean recon: only one handler
+// in internal/api/handler/ calls middleware.IsAdmin to gate access:
+// bulk_revocation.go — which has 3 dedicated tests
+// (NonAdmin_Returns403, AdminExplicitFalse_Returns403,
+// AdminPermitted_ForwardsActor) covering all three branches.
+//
+// This test enforces the invariant going forward by walking every
+// .go file in this package, finding every middleware.IsAdmin call
+// site, and asserting the file appears in AdminGatedHandlers below.
+// Adding a new middleware.IsAdmin call without updating the constant
+// AND adding a parallel test triplet fails CI.
+
+// AdminGatedHandlers is the documented allowlist of handler files that
+// gate access on middleware.IsAdmin. Every entry MUST have:
+//   - a non-admin-rejection test ("_NonAdmin_Returns403")
+//   - an explicit-false-admin-rejection test ("_AdminExplicitFalse_Returns403")
+//   - an admin-allowed actor-attribution test ("_AdminPermitted_ForwardsActor")
+//
+// Keys are the handler filenames; values are short descriptions of why
+// the gate exists. health.go is an INFORMATIONAL caller of IsAdmin (it
+// surfaces the flag to the GUI but does not gate) — explicitly excluded.
+var AdminGatedHandlers = map[string]string{
+	"bulk_revocation.go": "M-003: bulk revocation is fleet-scale destructive — admin-only",
+}
+
+// InformationalIsAdminCallers is the documented allowlist of files that
+// call middleware.IsAdmin without using the result to gate access. The
+// only legitimate use of an informational call is reporting the flag to
+// a downstream consumer (e.g. health.go::AuthCheck reports admin to the
+// GUI so it can hide admin-only buttons).
+var InformationalIsAdminCallers = map[string]string{
+	"health.go": "informational: reports admin flag to GUI for affordance gating, no server-side gate",
+}
+
+func TestM008_AdminGatedHandlers_PinExpectedSet(t *testing.T) {
+	actual, err := scanIsAdminCallers(".")
+	if err != nil {
+		t.Fatalf("scan handler dir: %v", err)
+	}
+
+	expected := append([]string(nil), keys(AdminGatedHandlers)...)
+	expected = append(expected, keys(InformationalIsAdminCallers)...)
+	sort.Strings(actual)
+	sort.Strings(expected)
+
+	if !slicesEqual008(actual, expected) {
+		t.Errorf(
+			"middleware.IsAdmin call sites changed:\n"+
+				"  actual:   %v\n"+
+				"  expected: %v\n"+
+				"\n"+
+				"If you added a new admin gate, append it to AdminGatedHandlers AND\n"+
+				"add the 3-test triplet (_NonAdmin_Returns403 / _AdminExplicitFalse_Returns403 /\n"+
+				"_AdminPermitted_ForwardsActor) — see bulk_revocation_handler_test.go for\n"+
+				"the template.\n"+
+				"\n"+
+				"If you added an informational caller (no gating), append to\n"+
+				"InformationalIsAdminCallers with a justification.",
+			actual, expected)
+	}
+}
+
+func TestM008_AdminGatedHandlers_HaveTripletTests(t *testing.T) {
+	for handlerFile := range AdminGatedHandlers {
+		base := strings.TrimSuffix(handlerFile, ".go")
+		// Look for the 3-test triplet in the corresponding _test.go file
+		// or in any test file in the package — bulk_revocation_handler_test.go
+		// follows a slightly different naming convention.
+		matches, err := filepath.Glob("*_test.go")
+		if err != nil {
+			t.Fatalf("glob: %v", err)
+		}
+		var foundNonAdmin, foundExplicitFalse, foundAdminPermitted bool
+		for _, m := range matches {
+			body, err := os.ReadFile(m)
+			if err != nil {
+				continue
+			}
+			s := string(body)
+			// Look for tests that mention the handler base name + the
+			// expected suffix. Loose match because some test files use
+			// _Handler_NonAdmin and others use _NonAdmin.
+			if strings.Contains(s, "NonAdmin_Returns403") {
+				foundNonAdmin = true
+			}
+			if strings.Contains(s, "AdminExplicitFalse_Returns403") {
+				foundExplicitFalse = true
+			}
+			if strings.Contains(s, "AdminPermitted_ForwardsActor") {
+				foundAdminPermitted = true
+			}
+		}
+		if !foundNonAdmin {
+			t.Errorf("admin-gated handler %s lacks a *_NonAdmin_Returns403 test", base)
+		}
+		if !foundExplicitFalse {
+			t.Errorf("admin-gated handler %s lacks a *_AdminExplicitFalse_Returns403 test", base)
+		}
+		if !foundAdminPermitted {
+			t.Errorf("admin-gated handler %s lacks a *_AdminPermitted_ForwardsActor test", base)
+		}
+	}
+}
+
+// --- helpers --------------------------------------------------------------
+
+func scanIsAdminCallers(dir string) ([]string, error) {
+	entries, err := os.ReadDir(dir)
+	if err != nil {
+		return nil, err
+	}
+	var out []string
+	fset := token.NewFileSet()
+	for _, e := range entries {
+		name := e.Name()
+		if !strings.HasSuffix(name, ".go") || strings.HasSuffix(name, "_test.go") {
+			continue
+		}
+		body, err := os.ReadFile(filepath.Join(dir, name))
+		if err != nil {
+			continue
+		}
+		_, parseErr := parser.ParseFile(fset, filepath.Join(dir, name), body, parser.SkipObjectResolution)
+		if parseErr != nil {
+			continue
+		}
+		// Substring-match middleware.IsAdmin — cheap and sufficient
+		// because the import path is fixed and there's no aliasing
+		// shenanigans elsewhere in this package.
+		if strings.Contains(string(body), "middleware.IsAdmin(") {
+			out = append(out, name)
+		}
+	}
+	return out, nil
+}
+
+func keys(m map[string]string) []string {
+	out := make([]string, 0, len(m))
+	for k := range m {
+		out = append(out, k)
+	}
+	return out
+}
+
+func slicesEqual008(a, b []string) bool {
+	if len(a) != len(b) {
+		return false
+	}
+	for i := range a {
+		if a[i] != b[i] {
+			return false
+		}
+	}
+	return true
+}
@@ -66,6 +66,18 @@ type Config struct {
 	// When enabled, the connector queries the CA's ARI endpoint to get CA-directed renewal timing.
 	ARIEnabled bool `json:"ari_enabled,omitempty"`

+	// ARIHTTPTimeoutSeconds bounds the per-request timeout on ARI HTTP calls.
+	// Bundle C / Audit M-019: a CA whose ARI endpoint is unreachable or
+	// stalls indefinitely must not stall the renewal scheduler — the
+	// fallback path is threshold-based renewal, which only kicks in once
+	// the ARI request errors out. The audit's "no fallback timeout" claim
+	// was wrong (a 15s default has been in place since the ARI feature
+	// shipped), but the previous timeout was hardcoded; this knob makes
+	// it configurable per-issuer for operators on flaky-CA networks.
+	// Defaults to 15 when zero. CERTCTL_ACME_ARI_HTTP_TIMEOUT_SECONDS in
+	// the env-driven build path.
+	ARIHTTPTimeoutSeconds int `json:"ari_http_timeout_seconds,omitempty"`
+
 	// Insecure skips TLS certificate verification when connecting to the ACME directory.
 	// Only use for testing with self-signed ACME servers like Pebble.
 	Insecure bool `json:"insecure,omitempty"`
@@ -49,7 +49,7 @@ func (c *Connector) GetRenewalInfo(ctx context.Context, certPEM string) (*issuer
 		return nil, fmt.Errorf("create ARI request: %w", err)
 	}

-	httpClient := &http.Client{Timeout: 15 * time.Second}
+	httpClient := &http.Client{Timeout: c.ariHTTPTimeout()}
 	resp, err := httpClient.Do(req)
 	if err != nil {
 		return nil, fmt.Errorf("ARI request failed: %w", err)
@@ -115,12 +115,22 @@ func computeARICertID(certPEM string) (string, error) {
 	return certID, nil
 }

+// ariHTTPTimeout returns the per-request timeout for ARI HTTP calls. Bundle C
+// / Audit M-019: configurable via Config.ARIHTTPTimeoutSeconds (env var
+// CERTCTL_ACME_ARI_HTTP_TIMEOUT_SECONDS), defaults to 15 seconds.
+func (c *Connector) ariHTTPTimeout() time.Duration {
+	if c.config != nil && c.config.ARIHTTPTimeoutSeconds > 0 {
+		return time.Duration(c.config.ARIHTTPTimeoutSeconds) * time.Second
+	}
+	return 15 * time.Second
+}
+
 // getARIEndpoint constructs the ARI endpoint URL from the ACME directory.
 // It fetches the directory JSON and extracts the "renewalInfo" field if available.
 // Falls back to a standard URL pattern if the directory doesn't advertise renewalInfo.
 func (c *Connector) getARIEndpoint(ctx context.Context, certID string) (string, error) {
 	// Try to fetch and parse the directory
-	httpClient := &http.Client{Timeout: 15 * time.Second}
+	httpClient := &http.Client{Timeout: c.ariHTTPTimeout()}
 	req, err := http.NewRequestWithContext(ctx, http.MethodGet, c.config.DirectoryURL, nil)
 	if err != nil {
 		return "", fmt.Errorf("create directory request: %w", err)
@@ -0,0 +1,69 @@
+package acme
+
+import (
+	"log/slog"
+	"testing"
+	"time"
+)
+
+// Bundle C / Audit M-019 (CWE-400): pin the ARI HTTP timeout dispatch
+// contract. Config.ARIHTTPTimeoutSeconds = 0 → 15s default. Non-zero
+// values override. The 15s default predates Bundle C and is preserved
+// byte-for-byte; this test guards against a future refactor that drops
+// the default and silently configures HTTP clients with no timeout
+// (which would re-open the M-019 stall risk).
+
+func newARITestConnector(t *testing.T, timeoutSec int) *Connector {
+	t.Helper()
+	cfg := &Config{
+		DirectoryURL:          "https://acme.example.invalid/directory",
+		ARIEnabled:            true,
+		ARIHTTPTimeoutSeconds: timeoutSec,
+	}
+	return New(cfg, slog.New(slog.NewTextHandler(testDiscardWriter{}, nil)))
+}
+
+type testDiscardWriter struct{}
+
+func (testDiscardWriter) Write(p []byte) (int, error) { return len(p), nil }
+
+func TestARIHTTPTimeout_DefaultIs15s(t *testing.T) {
+	c := newARITestConnector(t, 0)
+	got := c.ariHTTPTimeout()
+	want := 15 * time.Second
+	if got != want {
+		t.Errorf("ariHTTPTimeout default: got %s, want %s", got, want)
+	}
+}
+
+func TestARIHTTPTimeout_NonZeroOverridesDefault(t *testing.T) {
+	c := newARITestConnector(t, 45)
+	got := c.ariHTTPTimeout()
+	want := 45 * time.Second
+	if got != want {
+		t.Errorf("ariHTTPTimeout override: got %s, want %s", got, want)
+	}
+}
+
+func TestARIHTTPTimeout_NegativeValuesUseDefault(t *testing.T) {
+	// Negative values are nonsensical but should fall back to the
+	// default rather than producing an immediate-timeout client.
+	c := newARITestConnector(t, -1)
+	got := c.ariHTTPTimeout()
+	want := 15 * time.Second
+	if got != want {
+		t.Errorf("negative ariHTTPTimeout should fall back to default: got %s, want %s", got, want)
+	}
+}
+
+func TestARIHTTPTimeout_NilConfigSafeDefault(t *testing.T) {
+	// Defensive: a connector with nil config must not panic and must
+	// return the documented default. This is a guard for tests / DI
+	// callers that hand in a partially-built Connector.
+	c := &Connector{}
+	got := c.ariHTTPTimeout()
+	want := 15 * time.Second
+	if got != want {
+		t.Errorf("nil-config ariHTTPTimeout: got %s, want %s", got, want)
+	}
+}
@@ -0,0 +1,59 @@
+package domain
+
+import (
+	"reflect"
+	"testing"
+)
+
+// Bundle C / Audit M-015: pin the renewal-flow cardinality invariant.
+//
+// The audit's claim is "renewal flow assumes single profile per certificate;
+// no cardinality validation". Verified-already-clean: the certificate
+// struct holds exactly one CertificateProfileID and one RenewalPolicyID
+// as bare strings, not slices. There is literally no way to attach
+// multiple profiles or policies to a managed certificate without changing
+// the struct shape — which this test guards against.
+//
+// If a future schema change introduces N:N profiles or N:N renewal
+// policies, this test fails and forces the change to be paired with
+// a deliberate update of internal/service/renewal.go's iteration logic.
+
+func TestManagedCertificate_SingleProfileCardinality(t *testing.T) {
+	rt := reflect.TypeOf(ManagedCertificate{})
+	cases := []struct {
+		field    string
+		wantKind reflect.Kind
+	}{
+		{"CertificateProfileID", reflect.String},
+		{"RenewalPolicyID", reflect.String},
+		{"IssuerID", reflect.String},
+		{"OwnerID", reflect.String},
+	}
+	for _, tc := range cases {
+		t.Run(tc.field, func(t *testing.T) {
+			f, ok := rt.FieldByName(tc.field)
+			if !ok {
+				t.Fatalf("ManagedCertificate.%s field missing", tc.field)
+			}
+			if f.Type.Kind() != tc.wantKind {
+				t.Errorf("ManagedCertificate.%s kind = %s, want %s "+
+					"(M-015 cardinality pin: 1:1 relationships only — "+
+					"if you're changing this you must also update "+
+					"internal/service/renewal.go's profile/policy lookup)",
+					tc.field, f.Type.Kind(), tc.wantKind)
+			}
+		})
+	}
+}
+
+func TestRenewalPolicy_SingleProfileCardinality(t *testing.T) {
+	rt := reflect.TypeOf(RenewalPolicy{})
+	f, ok := rt.FieldByName("CertificateProfileID")
+	if !ok {
+		t.Fatal("RenewalPolicy.CertificateProfileID field missing")
+	}
+	if f.Type.Kind() != reflect.String {
+		t.Errorf("RenewalPolicy.CertificateProfileID kind = %s, want String "+
+			"(M-015 cardinality pin)", f.Type.Kind())
+	}
+}
@@ -271,6 +271,17 @@ type JobRepository interface {
 	// Failed; I-001's retry loop then auto-promotes eligible Failed jobs back to Pending.
 	// I-003 coverage-gap closure.
 	ListTimedOutAwaitingJobs(ctx context.Context, csrCutoff, approvalCutoff time.Time) ([]*domain.Job, error)
+
+	// ListJobsWithOfflineAgents returns jobs in Running status whose owning
+	// agent's last_heartbeat_at is older than agentCutoff. Bundle C / Audit
+	// M-016 (CWE-754): the existing ListTimedOutAwaitingJobs scope only
+	// covers AwaitingCSR / AwaitingApproval — jobs that were claimed by an
+	// agent and then stalled because the agent itself died (host crash,
+	// container OOM, network partition) sit in Running indefinitely with
+	// no recovery path. The reaper loop transitions these to Failed with
+	// reason "agent_offline" so I-001's retry loop can re-queue them on
+	// a healthy agent.
+	ListJobsWithOfflineAgents(ctx context.Context, agentCutoff time.Time) ([]*domain.Job, error)
 }

 // RenewalPolicyRepository defines operations for managing renewal policies.
@@ -607,6 +607,48 @@ func (r *JobRepository) ListTimedOutAwaitingJobs(ctx context.Context, csrCutoff,
 	return jobs, nil
 }

+// ListJobsWithOfflineAgents returns jobs in Running status whose owning
+// agent's last_heartbeat_at is older than agentCutoff. Bundle C / Audit
+// M-016 (CWE-754): closes the gap that ListTimedOutAwaitingJobs left
+// open — jobs claimed by an agent that subsequently dies sit in Running
+// indefinitely. The query joins jobs to agents on agent_id and filters
+// to (status='Running' AND agent.last_heartbeat_at < agentCutoff).
+//
+// Jobs without an agent_id (server-side keygen path) are intentionally
+// excluded: they have no agent to be "offline".
+func (r *JobRepository) ListJobsWithOfflineAgents(ctx context.Context, agentCutoff time.Time) ([]*domain.Job, error) {
+	rows, err := r.db.QueryContext(ctx, `
+		SELECT j.id, j.type, j.certificate_id, j.target_id, j.agent_id, j.status,
+		       j.attempts, j.max_attempts, j.last_error, j.scheduled_at,
+		       j.started_at, j.completed_at, j.created_at
+		FROM jobs j
+		JOIN agents a ON a.id = j.agent_id
+		WHERE j.status = $1
+		  AND j.agent_id IS NOT NULL
+		  AND a.last_heartbeat_at IS NOT NULL
+		  AND a.last_heartbeat_at < $2
+		ORDER BY j.started_at ASC NULLS FIRST
+	`, domain.JobStatusRunning, agentCutoff)
+
+	if err != nil {
+		return nil, fmt.Errorf("failed to query jobs with offline agents: %w", err)
+	}
+	defer rows.Close()
+
+	var jobs []*domain.Job
+	for rows.Next() {
+		job, err := scanJob(rows)
+		if err != nil {
+			return nil, err
+		}
+		jobs = append(jobs, job)
+	}
+	if err := rows.Err(); err != nil {
+		return nil, fmt.Errorf("error iterating offline-agent job rows: %w", err)
+	}
+	return jobs, nil
+}
+
 // scanJob scans a job from a row or rows
 func scanJob(scanner interface {
 	Scan(...interface{}) error
@@ -67,6 +67,12 @@ type CloudDiscoveryServicer interface {
 // JobReaperService defines the interface for job timeout reaping used by the scheduler.
 type JobReaperService interface {
 	ReapTimedOutJobs(ctx context.Context, csrTTL, approvalTTL time.Duration) error
+	// Bundle C / Audit M-016 (CWE-754): closes the gap left by ReapTimedOutJobs
+	// (which only handles AwaitingCSR / AwaitingApproval). Jobs in Running
+	// status whose owning agent has been silent for longer than agentTTL get
+	// transitioned to Failed with reason "agent_offline" so I-001's retry
+	// loop can re-queue them on a healthy agent.
+	ReapJobsWithOfflineAgents(ctx context.Context, agentTTL time.Duration) error
 }

 // Scheduler manages background jobs and periodic tasks for the certificate control plane.
@@ -97,6 +103,9 @@ type Scheduler struct {
 	healthCheckInterval           time.Duration
 	cloudDiscoveryInterval        time.Duration
 	jobTimeoutInterval            time.Duration
+	// agentOfflineJobTTL: per-tick threshold for reaping Running jobs whose
+	// owning agent has been silent. Bundle C / Audit M-016. Defaults below.
+	agentOfflineJobTTL time.Duration
 	awaitingCSRTimeout            time.Duration
 	awaitingApprovalTimeout       time.Duration

@@ -148,6 +157,9 @@ func NewScheduler(
 		healthCheckInterval:           60 * time.Second,
 		cloudDiscoveryInterval:        6 * time.Hour,
 		jobTimeoutInterval:            10 * time.Minute,
+		// 5 minutes is 5×agentHealthCheckInterval default of 1m; an agent
+		// must miss multiple heartbeats before its in-flight jobs are reaped.
+		agentOfflineJobTTL: 5 * time.Minute,
 	}
 }

@@ -233,6 +245,16 @@ func (s *Scheduler) SetJobReaperService(jr JobReaperService) {
 	s.jobReaper = jr
 }

+// SetAgentOfflineJobTTL sets the threshold past which a Running job whose
+// owning agent has gone silent is reaped to Failed. Bundle C / Audit M-016.
+// Zero or negative values are ignored (the default of 5 minutes is kept).
+func (s *Scheduler) SetAgentOfflineJobTTL(d time.Duration) {
+	if d <= 0 {
+		return
+	}
+	s.agentOfflineJobTTL = d
+}
+
 // SetJobTimeoutInterval sets the job timeout reaper tick interval (I-003).
 func (s *Scheduler) SetJobTimeoutInterval(d time.Duration) {
 	s.jobTimeoutInterval = d
@@ -503,6 +525,15 @@ func (s *Scheduler) jobTimeoutLoop(ctx context.Context) {
 // When no JobReaperService has been wired (e.g. in tests that don't exercise
 // I-003) the call is a safe no-op, preserving the always-on loop topology
 // described in I-003 without forcing every consumer to wire a reaper.
+//
+// Bundle C / Audit M-016: the reaping cycle now has TWO arms:
+//
+//  1. ReapTimedOutJobs handles AwaitingCSR / AwaitingApproval timeouts (I-003).
+//  2. ReapJobsWithOfflineAgents handles Running jobs whose owning agent has
+//     gone silent (M-016). Reuses the same agentHealthCheckTimeout as the
+//     mark-stale-agents-offline path for consistency: if the agent is judged
+//     offline by AgentService.MarkStaleAgentsOffline, its in-flight jobs
+//     should be reaped on the same cadence.
 func (s *Scheduler) runJobTimeout(ctx context.Context) {
 	if s.jobReaper == nil {
 		return
@@ -516,6 +547,20 @@ func (s *Scheduler) runJobTimeout(ctx context.Context) {
 	} else {
 		s.logger.Debug("job timeout reaper completed")
 	}
+	// Second arm: offline-agent reaper. Uses agentOfflineTimeout (defaults to
+	// 5 minutes — same value the agent-health-check path uses to flip an
+	// agent to Offline). A sensible default of 5×agentHealthCheckInterval
+	// catches agents that miss multiple consecutive heartbeats while leaving
+	// a single missed beat as a transient blip that does NOT reap.
+	offlineCtx, offlineCancel := context.WithTimeout(ctx, 2*time.Minute)
+	defer offlineCancel()
+	if err := s.jobReaper.ReapJobsWithOfflineAgents(offlineCtx, s.agentOfflineJobTTL); err != nil {
+		s.logger.Error("offline-agent job reaper failed",
+			"error", err,
+			"agent_offline_ttl", s.agentOfflineJobTTL.String())
+	} else {
+		s.logger.Debug("offline-agent job reaper completed")
+	}
 }

 // agentHealthCheckLoop runs every agentHealthCheckInterval and marks stale agents as offline.
@@ -165,6 +165,15 @@ func (m *mockJobService) ReapTimedOutJobs(ctx context.Context, csrTTL, approvalT
 	return nil
 }

+// ReapJobsWithOfflineAgents is the Bundle C / Audit M-016 stub. The
+// existing scheduler tests do not exercise this path; the offline-agent
+// reaper has its own end-to-end test in internal/service. Here we just
+// satisfy the JobReaperService interface so the scheduler tests still
+// compile.
+func (m *mockJobService) ReapJobsWithOfflineAgents(ctx context.Context, agentTTL time.Duration) error {
+	return nil
+}
+
 // mockAgentService is a mock implementation for testing.
 type mockAgentService struct {
 	mu          sync.Mutex
@@ -237,6 +237,58 @@ func (s *JobService) RetryFailedJobs(ctx context.Context, maxRetries int) error
 	return nil
 }

+// ReapJobsWithOfflineAgents transitions jobs in Running status whose
+// owning agent has been silent longer than agentTTL to Failed with
+// reason "agent_offline". Bundle C / Audit M-016 (CWE-754): closes the
+// gap left by ReapTimedOutJobs (which only handles AwaitingCSR /
+// AwaitingApproval). I-001's retry loop then auto-promotes eligible
+// Failed jobs back to Pending so a healthy agent can claim them.
+func (s *JobService) ReapJobsWithOfflineAgents(ctx context.Context, agentTTL time.Duration) error {
+	if agentTTL <= 0 {
+		return fmt.Errorf("ReapJobsWithOfflineAgents: agentTTL must be positive, got %s", agentTTL)
+	}
+	cutoff := time.Now().Add(-agentTTL)
+
+	staleJobs, err := s.jobRepo.ListJobsWithOfflineAgents(ctx, cutoff)
+	if err != nil {
+		return fmt.Errorf("list jobs with offline agents: %w", err)
+	}
+
+	var reaped int
+	for _, job := range staleJobs {
+		oldStatus := job.Status
+		errMsg := fmt.Sprintf("agent offline (no heartbeat for >%s)", agentTTL)
+
+		job.Status = domain.JobStatusFailed
+		job.LastError = &errMsg
+
+		if err := s.jobRepo.Update(ctx, job); err != nil {
+			s.logger.Error("failed to transition offline-agent job",
+				"job_id", job.ID, "agent_id", job.AgentID, "error", err)
+			continue
+		}
+
+		if s.auditService != nil {
+			if auditErr := s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem,
+				"job_offline_agent_reap", "job", job.ID,
+				map[string]interface{}{
+					"old_status":     string(oldStatus),
+					"new_status":     string(domain.JobStatusFailed),
+					"timeout_reason": "agent_offline",
+					"agent_id":       job.AgentID,
+				}); auditErr != nil {
+				s.logger.Error("failed to record offline-agent reap audit event",
+					"job_id", job.ID, "error", auditErr)
+			}
+		}
+		reaped++
+	}
+
+	s.logger.Info("offline-agent job reaper completed",
+		"reaped", reaped, "total_stale", len(staleJobs))
+	return nil
+}
+
 // ReapTimedOutJobs transitions jobs stuck in AwaitingCSR or AwaitingApproval
 // to Failed if they've exceeded their TTL. I-001's retry loop then auto-promotes
 // eligible Failed jobs back to Pending (closes coverage gap I-003).
@@ -0,0 +1,169 @@
+package service
+
+import (
+	"context"
+	"errors"
+	"io"
+	"log/slog"
+	"strings"
+	"testing"
+	"time"
+
+	"github.com/shankar0123/certctl/internal/domain"
+)
+
+// Bundle C / Audit M-016 (CWE-754): regression suite for the new
+// ReapJobsWithOfflineAgents path. Pre-bundle the reaper only handled
+// AwaitingCSR / AwaitingApproval timeouts; jobs claimed by an agent
+// that subsequently dies sat in Running indefinitely. These tests pin
+// the new behavior end-to-end through the JobService → mockJobRepo
+// boundary.
+
+func newOfflineReaperService(t *testing.T) (*JobService, *mockJobRepo, *mockAuditRepo) {
+	t.Helper()
+	jobRepo := &mockJobRepo{
+		Jobs:   map[string]*domain.Job{},
+		Agents: map[string]*domain.Agent{},
+	}
+	auditRepo := newMockAuditRepository()
+	auditService := NewAuditService(auditRepo)
+	svc := NewJobService(jobRepo, nil, nil, nil, nil, slog.New(slog.NewTextHandler(io.Discard, nil)))
+	svc.SetAuditService(auditService)
+	return svc, jobRepo, auditRepo
+}
+
+func mkRunningJob(id, agentID string) *domain.Job {
+	a := agentID
+	now := time.Now()
+	return &domain.Job{
+		ID:       id,
+		AgentID:  &a,
+		Status:   domain.JobStatusRunning,
+		CreatedAt: now.Add(-2 * time.Hour),
+	}
+}
+
+func mkAgentWithHeartbeat(id string, hbAge time.Duration) *domain.Agent {
+	hb := time.Now().Add(-hbAge)
+	return &domain.Agent{
+		ID:              id,
+		Name:            id,
+		LastHeartbeatAt: &hb,
+	}
+}
+
+func TestReapJobsWithOfflineAgents_FlipsRunningToFailed(t *testing.T) {
+	svc, repo, _ := newOfflineReaperService(t)
+
+	repo.Agents["agt-stale"] = mkAgentWithHeartbeat("agt-stale", 30*time.Minute)
+	repo.Agents["agt-fresh"] = mkAgentWithHeartbeat("agt-fresh", 1*time.Minute)
+	repo.Jobs["j-stale"] = mkRunningJob("j-stale", "agt-stale")
+	repo.Jobs["j-fresh"] = mkRunningJob("j-fresh", "agt-fresh")
+
+	if err := svc.ReapJobsWithOfflineAgents(context.Background(), 10*time.Minute); err != nil {
+		t.Fatalf("reaper returned error: %v", err)
+	}
+
+	if got := repo.Jobs["j-stale"].Status; got != domain.JobStatusFailed {
+		t.Errorf("stale-agent job status = %s, want Failed", got)
+	}
+	if got := repo.Jobs["j-fresh"].Status; got != domain.JobStatusRunning {
+		t.Errorf("fresh-agent job status = %s, want Running (must NOT be reaped)", got)
+	}
+
+	stale := repo.Jobs["j-stale"]
+	if stale.LastError == nil || !strings.Contains(*stale.LastError, "agent offline") {
+		t.Errorf("stale job LastError must cite agent offline; got: %v", stale.LastError)
+	}
+}
+
+func TestReapJobsWithOfflineAgents_SkipsServerKeygenJobs(t *testing.T) {
+	// Jobs without an agent_id (server-side keygen) must NOT be reaped
+	// by this path — they have no agent to be "offline".
+	svc, repo, _ := newOfflineReaperService(t)
+	noAgent := &domain.Job{
+		ID:        "j-server",
+		Status:    domain.JobStatusRunning,
+		CreatedAt: time.Now().Add(-time.Hour),
+	}
+	repo.Jobs["j-server"] = noAgent
+
+	if err := svc.ReapJobsWithOfflineAgents(context.Background(), 1*time.Minute); err != nil {
+		t.Fatalf("reaper returned error: %v", err)
+	}
+	if got := repo.Jobs["j-server"].Status; got != domain.JobStatusRunning {
+		t.Errorf("server-keygen job (no agent_id) status = %s, want Running", got)
+	}
+}
+
+func TestReapJobsWithOfflineAgents_SkipsNonRunningJobs(t *testing.T) {
+	// Pending / AwaitingCSR / AwaitingApproval jobs are NOT in scope —
+	// they're handled by ReapTimedOutJobs (I-003) or ClaimPendingJobs.
+	svc, repo, _ := newOfflineReaperService(t)
+	repo.Agents["agt-stale"] = mkAgentWithHeartbeat("agt-stale", 1*time.Hour)
+	repo.Jobs["j-pending"] = func() *domain.Job {
+		j := mkRunningJob("j-pending", "agt-stale")
+		j.Status = domain.JobStatusPending
+		return j
+	}()
+
+	if err := svc.ReapJobsWithOfflineAgents(context.Background(), 1*time.Minute); err != nil {
+		t.Fatalf("reaper returned error: %v", err)
+	}
+	if got := repo.Jobs["j-pending"].Status; got != domain.JobStatusPending {
+		t.Errorf("Pending job status = %s, want Pending (out of scope for offline-agent reaper)", got)
+	}
+}
+
+func TestReapJobsWithOfflineAgents_RejectsNonPositiveTTL(t *testing.T) {
+	svc, _, _ := newOfflineReaperService(t)
+	if err := svc.ReapJobsWithOfflineAgents(context.Background(), 0); err == nil {
+		t.Error("expected error for zero TTL — fail-loud guard against misconfig")
+	}
+	if err := svc.ReapJobsWithOfflineAgents(context.Background(), -time.Hour); err == nil {
+		t.Error("expected error for negative TTL — fail-loud guard against misconfig")
+	}
+}
+
+func TestReapJobsWithOfflineAgents_PropagatesRepoError(t *testing.T) {
+	svc, repo, _ := newOfflineReaperService(t)
+	repo.ListOfflineAgentJobsErr = errors.New("simulated db down")
+
+	err := svc.ReapJobsWithOfflineAgents(context.Background(), 5*time.Minute)
+	if err == nil {
+		t.Fatal("expected error to propagate from repo")
+	}
+	if !strings.Contains(err.Error(), "simulated db down") {
+		t.Errorf("expected wrapped repo error, got: %v", err)
+	}
+}
+
+func TestReapJobsWithOfflineAgents_RecordsAuditEvent(t *testing.T) {
+	svc, repo, audit := newOfflineReaperService(t)
+	repo.Agents["agt-stale"] = mkAgentWithHeartbeat("agt-stale", 30*time.Minute)
+	repo.Jobs["j-stale"] = mkRunningJob("j-stale", "agt-stale")
+
+	if err := svc.ReapJobsWithOfflineAgents(context.Background(), 5*time.Minute); err != nil {
+		t.Fatalf("reaper: %v", err)
+	}
+
+	audit.mu.Lock()
+	events := append([]*domain.AuditEvent(nil), audit.Events...)
+	audit.mu.Unlock()
+	var found *domain.AuditEvent
+	for i := range events {
+		if events[i].Action == "job_offline_agent_reap" {
+			found = events[i]
+			break
+		}
+	}
+	if found == nil {
+		t.Fatal("expected job_offline_agent_reap audit event, got none")
+	}
+	if found.Actor != "system" {
+		t.Errorf("audit Actor = %q, want system", found.Actor)
+	}
+	if found.ResourceType != "job" || found.ResourceID != "j-stale" {
+		t.Errorf("audit resource binding wrong: %s/%s", found.ResourceType, found.ResourceID)
+	}
+}
@@ -157,20 +157,22 @@ func (m *mockCertRepo) AddCert(cert *domain.ManagedCertificate) {

 // mockJobRepo is a test implementation of JobRepository
 type mockJobRepo struct {
-	mu                sync.Mutex
-	Jobs              map[string]*domain.Job
-	StatusUpdates     map[string]domain.JobStatus
-	CreateErr         error
-	UpdateErr         error
-	UpdateErrorByID   map[string]error
-	UpdateErrorByIDMu sync.Mutex
-	UpdateStatusErr   error
-	GetErr            error
-	ListErr           error
-	ListByStatusErr   error
-	DeleteErr         error
-	ListTimedOutErr   error
-	Updated           []*domain.Job
+	mu                      sync.Mutex
+	Jobs                    map[string]*domain.Job
+	Agents                  map[string]*domain.Agent
+	StatusUpdates           map[string]domain.JobStatus
+	CreateErr               error
+	UpdateErr               error
+	UpdateErrorByID         map[string]error
+	UpdateErrorByIDMu       sync.Mutex
+	UpdateStatusErr         error
+	GetErr                  error
+	ListErr                 error
+	ListByStatusErr         error
+	DeleteErr               error
+	ListTimedOutErr         error
+	ListOfflineAgentJobsErr error
+	Updated                 []*domain.Job
 }

 func (m *mockJobRepo) List(ctx context.Context) ([]*domain.Job, error) {
@@ -387,6 +389,34 @@ func (m *mockJobRepo) ListTimedOutAwaitingJobs(ctx context.Context, csrCutoff, a
 	return jobs, nil
 }

+// ListJobsWithOfflineAgents returns Running jobs whose owning agent's
+// last_heartbeat_at is older than agentCutoff. The mock walks Jobs +
+// Agents the same way the real repo does. Bundle C / Audit M-016.
+func (m *mockJobRepo) ListJobsWithOfflineAgents(ctx context.Context, agentCutoff time.Time) ([]*domain.Job, error) {
+	m.mu.Lock()
+	defer m.mu.Unlock()
+	if m.ListOfflineAgentJobsErr != nil {
+		return nil, m.ListOfflineAgentJobsErr
+	}
+	var jobs []*domain.Job
+	for _, j := range m.Jobs {
+		if j.Status != domain.JobStatusRunning {
+			continue
+		}
+		if j.AgentID == nil || *j.AgentID == "" {
+			continue
+		}
+		ag, ok := m.Agents[*j.AgentID]
+		if !ok || ag.LastHeartbeatAt == nil {
+			continue
+		}
+		if ag.LastHeartbeatAt.Before(agentCutoff) {
+			jobs = append(jobs, j)
+		}
+	}
+	return jobs, nil
+}
+
 func (m *mockJobRepo) AddJob(job *domain.Job) {
 	m.mu.Lock()
 	defer m.mu.Unlock()
@@ -81,6 +81,11 @@ func (m *mockVerificationJobRepo) ListTimedOutAwaitingJobs(ctx context.Context,
 	return nil, nil
 }

+// Bundle C / Audit M-016: stub for the new offline-agent reaper repo method.
+func (m *mockVerificationJobRepo) ListJobsWithOfflineAgents(ctx context.Context, agentCutoff time.Time) ([]*domain.Job, error) {
+	return nil, nil
+}
+
 // newVerificationTestService creates a VerificationService wired with test doubles.
 func newVerificationTestService(jobs map[string]*domain.Job, jobRepoErr error) (*VerificationService, *mockVerificationJobRepo, *mockAuditRepo) {
 	jobRepo := &mockVerificationJobRepo{jobs: jobs, err: jobRepoErr}
@@ -24,6 +24,14 @@
 -- can DROP it by name without ambiguity; un-named CHECK constraints use
 -- a synthesized PostgreSQL name that varies by environment.

+-- Bundle C / Audit M-006 (CWE-913): idempotency guard. Drop-if-exists
+-- before ADD so a re-run of this migration against a partially-applied
+-- DB doesn't fail with "constraint already exists". Mirrors the down
+-- migration's DROP CONSTRAINT IF EXISTS shape and the M-7 idempotent-
+-- index idiom.
+ALTER TABLE policy_violations
+    DROP CONSTRAINT IF EXISTS policy_violations_severity_check;
+
 ALTER TABLE policy_violations
    ADD CONSTRAINT policy_violations_severity_check
    CHECK (severity IN ('Warning', 'Error', 'Critical'));