From 04c7eca61598eb00a6f442dfcce3e668697ad17b Mon Sep 17 00:00:00 2001
From: shankar0123 <skreddy040@gmail.com>
Date: Mon, 20 Apr 2026 02:51:34 +0000
Subject: [PATCH] =?UTF-8?q?docs:=20reconcile=20scheduler=20topology=20acro?=
 =?UTF-8?q?ss=20sibling=20docs=20(7=20=E2=86=92=2012=20loops)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Authoritative 12-loop table lives at docs/architecture.md:522-534 (committed via
the I-001/I-003/I-005 + M48/M50 milestone commits). This change brings six sibling
docs into parity with that table so every surface — user-facing features reference,
SOC 2 compliance mapping, connectors guide, advanced demo architecture diagram,
testing guide, and in-line architecture prose — reflects the same 8 always-on + 4
opt-in topology.

Touches:
- docs/architecture.md: 2 inline ordinal references (9th / 8th loop) replaced with
  descriptive names (opt-in cloud discovery / opt-in endpoint health), cross-linked
  to the authoritative table to prevent future ordinal rot.
- docs/features.md: metric row (7 → 12), inline reference to 9th loop, and full
  scheduler table expanded to include Always-on column + env vars + I-001/I-003/I-005
  refs.
- docs/compliance-soc2.md: background scheduler monitoring bullets expanded to list
  all 12 loops with env vars + I-series refs; table row updated with 8 always-on +
  4 opt-in summary.
- docs/connectors.md: three inline ordinals (7th/6th/9th loop) replaced with
  descriptive names, cross-linked to architecture.md.
- docs/demo-advanced.md: Mermaid SCHED node label updated from '7 background loops'
  to '12 background loops (8 always-on + 4 opt-in)'.
- docs/testing-guide.md: Test 20.1.1 header + grep pattern expanded to include
  job-retry / job-timeout / notification-retry / digest / endpoint-health /
  cloud-discovery loops; sign-off chart row label updated.

Pure documentation reconciliation. No code changes. Master HEAD pre-commit: 6e646e0.
---
 docs/architecture.md    |  4 ++--
 docs/compliance-soc2.md | 25 +++++++++++++++----------
 docs/connectors.md      |  6 +++---
 docs/demo-advanced.md   |  2 +-
 docs/features.md        | 29 +++++++++++++++++------------
 docs/testing-guide.md   |  6 +++---
 6 files changed, 41 insertions(+), 31 deletions(-)

diff --git a/docs/architecture.md b/docs/architecture.md
index f039354..65cf1cc 100644
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -1092,7 +1092,7 @@ flowchart TB
 
 1. **Pluggable sources** — Each cloud provider implements the `DiscoverySource` interface (Name, Type, Discover, ValidateConfig). Three built-in sources: AWS Secrets Manager, Azure Key Vault, GCP Secret Manager
 2. **CloudDiscoveryService orchestrator** — Iterates registered sources, calls `Discover()` on each, feeds reports into `ProcessDiscoveryReport()`. Errors from one source don't prevent other sources from running
-3. **Scheduler integration** — 9th scheduler loop (6h default), runs immediately on startup, `atomic.Bool` idempotency guard
+3. **Scheduler integration** — opt-in cloud discovery scheduler loop (6h default; see `docs/architecture.md` 12-loop topology), runs immediately on startup, `atomic.Bool` idempotency guard
 4. **Sentinel agents** — Each source uses its own sentinel agent ID (`cloud-aws-sm`, `cloud-azure-kv`, `cloud-gcp-sm`) for dedup and triage filtering
 5. **Source path format** — `aws-sm://{region}/{secret}`, `azure-kv://{cert-name}/{version}`, `gcp-sm://{project}/{secret}`
 6. **No new schema** — Reuses existing `discovered_certificates` and `discovery_scans` tables. Sentinel agent IDs leverage existing `(fingerprint_sha256, agent_id, source_path)` dedup constraint
@@ -1114,7 +1114,7 @@ This data flow is pull-based and non-blocking. Agents discover at their own pace
 
 Beyond one-time discovery, certctl continuously monitors TLS endpoints for certificate health using a shared TLS probing package and a state-machine-driven health check service. Endpoints transition between states (Healthy → Degraded → Down) based on consecutive failures, and `cert_mismatch` status alerts when a deployed certificate is unexpectedly replaced.
 
-**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated 8th scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch).
+**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated opt-in endpoint health scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch).
 
 **State Machine:** Healthy → Degraded (configurable threshold, default 2 consecutive failures) → Down (default 5 failures). The `cert_mismatch` status is special — it fires whenever the observed certificate fingerprint differs from the expected (deployed) fingerprint, catching silent rollbacks and unauthorized cert replacements. Recovery from degraded/down transitions back to healthy and resets the failure counter.
 
diff --git a/docs/compliance-soc2.md b/docs/compliance-soc2.md
index 1d56f5a..c63e4e8 100644
--- a/docs/compliance-soc2.md
+++ b/docs/compliance-soc2.md
@@ -189,15 +189,20 @@ Each section includes:
 
 - **Health Endpoint** — `GET /health` returns 200 OK with service status. Consumed by Docker health checks and Kubernetes probes.
 - **Readiness Endpoint** — `GET /ready` returns 200 OK when the database is connected and migrations are applied.
-- **Background Scheduler Monitoring** — 7 background loops run on a fixed schedule:
-  - Renewal loop: every 1 hour, scans for certificates approaching renewal threshold
-  - Job processor loop: every 30 seconds, picks up pending/waiting jobs and advances their state
-  - Health check loop: every 2 minutes, pings agents to detect downtime
-  - Notification dispatcher loop: every 1 minute, sends queued alerts
-  - Short-lived cert expiry loop: every 30 seconds, marks expired short-lived credentials
-  - Network scanner loop: every 6 hours, scans enabled TLS endpoints for certificate discovery
-  - Digest emailer loop: every 24 hours, sends scheduled certificate digest email to configured recipients
-  Each loop includes error handling and logs failures via structured slog.
+- **Background Scheduler Monitoring** — 12 background loops (8 always-on + 4 opt-in) run on a fixed schedule. Authoritative topology in `docs/architecture.md`:
+  - Renewal loop (always-on, 1 hour): scans for certificates approaching renewal threshold
+  - Job processor loop (always-on, 30 seconds): picks up pending/waiting jobs and advances their state
+  - Job retry loop (always-on, 5 minutes, `CERTCTL_SCHEDULER_RETRY_INTERVAL`): retries Failed jobs (I-001)
+  - Job timeout reaper loop (always-on, 10 minutes, `CERTCTL_JOB_TIMEOUT_INTERVAL`): fails AwaitingCSR/AwaitingApproval jobs past timeout (I-003)
+  - Agent health check loop (always-on, 2 minutes): pings agents to detect downtime
+  - Notification dispatcher loop (always-on, 1 minute): sends queued alerts
+  - Notification retry loop (always-on, 2 minutes, `CERTCTL_NOTIFICATION_RETRY_INTERVAL`): exponential backoff retry for failed notifications; promote to dead-letter after 5 attempts (I-005)
+  - Short-lived cert expiry loop (always-on, 30 seconds): marks expired short-lived credentials
+  - Network scanner loop (opt-in, 6 hours, `CERTCTL_NETWORK_SCAN_ENABLED`): scans enabled TLS endpoints for certificate discovery
+  - Digest emailer loop (opt-in, 24 hours, `CERTCTL_DIGEST_INTERVAL`): sends scheduled certificate digest email to configured recipients
+  - Endpoint health loop (opt-in, 60 seconds, `CERTCTL_HEALTH_CHECK_INTERVAL`): continuous TLS health probes (M48)
+  - Cloud discovery loop (opt-in, 6 hours, `CERTCTL_CLOUD_DISCOVERY_INTERVAL`): cloud secret manager certificate discovery (M50)
+  Each loop includes `atomic.Bool` idempotency guards, error handling, and structured slog failure logs.
 - **Metrics Endpoints** — Two formats for monitoring integration:
   - `GET /api/v1/metrics` — JSON object with gauges, counters, and uptime for custom dashboards
   - `GET /api/v1/metrics/prometheus` — Prometheus exposition format (`text/plain; version=0.0.4`) for native scraping by Prometheus, Grafana Agent, Datadog, and other OpenMetrics-compatible collectors
@@ -459,7 +464,7 @@ Each section includes:
 | | Metrics JSON Endpoint | `GET /api/v1/metrics` (gauges, counters, uptime) | ✅ | ✅ | Set thresholds, configure alerting |
 | | Stats API (time-series) | `GET /api/v1/stats/*` (summary, status, expiration, jobs, issuance) | ✅ | ✅ | Integrate into dashboards, SLO tracking |
 | | Structured Logging | `slog` middleware with request IDs | ✅ | ✅ | Aggregate logs to SIEM, define retention policy |
-| | Background Scheduler | 7 loops (renewal 1h, jobs 30s, health 2m, notifications 1m, short-lived 30s, network scan 6h, digest 24h) | ✅ | ✅ | Alert on scheduler loop failures |
+| | Background Scheduler | 12 loops (8 always-on: renewal 1h, jobs 30s, job retry 5m I-001, job timeout 10m I-003, health 2m, notifications 1m, notif retry 2m I-005, short-lived 30s; 4 opt-in: network scan 6h, digest 24h, endpoint health 60s M48, cloud discovery 6h M50) | ✅ | ✅ | Alert on scheduler loop failures |
 | **CC7.2** Anomaly Detection | Immutable API Audit Trail | `internal/api/middleware/audit.go`, `GET /api/v1/audit` | ✅ | Enhanced (SIEM export) | Integrate into SIEM, search for anomalies, archive long-term |
 | | Expiration Threshold Alerting | Configurable per-policy (default 30/14/7/0 days) | ✅ | ✅ | Configure thresholds, integrate notifications |
 | | Status Auto-Transitions | Active → Expiring (30d) → Expired (0d) | ✅ | ✅ | Monitor status changes in audit trail |
diff --git a/docs/connectors.md b/docs/connectors.md
index f6880b1..b9c5846 100644
--- a/docs/connectors.md
+++ b/docs/connectors.md
@@ -1126,7 +1126,7 @@ The digest HTML template includes:
 - Expiring certificates table (color-coded by urgency: 7d, 14d, 30d)
 - Auto-refresh and responsive email layout
 
-**Scheduler Integration:** The 7th scheduler loop runs on configurable interval (default 24 hours). It does NOT run on startup — waits for first scheduled tick. Operation timeout is 5 minutes. Each loop execution is guarded by `sync/atomic.Bool` idempotency.
+**Scheduler Integration:** The opt-in digest scheduler loop runs on configurable interval (default 24 hours). It does NOT run on startup — waits for first scheduled tick. Operation timeout is 5 minutes. Each loop execution is guarded by `sync/atomic.Bool` idempotency. See `docs/architecture.md` for the full scheduler topology (12 loops, 8 always-on + 4 opt-in).
 
 Configuration:
 
@@ -1389,7 +1389,7 @@ curl -s -X DELETE http://localhost:8443/api/v1/network-scan-targets/nst-dmz
 
 ### Scheduler Integration
 
-When `CERTCTL_NETWORK_SCAN_ENABLED=true`, the server runs a 6th scheduler loop (alongside renewal, jobs, health, notifications, and short-lived expiry). It scans all enabled targets at the configured interval (default 6h). Each target tracks `last_scan_at`, `last_scan_duration_ms`, and `last_scan_certs_found` for monitoring scan health.
+When `CERTCTL_NETWORK_SCAN_ENABLED=true`, the server runs the opt-in network scanner scheduler loop alongside the always-on loops (renewal, jobs, job retry, job timeout, agent health, notifications, notification retry, short-lived expiry). It scans all enabled targets at the configured interval (default 6h). Each target tracks `last_scan_at`, `last_scan_duration_ms`, and `last_scan_certs_found` for monitoring scan health. See `docs/architecture.md` for the full 12-loop scheduler topology.
 
 ### Use Cases
 
@@ -1447,7 +1447,7 @@ Source path format: `gcp-sm://{project}/{secret-name}`. Sentinel agent: `cloud-g
 
 ### Cloud Discovery Scheduler
 
-All enabled cloud sources run on a shared scheduler loop (9th loop). The interval is configurable:
+All enabled cloud sources run on a shared opt-in cloud discovery scheduler loop (see `docs/architecture.md` for the full 12-loop scheduler topology). The interval is configurable:
 
 | Variable | Description | Default |
 |---|---|---|
diff --git a/docs/demo-advanced.md b/docs/demo-advanced.md
index 55b3563..b33851a 100644
--- a/docs/demo-advanced.md
+++ b/docs/demo-advanced.md
@@ -1155,7 +1155,7 @@ flowchart TB
         API["REST API\nGo net/http"]
         SVC["Service Layer\nBusiness Logic"]
         REPO["Repository Layer\ndatabase/sql + lib/pq"]
-        SCHED["Scheduler\n7 background loops"]
+        SCHED["Scheduler\n12 background loops\n(8 always-on + 4 opt-in)"]
         CONN["Connector Registry\nIssuer + Target + Notifier"]
     end
 
diff --git a/docs/features.md b/docs/features.md
index 3f31aae..50cb934 100644
--- a/docs/features.md
+++ b/docs/features.md
@@ -16,7 +16,7 @@ Complete reference of every feature shipped in certctl through v2.1.0 (April 202
 | Target connectors | 14 |
 | Notifier connectors | 6 channels |
 | Database tables | 21 (across 10 migrations) |
-| Background scheduler loops | 7 |
+| Background scheduler loops | 12 (8 always-on + 4 opt-in) |
 | Web dashboard pages | 24 |
 | Test functions | 1850+ |
 | Supported platforms | linux/amd64, linux/arm64, darwin/amd64, darwin/arm64 |
@@ -903,7 +903,7 @@ Server-side active TLS scanning of CIDR ranges. Concurrent probing with semaphor
 
 <!-- Source: internal/connector/discovery/awssm/, azurekv/, gcpsm/, internal/service/cloud_discovery.go -->
 
-Discovers certificates stored in cloud secret managers and brings them into the certctl inventory. Extends the existing discovery pipeline with pluggable `DiscoverySource` implementations. Each source runs as part of the 9th scheduler loop (6h default).
+Discovers certificates stored in cloud secret managers and brings them into the certctl inventory. Extends the existing discovery pipeline with pluggable `DiscoverySource` implementations. Each source runs as part of the opt-in cloud discovery scheduler loop (6h default; see `docs/architecture.md` for the full 12-loop scheduler topology).
 
 **Supported sources:**
 
@@ -1097,17 +1097,22 @@ Single SQL `UNION` query replaces the previous "fetch all, filter in Go" approac
 
 <!-- Source: internal/scheduler/scheduler.go -->
 
-7 background loops, each with an `atomic.Bool` idempotency guard preventing concurrent tick execution. `sync.WaitGroup` + `WaitForCompletion()` for graceful shutdown.
+12 background loops (8 always-on + 4 opt-in), each with an `atomic.Bool` idempotency guard preventing concurrent tick execution. `sync.WaitGroup` + `WaitForCompletion()` for graceful shutdown. Authoritative topology table lives in `docs/architecture.md`.
 
-| Loop | Default Interval | Description |
-|---|---|---|
-| Renewal check | 1 hour | Check expiring certs, query ARI, create renewal jobs |
-| Job processor | 30 seconds | Process pending jobs |
-| Agent health check | 2 minutes | Check agent heartbeat staleness |
-| Notification processor | 1 minute | Send queued notifications |
-| Short-lived expiry check | 30 seconds | Mark short-lived certs expired |
-| Network scan | 6 hours | Run network discovery scans |
-| Digest | 24 hours | Send certificate digest email (does not run on startup) |
+| Loop | Default Interval | Always-on | Env Var | Description |
+|---|---|---|---|---|
+| Renewal check | 1 hour | Yes | — | Check expiring certs, query ARI, create renewal jobs |
+| Job processor | 30 seconds | Yes | — | Process pending jobs |
+| Job retry | 5 minutes | Yes | `CERTCTL_SCHEDULER_RETRY_INTERVAL` | Retry Failed jobs (I-001) |
+| Job timeout reaper | 10 minutes | Yes | `CERTCTL_JOB_TIMEOUT_INTERVAL` | Fail AwaitingCSR/AwaitingApproval jobs past timeout (I-003) |
+| Agent health check | 2 minutes | Yes | — | Check agent heartbeat staleness |
+| Notification processor | 1 minute | Yes | — | Send queued notifications |
+| Notification retry | 2 minutes | Yes | `CERTCTL_NOTIFICATION_RETRY_INTERVAL` | Exponential backoff retry for failed notifications; promote to dead-letter after 5 attempts (I-005) |
+| Short-lived expiry check | 30 seconds | Yes | — | Mark short-lived certs expired |
+| Network scan | 6 hours | Opt-in | `CERTCTL_NETWORK_SCAN_ENABLED` | Run network discovery scans |
+| Digest | 24 hours | Opt-in | `CERTCTL_DIGEST_INTERVAL` | Send certificate digest email (does not run on startup) |
+| Endpoint health | 60 seconds | Opt-in | `CERTCTL_HEALTH_CHECK_INTERVAL` | Continuous TLS health probes (M48) |
+| Cloud discovery | 6 hours | Opt-in | `CERTCTL_CLOUD_DISCOVERY_INTERVAL` | Cloud secret manager certificate discovery (M50) |
 
 ---
 
diff --git a/docs/testing-guide.md b/docs/testing-guide.md
index a1d4302..d5fff51 100644
--- a/docs/testing-guide.md
+++ b/docs/testing-guide.md
@@ -5002,10 +5002,10 @@ curl -s -w "HTTP %{http_code}\n" -X DELETE -H "$AUTH" "$SERVER/api/v1/audit/$EVE
 
 > **Tip:** Open a second terminal with `docker compose logs -f certctl-server` to watch scheduler log output in real time.
 
-**Test 20.1.1 — Scheduler startup: all 7 loops registered**
+**Test 20.1.1 — Scheduler startup: all 12 loops registered**
 
 ```bash
-docker compose logs certctl-server 2>&1 | grep -i "scheduler\|renewal check\|job processor\|health check\|notification\|short-lived\|network scan" | head -20
+docker compose logs certctl-server 2>&1 | grep -i "scheduler\|renewal check\|job processor\|job retry\|job timeout\|health check\|notification\|notification retry\|short-lived\|network scan\|digest\|endpoint health\|cloud discovery" | head -30
 ```
 
 **What:** Checks server startup logs for scheduler loop registration.
@@ -7340,7 +7340,7 @@ These must be green before starting manual QA:
 
 | Test | Description | Method | Pass? | Date | Notes |
 |------|-------------|--------|-------|------|-------|
-| 20.1.1 | Scheduler startup: all 7 loops registered | Manual | ☐ |  |  |
+| 20.1.1 | Scheduler startup: all 12 loops registered | Manual | ☐ |  |  |
 | 20.1.2 | Job processor loop fires (30s interval) | Manual | ☐ |  |  |
 | 20.1.3 | Agent health check marks offline (2m interval) | Manual | ☐ |  |  |
 | 20.1.4 | Notification processor fires (1m interval) | Manual | ☐ |  |  |