docs: reconcile scheduler topology across sibling docs (7 → 12 loops)

Authoritative 12-loop table lives at docs/architecture.md:522-534 (committed via the I-001/I-003/I-005 + M48/M50 milestone commits). This change brings six sibling docs into parity with that table so every surface — user-facing features reference, SOC 2 compliance mapping, connectors guide, advanced demo architecture diagram, testing guide, and in-line architecture prose — reflects the same 8 always-on + 4 opt-in topology. Touches: - docs/architecture.md: 2 inline ordinal references (9th / 8th loop) replaced with descriptive names (opt-in cloud discovery / opt-in endpoint health), cross-linked to the authoritative table to prevent future ordinal rot. - docs/features.md: metric row (7 → 12), inline reference to 9th loop, and full scheduler table expanded to include Always-on column + env vars + I-001/I-003/I-005 refs. - docs/compliance-soc2.md: background scheduler monitoring bullets expanded to list all 12 loops with env vars + I-series refs; table row updated with 8 always-on + 4 opt-in summary. - docs/connectors.md: three inline ordinals (7th/6th/9th loop) replaced with descriptive names, cross-linked to architecture.md. - docs/demo-advanced.md: Mermaid SCHED node label updated from '7 background loops' to '12 background loops (8 always-on + 4 opt-in)'. - docs/testing-guide.md: Test 20.1.1 header + grep pattern expanded to include job-retry / job-timeout / notification-retry / digest / endpoint-health / cloud-discovery loops; sign-off chart row label updated. Pure documentation reconciliation. No code changes. Master HEAD pre-commit: 25131a3.
2026-08-03 15:28:27 +00:00 · 2026-04-20 02:51:34 +00:00
parent 25131a377d
commit c38ba40200
6 changed files with 41 additions and 31 deletions
@@ -1092,7 +1092,7 @@ flowchart TB

 1. **Pluggable sources** — Each cloud provider implements the `DiscoverySource` interface (Name, Type, Discover, ValidateConfig). Three built-in sources: AWS Secrets Manager, Azure Key Vault, GCP Secret Manager
 2. **CloudDiscoveryService orchestrator** — Iterates registered sources, calls `Discover()` on each, feeds reports into `ProcessDiscoveryReport()`. Errors from one source don't prevent other sources from running
-3. **Scheduler integration** — 9th scheduler loop (6h default), runs immediately on startup, `atomic.Bool` idempotency guard
+3. **Scheduler integration** — opt-in cloud discovery scheduler loop (6h default; see `docs/architecture.md` 12-loop topology), runs immediately on startup, `atomic.Bool` idempotency guard
 4. **Sentinel agents** — Each source uses its own sentinel agent ID (`cloud-aws-sm`, `cloud-azure-kv`, `cloud-gcp-sm`) for dedup and triage filtering
 5. **Source path format** — `aws-sm://{region}/{secret}`, `azure-kv://{cert-name}/{version}`, `gcp-sm://{project}/{secret}`
 6. **No new schema** — Reuses existing `discovered_certificates` and `discovery_scans` tables. Sentinel agent IDs leverage existing `(fingerprint_sha256, agent_id, source_path)` dedup constraint
@@ -1114,7 +1114,7 @@ This data flow is pull-based and non-blocking. Agents discover at their own pace

 Beyond one-time discovery, certctl continuously monitors TLS endpoints for certificate health using a shared TLS probing package and a state-machine-driven health check service. Endpoints transition between states (Healthy → Degraded → Down) based on consecutive failures, and `cert_mismatch` status alerts when a deployed certificate is unexpectedly replaced.

-**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated 8th scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch).
+**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated opt-in endpoint health scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch).

 **State Machine:** Healthy → Degraded (configurable threshold, default 2 consecutive failures) → Down (default 5 failures). The `cert_mismatch` status is special — it fires whenever the observed certificate fingerprint differs from the expected (deployed) fingerprint, catching silent rollbacks and unauthorized cert replacements. Recovery from degraded/down transitions back to healthy and resets the failure counter.

@@ -189,15 +189,20 @@ Each section includes:

 - **Health Endpoint** — `GET /health` returns 200 OK with service status. Consumed by Docker health checks and Kubernetes probes.
 - **Readiness Endpoint** — `GET /ready` returns 200 OK when the database is connected and migrations are applied.
- **Background Scheduler Monitoring** — 7 background loops run on a fixed schedule:
-  - Renewal loop: every 1 hour, scans for certificates approaching renewal threshold
-  - Job processor loop: every 30 seconds, picks up pending/waiting jobs and advances their state
-  - Health check loop: every 2 minutes, pings agents to detect downtime
-  - Notification dispatcher loop: every 1 minute, sends queued alerts
-  - Short-lived cert expiry loop: every 30 seconds, marks expired short-lived credentials
-  - Network scanner loop: every 6 hours, scans enabled TLS endpoints for certificate discovery
-  - Digest emailer loop: every 24 hours, sends scheduled certificate digest email to configured recipients
-  Each loop includes error handling and logs failures via structured slog.
+- **Background Scheduler Monitoring** — 12 background loops (8 always-on + 4 opt-in) run on a fixed schedule. Authoritative topology in `docs/architecture.md`:
+  - Renewal loop (always-on, 1 hour): scans for certificates approaching renewal threshold
+  - Job processor loop (always-on, 30 seconds): picks up pending/waiting jobs and advances their state
+  - Job retry loop (always-on, 5 minutes, `CERTCTL_SCHEDULER_RETRY_INTERVAL`): retries Failed jobs (I-001)
+  - Job timeout reaper loop (always-on, 10 minutes, `CERTCTL_JOB_TIMEOUT_INTERVAL`): fails AwaitingCSR/AwaitingApproval jobs past timeout (I-003)
+  - Agent health check loop (always-on, 2 minutes): pings agents to detect downtime
+  - Notification dispatcher loop (always-on, 1 minute): sends queued alerts
+  - Notification retry loop (always-on, 2 minutes, `CERTCTL_NOTIFICATION_RETRY_INTERVAL`): exponential backoff retry for failed notifications; promote to dead-letter after 5 attempts (I-005)
+  - Short-lived cert expiry loop (always-on, 30 seconds): marks expired short-lived credentials
+  - Network scanner loop (opt-in, 6 hours, `CERTCTL_NETWORK_SCAN_ENABLED`): scans enabled TLS endpoints for certificate discovery
+  - Digest emailer loop (opt-in, 24 hours, `CERTCTL_DIGEST_INTERVAL`): sends scheduled certificate digest email to configured recipients
+  - Endpoint health loop (opt-in, 60 seconds, `CERTCTL_HEALTH_CHECK_INTERVAL`): continuous TLS health probes (M48)
+  - Cloud discovery loop (opt-in, 6 hours, `CERTCTL_CLOUD_DISCOVERY_INTERVAL`): cloud secret manager certificate discovery (M50)
+  Each loop includes `atomic.Bool` idempotency guards, error handling, and structured slog failure logs.
 - **Metrics Endpoints** — Two formats for monitoring integration:
  - `GET /api/v1/metrics` — JSON object with gauges, counters, and uptime for custom dashboards
  - `GET /api/v1/metrics/prometheus` — Prometheus exposition format (`text/plain; version=0.0.4`) for native scraping by Prometheus, Grafana Agent, Datadog, and other OpenMetrics-compatible collectors
@@ -459,7 +464,7 @@ Each section includes:
 | | Metrics JSON Endpoint | `GET /api/v1/metrics` (gauges, counters, uptime) | ✅ | ✅ | Set thresholds, configure alerting |
 | | Stats API (time-series) | `GET /api/v1/stats/*` (summary, status, expiration, jobs, issuance) | ✅ | ✅ | Integrate into dashboards, SLO tracking |
 | | Structured Logging | `slog` middleware with request IDs | ✅ | ✅ | Aggregate logs to SIEM, define retention policy |
-| | Background Scheduler | 7 loops (renewal 1h, jobs 30s, health 2m, notifications 1m, short-lived 30s, network scan 6h, digest 24h) | ✅ | ✅ | Alert on scheduler loop failures |
+| | Background Scheduler | 12 loops (8 always-on: renewal 1h, jobs 30s, job retry 5m I-001, job timeout 10m I-003, health 2m, notifications 1m, notif retry 2m I-005, short-lived 30s; 4 opt-in: network scan 6h, digest 24h, endpoint health 60s M48, cloud discovery 6h M50) | ✅ | ✅ | Alert on scheduler loop failures |
 | **CC7.2** Anomaly Detection | Immutable API Audit Trail | `internal/api/middleware/audit.go`, `GET /api/v1/audit` | ✅ | Enhanced (SIEM export) | Integrate into SIEM, search for anomalies, archive long-term |
 | | Expiration Threshold Alerting | Configurable per-policy (default 30/14/7/0 days) | ✅ | ✅ | Configure thresholds, integrate notifications |
 | | Status Auto-Transitions | Active → Expiring (30d) → Expired (0d) | ✅ | ✅ | Monitor status changes in audit trail |
@@ -1126,7 +1126,7 @@ The digest HTML template includes:
 - Expiring certificates table (color-coded by urgency: 7d, 14d, 30d)
 - Auto-refresh and responsive email layout

-**Scheduler Integration:** The 7th scheduler loop runs on configurable interval (default 24 hours). It does NOT run on startup — waits for first scheduled tick. Operation timeout is 5 minutes. Each loop execution is guarded by `sync/atomic.Bool` idempotency.
+**Scheduler Integration:** The opt-in digest scheduler loop runs on configurable interval (default 24 hours). It does NOT run on startup — waits for first scheduled tick. Operation timeout is 5 minutes. Each loop execution is guarded by `sync/atomic.Bool` idempotency. See `docs/architecture.md` for the full scheduler topology (12 loops, 8 always-on + 4 opt-in).

 Configuration:

@@ -1389,7 +1389,7 @@ curl -s -X DELETE http://localhost:8443/api/v1/network-scan-targets/nst-dmz

 ### Scheduler Integration

-When `CERTCTL_NETWORK_SCAN_ENABLED=true`, the server runs a 6th scheduler loop (alongside renewal, jobs, health, notifications, and short-lived expiry). It scans all enabled targets at the configured interval (default 6h). Each target tracks `last_scan_at`, `last_scan_duration_ms`, and `last_scan_certs_found` for monitoring scan health.
+When `CERTCTL_NETWORK_SCAN_ENABLED=true`, the server runs the opt-in network scanner scheduler loop alongside the always-on loops (renewal, jobs, job retry, job timeout, agent health, notifications, notification retry, short-lived expiry). It scans all enabled targets at the configured interval (default 6h). Each target tracks `last_scan_at`, `last_scan_duration_ms`, and `last_scan_certs_found` for monitoring scan health. See `docs/architecture.md` for the full 12-loop scheduler topology.

 ### Use Cases

@@ -1447,7 +1447,7 @@ Source path format: `gcp-sm://{project}/{secret-name}`. Sentinel agent: `cloud-g

 ### Cloud Discovery Scheduler

-All enabled cloud sources run on a shared scheduler loop (9th loop). The interval is configurable:
+All enabled cloud sources run on a shared opt-in cloud discovery scheduler loop (see `docs/architecture.md` for the full 12-loop scheduler topology). The interval is configurable:

 | Variable | Description | Default |
 |---|---|---|
@@ -1155,7 +1155,7 @@ flowchart TB
        API["REST API\nGo net/http"]
        SVC["Service Layer\nBusiness Logic"]
        REPO["Repository Layer\ndatabase/sql + lib/pq"]
-        SCHED["Scheduler\n7 background loops"]
+        SCHED["Scheduler\n12 background loops\n(8 always-on + 4 opt-in)"]
        CONN["Connector Registry\nIssuer + Target + Notifier"]
    end

@@ -16,7 +16,7 @@ Complete reference of every feature shipped in certctl through v2.1.0 (April 202
 | Target connectors | 14 |
 | Notifier connectors | 6 channels |
 | Database tables | 21 (across 10 migrations) |
-| Background scheduler loops | 7 |
+| Background scheduler loops | 12 (8 always-on + 4 opt-in) |
 | Web dashboard pages | 24 |
 | Test functions | 1850+ |
 | Supported platforms | linux/amd64, linux/arm64, darwin/amd64, darwin/arm64 |
@@ -903,7 +903,7 @@ Server-side active TLS scanning of CIDR ranges. Concurrent probing with semaphor

 <!-- Source: internal/connector/discovery/awssm/, azurekv/, gcpsm/, internal/service/cloud_discovery.go -->

-Discovers certificates stored in cloud secret managers and brings them into the certctl inventory. Extends the existing discovery pipeline with pluggable `DiscoverySource` implementations. Each source runs as part of the 9th scheduler loop (6h default).
+Discovers certificates stored in cloud secret managers and brings them into the certctl inventory. Extends the existing discovery pipeline with pluggable `DiscoverySource` implementations. Each source runs as part of the opt-in cloud discovery scheduler loop (6h default; see `docs/architecture.md` for the full 12-loop scheduler topology).

 **Supported sources:**

@@ -1097,17 +1097,22 @@ Single SQL `UNION` query replaces the previous "fetch all, filter in Go" approac

 <!-- Source: internal/scheduler/scheduler.go -->

-7 background loops, each with an `atomic.Bool` idempotency guard preventing concurrent tick execution. `sync.WaitGroup` + `WaitForCompletion()` for graceful shutdown.
+12 background loops (8 always-on + 4 opt-in), each with an `atomic.Bool` idempotency guard preventing concurrent tick execution. `sync.WaitGroup` + `WaitForCompletion()` for graceful shutdown. Authoritative topology table lives in `docs/architecture.md`.

-| Loop | Default Interval | Description |
-|---|---|---|
-| Renewal check | 1 hour | Check expiring certs, query ARI, create renewal jobs |
-| Job processor | 30 seconds | Process pending jobs |
-| Agent health check | 2 minutes | Check agent heartbeat staleness |
-| Notification processor | 1 minute | Send queued notifications |
-| Short-lived expiry check | 30 seconds | Mark short-lived certs expired |
-| Network scan | 6 hours | Run network discovery scans |
-| Digest | 24 hours | Send certificate digest email (does not run on startup) |
+| Loop | Default Interval | Always-on | Env Var | Description |
+|---|---|---|---|---|
+| Renewal check | 1 hour | Yes | — | Check expiring certs, query ARI, create renewal jobs |
+| Job processor | 30 seconds | Yes | — | Process pending jobs |
+| Job retry | 5 minutes | Yes | `CERTCTL_SCHEDULER_RETRY_INTERVAL` | Retry Failed jobs (I-001) |
+| Job timeout reaper | 10 minutes | Yes | `CERTCTL_JOB_TIMEOUT_INTERVAL` | Fail AwaitingCSR/AwaitingApproval jobs past timeout (I-003) |
+| Agent health check | 2 minutes | Yes | — | Check agent heartbeat staleness |
+| Notification processor | 1 minute | Yes | — | Send queued notifications |
+| Notification retry | 2 minutes | Yes | `CERTCTL_NOTIFICATION_RETRY_INTERVAL` | Exponential backoff retry for failed notifications; promote to dead-letter after 5 attempts (I-005) |
+| Short-lived expiry check | 30 seconds | Yes | — | Mark short-lived certs expired |
+| Network scan | 6 hours | Opt-in | `CERTCTL_NETWORK_SCAN_ENABLED` | Run network discovery scans |
+| Digest | 24 hours | Opt-in | `CERTCTL_DIGEST_INTERVAL` | Send certificate digest email (does not run on startup) |
+| Endpoint health | 60 seconds | Opt-in | `CERTCTL_HEALTH_CHECK_INTERVAL` | Continuous TLS health probes (M48) |
+| Cloud discovery | 6 hours | Opt-in | `CERTCTL_CLOUD_DISCOVERY_INTERVAL` | Cloud secret manager certificate discovery (M50) |

 ---

@@ -5002,10 +5002,10 @@ curl -s -w "HTTP %{http_code}\n" -X DELETE -H "$AUTH" "$SERVER/api/v1/audit/$EVE

 > **Tip:** Open a second terminal with `docker compose logs -f certctl-server` to watch scheduler log output in real time.

-**Test 20.1.1 — Scheduler startup: all 7 loops registered**
+**Test 20.1.1 — Scheduler startup: all 12 loops registered**

 ```bash
-docker compose logs certctl-server 2>&1 | grep -i "scheduler\|renewal check\|job processor\|health check\|notification\|short-lived\|network scan" | head -20
+docker compose logs certctl-server 2>&1 | grep -i "scheduler\|renewal check\|job processor\|job retry\|job timeout\|health check\|notification\|notification retry\|short-lived\|network scan\|digest\|endpoint health\|cloud discovery" | head -30
 ```

 **What:** Checks server startup logs for scheduler loop registration.
@@ -7340,7 +7340,7 @@ These must be green before starting manual QA:

 | Test | Description | Method | Pass? | Date | Notes |
 |------|-------------|--------|-------|------|-------|
-| 20.1.1 | Scheduler startup: all 7 loops registered | Manual | ☐ |  |  |
+| 20.1.1 | Scheduler startup: all 12 loops registered | Manual | ☐ |  |  |
 | 20.1.2 | Job processor loop fires (30s interval) | Manual | ☐ |  |  |
 | 20.1.3 | Agent health check marks offline (2m interval) | Manual | ☐ |  |  |
 | 20.1.4 | Notification processor fires (1m interval) | Manual | ☐ |  |  |