docs: reconcile scheduler topology across sibling docs (7 → 12 loops)

Authoritative 12-loop table lives at docs/architecture.md:522-534 (committed via
the I-001/I-003/I-005 + M48/M50 milestone commits). This change brings six sibling
docs into parity with that table so every surface — user-facing features reference,
SOC 2 compliance mapping, connectors guide, advanced demo architecture diagram,
testing guide, and in-line architecture prose — reflects the same 8 always-on + 4
opt-in topology.

Touches:
- docs/architecture.md: 2 inline ordinal references (9th / 8th loop) replaced with
  descriptive names (opt-in cloud discovery / opt-in endpoint health), cross-linked
  to the authoritative table to prevent future ordinal rot.
- docs/features.md: metric row (7 → 12), inline reference to 9th loop, and full
  scheduler table expanded to include Always-on column + env vars + I-001/I-003/I-005
  refs.
- docs/compliance-soc2.md: background scheduler monitoring bullets expanded to list
  all 12 loops with env vars + I-series refs; table row updated with 8 always-on +
  4 opt-in summary.
- docs/connectors.md: three inline ordinals (7th/6th/9th loop) replaced with
  descriptive names, cross-linked to architecture.md.
- docs/demo-advanced.md: Mermaid SCHED node label updated from '7 background loops'
  to '12 background loops (8 always-on + 4 opt-in)'.
- docs/testing-guide.md: Test 20.1.1 header + grep pattern expanded to include
  job-retry / job-timeout / notification-retry / digest / endpoint-health /
  cloud-discovery loops; sign-off chart row label updated.

Pure documentation reconciliation. No code changes. Master HEAD pre-commit: 25131a3.
This commit is contained in:
Shankar
2026-04-20 02:51:34 +00:00
parent 25131a377d
commit c38ba40200
6 changed files with 41 additions and 31 deletions
+2 -2
View File
@@ -1092,7 +1092,7 @@ flowchart TB
1. **Pluggable sources** — Each cloud provider implements the `DiscoverySource` interface (Name, Type, Discover, ValidateConfig). Three built-in sources: AWS Secrets Manager, Azure Key Vault, GCP Secret Manager
2. **CloudDiscoveryService orchestrator** — Iterates registered sources, calls `Discover()` on each, feeds reports into `ProcessDiscoveryReport()`. Errors from one source don't prevent other sources from running
3. **Scheduler integration**9th scheduler loop (6h default), runs immediately on startup, `atomic.Bool` idempotency guard
3. **Scheduler integration**opt-in cloud discovery scheduler loop (6h default; see `docs/architecture.md` 12-loop topology), runs immediately on startup, `atomic.Bool` idempotency guard
4. **Sentinel agents** — Each source uses its own sentinel agent ID (`cloud-aws-sm`, `cloud-azure-kv`, `cloud-gcp-sm`) for dedup and triage filtering
5. **Source path format**`aws-sm://{region}/{secret}`, `azure-kv://{cert-name}/{version}`, `gcp-sm://{project}/{secret}`
6. **No new schema** — Reuses existing `discovered_certificates` and `discovery_scans` tables. Sentinel agent IDs leverage existing `(fingerprint_sha256, agent_id, source_path)` dedup constraint
@@ -1114,7 +1114,7 @@ This data flow is pull-based and non-blocking. Agents discover at their own pace
Beyond one-time discovery, certctl continuously monitors TLS endpoints for certificate health using a shared TLS probing package and a state-machine-driven health check service. Endpoints transition between states (Healthy → Degraded → Down) based on consecutive failures, and `cert_mismatch` status alerts when a deployed certificate is unexpectedly replaced.
**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated 8th scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch).
**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated opt-in endpoint health scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch).
**State Machine:** Healthy → Degraded (configurable threshold, default 2 consecutive failures) → Down (default 5 failures). The `cert_mismatch` status is special — it fires whenever the observed certificate fingerprint differs from the expected (deployed) fingerprint, catching silent rollbacks and unauthorized cert replacements. Recovery from degraded/down transitions back to healthy and resets the failure counter.
+15 -10
View File
@@ -189,15 +189,20 @@ Each section includes:
- **Health Endpoint** — `GET /health` returns 200 OK with service status. Consumed by Docker health checks and Kubernetes probes.
- **Readiness Endpoint** — `GET /ready` returns 200 OK when the database is connected and migrations are applied.
- **Background Scheduler Monitoring** — 7 background loops run on a fixed schedule:
- Renewal loop: every 1 hour, scans for certificates approaching renewal threshold
- Job processor loop: every 30 seconds, picks up pending/waiting jobs and advances their state
- Health check loop: every 2 minutes, pings agents to detect downtime
- Notification dispatcher loop: every 1 minute, sends queued alerts
- Short-lived cert expiry loop: every 30 seconds, marks expired short-lived credentials
- Network scanner loop: every 6 hours, scans enabled TLS endpoints for certificate discovery
- Digest emailer loop: every 24 hours, sends scheduled certificate digest email to configured recipients
Each loop includes error handling and logs failures via structured slog.
- **Background Scheduler Monitoring** — 12 background loops (8 always-on + 4 opt-in) run on a fixed schedule. Authoritative topology in `docs/architecture.md`:
- Renewal loop (always-on, 1 hour): scans for certificates approaching renewal threshold
- Job processor loop (always-on, 30 seconds): picks up pending/waiting jobs and advances their state
- Job retry loop (always-on, 5 minutes, `CERTCTL_SCHEDULER_RETRY_INTERVAL`): retries Failed jobs (I-001)
- Job timeout reaper loop (always-on, 10 minutes, `CERTCTL_JOB_TIMEOUT_INTERVAL`): fails AwaitingCSR/AwaitingApproval jobs past timeout (I-003)
- Agent health check loop (always-on, 2 minutes): pings agents to detect downtime
- Notification dispatcher loop (always-on, 1 minute): sends queued alerts
- Notification retry loop (always-on, 2 minutes, `CERTCTL_NOTIFICATION_RETRY_INTERVAL`): exponential backoff retry for failed notifications; promote to dead-letter after 5 attempts (I-005)
- Short-lived cert expiry loop (always-on, 30 seconds): marks expired short-lived credentials
- Network scanner loop (opt-in, 6 hours, `CERTCTL_NETWORK_SCAN_ENABLED`): scans enabled TLS endpoints for certificate discovery
- Digest emailer loop (opt-in, 24 hours, `CERTCTL_DIGEST_INTERVAL`): sends scheduled certificate digest email to configured recipients
- Endpoint health loop (opt-in, 60 seconds, `CERTCTL_HEALTH_CHECK_INTERVAL`): continuous TLS health probes (M48)
- Cloud discovery loop (opt-in, 6 hours, `CERTCTL_CLOUD_DISCOVERY_INTERVAL`): cloud secret manager certificate discovery (M50)
Each loop includes `atomic.Bool` idempotency guards, error handling, and structured slog failure logs.
- **Metrics Endpoints** — Two formats for monitoring integration:
- `GET /api/v1/metrics` — JSON object with gauges, counters, and uptime for custom dashboards
- `GET /api/v1/metrics/prometheus` — Prometheus exposition format (`text/plain; version=0.0.4`) for native scraping by Prometheus, Grafana Agent, Datadog, and other OpenMetrics-compatible collectors
@@ -459,7 +464,7 @@ Each section includes:
| | Metrics JSON Endpoint | `GET /api/v1/metrics` (gauges, counters, uptime) | ✅ | ✅ | Set thresholds, configure alerting |
| | Stats API (time-series) | `GET /api/v1/stats/*` (summary, status, expiration, jobs, issuance) | ✅ | ✅ | Integrate into dashboards, SLO tracking |
| | Structured Logging | `slog` middleware with request IDs | ✅ | ✅ | Aggregate logs to SIEM, define retention policy |
| | Background Scheduler | 7 loops (renewal 1h, jobs 30s, health 2m, notifications 1m, short-lived 30s, network scan 6h, digest 24h) | ✅ | ✅ | Alert on scheduler loop failures |
| | Background Scheduler | 12 loops (8 always-on: renewal 1h, jobs 30s, job retry 5m I-001, job timeout 10m I-003, health 2m, notifications 1m, notif retry 2m I-005, short-lived 30s; 4 opt-in: network scan 6h, digest 24h, endpoint health 60s M48, cloud discovery 6h M50) | ✅ | ✅ | Alert on scheduler loop failures |
| **CC7.2** Anomaly Detection | Immutable API Audit Trail | `internal/api/middleware/audit.go`, `GET /api/v1/audit` | ✅ | Enhanced (SIEM export) | Integrate into SIEM, search for anomalies, archive long-term |
| | Expiration Threshold Alerting | Configurable per-policy (default 30/14/7/0 days) | ✅ | ✅ | Configure thresholds, integrate notifications |
| | Status Auto-Transitions | Active → Expiring (30d) → Expired (0d) | ✅ | ✅ | Monitor status changes in audit trail |
+3 -3
View File
@@ -1126,7 +1126,7 @@ The digest HTML template includes:
- Expiring certificates table (color-coded by urgency: 7d, 14d, 30d)
- Auto-refresh and responsive email layout
**Scheduler Integration:** The 7th scheduler loop runs on configurable interval (default 24 hours). It does NOT run on startup — waits for first scheduled tick. Operation timeout is 5 minutes. Each loop execution is guarded by `sync/atomic.Bool` idempotency.
**Scheduler Integration:** The opt-in digest scheduler loop runs on configurable interval (default 24 hours). It does NOT run on startup — waits for first scheduled tick. Operation timeout is 5 minutes. Each loop execution is guarded by `sync/atomic.Bool` idempotency. See `docs/architecture.md` for the full scheduler topology (12 loops, 8 always-on + 4 opt-in).
Configuration:
@@ -1389,7 +1389,7 @@ curl -s -X DELETE http://localhost:8443/api/v1/network-scan-targets/nst-dmz
### Scheduler Integration
When `CERTCTL_NETWORK_SCAN_ENABLED=true`, the server runs a 6th scheduler loop (alongside renewal, jobs, health, notifications, and short-lived expiry). It scans all enabled targets at the configured interval (default 6h). Each target tracks `last_scan_at`, `last_scan_duration_ms`, and `last_scan_certs_found` for monitoring scan health.
When `CERTCTL_NETWORK_SCAN_ENABLED=true`, the server runs the opt-in network scanner scheduler loop alongside the always-on loops (renewal, jobs, job retry, job timeout, agent health, notifications, notification retry, short-lived expiry). It scans all enabled targets at the configured interval (default 6h). Each target tracks `last_scan_at`, `last_scan_duration_ms`, and `last_scan_certs_found` for monitoring scan health. See `docs/architecture.md` for the full 12-loop scheduler topology.
### Use Cases
@@ -1447,7 +1447,7 @@ Source path format: `gcp-sm://{project}/{secret-name}`. Sentinel agent: `cloud-g
### Cloud Discovery Scheduler
All enabled cloud sources run on a shared scheduler loop (9th loop). The interval is configurable:
All enabled cloud sources run on a shared opt-in cloud discovery scheduler loop (see `docs/architecture.md` for the full 12-loop scheduler topology). The interval is configurable:
| Variable | Description | Default |
|---|---|---|
+1 -1
View File
@@ -1155,7 +1155,7 @@ flowchart TB
API["REST API\nGo net/http"]
SVC["Service Layer\nBusiness Logic"]
REPO["Repository Layer\ndatabase/sql + lib/pq"]
SCHED["Scheduler\n7 background loops"]
SCHED["Scheduler\n12 background loops\n(8 always-on + 4 opt-in)"]
CONN["Connector Registry\nIssuer + Target + Notifier"]
end
+17 -12
View File
@@ -16,7 +16,7 @@ Complete reference of every feature shipped in certctl through v2.1.0 (April 202
| Target connectors | 14 |
| Notifier connectors | 6 channels |
| Database tables | 21 (across 10 migrations) |
| Background scheduler loops | 7 |
| Background scheduler loops | 12 (8 always-on + 4 opt-in) |
| Web dashboard pages | 24 |
| Test functions | 1850+ |
| Supported platforms | linux/amd64, linux/arm64, darwin/amd64, darwin/arm64 |
@@ -903,7 +903,7 @@ Server-side active TLS scanning of CIDR ranges. Concurrent probing with semaphor
<!-- Source: internal/connector/discovery/awssm/, azurekv/, gcpsm/, internal/service/cloud_discovery.go -->
Discovers certificates stored in cloud secret managers and brings them into the certctl inventory. Extends the existing discovery pipeline with pluggable `DiscoverySource` implementations. Each source runs as part of the 9th scheduler loop (6h default).
Discovers certificates stored in cloud secret managers and brings them into the certctl inventory. Extends the existing discovery pipeline with pluggable `DiscoverySource` implementations. Each source runs as part of the opt-in cloud discovery scheduler loop (6h default; see `docs/architecture.md` for the full 12-loop scheduler topology).
**Supported sources:**
@@ -1097,17 +1097,22 @@ Single SQL `UNION` query replaces the previous "fetch all, filter in Go" approac
<!-- Source: internal/scheduler/scheduler.go -->
7 background loops, each with an `atomic.Bool` idempotency guard preventing concurrent tick execution. `sync.WaitGroup` + `WaitForCompletion()` for graceful shutdown.
12 background loops (8 always-on + 4 opt-in), each with an `atomic.Bool` idempotency guard preventing concurrent tick execution. `sync.WaitGroup` + `WaitForCompletion()` for graceful shutdown. Authoritative topology table lives in `docs/architecture.md`.
| Loop | Default Interval | Description |
|---|---|---|
| Renewal check | 1 hour | Check expiring certs, query ARI, create renewal jobs |
| Job processor | 30 seconds | Process pending jobs |
| Agent health check | 2 minutes | Check agent heartbeat staleness |
| Notification processor | 1 minute | Send queued notifications |
| Short-lived expiry check | 30 seconds | Mark short-lived certs expired |
| Network scan | 6 hours | Run network discovery scans |
| Digest | 24 hours | Send certificate digest email (does not run on startup) |
| Loop | Default Interval | Always-on | Env Var | Description |
|---|---|---|---|---|
| Renewal check | 1 hour | Yes | — | Check expiring certs, query ARI, create renewal jobs |
| Job processor | 30 seconds | Yes | — | Process pending jobs |
| Job retry | 5 minutes | Yes | `CERTCTL_SCHEDULER_RETRY_INTERVAL` | Retry Failed jobs (I-001) |
| Job timeout reaper | 10 minutes | Yes | `CERTCTL_JOB_TIMEOUT_INTERVAL` | Fail AwaitingCSR/AwaitingApproval jobs past timeout (I-003) |
| Agent health check | 2 minutes | Yes | — | Check agent heartbeat staleness |
| Notification processor | 1 minute | Yes | — | Send queued notifications |
| Notification retry | 2 minutes | Yes | `CERTCTL_NOTIFICATION_RETRY_INTERVAL` | Exponential backoff retry for failed notifications; promote to dead-letter after 5 attempts (I-005) |
| Short-lived expiry check | 30 seconds | Yes | — | Mark short-lived certs expired |
| Network scan | 6 hours | Opt-in | `CERTCTL_NETWORK_SCAN_ENABLED` | Run network discovery scans |
| Digest | 24 hours | Opt-in | `CERTCTL_DIGEST_INTERVAL` | Send certificate digest email (does not run on startup) |
| Endpoint health | 60 seconds | Opt-in | `CERTCTL_HEALTH_CHECK_INTERVAL` | Continuous TLS health probes (M48) |
| Cloud discovery | 6 hours | Opt-in | `CERTCTL_CLOUD_DISCOVERY_INTERVAL` | Cloud secret manager certificate discovery (M50) |
---
+3 -3
View File
@@ -5002,10 +5002,10 @@ curl -s -w "HTTP %{http_code}\n" -X DELETE -H "$AUTH" "$SERVER/api/v1/audit/$EVE
> **Tip:** Open a second terminal with `docker compose logs -f certctl-server` to watch scheduler log output in real time.
**Test 20.1.1 — Scheduler startup: all 7 loops registered**
**Test 20.1.1 — Scheduler startup: all 12 loops registered**
```bash
docker compose logs certctl-server 2>&1 | grep -i "scheduler\|renewal check\|job processor\|health check\|notification\|short-lived\|network scan" | head -20
docker compose logs certctl-server 2>&1 | grep -i "scheduler\|renewal check\|job processor\|job retry\|job timeout\|health check\|notification\|notification retry\|short-lived\|network scan\|digest\|endpoint health\|cloud discovery" | head -30
```
**What:** Checks server startup logs for scheduler loop registration.
@@ -7340,7 +7340,7 @@ These must be green before starting manual QA:
| Test | Description | Method | Pass? | Date | Notes |
|------|-------------|--------|-------|------|-------|
| 20.1.1 | Scheduler startup: all 7 loops registered | Manual | ☐ | | |
| 20.1.1 | Scheduler startup: all 12 loops registered | Manual | ☐ | | |
| 20.1.2 | Job processor loop fires (30s interval) | Manual | ☐ | | |
| 20.1.3 | Agent health check marks offline (2m interval) | Manual | ☐ | | |
| 20.1.4 | Notification processor fires (1m interval) | Manual | ☐ | | |