docs: reconcile scheduler topology across sibling docs (7 → 12 loops)

Authoritative 12-loop table lives at docs/architecture.md:522-534 (committed via
the I-001/I-003/I-005 + M48/M50 milestone commits). This change brings six sibling
docs into parity with that table so every surface — user-facing features reference,
SOC 2 compliance mapping, connectors guide, advanced demo architecture diagram,
testing guide, and in-line architecture prose — reflects the same 8 always-on + 4
opt-in topology.

Touches:
- docs/architecture.md: 2 inline ordinal references (9th / 8th loop) replaced with
  descriptive names (opt-in cloud discovery / opt-in endpoint health), cross-linked
  to the authoritative table to prevent future ordinal rot.
- docs/features.md: metric row (7 → 12), inline reference to 9th loop, and full
  scheduler table expanded to include Always-on column + env vars + I-001/I-003/I-005
  refs.
- docs/compliance-soc2.md: background scheduler monitoring bullets expanded to list
  all 12 loops with env vars + I-series refs; table row updated with 8 always-on +
  4 opt-in summary.
- docs/connectors.md: three inline ordinals (7th/6th/9th loop) replaced with
  descriptive names, cross-linked to architecture.md.
- docs/demo-advanced.md: Mermaid SCHED node label updated from '7 background loops'
  to '12 background loops (8 always-on + 4 opt-in)'.
- docs/testing-guide.md: Test 20.1.1 header + grep pattern expanded to include
  job-retry / job-timeout / notification-retry / digest / endpoint-health /
  cloud-discovery loops; sign-off chart row label updated.

Pure documentation reconciliation. No code changes. Master HEAD pre-commit: 6e646e0.
This commit is contained in:
shankar0123
2026-04-20 02:51:34 +00:00
parent 6e646e0fe8
commit 04c7eca615
6 changed files with 41 additions and 31 deletions
+2 -2
View File
@@ -1092,7 +1092,7 @@ flowchart TB
1. **Pluggable sources** — Each cloud provider implements the `DiscoverySource` interface (Name, Type, Discover, ValidateConfig). Three built-in sources: AWS Secrets Manager, Azure Key Vault, GCP Secret Manager 1. **Pluggable sources** — Each cloud provider implements the `DiscoverySource` interface (Name, Type, Discover, ValidateConfig). Three built-in sources: AWS Secrets Manager, Azure Key Vault, GCP Secret Manager
2. **CloudDiscoveryService orchestrator** — Iterates registered sources, calls `Discover()` on each, feeds reports into `ProcessDiscoveryReport()`. Errors from one source don't prevent other sources from running 2. **CloudDiscoveryService orchestrator** — Iterates registered sources, calls `Discover()` on each, feeds reports into `ProcessDiscoveryReport()`. Errors from one source don't prevent other sources from running
3. **Scheduler integration**9th scheduler loop (6h default), runs immediately on startup, `atomic.Bool` idempotency guard 3. **Scheduler integration**opt-in cloud discovery scheduler loop (6h default; see `docs/architecture.md` 12-loop topology), runs immediately on startup, `atomic.Bool` idempotency guard
4. **Sentinel agents** — Each source uses its own sentinel agent ID (`cloud-aws-sm`, `cloud-azure-kv`, `cloud-gcp-sm`) for dedup and triage filtering 4. **Sentinel agents** — Each source uses its own sentinel agent ID (`cloud-aws-sm`, `cloud-azure-kv`, `cloud-gcp-sm`) for dedup and triage filtering
5. **Source path format**`aws-sm://{region}/{secret}`, `azure-kv://{cert-name}/{version}`, `gcp-sm://{project}/{secret}` 5. **Source path format**`aws-sm://{region}/{secret}`, `azure-kv://{cert-name}/{version}`, `gcp-sm://{project}/{secret}`
6. **No new schema** — Reuses existing `discovered_certificates` and `discovery_scans` tables. Sentinel agent IDs leverage existing `(fingerprint_sha256, agent_id, source_path)` dedup constraint 6. **No new schema** — Reuses existing `discovered_certificates` and `discovery_scans` tables. Sentinel agent IDs leverage existing `(fingerprint_sha256, agent_id, source_path)` dedup constraint
@@ -1114,7 +1114,7 @@ This data flow is pull-based and non-blocking. Agents discover at their own pace
Beyond one-time discovery, certctl continuously monitors TLS endpoints for certificate health using a shared TLS probing package and a state-machine-driven health check service. Endpoints transition between states (Healthy → Degraded → Down) based on consecutive failures, and `cert_mismatch` status alerts when a deployed certificate is unexpectedly replaced. Beyond one-time discovery, certctl continuously monitors TLS endpoints for certificate health using a shared TLS probing package and a state-machine-driven health check service. Endpoints transition between states (Healthy → Degraded → Down) based on consecutive failures, and `cert_mismatch` status alerts when a deployed certificate is unexpectedly replaced.
**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated 8th scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch). **Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated opt-in endpoint health scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch).
**State Machine:** Healthy → Degraded (configurable threshold, default 2 consecutive failures) → Down (default 5 failures). The `cert_mismatch` status is special — it fires whenever the observed certificate fingerprint differs from the expected (deployed) fingerprint, catching silent rollbacks and unauthorized cert replacements. Recovery from degraded/down transitions back to healthy and resets the failure counter. **State Machine:** Healthy → Degraded (configurable threshold, default 2 consecutive failures) → Down (default 5 failures). The `cert_mismatch` status is special — it fires whenever the observed certificate fingerprint differs from the expected (deployed) fingerprint, catching silent rollbacks and unauthorized cert replacements. Recovery from degraded/down transitions back to healthy and resets the failure counter.
+15 -10
View File
@@ -189,15 +189,20 @@ Each section includes:
- **Health Endpoint** — `GET /health` returns 200 OK with service status. Consumed by Docker health checks and Kubernetes probes. - **Health Endpoint** — `GET /health` returns 200 OK with service status. Consumed by Docker health checks and Kubernetes probes.
- **Readiness Endpoint** — `GET /ready` returns 200 OK when the database is connected and migrations are applied. - **Readiness Endpoint** — `GET /ready` returns 200 OK when the database is connected and migrations are applied.
- **Background Scheduler Monitoring** — 7 background loops run on a fixed schedule: - **Background Scheduler Monitoring** — 12 background loops (8 always-on + 4 opt-in) run on a fixed schedule. Authoritative topology in `docs/architecture.md`:
- Renewal loop: every 1 hour, scans for certificates approaching renewal threshold - Renewal loop (always-on, 1 hour): scans for certificates approaching renewal threshold
- Job processor loop: every 30 seconds, picks up pending/waiting jobs and advances their state - Job processor loop (always-on, 30 seconds): picks up pending/waiting jobs and advances their state
- Health check loop: every 2 minutes, pings agents to detect downtime - Job retry loop (always-on, 5 minutes, `CERTCTL_SCHEDULER_RETRY_INTERVAL`): retries Failed jobs (I-001)
- Notification dispatcher loop: every 1 minute, sends queued alerts - Job timeout reaper loop (always-on, 10 minutes, `CERTCTL_JOB_TIMEOUT_INTERVAL`): fails AwaitingCSR/AwaitingApproval jobs past timeout (I-003)
- Short-lived cert expiry loop: every 30 seconds, marks expired short-lived credentials - Agent health check loop (always-on, 2 minutes): pings agents to detect downtime
- Network scanner loop: every 6 hours, scans enabled TLS endpoints for certificate discovery - Notification dispatcher loop (always-on, 1 minute): sends queued alerts
- Digest emailer loop: every 24 hours, sends scheduled certificate digest email to configured recipients - Notification retry loop (always-on, 2 minutes, `CERTCTL_NOTIFICATION_RETRY_INTERVAL`): exponential backoff retry for failed notifications; promote to dead-letter after 5 attempts (I-005)
Each loop includes error handling and logs failures via structured slog. - Short-lived cert expiry loop (always-on, 30 seconds): marks expired short-lived credentials
- Network scanner loop (opt-in, 6 hours, `CERTCTL_NETWORK_SCAN_ENABLED`): scans enabled TLS endpoints for certificate discovery
- Digest emailer loop (opt-in, 24 hours, `CERTCTL_DIGEST_INTERVAL`): sends scheduled certificate digest email to configured recipients
- Endpoint health loop (opt-in, 60 seconds, `CERTCTL_HEALTH_CHECK_INTERVAL`): continuous TLS health probes (M48)
- Cloud discovery loop (opt-in, 6 hours, `CERTCTL_CLOUD_DISCOVERY_INTERVAL`): cloud secret manager certificate discovery (M50)
Each loop includes `atomic.Bool` idempotency guards, error handling, and structured slog failure logs.
- **Metrics Endpoints** — Two formats for monitoring integration: - **Metrics Endpoints** — Two formats for monitoring integration:
- `GET /api/v1/metrics` — JSON object with gauges, counters, and uptime for custom dashboards - `GET /api/v1/metrics` — JSON object with gauges, counters, and uptime for custom dashboards
- `GET /api/v1/metrics/prometheus` — Prometheus exposition format (`text/plain; version=0.0.4`) for native scraping by Prometheus, Grafana Agent, Datadog, and other OpenMetrics-compatible collectors - `GET /api/v1/metrics/prometheus` — Prometheus exposition format (`text/plain; version=0.0.4`) for native scraping by Prometheus, Grafana Agent, Datadog, and other OpenMetrics-compatible collectors
@@ -459,7 +464,7 @@ Each section includes:
| | Metrics JSON Endpoint | `GET /api/v1/metrics` (gauges, counters, uptime) | ✅ | ✅ | Set thresholds, configure alerting | | | Metrics JSON Endpoint | `GET /api/v1/metrics` (gauges, counters, uptime) | ✅ | ✅ | Set thresholds, configure alerting |
| | Stats API (time-series) | `GET /api/v1/stats/*` (summary, status, expiration, jobs, issuance) | ✅ | ✅ | Integrate into dashboards, SLO tracking | | | Stats API (time-series) | `GET /api/v1/stats/*` (summary, status, expiration, jobs, issuance) | ✅ | ✅ | Integrate into dashboards, SLO tracking |
| | Structured Logging | `slog` middleware with request IDs | ✅ | ✅ | Aggregate logs to SIEM, define retention policy | | | Structured Logging | `slog` middleware with request IDs | ✅ | ✅ | Aggregate logs to SIEM, define retention policy |
| | Background Scheduler | 7 loops (renewal 1h, jobs 30s, health 2m, notifications 1m, short-lived 30s, network scan 6h, digest 24h) | ✅ | ✅ | Alert on scheduler loop failures | | | Background Scheduler | 12 loops (8 always-on: renewal 1h, jobs 30s, job retry 5m I-001, job timeout 10m I-003, health 2m, notifications 1m, notif retry 2m I-005, short-lived 30s; 4 opt-in: network scan 6h, digest 24h, endpoint health 60s M48, cloud discovery 6h M50) | ✅ | ✅ | Alert on scheduler loop failures |
| **CC7.2** Anomaly Detection | Immutable API Audit Trail | `internal/api/middleware/audit.go`, `GET /api/v1/audit` | ✅ | Enhanced (SIEM export) | Integrate into SIEM, search for anomalies, archive long-term | | **CC7.2** Anomaly Detection | Immutable API Audit Trail | `internal/api/middleware/audit.go`, `GET /api/v1/audit` | ✅ | Enhanced (SIEM export) | Integrate into SIEM, search for anomalies, archive long-term |
| | Expiration Threshold Alerting | Configurable per-policy (default 30/14/7/0 days) | ✅ | ✅ | Configure thresholds, integrate notifications | | | Expiration Threshold Alerting | Configurable per-policy (default 30/14/7/0 days) | ✅ | ✅ | Configure thresholds, integrate notifications |
| | Status Auto-Transitions | Active → Expiring (30d) → Expired (0d) | ✅ | ✅ | Monitor status changes in audit trail | | | Status Auto-Transitions | Active → Expiring (30d) → Expired (0d) | ✅ | ✅ | Monitor status changes in audit trail |
+3 -3
View File
@@ -1126,7 +1126,7 @@ The digest HTML template includes:
- Expiring certificates table (color-coded by urgency: 7d, 14d, 30d) - Expiring certificates table (color-coded by urgency: 7d, 14d, 30d)
- Auto-refresh and responsive email layout - Auto-refresh and responsive email layout
**Scheduler Integration:** The 7th scheduler loop runs on configurable interval (default 24 hours). It does NOT run on startup — waits for first scheduled tick. Operation timeout is 5 minutes. Each loop execution is guarded by `sync/atomic.Bool` idempotency. **Scheduler Integration:** The opt-in digest scheduler loop runs on configurable interval (default 24 hours). It does NOT run on startup — waits for first scheduled tick. Operation timeout is 5 minutes. Each loop execution is guarded by `sync/atomic.Bool` idempotency. See `docs/architecture.md` for the full scheduler topology (12 loops, 8 always-on + 4 opt-in).
Configuration: Configuration:
@@ -1389,7 +1389,7 @@ curl -s -X DELETE http://localhost:8443/api/v1/network-scan-targets/nst-dmz
### Scheduler Integration ### Scheduler Integration
When `CERTCTL_NETWORK_SCAN_ENABLED=true`, the server runs a 6th scheduler loop (alongside renewal, jobs, health, notifications, and short-lived expiry). It scans all enabled targets at the configured interval (default 6h). Each target tracks `last_scan_at`, `last_scan_duration_ms`, and `last_scan_certs_found` for monitoring scan health. When `CERTCTL_NETWORK_SCAN_ENABLED=true`, the server runs the opt-in network scanner scheduler loop alongside the always-on loops (renewal, jobs, job retry, job timeout, agent health, notifications, notification retry, short-lived expiry). It scans all enabled targets at the configured interval (default 6h). Each target tracks `last_scan_at`, `last_scan_duration_ms`, and `last_scan_certs_found` for monitoring scan health. See `docs/architecture.md` for the full 12-loop scheduler topology.
### Use Cases ### Use Cases
@@ -1447,7 +1447,7 @@ Source path format: `gcp-sm://{project}/{secret-name}`. Sentinel agent: `cloud-g
### Cloud Discovery Scheduler ### Cloud Discovery Scheduler
All enabled cloud sources run on a shared scheduler loop (9th loop). The interval is configurable: All enabled cloud sources run on a shared opt-in cloud discovery scheduler loop (see `docs/architecture.md` for the full 12-loop scheduler topology). The interval is configurable:
| Variable | Description | Default | | Variable | Description | Default |
|---|---|---| |---|---|---|
+1 -1
View File
@@ -1155,7 +1155,7 @@ flowchart TB
API["REST API\nGo net/http"] API["REST API\nGo net/http"]
SVC["Service Layer\nBusiness Logic"] SVC["Service Layer\nBusiness Logic"]
REPO["Repository Layer\ndatabase/sql + lib/pq"] REPO["Repository Layer\ndatabase/sql + lib/pq"]
SCHED["Scheduler\n7 background loops"] SCHED["Scheduler\n12 background loops\n(8 always-on + 4 opt-in)"]
CONN["Connector Registry\nIssuer + Target + Notifier"] CONN["Connector Registry\nIssuer + Target + Notifier"]
end end
+17 -12
View File
@@ -16,7 +16,7 @@ Complete reference of every feature shipped in certctl through v2.1.0 (April 202
| Target connectors | 14 | | Target connectors | 14 |
| Notifier connectors | 6 channels | | Notifier connectors | 6 channels |
| Database tables | 21 (across 10 migrations) | | Database tables | 21 (across 10 migrations) |
| Background scheduler loops | 7 | | Background scheduler loops | 12 (8 always-on + 4 opt-in) |
| Web dashboard pages | 24 | | Web dashboard pages | 24 |
| Test functions | 1850+ | | Test functions | 1850+ |
| Supported platforms | linux/amd64, linux/arm64, darwin/amd64, darwin/arm64 | | Supported platforms | linux/amd64, linux/arm64, darwin/amd64, darwin/arm64 |
@@ -903,7 +903,7 @@ Server-side active TLS scanning of CIDR ranges. Concurrent probing with semaphor
<!-- Source: internal/connector/discovery/awssm/, azurekv/, gcpsm/, internal/service/cloud_discovery.go --> <!-- Source: internal/connector/discovery/awssm/, azurekv/, gcpsm/, internal/service/cloud_discovery.go -->
Discovers certificates stored in cloud secret managers and brings them into the certctl inventory. Extends the existing discovery pipeline with pluggable `DiscoverySource` implementations. Each source runs as part of the 9th scheduler loop (6h default). Discovers certificates stored in cloud secret managers and brings them into the certctl inventory. Extends the existing discovery pipeline with pluggable `DiscoverySource` implementations. Each source runs as part of the opt-in cloud discovery scheduler loop (6h default; see `docs/architecture.md` for the full 12-loop scheduler topology).
**Supported sources:** **Supported sources:**
@@ -1097,17 +1097,22 @@ Single SQL `UNION` query replaces the previous "fetch all, filter in Go" approac
<!-- Source: internal/scheduler/scheduler.go --> <!-- Source: internal/scheduler/scheduler.go -->
7 background loops, each with an `atomic.Bool` idempotency guard preventing concurrent tick execution. `sync.WaitGroup` + `WaitForCompletion()` for graceful shutdown. 12 background loops (8 always-on + 4 opt-in), each with an `atomic.Bool` idempotency guard preventing concurrent tick execution. `sync.WaitGroup` + `WaitForCompletion()` for graceful shutdown. Authoritative topology table lives in `docs/architecture.md`.
| Loop | Default Interval | Description | | Loop | Default Interval | Always-on | Env Var | Description |
|---|---|---| |---|---|---|---|---|
| Renewal check | 1 hour | Check expiring certs, query ARI, create renewal jobs | | Renewal check | 1 hour | Yes | — | Check expiring certs, query ARI, create renewal jobs |
| Job processor | 30 seconds | Process pending jobs | | Job processor | 30 seconds | Yes | — | Process pending jobs |
| Agent health check | 2 minutes | Check agent heartbeat staleness | | Job retry | 5 minutes | Yes | `CERTCTL_SCHEDULER_RETRY_INTERVAL` | Retry Failed jobs (I-001) |
| Notification processor | 1 minute | Send queued notifications | | Job timeout reaper | 10 minutes | Yes | `CERTCTL_JOB_TIMEOUT_INTERVAL` | Fail AwaitingCSR/AwaitingApproval jobs past timeout (I-003) |
| Short-lived expiry check | 30 seconds | Mark short-lived certs expired | | Agent health check | 2 minutes | Yes | — | Check agent heartbeat staleness |
| Network scan | 6 hours | Run network discovery scans | | Notification processor | 1 minute | Yes | — | Send queued notifications |
| Digest | 24 hours | Send certificate digest email (does not run on startup) | | Notification retry | 2 minutes | Yes | `CERTCTL_NOTIFICATION_RETRY_INTERVAL` | Exponential backoff retry for failed notifications; promote to dead-letter after 5 attempts (I-005) |
| Short-lived expiry check | 30 seconds | Yes | — | Mark short-lived certs expired |
| Network scan | 6 hours | Opt-in | `CERTCTL_NETWORK_SCAN_ENABLED` | Run network discovery scans |
| Digest | 24 hours | Opt-in | `CERTCTL_DIGEST_INTERVAL` | Send certificate digest email (does not run on startup) |
| Endpoint health | 60 seconds | Opt-in | `CERTCTL_HEALTH_CHECK_INTERVAL` | Continuous TLS health probes (M48) |
| Cloud discovery | 6 hours | Opt-in | `CERTCTL_CLOUD_DISCOVERY_INTERVAL` | Cloud secret manager certificate discovery (M50) |
--- ---
+3 -3
View File
@@ -5002,10 +5002,10 @@ curl -s -w "HTTP %{http_code}\n" -X DELETE -H "$AUTH" "$SERVER/api/v1/audit/$EVE
> **Tip:** Open a second terminal with `docker compose logs -f certctl-server` to watch scheduler log output in real time. > **Tip:** Open a second terminal with `docker compose logs -f certctl-server` to watch scheduler log output in real time.
**Test 20.1.1 — Scheduler startup: all 7 loops registered** **Test 20.1.1 — Scheduler startup: all 12 loops registered**
```bash ```bash
docker compose logs certctl-server 2>&1 | grep -i "scheduler\|renewal check\|job processor\|health check\|notification\|short-lived\|network scan" | head -20 docker compose logs certctl-server 2>&1 | grep -i "scheduler\|renewal check\|job processor\|job retry\|job timeout\|health check\|notification\|notification retry\|short-lived\|network scan\|digest\|endpoint health\|cloud discovery" | head -30
``` ```
**What:** Checks server startup logs for scheduler loop registration. **What:** Checks server startup logs for scheduler loop registration.
@@ -7340,7 +7340,7 @@ These must be green before starting manual QA:
| Test | Description | Method | Pass? | Date | Notes | | Test | Description | Method | Pass? | Date | Notes |
|------|-------------|--------|-------|------|-------| |------|-------------|--------|-------|------|-------|
| 20.1.1 | Scheduler startup: all 7 loops registered | Manual | ☐ | | | | 20.1.1 | Scheduler startup: all 12 loops registered | Manual | ☐ | | |
| 20.1.2 | Job processor loop fires (30s interval) | Manual | ☐ | | | | 20.1.2 | Job processor loop fires (30s interval) | Manual | ☐ | | |
| 20.1.3 | Agent health check marks offline (2m interval) | Manual | ☐ | | | | 20.1.3 | Agent health check marks offline (2m interval) | Manual | ☐ | | |
| 20.1.4 | Notification processor fires (1m interval) | Manual | ☐ | | | | 20.1.4 | Notification processor fires (1m interval) | Manual | ☐ | | |