From 04c7eca61598eb00a6f442dfcce3e668697ad17b Mon Sep 17 00:00:00 2001 From: shankar0123 Date: Mon, 20 Apr 2026 02:51:34 +0000 Subject: [PATCH] =?UTF-8?q?docs:=20reconcile=20scheduler=20topology=20acro?= =?UTF-8?q?ss=20sibling=20docs=20(7=20=E2=86=92=2012=20loops)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Authoritative 12-loop table lives at docs/architecture.md:522-534 (committed via the I-001/I-003/I-005 + M48/M50 milestone commits). This change brings six sibling docs into parity with that table so every surface — user-facing features reference, SOC 2 compliance mapping, connectors guide, advanced demo architecture diagram, testing guide, and in-line architecture prose — reflects the same 8 always-on + 4 opt-in topology. Touches: - docs/architecture.md: 2 inline ordinal references (9th / 8th loop) replaced with descriptive names (opt-in cloud discovery / opt-in endpoint health), cross-linked to the authoritative table to prevent future ordinal rot. - docs/features.md: metric row (7 → 12), inline reference to 9th loop, and full scheduler table expanded to include Always-on column + env vars + I-001/I-003/I-005 refs. - docs/compliance-soc2.md: background scheduler monitoring bullets expanded to list all 12 loops with env vars + I-series refs; table row updated with 8 always-on + 4 opt-in summary. - docs/connectors.md: three inline ordinals (7th/6th/9th loop) replaced with descriptive names, cross-linked to architecture.md. - docs/demo-advanced.md: Mermaid SCHED node label updated from '7 background loops' to '12 background loops (8 always-on + 4 opt-in)'. - docs/testing-guide.md: Test 20.1.1 header + grep pattern expanded to include job-retry / job-timeout / notification-retry / digest / endpoint-health / cloud-discovery loops; sign-off chart row label updated. Pure documentation reconciliation. No code changes. Master HEAD pre-commit: 6e646e0. --- docs/architecture.md | 4 ++-- docs/compliance-soc2.md | 25 +++++++++++++++---------- docs/connectors.md | 6 +++--- docs/demo-advanced.md | 2 +- docs/features.md | 29 +++++++++++++++++------------ docs/testing-guide.md | 6 +++--- 6 files changed, 41 insertions(+), 31 deletions(-) diff --git a/docs/architecture.md b/docs/architecture.md index f039354..65cf1cc 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -1092,7 +1092,7 @@ flowchart TB 1. **Pluggable sources** — Each cloud provider implements the `DiscoverySource` interface (Name, Type, Discover, ValidateConfig). Three built-in sources: AWS Secrets Manager, Azure Key Vault, GCP Secret Manager 2. **CloudDiscoveryService orchestrator** — Iterates registered sources, calls `Discover()` on each, feeds reports into `ProcessDiscoveryReport()`. Errors from one source don't prevent other sources from running -3. **Scheduler integration** — 9th scheduler loop (6h default), runs immediately on startup, `atomic.Bool` idempotency guard +3. **Scheduler integration** — opt-in cloud discovery scheduler loop (6h default; see `docs/architecture.md` 12-loop topology), runs immediately on startup, `atomic.Bool` idempotency guard 4. **Sentinel agents** — Each source uses its own sentinel agent ID (`cloud-aws-sm`, `cloud-azure-kv`, `cloud-gcp-sm`) for dedup and triage filtering 5. **Source path format** — `aws-sm://{region}/{secret}`, `azure-kv://{cert-name}/{version}`, `gcp-sm://{project}/{secret}` 6. **No new schema** — Reuses existing `discovered_certificates` and `discovery_scans` tables. Sentinel agent IDs leverage existing `(fingerprint_sha256, agent_id, source_path)` dedup constraint @@ -1114,7 +1114,7 @@ This data flow is pull-based and non-blocking. Agents discover at their own pace Beyond one-time discovery, certctl continuously monitors TLS endpoints for certificate health using a shared TLS probing package and a state-machine-driven health check service. Endpoints transition between states (Healthy → Degraded → Down) based on consecutive failures, and `cert_mismatch` status alerts when a deployed certificate is unexpectedly replaced. -**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated 8th scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch). +**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated opt-in endpoint health scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch). **State Machine:** Healthy → Degraded (configurable threshold, default 2 consecutive failures) → Down (default 5 failures). The `cert_mismatch` status is special — it fires whenever the observed certificate fingerprint differs from the expected (deployed) fingerprint, catching silent rollbacks and unauthorized cert replacements. Recovery from degraded/down transitions back to healthy and resets the failure counter. diff --git a/docs/compliance-soc2.md b/docs/compliance-soc2.md index 1d56f5a..c63e4e8 100644 --- a/docs/compliance-soc2.md +++ b/docs/compliance-soc2.md @@ -189,15 +189,20 @@ Each section includes: - **Health Endpoint** — `GET /health` returns 200 OK with service status. Consumed by Docker health checks and Kubernetes probes. - **Readiness Endpoint** — `GET /ready` returns 200 OK when the database is connected and migrations are applied. -- **Background Scheduler Monitoring** — 7 background loops run on a fixed schedule: - - Renewal loop: every 1 hour, scans for certificates approaching renewal threshold - - Job processor loop: every 30 seconds, picks up pending/waiting jobs and advances their state - - Health check loop: every 2 minutes, pings agents to detect downtime - - Notification dispatcher loop: every 1 minute, sends queued alerts - - Short-lived cert expiry loop: every 30 seconds, marks expired short-lived credentials - - Network scanner loop: every 6 hours, scans enabled TLS endpoints for certificate discovery - - Digest emailer loop: every 24 hours, sends scheduled certificate digest email to configured recipients - Each loop includes error handling and logs failures via structured slog. +- **Background Scheduler Monitoring** — 12 background loops (8 always-on + 4 opt-in) run on a fixed schedule. Authoritative topology in `docs/architecture.md`: + - Renewal loop (always-on, 1 hour): scans for certificates approaching renewal threshold + - Job processor loop (always-on, 30 seconds): picks up pending/waiting jobs and advances their state + - Job retry loop (always-on, 5 minutes, `CERTCTL_SCHEDULER_RETRY_INTERVAL`): retries Failed jobs (I-001) + - Job timeout reaper loop (always-on, 10 minutes, `CERTCTL_JOB_TIMEOUT_INTERVAL`): fails AwaitingCSR/AwaitingApproval jobs past timeout (I-003) + - Agent health check loop (always-on, 2 minutes): pings agents to detect downtime + - Notification dispatcher loop (always-on, 1 minute): sends queued alerts + - Notification retry loop (always-on, 2 minutes, `CERTCTL_NOTIFICATION_RETRY_INTERVAL`): exponential backoff retry for failed notifications; promote to dead-letter after 5 attempts (I-005) + - Short-lived cert expiry loop (always-on, 30 seconds): marks expired short-lived credentials + - Network scanner loop (opt-in, 6 hours, `CERTCTL_NETWORK_SCAN_ENABLED`): scans enabled TLS endpoints for certificate discovery + - Digest emailer loop (opt-in, 24 hours, `CERTCTL_DIGEST_INTERVAL`): sends scheduled certificate digest email to configured recipients + - Endpoint health loop (opt-in, 60 seconds, `CERTCTL_HEALTH_CHECK_INTERVAL`): continuous TLS health probes (M48) + - Cloud discovery loop (opt-in, 6 hours, `CERTCTL_CLOUD_DISCOVERY_INTERVAL`): cloud secret manager certificate discovery (M50) + Each loop includes `atomic.Bool` idempotency guards, error handling, and structured slog failure logs. - **Metrics Endpoints** — Two formats for monitoring integration: - `GET /api/v1/metrics` — JSON object with gauges, counters, and uptime for custom dashboards - `GET /api/v1/metrics/prometheus` — Prometheus exposition format (`text/plain; version=0.0.4`) for native scraping by Prometheus, Grafana Agent, Datadog, and other OpenMetrics-compatible collectors @@ -459,7 +464,7 @@ Each section includes: | | Metrics JSON Endpoint | `GET /api/v1/metrics` (gauges, counters, uptime) | ✅ | ✅ | Set thresholds, configure alerting | | | Stats API (time-series) | `GET /api/v1/stats/*` (summary, status, expiration, jobs, issuance) | ✅ | ✅ | Integrate into dashboards, SLO tracking | | | Structured Logging | `slog` middleware with request IDs | ✅ | ✅ | Aggregate logs to SIEM, define retention policy | -| | Background Scheduler | 7 loops (renewal 1h, jobs 30s, health 2m, notifications 1m, short-lived 30s, network scan 6h, digest 24h) | ✅ | ✅ | Alert on scheduler loop failures | +| | Background Scheduler | 12 loops (8 always-on: renewal 1h, jobs 30s, job retry 5m I-001, job timeout 10m I-003, health 2m, notifications 1m, notif retry 2m I-005, short-lived 30s; 4 opt-in: network scan 6h, digest 24h, endpoint health 60s M48, cloud discovery 6h M50) | ✅ | ✅ | Alert on scheduler loop failures | | **CC7.2** Anomaly Detection | Immutable API Audit Trail | `internal/api/middleware/audit.go`, `GET /api/v1/audit` | ✅ | Enhanced (SIEM export) | Integrate into SIEM, search for anomalies, archive long-term | | | Expiration Threshold Alerting | Configurable per-policy (default 30/14/7/0 days) | ✅ | ✅ | Configure thresholds, integrate notifications | | | Status Auto-Transitions | Active → Expiring (30d) → Expired (0d) | ✅ | ✅ | Monitor status changes in audit trail | diff --git a/docs/connectors.md b/docs/connectors.md index f6880b1..b9c5846 100644 --- a/docs/connectors.md +++ b/docs/connectors.md @@ -1126,7 +1126,7 @@ The digest HTML template includes: - Expiring certificates table (color-coded by urgency: 7d, 14d, 30d) - Auto-refresh and responsive email layout -**Scheduler Integration:** The 7th scheduler loop runs on configurable interval (default 24 hours). It does NOT run on startup — waits for first scheduled tick. Operation timeout is 5 minutes. Each loop execution is guarded by `sync/atomic.Bool` idempotency. +**Scheduler Integration:** The opt-in digest scheduler loop runs on configurable interval (default 24 hours). It does NOT run on startup — waits for first scheduled tick. Operation timeout is 5 minutes. Each loop execution is guarded by `sync/atomic.Bool` idempotency. See `docs/architecture.md` for the full scheduler topology (12 loops, 8 always-on + 4 opt-in). Configuration: @@ -1389,7 +1389,7 @@ curl -s -X DELETE http://localhost:8443/api/v1/network-scan-targets/nst-dmz ### Scheduler Integration -When `CERTCTL_NETWORK_SCAN_ENABLED=true`, the server runs a 6th scheduler loop (alongside renewal, jobs, health, notifications, and short-lived expiry). It scans all enabled targets at the configured interval (default 6h). Each target tracks `last_scan_at`, `last_scan_duration_ms`, and `last_scan_certs_found` for monitoring scan health. +When `CERTCTL_NETWORK_SCAN_ENABLED=true`, the server runs the opt-in network scanner scheduler loop alongside the always-on loops (renewal, jobs, job retry, job timeout, agent health, notifications, notification retry, short-lived expiry). It scans all enabled targets at the configured interval (default 6h). Each target tracks `last_scan_at`, `last_scan_duration_ms`, and `last_scan_certs_found` for monitoring scan health. See `docs/architecture.md` for the full 12-loop scheduler topology. ### Use Cases @@ -1447,7 +1447,7 @@ Source path format: `gcp-sm://{project}/{secret-name}`. Sentinel agent: `cloud-g ### Cloud Discovery Scheduler -All enabled cloud sources run on a shared scheduler loop (9th loop). The interval is configurable: +All enabled cloud sources run on a shared opt-in cloud discovery scheduler loop (see `docs/architecture.md` for the full 12-loop scheduler topology). The interval is configurable: | Variable | Description | Default | |---|---|---| diff --git a/docs/demo-advanced.md b/docs/demo-advanced.md index 55b3563..b33851a 100644 --- a/docs/demo-advanced.md +++ b/docs/demo-advanced.md @@ -1155,7 +1155,7 @@ flowchart TB API["REST API\nGo net/http"] SVC["Service Layer\nBusiness Logic"] REPO["Repository Layer\ndatabase/sql + lib/pq"] - SCHED["Scheduler\n7 background loops"] + SCHED["Scheduler\n12 background loops\n(8 always-on + 4 opt-in)"] CONN["Connector Registry\nIssuer + Target + Notifier"] end diff --git a/docs/features.md b/docs/features.md index 3f31aae..50cb934 100644 --- a/docs/features.md +++ b/docs/features.md @@ -16,7 +16,7 @@ Complete reference of every feature shipped in certctl through v2.1.0 (April 202 | Target connectors | 14 | | Notifier connectors | 6 channels | | Database tables | 21 (across 10 migrations) | -| Background scheduler loops | 7 | +| Background scheduler loops | 12 (8 always-on + 4 opt-in) | | Web dashboard pages | 24 | | Test functions | 1850+ | | Supported platforms | linux/amd64, linux/arm64, darwin/amd64, darwin/arm64 | @@ -903,7 +903,7 @@ Server-side active TLS scanning of CIDR ranges. Concurrent probing with semaphor -Discovers certificates stored in cloud secret managers and brings them into the certctl inventory. Extends the existing discovery pipeline with pluggable `DiscoverySource` implementations. Each source runs as part of the 9th scheduler loop (6h default). +Discovers certificates stored in cloud secret managers and brings them into the certctl inventory. Extends the existing discovery pipeline with pluggable `DiscoverySource` implementations. Each source runs as part of the opt-in cloud discovery scheduler loop (6h default; see `docs/architecture.md` for the full 12-loop scheduler topology). **Supported sources:** @@ -1097,17 +1097,22 @@ Single SQL `UNION` query replaces the previous "fetch all, filter in Go" approac -7 background loops, each with an `atomic.Bool` idempotency guard preventing concurrent tick execution. `sync.WaitGroup` + `WaitForCompletion()` for graceful shutdown. +12 background loops (8 always-on + 4 opt-in), each with an `atomic.Bool` idempotency guard preventing concurrent tick execution. `sync.WaitGroup` + `WaitForCompletion()` for graceful shutdown. Authoritative topology table lives in `docs/architecture.md`. -| Loop | Default Interval | Description | -|---|---|---| -| Renewal check | 1 hour | Check expiring certs, query ARI, create renewal jobs | -| Job processor | 30 seconds | Process pending jobs | -| Agent health check | 2 minutes | Check agent heartbeat staleness | -| Notification processor | 1 minute | Send queued notifications | -| Short-lived expiry check | 30 seconds | Mark short-lived certs expired | -| Network scan | 6 hours | Run network discovery scans | -| Digest | 24 hours | Send certificate digest email (does not run on startup) | +| Loop | Default Interval | Always-on | Env Var | Description | +|---|---|---|---|---| +| Renewal check | 1 hour | Yes | — | Check expiring certs, query ARI, create renewal jobs | +| Job processor | 30 seconds | Yes | — | Process pending jobs | +| Job retry | 5 minutes | Yes | `CERTCTL_SCHEDULER_RETRY_INTERVAL` | Retry Failed jobs (I-001) | +| Job timeout reaper | 10 minutes | Yes | `CERTCTL_JOB_TIMEOUT_INTERVAL` | Fail AwaitingCSR/AwaitingApproval jobs past timeout (I-003) | +| Agent health check | 2 minutes | Yes | — | Check agent heartbeat staleness | +| Notification processor | 1 minute | Yes | — | Send queued notifications | +| Notification retry | 2 minutes | Yes | `CERTCTL_NOTIFICATION_RETRY_INTERVAL` | Exponential backoff retry for failed notifications; promote to dead-letter after 5 attempts (I-005) | +| Short-lived expiry check | 30 seconds | Yes | — | Mark short-lived certs expired | +| Network scan | 6 hours | Opt-in | `CERTCTL_NETWORK_SCAN_ENABLED` | Run network discovery scans | +| Digest | 24 hours | Opt-in | `CERTCTL_DIGEST_INTERVAL` | Send certificate digest email (does not run on startup) | +| Endpoint health | 60 seconds | Opt-in | `CERTCTL_HEALTH_CHECK_INTERVAL` | Continuous TLS health probes (M48) | +| Cloud discovery | 6 hours | Opt-in | `CERTCTL_CLOUD_DISCOVERY_INTERVAL` | Cloud secret manager certificate discovery (M50) | --- diff --git a/docs/testing-guide.md b/docs/testing-guide.md index a1d4302..d5fff51 100644 --- a/docs/testing-guide.md +++ b/docs/testing-guide.md @@ -5002,10 +5002,10 @@ curl -s -w "HTTP %{http_code}\n" -X DELETE -H "$AUTH" "$SERVER/api/v1/audit/$EVE > **Tip:** Open a second terminal with `docker compose logs -f certctl-server` to watch scheduler log output in real time. -**Test 20.1.1 — Scheduler startup: all 7 loops registered** +**Test 20.1.1 — Scheduler startup: all 12 loops registered** ```bash -docker compose logs certctl-server 2>&1 | grep -i "scheduler\|renewal check\|job processor\|health check\|notification\|short-lived\|network scan" | head -20 +docker compose logs certctl-server 2>&1 | grep -i "scheduler\|renewal check\|job processor\|job retry\|job timeout\|health check\|notification\|notification retry\|short-lived\|network scan\|digest\|endpoint health\|cloud discovery" | head -30 ``` **What:** Checks server startup logs for scheduler loop registration. @@ -7340,7 +7340,7 @@ These must be green before starting manual QA: | Test | Description | Method | Pass? | Date | Notes | |------|-------------|--------|-------|------|-------| -| 20.1.1 | Scheduler startup: all 7 loops registered | Manual | ☐ | | | +| 20.1.1 | Scheduler startup: all 12 loops registered | Manual | ☐ | | | | 20.1.2 | Job processor loop fires (30s interval) | Manual | ☐ | | | | 20.1.3 | Agent health check marks offline (2m interval) | Manual | ☐ | | | | 20.1.4 | Notification processor fires (1m interval) | Manual | ☐ | | |