mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 15:41:41 +00:00

Files

T

shankar0123 264015059d ci(guards): fix G-3 (CERTCTL_MCP_READ_ONLY phantom) + S-1 (hardcoded 45)

Two CI guards tripped on the B4 + B5 closure commits:

1. G-3 env-docs-drift caught `CERTCTL_MCP_READ_ONLY` mentioned in
   docs/operator/security-bundle-5-audit-closure.md (Bundle 5 S8
   row) without a corresponding entry in internal/config/config.go.
   The env var is a v3 idea, not a shipped feature — the doc now
   describes the future gate without naming the literal env var,
   matching the G-3 phantom-env-var contract.

2. S-1 hardcoded-source-counts caught "all 45 migrations" in
   docs/operator/scheduler-ha.md (Bundle 4 D8 closure prose). Per
   the CLAUDE.md operating rule "Numeric claims about current state
   rot", swapped the literal count for the rebuild command
   `ls migrations/*.up.sql | wc -l`.

Both fixes are doc-only — no code change, no test change. The
underlying Bundle 4 + Bundle 5 closures stand.

Verification:
  bash scripts/ci-guards/G-3-env-docs-drift.sh            # clean
  bash scripts/ci-guards/S-1-hardcoded-source-counts.sh   # clean

2026-05-13 01:24:06 +00:00

8.9 KiB

Raw Blame History

Scheduler HA Semantics

Last reviewed: 2026-05-13

What happens when you run more than one certctl-server replica? Which scheduler loops are safe to run on every replica simultaneously, which need leader election, and which silently duplicate work today?

This page closes Bundle 4 audit findings D8 (singleton loop ambiguity) and HIGH-1 + MED-2 (HA semantics across the scheduler surface). It is a per-loop inventory of the 15 scheduler loops in internal/scheduler/scheduler.go, classified by HA-safety.

TL;DR

The only loops that are HA-safe today via FOR UPDATE SKIP LOCKED job claiming are jobProcessorLoop and jobRetryLoop. Every other loop is intra-process idempotent (a per-replica sync/atomic.Bool guard prevents a single replica from running the same loop twice at once) but is cross-replica duplicative — two replicas tick at the same interval and both do the work.

For ten of the fourteen non-job-claim loops this is harmless: the work is read-only DB scanning that produces idempotent side effects (e.g., create-job-if-not-exists, send-notification-once-per-event-via-DB-ledger), and duplicate execution wastes CPU but cannot corrupt data. For four loops (notificationProcessLoop, digestLoop, crlGenerationLoop, cloudDiscoveryLoop) duplicate execution produces observable duplicate side effects: duplicate emails, duplicate webhooks, duplicate CRL writes. v2.1.0 supports server.replicas > 1 for read availability and api throughput, but operators running multi-replica should accept these four duplication classes or pin replicas to 1 until the leader-election work lands.

True leader-election via Postgres advisory lock or Kubernetes lease is tracked in WORKSPACE-ROADMAP.md as a v3 work item.

Per-loop inventory

The 15 loops live in internal/scheduler/scheduler.go. Each is a func (s *Scheduler) <name>Loop(ctx context.Context) driven by a time.Ticker. The intra-process guard pattern is sync/atomic.Bool CompareAndSwap(false, true) at the top of the loop body — pre-Bundle-4 every loop already had this guard. Bundle 4 added the cross-replica classification below.

#	Loop	HA mode	Side-effect duplication risk under N>1 replicas
1	`renewalCheckLoop`	Idempotent — creates renewal jobs via `service.CheckExpiringCertificates`.	None. Duplicate ticks try to create the same `RenewalRequested` job; service-layer dedup (cert_id + status uniqueness window) collapses the second. Result: 2× CPU, 1× job, no data corruption.
2	`jobProcessorLoop`	HA-safe — `service.ProcessPendingJobs` ultimately calls `repository/postgres.JobRepository.ClaimPendingJobs` which uses `SELECT ... FOR UPDATE SKIP LOCKED`.	None. Postgres guarantees exactly-once row claim per tick across the replica set.
3	`jobRetryLoop`	HA-safe — `service.RetryFailedJobs` uses the same `ClaimPendingJobs` primitive (Bundle 1 audit fix H-6, commit `6cb4414`).	None.
4	`jobTimeoutLoop`	HA-safe-ish — `service.TimeoutStalledJobs` UPDATEs with `WHERE status = 'Running' AND started_at < $cutoff` inside a single statement. Two replicas may UPDATE the same row but the second UPDATE sees `Running → Failed` already applied and matches zero rows.	None.
5	`agentHealthCheckLoop`	Idempotent — UPDATEs `agents SET operational_status = 'Offline' WHERE last_heartbeat < $cutoff`. Two replicas running the same UPDATE land the same final state.	None.
6	`notificationProcessLoop`	Duplicates — reads pending notification queue, dispatches to Slack / PagerDuty / SMTP / Teams / OpsGenie, marks dispatched. The dispatch and the "mark dispatched" are not in a single transaction; two replicas can both dispatch the same notification before the mark lands.	Duplicate webhook + email sends. Bounded — at most N duplicates for N replicas — but operator-observable.
7	`notificationRetryLoop`	Duplicates — same shape as `notificationProcessLoop`.	Same as #6.
8	`shortLivedExpiryCheckLoop`	Idempotent — UPDATEs cert status to `Expired` based on `expires_at < NOW()`. Two replicas land the same status.	None.
9	`networkScanLoop`	Idempotent — invokes `service.NetworkScanService.ScanAllEnabledTargets` which iterates scan targets, probes each, and INSERTs discovered certs with `ON CONFLICT (fingerprint, agent_id, source_path) DO NOTHING`.	None on cert insertion. Duplicate TLS probes hit the operator's targets twice per tick. Operator may want to cap to 1 replica for low-egress-budget environments.
10	`digestLoop`	Duplicates — assembles the periodic digest email and dispatches via SMTP. Two replicas at the same digest tick both send.	Duplicate digest emails.
11	`healthCheckLoop`	Idempotent — runs the active TLS-fingerprint health-check sweep across deployed certs. Same idempotency story as #8.	None on state. Duplicate TLS probes to operator targets.
12	`cloudDiscoveryLoop`	Duplicates the scan; idempotent on the result store — fetches cert lists from AWS Secrets Manager / Azure Key Vault / GCP Secret Manager, INSERTs into discovered-certs with `ON CONFLICT DO NOTHING`.	Duplicate AWS/Azure/GCP API calls — bills operator cloud accounts 2× per tick on the discovery API surface. Storage stays clean.
13	`crlGenerationLoop`	Duplicates the signing; last-writer wins on storage — regenerates CRL DER blobs per issuer, writes to `certificate_revocation_lists` table with `UPDATE ... WHERE issuer_id = $1`. Two replicas sign two CRLs with two `thisUpdate` timestamps; the later UPDATE wins.	Duplicate CA signing operations (cost on HSM-backed issuers). CRL output is single-valued but the audit trail records both signings.
14	`acmeGCLoop`	Idempotent — DELETEs ACME nonce / authz / order rows older than the retention window. Two replicas race the same DELETEs; second one matches zero rows.	None.
15	`sessionGCLoop`	Idempotent — DELETEs expired session rows. Same shape as #14.	None.

What Bundle 4 closes

Bundle 4 does NOT introduce leader election. It introduces:

Documented HA truth table (this page) — operators know exactly which loops are safe to multi-replica and which produce operator-observable duplicates.
Migration HA via pg_advisory_lock + schema_migrations audit table (see internal/repository/postgres/db.go::RunMigrations). Pre-Bundle-4 every replica race-ran the full migrations directory on boot (count via ls migrations/*.up.sql | wc -l). Post-Bundle-4 the first replica acquires the lock, applies migrations, populates schema_migrations, releases the lock. Subsequent replicas block at the lock, then observe the audit table and skip every already-applied file.
Rate-limit scope statement at docs/operator/rate-limit-scope.md — process-local per-replica, restart-safe.

What Bundle 4 does NOT close (deferred, tracked in WORKSPACE-ROADMAP.md)

Leader election for notificationProcessLoop, notificationRetryLoop, digestLoop, cloudDiscoveryLoop, crlGenerationLoop. The cleanest implementation is a per-loop pg_try_advisory_lock(lock_id) at the top of runX so only one replica per tick claims the work, with a small leader-renewal mechanic for long-running loops. This would close the four duplicate-side-effect cases above. v3 work item.
Shared rate limits across replicas. See docs/operator/rate-limit-scope.md.

Operator guidance

Single-replica deployments (Helm server.replicas: 1 — the chart default): all 15 loops work as documented. No action needed.

Multi-replica deployments: review the four duplicate-side-effect loops above against your tolerance:

If your alerting fan-out can swallow duplicate webhooks (PagerDuty deduplicates by dedup_key, Slack does not), set server.replicas > 1 and accept the duplication.
If your CRL signing uses an HSM with per-operation cost, pin to single-replica until leader election lands.
If you're running cloud discovery against billed AWS/Azure/GCP secret-manager APIs and you have a 6 h discovery interval, the doubling is bearable; at 30 min intervals it doubles your API spend.

For any duplicate-side-effect class above, the operational mitigation is pinning server.replicas: 1 and scaling vertically. The certctl-server process is CPU-bound on issuance and IO-bound on Postgres; a single replica handles substantial fleets when given enough cores + a fast database.

Source-of-truth references

Scheduler loops: internal/scheduler/scheduler.go (15 <name>Loop functions, search ^func $s \*Scheduler$ [a-zA-Z]+Loop).
Job claim primitive: internal/repository/postgres/job.go::ClaimPendingJobs (Bundle 1 H-6 closure, commit 6cb4414).
Migration HA: internal/repository/postgres/db.go::RunMigrations (Bundle 4 closure).
Rate-limit scope: docs/operator/rate-limit-scope.md.
Load-test scope: deploy/test/loadtest/README.md ("What it explicitly does NOT measure").

8.9 KiB Raw Blame History Unescape Escape