Files
certctl/docs/operator/runbooks/expiry-alerts.md
T
shankar0123 3a807ae37e docs: Phase 2 mechanical file moves to subdirectory structure
Pure git mv operations; no content edits. Internal links remain pointing
at old paths and will be fixed in Phase 11. Per the Phase 1 audit
recommendations at cowork/docs-overhaul-phase-1-audit-2026-05-04/.

35 files moved across 8 audience-organized subdirectories:

  docs/getting-started/ (5):
    quickstart.md, concepts.md, examples.md, advanced-demo.md (was
    demo-advanced.md), why-certctl.md

  docs/reference/ (6):
    architecture.md, api.md (was openapi.md), mcp.md,
    intermediate-ca-hierarchy.md, deployment-model.md (was
    deployment-atomicity.md), vendor-matrix.md (was
    deployment-vendor-matrix.md)

  docs/reference/protocols/ (6):
    acme-server.md, acme-server-threat-model.md, scep-intune.md,
    est.md, crl-ocsp.md, async-ca-polling.md (was async-polling.md)

  docs/operator/ (4):
    security.md, tls.md, database-tls.md, approval-workflow.md

  docs/operator/runbooks/ (3):
    cloud-targets.md (was runbook-cloud-targets.md), expiry-alerts.md
    (was runbook-expiry-alerts.md), disaster-recovery.md

  docs/migration/ (3):
    from-certbot.md (was migrate-from-certbot.md), from-acmesh.md
    (was migrate-from-acmesh.md), cert-manager-coexistence.md (was
    certctl-for-cert-manager-users.md)

  docs/compliance/ (4):
    index.md (was compliance.md), soc2.md (was compliance-soc2.md),
    pci-dss.md (was compliance-pci-dss.md), nist-sp-800-57.md (was
    compliance-nist.md)

  docs/contributor/ (4):
    testing-strategy.md, test-environment.md (was test-env.md),
    ci-pipeline.md, qa-test-suite.md (was qa-test-guide.md)

Deferred to later Phase 2 sub-phases:
  - connectors.md split (Phase 4): docs/connectors.md +
    docs/connector-{apache,f5,iis,k8s,nginx}.md still at top level
  - testing-guide.md prune (Phase 5): docs/testing-guide.md still
    at top level
  - features.md disperse (Phase 6): docs/features.md still at top
    level
  - legacy-est-scep.md split (Phase 7): docs/legacy-est-scep.md
    still at top level
  - ACME walkthrough re-homing (Phase 8): three
    docs/acme-*-walkthrough.md still at top level
  - Upgrade docs archive (Phase 3): two docs/upgrade-*.md still
    at top level

Cross-reference updates (Phase 11) will happen after all moves and
content edits land. Internal links to docs/* paths are temporarily
broken until that phase completes.
2026-05-05 02:49:28 +00:00

7.9 KiB
Raw Blame History

Runbook: certificate-expiry alerts (multi-channel)

This runbook covers the per-policy multi-channel expiry-alert dispatch path that ships in certctl post-2026-05-03 (Rank 4 of the Infisical deep-research deliverable). It complements the operator-facing Routing expiry alerts across channels section in docs/connectors.md.

Audience: a platform sysadmin or on-call engineer who needs to configure, debug, or audit certctl's expiry-alert routing. Not a walkthrough of how to install certctl — that lives in the README.


End-to-end flow

flowchart TD
    Tick["daily ticker (renewalCheckLoop)"]
    Check["RenewalService.CheckExpiringCertificates"]

    Tick --> Check --> Loop

    subgraph Loop["for cert in expiring (≤30 days)"]
        L1["1. Resolve RenewalPolicy"]
        L2["2. Compute daysUntil"]
        L3["3. updateCertExpiryStatus"]
        L4["4. sendThresholdAlerts"]
        L5["5. Create renewal job<br/>(if issuer registered +<br/>ARI allows)"]
        L1 --> L2 --> L3 --> L4 --> L5
    end

    L4 --> Threshold

    subgraph Threshold["per threshold"]
        T1["a. resolve severity tier<br/>via AlertSeverityMap"]
        T2["b. resolve channel set<br/>via AlertChannels[tier]"]
        T1 --> T2 --> Channel
    end

    subgraph Channel["for each channel (fault-isolating)"]
        C1["i. dedup via notification_events<br/>(cert, threshold, channel)"]
        C2["ii. SendThresholdAlertOnChannel<br/>→ notifierRegistry[channel]<br/>→ Send(recipient, subj, body)"]
        C3["iii. record audit row<br/>event_type=expiration_alert_sent<br/>metadata.channel, metadata.severity_tier"]
        C4["iv. bump Prometheus counter<br/>certctl_expiry_alerts_total<br/>{channel, threshold, result}"]
        C1 --> C2 --> C3 --> C4
    end

The dispatch loop's per-channel error handling is fault-isolating: PagerDuty's failure does NOT skip Slack/Email at the same threshold. Each channel runs independently, with its own dedup row + audit row + metric increment.


Configuring the per-policy channel matrix

The matrix is a property of RenewalPolicy. Two new JSONB columns on the renewal_policies table back it (migration 000026):

  • alert_channels JSONBmap[severity_tier][]channel_name. Default {} → fall through to DefaultAlertChannels (Email-only at every tier).
  • alert_severity_map JSONBmap[threshold_days]severity_tier. Default {} → fall through to DefaultAlertSeverityMap (30→informational, 14→warning, 7→warning, 0→critical).

Example: production-grade routing

curl -X PUT https://certctl.example.com/api/v1/renewal-policies/rp-production \
  -H 'Authorization: Bearer ${TOKEN}' \
  -H 'Content-Type: application/json' \
  -d '{
    "name": "Production CDN renewal policy",
    "renewal_window_days": 30,
    "auto_renew": true,
    "max_retries": 3,
    "retry_interval_seconds": 300,
    "alert_thresholds_days": [30, 14, 7, 0],
    "alert_channels": {
      "informational": ["Slack"],
      "warning":       ["Slack", "Email"],
      "critical":      ["PagerDuty", "OpsGenie", "Email"]
    },
    "alert_severity_map": {
      "30": "informational",
      "14": "warning",
      "7":  "warning",
      "0":  "critical"
    }
  }'

After this PUT, the next renewal-loop tick that finds a cert under this policy will fan out alerts as documented above.

Example: opt out of informational alerts

If your team doesn't want T-30 informational alerts (you'd rather hear about a cert only at warning tier and beyond):

"alert_channels": {
  "informational": [],
  "warning":       ["Email"],
  "critical":      ["PagerDuty", "Email"]
}

The empty informational list causes the dispatch loop to record an expiration_alert_skipped_no_channels audit row at T-30 and skip the dispatch. Other tiers still fire.


Operator playbook

"Did the on-call team get paged?"

SELECT created_at,
       metadata->>'channel'        AS channel,
       metadata->>'threshold_days' AS threshold,
       metadata->>'severity_tier'  AS severity
FROM audit_events
WHERE event_type = 'expiration_alert_sent'
  AND resource_id = '<cert-id>'
ORDER BY created_at DESC;

One row per (channel, threshold) attempt. If you see a row with channel = 'PagerDuty' and severity = 'critical', the page went out (or was at least dispatched to the notifier).

"Why didn't I get an alert at T-7?"

Three places to look:

  1. Audit logSELECT FROM audit_events WHERE event_type IN ('expiration_alert_sent','expiration_alert_skipped_no_channels', 'expiration_alert_skipped_invalid_channel') AND resource_id = '<cert-id>'. If expiration_alert_skipped_no_channels appears, your policy's tier list is empty for the resolved tier. If expiration_alert_skipped_invalid_channel appears, your matrix has a typo (the metadata->>'invalid_channel' field tells you which value).

  2. Notifications tableSELECT FROM notification_events WHERE certificate_id = '<cert-id>' AND type = 'ExpirationWarning' ORDER BY created_at DESC. If rows exist with channel = 'Slack' and status = 'failed', the dispatch reached the channel but the channel rejected the send. Look at the error column for the upstream message.

  3. Prometheus counterscurl /api/v1/metrics/prometheus | grep certctl_expiry_alerts_total. Sustained {result="failure"} counts indicate a notifier connector misconfiguration (bad webhook URL, expired API key, etc.).

"How do I test the matrix without waiting for a real expiry?"

certctl ships an admin endpoint for this:

curl -X POST https://certctl.example.com/api/v1/admin/notifications/test \
  -H 'Authorization: Bearer ${TOKEN}' \
  -H 'Content-Type: application/json' \
  -d '{
    "certificate_id": "mc-test-cert",
    "threshold_days": 0,
    "channel": "PagerDuty"
  }'

This calls NotificationService.SendThresholdAlertOnChannel directly and bypasses the renewal loop's threshold check. Useful for "did I configure PagerDuty correctly?" without having to set up a deliberately-expiring cert. The admin endpoint requires role=admin (V3-Pro RBAC); V2 deploys gate it on the bearer token only.

"How do I rotate a notifier credential without downtime?"

  1. Update the CERTCTL_PAGERDUTY_ROUTING_KEY (or equivalent) env var in your deployment.
  2. Restart certctl-server. The notifier registry rebuilds with the new credential.
  3. Confirm with the admin-test endpoint above against the cert you most care about.

The renewal loop is idempotent — a missed tick during the restart window does NOT cause double-dispatch on the next tick (per-channel dedup on the notification_events table guards against that).


Cardinality + cost

  • Default 6 channels × 4 thresholds × 3 results = 72 Prometheus series.
  • Custom-thresholds policies (e.g. [60, 45, 30, 14, 7, 3, 1, 0]) expand the threshold dimension proportionally — 6 × 8 × 3 = 144 series.
  • Closed-enum discipline at the dispatch site means typos in alert_channels do NOT grow this count.
  • A daily renewal-loop tick over 10K certs each policy-bound to the matrix above produces O(channels × thresholds × certs) audit rows
    • notification rows in the worst case (every cert has crossed every threshold and no dedup applies). Operators sizing Postgres should plan for an audit_events row count on the order of unique_certs × channels_per_critical_tier per fan-out batch — which is ~3-5× the pre-Rank-4 row count.

V3-Pro forward path

Tracked at cowork/WORKSPACE-ROADMAP.md under "Adapter hardening":

  • Per-owner / per-team / per-tenant channel routing (the matrix is per-policy today, not per-owner).
  • Calendar-aware suppression (no T-30 alerts on weekends for non- on-call teams).
  • Escalation chains (T-1 unanswered for 30m → escalate to manager's PagerDuty).
  • Per-channel rate limiting (downstream of I-005's retry+DLQ).