Files
certctl/docs/operator/runbooks/expiry-alerts.md
T
shankar0123 b375df767e docs: Phase 2 mechanical file moves to subdirectory structure
Pure git mv operations; no content edits. Internal links remain pointing
at old paths and will be fixed in Phase 11. Per the Phase 1 audit
recommendations at cowork/docs-overhaul-phase-1-audit-2026-05-04/.

35 files moved across 8 audience-organized subdirectories:

  docs/getting-started/ (5):
    quickstart.md, concepts.md, examples.md, advanced-demo.md (was
    demo-advanced.md), why-certctl.md

  docs/reference/ (6):
    architecture.md, api.md (was openapi.md), mcp.md,
    intermediate-ca-hierarchy.md, deployment-model.md (was
    deployment-atomicity.md), vendor-matrix.md (was
    deployment-vendor-matrix.md)

  docs/reference/protocols/ (6):
    acme-server.md, acme-server-threat-model.md, scep-intune.md,
    est.md, crl-ocsp.md, async-ca-polling.md (was async-polling.md)

  docs/operator/ (4):
    security.md, tls.md, database-tls.md, approval-workflow.md

  docs/operator/runbooks/ (3):
    cloud-targets.md (was runbook-cloud-targets.md), expiry-alerts.md
    (was runbook-expiry-alerts.md), disaster-recovery.md

  docs/migration/ (3):
    from-certbot.md (was migrate-from-certbot.md), from-acmesh.md
    (was migrate-from-acmesh.md), cert-manager-coexistence.md (was
    certctl-for-cert-manager-users.md)

  docs/compliance/ (4):
    index.md (was compliance.md), soc2.md (was compliance-soc2.md),
    pci-dss.md (was compliance-pci-dss.md), nist-sp-800-57.md (was
    compliance-nist.md)

  docs/contributor/ (4):
    testing-strategy.md, test-environment.md (was test-env.md),
    ci-pipeline.md, qa-test-suite.md (was qa-test-guide.md)

Deferred to later Phase 2 sub-phases:
  - connectors.md split (Phase 4): docs/connectors.md +
    docs/connector-{apache,f5,iis,k8s,nginx}.md still at top level
  - testing-guide.md prune (Phase 5): docs/testing-guide.md still
    at top level
  - features.md disperse (Phase 6): docs/features.md still at top
    level
  - legacy-est-scep.md split (Phase 7): docs/legacy-est-scep.md
    still at top level
  - ACME walkthrough re-homing (Phase 8): three
    docs/acme-*-walkthrough.md still at top level
  - Upgrade docs archive (Phase 3): two docs/upgrade-*.md still
    at top level

Cross-reference updates (Phase 11) will happen after all moves and
content edits land. Internal links to docs/* paths are temporarily
broken until that phase completes.
2026-05-05 02:49:28 +00:00

227 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Runbook: certificate-expiry alerts (multi-channel)
This runbook covers the per-policy multi-channel expiry-alert dispatch
path that ships in certctl post-2026-05-03 (Rank 4 of the Infisical
deep-research deliverable). It complements the operator-facing
[Routing expiry alerts across channels](connectors.md#routing-expiry-alerts-across-channels)
section in `docs/connectors.md`.
Audience: a platform sysadmin or on-call engineer who needs to
configure, debug, or audit certctl's expiry-alert routing. Not a
walkthrough of how to install certctl — that lives in the README.
---
## End-to-end flow
```mermaid
flowchart TD
Tick["daily ticker (renewalCheckLoop)"]
Check["RenewalService.CheckExpiringCertificates"]
Tick --> Check --> Loop
subgraph Loop["for cert in expiring (≤30 days)"]
L1["1. Resolve RenewalPolicy"]
L2["2. Compute daysUntil"]
L3["3. updateCertExpiryStatus"]
L4["4. sendThresholdAlerts"]
L5["5. Create renewal job<br/>(if issuer registered +<br/>ARI allows)"]
L1 --> L2 --> L3 --> L4 --> L5
end
L4 --> Threshold
subgraph Threshold["per threshold"]
T1["a. resolve severity tier<br/>via AlertSeverityMap"]
T2["b. resolve channel set<br/>via AlertChannels[tier]"]
T1 --> T2 --> Channel
end
subgraph Channel["for each channel (fault-isolating)"]
C1["i. dedup via notification_events<br/>(cert, threshold, channel)"]
C2["ii. SendThresholdAlertOnChannel<br/>→ notifierRegistry[channel]<br/>→ Send(recipient, subj, body)"]
C3["iii. record audit row<br/>event_type=expiration_alert_sent<br/>metadata.channel, metadata.severity_tier"]
C4["iv. bump Prometheus counter<br/>certctl_expiry_alerts_total<br/>{channel, threshold, result}"]
C1 --> C2 --> C3 --> C4
end
```
The dispatch loop's per-channel error handling is
**fault-isolating**: PagerDuty's failure does NOT skip Slack/Email
at the same threshold. Each channel runs independently, with its
own dedup row + audit row + metric increment.
---
## Configuring the per-policy channel matrix
The matrix is a property of `RenewalPolicy`. Two new JSONB columns
on the `renewal_policies` table back it (migration 000026):
- `alert_channels JSONB``map[severity_tier][]channel_name`. Default `{}`
→ fall through to `DefaultAlertChannels` (Email-only at every tier).
- `alert_severity_map JSONB``map[threshold_days]severity_tier`. Default
`{}` → fall through to `DefaultAlertSeverityMap` (`30→informational,
14→warning, 7→warning, 0→critical`).
### Example: production-grade routing
```bash
curl -X PUT https://certctl.example.com/api/v1/renewal-policies/rp-production \
-H 'Authorization: Bearer ${TOKEN}' \
-H 'Content-Type: application/json' \
-d '{
"name": "Production CDN renewal policy",
"renewal_window_days": 30,
"auto_renew": true,
"max_retries": 3,
"retry_interval_seconds": 300,
"alert_thresholds_days": [30, 14, 7, 0],
"alert_channels": {
"informational": ["Slack"],
"warning": ["Slack", "Email"],
"critical": ["PagerDuty", "OpsGenie", "Email"]
},
"alert_severity_map": {
"30": "informational",
"14": "warning",
"7": "warning",
"0": "critical"
}
}'
```
After this PUT, the next renewal-loop tick that finds a cert under
this policy will fan out alerts as documented above.
### Example: opt out of informational alerts
If your team doesn't want T-30 informational alerts (you'd rather
hear about a cert only at warning tier and beyond):
```json
"alert_channels": {
"informational": [],
"warning": ["Email"],
"critical": ["PagerDuty", "Email"]
}
```
The empty `informational` list causes the dispatch loop to record
an `expiration_alert_skipped_no_channels` audit row at T-30 and
skip the dispatch. Other tiers still fire.
---
## Operator playbook
### "Did the on-call team get paged?"
```sql
SELECT created_at,
metadata->>'channel' AS channel,
metadata->>'threshold_days' AS threshold,
metadata->>'severity_tier' AS severity
FROM audit_events
WHERE event_type = 'expiration_alert_sent'
AND resource_id = '<cert-id>'
ORDER BY created_at DESC;
```
One row per (channel, threshold) attempt. If you see a row with
`channel = 'PagerDuty'` and `severity = 'critical'`, the page went
out (or was at least dispatched to the notifier).
### "Why didn't I get an alert at T-7?"
Three places to look:
1. **Audit log** — `SELECT FROM audit_events WHERE event_type IN
('expiration_alert_sent','expiration_alert_skipped_no_channels',
'expiration_alert_skipped_invalid_channel') AND resource_id =
'<cert-id>'`. If `expiration_alert_skipped_no_channels` appears,
your policy's tier list is empty for the resolved tier. If
`expiration_alert_skipped_invalid_channel` appears, your matrix
has a typo (the `metadata->>'invalid_channel'` field tells you
which value).
2. **Notifications table** —
`SELECT FROM notification_events WHERE certificate_id = '<cert-id>'
AND type = 'ExpirationWarning' ORDER BY created_at DESC`. If
rows exist with `channel = 'Slack'` and `status = 'failed'`,
the dispatch reached the channel but the channel rejected the
send. Look at the `error` column for the upstream message.
3. **Prometheus counters** —
`curl /api/v1/metrics/prometheus | grep certctl_expiry_alerts_total`.
Sustained `{result="failure"}` counts indicate a notifier
connector misconfiguration (bad webhook URL, expired API key,
etc.).
### "How do I test the matrix without waiting for a real expiry?"
certctl ships an admin endpoint for this:
```bash
curl -X POST https://certctl.example.com/api/v1/admin/notifications/test \
-H 'Authorization: Bearer ${TOKEN}' \
-H 'Content-Type: application/json' \
-d '{
"certificate_id": "mc-test-cert",
"threshold_days": 0,
"channel": "PagerDuty"
}'
```
This calls `NotificationService.SendThresholdAlertOnChannel`
directly and bypasses the renewal loop's threshold check. Useful
for "did I configure PagerDuty correctly?" without having to set
up a deliberately-expiring cert. The admin endpoint requires
`role=admin` (V3-Pro RBAC); V2 deploys gate it on the bearer
token only.
### "How do I rotate a notifier credential without downtime?"
1. Update the `CERTCTL_PAGERDUTY_ROUTING_KEY` (or equivalent) env
var in your deployment.
2. Restart `certctl-server`. The notifier registry rebuilds
with the new credential.
3. Confirm with the admin-test endpoint above against the cert
you most care about.
The renewal loop is idempotent — a missed tick during the restart
window does NOT cause double-dispatch on the next tick (per-channel
dedup on the `notification_events` table guards against that).
---
## Cardinality + cost
- Default 6 channels × 4 thresholds × 3 results = **72 Prometheus series**.
- Custom-thresholds policies (e.g. `[60, 45, 30, 14, 7, 3, 1, 0]`)
expand the threshold dimension proportionally — 6 × 8 × 3 = 144 series.
- Closed-enum discipline at the dispatch site means typos in
`alert_channels` do NOT grow this count.
- A daily renewal-loop tick over 10K certs each policy-bound to the
matrix above produces O(channels × thresholds × certs) audit rows
+ notification rows in the worst case (every cert has crossed
every threshold and no dedup applies). Operators sizing
Postgres should plan for an `audit_events` row count on the
order of `unique_certs × channels_per_critical_tier` per fan-out
batch — which is ~3-5× the pre-Rank-4 row count.
---
## V3-Pro forward path
Tracked at `cowork/WORKSPACE-ROADMAP.md` under "Adapter hardening":
- Per-owner / per-team / per-tenant channel routing (the matrix is
per-policy today, not per-owner).
- Calendar-aware suppression (no T-30 alerts on weekends for non-
on-call teams).
- Escalation chains (T-1 unanswered for 30m → escalate to
manager's PagerDuty).
- Per-channel rate limiting (downstream of I-005's retry+DLQ).