mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 22:01:36 +00:00
docs: Phase 2 mechanical file moves to subdirectory structure
Pure git mv operations; no content edits. Internal links remain pointing
at old paths and will be fixed in Phase 11. Per the Phase 1 audit
recommendations at cowork/docs-overhaul-phase-1-audit-2026-05-04/.
35 files moved across 8 audience-organized subdirectories:
docs/getting-started/ (5):
quickstart.md, concepts.md, examples.md, advanced-demo.md (was
demo-advanced.md), why-certctl.md
docs/reference/ (6):
architecture.md, api.md (was openapi.md), mcp.md,
intermediate-ca-hierarchy.md, deployment-model.md (was
deployment-atomicity.md), vendor-matrix.md (was
deployment-vendor-matrix.md)
docs/reference/protocols/ (6):
acme-server.md, acme-server-threat-model.md, scep-intune.md,
est.md, crl-ocsp.md, async-ca-polling.md (was async-polling.md)
docs/operator/ (4):
security.md, tls.md, database-tls.md, approval-workflow.md
docs/operator/runbooks/ (3):
cloud-targets.md (was runbook-cloud-targets.md), expiry-alerts.md
(was runbook-expiry-alerts.md), disaster-recovery.md
docs/migration/ (3):
from-certbot.md (was migrate-from-certbot.md), from-acmesh.md
(was migrate-from-acmesh.md), cert-manager-coexistence.md (was
certctl-for-cert-manager-users.md)
docs/compliance/ (4):
index.md (was compliance.md), soc2.md (was compliance-soc2.md),
pci-dss.md (was compliance-pci-dss.md), nist-sp-800-57.md (was
compliance-nist.md)
docs/contributor/ (4):
testing-strategy.md, test-environment.md (was test-env.md),
ci-pipeline.md, qa-test-suite.md (was qa-test-guide.md)
Deferred to later Phase 2 sub-phases:
- connectors.md split (Phase 4): docs/connectors.md +
docs/connector-{apache,f5,iis,k8s,nginx}.md still at top level
- testing-guide.md prune (Phase 5): docs/testing-guide.md still
at top level
- features.md disperse (Phase 6): docs/features.md still at top
level
- legacy-est-scep.md split (Phase 7): docs/legacy-est-scep.md
still at top level
- ACME walkthrough re-homing (Phase 8): three
docs/acme-*-walkthrough.md still at top level
- Upgrade docs archive (Phase 3): two docs/upgrade-*.md still
at top level
Cross-reference updates (Phase 11) will happen after all moves and
content edits land. Internal links to docs/* paths are temporarily
broken until that phase completes.
This commit is contained in:
@@ -0,0 +1,141 @@
|
||||
# Issuance approval workflow
|
||||
|
||||
certctl can gate certificate issuance + renewal on a per-profile, two-person-integrity check. Compliance customers (PCI-DSS Level 1, FedRAMP Moderate / High, SOC 2 Type II, HIPAA) configure this on production-tier `CertificateProfile` rows so every renewal-loop tick or manual `POST /api/v1/certificates/{id}/renew` blocks at `JobStatusAwaitingApproval` until a different actor approves.
|
||||
|
||||
Closes the procurement-checklist question "How do you enforce two-person integrity on cert issuance?" — without this surface the answer is "we don't"; with `requires_approval=true` on the profile, the answer is "here's the RBAC contract + here's the audit query that proves bypass mode is off in production."
|
||||
|
||||
## End-to-end flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
autonumber
|
||||
participant A as Operator A<br/>(or scheduler)
|
||||
participant SVC as CertificateService<br/>.TriggerRenewal
|
||||
participant JOB as Job + ApprovalRequest
|
||||
participant B as Operator B
|
||||
participant APR as ApprovalService.Approve
|
||||
participant SCH as Scheduler
|
||||
|
||||
A->>SVC: POST /api/v1/certificates/{id}/renew<br/>(or renewal-loop tick)
|
||||
SVC->>JOB: read profile.RequiresApproval;<br/>create Job @ JobStatusAwaitingApproval;<br/>create ApprovalRequest<br/>(state=pending, requested_by=Operator A)
|
||||
Note over JOB,SCH: Scheduler skips —<br/>AwaitingApproval is NOT a dispatchable status
|
||||
B->>JOB: GET /api/v1/approvals?state=pending
|
||||
B->>APR: POST /api/v1/approvals/{id}/approve<br/>(decided_by=Operator B, note=...)
|
||||
APR->>APR: RBAC: reject if Operator B == Operator A<br/>→ ErrApproveBySameActor (HTTP 403)
|
||||
APR->>JOB: ApprovalRequest → state=approved;<br/>Job AwaitingApproval → Pending;<br/>audit row (action=approval_approved,<br/>actor=Operator B);<br/>certctl_approval_decisions_total<br/>{outcome=approved,profile_id=...}++
|
||||
SCH->>JOB: pick up Pending → dispatch to issuer connector
|
||||
JOB-->>A: cert issues normally
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Set `requires_approval=true` on a `CertificateProfile`:
|
||||
|
||||
```bash
|
||||
curl -X PUT https://certctl/api/v1/profiles/p-prod-cdn \
|
||||
-H "Authorization: Bearer $API_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"name": "Production CDN",
|
||||
"requires_approval": true,
|
||||
...
|
||||
}'
|
||||
```
|
||||
|
||||
Every certificate bound to that profile is now gated. The default is `requires_approval=false` — existing profiles keep the historical unattended renewal path.
|
||||
|
||||
## RBAC: the two-person integrity rule
|
||||
|
||||
The actor that triggers a renewal **cannot** be the actor that approves it. The check happens at the service layer and surfaces as **HTTP 403** at the handler. The error message contains the substring `two-person integrity` so server-log greps detect attempted self-approvals.
|
||||
|
||||
This is the load-bearing compliance contract. Pinned by:
|
||||
|
||||
- `internal/service/approval_test.go::TestApproval_Approve_RejectsSameActor` — service-level pin.
|
||||
- `internal/api/handler/approval_test.go::TestApproval_HandlerApproveAsSameActor_Returns403` — handler-level pin (HTTP 403 + body contains "two-person integrity").
|
||||
|
||||
## Operator playbook: "I need to approve a renewal"
|
||||
|
||||
```bash
|
||||
# 1. Find the pending request
|
||||
curl -s "https://certctl/api/v1/approvals?state=pending" \
|
||||
-H "Authorization: Bearer $API_KEY" | jq
|
||||
|
||||
# 2. Inspect the request — confirm CN, SANs, requester
|
||||
curl -s "https://certctl/api/v1/approvals/ar-abc123" \
|
||||
-H "Authorization: Bearer $API_KEY" | jq
|
||||
|
||||
# 3. Approve as a different actor than the requester
|
||||
curl -X POST "https://certctl/api/v1/approvals/ar-abc123/approve" \
|
||||
-H "Authorization: Bearer $APPROVER_API_KEY" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"note":"approved per ticket SECOPS-12345"}'
|
||||
|
||||
# 4. Confirm the job transitioned to Pending
|
||||
curl -s "https://certctl/api/v1/jobs?certificate_id=mc-foo" \
|
||||
-H "Authorization: Bearer $API_KEY" | jq '.[] | {id,status,type}'
|
||||
```
|
||||
|
||||
To **reject** instead, swap the path: `POST /api/v1/approvals/{id}/reject` with the same body shape. The job transitions to `Cancelled` and the `note` is recorded in the audit row.
|
||||
|
||||
## Operator playbook: "approval timed out"
|
||||
|
||||
The scheduler reaper transitions stale pending requests + their linked jobs after `CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT` (default `168h` = 7 days):
|
||||
|
||||
- `ApprovalRequest.state` → `expired`
|
||||
- `Job.Status` → `Cancelled` (with `error_message="approval expired"`)
|
||||
- One audit row per expiry (`action=approval_expired, actor=system-reaper, actorType=System`)
|
||||
- `certctl_approval_decisions_total{outcome="expired",profile_id="..."}` increments
|
||||
|
||||
Resolve by re-triggering the renewal once the underlying delay is sorted:
|
||||
|
||||
```bash
|
||||
curl -X POST "https://certctl/api/v1/certificates/mc-foo/renew" \
|
||||
-H "Authorization: Bearer $API_KEY"
|
||||
```
|
||||
|
||||
Tighten the timeout for short-window deployments via the env var, e.g. `CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT=24h`.
|
||||
|
||||
## Compliance control mapping
|
||||
|
||||
| Standard | Control | What this surface satisfies |
|
||||
|---|---|---|
|
||||
| PCI-DSS 4.0 | **§6.4.5** (Separation of duties for production change-management) | Same-actor RBAC pin; audit row carries both `requested_by` and `decided_by` so reviewers see two distinct identities per change. |
|
||||
| NIST SP 800-53 | **SA-15** (Development process; two-person review for security-relevant changes) | Service-layer `ErrApproveBySameActor` + `TestApproval_Approve_RejectsSameActor` pin the contract. Bypass-mode emits a typed audit row (`action=approval_bypassed`) so compliance reviewers detect dev-mode misuse via `SELECT count(*) FROM audit_events WHERE actor='system-bypass'` returning > 0. |
|
||||
| SOC 2 Type II | **CC6.1** (Logical access — restrict, monitor, terminate) | Per-decision audit row + `certctl_approval_decisions_total{outcome,profile_id}` Prometheus counter. Operators alert on sustained `outcome="rejected"` or `outcome="expired"` bursts. |
|
||||
| HIPAA | **§164.308(a)(4)** (Information access management) | Same surface — the per-policy gating + audit trail is the access-management control. |
|
||||
|
||||
## Bypass mode (dev / CI ONLY)
|
||||
|
||||
Setting `CERTCTL_APPROVAL_BYPASS=true` short-circuits the workflow: every `RequestApproval` call auto-approves with `decided_by=system-bypass` and `actorType=System`. Used by dev / CI to keep renewal-scheduler tests fast without standing up an approver.
|
||||
|
||||
**Production deploys MUST leave this unset.** The bypass emits a typed audit event (`action=approval_bypassed`) so compliance auditors detect misuse via:
|
||||
|
||||
```sql
|
||||
SELECT count(*) FROM audit_events WHERE actor = 'system-bypass';
|
||||
```
|
||||
|
||||
returning **zero rows in production** and a high count in dev. The certctl-server logs a `WARN` line at boot when bypass is enabled — operators alert on that log line in production environments.
|
||||
|
||||
## Prometheus metrics
|
||||
|
||||
```
|
||||
certctl_approval_decisions_total{outcome,profile_id} counter
|
||||
certctl_approval_pending_age_seconds histogram
|
||||
(le buckets:
|
||||
60, 300, 1800, 3600,
|
||||
21600, 86400, +Inf)
|
||||
```
|
||||
|
||||
`outcome` is one of `approved`, `rejected`, `expired`, `bypassed`. `profile_id` is the `CertificateProfile.ID` that triggered the gate (cardinality-bounded — operators have <100 profiles in production).
|
||||
|
||||
The pending-age histogram observes seconds-since-creation at the moment of decision. Alert when p99 hits hours/days — compliance customers usually have a same-day decision deadline.
|
||||
|
||||
## Future free V2 work
|
||||
|
||||
- **M-of-N approver chains.** Today's primitive is single-approver. Future V2 work adds chains — e.g., "needs 2 of 3 platform-team members."
|
||||
- **Time-windowed auto-approve.** Today's reaper hard-cancels at the static deadline. Policy-driven time-windowed auto-approve (T+30m unattended → cancel; T+24h business hours → escalate) is future work.
|
||||
- **External ticketing integration.** ServiceNow / JIRA bridging so approval state mirrors the change-management record.
|
||||
- **Per-owner / per-team routing.** Today's pool is global. Per-owner / per-team routing matches cert ownership to approver pools.
|
||||
- **Approval delegation.** Today the same-actor rule is strict. Time-bounded delegation is future work.
|
||||
|
||||
Tracked in `WORKSPACE-ROADMAP.md` under the Future Free V2 Work section — every item ships free under BSL.
|
||||
@@ -0,0 +1,117 @@
|
||||
# Database TLS — Postgres Transport Encryption
|
||||
|
||||
**Audit reference:** Bundle B / M-018. PCI-DSS v4.0 Req 4 §2.2.5; CWE-319.
|
||||
|
||||
certctl talks to Postgres over a single connection-string URL controlled by the
|
||||
`CERTCTL_DATABASE_URL` env var. The `sslmode` query parameter on that URL
|
||||
selects the transport-encryption posture. Pre-Bundle-B all the bundled
|
||||
deployment artifacts (Helm chart, docker-compose) hard-coded `sslmode=disable`.
|
||||
Bundle B exposes that as an operator-facing knob with a documented default and
|
||||
explicit opt-in / opt-out paths for the four real-world deployment shapes.
|
||||
|
||||
## Quick reference
|
||||
|
||||
| Deployment shape | Default `sslmode` | When to change |
|
||||
|------------------------------------------------|--------------------|----------------|
|
||||
| Helm chart, bundled Postgres, in-cluster | `disable` | When the cluster does not provide pod-network encryption (CNI without WireGuard / IPSec) and the workload is in PCI-DSS scope. |
|
||||
| Helm chart, external Postgres (RDS / Cloud SQL / Azure DB) | not auto-set | **Always** set to `verify-full` and provide the cloud provider's server CA bundle. |
|
||||
| docker-compose, bundled Postgres on docker bridge | `disable` | Demo/dev only; not a deployment shape we expect operators to harden. |
|
||||
| docker-compose / k8s with external Postgres | not auto-set | **Always** set `CERTCTL_DATABASE_URL` to a connection string with `sslmode=verify-full`. |
|
||||
|
||||
`sslmode` values come from `lib/pq` (the underlying driver). The full set is:
|
||||
`disable`, `allow`, `prefer`, `require`, `verify-ca`, `verify-full`. PCI-DSS
|
||||
Req 4 v4.0 §2.2.5 considers `verify-ca` the floor for sensitive-data transport;
|
||||
`verify-full` is the floor for systems exposed to spoofing risk (it adds
|
||||
hostname validation against the server cert's CN/SAN).
|
||||
|
||||
## Helm chart (Bundle B)
|
||||
|
||||
Bundle B adds two values under `postgresql.tls`:
|
||||
|
||||
```yaml
|
||||
postgresql:
|
||||
tls:
|
||||
mode: disable # disable | require | verify-ca | verify-full
|
||||
caSecretRef: "" # Secret with ca.crt key (required for verify-ca / verify-full)
|
||||
```
|
||||
|
||||
The chart pipes `postgresql.tls.mode` into the `?sslmode=` parameter of the
|
||||
generated `CERTCTL_DATABASE_URL` (see `templates/_helpers.tpl::certctl.databaseURL`).
|
||||
For external Postgres, set `postgresql.enabled: false` and override
|
||||
`server.env.CERTCTL_DATABASE_URL` directly with the full connection string —
|
||||
the operator authoring an external-DB values file owns the entire URL.
|
||||
|
||||
### Example: external RDS with verify-full
|
||||
|
||||
```yaml
|
||||
postgresql:
|
||||
enabled: false # Disable bundled Postgres
|
||||
|
||||
server:
|
||||
env:
|
||||
CERTCTL_DATABASE_URL: |
|
||||
postgres://certctl:STRONGPW@my-db.cabc12345.us-east-1.rds.amazonaws.com:5432/certctl?sslmode=verify-full
|
||||
|
||||
# Provide the AWS RDS root CA bundle as a secret + mount.
|
||||
# AWS publishes per-region root certs at https://truststore.pki.rds.amazonaws.com/
|
||||
extraVolumes:
|
||||
- name: rds-ca
|
||||
secret:
|
||||
secretName: rds-ca-bundle # kubectl create secret generic rds-ca-bundle --from-file=ca.crt=...
|
||||
|
||||
extraVolumeMounts:
|
||||
- name: rds-ca
|
||||
mountPath: /etc/postgresql-ca
|
||||
readOnly: true
|
||||
|
||||
# lib/pq honors PGSSLROOTCERT for the verify-{ca,full} CA bundle path.
|
||||
server:
|
||||
env:
|
||||
PGSSLROOTCERT: /etc/postgresql-ca/ca.crt
|
||||
```
|
||||
|
||||
## docker-compose (development / demo)
|
||||
|
||||
The bundled `deploy/docker-compose.yml` keeps `sslmode=disable` as the default
|
||||
because the Postgres container shares the docker bridge network with the certctl
|
||||
server and the compose file is not a production deployment artifact. To opt in:
|
||||
|
||||
```bash
|
||||
export CERTCTL_DATABASE_URL='postgres://certctl:certctl@postgres:5432/certctl?sslmode=verify-full'
|
||||
docker compose up
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
For any non-`disable` mode, confirm the connection actually negotiated TLS:
|
||||
|
||||
```bash
|
||||
# From inside the certctl-server container or any host with psql + the same URL:
|
||||
psql "$CERTCTL_DATABASE_URL" -c "SELECT ssl, version, cipher FROM pg_stat_ssl WHERE pid = pg_backend_pid();"
|
||||
|
||||
# Expected output for verify-full: ssl=t, version=TLSv1.3 (or TLSv1.2), cipher=...
|
||||
```
|
||||
|
||||
If `ssl=f` appears, the connection silently fell back to plaintext — investigate
|
||||
the cert chain or sslmode value before treating the deployment as PCI-compliant.
|
||||
|
||||
## What this does NOT cover
|
||||
|
||||
* **Postgres-to-Postgres replication** — if you run a replica, replica-primary
|
||||
TLS is configured via the Postgres server itself (`pg_hba.conf` +
|
||||
`ssl=on`); it is independent of certctl's `CERTCTL_DATABASE_URL`.
|
||||
* **Backup transport** — `pg_dump` / `pg_basebackup` honor the same `sslmode`
|
||||
parameter when invoked with the URL form, but the bundled chart's backup
|
||||
story (if any) is operator-owned.
|
||||
* **Encryption at rest** — `sslmode` is a transport concern only. Disk
|
||||
encryption is the cloud provider's storage layer (RDS, EBS, etc.) or the
|
||||
operator's Postgres TDE / disk LUKS / etc.
|
||||
|
||||
## Reverting
|
||||
|
||||
If `sslmode=verify-full` causes connection failures (most common: missing CA
|
||||
bundle, wrong hostname), drop temporarily to `sslmode=require` to confirm TLS
|
||||
is at least negotiated, then add the CA bundle and ratchet back up. Never
|
||||
revert to `sslmode=disable` on a system carrying real cert metadata —
|
||||
audit_events alone contains enough operator/issuer/target identity to justify
|
||||
TLS in any scoped environment.
|
||||
@@ -0,0 +1,334 @@
|
||||
# Runbook: cloud-target deployment connectors (AWS ACM + Azure Key Vault)
|
||||
|
||||
This runbook covers the SDK-driven cloud target connectors that ship in
|
||||
certctl post-2026-05-03 (Rank 5 of the Infisical deep-research
|
||||
deliverable). It complements the operator-facing
|
||||
[AWS Certificate Manager](connectors.md#aws-certificate-manager-acm) and
|
||||
[Azure Key Vault](connectors.md#azure-key-vault) sections in
|
||||
`docs/connectors.md`.
|
||||
|
||||
Audience: a platform sysadmin or SRE who needs to configure, debug, or
|
||||
audit certctl's cloud-target deploys. Not a walkthrough of how to
|
||||
install certctl.
|
||||
|
||||
---
|
||||
|
||||
## End-to-end flow (cloud targets)
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
Renew["cert renewed → renewal job created"]
|
||||
Pick["agent picks up DeployCertificate work item"]
|
||||
Dispatch["target.Connector.DeployCertificate(ctx, request)"]
|
||||
|
||||
Renew --> Pick --> Dispatch
|
||||
Dispatch --> AWS
|
||||
Dispatch --> AZ
|
||||
|
||||
subgraph AWS["AWS ACM path"]
|
||||
A1["1. rotate-in-place only:<br/>DescribeCertificate(arn)"]
|
||||
A2["2. GetCertificate(arn) —<br/>capture snapshot bytes for rollback"]
|
||||
A3["3. ImportCertificate(arn, new_bytes) —<br/>fresh ARN OR rotate-in-place"]
|
||||
A4["4. AddTagsToCertificate(arn, provenance) —<br/>ACM strips on re-import; we re-apply"]
|
||||
A5["5. DescribeCertificate(arn) —<br/>verify serial matches expected"]
|
||||
A6["6. ON MISMATCH: rollback<br/>ImportCertificate(arn, snapshot_bytes)"]
|
||||
A1 --> A2 --> A3 --> A4 --> A5 --> A6
|
||||
end
|
||||
|
||||
subgraph AZ["Azure Key Vault path"]
|
||||
Z1["1. GetCertificate(name, '' = latest) —<br/>capture snapshot CER bytes"]
|
||||
Z2["2. Build PFX from cert+chain+key<br/>(PKCS#12 via go-pkcs12)"]
|
||||
Z3["3. ImportCertificate(name, PFX, tags) —<br/>ALWAYS creates a new version"]
|
||||
Z4["4. Tags carried forward automatically"]
|
||||
Z5["5. GetCertificate(name, '' = latest) —<br/>verify serial matches expected"]
|
||||
Z6["6. ON MISMATCH: rollback<br/>ImportCertificate(name, snapshot_PFX) —<br/>new version"]
|
||||
Z1 --> Z2 --> Z3 --> Z4 --> Z5 --> Z6
|
||||
end
|
||||
|
||||
A6 --> Audit
|
||||
Z6 --> Audit
|
||||
Audit["7. Audit row + Prometheus counters<br/>certctl_deploy_attempts_total{target_type, result}<br/>certctl_deploy_rollback_total{target_type, outcome}"]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuring an AWS ACM target
|
||||
|
||||
### Minimum config
|
||||
|
||||
```bash
|
||||
curl -X POST https://certctl.example.com/api/v1/targets \
|
||||
-H 'Authorization: Bearer ${TOKEN}' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"name": "Production ALB cert",
|
||||
"type": "AWSACM",
|
||||
"agent_id": "ag-server",
|
||||
"config": {
|
||||
"region": "us-east-1",
|
||||
"tags": {"env": "production"}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
Empty `certificate_arn` on first deploy = ACM creates a fresh ARN; the
|
||||
deployment record's Metadata captures it. Update the
|
||||
`deployment_targets.config.certificate_arn` field via the GUI / API /
|
||||
direct SQL to pin the ARN for subsequent renewals.
|
||||
|
||||
### Minimum IAM policy
|
||||
|
||||
```json
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"acm:ImportCertificate",
|
||||
"acm:GetCertificate",
|
||||
"acm:DescribeCertificate",
|
||||
"acm:ListCertificates",
|
||||
"acm:AddTagsToCertificate"
|
||||
],
|
||||
"Resource": "arn:aws:acm:us-east-1:*:certificate/*"
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
Pin `Resource` to the specific region / account where the ALB lives.
|
||||
Cross-account deploys use AssumeRole — configure the agent's role with
|
||||
`sts:AssumeRole` against the target account's role ARN.
|
||||
|
||||
### Auth: IRSA (recommended for EKS-hosted agents)
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: certctl-agent
|
||||
namespace: certctl-system
|
||||
annotations:
|
||||
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/certctl-acm-deployer
|
||||
```
|
||||
|
||||
Trust policy on `certctl-acm-deployer`:
|
||||
|
||||
```json
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [{
|
||||
"Effect": "Allow",
|
||||
"Principal": {
|
||||
"Federated": "arn:aws:iam::123456789012:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/EXAMPLE"
|
||||
},
|
||||
"Action": "sts:AssumeRoleWithWebIdentity",
|
||||
"Condition": {
|
||||
"StringEquals": {
|
||||
"oidc.eks.us-east-1.amazonaws.com/id/EXAMPLE:sub": "system:serviceaccount:certctl-system:certctl-agent"
|
||||
}
|
||||
}
|
||||
}]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Configuring an Azure Key Vault target
|
||||
|
||||
### Minimum config
|
||||
|
||||
```bash
|
||||
curl -X POST https://certctl.example.com/api/v1/targets \
|
||||
-H 'Authorization: Bearer ${TOKEN}' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"name": "Production AGW cert",
|
||||
"type": "AzureKeyVault",
|
||||
"agent_id": "ag-server",
|
||||
"config": {
|
||||
"vault_url": "https://prod-vault.vault.azure.net",
|
||||
"certificate_name": "api-prod",
|
||||
"credential_mode": "managed_identity",
|
||||
"tags": {"env": "production"}
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
### Minimum RBAC role
|
||||
|
||||
Off-the-shelf builtin: **Key Vault Certificates Officer** (assigns at
|
||||
the vault scope).
|
||||
|
||||
Custom minimum-permission role:
|
||||
|
||||
```json
|
||||
{
|
||||
"properties": {
|
||||
"roleName": "certctl-keyvault-deployer",
|
||||
"description": "Minimum permissions for certctl Key Vault target",
|
||||
"assignableScopes": [
|
||||
"/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<vault-name>"
|
||||
],
|
||||
"permissions": [{
|
||||
"actions": [],
|
||||
"notActions": [],
|
||||
"dataActions": [
|
||||
"Microsoft.KeyVault/vaults/certificates/import/action",
|
||||
"Microsoft.KeyVault/vaults/certificates/read",
|
||||
"Microsoft.KeyVault/vaults/certificates/listversions/read"
|
||||
],
|
||||
"notDataActions": []
|
||||
}]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Auth: AKS workload identity (recommended for AKS-hosted agents)
|
||||
|
||||
Annotate the agent's ServiceAccount:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: ServiceAccount
|
||||
metadata:
|
||||
name: certctl-agent
|
||||
namespace: certctl-system
|
||||
annotations:
|
||||
azure.workload.identity/client-id: <app-registration-client-id>
|
||||
labels:
|
||||
azure.workload.identity/use: "true"
|
||||
```
|
||||
|
||||
Federated credential on the app registration:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "certctl-agent-federated",
|
||||
"issuer": "https://<oidc-issuer-url>",
|
||||
"subject": "system:serviceaccount:certctl-system:certctl-agent",
|
||||
"audiences": ["api://AzureADTokenExchange"]
|
||||
}
|
||||
```
|
||||
|
||||
Set `credential_mode: workload_identity` on the deployment_target
|
||||
config.
|
||||
|
||||
---
|
||||
|
||||
## Operator playbook
|
||||
|
||||
### "Did the cert get imported to ACM / Key Vault?"
|
||||
|
||||
**AWS:**
|
||||
|
||||
```bash
|
||||
aws acm describe-certificate \
|
||||
--certificate-arn arn:aws:acm:us-east-1:...:certificate/<id> \
|
||||
--query 'Certificate.{Status:Status,Serial:Serial,Issued:IssuedAt,NotAfter:NotAfter,Tags:[Tags]}'
|
||||
```
|
||||
|
||||
**Azure:**
|
||||
|
||||
```bash
|
||||
az keyvault certificate show \
|
||||
--vault-name prod-vault \
|
||||
--name api-prod \
|
||||
--query '{Serial:x509ThumbprintHex, Version:id, NotAfter:attributes.expires}'
|
||||
```
|
||||
|
||||
In both cases, the `certctl-managed-by` tag confirms the cert was
|
||||
imported by certctl (and not someone running aws-cli directly).
|
||||
|
||||
### "Why did the rollback fail?"
|
||||
|
||||
The Prometheus counter
|
||||
`certctl_deploy_rollback_total{outcome="also_failed"}` ticks when the
|
||||
rollback's own ImportCertificate / Set call also returns an error. Look
|
||||
at the agent's slog at ERROR level for the per-call diagnostic; the
|
||||
underlying cloud SDK error message tells you whether it was IAM
|
||||
denial, throttling, or a structural input problem.
|
||||
|
||||
Manual recovery:
|
||||
|
||||
**AWS ACM:**
|
||||
|
||||
```bash
|
||||
# Get the snapshot of a known-good cert from S3 / Vault / wherever the
|
||||
# operator stores backup PEMs:
|
||||
aws acm import-certificate \
|
||||
--certificate fileb://known-good.crt \
|
||||
--private-key fileb://known-good.key \
|
||||
--certificate-chain fileb://known-good.chain \
|
||||
--certificate-arn arn:aws:acm:us-east-1:...:certificate/<id> \
|
||||
--tags Key=certctl-managed-by,Value=manual-recovery
|
||||
```
|
||||
|
||||
**Azure Key Vault:**
|
||||
|
||||
```bash
|
||||
# Import a fresh PFX as a new version under the same name:
|
||||
az keyvault certificate import \
|
||||
--vault-name prod-vault \
|
||||
--name api-prod \
|
||||
--file known-good.pfx \
|
||||
--tags certctl-managed-by=manual-recovery
|
||||
```
|
||||
|
||||
After the manual recovery, certctl's next renewal-loop tick re-verifies
|
||||
the live cert via `ValidateDeployment` and resumes normal operation.
|
||||
|
||||
### "How do I know certctl is the only one writing to this ARN / vault cert?"
|
||||
|
||||
**AWS — via CloudTrail:**
|
||||
|
||||
```
|
||||
EventName = "ImportCertificate"
|
||||
Resources.ARN = "arn:aws:acm:us-east-1:...:certificate/<id>"
|
||||
```
|
||||
|
||||
Filter by user identity to see which principal made each call. The
|
||||
certctl agent's IAM role / IRSA-bound role should be the only writer.
|
||||
|
||||
**Azure — via Activity Log:**
|
||||
|
||||
```bash
|
||||
az monitor activity-log list \
|
||||
--resource-id /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.KeyVault/vaults/<vault>/certificates/<name> \
|
||||
--offset 30d \
|
||||
--query "[?operationName.value=='Microsoft.KeyVault/vaults/certificates/import/action'].{caller:caller, time:eventTimestamp}"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cardinality + cost
|
||||
|
||||
- Per-target-type Prometheus counters: 2 new
|
||||
`certctl_deploy_attempts_total` series (AWSACM + AzureKeyVault) ×
|
||||
2 results = 4 series. Comfortable.
|
||||
- AWS ACM costs: ImportCertificate is free; CloudTrail logs at $2 per
|
||||
GB. Renewing 100 certs/month adds ~10 KB to CloudTrail.
|
||||
- Azure Key Vault costs: certificate operations $0.03 per 10K
|
||||
operations (V2 pricing as of 2026-05). 100 certs/month = $0.0009 in
|
||||
cert-op spend. Activity Log retention is configurable (default 90
|
||||
days, free).
|
||||
|
||||
---
|
||||
|
||||
## V3-Pro forward path
|
||||
|
||||
Tracked at `cowork/WORKSPACE-ROADMAP.md` under "Adapter hardening":
|
||||
|
||||
- **AWS CloudFront direct-attach** — UpdateDistribution after an ACM
|
||||
ImportCertificate so the CloudFront edge picks up the new cert
|
||||
without operator intervention. Requires `cloudfront:UpdateDistribution`
|
||||
IAM permission on top of the ACM minimum.
|
||||
- **Azure Front Door direct-attach** — UpdateRoutingConfig equivalent.
|
||||
- **AWS ALB / Azure App Gateway auto-bind** — currently operators
|
||||
attach the ARN / KID URI to the LB out-of-band (Terraform);
|
||||
V3-Pro adds the auto-attach step.
|
||||
- **Soft-delete recovery for Azure Key Vault** — V2 always
|
||||
re-imports as a new version; V3 detects soft-deleted prior
|
||||
versions and offers operator-confirmed recovery.
|
||||
- **GCP Certificate Manager target** — Google Cloud's equivalent to
|
||||
ACM; mirrors the AWS ACM connector shape. Separate cloud,
|
||||
separate connector.
|
||||
@@ -0,0 +1,348 @@
|
||||
# Disaster recovery runbook
|
||||
|
||||
> **Status (this document):** Production hardening II Phase 10
|
||||
> deliverable. Codifies the fail-safe behaviors that already exist in
|
||||
> the codebase and the operator procedures for recovering from
|
||||
> common failure modes. Nothing in this runbook requires new code —
|
||||
> if a procedure here doesn't work as documented, that's a bug in
|
||||
> docs (file an issue).
|
||||
|
||||
This runbook is the SOC 2 / PCI procurement-team deliverable: it tells
|
||||
auditors and on-call operators what to do when a piece of certctl's
|
||||
state corrupts, when a CA key needs rotation, or when Postgres needs
|
||||
a point-in-time restore. Read it once when you set up certctl; print
|
||||
the [DR checklist](#dr-checklist) and pin it near your on-call rotation.
|
||||
|
||||
## Contents
|
||||
|
||||
1. [Overview — what's already automatic](#overview)
|
||||
2. [CRL cache recovery](#crl-cache-recovery)
|
||||
3. [OCSP responder cert recovery](#ocsp-responder-cert-recovery)
|
||||
4. [OCSP response cache recovery](#ocsp-response-cache-recovery)
|
||||
5. [CA private-key rotation](#ca-private-key-rotation)
|
||||
6. [Postgres restore](#postgres-restore)
|
||||
7. [Trust-bundle reload semantics (SCEP / EST / Intune)](#trust-bundle-reload-semantics)
|
||||
8. [DR checklist](#dr-checklist)
|
||||
|
||||
## Overview
|
||||
|
||||
certctl is engineered so most failure modes are auto-recoverable
|
||||
without operator action. The fail-safes in the codebase:
|
||||
|
||||
- **CRL cache corruption** — the scheduler's `crlGenerationLoop`
|
||||
regenerates the CRL for every issuer on its tick (default 1h via
|
||||
`CERTCTL_CRL_GENERATION_INTERVAL`). A corrupt or missing
|
||||
`crl_cache` row causes the next HTTP fetch to fall through to the
|
||||
live-signing path; the scheduler then writes the fresh CRL back to
|
||||
cache.
|
||||
- **OCSP responder cert missing** — `ensureOCSPResponder` lazily
|
||||
bootstraps the responder cert on the first OCSP request after a
|
||||
missing row. The CA-key signing operation is rare (only at
|
||||
bootstrap / 7-day rotation cycle), so this is fast even on a
|
||||
cold cache.
|
||||
- **OCSP response cache corruption** — the read-through facade in
|
||||
`CAOperationsSvc.GetOCSPResponseWithNonce` falls through to live
|
||||
signing on cache miss + writes the fresh response back. Operators
|
||||
can `DELETE FROM ocsp_response_cache;` and the cache rebuilds
|
||||
organically as relying parties query.
|
||||
- **Trust anchor reload after a half-rotation** — `TrustAnchorHolder`
|
||||
(used by SCEP/Intune + EST mTLS) keeps the OLD pool in place when
|
||||
a SIGHUP-triggered reload fails (parse error, expired cert). The
|
||||
GUI reload modal surfaces the typed error so the operator can
|
||||
correct the file and retry without taking the EST/SCEP endpoint
|
||||
down.
|
||||
|
||||
These fail-safes mean most of this runbook is "delete the corrupt
|
||||
row + wait for the next tick" rather than "restore from backup +
|
||||
manually re-issue." The runbook documents the full procedures
|
||||
anyway because compliance auditors need to see them written down.
|
||||
|
||||
## CRL cache recovery
|
||||
|
||||
**Symptom:** `GET /.well-known/pki/crl/{issuer_id}` returns 500, or
|
||||
the CRL it returns has the wrong revocations / wrong signature, or
|
||||
parses as garbage.
|
||||
|
||||
**Diagnosis:**
|
||||
|
||||
```bash
|
||||
# 1. Look at the cached row directly:
|
||||
psql -c "SELECT issuer_id, length(crl_der), this_update, next_update,
|
||||
generated_at, generation_duration_ms, revoked_count
|
||||
FROM crl_cache WHERE issuer_id = 'iss-local';"
|
||||
|
||||
# 2. Look at recent generation events:
|
||||
psql -c "SELECT started_at, succeeded, error, duration_ms
|
||||
FROM crl_generation_events
|
||||
WHERE issuer_id = 'iss-local'
|
||||
ORDER BY started_at DESC LIMIT 10;"
|
||||
```
|
||||
|
||||
**Recovery:**
|
||||
|
||||
```bash
|
||||
# Force regeneration on next request by deleting the cache row.
|
||||
# The next HTTP fetch falls through to the live-signing path AND the
|
||||
# next crlGenerationLoop tick (≤1h by default) writes a fresh row.
|
||||
psql -c "DELETE FROM crl_cache WHERE issuer_id = 'iss-local';"
|
||||
|
||||
# Verify:
|
||||
curl -sS --cacert /path/to/ca.crt \
|
||||
https://certctl.example.com:8443/.well-known/pki/crl/iss-local \
|
||||
| openssl crl -inform DER -noout -text \
|
||||
| head -20
|
||||
```
|
||||
|
||||
**Worst case** — if the underlying revocation data in
|
||||
`certificate_revocations` is also corrupt, restore Postgres
|
||||
(see [Postgres restore](#postgres-restore)) and the CRL regenerates
|
||||
from the restored data on the next tick.
|
||||
|
||||
## OCSP responder cert recovery
|
||||
|
||||
**Symptom:** OCSP requests return 500 with errors like "responder
|
||||
not configured" or "failed to load responder key."
|
||||
|
||||
**Diagnosis:**
|
||||
|
||||
```bash
|
||||
psql -c "SELECT issuer_id, cert_subject, not_before, not_after,
|
||||
created_at, key_path
|
||||
FROM ocsp_responder_certs
|
||||
WHERE issuer_id = 'iss-local';"
|
||||
|
||||
# Check the on-disk responder key file (path from the row above):
|
||||
ls -la /etc/certctl/ocsp-responder-keys/iss-local.key
|
||||
```
|
||||
|
||||
**Recovery:**
|
||||
|
||||
```bash
|
||||
# Delete the responder row. The next OCSP request triggers
|
||||
# ensureOCSPResponder which generates a fresh keypair, signs a new
|
||||
# responder cert with the CA key (rare CA-key use), and persists
|
||||
# the new row + the on-disk key file (mode 0600 enforced).
|
||||
psql -c "DELETE FROM ocsp_responder_certs WHERE issuer_id = 'iss-local';"
|
||||
|
||||
# If the on-disk key file is also corrupt, delete it first:
|
||||
rm -f /etc/certctl/ocsp-responder-keys/iss-local.key
|
||||
|
||||
# Trigger the bootstrap by issuing one OCSP request:
|
||||
curl -sS --cacert /path/to/ca.crt \
|
||||
https://certctl.example.com:8443/.well-known/pki/ocsp/iss-local/00 \
|
||||
> /dev/null
|
||||
|
||||
# Verify the new row + file:
|
||||
psql -c "SELECT * FROM ocsp_responder_certs WHERE issuer_id = 'iss-local';"
|
||||
ls -la /etc/certctl/ocsp-responder-keys/iss-local.key
|
||||
```
|
||||
|
||||
The new responder cert carries the same `id-pkix-ocsp-nocheck`
|
||||
extension as the original (per RFC 6960 §4.2.2.2.1) so relying
|
||||
parties accept it without recursing through OCSP for the responder
|
||||
itself.
|
||||
|
||||
## OCSP response cache recovery
|
||||
|
||||
**Symptom:** an OCSP request returns a stale response (e.g. "good"
|
||||
for a cert you just revoked). This usually means the
|
||||
`InvalidateOnRevoke` wire failed to fire — see the warning logs from
|
||||
`RevocationSvc.RevokeCertificateWithActor`.
|
||||
|
||||
**Recovery:**
|
||||
|
||||
```bash
|
||||
# Delete the stale cache entry. The next OCSP request falls through
|
||||
# to live signing which reads the now-current revocation_status.
|
||||
psql -c "DELETE FROM ocsp_response_cache
|
||||
WHERE issuer_id = 'iss-local' AND serial_hex = 'deadbeef...';"
|
||||
|
||||
# Verify the next fetch returns "revoked":
|
||||
curl -sS --cacert /path/to/ca.crt \
|
||||
https://certctl.example.com:8443/.well-known/pki/ocsp/iss-local/deadbeef... \
|
||||
| openssl ocsp -respin /dev/stdin -resp_text -CAfile /path/to/ca.crt \
|
||||
| grep "Cert Status"
|
||||
```
|
||||
|
||||
For a fleet-wide invalidation (e.g. you rotated the CA key — see
|
||||
next section), nuke the whole cache:
|
||||
|
||||
```bash
|
||||
psql -c "TRUNCATE ocsp_response_cache;"
|
||||
```
|
||||
|
||||
The cache rebuilds organically as relying parties query. There's no
|
||||
service-degradation window because the live-sign fallback is always
|
||||
available; only the per-request CPU cost goes up until the cache
|
||||
warms back up.
|
||||
|
||||
## CA private-key rotation
|
||||
|
||||
**Symptom:** scheduled rotation cycle (annual or longer), or
|
||||
emergency rotation due to suspected compromise.
|
||||
|
||||
This procedure rotates the CA private key for the local issuer.
|
||||
After rotation, every existing cert chains to the OLD CA cert which
|
||||
remains trusted by relying parties until its `notAfter` (typical
|
||||
10y); newly-issued certs chain to the NEW CA cert.
|
||||
|
||||
**Procedure:**
|
||||
|
||||
1. **Backup the current CA cert + key.** The on-disk paths are
|
||||
`CERTCTL_CA_CERT_PATH` / `CERTCTL_CA_KEY_PATH` (typically
|
||||
`/etc/certctl/ca.crt` + `/etc/certctl/ca.key`). Copy both to
|
||||
a secure offline location with at least 2y retention (relying
|
||||
parties may still send OCSP requests against certs the OLD CA
|
||||
issued).
|
||||
2. **Generate a new keypair + cert.** For self-signed mode:
|
||||
```bash
|
||||
openssl ecparam -name prime256v1 -genkey -noout -out new-ca.key
|
||||
openssl req -x509 -key new-ca.key -days 3650 \
|
||||
-subj "/CN=certctl Local CA" -out new-ca.crt
|
||||
```
|
||||
For sub-CA mode, generate a CSR and have your enterprise root
|
||||
sign it instead.
|
||||
3. **Stop certctl.** `kill -TERM <pid>` or `docker stop certctl`.
|
||||
4. **Move the new files into place + back up the old:**
|
||||
```bash
|
||||
mv /etc/certctl/ca.crt /etc/certctl/ca.crt.old-rotated-20XX-XX-XX
|
||||
mv /etc/certctl/ca.key /etc/certctl/ca.key.old-rotated-20XX-XX-XX
|
||||
mv new-ca.crt /etc/certctl/ca.crt
|
||||
mv new-ca.key /etc/certctl/ca.key
|
||||
chmod 0600 /etc/certctl/ca.key
|
||||
```
|
||||
5. **Truncate the OCSP responder cert table** so the responder
|
||||
bootstrap re-fires against the new CA:
|
||||
```bash
|
||||
psql -c "DELETE FROM ocsp_responder_certs;"
|
||||
```
|
||||
6. **Truncate the CRL cache** so the next `crlGenerationLoop` tick
|
||||
regenerates the CRL signed by the new CA:
|
||||
```bash
|
||||
psql -c "TRUNCATE crl_cache;"
|
||||
```
|
||||
7. **Truncate the OCSP response cache** so future OCSP requests
|
||||
live-sign with the new CA's responder cert:
|
||||
```bash
|
||||
psql -c "TRUNCATE ocsp_response_cache;"
|
||||
```
|
||||
8. **Start certctl.** The startup preflight loads the new CA cert +
|
||||
key. The next HTTP request bootstraps a new responder cert.
|
||||
9. **Verify:**
|
||||
```bash
|
||||
# Issue a test cert
|
||||
curl ... new-cert
|
||||
# Confirm chain to the new CA
|
||||
openssl x509 -in new-cert -noout -issuer
|
||||
```
|
||||
|
||||
**Future:** when the HSM/PKCS#11 driver bundle (`cowork/hsm-pkcs11-
|
||||
driver-prompt.md`) ships, this rotation procedure changes
|
||||
substantially — the HSM-backed key never moves, only the cert wrap
|
||||
rotates. The signer interface seam is the load-bearing prerequisite
|
||||
for that.
|
||||
|
||||
## Postgres restore
|
||||
|
||||
certctl's full state lives in Postgres. The on-disk artifacts (CA
|
||||
cert/key, RA cert/key for SCEP, responder keys for OCSP, trust
|
||||
bundles for SCEP/Intune/EST mTLS) are operator-managed; everything
|
||||
else is in DB rows.
|
||||
|
||||
**Restore procedure:**
|
||||
|
||||
1. Stop certctl. `kill -TERM <pid>` or `docker stop certctl`.
|
||||
2. Restore the Postgres database from your point-in-time backup
|
||||
(`pg_restore` or your managed-DB equivalent).
|
||||
3. Run any migrations newer than the backup's snapshot:
|
||||
```bash
|
||||
migrate -path migrations/ -database "$DATABASE_URL" up
|
||||
```
|
||||
4. **Truncate the caches** that may now hold stale data referencing
|
||||
pre-restore rows:
|
||||
```bash
|
||||
psql -c "TRUNCATE crl_cache;"
|
||||
psql -c "TRUNCATE ocsp_response_cache;"
|
||||
```
|
||||
5. Start certctl. The schedulers regenerate caches on their next
|
||||
ticks.
|
||||
|
||||
**Recoverable from DB only:** managed certificates, revocations,
|
||||
audit log, jobs, agents, owners, teams, profiles, issuer/target/
|
||||
notifier configs, scheduled tasks, network scan results.
|
||||
|
||||
**Operator-managed (NOT in DB):**
|
||||
- CA cert + key (`CERTCTL_CA_CERT_PATH` / `CERTCTL_CA_KEY_PATH`)
|
||||
- SCEP RA cert + key per profile
|
||||
- OCSP responder keys per issuer (`CERTCTL_OCSP_RESPONDER_KEY_DIR`)
|
||||
- SCEP/Intune trust anchor PEM bundles
|
||||
- EST mTLS client CA trust bundles
|
||||
- `CERTCTL_API_KEY`, `CERTCTL_AGENT_BOOTSTRAP_TOKEN`,
|
||||
`CERTCTL_CONFIG_ENCRYPTION_KEY`
|
||||
|
||||
Back these up out-of-band on the same cadence as your Postgres
|
||||
backups. Without them, a restored DB is unusable.
|
||||
|
||||
## Trust-bundle reload semantics
|
||||
|
||||
This section codifies the fail-safe behavior that's already in code,
|
||||
for compliance auditors who need to see the procedure documented.
|
||||
|
||||
**Pattern:** every trust-bundle holder (`internal/trustanchor.Holder`,
|
||||
used by SCEP/Intune dispatcher + EST mTLS sibling route) implements
|
||||
the same SIGHUP-equivalent reload semantics:
|
||||
|
||||
- A bad reload (parse error, expired cert, empty bundle) keeps the
|
||||
OLD pool in place. The endpoint stays up; the operator sees the
|
||||
typed error in the GUI Reload modal.
|
||||
- The reload is atomic. There's no window where the holder is
|
||||
empty or pointing at a half-loaded bundle.
|
||||
- In-flight requests use a snapshot taken at request-start. A
|
||||
request that crosses a SIGHUP uses the OLD pool — no mid-request
|
||||
validation drift.
|
||||
|
||||
**Operator workflow:**
|
||||
|
||||
1. Receive the new trust bundle (e.g., rotated Intune Connector
|
||||
signing cert, rotated EST mTLS client CA).
|
||||
2. Overwrite the on-disk PEM file at the configured path.
|
||||
3. Trigger reload via the GUI (`/scep` Profiles tab → Reload trust
|
||||
anchor; `/est` Profiles tab → same) OR send `kill -HUP <certctl-pid>`
|
||||
directly.
|
||||
4. The Reload modal returns success or shows the typed error. On
|
||||
error, fix the file (`openssl x509 -in trust.pem -noout -text`
|
||||
to validate) and retry; the OLD pool stays in place between
|
||||
attempts.
|
||||
|
||||
## DR checklist
|
||||
|
||||
Print this. Pin it near your on-call rotation.
|
||||
|
||||
```
|
||||
☐ Backups: Postgres backup runs nightly + retention ≥ 30 days
|
||||
☐ Backups: CA cert + key offsite + retention ≥ NotAfter + 2y
|
||||
☐ Backups: OCSP responder keys offsite (or accept rotate-from-CA on restore)
|
||||
☐ Backups: Trust anchor PEMs offsite
|
||||
☐ Backups: Operator-managed env vars (API_KEY, BOOTSTRAP_TOKEN,
|
||||
CONFIG_ENCRYPTION_KEY) in a separate secret manager
|
||||
|
||||
☐ Quarterly: dry-run a Postgres restore into a staging environment
|
||||
☐ Quarterly: verify CA cert NotAfter > 1y
|
||||
☐ Quarterly: rotate the OCSP responder cert (auto-handled by
|
||||
ensureOCSPResponder; verify the rotation actually fires by
|
||||
diffing the responder row's serial_number quarter-over-quarter)
|
||||
|
||||
☐ Annually: dry-run a full DR — restore Postgres + CA + responders
|
||||
into a clean environment + issue + revoke a test cert end-to-end
|
||||
☐ Annually: rotate API_KEY, AGENT_BOOTSTRAP_TOKEN
|
||||
☐ Every 5y: rotate the CA private key (see CA rotation section above)
|
||||
```
|
||||
|
||||
## Related docs
|
||||
|
||||
- [`crl-ocsp.md`](crl-ocsp.md) — CRL/OCSP responder operator guide.
|
||||
- [`tls.md`](tls.md) — control-plane TLS bootstrap.
|
||||
- [`security.md`](security.md) — production-grade security posture.
|
||||
- [`scep-intune.md`](scep-intune.md) — SCEP/Intune trust-anchor
|
||||
rotation specifics.
|
||||
- [`est.md`](est.md) — EST mTLS trust-bundle rotation specifics.
|
||||
@@ -0,0 +1,226 @@
|
||||
# Runbook: certificate-expiry alerts (multi-channel)
|
||||
|
||||
This runbook covers the per-policy multi-channel expiry-alert dispatch
|
||||
path that ships in certctl post-2026-05-03 (Rank 4 of the Infisical
|
||||
deep-research deliverable). It complements the operator-facing
|
||||
[Routing expiry alerts across channels](connectors.md#routing-expiry-alerts-across-channels)
|
||||
section in `docs/connectors.md`.
|
||||
|
||||
Audience: a platform sysadmin or on-call engineer who needs to
|
||||
configure, debug, or audit certctl's expiry-alert routing. Not a
|
||||
walkthrough of how to install certctl — that lives in the README.
|
||||
|
||||
---
|
||||
|
||||
## End-to-end flow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
Tick["daily ticker (renewalCheckLoop)"]
|
||||
Check["RenewalService.CheckExpiringCertificates"]
|
||||
|
||||
Tick --> Check --> Loop
|
||||
|
||||
subgraph Loop["for cert in expiring (≤30 days)"]
|
||||
L1["1. Resolve RenewalPolicy"]
|
||||
L2["2. Compute daysUntil"]
|
||||
L3["3. updateCertExpiryStatus"]
|
||||
L4["4. sendThresholdAlerts"]
|
||||
L5["5. Create renewal job<br/>(if issuer registered +<br/>ARI allows)"]
|
||||
L1 --> L2 --> L3 --> L4 --> L5
|
||||
end
|
||||
|
||||
L4 --> Threshold
|
||||
|
||||
subgraph Threshold["per threshold"]
|
||||
T1["a. resolve severity tier<br/>via AlertSeverityMap"]
|
||||
T2["b. resolve channel set<br/>via AlertChannels[tier]"]
|
||||
T1 --> T2 --> Channel
|
||||
end
|
||||
|
||||
subgraph Channel["for each channel (fault-isolating)"]
|
||||
C1["i. dedup via notification_events<br/>(cert, threshold, channel)"]
|
||||
C2["ii. SendThresholdAlertOnChannel<br/>→ notifierRegistry[channel]<br/>→ Send(recipient, subj, body)"]
|
||||
C3["iii. record audit row<br/>event_type=expiration_alert_sent<br/>metadata.channel, metadata.severity_tier"]
|
||||
C4["iv. bump Prometheus counter<br/>certctl_expiry_alerts_total<br/>{channel, threshold, result}"]
|
||||
C1 --> C2 --> C3 --> C4
|
||||
end
|
||||
```
|
||||
|
||||
The dispatch loop's per-channel error handling is
|
||||
**fault-isolating**: PagerDuty's failure does NOT skip Slack/Email
|
||||
at the same threshold. Each channel runs independently, with its
|
||||
own dedup row + audit row + metric increment.
|
||||
|
||||
---
|
||||
|
||||
## Configuring the per-policy channel matrix
|
||||
|
||||
The matrix is a property of `RenewalPolicy`. Two new JSONB columns
|
||||
on the `renewal_policies` table back it (migration 000026):
|
||||
|
||||
- `alert_channels JSONB` — `map[severity_tier][]channel_name`. Default `{}`
|
||||
→ fall through to `DefaultAlertChannels` (Email-only at every tier).
|
||||
- `alert_severity_map JSONB` — `map[threshold_days]severity_tier`. Default
|
||||
`{}` → fall through to `DefaultAlertSeverityMap` (`30→informational,
|
||||
14→warning, 7→warning, 0→critical`).
|
||||
|
||||
### Example: production-grade routing
|
||||
|
||||
```bash
|
||||
curl -X PUT https://certctl.example.com/api/v1/renewal-policies/rp-production \
|
||||
-H 'Authorization: Bearer ${TOKEN}' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"name": "Production CDN renewal policy",
|
||||
"renewal_window_days": 30,
|
||||
"auto_renew": true,
|
||||
"max_retries": 3,
|
||||
"retry_interval_seconds": 300,
|
||||
"alert_thresholds_days": [30, 14, 7, 0],
|
||||
"alert_channels": {
|
||||
"informational": ["Slack"],
|
||||
"warning": ["Slack", "Email"],
|
||||
"critical": ["PagerDuty", "OpsGenie", "Email"]
|
||||
},
|
||||
"alert_severity_map": {
|
||||
"30": "informational",
|
||||
"14": "warning",
|
||||
"7": "warning",
|
||||
"0": "critical"
|
||||
}
|
||||
}'
|
||||
```
|
||||
|
||||
After this PUT, the next renewal-loop tick that finds a cert under
|
||||
this policy will fan out alerts as documented above.
|
||||
|
||||
### Example: opt out of informational alerts
|
||||
|
||||
If your team doesn't want T-30 informational alerts (you'd rather
|
||||
hear about a cert only at warning tier and beyond):
|
||||
|
||||
```json
|
||||
"alert_channels": {
|
||||
"informational": [],
|
||||
"warning": ["Email"],
|
||||
"critical": ["PagerDuty", "Email"]
|
||||
}
|
||||
```
|
||||
|
||||
The empty `informational` list causes the dispatch loop to record
|
||||
an `expiration_alert_skipped_no_channels` audit row at T-30 and
|
||||
skip the dispatch. Other tiers still fire.
|
||||
|
||||
---
|
||||
|
||||
## Operator playbook
|
||||
|
||||
### "Did the on-call team get paged?"
|
||||
|
||||
```sql
|
||||
SELECT created_at,
|
||||
metadata->>'channel' AS channel,
|
||||
metadata->>'threshold_days' AS threshold,
|
||||
metadata->>'severity_tier' AS severity
|
||||
FROM audit_events
|
||||
WHERE event_type = 'expiration_alert_sent'
|
||||
AND resource_id = '<cert-id>'
|
||||
ORDER BY created_at DESC;
|
||||
```
|
||||
|
||||
One row per (channel, threshold) attempt. If you see a row with
|
||||
`channel = 'PagerDuty'` and `severity = 'critical'`, the page went
|
||||
out (or was at least dispatched to the notifier).
|
||||
|
||||
### "Why didn't I get an alert at T-7?"
|
||||
|
||||
Three places to look:
|
||||
|
||||
1. **Audit log** — `SELECT FROM audit_events WHERE event_type IN
|
||||
('expiration_alert_sent','expiration_alert_skipped_no_channels',
|
||||
'expiration_alert_skipped_invalid_channel') AND resource_id =
|
||||
'<cert-id>'`. If `expiration_alert_skipped_no_channels` appears,
|
||||
your policy's tier list is empty for the resolved tier. If
|
||||
`expiration_alert_skipped_invalid_channel` appears, your matrix
|
||||
has a typo (the `metadata->>'invalid_channel'` field tells you
|
||||
which value).
|
||||
|
||||
2. **Notifications table** —
|
||||
`SELECT FROM notification_events WHERE certificate_id = '<cert-id>'
|
||||
AND type = 'ExpirationWarning' ORDER BY created_at DESC`. If
|
||||
rows exist with `channel = 'Slack'` and `status = 'failed'`,
|
||||
the dispatch reached the channel but the channel rejected the
|
||||
send. Look at the `error` column for the upstream message.
|
||||
|
||||
3. **Prometheus counters** —
|
||||
`curl /api/v1/metrics/prometheus | grep certctl_expiry_alerts_total`.
|
||||
Sustained `{result="failure"}` counts indicate a notifier
|
||||
connector misconfiguration (bad webhook URL, expired API key,
|
||||
etc.).
|
||||
|
||||
### "How do I test the matrix without waiting for a real expiry?"
|
||||
|
||||
certctl ships an admin endpoint for this:
|
||||
|
||||
```bash
|
||||
curl -X POST https://certctl.example.com/api/v1/admin/notifications/test \
|
||||
-H 'Authorization: Bearer ${TOKEN}' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"certificate_id": "mc-test-cert",
|
||||
"threshold_days": 0,
|
||||
"channel": "PagerDuty"
|
||||
}'
|
||||
```
|
||||
|
||||
This calls `NotificationService.SendThresholdAlertOnChannel`
|
||||
directly and bypasses the renewal loop's threshold check. Useful
|
||||
for "did I configure PagerDuty correctly?" without having to set
|
||||
up a deliberately-expiring cert. The admin endpoint requires
|
||||
`role=admin` (V3-Pro RBAC); V2 deploys gate it on the bearer
|
||||
token only.
|
||||
|
||||
### "How do I rotate a notifier credential without downtime?"
|
||||
|
||||
1. Update the `CERTCTL_PAGERDUTY_ROUTING_KEY` (or equivalent) env
|
||||
var in your deployment.
|
||||
2. Restart `certctl-server`. The notifier registry rebuilds
|
||||
with the new credential.
|
||||
3. Confirm with the admin-test endpoint above against the cert
|
||||
you most care about.
|
||||
|
||||
The renewal loop is idempotent — a missed tick during the restart
|
||||
window does NOT cause double-dispatch on the next tick (per-channel
|
||||
dedup on the `notification_events` table guards against that).
|
||||
|
||||
---
|
||||
|
||||
## Cardinality + cost
|
||||
|
||||
- Default 6 channels × 4 thresholds × 3 results = **72 Prometheus series**.
|
||||
- Custom-thresholds policies (e.g. `[60, 45, 30, 14, 7, 3, 1, 0]`)
|
||||
expand the threshold dimension proportionally — 6 × 8 × 3 = 144 series.
|
||||
- Closed-enum discipline at the dispatch site means typos in
|
||||
`alert_channels` do NOT grow this count.
|
||||
- A daily renewal-loop tick over 10K certs each policy-bound to the
|
||||
matrix above produces O(channels × thresholds × certs) audit rows
|
||||
+ notification rows in the worst case (every cert has crossed
|
||||
every threshold and no dedup applies). Operators sizing
|
||||
Postgres should plan for an `audit_events` row count on the
|
||||
order of `unique_certs × channels_per_critical_tier` per fan-out
|
||||
batch — which is ~3-5× the pre-Rank-4 row count.
|
||||
|
||||
---
|
||||
|
||||
## V3-Pro forward path
|
||||
|
||||
Tracked at `cowork/WORKSPACE-ROADMAP.md` under "Adapter hardening":
|
||||
|
||||
- Per-owner / per-team / per-tenant channel routing (the matrix is
|
||||
per-policy today, not per-owner).
|
||||
- Calendar-aware suppression (no T-30 alerts on weekends for non-
|
||||
on-call teams).
|
||||
- Escalation chains (T-1 unanswered for 30m → escalate to
|
||||
manager's PagerDuty).
|
||||
- Per-channel rate limiting (downstream of I-005's retry+DLQ).
|
||||
@@ -0,0 +1,169 @@
|
||||
# certctl Security Posture & Operator Guidance
|
||||
|
||||
This document collects the operator-facing security guidance that the source
|
||||
code's per-finding comment blocks reference. Each section names the audit
|
||||
finding it closes, the threat model, and the operator action required (if
|
||||
any).
|
||||
|
||||
## OCSP responder availability
|
||||
|
||||
**Audit reference:** Bundle C / M-020. CWE-770 (uncontrolled resource
|
||||
consumption); RFC 6960 (OCSP); RFC 7633 (Must-Staple).
|
||||
|
||||
certctl ships an OCSP responder at `/.well-known/pki/ocsp/{issuer_id}/{serial}`
|
||||
that signs a fresh response per request. Pre-Bundle-C the unauth handler
|
||||
chain had no rate limit, so an attacker could DoS the responder and force
|
||||
fail-open relying parties to accept revoked certificates as valid. Bundle C
|
||||
adds the same per-key rate limiter to the unauth chain that the authenticated
|
||||
chain has used since Bundle B. Per-IP keying applies because OCSP traffic is
|
||||
unauthenticated.
|
||||
|
||||
The rate limiter alone does not solve the underlying revocation-bypass risk.
|
||||
**The architectural fix is for issued certificates to carry the OCSP
|
||||
Must-Staple TLS Feature extension** (RFC 7633, OID 1.3.6.1.5.5.7.1.24). When
|
||||
present, conforming TLS clients refuse to negotiate a session unless the
|
||||
server staples a fresh signed OCSP response in the TLS handshake. This shifts
|
||||
revocation enforcement from the client's discretion (which most fail-open by
|
||||
default) to a hard requirement that the connection cannot complete without
|
||||
proof of non-revocation.
|
||||
|
||||
### Operator action
|
||||
|
||||
For certificates issued to systems where revocation correctness matters:
|
||||
|
||||
1. **Configure the issuer profile to set `must-staple: true`.** Out-of-the-box
|
||||
profiles in `migrations/seed.sql` do not set this; operators add it at
|
||||
profile-creation time via the API or by editing seed data.
|
||||
2. **Confirm the relying party honors the extension.** OpenSSL ≥ 1.1.0,
|
||||
Firefox, and Chrome 84+ all enforce Must-Staple. Older clients silently
|
||||
ignore it.
|
||||
3. **Confirm the deployment target is configured for OCSP stapling** so the
|
||||
server can actually deliver the stapled response in the handshake.
|
||||
- **nginx:** `ssl_stapling on; ssl_stapling_verify on;`
|
||||
- **Apache:** `SSLUseStapling on`
|
||||
- **HAProxy:** `set ssl ocsp-response /path/to/response.der`
|
||||
- **Envoy:** `ocsp_staple_policy: must_staple`
|
||||
|
||||
### What this does NOT cover
|
||||
|
||||
- **CRL fallback.** Must-Staple does not affect CRL behavior. Operators with
|
||||
CRL-based relying parties should use the rate-limit + caching defense
|
||||
alone; there is no client-side equivalent to Must-Staple for CRLs.
|
||||
- **Self-issued certs in air-gapped networks.** When the relying party
|
||||
cannot reach the OCSP responder at all (the threat model the audit
|
||||
cited), Must-Staple is the only mechanism that closes the bypass. CRL
|
||||
distribution similarly requires the relying party to fetch the CRL,
|
||||
which is also subject to the same network-availability concern.
|
||||
|
||||
## Postgres transport encryption
|
||||
|
||||
See [docs/database-tls.md](database-tls.md). Bundle B / M-018.
|
||||
|
||||
## Encryption at rest
|
||||
|
||||
Bundle B / M-001. PBKDF2-SHA256 at 600,000 rounds (OWASP 2024 Password
|
||||
Storage Cheat Sheet floor) for the operator-supplied passphrase that
|
||||
derives the AES-256-GCM key for sensitive config columns. v3 blob format
|
||||
with a per-ciphertext random salt; v1/v2 read fallback for legacy rows.
|
||||
See [internal/crypto/encryption.go](../internal/crypto/encryption.go) and
|
||||
the accompanying tests for the format spec.
|
||||
|
||||
## Authentication surface
|
||||
|
||||
Bundle B / M-002. Two layers decide auth-exempt status:
|
||||
|
||||
1. **Router layer:** `internal/api/router/router.go::AuthExemptRouterRoutes`
|
||||
— the 4 endpoints registered via direct `r.mux.Handle` without going
|
||||
through the middleware chain (`/health`, `/ready`, `/api/v1/auth/info`,
|
||||
`/api/v1/version`).
|
||||
2. **Dispatch layer:** `internal/api/router/router.go::AuthExemptDispatchPrefixes`
|
||||
— URL-prefix routing in `cmd/server/main.go::buildFinalHandler` for
|
||||
`/.well-known/pki/*`, `/.well-known/est/*`, and `/scep[/...]*`.
|
||||
|
||||
Both lists have AST-walking regression tests (`auth_exempt_test.go`) that
|
||||
fail CI if a new bypass lands without an updating the documented constant.
|
||||
|
||||
## Per-user rate limiting
|
||||
|
||||
Bundle B / M-025. Authenticated callers are bucketed by API-key name;
|
||||
unauthenticated callers (probes, OCSP relying parties, EST/SCEP enrollees)
|
||||
are bucketed by source IP. `RPS` and `BurstSize` are per-key budgets.
|
||||
`PerUserRPS` / `PerUserBurstSize` give authenticated clients a separate
|
||||
budget when set non-zero.
|
||||
|
||||
## API key rotation
|
||||
|
||||
**Audit reference:** L-004. CWE-924 (improper enforcement of message integrity during transmission in a communication channel) — operator UX variant.
|
||||
|
||||
certctl's API keys are configured via the `CERTCTL_API_KEYS_NAMED` env var
|
||||
(format `name1:key1,name2:key2:admin`) and parsed at startup into an
|
||||
in-memory list. There is no DB-resident key store, no GUI, no `/api/v1/keys`
|
||||
endpoint — the env var IS the key inventory.
|
||||
|
||||
Pre-Bundle-G the env var rejected duplicate names, so rotating a key
|
||||
required: stop accepting OLDKEY → restart → roll NEWKEY out. Any client
|
||||
polling against OLDKEY during the restart window hit a 401.
|
||||
|
||||
Bundle G adds a **double-key rotation window**: two entries can share a
|
||||
name during the rollover, and both keys validate. Operators run the
|
||||
rotation as:
|
||||
|
||||
1. **Generate the new key.** `openssl rand -hex 32` produces a 256-bit
|
||||
value with sufficient entropy.
|
||||
|
||||
2. **Append the new entry to `CERTCTL_API_KEYS_NAMED`** alongside the
|
||||
existing one:
|
||||
```
|
||||
CERTCTL_API_KEYS_NAMED="alice:OLDKEY:admin,alice:NEWKEY:admin"
|
||||
```
|
||||
Both entries MUST carry the same admin flag — startup fails loud if
|
||||
they don't (a non-admin shouldn't share an identity with an admin).
|
||||
|
||||
3. **Restart certctl.** A startup INFO log confirms the rotation window
|
||||
is active:
|
||||
```
|
||||
INFO api-key rotation window active name=alice entries=2 see=docs/security.md::api-key-rotation
|
||||
```
|
||||
|
||||
4. **Roll the new key out to all clients.** Both keys validate during
|
||||
this phase. Audit-trail actor + per-user rate-limit bucket stay
|
||||
consistent across the rollover (both entries produce the same
|
||||
`UserKey` context value, the shared name).
|
||||
|
||||
5. **Remove the old entry** from `CERTCTL_API_KEYS_NAMED`:
|
||||
```
|
||||
CERTCTL_API_KEYS_NAMED="alice:NEWKEY:admin"
|
||||
```
|
||||
|
||||
6. **Restart certctl.** OLDKEY now fails with 401. Rotation complete.
|
||||
|
||||
The rotation window has no operator-set timeout — it lasts for as long
|
||||
as both entries are in the env var. Best practice is a 24-72h window
|
||||
covering a full deploy cadence; if a client hasn't rolled to NEWKEY by
|
||||
the end of step 4, extend the window before step 5.
|
||||
|
||||
### What the contract guarantees
|
||||
|
||||
- Two entries with the same `name`: **allowed** if both have the same
|
||||
`admin` flag.
|
||||
- Two entries with the same `name` but mismatched admin: **rejected at
|
||||
startup** (privilege escalation guard).
|
||||
- Two entries with the same `(name, key)` pair: **rejected at startup**
|
||||
(typo guard — rotation requires DIFFERENT keys under the same name).
|
||||
- Single-entry steady state: unchanged from pre-Bundle-G behavior.
|
||||
|
||||
### What the contract does NOT do
|
||||
|
||||
- **No automatic expiration of OLDKEY.** The operator removes the entry
|
||||
in step 5; certctl doesn't track timestamps. A future enhancement
|
||||
could add a `rotated_at` annotation if operators ask for it.
|
||||
- **No GUI / API for key management.** Keys are env-var only by design;
|
||||
building a key-management surface is a separate feature project.
|
||||
- **No revocation list.** If a key leaks, the only path is to remove it
|
||||
from the env var and restart. That's appropriate for a small env-var
|
||||
inventory; it would not scale to a per-user-key-issued model.
|
||||
|
||||
## Reporting a vulnerability
|
||||
|
||||
Email `certctl@proton.me`. Coordinated disclosure preferred; we will
|
||||
acknowledge within 72h.
|
||||
@@ -0,0 +1,215 @@
|
||||
# TLS on the Control Plane
|
||||
|
||||
certctl's control plane is HTTPS-only as of v2.2. There is no plaintext `http://` listener, no `auto` mode, no dual-listener bridge, no TLS 1.2 escape hatch. The server refuses to start without a cert+key pair, the agent/CLI/MCP clients reject `http://` URLs at startup, and the Helm chart refuses to render without either an operator-supplied Secret or a cert-manager Certificate CR.
|
||||
|
||||
This doc covers four cert provisioning patterns, SIGHUP-based cert rotation, and the client-side CA-trust configuration agents and the CLI need to talk to the server. If you are upgrading from a pre-HTTPS release and want the step-by-step cutover procedure, read [`upgrade-to-tls.md`](upgrade-to-tls.md) first and come back here for reference.
|
||||
|
||||
## What you get
|
||||
|
||||
The server binds TLS 1.3 only with an explicit curve preference of `[X25519, P-256]`. TLS 1.3 cipher suites are non-negotiable (all three mandatory suites — AES-128-GCM-SHA256, AES-256-GCM-SHA384, CHACHA20-POLY1305-SHA256 — are always offered), so there is no `CipherSuites` knob to misconfigure. No TLS 1.2 fallback is available.
|
||||
|
||||
Two env vars are required on the server:
|
||||
|
||||
- `CERTCTL_SERVER_TLS_CERT_PATH` — filesystem path to the PEM-encoded server certificate
|
||||
- `CERTCTL_SERVER_TLS_KEY_PATH` — filesystem path to the PEM-encoded private key that signs the cert
|
||||
|
||||
Both paths are read during a fail-loud preflight in `cmd/server/main.go` (see `preflightServerTLS` in `cmd/server/tls.go`). If either is unset, unreadable, or the cert+key pair does not round-trip through `tls.LoadX509KeyPair`, the process refuses to start and emits a diagnostic pointing back at this doc. The rationale lives in §3 of the HTTPS-Everywhere milestone: a cert-lifecycle product should not silently bind plaintext.
|
||||
|
||||
## Pattern 1 — Self-signed bootstrap for docker-compose demos
|
||||
|
||||
This is the default for the `deploy/docker-compose.yml` stack. It exists so `docker compose up -d --build` just works on a laptop without the operator standing up a CA first. It is not appropriate for any non-demo environment.
|
||||
|
||||
An init container named `certctl-tls-init` runs once before the server starts. It uses the `alpine/openssl` image and generates an ECDSA-P256 self-signed cert (SHA-256 signature):
|
||||
|
||||
```
|
||||
openssl req -x509 -newkey ec \
|
||||
-pkeyopt ec_paramgen_curve:P-256 \
|
||||
-nodes \
|
||||
-keyout /etc/certctl/tls/server.key \
|
||||
-out /etc/certctl/tls/server.crt \
|
||||
-days 3650 \
|
||||
-subj "/CN=certctl-server" \
|
||||
-addext "subjectAltName=DNS:certctl-server,DNS:localhost,IP:127.0.0.1,IP:::1"
|
||||
```
|
||||
|
||||
**Why ECDSA-P256 and not ed25519.** The pre-v2.0.48 demo bootstrap used ed25519 (small keys, fast signatures). Apple's TLS stack — Safari Network Framework and the macOS-bundled LibreSSL 3.3.6 `/usr/bin/curl` — does not advertise ed25519 in the ClientHello `signature_algorithms` extension for server certs, so an ed25519 server cert was rejected at handshake with `tls: peer doesn't support any of the certificate's signature algorithms` on the server side (and the generic TLS handshake error on the client side). Homebrew OpenSSL 3.x, Chrome, Firefox, and Linux curl all accepted ed25519 — Apple was the outlier. ECDSA-P256 with SHA-256 is universally supported, so the demo bootstrap uses it by default. To pick up the new algorithm on an existing demo install, tear the volume down and rebuild: `docker compose -f deploy/docker-compose.yml down -v && docker compose -f deploy/docker-compose.yml up -d --build`. **Helm and operator-supplied-Secret users (Patterns 2 and 3) are unaffected** — they bring their own cert, and `cmd/server/tls.go` is algorithm-agnostic (TLS 1.3 with curve preference `[X25519, P-256]` for key exchange — no constraint on the server cert's signature algorithm).
|
||||
|
||||
The cert, its matching key, and a copy of the cert published as `ca.crt` land in a named volume (`certs`) mounted at `/etc/certctl/tls/` in the server container (read-only) and the agent container (read-only). The bootstrap is idempotent — if `server.crt`, `server.key`, and `ca.crt` are already present on the volume, the init container logs `TLS cert already present at …` and exits cleanly.
|
||||
|
||||
Single-cert design. CN is `certctl-server` to match the Docker-network hostname. The SAN list is `[certctl-server, localhost, 127.0.0.1, ::1]`, which covers both container-internal agent→server traffic and operator browser/curl access to `https://localhost:8443`. There is no separate intermediate/root chain — the server cert and the CA bundle are the same PEM. This is the whole point of a demo bootstrap.
|
||||
|
||||
To force regeneration (rotate the demo cert), tear the volume down: `docker compose down -v`. The next `up` re-runs the init container.
|
||||
|
||||
The server's Docker healthcheck and the agent both verify against `/etc/certctl/tls/ca.crt`; no `-k` / `InsecureSkipVerify` anywhere in the default stack.
|
||||
|
||||
## Pattern 2 — Operator-supplied `kubernetes.io/tls` Secret (Helm)
|
||||
|
||||
This is the default path for Helm installs. The operator provisions a Secret of type `kubernetes.io/tls` holding `tls.crt` + `tls.key` (and optionally `ca.crt` for mounting a CA bundle to clients in the same cluster) from whatever source they already trust — their internal CA, a manually-issued cert, step-ca, AWS ACM PCA exported to PEM, or the output of the self-signed bootstrap pattern above copied into a cluster Secret.
|
||||
|
||||
```
|
||||
kubectl create secret tls certctl-server-tls \
|
||||
--cert=server.crt \
|
||||
--key=server.key \
|
||||
--namespace certctl
|
||||
```
|
||||
|
||||
Then:
|
||||
|
||||
```
|
||||
helm install certctl deploy/helm/certctl \
|
||||
--namespace certctl \
|
||||
--set server.tls.existingSecret=certctl-server-tls
|
||||
```
|
||||
|
||||
The Secret is mounted read-only at `/etc/certctl/tls/` in the server pod. The `CERTCTL_SERVER_TLS_CERT_PATH` and `CERTCTL_SERVER_TLS_KEY_PATH` env vars are wired to `tls.crt` and `tls.key` keys inside that mount. If `ca.crt` is absent from the Secret, clients that need a CA bundle should use `tls.crt` as the bundle (self-signed case) or mount a separate ConfigMap with the root chain (operator-CA case).
|
||||
|
||||
If the operator sets neither `server.tls.existingSecret` nor `server.tls.certManager.enabled=true`, `helm template` / `helm install` fails at render-time with a diagnostic pointing at this doc. The guard is implemented in `deploy/helm/certctl/templates/_helpers.tpl` under the `certctl.tls.required` helper. This is deliberate: the HTTPS-only server would crash-loop on an empty path, so we fail earlier at Helm-render time.
|
||||
|
||||
## Pattern 3 — cert-manager `Certificate` CR (Helm, opt-in)
|
||||
|
||||
For clusters that already run cert-manager, the chart can provision a `Certificate` CR that writes into the Secret the server pod reads from. This is opt-in — the default is `server.tls.certManager.enabled: false` — because not every cluster has cert-manager installed, and we refuse to ship a chart that silently depends on an external controller.
|
||||
|
||||
```
|
||||
helm install certctl deploy/helm/certctl \
|
||||
--namespace certctl \
|
||||
--set server.tls.certManager.enabled=true \
|
||||
--set server.tls.certManager.issuerRef.name=my-cluster-issuer \
|
||||
--set server.tls.certManager.issuerRef.kind=ClusterIssuer
|
||||
```
|
||||
|
||||
The rendered `Certificate` (see `deploy/helm/certctl/templates/server-certificate.yaml`) writes `tls.crt` + `tls.key` + `ca.crt` into the Secret named by `server.tls.certManager.secretName` (defaults to `<fullname>-tls`). The server pod reads from that same Secret; the agent DaemonSet mounts the same Secret as its CA bundle source.
|
||||
|
||||
cert-manager handles rotation. certctl-server handles in-place reload — see the SIGHUP section below.
|
||||
|
||||
The chart enforces that if `server.tls.certManager.enabled=true`, `server.tls.certManager.issuerRef.name` must also be set. An empty `issuerRef.name` makes `helm template` fail with a diagnostic naming the missing flag.
|
||||
|
||||
## Pattern 4 — Manually-issued from an internal CA
|
||||
|
||||
For operators running neither Helm nor docker-compose (bare-metal / custom orchestration), the server just needs two files on disk pointed at by `CERTCTL_SERVER_TLS_CERT_PATH` and `CERTCTL_SERVER_TLS_KEY_PATH`. Issue the cert from your internal CA with:
|
||||
|
||||
- CN matching the hostname your agents and operators use to dial the server (e.g., `certctl.prod.example.com`)
|
||||
- SAN list covering every hostname and IP that appears in `CERTCTL_SERVER_URL` values across your agent fleet
|
||||
- Key usage: digital signature + key encipherment
|
||||
- Extended key usage: server auth
|
||||
|
||||
Store the key with mode `0600` and owner matching the UID the server runs as (`1000` in our shipped Dockerfile). The server process reads both files during `preflightServerTLS` at startup and again on every SIGHUP.
|
||||
|
||||
The full CA chain that signed the server cert should be distributed to agents, CLI operators, and MCP clients as their `CERTCTL_SERVER_CA_BUNDLE_PATH` — see the client section below.
|
||||
|
||||
## SIGHUP cert rotation
|
||||
|
||||
The server wraps its cert+key pair in a `*certHolder` (see `cmd/server/tls.go`) that guards the loaded `*tls.Certificate` under a `sync.Mutex`. The `*tls.Config` wires `GetCertificate` to the holder, so every new inbound TLS handshake reads whatever cert the holder currently has.
|
||||
|
||||
Send `SIGHUP` to the server PID and the holder re-reads both files from disk. On success, the next new connection uses the new cert; in-flight requests finish on the previous cert. A log line goes out:
|
||||
|
||||
```
|
||||
TLS cert reloaded via SIGHUP cert_path=/etc/certctl/tls/server.crt key_path=/etc/certctl/tls/server.key
|
||||
```
|
||||
|
||||
On failure (missing file, malformed PEM, key does not sign cert), the old cert is retained and an error logs:
|
||||
|
||||
```
|
||||
TLS cert reload failed; continuing with previous cert cert_path=… key_path=… error=…
|
||||
```
|
||||
|
||||
This is deliberately fail-safe on reload (as opposed to fail-loud on startup). A cert-manager renewal race, a partially-copied file, a typo in a rotation script — none of those should crash a running server and drop every agent connection. The operator sees the error in logs, fixes the underlying issue, and sends another `SIGHUP`.
|
||||
|
||||
Pair with cert-manager, certbot `--post-hook`, or any rotation tool that can fire a signal. For docker-compose, `docker compose kill -s HUP certctl-server` works. For Kubernetes, reload is typically handled by cert-manager updating the Secret and the mounted file changing on the next kubelet sync — no explicit SIGHUP needed if the volume mount is `subPath`-free.
|
||||
|
||||
Startup is a different story. If the cert is missing or malformed at process start, the server exits non-zero rather than binding plaintext or attempting a retry loop. That's the HTTPS-only contract.
|
||||
|
||||
## Client-side TLS: agents, CLI, MCP
|
||||
|
||||
Everything that talks to the server enforces HTTPS on the URL.
|
||||
|
||||
### Agent
|
||||
|
||||
`CERTCTL_SERVER_URL` must be `https://…`. `http://`, bare hostnames, `ftp://`, `ws://`, and empty strings are rejected at startup by `validateHTTPSScheme` in `cmd/agent/main.go` with a diagnostic pointing at `upgrade-to-tls.md`. There is no warning-and-proceed path.
|
||||
|
||||
Two additional env vars control how the agent verifies the server cert:
|
||||
|
||||
- `CERTCTL_SERVER_CA_BUNDLE_PATH` — filesystem path to a PEM-encoded CA bundle that signed the server cert. Loaded into `*tls.Config.RootCAs` on the agent's HTTP client. If unset, the agent falls back to the OS system trust store.
|
||||
- `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY` — defaults to `false`. Setting it to `true` skips verification entirely. **Dev-only escape hatch.** The agent logs a prominent warning at startup (`TLS certificate verification is disabled … never enable this in production`). Use this only when dialing a demo server whose cert you haven't bothered to mount into the agent container.
|
||||
|
||||
Equivalent CLI flags: `--ca-bundle <path>` and `--insecure-skip-verify`.
|
||||
|
||||
If both the CA bundle and `InsecureSkipVerify=true` are set, `InsecureSkipVerify` wins — it's the whole point of the flag. Don't do this in production.
|
||||
|
||||
### CLI (`certctl-cli`)
|
||||
|
||||
Same contract as the agent:
|
||||
|
||||
- `CERTCTL_SERVER_URL` defaults to `https://` scheme; `http://` rejected at startup
|
||||
- `--ca-bundle <path>` flag or `CERTCTL_SERVER_CA_BUNDLE_PATH` env var — CA bundle for server cert verification
|
||||
- `--insecure` flag or `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true` — skip verification (dev only)
|
||||
- Error diagnostic on empty URL explicitly mentions both `--server` and `CERTCTL_SERVER_URL` so operators see the right knob to turn
|
||||
|
||||
The CLI shares the URL-scheme validation with the agent; the test pins in `cmd/cli/main_test.go:TestValidateHTTPSScheme` cover the full rejection matrix.
|
||||
|
||||
### MCP server (`certctl-mcp-server`)
|
||||
|
||||
Same three controls as CLI, env-var-driven only (no flags — MCP runs as a stdio subprocess and inherits env from the launching LLM client):
|
||||
|
||||
- `CERTCTL_SERVER_URL` must start with `https://`
|
||||
- `CERTCTL_SERVER_CA_BUNDLE_PATH` optional CA bundle
|
||||
- `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY` optional skip
|
||||
|
||||
Claude Desktop / other MCP client configs should set all three in the tool's env block.
|
||||
|
||||
## Troubleshooting: fail-loud preflight errors
|
||||
|
||||
Every preflight failure message ends with `(see docs/tls.md)` so this doc is the first hit when an operator searches. Common failures:
|
||||
|
||||
**`CERTCTL_SERVER_TLS_CERT_PATH is empty: HTTPS-only control plane refuses to start`**
|
||||
Set the env var. For docker-compose this is already set to `/etc/certctl/tls/server.crt` in the shipped compose file — if you're seeing this, check the `certctl-tls-init` service logs to see why the init container didn't populate the volume. For Helm, check that `server.tls.existingSecret` or `server.tls.certManager.enabled=true` is set.
|
||||
|
||||
**`TLS cert file "…" unreadable: …`**
|
||||
The cert path is set but `os.Stat` failed. Check filesystem permissions — the server runs as UID 1000 in our shipped Dockerfile; the cert needs to be readable by that UID. Typos in the path also land here.
|
||||
|
||||
**`TLS cert/key pair invalid (cert="…" key="…"): …`**
|
||||
Both files exist but `tls.LoadX509KeyPair` refused them. Typical causes: the private key does not sign the certificate, the key is encrypted with a passphrase (not supported — remove the passphrase with `openssl pkey` before mounting), or one of the two is DER-encoded instead of PEM. Re-issue the pair from the same CA call and re-mount.
|
||||
|
||||
**Client side: `tls: failed to verify certificate: x509: certificate signed by unknown authority`**
|
||||
The client did not trust the CA that signed the server cert. Either mount the CA bundle via `CERTCTL_SERVER_CA_BUNDLE_PATH`, add the CA to the system trust store on the client host, or (dev only) set `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true`.
|
||||
|
||||
**Client side: `tls: first record does not look like a TLS handshake`**
|
||||
The client is speaking plaintext HTTP to an HTTPS server (or vice-versa). Check that `CERTCTL_SERVER_URL` starts with `https://`. If you are upgrading from a pre-v2.2 release and your agents are old, they will surface this error until you roll the DaemonSet — see [`upgrade-to-tls.md`](upgrade-to-tls.md).
|
||||
|
||||
## InsecureSkipVerify justifications (Audit L-001)
|
||||
|
||||
`crypto/tls.Config.InsecureSkipVerify` short-circuits standard certificate
|
||||
chain validation. Each production use site below has a justification —
|
||||
the shape is "this code path is fundamentally pre-trust or
|
||||
trust-from-context, and chain validation in the stdlib path is not the
|
||||
right tool". Test-only sites are not enumerated here.
|
||||
|
||||
The CI grep guard `Forbidden bare InsecureSkipVerify regression guard
|
||||
(L-001)` in `.github/workflows/ci.yml` fails the build if any new
|
||||
`InsecureSkipVerify: true` lands in a non-test file without a
|
||||
`//nolint:gosec` comment carrying a justification — adding a new entry
|
||||
to this table is the right way to extend the surface.
|
||||
|
||||
| Site (file:line) | Trigger | Justification |
|
||||
|---|---|---|
|
||||
| `cmd/agent/main.go:59,125,136,1259,1262` | `--insecure-skip-verify` CLI flag | Dev escape hatch; docs/tls.md and the agent install script direct operators to use a real CA bundle in production. The server emits a startup WARN when set. |
|
||||
| `cmd/agent/verify.go:70,78` | TLS deployment verification probe | The agent is verifying that its own freshly-deployed cert is being served. The chain may be self-signed or signed by an upstream the agent host doesn't trust; what matters is the leaf-cert match against what the agent just deployed. The verifier compares the served leaf bytes to the expected leaf, not the chain. |
|
||||
| `internal/tlsprobe/probe.go:33,47,54` | Network scanner / discovery probe | Discovery's job is to find every cert on the network, including expired, self-signed, and not-yet-deployed certs. Validating the chain would silently skip the broken-cert results that are precisely what operators want to know about. |
|
||||
| `internal/mcp/client.go:35` | MCP CLI `--insecure` flag | Dev escape hatch for local-only MCP testing against a self-signed control plane. |
|
||||
| `internal/cli/client.go:39` | `certctl --insecure` flag | Same shape as the agent flag — local dev only. |
|
||||
| `internal/connector/target/f5/f5.go:128` | F5 BIG-IP iControl REST | F5 default install ships with a self-signed cert; operators who haven't replaced it use `config.Insecure`. The connector logs this on every dial and the operator-facing config docs this. |
|
||||
| `internal/connector/issuer/acme/acme.go:146` | Pebble (ACME test server) | Hard-coded for tests that drive against Pebble locally. Pebble issues self-signed; verifying the chain would defeat the purpose. |
|
||||
| `internal/service/network_scan.go:460` | Network scanner probe | Same rationale as `tlsprobe/probe.go` above — discovery surfaces broken certs by design. |
|
||||
| `internal/api/acme/validators.go` (TLS-ALPN-01 validator) | RFC 8737 §3 TLS-ALPN-01 challenge validation | RFC 8737 mandates this: the responding TLS server presents a self-signed cert with the proof embedded in the `id-pe-acmeIdentifier` extension (OID 1.3.6.1.5.5.7.1.31). The chain is intentionally NOT validated — the proof is in the extension's SHA-256 of the key authorization, not the cert chain. Validating the chain would defeat the purpose: clients running TLS-ALPN-01 self-sign their challenge cert specifically because they don't have a trusted cert yet (that's what they're trying to obtain via ACME). The validator additionally checks that ALPN negotiated `acme-tls/1` and that the cert's `id-pe-acmeIdentifier` extension value is exactly SHA-256 of the expected key authorization. SSRF posture: the validator runs `validation.IsReservedIPForDial` against the resolved IP before the dial, refusing any private-IP target — same posture as the HTTP-01 validator. |
|
||||
|
||||
**What is NOT covered by this list:** `*_test.go` files use
|
||||
`InsecureSkipVerify` freely against `httptest.Server` instances; that's a
|
||||
test-fixture pattern, not a production trust decision. The grep guard
|
||||
ignores `_test.go`.
|
||||
|
||||
## Related docs
|
||||
|
||||
- [`upgrade-to-tls.md`](upgrade-to-tls.md) — one-step cutover from pre-HTTPS releases
|
||||
- [`quickstart.md`](quickstart.md) — docker-compose walkthrough with HTTPS examples
|
||||
- [`test-env.md`](test-env.md) — integration test environment (also HTTPS-only)
|
||||
- [`security.md`](security.md) — overall security posture, OCSP Must-Staple guidance, encryption-at-rest spec
|
||||
- Milestone spec: `prompts/https-everywhere-milestone.md` (authoritative source for locked decisions)
|
||||
Reference in New Issue
Block a user