mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-08 16:18:51 +00:00
Compare commits
4 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 52248be717 | |||
| 04c7eca615 | |||
| 6e646e0fe8 | |||
| 675b87ba63 |
@@ -148,8 +148,34 @@ jobs:
|
|||||||
with:
|
with:
|
||||||
version: '3.13.0'
|
version: '3.13.0'
|
||||||
|
|
||||||
|
# HTTPS-Everywhere (v2.0.47): the chart fails render when no TLS source is
|
||||||
|
# configured. Every lint/template invocation below must pick exactly one
|
||||||
|
# provisioning mode — see deploy/helm/certctl/templates/_helpers.tpl
|
||||||
|
# (certctl.tls.required) and docs/tls.md.
|
||||||
- name: Lint Helm Chart
|
- name: Lint Helm Chart
|
||||||
run: helm lint deploy/helm/certctl/
|
run: |
|
||||||
|
helm lint deploy/helm/certctl/ \
|
||||||
|
--set server.tls.existingSecret=certctl-tls-ci
|
||||||
|
|
||||||
- name: Template Helm Chart
|
- name: Template Helm Chart (existingSecret mode)
|
||||||
run: helm template certctl deploy/helm/certctl/ > /dev/null
|
run: |
|
||||||
|
helm template certctl deploy/helm/certctl/ \
|
||||||
|
--set server.tls.existingSecret=certctl-tls-ci \
|
||||||
|
> /dev/null
|
||||||
|
|
||||||
|
- name: Template Helm Chart (cert-manager mode)
|
||||||
|
run: |
|
||||||
|
helm template certctl deploy/helm/certctl/ \
|
||||||
|
--set server.tls.certManager.enabled=true \
|
||||||
|
--set server.tls.certManager.issuerRef.name=letsencrypt-prod \
|
||||||
|
> /dev/null
|
||||||
|
|
||||||
|
- name: Template Helm Chart (guard fails without TLS)
|
||||||
|
run: |
|
||||||
|
# Inverse test: the chart MUST refuse to render when no TLS source is
|
||||||
|
# configured. If this ever renders successfully, the fail-loud guard
|
||||||
|
# in certctl.tls.required has regressed.
|
||||||
|
if helm template certctl deploy/helm/certctl/ > /dev/null 2>&1; then
|
||||||
|
echo "::error::Helm chart rendered without a TLS source — fail-loud guard regressed"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|||||||
+13
-1
@@ -63,6 +63,7 @@ certctl-cli
|
|||||||
/server
|
/server
|
||||||
/agent
|
/agent
|
||||||
/cli
|
/cli
|
||||||
|
/mcp-server
|
||||||
|
|
||||||
# Private strategy docs
|
# Private strategy docs
|
||||||
strategy.md
|
strategy.md
|
||||||
@@ -71,9 +72,20 @@ SECURITY_REMEDIATION.md
|
|||||||
# OS
|
# OS
|
||||||
.DS_Store
|
.DS_Store
|
||||||
Thumbs.db
|
Thumbs.db
|
||||||
mcp-server
|
|
||||||
|
|
||||||
# Local Go build/module caches (session-scoped, never committed)
|
# Local Go build/module caches (session-scoped, never committed)
|
||||||
/.gocache/
|
/.gocache/
|
||||||
/.gomodcache/
|
/.gomodcache/
|
||||||
/.gopath/
|
/.gopath/
|
||||||
|
/.gomodcache-gopath/
|
||||||
|
|
||||||
|
# Design scratch files (session-scoped)
|
||||||
|
/.i004-design.md
|
||||||
|
/.i005-design.md
|
||||||
|
|
||||||
|
# HTTPS-Everywhere (M-007) Phase 6: the docker-compose.test.yml tls-init
|
||||||
|
# container writes ca.crt / server.crt / server.key into this directory so
|
||||||
|
# the host-side integration_test.go binary can pin the CA via
|
||||||
|
# CERTCTL_TEST_CA_BUNDLE=./certs/ca.crt. Material is regenerated on every
|
||||||
|
# `docker compose up` and never belongs in git.
|
||||||
|
/deploy/test/certs/
|
||||||
|
|||||||
@@ -0,0 +1,50 @@
|
|||||||
|
# Changelog
|
||||||
|
|
||||||
|
All notable changes to certctl are documented in this file. Dates use ISO 8601. Versions follow [Semantic Versioning](https://semver.org/).
|
||||||
|
|
||||||
|
## [2.2.0] — 2026-04-19
|
||||||
|
|
||||||
|
### HTTPS Everywhere — The Irony
|
||||||
|
|
||||||
|
> certctl manages other teams' certificates. Until v2.2, it didn't terminate TLS on its own control plane. We treated the server as an internal service sitting behind whatever TLS-terminating infrastructure the operator already owned — reverse proxies, Kubernetes Ingress controllers, service mesh sidecars. Working through an EST coverage-gap audit surfaced this as a credibility problem we wanted to fix head-on: a cert-lifecycle product should ship with HTTPS by default. This release flips that. Self-signed bootstrap for docker-compose demos, operator-supplied Secret for Helm (with optional cert-manager integration), and a one-step cutover with no backward-compat bridge. Out-of-date agents will fail at the TLS handshake layer on upgrade; the upgrade guide walks operators through the roll.
|
||||||
|
|
||||||
|
### Breaking Changes
|
||||||
|
|
||||||
|
- **HTTPS-only control plane. The plaintext HTTP listener is gone.** There is no `CERTCTL_TLS_ENABLED=false` escape hatch and no `:8080` fallback. Operators who were running certctl behind their own TLS terminator must either (a) continue doing so and let the downstream TLS terminator talk to certctl's HTTPS listener, or (b) bring their own cert/key and terminate on certctl directly. Either path requires config changes — see `docs/upgrade-to-tls.md` for a one-step cutover.
|
||||||
|
- **Agents reject `CERTCTL_SERVER_URL=http://...` at startup.** This is a pre-flight config validation failure with a fail-loud diagnostic pointing at `docs/upgrade-to-tls.md`. Not a TCP-refused, not a TLS-handshake-error — the agent will not even attempt the network call. Every agent deployment must be reconfigured before upgrading the server.
|
||||||
|
- **CLI and MCP clients require `https://` URLs.** Same pre-flight rejection of plaintext schemes.
|
||||||
|
- **TLS 1.2 is not supported. TLS 1.3 only.** The server's `tls.Config.MinVersion` is pinned to `tls.VersionTLS13`. Any client still negotiating TLS 1.2 will fail at the handshake. Modern curl, Go stdlib, browsers, and Kubernetes tooling all default to 1.3-capable; legacy clients may need an upgrade.
|
||||||
|
- **Helm chart requires a TLS source.** `helm install` without one of `server.tls.existingSecret`, `server.tls.certManager.enabled`, or (for eval only) `server.tls.selfSigned.enabled` fails at template time with a diagnostic pointing at `docs/tls.md`. There is no default-to-plaintext path.
|
||||||
|
|
||||||
|
### Added
|
||||||
|
|
||||||
|
- **Self-signed bootstrap for Docker Compose demos.** A `certctl-tls-init` init container runs before the server on first boot, generates a SAN-valid self-signed cert into `deploy/test/certs/`, and exits. The server mounts the resulting cert/key. Every curl in the demo stack pins against `./deploy/test/certs/ca.crt` with `--cacert`.
|
||||||
|
- **Helm chart TLS provisioning — three modes.** Operator-supplied Secret (`server.tls.existingSecret`), cert-manager integration (`server.tls.certManager.enabled` with issuer selection), or self-signed (`server.tls.selfSigned.enabled` — eval only, not supported for production). Chart templates enforce exactly one is active.
|
||||||
|
- **Hot-reload of TLS cert/key on `SIGHUP`.** Overwrite the cert/key on disk, send `SIGHUP` to the server PID, watch the `slog.Info("tls.reload", ...)` log line, and new TLS connections use the new cert. Failure during reload is logged and does not crash the server; the previous cert remains in use.
|
||||||
|
- **Agent CA-bundle env vars.** `CERTCTL_SERVER_CA_BUNDLE_PATH` points at a PEM file the agent's HTTP client will trust. `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY` disables verification (development only — the agent logs a loud warning at startup). `install-agent.sh` writes both as commented template lines into the generated `agent.env`.
|
||||||
|
- **Integration test suite runs over HTTPS.** `go test -tags=integration ./deploy/test/...` stands up the full Compose stack, extracts the self-signed CA bundle, and exercises every certctl API over `https://localhost:8443`. All 34 subtests green.
|
||||||
|
- **`docs/tls.md`** — cert provisioning patterns: bring-your-own Secret, cert-manager, self-signed bootstrap, SAN requirements, rotation workflows, SIGHUP reload semantics, troubleshooting.
|
||||||
|
- **`docs/upgrade-to-tls.md`** — one-step cutover guide for existing v2.1 operators. Walks through the agent fleet roll, Helm upgrade sequencing, downgrade-is-not-supported warnings, and cert-provisioning decision tree.
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
|
||||||
|
- `cmd/server/main.go` now calls `http.Server.ListenAndServeTLS(certFile, keyFile)`. The plaintext `ListenAndServe` code path is deleted — `grep -rn "ListenAndServe[^T]" cmd/ internal/` returns zero hits.
|
||||||
|
- All documentation curls (`docs/testing-guide.md`, `docs/quickstart.md`, `deploy/helm/INSTALLATION.md`, `deploy/helm/DEPLOYMENT_GUIDE.md`, `deploy/ENVIRONMENTS.md`, `docs/openapi.md`, migration guides, example READMEs) use `https://localhost:8443` and `--cacert` against the demo stack's bundle.
|
||||||
|
- OpenAPI spec (`api/openapi.yaml`) `servers` blocks default to `https://localhost:8443`.
|
||||||
|
|
||||||
|
### Security
|
||||||
|
|
||||||
|
- TLS 1.3 pinned via `tls.Config.MinVersion = tls.VersionTLS13`.
|
||||||
|
- Plaintext HTTP listener removed entirely — no port 8080, no `Upgrade-Insecure-Requests`, no HSTS-required redirect dance. There is only one port: 8443, TLS 1.3.
|
||||||
|
- `grep -rn "http://" cmd/ internal/` returns zero hits outside test fixtures and the agent-side URL-scheme rejection error message.
|
||||||
|
|
||||||
|
### Upgrade Notes
|
||||||
|
|
||||||
|
Read `docs/upgrade-to-tls.md` before upgrading. The short version:
|
||||||
|
|
||||||
|
1. Pick a TLS source — bring-your-own cert, cert-manager, or self-signed bootstrap.
|
||||||
|
2. Upgrade the server with TLS configured. First boot over HTTPS.
|
||||||
|
3. Roll the agent fleet: set `CERTCTL_SERVER_URL=https://...` and, if using a private CA, `CERTCTL_SERVER_CA_BUNDLE_PATH`. Old agents will fail loud at startup — expected.
|
||||||
|
4. Roll CLI/MCP clients the same way.
|
||||||
|
|
||||||
|
There is no backward-compat bridge. There is no dual-listener mode. The cutover is one step.
|
||||||
@@ -197,7 +197,7 @@ cd certctl
|
|||||||
docker compose -f deploy/docker-compose.yml up -d --build
|
docker compose -f deploy/docker-compose.yml up -d --build
|
||||||
```
|
```
|
||||||
|
|
||||||
Wait ~30 seconds, then open **http://localhost:8443** in your browser. The onboarding wizard walks you through connecting a CA, deploying an agent, and issuing your first certificate.
|
Wait ~30 seconds, then open **https://localhost:8443** in your browser. (The shipped `docker-compose.yml` self-signs a cert via the `certctl-tls-init` init container on first boot — accept the browser warning for the demo, or feed the generated `ca.crt` to your client.) The onboarding wizard walks you through connecting a CA, deploying an agent, and issuing your first certificate.
|
||||||
|
|
||||||
**Want a pre-populated demo instead?** Add the demo override to see 32 certificates across 10 issuers, 8 agents, and 180 days of realistic history:
|
**Want a pre-populated demo instead?** Add the demo override to see 32 certificates across 10 issuers, 8 agents, and 180 days of realistic history:
|
||||||
|
|
||||||
@@ -208,10 +208,12 @@ docker compose -f deploy/docker-compose.yml -f deploy/docker-compose.demo.yml up
|
|||||||
The `deploy/` directory has four compose files: `docker-compose.yml` (base platform), `docker-compose.demo.yml` (demo data overlay), `docker-compose.dev.yml` (PgAdmin + debug logging), and `docker-compose.test.yml` (standalone integration tests with real CA backends). See the [Docker Compose Environments Guide](deploy/ENVIRONMENTS.md) for a service-by-service walkthrough, or the [Quick Start](docs/quickstart.md#docker-compose-environments) for a summary.
|
The `deploy/` directory has four compose files: `docker-compose.yml` (base platform), `docker-compose.demo.yml` (demo data overlay), `docker-compose.dev.yml` (PgAdmin + debug logging), and `docker-compose.test.yml` (standalone integration tests with real CA backends). See the [Docker Compose Environments Guide](deploy/ENVIRONMENTS.md) for a service-by-service walkthrough, or the [Quick Start](docs/quickstart.md#docker-compose-environments) for a summary.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl http://localhost:8443/health
|
curl --cacert $(docker compose -f deploy/docker-compose.yml exec -T certctl-server cat /etc/certctl/tls/ca.crt) https://localhost:8443/health
|
||||||
# {"status":"healthy"}
|
# {"status":"healthy"}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The control plane is HTTPS-only (TLS 1.3, no plaintext listener). See [`docs/tls.md`](docs/tls.md) for cert provisioning patterns and [`docs/upgrade-to-tls.md`](docs/upgrade-to-tls.md) if you're upgrading from a pre-v2.2 release.
|
||||||
|
|
||||||
### Agent Install (One-Liner)
|
### Agent Install (One-Liner)
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@@ -326,8 +328,9 @@ Each directory contains a `docker-compose.yml` and a `README.md` explaining the
|
|||||||
go install github.com/shankar0123/certctl/cmd/cli@latest
|
go install github.com/shankar0123/certctl/cmd/cli@latest
|
||||||
|
|
||||||
# Configure
|
# Configure
|
||||||
export CERTCTL_SERVER_URL=http://localhost:8443
|
export CERTCTL_SERVER_URL=https://localhost:8443
|
||||||
export CERTCTL_API_KEY=your-api-key
|
export CERTCTL_API_KEY=your-api-key
|
||||||
|
export CERTCTL_SERVER_CA_BUNDLE_PATH=/path/to/ca.crt # or --ca-bundle on the CLI; --insecure for dev self-signed
|
||||||
|
|
||||||
# Usage
|
# Usage
|
||||||
certctl-cli certs list # List all certificates
|
certctl-cli certs list # List all certificates
|
||||||
@@ -347,11 +350,14 @@ certctl ships a standalone MCP (Model Context Protocol) server that exposes all
|
|||||||
```bash
|
```bash
|
||||||
# Install and run
|
# Install and run
|
||||||
go install github.com/shankar0123/certctl/cmd/mcp-server@latest
|
go install github.com/shankar0123/certctl/cmd/mcp-server@latest
|
||||||
export CERTCTL_SERVER_URL=http://localhost:8443
|
export CERTCTL_SERVER_URL=https://localhost:8443
|
||||||
export CERTCTL_API_KEY=your-api-key
|
export CERTCTL_API_KEY=your-api-key
|
||||||
|
export CERTCTL_SERVER_CA_BUNDLE_PATH=/path/to/ca.crt # required for self-signed bootstrap
|
||||||
mcp-server
|
mcp-server
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The MCP server is env-vars-only — there are no CLI flags for TLS. If you must bypass verification for local development against a self-signed cert, set `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true`. Never set that in production.
|
||||||
|
|
||||||
**Claude Desktop** (`claude_desktop_config.json`):
|
**Claude Desktop** (`claude_desktop_config.json`):
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
@@ -359,8 +365,9 @@ mcp-server
|
|||||||
"certctl": {
|
"certctl": {
|
||||||
"command": "mcp-server",
|
"command": "mcp-server",
|
||||||
"env": {
|
"env": {
|
||||||
"CERTCTL_SERVER_URL": "http://localhost:8443",
|
"CERTCTL_SERVER_URL": "https://localhost:8443",
|
||||||
"CERTCTL_API_KEY": "your-api-key"
|
"CERTCTL_API_KEY": "your-api-key",
|
||||||
|
"CERTCTL_SERVER_CA_BUNDLE_PATH": "/path/to/ca.crt"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
+66
-4
@@ -17,10 +17,8 @@ info:
|
|||||||
url: https://github.com/shankar0123/certctl/blob/master/LICENSE
|
url: https://github.com/shankar0123/certctl/blob/master/LICENSE
|
||||||
|
|
||||||
servers:
|
servers:
|
||||||
- url: http://localhost:8080
|
- url: https://localhost:8443
|
||||||
description: Local development
|
description: Docker Compose demo (self-signed cert; pin with ./deploy/test/certs/ca.crt)
|
||||||
- url: http://localhost:8443
|
|
||||||
description: Docker Compose demo
|
|
||||||
|
|
||||||
security:
|
security:
|
||||||
- bearerAuth: []
|
- bearerAuth: []
|
||||||
@@ -2037,6 +2035,16 @@ paths:
|
|||||||
parameters:
|
parameters:
|
||||||
- $ref: "#/components/parameters/page"
|
- $ref: "#/components/parameters/page"
|
||||||
- $ref: "#/components/parameters/per_page"
|
- $ref: "#/components/parameters/per_page"
|
||||||
|
- name: status
|
||||||
|
in: query
|
||||||
|
required: false
|
||||||
|
description: |
|
||||||
|
Filter by lifecycle status. I-005: `dead` powers the Dead letter
|
||||||
|
tab on the GUI; empty/omitted returns the default all-statuses
|
||||||
|
listing to preserve pre-I-005 behavior.
|
||||||
|
schema:
|
||||||
|
type: string
|
||||||
|
enum: [pending, sent, failed, dead, read]
|
||||||
responses:
|
responses:
|
||||||
"200":
|
"200":
|
||||||
description: Paginated list of notifications
|
description: Paginated list of notifications
|
||||||
@@ -2094,6 +2102,36 @@ paths:
|
|||||||
"500":
|
"500":
|
||||||
$ref: "#/components/responses/InternalError"
|
$ref: "#/components/responses/InternalError"
|
||||||
|
|
||||||
|
/api/v1/notifications/{id}/requeue:
|
||||||
|
post:
|
||||||
|
tags: [Notifications]
|
||||||
|
summary: Requeue a dead notification
|
||||||
|
description: |
|
||||||
|
I-005: flip a notification from the `dead` dead-letter queue back to
|
||||||
|
`pending` so the retry sweep (default 2 minutes) picks it up on its
|
||||||
|
next tick. Used by operators after fixing the underlying delivery
|
||||||
|
failure (SMTP config, webhook endpoint, etc.). Clears `next_retry_at`
|
||||||
|
and resets the `retry_count` budget; `last_error` is preserved for
|
||||||
|
audit continuity.
|
||||||
|
operationId: requeueNotification
|
||||||
|
parameters:
|
||||||
|
- $ref: "#/components/parameters/resourceId"
|
||||||
|
responses:
|
||||||
|
"200":
|
||||||
|
description: Requeued
|
||||||
|
content:
|
||||||
|
application/json:
|
||||||
|
schema:
|
||||||
|
$ref: "#/components/schemas/StatusResponse"
|
||||||
|
"400":
|
||||||
|
$ref: "#/components/responses/BadRequest"
|
||||||
|
"404":
|
||||||
|
$ref: "#/components/responses/NotFound"
|
||||||
|
"405":
|
||||||
|
description: Method not allowed (POST only)
|
||||||
|
"500":
|
||||||
|
$ref: "#/components/responses/InternalError"
|
||||||
|
|
||||||
# ─── Stats ───────────────────────────────────────────────────────────
|
# ─── Stats ───────────────────────────────────────────────────────────
|
||||||
/api/v1/stats/summary:
|
/api/v1/stats/summary:
|
||||||
get:
|
get:
|
||||||
@@ -3905,8 +3943,32 @@ components:
|
|||||||
format: date-time
|
format: date-time
|
||||||
status:
|
status:
|
||||||
type: string
|
type: string
|
||||||
|
enum: [pending, sent, failed, dead, read]
|
||||||
|
description: |
|
||||||
|
Notification lifecycle status. I-005 adds `dead` for notifications
|
||||||
|
that exhausted their 5-attempt retry budget and were moved to the
|
||||||
|
dead-letter queue; operators triage these in the GUI's Dead letter
|
||||||
|
tab and use POST /notifications/{id}/requeue to resurrect them.
|
||||||
error:
|
error:
|
||||||
type: string
|
type: string
|
||||||
|
retry_count:
|
||||||
|
type: integer
|
||||||
|
description: |
|
||||||
|
Number of delivery attempts made. I-005 retry-sweep field; caps
|
||||||
|
at max_attempts=5 before the notification transitions to `dead`.
|
||||||
|
next_retry_at:
|
||||||
|
type: string
|
||||||
|
format: date-time
|
||||||
|
description: |
|
||||||
|
When the next retry attempt is scheduled. I-005 retry-sweep field;
|
||||||
|
null for `sent`, `dead`, and `read` statuses. Backoff follows
|
||||||
|
`min(2^retry_count * 1m, 1h)`.
|
||||||
|
last_error:
|
||||||
|
type: string
|
||||||
|
description: |
|
||||||
|
Most recent transient delivery error (SMTP failure, webhook 5xx,
|
||||||
|
etc.). I-005 retry-sweep field; surfaced on the Dead letter tab
|
||||||
|
so operators can triage without chasing server logs.
|
||||||
created_at:
|
created_at:
|
||||||
type: string
|
type: string
|
||||||
format: date-time
|
format: date-time
|
||||||
|
|||||||
+273
-31
@@ -7,6 +7,7 @@ import (
|
|||||||
"crypto/elliptic"
|
"crypto/elliptic"
|
||||||
"crypto/rand"
|
"crypto/rand"
|
||||||
"crypto/rsa"
|
"crypto/rsa"
|
||||||
|
"crypto/tls"
|
||||||
"crypto/x509"
|
"crypto/x509"
|
||||||
"crypto/x509/pkix"
|
"crypto/x509/pkix"
|
||||||
"encoding/json"
|
"encoding/json"
|
||||||
@@ -72,7 +73,7 @@ func TestAgent_Heartbeat_Success(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
// Should not panic
|
// Should not panic
|
||||||
agent.sendHeartbeat(context.Background())
|
agent.sendHeartbeat(context.Background())
|
||||||
@@ -93,7 +94,7 @@ func TestAgent_Heartbeat_ServerError(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
// Should increment consecutive failures
|
// Should increment consecutive failures
|
||||||
failureBefore := agent.consecutiveFailures
|
failureBefore := agent.consecutiveFailures
|
||||||
@@ -115,7 +116,7 @@ func TestAgent_Heartbeat_ConnectionError(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
// Should fail due to connection error
|
// Should fail due to connection error
|
||||||
agent.sendHeartbeat(context.Background())
|
agent.sendHeartbeat(context.Background())
|
||||||
@@ -150,7 +151,7 @@ func TestAgent_PollWork_NoWork(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
// Should not panic
|
// Should not panic
|
||||||
agent.pollForWork(context.Background())
|
agent.pollForWork(context.Background())
|
||||||
@@ -195,7 +196,7 @@ func TestAgent_PollWork_Success(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
// Should not panic; work items are processed in separate gorines in real usage
|
// Should not panic; work items are processed in separate gorines in real usage
|
||||||
agent.pollForWork(context.Background())
|
agent.pollForWork(context.Background())
|
||||||
@@ -285,7 +286,7 @@ func TestParsePEMFile(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
// Parse the file
|
// Parse the file
|
||||||
entries := agent.parsePEMFile(certPath)
|
entries := agent.parsePEMFile(certPath)
|
||||||
@@ -336,7 +337,7 @@ func TestParsePEMFile_MultipleCerts(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
entries := agent.parsePEMFile(certPath)
|
entries := agent.parsePEMFile(certPath)
|
||||||
|
|
||||||
@@ -362,7 +363,7 @@ func TestParseDERFile(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
entry, err := agent.parseDERFile(derPath)
|
entry, err := agent.parseDERFile(derPath)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
@@ -397,7 +398,7 @@ func TestParseDERFile_Invalid(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
_, err := agent.parseDERFile(derPath)
|
_, err := agent.parseDERFile(derPath)
|
||||||
if err == nil {
|
if err == nil {
|
||||||
@@ -439,7 +440,7 @@ func TestScanDirectory(t *testing.T) {
|
|||||||
DiscoveryDirs: []string{tmpdir},
|
DiscoveryDirs: []string{tmpdir},
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
// Simulate directory walk manually (as runDiscoveryScan does)
|
// Simulate directory walk manually (as runDiscoveryScan does)
|
||||||
var certs []discoveredCertEntry
|
var certs []discoveredCertEntry
|
||||||
@@ -474,7 +475,7 @@ func TestCreateTargetConnector_NGINX(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
configJSON := json.RawMessage(`{"cert_path":"/etc/nginx/cert.pem"}`)
|
configJSON := json.RawMessage(`{"cert_path":"/etc/nginx/cert.pem"}`)
|
||||||
connector, err := agent.createTargetConnector("NGINX", configJSON)
|
connector, err := agent.createTargetConnector("NGINX", configJSON)
|
||||||
@@ -496,7 +497,7 @@ func TestCreateTargetConnector_Unsupported(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
_, err := agent.createTargetConnector("UnsupportedType", nil)
|
_, err := agent.createTargetConnector("UnsupportedType", nil)
|
||||||
|
|
||||||
@@ -530,7 +531,7 @@ func TestFetchCertificate_Success(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
certPEM, err := agent.fetchCertificate(context.Background(), "mc-001")
|
certPEM, err := agent.fetchCertificate(context.Background(), "mc-001")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
@@ -556,7 +557,7 @@ func TestFetchCertificate_NotFound(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
_, err := agent.fetchCertificate(context.Background(), "mc-nonexistent")
|
_, err := agent.fetchCertificate(context.Background(), "mc-nonexistent")
|
||||||
if err == nil {
|
if err == nil {
|
||||||
@@ -592,7 +593,7 @@ func TestReportJobStatus_Success(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
err := agent.reportJobStatus(context.Background(), "j-001", "Completed", "")
|
err := agent.reportJobStatus(context.Background(), "j-001", "Completed", "")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
@@ -624,7 +625,7 @@ func TestReportJobStatus_WithError(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
err := agent.reportJobStatus(context.Background(), "j-001", "Failed", "deployment failed")
|
err := agent.reportJobStatus(context.Background(), "j-001", "Failed", "deployment failed")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
@@ -658,7 +659,7 @@ func TestMakeRequest_Success(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
resp, err := agent.makeRequest(context.Background(), http.MethodPost, "/test", map[string]string{"key": "value"})
|
resp, err := agent.makeRequest(context.Background(), http.MethodPost, "/test", map[string]string{"key": "value"})
|
||||||
if err != nil {
|
if err != nil {
|
||||||
@@ -680,7 +681,7 @@ func TestMakeRequest_InvalidURL(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
_, err := agent.makeRequest(context.Background(), http.MethodGet, "/test", nil)
|
_, err := agent.makeRequest(context.Background(), http.MethodGet, "/test", nil)
|
||||||
if err == nil {
|
if err == nil {
|
||||||
@@ -765,7 +766,7 @@ func TestNewAgent(t *testing.T) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
if agent.config != cfg {
|
if agent.config != cfg {
|
||||||
t.Error("config not set correctly")
|
t.Error("config not set correctly")
|
||||||
@@ -791,7 +792,7 @@ func TestNewAgent_WithLogger(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
|
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
if agent.logger != logger {
|
if agent.logger != logger {
|
||||||
t.Error("logger not set correctly")
|
t.Error("logger not set correctly")
|
||||||
@@ -954,7 +955,7 @@ func TestCreateTargetConnector_AllSupportedTypes(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
for _, tt := range tests {
|
for _, tt := range tests {
|
||||||
t.Run(tt.name, func(t *testing.T) {
|
t.Run(tt.name, func(t *testing.T) {
|
||||||
@@ -1007,7 +1008,7 @@ func TestCreateTargetConnector_InvalidJSON(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
invalidJSON := json.RawMessage("{invalid json}")
|
invalidJSON := json.RawMessage("{invalid json}")
|
||||||
|
|
||||||
@@ -1031,7 +1032,7 @@ func TestCreateTargetConnector_UnknownType(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
_, err := agent.createTargetConnector("MagicBox", nil)
|
_, err := agent.createTargetConnector("MagicBox", nil)
|
||||||
|
|
||||||
@@ -1061,7 +1062,7 @@ func TestCreateTargetConnector_EmptyConfig(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
for _, typeName := range tests {
|
for _, typeName := range tests {
|
||||||
t.Run(typeName, func(t *testing.T) {
|
t.Run(typeName, func(t *testing.T) {
|
||||||
@@ -1137,7 +1138,7 @@ func TestRunDiscoveryScan_ValidCerts(t *testing.T) {
|
|||||||
DiscoveryDirs: []string{tmpDir},
|
DiscoveryDirs: []string{tmpDir},
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
// Run discovery scan
|
// Run discovery scan
|
||||||
agent.runDiscoveryScan(context.Background())
|
agent.runDiscoveryScan(context.Background())
|
||||||
@@ -1165,7 +1166,7 @@ func TestRunDiscoveryScan_NoCertificates(t *testing.T) {
|
|||||||
DiscoveryDirs: []string{tmpDir},
|
DiscoveryDirs: []string{tmpDir},
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
// Run discovery scan - should complete without error even with empty directory
|
// Run discovery scan - should complete without error even with empty directory
|
||||||
agent.runDiscoveryScan(context.Background())
|
agent.runDiscoveryScan(context.Background())
|
||||||
@@ -1222,7 +1223,7 @@ func TestRunDiscoveryScan_MultipleCerts(t *testing.T) {
|
|||||||
DiscoveryDirs: []string{tmpDir},
|
DiscoveryDirs: []string{tmpDir},
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
// Run discovery scan
|
// Run discovery scan
|
||||||
agent.runDiscoveryScan(context.Background())
|
agent.runDiscoveryScan(context.Background())
|
||||||
@@ -1273,7 +1274,7 @@ func TestRunDiscoveryScan_DERCertificate(t *testing.T) {
|
|||||||
DiscoveryDirs: []string{tmpDir},
|
DiscoveryDirs: []string{tmpDir},
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
// Run discovery scan
|
// Run discovery scan
|
||||||
agent.runDiscoveryScan(context.Background())
|
agent.runDiscoveryScan(context.Background())
|
||||||
@@ -1331,7 +1332,7 @@ func TestRunDiscoveryScan_Subdirectories(t *testing.T) {
|
|||||||
DiscoveryDirs: []string{tmpDir},
|
DiscoveryDirs: []string{tmpDir},
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
// Run discovery scan - should recursively find certs in subdirs
|
// Run discovery scan - should recursively find certs in subdirs
|
||||||
agent.runDiscoveryScan(context.Background())
|
agent.runDiscoveryScan(context.Background())
|
||||||
@@ -1369,7 +1370,7 @@ func TestRunDiscoveryScan_ServerError(t *testing.T) {
|
|||||||
DiscoveryDirs: []string{tmpDir},
|
DiscoveryDirs: []string{tmpDir},
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
// Should handle server error gracefully without panicking
|
// Should handle server error gracefully without panicking
|
||||||
agent.runDiscoveryScan(context.Background())
|
agent.runDiscoveryScan(context.Background())
|
||||||
@@ -1396,7 +1397,7 @@ func TestDiscoveredCertEntry_ValidFields(t *testing.T) {
|
|||||||
Hostname: "test-host",
|
Hostname: "test-host",
|
||||||
}
|
}
|
||||||
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
agent := NewAgent(cfg, logger)
|
agent, _ := NewAgent(cfg, logger)
|
||||||
|
|
||||||
entries := agent.parsePEMFile(certPath)
|
entries := agent.parsePEMFile(certPath)
|
||||||
|
|
||||||
@@ -1447,3 +1448,244 @@ func TestDiscoveredCertEntry_ValidFields(t *testing.T) {
|
|||||||
t.Error("PEMData should not be empty")
|
t.Error("PEMData should not be empty")
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
// HTTPS-Everywhere milestone (v2.2, §3.2 / §7) — Phase 5 client-side tests.
|
||||||
|
//
|
||||||
|
// These tests pin the agent's pre-flight HTTPS-scheme guard and the TLS
|
||||||
|
// configuration surface (CA bundle loading + TLS 1.3 round-trip) so that
|
||||||
|
// regressions surface at unit-test time, not at the first heartbeat of a
|
||||||
|
// production rollout. Matches the same contract asserted by the sibling
|
||||||
|
// binaries cmd/cli/main_test.go and cmd/mcp-server/main_test.go — the three
|
||||||
|
// must stay in lock-step because all three are HTTPS-only clients of the
|
||||||
|
// same control plane.
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
// TestValidateHTTPSScheme pins the pre-flight URL-scheme guard that the
|
||||||
|
// HTTPS-Everywhere milestone requires on the agent binary startup path. The
|
||||||
|
// agent's diagnostic is distinct from the CLI/MCP variants because it names
|
||||||
|
// CERTCTL_SERVER_URL (the only input channel — no --server flag on the
|
||||||
|
// agent). Every case here mirrors the dispatch arms in cmd/agent/main.go:
|
||||||
|
// validateHTTPSScheme; drifting the error-message substrings is what this
|
||||||
|
// test is here to catch.
|
||||||
|
func TestValidateHTTPSScheme(t *testing.T) {
|
||||||
|
tests := []struct {
|
||||||
|
name string
|
||||||
|
serverURL string
|
||||||
|
wantErr bool
|
||||||
|
wantErrSub string
|
||||||
|
}{
|
||||||
|
{
|
||||||
|
name: "https URL passes",
|
||||||
|
serverURL: "https://certctl-server:8443",
|
||||||
|
wantErr: false,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "https URL with path passes",
|
||||||
|
serverURL: "https://certctl.example.com/api/v1",
|
||||||
|
wantErr: false,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "uppercase HTTPS scheme passes (url.Parse lowercases)",
|
||||||
|
serverURL: "HTTPS://certctl-server:8443",
|
||||||
|
wantErr: false,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "empty URL rejected names CERTCTL_SERVER_URL",
|
||||||
|
serverURL: "",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "CERTCTL_SERVER_URL is empty",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "plaintext http rejected",
|
||||||
|
serverURL: "http://certctl-server:8443",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "plaintext http://",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "bare host missing scheme falls through to unsupported",
|
||||||
|
serverURL: "localhost:8443",
|
||||||
|
wantErr: true,
|
||||||
|
// url.Parse treats "localhost:8443" as scheme=localhost,
|
||||||
|
// opaque=8443 — exercises the default arm (unsupported scheme)
|
||||||
|
// rather than the empty-scheme arm. Both are fail-closed, which
|
||||||
|
// is what we care about.
|
||||||
|
wantErrSub: "unsupported scheme",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "path-only URL rejected",
|
||||||
|
serverURL: "//certctl-server:8443",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "missing a scheme",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "unsupported scheme rejected",
|
||||||
|
serverURL: "ftp://certctl-server:8443",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "unsupported scheme",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "ws scheme rejected",
|
||||||
|
serverURL: "ws://certctl-server:8443",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "unsupported scheme",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
for _, tt := range tests {
|
||||||
|
t.Run(tt.name, func(t *testing.T) {
|
||||||
|
err := validateHTTPSScheme(tt.serverURL)
|
||||||
|
if (err != nil) != tt.wantErr {
|
||||||
|
t.Fatalf("validateHTTPSScheme(%q) err=%v wantErr=%v", tt.serverURL, err, tt.wantErr)
|
||||||
|
}
|
||||||
|
if tt.wantErr && tt.wantErrSub != "" && !strings.Contains(err.Error(), tt.wantErrSub) {
|
||||||
|
t.Errorf("validateHTTPSScheme(%q) err=%q must contain %q so operators see the right diagnostic",
|
||||||
|
tt.serverURL, err.Error(), tt.wantErrSub)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// writeTestCABundle PEM-encodes a cert's DER bytes and writes the result to a
|
||||||
|
// tmp file inside dir. Used by CA-bundle tests so each case owns a distinct
|
||||||
|
// file path (matters for the "missing file" case which must point at a path
|
||||||
|
// that provably does not exist). Returns the path.
|
||||||
|
func writeTestCABundle(t *testing.T, dir string, certDER []byte, filename string) string {
|
||||||
|
t.Helper()
|
||||||
|
pemBytes := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: certDER})
|
||||||
|
path := filepath.Join(dir, filename)
|
||||||
|
if err := os.WriteFile(path, pemBytes, 0644); err != nil {
|
||||||
|
t.Fatalf("writing CA bundle %q: %v", path, err)
|
||||||
|
}
|
||||||
|
return path
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNewAgent_CABundle_Success confirms that a well-formed PEM bundle gets
|
||||||
|
// parsed into an x509.CertPool and wired onto the agent's HTTP client
|
||||||
|
// transport. This is the happy path the docs/tls.md "Private CA signed
|
||||||
|
// server cert" section depends on.
|
||||||
|
func TestNewAgent_CABundle_Success(t *testing.T) {
|
||||||
|
cert, err := generateTestCertWithCN("test.certctl.local")
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("generateTestCertWithCN: %v", err)
|
||||||
|
}
|
||||||
|
bundlePath := writeTestCABundle(t, t.TempDir(), cert.Raw, "ca-bundle.pem")
|
||||||
|
|
||||||
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
|
agent, err := NewAgent(&AgentConfig{
|
||||||
|
ServerURL: "https://certctl-server:8443",
|
||||||
|
APIKey: "test-key",
|
||||||
|
AgentID: "a-test",
|
||||||
|
Hostname: "test-host",
|
||||||
|
CABundlePath: bundlePath,
|
||||||
|
}, logger)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("NewAgent with valid CA bundle err=%v want nil", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
transport, ok := agent.client.Transport.(*http.Transport)
|
||||||
|
if !ok {
|
||||||
|
t.Fatalf("agent.client.Transport is %T; want *http.Transport", agent.client.Transport)
|
||||||
|
}
|
||||||
|
if transport.TLSClientConfig == nil {
|
||||||
|
t.Fatal("TLSClientConfig is nil; HTTPS-everywhere milestone requires a non-nil TLS config")
|
||||||
|
}
|
||||||
|
if transport.TLSClientConfig.MinVersion != tls.VersionTLS13 {
|
||||||
|
t.Errorf("MinVersion=%x want TLS 1.3 (%x) per §2.3 of the milestone spec",
|
||||||
|
transport.TLSClientConfig.MinVersion, tls.VersionTLS13)
|
||||||
|
}
|
||||||
|
if transport.TLSClientConfig.RootCAs == nil {
|
||||||
|
t.Error("RootCAs is nil; the configured CA bundle was silently dropped")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNewAgent_CABundle_MissingFile pins the fail-loud behavior when the
|
||||||
|
// operator points CERTCTL_SERVER_CA_BUNDLE_PATH at a path that does not
|
||||||
|
// exist. Falling back to system roots here would mask a misconfiguration as
|
||||||
|
// a much harder-to-debug TLS handshake failure downstream.
|
||||||
|
func TestNewAgent_CABundle_MissingFile(t *testing.T) {
|
||||||
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
|
missingPath := filepath.Join(t.TempDir(), "does-not-exist.pem")
|
||||||
|
_, err := NewAgent(&AgentConfig{
|
||||||
|
ServerURL: "https://certctl-server:8443",
|
||||||
|
APIKey: "test-key",
|
||||||
|
AgentID: "a-test",
|
||||||
|
Hostname: "test-host",
|
||||||
|
CABundlePath: missingPath,
|
||||||
|
}, logger)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("NewAgent err=nil for missing CA bundle path; must fail loud at startup")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "reading CA bundle") {
|
||||||
|
t.Errorf("err=%q must contain \"reading CA bundle\" so operators can trace the cause", err.Error())
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNewAgent_CABundle_EmptyPEM covers the "file exists but contains no
|
||||||
|
// valid certs" case (garbage, wrong-format, stripped PEM). AppendCertsFromPEM
|
||||||
|
// returns false in this case; NewAgent must translate that into a fail-loud
|
||||||
|
// startup error rather than quietly carry on with an empty pool.
|
||||||
|
func TestNewAgent_CABundle_EmptyPEM(t *testing.T) {
|
||||||
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
|
bundlePath := filepath.Join(t.TempDir(), "empty.pem")
|
||||||
|
if err := os.WriteFile(bundlePath, []byte("not a pem-encoded certificate, just garbage\n"), 0644); err != nil {
|
||||||
|
t.Fatalf("writing garbage bundle: %v", err)
|
||||||
|
}
|
||||||
|
_, err := NewAgent(&AgentConfig{
|
||||||
|
ServerURL: "https://certctl-server:8443",
|
||||||
|
APIKey: "test-key",
|
||||||
|
AgentID: "a-test",
|
||||||
|
Hostname: "test-host",
|
||||||
|
CABundlePath: bundlePath,
|
||||||
|
}, logger)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("NewAgent err=nil for empty-PEM CA bundle; must fail loud at startup")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "no valid PEM-encoded certificates") {
|
||||||
|
t.Errorf("err=%q must contain \"no valid PEM-encoded certificates\" so operators see why the bundle was rejected", err.Error())
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNewAgent_TLSRoundTrip is the end-to-end integration-style check: spin
|
||||||
|
// up an httptest.NewTLSServer (which presents a self-signed cert over TLS
|
||||||
|
// 1.3), feed that cert into the agent as a CA bundle, and confirm the agent
|
||||||
|
// successfully completes a heartbeat round-trip over HTTPS. This proves that
|
||||||
|
// (a) the CA pool is actually being consulted during verification and (b)
|
||||||
|
// the TLS 1.3 MinVersion doesn't break against httptest's default
|
||||||
|
// negotiation. Equivalent to the "TLS handshake succeeds against a
|
||||||
|
// self-signed control plane" integration gate, but runs in-process with no
|
||||||
|
// Docker dependency.
|
||||||
|
func TestNewAgent_TLSRoundTrip(t *testing.T) {
|
||||||
|
var heartbeatHit int
|
||||||
|
server := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||||
|
if r.URL.Path == "/api/v1/agents/a-tls-test/heartbeat" && r.Method == http.MethodPost {
|
||||||
|
heartbeatHit++
|
||||||
|
w.WriteHeader(http.StatusOK)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
w.WriteHeader(http.StatusNotFound)
|
||||||
|
}))
|
||||||
|
defer server.Close()
|
||||||
|
|
||||||
|
// server.Certificate() returns the *x509.Certificate httptest presents;
|
||||||
|
// PEM-encode its DER bytes so NewAgent's AppendCertsFromPEM can ingest it.
|
||||||
|
bundlePath := writeTestCABundle(t, t.TempDir(), server.Certificate().Raw, "httptest-ca.pem")
|
||||||
|
|
||||||
|
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
|
||||||
|
agent, err := NewAgent(&AgentConfig{
|
||||||
|
ServerURL: server.URL,
|
||||||
|
APIKey: "test-key",
|
||||||
|
AgentID: "a-tls-test",
|
||||||
|
Hostname: "tls-test-host",
|
||||||
|
CABundlePath: bundlePath,
|
||||||
|
}, logger)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("NewAgent with httptest CA bundle err=%v want nil", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
agent.sendHeartbeat(context.Background())
|
||||||
|
|
||||||
|
if heartbeatHit != 1 {
|
||||||
|
t.Fatalf("heartbeat handler hit %d times; want 1 — the TLS round-trip must actually complete", heartbeatHit)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|||||||
+134
-19
@@ -8,6 +8,7 @@ import (
|
|||||||
"crypto/rand"
|
"crypto/rand"
|
||||||
"crypto/rsa"
|
"crypto/rsa"
|
||||||
"crypto/sha256"
|
"crypto/sha256"
|
||||||
|
"crypto/tls"
|
||||||
"crypto/x509"
|
"crypto/x509"
|
||||||
"crypto/x509/pkix"
|
"crypto/x509/pkix"
|
||||||
"encoding/json"
|
"encoding/json"
|
||||||
@@ -19,6 +20,7 @@ import (
|
|||||||
"log/slog"
|
"log/slog"
|
||||||
"net"
|
"net"
|
||||||
"net/http"
|
"net/http"
|
||||||
|
"net/url"
|
||||||
"os"
|
"os"
|
||||||
"os/signal"
|
"os/signal"
|
||||||
"path/filepath"
|
"path/filepath"
|
||||||
@@ -46,13 +48,15 @@ import (
|
|||||||
|
|
||||||
// AgentConfig represents the agent-side configuration.
|
// AgentConfig represents the agent-side configuration.
|
||||||
type AgentConfig struct {
|
type AgentConfig struct {
|
||||||
ServerURL string // Control plane server URL (e.g., http://localhost:8443)
|
ServerURL string // Control plane server URL (e.g., https://localhost:8443) — must be https:// scheme
|
||||||
APIKey string // Agent API key for authentication
|
APIKey string // Agent API key for authentication
|
||||||
AgentName string // Agent name for identification
|
AgentName string // Agent name for identification
|
||||||
AgentID string // Agent ID for API calls (set after registration or from env)
|
AgentID string // Agent ID for API calls (set after registration or from env)
|
||||||
Hostname string // Server hostname
|
Hostname string // Server hostname
|
||||||
KeyDir string // Directory for storing private keys (default: /var/lib/certctl/keys)
|
KeyDir string // Directory for storing private keys (default: /var/lib/certctl/keys)
|
||||||
DiscoveryDirs []string // Directories to scan for certificates (comma-separated via env)
|
DiscoveryDirs []string // Directories to scan for certificates (comma-separated via env)
|
||||||
|
CABundlePath string // Optional path to a PEM-encoded CA bundle that signed the server's cert (empty = system roots)
|
||||||
|
InsecureSkipVerify bool // Dev-only: skip TLS certificate verification. Never enable in production. See docs/tls.md.
|
||||||
}
|
}
|
||||||
|
|
||||||
// ErrAgentRetired is the sentinel returned by [Agent.Run] when the control
|
// ErrAgentRetired is the sentinel returned by [Agent.Run] when the control
|
||||||
@@ -113,16 +117,57 @@ type JobItem struct {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// NewAgent creates a new agent instance.
|
// NewAgent creates a new agent instance.
|
||||||
func NewAgent(cfg *AgentConfig, logger *slog.Logger) *Agent {
|
//
|
||||||
|
// The returned HTTP client enforces HTTPS-only control-plane access per the
|
||||||
|
// HTTPS-Everywhere milestone (see docs/tls.md). TLS 1.3 is required; the
|
||||||
|
// optional CABundlePath loads a PEM bundle into RootCAs so the agent can
|
||||||
|
// trust internal / self-signed server certs without touching system trust
|
||||||
|
// stores. InsecureSkipVerify is a dev-only escape hatch — callers must log a
|
||||||
|
// loud warning when it's set; never enable in production (see §2.4 of the
|
||||||
|
// milestone spec and docs/upgrade-to-tls.md).
|
||||||
|
//
|
||||||
|
// Returns an error if CABundlePath is set but unreadable or malformed — fail
|
||||||
|
// loud at startup rather than silently fall back to system roots, which would
|
||||||
|
// turn a misconfigured bundle path into a cryptic "x509: certificate signed
|
||||||
|
// by unknown authority" on the first heartbeat.
|
||||||
|
func NewAgent(cfg *AgentConfig, logger *slog.Logger) (*Agent, error) {
|
||||||
|
tlsConfig := &tls.Config{
|
||||||
|
MinVersion: tls.VersionTLS13,
|
||||||
|
InsecureSkipVerify: cfg.InsecureSkipVerify, //nolint:gosec // opt-in dev escape hatch, documented in docs/tls.md
|
||||||
|
}
|
||||||
|
if cfg.CABundlePath != "" {
|
||||||
|
pemBytes, err := os.ReadFile(cfg.CABundlePath)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("reading CA bundle at %q: %w", cfg.CABundlePath, err)
|
||||||
|
}
|
||||||
|
pool := x509.NewCertPool()
|
||||||
|
if !pool.AppendCertsFromPEM(pemBytes) {
|
||||||
|
return nil, fmt.Errorf("CA bundle at %q contains no valid PEM-encoded certificates", cfg.CABundlePath)
|
||||||
|
}
|
||||||
|
tlsConfig.RootCAs = pool
|
||||||
|
}
|
||||||
|
|
||||||
|
httpClient := &http.Client{
|
||||||
|
Timeout: 30 * time.Second,
|
||||||
|
Transport: &http.Transport{
|
||||||
|
TLSClientConfig: tlsConfig,
|
||||||
|
ForceAttemptHTTP2: true,
|
||||||
|
MaxIdleConns: 10,
|
||||||
|
IdleConnTimeout: 90 * time.Second,
|
||||||
|
TLSHandshakeTimeout: 10 * time.Second,
|
||||||
|
ExpectContinueTimeout: 1 * time.Second,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
return &Agent{
|
return &Agent{
|
||||||
config: cfg,
|
config: cfg,
|
||||||
logger: logger,
|
logger: logger,
|
||||||
client: &http.Client{Timeout: 30 * time.Second},
|
client: httpClient,
|
||||||
heartbeatInterval: 60 * time.Second,
|
heartbeatInterval: 60 * time.Second,
|
||||||
pollInterval: 30 * time.Second,
|
pollInterval: 30 * time.Second,
|
||||||
discoveryInterval: 6 * time.Hour, // scan for certs every 6 hours
|
discoveryInterval: 6 * time.Hour, // scan for certs every 6 hours
|
||||||
retiredSignal: make(chan struct{}),
|
retiredSignal: make(chan struct{}),
|
||||||
}
|
}, nil
|
||||||
}
|
}
|
||||||
|
|
||||||
// markRetired records that the control plane has declared this agent retired
|
// markRetired records that the control plane has declared this agent retired
|
||||||
@@ -1118,12 +1163,14 @@ func certKeyInfo(cert *x509.Certificate) (string, int) {
|
|||||||
|
|
||||||
func main() {
|
func main() {
|
||||||
// Parse command-line flags (with env var fallbacks for Docker deployment)
|
// Parse command-line flags (with env var fallbacks for Docker deployment)
|
||||||
serverURL := flag.String("server", getEnvDefault("CERTCTL_SERVER_URL", "http://localhost:8443"), "Control plane server URL")
|
serverURL := flag.String("server", getEnvDefault("CERTCTL_SERVER_URL", "https://localhost:8443"), "Control plane server URL (must be https://)")
|
||||||
apiKey := flag.String("api-key", getEnvDefault("CERTCTL_API_KEY", ""), "Agent API key")
|
apiKey := flag.String("api-key", getEnvDefault("CERTCTL_API_KEY", ""), "Agent API key")
|
||||||
agentName := flag.String("name", getEnvDefault("CERTCTL_AGENT_NAME", "certctl-agent"), "Agent name")
|
agentName := flag.String("name", getEnvDefault("CERTCTL_AGENT_NAME", "certctl-agent"), "Agent name")
|
||||||
agentID := flag.String("agent-id", getEnvDefault("CERTCTL_AGENT_ID", ""), "Agent ID (from registration)")
|
agentID := flag.String("agent-id", getEnvDefault("CERTCTL_AGENT_ID", ""), "Agent ID (from registration)")
|
||||||
keyDir := flag.String("key-dir", getEnvDefault("CERTCTL_KEY_DIR", "/var/lib/certctl/keys"), "Directory for storing private keys")
|
keyDir := flag.String("key-dir", getEnvDefault("CERTCTL_KEY_DIR", "/var/lib/certctl/keys"), "Directory for storing private keys")
|
||||||
discoveryDirsStr := flag.String("discovery-dirs", getEnvDefault("CERTCTL_DISCOVERY_DIRS", ""), "Comma-separated directories to scan for certificates")
|
discoveryDirsStr := flag.String("discovery-dirs", getEnvDefault("CERTCTL_DISCOVERY_DIRS", ""), "Comma-separated directories to scan for certificates")
|
||||||
|
caBundlePath := flag.String("ca-bundle", getEnvDefault("CERTCTL_SERVER_CA_BUNDLE_PATH", ""), "Path to a PEM-encoded CA bundle that signed the server's TLS cert (optional; falls back to system roots)")
|
||||||
|
insecureSkipVerify := flag.Bool("insecure-skip-verify", getEnvBoolDefault("CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY", false), "Dev-only: skip TLS certificate verification. Never enable in production. See docs/tls.md.")
|
||||||
flag.Parse()
|
flag.Parse()
|
||||||
|
|
||||||
if *apiKey == "" {
|
if *apiKey == "" {
|
||||||
@@ -1137,6 +1184,18 @@ func main() {
|
|||||||
os.Exit(1)
|
os.Exit(1)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Pre-flight URL-scheme validation — reject plaintext http:// before any
|
||||||
|
// network call. The HTTPS-Everywhere milestone (§2.4, §7) mandates that
|
||||||
|
// mis-configured agents fail loudly at startup with a diagnostic pointing
|
||||||
|
// at the upgrade guide, rather than producing a TCP-refused or
|
||||||
|
// TLS-handshake-error that obscures the actual cause.
|
||||||
|
if err := validateHTTPSScheme(*serverURL); err != nil {
|
||||||
|
fmt.Fprintf(os.Stderr, "Error: %v\n", err)
|
||||||
|
fmt.Fprintf(os.Stderr, "\nThe certctl control plane is HTTPS-only as of v2.2.\n")
|
||||||
|
fmt.Fprintf(os.Stderr, "See docs/upgrade-to-tls.md for the cutover walkthrough.\n")
|
||||||
|
os.Exit(1)
|
||||||
|
}
|
||||||
|
|
||||||
// Set up structured logging
|
// Set up structured logging
|
||||||
logLevel := slog.LevelInfo
|
logLevel := slog.LevelInfo
|
||||||
if getEnvDefault("CERTCTL_LOG_LEVEL", "info") == "debug" {
|
if getEnvDefault("CERTCTL_LOG_LEVEL", "info") == "debug" {
|
||||||
@@ -1165,17 +1224,27 @@ func main() {
|
|||||||
|
|
||||||
// Create agent configuration
|
// Create agent configuration
|
||||||
agentCfg := &AgentConfig{
|
agentCfg := &AgentConfig{
|
||||||
ServerURL: *serverURL,
|
ServerURL: *serverURL,
|
||||||
APIKey: *apiKey,
|
APIKey: *apiKey,
|
||||||
AgentName: *agentName,
|
AgentName: *agentName,
|
||||||
AgentID: *agentID,
|
AgentID: *agentID,
|
||||||
Hostname: hostname,
|
Hostname: hostname,
|
||||||
KeyDir: *keyDir,
|
KeyDir: *keyDir,
|
||||||
DiscoveryDirs: discoveryDirs,
|
DiscoveryDirs: discoveryDirs,
|
||||||
|
CABundlePath: *caBundlePath,
|
||||||
|
InsecureSkipVerify: *insecureSkipVerify,
|
||||||
|
}
|
||||||
|
|
||||||
|
if agentCfg.InsecureSkipVerify {
|
||||||
|
logger.Warn("TLS certificate verification is disabled (CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true) — never enable this in production")
|
||||||
}
|
}
|
||||||
|
|
||||||
// Create and start agent
|
// Create and start agent
|
||||||
agent := NewAgent(agentCfg, logger)
|
agent, err := NewAgent(agentCfg, logger)
|
||||||
|
if err != nil {
|
||||||
|
fmt.Fprintf(os.Stderr, "Error: failed to initialize agent: %v\n", err)
|
||||||
|
os.Exit(1)
|
||||||
|
}
|
||||||
|
|
||||||
// Create context with cancellation for graceful shutdown
|
// Create context with cancellation for graceful shutdown
|
||||||
ctx, cancel := context.WithCancel(context.Background())
|
ctx, cancel := context.WithCancel(context.Background())
|
||||||
@@ -1233,3 +1302,49 @@ func getEnvDefault(key, defaultValue string) string {
|
|||||||
}
|
}
|
||||||
return defaultValue
|
return defaultValue
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// getEnvBoolDefault parses an environment variable as a boolean. Accepts "1",
|
||||||
|
// "t", "true", "T", "TRUE", "True" as true; anything else (including empty)
|
||||||
|
// returns the provided default. Kept permissive on purpose so operators can
|
||||||
|
// flip the dev-only TLS skip-verify toggle with any common truthy spelling
|
||||||
|
// without having to remember exactly what we parse.
|
||||||
|
func getEnvBoolDefault(key string, defaultValue bool) bool {
|
||||||
|
raw := os.Getenv(key)
|
||||||
|
if raw == "" {
|
||||||
|
return defaultValue
|
||||||
|
}
|
||||||
|
switch strings.ToLower(strings.TrimSpace(raw)) {
|
||||||
|
case "1", "t", "true", "yes", "on":
|
||||||
|
return true
|
||||||
|
case "0", "f", "false", "no", "off":
|
||||||
|
return false
|
||||||
|
default:
|
||||||
|
return defaultValue
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// validateHTTPSScheme enforces the HTTPS-Everywhere milestone's §7 acceptance
|
||||||
|
// criterion: "Agent with CERTCTL_SERVER_URL=http://... fails at startup with
|
||||||
|
// a fail-loud diagnostic pointing at docs/upgrade-to-tls.md. Not TCP-refused,
|
||||||
|
// not TLS-handshake-error — a pre-flight config validation failure before any
|
||||||
|
// network call." Returns a descriptive error; the caller prints the upgrade
|
||||||
|
// guide pointer and exits non-zero.
|
||||||
|
func validateHTTPSScheme(serverURL string) error {
|
||||||
|
if serverURL == "" {
|
||||||
|
return fmt.Errorf("CERTCTL_SERVER_URL is empty — set it to an https:// URL (e.g., https://certctl-server:8443)")
|
||||||
|
}
|
||||||
|
u, err := url.Parse(serverURL)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("CERTCTL_SERVER_URL %q is not a valid URL: %w", serverURL, err)
|
||||||
|
}
|
||||||
|
switch strings.ToLower(u.Scheme) {
|
||||||
|
case "https":
|
||||||
|
return nil
|
||||||
|
case "http":
|
||||||
|
return fmt.Errorf("CERTCTL_SERVER_URL %q uses plaintext http:// — the certctl control plane is HTTPS-only", serverURL)
|
||||||
|
case "":
|
||||||
|
return fmt.Errorf("CERTCTL_SERVER_URL %q is missing a scheme — expected https://", serverURL)
|
||||||
|
default:
|
||||||
|
return fmt.Errorf("CERTCTL_SERVER_URL %q uses unsupported scheme %q — expected https://", serverURL, u.Scheme)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|||||||
@@ -228,7 +228,7 @@ func TestReportVerificationResult_Success(t *testing.T) {
|
|||||||
ServerURL: server.URL,
|
ServerURL: server.URL,
|
||||||
APIKey: "test-api-key",
|
APIKey: "test-api-key",
|
||||||
}
|
}
|
||||||
agent := NewAgent(cfg, nil)
|
agent, _ := NewAgent(cfg, nil)
|
||||||
|
|
||||||
result := &VerificationResult{
|
result := &VerificationResult{
|
||||||
ExpectedFingerprint: "abc123",
|
ExpectedFingerprint: "abc123",
|
||||||
@@ -244,7 +244,7 @@ func TestReportVerificationResult_Success(t *testing.T) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
func TestReportVerificationResult_MissingFields(t *testing.T) {
|
func TestReportVerificationResult_MissingFields(t *testing.T) {
|
||||||
agent := NewAgent(&AgentConfig{}, nil)
|
agent, _ := NewAgent(&AgentConfig{}, nil)
|
||||||
|
|
||||||
result := &VerificationResult{
|
result := &VerificationResult{
|
||||||
Verified: true,
|
Verified: true,
|
||||||
@@ -343,7 +343,7 @@ func TestReportVerificationResult_ServerError(t *testing.T) {
|
|||||||
ServerURL: server.URL,
|
ServerURL: server.URL,
|
||||||
APIKey: "test-api-key",
|
APIKey: "test-api-key",
|
||||||
}
|
}
|
||||||
agent := NewAgent(cfg, nil)
|
agent, _ := NewAgent(cfg, nil)
|
||||||
|
|
||||||
result := &VerificationResult{
|
result := &VerificationResult{
|
||||||
ExpectedFingerprint: "abc123",
|
ExpectedFingerprint: "abc123",
|
||||||
|
|||||||
+46
-6
@@ -3,7 +3,9 @@ package main
|
|||||||
import (
|
import (
|
||||||
"flag"
|
"flag"
|
||||||
"fmt"
|
"fmt"
|
||||||
|
"net/url"
|
||||||
"os"
|
"os"
|
||||||
|
"strings"
|
||||||
|
|
||||||
"github.com/shankar0123/certctl/internal/cli"
|
"github.com/shankar0123/certctl/internal/cli"
|
||||||
)
|
)
|
||||||
@@ -43,22 +45,34 @@ Commands:
|
|||||||
version Show CLI version
|
version Show CLI version
|
||||||
|
|
||||||
Examples:
|
Examples:
|
||||||
certctl-cli --server http://localhost:8443 --api-key mykey certs list
|
certctl-cli --server https://localhost:8443 --api-key mykey certs list
|
||||||
certctl-cli certs renew mc-prod --format json
|
certctl-cli certs renew mc-prod --format json
|
||||||
certctl-cli import certs.pem
|
certctl-cli import certs.pem
|
||||||
`)
|
`)
|
||||||
}
|
}
|
||||||
|
|
||||||
serverURL := fs.String("server", os.Getenv("CERTCTL_SERVER_URL"), "certctl server URL (env: CERTCTL_SERVER_URL)")
|
// HTTPS-Everywhere (v2.2): the server is HTTPS-only. The default URL uses
|
||||||
if *serverURL == "" {
|
// https://; plaintext http:// is rejected by validateHTTPSScheme below.
|
||||||
*serverURL = "http://localhost:8443"
|
defaultServer := os.Getenv("CERTCTL_SERVER_URL")
|
||||||
|
if defaultServer == "" {
|
||||||
|
defaultServer = "https://localhost:8443"
|
||||||
}
|
}
|
||||||
|
serverURL := fs.String("server", defaultServer, "certctl server URL — must be https:// (env: CERTCTL_SERVER_URL)")
|
||||||
|
|
||||||
apiKey := fs.String("api-key", os.Getenv("CERTCTL_API_KEY"), "API key for authentication (env: CERTCTL_API_KEY)")
|
apiKey := fs.String("api-key", os.Getenv("CERTCTL_API_KEY"), "API key for authentication (env: CERTCTL_API_KEY)")
|
||||||
format := fs.String("format", "table", "Output format: table, json")
|
format := fs.String("format", "table", "Output format: table, json")
|
||||||
|
caBundlePath := fs.String("ca-bundle", os.Getenv("CERTCTL_SERVER_CA_BUNDLE_PATH"), "Path to a PEM-encoded CA bundle that signed the server cert (env: CERTCTL_SERVER_CA_BUNDLE_PATH)")
|
||||||
|
insecure := fs.Bool("insecure", strings.EqualFold(os.Getenv("CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY"), "true"), "Skip TLS certificate verification — dev only, never set in production (env: CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY)")
|
||||||
|
|
||||||
fs.Parse(os.Args[1:])
|
fs.Parse(os.Args[1:])
|
||||||
|
|
||||||
|
if err := validateHTTPSScheme(*serverURL); err != nil {
|
||||||
|
fmt.Fprintf(os.Stderr, "Error: %v\n", err)
|
||||||
|
fmt.Fprintf(os.Stderr, "\nThe certctl control plane is HTTPS-only as of v2.2.\n")
|
||||||
|
fmt.Fprintf(os.Stderr, "See docs/upgrade-to-tls.md for the cutover walkthrough.\n")
|
||||||
|
os.Exit(1)
|
||||||
|
}
|
||||||
|
|
||||||
args := fs.Args()
|
args := fs.Args()
|
||||||
if len(args) == 0 {
|
if len(args) == 0 {
|
||||||
fs.Usage()
|
fs.Usage()
|
||||||
@@ -66,13 +80,16 @@ Examples:
|
|||||||
}
|
}
|
||||||
|
|
||||||
// Create client
|
// Create client
|
||||||
client := cli.NewClient(*serverURL, *apiKey, *format)
|
client, err := cli.NewClient(*serverURL, *apiKey, *format, *caBundlePath, *insecure)
|
||||||
|
if err != nil {
|
||||||
|
fmt.Fprintf(os.Stderr, "Error: %v\n", err)
|
||||||
|
os.Exit(1)
|
||||||
|
}
|
||||||
|
|
||||||
// Dispatch to appropriate command
|
// Dispatch to appropriate command
|
||||||
command := args[0]
|
command := args[0]
|
||||||
cmdArgs := args[1:]
|
cmdArgs := args[1:]
|
||||||
|
|
||||||
var err error
|
|
||||||
switch command {
|
switch command {
|
||||||
case "certs":
|
case "certs":
|
||||||
err = handleCerts(client, cmdArgs)
|
err = handleCerts(client, cmdArgs)
|
||||||
@@ -237,3 +254,26 @@ func handleImport(client *cli.Client, args []string) error {
|
|||||||
func handleStatus(client *cli.Client) error {
|
func handleStatus(client *cli.Client) error {
|
||||||
return client.GetStatus()
|
return client.GetStatus()
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// validateHTTPSScheme rejects plaintext and empty-scheme server URLs at
|
||||||
|
// startup so operators get a fail-loud diagnostic before any network call,
|
||||||
|
// not a TCP-refused or TLS-handshake-error downstream. See docs/upgrade-to-tls.md.
|
||||||
|
func validateHTTPSScheme(serverURL string) error {
|
||||||
|
if serverURL == "" {
|
||||||
|
return fmt.Errorf("server URL is empty — set --server (or CERTCTL_SERVER_URL) to an https:// URL (e.g., https://certctl-server:8443)")
|
||||||
|
}
|
||||||
|
u, err := url.Parse(serverURL)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("server URL %q is not a valid URL: %w", serverURL, err)
|
||||||
|
}
|
||||||
|
switch strings.ToLower(u.Scheme) {
|
||||||
|
case "https":
|
||||||
|
return nil
|
||||||
|
case "http":
|
||||||
|
return fmt.Errorf("server URL %q uses plaintext http:// — the certctl control plane is HTTPS-only", serverURL)
|
||||||
|
case "":
|
||||||
|
return fmt.Errorf("server URL %q is missing a scheme — expected https://", serverURL)
|
||||||
|
default:
|
||||||
|
return fmt.Errorf("server URL %q uses unsupported scheme %q — expected https://", serverURL, u.Scheme)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|||||||
@@ -0,0 +1,96 @@
|
|||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"strings"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
// TestValidateHTTPSScheme pins the pre-flight URL-scheme guard that the
|
||||||
|
// HTTPS-Everywhere milestone (v2.2, §3.2) requires on the certctl-cli binary
|
||||||
|
// startup path. The CLI's diagnostic is distinct from the agent and MCP server
|
||||||
|
// because it surfaces the --server flag alongside CERTCTL_SERVER_URL — so the
|
||||||
|
// empty-URL case pins that flag-name substring separately. Every other case
|
||||||
|
// mirrors the dispatch arms in cmd/cli/main.go:validateHTTPSScheme; drifting
|
||||||
|
// the substrings is what this test is here to catch.
|
||||||
|
func TestValidateHTTPSScheme(t *testing.T) {
|
||||||
|
tests := []struct {
|
||||||
|
name string
|
||||||
|
serverURL string
|
||||||
|
wantErr bool
|
||||||
|
wantErrSub string // substring that MUST appear in the error message
|
||||||
|
}{
|
||||||
|
{
|
||||||
|
name: "https URL passes",
|
||||||
|
serverURL: "https://certctl-server:8443",
|
||||||
|
wantErr: false,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "https URL with path passes",
|
||||||
|
serverURL: "https://certctl.example.com/api/v1",
|
||||||
|
wantErr: false,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "uppercase HTTPS scheme passes (url.Parse lowercases)",
|
||||||
|
serverURL: "HTTPS://certctl-server:8443",
|
||||||
|
wantErr: false,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "empty URL rejected mentions --server flag",
|
||||||
|
serverURL: "",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "--server",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "empty URL rejected also mentions CERTCTL_SERVER_URL",
|
||||||
|
serverURL: "",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "CERTCTL_SERVER_URL",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "plaintext http rejected",
|
||||||
|
serverURL: "http://certctl-server:8443",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "plaintext http://",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "bare host missing scheme rejected",
|
||||||
|
serverURL: "localhost:8443",
|
||||||
|
wantErr: true,
|
||||||
|
// url.Parse treats "localhost:8443" as scheme=localhost, opaque=8443
|
||||||
|
// — exercises the default arm (unsupported scheme) rather than the
|
||||||
|
// empty-scheme arm. Both are fail-closed, which is what we care about.
|
||||||
|
wantErrSub: "unsupported scheme",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "path-only URL rejected",
|
||||||
|
serverURL: "//certctl-server:8443",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "missing a scheme",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "unsupported scheme rejected",
|
||||||
|
serverURL: "ftp://certctl-server:8443",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "unsupported scheme",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "ws scheme rejected",
|
||||||
|
serverURL: "ws://certctl-server:8443",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "unsupported scheme",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
for _, tt := range tests {
|
||||||
|
t.Run(tt.name, func(t *testing.T) {
|
||||||
|
err := validateHTTPSScheme(tt.serverURL)
|
||||||
|
if (err != nil) != tt.wantErr {
|
||||||
|
t.Fatalf("validateHTTPSScheme(%q) err=%v wantErr=%v", tt.serverURL, err, tt.wantErr)
|
||||||
|
}
|
||||||
|
if tt.wantErr && tt.wantErrSub != "" && !strings.Contains(err.Error(), tt.wantErrSub) {
|
||||||
|
t.Errorf("validateHTTPSScheme(%q) err=%q must contain %q so operators see the right diagnostic",
|
||||||
|
tt.serverURL, err.Error(), tt.wantErrSub)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
}
|
||||||
|
}
|
||||||
+46
-2
@@ -4,8 +4,10 @@ import (
|
|||||||
"context"
|
"context"
|
||||||
"fmt"
|
"fmt"
|
||||||
"log"
|
"log"
|
||||||
|
"net/url"
|
||||||
"os"
|
"os"
|
||||||
"os/signal"
|
"os/signal"
|
||||||
|
"strings"
|
||||||
|
|
||||||
gomcp "github.com/modelcontextprotocol/go-sdk/mcp"
|
gomcp "github.com/modelcontextprotocol/go-sdk/mcp"
|
||||||
|
|
||||||
@@ -16,14 +18,33 @@ import (
|
|||||||
var Version = "dev"
|
var Version = "dev"
|
||||||
|
|
||||||
func main() {
|
func main() {
|
||||||
|
// HTTPS-Everywhere (v2.2): the server is HTTPS-only. The default URL
|
||||||
|
// uses https://; plaintext http:// is rejected by validateHTTPSScheme
|
||||||
|
// below with a fail-loud pre-flight diagnostic pointing at
|
||||||
|
// docs/upgrade-to-tls.md, so operators never get a TCP-refused or
|
||||||
|
// TLS-handshake-error downstream. See docs/tls.md for CA bundle and
|
||||||
|
// insecure-skip-verify guidance.
|
||||||
serverURL := os.Getenv("CERTCTL_SERVER_URL")
|
serverURL := os.Getenv("CERTCTL_SERVER_URL")
|
||||||
if serverURL == "" {
|
if serverURL == "" {
|
||||||
serverURL = "http://localhost:8443"
|
serverURL = "https://localhost:8443"
|
||||||
|
}
|
||||||
|
|
||||||
|
if err := validateHTTPSScheme(serverURL); err != nil {
|
||||||
|
fmt.Fprintf(os.Stderr, "Error: %v\n", err)
|
||||||
|
fmt.Fprintf(os.Stderr, "\nThe certctl control plane is HTTPS-only as of v2.2.\n")
|
||||||
|
fmt.Fprintf(os.Stderr, "See docs/upgrade-to-tls.md for the cutover walkthrough.\n")
|
||||||
|
os.Exit(1)
|
||||||
}
|
}
|
||||||
|
|
||||||
apiKey := os.Getenv("CERTCTL_API_KEY")
|
apiKey := os.Getenv("CERTCTL_API_KEY")
|
||||||
|
caBundlePath := os.Getenv("CERTCTL_SERVER_CA_BUNDLE_PATH")
|
||||||
|
insecure := strings.EqualFold(os.Getenv("CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY"), "true")
|
||||||
|
|
||||||
client := mcp.NewClient(serverURL, apiKey)
|
client, err := mcp.NewClient(serverURL, apiKey, caBundlePath, insecure)
|
||||||
|
if err != nil {
|
||||||
|
fmt.Fprintf(os.Stderr, "Error: %v\n", err)
|
||||||
|
os.Exit(1)
|
||||||
|
}
|
||||||
|
|
||||||
server := gomcp.NewServer(&gomcp.Implementation{
|
server := gomcp.NewServer(&gomcp.Implementation{
|
||||||
Name: "certctl",
|
Name: "certctl",
|
||||||
@@ -41,3 +62,26 @@ func main() {
|
|||||||
log.Fatalf("MCP server error: %v", err)
|
log.Fatalf("MCP server error: %v", err)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// validateHTTPSScheme rejects plaintext and empty-scheme server URLs at
|
||||||
|
// startup so operators get a fail-loud diagnostic before any network call,
|
||||||
|
// not a TCP-refused or TLS-handshake-error downstream. See docs/upgrade-to-tls.md.
|
||||||
|
func validateHTTPSScheme(serverURL string) error {
|
||||||
|
if serverURL == "" {
|
||||||
|
return fmt.Errorf("server URL is empty — set CERTCTL_SERVER_URL to an https:// URL (e.g., https://certctl-server:8443)")
|
||||||
|
}
|
||||||
|
u, err := url.Parse(serverURL)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("server URL %q is not a valid URL: %w", serverURL, err)
|
||||||
|
}
|
||||||
|
switch strings.ToLower(u.Scheme) {
|
||||||
|
case "https":
|
||||||
|
return nil
|
||||||
|
case "http":
|
||||||
|
return fmt.Errorf("server URL %q uses plaintext http:// — the certctl control plane is HTTPS-only", serverURL)
|
||||||
|
case "":
|
||||||
|
return fmt.Errorf("server URL %q is missing a scheme — expected https://", serverURL)
|
||||||
|
default:
|
||||||
|
return fmt.Errorf("server URL %q uses unsupported scheme %q — expected https://", serverURL, u.Scheme)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|||||||
@@ -0,0 +1,90 @@
|
|||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"strings"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
// TestValidateHTTPSScheme pins the pre-flight URL-scheme guard that the
|
||||||
|
// HTTPS-Everywhere milestone (v2.2, §3.2) requires on the MCP server binary
|
||||||
|
// startup path. The whole point is to fail loud with a diagnostic that points
|
||||||
|
// at docs/upgrade-to-tls.md *before* any network call — not a cryptic
|
||||||
|
// TCP-refused or TLS-handshake-error two ticks later. Every case here mirrors
|
||||||
|
// the dispatch arms in cmd/mcp-server/main.go:validateHTTPSScheme; drifting
|
||||||
|
// the error-message substrings is what this test is here to catch.
|
||||||
|
func TestValidateHTTPSScheme(t *testing.T) {
|
||||||
|
tests := []struct {
|
||||||
|
name string
|
||||||
|
serverURL string
|
||||||
|
wantErr bool
|
||||||
|
wantErrSub string // substring that MUST appear in the error message
|
||||||
|
}{
|
||||||
|
{
|
||||||
|
name: "https URL passes",
|
||||||
|
serverURL: "https://certctl-server:8443",
|
||||||
|
wantErr: false,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "https URL with path passes",
|
||||||
|
serverURL: "https://certctl.example.com/api/v1",
|
||||||
|
wantErr: false,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "uppercase HTTPS scheme passes (url.Parse lowercases)",
|
||||||
|
serverURL: "HTTPS://certctl-server:8443",
|
||||||
|
wantErr: false,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "empty URL rejected",
|
||||||
|
serverURL: "",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "server URL is empty",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "plaintext http rejected",
|
||||||
|
serverURL: "http://certctl-server:8443",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "plaintext http://",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "bare host missing scheme rejected",
|
||||||
|
serverURL: "localhost:8443",
|
||||||
|
wantErr: true,
|
||||||
|
// url.Parse treats "localhost:8443" as scheme=localhost, opaque=8443
|
||||||
|
// — exercises the default arm (unsupported scheme) rather than the
|
||||||
|
// empty-scheme arm. Both are fail-closed, which is what we care about.
|
||||||
|
wantErrSub: "unsupported scheme",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "path-only URL rejected",
|
||||||
|
serverURL: "//certctl-server:8443",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "missing a scheme",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "unsupported scheme rejected",
|
||||||
|
serverURL: "ftp://certctl-server:8443",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "unsupported scheme",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "ws scheme rejected",
|
||||||
|
serverURL: "ws://certctl-server:8443",
|
||||||
|
wantErr: true,
|
||||||
|
wantErrSub: "unsupported scheme",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
for _, tt := range tests {
|
||||||
|
t.Run(tt.name, func(t *testing.T) {
|
||||||
|
err := validateHTTPSScheme(tt.serverURL)
|
||||||
|
if (err != nil) != tt.wantErr {
|
||||||
|
t.Fatalf("validateHTTPSScheme(%q) err=%v wantErr=%v", tt.serverURL, err, tt.wantErr)
|
||||||
|
}
|
||||||
|
if tt.wantErr && tt.wantErrSub != "" && !strings.Contains(err.Error(), tt.wantErrSub) {
|
||||||
|
t.Errorf("validateHTTPSScheme(%q) err=%q must contain %q so operators see the right diagnostic",
|
||||||
|
tt.serverURL, err.Error(), tt.wantErrSub)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,314 @@
|
|||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"net/http"
|
||||||
|
"net/http/httptest"
|
||||||
|
"os"
|
||||||
|
"path/filepath"
|
||||||
|
"strings"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
// TestBuildFinalHandler_Dispatch is the M-001 regression harness for the outer
|
||||||
|
// HTTP dispatch layer. It pins which path prefixes ride the no-auth middleware
|
||||||
|
// chain (EST, SCEP, /.well-known/pki, health/ready, /api/v1/auth/info) versus
|
||||||
|
// the authenticated chain (/api/v1/*).
|
||||||
|
//
|
||||||
|
// The concern under test is ONLY the dispatch in buildFinalHandler — the
|
||||||
|
// handlers themselves are mocked as marker handlers that stamp "AUTH" or
|
||||||
|
// "NOAUTH" into the response body. Service-layer concerns (SCEP password
|
||||||
|
// validation, EST CSR validation, API auth enforcement) are covered by their
|
||||||
|
// respective test suites.
|
||||||
|
//
|
||||||
|
// Case (i) is the central guard: EST with NO client cert / NO Bearer token
|
||||||
|
// MUST reach the no-auth handler (pre-M-001 it was 401'd by the Auth
|
||||||
|
// middleware, blocking enrollment for every real-world EST client).
|
||||||
|
func TestBuildFinalHandler_Dispatch(t *testing.T) {
|
||||||
|
// Marker handlers — each stamps a unique body so tests can verify which
|
||||||
|
// chain the request traversed.
|
||||||
|
authHandler := http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
|
||||||
|
w.Header().Set("X-Chain", "auth")
|
||||||
|
w.WriteHeader(http.StatusOK)
|
||||||
|
_, _ = w.Write([]byte("AUTH"))
|
||||||
|
})
|
||||||
|
noAuthHandler := http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
|
||||||
|
w.Header().Set("X-Chain", "noauth")
|
||||||
|
w.WriteHeader(http.StatusOK)
|
||||||
|
_, _ = w.Write([]byte("NOAUTH"))
|
||||||
|
})
|
||||||
|
|
||||||
|
// Dashboard directory with index.html + assets/ for SPA fallback and
|
||||||
|
// static-asset tests. Cleaned up by t.TempDir.
|
||||||
|
webDir := t.TempDir()
|
||||||
|
indexHTML := []byte("<!doctype html><html><body>certctl dashboard</body></html>")
|
||||||
|
if err := os.WriteFile(filepath.Join(webDir, "index.html"), indexHTML, 0o644); err != nil {
|
||||||
|
t.Fatalf("write index.html: %v", err)
|
||||||
|
}
|
||||||
|
assetsDir := filepath.Join(webDir, "assets")
|
||||||
|
if err := os.MkdirAll(assetsDir, 0o755); err != nil {
|
||||||
|
t.Fatalf("mkdir assets: %v", err)
|
||||||
|
}
|
||||||
|
assetJS := []byte("console.log('certctl');")
|
||||||
|
if err := os.WriteFile(filepath.Join(assetsDir, "app.js"), assetJS, 0o644); err != nil {
|
||||||
|
t.Fatalf("write app.js: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
handler := buildFinalHandler(authHandler, noAuthHandler, webDir, true /* dashboardEnabled */)
|
||||||
|
|
||||||
|
tests := []struct {
|
||||||
|
name string
|
||||||
|
method string
|
||||||
|
path string
|
||||||
|
wantBody string // "AUTH" | "NOAUTH" | "" (== substring match against response body)
|
||||||
|
wantBodyPrefix string
|
||||||
|
wantStatus int
|
||||||
|
description string
|
||||||
|
}{
|
||||||
|
// ---- Case (i): M-001 central regression guard ----
|
||||||
|
{
|
||||||
|
name: "est_cacerts_no_auth_reaches_noauth_handler",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/.well-known/est/cacerts",
|
||||||
|
wantBody: "NOAUTH",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
description: "EST clients cannot present Bearer tokens — must NOT be 401'd before reaching the handler (RFC 7030 §4.1.1)",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "est_simpleenroll_no_auth_reaches_noauth_handler",
|
||||||
|
method: http.MethodPost,
|
||||||
|
path: "/.well-known/est/simpleenroll",
|
||||||
|
wantBody: "NOAUTH",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
description: "RFC 7030 §4.2 simpleenroll served from no-auth chain (option D)",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "est_simplereenroll_no_auth_reaches_noauth_handler",
|
||||||
|
method: http.MethodPost,
|
||||||
|
path: "/.well-known/est/simplereenroll",
|
||||||
|
wantBody: "NOAUTH",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
description: "RFC 7030 §4.2.2 simplereenroll also on no-auth chain",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "est_csrattrs_no_auth_reaches_noauth_handler",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/.well-known/est/csrattrs",
|
||||||
|
wantBody: "NOAUTH",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
description: "RFC 7030 §4.5 csrattrs also on no-auth chain",
|
||||||
|
},
|
||||||
|
|
||||||
|
// ---- Cases (ii) + (iii): SCEP dispatch ----
|
||||||
|
// The actual challengePassword validation lives in the service layer
|
||||||
|
// (internal/service/scep.go). This test pins that ALL /scep* requests
|
||||||
|
// reach the no-auth chain — the service layer is then responsible for
|
||||||
|
// rejecting or accepting based on password contents.
|
||||||
|
{
|
||||||
|
name: "scep_exact_path_reaches_noauth_handler",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/scep",
|
||||||
|
wantBody: "NOAUTH",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
description: "SCEP clients authenticate via CSR challengePassword, not Bearer (RFC 8894 §3.2)",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "scep_subpath_reaches_noauth_handler",
|
||||||
|
method: http.MethodPost,
|
||||||
|
path: "/scep/",
|
||||||
|
wantBody: "NOAUTH",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
description: "Trailing-slash variant must also ride no-auth chain",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "scep_query_string_reaches_noauth_handler",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/scep?operation=GetCACaps",
|
||||||
|
wantBody: "NOAUTH",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
description: "Query string does not affect dispatch — operation dispatch is handler-internal",
|
||||||
|
},
|
||||||
|
// Defensive: /scepxyz MUST NOT match the SCEP prefix (guards against
|
||||||
|
// over-broad matching that would leak non-SCEP paths into no-auth).
|
||||||
|
{
|
||||||
|
name: "scepxyz_does_not_match_scep_prefix",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/scepxyz",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
wantBody: "certctl dashboard",
|
||||||
|
description: "SPA fallback — /scepxyz must not be confused with /scep or /scep/",
|
||||||
|
},
|
||||||
|
|
||||||
|
// ---- Case (iv): RFC 5280 CRL + RFC 6960 OCSP ----
|
||||||
|
{
|
||||||
|
name: "pki_crl_no_auth_reaches_noauth_handler",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/.well-known/pki/crl/abc123",
|
||||||
|
wantBody: "NOAUTH",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
description: "RFC 5280 CRL distribution point must be served without auth",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "pki_ocsp_no_auth_reaches_noauth_handler",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/.well-known/pki/ocsp/abc123/serial",
|
||||||
|
wantBody: "NOAUTH",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
description: "RFC 6960 OCSP responder must be served without auth",
|
||||||
|
},
|
||||||
|
|
||||||
|
// ---- Case (v): Authenticated API routes ----
|
||||||
|
{
|
||||||
|
name: "api_v1_certificates_goes_through_auth",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/api/v1/certificates",
|
||||||
|
wantBody: "AUTH",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
description: "Primary API surface must still require Bearer token",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "api_v1_auth_check_goes_through_auth",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/api/v1/auth/check",
|
||||||
|
wantBody: "AUTH",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
description: "auth/check validates the caller's Bearer — auth chain required",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "api_v1_jobs_goes_through_auth",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/api/v1/jobs",
|
||||||
|
wantBody: "AUTH",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
description: "Jobs API is part of the privileged surface",
|
||||||
|
},
|
||||||
|
|
||||||
|
// ---- Health probes bypass auth ----
|
||||||
|
{
|
||||||
|
name: "health_bypasses_auth",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/health",
|
||||||
|
wantBody: "NOAUTH",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
description: "Docker/K8s health probes cannot carry Bearer tokens",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "ready_bypasses_auth",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/ready",
|
||||||
|
wantBody: "NOAUTH",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
description: "Readiness probe also unauthenticated",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "auth_info_bypasses_auth",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/api/v1/auth/info",
|
||||||
|
wantBody: "NOAUTH",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
description: "React app calls auth/info BEFORE login to discover auth mode",
|
||||||
|
},
|
||||||
|
|
||||||
|
// ---- Static assets served by file server ----
|
||||||
|
{
|
||||||
|
name: "static_asset_served_by_file_server",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/assets/app.js",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
wantBody: "console.log('certctl');",
|
||||||
|
description: "Built Vite assets served directly without auth",
|
||||||
|
},
|
||||||
|
|
||||||
|
// ---- SPA fallback ----
|
||||||
|
{
|
||||||
|
name: "spa_fallback_serves_index_html",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
wantBody: "certctl dashboard",
|
||||||
|
description: "Root path serves SPA entry point",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "spa_fallback_for_unknown_route",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/certificates",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
wantBody: "certctl dashboard",
|
||||||
|
description: "React Router routes fall through to index.html",
|
||||||
|
},
|
||||||
|
{
|
||||||
|
name: "spa_fallback_deep_route",
|
||||||
|
method: http.MethodGet,
|
||||||
|
path: "/certificates/mc-api-prod/detail",
|
||||||
|
wantStatus: http.StatusOK,
|
||||||
|
wantBody: "certctl dashboard",
|
||||||
|
description: "Deep React Router routes also fall through to SPA",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
for _, tc := range tests {
|
||||||
|
t.Run(tc.name, func(t *testing.T) {
|
||||||
|
req := httptest.NewRequest(tc.method, tc.path, nil)
|
||||||
|
w := httptest.NewRecorder()
|
||||||
|
handler.ServeHTTP(w, req)
|
||||||
|
|
||||||
|
if w.Code != tc.wantStatus {
|
||||||
|
t.Errorf("status = %d, want %d (%s)", w.Code, tc.wantStatus, tc.description)
|
||||||
|
}
|
||||||
|
body := w.Body.String()
|
||||||
|
if tc.wantBody != "" && !strings.Contains(body, tc.wantBody) {
|
||||||
|
t.Errorf("body %q does not contain %q (%s)", body, tc.wantBody, tc.description)
|
||||||
|
}
|
||||||
|
if tc.wantBodyPrefix != "" && !strings.HasPrefix(body, tc.wantBodyPrefix) {
|
||||||
|
t.Errorf("body %q does not start with %q (%s)", body, tc.wantBodyPrefix, tc.description)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestBuildFinalHandler_NoDashboard pins the API-only (dashboard-absent)
|
||||||
|
// dispatch behavior. When web/dist/index.html is missing, everything that's
|
||||||
|
// not a no-auth bypass route falls through to the authenticated apiHandler
|
||||||
|
// (pre-M-001 behavior for headless deployments). EST/SCEP/PKI still ride the
|
||||||
|
// no-auth chain.
|
||||||
|
func TestBuildFinalHandler_NoDashboard(t *testing.T) {
|
||||||
|
authHandler := http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
|
||||||
|
w.WriteHeader(http.StatusOK)
|
||||||
|
_, _ = w.Write([]byte("AUTH"))
|
||||||
|
})
|
||||||
|
noAuthHandler := http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
|
||||||
|
w.WriteHeader(http.StatusOK)
|
||||||
|
_, _ = w.Write([]byte("NOAUTH"))
|
||||||
|
})
|
||||||
|
|
||||||
|
handler := buildFinalHandler(authHandler, noAuthHandler, "/nonexistent", false /* dashboardEnabled */)
|
||||||
|
|
||||||
|
tests := []struct {
|
||||||
|
name string
|
||||||
|
path string
|
||||||
|
wantBody string
|
||||||
|
}{
|
||||||
|
{"est_still_no_auth", "/.well-known/est/cacerts", "NOAUTH"},
|
||||||
|
{"scep_still_no_auth", "/scep", "NOAUTH"},
|
||||||
|
{"pki_still_no_auth", "/.well-known/pki/crl/x", "NOAUTH"},
|
||||||
|
{"health_still_no_auth", "/health", "NOAUTH"},
|
||||||
|
{"api_still_auth", "/api/v1/certificates", "AUTH"},
|
||||||
|
// The difference: non-API, non-special paths go through auth chain when
|
||||||
|
// there's no dashboard to serve (preserves legacy headless behavior).
|
||||||
|
{"unknown_path_falls_through_to_auth", "/", "AUTH"},
|
||||||
|
{"unknown_deep_path_falls_through_to_auth", "/random/path", "AUTH"},
|
||||||
|
}
|
||||||
|
|
||||||
|
for _, tc := range tests {
|
||||||
|
t.Run(tc.name, func(t *testing.T) {
|
||||||
|
req := httptest.NewRequest(http.MethodGet, tc.path, nil)
|
||||||
|
w := httptest.NewRecorder()
|
||||||
|
handler.ServeHTTP(w, req)
|
||||||
|
if w.Code != http.StatusOK {
|
||||||
|
t.Errorf("status = %d, want 200", w.Code)
|
||||||
|
}
|
||||||
|
if got := w.Body.String(); !strings.Contains(got, tc.wantBody) {
|
||||||
|
t.Errorf("body = %q, want to contain %q", got, tc.wantBody)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
}
|
||||||
|
}
|
||||||
+171
-76
@@ -17,7 +17,6 @@ import (
|
|||||||
"github.com/shankar0123/certctl/internal/api/middleware"
|
"github.com/shankar0123/certctl/internal/api/middleware"
|
||||||
"github.com/shankar0123/certctl/internal/api/router"
|
"github.com/shankar0123/certctl/internal/api/router"
|
||||||
"github.com/shankar0123/certctl/internal/config"
|
"github.com/shankar0123/certctl/internal/config"
|
||||||
"github.com/shankar0123/certctl/internal/domain"
|
|
||||||
discoveryawssm "github.com/shankar0123/certctl/internal/connector/discovery/awssm"
|
discoveryawssm "github.com/shankar0123/certctl/internal/connector/discovery/awssm"
|
||||||
discoveryazurekv "github.com/shankar0123/certctl/internal/connector/discovery/azurekv"
|
discoveryazurekv "github.com/shankar0123/certctl/internal/connector/discovery/azurekv"
|
||||||
discoverygcpsm "github.com/shankar0123/certctl/internal/connector/discovery/gcpsm"
|
discoverygcpsm "github.com/shankar0123/certctl/internal/connector/discovery/gcpsm"
|
||||||
@@ -26,6 +25,7 @@ import (
|
|||||||
notifypagerduty "github.com/shankar0123/certctl/internal/connector/notifier/pagerduty"
|
notifypagerduty "github.com/shankar0123/certctl/internal/connector/notifier/pagerduty"
|
||||||
notifyslack "github.com/shankar0123/certctl/internal/connector/notifier/slack"
|
notifyslack "github.com/shankar0123/certctl/internal/connector/notifier/slack"
|
||||||
notifyteams "github.com/shankar0123/certctl/internal/connector/notifier/teams"
|
notifyteams "github.com/shankar0123/certctl/internal/connector/notifier/teams"
|
||||||
|
"github.com/shankar0123/certctl/internal/domain"
|
||||||
"github.com/shankar0123/certctl/internal/repository/postgres"
|
"github.com/shankar0123/certctl/internal/repository/postgres"
|
||||||
"github.com/shankar0123/certctl/internal/scheduler"
|
"github.com/shankar0123/certctl/internal/scheduler"
|
||||||
"github.com/shankar0123/certctl/internal/service"
|
"github.com/shankar0123/certctl/internal/service"
|
||||||
@@ -353,6 +353,12 @@ func main() {
|
|||||||
|
|
||||||
// Initialize stats and metrics services
|
// Initialize stats and metrics services
|
||||||
statsService := service.NewStatsService(certificateRepo, jobRepo, agentRepo)
|
statsService := service.NewStatsService(certificateRepo, jobRepo, agentRepo)
|
||||||
|
// I-005: wire the notification repository so DashboardSummary.NotificationsDead
|
||||||
|
// is populated, which in turn drives the Prometheus counter
|
||||||
|
// certctl_notification_dead_total in GetPrometheusMetrics. Setter
|
||||||
|
// pattern keeps NewStatsService's nine call sites (main.go + stats_test.go
|
||||||
|
// + 8 digest_test.go sites) untouched.
|
||||||
|
statsService.SetNotifRepo(notificationRepo)
|
||||||
logger.Info("initialized stats service")
|
logger.Info("initialized stats service")
|
||||||
|
|
||||||
// Initialize API handlers
|
// Initialize API handlers
|
||||||
@@ -447,6 +453,14 @@ func main() {
|
|||||||
sched.SetJobRetryInterval(cfg.Scheduler.RetryInterval)
|
sched.SetJobRetryInterval(cfg.Scheduler.RetryInterval)
|
||||||
sched.SetAgentHealthCheckInterval(cfg.Scheduler.AgentHealthCheckInterval)
|
sched.SetAgentHealthCheckInterval(cfg.Scheduler.AgentHealthCheckInterval)
|
||||||
sched.SetNotificationProcessInterval(cfg.Scheduler.NotificationProcessInterval)
|
sched.SetNotificationProcessInterval(cfg.Scheduler.NotificationProcessInterval)
|
||||||
|
// I-005: drive the failed-notification retry sweep. Runs every
|
||||||
|
// NotificationRetryInterval (default 2m, CERTCTL_NOTIFICATION_RETRY_INTERVAL)
|
||||||
|
// and transitions eligible Failed notifications whose next_retry_at has
|
||||||
|
// arrived back to Pending so the notification processor picks them up on
|
||||||
|
// its next tick. Kept adjacent to the notification processor setter
|
||||||
|
// because they share the NotificationServicer dependency (same placement
|
||||||
|
// pattern as I-001's SetJobRetryInterval above).
|
||||||
|
sched.SetNotificationRetryInterval(cfg.Scheduler.NotificationRetryInterval)
|
||||||
if cfg.NetworkScan.Enabled {
|
if cfg.NetworkScan.Enabled {
|
||||||
sched.SetNetworkScanInterval(cfg.NetworkScan.ScanInterval)
|
sched.SetNetworkScanInterval(cfg.NetworkScan.ScanInterval)
|
||||||
logger.Info("network scanning enabled", "interval", cfg.NetworkScan.ScanInterval.String())
|
logger.Info("network scanning enabled", "interval", cfg.NetworkScan.ScanInterval.String())
|
||||||
@@ -469,7 +483,6 @@ func main() {
|
|||||||
"sources", cloudDiscoveryService.SourceCount())
|
"sources", cloudDiscoveryService.SourceCount())
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
// Wire job timeout reaper (I-003)
|
// Wire job timeout reaper (I-003)
|
||||||
sched.SetJobReaperService(jobService)
|
sched.SetJobReaperService(jobService)
|
||||||
sched.SetJobTimeoutInterval(cfg.Scheduler.JobTimeoutInterval)
|
sched.SetJobTimeoutInterval(cfg.Scheduler.JobTimeoutInterval)
|
||||||
@@ -489,28 +502,28 @@ func main() {
|
|||||||
// Build the API router with all handlers
|
// Build the API router with all handlers
|
||||||
apiRouter := router.New()
|
apiRouter := router.New()
|
||||||
apiRouter.RegisterHandlers(router.HandlerRegistry{
|
apiRouter.RegisterHandlers(router.HandlerRegistry{
|
||||||
Certificates: certificateHandler,
|
Certificates: certificateHandler,
|
||||||
Issuers: issuerHandler,
|
Issuers: issuerHandler,
|
||||||
Targets: targetHandler,
|
Targets: targetHandler,
|
||||||
Agents: agentHandler,
|
Agents: agentHandler,
|
||||||
Jobs: jobHandler,
|
Jobs: jobHandler,
|
||||||
Policies: policyHandler,
|
Policies: policyHandler,
|
||||||
Profiles: profileHandler,
|
Profiles: profileHandler,
|
||||||
Teams: teamHandler,
|
Teams: teamHandler,
|
||||||
Owners: ownerHandler,
|
Owners: ownerHandler,
|
||||||
AgentGroups: agentGroupHandler,
|
AgentGroups: agentGroupHandler,
|
||||||
Audit: auditHandler,
|
Audit: auditHandler,
|
||||||
Notifications: notificationHandler,
|
Notifications: notificationHandler,
|
||||||
Stats: statsHandler,
|
Stats: statsHandler,
|
||||||
Metrics: metricsHandler,
|
Metrics: metricsHandler,
|
||||||
Health: healthHandler,
|
Health: healthHandler,
|
||||||
Discovery: discoveryHandler,
|
Discovery: discoveryHandler,
|
||||||
NetworkScan: networkScanHandler,
|
NetworkScan: networkScanHandler,
|
||||||
Verification: verificationHandler,
|
Verification: verificationHandler,
|
||||||
Export: exportHandler,
|
Export: exportHandler,
|
||||||
Digest: *digestHandler,
|
Digest: *digestHandler,
|
||||||
HealthChecks: healthCheckHandler,
|
HealthChecks: healthCheckHandler,
|
||||||
BulkRevocation: bulkRevocationHandler,
|
BulkRevocation: bulkRevocationHandler,
|
||||||
})
|
})
|
||||||
// Register EST (RFC 7030) handlers if enabled
|
// Register EST (RFC 7030) handlers if enabled
|
||||||
if cfg.EST.Enabled {
|
if cfg.EST.Enabled {
|
||||||
@@ -712,74 +725,65 @@ func main() {
|
|||||||
middleware.Recovery,
|
middleware.Recovery,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
dashboardEnabled := false
|
||||||
if _, err := os.Stat(webDir + "/index.html"); err == nil {
|
if _, err := os.Stat(webDir + "/index.html"); err == nil {
|
||||||
fileServer := http.FileServer(http.Dir(webDir))
|
dashboardEnabled = true
|
||||||
finalHandler = http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
}
|
||||||
path := r.URL.Path
|
finalHandler = buildFinalHandler(apiHandler, noAuthHandler, webDir, dashboardEnabled)
|
||||||
// Health/ready and auth/info bypass auth middleware.
|
if dashboardEnabled {
|
||||||
// Health/ready: Docker/K8s health probes don't carry Bearer tokens.
|
|
||||||
// auth/info: React app calls this before login to detect auth mode.
|
|
||||||
if path == "/health" || path == "/ready" || path == "/api/v1/auth/info" {
|
|
||||||
noAuthHandler.ServeHTTP(w, r)
|
|
||||||
return
|
|
||||||
}
|
|
||||||
// RFC 5280 CRL and RFC 6960 OCSP live under /.well-known/pki/ and
|
|
||||||
// MUST be served unauthenticated — relying parties (browsers,
|
|
||||||
// OpenSSL, OCSP stapling sidecars, mTLS clients) cannot present
|
|
||||||
// certctl Bearer tokens. See router.RegisterPKIHandlers.
|
|
||||||
if len(path) >= 16 && path[:16] == "/.well-known/pki" {
|
|
||||||
noAuthHandler.ServeHTTP(w, r)
|
|
||||||
return
|
|
||||||
}
|
|
||||||
// All other API and EST routes go through the full middleware stack (with auth)
|
|
||||||
if (len(path) >= 8 && path[:8] == "/api/v1/") ||
|
|
||||||
(len(path) >= 16 && path[:16] == "/.well-known/est") {
|
|
||||||
apiHandler.ServeHTTP(w, r)
|
|
||||||
return
|
|
||||||
}
|
|
||||||
// Try to serve static files (JS, CSS, assets)
|
|
||||||
if len(path) > 8 && path[:8] == "/assets/" {
|
|
||||||
fileServer.ServeHTTP(w, r)
|
|
||||||
return
|
|
||||||
}
|
|
||||||
// SPA fallback: serve index.html for all other routes
|
|
||||||
http.ServeFile(w, r, webDir+"/index.html")
|
|
||||||
})
|
|
||||||
logger.Info("dashboard available at /", "web_dir", webDir)
|
logger.Info("dashboard available at /", "web_dir", webDir)
|
||||||
} else {
|
} else {
|
||||||
// No dashboard: route health/auth-info and /.well-known/pki without
|
|
||||||
// auth, everything else through full stack.
|
|
||||||
finalHandler = http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
|
||||||
path := r.URL.Path
|
|
||||||
if path == "/health" || path == "/ready" || path == "/api/v1/auth/info" {
|
|
||||||
noAuthHandler.ServeHTTP(w, r)
|
|
||||||
return
|
|
||||||
}
|
|
||||||
if len(path) >= 16 && path[:16] == "/.well-known/pki" {
|
|
||||||
noAuthHandler.ServeHTTP(w, r)
|
|
||||||
return
|
|
||||||
}
|
|
||||||
apiHandler.ServeHTTP(w, r)
|
|
||||||
})
|
|
||||||
logger.Info("dashboard directory not found, serving API only")
|
logger.Info("dashboard directory not found, serving API only")
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// HTTPS-everywhere milestone §2.1: fail-loud if the TLS configuration is
|
||||||
|
// missing or malformed. Duplicates config.Validate() for defense in depth
|
||||||
|
// (same pattern as preflightSCEPChallengePassword).
|
||||||
|
if err := preflightServerTLS(cfg.Server.TLS.CertPath, cfg.Server.TLS.KeyPath); err != nil {
|
||||||
|
logger.Error("startup refused: HTTPS cert unusable; control plane is HTTPS-only",
|
||||||
|
"error", err,
|
||||||
|
"cert_path", cfg.Server.TLS.CertPath,
|
||||||
|
"key_path", cfg.Server.TLS.KeyPath)
|
||||||
|
os.Exit(1)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Load the cert+key into a SIGHUP-reloadable holder. Any subsequent
|
||||||
|
// SIGHUP triggers a fresh read and atomic swap so rotations do not need
|
||||||
|
// a restart. Reload failures keep the previous cert and log a warning.
|
||||||
|
tlsCertHolder, err := newCertHolder(cfg.Server.TLS.CertPath, cfg.Server.TLS.KeyPath)
|
||||||
|
if err != nil {
|
||||||
|
logger.Error("startup refused: failed to load TLS cert holder",
|
||||||
|
"error", err,
|
||||||
|
"cert_path", cfg.Server.TLS.CertPath,
|
||||||
|
"key_path", cfg.Server.TLS.KeyPath)
|
||||||
|
os.Exit(1)
|
||||||
|
}
|
||||||
|
stopTLSWatcher := tlsCertHolder.watchSIGHUP(logger)
|
||||||
|
defer stopTLSWatcher()
|
||||||
|
|
||||||
// Server configuration
|
// Server configuration
|
||||||
addr := net.JoinHostPort(cfg.Server.Host, strconv.Itoa(cfg.Server.Port))
|
addr := net.JoinHostPort(cfg.Server.Host, strconv.Itoa(cfg.Server.Port))
|
||||||
httpServer := &http.Server{
|
httpServer := &http.Server{
|
||||||
Addr: addr,
|
Addr: addr,
|
||||||
Handler: finalHandler,
|
Handler: finalHandler,
|
||||||
|
TLSConfig: buildServerTLSConfig(tlsCertHolder),
|
||||||
ReadTimeout: 30 * time.Second,
|
ReadTimeout: 30 * time.Second,
|
||||||
ReadHeaderTimeout: 5 * time.Second,
|
ReadHeaderTimeout: 5 * time.Second,
|
||||||
WriteTimeout: 120 * time.Second, // Must accommodate ACME issuance (order + challenge + finalize)
|
WriteTimeout: 120 * time.Second, // Must accommodate ACME issuance (order + challenge + finalize)
|
||||||
IdleTimeout: 60 * time.Second,
|
IdleTimeout: 60 * time.Second,
|
||||||
}
|
}
|
||||||
|
|
||||||
// Start HTTP server in background
|
// Start HTTPS server in background. ListenAndServeTLS is called with
|
||||||
logger.Info("starting HTTP server", "address", addr)
|
// empty cert+key arguments because the cert is sourced through
|
||||||
|
// TLSConfig.GetCertificate (the SIGHUP-reloadable holder). Passing file
|
||||||
|
// paths here would pin the first-loaded cert and defeat hot reload.
|
||||||
|
logger.Info("HTTPS server listening",
|
||||||
|
"address", addr,
|
||||||
|
"cert_path", cfg.Server.TLS.CertPath,
|
||||||
|
"min_version", "TLS1.3")
|
||||||
go func() {
|
go func() {
|
||||||
if err := httpServer.ListenAndServe(); err != nil && err != http.ErrServerClosed {
|
if err := httpServer.ListenAndServeTLS("", ""); err != nil && err != http.ErrServerClosed {
|
||||||
logger.Error("HTTP server error", "error", err)
|
logger.Error("HTTPS server error", "error", err)
|
||||||
}
|
}
|
||||||
}()
|
}()
|
||||||
|
|
||||||
@@ -802,9 +806,9 @@ func main() {
|
|||||||
logger.Warn("scheduler work did not complete in time", "error", err)
|
logger.Warn("scheduler work did not complete in time", "error", err)
|
||||||
}
|
}
|
||||||
|
|
||||||
logger.Info("shutting down HTTP server")
|
logger.Info("shutting down HTTPS server")
|
||||||
if err := httpServer.Shutdown(shutdownCtx); err != nil {
|
if err := httpServer.Shutdown(shutdownCtx); err != nil {
|
||||||
logger.Error("HTTP server shutdown error", "error", err)
|
logger.Error("HTTPS server shutdown error", "error", err)
|
||||||
}
|
}
|
||||||
|
|
||||||
// Drain in-flight audit-recording goroutines before closing the DB pool.
|
// Drain in-flight audit-recording goroutines before closing the DB pool.
|
||||||
@@ -846,3 +850,94 @@ func preflightSCEPChallengePassword(enabled bool, challengePassword string) erro
|
|||||||
return nil
|
return nil
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// buildFinalHandler builds the outer HTTP dispatch handler that routes incoming
|
||||||
|
// requests to either the authenticated apiHandler chain or the unauthenticated
|
||||||
|
// noAuthHandler chain based on URL path prefix. Extracted from main() so the
|
||||||
|
// dispatch logic can be unit tested without booting the full server stack
|
||||||
|
// (see cmd/server/finalhandler_test.go).
|
||||||
|
//
|
||||||
|
// Dispatch rules (M-001, audit 2026-04-19, option D):
|
||||||
|
//
|
||||||
|
// - /health, /ready, /api/v1/auth/info → no-auth (probes + login detection)
|
||||||
|
// - /.well-known/pki/* → no-auth (RFC 5280 CRL, RFC 6960 OCSP)
|
||||||
|
// - /.well-known/est/* → no-auth (RFC 7030 §3.2.3)
|
||||||
|
// - /scep, /scep/* → no-auth (RFC 8894 §3.2, CSR challengePassword)
|
||||||
|
// - /api/v1/* → auth (Bearer token required)
|
||||||
|
// - /assets/* → static file server (dashboard only)
|
||||||
|
// - anything else → SPA index.html fallback (dashboard only)
|
||||||
|
// OR apiHandler (no dashboard)
|
||||||
|
//
|
||||||
|
// EST/SCEP clients (IoT devices, 802.1X supplicants, MDM endpoints, network
|
||||||
|
// appliances) cannot present certctl Bearer tokens, so those endpoints must be
|
||||||
|
// reachable without the Auth middleware. Authentication is instead enforced by
|
||||||
|
// CSR signature verification, profile policy gates, and for SCEP the
|
||||||
|
// challengePassword shared secret (fail-loud gated by preflightSCEPChallengePassword
|
||||||
|
// above).
|
||||||
|
//
|
||||||
|
// webDir must point to a directory containing index.html + assets/ when
|
||||||
|
// dashboardEnabled is true; it is ignored otherwise.
|
||||||
|
func buildFinalHandler(apiHandler, noAuthHandler http.Handler, webDir string, dashboardEnabled bool) http.Handler {
|
||||||
|
var fileServer http.Handler
|
||||||
|
if dashboardEnabled {
|
||||||
|
fileServer = http.FileServer(http.Dir(webDir))
|
||||||
|
}
|
||||||
|
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||||
|
path := r.URL.Path
|
||||||
|
|
||||||
|
// Health/ready and auth/info bypass auth middleware.
|
||||||
|
// Health/ready: Docker/K8s health probes don't carry Bearer tokens.
|
||||||
|
// auth/info: React app calls this before login to detect auth mode.
|
||||||
|
if path == "/health" || path == "/ready" || path == "/api/v1/auth/info" {
|
||||||
|
noAuthHandler.ServeHTTP(w, r)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
// RFC 5280 CRL and RFC 6960 OCSP live under /.well-known/pki/ and MUST
|
||||||
|
// be served unauthenticated — relying parties (browsers, OpenSSL, OCSP
|
||||||
|
// stapling sidecars, mTLS clients) cannot present certctl Bearer tokens.
|
||||||
|
if strings.HasPrefix(path, "/.well-known/pki") {
|
||||||
|
noAuthHandler.ServeHTTP(w, r)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
// RFC 7030 EST endpoints ride the no-auth middleware chain (M-001,
|
||||||
|
// option D, audit 2026-04-19). Trust boundary is CSR signature + profile
|
||||||
|
// policy, not HTTP Bearer. /.well-known/est/cacerts is explicitly
|
||||||
|
// anonymous per RFC 7030 §4.1.1.
|
||||||
|
if strings.HasPrefix(path, "/.well-known/est") {
|
||||||
|
noAuthHandler.ServeHTTP(w, r)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
// RFC 8894 SCEP rides the no-auth chain (M-001, option D). SCEP clients
|
||||||
|
// authenticate via the challengePassword attribute in the PKCS#10 CSR,
|
||||||
|
// not via HTTP Bearer tokens. preflightSCEPChallengePassword refuses to
|
||||||
|
// start the server if SCEP is enabled without a non-empty shared secret.
|
||||||
|
if path == "/scep" || strings.HasPrefix(path, "/scep/") {
|
||||||
|
noAuthHandler.ServeHTTP(w, r)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
// Authenticated API routes — full middleware stack including Auth.
|
||||||
|
if strings.HasPrefix(path, "/api/v1/") {
|
||||||
|
apiHandler.ServeHTTP(w, r)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
if !dashboardEnabled {
|
||||||
|
// No dashboard: everything non-special falls through to the
|
||||||
|
// authenticated handler (preserves pre-M-001 behavior for API-only
|
||||||
|
// deployments).
|
||||||
|
apiHandler.ServeHTTP(w, r)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
// Dashboard-present: serve static assets directly, SPA fallback for
|
||||||
|
// everything else.
|
||||||
|
if strings.HasPrefix(path, "/assets/") {
|
||||||
|
fileServer.ServeHTTP(w, r)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
http.ServeFile(w, r, webDir+"/index.html")
|
||||||
|
})
|
||||||
|
}
|
||||||
|
|||||||
@@ -214,6 +214,8 @@ func TestMain_ServerConfigFromEnvironment(t *testing.T) {
|
|||||||
oldAuthType := os.Getenv("CERTCTL_AUTH_TYPE")
|
oldAuthType := os.Getenv("CERTCTL_AUTH_TYPE")
|
||||||
oldServerHost := os.Getenv("CERTCTL_SERVER_HOST")
|
oldServerHost := os.Getenv("CERTCTL_SERVER_HOST")
|
||||||
oldServerPort := os.Getenv("CERTCTL_SERVER_PORT")
|
oldServerPort := os.Getenv("CERTCTL_SERVER_PORT")
|
||||||
|
oldTLSCert := os.Getenv("CERTCTL_SERVER_TLS_CERT_PATH")
|
||||||
|
oldTLSKey := os.Getenv("CERTCTL_SERVER_TLS_KEY_PATH")
|
||||||
defer func() {
|
defer func() {
|
||||||
if oldAuthType != "" {
|
if oldAuthType != "" {
|
||||||
os.Setenv("CERTCTL_AUTH_TYPE", oldAuthType)
|
os.Setenv("CERTCTL_AUTH_TYPE", oldAuthType)
|
||||||
@@ -230,12 +232,32 @@ func TestMain_ServerConfigFromEnvironment(t *testing.T) {
|
|||||||
} else {
|
} else {
|
||||||
os.Unsetenv("CERTCTL_SERVER_PORT")
|
os.Unsetenv("CERTCTL_SERVER_PORT")
|
||||||
}
|
}
|
||||||
|
if oldTLSCert != "" {
|
||||||
|
os.Setenv("CERTCTL_SERVER_TLS_CERT_PATH", oldTLSCert)
|
||||||
|
} else {
|
||||||
|
os.Unsetenv("CERTCTL_SERVER_TLS_CERT_PATH")
|
||||||
|
}
|
||||||
|
if oldTLSKey != "" {
|
||||||
|
os.Setenv("CERTCTL_SERVER_TLS_KEY_PATH", oldTLSKey)
|
||||||
|
} else {
|
||||||
|
os.Unsetenv("CERTCTL_SERVER_TLS_KEY_PATH")
|
||||||
|
}
|
||||||
}()
|
}()
|
||||||
|
|
||||||
|
// HTTPS-only control plane: Validate() refuses to pass without a readable
|
||||||
|
// cert/key pair on disk. Materialize a throwaway ECDSA P-256 pair using the
|
||||||
|
// same generator cmd/server/tls_test.go uses for the certHolder tests.
|
||||||
|
dir := t.TempDir()
|
||||||
|
certPath := dir + "/server.crt"
|
||||||
|
keyPath := dir + "/server.key"
|
||||||
|
generateTestCert(t, certPath, keyPath, "main-test-cn")
|
||||||
|
|
||||||
// Set test env vars
|
// Set test env vars
|
||||||
os.Setenv("CERTCTL_AUTH_TYPE", "none")
|
os.Setenv("CERTCTL_AUTH_TYPE", "none")
|
||||||
os.Setenv("CERTCTL_SERVER_HOST", "127.0.0.1")
|
os.Setenv("CERTCTL_SERVER_HOST", "127.0.0.1")
|
||||||
os.Setenv("CERTCTL_SERVER_PORT", "8080")
|
os.Setenv("CERTCTL_SERVER_PORT", "8080")
|
||||||
|
os.Setenv("CERTCTL_SERVER_TLS_CERT_PATH", certPath)
|
||||||
|
os.Setenv("CERTCTL_SERVER_TLS_KEY_PATH", keyPath)
|
||||||
|
|
||||||
cfg, err := config.Load()
|
cfg, err := config.Load()
|
||||||
if err != nil {
|
if err != nil {
|
||||||
@@ -260,6 +282,8 @@ func TestMain_AuthTypeConfiguration(t *testing.T) {
|
|||||||
// Save original env vars
|
// Save original env vars
|
||||||
oldAuthType := os.Getenv("CERTCTL_AUTH_TYPE")
|
oldAuthType := os.Getenv("CERTCTL_AUTH_TYPE")
|
||||||
oldAuthSecret := os.Getenv("CERTCTL_AUTH_SECRET")
|
oldAuthSecret := os.Getenv("CERTCTL_AUTH_SECRET")
|
||||||
|
oldTLSCert := os.Getenv("CERTCTL_SERVER_TLS_CERT_PATH")
|
||||||
|
oldTLSKey := os.Getenv("CERTCTL_SERVER_TLS_KEY_PATH")
|
||||||
defer func() {
|
defer func() {
|
||||||
if oldAuthType != "" {
|
if oldAuthType != "" {
|
||||||
os.Setenv("CERTCTL_AUTH_TYPE", oldAuthType)
|
os.Setenv("CERTCTL_AUTH_TYPE", oldAuthType)
|
||||||
@@ -271,8 +295,28 @@ func TestMain_AuthTypeConfiguration(t *testing.T) {
|
|||||||
} else {
|
} else {
|
||||||
os.Unsetenv("CERTCTL_AUTH_SECRET")
|
os.Unsetenv("CERTCTL_AUTH_SECRET")
|
||||||
}
|
}
|
||||||
|
if oldTLSCert != "" {
|
||||||
|
os.Setenv("CERTCTL_SERVER_TLS_CERT_PATH", oldTLSCert)
|
||||||
|
} else {
|
||||||
|
os.Unsetenv("CERTCTL_SERVER_TLS_CERT_PATH")
|
||||||
|
}
|
||||||
|
if oldTLSKey != "" {
|
||||||
|
os.Setenv("CERTCTL_SERVER_TLS_KEY_PATH", oldTLSKey)
|
||||||
|
} else {
|
||||||
|
os.Unsetenv("CERTCTL_SERVER_TLS_KEY_PATH")
|
||||||
|
}
|
||||||
}()
|
}()
|
||||||
|
|
||||||
|
// HTTPS-only control plane: config.Load()→Validate() refuses to pass
|
||||||
|
// without a readable cert/key pair. Mint one throwaway pair for the whole
|
||||||
|
// sub-test cohort — auth type toggles don't care about the TLS surface.
|
||||||
|
dir := t.TempDir()
|
||||||
|
certPath := dir + "/server.crt"
|
||||||
|
keyPath := dir + "/server.key"
|
||||||
|
generateTestCert(t, certPath, keyPath, "main-test-cn")
|
||||||
|
os.Setenv("CERTCTL_SERVER_TLS_CERT_PATH", certPath)
|
||||||
|
os.Setenv("CERTCTL_SERVER_TLS_KEY_PATH", keyPath)
|
||||||
|
|
||||||
// Set auth secret for api-key mode
|
// Set auth secret for api-key mode
|
||||||
os.Setenv("CERTCTL_AUTH_SECRET", "test-secret")
|
os.Setenv("CERTCTL_AUTH_SECRET", "test-secret")
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,164 @@
|
|||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"crypto/tls"
|
||||||
|
"fmt"
|
||||||
|
"log/slog"
|
||||||
|
"os"
|
||||||
|
"os/signal"
|
||||||
|
"sync"
|
||||||
|
"syscall"
|
||||||
|
)
|
||||||
|
|
||||||
|
// certHolder stores the server's TLS certificate under a mutex so it can be
|
||||||
|
// swapped atomically by a SIGHUP handler without restarting the server. A
|
||||||
|
// *tls.Config that wires GetCertificate → (*certHolder).GetCertificate reads
|
||||||
|
// through the holder on every ClientHello, so a successful reload takes
|
||||||
|
// effect on the next new connection immediately and without dropping
|
||||||
|
// in-flight requests.
|
||||||
|
//
|
||||||
|
// Concurrency: GetCertificate is invoked from crypto/tls handshake goroutines
|
||||||
|
// on every new inbound connection; Reload is invoked from the SIGHUP watcher
|
||||||
|
// goroutine. sync.Mutex is sufficient — TLS handshakes are not an inner-loop
|
||||||
|
// hot path and the critical section is a single pointer read.
|
||||||
|
type certHolder struct {
|
||||||
|
mu sync.Mutex
|
||||||
|
cert *tls.Certificate
|
||||||
|
certPath string
|
||||||
|
keyPath string
|
||||||
|
}
|
||||||
|
|
||||||
|
// newCertHolder loads the initial cert+key pair from disk and returns a
|
||||||
|
// holder ready to serve handshakes. Returns a non-nil error if either file
|
||||||
|
// is missing, unreadable, or the pair does not round-trip through
|
||||||
|
// tls.LoadX509KeyPair (for example the key does not sign the cert). The
|
||||||
|
// caller is expected to treat a non-nil error as a fail-loud startup gate
|
||||||
|
// and os.Exit(1) — the HTTPS-everywhere milestone (§3 locked decisions)
|
||||||
|
// prohibits plaintext HTTP fallback.
|
||||||
|
func newCertHolder(certPath, keyPath string) (*certHolder, error) {
|
||||||
|
cert, err := tls.LoadX509KeyPair(certPath, keyPath)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("load TLS cert/key (cert=%q key=%q): %w", certPath, keyPath, err)
|
||||||
|
}
|
||||||
|
return &certHolder{
|
||||||
|
cert: &cert,
|
||||||
|
certPath: certPath,
|
||||||
|
keyPath: keyPath,
|
||||||
|
}, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// GetCertificate is the tls.Config.GetCertificate hook. Returns the current
|
||||||
|
// cert under the holder's mutex. ClientHelloInfo is ignored — the control
|
||||||
|
// plane does not multiplex by SNI.
|
||||||
|
func (h *certHolder) GetCertificate(_ *tls.ClientHelloInfo) (*tls.Certificate, error) {
|
||||||
|
h.mu.Lock()
|
||||||
|
defer h.mu.Unlock()
|
||||||
|
return h.cert, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// Reload re-reads the cert+key pair from disk and swaps the holder
|
||||||
|
// atomically on success. On failure the holder retains its previous cert
|
||||||
|
// and the error is propagated to the caller — the SIGHUP watcher logs and
|
||||||
|
// keeps serving the previous cert rather than crashing on a bad reload.
|
||||||
|
// This is deliberately "fail-safe on reload, fail-loud on startup": an
|
||||||
|
// operator rotating certs wants a recoverable error, not a restart loop.
|
||||||
|
func (h *certHolder) Reload() error {
|
||||||
|
cert, err := tls.LoadX509KeyPair(h.certPath, h.keyPath)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("reload TLS cert/key (cert=%q key=%q): %w", h.certPath, h.keyPath, err)
|
||||||
|
}
|
||||||
|
h.mu.Lock()
|
||||||
|
h.cert = &cert
|
||||||
|
h.mu.Unlock()
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// watchSIGHUP installs a signal handler that calls Reload() on each SIGHUP.
|
||||||
|
// The returned stop function closes the internal done channel and stops
|
||||||
|
// signal delivery so the goroutine can exit cleanly during shutdown. Errors
|
||||||
|
// from Reload are logged but do not terminate the watcher — the operator
|
||||||
|
// can fix the files and send another SIGHUP.
|
||||||
|
//
|
||||||
|
// Defensive design note: this deliberately does NOT panic on Reload error
|
||||||
|
// even though HTTPS is mission-critical. A rotation that writes half-files
|
||||||
|
// (operator overwrites cert.pem then key.pem as two separate copies) would
|
||||||
|
// otherwise crash the server mid-rotation. Logging + retaining the old
|
||||||
|
// cert gives the operator a bounded window to fix and re-SIGHUP.
|
||||||
|
func (h *certHolder) watchSIGHUP(logger *slog.Logger) (stop func()) {
|
||||||
|
ch := make(chan os.Signal, 1)
|
||||||
|
signal.Notify(ch, syscall.SIGHUP)
|
||||||
|
done := make(chan struct{})
|
||||||
|
go func() {
|
||||||
|
for {
|
||||||
|
select {
|
||||||
|
case <-ch:
|
||||||
|
if err := h.Reload(); err != nil {
|
||||||
|
logger.Error("TLS cert reload failed; continuing with previous cert",
|
||||||
|
"error", err,
|
||||||
|
"cert_path", h.certPath,
|
||||||
|
"key_path", h.keyPath)
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
logger.Info("TLS cert reloaded via SIGHUP",
|
||||||
|
"cert_path", h.certPath,
|
||||||
|
"key_path", h.keyPath)
|
||||||
|
case <-done:
|
||||||
|
signal.Stop(ch)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}()
|
||||||
|
return func() { close(done) }
|
||||||
|
}
|
||||||
|
|
||||||
|
// buildServerTLSConfig returns the TLS 1.3-only *tls.Config for the HTTPS
|
||||||
|
// server. Pinned per HTTPS-everywhere milestone §2.1 + §3 locked decisions:
|
||||||
|
//
|
||||||
|
// - MinVersion: TLS 1.3 (no TLS 1.2 escape hatch). Go 1.25's crypto/tls
|
||||||
|
// automatically rejects older versions.
|
||||||
|
// - CurvePreferences: explicit [X25519, P-256]. Explicit ordering keeps
|
||||||
|
// the handshake deterministic and documents the accepted curves.
|
||||||
|
// - No CipherSuites field: TLS 1.3 cipher suites are not negotiable in
|
||||||
|
// the handshake (all three mandatory suites — AES-128-GCM-SHA256,
|
||||||
|
// AES-256-GCM-SHA384, CHACHA20-POLY1305-SHA256 — are always offered).
|
||||||
|
// Go's crypto/tls ignores CipherSuites for TLS 1.3.
|
||||||
|
// - GetCertificate: reads through the holder so SIGHUP rotations take
|
||||||
|
// effect on the next new connection without a restart. Setting
|
||||||
|
// tls.Config.Certificates directly would pin the first-loaded cert
|
||||||
|
// and defeat SIGHUP reload.
|
||||||
|
func buildServerTLSConfig(holder *certHolder) *tls.Config {
|
||||||
|
return &tls.Config{
|
||||||
|
MinVersion: tls.VersionTLS13,
|
||||||
|
CurvePreferences: []tls.CurveID{tls.X25519, tls.CurveP256},
|
||||||
|
GetCertificate: holder.GetCertificate,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// preflightServerTLS is the fail-loud startup gate for HTTPS. Returns a
|
||||||
|
// non-nil error when the TLS configuration is missing or the cert+key pair
|
||||||
|
// cannot be parsed, so the caller refuses to start the control plane
|
||||||
|
// (HTTPS-everywhere §3 locked decisions: no plaintext HTTP fallback).
|
||||||
|
//
|
||||||
|
// Duplicates the emptiness + stat + parse checks in config.Validate() for
|
||||||
|
// defense in depth, mirroring the pattern established by
|
||||||
|
// preflightSCEPChallengePassword (which itself duplicates
|
||||||
|
// config.Validate()'s SCEP check for CWE-306). Extracted into a separate
|
||||||
|
// function so the gate is unit-testable without booting the full server.
|
||||||
|
func preflightServerTLS(certPath, keyPath string) error {
|
||||||
|
if certPath == "" {
|
||||||
|
return fmt.Errorf("CERTCTL_SERVER_TLS_CERT_PATH is empty: HTTPS-only control plane refuses to start (see docs/tls.md)")
|
||||||
|
}
|
||||||
|
if keyPath == "" {
|
||||||
|
return fmt.Errorf("CERTCTL_SERVER_TLS_KEY_PATH is empty: HTTPS-only control plane refuses to start (see docs/tls.md)")
|
||||||
|
}
|
||||||
|
if _, err := os.Stat(certPath); err != nil {
|
||||||
|
return fmt.Errorf("TLS cert file %q unreadable: %w (see docs/tls.md)", certPath, err)
|
||||||
|
}
|
||||||
|
if _, err := os.Stat(keyPath); err != nil {
|
||||||
|
return fmt.Errorf("TLS key file %q unreadable: %w (see docs/tls.md)", keyPath, err)
|
||||||
|
}
|
||||||
|
if _, err := tls.LoadX509KeyPair(certPath, keyPath); err != nil {
|
||||||
|
return fmt.Errorf("TLS cert/key pair invalid (cert=%q key=%q): %w (see docs/tls.md)", certPath, keyPath, err)
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
@@ -0,0 +1,418 @@
|
|||||||
|
package main
|
||||||
|
|
||||||
|
import (
|
||||||
|
"crypto/ecdsa"
|
||||||
|
"crypto/elliptic"
|
||||||
|
"crypto/rand"
|
||||||
|
"crypto/tls"
|
||||||
|
"crypto/x509"
|
||||||
|
"crypto/x509/pkix"
|
||||||
|
"encoding/pem"
|
||||||
|
"errors"
|
||||||
|
"io"
|
||||||
|
"log/slog"
|
||||||
|
"math/big"
|
||||||
|
"net"
|
||||||
|
"os"
|
||||||
|
"path/filepath"
|
||||||
|
"sync"
|
||||||
|
"syscall"
|
||||||
|
"testing"
|
||||||
|
"time"
|
||||||
|
)
|
||||||
|
|
||||||
|
// generateTestCert writes a PEM-encoded self-signed leaf cert + ECDSA P-256
|
||||||
|
// key pair to certPath/keyPath. The subject is derived from cn so tests can
|
||||||
|
// tell reloaded certs apart from original certs by re-parsing the served
|
||||||
|
// Certificate and comparing the CN.
|
||||||
|
func generateTestCert(t *testing.T, certPath, keyPath, cn string) {
|
||||||
|
t.Helper()
|
||||||
|
priv, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("ecdsa.GenerateKey: %v", err)
|
||||||
|
}
|
||||||
|
tmpl := &x509.Certificate{
|
||||||
|
SerialNumber: big.NewInt(time.Now().UnixNano()),
|
||||||
|
Subject: pkix.Name{CommonName: cn},
|
||||||
|
NotBefore: time.Now().Add(-1 * time.Hour),
|
||||||
|
NotAfter: time.Now().Add(24 * time.Hour),
|
||||||
|
KeyUsage: x509.KeyUsageDigitalSignature,
|
||||||
|
ExtKeyUsage: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth},
|
||||||
|
DNSNames: []string{"localhost"},
|
||||||
|
IPAddresses: []net.IP{net.ParseIP("127.0.0.1"), net.ParseIP("::1")},
|
||||||
|
}
|
||||||
|
der, err := x509.CreateCertificate(rand.Reader, tmpl, tmpl, &priv.PublicKey, priv)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("x509.CreateCertificate: %v", err)
|
||||||
|
}
|
||||||
|
certPEM := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: der})
|
||||||
|
keyDER, err := x509.MarshalECPrivateKey(priv)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("MarshalECPrivateKey: %v", err)
|
||||||
|
}
|
||||||
|
keyPEM := pem.EncodeToMemory(&pem.Block{Type: "EC PRIVATE KEY", Bytes: keyDER})
|
||||||
|
if err := os.WriteFile(certPath, certPEM, 0o600); err != nil {
|
||||||
|
t.Fatalf("write cert: %v", err)
|
||||||
|
}
|
||||||
|
if err := os.WriteFile(keyPath, keyPEM, 0o600); err != nil {
|
||||||
|
t.Fatalf("write key: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// readCertCN returns the CommonName from the leaf cert currently held by the
|
||||||
|
// holder, by exercising the same GetCertificate path the tls handshake would
|
||||||
|
// take. Lets tests assert which generation of the cert is being served.
|
||||||
|
func readCertCN(t *testing.T, h *certHolder) string {
|
||||||
|
t.Helper()
|
||||||
|
c, err := h.GetCertificate(&tls.ClientHelloInfo{})
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("GetCertificate: %v", err)
|
||||||
|
}
|
||||||
|
leaf, err := x509.ParseCertificate(c.Certificate[0])
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("ParseCertificate: %v", err)
|
||||||
|
}
|
||||||
|
return leaf.Subject.CommonName
|
||||||
|
}
|
||||||
|
|
||||||
|
func silentLogger() *slog.Logger {
|
||||||
|
return slog.New(slog.NewTextHandler(io.Discard, &slog.HandlerOptions{Level: slog.LevelError}))
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNewCertHolder_ValidPair_LoadsCert(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
certPath := filepath.Join(dir, "tls.crt")
|
||||||
|
keyPath := filepath.Join(dir, "tls.key")
|
||||||
|
generateTestCert(t, certPath, keyPath, "cn-initial")
|
||||||
|
|
||||||
|
h, err := newCertHolder(certPath, keyPath)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("newCertHolder: %v", err)
|
||||||
|
}
|
||||||
|
if got := readCertCN(t, h); got != "cn-initial" {
|
||||||
|
t.Fatalf("CN mismatch: got %q want %q", got, "cn-initial")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNewCertHolder_MissingFile_Fails(t *testing.T) {
|
||||||
|
_, err := newCertHolder("/nonexistent/cert.pem", "/nonexistent/key.pem")
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected error for missing files, got nil")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestNewCertHolder_MalformedCert_Fails(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
certPath := filepath.Join(dir, "bad.crt")
|
||||||
|
keyPath := filepath.Join(dir, "bad.key")
|
||||||
|
if err := os.WriteFile(certPath, []byte("not a pem cert"), 0o600); err != nil {
|
||||||
|
t.Fatalf("write cert: %v", err)
|
||||||
|
}
|
||||||
|
if err := os.WriteFile(keyPath, []byte("not a pem key"), 0o600); err != nil {
|
||||||
|
t.Fatalf("write key: %v", err)
|
||||||
|
}
|
||||||
|
_, err := newCertHolder(certPath, keyPath)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected error for malformed PEM, got nil")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestCertHolder_Reload_SwapsCert(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
certPath := filepath.Join(dir, "tls.crt")
|
||||||
|
keyPath := filepath.Join(dir, "tls.key")
|
||||||
|
generateTestCert(t, certPath, keyPath, "cn-v1")
|
||||||
|
|
||||||
|
h, err := newCertHolder(certPath, keyPath)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("newCertHolder: %v", err)
|
||||||
|
}
|
||||||
|
if got := readCertCN(t, h); got != "cn-v1" {
|
||||||
|
t.Fatalf("initial CN: got %q want cn-v1", got)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Rotate on disk and reload.
|
||||||
|
generateTestCert(t, certPath, keyPath, "cn-v2")
|
||||||
|
if err := h.Reload(); err != nil {
|
||||||
|
t.Fatalf("Reload: %v", err)
|
||||||
|
}
|
||||||
|
if got := readCertCN(t, h); got != "cn-v2" {
|
||||||
|
t.Fatalf("post-reload CN: got %q want cn-v2", got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestCertHolder_Reload_FailureRetainsPreviousCert(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
certPath := filepath.Join(dir, "tls.crt")
|
||||||
|
keyPath := filepath.Join(dir, "tls.key")
|
||||||
|
generateTestCert(t, certPath, keyPath, "cn-v1")
|
||||||
|
|
||||||
|
h, err := newCertHolder(certPath, keyPath)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("newCertHolder: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Corrupt the cert file and attempt reload.
|
||||||
|
if err := os.WriteFile(certPath, []byte("garbage"), 0o600); err != nil {
|
||||||
|
t.Fatalf("corrupt cert: %v", err)
|
||||||
|
}
|
||||||
|
if err := h.Reload(); err == nil {
|
||||||
|
t.Fatal("expected Reload error for corrupt file, got nil")
|
||||||
|
}
|
||||||
|
// Holder should still serve the v1 cert.
|
||||||
|
if got := readCertCN(t, h); got != "cn-v1" {
|
||||||
|
t.Fatalf("post-failed-reload CN: got %q want cn-v1 (reload must not clobber on failure)", got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestCertHolder_GetCertificate_Concurrent(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
certPath := filepath.Join(dir, "tls.crt")
|
||||||
|
keyPath := filepath.Join(dir, "tls.key")
|
||||||
|
generateTestCert(t, certPath, keyPath, "cn-concurrent")
|
||||||
|
|
||||||
|
h, err := newCertHolder(certPath, keyPath)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("newCertHolder: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// 64 readers + 1 rotator for 500ms. Race detector catches any unsynchronized
|
||||||
|
// swap of h.cert. Rotator writes fresh files + Reload, readers call
|
||||||
|
// GetCertificate in a tight loop.
|
||||||
|
var wg sync.WaitGroup
|
||||||
|
done := make(chan struct{})
|
||||||
|
const readers = 64
|
||||||
|
for i := 0; i < readers; i++ {
|
||||||
|
wg.Add(1)
|
||||||
|
go func() {
|
||||||
|
defer wg.Done()
|
||||||
|
for {
|
||||||
|
select {
|
||||||
|
case <-done:
|
||||||
|
return
|
||||||
|
default:
|
||||||
|
if _, err := h.GetCertificate(&tls.ClientHelloInfo{}); err != nil {
|
||||||
|
t.Errorf("GetCertificate: %v", err)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}()
|
||||||
|
}
|
||||||
|
wg.Add(1)
|
||||||
|
go func() {
|
||||||
|
defer wg.Done()
|
||||||
|
for i := 0; i < 20; i++ {
|
||||||
|
generateTestCert(t, certPath, keyPath, "cn-concurrent")
|
||||||
|
_ = h.Reload()
|
||||||
|
time.Sleep(10 * time.Millisecond)
|
||||||
|
}
|
||||||
|
}()
|
||||||
|
time.Sleep(300 * time.Millisecond)
|
||||||
|
close(done)
|
||||||
|
wg.Wait()
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestCertHolder_WatchSIGHUP_ReloadsOnSignal(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
certPath := filepath.Join(dir, "tls.crt")
|
||||||
|
keyPath := filepath.Join(dir, "tls.key")
|
||||||
|
generateTestCert(t, certPath, keyPath, "cn-before-sighup")
|
||||||
|
|
||||||
|
h, err := newCertHolder(certPath, keyPath)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("newCertHolder: %v", err)
|
||||||
|
}
|
||||||
|
stop := h.watchSIGHUP(silentLogger())
|
||||||
|
defer stop()
|
||||||
|
|
||||||
|
// Rotate on disk, then fire SIGHUP to our own process and poll for the swap.
|
||||||
|
generateTestCert(t, certPath, keyPath, "cn-after-sighup")
|
||||||
|
if err := syscall.Kill(syscall.Getpid(), syscall.SIGHUP); err != nil {
|
||||||
|
t.Fatalf("SIGHUP: %v", err)
|
||||||
|
}
|
||||||
|
deadline := time.Now().Add(2 * time.Second)
|
||||||
|
for time.Now().Before(deadline) {
|
||||||
|
if readCertCN(t, h) == "cn-after-sighup" {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
time.Sleep(10 * time.Millisecond)
|
||||||
|
}
|
||||||
|
t.Fatalf("watcher did not reload cert within 2s (CN still %q)", readCertCN(t, h))
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestCertHolder_WatchSIGHUP_StopExits(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
certPath := filepath.Join(dir, "tls.crt")
|
||||||
|
keyPath := filepath.Join(dir, "tls.key")
|
||||||
|
generateTestCert(t, certPath, keyPath, "cn-stop")
|
||||||
|
|
||||||
|
h, err := newCertHolder(certPath, keyPath)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("newCertHolder: %v", err)
|
||||||
|
}
|
||||||
|
stop := h.watchSIGHUP(silentLogger())
|
||||||
|
|
||||||
|
// Closing should be synchronous and safe; a subsequent SIGHUP must not
|
||||||
|
// cause a reload (the watcher goroutine is gone).
|
||||||
|
stop()
|
||||||
|
time.Sleep(50 * time.Millisecond) // let goroutine exit
|
||||||
|
|
||||||
|
// After stop, the signal may still be delivered to the process but the
|
||||||
|
// watcher has called signal.Stop so this channel is no longer receiving.
|
||||||
|
// Simply assert that calling stop() twice does not panic — the goroutine
|
||||||
|
// has already exited, so a second close would panic on the `done`
|
||||||
|
// channel; we do NOT call stop twice. Instead verify no regression in
|
||||||
|
// the held cert.
|
||||||
|
if got := readCertCN(t, h); got != "cn-stop" {
|
||||||
|
t.Fatalf("unexpected cert rotation after stop: got %q want cn-stop", got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestBuildServerTLSConfig_IsTLS13Only(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
certPath := filepath.Join(dir, "tls.crt")
|
||||||
|
keyPath := filepath.Join(dir, "tls.key")
|
||||||
|
generateTestCert(t, certPath, keyPath, "cn-cfg")
|
||||||
|
|
||||||
|
h, err := newCertHolder(certPath, keyPath)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("newCertHolder: %v", err)
|
||||||
|
}
|
||||||
|
cfg := buildServerTLSConfig(h)
|
||||||
|
if cfg.MinVersion != tls.VersionTLS13 {
|
||||||
|
t.Fatalf("MinVersion: got %#x want %#x (TLS 1.3)", cfg.MinVersion, tls.VersionTLS13)
|
||||||
|
}
|
||||||
|
wantCurves := []tls.CurveID{tls.X25519, tls.CurveP256}
|
||||||
|
if len(cfg.CurvePreferences) != len(wantCurves) {
|
||||||
|
t.Fatalf("CurvePreferences length: got %d want %d", len(cfg.CurvePreferences), len(wantCurves))
|
||||||
|
}
|
||||||
|
for i, c := range cfg.CurvePreferences {
|
||||||
|
if c != wantCurves[i] {
|
||||||
|
t.Fatalf("CurvePreferences[%d]: got %v want %v", i, c, wantCurves[i])
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if cfg.GetCertificate == nil {
|
||||||
|
t.Fatal("GetCertificate: nil (holder not wired; SIGHUP reload would be broken)")
|
||||||
|
}
|
||||||
|
if len(cfg.Certificates) != 0 {
|
||||||
|
t.Fatalf("Certificates: got %d want 0 (static cert would pin the first load and defeat reload)", len(cfg.Certificates))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestBuildServerTLSConfig_Handshake_TLS12Rejected(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
certPath := filepath.Join(dir, "tls.crt")
|
||||||
|
keyPath := filepath.Join(dir, "tls.key")
|
||||||
|
generateTestCert(t, certPath, keyPath, "cn-handshake")
|
||||||
|
|
||||||
|
h, err := newCertHolder(certPath, keyPath)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("newCertHolder: %v", err)
|
||||||
|
}
|
||||||
|
serverCfg := buildServerTLSConfig(h)
|
||||||
|
|
||||||
|
ln, err := tls.Listen("tcp", "127.0.0.1:0", serverCfg)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("tls.Listen: %v", err)
|
||||||
|
}
|
||||||
|
defer ln.Close()
|
||||||
|
|
||||||
|
// Server loop: accept and immediately close (we only care about the
|
||||||
|
// handshake outcome).
|
||||||
|
go func() {
|
||||||
|
for {
|
||||||
|
conn, err := ln.Accept()
|
||||||
|
if err != nil {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
// Force handshake so the server-side error surfaces.
|
||||||
|
_ = conn.(*tls.Conn).Handshake()
|
||||||
|
conn.Close()
|
||||||
|
}
|
||||||
|
}()
|
||||||
|
|
||||||
|
// TLS 1.3 client — should succeed.
|
||||||
|
clientOK := &tls.Config{
|
||||||
|
MinVersion: tls.VersionTLS13,
|
||||||
|
MaxVersion: tls.VersionTLS13,
|
||||||
|
InsecureSkipVerify: true,
|
||||||
|
}
|
||||||
|
c, err := tls.Dial("tcp", ln.Addr().String(), clientOK)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("TLS 1.3 dial failed (expected success): %v", err)
|
||||||
|
}
|
||||||
|
if c.ConnectionState().Version != tls.VersionTLS13 {
|
||||||
|
t.Fatalf("negotiated version: got %#x want TLS 1.3 (%#x)", c.ConnectionState().Version, tls.VersionTLS13)
|
||||||
|
}
|
||||||
|
c.Close()
|
||||||
|
|
||||||
|
// TLS 1.2 client — must be rejected at handshake.
|
||||||
|
clientOld := &tls.Config{
|
||||||
|
MinVersion: tls.VersionTLS12,
|
||||||
|
MaxVersion: tls.VersionTLS12,
|
||||||
|
InsecureSkipVerify: true,
|
||||||
|
}
|
||||||
|
if _, err := tls.Dial("tcp", ln.Addr().String(), clientOld); err == nil {
|
||||||
|
t.Fatal("TLS 1.2 dial succeeded; HTTPS-everywhere requires server to refuse TLS 1.2")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestPreflightServerTLS_MissingCertPath(t *testing.T) {
|
||||||
|
err := preflightServerTLS("", "/any/key.pem")
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected error for empty cert path, got nil")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestPreflightServerTLS_MissingKeyPath(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
certPath := filepath.Join(dir, "tls.crt")
|
||||||
|
keyPath := filepath.Join(dir, "tls.key")
|
||||||
|
generateTestCert(t, certPath, keyPath, "cn-preflight")
|
||||||
|
err := preflightServerTLS(certPath, "")
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected error for empty key path, got nil")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestPreflightServerTLS_CertFileNotReadable(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
keyPath := filepath.Join(dir, "tls.key")
|
||||||
|
if err := os.WriteFile(keyPath, []byte("k"), 0o600); err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
err := preflightServerTLS(filepath.Join(dir, "nope.crt"), keyPath)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected error for unreadable cert path, got nil")
|
||||||
|
}
|
||||||
|
if !errors.Is(err, os.ErrNotExist) {
|
||||||
|
t.Fatalf("expected os.ErrNotExist wrapped in error chain, got: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestPreflightServerTLS_InvalidKeyPair(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
certPath := filepath.Join(dir, "tls.crt")
|
||||||
|
keyPath := filepath.Join(dir, "tls.key")
|
||||||
|
// Pair of valid cert + garbage key — files are readable but the pair
|
||||||
|
// doesn't round-trip tls.LoadX509KeyPair.
|
||||||
|
generateTestCert(t, certPath, keyPath, "cn-bad-pair")
|
||||||
|
if err := os.WriteFile(keyPath, []byte("-----BEGIN EC PRIVATE KEY-----\nBAD\n-----END EC PRIVATE KEY-----\n"), 0o600); err != nil {
|
||||||
|
t.Fatal(err)
|
||||||
|
}
|
||||||
|
err := preflightServerTLS(certPath, keyPath)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("expected error for invalid key pair, got nil")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestPreflightServerTLS_ValidPair_NoError(t *testing.T) {
|
||||||
|
dir := t.TempDir()
|
||||||
|
certPath := filepath.Join(dir, "tls.crt")
|
||||||
|
keyPath := filepath.Join(dir, "tls.key")
|
||||||
|
generateTestCert(t, certPath, keyPath, "cn-ok")
|
||||||
|
if err := preflightServerTLS(certPath, keyPath); err != nil {
|
||||||
|
t.Fatalf("unexpected error for valid pair: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -55,7 +55,7 @@ A compose file defines **services** (containers), **networks** (how they talk to
|
|||||||
|
|
||||||
**Overlay files** let you layer changes. Running `docker compose -f base.yml -f overlay.yml up` merges both files. The overlay can add services, change environment variables, or mount extra volumes without editing the base.
|
**Overlay files** let you layer changes. Running `docker compose -f base.yml -f overlay.yml up` merges both files. The overlay can add services, change environment variables, or mount extra volumes without editing the base.
|
||||||
|
|
||||||
**Port mapping** (`"8443:8443"`) maps host port (left) to container port (right). After startup, `http://localhost:8443` on your machine reaches the certctl server inside its container.
|
**Port mapping** (`"8443:8443"`) maps host port (left) to container port (right). After startup, `https://localhost:8443` on your machine reaches the certctl server inside its container (HTTPS-only as of v2.2; the `certctl-tls-init` init container bootstraps a self-signed cert into `deploy/test/certs/`).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -91,11 +91,13 @@ Wait about 30 seconds, then verify:
|
|||||||
docker compose -f deploy/docker-compose.yml ps
|
docker compose -f deploy/docker-compose.yml ps
|
||||||
# All three services should show "Up (healthy)"
|
# All three services should show "Up (healthy)"
|
||||||
|
|
||||||
curl http://localhost:8443/health
|
curl --cacert ./deploy/test/certs/ca.crt https://localhost:8443/health
|
||||||
# {"status":"healthy"}
|
# {"status":"healthy"}
|
||||||
```
|
```
|
||||||
|
|
||||||
Open **http://localhost:8443** in your browser. You'll see the onboarding wizard guiding you through: connecting a CA, deploying an agent, and adding your first certificate.
|
The control plane is HTTPS-only as of v2.2. The `certctl-tls-init` init container bootstraps a self-signed cert into `deploy/test/certs/` on first boot; pin it with `--cacert` (as above) or pass `-k` for one-off smoke tests (never in production).
|
||||||
|
|
||||||
|
Open **https://localhost:8443** in your browser. You'll see the onboarding wizard guiding you through: connecting a CA, deploying an agent, and adding your first certificate. Your browser will flag the self-signed cert as untrusted — accept the warning for local evaluation, or import `deploy/test/certs/ca.crt` into your OS trust store to make the warning go away.
|
||||||
|
|
||||||
### Service-by-service walkthrough
|
### Service-by-service walkthrough
|
||||||
|
|
||||||
@@ -307,8 +309,9 @@ docker compose -f deploy/docker-compose.test.yml up --build
|
|||||||
Wait for all health checks to pass (about 60 seconds for step-ca's first-run bootstrap). Then:
|
Wait for all health checks to pass (about 60 seconds for step-ca's first-run bootstrap). Then:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Dashboard with auth enabled
|
# Dashboard with auth enabled (HTTPS-only as of v2.2; browser will warn on the self-signed cert —
|
||||||
open http://localhost:8443
|
# accept the warning or trust `deploy/test/certs/ca.crt` in your OS keychain)
|
||||||
|
open https://localhost:8443
|
||||||
# API key: test-key-2026
|
# API key: test-key-2026
|
||||||
|
|
||||||
# NGINX serving a self-signed placeholder
|
# NGINX serving a self-signed placeholder
|
||||||
|
|||||||
@@ -4,8 +4,12 @@
|
|||||||
#
|
#
|
||||||
# Spins up the full certctl platform with real CA backends for manual QA:
|
# Spins up the full certctl platform with real CA backends for manual QA:
|
||||||
#
|
#
|
||||||
|
# 0. certctl-tls-init — one-shot init container; writes self-signed
|
||||||
|
# server.crt/.key/ca.crt into ./test/certs (bind
|
||||||
|
# mount, not a named volume — host-readable for
|
||||||
|
# the Go integration test binary)
|
||||||
# 1. PostgreSQL 16 — database (clean, no demo data)
|
# 1. PostgreSQL 16 — database (clean, no demo data)
|
||||||
# 2. certctl-server — control plane API + web dashboard on :8443
|
# 2. certctl-server — control plane API + web dashboard on :8443 (HTTPS)
|
||||||
# 3. certctl-agent — polls for work, deploys certs to NGINX
|
# 3. certctl-agent — polls for work, deploys certs to NGINX
|
||||||
# 4. step-ca — private CA (JWK provisioner, auto-bootstraps)
|
# 4. step-ca — private CA (JWK provisioner, auto-bootstraps)
|
||||||
# 5. Pebble — ACME test server (simulates Let's Encrypt)
|
# 5. Pebble — ACME test server (simulates Let's Encrypt)
|
||||||
@@ -16,15 +20,74 @@
|
|||||||
# cd deploy
|
# cd deploy
|
||||||
# docker compose -f docker-compose.test.yml up --build
|
# docker compose -f docker-compose.test.yml up --build
|
||||||
#
|
#
|
||||||
# Dashboard: http://localhost:8443
|
# Dashboard: https://localhost:8443 (self-signed — use --cacert test/certs/ca.crt)
|
||||||
# API key: test-key-2026
|
# API key: test-key-2026
|
||||||
# NGINX: https://localhost:8444 (self-signed placeholder until cert deployed)
|
# NGINX: https://localhost:8444 (self-signed placeholder until cert deployed)
|
||||||
#
|
#
|
||||||
|
# Integration tests: `go test -tags integration ./deploy/test/...` picks up
|
||||||
|
# the CA bundle at ./test/certs/ca.crt automatically via CERTCTL_TEST_CA_BUNDLE.
|
||||||
|
#
|
||||||
# See docs/test-env.md for the full walkthrough.
|
# See docs/test-env.md for the full walkthrough.
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
|
|
||||||
services:
|
services:
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# HTTPS-Everywhere Phase 6 — self-signed TLS bootstrap for the test harness.
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Mirrors the production `certctl-tls-init` (see docker-compose.yml §10-43)
|
||||||
|
# but writes into a *host bind mount* (./test/certs) instead of a named
|
||||||
|
# volume. The named-volume approach works fine inside Docker but hides the
|
||||||
|
# CA bundle from the Go integration test binary that runs on the host; the
|
||||||
|
# bind mount exposes /etc/certctl/tls/ca.crt at deploy/test/certs/ca.crt
|
||||||
|
# so `newTestClient()` can load it into an x509.CertPool and validate the
|
||||||
|
# self-signed server cert. Test-only divergence, explicitly documented.
|
||||||
|
#
|
||||||
|
# The generated cert has SAN=DNS:certctl-server,DNS:localhost,IP:127.0.0.1
|
||||||
|
# so both in-cluster traffic (agent → certctl-server:8443) and host traffic
|
||||||
|
# (go test → localhost:8443) validate cleanly. Destroy via
|
||||||
|
# `docker compose -f docker-compose.test.yml down -v` + `rm -rf test/certs`
|
||||||
|
# to force regeneration. Keys written 0600, certs 0644, owned 1000:1000
|
||||||
|
# (the UID the server binary runs as inside its container per Dockerfile:64).
|
||||||
|
certctl-tls-init:
|
||||||
|
image: alpine/openssl:latest
|
||||||
|
container_name: certctl-test-tls-init
|
||||||
|
restart: "no"
|
||||||
|
entrypoint: /bin/sh
|
||||||
|
command:
|
||||||
|
- -c
|
||||||
|
- |
|
||||||
|
set -eu
|
||||||
|
CERT=/etc/certctl/tls/server.crt
|
||||||
|
KEY=/etc/certctl/tls/server.key
|
||||||
|
CA=/etc/certctl/tls/ca.crt
|
||||||
|
if [ -f "$$CERT" ] && [ -f "$$KEY" ] && [ -f "$$CA" ]; then
|
||||||
|
echo "TLS cert already present at $$CERT — skipping generation"
|
||||||
|
else
|
||||||
|
mkdir -p /etc/certctl/tls
|
||||||
|
openssl req -x509 -newkey ed25519 -nodes \
|
||||||
|
-keyout "$$KEY" \
|
||||||
|
-out "$$CERT" \
|
||||||
|
-days 3650 \
|
||||||
|
-subj "/CN=certctl-server" \
|
||||||
|
-addext "subjectAltName=DNS:certctl-server,DNS:localhost,IP:127.0.0.1,IP:::1"
|
||||||
|
cp "$$CERT" "$$CA"
|
||||||
|
echo "Generated self-signed TLS cert for certctl-test-server (ed25519, 3650d, CN=certctl-server)"
|
||||||
|
fi
|
||||||
|
# The test server container runs as root (see `user: "0:0"` below)
|
||||||
|
# because setup-trust.sh needs to update the system trust store, so
|
||||||
|
# the perms here are really about host-side readability — 0644 on
|
||||||
|
# the CA/cert lets `go test` on the host read the bundle without a
|
||||||
|
# chown dance.
|
||||||
|
chown 1000:1000 "$$CERT" "$$KEY" "$$CA" || true
|
||||||
|
chmod 0644 "$$CERT" "$$CA"
|
||||||
|
chmod 0600 "$$KEY"
|
||||||
|
volumes:
|
||||||
|
- ./test/certs:/etc/certctl/tls
|
||||||
|
networks:
|
||||||
|
certctl-test:
|
||||||
|
ipv4_address: 10.30.50.9
|
||||||
|
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
# Database
|
# Database
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
@@ -168,6 +231,12 @@ services:
|
|||||||
condition: service_started
|
condition: service_started
|
||||||
step-ca:
|
step-ca:
|
||||||
condition: service_healthy
|
condition: service_healthy
|
||||||
|
# HTTPS-Everywhere Phase 6: block server boot until the init container
|
||||||
|
# has written server.crt / server.key / ca.crt into ./test/certs. The
|
||||||
|
# init container runs once and exits 0; service_completed_successfully
|
||||||
|
# makes that a gating dependency rather than a liveness one.
|
||||||
|
certctl-tls-init:
|
||||||
|
condition: service_completed_successfully
|
||||||
# Run as root so update-ca-certificates can write to /etc/ssl/certs.
|
# Run as root so update-ca-certificates can write to /etc/ssl/certs.
|
||||||
# Container isolation provides the security boundary.
|
# Container isolation provides the security boundary.
|
||||||
user: "0:0"
|
user: "0:0"
|
||||||
@@ -179,6 +248,12 @@ services:
|
|||||||
# Server
|
# Server
|
||||||
CERTCTL_SERVER_HOST: 0.0.0.0
|
CERTCTL_SERVER_HOST: 0.0.0.0
|
||||||
CERTCTL_SERVER_PORT: 8443
|
CERTCTL_SERVER_PORT: 8443
|
||||||
|
# HTTPS-Everywhere Phase 6: point the server at the init-container-generated
|
||||||
|
# cert/key pair (bind-mounted from ./test/certs). Same paths as production
|
||||||
|
# compose so the server binary code path is identical; only the host-side
|
||||||
|
# storage differs (bind mount vs named volume — see §certctl-tls-init block).
|
||||||
|
CERTCTL_SERVER_TLS_CERT_PATH: /etc/certctl/tls/server.crt
|
||||||
|
CERTCTL_SERVER_TLS_KEY_PATH: /etc/certctl/tls/server.key
|
||||||
CERTCTL_LOG_LEVEL: debug
|
CERTCTL_LOG_LEVEL: debug
|
||||||
|
|
||||||
# Auth — API key required (production-like)
|
# Auth — API key required (production-like)
|
||||||
@@ -224,12 +299,22 @@ services:
|
|||||||
- ./test/setup-trust.sh:/app/setup-trust.sh:ro
|
- ./test/setup-trust.sh:/app/setup-trust.sh:ro
|
||||||
# step-ca data volume (root cert at /certs/root_ca.crt, key at /secrets/provisioner_key)
|
# step-ca data volume (root cert at /certs/root_ca.crt, key at /secrets/provisioner_key)
|
||||||
- stepca_data:/stepca-data:ro
|
- stepca_data:/stepca-data:ro
|
||||||
|
# HTTPS-Everywhere Phase 6: read-only bind mount of the init-generated
|
||||||
|
# TLS material. The init container writes here; server reads here; the
|
||||||
|
# agent mounts the same host path at the same container path (see below)
|
||||||
|
# so /etc/certctl/tls/ca.crt resolves to the *same* bytes on both sides.
|
||||||
|
- ./test/certs:/etc/certctl/tls:ro
|
||||||
networks:
|
networks:
|
||||||
certctl-test:
|
certctl-test:
|
||||||
ipv4_address: 10.30.50.6
|
ipv4_address: 10.30.50.6
|
||||||
healthcheck:
|
healthcheck:
|
||||||
# /health requires auth when CERTCTL_AUTH_TYPE=api-key, so include the Bearer token
|
# HTTPS-Everywhere Phase 6: healthcheck now speaks TLS with --cacert to
|
||||||
test: ["CMD", "curl", "-f", "-H", "Authorization: Bearer test-key-2026", "http://localhost:8443/health"]
|
# verify the self-signed server cert against the init-generated bundle.
|
||||||
|
# /health requires auth when CERTCTL_AUTH_TYPE=api-key, so include the
|
||||||
|
# Bearer token. curl exits non-zero on both TLS handshake failure and
|
||||||
|
# non-2xx status — either failure keeps depends_on: {condition:
|
||||||
|
# service_healthy} from unblocking the agent, which is what we want.
|
||||||
|
test: ["CMD", "curl", "--cacert", "/etc/certctl/tls/ca.crt", "-f", "-H", "Authorization: Bearer test-key-2026", "https://localhost:8443/health"]
|
||||||
interval: 10s
|
interval: 10s
|
||||||
timeout: 5s
|
timeout: 5s
|
||||||
start_period: 30s
|
start_period: 30s
|
||||||
@@ -290,7 +375,13 @@ services:
|
|||||||
certctl-server:
|
certctl-server:
|
||||||
condition: service_healthy
|
condition: service_healthy
|
||||||
environment:
|
environment:
|
||||||
CERTCTL_SERVER_URL: http://certctl-server:8443
|
# HTTPS-Everywhere Phase 6: agent dials the server over TLS and validates
|
||||||
|
# the self-signed cert against the CA bundle pinned by
|
||||||
|
# CERTCTL_SERVER_CA_BUNDLE_PATH. Same env vars + container paths as
|
||||||
|
# production compose so the agent binary code path (loadCABundle →
|
||||||
|
# x509.CertPool → *tls.Config{RootCAs, MinVersion: TLS13}) is identical.
|
||||||
|
CERTCTL_SERVER_URL: https://certctl-server:8443
|
||||||
|
CERTCTL_SERVER_CA_BUNDLE_PATH: /etc/certctl/tls/ca.crt
|
||||||
CERTCTL_API_KEY: test-key-2026
|
CERTCTL_API_KEY: test-key-2026
|
||||||
CERTCTL_AGENT_NAME: test-agent-01
|
CERTCTL_AGENT_NAME: test-agent-01
|
||||||
CERTCTL_AGENT_ID: agent-test-01
|
CERTCTL_AGENT_ID: agent-test-01
|
||||||
@@ -300,6 +391,10 @@ services:
|
|||||||
volumes:
|
volumes:
|
||||||
- agent_keys:/var/lib/certctl/keys
|
- agent_keys:/var/lib/certctl/keys
|
||||||
- nginx_certs:/nginx-certs
|
- nginx_certs:/nginx-certs
|
||||||
|
# HTTPS-Everywhere Phase 6: same bind mount as the server, same path,
|
||||||
|
# so /etc/certctl/tls/ca.crt resolves to the identical bytes. This is
|
||||||
|
# the only way the CN=certctl-server cert validates on the agent side.
|
||||||
|
- ./test/certs:/etc/certctl/tls:ro
|
||||||
networks:
|
networks:
|
||||||
certctl-test:
|
certctl-test:
|
||||||
ipv4_address: 10.30.50.8
|
ipv4_address: 10.30.50.8
|
||||||
|
|||||||
@@ -1,4 +1,47 @@
|
|||||||
services:
|
services:
|
||||||
|
# HTTPS-Everywhere Phase 3 — self-signed TLS bootstrap (init container).
|
||||||
|
# Generates a CN=certctl-server ed25519 cert with the SAN list locked by
|
||||||
|
# milestone §3.6 on first boot; subsequent boots see the cert already
|
||||||
|
# present in the `certs` named volume and no-op out. Server + agent mount
|
||||||
|
# the volume read-only. Destroy via `docker compose down -v` to force
|
||||||
|
# regeneration. This bootstrap is for docker-compose demos and local dev
|
||||||
|
# only; Helm operators supply a Secret / cert-manager Certificate per
|
||||||
|
# docs/tls.md.
|
||||||
|
certctl-tls-init:
|
||||||
|
image: alpine/openssl:latest
|
||||||
|
container_name: certctl-tls-init
|
||||||
|
restart: "no"
|
||||||
|
entrypoint: /bin/sh
|
||||||
|
command:
|
||||||
|
- -c
|
||||||
|
- |
|
||||||
|
set -eu
|
||||||
|
CERT=/etc/certctl/tls/server.crt
|
||||||
|
KEY=/etc/certctl/tls/server.key
|
||||||
|
CA=/etc/certctl/tls/ca.crt
|
||||||
|
if [ -f "$$CERT" ] && [ -f "$$KEY" ] && [ -f "$$CA" ]; then
|
||||||
|
echo "TLS cert already present at $$CERT — skipping generation"
|
||||||
|
else
|
||||||
|
mkdir -p /etc/certctl/tls
|
||||||
|
openssl req -x509 -newkey ed25519 -nodes \
|
||||||
|
-keyout "$$KEY" \
|
||||||
|
-out "$$CERT" \
|
||||||
|
-days 3650 \
|
||||||
|
-subj "/CN=certctl-server" \
|
||||||
|
-addext "subjectAltName=DNS:certctl-server,DNS:localhost,IP:127.0.0.1,IP:::1"
|
||||||
|
cp "$$CERT" "$$CA"
|
||||||
|
echo "Generated self-signed TLS cert for certctl-server (ed25519, 3650d, CN=certctl-server)"
|
||||||
|
fi
|
||||||
|
# certctl binary runs as UID 1000 inside the server container per
|
||||||
|
# Dockerfile:64-65; the cert + key must be readable by that UID.
|
||||||
|
chown 1000:1000 "$$CERT" "$$KEY" "$$CA"
|
||||||
|
chmod 0644 "$$CERT" "$$CA"
|
||||||
|
chmod 0600 "$$KEY"
|
||||||
|
volumes:
|
||||||
|
- certs:/etc/certctl/tls
|
||||||
|
networks:
|
||||||
|
- certctl-network
|
||||||
|
|
||||||
# PostgreSQL database
|
# PostgreSQL database
|
||||||
postgres:
|
postgres:
|
||||||
image: postgres:16-alpine
|
image: postgres:16-alpine
|
||||||
@@ -50,10 +93,14 @@ services:
|
|||||||
depends_on:
|
depends_on:
|
||||||
postgres:
|
postgres:
|
||||||
condition: service_healthy
|
condition: service_healthy
|
||||||
|
certctl-tls-init:
|
||||||
|
condition: service_completed_successfully
|
||||||
environment:
|
environment:
|
||||||
CERTCTL_DATABASE_URL: postgres://certctl:${POSTGRES_PASSWORD:-certctl}@postgres:5432/certctl?sslmode=disable
|
CERTCTL_DATABASE_URL: postgres://certctl:${POSTGRES_PASSWORD:-certctl}@postgres:5432/certctl?sslmode=disable
|
||||||
CERTCTL_SERVER_HOST: 0.0.0.0
|
CERTCTL_SERVER_HOST: 0.0.0.0
|
||||||
CERTCTL_SERVER_PORT: 8443
|
CERTCTL_SERVER_PORT: 8443
|
||||||
|
CERTCTL_SERVER_TLS_CERT_PATH: /etc/certctl/tls/server.crt
|
||||||
|
CERTCTL_SERVER_TLS_KEY_PATH: /etc/certctl/tls/server.key
|
||||||
CERTCTL_LOG_LEVEL: info
|
CERTCTL_LOG_LEVEL: info
|
||||||
CERTCTL_AUTH_TYPE: none
|
CERTCTL_AUTH_TYPE: none
|
||||||
CERTCTL_KEYGEN_MODE: server # Demo uses server-side keygen; production should use "agent"
|
CERTCTL_KEYGEN_MODE: server # Demo uses server-side keygen; production should use "agent"
|
||||||
@@ -61,10 +108,12 @@ services:
|
|||||||
CERTCTL_CONFIG_ENCRYPTION_KEY: ${CERTCTL_CONFIG_ENCRYPTION_KEY:-change-me-32-char-encryption-key} # AES-256-GCM for dynamic issuer/target config
|
CERTCTL_CONFIG_ENCRYPTION_KEY: ${CERTCTL_CONFIG_ENCRYPTION_KEY:-change-me-32-char-encryption-key} # AES-256-GCM for dynamic issuer/target config
|
||||||
ports:
|
ports:
|
||||||
- "8443:8443"
|
- "8443:8443"
|
||||||
|
volumes:
|
||||||
|
- certs:/etc/certctl/tls:ro
|
||||||
networks:
|
networks:
|
||||||
- certctl-network
|
- certctl-network
|
||||||
healthcheck:
|
healthcheck:
|
||||||
test: ["CMD", "curl", "-f", "http://localhost:8443/health"]
|
test: ["CMD", "curl", "--cacert", "/etc/certctl/tls/ca.crt", "-f", "https://localhost:8443/health"]
|
||||||
interval: 10s
|
interval: 10s
|
||||||
timeout: 5s
|
timeout: 5s
|
||||||
retries: 5
|
retries: 5
|
||||||
@@ -99,13 +148,15 @@ services:
|
|||||||
certctl-server:
|
certctl-server:
|
||||||
condition: service_healthy
|
condition: service_healthy
|
||||||
environment:
|
environment:
|
||||||
CERTCTL_SERVER_URL: http://certctl-server:8443
|
CERTCTL_SERVER_URL: https://certctl-server:8443
|
||||||
|
CERTCTL_SERVER_CA_BUNDLE_PATH: /etc/certctl/tls/ca.crt
|
||||||
CERTCTL_API_KEY: ${CERTCTL_API_KEY:-change-me-in-production}
|
CERTCTL_API_KEY: ${CERTCTL_API_KEY:-change-me-in-production}
|
||||||
CERTCTL_AGENT_NAME: docker-agent
|
CERTCTL_AGENT_NAME: docker-agent
|
||||||
CERTCTL_LOG_LEVEL: info
|
CERTCTL_LOG_LEVEL: info
|
||||||
CERTCTL_DISCOVERY_DIRS: /var/lib/certctl/keys # Agent scans this directory for existing certificates
|
CERTCTL_DISCOVERY_DIRS: /var/lib/certctl/keys # Agent scans this directory for existing certificates
|
||||||
volumes:
|
volumes:
|
||||||
- agent_keys:/var/lib/certctl/keys
|
- agent_keys:/var/lib/certctl/keys
|
||||||
|
- certs:/etc/certctl/tls:ro
|
||||||
networks:
|
networks:
|
||||||
- certctl-network
|
- certctl-network
|
||||||
healthcheck:
|
healthcheck:
|
||||||
@@ -134,3 +185,5 @@ volumes:
|
|||||||
driver: local
|
driver: local
|
||||||
agent_keys:
|
agent_keys:
|
||||||
driver: local
|
driver: local
|
||||||
|
certs:
|
||||||
|
driver: local
|
||||||
|
|||||||
@@ -236,10 +236,12 @@ kubectl get svc -l app.kubernetes.io/instance=certctl
|
|||||||
kubectl get ingress
|
kubectl get ingress
|
||||||
kubectl describe ingress certctl
|
kubectl describe ingress certctl
|
||||||
|
|
||||||
# Test API connectivity
|
# Test API connectivity (HTTPS-only as of v2.2)
|
||||||
POD=$(kubectl get pods -l app.kubernetes.io/component=server -o jsonpath='{.items[0].metadata.name}')
|
POD=$(kubectl get pods -l app.kubernetes.io/component=server -o jsonpath='{.items[0].metadata.name}')
|
||||||
kubectl port-forward $POD 8443:8443 &
|
kubectl port-forward $POD 8443:8443 &
|
||||||
curl -H "Authorization: Bearer $API_KEY" http://localhost:8443/health
|
# If the chart provisioned a self-signed cert, fetch the CA bundle from the TLS secret first:
|
||||||
|
# kubectl get secret certctl-server-tls -o jsonpath='{.data.ca\.crt}' | base64 -d > /tmp/certctl-ca.crt
|
||||||
|
curl --cacert /tmp/certctl-ca.crt -H "Authorization: Bearer $API_KEY" https://localhost:8443/health
|
||||||
```
|
```
|
||||||
|
|
||||||
### Step 6: Access the Dashboard
|
### Step 6: Access the Dashboard
|
||||||
@@ -333,9 +335,10 @@ kubectl logs $POD | tail -20
|
|||||||
# Port forward to API
|
# Port forward to API
|
||||||
kubectl port-forward svc/certctl-server 8443:8443 &
|
kubectl port-forward svc/certctl-server 8443:8443 &
|
||||||
|
|
||||||
# Create a test certificate
|
# Create a test certificate (HTTPS-only as of v2.2 — pin the chart-provisioned CA bundle)
|
||||||
|
# kubectl get secret certctl-server-tls -o jsonpath='{.data.ca\.crt}' | base64 -d > /tmp/certctl-ca.crt
|
||||||
API_KEY="your-api-key"
|
API_KEY="your-api-key"
|
||||||
curl -X POST http://localhost:8443/api/v1/certificates \
|
curl --cacert /tmp/certctl-ca.crt -X POST https://localhost:8443/api/v1/certificates \
|
||||||
-H "Authorization: Bearer $API_KEY" \
|
-H "Authorization: Bearer $API_KEY" \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
|
|||||||
@@ -33,9 +33,11 @@ kubectl get pods -l app.kubernetes.io/instance=certctl
|
|||||||
# View server logs
|
# View server logs
|
||||||
kubectl logs -l app.kubernetes.io/component=server -f
|
kubectl logs -l app.kubernetes.io/component=server -f
|
||||||
|
|
||||||
# Access the API
|
# Access the API (HTTPS-only as of v2.2; use --cacert or -k depending on your cert provisioning)
|
||||||
kubectl port-forward svc/certctl-server 8443:8443 &
|
kubectl port-forward svc/certctl-server 8443:8443 &
|
||||||
curl http://localhost:8443/health
|
# If the chart provisioned a self-signed cert, fetch the CA bundle from the secret first:
|
||||||
|
# kubectl get secret certctl-server-tls -o jsonpath='{.data.ca\.crt}' | base64 -d > /tmp/certctl-ca.crt
|
||||||
|
curl --cacert /tmp/certctl-ca.crt https://localhost:8443/health
|
||||||
```
|
```
|
||||||
|
|
||||||
## Next Steps
|
## Next Steps
|
||||||
|
|||||||
@@ -4,36 +4,46 @@
|
|||||||
{{- else if contains "NodePort" .Values.server.service.type }}
|
{{- else if contains "NodePort" .Values.server.service.type }}
|
||||||
export NODE_IP=$(kubectl get nodes --namespace {{ .Release.Namespace }} -o jsonpath="{.items[0].status.addresses[0].address}")
|
export NODE_IP=$(kubectl get nodes --namespace {{ .Release.Namespace }} -o jsonpath="{.items[0].status.addresses[0].address}")
|
||||||
export NODE_PORT=$(kubectl get --namespace {{ .Release.Namespace }} -o jsonpath="{.spec.ports[0].nodePort}" services {{ include "certctl.fullname" . }}-server)
|
export NODE_PORT=$(kubectl get --namespace {{ .Release.Namespace }} -o jsonpath="{.spec.ports[0].nodePort}" services {{ include "certctl.fullname" . }}-server)
|
||||||
echo http://$NODE_IP:$NODE_PORT
|
echo https://$NODE_IP:$NODE_PORT
|
||||||
{{- else if contains "LoadBalancer" .Values.server.service.type }}
|
{{- else if contains "LoadBalancer" .Values.server.service.type }}
|
||||||
export SERVICE_IP=$(kubectl get svc --namespace {{ .Release.Namespace }} {{ include "certctl.fullname" . }}-server --template "{.status.loadBalancer.ingress[0].ip}")
|
export SERVICE_IP=$(kubectl get svc --namespace {{ .Release.Namespace }} {{ include "certctl.fullname" . }}-server --template "{.status.loadBalancer.ingress[0].ip}")
|
||||||
echo http://$SERVICE_IP:{{ .Values.server.service.port }}
|
echo https://$SERVICE_IP:{{ .Values.server.service.port }}
|
||||||
{{- else }}
|
{{- else }}
|
||||||
export POD_NAME=$(kubectl get pods --namespace {{ .Release.Namespace }} -l "app.kubernetes.io/name={{ include "certctl.name" . }},app.kubernetes.io/instance={{ .Release.Name }},app.kubernetes.io/component=server" -o jsonpath="{.items[0].metadata.name}")
|
export POD_NAME=$(kubectl get pods --namespace {{ .Release.Namespace }} -l "app.kubernetes.io/name={{ include "certctl.name" . }},app.kubernetes.io/instance={{ .Release.Name }},app.kubernetes.io/component=server" -o jsonpath="{.items[0].metadata.name}")
|
||||||
export CONTAINER_PORT=$(kubectl get pod --namespace {{ .Release.Namespace }} $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
|
export CONTAINER_PORT=$(kubectl get pod --namespace {{ .Release.Namespace }} $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
|
||||||
echo "Visit http://127.0.0.1:8080 to use your application"
|
echo "Visit https://127.0.0.1:8443 to use your application"
|
||||||
kubectl --namespace {{ .Release.Namespace }} port-forward $POD_NAME 8080:$CONTAINER_PORT
|
kubectl --namespace {{ .Release.Namespace }} port-forward $POD_NAME 8443:$CONTAINER_PORT
|
||||||
{{- end }}
|
{{- end }}
|
||||||
|
|
||||||
2. Get the default API key:
|
2. Talk to the HTTPS-only server from your workstation:
|
||||||
|
# Export the CA bundle that signed the server cert (self-signed or cert-manager-issued)
|
||||||
|
kubectl get secret --namespace {{ .Release.Namespace }} {{ include "certctl.tls.secretName" . }} \
|
||||||
|
-o jsonpath='{.data.ca\.crt}' | base64 --decode > /tmp/certctl-ca.crt
|
||||||
|
# (If ca.crt is empty, fall back to tls.crt — typical when the Secret
|
||||||
|
# was created from a self-signed bootstrap cert without a separate CA.)
|
||||||
|
|
||||||
|
# Adapt the URL below to match the Server URL printed in step 1.
|
||||||
|
curl --cacert /tmp/certctl-ca.crt https://127.0.0.1:8443/health
|
||||||
|
|
||||||
|
3. Get the default API key:
|
||||||
kubectl get secret --namespace {{ .Release.Namespace }} {{ include "certctl.fullname" . }}-server -o jsonpath="{.data.api-key}" | base64 --decode; echo
|
kubectl get secret --namespace {{ .Release.Namespace }} {{ include "certctl.fullname" . }}-server -o jsonpath="{.data.api-key}" | base64 --decode; echo
|
||||||
|
|
||||||
3. Get PostgreSQL connection details:
|
4. Get PostgreSQL connection details:
|
||||||
Host: {{ include "certctl.fullname" . }}-postgres.{{ .Release.Namespace }}.svc.cluster.local
|
Host: {{ include "certctl.fullname" . }}-postgres.{{ .Release.Namespace }}.svc.cluster.local
|
||||||
Port: 5432
|
Port: 5432
|
||||||
Database: {{ .Values.postgresql.auth.database }}
|
Database: {{ .Values.postgresql.auth.database }}
|
||||||
Username: {{ .Values.postgresql.auth.username }}
|
Username: {{ .Values.postgresql.auth.username }}
|
||||||
Password: $(kubectl get secret --namespace {{ .Release.Namespace }} {{ include "certctl.fullname" . }}-postgres -o jsonpath="{.data.password}" | base64 --decode)
|
Password: $(kubectl get secret --namespace {{ .Release.Namespace }} {{ include "certctl.fullname" . }}-postgres -o jsonpath="{.data.password}" | base64 --decode)
|
||||||
|
|
||||||
4. Check deployment status:
|
5. Check deployment status:
|
||||||
kubectl get pods -n {{ .Release.Namespace }} -l app.kubernetes.io/instance={{ .Release.Name }}
|
kubectl get pods -n {{ .Release.Namespace }} -l app.kubernetes.io/instance={{ .Release.Name }}
|
||||||
|
|
||||||
5. View server logs:
|
6. View server logs:
|
||||||
kubectl logs -n {{ .Release.Namespace }} -l app.kubernetes.io/name={{ include "certctl.name" . }},app.kubernetes.io/component=server -f
|
kubectl logs -n {{ .Release.Namespace }} -l app.kubernetes.io/name={{ include "certctl.name" . }},app.kubernetes.io/component=server -f
|
||||||
|
|
||||||
{{- if .Values.agent.enabled }}
|
{{- if .Values.agent.enabled }}
|
||||||
|
|
||||||
6. View agent logs:
|
7. View agent logs:
|
||||||
kubectl logs -n {{ .Release.Namespace }} -l app.kubernetes.io/name={{ include "certctl.name" . }},app.kubernetes.io/component=agent -f
|
kubectl logs -n {{ .Release.Namespace }} -l app.kubernetes.io/name={{ include "certctl.name" . }},app.kubernetes.io/component=agent -f
|
||||||
|
|
||||||
{{- end }}
|
{{- end }}
|
||||||
@@ -58,11 +68,7 @@ IMPORTANT NOTES FOR PRODUCTION:
|
|||||||
- Use an external PostgreSQL managed service (AWS RDS, Cloud SQL, etc.)
|
- Use an external PostgreSQL managed service (AWS RDS, Cloud SQL, etc.)
|
||||||
- Set postgresql.enabled=false and configure CERTCTL_DATABASE_URL in values
|
- Set postgresql.enabled=false and configure CERTCTL_DATABASE_URL in values
|
||||||
|
|
||||||
5. Enable HTTPS/TLS using an Ingress with certificate management:
|
5. Review security contexts and network policies:
|
||||||
- Configure cert-manager for automatic TLS certificate renewal
|
|
||||||
- Update ingress values with your domain and certificate issuer
|
|
||||||
|
|
||||||
6. Review security contexts and network policies:
|
|
||||||
- All containers run as non-root
|
- All containers run as non-root
|
||||||
- Implement network policies to restrict traffic between components
|
- Implement network policies to restrict traffic between components
|
||||||
- Consider pod security policies or security standards for your cluster
|
- Consider pod security policies or security standards for your cluster
|
||||||
|
|||||||
@@ -118,8 +118,54 @@ postgres://{{ .Values.postgresql.auth.username }}:$(POSTGRES_PASSWORD)@{{ includ
|
|||||||
{{- end }}
|
{{- end }}
|
||||||
|
|
||||||
{{/*
|
{{/*
|
||||||
Server URL (for agents)
|
Server URL (for agents). HTTPS-only as of v2.2 — see docs/tls.md.
|
||||||
*/}}
|
*/}}
|
||||||
{{- define "certctl.serverURL" -}}
|
{{- define "certctl.serverURL" -}}
|
||||||
http://{{ include "certctl.fullname" . }}-server:{{ .Values.server.service.port }}
|
https://{{ include "certctl.fullname" . }}-server:{{ .Values.server.service.port }}
|
||||||
|
{{- end }}
|
||||||
|
|
||||||
|
{{/*
|
||||||
|
TLS Secret name resolver.
|
||||||
|
|
||||||
|
Operator-facing precedence:
|
||||||
|
1. server.tls.existingSecret — operator points at a pre-existing kubernetes.io/tls Secret
|
||||||
|
2. server.tls.certManager.secretName — explicit secret name for the cert-manager Certificate CR
|
||||||
|
3. "<fullname>-tls" — default when cert-manager is enabled but secretName is blank
|
||||||
|
|
||||||
|
Never emits an empty string — that case is already excluded by certctl.tls.required below,
|
||||||
|
which must be invoked by any template that depends on the resolved secret name.
|
||||||
|
*/}}
|
||||||
|
{{- define "certctl.tls.secretName" -}}
|
||||||
|
{{- if .Values.server.tls.existingSecret -}}
|
||||||
|
{{- .Values.server.tls.existingSecret -}}
|
||||||
|
{{- else if .Values.server.tls.certManager.secretName -}}
|
||||||
|
{{- .Values.server.tls.certManager.secretName -}}
|
||||||
|
{{- else -}}
|
||||||
|
{{- printf "%s-tls" (include "certctl.fullname" .) -}}
|
||||||
|
{{- end -}}
|
||||||
|
{{- end }}
|
||||||
|
|
||||||
|
{{/*
|
||||||
|
TLS configuration gate.
|
||||||
|
|
||||||
|
HTTPS is the only supported listener mode (v2.2+). The server refuses to start
|
||||||
|
without a cert/key pair mounted at server.tls.mountPath, so `helm template` /
|
||||||
|
`helm install` must fail loudly at render-time rather than shipping a broken
|
||||||
|
Deployment that crash-loops with "tls config required".
|
||||||
|
|
||||||
|
Operators MUST configure EXACTLY ONE of:
|
||||||
|
(a) server.tls.existingSecret: <name-of-kubernetes.io/tls-secret>
|
||||||
|
(b) server.tls.certManager.enabled: true (+ issuerRef.name populated)
|
||||||
|
|
||||||
|
Any template that mounts the TLS Secret must call
|
||||||
|
`{{ include "certctl.tls.required" . }}` at the top so this guard runs once
|
||||||
|
per affected resource. No-op when configured correctly.
|
||||||
|
*/}}
|
||||||
|
{{- define "certctl.tls.required" -}}
|
||||||
|
{{- if and (not .Values.server.tls.existingSecret) (not .Values.server.tls.certManager.enabled) -}}
|
||||||
|
{{- fail "\n\ncertctl refuses to start without TLS.\n\nSet EXACTLY ONE of:\n --set server.tls.existingSecret=<your-kubernetes.io/tls-secret-name>\nOR\n --set server.tls.certManager.enabled=true \\\n --set server.tls.certManager.issuerRef.name=<your-issuer-or-clusterissuer>\n\nSee docs/tls.md for the full setup walkthrough, including bootstrap\nguidance for air-gapped clusters without cert-manager.\n" -}}
|
||||||
|
{{- end -}}
|
||||||
|
{{- if and .Values.server.tls.certManager.enabled (not .Values.server.tls.certManager.issuerRef.name) -}}
|
||||||
|
{{- fail "\n\nserver.tls.certManager.enabled=true but server.tls.certManager.issuerRef.name is empty.\n\nSet:\n --set server.tls.certManager.issuerRef.name=<your-issuer-or-clusterissuer>\n\nSee docs/tls.md.\n" -}}
|
||||||
|
{{- end -}}
|
||||||
{{- end }}
|
{{- end }}
|
||||||
|
|||||||
@@ -1,4 +1,5 @@
|
|||||||
{{- if .Values.agent.enabled }}
|
{{- if .Values.agent.enabled }}
|
||||||
|
{{- include "certctl.tls.required" . }}
|
||||||
{{- if eq .Values.agent.kind "DaemonSet" }}
|
{{- if eq .Values.agent.kind "DaemonSet" }}
|
||||||
apiVersion: apps/v1
|
apiVersion: apps/v1
|
||||||
kind: DaemonSet
|
kind: DaemonSet
|
||||||
@@ -53,6 +54,8 @@ spec:
|
|||||||
fieldPath: metadata.name
|
fieldPath: metadata.name
|
||||||
- name: CERTCTL_KEY_DIR
|
- name: CERTCTL_KEY_DIR
|
||||||
value: {{ .Values.agent.keyDir }}
|
value: {{ .Values.agent.keyDir }}
|
||||||
|
- name: CERTCTL_SERVER_CA_BUNDLE_PATH
|
||||||
|
value: "{{ .Values.server.tls.mountPath }}/ca.crt"
|
||||||
{{- if .Values.agent.discoveryDirs }}
|
{{- if .Values.agent.discoveryDirs }}
|
||||||
- name: CERTCTL_DISCOVERY_DIRS
|
- name: CERTCTL_DISCOVERY_DIRS
|
||||||
valueFrom:
|
valueFrom:
|
||||||
@@ -70,12 +73,19 @@ spec:
|
|||||||
mountPath: {{ .Values.agent.keyDir }}
|
mountPath: {{ .Values.agent.keyDir }}
|
||||||
- name: tmp
|
- name: tmp
|
||||||
mountPath: /tmp
|
mountPath: /tmp
|
||||||
|
- name: server-tls
|
||||||
|
mountPath: {{ .Values.server.tls.mountPath }}
|
||||||
|
readOnly: true
|
||||||
volumes:
|
volumes:
|
||||||
- name: agent-keys
|
- name: agent-keys
|
||||||
emptyDir:
|
emptyDir:
|
||||||
sizeLimit: 1Gi
|
sizeLimit: 1Gi
|
||||||
- name: tmp
|
- name: tmp
|
||||||
emptyDir: {}
|
emptyDir: {}
|
||||||
|
- name: server-tls
|
||||||
|
secret:
|
||||||
|
secretName: {{ include "certctl.tls.secretName" . }}
|
||||||
|
defaultMode: 0400
|
||||||
{{- else if eq .Values.agent.kind "Deployment" }}
|
{{- else if eq .Values.agent.kind "Deployment" }}
|
||||||
apiVersion: apps/v1
|
apiVersion: apps/v1
|
||||||
kind: Deployment
|
kind: Deployment
|
||||||
@@ -135,6 +145,8 @@ spec:
|
|||||||
{{- end }}
|
{{- end }}
|
||||||
- name: CERTCTL_KEY_DIR
|
- name: CERTCTL_KEY_DIR
|
||||||
value: {{ .Values.agent.keyDir }}
|
value: {{ .Values.agent.keyDir }}
|
||||||
|
- name: CERTCTL_SERVER_CA_BUNDLE_PATH
|
||||||
|
value: "{{ .Values.server.tls.mountPath }}/ca.crt"
|
||||||
{{- if .Values.agent.discoveryDirs }}
|
{{- if .Values.agent.discoveryDirs }}
|
||||||
- name: CERTCTL_DISCOVERY_DIRS
|
- name: CERTCTL_DISCOVERY_DIRS
|
||||||
valueFrom:
|
valueFrom:
|
||||||
@@ -152,11 +164,18 @@ spec:
|
|||||||
mountPath: {{ .Values.agent.keyDir }}
|
mountPath: {{ .Values.agent.keyDir }}
|
||||||
- name: tmp
|
- name: tmp
|
||||||
mountPath: /tmp
|
mountPath: /tmp
|
||||||
|
- name: server-tls
|
||||||
|
mountPath: {{ .Values.server.tls.mountPath }}
|
||||||
|
readOnly: true
|
||||||
volumes:
|
volumes:
|
||||||
- name: agent-keys
|
- name: agent-keys
|
||||||
emptyDir:
|
emptyDir:
|
||||||
sizeLimit: 1Gi
|
sizeLimit: 1Gi
|
||||||
- name: tmp
|
- name: tmp
|
||||||
emptyDir: {}
|
emptyDir: {}
|
||||||
|
- name: server-tls
|
||||||
|
secret:
|
||||||
|
secretName: {{ include "certctl.tls.secretName" . }}
|
||||||
|
defaultMode: 0400
|
||||||
{{- end }}
|
{{- end }}
|
||||||
{{- end }}
|
{{- end }}
|
||||||
|
|||||||
@@ -1,14 +1,24 @@
|
|||||||
{{- if .Values.ingress.enabled }}
|
{{- if .Values.ingress.enabled }}
|
||||||
|
{{- if and .Values.ingress.certManager.enabled (not .Values.ingress.certManager.issuerRef.name) -}}
|
||||||
|
{{- fail "\n\ningress.certManager.enabled=true but ingress.certManager.issuerRef.name is empty.\n\nSet:\n --set ingress.certManager.issuerRef.name=<your-issuer-or-clusterissuer>\n\nThis is separate from server.tls.certManager — it issues the external-facing\nIngress cert, not the in-cluster server TLS cert. See docs/tls.md.\n" -}}
|
||||||
|
{{- end -}}
|
||||||
apiVersion: networking.k8s.io/v1
|
apiVersion: networking.k8s.io/v1
|
||||||
kind: Ingress
|
kind: Ingress
|
||||||
metadata:
|
metadata:
|
||||||
name: {{ include "certctl.fullname" . }}
|
name: {{ include "certctl.fullname" . }}
|
||||||
labels:
|
labels:
|
||||||
{{- include "certctl.labels" . | nindent 4 }}
|
{{- include "certctl.labels" . | nindent 4 }}
|
||||||
{{- with .Values.ingress.annotations }}
|
|
||||||
annotations:
|
annotations:
|
||||||
|
{{- if .Values.ingress.certManager.enabled }}
|
||||||
|
{{- if eq .Values.ingress.certManager.issuerRef.kind "ClusterIssuer" }}
|
||||||
|
cert-manager.io/cluster-issuer: {{ .Values.ingress.certManager.issuerRef.name | quote }}
|
||||||
|
{{- else }}
|
||||||
|
cert-manager.io/issuer: {{ .Values.ingress.certManager.issuerRef.name | quote }}
|
||||||
|
{{- end }}
|
||||||
|
{{- end }}
|
||||||
|
{{- with .Values.ingress.annotations }}
|
||||||
{{- toYaml . | nindent 4 }}
|
{{- toYaml . | nindent 4 }}
|
||||||
{{- end }}
|
{{- end }}
|
||||||
spec:
|
spec:
|
||||||
{{- if .Values.ingress.className }}
|
{{- if .Values.ingress.className }}
|
||||||
ingressClassName: {{ .Values.ingress.className }}
|
ingressClassName: {{ .Values.ingress.className }}
|
||||||
@@ -33,7 +43,7 @@ spec:
|
|||||||
pathType: {{ .pathType }}
|
pathType: {{ .pathType }}
|
||||||
backend:
|
backend:
|
||||||
service:
|
service:
|
||||||
name: {{ include "certctl.fullname" . }}-server
|
name: {{ include "certctl.fullname" $ }}-server
|
||||||
port:
|
port:
|
||||||
number: {{ $.Values.server.service.port }}
|
number: {{ $.Values.server.service.port }}
|
||||||
{{- end }}
|
{{- end }}
|
||||||
|
|||||||
@@ -0,0 +1,31 @@
|
|||||||
|
{{- if .Values.server.tls.certManager.enabled }}
|
||||||
|
{{- include "certctl.tls.required" . }}
|
||||||
|
apiVersion: cert-manager.io/v1
|
||||||
|
kind: Certificate
|
||||||
|
metadata:
|
||||||
|
name: {{ include "certctl.fullname" . }}-server-tls
|
||||||
|
labels:
|
||||||
|
{{- include "certctl.labels" . | nindent 4 }}
|
||||||
|
app.kubernetes.io/component: server
|
||||||
|
spec:
|
||||||
|
secretName: {{ include "certctl.tls.secretName" . }}
|
||||||
|
commonName: {{ .Values.server.tls.certManager.commonName | quote }}
|
||||||
|
dnsNames:
|
||||||
|
{{- range .Values.server.tls.certManager.dnsNames }}
|
||||||
|
- {{ . | quote }}
|
||||||
|
{{- end }}
|
||||||
|
duration: {{ .Values.server.tls.certManager.duration }}
|
||||||
|
renewBefore: {{ .Values.server.tls.certManager.renewBefore }}
|
||||||
|
usages:
|
||||||
|
- server auth
|
||||||
|
- digital signature
|
||||||
|
- key encipherment
|
||||||
|
privateKey:
|
||||||
|
algorithm: ECDSA
|
||||||
|
size: 256
|
||||||
|
rotationPolicy: Always
|
||||||
|
issuerRef:
|
||||||
|
name: {{ .Values.server.tls.certManager.issuerRef.name | quote }}
|
||||||
|
kind: {{ .Values.server.tls.certManager.issuerRef.kind }}
|
||||||
|
group: {{ .Values.server.tls.certManager.issuerRef.group }}
|
||||||
|
{{- end }}
|
||||||
@@ -1,3 +1,4 @@
|
|||||||
|
{{- include "certctl.tls.required" . }}
|
||||||
apiVersion: apps/v1
|
apiVersion: apps/v1
|
||||||
kind: Deployment
|
kind: Deployment
|
||||||
metadata:
|
metadata:
|
||||||
@@ -32,7 +33,7 @@ spec:
|
|||||||
image: {{ include "certctl.serverImage" . }}
|
image: {{ include "certctl.serverImage" . }}
|
||||||
imagePullPolicy: {{ .Values.server.image.pullPolicy }}
|
imagePullPolicy: {{ .Values.server.image.pullPolicy }}
|
||||||
ports:
|
ports:
|
||||||
- name: http
|
- name: https
|
||||||
containerPort: {{ .Values.server.port }}
|
containerPort: {{ .Values.server.port }}
|
||||||
protocol: TCP
|
protocol: TCP
|
||||||
env:
|
env:
|
||||||
@@ -40,6 +41,10 @@ spec:
|
|||||||
value: "0.0.0.0"
|
value: "0.0.0.0"
|
||||||
- name: CERTCTL_SERVER_PORT
|
- name: CERTCTL_SERVER_PORT
|
||||||
value: "{{ .Values.server.port }}"
|
value: "{{ .Values.server.port }}"
|
||||||
|
- name: CERTCTL_SERVER_TLS_CERT_PATH
|
||||||
|
value: "{{ .Values.server.tls.mountPath }}/tls.crt"
|
||||||
|
- name: CERTCTL_SERVER_TLS_KEY_PATH
|
||||||
|
value: "{{ .Values.server.tls.mountPath }}/tls.key"
|
||||||
- name: CERTCTL_DATABASE_URL
|
- name: CERTCTL_DATABASE_URL
|
||||||
valueFrom:
|
valueFrom:
|
||||||
secretKeyRef:
|
secretKeyRef:
|
||||||
@@ -172,12 +177,19 @@ spec:
|
|||||||
volumeMounts:
|
volumeMounts:
|
||||||
- name: tmp
|
- name: tmp
|
||||||
mountPath: /tmp
|
mountPath: /tmp
|
||||||
|
- name: tls
|
||||||
|
mountPath: {{ .Values.server.tls.mountPath }}
|
||||||
|
readOnly: true
|
||||||
{{- if .Values.server.volumeMounts }}
|
{{- if .Values.server.volumeMounts }}
|
||||||
{{- toYaml .Values.server.volumeMounts | nindent 12 }}
|
{{- toYaml .Values.server.volumeMounts | nindent 12 }}
|
||||||
{{- end }}
|
{{- end }}
|
||||||
volumes:
|
volumes:
|
||||||
- name: tmp
|
- name: tmp
|
||||||
emptyDir: {}
|
emptyDir: {}
|
||||||
|
- name: tls
|
||||||
|
secret:
|
||||||
|
secretName: {{ include "certctl.tls.secretName" . }}
|
||||||
|
defaultMode: 0400
|
||||||
{{- if .Values.server.volumes }}
|
{{- if .Values.server.volumes }}
|
||||||
{{- toYaml .Values.server.volumes | nindent 8 }}
|
{{- toYaml .Values.server.volumes | nindent 8 }}
|
||||||
{{- end }}
|
{{- end }}
|
||||||
|
|||||||
@@ -13,8 +13,8 @@ spec:
|
|||||||
type: {{ .Values.server.service.type }}
|
type: {{ .Values.server.service.type }}
|
||||||
ports:
|
ports:
|
||||||
- port: {{ .Values.server.service.port }}
|
- port: {{ .Values.server.service.port }}
|
||||||
targetPort: http
|
targetPort: https
|
||||||
protocol: TCP
|
protocol: TCP
|
||||||
name: http
|
name: https
|
||||||
selector:
|
selector:
|
||||||
{{- include "certctl.serverSelectorLabels" . | nindent 4 }}
|
{{- include "certctl.serverSelectorLabels" . | nindent 4 }}
|
||||||
|
|||||||
@@ -48,11 +48,12 @@ server:
|
|||||||
drop:
|
drop:
|
||||||
- ALL
|
- ALL
|
||||||
|
|
||||||
# Liveness and readiness probes
|
# Liveness and readiness probes (HTTPS-only as of v2.2)
|
||||||
livenessProbe:
|
livenessProbe:
|
||||||
httpGet:
|
httpGet:
|
||||||
path: /health
|
path: /health
|
||||||
port: http
|
port: https
|
||||||
|
scheme: HTTPS
|
||||||
initialDelaySeconds: 10
|
initialDelaySeconds: 10
|
||||||
periodSeconds: 10
|
periodSeconds: 10
|
||||||
timeoutSeconds: 5
|
timeoutSeconds: 5
|
||||||
@@ -61,12 +62,50 @@ server:
|
|||||||
readinessProbe:
|
readinessProbe:
|
||||||
httpGet:
|
httpGet:
|
||||||
path: /readyz
|
path: /readyz
|
||||||
port: http
|
port: https
|
||||||
|
scheme: HTTPS
|
||||||
initialDelaySeconds: 5
|
initialDelaySeconds: 5
|
||||||
periodSeconds: 5
|
periodSeconds: 5
|
||||||
timeoutSeconds: 3
|
timeoutSeconds: 3
|
||||||
failureThreshold: 2
|
failureThreshold: 2
|
||||||
|
|
||||||
|
# TLS configuration — REQUIRED. HTTPS is the only supported mode (v2.2+).
|
||||||
|
# Operator must configure EXACTLY ONE of:
|
||||||
|
# (a) server.tls.existingSecret: <name> # pre-existing kubernetes.io/tls Secret
|
||||||
|
# (b) server.tls.certManager.enabled: true # provision a cert-manager Certificate CR
|
||||||
|
# Refusing to set either makes `helm template` fail with a diagnostic pointing at docs/tls.md.
|
||||||
|
tls:
|
||||||
|
# Name of a pre-existing Secret (type kubernetes.io/tls) holding tls.crt + tls.key (+ optional ca.crt).
|
||||||
|
# Leave empty to fall through to the cert-manager path.
|
||||||
|
existingSecret: ""
|
||||||
|
|
||||||
|
# Mount path for the TLS Secret inside the server + agent containers.
|
||||||
|
mountPath: /etc/certctl/tls
|
||||||
|
|
||||||
|
# cert-manager auto-provisioning. Opt-in (off by default per milestone §3.4).
|
||||||
|
certManager:
|
||||||
|
enabled: false
|
||||||
|
|
||||||
|
# Secret name the cert-manager Certificate CR writes into. Agents and the server
|
||||||
|
# both read from this Secret. If empty, defaults to "<fullname>-tls".
|
||||||
|
secretName: ""
|
||||||
|
|
||||||
|
# Cert-manager issuer reference.
|
||||||
|
issuerRef:
|
||||||
|
name: "" # e.g. "letsencrypt-prod" or "internal-ca"
|
||||||
|
kind: ClusterIssuer # ClusterIssuer or Issuer
|
||||||
|
group: cert-manager.io
|
||||||
|
|
||||||
|
# Subject fields on the issued cert.
|
||||||
|
commonName: "certctl-server"
|
||||||
|
dnsNames:
|
||||||
|
- certctl-server
|
||||||
|
- localhost
|
||||||
|
|
||||||
|
# Certificate lifetime + renewal window.
|
||||||
|
duration: 2160h # 90 days
|
||||||
|
renewBefore: 360h # 15 days
|
||||||
|
|
||||||
# Service type (ClusterIP, LoadBalancer, NodePort)
|
# Service type (ClusterIP, LoadBalancer, NodePort)
|
||||||
service:
|
service:
|
||||||
type: ClusterIP
|
type: ClusterIP
|
||||||
@@ -356,7 +395,16 @@ ingress:
|
|||||||
className: ""
|
className: ""
|
||||||
annotations: {}
|
annotations: {}
|
||||||
# kubernetes.io/ingress.class: nginx
|
# kubernetes.io/ingress.class: nginx
|
||||||
# cert-manager.io/cluster-issuer: letsencrypt-prod
|
|
||||||
|
# Optional cert-manager integration for the public-facing Ingress cert.
|
||||||
|
# This is completely independent of server.tls.* — the Ingress terminates
|
||||||
|
# an *additional* TLS hop between the internet and the in-cluster Service.
|
||||||
|
# Leave disabled unless an Ingress is exposing certctl to the outside world.
|
||||||
|
certManager:
|
||||||
|
enabled: false
|
||||||
|
issuerRef:
|
||||||
|
name: "" # e.g. "letsencrypt-prod"
|
||||||
|
kind: ClusterIssuer # ClusterIssuer or Issuer
|
||||||
hosts:
|
hosts:
|
||||||
- host: certctl.local
|
- host: certctl.local
|
||||||
paths:
|
paths:
|
||||||
|
|||||||
@@ -47,11 +47,30 @@ func envOr(key, fallback string) string {
|
|||||||
return fallback
|
return fallback
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// HTTPS-Everywhere Phase 6: the test harness now dials the server over TLS and
|
||||||
|
// validates the self-signed cert against the init-container-generated CA bundle
|
||||||
|
// bind-mounted at ./test/certs/ca.crt. The defaults assume the compose setup in
|
||||||
|
// deploy/docker-compose.test.yml; override via the usual env vars when pointing
|
||||||
|
// the suite at a different deployment.
|
||||||
|
//
|
||||||
|
// - CERTCTL_TEST_SERVER_URL — must be https:// for the Phase 6 wiring
|
||||||
|
// - CERTCTL_TEST_CA_BUNDLE — PEM bundle; must contain the server's issuing
|
||||||
|
// CA (self-signed in the compose setup, so server.crt doubles as ca.crt)
|
||||||
|
// - CERTCTL_TEST_INSECURE — set to "true" to fall back to
|
||||||
|
// InsecureSkipVerify when the CA bundle path is unavailable (CI smoke or
|
||||||
|
// exploratory runs only — CI-parity runs MUST use the pinned bundle).
|
||||||
|
//
|
||||||
|
// Under no circumstance does the suite silently downgrade to plaintext HTTP:
|
||||||
|
// Phase 5 (#203) pre-flight guards in cmd/server will refuse to start with an
|
||||||
|
// http:// URL anyway, so a misconfiguration fails loud at test-harness startup
|
||||||
|
// rather than flaking mid-suite.
|
||||||
var (
|
var (
|
||||||
serverURL = envOr("CERTCTL_TEST_SERVER_URL", "http://localhost:8443")
|
serverURL = envOr("CERTCTL_TEST_SERVER_URL", "https://localhost:8443")
|
||||||
apiKey = envOr("CERTCTL_TEST_API_KEY", "test-key-2026")
|
apiKey = envOr("CERTCTL_TEST_API_KEY", "test-key-2026")
|
||||||
dbURL = envOr("CERTCTL_TEST_DB_URL", "postgres://certctl:testpass@localhost:5432/certctl?sslmode=disable")
|
dbURL = envOr("CERTCTL_TEST_DB_URL", "postgres://certctl:testpass@localhost:5432/certctl?sslmode=disable")
|
||||||
nginxTLS = envOr("CERTCTL_TEST_NGINX_TLS", "localhost:8444")
|
nginxTLS = envOr("CERTCTL_TEST_NGINX_TLS", "localhost:8444")
|
||||||
|
caBundlePath = envOr("CERTCTL_TEST_CA_BUNDLE", "./certs/ca.crt")
|
||||||
|
insecureTLS = strings.EqualFold(os.Getenv("CERTCTL_TEST_INSECURE"), "true")
|
||||||
)
|
)
|
||||||
|
|
||||||
// ---------------------------------------------------------------------------
|
// ---------------------------------------------------------------------------
|
||||||
@@ -75,16 +94,74 @@ type testClient struct {
|
|||||||
apiKey string
|
apiKey string
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// buildTLSConfig wires up the x509.CertPool with the self-signed CA bundle
|
||||||
|
// emitted by the certctl-tls-init container. Panics via t.Fatal on the happy
|
||||||
|
// path if both CERTCTL_TEST_CA_BUNDLE is unreadable *and* CERTCTL_TEST_INSECURE
|
||||||
|
// is not set — that combination is almost always a misconfigured test harness
|
||||||
|
// and silently downgrading to InsecureSkipVerify would hide real failures.
|
||||||
|
//
|
||||||
|
// MinVersion is pinned to TLS 1.3 so this matches what cmd/server negotiates
|
||||||
|
// by default; a drift there would surface here first.
|
||||||
|
func buildTLSConfig() *tls.Config {
|
||||||
|
cfg := &tls.Config{
|
||||||
|
MinVersion: tls.VersionTLS13,
|
||||||
|
}
|
||||||
|
if insecureTLS {
|
||||||
|
// Opt-in smoke-run mode; log but don't fail so operators running
|
||||||
|
// `CERTCTL_TEST_INSECURE=true go test -tags integration ./deploy/test/...`
|
||||||
|
// against an ad-hoc environment still get a green suite when the server
|
||||||
|
// is reachable. CI must not set this.
|
||||||
|
cfg.InsecureSkipVerify = true
|
||||||
|
return cfg
|
||||||
|
}
|
||||||
|
pem, err := os.ReadFile(caBundlePath)
|
||||||
|
if err != nil {
|
||||||
|
// Can't use t.Fatal here (called from package-level helpers); fall
|
||||||
|
// back to a panic so the harness dies loud at the first HTTP call.
|
||||||
|
// Operators see a clear "CA bundle missing" message and fix their
|
||||||
|
// setup instead of chasing a confusing TLS handshake error.
|
||||||
|
panic(fmt.Sprintf("integration test: read CA bundle %q: %v — "+
|
||||||
|
"run `docker compose -f deploy/docker-compose.test.yml up` first, or "+
|
||||||
|
"set CERTCTL_TEST_CA_BUNDLE to a valid PEM path, or "+
|
||||||
|
"set CERTCTL_TEST_INSECURE=true for a smoke run", caBundlePath, err))
|
||||||
|
}
|
||||||
|
pool := x509.NewCertPool()
|
||||||
|
if !pool.AppendCertsFromPEM(pem) {
|
||||||
|
panic(fmt.Sprintf("integration test: no PEM certificates parsed from %q", caBundlePath))
|
||||||
|
}
|
||||||
|
cfg.RootCAs = pool
|
||||||
|
return cfg
|
||||||
|
}
|
||||||
|
|
||||||
|
// newTestClient builds a Bearer-authenticated HTTPS client pinned to the
|
||||||
|
// init-container CA. Every phase uses this for REST calls.
|
||||||
func newTestClient() *testClient {
|
func newTestClient() *testClient {
|
||||||
return &testClient{
|
return &testClient{
|
||||||
http: &http.Client{
|
http: &http.Client{
|
||||||
Timeout: 30 * time.Second,
|
Timeout: 30 * time.Second,
|
||||||
|
Transport: &http.Transport{
|
||||||
|
TLSClientConfig: buildTLSConfig(),
|
||||||
|
},
|
||||||
},
|
},
|
||||||
baseURL: serverURL,
|
baseURL: serverURL,
|
||||||
apiKey: apiKey,
|
apiKey: apiKey,
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// newUnauthHTTPClient returns an *http.Client with the same TLS configuration
|
||||||
|
// but no Bearer token. Used for the Phase 7 RFC 5280 CRL / RFC 8615
|
||||||
|
// `/.well-known/pki/*` probes — those endpoints must be reachable by
|
||||||
|
// *unauthenticated* relying parties per M-006, so we explicitly omit the
|
||||||
|
// Authorization header to prove it.
|
||||||
|
func newUnauthHTTPClient() *http.Client {
|
||||||
|
return &http.Client{
|
||||||
|
Timeout: 30 * time.Second,
|
||||||
|
Transport: &http.Transport{
|
||||||
|
TLSClientConfig: buildTLSConfig(),
|
||||||
|
},
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
func (c *testClient) do(method, path string, body io.Reader) (*http.Response, error) {
|
func (c *testClient) do(method, path string, body io.Reader) (*http.Response, error) {
|
||||||
url := c.baseURL + path
|
url := c.baseURL + path
|
||||||
req, err := http.NewRequest(method, url, body)
|
req, err := http.NewRequest(method, url, body)
|
||||||
@@ -724,11 +801,18 @@ func TestIntegrationSuite(t *testing.T) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// Check DER CRL served unauthenticated under /.well-known/pki/ per
|
// Check DER CRL served unauthenticated under /.well-known/pki/ per
|
||||||
// RFC 5280 §5 + RFC 8615 (M-006). Use a plain http.Get — no Bearer
|
// RFC 5280 §5 + RFC 8615 (M-006). Use newUnauthHTTPClient() — no
|
||||||
// token — to prove the endpoint is reachable by relying parties that
|
// Bearer token — to prove the endpoint is reachable by relying
|
||||||
// have no certctl API credentials.
|
// parties that have no certctl API credentials. Post HTTPS-Everywhere
|
||||||
|
// (M-007, Phase 6) the client still speaks TLS 1.3 against the pinned
|
||||||
|
// CA bundle from ./certs/ca.crt; we just skip the Authorization header
|
||||||
|
// to exercise the unauthenticated RFC 5280 / RFC 8615 relying-party
|
||||||
|
// path. Switching from the stdlib http.DefaultClient (plaintext OK,
|
||||||
|
// system trust store only) to the helper keeps the no-auth semantic
|
||||||
|
// while preventing silent plaintext downgrade — the whole point of
|
||||||
|
// this milestone.
|
||||||
t.Run("CRL_DER_Unauthenticated", func(t *testing.T) {
|
t.Run("CRL_DER_Unauthenticated", func(t *testing.T) {
|
||||||
resp, err := http.Get(serverURL + "/.well-known/pki/crl/iss-local")
|
resp, err := newUnauthHTTPClient().Get(serverURL + "/.well-known/pki/crl/iss-local")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("GET DER CRL: %v", err)
|
t.Fatalf("GET DER CRL: %v", err)
|
||||||
}
|
}
|
||||||
@@ -1141,4 +1225,243 @@ func TestIntegrationSuite(t *testing.T) {
|
|||||||
}
|
}
|
||||||
})
|
})
|
||||||
})
|
})
|
||||||
|
|
||||||
|
// -----------------------------------------------------------------------
|
||||||
|
// Phase 13: I-005 Phase 1 Red — Notification Retry + Dead Letter Queue (E2E)
|
||||||
|
//
|
||||||
|
// Pins the full retry-loop contract end-to-end. Phase 2 Green must turn
|
||||||
|
// every subtest Green with a single coherent change set (migration 000016
|
||||||
|
// live, scheduler notificationRetryLoop wired as the 11th loop bumping
|
||||||
|
// the total from 10 → 11, service RetryFailedNotifications + MarkAsDead +
|
||||||
|
// RequeueNotification implemented, handler POST
|
||||||
|
// /api/v1/notifications/{id}/requeue routed, list handler parsing the
|
||||||
|
// status query param).
|
||||||
|
//
|
||||||
|
// Subtests:
|
||||||
|
//
|
||||||
|
// 1. MarkAsDead_OnMaxAttempts — a notification seeded at retry_count=4
|
||||||
|
// (one failure shy of the max_attempts=5 gate) with next_retry_at in
|
||||||
|
// the past is promoted to status='dead' on the first retry-loop
|
||||||
|
// tick. The pre-increment arithmetic `retry_count + 1 = 5 =
|
||||||
|
// max_attempts` triggers MarkAsDead instead of scheduling another
|
||||||
|
// retry.
|
||||||
|
//
|
||||||
|
// 2. Requeue_FlipsDeadToPending — POST
|
||||||
|
// /api/v1/notifications/{id}/requeue on a dead row flips status back
|
||||||
|
// to 'pending', resets retry_count to 0, and clears next_retry_at
|
||||||
|
// so the existing ProcessPendingNotifications loop (not the retry
|
||||||
|
// sweep) picks it up on its next tick.
|
||||||
|
//
|
||||||
|
// 3. ListFilter_StatusDead — GET /api/v1/notifications?status=dead
|
||||||
|
// returns only rows in status='dead' so the UI's Dead Letter tab
|
||||||
|
// (web/src/pages/NotificationsPage.test.tsx subtest #1) can isolate
|
||||||
|
// them without client-side filtering.
|
||||||
|
//
|
||||||
|
// Red behavior at HEAD (what Phase 2 Green must flip):
|
||||||
|
//
|
||||||
|
// * Schema: the INSERTs reference retry_count, next_retry_at,
|
||||||
|
// last_error. Migration 000016 is already written (file (a) of
|
||||||
|
// Phase 1 Red) but until it is applied the INSERTs fail with
|
||||||
|
// "column does not exist" — schema-level Red halt.
|
||||||
|
//
|
||||||
|
// * Subtest 1: no retry loop exists at HEAD. The seeded row stays at
|
||||||
|
// status='failed' retry_count=4 forever. The 4-minute waitFor
|
||||||
|
// therefore times out.
|
||||||
|
//
|
||||||
|
// * Subtest 2: /notifications/{id}/requeue is not routed at HEAD
|
||||||
|
// (internal/api/handler/notifications.go registers only list / get /
|
||||||
|
// mark-read). The POST returns 404.
|
||||||
|
//
|
||||||
|
// * Subtest 3: the list handler does not parse the status query param
|
||||||
|
// at HEAD. The response includes rows of every status, so the
|
||||||
|
// "leaked non-dead row" assertion fires.
|
||||||
|
// -----------------------------------------------------------------------
|
||||||
|
t.Run("Phase13_NotificationRetryDLQ", func(t *testing.T) {
|
||||||
|
// Unreachable endpoint so every webhook delivery attempt fails
|
||||||
|
// deterministically — port 1 is never bound. Pinning retry_count=4
|
||||||
|
// + a guaranteed-failing channel is what turns the seeded row into
|
||||||
|
// 'dead' on the very next scheduler tick (one delivery attempt,
|
||||||
|
// retry_count 4→5, crosses max_attempts=5 → MarkAsDead).
|
||||||
|
const blackHole = "http://127.0.0.1:1/i005-red-black-hole"
|
||||||
|
|
||||||
|
// ---------------------------------------------------------------
|
||||||
|
// Subtest 1: failed → dead transition after one retry-loop tick
|
||||||
|
// ---------------------------------------------------------------
|
||||||
|
t.Run("MarkAsDead_OnMaxAttempts", func(t *testing.T) {
|
||||||
|
id := fmt.Sprintf("notif-i005-dead-%d", time.Now().UnixNano())
|
||||||
|
|
||||||
|
// retry_count=4 + next attempt = 5 = max_attempts → MarkAsDead.
|
||||||
|
// next_retry_at is backdated so the row is immediately eligible
|
||||||
|
// for the retry sweep rather than having to wait for its own
|
||||||
|
// backoff to elapse.
|
||||||
|
past := time.Now().Add(-30 * time.Second).UTC()
|
||||||
|
db.Exec(t, `
|
||||||
|
INSERT INTO notification_events
|
||||||
|
(id, type, channel, recipient, message, status,
|
||||||
|
retry_count, next_retry_at, last_error)
|
||||||
|
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
|
||||||
|
`,
|
||||||
|
id, "ExpirationWarning", "Webhook", blackHole,
|
||||||
|
"I-005 integration: DLQ promotion on max_attempts",
|
||||||
|
"failed", 4, past, "transient webhook 500",
|
||||||
|
)
|
||||||
|
|
||||||
|
// Give the retry sweep up to 4m to tick at least once (default
|
||||||
|
// 2m interval + seed/sweep/notifier slop). On success the row
|
||||||
|
// carries status='dead' and retry_count has advanced to 5.
|
||||||
|
waitFor(t, "notification transitions to dead", 4*time.Minute, 5*time.Second,
|
||||||
|
func() (bool, error) {
|
||||||
|
var status string
|
||||||
|
var retry int
|
||||||
|
err := db.db.QueryRow(
|
||||||
|
"SELECT status, retry_count FROM notification_events WHERE id = $1",
|
||||||
|
id,
|
||||||
|
).Scan(&status, &retry)
|
||||||
|
if err != nil {
|
||||||
|
return false, err
|
||||||
|
}
|
||||||
|
return strings.EqualFold(status, "dead") && retry >= 5, nil
|
||||||
|
})
|
||||||
|
|
||||||
|
// The dead-letter tab is only useful if operators can see why
|
||||||
|
// the row died. MarkAsDead must preserve the most recent
|
||||||
|
// failure string in last_error rather than nil'ing it.
|
||||||
|
var lastErr sql.NullString
|
||||||
|
if err := db.db.QueryRow(
|
||||||
|
"SELECT last_error FROM notification_events WHERE id = $1", id,
|
||||||
|
).Scan(&lastErr); err != nil {
|
||||||
|
t.Fatalf("read last_error: %v", err)
|
||||||
|
}
|
||||||
|
if !lastErr.Valid || lastErr.String == "" {
|
||||||
|
t.Errorf("dead notification %s has empty last_error — "+
|
||||||
|
"retry loop must preserve the most recent failure", id)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
// ---------------------------------------------------------------
|
||||||
|
// Subtest 2: dead → pending via manual Requeue endpoint
|
||||||
|
// ---------------------------------------------------------------
|
||||||
|
t.Run("Requeue_FlipsDeadToPending", func(t *testing.T) {
|
||||||
|
id := fmt.Sprintf("notif-i005-requeue-%d", time.Now().UnixNano())
|
||||||
|
|
||||||
|
// Seed directly at status='dead' rather than waiting for a
|
||||||
|
// scheduler tick — this subtest isolates the requeue handler,
|
||||||
|
// not the retry loop (subtest 1 already pins that).
|
||||||
|
past := time.Now().Add(-10 * time.Minute).UTC()
|
||||||
|
db.Exec(t, `
|
||||||
|
INSERT INTO notification_events
|
||||||
|
(id, type, channel, recipient, message, status,
|
||||||
|
retry_count, next_retry_at, last_error)
|
||||||
|
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
|
||||||
|
`,
|
||||||
|
id, "ExpirationWarning", "Webhook", blackHole,
|
||||||
|
"I-005 integration: manual requeue",
|
||||||
|
"dead", 5, past, "max attempts reached",
|
||||||
|
)
|
||||||
|
|
||||||
|
resp, err := c.Post("/api/v1/notifications/"+id+"/requeue", "")
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("POST requeue: %v", err)
|
||||||
|
}
|
||||||
|
body := readBody(resp)
|
||||||
|
if resp.StatusCode != http.StatusOK {
|
||||||
|
t.Fatalf("requeue status %d, want 200 (body: %s)",
|
||||||
|
resp.StatusCode, body)
|
||||||
|
}
|
||||||
|
// Phase 2 Green handler responds with {"status":"requeued"}
|
||||||
|
// to mirror MarkAsRead's {"status":"marked_as_read"} envelope.
|
||||||
|
if !strings.Contains(body, "requeued") {
|
||||||
|
t.Errorf("requeue body missing 'requeued' marker: %s", body)
|
||||||
|
}
|
||||||
|
|
||||||
|
// DB must reflect the full flip: pending status, reset counter,
|
||||||
|
// cleared next_retry_at. Clearing next_retry_at is what moves
|
||||||
|
// the row out of the retry-sweep partial index and back under
|
||||||
|
// ProcessPendingNotifications.
|
||||||
|
var status string
|
||||||
|
var retry int
|
||||||
|
var nextRetry sql.NullTime
|
||||||
|
if err := db.db.QueryRow(`
|
||||||
|
SELECT status, retry_count, next_retry_at
|
||||||
|
FROM notification_events WHERE id = $1
|
||||||
|
`, id).Scan(&status, &retry, &nextRetry); err != nil {
|
||||||
|
t.Fatalf("read requeued row: %v", err)
|
||||||
|
}
|
||||||
|
if !strings.EqualFold(status, "pending") {
|
||||||
|
t.Errorf("after requeue: status=%q, want 'pending'", status)
|
||||||
|
}
|
||||||
|
if retry != 0 {
|
||||||
|
t.Errorf("after requeue: retry_count=%d, want 0", retry)
|
||||||
|
}
|
||||||
|
if nextRetry.Valid {
|
||||||
|
t.Errorf("after requeue: next_retry_at=%v, want NULL",
|
||||||
|
nextRetry.Time)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
|
||||||
|
// ---------------------------------------------------------------
|
||||||
|
// Subtest 3: GET /notifications?status=dead isolates DLQ rows
|
||||||
|
// ---------------------------------------------------------------
|
||||||
|
t.Run("ListFilter_StatusDead", func(t *testing.T) {
|
||||||
|
suffix := fmt.Sprintf("%d", time.Now().UnixNano())
|
||||||
|
deadID := "notif-i005-filter-dead-" + suffix
|
||||||
|
pendingID := "notif-i005-filter-pending-" + suffix
|
||||||
|
|
||||||
|
// One row at each end of the lifecycle so we can prove the
|
||||||
|
// filter both matches and excludes.
|
||||||
|
db.Exec(t, `
|
||||||
|
INSERT INTO notification_events
|
||||||
|
(id, type, channel, recipient, message, status, retry_count)
|
||||||
|
VALUES ($1, 'ExpirationWarning', 'Webhook', $2,
|
||||||
|
'I-005 filter test: dead row', 'dead', 5)
|
||||||
|
`, deadID, blackHole)
|
||||||
|
db.Exec(t, `
|
||||||
|
INSERT INTO notification_events
|
||||||
|
(id, type, channel, recipient, message, status, retry_count)
|
||||||
|
VALUES ($1, 'ExpirationWarning', 'Webhook', $2,
|
||||||
|
'I-005 filter test: pending row', 'pending', 0)
|
||||||
|
`, pendingID, blackHole)
|
||||||
|
|
||||||
|
// per_page large enough to rule out pagination artifacts as
|
||||||
|
// the reason a seeded row might be missing from the response.
|
||||||
|
resp, err := c.Get("/api/v1/notifications?status=dead&per_page=500")
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("GET notifications?status=dead: %v", err)
|
||||||
|
}
|
||||||
|
var pr pagedResponse
|
||||||
|
if err := decodeJSON(resp, &pr); err != nil {
|
||||||
|
t.Fatalf("decode: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
type row struct {
|
||||||
|
ID string `json:"id"`
|
||||||
|
Status string `json:"status"`
|
||||||
|
}
|
||||||
|
var rows []row
|
||||||
|
if err := json.Unmarshal(pr.Data, &rows); err != nil {
|
||||||
|
t.Fatalf("unmarshal rows: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
var sawDead, sawPending bool
|
||||||
|
for _, r := range rows {
|
||||||
|
if r.ID == deadID {
|
||||||
|
sawDead = true
|
||||||
|
}
|
||||||
|
if r.ID == pendingID {
|
||||||
|
sawPending = true
|
||||||
|
}
|
||||||
|
if !strings.EqualFold(r.Status, "dead") {
|
||||||
|
t.Errorf("status=dead filter leaked non-dead row: "+
|
||||||
|
"id=%s status=%s", r.ID, r.Status)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if !sawDead {
|
||||||
|
t.Errorf("status=dead filter missed seeded dead row %s", deadID)
|
||||||
|
}
|
||||||
|
if sawPending {
|
||||||
|
t.Errorf("status=dead filter leaked seeded pending row %s",
|
||||||
|
pendingID)
|
||||||
|
}
|
||||||
|
})
|
||||||
|
})
|
||||||
}
|
}
|
||||||
|
|||||||
+53
-9
@@ -19,16 +19,29 @@
|
|||||||
//
|
//
|
||||||
// Environment overrides:
|
// Environment overrides:
|
||||||
//
|
//
|
||||||
// CERTCTL_QA_SERVER_URL (default: http://localhost:8443)
|
// CERTCTL_QA_SERVER_URL (default: https://localhost:8443)
|
||||||
// CERTCTL_QA_API_KEY (default: change-me-in-production)
|
// CERTCTL_QA_API_KEY (default: change-me-in-production)
|
||||||
// CERTCTL_QA_DB_URL (default: postgres://certctl:certctl@localhost:5432/certctl?sslmode=disable)
|
// CERTCTL_QA_DB_URL (default: postgres://certctl:certctl@localhost:5432/certctl?sslmode=disable)
|
||||||
// CERTCTL_QA_REPO_DIR (default: ../.. — the certctl repo root)
|
// CERTCTL_QA_REPO_DIR (default: ../.. — the certctl repo root)
|
||||||
|
// CERTCTL_QA_CA_BUNDLE (default: ./certs/ca.crt — the demo stack's init container writes here)
|
||||||
|
// CERTCTL_QA_INSECURE (default: false — set to "true" to skip TLS verify, e.g. before the init container finishes)
|
||||||
|
//
|
||||||
|
// TLS note (HTTPS-Everywhere M-007, Phase 6): the demo compose stack now
|
||||||
|
// listens on https://localhost:8443 with a self-signed cert written by the
|
||||||
|
// tls-init container. This suite pins the issuing CA via
|
||||||
|
// CERTCTL_QA_CA_BUNDLE so cert rotation or a tampered proxy fails the
|
||||||
|
// handshake instead of being silently trusted. CERTCTL_QA_INSECURE="true"
|
||||||
|
// is an explicit opt-out for bootstrap scenarios — there is no silent
|
||||||
|
// plaintext downgrade, matching the server-side pre-flight guard added in
|
||||||
|
// Phase 5 (task #203).
|
||||||
package integration_test
|
package integration_test
|
||||||
|
|
||||||
import (
|
import (
|
||||||
|
"crypto/tls"
|
||||||
"crypto/x509"
|
"crypto/x509"
|
||||||
"database/sql"
|
"database/sql"
|
||||||
"encoding/json"
|
"encoding/json"
|
||||||
|
"fmt"
|
||||||
"io"
|
"io"
|
||||||
"net/http"
|
"net/http"
|
||||||
"os"
|
"os"
|
||||||
@@ -50,10 +63,12 @@ func qaEnv(key, fallback string) string {
|
|||||||
}
|
}
|
||||||
|
|
||||||
var (
|
var (
|
||||||
qaServerURL = qaEnv("CERTCTL_QA_SERVER_URL", "http://localhost:8443")
|
qaServerURL = qaEnv("CERTCTL_QA_SERVER_URL", "https://localhost:8443")
|
||||||
qaAPIKey = qaEnv("CERTCTL_QA_API_KEY", "change-me-in-production")
|
qaAPIKey = qaEnv("CERTCTL_QA_API_KEY", "change-me-in-production")
|
||||||
qaDBURL = qaEnv("CERTCTL_QA_DB_URL", "postgres://certctl:certctl@localhost:5432/certctl?sslmode=disable")
|
qaDBURL = qaEnv("CERTCTL_QA_DB_URL", "postgres://certctl:certctl@localhost:5432/certctl?sslmode=disable")
|
||||||
qaRepoDir = qaEnv("CERTCTL_QA_REPO_DIR", filepath.Join("..", ".."))
|
qaRepoDir = qaEnv("CERTCTL_QA_REPO_DIR", filepath.Join("..", ".."))
|
||||||
|
qaCABundlePath = qaEnv("CERTCTL_QA_CA_BUNDLE", "./certs/ca.crt")
|
||||||
|
qaInsecure = strings.EqualFold(os.Getenv("CERTCTL_QA_INSECURE"), "true")
|
||||||
)
|
)
|
||||||
|
|
||||||
// ---------------------------------------------------------------------------
|
// ---------------------------------------------------------------------------
|
||||||
@@ -66,9 +81,38 @@ type qaClient struct {
|
|||||||
apiKey string
|
apiKey string
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// buildQATLSConfig returns the *tls.Config used by every qaClient. TLS 1.3
|
||||||
|
// minimum matches the server-side config pinned in Phase 2 (cmd/server).
|
||||||
|
// When CERTCTL_QA_INSECURE=true we skip verification entirely — useful
|
||||||
|
// when running against a compose stack where the tls-init container hasn't
|
||||||
|
// written ca.crt yet, or when pointing at a dev server with a rotated cert.
|
||||||
|
// Otherwise we pin CERTCTL_QA_CA_BUNDLE and panic on read/parse failure
|
||||||
|
// rather than silently downgrading to the system trust store (which would
|
||||||
|
// mask a missing init container).
|
||||||
|
func buildQATLSConfig() *tls.Config {
|
||||||
|
cfg := &tls.Config{MinVersion: tls.VersionTLS13}
|
||||||
|
if qaInsecure {
|
||||||
|
cfg.InsecureSkipVerify = true
|
||||||
|
return cfg
|
||||||
|
}
|
||||||
|
pem, err := os.ReadFile(qaCABundlePath)
|
||||||
|
if err != nil {
|
||||||
|
panic(fmt.Sprintf("qa test: read CA bundle %q: %v — set CERTCTL_QA_CA_BUNDLE or CERTCTL_QA_INSECURE=true", qaCABundlePath, err))
|
||||||
|
}
|
||||||
|
pool := x509.NewCertPool()
|
||||||
|
if !pool.AppendCertsFromPEM(pem) {
|
||||||
|
panic(fmt.Sprintf("qa test: no PEM certificates parsed from %q", qaCABundlePath))
|
||||||
|
}
|
||||||
|
cfg.RootCAs = pool
|
||||||
|
return cfg
|
||||||
|
}
|
||||||
|
|
||||||
func newQAClient() *qaClient {
|
func newQAClient() *qaClient {
|
||||||
return &qaClient{
|
return &qaClient{
|
||||||
http: &http.Client{Timeout: 30 * time.Second},
|
http: &http.Client{
|
||||||
|
Timeout: 30 * time.Second,
|
||||||
|
Transport: &http.Transport{TLSClientConfig: buildQATLSConfig()},
|
||||||
|
},
|
||||||
baseURL: qaServerURL,
|
baseURL: qaServerURL,
|
||||||
apiKey: qaAPIKey,
|
apiKey: qaAPIKey,
|
||||||
}
|
}
|
||||||
|
|||||||
+30
-4
@@ -1,5 +1,30 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
|
# DEPRECATED — prefer `go test -tags integration ./deploy/test/...`
|
||||||
|
# =============================================================================
|
||||||
|
#
|
||||||
|
# This bash harness predates the Go integration test suite in
|
||||||
|
# deploy/test/integration_test.go (build tag `integration`, 34 subtests across
|
||||||
|
# 13 phases — health, agent heartbeat, Local CA issuance, ACME, step-ca, EST,
|
||||||
|
# S/MIME, discovery, network scan, revocation + CRL, deployment verification).
|
||||||
|
# The Go suite uses crypto/x509, crypto/tls, and database/sql to parse certs,
|
||||||
|
# probe TLS, and talk to PostgreSQL directly — no openssl text-scraping or
|
||||||
|
# brittle curl pipelines. It is the authoritative integration test surface as
|
||||||
|
# of milestone M-007 (HTTPS Everywhere, Phase 6), where the test compose
|
||||||
|
# stack wires the server on https://localhost:8443 behind a pinned CA bundle
|
||||||
|
# at ./certs/ca.crt.
|
||||||
|
#
|
||||||
|
# Run the Go suite:
|
||||||
|
# (cd deploy && docker compose -f docker-compose.test.yml up -d --build)
|
||||||
|
# go test -tags integration -v -count=1 ./deploy/test/...
|
||||||
|
#
|
||||||
|
# Keep this bash script around because:
|
||||||
|
# * It is cited in docs/test-env.md and muscle-memory for contributors.
|
||||||
|
# * It exercises the CLI / curl path end-to-end (a different failure mode
|
||||||
|
# than the Go HTTP client path).
|
||||||
|
# But any NEW integration coverage goes in integration_test.go — not here.
|
||||||
|
#
|
||||||
|
# =============================================================================
|
||||||
# certctl End-to-End Test Script
|
# certctl End-to-End Test Script
|
||||||
# =============================================================================
|
# =============================================================================
|
||||||
#
|
#
|
||||||
@@ -32,10 +57,11 @@ set -euo pipefail
|
|||||||
# Config
|
# Config
|
||||||
# ---------------------------------------------------------------------------
|
# ---------------------------------------------------------------------------
|
||||||
COMPOSE_FILE="docker-compose.test.yml"
|
COMPOSE_FILE="docker-compose.test.yml"
|
||||||
API_URL="http://localhost:8443"
|
API_URL="https://localhost:8443"
|
||||||
API_KEY="test-key-2026"
|
API_KEY="test-key-2026"
|
||||||
NGINX_TLS="localhost:8444"
|
NGINX_TLS="localhost:8444"
|
||||||
AUTH_HEADER="Authorization: Bearer ${API_KEY}"
|
AUTH_HEADER="Authorization: Bearer ${API_KEY}"
|
||||||
|
CACERT="./certs/ca.crt"
|
||||||
|
|
||||||
# Flags
|
# Flags
|
||||||
BUILD=true
|
BUILD=true
|
||||||
@@ -91,7 +117,7 @@ header() {
|
|||||||
# API helper: GET endpoint, return JSON body. Exits 1 on HTTP error.
|
# API helper: GET endpoint, return JSON body. Exits 1 on HTTP error.
|
||||||
api_get() {
|
api_get() {
|
||||||
local path="$1"
|
local path="$1"
|
||||||
curl -sf -H "${AUTH_HEADER}" "${API_URL}${path}" 2>/dev/null
|
curl -sf --cacert "${CACERT}" -H "${AUTH_HEADER}" "${API_URL}${path}" 2>/dev/null
|
||||||
}
|
}
|
||||||
|
|
||||||
# API helper: POST with optional JSON body
|
# API helper: POST with optional JSON body
|
||||||
@@ -99,10 +125,10 @@ api_post() {
|
|||||||
local path="$1"
|
local path="$1"
|
||||||
local body="${2:-}"
|
local body="${2:-}"
|
||||||
if [ -n "$body" ]; then
|
if [ -n "$body" ]; then
|
||||||
curl -sf -X POST -H "${AUTH_HEADER}" -H "Content-Type: application/json" \
|
curl -sf --cacert "${CACERT}" -X POST -H "${AUTH_HEADER}" -H "Content-Type: application/json" \
|
||||||
-d "$body" "${API_URL}${path}" 2>/dev/null
|
-d "$body" "${API_URL}${path}" 2>/dev/null
|
||||||
else
|
else
|
||||||
curl -sf -X POST -H "${AUTH_HEADER}" "${API_URL}${path}" 2>/dev/null
|
curl -sf --cacert "${CACERT}" -X POST -H "${AUTH_HEADER}" "${API_URL}${path}" 2>/dev/null
|
||||||
fi
|
fi
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
+48
-17
@@ -61,7 +61,7 @@ flowchart TB
|
|||||||
API["REST API\n(Go net/http, :8443)"]
|
API["REST API\n(Go net/http, :8443)"]
|
||||||
SVC["Service Layer"]
|
SVC["Service Layer"]
|
||||||
REPO["Repository Layer\n(database/sql + lib/pq)"]
|
REPO["Repository Layer\n(database/sql + lib/pq)"]
|
||||||
SCHED["Background Scheduler\n7 loops"]
|
SCHED["Background Scheduler\n8 always-on + 4 optional loops"]
|
||||||
DASH["Web Dashboard\n(React SPA)"]
|
DASH["Web Dashboard\n(React SPA)"]
|
||||||
end
|
end
|
||||||
|
|
||||||
@@ -285,6 +285,9 @@ erDiagram
|
|||||||
text channel
|
text channel
|
||||||
text recipient
|
text recipient
|
||||||
text status
|
text status
|
||||||
|
int retry_count
|
||||||
|
timestamptz next_retry_at
|
||||||
|
text last_error
|
||||||
}
|
}
|
||||||
certificate_profiles {
|
certificate_profiles {
|
||||||
text id PK
|
text id PK
|
||||||
@@ -483,40 +486,55 @@ For compliance events requiring fleet-wide revocation (key compromise, CA distru
|
|||||||
|
|
||||||
### 4. Automatic Renewal
|
### 4. Automatic Renewal
|
||||||
|
|
||||||
The control plane runs a scheduler with seven background loops:
|
The control plane runs a scheduler with 8 always-on loops plus up to 4 optional loops (enabled by configuration). `internal/scheduler/scheduler.go:262-265` is the authoritative count.
|
||||||
|
|
||||||
```mermaid
|
```mermaid
|
||||||
flowchart LR
|
flowchart LR
|
||||||
subgraph "Scheduler (Background Goroutines)"
|
subgraph "Scheduler (Background Goroutines)"
|
||||||
R["Renewal Checker\n⏱ every 1h"]
|
R["Renewal Checker\n⏱ every 1h"]
|
||||||
J["Job Processor\n⏱ every 30s"]
|
J["Job Processor\n⏱ every 30s"]
|
||||||
|
JR["Job Retry\n⏱ every 5m"]
|
||||||
|
JT["Job Timeout\n⏱ every 10m"]
|
||||||
H["Agent Health\n⏱ every 2m"]
|
H["Agent Health\n⏱ every 2m"]
|
||||||
N["Notification Processor\n⏱ every 1m"]
|
N["Notification Processor\n⏱ every 1m"]
|
||||||
|
NR["Notification Retry\n⏱ every 2m"]
|
||||||
SL["Short-Lived Expiry\n⏱ every 30s"]
|
SL["Short-Lived Expiry\n⏱ every 30s"]
|
||||||
NS["Network Scanner\n⏱ every 6h"]
|
NS["Network Scanner\n⏱ every 6h"]
|
||||||
DG["Certificate Digest\n⏱ every 24h"]
|
DG["Certificate Digest\n⏱ every 24h"]
|
||||||
|
HC["Endpoint Health\n⏱ every 60s"]
|
||||||
|
CD["Cloud Discovery\n⏱ every 6h"]
|
||||||
end
|
end
|
||||||
|
|
||||||
R -->|"Find expiring certs\nCreate renewal jobs"| DB[("PostgreSQL")]
|
R -->|"Find expiring certs\nCreate renewal jobs"| DB[("PostgreSQL")]
|
||||||
J -->|"Process pending jobs\nCoordinate issuance"| DB
|
J -->|"Process pending jobs\nCoordinate issuance"| DB
|
||||||
|
JR -->|"Retry Failed jobs\nFailed→Pending"| DB
|
||||||
|
JT -->|"Reap stalled AwaitingCSR / AwaitingApproval jobs"| DB
|
||||||
H -->|"Check heartbeat staleness\nMark agents offline"| DB
|
H -->|"Check heartbeat staleness\nMark agents offline"| DB
|
||||||
N -->|"Send pending notifications\nEmail / Webhook / Slack"| DB
|
N -->|"Send pending notifications\nEmail / Webhook / Slack"| DB
|
||||||
|
NR -->|"Retry failed notifications\n2^n-min backoff, DLQ after 5 attempts"| DB
|
||||||
SL -->|"Expire short-lived certs\nMark as Expired"| DB
|
SL -->|"Expire short-lived certs\nMark as Expired"| DB
|
||||||
NS -->|"Probe TLS endpoints\nStore discovered certs"| DB
|
NS -->|"Probe TLS endpoints\nStore discovered certs"| DB
|
||||||
DG -->|"Generate & send HTML digest\nEmail to recipients"| DB
|
DG -->|"Generate & send HTML digest\nEmail to recipients"| DB
|
||||||
|
HC -->|"Probe deployed TLS endpoints\nState machine + mismatch"| DB
|
||||||
|
CD -->|"AWS SM / Azure KV / GCP SM\nFeed discovery pipeline"| DB
|
||||||
```
|
```
|
||||||
|
|
||||||
| Loop | Interval | Timeout | Purpose |
|
| Loop | Interval | Always-on? | Purpose |
|
||||||
|------|----------|---------|---------|
|
|------|----------|------------|---------|
|
||||||
| Renewal checker | 1 hour | 5 minutes | Finds certificates approaching expiry, creates renewal jobs |
|
| Renewal checker | 1 hour | Yes | Finds certificates approaching expiry (threshold-based or ARI-directed), creates renewal jobs |
|
||||||
| Job processor | 30 seconds | 2 minutes | Processes pending jobs (issuance, renewal, deployment) |
|
| Job processor | 30 seconds | Yes | Processes pending jobs (issuance, renewal, deployment) |
|
||||||
| Agent health check | 2 minutes | 1 minute | Marks agents as offline if heartbeat is stale |
|
| Job retry | 5 minutes (`CERTCTL_SCHEDULER_RETRY_INTERVAL`) | Yes | Transitions `Failed` jobs back to `Pending` for re-dispatch (I-001) |
|
||||||
| Notification processor | 1 minute | 1 minute | Sends pending notifications via configured channels |
|
| Job timeout | 10 minutes (`CERTCTL_JOB_TIMEOUT_INTERVAL`) | Yes | Reaps `AwaitingCSR` jobs older than 24h and `AwaitingApproval` jobs older than 7d to `Failed`, feeding the retry loop (I-003) |
|
||||||
| Short-lived expiry | 30 seconds | 30 seconds | Marks expired short-lived certificates (profile TTL < 1 hour) |
|
| Agent health check | 2 minutes | Yes | Marks agents as offline if heartbeat is stale |
|
||||||
| Network scanner | 6 hours | 30 minutes | Probes TLS endpoints on configured CIDR ranges, stores discovered certs (M21, opt-in via `CERTCTL_NETWORK_SCAN_ENABLED`). CIDR size validated at API level — max /20 (4096 IPs) per range. |
|
| Notification processor | 1 minute | Yes | Sends pending notifications via configured channels |
|
||||||
| Certificate digest | 24 hours | 5 minutes | Generates HTML email with certificate stats, expiration timeline, job health, agent count. Does NOT run on startup — waits for first scheduled tick. Configurable interval and recipients via `CERTCTL_DIGEST_INTERVAL` and `CERTCTL_DIGEST_RECIPIENTS`. Falls back to certificate owner emails if no explicit recipients configured. |
|
| Notification retry | 2 minutes (`CERTCTL_NOTIFICATION_RETRY_INTERVAL`) | Yes | Re-dispatches `Failed` notifications whose `next_retry_at` has elapsed; exponential backoff (2^n minutes, capped at 1h), 5-attempt budget, terminal `dead` status after exhaustion (I-005) |
|
||||||
|
| Short-lived expiry | 30 seconds | Yes | Marks expired short-lived certificates (profile TTL < 1 hour) |
|
||||||
|
| Network scanner | 6 hours | Opt-in (`CERTCTL_NETWORK_SCAN_ENABLED`) | Probes TLS endpoints on configured CIDR ranges, stores discovered certs (M21). CIDR size validated at API level — max /20 (4096 IPs) per range. |
|
||||||
|
| Certificate digest | 24 hours (`CERTCTL_DIGEST_INTERVAL`) | Opt-in (digest service) | Generates HTML email with certificate stats, expiration timeline, job health, agent count. Does NOT run on startup — waits for first scheduled tick. Falls back to certificate owner emails if no explicit recipients configured. |
|
||||||
|
| Endpoint health | 60 seconds (`CERTCTL_HEALTH_CHECK_INTERVAL`) | Opt-in (health check service) | Probes deployed TLS endpoints, drives the healthy/degraded/down/cert_mismatch state machine (M48) |
|
||||||
|
| Cloud discovery | 6 hours | Opt-in (at least one cloud source configured) | Walks AWS Secrets Manager / Azure Key Vault / GCP Secret Manager, feeds discovery pipeline (M50) |
|
||||||
|
|
||||||
Each loop uses `sync/atomic.Bool` idempotency guards to prevent concurrent tick execution — if a loop iteration is still running when the next tick fires, the tick is skipped with a warning log. All loops (including short-lived expiry check) run immediately on startup before entering their ticker interval, ensuring no gap between scheduler start and first execution. The certificate digest loop is the exception — it does NOT run on startup, only on scheduled ticks. Graceful shutdown uses `sync.WaitGroup` with `WaitForCompletion()` to drain all in-flight work before process exit.
|
Each loop uses `sync/atomic.Bool` idempotency guards to prevent concurrent tick execution — if a loop iteration is still running when the next tick fires, the tick is skipped with a warning log. Most loops (including short-lived expiry, job retry, job timeout, and notification retry) run immediately on startup before entering their ticker interval, ensuring no gap between scheduler start and first execution. The certificate digest loop is the exception — it does NOT run on startup, only on scheduled ticks. Graceful shutdown uses `sync.WaitGroup` with `WaitForCompletion()` to drain all in-flight work before process exit.
|
||||||
|
|
||||||
Each operation has a context timeout to prevent indefinite hangs if external services become unresponsive.
|
Each operation has a context timeout to prevent indefinite hangs if external services become unresponsive.
|
||||||
|
|
||||||
@@ -658,6 +676,16 @@ Built-in notifiers: **Email** (SMTP), **Webhook** (HTTP POST), **Slack** (incomi
|
|||||||
|
|
||||||
See the [Connector Development Guide](connectors.md) for details on building custom connectors.
|
See the [Connector Development Guide](connectors.md) for details on building custom connectors.
|
||||||
|
|
||||||
|
### Notification Retry & Dead-Letter Queue
|
||||||
|
|
||||||
|
A transient notifier failure (SMTP timeout, 5xx webhook response, Slack rate-limit) must not silently drop a critical alert. Migration `000016_notification_retry` adds three columns to `notification_events` — `retry_count INTEGER NOT NULL DEFAULT 0`, `next_retry_at TIMESTAMPTZ` (nullable — only meaningful while a row is in `failed` state), and `last_error TEXT` (the most recent transient error, preserved for operator triage) — together with a partial index `idx_notification_events_retry_sweep ON notification_events(next_retry_at) WHERE status = 'failed' AND next_retry_at IS NOT NULL` so the retry hot path scales with the retry-eligible slice rather than the full notification history.
|
||||||
|
|
||||||
|
The scheduler's notification-retry loop (see the scheduler section above) calls `NotificationService.RetryFailedNotifications(ctx)` every `CERTCTL_NOTIFICATION_RETRY_INTERVAL` (default `2m`). Each tick pulls up to 1000 rows via `notifRepo.ListRetryEligible(ctx, now, maxAttempts, sweepLimit)` — a partial-index-driven query that filters on `status='failed' AND next_retry_at <= now() AND retry_count < 5` — and redispatches them through the same notifier registry used by `ProcessPendingNotifications`. A successful redispatch transitions the row directly to `sent` without incrementing `retry_count`, so the audit trail preserves "delivered on attempt N". A failed redispatch re-arms `next_retry_at` using exponential backoff — `wait = min(2^retry_count minutes, 1h)` — bumps `retry_count`, and stamps `last_error`. When `retry_count >= 4` (the fifth attempt has just failed) the row is promoted to the terminal `dead` status via `notifRepo.MarkAsDead`, which clears `next_retry_at` so the partial retry-sweep index stops matching and the row cannot be re-entered into the retry rotation without operator action.
|
||||||
|
|
||||||
|
`NotificationService.RequeueNotification(ctx, id)` is the operator-driven escape hatch from `dead`. It atomically resets `retry_count → 0`, `next_retry_at → NULL`, `last_error → NULL`, and `status → pending`, handing the row back to `ProcessPendingNotifications` on the next 1m tick. This is the correct response to "the notifier outage is resolved, redeliver the queue"; it is not a retry, which is why the retry counter is reset rather than incremented.
|
||||||
|
|
||||||
|
The dead-letter depth is surfaced in two places. First, `DashboardSummary.NotificationsDead` is populated by `StatsService.GetDashboardSummary` via `notifRepo.CountByStatus(ctx, "dead")`. The injection uses a `SetNotifRepo` setter pattern (mirroring `CertificateService.SetTargetRepo`) rather than a new positional argument to `NewStatsService`, which keeps all nine existing `NewStatsService` call sites (main.go plus eight digest tests and stats_test.go) signature-stable — when the notification repository has not been wired in, `NotificationsDead` falls through to zero. Second, the `/api/v1/metrics/prometheus` endpoint emits `certctl_notification_dead_total` as a counter (operator alert thresholds per the I-005 spec: `> 0` warning, `> 10` critical) using the same `DashboardSummary` snapshot so the dashboard card and the Prometheus counter cannot skew. The web dashboard exposes a two-tab toolbar on `/notifications` — "All" (the pre-I-005 inbox) and "Dead letter" (threads `?status=dead` into the list query, surfaces `Retry N/5` and the truncated `last_error` with a full-text tooltip per row, and binds a Requeue button to `POST /api/v1/notifications/{id}/requeue`).
|
||||||
|
|
||||||
### EST Server (RFC 7030)
|
### EST Server (RFC 7030)
|
||||||
|
|
||||||
The EST (Enrollment over Secure Transport) server provides an industry-standard enrollment interface for devices that need certificates without using the REST API. It runs under `/.well-known/est/` per RFC 7030 and supports four operations: CA certificate distribution (`/cacerts`), initial enrollment (`/simpleenroll`), re-enrollment (`/simplereenroll`), and CSR attributes (`/csrattrs`).
|
The EST (Enrollment over Secure Transport) server provides an industry-standard enrollment interface for devices that need certificates without using the REST API. It runs under `/.well-known/est/` per RFC 7030 and supports four operations: CA certificate distribution (`/cacerts`), initial enrollment (`/simpleenroll`), re-enrollment (`/simplereenroll`), and CSR attributes (`/csrattrs`).
|
||||||
@@ -695,6 +723,8 @@ type ESTService interface {
|
|||||||
|
|
||||||
**Issuer connector extension:** EST required adding `GetCACertPEM(ctx) (string, error)` to the issuer connector interface so the `/cacerts` endpoint can serve the CA chain. The Local CA returns its CA certificate PEM; Vault PKI fetches via `GET /v1/{mount}/ca/pem`; Google CAS fetches via API; AWS ACM PCA retrieves via `GetCertificateAuthorityCertificate`. ACME, step-ca, OpenSSL, DigiCert, and Sectigo connectors return errors (they don't expose a static CA chain — their chains are per-issuance).
|
**Issuer connector extension:** EST required adding `GetCACertPEM(ctx) (string, error)` to the issuer connector interface so the `/cacerts` endpoint can serve the CA chain. The Local CA returns its CA certificate PEM; Vault PKI fetches via `GET /v1/{mount}/ca/pem`; Google CAS fetches via API; AWS ACM PCA retrieves via `GetCertificateAuthorityCertificate`. ACME, step-ca, OpenSSL, DigiCert, and Sectigo connectors return errors (they don't expose a static CA chain — their chains are per-issuance).
|
||||||
|
|
||||||
|
**Authentication:** EST endpoints are served unauthenticated at the HTTP layer under `/.well-known/est/*` — no Bearer token required. Per RFC 7030 §3.2.3 EST authentication is deployment-specific, and per §4.1.1 `/cacerts` is explicitly anonymous. certctl enforces authentication via CSR signature verification inside `ESTService.SimpleEnroll`/`SimpleReEnroll` plus profile policy gates (allowed key algorithms, minimum key size, permitted SANs, permitted EKUs, MaxTTL). The HTTP dispatch is implemented in `cmd/server/main.go:buildFinalHandler`, which routes `/.well-known/est/*` through `noAuthHandler` (RequestID + structuredLogger + Recovery only). Operators who need stronger client identification should terminate mTLS at an upstream reverse proxy and pin the CSR's SAN to the client cert subject at the profile level.
|
||||||
|
|
||||||
**Audit:** Every EST enrollment is recorded in the audit trail with `protocol: "EST"`, the CN, SANs, issuer ID, serial number, and optional profile ID.
|
**Audit:** Every EST enrollment is recorded in the audit trail with `protocol: "EST"`, the CN, SANs, issuer ID, serial number, and optional profile ID.
|
||||||
|
|
||||||
### SCEP Server (RFC 8894)
|
### SCEP Server (RFC 8894)
|
||||||
@@ -721,7 +751,7 @@ Signed certificate returned as PKCS#7 certs-only
|
|||||||
|
|
||||||
**Wire format:** SCEP clients wrap CSRs in PKCS#7 SignedData envelopes. The handler parses the outer ASN.1 ContentInfo → SignedData → EncapsulatedContentInfo to extract the CSR bytes. Fallback paths handle base64-encoded PKCS#7 and raw CSR submissions (for simpler clients). Responses use PKCS#7 certs-only via the shared `internal/pkcs7` package (same as EST). Single certs are returned as raw DER for `GetCACert`, chains as PKCS#7.
|
**Wire format:** SCEP clients wrap CSRs in PKCS#7 SignedData envelopes. The handler parses the outer ASN.1 ContentInfo → SignedData → EncapsulatedContentInfo to extract the CSR bytes. Fallback paths handle base64-encoded PKCS#7 and raw CSR submissions (for simpler clients). Responses use PKCS#7 certs-only via the shared `internal/pkcs7` package (same as EST). Single certs are returned as raw DER for `GetCACert`, chains as PKCS#7.
|
||||||
|
|
||||||
**Authentication:** SCEP uses challenge passwords embedded in CSR attributes (OID 1.2.840.113549.1.9.7) rather than TLS client certificates. The server validates the challenge password against `CERTCTL_SCEP_CHALLENGE_PASSWORD`. When no challenge password is configured, any value is accepted.
|
**Authentication:** SCEP endpoints at `/scep` and `/scep/*` are served unauthenticated at the HTTP layer — no Bearer token required — per RFC 8894 §3.2, which defines authentication via the `challengePassword` attribute (OID 1.2.840.113549.1.9.7) embedded in the PKCS#10 CSR rather than an HTTP credential. The HTTP dispatch is implemented in `cmd/server/main.go:buildFinalHandler`, which routes `/scep` and `/scep/*` through `noAuthHandler` (RequestID + structuredLogger + Recovery only). The `challengePassword` is mandatory: `preflightSCEPChallengePassword` at startup refuses to boot the control plane when `CERTCTL_SCEP_ENABLED=true` is set without `CERTCTL_SCEP_CHALLENGE_PASSWORD`, closing CWE-306 (missing authentication for a critical function). `SCEPService.PKCSReq` enforces the same invariant defense-in-depth — an empty `s.challengePassword` rejects every enrollment — and the password comparison uses `crypto/subtle.ConstantTimeCompare` to prevent response-time side-channel leakage. The startup log line `SCEP server enabled` emits a `challenge_password_set` boolean for operator visibility.
|
||||||
|
|
||||||
**Interface:** The `SCEPHandler` defines an `SCEPService` interface (dependency inversion):
|
**Interface:** The `SCEPHandler` defines an `SCEPService` interface (dependency inversion):
|
||||||
|
|
||||||
@@ -778,10 +808,11 @@ The control plane only handles public material: certificates, chains, and CSRs.
|
|||||||
|
|
||||||
### Authentication
|
### Authentication
|
||||||
|
|
||||||
- **API clients → Server**: API key in `Authorization: Bearer` header, or `none` for demo mode
|
- **API clients → Server**: API key in `Authorization: Bearer` header, or `none` for demo mode. Applies to every path under `/api/v1/*`.
|
||||||
- **Agent → Server**: API key registered at agent creation, included in all requests
|
- **Agent → Server**: API key registered at agent creation, included in all requests
|
||||||
- **Server → Issuers**: ACME account key, or connector-specific credentials
|
- **Server → Issuers**: ACME account key, or connector-specific credentials
|
||||||
- **Agent → Targets**: API tokens, WinRM credentials (stored locally on agent or proxy agent — never on server). Credential scope is limited to the agent's network zone.
|
- **Agent → Targets**: API tokens, WinRM credentials (stored locally on agent or proxy agent — never on server). Credential scope is limited to the agent's network zone.
|
||||||
|
- **Standards-based enrollment and PKI distribution endpoints**: `/.well-known/est/*` (RFC 7030), `/scep` and `/scep/*` (RFC 8894), and `/.well-known/pki/crl/{issuer_id}` + `/.well-known/pki/ocsp/{issuer_id}/{serial}` (RFC 5280 §5 / RFC 6960 / RFC 8615) are served unauthenticated at the HTTP layer. These protocols carry their own authentication semantics — CSR signature + profile policy for EST (§3.2.3 says EST auth is deployment-specific; §4.1.1 makes `/cacerts` explicitly anonymous), `challengePassword` in CSR attributes for SCEP (§3.2), and relying-party accessibility for CRL/OCSP — and cannot present certctl Bearer tokens. The dispatch is implemented in `cmd/server/main.go:buildFinalHandler`, which routes these prefixes through `noAuthHandler` (RequestID + structuredLogger + Recovery only, no auth or rate-limit middleware). CWE-306 is closed for SCEP by `preflightSCEPChallengePassword`, which refuses to start the server when SCEP is enabled without `CERTCTL_SCEP_CHALLENGE_PASSWORD`. The 27-subtest regression harness `cmd/server/finalhandler_test.go` pins this dispatch surface (EST 4-endpoint, SCEP exact + trailing-slash + query-string, PKI CRL+OCSP, health probes, `/api/v1/*` authenticated, `/assets/*` file server, SPA fallback).
|
||||||
|
|
||||||
### Audit Trail
|
### Audit Trail
|
||||||
|
|
||||||
@@ -865,7 +896,7 @@ The HTTP middleware stack processes requests in the following order (see `cmd/se
|
|||||||
|
|
||||||
### Concurrency Safety
|
### Concurrency Safety
|
||||||
|
|
||||||
The background scheduler uses `sync/atomic.Bool` idempotency guards on all 7 loops — if a tick fires while the previous iteration is still running, it skips. A `sync.WaitGroup` tracks all in-flight goroutines. `WaitForCompletion(timeout)` blocks during shutdown until all work finishes or the timeout expires, preventing state corruption from mid-flight database operations during process exit.
|
The background scheduler uses `sync/atomic.Bool` idempotency guards on every loop (8 always-on plus up to 4 optional) — if a tick fires while the previous iteration is still running, it skips. A `sync.WaitGroup` tracks all in-flight goroutines. `WaitForCompletion(timeout)` blocks during shutdown until all work finishes or the timeout expires, preventing state corruption from mid-flight database operations during process exit.
|
||||||
|
|
||||||
### Logging
|
### Logging
|
||||||
|
|
||||||
@@ -1061,7 +1092,7 @@ flowchart TB
|
|||||||
|
|
||||||
1. **Pluggable sources** — Each cloud provider implements the `DiscoverySource` interface (Name, Type, Discover, ValidateConfig). Three built-in sources: AWS Secrets Manager, Azure Key Vault, GCP Secret Manager
|
1. **Pluggable sources** — Each cloud provider implements the `DiscoverySource` interface (Name, Type, Discover, ValidateConfig). Three built-in sources: AWS Secrets Manager, Azure Key Vault, GCP Secret Manager
|
||||||
2. **CloudDiscoveryService orchestrator** — Iterates registered sources, calls `Discover()` on each, feeds reports into `ProcessDiscoveryReport()`. Errors from one source don't prevent other sources from running
|
2. **CloudDiscoveryService orchestrator** — Iterates registered sources, calls `Discover()` on each, feeds reports into `ProcessDiscoveryReport()`. Errors from one source don't prevent other sources from running
|
||||||
3. **Scheduler integration** — 9th scheduler loop (6h default), runs immediately on startup, `atomic.Bool` idempotency guard
|
3. **Scheduler integration** — opt-in cloud discovery scheduler loop (6h default; see `docs/architecture.md` 12-loop topology), runs immediately on startup, `atomic.Bool` idempotency guard
|
||||||
4. **Sentinel agents** — Each source uses its own sentinel agent ID (`cloud-aws-sm`, `cloud-azure-kv`, `cloud-gcp-sm`) for dedup and triage filtering
|
4. **Sentinel agents** — Each source uses its own sentinel agent ID (`cloud-aws-sm`, `cloud-azure-kv`, `cloud-gcp-sm`) for dedup and triage filtering
|
||||||
5. **Source path format** — `aws-sm://{region}/{secret}`, `azure-kv://{cert-name}/{version}`, `gcp-sm://{project}/{secret}`
|
5. **Source path format** — `aws-sm://{region}/{secret}`, `azure-kv://{cert-name}/{version}`, `gcp-sm://{project}/{secret}`
|
||||||
6. **No new schema** — Reuses existing `discovered_certificates` and `discovery_scans` tables. Sentinel agent IDs leverage existing `(fingerprint_sha256, agent_id, source_path)` dedup constraint
|
6. **No new schema** — Reuses existing `discovered_certificates` and `discovery_scans` tables. Sentinel agent IDs leverage existing `(fingerprint_sha256, agent_id, source_path)` dedup constraint
|
||||||
@@ -1083,7 +1114,7 @@ This data flow is pull-based and non-blocking. Agents discover at their own pace
|
|||||||
|
|
||||||
Beyond one-time discovery, certctl continuously monitors TLS endpoints for certificate health using a shared TLS probing package and a state-machine-driven health check service. Endpoints transition between states (Healthy → Degraded → Down) based on consecutive failures, and `cert_mismatch` status alerts when a deployed certificate is unexpectedly replaced.
|
Beyond one-time discovery, certctl continuously monitors TLS endpoints for certificate health using a shared TLS probing package and a state-machine-driven health check service. Endpoints transition between states (Healthy → Degraded → Down) based on consecutive failures, and `cert_mismatch` status alerts when a deployed certificate is unexpectedly replaced.
|
||||||
|
|
||||||
**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated 8th scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch).
|
**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated opt-in endpoint health scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch).
|
||||||
|
|
||||||
**State Machine:** Healthy → Degraded (configurable threshold, default 2 consecutive failures) → Down (default 5 failures). The `cert_mismatch` status is special — it fires whenever the observed certificate fingerprint differs from the expected (deployed) fingerprint, catching silent rollbacks and unauthorized cert replacements. Recovery from degraded/down transitions back to healthy and resets the failure counter.
|
**State Machine:** Healthy → Degraded (configurable threshold, default 2 consecutive failures) → Down (default 5 failures). The `cert_mismatch` status is special — it fires whenever the observed certificate fingerprint differs from the expected (deployed) fingerprint, catching silent rollbacks and unauthorized cert replacements. Recovery from degraded/down transitions back to healthy and resets the failure counter.
|
||||||
|
|
||||||
|
|||||||
@@ -39,7 +39,7 @@ Deploy certctl control plane once (Docker Compose, Kubernetes Helm chart, or sel
|
|||||||
```bash
|
```bash
|
||||||
cd /opt/certctl
|
cd /opt/certctl
|
||||||
docker compose up -d
|
docker compose up -d
|
||||||
# Dashboard & API: http://localhost:8443
|
# Dashboard & API: https://localhost:8443 (self-signed cert — pin with --cacert ./deploy/test/certs/ca.crt)
|
||||||
```
|
```
|
||||||
|
|
||||||
**Option B: Kubernetes** (recommended for prod)
|
**Option B: Kubernetes** (recommended for prod)
|
||||||
@@ -59,7 +59,8 @@ chmod +x /usr/local/bin/certctl-agent
|
|||||||
|
|
||||||
# Config
|
# Config
|
||||||
sudo tee /etc/certctl/agent.env > /dev/null <<EOF
|
sudo tee /etc/certctl/agent.env > /dev/null <<EOF
|
||||||
CERTCTL_SERVER_URL=http://certctl-control-plane:8443
|
CERTCTL_SERVER_URL=https://certctl-control-plane:8443
|
||||||
|
CERTCTL_SERVER_CA_BUNDLE_PATH=/etc/certctl/tls/ca.crt
|
||||||
CERTCTL_API_KEY=your-api-key
|
CERTCTL_API_KEY=your-api-key
|
||||||
CERTCTL_DISCOVERY_DIRS=/etc/nginx/certs,/etc/ssl,/etc/letsencrypt/live
|
CERTCTL_DISCOVERY_DIRS=/etc/nginx/certs,/etc/ssl,/etc/letsencrypt/live
|
||||||
CERTCTL_KEY_DIR=/var/lib/certctl/keys
|
CERTCTL_KEY_DIR=/var/lib/certctl/keys
|
||||||
|
|||||||
@@ -387,12 +387,12 @@ This requirement covers key generation, storage, rotation, and destruction. Cert
|
|||||||
- API key transmitted in Authorization header (not URL parameter, not cookie).
|
- API key transmitted in Authorization header (not URL parameter, not cookie).
|
||||||
- Browser to server: TLS.
|
- Browser to server: TLS.
|
||||||
- Agent to server: TLS.
|
- Agent to server: TLS.
|
||||||
- No credential logging (API key hash only, never plaintext).
|
- No credential logging (audit records the per-key actor `Name`, never the Bearer token; logs redact the `Authorization` header).
|
||||||
|
|
||||||
**Evidence You Can Provide**:
|
**Evidence You Can Provide**:
|
||||||
- API configuration: `CERTCTL_AUTH_TYPE=api-key` in deployment manifest.
|
- API configuration: `CERTCTL_AUTH_TYPE=api-key` in deployment manifest.
|
||||||
- Database schema: `api_keys` table showing SHA-256 hash column, not plaintext.
|
- Key inventory: `CERTCTL_API_KEYS_NAMED` env var (format `name:key:admin,...`) — seeds the in-memory `NamedAPIKey{Name, Key, Admin}` struct at `internal/api/middleware/middleware.go:29`. Keys are constant-time-compared (`subtle.ConstantTimeCompare`) against the Bearer token. No database table stores them; protect the env var contents at rest via a secrets manager (Vault / AWS Secrets Manager / Kubernetes Secrets / Docker Secrets).
|
||||||
- API audit log: `GET /api/v1/audit?action=api_call` showing Bearer token validation (no plaintext keys logged).
|
- API audit log: `GET /api/v1/audit?action=api_call` showing per-key actor names (`Name` field of matched `NamedAPIKey`) on every call, with zero plaintext or hashed key material recorded.
|
||||||
- TLS certificate on control plane: `openssl s_client -connect {server}:8443` showing valid certificate, TLS 1.2+, strong cipher.
|
- TLS certificate on control plane: `openssl s_client -connect {server}:8443` showing valid certificate, TLS 1.2+, strong cipher.
|
||||||
- GUI login flow: browser network tab showing Authorization header (token value redacted in compliance report).
|
- GUI login flow: browser network tab showing Authorization header (token value redacted in compliance report).
|
||||||
|
|
||||||
@@ -562,6 +562,7 @@ This requirement covers key generation, storage, rotation, and destruction. Cert
|
|||||||
- **Alert Notifications** (M3, M16a) — Configurable escalation:
|
- **Alert Notifications** (M3, M16a) — Configurable escalation:
|
||||||
- Email alerts: certificate approaching expiration, renewal failure, revocation notification.
|
- Email alerts: certificate approaching expiration, renewal failure, revocation notification.
|
||||||
- Webhook: custom HTTP POST to your monitoring system (Slack, Teams, PagerDuty, OpsGenie, custom webhook).
|
- Webhook: custom HTTP POST to your monitoring system (Slack, Teams, PagerDuty, OpsGenie, custom webhook).
|
||||||
|
- **Retry & Dead-Letter Queue** (I-005) — Transient notifier failures (SMTP timeout, webhook 5xx) are retried with exponential backoff (`2^n` minutes capped at 1h, 5-attempt budget) before landing in the terminal `dead` status. Operators monitor DLQ depth via the `certctl_notification_dead_total` Prometheus counter and requeue via the Notifications page Dead letter tab once the underlying outage is resolved. Closes the pre-I-005 silent-drop gap where a single 5xx could lose a compliance-relevant alert without evidence.
|
||||||
- Deduplication: one alert per threshold/certificate per day (avoid alert fatigue).
|
- Deduplication: one alert per threshold/certificate per day (avoid alert fatigue).
|
||||||
|
|
||||||
- **Audit Trail Filtering and Export** (M13) — Compliance reporting:
|
- **Audit Trail Filtering and Export** (M13) — Compliance reporting:
|
||||||
|
|||||||
+23
-12
@@ -44,7 +44,8 @@ Each section includes:
|
|||||||
|
|
||||||
**certctl Implementation** (V2 — Community Edition):
|
**certctl Implementation** (V2 — Community Edition):
|
||||||
|
|
||||||
- **API Key Authentication** — All API calls require a Bearer token (hashed with SHA-256, stored securely, validated with constant-time comparison) or are rejected with 401 Unauthorized. Environment: `CERTCTL_AUTH_TYPE` (default `api-key`; `none` requires explicit opt-in with log warning)
|
- **API Key Authentication** — All `/api/v1/*` calls require a Bearer token (hashed with SHA-256, stored securely, validated with constant-time comparison) or are rejected with 401 Unauthorized. Environment: `CERTCTL_AUTH_TYPE` (default `api-key`; `none` requires explicit opt-in with log warning)
|
||||||
|
- **Standards-based enrollment and PKI distribution endpoints** — EST (`/.well-known/est/*`, RFC 7030), SCEP (`/scep`, `/scep/*`, RFC 8894), and CRL/OCSP (`/.well-known/pki/crl/{issuer_id}`, `/.well-known/pki/ocsp/{issuer_id}/{serial}`, RFC 5280 §5 / RFC 6960 / RFC 8615) are served unauthenticated at the HTTP layer because these protocols cannot present certctl Bearer tokens. Authentication is enforced in-protocol: EST relies on CSR signature verification plus profile policy (RFC 7030 §3.2.3 says EST auth is deployment-specific; §4.1.1 makes `/cacerts` explicitly anonymous); SCEP requires a shared `challengePassword` in the PKCS#10 CSR attributes (OID 1.2.840.113549.1.9.7, RFC 8894 §3.2), validated with `crypto/subtle.ConstantTimeCompare`; CRL and OCSP are intentionally anonymous for relying-party accessibility. CWE-306 (missing authentication for a critical function) is closed for SCEP by `preflightSCEPChallengePassword` in `cmd/server/main.go`, which refuses to start the control plane when `CERTCTL_SCEP_ENABLED=true` is set without `CERTCTL_SCEP_CHALLENGE_PASSWORD`. The HTTP dispatch is implemented in `cmd/server/main.go:buildFinalHandler`, which routes these prefixes through `noAuthHandler` (RequestID + structuredLogger + Recovery only, no auth or rate-limit middleware) and is pinned by the 27-subtest regression harness at `cmd/server/finalhandler_test.go`.
|
||||||
- **GUI Authentication** — Web dashboard includes login screen requiring API key entry. Failed auth redirects to login on 401. Auth context persists across page navigation. Logout clears session.
|
- **GUI Authentication** — Web dashboard includes login screen requiring API key entry. Failed auth redirects to login on 401. Auth context persists across page navigation. Logout clears session.
|
||||||
- **Configurable CORS** — API restricts cross-origin requests via `CERTCTL_CORS_ORIGINS` allowlist or wildcard. Preflight caching prevents chatty browser auth flows.
|
- **Configurable CORS** — API restricts cross-origin requests via `CERTCTL_CORS_ORIGINS` allowlist or wildcard. Preflight caching prevents chatty browser auth flows.
|
||||||
- **Token Bucket Rate Limiting** — Per-IP rate limiting (configurable via `CERTCTL_RATE_LIMIT_RPS` / `CERTCTL_RATE_LIMIT_BURST`) returns 429 Too Many Requests with Retry-After header. Prevents credential stuffing and brute-force attacks.
|
- **Token Bucket Rate Limiting** — Per-IP rate limiting (configurable via `CERTCTL_RATE_LIMIT_RPS` / `CERTCTL_RATE_LIMIT_BURST`) returns 429 Too Many Requests with Retry-After header. Prevents credential stuffing and brute-force attacks.
|
||||||
@@ -58,6 +59,11 @@ Each section includes:
|
|||||||
- Auth info endpoint: `GET /api/v1/auth/info` (returns current auth mode, served without auth so GUI detects mode)
|
- Auth info endpoint: `GET /api/v1/auth/info` (returns current auth mode, served without auth so GUI detects mode)
|
||||||
- Rate limiting middleware: `internal/api/middleware/rate_limit.go`
|
- Rate limiting middleware: `internal/api/middleware/rate_limit.go`
|
||||||
- CORS configuration: `cmd/server/main.go`, search for `CERTCTL_CORS_ORIGINS`
|
- CORS configuration: `cmd/server/main.go`, search for `CERTCTL_CORS_ORIGINS`
|
||||||
|
- Final handler dispatch (authenticated vs. unauthenticated routing): `cmd/server/main.go:buildFinalHandler`
|
||||||
|
- SCEP preflight gate (CWE-306 closure): `cmd/server/main.go:preflightSCEPChallengePassword`
|
||||||
|
- SCEP service-layer defense-in-depth (rejects enrollment on empty challenge password, `crypto/subtle.ConstantTimeCompare`): `internal/service/scep.go`
|
||||||
|
- Final handler dispatch regression harness (27 subtests): `cmd/server/finalhandler_test.go`
|
||||||
|
- OpenAPI spec `security: []` overrides on unauthenticated paths: `api/openapi.yaml` (EST `/cacerts`, `/simpleenroll`, `/simplereenroll`, `/csrattrs`; SCEP `/scep` GET+POST; PKI `/crl/{issuer_id}`, `/ocsp/{issuer_id}/{serial}`)
|
||||||
|
|
||||||
**V3 Enhancement**:
|
**V3 Enhancement**:
|
||||||
|
|
||||||
@@ -110,7 +116,7 @@ Each section includes:
|
|||||||
|
|
||||||
**certctl Implementation** (V2):
|
**certctl Implementation** (V2):
|
||||||
|
|
||||||
- **API Key Policy** — All API access requires an API key or explicit opt-out. Opt-out (`CERTCTL_AUTH_TYPE=none`) logs a warning: "WARNING: Auth disabled (CERTCTL_AUTH_TYPE=none) — this is insecure and only for development". Configuration choice is logged at startup.
|
- **API Key Policy** — All `/api/v1/*` access requires an API key or explicit opt-out. Opt-out (`CERTCTL_AUTH_TYPE=none`) logs a warning: "WARNING: Auth disabled (CERTCTL_AUTH_TYPE=none) — this is insecure and only for development". Configuration choice is logged at startup. The standards-based enrollment and PKI distribution endpoints (EST, SCEP, CRL, OCSP) are served unauthenticated at the HTTP layer per their respective RFCs; see CC6.1 for the full authentication contract and CWE-306 closure via `preflightSCEPChallengePassword`.
|
||||||
- **Agent Authentication** — Agents authenticate to the server via API keys (same mechanism as users). Agent credentials are separate from user API keys.
|
- **Agent Authentication** — Agents authenticate to the server via API keys (same mechanism as users). Agent credentials are separate from user API keys.
|
||||||
- **Private Key Policy** — Agent-side key generation is the default (`CERTCTL_KEYGEN_MODE=agent`). Server-side keygen (`CERTCTL_KEYGEN_MODE=server`) requires explicit configuration and logs a warning: "server-side key generation enabled (CERTCTL_KEYGEN_MODE=server) — private keys touch control plane, demo only".
|
- **Private Key Policy** — Agent-side key generation is the default (`CERTCTL_KEYGEN_MODE=agent`). Server-side keygen (`CERTCTL_KEYGEN_MODE=server`) requires explicit configuration and logs a warning: "server-side key generation enabled (CERTCTL_KEYGEN_MODE=server) — private keys touch control plane, demo only".
|
||||||
- **Password Policy** — Not applicable; certctl uses API keys exclusively. Password management is delegated to your organization's IAM system if you integrate OIDC/SSO (V3).
|
- **Password Policy** — Not applicable; certctl uses API keys exclusively. Password management is delegated to your organization's IAM system if you integrate OIDC/SSO (V3).
|
||||||
@@ -183,15 +189,20 @@ Each section includes:
|
|||||||
|
|
||||||
- **Health Endpoint** — `GET /health` returns 200 OK with service status. Consumed by Docker health checks and Kubernetes probes.
|
- **Health Endpoint** — `GET /health` returns 200 OK with service status. Consumed by Docker health checks and Kubernetes probes.
|
||||||
- **Readiness Endpoint** — `GET /ready` returns 200 OK when the database is connected and migrations are applied.
|
- **Readiness Endpoint** — `GET /ready` returns 200 OK when the database is connected and migrations are applied.
|
||||||
- **Background Scheduler Monitoring** — 7 background loops run on a fixed schedule:
|
- **Background Scheduler Monitoring** — 12 background loops (8 always-on + 4 opt-in) run on a fixed schedule. Authoritative topology in `docs/architecture.md`:
|
||||||
- Renewal loop: every 1 hour, scans for certificates approaching renewal threshold
|
- Renewal loop (always-on, 1 hour): scans for certificates approaching renewal threshold
|
||||||
- Job processor loop: every 30 seconds, picks up pending/waiting jobs and advances their state
|
- Job processor loop (always-on, 30 seconds): picks up pending/waiting jobs and advances their state
|
||||||
- Health check loop: every 2 minutes, pings agents to detect downtime
|
- Job retry loop (always-on, 5 minutes, `CERTCTL_SCHEDULER_RETRY_INTERVAL`): retries Failed jobs (I-001)
|
||||||
- Notification dispatcher loop: every 1 minute, sends queued alerts
|
- Job timeout reaper loop (always-on, 10 minutes, `CERTCTL_JOB_TIMEOUT_INTERVAL`): fails AwaitingCSR/AwaitingApproval jobs past timeout (I-003)
|
||||||
- Short-lived cert expiry loop: every 30 seconds, marks expired short-lived credentials
|
- Agent health check loop (always-on, 2 minutes): pings agents to detect downtime
|
||||||
- Network scanner loop: every 6 hours, scans enabled TLS endpoints for certificate discovery
|
- Notification dispatcher loop (always-on, 1 minute): sends queued alerts
|
||||||
- Digest emailer loop: every 24 hours, sends scheduled certificate digest email to configured recipients
|
- Notification retry loop (always-on, 2 minutes, `CERTCTL_NOTIFICATION_RETRY_INTERVAL`): exponential backoff retry for failed notifications; promote to dead-letter after 5 attempts (I-005)
|
||||||
Each loop includes error handling and logs failures via structured slog.
|
- Short-lived cert expiry loop (always-on, 30 seconds): marks expired short-lived credentials
|
||||||
|
- Network scanner loop (opt-in, 6 hours, `CERTCTL_NETWORK_SCAN_ENABLED`): scans enabled TLS endpoints for certificate discovery
|
||||||
|
- Digest emailer loop (opt-in, 24 hours, `CERTCTL_DIGEST_INTERVAL`): sends scheduled certificate digest email to configured recipients
|
||||||
|
- Endpoint health loop (opt-in, 60 seconds, `CERTCTL_HEALTH_CHECK_INTERVAL`): continuous TLS health probes (M48)
|
||||||
|
- Cloud discovery loop (opt-in, 6 hours, `CERTCTL_CLOUD_DISCOVERY_INTERVAL`): cloud secret manager certificate discovery (M50)
|
||||||
|
Each loop includes `atomic.Bool` idempotency guards, error handling, and structured slog failure logs.
|
||||||
- **Metrics Endpoints** — Two formats for monitoring integration:
|
- **Metrics Endpoints** — Two formats for monitoring integration:
|
||||||
- `GET /api/v1/metrics` — JSON object with gauges, counters, and uptime for custom dashboards
|
- `GET /api/v1/metrics` — JSON object with gauges, counters, and uptime for custom dashboards
|
||||||
- `GET /api/v1/metrics/prometheus` — Prometheus exposition format (`text/plain; version=0.0.4`) for native scraping by Prometheus, Grafana Agent, Datadog, and other OpenMetrics-compatible collectors
|
- `GET /api/v1/metrics/prometheus` — Prometheus exposition format (`text/plain; version=0.0.4`) for native scraping by Prometheus, Grafana Agent, Datadog, and other OpenMetrics-compatible collectors
|
||||||
@@ -453,7 +464,7 @@ Each section includes:
|
|||||||
| | Metrics JSON Endpoint | `GET /api/v1/metrics` (gauges, counters, uptime) | ✅ | ✅ | Set thresholds, configure alerting |
|
| | Metrics JSON Endpoint | `GET /api/v1/metrics` (gauges, counters, uptime) | ✅ | ✅ | Set thresholds, configure alerting |
|
||||||
| | Stats API (time-series) | `GET /api/v1/stats/*` (summary, status, expiration, jobs, issuance) | ✅ | ✅ | Integrate into dashboards, SLO tracking |
|
| | Stats API (time-series) | `GET /api/v1/stats/*` (summary, status, expiration, jobs, issuance) | ✅ | ✅ | Integrate into dashboards, SLO tracking |
|
||||||
| | Structured Logging | `slog` middleware with request IDs | ✅ | ✅ | Aggregate logs to SIEM, define retention policy |
|
| | Structured Logging | `slog` middleware with request IDs | ✅ | ✅ | Aggregate logs to SIEM, define retention policy |
|
||||||
| | Background Scheduler | 7 loops (renewal 1h, jobs 30s, health 2m, notifications 1m, short-lived 30s, network scan 6h, digest 24h) | ✅ | ✅ | Alert on scheduler loop failures |
|
| | Background Scheduler | 12 loops (8 always-on: renewal 1h, jobs 30s, job retry 5m I-001, job timeout 10m I-003, health 2m, notifications 1m, notif retry 2m I-005, short-lived 30s; 4 opt-in: network scan 6h, digest 24h, endpoint health 60s M48, cloud discovery 6h M50) | ✅ | ✅ | Alert on scheduler loop failures |
|
||||||
| **CC7.2** Anomaly Detection | Immutable API Audit Trail | `internal/api/middleware/audit.go`, `GET /api/v1/audit` | ✅ | Enhanced (SIEM export) | Integrate into SIEM, search for anomalies, archive long-term |
|
| **CC7.2** Anomaly Detection | Immutable API Audit Trail | `internal/api/middleware/audit.go`, `GET /api/v1/audit` | ✅ | Enhanced (SIEM export) | Integrate into SIEM, search for anomalies, archive long-term |
|
||||||
| | Expiration Threshold Alerting | Configurable per-policy (default 30/14/7/0 days) | ✅ | ✅ | Configure thresholds, integrate notifications |
|
| | Expiration Threshold Alerting | Configurable per-policy (default 30/14/7/0 days) | ✅ | ✅ | Configure thresholds, integrate notifications |
|
||||||
| | Status Auto-Transitions | Active → Expiring (30d) → Expired (0d) | ✅ | ✅ | Monitor status changes in audit trail |
|
| | Status Auto-Transitions | Active → Expiring (30d) → Expired (0d) | ✅ | ✅ | Monitor status changes in audit trail |
|
||||||
|
|||||||
+3
-3
@@ -1126,7 +1126,7 @@ The digest HTML template includes:
|
|||||||
- Expiring certificates table (color-coded by urgency: 7d, 14d, 30d)
|
- Expiring certificates table (color-coded by urgency: 7d, 14d, 30d)
|
||||||
- Auto-refresh and responsive email layout
|
- Auto-refresh and responsive email layout
|
||||||
|
|
||||||
**Scheduler Integration:** The 7th scheduler loop runs on configurable interval (default 24 hours). It does NOT run on startup — waits for first scheduled tick. Operation timeout is 5 minutes. Each loop execution is guarded by `sync/atomic.Bool` idempotency.
|
**Scheduler Integration:** The opt-in digest scheduler loop runs on configurable interval (default 24 hours). It does NOT run on startup — waits for first scheduled tick. Operation timeout is 5 minutes. Each loop execution is guarded by `sync/atomic.Bool` idempotency. See `docs/architecture.md` for the full scheduler topology (12 loops, 8 always-on + 4 opt-in).
|
||||||
|
|
||||||
Configuration:
|
Configuration:
|
||||||
|
|
||||||
@@ -1389,7 +1389,7 @@ curl -s -X DELETE http://localhost:8443/api/v1/network-scan-targets/nst-dmz
|
|||||||
|
|
||||||
### Scheduler Integration
|
### Scheduler Integration
|
||||||
|
|
||||||
When `CERTCTL_NETWORK_SCAN_ENABLED=true`, the server runs a 6th scheduler loop (alongside renewal, jobs, health, notifications, and short-lived expiry). It scans all enabled targets at the configured interval (default 6h). Each target tracks `last_scan_at`, `last_scan_duration_ms`, and `last_scan_certs_found` for monitoring scan health.
|
When `CERTCTL_NETWORK_SCAN_ENABLED=true`, the server runs the opt-in network scanner scheduler loop alongside the always-on loops (renewal, jobs, job retry, job timeout, agent health, notifications, notification retry, short-lived expiry). It scans all enabled targets at the configured interval (default 6h). Each target tracks `last_scan_at`, `last_scan_duration_ms`, and `last_scan_certs_found` for monitoring scan health. See `docs/architecture.md` for the full 12-loop scheduler topology.
|
||||||
|
|
||||||
### Use Cases
|
### Use Cases
|
||||||
|
|
||||||
@@ -1447,7 +1447,7 @@ Source path format: `gcp-sm://{project}/{secret-name}`. Sentinel agent: `cloud-g
|
|||||||
|
|
||||||
### Cloud Discovery Scheduler
|
### Cloud Discovery Scheduler
|
||||||
|
|
||||||
All enabled cloud sources run on a shared scheduler loop (9th loop). The interval is configurable:
|
All enabled cloud sources run on a shared opt-in cloud discovery scheduler loop (see `docs/architecture.md` for the full 12-loop scheduler topology). The interval is configurable:
|
||||||
|
|
||||||
| Variable | Description | Default |
|
| Variable | Description | Default |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
|
|||||||
+17
-11
@@ -50,14 +50,17 @@ docker compose -f deploy/docker-compose.yml up -d --build
|
|||||||
docker compose -f deploy/docker-compose.yml ps
|
docker compose -f deploy/docker-compose.yml ps
|
||||||
```
|
```
|
||||||
|
|
||||||
Open **http://localhost:8443** in your browser alongside your terminal. You'll watch changes appear in the dashboard as you make API calls.
|
Open **https://localhost:8443** in your browser alongside your terminal. The default compose stack ships a self-signed cert; your browser will show a warning the first time — click through (or trust `deploy/test/certs/ca.crt` in your OS keychain). You'll watch changes appear in the dashboard as you make API calls.
|
||||||
|
|
||||||
Set up a base variable for convenience:
|
Set up base variables for convenience:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
API="http://localhost:8443"
|
API="https://localhost:8443"
|
||||||
|
CA="$PWD/deploy/test/certs/ca.crt" # pin the self-signed CA for curl
|
||||||
```
|
```
|
||||||
|
|
||||||
|
Every `curl` in this guide uses `--cacert "$CA"` so the TLS handshake verifies against the compose-stack CA instead of the system trust store.
|
||||||
|
|
||||||
## How the pieces fit together
|
## How the pieces fit together
|
||||||
|
|
||||||
Before we start, here's the high-level flow of what we're about to do:
|
Before we start, here's the high-level flow of what we're about to do:
|
||||||
@@ -730,7 +733,7 @@ Check the CRL (Certificate Revocation List) — served unauthenticated under the
|
|||||||
# DER-encoded X.509 CRL for the local CA (binary — pipe to openssl for inspection).
|
# DER-encoded X.509 CRL for the local CA (binary — pipe to openssl for inspection).
|
||||||
# Note: no -H "Authorization: Bearer ..." — the endpoint is deliberately
|
# Note: no -H "Authorization: Bearer ..." — the endpoint is deliberately
|
||||||
# unauthenticated. Content-Type is application/pkix-crl.
|
# unauthenticated. Content-Type is application/pkix-crl.
|
||||||
curl -s http://localhost:8443/.well-known/pki/crl/iss-local -o /tmp/crl.der
|
curl --cacert "$CA" -s https://localhost:8443/.well-known/pki/crl/iss-local -o /tmp/crl.der
|
||||||
openssl crl -inform DER -in /tmp/crl.der -text -noout
|
openssl crl -inform DER -in /tmp/crl.der -text -noout
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -740,7 +743,7 @@ Check OCSP status (RFC 6960, also unauthenticated, `application/ocsp-response`):
|
|||||||
# Replace SERIAL with the actual serial number from the certificate version.
|
# Replace SERIAL with the actual serial number from the certificate version.
|
||||||
# The embedded OCSP responder returns a signed DER response — parse it with
|
# The embedded OCSP responder returns a signed DER response — parse it with
|
||||||
# `openssl ocsp -respin` or similar tooling.
|
# `openssl ocsp -respin` or similar tooling.
|
||||||
curl -s http://localhost:8443/.well-known/pki/ocsp/iss-local/SERIAL -o /tmp/ocsp.der
|
curl --cacert "$CA" -s https://localhost:8443/.well-known/pki/ocsp/iss-local/SERIAL -o /tmp/ocsp.der
|
||||||
openssl ocsp -respin /tmp/ocsp.der -noverify -resp_text | head -40
|
openssl ocsp -respin /tmp/ocsp.der -noverify -resp_text | head -40
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -946,7 +949,8 @@ certctl includes a standalone CLI tool for command-line users:
|
|||||||
cd cmd/cli && go build -o certctl-cli .
|
cd cmd/cli && go build -o certctl-cli .
|
||||||
|
|
||||||
# Export credentials
|
# Export credentials
|
||||||
export CERTCTL_SERVER_URL="http://localhost:8443"
|
export CERTCTL_SERVER_URL="https://localhost:8443"
|
||||||
|
export CERTCTL_SERVER_CA_BUNDLE_PATH="$PWD/deploy/test/certs/ca.crt"
|
||||||
export CERTCTL_API_KEY="test-key-123"
|
export CERTCTL_API_KEY="test-key-123"
|
||||||
|
|
||||||
# List certificates (JSON or table format)
|
# List certificates (JSON or table format)
|
||||||
@@ -990,7 +994,8 @@ certctl exposes the full REST API via the Model Context Protocol (MCP), enabling
|
|||||||
cd cmd/mcp-server && go build -o mcp-server .
|
cd cmd/mcp-server && go build -o mcp-server .
|
||||||
|
|
||||||
# Export credentials
|
# Export credentials
|
||||||
export CERTCTL_SERVER_URL="http://localhost:8443"
|
export CERTCTL_SERVER_URL="https://localhost:8443"
|
||||||
|
export CERTCTL_SERVER_CA_BUNDLE_PATH="$PWD/deploy/test/certs/ca.crt"
|
||||||
export CERTCTL_API_KEY="test-key-123"
|
export CERTCTL_API_KEY="test-key-123"
|
||||||
|
|
||||||
# Start the MCP server (listens on stdin/stdout)
|
# Start the MCP server (listens on stdin/stdout)
|
||||||
@@ -1048,7 +1053,7 @@ docker compose -f deploy/docker-compose.yml run -e CERTCTL_DISCOVERY_DIRS=/tmp/c
|
|||||||
Or with the CLI flag:
|
Or with the CLI flag:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
certctl-agent --agent-id a-demo-1 --key-dir /tmp/keys --discovery-dirs /tmp/certs --server http://localhost:8443 --api-key test-key-123
|
certctl-agent --agent-id a-demo-1 --key-dir /tmp/keys --discovery-dirs /tmp/certs --server https://localhost:8443 --ca-bundle "$CA" --api-key test-key-123
|
||||||
```
|
```
|
||||||
|
|
||||||
### Network Discovery (Server-Side)
|
### Network Discovery (Server-Side)
|
||||||
@@ -1155,7 +1160,7 @@ flowchart TB
|
|||||||
API["REST API\nGo net/http"]
|
API["REST API\nGo net/http"]
|
||||||
SVC["Service Layer\nBusiness Logic"]
|
SVC["Service Layer\nBusiness Logic"]
|
||||||
REPO["Repository Layer\ndatabase/sql + lib/pq"]
|
REPO["Repository Layer\ndatabase/sql + lib/pq"]
|
||||||
SCHED["Scheduler\n7 background loops"]
|
SCHED["Scheduler\n12 background loops\n(8 always-on + 4 opt-in)"]
|
||||||
CONN["Connector Registry\nIssuer + Target + Notifier"]
|
CONN["Connector Registry\nIssuer + Target + Notifier"]
|
||||||
end
|
end
|
||||||
|
|
||||||
@@ -1191,7 +1196,8 @@ Here's a single script that runs the entire demo end-to-end. Save it as `demo.sh
|
|||||||
#!/bin/bash
|
#!/bin/bash
|
||||||
set -e
|
set -e
|
||||||
|
|
||||||
API="http://localhost:8443"
|
API="https://localhost:8443"
|
||||||
|
CA="$PWD/deploy/test/certs/ca.crt" # pin the self-signed CA for curl
|
||||||
BLUE='\033[0;34m'
|
BLUE='\033[0;34m'
|
||||||
GREEN='\033[0;32m'
|
GREEN='\033[0;32m'
|
||||||
YELLOW='\033[1;33m'
|
YELLOW='\033[1;33m'
|
||||||
@@ -1299,7 +1305,7 @@ echo " 5. Revoked the certificate with RFC 5280 reason codes"
|
|||||||
echo " 6. Checked dashboard stats and metrics"
|
echo " 6. Checked dashboard stats and metrics"
|
||||||
echo " 7. All actions recorded in the audit trail"
|
echo " 7. All actions recorded in the audit trail"
|
||||||
echo ""
|
echo ""
|
||||||
echo -e "Open ${GREEN}http://localhost:8443${NC} to see everything in the dashboard."
|
echo -e "Open ${GREEN}https://localhost:8443${NC} to see everything in the dashboard."
|
||||||
echo "Look for certificate: $CERT_ID"
|
echo "Look for certificate: $CERT_ID"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
+17
-12
@@ -16,7 +16,7 @@ Complete reference of every feature shipped in certctl through v2.1.0 (April 202
|
|||||||
| Target connectors | 14 |
|
| Target connectors | 14 |
|
||||||
| Notifier connectors | 6 channels |
|
| Notifier connectors | 6 channels |
|
||||||
| Database tables | 21 (across 10 migrations) |
|
| Database tables | 21 (across 10 migrations) |
|
||||||
| Background scheduler loops | 7 |
|
| Background scheduler loops | 12 (8 always-on + 4 opt-in) |
|
||||||
| Web dashboard pages | 24 |
|
| Web dashboard pages | 24 |
|
||||||
| Test functions | 1850+ |
|
| Test functions | 1850+ |
|
||||||
| Supported platforms | linux/amd64, linux/arm64, darwin/amd64, darwin/arm64 |
|
| Supported platforms | linux/amd64, linux/arm64, darwin/amd64, darwin/arm64 |
|
||||||
@@ -903,7 +903,7 @@ Server-side active TLS scanning of CIDR ranges. Concurrent probing with semaphor
|
|||||||
|
|
||||||
<!-- Source: internal/connector/discovery/awssm/, azurekv/, gcpsm/, internal/service/cloud_discovery.go -->
|
<!-- Source: internal/connector/discovery/awssm/, azurekv/, gcpsm/, internal/service/cloud_discovery.go -->
|
||||||
|
|
||||||
Discovers certificates stored in cloud secret managers and brings them into the certctl inventory. Extends the existing discovery pipeline with pluggable `DiscoverySource` implementations. Each source runs as part of the 9th scheduler loop (6h default).
|
Discovers certificates stored in cloud secret managers and brings them into the certctl inventory. Extends the existing discovery pipeline with pluggable `DiscoverySource` implementations. Each source runs as part of the opt-in cloud discovery scheduler loop (6h default; see `docs/architecture.md` for the full 12-loop scheduler topology).
|
||||||
|
|
||||||
**Supported sources:**
|
**Supported sources:**
|
||||||
|
|
||||||
@@ -1097,17 +1097,22 @@ Single SQL `UNION` query replaces the previous "fetch all, filter in Go" approac
|
|||||||
|
|
||||||
<!-- Source: internal/scheduler/scheduler.go -->
|
<!-- Source: internal/scheduler/scheduler.go -->
|
||||||
|
|
||||||
7 background loops, each with an `atomic.Bool` idempotency guard preventing concurrent tick execution. `sync.WaitGroup` + `WaitForCompletion()` for graceful shutdown.
|
12 background loops (8 always-on + 4 opt-in), each with an `atomic.Bool` idempotency guard preventing concurrent tick execution. `sync.WaitGroup` + `WaitForCompletion()` for graceful shutdown. Authoritative topology table lives in `docs/architecture.md`.
|
||||||
|
|
||||||
| Loop | Default Interval | Description |
|
| Loop | Default Interval | Always-on | Env Var | Description |
|
||||||
|---|---|---|
|
|---|---|---|---|---|
|
||||||
| Renewal check | 1 hour | Check expiring certs, query ARI, create renewal jobs |
|
| Renewal check | 1 hour | Yes | — | Check expiring certs, query ARI, create renewal jobs |
|
||||||
| Job processor | 30 seconds | Process pending jobs |
|
| Job processor | 30 seconds | Yes | — | Process pending jobs |
|
||||||
| Agent health check | 2 minutes | Check agent heartbeat staleness |
|
| Job retry | 5 minutes | Yes | `CERTCTL_SCHEDULER_RETRY_INTERVAL` | Retry Failed jobs (I-001) |
|
||||||
| Notification processor | 1 minute | Send queued notifications |
|
| Job timeout reaper | 10 minutes | Yes | `CERTCTL_JOB_TIMEOUT_INTERVAL` | Fail AwaitingCSR/AwaitingApproval jobs past timeout (I-003) |
|
||||||
| Short-lived expiry check | 30 seconds | Mark short-lived certs expired |
|
| Agent health check | 2 minutes | Yes | — | Check agent heartbeat staleness |
|
||||||
| Network scan | 6 hours | Run network discovery scans |
|
| Notification processor | 1 minute | Yes | — | Send queued notifications |
|
||||||
| Digest | 24 hours | Send certificate digest email (does not run on startup) |
|
| Notification retry | 2 minutes | Yes | `CERTCTL_NOTIFICATION_RETRY_INTERVAL` | Exponential backoff retry for failed notifications; promote to dead-letter after 5 attempts (I-005) |
|
||||||
|
| Short-lived expiry check | 30 seconds | Yes | — | Mark short-lived certs expired |
|
||||||
|
| Network scan | 6 hours | Opt-in | `CERTCTL_NETWORK_SCAN_ENABLED` | Run network discovery scans |
|
||||||
|
| Digest | 24 hours | Opt-in | `CERTCTL_DIGEST_INTERVAL` | Send certificate digest email (does not run on startup) |
|
||||||
|
| Endpoint health | 60 seconds | Opt-in | `CERTCTL_HEALTH_CHECK_INTERVAL` | Continuous TLS health probes (M48) |
|
||||||
|
| Cloud discovery | 6 hours | Opt-in | `CERTCTL_CLOUD_DISCOVERY_INTERVAL` | Cloud secret manager certificate discovery (M50) |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
+11
-5
@@ -29,15 +29,18 @@ The binary has zero runtime dependencies beyond the certctl server it connects t
|
|||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
The MCP server reads two environment variables:
|
The MCP server reads three environment variables:
|
||||||
|
|
||||||
| Variable | Required | Default | Description |
|
| Variable | Required | Default | Description |
|
||||||
|----------|----------|---------|-------------|
|
|----------|----------|---------|-------------|
|
||||||
| `CERTCTL_SERVER_URL` | No | `http://localhost:8443` | URL of the certctl REST API |
|
| `CERTCTL_SERVER_URL` | No | `https://localhost:8443` | URL of the certctl REST API (HTTPS-only as of v2.2) |
|
||||||
| `CERTCTL_API_KEY` | No | (empty) | API key for authentication (passed as `Bearer` token) |
|
| `CERTCTL_API_KEY` | No | (empty) | API key for authentication (passed as `Bearer` token) |
|
||||||
|
| `CERTCTL_SERVER_CA_BUNDLE_PATH` | Yes (for self-signed / internal CA) | (empty) | Path to PEM CA bundle that signed the server cert. Required when the server cert isn't rooted in the system trust store (the default compose stack ships a self-signed cert at `deploy/test/certs/ca.crt`). |
|
||||||
|
|
||||||
If your certctl server has auth enabled (the default), you must provide the API key. The MCP server passes it through to every HTTP request.
|
If your certctl server has auth enabled (the default), you must provide the API key. The MCP server passes it through to every HTTP request.
|
||||||
|
|
||||||
|
Since v2.2 the certctl control plane is HTTPS-only. If the server cert is self-signed or chained to an internal CA, set `CERTCTL_SERVER_CA_BUNDLE_PATH` so the MCP server can verify the TLS handshake. Never set `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true` outside local development — it disables all certificate validation.
|
||||||
|
|
||||||
## Setting Up with Claude Desktop
|
## Setting Up with Claude Desktop
|
||||||
|
|
||||||
Add this to your Claude Desktop MCP configuration file (`~/Library/Application Support/Claude/claude_desktop_config.json` on macOS, `%APPDATA%\Claude\claude_desktop_config.json` on Windows):
|
Add this to your Claude Desktop MCP configuration file (`~/Library/Application Support/Claude/claude_desktop_config.json` on macOS, `%APPDATA%\Claude\claude_desktop_config.json` on Windows):
|
||||||
@@ -48,7 +51,8 @@ Add this to your Claude Desktop MCP configuration file (`~/Library/Application S
|
|||||||
"certctl": {
|
"certctl": {
|
||||||
"command": "/path/to/certctl-mcp",
|
"command": "/path/to/certctl-mcp",
|
||||||
"env": {
|
"env": {
|
||||||
"CERTCTL_SERVER_URL": "http://localhost:8443",
|
"CERTCTL_SERVER_URL": "https://localhost:8443",
|
||||||
|
"CERTCTL_SERVER_CA_BUNDLE_PATH": "/path/to/certctl/deploy/test/certs/ca.crt",
|
||||||
"CERTCTL_API_KEY": "your-api-key-here"
|
"CERTCTL_API_KEY": "your-api-key-here"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -67,7 +71,8 @@ In Cursor, go to Settings → MCP Servers and add:
|
|||||||
"certctl": {
|
"certctl": {
|
||||||
"command": "/path/to/certctl-mcp",
|
"command": "/path/to/certctl-mcp",
|
||||||
"env": {
|
"env": {
|
||||||
"CERTCTL_SERVER_URL": "http://localhost:8443",
|
"CERTCTL_SERVER_URL": "https://localhost:8443",
|
||||||
|
"CERTCTL_SERVER_CA_BUNDLE_PATH": "/path/to/certctl/deploy/test/certs/ca.crt",
|
||||||
"CERTCTL_API_KEY": "your-api-key-here"
|
"CERTCTL_API_KEY": "your-api-key-here"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -84,7 +89,8 @@ Add certctl as an MCP server in your project's `.mcp.json`:
|
|||||||
"certctl": {
|
"certctl": {
|
||||||
"command": "/path/to/certctl-mcp",
|
"command": "/path/to/certctl-mcp",
|
||||||
"env": {
|
"env": {
|
||||||
"CERTCTL_SERVER_URL": "http://localhost:8443",
|
"CERTCTL_SERVER_URL": "https://localhost:8443",
|
||||||
|
"CERTCTL_SERVER_CA_BUNDLE_PATH": "/path/to/certctl/deploy/test/certs/ca.crt",
|
||||||
"CERTCTL_API_KEY": "your-api-key-here"
|
"CERTCTL_API_KEY": "your-api-key-here"
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -34,7 +34,7 @@ cd certctl/deploy
|
|||||||
docker compose up -d
|
docker compose up -d
|
||||||
```
|
```
|
||||||
|
|
||||||
Access the dashboard at `http://localhost:8443` with API key from `.env` file.
|
Access the dashboard at `https://localhost:8443` with the API key from `.env`. The default compose stack ships a self-signed cert; pin with `--cacert ./deploy/test/certs/ca.crt` when calling the API from the host.
|
||||||
|
|
||||||
### 2. Deploy Agents
|
### 2. Deploy Agents
|
||||||
|
|
||||||
|
|||||||
@@ -22,7 +22,7 @@ Option A: Docker Compose (quickest for evaluation)
|
|||||||
```bash
|
```bash
|
||||||
cd /opt/certctl
|
cd /opt/certctl
|
||||||
docker compose up -d
|
docker compose up -d
|
||||||
# Dashboard & API: http://localhost:8443
|
# Dashboard & API: https://localhost:8443 (self-signed cert — use --cacert ./deploy/test/certs/ca.crt for the default compose stack)
|
||||||
# Default API key in logs (grep CERTCTL_API_KEY docker logs certctl-server)
|
# Default API key in logs (grep CERTCTL_API_KEY docker logs certctl-server)
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -45,7 +45,8 @@ chmod +x /usr/local/bin/certctl-agent
|
|||||||
# Create config
|
# Create config
|
||||||
sudo mkdir -p /etc/certctl /var/lib/certctl/keys
|
sudo mkdir -p /etc/certctl /var/lib/certctl/keys
|
||||||
sudo tee /etc/certctl/agent.env > /dev/null <<EOF
|
sudo tee /etc/certctl/agent.env > /dev/null <<EOF
|
||||||
CERTCTL_SERVER_URL=http://certctl-control-plane.example.com:8443
|
CERTCTL_SERVER_URL=https://certctl-control-plane.example.com:8443
|
||||||
|
CERTCTL_SERVER_CA_BUNDLE_PATH=/etc/certctl/tls/ca.crt
|
||||||
CERTCTL_API_KEY=your-api-key-here
|
CERTCTL_API_KEY=your-api-key-here
|
||||||
CERTCTL_DISCOVERY_DIRS=/etc/letsencrypt/live
|
CERTCTL_DISCOVERY_DIRS=/etc/letsencrypt/live
|
||||||
CERTCTL_KEY_DIR=/var/lib/certctl/keys
|
CERTCTL_KEY_DIR=/var/lib/certctl/keys
|
||||||
|
|||||||
+9
-4
@@ -68,8 +68,10 @@ The spec organizes endpoints into 16 tags:
|
|||||||
The spec declares a `bearerAuth` security scheme applied globally. All endpoints under `/api/v1/` require a Bearer token by default:
|
The spec declares a `bearerAuth` security scheme applied globally. All endpoints under `/api/v1/` require a Bearer token by default:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -H "Authorization: Bearer your-api-key" \
|
# The default compose stack uses a self-signed cert; pin with --cacert
|
||||||
http://localhost:8443/api/v1/certificates
|
curl --cacert ./deploy/test/certs/ca.crt \
|
||||||
|
-H "Authorization: Bearer your-api-key" \
|
||||||
|
https://localhost:8443/api/v1/certificates
|
||||||
```
|
```
|
||||||
|
|
||||||
Three endpoints are exempt from auth (declared with `security: []` in the spec): `/health`, `/ready`, and `/api/v1/auth/info`. The auth info endpoint tells clients whether authentication is enabled and what type is required — useful for GUIs that need to show/hide a login screen.
|
Three endpoints are exempt from auth (declared with `security: []` in the spec): `/health`, `/ready`, and `/api/v1/auth/info`. The auth info endpoint tells clients whether authentication is enabled and what type is required — useful for GUIs that need to show/hide a login screen.
|
||||||
@@ -150,8 +152,9 @@ Import the spec directly into Postman:
|
|||||||
|
|
||||||
1. Open Postman → Import → File → select `api/openapi.yaml`
|
1. Open Postman → Import → File → select `api/openapi.yaml`
|
||||||
2. Postman creates a collection with all 78 documented operations organized by tag
|
2. Postman creates a collection with all 78 documented operations organized by tag
|
||||||
3. Set the `baseUrl` variable to `http://localhost:8443`
|
3. Set the `baseUrl` variable to `https://localhost:8443` (HTTPS-only as of v2.2)
|
||||||
4. Add an `Authorization: Bearer your-api-key` header to the collection
|
4. Add an `Authorization: Bearer your-api-key` header to the collection
|
||||||
|
5. Import the demo stack CA bundle (`deploy/test/certs/ca.crt`) into Postman's Settings → Certificates → CA Certificates, or disable certificate verification for the `localhost` host (Settings → General → SSL certificate verification)
|
||||||
|
|
||||||
## Key Schemas
|
## Key Schemas
|
||||||
|
|
||||||
@@ -176,8 +179,10 @@ Use the spec to generate contract tests that verify the API matches the spec:
|
|||||||
```bash
|
```bash
|
||||||
# Using schemathesis for fuzz testing against the spec
|
# Using schemathesis for fuzz testing against the spec
|
||||||
pip install schemathesis
|
pip install schemathesis
|
||||||
|
# The default compose stack uses a self-signed cert — export a CA bundle or set REQUESTS_CA_BUNDLE
|
||||||
|
export REQUESTS_CA_BUNDLE=$(pwd)/deploy/test/certs/ca.crt
|
||||||
schemathesis run api/openapi.yaml \
|
schemathesis run api/openapi.yaml \
|
||||||
--base-url http://localhost:8443 \
|
--base-url https://localhost:8443 \
|
||||||
--header "Authorization: Bearer your-api-key"
|
--header "Authorization: Bearer your-api-key"
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|||||||
@@ -85,10 +85,12 @@ go test -tags qa -v -timeout 10m ./...
|
|||||||
|
|
||||||
| Variable | Default | Description |
|
| Variable | Default | Description |
|
||||||
|---|---|---|
|
|---|---|---|
|
||||||
| `CERTCTL_QA_SERVER_URL` | `http://localhost:8443` | certctl server URL |
|
| `CERTCTL_QA_SERVER_URL` | `https://localhost:8443` | certctl server URL (HTTPS-only as of v2.2) |
|
||||||
| `CERTCTL_QA_API_KEY` | `change-me-in-production` | API key for Bearer auth |
|
| `CERTCTL_QA_API_KEY` | `change-me-in-production` | API key for Bearer auth |
|
||||||
| `CERTCTL_QA_DB_URL` | `postgres://certctl:certctl@localhost:5432/certctl?sslmode=disable` | PostgreSQL connection string |
|
| `CERTCTL_QA_DB_URL` | `postgres://certctl:certctl@localhost:5432/certctl?sslmode=disable` | PostgreSQL connection string |
|
||||||
| `CERTCTL_QA_REPO_DIR` | `../..` | Path to certctl repo root (for source file checks) |
|
| `CERTCTL_QA_REPO_DIR` | `../..` | Path to certctl repo root (for source file checks) |
|
||||||
|
| `CERTCTL_QA_CA_BUNDLE` | `./certs/ca.crt` | PEM CA bundle pinned for TLS verification. The demo stack's `certctl-tls-init` container writes here. |
|
||||||
|
| `CERTCTL_QA_INSECURE` | `false` | Set to `"true"` to skip TLS verification (e.g. before the init container finishes). Never use outside the demo harness. |
|
||||||
|
|
||||||
## Part-by-Part Coverage Map
|
## Part-by-Part Coverage Map
|
||||||
|
|
||||||
@@ -256,8 +258,8 @@ docker compose -f docker-compose.yml -f docker-compose.demo.yml ps
|
|||||||
# Check server logs
|
# Check server logs
|
||||||
docker compose -f docker-compose.yml -f docker-compose.demo.yml logs certctl-server
|
docker compose -f docker-compose.yml -f docker-compose.demo.yml logs certctl-server
|
||||||
|
|
||||||
# Check if the port is exposed
|
# Check if the port is exposed (self-signed cert — pin CA bundle)
|
||||||
curl -s http://localhost:8443/health
|
curl --cacert ./deploy/test/certs/ca.crt -s https://localhost:8443/health
|
||||||
```
|
```
|
||||||
|
|
||||||
### "connect to QA DB" failure
|
### "connect to QA DB" failure
|
||||||
|
|||||||
+59
-46
@@ -105,16 +105,24 @@ certctl-server Up (healthy)
|
|||||||
certctl-agent Up
|
certctl-agent Up
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The control plane is HTTPS-only as of v2.2. The `certctl-tls-init` init container in the shipped `deploy/docker-compose.yml` self-signs a cert on first boot and drops it into a named volume. Extract the CA bundle once and reuse it for every API call in this guide:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl http://localhost:8443/health
|
export CA=/tmp/certctl-ca.crt
|
||||||
|
docker compose -f deploy/docker-compose.yml exec -T certctl-server \
|
||||||
|
cat /etc/certctl/tls/ca.crt > "$CA"
|
||||||
|
|
||||||
|
curl --cacert "$CA" https://localhost:8443/health
|
||||||
```
|
```
|
||||||
```json
|
```json
|
||||||
{"status":"healthy"}
|
{"status":"healthy"}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
If you're bringing your own cert (internal CA, cert-manager, operator-supplied Secret), see [`docs/tls.md`](tls.md) for the full provisioning matrix. If you're cutting over an existing install, see [`docs/upgrade-to-tls.md`](upgrade-to-tls.md) for the failure modes (out-of-date `http://…` agents fail at the TLS handshake) and the one-step procedure.
|
||||||
|
|
||||||
## Open the Dashboard
|
## Open the Dashboard
|
||||||
|
|
||||||
Open **http://localhost:8443** in your browser.
|
Open **https://localhost:8443** in your browser. Your browser will warn about the self-signed cert — that's expected for the demo bootstrap. Trust the CA bundle you just exported, or click through the warning.
|
||||||
|
|
||||||
> **Note:** The Docker Compose demo runs with authentication disabled (`CERTCTL_AUTH_TYPE=none`) so you can explore immediately. For production, set `CERTCTL_AUTH_TYPE=api-key` and `CERTCTL_AUTH_SECRET=<your-secret>` in your environment, then pass `Authorization: Bearer <your-secret>` on all API requests. The dashboard will prompt for your API key on first load.
|
> **Note:** The Docker Compose demo runs with authentication disabled (`CERTCTL_AUTH_TYPE=none`) so you can explore immediately. For production, set `CERTCTL_AUTH_TYPE=api-key` and `CERTCTL_AUTH_SECRET=<your-secret>` in your environment, then pass `Authorization: Bearer <your-secret>` on all API requests. The dashboard will prompt for your API key on first load.
|
||||||
>
|
>
|
||||||
@@ -154,62 +162,64 @@ Everything you see in the dashboard is backed by the REST API. All endpoints liv
|
|||||||
|
|
||||||
### Core operations
|
### Core operations
|
||||||
|
|
||||||
|
Every request below uses `--cacert "$CA"` to pin the self-signed CA bundle extracted above. In production, point `$CA` at your internal CA root or the bundle you distributed to the fleet.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# List all certificates
|
# List all certificates
|
||||||
curl -s http://localhost:8443/api/v1/certificates | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/certificates | jq .
|
||||||
|
|
||||||
# Filter by status
|
# Filter by status
|
||||||
curl -s "http://localhost:8443/api/v1/certificates?status=Expiring" | jq .
|
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?status=Expiring" | jq .
|
||||||
|
|
||||||
# Filter by environment
|
# Filter by environment
|
||||||
curl -s "http://localhost:8443/api/v1/certificates?environment=production" | jq .
|
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?environment=production" | jq .
|
||||||
|
|
||||||
# Get a specific certificate
|
# Get a specific certificate
|
||||||
curl -s http://localhost:8443/api/v1/certificates/mc-api-prod | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/certificates/mc-api-prod | jq .
|
||||||
|
|
||||||
# Get deployment targets for a certificate
|
# Get deployment targets for a certificate
|
||||||
curl -s http://localhost:8443/api/v1/certificates/mc-api-prod/deployments | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/certificates/mc-api-prod/deployments | jq .
|
||||||
|
|
||||||
# List agents
|
# List agents
|
||||||
curl -s http://localhost:8443/api/v1/agents | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/agents | jq .
|
||||||
|
|
||||||
# Check agent pending work
|
# Check agent pending work
|
||||||
curl -s http://localhost:8443/api/v1/agents/ag-web-prod/work | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/agents/ag-web-prod/work | jq .
|
||||||
|
|
||||||
# View audit trail
|
# View audit trail
|
||||||
curl -s http://localhost:8443/api/v1/audit | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/audit | jq .
|
||||||
|
|
||||||
# View policies and violations
|
# View policies and violations
|
||||||
curl -s http://localhost:8443/api/v1/policies | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/policies | jq .
|
||||||
curl -s http://localhost:8443/api/v1/policies/pr-require-owner/violations | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/policies/pr-require-owner/violations | jq .
|
||||||
|
|
||||||
# Notifications
|
# Notifications
|
||||||
curl -s http://localhost:8443/api/v1/notifications | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/notifications | jq .
|
||||||
|
|
||||||
# Profiles and agent groups
|
# Profiles and agent groups
|
||||||
curl -s http://localhost:8443/api/v1/profiles | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/profiles | jq .
|
||||||
curl -s http://localhost:8443/api/v1/agent-groups | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/agent-groups | jq .
|
||||||
```
|
```
|
||||||
|
|
||||||
### Sorting, filtering, and pagination
|
### Sorting, filtering, and pagination
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Sort by expiration date (ascending)
|
# Sort by expiration date (ascending)
|
||||||
curl -s "http://localhost:8443/api/v1/certificates?sort=notAfter" | jq .
|
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?sort=notAfter" | jq .
|
||||||
|
|
||||||
# Sort descending (prefix with -)
|
# Sort descending (prefix with -)
|
||||||
curl -s "http://localhost:8443/api/v1/certificates?sort=-createdAt" | jq .
|
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?sort=-createdAt" | jq .
|
||||||
|
|
||||||
# Time-range filters (RFC3339)
|
# Time-range filters (RFC3339)
|
||||||
curl -s "http://localhost:8443/api/v1/certificates?expires_before=2026-05-01T00:00:00Z" | jq .
|
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?expires_before=2026-05-01T00:00:00Z" | jq .
|
||||||
curl -s "http://localhost:8443/api/v1/certificates?created_after=2026-03-01T00:00:00Z" | jq .
|
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?created_after=2026-03-01T00:00:00Z" | jq .
|
||||||
|
|
||||||
# Sparse fields — request only what you need
|
# Sparse fields — request only what you need
|
||||||
curl -s "http://localhost:8443/api/v1/certificates?fields=id,common_name,status,expires_at" | jq .
|
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?fields=id,common_name,status,expires_at" | jq .
|
||||||
|
|
||||||
# Cursor pagination — efficient for large inventories
|
# Cursor pagination — efficient for large inventories
|
||||||
curl -s "http://localhost:8443/api/v1/certificates?page_size=5" | jq '{next_cursor: .next_cursor, count: (.data | length)}'
|
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?page_size=5" | jq '{next_cursor: .next_cursor, count: (.data | length)}'
|
||||||
curl -s "http://localhost:8443/api/v1/certificates?cursor=<next_cursor_value>&page_size=5" | jq .
|
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?cursor=<next_cursor_value>&page_size=5" | jq .
|
||||||
```
|
```
|
||||||
|
|
||||||
Supported sort fields: `notAfter`, `expiresAt`, `createdAt`, `updatedAt`, `commonName`, `name`, `status`, `environment`.
|
Supported sort fields: `notAfter`, `expiresAt`, `createdAt`, `updatedAt`, `commonName`, `name`, `status`, `environment`.
|
||||||
@@ -218,22 +228,22 @@ Supported sort fields: `notAfter`, `expiresAt`, `createdAt`, `updatedAt`, `commo
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Dashboard summary
|
# Dashboard summary
|
||||||
curl -s http://localhost:8443/api/v1/stats/summary | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/stats/summary | jq .
|
||||||
|
|
||||||
# Certificates by status
|
# Certificates by status
|
||||||
curl -s http://localhost:8443/api/v1/stats/certificates-by-status | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/stats/certificates-by-status | jq .
|
||||||
|
|
||||||
# Expiration timeline (next 90 days)
|
# Expiration timeline (next 90 days)
|
||||||
curl -s "http://localhost:8443/api/v1/stats/expiration-timeline?days=90" | jq .
|
curl --cacert "$CA" -s "https://localhost:8443/api/v1/stats/expiration-timeline?days=90" | jq .
|
||||||
|
|
||||||
# Job trends (last 30 days)
|
# Job trends (last 30 days)
|
||||||
curl -s "http://localhost:8443/api/v1/stats/job-trends?days=30" | jq .
|
curl --cacert "$CA" -s "https://localhost:8443/api/v1/stats/job-trends?days=30" | jq .
|
||||||
|
|
||||||
# JSON metrics
|
# JSON metrics
|
||||||
curl -s http://localhost:8443/api/v1/metrics | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/metrics | jq .
|
||||||
|
|
||||||
# Prometheus format (for Prometheus, Grafana Agent, Datadog)
|
# Prometheus format (for Prometheus, Grafana Agent, Datadog)
|
||||||
curl -s http://localhost:8443/api/v1/metrics/prometheus
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/metrics/prometheus
|
||||||
```
|
```
|
||||||
|
|
||||||
## Create Your First Certificate
|
## Create Your First Certificate
|
||||||
@@ -241,7 +251,7 @@ curl -s http://localhost:8443/api/v1/metrics/prometheus
|
|||||||
Create a certificate record that certctl will track, renew, and deploy automatically.
|
Create a certificate record that certctl will track, renew, and deploy automatically.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -X POST http://localhost:8443/api/v1/certificates \
|
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"name": "My First Certificate",
|
"name": "My First Certificate",
|
||||||
@@ -264,22 +274,22 @@ CERT_ID="<paste the id from the response>"
|
|||||||
|
|
||||||
Trigger renewal:
|
Trigger renewal:
|
||||||
```bash
|
```bash
|
||||||
curl -s -X POST http://localhost:8443/api/v1/certificates/$CERT_ID/renew | jq .
|
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates/$CERT_ID/renew | jq .
|
||||||
```
|
```
|
||||||
|
|
||||||
Check the result:
|
Check the result:
|
||||||
```bash
|
```bash
|
||||||
curl -s http://localhost:8443/api/v1/certificates/$CERT_ID | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/certificates/$CERT_ID | jq .
|
||||||
```
|
```
|
||||||
|
|
||||||
Refresh the dashboard at http://localhost:8443 — your new certificate appears in the inventory.
|
Refresh the dashboard at https://localhost:8443 — your new certificate appears in the inventory.
|
||||||
|
|
||||||
### Revoke a certificate
|
### Revoke a certificate
|
||||||
|
|
||||||
When a private key is compromised or a service is decommissioned:
|
When a private key is compromised or a service is decommissioned:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -X POST http://localhost:8443/api/v1/certificates/$CERT_ID/revoke \
|
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates/$CERT_ID/revoke \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{"reason": "superseded"}' | jq .
|
-d '{"reason": "superseded"}' | jq .
|
||||||
```
|
```
|
||||||
@@ -289,7 +299,8 @@ Supported RFC 5280 reason codes: `unspecified`, `keyCompromise`, `caCompromise`,
|
|||||||
Confirm via the unauthenticated DER CRL (RFC 5280 §5, RFC 8615):
|
Confirm via the unauthenticated DER CRL (RFC 5280 §5, RFC 8615):
|
||||||
```bash
|
```bash
|
||||||
# Fetch the CRL without any API key — relying parties shouldn't need one.
|
# Fetch the CRL without any API key — relying parties shouldn't need one.
|
||||||
curl -s http://localhost:8443/.well-known/pki/crl/iss-local -o /tmp/crl.der
|
# The CRL path is unauthenticated, but it's still served over TLS.
|
||||||
|
curl --cacert "$CA" -s https://localhost:8443/.well-known/pki/crl/iss-local -o /tmp/crl.der
|
||||||
openssl crl -inform der -in /tmp/crl.der -noout -text | head -40
|
openssl crl -inform der -in /tmp/crl.der -noout -text | head -40
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -299,15 +310,15 @@ For high-value certificates where you want human oversight. The demo includes 2
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# List jobs awaiting approval (demo includes 2)
|
# List jobs awaiting approval (demo includes 2)
|
||||||
curl -s "http://localhost:8443/api/v1/jobs?status=AwaitingApproval" | jq '.data[] | {id, certificate_id, status}'
|
curl --cacert "$CA" -s "https://localhost:8443/api/v1/jobs?status=AwaitingApproval" | jq '.data[] | {id, certificate_id, status}'
|
||||||
|
|
||||||
# Approve a pending job
|
# Approve a pending job
|
||||||
curl -s -X POST http://localhost:8443/api/v1/jobs/JOB_ID/approve \
|
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/jobs/JOB_ID/approve \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{"reason": "Approved for production deployment"}' | jq .
|
-d '{"reason": "Approved for production deployment"}' | jq .
|
||||||
|
|
||||||
# Reject a pending job
|
# Reject a pending job
|
||||||
curl -s -X POST http://localhost:8443/api/v1/jobs/JOB_ID/reject \
|
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/jobs/JOB_ID/reject \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{"reason": "Key type does not meet compliance requirements"}' | jq .
|
-d '{"reason": "Key type does not meet compliance requirements"}' | jq .
|
||||||
```
|
```
|
||||||
@@ -333,7 +344,7 @@ export CERTCTL_DISCOVERY_DIRS="/etc/nginx/certs,/etc/ssl/certs,/var/lib/certs"
|
|||||||
export CERTCTL_NETWORK_SCAN_ENABLED=true
|
export CERTCTL_NETWORK_SCAN_ENABLED=true
|
||||||
|
|
||||||
# Create a scan target
|
# Create a scan target
|
||||||
curl -s -X POST http://localhost:8443/api/v1/network-scan-targets \
|
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/network-scan-targets \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"name": "Internal Network",
|
"name": "Internal Network",
|
||||||
@@ -345,20 +356,20 @@ curl -s -X POST http://localhost:8443/api/v1/network-scan-targets \
|
|||||||
}' | jq .
|
}' | jq .
|
||||||
|
|
||||||
# Trigger an immediate scan
|
# Trigger an immediate scan
|
||||||
curl -s -X POST http://localhost:8443/api/v1/network-scan-targets/nst-internal-network/scan | jq .
|
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/network-scan-targets/nst-internal-network/scan | jq .
|
||||||
```
|
```
|
||||||
|
|
||||||
### Triage discovered certificates
|
### Triage discovered certificates
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# List discovered certs
|
# List discovered certs
|
||||||
curl -s "http://localhost:8443/api/v1/discovered-certificates?agent_id=agent-nginx-prod" | jq .
|
curl --cacert "$CA" -s "https://localhost:8443/api/v1/discovered-certificates?agent_id=agent-nginx-prod" | jq .
|
||||||
|
|
||||||
# Summary counts
|
# Summary counts
|
||||||
curl -s http://localhost:8443/api/v1/discovery-summary | jq .
|
curl --cacert "$CA" -s https://localhost:8443/api/v1/discovery-summary | jq .
|
||||||
|
|
||||||
# Claim a discovered cert (bring under management)
|
# Claim a discovered cert (bring under management)
|
||||||
curl -s -X POST "http://localhost:8443/api/v1/discovered-certificates/DISCOVERY_ID/claim" \
|
curl --cacert "$CA" -s -X POST "https://localhost:8443/api/v1/discovered-certificates/DISCOVERY_ID/claim" \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{"managed_certificate_id": "mc-api-prod"}' | jq .
|
-d '{"managed_certificate_id": "mc-api-prod"}' | jq .
|
||||||
```
|
```
|
||||||
@@ -368,8 +379,9 @@ curl -s -X POST "http://localhost:8443/api/v1/discovered-certificates/DISCOVERY_
|
|||||||
```bash
|
```bash
|
||||||
cd cmd/cli && go build -o certctl-cli .
|
cd cmd/cli && go build -o certctl-cli .
|
||||||
|
|
||||||
export CERTCTL_SERVER_URL="http://localhost:8443"
|
export CERTCTL_SERVER_URL="https://localhost:8443"
|
||||||
export CERTCTL_API_KEY="test-key-123"
|
export CERTCTL_API_KEY="test-key-123"
|
||||||
|
export CERTCTL_SERVER_CA_BUNDLE_PATH="$CA" # or pass --ca-bundle; --insecure for dev self-signed
|
||||||
|
|
||||||
./certctl-cli certs list # List certificates
|
./certctl-cli certs list # List certificates
|
||||||
./certctl-cli certs get mc-api-prod # Certificate details
|
./certctl-cli certs get mc-api-prod # Certificate details
|
||||||
@@ -402,10 +414,10 @@ export CERTCTL_DIGEST_RECIPIENTS=ops@example.com,security@example.com
|
|||||||
|
|
||||||
Preview the digest HTML before enabling scheduled delivery:
|
Preview the digest HTML before enabling scheduled delivery:
|
||||||
```bash
|
```bash
|
||||||
curl http://localhost:8443/api/v1/digest/preview | jq '.html' | grep -o '<html>' # Shows HTML is ready
|
curl --cacert "$CA" https://localhost:8443/api/v1/digest/preview | jq '.html' | grep -o '<html>' # Shows HTML is ready
|
||||||
|
|
||||||
# Trigger a digest send immediately (outside of schedule)
|
# Trigger a digest send immediately (outside of schedule)
|
||||||
curl -X POST http://localhost:8443/api/v1/digest/send
|
curl --cacert "$CA" -X POST https://localhost:8443/api/v1/digest/send
|
||||||
```
|
```
|
||||||
|
|
||||||
If no recipients are configured (`CERTCTL_DIGEST_RECIPIENTS` empty), the digest falls back to certificate owner emails. Digests include total certificates, expiring soon, expired, active agents, completed/failed jobs (30-day summary), and a table of expiring certs color-coded by urgency (7/14/30 days).
|
If no recipients are configured (`CERTCTL_DIGEST_RECIPIENTS` empty), the digest falls back to certificate owner emails. Digests include total certificates, expiring soon, expired, active agents, completed/failed jobs (30-day summary), and a table of expiring certs color-coded by urgency (7/14/30 days).
|
||||||
@@ -415,8 +427,9 @@ If no recipients are configured (`CERTCTL_DIGEST_RECIPIENTS` empty), the digest
|
|||||||
```bash
|
```bash
|
||||||
cd cmd/mcp-server && go build -o mcp-server .
|
cd cmd/mcp-server && go build -o mcp-server .
|
||||||
|
|
||||||
export CERTCTL_SERVER_URL="http://localhost:8443"
|
export CERTCTL_SERVER_URL="https://localhost:8443"
|
||||||
export CERTCTL_API_KEY="test-key-123"
|
export CERTCTL_API_KEY="test-key-123"
|
||||||
|
export CERTCTL_SERVER_CA_BUNDLE_PATH="$CA" # MCP is env-vars-only; no CLI flags
|
||||||
|
|
||||||
./mcp-server
|
./mcp-server
|
||||||
```
|
```
|
||||||
|
|||||||
+64
-47
@@ -16,7 +16,7 @@ You'll start 7 Docker containers that talk to each other:
|
|||||||
| **pebble-challtestsrv** | DNS/HTTP challenge test server for Pebble | 10.30.50.3 | Not directly — Pebble talks to it |
|
| **pebble-challtestsrv** | DNS/HTTP challenge test server for Pebble | 10.30.50.3 | Not directly — Pebble talks to it |
|
||||||
| **Pebble** | A fake Let's Encrypt (tests the ACME protocol without touching the real internet) | 10.30.50.4 | Not directly — the server talks to it |
|
| **Pebble** | A fake Let's Encrypt (tests the ACME protocol without touching the real internet) | 10.30.50.4 | Not directly — the server talks to it |
|
||||||
| **step-ca** | A private Certificate Authority (think: your company's internal CA) | 10.30.50.5 | Not directly — the server talks to it |
|
| **step-ca** | A private Certificate Authority (think: your company's internal CA) | 10.30.50.5 | Not directly — the server talks to it |
|
||||||
| **certctl-server** | The brain. API + web dashboard + scheduler + ACME challenge server | 10.30.50.6 | **http://localhost:8443** |
|
| **certctl-server** | The brain. API + web dashboard + scheduler + ACME challenge server | 10.30.50.6 | **https://localhost:8443** (self-signed — see CA-bundle note below) |
|
||||||
| **NGINX** | A web server. The agent deploys certificates here. | 10.30.50.7 | **https://localhost:8444** |
|
| **NGINX** | A web server. The agent deploys certificates here. | 10.30.50.7 | **https://localhost:8444** |
|
||||||
| **certctl-agent** | The hands. Generates keys, deploys certs to NGINX | 10.30.50.8 | Not directly — it talks to the server |
|
| **certctl-agent** | The hands. Generates keys, deploys certs to NGINX | 10.30.50.8 | Not directly — it talks to the server |
|
||||||
|
|
||||||
@@ -123,7 +123,7 @@ docker compose -f docker-compose.test.yml up --build
|
|||||||
|
|
||||||
```
|
```
|
||||||
certctl-test-server | {"level":"INFO","msg":"server started","address":"0.0.0.0:8443"}
|
certctl-test-server | {"level":"INFO","msg":"server started","address":"0.0.0.0:8443"}
|
||||||
certctl-test-agent | {"level":"INFO","msg":"agent starting","server_url":"http://certctl-server:8443"}
|
certctl-test-agent | {"level":"INFO","msg":"agent starting","server_url":"https://certctl-server:8443"}
|
||||||
certctl-test-stepca | Serving HTTPS on :9000 ...
|
certctl-test-stepca | Serving HTTPS on :9000 ...
|
||||||
certctl-test-pebble | Listening on: 0.0.0.0:14000
|
certctl-test-pebble | Listening on: 0.0.0.0:14000
|
||||||
```
|
```
|
||||||
@@ -159,13 +159,29 @@ certctl-test-stepca Up (healthy)
|
|||||||
|
|
||||||
**If certctl-test-server says "Restarting"**: It probably started before step-ca or Pebble were ready. Wait 30 seconds and check again. If it keeps restarting, see [Troubleshooting](#troubleshooting).
|
**If certctl-test-server says "Restarting"**: It probably started before step-ca or Pebble were ready. Wait 30 seconds and check again. If it keeps restarting, see [Troubleshooting](#troubleshooting).
|
||||||
|
|
||||||
|
### Get the CA bundle for curl
|
||||||
|
|
||||||
|
The test harness runs HTTPS-only (the `certctl-tls-init` init container self-signs an ed25519 server cert into a bind-mounted directory before the server starts — see `docker-compose.test.yml` §`certctl-tls-init` for details). The CA cert that signed it is materialized on the host at `./test/certs/ca.crt` (relative to the `deploy/` directory). Every `curl` in the rest of this doc expects it in `$CA`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export CA=$PWD/test/certs/ca.crt
|
||||||
|
ls -la "$CA" # sanity check: file should exist and be non-empty
|
||||||
|
curl --cacert "$CA" -f https://localhost:8443/health
|
||||||
|
```
|
||||||
|
|
||||||
|
Expect `{"status":"ok"}`. If `curl` errors with `SSL certificate problem: unable to get local issuer certificate`, the init container hasn't finished yet — wait a few seconds and retry. If the file doesn't exist at all, the bind mount didn't populate; `docker compose -f docker-compose.test.yml logs certctl-tls-init` should show the self-sign ran.
|
||||||
|
|
||||||
|
For a full explanation of the cert provisioning patterns (self-signed bootstrap, operator-supplied, cert-manager), see [`tls.md`](tls.md). For the one-step cutover from the old plaintext test harness to HTTPS, see [`upgrade-to-tls.md`](upgrade-to-tls.md).
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Step 2: Open the Dashboard
|
## Step 2: Open the Dashboard
|
||||||
|
|
||||||
Open your web browser and go to:
|
Open your web browser and go to:
|
||||||
|
|
||||||
**http://localhost:8443**
|
**https://localhost:8443**
|
||||||
|
|
||||||
|
Your browser will warn you that the cert is self-signed ("Your connection is not private" / "NET::ERR_CERT_AUTHORITY_INVALID"). That's expected for the test harness — the CA that signed the cert lives at `deploy/test/certs/ca.crt` and isn't in your system trust store. Click through the warning (Chrome: "Advanced" → "Proceed"; Firefox: "Accept the Risk"; Safari: "Show Details" → "visit this website").
|
||||||
|
|
||||||
You'll see a login screen asking for an API key. Enter:
|
You'll see a login screen asking for an API key. Enter:
|
||||||
|
|
||||||
@@ -198,12 +214,13 @@ Go back to your second terminal. Let's verify the data loaded correctly.
|
|||||||
### Check the agent
|
### Check the agent
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -H "Authorization: Bearer test-key-2026" \
|
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
|
||||||
http://localhost:8443/api/v1/agents | python3 -m json.tool
|
https://localhost:8443/api/v1/agents | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
**What this command does**:
|
**What this command does**:
|
||||||
- `curl` makes an HTTP request (like a browser but from the terminal)
|
- `curl` makes an HTTPS request (like a browser but from the terminal)
|
||||||
|
- `--cacert "$CA"` pins the test harness's self-signed root as the only trust anchor for this call — matches what you exported in Step 1
|
||||||
- `-s` means "silent" (don't show progress bars)
|
- `-s` means "silent" (don't show progress bars)
|
||||||
- `-H "Authorization: Bearer test-key-2026"` sends the API key (same one you used to log in)
|
- `-H "Authorization: Bearer test-key-2026"` sends the API key (same one you used to log in)
|
||||||
- `python3 -m json.tool` formats the JSON response so it's readable
|
- `python3 -m json.tool` formats the JSON response so it's readable
|
||||||
@@ -233,8 +250,8 @@ The important parts: `"id": "agent-test-01"` and `"status": "online"`. If the st
|
|||||||
### Check the issuers
|
### Check the issuers
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -H "Authorization: Bearer test-key-2026" \
|
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
|
||||||
http://localhost:8443/api/v1/issuers | python3 -m json.tool
|
https://localhost:8443/api/v1/issuers | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
You should see three issuers:
|
You should see three issuers:
|
||||||
@@ -245,8 +262,8 @@ You should see three issuers:
|
|||||||
### Check the target
|
### Check the target
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -H "Authorization: Bearer test-key-2026" \
|
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
|
||||||
http://localhost:8443/api/v1/targets | python3 -m json.tool
|
https://localhost:8443/api/v1/targets | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
You should see `target-test-nginx` — the NGINX deployment target, assigned to `agent-test-01`.
|
You should see `target-test-nginx` — the NGINX deployment target, assigned to `agent-test-01`.
|
||||||
@@ -255,7 +272,7 @@ The target config uses no-op commands for `reload_command` and `validate_command
|
|||||||
|
|
||||||
### See it all in the dashboard
|
### See it all in the dashboard
|
||||||
|
|
||||||
Open the dashboard at http://localhost:8443 and click through the sidebar:
|
Open the dashboard at https://localhost:8443 and click through the sidebar:
|
||||||
- **Agents** — you should see `test-agent-01`
|
- **Agents** — you should see `test-agent-01`
|
||||||
- **Issuers** — you should see all three CAs
|
- **Issuers** — you should see all three CAs
|
||||||
- **Targets** — you should see `Test NGINX`
|
- **Targets** — you should see `Test NGINX`
|
||||||
@@ -287,7 +304,7 @@ The private key **never leaves the agent**. The server only ever sees the CSR (p
|
|||||||
### Step 4a: Create the certificate record
|
### Step 4a: Create the certificate record
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -X POST http://localhost:8443/api/v1/certificates \
|
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates \
|
||||||
-H "Authorization: Bearer test-key-2026" \
|
-H "Authorization: Bearer test-key-2026" \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
@@ -338,7 +355,7 @@ docker exec certctl-test-postgres psql -U certctl -d certctl -c \
|
|||||||
### Step 4c: Trigger issuance
|
### Step 4c: Trigger issuance
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -X POST http://localhost:8443/api/v1/certificates/mc-local-test/renew \
|
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates/mc-local-test/renew \
|
||||||
-H "Authorization: Bearer test-key-2026" | python3 -m json.tool
|
-H "Authorization: Bearer test-key-2026" | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -395,7 +412,7 @@ The `subject` should match the domain name you chose. The `issuer` should say "c
|
|||||||
|
|
||||||
### Step 4f: Check the dashboard
|
### Step 4f: Check the dashboard
|
||||||
|
|
||||||
Open the dashboard at http://localhost:8443 and:
|
Open the dashboard at https://localhost:8443 and:
|
||||||
|
|
||||||
1. Click **Certificates** in the sidebar — you should see `mc-local-test` with status "Active"
|
1. Click **Certificates** in the sidebar — you should see `mc-local-test` with status "Active"
|
||||||
2. Click on it to see the detail page — you should see version history, the signed certificate details, and the deployment timeline
|
2. Click on it to see the detail page — you should see version history, the signed certificate details, and the deployment timeline
|
||||||
@@ -414,7 +431,7 @@ This is the real deal. ACME is the protocol that Let's Encrypt uses to issue cer
|
|||||||
### Step 5a: Create the certificate record
|
### Step 5a: Create the certificate record
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -X POST http://localhost:8443/api/v1/certificates \
|
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates \
|
||||||
-H "Authorization: Bearer test-key-2026" \
|
-H "Authorization: Bearer test-key-2026" \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
@@ -441,7 +458,7 @@ docker exec certctl-test-postgres psql -U certctl -d certctl -c \
|
|||||||
"INSERT INTO certificate_target_mappings (certificate_id, target_id) VALUES ('mc-acme-test', 'target-test-nginx') ON CONFLICT DO NOTHING;"
|
"INSERT INTO certificate_target_mappings (certificate_id, target_id) VALUES ('mc-acme-test', 'target-test-nginx') ON CONFLICT DO NOTHING;"
|
||||||
|
|
||||||
# Trigger issuance
|
# Trigger issuance
|
||||||
curl -s -X POST http://localhost:8443/api/v1/certificates/mc-acme-test/renew \
|
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates/mc-acme-test/renew \
|
||||||
-H "Authorization: Bearer test-key-2026" | python3 -m json.tool
|
-H "Authorization: Bearer test-key-2026" | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -502,7 +519,7 @@ Revocation means "this certificate is no longer trusted, even though it hasn't e
|
|||||||
### Step 7a: Revoke the Local CA cert
|
### Step 7a: Revoke the Local CA cert
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -X POST http://localhost:8443/api/v1/certificates/mc-local-test/revoke \
|
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates/mc-local-test/revoke \
|
||||||
-H "Authorization: Bearer test-key-2026" \
|
-H "Authorization: Bearer test-key-2026" \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{"reason": "superseded"}' | python3 -m json.tool
|
-d '{"reason": "superseded"}' | python3 -m json.tool
|
||||||
@@ -516,7 +533,7 @@ The CRL is a DER-encoded X.509 v2 CRL (RFC 5280 §5) served under the RFC 8615 w
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# No Authorization header — the endpoint is public by design.
|
# No Authorization header — the endpoint is public by design.
|
||||||
curl -s http://localhost:8443/.well-known/pki/crl/iss-local -o /tmp/crl.der
|
curl --cacert "$CA" -s https://localhost:8443/.well-known/pki/crl/iss-local -o /tmp/crl.der
|
||||||
openssl crl -inform der -in /tmp/crl.der -noout -text | head -40
|
openssl crl -inform der -in /tmp/crl.der -noout -text | head -40
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -533,8 +550,8 @@ Go to **Certificates** in the sidebar. The `mc-local-test` cert should now show
|
|||||||
The agent is configured to scan `/nginx-certs` every 6 hours for existing certificates. It already ran a scan when it started up. Let's see what it found.
|
The agent is configured to scan `/nginx-certs` every 6 hours for existing certificates. It already ran a scan when it started up. Let's see what it found.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -H "Authorization: Bearer test-key-2026" \
|
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
|
||||||
http://localhost:8443/api/v1/discovered-certificates | python3 -m json.tool
|
https://localhost:8443/api/v1/discovered-certificates | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
**What you should see**: Any certificates that exist in the NGINX cert directory, including the ones you deployed in Steps 4-5. The discovery system extracts metadata (CN, SANs, issuer, expiry, fingerprint) from the PEM files.
|
**What you should see**: Any certificates that exist in the NGINX cert directory, including the ones you deployed in Steps 4-5. The discovery system extracts metadata (CN, SANs, issuer, expiry, fingerprint) from the PEM files.
|
||||||
@@ -542,8 +559,8 @@ curl -s -H "Authorization: Bearer test-key-2026" \
|
|||||||
Check the summary:
|
Check the summary:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -H "Authorization: Bearer test-key-2026" \
|
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
|
||||||
http://localhost:8443/api/v1/discovery-summary | python3 -m json.tool
|
https://localhost:8443/api/v1/discovery-summary | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
This shows counts: how many are Unmanaged, Managed, and Dismissed.
|
This shows counts: how many are Unmanaged, Managed, and Dismissed.
|
||||||
@@ -557,7 +574,7 @@ In the dashboard: click **Discovery** in the sidebar to see the triage view.
|
|||||||
Force a renewal on the ACME certificate to see the full cycle happen again:
|
Force a renewal on the ACME certificate to see the full cycle happen again:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -X POST http://localhost:8443/api/v1/certificates/mc-acme-test/renew \
|
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates/mc-acme-test/renew \
|
||||||
-H "Authorization: Bearer test-key-2026" | python3 -m json.tool
|
-H "Authorization: Bearer test-key-2026" | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -584,7 +601,7 @@ The test environment enables EST with `CERTCTL_EST_ENABLED=true` and `CERTCTL_ES
|
|||||||
### Step 10a: Check available CA certificates
|
### Step 10a: Check available CA certificates
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -sk http://localhost:8443/.well-known/est/cacerts \
|
curl --cacert "$CA" -s https://localhost:8443/.well-known/est/cacerts \
|
||||||
-H "Authorization: Bearer test-key-2026"
|
-H "Authorization: Bearer test-key-2026"
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -595,7 +612,7 @@ curl -sk http://localhost:8443/.well-known/est/cacerts \
|
|||||||
### Step 10b: Check CSR attributes
|
### Step 10b: Check CSR attributes
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -sk http://localhost:8443/.well-known/est/csrattrs \
|
curl --cacert "$CA" -s https://localhost:8443/.well-known/est/csrattrs \
|
||||||
-H "Authorization: Bearer test-key-2026"
|
-H "Authorization: Bearer test-key-2026"
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -615,7 +632,7 @@ openssl req -new -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
|
|||||||
EST_CSR=$(openssl req -in /tmp/est-test.csr -outform DER | base64 -w 0)
|
EST_CSR=$(openssl req -in /tmp/est-test.csr -outform DER | base64 -w 0)
|
||||||
|
|
||||||
# Submit to EST simpleenroll endpoint
|
# Submit to EST simpleenroll endpoint
|
||||||
curl -sk -X POST http://localhost:8443/.well-known/est/simpleenroll \
|
curl --cacert "$CA" -s -X POST https://localhost:8443/.well-known/est/simpleenroll \
|
||||||
-H "Authorization: Bearer test-key-2026" \
|
-H "Authorization: Bearer test-key-2026" \
|
||||||
-H "Content-Type: application/pkcs10" \
|
-H "Content-Type: application/pkcs10" \
|
||||||
-d "$EST_CSR"
|
-d "$EST_CSR"
|
||||||
@@ -628,8 +645,8 @@ curl -sk -X POST http://localhost:8443/.well-known/est/simpleenroll \
|
|||||||
Decode and inspect the response (if you saved it to a variable):
|
Decode and inspect the response (if you saved it to a variable):
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -H "Authorization: Bearer test-key-2026" \
|
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
|
||||||
http://localhost:8443/api/v1/audit-events | python3 -m json.tool | head -30
|
https://localhost:8443/api/v1/audit-events | python3 -m json.tool | head -30
|
||||||
```
|
```
|
||||||
|
|
||||||
Check the audit trail — you should see an `est_enrollment` event with the CN `est-device.certctl.test`.
|
Check the audit trail — you should see an `est_enrollment` event with the CN `est-device.certctl.test`.
|
||||||
@@ -639,7 +656,7 @@ Check the audit trail — you should see an `est_enrollment` event with the CN `
|
|||||||
EST also supports re-enrollment (certificate renewal). The same CSR format works:
|
EST also supports re-enrollment (certificate renewal). The same CSR format works:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -sk -X POST http://localhost:8443/.well-known/est/simplereenroll \
|
curl --cacert "$CA" -s -X POST https://localhost:8443/.well-known/est/simplereenroll \
|
||||||
-H "Authorization: Bearer test-key-2026" \
|
-H "Authorization: Bearer test-key-2026" \
|
||||||
-H "Content-Type: application/pkcs10" \
|
-H "Content-Type: application/pkcs10" \
|
||||||
-d "$EST_CSR"
|
-d "$EST_CSR"
|
||||||
@@ -658,7 +675,7 @@ S/MIME certificates are used for email signing and encryption — a different us
|
|||||||
### Step 11a: Create an S/MIME certificate record
|
### Step 11a: Create an S/MIME certificate record
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -X POST http://localhost:8443/api/v1/certificates \
|
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates \
|
||||||
-H "Authorization: Bearer test-key-2026" \
|
-H "Authorization: Bearer test-key-2026" \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
@@ -686,7 +703,7 @@ Notice:
|
|||||||
docker exec certctl-test-postgres psql -U certctl -d certctl -c \
|
docker exec certctl-test-postgres psql -U certctl -d certctl -c \
|
||||||
"INSERT INTO certificate_target_mappings (certificate_id, target_id) VALUES ('mc-smime-test', 'target-test-nginx') ON CONFLICT DO NOTHING;"
|
"INSERT INTO certificate_target_mappings (certificate_id, target_id) VALUES ('mc-smime-test', 'target-test-nginx') ON CONFLICT DO NOTHING;"
|
||||||
|
|
||||||
curl -s -X POST http://localhost:8443/api/v1/certificates/mc-smime-test/renew \
|
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates/mc-smime-test/renew \
|
||||||
-H "Authorization: Bearer test-key-2026" | python3 -m json.tool
|
-H "Authorization: Bearer test-key-2026" | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -695,15 +712,15 @@ curl -s -X POST http://localhost:8443/api/v1/certificates/mc-smime-test/renew \
|
|||||||
After the agent processes the job (30-60 seconds), check the certificate details:
|
After the agent processes the job (30-60 seconds), check the certificate details:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -H "Authorization: Bearer test-key-2026" \
|
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
|
||||||
http://localhost:8443/api/v1/certificates/mc-smime-test | python3 -m json.tool
|
https://localhost:8443/api/v1/certificates/mc-smime-test | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
The certificate should show `"status": "active"`. To verify the EKU on the actual cert, you can export it:
|
The certificate should show `"status": "active"`. To verify the EKU on the actual cert, you can export it:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -H "Authorization: Bearer test-key-2026" \
|
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
|
||||||
http://localhost:8443/api/v1/certificates/mc-smime-test/export/pem | python3 -m json.tool
|
https://localhost:8443/api/v1/certificates/mc-smime-test/export/pem | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
If you decode the certificate PEM, you should see:
|
If you decode the certificate PEM, you should see:
|
||||||
@@ -768,16 +785,16 @@ If you have Go installed, you can build and test the CLI tool:
|
|||||||
go build -o certctl-cli ./cmd/cli
|
go build -o certctl-cli ./cmd/cli
|
||||||
|
|
||||||
# List certificates
|
# List certificates
|
||||||
./certctl-cli --server http://localhost:8443 --api-key test-key-2026 list-certs
|
./certctl-cli --server https://localhost:8443 --ca-bundle "$CA" --api-key test-key-2026 list-certs
|
||||||
|
|
||||||
# Get a specific certificate
|
# Get a specific certificate
|
||||||
./certctl-cli --server http://localhost:8443 --api-key test-key-2026 get-cert mc-acme-test
|
./certctl-cli --server https://localhost:8443 --ca-bundle "$CA" --api-key test-key-2026 get-cert mc-acme-test
|
||||||
|
|
||||||
# Check health
|
# Check health
|
||||||
./certctl-cli --server http://localhost:8443 --api-key test-key-2026 health
|
./certctl-cli --server https://localhost:8443 --ca-bundle "$CA" --api-key test-key-2026 health
|
||||||
|
|
||||||
# Get metrics (JSON format)
|
# Get metrics (JSON format)
|
||||||
./certctl-cli --server http://localhost:8443 --api-key test-key-2026 --format json metrics
|
./certctl-cli --server https://localhost:8443 --ca-bundle "$CA" --api-key test-key-2026 --format json metrics
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -924,15 +941,15 @@ Look for error messages. Common ones:
|
|||||||
**Step 2**: Verify the agent is registered:
|
**Step 2**: Verify the agent is registered:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -H "Authorization: Bearer test-key-2026" \
|
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
|
||||||
http://localhost:8443/api/v1/agents/agent-test-01 | python3 -m json.tool
|
https://localhost:8443/api/v1/agents/agent-test-01 | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
**Step 3**: Check for pending jobs:
|
**Step 3**: Check for pending jobs:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -H "Authorization: Bearer test-key-2026" \
|
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
|
||||||
"http://localhost:8443/api/v1/jobs?status=Pending&status=AwaitingCSR" | python3 -m json.tool
|
"https://localhost:8443/api/v1/jobs?status=Pending&status=AwaitingCSR" | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
If there are pending jobs but the agent isn't picking them up, check that the job's `agent_id` matches `agent-test-01`.
|
If there are pending jobs but the agent isn't picking them up, check that the job's `agent_id` matches `agent-test-01`.
|
||||||
@@ -962,8 +979,8 @@ docker exec certctl-test-nginx nginx -s reload
|
|||||||
**Step 3**: If the files aren't there, the deployment job hasn't completed. Check the jobs:
|
**Step 3**: If the files aren't there, the deployment job hasn't completed. Check the jobs:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -s -H "Authorization: Bearer test-key-2026" \
|
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
|
||||||
"http://localhost:8443/api/v1/jobs?type=Deployment" | python3 -m json.tool
|
"https://localhost:8443/api/v1/jobs?type=Deployment" | python3 -m json.tool
|
||||||
```
|
```
|
||||||
|
|
||||||
Look at the job status. If it's "Running" and stuck, the server's job processor may have picked it up instead of the agent (this was a known bug — the fix skips deployment jobs with `agent_id` in the server's `ProcessPendingJobs`).
|
Look at the job status. If it's "Running" and stuck, the server's job processor may have picked it up instead of the agent (this was a known bug — the fix skips deployment jobs with `agent_id` in the server's `ProcessPendingJobs`).
|
||||||
@@ -1008,7 +1025,7 @@ Change it to a different port, like:
|
|||||||
- "9443:8443"
|
- "9443:8443"
|
||||||
```
|
```
|
||||||
|
|
||||||
Then access the dashboard at http://localhost:9443 instead.
|
Then access the dashboard at https://localhost:9443 instead.
|
||||||
|
|
||||||
### Starting completely fresh
|
### Starting completely fresh
|
||||||
|
|
||||||
@@ -1054,7 +1071,7 @@ docker compose -f docker-compose.test.yml up --build
|
|||||||
|
|
||||||
| What | Value |
|
| What | Value |
|
||||||
|---|---|
|
|---|---|
|
||||||
| Dashboard URL | http://localhost:8443 |
|
| Dashboard URL | https://localhost:8443 (use `--cacert ./test/certs/ca.crt`) |
|
||||||
| API key | `test-key-2026` |
|
| API key | `test-key-2026` |
|
||||||
| NGINX HTTP | http://localhost:8080 |
|
| NGINX HTTP | http://localhost:8080 |
|
||||||
| NGINX HTTPS | https://localhost:8444 |
|
| NGINX HTTPS | https://localhost:8444 |
|
||||||
|
|||||||
+194
-6
@@ -5002,10 +5002,10 @@ curl -s -w "HTTP %{http_code}\n" -X DELETE -H "$AUTH" "$SERVER/api/v1/audit/$EVE
|
|||||||
|
|
||||||
> **Tip:** Open a second terminal with `docker compose logs -f certctl-server` to watch scheduler log output in real time.
|
> **Tip:** Open a second terminal with `docker compose logs -f certctl-server` to watch scheduler log output in real time.
|
||||||
|
|
||||||
**Test 20.1.1 — Scheduler startup: all 7 loops registered**
|
**Test 20.1.1 — Scheduler startup: all 12 loops registered**
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
docker compose logs certctl-server 2>&1 | grep -i "scheduler\|renewal check\|job processor\|health check\|notification\|short-lived\|network scan" | head -20
|
docker compose logs certctl-server 2>&1 | grep -i "scheduler\|renewal check\|job processor\|job retry\|job timeout\|health check\|notification\|notification retry\|short-lived\|network scan\|digest\|endpoint health\|cloud discovery" | head -30
|
||||||
```
|
```
|
||||||
|
|
||||||
**What:** Checks server startup logs for scheduler loop registration.
|
**What:** Checks server startup logs for scheduler loop registration.
|
||||||
@@ -6812,6 +6812,194 @@ print('OpenAPI I-004 contract: OK')
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Part 56: Notification Retry & Dead-Letter Queue (I-005)
|
||||||
|
|
||||||
|
**What this validates:** The full retry lifecycle for `notification_events` rows — transient notifier failures are re-armed with exponential backoff (`2^retry_count` minutes capped at 1h, 5-attempt budget), rows that exhaust the budget land in the terminal `dead` status, the dead-letter depth is surfaced both on the dashboard and via a Prometheus counter, and operators can requeue dead rows once the underlying outage is resolved.
|
||||||
|
|
||||||
|
**Why it matters:** Before I-005, a failed notification was a silent drop. `internal/service/notification.go` flipped `status` to `failed` and never came back to it, because `ProcessPendingNotifications` only lists rows whose `status='pending'`. A 5xx from Slack, a 30-second SMTP stall, or a misrouted webhook URL could each lose a critical alert (cert expiry, CA compromise, approval-rejected) with no trace beyond a single log line. Part 56 pins the replacement contract (retry loop + DLQ + dashboard surface + Prometheus metric + operator requeue) so regressions show up here rather than as a post-incident "why didn't we get paged?" review.
|
||||||
|
|
||||||
|
### 56.1 Migration 000016 Columns Applied
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose -f deploy/docker-compose.yml exec postgres \
|
||||||
|
psql -U certctl -d certctl -c \
|
||||||
|
"SELECT column_name FROM information_schema.columns WHERE table_name='notification_events' AND column_name IN ('retry_count','next_retry_at','last_error') ORDER BY column_name;"
|
||||||
|
```
|
||||||
|
|
||||||
|
**What:** Confirms migration 000016 added the retry bookkeeping columns to `notification_events`.
|
||||||
|
**PASS if** all three rows (`last_error`, `next_retry_at`, `retry_count`) are returned. **FAIL** if any is missing — the migration did not apply and the retry loop will error on every tick.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 56.2 Partial Retry-Sweep Index Present
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose -f deploy/docker-compose.yml exec postgres \
|
||||||
|
psql -U certctl -d certctl -c \
|
||||||
|
"SELECT indexdef FROM pg_indexes WHERE tablename='notification_events' AND indexname='idx_notification_events_retry_sweep';"
|
||||||
|
```
|
||||||
|
|
||||||
|
**What:** Confirms the partial index `idx_notification_events_retry_sweep ON notification_events(next_retry_at) WHERE status = 'failed' AND next_retry_at IS NOT NULL` exists and has the expected predicate.
|
||||||
|
**PASS if** the returned `indexdef` includes `WHERE ((status = 'failed'::text) AND (next_retry_at IS NOT NULL))`. **FAIL** if the index is missing or unpartialed — the retry sweep will scan the full notification history instead of the small retry-eligible slice.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 56.3 Failed Notification Retries On Next Tick
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Seed a failed notification with next_retry_at in the past
|
||||||
|
docker compose -f deploy/docker-compose.yml exec postgres \
|
||||||
|
psql -U certctl -d certctl -c \
|
||||||
|
"UPDATE notification_events SET status='failed', retry_count=0, next_retry_at=NOW() - INTERVAL '1 minute', last_error='transient SMTP timeout' WHERE id='notif-demo-1';"
|
||||||
|
|
||||||
|
# Wait for the retry loop to sweep (default CERTCTL_NOTIFICATION_RETRY_INTERVAL=2m)
|
||||||
|
sleep 130
|
||||||
|
|
||||||
|
# Observe the post-sweep state
|
||||||
|
docker compose -f deploy/docker-compose.yml exec postgres \
|
||||||
|
psql -U certctl -d certctl -c \
|
||||||
|
"SELECT id, status, retry_count, next_retry_at IS NOT NULL AS has_next_retry FROM notification_events WHERE id='notif-demo-1';"
|
||||||
|
```
|
||||||
|
|
||||||
|
**What:** Exercises the retry loop's failure path. The seeded row is re-dispatched through the notifier registry; in the demo environment the notifier does not exist for `email` so the sweep either delivers (`status='sent'`) or records a failed attempt (`retry_count=1`, `next_retry_at` re-armed).
|
||||||
|
**PASS if** either `status='sent'` (delivered on retry) or the row is still `failed` with `retry_count >= 1` and `has_next_retry=t`. **FAIL** if the row is still `failed` with `retry_count=0` and `next_retry_at` in the past — the retry loop is not actually running.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 56.4 Exhausted Notification Transitions To Dead
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Seed a row one failure shy of exhaustion — retry_count=4 means the next
|
||||||
|
# tick's failure is the 5th attempt (notifRetryMaxAttempts-1 check at
|
||||||
|
# internal/service/notification.go:531).
|
||||||
|
docker compose -f deploy/docker-compose.yml exec postgres \
|
||||||
|
psql -U certctl -d certctl -c \
|
||||||
|
"UPDATE notification_events SET status='failed', retry_count=4, next_retry_at=NOW() - INTERVAL '1 minute', last_error='persistent outage', channel='channel-that-does-not-exist' WHERE id='notif-demo-2';"
|
||||||
|
|
||||||
|
sleep 130
|
||||||
|
|
||||||
|
docker compose -f deploy/docker-compose.yml exec postgres \
|
||||||
|
psql -U certctl -d certctl -c \
|
||||||
|
"SELECT id, status, retry_count, last_error FROM notification_events WHERE id='notif-demo-2';"
|
||||||
|
```
|
||||||
|
|
||||||
|
**What:** The row at `retry_count=4` enters the sweep, the notifier lookup fails (channel unknown), the exhaustion branch fires, and `MarkAsDead` flips the row. Note: the "notifier unknown" branch at notification.go:494-503 promotes to `sent` for demo parity, so for a strict DLQ assertion seed a row whose channel is a known registered notifier that will reject delivery — alternatively run against the integration test fixture where the retry-exhaustion path is deterministic.
|
||||||
|
**PASS if** `status='dead'` and `last_error` reflects the send failure. **FAIL** if the row is still `failed` with `retry_count >= 5` — the exhaustion branch did not fire and the row will retry forever.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 56.5 Dead Row Has Null next_retry_at
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose -f deploy/docker-compose.yml exec postgres \
|
||||||
|
psql -U certctl -d certctl -c \
|
||||||
|
"SELECT COUNT(*) FROM notification_events WHERE status='dead' AND next_retry_at IS NOT NULL;"
|
||||||
|
```
|
||||||
|
|
||||||
|
**What:** `MarkAsDead` must clear `next_retry_at` so the partial retry-sweep index stops matching the row. If this invariant breaks, a dead row keeps appearing in `ListRetryEligible` and the exhaustion branch fires on every sweep.
|
||||||
|
**PASS if** the count is `0`. **FAIL** if any dead rows still carry a non-null `next_retry_at` — the DLQ is leaky and the row will re-enter the retry rotation on the next tick.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 56.6 DashboardSummary Populates NotificationsDead
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Seed a dead row so the count is observable
|
||||||
|
docker compose -f deploy/docker-compose.yml exec postgres \
|
||||||
|
psql -U certctl -d certctl -c \
|
||||||
|
"UPDATE notification_events SET status='dead', next_retry_at=NULL, last_error='demo DLQ fixture' WHERE id='notif-demo-3';"
|
||||||
|
|
||||||
|
curl -sS "http://localhost:8443/api/v1/stats/summary" \
|
||||||
|
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
|
||||||
|
| python3 -c "import sys,json; s=json.load(sys.stdin); assert 'notifications_dead' in s, 'missing notifications_dead field'; assert s['notifications_dead'] >= 1, s['notifications_dead']; print('notifications_dead:', s['notifications_dead'])"
|
||||||
|
```
|
||||||
|
|
||||||
|
**What:** Confirms `DashboardSummary.NotificationsDead` (`internal/service/stats.go:66`) is populated by `notifRepo.CountByStatus(ctx, "dead")` (stats.go:137-142) and surfaced in the dashboard summary JSON.
|
||||||
|
**PASS if** the field is present and reflects at least the seeded dead row. **FAIL** if the field is missing (`SetNotifRepo` was not called on StatsService) or stuck at zero despite seeded dead rows (repository `CountByStatus` is broken).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 56.7 Prometheus Counter Emits certctl_notification_dead_total
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -sS "http://localhost:8443/api/v1/metrics/prometheus" \
|
||||||
|
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
|
||||||
|
| grep -E '^# (HELP|TYPE) certctl_notification_dead_total|^certctl_notification_dead_total '
|
||||||
|
```
|
||||||
|
|
||||||
|
**What:** The Prometheus endpoint (`internal/api/handler/metrics.go:217-219`) emits three lines: `# HELP certctl_notification_dead_total Number of notifications in the dead-letter queue.`, `# TYPE certctl_notification_dead_total counter`, and a bare `certctl_notification_dead_total <value>` value line. Operator alert thresholds per the I-005 spec: `> 0` warning, `> 10` critical.
|
||||||
|
**PASS if** all three lines are present and the value is `>= 1` when dead rows exist. **FAIL** if any of the three lines is missing — the metric name is misspelled, the `# TYPE` is wrong, or `DashboardSummary.NotificationsDead` is not wired into the metrics handler.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 56.8 Requeue Resets Retry Bookkeeping
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Confirm the row is in 'dead' with the full retry history
|
||||||
|
docker compose -f deploy/docker-compose.yml exec postgres \
|
||||||
|
psql -U certctl -d certctl -c \
|
||||||
|
"SELECT id, status, retry_count, next_retry_at, last_error FROM notification_events WHERE id='notif-demo-3';"
|
||||||
|
|
||||||
|
# Requeue via the operator endpoint
|
||||||
|
curl -sS -X POST "http://localhost:8443/api/v1/notifications/notif-demo-3/requeue" \
|
||||||
|
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
|
||||||
|
-w "\nHTTP %{http_code}\n"
|
||||||
|
|
||||||
|
# Confirm the atomic reset
|
||||||
|
docker compose -f deploy/docker-compose.yml exec postgres \
|
||||||
|
psql -U certctl -d certctl -c \
|
||||||
|
"SELECT id, status, retry_count, next_retry_at, last_error FROM notification_events WHERE id='notif-demo-3';"
|
||||||
|
```
|
||||||
|
|
||||||
|
**What:** Exercises the operator-driven escape hatch (`POST /api/v1/notifications/{id}/requeue`). The repository's `Requeue` must atomically flip `status → pending`, reset `retry_count → 0`, clear `next_retry_at → NULL`, and clear `last_error → NULL` — see `internal/service/notification.go:571-576` and the pinning test at `notification_handler_test.go:307-347`.
|
||||||
|
**PASS if** HTTP `200` with JSON body `{"status":"requeued"}` AND the post-requeue row has `status='pending'`, `retry_count=0`, `next_retry_at IS NULL`, `last_error IS NULL`. **FAIL** if any of the four fields is not reset — `ProcessPendingNotifications` will not treat this as a fresh attempt and the audit trail will be ambiguous.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 56.9 GUI Dead Letter Tab Threads ?status=dead
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd web
|
||||||
|
npx vitest run src/pages/NotificationsPage.test.tsx -t 'Dead letter tab fetches notifications with status=dead'
|
||||||
|
```
|
||||||
|
|
||||||
|
**What:** The two-tab toolbar on `/notifications` routes the "Dead letter" tab's query through `getNotifications({ status: 'dead', per_page: '100' })`. This test verifies the React Query's `queryKey: ['notifications', activeTab]` (`NotificationsPage.tsx:31`) actually translates the tab click into the server-side filter — not client-side filtering of the full inbox.
|
||||||
|
**PASS if** the Vitest assertion at `NotificationsPage.test.tsx:104-128` passes. **FAIL** if the Dead letter tab is merely a client-side filter on the `all` response — the DLQ-only code path (`NotificationRepository.ListByStatus`) is not exercised, which matters for pagination correctness once the inbox grows beyond 100 rows.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 56.10 Requeue Button MutationFn Wrapper
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd web
|
||||||
|
npx vitest run src/pages/NotificationsPage.test.tsx -t 'clicking Requeue invokes requeueNotification'
|
||||||
|
```
|
||||||
|
|
||||||
|
**What:** `react-query` v5's `mutate(id)` passes a second positional argument (the mutation context object) to the `mutationFn`. If `mutationFn: requeueNotification` is used directly, the API client receives `(id, { client })` — an extra argument that the strict-match `toHaveBeenCalledWith('notif-dead-001')` assertion at `NotificationsPage.test.tsx:181` rejects. The fix is an explicit single-arg arrow: `mutationFn: (id: string) => requeueNotification(id)` at `NotificationsPage.tsx:64`.
|
||||||
|
**PASS if** the Vitest assertion passes (the API client was called with exactly one argument). **FAIL** if the wrapper is inadvertently removed — silent success in runtime, loud failure in this contract.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 56.11 HEAD-State OpenAPI Contract
|
||||||
|
|
||||||
|
```bash
|
||||||
|
npx --yes @redocly/cli lint api/openapi.yaml \
|
||||||
|
--config '{"rules":{"operation-4xx-response":"error","no-invalid-media-type-examples":"error"}}'
|
||||||
|
python3 -c "
|
||||||
|
import yaml
|
||||||
|
spec = yaml.safe_load(open('api/openapi.yaml'))
|
||||||
|
post = spec['paths']['/api/v1/notifications/{id}/requeue']['post']
|
||||||
|
assert post['operationId'] == 'requeueNotification', post['operationId']
|
||||||
|
assert set(post['responses'].keys()) >= {'200','400','404','405','500'}, post['responses'].keys()
|
||||||
|
print('OpenAPI I-005 contract: OK')
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
**What:** Two-part check. Redocly lint confirms the spec is structurally valid; the Python assertions pin the requeue endpoint's `operationId` and the five minimum response codes (200/400/404/405/500).
|
||||||
|
**PASS if** redocly prints no errors and the Python script prints `OpenAPI I-005 contract: OK`. **FAIL** if the `operationId` changed or any of the five responses is missing — downstream MCP/CLI clients rely on the contract.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Release Sign-Off
|
## Release Sign-Off
|
||||||
|
|
||||||
All tests below must pass before tagging v2.1.0. Each row is one individual test from the guide above. The **Method** column indicates whether `qa-smoke-test.sh` covers the test automatically (**Auto**) or requires hands-on verification (**Manual**).
|
All tests below must pass before tagging v2.1.0. Each row is one individual test from the guide above. The **Method** column indicates whether `qa-smoke-test.sh` covers the test automatically (**Auto**) or requires hands-on verification (**Manual**).
|
||||||
@@ -7152,7 +7340,7 @@ These must be green before starting manual QA:
|
|||||||
|
|
||||||
| Test | Description | Method | Pass? | Date | Notes |
|
| Test | Description | Method | Pass? | Date | Notes |
|
||||||
|------|-------------|--------|-------|------|-------|
|
|------|-------------|--------|-------|------|-------|
|
||||||
| 20.1.1 | Scheduler startup: all 7 loops registered | Manual | ☐ | | |
|
| 20.1.1 | Scheduler startup: all 12 loops registered | Manual | ☐ | | |
|
||||||
| 20.1.2 | Job processor loop fires (30s interval) | Manual | ☐ | | |
|
| 20.1.2 | Job processor loop fires (30s interval) | Manual | ☐ | | |
|
||||||
| 20.1.3 | Agent health check marks offline (2m interval) | Manual | ☐ | | |
|
| 20.1.3 | Agent health check marks offline (2m interval) | Manual | ☐ | | |
|
||||||
| 20.1.4 | Notification processor fires (1m interval) | Manual | ☐ | | |
|
| 20.1.4 | Notification processor fires (1m interval) | Manual | ☐ | | |
|
||||||
@@ -7813,10 +8001,10 @@ These must be green before starting manual QA:
|
|||||||
| Category | Count |
|
| Category | Count |
|
||||||
|----------|-------|
|
|----------|-------|
|
||||||
| ☑ Auto (passed in `qa-smoke-test.sh`) | 144 |
|
| ☑ Auto (passed in `qa-smoke-test.sh`) | 144 |
|
||||||
| ☐ Auto (not yet run) | 129 |
|
| ☐ Auto (not yet run) | 136 |
|
||||||
| — Skipped (preconditions not met in demo) | 5 |
|
| — Skipped (preconditions not met in demo) | 5 |
|
||||||
| ☐ Manual (requires hands-on verification) | 282 |
|
| ☐ Manual (requires hands-on verification) | 286 |
|
||||||
| **Total** | **560** |
|
| **Total** | **571** |
|
||||||
|
|
||||||
**Automated tests must also be green.** CI passing is necessary but not sufficient — this manual QA catches integration issues that isolated unit tests miss.
|
**Automated tests must also be green.** CI passing is necessary but not sufficient — this manual QA catches integration issues that isolated unit tests miss.
|
||||||
|
|
||||||
|
|||||||
+179
@@ -0,0 +1,179 @@
|
|||||||
|
# TLS on the Control Plane
|
||||||
|
|
||||||
|
certctl's control plane is HTTPS-only as of v2.2. There is no plaintext `http://` listener, no `auto` mode, no dual-listener bridge, no TLS 1.2 escape hatch. The server refuses to start without a cert+key pair, the agent/CLI/MCP clients reject `http://` URLs at startup, and the Helm chart refuses to render without either an operator-supplied Secret or a cert-manager Certificate CR.
|
||||||
|
|
||||||
|
This doc covers four cert provisioning patterns, SIGHUP-based cert rotation, and the client-side CA-trust configuration agents and the CLI need to talk to the server. If you are upgrading from a pre-HTTPS release and want the step-by-step cutover procedure, read [`upgrade-to-tls.md`](upgrade-to-tls.md) first and come back here for reference.
|
||||||
|
|
||||||
|
## What you get
|
||||||
|
|
||||||
|
The server binds TLS 1.3 only with an explicit curve preference of `[X25519, P-256]`. TLS 1.3 cipher suites are non-negotiable (all three mandatory suites — AES-128-GCM-SHA256, AES-256-GCM-SHA384, CHACHA20-POLY1305-SHA256 — are always offered), so there is no `CipherSuites` knob to misconfigure. No TLS 1.2 fallback is available.
|
||||||
|
|
||||||
|
Two env vars are required on the server:
|
||||||
|
|
||||||
|
- `CERTCTL_SERVER_TLS_CERT_PATH` — filesystem path to the PEM-encoded server certificate
|
||||||
|
- `CERTCTL_SERVER_TLS_KEY_PATH` — filesystem path to the PEM-encoded private key that signs the cert
|
||||||
|
|
||||||
|
Both paths are read during a fail-loud preflight in `cmd/server/main.go` (see `preflightServerTLS` in `cmd/server/tls.go`). If either is unset, unreadable, or the cert+key pair does not round-trip through `tls.LoadX509KeyPair`, the process refuses to start and emits a diagnostic pointing back at this doc. The rationale lives in §3 of the HTTPS-Everywhere milestone: a cert-lifecycle product should not silently bind plaintext.
|
||||||
|
|
||||||
|
## Pattern 1 — Self-signed bootstrap for docker-compose demos
|
||||||
|
|
||||||
|
This is the default for the `deploy/docker-compose.yml` stack. It exists so `docker compose up -d --build` just works on a laptop without the operator standing up a CA first. It is not appropriate for any non-demo environment.
|
||||||
|
|
||||||
|
An init container named `certctl-tls-init` runs once before the server starts. It uses the `alpine/openssl` image and generates an ed25519 self-signed cert:
|
||||||
|
|
||||||
|
```
|
||||||
|
openssl req -x509 -newkey ed25519 -nodes \
|
||||||
|
-keyout /etc/certctl/tls/server.key \
|
||||||
|
-out /etc/certctl/tls/server.crt \
|
||||||
|
-days 3650 \
|
||||||
|
-subj "/CN=certctl-server" \
|
||||||
|
-addext "subjectAltName=DNS:certctl-server,DNS:localhost,IP:127.0.0.1,IP:::1"
|
||||||
|
```
|
||||||
|
|
||||||
|
The cert, its matching key, and a copy of the cert published as `ca.crt` land in a named volume (`certs`) mounted at `/etc/certctl/tls/` in the server container (read-only) and the agent container (read-only). The bootstrap is idempotent — if `server.crt`, `server.key`, and `ca.crt` are already present on the volume, the init container logs `TLS cert already present at …` and exits cleanly.
|
||||||
|
|
||||||
|
Single-cert design. CN is `certctl-server` to match the Docker-network hostname. The SAN list is `[certctl-server, localhost, 127.0.0.1, ::1]`, which covers both container-internal agent→server traffic and operator browser/curl access to `https://localhost:8443`. There is no separate intermediate/root chain — the server cert and the CA bundle are the same PEM. This is the whole point of a demo bootstrap.
|
||||||
|
|
||||||
|
To force regeneration (rotate the demo cert), tear the volume down: `docker compose down -v`. The next `up` re-runs the init container.
|
||||||
|
|
||||||
|
The server's Docker healthcheck and the agent both verify against `/etc/certctl/tls/ca.crt`; no `-k` / `InsecureSkipVerify` anywhere in the default stack.
|
||||||
|
|
||||||
|
## Pattern 2 — Operator-supplied `kubernetes.io/tls` Secret (Helm)
|
||||||
|
|
||||||
|
This is the default path for Helm installs. The operator provisions a Secret of type `kubernetes.io/tls` holding `tls.crt` + `tls.key` (and optionally `ca.crt` for mounting a CA bundle to clients in the same cluster) from whatever source they already trust — their internal CA, a manually-issued cert, step-ca, AWS ACM PCA exported to PEM, or the output of the self-signed bootstrap pattern above copied into a cluster Secret.
|
||||||
|
|
||||||
|
```
|
||||||
|
kubectl create secret tls certctl-server-tls \
|
||||||
|
--cert=server.crt \
|
||||||
|
--key=server.key \
|
||||||
|
--namespace certctl
|
||||||
|
```
|
||||||
|
|
||||||
|
Then:
|
||||||
|
|
||||||
|
```
|
||||||
|
helm install certctl deploy/helm/certctl \
|
||||||
|
--namespace certctl \
|
||||||
|
--set server.tls.existingSecret=certctl-server-tls
|
||||||
|
```
|
||||||
|
|
||||||
|
The Secret is mounted read-only at `/etc/certctl/tls/` in the server pod. The `CERTCTL_SERVER_TLS_CERT_PATH` and `CERTCTL_SERVER_TLS_KEY_PATH` env vars are wired to `tls.crt` and `tls.key` keys inside that mount. If `ca.crt` is absent from the Secret, clients that need a CA bundle should use `tls.crt` as the bundle (self-signed case) or mount a separate ConfigMap with the root chain (operator-CA case).
|
||||||
|
|
||||||
|
If the operator sets neither `server.tls.existingSecret` nor `server.tls.certManager.enabled=true`, `helm template` / `helm install` fails at render-time with a diagnostic pointing at this doc. The guard is implemented in `deploy/helm/certctl/templates/_helpers.tpl` under the `certctl.tls.required` helper. This is deliberate: the HTTPS-only server would crash-loop on an empty path, so we fail earlier at Helm-render time.
|
||||||
|
|
||||||
|
## Pattern 3 — cert-manager `Certificate` CR (Helm, opt-in)
|
||||||
|
|
||||||
|
For clusters that already run cert-manager, the chart can provision a `Certificate` CR that writes into the Secret the server pod reads from. This is opt-in — the default is `server.tls.certManager.enabled: false` — because not every cluster has cert-manager installed, and we refuse to ship a chart that silently depends on an external controller.
|
||||||
|
|
||||||
|
```
|
||||||
|
helm install certctl deploy/helm/certctl \
|
||||||
|
--namespace certctl \
|
||||||
|
--set server.tls.certManager.enabled=true \
|
||||||
|
--set server.tls.certManager.issuerRef.name=my-cluster-issuer \
|
||||||
|
--set server.tls.certManager.issuerRef.kind=ClusterIssuer
|
||||||
|
```
|
||||||
|
|
||||||
|
The rendered `Certificate` (see `deploy/helm/certctl/templates/server-certificate.yaml`) writes `tls.crt` + `tls.key` + `ca.crt` into the Secret named by `server.tls.certManager.secretName` (defaults to `<fullname>-tls`). The server pod reads from that same Secret; the agent DaemonSet mounts the same Secret as its CA bundle source.
|
||||||
|
|
||||||
|
cert-manager handles rotation. certctl-server handles in-place reload — see the SIGHUP section below.
|
||||||
|
|
||||||
|
The chart enforces that if `server.tls.certManager.enabled=true`, `server.tls.certManager.issuerRef.name` must also be set. An empty `issuerRef.name` makes `helm template` fail with a diagnostic naming the missing flag.
|
||||||
|
|
||||||
|
## Pattern 4 — Manually-issued from an internal CA
|
||||||
|
|
||||||
|
For operators running neither Helm nor docker-compose (bare-metal / custom orchestration), the server just needs two files on disk pointed at by `CERTCTL_SERVER_TLS_CERT_PATH` and `CERTCTL_SERVER_TLS_KEY_PATH`. Issue the cert from your internal CA with:
|
||||||
|
|
||||||
|
- CN matching the hostname your agents and operators use to dial the server (e.g., `certctl.prod.example.com`)
|
||||||
|
- SAN list covering every hostname and IP that appears in `CERTCTL_SERVER_URL` values across your agent fleet
|
||||||
|
- Key usage: digital signature + key encipherment
|
||||||
|
- Extended key usage: server auth
|
||||||
|
|
||||||
|
Store the key with mode `0600` and owner matching the UID the server runs as (`1000` in our shipped Dockerfile). The server process reads both files during `preflightServerTLS` at startup and again on every SIGHUP.
|
||||||
|
|
||||||
|
The full CA chain that signed the server cert should be distributed to agents, CLI operators, and MCP clients as their `CERTCTL_SERVER_CA_BUNDLE_PATH` — see the client section below.
|
||||||
|
|
||||||
|
## SIGHUP cert rotation
|
||||||
|
|
||||||
|
The server wraps its cert+key pair in a `*certHolder` (see `cmd/server/tls.go`) that guards the loaded `*tls.Certificate` under a `sync.Mutex`. The `*tls.Config` wires `GetCertificate` to the holder, so every new inbound TLS handshake reads whatever cert the holder currently has.
|
||||||
|
|
||||||
|
Send `SIGHUP` to the server PID and the holder re-reads both files from disk. On success, the next new connection uses the new cert; in-flight requests finish on the previous cert. A log line goes out:
|
||||||
|
|
||||||
|
```
|
||||||
|
TLS cert reloaded via SIGHUP cert_path=/etc/certctl/tls/server.crt key_path=/etc/certctl/tls/server.key
|
||||||
|
```
|
||||||
|
|
||||||
|
On failure (missing file, malformed PEM, key does not sign cert), the old cert is retained and an error logs:
|
||||||
|
|
||||||
|
```
|
||||||
|
TLS cert reload failed; continuing with previous cert cert_path=… key_path=… error=…
|
||||||
|
```
|
||||||
|
|
||||||
|
This is deliberately fail-safe on reload (as opposed to fail-loud on startup). A cert-manager renewal race, a partially-copied file, a typo in a rotation script — none of those should crash a running server and drop every agent connection. The operator sees the error in logs, fixes the underlying issue, and sends another `SIGHUP`.
|
||||||
|
|
||||||
|
Pair with cert-manager, certbot `--post-hook`, or any rotation tool that can fire a signal. For docker-compose, `docker compose kill -s HUP certctl-server` works. For Kubernetes, reload is typically handled by cert-manager updating the Secret and the mounted file changing on the next kubelet sync — no explicit SIGHUP needed if the volume mount is `subPath`-free.
|
||||||
|
|
||||||
|
Startup is a different story. If the cert is missing or malformed at process start, the server exits non-zero rather than binding plaintext or attempting a retry loop. That's the HTTPS-only contract.
|
||||||
|
|
||||||
|
## Client-side TLS: agents, CLI, MCP
|
||||||
|
|
||||||
|
Everything that talks to the server enforces HTTPS on the URL.
|
||||||
|
|
||||||
|
### Agent
|
||||||
|
|
||||||
|
`CERTCTL_SERVER_URL` must be `https://…`. `http://`, bare hostnames, `ftp://`, `ws://`, and empty strings are rejected at startup by `validateHTTPSScheme` in `cmd/agent/main.go` with a diagnostic pointing at `upgrade-to-tls.md`. There is no warning-and-proceed path.
|
||||||
|
|
||||||
|
Two additional env vars control how the agent verifies the server cert:
|
||||||
|
|
||||||
|
- `CERTCTL_SERVER_CA_BUNDLE_PATH` — filesystem path to a PEM-encoded CA bundle that signed the server cert. Loaded into `*tls.Config.RootCAs` on the agent's HTTP client. If unset, the agent falls back to the OS system trust store.
|
||||||
|
- `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY` — defaults to `false`. Setting it to `true` skips verification entirely. **Dev-only escape hatch.** The agent logs a prominent warning at startup (`TLS certificate verification is disabled … never enable this in production`). Use this only when dialing a demo server whose cert you haven't bothered to mount into the agent container.
|
||||||
|
|
||||||
|
Equivalent CLI flags: `--ca-bundle <path>` and `--insecure-skip-verify`.
|
||||||
|
|
||||||
|
If both the CA bundle and `InsecureSkipVerify=true` are set, `InsecureSkipVerify` wins — it's the whole point of the flag. Don't do this in production.
|
||||||
|
|
||||||
|
### CLI (`certctl-cli`)
|
||||||
|
|
||||||
|
Same contract as the agent:
|
||||||
|
|
||||||
|
- `CERTCTL_SERVER_URL` defaults to `https://` scheme; `http://` rejected at startup
|
||||||
|
- `--ca-bundle <path>` flag or `CERTCTL_SERVER_CA_BUNDLE_PATH` env var — CA bundle for server cert verification
|
||||||
|
- `--insecure` flag or `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true` — skip verification (dev only)
|
||||||
|
- Error diagnostic on empty URL explicitly mentions both `--server` and `CERTCTL_SERVER_URL` so operators see the right knob to turn
|
||||||
|
|
||||||
|
The CLI shares the URL-scheme validation with the agent; the test pins in `cmd/cli/main_test.go:TestValidateHTTPSScheme` cover the full rejection matrix.
|
||||||
|
|
||||||
|
### MCP server (`certctl-mcp-server`)
|
||||||
|
|
||||||
|
Same three controls as CLI, env-var-driven only (no flags — MCP runs as a stdio subprocess and inherits env from the launching LLM client):
|
||||||
|
|
||||||
|
- `CERTCTL_SERVER_URL` must start with `https://`
|
||||||
|
- `CERTCTL_SERVER_CA_BUNDLE_PATH` optional CA bundle
|
||||||
|
- `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY` optional skip
|
||||||
|
|
||||||
|
Claude Desktop / other MCP client configs should set all three in the tool's env block.
|
||||||
|
|
||||||
|
## Troubleshooting: fail-loud preflight errors
|
||||||
|
|
||||||
|
Every preflight failure message ends with `(see docs/tls.md)` so this doc is the first hit when an operator searches. Common failures:
|
||||||
|
|
||||||
|
**`CERTCTL_SERVER_TLS_CERT_PATH is empty: HTTPS-only control plane refuses to start`**
|
||||||
|
Set the env var. For docker-compose this is already set to `/etc/certctl/tls/server.crt` in the shipped compose file — if you're seeing this, check the `certctl-tls-init` service logs to see why the init container didn't populate the volume. For Helm, check that `server.tls.existingSecret` or `server.tls.certManager.enabled=true` is set.
|
||||||
|
|
||||||
|
**`TLS cert file "…" unreadable: …`**
|
||||||
|
The cert path is set but `os.Stat` failed. Check filesystem permissions — the server runs as UID 1000 in our shipped Dockerfile; the cert needs to be readable by that UID. Typos in the path also land here.
|
||||||
|
|
||||||
|
**`TLS cert/key pair invalid (cert="…" key="…"): …`**
|
||||||
|
Both files exist but `tls.LoadX509KeyPair` refused them. Typical causes: the private key does not sign the certificate, the key is encrypted with a passphrase (not supported — remove the passphrase with `openssl pkey` before mounting), or one of the two is DER-encoded instead of PEM. Re-issue the pair from the same CA call and re-mount.
|
||||||
|
|
||||||
|
**Client side: `tls: failed to verify certificate: x509: certificate signed by unknown authority`**
|
||||||
|
The client did not trust the CA that signed the server cert. Either mount the CA bundle via `CERTCTL_SERVER_CA_BUNDLE_PATH`, add the CA to the system trust store on the client host, or (dev only) set `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true`.
|
||||||
|
|
||||||
|
**Client side: `tls: first record does not look like a TLS handshake`**
|
||||||
|
The client is speaking plaintext HTTP to an HTTPS server (or vice-versa). Check that `CERTCTL_SERVER_URL` starts with `https://`. If you are upgrading from a pre-v2.2 release and your agents are old, they will surface this error until you roll the DaemonSet — see [`upgrade-to-tls.md`](upgrade-to-tls.md).
|
||||||
|
|
||||||
|
## Related docs
|
||||||
|
|
||||||
|
- [`upgrade-to-tls.md`](upgrade-to-tls.md) — one-step cutover from pre-HTTPS releases
|
||||||
|
- [`quickstart.md`](quickstart.md) — docker-compose walkthrough with HTTPS examples
|
||||||
|
- [`test-env.md`](test-env.md) — integration test environment (also HTTPS-only)
|
||||||
|
- Milestone spec: `prompts/https-everywhere-milestone.md` (authoritative source for locked decisions)
|
||||||
@@ -0,0 +1,194 @@
|
|||||||
|
# Upgrading to HTTPS-Everywhere (v2.2)
|
||||||
|
|
||||||
|
certctl's control plane is HTTPS-only as of v2.2. There is no `http` mode, no `auto` mode, no dual-listener bind, no N-release migration window. The cutover is a single step. Out-of-date agents that still point at `http://…` fail at the TCP/TLS handshake layer on first connect after the upgrade and stay `Offline` in the dashboard until their env block is updated and the fleet is rolled.
|
||||||
|
|
||||||
|
This doc walks operators through the cutover for the two shipped deployment topologies — docker-compose and Helm — and documents the failure modes and rollback posture explicitly.
|
||||||
|
|
||||||
|
For the deep-dive on cert provisioning patterns, SIGHUP cert reload, and client-side CA-trust configuration, read [`tls.md`](tls.md). This doc is the narrow "how do I upgrade" procedure.
|
||||||
|
|
||||||
|
## Preconditions
|
||||||
|
|
||||||
|
Before you start, confirm:
|
||||||
|
|
||||||
|
- **Shell access** to the server host and every agent host. The cutover requires you to restart the server and update every agent's env block.
|
||||||
|
- **A cert+key source** for the server. Pick one:
|
||||||
|
- An internal CA that can issue a server cert (CN + SAN list covering every hostname / IP agents dial).
|
||||||
|
- A `cert-manager` install in the target Kubernetes cluster, plus a `ClusterIssuer` or `Issuer` you're willing to reference.
|
||||||
|
- Willingness to use the self-signed bootstrap that the shipped `deploy/docker-compose.yml` generates automatically. This is the right choice for dev and demo; it is the wrong choice for production.
|
||||||
|
- **A maintenance window.** Out-of-date agents break at the TLS handshake and stay offline until rolled. Schedule the upgrade so the agent fleet can be updated in the same window as the server.
|
||||||
|
- **Backups.** This is a one-way door (see the Rollback section below). Snapshot your PostgreSQL database before `docker compose down` or `helm upgrade`.
|
||||||
|
|
||||||
|
There is no schema migration tied to this release; the only at-rest state that changes is the `certs` named volume (docker-compose) or the `tls.crt`/`tls.key` Secret (Helm).
|
||||||
|
|
||||||
|
## Procedure — docker-compose operators
|
||||||
|
|
||||||
|
The shipped `deploy/docker-compose.yml` includes a `certctl-tls-init` init container that self-signs an ed25519 cert on first boot and drops `server.crt`, `server.key`, and `ca.crt` into a named volume mounted read-only at `/etc/certctl/tls/` on the server and agent containers. No manual cert provisioning is required for the default stack.
|
||||||
|
|
||||||
|
1. **Pull the HTTPS-everywhere release.** From the repo root:
|
||||||
|
|
||||||
|
```
|
||||||
|
git pull
|
||||||
|
```
|
||||||
|
|
||||||
|
Confirm you're on a tag or `master` that contains the `certctl-tls-init` service in `deploy/docker-compose.yml`. Grep for it: `grep certctl-tls-init deploy/docker-compose.yml` should hit.
|
||||||
|
|
||||||
|
2. **Stop the old plaintext cluster.**
|
||||||
|
|
||||||
|
```
|
||||||
|
docker compose -f deploy/docker-compose.yml down
|
||||||
|
```
|
||||||
|
|
||||||
|
Do not pass `-v`; keeping the PostgreSQL volume preserves your cert inventory, audit trail, and job history across the upgrade.
|
||||||
|
|
||||||
|
3. **Bring the cluster back up with the HTTPS build.**
|
||||||
|
|
||||||
|
```
|
||||||
|
docker compose -f deploy/docker-compose.yml up -d --build
|
||||||
|
```
|
||||||
|
|
||||||
|
The `certctl-tls-init` service runs once, generates the self-signed cert into the `certs` volume, and exits with code 0. The server container waits for `certctl-tls-init` via `depends_on: { condition: service_completed_successfully }` and only starts once the cert material is on disk. The server's Docker healthcheck now uses `curl --cacert /etc/certctl/tls/ca.crt -f https://localhost:8443/health`, so the container only becomes healthy once the HTTPS listener is up and serving the bundled cert correctly.
|
||||||
|
|
||||||
|
4. **Verify the HTTPS endpoint from the host.**
|
||||||
|
|
||||||
|
```
|
||||||
|
curl --cacert $(docker compose -f deploy/docker-compose.yml exec -T certctl-server cat /etc/certctl/tls/ca.crt) https://localhost:8443/health
|
||||||
|
```
|
||||||
|
|
||||||
|
Expect `{"status":"ok"}` with HTTP 200. If you get a TLS verification error, the CA bundle wasn't read correctly — re-run the `exec -T` command and pipe the output directly into `--cacert @-` or save it to a local file first. If you get `connection refused`, the server never finished startup — check `docker compose logs certctl-server` for a fail-loud preflight diagnostic pointing at `docs/tls.md`.
|
||||||
|
|
||||||
|
5. **Confirm the bundled agent reconnects.** Agents inside the compose stack pick up the new URL (`CERTCTL_SERVER_URL=https://certctl-server:8443`) and the bundled CA (`CERTCTL_SERVER_CA_BUNDLE_PATH=/etc/certctl/tls/ca.crt`) from their env block automatically — no per-agent change needed. Tail the agent log:
|
||||||
|
|
||||||
|
```
|
||||||
|
docker compose -f deploy/docker-compose.yml logs -f certctl-agent
|
||||||
|
```
|
||||||
|
|
||||||
|
You should see `heartbeat sent` within 30 seconds. In the dashboard (`https://localhost:8443`), the agent should show as `Online`.
|
||||||
|
|
||||||
|
**External agents** running outside the compose network (e.g., the `install-agent.sh`-installed systemd service on a separate host) need their env block updated manually before the cutover — see the Agent env block section below.
|
||||||
|
|
||||||
|
## Procedure — Helm operators
|
||||||
|
|
||||||
|
The Helm chart does not self-sign. It refuses to render (`helm template` exits non-zero) unless you configure one of two cert sources: an operator-supplied Secret, or a cert-manager `Certificate` CR. See [`tls.md`](tls.md) for the full pattern catalog.
|
||||||
|
|
||||||
|
1. **Provision cert material.** Pick one of:
|
||||||
|
|
||||||
|
- **Operator-supplied Secret.** Issue a cert from your internal CA (or any other source) and load it into a `kubernetes.io/tls` Secret in the certctl namespace:
|
||||||
|
|
||||||
|
```
|
||||||
|
kubectl create secret tls certctl-server-tls \
|
||||||
|
--cert=server.crt --key=server.key \
|
||||||
|
--namespace certctl
|
||||||
|
```
|
||||||
|
|
||||||
|
- **cert-manager.** Set `server.tls.certManager.enabled=true` on the upgrade and reference an existing `ClusterIssuer` or `Issuer`:
|
||||||
|
|
||||||
|
```
|
||||||
|
--set server.tls.certManager.enabled=true
|
||||||
|
--set server.tls.certManager.issuerRef.name=my-cluster-issuer
|
||||||
|
--set server.tls.certManager.issuerRef.kind=ClusterIssuer
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Upgrade the release.**
|
||||||
|
|
||||||
|
```
|
||||||
|
helm upgrade certctl deploy/helm/certctl \
|
||||||
|
--namespace certctl \
|
||||||
|
--set server.tls.existingSecret=certctl-server-tls
|
||||||
|
```
|
||||||
|
|
||||||
|
(Or the `certManager` variant.) If you omit both `server.tls.existingSecret` and `server.tls.certManager.enabled`, the chart fails at render time with a diagnostic pointing at `docs/tls.md`. That guard exists precisely so you catch the missing config at `helm upgrade` time, not at pod-crash-loop time.
|
||||||
|
|
||||||
|
3. **Verify the HTTPS endpoint from inside the cluster.** Port-forward and curl with the CA bundle:
|
||||||
|
|
||||||
|
```
|
||||||
|
kubectl port-forward -n certctl svc/certctl-server 8443:8443 &
|
||||||
|
kubectl get secret -n certctl certctl-server-tls -o jsonpath='{.data.ca\.crt}' | base64 -d > /tmp/certctl-ca.crt
|
||||||
|
curl --cacert /tmp/certctl-ca.crt https://localhost:8443/health
|
||||||
|
```
|
||||||
|
|
||||||
|
Expect `{"status":"ok"}`. If the Secret does not contain a `ca.crt` key (operator-supplied Secrets often don't), use `tls.crt` as the bundle instead — for a self-signed cert the two files are identical, and for a cert chained to an internal CA you should separately distribute the root CA bundle via ConfigMap or mounted file.
|
||||||
|
|
||||||
|
4. **Update every agent manifest.** Agents outside this Helm release (or in a separately-managed DaemonSet) need their env block updated:
|
||||||
|
|
||||||
|
```
|
||||||
|
- name: CERTCTL_SERVER_URL
|
||||||
|
value: "https://certctl-server.certctl.svc.cluster.local:8443"
|
||||||
|
- name: CERTCTL_SERVER_CA_BUNDLE_PATH
|
||||||
|
value: "/etc/certctl/tls/ca.crt"
|
||||||
|
```
|
||||||
|
|
||||||
|
Mount the server's Secret (or a separate CA-bundle Secret / ConfigMap) at `/etc/certctl/tls/` as a read-only volume. If you bundle the agent via the shipped Helm chart's DaemonSet, the wiring is already done — set `agent.enabled=true` and the chart mounts the same Secret.
|
||||||
|
|
||||||
|
5. **Roll the agent DaemonSet.**
|
||||||
|
|
||||||
|
```
|
||||||
|
kubectl rollout restart ds/certctl-agent -n certctl
|
||||||
|
kubectl rollout status ds/certctl-agent -n certctl
|
||||||
|
```
|
||||||
|
|
||||||
|
Every agent pod restarts with the new URL + CA bundle and reconnects on HTTPS. The dashboard shows agents flip from `Offline` to `Online` as pods finish rolling.
|
||||||
|
|
||||||
|
## Agent env block — external hosts
|
||||||
|
|
||||||
|
Agents installed on bare-metal or VM hosts via `install-agent.sh` (systemd on Linux, launchd on macOS) read config from `/etc/certctl/agent.env` (Linux) or `~/Library/Application Support/certctl/agent.env` (macOS). On cutover, append or update:
|
||||||
|
|
||||||
|
```
|
||||||
|
CERTCTL_SERVER_URL=https://certctl.example.com:8443
|
||||||
|
CERTCTL_SERVER_CA_BUNDLE_PATH=/etc/certctl/tls/ca.crt
|
||||||
|
# CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=false # Dev only. Never set to true in production.
|
||||||
|
```
|
||||||
|
|
||||||
|
Distribute the CA bundle (the same `ca.crt` the server holds, or the root chain if you issued the server cert from an intermediate) to every agent host. The path under `CERTCTL_SERVER_CA_BUNDLE_PATH` must be readable by the UID the agent service runs as.
|
||||||
|
|
||||||
|
Restart the service after editing:
|
||||||
|
|
||||||
|
- Linux: `systemctl restart certctl-agent`
|
||||||
|
- macOS: `launchctl kickstart -k system/com.certctl.agent`
|
||||||
|
|
||||||
|
The agent refuses to start on an `http://` URL and exits with a pre-flight diagnostic that names this doc. That rejection happens before any network call — no spurious half-connected state.
|
||||||
|
|
||||||
|
## Failure mode
|
||||||
|
|
||||||
|
Out-of-date agents still configured with `CERTCTL_SERVER_URL=http://…` fail on first reconnect after the cutover. The failure surfaces as one of:
|
||||||
|
|
||||||
|
- `dial tcp …: connect: connection refused` — the server is no longer listening on a plaintext port. The new release binds only a TLS listener; attempting a plaintext `connect()` gets refused at the kernel level because nothing holds the socket.
|
||||||
|
- `tls: first record does not look like a TLS handshake` — depending on timing and proxy layers (e.g., a load balancer that accepts the TCP connection before forwarding), the client may negotiate TCP, send an HTTP request line, and have the server's TLS stack reject it.
|
||||||
|
|
||||||
|
Agents in this state surface as `Offline` in the dashboard. They stay offline until their env block is updated and the service restarts. There is no graceful 400-with-migration-URL response because there is no HTTP listener to serve one from — the entire plaintext call path is removed by design.
|
||||||
|
|
||||||
|
If you see an unexpected agent stay `Offline` past the cutover window, SSH to the host and check the agent log. On a systemd host:
|
||||||
|
|
||||||
|
```
|
||||||
|
journalctl -u certctl-agent -n 100
|
||||||
|
```
|
||||||
|
|
||||||
|
Look for `URL scheme "http" is not supported: HTTPS-only control plane refuses to start (see docs/upgrade-to-tls.md)`. That's the pre-flight rejection. Update `CERTCTL_SERVER_URL`, restart the service, and the agent reconnects.
|
||||||
|
|
||||||
|
## Rollback
|
||||||
|
|
||||||
|
**There is no rollback window.** The upgrade is a one-way door. The rationale lives in §3.7 of `prompts/https-everywhere-milestone.md`: a cert-lifecycle product that bridges back to plaintext after committing to HTTPS is advertising that its own security posture is negotiable.
|
||||||
|
|
||||||
|
If you need to revert, you have two options:
|
||||||
|
|
||||||
|
1. **Stay on the pre-HTTPS release.** Do not upgrade until you are ready to run HTTPS on the control plane. Pin your `docker-compose.yml` or `helm upgrade` command to the last pre-v2.2 tag.
|
||||||
|
2. **Rollback the release.** `helm rollback certctl <previous-revision>` or `git checkout <previous-tag> && docker compose up -d --build`. This rolls back the server, the compose topology, and the Helm chart in lockstep. Your PostgreSQL volume — cert inventory, audit trail, jobs — survives the rollback; nothing in this milestone changes the database schema.
|
||||||
|
|
||||||
|
Option 2 drops you back to the plaintext world. It should be treated as an emergency measure, not a supported migration path.
|
||||||
|
|
||||||
|
## After the cutover
|
||||||
|
|
||||||
|
Once every agent is `Online`, confirm a few invariants:
|
||||||
|
|
||||||
|
- `curl -sS -o /dev/null -w "%{http_code}\n" http://localhost:8443/health` returns `000` with `Connection refused` (no HTTP listener). Plaintext is gone.
|
||||||
|
- `openssl s_client -connect localhost:8443 -tls1_2 </dev/null` fails the handshake. TLS 1.2 is rejected.
|
||||||
|
- `openssl s_client -connect localhost:8443 -tls1_3 </dev/null` succeeds and prints the server's SAN list. TLS 1.3 is live.
|
||||||
|
- A cert rotation test: overwrite the server cert on disk, `kill -HUP` the server PID, confirm the new cert serves on the next `openssl s_client -connect … -showcerts` without a process restart. See the SIGHUP section in [`tls.md`](tls.md).
|
||||||
|
|
||||||
|
Update your runbooks. Every `http://certctl.example.com` URL in internal documentation, monitoring config, and on-call playbooks should become `https://certctl.example.com` plus a CA-trust note.
|
||||||
|
|
||||||
|
## Related docs
|
||||||
|
|
||||||
|
- [`tls.md`](tls.md) — cert provisioning patterns, SIGHUP rotation, troubleshooting
|
||||||
|
- [`quickstart.md`](quickstart.md) — docker-compose walkthrough (post-HTTPS)
|
||||||
|
- [`test-env.md`](test-env.md) — integration test environment (HTTPS-only)
|
||||||
|
- Milestone spec: `prompts/https-everywhere-milestone.md`
|
||||||
+1
-1
@@ -107,7 +107,7 @@ The demo seeds certificates across multiple issuers, agents, and deployment targ
|
|||||||
```bash
|
```bash
|
||||||
git clone https://github.com/shankar0123/certctl.git
|
git clone https://github.com/shankar0123/certctl.git
|
||||||
cd certctl/deploy && docker compose up -d
|
cd certctl/deploy && docker compose up -d
|
||||||
# Dashboard at http://localhost:8443
|
# Dashboard at https://localhost:8443 (self-signed cert — pin deploy/test/certs/ca.crt)
|
||||||
```
|
```
|
||||||
|
|
||||||
See the [Quickstart Guide](quickstart.md) for a full walkthrough, or explore the [5 turnkey examples](../examples/) for specific scenarios (ACME+NGINX, wildcard DNS-01, private CA+Traefik, step-ca+HAProxy, multi-issuer).
|
See the [Quickstart Guide](quickstart.md) for a full walkthrough, or explore the [5 turnkey examples](../examples/) for specific scenarios (ACME+NGINX, wildcard DNS-01, private CA+Traefik, step-ca+HAProxy, multi-issuer).
|
||||||
|
|||||||
@@ -36,6 +36,13 @@ flowchart TD
|
|||||||
|
|
||||||
If you don't have a real domain or can't open port 80, see [Customization Tips](#customization-tips) below.
|
If you don't have a real domain or can't open port 80, see [Customization Tips](#customization-tips) below.
|
||||||
|
|
||||||
|
## TLS Security
|
||||||
|
|
||||||
|
certctl is HTTPS-only as of v2.2. The demo compose stack provisions a self-signed certificate. When accessing `https://localhost:8443`, you can either:
|
||||||
|
- Use `curl --cacert ./deploy/test/certs/ca.crt ...` to pin the CA certificate
|
||||||
|
- Use `curl -k ...` for quick smoke tests (never in production)
|
||||||
|
- Import the CA at `./deploy/test/certs/ca.crt` into your OS trust store for browser visits
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
### 1. Clone or copy this example
|
### 1. Clone or copy this example
|
||||||
@@ -122,7 +129,7 @@ docker compose logs -f certctl-server certctl-agent
|
|||||||
|
|
||||||
### 5. Access the dashboard
|
### 5. Access the dashboard
|
||||||
|
|
||||||
Navigate to `http://localhost:8443` (or your `SERVER_PORT`)
|
Navigate to `https://localhost:8443` (or your `SERVER_PORT`)
|
||||||
|
|
||||||
You should see:
|
You should see:
|
||||||
- An empty certificate inventory (no certs issued yet)
|
- An empty certificate inventory (no certs issued yet)
|
||||||
|
|||||||
@@ -61,7 +61,7 @@ services:
|
|||||||
networks:
|
networks:
|
||||||
- certctl-network
|
- certctl-network
|
||||||
healthcheck:
|
healthcheck:
|
||||||
test: ['CMD-SHELL', 'curl -sf http://localhost:8443/health || exit 1']
|
test: ['CMD-SHELL', 'curl -sfk https://localhost:8443/health || exit 1']
|
||||||
interval: 10s
|
interval: 10s
|
||||||
timeout: 5s
|
timeout: 5s
|
||||||
retries: 3
|
retries: 3
|
||||||
|
|||||||
@@ -9,6 +9,13 @@ This example is ideal for:
|
|||||||
- Internal PKI with public DNS names
|
- Internal PKI with public DNS names
|
||||||
- Scenarios where you have programmatic access to your DNS provider's API
|
- Scenarios where you have programmatic access to your DNS provider's API
|
||||||
|
|
||||||
|
## TLS Security
|
||||||
|
|
||||||
|
certctl is HTTPS-only as of v2.2. The demo compose stack provisions a self-signed certificate. When accessing `https://localhost:8443`, you can either:
|
||||||
|
- Use `curl --cacert ./deploy/test/certs/ca.crt ...` to pin the CA certificate
|
||||||
|
- Use `curl -k ...` for quick smoke tests (never in production)
|
||||||
|
- Import the CA at `./deploy/test/certs/ca.crt` into your OS trust store for browser visits
|
||||||
|
|
||||||
## Prerequisites
|
## Prerequisites
|
||||||
|
|
||||||
Before running this example, you need:
|
Before running this example, you need:
|
||||||
@@ -74,7 +81,7 @@ This starts:
|
|||||||
|
|
||||||
### Step 5: Access the Dashboard
|
### Step 5: Access the Dashboard
|
||||||
|
|
||||||
Open your browser to `http://localhost:8443`
|
Open your browser to `https://localhost:8443`
|
||||||
|
|
||||||
### Step 6: Create a Wildcard Certificate
|
### Step 6: Create a Wildcard Certificate
|
||||||
|
|
||||||
|
|||||||
@@ -113,7 +113,7 @@ services:
|
|||||||
- certctl-network
|
- certctl-network
|
||||||
|
|
||||||
healthcheck:
|
healthcheck:
|
||||||
test: ['CMD-SHELL', 'curl -sf http://localhost:8443/health || exit 1']
|
test: ['CMD-SHELL', 'curl -sfk https://localhost:8443/health || exit 1']
|
||||||
interval: 10s
|
interval: 10s
|
||||||
timeout: 5s
|
timeout: 5s
|
||||||
retries: 3
|
retries: 3
|
||||||
|
|||||||
@@ -64,7 +64,7 @@ services:
|
|||||||
networks:
|
networks:
|
||||||
- certctl-network
|
- certctl-network
|
||||||
healthcheck:
|
healthcheck:
|
||||||
test: ['CMD-SHELL', 'curl -sf http://localhost:8443/health || exit 1']
|
test: ['CMD-SHELL', 'curl -sfk https://localhost:8443/health || exit 1']
|
||||||
interval: 10s
|
interval: 10s
|
||||||
timeout: 5s
|
timeout: 5s
|
||||||
retries: 3
|
retries: 3
|
||||||
|
|||||||
@@ -45,6 +45,13 @@ flowchart TD
|
|||||||
- **Domain for ACME** (optional) — if using real Let's Encrypt, not needed for demo
|
- **Domain for ACME** (optional) — if using real Let's Encrypt, not needed for demo
|
||||||
- **Internet connectivity** — to reach Let's Encrypt's API (demo can use staging directory)
|
- **Internet connectivity** — to reach Let's Encrypt's API (demo can use staging directory)
|
||||||
|
|
||||||
|
## TLS Security
|
||||||
|
|
||||||
|
certctl is HTTPS-only as of v2.2. The demo compose stack provisions a self-signed certificate. When accessing `https://localhost:8443`, you can either:
|
||||||
|
- Use `curl --cacert ./deploy/test/certs/ca.crt ...` to pin the CA certificate
|
||||||
|
- Use `curl -k ...` for quick smoke tests (never in production)
|
||||||
|
- Import the CA at `./deploy/test/certs/ca.crt` into your OS trust store for browser visits
|
||||||
|
|
||||||
## Quick Start
|
## Quick Start
|
||||||
|
|
||||||
### 1. Clone or navigate to this directory
|
### 1. Clone or navigate to this directory
|
||||||
@@ -83,7 +90,7 @@ This spins up:
|
|||||||
|
|
||||||
### 4. Access the dashboard
|
### 4. Access the dashboard
|
||||||
|
|
||||||
Open your browser to **http://localhost:8443** (or your configured SERVER_PORT)
|
Open your browser to **https://localhost:8443** (or your configured SERVER_PORT)
|
||||||
|
|
||||||
You should see:
|
You should see:
|
||||||
- Empty cert inventory (fresh start)
|
- Empty cert inventory (fresh start)
|
||||||
|
|||||||
@@ -77,7 +77,7 @@ services:
|
|||||||
networks:
|
networks:
|
||||||
- certctl-network
|
- certctl-network
|
||||||
healthcheck:
|
healthcheck:
|
||||||
test: ['CMD-SHELL', 'curl -sf http://localhost:8443/health || exit 1']
|
test: ['CMD-SHELL', 'curl -sfk https://localhost:8443/health || exit 1']
|
||||||
interval: 10s
|
interval: 10s
|
||||||
timeout: 5s
|
timeout: 5s
|
||||||
retries: 3
|
retries: 3
|
||||||
|
|||||||
@@ -29,6 +29,13 @@ flowchart TD
|
|||||||
C -->|TLS handshakes| D
|
C -->|TLS handshakes| D
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## TLS Security
|
||||||
|
|
||||||
|
certctl is HTTPS-only as of v2.2. The demo compose stack provisions a self-signed certificate. When accessing `https://localhost:8443`, you can either:
|
||||||
|
- Use `curl --cacert ./deploy/test/certs/ca.crt ...` to pin the CA certificate
|
||||||
|
- Use `curl -k ...` for quick smoke tests (never in production)
|
||||||
|
- Import the CA at `./deploy/test/certs/ca.crt` into your OS trust store for browser visits
|
||||||
|
|
||||||
## Quick Start (Self-Signed CA)
|
## Quick Start (Self-Signed CA)
|
||||||
|
|
||||||
The simplest way to get running in 2 minutes:
|
The simplest way to get running in 2 minutes:
|
||||||
@@ -58,7 +65,7 @@ EOF
|
|||||||
docker compose up -d
|
docker compose up -d
|
||||||
|
|
||||||
# 4. Access the dashboards
|
# 4. Access the dashboards
|
||||||
# - certctl: http://localhost:8443 (API only, use the CLI or direct HTTP calls)
|
# - certctl: https://localhost:8443 (API only, use the CLI or direct HTTP calls)
|
||||||
# - Traefik dashboard: http://localhost:8080
|
# - Traefik dashboard: http://localhost:8080
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -112,7 +119,7 @@ Once the stack is running:
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# 1. Create a certificate profile in certctl (defines allowed key types, TTL, etc.)
|
# 1. Create a certificate profile in certctl (defines allowed key types, TTL, etc.)
|
||||||
curl -X POST http://localhost:8443/api/v1/profiles \
|
curl -X POST https://localhost:8443/api/v1/profiles \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"id": "prof-internal",
|
"id": "prof-internal",
|
||||||
@@ -123,7 +130,7 @@ curl -X POST http://localhost:8443/api/v1/profiles \
|
|||||||
}'
|
}'
|
||||||
|
|
||||||
# 2. Create a renewal policy (defines issuer, renewal thresholds, etc.)
|
# 2. Create a renewal policy (defines issuer, renewal thresholds, etc.)
|
||||||
curl -X POST http://localhost:8443/api/v1/policies \
|
curl -X POST https://localhost:8443/api/v1/policies \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"id": "pol-internal",
|
"id": "pol-internal",
|
||||||
@@ -135,7 +142,7 @@ curl -X POST http://localhost:8443/api/v1/policies \
|
|||||||
}'
|
}'
|
||||||
|
|
||||||
# 3. Create a certificate (triggers issuance immediately)
|
# 3. Create a certificate (triggers issuance immediately)
|
||||||
curl -X POST http://localhost:8443/api/v1/certificates \
|
curl -X POST https://localhost:8443/api/v1/certificates \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"common_name": "api.internal.local",
|
"common_name": "api.internal.local",
|
||||||
@@ -144,7 +151,7 @@ curl -X POST http://localhost:8443/api/v1/certificates \
|
|||||||
}'
|
}'
|
||||||
|
|
||||||
# 4. Create a Traefik target (agent will deploy to this)
|
# 4. Create a Traefik target (agent will deploy to this)
|
||||||
curl -X POST http://localhost:8443/api/v1/targets \
|
curl -X POST https://localhost:8443/api/v1/targets \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"id": "target-traefik-01",
|
"id": "target-traefik-01",
|
||||||
@@ -156,7 +163,7 @@ curl -X POST http://localhost:8443/api/v1/targets \
|
|||||||
}'
|
}'
|
||||||
|
|
||||||
# 5. Create a deployment job (agent picks this up and deploys)
|
# 5. Create a deployment job (agent picks this up and deploys)
|
||||||
curl -X POST http://localhost:8443/api/v1/certificates/{cert-id}/deploy \
|
curl -X POST https://localhost:8443/api/v1/certificates/{cert-id}/deploy \
|
||||||
-H "Content-Type: application/json" \
|
-H "Content-Type: application/json" \
|
||||||
-d '{
|
-d '{
|
||||||
"target_ids": ["target-traefik-01"]
|
"target_ids": ["target-traefik-01"]
|
||||||
@@ -209,16 +216,16 @@ The server provides a REST API on port 8443. Example queries:
|
|||||||
|
|
||||||
```bash
|
```bash
|
||||||
# List all certificates
|
# List all certificates
|
||||||
curl http://localhost:8443/api/v1/certificates
|
curl https://localhost:8443/api/v1/certificates
|
||||||
|
|
||||||
# Check certificate status
|
# Check certificate status
|
||||||
curl http://localhost:8443/api/v1/certificates/{cert-id}
|
curl https://localhost:8443/api/v1/certificates/{cert-id}
|
||||||
|
|
||||||
# View audit trail
|
# View audit trail
|
||||||
curl http://localhost:8443/api/v1/audit
|
curl https://localhost:8443/api/v1/audit
|
||||||
|
|
||||||
# Check renewal policy compliance
|
# Check renewal policy compliance
|
||||||
curl http://localhost:8443/api/v1/policies/{policy-id}
|
curl https://localhost:8443/api/v1/policies/{policy-id}
|
||||||
```
|
```
|
||||||
|
|
||||||
### Traefik Dashboard
|
### Traefik Dashboard
|
||||||
@@ -290,7 +297,7 @@ Changes are picked up automatically (file watcher enabled).
|
|||||||
docker compose logs certctl-agent | grep heartbeat
|
docker compose logs certctl-agent | grep heartbeat
|
||||||
|
|
||||||
# Check deployment job status
|
# Check deployment job status
|
||||||
curl http://localhost:8443/api/v1/jobs | jq '.[] | select(.type == "Deployment")'
|
curl https://localhost:8443/api/v1/jobs | jq '.[] | select(.type == "Deployment")'
|
||||||
|
|
||||||
# Check Traefik is watching the directory
|
# Check Traefik is watching the directory
|
||||||
docker compose exec traefik ls -la /etc/traefik/certs/
|
docker compose exec traefik ls -la /etc/traefik/certs/
|
||||||
|
|||||||
@@ -119,7 +119,7 @@ services:
|
|||||||
networks:
|
networks:
|
||||||
- certctl-network
|
- certctl-network
|
||||||
healthcheck:
|
healthcheck:
|
||||||
test: ['CMD-SHELL', 'curl -sf http://localhost:8443/health || exit 1']
|
test: ['CMD-SHELL', 'curl -sfk https://localhost:8443/health || exit 1']
|
||||||
interval: 10s
|
interval: 10s
|
||||||
timeout: 5s
|
timeout: 5s
|
||||||
retries: 3
|
retries: 3
|
||||||
|
|||||||
@@ -48,6 +48,13 @@ Monitor logs:
|
|||||||
docker compose logs -f certctl-server
|
docker compose logs -f certctl-server
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## TLS Security
|
||||||
|
|
||||||
|
certctl is HTTPS-only as of v2.2. The demo compose stack provisions a self-signed certificate. When accessing `https://localhost:8443`, you can either:
|
||||||
|
- Use `curl --cacert ./deploy/test/certs/ca.crt ...` to pin the CA certificate
|
||||||
|
- Use `curl -k ...` for quick smoke tests (never in production)
|
||||||
|
- Import the CA at `./deploy/test/certs/ca.crt` into your OS trust store for browser visits
|
||||||
|
|
||||||
Wait for all services to reach healthy state:
|
Wait for all services to reach healthy state:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
@@ -69,7 +76,7 @@ certctl-haproxy-... healthy
|
|||||||
Open your browser to:
|
Open your browser to:
|
||||||
|
|
||||||
```
|
```
|
||||||
http://localhost:8443
|
https://localhost:8443
|
||||||
```
|
```
|
||||||
|
|
||||||
You should see an empty dashboard. This is expected — no certificates issued yet.
|
You should see an empty dashboard. This is expected — no certificates issued yet.
|
||||||
@@ -79,7 +86,7 @@ You should see an empty dashboard. This is expected — no certificates issued y
|
|||||||
This defines what certificates certctl can issue (key algorithm, max TTL, allowed names).
|
This defines what certificates certctl can issue (key algorithm, max TTL, allowed names).
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -X POST http://localhost:8443/api/v1/profiles \
|
curl -X POST https://localhost:8443/api/v1/profiles \
|
||||||
-H 'Content-Type: application/json' \
|
-H 'Content-Type: application/json' \
|
||||||
-d '{
|
-d '{
|
||||||
"name": "internal-web",
|
"name": "internal-web",
|
||||||
@@ -94,7 +101,7 @@ curl -X POST http://localhost:8443/api/v1/profiles \
|
|||||||
This tells certctl where to deploy certificates on the HAProxy server.
|
This tells certctl where to deploy certificates on the HAProxy server.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -X POST http://localhost:8443/api/v1/targets \
|
curl -X POST https://localhost:8443/api/v1/targets \
|
||||||
-H 'Content-Type: application/json' \
|
-H 'Content-Type: application/json' \
|
||||||
-d '{
|
-d '{
|
||||||
"name": "haproxy-01",
|
"name": "haproxy-01",
|
||||||
@@ -115,7 +122,7 @@ Note: In the Docker Compose environment, reload command can be `kill -HUP $(pido
|
|||||||
This ties a certificate profile to a deployment target and sets renewal thresholds.
|
This ties a certificate profile to a deployment target and sets renewal thresholds.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -X POST http://localhost:8443/api/v1/renewal-policies \
|
curl -X POST https://localhost:8443/api/v1/renewal-policies \
|
||||||
-H 'Content-Type: application/json' \
|
-H 'Content-Type: application/json' \
|
||||||
-d '{
|
-d '{
|
||||||
"name": "haproxy-internal-web",
|
"name": "haproxy-internal-web",
|
||||||
@@ -130,7 +137,7 @@ curl -X POST http://localhost:8443/api/v1/renewal-policies \
|
|||||||
Get the issuer ID:
|
Get the issuer ID:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl http://localhost:8443/api/v1/issuers | jq '.'
|
curl https://localhost:8443/api/v1/issuers | jq '.'
|
||||||
```
|
```
|
||||||
|
|
||||||
You should see `iss-stepca` in the list.
|
You should see `iss-stepca` in the list.
|
||||||
@@ -140,7 +147,7 @@ You should see `iss-stepca` in the list.
|
|||||||
Request a certificate via the API. The server will sign it via step-ca.
|
Request a certificate via the API. The server will sign it via step-ca.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -X POST http://localhost:8443/api/v1/certificates \
|
curl -X POST https://localhost:8443/api/v1/certificates \
|
||||||
-H 'Content-Type: application/json' \
|
-H 'Content-Type: application/json' \
|
||||||
-d '{
|
-d '{
|
||||||
"common_name": "api.internal.example.com",
|
"common_name": "api.internal.example.com",
|
||||||
@@ -155,7 +162,7 @@ curl -X POST http://localhost:8443/api/v1/certificates \
|
|||||||
Get the certificate ID and trigger deployment:
|
Get the certificate ID and trigger deployment:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
curl -X POST http://localhost:8443/api/v1/certificates/<cert_id>/deploy \
|
curl -X POST https://localhost:8443/api/v1/certificates/<cert_id>/deploy \
|
||||||
-H 'Content-Type: application/json' \
|
-H 'Content-Type: application/json' \
|
||||||
-d '{
|
-d '{
|
||||||
"target_id": "<target_id_from_step_4>"
|
"target_id": "<target_id_from_step_4>"
|
||||||
@@ -171,7 +178,7 @@ The agent will:
|
|||||||
|
|
||||||
### 8. Verify in Dashboard
|
### 8. Verify in Dashboard
|
||||||
|
|
||||||
Refresh http://localhost:8443 and you should see:
|
Refresh https://localhost:8443 and you should see:
|
||||||
- 1 certificate (status: Active, expiry in 90 days)
|
- 1 certificate (status: Active, expiry in 90 days)
|
||||||
- 1 deployment job (status: Completed)
|
- 1 deployment job (status: Completed)
|
||||||
- 1 agent (heartbeat: recent)
|
- 1 agent (heartbeat: recent)
|
||||||
|
|||||||
+40
-2
@@ -75,6 +75,14 @@ EXAMPLES:
|
|||||||
--server-url https://certctl.example.com \\
|
--server-url https://certctl.example.com \\
|
||||||
--api-key YOUR_API_KEY
|
--api-key YOUR_API_KEY
|
||||||
|
|
||||||
|
CONTROL-PLANE TLS TRUST:
|
||||||
|
The certctl server is HTTPS-only as of v2.2. This installer does NOT copy a CA
|
||||||
|
bundle — the generated agent.env leaves TLS trust to the system root store by
|
||||||
|
default. If the server uses a private/enterprise or self-signed CA, set
|
||||||
|
CERTCTL_SERVER_CA_BUNDLE_PATH in the generated agent.env to point at the CA
|
||||||
|
bundle, or (dev only) CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true. See the
|
||||||
|
commented block in the generated agent.env for the full menu.
|
||||||
|
|
||||||
EOF
|
EOF
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -322,7 +330,7 @@ setup_linux_config() {
|
|||||||
# Agent ID (unique identifier in the fleet)
|
# Agent ID (unique identifier in the fleet)
|
||||||
CERTCTL_AGENT_ID=$AGENT_ID
|
CERTCTL_AGENT_ID=$AGENT_ID
|
||||||
|
|
||||||
# Control plane server URL
|
# Control plane server URL (HTTPS-only as of v2.2)
|
||||||
CERTCTL_SERVER_URL=$SERVER_URL
|
CERTCTL_SERVER_URL=$SERVER_URL
|
||||||
|
|
||||||
# API authentication key
|
# API authentication key
|
||||||
@@ -334,6 +342,21 @@ CERTCTL_KEYGEN_MODE=agent
|
|||||||
# Key storage directory (agent-side keygen)
|
# Key storage directory (agent-side keygen)
|
||||||
CERTCTL_KEY_DIR=$key_dir
|
CERTCTL_KEY_DIR=$key_dir
|
||||||
|
|
||||||
|
# ---- Control-plane TLS trust ----
|
||||||
|
# The certctl server is HTTPS-only (v2.2+). The agent's HTTP client MUST trust the
|
||||||
|
# server's certificate chain. Pick ONE of the approaches below:
|
||||||
|
#
|
||||||
|
# 1) Public CA (Let's Encrypt, DigiCert, etc.) — no config needed; system trust store works.
|
||||||
|
# 2) Private / enterprise CA — point the agent at the CA bundle that signed the server cert:
|
||||||
|
# CERTCTL_SERVER_CA_BUNDLE_PATH=/etc/certctl/server-ca.crt
|
||||||
|
#
|
||||||
|
# 3) Self-signed server cert (Helm/compose bootstrap) — same env var, just point at the
|
||||||
|
# extracted self-signed CA bundle (e.g. from the certctl-server-tls Kubernetes secret
|
||||||
|
# via: kubectl get secret certctl-server-tls -o jsonpath='{.data.ca\.crt}' | base64 -d).
|
||||||
|
#
|
||||||
|
# 4) Dev/eval only — disable verification entirely (NEVER do this in production):
|
||||||
|
# CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true
|
||||||
|
|
||||||
# Logging level (debug, info, warn, error)
|
# Logging level (debug, info, warn, error)
|
||||||
# CERTCTL_LOG_LEVEL=info
|
# CERTCTL_LOG_LEVEL=info
|
||||||
|
|
||||||
@@ -373,7 +396,7 @@ setup_macos_config() {
|
|||||||
# Agent ID (unique identifier in the fleet)
|
# Agent ID (unique identifier in the fleet)
|
||||||
CERTCTL_AGENT_ID=$AGENT_ID
|
CERTCTL_AGENT_ID=$AGENT_ID
|
||||||
|
|
||||||
# Control plane server URL
|
# Control plane server URL (HTTPS-only as of v2.2)
|
||||||
CERTCTL_SERVER_URL=$SERVER_URL
|
CERTCTL_SERVER_URL=$SERVER_URL
|
||||||
|
|
||||||
# API authentication key
|
# API authentication key
|
||||||
@@ -385,6 +408,21 @@ CERTCTL_KEYGEN_MODE=agent
|
|||||||
# Key storage directory (agent-side keygen)
|
# Key storage directory (agent-side keygen)
|
||||||
CERTCTL_KEY_DIR=$key_dir
|
CERTCTL_KEY_DIR=$key_dir
|
||||||
|
|
||||||
|
# ---- Control-plane TLS trust ----
|
||||||
|
# The certctl server is HTTPS-only (v2.2+). The agent's HTTP client MUST trust the
|
||||||
|
# server's certificate chain. Pick ONE of the approaches below:
|
||||||
|
#
|
||||||
|
# 1) Public CA (Let's Encrypt, DigiCert, etc.) — no config needed; system trust store works.
|
||||||
|
# 2) Private / enterprise CA — point the agent at the CA bundle that signed the server cert:
|
||||||
|
# CERTCTL_SERVER_CA_BUNDLE_PATH=$HOME/.certctl/server-ca.crt
|
||||||
|
#
|
||||||
|
# 3) Self-signed server cert (Helm/compose bootstrap) — same env var, just point at the
|
||||||
|
# extracted self-signed CA bundle (e.g. from the certctl-server-tls Kubernetes secret
|
||||||
|
# via: kubectl get secret certctl-server-tls -o jsonpath='{.data.ca\.crt}' | base64 -d).
|
||||||
|
#
|
||||||
|
# 4) Dev/eval only — disable verification entirely (NEVER do this in production):
|
||||||
|
# CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true
|
||||||
|
|
||||||
# Logging level (debug, info, warn, error)
|
# Logging level (debug, info, warn, error)
|
||||||
# CERTCTL_LOG_LEVEL=info
|
# CERTCTL_LOG_LEVEL=info
|
||||||
|
|
||||||
|
|||||||
@@ -41,20 +41,26 @@ type MetricsResponse struct {
|
|||||||
|
|
||||||
// MetricsGauge represents gauge metrics (point-in-time values).
|
// MetricsGauge represents gauge metrics (point-in-time values).
|
||||||
type MetricsGauge struct {
|
type MetricsGauge struct {
|
||||||
CertificateTotal int64 `json:"certificate_total"`
|
CertificateTotal int64 `json:"certificate_total"`
|
||||||
CertificateActive int64 `json:"certificate_active"`
|
CertificateActive int64 `json:"certificate_active"`
|
||||||
CertificateExpiringSoon int64 `json:"certificate_expiring_soon"` // Within 30d
|
CertificateExpiringSoon int64 `json:"certificate_expiring_soon"` // Within 30d
|
||||||
CertificateExpired int64 `json:"certificate_expired"`
|
CertificateExpired int64 `json:"certificate_expired"`
|
||||||
CertificateRevoked int64 `json:"certificate_revoked"`
|
CertificateRevoked int64 `json:"certificate_revoked"`
|
||||||
AgentTotal int64 `json:"agent_total"`
|
AgentTotal int64 `json:"agent_total"`
|
||||||
AgentOnline int64 `json:"agent_online"`
|
AgentOnline int64 `json:"agent_online"`
|
||||||
JobPending int64 `json:"job_pending"`
|
JobPending int64 `json:"job_pending"`
|
||||||
}
|
}
|
||||||
|
|
||||||
// MetricsCounter represents counter metrics (cumulative values).
|
// MetricsCounter represents counter metrics (cumulative values).
|
||||||
type MetricsCounter struct {
|
type MetricsCounter struct {
|
||||||
JobCompletedTotal int64 `json:"job_completed_total"`
|
JobCompletedTotal int64 `json:"job_completed_total"`
|
||||||
JobFailedTotal int64 `json:"job_failed_total"`
|
JobFailedTotal int64 `json:"job_failed_total"`
|
||||||
|
// NotificationsDeadTotal is a point-in-time count of notifications in the
|
||||||
|
// dead-letter queue (status="dead"), exposed here with the _total suffix
|
||||||
|
// to match Prometheus DB-snapshot counter convention (same semantics as
|
||||||
|
// JobFailedTotal and JobCompletedTotal — see metrics.md). I-005 DLQ
|
||||||
|
// observability gate.
|
||||||
|
NotificationsDeadTotal int64 `json:"notifications_dead_total"`
|
||||||
}
|
}
|
||||||
|
|
||||||
// UptimeMetric represents server uptime information.
|
// UptimeMetric represents server uptime information.
|
||||||
@@ -95,18 +101,19 @@ func (h MetricsHandler) GetMetrics(w http.ResponseWriter, r *http.Request) {
|
|||||||
// Build metrics response
|
// Build metrics response
|
||||||
metricsResp := MetricsResponse{
|
metricsResp := MetricsResponse{
|
||||||
Gauge: MetricsGauge{
|
Gauge: MetricsGauge{
|
||||||
CertificateTotal: dashboardSummary.TotalCertificates,
|
CertificateTotal: dashboardSummary.TotalCertificates,
|
||||||
CertificateActive: dashboardSummary.TotalCertificates - dashboardSummary.ExpiringCertificates - dashboardSummary.ExpiredCertificates - dashboardSummary.RevokedCertificates,
|
CertificateActive: dashboardSummary.TotalCertificates - dashboardSummary.ExpiringCertificates - dashboardSummary.ExpiredCertificates - dashboardSummary.RevokedCertificates,
|
||||||
CertificateExpiringSoon: dashboardSummary.ExpiringCertificates,
|
CertificateExpiringSoon: dashboardSummary.ExpiringCertificates,
|
||||||
CertificateExpired: dashboardSummary.ExpiredCertificates,
|
CertificateExpired: dashboardSummary.ExpiredCertificates,
|
||||||
CertificateRevoked: dashboardSummary.RevokedCertificates,
|
CertificateRevoked: dashboardSummary.RevokedCertificates,
|
||||||
AgentTotal: dashboardSummary.TotalAgents,
|
AgentTotal: dashboardSummary.TotalAgents,
|
||||||
AgentOnline: dashboardSummary.ActiveAgents,
|
AgentOnline: dashboardSummary.ActiveAgents,
|
||||||
JobPending: dashboardSummary.PendingJobs,
|
JobPending: dashboardSummary.PendingJobs,
|
||||||
},
|
},
|
||||||
Counter: MetricsCounter{
|
Counter: MetricsCounter{
|
||||||
JobCompletedTotal: dashboardSummary.CompleteJobs,
|
JobCompletedTotal: dashboardSummary.CompleteJobs,
|
||||||
JobFailedTotal: dashboardSummary.FailedJobs,
|
JobFailedTotal: dashboardSummary.FailedJobs,
|
||||||
|
NotificationsDeadTotal: dashboardSummary.NotificationsDead,
|
||||||
},
|
},
|
||||||
Uptime: UptimeMetric{
|
Uptime: UptimeMetric{
|
||||||
UptimeSeconds: int64(time.Since(h.serverStarted).Seconds()),
|
UptimeSeconds: int64(time.Since(h.serverStarted).Seconds()),
|
||||||
@@ -200,6 +207,17 @@ func (h MetricsHandler) GetPrometheusMetrics(w http.ResponseWriter, r *http.Requ
|
|||||||
fmt.Fprintf(w, "# TYPE certctl_job_failed_total counter\n")
|
fmt.Fprintf(w, "# TYPE certctl_job_failed_total counter\n")
|
||||||
fmt.Fprintf(w, "certctl_job_failed_total %d\n\n", dashboardSummary.FailedJobs)
|
fmt.Fprintf(w, "certctl_job_failed_total %d\n\n", dashboardSummary.FailedJobs)
|
||||||
|
|
||||||
|
// I-005: notification dead-letter queue depth. Emitted with the _total
|
||||||
|
// suffix to match the existing certctl_job_completed_total /
|
||||||
|
// certctl_job_failed_total convention for DB-snapshot counters — the
|
||||||
|
// value is a point-in-time COUNT(*) of notification_events rows where
|
||||||
|
// status='dead', not a monotonically increasing process-lifetime counter.
|
||||||
|
// Operators alert on this as "dead-letter depth" (thresholds in the
|
||||||
|
// I-005 spec: > 0 → warning, > 10 → critical).
|
||||||
|
fmt.Fprintf(w, "# HELP certctl_notification_dead_total Number of notifications in the dead-letter queue.\n")
|
||||||
|
fmt.Fprintf(w, "# TYPE certctl_notification_dead_total counter\n")
|
||||||
|
fmt.Fprintf(w, "certctl_notification_dead_total %d\n\n", dashboardSummary.NotificationsDead)
|
||||||
|
|
||||||
// Info — server uptime
|
// Info — server uptime
|
||||||
fmt.Fprintf(w, "# HELP certctl_uptime_seconds Server uptime in seconds.\n")
|
fmt.Fprintf(w, "# HELP certctl_uptime_seconds Server uptime in seconds.\n")
|
||||||
fmt.Fprintf(w, "# TYPE certctl_uptime_seconds gauge\n")
|
fmt.Fprintf(w, "# TYPE certctl_uptime_seconds gauge\n")
|
||||||
@@ -209,15 +227,21 @@ func (h MetricsHandler) GetPrometheusMetrics(w http.ResponseWriter, r *http.Requ
|
|||||||
// DashboardSummary mirrors the service.DashboardSummary for JSON unmarshaling.
|
// DashboardSummary mirrors the service.DashboardSummary for JSON unmarshaling.
|
||||||
// JSON tags must match the service-layer struct exactly.
|
// JSON tags must match the service-layer struct exactly.
|
||||||
type DashboardSummary struct {
|
type DashboardSummary struct {
|
||||||
TotalCertificates int64 `json:"total_certificates"`
|
TotalCertificates int64 `json:"total_certificates"`
|
||||||
ExpiringCertificates int64 `json:"expiring_certificates"`
|
ExpiringCertificates int64 `json:"expiring_certificates"`
|
||||||
ExpiredCertificates int64 `json:"expired_certificates"`
|
ExpiredCertificates int64 `json:"expired_certificates"`
|
||||||
RevokedCertificates int64 `json:"revoked_certificates"`
|
RevokedCertificates int64 `json:"revoked_certificates"`
|
||||||
ActiveAgents int64 `json:"active_agents"`
|
ActiveAgents int64 `json:"active_agents"`
|
||||||
OfflineAgents int64 `json:"offline_agents"`
|
OfflineAgents int64 `json:"offline_agents"`
|
||||||
TotalAgents int64 `json:"total_agents"`
|
TotalAgents int64 `json:"total_agents"`
|
||||||
PendingJobs int64 `json:"pending_jobs"`
|
PendingJobs int64 `json:"pending_jobs"`
|
||||||
FailedJobs int64 `json:"failed_jobs"`
|
FailedJobs int64 `json:"failed_jobs"`
|
||||||
CompleteJobs int64 `json:"complete_jobs"`
|
CompleteJobs int64 `json:"complete_jobs"`
|
||||||
CompletedAt time.Time `json:"completed_at"`
|
// NotificationsDead mirrors service.DashboardSummary.NotificationsDead.
|
||||||
|
// JSON tag "notifications_dead" must match the service-layer struct
|
||||||
|
// exactly — this cross-package mirror avoids a direct import cycle and
|
||||||
|
// is driven by the I-005 Prometheus counter emission path. See
|
||||||
|
// GetPrometheusMetrics and MetricsCounter.NotificationsDeadTotal.
|
||||||
|
NotificationsDead int64 `json:"notifications_dead"`
|
||||||
|
CompletedAt time.Time `json:"completed_at"`
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -13,9 +13,11 @@ import (
|
|||||||
|
|
||||||
// MockNotificationService is a mock implementation of NotificationService interface.
|
// MockNotificationService is a mock implementation of NotificationService interface.
|
||||||
type MockNotificationService struct {
|
type MockNotificationService struct {
|
||||||
ListNotificationsFn func(page, perPage int) ([]domain.NotificationEvent, int64, error)
|
ListNotificationsFn func(page, perPage int) ([]domain.NotificationEvent, int64, error)
|
||||||
GetNotificationFn func(id string) (*domain.NotificationEvent, error)
|
ListNotificationsByStatusFn func(status string, page, perPage int) ([]domain.NotificationEvent, int64, error)
|
||||||
MarkAsReadFn func(id string) error
|
GetNotificationFn func(id string) (*domain.NotificationEvent, error)
|
||||||
|
MarkAsReadFn func(id string) error
|
||||||
|
RequeueFn func(id string) error
|
||||||
}
|
}
|
||||||
|
|
||||||
func (m *MockNotificationService) ListNotifications(_ context.Context, page, perPage int) ([]domain.NotificationEvent, int64, error) {
|
func (m *MockNotificationService) ListNotifications(_ context.Context, page, perPage int) ([]domain.NotificationEvent, int64, error) {
|
||||||
@@ -25,6 +27,13 @@ func (m *MockNotificationService) ListNotifications(_ context.Context, page, per
|
|||||||
return nil, 0, nil
|
return nil, 0, nil
|
||||||
}
|
}
|
||||||
|
|
||||||
|
func (m *MockNotificationService) ListNotificationsByStatus(_ context.Context, status string, page, perPage int) ([]domain.NotificationEvent, int64, error) {
|
||||||
|
if m.ListNotificationsByStatusFn != nil {
|
||||||
|
return m.ListNotificationsByStatusFn(status, page, perPage)
|
||||||
|
}
|
||||||
|
return nil, 0, nil
|
||||||
|
}
|
||||||
|
|
||||||
func (m *MockNotificationService) GetNotification(_ context.Context, id string) (*domain.NotificationEvent, error) {
|
func (m *MockNotificationService) GetNotification(_ context.Context, id string) (*domain.NotificationEvent, error) {
|
||||||
if m.GetNotificationFn != nil {
|
if m.GetNotificationFn != nil {
|
||||||
return m.GetNotificationFn(id)
|
return m.GetNotificationFn(id)
|
||||||
@@ -39,6 +48,13 @@ func (m *MockNotificationService) MarkAsRead(_ context.Context, id string) error
|
|||||||
return nil
|
return nil
|
||||||
}
|
}
|
||||||
|
|
||||||
|
func (m *MockNotificationService) RequeueNotification(_ context.Context, id string) error {
|
||||||
|
if m.RequeueFn != nil {
|
||||||
|
return m.RequeueFn(id)
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
func TestListNotifications_Success(t *testing.T) {
|
func TestListNotifications_Success(t *testing.T) {
|
||||||
now := time.Now()
|
now := time.Now()
|
||||||
certID := "mc-prod-001"
|
certID := "mc-prod-001"
|
||||||
@@ -282,3 +298,224 @@ func TestMarkAsRead_EmptyID(t *testing.T) {
|
|||||||
t.Fatalf("expected status 400, got %d", w.Code)
|
t.Fatalf("expected status 400, got %d", w.Code)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
// I-005: Notification Retry + Dead-Letter Queue handler contract (Phase 1 Red)
|
||||||
|
//
|
||||||
|
// These tests pin the HTTP surface Phase 2 Green must implement:
|
||||||
|
//
|
||||||
|
// 1. POST /api/v1/notifications/{id}/requeue — flips a dead notification
|
||||||
|
// back to 'pending' so the retry loop can pick it up again. The handler
|
||||||
|
// method does not exist yet (NotificationHandler has no RequeueNotification
|
||||||
|
// method) and the NotificationService interface does not declare
|
||||||
|
// RequeueNotification — both are compile-time Red halts.
|
||||||
|
//
|
||||||
|
// 2. GET /api/v1/notifications?status=dead — routes dead-letter list requests
|
||||||
|
// through ListNotificationsByStatus instead of ListNotifications. The
|
||||||
|
// status-filter routing does not exist yet, so ListNotificationsByStatusFn
|
||||||
|
// never fires — a runtime Red halt.
|
||||||
|
// ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
func TestRequeueNotification_Success(t *testing.T) {
|
||||||
|
var requeuedID string
|
||||||
|
mock := &MockNotificationService{
|
||||||
|
RequeueFn: func(id string) error {
|
||||||
|
requeuedID = id
|
||||||
|
return nil
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
handler := NewNotificationHandler(mock)
|
||||||
|
req := httptest.NewRequest(http.MethodPost, "/api/v1/notifications/notif-dead-001/requeue", nil)
|
||||||
|
req = req.WithContext(contextWithRequestID())
|
||||||
|
w := httptest.NewRecorder()
|
||||||
|
|
||||||
|
handler.RequeueNotification(w, req)
|
||||||
|
|
||||||
|
if w.Code != http.StatusOK {
|
||||||
|
t.Fatalf("expected status 200, got %d", w.Code)
|
||||||
|
}
|
||||||
|
if requeuedID != "notif-dead-001" {
|
||||||
|
t.Errorf("expected requeued ID 'notif-dead-001', got '%s'", requeuedID)
|
||||||
|
}
|
||||||
|
|
||||||
|
var resp map[string]string
|
||||||
|
if err := json.NewDecoder(w.Body).Decode(&resp); err != nil {
|
||||||
|
t.Fatalf("failed to decode response: %v", err)
|
||||||
|
}
|
||||||
|
if resp["status"] != "requeued" {
|
||||||
|
t.Errorf("expected status 'requeued', got '%s'", resp["status"])
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestRequeueNotification_NotFound(t *testing.T) {
|
||||||
|
mock := &MockNotificationService{
|
||||||
|
RequeueFn: func(id string) error {
|
||||||
|
return ErrMockNotFound
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
handler := NewNotificationHandler(mock)
|
||||||
|
req := httptest.NewRequest(http.MethodPost, "/api/v1/notifications/nonexistent/requeue", nil)
|
||||||
|
req = req.WithContext(contextWithRequestID())
|
||||||
|
w := httptest.NewRecorder()
|
||||||
|
|
||||||
|
handler.RequeueNotification(w, req)
|
||||||
|
|
||||||
|
if w.Code != http.StatusNotFound {
|
||||||
|
t.Fatalf("expected status 404, got %d", w.Code)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestRequeueNotification_ServiceError(t *testing.T) {
|
||||||
|
mock := &MockNotificationService{
|
||||||
|
RequeueFn: func(id string) error {
|
||||||
|
return ErrMockServiceFailed
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
handler := NewNotificationHandler(mock)
|
||||||
|
req := httptest.NewRequest(http.MethodPost, "/api/v1/notifications/notif-dead-001/requeue", nil)
|
||||||
|
req = req.WithContext(contextWithRequestID())
|
||||||
|
w := httptest.NewRecorder()
|
||||||
|
|
||||||
|
handler.RequeueNotification(w, req)
|
||||||
|
|
||||||
|
if w.Code != http.StatusInternalServerError {
|
||||||
|
t.Fatalf("expected status 500, got %d", w.Code)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestRequeueNotification_MethodNotAllowed(t *testing.T) {
|
||||||
|
handler := NewNotificationHandler(&MockNotificationService{})
|
||||||
|
req := httptest.NewRequest(http.MethodGet, "/api/v1/notifications/notif-dead-001/requeue", nil)
|
||||||
|
w := httptest.NewRecorder()
|
||||||
|
|
||||||
|
handler.RequeueNotification(w, req)
|
||||||
|
|
||||||
|
if w.Code != http.StatusMethodNotAllowed {
|
||||||
|
t.Fatalf("expected status 405, got %d", w.Code)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestRequeueNotification_EmptyID(t *testing.T) {
|
||||||
|
handler := NewNotificationHandler(&MockNotificationService{})
|
||||||
|
req := httptest.NewRequest(http.MethodPost, "/api/v1/notifications//requeue", nil)
|
||||||
|
req = req.WithContext(contextWithRequestID())
|
||||||
|
w := httptest.NewRecorder()
|
||||||
|
|
||||||
|
handler.RequeueNotification(w, req)
|
||||||
|
|
||||||
|
if w.Code != http.StatusBadRequest {
|
||||||
|
t.Fatalf("expected status 400, got %d", w.Code)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestListNotifications_StatusFilter_Dead(t *testing.T) {
|
||||||
|
now := time.Now()
|
||||||
|
certID := "mc-prod-001"
|
||||||
|
lastErr := "SMTP connection refused"
|
||||||
|
nextRetry := now.Add(1 * time.Minute)
|
||||||
|
dead := domain.NotificationEvent{
|
||||||
|
ID: "notif-dead-001",
|
||||||
|
Type: domain.NotificationTypeExpirationWarning,
|
||||||
|
CertificateID: &certID,
|
||||||
|
Channel: domain.NotificationChannelEmail,
|
||||||
|
Recipient: "admin@example.com",
|
||||||
|
Message: "Certificate expiring in 7 days",
|
||||||
|
Status: "dead",
|
||||||
|
CreatedAt: now,
|
||||||
|
RetryCount: 5,
|
||||||
|
NextRetryAt: &nextRetry,
|
||||||
|
LastError: &lastErr,
|
||||||
|
}
|
||||||
|
|
||||||
|
var capturedStatus string
|
||||||
|
var capturedPage, capturedPerPage int
|
||||||
|
byStatusCalled := false
|
||||||
|
listCalled := false
|
||||||
|
|
||||||
|
mock := &MockNotificationService{
|
||||||
|
ListNotificationsFn: func(page, perPage int) ([]domain.NotificationEvent, int64, error) {
|
||||||
|
listCalled = true
|
||||||
|
return nil, 0, nil
|
||||||
|
},
|
||||||
|
ListNotificationsByStatusFn: func(status string, page, perPage int) ([]domain.NotificationEvent, int64, error) {
|
||||||
|
byStatusCalled = true
|
||||||
|
capturedStatus = status
|
||||||
|
capturedPage = page
|
||||||
|
capturedPerPage = perPage
|
||||||
|
return []domain.NotificationEvent{dead}, 1, nil
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
handler := NewNotificationHandler(mock)
|
||||||
|
req := httptest.NewRequest(http.MethodGet, "/api/v1/notifications?status=dead&page=1&per_page=50", nil)
|
||||||
|
req = req.WithContext(contextWithRequestID())
|
||||||
|
w := httptest.NewRecorder()
|
||||||
|
|
||||||
|
handler.ListNotifications(w, req)
|
||||||
|
|
||||||
|
if w.Code != http.StatusOK {
|
||||||
|
t.Fatalf("expected status 200, got %d", w.Code)
|
||||||
|
}
|
||||||
|
if !byStatusCalled {
|
||||||
|
t.Fatalf("expected ListNotificationsByStatus to be called for ?status=dead, but it was not")
|
||||||
|
}
|
||||||
|
if listCalled {
|
||||||
|
t.Errorf("ListNotifications should not be called when status filter is present")
|
||||||
|
}
|
||||||
|
if capturedStatus != "dead" {
|
||||||
|
t.Errorf("expected status='dead', got '%s'", capturedStatus)
|
||||||
|
}
|
||||||
|
if capturedPage != 1 {
|
||||||
|
t.Errorf("expected page=1, got %d", capturedPage)
|
||||||
|
}
|
||||||
|
if capturedPerPage != 50 {
|
||||||
|
t.Errorf("expected per_page=50, got %d", capturedPerPage)
|
||||||
|
}
|
||||||
|
|
||||||
|
var resp PagedResponse
|
||||||
|
if err := json.NewDecoder(w.Body).Decode(&resp); err != nil {
|
||||||
|
t.Fatalf("failed to decode response: %v", err)
|
||||||
|
}
|
||||||
|
if resp.Total != 1 {
|
||||||
|
t.Errorf("expected total=1 dead notification, got %d", resp.Total)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func TestListNotifications_NoStatusFilter_CallsDefault(t *testing.T) {
|
||||||
|
// Pin the inverse: when no ?status= is provided, the handler must call the
|
||||||
|
// existing ListNotifications path (not ListNotificationsByStatus). Phase 2
|
||||||
|
// Green must not break the default listing behavior for the plain tab.
|
||||||
|
listCalled := false
|
||||||
|
byStatusCalled := false
|
||||||
|
|
||||||
|
mock := &MockNotificationService{
|
||||||
|
ListNotificationsFn: func(page, perPage int) ([]domain.NotificationEvent, int64, error) {
|
||||||
|
listCalled = true
|
||||||
|
return []domain.NotificationEvent{}, 0, nil
|
||||||
|
},
|
||||||
|
ListNotificationsByStatusFn: func(status string, page, perPage int) ([]domain.NotificationEvent, int64, error) {
|
||||||
|
byStatusCalled = true
|
||||||
|
return nil, 0, nil
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
handler := NewNotificationHandler(mock)
|
||||||
|
req := httptest.NewRequest(http.MethodGet, "/api/v1/notifications", nil)
|
||||||
|
req = req.WithContext(contextWithRequestID())
|
||||||
|
w := httptest.NewRecorder()
|
||||||
|
|
||||||
|
handler.ListNotifications(w, req)
|
||||||
|
|
||||||
|
if w.Code != http.StatusOK {
|
||||||
|
t.Fatalf("expected status 200, got %d", w.Code)
|
||||||
|
}
|
||||||
|
if !listCalled {
|
||||||
|
t.Errorf("expected ListNotifications to be called when no status filter is present")
|
||||||
|
}
|
||||||
|
if byStatusCalled {
|
||||||
|
t.Errorf("ListNotificationsByStatus should not be called when no status filter is present")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|||||||
@@ -11,10 +11,17 @@ import (
|
|||||||
)
|
)
|
||||||
|
|
||||||
// NotificationService defines the service interface for notification operations.
|
// NotificationService defines the service interface for notification operations.
|
||||||
|
//
|
||||||
|
// ListNotificationsByStatus and RequeueNotification were added to close coverage
|
||||||
|
// gap I-005: the Dead letter tab on the GUI (?status=dead) needs a scoped
|
||||||
|
// listing path, and the Requeue action needs a dedicated endpoint that flips a
|
||||||
|
// dead notification back to 'pending' so the retry sweep can pick it up again.
|
||||||
type NotificationService interface {
|
type NotificationService interface {
|
||||||
ListNotifications(ctx context.Context, page, perPage int) ([]domain.NotificationEvent, int64, error)
|
ListNotifications(ctx context.Context, page, perPage int) ([]domain.NotificationEvent, int64, error)
|
||||||
|
ListNotificationsByStatus(ctx context.Context, status string, page, perPage int) ([]domain.NotificationEvent, int64, error)
|
||||||
GetNotification(ctx context.Context, id string) (*domain.NotificationEvent, error)
|
GetNotification(ctx context.Context, id string) (*domain.NotificationEvent, error)
|
||||||
MarkAsRead(ctx context.Context, id string) error
|
MarkAsRead(ctx context.Context, id string) error
|
||||||
|
RequeueNotification(ctx context.Context, id string) error
|
||||||
}
|
}
|
||||||
|
|
||||||
// NotificationHandler handles HTTP requests for notification operations.
|
// NotificationHandler handles HTTP requests for notification operations.
|
||||||
@@ -51,7 +58,20 @@ func (h NotificationHandler) ListNotifications(w http.ResponseWriter, r *http.Re
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
notifications, total, err := h.svc.ListNotifications(r.Context(), page, perPage)
|
// I-005: branch to the status-scoped listing path when ?status= is present
|
||||||
|
// so the Dead letter tab on the GUI (?status=dead) can filter server-side.
|
||||||
|
// Empty status delegates to the original ListNotifications path to preserve
|
||||||
|
// the default tab's existing behavior.
|
||||||
|
var (
|
||||||
|
notifications []domain.NotificationEvent
|
||||||
|
total int64
|
||||||
|
err error
|
||||||
|
)
|
||||||
|
if status := query.Get("status"); status != "" {
|
||||||
|
notifications, total, err = h.svc.ListNotificationsByStatus(r.Context(), status, page, perPage)
|
||||||
|
} else {
|
||||||
|
notifications, total, err = h.svc.ListNotifications(r.Context(), page, perPage)
|
||||||
|
}
|
||||||
if err != nil {
|
if err != nil {
|
||||||
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to list notifications", requestID)
|
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to list notifications", requestID)
|
||||||
return
|
return
|
||||||
@@ -124,3 +144,43 @@ func (h NotificationHandler) MarkAsRead(w http.ResponseWriter, r *http.Request)
|
|||||||
|
|
||||||
JSON(w, http.StatusOK, response)
|
JSON(w, http.StatusOK, response)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// RequeueNotification flips a dead notification back to 'pending' so the retry
|
||||||
|
// sweep (coverage gap I-005) can pick it up again on its next tick. The handler
|
||||||
|
// is strictly POST-only; GET/PUT/DELETE return 405. An empty id segment
|
||||||
|
// (/api/v1/notifications//requeue) returns 400. Service errors that carry a
|
||||||
|
// "not found" sentinel map to 404; all other service errors map to 500. This
|
||||||
|
// 404-vs-500 split mirrors GetCertificateDeployments at certificates.go:644.
|
||||||
|
// POST /api/v1/notifications/{id}/requeue
|
||||||
|
func (h NotificationHandler) RequeueNotification(w http.ResponseWriter, r *http.Request) {
|
||||||
|
if r.Method != http.MethodPost {
|
||||||
|
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
requestID := middleware.GetRequestID(r.Context())
|
||||||
|
|
||||||
|
// Extract notification ID from path /api/v1/notifications/{id}/requeue
|
||||||
|
path := strings.TrimPrefix(r.URL.Path, "/api/v1/notifications/")
|
||||||
|
parts := strings.Split(path, "/")
|
||||||
|
if len(parts) < 2 || parts[0] == "" {
|
||||||
|
ErrorWithRequestID(w, http.StatusBadRequest, "Notification ID is required", requestID)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
notificationID := parts[0]
|
||||||
|
|
||||||
|
if err := h.svc.RequeueNotification(r.Context(), notificationID); err != nil {
|
||||||
|
if strings.Contains(err.Error(), "not found") {
|
||||||
|
ErrorWithRequestID(w, http.StatusNotFound, "Notification not found", requestID)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to requeue notification", requestID)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
response := map[string]string{
|
||||||
|
"status": "requeued",
|
||||||
|
}
|
||||||
|
|
||||||
|
JSON(w, http.StatusOK, response)
|
||||||
|
}
|
||||||
|
|||||||
@@ -45,28 +45,28 @@ func (r *Router) RegisterFunc(pattern string, handler func(http.ResponseWriter,
|
|||||||
|
|
||||||
// HandlerRegistry groups all API handler dependencies for router registration.
|
// HandlerRegistry groups all API handler dependencies for router registration.
|
||||||
type HandlerRegistry struct {
|
type HandlerRegistry struct {
|
||||||
Certificates handler.CertificateHandler
|
Certificates handler.CertificateHandler
|
||||||
Issuers handler.IssuerHandler
|
Issuers handler.IssuerHandler
|
||||||
Targets handler.TargetHandler
|
Targets handler.TargetHandler
|
||||||
Agents handler.AgentHandler
|
Agents handler.AgentHandler
|
||||||
Jobs handler.JobHandler
|
Jobs handler.JobHandler
|
||||||
Policies handler.PolicyHandler
|
Policies handler.PolicyHandler
|
||||||
Profiles handler.ProfileHandler
|
Profiles handler.ProfileHandler
|
||||||
Teams handler.TeamHandler
|
Teams handler.TeamHandler
|
||||||
Owners handler.OwnerHandler
|
Owners handler.OwnerHandler
|
||||||
AgentGroups handler.AgentGroupHandler
|
AgentGroups handler.AgentGroupHandler
|
||||||
Audit handler.AuditHandler
|
Audit handler.AuditHandler
|
||||||
Notifications handler.NotificationHandler
|
Notifications handler.NotificationHandler
|
||||||
Stats handler.StatsHandler
|
Stats handler.StatsHandler
|
||||||
Metrics handler.MetricsHandler
|
Metrics handler.MetricsHandler
|
||||||
Health handler.HealthHandler
|
Health handler.HealthHandler
|
||||||
Discovery handler.DiscoveryHandler
|
Discovery handler.DiscoveryHandler
|
||||||
NetworkScan handler.NetworkScanHandler
|
NetworkScan handler.NetworkScanHandler
|
||||||
Verification handler.VerificationHandler
|
Verification handler.VerificationHandler
|
||||||
Export handler.ExportHandler
|
Export handler.ExportHandler
|
||||||
Digest handler.DigestHandler
|
Digest handler.DigestHandler
|
||||||
HealthChecks *handler.HealthCheckHandler
|
HealthChecks *handler.HealthCheckHandler
|
||||||
BulkRevocation handler.BulkRevocationHandler
|
BulkRevocation handler.BulkRevocationHandler
|
||||||
}
|
}
|
||||||
|
|
||||||
// RegisterHandlers sets up all API routes with their handlers.
|
// RegisterHandlers sets up all API routes with their handlers.
|
||||||
@@ -204,6 +204,10 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
|
|||||||
r.Register("GET /api/v1/notifications", http.HandlerFunc(reg.Notifications.ListNotifications))
|
r.Register("GET /api/v1/notifications", http.HandlerFunc(reg.Notifications.ListNotifications))
|
||||||
r.Register("GET /api/v1/notifications/{id}", http.HandlerFunc(reg.Notifications.GetNotification))
|
r.Register("GET /api/v1/notifications/{id}", http.HandlerFunc(reg.Notifications.GetNotification))
|
||||||
r.Register("POST /api/v1/notifications/{id}/read", http.HandlerFunc(reg.Notifications.MarkAsRead))
|
r.Register("POST /api/v1/notifications/{id}/read", http.HandlerFunc(reg.Notifications.MarkAsRead))
|
||||||
|
// I-005: requeue a dead notification back to pending so the retry sweep
|
||||||
|
// picks it up again. Go 1.22 ServeMux resolves the literal /requeue segment
|
||||||
|
// before falling back to the {id} path-variable route above.
|
||||||
|
r.Register("POST /api/v1/notifications/{id}/requeue", http.HandlerFunc(reg.Notifications.RequeueNotification))
|
||||||
|
|
||||||
// Stats routes: /api/v1/stats
|
// Stats routes: /api/v1/stats
|
||||||
r.Register("GET /api/v1/stats/summary", http.HandlerFunc(reg.Stats.GetDashboardSummary))
|
r.Register("GET /api/v1/stats/summary", http.HandlerFunc(reg.Stats.GetDashboardSummary))
|
||||||
@@ -254,7 +258,19 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// RegisterESTHandlers sets up EST (RFC 7030) routes under /.well-known/est/.
|
// RegisterESTHandlers sets up EST (RFC 7030) routes under /.well-known/est/.
|
||||||
// EST endpoints use a separate middleware chain (no API key auth — EST uses TLS client certs).
|
//
|
||||||
|
// EST endpoints are intentionally unauthenticated at the HTTP layer. Per RFC 7030
|
||||||
|
// §3.2.3, authentication and authorization for enrollment are deployment-specific;
|
||||||
|
// certctl relies on CSR signature verification, profile policy enforcement (allowed
|
||||||
|
// key types, max TTL, permitted EKUs), and the underlying issuer connector's own
|
||||||
|
// policy. Per RFC 7030 §4.1.1, /.well-known/est/cacerts is explicitly anonymous.
|
||||||
|
//
|
||||||
|
// cmd/server/main.go's finalHandler dispatches /.well-known/est/* to a dedicated
|
||||||
|
// no-auth middleware chain (RequestID, structuredLogger, Recovery only) so EST
|
||||||
|
// clients — IoT devices, 802.1X supplicants, MDM-enrolled laptops — never hit the
|
||||||
|
// Bearer-token auth middleware they cannot satisfy. See M-001 audit 2026-04-19
|
||||||
|
// (option D): prior builds routed EST through the authenticated apiHandler chain,
|
||||||
|
// which reduced every enrollment to a 401 before the handler was reached.
|
||||||
func (r *Router) RegisterESTHandlers(est handler.ESTHandler) {
|
func (r *Router) RegisterESTHandlers(est handler.ESTHandler) {
|
||||||
// EST endpoints per RFC 7030 Section 3.2.2
|
// EST endpoints per RFC 7030 Section 3.2.2
|
||||||
r.Register("GET /.well-known/est/cacerts", http.HandlerFunc(est.CACerts))
|
r.Register("GET /.well-known/est/cacerts", http.HandlerFunc(est.CACerts))
|
||||||
@@ -265,7 +281,11 @@ func (r *Router) RegisterESTHandlers(est handler.ESTHandler) {
|
|||||||
|
|
||||||
// RegisterSCEPHandlers sets up SCEP (RFC 8894) routes.
|
// RegisterSCEPHandlers sets up SCEP (RFC 8894) routes.
|
||||||
// SCEP uses a single endpoint with operation-based dispatch via query parameters.
|
// SCEP uses a single endpoint with operation-based dispatch via query parameters.
|
||||||
// Authentication is via challenge password in the CSR, not TLS client certs or API keys.
|
// Authentication is via the challengePassword attribute in the PKCS#10 CSR, not
|
||||||
|
// via HTTP Bearer tokens or TLS client certs. cmd/server/main.go's finalHandler
|
||||||
|
// routes /scep* through the no-auth middleware chain (M-001 audit 2026-04-19,
|
||||||
|
// option D), and Config.Validate() refuses to start the server if SCEP is enabled
|
||||||
|
// without a non-empty CERTCTL_SCEP_CHALLENGE_PASSWORD (H-2, CWE-306).
|
||||||
func (r *Router) RegisterSCEPHandlers(scep handler.SCEPHandler) {
|
func (r *Router) RegisterSCEPHandlers(scep handler.SCEPHandler) {
|
||||||
// SCEP uses a single path; the handler dispatches on ?operation= query param
|
// SCEP uses a single path; the handler dispatches on ?operation= query param
|
||||||
r.Register("GET /scep", http.HandlerFunc(scep.HandleSCEP))
|
r.Register("GET /scep", http.HandlerFunc(scep.HandleSCEP))
|
||||||
|
|||||||
@@ -46,7 +46,7 @@ func TestClient_RetireAgent_Success(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
// Positional arg: the agent ID. No --force, no --reason — the default
|
// Positional arg: the agent ID. No --force, no --reason — the default
|
||||||
// soft-retire path. Compile-fail until client.RetireAgent exists.
|
// soft-retire path. Compile-fail until client.RetireAgent exists.
|
||||||
if err := client.RetireAgent([]string{"ag-1"}); err != nil {
|
if err := client.RetireAgent([]string{"ag-1"}); err != nil {
|
||||||
@@ -101,7 +101,7 @@ func TestClient_RetireAgent_Force_WithReason_Success(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
if err := client.RetireAgent([]string{"ag-1", "--force", "--reason", "decommissioning rack 7"}); err != nil {
|
if err := client.RetireAgent([]string{"ag-1", "--force", "--reason", "decommissioning rack 7"}); err != nil {
|
||||||
t.Fatalf("RetireAgent(force+reason) err=%v want nil", err)
|
t.Fatalf("RetireAgent(force+reason) err=%v want nil", err)
|
||||||
}
|
}
|
||||||
@@ -126,7 +126,7 @@ func TestClient_RetireAgent_Force_RequiresReason(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
err := client.RetireAgent([]string{"ag-1", "--force"})
|
err := client.RetireAgent([]string{"ag-1", "--force"})
|
||||||
if err == nil {
|
if err == nil {
|
||||||
t.Fatalf("RetireAgent(force, no reason) err=nil want client-side error")
|
t.Fatalf("RetireAgent(force, no reason) err=nil want client-side error")
|
||||||
@@ -150,7 +150,7 @@ func TestClient_RetireAgent_MissingID(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
err := client.RetireAgent([]string{})
|
err := client.RetireAgent([]string{})
|
||||||
if err == nil {
|
if err == nil {
|
||||||
t.Fatalf("RetireAgent([]) err=nil want missing-id error")
|
t.Fatalf("RetireAgent([]) err=nil want missing-id error")
|
||||||
@@ -198,7 +198,7 @@ func TestClient_ListRetiredAgents_Success(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
if err := client.ListRetiredAgents([]string{}); err != nil {
|
if err := client.ListRetiredAgents([]string{}); err != nil {
|
||||||
t.Fatalf("ListRetiredAgents err=%v want nil", err)
|
t.Fatalf("ListRetiredAgents err=%v want nil", err)
|
||||||
}
|
}
|
||||||
@@ -220,7 +220,7 @@ func TestClient_ListRetiredAgents_ServerError(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
err := client.ListRetiredAgents([]string{})
|
err := client.ListRetiredAgents([]string{})
|
||||||
if err == nil {
|
if err == nil {
|
||||||
t.Fatalf("ListRetiredAgents(500) err=nil want propagated error")
|
t.Fatalf("ListRetiredAgents(500) err=nil want propagated error")
|
||||||
|
|||||||
+35
-5
@@ -2,6 +2,7 @@ package cli
|
|||||||
|
|
||||||
import (
|
import (
|
||||||
"bytes"
|
"bytes"
|
||||||
|
"crypto/tls"
|
||||||
"crypto/x509"
|
"crypto/x509"
|
||||||
"encoding/json"
|
"encoding/json"
|
||||||
"encoding/pem"
|
"encoding/pem"
|
||||||
@@ -19,22 +20,51 @@ import (
|
|||||||
|
|
||||||
// Client is the CLI HTTP client that communicates with the certctl server.
|
// Client is the CLI HTTP client that communicates with the certctl server.
|
||||||
type Client struct {
|
type Client struct {
|
||||||
baseURL string
|
baseURL string
|
||||||
apiKey string
|
apiKey string
|
||||||
format string
|
format string
|
||||||
httpClient *http.Client
|
httpClient *http.Client
|
||||||
}
|
}
|
||||||
|
|
||||||
// NewClient creates a new CLI client.
|
// NewClient creates a new CLI client.
|
||||||
func NewClient(baseURL, apiKey, format string) *Client {
|
//
|
||||||
|
// HTTPS-Everywhere (v2.2): the certctl control plane is HTTPS-only. caBundlePath,
|
||||||
|
// when non-empty, points at a PEM bundle used to verify the server cert; otherwise
|
||||||
|
// the system trust store is used. insecure skips cert verification — dev only,
|
||||||
|
// never enable in production. The TLS config is attached to *http.Transport so
|
||||||
|
// every call goes through the same verified socket.
|
||||||
|
func NewClient(baseURL, apiKey, format, caBundlePath string, insecure bool) (*Client, error) {
|
||||||
|
tlsConfig := &tls.Config{
|
||||||
|
MinVersion: tls.VersionTLS13,
|
||||||
|
InsecureSkipVerify: insecure, //nolint:gosec // opt-in dev toggle, documented in docs/tls.md
|
||||||
|
}
|
||||||
|
if caBundlePath != "" {
|
||||||
|
pemBytes, err := os.ReadFile(caBundlePath)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("reading CA bundle at %q: %w", caBundlePath, err)
|
||||||
|
}
|
||||||
|
pool := x509.NewCertPool()
|
||||||
|
if !pool.AppendCertsFromPEM(pemBytes) {
|
||||||
|
return nil, fmt.Errorf("CA bundle at %q contains no valid PEM-encoded certificates", caBundlePath)
|
||||||
|
}
|
||||||
|
tlsConfig.RootCAs = pool
|
||||||
|
}
|
||||||
return &Client{
|
return &Client{
|
||||||
baseURL: baseURL,
|
baseURL: baseURL,
|
||||||
apiKey: apiKey,
|
apiKey: apiKey,
|
||||||
format: format,
|
format: format,
|
||||||
httpClient: &http.Client{
|
httpClient: &http.Client{
|
||||||
Timeout: 30 * time.Second,
|
Timeout: 30 * time.Second,
|
||||||
|
Transport: &http.Transport{
|
||||||
|
TLSClientConfig: tlsConfig,
|
||||||
|
ForceAttemptHTTP2: true,
|
||||||
|
MaxIdleConns: 10,
|
||||||
|
IdleConnTimeout: 90 * time.Second,
|
||||||
|
TLSHandshakeTimeout: 10 * time.Second,
|
||||||
|
ExpectContinueTimeout: 1 * time.Second,
|
||||||
|
},
|
||||||
},
|
},
|
||||||
}
|
}, nil
|
||||||
}
|
}
|
||||||
|
|
||||||
// do performs an HTTP request and returns the parsed JSON response.
|
// do performs an HTTP request and returns the parsed JSON response.
|
||||||
|
|||||||
+207
-15
@@ -3,6 +3,7 @@ package cli
|
|||||||
import (
|
import (
|
||||||
"crypto/rand"
|
"crypto/rand"
|
||||||
"crypto/rsa"
|
"crypto/rsa"
|
||||||
|
"crypto/tls"
|
||||||
"crypto/x509"
|
"crypto/x509"
|
||||||
"crypto/x509/pkix"
|
"crypto/x509/pkix"
|
||||||
"encoding/json"
|
"encoding/json"
|
||||||
@@ -39,7 +40,7 @@ func TestClient_ListCertificates(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
err := client.ListCertificates([]string{})
|
err := client.ListCertificates([]string{})
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("ListCertificates failed: %v", err)
|
t.Fatalf("ListCertificates failed: %v", err)
|
||||||
@@ -64,7 +65,7 @@ func TestClient_GetCertificate(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "json")
|
client, _ := NewClient(server.URL, "", "json", "", false)
|
||||||
err := client.GetCertificate("mc-1")
|
err := client.GetCertificate("mc-1")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("GetCertificate failed: %v", err)
|
t.Fatalf("GetCertificate failed: %v", err)
|
||||||
@@ -86,7 +87,7 @@ func TestClient_RenewCertificate(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
err := client.RenewCertificate("mc-1")
|
err := client.RenewCertificate("mc-1")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("RenewCertificate failed: %v", err)
|
t.Fatalf("RenewCertificate failed: %v", err)
|
||||||
@@ -107,7 +108,7 @@ func TestClient_RevokeCertificate(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
err := client.RevokeCertificate("mc-1", "cessationOfOperation")
|
err := client.RevokeCertificate("mc-1", "cessationOfOperation")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("RevokeCertificate failed: %v", err)
|
t.Fatalf("RevokeCertificate failed: %v", err)
|
||||||
@@ -141,7 +142,7 @@ func TestClient_BulkRevokeCertificates(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
err := client.BulkRevokeCertificates([]string{
|
err := client.BulkRevokeCertificates([]string{
|
||||||
"--reason", "keyCompromise",
|
"--reason", "keyCompromise",
|
||||||
"--profile-id", "prof-tls",
|
"--profile-id", "prof-tls",
|
||||||
@@ -175,7 +176,7 @@ func TestClient_ListAgents(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
err := client.ListAgents([]string{})
|
err := client.ListAgents([]string{})
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("ListAgents failed: %v", err)
|
t.Fatalf("ListAgents failed: %v", err)
|
||||||
@@ -201,7 +202,7 @@ func TestClient_GetAgent(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "json")
|
client, _ := NewClient(server.URL, "", "json", "", false)
|
||||||
err := client.GetAgent("ag-1")
|
err := client.GetAgent("ag-1")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("GetAgent failed: %v", err)
|
t.Fatalf("GetAgent failed: %v", err)
|
||||||
@@ -232,7 +233,7 @@ func TestClient_ListJobs(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
err := client.ListJobs([]string{})
|
err := client.ListJobs([]string{})
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("ListJobs failed: %v", err)
|
t.Fatalf("ListJobs failed: %v", err)
|
||||||
@@ -258,7 +259,7 @@ func TestClient_GetJob(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "json")
|
client, _ := NewClient(server.URL, "", "json", "", false)
|
||||||
err := client.GetJob("job-1")
|
err := client.GetJob("job-1")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("GetJob failed: %v", err)
|
t.Fatalf("GetJob failed: %v", err)
|
||||||
@@ -276,7 +277,7 @@ func TestClient_CancelJob(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
err := client.CancelJob("job-1")
|
err := client.CancelJob("job-1")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("CancelJob failed: %v", err)
|
t.Fatalf("CancelJob failed: %v", err)
|
||||||
@@ -308,7 +309,7 @@ func TestClient_GetStatus(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
err := client.GetStatus()
|
err := client.GetStatus()
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("GetStatus failed: %v", err)
|
t.Fatalf("GetStatus failed: %v", err)
|
||||||
@@ -381,7 +382,7 @@ func TestClient_AuthHeader(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "testkey123", "json")
|
client, _ := NewClient(server.URL, "testkey123", "json", "", false)
|
||||||
client.do("GET", "/api/v1/certificates", nil, nil)
|
client.do("GET", "/api/v1/certificates", nil, nil)
|
||||||
|
|
||||||
if authHeader != "Bearer testkey123" {
|
if authHeader != "Bearer testkey123" {
|
||||||
@@ -439,7 +440,7 @@ func TestClient_ImportCertificates_MissingRequiredFlags(t *testing.T) {
|
|||||||
|
|
||||||
for _, tc := range cases {
|
for _, tc := range cases {
|
||||||
t.Run(tc.name, func(t *testing.T) {
|
t.Run(tc.name, func(t *testing.T) {
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
err := client.ImportCertificates(tc.args)
|
err := client.ImportCertificates(tc.args)
|
||||||
if err == nil {
|
if err == nil {
|
||||||
t.Fatalf("expected error for %s, got nil", tc.name)
|
t.Fatalf("expected error for %s, got nil", tc.name)
|
||||||
@@ -468,7 +469,7 @@ func TestClient_ImportCertificates_MissingPositionalArgs(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
err := client.ImportCertificates([]string{
|
err := client.ImportCertificates([]string{
|
||||||
"--owner-id", "o-alice",
|
"--owner-id", "o-alice",
|
||||||
"--team-id", "t-platform",
|
"--team-id", "t-platform",
|
||||||
@@ -513,7 +514,7 @@ func TestClient_ImportCertificates_SixFieldPayload(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "", "table")
|
client, _ := NewClient(server.URL, "", "table", "", false)
|
||||||
err := client.ImportCertificates([]string{
|
err := client.ImportCertificates([]string{
|
||||||
"--owner-id", "o-alice",
|
"--owner-id", "o-alice",
|
||||||
"--team-id", "t-platform",
|
"--team-id", "t-platform",
|
||||||
@@ -583,3 +584,194 @@ func generateTestCert() *x509.Certificate {
|
|||||||
|
|
||||||
return cert
|
return cert
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// -----------------------------------------------------------------------------
|
||||||
|
// HTTPS-Everywhere milestone (v2.2, §3.2 + §7 Phase 5):
|
||||||
|
// The CLI binary now talks HTTPS-only to the control plane. These tests pin the
|
||||||
|
// three contracts the milestone requires every client binary (agent, CLI, MCP)
|
||||||
|
// to satisfy in lock-step:
|
||||||
|
// (a) CA bundle load success — PEM loads, RootCAs + MinVersion=TLS1.3 wired
|
||||||
|
// through the injected *http.Transport so the httpClient actually uses them.
|
||||||
|
// (b) CA bundle load failure — missing file and malformed/empty PEM each fail
|
||||||
|
// loud with a pinned substring so operators get a useful diagnostic instead
|
||||||
|
// of a later TLS-handshake-error mystery.
|
||||||
|
// (c) End-to-end TLS round-trip — an httptest.NewTLSServer whose own cert is
|
||||||
|
// written out as the CA bundle validates that every TLS-config knob is
|
||||||
|
// actually reaching the wire, not just surviving into the struct.
|
||||||
|
// Each of the three client binaries pins the same three contracts against its
|
||||||
|
// own NewClient signature; drifting any of them in isolation is exactly what
|
||||||
|
// this suite is here to catch. The error-string substrings below must stay in
|
||||||
|
// sync with the fmt.Errorf messages in internal/cli/client.go:NewClient.
|
||||||
|
// -----------------------------------------------------------------------------
|
||||||
|
|
||||||
|
// writeCABundle PEM-encodes a DER cert and writes it to a temp file under the
|
||||||
|
// test's own TempDir. Returns the absolute path of the written bundle so test
|
||||||
|
// callers can pass it straight into NewClient(..., caBundlePath, ...).
|
||||||
|
func writeCABundle(t *testing.T, dir string, certDER []byte, filename string) string {
|
||||||
|
t.Helper()
|
||||||
|
path := filepath.Join(dir, filename)
|
||||||
|
pemBytes := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: certDER})
|
||||||
|
if err := os.WriteFile(path, pemBytes, 0o600); err != nil {
|
||||||
|
t.Fatalf("writing CA bundle to %q: %v", path, err)
|
||||||
|
}
|
||||||
|
return path
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNewClient_CABundle_Success pins the happy path: a valid PEM CA bundle
|
||||||
|
// loads, populates RootCAs on the client's TLS config, and leaves
|
||||||
|
// MinVersion=TLS1.3 intact. Regression guard: if a future edit accidentally
|
||||||
|
// swaps the transport after TLS config setup (or forgets to re-attach the
|
||||||
|
// *tls.Config to *http.Transport), this test catches it before ops does.
|
||||||
|
func TestNewClient_CABundle_Success(t *testing.T) {
|
||||||
|
cert := generateTestCert()
|
||||||
|
tmp := t.TempDir()
|
||||||
|
bundlePath := writeCABundle(t, tmp, cert.Raw, "ca.pem")
|
||||||
|
|
||||||
|
client, err := NewClient("https://certctl-server:8443", "test-key", "table", bundlePath, false)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("NewClient with valid CA bundle err=%v want nil", err)
|
||||||
|
}
|
||||||
|
if client == nil {
|
||||||
|
t.Fatal("NewClient returned nil client on happy path")
|
||||||
|
}
|
||||||
|
|
||||||
|
transport, ok := client.httpClient.Transport.(*http.Transport)
|
||||||
|
if !ok {
|
||||||
|
t.Fatalf("httpClient.Transport type=%T want *http.Transport (TLS config injection broke)", client.httpClient.Transport)
|
||||||
|
}
|
||||||
|
if transport.TLSClientConfig == nil {
|
||||||
|
t.Fatal("transport.TLSClientConfig is nil; TLS config must be set on every client")
|
||||||
|
}
|
||||||
|
if transport.TLSClientConfig.RootCAs == nil {
|
||||||
|
t.Fatal("transport.TLSClientConfig.RootCAs is nil; CA bundle path was ignored")
|
||||||
|
}
|
||||||
|
if transport.TLSClientConfig.MinVersion != tls.VersionTLS13 {
|
||||||
|
t.Errorf("MinVersion=%d want tls.VersionTLS13 (%d); HTTPS-Everywhere requires TLS1.3 floor",
|
||||||
|
transport.TLSClientConfig.MinVersion, tls.VersionTLS13)
|
||||||
|
}
|
||||||
|
if transport.TLSClientConfig.InsecureSkipVerify {
|
||||||
|
t.Error("InsecureSkipVerify=true with insecure=false arg; flag wiring crossed")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNewClient_CABundle_MissingFile pins the fail-loud path for a nonexistent
|
||||||
|
// bundle path. The error surface must include "reading CA bundle" so operators
|
||||||
|
// see the right diagnostic instead of a downstream TLS-handshake-error.
|
||||||
|
func TestNewClient_CABundle_MissingFile(t *testing.T) {
|
||||||
|
_, err := NewClient("https://certctl-server:8443", "test-key", "table", "/nonexistent/path/ca.pem", false)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("NewClient with missing CA bundle err=nil; must fail loud so operators see the right diagnostic")
|
||||||
|
}
|
||||||
|
if !containsStr(err.Error(), "reading CA bundle") {
|
||||||
|
t.Errorf("err=%q must contain %q so operators can locate the misconfigured path", err.Error(), "reading CA bundle")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNewClient_CABundle_EmptyPEM pins the fail-loud path for a file whose
|
||||||
|
// contents are not valid PEM certificate data. AppendCertsFromPEM returning
|
||||||
|
// false is the signal we need to surface — otherwise the client would silently
|
||||||
|
// ship with an empty cert pool and every TLS handshake would fail downstream.
|
||||||
|
func TestNewClient_CABundle_EmptyPEM(t *testing.T) {
|
||||||
|
tmp := t.TempDir()
|
||||||
|
garbagePath := filepath.Join(tmp, "garbage.pem")
|
||||||
|
if err := os.WriteFile(garbagePath, []byte("not a pem certificate, just bytes"), 0o600); err != nil {
|
||||||
|
t.Fatalf("writing garbage file: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
_, err := NewClient("https://certctl-server:8443", "test-key", "table", garbagePath, false)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("NewClient with malformed PEM err=nil; must fail loud, not silently skip")
|
||||||
|
}
|
||||||
|
if !containsStr(err.Error(), "no valid PEM-encoded certificates") {
|
||||||
|
t.Errorf("err=%q must contain %q so operators know the file parsed but held no certs",
|
||||||
|
err.Error(), "no valid PEM-encoded certificates")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNewClient_TLSRoundTrip validates that the TLS config knobs we set on
|
||||||
|
// NewClient actually reach the wire. An httptest.NewTLSServer signs its own
|
||||||
|
// self-signed leaf; we PEM-encode that server cert, write it as the CA bundle,
|
||||||
|
// and issue a real HTTPS call through ListCertificates. A successful round-trip
|
||||||
|
// proves RootCAs + MinVersion are flowing through *http.Transport into the
|
||||||
|
// dialer, not just surviving into the client struct.
|
||||||
|
func TestNewClient_TLSRoundTrip(t *testing.T) {
|
||||||
|
var handlerHit int
|
||||||
|
server := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||||
|
if r.Method == "GET" && r.URL.Path == "/api/v1/certificates" {
|
||||||
|
handlerHit++
|
||||||
|
w.Header().Set("Content-Type", "application/json")
|
||||||
|
_ = json.NewEncoder(w).Encode(map[string]interface{}{
|
||||||
|
"data": []map[string]interface{}{},
|
||||||
|
"total": 0,
|
||||||
|
})
|
||||||
|
return
|
||||||
|
}
|
||||||
|
w.WriteHeader(http.StatusNotFound)
|
||||||
|
}))
|
||||||
|
defer server.Close()
|
||||||
|
|
||||||
|
serverCert := server.Certificate()
|
||||||
|
if serverCert == nil {
|
||||||
|
t.Fatal("httptest.NewTLSServer.Certificate() returned nil; cannot build CA bundle")
|
||||||
|
}
|
||||||
|
|
||||||
|
tmp := t.TempDir()
|
||||||
|
bundlePath := writeCABundle(t, tmp, serverCert.Raw, "server-ca.pem")
|
||||||
|
|
||||||
|
client, err := NewClient(server.URL, "test-key", "table", bundlePath, false)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("NewClient(TLS server) err=%v want nil", err)
|
||||||
|
}
|
||||||
|
if err := client.ListCertificates([]string{}); err != nil {
|
||||||
|
t.Fatalf("ListCertificates over HTTPS err=%v; TLS config must reach the wire", err)
|
||||||
|
}
|
||||||
|
if handlerHit != 1 {
|
||||||
|
t.Errorf("handlerHit=%d want 1; request did not reach the TLS server", handlerHit)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNewClient_InsecureSkipVerify pins the dev-only escape hatch: an untrusted
|
||||||
|
// TLS server (cert NOT in the client's root pool) must be reachable when
|
||||||
|
// insecure=true. This is the only path in the control plane that disables
|
||||||
|
// certificate verification; it's documented in docs/tls.md and gated by the
|
||||||
|
// CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY env var so it never slips into
|
||||||
|
// production silently.
|
||||||
|
func TestNewClient_InsecureSkipVerify(t *testing.T) {
|
||||||
|
var handlerHit int
|
||||||
|
server := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||||
|
handlerHit++
|
||||||
|
w.Header().Set("Content-Type", "application/json")
|
||||||
|
_ = json.NewEncoder(w).Encode(map[string]interface{}{
|
||||||
|
"data": []map[string]interface{}{},
|
||||||
|
"total": 0,
|
||||||
|
})
|
||||||
|
}))
|
||||||
|
defer server.Close()
|
||||||
|
|
||||||
|
// No CA bundle → system roots, which will NOT trust the self-signed
|
||||||
|
// httptest cert. insecure=true is the only thing keeping this call from
|
||||||
|
// failing with an x509-unknown-authority error.
|
||||||
|
client, err := NewClient(server.URL, "test-key", "table", "", true)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("NewClient(insecure=true) err=%v want nil", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
transport, ok := client.httpClient.Transport.(*http.Transport)
|
||||||
|
if !ok {
|
||||||
|
t.Fatalf("httpClient.Transport type=%T want *http.Transport", client.httpClient.Transport)
|
||||||
|
}
|
||||||
|
if !transport.TLSClientConfig.InsecureSkipVerify {
|
||||||
|
t.Fatal("insecure=true arg did not set TLSClientConfig.InsecureSkipVerify; flag wiring broken")
|
||||||
|
}
|
||||||
|
if transport.TLSClientConfig.MinVersion != tls.VersionTLS13 {
|
||||||
|
t.Errorf("MinVersion=%d want tls.VersionTLS13 even with insecure=true (TLS1.3 floor is not optional)",
|
||||||
|
transport.TLSClientConfig.MinVersion)
|
||||||
|
}
|
||||||
|
|
||||||
|
if err := client.ListCertificates([]string{}); err != nil {
|
||||||
|
t.Fatalf("ListCertificates(insecure=true) err=%v; escape hatch must still complete the round-trip", err)
|
||||||
|
}
|
||||||
|
if handlerHit != 1 {
|
||||||
|
t.Errorf("handlerHit=%d want 1; insecure round-trip did not reach the server", handlerHit)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|||||||
+136
-36
@@ -1,6 +1,7 @@
|
|||||||
package config
|
package config
|
||||||
|
|
||||||
import (
|
import (
|
||||||
|
"crypto/tls"
|
||||||
"fmt"
|
"fmt"
|
||||||
"log/slog"
|
"log/slog"
|
||||||
"os"
|
"os"
|
||||||
@@ -12,29 +13,29 @@ import (
|
|||||||
// Config represents the complete application configuration.
|
// Config represents the complete application configuration.
|
||||||
// All configuration values are read from environment variables with CERTCTL_ prefix.
|
// All configuration values are read from environment variables with CERTCTL_ prefix.
|
||||||
type Config struct {
|
type Config struct {
|
||||||
Server ServerConfig
|
Server ServerConfig
|
||||||
Database DatabaseConfig
|
Database DatabaseConfig
|
||||||
Scheduler SchedulerConfig
|
Scheduler SchedulerConfig
|
||||||
Log LogConfig
|
Log LogConfig
|
||||||
Auth AuthConfig
|
Auth AuthConfig
|
||||||
RateLimit RateLimitConfig
|
RateLimit RateLimitConfig
|
||||||
CORS CORSConfig
|
CORS CORSConfig
|
||||||
Keygen KeygenConfig
|
Keygen KeygenConfig
|
||||||
CA CAConfig
|
CA CAConfig
|
||||||
Notifiers NotifierConfig
|
Notifiers NotifierConfig
|
||||||
NetworkScan NetworkScanConfig
|
NetworkScan NetworkScanConfig
|
||||||
EST ESTConfig
|
EST ESTConfig
|
||||||
SCEP SCEPConfig
|
SCEP SCEPConfig
|
||||||
Verification VerificationConfig
|
Verification VerificationConfig
|
||||||
ACME ACMEConfig
|
ACME ACMEConfig
|
||||||
Vault VaultConfig
|
Vault VaultConfig
|
||||||
DigiCert DigiCertConfig
|
DigiCert DigiCertConfig
|
||||||
Sectigo SectigoConfig
|
Sectigo SectigoConfig
|
||||||
GoogleCAS GoogleCASConfig
|
GoogleCAS GoogleCASConfig
|
||||||
AWSACMPCA AWSACMPCAConfig
|
AWSACMPCA AWSACMPCAConfig
|
||||||
Entrust EntrustConfig
|
Entrust EntrustConfig
|
||||||
GlobalSign GlobalSignConfig
|
GlobalSign GlobalSignConfig
|
||||||
EJBCA EJBCAConfig
|
EJBCA EJBCAConfig
|
||||||
Digest DigestConfig
|
Digest DigestConfig
|
||||||
HealthCheck HealthCheckConfig
|
HealthCheck HealthCheckConfig
|
||||||
Encryption EncryptionConfig
|
Encryption EncryptionConfig
|
||||||
@@ -651,11 +652,14 @@ type SCEPConfig struct {
|
|||||||
// ChallengePassword is the shared secret used to authenticate SCEP enrollment requests.
|
// ChallengePassword is the shared secret used to authenticate SCEP enrollment requests.
|
||||||
// Clients include this in the PKCS#10 CSR challengePassword attribute.
|
// Clients include this in the PKCS#10 CSR challengePassword attribute.
|
||||||
//
|
//
|
||||||
// REQUIRED when Enabled is true. If SCEP is enabled and this value is empty,
|
// REQUIRED when Enabled is true. Config.Validate() below refuses to start the
|
||||||
// cmd/server/main.go's preflightSCEPChallengePassword check will refuse to
|
// server if SCEP is enabled and this value is empty (H-2, CWE-306): post-M-001
|
||||||
// start the server (H-2, CWE-306): an empty shared secret allowed any client
|
// under option (D), the /scep endpoint rides the no-auth middleware chain per
|
||||||
// that could reach /scep to enroll a CSR against the configured issuer. The
|
// RFC 8894 §3.2, so the challenge password is the sole application-layer
|
||||||
// service-layer PKCSReq path also rejects this configuration defense-in-depth.
|
// authentication boundary for SCEP enrollment. An empty shared secret would
|
||||||
|
// allow any client that can reach /scep to enroll a CSR against the configured
|
||||||
|
// issuer. The service-layer PKCSReq path also rejects this configuration
|
||||||
|
// defense-in-depth.
|
||||||
ChallengePassword string
|
ChallengePassword string
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -674,9 +678,30 @@ type VerificationConfig struct {
|
|||||||
|
|
||||||
// ServerConfig contains HTTP server configuration.
|
// ServerConfig contains HTTP server configuration.
|
||||||
type ServerConfig struct {
|
type ServerConfig struct {
|
||||||
Host string // Server host (default: 127.0.0.1). Set via CERTCTL_SERVER_HOST.
|
Host string // Server host (default: 127.0.0.1). Set via CERTCTL_SERVER_HOST.
|
||||||
Port int // Server port (default: 8080). Set via CERTCTL_SERVER_PORT.
|
Port int // Server port (default: 8080). Set via CERTCTL_SERVER_PORT.
|
||||||
MaxBodySize int64 // Maximum request body size in bytes (default: 1MB). Set via CERTCTL_MAX_BODY_SIZE.
|
MaxBodySize int64 // Maximum request body size in bytes (default: 1MB). Set via CERTCTL_MAX_BODY_SIZE.
|
||||||
|
TLS ServerTLSConfig // HTTPS-only TLS configuration. Both CertPath and KeyPath are required.
|
||||||
|
}
|
||||||
|
|
||||||
|
// ServerTLSConfig holds the server-side TLS material.
|
||||||
|
//
|
||||||
|
// The control plane is HTTPS-only as of the HTTPS-everywhere milestone
|
||||||
|
// (§3 locked decisions: no `http` mode, no dual-listener, TLS 1.3 only).
|
||||||
|
// Both CertPath and KeyPath are required; an empty value causes
|
||||||
|
// Config.Validate() to return a fail-loud error and the server refuses
|
||||||
|
// to start. There is no plaintext HTTP fallback, no N-release migration
|
||||||
|
// bridge, and no auto-generated self-signed cert — operators either
|
||||||
|
// supply a cert on disk (docker-compose init container, operator-managed
|
||||||
|
// file, cert-manager mount) or the process exits non-zero.
|
||||||
|
type ServerTLSConfig struct {
|
||||||
|
// CertPath is the filesystem path to the server's PEM-encoded X.509
|
||||||
|
// certificate. Set via CERTCTL_SERVER_TLS_CERT_PATH. Required.
|
||||||
|
CertPath string
|
||||||
|
|
||||||
|
// KeyPath is the filesystem path to the server's PEM-encoded private
|
||||||
|
// key that signs CertPath. Set via CERTCTL_SERVER_TLS_KEY_PATH. Required.
|
||||||
|
KeyPath string
|
||||||
}
|
}
|
||||||
|
|
||||||
// DatabaseConfig contains database connection configuration.
|
// DatabaseConfig contains database connection configuration.
|
||||||
@@ -708,6 +733,17 @@ type SchedulerConfig struct {
|
|||||||
// Setting: CERTCTL_SCHEDULER_NOTIFICATION_PROCESS_INTERVAL environment variable.
|
// Setting: CERTCTL_SCHEDULER_NOTIFICATION_PROCESS_INTERVAL environment variable.
|
||||||
NotificationProcessInterval time.Duration
|
NotificationProcessInterval time.Duration
|
||||||
|
|
||||||
|
// NotificationRetryInterval is how often the scheduler retries failed
|
||||||
|
// notifications whose retry_count is below the service-layer 5-attempt
|
||||||
|
// DLQ budget. Default: 2 minutes. Minimum: 1 second. Mirrors the I-001
|
||||||
|
// RetryInterval knob: transitions eligible Failed notifications whose
|
||||||
|
// next_retry_at has arrived back to Pending so the notification processor
|
||||||
|
// picks them up on its next tick (closes coverage gap I-005 — HEAD had
|
||||||
|
// no retry path for transient SMTP/webhook failures and notifications
|
||||||
|
// stayed Failed forever).
|
||||||
|
// Setting: CERTCTL_NOTIFICATION_RETRY_INTERVAL environment variable.
|
||||||
|
NotificationRetryInterval time.Duration
|
||||||
|
|
||||||
// RetryInterval is how often the scheduler retries failed jobs whose Attempts
|
// RetryInterval is how often the scheduler retries failed jobs whose Attempts
|
||||||
// counter is below MaxAttempts. Default: 5 minutes. Minimum: 1 second.
|
// counter is below MaxAttempts. Default: 5 minutes. Minimum: 1 second.
|
||||||
// Transitions eligible Failed jobs back to Pending so the job processor can
|
// Transitions eligible Failed jobs back to Pending so the job processor can
|
||||||
@@ -827,6 +863,13 @@ func Load() (*Config, error) {
|
|||||||
Host: getEnv("CERTCTL_SERVER_HOST", "127.0.0.1"),
|
Host: getEnv("CERTCTL_SERVER_HOST", "127.0.0.1"),
|
||||||
Port: getEnvInt("CERTCTL_SERVER_PORT", 8080),
|
Port: getEnvInt("CERTCTL_SERVER_PORT", 8080),
|
||||||
MaxBodySize: getEnvInt64("CERTCTL_MAX_BODY_SIZE", 1024*1024), // 1MB default
|
MaxBodySize: getEnvInt64("CERTCTL_MAX_BODY_SIZE", 1024*1024), // 1MB default
|
||||||
|
// HTTPS-everywhere milestone §2.1: both paths REQUIRED. Empty defaults
|
||||||
|
// are intentional so Validate() emits a fail-loud error pointing at
|
||||||
|
// docs/tls.md rather than silently binding plaintext HTTP.
|
||||||
|
TLS: ServerTLSConfig{
|
||||||
|
CertPath: getEnv("CERTCTL_SERVER_TLS_CERT_PATH", ""),
|
||||||
|
KeyPath: getEnv("CERTCTL_SERVER_TLS_KEY_PATH", ""),
|
||||||
|
},
|
||||||
},
|
},
|
||||||
Database: DatabaseConfig{
|
Database: DatabaseConfig{
|
||||||
URL: getEnv("CERTCTL_DATABASE_URL", "postgres://localhost/certctl"),
|
URL: getEnv("CERTCTL_DATABASE_URL", "postgres://localhost/certctl"),
|
||||||
@@ -838,10 +881,16 @@ func Load() (*Config, error) {
|
|||||||
JobProcessorInterval: getEnvDuration("CERTCTL_SCHEDULER_JOB_PROCESSOR_INTERVAL", 30*time.Second),
|
JobProcessorInterval: getEnvDuration("CERTCTL_SCHEDULER_JOB_PROCESSOR_INTERVAL", 30*time.Second),
|
||||||
AgentHealthCheckInterval: getEnvDuration("CERTCTL_SCHEDULER_AGENT_HEALTH_CHECK_INTERVAL", 2*time.Minute),
|
AgentHealthCheckInterval: getEnvDuration("CERTCTL_SCHEDULER_AGENT_HEALTH_CHECK_INTERVAL", 2*time.Minute),
|
||||||
NotificationProcessInterval: getEnvDuration("CERTCTL_SCHEDULER_NOTIFICATION_PROCESS_INTERVAL", 1*time.Minute),
|
NotificationProcessInterval: getEnvDuration("CERTCTL_SCHEDULER_NOTIFICATION_PROCESS_INTERVAL", 1*time.Minute),
|
||||||
RetryInterval: getEnvDuration("CERTCTL_SCHEDULER_RETRY_INTERVAL", 5*time.Minute),
|
// I-005: retry sweep for failed notifications. Mirrors RetryInterval
|
||||||
JobTimeoutInterval: getEnvDuration("CERTCTL_JOB_TIMEOUT_INTERVAL", 10*time.Minute),
|
// (I-001 job retry) but scoped to the notification DLQ machinery.
|
||||||
AwaitingCSRTimeout: getEnvDuration("CERTCTL_JOB_AWAITING_CSR_TIMEOUT", 24*time.Hour),
|
// Default 2 minutes — fast enough to absorb transient SMTP/webhook
|
||||||
AwaitingApprovalTimeout: getEnvDuration("CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT", 168*time.Hour),
|
// blips, slow enough to respect the service-layer 5-attempt budget
|
||||||
|
// without hammering external notifier endpoints.
|
||||||
|
NotificationRetryInterval: getEnvDuration("CERTCTL_NOTIFICATION_RETRY_INTERVAL", 2*time.Minute),
|
||||||
|
RetryInterval: getEnvDuration("CERTCTL_SCHEDULER_RETRY_INTERVAL", 5*time.Minute),
|
||||||
|
JobTimeoutInterval: getEnvDuration("CERTCTL_JOB_TIMEOUT_INTERVAL", 10*time.Minute),
|
||||||
|
AwaitingCSRTimeout: getEnvDuration("CERTCTL_JOB_AWAITING_CSR_TIMEOUT", 24*time.Hour),
|
||||||
|
AwaitingApprovalTimeout: getEnvDuration("CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT", 168*time.Hour),
|
||||||
},
|
},
|
||||||
Log: LogConfig{
|
Log: LogConfig{
|
||||||
Level: getEnv("CERTCTL_LOG_LEVEL", "info"),
|
Level: getEnv("CERTCTL_LOG_LEVEL", "info"),
|
||||||
@@ -871,7 +920,7 @@ func Load() (*Config, error) {
|
|||||||
Notifiers: NotifierConfig{
|
Notifiers: NotifierConfig{
|
||||||
SlackWebhookURL: getEnv("CERTCTL_SLACK_WEBHOOK_URL", ""),
|
SlackWebhookURL: getEnv("CERTCTL_SLACK_WEBHOOK_URL", ""),
|
||||||
SlackChannel: getEnv("CERTCTL_SLACK_CHANNEL", ""),
|
SlackChannel: getEnv("CERTCTL_SLACK_CHANNEL", ""),
|
||||||
SlackUsername: getEnv("CERTCTL_SLACK_USERNAME", "certctl"),
|
SlackUsername: getEnv("CERTCTL_SLACK_USERNAME", "certctl"),
|
||||||
TeamsWebhookURL: getEnv("CERTCTL_TEAMS_WEBHOOK_URL", ""),
|
TeamsWebhookURL: getEnv("CERTCTL_TEAMS_WEBHOOK_URL", ""),
|
||||||
PagerDutyRoutingKey: getEnv("CERTCTL_PAGERDUTY_ROUTING_KEY", ""),
|
PagerDutyRoutingKey: getEnv("CERTCTL_PAGERDUTY_ROUTING_KEY", ""),
|
||||||
PagerDutySeverity: getEnv("CERTCTL_PAGERDUTY_SEVERITY", "warning"),
|
PagerDutySeverity: getEnv("CERTCTL_PAGERDUTY_SEVERITY", "warning"),
|
||||||
@@ -1039,6 +1088,37 @@ func (c *Config) Validate() error {
|
|||||||
return fmt.Errorf("invalid server port: %d", c.Server.Port)
|
return fmt.Errorf("invalid server port: %d", c.Server.Port)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// HTTPS-everywhere milestone §2.1 + §3 locked decisions: the control plane
|
||||||
|
// is TLS-only and refuses to start without a cert. No plaintext HTTP fallback,
|
||||||
|
// no auto-generated self-signed cert, no N-release migration window. An empty
|
||||||
|
// CertPath or KeyPath is operator-visible misconfiguration, not a soft warning.
|
||||||
|
if c.Server.TLS.CertPath == "" {
|
||||||
|
return fmt.Errorf("server TLS cert path is required — refuse to start (HTTPS-only: set CERTCTL_SERVER_TLS_CERT_PATH to a PEM-encoded certificate; see docs/tls.md)")
|
||||||
|
}
|
||||||
|
if c.Server.TLS.KeyPath == "" {
|
||||||
|
return fmt.Errorf("server TLS key path is required — refuse to start (HTTPS-only: set CERTCTL_SERVER_TLS_KEY_PATH to the PEM-encoded private key matching CERTCTL_SERVER_TLS_CERT_PATH; see docs/tls.md)")
|
||||||
|
}
|
||||||
|
|
||||||
|
// Files must exist and be readable. Catches typos and missing mount paths
|
||||||
|
// up-front so the operator gets a structured error on startup instead of
|
||||||
|
// a deferred ListenAndServeTLS failure after the scheduler has already
|
||||||
|
// fanned out its goroutines.
|
||||||
|
if _, err := os.Stat(c.Server.TLS.CertPath); err != nil {
|
||||||
|
return fmt.Errorf("server TLS cert file unreadable at %q: %w — refuse to start (HTTPS-only; see docs/tls.md)", c.Server.TLS.CertPath, err)
|
||||||
|
}
|
||||||
|
if _, err := os.Stat(c.Server.TLS.KeyPath); err != nil {
|
||||||
|
return fmt.Errorf("server TLS key file unreadable at %q: %w — refuse to start (HTTPS-only; see docs/tls.md)", c.Server.TLS.KeyPath, err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Parse the cert+key pair up-front. tls.LoadX509KeyPair verifies that the
|
||||||
|
// key signs the cert (prevents the classic footgun of shipping a pair
|
||||||
|
// whose private key doesn't match). Discard the returned Certificate — the
|
||||||
|
// server constructs its own holder from fresh reads so SIGHUP reload is
|
||||||
|
// authoritative.
|
||||||
|
if _, err := tls.LoadX509KeyPair(c.Server.TLS.CertPath, c.Server.TLS.KeyPath); err != nil {
|
||||||
|
return fmt.Errorf("server TLS cert/key pair invalid (cert=%q key=%q): %w — refuse to start (HTTPS-only; see docs/tls.md)", c.Server.TLS.CertPath, c.Server.TLS.KeyPath, err)
|
||||||
|
}
|
||||||
|
|
||||||
// Validate database configuration
|
// Validate database configuration
|
||||||
if c.Database.URL == "" {
|
if c.Database.URL == "" {
|
||||||
return fmt.Errorf("database URL is required")
|
return fmt.Errorf("database URL is required")
|
||||||
@@ -1092,6 +1172,19 @@ func (c *Config) Validate() error {
|
|||||||
return fmt.Errorf("invalid keygen mode: %s (must be 'agent' or 'server')", c.Keygen.Mode)
|
return fmt.Errorf("invalid keygen mode: %s (must be 'agent' or 'server')", c.Keygen.Mode)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// SCEP fail-loud startup gate (H-2, CWE-306).
|
||||||
|
//
|
||||||
|
// Post-M-001 option (D) routes /scep through the no-auth middleware chain per
|
||||||
|
// RFC 8894 §3.2 — SCEP clients authenticate via the challengePassword attribute
|
||||||
|
// in the PKCS#10 CSR, not via HTTP Bearer tokens or TLS client certs. That makes
|
||||||
|
// CERTCTL_SCEP_CHALLENGE_PASSWORD the sole application-layer authentication
|
||||||
|
// boundary for SCEP enrollment. Refuse to start if it is empty when SCEP is
|
||||||
|
// enabled: an empty shared secret would allow any client that can reach /scep to
|
||||||
|
// enroll a CSR against the configured issuer (anonymous issuance).
|
||||||
|
if c.SCEP.Enabled && c.SCEP.ChallengePassword == "" {
|
||||||
|
return fmt.Errorf("SCEP is enabled but CERTCTL_SCEP_CHALLENGE_PASSWORD is empty — refuse to start (CWE-306: anonymous SCEP issuance is insecure; set a non-empty shared secret or disable SCEP with CERTCTL_SCEP_ENABLED=false). This gate duplicates cmd/server/main.go:preflightSCEPChallengePassword for defense in depth")
|
||||||
|
}
|
||||||
|
|
||||||
// Validate scheduler intervals
|
// Validate scheduler intervals
|
||||||
if c.Scheduler.RenewalCheckInterval < 1*time.Minute {
|
if c.Scheduler.RenewalCheckInterval < 1*time.Minute {
|
||||||
return fmt.Errorf("renewal check interval must be at least 1 minute")
|
return fmt.Errorf("renewal check interval must be at least 1 minute")
|
||||||
@@ -1109,6 +1202,13 @@ func (c *Config) Validate() error {
|
|||||||
return fmt.Errorf("notification process interval must be at least 1 second")
|
return fmt.Errorf("notification process interval must be at least 1 second")
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// I-005: guard against a misconfigured retry sweep that would either
|
||||||
|
// spin-wait or never fire. Matches the NotificationProcessInterval
|
||||||
|
// minimum (1s) so operators can tune both knobs from the same floor.
|
||||||
|
if c.Scheduler.NotificationRetryInterval < 1*time.Second {
|
||||||
|
return fmt.Errorf("notification retry interval must be at least 1 second")
|
||||||
|
}
|
||||||
|
|
||||||
if c.Scheduler.RetryInterval < 1*time.Second {
|
if c.Scheduler.RetryInterval < 1*time.Second {
|
||||||
return fmt.Errorf("retry interval must be at least 1 second")
|
return fmt.Errorf("retry interval must be at least 1 second")
|
||||||
}
|
}
|
||||||
|
|||||||
+256
-13
@@ -1,10 +1,18 @@
|
|||||||
package config
|
package config
|
||||||
|
|
||||||
import (
|
import (
|
||||||
|
"crypto/ecdsa"
|
||||||
|
"crypto/elliptic"
|
||||||
|
"crypto/rand"
|
||||||
|
"crypto/x509"
|
||||||
|
"crypto/x509/pkix"
|
||||||
|
"encoding/pem"
|
||||||
"log/slog"
|
"log/slog"
|
||||||
|
"math/big"
|
||||||
"os"
|
"os"
|
||||||
"testing"
|
"path/filepath"
|
||||||
"strings"
|
"strings"
|
||||||
|
"testing"
|
||||||
"time"
|
"time"
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -26,10 +34,76 @@ func clearCertctlEnv(t *testing.T) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// setMinimalValidEnv sets the minimum env vars needed for Load() to succeed (Validate passes).
|
// setMinimalValidEnv sets the minimum env vars needed for Load() to succeed (Validate passes).
|
||||||
|
//
|
||||||
|
// HTTPS-everywhere milestone (§2.1 + §3 locked decisions): the control plane
|
||||||
|
// is TLS-only and Validate() refuses to pass without a readable cert/key pair
|
||||||
|
// on disk. setMinimalValidEnv therefore materializes a throwaway ECDSA P-256
|
||||||
|
// self-signed pair in t.TempDir() and points the two TLS env vars at it so
|
||||||
|
// every Load-based test inherits a valid HTTPS posture without each caller
|
||||||
|
// having to spell out cert generation. The temp dir is cleaned up by
|
||||||
|
// testing.T at end-of-test.
|
||||||
func setMinimalValidEnv(t *testing.T) {
|
func setMinimalValidEnv(t *testing.T) {
|
||||||
t.Helper()
|
t.Helper()
|
||||||
// api-key auth requires a secret
|
// api-key auth requires a secret
|
||||||
t.Setenv("CERTCTL_AUTH_SECRET", "test-secret-key")
|
t.Setenv("CERTCTL_AUTH_SECRET", "test-secret-key")
|
||||||
|
// HTTPS-only control plane requires a real cert/key pair on disk.
|
||||||
|
certPath, keyPath := generateTestTLSPair(t)
|
||||||
|
t.Setenv("CERTCTL_SERVER_TLS_CERT_PATH", certPath)
|
||||||
|
t.Setenv("CERTCTL_SERVER_TLS_KEY_PATH", keyPath)
|
||||||
|
}
|
||||||
|
|
||||||
|
// generateTestTLSPair writes an ECDSA P-256 self-signed certificate + private
|
||||||
|
// key pair to files inside t.TempDir() and returns the paths. Same shape used
|
||||||
|
// by cmd/server/tls_test.go — this duplicates the generator rather than
|
||||||
|
// importing it so the config package tests stay independent of cmd/server.
|
||||||
|
func generateTestTLSPair(t *testing.T) (certPath, keyPath string) {
|
||||||
|
t.Helper()
|
||||||
|
key, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("ecdsa.GenerateKey: %v", err)
|
||||||
|
}
|
||||||
|
tmpl := &x509.Certificate{
|
||||||
|
SerialNumber: big.NewInt(1),
|
||||||
|
Subject: pkix.Name{CommonName: "certctl-config-test"},
|
||||||
|
NotBefore: time.Now().Add(-time.Hour),
|
||||||
|
NotAfter: time.Now().Add(time.Hour),
|
||||||
|
KeyUsage: x509.KeyUsageDigitalSignature | x509.KeyUsageKeyEncipherment,
|
||||||
|
ExtKeyUsage: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth},
|
||||||
|
}
|
||||||
|
der, err := x509.CreateCertificate(rand.Reader, tmpl, tmpl, &key.PublicKey, key)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("x509.CreateCertificate: %v", err)
|
||||||
|
}
|
||||||
|
dir := t.TempDir()
|
||||||
|
certPath = filepath.Join(dir, "cert.pem")
|
||||||
|
keyPath = filepath.Join(dir, "key.pem")
|
||||||
|
certPEM := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: der})
|
||||||
|
if err := os.WriteFile(certPath, certPEM, 0o600); err != nil {
|
||||||
|
t.Fatalf("write cert: %v", err)
|
||||||
|
}
|
||||||
|
keyDER, err := x509.MarshalECPrivateKey(key)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("x509.MarshalECPrivateKey: %v", err)
|
||||||
|
}
|
||||||
|
keyPEM := pem.EncodeToMemory(&pem.Block{Type: "EC PRIVATE KEY", Bytes: keyDER})
|
||||||
|
if err := os.WriteFile(keyPath, keyPEM, 0o600); err != nil {
|
||||||
|
t.Fatalf("write key: %v", err)
|
||||||
|
}
|
||||||
|
return certPath, keyPath
|
||||||
|
}
|
||||||
|
|
||||||
|
// validServerConfig returns a ServerConfig with Port=8080 plus a freshly
|
||||||
|
// minted TLS cert/key pair on disk, so Validate() passes the HTTPS-only
|
||||||
|
// preflight (cert empty → stat → tls.LoadX509KeyPair round-trip). Every
|
||||||
|
// struct-based Validate test uses this so they fail for the reason they
|
||||||
|
// claim to test, not for a missing TLS pair.
|
||||||
|
func validServerConfig(t *testing.T) ServerConfig {
|
||||||
|
t.Helper()
|
||||||
|
certPath, keyPath := generateTestTLSPair(t)
|
||||||
|
return ServerConfig{
|
||||||
|
Port: 8080,
|
||||||
|
TLS: ServerTLSConfig{CertPath: certPath, KeyPath: keyPath},
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
func TestLoad_DefaultValues(t *testing.T) {
|
func TestLoad_DefaultValues(t *testing.T) {
|
||||||
@@ -135,6 +209,13 @@ func TestLoad_DefaultValues(t *testing.T) {
|
|||||||
func TestLoad_AllEnvVarsSet(t *testing.T) {
|
func TestLoad_AllEnvVarsSet(t *testing.T) {
|
||||||
clearCertctlEnv(t)
|
clearCertctlEnv(t)
|
||||||
|
|
||||||
|
// HTTPS-only control plane: Load() → Validate() refuses an empty cert path.
|
||||||
|
// Materialize a throwaway ECDSA P-256 pair and point the two TLS env vars
|
||||||
|
// at it before setting every other CERTCTL_* var this test cares about.
|
||||||
|
certPath, keyPath := generateTestTLSPair(t)
|
||||||
|
t.Setenv("CERTCTL_SERVER_TLS_CERT_PATH", certPath)
|
||||||
|
t.Setenv("CERTCTL_SERVER_TLS_KEY_PATH", keyPath)
|
||||||
|
|
||||||
t.Setenv("CERTCTL_SERVER_HOST", "0.0.0.0")
|
t.Setenv("CERTCTL_SERVER_HOST", "0.0.0.0")
|
||||||
t.Setenv("CERTCTL_SERVER_PORT", "9090")
|
t.Setenv("CERTCTL_SERVER_PORT", "9090")
|
||||||
t.Setenv("CERTCTL_MAX_BODY_SIZE", "2097152")
|
t.Setenv("CERTCTL_MAX_BODY_SIZE", "2097152")
|
||||||
@@ -319,7 +400,7 @@ func TestLoad_CommaSeparatedList(t *testing.T) {
|
|||||||
|
|
||||||
func TestValidate_ValidConfig(t *testing.T) {
|
func TestValidate_ValidConfig(t *testing.T) {
|
||||||
cfg := &Config{
|
cfg := &Config{
|
||||||
Server: ServerConfig{Port: 8080},
|
Server: validServerConfig(t),
|
||||||
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
||||||
Log: LogConfig{Level: "info", Format: "json"},
|
Log: LogConfig{Level: "info", Format: "json"},
|
||||||
Auth: AuthConfig{Type: "api-key", Secret: "test-secret"},
|
Auth: AuthConfig{Type: "api-key", Secret: "test-secret"},
|
||||||
@@ -329,6 +410,7 @@ func TestValidate_ValidConfig(t *testing.T) {
|
|||||||
JobProcessorInterval: 30 * time.Second,
|
JobProcessorInterval: 30 * time.Second,
|
||||||
AgentHealthCheckInterval: 2 * time.Minute,
|
AgentHealthCheckInterval: 2 * time.Minute,
|
||||||
NotificationProcessInterval: 1 * time.Minute,
|
NotificationProcessInterval: 1 * time.Minute,
|
||||||
|
NotificationRetryInterval: 2 * time.Minute,
|
||||||
RetryInterval: 5 * time.Minute,
|
RetryInterval: 5 * time.Minute,
|
||||||
JobTimeoutInterval: 10 * time.Minute,
|
JobTimeoutInterval: 10 * time.Minute,
|
||||||
AwaitingCSRTimeout: 24 * time.Hour,
|
AwaitingCSRTimeout: 24 * time.Hour,
|
||||||
@@ -342,7 +424,7 @@ func TestValidate_ValidConfig(t *testing.T) {
|
|||||||
|
|
||||||
func TestValidate_AuthTypeNone(t *testing.T) {
|
func TestValidate_AuthTypeNone(t *testing.T) {
|
||||||
cfg := &Config{
|
cfg := &Config{
|
||||||
Server: ServerConfig{Port: 8080},
|
Server: validServerConfig(t),
|
||||||
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
||||||
Log: LogConfig{Level: "info", Format: "json"},
|
Log: LogConfig{Level: "info", Format: "json"},
|
||||||
Auth: AuthConfig{Type: "none", Secret: ""},
|
Auth: AuthConfig{Type: "none", Secret: ""},
|
||||||
@@ -352,6 +434,7 @@ func TestValidate_AuthTypeNone(t *testing.T) {
|
|||||||
JobProcessorInterval: 30 * time.Second,
|
JobProcessorInterval: 30 * time.Second,
|
||||||
AgentHealthCheckInterval: 2 * time.Minute,
|
AgentHealthCheckInterval: 2 * time.Minute,
|
||||||
NotificationProcessInterval: 1 * time.Minute,
|
NotificationProcessInterval: 1 * time.Minute,
|
||||||
|
NotificationRetryInterval: 2 * time.Minute,
|
||||||
RetryInterval: 5 * time.Minute,
|
RetryInterval: 5 * time.Minute,
|
||||||
JobTimeoutInterval: 10 * time.Minute,
|
JobTimeoutInterval: 10 * time.Minute,
|
||||||
AwaitingCSRTimeout: 24 * time.Hour,
|
AwaitingCSRTimeout: 24 * time.Hour,
|
||||||
@@ -365,7 +448,7 @@ func TestValidate_AuthTypeNone(t *testing.T) {
|
|||||||
|
|
||||||
func TestValidate_InvalidAuthType(t *testing.T) {
|
func TestValidate_InvalidAuthType(t *testing.T) {
|
||||||
cfg := &Config{
|
cfg := &Config{
|
||||||
Server: ServerConfig{Port: 8080},
|
Server: validServerConfig(t),
|
||||||
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
||||||
Log: LogConfig{Level: "info", Format: "json"},
|
Log: LogConfig{Level: "info", Format: "json"},
|
||||||
Auth: AuthConfig{Type: "oauth", Secret: "key"},
|
Auth: AuthConfig{Type: "oauth", Secret: "key"},
|
||||||
@@ -384,7 +467,7 @@ func TestValidate_InvalidAuthType(t *testing.T) {
|
|||||||
|
|
||||||
func TestValidate_APIKeyAuth_MissingSecret(t *testing.T) {
|
func TestValidate_APIKeyAuth_MissingSecret(t *testing.T) {
|
||||||
cfg := &Config{
|
cfg := &Config{
|
||||||
Server: ServerConfig{Port: 8080},
|
Server: validServerConfig(t),
|
||||||
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
||||||
Log: LogConfig{Level: "info", Format: "json"},
|
Log: LogConfig{Level: "info", Format: "json"},
|
||||||
Auth: AuthConfig{Type: "api-key", Secret: ""},
|
Auth: AuthConfig{Type: "api-key", Secret: ""},
|
||||||
@@ -403,7 +486,7 @@ func TestValidate_APIKeyAuth_MissingSecret(t *testing.T) {
|
|||||||
|
|
||||||
func TestValidate_JWTAuth_MissingSecret(t *testing.T) {
|
func TestValidate_JWTAuth_MissingSecret(t *testing.T) {
|
||||||
cfg := &Config{
|
cfg := &Config{
|
||||||
Server: ServerConfig{Port: 8080},
|
Server: validServerConfig(t),
|
||||||
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
||||||
Log: LogConfig{Level: "info", Format: "json"},
|
Log: LogConfig{Level: "info", Format: "json"},
|
||||||
Auth: AuthConfig{Type: "jwt", Secret: ""},
|
Auth: AuthConfig{Type: "jwt", Secret: ""},
|
||||||
@@ -422,7 +505,7 @@ func TestValidate_JWTAuth_MissingSecret(t *testing.T) {
|
|||||||
|
|
||||||
func TestValidate_InvalidKeygenMode(t *testing.T) {
|
func TestValidate_InvalidKeygenMode(t *testing.T) {
|
||||||
cfg := &Config{
|
cfg := &Config{
|
||||||
Server: ServerConfig{Port: 8080},
|
Server: validServerConfig(t),
|
||||||
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
||||||
Log: LogConfig{Level: "info", Format: "json"},
|
Log: LogConfig{Level: "info", Format: "json"},
|
||||||
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
||||||
@@ -470,9 +553,168 @@ func TestValidate_InvalidPort(t *testing.T) {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// TestValidate_TLSCertPathEmpty pins the first of the HTTPS-only fail-loud
|
||||||
|
// gates in Validate(): an empty CertPath must produce the operator-facing
|
||||||
|
// "server TLS cert path is required" error. Per §2.1 + §3 locked decisions,
|
||||||
|
// there is no plaintext HTTP fallback — missing TLS config is a hard startup
|
||||||
|
// refusal, not a warning.
|
||||||
|
func TestValidate_TLSCertPathEmpty(t *testing.T) {
|
||||||
|
_, keyPath := generateTestTLSPair(t)
|
||||||
|
cfg := &Config{
|
||||||
|
Server: ServerConfig{
|
||||||
|
Port: 8080,
|
||||||
|
TLS: ServerTLSConfig{CertPath: "", KeyPath: keyPath},
|
||||||
|
},
|
||||||
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
||||||
|
Log: LogConfig{Level: "info", Format: "json"},
|
||||||
|
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
||||||
|
Keygen: KeygenConfig{Mode: "agent"},
|
||||||
|
Scheduler: SchedulerConfig{
|
||||||
|
RenewalCheckInterval: 1 * time.Hour,
|
||||||
|
JobProcessorInterval: 30 * time.Second,
|
||||||
|
AgentHealthCheckInterval: 2 * time.Minute,
|
||||||
|
NotificationProcessInterval: 1 * time.Minute,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
err := cfg.Validate()
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("Validate() should return error for empty TLS cert path")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "server TLS cert path is required") {
|
||||||
|
t.Errorf("error = %q, want substring %q", err.Error(), "server TLS cert path is required")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestValidate_TLSKeyPathEmpty pins the second HTTPS-only gate: empty KeyPath
|
||||||
|
// must produce the "server TLS key path is required" error. Runs with a valid
|
||||||
|
// CertPath so the cert-empty gate (which fires first) is cleanly bypassed —
|
||||||
|
// proves the key-empty gate is actually reached.
|
||||||
|
func TestValidate_TLSKeyPathEmpty(t *testing.T) {
|
||||||
|
certPath, _ := generateTestTLSPair(t)
|
||||||
|
cfg := &Config{
|
||||||
|
Server: ServerConfig{
|
||||||
|
Port: 8080,
|
||||||
|
TLS: ServerTLSConfig{CertPath: certPath, KeyPath: ""},
|
||||||
|
},
|
||||||
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
||||||
|
Log: LogConfig{Level: "info", Format: "json"},
|
||||||
|
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
||||||
|
Keygen: KeygenConfig{Mode: "agent"},
|
||||||
|
Scheduler: SchedulerConfig{
|
||||||
|
RenewalCheckInterval: 1 * time.Hour,
|
||||||
|
JobProcessorInterval: 30 * time.Second,
|
||||||
|
AgentHealthCheckInterval: 2 * time.Minute,
|
||||||
|
NotificationProcessInterval: 1 * time.Minute,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
err := cfg.Validate()
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("Validate() should return error for empty TLS key path")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "server TLS key path is required") {
|
||||||
|
t.Errorf("error = %q, want substring %q", err.Error(), "server TLS key path is required")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestValidate_TLSCertFileMissing pins the os.Stat gate on the cert path. A
|
||||||
|
// non-existent path must surface "server TLS cert file unreadable" so the
|
||||||
|
// operator sees the bad path in the error (file=%q) instead of a deferred
|
||||||
|
// ListenAndServeTLS panic after the scheduler has already fanned out.
|
||||||
|
func TestValidate_TLSCertFileMissing(t *testing.T) {
|
||||||
|
_, keyPath := generateTestTLSPair(t)
|
||||||
|
missingCert := filepath.Join(t.TempDir(), "does-not-exist.pem")
|
||||||
|
cfg := &Config{
|
||||||
|
Server: ServerConfig{
|
||||||
|
Port: 8080,
|
||||||
|
TLS: ServerTLSConfig{CertPath: missingCert, KeyPath: keyPath},
|
||||||
|
},
|
||||||
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
||||||
|
Log: LogConfig{Level: "info", Format: "json"},
|
||||||
|
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
||||||
|
Keygen: KeygenConfig{Mode: "agent"},
|
||||||
|
Scheduler: SchedulerConfig{
|
||||||
|
RenewalCheckInterval: 1 * time.Hour,
|
||||||
|
JobProcessorInterval: 30 * time.Second,
|
||||||
|
AgentHealthCheckInterval: 2 * time.Minute,
|
||||||
|
NotificationProcessInterval: 1 * time.Minute,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
err := cfg.Validate()
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("Validate() should return error for missing TLS cert file")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "server TLS cert file unreadable") {
|
||||||
|
t.Errorf("error = %q, want substring %q", err.Error(), "server TLS cert file unreadable")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestValidate_TLSKeyFileMissing pins the os.Stat gate on the key path. Uses a
|
||||||
|
// valid CertPath so the cert-missing gate does not pre-empt; proves the key
|
||||||
|
// gate is reached and reports the bad key path.
|
||||||
|
func TestValidate_TLSKeyFileMissing(t *testing.T) {
|
||||||
|
certPath, _ := generateTestTLSPair(t)
|
||||||
|
missingKey := filepath.Join(t.TempDir(), "does-not-exist.key")
|
||||||
|
cfg := &Config{
|
||||||
|
Server: ServerConfig{
|
||||||
|
Port: 8080,
|
||||||
|
TLS: ServerTLSConfig{CertPath: certPath, KeyPath: missingKey},
|
||||||
|
},
|
||||||
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
||||||
|
Log: LogConfig{Level: "info", Format: "json"},
|
||||||
|
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
||||||
|
Keygen: KeygenConfig{Mode: "agent"},
|
||||||
|
Scheduler: SchedulerConfig{
|
||||||
|
RenewalCheckInterval: 1 * time.Hour,
|
||||||
|
JobProcessorInterval: 30 * time.Second,
|
||||||
|
AgentHealthCheckInterval: 2 * time.Minute,
|
||||||
|
NotificationProcessInterval: 1 * time.Minute,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
err := cfg.Validate()
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("Validate() should return error for missing TLS key file")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "server TLS key file unreadable") {
|
||||||
|
t.Errorf("error = %q, want substring %q", err.Error(), "server TLS key file unreadable")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestValidate_TLSMismatchedPair pins the tls.LoadX509KeyPair gate — the
|
||||||
|
// classic "you shipped the wrong private key" footgun. Generates two
|
||||||
|
// independent ECDSA pairs and crosses them (pair1 cert + pair2 key). Both
|
||||||
|
// files exist and parse as PEM, so os.Stat passes; only the cryptographic
|
||||||
|
// round-trip inside LoadX509KeyPair catches the mismatch.
|
||||||
|
func TestValidate_TLSMismatchedPair(t *testing.T) {
|
||||||
|
certPath1, _ := generateTestTLSPair(t)
|
||||||
|
_, keyPath2 := generateTestTLSPair(t)
|
||||||
|
cfg := &Config{
|
||||||
|
Server: ServerConfig{
|
||||||
|
Port: 8080,
|
||||||
|
TLS: ServerTLSConfig{CertPath: certPath1, KeyPath: keyPath2},
|
||||||
|
},
|
||||||
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
||||||
|
Log: LogConfig{Level: "info", Format: "json"},
|
||||||
|
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
||||||
|
Keygen: KeygenConfig{Mode: "agent"},
|
||||||
|
Scheduler: SchedulerConfig{
|
||||||
|
RenewalCheckInterval: 1 * time.Hour,
|
||||||
|
JobProcessorInterval: 30 * time.Second,
|
||||||
|
AgentHealthCheckInterval: 2 * time.Minute,
|
||||||
|
NotificationProcessInterval: 1 * time.Minute,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
err := cfg.Validate()
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("Validate() should return error for mismatched TLS cert/key pair")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "server TLS cert/key pair invalid") {
|
||||||
|
t.Errorf("error = %q, want substring %q", err.Error(), "server TLS cert/key pair invalid")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
func TestValidate_EmptyDatabaseURL(t *testing.T) {
|
func TestValidate_EmptyDatabaseURL(t *testing.T) {
|
||||||
cfg := &Config{
|
cfg := &Config{
|
||||||
Server: ServerConfig{Port: 8080},
|
Server: validServerConfig(t),
|
||||||
Database: DatabaseConfig{URL: "", MaxConnections: 25},
|
Database: DatabaseConfig{URL: "", MaxConnections: 25},
|
||||||
Log: LogConfig{Level: "info", Format: "json"},
|
Log: LogConfig{Level: "info", Format: "json"},
|
||||||
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
||||||
@@ -491,7 +733,7 @@ func TestValidate_EmptyDatabaseURL(t *testing.T) {
|
|||||||
|
|
||||||
func TestValidate_InvalidLogLevel(t *testing.T) {
|
func TestValidate_InvalidLogLevel(t *testing.T) {
|
||||||
cfg := &Config{
|
cfg := &Config{
|
||||||
Server: ServerConfig{Port: 8080},
|
Server: validServerConfig(t),
|
||||||
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
||||||
Log: LogConfig{Level: "verbose", Format: "json"},
|
Log: LogConfig{Level: "verbose", Format: "json"},
|
||||||
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
||||||
@@ -510,7 +752,7 @@ func TestValidate_InvalidLogLevel(t *testing.T) {
|
|||||||
|
|
||||||
func TestValidate_InvalidLogFormat(t *testing.T) {
|
func TestValidate_InvalidLogFormat(t *testing.T) {
|
||||||
cfg := &Config{
|
cfg := &Config{
|
||||||
Server: ServerConfig{Port: 8080},
|
Server: validServerConfig(t),
|
||||||
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
||||||
Log: LogConfig{Level: "info", Format: "yaml"},
|
Log: LogConfig{Level: "info", Format: "yaml"},
|
||||||
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
||||||
@@ -572,7 +814,7 @@ func TestValidate_SchedulerIntervalTooSmall(t *testing.T) {
|
|||||||
for _, tt := range tests {
|
for _, tt := range tests {
|
||||||
t.Run(tt.name, func(t *testing.T) {
|
t.Run(tt.name, func(t *testing.T) {
|
||||||
cfg := &Config{
|
cfg := &Config{
|
||||||
Server: ServerConfig{Port: 8080},
|
Server: validServerConfig(t),
|
||||||
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
||||||
Log: LogConfig{Level: "info", Format: "json"},
|
Log: LogConfig{Level: "info", Format: "json"},
|
||||||
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
||||||
@@ -588,7 +830,7 @@ func TestValidate_SchedulerIntervalTooSmall(t *testing.T) {
|
|||||||
|
|
||||||
func TestValidate_DatabaseMaxConnectionsZero(t *testing.T) {
|
func TestValidate_DatabaseMaxConnectionsZero(t *testing.T) {
|
||||||
cfg := &Config{
|
cfg := &Config{
|
||||||
Server: ServerConfig{Port: 8080},
|
Server: validServerConfig(t),
|
||||||
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 0},
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 0},
|
||||||
Log: LogConfig{Level: "info", Format: "json"},
|
Log: LogConfig{Level: "info", Format: "json"},
|
||||||
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
Auth: AuthConfig{Type: "api-key", Secret: "key"},
|
||||||
@@ -795,7 +1037,7 @@ func TestConfig_Scheduler_JobTimeoutValidation(t *testing.T) {
|
|||||||
// Start from a fully valid config so the I-003 timeout checks
|
// Start from a fully valid config so the I-003 timeout checks
|
||||||
// are the only potential failure point.
|
// are the only potential failure point.
|
||||||
cfg := &Config{
|
cfg := &Config{
|
||||||
Server: ServerConfig{Port: 8080},
|
Server: validServerConfig(t),
|
||||||
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
|
||||||
Log: LogConfig{Level: "info", Format: "json"},
|
Log: LogConfig{Level: "info", Format: "json"},
|
||||||
Auth: AuthConfig{Type: "api-key", Secret: "test-secret"},
|
Auth: AuthConfig{Type: "api-key", Secret: "test-secret"},
|
||||||
@@ -805,6 +1047,7 @@ func TestConfig_Scheduler_JobTimeoutValidation(t *testing.T) {
|
|||||||
JobProcessorInterval: 1 * time.Minute,
|
JobProcessorInterval: 1 * time.Minute,
|
||||||
AgentHealthCheckInterval: 1 * time.Minute,
|
AgentHealthCheckInterval: 1 * time.Minute,
|
||||||
NotificationProcessInterval: 1 * time.Minute,
|
NotificationProcessInterval: 1 * time.Minute,
|
||||||
|
NotificationRetryInterval: 2 * time.Minute,
|
||||||
RetryInterval: 1 * time.Minute,
|
RetryInterval: 1 * time.Minute,
|
||||||
JobTimeoutInterval: 10 * time.Minute,
|
JobTimeoutInterval: 10 * time.Minute,
|
||||||
AwaitingCSRTimeout: 24 * time.Hour,
|
AwaitingCSRTimeout: 24 * time.Hour,
|
||||||
|
|||||||
@@ -5,6 +5,15 @@ import (
|
|||||||
)
|
)
|
||||||
|
|
||||||
// NotificationEvent records a notification sent to users about certificate events.
|
// NotificationEvent records a notification sent to users about certificate events.
|
||||||
|
//
|
||||||
|
// I-005 extends the event with a retry counter, a nullable next-retry timestamp
|
||||||
|
// that drives the retry-sweep partial index, and a nullable last-error string
|
||||||
|
// preserving the most recent transient failure so operators triaging the dead
|
||||||
|
// letter queue can see *why* a notification died without chasing server logs.
|
||||||
|
// Status stays a plain `string` (not retyped to NotificationStatus) because the
|
||||||
|
// repo layer materialises it directly from PostgreSQL's VARCHAR column and the
|
||||||
|
// service layer compares against the NotificationStatus* constants via
|
||||||
|
// `string(...)` casts at call sites — see service.RetryFailedNotifications.
|
||||||
type NotificationEvent struct {
|
type NotificationEvent struct {
|
||||||
ID string `json:"id"`
|
ID string `json:"id"`
|
||||||
Type NotificationType `json:"type"`
|
Type NotificationType `json:"type"`
|
||||||
@@ -15,9 +24,37 @@ type NotificationEvent struct {
|
|||||||
SentAt *time.Time `json:"sent_at,omitempty"`
|
SentAt *time.Time `json:"sent_at,omitempty"`
|
||||||
Status string `json:"status"`
|
Status string `json:"status"`
|
||||||
Error *string `json:"error,omitempty"`
|
Error *string `json:"error,omitempty"`
|
||||||
|
RetryCount int `json:"retry_count"`
|
||||||
|
NextRetryAt *time.Time `json:"next_retry_at,omitempty"`
|
||||||
|
LastError *string `json:"last_error,omitempty"`
|
||||||
CreatedAt time.Time `json:"created_at"`
|
CreatedAt time.Time `json:"created_at"`
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// NotificationStatus is the typed string alias for the lifecycle status of a
|
||||||
|
// NotificationEvent. It mirrors the VARCHAR(50) column on notification_events
|
||||||
|
// and the status values used by the I-005 retry/DLQ machinery.
|
||||||
|
//
|
||||||
|
// Status transitions:
|
||||||
|
//
|
||||||
|
// pending → sent (delivery succeeded)
|
||||||
|
// pending → failed → pending (transient failure, re-armed by retry sweep)
|
||||||
|
// pending → failed → dead (retry_count reached max_attempts; DLQ)
|
||||||
|
// pending → read (operator acknowledged, no delivery needed)
|
||||||
|
//
|
||||||
|
// Values are lowercase to match the pre-I-005 on-wire representation used by
|
||||||
|
// existing UpdateStatus calls and the seed_demo.sql fixtures; retyping
|
||||||
|
// NotificationEvent.Status to NotificationStatus would be a breaking DB scan
|
||||||
|
// change, so the type is kept additive and consumed via `string(const)` casts.
|
||||||
|
type NotificationStatus string
|
||||||
|
|
||||||
|
const (
|
||||||
|
NotificationStatusPending NotificationStatus = "pending"
|
||||||
|
NotificationStatusSent NotificationStatus = "sent"
|
||||||
|
NotificationStatusFailed NotificationStatus = "failed"
|
||||||
|
NotificationStatusDead NotificationStatus = "dead"
|
||||||
|
NotificationStatusRead NotificationStatus = "read"
|
||||||
|
)
|
||||||
|
|
||||||
// NotificationType represents the event that triggered a notification.
|
// NotificationType represents the event that triggered a notification.
|
||||||
type NotificationType string
|
type NotificationType string
|
||||||
|
|
||||||
|
|||||||
@@ -1,6 +1,9 @@
|
|||||||
package domain
|
package domain
|
||||||
|
|
||||||
import "testing"
|
import (
|
||||||
|
"testing"
|
||||||
|
"time"
|
||||||
|
)
|
||||||
|
|
||||||
func TestNotificationType_Constants(t *testing.T) {
|
func TestNotificationType_Constants(t *testing.T) {
|
||||||
tests := map[string]NotificationType{
|
tests := map[string]NotificationType{
|
||||||
@@ -71,3 +74,54 @@ func TestNotificationEvent_Fields(t *testing.T) {
|
|||||||
t.Errorf("expected error 'failed to send', got %v", event.Error)
|
t.Errorf("expected error 'failed to send', got %v", event.Error)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// TestNotificationStatus_Constants verifies that I-005 introduces a typed
|
||||||
|
// NotificationStatus alongside canonical lowercase string constants covering
|
||||||
|
// the pending → sent, pending → failed → dead, and pending → read transitions.
|
||||||
|
// The Red signal here is a compile error: the type and the NotificationStatusDead
|
||||||
|
// constant do not exist before Phase 2 Green.
|
||||||
|
func TestNotificationStatus_Constants(t *testing.T) {
|
||||||
|
tests := map[string]NotificationStatus{
|
||||||
|
"pending": NotificationStatusPending,
|
||||||
|
"sent": NotificationStatusSent,
|
||||||
|
"failed": NotificationStatusFailed,
|
||||||
|
"dead": NotificationStatusDead,
|
||||||
|
"read": NotificationStatusRead,
|
||||||
|
}
|
||||||
|
for expected, got := range tests {
|
||||||
|
if string(got) != expected {
|
||||||
|
t.Errorf("expected %q, got %q", expected, string(got))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNotificationEvent_RetryFields verifies the I-005 retry/DLQ columns are
|
||||||
|
// surfaced on the domain model: a RetryCount counter, a nullable NextRetryAt
|
||||||
|
// timestamp used by the retry-sweep partial index, and a nullable LastError
|
||||||
|
// string preserving the most recent transient failure for operator triage.
|
||||||
|
// The Red signal is a compile error — these fields do not exist yet.
|
||||||
|
func TestNotificationEvent_RetryFields(t *testing.T) {
|
||||||
|
next := time.Now().Add(2 * time.Minute)
|
||||||
|
lastErr := "connection refused"
|
||||||
|
event := &NotificationEvent{
|
||||||
|
ID: "notif-retry-001",
|
||||||
|
Type: NotificationTypeExpirationWarning,
|
||||||
|
Channel: NotificationChannelWebhook,
|
||||||
|
Recipient: "https://hooks.example.com/certs",
|
||||||
|
Message: "retry me",
|
||||||
|
Status: string(NotificationStatusFailed),
|
||||||
|
RetryCount: 3,
|
||||||
|
NextRetryAt: &next,
|
||||||
|
LastError: &lastErr,
|
||||||
|
}
|
||||||
|
|
||||||
|
if event.RetryCount != 3 {
|
||||||
|
t.Errorf("expected RetryCount 3, got %d", event.RetryCount)
|
||||||
|
}
|
||||||
|
if event.NextRetryAt == nil || !event.NextRetryAt.Equal(next) {
|
||||||
|
t.Errorf("expected NextRetryAt %v, got %v", next, event.NextRetryAt)
|
||||||
|
}
|
||||||
|
if event.LastError == nil || *event.LastError != "connection refused" {
|
||||||
|
t.Errorf("expected LastError 'connection refused', got %v", event.LastError)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|||||||
@@ -103,25 +103,25 @@ func TestCertificateLifecycle(t *testing.T) {
|
|||||||
// Create router and register handlers
|
// Create router and register handlers
|
||||||
r := router.New()
|
r := router.New()
|
||||||
r.RegisterHandlers(router.HandlerRegistry{
|
r.RegisterHandlers(router.HandlerRegistry{
|
||||||
Certificates: certificateHandler,
|
Certificates: certificateHandler,
|
||||||
Issuers: issuerHandler,
|
Issuers: issuerHandler,
|
||||||
Targets: targetHandler,
|
Targets: targetHandler,
|
||||||
Agents: agentHandler,
|
Agents: agentHandler,
|
||||||
Jobs: jobHandler,
|
Jobs: jobHandler,
|
||||||
Policies: policyHandler,
|
Policies: policyHandler,
|
||||||
Profiles: profileHandler,
|
Profiles: profileHandler,
|
||||||
Teams: teamHandler,
|
Teams: teamHandler,
|
||||||
Owners: ownerHandler,
|
Owners: ownerHandler,
|
||||||
AgentGroups: agentGroupHandler,
|
AgentGroups: agentGroupHandler,
|
||||||
Audit: auditHandler,
|
Audit: auditHandler,
|
||||||
Notifications: notificationHandler,
|
Notifications: notificationHandler,
|
||||||
Stats: statsHandler,
|
Stats: statsHandler,
|
||||||
Metrics: metricsHandler,
|
Metrics: metricsHandler,
|
||||||
Health: healthHandler,
|
Health: healthHandler,
|
||||||
Discovery: discoveryHandler,
|
Discovery: discoveryHandler,
|
||||||
NetworkScan: networkScanHandler,
|
NetworkScan: networkScanHandler,
|
||||||
Verification: verificationHandler,
|
Verification: verificationHandler,
|
||||||
BulkRevocation: handler.BulkRevocationHandler{},
|
BulkRevocation: handler.BulkRevocationHandler{},
|
||||||
})
|
})
|
||||||
r.RegisterESTHandlers(estHandler)
|
r.RegisterESTHandlers(estHandler)
|
||||||
|
|
||||||
@@ -1022,6 +1022,46 @@ func (m *mockNotificationRepository) UpdateStatus(ctx context.Context, id string
|
|||||||
return fmt.Errorf("notification not found")
|
return fmt.Errorf("notification not found")
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// I-005: retry/DLQ interface satisfiers. The integration tests in this package
|
||||||
|
// drive the end-to-end lifecycle against a NotificationService which requires
|
||||||
|
// the full repository.NotificationRepository interface, but none of the
|
||||||
|
// lifecycle scenarios exercise the retry sweep or dead-letter transitions —
|
||||||
|
// they're covered by unit tests in internal/service/notification_test.go. So
|
||||||
|
// these are deliberate no-op / panic-free stubs whose only job is to satisfy
|
||||||
|
// the compile-time interface contract. If a future integration test needs
|
||||||
|
// real retry semantics, promote this mock to match internal/service's
|
||||||
|
// mockNotifRepo (testutil_test.go:410) one-for-one.
|
||||||
|
|
||||||
|
func (m *mockNotificationRepository) ListRetryEligible(ctx context.Context, now time.Time, maxAttempts, limit int) ([]*domain.NotificationEvent, error) {
|
||||||
|
return nil, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func (m *mockNotificationRepository) RecordFailedAttempt(ctx context.Context, id string, lastError string, nextRetryAt time.Time) error {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func (m *mockNotificationRepository) MarkAsDead(ctx context.Context, id string, lastError string) error {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func (m *mockNotificationRepository) Requeue(ctx context.Context, id string) error {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// CountByStatus satisfies the NotificationRepository interface contract added
|
||||||
|
// by I-005 Phase 2 Green. Counts in-memory rows so StatsService wiring exercised
|
||||||
|
// by the lifecycle integration tests gets a truthful count even though the
|
||||||
|
// retry/DLQ surface isn't driven here.
|
||||||
|
func (m *mockNotificationRepository) CountByStatus(ctx context.Context, status string) (int64, error) {
|
||||||
|
var count int64
|
||||||
|
for _, n := range m.notifications {
|
||||||
|
if n.Status == status {
|
||||||
|
count++
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return count, nil
|
||||||
|
}
|
||||||
|
|
||||||
type mockPolicyRepository struct {
|
type mockPolicyRepository struct {
|
||||||
rules map[string]*domain.PolicyRule
|
rules map[string]*domain.PolicyRule
|
||||||
violations []*domain.PolicyViolation
|
violations []*domain.PolicyViolation
|
||||||
|
|||||||
+36
-3
@@ -2,11 +2,14 @@ package mcp
|
|||||||
|
|
||||||
import (
|
import (
|
||||||
"bytes"
|
"bytes"
|
||||||
|
"crypto/tls"
|
||||||
|
"crypto/x509"
|
||||||
"encoding/json"
|
"encoding/json"
|
||||||
"fmt"
|
"fmt"
|
||||||
"io"
|
"io"
|
||||||
"net/http"
|
"net/http"
|
||||||
"net/url"
|
"net/url"
|
||||||
|
"os"
|
||||||
"time"
|
"time"
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -18,15 +21,45 @@ type Client struct {
|
|||||||
httpClient *http.Client
|
httpClient *http.Client
|
||||||
}
|
}
|
||||||
|
|
||||||
// NewClient creates a new certctl API client.
|
// NewClient creates a new certctl API client. The control plane is HTTPS-only
|
||||||
func NewClient(baseURL, apiKey string) *Client {
|
// as of v2.2, so the transport is pinned to TLS 1.3 and optionally loads a
|
||||||
|
// PEM-encoded CA bundle from caBundlePath (empty means "trust the system
|
||||||
|
// roots"). The insecure flag disables certificate verification and is a
|
||||||
|
// dev-only opt-in documented in docs/tls.md — it must never be set in
|
||||||
|
// production. Returns an error if the CA bundle path is non-empty but the
|
||||||
|
// file is missing or contains no valid PEM-encoded certificates, so the
|
||||||
|
// caller can fail loud before any network call.
|
||||||
|
func NewClient(baseURL, apiKey, caBundlePath string, insecure bool) (*Client, error) {
|
||||||
|
tlsConfig := &tls.Config{
|
||||||
|
MinVersion: tls.VersionTLS13,
|
||||||
|
InsecureSkipVerify: insecure, //nolint:gosec // opt-in dev toggle, documented in docs/tls.md
|
||||||
|
}
|
||||||
|
if caBundlePath != "" {
|
||||||
|
pemBytes, err := os.ReadFile(caBundlePath)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("reading CA bundle at %q: %w", caBundlePath, err)
|
||||||
|
}
|
||||||
|
pool := x509.NewCertPool()
|
||||||
|
if !pool.AppendCertsFromPEM(pemBytes) {
|
||||||
|
return nil, fmt.Errorf("CA bundle at %q contains no valid PEM-encoded certificates", caBundlePath)
|
||||||
|
}
|
||||||
|
tlsConfig.RootCAs = pool
|
||||||
|
}
|
||||||
return &Client{
|
return &Client{
|
||||||
baseURL: baseURL,
|
baseURL: baseURL,
|
||||||
apiKey: apiKey,
|
apiKey: apiKey,
|
||||||
httpClient: &http.Client{
|
httpClient: &http.Client{
|
||||||
Timeout: 30 * time.Second,
|
Timeout: 30 * time.Second,
|
||||||
|
Transport: &http.Transport{
|
||||||
|
TLSClientConfig: tlsConfig,
|
||||||
|
ForceAttemptHTTP2: true,
|
||||||
|
MaxIdleConns: 10,
|
||||||
|
IdleConnTimeout: 90 * time.Second,
|
||||||
|
TLSHandshakeTimeout: 10 * time.Second,
|
||||||
|
ExpectContinueTimeout: 1 * time.Second,
|
||||||
|
},
|
||||||
},
|
},
|
||||||
}
|
}, nil
|
||||||
}
|
}
|
||||||
|
|
||||||
// Get performs an HTTP GET and returns the raw JSON response body.
|
// Get performs an HTTP GET and returns the raw JSON response body.
|
||||||
|
|||||||
+248
-15
@@ -1,17 +1,30 @@
|
|||||||
package mcp
|
package mcp
|
||||||
|
|
||||||
import (
|
import (
|
||||||
|
"crypto/rand"
|
||||||
|
"crypto/rsa"
|
||||||
|
"crypto/tls"
|
||||||
|
"crypto/x509"
|
||||||
|
"crypto/x509/pkix"
|
||||||
"encoding/json"
|
"encoding/json"
|
||||||
|
"encoding/pem"
|
||||||
"io"
|
"io"
|
||||||
|
"math/big"
|
||||||
"net/http"
|
"net/http"
|
||||||
"net/http/httptest"
|
"net/http/httptest"
|
||||||
|
"os"
|
||||||
|
"path/filepath"
|
||||||
"testing"
|
"testing"
|
||||||
|
"time"
|
||||||
)
|
)
|
||||||
|
|
||||||
func TestNewClient(t *testing.T) {
|
func TestNewClient(t *testing.T) {
|
||||||
c := NewClient("http://localhost:8443", "test-key")
|
c, err := NewClient("https://localhost:8443", "test-key", "", false)
|
||||||
if c.baseURL != "http://localhost:8443" {
|
if err != nil {
|
||||||
t.Errorf("expected baseURL http://localhost:8443, got %s", c.baseURL)
|
t.Fatalf("NewClient err=%v want nil", err)
|
||||||
|
}
|
||||||
|
if c.baseURL != "https://localhost:8443" {
|
||||||
|
t.Errorf("expected baseURL https://localhost:8443, got %s", c.baseURL)
|
||||||
}
|
}
|
||||||
if c.apiKey != "test-key" {
|
if c.apiKey != "test-key" {
|
||||||
t.Errorf("expected apiKey test-key, got %s", c.apiKey)
|
t.Errorf("expected apiKey test-key, got %s", c.apiKey)
|
||||||
@@ -44,7 +57,7 @@ func TestClient_Get(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
c := NewClient(server.URL, "test-key")
|
c, _ := NewClient(server.URL, "test-key", "", false)
|
||||||
data, err := c.Get("/api/v1/certificates", map[string][]string{"status": {"Active"}})
|
data, err := c.Get("/api/v1/certificates", map[string][]string{"status": {"Active"}})
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("unexpected error: %v", err)
|
t.Fatalf("unexpected error: %v", err)
|
||||||
@@ -64,7 +77,7 @@ func TestClient_Get_NoAuth(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
c := NewClient(server.URL, "")
|
c, _ := NewClient(server.URL, "", "", false)
|
||||||
_, err := c.Get("/api/v1/certificates", nil)
|
_, err := c.Get("/api/v1/certificates", nil)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("unexpected error: %v", err)
|
t.Fatalf("unexpected error: %v", err)
|
||||||
@@ -95,7 +108,7 @@ func TestClient_Post(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
c := NewClient(server.URL, "test-key")
|
c, _ := NewClient(server.URL, "test-key", "", false)
|
||||||
data, err := c.Post("/api/v1/certificates", map[string]string{"name": "test-cert"})
|
data, err := c.Post("/api/v1/certificates", map[string]string{"name": "test-cert"})
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("unexpected error: %v", err)
|
t.Fatalf("unexpected error: %v", err)
|
||||||
@@ -120,7 +133,7 @@ func TestClient_Put(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
c := NewClient(server.URL, "test-key")
|
c, _ := NewClient(server.URL, "test-key", "", false)
|
||||||
data, err := c.Put("/api/v1/certificates/mc-test", map[string]string{"name": "updated"})
|
data, err := c.Put("/api/v1/certificates/mc-test", map[string]string{"name": "updated"})
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("unexpected error: %v", err)
|
t.Fatalf("unexpected error: %v", err)
|
||||||
@@ -139,7 +152,7 @@ func TestClient_Delete_204(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
c := NewClient(server.URL, "test-key")
|
c, _ := NewClient(server.URL, "test-key", "", false)
|
||||||
data, err := c.Delete("/api/v1/certificates/mc-test")
|
data, err := c.Delete("/api/v1/certificates/mc-test")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("unexpected error: %v", err)
|
t.Fatalf("unexpected error: %v", err)
|
||||||
@@ -161,7 +174,7 @@ func TestClient_ErrorResponse(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
c := NewClient(server.URL, "test-key")
|
c, _ := NewClient(server.URL, "test-key", "", false)
|
||||||
_, err := c.Get("/api/v1/certificates/nonexistent", nil)
|
_, err := c.Get("/api/v1/certificates/nonexistent", nil)
|
||||||
if err == nil {
|
if err == nil {
|
||||||
t.Fatal("expected error for 404 response")
|
t.Fatal("expected error for 404 response")
|
||||||
@@ -179,7 +192,7 @@ func TestClient_ServerError(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
c := NewClient(server.URL, "test-key")
|
c, _ := NewClient(server.URL, "test-key", "", false)
|
||||||
_, err := c.Post("/api/v1/certificates", map[string]string{"name": "test"})
|
_, err := c.Post("/api/v1/certificates", map[string]string{"name": "test"})
|
||||||
if err == nil {
|
if err == nil {
|
||||||
t.Fatal("expected error for 500 response")
|
t.Fatal("expected error for 500 response")
|
||||||
@@ -202,7 +215,7 @@ func TestClient_GetRaw(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
c := NewClient(server.URL, "test-key")
|
c, _ := NewClient(server.URL, "test-key", "", false)
|
||||||
data, contentType, err := c.GetRaw("/.well-known/pki/crl/iss-local")
|
data, contentType, err := c.GetRaw("/.well-known/pki/crl/iss-local")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("unexpected error: %v", err)
|
t.Fatalf("unexpected error: %v", err)
|
||||||
@@ -222,7 +235,7 @@ func TestClient_GetRaw_Error(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
c := NewClient(server.URL, "test-key")
|
c, _ := NewClient(server.URL, "test-key", "", false)
|
||||||
_, _, err := c.GetRaw("/.well-known/pki/crl/nonexistent")
|
_, _, err := c.GetRaw("/.well-known/pki/crl/nonexistent")
|
||||||
if err == nil {
|
if err == nil {
|
||||||
t.Fatal("expected error for 404 response")
|
t.Fatal("expected error for 404 response")
|
||||||
@@ -230,7 +243,7 @@ func TestClient_GetRaw_Error(t *testing.T) {
|
|||||||
}
|
}
|
||||||
|
|
||||||
func TestClient_ConnectionRefused(t *testing.T) {
|
func TestClient_ConnectionRefused(t *testing.T) {
|
||||||
c := NewClient("http://localhost:1", "test-key")
|
c, _ := NewClient("https://localhost:1", "test-key", "", false)
|
||||||
_, err := c.Get("/api/v1/certificates", nil)
|
_, err := c.Get("/api/v1/certificates", nil)
|
||||||
if err == nil {
|
if err == nil {
|
||||||
t.Fatal("expected error for connection refused")
|
t.Fatal("expected error for connection refused")
|
||||||
@@ -247,7 +260,7 @@ func TestClient_PostNilBody(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
c := NewClient(server.URL, "test-key")
|
c, _ := NewClient(server.URL, "test-key", "", false)
|
||||||
data, err := c.Post("/api/v1/certificates/mc-test/renew", nil)
|
data, err := c.Post("/api/v1/certificates/mc-test/renew", nil)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("unexpected error: %v", err)
|
t.Fatalf("unexpected error: %v", err)
|
||||||
@@ -270,7 +283,7 @@ func TestClient_QueryParams(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
c := NewClient(server.URL, "test-key")
|
c, _ := NewClient(server.URL, "test-key", "", false)
|
||||||
q := paginationQuery(2, 10)
|
q := paginationQuery(2, 10)
|
||||||
_, err := c.Get("/api/v1/certificates", q)
|
_, err := c.Get("/api/v1/certificates", q)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
@@ -287,3 +300,223 @@ func containsStr(s, substr string) bool {
|
|||||||
}
|
}
|
||||||
return false
|
return false
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// generateTestCert produces a short-lived self-signed RSA-2048 certificate for
|
||||||
|
// tests that need a PEM-encodable cert. Mirrors the helper used in
|
||||||
|
// internal/cli/client_test.go so the two packages pin the same HTTPS-Everywhere
|
||||||
|
// TLS-wiring contract against matching test fixtures.
|
||||||
|
func generateTestCert() *x509.Certificate {
|
||||||
|
now := time.Now()
|
||||||
|
template := &x509.Certificate{
|
||||||
|
SerialNumber: big.NewInt(1),
|
||||||
|
Subject: pkix.Name{
|
||||||
|
CommonName: "test.certctl.local",
|
||||||
|
},
|
||||||
|
NotBefore: now,
|
||||||
|
NotAfter: now.Add(365 * 24 * time.Hour),
|
||||||
|
KeyUsage: x509.KeyUsageKeyEncipherment | x509.KeyUsageDigitalSignature,
|
||||||
|
ExtKeyUsage: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth},
|
||||||
|
BasicConstraintsValid: true,
|
||||||
|
DNSNames: []string{"test.certctl.local"},
|
||||||
|
}
|
||||||
|
|
||||||
|
privateKey, _ := rsa.GenerateKey(rand.Reader, 2048)
|
||||||
|
certBytes, _ := x509.CreateCertificate(rand.Reader, template, template, &privateKey.PublicKey, privateKey)
|
||||||
|
cert, _ := x509.ParseCertificate(certBytes)
|
||||||
|
return cert
|
||||||
|
}
|
||||||
|
|
||||||
|
// -----------------------------------------------------------------------------
|
||||||
|
// HTTPS-Everywhere milestone (v2.2, §3.2 + §7 Phase 5):
|
||||||
|
// The MCP server binary talks HTTPS-only to the certctl control plane. These
|
||||||
|
// tests pin the three contracts every client binary (agent, CLI, MCP) must
|
||||||
|
// satisfy in lock-step:
|
||||||
|
// (a) CA bundle load success — PEM loads, RootCAs + MinVersion=TLS1.3 wired
|
||||||
|
// through the injected *http.Transport so the httpClient actually uses
|
||||||
|
// them on the wire, not just in the struct.
|
||||||
|
// (b) CA bundle load failure — missing file and malformed/empty PEM each fail
|
||||||
|
// loud with a pinned substring so operators get a useful diagnostic.
|
||||||
|
// (c) End-to-end TLS round-trip — an httptest.NewTLSServer whose own cert is
|
||||||
|
// written out as the CA bundle validates that every TLS-config knob
|
||||||
|
// actually flows into the dialer.
|
||||||
|
// The substrings below must stay in sync with internal/mcp/client.go:NewClient;
|
||||||
|
// drifting them in isolation is exactly what this suite is here to catch.
|
||||||
|
// -----------------------------------------------------------------------------
|
||||||
|
|
||||||
|
// writeCABundle PEM-encodes a DER cert and writes it to a temp file under the
|
||||||
|
// test's own TempDir. Returns the absolute path for piping into NewClient.
|
||||||
|
func writeCABundle(t *testing.T, dir string, certDER []byte, filename string) string {
|
||||||
|
t.Helper()
|
||||||
|
path := filepath.Join(dir, filename)
|
||||||
|
pemBytes := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: certDER})
|
||||||
|
if err := os.WriteFile(path, pemBytes, 0o600); err != nil {
|
||||||
|
t.Fatalf("writing CA bundle to %q: %v", path, err)
|
||||||
|
}
|
||||||
|
return path
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNewClient_CABundle_Success pins the happy path: a valid PEM CA bundle
|
||||||
|
// loads, populates RootCAs on the client's TLS config, and leaves
|
||||||
|
// MinVersion=TLS1.3 intact. Regression guard for any future edit that
|
||||||
|
// accidentally swaps the transport or detaches *tls.Config from *http.Transport.
|
||||||
|
func TestNewClient_CABundle_Success(t *testing.T) {
|
||||||
|
cert := generateTestCert()
|
||||||
|
tmp := t.TempDir()
|
||||||
|
bundlePath := writeCABundle(t, tmp, cert.Raw, "ca.pem")
|
||||||
|
|
||||||
|
client, err := NewClient("https://certctl-server:8443", "test-key", bundlePath, false)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("NewClient with valid CA bundle err=%v want nil", err)
|
||||||
|
}
|
||||||
|
if client == nil {
|
||||||
|
t.Fatal("NewClient returned nil client on happy path")
|
||||||
|
}
|
||||||
|
|
||||||
|
transport, ok := client.httpClient.Transport.(*http.Transport)
|
||||||
|
if !ok {
|
||||||
|
t.Fatalf("httpClient.Transport type=%T want *http.Transport (TLS config injection broke)", client.httpClient.Transport)
|
||||||
|
}
|
||||||
|
if transport.TLSClientConfig == nil {
|
||||||
|
t.Fatal("transport.TLSClientConfig is nil; TLS config must be set on every client")
|
||||||
|
}
|
||||||
|
if transport.TLSClientConfig.RootCAs == nil {
|
||||||
|
t.Fatal("transport.TLSClientConfig.RootCAs is nil; CA bundle path was ignored")
|
||||||
|
}
|
||||||
|
if transport.TLSClientConfig.MinVersion != tls.VersionTLS13 {
|
||||||
|
t.Errorf("MinVersion=%d want tls.VersionTLS13 (%d); HTTPS-Everywhere requires TLS1.3 floor",
|
||||||
|
transport.TLSClientConfig.MinVersion, tls.VersionTLS13)
|
||||||
|
}
|
||||||
|
if transport.TLSClientConfig.InsecureSkipVerify {
|
||||||
|
t.Error("InsecureSkipVerify=true with insecure=false arg; flag wiring crossed")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNewClient_CABundle_MissingFile pins the fail-loud path for a nonexistent
|
||||||
|
// bundle path. The error surface must include "reading CA bundle" so operators
|
||||||
|
// see the right diagnostic instead of a downstream TLS-handshake-error.
|
||||||
|
func TestNewClient_CABundle_MissingFile(t *testing.T) {
|
||||||
|
_, err := NewClient("https://certctl-server:8443", "test-key", "/nonexistent/path/ca.pem", false)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("NewClient with missing CA bundle err=nil; must fail loud so operators see the right diagnostic")
|
||||||
|
}
|
||||||
|
if !containsStr(err.Error(), "reading CA bundle") {
|
||||||
|
t.Errorf("err=%q must contain %q so operators can locate the misconfigured path", err.Error(), "reading CA bundle")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNewClient_CABundle_EmptyPEM pins the fail-loud path for a file whose
|
||||||
|
// contents are not valid PEM. AppendCertsFromPEM returning false is the signal
|
||||||
|
// we need to surface — otherwise the client would silently ship with an empty
|
||||||
|
// cert pool and every TLS handshake would fail downstream.
|
||||||
|
func TestNewClient_CABundle_EmptyPEM(t *testing.T) {
|
||||||
|
tmp := t.TempDir()
|
||||||
|
garbagePath := filepath.Join(tmp, "garbage.pem")
|
||||||
|
if err := os.WriteFile(garbagePath, []byte("not a pem certificate, just bytes"), 0o600); err != nil {
|
||||||
|
t.Fatalf("writing garbage file: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
_, err := NewClient("https://certctl-server:8443", "test-key", garbagePath, false)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatal("NewClient with malformed PEM err=nil; must fail loud, not silently skip")
|
||||||
|
}
|
||||||
|
if !containsStr(err.Error(), "no valid PEM-encoded certificates") {
|
||||||
|
t.Errorf("err=%q must contain %q so operators know the file parsed but held no certs",
|
||||||
|
err.Error(), "no valid PEM-encoded certificates")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNewClient_TLSRoundTrip validates that the TLS config knobs we set on
|
||||||
|
// NewClient actually reach the wire. An httptest.NewTLSServer signs its own
|
||||||
|
// self-signed leaf; we PEM-encode that server cert, write it as the CA bundle,
|
||||||
|
// and issue a real HTTPS GET via c.Get. A successful round-trip proves RootCAs
|
||||||
|
// + MinVersion are flowing through *http.Transport into the dialer, not just
|
||||||
|
// surviving into the client struct.
|
||||||
|
func TestNewClient_TLSRoundTrip(t *testing.T) {
|
||||||
|
var handlerHit int
|
||||||
|
server := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||||
|
if r.Method == http.MethodGet && r.URL.Path == "/api/v1/certificates" {
|
||||||
|
handlerHit++
|
||||||
|
w.Header().Set("Content-Type", "application/json")
|
||||||
|
_ = json.NewEncoder(w).Encode(map[string]interface{}{
|
||||||
|
"data": []interface{}{},
|
||||||
|
"total": 0,
|
||||||
|
})
|
||||||
|
return
|
||||||
|
}
|
||||||
|
w.WriteHeader(http.StatusNotFound)
|
||||||
|
}))
|
||||||
|
defer server.Close()
|
||||||
|
|
||||||
|
serverCert := server.Certificate()
|
||||||
|
if serverCert == nil {
|
||||||
|
t.Fatal("httptest.NewTLSServer.Certificate() returned nil; cannot build CA bundle")
|
||||||
|
}
|
||||||
|
|
||||||
|
tmp := t.TempDir()
|
||||||
|
bundlePath := writeCABundle(t, tmp, serverCert.Raw, "server-ca.pem")
|
||||||
|
|
||||||
|
client, err := NewClient(server.URL, "test-key", bundlePath, false)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("NewClient(TLS server) err=%v want nil", err)
|
||||||
|
}
|
||||||
|
data, err := client.Get("/api/v1/certificates", nil)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("Get over HTTPS err=%v; TLS config must reach the wire", err)
|
||||||
|
}
|
||||||
|
if data == nil {
|
||||||
|
t.Fatal("Get over HTTPS returned nil data; want non-empty JSON body")
|
||||||
|
}
|
||||||
|
if handlerHit != 1 {
|
||||||
|
t.Errorf("handlerHit=%d want 1; request did not reach the TLS server", handlerHit)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNewClient_InsecureSkipVerify pins the dev-only escape hatch: an untrusted
|
||||||
|
// TLS server (cert NOT in the client's root pool) must be reachable when
|
||||||
|
// insecure=true. This is the only path in the control plane that disables
|
||||||
|
// certificate verification; it's documented in docs/tls.md and gated by the
|
||||||
|
// CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY env var so it never slips into
|
||||||
|
// production silently.
|
||||||
|
func TestNewClient_InsecureSkipVerify(t *testing.T) {
|
||||||
|
var handlerHit int
|
||||||
|
server := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||||
|
handlerHit++
|
||||||
|
w.Header().Set("Content-Type", "application/json")
|
||||||
|
_ = json.NewEncoder(w).Encode(map[string]interface{}{
|
||||||
|
"data": []interface{}{},
|
||||||
|
"total": 0,
|
||||||
|
})
|
||||||
|
}))
|
||||||
|
defer server.Close()
|
||||||
|
|
||||||
|
// No CA bundle → system roots, which will NOT trust the self-signed
|
||||||
|
// httptest cert. insecure=true is the only thing keeping this call from
|
||||||
|
// failing with an x509-unknown-authority error.
|
||||||
|
client, err := NewClient(server.URL, "test-key", "", true)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("NewClient(insecure=true) err=%v want nil", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
transport, ok := client.httpClient.Transport.(*http.Transport)
|
||||||
|
if !ok {
|
||||||
|
t.Fatalf("httpClient.Transport type=%T want *http.Transport", client.httpClient.Transport)
|
||||||
|
}
|
||||||
|
if !transport.TLSClientConfig.InsecureSkipVerify {
|
||||||
|
t.Fatal("insecure=true arg did not set TLSClientConfig.InsecureSkipVerify; flag wiring broken")
|
||||||
|
}
|
||||||
|
if transport.TLSClientConfig.MinVersion != tls.VersionTLS13 {
|
||||||
|
t.Errorf("MinVersion=%d want tls.VersionTLS13 even with insecure=true (TLS1.3 floor is not optional)",
|
||||||
|
transport.TLSClientConfig.MinVersion)
|
||||||
|
}
|
||||||
|
|
||||||
|
data, err := client.Get("/api/v1/certificates", nil)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("Get(insecure=true) err=%v; escape hatch must still complete the round-trip", err)
|
||||||
|
}
|
||||||
|
if data == nil {
|
||||||
|
t.Fatal("Get(insecure=true) returned nil data; want non-empty JSON body")
|
||||||
|
}
|
||||||
|
if handlerHit != 1 {
|
||||||
|
t.Errorf("handlerHit=%d want 1; insecure round-trip did not reach the server", handlerHit)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|||||||
@@ -44,7 +44,7 @@ func TestClient_DeleteWithQuery_ForceRetire(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
c := NewClient(server.URL, "test-key")
|
c, _ := NewClient(server.URL, "test-key", "", false)
|
||||||
// Compile-fail until Phase 2b grows Client.DeleteWithQuery. Passing the
|
// Compile-fail until Phase 2b grows Client.DeleteWithQuery. Passing the
|
||||||
// query as a url.Values is the established pattern (matches Get's shape).
|
// query as a url.Values is the established pattern (matches Get's shape).
|
||||||
query := url.Values{}
|
query := url.Values{}
|
||||||
@@ -87,7 +87,7 @@ func TestClient_DeleteWithQuery_NoQuery(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
c := NewClient(server.URL, "")
|
c, _ := NewClient(server.URL, "", "", false)
|
||||||
if _, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil); err != nil {
|
if _, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil); err != nil {
|
||||||
t.Fatalf("DeleteWithQuery(nil query) err=%v want nil", err)
|
t.Fatalf("DeleteWithQuery(nil query) err=%v want nil", err)
|
||||||
}
|
}
|
||||||
@@ -108,7 +108,7 @@ func TestClient_DeleteWithQuery_204ReturnsMinimalBody(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
c := NewClient(server.URL, "")
|
c, _ := NewClient(server.URL, "", "", false)
|
||||||
data, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil)
|
data, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("DeleteWithQuery(204) err=%v want nil (idempotent)", err)
|
t.Fatalf("DeleteWithQuery(204) err=%v want nil (idempotent)", err)
|
||||||
@@ -141,7 +141,7 @@ func TestClient_DeleteWithQuery_409PropagatesError(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
c := NewClient(server.URL, "")
|
c, _ := NewClient(server.URL, "", "", false)
|
||||||
_, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil)
|
_, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil)
|
||||||
if err == nil {
|
if err == nil {
|
||||||
t.Fatalf("DeleteWithQuery(409) err=nil; 409 must propagate as Go error")
|
t.Fatalf("DeleteWithQuery(409) err=nil; 409 must propagate as Go error")
|
||||||
|
|||||||
+22
-3
@@ -974,9 +974,13 @@ func registerAuditTools(s *gomcp.Server, c *Client) {
|
|||||||
func registerNotificationTools(s *gomcp.Server, c *Client) {
|
func registerNotificationTools(s *gomcp.Server, c *Client) {
|
||||||
gomcp.AddTool(s, &gomcp.Tool{
|
gomcp.AddTool(s, &gomcp.Tool{
|
||||||
Name: "certctl_list_notifications",
|
Name: "certctl_list_notifications",
|
||||||
Description: "List notification events (expiration warnings, renewal/deployment results, policy violations, revocations).",
|
Description: "List notification events (expiration warnings, renewal/deployment results, policy violations, revocations). Optional status filter supports the I-005 Dead letter tab (status=dead).",
|
||||||
}, func(ctx context.Context, req *gomcp.CallToolRequest, input ListParams) (*gomcp.CallToolResult, any, error) {
|
}, func(ctx context.Context, req *gomcp.CallToolRequest, input ListNotificationsInput) (*gomcp.CallToolResult, any, error) {
|
||||||
data, err := c.Get("/api/v1/notifications", paginationQuery(input.Page, input.PerPage))
|
q := paginationQuery(input.Page, input.PerPage)
|
||||||
|
if input.Status != "" {
|
||||||
|
q.Set("status", input.Status)
|
||||||
|
}
|
||||||
|
data, err := c.Get("/api/v1/notifications", q)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
return errorResult(err)
|
return errorResult(err)
|
||||||
}
|
}
|
||||||
@@ -1004,6 +1008,21 @@ func registerNotificationTools(s *gomcp.Server, c *Client) {
|
|||||||
}
|
}
|
||||||
return textResult(data)
|
return textResult(data)
|
||||||
})
|
})
|
||||||
|
|
||||||
|
// I-005: requeue a dead-letter notification. Flips status from 'dead'
|
||||||
|
// back to 'pending' and clears next_retry_at so the retry sweep picks
|
||||||
|
// the notification up on its next tick. Operator-triggered; the tool
|
||||||
|
// is the MCP counterpart of the GUI's Dead letter tab "Requeue" button.
|
||||||
|
gomcp.AddTool(s, &gomcp.Tool{
|
||||||
|
Name: "certctl_requeue_notification",
|
||||||
|
Description: "Requeue a dead notification back to pending so the retry sweep can deliver it again. Used to recover from persistent delivery failures after the underlying issue (SMTP config, webhook endpoint, etc.) has been fixed.",
|
||||||
|
}, func(ctx context.Context, req *gomcp.CallToolRequest, input GetByIDInput) (*gomcp.CallToolResult, any, error) {
|
||||||
|
data, err := c.Post("/api/v1/notifications/"+input.ID+"/requeue", nil)
|
||||||
|
if err != nil {
|
||||||
|
return errorResult(err)
|
||||||
|
}
|
||||||
|
return textResult(data)
|
||||||
|
})
|
||||||
}
|
}
|
||||||
|
|
||||||
// ── Stats ───────────────────────────────────────────────────────────
|
// ── Stats ───────────────────────────────────────────────────────────
|
||||||
|
|||||||
+10
-10
@@ -88,7 +88,7 @@ func TestRegisterTools_ToolCount(t *testing.T) {
|
|||||||
api := mockCertctlAPI(log)
|
api := mockCertctlAPI(log)
|
||||||
defer api.Close()
|
defer api.Close()
|
||||||
|
|
||||||
client := NewClient(api.URL, "test-key")
|
client, _ := NewClient(api.URL, "test-key", "", false)
|
||||||
RegisterTools(server, client)
|
RegisterTools(server, client)
|
||||||
|
|
||||||
// The server should have tools registered — we can verify by listing them
|
// The server should have tools registered — we can verify by listing them
|
||||||
@@ -166,7 +166,7 @@ func TestToolEndToEnd_ListCertificates(t *testing.T) {
|
|||||||
api := mockCertctlAPI(log)
|
api := mockCertctlAPI(log)
|
||||||
defer api.Close()
|
defer api.Close()
|
||||||
|
|
||||||
client := NewClient(api.URL, "test-key")
|
client, _ := NewClient(api.URL, "test-key", "", false)
|
||||||
|
|
||||||
// Manually call the handler logic that would be registered as a tool
|
// Manually call the handler logic that would be registered as a tool
|
||||||
q := paginationQuery(1, 50)
|
q := paginationQuery(1, 50)
|
||||||
@@ -204,7 +204,7 @@ func TestToolEndToEnd_CreateCertificate(t *testing.T) {
|
|||||||
api := mockCertctlAPI(log)
|
api := mockCertctlAPI(log)
|
||||||
defer api.Close()
|
defer api.Close()
|
||||||
|
|
||||||
client := NewClient(api.URL, "test-key")
|
client, _ := NewClient(api.URL, "test-key", "", false)
|
||||||
|
|
||||||
input := CreateCertificateInput{
|
input := CreateCertificateInput{
|
||||||
Name: "API Production",
|
Name: "API Production",
|
||||||
@@ -244,7 +244,7 @@ func TestToolEndToEnd_TriggerRenewal(t *testing.T) {
|
|||||||
api := mockCertctlAPI(log)
|
api := mockCertctlAPI(log)
|
||||||
defer api.Close()
|
defer api.Close()
|
||||||
|
|
||||||
client := NewClient(api.URL, "test-key")
|
client, _ := NewClient(api.URL, "test-key", "", false)
|
||||||
data, err := client.Post("/api/v1/certificates/mc-api-prod/renew", nil)
|
data, err := client.Post("/api/v1/certificates/mc-api-prod/renew", nil)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("unexpected error: %v", err)
|
t.Fatalf("unexpected error: %v", err)
|
||||||
@@ -272,7 +272,7 @@ func TestToolEndToEnd_DeleteTarget(t *testing.T) {
|
|||||||
api := mockCertctlAPI(log)
|
api := mockCertctlAPI(log)
|
||||||
defer api.Close()
|
defer api.Close()
|
||||||
|
|
||||||
client := NewClient(api.URL, "test-key")
|
client, _ := NewClient(api.URL, "test-key", "", false)
|
||||||
data, err := client.Delete("/api/v1/targets/t-platform")
|
data, err := client.Delete("/api/v1/targets/t-platform")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("unexpected error: %v", err)
|
t.Fatalf("unexpected error: %v", err)
|
||||||
@@ -300,7 +300,7 @@ func TestToolEndToEnd_RevokeCertificate(t *testing.T) {
|
|||||||
api := mockCertctlAPI(log)
|
api := mockCertctlAPI(log)
|
||||||
defer api.Close()
|
defer api.Close()
|
||||||
|
|
||||||
client := NewClient(api.URL, "test-key")
|
client, _ := NewClient(api.URL, "test-key", "", false)
|
||||||
input := RevokeCertificateInput{
|
input := RevokeCertificateInput{
|
||||||
ID: "mc-api-prod",
|
ID: "mc-api-prod",
|
||||||
Reason: "keyCompromise",
|
Reason: "keyCompromise",
|
||||||
@@ -327,7 +327,7 @@ func TestToolEndToEnd_AgentHeartbeat(t *testing.T) {
|
|||||||
api := mockCertctlAPI(log)
|
api := mockCertctlAPI(log)
|
||||||
defer api.Close()
|
defer api.Close()
|
||||||
|
|
||||||
client := NewClient(api.URL, "test-key")
|
client, _ := NewClient(api.URL, "test-key", "", false)
|
||||||
_, err := client.Post("/api/v1/agents/agent-001/heartbeat", map[string]string{
|
_, err := client.Post("/api/v1/agents/agent-001/heartbeat", map[string]string{
|
||||||
"os": "linux",
|
"os": "linux",
|
||||||
"architecture": "amd64",
|
"architecture": "amd64",
|
||||||
@@ -347,7 +347,7 @@ func TestToolEndToEnd_ListWithFilters(t *testing.T) {
|
|||||||
api := mockCertctlAPI(log)
|
api := mockCertctlAPI(log)
|
||||||
defer api.Close()
|
defer api.Close()
|
||||||
|
|
||||||
client := NewClient(api.URL, "test-key")
|
client, _ := NewClient(api.URL, "test-key", "", false)
|
||||||
q := paginationQuery(1, 25)
|
q := paginationQuery(1, 25)
|
||||||
q.Set("status", "Pending")
|
q.Set("status", "Pending")
|
||||||
q.Set("type", "Renewal")
|
q.Set("type", "Renewal")
|
||||||
@@ -377,7 +377,7 @@ func TestToolEndToEnd_GetRawBinary(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "test-key")
|
client, _ := NewClient(server.URL, "test-key", "", false)
|
||||||
data, ct, err := client.GetRaw("/.well-known/pki/crl/iss-local")
|
data, ct, err := client.GetRaw("/.well-known/pki/crl/iss-local")
|
||||||
if err != nil {
|
if err != nil {
|
||||||
t.Fatalf("unexpected error: %v", err)
|
t.Fatalf("unexpected error: %v", err)
|
||||||
@@ -397,7 +397,7 @@ func TestToolEndToEnd_ErrorPropagation(t *testing.T) {
|
|||||||
}))
|
}))
|
||||||
defer server.Close()
|
defer server.Close()
|
||||||
|
|
||||||
client := NewClient(server.URL, "test-key")
|
client, _ := NewClient(server.URL, "test-key", "", false)
|
||||||
_, err := client.Get("/api/v1/certificates", nil)
|
_, err := client.Get("/api/v1/certificates", nil)
|
||||||
if err == nil {
|
if err == nil {
|
||||||
t.Fatal("expected error for 403 response")
|
t.Fatal("expected error for 403 response")
|
||||||
|
|||||||
@@ -182,6 +182,16 @@ type RejectJobInput struct {
|
|||||||
Reason string `json:"reason,omitempty" jsonschema:"Reason for rejection"`
|
Reason string `json:"reason,omitempty" jsonschema:"Reason for rejection"`
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ── Notifications ───────────────────────────────────────────────────
|
||||||
|
|
||||||
|
// ListNotificationsInput adds the I-005 status filter on top of the standard
|
||||||
|
// pagination params. Status="dead" drives the Dead letter tab use case;
|
||||||
|
// empty status preserves the pre-I-005 list-all behavior.
|
||||||
|
type ListNotificationsInput struct {
|
||||||
|
ListParams
|
||||||
|
Status string `json:"status,omitempty" jsonschema:"Filter by status: pending, sent, failed, dead, read"`
|
||||||
|
}
|
||||||
|
|
||||||
// ── Policies ────────────────────────────────────────────────────────
|
// ── Policies ────────────────────────────────────────────────────────
|
||||||
|
|
||||||
type CreatePolicyInput struct {
|
type CreatePolicyInput struct {
|
||||||
|
|||||||
@@ -285,6 +285,12 @@ type AuditRepository interface {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// NotificationRepository defines operations for managing notifications.
|
// NotificationRepository defines operations for managing notifications.
|
||||||
|
//
|
||||||
|
// I-005 extends the interface with four retry/DLQ methods. The retry scheduler
|
||||||
|
// loop calls ListRetryEligible on every tick to pull overdue failed rows, then
|
||||||
|
// either RecordFailedAttempt (still-retrying) or MarkAsDead (exhausted). The
|
||||||
|
// operator-facing dead-letter tab calls Requeue to move a row from 'dead' (or
|
||||||
|
// 'failed') back to 'pending' so ProcessPendingNotifications picks it up again.
|
||||||
type NotificationRepository interface {
|
type NotificationRepository interface {
|
||||||
// Create stores a new notification.
|
// Create stores a new notification.
|
||||||
Create(ctx context.Context, notif *domain.NotificationEvent) error
|
Create(ctx context.Context, notif *domain.NotificationEvent) error
|
||||||
@@ -292,6 +298,44 @@ type NotificationRepository interface {
|
|||||||
List(ctx context.Context, filter *NotificationFilter) ([]*domain.NotificationEvent, error)
|
List(ctx context.Context, filter *NotificationFilter) ([]*domain.NotificationEvent, error)
|
||||||
// UpdateStatus updates a notification's delivery status.
|
// UpdateStatus updates a notification's delivery status.
|
||||||
UpdateStatus(ctx context.Context, id string, status string, sentAt time.Time) error
|
UpdateStatus(ctx context.Context, id string, status string, sentAt time.Time) error
|
||||||
|
// ListRetryEligible returns failed notification rows whose next_retry_at
|
||||||
|
// is <= now AND retry_count < maxAttempts, ordered by next_retry_at ASC
|
||||||
|
// (oldest overdue first — same fairness as I-001's RetryFailedJobs). The
|
||||||
|
// WHERE clause mirrors the partial retry-sweep index predicate from
|
||||||
|
// migration 000016 so the planner uses it. A limit<=0 is normalised to
|
||||||
|
// a sane default in the repo implementation to avoid accidental unbounded
|
||||||
|
// sweeps. I-005 coverage-gap closure.
|
||||||
|
ListRetryEligible(ctx context.Context, now time.Time, maxAttempts, limit int) ([]*domain.NotificationEvent, error)
|
||||||
|
// RecordFailedAttempt is called by the retry sweep after a notifier.Send
|
||||||
|
// transient failure. The UPDATE increments retry_count by exactly 1,
|
||||||
|
// overwrites last_error, overwrites next_retry_at, and KEEPS status='failed'
|
||||||
|
// so the row remains a candidate for ListRetryEligible on the next sweep.
|
||||||
|
// Returns "not found" when no row matches the id (mirrors UpdateStatus).
|
||||||
|
// I-005 coverage-gap closure.
|
||||||
|
RecordFailedAttempt(ctx context.Context, id string, lastError string, nextRetryAt time.Time) error
|
||||||
|
// MarkAsDead performs the DLQ transition when retry_count reaches
|
||||||
|
// max_attempts. Flips status='dead', clears next_retry_at so the partial
|
||||||
|
// retry-sweep index drops the row, writes the final last_error, and
|
||||||
|
// PRESERVES retry_count as historical evidence of how many attempts were
|
||||||
|
// burned. Returns "not found" when no row matches.
|
||||||
|
// I-005 coverage-gap closure.
|
||||||
|
MarkAsDead(ctx context.Context, id string, lastError string) error
|
||||||
|
// Requeue is the operator "try again" action from the UI's Dead letter
|
||||||
|
// tab. Flips status='pending' (so ProcessPendingNotifications picks it
|
||||||
|
// up), resets retry_count to 0 (otherwise the operator's first retry
|
||||||
|
// would already be at hour-long waits), clears next_retry_at, and clears
|
||||||
|
// last_error. Valid from both 'dead' and 'failed'. Returns "not found"
|
||||||
|
// when no row matches. I-005 coverage-gap closure.
|
||||||
|
Requeue(ctx context.Context, id string) error
|
||||||
|
// CountByStatus returns the number of notification_events rows whose
|
||||||
|
// status column matches the given string exactly. Used by StatsService
|
||||||
|
// to populate DashboardSummary.NotificationsDead which in turn drives
|
||||||
|
// the Prometheus counter certctl_notification_dead_total (I-005 Phase 2
|
||||||
|
// observability gate). A dedicated SQL COUNT(*) is used instead of
|
||||||
|
// List(filter{Status: ...}) because List silently resets PerPage>500 to
|
||||||
|
// 50 — a latent scale bug for any status-filtered count. I-005
|
||||||
|
// coverage-gap closure.
|
||||||
|
CountByStatus(ctx context.Context, status string) (int64, error)
|
||||||
}
|
}
|
||||||
|
|
||||||
// TeamRepository defines operations for managing teams.
|
// TeamRepository defines operations for managing teams.
|
||||||
|
|||||||
@@ -0,0 +1,256 @@
|
|||||||
|
package postgres_test
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"database/sql"
|
||||||
|
"strings"
|
||||||
|
"testing"
|
||||||
|
)
|
||||||
|
|
||||||
|
// TestMigration000016_NotificationRetryRoundTrip is the Phase 1 Red regression
|
||||||
|
// test for I-005 ("failed webhook/email drops critical alerts — no retry, no
|
||||||
|
// DLQ, no escalation"). The fix depends on a new migration,
|
||||||
|
// 000016_notification_retry.up.sql + .down.sql, which must:
|
||||||
|
//
|
||||||
|
// 1. Add `retry_count INTEGER NOT NULL DEFAULT 0` on notification_events.
|
||||||
|
// Mirrors migration 000015's column-nullability pattern: explicit
|
||||||
|
// NOT NULL + default so existing rows backfill cleanly and the service
|
||||||
|
// layer never has to nil-check the counter. The 0 default is what lets
|
||||||
|
// the retry scheduler promote a row from failed → pending on its very
|
||||||
|
// first sweep without a bespoke backfill.
|
||||||
|
//
|
||||||
|
// 2. Add `next_retry_at TIMESTAMPTZ` (nullable) on notification_events.
|
||||||
|
// Populated by the service layer on every failed→pending transition
|
||||||
|
// using exponential backoff (2^retry_count minutes, cap 1h). Nullable
|
||||||
|
// because the field is only meaningful while a row sits in 'failed'
|
||||||
|
// state; 'sent', 'pending', 'dead', and 'read' rows leave it NULL.
|
||||||
|
//
|
||||||
|
// 3. Add `last_error TEXT` (nullable) on notification_events. TEXT
|
||||||
|
// (not VARCHAR(N)) because notifier errors can include full HTTP
|
||||||
|
// response bodies, TLS handshake diagnostics, or stringified stack
|
||||||
|
// traces. Truncation here would kick the operator back to the server
|
||||||
|
// log, which is exactly the triage pain I-005 is meant to eliminate.
|
||||||
|
//
|
||||||
|
// 4. Create the partial retry-sweep index
|
||||||
|
// `idx_notification_events_retry_sweep ON notification_events(next_retry_at)
|
||||||
|
// WHERE status = 'failed' AND next_retry_at IS NOT NULL`.
|
||||||
|
// The predicate keeps the index tiny in a healthy fleet — only failed
|
||||||
|
// rows scheduled for retry participate; sent/pending/dead/read rows and
|
||||||
|
// unscheduled failures are excluded. Makes the retry sweep in
|
||||||
|
// RetryFailedNotifications O(retry-eligible) rather than O(total-events).
|
||||||
|
//
|
||||||
|
// The round-trip also validates that the down migration cleanly reverses all
|
||||||
|
// four schema additions, so an operator who lands on a rollback can still
|
||||||
|
// boot the server. Stage 4 asserts idempotency — the up migration must be
|
||||||
|
// safely re-runnable after a partial rollback, which requires ADD COLUMN
|
||||||
|
// IF NOT EXISTS and CREATE INDEX IF NOT EXISTS on every new object.
|
||||||
|
//
|
||||||
|
// Red-until-Green: this test compiles but fails until
|
||||||
|
// migrations/000016_notification_retry.up.sql + .down.sql exist with the
|
||||||
|
// right schema, because freshSchema(t) runs every `.up.sql` in lexical order
|
||||||
|
// — the new migration runs automatically once Phase 2 creates the files.
|
||||||
|
func TestMigration000016_NotificationRetryRoundTrip(t *testing.T) {
|
||||||
|
tdb := getTestDB(t)
|
||||||
|
db := tdb.freshSchema(t)
|
||||||
|
ctx := context.Background()
|
||||||
|
|
||||||
|
// ─── Stage 1: Post-up assertions ─────────────────────────────────────
|
||||||
|
//
|
||||||
|
// After every .up.sql migration (including the new 000016) has run, the
|
||||||
|
// three new columns and the partial retry-sweep index must be observable
|
||||||
|
// in the catalog.
|
||||||
|
|
||||||
|
// All three retry columns must be present on notification_events.
|
||||||
|
assertColumnExists(t, db, "notification_events", "retry_count")
|
||||||
|
assertColumnExists(t, db, "notification_events", "next_retry_at")
|
||||||
|
assertColumnExists(t, db, "notification_events", "last_error")
|
||||||
|
|
||||||
|
// retry_count must be NOT NULL with a server-side default of 0. The
|
||||||
|
// scheduler's failed→pending transition relies on reading the counter
|
||||||
|
// without a COALESCE, and the back-fill on existing rows must be
|
||||||
|
// deterministic; 0 is the only safe default for an attempt counter.
|
||||||
|
assertColumnNotNull(t, db, "notification_events", "retry_count", true)
|
||||||
|
assertColumnDefaultContains(t, db, "notification_events", "retry_count", "0")
|
||||||
|
|
||||||
|
// next_retry_at and last_error are nullable by design — see the Stage 1
|
||||||
|
// doc block above for why. A NOT NULL constraint here would force the
|
||||||
|
// service layer to write sentinel values on every terminal-status
|
||||||
|
// transition, which is worse than just leaving them NULL.
|
||||||
|
assertColumnNotNull(t, db, "notification_events", "next_retry_at", false)
|
||||||
|
assertColumnNotNull(t, db, "notification_events", "last_error", false)
|
||||||
|
|
||||||
|
// The partial retry-sweep index must exist on notification_events and
|
||||||
|
// must include the WHERE predicate that restricts it to failed+scheduled
|
||||||
|
// rows. Without the predicate the index is merely an index on
|
||||||
|
// next_retry_at — correct semantics, but it would balloon in a busy
|
||||||
|
// fleet because every sent/read row would sit in it with a NULL key.
|
||||||
|
assertIndexExists(t, db, "idx_notification_events_retry_sweep")
|
||||||
|
assertIndexPredicateContains(t, db, "idx_notification_events_retry_sweep", "status = 'failed'")
|
||||||
|
assertIndexPredicateContains(t, db, "idx_notification_events_retry_sweep", "next_retry_at IS NOT NULL")
|
||||||
|
|
||||||
|
// ─── Stage 2: Run the 000016 down migration manually ─────────────────
|
||||||
|
//
|
||||||
|
// testutil_test.go's runMigrations helper only runs *.up.sql. To exercise
|
||||||
|
// the down migration I read and execute it by hand, then re-check the
|
||||||
|
// catalog.
|
||||||
|
|
||||||
|
downSQL := readMigrationFile(t, "000016_notification_retry.down.sql")
|
||||||
|
if _, err := db.ExecContext(ctx, downSQL); err != nil {
|
||||||
|
t.Fatalf("000016 down migration failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Stage 3: Post-down assertions — all three columns removed, partial
|
||||||
|
// index dropped.
|
||||||
|
assertColumnGone(t, db, "notification_events", "retry_count")
|
||||||
|
assertColumnGone(t, db, "notification_events", "next_retry_at")
|
||||||
|
assertColumnGone(t, db, "notification_events", "last_error")
|
||||||
|
assertIndexGone(t, db, "idx_notification_events_retry_sweep")
|
||||||
|
|
||||||
|
// ─── Stage 4: Re-run the up migration for idempotency ────────────────
|
||||||
|
//
|
||||||
|
// The up migration must be safely re-runnable — operators sometimes
|
||||||
|
// re-apply by hand after a partial rollback. Use ADD COLUMN IF NOT
|
||||||
|
// EXISTS and CREATE INDEX IF NOT EXISTS so every converging run is a
|
||||||
|
// no-op.
|
||||||
|
|
||||||
|
upSQL := readMigrationFile(t, "000016_notification_retry.up.sql")
|
||||||
|
if _, err := db.ExecContext(ctx, upSQL); err != nil {
|
||||||
|
t.Fatalf("000016 up migration re-apply failed (must be idempotent): %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
assertColumnExists(t, db, "notification_events", "retry_count")
|
||||||
|
assertColumnExists(t, db, "notification_events", "next_retry_at")
|
||||||
|
assertColumnExists(t, db, "notification_events", "last_error")
|
||||||
|
assertIndexExists(t, db, "idx_notification_events_retry_sweep")
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── Extra catalog helpers for 000016 ─────────────────────────────────────
|
||||||
|
//
|
||||||
|
// These are additive to the column-existence and FK helpers defined in
|
||||||
|
// migration_000015_test.go. Both files live in `package postgres_test`, so
|
||||||
|
// assertColumnExists / assertColumnGone / readMigrationFile are already in
|
||||||
|
// scope from the 000015 test file and must not be redeclared.
|
||||||
|
|
||||||
|
// assertColumnNotNull asserts that the information_schema reports the
|
||||||
|
// expected nullability for a column. PG exposes `is_nullable` as the string
|
||||||
|
// 'YES' or 'NO'; we translate to a bool so the call site reads cleanly.
|
||||||
|
func assertColumnNotNull(t *testing.T, db *sql.DB, table, column string, wantNotNull bool) {
|
||||||
|
t.Helper()
|
||||||
|
var isNullable string
|
||||||
|
err := db.QueryRowContext(context.Background(), `
|
||||||
|
SELECT is_nullable
|
||||||
|
FROM information_schema.columns
|
||||||
|
WHERE table_schema = current_schema()
|
||||||
|
AND table_name = $1
|
||||||
|
AND column_name = $2
|
||||||
|
`, table, column).Scan(&isNullable)
|
||||||
|
if err == sql.ErrNoRows {
|
||||||
|
t.Fatalf("column %s.%s not found in current_schema (migration missing?)", table, column)
|
||||||
|
}
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("is_nullable lookup for %s.%s failed: %v", table, column, err)
|
||||||
|
}
|
||||||
|
gotNotNull := isNullable == "NO"
|
||||||
|
if gotNotNull != wantNotNull {
|
||||||
|
t.Errorf("column %s.%s nullability: got NOT NULL=%v, want NOT NULL=%v (is_nullable=%q)",
|
||||||
|
table, column, gotNotNull, wantNotNull, isNullable)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// assertColumnDefaultContains asserts that the server-side DEFAULT clause for
|
||||||
|
// a column contains the expected substring. Postgres can render defaults in
|
||||||
|
// a few different normalized shapes (`0`, `(0)::integer`, `0::integer`),
|
||||||
|
// so substring matching is more robust than exact equality here.
|
||||||
|
func assertColumnDefaultContains(t *testing.T, db *sql.DB, table, column, wantSubstr string) {
|
||||||
|
t.Helper()
|
||||||
|
var columnDefault sql.NullString
|
||||||
|
err := db.QueryRowContext(context.Background(), `
|
||||||
|
SELECT column_default
|
||||||
|
FROM information_schema.columns
|
||||||
|
WHERE table_schema = current_schema()
|
||||||
|
AND table_name = $1
|
||||||
|
AND column_name = $2
|
||||||
|
`, table, column).Scan(&columnDefault)
|
||||||
|
if err == sql.ErrNoRows {
|
||||||
|
t.Fatalf("column %s.%s not found in current_schema (migration missing?)", table, column)
|
||||||
|
}
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("column_default lookup for %s.%s failed: %v", table, column, err)
|
||||||
|
}
|
||||||
|
if !columnDefault.Valid {
|
||||||
|
t.Errorf("column %s.%s has no DEFAULT clause; want substring %q", table, column, wantSubstr)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
if !strings.Contains(columnDefault.String, wantSubstr) {
|
||||||
|
t.Errorf("column %s.%s DEFAULT = %q; want substring %q",
|
||||||
|
table, column, columnDefault.String, wantSubstr)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// assertIndexExists asserts that a named index exists in the current schema.
|
||||||
|
// Scoped via pg_indexes.schemaname = current_schema() so schema-per-test
|
||||||
|
// isolation holds.
|
||||||
|
func assertIndexExists(t *testing.T, db *sql.DB, indexName string) {
|
||||||
|
t.Helper()
|
||||||
|
var exists bool
|
||||||
|
err := db.QueryRowContext(context.Background(), `
|
||||||
|
SELECT EXISTS (
|
||||||
|
SELECT 1 FROM pg_indexes
|
||||||
|
WHERE schemaname = current_schema()
|
||||||
|
AND indexname = $1
|
||||||
|
)`, indexName).Scan(&exists)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("index existence query failed for %s: %v", indexName, err)
|
||||||
|
}
|
||||||
|
if !exists {
|
||||||
|
t.Errorf("expected index %s to exist after 000016 up (migration missing or drifted)", indexName)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// assertIndexGone is the negative form, used after the down migration to
|
||||||
|
// confirm the partial retry-sweep index has been dropped.
|
||||||
|
func assertIndexGone(t *testing.T, db *sql.DB, indexName string) {
|
||||||
|
t.Helper()
|
||||||
|
var exists bool
|
||||||
|
err := db.QueryRowContext(context.Background(), `
|
||||||
|
SELECT EXISTS (
|
||||||
|
SELECT 1 FROM pg_indexes
|
||||||
|
WHERE schemaname = current_schema()
|
||||||
|
AND indexname = $1
|
||||||
|
)`, indexName).Scan(&exists)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("index existence query failed for %s: %v", indexName, err)
|
||||||
|
}
|
||||||
|
if exists {
|
||||||
|
t.Errorf("expected index %s to be removed after 000016 down (down migration is incomplete)", indexName)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// assertIndexPredicateContains asserts that the reconstructed `indexdef`
|
||||||
|
// (pg_indexes.indexdef — the CREATE INDEX statement Postgres would emit to
|
||||||
|
// recreate the index) contains the expected substring. This is how we pin
|
||||||
|
// the WHERE predicate of a partial index without parsing the SQL.
|
||||||
|
//
|
||||||
|
// Postgres normalises the predicate (e.g. single-quoted literals stay
|
||||||
|
// single-quoted, column references are bare), so substring matching is both
|
||||||
|
// sufficient and robust against cosmetic reformatting.
|
||||||
|
func assertIndexPredicateContains(t *testing.T, db *sql.DB, indexName, wantSubstr string) {
|
||||||
|
t.Helper()
|
||||||
|
var indexdef string
|
||||||
|
err := db.QueryRowContext(context.Background(), `
|
||||||
|
SELECT indexdef
|
||||||
|
FROM pg_indexes
|
||||||
|
WHERE schemaname = current_schema()
|
||||||
|
AND indexname = $1
|
||||||
|
`, indexName).Scan(&indexdef)
|
||||||
|
if err == sql.ErrNoRows {
|
||||||
|
t.Fatalf("index %s not found in current_schema (migration missing?)", indexName)
|
||||||
|
}
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("indexdef lookup for %s failed: %v", indexName, err)
|
||||||
|
}
|
||||||
|
if !strings.Contains(indexdef, wantSubstr) {
|
||||||
|
t.Errorf("index %s definition missing expected predicate fragment %q\nfull indexdef: %s",
|
||||||
|
indexName, wantSubstr, indexdef)
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -100,10 +100,14 @@ func (r *NotificationRepository) List(ctx context.Context, filter *repository.No
|
|||||||
return nil, fmt.Errorf("failed to count notifications: %w", err)
|
return nil, fmt.Errorf("failed to count notifications: %w", err)
|
||||||
}
|
}
|
||||||
|
|
||||||
// Get paginated results
|
// Get paginated results. I-005 extends the SELECT with the three retry
|
||||||
|
// columns (retry_count / next_retry_at / last_error) so scanNotification
|
||||||
|
// can populate the new fields on domain.NotificationEvent. The column
|
||||||
|
// order here MUST stay in lockstep with scanNotification below.
|
||||||
offset := (filter.Page - 1) * filter.PerPage
|
offset := (filter.Page - 1) * filter.PerPage
|
||||||
query := fmt.Sprintf(`
|
query := fmt.Sprintf(`
|
||||||
SELECT id, type, certificate_id, channel, recipient, message, sent_at, status, error
|
SELECT id, type, certificate_id, channel, recipient, message, sent_at, status, error,
|
||||||
|
retry_count, next_retry_at, last_error
|
||||||
FROM notification_events
|
FROM notification_events
|
||||||
%s
|
%s
|
||||||
ORDER BY sent_at DESC NULLS LAST
|
ORDER BY sent_at DESC NULLS LAST
|
||||||
@@ -156,13 +160,23 @@ func (r *NotificationRepository) UpdateStatus(ctx context.Context, id string, st
|
|||||||
return nil
|
return nil
|
||||||
}
|
}
|
||||||
|
|
||||||
// scanNotification scans a notification from a row or rows
|
// scanNotification scans a notification from a row or rows.
|
||||||
|
//
|
||||||
|
// I-005 extends the scan list from 9 → 12 columns (adds retry_count,
|
||||||
|
// next_retry_at, last_error). Every caller — List and the four new retry
|
||||||
|
// methods below — funnels rows through this helper, so the SELECT column
|
||||||
|
// order in every query must match the Scan order here exactly. RetryCount
|
||||||
|
// scans into an `int` (migration 000016 declares the column NOT NULL with
|
||||||
|
// DEFAULT 0), while NextRetryAt and LastError scan into pointer types
|
||||||
|
// because the column is nullable — a healthy pending/sent/dead row leaves
|
||||||
|
// both NULL.
|
||||||
func scanNotification(scanner interface {
|
func scanNotification(scanner interface {
|
||||||
Scan(...interface{}) error
|
Scan(...interface{}) error
|
||||||
}) (*domain.NotificationEvent, error) {
|
}) (*domain.NotificationEvent, error) {
|
||||||
var notif domain.NotificationEvent
|
var notif domain.NotificationEvent
|
||||||
err := scanner.Scan(¬if.ID, ¬if.Type, ¬if.CertificateID, ¬if.Channel,
|
err := scanner.Scan(¬if.ID, ¬if.Type, ¬if.CertificateID, ¬if.Channel,
|
||||||
¬if.Recipient, ¬if.Message, ¬if.SentAt, ¬if.Status, ¬if.Error)
|
¬if.Recipient, ¬if.Message, ¬if.SentAt, ¬if.Status, ¬if.Error,
|
||||||
|
¬if.RetryCount, ¬if.NextRetryAt, ¬if.LastError)
|
||||||
|
|
||||||
if err != nil {
|
if err != nil {
|
||||||
return nil, fmt.Errorf("failed to scan notification: %w", err)
|
return nil, fmt.Errorf("failed to scan notification: %w", err)
|
||||||
@@ -170,3 +184,220 @@ func scanNotification(scanner interface {
|
|||||||
|
|
||||||
return ¬if, nil
|
return ¬if, nil
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ─── I-005 retry/DLQ methods ─────────────────────────────────────────────
|
||||||
|
//
|
||||||
|
// The four methods below implement the repository half of the I-005
|
||||||
|
// notification retry + dead-letter queue fix. The retry scheduler loop
|
||||||
|
// (added alongside these in internal/scheduler/scheduler.go) drives them in
|
||||||
|
// a strict cycle:
|
||||||
|
//
|
||||||
|
// ┌─► ListRetryEligible(ctx, now, maxAttempts, limit)
|
||||||
|
// │ (oldest overdue failed rows first)
|
||||||
|
// │ │
|
||||||
|
// │ ├──► notifier.Send() succeeds → UpdateStatus('sent')
|
||||||
|
// │ │
|
||||||
|
// │ ├──► transient failure, retry_count+1 < maxAttempts
|
||||||
|
// │ │ → RecordFailedAttempt(id, err, next)
|
||||||
|
// │ │
|
||||||
|
// │ └──► transient failure, retry_count+1 == maxAttempts
|
||||||
|
// │ → MarkAsDead(id, err)
|
||||||
|
// │
|
||||||
|
// └──◄ Requeue(id) ────── operator "try again" from Dead-letter tab
|
||||||
|
//
|
||||||
|
// The WHERE clauses in every UPDATE are scoped by id (not by status), so
|
||||||
|
// status invariants ("you can't requeue a sent row", "you can't mark a
|
||||||
|
// dead row as dead again") live in the service layer. The repo layer is
|
||||||
|
// deliberately thin — it mirrors the postgres CHECK constraints and
|
||||||
|
// trusts the service to hand it rows in a sane state. The one exception
|
||||||
|
// is "row must exist": each method returns an error on zero RowsAffected,
|
||||||
|
// matching the pre-existing UpdateStatus contract above so the scheduler
|
||||||
|
// can detect a concurrent delete without guessing.
|
||||||
|
|
||||||
|
// listRetryEligibleDefaultLimit caps a caller that passes limit <= 0.
|
||||||
|
// Picked high enough that normal sweeps never hit it (a healthy fleet
|
||||||
|
// should have tens of overdue rows at most, not thousands), but finite
|
||||||
|
// so a pathological call (wrong arg in a future refactor, bad MCP tool
|
||||||
|
// wiring) cannot scan the entire notification_events table.
|
||||||
|
const listRetryEligibleDefaultLimit = 1000
|
||||||
|
|
||||||
|
// ListRetryEligible returns failed notification rows whose next_retry_at
|
||||||
|
// is due and whose retry_count has not yet reached the configured
|
||||||
|
// max_attempts.
|
||||||
|
//
|
||||||
|
// The WHERE clause is the exact dual of the partial retry-sweep index
|
||||||
|
// predicate from migration 000016:
|
||||||
|
//
|
||||||
|
// WHERE status = 'failed'
|
||||||
|
// AND next_retry_at IS NOT NULL
|
||||||
|
// AND next_retry_at <= $1
|
||||||
|
// AND retry_count < $2
|
||||||
|
//
|
||||||
|
// Because the index is partial on the first two conjuncts, the planner
|
||||||
|
// uses it to satisfy the range scan on next_retry_at; the retry_count
|
||||||
|
// filter is applied as a residual on the (very small) candidate set.
|
||||||
|
//
|
||||||
|
// ORDER BY next_retry_at ASC matches the fairness guarantee called out
|
||||||
|
// in the test file: oldest overdue row goes first, so a backed-up
|
||||||
|
// scheduler doesn't starve the notifications that have been waiting
|
||||||
|
// longest. The same order is what I-001's RetryFailedJobs uses.
|
||||||
|
func (r *NotificationRepository) ListRetryEligible(ctx context.Context, now time.Time, maxAttempts, limit int) ([]*domain.NotificationEvent, error) {
|
||||||
|
if limit <= 0 {
|
||||||
|
limit = listRetryEligibleDefaultLimit
|
||||||
|
}
|
||||||
|
|
||||||
|
rows, err := r.db.QueryContext(ctx, `
|
||||||
|
SELECT id, type, certificate_id, channel, recipient, message, sent_at, status, error,
|
||||||
|
retry_count, next_retry_at, last_error
|
||||||
|
FROM notification_events
|
||||||
|
WHERE status = 'failed'
|
||||||
|
AND next_retry_at IS NOT NULL
|
||||||
|
AND next_retry_at <= $1
|
||||||
|
AND retry_count < $2
|
||||||
|
ORDER BY next_retry_at ASC
|
||||||
|
LIMIT $3
|
||||||
|
`, now, maxAttempts, limit)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to query retry-eligible notifications: %w", err)
|
||||||
|
}
|
||||||
|
defer rows.Close()
|
||||||
|
|
||||||
|
var notifs []*domain.NotificationEvent
|
||||||
|
for rows.Next() {
|
||||||
|
notif, err := scanNotification(rows)
|
||||||
|
if err != nil {
|
||||||
|
return nil, err
|
||||||
|
}
|
||||||
|
notifs = append(notifs, notif)
|
||||||
|
}
|
||||||
|
if err := rows.Err(); err != nil {
|
||||||
|
return nil, fmt.Errorf("error iterating retry-eligible notification rows: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
return notifs, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// RecordFailedAttempt is called by the retry sweep after a notifier.Send
|
||||||
|
// transient failure. It increments retry_count by exactly 1, overwrites
|
||||||
|
// last_error and next_retry_at, and deliberately DOES NOT touch status —
|
||||||
|
// the row must remain 'failed' so the next ListRetryEligible tick can
|
||||||
|
// pick it up again (unless the service layer has decided this attempt
|
||||||
|
// exhausts max_attempts, in which case it calls MarkAsDead directly
|
||||||
|
// instead of calling RecordFailedAttempt).
|
||||||
|
//
|
||||||
|
// The +1 is done server-side (SET retry_count = retry_count + 1) rather
|
||||||
|
// than client-side so a race between two scheduler instances cannot lose
|
||||||
|
// an attempt. Only one scheduler should be running in a healthy deploy,
|
||||||
|
// but the cheap arithmetic here survives a split-brain without lying
|
||||||
|
// about attempt counts.
|
||||||
|
func (r *NotificationRepository) RecordFailedAttempt(ctx context.Context, id string, lastError string, nextRetryAt time.Time) error {
|
||||||
|
result, err := r.db.ExecContext(ctx, `
|
||||||
|
UPDATE notification_events
|
||||||
|
SET retry_count = retry_count + 1,
|
||||||
|
last_error = $1,
|
||||||
|
next_retry_at = $2
|
||||||
|
WHERE id = $3
|
||||||
|
`, lastError, nextRetryAt, id)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("failed to record notification retry attempt: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
rows, err := result.RowsAffected()
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("failed to get rows affected: %w", err)
|
||||||
|
}
|
||||||
|
if rows == 0 {
|
||||||
|
// Same "not found" error shape as UpdateStatus above. The scheduler
|
||||||
|
// logs-and-continues on this so a concurrently-deleted row doesn't
|
||||||
|
// break the sweep.
|
||||||
|
return fmt.Errorf("notification not found")
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// MarkAsDead performs the DLQ transition. Flips status='dead' so the
|
||||||
|
// partial retry-sweep index drops the row (the index predicate requires
|
||||||
|
// status='failed'), clears next_retry_at so operator dashboards don't
|
||||||
|
// claim the row is still "scheduled to retry", writes the final
|
||||||
|
// last_error for triage, and PRESERVES retry_count as historical evidence
|
||||||
|
// of how many attempts were burned before the row was declared dead.
|
||||||
|
// The retry_count value is operator-visible in the Dead letter tab so
|
||||||
|
// on-call can tell "this notification died on attempt 5" vs "this one
|
||||||
|
// died on attempt 1 because the recipient webhook was malformed from the
|
||||||
|
// start".
|
||||||
|
func (r *NotificationRepository) MarkAsDead(ctx context.Context, id string, lastError string) error {
|
||||||
|
result, err := r.db.ExecContext(ctx, `
|
||||||
|
UPDATE notification_events
|
||||||
|
SET status = 'dead',
|
||||||
|
next_retry_at = NULL,
|
||||||
|
last_error = $1
|
||||||
|
WHERE id = $2
|
||||||
|
`, lastError, id)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("failed to mark notification as dead: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
rows, err := result.RowsAffected()
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("failed to get rows affected: %w", err)
|
||||||
|
}
|
||||||
|
if rows == 0 {
|
||||||
|
return fmt.Errorf("notification not found")
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// Requeue is the operator "try again" action fired from the Dead letter
|
||||||
|
// tab. Flips status='pending' so ProcessPendingNotifications picks the
|
||||||
|
// row up again, resets retry_count to 0 (otherwise the operator's first
|
||||||
|
// retry would immediately sit at the top of the backoff ladder), clears
|
||||||
|
// next_retry_at so the row is no longer in the retry-sweep index, and
|
||||||
|
// clears last_error so the UI doesn't render a stale error badge next
|
||||||
|
// to a freshly-requeued row.
|
||||||
|
//
|
||||||
|
// The service layer is responsible for forbidding Requeue on 'sent' or
|
||||||
|
// 'read' rows (terminal success states). This repo layer deliberately
|
||||||
|
// doesn't filter by current status — an operator action has already
|
||||||
|
// passed a human-in-the-loop guard by the time it reaches the DB, and
|
||||||
|
// the test suite only exercises the Requeue-from-{dead,failed} paths.
|
||||||
|
// Matches how UpdateStatus doesn't filter by current status either.
|
||||||
|
func (r *NotificationRepository) Requeue(ctx context.Context, id string) error {
|
||||||
|
result, err := r.db.ExecContext(ctx, `
|
||||||
|
UPDATE notification_events
|
||||||
|
SET status = 'pending',
|
||||||
|
retry_count = 0,
|
||||||
|
next_retry_at = NULL,
|
||||||
|
last_error = NULL
|
||||||
|
WHERE id = $1
|
||||||
|
`, id)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("failed to requeue notification: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
rows, err := result.RowsAffected()
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("failed to get rows affected: %w", err)
|
||||||
|
}
|
||||||
|
if rows == 0 {
|
||||||
|
return fmt.Errorf("notification not found")
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// CountByStatus returns the number of notification_events rows matching the
|
||||||
|
// given status string. Implemented as a direct COUNT(*) rather than via List
|
||||||
|
// because List resets filter.PerPage>500 to 50 (see line 57 quirk), which
|
||||||
|
// would produce undercounts on high-volume deployments. I-005 Phase 2 Green —
|
||||||
|
// backs StatsService.GetDashboardSummary.NotificationsDead and the Prometheus
|
||||||
|
// counter certctl_notification_dead_total.
|
||||||
|
func (r *NotificationRepository) CountByStatus(ctx context.Context, status string) (int64, error) {
|
||||||
|
var count int64
|
||||||
|
err := r.db.QueryRowContext(ctx,
|
||||||
|
`SELECT COUNT(*) FROM notification_events WHERE status = $1`,
|
||||||
|
status,
|
||||||
|
).Scan(&count)
|
||||||
|
if err != nil {
|
||||||
|
return 0, fmt.Errorf("failed to count notifications by status: %w", err)
|
||||||
|
}
|
||||||
|
return count, nil
|
||||||
|
}
|
||||||
|
|||||||
@@ -0,0 +1,398 @@
|
|||||||
|
package postgres_test
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"database/sql"
|
||||||
|
"testing"
|
||||||
|
"time"
|
||||||
|
|
||||||
|
"github.com/shankar0123/certctl/internal/domain"
|
||||||
|
"github.com/shankar0123/certctl/internal/repository/postgres"
|
||||||
|
)
|
||||||
|
|
||||||
|
// TestNotificationRepository_RetryMethods is the Phase 1 Red regression test
|
||||||
|
// for the I-005 fix ("failed webhook/email drops critical alerts — no retry,
|
||||||
|
// no DLQ, no escalation"). It pins the four new repository methods the
|
||||||
|
// notification-retry scheduler loop will depend on:
|
||||||
|
//
|
||||||
|
// 1. ListRetryEligible(ctx, now, maxAttempts, limit) — the retry-sweep query.
|
||||||
|
// Returns failed rows whose next_retry_at <= now AND retry_count <
|
||||||
|
// maxAttempts. Everything else (sent/pending/dead/read, unscheduled
|
||||||
|
// failures, exhausted rows) is excluded. Ordering is ASC on next_retry_at
|
||||||
|
// so the oldest overdue row is processed first — same fairness guarantee
|
||||||
|
// as I-001's RetryFailedJobs.
|
||||||
|
//
|
||||||
|
// 2. RecordFailedAttempt(ctx, id, lastError, nextRetryAt) — what the
|
||||||
|
// scheduler calls after a notifier.Send() transient failure. Must
|
||||||
|
// increment retry_count by exactly 1, overwrite last_error, overwrite
|
||||||
|
// next_retry_at, and KEEP status='failed' so the row is still a
|
||||||
|
// candidate for ListRetryEligible on the next sweep.
|
||||||
|
//
|
||||||
|
// 3. MarkAsDead(ctx, id, lastError) — the DLQ transition when retry_count
|
||||||
|
// hits max_attempts. Flips status to 'dead', clears next_retry_at
|
||||||
|
// (so the partial retry-sweep index drops the row), preserves
|
||||||
|
// retry_count as historical evidence of how many attempts were spent,
|
||||||
|
// and records the final transient error for operator triage.
|
||||||
|
//
|
||||||
|
// 4. Requeue(ctx, id) — the operator "try again" action fired from the
|
||||||
|
// Dead letter tab in the UI. Flips status back to 'pending' (which is
|
||||||
|
// what ProcessPendingNotifications picks up), resets retry_count to 0,
|
||||||
|
// clears next_retry_at AND last_error. Valid from both 'dead' (normal
|
||||||
|
// path) and 'failed' (operator rescuing a stuck row before the sweep
|
||||||
|
// fires). Invalid from 'sent' / 'read' (terminal success states).
|
||||||
|
//
|
||||||
|
// Red-until-Green: this test file compiles only after Phase 2 adds
|
||||||
|
// ListRetryEligible, RecordFailedAttempt, MarkAsDead, and Requeue to
|
||||||
|
// postgres.NotificationRepository. Every subtest is testcontainers-gated
|
||||||
|
// via getTestDB(t).freshSchema(t), so `go test -short` skips them and CI
|
||||||
|
// without Docker stays green. Fixtures are inserted via raw SQL — Create()
|
||||||
|
// doesn't know about the new retry columns pre-Green, so the test bypasses
|
||||||
|
// it entirely. certificate_id is left NULL on every fixture row to dodge
|
||||||
|
// the FK to managed_certificates (the column is nullable per migration
|
||||||
|
// 000001, line 212).
|
||||||
|
|
||||||
|
// TestNotificationRepository_ListRetryEligible exercises the retry-sweep
|
||||||
|
// query. The test fixture deliberately seeds one row per excluded and
|
||||||
|
// included case so a single call to ListRetryEligible is the oracle:
|
||||||
|
// every row the query returns must be an "include", every row it skips
|
||||||
|
// must be an "exclude".
|
||||||
|
func TestNotificationRepository_ListRetryEligible(t *testing.T) {
|
||||||
|
tdb := getTestDB(t)
|
||||||
|
db := tdb.freshSchema(t)
|
||||||
|
repo := postgres.NewNotificationRepository(db)
|
||||||
|
ctx := context.Background()
|
||||||
|
|
||||||
|
// Pin `now` so the test is deterministic. All "overdue" rows have
|
||||||
|
// next_retry_at < now; all "future" rows have next_retry_at > now.
|
||||||
|
now := time.Now().UTC().Truncate(time.Microsecond)
|
||||||
|
past := now.Add(-5 * time.Minute)
|
||||||
|
future := now.Add(5 * time.Minute)
|
||||||
|
|
||||||
|
// Fixture grid — each row pins a specific edge of the query:
|
||||||
|
//
|
||||||
|
// notif-overdue-1 status=failed, retry=1, next=past → INCLUDE
|
||||||
|
// notif-overdue-2 status=failed, retry=3, next=past → INCLUDE
|
||||||
|
// (later next_retry_at than notif-overdue-1 by a
|
||||||
|
// few seconds so ORDER BY is observable)
|
||||||
|
// notif-future status=failed, retry=2, next=future → EXCLUDE
|
||||||
|
// (CA hasn't hit backoff yet)
|
||||||
|
// notif-exhausted status=failed, retry=5, next=past → EXCLUDE
|
||||||
|
// (retry_count >= max_attempts — sweep must skip
|
||||||
|
// so we don't re-promote a row that's about to
|
||||||
|
// be marked dead)
|
||||||
|
// notif-pending status=pending, retry=0, next=NULL → EXCLUDE
|
||||||
|
// (healthy in-flight notification)
|
||||||
|
// notif-sent status=sent, retry=0, next=NULL → EXCLUDE
|
||||||
|
// notif-dead status=dead, retry=5, next=NULL → EXCLUDE
|
||||||
|
// (already in DLQ — retrying it would reset the
|
||||||
|
// dead-letter counter and lie to the operator)
|
||||||
|
// notif-unsched status=failed, retry=1, next=NULL → EXCLUDE
|
||||||
|
// (failed row that somehow lost its next_retry_at
|
||||||
|
// — partial index predicate strips it, and the
|
||||||
|
// WHERE clause must mirror the predicate)
|
||||||
|
rawInsert := func(id, status string, retryCount int, nextRetryAt *time.Time) {
|
||||||
|
t.Helper()
|
||||||
|
_, err := db.ExecContext(ctx, `
|
||||||
|
INSERT INTO notification_events (
|
||||||
|
id, type, channel, recipient, message, status, retry_count, next_retry_at
|
||||||
|
) VALUES ($1, 'ExpirationWarning', 'Webhook', 'https://hooks.example.com/x',
|
||||||
|
'seed', $2, $3, $4)
|
||||||
|
`, id, status, retryCount, nextRetryAt)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("raw insert for %s failed: %v", id, err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
overdue1 := past.Add(-30 * time.Second) // oldest overdue
|
||||||
|
overdue2 := past // second-oldest overdue
|
||||||
|
rawInsert("notif-overdue-1", "failed", 1, &overdue1)
|
||||||
|
rawInsert("notif-overdue-2", "failed", 3, &overdue2)
|
||||||
|
rawInsert("notif-future", "failed", 2, &future)
|
||||||
|
rawInsert("notif-exhausted", "failed", 5, &overdue1)
|
||||||
|
rawInsert("notif-pending", "pending", 0, nil)
|
||||||
|
rawInsert("notif-sent", "sent", 0, nil)
|
||||||
|
rawInsert("notif-dead", "dead", 5, nil)
|
||||||
|
rawInsert("notif-unsched", "failed", 1, nil)
|
||||||
|
|
||||||
|
// Act — the central call under test.
|
||||||
|
got, err := repo.ListRetryEligible(ctx, now, 5, 100)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("ListRetryEligible failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Assert inclusion: exactly the two overdue rows.
|
||||||
|
if len(got) != 2 {
|
||||||
|
t.Fatalf("ListRetryEligible returned %d rows, want 2 (overdue-1 + overdue-2); got IDs = %v",
|
||||||
|
len(got), collectIDs(got))
|
||||||
|
}
|
||||||
|
|
||||||
|
// Assert ordering: ASC on next_retry_at. notif-overdue-1 has the
|
||||||
|
// earlier next_retry_at (past - 30s), so it must come first.
|
||||||
|
if got[0].ID != "notif-overdue-1" {
|
||||||
|
t.Errorf("ListRetryEligible[0].ID = %q, want %q (ORDER BY next_retry_at ASC — oldest first)",
|
||||||
|
got[0].ID, "notif-overdue-1")
|
||||||
|
}
|
||||||
|
if got[1].ID != "notif-overdue-2" {
|
||||||
|
t.Errorf("ListRetryEligible[1].ID = %q, want %q", got[1].ID, "notif-overdue-2")
|
||||||
|
}
|
||||||
|
|
||||||
|
// Assert limit is respected. Re-run with limit=1 and confirm only the
|
||||||
|
// oldest overdue row comes back — this is what lets the scheduler
|
||||||
|
// chunk its sweep under load.
|
||||||
|
limited, err := repo.ListRetryEligible(ctx, now, 5, 1)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("ListRetryEligible(limit=1) failed: %v", err)
|
||||||
|
}
|
||||||
|
if len(limited) != 1 || limited[0].ID != "notif-overdue-1" {
|
||||||
|
t.Errorf("ListRetryEligible(limit=1) returned %v, want [notif-overdue-1]", collectIDs(limited))
|
||||||
|
}
|
||||||
|
|
||||||
|
// Assert maxAttempts is respected. Re-run with maxAttempts=2 — this
|
||||||
|
// flips notif-overdue-2 (retry_count=3) into the "exhausted" bucket
|
||||||
|
// and must not come back. Only notif-overdue-1 (retry_count=1) qualifies.
|
||||||
|
capped, err := repo.ListRetryEligible(ctx, now, 2, 100)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("ListRetryEligible(maxAttempts=2) failed: %v", err)
|
||||||
|
}
|
||||||
|
if len(capped) != 1 || capped[0].ID != "notif-overdue-1" {
|
||||||
|
t.Errorf("ListRetryEligible(maxAttempts=2) returned %v, want [notif-overdue-1]", collectIDs(capped))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNotificationRepository_RecordFailedAttempt verifies the retry-bump
|
||||||
|
// UPDATE. The contract is: retry_count += 1, last_error = new msg,
|
||||||
|
// next_retry_at = new time, status STAYS 'failed'. Any other side effect
|
||||||
|
// (status flip, retry_count reset, sent_at mutation) is a bug.
|
||||||
|
func TestNotificationRepository_RecordFailedAttempt(t *testing.T) {
|
||||||
|
tdb := getTestDB(t)
|
||||||
|
db := tdb.freshSchema(t)
|
||||||
|
repo := postgres.NewNotificationRepository(db)
|
||||||
|
ctx := context.Background()
|
||||||
|
|
||||||
|
initialRetry := past()
|
||||||
|
_, err := db.ExecContext(ctx, `
|
||||||
|
INSERT INTO notification_events (
|
||||||
|
id, type, channel, recipient, message, status, retry_count, next_retry_at, last_error
|
||||||
|
) VALUES ('notif-attempt-1', 'ExpirationWarning', 'Webhook',
|
||||||
|
'https://hooks.example.com/x', 'seed', 'failed', 2, $1, 'first failure')
|
||||||
|
`, initialRetry)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("seed failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
nextTry := time.Now().UTC().Add(8 * time.Minute).Truncate(time.Microsecond)
|
||||||
|
if err := repo.RecordFailedAttempt(ctx, "notif-attempt-1", "connection refused", nextTry); err != nil {
|
||||||
|
t.Fatalf("RecordFailedAttempt failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Re-read the row directly from the DB (bypassing the repo's List()
|
||||||
|
// filter logic) so the assertion tests storage, not query plumbing.
|
||||||
|
var (
|
||||||
|
gotStatus string
|
||||||
|
gotRetryCount int
|
||||||
|
gotNextRetry *time.Time
|
||||||
|
gotLastError *string
|
||||||
|
)
|
||||||
|
err = db.QueryRowContext(ctx, `
|
||||||
|
SELECT status, retry_count, next_retry_at, last_error
|
||||||
|
FROM notification_events WHERE id = 'notif-attempt-1'
|
||||||
|
`).Scan(&gotStatus, &gotRetryCount, &gotNextRetry, &gotLastError)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("post-update SELECT failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if gotStatus != "failed" {
|
||||||
|
t.Errorf("status = %q, want 'failed' (RecordFailedAttempt must preserve status so sweep re-picks the row)", gotStatus)
|
||||||
|
}
|
||||||
|
if gotRetryCount != 3 {
|
||||||
|
t.Errorf("retry_count = %d, want 3 (must increment by exactly 1 from seeded 2)", gotRetryCount)
|
||||||
|
}
|
||||||
|
if gotNextRetry == nil || !gotNextRetry.Equal(nextTry) {
|
||||||
|
t.Errorf("next_retry_at = %v, want %v", gotNextRetry, nextTry)
|
||||||
|
}
|
||||||
|
if gotLastError == nil || *gotLastError != "connection refused" {
|
||||||
|
t.Errorf("last_error = %v, want 'connection refused'", gotLastError)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Negative path: unknown id must surface "not found" — mirrors the
|
||||||
|
// existing UpdateStatus contract so the scheduler can detect a
|
||||||
|
// concurrent delete without guessing.
|
||||||
|
if err := repo.RecordFailedAttempt(ctx, "notif-does-not-exist", "oops", nextTry); err == nil {
|
||||||
|
t.Errorf("RecordFailedAttempt on unknown id succeeded; want error")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNotificationRepository_MarkAsDead verifies the DLQ transition. Flips
|
||||||
|
// status to 'dead', clears next_retry_at (so the partial retry-sweep
|
||||||
|
// index drops the row), writes final last_error, preserves retry_count as
|
||||||
|
// evidence of how many attempts were burned.
|
||||||
|
func TestNotificationRepository_MarkAsDead(t *testing.T) {
|
||||||
|
tdb := getTestDB(t)
|
||||||
|
db := tdb.freshSchema(t)
|
||||||
|
repo := postgres.NewNotificationRepository(db)
|
||||||
|
ctx := context.Background()
|
||||||
|
|
||||||
|
lastAttempt := past()
|
||||||
|
_, err := db.ExecContext(ctx, `
|
||||||
|
INSERT INTO notification_events (
|
||||||
|
id, type, channel, recipient, message, status, retry_count, next_retry_at, last_error
|
||||||
|
) VALUES ('notif-dlq-1', 'ExpirationWarning', 'Webhook',
|
||||||
|
'https://hooks.example.com/x', 'seed', 'failed', 5, $1, 'prior failure')
|
||||||
|
`, lastAttempt)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("seed failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if err := repo.MarkAsDead(ctx, "notif-dlq-1", "max attempts exceeded"); err != nil {
|
||||||
|
t.Fatalf("MarkAsDead failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
var (
|
||||||
|
gotStatus string
|
||||||
|
gotRetryCount int
|
||||||
|
gotNextRetry *time.Time
|
||||||
|
gotLastError *string
|
||||||
|
)
|
||||||
|
err = db.QueryRowContext(ctx, `
|
||||||
|
SELECT status, retry_count, next_retry_at, last_error
|
||||||
|
FROM notification_events WHERE id = 'notif-dlq-1'
|
||||||
|
`).Scan(&gotStatus, &gotRetryCount, &gotNextRetry, &gotLastError)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("post-update SELECT failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if gotStatus != "dead" {
|
||||||
|
t.Errorf("status = %q, want 'dead' (DLQ transition)", gotStatus)
|
||||||
|
}
|
||||||
|
if gotNextRetry != nil {
|
||||||
|
// next_retry_at MUST be NULL post-DLQ — the partial retry-sweep
|
||||||
|
// index predicate is `status='failed' AND next_retry_at IS NOT NULL`,
|
||||||
|
// so leaving a value here would only waste space; the status='dead'
|
||||||
|
// half of the predicate already excludes the row from the sweep,
|
||||||
|
// but operator dashboards treat a populated next_retry_at as "still
|
||||||
|
// scheduled", which would be a lie.
|
||||||
|
t.Errorf("next_retry_at = %v, want NULL (dead rows are terminal, not rescheduled)", gotNextRetry)
|
||||||
|
}
|
||||||
|
if gotRetryCount != 5 {
|
||||||
|
// retry_count is audit evidence — how many attempts were burned
|
||||||
|
// before the row was declared dead. Don't clobber it.
|
||||||
|
t.Errorf("retry_count = %d, want 5 preserved (evidence of burned attempts)", gotRetryCount)
|
||||||
|
}
|
||||||
|
if gotLastError == nil || *gotLastError != "max attempts exceeded" {
|
||||||
|
t.Errorf("last_error = %v, want 'max attempts exceeded'", gotLastError)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Negative path: unknown id must surface "not found".
|
||||||
|
if err := repo.MarkAsDead(ctx, "notif-does-not-exist", "oops"); err == nil {
|
||||||
|
t.Errorf("MarkAsDead on unknown id succeeded; want error")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNotificationRepository_Requeue verifies the operator "try again"
|
||||||
|
// flow exposed by the Dead letter tab. The contract:
|
||||||
|
//
|
||||||
|
// - Flips status → 'pending' regardless of prior ('dead' or 'failed').
|
||||||
|
// - Resets retry_count to 0 — a manual requeue restarts the backoff
|
||||||
|
// ladder; otherwise the operator's first retry would already be at
|
||||||
|
// "wait 32 minutes" which defeats the point.
|
||||||
|
// - Clears next_retry_at so the row is no longer in the retry-sweep
|
||||||
|
// index (the scheduler would otherwise try to retry it *again* a
|
||||||
|
// few seconds later).
|
||||||
|
// - Clears last_error — the UI shouldn't show a stale error next to
|
||||||
|
// a freshly-requeued row.
|
||||||
|
func TestNotificationRepository_Requeue(t *testing.T) {
|
||||||
|
tdb := getTestDB(t)
|
||||||
|
db := tdb.freshSchema(t)
|
||||||
|
repo := postgres.NewNotificationRepository(db)
|
||||||
|
ctx := context.Background()
|
||||||
|
|
||||||
|
// Two fixtures — one dead (DLQ path, the normal case) and one failed
|
||||||
|
// (operator rescuing a stuck-in-retry row before the sweep fires).
|
||||||
|
// Both must accept Requeue; a status='sent' or 'read' row must NOT.
|
||||||
|
_, err := db.ExecContext(ctx, `
|
||||||
|
INSERT INTO notification_events (id, type, channel, recipient, message, status, retry_count, last_error)
|
||||||
|
VALUES
|
||||||
|
('notif-dead-ready', 'ExpirationWarning', 'Webhook', 'https://h/x', 'seed', 'dead', 5, 'gave up'),
|
||||||
|
('notif-failed-hot', 'ExpirationWarning', 'Webhook', 'https://h/x', 'seed', 'failed', 2, 'transient'),
|
||||||
|
('notif-sent-done', 'ExpirationWarning', 'Webhook', 'https://h/x', 'seed', 'sent', 0, NULL)
|
||||||
|
`)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("seed failed: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Happy path 1: requeue a dead row.
|
||||||
|
if err := repo.Requeue(ctx, "notif-dead-ready"); err != nil {
|
||||||
|
t.Fatalf("Requeue(dead) failed: %v", err)
|
||||||
|
}
|
||||||
|
assertRequeued(t, db, ctx, "notif-dead-ready")
|
||||||
|
|
||||||
|
// Happy path 2: requeue a failed row.
|
||||||
|
if err := repo.Requeue(ctx, "notif-failed-hot"); err != nil {
|
||||||
|
t.Fatalf("Requeue(failed) failed: %v", err)
|
||||||
|
}
|
||||||
|
assertRequeued(t, db, ctx, "notif-failed-hot")
|
||||||
|
|
||||||
|
// Negative path: Requeue on unknown id is "not found", not a no-op
|
||||||
|
// silent success — the handler needs to surface a 404 to the operator.
|
||||||
|
if err := repo.Requeue(ctx, "notif-does-not-exist"); err == nil {
|
||||||
|
t.Errorf("Requeue on unknown id succeeded; want error")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ─── Helpers ──────────────────────────────────────────────────────────────
|
||||||
|
|
||||||
|
// past returns a stable "5 minutes ago" time for fixture seeding. Truncated
|
||||||
|
// to microseconds so round-tripping through Postgres TIMESTAMPTZ doesn't
|
||||||
|
// introduce a sub-microsecond diff that breaks equality assertions.
|
||||||
|
func past() time.Time {
|
||||||
|
return time.Now().UTC().Add(-5 * time.Minute).Truncate(time.Microsecond)
|
||||||
|
}
|
||||||
|
|
||||||
|
// collectIDs pulls the IDs out of a slice of events for readable test
|
||||||
|
// failure output. Without it, a failure prints "[0xc00012... 0xc00013...]"
|
||||||
|
// which is useless when diagnosing a mis-sorted sweep.
|
||||||
|
func collectIDs(events []*domain.NotificationEvent) []string {
|
||||||
|
ids := make([]string, len(events))
|
||||||
|
for i, e := range events {
|
||||||
|
ids[i] = e.ID
|
||||||
|
}
|
||||||
|
return ids
|
||||||
|
}
|
||||||
|
|
||||||
|
// assertRequeued is the shared "did Requeue do exactly what the contract
|
||||||
|
// promises?" assertion. Re-reads the row and checks all four mutations
|
||||||
|
// atomically so every Requeue test path gets the same rigor: status flipped
|
||||||
|
// to 'pending', retry_count reset to 0, next_retry_at cleared, last_error
|
||||||
|
// cleared. Any one of these missing is a contract violation.
|
||||||
|
func assertRequeued(t *testing.T, db *sql.DB, ctx context.Context, id string) {
|
||||||
|
t.Helper()
|
||||||
|
var (
|
||||||
|
gotStatus string
|
||||||
|
gotRetryCount int
|
||||||
|
gotNextRetry *time.Time
|
||||||
|
gotLastError *string
|
||||||
|
)
|
||||||
|
err := db.QueryRowContext(ctx, `
|
||||||
|
SELECT status, retry_count, next_retry_at, last_error
|
||||||
|
FROM notification_events WHERE id = $1
|
||||||
|
`, id).Scan(&gotStatus, &gotRetryCount, &gotNextRetry, &gotLastError)
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("post-Requeue SELECT for %s failed: %v", id, err)
|
||||||
|
}
|
||||||
|
if gotStatus != "pending" {
|
||||||
|
t.Errorf("%s.status = %q, want 'pending' (Requeue must re-open the row for ProcessPendingNotifications)",
|
||||||
|
id, gotStatus)
|
||||||
|
}
|
||||||
|
if gotRetryCount != 0 {
|
||||||
|
t.Errorf("%s.retry_count = %d, want 0 (Requeue restarts the backoff ladder so the operator's first retry isn't already at hour-long waits)",
|
||||||
|
id, gotRetryCount)
|
||||||
|
}
|
||||||
|
if gotNextRetry != nil {
|
||||||
|
t.Errorf("%s.next_retry_at = %v, want NULL (a fresh pending row must not sit in the retry-sweep index)",
|
||||||
|
id, gotNextRetry)
|
||||||
|
}
|
||||||
|
if gotLastError != nil {
|
||||||
|
t.Errorf("%s.last_error = %v, want NULL (stale errors on freshly-requeued rows mislead the UI)",
|
||||||
|
id, *gotLastError)
|
||||||
|
}
|
||||||
|
}
|
||||||
+115
-39
@@ -34,8 +34,14 @@ type AgentServicer interface {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// NotificationServicer defines the interface for notification processing used by the scheduler.
|
// NotificationServicer defines the interface for notification processing used by the scheduler.
|
||||||
|
//
|
||||||
|
// RetryFailedNotifications was added to close coverage gap I-005: the retry
|
||||||
|
// sweep transitions eligible Failed notifications to Pending on an independent
|
||||||
|
// tick, using exponential backoff with a 1h cap and a 5-attempt DLQ budget.
|
||||||
|
// Mirrors the I-001 job retry loop topology.
|
||||||
type NotificationServicer interface {
|
type NotificationServicer interface {
|
||||||
ProcessPendingNotifications(ctx context.Context) error
|
ProcessPendingNotifications(ctx context.Context) error
|
||||||
|
RetryFailedNotifications(ctx context.Context) error
|
||||||
}
|
}
|
||||||
|
|
||||||
// NetworkScanServicer defines the interface for network scanning used by the scheduler.
|
// NetworkScanServicer defines the interface for network scanning used by the scheduler.
|
||||||
@@ -67,44 +73,46 @@ type JobReaperService interface {
|
|||||||
// It runs multiple concurrent loops for renewal checks, job processing, agent health checks,
|
// It runs multiple concurrent loops for renewal checks, job processing, agent health checks,
|
||||||
// and notification processing.
|
// and notification processing.
|
||||||
type Scheduler struct {
|
type Scheduler struct {
|
||||||
renewalService RenewalServicer
|
renewalService RenewalServicer
|
||||||
jobService JobServicer
|
jobService JobServicer
|
||||||
agentService AgentServicer
|
agentService AgentServicer
|
||||||
notificationService NotificationServicer
|
notificationService NotificationServicer
|
||||||
networkScanService NetworkScanServicer
|
networkScanService NetworkScanServicer
|
||||||
digestService DigestServicer
|
digestService DigestServicer
|
||||||
healthCheckService HealthCheckServicer
|
healthCheckService HealthCheckServicer
|
||||||
cloudDiscoveryService CloudDiscoveryServicer
|
cloudDiscoveryService CloudDiscoveryServicer
|
||||||
jobReaper JobReaperService
|
jobReaper JobReaperService
|
||||||
logger *slog.Logger
|
logger *slog.Logger
|
||||||
|
|
||||||
// Configurable tick intervals
|
// Configurable tick intervals
|
||||||
renewalCheckInterval time.Duration
|
renewalCheckInterval time.Duration
|
||||||
jobProcessorInterval time.Duration
|
jobProcessorInterval time.Duration
|
||||||
jobRetryInterval time.Duration
|
jobRetryInterval time.Duration
|
||||||
agentHealthCheckInterval time.Duration
|
agentHealthCheckInterval time.Duration
|
||||||
notificationProcessInterval time.Duration
|
notificationProcessInterval time.Duration
|
||||||
shortLivedExpiryCheckInterval time.Duration
|
notificationRetryInterval time.Duration
|
||||||
networkScanInterval time.Duration
|
shortLivedExpiryCheckInterval time.Duration
|
||||||
digestInterval time.Duration
|
networkScanInterval time.Duration
|
||||||
healthCheckInterval time.Duration
|
digestInterval time.Duration
|
||||||
cloudDiscoveryInterval time.Duration
|
healthCheckInterval time.Duration
|
||||||
jobTimeoutInterval time.Duration
|
cloudDiscoveryInterval time.Duration
|
||||||
awaitingCSRTimeout time.Duration
|
jobTimeoutInterval time.Duration
|
||||||
awaitingApprovalTimeout time.Duration
|
awaitingCSRTimeout time.Duration
|
||||||
|
awaitingApprovalTimeout time.Duration
|
||||||
|
|
||||||
// Idempotency guards: prevent duplicate execution of slow jobs
|
// Idempotency guards: prevent duplicate execution of slow jobs
|
||||||
renewalCheckRunning atomic.Bool
|
renewalCheckRunning atomic.Bool
|
||||||
jobProcessorRunning atomic.Bool
|
jobProcessorRunning atomic.Bool
|
||||||
jobRetryRunning atomic.Bool
|
jobRetryRunning atomic.Bool
|
||||||
agentHealthCheckRunning atomic.Bool
|
agentHealthCheckRunning atomic.Bool
|
||||||
notificationProcessRunning atomic.Bool
|
notificationProcessRunning atomic.Bool
|
||||||
shortLivedExpiryCheckRunning atomic.Bool
|
notificationRetryRunning atomic.Bool
|
||||||
networkScanRunning atomic.Bool
|
shortLivedExpiryCheckRunning atomic.Bool
|
||||||
digestRunning atomic.Bool
|
networkScanRunning atomic.Bool
|
||||||
healthCheckRunning atomic.Bool
|
digestRunning atomic.Bool
|
||||||
cloudDiscoveryRunning atomic.Bool
|
healthCheckRunning atomic.Bool
|
||||||
jobTimeoutRunning atomic.Bool
|
cloudDiscoveryRunning atomic.Bool
|
||||||
|
jobTimeoutRunning atomic.Bool
|
||||||
|
|
||||||
// Graceful shutdown: wait for in-flight work to complete
|
// Graceful shutdown: wait for in-flight work to complete
|
||||||
wg sync.WaitGroup
|
wg sync.WaitGroup
|
||||||
@@ -133,6 +141,7 @@ func NewScheduler(
|
|||||||
jobRetryInterval: 5 * time.Minute,
|
jobRetryInterval: 5 * time.Minute,
|
||||||
agentHealthCheckInterval: 2 * time.Minute,
|
agentHealthCheckInterval: 2 * time.Minute,
|
||||||
notificationProcessInterval: 1 * time.Minute,
|
notificationProcessInterval: 1 * time.Minute,
|
||||||
|
notificationRetryInterval: 2 * time.Minute,
|
||||||
shortLivedExpiryCheckInterval: 30 * time.Second,
|
shortLivedExpiryCheckInterval: 30 * time.Second,
|
||||||
networkScanInterval: 6 * time.Hour,
|
networkScanInterval: 6 * time.Hour,
|
||||||
digestInterval: 24 * time.Hour,
|
digestInterval: 24 * time.Hour,
|
||||||
@@ -180,6 +189,13 @@ func (s *Scheduler) SetNotificationProcessInterval(d time.Duration) {
|
|||||||
s.notificationProcessInterval = d
|
s.notificationProcessInterval = d
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// SetNotificationRetryInterval configures the interval for the failed-notification
|
||||||
|
// retry sweep (coverage gap I-005). Defaults to 2 minutes; honors
|
||||||
|
// CERTCTL_NOTIFICATION_RETRY_INTERVAL when wired from config.
|
||||||
|
func (s *Scheduler) SetNotificationRetryInterval(d time.Duration) {
|
||||||
|
s.notificationRetryInterval = d
|
||||||
|
}
|
||||||
|
|
||||||
// SetNetworkScanInterval configures the interval for network scanning.
|
// SetNetworkScanInterval configures the interval for network scanning.
|
||||||
func (s *Scheduler) SetNetworkScanInterval(d time.Duration) {
|
func (s *Scheduler) SetNetworkScanInterval(d time.Duration) {
|
||||||
s.networkScanInterval = d
|
s.networkScanInterval = d
|
||||||
@@ -212,7 +228,6 @@ func (s *Scheduler) SetCloudDiscoveryInterval(d time.Duration) {
|
|||||||
s.cloudDiscoveryInterval = d
|
s.cloudDiscoveryInterval = d
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
// SetJobReaperService sets the job reaper service (I-003).
|
// SetJobReaperService sets the job reaper service (I-003).
|
||||||
func (s *Scheduler) SetJobReaperService(jr JobReaperService) {
|
func (s *Scheduler) SetJobReaperService(jr JobReaperService) {
|
||||||
s.jobReaper = jr
|
s.jobReaper = jr
|
||||||
@@ -232,6 +247,7 @@ func (s *Scheduler) SetAwaitingCSRTimeout(d time.Duration) {
|
|||||||
func (s *Scheduler) SetAwaitingApprovalTimeout(d time.Duration) {
|
func (s *Scheduler) SetAwaitingApprovalTimeout(d time.Duration) {
|
||||||
s.awaitingApprovalTimeout = d
|
s.awaitingApprovalTimeout = d
|
||||||
}
|
}
|
||||||
|
|
||||||
// Start initiates all background scheduler loops. It returns a channel that signals
|
// Start initiates all background scheduler loops. It returns a channel that signals
|
||||||
// when the scheduler has started all loops. The scheduler runs until the context is cancelled.
|
// when the scheduler has started all loops. The scheduler runs until the context is cancelled.
|
||||||
func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
|
func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
|
||||||
@@ -242,10 +258,11 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
|
|||||||
|
|
||||||
// Track all loop goroutines in the WaitGroup so WaitForCompletion
|
// Track all loop goroutines in the WaitGroup so WaitForCompletion
|
||||||
// blocks until they've fully exited (prevents test races).
|
// blocks until they've fully exited (prevents test races).
|
||||||
// Base count is 7: renewal, job processor, job retry (I-001),
|
// Base count is 8: renewal, job processor, job retry (I-001),
|
||||||
// job timeout (I-003), agent health, notification, short-lived expiry. Optional loops
|
// job timeout (I-003), agent health, notification, notification retry
|
||||||
// (network scan, digest, health check, cloud discovery) add to this.
|
// (I-005), short-lived expiry. Optional loops (network scan, digest,
|
||||||
loopCount := 7
|
// health check, cloud discovery) add to this.
|
||||||
|
loopCount := 8
|
||||||
if s.networkScanService != nil {
|
if s.networkScanService != nil {
|
||||||
loopCount++
|
loopCount++
|
||||||
}
|
}
|
||||||
@@ -266,6 +283,7 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
|
|||||||
go func() { defer s.wg.Done(); s.jobTimeoutLoop(ctx) }()
|
go func() { defer s.wg.Done(); s.jobTimeoutLoop(ctx) }()
|
||||||
go func() { defer s.wg.Done(); s.agentHealthCheckLoop(ctx) }()
|
go func() { defer s.wg.Done(); s.agentHealthCheckLoop(ctx) }()
|
||||||
go func() { defer s.wg.Done(); s.notificationProcessLoop(ctx) }()
|
go func() { defer s.wg.Done(); s.notificationProcessLoop(ctx) }()
|
||||||
|
go func() { defer s.wg.Done(); s.notificationRetryLoop(ctx) }()
|
||||||
go func() { defer s.wg.Done(); s.shortLivedExpiryCheckLoop(ctx) }()
|
go func() { defer s.wg.Done(); s.shortLivedExpiryCheckLoop(ctx) }()
|
||||||
if s.networkScanService != nil {
|
if s.networkScanService != nil {
|
||||||
go func() { defer s.wg.Done(); s.networkScanLoop(ctx) }()
|
go func() { defer s.wg.Done(); s.networkScanLoop(ctx) }()
|
||||||
@@ -597,6 +615,64 @@ func (s *Scheduler) runNotificationProcess(ctx context.Context) {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// notificationRetryLoop runs every notificationRetryInterval and transitions
|
||||||
|
// eligible Failed notifications back to Pending so the notification processor
|
||||||
|
// can pick them up again. Closes coverage gap I-005 — NotificationService.
|
||||||
|
// RetryFailedNotifications had no runtime caller prior to this loop being
|
||||||
|
// wired. Runs immediately on start, then every interval.
|
||||||
|
// Uses atomic.Bool to prevent duplicate execution if the previous retry sweep
|
||||||
|
// is still running. Mirrors the I-001 jobRetryLoop topology byte-for-byte.
|
||||||
|
func (s *Scheduler) notificationRetryLoop(ctx context.Context) {
|
||||||
|
ticker := time.NewTicker(s.notificationRetryInterval)
|
||||||
|
defer ticker.Stop()
|
||||||
|
|
||||||
|
// Run immediately on start (with idempotency guard)
|
||||||
|
s.notificationRetryRunning.Store(true)
|
||||||
|
s.wg.Add(1)
|
||||||
|
go func() {
|
||||||
|
defer s.wg.Done()
|
||||||
|
defer s.notificationRetryRunning.Store(false)
|
||||||
|
s.runNotificationRetry(ctx)
|
||||||
|
}()
|
||||||
|
|
||||||
|
for {
|
||||||
|
select {
|
||||||
|
case <-ctx.Done():
|
||||||
|
return
|
||||||
|
case <-ticker.C:
|
||||||
|
if !s.notificationRetryRunning.CompareAndSwap(false, true) {
|
||||||
|
s.logger.Warn("notification retry still running, skipping tick")
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
s.wg.Add(1)
|
||||||
|
go func() {
|
||||||
|
defer s.wg.Done()
|
||||||
|
defer s.notificationRetryRunning.Store(false)
|
||||||
|
s.runNotificationRetry(ctx)
|
||||||
|
}()
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// runNotificationRetry executes a single failed-notification retry cycle with
|
||||||
|
// error recovery. Uses a 2-minute per-tick timeout matching runJobRetry;
|
||||||
|
// RetryFailedNotifications issues one SELECT and one UPDATE per eligible row
|
||||||
|
// (cheap), so this headroom covers very large failure backlogs without
|
||||||
|
// starving the loop. The service layer swallows per-row send errors (mirrors
|
||||||
|
// ProcessPendingNotifications) and only returns the List error from the
|
||||||
|
// initial ListRetryEligible call.
|
||||||
|
func (s *Scheduler) runNotificationRetry(ctx context.Context) {
|
||||||
|
opCtx, cancel := context.WithTimeout(ctx, 2*time.Minute)
|
||||||
|
defer cancel()
|
||||||
|
if err := s.notificationService.RetryFailedNotifications(opCtx); err != nil {
|
||||||
|
s.logger.Error("notification retry failed",
|
||||||
|
"error", err,
|
||||||
|
"interval", s.notificationRetryInterval.String())
|
||||||
|
} else {
|
||||||
|
s.logger.Debug("notification retry completed")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// shortLivedExpiryCheckLoop runs every shortLivedExpiryCheckInterval and marks expired
|
// shortLivedExpiryCheckLoop runs every shortLivedExpiryCheckInterval and marks expired
|
||||||
// short-lived certificates. For certs with TTL < 1 hour, expiry IS revocation —
|
// short-lived certificates. For certs with TTL < 1 hour, expiry IS revocation —
|
||||||
// no CRL/OCSP needed.
|
// no CRL/OCSP needed.
|
||||||
|
|||||||
@@ -195,12 +195,25 @@ func (m *mockAgentService) MarkStaleAgentsOffline(ctx context.Context, interval
|
|||||||
}
|
}
|
||||||
|
|
||||||
// mockNotificationService is a mock implementation for testing.
|
// mockNotificationService is a mock implementation for testing.
|
||||||
|
//
|
||||||
|
// Tracks ProcessPendingNotifications and RetryFailedNotifications separately.
|
||||||
|
// retrySlowDelay and retryShouldError let tests exercise the retry loop
|
||||||
|
// independently of the processor loop without coupling their timing/failure
|
||||||
|
// modes (coverage gap I-005 — prior to the notificationRetryLoop being wired,
|
||||||
|
// RetryFailedNotifications had no runtime caller).
|
||||||
type mockNotificationService struct {
|
type mockNotificationService struct {
|
||||||
mu sync.Mutex
|
mu sync.Mutex
|
||||||
callCount int
|
callCount int
|
||||||
callTimes []time.Time
|
callTimes []time.Time
|
||||||
slowDelay time.Duration
|
slowDelay time.Duration
|
||||||
shouldError bool
|
shouldError bool
|
||||||
|
|
||||||
|
// Retry loop tracking (coverage gap I-005)
|
||||||
|
retryCallCount int
|
||||||
|
retryCallTimes []time.Time
|
||||||
|
retrySlowDelay time.Duration
|
||||||
|
retryShouldError bool
|
||||||
|
retryCtxHasDeadline bool
|
||||||
}
|
}
|
||||||
|
|
||||||
func (m *mockNotificationService) ProcessPendingNotifications(ctx context.Context) error {
|
func (m *mockNotificationService) ProcessPendingNotifications(ctx context.Context) error {
|
||||||
@@ -223,6 +236,42 @@ func (m *mockNotificationService) ProcessPendingNotifications(ctx context.Contex
|
|||||||
return nil
|
return nil
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// RetryFailedNotifications is the scheduler-driven counterpart to
|
||||||
|
// ProcessPendingNotifications that closes coverage gap I-005. Prior to the
|
||||||
|
// notificationRetryLoop being wired, notifications that hit status='failed'
|
||||||
|
// orphaned there forever — no retry, no DLQ, no escalation. The service-layer
|
||||||
|
// method exists to sweep failed rows whose next_retry_at has elapsed, but
|
||||||
|
// without a scheduler caller the sweep never runs in production.
|
||||||
|
//
|
||||||
|
// This mock mirrors mockJobService.RetryFailedJobs's shape: a retry-only field
|
||||||
|
// cluster so callers can dial retrySlowDelay / retryShouldError without
|
||||||
|
// perturbing ProcessPendingNotifications's timing, and retryCtxHasDeadline so
|
||||||
|
// the ContextDeadlineRespected test can assert the scheduler is passing a
|
||||||
|
// per-tick context.WithTimeout rather than the raw shutdown ctx.
|
||||||
|
func (m *mockNotificationService) RetryFailedNotifications(ctx context.Context) error {
|
||||||
|
m.mu.Lock()
|
||||||
|
m.retryCallCount++
|
||||||
|
m.retryCallTimes = append(m.retryCallTimes, time.Now())
|
||||||
|
// Track whether context has a deadline set — the scheduler must wrap each
|
||||||
|
// tick in a bounded context so a hung sweep can't stall shutdown.
|
||||||
|
_, hasDeadline := ctx.Deadline()
|
||||||
|
m.retryCtxHasDeadline = hasDeadline
|
||||||
|
m.mu.Unlock()
|
||||||
|
|
||||||
|
if m.retrySlowDelay > 0 {
|
||||||
|
select {
|
||||||
|
case <-time.After(m.retrySlowDelay):
|
||||||
|
case <-ctx.Done():
|
||||||
|
return ctx.Err()
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if m.retryShouldError {
|
||||||
|
return context.Canceled
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
// mockNetworkScanService is a mock implementation for testing.
|
// mockNetworkScanService is a mock implementation for testing.
|
||||||
type mockNetworkScanService struct {
|
type mockNetworkScanService struct {
|
||||||
mu sync.Mutex
|
mu sync.Mutex
|
||||||
@@ -1358,3 +1407,221 @@ func TestScheduler_JobTimeoutLoop_ContextDeadlineRespected(t *testing.T) {
|
|||||||
}
|
}
|
||||||
t.Log("timeout reaper context deadline verified")
|
t.Log("timeout reaper context deadline verified")
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ─── NotificationRetryLoop tests (coverage gap I-005) ────────────────────────
|
||||||
|
//
|
||||||
|
// These four tests are the scheduler-level Red half of the I-005 fix. They
|
||||||
|
// mirror the I-001 jobRetryLoop triplet (CallsService / IdempotencyGuard /
|
||||||
|
// WaitForCompletion) plus the I-003 ContextDeadlineRespected shape.
|
||||||
|
//
|
||||||
|
// All four use the same "quiet every other loop" pattern so the only tick
|
||||||
|
// activity visible on notificationMock is the retry loop under test. JobTimeout
|
||||||
|
// is intentionally left unconfigured — SetJobReaperService isn't called, so the
|
||||||
|
// timeout loop is dormant (same convention the I-001 tests follow).
|
||||||
|
|
||||||
|
// TestScheduler_NotificationRetryLoop_CallsService verifies that the
|
||||||
|
// notification retry loop invokes NotificationService.RetryFailedNotifications
|
||||||
|
// on each tick. Closes coverage gap I-005 — prior to the loop being wired,
|
||||||
|
// RetryFailedNotifications had no runtime caller and failed notification_events
|
||||||
|
// rows orphaned at status='failed' forever (no retry, no DLQ, no escalation).
|
||||||
|
//
|
||||||
|
// Unlike the jobRetryLoop test, there is no maxRetries advisory constant to
|
||||||
|
// forward: the max_attempts limit on notification retries lives on the row
|
||||||
|
// itself (retry_count column introduced by migration 000016), not in the call
|
||||||
|
// signature.
|
||||||
|
func TestScheduler_NotificationRetryLoop_CallsService(t *testing.T) {
|
||||||
|
logger := slog.New(slog.NewTextHandler(os.Stderr, nil))
|
||||||
|
renewalMock := &mockRenewalService{}
|
||||||
|
jobMock := &mockJobService{}
|
||||||
|
agentMock := &mockAgentService{}
|
||||||
|
notificationMock := &mockNotificationService{}
|
||||||
|
networkMock := &mockNetworkScanService{}
|
||||||
|
|
||||||
|
sched := NewScheduler(renewalMock, jobMock, agentMock, notificationMock, networkMock, logger)
|
||||||
|
// Quiet every other loop so only the retry loop's calls are visible on notificationMock.
|
||||||
|
sched.SetRenewalCheckInterval(10 * time.Second)
|
||||||
|
sched.SetJobProcessorInterval(10 * time.Second)
|
||||||
|
sched.SetAgentHealthCheckInterval(10 * time.Second)
|
||||||
|
sched.SetNotificationProcessInterval(10 * time.Second)
|
||||||
|
sched.SetNetworkScanInterval(10 * time.Second)
|
||||||
|
sched.SetJobRetryInterval(10 * time.Second)
|
||||||
|
sched.SetNotificationRetryInterval(50 * time.Millisecond)
|
||||||
|
|
||||||
|
ctx, cancel := context.WithCancel(context.Background())
|
||||||
|
defer cancel()
|
||||||
|
|
||||||
|
startedChan := sched.Start(ctx)
|
||||||
|
<-startedChan
|
||||||
|
|
||||||
|
// Run long enough for the immediate start + at least one tick.
|
||||||
|
time.Sleep(200 * time.Millisecond)
|
||||||
|
cancel()
|
||||||
|
_ = sched.WaitForCompletion(2 * time.Second)
|
||||||
|
|
||||||
|
notificationMock.mu.Lock()
|
||||||
|
retryCount := notificationMock.retryCallCount
|
||||||
|
notificationMock.mu.Unlock()
|
||||||
|
|
||||||
|
if retryCount < 1 {
|
||||||
|
t.Fatalf("expected notification retry service to be called at least once, got %d", retryCount)
|
||||||
|
}
|
||||||
|
t.Logf("notification retry loop called %d times", retryCount)
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestScheduler_NotificationRetryLoop_IdempotencyGuard verifies that a slow
|
||||||
|
// retry sweep does not cause overlapping executions. Mirrors the shape of
|
||||||
|
// TestScheduler_JobRetryLoop_IdempotencyGuard.
|
||||||
|
//
|
||||||
|
// The guard is the atomic.Bool notificationRetryRunning in scheduler.go.
|
||||||
|
// Without it, a 100ms tick against a 150ms operation would fire ~4 times in
|
||||||
|
// 400ms; with the guard we expect ~2–3 calls. Anything above 3 is logged as a
|
||||||
|
// warning (not a hard failure) so CI timing noise doesn't produce flakes.
|
||||||
|
func TestScheduler_NotificationRetryLoop_IdempotencyGuard(t *testing.T) {
|
||||||
|
logger := slog.New(slog.NewTextHandler(os.Stderr, nil))
|
||||||
|
renewalMock := &mockRenewalService{}
|
||||||
|
jobMock := &mockJobService{}
|
||||||
|
agentMock := &mockAgentService{}
|
||||||
|
notificationMock := &mockNotificationService{
|
||||||
|
retrySlowDelay: 150 * time.Millisecond, // slower than tick interval
|
||||||
|
}
|
||||||
|
networkMock := &mockNetworkScanService{}
|
||||||
|
|
||||||
|
sched := NewScheduler(renewalMock, jobMock, agentMock, notificationMock, networkMock, logger)
|
||||||
|
sched.SetRenewalCheckInterval(10 * time.Second)
|
||||||
|
sched.SetJobProcessorInterval(10 * time.Second)
|
||||||
|
sched.SetAgentHealthCheckInterval(10 * time.Second)
|
||||||
|
sched.SetNotificationProcessInterval(10 * time.Second)
|
||||||
|
sched.SetNetworkScanInterval(10 * time.Second)
|
||||||
|
sched.SetJobRetryInterval(10 * time.Second)
|
||||||
|
sched.SetNotificationRetryInterval(100 * time.Millisecond)
|
||||||
|
|
||||||
|
ctx, cancel := context.WithCancel(context.Background())
|
||||||
|
defer cancel()
|
||||||
|
|
||||||
|
startedChan := sched.Start(ctx)
|
||||||
|
<-startedChan
|
||||||
|
|
||||||
|
time.Sleep(400 * time.Millisecond)
|
||||||
|
|
||||||
|
notificationMock.mu.Lock()
|
||||||
|
retryCount := notificationMock.retryCallCount
|
||||||
|
notificationMock.mu.Unlock()
|
||||||
|
|
||||||
|
// With a 150ms sweep and 100ms interval, a functioning guard should yield
|
||||||
|
// roughly 2–3 calls (immediate + any ticks whose previous sweep finished).
|
||||||
|
// Anything above 3 suggests the guard isn't holding.
|
||||||
|
if retryCount > 3 {
|
||||||
|
t.Logf("WARNING: retry called %d times in 400ms with 100ms interval and 150ms sweep — guard may not be working", retryCount)
|
||||||
|
}
|
||||||
|
|
||||||
|
t.Logf("notification retry idempotency guard: %d calls in 400ms (100ms interval, 150ms sweep)", retryCount)
|
||||||
|
|
||||||
|
cancel()
|
||||||
|
if err := sched.WaitForCompletion(2 * time.Second); err != nil {
|
||||||
|
t.Fatalf("WaitForCompletion should succeed: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestScheduler_NotificationRetryLoop_WaitForCompletion verifies that a retry
|
||||||
|
// sweep still in flight at shutdown is awaited by WaitForCompletion — the same
|
||||||
|
// sync.WaitGroup contract every other loop satisfies. If the loop were to
|
||||||
|
// return early without registering its goroutine on s.wg, this test would
|
||||||
|
// either (a) observe retryCount==0 because the immediate-start sweep was never
|
||||||
|
// launched, or (b) observe WaitForCompletion returning before the in-flight
|
||||||
|
// sweep finished (elapsed < retrySlowDelay).
|
||||||
|
func TestScheduler_NotificationRetryLoop_WaitForCompletion(t *testing.T) {
|
||||||
|
logger := slog.New(slog.NewTextHandler(os.Stderr, nil))
|
||||||
|
renewalMock := &mockRenewalService{}
|
||||||
|
jobMock := &mockJobService{}
|
||||||
|
agentMock := &mockAgentService{}
|
||||||
|
notificationMock := &mockNotificationService{
|
||||||
|
retrySlowDelay: 100 * time.Millisecond,
|
||||||
|
}
|
||||||
|
networkMock := &mockNetworkScanService{}
|
||||||
|
|
||||||
|
sched := NewScheduler(renewalMock, jobMock, agentMock, notificationMock, networkMock, logger)
|
||||||
|
sched.SetRenewalCheckInterval(10 * time.Second)
|
||||||
|
sched.SetJobProcessorInterval(10 * time.Second)
|
||||||
|
sched.SetAgentHealthCheckInterval(10 * time.Second)
|
||||||
|
sched.SetNotificationProcessInterval(10 * time.Second)
|
||||||
|
sched.SetNetworkScanInterval(10 * time.Second)
|
||||||
|
sched.SetJobRetryInterval(10 * time.Second)
|
||||||
|
sched.SetNotificationRetryInterval(50 * time.Millisecond)
|
||||||
|
|
||||||
|
ctx, cancel := context.WithCancel(context.Background())
|
||||||
|
defer cancel()
|
||||||
|
|
||||||
|
startedChan := sched.Start(ctx)
|
||||||
|
<-startedChan
|
||||||
|
|
||||||
|
// Let the immediate-start retry goroutine begin its 100ms sweep.
|
||||||
|
time.Sleep(30 * time.Millisecond)
|
||||||
|
|
||||||
|
// Initiate shutdown mid-sweep.
|
||||||
|
cancel()
|
||||||
|
|
||||||
|
start := time.Now()
|
||||||
|
err := sched.WaitForCompletion(5 * time.Second)
|
||||||
|
elapsed := time.Since(start)
|
||||||
|
|
||||||
|
if err != nil {
|
||||||
|
t.Fatalf("WaitForCompletion should not error: %v", err)
|
||||||
|
}
|
||||||
|
if elapsed > 5*time.Second {
|
||||||
|
t.Fatalf("WaitForCompletion took longer than expected: %v", elapsed)
|
||||||
|
}
|
||||||
|
|
||||||
|
notificationMock.mu.Lock()
|
||||||
|
retryCount := notificationMock.retryCallCount
|
||||||
|
notificationMock.mu.Unlock()
|
||||||
|
|
||||||
|
if retryCount < 1 {
|
||||||
|
t.Fatalf("expected notification retry service to have started at least once before shutdown, got %d", retryCount)
|
||||||
|
}
|
||||||
|
t.Logf("notification retry loop graceful shutdown completed in %v after %d in-flight sweep(s)", elapsed, retryCount)
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestScheduler_NotificationRetryLoop_ContextDeadlineRespected verifies that
|
||||||
|
// each tick of the retry loop receives a context with a deadline set. Mirrors
|
||||||
|
// TestScheduler_JobTimeoutLoop_ContextDeadlineRespected.
|
||||||
|
//
|
||||||
|
// The per-tick context.WithTimeout exists so a pathologically slow sweep (e.g.
|
||||||
|
// a misbehaving DB lock) can't stall the rest of the scheduler's shutdown
|
||||||
|
// sequence indefinitely — the wrapping context expires, the sweep returns
|
||||||
|
// ctx.Err(), and the WaitGroup.Done() fires on schedule.
|
||||||
|
func TestScheduler_NotificationRetryLoop_ContextDeadlineRespected(t *testing.T) {
|
||||||
|
logger := slog.New(slog.NewTextHandler(os.Stderr, nil))
|
||||||
|
renewalMock := &mockRenewalService{}
|
||||||
|
jobMock := &mockJobService{}
|
||||||
|
agentMock := &mockAgentService{}
|
||||||
|
notificationMock := &mockNotificationService{}
|
||||||
|
networkMock := &mockNetworkScanService{}
|
||||||
|
|
||||||
|
sched := NewScheduler(renewalMock, jobMock, agentMock, notificationMock, networkMock, logger)
|
||||||
|
sched.SetRenewalCheckInterval(10 * time.Second)
|
||||||
|
sched.SetJobProcessorInterval(10 * time.Second)
|
||||||
|
sched.SetAgentHealthCheckInterval(10 * time.Second)
|
||||||
|
sched.SetNotificationProcessInterval(10 * time.Second)
|
||||||
|
sched.SetNetworkScanInterval(10 * time.Second)
|
||||||
|
sched.SetJobRetryInterval(10 * time.Second)
|
||||||
|
sched.SetNotificationRetryInterval(50 * time.Millisecond)
|
||||||
|
|
||||||
|
ctx, cancel := context.WithCancel(context.Background())
|
||||||
|
defer cancel()
|
||||||
|
|
||||||
|
<-sched.Start(ctx)
|
||||||
|
time.Sleep(100 * time.Millisecond)
|
||||||
|
cancel()
|
||||||
|
if err := sched.WaitForCompletion(2 * time.Second); err != nil {
|
||||||
|
t.Fatalf("WaitForCompletion: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
notificationMock.mu.Lock()
|
||||||
|
hasDeadline := notificationMock.retryCtxHasDeadline
|
||||||
|
notificationMock.mu.Unlock()
|
||||||
|
|
||||||
|
if !hasDeadline {
|
||||||
|
t.Fatal("expected notification retry context to have a deadline set, but none found")
|
||||||
|
}
|
||||||
|
t.Log("notification retry context deadline verified")
|
||||||
|
}
|
||||||
|
|||||||
@@ -10,6 +10,40 @@ import (
|
|||||||
"github.com/shankar0123/certctl/internal/repository"
|
"github.com/shankar0123/certctl/internal/repository"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
// I-005 retry + DLQ knobs. These pin the operator-approved retry budget and
|
||||||
|
// the defense-in-depth ceiling on the exponential backoff curve used by
|
||||||
|
// RetryFailedNotifications.
|
||||||
|
//
|
||||||
|
// Values match those the Phase 1 Red tests assert against (see
|
||||||
|
// i005MaxAttempts / i005BackoffCap in notification_test.go:600-608) — the
|
||||||
|
// production identifiers are distinct because this file and its tests share
|
||||||
|
// `package service`, so a single shared name would collide at compile time.
|
||||||
|
// The test comment explicitly notes "Phase 2 is free to thread this from
|
||||||
|
// config"; when that wiring lands, these become package-level defaults the
|
||||||
|
// scheduler can override. For now they are the single source of truth.
|
||||||
|
const (
|
||||||
|
// notifRetryMaxAttempts is the attempt budget *before* the current
|
||||||
|
// attempt: a row at retry_count == notifRetryMaxAttempts-1 that fails
|
||||||
|
// this tick transitions to 'dead' instead of being re-armed. The
|
||||||
|
// repository's ListRetryEligible filter also uses this value as a
|
||||||
|
// guard (`AND retry_count < $2`) so a DLQ row is never re-swept.
|
||||||
|
notifRetryMaxAttempts = 5
|
||||||
|
|
||||||
|
// notifRetryBackoffCap is the 1h ceiling on `2^retry_count` minutes.
|
||||||
|
// With max_attempts=5 the deepest actually-schedulable wait is 2^3=8m
|
||||||
|
// (retry_count=3 → 8m, then retry_count=4 → 'dead'), so the cap is a
|
||||||
|
// ceiling-assertion today — but it must stay in place so a later
|
||||||
|
// increase in max_attempts cannot push next_retry_at past 1h without
|
||||||
|
// an explicit policy decision.
|
||||||
|
notifRetryBackoffCap = time.Hour
|
||||||
|
|
||||||
|
// notifRetrySweepLimit caps a single retry tick at this many rows so
|
||||||
|
// a large burst of dead-letter-bound mail cannot monopolize the 2m
|
||||||
|
// tick budget. Mirrors the 1000-row cap on ProcessPendingNotifications
|
||||||
|
// at notification.go:244 for operational symmetry.
|
||||||
|
notifRetrySweepLimit = 1000
|
||||||
|
)
|
||||||
|
|
||||||
// NotificationService provides business logic for managing notifications.
|
// NotificationService provides business logic for managing notifications.
|
||||||
type NotificationService struct {
|
type NotificationService struct {
|
||||||
notifRepo repository.NotificationRepository
|
notifRepo repository.NotificationRepository
|
||||||
@@ -373,3 +407,211 @@ func (s *NotificationService) GetNotification(ctx context.Context, id string) (*
|
|||||||
func (s *NotificationService) MarkAsRead(ctx context.Context, id string) error {
|
func (s *NotificationService) MarkAsRead(ctx context.Context, id string) error {
|
||||||
return s.notifRepo.UpdateStatus(ctx, id, "read", time.Now())
|
return s.notifRepo.UpdateStatus(ctx, id, "read", time.Now())
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ─── I-005 retry + DLQ surface (Phase 2 Green) ───────────────────────────
|
||||||
|
//
|
||||||
|
// The three methods below close the retry loop the Phase 1 Red tests pin at
|
||||||
|
// notification_test.go:600-917 and notification_handler_test.go:443-519:
|
||||||
|
//
|
||||||
|
// 1. RetryFailedNotifications — scheduler entry point. Pulls failed rows
|
||||||
|
// whose next_retry_at has elapsed, retries delivery, rewrites retry
|
||||||
|
// bookkeeping per the pre-increment backoff contract, and transitions
|
||||||
|
// exhausted rows to 'dead' (DLQ). Per-row errors never bubble — a
|
||||||
|
// single bad recipient cannot stall the tick. Mirrors the ordering
|
||||||
|
// the ProcessPendingNotifications loop uses at notification.go:242.
|
||||||
|
//
|
||||||
|
// 2. RequeueNotification — operator-driven escape hatch from 'dead' back
|
||||||
|
// to 'pending'. Pass-through to the repo's Requeue method with clean
|
||||||
|
// error wrapping so repo-layer failures ("pg: deadlock detected")
|
||||||
|
// surface in the UI instead of silently succeeding.
|
||||||
|
//
|
||||||
|
// 3. ListNotificationsByStatus — Dead letter tab support. Thin filter
|
||||||
|
// wrapper around the existing List query; the Phase 2 Green handler
|
||||||
|
// routes `?status=…` through this method while preserving the
|
||||||
|
// unfiltered path through ListNotifications (handler_test pins both).
|
||||||
|
//
|
||||||
|
// Sibling scheduler loops I-001 (job retry) and I-003 (job timeout) already
|
||||||
|
// ship the 10-loop topology these methods plug into; the 11th loop added
|
||||||
|
// by this milestone calls RetryFailedNotifications on a 2m tick, matching
|
||||||
|
// the CERTCTL_NOTIFICATION_RETRY_INTERVAL default pinned in config/
|
||||||
|
// scheduler Phase 2 Green edits that follow this one.
|
||||||
|
|
||||||
|
// RetryFailedNotifications is the scheduler entry point for the I-005
|
||||||
|
// retry sweep. Semantics (pinned by notification_test.go:635-843):
|
||||||
|
//
|
||||||
|
// - A ListRetryEligible failure short-circuits with a wrapped error so
|
||||||
|
// the caller's tick counter reflects the outage. Crucially, zero
|
||||||
|
// notifier.Send calls fire in this path — we never got a canonical
|
||||||
|
// set of rows, and issuing any sends risks double-delivery when the
|
||||||
|
// DB comes back.
|
||||||
|
//
|
||||||
|
// - Per-row failures are logged but NEVER returned. That contract comes
|
||||||
|
// straight from ProcessPendingNotifications (notification.go:242-267);
|
||||||
|
// the retry loop inherits it so a single 4xx response can't freeze
|
||||||
|
// every downstream row in the sweep.
|
||||||
|
//
|
||||||
|
// - Success promotes the row directly to 'sent' via UpdateStatus. The
|
||||||
|
// retry_count field is *not* incremented on success — that would
|
||||||
|
// falsify the audit-trail signal "this row was delivered on attempt
|
||||||
|
// N". The mock's UpdateStatus does a plain status write with no retry
|
||||||
|
// mutation (testutil_test.go:446-459), matching the postgres impl.
|
||||||
|
//
|
||||||
|
// - Failure uses pre-increment exponential backoff:
|
||||||
|
// wait = min(2^retry_count * time.Minute, notifRetryBackoffCap)
|
||||||
|
// where retry_count is the row's value *before* this attempt. The
|
||||||
|
// repo layer's RecordFailedAttempt then increments retry_count by 1
|
||||||
|
// server-side. This asymmetry keeps the service stateless — the
|
||||||
|
// service reads retry_count to compute the wait, but never writes it
|
||||||
|
// directly; the write is exclusively the repo's responsibility.
|
||||||
|
//
|
||||||
|
// - Exhaustion transitions to 'dead' when retry_count == max-1, because
|
||||||
|
// RecordFailedAttempt's ++ would push retry_count to max and the next
|
||||||
|
// sweep's `retry_count < max` filter in ListRetryEligible would then
|
||||||
|
// silently skip the row forever (a zombie-failed row nobody sees).
|
||||||
|
// MarkAsDead clears next_retry_at to evict the row from the partial
|
||||||
|
// retry-sweep index as well, so it stops scanning past dead rows.
|
||||||
|
//
|
||||||
|
// - A row whose Channel has no registered notifier is promoted to
|
||||||
|
// 'sent' (demo-mode parity with sendNotification's fallback at
|
||||||
|
// notification.go:272-279). This branch should not normally fire for
|
||||||
|
// retry rows — they were created *by* a notifier that failed — but
|
||||||
|
// defensive handling guards against config drift (notifier disabled
|
||||||
|
// between Create and retry) that would otherwise wedge the row.
|
||||||
|
func (s *NotificationService) RetryFailedNotifications(ctx context.Context) error {
|
||||||
|
now := time.Now()
|
||||||
|
|
||||||
|
rows, err := s.notifRepo.ListRetryEligible(ctx, now, notifRetryMaxAttempts, notifRetrySweepLimit)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("failed to list retry-eligible notifications: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
for _, row := range rows {
|
||||||
|
if row == nil {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
|
||||||
|
notifier, ok := s.notifierRegistry[string(row.Channel)]
|
||||||
|
if !ok {
|
||||||
|
// No notifier wired for this channel — promote to 'sent' to
|
||||||
|
// avoid looping forever over a row that has nowhere to go.
|
||||||
|
// See notification.go:272-279 for the sibling demo-mode path.
|
||||||
|
if updateErr := s.notifRepo.UpdateStatus(ctx, row.ID, string(domain.NotificationStatusSent), time.Now()); updateErr != nil {
|
||||||
|
slog.Error("failed to promote retry row with missing notifier to sent",
|
||||||
|
"notification_id", row.ID, "channel", row.Channel, "error", updateErr)
|
||||||
|
}
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
|
||||||
|
sendErr := notifier.Send(ctx, row.Recipient, string(row.Type), row.Message)
|
||||||
|
if sendErr == nil {
|
||||||
|
// Success: promote straight to 'sent' without touching
|
||||||
|
// retry_count — the audit trail must preserve "this row was
|
||||||
|
// delivered on attempt N", and the mock's UpdateStatus is a
|
||||||
|
// plain status write (no retry_count reset). Errors here are
|
||||||
|
// logged, never returned.
|
||||||
|
if updateErr := s.notifRepo.UpdateStatus(ctx, row.ID, string(domain.NotificationStatusSent), time.Now()); updateErr != nil {
|
||||||
|
slog.Error("failed to mark retried notification as sent",
|
||||||
|
"notification_id", row.ID, "error", updateErr)
|
||||||
|
}
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
|
||||||
|
// Failure path. Compute pre-increment backoff first so the
|
||||||
|
// exhaustion branch and the reschedule branch see an identical
|
||||||
|
// `wait` derivation — easier to audit against the test window
|
||||||
|
// assertions at notification_test.go:739-743 and :796-801.
|
||||||
|
wait := time.Duration(1<<row.RetryCount) * time.Minute
|
||||||
|
if wait > notifRetryBackoffCap {
|
||||||
|
wait = notifRetryBackoffCap
|
||||||
|
}
|
||||||
|
|
||||||
|
// Exhaustion: this attempt consumes the final slot of the attempt
|
||||||
|
// budget. Transition to 'dead' and let MarkAsDead clear
|
||||||
|
// next_retry_at so the retry-sweep index stops hitting the row.
|
||||||
|
if row.RetryCount >= notifRetryMaxAttempts-1 {
|
||||||
|
if markErr := s.notifRepo.MarkAsDead(ctx, row.ID, sendErr.Error()); markErr != nil {
|
||||||
|
slog.Error("failed to mark exhausted notification as dead",
|
||||||
|
"notification_id", row.ID, "retry_count", row.RetryCount,
|
||||||
|
"send_error", sendErr, "mark_error", markErr)
|
||||||
|
}
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
|
||||||
|
// Non-terminal: hand the lastError + nextRetryAt off to the repo,
|
||||||
|
// which increments retry_count by exactly 1 and keeps the row in
|
||||||
|
// 'failed' state so the next tick picks it up.
|
||||||
|
nextRetryAt := time.Now().Add(wait)
|
||||||
|
if recErr := s.notifRepo.RecordFailedAttempt(ctx, row.ID, sendErr.Error(), nextRetryAt); recErr != nil {
|
||||||
|
slog.Error("failed to record notification retry attempt",
|
||||||
|
"notification_id", row.ID, "retry_count", row.RetryCount,
|
||||||
|
"next_retry_at", nextRetryAt, "send_error", sendErr, "record_error", recErr)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// RequeueNotification is the operator-driven escape hatch from 'dead' back
|
||||||
|
// to 'pending'. It resets all retry bookkeeping — retry_count → 0,
|
||||||
|
// next_retry_at → NULL, last_error → NULL — so ProcessPendingNotifications
|
||||||
|
// treats the requeued row as a fresh attempt on its next tick. Identical on
|
||||||
|
// the wire to a newly-created notification.
|
||||||
|
//
|
||||||
|
// Behavior contract (pinned by notification_test.go:849-917):
|
||||||
|
//
|
||||||
|
// - Success path delegates to the repo's Requeue, which performs the
|
||||||
|
// status/retry_count/next_retry_at/last_error reset atomically. The
|
||||||
|
// service adds no extra bookkeeping; the audit trail already captures
|
||||||
|
// the transition via the upstream API call.
|
||||||
|
//
|
||||||
|
// - Error path wraps the repo error with context so a failure like
|
||||||
|
// "pg: deadlock detected" surfaces in the handler response and the
|
||||||
|
// operator UI. The service has no fallback — a silent "success" that
|
||||||
|
// didn't actually mutate the row would be worse than a loud error.
|
||||||
|
func (s *NotificationService) RequeueNotification(ctx context.Context, id string) error {
|
||||||
|
if err := s.notifRepo.Requeue(ctx, id); err != nil {
|
||||||
|
return fmt.Errorf("failed to requeue notification: %w", err)
|
||||||
|
}
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// ListNotificationsByStatus returns paginated notifications filtered by
|
||||||
|
// status. It mirrors ListNotifications's shape but threads a Status filter
|
||||||
|
// into the NotificationFilter so the Phase 2 Green handler can route
|
||||||
|
// `?status=dead` (Dead letter tab) through this method while keeping the
|
||||||
|
// unfiltered path on ListNotifications for backward compat.
|
||||||
|
//
|
||||||
|
// Pinned by notification_handler_test.go:443-519 — the handler test asserts
|
||||||
|
// that a request with `?status=dead&page=1&per_page=50` lands on exactly
|
||||||
|
// this signature (`status string, page, perPage int`) and that requests
|
||||||
|
// without a status param do NOT call it. Keep the returned shape identical
|
||||||
|
// to ListNotifications so the handler can reuse its JSON-encoding path.
|
||||||
|
func (s *NotificationService) ListNotificationsByStatus(ctx context.Context, status string, page, perPage int) ([]domain.NotificationEvent, int64, error) {
|
||||||
|
if page < 1 {
|
||||||
|
page = 1
|
||||||
|
}
|
||||||
|
if perPage < 1 {
|
||||||
|
perPage = 50
|
||||||
|
}
|
||||||
|
|
||||||
|
filter := &repository.NotificationFilter{
|
||||||
|
Status: status,
|
||||||
|
Page: page,
|
||||||
|
PerPage: perPage,
|
||||||
|
}
|
||||||
|
|
||||||
|
notifications, err := s.notifRepo.List(ctx, filter)
|
||||||
|
if err != nil {
|
||||||
|
return nil, 0, fmt.Errorf("failed to list notifications by status: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
result := make([]domain.NotificationEvent, 0, len(notifications))
|
||||||
|
for _, n := range notifications {
|
||||||
|
if n != nil {
|
||||||
|
result = append(result, *n)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
total := int64(len(result))
|
||||||
|
return result, total, nil
|
||||||
|
}
|
||||||
|
|||||||
@@ -565,3 +565,353 @@ func TestGetNotificationHistory(t *testing.T) {
|
|||||||
func stringPtr(s string) *string {
|
func stringPtr(s string) *string {
|
||||||
return &s
|
return &s
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ─── I-005 retry + DLQ service contract (Phase 1 Red) ─────────────────────
|
||||||
|
//
|
||||||
|
// These tests pin the service-layer contract the I-005 fix must satisfy. The
|
||||||
|
// Red signals they produce are, in compile order:
|
||||||
|
//
|
||||||
|
// 1. service.NotificationService.RetryFailedNotifications undefined
|
||||||
|
// 2. service.NotificationService.RequeueNotification undefined
|
||||||
|
// 3. mockNotifRepo.ListRetryEligible undefined (surfaced after the service
|
||||||
|
// method exists and starts calling it)
|
||||||
|
// 4. mockNotifRepo.RecordFailedAttempt undefined
|
||||||
|
// 5. mockNotifRepo.MarkAsDead undefined
|
||||||
|
// 6. mockNotifRepo.Requeue undefined
|
||||||
|
// 7. NotificationEvent.RetryCount / NextRetryAt / LastError undefined — but
|
||||||
|
// domain/notification_test.go already pins these, so they ride in on the
|
||||||
|
// Phase 2 Green domain edit and compile by the time the service-layer
|
||||||
|
// tests run.
|
||||||
|
//
|
||||||
|
// The contract under test, re-derived from notification.go:282-288:
|
||||||
|
// * A failed notifier.Send used to stamp status='failed' with a zero
|
||||||
|
// time.Time and return. I-005 reframes that row as retry-eligible with
|
||||||
|
// bookkeeping (retry_count, next_retry_at, last_error) so a sibling
|
||||||
|
// scheduler loop can promote it back to 'pending' until max_attempts,
|
||||||
|
// then to 'dead' (DLQ) for operator triage.
|
||||||
|
// * Backoff is 2^retry_count minutes, capped at 1h, mirroring the
|
||||||
|
// operator decision captured in the I-005 design notes.
|
||||||
|
// * Success on a retry promotes the row straight to 'sent' via
|
||||||
|
// UpdateStatus (no retry bookkeeping change).
|
||||||
|
// * Requeue is the operator-driven escape hatch from 'dead' back to
|
||||||
|
// 'pending' with retry_count reset to 0; service-layer impl is a
|
||||||
|
// pass-through to repo.Requeue so the audit trail is consistent.
|
||||||
|
|
||||||
|
const (
|
||||||
|
// i005MaxAttempts must match the same constant used by the Green
|
||||||
|
// service implementation. Declared here only so the test assertions
|
||||||
|
// read cleanly; Phase 2 is free to thread this from config.
|
||||||
|
i005MaxAttempts = 5
|
||||||
|
|
||||||
|
// i005BackoffCap mirrors the 1h ceiling on 2^retry_count minutes.
|
||||||
|
i005BackoffCap = time.Hour
|
||||||
|
)
|
||||||
|
|
||||||
|
// newFailedNotification builds a minimal failed-state row suitable for seeding
|
||||||
|
// the mock repo. retry_count is the number of attempts already consumed (so
|
||||||
|
// the next attempt becomes retry_count+1, and retry_count == max-1 puts the
|
||||||
|
// row at the exhaustion threshold).
|
||||||
|
func newFailedNotification(id string, retryCount int, nextRetryAt time.Time) *domain.NotificationEvent {
|
||||||
|
nextCopy := nextRetryAt
|
||||||
|
last := "connection refused"
|
||||||
|
return &domain.NotificationEvent{
|
||||||
|
ID: id,
|
||||||
|
Type: domain.NotificationTypeExpirationWarning,
|
||||||
|
Channel: domain.NotificationChannelEmail,
|
||||||
|
Recipient: "owner-i005@example.com",
|
||||||
|
Message: "retry me: " + id,
|
||||||
|
Status: string(domain.NotificationStatusFailed),
|
||||||
|
RetryCount: retryCount,
|
||||||
|
NextRetryAt: &nextCopy,
|
||||||
|
LastError: &last,
|
||||||
|
CreatedAt: time.Now().Add(-time.Hour),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNotificationService_RetryFailedNotifications_NoEligibleRows asserts the
|
||||||
|
// no-op path: an empty retry queue must not trigger any notifier.Send calls
|
||||||
|
// and must not surface as an error. This pins that the retry loop's cost is
|
||||||
|
// O(retry-eligible), not O(total).
|
||||||
|
func TestNotificationService_RetryFailedNotifications_NoEligibleRows(t *testing.T) {
|
||||||
|
ctx := context.Background()
|
||||||
|
notifRepo := newMockNotificationRepository()
|
||||||
|
notifier := newMockNotifier()
|
||||||
|
registry := map[string]Notifier{"Email": notifier}
|
||||||
|
svc := NewNotificationService(notifRepo, registry)
|
||||||
|
|
||||||
|
if err := svc.RetryFailedNotifications(ctx); err != nil {
|
||||||
|
t.Fatalf("RetryFailedNotifications on empty queue returned error: %v", err)
|
||||||
|
}
|
||||||
|
if got := notifier.getSentCount(); got != 0 {
|
||||||
|
t.Errorf("notifier.Send call count = %d, want 0 (no retry-eligible rows)", got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNotificationService_RetryFailedNotifications_ListError asserts that a
|
||||||
|
// ListRetryEligible failure short-circuits the loop. Notifier.Send must not
|
||||||
|
// fire — we never got a canonical set of rows to act on, so sending anything
|
||||||
|
// would risk double-delivery when the DB comes back.
|
||||||
|
func TestNotificationService_RetryFailedNotifications_ListError(t *testing.T) {
|
||||||
|
ctx := context.Background()
|
||||||
|
notifRepo := newMockNotificationRepository()
|
||||||
|
notifRepo.ListErr = fmt.Errorf("simulated DB outage")
|
||||||
|
|
||||||
|
notifier := newMockNotifier()
|
||||||
|
registry := map[string]Notifier{"Email": notifier}
|
||||||
|
svc := NewNotificationService(notifRepo, registry)
|
||||||
|
|
||||||
|
err := svc.RetryFailedNotifications(ctx)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatalf("RetryFailedNotifications must surface the list error; got nil")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "simulated DB outage") {
|
||||||
|
t.Errorf("expected wrapped list error to mention 'simulated DB outage', got: %v", err)
|
||||||
|
}
|
||||||
|
if got := notifier.getSentCount(); got != 0 {
|
||||||
|
t.Errorf("notifier.Send must not fire when list fails; got %d sends", got)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNotificationService_RetryFailedNotifications_SuccessPromotes asserts
|
||||||
|
// the happy path for a retry that succeeds: the row is promoted directly to
|
||||||
|
// 'sent' via UpdateStatus (mirroring ProcessPendingNotifications), and no
|
||||||
|
// retry bookkeeping mutation (RecordFailedAttempt / MarkAsDead) fires.
|
||||||
|
func TestNotificationService_RetryFailedNotifications_SuccessPromotes(t *testing.T) {
|
||||||
|
ctx := context.Background()
|
||||||
|
notifRepo := newMockNotificationRepository()
|
||||||
|
notifier := newMockNotifier() // default: no error — Send succeeds
|
||||||
|
registry := map[string]Notifier{"Email": notifier}
|
||||||
|
svc := NewNotificationService(notifRepo, registry)
|
||||||
|
|
||||||
|
row := newFailedNotification("notif-success", 2, time.Now().Add(-time.Minute))
|
||||||
|
notifRepo.AddNotification(row)
|
||||||
|
|
||||||
|
if err := svc.RetryFailedNotifications(ctx); err != nil {
|
||||||
|
t.Fatalf("RetryFailedNotifications should not error on per-row success: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if notifier.getSentCount() != 1 {
|
||||||
|
t.Errorf("expected exactly 1 notifier.Send call, got %d", notifier.getSentCount())
|
||||||
|
}
|
||||||
|
if row.Status != string(domain.NotificationStatusSent) {
|
||||||
|
t.Errorf("successful retry must promote status to 'sent', got %q", row.Status)
|
||||||
|
}
|
||||||
|
// retry_count must NOT increment on success — that would falsify the
|
||||||
|
// "this row was delivered on attempt N" signal the audit trail relies on.
|
||||||
|
if row.RetryCount != 2 {
|
||||||
|
t.Errorf("retry_count must not change on success, got %d (want 2)", row.RetryCount)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNotificationService_RetryFailedNotifications_ExponentialBackoff asserts
|
||||||
|
// that a still-retriable failure schedules the next attempt at 2^retry_count
|
||||||
|
// minutes from now, matching the operator-approved curve 1m, 2m, 4m, 8m, 16m.
|
||||||
|
// The assertion is a window check against time.Now() because the service
|
||||||
|
// reads its own clock.
|
||||||
|
func TestNotificationService_RetryFailedNotifications_ExponentialBackoff(t *testing.T) {
|
||||||
|
ctx := context.Background()
|
||||||
|
notifRepo := newMockNotificationRepository()
|
||||||
|
notifier := newMockNotifier()
|
||||||
|
notifier.SendErr = fmt.Errorf("smtp 451 temporary failure")
|
||||||
|
registry := map[string]Notifier{"Email": notifier}
|
||||||
|
svc := NewNotificationService(notifRepo, registry)
|
||||||
|
|
||||||
|
// retry_count=2 → next attempt is #3, backoff = 2^2 = 4 minutes.
|
||||||
|
row := newFailedNotification("notif-backoff", 2, time.Now().Add(-time.Minute))
|
||||||
|
notifRepo.AddNotification(row)
|
||||||
|
|
||||||
|
before := time.Now()
|
||||||
|
if err := svc.RetryFailedNotifications(ctx); err != nil {
|
||||||
|
t.Fatalf("RetryFailedNotifications should not bubble per-row send errors: %v", err)
|
||||||
|
}
|
||||||
|
after := time.Now()
|
||||||
|
|
||||||
|
// Still in 'failed' — not yet exhausted (retry_count+1 = 3, below max 5).
|
||||||
|
if row.Status != string(domain.NotificationStatusFailed) {
|
||||||
|
t.Errorf("status after non-terminal retry must stay 'failed', got %q", row.Status)
|
||||||
|
}
|
||||||
|
if row.RetryCount != 3 {
|
||||||
|
t.Errorf("retry_count must increment on failure, got %d (want 3)", row.RetryCount)
|
||||||
|
}
|
||||||
|
if row.NextRetryAt == nil {
|
||||||
|
t.Fatalf("NextRetryAt must be set on non-terminal retry failure; got nil")
|
||||||
|
}
|
||||||
|
expectedMin := before.Add(4 * time.Minute)
|
||||||
|
expectedMax := after.Add(4 * time.Minute)
|
||||||
|
if row.NextRetryAt.Before(expectedMin) || row.NextRetryAt.After(expectedMax) {
|
||||||
|
t.Errorf("NextRetryAt outside 2^2=4m window [%v, %v]; got %v",
|
||||||
|
expectedMin, expectedMax, *row.NextRetryAt)
|
||||||
|
}
|
||||||
|
if row.LastError == nil || !strings.Contains(*row.LastError, "smtp 451 temporary failure") {
|
||||||
|
t.Errorf("LastError must preserve the notifier error body for triage; got %v", row.LastError)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNotificationService_RetryFailedNotifications_BackoffCap asserts the
|
||||||
|
// defense-in-depth 1h ceiling on next_retry_at. The retry curve under the
|
||||||
|
// operator-approved formula is pre-increment `2^retry_count` minutes — 1m,
|
||||||
|
// 2m, 4m, 8m — and with max_attempts=5 the deepest still-retriable row is
|
||||||
|
// retry_count=4 (next wait = 2^4 = 16m), which would transition to 'dead'
|
||||||
|
// before ever scheduling. So the largest actually-schedulable wait is
|
||||||
|
// 2^3=8m at retry_count=3, well under the 1h cap.
|
||||||
|
//
|
||||||
|
// That makes this test a ceiling-assertion, not a saturation-assertion: we
|
||||||
|
// pick retry_count=3 (matching ExponentialBackoff's formula but one step
|
||||||
|
// deeper) and verify (a) the window lands at 2^3=8m and (b) the cap is
|
||||||
|
// never exceeded. When max_attempts becomes configurable in a later
|
||||||
|
// milestone, this test becomes the natural home for a true cap-saturation
|
||||||
|
// fixture; for now it pins the arithmetic the Phase 2 Green implementation
|
||||||
|
// has to hit exactly.
|
||||||
|
func TestNotificationService_RetryFailedNotifications_BackoffCap(t *testing.T) {
|
||||||
|
ctx := context.Background()
|
||||||
|
notifRepo := newMockNotificationRepository()
|
||||||
|
notifier := newMockNotifier()
|
||||||
|
notifier.SendErr = fmt.Errorf("webhook 502 bad gateway")
|
||||||
|
registry := map[string]Notifier{"Email": notifier}
|
||||||
|
svc := NewNotificationService(notifRepo, registry)
|
||||||
|
|
||||||
|
// retry_count=3 → pre-increment wait = 2^3 = 8 minutes. Post-increment
|
||||||
|
// retry_count becomes 4, which is still below max_attempts=5, so the
|
||||||
|
// row stays in 'failed' rather than transitioning to 'dead'.
|
||||||
|
row := newFailedNotification("notif-backoff-cap", 3, time.Now().Add(-time.Minute))
|
||||||
|
notifRepo.AddNotification(row)
|
||||||
|
|
||||||
|
before := time.Now()
|
||||||
|
if err := svc.RetryFailedNotifications(ctx); err != nil {
|
||||||
|
t.Fatalf("RetryFailedNotifications should not bubble per-row send errors: %v", err)
|
||||||
|
}
|
||||||
|
after := time.Now()
|
||||||
|
|
||||||
|
if row.Status != string(domain.NotificationStatusFailed) {
|
||||||
|
t.Errorf("mid-retry status must stay 'failed', got %q", row.Status)
|
||||||
|
}
|
||||||
|
if row.RetryCount != 4 {
|
||||||
|
t.Errorf("retry_count must increment on failure, got %d (want 4)", row.RetryCount)
|
||||||
|
}
|
||||||
|
if row.NextRetryAt == nil {
|
||||||
|
t.Fatalf("NextRetryAt must be set; got nil")
|
||||||
|
}
|
||||||
|
// retry_count=3 → pre-increment 2^3 = 8m, matching the curve pinned by
|
||||||
|
// ExponentialBackoff (retry_count=2 → 2^2=4m).
|
||||||
|
expectedMin := before.Add(8 * time.Minute)
|
||||||
|
expectedMax := after.Add(8 * time.Minute)
|
||||||
|
if row.NextRetryAt.Before(expectedMin) || row.NextRetryAt.After(expectedMax) {
|
||||||
|
t.Errorf("NextRetryAt outside 2^3=8m window [%v, %v]; got %v",
|
||||||
|
expectedMin, expectedMax, *row.NextRetryAt)
|
||||||
|
}
|
||||||
|
// And regardless of retry_count, the ceiling must hold: next_retry_at
|
||||||
|
// must never be more than i005BackoffCap (1h) from now. This is the
|
||||||
|
// defense-in-depth assertion — it would fail loudly if a future
|
||||||
|
// refactor swapped to post-increment and overshot on a deeper row.
|
||||||
|
if row.NextRetryAt.After(after.Add(i005BackoffCap + time.Second)) {
|
||||||
|
t.Errorf("NextRetryAt violates 1h cap; scheduled %v in the future",
|
||||||
|
row.NextRetryAt.Sub(after))
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNotificationService_RetryFailedNotifications_MarkDeadOnExhaustion
|
||||||
|
// asserts the terminal transition: once retry_count crosses max_attempts,
|
||||||
|
// the row moves to 'dead' (DLQ) and stops participating in the retry sweep.
|
||||||
|
// next_retry_at must be cleared — otherwise the partial retry-sweep index
|
||||||
|
// would still pick it up and we'd loop forever.
|
||||||
|
func TestNotificationService_RetryFailedNotifications_MarkDeadOnExhaustion(t *testing.T) {
|
||||||
|
ctx := context.Background()
|
||||||
|
notifRepo := newMockNotificationRepository()
|
||||||
|
notifier := newMockNotifier()
|
||||||
|
notifier.SendErr = fmt.Errorf("connection refused after max attempts")
|
||||||
|
registry := map[string]Notifier{"Email": notifier}
|
||||||
|
svc := NewNotificationService(notifRepo, registry)
|
||||||
|
|
||||||
|
// retry_count = max-1: this attempt makes it max, so the row must
|
||||||
|
// transition to 'dead', not get rescheduled.
|
||||||
|
row := newFailedNotification("notif-dead", i005MaxAttempts-1, time.Now().Add(-time.Minute))
|
||||||
|
notifRepo.AddNotification(row)
|
||||||
|
|
||||||
|
if err := svc.RetryFailedNotifications(ctx); err != nil {
|
||||||
|
t.Fatalf("RetryFailedNotifications must not bubble per-row exhaustion: %v", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if row.Status != string(domain.NotificationStatusDead) {
|
||||||
|
t.Errorf("exhausted row must be in 'dead' status, got %q", row.Status)
|
||||||
|
}
|
||||||
|
if row.NextRetryAt != nil {
|
||||||
|
t.Errorf("dead row must have next_retry_at cleared (else retry sweep keeps picking it up); got %v", *row.NextRetryAt)
|
||||||
|
}
|
||||||
|
if row.LastError == nil || !strings.Contains(*row.LastError, "connection refused after max attempts") {
|
||||||
|
t.Errorf("LastError on dead row must preserve final failure reason; got %v", row.LastError)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNotificationService_RequeueNotification_Success asserts the operator
|
||||||
|
// escape hatch: Requeue flips a dead row back to 'pending' with
|
||||||
|
// retry_count=0 so ProcessPendingNotifications can pick it up on the very
|
||||||
|
// next tick. The service delegates to repo.Requeue and propagates no error.
|
||||||
|
func TestNotificationService_RequeueNotification_Success(t *testing.T) {
|
||||||
|
ctx := context.Background()
|
||||||
|
notifRepo := newMockNotificationRepository()
|
||||||
|
registry := map[string]Notifier{"Email": newMockNotifier()}
|
||||||
|
svc := NewNotificationService(notifRepo, registry)
|
||||||
|
|
||||||
|
next := time.Now().Add(10 * time.Minute)
|
||||||
|
last := "max attempts exceeded"
|
||||||
|
dead := &domain.NotificationEvent{
|
||||||
|
ID: "notif-requeue",
|
||||||
|
Type: domain.NotificationTypeExpirationWarning,
|
||||||
|
Channel: domain.NotificationChannelEmail,
|
||||||
|
Recipient: "owner@example.com",
|
||||||
|
Message: "please requeue me",
|
||||||
|
Status: string(domain.NotificationStatusDead),
|
||||||
|
RetryCount: i005MaxAttempts,
|
||||||
|
NextRetryAt: &next,
|
||||||
|
LastError: &last,
|
||||||
|
CreatedAt: time.Now().Add(-2 * time.Hour),
|
||||||
|
}
|
||||||
|
notifRepo.AddNotification(dead)
|
||||||
|
|
||||||
|
if err := svc.RequeueNotification(ctx, dead.ID); err != nil {
|
||||||
|
t.Fatalf("RequeueNotification(%s) returned error: %v", dead.ID, err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if dead.Status != string(domain.NotificationStatusPending) {
|
||||||
|
t.Errorf("Requeue must flip status to 'pending', got %q", dead.Status)
|
||||||
|
}
|
||||||
|
if dead.RetryCount != 0 {
|
||||||
|
t.Errorf("Requeue must reset retry_count to 0, got %d", dead.RetryCount)
|
||||||
|
}
|
||||||
|
if dead.NextRetryAt != nil {
|
||||||
|
t.Errorf("Requeue must clear next_retry_at (pending rows never have it), got %v", *dead.NextRetryAt)
|
||||||
|
}
|
||||||
|
if dead.LastError != nil {
|
||||||
|
t.Errorf("Requeue must clear last_error (pending is a fresh attempt), got %v", *dead.LastError)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// TestNotificationService_RequeueNotification_RepoError asserts that a
|
||||||
|
// failed Requeue at the repository layer surfaces cleanly. The service has
|
||||||
|
// no fallback here — if the DB can't update the row, the operator action
|
||||||
|
// must fail loudly rather than silently "succeed" in the UI.
|
||||||
|
func TestNotificationService_RequeueNotification_RepoError(t *testing.T) {
|
||||||
|
ctx := context.Background()
|
||||||
|
notifRepo := newMockNotificationRepository()
|
||||||
|
notifRepo.UpdateErr = fmt.Errorf("pg: deadlock detected")
|
||||||
|
registry := map[string]Notifier{"Email": newMockNotifier()}
|
||||||
|
svc := NewNotificationService(notifRepo, registry)
|
||||||
|
|
||||||
|
// Seed a dead row so the service has something to act on (the error
|
||||||
|
// must come from the repo write, not from a missing ID).
|
||||||
|
dead := &domain.NotificationEvent{
|
||||||
|
ID: "notif-requeue-err",
|
||||||
|
Type: domain.NotificationTypeExpirationWarning,
|
||||||
|
Channel: domain.NotificationChannelEmail,
|
||||||
|
Status: string(domain.NotificationStatusDead),
|
||||||
|
}
|
||||||
|
notifRepo.AddNotification(dead)
|
||||||
|
|
||||||
|
err := svc.RequeueNotification(ctx, dead.ID)
|
||||||
|
if err == nil {
|
||||||
|
t.Fatalf("RequeueNotification must surface repo errors; got nil")
|
||||||
|
}
|
||||||
|
if !strings.Contains(err.Error(), "pg: deadlock detected") {
|
||||||
|
t.Errorf("expected wrapped repo error to mention 'pg: deadlock detected', got: %v", err)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|||||||
+46
-11
@@ -15,6 +15,12 @@ type StatsService struct {
|
|||||||
certRepo repository.CertificateRepository
|
certRepo repository.CertificateRepository
|
||||||
jobRepo repository.JobRepository
|
jobRepo repository.JobRepository
|
||||||
agentRepo repository.AgentRepository
|
agentRepo repository.AgentRepository
|
||||||
|
// notifRepo is injected post-construction via SetNotifRepo so that
|
||||||
|
// NewStatsService's nine call sites (main.go + stats_test.go + 8 digest
|
||||||
|
// tests) keep their existing signatures. When nil, the dead-letter count
|
||||||
|
// falls through to zero — see GetDashboardSummary. I-005 coverage-gap
|
||||||
|
// closure.
|
||||||
|
notifRepo repository.NotificationRepository
|
||||||
}
|
}
|
||||||
|
|
||||||
// NewStatsService creates a new stats service.
|
// NewStatsService creates a new stats service.
|
||||||
@@ -30,19 +36,35 @@ func NewStatsService(
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// SetNotifRepo injects the notification repository used to populate
|
||||||
|
// DashboardSummary.NotificationsDead. Setter pattern (matching the
|
||||||
|
// certificateService.SetTargetRepo / SetProfileRepo / SetDigestService
|
||||||
|
// precedent) keeps the NewStatsService signature stable across its
|
||||||
|
// pre-existing call sites. I-005 coverage-gap closure.
|
||||||
|
func (s *StatsService) SetNotifRepo(notifRepo repository.NotificationRepository) {
|
||||||
|
s.notifRepo = notifRepo
|
||||||
|
}
|
||||||
|
|
||||||
// DashboardSummary represents a high-level summary of system state.
|
// DashboardSummary represents a high-level summary of system state.
|
||||||
type DashboardSummary struct {
|
type DashboardSummary struct {
|
||||||
TotalCertificates int64 `json:"total_certificates"`
|
TotalCertificates int64 `json:"total_certificates"`
|
||||||
ExpiringCertificates int64 `json:"expiring_certificates"`
|
ExpiringCertificates int64 `json:"expiring_certificates"`
|
||||||
ExpiredCertificates int64 `json:"expired_certificates"`
|
ExpiredCertificates int64 `json:"expired_certificates"`
|
||||||
RevokedCertificates int64 `json:"revoked_certificates"`
|
RevokedCertificates int64 `json:"revoked_certificates"`
|
||||||
ActiveAgents int64 `json:"active_agents"`
|
ActiveAgents int64 `json:"active_agents"`
|
||||||
OfflineAgents int64 `json:"offline_agents"`
|
OfflineAgents int64 `json:"offline_agents"`
|
||||||
TotalAgents int64 `json:"total_agents"`
|
TotalAgents int64 `json:"total_agents"`
|
||||||
PendingJobs int64 `json:"pending_jobs"`
|
PendingJobs int64 `json:"pending_jobs"`
|
||||||
FailedJobs int64 `json:"failed_jobs"`
|
FailedJobs int64 `json:"failed_jobs"`
|
||||||
CompleteJobs int64 `json:"complete_jobs"`
|
CompleteJobs int64 `json:"complete_jobs"`
|
||||||
CompletedAt time.Time `json:"completed_at"`
|
// NotificationsDead is the number of notification_events rows currently
|
||||||
|
// in the terminal "dead" status (I-005 dead-letter queue). Exposed here
|
||||||
|
// so the metrics handler can derive the Prometheus counter
|
||||||
|
// certctl_notification_dead_total from the same snapshot used by the
|
||||||
|
// dashboard. DB-COUNT rather than in-memory — notifications can grow
|
||||||
|
// without bound, and filter-based List() is PerPage-capped to 50.
|
||||||
|
NotificationsDead int64 `json:"notifications_dead"`
|
||||||
|
CompletedAt time.Time `json:"completed_at"`
|
||||||
}
|
}
|
||||||
|
|
||||||
// GetDashboardSummary returns a summary of key metrics.
|
// GetDashboardSummary returns a summary of key metrics.
|
||||||
@@ -106,6 +128,19 @@ func (s *StatsService) GetDashboardSummary(ctx context.Context) (interface{}, er
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// I-005: dead-letter count for certctl_notification_dead_total. nil-safe
|
||||||
|
// so the nine existing NewStatsService call sites that haven't yet been
|
||||||
|
// updated to call SetNotifRepo keep working — they'll simply report
|
||||||
|
// NotificationsDead=0, which is the correct value on a system without a
|
||||||
|
// notification repository wired in. A CountByStatus error is non-fatal:
|
||||||
|
// the dashboard summary is best-effort for this field.
|
||||||
|
if s.notifRepo != nil {
|
||||||
|
deadCount, err := s.notifRepo.CountByStatus(ctx, string(domain.NotificationStatusDead))
|
||||||
|
if err == nil {
|
||||||
|
summary.NotificationsDead = deadCount
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
return summary, nil
|
return summary, nil
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -157,20 +157,20 @@ func (m *mockCertRepo) AddCert(cert *domain.ManagedCertificate) {
|
|||||||
|
|
||||||
// mockJobRepo is a test implementation of JobRepository
|
// mockJobRepo is a test implementation of JobRepository
|
||||||
type mockJobRepo struct {
|
type mockJobRepo struct {
|
||||||
mu sync.Mutex
|
mu sync.Mutex
|
||||||
Jobs map[string]*domain.Job
|
Jobs map[string]*domain.Job
|
||||||
StatusUpdates map[string]domain.JobStatus
|
StatusUpdates map[string]domain.JobStatus
|
||||||
CreateErr error
|
CreateErr error
|
||||||
UpdateErr error
|
UpdateErr error
|
||||||
UpdateErrorByID map[string]error
|
UpdateErrorByID map[string]error
|
||||||
UpdateErrorByIDMu sync.Mutex
|
UpdateErrorByIDMu sync.Mutex
|
||||||
UpdateStatusErr error
|
UpdateStatusErr error
|
||||||
GetErr error
|
GetErr error
|
||||||
ListErr error
|
ListErr error
|
||||||
ListByStatusErr error
|
ListByStatusErr error
|
||||||
DeleteErr error
|
DeleteErr error
|
||||||
ListTimedOutErr error
|
ListTimedOutErr error
|
||||||
Updated []*domain.Job
|
Updated []*domain.Job
|
||||||
}
|
}
|
||||||
|
|
||||||
func (m *mockJobRepo) List(ctx context.Context) ([]*domain.Job, error) {
|
func (m *mockJobRepo) List(ctx context.Context) ([]*domain.Job, error) {
|
||||||
@@ -393,13 +393,36 @@ func (m *mockJobRepo) AddJob(job *domain.Job) {
|
|||||||
m.Jobs[job.ID] = job
|
m.Jobs[job.ID] = job
|
||||||
}
|
}
|
||||||
|
|
||||||
// mockNotifRepo is a test implementation of NotificationRepository
|
// mockNotifRepo is a test implementation of NotificationRepository.
|
||||||
|
//
|
||||||
|
// I-005 extensions (ListRetryEligible / RecordFailedAttempt / MarkAsDead /
|
||||||
|
// Requeue) mutate the seeded *domain.NotificationEvent pointers in place.
|
||||||
|
// The service tests in notification_test.go assert against those same
|
||||||
|
// pointers (via notifRepo.Notifications or the local `row` handle), so
|
||||||
|
// in-place mutation is the contract — not a copy-and-replace pattern.
|
||||||
|
//
|
||||||
|
// Error fields are layered:
|
||||||
|
// - Per-method errors (ListRetryEligibleErr, RecordFailedAttemptErr, etc.)
|
||||||
|
// for fine-grained failure injection when a test targets exactly one
|
||||||
|
// method.
|
||||||
|
// - Shared legacy errors (ListErr for list-shaped reads, UpdateErr for
|
||||||
|
// update-shaped writes) so the pre-I-005 tests that configure ListErr
|
||||||
|
// or UpdateErr continue to short-circuit the new methods too. The
|
||||||
|
// RequeueNotification_RepoError test deliberately relies on this by
|
||||||
|
// setting UpdateErr rather than RequeueErr.
|
||||||
type mockNotifRepo struct {
|
type mockNotifRepo struct {
|
||||||
mu sync.Mutex
|
mu sync.Mutex
|
||||||
Notifications []*domain.NotificationEvent
|
Notifications []*domain.NotificationEvent
|
||||||
CreateErr error
|
CreateErr error
|
||||||
ListErr error
|
ListErr error
|
||||||
UpdateErr error
|
UpdateErr error
|
||||||
|
|
||||||
|
// I-005 per-method failure injection.
|
||||||
|
ListRetryEligibleErr error
|
||||||
|
RecordFailedAttemptErr error
|
||||||
|
MarkAsDeadErr error
|
||||||
|
RequeueErr error
|
||||||
|
CountByStatusErr error
|
||||||
}
|
}
|
||||||
|
|
||||||
func (m *mockNotifRepo) Create(ctx context.Context, notif *domain.NotificationEvent) error {
|
func (m *mockNotifRepo) Create(ctx context.Context, notif *domain.NotificationEvent) error {
|
||||||
@@ -436,12 +459,163 @@ func (m *mockNotifRepo) UpdateStatus(ctx context.Context, id string, status stri
|
|||||||
return errNotFound
|
return errNotFound
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// ListRetryEligible returns failed rows whose NextRetryAt is non-nil, at or
|
||||||
|
// before beforeTime, AND whose RetryCount is strictly less than maxAttempts,
|
||||||
|
// ordered oldest-due first, capped at limit. Signature matches the postgres-
|
||||||
|
// canonical shape pinned by notification_test.go:118 ("repo.ListRetryEligible
|
||||||
|
// (ctx, now, 5, 100)") and the NotificationRepository interface at
|
||||||
|
// interfaces.go:308 — a row at retry_count == maxAttempts is NOT returned
|
||||||
|
// because the service has already exhausted its attempt budget and the row
|
||||||
|
// must be MarkAsDead'd by whichever tick last touched it, not re-swept here.
|
||||||
|
// Mirrors the partial-index predicate
|
||||||
|
// `WHERE status='failed' AND next_retry_at IS NOT NULL AND next_retry_at <= $1`
|
||||||
|
// that migration 000016's retry-sweep index makes cheap to scan; the
|
||||||
|
// retry_count filter is an extra Go-side guard so the mock behaves
|
||||||
|
// identically to the postgres `AND retry_count < $2` clause.
|
||||||
|
func (m *mockNotifRepo) ListRetryEligible(ctx context.Context, beforeTime time.Time, maxAttempts, limit int) ([]*domain.NotificationEvent, error) {
|
||||||
|
m.mu.Lock()
|
||||||
|
defer m.mu.Unlock()
|
||||||
|
if m.ListRetryEligibleErr != nil {
|
||||||
|
return nil, m.ListRetryEligibleErr
|
||||||
|
}
|
||||||
|
if m.ListErr != nil {
|
||||||
|
return nil, m.ListErr
|
||||||
|
}
|
||||||
|
eligible := make([]*domain.NotificationEvent, 0)
|
||||||
|
for _, n := range m.Notifications {
|
||||||
|
if n.Status != string(domain.NotificationStatusFailed) {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
if n.NextRetryAt == nil {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
if n.NextRetryAt.After(beforeTime) {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
if n.RetryCount >= maxAttempts {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
eligible = append(eligible, n)
|
||||||
|
}
|
||||||
|
// Oldest-due first so the service processes the most-overdue row first,
|
||||||
|
// matching how an ORDER BY next_retry_at ASC query would behave.
|
||||||
|
sort.Slice(eligible, func(i, j int) bool {
|
||||||
|
return eligible[i].NextRetryAt.Before(*eligible[j].NextRetryAt)
|
||||||
|
})
|
||||||
|
if limit > 0 && len(eligible) > limit {
|
||||||
|
eligible = eligible[:limit]
|
||||||
|
}
|
||||||
|
return eligible, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// RecordFailedAttempt mutates the matched row in place: increments
|
||||||
|
// retry_count, pins next_retry_at, stores last_error, and keeps the row in
|
||||||
|
// 'failed' state so the next retry-sweep tick picks it up again. Service-
|
||||||
|
// level backoff math happens before the call; the repo is a dumb setter.
|
||||||
|
// Signature matches the postgres-canonical shape pinned by
|
||||||
|
// notification_test.go:184 ("repo.RecordFailedAttempt(ctx, 'notif-attempt-1',
|
||||||
|
// 'connection refused', nextTry)") and the NotificationRepository interface
|
||||||
|
// at interfaces.go:315 — id, then lastError, then nextRetryAt. The earlier
|
||||||
|
// (id, nextRetryAt, lastError) ordering from the Phase 1 Red seed was wrong
|
||||||
|
// and is corrected here in Phase 2 Green.
|
||||||
|
func (m *mockNotifRepo) RecordFailedAttempt(ctx context.Context, id string, lastError string, nextRetryAt time.Time) error {
|
||||||
|
m.mu.Lock()
|
||||||
|
defer m.mu.Unlock()
|
||||||
|
if m.RecordFailedAttemptErr != nil {
|
||||||
|
return m.RecordFailedAttemptErr
|
||||||
|
}
|
||||||
|
if m.UpdateErr != nil {
|
||||||
|
return m.UpdateErr
|
||||||
|
}
|
||||||
|
for _, n := range m.Notifications {
|
||||||
|
if n.ID == id {
|
||||||
|
n.RetryCount++
|
||||||
|
next := nextRetryAt
|
||||||
|
n.NextRetryAt = &next
|
||||||
|
le := lastError
|
||||||
|
n.LastError = &le
|
||||||
|
n.Status = string(domain.NotificationStatusFailed)
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return errNotFound
|
||||||
|
}
|
||||||
|
|
||||||
|
// MarkAsDead flips the row into the terminal DLQ state. next_retry_at is
|
||||||
|
// cleared so the partial retry-sweep index no longer touches this row —
|
||||||
|
// otherwise RetryFailedNotifications would loop over it forever without
|
||||||
|
// making any state change.
|
||||||
|
func (m *mockNotifRepo) MarkAsDead(ctx context.Context, id string, lastError string) error {
|
||||||
|
m.mu.Lock()
|
||||||
|
defer m.mu.Unlock()
|
||||||
|
if m.MarkAsDeadErr != nil {
|
||||||
|
return m.MarkAsDeadErr
|
||||||
|
}
|
||||||
|
if m.UpdateErr != nil {
|
||||||
|
return m.UpdateErr
|
||||||
|
}
|
||||||
|
for _, n := range m.Notifications {
|
||||||
|
if n.ID == id {
|
||||||
|
n.Status = string(domain.NotificationStatusDead)
|
||||||
|
n.NextRetryAt = nil
|
||||||
|
le := lastError
|
||||||
|
n.LastError = &le
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return errNotFound
|
||||||
|
}
|
||||||
|
|
||||||
|
// Requeue is the operator-driven escape hatch from 'dead' back to 'pending'.
|
||||||
|
// Clears retry bookkeeping entirely so ProcessPendingNotifications treats
|
||||||
|
// the requeued row as a fresh attempt — identical on the wire to a freshly-
|
||||||
|
// created notification.
|
||||||
|
func (m *mockNotifRepo) Requeue(ctx context.Context, id string) error {
|
||||||
|
m.mu.Lock()
|
||||||
|
defer m.mu.Unlock()
|
||||||
|
if m.RequeueErr != nil {
|
||||||
|
return m.RequeueErr
|
||||||
|
}
|
||||||
|
if m.UpdateErr != nil {
|
||||||
|
return m.UpdateErr
|
||||||
|
}
|
||||||
|
for _, n := range m.Notifications {
|
||||||
|
if n.ID == id {
|
||||||
|
n.Status = string(domain.NotificationStatusPending)
|
||||||
|
n.RetryCount = 0
|
||||||
|
n.NextRetryAt = nil
|
||||||
|
n.LastError = nil
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return errNotFound
|
||||||
|
}
|
||||||
|
|
||||||
func (m *mockNotifRepo) AddNotification(notif *domain.NotificationEvent) {
|
func (m *mockNotifRepo) AddNotification(notif *domain.NotificationEvent) {
|
||||||
m.mu.Lock()
|
m.mu.Lock()
|
||||||
defer m.mu.Unlock()
|
defer m.mu.Unlock()
|
||||||
m.Notifications = append(m.Notifications, notif)
|
m.Notifications = append(m.Notifications, notif)
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// CountByStatus counts in-memory rows whose Status field matches exactly.
|
||||||
|
// Dedicated error injection via CountByStatusErr so a test can assert the
|
||||||
|
// StatsService wrap-path ("failed to count dead notifications: …") without
|
||||||
|
// also tripping ListErr or other shared fields. I-005 Phase 2 Green.
|
||||||
|
func (m *mockNotifRepo) CountByStatus(ctx context.Context, status string) (int64, error) {
|
||||||
|
m.mu.Lock()
|
||||||
|
defer m.mu.Unlock()
|
||||||
|
if m.CountByStatusErr != nil {
|
||||||
|
return 0, m.CountByStatusErr
|
||||||
|
}
|
||||||
|
var count int64
|
||||||
|
for _, n := range m.Notifications {
|
||||||
|
if n.Status == status {
|
||||||
|
count++
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return count, nil
|
||||||
|
}
|
||||||
|
|
||||||
// mockAuditRepo is a test implementation of AuditRepository
|
// mockAuditRepo is a test implementation of AuditRepository
|
||||||
type mockAuditRepo struct {
|
type mockAuditRepo struct {
|
||||||
mu sync.Mutex
|
mu sync.Mutex
|
||||||
@@ -635,10 +809,10 @@ type mockAgentRepo struct {
|
|||||||
// or RetireAgentWithCascade failure after preflight passed, so the
|
// or RetireAgentWithCascade failure after preflight passed, so the
|
||||||
// service's error surfacing (wrap+return, skip audit, etc.) can be
|
// service's error surfacing (wrap+return, skip audit, etc.) can be
|
||||||
// exercised without having to stand up a real PG connection.
|
// exercised without having to stand up a real PG connection.
|
||||||
SoftRetireErr error
|
SoftRetireErr error
|
||||||
RetireCascadeErr error
|
RetireCascadeErr error
|
||||||
CountErr error
|
CountErr error
|
||||||
ListRetiredErr error
|
ListRetiredErr error
|
||||||
}
|
}
|
||||||
|
|
||||||
// List mirrors the production repo contract post-I-004: it returns only
|
// List mirrors the production repo contract post-I-004: it returns only
|
||||||
@@ -993,8 +1167,8 @@ func newMockTargetRepository() *mockTargetRepo {
|
|||||||
|
|
||||||
// mockIssuerConnector is a test implementation of IssuerConnector
|
// mockIssuerConnector is a test implementation of IssuerConnector
|
||||||
type mockIssuerConnector struct {
|
type mockIssuerConnector struct {
|
||||||
Result *IssuanceResult
|
Result *IssuanceResult
|
||||||
Err error
|
Err error
|
||||||
getRenewalInfoResult *RenewalInfoResult
|
getRenewalInfoResult *RenewalInfoResult
|
||||||
getRenewalInfoErr error
|
getRenewalInfoErr error
|
||||||
// LastOCSPSignRequest captures the last request passed to SignOCSPResponse.
|
// LastOCSPSignRequest captures the last request passed to SignOCSPResponse.
|
||||||
|
|||||||
@@ -0,0 +1,12 @@
|
|||||||
|
-- Rollback for migration 000016 (I-005 notification retry + DLQ).
|
||||||
|
-- Drops the retry-sweep partial index first, then the three columns added to
|
||||||
|
-- notification_events. No status-rewriting: rows that were promoted to 'dead'
|
||||||
|
-- during retry exhaustion remain in that status (rollback is opt-in, and
|
||||||
|
-- clobbering terminal states on rollback would erase the audit trail of which
|
||||||
|
-- alerts were never delivered).
|
||||||
|
|
||||||
|
DROP INDEX IF EXISTS idx_notification_events_retry_sweep;
|
||||||
|
|
||||||
|
ALTER TABLE notification_events DROP COLUMN IF EXISTS last_error;
|
||||||
|
ALTER TABLE notification_events DROP COLUMN IF EXISTS next_retry_at;
|
||||||
|
ALTER TABLE notification_events DROP COLUMN IF EXISTS retry_count;
|
||||||
@@ -0,0 +1,55 @@
|
|||||||
|
-- Migration 000016: Notification retry + dead-letter queue (I-005 coverage-gap fix).
|
||||||
|
--
|
||||||
|
-- Adds retry bookkeeping to notification_events so transient webhook / SMTP
|
||||||
|
-- failures no longer silently drop critical alerts, and introduces a terminal
|
||||||
|
-- "dead" status that an operator can triage from the UI.
|
||||||
|
--
|
||||||
|
-- Rationale (audit finding I-005):
|
||||||
|
-- Today `internal/service/notification.go:282-288` flips status to 'failed'
|
||||||
|
-- with a zero-valued sent_at and returns. `ProcessPendingNotifications`
|
||||||
|
-- (line 243) only lists rows whose status='pending', so a failed row is
|
||||||
|
-- orphaned: no retry, no backoff, no escalation, no dead-letter. The only
|
||||||
|
-- way an operator learns about the drop is by reading the server log.
|
||||||
|
--
|
||||||
|
-- The fix mirrors the I-001 job retry loop: a sibling scheduler loop sweeps
|
||||||
|
-- notification_events for rows whose (status='failed', next_retry_at <= now())
|
||||||
|
-- and, while retry_count < max_attempts, requeues them to 'pending'. Once
|
||||||
|
-- retry_count crosses max_attempts the row is promoted to 'dead' and a
|
||||||
|
-- Prometheus counter is bumped for alerting. The UI exposes a manual Requeue
|
||||||
|
-- button on dead rows for when the operator has resolved the underlying
|
||||||
|
-- notifier outage.
|
||||||
|
--
|
||||||
|
-- Column design mirrors migration 000015 (agent_retire) style:
|
||||||
|
-- * retry_count INTEGER NOT NULL DEFAULT 0 — explicit NOT NULL + default so
|
||||||
|
-- existing rows backfill cleanly and the service layer never needs to
|
||||||
|
-- nil-check the counter.
|
||||||
|
-- * next_retry_at TIMESTAMPTZ NULL — nullable because the field is only
|
||||||
|
-- meaningful while a row is in 'failed' state; 'sent', 'pending', 'dead'
|
||||||
|
-- and 'read' rows all leave it NULL. The partial index below is what makes
|
||||||
|
-- the retry sweep O(retry-eligible) rather than O(total).
|
||||||
|
-- * last_error TEXT NULL — preserves the most recent transient failure
|
||||||
|
-- string for operator triage. TEXT (not VARCHAR(N)) because notifier
|
||||||
|
-- errors can include full HTTP bodies, stack traces, or stringified
|
||||||
|
-- TLS handshake diagnostics without truncation risk.
|
||||||
|
--
|
||||||
|
-- Idempotency guarantees (enforced by notification repository integration tests):
|
||||||
|
-- * ADD COLUMN IF NOT EXISTS → re-running is a no-op
|
||||||
|
-- * CREATE INDEX IF NOT EXISTS → re-running is a no-op
|
||||||
|
|
||||||
|
-- Retry counter. DEFAULT 0 backfills every existing row at zero attempts.
|
||||||
|
ALTER TABLE notification_events ADD COLUMN IF NOT EXISTS retry_count INTEGER NOT NULL DEFAULT 0;
|
||||||
|
|
||||||
|
-- Next-retry timestamp. Populated by the service layer on the failed→pending
|
||||||
|
-- transition using exponential backoff (2^retry_count minutes, capped at 1h).
|
||||||
|
ALTER TABLE notification_events ADD COLUMN IF NOT EXISTS next_retry_at TIMESTAMPTZ;
|
||||||
|
|
||||||
|
-- Last transient error preserved for operator triage and dashboard display.
|
||||||
|
ALTER TABLE notification_events ADD COLUMN IF NOT EXISTS last_error TEXT;
|
||||||
|
|
||||||
|
-- Partial index for the retry-sweep hot path. Only rows in 'failed' state with
|
||||||
|
-- a scheduled next_retry_at participate in the index; everything else (sent,
|
||||||
|
-- pending, dead, read, and unscheduled failures) is excluded. Keeps the index
|
||||||
|
-- tiny in healthy fleets where transient failures are rare.
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_notification_events_retry_sweep
|
||||||
|
ON notification_events(next_retry_at)
|
||||||
|
WHERE status = 'failed' AND next_retry_at IS NOT NULL;
|
||||||
@@ -119,7 +119,7 @@ echo -e "\n${GREEN}=== Setup Complete ===${NC}\n"
|
|||||||
echo "Your development environment is ready!"
|
echo "Your development environment is ready!"
|
||||||
echo ""
|
echo ""
|
||||||
echo "Services running:"
|
echo "Services running:"
|
||||||
echo " • Server: http://localhost:8443"
|
echo " • Server: https://localhost:8443"
|
||||||
echo " • Database: postgres://certctl:certctl@localhost:5432/certctl"
|
echo " • Database: postgres://certctl:certctl@localhost:5432/certctl"
|
||||||
echo " • Agent: Connected to server"
|
echo " • Agent: Connected to server"
|
||||||
echo ""
|
echo ""
|
||||||
@@ -132,7 +132,7 @@ echo " make docker-logs-server"
|
|||||||
echo " make docker-logs-agent"
|
echo " make docker-logs-agent"
|
||||||
echo ""
|
echo ""
|
||||||
echo " 3. Test the API:"
|
echo " 3. Test the API:"
|
||||||
echo " curl http://localhost:8443/health"
|
echo " curl --cacert ./deploy/test/certs/ca.crt https://localhost:8443/health"
|
||||||
echo ""
|
echo ""
|
||||||
echo " 4. Try the quick start guide:"
|
echo " 4. Try the quick start guide:"
|
||||||
echo " cat docs/quickstart.md"
|
echo " cat docs/quickstart.md"
|
||||||
|
|||||||
@@ -301,6 +301,19 @@ export const getNotification = (id: string) =>
|
|||||||
export const markNotificationRead = (id: string) =>
|
export const markNotificationRead = (id: string) =>
|
||||||
fetchJSON<{ message: string }>(`${BASE}/notifications/${id}/read`, { method: 'POST' });
|
fetchJSON<{ message: string }>(`${BASE}/notifications/${id}/read`, { method: 'POST' });
|
||||||
|
|
||||||
|
/**
|
||||||
|
* I-005: requeue a dead notification back to the retry queue. Flips status
|
||||||
|
* 'dead' → 'pending' and clears next_retry_at so the retry sweep picks it up
|
||||||
|
* on its next tick (default 2 minutes, CERTCTL_NOTIFICATION_RETRY_INTERVAL).
|
||||||
|
* Used by the Dead letter tab's "Requeue" button after an operator fixes the
|
||||||
|
* underlying delivery failure (SMTP config, webhook endpoint, etc.). The
|
||||||
|
* handler returns a StatusResponse ({ status: "requeued" }) — the frontend
|
||||||
|
* only needs to know the call succeeded so the mutation can invalidate the
|
||||||
|
* notifications query.
|
||||||
|
*/
|
||||||
|
export const requeueNotification = (id: string) =>
|
||||||
|
fetchJSON<{ status: string }>(`${BASE}/notifications/${id}/requeue`, { method: 'POST' });
|
||||||
|
|
||||||
// Audit
|
// Audit
|
||||||
export const getAuditEvents = (params: Record<string, string> = {}) => {
|
export const getAuditEvents = (params: Record<string, string> = {}) => {
|
||||||
const qs = new URLSearchParams({ page: '1', per_page: '200', ...params }).toString();
|
const qs = new URLSearchParams({ page: '1', per_page: '200', ...params }).toString();
|
||||||
|
|||||||
+34
-2
@@ -126,15 +126,47 @@ export interface Job {
|
|||||||
verification_error?: string;
|
verification_error?: string;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
/**
|
||||||
|
* Notification mirrors internal/domain/notification.go#NotificationEvent.
|
||||||
|
*
|
||||||
|
* I-005 (Notification Retry + Dead-letter Queue) widens the shape with three
|
||||||
|
* audit fields:
|
||||||
|
*
|
||||||
|
* - retry_count — number of delivery attempts already consumed (0..5). The
|
||||||
|
* 5-cap is enforced server-side by NotificationsMaxAttempts.
|
||||||
|
* - next_retry_at — RFC3339 timestamp the retry sweep will next consider this
|
||||||
|
* notification. Null for sent/dead/read and between sweeps
|
||||||
|
* for pending rows; the sweep populates it on each failure
|
||||||
|
* using min(2^retry_count * 1m, 1h).
|
||||||
|
* - last_error — most recent transient delivery failure. Preserved across
|
||||||
|
* requeue so Dead letter triage shows *why* the row died
|
||||||
|
* without chasing server logs.
|
||||||
|
*
|
||||||
|
* `sent_at` and `error` are the pre-I-005 audit fields on the backend struct.
|
||||||
|
* `subject` is a historical frontend-only field the backend never emits; it's
|
||||||
|
* kept optional so legacy fixtures and the pendingNotif test mock still type
|
||||||
|
* correctly without forcing a rewrite of every existing consumer.
|
||||||
|
*
|
||||||
|
* Status values follow the backend NotificationStatus constants:
|
||||||
|
* pending · sent · failed · dead · read
|
||||||
|
* The existing list view tolerates the legacy title-cased "Pending" alias at
|
||||||
|
* render time (NotificationRow) so upgraded clients talking to older servers
|
||||||
|
* don't regress — see `isUnread` logic in NotificationsPage.tsx.
|
||||||
|
*/
|
||||||
export interface Notification {
|
export interface Notification {
|
||||||
id: string;
|
id: string;
|
||||||
type: string;
|
type: string;
|
||||||
channel: string;
|
channel: string;
|
||||||
recipient: string;
|
recipient: string;
|
||||||
subject: string;
|
subject?: string;
|
||||||
message: string;
|
message: string;
|
||||||
status: string;
|
status: string;
|
||||||
certificate_id: string;
|
certificate_id?: string;
|
||||||
|
sent_at?: string | null;
|
||||||
|
error?: string | null;
|
||||||
|
retry_count?: number;
|
||||||
|
next_retry_at?: string | null;
|
||||||
|
last_error?: string | null;
|
||||||
created_at: string;
|
created_at: string;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|||||||
@@ -0,0 +1,208 @@
|
|||||||
|
import { describe, it, expect, vi, beforeEach } from 'vitest';
|
||||||
|
import { render, screen, waitFor, fireEvent, cleanup } from '@testing-library/react';
|
||||||
|
import { QueryClient, QueryClientProvider } from '@tanstack/react-query';
|
||||||
|
import { MemoryRouter } from 'react-router-dom';
|
||||||
|
import type { ReactNode } from 'react';
|
||||||
|
|
||||||
|
// -----------------------------------------------------------------------------
|
||||||
|
// I-005: NotificationsPage Phase 1 Red — Dead Letter tab + Requeue action
|
||||||
|
//
|
||||||
|
// This file pins the frontend contract Phase 2 Green must implement:
|
||||||
|
//
|
||||||
|
// 1. A "Dead letter" tab renders alongside the existing status filter, and
|
||||||
|
// selecting it causes the underlying query to fetch with { status: 'dead' }.
|
||||||
|
// The tab does not exist at HEAD — the tab-locator assertions are the Red.
|
||||||
|
//
|
||||||
|
// 2. Notifications in status='dead' render a "Requeue" action button. HEAD
|
||||||
|
// only renders "Mark read" for Pending rows and no action for anything
|
||||||
|
// else — the button-locator assertion is the Red.
|
||||||
|
//
|
||||||
|
// 3. Clicking "Requeue" invokes requeueNotification(id) from the API client
|
||||||
|
// and invalidates the notifications query. `requeueNotification` does not
|
||||||
|
// yet exist as an export from ../api/client — tsc --noEmit will fail with
|
||||||
|
// "Property 'requeueNotification' does not exist" when Phase 2 Green runs
|
||||||
|
// its verification gates, which is the compile-time Red halt. This file is
|
||||||
|
// structured so Phase 2 Green's single fix (add the client export + page
|
||||||
|
// wiring) flips the entire suite Green at once.
|
||||||
|
// -----------------------------------------------------------------------------
|
||||||
|
|
||||||
|
vi.mock('../api/client', () => ({
|
||||||
|
getNotifications: vi.fn(),
|
||||||
|
getNotification: vi.fn(),
|
||||||
|
markNotificationRead: vi.fn(),
|
||||||
|
requeueNotification: vi.fn(),
|
||||||
|
}));
|
||||||
|
|
||||||
|
// Imported after vi.mock so the mock replaces the real module.
|
||||||
|
import NotificationsPage from './NotificationsPage';
|
||||||
|
import * as client from '../api/client';
|
||||||
|
|
||||||
|
function renderWithQuery(ui: ReactNode) {
|
||||||
|
const qc = new QueryClient({
|
||||||
|
defaultOptions: {
|
||||||
|
queries: { retry: false, gcTime: 0, staleTime: 0 },
|
||||||
|
},
|
||||||
|
});
|
||||||
|
return render(
|
||||||
|
<QueryClientProvider client={qc}>
|
||||||
|
<MemoryRouter>{ui}</MemoryRouter>
|
||||||
|
</QueryClientProvider>,
|
||||||
|
);
|
||||||
|
}
|
||||||
|
|
||||||
|
const pendingNotif = {
|
||||||
|
id: 'notif-001',
|
||||||
|
type: 'ExpirationWarning',
|
||||||
|
channel: 'Email',
|
||||||
|
recipient: 'admin@example.com',
|
||||||
|
subject: 'Certificate expiring',
|
||||||
|
message: 'Certificate expiring in 7 days',
|
||||||
|
status: 'Pending',
|
||||||
|
certificate_id: 'mc-prod-001',
|
||||||
|
created_at: new Date().toISOString(),
|
||||||
|
};
|
||||||
|
|
||||||
|
const deadNotif = {
|
||||||
|
id: 'notif-dead-001',
|
||||||
|
type: 'ExpirationWarning',
|
||||||
|
channel: 'Email',
|
||||||
|
recipient: 'admin@example.com',
|
||||||
|
subject: 'Certificate expiring',
|
||||||
|
message: 'Certificate expiring in 7 days',
|
||||||
|
status: 'dead',
|
||||||
|
certificate_id: 'mc-prod-001',
|
||||||
|
created_at: new Date().toISOString(),
|
||||||
|
retry_count: 5,
|
||||||
|
last_error: 'SMTP connection refused',
|
||||||
|
};
|
||||||
|
|
||||||
|
describe('NotificationsPage — I-005 Dead Letter + Requeue (Phase 1 Red)', () => {
|
||||||
|
beforeEach(() => {
|
||||||
|
vi.clearAllMocks();
|
||||||
|
cleanup();
|
||||||
|
});
|
||||||
|
|
||||||
|
it('renders a Dead letter tab in the filter toolbar', async () => {
|
||||||
|
vi.mocked(client.getNotifications).mockResolvedValue({
|
||||||
|
data: [pendingNotif],
|
||||||
|
total: 1,
|
||||||
|
page: 1,
|
||||||
|
per_page: 100,
|
||||||
|
});
|
||||||
|
|
||||||
|
renderWithQuery(<NotificationsPage />);
|
||||||
|
|
||||||
|
await waitFor(() => {
|
||||||
|
expect(screen.queryByText(/Loading/i)).not.toBeInTheDocument();
|
||||||
|
});
|
||||||
|
|
||||||
|
// Red: no Dead letter tab exists at HEAD. Phase 2 Green adds a button/tab
|
||||||
|
// labeled "Dead letter" (matches docs/testing-guide UI label).
|
||||||
|
expect(screen.getByRole('button', { name: /Dead letter/i })).toBeInTheDocument();
|
||||||
|
});
|
||||||
|
|
||||||
|
it('clicking Dead letter tab fetches notifications with status=dead', async () => {
|
||||||
|
vi.mocked(client.getNotifications).mockResolvedValue({
|
||||||
|
data: [],
|
||||||
|
total: 0,
|
||||||
|
page: 1,
|
||||||
|
per_page: 100,
|
||||||
|
});
|
||||||
|
|
||||||
|
renderWithQuery(<NotificationsPage />);
|
||||||
|
|
||||||
|
await waitFor(() => {
|
||||||
|
expect(screen.queryByText(/Loading/i)).not.toBeInTheDocument();
|
||||||
|
});
|
||||||
|
|
||||||
|
const tab = screen.getByRole('button', { name: /Dead letter/i });
|
||||||
|
fireEvent.click(tab);
|
||||||
|
|
||||||
|
// Red: Phase 2 Green must route the Dead letter tab's query through
|
||||||
|
// getNotifications({ status: 'dead', per_page: '100' }). HEAD only ever
|
||||||
|
// calls getNotifications({ per_page: '100' }) — no status param is ever
|
||||||
|
// passed through.
|
||||||
|
await waitFor(() => {
|
||||||
|
const calls = vi.mocked(client.getNotifications).mock.calls;
|
||||||
|
const deadCall = calls.find(([params]) => (params as Record<string, string>)?.status === 'dead');
|
||||||
|
expect(deadCall, 'expected getNotifications to be called with status=dead').toBeTruthy();
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
it('renders a Requeue button on dead notifications', async () => {
|
||||||
|
vi.mocked(client.getNotifications).mockResolvedValue({
|
||||||
|
data: [deadNotif],
|
||||||
|
total: 1,
|
||||||
|
page: 1,
|
||||||
|
per_page: 100,
|
||||||
|
});
|
||||||
|
|
||||||
|
renderWithQuery(<NotificationsPage />);
|
||||||
|
|
||||||
|
await waitFor(() => {
|
||||||
|
expect(screen.queryByText(/Loading/i)).not.toBeInTheDocument();
|
||||||
|
});
|
||||||
|
|
||||||
|
// Switch to Dead letter tab so the mocked dead notification becomes visible.
|
||||||
|
const tab = screen.getByRole('button', { name: /Dead letter/i });
|
||||||
|
fireEvent.click(tab);
|
||||||
|
|
||||||
|
await waitFor(() => {
|
||||||
|
// Red: HEAD renders no action for status='dead'. Phase 2 Green adds a
|
||||||
|
// "Requeue" button next to each dead row.
|
||||||
|
expect(screen.getByRole('button', { name: /Requeue/i })).toBeInTheDocument();
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
it('clicking Requeue invokes requeueNotification(id) from the API client', async () => {
|
||||||
|
vi.mocked(client.getNotifications).mockResolvedValue({
|
||||||
|
data: [deadNotif],
|
||||||
|
total: 1,
|
||||||
|
page: 1,
|
||||||
|
per_page: 100,
|
||||||
|
});
|
||||||
|
vi.mocked(client.requeueNotification).mockResolvedValue({ status: 'requeued' });
|
||||||
|
|
||||||
|
renderWithQuery(<NotificationsPage />);
|
||||||
|
|
||||||
|
await waitFor(() => {
|
||||||
|
expect(screen.queryByText(/Loading/i)).not.toBeInTheDocument();
|
||||||
|
});
|
||||||
|
|
||||||
|
fireEvent.click(screen.getByRole('button', { name: /Dead letter/i }));
|
||||||
|
|
||||||
|
const requeueBtn = await screen.findByRole('button', { name: /Requeue/i });
|
||||||
|
fireEvent.click(requeueBtn);
|
||||||
|
|
||||||
|
// Red: client.requeueNotification is not an exported function at HEAD, and
|
||||||
|
// the page does not call it. Both the mock and the page wiring are added
|
||||||
|
// in Phase 2 Green.
|
||||||
|
await waitFor(() => {
|
||||||
|
expect(client.requeueNotification).toHaveBeenCalledWith('notif-dead-001');
|
||||||
|
});
|
||||||
|
});
|
||||||
|
|
||||||
|
it('dead notifications surface retry_count and last_error metadata', async () => {
|
||||||
|
vi.mocked(client.getNotifications).mockResolvedValue({
|
||||||
|
data: [deadNotif],
|
||||||
|
total: 1,
|
||||||
|
page: 1,
|
||||||
|
per_page: 100,
|
||||||
|
});
|
||||||
|
|
||||||
|
renderWithQuery(<NotificationsPage />);
|
||||||
|
|
||||||
|
await waitFor(() => {
|
||||||
|
expect(screen.queryByText(/Loading/i)).not.toBeInTheDocument();
|
||||||
|
});
|
||||||
|
|
||||||
|
fireEvent.click(screen.getByRole('button', { name: /Dead letter/i }));
|
||||||
|
|
||||||
|
await waitFor(() => {
|
||||||
|
// Red: HEAD does not display retry_count or last_error. Phase 2 Green
|
||||||
|
// must surface these so operators can see *why* a notification died.
|
||||||
|
expect(screen.getByText(/SMTP connection refused/i)).toBeInTheDocument();
|
||||||
|
expect(screen.getByText(/5/)).toBeInTheDocument();
|
||||||
|
});
|
||||||
|
});
|
||||||
|
});
|
||||||
@@ -1,6 +1,6 @@
|
|||||||
import { useState, useMemo } from 'react';
|
import { useState, useMemo } from 'react';
|
||||||
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query';
|
import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query';
|
||||||
import { getNotifications, markNotificationRead } from '../api/client';
|
import { getNotifications, markNotificationRead, requeueNotification } from '../api/client';
|
||||||
import PageHeader from '../components/PageHeader';
|
import PageHeader from '../components/PageHeader';
|
||||||
import StatusBadge from '../components/StatusBadge';
|
import StatusBadge from '../components/StatusBadge';
|
||||||
import ErrorState from '../components/ErrorState';
|
import ErrorState from '../components/ErrorState';
|
||||||
@@ -9,15 +9,37 @@ import type { Notification } from '../api/types';
|
|||||||
|
|
||||||
type ViewMode = 'list' | 'grouped';
|
type ViewMode = 'list' | 'grouped';
|
||||||
|
|
||||||
|
// I-005: the Notifications page now hosts two tabs. "all" is the pre-I-005
|
||||||
|
// inbox behavior — no server-side status filter, client-side type/status
|
||||||
|
// dropdowns untouched. "dead" routes the query through the new ?status=dead
|
||||||
|
// handler branch so operators can triage the dead-letter queue in isolation.
|
||||||
|
// The tab is intentionally a separate state axis from the status dropdown so
|
||||||
|
// the two don't fight each other (dropdown filters within the tab's scope).
|
||||||
|
type ActiveTab = 'all' | 'dead';
|
||||||
|
|
||||||
export default function NotificationsPage() {
|
export default function NotificationsPage() {
|
||||||
const [viewMode, setViewMode] = useState<ViewMode>('grouped');
|
const [viewMode, setViewMode] = useState<ViewMode>('grouped');
|
||||||
const [typeFilter, setTypeFilter] = useState('');
|
const [typeFilter, setTypeFilter] = useState('');
|
||||||
const [statusFilter, setStatusFilter] = useState('');
|
const [statusFilter, setStatusFilter] = useState('');
|
||||||
|
const [activeTab, setActiveTab] = useState<ActiveTab>('all');
|
||||||
const queryClient = useQueryClient();
|
const queryClient = useQueryClient();
|
||||||
|
|
||||||
const { data, isLoading, error, refetch } = useQuery({
|
const { data, isLoading, error, refetch } = useQuery({
|
||||||
queryKey: ['notifications'],
|
// I-005: queryKey carries the active tab so TanStack Query treats
|
||||||
queryFn: () => getNotifications({ per_page: '100' }),
|
// "all" and "dead" as distinct cache entries. Without this, switching
|
||||||
|
// tabs would return stale data until the 30s refetchInterval fires.
|
||||||
|
queryKey: ['notifications', activeTab],
|
||||||
|
queryFn: () => {
|
||||||
|
const params: Record<string, string> = { per_page: '100' };
|
||||||
|
if (activeTab === 'dead') {
|
||||||
|
// The listNotifications handler's ?status=dead branch hits the
|
||||||
|
// NotificationRepository.ListByStatus path instead of plain List,
|
||||||
|
// which is both cheaper (DLQ is a small slice of all notifications)
|
||||||
|
// and correct (pagination counts DLQ rows, not the full inbox).
|
||||||
|
params.status = 'dead';
|
||||||
|
}
|
||||||
|
return getNotifications(params);
|
||||||
|
},
|
||||||
refetchInterval: 30000,
|
refetchInterval: 30000,
|
||||||
});
|
});
|
||||||
|
|
||||||
@@ -26,6 +48,23 @@ export default function NotificationsPage() {
|
|||||||
onSuccess: () => queryClient.invalidateQueries({ queryKey: ['notifications'] }),
|
onSuccess: () => queryClient.invalidateQueries({ queryKey: ['notifications'] }),
|
||||||
});
|
});
|
||||||
|
|
||||||
|
// I-005: requeue a dead notification. Invalidates both tab cache entries
|
||||||
|
// because a successful requeue flips the row out of "dead" and potentially
|
||||||
|
// into the "all" tab on its next refetch (status becomes 'pending').
|
||||||
|
//
|
||||||
|
// The mutationFn is wrapped as `(id) => requeueNotification(id)` rather
|
||||||
|
// than passed by reference so react-query v5's second positional argument
|
||||||
|
// (the mutation context object) never reaches the API client. Without the
|
||||||
|
// wrapper, TanStack invokes `requeueNotification(id, { client })`, and the
|
||||||
|
// I-005 Phase 1 Red contract's strict `toHaveBeenCalledWith('notif-dead-001')`
|
||||||
|
// assertion fails on the extra argument. Keep the arrow even if the context
|
||||||
|
// object later becomes structurally empty — the contract pins a single-arg
|
||||||
|
// call and the page must not leak mutation machinery into API boundaries.
|
||||||
|
const requeue = useMutation({
|
||||||
|
mutationFn: (id: string) => requeueNotification(id),
|
||||||
|
onSuccess: () => queryClient.invalidateQueries({ queryKey: ['notifications'] }),
|
||||||
|
});
|
||||||
|
|
||||||
const notifications = data?.data || [];
|
const notifications = data?.data || [];
|
||||||
|
|
||||||
const filtered = useMemo(() => {
|
const filtered = useMemo(() => {
|
||||||
@@ -81,6 +120,23 @@ export default function NotificationsPage() {
|
|||||||
subtitle={`${filtered.length} notifications${unreadCount ? ` (${unreadCount} unread)` : ''}`}
|
subtitle={`${filtered.length} notifications${unreadCount ? ` (${unreadCount} unread)` : ''}`}
|
||||||
/>
|
/>
|
||||||
<div className="px-4 py-3 flex flex-wrap items-center gap-3 border-b border-surface-border/50">
|
<div className="px-4 py-3 flex flex-wrap items-center gap-3 border-b border-surface-border/50">
|
||||||
|
{/* I-005: tab switcher between the standard inbox and the DLQ. The
|
||||||
|
"Dead letter" label is pinned by NotificationsPage.test.tsx — do
|
||||||
|
not rename without updating the Phase 1 Red contract. */}
|
||||||
|
<div className="flex rounded overflow-hidden border border-surface-border">
|
||||||
|
<button
|
||||||
|
onClick={() => setActiveTab('all')}
|
||||||
|
className={`px-3 py-1.5 text-xs transition-colors ${activeTab === 'all' ? 'bg-brand-400 text-white' : 'bg-surface text-ink-muted hover:text-ink'}`}
|
||||||
|
>
|
||||||
|
All
|
||||||
|
</button>
|
||||||
|
<button
|
||||||
|
onClick={() => setActiveTab('dead')}
|
||||||
|
className={`px-3 py-1.5 text-xs transition-colors ${activeTab === 'dead' ? 'bg-brand-400 text-white' : 'bg-surface text-ink-muted hover:text-ink'}`}
|
||||||
|
>
|
||||||
|
Dead letter
|
||||||
|
</button>
|
||||||
|
</div>
|
||||||
<div className="flex rounded overflow-hidden border border-surface-border">
|
<div className="flex rounded overflow-hidden border border-surface-border">
|
||||||
<button
|
<button
|
||||||
onClick={() => setViewMode('grouped')}
|
onClick={() => setViewMode('grouped')}
|
||||||
@@ -135,7 +191,7 @@ export default function NotificationsPage() {
|
|||||||
</div>
|
</div>
|
||||||
<div className="space-y-2">
|
<div className="space-y-2">
|
||||||
{items.map((n) => (
|
{items.map((n) => (
|
||||||
<NotificationRow key={n.id} notification={n} onMarkRead={() => markRead.mutate(n.id)} />
|
<NotificationRow key={n.id} notification={n} onMarkRead={() => markRead.mutate(n.id)} onRequeue={() => requeue.mutate(n.id)} />
|
||||||
))}
|
))}
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
@@ -157,10 +213,25 @@ export default function NotificationsPage() {
|
|||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|
||||||
function NotificationRow({ notification: n, onMarkRead }: { notification: Notification; onMarkRead: () => void }) {
|
function NotificationRow({
|
||||||
|
notification: n,
|
||||||
|
onMarkRead,
|
||||||
|
onRequeue,
|
||||||
|
}: {
|
||||||
|
notification: Notification;
|
||||||
|
onMarkRead: () => void;
|
||||||
|
// I-005: optional so callers who don't care about the DLQ (if any are ever
|
||||||
|
// added) aren't forced to thread a no-op through. Every NotificationRow
|
||||||
|
// today passes this, so in practice it's always defined.
|
||||||
|
onRequeue?: () => void;
|
||||||
|
}) {
|
||||||
const isUnread = n.status === 'Pending' || n.status === 'pending';
|
const isUnread = n.status === 'Pending' || n.status === 'pending';
|
||||||
|
// I-005: dead rows get a Requeue button and surface the retry budget + the
|
||||||
|
// last transient error so operators triaging the DLQ can see *why* the
|
||||||
|
// notification died before deciding whether to requeue.
|
||||||
|
const isDead = n.status === 'dead';
|
||||||
return (
|
return (
|
||||||
<div className={`flex items-start justify-between py-2 px-3 rounded transition-colors ${isUnread ? 'bg-surface-muted border-l-2 border-brand-400' : 'hover:bg-surface-muted'}`}>
|
<div className={`flex items-start justify-between py-2 px-3 rounded transition-colors ${isUnread ? 'bg-surface-muted border-l-2 border-brand-400' : isDead ? 'bg-surface-muted border-l-2 border-danger' : 'hover:bg-surface-muted'}`}>
|
||||||
<div className="flex-1 min-w-0">
|
<div className="flex-1 min-w-0">
|
||||||
<div className="flex items-center gap-2 mb-1">
|
<div className="flex items-center gap-2 mb-1">
|
||||||
<span className="text-sm text-ink">{n.type.replace(/([A-Z])/g, ' $1').trim()}</span>
|
<span className="text-sm text-ink">{n.type.replace(/([A-Z])/g, ' $1').trim()}</span>
|
||||||
@@ -168,6 +239,18 @@ function NotificationRow({ notification: n, onMarkRead }: { notification: Notifi
|
|||||||
<span className="text-xs text-ink-faint">{n.channel}</span>
|
<span className="text-xs text-ink-faint">{n.channel}</span>
|
||||||
</div>
|
</div>
|
||||||
<p className="text-xs text-ink-muted truncate">{n.message || n.subject}</p>
|
<p className="text-xs text-ink-muted truncate">{n.message || n.subject}</p>
|
||||||
|
{isDead && (
|
||||||
|
<div className="flex items-center gap-3 mt-1 text-xs">
|
||||||
|
<span className="text-ink-faint">
|
||||||
|
Retry {n.retry_count ?? 0}/5
|
||||||
|
</span>
|
||||||
|
{n.last_error && (
|
||||||
|
<span className="text-danger truncate" title={n.last_error}>
|
||||||
|
{n.last_error}
|
||||||
|
</span>
|
||||||
|
)}
|
||||||
|
</div>
|
||||||
|
)}
|
||||||
<div className="flex items-center gap-3 mt-1">
|
<div className="flex items-center gap-3 mt-1">
|
||||||
<span className="text-xs text-ink-faint">{n.recipient}</span>
|
<span className="text-xs text-ink-faint">{n.recipient}</span>
|
||||||
<span className="text-xs text-ink-faint">{timeAgo(n.created_at)}</span>
|
<span className="text-xs text-ink-faint">{timeAgo(n.created_at)}</span>
|
||||||
@@ -181,6 +264,14 @@ function NotificationRow({ notification: n, onMarkRead }: { notification: Notifi
|
|||||||
Mark read
|
Mark read
|
||||||
</button>
|
</button>
|
||||||
)}
|
)}
|
||||||
|
{isDead && onRequeue && (
|
||||||
|
<button
|
||||||
|
onClick={(e) => { e.stopPropagation(); onRequeue(); }}
|
||||||
|
className="ml-3 text-xs text-brand-400 hover:text-brand-500 transition-colors whitespace-nowrap"
|
||||||
|
>
|
||||||
|
Requeue
|
||||||
|
</button>
|
||||||
|
)}
|
||||||
</div>
|
</div>
|
||||||
);
|
);
|
||||||
}
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user