Compare commits

...

30 Commits

Author SHA1 Message Date
shankar0123 36e722ba12 WIP: M-1 handler sentinel error mapping (checkpoint before branch cleanup)
Uncommitted migration work at the time of branch cleanup. Tagged as
checkpoint/m1-migration-wip so the commit survives git gc --prune=now.

Session context: Phase 3 Part B+C of the M-1 sentinel error migration
was in progress. 38 modified files, 4 new files (errors.go + errors_test.go
in internal/service/ and internal/api/handler/). Resume from this commit
via 'git checkout checkpoint/m1-migration-wip'.
2026-04-24 00:35:12 +00:00
shankar0123 d6959a75c1 Merge branch 'test/l1-repo-integration-coverage' 2026-04-20 20:39:10 +00:00
shankar0123 97b23e98d9 test(repository): close L-1 integration-coverage gap for HealthCheck + RenewalPolicy
The coverage-gap audit flagged L-1 (P2): `HealthCheckRepository` (453 LOC,
11 methods) and `RenewalPolicyRepository` (289 LOC, 5 methods post-G-1 —
the audit's "92 lines, 2 methods" figure was stale) ship to production
with zero live-DB integration coverage. The existing `repo_test.go`
header self-documents the gap: "15 of 17 PostgreSQL repository files".

Operationally load-bearing piece: M48's scheduler calls
`HealthCheckRepository.ListDueForCheck` every tick to drive continuous
TLS health monitoring. A silent SQL regression there — wrong INTERVAL
math, NULL-handling slip, lost ORDER BY — would fail open: operator
adds endpoint → scheduler never picks it up → endpoint degrades in
production → no alert. The loop continues ticking and logs "processed
0 endpoints" normally, so the failure mode is operationally invisible.

Closure shape (test-only; no production code touched):

- internal/repository/postgres/health_check_test.go (new file, 7 tests)
  · TestHealthCheckRepository_CRUD
  · TestHealthCheckRepository_GetByEndpoint
  · TestHealthCheckRepository_List_Filters
  · TestHealthCheckRepository_ListDueForCheck  (the load-bearing one —
    seeds four rows with differing last_checked_at+interval
    relationships to NOW() plus one NULL-last_checked_at row,
    asserts the correct subset returns and ORDER BY last_checked_at
    ASC NULLS FIRST holds)
  · TestHealthCheckRepository_RecordHistory_GetHistory
  · TestHealthCheckRepository_PurgeHistory
  · TestHealthCheckRepository_GetSummary

- internal/repository/postgres/renewal_policy_test.go (new file, 3 tests)
  · TestRenewalPolicyRepository_CRUD  (exercises auto-generated
    rp-<slug(name)> PK, JSONB round-trip of [30,14,7,0] thresholds,
    UpdatedAt monotonic advance, ORDER BY name for List)
  · TestRenewalPolicyRepository_DuplicateName  (asserts
    errors.Is(err, repository.ErrRenewalPolicyDuplicateName) on both
    Create-name-unique and Update-name-unique collision paths, the pg
    23505 sentinel mapping)
  · TestRenewalPolicyRepository_DeleteInUse  (raw-INSERTs a
    managed_certificates row FK'ing the policy, asserts
    errors.Is(err, repository.ErrRenewalPolicyInUse) from pg 23503
    ON DELETE RESTRICT, cleans up, then asserts not-found surfaces
    distinctly)

- internal/repository/postgres/repo_test.go (one-line header flip)
  "covering 15 of 17 ... repository files" → "17 of 17"; added
  cross-reference pointing readers at the two sibling files.

Both new files use the existing getTestDB(t) + schema-per-test-isolation
convention and skip via testing.Short() in CI, matching M26 TICKET-003
scaffolding byte-for-byte. Repository/postgres is not in the CI
coverage-gate path (grep -nE "internal/repository/postgres"
.github/workflows/ci.yml → no hits), so adding test-only files cannot
regress gated coverage elsewhere.

Verification gates run locally (sandbox without Docker, so the -short
skip gate itself is what's exercised; operator runs the testcontainer
path locally):

  1.  go vet ./...                                              — clean
  2.  go build ./...                                            — clean
  3.  go test -short -count=1 ./...                             — clean
  4.  go test -race -short ./internal/repository/postgres/...   — clean
  5.  staticcheck                         — absent; CI checkset holds
  6.  govulncheck                         — skipped; test-only, no deps
  7.  per-layer coverage no-regression    — N/A; repo/pg not gated
  8.  tsc --noEmit                        — N/A; no frontend change
  9.  vitest run                          — N/A; no frontend change
  10. vite build                          — N/A; no frontend change
  11. OpenAPI lint                        — N/A; no spec change

No migration, no interface change, no production code diff. The
RenewalPolicyRepository drift between audit ("92 lines, 2 methods")
and HEAD (289 lines, 5 methods post-G-1) is documented honestly in
the audit report's Resolution Log, not papered over.

Closes: coverage-gap-audit L-1 (P2)
2026-04-20 20:39:06 +00:00
shankar0123 4cf5fcdb4f Merge branch 'fix/d1-cli-status-endpoint' 2026-04-20 19:41:03 +00:00
shankar0123 1ee67b7792 D-1: correct certctl-cli status endpoint path (/api/v1/health -> /health)
The CLI's GetStatus() was issuing GET /api/v1/health, but the real
liveness route is GET /health at internal/api/router/router.go:76
(mounted at root, not under /api/v1/). Every 'certctl-cli status'
invocation 404'd since M16b.

The regression was masked because TestClient_GetStatus encoded the
same wrong path on both sides of the contract -- the mock server
also dispatched on /api/v1/health -- so the production request
matched the test's buggy dispatch and the green bar hid the bug.

Two-line fix:
  - internal/cli/client.go:615: "/api/v1/health" -> "/health"
  - internal/cli/client_test.go:296: mock dispatch to match

Red receipt captured before the green fix: with the test fixture
corrected but production still wrong, TestClient_GetStatus fails
'parsing response: unexpected end of JSON input' (the client falls
through the mock's if/else to the default 200 OK empty body and
the JSON decoder chokes). After the production edit the test
passes.

GetStatus()'s response decoder is already compatible with the real
/health shape (graceful 'ok' check on health["status"], optional
health["timestamp"]). No interface change. No migration. No
frontend change. No OpenAPI delta -- /health is a root-level
liveness probe, not part of the /api/v1/ surface.
2026-04-20 19:40:58 +00:00
shankar0123 128d0eeaa8 Merge branch 'fix/g1-renewal-policies-api'
G-1: renewal-policies API + frontend FK-drift fix. Adds /api/v1/renewal-policies
CRUD backing the dropdown that managed_certificates.renewal_policy_id FKs into.
Three frontend call sites swapped from getPolicies() (pol-*, compliance rules)
to getRenewalPolicies() (rp-*, lifecycle policies). Validation bounds, pg
23503/23505 error mapping to HTTP 409, OpenAPI coverage, test suite.

No migration — renewal_policies table already exists from schema 000001.
2026-04-20 18:53:09 +00:00
shankar0123 9834b4e4a4 G-1: renewal-policies API + frontend FK-drift fix
Three frontend call sites (OnboardingWizard.tsx:603, CertificatesPage.tsx:52,
CertificateDetailPage.tsx:169) populated the renewal_policy_id dropdown from
getPolicies() — the compliance-rule endpoint returning pol-* IDs — which
violated the FK managed_certificates.renewal_policy_id REFERENCES
renewal_policies(id) ON DELETE RESTRICT. Create would fail pg 23503 at insert.

Backend (new):
- RenewalPolicyRepository CRUD + ListAll/ExistsByID (pg 23503 → ErrRenewalPolicyInUse
  → HTTP 409; pg 23505 → ErrRenewalPolicyDuplicateName → HTTP 409)
- RenewalPolicyService with repo-only constructor. Service sentinels
  var-alias the repo sentinels so errors.Is walks across layers.
- RenewalPolicyHandler with validation bounds: name 1–255;
  renewal_window_days [1,365] default 30; max_retries [0,10] not defaulted;
  retry_interval_seconds [60,86400] default 3600; alert_thresholds_days
  [0,365] default [30,14,7,0]. Auto-generated IDs rp-<slug(name)>.
- Router registers 5 routes under /api/v1/renewal-policies[/{id}].

Frontend:
- CertificatesPage/CertificateDetailPage/OnboardingWizard now call
  getRenewalPolicies() and render rp-* IDs.
- client.ts adds getRenewalPolicies/createRenewalPolicy/updateRenewalPolicy/
  deleteRenewalPolicy. types.ts adds the RenewalPolicy shape.

OpenAPI: RenewalPolicies tag + 5 operations + 3 schemas (RenewalPolicy,
RenewalPolicyCreateRequest, RenewalPolicyUpdateRequest). 409 responses
on create/update duplicate-name and delete FK-in-use.

No migration — renewal_policies table already exists from the initial
schema (000001).

Tests:
- internal/service/renewal_policy_test.go: CRUD + validation + sentinel
  error wrapping.
- internal/api/handler/renewal_policy_handler_test.go: handler endpoint
  contracts including 400/404/409.
- web/src/api/client.test.ts: 4 subtests covering the 4 new API functions.

Phase 3 gates all green: go vet, build, short tests, race tests (service/
handler/router/scheduler), staticcheck (G-1 packages), govulncheck (0
reachable), coverage (service 69.7%, handler 79.0%, domain 86.9%,
middleware 80.6% — all above thresholds), tsc, vitest (256 passed),
vite build, OpenAPI structural validation.
2026-04-20 18:53:01 +00:00
shankar0123 cab579368b Merge branch 'fix/audit-f001-f002-f003'
Closes F-001 (CRL scoped query via composite index), F-002 (digest error
body sanitization), and F-003 (ctx-aware sleep at three sites).

Verification: build, vet, race-short test sweep across all packages green.
govulncheck clean. golangci-lint run deferred — local environment's
golangci-lint is v1.64.8 built with go1.24 and rejects the go1.25.9
project; fresh install blocked by disk constraints. CI lane will cover it.
2026-04-20 16:52:00 +00:00
shankar0123 4e5522a999 F-001/F-002/F-003: CRL prefix-scan, digest error sanitization, ctx-aware sleeps
F-001 (P3): GenerateDERCRL scoped to issuer via composite index
  - Add RevocationRepository.ListByIssuer leveraging migration 000012's
    idx_certificate_revocations_issuer_serial composite index as a
    prefix-scan target. Previously CAOperationsSvc.GenerateDERCRL called
    ListAll() and filtered by IssuerID in Go — O(total revocations)
    regardless of how many revocations belonged to the target issuer.
  - Rewrite GenerateDERCRL to call ListByIssuer(ctx, issuerID) so PostgreSQL
    drives a prefix scan of the composite index. Drops the in-memory filter.
  - New regression test in ca_operations_test.go asserts the CRL hot path
    invokes ListByIssuer exactly once and never ListAll, and that the
    issuerID is threaded through correctly.

F-002 (P3): digest.go admin-auth endpoints no longer leak internal errors
  - PreviewDigest (GET /api/v1/digest/preview) and SendDigest
    (POST /api/v1/digest/send) previously wrote err.Error() into the HTTP
    response body on 500s. Replace with slog.Error server-side logging plus
    a generic "internal error" response body, matching the house pattern
    in certificates.go and export.go.

F-003 (P4): three blocking time.Sleep sites now honor ctx cancellation
  - internal/connector/issuer/acme/acme.go:672 (DNS-01 propagation wait)
    now runs under a select{case <-ctx.Done(): CleanUp + return ctx.Err();
    case <-time.After(d):} so graceful shutdown doesn't get stuck behind
    the propagation delay.
  - internal/connector/issuer/acme/acme.go:786 (dns-persist-01 propagation
    wait) same pattern, returns ctx.Err() on cancel.
  - cmd/agent/main.go:272 (polling backoff inside the heartbeat loop) now
    wraps the sleep in select{case <-ctx.Done(): continue; case <-time.After(backoff):}
    so the outer <-ctx.Done() case on the parent loop fires cleanly.

Verification: build, vet, and race-enabled short tests green across all
55+ packages. govulncheck reports zero vulnerabilities in the code path.
No migration needed — F-001 reuses the existing 000012 composite index.
No frontend changes.
2026-04-20 16:51:52 +00:00
shankar0123 55ce86b132 v2.0.48: swap self-signed TLS bootstrap algorithm ed25519 → ECDSA-P256
Follow-up to v2.0.47 (HTTPS-Everywhere). The Phase-3 self-signed
bootstrap sidecar shipped an ed25519 server cert. Apple's TLS stack —
Safari Network Framework and the macOS-bundled LibreSSL 3.3.6
/usr/bin/curl — does not advertise ed25519 in the ClientHello
signature_algorithms extension for server certs, so the handshake fails
with the server-side log line:

  tls: peer doesn't support any of the certificate's signature algorithms

Homebrew OpenSSL 3.x, Chrome, Firefox, and Linux curl all accept
ed25519 server certs fine. Apple is the outlier. Rather than gate the
demo stack behind "install Homebrew OpenSSL first," swap the bootstrap
algorithm to ECDSA-P256 with SHA-256 — universally supported, including
on the Apple stack.

Changes
- deploy/docker-compose.yml: certctl-tls-init openssl invocation swapped
  to `-newkey ec -pkeyopt ec_paramgen_curve:P-256 -nodes`; header comment
  + echo line updated; multi-line rationale paragraph added.
- deploy/docker-compose.test.yml: same openssl swap + echo update for
  the test harness sidecar that writes to the bind-mounted ./test/certs
  directory the Go integration_test.go pins via CERTCTL_TEST_CA_BUNDLE.
- docs/tls.md: Pattern 1 description + code block updated;
  "Why ECDSA-P256 and not ed25519" rationale paragraph added covering
  pre-v2.0.48 history, the Apple diagnosis, accepting clients, and
  the operator migration command. Patterns 2 (existing Secret) and 3
  (cert-manager) explicitly called out as unaffected.
- docs/upgrade-to-tls.md: docker-compose procedure sentence updated
  with cross-reference to tls.md Pattern 1.
- docs/test-env.md: "Get the CA bundle for curl" sentence updated.

Migration
Existing demo installs must tear the `certs` named volume down to pick
up the new algorithm:

  docker compose -f deploy/docker-compose.yml down -v
  docker compose -f deploy/docker-compose.yml up -d --build

Not touched
- cmd/server/tls.go: algorithm-agnostic. TLS 1.3 min version with
  [X25519, P-256] curve preferences for key exchange is orthogonal to
  the server cert's signature algorithm. No Go code change needed.
- Helm chart: Patterns 2 and 3 operators supply their own cert; this
  patch does not affect them.
- Unrelated ed25519 uses (agent key algorithm detection, profile
  algorithm options, SSH key path examples, tlsprobe key metadata,
  cloud discovery key-algo display): all orthogonal to the server TLS
  bootstrap cert.

Incidental cleanup
- .gitignore: dropped dangling `strategy.md` entry (file doesn't exist
  in repo; entry was cruft).
2026-04-20 04:17:05 +00:00
shankar0123 52248be717 v2.0.47: HTTPS Everywhere — TLS-only control plane, agents/CLI/MCP
Breaking change release. Plaintext HTTP listener removed. The certctl
control plane now terminates TLS 1.3 on :8443 via
http.Server.ListenAndServeTLS. No CERTCTL_TLS_ENABLED=false escape
hatch. No dual-listener mode. One-step cutover per docs/upgrade-to-tls.md.

Server
- cmd/server/tls.go: certHolder with SIGHUP hot-reload + atomic cert
  swap, buildServerTLSConfig (TLS 1.3 min, GetCertificate callback),
  preflightServerTLS validation
- cmd/server/main.go: ListenAndServeTLS in place of ListenAndServe,
  watchSIGHUP wiring, cert/key path config threading
- tls_test.go: 418-line regression coverage of reload, preflight,
  callback behavior, SAN validation

Config
- CERTCTL_TLS_CERT_PATH / CERTCTL_TLS_KEY_PATH (required)
- Plaintext rejection: agents/CLI/MCP pre-flight-fail on http://
  URLs with a pointer to docs/upgrade-to-tls.md

Agents, CLI, MCP
- All three pre-flight-reject http:// URLs with fail-loud diagnostic
- CERTCTL_SERVER_CA_BUNDLE_PATH for private-CA trust
- CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY for dev-only bypass
  (loud warning on startup)
- install-agent.sh emits both vars as commented template lines

docker-compose
- certctl-tls-init sidecar generates SAN-valid self-signed cert into
  deploy/test/certs/ on first boot
- All demo-stack curls pin against ca.crt with --cacert

Helm chart
- Three TLS provisioning modes, exactly one required:
  - server.tls.existingSecret (operator-supplied)
  - server.tls.certManager.enabled (cert-manager integration)
  - server.tls.selfSigned.enabled (eval only — not for production)
- server-certificate.yaml template for cert-manager mode
- helm install without a TLS source fails at template render with
  a pointer to docs/tls.md

CI
- .github/workflows/ci.yml Helm Chart Validation step renders the
  chart in both existingSecret and cert-manager modes, plus an
  inverse guard-regression test that asserts helm template MUST
  refuse to render when no TLS source is configured. Previously
  the single `helm template` invocation hit the certctl.tls.required
  fail-loud guard and exit-1'd CI. Four invocations now: lint
  (existingSecret), template (existingSecret), template
  (cert-manager), template (no args — must fail).

Integration tests
- deploy/test/integration_test.go stands up the Compose stack over
  HTTPS, extracts the CA bundle, and exercises every certctl API
  over https://localhost:8443
- All 34 integration subtests green (per Phase 8 local CI-parity)

Documentation
- New: docs/tls.md (provisioning patterns, rotation, SIGHUP reload)
- New: docs/upgrade-to-tls.md (one-step cutover, no-downgrade
  warnings, fleet-roll sequencing)
- CHANGELOG.md: v2.2.0 "HTTPS Everywhere — The Irony" entry
  (file heading unchanged; release tag is v2.0.47)
- All curls in docs/, examples/, deploy/helm/ guides use
  https://localhost:8443 --cacert

Verification
- grep -rn "ListenAndServe[^T]" cmd/ internal/ → 0 hits
- grep -rn "\"http://" cmd/ internal/ → 2 benign hits (Caddy admin
  API default, SSRF doc comment) — zero certctl endpoints
- Tasks #197–#206 (Phases 0–8) all closed in the tracker

Files: 65 changed, 3489 insertions, 372 deletions (pre-CI-fix).
2026-04-20 03:43:10 +00:00
shankar0123 04c7eca615 docs: reconcile scheduler topology across sibling docs (7 → 12 loops)
Authoritative 12-loop table lives at docs/architecture.md:522-534 (committed via
the I-001/I-003/I-005 + M48/M50 milestone commits). This change brings six sibling
docs into parity with that table so every surface — user-facing features reference,
SOC 2 compliance mapping, connectors guide, advanced demo architecture diagram,
testing guide, and in-line architecture prose — reflects the same 8 always-on + 4
opt-in topology.

Touches:
- docs/architecture.md: 2 inline ordinal references (9th / 8th loop) replaced with
  descriptive names (opt-in cloud discovery / opt-in endpoint health), cross-linked
  to the authoritative table to prevent future ordinal rot.
- docs/features.md: metric row (7 → 12), inline reference to 9th loop, and full
  scheduler table expanded to include Always-on column + env vars + I-001/I-003/I-005
  refs.
- docs/compliance-soc2.md: background scheduler monitoring bullets expanded to list
  all 12 loops with env vars + I-series refs; table row updated with 8 always-on +
  4 opt-in summary.
- docs/connectors.md: three inline ordinals (7th/6th/9th loop) replaced with
  descriptive names, cross-linked to architecture.md.
- docs/demo-advanced.md: Mermaid SCHED node label updated from '7 background loops'
  to '12 background loops (8 always-on + 4 opt-in)'.
- docs/testing-guide.md: Test 20.1.1 header + grep pattern expanded to include
  job-retry / job-timeout / notification-retry / digest / endpoint-health /
  cloud-discovery loops; sign-off chart row label updated.

Pure documentation reconciliation. No code changes. Master HEAD pre-commit: 6e646e0.
2026-04-20 02:51:34 +00:00
shankar0123 6e646e0fe8 M-001/M-006: strip HTTP auth from EST/SCEP + fail-loud SCEP preflight
Closes CWE-306 (missing authentication for critical function) for SCEP
via a fail-loud startup gate, and aligns EST/SCEP HTTP dispatch with
their respective RFCs. CRL/OCSP remain unauthenticated under
.well-known/pki/* per RFC 5280 §5 / RFC 6960 / RFC 8615. Option (D):
no mTLS in this milestone.

- RFC 7030 §3.2.3 (EST auth is deployment-specific) and §4.1.1
  (/cacerts explicitly anonymous): EST paths served unauthenticated;
  CSR-signature + profile policy enforce identity inside ESTService.
- RFC 8894 §3.2: SCEP authenticates via the challengePassword
  PKCS#10 attribute (OID 1.2.840.113549.1.9.7), not an HTTP credential.
  HTTP dispatch is unauthenticated; preflightSCEPChallengePassword
  refuses to start when CERTCTL_SCEP_ENABLED=true without
  CERTCTL_SCEP_CHALLENGE_PASSWORD. SCEPService.PKCSReq enforces the
  same invariant defense-in-depth and compares with
  crypto/subtle.ConstantTimeCompare.

cmd/server/main.go:
- Extract buildFinalHandler(apiHandler, noAuthHandler, webDir,
  dashboardEnabled); route /.well-known/est/*, /scep, /scep/*,
  /.well-known/pki/crl/{id}, /.well-known/pki/ocsp/{id}/{serial},
  and health probes through noAuthHandler (RequestID +
  structuredLogger + Recovery only).
- Add preflightSCEPChallengePassword fail-loud gate; startup log
  emits challenge_password_set boolean for operator visibility.

cmd/server/finalhandler_test.go (new, 314 lines, 27 subtests):
- TestBuildFinalHandler_Dispatch (20) + TestBuildFinalHandler_NoDashboard
  (7) pin the dispatch surface: EST 4-endpoint, SCEP exact +
  trailing-slash + query-string, PKI CRL+OCSP, health, /api/v1/*
  authenticated, /assets/* file server, SPA fallback.

internal/api/router/router.go, internal/config/config.go:
- Router-level comments explain why EST/SCEP/PKI dispatchers sit
  outside the authenticated mux; SCEP challenge password config
  plumbed through.

docs/architecture.md:
- New EST Authentication subsection (RFC 7030 §3.2.3 + §4.1.1,
  buildFinalHandler + noAuthHandler references).
- Rewrite SCEP Authentication subsection; replaces pre-existing
  factually-incorrect "any value accepted" claim with CWE-306
  preflight, service-layer defense-in-depth, and
  crypto/subtle.ConstantTimeCompare.
- Top-level Authentication section: qualify /api/v1/* scope on API
  clients bullet; add standards-based-endpoints bullet referencing
  the 27-subtest regression harness.

docs/compliance-soc2.md:
- CC6.1: scope API Key Authentication to /api/v1/*; add
  standards-based endpoints bullet citing RFCs and CWE-306 closure.
- CC6.3: scope API Key Policy to /api/v1/* with cross-reference to
  CC6.1.
- Evidence Locations augmented with buildFinalHandler,
  preflightSCEPChallengePassword, scep.go defense path, regression
  harness, and OpenAPI security:[] overrides.

api/openapi.yaml: verified already correct (global bearerAuth
default overridden with security:[] on /cacerts, /simpleenroll,
/simplereenroll, /csrattrs, /scep GET+POST, /crl/{issuer_id},
/ocsp/{issuer_id}/{serial}); no edits needed.
2026-04-19 17:20:05 +00:00
shankar0123 675b87ba63 I-005: notification retry loop + dead-letter queue
Critical alerts can no longer be silently dropped by a transient
notifier failure. Failed notification attempts now ride an exponential
backoff retry loop, with a 5-attempt budget before promotion to the
dead-letter queue for operator intervention.

Schema (migration 000016, idempotent):
- retry_count INTEGER NOT NULL DEFAULT 0
- next_retry_at TIMESTAMPTZ
- last_error TEXT
- idx_notification_events_retry_sweep partial index
  (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL
  Dead rows clear next_retry_at so the index stops matching them.

Service contract:
- NotificationService.RetryFailedNotifications drives 2^n-minute
  exponential backoff capped at 1h (notifRetryBackoffCap) with
  5-attempt budget (notifRetryMaxAttempts).
- Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to
  status='dead' via MarkAsDead.
- Non-terminal failures record via RecordFailedAttempt.
- Success path promotes to 'sent' without touching retry_count
  (audit preserves "delivered on attempt N").
- Missing-notifier branch defensively promotes to 'sent' to avoid
  wedging a row on a deleted channel.
- RequeueNotification operator escape hatch atomically resets
  retry_count -> 0, next_retry_at -> NULL, last_error -> NULL,
  status -> pending via notifRepo.Requeue.

Scheduler:
- New always-on notificationRetryLoop wired into the base loop set at
  CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m).
- sync/atomic.Bool idempotency guard.
- sync.WaitGroup shutdown drain via WaitForCompletion.

StatsService:
- SetNotifRepo setter pattern preserves 9 pre-existing
  NewStatsService call sites (main.go + stats_test.go + 8 digest
  tests) without touching the constructor signature.
- DashboardSummary.NotificationsDead populated via
  notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired
  (reports zero on systems without a notification repository).
- CountByStatus error is non-fatal (dashboard summary is
  best-effort for this field).
- Prometheus certctl_notification_dead_total counter emitted from
  the same snapshot.

Handler:
- New POST /api/v1/notifications/{id}/requeue endpoint.
- dead status surfaces to MCP + CLI.

Frontend:
- NotificationsPage gains two-tab toolbar ("All" / "Dead letter")
  with queryKey: ['notifications', activeTab] so switching tabs
  doesn't serve stale data until the 30s refetch.
- Dead rows surface "Retry {n}/5" + truncated last_error with
  full-text title tooltip.
- Requeue mutation wrapped as
    mutationFn: (id: string) => requeueNotification(id)
  to prevent react-query v5's positional context argument from
  leaking into the API client — pinned against future refactors
  by strict-match toHaveBeenCalledWith('notif-dead-001') in
  NotificationsPage.test.tsx:181.

Closes I-005.
2026-04-19 15:17:27 +00:00
shankar0123 707d8de4fb UX-001: sidebar re-entry + inline team/owner creation in wizard
Closes UX-001 (OnboardingWizard CertificateStep dead-end): users no
longer have to navigate away from the wizard and lose their in-flight
state when the required Owner/Team dropdowns are empty.

Layout.tsx
  - Adds persistent 'Setup guide' button in the left sidebar.
  - Clears localStorage 'certctl:onboarding-dismissed' then navigates
    to /?onboarding=1 as a re-entry signal that overrides dismissal.
  - localStorage.removeItem wrapped in try/catch to tolerate storage
    access errors (private browsing, quota, etc.).

DashboardPage.tsx
  - Reads ?onboarding=1 via useSearchParams as a forceOnboarding flag.
  - forceOnboarding bypasses the latched first-run gate so the wizard
    reopens even after dismissal or with certs/issuers already present.
  - onDismiss now also strips ?onboarding=1 via setSearchParams(next,
    { replace: true }) so a page refresh does not relaunch the wizard.

OnboardingWizard.tsx
  - Adds CreateTeamModalInline and CreateOwnerModalInline inside
    CertificateStep. Both wire through React Query: createTeam /
    createOwner mutation on success invalidates ['teams'] / ['owners']
    and calls onCreated(id) so the parent select auto-selects the new
    row as soon as the refetch lands.
  - '+ New team' and '+ New owner' buttons placed next to the select
    labels; empty-state copy replaced with inline 'create one now'
    buttons (no more Link back to /owners /teams).
  - CreateOwner coerces empty teamId to undefined before mutation so
    the server contract matches OwnersPage.

Tests (12 new, all green; total suite 252 passed / 0 failed):
  - Layout.test.tsx (4): Setup guide button renders, clicking it clears
    the dismissal key and navigates to /?onboarding=1, tolerates
    localStorage.removeItem throwing.
  - DashboardPage.test.tsx (4): first-run auto-open, ?onboarding=1
    re-entry after dismissal, onDismiss writes localStorage + strips
    the query param, dismissed-with-no-param stays closed.
  - OnboardingWizard.test.tsx (4): Skip-Skip reaches CertificateStep
    with '+ New team' / '+ New owner' buttons visible; '+ New team'
    happy path with React Query invalidation + parent-select
    auto-select via option-parent traversal (label is a sibling, not
    htmlFor-linked); '+ New owner' happy path pins team_id: undefined
    coercion; Cancel abort never mutates.

Test infrastructure notes:
  - Closure-driven vi.fn().mockImplementation pattern drives the
    post-invalidation refetch: the mutation mock mutates a closure
    variable that the getTeams/getOwners mock reads, so the parent
    select's new <option> exists by the time the refetch lands.
  - Anchored regex (/^Create Team$/, /^Create Owner$/) disambiguates
    the modal submit from the '+ New team' / '+ New owner' triggers.

Verification gates (all green):
  - vitest run: 252 passed / 0 failed (8 files, 13.98s)
  - tsc --noEmit: 0 errors
  - vite build: clean production bundle (851.77 kB js / 226.81 kB gzip)

No new runtime dependencies. Frontend-only change.
2026-04-19 14:49:04 +00:00
shankar0123 0725713e19 Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.

Schema — migration 000015:
  migrations/000015_agent_retire.up.sql flips
  deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
  RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
  boundary instead of quietly destroying targets. Both `agents` and
  `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
  TEXT pair (TEXT not VARCHAR so operator comments are never
  truncated), indexed via partial indexes WHERE retired_at IS NOT
  NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
  CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
  EXISTS) so repeated runs against partially-migrated databases
  converge. migrations/000015_agent_retire.down.sql restores CASCADE
  and drops the new columns for clean rollback. A dedicated
  repository-layer testcontainers test
  (internal/repository/postgres/migration_000015_test.go) asserts the
  before/after FK action, column presence, index presence, and
  round-trip idempotency under up→down→up.

Domain — sentinel guard + dependency counts:
  internal/domain/connector.go gains IsRetired() on Agent, the
  exported SentinelAgentIDs slice listing server-scanner,
  cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
  four reserved IDs documented in CLAUDE.md and created at startup in
  cmd/server/main.go), IsSentinelAgent(id string) predicate,
  AgentDependencyCounts{ActiveTargets, ActiveCertificates,
  PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
  ActorTypeSystem enum values used by audit emission downstream.
  Coverage locked down by internal/domain/connector_test.go.

Service — 8-step ordered contract:
  internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
  opts{Force, Reason}) enforces a fixed execution order:
  (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
      unconditionally; force=true does NOT bypass it.
  (2) fetch — ErrAgentNotFound on miss.
  (3) idempotency — if IsRetired() already, return
      AgentRetirementResult{AlreadyRetired: true} with no new audit
      event and no state change (safe to replay from flaky clients).
  (4) preflight counts — collectAgentDependencyCounts runs
      ActiveTargets, ActiveCertificates, PendingJobs sequentially
      (not in parallel; keeps the per-query timeout predictable and
      matches the repo's existing call-chain shape).
  (5) force-reason guard — opts.Force=true with empty Reason returns
      ErrForceReasonRequired (wired into the 400 status surface).
  (6) dependency guard — HasDependencies() with opts.Force=false
      returns BlockedByDependenciesError{Counts} (wired into the 409
      body with per-bucket counts).
  (7) mutation — single pinned retiredAt := time.Now(); agent
      retirement first, then cascade target retirement if opts.Force,
      all under the repo's single transaction so the two retired_at
      stamps match to the second.
  (8) best-effort audit — agent_retired always; agent_retirement_
      cascaded additionally on the force path. Actor is whatever the
      handler resolves from the request; actor type is mapped by
      resolveActorType (system/agent-prefix→Agent/else→User). Audit
      emission failures are logged via slog.Error but do not abort
      the retirement (matches the house convention used by every
      other scheduler-emitted event).

  BlockedByDependenciesError implements Error() as
  "active_targets=%d, active_certificates=%d, pending_jobs=%d" and
  Unwrap() → ErrBlockedByDependencies. The single struct satisfies
  errors.Is via Unwrap (used by scheduler-level tests) and errors.As
  via the concrete type (used by the handler to fish out Counts for
  the 409 body). ListRetiredAgents(page, perPage) adds a separate
  paginated accessor with page<1→1 and perPage<1→50 normalization so
  retired rows are queryable without polluting the default agent
  listing.

  Sentinel guard coverage is asymmetric by design: all four reserved
  IDs are protected, and force=true cannot override. Regression tests
  in internal/service/agent_retire_test.go assert each of the eight
  steps in order, plus sentinel bypass attempts and idempotency
  replay.

Handler + router — status-code surface:
  internal/api/handler/agents.go:RetireAgent exposes seven status
  codes on DELETE /agents/{id}:
    200 on a fresh retirement (body echoes AgentRetirementResult).
    204 on idempotent replay (AlreadyRetired=true; no new audit).
    400 on ErrForceReasonRequired.
    403 on ErrAgentIsSentinel.
    404 on ErrAgentNotFound.
    409 on BlockedByDependenciesError, with a custom body shape
        {error, counts{active_targets, active_certificates,
        pending_jobs}} that bypasses the default ErrorWithRequestID
        envelope so callers get the per-bucket numbers directly.
    500 on any other error.
  Heartbeat HandleHeartbeat returns 410 Gone when the agent is
  retired (ErrAgentRetired), signalling the agent to shut down.
  Query params `force=true` and `reason=<text>` drive the cascade
  path; both are forwarded as url.Values through the new MCP
  transport.

  internal/api/router/router.go registers GET /api/v1/agents/retired
  literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
  literal-beats-pattern-var precedence routes "retired" to the
  paginated retired-agents listing instead of fetching a hypothetical
  agent named "retired".

Agent binary — clean shutdown on 410:
  cmd/agent/main.go gains the ErrAgentRetired sentinel, a
  retiredOnce sync.Once, and a retiredSignal chan struct{}. A
  markRetired(source, statusCode, body) helper closes the channel
  exactly once; the Run() select loop observes the close and returns
  ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
  and exits cleanly instead of spinning in the heartbeat retry loop.
  The 410 Gone surface is therefore terminal for the agent process.

MCP transport:
  internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
  a new additive transport method. Client.Delete is path-only; without
  this method the retire tool would silently drop `force` and `reason`,
  turning every cascade retire into a default soft-retire. The new
  method shares do()'s 204 normalization and 4xx/5xx error
  propagation so tool authors get one contract.
  internal/mcp/tools.go + internal/mcp/types.go expose the
  retire_agent tool with Force+Reason inputs wired through
  DeleteWithQuery.

CLI:
  cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
  `agents list --retired` (client-side strip of --retired then
  delegation to ListRetiredAgents, sharing --page/--per-page parsing
  with the default listing) and `agents retire <id> [--force --reason
  "…"]` (mirrors ErrForceReasonRequired — force without reason is
  rejected client-side before the request is sent). JSON + table
  output modes both honor the new columns.

Frontend:
  web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
  web/src/api/client.ts + web/src/api/types.ts expose the retire
  endpoint and the retired-listing. 4 new Vitest regression cases.

OpenAPI:
  api/openapi.yaml documents DELETE /agents/{id} with all seven
  status codes, 410 on heartbeat, and the 409 per-bucket body shape.

Regression coverage (six new test files, all green):
  internal/service/agent_retire_test.go           — 8-step contract + sentinel guards
  internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
  internal/mcp/retire_agent_test.go               — DeleteWithQuery wire-through
  internal/cli/agent_retire_test.go               — --retired listing + --force/--reason pairing
  internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
  internal/domain/connector_test.go               — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies

Files:
  api/openapi.yaml                                — DELETE + 410 + 409 body shape
  cmd/agent/main.go                               — ErrAgentRetired, markRetired, retiredSignal
  cmd/cli/main.go                                 — handleAgents list/get/retire dispatch
  docs/architecture.md, docs/concepts.md,
    docs/testing-guide.md                         — retirement contract narrative
  internal/api/handler/agents.go                  — RetireAgent, status surface, 410 on heartbeat
  internal/api/handler/agent_handler_test.go      — extended coverage
  internal/api/handler/agent_retire_handler_test.go — new
  internal/api/router/router.go                   — /agents/retired before /agents/{id}
  internal/cli/agent_retire_test.go               — new
  internal/cli/client.go                          — ListRetiredAgents + RetireAgent
  internal/domain/connector.go                    — IsRetired, SentinelAgentIDs,
                                                    IsSentinelAgent, AgentDependencyCounts,
                                                    ActorTypeAgent/System
  internal/domain/connector_test.go               — new
  internal/integration/lifecycle_test.go          — retirement fixture
  internal/mcp/client.go                          — DeleteWithQuery additive transport
  internal/mcp/retire_agent_test.go               — new
  internal/mcp/tools.go, internal/mcp/types.go    — retire_agent tool + Force/Reason inputs
  internal/repository/interfaces.go               — AgentRepository retirement methods
  internal/repository/postgres/agent.go           — retire + cascade target retire + counts
  internal/repository/postgres/migration_000015_test.go — new
  internal/service/agent.go                       — wire into AgentService surface
  internal/service/agent_retire.go                — new 8-step contract
  internal/service/agent_retire_test.go           — new
  internal/service/deployment.go                  — skip retired agents
  internal/service/target.go                      — skip retired agents
  internal/service/testutil_test.go               — shared mocks extended
  migrations/000015_agent_retire.up.sql           — new
  migrations/000015_agent_retire.down.sql         — new
  web/src/api/client.ts, types.ts + tests         — retire endpoint wiring
  web/src/pages/AgentsPage.tsx                    — retire UI
2026-04-19 05:24:00 +00:00
shankar0123 1ee77c89f8 I-003: job timeout reaper closes AwaitingCSR/AwaitingApproval gap
Add 11th always-on scheduler loop that transitions jobs stuck in
AwaitingCSR (default 24h TTL) or AwaitingApproval (default 168h TTL)
to Failed. I-001's retry loop then auto-promotes eligible Failed jobs
back to Pending. No new status enum, no schema migration.

- JobRepository.ListTimedOutAwaitingJobs with per-status cutoff WHERE
- JobService.ReapTimedOutJobs mirrors RetryFailedJobs structure
- Scheduler jobTimeoutLoop with atomic.Bool idempotency guard, 2m
  per-tick context, WaitGroup shutdown drain
- Config: CERTCTL_JOB_TIMEOUT_INTERVAL (10m), CERTCTL_JOB_AWAITING_CSR_TIMEOUT
  (24h), CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT (168h)
- Audit event per transition: actor=system, actorType=System,
  action=job_timeout, details={old_status, new_status, timeout_reason,
  age_hours}
- 14 new tests: 3 config, 7 service, 4 scheduler
2026-04-19 01:37:18 +00:00
shankar0123 4bc8b3e723 fix(config): add RetryInterval to TestValidate_ValidConfig + TestValidate_AuthTypeNone fixtures (I-001 follow-up)
Problem:
  TestValidate_ValidConfig and TestValidate_AuthTypeNone construct a
  SchedulerConfig without RetryInterval, so Validate() fails the
  'retry interval must be at least 1 second' check at config.go:1086
  with 'retry interval must be at least 1 second'. Both tests expect
  success, so they fail whenever run.

Root cause (re-derived from source, not inherited from memory):
  git log -S 'retry interval must be at least' --source --all shows
  the validation was introduced in 0200c7f (I-001, RetryFailedJobs
  scheduler wiring). git log -- internal/config/config_test.go shows
  the test file was last touched in 7382e5f, which predates 0200c7f.
  I-001 added a new Validate() rule without updating the two positive
  test fixtures — a gap in I-001's verification pass.

  This is NOT C-001 fallout. The config_test.go file was untouched by
  the C-001 closure commits 91642e2 and 4696116. The failure surfaced
  during the full test suite run after C-001 landed because no one
  had run 'go test ./internal/config/...' since I-001.

Scope:
  - internal/config/config_test.go (2 fixtures: TestValidate_ValidConfig,
    TestValidate_AuthTypeNone).

Implementation:
  Added 'RetryInterval: 5 * time.Minute' to both SchedulerConfig
  literals. 5 minutes matches the I-001 default at config.go:818:

    RetryInterval: getEnvDuration("CERTCTL_SCHEDULER_RETRY_INTERVAL", 5*time.Minute)

  The other two TestValidate_* tests (InvalidAuthType, APIKeyAuth_
  MissingSecret) are unaffected because they expect Validate() to
  error at the auth-type check (line 1052) or auth-secret check
  (line 1057), both of which fire before the RetryInterval check at
  line 1086.

Verification:
  - go test -count=1 -run 'TestValidate_' ./internal/config/...: PASS
  - go test -short -count=1 ./...: all packages PASS
  - go vet ./...: exit 0

Residual:
  None. This is a pure test-fixture fix — production code is unchanged.

Commit:
  0200c7f (I-001) should have included this edit. Attributed here for
  traceability.
2026-04-19 00:33:22 +00:00
shankar0123 469611650c fix(cli): add missing os + path/filepath imports to client_test.go
Follow-up to 91642e2. TestClient_ImportCertificates_SixFieldPayload
uses filepath.Join(t.TempDir(), ...) and os.WriteFile to stage a
test PEM, but the import block only listed encoding/json,
encoding/pem, net/http, etc. — neither os nor path/filepath was
imported. go vet rejected the package with 'undefined: filepath'
(and would have caught 'undefined: os' next).

Add both imports. No behavioral change — the referenced symbols
are the standard library's usual names for their respective
packages, so the test compiles and runs exactly as intended.
CI should now pass go build + go vet on the cli package.
2026-04-19 00:27:11 +00:00
shankar0123 91642e2860 C-001 scope expansion: tighten parallel POST /api/v1/certificates call sites to six-field contract
Problem:
a53a4b8 closed C-001 at the handler boundary by tightening the
ValidateRequired contract on POST /api/v1/certificates to require six
fields: name, common_name, renewal_policy_id, issuer_id, owner_id,
team_id. (Correction re-derived from source: the handler
ValidateRequired calls on owner_id/team_id/renewal_policy_id were
actually installed in 3287e17 under M-002/M-003/M-006 auth unification
— a53a4b8's commit message overstates scope.) Post-audit on
2026-04-18 found three parallel call sites still shipping
three-to-four-field payloads that the newly strict handler would
reject with HTTP 400:
  - GUI: OnboardingWizard CertificateStep (common_name + sans +
    issuer_id + environment only)
  - CLI: certctl-cli import (common_name + issuer_id + status only;
    no required-flag gating)
  - Tests: deploy/test/qa_test.go Part03 positive paths

Scope:
Bring every POST /api/v1/certificates caller to six-field parity. No
handler changes — the contract is authoritative; the callers must
conform.

Implementation:

  GUI — OnboardingWizard CertificateStep expansion:
    web/src/pages/OnboardingWizard.tsx adds name/owner_id/team_id/
    renewal_policy_id state. React Query hooks for getOwners/
    getTeams/getPolicies use per_page: '500' to populate dropdowns
    without pagination-driven truncation. Payload ships all six
    required fields plus sans/certificate_profile_id/environment.
    nextDisabled gate enforces all six before the Continue button
    activates.

  CLI — ImportCertificates rewrite:
    internal/cli/client.go rewrites ImportCertificates with
    flag.NewFlagSet("import", flag.ContinueOnError). Required flags:
    --owner-id, --team-id, --renewal-policy-id, --issuer-id. Optional:
    --name-template (default {cn}, templated via strings.ReplaceAll
    against cert.Subject.CommonName), --environment (default
    imported). Missing required flags fail pre-HTTP with a clear
    error. Request map ships all six required fields plus sans/
    environment/status/optional serial_number.
    cmd/cli/main.go — usage string updated to document the new
    required/optional flags.

  Tests — qa_test.go Part03 positive paths:
    deploy/test/qa_test.go Part03 Create_Minimal and Create_Full
    updated to include all six fields. Uses seed_demo.sql-supplied IDs
    (o-alice, t-platform, rp-standard) — docker-compose.demo.yml is
    the run context. C-001 explanatory comment added above
    Create_Minimal so future readers understand why the minimal
    payload is no longer minimal.

  MCP parity:
    Verified no-op. internal/mcp/types.go:28 CreateCertificateInput
    already declares all six fields; internal/mcp/tools.go:102
    forwards the typed struct unchanged.

Verification:

  Go CLI regression tests (internal/cli/client_test.go):
    * TestClient_ImportCertificates_MissingRequiredFlags — 5 subtests,
      one per missing required flag, confirms flag.ContinueOnError
      rejects with non-nil error before any HTTP call is attempted.
    * TestClient_ImportCertificates_MissingPositionalArgs — confirms
      the "usage: import <file>" error path when no PEM file is
      supplied after the flags.
    * TestClient_ImportCertificates_SixFieldPayload — uses httptest
      to decode the POST body and assert all six required fields
      plus sans/environment are present on the wire.

  Frontend regression test (web/src/api/client.test.ts):
    'createCertificate accepts and transmits all six required fields'
    pins the wire shape for both GUI call sites (OnboardingWizard
    CertificateStep + CertificatesPage CreateCertificateModal). If
    either UI surface accidentally drops a field, this assertion
    fails in CI rather than surfacing as a 400 at runtime.

  Grep-based call-site sweep:
    Enumerated every POST /api/v1/certificates create caller. Four
    total: OnboardingWizard, CertificatesPage, MCP tools, CLI import.
    All four now ship six-field payloads. Claim path
    (internal/service/discovery.go) updates existing rows and does
    not POST. EST/SCEP handlers invoke internal
    certService.CreateVersion, not the public API. Negative-path
    tests (qa_test.go:1085/1267/1274/1288/1298) remain valid: they
    assert 400/non-500 on oversized/malformed/missing-CN/UTF-8/empty
    bodies, and these properties still hold under the stricter
    handler.

  Static gates:
    go build ./..., go vet ./..., go test ./internal/cli/..., and
    cd web && npm run test deferred to operator pre-push — the Go
    toolchain is not available in the session sandbox. Grep-based
    verification confirms the syntactic shape of every changed file.

Residual:
None. Every POST /api/v1/certificates call site now conforms to the
six-field contract; the wire shape is pinned by both Go and
TypeScript regression tests.

Commit:
TBD-SHA (audit doc + CLAUDE.md carry TBD-SHA placeholders to be
amended after commit)
2026-04-19 00:25:10 +00:00
shankar0123 0200c7f4a4 Close I-001 (RetryFailedJobs never invoked) coverage-gap finding
Operator decision answered as Option A: JobService.RetryFailedJobs is
now wired into the scheduler as an always-on 10th loop. Prior to this
commit the method was implemented, unit-tested, and exported but had
zero runtime callers — any job that transitioned to status=Failed stayed
Failed forever regardless of how many attempts it had remaining.

Scheduler — 10th loop:
  internal/scheduler/scheduler.go grows a jobRetryLoop alongside the
  existing nine loops (renewal, jobs, health, notifications, short-lived,
  network scan, digest, health check, cloud discovery). The loop follows
  the established run-immediately-then-tick pattern (same shape as
  jobProcessorLoop), gated by a sync/atomic.Bool idempotency guard and
  joined into the scheduler's sync.WaitGroup so WaitForCompletion drains
  it on graceful shutdown. Each tick runs under a 2-minute context
  timeout mirroring jobProcessorLoop's opCtx budget. The runJobRetry
  helper invokes jobService.RetryFailedJobs(ctx, 3) — the advisory
  maxRetries cap is belt-and-suspenders; per-job eligibility is still
  enforced inside the service via Attempts < MaxAttempts.

  The JobServicer scheduler-interface gains RetryFailedJobs so the
  scheduler's dependency surface stays explicit and mockable.

Service — audit trail per retry:
  internal/service/job.go:RetryFailedJobs now emits an audit event for
  every Failed→Pending transition. Following the house convention used
  by all scheduler-emitted events, actor='system' and actorType=
  domain.ActorTypeSystem; action='job_retry'; details capture
  old_status, new_status, attempts, max_attempts. JobService carries an
  optional *AuditService (SetAuditService) that nil-guards to preserve
  test-wiring ergonomics — existing tests that construct JobService
  without an audit service continue to pass unchanged.

Config — env var with sane default:
  internal/config/config.go:SchedulerConfig grows RetryInterval, wired
  to CERTCTL_SCHEDULER_RETRY_INTERVAL with a 5-minute default. Validate
  rejects intervals below 1 second (matches other scheduler interval
  validators).

Server wiring:
  cmd/server/main.go calls jobService.SetAuditService(auditService)
  after JobService construction and sched.SetJobRetryInterval(
  cfg.Scheduler.RetryInterval) alongside the other SetXxxInterval calls.

Regression coverage:
  internal/service/job_test.go (3 new)
    - TestJobService_RetryFailedJobs_EligibleJobTransitionsAndAudits
    - TestJobService_RetryFailedJobs_SkipsJobsAtMaxAttempts
    - TestJobService_RetryFailedJobs_NoAuditServiceOK
  internal/scheduler/scheduler_test.go (3 new)
    - TestScheduler_JobRetryLoop_CallsService
    - TestScheduler_JobRetryLoop_IdempotencyGuard
    - TestScheduler_JobRetryLoop_WaitForCompletion

  The service tests assert status transitions, attempt-cap short-
  circuiting, and audit event shape (actor='system', action='job_retry',
  details keys). The scheduler tests assert the loop invokes the service,
  the atomic.Bool guard skips overlapping ticks with the expected
  'still running, skipping tick' log, and WaitForCompletion drains the
  in-flight tick on Stop.

Residual follow-up (not in scope for this commit):
  internal/service/renewal.go:RetryFailedJobs is a parallel dead-code
  duplicate of the same logic on RenewalService — untested and has no
  runtime caller. The audit finding called this out as 'implemented
  twice'. Removing it is a separate cleanup and does not block the
  Option-A wiring this commit delivers.

Files:
  cmd/server/main.go                     — SetAuditService + SetJobRetryInterval
  internal/config/config.go              — RetryInterval field + env + validate
  internal/scheduler/scheduler.go        — 10th loop, interface, field, setter
  internal/scheduler/scheduler_test.go   — 3 new scheduler-loop tests
  internal/service/job.go                — RetryFailedJobs audit emission + SetAuditService
  internal/service/job_test.go           — 3 new service-layer tests
2026-04-18 23:24:54 +00:00
shankar0123 fe7e766510 Close M-004 (OCSP issuer binding) and M-005 (discovery actor propagation) coverage-gap findings
M-004 — OCSP issuer binding (composite key):
  The OCSP lookup path now binds (issuer_id, serial) as a composite key
  rather than resolving by serial alone. CertificateRepository and
  RevocationRepository gain GetByIssuerAndSerial methods; ca_operations.go
  scopes both lookups by the issuer_id path param. When no managed cert
  binds to that (issuer, serial) tuple, GetOCSPResponse constructs an
  RFC 6960 §2.2 'unknown' response (CertStatus=2) instead of the prior
  default 'good'. Short-lived cert exemption (profile TTL < 1h) is
  preserved. Real repo errors (non-sql.ErrNoRows) fail closed with a log.

  Regression coverage: internal/service/ca_operations_test.go
    - TestCAOperationsSvc_GetOCSPResponse_Unknown_CrossIssuer
    - TestCAOperationsSvc_GetOCSPResponse_Unknown_UnknownSerial

M-005 — Discovery Claim/Dismiss actor propagation:
  DiscoveryService.ClaimDiscovered and DismissDiscovered now accept an
  explicit 'actor string' parameter (propagation pattern mirrors
  bulk_revocation.go / revocation_svc.go). The handler layer passes
  resolveActor(r.Context()) — the named-key identity established by the
  M-002 auth unification — and the service falls back to 'api' (the same
  safe sentinel resolveActor uses when no auth context is present) only
  when the caller passes an empty string. Never falls back to 'operator'.

  Regression coverage: internal/service/discovery_test.go
    - TestDiscoveryService_ClaimDiscovered_AuditActor
    - TestDiscoveryService_DismissDiscovered_AuditActor
    - TestDiscoveryService_ClaimDiscovered_EmptyActorFallsBackToAPI
    - TestDiscoveryService_DismissDiscovered_EmptyActorFallsBackToAPI

Each new test asserts event.Actor matches the caller-supplied string (or
'api' on empty input) and explicitly asserts event.Actor != 'operator'
to lock in the historical fix intent.

Files:
  internal/api/handler/discovery.go          — pass resolveActor(ctx)
  internal/api/handler/discovery_handler_test.go — updated call sites
  internal/integration/lifecycle_test.go     — updated mock wiring
  internal/repository/interfaces.go          — GetByIssuerAndSerial on
                                               CertificateRepository +
                                               RevocationRepository
  internal/repository/postgres/certificate.go — composite key lookup
  internal/service/ca_operations.go          — (issuer_id, serial) scoping
  internal/service/ca_operations_test.go     — 2 new M-004 tests
  internal/service/discovery.go              — actor parameter + 'api' fallback
  internal/service/discovery_test.go         — 4 new M-005 tests
  internal/service/shortlived_test.go        — mock signature update
  internal/service/testutil_test.go          — mock GetByIssuerAndSerial
2026-04-18 22:20:25 +00:00
shankar0123 ff7357f889 fix(lint): godoc comment on NewAuthWithNamedKeys must lead with function name (ST1020)
CI failure on master (commit 3287e17) — staticcheck ST1020:

  internal/api/middleware/middleware.go:125:1: ST1020: comment on exported
  function NewAuthWithNamedKeys should be of the form
  "NewAuthWithNamedKeys ..." (staticcheck)

When NewAuth was renamed to NewAuthWithNamedKeys during the M-002 auth
unification, the leading godoc sentence was left pointing at the old name.
Rewrite the comment so its first sentence starts with the new function
name, and expand the body to describe the named-key + admin-flag contract
introduced in 3287e17.

Also gitignore /.gopath/ — session-scoped tool install cache, same
category as /.gocache/ and /.gomodcache/.

Verification:
  go vet ./internal/api/middleware/...          — clean
  go build ./internal/api/middleware/...        — clean
  go test ./internal/api/middleware/...         — PASS (0.245s)
  staticcheck -checks=all,<project exclusions>  — clean across
    middleware, handler, service, domain, cmd/server, scheduler

Closes: CI failure on 3287e17.
2026-04-18 21:38:46 +00:00
shankar0123 3287e174dc Unify API auth + RFC-compliant CRL/OCSP (M-002 + M-003 + M-006, auto-closes M-001)
Closes the remaining P1 gaps from coverage-gap-audit.md (M-001/M-002/M-003/M-006)
on top of the C-001/C-002 ownership + agent-FK contract fixes landed in
a53a4b8. The work lands as a single commit spanning server, docs, tests,
and the React client.

M-002 — Named API keys with per-key actor propagation
  * Migration 000014 adds the 'api_keys' table (id, name, hash,
    principal, role, created_at, last_used_at, disabled_at) so every
    credential carries an identifiable principal instead of the
    opaque 'anonymous'/'api-key' sentinel.
  * Auth middleware now rotates through configured keys, performs
    constant-time hash comparison, stamps 'last_used_at', and emits
    an actor struct via contextWithActor(). The audit middleware,
    bulk-revocation handler, approval handlers, and MCP tool layer
    now read the principal off the context and persist it on every
    audit_events row.
  * Regression coverage:
      - internal/api/middleware/audit_test.go — actor propagation,
        principal redaction for disabled keys, anonymous fallback for
        unauthenticated endpoints.
      - internal/api/handler/bulk_revocation_handler_test.go,
        job_handler_test.go — principal-on-audit assertions.

M-003 — Authorization gates (Phase B)
  * Approval handler rejects self-approval / self-rejection with 403
    when the actor principal equals the job's requested_by field.
  * Bulk revocation is gated behind the 'admin' role; operators and
    viewers receive 403.
  * Regression coverage:
      - internal/service/job_test.go — TestApproveJob_NotSelf,
        TestRejectJob_NotSelf.
      - internal/api/handler/bulk_revocation_handler_test.go —
        TestBulkRevoke_RequiresAdmin, TestBulkRevoke_AdminSucceeds.

M-006 — RFC-compliant CRL/OCSP on the unauthenticated .well-known mux
  * Per RFC 8615, relying parties cannot reasonably be asked to
    authenticate against the issuing certctl instance to retrieve
    revocation material. CRL and OCSP move off the authenticated
    '/api/v1/crl*' and '/api/v1/ocsp/*' paths onto:
        GET /.well-known/pki/crl/{issuer_id}
            Content-Type: application/pkix-crl   (RFC 5280 §5)
        GET /.well-known/pki/ocsp/{issuer_id}/{serial}
            Content-Type: application/ocsp-response  (RFC 6960)
  * Non-standard JSON CRL shape is removed; only DER is served.
  * Short-lived certificate exemption (profile TTL < 1h → skip
    CRL/OCSP) is preserved; the response simply omits the serial.
  * Routes are registered on the unauthenticated 'finalHandler' mux
    in cmd/server/main.go alongside EST ('/.well-known/est/*') and
    SCEP ('/scep'). Legacy authenticated paths return 404.
  * Regression coverage:
      - internal/api/handler/certificate_handler_test.go — content
        type, DER parseability, 404 for unknown issuer.
      - internal/api/handler/adversarial_path_test.go — unauthenticated
        access asserted for CRL, OCSP, EST, SCEP.
      - internal/api/router/router_test.go — route-table assertion
        that '.well-known/pki/*', '.well-known/est/*', and '/scep' are
        mounted on the unauthenticated branch.

M-001 — Auto-closed by M-002
  EST and SCEP were already registered on the unauthenticated
  'finalHandler' mux; the router comment at
  internal/api/router/router.go:247 now matches reality. The
  adversarial-path tests above lock the behavior in.

Verification (all gates green):
  * go vet ./...                                           — clean
  * go build ./...                                         — ok
  * go test -short ./... (55+ packages)                    — all pass
  * web/ : npm test (225 Vitest tests)                     — all pass
  * web/ : npx tsc --noEmit                                — clean
  * grep sweep for '/api/v1/(crl|ocsp)' — 13 surviving hits,
    all intentional M-006 tombstone/relocation comments.

Documentation:
  * coverage-gap-audit.md — status flips M-001/M-002/M-003/M-006 →
    Fixed, with per-finding resolution paragraphs citing regression
    test IDs. (Audit file lives outside this repo; see cowork root.)
  * CLAUDE.md Project Status line updated with the auth-unification
    closure note.
  * docs/features.md, docs/architecture.md, docs/quickstart.md,
    docs/concepts.md, docs/connectors.md, docs/test-env.md,
    docs/testing-guide.md, docs/compliance-*.md, docs/demo-advanced.md
    — refreshed for the new '.well-known/pki/*' namespace and named
    API keys.
  * api/openapi.yaml — documents the new unauthenticated endpoints
    and removes the legacy '/api/v1/crl*' + '/api/v1/ocsp/*' paths.

.gitignore: adds '/.gocache/' and '/.gomodcache/' for the session-
scoped Go caches so they never enter the tree.
2026-04-18 18:17:41 +00:00
shankar0123 a53a4b845b fix(gui,api): close C-001 + C-002 — ownership + agent FK contract
C-001 — CreateCertificate was server-accepted with null owner_id,
team_id, renewal_policy_id because the GUI neither collected the fields
nor enforced them, even though the backend's ManagedCertificate schema
and handler contract treat them as required. Fix the contract at all
four layers:

  - web/src/pages/CertificatesPage.tsx: replace owner_id/team_id free-
    text inputs with <select> elements fed by getOwners/getTeams/
    getPolicies queries; mark all three required; gate the Create
    button on owner_id + team_id + renewal_policy_id being set.
  - internal/api/handler/certificates.go: ValidateRequired for
    owner_id, team_id, renewal_policy_id on CreateCertificate so the
    handler returns HTTP 400 with the offending field name before the
    service layer is reached.
  - internal/mcp/types.go: drop ',omitempty' from
    CreateCertificateInput.RenewalPolicyID so the MCP schema reflects
    the required contract; Update inputs keep partial-update semantics.
  - api/openapi.yaml: 'required: [name, common_name, renewal_policy_id,
    issuer_id, owner_id, team_id]' was already present on the Create
    schema; clarified DeploymentTarget.agent_id description to note the
    FK contract.

C-002 — CreateTargetWizard accepted an empty or bogus agent_id and the
service inserted directly, producing a Postgres 23503 FK-violation that
bubbled out as a generic HTTP 500. The FK itself (migration 000001 line
104: agent_id TEXT NOT NULL REFERENCES agents(id)) is correct; we keep
the schema strict and add validation at three layers:

  - internal/service/target.go: introduce
    ErrAgentNotFound sentinel and pre-validate agent_id in
    TargetService.CreateTarget — empty string returns
    'agent_id is required'; a nonexistent id returns the full
    'referenced agent does not exist: <id>' error. Both wrap
    ErrAgentNotFound via fmt.Errorf %w so callers can use errors.Is.
  - internal/api/handler/targets.go: ValidateRequired on agent_id; map
    errors.Is(err, service.ErrAgentNotFound) to HTTP 400 instead of
    letting it fall through to the generic 500 branch.
  - internal/mcp/types.go: drop ',omitempty' from
    CreateTargetInput.AgentID to match the required contract.
  - web/src/pages/TargetsPage.tsx: replace the free-text Agent ID input
    with a <select> populated from getAgents(); include agent in the
    canProceedToReview gate so Next is disabled until an agent is
    chosen.

Regression coverage (21 new subtests total):

  - TestCreateCertificate_MissingRequiredField_Returns400 — 6 subtests,
    one per required field, each proves the handler guard fires before
    the mock service is called.
  - TestCreateTarget_MissingAgentID_Returns400 — handler guard.
  - TestCreateTarget_NonexistentAgent_Returns400 — pins the
    ErrAgentNotFound -> 400 translation.
  - TestTargetService_CreateTarget_MissingAgentID — errors.Is sentinel.
  - TestTargetService_CreateTarget_NonexistentAgentID — errors.Is.
  - The existing TestTargetService_CreateTarget_Success, along with
    TestCreateTarget_{MissingName,MissingType,NameTooLong}_* handler
    tests, were updated to seed a real agent or include agent_id in
    the request body so the happy paths still run cleanly.

Gates (Phase 4):
  - go build/vet/test/race: green
  - go test -cover: internal/service 68.7% (gate 55%),
    internal/api/handler 78.9% (gate 60%)
  - golangci-lint on service+handler+mcp: 0 issues
  - govulncheck: no reachable vulns
  - tsc --noEmit: clean
  - vitest: 223/223 passing

See cowork/certctl-coverage-gap-audit.md entries C-001 and C-002.
2026-04-18 16:01:40 +00:00
shankar0123 9143da5fa8 Merge branch 'fix/d-008-policy-engine-drift' 2026-04-18 14:56:06 +00:00
shankar0123 b3cc7cbdb2 fix(policies): close the D-006 loop — TitleCase seed canonicals + severity-aware, config-consuming rule engine (D-008)
D-008 was a three-part drift in the policy engine that made the
D-005/D-006 remediation cosmetic below the DB layer:

  (a) migrations/seed.sql INSERTed rules with pre-D-005 lowercase
      types ('ownership', 'environment', 'lifetime', 'renewal_window')
      that the handler validator rejects on Create/Update but that
      raw SQL INSERTs bypassed entirely. At runtime evaluateRule's
      switch fell through to the default "unknown policy rule type"
      error branch on every demo rule × every cert × every cycle,
      flooding logs while emitting zero violations.

  (b) migrations/seed_demo.sql persisted lowercase severity values
      ('critical', 'error', 'warning') on policy_violations rows.
      INSERT succeeded because that column had no CHECK, but any
      frontend comparing against the canonical PolicySeverity enum
      mis-categorized every seeded violation.

  (c) evaluateRule hardcoded Severity: PolicySeverityWarning on
      every emitted violation and ignored rule.Config entirely —
      so the D-006 per-rule severity column (000013) and every
      per-arm Config JSON ({allowed_issuer_ids, allowed_domains,
      required_keys, allowed, lead_time_days, max_days}) was dead
      data below the evaluation layer.

This commit lands (a)+(b)+(c) atomically. Shipping any subset
leaves the feature half-working.

## Changes

Domain (internal/domain/policy.go):
  * Add PolicyTypeCertificateLifetime as the 6th TitleCase canonical.
    Pre-D-008 the seeded "max-certificate-lifetime" rule had no engine
    arm — routing it through RenewalLeadTime would conflate "how
    close to expiry before we renew" with "how long can the cert
    possibly be", two distinct semantics. The new type accepts
    config {"max_days": int} and flags certs whose
    NotAfter - NotBefore exceeds the cap.

Handler validator (internal/api/handler/validation.go):
  * ValidatePolicyType allowlist grown to 6 canonicals
    (AllowedIssuers, AllowedDomains, RequiredMetadata,
    AllowedEnvironments, RenewalLeadTime, CertificateLifetime).

OpenAPI (api/openapi.yaml):
  * PolicyType enum grown to match domain.

Frontend (web/src/api/types.ts, types.test.ts):
  * POLICY_TYPES tuple gains CertificateLifetime; pin test asserts
    all 6 canonicals and rejects casing drift.

Migration 000014 (policy_violations severity CHECK):
  * Named CHECK constraint (policy_violations_severity_check)
    mirroring 000013's allowlist, defense-in-depth at the DB layer
    against future drift from bypassed writes (migrations, psql
    sessions, future callers). Symmetric down migration drops by
    name.

Seed data:
  * migrations/seed.sql rewritten to emit TitleCase canonicals with
    per-arm config JSON that actually exercises the config-consuming
    paths (not the missing-field backstops):
      - pr-require-owner         → RequiredMetadata     {"required_keys":["owner"]}                        Warning
      - pr-allowed-environments  → AllowedEnvironments  {"allowed":["production","staging","development"]} Error
      - pr-max-certificate-lifetime → CertificateLifetime {"max_days":90}                                   Critical
      - pr-min-renewal-window    → RenewalLeadTime      {"lead_time_days":14}                              Warning
    Severities are now differentiated per rule (D-006 intent).
  * migrations/seed_demo.sql violation rows flipped to TitleCase
    severity ('Critical', 'Error', 'Warning') so migration 000014
    applies cleanly on upgrade paths.

Engine rewrite (internal/service/policy.go):
  * evaluateRule rewritten. All six arms now:
      1. Parse rule.Config into the per-arm typed struct.
      2. Bad JSON → log at ValidateCertificate boundary and skip
         this rule (no co-located poisoning of other rules in the
         same batch).
      3. Empty/null Config → emit the pre-D-008 missing-field
         violation (backwards compat invariant — operators who
         haven't reconfigured still see the same output).
      4. Violations emitted carry rule.Severity (no more hardcoded
         Warning); D-006 column is now load-bearing.
  * CertificateLifetime arm reads NotBefore/NotAfter from the
    certificate's latest version via CertRepo. Injected via
    PolicyService.SetCertRepo() setter — avoids churning ~36
    NewPolicyService call sites while keeping the lifetime arm
    optional (degrades to a log+skip if the setter is not wired).

Server wiring (cmd/server/main.go):
  * policyService.SetCertRepo(certRepo) wired after construction.

Tests (internal/service/policy_test.go):
  * 25 new subtests across 5 groups:
      - TestEvaluateRule_SeverityPassThrough (6): every rule type
        emits violations carrying rule.Severity, not hardcoded.
      - TestEvaluateRule_ConfigConsumed (12): every per-arm Config
        path exercised positive + negative.
      - TestEvaluateRule_EmptyConfig_BackCompat (3): empty/null
        Config still emits pre-D-008 missing-field violations.
      - TestEvaluateRule_BadConfig_SkipsRule: malformed JSON logs
        and skips cleanly without poisoning neighbors.
      - TestEvaluateRule_CertificateLifetime_RepoScenarios (3):
        ok when repo wired, log+skip when not, handles missing
        NotBefore/NotAfter edges.

Provenance: D-008 surfaced during D-005/D-006 remediation review
in eef1db0. That commit added persistence and CI pins for the
severity field but did not re-verify the evaluation layer
consumed it; this finding and fix close the audit-process gap.
2026-04-18 14:55:56 +00:00
shankar0123 eef1db0f0a fix(policies): stop 400ing the "+ New Policy" button + add per-rule severity (D-005, D-006)
Coverage Gap Audit findings D-005 (P0) + D-006 (P1) fixed together in a
single commit because they share the same root cause — policy CRUD sending
values the backend silently rejects — and splitting them would leave a
half-working UI between commits.

## D-005 (P0): PoliciesPage dropdown 400s every Create Policy

Root cause
----------
`web/src/pages/PoliciesPage.tsx` populated the Type `<select>` from a
hardcoded `['key_algorithm', 'ownership', 'allowed_issuers', ...]` array.
The backend's `internal/api/handler/validators.go::ValidatePolicyType`
enforces the TitleCase allowlist `AllowedIssuers`, `AllowedDomains`,
`RequiredMetadata`, `AllowedEnvironments`, `RenewalLeadTime` — defined in
`internal/domain/policy.go`. Every Create Policy request was rejected with
`400 invalid policy type`. The error surfaced only as a transient toast;
the modal closed anyway. Silent user-visible failure.

Fix
---
- `web/src/api/types.ts`: added `POLICY_TYPES` and `POLICY_SEVERITIES`
  tuples with `as const` and narrowed `PolicyRule.type`, `.severity`, and
  `PolicyViolation.severity` to the literal-union types. Dropdown is now
  sourced from the tuple; casing drift becomes a compile error.
- `web/src/pages/PoliciesPage.tsx`: rekeyed `severityStyles` /
  `severityDots` to the TitleCase values, added `humanize()` for display
  (AllowedIssuers → "Allowed Issuers"), removed the `badge-neutral`
  fallback that was papering over the mismatch.
- `web/src/api/types.test.ts` (new): pins both tuples exactly. If anyone
  edits one side of the frontend/backend contract without the other, CI
  fails with a clear assertion. Pure-TS vitest, no RTL dependency.

## D-006 (P1): `severity` field silently dropped on create/update

Root cause
----------
`PolicyRule` had no `Severity` field in `internal/domain/policy.go`. The
frontend has always sent `severity` on create/update, but Go's
`json.Decoder` (default settings, no `DisallowUnknownFields`) silently
dropped it. The value never reached PostgreSQL. Every rule rendered with
the same severity because there was no severity — just a display
computation downstream.

Fix: option (b), full-stack schema add (not delete-the-field)
-------------------------------------------------------------
- Migration `000013_policy_rule_severity` (up + down): adds
  `severity VARCHAR(50) NOT NULL DEFAULT 'Warning'` to `policy_rules` with
  CHECK constraint `severity IN ('Warning', 'Error', 'Critical')`. No
  index — three-value column on a low-thousands-rows table, planner will
  seq-scan regardless. PG 11+ metadata-only ADD COLUMN, safe on live data.
- `internal/domain/policy.go`: added `Severity PolicySeverity` field.
- `internal/repository/postgres/policy.go`: plumbed `severity` through
  ListRules SELECT + Scan, GetRule SELECT + Scan, CreateRule INSERT,
  UpdateRule UPDATE (4 queries).
- `internal/service/policy.go::UpdatePolicy`: if the client omits
  severity on a PUT (zero-value empty string), fetch the existing rule
  and preserve its severity. Without this, partial updates would trip the
  NOT NULL CHECK and 500. Preserves pre-existing behavior for Name/Type
  (out of scope).
- `internal/api/handler/policies.go::CreatePolicy`: default empty severity
  to `'Warning'`, then validate via `ValidatePolicySeverity`. 400 with
  clear message instead of 500 on CHECK violation. `UpdatePolicy`:
  validates severity only when provided.
- `internal/mcp/types.go` + `internal/mcp/tools.go`: added optional
  `severity` on the MCP `create_policy` / `update_policy` tool inputs so
  LLM callers stay in sync with the wire contract.
- `api/openapi.yaml`: added `severity` to the `PolicyRule` schema with
  the enum and default.

Acceptance criterion (user-defined)
-----------------------------------
"Create a rule with severity=Critical, reload the page, and still see
Critical — no silent drops." Verified end-to-end: frontend sends
`severity: "Critical"`, handler validates, service persists, DB stores,
GET returns, React renders the correct badge.

Seed data
---------
`migrations/seed.sql`: four demo rules now have differentiated severities
— `pr-require-owner` → Warning, `pr-allowed-environments` → Error,
`pr-max-certificate-lifetime` → Critical, `pr-min-renewal-window` →
Warning. The user called out that seeding all four at the same severity
makes the feature look decorative; differentiation demonstrates the
column carries real signal.

## Integration test fix (side effect of D-006)

`internal/integration/e2e_test.go::TestCrossResourceWorkflow/CreatePolicy`
was sending `"severity": "High"` — a value from the pre-audit severity
vocabulary that the new `ValidatePolicySeverity` correctly rejects with
400. Changed to `"Error"` (closest semantic match in the new TitleCase
allowlist). Only severity reference in the integration/ directory;
verified via grep.

## Out of scope, logged for follow-up (d/D-008)

Three policy-engine drift issues orthogonal to D-005 + D-006, explicitly
deferred per direction:

1. `migrations/seed.sql` policy_rules INSERTs use lowercase TYPE values
   (`'ownership'`, `'environment'`, `'lifetime'`, `'renewal_window'`).
   These are load-bearing on `internal/service/policy.go::evaluateRule`'s
   `switch rule.Type` (which also uses the lowercase strings). Migrating
   requires coordinated changes across seed + evaluation engine.
2. `migrations/seed_demo.sql:482-483` contains lowercase `'critical'`
   severity — will now fail the new CHECK constraint. Separate fix.
3. `evaluateRule` hardcodes `Severity: domain.PolicySeverityWarning` on
   emitted violations and ignores the configured `rule.Config`. The new
   severity column is read correctly on the CRUD path but not yet
   consulted during evaluation.

## Verification

Backend:
- `go build ./...` — clean
- `go vet ./...` — clean
- `go test -short ./...` — all packages green, including
  `internal/service` (policy service), `internal/api/handler` (policy +
  MCP handler tests), `internal/integration` (e2e_test.go after fix),
  `internal/domain`, `internal/repository/postgres`.

Frontend:
- `tsc --noEmit` — clean
- `vitest run` — 223/223 passing (4 new assertions in types.test.ts)
- `vite build` — clean (only the pre-existing chunk-size warning)
2026-04-18 13:02:04 +00:00
shankar0123 72f5246ce3 Merge branch 'fix/m11-cosign-v3-sign-blob-bundle': M-11 cosign v3 sign-blob migration 2026-04-18 09:29:25 +00:00
shankar0123 cb308bb4c7 ci(release): migrate cosign sign-blob to --bundle (cosign v3.0)
Cosign v3.0 (shipped by default with sigstore/cosign-installer@cad07c2e,
release v3.0.5) removed --output-signature and --output-certificate from
the sign-blob subcommand. The replacement is a single --bundle flag that
emits a unified Sigstore bundle (.sigstore.json) containing the
signature, certificate chain, and Rekor inclusion proof in one file.

This change migrates both sign-blob invocations in .github/workflows/
release.yml (per-binary matrix signing and aggregate checksums.txt
signing), updates the artefact upload paths, the artefact aggregation
case filter, the GitHub Release asset list, and the release-notes body
verify-blob example. The README cosign verification snippet and sidecar
description are also updated to the --bundle / .sigstore.json shape.

No cosign version pinning. No legacy fallback. OCI image signing
(cosign sign on image digest) is unchanged — only sign-blob flags
changed in v3.0. See M-11 in certctl-audit-report.md.

Verification gates:
- YAML parse: OK
- go vet ./...: exit 0
- go build ./...: exit 0
- grep 'cosign sign-blob' release.yml: 2 (expected: 2)
- grep '.sigstore.json' release.yml: 9 (expected: >=5)
- grep '.sig/.pem' release.yml non-comment: 0 (expected: 0)
- README legacy cosign refs: 0 (expected: 0)
- docs/ legacy cosign refs: 0 (expected: 0)

Coverage: unchanged (CI workflow edit + README — zero Go code touched).
2026-04-18 09:29:20 +00:00
206 changed files with 22440 additions and 1735 deletions
+29 -3
View File
@@ -148,8 +148,34 @@ jobs:
with:
version: '3.13.0'
# HTTPS-Everywhere (v2.0.47): the chart fails render when no TLS source is
# configured. Every lint/template invocation below must pick exactly one
# provisioning mode — see deploy/helm/certctl/templates/_helpers.tpl
# (certctl.tls.required) and docs/tls.md.
- name: Lint Helm Chart
run: helm lint deploy/helm/certctl/
run: |
helm lint deploy/helm/certctl/ \
--set server.tls.existingSecret=certctl-tls-ci
- name: Template Helm Chart
run: helm template certctl deploy/helm/certctl/ > /dev/null
- name: Template Helm Chart (existingSecret mode)
run: |
helm template certctl deploy/helm/certctl/ \
--set server.tls.existingSecret=certctl-tls-ci \
> /dev/null
- name: Template Helm Chart (cert-manager mode)
run: |
helm template certctl deploy/helm/certctl/ \
--set server.tls.certManager.enabled=true \
--set server.tls.certManager.issuerRef.name=letsencrypt-prod \
> /dev/null
- name: Template Helm Chart (guard fails without TLS)
run: |
# Inverse test: the chart MUST refuse to render when no TLS source is
# configured. If this ever renders successfully, the fail-loud guard
# in certctl.tls.required has regressed.
if helm template certctl deploy/helm/certctl/ > /dev/null 2>&1; then
echo "::error::Helm chart rendered without a TLS source — fail-loud guard regressed"
exit 1
fi
+15 -12
View File
@@ -79,10 +79,14 @@ jobs:
OUTPUT_NAME: ${{ steps.build.outputs.output_name }}
run: |
set -euo pipefail
# Cosign v3.0 (shipped by cosign-installer@v4.1.1 default
# cosign-release=v3.0.5) removed --output-signature/--output-certificate
# on sign-blob. The replacement is --bundle, which emits a unified
# Sigstore bundle (signature + cert chain + Rekor inclusion proof) as
# a single .sigstore.json artefact. M-11.
cosign sign-blob \
--yes \
--output-signature "dist/${OUTPUT_NAME}.sig" \
--output-certificate "dist/${OUTPUT_NAME}.pem" \
--bundle "dist/${OUTPUT_NAME}.sigstore.json" \
"dist/${OUTPUT_NAME}"
- name: Compute SHA-256 sidecar
@@ -100,8 +104,7 @@ jobs:
name: binary-${{ steps.build.outputs.output_name }}
path: |
dist/${{ steps.build.outputs.output_name }}
dist/${{ steps.build.outputs.output_name }}.sig
dist/${{ steps.build.outputs.output_name }}.pem
dist/${{ steps.build.outputs.output_name }}.sigstore.json
dist/${{ steps.build.outputs.output_name }}.sbom.spdx.json
dist/${{ steps.build.outputs.output_name }}.sha256
if-no-files-found: error
@@ -138,7 +141,7 @@ jobs:
: > checksums.txt
for f in certctl-*; do
case "$f" in
*.sig|*.pem|*.sbom.spdx.json|*.sha256|checksums.txt)
*.sigstore.json|*.sbom.spdx.json|*.sha256|checksums.txt)
continue ;;
esac
sha256sum "$f" >> checksums.txt
@@ -156,10 +159,11 @@ jobs:
run: |
set -euo pipefail
cd artifacts
# Cosign v3.0 --bundle replaces the removed v2 flag pair
# --output-signature / --output-certificate. See M-11.
cosign sign-blob \
--yes \
--output-signature checksums.txt.sig \
--output-certificate checksums.txt.pem \
--bundle checksums.txt.sigstore.json \
checksums.txt
- name: Upload artefacts to GitHub Release
@@ -169,8 +173,7 @@ jobs:
files: |
artifacts/certctl-*
artifacts/checksums.txt
artifacts/checksums.txt.sig
artifacts/checksums.txt.pem
artifacts/checksums.txt.sigstore.json
# ----------------------------------------------------------------------
# provenance-binaries (M-3): SLSA Level 3 provenance for every binary.
@@ -402,15 +405,15 @@ jobs:
```bash
cosign verify-blob \
--certificate checksums.txt.pem \
--signature checksums.txt.sig \
--bundle checksums.txt.sigstore.json \
--certificate-identity-regexp '^https://github\.com/shankar0123/certctl/\.github/workflows/release\.yml@refs/tags/' \
--certificate-oidc-issuer 'https://token.actions.githubusercontent.com' \
checksums.txt
```
Replace `checksums.txt` with any individual binary name to verify that
artefact directly (each binary ships with its own `.sig` + `.pem` sidecar).
artefact directly (each binary ships with its own `.sigstore.json`
bundle, e.g. `cosign verify-blob --bundle certctl-agent-linux-amd64.sigstore.json …`).
**3. Verify SLSA Level 3 provenance (binaries):**
+18 -2
View File
@@ -63,12 +63,28 @@ certctl-cli
/server
/agent
/cli
/mcp-server
# Private strategy docs
strategy.md
SECURITY_REMEDIATION.md
# OS
.DS_Store
Thumbs.db
mcp-server
# Local Go build/module caches (session-scoped, never committed)
/.gocache/
/.gomodcache/
/.gopath/
/.gomodcache-gopath/
# Design scratch files (session-scoped)
/.i004-design.md
/.i005-design.md
# HTTPS-Everywhere (M-007) Phase 6: the docker-compose.test.yml tls-init
# container writes ca.crt / server.crt / server.key into this directory so
# the host-side integration_test.go binary can pin the CA via
# CERTCTL_TEST_CA_BUNDLE=./certs/ca.crt. Material is regenerated on every
# `docker compose up` and never belongs in git.
/deploy/test/certs/
+50
View File
@@ -0,0 +1,50 @@
# Changelog
All notable changes to certctl are documented in this file. Dates use ISO 8601. Versions follow [Semantic Versioning](https://semver.org/).
## [2.2.0] — 2026-04-19
### HTTPS Everywhere — The Irony
> certctl manages other teams' certificates. Until v2.2, it didn't terminate TLS on its own control plane. We treated the server as an internal service sitting behind whatever TLS-terminating infrastructure the operator already owned — reverse proxies, Kubernetes Ingress controllers, service mesh sidecars. Working through an EST coverage-gap audit surfaced this as a credibility problem we wanted to fix head-on: a cert-lifecycle product should ship with HTTPS by default. This release flips that. Self-signed bootstrap for docker-compose demos, operator-supplied Secret for Helm (with optional cert-manager integration), and a one-step cutover with no backward-compat bridge. Out-of-date agents will fail at the TLS handshake layer on upgrade; the upgrade guide walks operators through the roll.
### Breaking Changes
- **HTTPS-only control plane. The plaintext HTTP listener is gone.** There is no `CERTCTL_TLS_ENABLED=false` escape hatch and no `:8080` fallback. Operators who were running certctl behind their own TLS terminator must either (a) continue doing so and let the downstream TLS terminator talk to certctl's HTTPS listener, or (b) bring their own cert/key and terminate on certctl directly. Either path requires config changes — see `docs/upgrade-to-tls.md` for a one-step cutover.
- **Agents reject `CERTCTL_SERVER_URL=http://...` at startup.** This is a pre-flight config validation failure with a fail-loud diagnostic pointing at `docs/upgrade-to-tls.md`. Not a TCP-refused, not a TLS-handshake-error — the agent will not even attempt the network call. Every agent deployment must be reconfigured before upgrading the server.
- **CLI and MCP clients require `https://` URLs.** Same pre-flight rejection of plaintext schemes.
- **TLS 1.2 is not supported. TLS 1.3 only.** The server's `tls.Config.MinVersion` is pinned to `tls.VersionTLS13`. Any client still negotiating TLS 1.2 will fail at the handshake. Modern curl, Go stdlib, browsers, and Kubernetes tooling all default to 1.3-capable; legacy clients may need an upgrade.
- **Helm chart requires a TLS source.** `helm install` without one of `server.tls.existingSecret`, `server.tls.certManager.enabled`, or (for eval only) `server.tls.selfSigned.enabled` fails at template time with a diagnostic pointing at `docs/tls.md`. There is no default-to-plaintext path.
### Added
- **Self-signed bootstrap for Docker Compose demos.** A `certctl-tls-init` init container runs before the server on first boot, generates a SAN-valid self-signed cert into `deploy/test/certs/`, and exits. The server mounts the resulting cert/key. Every curl in the demo stack pins against `./deploy/test/certs/ca.crt` with `--cacert`.
- **Helm chart TLS provisioning — three modes.** Operator-supplied Secret (`server.tls.existingSecret`), cert-manager integration (`server.tls.certManager.enabled` with issuer selection), or self-signed (`server.tls.selfSigned.enabled` — eval only, not supported for production). Chart templates enforce exactly one is active.
- **Hot-reload of TLS cert/key on `SIGHUP`.** Overwrite the cert/key on disk, send `SIGHUP` to the server PID, watch the `slog.Info("tls.reload", ...)` log line, and new TLS connections use the new cert. Failure during reload is logged and does not crash the server; the previous cert remains in use.
- **Agent CA-bundle env vars.** `CERTCTL_SERVER_CA_BUNDLE_PATH` points at a PEM file the agent's HTTP client will trust. `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY` disables verification (development only — the agent logs a loud warning at startup). `install-agent.sh` writes both as commented template lines into the generated `agent.env`.
- **Integration test suite runs over HTTPS.** `go test -tags=integration ./deploy/test/...` stands up the full Compose stack, extracts the self-signed CA bundle, and exercises every certctl API over `https://localhost:8443`. All 34 subtests green.
- **`docs/tls.md`** — cert provisioning patterns: bring-your-own Secret, cert-manager, self-signed bootstrap, SAN requirements, rotation workflows, SIGHUP reload semantics, troubleshooting.
- **`docs/upgrade-to-tls.md`** — one-step cutover guide for existing v2.1 operators. Walks through the agent fleet roll, Helm upgrade sequencing, downgrade-is-not-supported warnings, and cert-provisioning decision tree.
### Changed
- `cmd/server/main.go` now calls `http.Server.ListenAndServeTLS(certFile, keyFile)`. The plaintext `ListenAndServe` code path is deleted — `grep -rn "ListenAndServe[^T]" cmd/ internal/` returns zero hits.
- All documentation curls (`docs/testing-guide.md`, `docs/quickstart.md`, `deploy/helm/INSTALLATION.md`, `deploy/helm/DEPLOYMENT_GUIDE.md`, `deploy/ENVIRONMENTS.md`, `docs/openapi.md`, migration guides, example READMEs) use `https://localhost:8443` and `--cacert` against the demo stack's bundle.
- OpenAPI spec (`api/openapi.yaml`) `servers` blocks default to `https://localhost:8443`.
### Security
- TLS 1.3 pinned via `tls.Config.MinVersion = tls.VersionTLS13`.
- Plaintext HTTP listener removed entirely — no port 8080, no `Upgrade-Insecure-Requests`, no HSTS-required redirect dance. There is only one port: 8443, TLS 1.3.
- `grep -rn "http://" cmd/ internal/` returns zero hits outside test fixtures and the agent-side URL-scheme rejection error message.
### Upgrade Notes
Read `docs/upgrade-to-tls.md` before upgrading. The short version:
1. Pick a TLS source — bring-your-own cert, cert-manager, or self-signed bootstrap.
2. Upgrade the server with TLS configured. First boot over HTTPS.
3. Roll the agent fleet: set `CERTCTL_SERVER_URL=https://...` and, if using a private CA, `CERTCTL_SERVER_CA_BUNDLE_PATH`. Old agents will fail loud at startup — expected.
4. Roll CLI/MCP clients the same way.
There is no backward-compat bridge. There is no dual-listener mode. The cutover is one step.
+19 -10
View File
@@ -197,7 +197,7 @@ cd certctl
docker compose -f deploy/docker-compose.yml up -d --build
```
Wait ~30 seconds, then open **http://localhost:8443** in your browser. The onboarding wizard walks you through connecting a CA, deploying an agent, and issuing your first certificate.
Wait ~30 seconds, then open **https://localhost:8443** in your browser. (The shipped `docker-compose.yml` self-signs a cert via the `certctl-tls-init` init container on first boot — accept the browser warning for the demo, or feed the generated `ca.crt` to your client.) The onboarding wizard walks you through connecting a CA, deploying an agent, and issuing your first certificate.
**Want a pre-populated demo instead?** Add the demo override to see 32 certificates across 10 issuers, 8 agents, and 180 days of realistic history:
@@ -208,10 +208,12 @@ docker compose -f deploy/docker-compose.yml -f deploy/docker-compose.demo.yml up
The `deploy/` directory has four compose files: `docker-compose.yml` (base platform), `docker-compose.demo.yml` (demo data overlay), `docker-compose.dev.yml` (PgAdmin + debug logging), and `docker-compose.test.yml` (standalone integration tests with real CA backends). See the [Docker Compose Environments Guide](deploy/ENVIRONMENTS.md) for a service-by-service walkthrough, or the [Quick Start](docs/quickstart.md#docker-compose-environments) for a summary.
```bash
curl http://localhost:8443/health
curl --cacert $(docker compose -f deploy/docker-compose.yml exec -T certctl-server cat /etc/certctl/tls/ca.crt) https://localhost:8443/health
# {"status":"healthy"}
```
The control plane is HTTPS-only (TLS 1.3, no plaintext listener). See [`docs/tls.md`](docs/tls.md) for cert provisioning patterns and [`docs/upgrade-to-tls.md`](docs/upgrade-to-tls.md) if you're upgrading from a pre-v2.2 release.
### Agent Install (One-Liner)
```bash
@@ -260,15 +262,17 @@ sha256sum -c checksums.txt
```bash
cosign verify-blob \
--certificate checksums.txt.pem \
--signature checksums.txt.sig \
--bundle checksums.txt.sigstore.json \
--certificate-identity-regexp '^https://github\.com/shankar0123/certctl/\.github/workflows/release\.yml@refs/tags/' \
--certificate-oidc-issuer 'https://token.actions.githubusercontent.com' \
checksums.txt
```
Every individual binary has its own `.sig` + `.pem` sidecar; swap
`checksums.txt` for any binary name to verify it directly.
Every individual binary ships with its own `.sigstore.json` bundle
(unified Sigstore bundle containing signature, certificate chain, and
Rekor inclusion proof). Swap `checksums.txt` for any binary name and
point `--bundle` at the matching `<binary>.sigstore.json` to verify it
directly.
**3. Verify SLSA Level 3 provenance on a binary:**
@@ -324,8 +328,9 @@ Each directory contains a `docker-compose.yml` and a `README.md` explaining the
go install github.com/shankar0123/certctl/cmd/cli@latest
# Configure
export CERTCTL_SERVER_URL=http://localhost:8443
export CERTCTL_SERVER_URL=https://localhost:8443
export CERTCTL_API_KEY=your-api-key
export CERTCTL_SERVER_CA_BUNDLE_PATH=/path/to/ca.crt # or --ca-bundle on the CLI; --insecure for dev self-signed
# Usage
certctl-cli certs list # List all certificates
@@ -345,11 +350,14 @@ certctl ships a standalone MCP (Model Context Protocol) server that exposes all
```bash
# Install and run
go install github.com/shankar0123/certctl/cmd/mcp-server@latest
export CERTCTL_SERVER_URL=http://localhost:8443
export CERTCTL_SERVER_URL=https://localhost:8443
export CERTCTL_API_KEY=your-api-key
export CERTCTL_SERVER_CA_BUNDLE_PATH=/path/to/ca.crt # required for self-signed bootstrap
mcp-server
```
The MCP server is env-vars-only — there are no CLI flags for TLS. If you must bypass verification for local development against a self-signed cert, set `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true`. Never set that in production.
**Claude Desktop** (`claude_desktop_config.json`):
```json
{
@@ -357,8 +365,9 @@ mcp-server
"certctl": {
"command": "mcp-server",
"env": {
"CERTCTL_SERVER_URL": "http://localhost:8443",
"CERTCTL_API_KEY": "your-api-key"
"CERTCTL_SERVER_URL": "https://localhost:8443",
"CERTCTL_API_KEY": "your-api-key",
"CERTCTL_SERVER_CA_BUNDLE_PATH": "/path/to/ca.crt"
}
}
}
+594 -50
View File
@@ -17,10 +17,8 @@ info:
url: https://github.com/shankar0123/certctl/blob/master/LICENSE
servers:
- url: http://localhost:8080
description: Local development
- url: http://localhost:8443
description: Docker Compose demo
- url: https://localhost:8443
description: Docker Compose demo (self-signed cert; pin with ./deploy/test/certs/ca.crt)
security:
- bearerAuth: []
@@ -29,7 +27,11 @@ tags:
- name: Certificates
description: Certificate lifecycle — CRUD, versions, renewal, deployment, revocation
- name: CRL & OCSP
description: Certificate revocation list and OCSP responder
description: |
Certificate revocation list (RFC 5280) and OCSP responder (RFC 6960).
Served unauthenticated under `/.well-known/pki/*` (RFC 8615) so
relying parties can retrieve revocation status without a certctl
API key.
- name: Issuers
description: CA issuer connector management (Local CA, ACME, step-ca)
- name: Targets
@@ -40,6 +42,8 @@ tags:
description: Job queue — issuance, renewal, deployment, validation
- name: Policies
description: Policy rules and violation tracking
- name: RenewalPolicies
description: Lifecycle renewal policies (distinct from compliance policy rules above)
- name: Profiles
description: Certificate enrollment profiles with crypto constraints
- name: Teams
@@ -493,50 +497,28 @@ paths:
"500":
$ref: "#/components/responses/InternalError"
# ─── CRL & OCSP ─────────────────────────────────────────────────────
/api/v1/crl:
# ─── PKI (CRL & OCSP, RFC 5280 / 6960 / 8615) ──────────────────────
#
# Relying parties (browsers, OpenSSL clients, OCSP stapling sidecars,
# mTLS clients) cannot present a certctl Bearer token, so these two
# endpoints are unauthenticated and live under the RFC 8615
# `.well-known` namespace. They were previously mounted at
# /api/v1/crl/{issuer_id} and /api/v1/ocsp/{issuer_id}/{serial}; those
# paths were removed in M-006.
#
# The non-standard JSON CRL endpoint (GET /api/v1/crl) was also
# removed — RFC 5280 defines only the DER wire format.
/.well-known/pki/crl/{issuer_id}:
get:
tags: [CRL & OCSP]
summary: Get JSON CRL
description: Returns all revoked certificates in JSON format.
operationId: getCRL
responses:
"200":
description: JSON CRL
content:
application/json:
schema:
type: object
properties:
version:
type: integer
example: 1
entries:
type: array
items:
type: object
properties:
serial_number:
type: string
revocation_date:
type: string
format: date-time
revocation_reason:
type: string
total:
type: integer
generated_at:
type: string
format: date-time
"500":
$ref: "#/components/responses/InternalError"
/api/v1/crl/{issuer_id}:
get:
tags: [CRL & OCSP]
summary: Get DER-encoded X.509 CRL
description: Returns a proper DER-encoded CRL signed by the issuing CA. 24-hour validity.
summary: Get DER-encoded X.509 CRL (RFC 5280)
description: |
Returns a DER-encoded CRL signed by the issuing CA (RFC 5280 §5),
served unauthenticated per RFC 8615 `.well-known` semantics so
relying parties can retrieve it without a certctl API key.
Validity is 24 hours.
operationId: getDERCRL
security: []
parameters:
- name: issuer_id
in: path
@@ -560,12 +542,17 @@ paths:
"501":
description: Issuer does not support CRL generation
/api/v1/ocsp/{issuer_id}/{serial}:
/.well-known/pki/ocsp/{issuer_id}/{serial}:
get:
tags: [CRL & OCSP]
summary: OCSP responder
description: Returns signed OCSP response (good/revoked/unknown) for the given serial number.
summary: OCSP responder (RFC 6960)
description: |
Returns a signed OCSP response (good/revoked/unknown) for the
given serial number per RFC 6960 §2.1, served unauthenticated
per RFC 8615 so relying parties and OCSP stapling sidecars can
query revocation status without a certctl API key.
operationId: handleOCSP
security: []
parameters:
- name: issuer_id
in: path
@@ -893,6 +880,40 @@ paths:
"500":
$ref: "#/components/responses/InternalError"
/api/v1/agents/retired:
get:
tags: [Agents]
summary: List retired agents
description: |
I-004: opt-in listing of soft-retired agents. The default
`GET /api/v1/agents` endpoint filters retired rows out; this is the
dedicated surface for reading them back (e.g., the operator UI's
"Retired" tab, audit and forensics workflows). Pagination defaults
match the default agent listing (page=1, per_page=50, max 500). Go
1.22's enhanced ServeMux routes `/agents/retired` to this handler
via the literal-beats-pattern-var precedence rule, so the sibling
`/agents/{id}` route does not shadow it.
operationId: listRetiredAgents
parameters:
- $ref: "#/components/parameters/page"
- $ref: "#/components/parameters/per_page"
responses:
"200":
description: Paginated list of retired agents
content:
application/json:
schema:
allOf:
- $ref: "#/components/schemas/PaginationEnvelope"
- type: object
properties:
data:
type: array
items:
$ref: "#/components/schemas/Agent"
"500":
$ref: "#/components/responses/InternalError"
/api/v1/agents/{id}:
get:
tags: [Agents]
@@ -913,12 +934,116 @@ paths:
$ref: "#/components/responses/NotFound"
"500":
$ref: "#/components/responses/InternalError"
delete:
tags: [Agents]
summary: Soft-retire agent
description: |
I-004: soft-retirement. The agent row is preserved (so its audit
trail and historical job links remain intact) and `retired_at` is
stamped. A retired agent receives `410 Gone` on subsequent
heartbeats so it can shut down cleanly.
Behavior matrix:
| Scenario | Query | Status | Body |
| --- | --- | --- | --- |
| Clean retire (no active dependencies) | none | `200` | `RetireAgentResponse` with `cascade=false`, zero counts |
| Blocked by active targets/certs/jobs | none | `409` | `BlockedByDependenciesResponse` with per-bucket counts |
| Force-cascade retire | `force=true&reason=...` | `200` | `RetireAgentResponse` with `cascade=true`, pre-cascade counts |
| Idempotent re-retire | either | `204` | (empty — downstream consumers break on stray bodies) |
| `force=true` without reason | `force=true` | `400` | ErrorResponse (ErrForceReasonRequired) |
| Reserved sentinel agent | any | `403` | ErrorResponse (ErrAgentIsSentinel) |
| Unknown agent id | any | `404` | ErrorResponse |
Sentinel agents are the four reserved identities backing non-agent
discovery subsystems (`server-scanner`, `cloud-aws-sm`,
`cloud-azure-kv`, `cloud-gcp-sm`). Retiring them would orphan the
scanner or a cloud secret-manager source, so the handler refuses
unconditionally — even with `force=true`.
operationId: retireAgent
parameters:
- $ref: "#/components/parameters/resourceId"
- name: force
in: query
required: false
schema:
type: boolean
default: false
description: |
Cascade-retire active downstream targets, certificates, and
jobs. When `true`, a non-empty `reason` is required. A
malformed value (anything strconv.ParseBool rejects) is
silently treated as `false` so a typoed query can never
accidentally enable the cascade.
- name: reason
in: query
required: false
schema:
type: string
description: |
Human-readable reason recorded on the retired row and in the
immutable audit trail. Required (non-empty after trimming)
when `force=true`.
responses:
"200":
description: |
Agent retired (clean retire or successful force-cascade). Body
is `RetireAgentResponse`.
content:
application/json:
schema:
$ref: "#/components/schemas/RetireAgentResponse"
"204":
description: |
Idempotent retire — the agent was already retired. Response
body is empty (the 200-path shape does not apply, and
downstream clients that tee responses into dashboards would
break on spurious bodies).
"400":
description: |
`force=true` was sent without a non-empty `reason`
(ErrForceReasonRequired).
content:
application/json:
schema:
$ref: "#/components/schemas/ErrorResponse"
"403":
description: |
Agent is a reserved sentinel and cannot be retired even with
`?force=true` (ErrAgentIsSentinel).
content:
application/json:
schema:
$ref: "#/components/schemas/ErrorResponse"
"404":
$ref: "#/components/responses/NotFound"
"409":
description: |
Blocked by active downstream dependencies. Body carries
per-bucket counts so the operator UI can show the user which
dependency is holding up the retire. Re-run with
`?force=true&reason=...` to cascade.
content:
application/json:
schema:
$ref: "#/components/schemas/BlockedByDependenciesResponse"
"405":
description: Method not allowed (only DELETE, GET are routed to this path)
"500":
$ref: "#/components/responses/InternalError"
/api/v1/agents/{id}/heartbeat:
post:
tags: [Agents]
summary: Agent heartbeat
description: Reports agent liveness and metadata (OS, architecture, IP, version).
description: |
Reports agent liveness and metadata (OS, architecture, IP, version).
I-004: a retired agent still polling the heartbeat endpoint receives
`410 Gone` so `cmd/agent` detects the terminal signal and shuts down
cleanly instead of looping forever against a decommissioned identity.
The retired-agent check runs before any "not found" string match so
it can never be masked by a sibling error branch.
operationId: agentHeartbeat
parameters:
- $ref: "#/components/parameters/resourceId"
@@ -949,6 +1074,14 @@ paths:
$ref: "#/components/responses/BadRequest"
"404":
$ref: "#/components/responses/NotFound"
"410":
description: |
I-004: the agent has been soft-retired. The agent process should
treat this as a terminal signal and shut down cleanly.
content:
application/json:
schema:
$ref: "#/components/schemas/ErrorResponse"
"500":
$ref: "#/components/responses/InternalError"
@@ -1397,6 +1530,137 @@ paths:
"500":
$ref: "#/components/responses/InternalError"
# ─── Renewal Policies ────────────────────────────────────────────────
# G-1: lifecycle policies (rp-* ids, table renewal_policies). DISTINCT from
# /api/v1/policies above, which returns compliance rules (pol-* ids, table
# policy_rules). `managed_certificates.renewal_policy_id` FK points at
# renewal_policies(id) — populating that dropdown from /api/v1/policies
# caused 23503 FK violations; hence this endpoint.
/api/v1/renewal-policies:
get:
tags: [RenewalPolicies]
summary: List renewal policies
operationId: listRenewalPolicies
parameters:
- $ref: "#/components/parameters/page"
- $ref: "#/components/parameters/per_page"
responses:
"200":
description: Paginated list of renewal policies
content:
application/json:
schema:
allOf:
- $ref: "#/components/schemas/PaginationEnvelope"
- type: object
properties:
data:
type: array
items:
$ref: "#/components/schemas/RenewalPolicy"
"500":
$ref: "#/components/responses/InternalError"
post:
tags: [RenewalPolicies]
summary: Create renewal policy
operationId: createRenewalPolicy
requestBody:
required: true
content:
application/json:
schema:
$ref: "#/components/schemas/RenewalPolicyCreateRequest"
responses:
"201":
description: Renewal policy created
content:
application/json:
schema:
$ref: "#/components/schemas/RenewalPolicy"
"400":
$ref: "#/components/responses/BadRequest"
"409":
description: Duplicate policy name
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
"500":
$ref: "#/components/responses/InternalError"
/api/v1/renewal-policies/{id}:
get:
tags: [RenewalPolicies]
summary: Get renewal policy
operationId: getRenewalPolicy
parameters:
- $ref: "#/components/parameters/resourceId"
responses:
"200":
description: Renewal policy details
content:
application/json:
schema:
$ref: "#/components/schemas/RenewalPolicy"
"400":
$ref: "#/components/responses/BadRequest"
"404":
$ref: "#/components/responses/NotFound"
"500":
$ref: "#/components/responses/InternalError"
put:
tags: [RenewalPolicies]
summary: Update renewal policy
operationId: updateRenewalPolicy
parameters:
- $ref: "#/components/parameters/resourceId"
requestBody:
required: true
content:
application/json:
schema:
$ref: "#/components/schemas/RenewalPolicyUpdateRequest"
responses:
"200":
description: Renewal policy updated
content:
application/json:
schema:
$ref: "#/components/schemas/RenewalPolicy"
"400":
$ref: "#/components/responses/BadRequest"
"404":
$ref: "#/components/responses/NotFound"
"409":
description: Duplicate policy name
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
"500":
$ref: "#/components/responses/InternalError"
delete:
tags: [RenewalPolicies]
summary: Delete renewal policy
operationId: deleteRenewalPolicy
parameters:
- $ref: "#/components/parameters/resourceId"
responses:
"204":
description: Renewal policy deleted
"400":
$ref: "#/components/responses/BadRequest"
"404":
$ref: "#/components/responses/NotFound"
"409":
description: Policy in use by one or more certificates (FK restrict)
content:
application/json:
schema:
$ref: "#/components/schemas/Error"
"500":
$ref: "#/components/responses/InternalError"
# ─── Profiles ────────────────────────────────────────────────────────
/api/v1/profiles:
get:
@@ -1904,6 +2168,16 @@ paths:
parameters:
- $ref: "#/components/parameters/page"
- $ref: "#/components/parameters/per_page"
- name: status
in: query
required: false
description: |
Filter by lifecycle status. I-005: `dead` powers the Dead letter
tab on the GUI; empty/omitted returns the default all-statuses
listing to preserve pre-I-005 behavior.
schema:
type: string
enum: [pending, sent, failed, dead, read]
responses:
"200":
description: Paginated list of notifications
@@ -1961,6 +2235,36 @@ paths:
"500":
$ref: "#/components/responses/InternalError"
/api/v1/notifications/{id}/requeue:
post:
tags: [Notifications]
summary: Requeue a dead notification
description: |
I-005: flip a notification from the `dead` dead-letter queue back to
`pending` so the retry sweep (default 2 minutes) picks it up on its
next tick. Used by operators after fixing the underlying delivery
failure (SMTP config, webhook endpoint, etc.). Clears `next_retry_at`
and resets the `retry_count` budget; `last_error` is preserved for
audit continuity.
operationId: requeueNotification
parameters:
- $ref: "#/components/parameters/resourceId"
responses:
"200":
description: Requeued
content:
application/json:
schema:
$ref: "#/components/schemas/StatusResponse"
"400":
$ref: "#/components/responses/BadRequest"
"404":
$ref: "#/components/responses/NotFound"
"405":
description: Method not allowed (POST only)
"500":
$ref: "#/components/responses/InternalError"
# ─── Stats ───────────────────────────────────────────────────────────
/api/v1/stats/summary:
get:
@@ -3326,6 +3630,7 @@ components:
DeploymentTarget:
type: object
required: [name, type, agent_id]
properties:
id:
type: string
@@ -3335,6 +3640,12 @@ components:
$ref: "#/components/schemas/TargetType"
agent_id:
type: string
description: |
ID of the agent that manages this target. Required because
deployment_targets.agent_id is a NOT NULL foreign key to agents(id)
(migration 000001). Empty or nonexistent agent IDs are rejected
with HTTP 400 by the service layer (see C-002 in the coverage-gap
audit).
config:
type: object
description: Target-specific configuration (varies by type)
@@ -3379,6 +3690,85 @@ components:
type: string
version:
type: string
retired_at:
type: string
format: date-time
nullable: true
description: |
I-004: soft-retirement timestamp. `null` (or field absent) means the
agent is active. A non-null value is the canonical "retired" state —
the operational `status` column is preserved at retirement time as
the last-seen value, but `retired_at` is the source of truth for
filtering agents out of active listings.
retired_reason:
type: string
nullable: true
description: |
I-004: human-readable reason captured at retirement time. Only set
when the agent was retired via `?force=true&reason=...` cascade; a
default soft-retire leaves this field null.
AgentDependencyCounts:
type: object
description: |
I-004: preflight counts of active downstream rows that would be
orphaned by retiring an agent. Returned in the 409
`blocked_by_dependencies` body so the operator UI can tell the user
which bucket is blocking the retire, and also in the 200 response
body on a successful `?force=true` cascade as a snapshot of what
was cascaded.
properties:
active_targets:
type: integer
description: Deployment targets with this agent assigned and retired_at IS NULL
active_certificates:
type: integer
description: Certificates currently deployed via one of this agent's active targets
pending_jobs:
type: integer
description: Jobs with agent_id=this in status Pending, AwaitingCSR, AwaitingApproval, or Running
RetireAgentResponse:
type: object
description: |
I-004: response body for a successful retire on DELETE /api/v1/agents/{id}.
Returned on both clean retires (cascade=false, zero counts) and
force-cascade retires (cascade=true, counts snapshot of the
pre-cascade dependency state). The 204 idempotent-retire path does
NOT emit this body — re-retiring an already-retired agent returns
an empty response.
properties:
retired_at:
type: string
format: date-time
already_retired:
type: boolean
description: |
Always false on the 200 response — the already-retired path
returns 204 No Content with no body. Surfaced in the schema
only so downstream consumers have a complete field map.
cascade:
type: boolean
description: True when the retire was invoked with ?force=true
counts:
$ref: "#/components/schemas/AgentDependencyCounts"
BlockedByDependenciesResponse:
type: object
description: |
I-004: 409 response body for a retire request blocked by active
downstream dependencies. Returned when `force=true` is not set and
any of the three counts is non-zero. The operator UI renders these
counts so the human can retire or reassign the blocking rows
before re-running the retire, or tick the force checkbox to cascade.
properties:
error:
type: string
example: blocked_by_dependencies
message:
type: string
counts:
$ref: "#/components/schemas/AgentDependencyCounts"
WorkItem:
type: object
@@ -3461,6 +3851,7 @@ components:
- RequiredMetadata
- AllowedEnvironments
- RenewalLeadTime
- CertificateLifetime
PolicySeverity:
type: string
@@ -3480,6 +3871,9 @@ components:
description: Policy-specific configuration (varies by type)
enabled:
type: boolean
severity:
$ref: "#/components/schemas/PolicySeverity"
description: Severity level applied to violations of this rule. Defaults to Warning on create when omitted.
created_at:
type: string
format: date-time
@@ -3504,6 +3898,132 @@ components:
type: string
format: date-time
# ─── Renewal Policies ─────────────────────────────────────────────
# G-1: renewal_policies table — lifecycle policies, referenced by
# managed_certificates.renewal_policy_id ON DELETE RESTRICT. Distinct
# from PolicyRule above (compliance rules, table policy_rules).
RenewalPolicy:
type: object
required:
- id
- name
- renewal_window_days
- auto_renew
- max_retries
- retry_interval_seconds
- alert_thresholds_days
- created_at
- updated_at
properties:
id:
type: string
description: Human-readable ID, prefixed `rp-` (e.g., `rp-default`).
name:
type: string
description: Unique display name (UNIQUE in DB).
renewal_window_days:
type: integer
minimum: 1
maximum: 365
description: Days before expiry to trigger renewal.
auto_renew:
type: boolean
description: Whether renewal is triggered automatically by the scheduler.
max_retries:
type: integer
minimum: 0
maximum: 10
description: Maximum renewal retry attempts on failure.
retry_interval_seconds:
type: integer
minimum: 60
maximum: 86400
description: Seconds to wait between retry attempts.
alert_thresholds_days:
type: array
items:
type: integer
minimum: 0
maximum: 365
description: Days-before-expiry thresholds at which to emit alerts.
certificate_profile_id:
type: string
nullable: true
description: Optional certificate profile binding. Read-only at this endpoint; UI does not currently edit this field.
created_at:
type: string
format: date-time
updated_at:
type: string
format: date-time
RenewalPolicyCreateRequest:
type: object
required:
- name
properties:
id:
type: string
description: Optional human-readable ID. Auto-generated from name when omitted.
name:
type: string
minLength: 1
maxLength: 255
renewal_window_days:
type: integer
minimum: 1
maximum: 365
default: 30
auto_renew:
type: boolean
default: true
max_retries:
type: integer
minimum: 0
maximum: 10
description: Required. Not defaulted — 0 is a valid operator choice.
retry_interval_seconds:
type: integer
minimum: 60
maximum: 86400
default: 3600
alert_thresholds_days:
type: array
items:
type: integer
minimum: 0
maximum: 365
default: [30, 14, 7, 0]
RenewalPolicyUpdateRequest:
type: object
description: Partial update. Omitted fields are left unchanged.
properties:
name:
type: string
minLength: 1
maxLength: 255
renewal_window_days:
type: integer
minimum: 1
maximum: 365
auto_renew:
type: boolean
max_retries:
type: integer
minimum: 0
maximum: 10
retry_interval_seconds:
type: integer
minimum: 60
maximum: 86400
alert_thresholds_days:
type: array
items:
type: integer
minimum: 0
maximum: 365
# ─── Profiles ────────────────────────────────────────────────────
CertificateProfile:
type: object
@@ -3682,8 +4202,32 @@ components:
format: date-time
status:
type: string
enum: [pending, sent, failed, dead, read]
description: |
Notification lifecycle status. I-005 adds `dead` for notifications
that exhausted their 5-attempt retry budget and were moved to the
dead-letter queue; operators triage these in the GUI's Dead letter
tab and use POST /notifications/{id}/requeue to resurrect them.
error:
type: string
retry_count:
type: integer
description: |
Number of delivery attempts made. I-005 retry-sweep field; caps
at max_attempts=5 before the notification transitions to `dead`.
next_retry_at:
type: string
format: date-time
description: |
When the next retry attempt is scheduled. I-005 retry-sweep field;
null for `sent`, `dead`, and `read` statuses. Backoff follows
`min(2^retry_count * 1m, 1h)`.
last_error:
type: string
description: |
Most recent transient delivery error (SMTP failure, webhook 5xx,
etc.). I-005 retry-sweep field; surfaced on the Dead letter tab
so operators can triage without chasing server logs.
created_at:
type: string
format: date-time
+273 -31
View File
@@ -7,6 +7,7 @@ import (
"crypto/elliptic"
"crypto/rand"
"crypto/rsa"
"crypto/tls"
"crypto/x509"
"crypto/x509/pkix"
"encoding/json"
@@ -72,7 +73,7 @@ func TestAgent_Heartbeat_Success(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
// Should not panic
agent.sendHeartbeat(context.Background())
@@ -93,7 +94,7 @@ func TestAgent_Heartbeat_ServerError(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
// Should increment consecutive failures
failureBefore := agent.consecutiveFailures
@@ -115,7 +116,7 @@ func TestAgent_Heartbeat_ConnectionError(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
// Should fail due to connection error
agent.sendHeartbeat(context.Background())
@@ -150,7 +151,7 @@ func TestAgent_PollWork_NoWork(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
// Should not panic
agent.pollForWork(context.Background())
@@ -195,7 +196,7 @@ func TestAgent_PollWork_Success(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
// Should not panic; work items are processed in separate gorines in real usage
agent.pollForWork(context.Background())
@@ -285,7 +286,7 @@ func TestParsePEMFile(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
// Parse the file
entries := agent.parsePEMFile(certPath)
@@ -336,7 +337,7 @@ func TestParsePEMFile_MultipleCerts(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
entries := agent.parsePEMFile(certPath)
@@ -362,7 +363,7 @@ func TestParseDERFile(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
entry, err := agent.parseDERFile(derPath)
if err != nil {
@@ -397,7 +398,7 @@ func TestParseDERFile_Invalid(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
_, err := agent.parseDERFile(derPath)
if err == nil {
@@ -439,7 +440,7 @@ func TestScanDirectory(t *testing.T) {
DiscoveryDirs: []string{tmpdir},
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
// Simulate directory walk manually (as runDiscoveryScan does)
var certs []discoveredCertEntry
@@ -474,7 +475,7 @@ func TestCreateTargetConnector_NGINX(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
configJSON := json.RawMessage(`{"cert_path":"/etc/nginx/cert.pem"}`)
connector, err := agent.createTargetConnector("NGINX", configJSON)
@@ -496,7 +497,7 @@ func TestCreateTargetConnector_Unsupported(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
_, err := agent.createTargetConnector("UnsupportedType", nil)
@@ -530,7 +531,7 @@ func TestFetchCertificate_Success(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
certPEM, err := agent.fetchCertificate(context.Background(), "mc-001")
if err != nil {
@@ -556,7 +557,7 @@ func TestFetchCertificate_NotFound(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
_, err := agent.fetchCertificate(context.Background(), "mc-nonexistent")
if err == nil {
@@ -592,7 +593,7 @@ func TestReportJobStatus_Success(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
err := agent.reportJobStatus(context.Background(), "j-001", "Completed", "")
if err != nil {
@@ -624,7 +625,7 @@ func TestReportJobStatus_WithError(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
err := agent.reportJobStatus(context.Background(), "j-001", "Failed", "deployment failed")
if err != nil {
@@ -658,7 +659,7 @@ func TestMakeRequest_Success(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
resp, err := agent.makeRequest(context.Background(), http.MethodPost, "/test", map[string]string{"key": "value"})
if err != nil {
@@ -680,7 +681,7 @@ func TestMakeRequest_InvalidURL(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
_, err := agent.makeRequest(context.Background(), http.MethodGet, "/test", nil)
if err == nil {
@@ -765,7 +766,7 @@ func TestNewAgent(t *testing.T) {
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
if agent.config != cfg {
t.Error("config not set correctly")
@@ -791,7 +792,7 @@ func TestNewAgent_WithLogger(t *testing.T) {
Hostname: "test-host",
}
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
if agent.logger != logger {
t.Error("logger not set correctly")
@@ -954,7 +955,7 @@ func TestCreateTargetConnector_AllSupportedTypes(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
@@ -1007,7 +1008,7 @@ func TestCreateTargetConnector_InvalidJSON(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
invalidJSON := json.RawMessage("{invalid json}")
@@ -1031,7 +1032,7 @@ func TestCreateTargetConnector_UnknownType(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
_, err := agent.createTargetConnector("MagicBox", nil)
@@ -1061,7 +1062,7 @@ func TestCreateTargetConnector_EmptyConfig(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
for _, typeName := range tests {
t.Run(typeName, func(t *testing.T) {
@@ -1137,7 +1138,7 @@ func TestRunDiscoveryScan_ValidCerts(t *testing.T) {
DiscoveryDirs: []string{tmpDir},
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
// Run discovery scan
agent.runDiscoveryScan(context.Background())
@@ -1165,7 +1166,7 @@ func TestRunDiscoveryScan_NoCertificates(t *testing.T) {
DiscoveryDirs: []string{tmpDir},
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
// Run discovery scan - should complete without error even with empty directory
agent.runDiscoveryScan(context.Background())
@@ -1222,7 +1223,7 @@ func TestRunDiscoveryScan_MultipleCerts(t *testing.T) {
DiscoveryDirs: []string{tmpDir},
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
// Run discovery scan
agent.runDiscoveryScan(context.Background())
@@ -1273,7 +1274,7 @@ func TestRunDiscoveryScan_DERCertificate(t *testing.T) {
DiscoveryDirs: []string{tmpDir},
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
// Run discovery scan
agent.runDiscoveryScan(context.Background())
@@ -1331,7 +1332,7 @@ func TestRunDiscoveryScan_Subdirectories(t *testing.T) {
DiscoveryDirs: []string{tmpDir},
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
// Run discovery scan - should recursively find certs in subdirs
agent.runDiscoveryScan(context.Background())
@@ -1369,7 +1370,7 @@ func TestRunDiscoveryScan_ServerError(t *testing.T) {
DiscoveryDirs: []string{tmpDir},
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
// Should handle server error gracefully without panicking
agent.runDiscoveryScan(context.Background())
@@ -1396,7 +1397,7 @@ func TestDiscoveredCertEntry_ValidFields(t *testing.T) {
Hostname: "test-host",
}
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent := NewAgent(cfg, logger)
agent, _ := NewAgent(cfg, logger)
entries := agent.parsePEMFile(certPath)
@@ -1447,3 +1448,244 @@ func TestDiscoveredCertEntry_ValidFields(t *testing.T) {
t.Error("PEMData should not be empty")
}
}
// ---------------------------------------------------------------------------
// HTTPS-Everywhere milestone (v2.2, §3.2 / §7) — Phase 5 client-side tests.
//
// These tests pin the agent's pre-flight HTTPS-scheme guard and the TLS
// configuration surface (CA bundle loading + TLS 1.3 round-trip) so that
// regressions surface at unit-test time, not at the first heartbeat of a
// production rollout. Matches the same contract asserted by the sibling
// binaries cmd/cli/main_test.go and cmd/mcp-server/main_test.go — the three
// must stay in lock-step because all three are HTTPS-only clients of the
// same control plane.
// ---------------------------------------------------------------------------
// TestValidateHTTPSScheme pins the pre-flight URL-scheme guard that the
// HTTPS-Everywhere milestone requires on the agent binary startup path. The
// agent's diagnostic is distinct from the CLI/MCP variants because it names
// CERTCTL_SERVER_URL (the only input channel — no --server flag on the
// agent). Every case here mirrors the dispatch arms in cmd/agent/main.go:
// validateHTTPSScheme; drifting the error-message substrings is what this
// test is here to catch.
func TestValidateHTTPSScheme(t *testing.T) {
tests := []struct {
name string
serverURL string
wantErr bool
wantErrSub string
}{
{
name: "https URL passes",
serverURL: "https://certctl-server:8443",
wantErr: false,
},
{
name: "https URL with path passes",
serverURL: "https://certctl.example.com/api/v1",
wantErr: false,
},
{
name: "uppercase HTTPS scheme passes (url.Parse lowercases)",
serverURL: "HTTPS://certctl-server:8443",
wantErr: false,
},
{
name: "empty URL rejected names CERTCTL_SERVER_URL",
serverURL: "",
wantErr: true,
wantErrSub: "CERTCTL_SERVER_URL is empty",
},
{
name: "plaintext http rejected",
serverURL: "http://certctl-server:8443",
wantErr: true,
wantErrSub: "plaintext http://",
},
{
name: "bare host missing scheme falls through to unsupported",
serverURL: "localhost:8443",
wantErr: true,
// url.Parse treats "localhost:8443" as scheme=localhost,
// opaque=8443 — exercises the default arm (unsupported scheme)
// rather than the empty-scheme arm. Both are fail-closed, which
// is what we care about.
wantErrSub: "unsupported scheme",
},
{
name: "path-only URL rejected",
serverURL: "//certctl-server:8443",
wantErr: true,
wantErrSub: "missing a scheme",
},
{
name: "unsupported scheme rejected",
serverURL: "ftp://certctl-server:8443",
wantErr: true,
wantErrSub: "unsupported scheme",
},
{
name: "ws scheme rejected",
serverURL: "ws://certctl-server:8443",
wantErr: true,
wantErrSub: "unsupported scheme",
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
err := validateHTTPSScheme(tt.serverURL)
if (err != nil) != tt.wantErr {
t.Fatalf("validateHTTPSScheme(%q) err=%v wantErr=%v", tt.serverURL, err, tt.wantErr)
}
if tt.wantErr && tt.wantErrSub != "" && !strings.Contains(err.Error(), tt.wantErrSub) {
t.Errorf("validateHTTPSScheme(%q) err=%q must contain %q so operators see the right diagnostic",
tt.serverURL, err.Error(), tt.wantErrSub)
}
})
}
}
// writeTestCABundle PEM-encodes a cert's DER bytes and writes the result to a
// tmp file inside dir. Used by CA-bundle tests so each case owns a distinct
// file path (matters for the "missing file" case which must point at a path
// that provably does not exist). Returns the path.
func writeTestCABundle(t *testing.T, dir string, certDER []byte, filename string) string {
t.Helper()
pemBytes := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: certDER})
path := filepath.Join(dir, filename)
if err := os.WriteFile(path, pemBytes, 0644); err != nil {
t.Fatalf("writing CA bundle %q: %v", path, err)
}
return path
}
// TestNewAgent_CABundle_Success confirms that a well-formed PEM bundle gets
// parsed into an x509.CertPool and wired onto the agent's HTTP client
// transport. This is the happy path the docs/tls.md "Private CA signed
// server cert" section depends on.
func TestNewAgent_CABundle_Success(t *testing.T) {
cert, err := generateTestCertWithCN("test.certctl.local")
if err != nil {
t.Fatalf("generateTestCertWithCN: %v", err)
}
bundlePath := writeTestCABundle(t, t.TempDir(), cert.Raw, "ca-bundle.pem")
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent, err := NewAgent(&AgentConfig{
ServerURL: "https://certctl-server:8443",
APIKey: "test-key",
AgentID: "a-test",
Hostname: "test-host",
CABundlePath: bundlePath,
}, logger)
if err != nil {
t.Fatalf("NewAgent with valid CA bundle err=%v want nil", err)
}
transport, ok := agent.client.Transport.(*http.Transport)
if !ok {
t.Fatalf("agent.client.Transport is %T; want *http.Transport", agent.client.Transport)
}
if transport.TLSClientConfig == nil {
t.Fatal("TLSClientConfig is nil; HTTPS-everywhere milestone requires a non-nil TLS config")
}
if transport.TLSClientConfig.MinVersion != tls.VersionTLS13 {
t.Errorf("MinVersion=%x want TLS 1.3 (%x) per §2.3 of the milestone spec",
transport.TLSClientConfig.MinVersion, tls.VersionTLS13)
}
if transport.TLSClientConfig.RootCAs == nil {
t.Error("RootCAs is nil; the configured CA bundle was silently dropped")
}
}
// TestNewAgent_CABundle_MissingFile pins the fail-loud behavior when the
// operator points CERTCTL_SERVER_CA_BUNDLE_PATH at a path that does not
// exist. Falling back to system roots here would mask a misconfiguration as
// a much harder-to-debug TLS handshake failure downstream.
func TestNewAgent_CABundle_MissingFile(t *testing.T) {
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
missingPath := filepath.Join(t.TempDir(), "does-not-exist.pem")
_, err := NewAgent(&AgentConfig{
ServerURL: "https://certctl-server:8443",
APIKey: "test-key",
AgentID: "a-test",
Hostname: "test-host",
CABundlePath: missingPath,
}, logger)
if err == nil {
t.Fatal("NewAgent err=nil for missing CA bundle path; must fail loud at startup")
}
if !strings.Contains(err.Error(), "reading CA bundle") {
t.Errorf("err=%q must contain \"reading CA bundle\" so operators can trace the cause", err.Error())
}
}
// TestNewAgent_CABundle_EmptyPEM covers the "file exists but contains no
// valid certs" case (garbage, wrong-format, stripped PEM). AppendCertsFromPEM
// returns false in this case; NewAgent must translate that into a fail-loud
// startup error rather than quietly carry on with an empty pool.
func TestNewAgent_CABundle_EmptyPEM(t *testing.T) {
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
bundlePath := filepath.Join(t.TempDir(), "empty.pem")
if err := os.WriteFile(bundlePath, []byte("not a pem-encoded certificate, just garbage\n"), 0644); err != nil {
t.Fatalf("writing garbage bundle: %v", err)
}
_, err := NewAgent(&AgentConfig{
ServerURL: "https://certctl-server:8443",
APIKey: "test-key",
AgentID: "a-test",
Hostname: "test-host",
CABundlePath: bundlePath,
}, logger)
if err == nil {
t.Fatal("NewAgent err=nil for empty-PEM CA bundle; must fail loud at startup")
}
if !strings.Contains(err.Error(), "no valid PEM-encoded certificates") {
t.Errorf("err=%q must contain \"no valid PEM-encoded certificates\" so operators see why the bundle was rejected", err.Error())
}
}
// TestNewAgent_TLSRoundTrip is the end-to-end integration-style check: spin
// up an httptest.NewTLSServer (which presents a self-signed cert over TLS
// 1.3), feed that cert into the agent as a CA bundle, and confirm the agent
// successfully completes a heartbeat round-trip over HTTPS. This proves that
// (a) the CA pool is actually being consulted during verification and (b)
// the TLS 1.3 MinVersion doesn't break against httptest's default
// negotiation. Equivalent to the "TLS handshake succeeds against a
// self-signed control plane" integration gate, but runs in-process with no
// Docker dependency.
func TestNewAgent_TLSRoundTrip(t *testing.T) {
var heartbeatHit int
server := httptest.NewTLSServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.URL.Path == "/api/v1/agents/a-tls-test/heartbeat" && r.Method == http.MethodPost {
heartbeatHit++
w.WriteHeader(http.StatusOK)
return
}
w.WriteHeader(http.StatusNotFound)
}))
defer server.Close()
// server.Certificate() returns the *x509.Certificate httptest presents;
// PEM-encode its DER bytes so NewAgent's AppendCertsFromPEM can ingest it.
bundlePath := writeTestCABundle(t, t.TempDir(), server.Certificate().Raw, "httptest-ca.pem")
logger := slog.New(slog.NewTextHandler(io.Discard, nil))
agent, err := NewAgent(&AgentConfig{
ServerURL: server.URL,
APIKey: "test-key",
AgentID: "a-tls-test",
Hostname: "tls-test-host",
CABundlePath: bundlePath,
}, logger)
if err != nil {
t.Fatalf("NewAgent with httptest CA bundle err=%v want nil", err)
}
agent.sendHeartbeat(context.Background())
if heartbeatHit != 1 {
t.Fatalf("heartbeat handler hit %d times; want 1 — the TLS round-trip must actually complete", heartbeatHit)
}
}
+242 -20
View File
@@ -8,21 +8,25 @@ import (
"crypto/rand"
"crypto/rsa"
"crypto/sha256"
"crypto/tls"
"crypto/x509"
"crypto/x509/pkix"
"encoding/json"
"encoding/pem"
"errors"
"flag"
"fmt"
"io"
"log/slog"
"net"
"net/http"
"net/url"
"os"
"os/signal"
"path/filepath"
"runtime"
"strings"
"sync"
"syscall"
"time"
@@ -44,15 +48,27 @@ import (
// AgentConfig represents the agent-side configuration.
type AgentConfig struct {
ServerURL string // Control plane server URL (e.g., http://localhost:8443)
APIKey string // Agent API key for authentication
AgentName string // Agent name for identification
AgentID string // Agent ID for API calls (set after registration or from env)
Hostname string // Server hostname
KeyDir string // Directory for storing private keys (default: /var/lib/certctl/keys)
DiscoveryDirs []string // Directories to scan for certificates (comma-separated via env)
ServerURL string // Control plane server URL (e.g., https://localhost:8443) — must be https:// scheme
APIKey string // Agent API key for authentication
AgentName string // Agent name for identification
AgentID string // Agent ID for API calls (set after registration or from env)
Hostname string // Server hostname
KeyDir string // Directory for storing private keys (default: /var/lib/certctl/keys)
DiscoveryDirs []string // Directories to scan for certificates (comma-separated via env)
CABundlePath string // Optional path to a PEM-encoded CA bundle that signed the server's cert (empty = system roots)
InsecureSkipVerify bool // Dev-only: skip TLS certificate verification. Never enable in production. See docs/tls.md.
}
// ErrAgentRetired is the sentinel returned by [Agent.Run] when the control
// plane responds with HTTP 410 Gone to a heartbeat or work-poll request — the
// canonical signal that this agent's row has been soft-retired server-side
// (see I-004 in cowork/certctl-coverage-gap-audit.md). The binary must
// terminate cleanly: an init-system restart would only produce another 410
// and wedge the host in a restart loop. main() translates this sentinel into
// a zero exit code so systemd (Restart=on-failure) and launchd do not respawn
// the process. Do not wrap this error — main() matches it with errors.Is.
var ErrAgentRetired = fmt.Errorf("agent retired by control plane")
// Agent represents the local agent that runs on target servers.
// It periodically sends heartbeats, polls for work, executes deployment and CSR jobs,
// and scans configured directories for existing certificates.
@@ -68,6 +84,17 @@ type Agent struct {
pollInterval time.Duration
discoveryInterval time.Duration
consecutiveFailures int
// I-004: terminal retirement signal. retiredSignal is closed exactly once
// (guarded by retiredOnce) when either sendHeartbeat or pollForWork
// observes HTTP 410 Gone. The Run() select loop picks up the close and
// returns ErrAgentRetired, unwinding the goroutine cleanly so main() can
// log + exit(0). Using a channel + sync.Once (rather than an atomic bool
// + polling) lets us fall through the select statement immediately instead
// of waiting for the next ticker; the zero-allocation close is safe to
// race with ctx.Done() and other cases.
retiredOnce sync.Once
retiredSignal chan struct{}
}
// WorkResponse represents the response from the work polling endpoint.
@@ -90,15 +117,78 @@ type JobItem struct {
}
// NewAgent creates a new agent instance.
func NewAgent(cfg *AgentConfig, logger *slog.Logger) *Agent {
//
// The returned HTTP client enforces HTTPS-only control-plane access per the
// HTTPS-Everywhere milestone (see docs/tls.md). TLS 1.3 is required; the
// optional CABundlePath loads a PEM bundle into RootCAs so the agent can
// trust internal / self-signed server certs without touching system trust
// stores. InsecureSkipVerify is a dev-only escape hatch — callers must log a
// loud warning when it's set; never enable in production (see §2.4 of the
// milestone spec and docs/upgrade-to-tls.md).
//
// Returns an error if CABundlePath is set but unreadable or malformed — fail
// loud at startup rather than silently fall back to system roots, which would
// turn a misconfigured bundle path into a cryptic "x509: certificate signed
// by unknown authority" on the first heartbeat.
func NewAgent(cfg *AgentConfig, logger *slog.Logger) (*Agent, error) {
tlsConfig := &tls.Config{
MinVersion: tls.VersionTLS13,
InsecureSkipVerify: cfg.InsecureSkipVerify, //nolint:gosec // opt-in dev escape hatch, documented in docs/tls.md
}
if cfg.CABundlePath != "" {
pemBytes, err := os.ReadFile(cfg.CABundlePath)
if err != nil {
return nil, fmt.Errorf("reading CA bundle at %q: %w", cfg.CABundlePath, err)
}
pool := x509.NewCertPool()
if !pool.AppendCertsFromPEM(pemBytes) {
return nil, fmt.Errorf("CA bundle at %q contains no valid PEM-encoded certificates", cfg.CABundlePath)
}
tlsConfig.RootCAs = pool
}
httpClient := &http.Client{
Timeout: 30 * time.Second,
Transport: &http.Transport{
TLSClientConfig: tlsConfig,
ForceAttemptHTTP2: true,
MaxIdleConns: 10,
IdleConnTimeout: 90 * time.Second,
TLSHandshakeTimeout: 10 * time.Second,
ExpectContinueTimeout: 1 * time.Second,
},
}
return &Agent{
config: cfg,
logger: logger,
client: &http.Client{Timeout: 30 * time.Second},
client: httpClient,
heartbeatInterval: 60 * time.Second,
pollInterval: 30 * time.Second,
discoveryInterval: 6 * time.Hour, // scan for certs every 6 hours
}
retiredSignal: make(chan struct{}),
}, nil
}
// markRetired records that the control plane has declared this agent retired
// (HTTP 410 Gone on heartbeat or work poll). Idempotent via sync.Once — if
// both the heartbeat and work-poll paths observe 410 in the same tick, only
// the first close() runs and we avoid a runtime panic. Emits an ERROR-level
// log line so init-system journaling captures it prominently, and includes
// the source (heartbeat/work_poll), response body, and status code so the
// operator can verify it's a genuine retirement signal rather than a
// misrouted request. After this returns, the select-loop case in Run()
// observes the closed channel on its next iteration and returns
// ErrAgentRetired.
func (a *Agent) markRetired(source string, statusCode int, body string) {
a.retiredOnce.Do(func() {
a.logger.Error("agent has been retired by control plane — shutting down",
"source", source,
"status", statusCode,
"body", body,
"agent_id", a.config.AgentID)
close(a.retiredSignal)
})
}
// Run starts the agent's main loop.
@@ -154,6 +244,19 @@ func (a *Agent) Run(ctx context.Context) error {
a.logger.Info("agent shutting down", "reason", ctx.Err())
return ctx.Err()
// I-004: retiredSignal is closed exactly once (via markRetired's
// sync.Once) when either sendHeartbeat or pollForWork observes HTTP 410
// Gone from the control plane. Falling through this case immediately
// (rather than waiting for the next ticker) lets the agent shut down
// quickly once retirement is confirmed — every extra heartbeat against a
// retired row is wasted work and noise in the audit trail. Returning
// ErrAgentRetired propagates up to main(), which matches it with
// errors.Is and exits(0) so systemd/launchd do not respawn the process.
case <-a.retiredSignal:
a.logger.Info("agent retired signal received — exiting event loop",
"agent_id", a.config.AgentID)
return ErrAgentRetired
case <-heartbeatTicker.C:
a.sendHeartbeat(ctx)
@@ -166,7 +269,14 @@ func (a *Agent) Run(ctx context.Context) error {
a.logger.Warn("backing off due to consecutive failures",
"failures", a.consecutiveFailures,
"backoff", backoff.String())
time.Sleep(backoff)
// F-003: ctx-aware wait so graceful shutdown does not stall on
// a long backoff. If ctx cancels mid-backoff, return to the
// outer loop so the <-ctx.Done() case can trigger clean exit.
select {
case <-ctx.Done():
continue
case <-time.After(backoff):
}
}
a.pollForWork(ctx)
@@ -209,6 +319,22 @@ func (a *Agent) sendHeartbeat(ctx context.Context) {
}
defer resp.Body.Close()
// I-004: HTTP 410 Gone is the terminal signal from the control plane that
// this agent's row has been soft-retired (see internal/api/handler/agent.go
// heartbeat path + AgentRetirementService). Treat it separately from the
// generic non-200 error branch: record the event to markRetired (which closes
// retiredSignal exactly once via sync.Once) and return without bumping
// consecutiveFailures — this is not a transient failure, it's a clean
// shutdown. The Run() select loop picks up the closed channel on its next
// iteration and returns ErrAgentRetired, which main() translates into an
// exit(0) so systemd/launchd don't respawn the process into another 410
// loop.
if resp.StatusCode == http.StatusGone {
body, _ := io.ReadAll(resp.Body)
a.markRetired("heartbeat", resp.StatusCode, string(body))
return
}
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
a.logger.Error("heartbeat rejected",
@@ -237,6 +363,19 @@ func (a *Agent) pollForWork(ctx context.Context) {
}
defer resp.Body.Close()
// I-004: same terminal-retirement handling as sendHeartbeat. Work-poll is the
// other hot path that can observe an agent's soft-retirement; if the
// heartbeat tick happens to fire after a work-poll tick within the same
// retirement window, this branch catches it first. markRetired's sync.Once
// guards idempotency so racing both paths in the same tick only closes the
// signal channel once. No consecutiveFailures increment — retirement is
// not a transient failure.
if resp.StatusCode == http.StatusGone {
body, _ := io.ReadAll(resp.Body)
a.markRetired("work_poll", resp.StatusCode, string(body))
return
}
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
a.logger.Error("work poll rejected",
@@ -1031,12 +1170,14 @@ func certKeyInfo(cert *x509.Certificate) (string, int) {
func main() {
// Parse command-line flags (with env var fallbacks for Docker deployment)
serverURL := flag.String("server", getEnvDefault("CERTCTL_SERVER_URL", "http://localhost:8443"), "Control plane server URL")
serverURL := flag.String("server", getEnvDefault("CERTCTL_SERVER_URL", "https://localhost:8443"), "Control plane server URL (must be https://)")
apiKey := flag.String("api-key", getEnvDefault("CERTCTL_API_KEY", ""), "Agent API key")
agentName := flag.String("name", getEnvDefault("CERTCTL_AGENT_NAME", "certctl-agent"), "Agent name")
agentID := flag.String("agent-id", getEnvDefault("CERTCTL_AGENT_ID", ""), "Agent ID (from registration)")
keyDir := flag.String("key-dir", getEnvDefault("CERTCTL_KEY_DIR", "/var/lib/certctl/keys"), "Directory for storing private keys")
discoveryDirsStr := flag.String("discovery-dirs", getEnvDefault("CERTCTL_DISCOVERY_DIRS", ""), "Comma-separated directories to scan for certificates")
caBundlePath := flag.String("ca-bundle", getEnvDefault("CERTCTL_SERVER_CA_BUNDLE_PATH", ""), "Path to a PEM-encoded CA bundle that signed the server's TLS cert (optional; falls back to system roots)")
insecureSkipVerify := flag.Bool("insecure-skip-verify", getEnvBoolDefault("CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY", false), "Dev-only: skip TLS certificate verification. Never enable in production. See docs/tls.md.")
flag.Parse()
if *apiKey == "" {
@@ -1050,6 +1191,18 @@ func main() {
os.Exit(1)
}
// Pre-flight URL-scheme validation — reject plaintext http:// before any
// network call. The HTTPS-Everywhere milestone (§2.4, §7) mandates that
// mis-configured agents fail loudly at startup with a diagnostic pointing
// at the upgrade guide, rather than producing a TCP-refused or
// TLS-handshake-error that obscures the actual cause.
if err := validateHTTPSScheme(*serverURL); err != nil {
fmt.Fprintf(os.Stderr, "Error: %v\n", err)
fmt.Fprintf(os.Stderr, "\nThe certctl control plane is HTTPS-only as of v2.2.\n")
fmt.Fprintf(os.Stderr, "See docs/upgrade-to-tls.md for the cutover walkthrough.\n")
os.Exit(1)
}
// Set up structured logging
logLevel := slog.LevelInfo
if getEnvDefault("CERTCTL_LOG_LEVEL", "info") == "debug" {
@@ -1078,17 +1231,27 @@ func main() {
// Create agent configuration
agentCfg := &AgentConfig{
ServerURL: *serverURL,
APIKey: *apiKey,
AgentName: *agentName,
AgentID: *agentID,
Hostname: hostname,
KeyDir: *keyDir,
DiscoveryDirs: discoveryDirs,
ServerURL: *serverURL,
APIKey: *apiKey,
AgentName: *agentName,
AgentID: *agentID,
Hostname: hostname,
KeyDir: *keyDir,
DiscoveryDirs: discoveryDirs,
CABundlePath: *caBundlePath,
InsecureSkipVerify: *insecureSkipVerify,
}
if agentCfg.InsecureSkipVerify {
logger.Warn("TLS certificate verification is disabled (CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true) — never enable this in production")
}
// Create and start agent
agent := NewAgent(agentCfg, logger)
agent, err := NewAgent(agentCfg, logger)
if err != nil {
fmt.Fprintf(os.Stderr, "Error: failed to initialize agent: %v\n", err)
os.Exit(1)
}
// Create context with cancellation for graceful shutdown
ctx, cancel := context.WithCancel(context.Background())
@@ -1117,6 +1280,19 @@ func main() {
cancel()
<-errChan
case err := <-errChan:
// I-004: ErrAgentRetired is a terminal, *clean* shutdown — the control
// plane responded HTTP 410 Gone on heartbeat/work-poll, meaning this
// agent's row has been soft-retired and will never be reachable again.
// Exit 0 so systemd's Restart=on-failure and launchd's KeepAlive do NOT
// respawn the process into another 410 loop (which would wedge the host
// and spam the control plane). Operators can observe the retirement via
// audit_events or the AgentsPage retired tab; the terminal log line on
// the way out is enough for post-mortem forensics.
if errors.Is(err, ErrAgentRetired) {
logger.Info("agent retired by control plane — exiting without restart",
"agent_id", agentCfg.AgentID)
return
}
if err != context.Canceled {
logger.Error("agent error", "error", err)
os.Exit(1)
@@ -1133,3 +1309,49 @@ func getEnvDefault(key, defaultValue string) string {
}
return defaultValue
}
// getEnvBoolDefault parses an environment variable as a boolean. Accepts "1",
// "t", "true", "T", "TRUE", "True" as true; anything else (including empty)
// returns the provided default. Kept permissive on purpose so operators can
// flip the dev-only TLS skip-verify toggle with any common truthy spelling
// without having to remember exactly what we parse.
func getEnvBoolDefault(key string, defaultValue bool) bool {
raw := os.Getenv(key)
if raw == "" {
return defaultValue
}
switch strings.ToLower(strings.TrimSpace(raw)) {
case "1", "t", "true", "yes", "on":
return true
case "0", "f", "false", "no", "off":
return false
default:
return defaultValue
}
}
// validateHTTPSScheme enforces the HTTPS-Everywhere milestone's §7 acceptance
// criterion: "Agent with CERTCTL_SERVER_URL=http://... fails at startup with
// a fail-loud diagnostic pointing at docs/upgrade-to-tls.md. Not TCP-refused,
// not TLS-handshake-error — a pre-flight config validation failure before any
// network call." Returns a descriptive error; the caller prints the upgrade
// guide pointer and exits non-zero.
func validateHTTPSScheme(serverURL string) error {
if serverURL == "" {
return fmt.Errorf("CERTCTL_SERVER_URL is empty — set it to an https:// URL (e.g., https://certctl-server:8443)")
}
u, err := url.Parse(serverURL)
if err != nil {
return fmt.Errorf("CERTCTL_SERVER_URL %q is not a valid URL: %w", serverURL, err)
}
switch strings.ToLower(u.Scheme) {
case "https":
return nil
case "http":
return fmt.Errorf("CERTCTL_SERVER_URL %q uses plaintext http:// — the certctl control plane is HTTPS-only", serverURL)
case "":
return fmt.Errorf("CERTCTL_SERVER_URL %q is missing a scheme — expected https://", serverURL)
default:
return fmt.Errorf("CERTCTL_SERVER_URL %q uses unsupported scheme %q — expected https://", serverURL, u.Scheme)
}
}
+3 -3
View File
@@ -228,7 +228,7 @@ func TestReportVerificationResult_Success(t *testing.T) {
ServerURL: server.URL,
APIKey: "test-api-key",
}
agent := NewAgent(cfg, nil)
agent, _ := NewAgent(cfg, nil)
result := &VerificationResult{
ExpectedFingerprint: "abc123",
@@ -244,7 +244,7 @@ func TestReportVerificationResult_Success(t *testing.T) {
}
func TestReportVerificationResult_MissingFields(t *testing.T) {
agent := NewAgent(&AgentConfig{}, nil)
agent, _ := NewAgent(&AgentConfig{}, nil)
result := &VerificationResult{
Verified: true,
@@ -343,7 +343,7 @@ func TestReportVerificationResult_ServerError(t *testing.T) {
ServerURL: server.URL,
APIKey: "test-api-key",
}
agent := NewAgent(cfg, nil)
agent, _ := NewAgent(cfg, nil)
result := &VerificationResult{
ExpectedFingerprint: "abc123",
+84 -10
View File
@@ -3,7 +3,9 @@ package main
import (
"flag"
"fmt"
"net/url"
"os"
"strings"
"github.com/shankar0123/certctl/internal/cli"
)
@@ -27,35 +29,50 @@ Commands:
certs renew ID Trigger certificate renewal
certs revoke ID Revoke a certificate
agents list List agents
agents get ID Get agent details
agents list List agents (add --retired to list soft-retired agents)
agents get ID Get agent details
agents retire ID Soft-retire an agent (add --force --reason "…" to cascade)
jobs list List jobs
jobs get ID Get job details
jobs cancel ID Cancel a pending job
import FILE Bulk import certificates from PEM file(s)
Required: --owner-id, --team-id, --renewal-policy-id, --issuer-id
Optional: --name-template (default {cn}), --environment (default imported)
status Show server health + summary stats
version Show CLI version
Examples:
certctl-cli --server http://localhost:8443 --api-key mykey certs list
certctl-cli --server https://localhost:8443 --api-key mykey certs list
certctl-cli certs renew mc-prod --format json
certctl-cli import certs.pem
`)
}
serverURL := fs.String("server", os.Getenv("CERTCTL_SERVER_URL"), "certctl server URL (env: CERTCTL_SERVER_URL)")
if *serverURL == "" {
*serverURL = "http://localhost:8443"
// HTTPS-Everywhere (v2.2): the server is HTTPS-only. The default URL uses
// https://; plaintext http:// is rejected by validateHTTPSScheme below.
defaultServer := os.Getenv("CERTCTL_SERVER_URL")
if defaultServer == "" {
defaultServer = "https://localhost:8443"
}
serverURL := fs.String("server", defaultServer, "certctl server URL — must be https:// (env: CERTCTL_SERVER_URL)")
apiKey := fs.String("api-key", os.Getenv("CERTCTL_API_KEY"), "API key for authentication (env: CERTCTL_API_KEY)")
format := fs.String("format", "table", "Output format: table, json")
caBundlePath := fs.String("ca-bundle", os.Getenv("CERTCTL_SERVER_CA_BUNDLE_PATH"), "Path to a PEM-encoded CA bundle that signed the server cert (env: CERTCTL_SERVER_CA_BUNDLE_PATH)")
insecure := fs.Bool("insecure", strings.EqualFold(os.Getenv("CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY"), "true"), "Skip TLS certificate verification — dev only, never set in production (env: CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY)")
fs.Parse(os.Args[1:])
if err := validateHTTPSScheme(*serverURL); err != nil {
fmt.Fprintf(os.Stderr, "Error: %v\n", err)
fmt.Fprintf(os.Stderr, "\nThe certctl control plane is HTTPS-only as of v2.2.\n")
fmt.Fprintf(os.Stderr, "See docs/upgrade-to-tls.md for the cutover walkthrough.\n")
os.Exit(1)
}
args := fs.Args()
if len(args) == 0 {
fs.Usage()
@@ -63,13 +80,16 @@ Examples:
}
// Create client
client := cli.NewClient(*serverURL, *apiKey, *format)
client, err := cli.NewClient(*serverURL, *apiKey, *format, *caBundlePath, *insecure)
if err != nil {
fmt.Fprintf(os.Stderr, "Error: %v\n", err)
os.Exit(1)
}
// Dispatch to appropriate command
command := args[0]
cmdArgs := args[1:]
var err error
switch command {
case "certs":
err = handleCerts(client, cmdArgs)
@@ -138,9 +158,19 @@ func handleCerts(client *cli.Client, args []string) error {
}
}
// handleAgents dispatches the `agents` subcommands.
//
// I-004 additions:
//
// agents list --retired — hit the opt-in /agents/retired endpoint
// instead of the default listing (which
// filters retired rows out).
// agents retire <id> — soft-retire an agent (DELETE /agents/{id}).
// --force cascades; --reason is required with
// --force (mirrors ErrForceReasonRequired).
func handleAgents(client *cli.Client, args []string) error {
if len(args) == 0 {
fmt.Fprintf(os.Stderr, "usage: agents <list|get> [options]\n")
fmt.Fprintf(os.Stderr, "usage: agents <list|get|retire> [options]\n")
return nil
}
@@ -149,13 +179,34 @@ func handleAgents(client *cli.Client, args []string) error {
switch subcommand {
case "list":
return client.ListAgents(subArgs)
// --retired flag splits to a separate endpoint. We intercept it
// client-side and strip it before delegating, so both code paths
// share the --page/--per-page flag parsing inside the client.
retired := false
rest := make([]string, 0, len(subArgs))
for _, a := range subArgs {
if a == "--retired" {
retired = true
continue
}
rest = append(rest, a)
}
if retired {
return client.ListRetiredAgents(rest)
}
return client.ListAgents(rest)
case "get":
if len(subArgs) == 0 {
fmt.Fprintf(os.Stderr, "usage: agents get <id>\n")
return nil
}
return client.GetAgent(subArgs[0])
case "retire":
if len(subArgs) == 0 {
fmt.Fprintf(os.Stderr, "usage: agents retire <id> [--force] [--reason <reason>]\n")
return nil
}
return client.RetireAgent(subArgs)
default:
fmt.Fprintf(os.Stderr, "unknown subcommand: agents %s\n", subcommand)
return nil
@@ -203,3 +254,26 @@ func handleImport(client *cli.Client, args []string) error {
func handleStatus(client *cli.Client) error {
return client.GetStatus()
}
// validateHTTPSScheme rejects plaintext and empty-scheme server URLs at
// startup so operators get a fail-loud diagnostic before any network call,
// not a TCP-refused or TLS-handshake-error downstream. See docs/upgrade-to-tls.md.
func validateHTTPSScheme(serverURL string) error {
if serverURL == "" {
return fmt.Errorf("server URL is empty — set --server (or CERTCTL_SERVER_URL) to an https:// URL (e.g., https://certctl-server:8443)")
}
u, err := url.Parse(serverURL)
if err != nil {
return fmt.Errorf("server URL %q is not a valid URL: %w", serverURL, err)
}
switch strings.ToLower(u.Scheme) {
case "https":
return nil
case "http":
return fmt.Errorf("server URL %q uses plaintext http:// — the certctl control plane is HTTPS-only", serverURL)
case "":
return fmt.Errorf("server URL %q is missing a scheme — expected https://", serverURL)
default:
return fmt.Errorf("server URL %q uses unsupported scheme %q — expected https://", serverURL, u.Scheme)
}
}
+96
View File
@@ -0,0 +1,96 @@
package main
import (
"strings"
"testing"
)
// TestValidateHTTPSScheme pins the pre-flight URL-scheme guard that the
// HTTPS-Everywhere milestone (v2.2, §3.2) requires on the certctl-cli binary
// startup path. The CLI's diagnostic is distinct from the agent and MCP server
// because it surfaces the --server flag alongside CERTCTL_SERVER_URL — so the
// empty-URL case pins that flag-name substring separately. Every other case
// mirrors the dispatch arms in cmd/cli/main.go:validateHTTPSScheme; drifting
// the substrings is what this test is here to catch.
func TestValidateHTTPSScheme(t *testing.T) {
tests := []struct {
name string
serverURL string
wantErr bool
wantErrSub string // substring that MUST appear in the error message
}{
{
name: "https URL passes",
serverURL: "https://certctl-server:8443",
wantErr: false,
},
{
name: "https URL with path passes",
serverURL: "https://certctl.example.com/api/v1",
wantErr: false,
},
{
name: "uppercase HTTPS scheme passes (url.Parse lowercases)",
serverURL: "HTTPS://certctl-server:8443",
wantErr: false,
},
{
name: "empty URL rejected mentions --server flag",
serverURL: "",
wantErr: true,
wantErrSub: "--server",
},
{
name: "empty URL rejected also mentions CERTCTL_SERVER_URL",
serverURL: "",
wantErr: true,
wantErrSub: "CERTCTL_SERVER_URL",
},
{
name: "plaintext http rejected",
serverURL: "http://certctl-server:8443",
wantErr: true,
wantErrSub: "plaintext http://",
},
{
name: "bare host missing scheme rejected",
serverURL: "localhost:8443",
wantErr: true,
// url.Parse treats "localhost:8443" as scheme=localhost, opaque=8443
// — exercises the default arm (unsupported scheme) rather than the
// empty-scheme arm. Both are fail-closed, which is what we care about.
wantErrSub: "unsupported scheme",
},
{
name: "path-only URL rejected",
serverURL: "//certctl-server:8443",
wantErr: true,
wantErrSub: "missing a scheme",
},
{
name: "unsupported scheme rejected",
serverURL: "ftp://certctl-server:8443",
wantErr: true,
wantErrSub: "unsupported scheme",
},
{
name: "ws scheme rejected",
serverURL: "ws://certctl-server:8443",
wantErr: true,
wantErrSub: "unsupported scheme",
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
err := validateHTTPSScheme(tt.serverURL)
if (err != nil) != tt.wantErr {
t.Fatalf("validateHTTPSScheme(%q) err=%v wantErr=%v", tt.serverURL, err, tt.wantErr)
}
if tt.wantErr && tt.wantErrSub != "" && !strings.Contains(err.Error(), tt.wantErrSub) {
t.Errorf("validateHTTPSScheme(%q) err=%q must contain %q so operators see the right diagnostic",
tt.serverURL, err.Error(), tt.wantErrSub)
}
})
}
}
+46 -2
View File
@@ -4,8 +4,10 @@ import (
"context"
"fmt"
"log"
"net/url"
"os"
"os/signal"
"strings"
gomcp "github.com/modelcontextprotocol/go-sdk/mcp"
@@ -16,14 +18,33 @@ import (
var Version = "dev"
func main() {
// HTTPS-Everywhere (v2.2): the server is HTTPS-only. The default URL
// uses https://; plaintext http:// is rejected by validateHTTPSScheme
// below with a fail-loud pre-flight diagnostic pointing at
// docs/upgrade-to-tls.md, so operators never get a TCP-refused or
// TLS-handshake-error downstream. See docs/tls.md for CA bundle and
// insecure-skip-verify guidance.
serverURL := os.Getenv("CERTCTL_SERVER_URL")
if serverURL == "" {
serverURL = "http://localhost:8443"
serverURL = "https://localhost:8443"
}
if err := validateHTTPSScheme(serverURL); err != nil {
fmt.Fprintf(os.Stderr, "Error: %v\n", err)
fmt.Fprintf(os.Stderr, "\nThe certctl control plane is HTTPS-only as of v2.2.\n")
fmt.Fprintf(os.Stderr, "See docs/upgrade-to-tls.md for the cutover walkthrough.\n")
os.Exit(1)
}
apiKey := os.Getenv("CERTCTL_API_KEY")
caBundlePath := os.Getenv("CERTCTL_SERVER_CA_BUNDLE_PATH")
insecure := strings.EqualFold(os.Getenv("CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY"), "true")
client := mcp.NewClient(serverURL, apiKey)
client, err := mcp.NewClient(serverURL, apiKey, caBundlePath, insecure)
if err != nil {
fmt.Fprintf(os.Stderr, "Error: %v\n", err)
os.Exit(1)
}
server := gomcp.NewServer(&gomcp.Implementation{
Name: "certctl",
@@ -41,3 +62,26 @@ func main() {
log.Fatalf("MCP server error: %v", err)
}
}
// validateHTTPSScheme rejects plaintext and empty-scheme server URLs at
// startup so operators get a fail-loud diagnostic before any network call,
// not a TCP-refused or TLS-handshake-error downstream. See docs/upgrade-to-tls.md.
func validateHTTPSScheme(serverURL string) error {
if serverURL == "" {
return fmt.Errorf("server URL is empty — set CERTCTL_SERVER_URL to an https:// URL (e.g., https://certctl-server:8443)")
}
u, err := url.Parse(serverURL)
if err != nil {
return fmt.Errorf("server URL %q is not a valid URL: %w", serverURL, err)
}
switch strings.ToLower(u.Scheme) {
case "https":
return nil
case "http":
return fmt.Errorf("server URL %q uses plaintext http:// — the certctl control plane is HTTPS-only", serverURL)
case "":
return fmt.Errorf("server URL %q is missing a scheme — expected https://", serverURL)
default:
return fmt.Errorf("server URL %q uses unsupported scheme %q — expected https://", serverURL, u.Scheme)
}
}
+90
View File
@@ -0,0 +1,90 @@
package main
import (
"strings"
"testing"
)
// TestValidateHTTPSScheme pins the pre-flight URL-scheme guard that the
// HTTPS-Everywhere milestone (v2.2, §3.2) requires on the MCP server binary
// startup path. The whole point is to fail loud with a diagnostic that points
// at docs/upgrade-to-tls.md *before* any network call — not a cryptic
// TCP-refused or TLS-handshake-error two ticks later. Every case here mirrors
// the dispatch arms in cmd/mcp-server/main.go:validateHTTPSScheme; drifting
// the error-message substrings is what this test is here to catch.
func TestValidateHTTPSScheme(t *testing.T) {
tests := []struct {
name string
serverURL string
wantErr bool
wantErrSub string // substring that MUST appear in the error message
}{
{
name: "https URL passes",
serverURL: "https://certctl-server:8443",
wantErr: false,
},
{
name: "https URL with path passes",
serverURL: "https://certctl.example.com/api/v1",
wantErr: false,
},
{
name: "uppercase HTTPS scheme passes (url.Parse lowercases)",
serverURL: "HTTPS://certctl-server:8443",
wantErr: false,
},
{
name: "empty URL rejected",
serverURL: "",
wantErr: true,
wantErrSub: "server URL is empty",
},
{
name: "plaintext http rejected",
serverURL: "http://certctl-server:8443",
wantErr: true,
wantErrSub: "plaintext http://",
},
{
name: "bare host missing scheme rejected",
serverURL: "localhost:8443",
wantErr: true,
// url.Parse treats "localhost:8443" as scheme=localhost, opaque=8443
// — exercises the default arm (unsupported scheme) rather than the
// empty-scheme arm. Both are fail-closed, which is what we care about.
wantErrSub: "unsupported scheme",
},
{
name: "path-only URL rejected",
serverURL: "//certctl-server:8443",
wantErr: true,
wantErrSub: "missing a scheme",
},
{
name: "unsupported scheme rejected",
serverURL: "ftp://certctl-server:8443",
wantErr: true,
wantErrSub: "unsupported scheme",
},
{
name: "ws scheme rejected",
serverURL: "ws://certctl-server:8443",
wantErr: true,
wantErrSub: "unsupported scheme",
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
err := validateHTTPSScheme(tt.serverURL)
if (err != nil) != tt.wantErr {
t.Fatalf("validateHTTPSScheme(%q) err=%v wantErr=%v", tt.serverURL, err, tt.wantErr)
}
if tt.wantErr && tt.wantErrSub != "" && !strings.Contains(err.Error(), tt.wantErrSub) {
t.Errorf("validateHTTPSScheme(%q) err=%q must contain %q so operators see the right diagnostic",
tt.serverURL, err.Error(), tt.wantErrSub)
}
})
}
}
+314
View File
@@ -0,0 +1,314 @@
package main
import (
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"strings"
"testing"
)
// TestBuildFinalHandler_Dispatch is the M-001 regression harness for the outer
// HTTP dispatch layer. It pins which path prefixes ride the no-auth middleware
// chain (EST, SCEP, /.well-known/pki, health/ready, /api/v1/auth/info) versus
// the authenticated chain (/api/v1/*).
//
// The concern under test is ONLY the dispatch in buildFinalHandler — the
// handlers themselves are mocked as marker handlers that stamp "AUTH" or
// "NOAUTH" into the response body. Service-layer concerns (SCEP password
// validation, EST CSR validation, API auth enforcement) are covered by their
// respective test suites.
//
// Case (i) is the central guard: EST with NO client cert / NO Bearer token
// MUST reach the no-auth handler (pre-M-001 it was 401'd by the Auth
// middleware, blocking enrollment for every real-world EST client).
func TestBuildFinalHandler_Dispatch(t *testing.T) {
// Marker handlers — each stamps a unique body so tests can verify which
// chain the request traversed.
authHandler := http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
w.Header().Set("X-Chain", "auth")
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("AUTH"))
})
noAuthHandler := http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
w.Header().Set("X-Chain", "noauth")
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("NOAUTH"))
})
// Dashboard directory with index.html + assets/ for SPA fallback and
// static-asset tests. Cleaned up by t.TempDir.
webDir := t.TempDir()
indexHTML := []byte("<!doctype html><html><body>certctl dashboard</body></html>")
if err := os.WriteFile(filepath.Join(webDir, "index.html"), indexHTML, 0o644); err != nil {
t.Fatalf("write index.html: %v", err)
}
assetsDir := filepath.Join(webDir, "assets")
if err := os.MkdirAll(assetsDir, 0o755); err != nil {
t.Fatalf("mkdir assets: %v", err)
}
assetJS := []byte("console.log('certctl');")
if err := os.WriteFile(filepath.Join(assetsDir, "app.js"), assetJS, 0o644); err != nil {
t.Fatalf("write app.js: %v", err)
}
handler := buildFinalHandler(authHandler, noAuthHandler, webDir, true /* dashboardEnabled */)
tests := []struct {
name string
method string
path string
wantBody string // "AUTH" | "NOAUTH" | "" (== substring match against response body)
wantBodyPrefix string
wantStatus int
description string
}{
// ---- Case (i): M-001 central regression guard ----
{
name: "est_cacerts_no_auth_reaches_noauth_handler",
method: http.MethodGet,
path: "/.well-known/est/cacerts",
wantBody: "NOAUTH",
wantStatus: http.StatusOK,
description: "EST clients cannot present Bearer tokens — must NOT be 401'd before reaching the handler (RFC 7030 §4.1.1)",
},
{
name: "est_simpleenroll_no_auth_reaches_noauth_handler",
method: http.MethodPost,
path: "/.well-known/est/simpleenroll",
wantBody: "NOAUTH",
wantStatus: http.StatusOK,
description: "RFC 7030 §4.2 simpleenroll served from no-auth chain (option D)",
},
{
name: "est_simplereenroll_no_auth_reaches_noauth_handler",
method: http.MethodPost,
path: "/.well-known/est/simplereenroll",
wantBody: "NOAUTH",
wantStatus: http.StatusOK,
description: "RFC 7030 §4.2.2 simplereenroll also on no-auth chain",
},
{
name: "est_csrattrs_no_auth_reaches_noauth_handler",
method: http.MethodGet,
path: "/.well-known/est/csrattrs",
wantBody: "NOAUTH",
wantStatus: http.StatusOK,
description: "RFC 7030 §4.5 csrattrs also on no-auth chain",
},
// ---- Cases (ii) + (iii): SCEP dispatch ----
// The actual challengePassword validation lives in the service layer
// (internal/service/scep.go). This test pins that ALL /scep* requests
// reach the no-auth chain — the service layer is then responsible for
// rejecting or accepting based on password contents.
{
name: "scep_exact_path_reaches_noauth_handler",
method: http.MethodGet,
path: "/scep",
wantBody: "NOAUTH",
wantStatus: http.StatusOK,
description: "SCEP clients authenticate via CSR challengePassword, not Bearer (RFC 8894 §3.2)",
},
{
name: "scep_subpath_reaches_noauth_handler",
method: http.MethodPost,
path: "/scep/",
wantBody: "NOAUTH",
wantStatus: http.StatusOK,
description: "Trailing-slash variant must also ride no-auth chain",
},
{
name: "scep_query_string_reaches_noauth_handler",
method: http.MethodGet,
path: "/scep?operation=GetCACaps",
wantBody: "NOAUTH",
wantStatus: http.StatusOK,
description: "Query string does not affect dispatch — operation dispatch is handler-internal",
},
// Defensive: /scepxyz MUST NOT match the SCEP prefix (guards against
// over-broad matching that would leak non-SCEP paths into no-auth).
{
name: "scepxyz_does_not_match_scep_prefix",
method: http.MethodGet,
path: "/scepxyz",
wantStatus: http.StatusOK,
wantBody: "certctl dashboard",
description: "SPA fallback — /scepxyz must not be confused with /scep or /scep/",
},
// ---- Case (iv): RFC 5280 CRL + RFC 6960 OCSP ----
{
name: "pki_crl_no_auth_reaches_noauth_handler",
method: http.MethodGet,
path: "/.well-known/pki/crl/abc123",
wantBody: "NOAUTH",
wantStatus: http.StatusOK,
description: "RFC 5280 CRL distribution point must be served without auth",
},
{
name: "pki_ocsp_no_auth_reaches_noauth_handler",
method: http.MethodGet,
path: "/.well-known/pki/ocsp/abc123/serial",
wantBody: "NOAUTH",
wantStatus: http.StatusOK,
description: "RFC 6960 OCSP responder must be served without auth",
},
// ---- Case (v): Authenticated API routes ----
{
name: "api_v1_certificates_goes_through_auth",
method: http.MethodGet,
path: "/api/v1/certificates",
wantBody: "AUTH",
wantStatus: http.StatusOK,
description: "Primary API surface must still require Bearer token",
},
{
name: "api_v1_auth_check_goes_through_auth",
method: http.MethodGet,
path: "/api/v1/auth/check",
wantBody: "AUTH",
wantStatus: http.StatusOK,
description: "auth/check validates the caller's Bearer — auth chain required",
},
{
name: "api_v1_jobs_goes_through_auth",
method: http.MethodGet,
path: "/api/v1/jobs",
wantBody: "AUTH",
wantStatus: http.StatusOK,
description: "Jobs API is part of the privileged surface",
},
// ---- Health probes bypass auth ----
{
name: "health_bypasses_auth",
method: http.MethodGet,
path: "/health",
wantBody: "NOAUTH",
wantStatus: http.StatusOK,
description: "Docker/K8s health probes cannot carry Bearer tokens",
},
{
name: "ready_bypasses_auth",
method: http.MethodGet,
path: "/ready",
wantBody: "NOAUTH",
wantStatus: http.StatusOK,
description: "Readiness probe also unauthenticated",
},
{
name: "auth_info_bypasses_auth",
method: http.MethodGet,
path: "/api/v1/auth/info",
wantBody: "NOAUTH",
wantStatus: http.StatusOK,
description: "React app calls auth/info BEFORE login to discover auth mode",
},
// ---- Static assets served by file server ----
{
name: "static_asset_served_by_file_server",
method: http.MethodGet,
path: "/assets/app.js",
wantStatus: http.StatusOK,
wantBody: "console.log('certctl');",
description: "Built Vite assets served directly without auth",
},
// ---- SPA fallback ----
{
name: "spa_fallback_serves_index_html",
method: http.MethodGet,
path: "/",
wantStatus: http.StatusOK,
wantBody: "certctl dashboard",
description: "Root path serves SPA entry point",
},
{
name: "spa_fallback_for_unknown_route",
method: http.MethodGet,
path: "/certificates",
wantStatus: http.StatusOK,
wantBody: "certctl dashboard",
description: "React Router routes fall through to index.html",
},
{
name: "spa_fallback_deep_route",
method: http.MethodGet,
path: "/certificates/mc-api-prod/detail",
wantStatus: http.StatusOK,
wantBody: "certctl dashboard",
description: "Deep React Router routes also fall through to SPA",
},
}
for _, tc := range tests {
t.Run(tc.name, func(t *testing.T) {
req := httptest.NewRequest(tc.method, tc.path, nil)
w := httptest.NewRecorder()
handler.ServeHTTP(w, req)
if w.Code != tc.wantStatus {
t.Errorf("status = %d, want %d (%s)", w.Code, tc.wantStatus, tc.description)
}
body := w.Body.String()
if tc.wantBody != "" && !strings.Contains(body, tc.wantBody) {
t.Errorf("body %q does not contain %q (%s)", body, tc.wantBody, tc.description)
}
if tc.wantBodyPrefix != "" && !strings.HasPrefix(body, tc.wantBodyPrefix) {
t.Errorf("body %q does not start with %q (%s)", body, tc.wantBodyPrefix, tc.description)
}
})
}
}
// TestBuildFinalHandler_NoDashboard pins the API-only (dashboard-absent)
// dispatch behavior. When web/dist/index.html is missing, everything that's
// not a no-auth bypass route falls through to the authenticated apiHandler
// (pre-M-001 behavior for headless deployments). EST/SCEP/PKI still ride the
// no-auth chain.
func TestBuildFinalHandler_NoDashboard(t *testing.T) {
authHandler := http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("AUTH"))
})
noAuthHandler := http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
w.WriteHeader(http.StatusOK)
_, _ = w.Write([]byte("NOAUTH"))
})
handler := buildFinalHandler(authHandler, noAuthHandler, "/nonexistent", false /* dashboardEnabled */)
tests := []struct {
name string
path string
wantBody string
}{
{"est_still_no_auth", "/.well-known/est/cacerts", "NOAUTH"},
{"scep_still_no_auth", "/scep", "NOAUTH"},
{"pki_still_no_auth", "/.well-known/pki/crl/x", "NOAUTH"},
{"health_still_no_auth", "/health", "NOAUTH"},
{"api_still_auth", "/api/v1/certificates", "AUTH"},
// The difference: non-API, non-special paths go through auth chain when
// there's no dashboard to serve (preserves legacy headless behavior).
{"unknown_path_falls_through_to_auth", "/", "AUTH"},
{"unknown_deep_path_falls_through_to_auth", "/random/path", "AUTH"},
}
for _, tc := range tests {
t.Run(tc.name, func(t *testing.T) {
req := httptest.NewRequest(http.MethodGet, tc.path, nil)
w := httptest.NewRecorder()
handler.ServeHTTP(w, req)
if w.Code != http.StatusOK {
t.Errorf("status = %d, want 200", w.Code)
}
if got := w.Body.String(); !strings.Contains(got, tc.wantBody) {
t.Errorf("body = %q, want to contain %q", got, tc.wantBody)
}
})
}
}
+256 -68
View File
@@ -9,6 +9,7 @@ import (
"os"
"os/signal"
"strconv"
"strings"
"syscall"
"time"
@@ -16,7 +17,6 @@ import (
"github.com/shankar0123/certctl/internal/api/middleware"
"github.com/shankar0123/certctl/internal/api/router"
"github.com/shankar0123/certctl/internal/config"
"github.com/shankar0123/certctl/internal/domain"
discoveryawssm "github.com/shankar0123/certctl/internal/connector/discovery/awssm"
discoveryazurekv "github.com/shankar0123/certctl/internal/connector/discovery/azurekv"
discoverygcpsm "github.com/shankar0123/certctl/internal/connector/discovery/gcpsm"
@@ -25,6 +25,7 @@ import (
notifypagerduty "github.com/shankar0123/certctl/internal/connector/notifier/pagerduty"
notifyslack "github.com/shankar0123/certctl/internal/connector/notifier/slack"
notifyteams "github.com/shankar0123/certctl/internal/connector/notifier/teams"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/repository/postgres"
"github.com/shankar0123/certctl/internal/scheduler"
"github.com/shankar0123/certctl/internal/service"
@@ -145,6 +146,12 @@ func main() {
// Initialize services (following the dependency graph)
auditService := service.NewAuditService(auditRepo)
policyService := service.NewPolicyService(policyRepo, auditService)
policyService.SetCertRepo(certificateRepo) // D-008: CertificateLifetime arm needs CertificateVersion.NotBefore/NotAfter
// G-1: RenewalPolicyService — distinct from PolicyService (compliance rules).
// Drives /api/v1/renewal-policies CRUD; the service layer owns slugify + validation,
// the repo layer owns sentinel translation for 23505 (name UNIQUE) and 23503
// (FK-RESTRICT against managed_certificates.renewal_policy_id).
renewalPolicyService := service.NewRenewalPolicyService(renewalPolicyRepo)
certificateService := service.NewCertificateService(certificateRepo, policyService, auditService)
notifierRegistry := make(map[string]service.Notifier)
@@ -222,7 +229,10 @@ func main() {
renewalService := service.NewRenewalService(certificateRepo, jobRepo, renewalPolicyRepo, profileRepo, auditService, notificationService, issuerRegistry, cfg.Keygen.Mode)
renewalService.SetTargetRepo(targetRepo)
deploymentService := service.NewDeploymentService(jobRepo, targetRepo, agentRepo, certificateRepo, auditService, notificationService)
jobService := service.NewJobService(jobRepo, renewalService, deploymentService, logger)
jobService := service.NewJobService(jobRepo, certificateRepo, ownerRepo, renewalService, deploymentService, logger)
// I-001: emit "job_retry" audit events when the scheduler resets Failed→Pending.
// SetAuditService is optional — JobService falls back to nil-guarded no-op if unwired.
jobService.SetAuditService(auditService)
agentService := service.NewAgentService(agentRepo, certificateRepo, jobRepo, targetRepo, auditService, issuerRegistry, renewalService)
agentService.SetProfileRepo(profileRepo)
issuerService := service.NewIssuerService(issuerRepo, auditService, issuerRegistry, encryptionKey, logger)
@@ -348,6 +358,12 @@ func main() {
// Initialize stats and metrics services
statsService := service.NewStatsService(certificateRepo, jobRepo, agentRepo)
// I-005: wire the notification repository so DashboardSummary.NotificationsDead
// is populated, which in turn drives the Prometheus counter
// certctl_notification_dead_total in GetPrometheusMetrics. Setter
// pattern keeps NewStatsService's nine call sites (main.go + stats_test.go
// + 8 digest_test.go sites) untouched.
statsService.SetNotifRepo(notificationRepo)
logger.Info("initialized stats service")
// Initialize API handlers
@@ -357,6 +373,10 @@ func main() {
agentHandler := handler.NewAgentHandler(agentService)
jobHandler := handler.NewJobHandler(jobService)
policyHandler := handler.NewPolicyHandler(policyService)
// G-1: RenewalPolicyHandler — /api/v1/renewal-policies CRUD. Value-returning
// constructor matches the house pattern (PolicyHandler, IssuerHandler etc.);
// the registry stores it by value in HandlerRegistry.RenewalPolicies.
renewalPolicyHandler := handler.NewRenewalPolicyHandler(renewalPolicyService)
profileHandler := handler.NewProfileHandler(profileService)
teamHandler := handler.NewTeamHandler(teamService)
ownerHandler := handler.NewOwnerHandler(ownerService)
@@ -436,8 +456,20 @@ func main() {
// Configure scheduler intervals from config
sched.SetRenewalCheckInterval(cfg.Scheduler.RenewalCheckInterval)
sched.SetJobProcessorInterval(cfg.Scheduler.JobProcessorInterval)
// I-001: drive the failed-job retry loop. Runs on start + every RetryInterval
// (default 5m, CERTCTL_SCHEDULER_RETRY_INTERVAL). Kept adjacent to the job
// processor setter because they share the JobServicer dependency.
sched.SetJobRetryInterval(cfg.Scheduler.RetryInterval)
sched.SetAgentHealthCheckInterval(cfg.Scheduler.AgentHealthCheckInterval)
sched.SetNotificationProcessInterval(cfg.Scheduler.NotificationProcessInterval)
// I-005: drive the failed-notification retry sweep. Runs every
// NotificationRetryInterval (default 2m, CERTCTL_NOTIFICATION_RETRY_INTERVAL)
// and transitions eligible Failed notifications whose next_retry_at has
// arrived back to Pending so the notification processor picks them up on
// its next tick. Kept adjacent to the notification processor setter
// because they share the NotificationServicer dependency (same placement
// pattern as I-001's SetJobRetryInterval above).
sched.SetNotificationRetryInterval(cfg.Scheduler.NotificationRetryInterval)
if cfg.NetworkScan.Enabled {
sched.SetNetworkScanInterval(cfg.NetworkScan.ScanInterval)
logger.Info("network scanning enabled", "interval", cfg.NetworkScan.ScanInterval.String())
@@ -460,6 +492,16 @@ func main() {
"sources", cloudDiscoveryService.SourceCount())
}
// Wire job timeout reaper (I-003)
sched.SetJobReaperService(jobService)
sched.SetJobTimeoutInterval(cfg.Scheduler.JobTimeoutInterval)
sched.SetAwaitingCSRTimeout(cfg.Scheduler.AwaitingCSRTimeout)
sched.SetAwaitingApprovalTimeout(cfg.Scheduler.AwaitingApprovalTimeout)
logger.Info("job timeout reaper enabled",
"interval", cfg.Scheduler.JobTimeoutInterval.String(),
"csr_timeout", cfg.Scheduler.AwaitingCSRTimeout.String(),
"approval_timeout", cfg.Scheduler.AwaitingApprovalTimeout.String())
// Start scheduler
logger.Info("starting scheduler")
startedChan := sched.Start(ctx)
@@ -469,28 +511,29 @@ func main() {
// Build the API router with all handlers
apiRouter := router.New()
apiRouter.RegisterHandlers(router.HandlerRegistry{
Certificates: certificateHandler,
Issuers: issuerHandler,
Targets: targetHandler,
Agents: agentHandler,
Jobs: jobHandler,
Policies: policyHandler,
Profiles: profileHandler,
Teams: teamHandler,
Owners: ownerHandler,
AgentGroups: agentGroupHandler,
Audit: auditHandler,
Notifications: notificationHandler,
Stats: statsHandler,
Metrics: metricsHandler,
Health: healthHandler,
Discovery: discoveryHandler,
NetworkScan: networkScanHandler,
Verification: verificationHandler,
Export: exportHandler,
Digest: *digestHandler,
HealthChecks: healthCheckHandler,
BulkRevocation: bulkRevocationHandler,
Certificates: certificateHandler,
Issuers: issuerHandler,
Targets: targetHandler,
Agents: agentHandler,
Jobs: jobHandler,
Policies: policyHandler,
RenewalPolicies: renewalPolicyHandler,
Profiles: profileHandler,
Teams: teamHandler,
Owners: ownerHandler,
AgentGroups: agentGroupHandler,
Audit: auditHandler,
Notifications: notificationHandler,
Stats: statsHandler,
Metrics: metricsHandler,
Health: healthHandler,
Discovery: discoveryHandler,
NetworkScan: networkScanHandler,
Verification: verificationHandler,
Export: exportHandler,
Digest: *digestHandler,
HealthChecks: healthCheckHandler,
BulkRevocation: bulkRevocationHandler,
})
// Register EST (RFC 7030) handlers if enabled
if cfg.EST.Enabled {
@@ -551,13 +594,63 @@ func main() {
"endpoints", "/scep?operation={GetCACaps,GetCACert,PKIOperation}")
}
// Register RFC 5280 CRL and RFC 6960 OCSP handlers under /.well-known/pki/.
// These are always enabled (no config gate) — revocation data must be
// reachable to relying parties for any cert certctl issues. The finalHandler
// routing gate below strips auth middleware for this prefix so browsers,
// OpenSSL, OCSP stapling sidecars, and mTLS clients can fetch without
// presenting certctl Bearer tokens.
apiRouter.RegisterPKIHandlers(certificateHandler)
logger.Info("PKI endpoints registered",
"endpoints", "/.well-known/pki/{crl/{issuer_id},ocsp/{issuer_id}/{serial}}")
logger.Info("registered all API handlers")
// Build middleware stack
authMiddleware := middleware.NewAuth(middleware.AuthConfig{
Type: cfg.Auth.Type,
Secret: cfg.Auth.Secret,
})
// Build middleware stack.
//
// Authentication unification (M-002): every authenticated request now
// carries a named actor in the request context so audit events record
// the real key identity instead of the hardcoded "api-key-user" string.
// Named keys come from CERTCTL_API_KEYS_NAMED (preferred). For backward
// compatibility CERTCTL_AUTH_SECRET is synthesized into legacy-key-N
// entries with Admin=false.
var namedKeys []middleware.NamedAPIKey
if cfg.Auth.Type != "none" {
// Translate typed config.NamedAPIKey -> middleware.NamedAPIKey. The
// two structs are field-compatible but live in different packages to
// preserve the config→middleware dependency direction.
for _, nk := range cfg.Auth.NamedKeys {
namedKeys = append(namedKeys, middleware.NamedAPIKey{
Name: nk.Name,
Key: nk.Key,
Admin: nk.Admin,
})
}
// Back-compat: if no named keys but legacy Secret is configured,
// synthesize named entries so the audit trail still attributes the
// action (instead of falling back to "api-key-user" / "anonymous").
if len(namedKeys) == 0 && cfg.Auth.Secret != "" {
parts := strings.Split(cfg.Auth.Secret, ",")
idx := 0
for _, p := range parts {
p = strings.TrimSpace(p)
if p == "" {
continue
}
namedKeys = append(namedKeys, middleware.NamedAPIKey{
Name: fmt.Sprintf("legacy-key-%d", idx),
Key: p,
Admin: false,
})
idx++
}
if len(namedKeys) > 0 {
logger.Warn("CERTCTL_AUTH_SECRET is deprecated — set CERTCTL_API_KEYS_NAMED for named actor attribution and admin gating",
"synthesized_keys", len(namedKeys))
}
}
}
authMiddleware := middleware.NewAuthWithNamedKeys(namedKeys)
corsMiddleware := middleware.NewCORS(middleware.CORSConfig{
AllowedOrigins: cfg.CORS.AllowedOrigins,
})
@@ -642,61 +735,65 @@ func main() {
middleware.Recovery,
)
dashboardEnabled := false
if _, err := os.Stat(webDir + "/index.html"); err == nil {
fileServer := http.FileServer(http.Dir(webDir))
finalHandler = http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
path := r.URL.Path
// Health/ready and auth/info bypass auth middleware.
// Health/ready: Docker/K8s health probes don't carry Bearer tokens.
// auth/info: React app calls this before login to detect auth mode.
if path == "/health" || path == "/ready" || path == "/api/v1/auth/info" {
noAuthHandler.ServeHTTP(w, r)
return
}
// All other API and EST routes go through the full middleware stack (with auth)
if (len(path) >= 8 && path[:8] == "/api/v1/") ||
(len(path) >= 16 && path[:16] == "/.well-known/est") {
apiHandler.ServeHTTP(w, r)
return
}
// Try to serve static files (JS, CSS, assets)
if len(path) > 8 && path[:8] == "/assets/" {
fileServer.ServeHTTP(w, r)
return
}
// SPA fallback: serve index.html for all other routes
http.ServeFile(w, r, webDir+"/index.html")
})
dashboardEnabled = true
}
finalHandler = buildFinalHandler(apiHandler, noAuthHandler, webDir, dashboardEnabled)
if dashboardEnabled {
logger.Info("dashboard available at /", "web_dir", webDir)
} else {
// No dashboard: route health/auth-info without auth, everything else through full stack
finalHandler = http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
path := r.URL.Path
if path == "/health" || path == "/ready" || path == "/api/v1/auth/info" {
noAuthHandler.ServeHTTP(w, r)
return
}
apiHandler.ServeHTTP(w, r)
})
logger.Info("dashboard directory not found, serving API only")
}
// HTTPS-everywhere milestone §2.1: fail-loud if the TLS configuration is
// missing or malformed. Duplicates config.Validate() for defense in depth
// (same pattern as preflightSCEPChallengePassword).
if err := preflightServerTLS(cfg.Server.TLS.CertPath, cfg.Server.TLS.KeyPath); err != nil {
logger.Error("startup refused: HTTPS cert unusable; control plane is HTTPS-only",
"error", err,
"cert_path", cfg.Server.TLS.CertPath,
"key_path", cfg.Server.TLS.KeyPath)
os.Exit(1)
}
// Load the cert+key into a SIGHUP-reloadable holder. Any subsequent
// SIGHUP triggers a fresh read and atomic swap so rotations do not need
// a restart. Reload failures keep the previous cert and log a warning.
tlsCertHolder, err := newCertHolder(cfg.Server.TLS.CertPath, cfg.Server.TLS.KeyPath)
if err != nil {
logger.Error("startup refused: failed to load TLS cert holder",
"error", err,
"cert_path", cfg.Server.TLS.CertPath,
"key_path", cfg.Server.TLS.KeyPath)
os.Exit(1)
}
stopTLSWatcher := tlsCertHolder.watchSIGHUP(logger)
defer stopTLSWatcher()
// Server configuration
addr := net.JoinHostPort(cfg.Server.Host, strconv.Itoa(cfg.Server.Port))
httpServer := &http.Server{
Addr: addr,
Handler: finalHandler,
TLSConfig: buildServerTLSConfig(tlsCertHolder),
ReadTimeout: 30 * time.Second,
ReadHeaderTimeout: 5 * time.Second,
WriteTimeout: 120 * time.Second, // Must accommodate ACME issuance (order + challenge + finalize)
IdleTimeout: 60 * time.Second,
}
// Start HTTP server in background
logger.Info("starting HTTP server", "address", addr)
// Start HTTPS server in background. ListenAndServeTLS is called with
// empty cert+key arguments because the cert is sourced through
// TLSConfig.GetCertificate (the SIGHUP-reloadable holder). Passing file
// paths here would pin the first-loaded cert and defeat hot reload.
logger.Info("HTTPS server listening",
"address", addr,
"cert_path", cfg.Server.TLS.CertPath,
"min_version", "TLS1.3")
go func() {
if err := httpServer.ListenAndServe(); err != nil && err != http.ErrServerClosed {
logger.Error("HTTP server error", "error", err)
if err := httpServer.ListenAndServeTLS("", ""); err != nil && err != http.ErrServerClosed {
logger.Error("HTTPS server error", "error", err)
}
}()
@@ -719,9 +816,9 @@ func main() {
logger.Warn("scheduler work did not complete in time", "error", err)
}
logger.Info("shutting down HTTP server")
logger.Info("shutting down HTTPS server")
if err := httpServer.Shutdown(shutdownCtx); err != nil {
logger.Error("HTTP server shutdown error", "error", err)
logger.Error("HTTPS server shutdown error", "error", err)
}
// Drain in-flight audit-recording goroutines before closing the DB pool.
@@ -763,3 +860,94 @@ func preflightSCEPChallengePassword(enabled bool, challengePassword string) erro
return nil
}
// buildFinalHandler builds the outer HTTP dispatch handler that routes incoming
// requests to either the authenticated apiHandler chain or the unauthenticated
// noAuthHandler chain based on URL path prefix. Extracted from main() so the
// dispatch logic can be unit tested without booting the full server stack
// (see cmd/server/finalhandler_test.go).
//
// Dispatch rules (M-001, audit 2026-04-19, option D):
//
// - /health, /ready, /api/v1/auth/info → no-auth (probes + login detection)
// - /.well-known/pki/* → no-auth (RFC 5280 CRL, RFC 6960 OCSP)
// - /.well-known/est/* → no-auth (RFC 7030 §3.2.3)
// - /scep, /scep/* → no-auth (RFC 8894 §3.2, CSR challengePassword)
// - /api/v1/* → auth (Bearer token required)
// - /assets/* → static file server (dashboard only)
// - anything else → SPA index.html fallback (dashboard only)
// OR apiHandler (no dashboard)
//
// EST/SCEP clients (IoT devices, 802.1X supplicants, MDM endpoints, network
// appliances) cannot present certctl Bearer tokens, so those endpoints must be
// reachable without the Auth middleware. Authentication is instead enforced by
// CSR signature verification, profile policy gates, and for SCEP the
// challengePassword shared secret (fail-loud gated by preflightSCEPChallengePassword
// above).
//
// webDir must point to a directory containing index.html + assets/ when
// dashboardEnabled is true; it is ignored otherwise.
func buildFinalHandler(apiHandler, noAuthHandler http.Handler, webDir string, dashboardEnabled bool) http.Handler {
var fileServer http.Handler
if dashboardEnabled {
fileServer = http.FileServer(http.Dir(webDir))
}
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
path := r.URL.Path
// Health/ready and auth/info bypass auth middleware.
// Health/ready: Docker/K8s health probes don't carry Bearer tokens.
// auth/info: React app calls this before login to detect auth mode.
if path == "/health" || path == "/ready" || path == "/api/v1/auth/info" {
noAuthHandler.ServeHTTP(w, r)
return
}
// RFC 5280 CRL and RFC 6960 OCSP live under /.well-known/pki/ and MUST
// be served unauthenticated — relying parties (browsers, OpenSSL, OCSP
// stapling sidecars, mTLS clients) cannot present certctl Bearer tokens.
if strings.HasPrefix(path, "/.well-known/pki") {
noAuthHandler.ServeHTTP(w, r)
return
}
// RFC 7030 EST endpoints ride the no-auth middleware chain (M-001,
// option D, audit 2026-04-19). Trust boundary is CSR signature + profile
// policy, not HTTP Bearer. /.well-known/est/cacerts is explicitly
// anonymous per RFC 7030 §4.1.1.
if strings.HasPrefix(path, "/.well-known/est") {
noAuthHandler.ServeHTTP(w, r)
return
}
// RFC 8894 SCEP rides the no-auth chain (M-001, option D). SCEP clients
// authenticate via the challengePassword attribute in the PKCS#10 CSR,
// not via HTTP Bearer tokens. preflightSCEPChallengePassword refuses to
// start the server if SCEP is enabled without a non-empty shared secret.
if path == "/scep" || strings.HasPrefix(path, "/scep/") {
noAuthHandler.ServeHTTP(w, r)
return
}
// Authenticated API routes — full middleware stack including Auth.
if strings.HasPrefix(path, "/api/v1/") {
apiHandler.ServeHTTP(w, r)
return
}
if !dashboardEnabled {
// No dashboard: everything non-special falls through to the
// authenticated handler (preserves pre-M-001 behavior for API-only
// deployments).
apiHandler.ServeHTTP(w, r)
return
}
// Dashboard-present: serve static assets directly, SPA fallback for
// everything else.
if strings.HasPrefix(path, "/assets/") {
fileServer.ServeHTTP(w, r)
return
}
http.ServeFile(w, r, webDir+"/index.html")
})
}
+44
View File
@@ -214,6 +214,8 @@ func TestMain_ServerConfigFromEnvironment(t *testing.T) {
oldAuthType := os.Getenv("CERTCTL_AUTH_TYPE")
oldServerHost := os.Getenv("CERTCTL_SERVER_HOST")
oldServerPort := os.Getenv("CERTCTL_SERVER_PORT")
oldTLSCert := os.Getenv("CERTCTL_SERVER_TLS_CERT_PATH")
oldTLSKey := os.Getenv("CERTCTL_SERVER_TLS_KEY_PATH")
defer func() {
if oldAuthType != "" {
os.Setenv("CERTCTL_AUTH_TYPE", oldAuthType)
@@ -230,12 +232,32 @@ func TestMain_ServerConfigFromEnvironment(t *testing.T) {
} else {
os.Unsetenv("CERTCTL_SERVER_PORT")
}
if oldTLSCert != "" {
os.Setenv("CERTCTL_SERVER_TLS_CERT_PATH", oldTLSCert)
} else {
os.Unsetenv("CERTCTL_SERVER_TLS_CERT_PATH")
}
if oldTLSKey != "" {
os.Setenv("CERTCTL_SERVER_TLS_KEY_PATH", oldTLSKey)
} else {
os.Unsetenv("CERTCTL_SERVER_TLS_KEY_PATH")
}
}()
// HTTPS-only control plane: Validate() refuses to pass without a readable
// cert/key pair on disk. Materialize a throwaway ECDSA P-256 pair using the
// same generator cmd/server/tls_test.go uses for the certHolder tests.
dir := t.TempDir()
certPath := dir + "/server.crt"
keyPath := dir + "/server.key"
generateTestCert(t, certPath, keyPath, "main-test-cn")
// Set test env vars
os.Setenv("CERTCTL_AUTH_TYPE", "none")
os.Setenv("CERTCTL_SERVER_HOST", "127.0.0.1")
os.Setenv("CERTCTL_SERVER_PORT", "8080")
os.Setenv("CERTCTL_SERVER_TLS_CERT_PATH", certPath)
os.Setenv("CERTCTL_SERVER_TLS_KEY_PATH", keyPath)
cfg, err := config.Load()
if err != nil {
@@ -260,6 +282,8 @@ func TestMain_AuthTypeConfiguration(t *testing.T) {
// Save original env vars
oldAuthType := os.Getenv("CERTCTL_AUTH_TYPE")
oldAuthSecret := os.Getenv("CERTCTL_AUTH_SECRET")
oldTLSCert := os.Getenv("CERTCTL_SERVER_TLS_CERT_PATH")
oldTLSKey := os.Getenv("CERTCTL_SERVER_TLS_KEY_PATH")
defer func() {
if oldAuthType != "" {
os.Setenv("CERTCTL_AUTH_TYPE", oldAuthType)
@@ -271,8 +295,28 @@ func TestMain_AuthTypeConfiguration(t *testing.T) {
} else {
os.Unsetenv("CERTCTL_AUTH_SECRET")
}
if oldTLSCert != "" {
os.Setenv("CERTCTL_SERVER_TLS_CERT_PATH", oldTLSCert)
} else {
os.Unsetenv("CERTCTL_SERVER_TLS_CERT_PATH")
}
if oldTLSKey != "" {
os.Setenv("CERTCTL_SERVER_TLS_KEY_PATH", oldTLSKey)
} else {
os.Unsetenv("CERTCTL_SERVER_TLS_KEY_PATH")
}
}()
// HTTPS-only control plane: config.Load()→Validate() refuses to pass
// without a readable cert/key pair. Mint one throwaway pair for the whole
// sub-test cohort — auth type toggles don't care about the TLS surface.
dir := t.TempDir()
certPath := dir + "/server.crt"
keyPath := dir + "/server.key"
generateTestCert(t, certPath, keyPath, "main-test-cn")
os.Setenv("CERTCTL_SERVER_TLS_CERT_PATH", certPath)
os.Setenv("CERTCTL_SERVER_TLS_KEY_PATH", keyPath)
// Set auth secret for api-key mode
os.Setenv("CERTCTL_AUTH_SECRET", "test-secret")
+164
View File
@@ -0,0 +1,164 @@
package main
import (
"crypto/tls"
"fmt"
"log/slog"
"os"
"os/signal"
"sync"
"syscall"
)
// certHolder stores the server's TLS certificate under a mutex so it can be
// swapped atomically by a SIGHUP handler without restarting the server. A
// *tls.Config that wires GetCertificate → (*certHolder).GetCertificate reads
// through the holder on every ClientHello, so a successful reload takes
// effect on the next new connection immediately and without dropping
// in-flight requests.
//
// Concurrency: GetCertificate is invoked from crypto/tls handshake goroutines
// on every new inbound connection; Reload is invoked from the SIGHUP watcher
// goroutine. sync.Mutex is sufficient — TLS handshakes are not an inner-loop
// hot path and the critical section is a single pointer read.
type certHolder struct {
mu sync.Mutex
cert *tls.Certificate
certPath string
keyPath string
}
// newCertHolder loads the initial cert+key pair from disk and returns a
// holder ready to serve handshakes. Returns a non-nil error if either file
// is missing, unreadable, or the pair does not round-trip through
// tls.LoadX509KeyPair (for example the key does not sign the cert). The
// caller is expected to treat a non-nil error as a fail-loud startup gate
// and os.Exit(1) — the HTTPS-everywhere milestone (§3 locked decisions)
// prohibits plaintext HTTP fallback.
func newCertHolder(certPath, keyPath string) (*certHolder, error) {
cert, err := tls.LoadX509KeyPair(certPath, keyPath)
if err != nil {
return nil, fmt.Errorf("load TLS cert/key (cert=%q key=%q): %w", certPath, keyPath, err)
}
return &certHolder{
cert: &cert,
certPath: certPath,
keyPath: keyPath,
}, nil
}
// GetCertificate is the tls.Config.GetCertificate hook. Returns the current
// cert under the holder's mutex. ClientHelloInfo is ignored — the control
// plane does not multiplex by SNI.
func (h *certHolder) GetCertificate(_ *tls.ClientHelloInfo) (*tls.Certificate, error) {
h.mu.Lock()
defer h.mu.Unlock()
return h.cert, nil
}
// Reload re-reads the cert+key pair from disk and swaps the holder
// atomically on success. On failure the holder retains its previous cert
// and the error is propagated to the caller — the SIGHUP watcher logs and
// keeps serving the previous cert rather than crashing on a bad reload.
// This is deliberately "fail-safe on reload, fail-loud on startup": an
// operator rotating certs wants a recoverable error, not a restart loop.
func (h *certHolder) Reload() error {
cert, err := tls.LoadX509KeyPair(h.certPath, h.keyPath)
if err != nil {
return fmt.Errorf("reload TLS cert/key (cert=%q key=%q): %w", h.certPath, h.keyPath, err)
}
h.mu.Lock()
h.cert = &cert
h.mu.Unlock()
return nil
}
// watchSIGHUP installs a signal handler that calls Reload() on each SIGHUP.
// The returned stop function closes the internal done channel and stops
// signal delivery so the goroutine can exit cleanly during shutdown. Errors
// from Reload are logged but do not terminate the watcher — the operator
// can fix the files and send another SIGHUP.
//
// Defensive design note: this deliberately does NOT panic on Reload error
// even though HTTPS is mission-critical. A rotation that writes half-files
// (operator overwrites cert.pem then key.pem as two separate copies) would
// otherwise crash the server mid-rotation. Logging + retaining the old
// cert gives the operator a bounded window to fix and re-SIGHUP.
func (h *certHolder) watchSIGHUP(logger *slog.Logger) (stop func()) {
ch := make(chan os.Signal, 1)
signal.Notify(ch, syscall.SIGHUP)
done := make(chan struct{})
go func() {
for {
select {
case <-ch:
if err := h.Reload(); err != nil {
logger.Error("TLS cert reload failed; continuing with previous cert",
"error", err,
"cert_path", h.certPath,
"key_path", h.keyPath)
continue
}
logger.Info("TLS cert reloaded via SIGHUP",
"cert_path", h.certPath,
"key_path", h.keyPath)
case <-done:
signal.Stop(ch)
return
}
}
}()
return func() { close(done) }
}
// buildServerTLSConfig returns the TLS 1.3-only *tls.Config for the HTTPS
// server. Pinned per HTTPS-everywhere milestone §2.1 + §3 locked decisions:
//
// - MinVersion: TLS 1.3 (no TLS 1.2 escape hatch). Go 1.25's crypto/tls
// automatically rejects older versions.
// - CurvePreferences: explicit [X25519, P-256]. Explicit ordering keeps
// the handshake deterministic and documents the accepted curves.
// - No CipherSuites field: TLS 1.3 cipher suites are not negotiable in
// the handshake (all three mandatory suites — AES-128-GCM-SHA256,
// AES-256-GCM-SHA384, CHACHA20-POLY1305-SHA256 — are always offered).
// Go's crypto/tls ignores CipherSuites for TLS 1.3.
// - GetCertificate: reads through the holder so SIGHUP rotations take
// effect on the next new connection without a restart. Setting
// tls.Config.Certificates directly would pin the first-loaded cert
// and defeat SIGHUP reload.
func buildServerTLSConfig(holder *certHolder) *tls.Config {
return &tls.Config{
MinVersion: tls.VersionTLS13,
CurvePreferences: []tls.CurveID{tls.X25519, tls.CurveP256},
GetCertificate: holder.GetCertificate,
}
}
// preflightServerTLS is the fail-loud startup gate for HTTPS. Returns a
// non-nil error when the TLS configuration is missing or the cert+key pair
// cannot be parsed, so the caller refuses to start the control plane
// (HTTPS-everywhere §3 locked decisions: no plaintext HTTP fallback).
//
// Duplicates the emptiness + stat + parse checks in config.Validate() for
// defense in depth, mirroring the pattern established by
// preflightSCEPChallengePassword (which itself duplicates
// config.Validate()'s SCEP check for CWE-306). Extracted into a separate
// function so the gate is unit-testable without booting the full server.
func preflightServerTLS(certPath, keyPath string) error {
if certPath == "" {
return fmt.Errorf("CERTCTL_SERVER_TLS_CERT_PATH is empty: HTTPS-only control plane refuses to start (see docs/tls.md)")
}
if keyPath == "" {
return fmt.Errorf("CERTCTL_SERVER_TLS_KEY_PATH is empty: HTTPS-only control plane refuses to start (see docs/tls.md)")
}
if _, err := os.Stat(certPath); err != nil {
return fmt.Errorf("TLS cert file %q unreadable: %w (see docs/tls.md)", certPath, err)
}
if _, err := os.Stat(keyPath); err != nil {
return fmt.Errorf("TLS key file %q unreadable: %w (see docs/tls.md)", keyPath, err)
}
if _, err := tls.LoadX509KeyPair(certPath, keyPath); err != nil {
return fmt.Errorf("TLS cert/key pair invalid (cert=%q key=%q): %w (see docs/tls.md)", certPath, keyPath, err)
}
return nil
}
+418
View File
@@ -0,0 +1,418 @@
package main
import (
"crypto/ecdsa"
"crypto/elliptic"
"crypto/rand"
"crypto/tls"
"crypto/x509"
"crypto/x509/pkix"
"encoding/pem"
"errors"
"io"
"log/slog"
"math/big"
"net"
"os"
"path/filepath"
"sync"
"syscall"
"testing"
"time"
)
// generateTestCert writes a PEM-encoded self-signed leaf cert + ECDSA P-256
// key pair to certPath/keyPath. The subject is derived from cn so tests can
// tell reloaded certs apart from original certs by re-parsing the served
// Certificate and comparing the CN.
func generateTestCert(t *testing.T, certPath, keyPath, cn string) {
t.Helper()
priv, err := ecdsa.GenerateKey(elliptic.P256(), rand.Reader)
if err != nil {
t.Fatalf("ecdsa.GenerateKey: %v", err)
}
tmpl := &x509.Certificate{
SerialNumber: big.NewInt(time.Now().UnixNano()),
Subject: pkix.Name{CommonName: cn},
NotBefore: time.Now().Add(-1 * time.Hour),
NotAfter: time.Now().Add(24 * time.Hour),
KeyUsage: x509.KeyUsageDigitalSignature,
ExtKeyUsage: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth},
DNSNames: []string{"localhost"},
IPAddresses: []net.IP{net.ParseIP("127.0.0.1"), net.ParseIP("::1")},
}
der, err := x509.CreateCertificate(rand.Reader, tmpl, tmpl, &priv.PublicKey, priv)
if err != nil {
t.Fatalf("x509.CreateCertificate: %v", err)
}
certPEM := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: der})
keyDER, err := x509.MarshalECPrivateKey(priv)
if err != nil {
t.Fatalf("MarshalECPrivateKey: %v", err)
}
keyPEM := pem.EncodeToMemory(&pem.Block{Type: "EC PRIVATE KEY", Bytes: keyDER})
if err := os.WriteFile(certPath, certPEM, 0o600); err != nil {
t.Fatalf("write cert: %v", err)
}
if err := os.WriteFile(keyPath, keyPEM, 0o600); err != nil {
t.Fatalf("write key: %v", err)
}
}
// readCertCN returns the CommonName from the leaf cert currently held by the
// holder, by exercising the same GetCertificate path the tls handshake would
// take. Lets tests assert which generation of the cert is being served.
func readCertCN(t *testing.T, h *certHolder) string {
t.Helper()
c, err := h.GetCertificate(&tls.ClientHelloInfo{})
if err != nil {
t.Fatalf("GetCertificate: %v", err)
}
leaf, err := x509.ParseCertificate(c.Certificate[0])
if err != nil {
t.Fatalf("ParseCertificate: %v", err)
}
return leaf.Subject.CommonName
}
func silentLogger() *slog.Logger {
return slog.New(slog.NewTextHandler(io.Discard, &slog.HandlerOptions{Level: slog.LevelError}))
}
func TestNewCertHolder_ValidPair_LoadsCert(t *testing.T) {
dir := t.TempDir()
certPath := filepath.Join(dir, "tls.crt")
keyPath := filepath.Join(dir, "tls.key")
generateTestCert(t, certPath, keyPath, "cn-initial")
h, err := newCertHolder(certPath, keyPath)
if err != nil {
t.Fatalf("newCertHolder: %v", err)
}
if got := readCertCN(t, h); got != "cn-initial" {
t.Fatalf("CN mismatch: got %q want %q", got, "cn-initial")
}
}
func TestNewCertHolder_MissingFile_Fails(t *testing.T) {
_, err := newCertHolder("/nonexistent/cert.pem", "/nonexistent/key.pem")
if err == nil {
t.Fatal("expected error for missing files, got nil")
}
}
func TestNewCertHolder_MalformedCert_Fails(t *testing.T) {
dir := t.TempDir()
certPath := filepath.Join(dir, "bad.crt")
keyPath := filepath.Join(dir, "bad.key")
if err := os.WriteFile(certPath, []byte("not a pem cert"), 0o600); err != nil {
t.Fatalf("write cert: %v", err)
}
if err := os.WriteFile(keyPath, []byte("not a pem key"), 0o600); err != nil {
t.Fatalf("write key: %v", err)
}
_, err := newCertHolder(certPath, keyPath)
if err == nil {
t.Fatal("expected error for malformed PEM, got nil")
}
}
func TestCertHolder_Reload_SwapsCert(t *testing.T) {
dir := t.TempDir()
certPath := filepath.Join(dir, "tls.crt")
keyPath := filepath.Join(dir, "tls.key")
generateTestCert(t, certPath, keyPath, "cn-v1")
h, err := newCertHolder(certPath, keyPath)
if err != nil {
t.Fatalf("newCertHolder: %v", err)
}
if got := readCertCN(t, h); got != "cn-v1" {
t.Fatalf("initial CN: got %q want cn-v1", got)
}
// Rotate on disk and reload.
generateTestCert(t, certPath, keyPath, "cn-v2")
if err := h.Reload(); err != nil {
t.Fatalf("Reload: %v", err)
}
if got := readCertCN(t, h); got != "cn-v2" {
t.Fatalf("post-reload CN: got %q want cn-v2", got)
}
}
func TestCertHolder_Reload_FailureRetainsPreviousCert(t *testing.T) {
dir := t.TempDir()
certPath := filepath.Join(dir, "tls.crt")
keyPath := filepath.Join(dir, "tls.key")
generateTestCert(t, certPath, keyPath, "cn-v1")
h, err := newCertHolder(certPath, keyPath)
if err != nil {
t.Fatalf("newCertHolder: %v", err)
}
// Corrupt the cert file and attempt reload.
if err := os.WriteFile(certPath, []byte("garbage"), 0o600); err != nil {
t.Fatalf("corrupt cert: %v", err)
}
if err := h.Reload(); err == nil {
t.Fatal("expected Reload error for corrupt file, got nil")
}
// Holder should still serve the v1 cert.
if got := readCertCN(t, h); got != "cn-v1" {
t.Fatalf("post-failed-reload CN: got %q want cn-v1 (reload must not clobber on failure)", got)
}
}
func TestCertHolder_GetCertificate_Concurrent(t *testing.T) {
dir := t.TempDir()
certPath := filepath.Join(dir, "tls.crt")
keyPath := filepath.Join(dir, "tls.key")
generateTestCert(t, certPath, keyPath, "cn-concurrent")
h, err := newCertHolder(certPath, keyPath)
if err != nil {
t.Fatalf("newCertHolder: %v", err)
}
// 64 readers + 1 rotator for 500ms. Race detector catches any unsynchronized
// swap of h.cert. Rotator writes fresh files + Reload, readers call
// GetCertificate in a tight loop.
var wg sync.WaitGroup
done := make(chan struct{})
const readers = 64
for i := 0; i < readers; i++ {
wg.Add(1)
go func() {
defer wg.Done()
for {
select {
case <-done:
return
default:
if _, err := h.GetCertificate(&tls.ClientHelloInfo{}); err != nil {
t.Errorf("GetCertificate: %v", err)
return
}
}
}
}()
}
wg.Add(1)
go func() {
defer wg.Done()
for i := 0; i < 20; i++ {
generateTestCert(t, certPath, keyPath, "cn-concurrent")
_ = h.Reload()
time.Sleep(10 * time.Millisecond)
}
}()
time.Sleep(300 * time.Millisecond)
close(done)
wg.Wait()
}
func TestCertHolder_WatchSIGHUP_ReloadsOnSignal(t *testing.T) {
dir := t.TempDir()
certPath := filepath.Join(dir, "tls.crt")
keyPath := filepath.Join(dir, "tls.key")
generateTestCert(t, certPath, keyPath, "cn-before-sighup")
h, err := newCertHolder(certPath, keyPath)
if err != nil {
t.Fatalf("newCertHolder: %v", err)
}
stop := h.watchSIGHUP(silentLogger())
defer stop()
// Rotate on disk, then fire SIGHUP to our own process and poll for the swap.
generateTestCert(t, certPath, keyPath, "cn-after-sighup")
if err := syscall.Kill(syscall.Getpid(), syscall.SIGHUP); err != nil {
t.Fatalf("SIGHUP: %v", err)
}
deadline := time.Now().Add(2 * time.Second)
for time.Now().Before(deadline) {
if readCertCN(t, h) == "cn-after-sighup" {
return
}
time.Sleep(10 * time.Millisecond)
}
t.Fatalf("watcher did not reload cert within 2s (CN still %q)", readCertCN(t, h))
}
func TestCertHolder_WatchSIGHUP_StopExits(t *testing.T) {
dir := t.TempDir()
certPath := filepath.Join(dir, "tls.crt")
keyPath := filepath.Join(dir, "tls.key")
generateTestCert(t, certPath, keyPath, "cn-stop")
h, err := newCertHolder(certPath, keyPath)
if err != nil {
t.Fatalf("newCertHolder: %v", err)
}
stop := h.watchSIGHUP(silentLogger())
// Closing should be synchronous and safe; a subsequent SIGHUP must not
// cause a reload (the watcher goroutine is gone).
stop()
time.Sleep(50 * time.Millisecond) // let goroutine exit
// After stop, the signal may still be delivered to the process but the
// watcher has called signal.Stop so this channel is no longer receiving.
// Simply assert that calling stop() twice does not panic — the goroutine
// has already exited, so a second close would panic on the `done`
// channel; we do NOT call stop twice. Instead verify no regression in
// the held cert.
if got := readCertCN(t, h); got != "cn-stop" {
t.Fatalf("unexpected cert rotation after stop: got %q want cn-stop", got)
}
}
func TestBuildServerTLSConfig_IsTLS13Only(t *testing.T) {
dir := t.TempDir()
certPath := filepath.Join(dir, "tls.crt")
keyPath := filepath.Join(dir, "tls.key")
generateTestCert(t, certPath, keyPath, "cn-cfg")
h, err := newCertHolder(certPath, keyPath)
if err != nil {
t.Fatalf("newCertHolder: %v", err)
}
cfg := buildServerTLSConfig(h)
if cfg.MinVersion != tls.VersionTLS13 {
t.Fatalf("MinVersion: got %#x want %#x (TLS 1.3)", cfg.MinVersion, tls.VersionTLS13)
}
wantCurves := []tls.CurveID{tls.X25519, tls.CurveP256}
if len(cfg.CurvePreferences) != len(wantCurves) {
t.Fatalf("CurvePreferences length: got %d want %d", len(cfg.CurvePreferences), len(wantCurves))
}
for i, c := range cfg.CurvePreferences {
if c != wantCurves[i] {
t.Fatalf("CurvePreferences[%d]: got %v want %v", i, c, wantCurves[i])
}
}
if cfg.GetCertificate == nil {
t.Fatal("GetCertificate: nil (holder not wired; SIGHUP reload would be broken)")
}
if len(cfg.Certificates) != 0 {
t.Fatalf("Certificates: got %d want 0 (static cert would pin the first load and defeat reload)", len(cfg.Certificates))
}
}
func TestBuildServerTLSConfig_Handshake_TLS12Rejected(t *testing.T) {
dir := t.TempDir()
certPath := filepath.Join(dir, "tls.crt")
keyPath := filepath.Join(dir, "tls.key")
generateTestCert(t, certPath, keyPath, "cn-handshake")
h, err := newCertHolder(certPath, keyPath)
if err != nil {
t.Fatalf("newCertHolder: %v", err)
}
serverCfg := buildServerTLSConfig(h)
ln, err := tls.Listen("tcp", "127.0.0.1:0", serverCfg)
if err != nil {
t.Fatalf("tls.Listen: %v", err)
}
defer ln.Close()
// Server loop: accept and immediately close (we only care about the
// handshake outcome).
go func() {
for {
conn, err := ln.Accept()
if err != nil {
return
}
// Force handshake so the server-side error surfaces.
_ = conn.(*tls.Conn).Handshake()
conn.Close()
}
}()
// TLS 1.3 client — should succeed.
clientOK := &tls.Config{
MinVersion: tls.VersionTLS13,
MaxVersion: tls.VersionTLS13,
InsecureSkipVerify: true,
}
c, err := tls.Dial("tcp", ln.Addr().String(), clientOK)
if err != nil {
t.Fatalf("TLS 1.3 dial failed (expected success): %v", err)
}
if c.ConnectionState().Version != tls.VersionTLS13 {
t.Fatalf("negotiated version: got %#x want TLS 1.3 (%#x)", c.ConnectionState().Version, tls.VersionTLS13)
}
c.Close()
// TLS 1.2 client — must be rejected at handshake.
clientOld := &tls.Config{
MinVersion: tls.VersionTLS12,
MaxVersion: tls.VersionTLS12,
InsecureSkipVerify: true,
}
if _, err := tls.Dial("tcp", ln.Addr().String(), clientOld); err == nil {
t.Fatal("TLS 1.2 dial succeeded; HTTPS-everywhere requires server to refuse TLS 1.2")
}
}
func TestPreflightServerTLS_MissingCertPath(t *testing.T) {
err := preflightServerTLS("", "/any/key.pem")
if err == nil {
t.Fatal("expected error for empty cert path, got nil")
}
}
func TestPreflightServerTLS_MissingKeyPath(t *testing.T) {
dir := t.TempDir()
certPath := filepath.Join(dir, "tls.crt")
keyPath := filepath.Join(dir, "tls.key")
generateTestCert(t, certPath, keyPath, "cn-preflight")
err := preflightServerTLS(certPath, "")
if err == nil {
t.Fatal("expected error for empty key path, got nil")
}
}
func TestPreflightServerTLS_CertFileNotReadable(t *testing.T) {
dir := t.TempDir()
keyPath := filepath.Join(dir, "tls.key")
if err := os.WriteFile(keyPath, []byte("k"), 0o600); err != nil {
t.Fatal(err)
}
err := preflightServerTLS(filepath.Join(dir, "nope.crt"), keyPath)
if err == nil {
t.Fatal("expected error for unreadable cert path, got nil")
}
if !errors.Is(err, os.ErrNotExist) {
t.Fatalf("expected os.ErrNotExist wrapped in error chain, got: %v", err)
}
}
func TestPreflightServerTLS_InvalidKeyPair(t *testing.T) {
dir := t.TempDir()
certPath := filepath.Join(dir, "tls.crt")
keyPath := filepath.Join(dir, "tls.key")
// Pair of valid cert + garbage key — files are readable but the pair
// doesn't round-trip tls.LoadX509KeyPair.
generateTestCert(t, certPath, keyPath, "cn-bad-pair")
if err := os.WriteFile(keyPath, []byte("-----BEGIN EC PRIVATE KEY-----\nBAD\n-----END EC PRIVATE KEY-----\n"), 0o600); err != nil {
t.Fatal(err)
}
err := preflightServerTLS(certPath, keyPath)
if err == nil {
t.Fatal("expected error for invalid key pair, got nil")
}
}
func TestPreflightServerTLS_ValidPair_NoError(t *testing.T) {
dir := t.TempDir()
certPath := filepath.Join(dir, "tls.crt")
keyPath := filepath.Join(dir, "tls.key")
generateTestCert(t, certPath, keyPath, "cn-ok")
if err := preflightServerTLS(certPath, keyPath); err != nil {
t.Fatalf("unexpected error for valid pair: %v", err)
}
}
+8 -5
View File
@@ -55,7 +55,7 @@ A compose file defines **services** (containers), **networks** (how they talk to
**Overlay files** let you layer changes. Running `docker compose -f base.yml -f overlay.yml up` merges both files. The overlay can add services, change environment variables, or mount extra volumes without editing the base.
**Port mapping** (`"8443:8443"`) maps host port (left) to container port (right). After startup, `http://localhost:8443` on your machine reaches the certctl server inside its container.
**Port mapping** (`"8443:8443"`) maps host port (left) to container port (right). After startup, `https://localhost:8443` on your machine reaches the certctl server inside its container (HTTPS-only as of v2.2; the `certctl-tls-init` init container bootstraps a self-signed cert into `deploy/test/certs/`).
---
@@ -91,11 +91,13 @@ Wait about 30 seconds, then verify:
docker compose -f deploy/docker-compose.yml ps
# All three services should show "Up (healthy)"
curl http://localhost:8443/health
curl --cacert ./deploy/test/certs/ca.crt https://localhost:8443/health
# {"status":"healthy"}
```
Open **http://localhost:8443** in your browser. You'll see the onboarding wizard guiding you through: connecting a CA, deploying an agent, and adding your first certificate.
The control plane is HTTPS-only as of v2.2. The `certctl-tls-init` init container bootstraps a self-signed cert into `deploy/test/certs/` on first boot; pin it with `--cacert` (as above) or pass `-k` for one-off smoke tests (never in production).
Open **https://localhost:8443** in your browser. You'll see the onboarding wizard guiding you through: connecting a CA, deploying an agent, and adding your first certificate. Your browser will flag the self-signed cert as untrusted — accept the warning for local evaluation, or import `deploy/test/certs/ca.crt` into your OS trust store to make the warning go away.
### Service-by-service walkthrough
@@ -307,8 +309,9 @@ docker compose -f deploy/docker-compose.test.yml up --build
Wait for all health checks to pass (about 60 seconds for step-ca's first-run bootstrap). Then:
```bash
# Dashboard with auth enabled
open http://localhost:8443
# Dashboard with auth enabled (HTTPS-only as of v2.2; browser will warn on the self-signed cert —
# accept the warning or trust `deploy/test/certs/ca.crt` in your OS keychain)
open https://localhost:8443
# API key: test-key-2026
# NGINX serving a self-signed placeholder
+102 -5
View File
@@ -4,8 +4,12 @@
#
# Spins up the full certctl platform with real CA backends for manual QA:
#
# 0. certctl-tls-init — one-shot init container; writes self-signed
# server.crt/.key/ca.crt into ./test/certs (bind
# mount, not a named volume — host-readable for
# the Go integration test binary)
# 1. PostgreSQL 16 — database (clean, no demo data)
# 2. certctl-server — control plane API + web dashboard on :8443
# 2. certctl-server — control plane API + web dashboard on :8443 (HTTPS)
# 3. certctl-agent — polls for work, deploys certs to NGINX
# 4. step-ca — private CA (JWK provisioner, auto-bootstraps)
# 5. Pebble — ACME test server (simulates Let's Encrypt)
@@ -16,15 +20,76 @@
# cd deploy
# docker compose -f docker-compose.test.yml up --build
#
# Dashboard: http://localhost:8443
# Dashboard: https://localhost:8443 (self-signed — use --cacert test/certs/ca.crt)
# API key: test-key-2026
# NGINX: https://localhost:8444 (self-signed placeholder until cert deployed)
#
# Integration tests: `go test -tags integration ./deploy/test/...` picks up
# the CA bundle at ./test/certs/ca.crt automatically via CERTCTL_TEST_CA_BUNDLE.
#
# See docs/test-env.md for the full walkthrough.
# =============================================================================
services:
# ---------------------------------------------------------------------------
# HTTPS-Everywhere Phase 6 — self-signed TLS bootstrap for the test harness.
# ---------------------------------------------------------------------------
# Mirrors the production `certctl-tls-init` (see docker-compose.yml §10-43)
# but writes into a *host bind mount* (./test/certs) instead of a named
# volume. The named-volume approach works fine inside Docker but hides the
# CA bundle from the Go integration test binary that runs on the host; the
# bind mount exposes /etc/certctl/tls/ca.crt at deploy/test/certs/ca.crt
# so `newTestClient()` can load it into an x509.CertPool and validate the
# self-signed server cert. Test-only divergence, explicitly documented.
#
# The generated cert has SAN=DNS:certctl-server,DNS:localhost,IP:127.0.0.1
# so both in-cluster traffic (agent → certctl-server:8443) and host traffic
# (go test → localhost:8443) validate cleanly. Destroy via
# `docker compose -f docker-compose.test.yml down -v` + `rm -rf test/certs`
# to force regeneration. Keys written 0600, certs 0644, owned 1000:1000
# (the UID the server binary runs as inside its container per Dockerfile:64).
certctl-tls-init:
image: alpine/openssl:latest
container_name: certctl-test-tls-init
restart: "no"
entrypoint: /bin/sh
command:
- -c
- |
set -eu
CERT=/etc/certctl/tls/server.crt
KEY=/etc/certctl/tls/server.key
CA=/etc/certctl/tls/ca.crt
if [ -f "$$CERT" ] && [ -f "$$KEY" ] && [ -f "$$CA" ]; then
echo "TLS cert already present at $$CERT — skipping generation"
else
mkdir -p /etc/certctl/tls
openssl req -x509 -newkey ec \
-pkeyopt ec_paramgen_curve:P-256 \
-nodes \
-keyout "$$KEY" \
-out "$$CERT" \
-days 3650 \
-subj "/CN=certctl-server" \
-addext "subjectAltName=DNS:certctl-server,DNS:localhost,IP:127.0.0.1,IP:::1"
cp "$$CERT" "$$CA"
echo "Generated self-signed TLS cert for certctl-test-server (ECDSA-P256/SHA-256, 3650d, CN=certctl-server)"
fi
# The test server container runs as root (see `user: "0:0"` below)
# because setup-trust.sh needs to update the system trust store, so
# the perms here are really about host-side readability — 0644 on
# the CA/cert lets `go test` on the host read the bundle without a
# chown dance.
chown 1000:1000 "$$CERT" "$$KEY" "$$CA" || true
chmod 0644 "$$CERT" "$$CA"
chmod 0600 "$$KEY"
volumes:
- ./test/certs:/etc/certctl/tls
networks:
certctl-test:
ipv4_address: 10.30.50.9
# ---------------------------------------------------------------------------
# Database
# ---------------------------------------------------------------------------
@@ -168,6 +233,12 @@ services:
condition: service_started
step-ca:
condition: service_healthy
# HTTPS-Everywhere Phase 6: block server boot until the init container
# has written server.crt / server.key / ca.crt into ./test/certs. The
# init container runs once and exits 0; service_completed_successfully
# makes that a gating dependency rather than a liveness one.
certctl-tls-init:
condition: service_completed_successfully
# Run as root so update-ca-certificates can write to /etc/ssl/certs.
# Container isolation provides the security boundary.
user: "0:0"
@@ -179,6 +250,12 @@ services:
# Server
CERTCTL_SERVER_HOST: 0.0.0.0
CERTCTL_SERVER_PORT: 8443
# HTTPS-Everywhere Phase 6: point the server at the init-container-generated
# cert/key pair (bind-mounted from ./test/certs). Same paths as production
# compose so the server binary code path is identical; only the host-side
# storage differs (bind mount vs named volume — see §certctl-tls-init block).
CERTCTL_SERVER_TLS_CERT_PATH: /etc/certctl/tls/server.crt
CERTCTL_SERVER_TLS_KEY_PATH: /etc/certctl/tls/server.key
CERTCTL_LOG_LEVEL: debug
# Auth — API key required (production-like)
@@ -224,12 +301,22 @@ services:
- ./test/setup-trust.sh:/app/setup-trust.sh:ro
# step-ca data volume (root cert at /certs/root_ca.crt, key at /secrets/provisioner_key)
- stepca_data:/stepca-data:ro
# HTTPS-Everywhere Phase 6: read-only bind mount of the init-generated
# TLS material. The init container writes here; server reads here; the
# agent mounts the same host path at the same container path (see below)
# so /etc/certctl/tls/ca.crt resolves to the *same* bytes on both sides.
- ./test/certs:/etc/certctl/tls:ro
networks:
certctl-test:
ipv4_address: 10.30.50.6
healthcheck:
# /health requires auth when CERTCTL_AUTH_TYPE=api-key, so include the Bearer token
test: ["CMD", "curl", "-f", "-H", "Authorization: Bearer test-key-2026", "http://localhost:8443/health"]
# HTTPS-Everywhere Phase 6: healthcheck now speaks TLS with --cacert to
# verify the self-signed server cert against the init-generated bundle.
# /health requires auth when CERTCTL_AUTH_TYPE=api-key, so include the
# Bearer token. curl exits non-zero on both TLS handshake failure and
# non-2xx status — either failure keeps depends_on: {condition:
# service_healthy} from unblocking the agent, which is what we want.
test: ["CMD", "curl", "--cacert", "/etc/certctl/tls/ca.crt", "-f", "-H", "Authorization: Bearer test-key-2026", "https://localhost:8443/health"]
interval: 10s
timeout: 5s
start_period: 30s
@@ -290,7 +377,13 @@ services:
certctl-server:
condition: service_healthy
environment:
CERTCTL_SERVER_URL: http://certctl-server:8443
# HTTPS-Everywhere Phase 6: agent dials the server over TLS and validates
# the self-signed cert against the CA bundle pinned by
# CERTCTL_SERVER_CA_BUNDLE_PATH. Same env vars + container paths as
# production compose so the agent binary code path (loadCABundle →
# x509.CertPool → *tls.Config{RootCAs, MinVersion: TLS13}) is identical.
CERTCTL_SERVER_URL: https://certctl-server:8443
CERTCTL_SERVER_CA_BUNDLE_PATH: /etc/certctl/tls/ca.crt
CERTCTL_API_KEY: test-key-2026
CERTCTL_AGENT_NAME: test-agent-01
CERTCTL_AGENT_ID: agent-test-01
@@ -300,6 +393,10 @@ services:
volumes:
- agent_keys:/var/lib/certctl/keys
- nginx_certs:/nginx-certs
# HTTPS-Everywhere Phase 6: same bind mount as the server, same path,
# so /etc/certctl/tls/ca.crt resolves to the identical bytes. This is
# the only way the CN=certctl-server cert validates on the agent side.
- ./test/certs:/etc/certctl/tls:ro
networks:
certctl-test:
ipv4_address: 10.30.50.8
+65 -2
View File
@@ -1,4 +1,57 @@
services:
# HTTPS-Everywhere Phase 3 — self-signed TLS bootstrap (init container).
# Generates a CN=certctl-server ECDSA-P256 (SHA-256 signature) cert with
# the SAN list locked by milestone §3.6 on first boot; subsequent boots
# see the cert already present in the `certs` named volume and no-op out.
# Server + agent mount the volume read-only. Destroy via `docker compose
# down -v` to force regeneration. This bootstrap is for docker-compose
# demos and local dev only; Helm operators supply a Secret / cert-manager
# Certificate per docs/tls.md.
#
# Rationale for ECDSA-P256 (was ed25519 pre-v2.0.48): Apple's TLS stack
# — Safari Network Framework and the macOS-bundled LibreSSL 3.3.6
# /usr/bin/curl — does not advertise ed25519 in the ClientHello
# signature_algorithms extension for server certs, yielding "tls: peer
# doesn't support any of the certificate's signature algorithms" at
# handshake. ECDSA-P256 with SHA-256 is universally supported. See
# docs/tls.md Pattern 1.
certctl-tls-init:
image: alpine/openssl:latest
container_name: certctl-tls-init
restart: "no"
entrypoint: /bin/sh
command:
- -c
- |
set -eu
CERT=/etc/certctl/tls/server.crt
KEY=/etc/certctl/tls/server.key
CA=/etc/certctl/tls/ca.crt
if [ -f "$$CERT" ] && [ -f "$$KEY" ] && [ -f "$$CA" ]; then
echo "TLS cert already present at $$CERT — skipping generation"
else
mkdir -p /etc/certctl/tls
openssl req -x509 -newkey ec \
-pkeyopt ec_paramgen_curve:P-256 \
-nodes \
-keyout "$$KEY" \
-out "$$CERT" \
-days 3650 \
-subj "/CN=certctl-server" \
-addext "subjectAltName=DNS:certctl-server,DNS:localhost,IP:127.0.0.1,IP:::1"
cp "$$CERT" "$$CA"
echo "Generated self-signed TLS cert for certctl-server (ECDSA-P256/SHA-256, 3650d, CN=certctl-server)"
fi
# certctl binary runs as UID 1000 inside the server container per
# Dockerfile:64-65; the cert + key must be readable by that UID.
chown 1000:1000 "$$CERT" "$$KEY" "$$CA"
chmod 0644 "$$CERT" "$$CA"
chmod 0600 "$$KEY"
volumes:
- certs:/etc/certctl/tls
networks:
- certctl-network
# PostgreSQL database
postgres:
image: postgres:16-alpine
@@ -50,10 +103,14 @@ services:
depends_on:
postgres:
condition: service_healthy
certctl-tls-init:
condition: service_completed_successfully
environment:
CERTCTL_DATABASE_URL: postgres://certctl:${POSTGRES_PASSWORD:-certctl}@postgres:5432/certctl?sslmode=disable
CERTCTL_SERVER_HOST: 0.0.0.0
CERTCTL_SERVER_PORT: 8443
CERTCTL_SERVER_TLS_CERT_PATH: /etc/certctl/tls/server.crt
CERTCTL_SERVER_TLS_KEY_PATH: /etc/certctl/tls/server.key
CERTCTL_LOG_LEVEL: info
CERTCTL_AUTH_TYPE: none
CERTCTL_KEYGEN_MODE: server # Demo uses server-side keygen; production should use "agent"
@@ -61,10 +118,12 @@ services:
CERTCTL_CONFIG_ENCRYPTION_KEY: ${CERTCTL_CONFIG_ENCRYPTION_KEY:-change-me-32-char-encryption-key} # AES-256-GCM for dynamic issuer/target config
ports:
- "8443:8443"
volumes:
- certs:/etc/certctl/tls:ro
networks:
- certctl-network
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8443/health"]
test: ["CMD", "curl", "--cacert", "/etc/certctl/tls/ca.crt", "-f", "https://localhost:8443/health"]
interval: 10s
timeout: 5s
retries: 5
@@ -99,13 +158,15 @@ services:
certctl-server:
condition: service_healthy
environment:
CERTCTL_SERVER_URL: http://certctl-server:8443
CERTCTL_SERVER_URL: https://certctl-server:8443
CERTCTL_SERVER_CA_BUNDLE_PATH: /etc/certctl/tls/ca.crt
CERTCTL_API_KEY: ${CERTCTL_API_KEY:-change-me-in-production}
CERTCTL_AGENT_NAME: docker-agent
CERTCTL_LOG_LEVEL: info
CERTCTL_DISCOVERY_DIRS: /var/lib/certctl/keys # Agent scans this directory for existing certificates
volumes:
- agent_keys:/var/lib/certctl/keys
- certs:/etc/certctl/tls:ro
networks:
- certctl-network
healthcheck:
@@ -134,3 +195,5 @@ volumes:
driver: local
agent_keys:
driver: local
certs:
driver: local
+7 -4
View File
@@ -236,10 +236,12 @@ kubectl get svc -l app.kubernetes.io/instance=certctl
kubectl get ingress
kubectl describe ingress certctl
# Test API connectivity
# Test API connectivity (HTTPS-only as of v2.2)
POD=$(kubectl get pods -l app.kubernetes.io/component=server -o jsonpath='{.items[0].metadata.name}')
kubectl port-forward $POD 8443:8443 &
curl -H "Authorization: Bearer $API_KEY" http://localhost:8443/health
# If the chart provisioned a self-signed cert, fetch the CA bundle from the TLS secret first:
# kubectl get secret certctl-server-tls -o jsonpath='{.data.ca\.crt}' | base64 -d > /tmp/certctl-ca.crt
curl --cacert /tmp/certctl-ca.crt -H "Authorization: Bearer $API_KEY" https://localhost:8443/health
```
### Step 6: Access the Dashboard
@@ -333,9 +335,10 @@ kubectl logs $POD | tail -20
# Port forward to API
kubectl port-forward svc/certctl-server 8443:8443 &
# Create a test certificate
# Create a test certificate (HTTPS-only as of v2.2 — pin the chart-provisioned CA bundle)
# kubectl get secret certctl-server-tls -o jsonpath='{.data.ca\.crt}' | base64 -d > /tmp/certctl-ca.crt
API_KEY="your-api-key"
curl -X POST http://localhost:8443/api/v1/certificates \
curl --cacert /tmp/certctl-ca.crt -X POST https://localhost:8443/api/v1/certificates \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
+4 -2
View File
@@ -33,9 +33,11 @@ kubectl get pods -l app.kubernetes.io/instance=certctl
# View server logs
kubectl logs -l app.kubernetes.io/component=server -f
# Access the API
# Access the API (HTTPS-only as of v2.2; use --cacert or -k depending on your cert provisioning)
kubectl port-forward svc/certctl-server 8443:8443 &
curl http://localhost:8443/health
# If the chart provisioned a self-signed cert, fetch the CA bundle from the secret first:
# kubectl get secret certctl-server-tls -o jsonpath='{.data.ca\.crt}' | base64 -d > /tmp/certctl-ca.crt
curl --cacert /tmp/certctl-ca.crt https://localhost:8443/health
```
## Next Steps
+20 -14
View File
@@ -4,36 +4,46 @@
{{- else if contains "NodePort" .Values.server.service.type }}
export NODE_IP=$(kubectl get nodes --namespace {{ .Release.Namespace }} -o jsonpath="{.items[0].status.addresses[0].address}")
export NODE_PORT=$(kubectl get --namespace {{ .Release.Namespace }} -o jsonpath="{.spec.ports[0].nodePort}" services {{ include "certctl.fullname" . }}-server)
echo http://$NODE_IP:$NODE_PORT
echo https://$NODE_IP:$NODE_PORT
{{- else if contains "LoadBalancer" .Values.server.service.type }}
export SERVICE_IP=$(kubectl get svc --namespace {{ .Release.Namespace }} {{ include "certctl.fullname" . }}-server --template "{.status.loadBalancer.ingress[0].ip}")
echo http://$SERVICE_IP:{{ .Values.server.service.port }}
echo https://$SERVICE_IP:{{ .Values.server.service.port }}
{{- else }}
export POD_NAME=$(kubectl get pods --namespace {{ .Release.Namespace }} -l "app.kubernetes.io/name={{ include "certctl.name" . }},app.kubernetes.io/instance={{ .Release.Name }},app.kubernetes.io/component=server" -o jsonpath="{.items[0].metadata.name}")
export CONTAINER_PORT=$(kubectl get pod --namespace {{ .Release.Namespace }} $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
echo "Visit http://127.0.0.1:8080 to use your application"
kubectl --namespace {{ .Release.Namespace }} port-forward $POD_NAME 8080:$CONTAINER_PORT
echo "Visit https://127.0.0.1:8443 to use your application"
kubectl --namespace {{ .Release.Namespace }} port-forward $POD_NAME 8443:$CONTAINER_PORT
{{- end }}
2. Get the default API key:
2. Talk to the HTTPS-only server from your workstation:
# Export the CA bundle that signed the server cert (self-signed or cert-manager-issued)
kubectl get secret --namespace {{ .Release.Namespace }} {{ include "certctl.tls.secretName" . }} \
-o jsonpath='{.data.ca\.crt}' | base64 --decode > /tmp/certctl-ca.crt
# (If ca.crt is empty, fall back to tls.crt — typical when the Secret
# was created from a self-signed bootstrap cert without a separate CA.)
# Adapt the URL below to match the Server URL printed in step 1.
curl --cacert /tmp/certctl-ca.crt https://127.0.0.1:8443/health
3. Get the default API key:
kubectl get secret --namespace {{ .Release.Namespace }} {{ include "certctl.fullname" . }}-server -o jsonpath="{.data.api-key}" | base64 --decode; echo
3. Get PostgreSQL connection details:
4. Get PostgreSQL connection details:
Host: {{ include "certctl.fullname" . }}-postgres.{{ .Release.Namespace }}.svc.cluster.local
Port: 5432
Database: {{ .Values.postgresql.auth.database }}
Username: {{ .Values.postgresql.auth.username }}
Password: $(kubectl get secret --namespace {{ .Release.Namespace }} {{ include "certctl.fullname" . }}-postgres -o jsonpath="{.data.password}" | base64 --decode)
4. Check deployment status:
5. Check deployment status:
kubectl get pods -n {{ .Release.Namespace }} -l app.kubernetes.io/instance={{ .Release.Name }}
5. View server logs:
6. View server logs:
kubectl logs -n {{ .Release.Namespace }} -l app.kubernetes.io/name={{ include "certctl.name" . }},app.kubernetes.io/component=server -f
{{- if .Values.agent.enabled }}
6. View agent logs:
7. View agent logs:
kubectl logs -n {{ .Release.Namespace }} -l app.kubernetes.io/name={{ include "certctl.name" . }},app.kubernetes.io/component=agent -f
{{- end }}
@@ -58,11 +68,7 @@ IMPORTANT NOTES FOR PRODUCTION:
- Use an external PostgreSQL managed service (AWS RDS, Cloud SQL, etc.)
- Set postgresql.enabled=false and configure CERTCTL_DATABASE_URL in values
5. Enable HTTPS/TLS using an Ingress with certificate management:
- Configure cert-manager for automatic TLS certificate renewal
- Update ingress values with your domain and certificate issuer
6. Review security contexts and network policies:
5. Review security contexts and network policies:
- All containers run as non-root
- Implement network policies to restrict traffic between components
- Consider pod security policies or security standards for your cluster
+48 -2
View File
@@ -118,8 +118,54 @@ postgres://{{ .Values.postgresql.auth.username }}:$(POSTGRES_PASSWORD)@{{ includ
{{- end }}
{{/*
Server URL (for agents)
Server URL (for agents). HTTPS-only as of v2.2 see docs/tls.md.
*/}}
{{- define "certctl.serverURL" -}}
http://{{ include "certctl.fullname" . }}-server:{{ .Values.server.service.port }}
https://{{ include "certctl.fullname" . }}-server:{{ .Values.server.service.port }}
{{- end }}
{{/*
TLS Secret name resolver.
Operator-facing precedence:
1. server.tls.existingSecret operator points at a pre-existing kubernetes.io/tls Secret
2. server.tls.certManager.secretName explicit secret name for the cert-manager Certificate CR
3. "<fullname>-tls" default when cert-manager is enabled but secretName is blank
Never emits an empty string that case is already excluded by certctl.tls.required below,
which must be invoked by any template that depends on the resolved secret name.
*/}}
{{- define "certctl.tls.secretName" -}}
{{- if .Values.server.tls.existingSecret -}}
{{- .Values.server.tls.existingSecret -}}
{{- else if .Values.server.tls.certManager.secretName -}}
{{- .Values.server.tls.certManager.secretName -}}
{{- else -}}
{{- printf "%s-tls" (include "certctl.fullname" .) -}}
{{- end -}}
{{- end }}
{{/*
TLS configuration gate.
HTTPS is the only supported listener mode (v2.2+). The server refuses to start
without a cert/key pair mounted at server.tls.mountPath, so `helm template` /
`helm install` must fail loudly at render-time rather than shipping a broken
Deployment that crash-loops with "tls config required".
Operators MUST configure EXACTLY ONE of:
(a) server.tls.existingSecret: <name-of-kubernetes.io/tls-secret>
(b) server.tls.certManager.enabled: true (+ issuerRef.name populated)
Any template that mounts the TLS Secret must call
`{{ include "certctl.tls.required" . }}` at the top so this guard runs once
per affected resource. No-op when configured correctly.
*/}}
{{- define "certctl.tls.required" -}}
{{- if and (not .Values.server.tls.existingSecret) (not .Values.server.tls.certManager.enabled) -}}
{{- fail "\n\ncertctl refuses to start without TLS.\n\nSet EXACTLY ONE of:\n --set server.tls.existingSecret=<your-kubernetes.io/tls-secret-name>\nOR\n --set server.tls.certManager.enabled=true \\\n --set server.tls.certManager.issuerRef.name=<your-issuer-or-clusterissuer>\n\nSee docs/tls.md for the full setup walkthrough, including bootstrap\nguidance for air-gapped clusters without cert-manager.\n" -}}
{{- end -}}
{{- if and .Values.server.tls.certManager.enabled (not .Values.server.tls.certManager.issuerRef.name) -}}
{{- fail "\n\nserver.tls.certManager.enabled=true but server.tls.certManager.issuerRef.name is empty.\n\nSet:\n --set server.tls.certManager.issuerRef.name=<your-issuer-or-clusterissuer>\n\nSee docs/tls.md.\n" -}}
{{- end -}}
{{- end }}
@@ -1,4 +1,5 @@
{{- if .Values.agent.enabled }}
{{- include "certctl.tls.required" . }}
{{- if eq .Values.agent.kind "DaemonSet" }}
apiVersion: apps/v1
kind: DaemonSet
@@ -53,6 +54,8 @@ spec:
fieldPath: metadata.name
- name: CERTCTL_KEY_DIR
value: {{ .Values.agent.keyDir }}
- name: CERTCTL_SERVER_CA_BUNDLE_PATH
value: "{{ .Values.server.tls.mountPath }}/ca.crt"
{{- if .Values.agent.discoveryDirs }}
- name: CERTCTL_DISCOVERY_DIRS
valueFrom:
@@ -70,12 +73,19 @@ spec:
mountPath: {{ .Values.agent.keyDir }}
- name: tmp
mountPath: /tmp
- name: server-tls
mountPath: {{ .Values.server.tls.mountPath }}
readOnly: true
volumes:
- name: agent-keys
emptyDir:
sizeLimit: 1Gi
- name: tmp
emptyDir: {}
- name: server-tls
secret:
secretName: {{ include "certctl.tls.secretName" . }}
defaultMode: 0400
{{- else if eq .Values.agent.kind "Deployment" }}
apiVersion: apps/v1
kind: Deployment
@@ -135,6 +145,8 @@ spec:
{{- end }}
- name: CERTCTL_KEY_DIR
value: {{ .Values.agent.keyDir }}
- name: CERTCTL_SERVER_CA_BUNDLE_PATH
value: "{{ .Values.server.tls.mountPath }}/ca.crt"
{{- if .Values.agent.discoveryDirs }}
- name: CERTCTL_DISCOVERY_DIRS
valueFrom:
@@ -152,11 +164,18 @@ spec:
mountPath: {{ .Values.agent.keyDir }}
- name: tmp
mountPath: /tmp
- name: server-tls
mountPath: {{ .Values.server.tls.mountPath }}
readOnly: true
volumes:
- name: agent-keys
emptyDir:
sizeLimit: 1Gi
- name: tmp
emptyDir: {}
- name: server-tls
secret:
secretName: {{ include "certctl.tls.secretName" . }}
defaultMode: 0400
{{- end }}
{{- end }}
+13 -3
View File
@@ -1,14 +1,24 @@
{{- if .Values.ingress.enabled }}
{{- if and .Values.ingress.certManager.enabled (not .Values.ingress.certManager.issuerRef.name) -}}
{{- fail "\n\ningress.certManager.enabled=true but ingress.certManager.issuerRef.name is empty.\n\nSet:\n --set ingress.certManager.issuerRef.name=<your-issuer-or-clusterissuer>\n\nThis is separate from server.tls.certManager — it issues the external-facing\nIngress cert, not the in-cluster server TLS cert. See docs/tls.md.\n" -}}
{{- end -}}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: {{ include "certctl.fullname" . }}
labels:
{{- include "certctl.labels" . | nindent 4 }}
{{- with .Values.ingress.annotations }}
annotations:
{{- if .Values.ingress.certManager.enabled }}
{{- if eq .Values.ingress.certManager.issuerRef.kind "ClusterIssuer" }}
cert-manager.io/cluster-issuer: {{ .Values.ingress.certManager.issuerRef.name | quote }}
{{- else }}
cert-manager.io/issuer: {{ .Values.ingress.certManager.issuerRef.name | quote }}
{{- end }}
{{- end }}
{{- with .Values.ingress.annotations }}
{{- toYaml . | nindent 4 }}
{{- end }}
{{- end }}
spec:
{{- if .Values.ingress.className }}
ingressClassName: {{ .Values.ingress.className }}
@@ -33,7 +43,7 @@ spec:
pathType: {{ .pathType }}
backend:
service:
name: {{ include "certctl.fullname" . }}-server
name: {{ include "certctl.fullname" $ }}-server
port:
number: {{ $.Values.server.service.port }}
{{- end }}
@@ -0,0 +1,31 @@
{{- if .Values.server.tls.certManager.enabled }}
{{- include "certctl.tls.required" . }}
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: {{ include "certctl.fullname" . }}-server-tls
labels:
{{- include "certctl.labels" . | nindent 4 }}
app.kubernetes.io/component: server
spec:
secretName: {{ include "certctl.tls.secretName" . }}
commonName: {{ .Values.server.tls.certManager.commonName | quote }}
dnsNames:
{{- range .Values.server.tls.certManager.dnsNames }}
- {{ . | quote }}
{{- end }}
duration: {{ .Values.server.tls.certManager.duration }}
renewBefore: {{ .Values.server.tls.certManager.renewBefore }}
usages:
- server auth
- digital signature
- key encipherment
privateKey:
algorithm: ECDSA
size: 256
rotationPolicy: Always
issuerRef:
name: {{ .Values.server.tls.certManager.issuerRef.name | quote }}
kind: {{ .Values.server.tls.certManager.issuerRef.kind }}
group: {{ .Values.server.tls.certManager.issuerRef.group }}
{{- end }}
@@ -1,3 +1,4 @@
{{- include "certctl.tls.required" . }}
apiVersion: apps/v1
kind: Deployment
metadata:
@@ -32,7 +33,7 @@ spec:
image: {{ include "certctl.serverImage" . }}
imagePullPolicy: {{ .Values.server.image.pullPolicy }}
ports:
- name: http
- name: https
containerPort: {{ .Values.server.port }}
protocol: TCP
env:
@@ -40,6 +41,10 @@ spec:
value: "0.0.0.0"
- name: CERTCTL_SERVER_PORT
value: "{{ .Values.server.port }}"
- name: CERTCTL_SERVER_TLS_CERT_PATH
value: "{{ .Values.server.tls.mountPath }}/tls.crt"
- name: CERTCTL_SERVER_TLS_KEY_PATH
value: "{{ .Values.server.tls.mountPath }}/tls.key"
- name: CERTCTL_DATABASE_URL
valueFrom:
secretKeyRef:
@@ -172,12 +177,19 @@ spec:
volumeMounts:
- name: tmp
mountPath: /tmp
- name: tls
mountPath: {{ .Values.server.tls.mountPath }}
readOnly: true
{{- if .Values.server.volumeMounts }}
{{- toYaml .Values.server.volumeMounts | nindent 12 }}
{{- end }}
volumes:
- name: tmp
emptyDir: {}
- name: tls
secret:
secretName: {{ include "certctl.tls.secretName" . }}
defaultMode: 0400
{{- if .Values.server.volumes }}
{{- toYaml .Values.server.volumes | nindent 8 }}
{{- end }}
@@ -13,8 +13,8 @@ spec:
type: {{ .Values.server.service.type }}
ports:
- port: {{ .Values.server.service.port }}
targetPort: http
targetPort: https
protocol: TCP
name: http
name: https
selector:
{{- include "certctl.serverSelectorLabels" . | nindent 4 }}
+52 -4
View File
@@ -48,11 +48,12 @@ server:
drop:
- ALL
# Liveness and readiness probes
# Liveness and readiness probes (HTTPS-only as of v2.2)
livenessProbe:
httpGet:
path: /health
port: http
port: https
scheme: HTTPS
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
@@ -61,12 +62,50 @@ server:
readinessProbe:
httpGet:
path: /readyz
port: http
port: https
scheme: HTTPS
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
# TLS configuration — REQUIRED. HTTPS is the only supported mode (v2.2+).
# Operator must configure EXACTLY ONE of:
# (a) server.tls.existingSecret: <name> # pre-existing kubernetes.io/tls Secret
# (b) server.tls.certManager.enabled: true # provision a cert-manager Certificate CR
# Refusing to set either makes `helm template` fail with a diagnostic pointing at docs/tls.md.
tls:
# Name of a pre-existing Secret (type kubernetes.io/tls) holding tls.crt + tls.key (+ optional ca.crt).
# Leave empty to fall through to the cert-manager path.
existingSecret: ""
# Mount path for the TLS Secret inside the server + agent containers.
mountPath: /etc/certctl/tls
# cert-manager auto-provisioning. Opt-in (off by default per milestone §3.4).
certManager:
enabled: false
# Secret name the cert-manager Certificate CR writes into. Agents and the server
# both read from this Secret. If empty, defaults to "<fullname>-tls".
secretName: ""
# Cert-manager issuer reference.
issuerRef:
name: "" # e.g. "letsencrypt-prod" or "internal-ca"
kind: ClusterIssuer # ClusterIssuer or Issuer
group: cert-manager.io
# Subject fields on the issued cert.
commonName: "certctl-server"
dnsNames:
- certctl-server
- localhost
# Certificate lifetime + renewal window.
duration: 2160h # 90 days
renewBefore: 360h # 15 days
# Service type (ClusterIP, LoadBalancer, NodePort)
service:
type: ClusterIP
@@ -356,7 +395,16 @@ ingress:
className: ""
annotations: {}
# kubernetes.io/ingress.class: nginx
# cert-manager.io/cluster-issuer: letsencrypt-prod
# Optional cert-manager integration for the public-facing Ingress cert.
# This is completely independent of server.tls.* — the Ingress terminates
# an *additional* TLS hop between the internet and the in-cluster Service.
# Leave disabled unless an Ingress is exposing certctl to the outside world.
certManager:
enabled: false
issuerRef:
name: "" # e.g. "letsencrypt-prod"
kind: ClusterIssuer # ClusterIssuer or Issuer
hosts:
- host: certctl.local
paths:
+364 -23
View File
@@ -47,11 +47,30 @@ func envOr(key, fallback string) string {
return fallback
}
// HTTPS-Everywhere Phase 6: the test harness now dials the server over TLS and
// validates the self-signed cert against the init-container-generated CA bundle
// bind-mounted at ./test/certs/ca.crt. The defaults assume the compose setup in
// deploy/docker-compose.test.yml; override via the usual env vars when pointing
// the suite at a different deployment.
//
// - CERTCTL_TEST_SERVER_URL — must be https:// for the Phase 6 wiring
// - CERTCTL_TEST_CA_BUNDLE — PEM bundle; must contain the server's issuing
// CA (self-signed in the compose setup, so server.crt doubles as ca.crt)
// - CERTCTL_TEST_INSECURE — set to "true" to fall back to
// InsecureSkipVerify when the CA bundle path is unavailable (CI smoke or
// exploratory runs only — CI-parity runs MUST use the pinned bundle).
//
// Under no circumstance does the suite silently downgrade to plaintext HTTP:
// Phase 5 (#203) pre-flight guards in cmd/server will refuse to start with an
// http:// URL anyway, so a misconfiguration fails loud at test-harness startup
// rather than flaking mid-suite.
var (
serverURL = envOr("CERTCTL_TEST_SERVER_URL", "http://localhost:8443")
apiKey = envOr("CERTCTL_TEST_API_KEY", "test-key-2026")
dbURL = envOr("CERTCTL_TEST_DB_URL", "postgres://certctl:testpass@localhost:5432/certctl?sslmode=disable")
nginxTLS = envOr("CERTCTL_TEST_NGINX_TLS", "localhost:8444")
serverURL = envOr("CERTCTL_TEST_SERVER_URL", "https://localhost:8443")
apiKey = envOr("CERTCTL_TEST_API_KEY", "test-key-2026")
dbURL = envOr("CERTCTL_TEST_DB_URL", "postgres://certctl:testpass@localhost:5432/certctl?sslmode=disable")
nginxTLS = envOr("CERTCTL_TEST_NGINX_TLS", "localhost:8444")
caBundlePath = envOr("CERTCTL_TEST_CA_BUNDLE", "./certs/ca.crt")
insecureTLS = strings.EqualFold(os.Getenv("CERTCTL_TEST_INSECURE"), "true")
)
// ---------------------------------------------------------------------------
@@ -75,16 +94,74 @@ type testClient struct {
apiKey string
}
// buildTLSConfig wires up the x509.CertPool with the self-signed CA bundle
// emitted by the certctl-tls-init container. Panics via t.Fatal on the happy
// path if both CERTCTL_TEST_CA_BUNDLE is unreadable *and* CERTCTL_TEST_INSECURE
// is not set — that combination is almost always a misconfigured test harness
// and silently downgrading to InsecureSkipVerify would hide real failures.
//
// MinVersion is pinned to TLS 1.3 so this matches what cmd/server negotiates
// by default; a drift there would surface here first.
func buildTLSConfig() *tls.Config {
cfg := &tls.Config{
MinVersion: tls.VersionTLS13,
}
if insecureTLS {
// Opt-in smoke-run mode; log but don't fail so operators running
// `CERTCTL_TEST_INSECURE=true go test -tags integration ./deploy/test/...`
// against an ad-hoc environment still get a green suite when the server
// is reachable. CI must not set this.
cfg.InsecureSkipVerify = true
return cfg
}
pem, err := os.ReadFile(caBundlePath)
if err != nil {
// Can't use t.Fatal here (called from package-level helpers); fall
// back to a panic so the harness dies loud at the first HTTP call.
// Operators see a clear "CA bundle missing" message and fix their
// setup instead of chasing a confusing TLS handshake error.
panic(fmt.Sprintf("integration test: read CA bundle %q: %v — "+
"run `docker compose -f deploy/docker-compose.test.yml up` first, or "+
"set CERTCTL_TEST_CA_BUNDLE to a valid PEM path, or "+
"set CERTCTL_TEST_INSECURE=true for a smoke run", caBundlePath, err))
}
pool := x509.NewCertPool()
if !pool.AppendCertsFromPEM(pem) {
panic(fmt.Sprintf("integration test: no PEM certificates parsed from %q", caBundlePath))
}
cfg.RootCAs = pool
return cfg
}
// newTestClient builds a Bearer-authenticated HTTPS client pinned to the
// init-container CA. Every phase uses this for REST calls.
func newTestClient() *testClient {
return &testClient{
http: &http.Client{
Timeout: 30 * time.Second,
Transport: &http.Transport{
TLSClientConfig: buildTLSConfig(),
},
},
baseURL: serverURL,
apiKey: apiKey,
}
}
// newUnauthHTTPClient returns an *http.Client with the same TLS configuration
// but no Bearer token. Used for the Phase 7 RFC 5280 CRL / RFC 8615
// `/.well-known/pki/*` probes — those endpoints must be reachable by
// *unauthenticated* relying parties per M-006, so we explicitly omit the
// Authorization header to prove it.
func newUnauthHTTPClient() *http.Client {
return &http.Client{
Timeout: 30 * time.Second,
Transport: &http.Transport{
TLSClientConfig: buildTLSConfig(),
},
}
}
func (c *testClient) do(method, path string, body io.Reader) (*http.Response, error) {
url := c.baseURL + path
req, err := http.NewRequest(method, url, body)
@@ -195,16 +272,11 @@ type metricsResponse struct {
Uptime float64 `json:"uptime_seconds"`
}
// crlResponse for the CRL endpoint.
type crlResponse struct {
Version int `json:"version"`
Total int `json:"total"`
Entries []struct {
Serial string `json:"serial_number"`
Reason string `json:"reason"`
RevokedAt string `json:"revoked_at"`
} `json:"entries"`
}
// M-006: The non-standard JSON CRL endpoint (`GET /api/v1/crl`) was removed.
// RFC 5280 §5 defines only the DER wire format, which is now served
// unauthenticated at `/.well-known/pki/crl/{issuer_id}` per RFC 8615.
// The `crlResponse` Go struct that used to decode the JSON envelope is gone;
// Phase 7 parses the DER bytes directly via `x509.ParseRevocationList`.
// ---------------------------------------------------------------------------
// PostgreSQL test helper
@@ -728,18 +800,48 @@ func TestIntegrationSuite(t *testing.T) {
t.Fatalf("revocation response unexpected: %s", body)
}
// Check CRL
t.Run("CRL", func(t *testing.T) {
resp, err := c.Get("/api/v1/crl")
// Check DER CRL served unauthenticated under /.well-known/pki/ per
// RFC 5280 §5 + RFC 8615 (M-006). Use newUnauthHTTPClient() — no
// Bearer token — to prove the endpoint is reachable by relying
// parties that have no certctl API credentials. Post HTTPS-Everywhere
// (M-007, Phase 6) the client still speaks TLS 1.3 against the pinned
// CA bundle from ./certs/ca.crt; we just skip the Authorization header
// to exercise the unauthenticated RFC 5280 / RFC 8615 relying-party
// path. Switching from the stdlib http.DefaultClient (plaintext OK,
// system trust store only) to the helper keeps the no-auth semantic
// while preventing silent plaintext downgrade — the whole point of
// this milestone.
t.Run("CRL_DER_Unauthenticated", func(t *testing.T) {
resp, err := newUnauthHTTPClient().Get(serverURL + "/.well-known/pki/crl/iss-local")
if err != nil {
t.Fatalf("GET CRL: %v", err)
t.Fatalf("GET DER CRL: %v", err)
}
var crl crlResponse
if err := decodeJSON(resp, &crl); err != nil {
t.Fatalf("decode CRL: %v", err)
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
t.Fatalf("unexpected status: got %d, want 200 (body=%s)", resp.StatusCode, string(body))
}
if crl.Total < 1 {
t.Fatalf("CRL total: got %d, want >= 1", crl.Total)
if ct := resp.Header.Get("Content-Type"); ct != "application/pkix-crl" {
t.Errorf("Content-Type: got %q, want %q", ct, "application/pkix-crl")
}
body, err := io.ReadAll(resp.Body)
if err != nil {
t.Fatalf("read CRL body: %v", err)
}
if len(body) == 0 {
t.Fatal("CRL body empty")
}
// Parse the DER bytes as an X.509 CRL (RFC 5280) and verify the
// just-revoked certificate is listed.
crl, err := x509.ParseRevocationList(body)
if err != nil {
t.Fatalf("parse DER CRL: %v", err)
}
if len(crl.RevokedCertificateEntries) < 1 {
t.Fatalf("CRL entries: got %d, want >= 1", len(crl.RevokedCertificateEntries))
}
})
@@ -1123,4 +1225,243 @@ func TestIntegrationSuite(t *testing.T) {
}
})
})
// -----------------------------------------------------------------------
// Phase 13: I-005 Phase 1 Red — Notification Retry + Dead Letter Queue (E2E)
//
// Pins the full retry-loop contract end-to-end. Phase 2 Green must turn
// every subtest Green with a single coherent change set (migration 000016
// live, scheduler notificationRetryLoop wired as the 11th loop bumping
// the total from 10 → 11, service RetryFailedNotifications + MarkAsDead +
// RequeueNotification implemented, handler POST
// /api/v1/notifications/{id}/requeue routed, list handler parsing the
// status query param).
//
// Subtests:
//
// 1. MarkAsDead_OnMaxAttempts — a notification seeded at retry_count=4
// (one failure shy of the max_attempts=5 gate) with next_retry_at in
// the past is promoted to status='dead' on the first retry-loop
// tick. The pre-increment arithmetic `retry_count + 1 = 5 =
// max_attempts` triggers MarkAsDead instead of scheduling another
// retry.
//
// 2. Requeue_FlipsDeadToPending — POST
// /api/v1/notifications/{id}/requeue on a dead row flips status back
// to 'pending', resets retry_count to 0, and clears next_retry_at
// so the existing ProcessPendingNotifications loop (not the retry
// sweep) picks it up on its next tick.
//
// 3. ListFilter_StatusDead — GET /api/v1/notifications?status=dead
// returns only rows in status='dead' so the UI's Dead Letter tab
// (web/src/pages/NotificationsPage.test.tsx subtest #1) can isolate
// them without client-side filtering.
//
// Red behavior at HEAD (what Phase 2 Green must flip):
//
// * Schema: the INSERTs reference retry_count, next_retry_at,
// last_error. Migration 000016 is already written (file (a) of
// Phase 1 Red) but until it is applied the INSERTs fail with
// "column does not exist" — schema-level Red halt.
//
// * Subtest 1: no retry loop exists at HEAD. The seeded row stays at
// status='failed' retry_count=4 forever. The 4-minute waitFor
// therefore times out.
//
// * Subtest 2: /notifications/{id}/requeue is not routed at HEAD
// (internal/api/handler/notifications.go registers only list / get /
// mark-read). The POST returns 404.
//
// * Subtest 3: the list handler does not parse the status query param
// at HEAD. The response includes rows of every status, so the
// "leaked non-dead row" assertion fires.
// -----------------------------------------------------------------------
t.Run("Phase13_NotificationRetryDLQ", func(t *testing.T) {
// Unreachable endpoint so every webhook delivery attempt fails
// deterministically — port 1 is never bound. Pinning retry_count=4
// + a guaranteed-failing channel is what turns the seeded row into
// 'dead' on the very next scheduler tick (one delivery attempt,
// retry_count 4→5, crosses max_attempts=5 → MarkAsDead).
const blackHole = "http://127.0.0.1:1/i005-red-black-hole"
// ---------------------------------------------------------------
// Subtest 1: failed → dead transition after one retry-loop tick
// ---------------------------------------------------------------
t.Run("MarkAsDead_OnMaxAttempts", func(t *testing.T) {
id := fmt.Sprintf("notif-i005-dead-%d", time.Now().UnixNano())
// retry_count=4 + next attempt = 5 = max_attempts → MarkAsDead.
// next_retry_at is backdated so the row is immediately eligible
// for the retry sweep rather than having to wait for its own
// backoff to elapse.
past := time.Now().Add(-30 * time.Second).UTC()
db.Exec(t, `
INSERT INTO notification_events
(id, type, channel, recipient, message, status,
retry_count, next_retry_at, last_error)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
`,
id, "ExpirationWarning", "Webhook", blackHole,
"I-005 integration: DLQ promotion on max_attempts",
"failed", 4, past, "transient webhook 500",
)
// Give the retry sweep up to 4m to tick at least once (default
// 2m interval + seed/sweep/notifier slop). On success the row
// carries status='dead' and retry_count has advanced to 5.
waitFor(t, "notification transitions to dead", 4*time.Minute, 5*time.Second,
func() (bool, error) {
var status string
var retry int
err := db.db.QueryRow(
"SELECT status, retry_count FROM notification_events WHERE id = $1",
id,
).Scan(&status, &retry)
if err != nil {
return false, err
}
return strings.EqualFold(status, "dead") && retry >= 5, nil
})
// The dead-letter tab is only useful if operators can see why
// the row died. MarkAsDead must preserve the most recent
// failure string in last_error rather than nil'ing it.
var lastErr sql.NullString
if err := db.db.QueryRow(
"SELECT last_error FROM notification_events WHERE id = $1", id,
).Scan(&lastErr); err != nil {
t.Fatalf("read last_error: %v", err)
}
if !lastErr.Valid || lastErr.String == "" {
t.Errorf("dead notification %s has empty last_error — "+
"retry loop must preserve the most recent failure", id)
}
})
// ---------------------------------------------------------------
// Subtest 2: dead → pending via manual Requeue endpoint
// ---------------------------------------------------------------
t.Run("Requeue_FlipsDeadToPending", func(t *testing.T) {
id := fmt.Sprintf("notif-i005-requeue-%d", time.Now().UnixNano())
// Seed directly at status='dead' rather than waiting for a
// scheduler tick — this subtest isolates the requeue handler,
// not the retry loop (subtest 1 already pins that).
past := time.Now().Add(-10 * time.Minute).UTC()
db.Exec(t, `
INSERT INTO notification_events
(id, type, channel, recipient, message, status,
retry_count, next_retry_at, last_error)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9)
`,
id, "ExpirationWarning", "Webhook", blackHole,
"I-005 integration: manual requeue",
"dead", 5, past, "max attempts reached",
)
resp, err := c.Post("/api/v1/notifications/"+id+"/requeue", "")
if err != nil {
t.Fatalf("POST requeue: %v", err)
}
body := readBody(resp)
if resp.StatusCode != http.StatusOK {
t.Fatalf("requeue status %d, want 200 (body: %s)",
resp.StatusCode, body)
}
// Phase 2 Green handler responds with {"status":"requeued"}
// to mirror MarkAsRead's {"status":"marked_as_read"} envelope.
if !strings.Contains(body, "requeued") {
t.Errorf("requeue body missing 'requeued' marker: %s", body)
}
// DB must reflect the full flip: pending status, reset counter,
// cleared next_retry_at. Clearing next_retry_at is what moves
// the row out of the retry-sweep partial index and back under
// ProcessPendingNotifications.
var status string
var retry int
var nextRetry sql.NullTime
if err := db.db.QueryRow(`
SELECT status, retry_count, next_retry_at
FROM notification_events WHERE id = $1
`, id).Scan(&status, &retry, &nextRetry); err != nil {
t.Fatalf("read requeued row: %v", err)
}
if !strings.EqualFold(status, "pending") {
t.Errorf("after requeue: status=%q, want 'pending'", status)
}
if retry != 0 {
t.Errorf("after requeue: retry_count=%d, want 0", retry)
}
if nextRetry.Valid {
t.Errorf("after requeue: next_retry_at=%v, want NULL",
nextRetry.Time)
}
})
// ---------------------------------------------------------------
// Subtest 3: GET /notifications?status=dead isolates DLQ rows
// ---------------------------------------------------------------
t.Run("ListFilter_StatusDead", func(t *testing.T) {
suffix := fmt.Sprintf("%d", time.Now().UnixNano())
deadID := "notif-i005-filter-dead-" + suffix
pendingID := "notif-i005-filter-pending-" + suffix
// One row at each end of the lifecycle so we can prove the
// filter both matches and excludes.
db.Exec(t, `
INSERT INTO notification_events
(id, type, channel, recipient, message, status, retry_count)
VALUES ($1, 'ExpirationWarning', 'Webhook', $2,
'I-005 filter test: dead row', 'dead', 5)
`, deadID, blackHole)
db.Exec(t, `
INSERT INTO notification_events
(id, type, channel, recipient, message, status, retry_count)
VALUES ($1, 'ExpirationWarning', 'Webhook', $2,
'I-005 filter test: pending row', 'pending', 0)
`, pendingID, blackHole)
// per_page large enough to rule out pagination artifacts as
// the reason a seeded row might be missing from the response.
resp, err := c.Get("/api/v1/notifications?status=dead&per_page=500")
if err != nil {
t.Fatalf("GET notifications?status=dead: %v", err)
}
var pr pagedResponse
if err := decodeJSON(resp, &pr); err != nil {
t.Fatalf("decode: %v", err)
}
type row struct {
ID string `json:"id"`
Status string `json:"status"`
}
var rows []row
if err := json.Unmarshal(pr.Data, &rows); err != nil {
t.Fatalf("unmarshal rows: %v", err)
}
var sawDead, sawPending bool
for _, r := range rows {
if r.ID == deadID {
sawDead = true
}
if r.ID == pendingID {
sawPending = true
}
if !strings.EqualFold(r.Status, "dead") {
t.Errorf("status=dead filter leaked non-dead row: "+
"id=%s status=%s", r.ID, r.Status)
}
}
if !sawDead {
t.Errorf("status=dead filter missed seeded dead row %s", deadID)
}
if sawPending {
t.Errorf("status=dead filter leaked seeded pending row %s",
pendingID)
}
})
})
}
+98 -17
View File
@@ -19,15 +19,29 @@
//
// Environment overrides:
//
// CERTCTL_QA_SERVER_URL (default: http://localhost:8443)
// CERTCTL_QA_API_KEY (default: change-me-in-production)
// CERTCTL_QA_DB_URL (default: postgres://certctl:certctl@localhost:5432/certctl?sslmode=disable)
// CERTCTL_QA_REPO_DIR (default: ../.. — the certctl repo root)
// CERTCTL_QA_SERVER_URL (default: https://localhost:8443)
// CERTCTL_QA_API_KEY (default: change-me-in-production)
// CERTCTL_QA_DB_URL (default: postgres://certctl:certctl@localhost:5432/certctl?sslmode=disable)
// CERTCTL_QA_REPO_DIR (default: ../.. — the certctl repo root)
// CERTCTL_QA_CA_BUNDLE (default: ./certs/ca.crt — the demo stack's init container writes here)
// CERTCTL_QA_INSECURE (default: false — set to "true" to skip TLS verify, e.g. before the init container finishes)
//
// TLS note (HTTPS-Everywhere M-007, Phase 6): the demo compose stack now
// listens on https://localhost:8443 with a self-signed cert written by the
// tls-init container. This suite pins the issuing CA via
// CERTCTL_QA_CA_BUNDLE so cert rotation or a tampered proxy fails the
// handshake instead of being silently trusted. CERTCTL_QA_INSECURE="true"
// is an explicit opt-out for bootstrap scenarios — there is no silent
// plaintext downgrade, matching the server-side pre-flight guard added in
// Phase 5 (task #203).
package integration_test
import (
"crypto/tls"
"crypto/x509"
"database/sql"
"encoding/json"
"fmt"
"io"
"net/http"
"os"
@@ -49,10 +63,12 @@ func qaEnv(key, fallback string) string {
}
var (
qaServerURL = qaEnv("CERTCTL_QA_SERVER_URL", "http://localhost:8443")
qaAPIKey = qaEnv("CERTCTL_QA_API_KEY", "change-me-in-production")
qaDBURL = qaEnv("CERTCTL_QA_DB_URL", "postgres://certctl:certctl@localhost:5432/certctl?sslmode=disable")
qaRepoDir = qaEnv("CERTCTL_QA_REPO_DIR", filepath.Join("..", ".."))
qaServerURL = qaEnv("CERTCTL_QA_SERVER_URL", "https://localhost:8443")
qaAPIKey = qaEnv("CERTCTL_QA_API_KEY", "change-me-in-production")
qaDBURL = qaEnv("CERTCTL_QA_DB_URL", "postgres://certctl:certctl@localhost:5432/certctl?sslmode=disable")
qaRepoDir = qaEnv("CERTCTL_QA_REPO_DIR", filepath.Join("..", ".."))
qaCABundlePath = qaEnv("CERTCTL_QA_CA_BUNDLE", "./certs/ca.crt")
qaInsecure = strings.EqualFold(os.Getenv("CERTCTL_QA_INSECURE"), "true")
)
// ---------------------------------------------------------------------------
@@ -65,9 +81,38 @@ type qaClient struct {
apiKey string
}
// buildQATLSConfig returns the *tls.Config used by every qaClient. TLS 1.3
// minimum matches the server-side config pinned in Phase 2 (cmd/server).
// When CERTCTL_QA_INSECURE=true we skip verification entirely — useful
// when running against a compose stack where the tls-init container hasn't
// written ca.crt yet, or when pointing at a dev server with a rotated cert.
// Otherwise we pin CERTCTL_QA_CA_BUNDLE and panic on read/parse failure
// rather than silently downgrading to the system trust store (which would
// mask a missing init container).
func buildQATLSConfig() *tls.Config {
cfg := &tls.Config{MinVersion: tls.VersionTLS13}
if qaInsecure {
cfg.InsecureSkipVerify = true
return cfg
}
pem, err := os.ReadFile(qaCABundlePath)
if err != nil {
panic(fmt.Sprintf("qa test: read CA bundle %q: %v — set CERTCTL_QA_CA_BUNDLE or CERTCTL_QA_INSECURE=true", qaCABundlePath, err))
}
pool := x509.NewCertPool()
if !pool.AppendCertsFromPEM(pem) {
panic(fmt.Sprintf("qa test: no PEM certificates parsed from %q", qaCABundlePath))
}
cfg.RootCAs = pool
return cfg
}
func newQAClient() *qaClient {
return &qaClient{
http: &http.Client{Timeout: 30 * time.Second},
http: &http.Client{
Timeout: 30 * time.Second,
Transport: &http.Transport{TLSClientConfig: buildQATLSConfig()},
},
baseURL: qaServerURL,
apiKey: qaAPIKey,
}
@@ -434,10 +479,19 @@ func TestQA(t *testing.T) {
// ===================================================================
t.Run("Part03_CertCRUD", func(t *testing.T) {
t.Run("Create_Minimal", func(t *testing.T) {
// C-001 scope-expansion: the handler's ValidateRequired
// contract now gates common_name, owner_id, team_id,
// issuer_id, name, and renewal_policy_id. A 3-field
// payload would 400 regardless of the id hint, so the
// "minimal" variant carries every required field.
code, body := c.bodyStr(t, "POST", "/api/v1/certificates", `{
"id": "mc-qa-minimal",
"name": "qa-minimal",
"common_name": "qa-minimal.example.com",
"issuer_id": "iss-local"
"issuer_id": "iss-local",
"owner_id": "o-alice",
"team_id": "t-platform",
"renewal_policy_id": "rp-standard"
}`)
if code != 201 && code != 200 {
t.Fatalf("create cert: status %d, body: %s", code, body)
@@ -447,11 +501,14 @@ func TestQA(t *testing.T) {
t.Run("Create_Full", func(t *testing.T) {
code, body := c.bodyStr(t, "POST", "/api/v1/certificates", `{
"id": "mc-qa-full",
"name": "qa-full",
"common_name": "qa-full.example.com",
"sans": ["qa-full-alt.example.com"],
"issuer_id": "iss-local",
"environment": "staging",
"owner_id": "o-alice"
"owner_id": "o-alice",
"team_id": "t-platform",
"renewal_policy_id": "rp-standard"
}`)
if code != 201 && code != 200 {
t.Fatalf("create cert: status %d, body: %s", code, body)
@@ -596,13 +653,37 @@ func TestQA(t *testing.T) {
}
})
t.Run("CRL_JSON", func(t *testing.T) {
code, body := c.bodyStr(t, "GET", "/api/v1/crl", "")
if code != 200 {
t.Fatalf("CRL = %d", code)
// M-006: The non-standard JSON CRL endpoint was removed. RFC 5280 §5
// defines only the DER wire format, now served unauthenticated at
// `/.well-known/pki/crl/{issuer_id}` per RFC 8615. Use a plain
// http.Get — no Bearer — to prove the endpoint is reachable by
// relying parties with no API credentials.
t.Run("CRL_DER_Unauthenticated", func(t *testing.T) {
resp, err := http.Get(qaServerURL + "/.well-known/pki/crl/iss-local")
if err != nil {
t.Fatalf("GET DER CRL: %v", err)
}
if !strings.Contains(body, "entries") {
t.Fatalf("CRL response missing entries field")
defer resp.Body.Close()
if resp.StatusCode != 200 {
b, _ := io.ReadAll(resp.Body)
t.Fatalf("CRL = %d (body=%s)", resp.StatusCode, string(b))
}
if ct := resp.Header.Get("Content-Type"); ct != "application/pkix-crl" {
t.Errorf("Content-Type: got %q, want %q", ct, "application/pkix-crl")
}
body, err := io.ReadAll(resp.Body)
if err != nil {
t.Fatalf("read CRL body: %v", err)
}
if len(body) == 0 {
t.Fatal("CRL body empty")
}
crl, err := x509.ParseRevocationList(body)
if err != nil {
t.Fatalf("parse DER CRL: %v", err)
}
if len(crl.RevokedCertificateEntries) < 1 {
t.Fatalf("CRL entries: got %d, want >= 1", len(crl.RevokedCertificateEntries))
}
})
})
+45 -10
View File
@@ -1,5 +1,30 @@
#!/usr/bin/env bash
# =============================================================================
# DEPRECATED — prefer `go test -tags integration ./deploy/test/...`
# =============================================================================
#
# This bash harness predates the Go integration test suite in
# deploy/test/integration_test.go (build tag `integration`, 34 subtests across
# 13 phases — health, agent heartbeat, Local CA issuance, ACME, step-ca, EST,
# S/MIME, discovery, network scan, revocation + CRL, deployment verification).
# The Go suite uses crypto/x509, crypto/tls, and database/sql to parse certs,
# probe TLS, and talk to PostgreSQL directly — no openssl text-scraping or
# brittle curl pipelines. It is the authoritative integration test surface as
# of milestone M-007 (HTTPS Everywhere, Phase 6), where the test compose
# stack wires the server on https://localhost:8443 behind a pinned CA bundle
# at ./certs/ca.crt.
#
# Run the Go suite:
# (cd deploy && docker compose -f docker-compose.test.yml up -d --build)
# go test -tags integration -v -count=1 ./deploy/test/...
#
# Keep this bash script around because:
# * It is cited in docs/test-env.md and muscle-memory for contributors.
# * It exercises the CLI / curl path end-to-end (a different failure mode
# than the Go HTTP client path).
# But any NEW integration coverage goes in integration_test.go — not here.
#
# =============================================================================
# certctl End-to-End Test Script
# =============================================================================
#
@@ -32,10 +57,11 @@ set -euo pipefail
# Config
# ---------------------------------------------------------------------------
COMPOSE_FILE="docker-compose.test.yml"
API_URL="http://localhost:8443"
API_URL="https://localhost:8443"
API_KEY="test-key-2026"
NGINX_TLS="localhost:8444"
AUTH_HEADER="Authorization: Bearer ${API_KEY}"
CACERT="./certs/ca.crt"
# Flags
BUILD=true
@@ -91,7 +117,7 @@ header() {
# API helper: GET endpoint, return JSON body. Exits 1 on HTTP error.
api_get() {
local path="$1"
curl -sf -H "${AUTH_HEADER}" "${API_URL}${path}" 2>/dev/null
curl -sf --cacert "${CACERT}" -H "${AUTH_HEADER}" "${API_URL}${path}" 2>/dev/null
}
# API helper: POST with optional JSON body
@@ -99,10 +125,10 @@ api_post() {
local path="$1"
local body="${2:-}"
if [ -n "$body" ]; then
curl -sf -X POST -H "${AUTH_HEADER}" -H "Content-Type: application/json" \
curl -sf --cacert "${CACERT}" -X POST -H "${AUTH_HEADER}" -H "Content-Type: application/json" \
-d "$body" "${API_URL}${path}" 2>/dev/null
else
curl -sf -X POST -H "${AUTH_HEADER}" "${API_URL}${path}" 2>/dev/null
curl -sf --cacert "${CACERT}" -X POST -H "${AUTH_HEADER}" "${API_URL}${path}" 2>/dev/null
fi
}
@@ -608,13 +634,22 @@ else
fail "Revocation failed" "$REVOKE_RESP"
fi
info "Checking CRL..."
CRL_RESP=$(api_get "/api/v1/crl" 2>/dev/null || echo '{"total":0}')
CRL_TOTAL=$(echo "$CRL_RESP" | python3 -c "import sys,json; print(json.load(sys.stdin).get('total',0))" 2>/dev/null || echo 0)
if [ "$CRL_TOTAL" -ge 1 ]; then
pass "CRL contains $CRL_TOTAL revoked certificate(s)"
info "Checking DER CRL under /.well-known/pki (RFC 5280 §5, RFC 8615)..."
# The JSON CRL endpoint (`GET /api/v1/crl`) was removed in M-006. RFC 5280
# defines only the DER wire format, now served unauthenticated at
# `/.well-known/pki/crl/{issuer_id}`. Fetch without the Bearer header to
# prove the endpoint is reachable by relying parties with no API key.
CRL_TMP=$(mktemp)
CRL_HEADERS=$(mktemp)
CRL_HTTP_CODE=$(curl -s -o "$CRL_TMP" -D "$CRL_HEADERS" -w "%{http_code}" "${API_URL}/.well-known/pki/crl/iss-local" 2>/dev/null || echo "000")
CRL_SIZE=$(wc -c < "$CRL_TMP" | tr -d ' ')
CRL_CONTENT_TYPE=$(awk 'tolower($1)=="content-type:" { sub(/\r$/,"",$2); print tolower($2) }' "$CRL_HEADERS" | head -n1)
rm -f "$CRL_TMP" "$CRL_HEADERS"
if [ "$CRL_HTTP_CODE" = "200" ] && [ "$CRL_CONTENT_TYPE" = "application/pkix-crl" ] && [ "$CRL_SIZE" -gt 0 ]; then
pass "DER CRL served unauthenticated (HTTP 200, Content-Type application/pkix-crl, ${CRL_SIZE} bytes)"
else
fail "CRL empty after revocation"
fail "DER CRL fetch failed: HTTP=$CRL_HTTP_CODE Content-Type=$CRL_CONTENT_TYPE size=$CRL_SIZE"
fi
CERT_STATUS=$(api_get "/api/v1/certificates/mc-local-test" | python3 -c "import sys,json; print(json.load(sys.stdin).get('status',''))" 2>/dev/null || echo "unknown")
+60 -19
View File
@@ -61,7 +61,7 @@ flowchart TB
API["REST API\n(Go net/http, :8443)"]
SVC["Service Layer"]
REPO["Repository Layer\n(database/sql + lib/pq)"]
SCHED["Background Scheduler\n7 loops"]
SCHED["Background Scheduler\n8 always-on + 4 optional loops"]
DASH["Web Dashboard\n(React SPA)"]
end
@@ -139,6 +139,16 @@ The agent runs two background loops: a heartbeat (every 60 seconds) to signal it
**Agent groups (M11b):** Dynamic device grouping allows organizing agents by metadata criteria. Agent groups can match by OS, architecture, IP CIDR, and version. Groups support both dynamic matching (agents automatically join when criteria match) and manual membership (explicit include/exclude). Renewal policies can be scoped to agent groups via the `agent_group_id` foreign key. The GUI provides full CRUD management for agent groups with visual match criteria badges.
**Agent soft-retirement (I-004):** `DELETE /api/v1/agents/{id}` is a soft-delete surface — the row is never removed. Retirement stamps `agents.retired_at` (TIMESTAMPTZ) and `agents.retired_reason` (TEXT) and flips the operational status to `Offline`. Default listings (`GET /api/v1/agents`, the dashboard stats counter, and the stale-offline sweeper) filter retired rows out via `AgentRepository.ListActive`; retired rows are surfaced only through the opt-in `GET /api/v1/agents/retired` view. The endpoint follows a preflight → block → escape-hatch contract:
- **Clean retire** (no active dependencies) — `200 OK` with `RetireAgentResponse` (`cascade=false`, zero counts).
- **Blocked by active dependencies**`409 Conflict` with `BlockedByDependenciesResponse`. The three counts (`active_targets`, `active_certificates`, `pending_jobs`) tell the operator exactly which rows would be orphaned. The schema diverges from `ErrorResponse` because downstream dashboards parse the stable three-key shape.
- **Force cascade**`DELETE /api/v1/agents/{id}?force=true&reason=...`. `reason` is required (400 otherwise). Transactionally soft-retires downstream `deployment_targets`, cancels pending jobs, and soft-retires the agent, emitting an `agent_retirement_cascaded` audit event with actor + reason + per-bucket counts.
- **Idempotent re-retire** — a retire attempt against an already-retired agent returns `204 No Content` with an empty body (no second audit event, no response shape — callers that POST again on a retry get a clean no-op).
- **Sentinel refusal** — the four sentinel agent IDs (`server-scanner`, `cloud-aws-sm`, `cloud-azure-kv`, `cloud-gcp-sm`) back non-agent discovery subsystems (the network scanner and the three cloud secret-manager sources). They are refused unconditionally — even with `force=true` — via `ErrAgentIsSentinel``403 Forbidden`. The ID list lives in `internal/domain/connector.go` (`SentinelAgentIDs`) so handler, repository, and scheduler code can filter them without importing `service`.
Retired agents receive `410 Gone` on subsequent heartbeats (`service.ErrAgentRetired`). `cmd/agent` treats 410 as a terminal signal and exits cleanly so retired agents stop phoning home. Migration `000015` flipped `deployment_targets.agent_id` from `ON DELETE CASCADE` to `ON DELETE RESTRICT`, making the old hard-delete path a schema error and forcing all retirement through this contract.
### Web Dashboard
The web dashboard is the primary operational interface for certctl. It is built with Vite + React + TypeScript and uses TanStack Query for server state management (caching, background refetching, optimistic updates).
@@ -275,6 +285,9 @@ erDiagram
text channel
text recipient
text status
int retry_count
timestamptz next_retry_at
text last_error
}
certificate_profiles {
text id PK
@@ -463,7 +476,7 @@ sequenceDiagram
API-->>U: 200 OK
```
The revocation is recorded in the `certificate_revocations` table (separate from the certificate status update) for CRL generation. The DER-encoded CRL at `GET /api/v1/crl/{issuer_id}` is generated on-demand by querying this table and signing with the issuing CA's key. The OCSP responder at `GET /api/v1/ocsp/{issuer_id}/{serial}` checks both the certificate status and the revocations table to return signed good/revoked/unknown responses.
The revocation is recorded in the `certificate_revocations` table (separate from the certificate status update) for CRL generation. The DER-encoded CRL at `GET /.well-known/pki/crl/{issuer_id}` (RFC 5280 §5, RFC 8615) is generated on-demand by querying this table and signing with the issuing CA's key. The OCSP responder at `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (RFC 6960) checks both the certificate status and the revocations table to return signed good/revoked/unknown responses. Both endpoints are served unauthenticated — relying parties (TLS clients, hardware appliances, browsers) must be able to reach them without a certctl API key — and carry the IANA-registered media types `application/pkix-crl` and `application/ocsp-response` respectively.
Short-lived certificates (those with profile TTL < 1 hour) return "good" from OCSP and are excluded from CRL — their rapid expiry is treated as sufficient revocation.
@@ -473,40 +486,55 @@ For compliance events requiring fleet-wide revocation (key compromise, CA distru
### 4. Automatic Renewal
The control plane runs a scheduler with seven background loops:
The control plane runs a scheduler with 8 always-on loops plus up to 4 optional loops (enabled by configuration). `internal/scheduler/scheduler.go:262-265` is the authoritative count.
```mermaid
flowchart LR
subgraph "Scheduler (Background Goroutines)"
R["Renewal Checker\n⏱ every 1h"]
J["Job Processor\n⏱ every 30s"]
JR["Job Retry\n⏱ every 5m"]
JT["Job Timeout\n⏱ every 10m"]
H["Agent Health\n⏱ every 2m"]
N["Notification Processor\n⏱ every 1m"]
NR["Notification Retry\n⏱ every 2m"]
SL["Short-Lived Expiry\n⏱ every 30s"]
NS["Network Scanner\n⏱ every 6h"]
DG["Certificate Digest\n⏱ every 24h"]
HC["Endpoint Health\n⏱ every 60s"]
CD["Cloud Discovery\n⏱ every 6h"]
end
R -->|"Find expiring certs\nCreate renewal jobs"| DB[("PostgreSQL")]
J -->|"Process pending jobs\nCoordinate issuance"| DB
JR -->|"Retry Failed jobs\nFailed→Pending"| DB
JT -->|"Reap stalled AwaitingCSR / AwaitingApproval jobs"| DB
H -->|"Check heartbeat staleness\nMark agents offline"| DB
N -->|"Send pending notifications\nEmail / Webhook / Slack"| DB
NR -->|"Retry failed notifications\n2^n-min backoff, DLQ after 5 attempts"| DB
SL -->|"Expire short-lived certs\nMark as Expired"| DB
NS -->|"Probe TLS endpoints\nStore discovered certs"| DB
DG -->|"Generate & send HTML digest\nEmail to recipients"| DB
HC -->|"Probe deployed TLS endpoints\nState machine + mismatch"| DB
CD -->|"AWS SM / Azure KV / GCP SM\nFeed discovery pipeline"| DB
```
| Loop | Interval | Timeout | Purpose |
|------|----------|---------|---------|
| Renewal checker | 1 hour | 5 minutes | Finds certificates approaching expiry, creates renewal jobs |
| Job processor | 30 seconds | 2 minutes | Processes pending jobs (issuance, renewal, deployment) |
| Agent health check | 2 minutes | 1 minute | Marks agents as offline if heartbeat is stale |
| Notification processor | 1 minute | 1 minute | Sends pending notifications via configured channels |
| Short-lived expiry | 30 seconds | 30 seconds | Marks expired short-lived certificates (profile TTL < 1 hour) |
| Network scanner | 6 hours | 30 minutes | Probes TLS endpoints on configured CIDR ranges, stores discovered certs (M21, opt-in via `CERTCTL_NETWORK_SCAN_ENABLED`). CIDR size validated at API level — max /20 (4096 IPs) per range. |
| Certificate digest | 24 hours | 5 minutes | Generates HTML email with certificate stats, expiration timeline, job health, agent count. Does NOT run on startup — waits for first scheduled tick. Configurable interval and recipients via `CERTCTL_DIGEST_INTERVAL` and `CERTCTL_DIGEST_RECIPIENTS`. Falls back to certificate owner emails if no explicit recipients configured. |
| Loop | Interval | Always-on? | Purpose |
|------|----------|------------|---------|
| Renewal checker | 1 hour | Yes | Finds certificates approaching expiry (threshold-based or ARI-directed), creates renewal jobs |
| Job processor | 30 seconds | Yes | Processes pending jobs (issuance, renewal, deployment) |
| Job retry | 5 minutes (`CERTCTL_SCHEDULER_RETRY_INTERVAL`) | Yes | Transitions `Failed` jobs back to `Pending` for re-dispatch (I-001) |
| Job timeout | 10 minutes (`CERTCTL_JOB_TIMEOUT_INTERVAL`) | Yes | Reaps `AwaitingCSR` jobs older than 24h and `AwaitingApproval` jobs older than 7d to `Failed`, feeding the retry loop (I-003) |
| Agent health check | 2 minutes | Yes | Marks agents as offline if heartbeat is stale |
| Notification processor | 1 minute | Yes | Sends pending notifications via configured channels |
| Notification retry | 2 minutes (`CERTCTL_NOTIFICATION_RETRY_INTERVAL`) | Yes | Re-dispatches `Failed` notifications whose `next_retry_at` has elapsed; exponential backoff (2^n minutes, capped at 1h), 5-attempt budget, terminal `dead` status after exhaustion (I-005) |
| Short-lived expiry | 30 seconds | Yes | Marks expired short-lived certificates (profile TTL < 1 hour) |
| Network scanner | 6 hours | Opt-in (`CERTCTL_NETWORK_SCAN_ENABLED`) | Probes TLS endpoints on configured CIDR ranges, stores discovered certs (M21). CIDR size validated at API level — max /20 (4096 IPs) per range. |
| Certificate digest | 24 hours (`CERTCTL_DIGEST_INTERVAL`) | Opt-in (digest service) | Generates HTML email with certificate stats, expiration timeline, job health, agent count. Does NOT run on startup — waits for first scheduled tick. Falls back to certificate owner emails if no explicit recipients configured. |
| Endpoint health | 60 seconds (`CERTCTL_HEALTH_CHECK_INTERVAL`) | Opt-in (health check service) | Probes deployed TLS endpoints, drives the healthy/degraded/down/cert_mismatch state machine (M48) |
| Cloud discovery | 6 hours | Opt-in (at least one cloud source configured) | Walks AWS Secrets Manager / Azure Key Vault / GCP Secret Manager, feeds discovery pipeline (M50) |
Each loop uses `sync/atomic.Bool` idempotency guards to prevent concurrent tick execution — if a loop iteration is still running when the next tick fires, the tick is skipped with a warning log. All loops (including short-lived expiry check) run immediately on startup before entering their ticker interval, ensuring no gap between scheduler start and first execution. The certificate digest loop is the exception — it does NOT run on startup, only on scheduled ticks. Graceful shutdown uses `sync.WaitGroup` with `WaitForCompletion()` to drain all in-flight work before process exit.
Each loop uses `sync/atomic.Bool` idempotency guards to prevent concurrent tick execution — if a loop iteration is still running when the next tick fires, the tick is skipped with a warning log. Most loops (including short-lived expiry, job retry, job timeout, and notification retry) run immediately on startup before entering their ticker interval, ensuring no gap between scheduler start and first execution. The certificate digest loop is the exception — it does NOT run on startup, only on scheduled ticks. Graceful shutdown uses `sync.WaitGroup` with `WaitForCompletion()` to drain all in-flight work before process exit.
Each operation has a context timeout to prevent indefinite hangs if external services become unresponsive.
@@ -648,6 +676,16 @@ Built-in notifiers: **Email** (SMTP), **Webhook** (HTTP POST), **Slack** (incomi
See the [Connector Development Guide](connectors.md) for details on building custom connectors.
### Notification Retry & Dead-Letter Queue
A transient notifier failure (SMTP timeout, 5xx webhook response, Slack rate-limit) must not silently drop a critical alert. Migration `000016_notification_retry` adds three columns to `notification_events``retry_count INTEGER NOT NULL DEFAULT 0`, `next_retry_at TIMESTAMPTZ` (nullable — only meaningful while a row is in `failed` state), and `last_error TEXT` (the most recent transient error, preserved for operator triage) — together with a partial index `idx_notification_events_retry_sweep ON notification_events(next_retry_at) WHERE status = 'failed' AND next_retry_at IS NOT NULL` so the retry hot path scales with the retry-eligible slice rather than the full notification history.
The scheduler's notification-retry loop (see the scheduler section above) calls `NotificationService.RetryFailedNotifications(ctx)` every `CERTCTL_NOTIFICATION_RETRY_INTERVAL` (default `2m`). Each tick pulls up to 1000 rows via `notifRepo.ListRetryEligible(ctx, now, maxAttempts, sweepLimit)` — a partial-index-driven query that filters on `status='failed' AND next_retry_at <= now() AND retry_count < 5` — and redispatches them through the same notifier registry used by `ProcessPendingNotifications`. A successful redispatch transitions the row directly to `sent` without incrementing `retry_count`, so the audit trail preserves "delivered on attempt N". A failed redispatch re-arms `next_retry_at` using exponential backoff — `wait = min(2^retry_count minutes, 1h)` — bumps `retry_count`, and stamps `last_error`. When `retry_count >= 4` (the fifth attempt has just failed) the row is promoted to the terminal `dead` status via `notifRepo.MarkAsDead`, which clears `next_retry_at` so the partial retry-sweep index stops matching and the row cannot be re-entered into the retry rotation without operator action.
`NotificationService.RequeueNotification(ctx, id)` is the operator-driven escape hatch from `dead`. It atomically resets `retry_count → 0`, `next_retry_at → NULL`, `last_error → NULL`, and `status → pending`, handing the row back to `ProcessPendingNotifications` on the next 1m tick. This is the correct response to "the notifier outage is resolved, redeliver the queue"; it is not a retry, which is why the retry counter is reset rather than incremented.
The dead-letter depth is surfaced in two places. First, `DashboardSummary.NotificationsDead` is populated by `StatsService.GetDashboardSummary` via `notifRepo.CountByStatus(ctx, "dead")`. The injection uses a `SetNotifRepo` setter pattern (mirroring `CertificateService.SetTargetRepo`) rather than a new positional argument to `NewStatsService`, which keeps all nine existing `NewStatsService` call sites (main.go plus eight digest tests and stats_test.go) signature-stable — when the notification repository has not been wired in, `NotificationsDead` falls through to zero. Second, the `/api/v1/metrics/prometheus` endpoint emits `certctl_notification_dead_total` as a counter (operator alert thresholds per the I-005 spec: `> 0` warning, `> 10` critical) using the same `DashboardSummary` snapshot so the dashboard card and the Prometheus counter cannot skew. The web dashboard exposes a two-tab toolbar on `/notifications` — "All" (the pre-I-005 inbox) and "Dead letter" (threads `?status=dead` into the list query, surfaces `Retry N/5` and the truncated `last_error` with a full-text tooltip per row, and binds a Requeue button to `POST /api/v1/notifications/{id}/requeue`).
### EST Server (RFC 7030)
The EST (Enrollment over Secure Transport) server provides an industry-standard enrollment interface for devices that need certificates without using the REST API. It runs under `/.well-known/est/` per RFC 7030 and supports four operations: CA certificate distribution (`/cacerts`), initial enrollment (`/simpleenroll`), re-enrollment (`/simplereenroll`), and CSR attributes (`/csrattrs`).
@@ -685,6 +723,8 @@ type ESTService interface {
**Issuer connector extension:** EST required adding `GetCACertPEM(ctx) (string, error)` to the issuer connector interface so the `/cacerts` endpoint can serve the CA chain. The Local CA returns its CA certificate PEM; Vault PKI fetches via `GET /v1/{mount}/ca/pem`; Google CAS fetches via API; AWS ACM PCA retrieves via `GetCertificateAuthorityCertificate`. ACME, step-ca, OpenSSL, DigiCert, and Sectigo connectors return errors (they don't expose a static CA chain — their chains are per-issuance).
**Authentication:** EST endpoints are served unauthenticated at the HTTP layer under `/.well-known/est/*` — no Bearer token required. Per RFC 7030 §3.2.3 EST authentication is deployment-specific, and per §4.1.1 `/cacerts` is explicitly anonymous. certctl enforces authentication via CSR signature verification inside `ESTService.SimpleEnroll`/`SimpleReEnroll` plus profile policy gates (allowed key algorithms, minimum key size, permitted SANs, permitted EKUs, MaxTTL). The HTTP dispatch is implemented in `cmd/server/main.go:buildFinalHandler`, which routes `/.well-known/est/*` through `noAuthHandler` (RequestID + structuredLogger + Recovery only). Operators who need stronger client identification should terminate mTLS at an upstream reverse proxy and pin the CSR's SAN to the client cert subject at the profile level.
**Audit:** Every EST enrollment is recorded in the audit trail with `protocol: "EST"`, the CN, SANs, issuer ID, serial number, and optional profile ID.
### SCEP Server (RFC 8894)
@@ -711,7 +751,7 @@ Signed certificate returned as PKCS#7 certs-only
**Wire format:** SCEP clients wrap CSRs in PKCS#7 SignedData envelopes. The handler parses the outer ASN.1 ContentInfo → SignedData → EncapsulatedContentInfo to extract the CSR bytes. Fallback paths handle base64-encoded PKCS#7 and raw CSR submissions (for simpler clients). Responses use PKCS#7 certs-only via the shared `internal/pkcs7` package (same as EST). Single certs are returned as raw DER for `GetCACert`, chains as PKCS#7.
**Authentication:** SCEP uses challenge passwords embedded in CSR attributes (OID 1.2.840.113549.1.9.7) rather than TLS client certificates. The server validates the challenge password against `CERTCTL_SCEP_CHALLENGE_PASSWORD`. When no challenge password is configured, any value is accepted.
**Authentication:** SCEP endpoints at `/scep` and `/scep/*` are served unauthenticated at the HTTP layer — no Bearer token required — per RFC 8894 §3.2, which defines authentication via the `challengePassword` attribute (OID 1.2.840.113549.1.9.7) embedded in the PKCS#10 CSR rather than an HTTP credential. The HTTP dispatch is implemented in `cmd/server/main.go:buildFinalHandler`, which routes `/scep` and `/scep/*` through `noAuthHandler` (RequestID + structuredLogger + Recovery only). The `challengePassword` is mandatory: `preflightSCEPChallengePassword` at startup refuses to boot the control plane when `CERTCTL_SCEP_ENABLED=true` is set without `CERTCTL_SCEP_CHALLENGE_PASSWORD`, closing CWE-306 (missing authentication for a critical function). `SCEPService.PKCSReq` enforces the same invariant defense-in-depth — an empty `s.challengePassword` rejects every enrollment — and the password comparison uses `crypto/subtle.ConstantTimeCompare` to prevent response-time side-channel leakage. The startup log line `SCEP server enabled` emits a `challenge_password_set` boolean for operator visibility.
**Interface:** The `SCEPHandler` defines an `SCEPService` interface (dependency inversion):
@@ -768,10 +808,11 @@ The control plane only handles public material: certificates, chains, and CSRs.
### Authentication
- **API clients → Server**: API key in `Authorization: Bearer` header, or `none` for demo mode
- **API clients → Server**: API key in `Authorization: Bearer` header, or `none` for demo mode. Applies to every path under `/api/v1/*`.
- **Agent → Server**: API key registered at agent creation, included in all requests
- **Server → Issuers**: ACME account key, or connector-specific credentials
- **Agent → Targets**: API tokens, WinRM credentials (stored locally on agent or proxy agent — never on server). Credential scope is limited to the agent's network zone.
- **Standards-based enrollment and PKI distribution endpoints**: `/.well-known/est/*` (RFC 7030), `/scep` and `/scep/*` (RFC 8894), and `/.well-known/pki/crl/{issuer_id}` + `/.well-known/pki/ocsp/{issuer_id}/{serial}` (RFC 5280 §5 / RFC 6960 / RFC 8615) are served unauthenticated at the HTTP layer. These protocols carry their own authentication semantics — CSR signature + profile policy for EST (§3.2.3 says EST auth is deployment-specific; §4.1.1 makes `/cacerts` explicitly anonymous), `challengePassword` in CSR attributes for SCEP (§3.2), and relying-party accessibility for CRL/OCSP — and cannot present certctl Bearer tokens. The dispatch is implemented in `cmd/server/main.go:buildFinalHandler`, which routes these prefixes through `noAuthHandler` (RequestID + structuredLogger + Recovery only, no auth or rate-limit middleware). CWE-306 is closed for SCEP by `preflightSCEPChallengePassword`, which refuses to start the server when SCEP is enabled without `CERTCTL_SCEP_CHALLENGE_PASSWORD`. The 27-subtest regression harness `cmd/server/finalhandler_test.go` pins this dispatch surface (EST 4-endpoint, SCEP exact + trailing-slash + query-string, PKI CRL+OCSP, health probes, `/api/v1/*` authenticated, `/assets/*` file server, SPA fallback).
### Audit Trail
@@ -855,7 +896,7 @@ The HTTP middleware stack processes requests in the following order (see `cmd/se
### Concurrency Safety
The background scheduler uses `sync/atomic.Bool` idempotency guards on all 7 loops — if a tick fires while the previous iteration is still running, it skips. A `sync.WaitGroup` tracks all in-flight goroutines. `WaitForCompletion(timeout)` blocks during shutdown until all work finishes or the timeout expires, preventing state corruption from mid-flight database operations during process exit.
The background scheduler uses `sync/atomic.Bool` idempotency guards on every loop (8 always-on plus up to 4 optional) — if a tick fires while the previous iteration is still running, it skips. A `sync.WaitGroup` tracks all in-flight goroutines. `WaitForCompletion(timeout)` blocks during shutdown until all work finishes or the timeout expires, preventing state corruption from mid-flight database operations during process exit.
### Logging
@@ -889,7 +930,7 @@ Jobs support additional action endpoints: `POST /api/v1/jobs/{id}/cancel`, `POST
- **Additional filters**: `?agent_id=`, `?profile_id=` (in addition to existing status, environment, owner_id, team_id, issuer_id).
- **Deployments**: `GET /api/v1/certificates/{id}/deployments` returns deployment targets for a certificate.
Certificate revocation: `POST /api/v1/certificates/{id}/revoke` with optional `{"reason": "keyCompromise"}`. Supports RFC 5280 reason codes (unspecified, keyCompromise, caCompromise, affiliationChanged, superseded, cessationOfOperation, certificateHold, privilegeWithdrawn). Returns the updated certificate status. Best-effort issuer notification — the revocation succeeds even if the issuer connector is unavailable. A JSON-formatted CRL is available at `GET /api/v1/crl`, and a DER-encoded X.509 CRL signed by the issuing CA at `GET /api/v1/crl/{issuer_id}`. An embedded OCSP responder serves signed responses at `GET /api/v1/ocsp/{issuer_id}/{serial}`. Short-lived certificates (profile TTL < 1 hour) are exempt from CRL/OCSP — expiry is sufficient revocation.
Certificate revocation: `POST /api/v1/certificates/{id}/revoke` with optional `{"reason": "keyCompromise"}`. Supports RFC 5280 reason codes (unspecified, keyCompromise, caCompromise, affiliationChanged, superseded, cessationOfOperation, certificateHold, privilegeWithdrawn). Returns the updated certificate status. Best-effort issuer notification — the revocation succeeds even if the issuer connector is unavailable. The DER-encoded X.509 CRL signed by the issuing CA is served unauthenticated at `GET /.well-known/pki/crl/{issuer_id}` (RFC 5280 §5 + RFC 8615, `Content-Type: application/pkix-crl`). The embedded OCSP responder serves signed responses unauthenticated at `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (RFC 6960, `Content-Type: application/ocsp-response`). Both endpoints are accessible to relying parties with no certctl API credentials, as RFC-compliant PKI consumers expect. Short-lived certificates (profile TTL < 1 hour) are exempt from CRL/OCSP — expiry is sufficient revocation.
Certificate export (M27): `GET /api/v1/certificates/{id}/export/pem` returns PEM-encoded certificate and chain, and `POST /api/v1/certificates/{id}/export/pkcs12` returns a PKCS#12 bundle (binary). Private keys are never exported — they remain on agents. All exports are audited with actor, timestamp, and format.
@@ -1051,7 +1092,7 @@ flowchart TB
1. **Pluggable sources** — Each cloud provider implements the `DiscoverySource` interface (Name, Type, Discover, ValidateConfig). Three built-in sources: AWS Secrets Manager, Azure Key Vault, GCP Secret Manager
2. **CloudDiscoveryService orchestrator** — Iterates registered sources, calls `Discover()` on each, feeds reports into `ProcessDiscoveryReport()`. Errors from one source don't prevent other sources from running
3. **Scheduler integration**9th scheduler loop (6h default), runs immediately on startup, `atomic.Bool` idempotency guard
3. **Scheduler integration**opt-in cloud discovery scheduler loop (6h default; see `docs/architecture.md` 12-loop topology), runs immediately on startup, `atomic.Bool` idempotency guard
4. **Sentinel agents** — Each source uses its own sentinel agent ID (`cloud-aws-sm`, `cloud-azure-kv`, `cloud-gcp-sm`) for dedup and triage filtering
5. **Source path format**`aws-sm://{region}/{secret}`, `azure-kv://{cert-name}/{version}`, `gcp-sm://{project}/{secret}`
6. **No new schema** — Reuses existing `discovered_certificates` and `discovery_scans` tables. Sentinel agent IDs leverage existing `(fingerprint_sha256, agent_id, source_path)` dedup constraint
@@ -1073,7 +1114,7 @@ This data flow is pull-based and non-blocking. Agents discover at their own pace
Beyond one-time discovery, certctl continuously monitors TLS endpoints for certificate health using a shared TLS probing package and a state-machine-driven health check service. Endpoints transition between states (Healthy → Degraded → Down) based on consecutive failures, and `cert_mismatch` status alerts when a deployed certificate is unexpectedly replaced.
**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated 8th scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch).
**Architecture:** Probing is extracted into a shared `internal/tlsprobe/` package used by both the network scanner (M21) and the health monitor. The `HealthCheckService` manages 8 API endpoints for CRUD operations and state transitions. A dedicated opt-in endpoint health scheduler loop runs every 60 seconds (configurable via `CERTCTL_HEALTH_CHECK_INTERVAL`). Individual health check targets have their own check intervals (default 300 seconds) — the scheduler queries only endpoints due for check via `ListDueForCheck()`. Results are stored with historical tracking for 30 days (configurable via `CERTCTL_HEALTH_CHECK_HISTORY_RETENTION`). State transitions trigger notifications (critical for down endpoints, warning for degraded, high for cert_mismatch).
**State Machine:** Healthy → Degraded (configurable threshold, default 2 consecutive failures) → Down (default 5 failures). The `cert_mismatch` status is special — it fires whenever the observed certificate fingerprint differs from the expected (deployed) fingerprint, catching silent rollbacks and unauthorized cert replacements. Recovery from degraded/down transitions back to healthy and resets the failure counter.
+3 -2
View File
@@ -39,7 +39,7 @@ Deploy certctl control plane once (Docker Compose, Kubernetes Helm chart, or sel
```bash
cd /opt/certctl
docker compose up -d
# Dashboard & API: http://localhost:8443
# Dashboard & API: https://localhost:8443 (self-signed cert — pin with --cacert ./deploy/test/certs/ca.crt)
```
**Option B: Kubernetes** (recommended for prod)
@@ -59,7 +59,8 @@ chmod +x /usr/local/bin/certctl-agent
# Config
sudo tee /etc/certctl/agent.env > /dev/null <<EOF
CERTCTL_SERVER_URL=http://certctl-control-plane:8443
CERTCTL_SERVER_URL=https://certctl-control-plane:8443
CERTCTL_SERVER_CA_BUNDLE_PATH=/etc/certctl/tls/ca.crt
CERTCTL_API_KEY=your-api-key
CERTCTL_DISCOVERY_DIRS=/etc/nginx/certs,/etc/ssl,/etc/letsencrypt/live
CERTCTL_KEY_DIR=/var/lib/certctl/keys
+6 -4
View File
@@ -210,15 +210,17 @@ NIST SP 800-57 Part 1 Section 6.2 addresses secure key distribution to minimize
- Proxy agent executes deployment via appliance API
**Revocation Distribution**
- Certificate Revocation List (CRL) via `GET /api/v1/crl/{issuer_id}`
- Returns DER-encoded X.509 CRL signed by issuing CA
- Certificate Revocation List (CRL) via `GET /.well-known/pki/crl/{issuer_id}` (RFC 5280 §5, RFC 8615)
- Returns DER-encoded X.509 CRL signed by issuing CA (`Content-Type: application/pkix-crl`)
- 24-hour validity period
- Includes all revoked serials, reasons, and revocation timestamps
- Served unauthenticated so relying parties without certctl API credentials can fetch it
- Subject to URL caching; OCSP preferred for real-time revocation
- OCSP via `GET /api/v1/ocsp/{issuer_id}/{serial}`
- Returns DER-encoded OCSP response (OCSPResponse ASN.1 structure)
- OCSP via `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (RFC 6960)
- Returns DER-encoded OCSP response (OCSPResponse ASN.1 structure, `Content-Type: application/ocsp-response`)
- Signed by issuing CA (or delegated OCSP signing cert)
- Responds with good/revoked/unknown status
- Served unauthenticated — the RFC 6960 relying-party model does not assume API credentials
- Real-time, more bandwidth-efficient than CRL polling
## Revocation and Compromise (NIST SP 800-57 Part 3)
+16 -14
View File
@@ -92,10 +92,10 @@ Your QSA will request evidence that your certificate and key management systems
- **Certificate Status Tracking** — Four statuses: Active (deployed, not yet expired), Expiring (within threshold, awaiting renewal), Expired (past not-after date), Revoked (revoked via RFC 5280 revocation API). Dashboard charts show status distribution.
- **Revocation Infrastructure** (M15a, M15b):
- **Revocation Infrastructure** (M15a, M15b, M-006):
- Revocation API: `POST /api/v1/certificates/{id}/revoke` with RFC 5280 reason codes
- CRL endpoint: `GET /api/v1/crl` (JSON format) or `GET /api/v1/crl/{issuer_id}` (DER X.509 CRL, 24h validity, signed by issuing CA)
- OCSP responder: `GET /api/v1/ocsp/{issuer_id}/{serial}` (returns DER-encoded OCSP response: good/revoked/unknown)
- CRL endpoint: `GET /.well-known/pki/crl/{issuer_id}` DER X.509 CRL, 24h validity, signed by issuing CA, served unauthenticated (RFC 5280 §5, RFC 8615, `Content-Type: application/pkix-crl`)
- OCSP responder: `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` DER-encoded OCSP response (good/revoked/unknown), served unauthenticated (RFC 6960, `Content-Type: application/ocsp-response`)
- Bulk revocation (V2.2): `POST /api/v1/certificates/bulk-revoke` with filter criteria (profile, owner, agent, issuer) for fleet-wide incident response
- Short-lived cert exemption: certs with TTL < 1 hour skip CRL/OCSP (expiry is sufficient revocation)
@@ -109,7 +109,7 @@ Your QSA will request evidence that your certificate and key management systems
- Discovered certificate report: `GET /api/v1/discovered-certificates` JSON export showing all certs on systems, fingerprints, and status.
- Managed certificate inventory: `GET /api/v1/certificates` with filters (`?status=Expiring` for upcoming renewals).
- Expiration alert configuration: policy JSON showing `alert_thresholds_days` for each environment.
- CRL/OCSP availability proof: HTTP GET requests to `/api/v1/crl` and `/api/v1/ocsp/{issuer}/{serial}` with signed responses.
- CRL/OCSP availability proof: unauthenticated HTTP GET requests to `/.well-known/pki/crl/{issuer_id}` (DER, `application/pkix-crl`) and `/.well-known/pki/ocsp/{issuer_id}/{serial}` (DER, `application/ocsp-response`) with signed responses.
- Audit trail for certificate creation/renewal/revocation: `GET /api/v1/audit?type=certificate_issued,certificate_renewed,certificate_revoked`.
- Dashboard charts showing expiration timeline, renewal success trends, status distribution.
@@ -328,9 +328,10 @@ This requirement covers key generation, storage, rotation, and destruction. Cert
- Issuer notified (best-effort; ACME lacks standard revocation, Local CA skips issuer step).
- Revocation notifications sent to owner via email/webhook/Slack/Teams/PagerDuty.
- **CRL and OCSP Publication** (M15b) — Revoked certificates published in:
- CRL: `GET /api/v1/crl` (JSON format) or `GET /api/v1/crl/{issuer_id}` (DER X.509, signed by CA, 24h validity)
- OCSP: `GET /api/v1/ocsp/{issuer_id}/{serial}` (returns revoked status for clients validating certificate chain)
- **CRL and OCSP Publication** (M15b, M-006) — Revoked certificates published in:
- CRL: `GET /.well-known/pki/crl/{issuer_id}` (DER X.509 signed by CA, 24h validity, RFC 5280 §5 + RFC 8615, `Content-Type: application/pkix-crl`)
- OCSP: `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (returns revoked status for clients validating certificate chain, RFC 6960, `Content-Type: application/ocsp-response`)
- Both endpoints are served unauthenticated so relying parties (browsers, TLS appliances) without certctl API keys can verify revocation — this is the RFC-compliant PKI model.
- Clients checking certificate status via OCSP or CRL see revoked status within 24 hours.
- **Bulk Revocation for Incident Response** (V2.2) — `POST /api/v1/certificates/bulk-revoke` with filter criteria (profile, owner, agent, issuer) revokes all matching certificates in a single operation. PCI-DSS Req 4 requires rapid response to data transmission security incidents — bulk revocation enables operators to revoke an entire certificate set (e.g., all certs used by a compromised team or endpoint) in minutes rather than hours.
@@ -342,8 +343,8 @@ This requirement covers key generation, storage, rotation, and destruction. Cert
**Evidence You Can Provide**:
- Revocation requests: `GET /api/v1/audit?type=certificate_revoked` with RFC 5280 reason codes.
- CRL publication: HTTP GET `/api/v1/crl` and parse JSON to show revoked serial numbers and timestamps.
- OCSP responder validation: Query `GET /api/v1/ocsp/{issuer}/{serial}` for a known-revoked cert; response includes `revoked` status.
- CRL publication: HTTP GET `/.well-known/pki/crl/{issuer_id}` (unauthenticated) returns a DER X.509 CRL — parse with `openssl crl -inform der -noout -text` to show revoked serial numbers, reasons, and timestamps.
- OCSP responder validation: Query `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (unauthenticated) for a known-revoked cert; response includes `revoked` status and can be parsed with `openssl ocsp` tooling.
- Audit trail: Certificate status transitions (Active → Revoked) recorded in `audit_events`.
**Operator Responsibility**:
@@ -386,12 +387,12 @@ This requirement covers key generation, storage, rotation, and destruction. Cert
- API key transmitted in Authorization header (not URL parameter, not cookie).
- Browser to server: TLS.
- Agent to server: TLS.
- No credential logging (API key hash only, never plaintext).
- No credential logging (audit records the per-key actor `Name`, never the Bearer token; logs redact the `Authorization` header).
**Evidence You Can Provide**:
- API configuration: `CERTCTL_AUTH_TYPE=api-key` in deployment manifest.
- Database schema: `api_keys` table showing SHA-256 hash column, not plaintext.
- API audit log: `GET /api/v1/audit?action=api_call` showing Bearer token validation (no plaintext keys logged).
- Key inventory: `CERTCTL_API_KEYS_NAMED` env var (format `name:key:admin,...`) — seeds the in-memory `NamedAPIKey{Name, Key, Admin}` struct at `internal/api/middleware/middleware.go:29`. Keys are constant-time-compared (`subtle.ConstantTimeCompare`) against the Bearer token. No database table stores them; protect the env var contents at rest via a secrets manager (Vault / AWS Secrets Manager / Kubernetes Secrets / Docker Secrets).
- API audit log: `GET /api/v1/audit?action=api_call` showing per-key actor names (`Name` field of matched `NamedAPIKey`) on every call, with zero plaintext or hashed key material recorded.
- TLS certificate on control plane: `openssl s_client -connect {server}:8443` showing valid certificate, TLS 1.2+, strong cipher.
- GUI login flow: browser network tab showing Authorization header (token value redacted in compliance report).
@@ -561,6 +562,7 @@ This requirement covers key generation, storage, rotation, and destruction. Cert
- **Alert Notifications** (M3, M16a) — Configurable escalation:
- Email alerts: certificate approaching expiration, renewal failure, revocation notification.
- Webhook: custom HTTP POST to your monitoring system (Slack, Teams, PagerDuty, OpsGenie, custom webhook).
- **Retry & Dead-Letter Queue** (I-005) — Transient notifier failures (SMTP timeout, webhook 5xx) are retried with exponential backoff (`2^n` minutes capped at 1h, 5-attempt budget) before landing in the terminal `dead` status. Operators monitor DLQ depth via the `certctl_notification_dead_total` Prometheus counter and requeue via the Notifications page Dead letter tab once the underlying outage is resolved. Closes the pre-I-005 silent-drop gap where a single 5xx could lose a compliance-relevant alert without evidence.
- Deduplication: one alert per threshold/certificate per day (avoid alert fatigue).
- **Audit Trail Filtering and Export** (M13) — Compliance reporting:
@@ -721,12 +723,12 @@ This requirement covers key generation, storage, rotation, and destruction. Cert
| PCI-DSS Requirement | certctl Feature | API/UI Evidence | Database/Config | Audit Trail | Status |
|---|---|---|---|---|---|
| **4.2.1** Strong Crypto | TLS cert issuance, ACME/step-ca/Local CA, RSA 2048+/ECDSA P-256 | `GET /api/v1/certificates` (key_type, key_size) | Certificate profiles | `GET /api/v1/audit?type=certificate_issued` | Available |
| **4.2.2** Cert Inventory & Validation | Managed cert CRUD, discovery (M18b), expiration alerting, CRL/OCSP | `GET /api/v1/certificates`, `GET /api/v1/discovered-certificates`, `GET /api/v1/crl`, `GET /api/v1/ocsp/{issuer}/{serial}` | `managed_certificates`, `discovered_certificates` tables | `GET /api/v1/audit?type=certificate_*` | Available |
| **4.2.2** Cert Inventory & Validation | Managed cert CRUD, discovery (M18b), expiration alerting, CRL/OCSP | `GET /api/v1/certificates`, `GET /api/v1/discovered-certificates`, `GET /.well-known/pki/crl/{issuer_id}`, `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (both unauthenticated, RFC 5280 / RFC 6960) | `managed_certificates`, `discovered_certificates` tables | `GET /api/v1/audit?type=certificate_*` | Available |
| **3.6** Key Documentation | Profiles, owner/team tracking, issuer config, audit trail | `GET /api/v1/profiles`, `GET /api/v1/issuers`, certificate detail with owner/team | Profiles, certificate owner/team fields, issuer config | `GET /api/v1/audit?resource_type=certificate` | Available |
| **3.7.1** Key Generation | Agent-side ECDSA P-256, server keygen (demo only) | Agent logs, renewal job detail, CSR audit | `CERTCTL_KEYGEN_MODE=agent` (config), job_type=AwaitingCSR | `GET /api/v1/audit?type=certificate_issued` with CSR hash | Available |
| **3.7.2** Key Storage | Agent `/var/lib/certctl/keys` (0600), env var secrets, .env excluded | Deployment manifest (env var refs), agent key dir listing | `.env` file (git-ignored), `CERTCTL_KEY_DIR`, `CERTCTL_CA_KEY_PATH` | No API audit (keys off-platform) | Available |
| **3.7.3** Key Rotation | Auto renewal, expiration thresholds, renewal jobs | Dashboard renewal trends, `GET /api/v1/jobs?type=Renewal`, certificate versions | Renewal policies, certificate version history | `GET /api/v1/audit?type=certificate_renewed` | Available |
| **3.7.4** Key Destruction | Revocation API (RFC 5280), CRL/OCSP, private key cleanup | `POST /api/v1/certificates/{id}/revoke`, `GET /api/v1/crl`, OCSP endpoint | `certificate_revocations` table, CRL publication | `GET /api/v1/audit?type=certificate_revoked` | Available |
| **3.7.4** Key Destruction | Revocation API (RFC 5280), CRL/OCSP, private key cleanup | `POST /api/v1/certificates/{id}/revoke`, unauthenticated `GET /.well-known/pki/crl/{issuer_id}` and `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` | `certificate_revocations` table, CRL publication | `GET /api/v1/audit?type=certificate_revoked` | Available |
| **8.3** Strong Authentication | API key (SHA-256 hash, TLS), GUI login, 401 redirect | GUI login screenshot, API key auth header, TLS cert | API key hash in database | `GET /api/v1/audit` showing API calls | Available |
| **8.6** Acct Management | Credentials out of source, .env excluded, env var config | Code review (no hardcoded secrets), `.gitignore` check | Deployment manifests showing env var refs only | No account lifecycle audit (outside scope) | Available in part |
| **10.2** Audit Logging | API audit middleware (M19), certificate lifecycle events | `GET /api/v1/audit` with filter/pagination | `audit_events` table (every API call) | Real-time via API | Available |
+27 -16
View File
@@ -44,7 +44,8 @@ Each section includes:
**certctl Implementation** (V2 — Community Edition):
- **API Key Authentication** — All API calls require a Bearer token (hashed with SHA-256, stored securely, validated with constant-time comparison) or are rejected with 401 Unauthorized. Environment: `CERTCTL_AUTH_TYPE` (default `api-key`; `none` requires explicit opt-in with log warning)
- **API Key Authentication** — All `/api/v1/*` calls require a Bearer token (hashed with SHA-256, stored securely, validated with constant-time comparison) or are rejected with 401 Unauthorized. Environment: `CERTCTL_AUTH_TYPE` (default `api-key`; `none` requires explicit opt-in with log warning)
- **Standards-based enrollment and PKI distribution endpoints** — EST (`/.well-known/est/*`, RFC 7030), SCEP (`/scep`, `/scep/*`, RFC 8894), and CRL/OCSP (`/.well-known/pki/crl/{issuer_id}`, `/.well-known/pki/ocsp/{issuer_id}/{serial}`, RFC 5280 §5 / RFC 6960 / RFC 8615) are served unauthenticated at the HTTP layer because these protocols cannot present certctl Bearer tokens. Authentication is enforced in-protocol: EST relies on CSR signature verification plus profile policy (RFC 7030 §3.2.3 says EST auth is deployment-specific; §4.1.1 makes `/cacerts` explicitly anonymous); SCEP requires a shared `challengePassword` in the PKCS#10 CSR attributes (OID 1.2.840.113549.1.9.7, RFC 8894 §3.2), validated with `crypto/subtle.ConstantTimeCompare`; CRL and OCSP are intentionally anonymous for relying-party accessibility. CWE-306 (missing authentication for a critical function) is closed for SCEP by `preflightSCEPChallengePassword` in `cmd/server/main.go`, which refuses to start the control plane when `CERTCTL_SCEP_ENABLED=true` is set without `CERTCTL_SCEP_CHALLENGE_PASSWORD`. The HTTP dispatch is implemented in `cmd/server/main.go:buildFinalHandler`, which routes these prefixes through `noAuthHandler` (RequestID + structuredLogger + Recovery only, no auth or rate-limit middleware) and is pinned by the 27-subtest regression harness at `cmd/server/finalhandler_test.go`.
- **GUI Authentication** — Web dashboard includes login screen requiring API key entry. Failed auth redirects to login on 401. Auth context persists across page navigation. Logout clears session.
- **Configurable CORS** — API restricts cross-origin requests via `CERTCTL_CORS_ORIGINS` allowlist or wildcard. Preflight caching prevents chatty browser auth flows.
- **Token Bucket Rate Limiting** — Per-IP rate limiting (configurable via `CERTCTL_RATE_LIMIT_RPS` / `CERTCTL_RATE_LIMIT_BURST`) returns 429 Too Many Requests with Retry-After header. Prevents credential stuffing and brute-force attacks.
@@ -58,6 +59,11 @@ Each section includes:
- Auth info endpoint: `GET /api/v1/auth/info` (returns current auth mode, served without auth so GUI detects mode)
- Rate limiting middleware: `internal/api/middleware/rate_limit.go`
- CORS configuration: `cmd/server/main.go`, search for `CERTCTL_CORS_ORIGINS`
- Final handler dispatch (authenticated vs. unauthenticated routing): `cmd/server/main.go:buildFinalHandler`
- SCEP preflight gate (CWE-306 closure): `cmd/server/main.go:preflightSCEPChallengePassword`
- SCEP service-layer defense-in-depth (rejects enrollment on empty challenge password, `crypto/subtle.ConstantTimeCompare`): `internal/service/scep.go`
- Final handler dispatch regression harness (27 subtests): `cmd/server/finalhandler_test.go`
- OpenAPI spec `security: []` overrides on unauthenticated paths: `api/openapi.yaml` (EST `/cacerts`, `/simpleenroll`, `/simplereenroll`, `/csrattrs`; SCEP `/scep` GET+POST; PKI `/crl/{issuer_id}`, `/ocsp/{issuer_id}/{serial}`)
**V3 Enhancement**:
@@ -110,7 +116,7 @@ Each section includes:
**certctl Implementation** (V2):
- **API Key Policy** — All API access requires an API key or explicit opt-out. Opt-out (`CERTCTL_AUTH_TYPE=none`) logs a warning: "WARNING: Auth disabled (CERTCTL_AUTH_TYPE=none) — this is insecure and only for development". Configuration choice is logged at startup.
- **API Key Policy** — All `/api/v1/*` access requires an API key or explicit opt-out. Opt-out (`CERTCTL_AUTH_TYPE=none`) logs a warning: "WARNING: Auth disabled (CERTCTL_AUTH_TYPE=none) — this is insecure and only for development". Configuration choice is logged at startup. The standards-based enrollment and PKI distribution endpoints (EST, SCEP, CRL, OCSP) are served unauthenticated at the HTTP layer per their respective RFCs; see CC6.1 for the full authentication contract and CWE-306 closure via `preflightSCEPChallengePassword`.
- **Agent Authentication** — Agents authenticate to the server via API keys (same mechanism as users). Agent credentials are separate from user API keys.
- **Private Key Policy** — Agent-side key generation is the default (`CERTCTL_KEYGEN_MODE=agent`). Server-side keygen (`CERTCTL_KEYGEN_MODE=server`) requires explicit configuration and logs a warning: "server-side key generation enabled (CERTCTL_KEYGEN_MODE=server) — private keys touch control plane, demo only".
- **Password Policy** — Not applicable; certctl uses API keys exclusively. Password management is delegated to your organization's IAM system if you integrate OIDC/SSO (V3).
@@ -183,15 +189,20 @@ Each section includes:
- **Health Endpoint**`GET /health` returns 200 OK with service status. Consumed by Docker health checks and Kubernetes probes.
- **Readiness Endpoint**`GET /ready` returns 200 OK when the database is connected and migrations are applied.
- **Background Scheduler Monitoring**7 background loops run on a fixed schedule:
- Renewal loop: every 1 hour, scans for certificates approaching renewal threshold
- Job processor loop: every 30 seconds, picks up pending/waiting jobs and advances their state
- Health check loop: every 2 minutes, pings agents to detect downtime
- Notification dispatcher loop: every 1 minute, sends queued alerts
- Short-lived cert expiry loop: every 30 seconds, marks expired short-lived credentials
- Network scanner loop: every 6 hours, scans enabled TLS endpoints for certificate discovery
- Digest emailer loop: every 24 hours, sends scheduled certificate digest email to configured recipients
Each loop includes error handling and logs failures via structured slog.
- **Background Scheduler Monitoring**12 background loops (8 always-on + 4 opt-in) run on a fixed schedule. Authoritative topology in `docs/architecture.md`:
- Renewal loop (always-on, 1 hour): scans for certificates approaching renewal threshold
- Job processor loop (always-on, 30 seconds): picks up pending/waiting jobs and advances their state
- Job retry loop (always-on, 5 minutes, `CERTCTL_SCHEDULER_RETRY_INTERVAL`): retries Failed jobs (I-001)
- Job timeout reaper loop (always-on, 10 minutes, `CERTCTL_JOB_TIMEOUT_INTERVAL`): fails AwaitingCSR/AwaitingApproval jobs past timeout (I-003)
- Agent health check loop (always-on, 2 minutes): pings agents to detect downtime
- Notification dispatcher loop (always-on, 1 minute): sends queued alerts
- Notification retry loop (always-on, 2 minutes, `CERTCTL_NOTIFICATION_RETRY_INTERVAL`): exponential backoff retry for failed notifications; promote to dead-letter after 5 attempts (I-005)
- Short-lived cert expiry loop (always-on, 30 seconds): marks expired short-lived credentials
- Network scanner loop (opt-in, 6 hours, `CERTCTL_NETWORK_SCAN_ENABLED`): scans enabled TLS endpoints for certificate discovery
- Digest emailer loop (opt-in, 24 hours, `CERTCTL_DIGEST_INTERVAL`): sends scheduled certificate digest email to configured recipients
- Endpoint health loop (opt-in, 60 seconds, `CERTCTL_HEALTH_CHECK_INTERVAL`): continuous TLS health probes (M48)
- Cloud discovery loop (opt-in, 6 hours, `CERTCTL_CLOUD_DISCOVERY_INTERVAL`): cloud secret manager certificate discovery (M50)
Each loop includes `atomic.Bool` idempotency guards, error handling, and structured slog failure logs.
- **Metrics Endpoints** — Two formats for monitoring integration:
- `GET /api/v1/metrics` — JSON object with gauges, counters, and uptime for custom dashboards
- `GET /api/v1/metrics/prometheus` — Prometheus exposition format (`text/plain; version=0.0.4`) for native scraping by Prometheus, Grafana Agent, Datadog, and other OpenMetrics-compatible collectors
@@ -282,8 +293,8 @@ Each section includes:
- `certificateHold` — temporary revocation (can be "unhold" by reissue)
- `privilegeWithdrawn` — access rights revoked
Revocation is **immediate** (no approval workflow). The certificate is marked `Revoked` in inventory, an audit event is logged, and optional issuer notification is best-effort. All revoked certs are excluded from active deployments.
- **CRL Endpoint**`GET /api/v1/crl` returns a JSON-formatted Certificate Revocation List (serial, reason, timestamp for each revoked cert). `GET /api/v1/crl/{issuer_id}` returns a DER-encoded X.509 CRL signed by the issuing CA (useful for legacy clients that don't support OCSP).
- **OCSP Responder**`GET /api/v1/ocsp/{issuer_id}/{serial}` returns a signed OCSP response indicating whether a cert is good, revoked, or unknown. Clients (browsers, TLS libraries) query this endpoint to verify cert validity in real-time.
- **CRL Endpoint**`GET /.well-known/pki/crl/{issuer_id}` returns a DER-encoded X.509 CRL signed by the issuing CA (RFC 5280 §5, RFC 8615, `Content-Type: application/pkix-crl`), served unauthenticated for relying parties that don't hold certctl API credentials.
- **OCSP Responder**`GET /.well-known/pki/ocsp/{issuer_id}/{serial}` returns a signed OCSP response indicating whether a cert is good, revoked, or unknown (RFC 6960, `Content-Type: application/ocsp-response`). Also unauthenticated. Clients (browsers, TLS libraries) query this endpoint to verify cert validity in real-time.
- **Revocation Notifications** — When a cert is revoked, notifications are sent to:
- Certificate owner (email)
- Configured webhooks (if you have a SIEM that subscribes)
@@ -453,15 +464,15 @@ Each section includes:
| | Metrics JSON Endpoint | `GET /api/v1/metrics` (gauges, counters, uptime) | ✅ | ✅ | Set thresholds, configure alerting |
| | Stats API (time-series) | `GET /api/v1/stats/*` (summary, status, expiration, jobs, issuance) | ✅ | ✅ | Integrate into dashboards, SLO tracking |
| | Structured Logging | `slog` middleware with request IDs | ✅ | ✅ | Aggregate logs to SIEM, define retention policy |
| | Background Scheduler | 7 loops (renewal 1h, jobs 30s, health 2m, notifications 1m, short-lived 30s, network scan 6h, digest 24h) | ✅ | ✅ | Alert on scheduler loop failures |
| | Background Scheduler | 12 loops (8 always-on: renewal 1h, jobs 30s, job retry 5m I-001, job timeout 10m I-003, health 2m, notifications 1m, notif retry 2m I-005, short-lived 30s; 4 opt-in: network scan 6h, digest 24h, endpoint health 60s M48, cloud discovery 6h M50) | ✅ | ✅ | Alert on scheduler loop failures |
| **CC7.2** Anomaly Detection | Immutable API Audit Trail | `internal/api/middleware/audit.go`, `GET /api/v1/audit` | ✅ | Enhanced (SIEM export) | Integrate into SIEM, search for anomalies, archive long-term |
| | Expiration Threshold Alerting | Configurable per-policy (default 30/14/7/0 days) | ✅ | ✅ | Configure thresholds, integrate notifications |
| | Status Auto-Transitions | Active → Expiring (30d) → Expired (0d) | ✅ | ✅ | Monitor status changes in audit trail |
| | Notification Routing | Email, Slack, Teams, PagerDuty, OpsGenie | ✅ | ✅ | Configure notifiers, on-call integration |
| | Deployment Rollback | Redeploy previous cert version via GUI | ✅ | ✅ | Audit rollback decisions |
| **CC7.3** Incident Response | Revocation API (RFC 5280 reasons) | `POST /api/v1/certificates/{id}/revoke` | ✅ | Enhanced (bulk revocation) | Establish incident response policy |
| | CRL Endpoint (JSON + DER) | `GET /api/v1/crl`, `GET /api/v1/crl/{issuer_id}` | ✅ | ✅ | Ensure CRL/OCSP accessible to all clients |
| | OCSP Responder | `GET /api/v1/ocsp/{issuer_id}/{serial}` | ✅ | ✅ | Test revocation in staging |
| | CRL Endpoint (DER, RFC 5280 §5) | `GET /.well-known/pki/crl/{issuer_id}` (unauthenticated, `application/pkix-crl`) | ✅ | ✅ | Ensure CRL/OCSP accessible to all clients without API keys |
| | OCSP Responder (RFC 6960) | `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (unauthenticated, `application/ocsp-response`) | ✅ | ✅ | Test revocation in staging |
| | Revocation Notifications | Email, webhook, Slack/Teams on revocation | ✅ | ✅ | Integrate into on-call, document justification separately |
| | Short-Lived Cert Exemption | TTL < 1h skip CRL/OCSP | ✅ | ✅ | Configure profiles appropriately |
| **CC7.4** Risk Mitigation | Renewal Job Tracking | Job state machine (Pending → Running → Completed/Failed) | ✅ | ✅ | Monitor renewal success rate |
+4 -2
View File
@@ -123,6 +123,8 @@ At no point does the private key leave the agent. This is a fundamental security
Agents also report **metadata** about themselves — their operating system, CPU architecture, IP address, hostname, and version — with every heartbeat. This gives ops teams fleet-wide visibility (e.g., "how many agents are running on ARM?", "which agents are still on v1.0.0?") and powers **agent groups** — dynamic device grouping where policies can be scoped to specific agent criteria like OS type, architecture, or network subnet.
**Retiring an agent.** When you decommission a server, the certctl record for its agent needs to be retired, not deleted. certctl uses a **soft-delete** model: `DELETE /api/v1/agents/{id}` stamps the row with a retired-at timestamp and a reason, instead of removing it. This is deliberate — an audit trail of "who owned this certificate, on which host, for which team" stays intact forever, and the downstream deployment_targets, certificates, and jobs keep valid foreign keys. Retired agents are filtered out of default list views and the dashboard's agent counter, but remain visible through a separate retired-agents view for compliance reconciliation. If the agent still has active deployment targets, deployed certificates, or pending jobs, retirement is blocked by default so you don't silently orphan those rows; the API responds with the exact counts so you can retire or reassign each dependency explicitly. A force-retire escape hatch (`?force=true&reason=...`) is available for true decommission scenarios — it transactionally retires the downstream targets, cancels pending jobs, and records the cascade in the audit trail with the reason you provided. Four internal sentinel agents that back the network scanner and the cloud secret-manager discovery sources cannot be retired at all, even with force, because retiring them would orphan their subsystems. Once retired, an agent that still attempts to heartbeat receives `410 Gone` — the agent process reads that as "you've been retired, shut down" and exits cleanly.
### Deployment Targets
Targets are the systems where certificates actually get installed — NGINX web servers, Apache httpd servers, HAProxy load balancers, Traefik reverse proxies, Caddy servers, Envoy gateways, Postfix/Dovecot mail servers, Microsoft IIS servers, and network appliances. Each target type has a **connector** that knows how to deploy certificates to that specific system (e.g., writing files and reloading NGINX or Apache config, building a combined PEM for HAProxy).
@@ -216,9 +218,9 @@ certctl implements revocation using three complementary mechanisms:
**Bulk Revocation** (Fleet-Level Incident Response): For large-scale incidents like CA compromise or team infrastructure decommissioning, `POST /api/v1/certificates/bulk-revoke` revokes all certificates matching filter criteria in a single operation. Filter by profile, owner, team, agent group, or issuer to target the affected certificate set. This is essential for incident response — instead of revoking certificates one-by-one, operators can revoke an entire fleet in minutes. Bulk revocation creates individual revocation jobs that reuse the existing revocation pipeline, ensuring every certificate is audited and notifications are sent.
**Certificate Revocation List (CRL)**: certctl serves both a JSON-formatted CRL at `GET /api/v1/crl` and DER-encoded X.509 CRLs per issuer at `GET /api/v1/crl/{issuer_id}`. The DER CRL is signed by the issuing CA's key and has 24-hour validity clients can download it periodically to check revocation status offline.
**Certificate Revocation List (CRL)**: certctl serves DER-encoded X.509 CRLs per issuer at `GET /.well-known/pki/crl/{issuer_id}` (RFC 5280 §5 wire format, RFC 8615 well-known namespace). The endpoint is unauthenticated so any relying party — browser, TLS client, hardware appliance — can fetch it without a certctl API key. The CRL is signed by the issuing CA's key and has 24-hour validity; clients can download it periodically to check revocation status offline. The response carries `Content-Type: application/pkix-crl`.
**OCSP Responder**: For real-time revocation checking, certctl includes an embedded OCSP responder at `GET /api/v1/ocsp/{issuer_id}/{serial}`. It returns signed OCSP responses (good, revoked, or unknown) so clients can verify certificate status without downloading the full CRL.
**OCSP Responder**: For real-time revocation checking, certctl includes an embedded OCSP responder at `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (RFC 6960). Like the CRL endpoint, it is unauthenticated and returns signed OCSP responses (good, revoked, or unknown) with `Content-Type: application/ocsp-response`, so clients can verify certificate status without downloading the full CRL.
Short-lived certificates (those assigned to profiles with TTL under 1 hour) are exempt from CRL and OCSP — their rapid expiry is considered sufficient revocation. This is a deliberate design choice to reduce infrastructure overhead for ephemeral machine-to-machine credentials.
+5 -5
View File
@@ -155,7 +155,7 @@ The Local CA issuer signs certificates using Go's `crypto/x509` library. It supp
**Sub-CA mode:** Loads a CA certificate and private key from disk (`CERTCTL_CA_CERT_PATH` + `CERTCTL_CA_KEY_PATH`). The CA cert is signed by an upstream CA (e.g., ADCS), so all issued certificates chain to the enterprise root trust hierarchy. Clients that already trust the enterprise root automatically trust certctl-issued certs. Supports RSA, ECDSA, and PKCS#8 key formats. If the paths are not set, falls back to self-signed mode. The loaded certificate must have `IsCA=true` and `KeyUsageCertSign`.
**CRL and OCSP support (M15b):** The Local CA supports DER-encoded X.509 CRL generation via `GET /api/v1/crl/{issuer_id}` with 24-hour validity. An embedded OCSP responder at `GET /api/v1/ocsp/{issuer_id}/{serial}` returns signed OCSP responses for issued certificates (good/revoked/unknown status). Certificates with profile TTL < 1 hour automatically skip CRL/OCSP — expiry is treated as sufficient revocation for short-lived credentials.
**CRL and OCSP support (M15b):** The Local CA supports DER-encoded X.509 CRL generation served unauthenticated at `GET /.well-known/pki/crl/{issuer_id}` (RFC 5280 §5, RFC 8615, `Content-Type: application/pkix-crl`) with 24-hour validity. An embedded OCSP responder at `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (RFC 6960, `Content-Type: application/ocsp-response`) returns signed OCSP responses for issued certificates (good/revoked/unknown status). Both endpoints are reachable by relying parties with no certctl API credentials, which is how standard TLS clients, browsers, and hardware appliances consume these resources. Certificates with profile TTL < 1 hour automatically skip CRL/OCSP — expiry is treated as sufficient revocation for short-lived credentials.
**Extended Key Usage (EKU) support (M27):** The Local CA respects EKU constraints from certificate profiles and adjusts key usage flags accordingly. For S/MIME certificates (emailProtection EKU), it uses `DigitalSignature | ContentCommitment` instead of the TLS default. For TLS certificates (serverAuth/clientAuth EKU), it uses `DigitalSignature | KeyEncipherment`. This enables support for multiple certificate types — TLS, S/MIME, code signing, timestamping — from a single CA.
@@ -287,7 +287,7 @@ Environment variables:
The connector is registered in the issuer registry under `iss-stepca`. step-ca also works with the existing ACME connector (point `iss-acme-*` at step-ca's ACME directory URL for ACME-based issuance).
**Note:** step-ca-issued certificates rely on step-ca's own CRL/OCSP infrastructure. certctl's local CRL/OCSP endpoints (`GET /api/v1/crl/{issuer_id}` and `GET /api/v1/ocsp/{issuer_id}/{serial}`) are populated from step-ca's revocation data if available, but clients should validate against step-ca's endpoints for the authoritative status.
**Note:** step-ca-issued certificates rely on step-ca's own CRL/OCSP infrastructure. certctl's local CRL/OCSP endpoints (`GET /.well-known/pki/crl/{issuer_id}` and `GET /.well-known/pki/ocsp/{issuer_id}/{serial}`, served unauthenticated per RFC 5280 §5 / RFC 6960 / RFC 8615) are populated from step-ca's revocation data if available, but clients should validate against step-ca's endpoints for the authoritative status.
**MaxTTL enforcement (M11c):** When a certificate profile defines a maximum TTL, the step-ca connector caps the `NotAfter` field to ensure the issued certificate does not exceed the profile limit, regardless of the step-ca provisioner's own maximum.
@@ -1126,7 +1126,7 @@ The digest HTML template includes:
- Expiring certificates table (color-coded by urgency: 7d, 14d, 30d)
- Auto-refresh and responsive email layout
**Scheduler Integration:** The 7th scheduler loop runs on configurable interval (default 24 hours). It does NOT run on startup — waits for first scheduled tick. Operation timeout is 5 minutes. Each loop execution is guarded by `sync/atomic.Bool` idempotency.
**Scheduler Integration:** The opt-in digest scheduler loop runs on configurable interval (default 24 hours). It does NOT run on startup — waits for first scheduled tick. Operation timeout is 5 minutes. Each loop execution is guarded by `sync/atomic.Bool` idempotency. See `docs/architecture.md` for the full scheduler topology (12 loops, 8 always-on + 4 opt-in).
Configuration:
@@ -1389,7 +1389,7 @@ curl -s -X DELETE http://localhost:8443/api/v1/network-scan-targets/nst-dmz
### Scheduler Integration
When `CERTCTL_NETWORK_SCAN_ENABLED=true`, the server runs a 6th scheduler loop (alongside renewal, jobs, health, notifications, and short-lived expiry). It scans all enabled targets at the configured interval (default 6h). Each target tracks `last_scan_at`, `last_scan_duration_ms`, and `last_scan_certs_found` for monitoring scan health.
When `CERTCTL_NETWORK_SCAN_ENABLED=true`, the server runs the opt-in network scanner scheduler loop alongside the always-on loops (renewal, jobs, job retry, job timeout, agent health, notifications, notification retry, short-lived expiry). It scans all enabled targets at the configured interval (default 6h). Each target tracks `last_scan_at`, `last_scan_duration_ms`, and `last_scan_certs_found` for monitoring scan health. See `docs/architecture.md` for the full 12-loop scheduler topology.
### Use Cases
@@ -1447,7 +1447,7 @@ Source path format: `gcp-sm://{project}/{secret-name}`. Sentinel agent: `cloud-g
### Cloud Discovery Scheduler
All enabled cloud sources run on a shared scheduler loop (9th loop). The interval is configurable:
All enabled cloud sources run on a shared opt-in cloud discovery scheduler loop (see `docs/architecture.md` for the full 12-loop scheduler topology). The interval is configurable:
| Variable | Description | Default |
|---|---|---|
+26 -18
View File
@@ -50,14 +50,17 @@ docker compose -f deploy/docker-compose.yml up -d --build
docker compose -f deploy/docker-compose.yml ps
```
Open **http://localhost:8443** in your browser alongside your terminal. You'll watch changes appear in the dashboard as you make API calls.
Open **https://localhost:8443** in your browser alongside your terminal. The default compose stack ships a self-signed cert; your browser will show a warning the first time — click through (or trust `deploy/test/certs/ca.crt` in your OS keychain). You'll watch changes appear in the dashboard as you make API calls.
Set up a base variable for convenience:
Set up base variables for convenience:
```bash
API="http://localhost:8443"
API="https://localhost:8443"
CA="$PWD/deploy/test/certs/ca.crt" # pin the self-signed CA for curl
```
Every `curl` in this guide uses `--cacert "$CA"` so the TLS handshake verifies against the compose-stack CA instead of the system trust store.
## How the pieces fit together
Before we start, here's the high-level flow of what we're about to do:
@@ -724,22 +727,24 @@ curl -s -X POST $API/api/v1/certificates/mc-demo-payments/revoke \
6. Creates an audit trail entry
7. Sends revocation notifications via configured channels
Check the CRL (Certificate Revocation List):
Check the CRL (Certificate Revocation List) — served unauthenticated under the RFC 8615 well-known namespace so relying parties without a certctl API key can still verify revocation (RFC 5280 §5):
```bash
# JSON-formatted CRL
curl -s $API/api/v1/crl | jq .
# DER-encoded X.509 CRL for the local CA (binary — pipe to openssl for inspection)
curl -s $API/api/v1/crl/iss-local -o /tmp/crl.der
# DER-encoded X.509 CRL for the local CA (binary — pipe to openssl for inspection).
# Note: no -H "Authorization: Bearer ..." — the endpoint is deliberately
# unauthenticated. Content-Type is application/pkix-crl.
curl --cacert "$CA" -s https://localhost:8443/.well-known/pki/crl/iss-local -o /tmp/crl.der
openssl crl -inform DER -in /tmp/crl.der -text -noout
```
Check OCSP status:
Check OCSP status (RFC 6960, also unauthenticated, `application/ocsp-response`):
```bash
# Replace SERIAL with the actual serial number from the certificate version
curl -s $API/api/v1/ocsp/iss-local/SERIAL | jq .
# Replace SERIAL with the actual serial number from the certificate version.
# The embedded OCSP responder returns a signed DER response — parse it with
# `openssl ocsp -respin` or similar tooling.
curl --cacert "$CA" -s https://localhost:8443/.well-known/pki/ocsp/iss-local/SERIAL -o /tmp/ocsp.der
openssl ocsp -respin /tmp/ocsp.der -noverify -resp_text | head -40
```
**Why RFC 5280 reason codes:** The reason code isn't just metadata — it tells clients *why* the certificate was revoked. A `keyCompromise` revocation means the private key was exposed and the certificate should be distrusted immediately. A `superseded` revocation means a newer certificate replaced it — less urgent. CRLs and OCSP responses include the reason code so client software can make informed trust decisions.
@@ -944,7 +949,8 @@ certctl includes a standalone CLI tool for command-line users:
cd cmd/cli && go build -o certctl-cli .
# Export credentials
export CERTCTL_SERVER_URL="http://localhost:8443"
export CERTCTL_SERVER_URL="https://localhost:8443"
export CERTCTL_SERVER_CA_BUNDLE_PATH="$PWD/deploy/test/certs/ca.crt"
export CERTCTL_API_KEY="test-key-123"
# List certificates (JSON or table format)
@@ -988,7 +994,8 @@ certctl exposes the full REST API via the Model Context Protocol (MCP), enabling
cd cmd/mcp-server && go build -o mcp-server .
# Export credentials
export CERTCTL_SERVER_URL="http://localhost:8443"
export CERTCTL_SERVER_URL="https://localhost:8443"
export CERTCTL_SERVER_CA_BUNDLE_PATH="$PWD/deploy/test/certs/ca.crt"
export CERTCTL_API_KEY="test-key-123"
# Start the MCP server (listens on stdin/stdout)
@@ -1046,7 +1053,7 @@ docker compose -f deploy/docker-compose.yml run -e CERTCTL_DISCOVERY_DIRS=/tmp/c
Or with the CLI flag:
```bash
certctl-agent --agent-id a-demo-1 --key-dir /tmp/keys --discovery-dirs /tmp/certs --server http://localhost:8443 --api-key test-key-123
certctl-agent --agent-id a-demo-1 --key-dir /tmp/keys --discovery-dirs /tmp/certs --server https://localhost:8443 --ca-bundle "$CA" --api-key test-key-123
```
### Network Discovery (Server-Side)
@@ -1153,7 +1160,7 @@ flowchart TB
API["REST API\nGo net/http"]
SVC["Service Layer\nBusiness Logic"]
REPO["Repository Layer\ndatabase/sql + lib/pq"]
SCHED["Scheduler\n7 background loops"]
SCHED["Scheduler\n12 background loops\n(8 always-on + 4 opt-in)"]
CONN["Connector Registry\nIssuer + Target + Notifier"]
end
@@ -1189,7 +1196,8 @@ Here's a single script that runs the entire demo end-to-end. Save it as `demo.sh
#!/bin/bash
set -e
API="http://localhost:8443"
API="https://localhost:8443"
CA="$PWD/deploy/test/certs/ca.crt" # pin the self-signed CA for curl
BLUE='\033[0;34m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
@@ -1297,7 +1305,7 @@ echo " 5. Revoked the certificate with RFC 5280 reason codes"
echo " 6. Checked dashboard stats and metrics"
echo " 7. All actions recorded in the audit trail"
echo ""
echo -e "Open ${GREEN}http://localhost:8443${NC} to see everything in the dashboard."
echo -e "Open ${GREEN}https://localhost:8443${NC} to see everything in the dashboard."
echo "Look for certificate: $CERT_ID"
```
+22 -16
View File
@@ -16,7 +16,7 @@ Complete reference of every feature shipped in certctl through v2.1.0 (April 202
| Target connectors | 14 |
| Notifier connectors | 6 channels |
| Database tables | 21 (across 10 migrations) |
| Background scheduler loops | 7 |
| Background scheduler loops | 12 (8 always-on + 4 opt-in) |
| Web dashboard pages | 24 |
| Test functions | 1850+ |
| Supported platforms | linux/amd64, linux/arm64, darwin/amd64, darwin/arm64 |
@@ -228,14 +228,15 @@ Revocation is a 7-step process: validate eligibility → get serial → update s
- Audit trail: single `bulk_revocation_initiated` event logs the criteria and actor
- Optional `--reason` defaults to `unspecified` if omitted
### CRL Endpoints
### CRL Endpoint
- `GET /api/v1/crl` — JSON-formatted CRL (version, entries array, total count, timestamp)
- `GET /api/v1/crl/{issuer_id}` — DER-encoded X.509 CRL signed by issuing CA, 24-hour validity
- `GET /.well-known/pki/crl/{issuer_id}` — DER-encoded X.509 CRL signed by the issuing CA, 24-hour validity (RFC 5280 §5 + RFC 8615). Served unauthenticated with `Content-Type: application/pkix-crl` so relying parties without certctl API credentials can fetch it.
Prior non-standard JSON CRL and authenticated `/api/v1/crl*` paths were removed in M-006 — RFC 5280 defines only the DER wire format and relying parties do not have API keys.
### OCSP Responder
`GET /api/v1/ocsp/{issuer_id}/{serial}` — signed OCSP responses (good/revoked/unknown). Signs with issuing CA key. Requires CA key access (Local CA, step-CA connectors).
`GET /.well-known/pki/ocsp/{issuer_id}/{serial}` — signed OCSP responses (good/revoked/unknown) per RFC 6960. Served unauthenticated with `Content-Type: application/ocsp-response`. Signs with the issuing CA key; requires CA key access (Local CA, step-CA connectors).
### Short-Lived Certificate Exemption
@@ -902,7 +903,7 @@ Server-side active TLS scanning of CIDR ranges. Concurrent probing with semaphor
<!-- Source: internal/connector/discovery/awssm/, azurekv/, gcpsm/, internal/service/cloud_discovery.go -->
Discovers certificates stored in cloud secret managers and brings them into the certctl inventory. Extends the existing discovery pipeline with pluggable `DiscoverySource` implementations. Each source runs as part of the 9th scheduler loop (6h default).
Discovers certificates stored in cloud secret managers and brings them into the certctl inventory. Extends the existing discovery pipeline with pluggable `DiscoverySource` implementations. Each source runs as part of the opt-in cloud discovery scheduler loop (6h default; see `docs/architecture.md` for the full 12-loop scheduler topology).
**Supported sources:**
@@ -1096,17 +1097,22 @@ Single SQL `UNION` query replaces the previous "fetch all, filter in Go" approac
<!-- Source: internal/scheduler/scheduler.go -->
7 background loops, each with an `atomic.Bool` idempotency guard preventing concurrent tick execution. `sync.WaitGroup` + `WaitForCompletion()` for graceful shutdown.
12 background loops (8 always-on + 4 opt-in), each with an `atomic.Bool` idempotency guard preventing concurrent tick execution. `sync.WaitGroup` + `WaitForCompletion()` for graceful shutdown. Authoritative topology table lives in `docs/architecture.md`.
| Loop | Default Interval | Description |
|---|---|---|
| Renewal check | 1 hour | Check expiring certs, query ARI, create renewal jobs |
| Job processor | 30 seconds | Process pending jobs |
| Agent health check | 2 minutes | Check agent heartbeat staleness |
| Notification processor | 1 minute | Send queued notifications |
| Short-lived expiry check | 30 seconds | Mark short-lived certs expired |
| Network scan | 6 hours | Run network discovery scans |
| Digest | 24 hours | Send certificate digest email (does not run on startup) |
| Loop | Default Interval | Always-on | Env Var | Description |
|---|---|---|---|---|
| Renewal check | 1 hour | Yes | — | Check expiring certs, query ARI, create renewal jobs |
| Job processor | 30 seconds | Yes | — | Process pending jobs |
| Job retry | 5 minutes | Yes | `CERTCTL_SCHEDULER_RETRY_INTERVAL` | Retry Failed jobs (I-001) |
| Job timeout reaper | 10 minutes | Yes | `CERTCTL_JOB_TIMEOUT_INTERVAL` | Fail AwaitingCSR/AwaitingApproval jobs past timeout (I-003) |
| Agent health check | 2 minutes | Yes | — | Check agent heartbeat staleness |
| Notification processor | 1 minute | Yes | — | Send queued notifications |
| Notification retry | 2 minutes | Yes | `CERTCTL_NOTIFICATION_RETRY_INTERVAL` | Exponential backoff retry for failed notifications; promote to dead-letter after 5 attempts (I-005) |
| Short-lived expiry check | 30 seconds | Yes | — | Mark short-lived certs expired |
| Network scan | 6 hours | Opt-in | `CERTCTL_NETWORK_SCAN_ENABLED` | Run network discovery scans |
| Digest | 24 hours | Opt-in | `CERTCTL_DIGEST_INTERVAL` | Send certificate digest email (does not run on startup) |
| Endpoint health | 60 seconds | Opt-in | `CERTCTL_HEALTH_CHECK_INTERVAL` | Continuous TLS health probes (M48) |
| Cloud discovery | 6 hours | Opt-in | `CERTCTL_CLOUD_DISCOVERY_INTERVAL` | Cloud secret manager certificate discovery (M50) |
---
+11 -5
View File
@@ -29,15 +29,18 @@ The binary has zero runtime dependencies beyond the certctl server it connects t
## Configuration
The MCP server reads two environment variables:
The MCP server reads three environment variables:
| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `CERTCTL_SERVER_URL` | No | `http://localhost:8443` | URL of the certctl REST API |
| `CERTCTL_SERVER_URL` | No | `https://localhost:8443` | URL of the certctl REST API (HTTPS-only as of v2.2) |
| `CERTCTL_API_KEY` | No | (empty) | API key for authentication (passed as `Bearer` token) |
| `CERTCTL_SERVER_CA_BUNDLE_PATH` | Yes (for self-signed / internal CA) | (empty) | Path to PEM CA bundle that signed the server cert. Required when the server cert isn't rooted in the system trust store (the default compose stack ships a self-signed cert at `deploy/test/certs/ca.crt`). |
If your certctl server has auth enabled (the default), you must provide the API key. The MCP server passes it through to every HTTP request.
Since v2.2 the certctl control plane is HTTPS-only. If the server cert is self-signed or chained to an internal CA, set `CERTCTL_SERVER_CA_BUNDLE_PATH` so the MCP server can verify the TLS handshake. Never set `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true` outside local development — it disables all certificate validation.
## Setting Up with Claude Desktop
Add this to your Claude Desktop MCP configuration file (`~/Library/Application Support/Claude/claude_desktop_config.json` on macOS, `%APPDATA%\Claude\claude_desktop_config.json` on Windows):
@@ -48,7 +51,8 @@ Add this to your Claude Desktop MCP configuration file (`~/Library/Application S
"certctl": {
"command": "/path/to/certctl-mcp",
"env": {
"CERTCTL_SERVER_URL": "http://localhost:8443",
"CERTCTL_SERVER_URL": "https://localhost:8443",
"CERTCTL_SERVER_CA_BUNDLE_PATH": "/path/to/certctl/deploy/test/certs/ca.crt",
"CERTCTL_API_KEY": "your-api-key-here"
}
}
@@ -67,7 +71,8 @@ In Cursor, go to Settings → MCP Servers and add:
"certctl": {
"command": "/path/to/certctl-mcp",
"env": {
"CERTCTL_SERVER_URL": "http://localhost:8443",
"CERTCTL_SERVER_URL": "https://localhost:8443",
"CERTCTL_SERVER_CA_BUNDLE_PATH": "/path/to/certctl/deploy/test/certs/ca.crt",
"CERTCTL_API_KEY": "your-api-key-here"
}
}
@@ -84,7 +89,8 @@ Add certctl as an MCP server in your project's `.mcp.json`:
"certctl": {
"command": "/path/to/certctl-mcp",
"env": {
"CERTCTL_SERVER_URL": "http://localhost:8443",
"CERTCTL_SERVER_URL": "https://localhost:8443",
"CERTCTL_SERVER_CA_BUNDLE_PATH": "/path/to/certctl/deploy/test/certs/ca.crt",
"CERTCTL_API_KEY": "your-api-key-here"
}
}
+1 -1
View File
@@ -34,7 +34,7 @@ cd certctl/deploy
docker compose up -d
```
Access the dashboard at `http://localhost:8443` with API key from `.env` file.
Access the dashboard at `https://localhost:8443` with the API key from `.env`. The default compose stack ships a self-signed cert; pin with `--cacert ./deploy/test/certs/ca.crt` when calling the API from the host.
### 2. Deploy Agents
+3 -2
View File
@@ -22,7 +22,7 @@ Option A: Docker Compose (quickest for evaluation)
```bash
cd /opt/certctl
docker compose up -d
# Dashboard & API: http://localhost:8443
# Dashboard & API: https://localhost:8443 (self-signed cert — use --cacert ./deploy/test/certs/ca.crt for the default compose stack)
# Default API key in logs (grep CERTCTL_API_KEY docker logs certctl-server)
```
@@ -45,7 +45,8 @@ chmod +x /usr/local/bin/certctl-agent
# Create config
sudo mkdir -p /etc/certctl /var/lib/certctl/keys
sudo tee /etc/certctl/agent.env > /dev/null <<EOF
CERTCTL_SERVER_URL=http://certctl-control-plane.example.com:8443
CERTCTL_SERVER_URL=https://certctl-control-plane.example.com:8443
CERTCTL_SERVER_CA_BUNDLE_PATH=/etc/certctl/tls/ca.crt
CERTCTL_API_KEY=your-api-key-here
CERTCTL_DISCOVERY_DIRS=/etc/letsencrypt/live
CERTCTL_KEY_DIR=/var/lib/certctl/keys
+9 -4
View File
@@ -68,8 +68,10 @@ The spec organizes endpoints into 16 tags:
The spec declares a `bearerAuth` security scheme applied globally. All endpoints under `/api/v1/` require a Bearer token by default:
```bash
curl -H "Authorization: Bearer your-api-key" \
http://localhost:8443/api/v1/certificates
# The default compose stack uses a self-signed cert; pin with --cacert
curl --cacert ./deploy/test/certs/ca.crt \
-H "Authorization: Bearer your-api-key" \
https://localhost:8443/api/v1/certificates
```
Three endpoints are exempt from auth (declared with `security: []` in the spec): `/health`, `/ready`, and `/api/v1/auth/info`. The auth info endpoint tells clients whether authentication is enabled and what type is required — useful for GUIs that need to show/hide a login screen.
@@ -150,8 +152,9 @@ Import the spec directly into Postman:
1. Open Postman → Import → File → select `api/openapi.yaml`
2. Postman creates a collection with all 78 documented operations organized by tag
3. Set the `baseUrl` variable to `http://localhost:8443`
3. Set the `baseUrl` variable to `https://localhost:8443` (HTTPS-only as of v2.2)
4. Add an `Authorization: Bearer your-api-key` header to the collection
5. Import the demo stack CA bundle (`deploy/test/certs/ca.crt`) into Postman's Settings → Certificates → CA Certificates, or disable certificate verification for the `localhost` host (Settings → General → SSL certificate verification)
## Key Schemas
@@ -176,8 +179,10 @@ Use the spec to generate contract tests that verify the API matches the spec:
```bash
# Using schemathesis for fuzz testing against the spec
pip install schemathesis
# The default compose stack uses a self-signed cert — export a CA bundle or set REQUESTS_CA_BUNDLE
export REQUESTS_CA_BUNDLE=$(pwd)/deploy/test/certs/ca.crt
schemathesis run api/openapi.yaml \
--base-url http://localhost:8443 \
--base-url https://localhost:8443 \
--header "Authorization: Bearer your-api-key"
```
+5 -3
View File
@@ -85,10 +85,12 @@ go test -tags qa -v -timeout 10m ./...
| Variable | Default | Description |
|---|---|---|
| `CERTCTL_QA_SERVER_URL` | `http://localhost:8443` | certctl server URL |
| `CERTCTL_QA_SERVER_URL` | `https://localhost:8443` | certctl server URL (HTTPS-only as of v2.2) |
| `CERTCTL_QA_API_KEY` | `change-me-in-production` | API key for Bearer auth |
| `CERTCTL_QA_DB_URL` | `postgres://certctl:certctl@localhost:5432/certctl?sslmode=disable` | PostgreSQL connection string |
| `CERTCTL_QA_REPO_DIR` | `../..` | Path to certctl repo root (for source file checks) |
| `CERTCTL_QA_CA_BUNDLE` | `./certs/ca.crt` | PEM CA bundle pinned for TLS verification. The demo stack's `certctl-tls-init` container writes here. |
| `CERTCTL_QA_INSECURE` | `false` | Set to `"true"` to skip TLS verification (e.g. before the init container finishes). Never use outside the demo harness. |
## Part-by-Part Coverage Map
@@ -256,8 +258,8 @@ docker compose -f docker-compose.yml -f docker-compose.demo.yml ps
# Check server logs
docker compose -f docker-compose.yml -f docker-compose.demo.yml logs certctl-server
# Check if the port is exposed
curl -s http://localhost:8443/health
# Check if the port is exposed (self-signed cert — pin CA bundle)
curl --cacert ./deploy/test/certs/ca.crt -s https://localhost:8443/health
```
### "connect to QA DB" failure
+62 -47
View File
@@ -105,16 +105,24 @@ certctl-server Up (healthy)
certctl-agent Up
```
The control plane is HTTPS-only as of v2.2. The `certctl-tls-init` init container in the shipped `deploy/docker-compose.yml` self-signs a cert on first boot and drops it into a named volume. Extract the CA bundle once and reuse it for every API call in this guide:
```bash
curl http://localhost:8443/health
export CA=/tmp/certctl-ca.crt
docker compose -f deploy/docker-compose.yml exec -T certctl-server \
cat /etc/certctl/tls/ca.crt > "$CA"
curl --cacert "$CA" https://localhost:8443/health
```
```json
{"status":"healthy"}
```
If you're bringing your own cert (internal CA, cert-manager, operator-supplied Secret), see [`docs/tls.md`](tls.md) for the full provisioning matrix. If you're cutting over an existing install, see [`docs/upgrade-to-tls.md`](upgrade-to-tls.md) for the failure modes (out-of-date `http://…` agents fail at the TLS handshake) and the one-step procedure.
## Open the Dashboard
Open **http://localhost:8443** in your browser.
Open **https://localhost:8443** in your browser. Your browser will warn about the self-signed cert — that's expected for the demo bootstrap. Trust the CA bundle you just exported, or click through the warning.
> **Note:** The Docker Compose demo runs with authentication disabled (`CERTCTL_AUTH_TYPE=none`) so you can explore immediately. For production, set `CERTCTL_AUTH_TYPE=api-key` and `CERTCTL_AUTH_SECRET=<your-secret>` in your environment, then pass `Authorization: Bearer <your-secret>` on all API requests. The dashboard will prompt for your API key on first load.
>
@@ -154,62 +162,64 @@ Everything you see in the dashboard is backed by the REST API. All endpoints liv
### Core operations
Every request below uses `--cacert "$CA"` to pin the self-signed CA bundle extracted above. In production, point `$CA` at your internal CA root or the bundle you distributed to the fleet.
```bash
# List all certificates
curl -s http://localhost:8443/api/v1/certificates | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/certificates | jq .
# Filter by status
curl -s "http://localhost:8443/api/v1/certificates?status=Expiring" | jq .
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?status=Expiring" | jq .
# Filter by environment
curl -s "http://localhost:8443/api/v1/certificates?environment=production" | jq .
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?environment=production" | jq .
# Get a specific certificate
curl -s http://localhost:8443/api/v1/certificates/mc-api-prod | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/certificates/mc-api-prod | jq .
# Get deployment targets for a certificate
curl -s http://localhost:8443/api/v1/certificates/mc-api-prod/deployments | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/certificates/mc-api-prod/deployments | jq .
# List agents
curl -s http://localhost:8443/api/v1/agents | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/agents | jq .
# Check agent pending work
curl -s http://localhost:8443/api/v1/agents/ag-web-prod/work | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/agents/ag-web-prod/work | jq .
# View audit trail
curl -s http://localhost:8443/api/v1/audit | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/audit | jq .
# View policies and violations
curl -s http://localhost:8443/api/v1/policies | jq .
curl -s http://localhost:8443/api/v1/policies/pr-require-owner/violations | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/policies | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/policies/pr-require-owner/violations | jq .
# Notifications
curl -s http://localhost:8443/api/v1/notifications | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/notifications | jq .
# Profiles and agent groups
curl -s http://localhost:8443/api/v1/profiles | jq .
curl -s http://localhost:8443/api/v1/agent-groups | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/profiles | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/agent-groups | jq .
```
### Sorting, filtering, and pagination
```bash
# Sort by expiration date (ascending)
curl -s "http://localhost:8443/api/v1/certificates?sort=notAfter" | jq .
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?sort=notAfter" | jq .
# Sort descending (prefix with -)
curl -s "http://localhost:8443/api/v1/certificates?sort=-createdAt" | jq .
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?sort=-createdAt" | jq .
# Time-range filters (RFC3339)
curl -s "http://localhost:8443/api/v1/certificates?expires_before=2026-05-01T00:00:00Z" | jq .
curl -s "http://localhost:8443/api/v1/certificates?created_after=2026-03-01T00:00:00Z" | jq .
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?expires_before=2026-05-01T00:00:00Z" | jq .
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?created_after=2026-03-01T00:00:00Z" | jq .
# Sparse fields — request only what you need
curl -s "http://localhost:8443/api/v1/certificates?fields=id,common_name,status,expires_at" | jq .
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?fields=id,common_name,status,expires_at" | jq .
# Cursor pagination — efficient for large inventories
curl -s "http://localhost:8443/api/v1/certificates?page_size=5" | jq '{next_cursor: .next_cursor, count: (.data | length)}'
curl -s "http://localhost:8443/api/v1/certificates?cursor=<next_cursor_value>&page_size=5" | jq .
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?page_size=5" | jq '{next_cursor: .next_cursor, count: (.data | length)}'
curl --cacert "$CA" -s "https://localhost:8443/api/v1/certificates?cursor=<next_cursor_value>&page_size=5" | jq .
```
Supported sort fields: `notAfter`, `expiresAt`, `createdAt`, `updatedAt`, `commonName`, `name`, `status`, `environment`.
@@ -218,22 +228,22 @@ Supported sort fields: `notAfter`, `expiresAt`, `createdAt`, `updatedAt`, `commo
```bash
# Dashboard summary
curl -s http://localhost:8443/api/v1/stats/summary | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/stats/summary | jq .
# Certificates by status
curl -s http://localhost:8443/api/v1/stats/certificates-by-status | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/stats/certificates-by-status | jq .
# Expiration timeline (next 90 days)
curl -s "http://localhost:8443/api/v1/stats/expiration-timeline?days=90" | jq .
curl --cacert "$CA" -s "https://localhost:8443/api/v1/stats/expiration-timeline?days=90" | jq .
# Job trends (last 30 days)
curl -s "http://localhost:8443/api/v1/stats/job-trends?days=30" | jq .
curl --cacert "$CA" -s "https://localhost:8443/api/v1/stats/job-trends?days=30" | jq .
# JSON metrics
curl -s http://localhost:8443/api/v1/metrics | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/metrics | jq .
# Prometheus format (for Prometheus, Grafana Agent, Datadog)
curl -s http://localhost:8443/api/v1/metrics/prometheus
curl --cacert "$CA" -s https://localhost:8443/api/v1/metrics/prometheus
```
## Create Your First Certificate
@@ -241,7 +251,7 @@ curl -s http://localhost:8443/api/v1/metrics/prometheus
Create a certificate record that certctl will track, renew, and deploy automatically.
```bash
curl -s -X POST http://localhost:8443/api/v1/certificates \
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates \
-H "Content-Type: application/json" \
-d '{
"name": "My First Certificate",
@@ -264,31 +274,34 @@ CERT_ID="<paste the id from the response>"
Trigger renewal:
```bash
curl -s -X POST http://localhost:8443/api/v1/certificates/$CERT_ID/renew | jq .
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates/$CERT_ID/renew | jq .
```
Check the result:
```bash
curl -s http://localhost:8443/api/v1/certificates/$CERT_ID | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/certificates/$CERT_ID | jq .
```
Refresh the dashboard at http://localhost:8443 — your new certificate appears in the inventory.
Refresh the dashboard at https://localhost:8443 — your new certificate appears in the inventory.
### Revoke a certificate
When a private key is compromised or a service is decommissioned:
```bash
curl -s -X POST http://localhost:8443/api/v1/certificates/$CERT_ID/revoke \
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates/$CERT_ID/revoke \
-H "Content-Type: application/json" \
-d '{"reason": "superseded"}' | jq .
```
Supported RFC 5280 reason codes: `unspecified`, `keyCompromise`, `caCompromise`, `affiliationChanged`, `superseded`, `cessationOfOperation`, `certificateHold`, `privilegeWithdrawn`.
Confirm via CRL:
Confirm via the unauthenticated DER CRL (RFC 5280 §5, RFC 8615):
```bash
curl -s http://localhost:8443/api/v1/crl | jq .
# Fetch the CRL without any API key — relying parties shouldn't need one.
# The CRL path is unauthenticated, but it's still served over TLS.
curl --cacert "$CA" -s https://localhost:8443/.well-known/pki/crl/iss-local -o /tmp/crl.der
openssl crl -inform der -in /tmp/crl.der -noout -text | head -40
```
### Interactive approval workflow
@@ -297,15 +310,15 @@ For high-value certificates where you want human oversight. The demo includes 2
```bash
# List jobs awaiting approval (demo includes 2)
curl -s "http://localhost:8443/api/v1/jobs?status=AwaitingApproval" | jq '.data[] | {id, certificate_id, status}'
curl --cacert "$CA" -s "https://localhost:8443/api/v1/jobs?status=AwaitingApproval" | jq '.data[] | {id, certificate_id, status}'
# Approve a pending job
curl -s -X POST http://localhost:8443/api/v1/jobs/JOB_ID/approve \
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/jobs/JOB_ID/approve \
-H "Content-Type: application/json" \
-d '{"reason": "Approved for production deployment"}' | jq .
# Reject a pending job
curl -s -X POST http://localhost:8443/api/v1/jobs/JOB_ID/reject \
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/jobs/JOB_ID/reject \
-H "Content-Type: application/json" \
-d '{"reason": "Key type does not meet compliance requirements"}' | jq .
```
@@ -331,7 +344,7 @@ export CERTCTL_DISCOVERY_DIRS="/etc/nginx/certs,/etc/ssl/certs,/var/lib/certs"
export CERTCTL_NETWORK_SCAN_ENABLED=true
# Create a scan target
curl -s -X POST http://localhost:8443/api/v1/network-scan-targets \
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/network-scan-targets \
-H "Content-Type: application/json" \
-d '{
"name": "Internal Network",
@@ -343,20 +356,20 @@ curl -s -X POST http://localhost:8443/api/v1/network-scan-targets \
}' | jq .
# Trigger an immediate scan
curl -s -X POST http://localhost:8443/api/v1/network-scan-targets/nst-internal-network/scan | jq .
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/network-scan-targets/nst-internal-network/scan | jq .
```
### Triage discovered certificates
```bash
# List discovered certs
curl -s "http://localhost:8443/api/v1/discovered-certificates?agent_id=agent-nginx-prod" | jq .
curl --cacert "$CA" -s "https://localhost:8443/api/v1/discovered-certificates?agent_id=agent-nginx-prod" | jq .
# Summary counts
curl -s http://localhost:8443/api/v1/discovery-summary | jq .
curl --cacert "$CA" -s https://localhost:8443/api/v1/discovery-summary | jq .
# Claim a discovered cert (bring under management)
curl -s -X POST "http://localhost:8443/api/v1/discovered-certificates/DISCOVERY_ID/claim" \
curl --cacert "$CA" -s -X POST "https://localhost:8443/api/v1/discovered-certificates/DISCOVERY_ID/claim" \
-H "Content-Type: application/json" \
-d '{"managed_certificate_id": "mc-api-prod"}' | jq .
```
@@ -366,8 +379,9 @@ curl -s -X POST "http://localhost:8443/api/v1/discovered-certificates/DISCOVERY_
```bash
cd cmd/cli && go build -o certctl-cli .
export CERTCTL_SERVER_URL="http://localhost:8443"
export CERTCTL_SERVER_URL="https://localhost:8443"
export CERTCTL_API_KEY="test-key-123"
export CERTCTL_SERVER_CA_BUNDLE_PATH="$CA" # or pass --ca-bundle; --insecure for dev self-signed
./certctl-cli certs list # List certificates
./certctl-cli certs get mc-api-prod # Certificate details
@@ -400,10 +414,10 @@ export CERTCTL_DIGEST_RECIPIENTS=ops@example.com,security@example.com
Preview the digest HTML before enabling scheduled delivery:
```bash
curl http://localhost:8443/api/v1/digest/preview | jq '.html' | grep -o '<html>' # Shows HTML is ready
curl --cacert "$CA" https://localhost:8443/api/v1/digest/preview | jq '.html' | grep -o '<html>' # Shows HTML is ready
# Trigger a digest send immediately (outside of schedule)
curl -X POST http://localhost:8443/api/v1/digest/send
curl --cacert "$CA" -X POST https://localhost:8443/api/v1/digest/send
```
If no recipients are configured (`CERTCTL_DIGEST_RECIPIENTS` empty), the digest falls back to certificate owner emails. Digests include total certificates, expiring soon, expired, active agents, completed/failed jobs (30-day summary), and a table of expiring certs color-coded by urgency (7/14/30 days).
@@ -413,8 +427,9 @@ If no recipients are configured (`CERTCTL_DIGEST_RECIPIENTS` empty), the digest
```bash
cd cmd/mcp-server && go build -o mcp-server .
export CERTCTL_SERVER_URL="http://localhost:8443"
export CERTCTL_SERVER_URL="https://localhost:8443"
export CERTCTL_API_KEY="test-key-123"
export CERTCTL_SERVER_CA_BUNDLE_PATH="$CA" # MCP is env-vars-only; no CLI flags
./mcp-server
```
+69 -49
View File
@@ -16,7 +16,7 @@ You'll start 7 Docker containers that talk to each other:
| **pebble-challtestsrv** | DNS/HTTP challenge test server for Pebble | 10.30.50.3 | Not directly — Pebble talks to it |
| **Pebble** | A fake Let's Encrypt (tests the ACME protocol without touching the real internet) | 10.30.50.4 | Not directly — the server talks to it |
| **step-ca** | A private Certificate Authority (think: your company's internal CA) | 10.30.50.5 | Not directly — the server talks to it |
| **certctl-server** | The brain. API + web dashboard + scheduler + ACME challenge server | 10.30.50.6 | **http://localhost:8443** |
| **certctl-server** | The brain. API + web dashboard + scheduler + ACME challenge server | 10.30.50.6 | **https://localhost:8443** (self-signed — see CA-bundle note below) |
| **NGINX** | A web server. The agent deploys certificates here. | 10.30.50.7 | **https://localhost:8444** |
| **certctl-agent** | The hands. Generates keys, deploys certs to NGINX | 10.30.50.8 | Not directly — it talks to the server |
@@ -123,7 +123,7 @@ docker compose -f docker-compose.test.yml up --build
```
certctl-test-server | {"level":"INFO","msg":"server started","address":"0.0.0.0:8443"}
certctl-test-agent | {"level":"INFO","msg":"agent starting","server_url":"http://certctl-server:8443"}
certctl-test-agent | {"level":"INFO","msg":"agent starting","server_url":"https://certctl-server:8443"}
certctl-test-stepca | Serving HTTPS on :9000 ...
certctl-test-pebble | Listening on: 0.0.0.0:14000
```
@@ -159,13 +159,29 @@ certctl-test-stepca Up (healthy)
**If certctl-test-server says "Restarting"**: It probably started before step-ca or Pebble were ready. Wait 30 seconds and check again. If it keeps restarting, see [Troubleshooting](#troubleshooting).
### Get the CA bundle for curl
The test harness runs HTTPS-only (the `certctl-tls-init` init container self-signs an ECDSA-P256 server cert with a SHA-256 signature into a bind-mounted directory before the server starts — see `docker-compose.test.yml` §`certctl-tls-init` for details). The CA cert that signed it is materialized on the host at `./test/certs/ca.crt` (relative to the `deploy/` directory). Every `curl` in the rest of this doc expects it in `$CA`:
```bash
export CA=$PWD/test/certs/ca.crt
ls -la "$CA" # sanity check: file should exist and be non-empty
curl --cacert "$CA" -f https://localhost:8443/health
```
Expect `{"status":"ok"}`. If `curl` errors with `SSL certificate problem: unable to get local issuer certificate`, the init container hasn't finished yet — wait a few seconds and retry. If the file doesn't exist at all, the bind mount didn't populate; `docker compose -f docker-compose.test.yml logs certctl-tls-init` should show the self-sign ran.
For a full explanation of the cert provisioning patterns (self-signed bootstrap, operator-supplied, cert-manager), see [`tls.md`](tls.md). For the one-step cutover from the old plaintext test harness to HTTPS, see [`upgrade-to-tls.md`](upgrade-to-tls.md).
---
## Step 2: Open the Dashboard
Open your web browser and go to:
**http://localhost:8443**
**https://localhost:8443**
Your browser will warn you that the cert is self-signed ("Your connection is not private" / "NET::ERR_CERT_AUTHORITY_INVALID"). That's expected for the test harness — the CA that signed the cert lives at `deploy/test/certs/ca.crt` and isn't in your system trust store. Click through the warning (Chrome: "Advanced" → "Proceed"; Firefox: "Accept the Risk"; Safari: "Show Details" → "visit this website").
You'll see a login screen asking for an API key. Enter:
@@ -198,12 +214,13 @@ Go back to your second terminal. Let's verify the data loaded correctly.
### Check the agent
```bash
curl -s -H "Authorization: Bearer test-key-2026" \
http://localhost:8443/api/v1/agents | python3 -m json.tool
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
https://localhost:8443/api/v1/agents | python3 -m json.tool
```
**What this command does**:
- `curl` makes an HTTP request (like a browser but from the terminal)
- `curl` makes an HTTPS request (like a browser but from the terminal)
- `--cacert "$CA"` pins the test harness's self-signed root as the only trust anchor for this call — matches what you exported in Step 1
- `-s` means "silent" (don't show progress bars)
- `-H "Authorization: Bearer test-key-2026"` sends the API key (same one you used to log in)
- `python3 -m json.tool` formats the JSON response so it's readable
@@ -233,8 +250,8 @@ The important parts: `"id": "agent-test-01"` and `"status": "online"`. If the st
### Check the issuers
```bash
curl -s -H "Authorization: Bearer test-key-2026" \
http://localhost:8443/api/v1/issuers | python3 -m json.tool
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
https://localhost:8443/api/v1/issuers | python3 -m json.tool
```
You should see three issuers:
@@ -245,8 +262,8 @@ You should see three issuers:
### Check the target
```bash
curl -s -H "Authorization: Bearer test-key-2026" \
http://localhost:8443/api/v1/targets | python3 -m json.tool
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
https://localhost:8443/api/v1/targets | python3 -m json.tool
```
You should see `target-test-nginx` — the NGINX deployment target, assigned to `agent-test-01`.
@@ -255,7 +272,7 @@ The target config uses no-op commands for `reload_command` and `validate_command
### See it all in the dashboard
Open the dashboard at http://localhost:8443 and click through the sidebar:
Open the dashboard at https://localhost:8443 and click through the sidebar:
- **Agents** — you should see `test-agent-01`
- **Issuers** — you should see all three CAs
- **Targets** — you should see `Test NGINX`
@@ -287,7 +304,7 @@ The private key **never leaves the agent**. The server only ever sees the CSR (p
### Step 4a: Create the certificate record
```bash
curl -s -X POST http://localhost:8443/api/v1/certificates \
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates \
-H "Authorization: Bearer test-key-2026" \
-H "Content-Type: application/json" \
-d '{
@@ -338,7 +355,7 @@ docker exec certctl-test-postgres psql -U certctl -d certctl -c \
### Step 4c: Trigger issuance
```bash
curl -s -X POST http://localhost:8443/api/v1/certificates/mc-local-test/renew \
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates/mc-local-test/renew \
-H "Authorization: Bearer test-key-2026" | python3 -m json.tool
```
@@ -395,7 +412,7 @@ The `subject` should match the domain name you chose. The `issuer` should say "c
### Step 4f: Check the dashboard
Open the dashboard at http://localhost:8443 and:
Open the dashboard at https://localhost:8443 and:
1. Click **Certificates** in the sidebar — you should see `mc-local-test` with status "Active"
2. Click on it to see the detail page — you should see version history, the signed certificate details, and the deployment timeline
@@ -414,7 +431,7 @@ This is the real deal. ACME is the protocol that Let's Encrypt uses to issue cer
### Step 5a: Create the certificate record
```bash
curl -s -X POST http://localhost:8443/api/v1/certificates \
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates \
-H "Authorization: Bearer test-key-2026" \
-H "Content-Type: application/json" \
-d '{
@@ -441,7 +458,7 @@ docker exec certctl-test-postgres psql -U certctl -d certctl -c \
"INSERT INTO certificate_target_mappings (certificate_id, target_id) VALUES ('mc-acme-test', 'target-test-nginx') ON CONFLICT DO NOTHING;"
# Trigger issuance
curl -s -X POST http://localhost:8443/api/v1/certificates/mc-acme-test/renew \
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates/mc-acme-test/renew \
-H "Authorization: Bearer test-key-2026" | python3 -m json.tool
```
@@ -502,7 +519,7 @@ Revocation means "this certificate is no longer trusted, even though it hasn't e
### Step 7a: Revoke the Local CA cert
```bash
curl -s -X POST http://localhost:8443/api/v1/certificates/mc-local-test/revoke \
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates/mc-local-test/revoke \
-H "Authorization: Bearer test-key-2026" \
-H "Content-Type: application/json" \
-d '{"reason": "superseded"}' | python3 -m json.tool
@@ -512,12 +529,15 @@ curl -s -X POST http://localhost:8443/api/v1/certificates/mc-local-test/revoke \
### Step 7b: Check the CRL (Certificate Revocation List)
The CRL is a DER-encoded X.509 v2 CRL (RFC 5280 §5) served under the RFC 8615 well-known namespace. It is deliberately unauthenticated — relying parties that need to verify revocation don't have certctl API keys.
```bash
curl -s -H "Authorization: Bearer test-key-2026" \
http://localhost:8443/api/v1/crl | python3 -m json.tool
# No Authorization header — the endpoint is public by design.
curl --cacert "$CA" -s https://localhost:8443/.well-known/pki/crl/iss-local -o /tmp/crl.der
openssl crl -inform der -in /tmp/crl.der -noout -text | head -40
```
**What you should see**: A list that includes the revoked certificate's serial number, the reason, and the timestamp.
**What you should see**: `openssl` prints the CRL issuer DN, `This Update` / `Next Update` timestamps, and at least one entry whose `Serial Number` matches the cert you just revoked, with `CRL Reason Code: Superseded` (or whichever reason you passed in step 7a). The response's `Content-Type` header is `application/pkix-crl`.
### Step 7c: Check in the dashboard
@@ -530,8 +550,8 @@ Go to **Certificates** in the sidebar. The `mc-local-test` cert should now show
The agent is configured to scan `/nginx-certs` every 6 hours for existing certificates. It already ran a scan when it started up. Let's see what it found.
```bash
curl -s -H "Authorization: Bearer test-key-2026" \
http://localhost:8443/api/v1/discovered-certificates | python3 -m json.tool
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
https://localhost:8443/api/v1/discovered-certificates | python3 -m json.tool
```
**What you should see**: Any certificates that exist in the NGINX cert directory, including the ones you deployed in Steps 4-5. The discovery system extracts metadata (CN, SANs, issuer, expiry, fingerprint) from the PEM files.
@@ -539,8 +559,8 @@ curl -s -H "Authorization: Bearer test-key-2026" \
Check the summary:
```bash
curl -s -H "Authorization: Bearer test-key-2026" \
http://localhost:8443/api/v1/discovery-summary | python3 -m json.tool
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
https://localhost:8443/api/v1/discovery-summary | python3 -m json.tool
```
This shows counts: how many are Unmanaged, Managed, and Dismissed.
@@ -554,7 +574,7 @@ In the dashboard: click **Discovery** in the sidebar to see the triage view.
Force a renewal on the ACME certificate to see the full cycle happen again:
```bash
curl -s -X POST http://localhost:8443/api/v1/certificates/mc-acme-test/renew \
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates/mc-acme-test/renew \
-H "Authorization: Bearer test-key-2026" | python3 -m json.tool
```
@@ -581,7 +601,7 @@ The test environment enables EST with `CERTCTL_EST_ENABLED=true` and `CERTCTL_ES
### Step 10a: Check available CA certificates
```bash
curl -sk http://localhost:8443/.well-known/est/cacerts \
curl --cacert "$CA" -s https://localhost:8443/.well-known/est/cacerts \
-H "Authorization: Bearer test-key-2026"
```
@@ -592,7 +612,7 @@ curl -sk http://localhost:8443/.well-known/est/cacerts \
### Step 10b: Check CSR attributes
```bash
curl -sk http://localhost:8443/.well-known/est/csrattrs \
curl --cacert "$CA" -s https://localhost:8443/.well-known/est/csrattrs \
-H "Authorization: Bearer test-key-2026"
```
@@ -612,7 +632,7 @@ openssl req -new -newkey ec -pkeyopt ec_paramgen_curve:P-256 \
EST_CSR=$(openssl req -in /tmp/est-test.csr -outform DER | base64 -w 0)
# Submit to EST simpleenroll endpoint
curl -sk -X POST http://localhost:8443/.well-known/est/simpleenroll \
curl --cacert "$CA" -s -X POST https://localhost:8443/.well-known/est/simpleenroll \
-H "Authorization: Bearer test-key-2026" \
-H "Content-Type: application/pkcs10" \
-d "$EST_CSR"
@@ -625,8 +645,8 @@ curl -sk -X POST http://localhost:8443/.well-known/est/simpleenroll \
Decode and inspect the response (if you saved it to a variable):
```bash
curl -s -H "Authorization: Bearer test-key-2026" \
http://localhost:8443/api/v1/audit-events | python3 -m json.tool | head -30
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
https://localhost:8443/api/v1/audit-events | python3 -m json.tool | head -30
```
Check the audit trail — you should see an `est_enrollment` event with the CN `est-device.certctl.test`.
@@ -636,7 +656,7 @@ Check the audit trail — you should see an `est_enrollment` event with the CN `
EST also supports re-enrollment (certificate renewal). The same CSR format works:
```bash
curl -sk -X POST http://localhost:8443/.well-known/est/simplereenroll \
curl --cacert "$CA" -s -X POST https://localhost:8443/.well-known/est/simplereenroll \
-H "Authorization: Bearer test-key-2026" \
-H "Content-Type: application/pkcs10" \
-d "$EST_CSR"
@@ -655,7 +675,7 @@ S/MIME certificates are used for email signing and encryption — a different us
### Step 11a: Create an S/MIME certificate record
```bash
curl -s -X POST http://localhost:8443/api/v1/certificates \
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates \
-H "Authorization: Bearer test-key-2026" \
-H "Content-Type: application/json" \
-d '{
@@ -683,7 +703,7 @@ Notice:
docker exec certctl-test-postgres psql -U certctl -d certctl -c \
"INSERT INTO certificate_target_mappings (certificate_id, target_id) VALUES ('mc-smime-test', 'target-test-nginx') ON CONFLICT DO NOTHING;"
curl -s -X POST http://localhost:8443/api/v1/certificates/mc-smime-test/renew \
curl --cacert "$CA" -s -X POST https://localhost:8443/api/v1/certificates/mc-smime-test/renew \
-H "Authorization: Bearer test-key-2026" | python3 -m json.tool
```
@@ -692,15 +712,15 @@ curl -s -X POST http://localhost:8443/api/v1/certificates/mc-smime-test/renew \
After the agent processes the job (30-60 seconds), check the certificate details:
```bash
curl -s -H "Authorization: Bearer test-key-2026" \
http://localhost:8443/api/v1/certificates/mc-smime-test | python3 -m json.tool
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
https://localhost:8443/api/v1/certificates/mc-smime-test | python3 -m json.tool
```
The certificate should show `"status": "active"`. To verify the EKU on the actual cert, you can export it:
```bash
curl -s -H "Authorization: Bearer test-key-2026" \
http://localhost:8443/api/v1/certificates/mc-smime-test/export/pem | python3 -m json.tool
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
https://localhost:8443/api/v1/certificates/mc-smime-test/export/pem | python3 -m json.tool
```
If you decode the certificate PEM, you should see:
@@ -765,16 +785,16 @@ If you have Go installed, you can build and test the CLI tool:
go build -o certctl-cli ./cmd/cli
# List certificates
./certctl-cli --server http://localhost:8443 --api-key test-key-2026 list-certs
./certctl-cli --server https://localhost:8443 --ca-bundle "$CA" --api-key test-key-2026 list-certs
# Get a specific certificate
./certctl-cli --server http://localhost:8443 --api-key test-key-2026 get-cert mc-acme-test
./certctl-cli --server https://localhost:8443 --ca-bundle "$CA" --api-key test-key-2026 get-cert mc-acme-test
# Check health
./certctl-cli --server http://localhost:8443 --api-key test-key-2026 health
./certctl-cli --server https://localhost:8443 --ca-bundle "$CA" --api-key test-key-2026 health
# Get metrics (JSON format)
./certctl-cli --server http://localhost:8443 --api-key test-key-2026 --format json metrics
./certctl-cli --server https://localhost:8443 --ca-bundle "$CA" --api-key test-key-2026 --format json metrics
```
---
@@ -921,15 +941,15 @@ Look for error messages. Common ones:
**Step 2**: Verify the agent is registered:
```bash
curl -s -H "Authorization: Bearer test-key-2026" \
http://localhost:8443/api/v1/agents/agent-test-01 | python3 -m json.tool
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
https://localhost:8443/api/v1/agents/agent-test-01 | python3 -m json.tool
```
**Step 3**: Check for pending jobs:
```bash
curl -s -H "Authorization: Bearer test-key-2026" \
"http://localhost:8443/api/v1/jobs?status=Pending&status=AwaitingCSR" | python3 -m json.tool
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
"https://localhost:8443/api/v1/jobs?status=Pending&status=AwaitingCSR" | python3 -m json.tool
```
If there are pending jobs but the agent isn't picking them up, check that the job's `agent_id` matches `agent-test-01`.
@@ -959,8 +979,8 @@ docker exec certctl-test-nginx nginx -s reload
**Step 3**: If the files aren't there, the deployment job hasn't completed. Check the jobs:
```bash
curl -s -H "Authorization: Bearer test-key-2026" \
"http://localhost:8443/api/v1/jobs?type=Deployment" | python3 -m json.tool
curl --cacert "$CA" -s -H "Authorization: Bearer test-key-2026" \
"https://localhost:8443/api/v1/jobs?type=Deployment" | python3 -m json.tool
```
Look at the job status. If it's "Running" and stuck, the server's job processor may have picked it up instead of the agent (this was a known bug — the fix skips deployment jobs with `agent_id` in the server's `ProcessPendingJobs`).
@@ -1005,7 +1025,7 @@ Change it to a different port, like:
- "9443:8443"
```
Then access the dashboard at http://localhost:9443 instead.
Then access the dashboard at https://localhost:9443 instead.
### Starting completely fresh
@@ -1051,7 +1071,7 @@ docker compose -f docker-compose.test.yml up --build
| What | Value |
|---|---|
| Dashboard URL | http://localhost:8443 |
| Dashboard URL | https://localhost:8443 (use `--cacert ./test/certs/ca.crt`) |
| API key | `test-key-2026` |
| NGINX HTTP | http://localhost:8080 |
| NGINX HTTPS | https://localhost:8444 |
+473 -67
View File
@@ -1297,66 +1297,59 @@ curl -s -H "$AUTH" "$SERVER/api/v1/audit?per_page=5" | jq '[.items[] | select(.a
### 5.3 CRL & OCSP
**Test 5.3.1 — JSON CRL endpoint**
> **M-006 note:** The non-standard JSON CRL (`GET /api/v1/crl`) and the authenticated DER CRL (`GET /api/v1/crl/{issuer_id}`) and OCSP (`GET /api/v1/ocsp/{issuer_id}/{serial}`) paths were removed. Revocation-status distribution now lives under the RFC 8615 well-known namespace (`/.well-known/pki/crl/{issuer_id}` and `/.well-known/pki/ocsp/{issuer_id}/{serial}`), served unauthenticated because relying parties (browsers, TLS clients, hardware appliances) do not have certctl API keys.
**Test 5.3.1 — DER CRL endpoint (RFC 5280 §5, unauthenticated)**
```bash
curl -s -w "\nHTTP %{http_code}\n" -H "$AUTH" "$SERVER/api/v1/crl" | jq '{total: .total, entries_count: (.entries | length)}'
curl -s -D - -o /tmp/crl.der "$SERVER/.well-known/pki/crl/iss-local" | grep -i "content-type"
openssl crl -inform der -in /tmp/crl.der -noout -text | head -40
```
**What:** Fetches the JSON-formatted Certificate Revocation List.
**Why:** CRL is how relying parties check if a certificate has been revoked. The JSON CRL is the machine-readable API view.
**Expected:** HTTP 200. `total` > 0 (we revoked several certs above). Entries array contains serial numbers.
**PASS if** HTTP 200 and `total` > 0. **FAIL** if total = 0 or 500.
**What:** Fetches the DER-encoded X.509 CRL for the local issuer without presenting any API credentials.
**Why:** Relying parties (browsers, TLS libraries, network appliances) don't have certctl API keys. RFC 5280 §5 defines only the DER wire format, and RFC 8615 defines `.well-known/pki/*` as the relying-party namespace. The Content-Type must be `application/pkix-crl` and `openssl crl -inform der` must parse the body.
**Expected:** `Content-Type: application/pkix-crl`, `openssl` prints a valid CRL with the revoked serials we created above.
**PASS if** Content-Type matches and `openssl crl` parses the body. **FAIL** if JSON/HTML, 401/403, or parse error.
---
**Test 5.3.2 — DER CRL endpoint**
**Test 5.3.2 — OCSP: good response for non-revoked cert (RFC 6960, unauthenticated)**
```bash
curl -s -D - -o /dev/null -H "$AUTH" "$SERVER/api/v1/crl/iss-local" | grep -i "content-type"
curl -s -w "\nHTTP %{http_code}\n" "$SERVER/.well-known/pki/ocsp/iss-local/mc-api-prod" -o /tmp/ocsp.der
openssl ocsp -respin /tmp/ocsp.der -noverify -text 2>/dev/null | head -20
```
**What:** Fetches the DER-encoded X.509 CRL for the local issuer.
**Why:** Standard CRL consumers (browsers, TLS libraries) expect DER-encoded CRLs, not JSON. The Content-Type must be correct.
**Expected:** `Content-Type: application/pkix-crl`
**PASS if** Content-Type is `application/pkix-crl`. **FAIL** if JSON or other.
**What:** Queries the OCSP responder for a non-revoked certificate without any Authorization header.
**Why:** OCSP is the real-time alternative to CRL. RFC 6960 relying parties do not authenticate to the responder, so the endpoint must be public and return `Content-Type: application/ocsp-response`.
**Expected:** HTTP 200 with OCSP response indicating "good" status when `openssl ocsp -respin` parses the body.
**PASS if** HTTP 200 and cert status prints "good". **FAIL** if 401/403/500 or "revoked"/"unknown".
---
**Test 5.3.3 — OCSP: good response for non-revoked cert**
**Test 5.3.3 — OCSP: revoked response for revoked cert (unauthenticated)**
```bash
curl -s -w "\nHTTP %{http_code}\n" -H "$AUTH" "$SERVER/api/v1/ocsp/iss-local/mc-api-prod"
```
**What:** Queries the OCSP responder for a non-revoked certificate.
**Why:** OCSP is the real-time alternative to CRL. A "good" response means the cert is valid.
**Expected:** HTTP 200 with OCSP response indicating "good" status.
**PASS if** HTTP 200. **FAIL** if 500.
---
**Test 5.3.4 — OCSP: revoked response for revoked cert**
```bash
curl -s -w "\nHTTP %{http_code}\n" -H "$AUTH" "$SERVER/api/v1/ocsp/iss-local/mc-test-full"
curl -s -w "\nHTTP %{http_code}\n" "$SERVER/.well-known/pki/ocsp/iss-local/mc-test-full" -o /tmp/ocsp.der
openssl ocsp -respin /tmp/ocsp.der -noverify -text 2>/dev/null | grep -i "cert status"
```
**What:** Queries OCSP for a certificate we revoked earlier.
**Why:** OCSP must return "revoked" status for revoked certs. If it still returns "good," relying parties will trust a compromised certificate.
**Why:** OCSP must return "revoked" status for revoked certs. If it still returns "good," relying parties will trust a compromised certificate. Endpoint is unauthenticated per RFC 6960.
**Expected:** HTTP 200 with OCSP response indicating "revoked" status.
**PASS if** HTTP 200 and response indicates revoked. **FAIL** if response indicates "good".
**PASS if** HTTP 200 and status prints "revoked". **FAIL** if status is "good".
---
**Test 5.3.5 — OCSP: unknown serial**
**Test 5.3.4 — OCSP: unknown serial (unauthenticated)**
```bash
curl -s -w "\nHTTP %{http_code}\n" -H "$AUTH" "$SERVER/api/v1/ocsp/iss-local/nonexistent-serial"
curl -s -w "\nHTTP %{http_code}\n" "$SERVER/.well-known/pki/ocsp/iss-local/nonexistent-serial" -o /tmp/ocsp.der
openssl ocsp -respin /tmp/ocsp.der -noverify -text 2>/dev/null | grep -i "cert status"
```
**What:** Queries OCSP for a serial number the server doesn't recognize.
**Why:** OCSP must return "unknown" for serials it doesn't manage, not "good" (which would be a false positive).
**Why:** OCSP must return "unknown" for serials it doesn't manage, not "good" (which would be a false positive). Endpoint is public per RFC 6960.
**Expected:** HTTP 200 with OCSP "unknown" response, or HTTP 404.
**PASS if** response is "unknown" or 404. **FAIL** if "good".
@@ -2102,9 +2095,10 @@ go test ./internal/connector/issuer/local/ -run "TestSubCA" -v
**What:** In sub-CA mode, the DER CRL (Part 31.1) should be signed by the sub-CA key, not a self-signed root.
```bash
# After starting in sub-CA mode and revoking a cert:
curl -s -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/crl/iss-local" -o /tmp/subca-crl.der
# After starting in sub-CA mode and revoking a cert. The CRL is
# published unauthenticated under the RFC 8615 well-known namespace
# because relying parties don't carry certctl API keys.
curl -s "http://localhost:8443/.well-known/pki/crl/iss-local" -o /tmp/subca-crl.der
openssl crl -in /tmp/subca-crl.der -inform DER -noout -issuer
```
@@ -3706,23 +3700,24 @@ go test ./internal/service/ -run TestCSRRenewal -v
**Why:** TLS clients need to verify that certificates haven't been revoked. Without OCSP/CRL, a compromised certificate remains trusted until it expires. The short-lived exemption avoids bloating the CRL with certs that expire before distribution.
### 24.1: DER-Encoded CRL
> **M-006 note:** CRL and OCSP are published at `GET /.well-known/pki/crl/{issuer_id}` (RFC 5280 §5, `application/pkix-crl`) and `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (RFC 6960, `application/ocsp-response`). Per RFC 8615, `.well-known/pki/*` is the relying-party namespace, and the endpoints are served **unauthenticated** — browsers, TLS libraries, and network appliances do not have certctl API keys. The legacy `GET /api/v1/crl`, `GET /api/v1/crl/{issuer_id}`, and `GET /api/v1/ocsp/{issuer_id}/{serial}` routes were removed.
**What:** `GET /api/v1/crl/{issuer_id}` returns a DER-encoded X.509 CRL signed by the issuing CA. Content-Type is `application/pkix-crl`. The CRL has 24-hour validity.
### 24.1: DER-Encoded CRL (unauthenticated)
**Why:** This is the standard CRL format that browsers, TLS libraries, and LDAP directories consume. The existing JSON CRL at `GET /api/v1/crl` is certctl-specific; the DER CRL is interoperable.
**What:** `GET /.well-known/pki/crl/{issuer_id}` returns a DER-encoded X.509 CRL signed by the issuing CA. Content-Type is `application/pkix-crl`. The CRL has 24-hour validity.
**Why:** This is the RFC 5280 §5 wire format that browsers, TLS libraries, and LDAP directories consume. It must be reachable without any Authorization header so that relying parties — who have no certctl credentials — can fetch it.
```bash
# Request DER CRL for the local issuer
curl -s -D - -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/crl/iss-local" \
# Request DER CRL for the local issuer. No Authorization header.
curl -s -D - "http://localhost:8443/.well-known/pki/crl/iss-local" \
-o /tmp/crl.der
# Verify it's valid DER CRL with openssl
openssl crl -in /tmp/crl.der -inform DER -noout -text
```
**Expected:** 200 OK, Content-Type `application/pkix-crl`, Cache-Control `public, max-age=3600`.
**Expected:** 200 OK, Content-Type `application/pkix-crl`.
**PASS if:**
- `openssl crl` parses the DER file successfully
@@ -3730,33 +3725,34 @@ openssl crl -in /tmp/crl.der -inform DER -noout -text
- Validity period is present (thisUpdate / nextUpdate)
- If any certs have been revoked, they appear in the revocation list with serial + reason
**FAIL if:** Response is JSON (wrong endpoint), `openssl` rejects the DER format, or headers are wrong.
**FAIL if:** Response is JSON (wrong endpoint), `openssl` rejects the DER format, headers are wrong, or the server returns 401/403 (auth must NOT be required).
### 24.2: DER CRL — Nonexistent Issuer
```bash
curl -s -w "\n%{http_code}" -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/crl/iss-nonexistent"
curl -s -w "\n%{http_code}" \
"http://localhost:8443/.well-known/pki/crl/iss-nonexistent"
```
**Expected:** 404 Not Found.
**PASS if** status code is 404 and body contains "not found".
### 24.3: OCSP Responder — Good Status
### 24.3: OCSP Responder — Good Status (unauthenticated)
**What:** `GET /api/v1/ocsp/{issuer_id}/{serial}` returns a signed OCSP response. For a non-revoked certificate, the status is "good".
**What:** `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` returns a signed OCSP response. For a non-revoked certificate, the status is "good".
**Why:** OCSP is the real-time revocation check that TLS clients perform during the handshake. A "good" response tells the client the cert is still valid.
**Why:** OCSP is the real-time RFC 6960 revocation check that TLS clients perform during the handshake. A "good" response tells the client the cert is still valid. Relying parties fetch this without API credentials.
```bash
# First, get a certificate's serial number
# First, get a certificate's serial number (this uses the authenticated API
# because the operator has an API key — that is different from the relying
# party fetching the OCSP response).
SERIAL=$(curl -s -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/certificates/mc-api-prod" | jq -r '.latest_version.serial_number // empty')
# If serial is available, query OCSP
# Query OCSP without any Authorization header.
if [ -n "$SERIAL" ]; then
curl -s -D - -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/ocsp/iss-local/$SERIAL" \
curl -s -D - "http://localhost:8443/.well-known/pki/ocsp/iss-local/$SERIAL" \
-o /tmp/ocsp.der
# Parse OCSP response
@@ -3771,7 +3767,7 @@ fi
- Certificate status is "good" for a non-revoked cert
- Response is signed (producedAt timestamp present)
**FAIL if:** Response is JSON, OCSP status is wrong, or `openssl` rejects the response.
**FAIL if:** Response is JSON, OCSP status is wrong, `openssl` rejects the response, or the endpoint requires auth.
### 24.4: OCSP Responder — Revoked Status
@@ -3784,9 +3780,8 @@ curl -s -X POST -H "Authorization: Bearer $API_KEY" \
-d '{"reason": "keyCompromise"}' \
"http://localhost:8443/api/v1/certificates/$CERT_ID/revoke"
# Then query OCSP
curl -s -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/ocsp/iss-local/$SERIAL" \
# Then query OCSP — unauthenticated.
curl -s "http://localhost:8443/.well-known/pki/ocsp/iss-local/$SERIAL" \
-o /tmp/ocsp-revoked.der
openssl ocsp -respin /tmp/ocsp-revoked.der -text -noverify
@@ -3801,8 +3796,7 @@ openssl ocsp -respin /tmp/ocsp-revoked.der -text -noverify
**What:** Querying a serial number that doesn't exist in the inventory returns an "unknown" OCSP status (not an error — this is the correct OCSP behavior per RFC 6960).
```bash
curl -s -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/ocsp/iss-local/DEADBEEF" \
curl -s "http://localhost:8443/.well-known/pki/ocsp/iss-local/DEADBEEF" \
-o /tmp/ocsp-unknown.der
openssl ocsp -respin /tmp/ocsp-unknown.der -text -noverify
@@ -3820,9 +3814,8 @@ openssl ocsp -respin /tmp/ocsp-unknown.der -text -noverify
To test: revoke a cert that was issued under the `prof-short-lived` profile, then check the DER CRL. The revoked short-lived cert should NOT appear.
```bash
# After revoking a short-lived cert (serial SHORT_SERIAL):
curl -s -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/crl/iss-local" -o /tmp/crl.der
# After revoking a short-lived cert (serial SHORT_SERIAL). No auth needed.
curl -s "http://localhost:8443/.well-known/pki/crl/iss-local" -o /tmp/crl.der
openssl crl -in /tmp/crl.der -inform DER -text | grep -i "$SHORT_SERIAL"
```
@@ -5009,10 +5002,10 @@ curl -s -w "HTTP %{http_code}\n" -X DELETE -H "$AUTH" "$SERVER/api/v1/audit/$EVE
> **Tip:** Open a second terminal with `docker compose logs -f certctl-server` to watch scheduler log output in real time.
**Test 20.1.1 — Scheduler startup: all 7 loops registered**
**Test 20.1.1 — Scheduler startup: all 12 loops registered**
```bash
docker compose logs certctl-server 2>&1 | grep -i "scheduler\|renewal check\|job processor\|health check\|notification\|short-lived\|network scan" | head -20
docker compose logs certctl-server 2>&1 | grep -i "scheduler\|renewal check\|job processor\|job retry\|job timeout\|health check\|notification\|notification retry\|short-lived\|network scan\|digest\|endpoint health\|cloud discovery" | head -30
```
**What:** Checks server startup logs for scheduler loop registration.
@@ -6594,6 +6587,419 @@ helm template certctl deploy/helm/certctl/ --set server.replicaCount=3 | grep 'r
---
## Part 55: Agent Soft-Retirement (I-004)
**What this validates:** The full `DELETE /api/v1/agents/{id}` soft-retirement contract — seven HTTP status codes (200/204/400/403/404/405/409/500), opt-in retired-agent listing, sentinel refusal, `410 Gone` heartbeat response, and the force-cascade escape hatch.
**Why it matters:** Before I-004, there was no retirement surface at all — `DELETE` did not exist and agents could only be removed via raw SQL against the `agents` table. Worse, the schema declared `deployment_targets.agent_id ON DELETE CASCADE`, so any such manual delete silently cascaded through four tables with zero audit trail. This part pins the replacement contract (soft-delete + preflight + force-cascade + sentinel guard + heartbeat 410) so regressions show up here first rather than as orphaned targets in production.
### 55.1 Migration 000015 Applied
```bash
docker compose -f deploy/docker-compose.yml exec postgres \
psql -U certctl -d certctl -c \
"SELECT column_name FROM information_schema.columns WHERE table_name='agents' AND column_name IN ('retired_at','retired_reason') ORDER BY column_name;"
```
**What:** Confirms migration 000015 added the archival columns to the `agents` table.
**PASS if** both `retired_at` and `retired_reason` rows are returned. **FAIL** if either is missing (migration did not apply).
---
### 55.2 FK Constraint Flipped to RESTRICT
```bash
docker compose -f deploy/docker-compose.yml exec postgres \
psql -U certctl -d certctl -c \
"SELECT confdeltype FROM pg_constraint WHERE conname='deployment_targets_agent_id_fkey';"
```
**What:** `confdeltype` is PostgreSQL's one-character code for the FK delete action: `r` = RESTRICT, `c` = CASCADE.
**PASS if** the value is `r`. **FAIL** if it is still `c` — that means migration 000015's FK flip did not run, and a hard `DELETE` against an agent row would silently cascade.
---
### 55.3 Clean Retire — 200
```bash
curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-test-clean" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-w "\nHTTP %{http_code}\n"
```
**What:** Retires an agent that has no active deployment targets, no deployed certificates, and no pending jobs.
**PASS if** status code is `200` and response body includes `"retired_at":"<ISO8601>"`, `"cascade":false`, and zero-valued counts.
---
### 55.4 Idempotent Re-Retire — 204
```bash
curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-test-clean" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-w "\nHTTP %{http_code}\n"
```
**What:** Retires an agent that is already retired.
**PASS if** status code is `204` and response body is completely empty (not even a trailing newline from the handler). The 200-shape must NOT be emitted — this is the terminal no-op.
---
### 55.5 Blocked by Dependencies — 409
```bash
curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-with-deps" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-w "\nHTTP %{http_code}\n"
```
**What:** Attempts to retire an agent that still has active targets/certificates/jobs.
**PASS if** status code is `409` and response body is the three-key `BlockedByDependenciesResponse` shape: `{"error":"blocked_by_dependencies", "message": "...", "counts": {"active_targets": N, "active_certificates": N, "pending_jobs": N}}`. Must NOT be the generic `ErrorResponse` shape — downstream dashboards parse the `counts` key.
---
### 55.6 Force Cascade — 200
```bash
curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-with-deps?force=true&reason=decommissioning+rack-7" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-w "\nHTTP %{http_code}\n"
```
**What:** Uses the force escape hatch to cascade-retire the dependencies.
**PASS if** status code is `200`, response includes `"cascade":true` with the pre-cascade counts, and the subsequent `GET /api/v1/audit-events?action=agent_retirement_cascaded` shows the event with the supplied `reason` and actor.
---
### 55.7 Force Without Reason — 400
```bash
curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-other?force=true" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-w "\nHTTP %{http_code}\n"
```
**What:** Verifies the `ErrForceReasonRequired` guard — `force=true` without `reason` must be rejected before any state mutation.
**PASS if** status code is `400` and no agent/target/job rows were modified.
---
### 55.8 Sentinel Refusal — 403
```bash
for id in server-scanner cloud-aws-sm cloud-azure-kv cloud-gcp-sm; do
echo "=== $id ==="
curl -sS -X DELETE "http://localhost:8443/api/v1/agents/${id}?force=true&reason=attempt" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-w "\nHTTP %{http_code}\n"
done
```
**What:** Verifies all four sentinel agents refuse retirement even with `force=true`.
**PASS if** every request returns `403` and the response body's `error` value is `sentinel_agent` (or the equivalent `ErrAgentIsSentinel` mapping). **FAIL** if any sentinel accepts the request — retiring one silently orphans the network scanner or one of the three cloud secret-manager discovery sources.
---
### 55.9 Unknown ID — 404
```bash
curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-does-not-exist" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-w "\nHTTP %{http_code}\n"
```
**What:** Verifies `ErrAgentNotFound` maps to 404 (not 500). Ordering matters — the not-found check must come after the sentinel check so a typo'd sentinel ID still returns 403, not 404.
**PASS if** status code is `404`.
---
### 55.10 Heartbeat on Retired Agent — 410
```bash
curl -sS -X POST "http://localhost:8443/api/v1/agents/ag-test-clean/heartbeat" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-H "Content-Type: application/json" \
-d '{"os":"linux","architecture":"amd64","hostname":"test","ip_address":"10.0.0.1","version":"2.1.0"}' \
-w "\nHTTP %{http_code}\n"
```
**What:** Retired agents get `410 Gone` — the canonical "resource is permanently gone, stop retrying" signal — so `cmd/agent` detects it and exits cleanly.
**PASS if** status code is `410`. **FAIL** if it is `404` (wrong ordering — retired-check must run before not-found) or `200` (retired filter missing entirely — agent would keep phoning home forever).
---
### 55.11 Default List Excludes Retired
```bash
curl -sS "http://localhost:8443/api/v1/agents" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
| jq -r '.data[] | select(.id=="ag-test-clean") | .id'
```
**What:** Verifies the default `/agents` listing filters retired rows via `AgentRepository.ListActive`.
**PASS if** output is empty (the retired agent does NOT appear). **FAIL** if `ag-test-clean` shows up — default listings must not expose retired rows.
---
### 55.12 Retired Agents Opt-In View
```bash
curl -sS "http://localhost:8443/api/v1/agents/retired" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
| jq -r '.data[] | select(.id=="ag-test-clean") | {id, retired_at, retired_reason}'
```
**What:** Verifies the opt-in retired-agents view returns the row with `retired_at` and `retired_reason` populated. Go 1.22 ServeMux literal-beats-pattern-var precedence routes `/agents/retired` to this handler rather than `/agents/{id}`.
**PASS if** the row appears with non-null `retired_at`. **FAIL** if the row is missing (listing broken) or `retired_at` is null (serialization broken).
---
### 55.13 Dashboard Stats Counter Excludes Retired
```bash
curl -sS "http://localhost:8443/api/v1/stats/summary" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
| jq -r '.total_agents'
```
**What:** Stats dashboard uses `ListActive`, not `List` — retired agents must not inflate the count.
**PASS if** the counter reflects only non-retired rows (verify against `SELECT count(*) FROM agents WHERE retired_at IS NULL`).
---
### 55.14 CLI Retire Subcommand
```bash
certctl-cli agents retire ag-cli-test --force --reason "smoke test"
certctl-cli agents list --retired | grep ag-cli-test
```
**What:** Verifies the CLI `agents retire` subcommand forwards `--force` and `--reason` via `DeleteWithQuery` and the `agents list --retired` flag hits `/agents/retired` rather than the default listing.
**PASS if** the first command succeeds and the second shows the agent in the retired view.
---
### 55.15 MCP Retire Tool Schema
```bash
go test ./internal/mcp/ -run TestRetireAgent -v -count=1
```
**What:** Verifies the `certctl_retire_agent` MCP tool's input schema accepts `id`, `force`, and `reason`, and that the tool actually propagates `force`/`reason` into the outbound DELETE query string (not the body).
**PASS if** exit code 0.
---
### 55.16 HEAD-State OpenAPI Contract
```bash
npx --yes @redocly/cli lint api/openapi.yaml \
--config '{"rules":{"operation-4xx-response":"error","no-invalid-media-type-examples":"error"}}'
python3 -c "
import yaml
spec = yaml.safe_load(open('api/openapi.yaml'))
del_op = spec['paths']['/api/v1/agents/{id}']['delete']
assert set(del_op['responses'].keys()) == {'200','204','400','403','404','405','409','500'}, del_op['responses'].keys()
hb = spec['paths']['/api/v1/agents/{id}/heartbeat']['post']
assert '410' in hb['responses'], hb['responses'].keys()
assert spec['paths']['/api/v1/agents/retired']['get']['operationId'] == 'listRetiredAgents'
print('OpenAPI I-004 contract: OK')
"
```
**What:** Two-part check. Redocly lint confirms the spec is structurally valid; the Python assertions pin the seven DELETE status codes, the 410 heartbeat response, and the retired-agents operationId.
**PASS if** redocly prints no errors and the Python script prints `OpenAPI I-004 contract: OK`.
---
## Part 56: Notification Retry & Dead-Letter Queue (I-005)
**What this validates:** The full retry lifecycle for `notification_events` rows — transient notifier failures are re-armed with exponential backoff (`2^retry_count` minutes capped at 1h, 5-attempt budget), rows that exhaust the budget land in the terminal `dead` status, the dead-letter depth is surfaced both on the dashboard and via a Prometheus counter, and operators can requeue dead rows once the underlying outage is resolved.
**Why it matters:** Before I-005, a failed notification was a silent drop. `internal/service/notification.go` flipped `status` to `failed` and never came back to it, because `ProcessPendingNotifications` only lists rows whose `status='pending'`. A 5xx from Slack, a 30-second SMTP stall, or a misrouted webhook URL could each lose a critical alert (cert expiry, CA compromise, approval-rejected) with no trace beyond a single log line. Part 56 pins the replacement contract (retry loop + DLQ + dashboard surface + Prometheus metric + operator requeue) so regressions show up here rather than as a post-incident "why didn't we get paged?" review.
### 56.1 Migration 000016 Columns Applied
```bash
docker compose -f deploy/docker-compose.yml exec postgres \
psql -U certctl -d certctl -c \
"SELECT column_name FROM information_schema.columns WHERE table_name='notification_events' AND column_name IN ('retry_count','next_retry_at','last_error') ORDER BY column_name;"
```
**What:** Confirms migration 000016 added the retry bookkeeping columns to `notification_events`.
**PASS if** all three rows (`last_error`, `next_retry_at`, `retry_count`) are returned. **FAIL** if any is missing — the migration did not apply and the retry loop will error on every tick.
---
### 56.2 Partial Retry-Sweep Index Present
```bash
docker compose -f deploy/docker-compose.yml exec postgres \
psql -U certctl -d certctl -c \
"SELECT indexdef FROM pg_indexes WHERE tablename='notification_events' AND indexname='idx_notification_events_retry_sweep';"
```
**What:** Confirms the partial index `idx_notification_events_retry_sweep ON notification_events(next_retry_at) WHERE status = 'failed' AND next_retry_at IS NOT NULL` exists and has the expected predicate.
**PASS if** the returned `indexdef` includes `WHERE ((status = 'failed'::text) AND (next_retry_at IS NOT NULL))`. **FAIL** if the index is missing or unpartialed — the retry sweep will scan the full notification history instead of the small retry-eligible slice.
---
### 56.3 Failed Notification Retries On Next Tick
```bash
# Seed a failed notification with next_retry_at in the past
docker compose -f deploy/docker-compose.yml exec postgres \
psql -U certctl -d certctl -c \
"UPDATE notification_events SET status='failed', retry_count=0, next_retry_at=NOW() - INTERVAL '1 minute', last_error='transient SMTP timeout' WHERE id='notif-demo-1';"
# Wait for the retry loop to sweep (default CERTCTL_NOTIFICATION_RETRY_INTERVAL=2m)
sleep 130
# Observe the post-sweep state
docker compose -f deploy/docker-compose.yml exec postgres \
psql -U certctl -d certctl -c \
"SELECT id, status, retry_count, next_retry_at IS NOT NULL AS has_next_retry FROM notification_events WHERE id='notif-demo-1';"
```
**What:** Exercises the retry loop's failure path. The seeded row is re-dispatched through the notifier registry; in the demo environment the notifier does not exist for `email` so the sweep either delivers (`status='sent'`) or records a failed attempt (`retry_count=1`, `next_retry_at` re-armed).
**PASS if** either `status='sent'` (delivered on retry) or the row is still `failed` with `retry_count >= 1` and `has_next_retry=t`. **FAIL** if the row is still `failed` with `retry_count=0` and `next_retry_at` in the past — the retry loop is not actually running.
---
### 56.4 Exhausted Notification Transitions To Dead
```bash
# Seed a row one failure shy of exhaustion — retry_count=4 means the next
# tick's failure is the 5th attempt (notifRetryMaxAttempts-1 check at
# internal/service/notification.go:531).
docker compose -f deploy/docker-compose.yml exec postgres \
psql -U certctl -d certctl -c \
"UPDATE notification_events SET status='failed', retry_count=4, next_retry_at=NOW() - INTERVAL '1 minute', last_error='persistent outage', channel='channel-that-does-not-exist' WHERE id='notif-demo-2';"
sleep 130
docker compose -f deploy/docker-compose.yml exec postgres \
psql -U certctl -d certctl -c \
"SELECT id, status, retry_count, last_error FROM notification_events WHERE id='notif-demo-2';"
```
**What:** The row at `retry_count=4` enters the sweep, the notifier lookup fails (channel unknown), the exhaustion branch fires, and `MarkAsDead` flips the row. Note: the "notifier unknown" branch at notification.go:494-503 promotes to `sent` for demo parity, so for a strict DLQ assertion seed a row whose channel is a known registered notifier that will reject delivery — alternatively run against the integration test fixture where the retry-exhaustion path is deterministic.
**PASS if** `status='dead'` and `last_error` reflects the send failure. **FAIL** if the row is still `failed` with `retry_count >= 5` — the exhaustion branch did not fire and the row will retry forever.
---
### 56.5 Dead Row Has Null next_retry_at
```bash
docker compose -f deploy/docker-compose.yml exec postgres \
psql -U certctl -d certctl -c \
"SELECT COUNT(*) FROM notification_events WHERE status='dead' AND next_retry_at IS NOT NULL;"
```
**What:** `MarkAsDead` must clear `next_retry_at` so the partial retry-sweep index stops matching the row. If this invariant breaks, a dead row keeps appearing in `ListRetryEligible` and the exhaustion branch fires on every sweep.
**PASS if** the count is `0`. **FAIL** if any dead rows still carry a non-null `next_retry_at` — the DLQ is leaky and the row will re-enter the retry rotation on the next tick.
---
### 56.6 DashboardSummary Populates NotificationsDead
```bash
# Seed a dead row so the count is observable
docker compose -f deploy/docker-compose.yml exec postgres \
psql -U certctl -d certctl -c \
"UPDATE notification_events SET status='dead', next_retry_at=NULL, last_error='demo DLQ fixture' WHERE id='notif-demo-3';"
curl -sS "http://localhost:8443/api/v1/stats/summary" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
| python3 -c "import sys,json; s=json.load(sys.stdin); assert 'notifications_dead' in s, 'missing notifications_dead field'; assert s['notifications_dead'] >= 1, s['notifications_dead']; print('notifications_dead:', s['notifications_dead'])"
```
**What:** Confirms `DashboardSummary.NotificationsDead` (`internal/service/stats.go:66`) is populated by `notifRepo.CountByStatus(ctx, "dead")` (stats.go:137-142) and surfaced in the dashboard summary JSON.
**PASS if** the field is present and reflects at least the seeded dead row. **FAIL** if the field is missing (`SetNotifRepo` was not called on StatsService) or stuck at zero despite seeded dead rows (repository `CountByStatus` is broken).
---
### 56.7 Prometheus Counter Emits certctl_notification_dead_total
```bash
curl -sS "http://localhost:8443/api/v1/metrics/prometheus" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
| grep -E '^# (HELP|TYPE) certctl_notification_dead_total|^certctl_notification_dead_total '
```
**What:** The Prometheus endpoint (`internal/api/handler/metrics.go:217-219`) emits three lines: `# HELP certctl_notification_dead_total Number of notifications in the dead-letter queue.`, `# TYPE certctl_notification_dead_total counter`, and a bare `certctl_notification_dead_total <value>` value line. Operator alert thresholds per the I-005 spec: `> 0` warning, `> 10` critical.
**PASS if** all three lines are present and the value is `>= 1` when dead rows exist. **FAIL** if any of the three lines is missing — the metric name is misspelled, the `# TYPE` is wrong, or `DashboardSummary.NotificationsDead` is not wired into the metrics handler.
---
### 56.8 Requeue Resets Retry Bookkeeping
```bash
# Confirm the row is in 'dead' with the full retry history
docker compose -f deploy/docker-compose.yml exec postgres \
psql -U certctl -d certctl -c \
"SELECT id, status, retry_count, next_retry_at, last_error FROM notification_events WHERE id='notif-demo-3';"
# Requeue via the operator endpoint
curl -sS -X POST "http://localhost:8443/api/v1/notifications/notif-demo-3/requeue" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-w "\nHTTP %{http_code}\n"
# Confirm the atomic reset
docker compose -f deploy/docker-compose.yml exec postgres \
psql -U certctl -d certctl -c \
"SELECT id, status, retry_count, next_retry_at, last_error FROM notification_events WHERE id='notif-demo-3';"
```
**What:** Exercises the operator-driven escape hatch (`POST /api/v1/notifications/{id}/requeue`). The repository's `Requeue` must atomically flip `status → pending`, reset `retry_count → 0`, clear `next_retry_at → NULL`, and clear `last_error → NULL` — see `internal/service/notification.go:571-576` and the pinning test at `notification_handler_test.go:307-347`.
**PASS if** HTTP `200` with JSON body `{"status":"requeued"}` AND the post-requeue row has `status='pending'`, `retry_count=0`, `next_retry_at IS NULL`, `last_error IS NULL`. **FAIL** if any of the four fields is not reset — `ProcessPendingNotifications` will not treat this as a fresh attempt and the audit trail will be ambiguous.
---
### 56.9 GUI Dead Letter Tab Threads ?status=dead
```bash
cd web
npx vitest run src/pages/NotificationsPage.test.tsx -t 'Dead letter tab fetches notifications with status=dead'
```
**What:** The two-tab toolbar on `/notifications` routes the "Dead letter" tab's query through `getNotifications({ status: 'dead', per_page: '100' })`. This test verifies the React Query's `queryKey: ['notifications', activeTab]` (`NotificationsPage.tsx:31`) actually translates the tab click into the server-side filter — not client-side filtering of the full inbox.
**PASS if** the Vitest assertion at `NotificationsPage.test.tsx:104-128` passes. **FAIL** if the Dead letter tab is merely a client-side filter on the `all` response — the DLQ-only code path (`NotificationRepository.ListByStatus`) is not exercised, which matters for pagination correctness once the inbox grows beyond 100 rows.
---
### 56.10 Requeue Button MutationFn Wrapper
```bash
cd web
npx vitest run src/pages/NotificationsPage.test.tsx -t 'clicking Requeue invokes requeueNotification'
```
**What:** `react-query` v5's `mutate(id)` passes a second positional argument (the mutation context object) to the `mutationFn`. If `mutationFn: requeueNotification` is used directly, the API client receives `(id, { client })` — an extra argument that the strict-match `toHaveBeenCalledWith('notif-dead-001')` assertion at `NotificationsPage.test.tsx:181` rejects. The fix is an explicit single-arg arrow: `mutationFn: (id: string) => requeueNotification(id)` at `NotificationsPage.tsx:64`.
**PASS if** the Vitest assertion passes (the API client was called with exactly one argument). **FAIL** if the wrapper is inadvertently removed — silent success in runtime, loud failure in this contract.
---
### 56.11 HEAD-State OpenAPI Contract
```bash
npx --yes @redocly/cli lint api/openapi.yaml \
--config '{"rules":{"operation-4xx-response":"error","no-invalid-media-type-examples":"error"}}'
python3 -c "
import yaml
spec = yaml.safe_load(open('api/openapi.yaml'))
post = spec['paths']['/api/v1/notifications/{id}/requeue']['post']
assert post['operationId'] == 'requeueNotification', post['operationId']
assert set(post['responses'].keys()) >= {'200','400','404','405','500'}, post['responses'].keys()
print('OpenAPI I-005 contract: OK')
"
```
**What:** Two-part check. Redocly lint confirms the spec is structurally valid; the Python assertions pin the requeue endpoint's `operationId` and the five minimum response codes (200/400/404/405/500).
**PASS if** redocly prints no errors and the Python script prints `OpenAPI I-005 contract: OK`. **FAIL** if the `operationId` changed or any of the five responses is missing — downstream MCP/CLI clients rely on the contract.
---
## Release Sign-Off
All tests below must pass before tagging v2.1.0. Each row is one individual test from the guide above. The **Method** column indicates whether `qa-smoke-test.sh` covers the test automatically (**Auto**) or requires hands-on verification (**Manual**).
@@ -6934,7 +7340,7 @@ These must be green before starting manual QA:
| Test | Description | Method | Pass? | Date | Notes |
|------|-------------|--------|-------|------|-------|
| 20.1.1 | Scheduler startup: all 7 loops registered | Manual | ☐ | | |
| 20.1.1 | Scheduler startup: all 12 loops registered | Manual | ☐ | | |
| 20.1.2 | Job processor loop fires (30s interval) | Manual | ☐ | | |
| 20.1.3 | Agent health check marks offline (2m interval) | Manual | ☐ | | |
| 20.1.4 | Notification processor fires (1m interval) | Manual | ☐ | | |
@@ -7595,10 +8001,10 @@ These must be green before starting manual QA:
| Category | Count |
|----------|-------|
| ☑ Auto (passed in `qa-smoke-test.sh`) | 144 |
| ☐ Auto (not yet run) | 129 |
| ☐ Auto (not yet run) | 136 |
| — Skipped (preconditions not met in demo) | 5 |
| ☐ Manual (requires hands-on verification) | 282 |
| **Total** | **560** |
| ☐ Manual (requires hands-on verification) | 286 |
| **Total** | **571** |
**Automated tests must also be green.** CI passing is necessary but not sufficient — this manual QA catches integration issues that isolated unit tests miss.
+183
View File
@@ -0,0 +1,183 @@
# TLS on the Control Plane
certctl's control plane is HTTPS-only as of v2.2. There is no plaintext `http://` listener, no `auto` mode, no dual-listener bridge, no TLS 1.2 escape hatch. The server refuses to start without a cert+key pair, the agent/CLI/MCP clients reject `http://` URLs at startup, and the Helm chart refuses to render without either an operator-supplied Secret or a cert-manager Certificate CR.
This doc covers four cert provisioning patterns, SIGHUP-based cert rotation, and the client-side CA-trust configuration agents and the CLI need to talk to the server. If you are upgrading from a pre-HTTPS release and want the step-by-step cutover procedure, read [`upgrade-to-tls.md`](upgrade-to-tls.md) first and come back here for reference.
## What you get
The server binds TLS 1.3 only with an explicit curve preference of `[X25519, P-256]`. TLS 1.3 cipher suites are non-negotiable (all three mandatory suites — AES-128-GCM-SHA256, AES-256-GCM-SHA384, CHACHA20-POLY1305-SHA256 — are always offered), so there is no `CipherSuites` knob to misconfigure. No TLS 1.2 fallback is available.
Two env vars are required on the server:
- `CERTCTL_SERVER_TLS_CERT_PATH` — filesystem path to the PEM-encoded server certificate
- `CERTCTL_SERVER_TLS_KEY_PATH` — filesystem path to the PEM-encoded private key that signs the cert
Both paths are read during a fail-loud preflight in `cmd/server/main.go` (see `preflightServerTLS` in `cmd/server/tls.go`). If either is unset, unreadable, or the cert+key pair does not round-trip through `tls.LoadX509KeyPair`, the process refuses to start and emits a diagnostic pointing back at this doc. The rationale lives in §3 of the HTTPS-Everywhere milestone: a cert-lifecycle product should not silently bind plaintext.
## Pattern 1 — Self-signed bootstrap for docker-compose demos
This is the default for the `deploy/docker-compose.yml` stack. It exists so `docker compose up -d --build` just works on a laptop without the operator standing up a CA first. It is not appropriate for any non-demo environment.
An init container named `certctl-tls-init` runs once before the server starts. It uses the `alpine/openssl` image and generates an ECDSA-P256 self-signed cert (SHA-256 signature):
```
openssl req -x509 -newkey ec \
-pkeyopt ec_paramgen_curve:P-256 \
-nodes \
-keyout /etc/certctl/tls/server.key \
-out /etc/certctl/tls/server.crt \
-days 3650 \
-subj "/CN=certctl-server" \
-addext "subjectAltName=DNS:certctl-server,DNS:localhost,IP:127.0.0.1,IP:::1"
```
**Why ECDSA-P256 and not ed25519.** The pre-v2.0.48 demo bootstrap used ed25519 (small keys, fast signatures). Apple's TLS stack — Safari Network Framework and the macOS-bundled LibreSSL 3.3.6 `/usr/bin/curl` — does not advertise ed25519 in the ClientHello `signature_algorithms` extension for server certs, so an ed25519 server cert was rejected at handshake with `tls: peer doesn't support any of the certificate's signature algorithms` on the server side (and the generic TLS handshake error on the client side). Homebrew OpenSSL 3.x, Chrome, Firefox, and Linux curl all accepted ed25519 — Apple was the outlier. ECDSA-P256 with SHA-256 is universally supported, so the demo bootstrap uses it by default. To pick up the new algorithm on an existing demo install, tear the volume down and rebuild: `docker compose -f deploy/docker-compose.yml down -v && docker compose -f deploy/docker-compose.yml up -d --build`. **Helm and operator-supplied-Secret users (Patterns 2 and 3) are unaffected** — they bring their own cert, and `cmd/server/tls.go` is algorithm-agnostic (TLS 1.3 with curve preference `[X25519, P-256]` for key exchange — no constraint on the server cert's signature algorithm).
The cert, its matching key, and a copy of the cert published as `ca.crt` land in a named volume (`certs`) mounted at `/etc/certctl/tls/` in the server container (read-only) and the agent container (read-only). The bootstrap is idempotent — if `server.crt`, `server.key`, and `ca.crt` are already present on the volume, the init container logs `TLS cert already present at …` and exits cleanly.
Single-cert design. CN is `certctl-server` to match the Docker-network hostname. The SAN list is `[certctl-server, localhost, 127.0.0.1, ::1]`, which covers both container-internal agent→server traffic and operator browser/curl access to `https://localhost:8443`. There is no separate intermediate/root chain — the server cert and the CA bundle are the same PEM. This is the whole point of a demo bootstrap.
To force regeneration (rotate the demo cert), tear the volume down: `docker compose down -v`. The next `up` re-runs the init container.
The server's Docker healthcheck and the agent both verify against `/etc/certctl/tls/ca.crt`; no `-k` / `InsecureSkipVerify` anywhere in the default stack.
## Pattern 2 — Operator-supplied `kubernetes.io/tls` Secret (Helm)
This is the default path for Helm installs. The operator provisions a Secret of type `kubernetes.io/tls` holding `tls.crt` + `tls.key` (and optionally `ca.crt` for mounting a CA bundle to clients in the same cluster) from whatever source they already trust — their internal CA, a manually-issued cert, step-ca, AWS ACM PCA exported to PEM, or the output of the self-signed bootstrap pattern above copied into a cluster Secret.
```
kubectl create secret tls certctl-server-tls \
--cert=server.crt \
--key=server.key \
--namespace certctl
```
Then:
```
helm install certctl deploy/helm/certctl \
--namespace certctl \
--set server.tls.existingSecret=certctl-server-tls
```
The Secret is mounted read-only at `/etc/certctl/tls/` in the server pod. The `CERTCTL_SERVER_TLS_CERT_PATH` and `CERTCTL_SERVER_TLS_KEY_PATH` env vars are wired to `tls.crt` and `tls.key` keys inside that mount. If `ca.crt` is absent from the Secret, clients that need a CA bundle should use `tls.crt` as the bundle (self-signed case) or mount a separate ConfigMap with the root chain (operator-CA case).
If the operator sets neither `server.tls.existingSecret` nor `server.tls.certManager.enabled=true`, `helm template` / `helm install` fails at render-time with a diagnostic pointing at this doc. The guard is implemented in `deploy/helm/certctl/templates/_helpers.tpl` under the `certctl.tls.required` helper. This is deliberate: the HTTPS-only server would crash-loop on an empty path, so we fail earlier at Helm-render time.
## Pattern 3 — cert-manager `Certificate` CR (Helm, opt-in)
For clusters that already run cert-manager, the chart can provision a `Certificate` CR that writes into the Secret the server pod reads from. This is opt-in — the default is `server.tls.certManager.enabled: false` — because not every cluster has cert-manager installed, and we refuse to ship a chart that silently depends on an external controller.
```
helm install certctl deploy/helm/certctl \
--namespace certctl \
--set server.tls.certManager.enabled=true \
--set server.tls.certManager.issuerRef.name=my-cluster-issuer \
--set server.tls.certManager.issuerRef.kind=ClusterIssuer
```
The rendered `Certificate` (see `deploy/helm/certctl/templates/server-certificate.yaml`) writes `tls.crt` + `tls.key` + `ca.crt` into the Secret named by `server.tls.certManager.secretName` (defaults to `<fullname>-tls`). The server pod reads from that same Secret; the agent DaemonSet mounts the same Secret as its CA bundle source.
cert-manager handles rotation. certctl-server handles in-place reload — see the SIGHUP section below.
The chart enforces that if `server.tls.certManager.enabled=true`, `server.tls.certManager.issuerRef.name` must also be set. An empty `issuerRef.name` makes `helm template` fail with a diagnostic naming the missing flag.
## Pattern 4 — Manually-issued from an internal CA
For operators running neither Helm nor docker-compose (bare-metal / custom orchestration), the server just needs two files on disk pointed at by `CERTCTL_SERVER_TLS_CERT_PATH` and `CERTCTL_SERVER_TLS_KEY_PATH`. Issue the cert from your internal CA with:
- CN matching the hostname your agents and operators use to dial the server (e.g., `certctl.prod.example.com`)
- SAN list covering every hostname and IP that appears in `CERTCTL_SERVER_URL` values across your agent fleet
- Key usage: digital signature + key encipherment
- Extended key usage: server auth
Store the key with mode `0600` and owner matching the UID the server runs as (`1000` in our shipped Dockerfile). The server process reads both files during `preflightServerTLS` at startup and again on every SIGHUP.
The full CA chain that signed the server cert should be distributed to agents, CLI operators, and MCP clients as their `CERTCTL_SERVER_CA_BUNDLE_PATH` — see the client section below.
## SIGHUP cert rotation
The server wraps its cert+key pair in a `*certHolder` (see `cmd/server/tls.go`) that guards the loaded `*tls.Certificate` under a `sync.Mutex`. The `*tls.Config` wires `GetCertificate` to the holder, so every new inbound TLS handshake reads whatever cert the holder currently has.
Send `SIGHUP` to the server PID and the holder re-reads both files from disk. On success, the next new connection uses the new cert; in-flight requests finish on the previous cert. A log line goes out:
```
TLS cert reloaded via SIGHUP cert_path=/etc/certctl/tls/server.crt key_path=/etc/certctl/tls/server.key
```
On failure (missing file, malformed PEM, key does not sign cert), the old cert is retained and an error logs:
```
TLS cert reload failed; continuing with previous cert cert_path=… key_path=… error=…
```
This is deliberately fail-safe on reload (as opposed to fail-loud on startup). A cert-manager renewal race, a partially-copied file, a typo in a rotation script — none of those should crash a running server and drop every agent connection. The operator sees the error in logs, fixes the underlying issue, and sends another `SIGHUP`.
Pair with cert-manager, certbot `--post-hook`, or any rotation tool that can fire a signal. For docker-compose, `docker compose kill -s HUP certctl-server` works. For Kubernetes, reload is typically handled by cert-manager updating the Secret and the mounted file changing on the next kubelet sync — no explicit SIGHUP needed if the volume mount is `subPath`-free.
Startup is a different story. If the cert is missing or malformed at process start, the server exits non-zero rather than binding plaintext or attempting a retry loop. That's the HTTPS-only contract.
## Client-side TLS: agents, CLI, MCP
Everything that talks to the server enforces HTTPS on the URL.
### Agent
`CERTCTL_SERVER_URL` must be `https://…`. `http://`, bare hostnames, `ftp://`, `ws://`, and empty strings are rejected at startup by `validateHTTPSScheme` in `cmd/agent/main.go` with a diagnostic pointing at `upgrade-to-tls.md`. There is no warning-and-proceed path.
Two additional env vars control how the agent verifies the server cert:
- `CERTCTL_SERVER_CA_BUNDLE_PATH` — filesystem path to a PEM-encoded CA bundle that signed the server cert. Loaded into `*tls.Config.RootCAs` on the agent's HTTP client. If unset, the agent falls back to the OS system trust store.
- `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY` — defaults to `false`. Setting it to `true` skips verification entirely. **Dev-only escape hatch.** The agent logs a prominent warning at startup (`TLS certificate verification is disabled … never enable this in production`). Use this only when dialing a demo server whose cert you haven't bothered to mount into the agent container.
Equivalent CLI flags: `--ca-bundle <path>` and `--insecure-skip-verify`.
If both the CA bundle and `InsecureSkipVerify=true` are set, `InsecureSkipVerify` wins — it's the whole point of the flag. Don't do this in production.
### CLI (`certctl-cli`)
Same contract as the agent:
- `CERTCTL_SERVER_URL` defaults to `https://` scheme; `http://` rejected at startup
- `--ca-bundle <path>` flag or `CERTCTL_SERVER_CA_BUNDLE_PATH` env var — CA bundle for server cert verification
- `--insecure` flag or `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true` — skip verification (dev only)
- Error diagnostic on empty URL explicitly mentions both `--server` and `CERTCTL_SERVER_URL` so operators see the right knob to turn
The CLI shares the URL-scheme validation with the agent; the test pins in `cmd/cli/main_test.go:TestValidateHTTPSScheme` cover the full rejection matrix.
### MCP server (`certctl-mcp-server`)
Same three controls as CLI, env-var-driven only (no flags — MCP runs as a stdio subprocess and inherits env from the launching LLM client):
- `CERTCTL_SERVER_URL` must start with `https://`
- `CERTCTL_SERVER_CA_BUNDLE_PATH` optional CA bundle
- `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY` optional skip
Claude Desktop / other MCP client configs should set all three in the tool's env block.
## Troubleshooting: fail-loud preflight errors
Every preflight failure message ends with `(see docs/tls.md)` so this doc is the first hit when an operator searches. Common failures:
**`CERTCTL_SERVER_TLS_CERT_PATH is empty: HTTPS-only control plane refuses to start`**
Set the env var. For docker-compose this is already set to `/etc/certctl/tls/server.crt` in the shipped compose file — if you're seeing this, check the `certctl-tls-init` service logs to see why the init container didn't populate the volume. For Helm, check that `server.tls.existingSecret` or `server.tls.certManager.enabled=true` is set.
**`TLS cert file "…" unreadable: …`**
The cert path is set but `os.Stat` failed. Check filesystem permissions — the server runs as UID 1000 in our shipped Dockerfile; the cert needs to be readable by that UID. Typos in the path also land here.
**`TLS cert/key pair invalid (cert="…" key="…"): …`**
Both files exist but `tls.LoadX509KeyPair` refused them. Typical causes: the private key does not sign the certificate, the key is encrypted with a passphrase (not supported — remove the passphrase with `openssl pkey` before mounting), or one of the two is DER-encoded instead of PEM. Re-issue the pair from the same CA call and re-mount.
**Client side: `tls: failed to verify certificate: x509: certificate signed by unknown authority`**
The client did not trust the CA that signed the server cert. Either mount the CA bundle via `CERTCTL_SERVER_CA_BUNDLE_PATH`, add the CA to the system trust store on the client host, or (dev only) set `CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true`.
**Client side: `tls: first record does not look like a TLS handshake`**
The client is speaking plaintext HTTP to an HTTPS server (or vice-versa). Check that `CERTCTL_SERVER_URL` starts with `https://`. If you are upgrading from a pre-v2.2 release and your agents are old, they will surface this error until you roll the DaemonSet — see [`upgrade-to-tls.md`](upgrade-to-tls.md).
## Related docs
- [`upgrade-to-tls.md`](upgrade-to-tls.md) — one-step cutover from pre-HTTPS releases
- [`quickstart.md`](quickstart.md) — docker-compose walkthrough with HTTPS examples
- [`test-env.md`](test-env.md) — integration test environment (also HTTPS-only)
- Milestone spec: `prompts/https-everywhere-milestone.md` (authoritative source for locked decisions)
+194
View File
@@ -0,0 +1,194 @@
# Upgrading to HTTPS-Everywhere (v2.2)
certctl's control plane is HTTPS-only as of v2.2. There is no `http` mode, no `auto` mode, no dual-listener bind, no N-release migration window. The cutover is a single step. Out-of-date agents that still point at `http://…` fail at the TCP/TLS handshake layer on first connect after the upgrade and stay `Offline` in the dashboard until their env block is updated and the fleet is rolled.
This doc walks operators through the cutover for the two shipped deployment topologies — docker-compose and Helm — and documents the failure modes and rollback posture explicitly.
For the deep-dive on cert provisioning patterns, SIGHUP cert reload, and client-side CA-trust configuration, read [`tls.md`](tls.md). This doc is the narrow "how do I upgrade" procedure.
## Preconditions
Before you start, confirm:
- **Shell access** to the server host and every agent host. The cutover requires you to restart the server and update every agent's env block.
- **A cert+key source** for the server. Pick one:
- An internal CA that can issue a server cert (CN + SAN list covering every hostname / IP agents dial).
- A `cert-manager` install in the target Kubernetes cluster, plus a `ClusterIssuer` or `Issuer` you're willing to reference.
- Willingness to use the self-signed bootstrap that the shipped `deploy/docker-compose.yml` generates automatically. This is the right choice for dev and demo; it is the wrong choice for production.
- **A maintenance window.** Out-of-date agents break at the TLS handshake and stay offline until rolled. Schedule the upgrade so the agent fleet can be updated in the same window as the server.
- **Backups.** This is a one-way door (see the Rollback section below). Snapshot your PostgreSQL database before `docker compose down` or `helm upgrade`.
There is no schema migration tied to this release; the only at-rest state that changes is the `certs` named volume (docker-compose) or the `tls.crt`/`tls.key` Secret (Helm).
## Procedure — docker-compose operators
The shipped `deploy/docker-compose.yml` includes a `certctl-tls-init` init container that self-signs an ECDSA-P256 (SHA-256 signature) cert on first boot and drops `server.crt`, `server.key`, and `ca.crt` into a named volume mounted read-only at `/etc/certctl/tls/` on the server and agent containers. No manual cert provisioning is required for the default stack. (Pre-v2.0.48 this was an ed25519 cert; see [`tls.md`](tls.md) Pattern 1 for the rationale and the `down -v && up --build` migration note.)
1. **Pull the HTTPS-everywhere release.** From the repo root:
```
git pull
```
Confirm you're on a tag or `master` that contains the `certctl-tls-init` service in `deploy/docker-compose.yml`. Grep for it: `grep certctl-tls-init deploy/docker-compose.yml` should hit.
2. **Stop the old plaintext cluster.**
```
docker compose -f deploy/docker-compose.yml down
```
Do not pass `-v`; keeping the PostgreSQL volume preserves your cert inventory, audit trail, and job history across the upgrade.
3. **Bring the cluster back up with the HTTPS build.**
```
docker compose -f deploy/docker-compose.yml up -d --build
```
The `certctl-tls-init` service runs once, generates the self-signed cert into the `certs` volume, and exits with code 0. The server container waits for `certctl-tls-init` via `depends_on: { condition: service_completed_successfully }` and only starts once the cert material is on disk. The server's Docker healthcheck now uses `curl --cacert /etc/certctl/tls/ca.crt -f https://localhost:8443/health`, so the container only becomes healthy once the HTTPS listener is up and serving the bundled cert correctly.
4. **Verify the HTTPS endpoint from the host.**
```
curl --cacert $(docker compose -f deploy/docker-compose.yml exec -T certctl-server cat /etc/certctl/tls/ca.crt) https://localhost:8443/health
```
Expect `{"status":"ok"}` with HTTP 200. If you get a TLS verification error, the CA bundle wasn't read correctly — re-run the `exec -T` command and pipe the output directly into `--cacert @-` or save it to a local file first. If you get `connection refused`, the server never finished startup — check `docker compose logs certctl-server` for a fail-loud preflight diagnostic pointing at `docs/tls.md`.
5. **Confirm the bundled agent reconnects.** Agents inside the compose stack pick up the new URL (`CERTCTL_SERVER_URL=https://certctl-server:8443`) and the bundled CA (`CERTCTL_SERVER_CA_BUNDLE_PATH=/etc/certctl/tls/ca.crt`) from their env block automatically — no per-agent change needed. Tail the agent log:
```
docker compose -f deploy/docker-compose.yml logs -f certctl-agent
```
You should see `heartbeat sent` within 30 seconds. In the dashboard (`https://localhost:8443`), the agent should show as `Online`.
**External agents** running outside the compose network (e.g., the `install-agent.sh`-installed systemd service on a separate host) need their env block updated manually before the cutover — see the Agent env block section below.
## Procedure — Helm operators
The Helm chart does not self-sign. It refuses to render (`helm template` exits non-zero) unless you configure one of two cert sources: an operator-supplied Secret, or a cert-manager `Certificate` CR. See [`tls.md`](tls.md) for the full pattern catalog.
1. **Provision cert material.** Pick one of:
- **Operator-supplied Secret.** Issue a cert from your internal CA (or any other source) and load it into a `kubernetes.io/tls` Secret in the certctl namespace:
```
kubectl create secret tls certctl-server-tls \
--cert=server.crt --key=server.key \
--namespace certctl
```
- **cert-manager.** Set `server.tls.certManager.enabled=true` on the upgrade and reference an existing `ClusterIssuer` or `Issuer`:
```
--set server.tls.certManager.enabled=true
--set server.tls.certManager.issuerRef.name=my-cluster-issuer
--set server.tls.certManager.issuerRef.kind=ClusterIssuer
```
2. **Upgrade the release.**
```
helm upgrade certctl deploy/helm/certctl \
--namespace certctl \
--set server.tls.existingSecret=certctl-server-tls
```
(Or the `certManager` variant.) If you omit both `server.tls.existingSecret` and `server.tls.certManager.enabled`, the chart fails at render time with a diagnostic pointing at `docs/tls.md`. That guard exists precisely so you catch the missing config at `helm upgrade` time, not at pod-crash-loop time.
3. **Verify the HTTPS endpoint from inside the cluster.** Port-forward and curl with the CA bundle:
```
kubectl port-forward -n certctl svc/certctl-server 8443:8443 &
kubectl get secret -n certctl certctl-server-tls -o jsonpath='{.data.ca\.crt}' | base64 -d > /tmp/certctl-ca.crt
curl --cacert /tmp/certctl-ca.crt https://localhost:8443/health
```
Expect `{"status":"ok"}`. If the Secret does not contain a `ca.crt` key (operator-supplied Secrets often don't), use `tls.crt` as the bundle instead — for a self-signed cert the two files are identical, and for a cert chained to an internal CA you should separately distribute the root CA bundle via ConfigMap or mounted file.
4. **Update every agent manifest.** Agents outside this Helm release (or in a separately-managed DaemonSet) need their env block updated:
```
- name: CERTCTL_SERVER_URL
value: "https://certctl-server.certctl.svc.cluster.local:8443"
- name: CERTCTL_SERVER_CA_BUNDLE_PATH
value: "/etc/certctl/tls/ca.crt"
```
Mount the server's Secret (or a separate CA-bundle Secret / ConfigMap) at `/etc/certctl/tls/` as a read-only volume. If you bundle the agent via the shipped Helm chart's DaemonSet, the wiring is already done — set `agent.enabled=true` and the chart mounts the same Secret.
5. **Roll the agent DaemonSet.**
```
kubectl rollout restart ds/certctl-agent -n certctl
kubectl rollout status ds/certctl-agent -n certctl
```
Every agent pod restarts with the new URL + CA bundle and reconnects on HTTPS. The dashboard shows agents flip from `Offline` to `Online` as pods finish rolling.
## Agent env block — external hosts
Agents installed on bare-metal or VM hosts via `install-agent.sh` (systemd on Linux, launchd on macOS) read config from `/etc/certctl/agent.env` (Linux) or `~/Library/Application Support/certctl/agent.env` (macOS). On cutover, append or update:
```
CERTCTL_SERVER_URL=https://certctl.example.com:8443
CERTCTL_SERVER_CA_BUNDLE_PATH=/etc/certctl/tls/ca.crt
# CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=false # Dev only. Never set to true in production.
```
Distribute the CA bundle (the same `ca.crt` the server holds, or the root chain if you issued the server cert from an intermediate) to every agent host. The path under `CERTCTL_SERVER_CA_BUNDLE_PATH` must be readable by the UID the agent service runs as.
Restart the service after editing:
- Linux: `systemctl restart certctl-agent`
- macOS: `launchctl kickstart -k system/com.certctl.agent`
The agent refuses to start on an `http://` URL and exits with a pre-flight diagnostic that names this doc. That rejection happens before any network call — no spurious half-connected state.
## Failure mode
Out-of-date agents still configured with `CERTCTL_SERVER_URL=http://…` fail on first reconnect after the cutover. The failure surfaces as one of:
- `dial tcp …: connect: connection refused` — the server is no longer listening on a plaintext port. The new release binds only a TLS listener; attempting a plaintext `connect()` gets refused at the kernel level because nothing holds the socket.
- `tls: first record does not look like a TLS handshake` — depending on timing and proxy layers (e.g., a load balancer that accepts the TCP connection before forwarding), the client may negotiate TCP, send an HTTP request line, and have the server's TLS stack reject it.
Agents in this state surface as `Offline` in the dashboard. They stay offline until their env block is updated and the service restarts. There is no graceful 400-with-migration-URL response because there is no HTTP listener to serve one from — the entire plaintext call path is removed by design.
If you see an unexpected agent stay `Offline` past the cutover window, SSH to the host and check the agent log. On a systemd host:
```
journalctl -u certctl-agent -n 100
```
Look for `URL scheme "http" is not supported: HTTPS-only control plane refuses to start (see docs/upgrade-to-tls.md)`. That's the pre-flight rejection. Update `CERTCTL_SERVER_URL`, restart the service, and the agent reconnects.
## Rollback
**There is no rollback window.** The upgrade is a one-way door. The rationale lives in §3.7 of `prompts/https-everywhere-milestone.md`: a cert-lifecycle product that bridges back to plaintext after committing to HTTPS is advertising that its own security posture is negotiable.
If you need to revert, you have two options:
1. **Stay on the pre-HTTPS release.** Do not upgrade until you are ready to run HTTPS on the control plane. Pin your `docker-compose.yml` or `helm upgrade` command to the last pre-v2.2 tag.
2. **Rollback the release.** `helm rollback certctl <previous-revision>` or `git checkout <previous-tag> && docker compose up -d --build`. This rolls back the server, the compose topology, and the Helm chart in lockstep. Your PostgreSQL volume — cert inventory, audit trail, jobs — survives the rollback; nothing in this milestone changes the database schema.
Option 2 drops you back to the plaintext world. It should be treated as an emergency measure, not a supported migration path.
## After the cutover
Once every agent is `Online`, confirm a few invariants:
- `curl -sS -o /dev/null -w "%{http_code}\n" http://localhost:8443/health` returns `000` with `Connection refused` (no HTTP listener). Plaintext is gone.
- `openssl s_client -connect localhost:8443 -tls1_2 </dev/null` fails the handshake. TLS 1.2 is rejected.
- `openssl s_client -connect localhost:8443 -tls1_3 </dev/null` succeeds and prints the server's SAN list. TLS 1.3 is live.
- A cert rotation test: overwrite the server cert on disk, `kill -HUP` the server PID, confirm the new cert serves on the next `openssl s_client -connect … -showcerts` without a process restart. See the SIGHUP section in [`tls.md`](tls.md).
Update your runbooks. Every `http://certctl.example.com` URL in internal documentation, monitoring config, and on-call playbooks should become `https://certctl.example.com` plus a CA-trust note.
## Related docs
- [`tls.md`](tls.md) — cert provisioning patterns, SIGHUP rotation, troubleshooting
- [`quickstart.md`](quickstart.md) — docker-compose walkthrough (post-HTTPS)
- [`test-env.md`](test-env.md) — integration test environment (HTTPS-only)
- Milestone spec: `prompts/https-everywhere-milestone.md`
+1 -1
View File
@@ -107,7 +107,7 @@ The demo seeds certificates across multiple issuers, agents, and deployment targ
```bash
git clone https://github.com/shankar0123/certctl.git
cd certctl/deploy && docker compose up -d
# Dashboard at http://localhost:8443
# Dashboard at https://localhost:8443 (self-signed cert — pin deploy/test/certs/ca.crt)
```
See the [Quickstart Guide](quickstart.md) for a full walkthrough, or explore the [5 turnkey examples](../examples/) for specific scenarios (ACME+NGINX, wildcard DNS-01, private CA+Traefik, step-ca+HAProxy, multi-issuer).
+8 -1
View File
@@ -36,6 +36,13 @@ flowchart TD
If you don't have a real domain or can't open port 80, see [Customization Tips](#customization-tips) below.
## TLS Security
certctl is HTTPS-only as of v2.2. The demo compose stack provisions a self-signed certificate. When accessing `https://localhost:8443`, you can either:
- Use `curl --cacert ./deploy/test/certs/ca.crt ...` to pin the CA certificate
- Use `curl -k ...` for quick smoke tests (never in production)
- Import the CA at `./deploy/test/certs/ca.crt` into your OS trust store for browser visits
## Quick Start
### 1. Clone or copy this example
@@ -122,7 +129,7 @@ docker compose logs -f certctl-server certctl-agent
### 5. Access the dashboard
Navigate to `http://localhost:8443` (or your `SERVER_PORT`)
Navigate to `https://localhost:8443` (or your `SERVER_PORT`)
You should see:
- An empty certificate inventory (no certs issued yet)
+1 -1
View File
@@ -61,7 +61,7 @@ services:
networks:
- certctl-network
healthcheck:
test: ['CMD-SHELL', 'curl -sf http://localhost:8443/health || exit 1']
test: ['CMD-SHELL', 'curl -sfk https://localhost:8443/health || exit 1']
interval: 10s
timeout: 5s
retries: 3
@@ -9,6 +9,13 @@ This example is ideal for:
- Internal PKI with public DNS names
- Scenarios where you have programmatic access to your DNS provider's API
## TLS Security
certctl is HTTPS-only as of v2.2. The demo compose stack provisions a self-signed certificate. When accessing `https://localhost:8443`, you can either:
- Use `curl --cacert ./deploy/test/certs/ca.crt ...` to pin the CA certificate
- Use `curl -k ...` for quick smoke tests (never in production)
- Import the CA at `./deploy/test/certs/ca.crt` into your OS trust store for browser visits
## Prerequisites
Before running this example, you need:
@@ -74,7 +81,7 @@ This starts:
### Step 5: Access the Dashboard
Open your browser to `http://localhost:8443`
Open your browser to `https://localhost:8443`
### Step 6: Create a Wildcard Certificate
@@ -113,7 +113,7 @@ services:
- certctl-network
healthcheck:
test: ['CMD-SHELL', 'curl -sf http://localhost:8443/health || exit 1']
test: ['CMD-SHELL', 'curl -sfk https://localhost:8443/health || exit 1']
interval: 10s
timeout: 5s
retries: 3
+1 -1
View File
@@ -64,7 +64,7 @@ services:
networks:
- certctl-network
healthcheck:
test: ['CMD-SHELL', 'curl -sf http://localhost:8443/health || exit 1']
test: ['CMD-SHELL', 'curl -sfk https://localhost:8443/health || exit 1']
interval: 10s
timeout: 5s
retries: 3
+8 -1
View File
@@ -45,6 +45,13 @@ flowchart TD
- **Domain for ACME** (optional) — if using real Let's Encrypt, not needed for demo
- **Internet connectivity** — to reach Let's Encrypt's API (demo can use staging directory)
## TLS Security
certctl is HTTPS-only as of v2.2. The demo compose stack provisions a self-signed certificate. When accessing `https://localhost:8443`, you can either:
- Use `curl --cacert ./deploy/test/certs/ca.crt ...` to pin the CA certificate
- Use `curl -k ...` for quick smoke tests (never in production)
- Import the CA at `./deploy/test/certs/ca.crt` into your OS trust store for browser visits
## Quick Start
### 1. Clone or navigate to this directory
@@ -83,7 +90,7 @@ This spins up:
### 4. Access the dashboard
Open your browser to **http://localhost:8443** (or your configured SERVER_PORT)
Open your browser to **https://localhost:8443** (or your configured SERVER_PORT)
You should see:
- Empty cert inventory (fresh start)
@@ -77,7 +77,7 @@ services:
networks:
- certctl-network
healthcheck:
test: ['CMD-SHELL', 'curl -sf http://localhost:8443/health || exit 1']
test: ['CMD-SHELL', 'curl -sfk https://localhost:8443/health || exit 1']
interval: 10s
timeout: 5s
retries: 3
@@ -29,6 +29,13 @@ flowchart TD
C -->|TLS handshakes| D
```
## TLS Security
certctl is HTTPS-only as of v2.2. The demo compose stack provisions a self-signed certificate. When accessing `https://localhost:8443`, you can either:
- Use `curl --cacert ./deploy/test/certs/ca.crt ...` to pin the CA certificate
- Use `curl -k ...` for quick smoke tests (never in production)
- Import the CA at `./deploy/test/certs/ca.crt` into your OS trust store for browser visits
## Quick Start (Self-Signed CA)
The simplest way to get running in 2 minutes:
@@ -58,7 +65,7 @@ EOF
docker compose up -d
# 4. Access the dashboards
# - certctl: http://localhost:8443 (API only, use the CLI or direct HTTP calls)
# - certctl: https://localhost:8443 (API only, use the CLI or direct HTTP calls)
# - Traefik dashboard: http://localhost:8080
```
@@ -112,7 +119,7 @@ Once the stack is running:
```bash
# 1. Create a certificate profile in certctl (defines allowed key types, TTL, etc.)
curl -X POST http://localhost:8443/api/v1/profiles \
curl -X POST https://localhost:8443/api/v1/profiles \
-H "Content-Type: application/json" \
-d '{
"id": "prof-internal",
@@ -123,7 +130,7 @@ curl -X POST http://localhost:8443/api/v1/profiles \
}'
# 2. Create a renewal policy (defines issuer, renewal thresholds, etc.)
curl -X POST http://localhost:8443/api/v1/policies \
curl -X POST https://localhost:8443/api/v1/policies \
-H "Content-Type: application/json" \
-d '{
"id": "pol-internal",
@@ -135,7 +142,7 @@ curl -X POST http://localhost:8443/api/v1/policies \
}'
# 3. Create a certificate (triggers issuance immediately)
curl -X POST http://localhost:8443/api/v1/certificates \
curl -X POST https://localhost:8443/api/v1/certificates \
-H "Content-Type: application/json" \
-d '{
"common_name": "api.internal.local",
@@ -144,7 +151,7 @@ curl -X POST http://localhost:8443/api/v1/certificates \
}'
# 4. Create a Traefik target (agent will deploy to this)
curl -X POST http://localhost:8443/api/v1/targets \
curl -X POST https://localhost:8443/api/v1/targets \
-H "Content-Type: application/json" \
-d '{
"id": "target-traefik-01",
@@ -156,7 +163,7 @@ curl -X POST http://localhost:8443/api/v1/targets \
}'
# 5. Create a deployment job (agent picks this up and deploys)
curl -X POST http://localhost:8443/api/v1/certificates/{cert-id}/deploy \
curl -X POST https://localhost:8443/api/v1/certificates/{cert-id}/deploy \
-H "Content-Type: application/json" \
-d '{
"target_ids": ["target-traefik-01"]
@@ -209,16 +216,16 @@ The server provides a REST API on port 8443. Example queries:
```bash
# List all certificates
curl http://localhost:8443/api/v1/certificates
curl https://localhost:8443/api/v1/certificates
# Check certificate status
curl http://localhost:8443/api/v1/certificates/{cert-id}
curl https://localhost:8443/api/v1/certificates/{cert-id}
# View audit trail
curl http://localhost:8443/api/v1/audit
curl https://localhost:8443/api/v1/audit
# Check renewal policy compliance
curl http://localhost:8443/api/v1/policies/{policy-id}
curl https://localhost:8443/api/v1/policies/{policy-id}
```
### Traefik Dashboard
@@ -290,7 +297,7 @@ Changes are picked up automatically (file watcher enabled).
docker compose logs certctl-agent | grep heartbeat
# Check deployment job status
curl http://localhost:8443/api/v1/jobs | jq '.[] | select(.type == "Deployment")'
curl https://localhost:8443/api/v1/jobs | jq '.[] | select(.type == "Deployment")'
# Check Traefik is watching the directory
docker compose exec traefik ls -la /etc/traefik/certs/
+1 -1
View File
@@ -119,7 +119,7 @@ services:
networks:
- certctl-network
healthcheck:
test: ['CMD-SHELL', 'curl -sf http://localhost:8443/health || exit 1']
test: ['CMD-SHELL', 'curl -sfk https://localhost:8443/health || exit 1']
interval: 10s
timeout: 5s
retries: 3
+15 -8
View File
@@ -48,6 +48,13 @@ Monitor logs:
docker compose logs -f certctl-server
```
## TLS Security
certctl is HTTPS-only as of v2.2. The demo compose stack provisions a self-signed certificate. When accessing `https://localhost:8443`, you can either:
- Use `curl --cacert ./deploy/test/certs/ca.crt ...` to pin the CA certificate
- Use `curl -k ...` for quick smoke tests (never in production)
- Import the CA at `./deploy/test/certs/ca.crt` into your OS trust store for browser visits
Wait for all services to reach healthy state:
```bash
@@ -69,7 +76,7 @@ certctl-haproxy-... healthy
Open your browser to:
```
http://localhost:8443
https://localhost:8443
```
You should see an empty dashboard. This is expected — no certificates issued yet.
@@ -79,7 +86,7 @@ You should see an empty dashboard. This is expected — no certificates issued y
This defines what certificates certctl can issue (key algorithm, max TTL, allowed names).
```bash
curl -X POST http://localhost:8443/api/v1/profiles \
curl -X POST https://localhost:8443/api/v1/profiles \
-H 'Content-Type: application/json' \
-d '{
"name": "internal-web",
@@ -94,7 +101,7 @@ curl -X POST http://localhost:8443/api/v1/profiles \
This tells certctl where to deploy certificates on the HAProxy server.
```bash
curl -X POST http://localhost:8443/api/v1/targets \
curl -X POST https://localhost:8443/api/v1/targets \
-H 'Content-Type: application/json' \
-d '{
"name": "haproxy-01",
@@ -115,7 +122,7 @@ Note: In the Docker Compose environment, reload command can be `kill -HUP $(pido
This ties a certificate profile to a deployment target and sets renewal thresholds.
```bash
curl -X POST http://localhost:8443/api/v1/renewal-policies \
curl -X POST https://localhost:8443/api/v1/renewal-policies \
-H 'Content-Type: application/json' \
-d '{
"name": "haproxy-internal-web",
@@ -130,7 +137,7 @@ curl -X POST http://localhost:8443/api/v1/renewal-policies \
Get the issuer ID:
```bash
curl http://localhost:8443/api/v1/issuers | jq '.'
curl https://localhost:8443/api/v1/issuers | jq '.'
```
You should see `iss-stepca` in the list.
@@ -140,7 +147,7 @@ You should see `iss-stepca` in the list.
Request a certificate via the API. The server will sign it via step-ca.
```bash
curl -X POST http://localhost:8443/api/v1/certificates \
curl -X POST https://localhost:8443/api/v1/certificates \
-H 'Content-Type: application/json' \
-d '{
"common_name": "api.internal.example.com",
@@ -155,7 +162,7 @@ curl -X POST http://localhost:8443/api/v1/certificates \
Get the certificate ID and trigger deployment:
```bash
curl -X POST http://localhost:8443/api/v1/certificates/<cert_id>/deploy \
curl -X POST https://localhost:8443/api/v1/certificates/<cert_id>/deploy \
-H 'Content-Type: application/json' \
-d '{
"target_id": "<target_id_from_step_4>"
@@ -171,7 +178,7 @@ The agent will:
### 8. Verify in Dashboard
Refresh http://localhost:8443 and you should see:
Refresh https://localhost:8443 and you should see:
- 1 certificate (status: Active, expiry in 90 days)
- 1 deployment job (status: Completed)
- 1 agent (heartbeat: recent)
+40 -2
View File
@@ -75,6 +75,14 @@ EXAMPLES:
--server-url https://certctl.example.com \\
--api-key YOUR_API_KEY
CONTROL-PLANE TLS TRUST:
The certctl server is HTTPS-only as of v2.2. This installer does NOT copy a CA
bundle — the generated agent.env leaves TLS trust to the system root store by
default. If the server uses a private/enterprise or self-signed CA, set
CERTCTL_SERVER_CA_BUNDLE_PATH in the generated agent.env to point at the CA
bundle, or (dev only) CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true. See the
commented block in the generated agent.env for the full menu.
EOF
}
@@ -322,7 +330,7 @@ setup_linux_config() {
# Agent ID (unique identifier in the fleet)
CERTCTL_AGENT_ID=$AGENT_ID
# Control plane server URL
# Control plane server URL (HTTPS-only as of v2.2)
CERTCTL_SERVER_URL=$SERVER_URL
# API authentication key
@@ -334,6 +342,21 @@ CERTCTL_KEYGEN_MODE=agent
# Key storage directory (agent-side keygen)
CERTCTL_KEY_DIR=$key_dir
# ---- Control-plane TLS trust ----
# The certctl server is HTTPS-only (v2.2+). The agent's HTTP client MUST trust the
# server's certificate chain. Pick ONE of the approaches below:
#
# 1) Public CA (Let's Encrypt, DigiCert, etc.) — no config needed; system trust store works.
# 2) Private / enterprise CA — point the agent at the CA bundle that signed the server cert:
# CERTCTL_SERVER_CA_BUNDLE_PATH=/etc/certctl/server-ca.crt
#
# 3) Self-signed server cert (Helm/compose bootstrap) — same env var, just point at the
# extracted self-signed CA bundle (e.g. from the certctl-server-tls Kubernetes secret
# via: kubectl get secret certctl-server-tls -o jsonpath='{.data.ca\.crt}' | base64 -d).
#
# 4) Dev/eval only — disable verification entirely (NEVER do this in production):
# CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true
# Logging level (debug, info, warn, error)
# CERTCTL_LOG_LEVEL=info
@@ -373,7 +396,7 @@ setup_macos_config() {
# Agent ID (unique identifier in the fleet)
CERTCTL_AGENT_ID=$AGENT_ID
# Control plane server URL
# Control plane server URL (HTTPS-only as of v2.2)
CERTCTL_SERVER_URL=$SERVER_URL
# API authentication key
@@ -385,6 +408,21 @@ CERTCTL_KEYGEN_MODE=agent
# Key storage directory (agent-side keygen)
CERTCTL_KEY_DIR=$key_dir
# ---- Control-plane TLS trust ----
# The certctl server is HTTPS-only (v2.2+). The agent's HTTP client MUST trust the
# server's certificate chain. Pick ONE of the approaches below:
#
# 1) Public CA (Let's Encrypt, DigiCert, etc.) — no config needed; system trust store works.
# 2) Private / enterprise CA — point the agent at the CA bundle that signed the server cert:
# CERTCTL_SERVER_CA_BUNDLE_PATH=$HOME/.certctl/server-ca.crt
#
# 3) Self-signed server cert (Helm/compose bootstrap) — same env var, just point at the
# extracted self-signed CA bundle (e.g. from the certctl-server-tls Kubernetes secret
# via: kubectl get secret certctl-server-tls -o jsonpath='{.data.ca\.crt}' | base64 -d).
#
# 4) Dev/eval only — disable verification entirely (NEVER do this in production):
# CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true
# Logging level (debug, info, warn, error)
# CERTCTL_LOG_LEVEL=info
+24 -18
View File
@@ -247,26 +247,30 @@ func TestGetCertificateVersions_MultiSegment(t *testing.T) {
}
// TestHandleOCSP_MultiSegment exercises the OCSP responder's 2-segment path
// parser (/api/v1/ocsp/{issuer_id}/{serial_hex}). Each leg is attacker-
// controlled and the serial can be arbitrary length. This is a key adversarial
// surface because the serial is passed directly to the CA-operations service,
// which is expected to treat it as an opaque identifier.
// parser (/.well-known/pki/ocsp/{issuer_id}/{serial_hex}). Each leg is
// attacker-controlled and the serial can be arbitrary length. This is a key
// adversarial surface because the serial is passed directly to the
// CA-operations service, which is expected to treat it as an opaque
// identifier.
//
// M-006 relocation: these paths were previously served at /api/v1/ocsp/*;
// under RFC 8615 and RFC 6960 they now live under /.well-known/pki/ocsp/*.
func TestHandleOCSP_MultiSegment(t *testing.T) {
cases := []struct {
name string
path string
}{
{"missing_serial", "/api/v1/ocsp/iss-local"},
{"missing_both", "/api/v1/ocsp/"},
{"empty_issuer", "/api/v1/ocsp//01ABCDEF"},
{"empty_serial", "/api/v1/ocsp/iss-local/"},
{"traversal_issuer", "/api/v1/ocsp/..%2F..%2Fetc/passwd/01"},
{"null_byte_serial", "/api/v1/ocsp/iss-local/01\x00FF"},
{"sql_injection_serial", "/api/v1/ocsp/iss-local/01'; DROP TABLE--"},
{"negative_hex_serial", "/api/v1/ocsp/iss-local/-1"},
{"unicode_serial", "/api/v1/ocsp/iss-local/01\u2010FF"},
{"extremely_long_serial", "/api/v1/ocsp/iss-local/" + strings.Repeat("F", 10000)},
{"extra_segments", "/api/v1/ocsp/iss-local/01FF/extra/segments"},
{"missing_serial", "/.well-known/pki/ocsp/iss-local"},
{"missing_both", "/.well-known/pki/ocsp/"},
{"empty_issuer", "/.well-known/pki/ocsp//01ABCDEF"},
{"empty_serial", "/.well-known/pki/ocsp/iss-local/"},
{"traversal_issuer", "/.well-known/pki/ocsp/..%2F..%2Fetc/passwd/01"},
{"null_byte_serial", "/.well-known/pki/ocsp/iss-local/01\x00FF"},
{"sql_injection_serial", "/.well-known/pki/ocsp/iss-local/01'; DROP TABLE--"},
{"negative_hex_serial", "/.well-known/pki/ocsp/iss-local/-1"},
{"unicode_serial", "/.well-known/pki/ocsp/iss-local/01\u2010FF"},
{"extremely_long_serial", "/.well-known/pki/ocsp/iss-local/" + strings.Repeat("F", 10000)},
{"extra_segments", "/.well-known/pki/ocsp/iss-local/01FF/extra/segments"},
}
for _, tc := range cases {
@@ -301,7 +305,9 @@ func TestHandleOCSP_MultiSegment(t *testing.T) {
}
}
// TestGetDERCRL_IssuerPathInjection exercises /api/v1/crl/{issuer_id}.
// TestGetDERCRL_IssuerPathInjection exercises
// /.well-known/pki/crl/{issuer_id} (RFC 5280 CRL; M-006 relocation from
// /api/v1/crl/{issuer_id}).
func TestGetDERCRL_IssuerPathInjection(t *testing.T) {
for _, tc := range adversarialPathInputs() {
t.Run(tc.name, func(t *testing.T) {
@@ -316,8 +322,8 @@ func TestGetDERCRL_IssuerPathInjection(t *testing.T) {
return nil, ErrMockNotFound
}
req := httptest.NewRequest(http.MethodGet, "/api/v1/crl/x", nil)
req.URL.Path = "/api/v1/crl/" + tc.input
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/crl/x", nil)
req.URL.Path = "/.well-known/pki/crl/" + tc.input
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
+58 -9
View File
@@ -3,12 +3,14 @@ package handler
import (
"context"
"encoding/json"
"errors"
"net/http"
"strconv"
"strings"
"github.com/shankar0123/certctl/internal/api/middleware"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// AgentGroupService defines the service interface for agent group operations.
@@ -22,6 +24,19 @@ type AgentGroupService interface {
}
// AgentGroupHandler handles HTTP requests for agent group operations.
//
// Error dispatch (post-M-1): every service error routes through the [errToStatus]
// choke point via `errors.Is` walking the wrap chain, with one explicit
// [service.ErrValidation] arm on the write paths (Create, Update) so the
// composed "validation: <field-specific reason>" message the service layer
// attaches via `fmt.Errorf("%w: ...", ErrValidation)` can be passed through to
// the 400 response body. Before M-1, the Create handler branched on
// `strings.Contains(err.Error(), "invalid"|"required")` — fragile because a
// single reword in [service.validateAgentGroup] would demote the 400 to 500
// with no compile-time signal — and the Update/Delete handlers branched on
// `strings.Contains(err.Error(), "not found")`, coupling HTTP classification
// to repository human-readable strings. Both now ride the typed
// [repository.ErrNotFound] wrap through errToStatus. Mirrors ProfileHandler.
type AgentGroupHandler struct {
svc AgentGroupService
}
@@ -89,7 +104,18 @@ func (h AgentGroupHandler) GetAgentGroup(w http.ResponseWriter, r *http.Request)
group, err := h.svc.GetAgentGroup(r.Context(), id)
if err != nil {
ErrorWithRequestID(w, http.StatusNotFound, "Agent group not found", requestID)
// M-1: route through errToStatus so a repo-level `sql.ErrNoRows`
// (wrapped as repository.ErrNotFound) becomes 404, but a transient DB
// failure no longer masquerades as 404 — it correctly surfaces 500. The
// pre-M-1 "any error → 404" shortcut was plausible when Get's only
// expected failure was "not found", but the choke point now gives us
// correct dispatch for free. Mirrors GetProfile.
status := errToStatus(err)
msg := "Failed to get agent group"
if status == http.StatusNotFound {
msg = "Agent group not found"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -123,7 +149,15 @@ func (h AgentGroupHandler) CreateAgentGroup(w http.ResponseWriter, r *http.Reque
created, err := h.svc.CreateAgentGroup(r.Context(), group)
if err != nil {
if strings.Contains(err.Error(), "invalid") || strings.Contains(err.Error(), "required") {
// M-1: replace the 2-term substring net (`"invalid"|"required"`) with a
// single `errors.Is(err, service.ErrValidation)` arm. validateAgentGroup
// wraps every field-specific failure via `fmt.Errorf("%w: <reason>",
// ErrValidation)`, so `err.Error()` still contains the human-readable
// reason (e.g., "agent group name is required") and can be safely passed
// to the 400 body — but the status decision no longer depends on the
// exact wording. Other errors redact to a generic 500. Mirrors
// CreateProfile.
if errors.Is(err, service.ErrValidation) {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
@@ -160,11 +194,22 @@ func (h AgentGroupHandler) UpdateAgentGroup(w http.ResponseWriter, r *http.Reque
updated, err := h.svc.UpdateAgentGroup(r.Context(), id, group)
if err != nil {
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Agent group not found", requestID)
// M-1: explicit ErrValidation arm preserves the user-facing reason in
// the 400 body (validateAgentGroup wraps every failure via
// `fmt.Errorf("%w: ...", ErrValidation)`); every other error — including
// repo-layer ErrNotFound on a missing row — routes through errToStatus
// so the 404/500 decision no longer depends on substring matching.
// Mirrors UpdateProfile.
if errors.Is(err, service.ErrValidation) {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to update agent group", requestID)
status := errToStatus(err)
msg := "Failed to update agent group"
if status == http.StatusNotFound {
msg = "Agent group not found"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -188,11 +233,15 @@ func (h AgentGroupHandler) DeleteAgentGroup(w http.ResponseWriter, r *http.Reque
}
if err := h.svc.DeleteAgentGroup(r.Context(), id); err != nil {
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Agent group not found", requestID)
return
// M-1: sentinel dispatch replaces the substring 404 check — see the
// parallel comment block in UpdateAgentGroup for the rationale. Mirrors
// DeleteProfile.
status := errToStatus(err)
msg := "Failed to delete agent group"
if status == http.StatusNotFound {
msg = "Agent group not found"
}
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to delete agent group", requestID)
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -10,6 +10,7 @@ import (
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// MockAgentService is a mock implementation of AgentService interface.
@@ -24,6 +25,11 @@ type MockAgentService struct {
GetWorkFn func(agentID string) ([]domain.Job, error)
GetWorkWithTargetsFn func(agentID string) ([]domain.WorkItem, error)
UpdateJobStatusFn func(agentID string, jobID string, status string, errMsg string) error
// I-004: soft-retirement hooks. Tests that don't set these receive nil
// results and nil errors, which mirrors the safest default (no-op) for
// unrelated suites that mock only the legacy surface.
RetireAgentFn func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error)
ListRetiredAgentsFn func(page, perPage int) ([]domain.Agent, int64, error)
}
func (m *MockAgentService) ListAgents(_ context.Context, page, perPage int) ([]domain.Agent, int64, error) {
@@ -96,6 +102,25 @@ func (m *MockAgentService) UpdateJobStatus(_ context.Context, agentID string, jo
return nil
}
// RetireAgent is the I-004 soft-retirement entrypoint. Tests that don't set
// RetireAgentFn get a nil result + nil error, which is a no-op response that
// lets unrelated suites compile without caring about the retirement surface.
func (m *MockAgentService) RetireAgent(_ context.Context, agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if m.RetireAgentFn != nil {
return m.RetireAgentFn(agentID, actor, force, reason)
}
return nil, nil
}
// ListRetiredAgents returns retired rows for the retired-agents tab / audit
// views. Same zero-value default as RetireAgent for unrelated tests.
func (m *MockAgentService) ListRetiredAgents(_ context.Context, page, perPage int) ([]domain.Agent, int64, error) {
if m.ListRetiredAgentsFn != nil {
return m.ListRetiredAgentsFn(page, perPage)
}
return nil, 0, nil
}
// Test ListAgents - success case
func TestListAgents_Success(t *testing.T) {
now := time.Now()
@@ -0,0 +1,393 @@
package handler
import (
"context"
"encoding/json"
"errors"
"net/http"
"net/http/httptest"
"testing"
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// agentRetireTestSetup builds an AgentHandler with a mock AgentService whose
// RetireAgent / ListRetiredAgents / Heartbeat behavior is driven by the
// returned mock. Keeps every I-004 handler test self-contained so a single
// failing assertion can't cascade through a shared fixture.
func agentRetireTestSetup() (*MockAgentService, AgentHandler) {
mock := &MockAgentService{}
handler := NewAgentHandler(mock)
return mock, handler
}
// TestRetireAgentHandler_Success_200 pins the happy-path contract for the
// soft-retirement HTTP surface: DELETE /api/v1/agents/{id} with no dependency
// fallout returns 200 OK and a JSON body echoing retirement metadata
// (retired_at timestamp, already_retired=false, cascade=false, zero counts).
// Operators building dashboards parse these fields; keep the shape stable.
func TestRetireAgentHandler_Success_200(t *testing.T) {
retiredAt := time.Date(2026, 4, 18, 12, 0, 0, 0, time.UTC)
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if agentID != "a-prod-001" {
t.Fatalf("retire handler received agentID=%q want a-prod-001", agentID)
}
if force {
t.Fatalf("retire handler set force=true unexpectedly; default path must be force=false")
}
return &service.AgentRetirementResult{
AlreadyRetired: false,
Cascade: false,
RetiredAt: retiredAt,
Counts: domain.AgentDependencyCounts{},
}, nil
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusOK {
t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String())
}
var body struct {
RetiredAt time.Time `json:"retired_at"`
AlreadyRetired bool `json:"already_retired"`
Cascade bool `json:"cascade"`
Counts domain.AgentDependencyCounts `json:"counts"`
}
if err := json.NewDecoder(w.Body).Decode(&body); err != nil {
t.Fatalf("decode 200 body: %v", err)
}
if !body.RetiredAt.Equal(retiredAt) {
t.Errorf("retired_at=%v want %v", body.RetiredAt, retiredAt)
}
if body.AlreadyRetired {
t.Errorf("already_retired=true want false on clean retire")
}
if body.Cascade {
t.Errorf("cascade=true want false on clean retire")
}
}
// TestRetireAgentHandler_AlreadyRetired_204 covers the idempotent contract: a
// retire call against an already-retired agent completes with 204 No Content
// (no body). This lets operators safely re-issue the DELETE after a network
// blip without fearing duplicate audit events or state mutations.
func TestRetireAgentHandler_AlreadyRetired_204(t *testing.T) {
mock, handler := agentRetireTestSetup()
past := time.Now().Add(-24 * time.Hour)
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return &service.AgentRetirementResult{
AlreadyRetired: true,
Cascade: false,
RetiredAt: past,
Counts: domain.AgentDependencyCounts{},
}, nil
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusNoContent {
t.Fatalf("status=%d body=%s want 204", w.Code, w.Body.String())
}
// 204 No Content must have zero body. If anything leaks through, downstream
// clients (curl scripts, dashboards) break.
if w.Body.Len() != 0 {
t.Errorf("204 body=%q want empty", w.Body.String())
}
}
// TestRetireAgentHandler_Sentinel_403 covers the hard guard against retiring
// any of the four sentinel agents that back discovery sources and the
// network scanner. These IDs are reserved; the handler must surface the
// service-layer ErrAgentIsSentinel as 403 Forbidden regardless of force/reason
// because no operator intent can legitimately retire them.
func TestRetireAgentHandler_Sentinel_403(t *testing.T) {
sentinels := []string{"server-scanner", "cloud-aws-sm", "cloud-azure-kv", "cloud-gcp-sm"}
for _, id := range sentinels {
t.Run(id, func(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return nil, service.ErrAgentIsSentinel
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/"+id, nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusForbidden {
t.Fatalf("sentinel %q status=%d body=%s want 403", id, w.Code, w.Body.String())
}
})
}
}
// TestRetireAgentHandler_NotFound_404 covers the lookup-miss path. Service
// returns a not-found error; handler maps to 404. Keeping the error
// discrimination at the service layer (sentinel errors.Is) rather than string
// matching is the whole point of wrapping.
func TestRetireAgentHandler_NotFound_404(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return nil, errors.New("agent not found")
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/unknown-id", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusNotFound {
t.Fatalf("status=%d body=%s want 404", w.Code, w.Body.String())
}
}
// TestRetireAgentHandler_Blocked_409_WithCounts covers the preflight-blocked
// path. Service returns *BlockedByDependenciesError wrapping
// ErrBlockedByDependencies; handler unwraps via errors.As, maps to 409, and
// MUST include the counts in the response body so operators know what's
// blocking them. Without counts the 409 is useless — the operator has to
// guess which downstream dependency is holding up the retirement.
func TestRetireAgentHandler_Blocked_409_WithCounts(t *testing.T) {
mock, handler := agentRetireTestSetup()
blockCounts := domain.AgentDependencyCounts{
ActiveTargets: 3,
ActiveCertificates: 7,
PendingJobs: 2,
}
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return nil, &service.BlockedByDependenciesError{Counts: blockCounts}
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusConflict {
t.Fatalf("status=%d body=%s want 409", w.Code, w.Body.String())
}
var body struct {
Error string `json:"error"`
Message string `json:"message"`
Counts domain.AgentDependencyCounts `json:"counts"`
}
if err := json.NewDecoder(w.Body).Decode(&body); err != nil {
t.Fatalf("decode 409 body: %v", err)
}
if body.Counts.ActiveTargets != 3 {
t.Errorf("counts.active_targets=%d want 3", body.Counts.ActiveTargets)
}
if body.Counts.ActiveCertificates != 7 {
t.Errorf("counts.active_certificates=%d want 7", body.Counts.ActiveCertificates)
}
if body.Counts.PendingJobs != 2 {
t.Errorf("counts.pending_jobs=%d want 2", body.Counts.PendingJobs)
}
if body.Message == "" {
t.Errorf("409 body missing human-readable message; operators need guidance")
}
}
// TestRetireAgentHandler_Force_NoReason_400 covers the force-escape-hatch
// guardrail: force=true without a non-empty reason must be rejected at the
// handler seam BEFORE the service performs any DB work, because a
// reason-less cascade is unauditable. Service returns ErrForceReasonRequired;
// handler maps to 400.
func TestRetireAgentHandler_Force_NoReason_400(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if !force {
t.Fatalf("handler did not forward force=true; force query param was dropped")
}
if reason != "" {
t.Fatalf("handler passed reason=%q; empty reason must reach service for error path", reason)
}
return nil, service.ErrForceReasonRequired
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001?force=true", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("status=%d body=%s want 400", w.Code, w.Body.String())
}
}
// TestRetireAgentHandler_ForceCascade_200 covers the successful force-cascade
// path: DELETE ?force=true&reason=... → service executes transactional
// cascade → 200 with cascade=true and the pre-cascade counts echoed back so
// the operator's confirmation dialog can show "I just retired N targets,
// M certificates, K pending jobs."
func TestRetireAgentHandler_ForceCascade_200(t *testing.T) {
mock, handler := agentRetireTestSetup()
retiredAt := time.Date(2026, 4, 18, 14, 30, 0, 0, time.UTC)
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if !force {
t.Fatalf("handler did not forward force=true; query-param parsing broken")
}
if reason != "decommissioning rack 7" {
t.Fatalf("handler forwarded reason=%q want %q", reason, "decommissioning rack 7")
}
return &service.AgentRetirementResult{
AlreadyRetired: false,
Cascade: true,
RetiredAt: retiredAt,
Counts: domain.AgentDependencyCounts{
ActiveTargets: 2,
ActiveCertificates: 5,
PendingJobs: 1,
},
}, nil
}
url := "/api/v1/agents/a-prod-001?force=true&reason=decommissioning+rack+7"
req := httptest.NewRequest(http.MethodDelete, url, nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusOK {
t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String())
}
var body struct {
RetiredAt time.Time `json:"retired_at"`
AlreadyRetired bool `json:"already_retired"`
Cascade bool `json:"cascade"`
Counts domain.AgentDependencyCounts `json:"counts"`
}
if err := json.NewDecoder(w.Body).Decode(&body); err != nil {
t.Fatalf("decode force-cascade 200 body: %v", err)
}
if !body.Cascade {
t.Errorf("cascade=false want true on ?force=true successful retire")
}
if body.Counts.ActiveTargets != 2 || body.Counts.ActiveCertificates != 5 || body.Counts.PendingJobs != 1 {
t.Errorf("counts=%+v want {ActiveTargets:2 ActiveCertificates:5 PendingJobs:1}", body.Counts)
}
}
// TestHeartbeatHandler_RetiredAgent_410 covers the agent-shutdown signal. A
// retired agent that is still polling must be told its identity is gone
// (410 Gone) rather than offered the normal 200 "recorded" response.
// cmd/agent treats 410 as a terminal signal and exits rather than looping
// forever against a decommissioned identity. Service returns ErrAgentRetired;
// handler maps to 410.
func TestHeartbeatHandler_RetiredAgent_410(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.HeartbeatFn = func(agentID string, metadata *domain.AgentMetadata) error {
return service.ErrAgentRetired
}
req := httptest.NewRequest(http.MethodPost, "/api/v1/agents/a-prod-001/heartbeat", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.Heartbeat(w, req)
if w.Code != http.StatusGone {
t.Fatalf("heartbeat(retired) status=%d body=%s want 410", w.Code, w.Body.String())
}
}
// TestListRetiredAgentsHandler_Success covers the audit/forensics-facing
// endpoint GET /api/v1/agents/retired. Returns a paged list of retired rows
// alongside total count so the GUI can render a "Retired Agents" tab with
// pagination. Default listing (GET /agents) hides retired rows; this is the
// opt-in surface for them.
func TestListRetiredAgentsHandler_Success(t *testing.T) {
past := time.Now().Add(-48 * time.Hour)
reason := "old hardware"
retired := []domain.Agent{
{
ID: "agent-retired-01",
Name: "decom-01",
Hostname: "server-old",
Status: domain.AgentStatusOffline,
RegisteredAt: past,
RetiredAt: &past,
RetiredReason: &reason,
},
}
mock, handler := agentRetireTestSetup()
mock.ListRetiredAgentsFn = func(page, perPage int) ([]domain.Agent, int64, error) {
if page != 1 || perPage != 50 {
t.Fatalf("ListRetired handler received page=%d perPage=%d want 1/50 defaults", page, perPage)
}
return retired, 1, nil
}
req := httptest.NewRequest(http.MethodGet, "/api/v1/agents/retired", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.ListRetiredAgents(w, req)
if w.Code != http.StatusOK {
t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String())
}
var response PagedResponse
if err := json.NewDecoder(w.Body).Decode(&response); err != nil {
t.Fatalf("decode list-retired body: %v", err)
}
if response.Total != 1 {
t.Errorf("total=%d want 1", response.Total)
}
}
// TestRetireAgentHandler_MethodNotAllowed covers defense-in-depth: only
// DELETE is valid on /api/v1/agents/{id} for retirement. Using POST/PUT/PATCH
// must be rejected with 405 so misconfigured callers don't accidentally
// trigger retirement via a wrong-method request.
func TestRetireAgentHandler_MethodNotAllowed(t *testing.T) {
_, handler := agentRetireTestSetup()
for _, method := range []string{http.MethodPost, http.MethodPut, http.MethodPatch} {
t.Run(method, func(t *testing.T) {
req := httptest.NewRequest(method, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusMethodNotAllowed {
t.Fatalf("method=%s status=%d want 405", method, w.Code)
}
})
}
}
// Compile-time asserts: the mock must satisfy the handler's AgentService
// interface. Red state: this fails until the interface grows RetireAgent +
// ListRetiredAgents. Once Phase 2b adds those methods to AgentService, this
// assertion goes green along with every test above.
var _ AgentService = (*MockAgentService)(nil)
// Unused-import suppressor for context — the package-level tests already
// pull context from agent_handler_test.go, but leaving this here documents
// that the mock methods receive context.Context values even though this
// file's tests don't construct them directly (they ride on httptest.NewRequest).
var _ = context.Background
+251 -5
View File
@@ -3,16 +3,24 @@ package handler
import (
"context"
"encoding/json"
"errors"
"log/slog"
"net/http"
"strconv"
"strings"
"time"
"github.com/shankar0123/certctl/internal/api/middleware"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// AgentService defines the service interface for agent operations.
//
// I-004 expansion: RetireAgent + ListRetiredAgents back the soft-retirement
// surface. The handler depends on the service-package's AgentRetirementResult
// and BlockedByDependenciesError types for result shape + errors.As unwrap,
// which is why this file imports internal/service.
type AgentService interface {
ListAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error)
GetAgent(ctx context.Context, id string) (*domain.Agent, error)
@@ -24,6 +32,10 @@ type AgentService interface {
GetWork(ctx context.Context, agentID string) ([]domain.Job, error)
GetWorkWithTargets(ctx context.Context, agentID string) ([]domain.WorkItem, error)
UpdateJobStatus(ctx context.Context, agentID string, jobID string, status string, errMsg string) error
// I-004 soft-retirement API. Both default to no-op (nil result / nil error)
// in mocks that don't override them — handler tests opt in per suite.
RetireAgent(ctx context.Context, agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error)
ListRetiredAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error)
}
// AgentHandler handles HTTP requests for agent operations.
@@ -96,7 +108,19 @@ func (h AgentHandler) GetAgent(w http.ResponseWriter, r *http.Request) {
agent, err := h.svc.GetAgent(r.Context(), id)
if err != nil {
ErrorWithRequestID(w, http.StatusNotFound, "Agent not found", requestID)
// M-1 (P2): route through errToStatus so a repo-level
// sql.ErrNoRows (wrapped as repository.ErrNotFound) becomes 404,
// but a transient DB failure no longer masquerades as 404 — it
// correctly surfaces 500. The pre-M-1 "any error → 404" shortcut
// was plausible when Get's only expected failure was "not found",
// but the choke point now gives us correct dispatch for free.
// Mirrors GetAgentGroup / GetProfile.
status := errToStatus(err)
msg := "Failed to get agent"
if status == http.StatusNotFound {
msg = "Agent not found"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -135,11 +159,20 @@ func (h AgentHandler) RegisterAgent(w http.ResponseWriter, r *http.Request) {
created, err := h.svc.RegisterAgent(r.Context(), agent)
if err != nil {
errMsg := err.Error()
if strings.Contains(errMsg, "unique") || strings.Contains(errMsg, "duplicate") || strings.Contains(errMsg, "already exists") {
// M-1 (P2): replace the 3-term substring net
// (`"unique"|"duplicate"|"already exists"`) with a typed
// errors.Is(err, service.ErrConflict) arm. The service layer now
// wraps pg SQLSTATE 23505 duplicate-name violations via
// fmt.Errorf("%w: agent name already exists", ErrConflict), so
// classification no longer depends on the exact driver wording.
// Other errors redact to a generic 500 with slog.Error server-
// side diagnostic capture (F-002). Mirrors CreateProfile's
// ErrValidation arm pattern, adapted for the conflict case.
if errors.Is(err, service.ErrConflict) {
ErrorWithRequestID(w, http.StatusConflict, "Agent with this name already exists", requestID)
return
}
slog.Error("RegisterAgent failed", "name", agent.Name, "error", err.Error())
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to register agent", requestID)
return
}
@@ -190,7 +223,24 @@ func (h AgentHandler) Heartbeat(w http.ResponseWriter, r *http.Request) {
}
if err := h.svc.Heartbeat(r.Context(), agentID, metadata); err != nil {
if strings.Contains(err.Error(), "not found") {
// I-004: a retired agent still polling must receive 410 Gone so
// cmd/agent detects the terminal signal and shuts down cleanly
// instead of looping forever against a decommissioned identity.
// Check this FIRST — before "not found" string matching — so the
// retired-path is never masked by a sibling error branch.
if errors.Is(err, service.ErrAgentRetired) {
ErrorWithRequestID(w, http.StatusGone, "Agent has been retired", requestID)
return
}
// M-1 (P2): the pre-M-1 `strings.Contains(err.Error(), "not
// found")` branch now rides the errToStatus choke point, which
// recognizes repository.ErrNotFound via errors.Is. The retired-
// agent sentinel is still checked FIRST above so the 410 Gone
// short-circuit is never masked by the 404 arm. Any other error
// redacts to a generic 500 with slog.Error server-side diagnostic
// capture (F-002). Mirrors GetAgentGroup.
status := errToStatus(err)
if status == http.StatusNotFound {
ErrorWithRequestID(w, http.StatusNotFound, "Agent not found", requestID)
return
}
@@ -287,7 +337,16 @@ func (h AgentHandler) AgentCertificatePickup(w http.ResponseWriter, r *http.Requ
certPEM, err := h.svc.CertificatePickup(r.Context(), agentID, certID)
if err != nil {
ErrorWithRequestID(w, http.StatusNotFound, "Certificate not found or not ready", requestID)
// M-1 (P2): route through errToStatus so a repo-level
// sql.ErrNoRows (wrapped as repository.ErrNotFound) becomes 404,
// but a transient DB failure no longer masquerades as 404 — it
// correctly surfaces 500. Mirrors GetAgent.
status := errToStatus(err)
msg := "Failed to retrieve certificate"
if status == http.StatusNotFound {
msg = "Certificate not found or not ready"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -376,3 +435,190 @@ func (h AgentHandler) AgentReportJobStatus(w http.ResponseWriter, r *http.Reques
"status": "updated",
})
}
// RetireAgent executes the I-004 soft-retirement surface.
// DELETE /api/v1/agents/{id}[?force=true&reason=...]
//
// Contract (pinned by agent_retire_handler_test.go):
//
// 405 any method other than DELETE
// 200 clean retire (body: retired_at, already_retired=false, cascade=false, counts=0s)
// 200 force-cascade retire (body: cascade=true, counts=pre-cascade snapshot)
// 204 idempotent retire of an already-retired agent (NO body — downstream
// clients that tee responses into dashboards break on spurious bodies)
// 400 force=true without a non-empty reason (ErrForceReasonRequired)
// 403 one of the four reserved sentinel IDs (ErrAgentIsSentinel)
// 404 agent does not exist ("not found" string match, kept for compat with
// repo error strings; sentinel checks run first so they never mask)
// 409 blocked by preflight counts (*BlockedByDependenciesError) — body
// carries the per-bucket counts so the operator UI can tell the
// human which downstream dependency is holding up the retirement,
// rather than forcing them to re-run the DELETE with ?force=true
// and guess
// 500 anything else
//
// The 409 body intentionally does NOT go through ErrorWithRequestID because
// that helper's ErrorResponse shape has no `counts` field — we inline-marshal
// a custom body instead. Keeping this shape stable is important: the GUI
// pattern is "show the 409 dialog, list the N targets / M certs / K jobs
// blocking, let the operator retire them first or tick the force checkbox."
func (h AgentHandler) RetireAgent(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodDelete {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
// Extract {id} from /api/v1/agents/{id}. Mirror GetAgent's pattern so
// the path parser is identical across the agent handler surface and a
// future refactor can extract it once without introducing drift.
rawID := strings.TrimPrefix(r.URL.Path, "/api/v1/agents/")
parts := strings.Split(rawID, "/")
if len(parts) == 0 || parts[0] == "" {
ErrorWithRequestID(w, http.StatusBadRequest, "Agent ID is required", requestID)
return
}
id := parts[0]
// Parse optional force + reason. A missing `force` param is treated as
// force=false (the default, safe path); anything strconv.ParseBool rejects
// is also force=false so a malformed query can never silently enable the
// cascade. The reason string is passed through verbatim — the service
// owns the "force=true requires reason" rule.
query := r.URL.Query()
force := false
if fv := query.Get("force"); fv != "" {
if parsed, err := strconv.ParseBool(fv); err == nil {
force = parsed
}
}
reason := query.Get("reason")
actor := resolveActor(r.Context())
result, err := h.svc.RetireAgent(r.Context(), id, actor, force, reason)
if err != nil {
// Sentinel + typed-error checks run BEFORE string matching on "not
// found" so a repo error that happens to contain those words can
// never mask a structural refusal (403/400/409). Order matters.
if errors.Is(err, service.ErrAgentIsSentinel) {
ErrorWithRequestID(w, http.StatusForbidden, "Agent is a reserved sentinel and cannot be retired", requestID)
return
}
if errors.Is(err, service.ErrForceReasonRequired) {
ErrorWithRequestID(w, http.StatusBadRequest, "force=true requires a non-empty reason", requestID)
return
}
var blocked *service.BlockedByDependenciesError
if errors.As(err, &blocked) {
// Custom 409 body with per-bucket counts. ErrorResponse has no
// `counts` field, so we marshal a bespoke struct instead.
// Keep `error`/`message`/`counts` as the stable shape — any
// dashboard parsing this relies on those three keys.
body := struct {
Error string `json:"error"`
Message string `json:"message"`
Counts domain.AgentDependencyCounts `json:"counts"`
}{
Error: "blocked_by_dependencies",
Message: "Agent has active downstream dependencies. Retire or reassign them " +
"first, or re-run with ?force=true&reason=... to cascade.",
Counts: blocked.Counts,
}
JSON(w, http.StatusConflict, body)
return
}
// M-1 (P2): the pre-M-1 `strings.Contains(err.Error(), "not
// found")` branch now rides the errToStatus choke point, which
// recognizes repository.ErrNotFound via errors.Is. The sentinel
// (ErrAgentIsSentinel, ErrForceReasonRequired) and typed
// (*BlockedByDependenciesError) checks above still run FIRST so
// the 403/400/409 structural refusals are never masked by the
// 404 arm. Any other error redacts to a generic 500 with
// slog.Error server-side diagnostic capture (F-002).
status := errToStatus(err)
if status == http.StatusNotFound {
ErrorWithRequestID(w, http.StatusNotFound, "Agent not found", requestID)
return
}
slog.Error("RetireAgent failed", "agent_id", id, "error", err.Error())
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to retire agent", requestID)
return
}
// Idempotent retire: the agent was already retired, so we return 204 No
// Content with a ZERO-length body. The Red contract (test line 106) fails
// if even a trailing newline leaks into the response. WriteHeader alone
// emits the status without invoking the JSON encoder.
if result.AlreadyRetired {
w.WriteHeader(http.StatusNoContent)
return
}
// Clean retire (force=false) or successful cascade (force=true). Body
// shape pinned by Red contract: retired_at, already_retired, cascade,
// counts. Omitempty is deliberately NOT used — operators parsing the
// response expect every field to always be present.
JSON(w, http.StatusOK, struct {
RetiredAt time.Time `json:"retired_at"`
AlreadyRetired bool `json:"already_retired"`
Cascade bool `json:"cascade"`
Counts domain.AgentDependencyCounts `json:"counts"`
}{
RetiredAt: result.RetiredAt,
AlreadyRetired: result.AlreadyRetired,
Cascade: result.Cascade,
Counts: result.Counts,
})
}
// ListRetiredAgents returns the opt-in listing of retired agents for the
// operator UI's "Retired" tab and for audit/forensics workflows.
// GET /api/v1/agents/retired?page=1&per_page=50
//
// The default ListAgents handler hides retired rows; this is the dedicated
// surface for reading them back. Pagination defaults match ListAgents so
// the GUI can reuse the same query hook (page=1, per_page=50, cap 500).
//
// Go 1.22's enhanced ServeMux routes `/agents/retired` to this handler via
// the literal-beats-pattern-var precedence rule (literal `retired` wins over
// `{id}` in the sibling GET /api/v1/agents/{id} route), so both entries can
// coexist without conflict. If that precedence ever regresses, the failure
// mode is TestListRetiredAgentsHandler_Success blowing up with a 404 — which
// is the fast signal we want.
func (h AgentHandler) ListRetiredAgents(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
page := 1
perPage := 50
query := r.URL.Query()
if p := query.Get("page"); p != "" {
if parsed, err := strconv.Atoi(p); err == nil && parsed > 0 {
page = parsed
}
}
if pp := query.Get("per_page"); pp != "" {
if parsed, err := strconv.Atoi(pp); err == nil && parsed > 0 && parsed <= 500 {
perPage = parsed
}
}
agents, total, err := h.svc.ListRetiredAgents(r.Context(), page, perPage)
if err != nil {
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to list retired agents", requestID)
return
}
JSON(w, http.StatusOK, PagedResponse{
Data: agents,
Total: total,
Page: page,
PerPage: perPage,
})
}
+19 -1
View File
@@ -2,6 +2,7 @@ package handler
import (
"context"
"log/slog"
"net/http"
"strconv"
"strings"
@@ -86,7 +87,24 @@ func (h AuditHandler) GetAuditEvent(w http.ResponseWriter, r *http.Request) {
event, err := h.svc.GetAuditEvent(r.Context(), id)
if err != nil {
ErrorWithRequestID(w, http.StatusNotFound, "Audit event not found", requestID)
// M-1 (P2): dispatch routes through errToStatus. Pre-M-1 this branch was
// a blanket `any error → 404 Audit event not found` shortcut — which
// silently demoted transient DB failures from the service's auditRepo.List
// wrap to 404 Not Found with no observable external signal. Post-M-1:
// service/audit.go GetAuditEvent only wraps the genuine zero-events path
// with service.ErrNotFound via %w, and the repo.List wrap surfaces without
// that sentinel — so errors.Is(err, service.ErrNotFound) picks up the real
// 404s and everything else correctly surfaces as 500 with server-side
// slog.Error capture (F-002 redacted-500 pattern preserved).
status := errToStatus(err)
if status == http.StatusInternalServerError {
slog.Error("GetAuditEvent failed", "audit_event_id", id, "error", err.Error())
}
msg := "Failed to get audit event"
if status == http.StatusNotFound {
msg = "Audit event not found"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
+17 -5
View File
@@ -37,6 +37,11 @@ type bulkRevokeRequest struct {
// BulkRevoke handles bulk certificate revocation.
// POST /api/v1/certificates/bulk-revoke
//
// M-003: admin-only. Bulk revocation is a fleet-scale destructive operation —
// a non-admin caller must not be able to invalidate certificates across
// profiles/owners/agents. The gate is enforced here (before body parsing) so a
// non-admin never sees its request criteria evaluated.
func (h BulkRevocationHandler) BulkRevoke(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPost {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
@@ -45,6 +50,16 @@ func (h BulkRevocationHandler) BulkRevoke(w http.ResponseWriter, r *http.Request
requestID := middleware.GetRequestID(r.Context())
// M-003: admin-only gate. Non-admin callers are rejected before any
// criteria/body processing to avoid leaking validation behavior to
// unauthorized actors.
if !middleware.IsAdmin(r.Context()) {
ErrorWithRequestID(w, http.StatusForbidden,
"Bulk revocation requires admin privileges",
requestID)
return
}
var req bulkRevokeRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
ErrorWithRequestID(w, http.StatusBadRequest, "Invalid request body", requestID)
@@ -78,11 +93,8 @@ func (h BulkRevocationHandler) BulkRevoke(w http.ResponseWriter, r *http.Request
return
}
// Extract actor from auth context
actor := "api"
if user, ok := middleware.GetUser(r.Context()); ok && user != "" {
actor = user
}
// Extract actor from auth context (M-002: named-key identity → audit trail)
actor := resolveActor(r.Context())
result, err := h.svc.BulkRevoke(r.Context(), criteria, req.Reason, actor)
if err != nil {
@@ -7,8 +7,10 @@ import (
"fmt"
"net/http"
"net/http/httptest"
"strings"
"testing"
"github.com/shankar0123/certctl/internal/api/middleware"
"github.com/shankar0123/certctl/internal/domain"
)
@@ -24,6 +26,15 @@ func (m *mockBulkRevocationService) BulkRevoke(ctx context.Context, criteria dom
return &domain.BulkRevocationResult{}, nil
}
// adminContext returns a context carrying the admin flag, mimicking what the
// auth middleware sets for named-key callers whose entry is admin-tagged.
// M-003: bulk revocation handler requires admin context to reach the service.
func adminContext() context.Context {
ctx := context.WithValue(context.Background(), middleware.RequestIDKey{}, "test-request-id-bulk")
ctx = context.WithValue(ctx, middleware.AdminKey{}, true)
return ctx
}
func TestBulkRevoke_Success_WithIDs(t *testing.T) {
svc := &mockBulkRevocationService{
BulkRevokeFn: func(ctx context.Context, criteria domain.BulkRevocationCriteria, reason string, actor string) (*domain.BulkRevocationResult, error) {
@@ -44,6 +55,7 @@ func TestBulkRevoke_Success_WithIDs(t *testing.T) {
body := `{"reason":"keyCompromise","certificate_ids":["mc-1","mc-2"]}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
req = req.WithContext(adminContext())
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
@@ -82,6 +94,7 @@ func TestBulkRevoke_Success_WithProfile(t *testing.T) {
body := `{"reason":"keyCompromise","profile_id":"prof-tls"}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
req = req.WithContext(adminContext())
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
@@ -97,6 +110,7 @@ func TestBulkRevoke_MissingReason_400(t *testing.T) {
body := `{"certificate_ids":["mc-1"]}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
req = req.WithContext(adminContext())
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
@@ -112,6 +126,7 @@ func TestBulkRevoke_EmptyCriteria_400(t *testing.T) {
body := `{"reason":"keyCompromise"}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
req = req.WithContext(adminContext())
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
@@ -127,6 +142,7 @@ func TestBulkRevoke_InvalidReason_400(t *testing.T) {
body := `{"reason":"totallyBogus","certificate_ids":["mc-1"]}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
req = req.WithContext(adminContext())
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
@@ -139,6 +155,8 @@ func TestBulkRevoke_InvalidReason_400(t *testing.T) {
func TestBulkRevoke_MethodNotAllowed_405(t *testing.T) {
h := NewBulkRevocationHandler(&mockBulkRevocationService{})
// Method check fires before the admin gate, so 405 must hold even for a
// non-admin caller — asserting this keeps the ordering explicit.
req := httptest.NewRequest(http.MethodGet, "/api/v1/certificates/bulk-revoke", nil)
w := httptest.NewRecorder()
@@ -160,6 +178,7 @@ func TestBulkRevoke_ServiceError_500(t *testing.T) {
body := `{"reason":"keyCompromise","certificate_ids":["mc-1"]}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
req = req.WithContext(adminContext())
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
@@ -168,3 +187,103 @@ func TestBulkRevoke_ServiceError_500(t *testing.T) {
t.Errorf("expected 500, got %d", w.Code)
}
}
// --- M-003: admin-only gate on bulk revocation ---
// TestBulkRevoke_NonAdmin_Returns403 is the central authorization regression
// for M-003. A caller without an admin-tagged context must be rejected with
// HTTP 403, regardless of how well-formed its body is, and the service layer
// must never see the request.
func TestBulkRevoke_NonAdmin_Returns403(t *testing.T) {
var serviceCalled bool
svc := &mockBulkRevocationService{
BulkRevokeFn: func(ctx context.Context, criteria domain.BulkRevocationCriteria, reason string, actor string) (*domain.BulkRevocationResult, error) {
serviceCalled = true
return &domain.BulkRevocationResult{}, nil
},
}
h := NewBulkRevocationHandler(svc)
// Well-formed body + well-formed reason + filter — the only thing
// missing is an admin-tagged context. The gate must still fire.
body := `{"reason":"keyCompromise","certificate_ids":["mc-1","mc-2"]}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
req = req.WithContext(contextWithRequestID()) // request id only, no admin flag
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
if w.Code != http.StatusForbidden {
t.Fatalf("expected status 403, got %d (body=%q)", w.Code, w.Body.String())
}
var resp map[string]any
if err := json.NewDecoder(w.Body).Decode(&resp); err != nil {
t.Fatalf("failed to decode response: %v", err)
}
msg, _ := resp["message"].(string)
if !strings.Contains(strings.ToLower(msg), "admin") {
t.Errorf("expected message to mention admin requirement, got %q", msg)
}
if serviceCalled {
t.Errorf("service was invoked despite non-admin caller — gate failed open")
}
}
// TestBulkRevoke_AdminExplicitFalse_Returns403 pins the specific case where the
// AdminKey exists but is set to false — e.g., a non-admin named-key caller.
// Without this we could regress to "key missing == deny, key present == allow"
// which would silently grant a false flag.
func TestBulkRevoke_AdminExplicitFalse_Returns403(t *testing.T) {
h := NewBulkRevocationHandler(&mockBulkRevocationService{})
body := `{"reason":"keyCompromise","certificate_ids":["mc-1"]}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
ctx := context.WithValue(context.Background(), middleware.RequestIDKey{}, "test-request-id")
ctx = context.WithValue(ctx, middleware.AdminKey{}, false)
req = req.WithContext(ctx)
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
if w.Code != http.StatusForbidden {
t.Fatalf("expected status 403 for admin=false, got %d", w.Code)
}
}
// TestBulkRevoke_AdminPermitted_ForwardsActor confirms the happy path:
// an admin-tagged context reaches the service and the actor (from the auth
// UserKey) is propagated through to BulkRevoke. This keeps the admin gate and
// the M-002 actor-propagation wired together in a single regression.
func TestBulkRevoke_AdminPermitted_ForwardsActor(t *testing.T) {
var capturedActor string
svc := &mockBulkRevocationService{
BulkRevokeFn: func(ctx context.Context, criteria domain.BulkRevocationCriteria, reason string, actor string) (*domain.BulkRevocationResult, error) {
capturedActor = actor
return &domain.BulkRevocationResult{TotalMatched: 1, TotalRevoked: 1}, nil
},
}
h := NewBulkRevocationHandler(svc)
body := `{"reason":"keyCompromise","certificate_ids":["mc-1"]}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
ctx := context.WithValue(context.Background(), middleware.RequestIDKey{}, "test-request-id")
ctx = context.WithValue(ctx, middleware.AdminKey{}, true)
ctx = context.WithValue(ctx, middleware.UserKey{}, "ops-admin")
req = req.WithContext(ctx)
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected status 200 for admin caller, got %d (body=%q)", w.Code, w.Body.String())
}
if capturedActor != "ops-admin" {
t.Errorf("expected actor ops-admin, got %q", capturedActor)
}
}
+82 -130
View File
@@ -432,6 +432,66 @@ func TestCreateCertificate_ServiceError(t *testing.T) {
}
}
// TestCreateCertificate_MissingRequiredField_Returns400 pins the C-001 handler
// contract: handler MUST reject a create payload that omits any of the five
// required fields (name, common_name, owner_id, team_id, issuer_id,
// renewal_policy_id) with HTTP 400 before the service is invoked. The mock
// service here would succeed if called; every subtest proving 400 therefore
// proves the handler guard fires.
func TestCreateCertificate_MissingRequiredField_Returns400(t *testing.T) {
baseBody := map[string]interface{}{
"name": "API Prod",
"common_name": "api.example.com",
"owner_id": "o-alice",
"team_id": "t-platform",
"issuer_id": "iss-local",
"renewal_policy_id": "rp-standard",
}
cases := []struct {
name string
missingField string
}{
{"missing name", "name"},
{"missing common_name", "common_name"},
{"missing owner_id", "owner_id"},
{"missing team_id", "team_id"},
{"missing issuer_id", "issuer_id"},
{"missing renewal_policy_id", "renewal_policy_id"},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
body := make(map[string]interface{}, len(baseBody))
for k, v := range baseBody {
body[k] = v
}
delete(body, tc.missingField)
bodyBytes, _ := json.Marshal(body)
mock := &MockCertificateService{
CreateCertificateFn: func(_ context.Context, cert domain.ManagedCertificate) (*domain.ManagedCertificate, error) {
// Would succeed if handler guard did not fire.
cert.ID = "mc-would-be-created"
return &cert, nil
},
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates", bytes.NewReader(bodyBytes))
req = req.WithContext(contextWithRequestID())
req.Header.Set("Content-Type", "application/json")
w := httptest.NewRecorder()
handler.CreateCertificate(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("%s: expected 400, got %d — body=%s", tc.name, w.Code, w.Body.String())
}
})
}
}
// Test UpdateCertificate - success case
func TestUpdateCertificate_Success(t *testing.T) {
updated := &domain.ManagedCertificate{
@@ -958,127 +1018,13 @@ func TestRevokeCertificate_Handler_ServerError(t *testing.T) {
}
}
// === CRL Handler Tests ===
func TestGetCRL_Success(t *testing.T) {
mock := &MockCertificateService{
GetRevokedCertificatesFn: func(_ context.Context) ([]*domain.CertificateRevocation, error) {
return []*domain.CertificateRevocation{
{
ID: "rev-1",
CertificateID: "cert-1",
SerialNumber: "ABC123",
Reason: "keyCompromise",
RevokedAt: time.Date(2026, 3, 20, 10, 0, 0, 0, time.UTC),
},
{
ID: "rev-2",
CertificateID: "cert-2",
SerialNumber: "DEF456",
Reason: "superseded",
RevokedAt: time.Date(2026, 3, 21, 14, 30, 0, 0, time.UTC),
},
}, nil
},
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/crl", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.GetCRL(w, req)
if w.Code != http.StatusOK {
t.Errorf("expected status %d, got %d", http.StatusOK, w.Code)
}
var resp map[string]interface{}
json.NewDecoder(w.Body).Decode(&resp)
if resp["version"] != float64(1) {
t.Errorf("expected version 1, got %v", resp["version"])
}
if resp["total"] != float64(2) {
t.Errorf("expected total 2, got %v", resp["total"])
}
entries, ok := resp["entries"].([]interface{})
if !ok {
t.Fatal("expected entries to be an array")
}
if len(entries) != 2 {
t.Errorf("expected 2 entries, got %d", len(entries))
}
entry1 := entries[0].(map[string]interface{})
if entry1["serial_number"] != "ABC123" {
t.Errorf("expected serial ABC123, got %v", entry1["serial_number"])
}
if entry1["revocation_reason"] != "keyCompromise" {
t.Errorf("expected reason keyCompromise, got %v", entry1["revocation_reason"])
}
}
func TestGetCRL_Empty(t *testing.T) {
mock := &MockCertificateService{
GetRevokedCertificatesFn: func(_ context.Context) ([]*domain.CertificateRevocation, error) {
return nil, nil
},
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/crl", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.GetCRL(w, req)
if w.Code != http.StatusOK {
t.Errorf("expected status %d, got %d", http.StatusOK, w.Code)
}
var resp map[string]interface{}
json.NewDecoder(w.Body).Decode(&resp)
if resp["total"] != float64(0) {
t.Errorf("expected total 0, got %v", resp["total"])
}
}
func TestGetCRL_ServiceError(t *testing.T) {
mock := &MockCertificateService{
GetRevokedCertificatesFn: func(_ context.Context) ([]*domain.CertificateRevocation, error) {
return nil, fmt.Errorf("revocation repository not configured")
},
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/crl", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.GetCRL(w, req)
if w.Code != http.StatusInternalServerError {
t.Errorf("expected status %d, got %d", http.StatusInternalServerError, w.Code)
}
}
func TestGetCRL_MethodNotAllowed(t *testing.T) {
mock := &MockCertificateService{}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/crl", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.GetCRL(w, req)
if w.Code != http.StatusMethodNotAllowed {
t.Errorf("expected status %d, got %d", http.StatusMethodNotAllowed, w.Code)
}
}
// M15b: DER CRL and OCSP Handler Tests
// === CRL and OCSP Handler Tests (RFC 5280 / RFC 6960, served under /.well-known/pki/) ===
//
// M-006 relocated these endpoints from /api/v1/crl* and /api/v1/ocsp/* to the
// RFC-compliant /.well-known/pki/ namespace and deleted the non-standard JSON
// CRL endpoint. The DER-encoded X.509 CRL (application/pkix-crl) and the
// DER-encoded OCSP response (application/ocsp-response) are the only wire
// formats certctl supports for revocation data.
func TestGetDERCRL_Success(t *testing.T) {
derCRLData := []byte{0x30, 0x82, 0x01, 0x00} // Mock DER CRL bytes
@@ -1092,7 +1038,7 @@ func TestGetDERCRL_Success(t *testing.T) {
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/crl/iss-local", nil)
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/crl/iss-local", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1107,6 +1053,9 @@ func TestGetDERCRL_Success(t *testing.T) {
if len(responseBody) == 0 {
t.Error("expected non-empty response body")
}
if ct := w.Header().Get("Content-Type"); ct != "application/pkix-crl" {
t.Errorf("expected Content-Type application/pkix-crl, got %q", ct)
}
}
func TestGetDERCRL_IssuerNotFound(t *testing.T) {
@@ -1117,7 +1066,7 @@ func TestGetDERCRL_IssuerNotFound(t *testing.T) {
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/crl/nonexistent", nil)
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/crl/nonexistent", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1136,7 +1085,7 @@ func TestGetDERCRL_NotSupported(t *testing.T) {
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/crl/iss-acme", nil)
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/crl/iss-acme", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1151,7 +1100,7 @@ func TestGetDERCRL_NotSupported(t *testing.T) {
func TestGetDERCRL_MethodNotAllowed(t *testing.T) {
mock := &MockCertificateService{}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/crl/iss-local", nil)
req := httptest.NewRequest(http.MethodPost, "/.well-known/pki/crl/iss-local", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1174,7 +1123,7 @@ func TestHandleOCSP_Success(t *testing.T) {
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/ocsp/iss-local/12345", nil)
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/ocsp/iss-local/12345", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1188,12 +1137,15 @@ func TestHandleOCSP_Success(t *testing.T) {
if len(responseBody) == 0 {
t.Error("expected non-empty OCSP response body")
}
if ct := w.Header().Get("Content-Type"); ct != "application/ocsp-response" {
t.Errorf("expected Content-Type application/ocsp-response, got %q", ct)
}
}
func TestHandleOCSP_MissingSerial(t *testing.T) {
mock := &MockCertificateService{}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/ocsp/iss-local/", nil)
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/ocsp/iss-local/", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1212,7 +1164,7 @@ func TestHandleOCSP_IssuerNotFound(t *testing.T) {
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/ocsp/nonexistent/ABC123", nil)
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/ocsp/nonexistent/ABC123", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1231,7 +1183,7 @@ func TestHandleOCSP_CertNotFound(t *testing.T) {
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/ocsp/iss-local/UNKNOWN", nil)
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/ocsp/iss-local/UNKNOWN", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1245,7 +1197,7 @@ func TestHandleOCSP_CertNotFound(t *testing.T) {
func TestHandleOCSP_MethodNotAllowed(t *testing.T) {
mock := &MockCertificateService{}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/ocsp/iss-local/12345", nil)
req := httptest.NewRequest(http.MethodPost, "/.well-known/pki/ocsp/iss-local/12345", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
+70 -112
View File
@@ -298,12 +298,13 @@ func (h CertificateHandler) UpdateCertificate(w http.ResponseWriter, r *http.Req
updated, err := h.svc.UpdateCertificate(r.Context(), id, cert)
if err != nil {
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Certificate not found", requestID)
return
status := errToStatus(err)
msg := err.Error()
if status == http.StatusInternalServerError {
slog.Error("UpdateCertificate failed", "cert_id", id, "error", err)
msg = "internal error"
}
slog.Error("UpdateCertificate failed", "cert_id", id, "error", err.Error())
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to update certificate", requestID)
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -327,11 +328,13 @@ func (h CertificateHandler) ArchiveCertificate(w http.ResponseWriter, r *http.Re
}
if err := h.svc.ArchiveCertificate(r.Context(), id); err != nil {
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Certificate not found", requestID)
return
status := errToStatus(err)
msg := err.Error()
if status == http.StatusInternalServerError {
slog.Error("ArchiveCertificate failed", "cert_id", id, "error", err)
msg = "internal error"
}
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to archive certificate", requestID)
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -373,12 +376,13 @@ func (h CertificateHandler) GetCertificateVersions(w http.ResponseWriter, r *htt
versions, total, err := h.svc.GetCertificateVersions(r.Context(), certID, page, perPage)
if err != nil {
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Certificate not found", requestID)
return
status := errToStatus(err)
msg := err.Error()
if status == http.StatusInternalServerError {
slog.Error("GetCertificateVersions failed", "cert_id", certID, "error", err)
msg = "internal error"
}
slog.Error("GetCertificateVersions failed", "cert_id", certID, "error", err.Error())
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to get certificate versions", requestID)
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -411,21 +415,16 @@ func (h CertificateHandler) TriggerRenewal(w http.ResponseWriter, r *http.Reques
}
certID := parts[0]
if err := h.svc.TriggerRenewal(r.Context(), certID, "api"); err != nil {
errMsg := err.Error()
if strings.Contains(errMsg, "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Certificate not found", requestID)
return
actor := resolveActor(r.Context())
if err := h.svc.TriggerRenewal(r.Context(), certID, actor); err != nil {
status := errToStatus(err)
msg := err.Error()
if status == http.StatusInternalServerError {
slog.Error("TriggerRenewal failed", "cert_id", certID, "error", err)
msg = "internal error"
}
if strings.Contains(errMsg, "cannot renew") {
ErrorWithRequestID(w, http.StatusBadRequest, errMsg, requestID)
return
}
if strings.Contains(errMsg, "already in progress") {
ErrorWithRequestID(w, http.StatusConflict, errMsg, requestID)
return
}
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to trigger renewal", requestID)
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -467,7 +466,9 @@ func (h CertificateHandler) TriggerDeployment(w http.ResponseWriter, r *http.Req
}
}
if err := h.svc.TriggerDeployment(r.Context(), certID, req.TargetID, "api"); err != nil {
actor := resolveActor(r.Context())
if err := h.svc.TriggerDeployment(r.Context(), certID, req.TargetID, actor); err != nil {
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to trigger deployment", requestID)
return
}
@@ -509,69 +510,28 @@ func (h CertificateHandler) RevokeCertificate(w http.ResponseWriter, r *http.Req
}
}
if err := h.svc.RevokeCertificate(r.Context(), certID, req.Reason, "api"); err != nil {
// Distinguish between client errors and server errors
errMsg := err.Error()
if strings.Contains(errMsg, "already revoked") ||
strings.Contains(errMsg, "cannot revoke") ||
strings.Contains(errMsg, "invalid revocation reason") {
ErrorWithRequestID(w, http.StatusBadRequest, errMsg, requestID)
return
actor := resolveActor(r.Context())
if err := h.svc.RevokeCertificate(r.Context(), certID, req.Reason, actor); err != nil {
status := errToStatus(err)
msg := err.Error()
if status == http.StatusInternalServerError {
slog.Error("RevokeCertificate failed", "cert_id", certID, "error", err)
msg = "internal error"
}
if strings.Contains(errMsg, "not found") || strings.Contains(errMsg, "failed to fetch") || strings.Contains(errMsg, "failed to get") {
ErrorWithRequestID(w, http.StatusNotFound, "Certificate not found", requestID)
return
}
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to revoke certificate", requestID)
ErrorWithRequestID(w, status, msg, requestID)
return
}
JSON(w, http.StatusOK, map[string]string{"status": "revoked"})
}
// GetCRL returns the Certificate Revocation List as structured JSON.
// GET /api/v1/crl
// Note: DER-encoded X.509 CRL generation (requiring CA key access) is planned for M15b
// alongside the embedded OCSP responder. This endpoint provides the same data in JSON format.
func (h CertificateHandler) GetCRL(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
revocations, err := h.svc.GetRevokedCertificates(r.Context())
if err != nil {
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to generate CRL", requestID)
return
}
type CRLEntry struct {
SerialNumber string `json:"serial_number"`
RevocationDate string `json:"revocation_date"`
RevocationReason string `json:"revocation_reason"`
}
entries := make([]CRLEntry, 0, len(revocations))
for _, rev := range revocations {
entries = append(entries, CRLEntry{
SerialNumber: rev.SerialNumber,
RevocationDate: rev.RevokedAt.Format("2006-01-02T15:04:05Z"),
RevocationReason: rev.Reason,
})
}
JSON(w, http.StatusOK, map[string]interface{}{
"version": 1,
"entries": entries,
"total": len(entries),
"generated_at": time.Now().UTC().Format("2006-01-02T15:04:05Z"),
})
}
// GetDERCRL returns a DER-encoded X.509 CRL signed by the specified issuer.
// GET /api/v1/crl/{issuer_id}
// GET /.well-known/pki/crl/{issuer_id}
//
// RFC 5280 § 5. Served unauthenticated under the /.well-known/pki/ namespace so
// relying parties (browsers, OpenSSL, OCSP stapling sidecars) can fetch the CRL
// without presenting certctl API credentials.
func (h CertificateHandler) GetDERCRL(w http.ResponseWriter, r *http.Request) {
requestID, _ := r.Context().Value("request_id").(string)
@@ -580,7 +540,7 @@ func (h CertificateHandler) GetDERCRL(w http.ResponseWriter, r *http.Request) {
return
}
issuerID := strings.TrimPrefix(r.URL.Path, "/api/v1/crl/")
issuerID := strings.TrimPrefix(r.URL.Path, "/.well-known/pki/crl/")
if issuerID == "" {
ErrorWithRequestID(w, http.StatusBadRequest, "Issuer ID is required", requestID)
return
@@ -588,16 +548,13 @@ func (h CertificateHandler) GetDERCRL(w http.ResponseWriter, r *http.Request) {
derBytes, err := h.svc.GenerateDERCRL(r.Context(), issuerID)
if err != nil {
errMsg := err.Error()
if strings.Contains(errMsg, "not found") {
ErrorWithRequestID(w, http.StatusNotFound, errMsg, requestID)
return
status := errToStatus(err)
msg := err.Error()
if status == http.StatusInternalServerError {
slog.Error("GenerateDERCRL failed", "issuer_id", issuerID, "error", err)
msg = "internal error"
}
if strings.Contains(errMsg, "do not support") || strings.Contains(errMsg, "does not support") {
ErrorWithRequestID(w, http.StatusNotImplemented, errMsg, requestID)
return
}
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to generate CRL", requestID)
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -608,8 +565,11 @@ func (h CertificateHandler) GetDERCRL(w http.ResponseWriter, r *http.Request) {
}
// HandleOCSP processes OCSP requests.
// GET /api/v1/ocsp/{issuer_id}/{serial_hex}
// For simplicity, use GET with path params instead of binary POST.
// GET /.well-known/pki/ocsp/{issuer_id}/{serial_hex}
//
// RFC 6960. Served unauthenticated under the /.well-known/pki/ namespace. For
// simplicity we accept GET with path params rather than the binary POST body
// form — the response is a valid DER-encoded OCSP response either way.
func (h CertificateHandler) HandleOCSP(w http.ResponseWriter, r *http.Request) {
requestID, _ := r.Context().Value("request_id").(string)
@@ -618,8 +578,8 @@ func (h CertificateHandler) HandleOCSP(w http.ResponseWriter, r *http.Request) {
return
}
// Extract issuer_id and serial from path: /api/v1/ocsp/{issuer_id}/{serial_hex}
path := strings.TrimPrefix(r.URL.Path, "/api/v1/ocsp/")
// Extract issuer_id and serial from path: /.well-known/pki/ocsp/{issuer_id}/{serial_hex}
path := strings.TrimPrefix(r.URL.Path, "/.well-known/pki/ocsp/")
parts := strings.SplitN(path, "/", 2)
if len(parts) < 2 || parts[0] == "" || parts[1] == "" {
ErrorWithRequestID(w, http.StatusBadRequest, "Issuer ID and serial number are required", requestID)
@@ -630,16 +590,13 @@ func (h CertificateHandler) HandleOCSP(w http.ResponseWriter, r *http.Request) {
derBytes, err := h.svc.GetOCSPResponse(r.Context(), issuerID, serialHex)
if err != nil {
errMsg := err.Error()
if strings.Contains(errMsg, "not found") {
ErrorWithRequestID(w, http.StatusNotFound, errMsg, requestID)
return
status := errToStatus(err)
msg := err.Error()
if status == http.StatusInternalServerError {
slog.Error("GetOCSPResponse failed", "issuer_id", issuerID, "serial", serialHex, "error", err)
msg = "internal error"
}
if strings.Contains(errMsg, "do not support") || strings.Contains(errMsg, "does not support") {
ErrorWithRequestID(w, http.StatusNotImplemented, errMsg, requestID)
return
}
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to generate OCSP response", requestID)
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -670,12 +627,13 @@ func (h CertificateHandler) GetCertificateDeployments(w http.ResponseWriter, r *
deployments, err := h.svc.GetCertificateDeployments(r.Context(), certID)
if err != nil {
errMsg := err.Error()
if strings.Contains(errMsg, "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Certificate not found", requestID)
return
status := errToStatus(err)
msg := err.Error()
if status == http.StatusInternalServerError {
slog.Error("GetCertificateDeployments failed", "cert_id", certID, "error", err)
msg = "internal error"
}
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to get deployments", requestID)
ErrorWithRequestID(w, status, msg, requestID)
return
}
+5 -2
View File
@@ -3,6 +3,7 @@ package handler
import (
"context"
"encoding/json"
"log/slog"
"net/http"
)
@@ -39,7 +40,8 @@ func (h *DigestHandler) PreviewDigest(w http.ResponseWriter, r *http.Request) {
html, err := h.service.PreviewDigest(r.Context())
if err != nil {
http.Error(w, err.Error(), http.StatusInternalServerError)
slog.Error("digest preview failed", "error", err.Error())
http.Error(w, "internal error", http.StatusInternalServerError)
return
}
@@ -64,9 +66,10 @@ func (h *DigestHandler) SendDigest(w http.ResponseWriter, r *http.Request) {
}
if err := h.service.SendDigest(r.Context()); err != nil {
slog.Error("digest send failed", "error", err.Error())
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusInternalServerError)
json.NewEncoder(w).Encode(map[string]string{"error": err.Error()})
json.NewEncoder(w).Encode(map[string]string{"error": "internal error"})
return
}
+9 -4
View File
@@ -11,12 +11,17 @@ import (
)
// DiscoveryService defines the interface used by the discovery handler.
// ClaimDiscovered and DismissDiscovered accept an explicit actor parameter so
// the handler can flow the authenticated named-key identity into the audit
// trail (M-005). Services that call these methods from non-request contexts
// pass a descriptive sentinel (e.g., "system") or "" (which falls back to
// "api").
type DiscoveryService interface {
ProcessDiscoveryReport(ctx context.Context, report *domain.DiscoveryReport) (*domain.DiscoveryScan, error)
ListDiscovered(ctx context.Context, agentID, status string, page, perPage int) ([]*domain.DiscoveredCertificate, int, error)
GetDiscovered(ctx context.Context, id string) (*domain.DiscoveredCertificate, error)
ClaimDiscovered(ctx context.Context, id string, managedCertID string) error
DismissDiscovered(ctx context.Context, id string) error
ClaimDiscovered(ctx context.Context, id string, managedCertID string, actor string) error
DismissDiscovered(ctx context.Context, id string, actor string) error
ListScans(ctx context.Context, agentID string, page, perPage int) ([]*domain.DiscoveryScan, int, error)
GetScan(ctx context.Context, id string) (*domain.DiscoveryScan, error)
GetDiscoverySummary(ctx context.Context) (map[string]int, error)
@@ -142,7 +147,7 @@ func (h DiscoveryHandler) ClaimDiscovered(w http.ResponseWriter, r *http.Request
return
}
if err := h.svc.ClaimDiscovered(r.Context(), id, body.ManagedCertificateID); err != nil {
if err := h.svc.ClaimDiscovered(r.Context(), id, body.ManagedCertificateID, resolveActor(r.Context())); err != nil {
Error(w, http.StatusInternalServerError, fmt.Sprintf("failed to claim certificate: %v", err))
return
}
@@ -166,7 +171,7 @@ func (h DiscoveryHandler) DismissDiscovered(w http.ResponseWriter, r *http.Reque
return
}
if err := h.svc.DismissDiscovered(r.Context(), id); err != nil {
if err := h.svc.DismissDiscovered(r.Context(), id, resolveActor(r.Context())); err != nil {
Error(w, http.StatusInternalServerError, fmt.Sprintf("failed to dismiss certificate: %v", err))
return
}
+10 -10
View File
@@ -19,8 +19,8 @@ type MockDiscoveryService struct {
ProcessDiscoveryReportFn func(ctx context.Context, report *domain.DiscoveryReport) (*domain.DiscoveryScan, error)
ListDiscoveredFn func(ctx context.Context, agentID, status string, page, perPage int) ([]*domain.DiscoveredCertificate, int, error)
GetDiscoveredFn func(ctx context.Context, id string) (*domain.DiscoveredCertificate, error)
ClaimDiscoveredFn func(ctx context.Context, id string, managedCertID string) error
DismissDiscoveredFn func(ctx context.Context, id string) error
ClaimDiscoveredFn func(ctx context.Context, id string, managedCertID string, actor string) error
DismissDiscoveredFn func(ctx context.Context, id string, actor string) error
ListScansFn func(ctx context.Context, agentID string, page, perPage int) ([]*domain.DiscoveryScan, int, error)
GetScanFn func(ctx context.Context, id string) (*domain.DiscoveryScan, error)
GetDiscoverySummaryFn func(ctx context.Context) (map[string]int, error)
@@ -47,16 +47,16 @@ func (m *MockDiscoveryService) GetDiscovered(ctx context.Context, id string) (*d
return nil, nil
}
func (m *MockDiscoveryService) ClaimDiscovered(ctx context.Context, id string, managedCertID string) error {
func (m *MockDiscoveryService) ClaimDiscovered(ctx context.Context, id string, managedCertID string, actor string) error {
if m.ClaimDiscoveredFn != nil {
return m.ClaimDiscoveredFn(ctx, id, managedCertID)
return m.ClaimDiscoveredFn(ctx, id, managedCertID, actor)
}
return nil
}
func (m *MockDiscoveryService) DismissDiscovered(ctx context.Context, id string) error {
func (m *MockDiscoveryService) DismissDiscovered(ctx context.Context, id string, actor string) error {
if m.DismissDiscoveredFn != nil {
return m.DismissDiscoveredFn(ctx, id)
return m.DismissDiscoveredFn(ctx, id, actor)
}
return nil
}
@@ -352,7 +352,7 @@ func TestGetDiscovered_NotFound(t *testing.T) {
// Test ClaimDiscovered - success case
func TestClaimDiscovered_Success(t *testing.T) {
mock := &MockDiscoveryService{
ClaimDiscoveredFn: func(ctx context.Context, id string, managedCertID string) error {
ClaimDiscoveredFn: func(ctx context.Context, id string, managedCertID string, actor string) error {
if id == "dcert-1" && managedCertID == "mc-prod-1" {
return nil
}
@@ -411,7 +411,7 @@ func TestClaimDiscovered_MissingManagedCertID(t *testing.T) {
// Test ClaimDiscovered - discovered cert not found
func TestClaimDiscovered_NotFound(t *testing.T) {
mock := &MockDiscoveryService{
ClaimDiscoveredFn: func(ctx context.Context, id string, managedCertID string) error {
ClaimDiscoveredFn: func(ctx context.Context, id string, managedCertID string, actor string) error {
return fmt.Errorf("discovered certificate not found")
},
}
@@ -438,7 +438,7 @@ func TestClaimDiscovered_NotFound(t *testing.T) {
// Test DismissDiscovered - success case
func TestDismissDiscovered_Success(t *testing.T) {
mock := &MockDiscoveryService{
DismissDiscoveredFn: func(ctx context.Context, id string) error {
DismissDiscoveredFn: func(ctx context.Context, id string, actor string) error {
if id == "dcert-1" {
return nil
}
@@ -614,7 +614,7 @@ func TestGetDiscoverySummary_MethodNotAllowed(t *testing.T) {
// Test DismissDiscovered - service error
func TestDismissDiscovered_ServiceError(t *testing.T) {
mock := &MockDiscoveryService{
DismissDiscoveredFn: func(ctx context.Context, id string) error {
DismissDiscoveredFn: func(ctx context.Context, id string, actor string) error {
return fmt.Errorf("database error")
},
}
+117
View File
@@ -0,0 +1,117 @@
package handler
import (
"errors"
"net/http"
"github.com/lib/pq"
"github.com/shankar0123/certctl/internal/repository"
"github.com/shankar0123/certctl/internal/service"
)
// errToStatus is the single choke point that maps a service-layer or
// repository-layer error to its HTTP status code. Before M-1 (P2), 42 switch
// branches across 11 handler files classified errors via
// `strings.Contains(err.Error(), ...)` substring matching — a pattern that
// made every HTTP status mapping one sentinel-message reword away from silent
// regression (see M-003 self-approval privilege boundary: a reword of
// ErrSelfApproval.Error() would have demoted 403 Forbidden to 500 Internal
// Server Error with no compile-time error, no test failure, and no observable
// external signal).
//
// All handler branches now route through this function via errors.Is and
// errors.As, which walks the wrap chain built by fmt.Errorf("%w: ...", ...).
// The generic sentinels live in internal/service/errors.go; domain-specific
// sentinels (ErrSelfApproval, ErrAgentIsSentinel, ErrBlockedByDependencies,
// ErrForceReasonRequired, ErrAgentNotFound) wrap those generics via %w so both
// errors.Is(err, ErrSelfApproval) and errors.Is(err, ErrForbidden) succeed on
// the same wrapped error.
//
// # Dispatch order
//
// 1. ErrAgentRetired → 410 Gone. Tested FIRST. It is deliberately NOT wrapped
// under any generic sentinel — 410 Gone is semantically distinct from
// 403/404/409 (permanently-terminated resource identity that drives
// deterministic agent-binary shutdown at cmd/agent/main.go:1291). Must
// short-circuit before any generic check so wrapping can never demote it.
// 2. ErrNotFound → 404 Not Found. Both service.ErrNotFound and
// repository.ErrNotFound route here — repositories wrap sql.ErrNoRows with
// repository.ErrNotFound so a "row not found" escapes the repo layer as a
// typed sentinel rather than an untyped fmt.Errorf string. Tested BEFORE
// ErrForbidden so RFC 7235's preference for hiding resource existence from
// unauthorized callers is preserved (a caller who cannot see a resource
// should get 404, not 403).
// 3. ErrUnauthenticated → 401 Unauthorized. SCEP challenge-password mismatch
// and similar credential failures.
// 4. ErrForbidden → 403 Forbidden. M-003 gate. Tested BEFORE ErrValidation so
// double-wrapping (e.g., a future fmt.Errorf("%w: ctx", ErrSelfApproval)
// in a wrapping call site) cannot demote 403 to 400.
// 5. ErrConflict / repository.ErrRenewalPolicyDuplicateName /
// repository.ErrRenewalPolicyInUse → 409 Conflict. The repo-layer sentinels
// are routed here explicitly so handlers do not need their own dispatch
// tree for G-1's renewal-policy FK + unique-name violations.
// 6. ErrValidation → 400 Bad Request. Generic input validation / malformed
// request bodies / invalid state transitions that the caller could correct
// by changing their request.
// 7. ErrUnprocessable → 422 Unprocessable Entity. Distinct from
// ErrValidation: ErrValidation is "caller sent bad input" (400), while
// ErrUnprocessable is "caller's input was fine but our stored data can't
// satisfy the operation" — e.g., an X.509 PEM in the inventory that fails
// to decode. The pre-M-1 ExportPKCS12 handler pinned 422 on
// strings.Contains(err.Error(), "cannot be parsed"); the sentinel makes
// that dispatch survive message rewording.
// 8. ErrNotImplemented → 501 Not Implemented. Reserved for feature-flag-gated
// code paths.
// 9. *pq.Error fallback on SQLSTATE 23503 (FK violation) / 23505 (unique
// violation) → 409 Conflict. Final branch before the default 500. Anything
// that reaches here is technically a code smell (the repository layer
// should normally wrap driver errors into a typed sentinel) but the status
// mapping is still correct.
//
// # Why a function, not a middleware
//
// Handlers must continue to call [Error] / [ErrorWithRequestID] with a
// caller-chosen human-readable message (sometimes the wrapped err.Error(),
// sometimes a redacted "internal error" for 500s per F-002). This function
// gives handlers the status code; the handler keeps control of the body.
func errToStatus(err error) int {
if err == nil {
return http.StatusOK
}
switch {
case errors.Is(err, service.ErrAgentRetired):
return http.StatusGone // 410 — must short-circuit before generic dispatch
case errors.Is(err, service.ErrNotFound),
errors.Is(err, repository.ErrNotFound):
return http.StatusNotFound // 404 — before ErrForbidden (RFC 7235 existence hiding)
case errors.Is(err, service.ErrUnauthenticated):
return http.StatusUnauthorized // 401
case errors.Is(err, service.ErrForbidden):
return http.StatusForbidden // 403 — before ErrValidation (preserves M-003 gate under double-wrap)
case errors.Is(err, service.ErrConflict),
errors.Is(err, repository.ErrRenewalPolicyDuplicateName),
errors.Is(err, repository.ErrRenewalPolicyInUse):
return http.StatusConflict // 409
case errors.Is(err, service.ErrValidation):
return http.StatusBadRequest // 400
case errors.Is(err, service.ErrUnprocessable):
return http.StatusUnprocessableEntity // 422 — stored-data-unparseable, not caller-input-bad
case errors.Is(err, service.ErrNotImplemented):
return http.StatusNotImplemented // 501
}
// Driver-level fallback. Raw *pq.Error escaping the repository layer is a
// code smell but a real escape hatch today — we still want a correct 409
// instead of a generic 500 for FK/unique violations.
var pgErr *pq.Error
if errors.As(err, &pgErr) {
switch pgErr.Code {
case "23503", "23505":
return http.StatusConflict
}
}
return http.StatusInternalServerError
}
+120
View File
@@ -0,0 +1,120 @@
package handler
import (
"errors"
"fmt"
"net/http"
"testing"
"github.com/lib/pq"
"github.com/shankar0123/certctl/internal/repository"
"github.com/shankar0123/certctl/internal/service"
)
// TestErrToStatus_DispatchMatrix pins the handler's single error → HTTP
// status choke point. Each row covers one branch of the dispatch switch and
// the dispatch order invariants documented in errors.go:
//
// - ErrAgentRetired FIRST (410 short-circuits before generic checks)
// - ErrNotFound before ErrForbidden (RFC 7235 existence hiding)
// - ErrForbidden before ErrValidation (preserves M-003 gate under double-wrap)
// - Repo sentinels route to 409 alongside ErrConflict
// - *pq.Error on 23503 / 23505 routes to 409 as the driver-level fallback
// - Default path is 500
func TestErrToStatus_DispatchMatrix(t *testing.T) {
cases := []struct {
name string
err error
want int
}{
{"nil → 200", nil, http.StatusOK},
// Each generic sentinel resolves to its documented status code.
{"ErrNotFound → 404", service.ErrNotFound, http.StatusNotFound},
{"ErrValidation → 400", service.ErrValidation, http.StatusBadRequest},
{"ErrConflict → 409", service.ErrConflict, http.StatusConflict},
{"ErrForbidden → 403", service.ErrForbidden, http.StatusForbidden},
{"ErrUnauthenticated → 401", service.ErrUnauthenticated, http.StatusUnauthorized},
{"ErrNotImplemented → 501", service.ErrNotImplemented, http.StatusNotImplemented},
// Wrapped domain sentinels route through their generic wrap.
{"ErrSelfApproval → 403 (via ErrForbidden)", service.ErrSelfApproval, http.StatusForbidden},
{"ErrAgentIsSentinel → 403 (via ErrForbidden)", service.ErrAgentIsSentinel, http.StatusForbidden},
{"ErrBlockedByDependencies → 409 (via ErrConflict)", service.ErrBlockedByDependencies, http.StatusConflict},
{"ErrForceReasonRequired → 400 (via ErrValidation)", service.ErrForceReasonRequired, http.StatusBadRequest},
{"ErrAgentNotFound → 400 (via ErrValidation)", service.ErrAgentNotFound, http.StatusBadRequest},
// ErrAgentRetired is standalone — 410 Gone must fire before any
// generic dispatch. This locks in the semantic-distinct short-circuit.
{"ErrAgentRetired → 410", service.ErrAgentRetired, http.StatusGone},
// Repository-layer sentinels (G-1 + M-1).
{"repo.ErrNotFound → 404", repository.ErrNotFound, http.StatusNotFound},
{"wrapped repo.ErrNotFound → 404",
fmt.Errorf("%w: renewal policy rp-foo", repository.ErrNotFound),
http.StatusNotFound},
{"repo.ErrRenewalPolicyDuplicateName → 409", repository.ErrRenewalPolicyDuplicateName, http.StatusConflict},
{"repo.ErrRenewalPolicyInUse → 409", repository.ErrRenewalPolicyInUse, http.StatusConflict},
// Wrapped errors with additional context survive the dispatch.
{"wrapped ErrNotFound with context → 404",
fmt.Errorf("lookup failed: %w", service.ErrNotFound),
http.StatusNotFound},
{"wrapped ErrSelfApproval with context → 403",
fmt.Errorf("approval gate: %w", service.ErrSelfApproval),
http.StatusForbidden},
// Driver-level fallback: raw *pq.Error escaping repo layer.
{"*pq.Error 23503 → 409", &pq.Error{Code: "23503"}, http.StatusConflict},
{"*pq.Error 23505 → 409", &pq.Error{Code: "23505"}, http.StatusConflict},
{"*pq.Error 08006 → 500", &pq.Error{Code: "08006"}, http.StatusInternalServerError},
// Default path.
{"unknown error → 500", errors.New("something arbitrary"), http.StatusInternalServerError},
}
for _, c := range cases {
t.Run(c.name, func(t *testing.T) {
got := errToStatus(c.err)
if got != c.want {
t.Errorf("errToStatus(%v) = %d, want %d", c.err, got, c.want)
}
})
}
}
// TestErrToStatus_AgentRetiredShortCircuit is a dedicated regression guard
// for the most fragile dispatch invariant: ErrAgentRetired's 410 Gone must
// fire FIRST. If a future commit wraps it under ErrForbidden (e.g., to
// include it in a generic "agent operations forbidden" bucket), this test
// goes red and the agent-binary shutdown at cmd/agent/main.go:1291 would
// silently stop triggering.
func TestErrToStatus_AgentRetiredShortCircuit(t *testing.T) {
if got := errToStatus(service.ErrAgentRetired); got != http.StatusGone {
t.Fatalf("ErrAgentRetired → %d, want 410 Gone (short-circuit must fire before any generic dispatch)", got)
}
}
// TestErrToStatus_NotFoundBeforeForbidden locks the RFC 7235 existence-
// hiding dispatch order. If someone were to reorder the switch arms to put
// ErrForbidden first, an authorization failure on a nonexistent resource
// would leak existence via a 403 instead of masking it with a 404.
func TestErrToStatus_NotFoundBeforeForbidden(t *testing.T) {
// A hypothetical wrapping where both would match — contrived but the
// ordering guarantee is what we're testing.
both := fmt.Errorf("%w: layered with %w", service.ErrNotFound, service.ErrForbidden)
if got := errToStatus(both); got != http.StatusNotFound {
t.Errorf("dual-wrapped err → %d, want 404 (ErrNotFound must dispatch before ErrForbidden)", got)
}
}
// TestErrToStatus_ForbiddenBeforeValidation guards the M-003 self-approval
// gate against a future call site that double-wraps ErrSelfApproval under
// ErrValidation (intentionally or accidentally). The dispatch must pick
// 403, not 400.
func TestErrToStatus_ForbiddenBeforeValidation(t *testing.T) {
doubled := fmt.Errorf("%w: %w", service.ErrSelfApproval, service.ErrValidation)
if got := errToStatus(doubled); got != http.StatusForbidden {
t.Errorf("double-wrapped err → %d, want 403 (ErrForbidden must dispatch before ErrValidation — M-003 gate)", got)
}
}
+43 -13
View File
@@ -46,12 +46,26 @@ func (h ExportHandler) ExportPEM(w http.ResponseWriter, r *http.Request) {
result, err := h.svc.ExportPEM(r.Context(), id)
if err != nil {
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Certificate not found", requestID)
return
// M-1 (P2): dispatch routes through errToStatus. Pre-M-1 this branch
// classified 404 via strings.Contains(err.Error(), "not found"), which
// gave false positives on any error whose rendered text happened to
// contain "not found" — notably a transient DB failure when the service
// layer wrapped every certRepo.Get error with "certificate not found".
// Post-M-1: service/export.go now wraps with "failed to get certificate"
// and only the genuine sql.ErrNoRows path surfaces repository.ErrNotFound
// through the wrap chain, so errors.Is(err, repository.ErrNotFound) picks
// up the real 404s and everything else — including transient DB errors —
// correctly surfaces as 500 with server-side slog.Error capture (F-002
// redacted-500 pattern preserved).
status := errToStatus(err)
if status == http.StatusInternalServerError {
slog.Error("ExportPEM failed", "cert_id", id, "error", err.Error())
}
slog.Error("ExportPEM failed", "cert_id", id, "error", err.Error())
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to export certificate", requestID)
msg := "Failed to export certificate"
if status == http.StatusNotFound {
msg = "Certificate not found"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -94,16 +108,32 @@ func (h ExportHandler) ExportPKCS12(w http.ResponseWriter, r *http.Request) {
pfxData, err := h.svc.ExportPKCS12(r.Context(), id, req.Password)
if err != nil {
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Certificate not found", requestID)
return
// M-1 (P2): dispatch routes through errToStatus. The pre-M-1 3-term
// substring net (`"not found"|"cannot be parsed"|"no certificates
// found"`) is replaced with sentinel dispatch:
// - repository.ErrNotFound (from certificate.go Get/GetLatestVersion
// sql.ErrNoRows wrap) → 404
// - service.ErrUnprocessable (from service/export.go ExportPKCS12's
// parsePEMCertificates-failure and empty-chain wraps) → 422 —
// semantically correct because the caller's request is fine; our
// stored PEM chain is what cannot be processed
// - everything else → 500 with slog.Error capture (F-002 redacted-500
// pattern preserved)
// A transient DB failure that pre-M-1 would have been swept into the
// 404 substring branch (because the service wrapped every certRepo.Get
// error with "certificate not found") now correctly surfaces as 500.
status := errToStatus(err)
if status == http.StatusInternalServerError {
slog.Error("ExportPKCS12 failed", "cert_id", id, "error", err.Error())
}
if strings.Contains(err.Error(), "cannot be parsed") || strings.Contains(err.Error(), "no certificates found") {
ErrorWithRequestID(w, http.StatusUnprocessableEntity, "Certificate data cannot be parsed as X.509", requestID)
return
msg := "Failed to export PKCS#12"
switch status {
case http.StatusNotFound:
msg = "Certificate not found"
case http.StatusUnprocessableEntity:
msg = "Certificate data cannot be parsed as X.509"
}
slog.Error("ExportPKCS12 failed", "cert_id", id, "error", err.Error())
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to export PKCS#12", requestID)
ErrorWithRequestID(w, status, msg, requestID)
return
}
+37 -2
View File
@@ -108,9 +108,17 @@ func TestExportPEM_Download(t *testing.T) {
}
func TestExportPEM_NotFound(t *testing.T) {
// M-1 (P2): wrap with service.ErrNotFound via %w so the handler's
// errToStatus choke point dispatches to 404 via errors.Is. Pre-M-1 this
// test used a raw `fmt.Errorf("certificate not found")` string and relied
// on the handler's strings.Contains(err.Error(), "not found") classifier
// — which was the same mechanism that silently misclassified transient DB
// failures whose text happened to include "not found" (see docblock on
// ExportPEM handler). Pinning the sentinel contract makes this test
// regression-proof against wrap-text changes.
mockSvc := &MockExportService{
ExportPEMFn: func(_ context.Context, _ string) (*service.ExportPEMResult, error) {
return nil, fmt.Errorf("certificate not found")
return nil, fmt.Errorf("%w: certificate", service.ErrNotFound)
},
}
h := NewExportHandler(mockSvc)
@@ -214,9 +222,11 @@ func TestExportPKCS12_EmptyPassword(t *testing.T) {
}
func TestExportPKCS12_NotFound(t *testing.T) {
// M-1 (P2): same sentinel migration as TestExportPEM_NotFound — see
// rationale there.
mockSvc := &MockExportService{
ExportPKCS12Fn: func(_ context.Context, _ string, _ string) ([]byte, error) {
return nil, fmt.Errorf("certificate not found")
return nil, fmt.Errorf("%w: certificate", service.ErrNotFound)
},
}
h := NewExportHandler(mockSvc)
@@ -231,6 +241,31 @@ func TestExportPKCS12_NotFound(t *testing.T) {
}
}
// TestExportPKCS12_Unprocessable pins the M-1 (P2) 422 contract: when the
// service layer wraps a parse failure with service.ErrUnprocessable, the
// handler's errToStatus choke point must dispatch to 422 Unprocessable
// Entity. Pre-M-1 this was classified via a 2-term substring net
// (`"cannot be parsed"|"no certificates found"`) at export.go:101, which
// would have been silently broken by a message reword in service/export.go.
// The new sentinel makes the dispatch survive message rewording.
func TestExportPKCS12_Unprocessable(t *testing.T) {
mockSvc := &MockExportService{
ExportPKCS12Fn: func(_ context.Context, _ string, _ string) ([]byte, error) {
return nil, fmt.Errorf("%w: certificate data cannot be parsed as X.509: asn1 decode error", service.ErrUnprocessable)
},
}
h := NewExportHandler(mockSvc)
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/mc-test-1/export/pkcs12", nil)
w := httptest.NewRecorder()
h.ExportPKCS12(w, req)
if w.Code != http.StatusUnprocessableEntity {
t.Fatalf("expected 422, got %d", w.Code)
}
}
func TestExportPKCS12_ServiceError(t *testing.T) {
mockSvc := &MockExportService{
ExportPKCS12Fn: func(_ context.Context, _ string, _ string) ([]byte, error) {
+19 -3
View File
@@ -2,6 +2,8 @@ package handler
import (
"net/http"
"github.com/shankar0123/certctl/internal/api/middleware"
)
// HealthHandler handles health and readiness check endpoints.
@@ -55,9 +57,23 @@ func (h HealthHandler) AuthInfo(w http.ResponseWriter, r *http.Request) {
JSON(w, http.StatusOK, response)
}
// AuthCheck returns 200 if the request has valid auth credentials.
// The auth middleware runs before this handler, so reaching here means auth passed.
// AuthCheck returns 200 if the request has valid auth credentials, along with
// the resolved named-key identity and admin flag so the GUI can gate
// admin-only affordances (e.g., the bulk-revoke button).
//
// M-003 (Phase B.4): surface the admin flag so the frontend hides affordances
// that would otherwise 403 at the server. This is a hint for UX only —
// authorization remains enforced at the handler layer (bulk_revocation.go).
//
// The auth middleware runs before this handler, so reaching here means auth
// passed. `user` falls back to an empty string when auth is disabled
// (CERTCTL_AUTH_TYPE=none).
// GET /api/v1/auth/check
func (h HealthHandler) AuthCheck(w http.ResponseWriter, r *http.Request) {
JSON(w, http.StatusOK, map[string]string{"status": "authenticated"})
response := map[string]interface{}{
"status": "authenticated",
"user": middleware.GetUser(r.Context()),
"admin": middleware.IsAdmin(r.Context()),
}
JSON(w, http.StatusOK, response)
}
+115 -2
View File
@@ -1,10 +1,13 @@
package handler
import (
"context"
"encoding/json"
"net/http"
"net/http/httptest"
"testing"
"github.com/shankar0123/certctl/internal/api/middleware"
)
func TestHealth_ReturnsOK(t *testing.T) {
@@ -204,8 +207,8 @@ func TestAuthCheck_ReturnsOK(t *testing.T) {
t.Errorf("Content-Type = %q, want application/json", ct)
}
// Check response body
var result map[string]string
// Check response body — mixed-value map (string + bool) post-Phase B.4.
var result map[string]any
if err := json.NewDecoder(w.Body).Decode(&result); err != nil {
t.Fatalf("failed to decode response: %v", err)
}
@@ -232,3 +235,113 @@ func TestAuthCheck_MethodNotAllowed(t *testing.T) {
t.Logf("AuthCheck returned status %d (note: method not enforced in handler)", status)
}
}
// --- M-003 (Phase B.4): /auth/check surfaces admin flag + user identity ---
// TestAuthCheck_AdminCaller_ReportsAdminTrue confirms that when the auth
// middleware sets AdminKey{}=true (i.e., named key was admin-tagged), the
// /auth/check endpoint reports admin=true so the GUI can show admin-only
// affordances.
func TestAuthCheck_AdminCaller_ReportsAdminTrue(t *testing.T) {
handler := NewHealthHandler("api-key")
req := httptest.NewRequest(http.MethodGet, "/api/v1/auth/check", nil)
ctx := context.WithValue(req.Context(), middleware.AdminKey{}, true)
ctx = context.WithValue(ctx, middleware.UserKey{}, "ops-admin")
req = req.WithContext(ctx)
w := httptest.NewRecorder()
handler.AuthCheck(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected status 200, got %d", w.Code)
}
var result map[string]any
if err := json.NewDecoder(w.Body).Decode(&result); err != nil {
t.Fatalf("failed to decode response: %v", err)
}
if result["status"] != "authenticated" {
t.Errorf("status = %q, want authenticated", result["status"])
}
admin, ok := result["admin"].(bool)
if !ok {
t.Fatalf("admin field missing or wrong type: %T", result["admin"])
}
if !admin {
t.Errorf("admin = false, want true")
}
if result["user"] != "ops-admin" {
t.Errorf("user = %q, want ops-admin", result["user"])
}
}
// TestAuthCheck_NonAdminCaller_ReportsAdminFalse pins the negative case: the
// auth middleware has stored AdminKey{}=false (non-admin named key) — the
// endpoint must report admin=false so the GUI hides admin-only affordances.
func TestAuthCheck_NonAdminCaller_ReportsAdminFalse(t *testing.T) {
handler := NewHealthHandler("api-key")
req := httptest.NewRequest(http.MethodGet, "/api/v1/auth/check", nil)
ctx := context.WithValue(req.Context(), middleware.AdminKey{}, false)
ctx = context.WithValue(ctx, middleware.UserKey{}, "alice")
req = req.WithContext(ctx)
w := httptest.NewRecorder()
handler.AuthCheck(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected status 200, got %d", w.Code)
}
var result map[string]any
if err := json.NewDecoder(w.Body).Decode(&result); err != nil {
t.Fatalf("failed to decode response: %v", err)
}
admin, ok := result["admin"].(bool)
if !ok {
t.Fatalf("admin field missing or wrong type: %T", result["admin"])
}
if admin {
t.Errorf("admin = true, want false")
}
if result["user"] != "alice" {
t.Errorf("user = %q, want alice", result["user"])
}
}
// TestAuthCheck_NoAuthContext_DefaultsToEmptyUserAndFalseAdmin covers the
// CERTCTL_AUTH_TYPE=none deployment, where the auth middleware doesn't set
// any keys. Response must still be well-formed with empty user + admin=false.
func TestAuthCheck_NoAuthContext_DefaultsToEmptyUserAndFalseAdmin(t *testing.T) {
handler := NewHealthHandler("none")
req := httptest.NewRequest(http.MethodGet, "/api/v1/auth/check", nil)
w := httptest.NewRecorder()
handler.AuthCheck(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected status 200, got %d", w.Code)
}
var result map[string]any
if err := json.NewDecoder(w.Body).Decode(&result); err != nil {
t.Fatalf("failed to decode response: %v", err)
}
if result["status"] != "authenticated" {
t.Errorf("status = %q, want authenticated", result["status"])
}
admin, ok := result["admin"].(bool)
if !ok {
t.Fatalf("admin field missing or wrong type: %T", result["admin"])
}
if admin {
t.Errorf("admin = true for no-auth context, want false")
}
if result["user"] != "" {
t.Errorf("user = %q, want empty string", result["user"])
}
}
+18 -24
View File
@@ -135,16 +135,13 @@ func (h IssuerHandler) CreateIssuer(w http.ResponseWriter, r *http.Request) {
created, err := h.svc.CreateIssuer(r.Context(), issuer)
if err != nil {
h.logger.Error("failed to create issuer", "error", err, "name", issuer.Name, "type", issuer.Type)
errMsg := err.Error()
switch {
case strings.Contains(errMsg, "unique") || strings.Contains(errMsg, "duplicate"):
ErrorWithRequestID(w, http.StatusConflict, "An issuer with this name already exists", requestID)
case strings.Contains(errMsg, "unsupported issuer type"):
ErrorWithRequestID(w, http.StatusBadRequest, errMsg, requestID)
default:
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to create issuer", requestID)
status := errToStatus(err)
msg := err.Error()
if status == http.StatusInternalServerError {
h.logger.Error("failed to create issuer", "error", err, "name", issuer.Name, "type", issuer.Type)
msg = "internal error"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -177,16 +174,13 @@ func (h IssuerHandler) UpdateIssuer(w http.ResponseWriter, r *http.Request) {
updated, err := h.svc.UpdateIssuer(r.Context(), id, issuer)
if err != nil {
h.logger.Error("failed to update issuer", "error", err, "id", id)
errMsg := err.Error()
switch {
case strings.Contains(errMsg, "unique") || strings.Contains(errMsg, "duplicate"):
ErrorWithRequestID(w, http.StatusConflict, "An issuer with this name already exists", requestID)
case strings.Contains(errMsg, "not found"):
ErrorWithRequestID(w, http.StatusNotFound, "Issuer not found", requestID)
default:
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to update issuer", requestID)
status := errToStatus(err)
msg := err.Error()
if status == http.StatusInternalServerError {
h.logger.Error("failed to update issuer", "error", err, "id", id)
msg = "internal error"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -210,13 +204,13 @@ func (h IssuerHandler) DeleteIssuer(w http.ResponseWriter, r *http.Request) {
}
if err := h.svc.DeleteIssuer(r.Context(), id); err != nil {
if strings.Contains(err.Error(), "violates foreign key") || strings.Contains(err.Error(), "RESTRICT") {
ErrorWithRequestID(w, http.StatusConflict, "Cannot delete issuer: certificates are still using this issuer", requestID)
} else if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Issuer not found", requestID)
} else {
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to delete issuer", requestID)
status := errToStatus(err)
msg := err.Error()
if status == http.StatusInternalServerError {
h.logger.Error("DeleteIssuer failed", "issuer_id", id, "error", err)
msg = "internal error"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
+61 -12
View File
@@ -11,15 +11,18 @@ import (
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// MockJobService is a mock implementation of JobService interface.
// Approve/Reject closures now take the actor string so tests can assert
// actor propagation from the auth middleware → handler → service.
type MockJobService struct {
ListJobsFn func(status, jobType string, page, perPage int) ([]domain.Job, int64, error)
GetJobFn func(id string) (*domain.Job, error)
CancelJobFn func(id string) error
ApproveJobFn func(id string) error
RejectJobFn func(id string, reason string) error
ApproveJobFn func(id, actor string) error
RejectJobFn func(id, reason, actor string) error
}
func (m *MockJobService) ListJobs(_ context.Context, status, jobType string, page, perPage int) ([]domain.Job, int64, error) {
@@ -43,16 +46,16 @@ func (m *MockJobService) CancelJob(_ context.Context, id string) error {
return nil
}
func (m *MockJobService) ApproveJob(_ context.Context, id string) error {
func (m *MockJobService) ApproveJob(_ context.Context, id, actor string) error {
if m.ApproveJobFn != nil {
return m.ApproveJobFn(id)
return m.ApproveJobFn(id, actor)
}
return nil
}
func (m *MockJobService) RejectJob(_ context.Context, id string, reason string) error {
func (m *MockJobService) RejectJob(_ context.Context, id, reason, actor string) error {
if m.RejectJobFn != nil {
return m.RejectJobFn(id, reason)
return m.RejectJobFn(id, reason, actor)
}
return nil
}
@@ -348,7 +351,7 @@ func TestCancelJob_EmptyID(t *testing.T) {
func TestApproveJob_Success(t *testing.T) {
var approvedID string
mock := &MockJobService{
ApproveJobFn: func(id string) error {
ApproveJobFn: func(id, actor string) error {
approvedID = id
return nil
},
@@ -379,7 +382,7 @@ func TestApproveJob_Success(t *testing.T) {
func TestApproveJob_NotFound(t *testing.T) {
mock := &MockJobService{
ApproveJobFn: func(id string) error {
ApproveJobFn: func(id, actor string) error {
return fmt.Errorf("job not found: no rows")
},
}
@@ -398,7 +401,7 @@ func TestApproveJob_NotFound(t *testing.T) {
func TestApproveJob_BadStatus(t *testing.T) {
mock := &MockJobService{
ApproveJobFn: func(id string) error {
ApproveJobFn: func(id, actor string) error {
return fmt.Errorf("cannot approve job with status Running")
},
}
@@ -427,10 +430,56 @@ func TestApproveJob_MethodNotAllowed(t *testing.T) {
}
}
// TestApproveJob_SelfApproval_Returns403 verifies the M-003 separation-of-duties
// wire: when the service returns ErrSelfApproval the handler must surface HTTP
// 403 Forbidden (NOT 500). The error sentinel crosses the service boundary via
// errors.Is so the handler can pattern-match regardless of any fmt.Errorf
// wrapping that may be added later.
func TestApproveJob_SelfApproval_Returns403(t *testing.T) {
var capturedActor string
mock := &MockJobService{
ApproveJobFn: func(id, actor string) error {
capturedActor = actor
return service.ErrSelfApproval
},
}
h := NewJobHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/jobs/job-self/approve", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
h.ApproveJob(w, req)
if w.Code != http.StatusForbidden {
t.Fatalf("expected status 403, got %d", w.Code)
}
var resp map[string]any
if err := json.NewDecoder(w.Body).Decode(&resp); err != nil {
t.Fatalf("failed to decode response: %v", err)
}
// Response body should name the self-approval condition explicitly so
// operators triaging a 403 can distinguish it from other forbid paths.
// The ErrorResponse envelope uses "error" for the status text and
// "message" for the human-readable explanation — we assert on message.
msg, _ := resp["message"].(string)
if !strings.Contains(strings.ToLower(msg), "self-approval") {
t.Errorf("expected message to mention self-approval, got %q", msg)
}
// The handler resolves the actor from the auth context; in this test the
// request has no auth context, so the propagated actor is the anonymous
// fallback ("" or "anonymous" depending on middleware wiring). We only
// assert the closure observed *some* actor string — the detailed actor
// threading is covered by resolveActor unit tests.
_ = capturedActor
}
func TestRejectJob_Success(t *testing.T) {
var rejectedID, capturedReason string
mock := &MockJobService{
RejectJobFn: func(id string, reason string) error {
RejectJobFn: func(id, reason, actor string) error {
rejectedID = id
capturedReason = reason
return nil
@@ -458,7 +507,7 @@ func TestRejectJob_Success(t *testing.T) {
func TestRejectJob_NoReason(t *testing.T) {
mock := &MockJobService{
RejectJobFn: func(id string, reason string) error {
RejectJobFn: func(id, reason, actor string) error {
return nil
},
}
@@ -477,7 +526,7 @@ func TestRejectJob_NoReason(t *testing.T) {
func TestRejectJob_NotFound(t *testing.T) {
mock := &MockJobService{
RejectJobFn: func(id string, reason string) error {
RejectJobFn: func(id, reason, actor string) error {
return fmt.Errorf("job not found: no rows")
},
}
+26 -20
View File
@@ -4,6 +4,7 @@ import (
"context"
"encoding/json"
"io"
"log/slog"
"net/http"
"strconv"
"strings"
@@ -17,8 +18,13 @@ type JobService interface {
ListJobs(ctx context.Context, status, jobType string, page, perPage int) ([]domain.Job, int64, error)
GetJob(ctx context.Context, id string) (*domain.Job, error)
CancelJob(ctx context.Context, id string) error
ApproveJob(ctx context.Context, id string) error
RejectJob(ctx context.Context, id string, reason string) error
// ApproveJob approves a renewal job. actor is the named-key identity
// resolved from the auth middleware; the service returns ErrSelfApproval
// (mapped to 403) when actor matches the certificate owner.
ApproveJob(ctx context.Context, id, actor string) error
// RejectJob rejects a renewal job. actor is the named-key identity
// recorded for audit attribution; no not-self restriction.
RejectJob(ctx context.Context, id, reason, actor string) error
}
// JobHandler handles HTTP requests for job operations.
@@ -150,16 +156,16 @@ func (h JobHandler) ApproveJob(w http.ResponseWriter, r *http.Request) {
}
jobID := parts[0]
if err := h.svc.ApproveJob(r.Context(), jobID); err != nil {
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Job not found", requestID)
return
actor := resolveActor(r.Context())
if err := h.svc.ApproveJob(r.Context(), jobID, actor); err != nil {
status := errToStatus(err)
msg := err.Error()
if status == http.StatusInternalServerError {
slog.Error("ApproveJob failed", "job_id", jobID, "error", err)
msg = "internal error"
}
if strings.Contains(err.Error(), "cannot approve") {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to approve job", requestID)
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -194,16 +200,16 @@ func (h JobHandler) RejectJob(w http.ResponseWriter, r *http.Request) {
}
}
if err := h.svc.RejectJob(r.Context(), jobID, body.Reason); err != nil {
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Job not found", requestID)
return
actor := resolveActor(r.Context())
if err := h.svc.RejectJob(r.Context(), jobID, body.Reason, actor); err != nil {
status := errToStatus(err)
msg := err.Error()
if status == http.StatusInternalServerError {
slog.Error("RejectJob failed", "job_id", jobID, "error", err)
msg = "internal error"
}
if strings.Contains(err.Error(), "cannot reject") {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to reject job", requestID)
ErrorWithRequestID(w, status, msg, requestID)
return
}
+52 -28
View File
@@ -41,20 +41,26 @@ type MetricsResponse struct {
// MetricsGauge represents gauge metrics (point-in-time values).
type MetricsGauge struct {
CertificateTotal int64 `json:"certificate_total"`
CertificateActive int64 `json:"certificate_active"`
CertificateExpiringSoon int64 `json:"certificate_expiring_soon"` // Within 30d
CertificateExpired int64 `json:"certificate_expired"`
CertificateRevoked int64 `json:"certificate_revoked"`
AgentTotal int64 `json:"agent_total"`
AgentOnline int64 `json:"agent_online"`
JobPending int64 `json:"job_pending"`
CertificateTotal int64 `json:"certificate_total"`
CertificateActive int64 `json:"certificate_active"`
CertificateExpiringSoon int64 `json:"certificate_expiring_soon"` // Within 30d
CertificateExpired int64 `json:"certificate_expired"`
CertificateRevoked int64 `json:"certificate_revoked"`
AgentTotal int64 `json:"agent_total"`
AgentOnline int64 `json:"agent_online"`
JobPending int64 `json:"job_pending"`
}
// MetricsCounter represents counter metrics (cumulative values).
type MetricsCounter struct {
JobCompletedTotal int64 `json:"job_completed_total"`
JobFailedTotal int64 `json:"job_failed_total"`
// NotificationsDeadTotal is a point-in-time count of notifications in the
// dead-letter queue (status="dead"), exposed here with the _total suffix
// to match Prometheus DB-snapshot counter convention (same semantics as
// JobFailedTotal and JobCompletedTotal — see metrics.md). I-005 DLQ
// observability gate.
NotificationsDeadTotal int64 `json:"notifications_dead_total"`
}
// UptimeMetric represents server uptime information.
@@ -95,18 +101,19 @@ func (h MetricsHandler) GetMetrics(w http.ResponseWriter, r *http.Request) {
// Build metrics response
metricsResp := MetricsResponse{
Gauge: MetricsGauge{
CertificateTotal: dashboardSummary.TotalCertificates,
CertificateActive: dashboardSummary.TotalCertificates - dashboardSummary.ExpiringCertificates - dashboardSummary.ExpiredCertificates - dashboardSummary.RevokedCertificates,
CertificateTotal: dashboardSummary.TotalCertificates,
CertificateActive: dashboardSummary.TotalCertificates - dashboardSummary.ExpiringCertificates - dashboardSummary.ExpiredCertificates - dashboardSummary.RevokedCertificates,
CertificateExpiringSoon: dashboardSummary.ExpiringCertificates,
CertificateExpired: dashboardSummary.ExpiredCertificates,
CertificateRevoked: dashboardSummary.RevokedCertificates,
AgentTotal: dashboardSummary.TotalAgents,
AgentOnline: dashboardSummary.ActiveAgents,
JobPending: dashboardSummary.PendingJobs,
CertificateExpired: dashboardSummary.ExpiredCertificates,
CertificateRevoked: dashboardSummary.RevokedCertificates,
AgentTotal: dashboardSummary.TotalAgents,
AgentOnline: dashboardSummary.ActiveAgents,
JobPending: dashboardSummary.PendingJobs,
},
Counter: MetricsCounter{
JobCompletedTotal: dashboardSummary.CompleteJobs,
JobFailedTotal: dashboardSummary.FailedJobs,
JobCompletedTotal: dashboardSummary.CompleteJobs,
JobFailedTotal: dashboardSummary.FailedJobs,
NotificationsDeadTotal: dashboardSummary.NotificationsDead,
},
Uptime: UptimeMetric{
UptimeSeconds: int64(time.Since(h.serverStarted).Seconds()),
@@ -200,6 +207,17 @@ func (h MetricsHandler) GetPrometheusMetrics(w http.ResponseWriter, r *http.Requ
fmt.Fprintf(w, "# TYPE certctl_job_failed_total counter\n")
fmt.Fprintf(w, "certctl_job_failed_total %d\n\n", dashboardSummary.FailedJobs)
// I-005: notification dead-letter queue depth. Emitted with the _total
// suffix to match the existing certctl_job_completed_total /
// certctl_job_failed_total convention for DB-snapshot counters — the
// value is a point-in-time COUNT(*) of notification_events rows where
// status='dead', not a monotonically increasing process-lifetime counter.
// Operators alert on this as "dead-letter depth" (thresholds in the
// I-005 spec: > 0 → warning, > 10 → critical).
fmt.Fprintf(w, "# HELP certctl_notification_dead_total Number of notifications in the dead-letter queue.\n")
fmt.Fprintf(w, "# TYPE certctl_notification_dead_total counter\n")
fmt.Fprintf(w, "certctl_notification_dead_total %d\n\n", dashboardSummary.NotificationsDead)
// Info — server uptime
fmt.Fprintf(w, "# HELP certctl_uptime_seconds Server uptime in seconds.\n")
fmt.Fprintf(w, "# TYPE certctl_uptime_seconds gauge\n")
@@ -209,15 +227,21 @@ func (h MetricsHandler) GetPrometheusMetrics(w http.ResponseWriter, r *http.Requ
// DashboardSummary mirrors the service.DashboardSummary for JSON unmarshaling.
// JSON tags must match the service-layer struct exactly.
type DashboardSummary struct {
TotalCertificates int64 `json:"total_certificates"`
ExpiringCertificates int64 `json:"expiring_certificates"`
ExpiredCertificates int64 `json:"expired_certificates"`
RevokedCertificates int64 `json:"revoked_certificates"`
ActiveAgents int64 `json:"active_agents"`
OfflineAgents int64 `json:"offline_agents"`
TotalAgents int64 `json:"total_agents"`
PendingJobs int64 `json:"pending_jobs"`
FailedJobs int64 `json:"failed_jobs"`
CompleteJobs int64 `json:"complete_jobs"`
CompletedAt time.Time `json:"completed_at"`
TotalCertificates int64 `json:"total_certificates"`
ExpiringCertificates int64 `json:"expiring_certificates"`
ExpiredCertificates int64 `json:"expired_certificates"`
RevokedCertificates int64 `json:"revoked_certificates"`
ActiveAgents int64 `json:"active_agents"`
OfflineAgents int64 `json:"offline_agents"`
TotalAgents int64 `json:"total_agents"`
PendingJobs int64 `json:"pending_jobs"`
FailedJobs int64 `json:"failed_jobs"`
CompleteJobs int64 `json:"complete_jobs"`
// NotificationsDead mirrors service.DashboardSummary.NotificationsDead.
// JSON tag "notifications_dead" must match the service-layer struct
// exactly — this cross-package mirror avoids a direct import cycle and
// is driven by the I-005 Prometheus counter emission path. See
// GetPrometheusMetrics and MetricsCounter.NotificationsDeadTotal.
NotificationsDead int64 `json:"notifications_dead"`
CompletedAt time.Time `json:"completed_at"`
}
@@ -13,9 +13,11 @@ import (
// MockNotificationService is a mock implementation of NotificationService interface.
type MockNotificationService struct {
ListNotificationsFn func(page, perPage int) ([]domain.NotificationEvent, int64, error)
GetNotificationFn func(id string) (*domain.NotificationEvent, error)
MarkAsReadFn func(id string) error
ListNotificationsFn func(page, perPage int) ([]domain.NotificationEvent, int64, error)
ListNotificationsByStatusFn func(status string, page, perPage int) ([]domain.NotificationEvent, int64, error)
GetNotificationFn func(id string) (*domain.NotificationEvent, error)
MarkAsReadFn func(id string) error
RequeueFn func(id string) error
}
func (m *MockNotificationService) ListNotifications(_ context.Context, page, perPage int) ([]domain.NotificationEvent, int64, error) {
@@ -25,6 +27,13 @@ func (m *MockNotificationService) ListNotifications(_ context.Context, page, per
return nil, 0, nil
}
func (m *MockNotificationService) ListNotificationsByStatus(_ context.Context, status string, page, perPage int) ([]domain.NotificationEvent, int64, error) {
if m.ListNotificationsByStatusFn != nil {
return m.ListNotificationsByStatusFn(status, page, perPage)
}
return nil, 0, nil
}
func (m *MockNotificationService) GetNotification(_ context.Context, id string) (*domain.NotificationEvent, error) {
if m.GetNotificationFn != nil {
return m.GetNotificationFn(id)
@@ -39,6 +48,13 @@ func (m *MockNotificationService) MarkAsRead(_ context.Context, id string) error
return nil
}
func (m *MockNotificationService) RequeueNotification(_ context.Context, id string) error {
if m.RequeueFn != nil {
return m.RequeueFn(id)
}
return nil
}
func TestListNotifications_Success(t *testing.T) {
now := time.Now()
certID := "mc-prod-001"
@@ -282,3 +298,224 @@ func TestMarkAsRead_EmptyID(t *testing.T) {
t.Fatalf("expected status 400, got %d", w.Code)
}
}
// ---------------------------------------------------------------------------
// I-005: Notification Retry + Dead-Letter Queue handler contract (Phase 1 Red)
//
// These tests pin the HTTP surface Phase 2 Green must implement:
//
// 1. POST /api/v1/notifications/{id}/requeue — flips a dead notification
// back to 'pending' so the retry loop can pick it up again. The handler
// method does not exist yet (NotificationHandler has no RequeueNotification
// method) and the NotificationService interface does not declare
// RequeueNotification — both are compile-time Red halts.
//
// 2. GET /api/v1/notifications?status=dead — routes dead-letter list requests
// through ListNotificationsByStatus instead of ListNotifications. The
// status-filter routing does not exist yet, so ListNotificationsByStatusFn
// never fires — a runtime Red halt.
// ---------------------------------------------------------------------------
func TestRequeueNotification_Success(t *testing.T) {
var requeuedID string
mock := &MockNotificationService{
RequeueFn: func(id string) error {
requeuedID = id
return nil
},
}
handler := NewNotificationHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/notifications/notif-dead-001/requeue", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RequeueNotification(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected status 200, got %d", w.Code)
}
if requeuedID != "notif-dead-001" {
t.Errorf("expected requeued ID 'notif-dead-001', got '%s'", requeuedID)
}
var resp map[string]string
if err := json.NewDecoder(w.Body).Decode(&resp); err != nil {
t.Fatalf("failed to decode response: %v", err)
}
if resp["status"] != "requeued" {
t.Errorf("expected status 'requeued', got '%s'", resp["status"])
}
}
func TestRequeueNotification_NotFound(t *testing.T) {
mock := &MockNotificationService{
RequeueFn: func(id string) error {
return ErrMockNotFound
},
}
handler := NewNotificationHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/notifications/nonexistent/requeue", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RequeueNotification(w, req)
if w.Code != http.StatusNotFound {
t.Fatalf("expected status 404, got %d", w.Code)
}
}
func TestRequeueNotification_ServiceError(t *testing.T) {
mock := &MockNotificationService{
RequeueFn: func(id string) error {
return ErrMockServiceFailed
},
}
handler := NewNotificationHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/notifications/notif-dead-001/requeue", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RequeueNotification(w, req)
if w.Code != http.StatusInternalServerError {
t.Fatalf("expected status 500, got %d", w.Code)
}
}
func TestRequeueNotification_MethodNotAllowed(t *testing.T) {
handler := NewNotificationHandler(&MockNotificationService{})
req := httptest.NewRequest(http.MethodGet, "/api/v1/notifications/notif-dead-001/requeue", nil)
w := httptest.NewRecorder()
handler.RequeueNotification(w, req)
if w.Code != http.StatusMethodNotAllowed {
t.Fatalf("expected status 405, got %d", w.Code)
}
}
func TestRequeueNotification_EmptyID(t *testing.T) {
handler := NewNotificationHandler(&MockNotificationService{})
req := httptest.NewRequest(http.MethodPost, "/api/v1/notifications//requeue", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RequeueNotification(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("expected status 400, got %d", w.Code)
}
}
func TestListNotifications_StatusFilter_Dead(t *testing.T) {
now := time.Now()
certID := "mc-prod-001"
lastErr := "SMTP connection refused"
nextRetry := now.Add(1 * time.Minute)
dead := domain.NotificationEvent{
ID: "notif-dead-001",
Type: domain.NotificationTypeExpirationWarning,
CertificateID: &certID,
Channel: domain.NotificationChannelEmail,
Recipient: "admin@example.com",
Message: "Certificate expiring in 7 days",
Status: "dead",
CreatedAt: now,
RetryCount: 5,
NextRetryAt: &nextRetry,
LastError: &lastErr,
}
var capturedStatus string
var capturedPage, capturedPerPage int
byStatusCalled := false
listCalled := false
mock := &MockNotificationService{
ListNotificationsFn: func(page, perPage int) ([]domain.NotificationEvent, int64, error) {
listCalled = true
return nil, 0, nil
},
ListNotificationsByStatusFn: func(status string, page, perPage int) ([]domain.NotificationEvent, int64, error) {
byStatusCalled = true
capturedStatus = status
capturedPage = page
capturedPerPage = perPage
return []domain.NotificationEvent{dead}, 1, nil
},
}
handler := NewNotificationHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/notifications?status=dead&page=1&per_page=50", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.ListNotifications(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected status 200, got %d", w.Code)
}
if !byStatusCalled {
t.Fatalf("expected ListNotificationsByStatus to be called for ?status=dead, but it was not")
}
if listCalled {
t.Errorf("ListNotifications should not be called when status filter is present")
}
if capturedStatus != "dead" {
t.Errorf("expected status='dead', got '%s'", capturedStatus)
}
if capturedPage != 1 {
t.Errorf("expected page=1, got %d", capturedPage)
}
if capturedPerPage != 50 {
t.Errorf("expected per_page=50, got %d", capturedPerPage)
}
var resp PagedResponse
if err := json.NewDecoder(w.Body).Decode(&resp); err != nil {
t.Fatalf("failed to decode response: %v", err)
}
if resp.Total != 1 {
t.Errorf("expected total=1 dead notification, got %d", resp.Total)
}
}
func TestListNotifications_NoStatusFilter_CallsDefault(t *testing.T) {
// Pin the inverse: when no ?status= is provided, the handler must call the
// existing ListNotifications path (not ListNotificationsByStatus). Phase 2
// Green must not break the default listing behavior for the plain tab.
listCalled := false
byStatusCalled := false
mock := &MockNotificationService{
ListNotificationsFn: func(page, perPage int) ([]domain.NotificationEvent, int64, error) {
listCalled = true
return []domain.NotificationEvent{}, 0, nil
},
ListNotificationsByStatusFn: func(status string, page, perPage int) ([]domain.NotificationEvent, int64, error) {
byStatusCalled = true
return nil, 0, nil
},
}
handler := NewNotificationHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/notifications", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.ListNotifications(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected status 200, got %d", w.Code)
}
if !listCalled {
t.Errorf("expected ListNotifications to be called when no status filter is present")
}
if byStatusCalled {
t.Errorf("ListNotificationsByStatus should not be called when no status filter is present")
}
}
+96 -2
View File
@@ -2,6 +2,7 @@ package handler
import (
"context"
"log/slog"
"net/http"
"strconv"
"strings"
@@ -11,10 +12,17 @@ import (
)
// NotificationService defines the service interface for notification operations.
//
// ListNotificationsByStatus and RequeueNotification were added to close coverage
// gap I-005: the Dead letter tab on the GUI (?status=dead) needs a scoped
// listing path, and the Requeue action needs a dedicated endpoint that flips a
// dead notification back to 'pending' so the retry sweep can pick it up again.
type NotificationService interface {
ListNotifications(ctx context.Context, page, perPage int) ([]domain.NotificationEvent, int64, error)
ListNotificationsByStatus(ctx context.Context, status string, page, perPage int) ([]domain.NotificationEvent, int64, error)
GetNotification(ctx context.Context, id string) (*domain.NotificationEvent, error)
MarkAsRead(ctx context.Context, id string) error
RequeueNotification(ctx context.Context, id string) error
}
// NotificationHandler handles HTTP requests for notification operations.
@@ -51,7 +59,20 @@ func (h NotificationHandler) ListNotifications(w http.ResponseWriter, r *http.Re
}
}
notifications, total, err := h.svc.ListNotifications(r.Context(), page, perPage)
// I-005: branch to the status-scoped listing path when ?status= is present
// so the Dead letter tab on the GUI (?status=dead) can filter server-side.
// Empty status delegates to the original ListNotifications path to preserve
// the default tab's existing behavior.
var (
notifications []domain.NotificationEvent
total int64
err error
)
if status := query.Get("status"); status != "" {
notifications, total, err = h.svc.ListNotificationsByStatus(r.Context(), status, page, perPage)
} else {
notifications, total, err = h.svc.ListNotifications(r.Context(), page, perPage)
}
if err != nil {
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to list notifications", requestID)
return
@@ -87,7 +108,25 @@ func (h NotificationHandler) GetNotification(w http.ResponseWriter, r *http.Requ
notification, err := h.svc.GetNotification(r.Context(), id)
if err != nil {
ErrorWithRequestID(w, http.StatusNotFound, "Notification not found", requestID)
// M-1 (P2): dispatch routes through errToStatus. Pre-M-1 this branch was
// a blanket `any error → 404 Notification not found` shortcut — which
// silently demoted transient DB failures from the service's List() wrap
// (notification.go:386 pre-M-1) to 404 Not Found with no observable
// external signal. Post-M-1: service/notification.go GetNotification only
// wraps the genuine "not found" path with service.ErrNotFound via %w, and
// every other error (including the List() wrap) surfaces without that
// sentinel — so errors.Is(err, service.ErrNotFound) picks up the real
// 404s and everything else correctly surfaces as 500 with server-side
// slog.Error capture (F-002 redacted-500 pattern preserved).
status := errToStatus(err)
if status == http.StatusInternalServerError {
slog.Error("GetNotification failed", "notification_id", id, "error", err.Error())
}
msg := "Failed to get notification"
if status == http.StatusNotFound {
msg = "Notification not found"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -124,3 +163,58 @@ func (h NotificationHandler) MarkAsRead(w http.ResponseWriter, r *http.Request)
JSON(w, http.StatusOK, response)
}
// RequeueNotification flips a dead notification back to 'pending' so the retry
// sweep (coverage gap I-005) can pick it up again on its next tick. The handler
// is strictly POST-only; GET/PUT/DELETE return 405. An empty id segment
// (/api/v1/notifications//requeue) returns 400. Service errors that carry a
// "not found" sentinel map to 404; all other service errors map to 500. This
// 404-vs-500 split mirrors GetCertificateDeployments at certificates.go:644.
// POST /api/v1/notifications/{id}/requeue
func (h NotificationHandler) RequeueNotification(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPost {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
// Extract notification ID from path /api/v1/notifications/{id}/requeue
path := strings.TrimPrefix(r.URL.Path, "/api/v1/notifications/")
parts := strings.Split(path, "/")
if len(parts) < 2 || parts[0] == "" {
ErrorWithRequestID(w, http.StatusBadRequest, "Notification ID is required", requestID)
return
}
notificationID := parts[0]
if err := h.svc.RequeueNotification(r.Context(), notificationID); err != nil {
// M-1 (P2): dispatch routes through errToStatus. Pre-M-1 this branch
// classified 404 via strings.Contains(err.Error(), "not found"), which
// would have given false positives on any error whose rendered text
// happened to contain "not found" — notably a transient DB failure whose
// driver message mentioned a missing relation or column. Post-M-1: the
// repo-layer Requeue (postgres/notification.go) wraps zero-rows-affected
// with repository.ErrNotFound via %w, and the service layer forwards
// that error with %w — so errors.Is(err, repository.ErrNotFound) walks
// the full wrap chain and picks up the real 404s while everything else
// (including transient DB errors) correctly surfaces as 500 with
// server-side slog.Error capture (F-002 redacted-500 pattern preserved).
status := errToStatus(err)
if status == http.StatusInternalServerError {
slog.Error("RequeueNotification failed", "notification_id", notificationID, "error", err.Error())
}
msg := "Failed to requeue notification"
if status == http.StatusNotFound {
msg = "Notification not found"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
response := map[string]string{
"status": "requeued",
}
JSON(w, http.StatusOK, response)
}
+7 -6
View File
@@ -3,6 +3,7 @@ package handler
import (
"context"
"encoding/json"
"log/slog"
"net/http"
"strconv"
"strings"
@@ -184,13 +185,13 @@ func (h OwnerHandler) DeleteOwner(w http.ResponseWriter, r *http.Request) {
id = parts[0]
if err := h.svc.DeleteOwner(r.Context(), id); err != nil {
if strings.Contains(err.Error(), "violates foreign key") || strings.Contains(err.Error(), "RESTRICT") {
ErrorWithRequestID(w, http.StatusConflict, "Cannot delete owner: certificates are still assigned to this owner", requestID)
} else if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Owner not found", requestID)
} else {
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to delete owner", requestID)
status := errToStatus(err)
msg := err.Error()
if status == http.StatusInternalServerError {
slog.Error("DeleteOwner failed", "owner_id", id, "error", err)
msg = "internal error"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
+17
View File
@@ -127,6 +127,17 @@ func (h PolicyHandler) CreatePolicy(w http.ResponseWriter, r *http.Request) {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
// Severity is optional on create; default matches the DB default.
// Any explicit value must pass the TitleCase allowlist; the DB CHECK
// constraint enforces the same set, but catching it here gives a 400
// with a clear message instead of a 500 on constraint violation.
if policy.Severity == "" {
policy.Severity = domain.PolicySeverityWarning
}
if err := ValidatePolicySeverity(policy.Severity); err != nil {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
created, err := h.svc.CreatePolicy(r.Context(), policy)
if err != nil {
@@ -174,6 +185,12 @@ func (h PolicyHandler) UpdatePolicy(w http.ResponseWriter, r *http.Request) {
return
}
}
if policy.Severity != "" {
if err := ValidatePolicySeverity(policy.Severity); err != nil {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
}
updated, err := h.svc.UpdatePolicy(r.Context(), id, policy)
if err != nil {
+56 -15
View File
@@ -3,12 +3,14 @@ package handler
import (
"context"
"encoding/json"
"errors"
"net/http"
"strconv"
"strings"
"github.com/shankar0123/certctl/internal/api/middleware"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// ProfileService defines the service interface for certificate profile operations.
@@ -21,6 +23,20 @@ type ProfileService interface {
}
// ProfileHandler handles HTTP requests for certificate profile operations.
//
// Error dispatch (post-M-1): every service error routes through the [errToStatus]
// choke point via `errors.Is` walking the wrap chain, with one explicit
// [service.ErrValidation] arm on the write paths (Create, Update) so the
// composed "validation: <field-specific reason>" message the service layer
// attaches via `fmt.Errorf("%w: ...", ErrValidation)` can be passed through to
// the 400 response body. Before M-1, the Create and Update handlers branched on
// `strings.Contains(err.Error(), "invalid"|"required"|"must be"|"cannot")` — a
// fragile pattern where a single reword in [service.validateProfile] would
// demote the 400 to 500 with no compile-time signal. The substring-based 404
// branches on Update and Delete likewise depended on the repository's
// human-readable "profile not found" message surviving forever; both now ride
// the same [repository.ErrNotFound] wrap that G-1's renewal-policy and M-1's
// other repositories use.
type ProfileHandler struct {
svc ProfileService
}
@@ -88,7 +104,18 @@ func (h ProfileHandler) GetProfile(w http.ResponseWriter, r *http.Request) {
profile, err := h.svc.GetProfile(r.Context(), id)
if err != nil {
ErrorWithRequestID(w, http.StatusNotFound, "Profile not found", requestID)
// M-1: route through errToStatus so a repo-level `sql.ErrNoRows`
// (wrapped as repository.ErrNotFound) becomes 404, but a transient DB
// failure no longer masquerades as 404 — it correctly surfaces 500. The
// pre-M-1 "any error → 404" shortcut was plausible when Get's only
// expected failure was "not found", but the choke point now gives us
// correct dispatch for free. Mirrors GetRenewalPolicy.
status := errToStatus(err)
msg := "Failed to get profile"
if status == http.StatusNotFound {
msg = "Profile not found"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -123,9 +150,15 @@ func (h ProfileHandler) CreateProfile(w http.ResponseWriter, r *http.Request) {
created, err := h.svc.CreateProfile(r.Context(), profile)
if err != nil {
// Check if it's a validation error from the service
if strings.Contains(err.Error(), "invalid") || strings.Contains(err.Error(), "required") ||
strings.Contains(err.Error(), "must be") || strings.Contains(err.Error(), "cannot") {
// M-1: replace the 4-term substring net
// (`"invalid"|"required"|"must be"|"cannot"`) with a single
// `errors.Is(err, service.ErrValidation)` arm. validateProfile wraps
// every field-specific failure via `fmt.Errorf("%w: <reason>",
// ErrValidation)`, so `err.Error()` still contains the human-readable
// reason (e.g., "RSA minimum key size must be at least 2048") and can be
// safely passed to the 400 body — but the status decision no longer
// depends on the exact wording. Other errors redact to a generic 500.
if errors.Is(err, service.ErrValidation) {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
@@ -162,16 +195,21 @@ func (h ProfileHandler) UpdateProfile(w http.ResponseWriter, r *http.Request) {
updated, err := h.svc.UpdateProfile(r.Context(), id, profile)
if err != nil {
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Profile not found", requestID)
return
}
if strings.Contains(err.Error(), "invalid") || strings.Contains(err.Error(), "required") ||
strings.Contains(err.Error(), "must be") || strings.Contains(err.Error(), "cannot") {
// M-1: explicit ErrValidation arm preserves the user-facing reason in
// the 400 body (validateProfile wraps every failure via
// `fmt.Errorf("%w: ...", ErrValidation)`); every other error — including
// repo-layer ErrNotFound on a missing row — routes through errToStatus
// so the 404/500 decision no longer depends on substring matching.
if errors.Is(err, service.ErrValidation) {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to update profile", requestID)
status := errToStatus(err)
msg := "Failed to update profile"
if status == http.StatusNotFound {
msg = "Profile not found"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
@@ -195,11 +233,14 @@ func (h ProfileHandler) DeleteProfile(w http.ResponseWriter, r *http.Request) {
}
if err := h.svc.DeleteProfile(r.Context(), id); err != nil {
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Profile not found", requestID)
return
// M-1: sentinel dispatch replaces the substring 404 check — see the
// parallel comment block in UpdateProfile for the rationale.
status := errToStatus(err)
msg := "Failed to delete profile"
if status == http.StatusNotFound {
msg = "Profile not found"
}
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to delete profile", requestID)
ErrorWithRequestID(w, status, msg, requestID)
return
}
+273
View File
@@ -0,0 +1,273 @@
package handler
import (
"context"
"encoding/json"
"errors"
"net/http"
"strconv"
"strings"
"github.com/shankar0123/certctl/internal/api/middleware"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// RenewalPolicyService defines the service interface for renewal policy
// operations. G-1: all methods take ctx so the handler can propagate
// request-scoped cancellation/deadlines through the full stack.
type RenewalPolicyService interface {
ListRenewalPolicies(ctx context.Context, page, perPage int) ([]domain.RenewalPolicy, int64, error)
GetRenewalPolicy(ctx context.Context, id string) (*domain.RenewalPolicy, error)
CreateRenewalPolicy(ctx context.Context, rp domain.RenewalPolicy) (*domain.RenewalPolicy, error)
UpdateRenewalPolicy(ctx context.Context, id string, rp domain.RenewalPolicy) (*domain.RenewalPolicy, error)
DeleteRenewalPolicy(ctx context.Context, id string) error
}
// RenewalPolicyHandler serves /api/v1/renewal-policies CRUD endpoints.
//
// Error dispatch (post-M-1): every service error routes through the [errToStatus]
// choke point via `errors.Is` walking the wrap chain. Three sentinel identities
// cover the full dispatch surface:
//
// - [service.ErrRenewalPolicyDuplicateName] / [service.ErrRenewalPolicyInUse]
// are `var`-aliased to the repository-layer sentinels of the same name (G-1),
// so handler-side `errors.Is` succeeds against a sentinel raised three layers
// deep in [internal/repository/postgres.RenewalPolicyRepository] without the
// service layer having to translate. [errToStatus] routes both to 409.
// - [repository.ErrNotFound] is wrapped around `sql.ErrNoRows` inside the
// repo's Get/Update/Delete methods via `fmt.Errorf("%w: renewal policy %s",
// repository.ErrNotFound, id)` (M-1). [errToStatus] routes that to 404 in
// the same switch arm as [service.ErrNotFound], preserving the existing
// 404-on-missing behavior that the pre-M-1 substring check provided —
// without the rewording-regression risk that motivated the migration.
//
// The handler layer keeps two explicit `errors.Is` arms for the
// duplicate-name / in-use sentinels so each 409 response can carry a
// constraint-specific human-readable message ("with that name" vs. "still
// referenced by managed certificates"); every other error path — including
// not-found — delegates the status decision to [errToStatus] and provides a
// generic body via the F-002 redacted-500 pattern.
type RenewalPolicyHandler struct {
svc RenewalPolicyService
}
// NewRenewalPolicyHandler constructs the handler with its service dependency.
// Returned by value to match the house pattern (PolicyHandler, IssuerHandler
// etc.) — the registry stores handlers by value in router.HandlerRegistry.
func NewRenewalPolicyHandler(svc RenewalPolicyService) RenewalPolicyHandler {
return RenewalPolicyHandler{svc: svc}
}
// ListRenewalPolicies lists all renewal policies (paginated).
// GET /api/v1/renewal-policies?page=1&per_page=50
func (h RenewalPolicyHandler) ListRenewalPolicies(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
page := 1
perPage := 50
query := r.URL.Query()
if p := query.Get("page"); p != "" {
if parsed, err := strconv.Atoi(p); err == nil && parsed > 0 {
page = parsed
}
}
if pp := query.Get("per_page"); pp != "" {
if parsed, err := strconv.Atoi(pp); err == nil && parsed > 0 && parsed <= 500 {
perPage = parsed
}
}
policies, total, err := h.svc.ListRenewalPolicies(r.Context(), page, perPage)
if err != nil {
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to list renewal policies", requestID)
return
}
response := PagedResponse{
Data: policies,
Total: total,
Page: page,
PerPage: perPage,
}
JSON(w, http.StatusOK, response)
}
// GetRenewalPolicy retrieves a single renewal policy by ID.
// GET /api/v1/renewal-policies/{id}
func (h RenewalPolicyHandler) GetRenewalPolicy(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
id := strings.TrimPrefix(r.URL.Path, "/api/v1/renewal-policies/")
parts := strings.Split(id, "/")
if len(parts) == 0 || parts[0] == "" {
ErrorWithRequestID(w, http.StatusBadRequest, "Renewal policy ID is required", requestID)
return
}
id = parts[0]
policy, err := h.svc.GetRenewalPolicy(r.Context(), id)
if err != nil {
// M-1: route through errToStatus so a repo-level `sql.ErrNoRows`
// (wrapped as repository.ErrNotFound) becomes 404, but a transient DB
// failure no longer masquerades as 404 — it correctly surfaces 500.
// The pre-M-1 "any error → 404" shortcut was plausible when Get's only
// expected failure was "not found", but the choke point now gives us
// correct dispatch for free.
status := errToStatus(err)
msg := "Failed to get renewal policy"
if status == http.StatusNotFound {
msg = "Renewal policy not found"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
JSON(w, http.StatusOK, policy)
}
// CreateRenewalPolicy inserts a new renewal policy.
// POST /api/v1/renewal-policies
//
// Error mapping:
// - invalid JSON / missing name → 400
// - ErrRenewalPolicyDuplicateName (pg 23505 on name UNIQUE) → 409
// - anything else → 500
func (h RenewalPolicyHandler) CreateRenewalPolicy(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPost {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
var rp domain.RenewalPolicy
if err := json.NewDecoder(r.Body).Decode(&rp); err != nil {
ErrorWithRequestID(w, http.StatusBadRequest, "Invalid request body", requestID)
return
}
if err := ValidateRequired("name", rp.Name); err != nil {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
created, err := h.svc.CreateRenewalPolicy(r.Context(), rp)
if err != nil {
if errors.Is(err, service.ErrRenewalPolicyDuplicateName) {
ErrorWithRequestID(w, http.StatusConflict, "A renewal policy with that name already exists", requestID)
return
}
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to create renewal policy", requestID)
return
}
JSON(w, http.StatusCreated, created)
}
// UpdateRenewalPolicy replaces the fields of an existing renewal policy.
// PUT /api/v1/renewal-policies/{id}
//
// Error mapping (post-M-1, sentinel-driven):
// - invalid JSON / empty ID → 400
// - ErrRenewalPolicyDuplicateName (pg 23505) → 409 (explicit arm, custom msg)
// - ErrNotFound (wrapping sql.ErrNoRows) → 404 (via errToStatus)
// - anything else → 500 (via errToStatus, body redacted)
func (h RenewalPolicyHandler) UpdateRenewalPolicy(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPut {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
id := strings.TrimPrefix(r.URL.Path, "/api/v1/renewal-policies/")
parts := strings.Split(id, "/")
if len(parts) == 0 || parts[0] == "" {
ErrorWithRequestID(w, http.StatusBadRequest, "Renewal policy ID is required", requestID)
return
}
id = parts[0]
var rp domain.RenewalPolicy
if err := json.NewDecoder(r.Body).Decode(&rp); err != nil {
ErrorWithRequestID(w, http.StatusBadRequest, "Invalid request body", requestID)
return
}
updated, err := h.svc.UpdateRenewalPolicy(r.Context(), id, rp)
if err != nil {
if errors.Is(err, service.ErrRenewalPolicyDuplicateName) {
ErrorWithRequestID(w, http.StatusConflict, "A renewal policy with that name already exists", requestID)
return
}
// M-1: drop the `strings.Contains(err.Error(), "not found")` branch.
// [repository.ErrNotFound] now wraps sql.ErrNoRows at the three
// renewal-policy repo methods (Get/Update/Delete), so errToStatus
// routes a missing row to 404 via errors.Is without depending on the
// repo's fmt.Errorf format string surviving a future reword.
status := errToStatus(err)
msg := "Failed to update renewal policy"
if status == http.StatusNotFound {
msg = "Renewal policy not found"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
JSON(w, http.StatusOK, updated)
}
// DeleteRenewalPolicy removes a renewal policy.
// DELETE /api/v1/renewal-policies/{id}
//
// Error mapping (post-M-1, sentinel-driven):
// - empty ID (trailing slash) → 400
// - ErrRenewalPolicyInUse (pg 23503 FK-RESTRICT) → 409 (explicit arm, custom msg)
// - ErrNotFound (wrapping sql.ErrNoRows) → 404 (via errToStatus)
// - anything else → 500 (via errToStatus, body redacted)
func (h RenewalPolicyHandler) DeleteRenewalPolicy(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodDelete {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
id := strings.TrimPrefix(r.URL.Path, "/api/v1/renewal-policies/")
parts := strings.Split(id, "/")
if len(parts) == 0 || parts[0] == "" {
ErrorWithRequestID(w, http.StatusBadRequest, "Renewal policy ID is required", requestID)
return
}
id = parts[0]
if err := h.svc.DeleteRenewalPolicy(r.Context(), id); err != nil {
if errors.Is(err, service.ErrRenewalPolicyInUse) {
ErrorWithRequestID(w, http.StatusConflict, "Renewal policy is still referenced by managed certificates", requestID)
return
}
// M-1: sentinel dispatch replaces the substring check — see the
// parallel comment block in UpdateRenewalPolicy for the rationale.
status := errToStatus(err)
msg := "Failed to delete renewal policy"
if status == http.StatusNotFound {
msg = "Renewal policy not found"
}
ErrorWithRequestID(w, status, msg, requestID)
return
}
w.WriteHeader(http.StatusNoContent)
}
@@ -0,0 +1,434 @@
package handler
import (
"bytes"
"context"
"encoding/json"
"net/http"
"net/http/httptest"
"testing"
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// G-1 red tests: lock in the HTTP surface of /api/v1/renewal-policies before
// the production code exists. Every subtest here references a symbol that
// Phase 2b must introduce:
//
// - NewRenewalPolicyHandler(svc) (constructor)
// - RenewalPolicyService (service-layer interface, in this package)
// - handler.ListRenewalPolicies / GetRenewalPolicy / CreateRenewalPolicy /
// UpdateRenewalPolicy / DeleteRenewalPolicy
// - service.ErrRenewalPolicyDuplicateName (pg 23505 → 409 mapping)
// - service.ErrRenewalPolicyInUse (pg 23503 → 409 mapping)
// MockRenewalPolicyService is a mock implementation of RenewalPolicyService.
type MockRenewalPolicyService struct {
ListRenewalPoliciesFn func(page, perPage int) ([]domain.RenewalPolicy, int64, error)
GetRenewalPolicyFn func(id string) (*domain.RenewalPolicy, error)
CreateRenewalPolicyFn func(rp domain.RenewalPolicy) (*domain.RenewalPolicy, error)
UpdateRenewalPolicyFn func(id string, rp domain.RenewalPolicy) (*domain.RenewalPolicy, error)
DeleteRenewalPolicyFn func(id string) error
}
func (m *MockRenewalPolicyService) ListRenewalPolicies(_ context.Context, page, perPage int) ([]domain.RenewalPolicy, int64, error) {
if m.ListRenewalPoliciesFn != nil {
return m.ListRenewalPoliciesFn(page, perPage)
}
return nil, 0, nil
}
func (m *MockRenewalPolicyService) GetRenewalPolicy(_ context.Context, id string) (*domain.RenewalPolicy, error) {
if m.GetRenewalPolicyFn != nil {
return m.GetRenewalPolicyFn(id)
}
return nil, nil
}
func (m *MockRenewalPolicyService) CreateRenewalPolicy(_ context.Context, rp domain.RenewalPolicy) (*domain.RenewalPolicy, error) {
if m.CreateRenewalPolicyFn != nil {
return m.CreateRenewalPolicyFn(rp)
}
return nil, nil
}
func (m *MockRenewalPolicyService) UpdateRenewalPolicy(_ context.Context, id string, rp domain.RenewalPolicy) (*domain.RenewalPolicy, error) {
if m.UpdateRenewalPolicyFn != nil {
return m.UpdateRenewalPolicyFn(id, rp)
}
return nil, nil
}
func (m *MockRenewalPolicyService) DeleteRenewalPolicy(_ context.Context, id string) error {
if m.DeleteRenewalPolicyFn != nil {
return m.DeleteRenewalPolicyFn(id)
}
return nil
}
// ----- List -----
func TestListRenewalPolicies_Success(t *testing.T) {
now := time.Now()
rp1 := domain.RenewalPolicy{
ID: "rp-default", Name: "Default", RenewalWindowDays: 30,
MaxRetries: 3, RetryInterval: 3600, AutoRenew: true,
CreatedAt: now, UpdatedAt: now,
}
rp2 := domain.RenewalPolicy{
ID: "rp-urgent", Name: "Urgent", RenewalWindowDays: 7,
MaxRetries: 5, RetryInterval: 600, AutoRenew: true,
CreatedAt: now, UpdatedAt: now,
}
mock := &MockRenewalPolicyService{
ListRenewalPoliciesFn: func(page, perPage int) ([]domain.RenewalPolicy, int64, error) {
return []domain.RenewalPolicy{rp1, rp2}, 2, nil
},
}
handler := NewRenewalPolicyHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/renewal-policies", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.ListRenewalPolicies(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected status 200, got %d", w.Code)
}
var resp PagedResponse
if err := json.NewDecoder(w.Body).Decode(&resp); err != nil {
t.Fatalf("failed to decode response: %v", err)
}
if resp.Total != 2 {
t.Errorf("expected total 2, got %d", resp.Total)
}
}
func TestListRenewalPolicies_ServiceError(t *testing.T) {
mock := &MockRenewalPolicyService{
ListRenewalPoliciesFn: func(page, perPage int) ([]domain.RenewalPolicy, int64, error) {
return nil, 0, ErrMockServiceFailed
},
}
handler := NewRenewalPolicyHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/renewal-policies", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.ListRenewalPolicies(w, req)
if w.Code != http.StatusInternalServerError {
t.Fatalf("expected status 500, got %d", w.Code)
}
}
func TestListRenewalPolicies_MethodNotAllowed(t *testing.T) {
handler := NewRenewalPolicyHandler(&MockRenewalPolicyService{})
req := httptest.NewRequest(http.MethodDelete, "/api/v1/renewal-policies", nil)
w := httptest.NewRecorder()
handler.ListRenewalPolicies(w, req)
if w.Code != http.StatusMethodNotAllowed {
t.Fatalf("expected status 405, got %d", w.Code)
}
}
// ----- Get -----
func TestGetRenewalPolicy_Success(t *testing.T) {
now := time.Now()
mock := &MockRenewalPolicyService{
GetRenewalPolicyFn: func(id string) (*domain.RenewalPolicy, error) {
return &domain.RenewalPolicy{
ID: id, Name: "Default", RenewalWindowDays: 30,
MaxRetries: 3, RetryInterval: 3600, AutoRenew: true,
CreatedAt: now, UpdatedAt: now,
}, nil
},
}
handler := NewRenewalPolicyHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/renewal-policies/rp-default", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.GetRenewalPolicy(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected status 200, got %d", w.Code)
}
}
func TestGetRenewalPolicy_NotFound(t *testing.T) {
mock := &MockRenewalPolicyService{
GetRenewalPolicyFn: func(id string) (*domain.RenewalPolicy, error) {
return nil, ErrMockNotFound
},
}
handler := NewRenewalPolicyHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/renewal-policies/nonexistent", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.GetRenewalPolicy(w, req)
if w.Code != http.StatusNotFound {
t.Fatalf("expected status 404, got %d", w.Code)
}
}
// ----- Create -----
func TestCreateRenewalPolicy_Success(t *testing.T) {
now := time.Now()
mock := &MockRenewalPolicyService{
CreateRenewalPolicyFn: func(rp domain.RenewalPolicy) (*domain.RenewalPolicy, error) {
rp.ID = "rp-new"
rp.CreatedAt = now
rp.UpdatedAt = now
return &rp, nil
},
}
body := map[string]interface{}{
"name": "New Policy",
"renewal_window_days": 30,
"max_retries": 3,
"retry_interval_seconds": 3600,
"auto_renew": true,
}
bodyBytes, _ := json.Marshal(body)
handler := NewRenewalPolicyHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/renewal-policies", bytes.NewReader(bodyBytes))
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.CreateRenewalPolicy(w, req)
if w.Code != http.StatusCreated {
t.Fatalf("expected status 201, got %d", w.Code)
}
}
func TestCreateRenewalPolicy_MissingName(t *testing.T) {
body := map[string]interface{}{
"renewal_window_days": 30,
"max_retries": 3,
"retry_interval_seconds": 3600,
}
bodyBytes, _ := json.Marshal(body)
handler := NewRenewalPolicyHandler(&MockRenewalPolicyService{})
req := httptest.NewRequest(http.MethodPost, "/api/v1/renewal-policies", bytes.NewReader(bodyBytes))
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.CreateRenewalPolicy(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("expected status 400, got %d", w.Code)
}
}
func TestCreateRenewalPolicy_InvalidJSON(t *testing.T) {
handler := NewRenewalPolicyHandler(&MockRenewalPolicyService{})
req := httptest.NewRequest(http.MethodPost, "/api/v1/renewal-policies", bytes.NewReader([]byte("not json")))
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.CreateRenewalPolicy(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("expected status 400, got %d", w.Code)
}
}
func TestCreateRenewalPolicy_DuplicateName(t *testing.T) {
// Service bubbles up ErrRenewalPolicyDuplicateName (pg 23505) → handler maps to 409.
mock := &MockRenewalPolicyService{
CreateRenewalPolicyFn: func(rp domain.RenewalPolicy) (*domain.RenewalPolicy, error) {
return nil, service.ErrRenewalPolicyDuplicateName
},
}
body := map[string]interface{}{
"name": "Duplicate",
"renewal_window_days": 30,
"max_retries": 3,
"retry_interval_seconds": 3600,
}
bodyBytes, _ := json.Marshal(body)
handler := NewRenewalPolicyHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/renewal-policies", bytes.NewReader(bodyBytes))
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.CreateRenewalPolicy(w, req)
if w.Code != http.StatusConflict {
t.Fatalf("expected status 409 on duplicate name, got %d", w.Code)
}
}
func TestCreateRenewalPolicy_MethodNotAllowed(t *testing.T) {
handler := NewRenewalPolicyHandler(&MockRenewalPolicyService{})
req := httptest.NewRequest(http.MethodGet, "/api/v1/renewal-policies", nil)
w := httptest.NewRecorder()
handler.CreateRenewalPolicy(w, req)
if w.Code != http.StatusMethodNotAllowed {
t.Fatalf("expected status 405, got %d", w.Code)
}
}
// ----- Update -----
func TestUpdateRenewalPolicy_Success(t *testing.T) {
now := time.Now()
mock := &MockRenewalPolicyService{
UpdateRenewalPolicyFn: func(id string, rp domain.RenewalPolicy) (*domain.RenewalPolicy, error) {
return &domain.RenewalPolicy{
ID: id, Name: rp.Name, RenewalWindowDays: rp.RenewalWindowDays,
MaxRetries: rp.MaxRetries, RetryInterval: rp.RetryInterval,
AutoRenew: rp.AutoRenew,
CreatedAt: now, UpdatedAt: now,
}, nil
},
}
body := map[string]interface{}{
"name": "Updated Policy",
"renewal_window_days": 45,
"max_retries": 5,
"retry_interval_seconds": 1800,
"auto_renew": true,
}
bodyBytes, _ := json.Marshal(body)
handler := NewRenewalPolicyHandler(mock)
req := httptest.NewRequest(http.MethodPut, "/api/v1/renewal-policies/rp-default", bytes.NewReader(bodyBytes))
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.UpdateRenewalPolicy(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected status 200, got %d", w.Code)
}
}
func TestUpdateRenewalPolicy_NotFound(t *testing.T) {
mock := &MockRenewalPolicyService{
UpdateRenewalPolicyFn: func(id string, rp domain.RenewalPolicy) (*domain.RenewalPolicy, error) {
return nil, ErrMockNotFound
},
}
body := map[string]interface{}{
"name": "Updated",
"renewal_window_days": 30,
"max_retries": 3,
"retry_interval_seconds": 3600,
}
bodyBytes, _ := json.Marshal(body)
handler := NewRenewalPolicyHandler(mock)
req := httptest.NewRequest(http.MethodPut, "/api/v1/renewal-policies/rp-missing", bytes.NewReader(bodyBytes))
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.UpdateRenewalPolicy(w, req)
if w.Code != http.StatusNotFound {
t.Fatalf("expected status 404, got %d", w.Code)
}
}
// ----- Delete -----
func TestDeleteRenewalPolicy_Success(t *testing.T) {
var deletedID string
mock := &MockRenewalPolicyService{
DeleteRenewalPolicyFn: func(id string) error {
deletedID = id
return nil
},
}
handler := NewRenewalPolicyHandler(mock)
req := httptest.NewRequest(http.MethodDelete, "/api/v1/renewal-policies/rp-default", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.DeleteRenewalPolicy(w, req)
if w.Code != http.StatusNoContent {
t.Fatalf("expected status 204, got %d", w.Code)
}
if deletedID != "rp-default" {
t.Errorf("expected deleted ID 'rp-default', got '%s'", deletedID)
}
}
func TestDeleteRenewalPolicy_NotFound(t *testing.T) {
mock := &MockRenewalPolicyService{
DeleteRenewalPolicyFn: func(id string) error {
return ErrMockNotFound
},
}
handler := NewRenewalPolicyHandler(mock)
req := httptest.NewRequest(http.MethodDelete, "/api/v1/renewal-policies/rp-missing", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.DeleteRenewalPolicy(w, req)
if w.Code != http.StatusNotFound {
t.Fatalf("expected status 404, got %d", w.Code)
}
}
func TestDeleteRenewalPolicy_InUseConflict(t *testing.T) {
// Service bubbles up ErrRenewalPolicyInUse (pg 23503 FK-RESTRICT) → handler maps to 409.
mock := &MockRenewalPolicyService{
DeleteRenewalPolicyFn: func(id string) error {
return service.ErrRenewalPolicyInUse
},
}
handler := NewRenewalPolicyHandler(mock)
req := httptest.NewRequest(http.MethodDelete, "/api/v1/renewal-policies/rp-active", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.DeleteRenewalPolicy(w, req)
if w.Code != http.StatusConflict {
t.Fatalf("expected status 409 on in-use conflict, got %d", w.Code)
}
}
func TestDeleteRenewalPolicy_EmptyID(t *testing.T) {
handler := NewRenewalPolicyHandler(&MockRenewalPolicyService{})
req := httptest.NewRequest(http.MethodDelete, "/api/v1/renewal-policies/", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.DeleteRenewalPolicy(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("expected status 400, got %d", w.Code)
}
}
+20
View File
@@ -1,14 +1,34 @@
package handler
import (
"context"
"encoding/base64"
"encoding/json"
"fmt"
"net/http"
"strings"
"time"
"github.com/shankar0123/certctl/internal/api/middleware"
)
// resolveActor extracts the authenticated named-key identity from the request
// context for audit-trail attribution. Returns the named-key name when set by
// the auth middleware, or "api" as a safe sentinel when the auth middleware
// did not populate the context (e.g., AUTH_TYPE=none, or internal/system calls
// that bypass auth).
//
// Post-M-002: this is the single source of truth for handler-layer actor
// resolution. Handlers must NOT hardcode string literals like "api-key-user"
// or "api" — always go through this helper so the named-key identity flows to
// services and the audit trail.
func resolveActor(ctx context.Context) string {
if user := middleware.GetUser(ctx); user != "" {
return user
}
return "api"
}
// PagedResponse represents a paginated API response.
type PagedResponse struct {
Data interface{} `json:"data"`
+19 -3
View File
@@ -6,6 +6,7 @@ import (
"encoding/asn1"
"encoding/base64"
"encoding/pem"
"errors"
"fmt"
"io"
"net/http"
@@ -14,6 +15,7 @@ import (
"github.com/shankar0123/certctl/internal/api/middleware"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/pkcs7"
"github.com/shankar0123/certctl/internal/service"
)
// SCEPService defines the service interface for SCEP enrollment operations.
@@ -171,11 +173,25 @@ func (h SCEPHandler) pkiOperation(w http.ResponseWriter, r *http.Request) {
result, err := h.svc.PKCSReq(r.Context(), csrPEM, challengePassword, transactionID)
if err != nil {
if strings.Contains(err.Error(), "challenge password") {
ErrorWithRequestID(w, http.StatusForbidden, "Invalid challenge password", requestID)
// M-1 (P2): typed-sentinel dispatch replaces the pre-M-1 substring
// branch `strings.Contains(err.Error(), "challenge password")`. The
// service layer now wraps both challenge-password failure modes (server
// misconfigured / client credential wrong) via `fmt.Errorf("%w: ...",
// ErrUnauthenticated)`, so errors.Is walks the wrap chain without
// depending on the exact wording of the error string. This also
// corrects the HTTP status: pre-M-1 returned 403 Forbidden, but RFC
// 7235 classifies this as 401 Unauthorized (authentication failure, not
// authorization denial). The errToStatus doc block enumerates this as
// the canonical ErrUnauthenticated call site.
if errors.Is(err, service.ErrUnauthenticated) {
ErrorWithRequestID(w, http.StatusUnauthorized, "Invalid challenge password", requestID)
return
}
ErrorWithRequestID(w, http.StatusInternalServerError, fmt.Sprintf("Enrollment failed: %v", err), requestID)
// F-002 redacted-500: every other enrollment failure (CSR parse errors,
// issuer-layer failures, audit-layer failures) returns an opaque body
// so we don't leak internal state through the SCEP response. The
// logger already captured the real error at the service layer.
ErrorWithRequestID(w, http.StatusInternalServerError, "Enrollment failed", requestID)
return
}
+17 -3
View File
@@ -4,12 +4,14 @@ import (
"context"
"encoding/pem"
"errors"
"fmt"
"net/http"
"net/http/httptest"
"strings"
"testing"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// mockSCEPService implements SCEPService for testing.
@@ -214,9 +216,21 @@ func TestSCEP_PKIOperation_Success_RawCSR(t *testing.T) {
}
}
// TestSCEP_PKIOperation_ChallengePasswordRejected pins the M-1 (P2) dispatch
// contract: when the service wraps the failure via `fmt.Errorf("%w: ...",
// service.ErrUnauthenticated)` the handler's errToStatus choke point must
// return 401 Unauthorized, NOT 403 Forbidden.
//
// This is a deliberate semantic correction. Pre-M-1 the handler inspected
// err.Error() for the "challenge password" substring and returned 403, which
// misclassified the RFC 7235 condition (the caller presented no valid
// application-layer credential — that is auth failure, not authorization
// denial). The errToStatus doc explicitly cites this code path as the
// canonical ErrUnauthenticated consumer; see handler/errors.go and the
// symmetric M-1 comment block at handler/scep.go in the pkiOperation arm.
func TestSCEP_PKIOperation_ChallengePasswordRejected(t *testing.T) {
svc := &mockSCEPService{
EnrollErr: errors.New("invalid challenge password"),
EnrollErr: fmt.Errorf("%w: invalid challenge password", service.ErrUnauthenticated),
}
h := NewSCEPHandler(svc)
@@ -230,8 +244,8 @@ func TestSCEP_PKIOperation_ChallengePasswordRejected(t *testing.T) {
w := httptest.NewRecorder()
h.HandleSCEP(w, req)
if w.Code != http.StatusForbidden {
t.Errorf("expected 403, got %d: %s", w.Code, w.Body.String())
if w.Code != http.StatusUnauthorized {
t.Errorf("expected 401 Unauthorized (M-1 dispatch of service.ErrUnauthenticated), got %d: %s", w.Code, w.Body.String())
}
}
+70 -6
View File
@@ -10,6 +10,7 @@ import (
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// MockTargetService is a mock implementation of TargetService interface.
@@ -239,8 +240,9 @@ func TestCreateTarget_Success(t *testing.T) {
}
body := map[string]interface{}{
"name": "New Target",
"type": "nginx",
"name": "New Target",
"type": "nginx",
"agent_id": "agent-001",
}
bodyBytes, _ := json.Marshal(body)
@@ -258,7 +260,8 @@ func TestCreateTarget_Success(t *testing.T) {
func TestCreateTarget_MissingName(t *testing.T) {
body := map[string]interface{}{
"type": "nginx",
"type": "nginx",
"agent_id": "agent-001",
}
bodyBytes, _ := json.Marshal(body)
@@ -276,7 +279,8 @@ func TestCreateTarget_MissingName(t *testing.T) {
func TestCreateTarget_MissingType(t *testing.T) {
body := map[string]interface{}{
"name": "New Target",
"name": "New Target",
"agent_id": "agent-001",
}
bodyBytes, _ := json.Marshal(body)
@@ -311,8 +315,9 @@ func TestCreateTarget_NameTooLong(t *testing.T) {
longName += "x"
}
body := map[string]interface{}{
"name": longName,
"type": "nginx",
"name": longName,
"type": "nginx",
"agent_id": "agent-001",
}
bodyBytes, _ := json.Marshal(body)
@@ -340,6 +345,65 @@ func TestCreateTarget_MethodNotAllowed(t *testing.T) {
}
}
// TestCreateTarget_MissingAgentID_Returns400 pins the C-002 handler contract:
// handler MUST reject a create payload that omits agent_id with HTTP 400
// before the service is invoked. Using a mock that would return 201-worthy
// success proves the guard fires.
func TestCreateTarget_MissingAgentID_Returns400(t *testing.T) {
body := map[string]interface{}{
"name": "New Target",
"type": "nginx",
// agent_id intentionally omitted
}
bodyBytes, _ := json.Marshal(body)
mock := &MockTargetService{
CreateTargetFn: func(_ context.Context, target domain.DeploymentTarget) (*domain.DeploymentTarget, error) {
// Would succeed if handler guard did not fire.
target.ID = "t-would-be-created"
return &target, nil
},
}
handler := NewTargetHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/targets", bytes.NewReader(bodyBytes))
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.CreateTarget(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("expected 400, got %d — body=%s", w.Code, w.Body.String())
}
}
// TestCreateTarget_NonexistentAgent_Returns400 pins the C-002 handler↔service
// translation: when the service returns service.ErrAgentNotFound, the handler
// MUST map it to HTTP 400, not the generic 500 used for other service errors.
func TestCreateTarget_NonexistentAgent_Returns400(t *testing.T) {
mock := &MockTargetService{
CreateTargetFn: func(_ context.Context, target domain.DeploymentTarget) (*domain.DeploymentTarget, error) {
return nil, service.ErrAgentNotFound
},
}
body := map[string]interface{}{
"name": "New Target",
"type": "nginx",
"agent_id": "agent-does-not-exist",
}
bodyBytes, _ := json.Marshal(body)
handler := NewTargetHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/targets", bytes.NewReader(bodyBytes))
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.CreateTarget(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("expected 400 for nonexistent agent, got %d — body=%s", w.Code, w.Body.String())
}
}
func TestUpdateTarget_Success(t *testing.T) {
now := time.Now()
mock := &MockTargetService{
+16
View File
@@ -3,12 +3,14 @@ package handler
import (
"context"
"encoding/json"
"errors"
"net/http"
"strconv"
"strings"
"github.com/shankar0123/certctl/internal/api/middleware"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// TargetService defines the service interface for deployment target operations.
@@ -125,9 +127,23 @@ func (h TargetHandler) CreateTarget(w http.ResponseWriter, r *http.Request) {
ErrorWithRequestID(w, http.StatusBadRequest, "type is required", requestID)
return
}
// C-002: agent_id is a NOT NULL FK in deployment_targets (migration 000001
// line 104). Reject empty values at the boundary so callers get a clean 400
// with the field name rather than a generic "Failed to create target" 500.
if err := ValidateRequired("agent_id", target.AgentID); err != nil {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
created, err := h.svc.CreateTarget(r.Context(), target)
if err != nil {
// C-002: a nonexistent agent_id is a client error, not a server error.
// The service returns ErrAgentNotFound (wrapped via fmt.Errorf %w) when
// agentRepo.Get fails; we translate that to 400 via errors.Is.
if errors.Is(err, service.ErrAgentNotFound) {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to create target", requestID)
return
}

Some files were not shown because too many files have changed in this diff Show More