mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 20:21:29 +00:00
v2.0.51
58 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
a3d8b9c607 |
fix(deploy,db,handler): close fresh-clone postgres init failure + 4 ride-along audit findings (U-3 master)
GitHub #10 reopened: operator mikeakasully cloned v2.0.50 fresh and ran the canonical quickstart (docker compose -f deploy/docker-compose.yml up -d --build); postgres reported unhealthy indefinitely, dependent containers never started. Root cause: deploy/docker-compose.yml mounted a hand-curated subset of migrations/*.up.sql + seed.sql into postgres /docker-entrypoint-initdb.d/. Postgres applied them at initdb time. Once seed.sql referenced columns added by migrations *after* the mounted cutoff (e.g., policy_rules.severity from migration 000013), initdb crashed mid-seed and the container loop wedged. Two sources of truth (compose mount list vs in-tree migration ladder) diverged the moment a seed-touching migration shipped, and the only thing that fixed it was hand-editing the compose file every release. Fix: remove the dual source. Postgres boots empty; the server applies migrations + seed at startup via RunMigrations + RunSeed. Helm has used this pattern since day one (postgres-init emptyDir); compose now matches. Bundled with four ride-along audit findings whose fixes share the same schema/db code surface, so operators take the schema-change pain only once: cat-u-seed_initdb_schema_drift [P1, primary] — initdb-mount fix cat-o-retry_interval_unit_mismatch [P1] — column rename minutes→seconds cat-o-notification_created_at_dead_field [P2] — add column + populate cat-o-health_check_column_orphans [P1] — drop unwired columns cat-u-no_version_endpoint [P2] — add /api/v1/version Single migration (000017_db_coupling_cleanup) bundles the three schema changes under a DO \$\$ guard so re-application is safe; reduces operator-visible 'schema-change releases' from four to one. Backend - internal/repository/postgres/db.go: add RunSeed (baseline) + RunDemoSeed (gated by CERTCTL_DEMO_SEED). Both idempotent (ON CONFLICT DO NOTHING in every shipped INSERT) so repeated boots are safe; missing-file is no-op so custom packaging that strips seeds still boots cleanly. - cmd/server/main.go: invoke RunSeed (always) + RunDemoSeed (when flag set) immediately after RunMigrations. - internal/repository/postgres/notification.go: NotificationRepository.Create now sets created_at (with time.Now() fallback when caller leaves it zero); scanNotification reads it back; List + ListRetryEligible SELECT extended. - internal/repository/postgres/renewal_policy.go: column references updated to retry_interval_seconds across SELECT/INSERT/UPDATE sites. - internal/api/handler/version.go: new VersionHandler exposes {version, commit, modified, build_time, go_version} from runtime/debug.ReadBuildInfo() with ldflags-supplied Version override. - internal/api/router/router.go: register GET /api/v1/version through the no-auth chain (CORS + ContentType) alongside /health, /ready, /api/v1/auth/info. - cmd/server/main.go: add /api/v1/version to no-auth dispatch + audit ExcludePaths so rollout polling doesn't dominate the audit trail. - internal/config/config.go: add DatabaseConfig.DemoSeed + CERTCTL_DEMO_SEED env var. Migration - migrations/000017_db_coupling_cleanup.up.sql + .down.sql: (1) renewal_policies.retry_interval_minutes → retry_interval_seconds (DO \$\$ guard, idempotent re-application) (2) notification_events ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() (3) network_scan_targets DROP orphan health_check_enabled + health_check_interval_seconds - migrations/seed.sql: column reference updated to retry_interval_seconds. - migrations/seed_demo.sql: same column rename + applied at runtime now via RunDemoSeed (no longer initdb-mounted). Compose - deploy/docker-compose.yml: drop ALL initdb mounts (10 migration files + seed.sql); add start_period: 30s to postgres + certctl-server healthchecks to absorb the runtime migration + seed application window on first boot. - deploy/docker-compose.test.yml: same drop (+ ghost seed_test.sql mount removed; that file never existed); same healthcheck start_period. - deploy/docker-compose.demo.yml: replace seed_demo.sql initdb mount with CERTCTL_DEMO_SEED=true env var on certctl-server. Tests - internal/api/handler/version_handler_test.go: TestVersion_ReturnsBuildInfo, TestVersion_RejectsNonGet, TestVersion_LdflagsOverride. - internal/repository/postgres/seed_test.go: TestRunSeed_AppliesIdempotently, TestRunSeed_MissingFileIsNoOp, TestRunDemoSeed_AppliesIdempotently, TestMigration000017_RetryIntervalRename, TestMigration000017_NotificationCreatedAt, TestMigration000017_HealthCheckOrphansDropped (testcontainers, -short skips). - internal/repository/postgres/notification_test.go: TestNotificationRepository_CreatedAt_IsPersisted + TestNotificationRepository_CreatedAt_DefaultsToNow. CI guardrail - .github/workflows/ci.yml: new 'Forbidden migration mount in compose initdb (U-3)' step grep-fails the build if any migrations/*.sql or seed*.sql re-appears in /docker-entrypoint-initdb.d in any compose file. Catches future drift before a fresh-clone operator hits it. Spec / Docs - api/openapi.yaml: add /api/v1/version operation under Health tag. - docs/architecture.md: replace the 'initdb may run the same SQL' paragraph with a post-U-3 single-source-of-truth explanation. - CHANGELOG.md: full unreleased-section entry covering all 5 closures, breaking changes, and the new env var. Audit doc - coverage-gap-audit-2026-04-24-v5/unified-audit.md: add new P1 #14 cat-u-seed_initdb_schema_drift; flip the 4 ride-along findings to ✅ RESOLVED with closure prose pointing at this commit. Verification: build/vet/test -short -race all clean across all touched packages locally; govulncheck reports 0 vulnerabilities affecting our code; OpenAPI YAML parses; CI U-3 grep guardrail clears against the post-fix tree. |
||
|
|
9c1d446e40 |
fix(security,config): remove unimplemented JWT auth-type, close silent downgrade (G-1)
The pre-G-1 config validator accepted CERTCTL_AUTH_TYPE=jwt and the
startup log faithfully echoed 'authentication enabled type=jwt'.
Reasonable people read that and concluded JWT auth was on. It wasn't.
The auth-middleware wiring at cmd/server/main.go unconditionally routed
every request through the api-key bearer middleware regardless of
cfg.Auth.Type. So CERTCTL_AUTH_TYPE=jwt quietly compared the incoming
'Authorization: Bearer <token>' against whatever string the operator put
in CERTCTL_AUTH_SECRET — real JWT clients got 401, and operators who
treated CERTCTL_AUTH_SECRET as a *signing* secret (because they thought
they were configuring JWT) had effectively handed an attacker an api-key.
A security finding masquerading as a config option.
We chose the audit-recommended structural fix: remove the option, fail
fast at startup, and add the gateway-fronting pattern as the documented
forward path. Implementing JWT middleware would have meant jwks vs
static-secret rotation, claim mapping, expiry enforcement, audience and
issuer validation, key rollover semantics, and regression coverage at the
same depth as the existing api-key path — a feature, not a fix. Operators
who genuinely need JWT/OIDC front certctl with an authenticating gateway
(oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium /
Authelia) and run the upstream certctl with CERTCTL_AUTH_TYPE=none. Same
shape works on docker-compose and Helm.
The change is comprehensive across 7 phases — every surface that
mentioned 'jwt' as a certctl-auth-type is updated, plus structural
backstops (typed enum, runtime guard, helm template validation, CI grep
guard) so the lie can't reappear.
Files changed:
Phase 1 — production code (typed enum + jwt removal):
- internal/config/config.go: AuthType typed alias + AuthTypeAPIKey /
AuthTypeNone constants + ValidAuthTypes() helper. Validate() routes
literal 'jwt' through a dedicated multi-line diagnostic naming the
authenticating-gateway pattern, then cross-checks against
ValidAuthTypes(). Secret-required branch simplified to api-key-only.
Field comment on AuthConfig.Type rewritten to drop jwt and point at
the gateway pattern.
- internal/api/middleware/middleware.go: AuthConfig.Type field comment
references the typed config.AuthType constants.
- internal/api/handler/health.go: same treatment for HealthHandler.AuthType.
- cmd/server/main.go: defense-in-depth runtime switch immediately after
config.Load() — exits 1 on any unsupported auth-type that bypassed the
validator. Auth-disabled startup log explicitly names the
authenticating-gateway pattern.
Phase 2 — tests (Red→Green, contract pinning):
- internal/config/config_test.go: TestValidate_JWTAuth_RejectedDedicated
(two table rows pinning the dedicated G-1 error fires regardless of
whether Secret is set), TestValidAuthTypesDoesNotContainJWT (property
guard against future re-introduction),
TestValidAuthTypesIsExactly_APIKey_None (allowed-set contract),
TestValidate_GenericInvalidAuthType (pins non-jwt invalid values still
hit the generic invalid-auth-type error). Removed the prior
TestValidate_JWTAuth_MissingSecret happy-path since its premise is
inverted post-G-1.
- internal/api/handler/health_test.go: removed
TestAuthInfo_ReturnsAuthType_JWT (which baked the silent-downgrade lie
into the regression suite). Pre-existing _APIKey test continues to
cover the api-key happy path.
Phase 3 — spec, docs, env templates:
- api/openapi.yaml: auth_type enum dropped to [api-key, none] with
inline comment naming the G-1 closure.
- .env.example (root): CERTCTL_AUTH_TYPE comment block rewritten to drop
jwt and point at the gateway pattern; secret-required conditional
simplified to api-key-only.
- docs/architecture.md: middleware-stack bullet rewritten to drop the
JWT mention; new H3 'Authenticating-gateway pattern (JWT, OIDC, mTLS)'
section explaining the design rationale and listing oauth2-proxy /
Envoy ext_authz / Traefik ForwardAuth / Pomerium / Authelia / Caddy
forward_auth / Apache mod_auth_openidc / nginx auth_request as the
standard fronting options.
- docs/upgrade-to-v2-jwt-removal.md (new ~125 lines): migration guide
with preconditions, what-changes, both recovery paths, complete
docker-compose oauth2-proxy walkthrough, Traefik ForwardAuth and Envoy
ext_authz patterns, rollback posture.
Phase 4 — Helm chart (template validation + docs):
- deploy/helm/certctl/templates/_helpers.tpl: new certctl.validateAuthType
helper mirroring the existing certctl.tls.required pattern. Fails
template render on any server.auth.type outside {api-key, none} with
a multi-line diagnostic.
- deploy/helm/certctl/templates/server-deployment.yaml,
server-configmap.yaml, server-secret.yaml: invoke the helper at the
top of each template that depends on .Values.server.auth.type.
- deploy/helm/certctl/values.yaml: auth: block comment expanded with the
G-1 rationale and gateway-pattern cross-reference.
- deploy/helm/CHART_SUMMARY.md: server.auth.type table row now surfaces
the allowed set and points at the upgrade doc.
- deploy/helm/certctl/README.md: new 'JWT / OIDC via authenticating
gateway' section with a Kubernetes-flavored oauth2-proxy + certctl
walkthrough.
Phase 5 — release surface:
- CHANGELOG.md: new [unreleased] top entry with Breaking / Removed /
Added / Changed sections; explicit pointer at
docs/upgrade-to-v2-jwt-removal.md from the Breaking subsection.
Phase 6 — CI guardrail:
- .github/workflows/ci.yml: new 'Forbidden auth-type literal regression
guard (G-1)' step. Scoped patterns catch the actual regression shapes
(map literal, slice literal, switch case, OpenAPI enum, env-file
default, AuthType('jwt') cast). Comments and the dedicated rejection
branch are intentionally exempt; connector-package JWT references
(Google OAuth2 / step-ca) are exempt as out-of-scope external
protocols. Verified locally: the guard passes on the actual tree and
fires on all 4 synthetic regression patterns.
Out of scope (explicitly untouched):
- internal/connector/discovery/gcpsm/gcpsm.go — Google OAuth2 service-
account JWT (external protocol).
- internal/connector/issuer/googlecas/googlecas.go — same.
- internal/connector/issuer/stepca/stepca.go — step-ca's provisioner
one-time-token JWT for /sign API.
- docs/test-env.md, docs/connectors.md, docs/features.md — describe
external CAs' use of JWT, not certctl's auth shape.
- Implementing actual JWT middleware. Feature, not a fix.
Verification (all gates pass):
- go build ./... — clean
- go vet ./... — clean
- go test -short ./... — every package green
- go test -short -race ./internal/config/... ./internal/api/... — clean
- govulncheck ./... — no vulnerabilities in our code
- helm lint deploy/helm/certctl/ — clean
- helm template with auth.type=api-key — renders OK
- helm template with auth.type=none — renders OK
- helm template with auth.type=jwt — fails with validateAuthType
diagnostic (exit 1)
- python3 yaml.safe_load on api/openapi.yaml — parses
- CI guardrail mirror — clean on real tree, fires on all 4 synthetic
regression patterns
- Smoke test: 'CERTCTL_AUTH_TYPE=jwt ./certctl-server' exits non-zero
with: 'Failed to load configuration: CERTCTL_AUTH_TYPE=jwt is no
longer accepted (G-1 silent auth downgrade): no JWT middleware ships
with certctl. To use JWT/OIDC, run an authenticating gateway
(oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium) in
front of certctl and set CERTCTL_AUTH_TYPE=none on the upstream.
See docs/architecture.md "Authenticating-gateway pattern" and
docs/upgrade-to-v2-jwt-removal.md for the migration walkthrough'
config pkg coverage: ValidAuthTypes 100%, Validate 94.7%, total 75.5%.
Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
§2 P1 cluster, cat-g-jwt_silent_auth_downgrade
Audit recommendation followed verbatim: 'Remove jwt from
validAuthTypes until middleware ships'.
|
||
|
|
9834b4e4a4 |
G-1: renewal-policies API + frontend FK-drift fix
Three frontend call sites (OnboardingWizard.tsx:603, CertificatesPage.tsx:52,
CertificateDetailPage.tsx:169) populated the renewal_policy_id dropdown from
getPolicies() — the compliance-rule endpoint returning pol-* IDs — which
violated the FK managed_certificates.renewal_policy_id REFERENCES
renewal_policies(id) ON DELETE RESTRICT. Create would fail pg 23503 at insert.
Backend (new):
- RenewalPolicyRepository CRUD + ListAll/ExistsByID (pg 23503 → ErrRenewalPolicyInUse
→ HTTP 409; pg 23505 → ErrRenewalPolicyDuplicateName → HTTP 409)
- RenewalPolicyService with repo-only constructor. Service sentinels
var-alias the repo sentinels so errors.Is walks across layers.
- RenewalPolicyHandler with validation bounds: name 1–255;
renewal_window_days [1,365] default 30; max_retries [0,10] not defaulted;
retry_interval_seconds [60,86400] default 3600; alert_thresholds_days
[0,365] default [30,14,7,0]. Auto-generated IDs rp-<slug(name)>.
- Router registers 5 routes under /api/v1/renewal-policies[/{id}].
Frontend:
- CertificatesPage/CertificateDetailPage/OnboardingWizard now call
getRenewalPolicies() and render rp-* IDs.
- client.ts adds getRenewalPolicies/createRenewalPolicy/updateRenewalPolicy/
deleteRenewalPolicy. types.ts adds the RenewalPolicy shape.
OpenAPI: RenewalPolicies tag + 5 operations + 3 schemas (RenewalPolicy,
RenewalPolicyCreateRequest, RenewalPolicyUpdateRequest). 409 responses
on create/update duplicate-name and delete FK-in-use.
No migration — renewal_policies table already exists from the initial
schema (000001).
Tests:
- internal/service/renewal_policy_test.go: CRUD + validation + sentinel
error wrapping.
- internal/api/handler/renewal_policy_handler_test.go: handler endpoint
contracts including 400/404/409.
- web/src/api/client.test.ts: 4 subtests covering the 4 new API functions.
Phase 3 gates all green: go vet, build, short tests, race tests (service/
handler/router/scheduler), staticcheck (G-1 packages), govulncheck (0
reachable), coverage (service 69.7%, handler 79.0%, domain 86.9%,
middleware 80.6% — all above thresholds), tsc, vitest (256 passed),
vite build, OpenAPI structural validation.
|
||
|
|
52248be717 |
v2.0.47: HTTPS Everywhere — TLS-only control plane, agents/CLI/MCP
Breaking change release. Plaintext HTTP listener removed. The certctl control plane now terminates TLS 1.3 on :8443 via http.Server.ListenAndServeTLS. No CERTCTL_TLS_ENABLED=false escape hatch. No dual-listener mode. One-step cutover per docs/upgrade-to-tls.md. Server - cmd/server/tls.go: certHolder with SIGHUP hot-reload + atomic cert swap, buildServerTLSConfig (TLS 1.3 min, GetCertificate callback), preflightServerTLS validation - cmd/server/main.go: ListenAndServeTLS in place of ListenAndServe, watchSIGHUP wiring, cert/key path config threading - tls_test.go: 418-line regression coverage of reload, preflight, callback behavior, SAN validation Config - CERTCTL_TLS_CERT_PATH / CERTCTL_TLS_KEY_PATH (required) - Plaintext rejection: agents/CLI/MCP pre-flight-fail on http:// URLs with a pointer to docs/upgrade-to-tls.md Agents, CLI, MCP - All three pre-flight-reject http:// URLs with fail-loud diagnostic - CERTCTL_SERVER_CA_BUNDLE_PATH for private-CA trust - CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY for dev-only bypass (loud warning on startup) - install-agent.sh emits both vars as commented template lines docker-compose - certctl-tls-init sidecar generates SAN-valid self-signed cert into deploy/test/certs/ on first boot - All demo-stack curls pin against ca.crt with --cacert Helm chart - Three TLS provisioning modes, exactly one required: - server.tls.existingSecret (operator-supplied) - server.tls.certManager.enabled (cert-manager integration) - server.tls.selfSigned.enabled (eval only — not for production) - server-certificate.yaml template for cert-manager mode - helm install without a TLS source fails at template render with a pointer to docs/tls.md CI - .github/workflows/ci.yml Helm Chart Validation step renders the chart in both existingSecret and cert-manager modes, plus an inverse guard-regression test that asserts helm template MUST refuse to render when no TLS source is configured. Previously the single `helm template` invocation hit the certctl.tls.required fail-loud guard and exit-1'd CI. Four invocations now: lint (existingSecret), template (existingSecret), template (cert-manager), template (no args — must fail). Integration tests - deploy/test/integration_test.go stands up the Compose stack over HTTPS, extracts the CA bundle, and exercises every certctl API over https://localhost:8443 - All 34 integration subtests green (per Phase 8 local CI-parity) Documentation - New: docs/tls.md (provisioning patterns, rotation, SIGHUP reload) - New: docs/upgrade-to-tls.md (one-step cutover, no-downgrade warnings, fleet-roll sequencing) - CHANGELOG.md: v2.2.0 "HTTPS Everywhere — The Irony" entry (file heading unchanged; release tag is v2.0.47) - All curls in docs/, examples/, deploy/helm/ guides use https://localhost:8443 --cacert Verification - grep -rn "ListenAndServe[^T]" cmd/ internal/ → 0 hits - grep -rn "\"http://" cmd/ internal/ → 2 benign hits (Caddy admin API default, SSRF doc comment) — zero certctl endpoints - Tasks #197–#206 (Phases 0–8) all closed in the tracker Files: 65 changed, 3489 insertions, 372 deletions (pre-CI-fix). |
||
|
|
6e646e0fe8 |
M-001/M-006: strip HTTP auth from EST/SCEP + fail-loud SCEP preflight
Closes CWE-306 (missing authentication for critical function) for SCEP
via a fail-loud startup gate, and aligns EST/SCEP HTTP dispatch with
their respective RFCs. CRL/OCSP remain unauthenticated under
.well-known/pki/* per RFC 5280 §5 / RFC 6960 / RFC 8615. Option (D):
no mTLS in this milestone.
- RFC 7030 §3.2.3 (EST auth is deployment-specific) and §4.1.1
(/cacerts explicitly anonymous): EST paths served unauthenticated;
CSR-signature + profile policy enforce identity inside ESTService.
- RFC 8894 §3.2: SCEP authenticates via the challengePassword
PKCS#10 attribute (OID 1.2.840.113549.1.9.7), not an HTTP credential.
HTTP dispatch is unauthenticated; preflightSCEPChallengePassword
refuses to start when CERTCTL_SCEP_ENABLED=true without
CERTCTL_SCEP_CHALLENGE_PASSWORD. SCEPService.PKCSReq enforces the
same invariant defense-in-depth and compares with
crypto/subtle.ConstantTimeCompare.
cmd/server/main.go:
- Extract buildFinalHandler(apiHandler, noAuthHandler, webDir,
dashboardEnabled); route /.well-known/est/*, /scep, /scep/*,
/.well-known/pki/crl/{id}, /.well-known/pki/ocsp/{id}/{serial},
and health probes through noAuthHandler (RequestID +
structuredLogger + Recovery only).
- Add preflightSCEPChallengePassword fail-loud gate; startup log
emits challenge_password_set boolean for operator visibility.
cmd/server/finalhandler_test.go (new, 314 lines, 27 subtests):
- TestBuildFinalHandler_Dispatch (20) + TestBuildFinalHandler_NoDashboard
(7) pin the dispatch surface: EST 4-endpoint, SCEP exact +
trailing-slash + query-string, PKI CRL+OCSP, health, /api/v1/*
authenticated, /assets/* file server, SPA fallback.
internal/api/router/router.go, internal/config/config.go:
- Router-level comments explain why EST/SCEP/PKI dispatchers sit
outside the authenticated mux; SCEP challenge password config
plumbed through.
docs/architecture.md:
- New EST Authentication subsection (RFC 7030 §3.2.3 + §4.1.1,
buildFinalHandler + noAuthHandler references).
- Rewrite SCEP Authentication subsection; replaces pre-existing
factually-incorrect "any value accepted" claim with CWE-306
preflight, service-layer defense-in-depth, and
crypto/subtle.ConstantTimeCompare.
- Top-level Authentication section: qualify /api/v1/* scope on API
clients bullet; add standards-based-endpoints bullet referencing
the 27-subtest regression harness.
docs/compliance-soc2.md:
- CC6.1: scope API Key Authentication to /api/v1/*; add
standards-based endpoints bullet citing RFCs and CWE-306 closure.
- CC6.3: scope API Key Policy to /api/v1/* with cross-reference to
CC6.1.
- Evidence Locations augmented with buildFinalHandler,
preflightSCEPChallengePassword, scep.go defense path, regression
harness, and OpenAPI security:[] overrides.
api/openapi.yaml: verified already correct (global bearerAuth
default overridden with security:[] on /cacerts, /simpleenroll,
/simplereenroll, /csrattrs, /scep GET+POST, /crl/{issuer_id},
/ocsp/{issuer_id}/{serial}); no edits needed.
|
||
|
|
675b87ba63 |
I-005: notification retry loop + dead-letter queue
Critical alerts can no longer be silently dropped by a transient
notifier failure. Failed notification attempts now ride an exponential
backoff retry loop, with a 5-attempt budget before promotion to the
dead-letter queue for operator intervention.
Schema (migration 000016, idempotent):
- retry_count INTEGER NOT NULL DEFAULT 0
- next_retry_at TIMESTAMPTZ
- last_error TEXT
- idx_notification_events_retry_sweep partial index
(next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL
Dead rows clear next_retry_at so the index stops matching them.
Service contract:
- NotificationService.RetryFailedNotifications drives 2^n-minute
exponential backoff capped at 1h (notifRetryBackoffCap) with
5-attempt budget (notifRetryMaxAttempts).
- Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to
status='dead' via MarkAsDead.
- Non-terminal failures record via RecordFailedAttempt.
- Success path promotes to 'sent' without touching retry_count
(audit preserves "delivered on attempt N").
- Missing-notifier branch defensively promotes to 'sent' to avoid
wedging a row on a deleted channel.
- RequeueNotification operator escape hatch atomically resets
retry_count -> 0, next_retry_at -> NULL, last_error -> NULL,
status -> pending via notifRepo.Requeue.
Scheduler:
- New always-on notificationRetryLoop wired into the base loop set at
CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m).
- sync/atomic.Bool idempotency guard.
- sync.WaitGroup shutdown drain via WaitForCompletion.
StatsService:
- SetNotifRepo setter pattern preserves 9 pre-existing
NewStatsService call sites (main.go + stats_test.go + 8 digest
tests) without touching the constructor signature.
- DashboardSummary.NotificationsDead populated via
notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired
(reports zero on systems without a notification repository).
- CountByStatus error is non-fatal (dashboard summary is
best-effort for this field).
- Prometheus certctl_notification_dead_total counter emitted from
the same snapshot.
Handler:
- New POST /api/v1/notifications/{id}/requeue endpoint.
- dead status surfaces to MCP + CLI.
Frontend:
- NotificationsPage gains two-tab toolbar ("All" / "Dead letter")
with queryKey: ['notifications', activeTab] so switching tabs
doesn't serve stale data until the 30s refetch.
- Dead rows surface "Retry {n}/5" + truncated last_error with
full-text title tooltip.
- Requeue mutation wrapped as
mutationFn: (id: string) => requeueNotification(id)
to prevent react-query v5's positional context argument from
leaking into the API client — pinned against future refactors
by strict-match toHaveBeenCalledWith('notif-dead-001') in
NotificationsPage.test.tsx:181.
Closes I-005.
|
||
|
|
1ee77c89f8 |
I-003: job timeout reaper closes AwaitingCSR/AwaitingApproval gap
Add 11th always-on scheduler loop that transitions jobs stuck in
AwaitingCSR (default 24h TTL) or AwaitingApproval (default 168h TTL)
to Failed. I-001's retry loop then auto-promotes eligible Failed jobs
back to Pending. No new status enum, no schema migration.
- JobRepository.ListTimedOutAwaitingJobs with per-status cutoff WHERE
- JobService.ReapTimedOutJobs mirrors RetryFailedJobs structure
- Scheduler jobTimeoutLoop with atomic.Bool idempotency guard, 2m
per-tick context, WaitGroup shutdown drain
- Config: CERTCTL_JOB_TIMEOUT_INTERVAL (10m), CERTCTL_JOB_AWAITING_CSR_TIMEOUT
(24h), CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT (168h)
- Audit event per transition: actor=system, actorType=System,
action=job_timeout, details={old_status, new_status, timeout_reason,
age_hours}
- 14 new tests: 3 config, 7 service, 4 scheduler
|
||
|
|
0200c7f4a4 |
Close I-001 (RetryFailedJobs never invoked) coverage-gap finding
Operator decision answered as Option A: JobService.RetryFailedJobs is
now wired into the scheduler as an always-on 10th loop. Prior to this
commit the method was implemented, unit-tested, and exported but had
zero runtime callers — any job that transitioned to status=Failed stayed
Failed forever regardless of how many attempts it had remaining.
Scheduler — 10th loop:
internal/scheduler/scheduler.go grows a jobRetryLoop alongside the
existing nine loops (renewal, jobs, health, notifications, short-lived,
network scan, digest, health check, cloud discovery). The loop follows
the established run-immediately-then-tick pattern (same shape as
jobProcessorLoop), gated by a sync/atomic.Bool idempotency guard and
joined into the scheduler's sync.WaitGroup so WaitForCompletion drains
it on graceful shutdown. Each tick runs under a 2-minute context
timeout mirroring jobProcessorLoop's opCtx budget. The runJobRetry
helper invokes jobService.RetryFailedJobs(ctx, 3) — the advisory
maxRetries cap is belt-and-suspenders; per-job eligibility is still
enforced inside the service via Attempts < MaxAttempts.
The JobServicer scheduler-interface gains RetryFailedJobs so the
scheduler's dependency surface stays explicit and mockable.
Service — audit trail per retry:
internal/service/job.go:RetryFailedJobs now emits an audit event for
every Failed→Pending transition. Following the house convention used
by all scheduler-emitted events, actor='system' and actorType=
domain.ActorTypeSystem; action='job_retry'; details capture
old_status, new_status, attempts, max_attempts. JobService carries an
optional *AuditService (SetAuditService) that nil-guards to preserve
test-wiring ergonomics — existing tests that construct JobService
without an audit service continue to pass unchanged.
Config — env var with sane default:
internal/config/config.go:SchedulerConfig grows RetryInterval, wired
to CERTCTL_SCHEDULER_RETRY_INTERVAL with a 5-minute default. Validate
rejects intervals below 1 second (matches other scheduler interval
validators).
Server wiring:
cmd/server/main.go calls jobService.SetAuditService(auditService)
after JobService construction and sched.SetJobRetryInterval(
cfg.Scheduler.RetryInterval) alongside the other SetXxxInterval calls.
Regression coverage:
internal/service/job_test.go (3 new)
- TestJobService_RetryFailedJobs_EligibleJobTransitionsAndAudits
- TestJobService_RetryFailedJobs_SkipsJobsAtMaxAttempts
- TestJobService_RetryFailedJobs_NoAuditServiceOK
internal/scheduler/scheduler_test.go (3 new)
- TestScheduler_JobRetryLoop_CallsService
- TestScheduler_JobRetryLoop_IdempotencyGuard
- TestScheduler_JobRetryLoop_WaitForCompletion
The service tests assert status transitions, attempt-cap short-
circuiting, and audit event shape (actor='system', action='job_retry',
details keys). The scheduler tests assert the loop invokes the service,
the atomic.Bool guard skips overlapping ticks with the expected
'still running, skipping tick' log, and WaitForCompletion drains the
in-flight tick on Stop.
Residual follow-up (not in scope for this commit):
internal/service/renewal.go:RetryFailedJobs is a parallel dead-code
duplicate of the same logic on RenewalService — untested and has no
runtime caller. The audit finding called this out as 'implemented
twice'. Removing it is a separate cleanup and does not block the
Option-A wiring this commit delivers.
Files:
cmd/server/main.go — SetAuditService + SetJobRetryInterval
internal/config/config.go — RetryInterval field + env + validate
internal/scheduler/scheduler.go — 10th loop, interface, field, setter
internal/scheduler/scheduler_test.go — 3 new scheduler-loop tests
internal/service/job.go — RetryFailedJobs audit emission + SetAuditService
internal/service/job_test.go — 3 new service-layer tests
|
||
|
|
3287e174dc |
Unify API auth + RFC-compliant CRL/OCSP (M-002 + M-003 + M-006, auto-closes M-001)
Closes the remaining P1 gaps from coverage-gap-audit.md (M-001/M-002/M-003/M-006)
on top of the C-001/C-002 ownership + agent-FK contract fixes landed in
|
||
|
|
b3cc7cbdb2 |
fix(policies): close the D-006 loop — TitleCase seed canonicals + severity-aware, config-consuming rule engine (D-008)
D-008 was a three-part drift in the policy engine that made the
D-005/D-006 remediation cosmetic below the DB layer:
(a) migrations/seed.sql INSERTed rules with pre-D-005 lowercase
types ('ownership', 'environment', 'lifetime', 'renewal_window')
that the handler validator rejects on Create/Update but that
raw SQL INSERTs bypassed entirely. At runtime evaluateRule's
switch fell through to the default "unknown policy rule type"
error branch on every demo rule × every cert × every cycle,
flooding logs while emitting zero violations.
(b) migrations/seed_demo.sql persisted lowercase severity values
('critical', 'error', 'warning') on policy_violations rows.
INSERT succeeded because that column had no CHECK, but any
frontend comparing against the canonical PolicySeverity enum
mis-categorized every seeded violation.
(c) evaluateRule hardcoded Severity: PolicySeverityWarning on
every emitted violation and ignored rule.Config entirely —
so the D-006 per-rule severity column (000013) and every
per-arm Config JSON ({allowed_issuer_ids, allowed_domains,
required_keys, allowed, lead_time_days, max_days}) was dead
data below the evaluation layer.
This commit lands (a)+(b)+(c) atomically. Shipping any subset
leaves the feature half-working.
## Changes
Domain (internal/domain/policy.go):
* Add PolicyTypeCertificateLifetime as the 6th TitleCase canonical.
Pre-D-008 the seeded "max-certificate-lifetime" rule had no engine
arm — routing it through RenewalLeadTime would conflate "how
close to expiry before we renew" with "how long can the cert
possibly be", two distinct semantics. The new type accepts
config {"max_days": int} and flags certs whose
NotAfter - NotBefore exceeds the cap.
Handler validator (internal/api/handler/validation.go):
* ValidatePolicyType allowlist grown to 6 canonicals
(AllowedIssuers, AllowedDomains, RequiredMetadata,
AllowedEnvironments, RenewalLeadTime, CertificateLifetime).
OpenAPI (api/openapi.yaml):
* PolicyType enum grown to match domain.
Frontend (web/src/api/types.ts, types.test.ts):
* POLICY_TYPES tuple gains CertificateLifetime; pin test asserts
all 6 canonicals and rejects casing drift.
Migration 000014 (policy_violations severity CHECK):
* Named CHECK constraint (policy_violations_severity_check)
mirroring 000013's allowlist, defense-in-depth at the DB layer
against future drift from bypassed writes (migrations, psql
sessions, future callers). Symmetric down migration drops by
name.
Seed data:
* migrations/seed.sql rewritten to emit TitleCase canonicals with
per-arm config JSON that actually exercises the config-consuming
paths (not the missing-field backstops):
- pr-require-owner → RequiredMetadata {"required_keys":["owner"]} Warning
- pr-allowed-environments → AllowedEnvironments {"allowed":["production","staging","development"]} Error
- pr-max-certificate-lifetime → CertificateLifetime {"max_days":90} Critical
- pr-min-renewal-window → RenewalLeadTime {"lead_time_days":14} Warning
Severities are now differentiated per rule (D-006 intent).
* migrations/seed_demo.sql violation rows flipped to TitleCase
severity ('Critical', 'Error', 'Warning') so migration 000014
applies cleanly on upgrade paths.
Engine rewrite (internal/service/policy.go):
* evaluateRule rewritten. All six arms now:
1. Parse rule.Config into the per-arm typed struct.
2. Bad JSON → log at ValidateCertificate boundary and skip
this rule (no co-located poisoning of other rules in the
same batch).
3. Empty/null Config → emit the pre-D-008 missing-field
violation (backwards compat invariant — operators who
haven't reconfigured still see the same output).
4. Violations emitted carry rule.Severity (no more hardcoded
Warning); D-006 column is now load-bearing.
* CertificateLifetime arm reads NotBefore/NotAfter from the
certificate's latest version via CertRepo. Injected via
PolicyService.SetCertRepo() setter — avoids churning ~36
NewPolicyService call sites while keeping the lifetime arm
optional (degrades to a log+skip if the setter is not wired).
Server wiring (cmd/server/main.go):
* policyService.SetCertRepo(certRepo) wired after construction.
Tests (internal/service/policy_test.go):
* 25 new subtests across 5 groups:
- TestEvaluateRule_SeverityPassThrough (6): every rule type
emits violations carrying rule.Severity, not hardcoded.
- TestEvaluateRule_ConfigConsumed (12): every per-arm Config
path exercised positive + negative.
- TestEvaluateRule_EmptyConfig_BackCompat (3): empty/null
Config still emits pre-D-008 missing-field violations.
- TestEvaluateRule_BadConfig_SkipsRule: malformed JSON logs
and skips cleanly without poisoning neighbors.
- TestEvaluateRule_CertificateLifetime_RepoScenarios (3):
ok when repo wired, log+skip when not, handles missing
NotBefore/NotAfter edges.
Provenance: D-008 surfaced during D-005/D-006 remediation review
in
|
||
|
|
d14a45401b |
fix(audit): drain in-flight recording goroutines on shutdown (M-1)
Audit events spawned from the HTTP middleware ran in detached goroutines using context.Background(). On SIGTERM the DB pool was closed before those goroutines finished writing, silently dropping audit events (CWE-662 Improper Synchronization / CWE-400 Uncontrolled Resource Consumption). NewAuditLog now returns an *AuditMiddleware struct that tracks every spawned goroutine with sync.WaitGroup. Callers wire the middleware via its Middleware method value (preserves the existing func(http.Handler) http.Handler shape) and drain the WaitGroup with Flush(ctx), which blocks until in-flight recordings complete or the provided context is cancelled — mirroring scheduler.WaitForCompletion. Flush is invoked in cmd/server/main.go between http.Server.Shutdown (no new requests accepted) and db.Close (pool torn down), with a timeout returning ErrAuditFlushTimeout wrapping ctx.Err(). Request-derived inputs (method, path, status) are snapshotted before the goroutine spawn so the worker does not race with http.Server reusing r after the handler returns. Tests: TestAuditLog_FlushDrainsInFlightGoroutines TestAuditLog_FlushTimeoutReturnsErrAuditFlushTimeout Verification: go build ./... : 0 go vet ./... : 0 go test -race -short ./... : 0 (all packages) go test -cover ./internal/api/middleware : 81.4% golangci-lint run : 0 issues govulncheck ./... : 0 vulns in called code |
||
|
|
27afa4463d |
fix(repository): idempotent sentinel agent creation via ON CONFLICT (M-6)
Sentinel agents (server-scanner, cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm) were created on startup with a plain INSERT whose duplicate-key error was swallowed unconditionally. That silenced every other DB failure too (connectivity drop, permissions change, unrelated constraint violation) — a restart after the first boot quietly de-fanged cloud discovery and the network scanner (CWE-662, CWE-209- adjacent). Shape A: add AgentRepository.CreateIfNotExists using ON CONFLICT (id) DO NOTHING RETURNING id + sql.ErrNoRows discrimination. This keeps the strict Create semantics (duplicate-key is an error) intact for real agent registration and gives sentinels their own idempotent path. - repo: CreateIfNotExists returns (created bool, err error); false,nil on pre-existing row; false,wrapped err on anything else. - interface: CreateIfNotExists added to AgentRepository. - main.go: 4 sentinel sites log Error/Info/Debug distinctly. - mocks: service + integration mocks implement the new method. - tests: 4 new testcontainers integration tests cover first-insert, idempotent second-call, concurrent 16-goroutine race (exactly one creator, no duplicate-key panic), and pre-cancelled context surfacing. Coverage gates (go test -cover): service 67.6%/55, handler 78.6%/60, domain 92.7%/40, middleware 80.0%/30, crypto 86.7%/85. Race/vet/ golangci-lint v2.11.4 (0 issues)/govulncheck v1.2.0 clean across all touched packages. |
||
|
|
5abeeb882b | fix(crypto): per-ciphertext PBKDF2 salt + v2 versioned format with v1 fallback (M-8) | ||
|
|
b813660c74 |
security: require SCEP challenge password when SCEP enabled (fixes H-2)
Problem (CWE-306 Missing Authentication for Critical Function):
internal/service/scep.go PKCSReq skipped the shared-secret check when
s.challengePassword was empty. An unconfigured-but-enabled SCEP server
accepted any unauthenticated client reaching /scep and issued a
certificate against the configured issuer for any CSR with a valid
signature. No audit trail distinguished authenticated from
unauthenticated enrollments. This matches the two-layer fail-closed
pattern already used for C-2 (
|
||
|
|
f549a7aa79 |
security: fail closed when CERTCTL_CONFIG_ENCRYPTION_KEY is unset (fixes C-2)
EncryptIfKeySet/DecryptIfKeySet in internal/crypto/encryption.go previously
returned plaintext + wasEncrypted=false when the operator had not configured
CERTCTL_CONFIG_ENCRYPTION_KEY. That produced a data-at-rest confidentiality
bypass (CWE-311): sensitive fields on dynamically-configured issuer and
target rows (source='database') were persisted to PostgreSQL without any
encryption, and no caller could distinguish the encrypted from the plaintext
branch at runtime. The only visible signal was a single warning log line
emitted once at startup.
Fail closed instead:
- EncryptIfKeySet / DecryptIfKeySet now return crypto.ErrEncryptionKeyRequired
(a new exported sentinel, errors.Is-unwrappable) when the key is empty or
nil, rather than silently emitting plaintext. The (result, wasEncrypted,
err) tuple signature is preserved for source compatibility; only the
semantics of the no-key branch changed.
- cmd/server/main.go grows a startup pre-flight check: if no encryption key
is configured the server lists issuers and targets, counts rows with
source='database', and refuses to start (os.Exit(1)) if any exist. Operators
must either configure CERTCTL_CONFIG_ENCRYPTION_KEY or remove the exposed
rows before the control plane can boot. The warning-only path is retained
for the clean-slate case (no database rows).
- internal/service/issuer.go's SeedFromEnvVars now guards the encryption call
with len(s.encryptionKey) > 0 so env-seeded rows (source='env', which are
reconstructable on every boot from process env) continue to persist as
plaintext in the 'config' column when no key is configured. Registry load
already falls through to cfg.Config when EncryptedConfig is nil. GUI/API
write paths (source='database') remain fail-closed via propagation of
ErrEncryptionKeyRequired.
- Integration tests that exercise CreateIssuer via the handler layer now
supply a real 32-byte AES-256 test key so the encrypt path runs instead of
returning ErrEncryptionKeyRequired. Same pattern in internal/service/
testutil_test.go for consolidated service-layer tests.
- internal/crypto/encryption_test.go grows regression guards:
TestEncryptIfKeySet_EmptyKeyFailsClosed (nil_key + empty_key subtests),
TestDecryptIfKeySet_EmptyKeyFailsClosed (nil_key + empty_key subtests),
TestEncryptDecryptIfKeySet_RoundTripProducesDifferentCiphertext,
TestDecryptIfKeySet_RejectsTamperedCiphertext, and
TestEncryptIfKeySet_PreservesErrEncryptionKeyRequiredSentinel (verifies
the sentinel unwraps through fmt.Errorf(%w)-style wrapping).
Wire format is unchanged: AES-256-GCM Encrypt/Decrypt/DeriveKey, the
12-byte nonce prefix, the GCM auth tag, the PBKDF2 salt
('certctl-config-encryption-v1'), and the 100,000 iteration count are all
byte-identical. Ciphertexts produced before this change remain decryptable.
Verified:
- go build ./... : clean
- go vet ./... : clean
- go test -race ./internal/crypto/... ./internal/service/... \
./internal/integration/... ./cmd/server/... : pass
- golangci-lint run ./... : 0 issues
- govulncheck ./... : 0 reachable vulnerabilities
- rg 'return plaintext, false, nil' internal/ : no matches
- Coverage: crypto 85.0% (unchanged), service 67.8% (was 67.9%, noise),
cmd/server 0.0% (unchanged baseline). All above CI thresholds.
See certctl-audit-report.md for the full finding record and resolution log.
|
||
|
|
13cd4d98ba |
feat(V2.2): bulk revocation — filter-based fleet-wide certificate revocation
Add POST /api/v1/certificates/bulk-revoke with filter criteria (profile_id, owner_id, agent_id, issuer_id, team_id, certificate_ids), partial-failure tolerance, and audit trail. Includes MCP tool, CLI command (certs bulk-revoke), server-side bulk modal in GUI replacing client-side sequential loop, OpenAPI spec, compliance mapping updates, and 21 new tests (12 service, 7 handler, 1 CLI, 1 frontend). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
e1bcde4cf1 |
feat(M50): cloud secret manager discovery — AWS SM, Azure KV, GCP SM
Extend certificate discovery from filesystem + network to cloud secret managers. Three pluggable DiscoverySource connectors feed into the existing discovery pipeline via sentinel agent pattern, with a 9th scheduler loop for periodic cloud scanning. - AWS Secrets Manager: aws-sdk-go-v2, tag/prefix filtering, 10 tests - Azure Key Vault: stdlib HTTP + OAuth2, base64 DER/PEM, 16 tests - GCP Secret Manager: stdlib HTTP + JWT OAuth2, label filter, 14 tests - CloudDiscoveryService orchestrator with 9 tests - 9th scheduler loop (6h default, atomic.Bool idempotency) - Discovery page: color-coded source type badges - 14 new env vars across CloudDiscoveryConfig structs - Docs: connectors.md, architecture.md, features.md, README updated 49 new tests. All CI checks pass (go vet, race, lint, coverage). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
596d86a206 |
feat(M48): continuous TLS health monitoring — endpoint state machine, shared tlsprobe, 8 API endpoints, GUI
Adds continuous TLS endpoint health monitoring that closes the deploy→verify→monitor loop. After M25 verifies a deployment succeeded once, M48 continuously confirms it stays healthy. Key components: - Shared `internal/tlsprobe/` package extracted from network scanner for reuse - Health status state machine: healthy → degraded (2 failures) → down (5 failures), plus cert_mismatch when served fingerprint differs from expected - 8th scheduler loop (60s tick, per-endpoint configurable intervals) - PostgreSQL migration 000011: endpoint_health_checks + endpoint_health_history tables - 8 REST API endpoints (CRUD, history, acknowledge, summary) - Health Monitor GUI page with summary bar, status table, create modal, auto-refresh - 38 new tests (5 tlsprobe + 11 domain + 10 service + 8 handler + 4 frontend) - All coverage thresholds maintained (service 68%, handler 83%, domain 87%, middleware 63%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
f2e60b93a3 |
feat(M11c): crypto policy enforcement — CSR validation, MaxTTL caps, key metadata
Enforce certificate profile crypto constraints across all 5 issuance paths (renewal, agent CSR, EST, SCEP). ValidateCSRAgainstProfile() rejects CSRs with key algorithm/size that don't match profile rules. MaxTTL enforcement caps certificate validity per issuer connector (Local CA, Vault, step-ca enforce directly; ACME/DigiCert/Sectigo pass through). Key algorithm and size are now persisted in certificate_versions for audit compliance. 16 new tests (12 service-layer + 4 Local CA connector). Removes hardcoded version number from GUI sidebar. Documentation updated across architecture, features, connectors, and README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
bcefb11e65 |
feat(M51): add SCEP server (RFC 8894) for MDM and network device enrollment
Implements Simple Certificate Enrollment Protocol with single-endpoint operation-based dispatch (GetCACaps, GetCACert, PKIOperation), PKCS#7 SignedData CSR extraction with fallback for raw/base64 CSR, challenge password authentication via CSR attributes, and shared internal/pkcs7 package extracted from EST handler to eliminate code duplication. 24 new tests (11 service + 13 handler) plus 5 shared pkcs7 package tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
370f856725 |
fix: resolve 8 staticcheck lint errors in test files
SA1029: use typed context key instead of string in main_test.go S1039: remove unnecessary fmt.Sprintf in validation_test.go SA4023: fix unreachable nil check on concrete error type SA4006: fix unused variable assignments in stepca_test.go (4 occurrences) SA4000: fix duplicate expression in ssh_test.go (BEGIN vs END CERTIFICATE) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
7382e5f03b |
test: comprehensive test gap closure across 24 packages
Close coverage gaps identified by dual-audit (qualitative + quantitative). New test files for config (0%→98%), router (0%→100%), handler validation, health, audit, response helpers, webhook notifier (0%→88%), email notifier, middleware (recovery, rate limiter), domain profile, service nil-safety, config helpers, issuer bootstrap, and server bootstrap wiring. Expanded existing tests for ACME (34%→42%), step-ca (42%→52%), F5, SSH, agent (43%→63%), scheduler (88%→99%), renewal service, and issuerfactory. All tests pass: go test -short, go vet, go test -race clean. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
e6088c79a3 |
feat(M35): dynamic target configuration with encrypted config, test connection, and GUI updates
Mirror M34's dynamic issuer config pattern for deployment targets: AES-256-GCM encrypted config storage, sensitive field redaction in API responses, agent heartbeat-based test connection endpoint, and full frontend updates including test status indicators, source badges, and removal of stale hostname/status fields from the Target interface. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
995b72df05 |
feat(M34): dynamic issuer configuration with encrypted config storage
Replace static env-var-based issuer wiring with GUI-driven dynamic configuration stored encrypted in PostgreSQL. Operators can now configure, test, enable/disable, and manage issuers from the dashboard without restarting the server. Key changes: - AES-256-GCM encryption for sensitive issuer config at rest (PBKDF2 key derivation with 100k iterations) - Dynamic IssuerRegistry with sync.RWMutex replacing static map - Connector factory pattern (issuerfactory.NewFromConfig) replacing 140 lines of static wiring in main.go - Migration 000009: encrypted_config, last_tested_at, test_status, source columns on issuers table - Env var seeding on first boot with ON CONFLICT DO NOTHING - Registry Rebuild() for atomic map swap after CRUD operations - Issuer type validation against domain constants on Create - Audit trail for test connection results - Conditional seeding for step-ca/OpenSSL (only when env vars set) - GUI: source badge, connection test status on issuer detail page Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
5a53b648b1 |
feat(M44): Google CAS issuer connector
Google Cloud Certificate Authority Service integration via REST API with OAuth2 service account auth (JWT→access token). Synchronous issuance model, CA pool selection, mutex-guarded token caching, revocation with RFC 5280 reason mapping. No Google SDK dependency — all stdlib. 19 tests with httptest mock OAuth2 + CAS API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
3a11e447cf |
feat(M43): Sectigo SCM issuer connector
Implement Sectigo Certificate Manager REST API connector with async order model (enroll → poll → collect PEM), 3-header auth, DV/OV/EV support, collect-not-ready (400/-183) graceful handling, and RFC 5280 revocation reason mapping. 20 tests with httptest mock API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
b059ec930f |
fix: end-to-end certificate lifecycle bugs + integration test environment
Fixes 12 production bugs preventing the full issuance→deployment flow from working with ACME (Pebble/Let's Encrypt) and step-ca issuers: ACME connector (acme.go): - Save orderURI before WaitOrder overwrites it (Go crypto/acme bug) - Add CreateOrderCert fallback via WaitOrder+FetchCert - Remove defer-reset in ValidateConfig that caused nil pointer panic - Add Insecure TLS option for self-signed ACME servers (Pebble) step-ca connector (stepca.go, jwe.go): - Real JWE provisioner key loading + decryption (was using ephemeral keys) - Fix JWT audience (/1.0/sign), sha claim (key fingerprint), kid header - Custom root CA trust via RootCertPath config - Remove hardcoded 90-day validity default (let step-ca decide) NGINX target connector (nginx.go): - Use sh -c for validate/reload commands (shell interpretation) - Use filepath.Dir instead of fragile string slicing - Add private key file writing (agent-mode keys were never deployed) - Make chain_path write conditional Server/service layer: - TriggerRenewalWithActor now creates actual Job records (was no-op) - createDeploymentJobs falls back to DB query when cert.TargetIDs empty - ProcessPendingJobs skips agent-routed deployment jobs - Agent cert pickup path parsing: len(parts)<4 → len(parts)<3 - Health/ready/auth-info endpoints bypass auth middleware - Write timeout 15s→120s for ACME issuance - Cert fingerprint computed on CSR submission Integration test environment (deploy/test/): - 10-phase test script covering Local CA, ACME, step-ca, revocation, discovery, renewal, and API spot checks - Docker Compose with 7 containers (server, agent, postgres, nginx, pebble, challtestsrv, step-ca) on isolated network - TLS verification checks SAN (not just Subject CN) for modern CA compat Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
6375909591 |
feat: add Vault PKI and DigiCert CertCentral issuer connectors (M32 + M37)
Vault PKI: synchronous issuance via /v1/{mount}/sign/{role}, token auth,
revocation, CA cert retrieval, 14 tests. DigiCert CertCentral: async order
model (submit → poll → download), X-DC-DEVKEY auth, OV/EV support, PEM
bundle parsing, 16 tests. Both conditionally registered based on env vars.
Includes OpenAPI enum updates, seed data, connector docs, architecture docs,
README badges, and testing guide sign-off (Parts 38 + 39, 12 automated
smoke test assertions all passing).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
11173a74c6 |
feat(M31): agent work routing — scope jobs to assigned agents
Deployment jobs now set agent_id from target→agent relationship at creation time. GetPendingWork() uses ListPendingByAgentID() with a 3-way UNION query (direct match, legacy NULL fallback via target JOIN, AwaitingCSR via cert→target→agent chain) so each agent only receives its own jobs. - Added AgentID *string to Job domain struct - Added agent_id to all job SQL queries (5 SELECTs, INSERT, UPDATE, scanJob) - New ListPendingByAgentID() repository method - Rewrote GetPendingWork() from ~25 lines to single scoped query - 4 new Go tests (3 agent routing + 1 deployment agent_id) - Frontend: agent_id/target_id on Job type Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
ec21c9bb29 |
feat(m28+m29+m30): ACME ARI, email digest, and Helm chart
M28: ACME Renewal Information (RFC 9702) — CA-directed renewal timing with cert ID computation, directory endpoint discovery, graceful degradation for non-ARI CAs. 19 tests. M29: Email notifier wiring + scheduled certificate digest — SMTP connector bridged to service layer via NotifierAdapter, DigestService with HTML email template, 7th scheduler loop (24h), digest preview/send API endpoints and GUI card. 21 tests. M30: Production-ready Helm chart — server Deployment, PostgreSQL StatefulSet, agent DaemonSet, ConfigMaps, Secrets, Ingress, security contexts, health probes, example values for dev/prod/ACME scenarios. Also: OpenAPI spec updates, MCP tool additions, CI helm-lint job, documentation updates across 5 doc files and README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
a00bb349c4 |
feat(m27): certificate export (PEM/PKCS#12) and S/MIME EKU support
Add certificate export in PEM (JSON or file download) and PKCS#12 formats. Private keys are never included — they stay on agents. Add EKU-aware issuance threading profile EKUs (serverAuth, clientAuth, codeSigning, emailProtection, timeStamping) through the full issuance pipeline. Fix agent CSR SAN splitting for email addresses, adaptive KeyUsage flags for S/MIME vs TLS, and a pre-existing generateID collision bug in deployment job creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
1f98f31f83 |
chore: bump version to 2.0.9
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
6d508cf53f |
fix: security audit remediation (AUDIT-001, 003, 004, 005, 006, 018)
- AUDIT-001: Validate OpenSSL revoke inputs (hex-only serials, RFC 5280 reasons) - AUDIT-003: Enforce /20 CIDR size cap at API level (create + update) - AUDIT-004: Support comma-separated CERTCTL_AUTH_SECRET for zero-downtime key rotation - AUDIT-005: Add ReadHeaderTimeout (5s) to prevent Slowloris - AUDIT-006: Document audit trail query parameter exclusion rationale - AUDIT-018: Add immediate-run-on-start to short-lived expiry scheduler loop Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
6da60d1287 |
chore: bump version to 2.0.8, replace static README badge with dynamic GitHub Release badge
- Layout.tsx: v2.0.7 → v2.0.8 - cmd/server/main.go: 2.0.7 → 2.0.8 - README.md: static version badge → shields.io/github/v/release (auto-updates) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
baafab50c5 |
feat(gui): add create modals for issuers, policies, profiles, owners, teams, agent groups
Six pages were read-only viewers despite the API client having all create functions wired up. Users deploying certctl had no way to create CAs or other objects from the GUI — reported in GitHub issue. - IssuersPage: 2-step create modal (type selection → config) for Local CA, ACME, step-ca, OpenSSL/Custom issuer types - PoliciesPage: create modal with type, severity, JSON config, enabled - ProfilesPage: create modal with name, description, max TTL, short-lived - OwnersPage: create modal with name, email, team dropdown - TeamsPage: create modal with name, description - AgentGroupsPage: create modal with match criteria fields - Layout.tsx: version v2.0.5 → v2.0.7 - cmd/server/main.go: version 0.1.0 → 2.0.7 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
de9264baf7 |
docs: synchronize project documentation with codebase
Implements 3 deferred security tickets (TICKET-003, TICKET-007, TICKET-010) and performs comprehensive documentation audit to eliminate drift between code and docs. Code changes: - TICKET-003: Repository integration tests with testcontainers-go (50+ subtests) - TICKET-007: CertificateService decomposition into RevocationSvc + CAOperationsSvc - TICKET-010: Request body size limits via http.MaxBytesReader middleware - Fix missing slog import in certificate.go after service decomposition Documentation updates: - README: Fix endpoint count (97→93), expand env var reference (15→39 vars) - CLAUDE.md: Fix OpenAPI operation count (85→93), update file locations - architecture.md: Add body size limits section, middleware chain ordering - CONTRIBUTING.md: New contributor guide with architecture conventions, test patterns, middleware ordering, CI thresholds - SECURITY_REMEDIATION.md: Removed from repo (moved to cowork, gitignored) - Test files: Add doc comments to all new test files Documentation that should exist but doesn't yet: - Architecture diagrams (C4 model or similar) - Threat model document - Testing philosophy guide - Disaster recovery runbook - Upgrade guide (migration between versions) - API versioning strategy document Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
677c28aeca |
refactor(api): TICKET-006 replace 18-param RegisterHandlers with HandlerRegistry struct
Replace the 18-parameter RegisterHandlers function signature with a cleaner HandlerRegistry struct that groups all API handler dependencies. This eliminates the signature explosion that made the function difficult to read and maintain. Changes: - Added HandlerRegistry struct with 18 fields grouping all handler types - Updated RegisterHandlers to accept a single HandlerRegistry parameter - Updated all internal handler references to use reg.FieldName syntax - Updated call sites in cmd/server/main.go and integration tests - No functional changes, purely structural refactoring Resolves TICKET-006: RegisterHandlers Signature Explosion |
||
|
|
3e3e68fd3a |
fix(security): TICKET-009 add HTTP timeouts to notifier clients
- Added TestSlack_ClientHasTimeout to verify 10-second timeout - Added TestTeams_ClientHasTimeout to verify 10-second timeout - Added TestPagerDuty_ClientHasTimeout to verify 10-second timeout - Added TestOpsGenie_ClientHasTimeout to verify 10-second timeout - All notifiers already configured with 10 second timeout in New() - Tests verify timeout is set and matches expected value |
||
|
|
be72627aeb |
feat: M25 post-deployment TLS verification + M26 Traefik/Caddy targets
M25: After deploying a certificate, the agent probes the live TLS
endpoint and compares SHA-256 fingerprints to verify the correct cert
is being served. Best-effort — failures don't block deployments.
New endpoints: POST /jobs/{id}/verify, GET /jobs/{id}/verification.
Migration 000008 adds verification columns to jobs table.
M26: Traefik target connector (file provider, auto-reload) and Caddy
target connector (dual-mode: admin API hot-reload or file-based).
Both wired into agent dispatch.
Also: restructured README to highlight supported integrations (issuers,
targets, notifiers) earlier, moved API/CLI/MCP sections lower. Updated
all docs (features, connectors, architecture, testing guide, why-certctl)
and fixed integration tests for 18-param RegisterHandlers signature.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
03593d4304 |
feat: wire ACME EAB into account registration + ZeroSSL auto-fetch
EAB credentials (KID + HMAC) were defined in the ACME connector config but never wired into the acme.Account registration call. This fixes the dead code and adds automatic EAB credential fetching for ZeroSSL — when the directory URL is detected as ZeroSSL and no EAB credentials are provided, certctl calls ZeroSSL's public API to get them automatically. Changes: - Wire EABKid/EABHmac into acme.Account.ExternalAccountBinding - Add isZeroSSL() detection and fetchZeroSSLEAB() auto-fetch - Add CERTCTL_ACME_EAB_KID/CERTCTL_ACME_EAB_HMAC env vars to main.go - Add 13 ACME connector tests (config validation, EAB decode, ZeroSSL auto-EAB with mock servers, URL detection) - Update docs: README, architecture, connectors, demo-advanced, testing-guide with EAB/auto-EAB documentation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
e19c240a79 |
feat: add ACME DNS-PERSIST-01 challenge support (IETF draft-ietf-acme-dns-persist)
Standing TXT record at _validation-persist.<domain> eliminates per-renewal DNS updates. Auto-fallback to dns-01 if CA doesn't offer dns-persist-01. ScriptDNSSolver extended with PresentPersist method. Configurable via CERTCTL_ACME_CHALLENGE_TYPE=dns-persist-01 and CERTCTL_ACME_DNS_PERSIST_ISSUER_DOMAIN env vars. Also fixes IsExpired edge-case test in discovery_test.go that always failed due to time.Now() drift between test setup and method invocation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
7d14635a72 |
feat: add EST server (RFC 7030) for device certificate enrollment (M23)
Implement Enrollment over Secure Transport protocol with 4 endpoints under /.well-known/est/ — cacerts (CA chain distribution), simpleenroll (initial enrollment), simplereenroll (certificate renewal), and csrattrs (CSR attributes). PKCS#7 certs-only wire format with hand-rolled ASN.1, accepts both PEM and base64-encoded DER CSRs, configurable issuer and profile binding, full audit trail. 28 new tests (18 handler + 10 service). Also includes: - GetCACertPEM added to issuer connector interface (all 4 issuers updated) - EST integration tests wired into e2e test suite (13 test cases) - QA testing guide Part 26 (15 manual EST test cases) - All docs updated: README, features, architecture, concepts, connectors, quickstart, demo-advanced (endpoint counts, MCP wording, agent IDs, issuer interface, resource lists, OpenSSL status) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
4f90be9311 |
feat: add network certificate discovery (M21) and Prometheus metrics (M22)
M21 adds server-side active TLS scanning of CIDR ranges with concurrent probing, sentinel agent pattern for pipeline reuse, and full CRUD API for scan targets. M22 adds Prometheus exposition format endpoint alongside existing JSON metrics. Comprehensive documentation audit updates all docs to reflect 91 endpoints, 19 tables, 6 scheduler loops, and 900+ tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
667a30870d |
feat: M18b Filesystem Certificate Discovery — agent scanning, server dedup, triage API
Agent-side:
- Filesystem scanner walks configured directories (CERTCTL_DISCOVERY_DIRS)
- Parses PEM (.pem, .crt, .cer, .cert) and DER (.der) certificate files
- Extracts CN, SANs, serial, issuer/subject DN, validity, key info, SHA-256 fingerprint
- Reports discoveries to control plane on startup + every 6 hours
- Skips files >1MB and private key files
Server-side:
- Migration 000006: discovered_certificates + discovery_scans tables
- Domain model: DiscoveredCertificate, DiscoveryScan, DiscoveryReport
- Three triage states: Unmanaged, Managed (claimed), Dismissed
- Repository with upsert dedup (fingerprint + agent + path)
- Service layer: process reports, claim, dismiss, list, summary
- 7 new API endpoints (84 total):
POST /agents/{id}/discoveries, GET /discovered-certificates,
GET /discovered-certificates/{id}, POST .../claim, POST .../dismiss,
GET /discovery-scans, GET /discovery-summary
- Audit trail: scan_completed, cert_claimed, cert_dismissed events
Tests: 28 new test functions (domain, handler, service layers)
Docs: README, quickstart, demo-guide, demo-advanced, architecture,
concepts, connectors, features.md all updated
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
e078a686bf |
feat: M20 Enhanced Query API — sort, time-range filters, cursor pagination, sparse fields, deployments endpoint
V2 (free) query enhancements for certificates:
- `sort` param with direction (`?sort=-notAfter` for descending)
- Time-range filters: `expires_before`, `expires_after`, `created_after`, `updated_after`
- Cursor-based pagination (`?cursor=token&page_size=100`) alongside page-based
- Sparse field selection (`?fields=id,commonName,status`)
- Additional filters: `agent_id`, `profile_id`
- New endpoint: `GET /api/v1/certificates/{id}/deployments`
25 new tests (12 handler + 13 e2e) covering all M20 features.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
df1aaa37f8 |
feat: M17 OpenSSL/Custom CA issuer connector + M16b CLI tool with bulk import
M17: Script-based issuer connector delegating sign/revoke/CRL to user-provided scripts. Compatible with any CA tooling (OpenSSL, cfssl, custom PKI). Configurable timeout, environment variable passthrough. 14 tests including timeout enforcement. M16b: certctl-cli wraps all 76 REST API endpoints for terminal workflows. Supports certs/agents/jobs list/get/renew/revoke/cancel, bulk PEM import with progress reporting, server health status, table and JSON output formats. Zero external dependencies (stdlib only). 14 tests with mock HTTP server. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
9b0ff37973 |
feat: M19 API audit log + M16a notifier connectors (Slack, Teams, PagerDuty, OpsGenie)
M19: HTTP middleware records every API call to the immutable audit trail with method, path, actor, SHA-256 body hash, status, and latency. Best-effort async recording via goroutine. Health/ready probes excluded. M16a: Four pluggable notifier connectors — Slack (incoming webhook), Teams (MessageCard), PagerDuty (Events API v2), OpsGenie (Alert API v2). Each enabled by config env var. 30 new tests across middleware and connectors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
ee75f149ae |
feat: M14 — Observability (dashboard charts, agent fleet, stats API, metrics, structured logging, rollback)
Backend: StatsService with 5 aggregation methods, JSON metrics endpoint, slog-based structured logging middleware. Stats API: dashboard summary, certificates-by-status, expiration timeline, job trends, issuance rate. 23 new backend tests. Frontend: Recharts-powered dashboard with 4 charts (status pie, expiration heatmap, job trends line, issuance bar), agent fleet overview page with OS/arch grouping and version breakdown, deployment rollback buttons on version history. 7 new frontend tests. 78 API endpoints, 744+ total tests (658 Go + 86 Vitest). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
762c523d59 |
feat: M15b — OCSP responder, DER CRL, short-lived exemption, revocation GUI
Backend:
- Embedded OCSP responder: GET /api/v1/ocsp/{issuer_id}/{serial} returns
signed OCSP responses (good/revoked/unknown) using CA key
- DER-encoded X.509 CRL: GET /api/v1/crl/{issuer_id} returns proper DER CRL
signed by issuing CA with 24h validity window
- Short-lived cert exemption: certs with profile TTL < 1 hour skip CRL/OCSP
(expiry is sufficient revocation for ephemeral workloads)
- Extended issuer connector interface with GenerateCRL and SignOCSPResponse
- Local CA implements full CRL/OCSP signing; ACME and step-ca return
appropriate "use native endpoint" errors
- IssuerConnectorAdapter bridges new methods between layers
Frontend:
- Revoke button on certificate detail page with RFC 5280 reason modal
- Revocation banner with reason display and timestamp
- Revocation status indicators in lifecycle section
- "Revoked" filter option in certificates list
- API client: revokeCertificate() function and Certificate type extensions
Tests: ~31 new tests across connector, service, handler, and adapter layers
Docs: milestones renumbered (M13-M14, M16-M18), M15b marked complete
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
5d98e373e3 |
feat: M15a — certificate revocation API, CRL endpoint, and revocation notifications
Implements core revocation infrastructure: POST /api/v1/certificates/{id}/revoke
with all 8 RFC 5280 reason codes, JSON-formatted CRL at GET /api/v1/crl, webhook
and email revocation notifications, best-effort issuer notification, and immutable
revocation audit trail. Includes 48 new tests across service, handler, integration,
and domain layers (600+ total). Fixes 3 pre-existing test bugs (team_test error
matching, agent_group delete status code, team handler per_page validation).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|