mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 20:01:31 +00:00
2893f9b48e35141497c36286a6faca533b663376
51 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
9c679a5960 |
auth-bundle-2 Phase 5: OIDC + session HTTP surface (13 endpoints),
pre-login store, OpenID Connect Back-Channel Logout 1.0, cookieAuth
scheme, 7 new auth permissions, CI guard, handler tests
Phase 5 of the bundle puts the Phase 3 OIDC service + Phase 4 session
service on the wire. 13 HTTP endpoints split into three logical groups:
Public OIDC handshake (auth-exempt; protocol-mediated):
GET /auth/oidc/login?provider=<id> -> 302 to IdP authorization URL
+ sets certctl_oidc_pending cookie
(10-min TTL, Path=/auth/oidc/,
SameSite=Lax)
GET /auth/oidc/callback?code=...&state=... -> consume pre-login row,
run Phase 3's 11-step token
validation, mint post-login
session, 302 to dashboard
POST /auth/oidc/back-channel-logout -> OpenID Connect BCL 1.0 — IdP
POSTs logout_token JWT; certctl
validates signature against IdP
JWKS via Phase 3 alg allow-list,
required claims (iss/aud/iat/jti/
events; exactly one of sub/sid;
nonce ABSENT per spec §2.4),
revokes matching sessions,
returns 200 with
Cache-Control: no-store
POST /auth/logout -> revoke caller's session
Session management (RBAC-gated auth.session.*):
GET /api/v1/auth/sessions -> auth.session.list (own / all)
DELETE /api/v1/auth/sessions/{id} -> auth.session.revoke (own bypass)
OIDC provider + group-mapping CRUD (RBAC-gated auth.oidc.*):
GET /api/v1/auth/oidc/providers -> auth.oidc.list
POST /api/v1/auth/oidc/providers -> auth.oidc.create
(client_secret encrypted
at rest via
internal/crypto.EncryptIfKeySet)
PUT /api/v1/auth/oidc/providers/{id} -> auth.oidc.edit
DELETE /api/v1/auth/oidc/providers/{id} -> auth.oidc.delete
(refused via
ErrOIDCProviderInUse → 409
when users authenticated
via this provider)
POST /api/v1/auth/oidc/providers/{id}/refresh -> auth.oidc.edit
(re-runs IdP downgrade
defense via
OIDCService.RefreshKeys)
GET /api/v1/auth/oidc/group-mappings -> auth.oidc.list
POST /api/v1/auth/oidc/group-mappings -> auth.oidc.edit
DELETE /api/v1/auth/oidc/group-mappings/{id} -> auth.oidc.edit
Migration 000037 ships:
- oidc_pre_login_sessions table (10-min absolute TTL, FK CASCADE on
oidc_provider_id, FK RESTRICT on signing_key_id; index on
absolute_expires_at for the GC sweep);
- 7 new permissions seeded into r-admin only:
auth.session.list, auth.session.list.all, auth.session.revoke,
auth.oidc.list, auth.oidc.create, auth.oidc.edit, auth.oidc.delete
CanonicalPermissions extended in lockstep at internal/domain/auth/
validate.go.
Pre-login machinery:
- internal/repository/oidc.go gains PreLoginRepository interface +
PreLoginSession struct + ErrPreLoginNotFound / ErrPreLoginExpired
sentinels.
- internal/repository/postgres/oidc_prelogin.go ships the impl;
LookupAndConsume uses DELETE ... RETURNING for atomic single-use.
- internal/auth/oidc/prelogin.go is the PreLoginAdapter that bridges
the OIDC service's Phase 3 PreLoginStore interface to the new
repository, signing the cookie value under the active
SessionSigningKey via the same v1.<id>.<key>.<HMAC> wire format
Phase 4 uses for post-login cookies. Defense-in-depth: the
pre-login `pl-` prefix is enforced by ParseCookieValue(prefix);
a stolen pre-login cookie cannot be replayed against the
post-login Validate path (pinned by
TestService_Validate_RejectsPreLoginCookieAtPostLoginGate).
Session package extension:
- internal/auth/session/service.go gains exported SignCookieValue,
ParseCookieValue (with caller-supplied id-1 prefix), ComputeCookieHMAC,
DecryptKeyMaterial wrappers so the OIDC pre-login adapter shares
the same length-prefixed HMAC math without code duplication.
- parseCookie no longer hardcodes the `ses-` prefix check (moved to
Validate as defense-in-depth; pre-login cookie verification uses
the `pl-` prefix via ParseCookieValue).
Cookie attributes (all Phase 5 endpoints honor CERTCTL_SESSION_SAMESITE
+ Secure=true via SessionCookieAttrs from Phase 4 config):
- certctl_oidc_pending: Path=/auth/oidc/, MaxAge=600s, SameSite=Lax
(cannot be Strict because the IdP-initiated callback is a top-level
navigation from a different origin).
- certctl_session: Path=/, Expires=8h, SameSite=Lax|Strict, HttpOnly.
- certctl_csrf: Path=/, Expires=8h, HttpOnly=false (intentional —
GUI must read it to echo into X-CSRF-Token header).
Audit logging on every mutating operation (event_category="auth"):
auth.oidc_login_succeeded / failed / unmapped_groups
auth.oidc_back_channel_logout / failed
auth.session_revoked
auth.oidc_provider_{created,updated,deleted,refreshed}
auth.group_mapping_{added,removed}
OpenAPI updates:
- cookieAuth security scheme added to api/openapi.yaml under
components.securitySchemes (apiKey / cookie / certctl_session).
- The 13 Phase 5 routes are added to SpecParityExceptions with a
deferral note: full per-endpoint OpenAPI rows land in a follow-on
commit alongside the GUI work (Phase 8) so the ergonomic shape can
be validated against the live GUI client.
CI guard: scripts/ci-guards/N-bundle-2-security-empty-preserved.sh
asserts api/openapi.yaml has ≥ 14 'security: []' occurrences (the
pre-Bundle-2 baseline). Reducing the count below 14 would silently
force a Bearer-or-cookie requirement onto an endpoint that legitimately
runs without certctl-issued credentials; the guard fires before that
regression lands.
Handler tests (internal/api/handler/auth_session_oidc_test.go):
- All 6 prompt-mandated negative cases:
BCL with missing events claim -> 400
BCL with nonce present -> 400 (per spec §2.4)
BCL with sig signed by an unknown key -> 400
Callback with replayed state -> 400
Callback with PKCE verifier mismatch -> 400
Callback with expired pre-login row -> 400
- Plus happy paths for every endpoint, edge cases (missing-cookie,
duplicate-name, in-use-409, wrong-tenant), and the Helper-function
coverage (peekIssuer, classifyOIDCFailure, defaultIfBlank,
defaultIntIfZero, clientIPFromRequest, encryptClientSecret).
Coverage on internal/api/handler/auth_session_oidc.go: 80.9% per-function
(above the Phase 5 spec's ≥ 80% floor).
Server wiring (cmd/server/main.go):
Wired AFTER sessionService (Phase 4) so the OIDC PreLoginAdapter can
sign pre-login cookies under the active SessionSigningKey:
oidcProviderRepo + oidcMappingRepo + oidcUserRepo + oidcPreLoginRepo
-> preLoginAdapter -> oidcService -> authSessionOIDCHandler.
sessionMinterAdapter shim bridges *session.Service.Create to the
oidcsvc.SessionMinter port the OIDC service consumes.
Router wiring (internal/api/router/router.go):
4 public OIDC routes via direct r.mux.Handle (auth-exempt; pinned in
AuthExemptRouterRoutes); 9 RBAC-gated routes via r.Register +
rbacGate(checker, perm, h). Routes only register when
reg.AuthSessionOIDC != nil so pre-Phase-5 builds skip the block
entirely.
Verifications: gofmt clean, go vet clean across all touched packages,
go test -short -count=1 green across internal/api/handler (74 tests +
new Phase 5 batch), internal/api/router (parity + auth-exempt
allowlist), internal/auth/oidc + session (no regressions), full domain
+ scheduler + config sweeps green, ci-guard
N-bundle-2-security-empty-preserved.sh green (17 ≥ 14 baseline).
|
||
|
|
2d9110b0c4 |
auth-bundle-2 Phase 0: dependency-add + oidc auth-type literal + runtime guard
Bundle 2 Phase 0 stages the dependencies + auth-type discriminator
literal that later phases consume. No handler chain wired yet; an
operator who sets CERTCTL_AUTH_TYPE=oidc on this commit gets a clear
refuse-to-start error rather than a silent fallback to api-key (the
G-1 failure mode that drove "jwt" out of the allowed set).
Deliverables:
* go.mod: github.com/coreos/go-oidc/v3 v3.18.0 added as a direct
require. Per the pre-bundle dependency audit (Apache-2.0, zero CVEs
ever per OSV.dev, 2,400+ stars, used by Hashicorp Vault + Dex +
Hydra + Authentik + every Kubernetes OIDC integration), this is the
ecosystem-standard Go OIDC client. Pinned to a specific minor
(v3.18.0) per the prompt's "no bare latest" rule.
* go.mod: golang.org/x/oauth2 promoted from // indirect to direct,
bumped from v0.34.0 to v0.36.0 by go mod tidy. Both versions are
OSV-clean. Maintained by the Go team.
* No JSON-path library added (forbidden by the dependency audit; the
group-claim resolver is hand-rolled in Phase 3).
* internal/config/config.go: AuthTypeOIDC constant added with a
load-bearing comment explaining (a) this is the AUTH-TYPE literal,
not a JWT alg literal, so the G-1 closure invariant is preserved
("jwt" stays out of ValidAuthTypes forever); (b) the runtime guard
in cmd/server/main.go intentionally refuses-to-start when oidc is
set pre-Phase-6 to avoid the silent-downgrade failure mode.
ValidAuthTypes() now returns {api-key, none, oidc}.
* internal/config/config_test.go: TestValidAuthTypesIsExactly_APIKey_None
renamed to TestValidAuthTypesIsExactly_APIKey_None_OIDC and now pins
the 3-entry set. TestValidAuthTypesDoesNotContainJWT (G-1 closure
test) still passes because "jwt" is never added back.
TestValidate_GenericInvalidAuthType's bad-types list updated:
"oidc" removed (now valid), "saml" added (correctly rejected per
Decision 5's SAML deferral).
* cmd/server/main.go: defense-in-depth runtime auth-type guard now
has an explicit AuthTypeOIDC case that exit(1)s with an actionable
message: "the OIDC auth chain is not yet wired in this build (Auth
Bundle 2 Phase 6 ships the session middleware that consumes this
auth-type literal)." This closes the lying-field gap the literal
would otherwise create. Phase 6 of Bundle 2 relaxes this case to
fall through alongside api-key + none.
* api/openapi.yaml: /v1/auth/info auth_type enum extended from
[api-key, none] to [api-key, none, oidc] with an in-line comment
explaining the Phase-0-vs-Phase-6 timing so an OpenAPI consumer
isn't surprised by "oidc" appearing here pre-Bundle-2-merge.
* deploy/helm/certctl/templates/_helpers.tpl::certctl.validateAuthType:
valid set extended to include "oidc". Chart-time validation now
passes for type=oidc; the binary's runtime guard takes over to
refuse the start. Once Bundle 2 ships, the runtime guard relaxes
and OIDC works end-to-end with no further chart edits.
* .env.example: CERTCTL_AUTH_TYPE comment block updated to document
the three valid values + the Phase-0-vs-Phase-6 timing.
* internal/auth/oidc/doc.go: new package directory with package doc
+ transitional blank imports for coreos/go-oidc/v3 + x/oauth2 so
go mod tidy keeps both deps as direct requires until Phase 3's
service.go replaces the blanks with real symbol use. Doc explains
the package layout (oidc/ + oidc/domain/ + oidc/groupclaim/ +
oidc/testfixtures/) so the post-Bundle-2 reader can navigate.
Verifications:
* gofmt clean on every changed file.
* go vet clean on internal/config + cmd/server + internal/auth/oidc.
* go test -short -count=1 green on internal/config (including the
G-1 closure + new validation tests), cmd/server, internal/auth (all
Bundle 1 packages), internal/service/auth.
* govulncheck ./... clean (M-024 hard CI gate).
* All 24 ci-guards pass locally.
Phase 0 exit criteria from cowork/auth-bundle-2-prompt.md:
* go.mod shows coreos/go-oidc/v3 as direct: yes.
* golang.org/x/oauth2 is direct (not indirect): yes.
* govulncheck ./... clean: yes.
* No JSON-path library in go.mod / go.sum deltas: confirmed (only
v3 of go-oidc + the x/oauth2 bump landed).
* make verify green: gofmt + vet + go test pass; full make verify
(which would invoke golangci-lint) deferred to CI since the
sandbox doesn't have golangci-lint installed; the operator runs
make verify locally before pushing per CLAUDE.md operating rule.
|
||
|
|
3e91c7a1f0 |
chore(security): bump Go toolchain 1.25.9 -> 1.25.10 + golang.org/x/net 0.49 -> 0.53
CI run #484's Go Build & Test job failed govulncheck (M-024 hard gate). Six standard-library CVEs land in go1.25.9 + one golang.org/x/net CVE in v0.49.0; all are fixed in go1.25.10 + x/net v0.53.0 respectively. The advisories that fired were: GO-2026-4986 Quadratic string concat in net/mail.consumeComment — called via internal/api/handler/validation.go's ValidateCommonName -> mail.ParseAddress GO-2026-4977 Quadratic string concat in net/mail.consumePhrase — same call site GO-2026-4982 Bypass of meta-content URL escaping in html/template — called via internal/service/digest.go's RenderDigestHTML -> Template.Execute GO-2026-4980 Escaper bypass in html/template — same call site GO-2026-4971 Panic in net.Dial / LookupPort on Windows NUL bytes — many call sites (email notifier, SSH connector, ACME validators, validation.ValidateSafeURL, ...) GO-2026-4918 Infinite loop in net/http2 transport on bad SETTINGS_MAX_FRAME_SIZE — called via internal/connector/target/f5.go's F5Client.Authenticate -> http.Client.Do Bumps applied: * `go.mod`: `go 1.25.9` -> `go 1.25.10`; `golang.org/x/net v0.49.0` -> `v0.53.0` (kept indirect — the upgrade is force-pulled by the module-version directive; transitive deps will pick the higher). * `.github/workflows/{ci,codeql,release}.yml`: setup-go pin and the release.yml `GO_VERSION` env var bumped to 1.25.10. The security-deep-scan.yml workflow uses the major-minor `1.25` pin which auto-resolves to the latest 1.25.x and is unaffected. * `Dockerfile` + `Dockerfile.agent`: `golang:1.25-alpine@sha256:5caa...` re-pinned to `golang:1.25.10-alpine@sha256:8d22e29d960bc50cd0...` (digest looked up against `registry-1.docker.io/v2/library/golang/ manifests/1.25.10-alpine`; verified by the digest-validity ci-guard). The explicit `1.25.10-alpine` tag form replaces the moving `1.25-alpine` pin so the image-spec is reproducible end-to-end even without the digest reference. * `deploy/test/f5-mock-icontrol/Dockerfile`: `golang:1.25.9-bookworm @sha256:1a14...` re-pinned to `golang:1.25.10-bookworm@sha256: e3a54b77385b4f8a31c1...` (looked up the same way). * `deploy/test/f5-mock-icontrol/go.mod`: `go 1.25.9` -> `go 1.25.10`. * `internal/api/handler/version.go` + `api/openapi.yaml`: the `runtime.Version()`-shape comment + OpenAPI `example: go1.25.9` bumped to keep doc/example freshness. * `docs/contributor/ci-pipeline.md` + `docs/reference/connectors/ iis.md`: doc-only `Go 1.25.9` -> `Go 1.25.10` references. Verification done in-tree: * All `scripts/ci-guards/*.sh` pass locally including `digest-validity.sh` (the new digests resolve cleanly against Docker Hub). * `S-1-hardcoded-source-counts.sh` clean (the false-positive on "Bundle 1 migrations" was fixed in the prior commit). Operator step required post-push (sandbox has no Go toolchain): cd certctl && go mod tidy This regenerates go.sum's `golang.org/x/net v0.49.0` h1: lines into v0.53.0 ones. CI's `go mod tidy && git diff --exit-code go.mod go.sum` step will catch the drift if missed; in that case run the command, commit, and push the go.sum-only delta. |
||
|
|
3ef45e2ad4 |
auth-bundle-1 Phase 6-7-8: bootstrap path + scope-down CLI + auditor-role split
# Phase 6 — day-0 admin bootstrap * internal/auth/bootstrap/ (new package): Strategy interface + EnvTokenStrategy with constant-time compare, one-shot consumption via sync.Mutex, optional admin-existence probe. Bundle 2's OIDC- first-admin will plug in alongside as an alternate Strategy. * BootstrapService.ValidateAndMint: validates the operator's CERTCTL_BOOTSTRAP_TOKEN, mints a 32-byte (64-hex-char) random API key value, persists the SHA-256 hash to api_keys, grants r-admin via actor_roles, AddHashed's the runtime keystore so the just- minted key authenticates the next request without restart, and records bootstrap.consume to the audit trail with category=auth. * internal/auth/keystore.go (new): KeyStore interface + StaticKeyStore (immutable env-var-only path) + MutableKeyStore (env-var keys + DB-loaded api_keys + runtime AddHashed). The auth middleware now consumes a KeyStore so the bootstrap path can extend the lookup table at runtime. * migrations/000031_api_keys.up/down.sql: api_keys table with (id, name UNIQUE, key_hash UNIQUE, tenant_id, admin, created_by, created_at, expires_at, last_used_at). Idempotent. * /v1/auth/bootstrap GET (probe) + POST (mint) — auth-exempt. Both routes documented in api/openapi.yaml + AuthExemptRouterRoutes allowlist updated. The token never leaves internal/auth/bootstrap; the minted plaintext key flows only into the HTTP response body. * Startup warning emitted when CERTCTL_BOOTSTRAP_TOKEN is set AND admin actors already exist (config drift signal). * Tests: 4 strategy invariants (empty token born disabled, wrong token=ErrInvalidToken without consumption, one-shot consumption, admin-exists closes path), 5 service tests (happy path + actor- name validation + propagation of strategy errors + nil-deps guard + 32-byte entropy budget), 8 HTTP-handler tests (status 201/410/401/400 mapping + token-leak hygiene scan of slog + audit details + Location header). Token-leak test redirects slog.Default to a buffer for the test scope. # Phase 7 — API-key migration + scope-down CLI * GET /v1/auth/keys handler + service method ListKeys backed by ActorRoleRepository.ListDistinctActors. Returns one row per (actor_id, actor_type) pair with the slice of role IDs they hold. Permission: auth.role.list. * internal/cli/auth_scope_down.go: AuthListKeys, AuthScopeDown (interactive), AuthScopeDownNonInteractive (JSON config), AuthScopeDownSuggest (--suggest with optional --apply). The synthetic actor-demo-anon is filtered out of every interactive / bulk path; non-interactive flow logs and skips it explicitly. * SuggestRoleFromAuditEvents (pure function): walks 30 days of audit events per actor and returns the narrowest matching role (admin / mcp / viewer / agent / operator) plus a one-line reason. Classification: any admin-shaped action wins; otherwise all-MCP → mcp; all-read-only → viewer; all-agent-shaped → agent; otherwise operator. Test table pins all six classifications. * CLI subcommand tree extended: 'auth keys list' + 'auth keys scope-down [--non-interactive <cfg>] [--suggest [--apply]]'. * CHANGELOG.md leads v2.1.0 with the SECURITY: AUDIT YOUR API KEYS call-out + four flow examples. # Phase 8 — auditor role + event_category column * migrations/000032_audit_category.up/down.sql: ALTER TABLE audit_events ADD COLUMN event_category TEXT NOT NULL DEFAULT 'cert_lifecycle' + CHECK constraint (cert_lifecycle/auth/config) + (event_category) and (event_category, timestamp DESC) indexes for the auditor-filter query path. WORM trigger from migration 000018 continues to enforce append-only at the DB layer (DDL is not blocked). * domain.AuditEvent gains EventCategory string (omitempty); domain.EventCategoryCertLifecycle / Auth / Config constants. * AuditService.RecordEventWithCategory sibling of RecordEvent; legacy callers stay on RecordEvent (defaults to cert_lifecycle). Auth callers (RoleService, ActorRoleService, BootstrapService) switched to RecordEventWithCategory(..., 'auth', ...). * GET /v1/audit?category=<cat>: handler accepts the optional query param, validates against the enum (400 on invalid value), dispatches through ListAuditEventsByCategory. OpenAPI updated with the new query param + AuditEvent.event_category schema. * Postgres AuditRepository.Create now writes event_category; AuditRepository.List filters on it; AuditFilter.EventCategory gates the WHERE clause. * Tests: 5 audit-category-filter HTTP tests (dispatch routing, back-compat fallback, 400 for invalid values, all 3 enum values accepted, page+category combine, JSON output surfaces the field). 3 auditor-role invariants (auditor holds exactly audit.read+audit.export, no mutating perms, disjoint from viewer except audit.read). # Cross-phase wiring * HandlerRegistry.Bootstrap field added; cmd/server/main.go wires the bootstrap service ahead of RegisterHandlers (extracted assembleNamedAPIKeys helper into auth_backfill.go, moved the keystore + bootstrap construction up alongside the auth repos). * AuthCheckResolver / AuthActorRoleService extended with ListKeys to satisfy the Phase 7 surface; existing fakes updated. * fakeAudit + mockAuditService stubs in tests gain RecordEventWithCategory + ListAuditEventsByCategory; existing tests untouched. # Verifications * gofmt -l: clean across every modified file. * go vet ./...: clean. * staticcheck across internal/auth + handler + router + cli + service + repository + cmd + domain: clean. * go test -short -count=1: green across every Bundle-1-touched package — internal/auth (incl. bootstrap), internal/api/handler, internal/api/router, internal/cli, internal/service/auth, internal/service, internal/domain/auth, internal/repository/postgres, cmd/server, cmd/cli, plus internal/scheduler, internal/api/middleware, cmd/agent, internal/mcp. |
||
|
|
60a589ab96 |
auth-bundle-1 Phase 0-5 closure: demo-mode wire, named-key backfill, AuthCheck enrichment, OpenAPI schema, intermediate-ca comment refresh
Closes the 5 gaps the post-Phase-5 audit flagged on dev/auth-bundle-1.
C1: cmd/server/main.go now selects auth.NewDemoModeAuth() when
CERTCTL_AUTH_TYPE=none and falls back to auth.NewAuthWithNamedKeys
otherwise. Pre-closure, the no-op pass-through that
NewAuthWithNamedKeys returns for empty keys would have left
ActorIDKey / ActorTypeKey / TenantIDKey unpopulated and 401'd
every Phase-3.5 rbacGate-wrapped admin route + every Phase-4
RBAC handler in demo deployments. NewDemoModeAuth injects the
synthetic 'actor-demo-anon' actor seeded by migration 000029,
which holds r-admin at global scope.
C2: backfillNamedKeyActorRoles startup hook (cmd/server/auth_backfill.go)
iterates CERTCTL_API_KEYS_NAMED entries (and legacy
CERTCTL_AUTH_SECRET synthesized fallbacks) and grants r-admin
or r-viewer to each via authActorRoleRepo.Grant before the
HTTP server starts accepting requests. Idempotent via
ON CONFLICT DO NOTHING in the repo. Failures log a warning but
are non-fatal — the server still starts and the operator can
fix grants via /v1/auth/keys. Helper extracted from main.go so
the role-mapping invariant is pinned by 4 focused unit tests
(admin->r-admin, non-admin->r-viewer, empty no-op,
grant-error non-fatal, nil-logger safe).
M1: HealthHandler.AuthCheck now returns actor_id, actor_type,
tenant_id, roles, effective_permissions, and admin_via_role
when the optional AuthCheckResolver is wired (production path:
authCheckResolverAdapter wraps the postgres ActorRoleRepository
in main.go). Nil resolver preserves the legacy {status, user,
admin} contract for back-compat with pre-Bundle-1 GUIs and
test fixtures. Adds 2 regression tests + 1 fake resolver shim.
M2: refreshes the stale 'Admin gate: every method calls
auth.IsAdmin first' comment on IntermediateCAHandler — the gate
moved to router.go::rbacGate via auth.RequirePermission
middleware in Phase 3.5; the new comment block points readers
there.
M4: 11 RBAC routes (auth/me, auth/permissions, 5 role lifecycle,
2 role-permission grant/revoke, 2 actor-role grant/revoke) added
to api/openapi.yaml under the [Auth] tag with operationIds and
shared AuthRole / AuthRolePermission schemas. AuthCheck path
extended with the Bundle-1 enrichment fields. The 11 entries
removed from openapi_parity_test.go::SpecParityExceptions.
Tests: go vet + staticcheck + go test -short -count=1 green
across cmd/server/, internal/auth/, internal/api/router/, and
internal/api/handler/. New tests: 4 backfill unit tests,
2 AuthCheck M1 enrichment tests, 1 demo-mode + rbacGate chain
integration test (TestRBACGate_DemoModeChainReachesHandler).
Branch SECURITY.md (cowork/auth-bundle-1-SECURITY.md, not part
of this commit) captures the full posture of dev/auth-bundle-1
as of this closure for the operator's pre-merge review.
|
||
|
|
34adcfbbe5 |
api, handler: 4 admin-gated CA hierarchy endpoints + OpenAPI (Rank 8 commit 4)
Rank 8 commit 4 of 5. The API + RBAC layer that operators drive
the new hierarchy management surface from.
Endpoints (all admin-gated via middleware.IsAdmin; non-admin Bearer
callers get 403):
POST /api/v1/issuers/{id}/intermediates
Discriminator on body shape:
empty parent_ca_id + root_cert_pem + key_driver_id
→ CreateRoot (registers operator-supplied root CA).
parent_ca_id non-empty
→ CreateChild (signs new sub-CA cert under parent).
Service-layer error → HTTP code mapping:
ErrCANotSelfSigned → 400
ErrCAKeyMismatch → 400
ErrPathLenExceeded → 400
ErrNameConstraintExceeded → 400
ErrInvalidCertPEM → 400
ErrParentCANotActive → 409
ErrIntermediateCANotFound → 404
(other) → 500
GET /api/v1/issuers/{id}/intermediates
Returns flat list ordered by created_at; caller renders the
tree from each row's parent_ca_id (nil = root).
GET /api/v1/intermediates/{id}
Single-row detail.
POST /api/v1/intermediates/{id}/retire
Two-phase: confirm=false → active→retiring; confirm=true →
retiring→retired with active-children check (drain-first
semantics; ErrCAStillHasActiveChildren → 409).
Files changed:
internal/api/handler/intermediate_ca.go — 4 handlers
+ handler-defined
service interface
(dependency
inversion).
internal/api/handler/intermediate_ca_test.go — 8 test variants
(M-008 admin-
gate triplet
complete).
internal/api/handler/m008_admin_gate_test.go — register the
new admin-gated
handler in
AdminGatedHandlers
so the M-008
coherence
scanner stays
green.
internal/api/router/router.go — 4 r.Register
calls + new
IntermediateCAs
field on
HandlerRegistry.
cmd/server/main.go — wire the
postgres repo +
service +
handler. Reuses
the same
signer.FileDriver
instance the
OCSP responder
bootstrap path
feeds.
api/openapi.yaml — 4 new
operationIds,
full body
schema + status-
code dispatch.
Tests (8 in this commit):
TestIntermediateCA_Handler_NonAdmin_Returns403 (admin gate
— table-driven across all 4 endpoints)
TestIntermediateCA_Handler_AdminExplicitFalse_Returns403
(defensive: AdminKey present but false ≠ AdminKey absent)
TestIntermediateCA_Handler_AdminPermitted_ForwardsActor
(admin actor forwarded to service for audit attribution)
TestIntermediateCA_HandlerCreate_RootDispatch
(body discriminator: empty parent_ca_id → CreateRoot)
TestIntermediateCA_HandlerCreate_ChildDispatch
(body discriminator: parent_ca_id present → CreateChild)
TestIntermediateCA_HandlerCreate_BadRequestOnMissingRootBundle
(validation: no parent + no root bundle → 400)
TestIntermediateCA_HandlerCreate_ServiceErrorMappings
(table-driven: 7 service errors → expected HTTP codes)
TestIntermediateCA_HandlerRetire_TwoPhaseConfirm
(confirm=false then confirm=true forwarded correctly)
TestIntermediateCA_HandlerRetire_StillHasActiveChildren_Returns409
(drain-first contract — 409 not 500)
Verified locally:
gofmt: clean.
go vet ./...: exit 0.
go test -short -count=1 ./internal/api/handler/...: ok 4.498s.
bash scripts/ci-guards/openapi-handler-parity.sh: clean
(router routes: 182, openapi operations: 148; the +4 new routes
have +4 new operationIds — parity preserved).
bash scripts/ci-guards/* (all 24 guards): clean.
Out of scope of THIS commit (commit 5):
- web/src/pages/IssuerHierarchyPage.tsx (recursive tree render).
- docs/intermediate-ca-hierarchy.md sysadmin runbook (FedRAMP /
financial-services / internal-PKI patterns).
- docs/connectors.md hierarchy_mode row.
- WORKSPACE-ROADMAP entries (HSM-backed roots, automated
rotation, CRL chaining, NameConstraints templates, D3
dendrogram).
Reference: cowork/rank-8-intermediate-ca-hierarchy-prompt.md, commit 4.
|
||
|
|
31e50d987f |
ci: fix Rank 7 lint + openapi-handler-parity drift on master
Two CI failures from the Rank 7 chain push (#438): Go Build & Test — staticcheck ST1021: internal/service/approval_metrics.go:97 comment for ApprovalDecisionEntry doesn't start with the type name internal/service/approval_metrics.go:130 comment for ApprovalPendingAgeSnapshot doesn't start with the type name Frontend Build — scripts/ci-guards/openapi-handler-parity.sh: 4 router routes have no OpenAPI operationId: GET /api/v1/approvals GET /api/v1/approvals/{id} POST /api/v1/approvals/{id}/approve POST /api/v1/approvals/{id}/reject The Rank 7 commit-3 spec deferred OpenAPI extension to commit 4 with a 'batched alongside the integration changes' note; commit 4 didn't actually add them. This commit closes that gap. Fixes: approval_metrics.go — split the doc comment that was attached to SnapshotApprovalDecisions (the function) but visually preceded ApprovalDecisionEntry (the type), so the type appeared to staticcheck as having a comment that named the function instead of the type. Same fix on ApprovalPendingAgeSnapshot. Now each exported type has its own type-name-leading comment per Go convention. api/openapi.yaml — added 4 new operationIds (listApprovalRequests, getApprovalRequest, approveApprovalRequest, rejectApprovalRequest) + new ApprovalRequest schema component under components/schemas. Inline 401 response (the Unauthorized component does not exist in this spec; the canonical pattern in the rest of the file is inline 'description: Authentication required'). The two-person integrity contract surface is documented in the description of the approve / reject endpoints so external readers see the RBAC contract from the spec alone. Verified locally: go vet ./internal/service/...: exit 0. scripts/ci-guards/openapi-handler-parity.sh: clean (140 ops vs 174 routes, 36 documented exceptions). Third CI failure (image-and-supply-chain) was a transient apt-fetch 'Connection reset by peer' from deb.debian.org while pulling libasan6_10.2.1-6_amd64.deb. Not a code issue; just re-run the workflow. No code change needed. |
||
|
|
0729ee46e0 |
chore: sweep github.com/shankar0123/certctl URL refs to certctl-io/certctl
Post-transfer cosmetic + release-critical URL refresh after moving the
repo from github.com/shankar0123/certctl to github.com/certctl-io/certctl
(2026-05-03). GitHub HTTP redirects continue to forward old URLs forever,
so existing operators are not broken — but aligns the canonical
references with the new owner so:
- procurement engineers / contributors browsing the docs see the right
URL on first read
- operators copying the agent install one-liner hit the new path
directly without going through a redirect
- the Helm chart's default image repository points at the canonical org
registry path
- the OnboardingWizard rendered to first-run UI users shows the new
URL in the install snippets and doc anchor links
- the GitHub Actions release workflow pushes container images to
ghcr.io/certctl-io/certctl-{server,agent} (was: shankar0123)
- the release-notes Markdown body in release.yml — which gets stamped
into every future release page — references the post-transfer
cert-identity (cosign keyless signing now uses the certctl-io
workflow URL) and the post-transfer SLSA provenance source-uri.
Without this, every cosign verify / slsa-verifier command on a
v2.1.0+ release would fail because the cert-identity-regexp would
not match the signing identity GitHub Actions OIDC issues post-
transfer. Old releases (v2.0.67 and earlier) keep their immutable
release-notes pointing at the shankar0123 path and remain
verifiable via their own published instructions.
Customer impact:
- Operators on ghcr.io/shankar0123/certctl-{server,agent}:latest
silently freeze on whatever tag was current at transfer time. They
get no errors; they just stop receiving updates. The next release
notes need a one-line callout (Phase 3.1 of cowork/transfer-
certctl-to-org.md) telling them to update their image path to
ghcr.io/certctl-io/certctl-{server,agent}.
- All other URLs (git clone, install one-liner, raw.githubusercontent
URLs, browser links, GitHub API) continue to resolve via permanent
HTTP redirects. The sweep is cosmetic for those.
Files swept (30 total):
.github/workflows/release.yml — IMAGE_NAMESPACE, source-uri,
cosign cert-identity-regexp, IMAGE= snippet (5 refs total).
CHANGELOG.md, README.md — anchor links, badges, install one-liner,
cosign verify snippets in operator-facing sections.
api/openapi.yaml — info / externalDocs URLs.
install-agent.sh — GITHUB_REPO const + systemd unit Documentation=
field.
deploy/ENVIRONMENTS.md, deploy/helm/{CHART_SUMMARY,INDEX,
INSTALLATION,README}.md, deploy/helm/certctl/{Chart.yaml,
README.md,values.yaml}, deploy/helm/examples/values-*.yaml —
chart docs + image repository defaults across dev / prod-ha
overrides.
docs/{certctl-for-cert-manager-users,connector-iis,connectors,
migrate-from-acmesh,migrate-from-certbot,quickstart,test-env,
why-certctl}.md — operator-facing doc URLs.
examples/{acme-nginx,acme-wildcard-dns01,multi-issuer,
private-ca-traefik,step-ca-haproxy}/docker-compose.yml +
examples/step-ca-haproxy/step-ca-haproxy.md — example image:
paths and accompanying narrative.
web/src/pages/OnboardingWizard.tsx — first-run-UI URL refs (curl
install one-liners, agent docker image path, doc anchor links).
Files intentionally NOT swept (Choice A from cowork/transfer-certctl-
to-org.md):
go.mod, go.sum — module declaration stays github.com/shankar0123/
certctl. Existing imports compile because Go uses the path
declared in go.mod, not the URL it was fetched from. Internal-
only project; no external Go consumers; rename will land as a
mechanical sed when one materializes.
~250 *.go files — every import remains github.com/shankar0123/
certctl/internal/...
deploy/test/f5-mock-icontrol/go.mod — separate test sub-module;
same Choice A logic; module path stays.
Files intentionally NOT swept (other reasons):
README.md lines 244-245 — Scarf-pixel docker-pull commands.
shankar0123.docker.scarf.sh/... is a Scarf-account hostname
(per-user, not per-repo) and the pixel keeps tracking pulls
against the operator's personal Scarf account. Migrating to a
certctl-io Scarf account is a separate decision (create org
Scarf account → re-create package → update README).
deploy/test/f5-mock-icontrol/f5-mock-icontrol — checked-in
compiled binary with shankar0123/certctl baked into Go build
info via the sub-module path. Out of scope for a URL sweep;
will refresh on the next `make test-integration` rebuild.
Verification:
gofmt: clean (no .go files touched).
go vet ./...: clean (verified at this SHA in 1.3 of the transfer
checklist; no .go changes since).
go build ./...: clean (same).
go test -short on representative packages: green (same).
Diff shape: 30 files, 74 insertions / 74 deletions, net-zero size,
pure URL substitution.
|
||
|
|
4dc8d3fa5b |
acme-server: key rollover + revocation + ARI (Phase 4/7)
Closes the RFC 8555 + RFC 9773 surface beyond the issuance happy-path:
- POST /acme/profile/<id>/key-change (RFC 8555 §7.3.5)
- POST /acme/profile/<id>/revoke-cert (RFC 8555 §7.6)
- GET /acme/profile/<id>/renewal-info/<cert-id> (RFC 9773 ARI)
After this commit, ACME clients can rotate account keys, revoke certs
through the ACME surface (rather than only via the certctl GUI/API),
and fetch ARI for proactive renewal scheduling.
Architecture:
- Key rollover: outer JWS verified against the registered account key
(existing kid path); the inner JWS — embedded as the outer's payload
— verified against the embedded NEW jwk in a new dedicated routine
(ParseAndVerifyKeyChangeInner) that enforces RFC 8555 §7.3.5
inner-only invariants: MUST use jwk + MUST NOT use kid, payload
.account == outer.kid, payload.oldKey thumbprint-equals registered.
A single WithinTx swaps the stored thumbprint+pem and writes the
audit row. Concurrent-rollover safety via SELECT…FOR UPDATE on the
conflicting account row in UpdateAccountJWKWithTx; the loser
observes the winner's new thumbprint and is told to retry (409).
- Revocation: two auth paths. kid → AccountOwnsCertificate single-
indexed COUNT lookup over acme_orders. jwk → constant-time RFC 7638
thumbprint compare against the cert's pubkey. Both paths route
through service.RevocationSvc.RevokeCertificateWithActor so the
existing CRL/OCSP refresh + audit + metrics pipeline applies. RFC
5280 §5.3.1 numeric reason codes clamp to certctl's
domain.ValidRevocationReasons; codes 8 (removeFromCRL) + 10
(aACompromise) clamp to 'unspecified' since they aren't in the set.
- ARI is GET-only and unauth per RFC 9773 §4. Cert-id wire shape is
base64url(AKI).base64url(serial); ParseARICertID strict-decodes,
SerialHex emits the canonical certctl-shape lowercase-no-leading-
zeros hex used in certificate_versions.serial_number.
ComputeRenewalWindow has 3 branches: bound RenewalPolicy →
[notAfter - days, notAfter - days/2]; no policy → last 33% of
validity; past expiry → [now, now + 1d] (renew immediately).
Retry-After honors CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL.
What ships:
- internal/api/acme/{keychange,ari}.go (+ phase4_test.go: 15 tests).
- internal/api/acme/order.go: RevokeCertRequest wire shape.
- internal/api/handler/acme.go: KeyChange, RevokeCert, RenewalInfo
+ 11 new writeServiceError mappings.
- internal/repository/postgres/acme.go: UpdateAccountJWKWithTx (FOR
UPDATE + expectedOldThumbprint precondition; ErrACMEAccountKey-
ConcurrentUpdate sentinel) + AccountOwnsCertificate.
- internal/service/acme.go: RotateAccountKey + RevokeCert +
RenewalInfo; CertificateRevoker + RenewalPolicyLookup interfaces;
SetRevocationDelegate + SetRenewalPolicyLookup wiring; 11 new
sentinels; 6 new metrics.
- internal/service/acme_phase4_test.go: service-layer tests for
RotateAccountKey (happy + duplicate-key) + RevokeCert (kid mismatch
+ jwk mismatch + jwk happy + already-revoked + reason-clamping) +
RenewalInfo (disabled + bad cert-id).
- internal/api/router/router.go: 6 new register calls (3 per-profile
+ 3 shorthand). Router parity exceptions extended in lockstep
(in-tree SpecParityExceptions + CI-only openapi-handler-exceptions
.yaml).
- cmd/server/main.go: SetRevocationDelegate(revocationSvc) +
SetRenewalPolicyLookup(renewalPolicyRepo) at startup.
- internal/config/config.go: CERTCTL_ACME_SERVER_ARI_ENABLED (default
true) + CERTCTL_ACME_SERVER_ARI_POLL_INTERVAL (default 6h);
BuildDirectory's ariEnabled flag now flips on under
cfg.ARIEnabled.
- docs/acme-server.md: phase status flipped to Phase 4; endpoints
table grows 6 rows (3 per-profile + 3 shorthand); FAQ section
appended explaining how to rotate keys, revoke certs, and consume
ARI.
Tests:
- 'go vet ./...' clean across the repo.
- 'go test -short -count=1 ./...' green across every package.
- phase4_test.go covers: keychange happy-path + 5 negatives +
MapKeyChangeErrorToProblem coverage; ARI cert-id round-trip + 6
malformed cases + BuildARICertID from a generated cert; window-
math 3 branches.
- service-layer tests confirm: RotateAccountKey atomically swaps the
thumbprint (verifies persisted state) and rejects duplicate keys;
RevokeCert routes through the stub RevocationSvc with the right
actor string + reason on the jwk path, rejects mismatched keys,
rejects already-revoked certs, clamps reason codes correctly;
RenewalInfo respects ARIEnabled + cert-id format.
Engineering history: cowork/WORKSPACE-CHANGELOG.md 'ACME-Server-4'.
|
||
|
|
9bc845304e |
acme-server: HTTP-01 + DNS-01 + TLS-ALPN-01 challenge validation (Phase 3/7)
Wires up the actual challenge-validation machinery so profiles in
acme_auth_mode='challenge' resolve end-to-end. After this commit,
cert-manager 1.15+ with `solver: http01: ingress` against a
challenge-mode profile completes a real HTTP-01 flow and gets a cert.
DNS-01 + TLS-ALPN-01 share the same code path with the appropriate
validator selection.
Architecture (the load-bearing parts):
- 3 separate semaphore-bounded worker pools (one per challenge type),
so HTTP-01 and DNS-01 can't starve each other under load. Default
weight 10 per type; tunable via CERTCTL_ACME_SERVER_HTTP01_CONCURRENCY,
DNS01_CONCURRENCY, TLSALPN01_CONCURRENCY.
- 30s per-challenge timeout (configurable via PoolConfig.PerChallengeTimeout).
- HTTP-01 validator runs validation.IsReservedIPForDial (newly
exported wrapper preserving the existing private impl byte-for-byte
for the network scanner + ValidateSafeURL paths) on the resolved
IP — both at the initial dial and every redirect hop. SSRF probes
into private IP space are refused before the connect.
- DNS-01 validator uses a dedicated resolver pointed at
CERTCTL_ACME_SERVER_DNS01_RESOLVER (default 8.8.8.8:53) — does
NOT use the system resolver to keep behavior deterministic across
deployments. Wildcard handling: `*.example.com` queries
_acme-challenge.example.com.
- TLS-ALPN-01 validator (RFC 8737) connects with ALPN `acme-tls/1`,
inspects the id-pe-acmeIdentifier extension (OID 1.3.6.1.5.5.7.1.31),
asserts the ASN.1 OCTET STRING value equals SHA-256 of the key
authorization. Cert chain is intentionally NOT validated
(InsecureSkipVerify=true is correct per RFC 8737 — the proof is
in the extension, not the chain). Documented in docs/tls.md L-001
table + the //nolint:gosec comment carries the justification.
SSRF guard: same posture as HTTP-01.
- Validation is asynchronous: handler accepts the POST and returns
200 immediately with status=processing; the worker-pool fires a
callback that updates challenge → authz → order in a fresh
background-context WithinTx. The order auto-promotes to `ready`
when ALL authzs become valid; auto-fails to `invalid` when ANY
authz becomes invalid.
What ships:
- internal/api/acme/challenge.go: KeyAuthorization (RFC 8555 §8.1) +
DNS01TXTRecordValue (§8.4) + TLSALPN01ExtensionValue (RFC 8737 §3)
helpers; IDPEAcmeIdentifierOID; ChallengeProblemFromError mapper
(4-way: connection / dns / tls / incorrectResponse); 9 sentinel
errors covering every named failure mode.
- internal/api/acme/validators.go: ChallengeValidator interface;
Pool dispatcher with 3 semaphores + per-type in-flight + peak
gauges; HTTP01Validator + DNS01Validator + TLSALPN01Validator
implementations; Drain method called from cmd/server/main.go's
shutdown sequence.
- internal/api/acme/validators_test.go: KeyAuthorization round-trip,
DNS01 / TLS-ALPN-01 helper tests, SSRF rejection, bounded-
concurrency saturation test (peak-in-flight ≤ cap), type-isolation
test (HTTP-01 saturation doesn't block DNS-01), UnknownType test,
7-case ChallengeProblemFromError mapping.
- internal/repository/postgres/acme.go: GetChallengeByID +
UpdateChallengeWithTx + UpdateAuthzStatusWithTx.
- internal/service/acme.go: SetValidatorPool wires the *acme.Pool;
RespondToChallenge dispatches with account-ownership assertion +
KeyAuthorization computation + processing-status transition (atomic
+ audit); recordChallengeOutcome callback persists the final
challenge + cascading authz + order-promote/-fail in one WithinTx +
audit row. 4 new metrics.
- internal/api/handler/acme.go: Challenge handler; round-trips
account.JWKPEM through ParseJWKFromPEM to recover the *jose.JSONWebKey
the validator pool needs.
- internal/api/router/router.go + openapi_parity_test.go +
api/openapi-handler-exceptions.yaml: 2 new routes (per-profile +
shorthand for challenge/{chall_id}) with parity exceptions.
- cmd/server/main.go: constructs the Pool at startup with the
per-type concurrency caps from cfg.ACMEServer; ACMEService.ValidatorPool()
accessor exposed for the shutdown drain sequence.
- internal/validation/ssrf.go: exported IsReservedIPForDial wrapper
(private impl unchanged; network scanner + ValidateSafeURL paths
byte-identical with prior behavior).
- docs/tls.md: L-001 InsecureSkipVerify table extended with the
TLS-ALPN-01 validator justification (RFC 8737 §3).
- docs/acme-server.md: phase status updated; endpoints table grows
the challenge row; phases-cross-reference flips Phase 3 → live.
Tests:
- 80%+ coverage on the new files.
- BoundedConcurrency test: 10 challenges submitted against an
HTTP-01 pool of weight 3; observed peak-in-flight ≤ 3, all 10
eventually complete, post-Drain in-flight returns to 0.
- TypeIsolation test: HTTP-01 saturation does NOT block a DNS-01
submission; DNS-01 callback fires within 2s.
- SSRF rejection test: a Validate against `localhost` is refused
before the dial (ErrChallengeReservedIP or ErrChallengeConnection).
Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-3".
|
||
|
|
c351bba41a |
acme-server: orders + authorizations + finalize + cert download (Phase 2/7)
Closes the issuance loop in trust_authenticated mode (commits |
||
|
|
a05a7d3dad |
ci: fix Phase 1b post-push CI failures (3 guards)
Phase 1b push (commit
|
||
|
|
b7a3162028 |
ci-pipeline-cleanup Phases 7-9: image-and-supply-chain job
Bundle: ci-pipeline-cleanup, Phases 7-9 / frozen decisions 0.8 + 0.10 + 0.11.
NEW image-and-supply-chain job (Ubuntu, ~3 min). Three steps:
PHASE 7 — Digest validity
scripts/ci-guards/digest-validity.sh resolves every @sha256:<digest>
ref in deploy/**/*.{yml,Dockerfile*} against its registry. Closes the
H-001 lying-field gap that Bundle II hit (11 fabricated digests passed
H-001's regex-only check and failed docker pull in CI).
Sandbox verification: 16/16 digests in deploy/* + Dockerfiles all
return HTTP 200 from registry-1.docker.io / ghcr.io / mcr.microsoft.com.
PHASE 8 — Docker build smoke (all 4 Dockerfiles)
Per frozen decision 0.10: build Dockerfile, Dockerfile.agent,
deploy/test/f5-mock-icontrol/Dockerfile, deploy/test/libest/Dockerfile.
Catches syntax errors + COPY path drift before tag-time release.yml.
The test-sidecar Dockerfiles are load-bearing for vendor-e2e — a
syntax error there silently breaks the e2e suite.
PHASE 9 — OpenAPI ↔ handler operationId parity
scripts/ci-guards/openapi-handler-parity.sh extracts router routes
(r.mux.Handle / r.Register "METHOD /path" syntax — Go 1.22+ ServeMux),
extracts OpenAPI operations (paths × HTTP methods), and fails if any
router route has no operationId AND is not documented in the new
api/openapi-handler-exceptions.yaml.
Verified gap at HEAD
|
||
|
|
5a682db8e2 |
EST RFC 7030 hardening master bundle Phases 10-11: libest sidecar e2e
+ Cisco IOS quirk fixtures + ManagedCertificate.Source provenance + EST bulk-revoke endpoint + 13 typed audit action codes. Phase 10.1 — libest reference-client sidecar: - deploy/test/libest/Dockerfile: multi-stage Debian-bookworm-slim build of Cisco's libest v3.2.0-2 from source (autoconf/automake/ libtool + libcurl4-openssl-dev + libssl-dev). Runtime stage carries only estclient + bash + openssl + ca-certificates so the exec surface stays small + predictable. - docker-compose.test.yml libest-client entry (profiles: [est-e2e]) with bind mounts for /config/est (test workspace) + /config/certs (certctl CA bundle for TLS pinning); IP 10.30.50.9 (10.30.50.8 was already taken by certctl-agent). - deploy/test/est/.gitkeep keeps the bind-mount target tracked. Phase 10.2 — 5 integration tests (//go:build integration) in deploy/test/est_e2e_test.go: - TestEST_LibESTClient_Enrollment_Integration (cacerts → simpleenroll → cert-shape assertion) - TestEST_LibESTClient_MTLSEnrollment_Integration (mTLS sibling-route cert auth; skip when bootstrap cert absent) - TestEST_LibESTClient_ServerKeygen_Integration (RFC 7030 §4.4 multipart; skip when profile gate disabled) - TestEST_LibESTClient_RateLimited_Integration (4th enroll trips per-principal cap, asserts 429-shaped error) - TestEST_LibESTClient_ChannelBinding_Integration (libest --tls-exporter; skip when libest build lacks the flag). - requireESTSidecar guard skips the suite when the operator forgot --profile est-e2e; helpful error message includes the exact command to bring the sidecar up. Phase 10.3 — Cisco IOS quirk fixtures + 3 unit tests in internal/api/handler/cisco_ios_quirks_test.go: - testdata/cisco_ios_15x_pem_csr.txt: PEM body sent with Content-Type application/x-pem-file. Handler dispatches on body-prefix not Content-Type — accepts cleanly. - testdata/cisco_ios_16x_trailing_newline_csr.txt: extra trailing newlines after base64 body. strings.TrimSpace tolerates. - testdata/cisco_ios_crlf_b64_csr.txt: CRLF-wrapped base64. base64.StdEncoding handles CRLF + LF identically. Phase 11.1 — ManagedCertificate.Source provenance: - New domain.CertificateSource enum (Unspecified/EST/SCEP/API/Agent). - Migration 000023_managed_certificates_source.up.sql adds source TEXT NOT NULL DEFAULT '' so existing rows scan as CertificateSourceUnspecified — back-compat: bulk-revoke filter treats empty as "any source". - Postgres repo Insert/Update/scan paths all wire the new column. Phase 11.2 — EST bulk-revoke endpoint: - BulkRevocationCriteria.Source field (Source-only requests rejected as too broad — must accompany at least one narrower criterion). - service.bulk_revocation.resolveCertificates post-filter by Source (empty=any, no SQL change so existing CertificateFilter callers unaffected). - New BulkRevocationHandler.BulkRevokeEST method pins Source=EST + dispatches; new route POST /api/v1/est/certificates/bulk-revoke (M-008 admin-gated). openapi.yaml documented + parity-guard green. Phase 11.3 — 13 typed audit action codes in internal/service/est_audit_actions.go: - est_simple_enroll_success / _failed - est_simple_reenroll_success / _failed - est_server_keygen_success / _failed - est_auth_failed_basic / _mtls / _channel_binding - est_rate_limited - est_csr_policy_violation - est_bulk_revoke - est_trust_anchor_reloaded - ESTService.processEnrollment + SimpleServerKeygen + ReloadTrust split-emit BOTH the legacy bare action codes (back-compat for the GUI activity-tab chip filters that match by exact string + existing audit-log analysers) AND the new typed _success / _failed variants (operator grep target + per-failure-mode counter). Tests: - internal/api/handler/bulk_revocation_est_test.go — 5 cases (admin-true happy path pins Source=EST + non-admin 403 + empty-criteria 400 + invalid-reason 400 + method-not-allowed). - internal/service/est_audit_actions_test.go — 5 cases (SimpleEnroll legacy+typed emission / SimpleReEnroll typed / IssuerError typed-failed / PolicyViolation triple-emit / unique-string invariant). Pre-commit verification (sandbox): gofmt clean, go vet clean (excluding repository/postgres testcontainers limit), staticcheck clean across api/handler/api/router/domain/service/deploy/test, go test -short -count=1 green for every non-postgres Go package + integration build (`go build -tags integration ./deploy/test/...`) clean. G-3 docs-drift guard reproduced locally clean (Phases 10-11 added zero new env vars). Spec preserved at cowork/est-rfc7030-hardening-prompt.md. Phases 12-13 (docs/est.md + WiFi/802.1X / IoT bootstrap / FreeRADIUS recipes; release prep + tag) remain — post-2.1.0 work. |
||
|
|
43075a1b5c |
EST RFC 7030 hardening master bundle Phases 5-7: end-to-end serverkeygen
+ profile-driven csrattrs + admin observability with per-status counters + reload-trust endpoint. Phase 5 — RFC 7030 §4.4 server-driven key generation: - internal/pkcs7/envelopeddata_builder.go is the inverse of the existing parser/decryptor: AES-256-CBC content cipher + RSA PKCS#1 v1.5 keyTrans + per-call random IV. Round-trip pinned in test (BuildEnvelopedData → ParseEnvelopedData → Decrypt returns the original plaintext byte-for-byte). - ESTService.SimpleServerKeygen runs the full §4.4 flow: parse client CSR → require RSA pubkey for keyTrans → resolve per-profile algorithm (RSA-2048 default; honors AllowedKeyAlgorithms) → in- memory keygen → re-build CSR with server pubkey → run existing issuer pipeline → marshal PKCS#8 → CMS-EnvelopedData wrap to a synthetic recipient cert wrapping the device's CSR-supplied pubkey → zeroize plaintext + PKCS#8 bytes → return CertPEM + ChainPEM + EncryptedKey. Typed sentinels ErrServerKeygenRequiresKey- Encipherment / ErrServerKeygenUnsupportedAlgorithm / ErrServerKeygenDisabled. - ESTHandler.ServerKeygen + ServerKeygenMTLS emit RFC 7030 §4.4.2 multipart/mixed with random per-response boundary; per-profile SetServerKeygenEnabled gate returns 404 when off (defense in depth even if the route was registered). - New routes POST /.well-known/est/[<PathID>/]serverkeygen + /.well-known/est-mtls/<PathID>/serverkeygen; openapi.yaml + openapi-parity guard updated. Phase 6 — Real csrattrs implementation: - New CertificateProfile.RequiredCSRAttributes []string + migration 000022_certificate_profiles_csrattrs.up.sql. The migration also lands the previously-unwired must_staple column (closes the 5.6 follow-up loop where the field shipped at the domain + service layer but the postgres scan/insert/update never persisted it). - domain.EKUStringToOID + AttributeStringToOID lookup tables: id-kp-* EKUs (RFC 5280 §4.2.1.12) + RFC 5280 DN attributes + RFC 2985 PKCS#10 attributes + Microsoft Intune device-serial OID. - ESTService.GetCSRAttrs replaces the v2.0.x nil/204 stub with a profile-derived SEQUENCE OF OID ASN.1 marshal. Unknown EKU / attribute strings dropped + warning-logged so a typo doesn't take down the entire endpoint. Phase 7 — Admin observability + counters + reload-trust: - internal/service/est_counters.go: estCounterTab (sync/atomic; 12 named labels) + ESTStatsSnapshot per-profile shape + ESTService.Stats(now) zero-allocation accessor + ReloadTrust() SIGHUP-equivalent + SetESTAdminMetadata setter. - Counter ticks wired into processEnrollment + SimpleServerKeygen at every success/failure leg. - internal/api/handler/admin_est.go mirrors AdminSCEPIntune verbatim: Profiles + ReloadTrust handlers + AdminESTServiceImpl. Both endpoints admin-gated (M-008 triplet pinned + admin_est.go added to AdminGatedHandlers). - New routes GET /api/v1/admin/est/profiles + POST /api/v1/admin/ est/reload-trust; openapi.yaml documented; openapi-parity guard reproduced clean. - cmd/server/main.go grows estServices map populated by the per- profile EST loop + handed to AdminEST. New MTLSTrust() + HasMTLSTrust() accessors on ESTHandler so main.go can pull the trust holder for the admin-metadata wire-up. - Per-profile counter isolation regression test (internal/service/est_profile_counter_isolation_test.go) proves a future shared-counter refactor would fail at compile-time pointer-identity check. Pre-commit verification (sandbox): gofmt clean, go vet clean (excluding repository/postgres which the sandbox can't build — disk-space testcontainers download), staticcheck clean across cms/trustanchor/api/handler/api/router/scep/intune/ratelimit/ service/pkcs7/domain/cmd/server, go test -short -count=1 green for every non-postgres package. G-3 docs-drift guard reproduced locally clean (Phases 5-7 added zero new env vars; Phase 1 already documented per-profile SERVER_KEYGEN_ENABLED). Spec preserved at cowork/est-rfc7030-hardening-prompt.md. Phases 8-13 (GUI ESTAdminPage / CLI+MCP / libest e2e / bulk revocation / docs/est.md / release prep) remain — post-2.1.0 work. |
||
|
|
506cff137d |
feat(scep): SCEP probe in network scanner for fleet-readiness assessment
Phase 11.5 of the SCEP RFC 8894 + Intune master bundle. Adds an
operator-facing SCEP probe that issues GetCACaps + GetCACert against
an arbitrary SCEP server URL and returns a structured posture snapshot
(reachable + advertised caps + RFC 8894 / AES / POST / Renewal /
SHA-256 / SHA-512 support flags + CA cert subject + issuer + NotBefore
+ NotAfter + days-to-expiry + algorithm + chain length).
Two operator use cases per the master prompt:
1. Pre-migration assessment — probe an existing EJBCA / NDES SCEP
server before switching to certctl to see what capabilities it
advertises and what the CA cert looks like.
2. Compliance posture audits — periodic ad-hoc probes against the
operator's own SCEP servers to flag drift.
Capability-only — does NOT POST a CSR per the spec (would consume slot
allocations on the target server + create audit noise). Standalone CLI
binary explicitly out of scope (per the master prompt §11.5.6 and the
operator's confirmation): the probe code lands inside certctl; a
future thin Cobra wrapper is a separate decision.
Backend (six new + one extended file):
* internal/domain/network_scan.go — new SCEPProbeResult struct with
every probe field documented for the GUI's display layer.
* migrations/000021_scep_probe_results.up.sql + .down.sql — new
scep_probe_results table with TEXT id, target_url, all probe
flags, CA cert metadata, probed_at, probe_duration_ms, error.
Two indexes: idx_scep_probe_results_probed_at (DESC) for the
'recent probes' GUI query, idx_scep_probe_results_target_url
(target_url, probed_at DESC) for the future per-URL history view.
* internal/repository/interfaces.go — new SCEPProbeResultRepository
interface (Insert + ListRecent).
* internal/repository/postgres/scep_probe_results.go — Postgres
implementation. ListRecent clamps limit to [1, 200]; on read
re-derives ca_cert_days_to_expiry against the query-time wall
clock so 'X days remaining' stays fresh.
* internal/service/scep_probe.go — ProbeSCEP(ctx, url) on
NetworkScanService. Validation order:
1. Up-front URL validation via validation.ValidateSafeURL
(defaults to validation.ValidateSafeURL but injectable for
tests via the new scepValidateURL field on the service).
2. Dial-time SSRF re-check via SafeHTTPDialContext on the
http.Transport (defends against DNS rebinding).
3. GET ?operation=GetCACaps + GET ?operation=GetCACert.
GetCACert handles three response shapes: PKCS#7 SignedData
certs-only envelope (multi-cert), raw DER (single-cert),
and PEM-wrapped DER (non-conforming servers).
Times out at 30s; uses a 1MB body cap for DoS defense; wraps
the result + persists via the repo (nil-safe) before returning.
describeCertAlgorithm helper returns 'RSA-N' / 'ECDSA-curve' /
'Ed25519' / 'DSA' for the GUI's algorithm column.
* internal/service/network_scan.go — added scepProbeRepo +
scepHTTPClient + scepValidateURL + scepIDFn + nowFn fields;
SetSCEPProbeRepo wires the repo at startup.
* internal/api/handler/network_scan.go — extended NetworkScanService
interface with ProbeSCEP + ListRecentSCEPProbes; added two new
HTTP handlers:
POST /api/v1/network-scan/scep-probe (body {url})
GET /api/v1/network-scan/scep-probes (recent history)
Synchronous probe; HTTP 200 with the result body for both success
and reachable-but-failed cases (so the GUI can render the failure
tone with the operator-actionable error message).
* internal/api/router/router.go — registered the two routes inline
after the existing network-scan target endpoints.
* api/openapi.yaml — documented both endpoints (operationId
probeSCEP + listSCEPProbes) with full schema + response codes.
* cmd/server/main.go — wires the new SCEPProbeResultRepository
onto the network scan service via SetSCEPProbeRepo right after
the existing NewNetworkScanService construction.
Backend tests (6 new — exit-criteria-named per the master prompt):
* TestProbeSCEP_AdvertisesAllCaps — happy path, full RFC 8894
capability set, ECDSA P-256 CA cert, 365-day expiry.
* TestProbeSCEP_MissingSCEPStandard — pre-RFC-8894 server (only
POSTPKIOperation + SHA-1 + DES3); SupportsRFC8894 = false.
* TestProbeSCEP_GetCACertExpired — CA cert NotAfter 30d in the
past; CACertExpired = true.
* TestProbeSCEP_Unreachable — connect to TCP port 1; probe
returns Reachable=false + non-empty Error.
* TestProbeSCEP_RejectsReservedIP — http://169.254.169.254/scep
(EC2 metadata literal) rejected by the up-front
validation.ValidateSafeURL gate; result captures the error
without ever issuing the HTTP call.
* TestProbeSCEP_PEMWrappedCert — server returns PEM instead of
raw DER for GetCACert; the fallback parse path handles it.
Frontend (one extended file + types/client):
* web/src/api/types.ts — SCEPProbeResult + SCEPProbesResponse.
* web/src/api/client.ts — probeSCEPServer + listSCEPProbes
helpers.
* web/src/pages/NetworkScanPage.tsx — new SCEPProbeSection
component + ProbeResultPanel (with capability badges + CA cert
details panel + raw caps line) + SCEPProbeHistoryTable. Form
rejects empty URL with inline error before calling the API.
Reload mutation goes through useTrackedMutation with explicit
invalidates: [['scep-probes']] (M-009 contract).
Frontend tests (5 new + 0 regressions):
* Scep probe section header + form renders.
* Empty URL is rejected with inline error and never calls the
probe endpoint.
* Successful probe renders capability badges + CA cert subject
+ days-remaining inline panel.
* Probe-level errors are surfaced in the inline panel (no result
panel rendered).
* Recent-probes history table renders one row per probe.
* (Existing 2 NetworkScanPage XSS-hardening tests stub the new
listSCEPProbes endpoint to an empty list so they still pass.)
Verification:
* gofmt clean on touched files
* go vet ./... clean
* staticcheck on service+handler+router+repository+cmd-server clean
* go test -short across service+handler+router+repository+cmd-server
+ integration: all green (existing + 6 new probe tests pass)
* Frontend tsc --noEmit clean
* Vitest: 7/7 NetworkScanPage tests pass (2 existing XSS + 5 new
probe section)
* G-3 docs-drift CI guard reproduced locally clean (no new env vars)
* M-009 hard-zero useMutation guard clean (probe mutation goes
through useTrackedMutation)
* openapi-parity guard satisfied (both new routes documented)
* The mockNetworkScanService in handler + integration packages
extended with stub Probe methods; targeted coverage stays in
scep_probe_test.go.
Out of scope (per master prompt §11.5.6 + operator confirmation):
* Standalone certctl-scan CLI binary — separate decision, ~1d of
follow-up work when/if shipped.
Refs: cowork/scep-rfc8894-intune-master-prompt.md::Phase 11.5
cowork/scep-rfc8894-intune/progress.md
|
||
|
|
0be889ff1d |
refactor(scep-gui): rebrand SCEP admin surface to per-profile tabbed interface (Profiles + Intune + Recent Activity)
Phase 9 follow-up to the SCEP RFC 8894 + Intune master bundle. The
Phase 9.4 GUI shipped 'SCEP Intune Monitoring' at /scep/intune, which
made the per-profile observability surface look Intune-only — operators
running EJBCA + Jamf would never click that nav link expecting per-
profile RA cert + mTLS observability. The page is per-profile keyed
under the hood; this commit rebrands + restructures so the surface
matches what operators actually need.
Spec: cowork/scep-gui-restructure-prompt.md.
User-visible change:
- Nav link renamed: 'SCEP Intune' → 'SCEP Admin'.
- Route: /scep is the new canonical path; /scep/intune kept as a
backward-compat alias that lands directly on the Intune tab.
- Page header: 'SCEP Administration'.
- Three tabs:
* Profiles (default) — per-profile lean cards with RA cert
expiry countdown, mTLS sibling-route status badge, Intune
enabled/disabled badge, challenge-password-set indicator.
'View Intune details →' link on Intune-enabled cards
deep-links into the Intune tab.
* Intune Monitoring — the existing Phase 9.4 deep-dive
(per-status counters, trust anchor expiry, recent failures
table, reload-trust button + confirmation modal).
* Recent Activity — full SCEP audit log filter merging all
four action codes (scep_pkcsreq + scep_renewalreq +
scep_pkcsreq_intune + scep_renewalreq_intune); chip filters
for All / Initial / Renewal / Intune / Static.
Backend:
* internal/service/scep.go — new SCEPProfileStatsSnapshot type +
IntuneSection sub-block + ProfileStats(now) accessor. Adds
raCertSubject/raCertNotBefore/raCertNotAfter + mtlsEnabled +
mtlsTrustBundlePath fields with SetRACert + SetMTLSConfig setters.
Existing IntuneStatsSnapshot + IntuneStats(now) preserved
UNCHANGED for /admin/scep/intune/stats backward compat (the
JSON shape stays byte-stable for external consumers — the
aliasing approach the prompt initially suggested doesn't work
because the new shape nests Intune while the old one is flat).
ChallengePasswordSet is derived from challengePassword != ''
(the secret value itself is never surfaced).
* internal/api/handler/admin_scep_intune.go — new Profiles handler
method on AdminSCEPIntuneHandler with the same M-008 admin gate.
AdminSCEPIntuneServiceImpl extended (in place; same
map[string]*service.SCEPService) to satisfy the new
AdminSCEPProfileService interface. Single handler file gets the
third method so the M-008 pin entry count stays steady (no new
file, no new triplet of admin-gate test files — just three new
Profiles tests inside the existing test file).
* internal/api/router/router.go — one new route
'GET /api/v1/admin/scep/profiles' registered to
reg.AdminSCEPIntune.Profiles. HandlerRegistry unchanged.
* api/openapi.yaml — new operation 'listSCEPProfiles' documenting
the request body / response shape / error mapping. Existing
Intune entries unchanged.
* cmd/server/main.go — per-profile loop now calls
scepService.SetMTLSConfig(profile.MTLSEnabled,
profile.MTLSClientCATrustBundlePath) right after SetPathID, and
scepService.SetRACert(raCert) right after loadSCEPRAPair returns
the leaf cert. Both setters are nil-safe.
* internal/api/handler/m008_admin_gate_test.go — extended the
existing admin_scep_intune.go entry's justification to mention
the third endpoint. No new map entry needed (file already
listed).
Backend tests (8 new):
* TestAdminSCEPProfiles_NonAdmin_Returns403
* TestAdminSCEPProfiles_AdminExplicitFalse_Returns403
* TestAdminSCEPProfiles_AdminPermitted_ForwardsActor — also pins
that Intune-enabled profiles emit an 'intune' sub-block while
Intune-disabled profiles OMIT it.
* TestAdminSCEPProfiles_RejectsNonGetMethod
* TestAdminSCEPProfiles_PropagatesServiceError
* TestAdminSCEPProfilesServiceImpl_NilMapReturnsEmpty
* (existing 16 Phase 9 admin tests still pass — backward-compat
preserved)
Frontend:
* web/src/api/types.ts — new SCEPProfileStatsSnapshot +
IntuneSection + SCEPProfilesResponse types. Existing
IntuneStatsSnapshot et al unchanged.
* web/src/api/client.ts — new getAdminSCEPProfiles helper.
* web/src/pages/SCEPAdminPage.tsx — full rewrite as the tabbed
surface. Reuses the existing ConfirmReloadModal and Intune
deep-dive card components verbatim; adds ProfileSummaryCard
(lean card for the Profiles tab) and ActivityTab. URL state
sync via useSearchParams so deep links survive reloads + browser
back/forward. The legacy /scep/intune route alias defaults the
activeTab to 'intune' on mount.
* web/src/main.tsx — new <Route path='scep' /> + preserved
<Route path='scep/intune' /> alias. Both render SCEPAdminPage.
* web/src/components/Layout.tsx — nav link rebranded:
label 'SCEP Intune' → 'SCEP Admin', to '/scep/intune' → '/scep'.
Frontend tests (20 — full rebuild):
* Admin gate (non-admin sees gated banner + zero admin API calls)
* Profiles tab default + Intune tab tabswitch + ?tab=intune deep
link + legacy /scep/intune alias all land on Intune
* Profiles tab status badges (Intune + mTLS + challenge-set)
reflect each profile's flags
* RA cert expiry tone bands (good ≥30d / warn 7-30d / bad <7d /
EXPIRED) verified across three fixture profiles
* 'View Intune details →' only renders for Intune-enabled
profiles AND switches tabs on click
* Empty-state banner when no profiles configured
* Intune tab counters render with the existing Phase 9 deep-dive
shape; reload modal Open/Confirm/Cancel/Error paths all pinned
* Recent Activity tab merges all four SCEP audit actions across
four parallel useQuery calls; filter chips
(all/initial/renewal/intune/static) narrow correctly
* Error path surfaces ErrorState on the active tab
Docs:
* docs/scep-intune.md — Operational monitoring section heading
expanded to '(SCEP Administration → Intune Monitoring tab)'.
Page-surface description rewritten for the tabbed shape;
admin-endpoints list extended with the new /admin/scep/profiles
entry.
* docs/architecture.md — Microsoft Intune Connector trust anchor
subsection updated to reference the Intune Monitoring tab inside
the SCEP Administration page + lists all three admin endpoints.
* docs/legacy-est-scep.md — forward-ref expanded with a parallel
sentence for the per-profile observability surface (independent
of Intune).
* README.md — Enrollment Protocols bullet for Intune updated to
'admin GUI SCEP Administration page at /scep' with the three
tabs called out.
Verification:
* gofmt clean on touched files
* go vet ./... clean
* staticcheck on intune+service+handler+router+cmd-server clean
* go test -short across intune+service+handler+router+cmd-server:
all green (existing Phase 9 tests + new Profiles tests)
* Frontend tsc --noEmit clean
* Vitest: 20/20 SCEPAdminPage tests + 3/3 sibling AuditPage tests
pass
* G-3 docs-drift CI guard reproduced locally: clean (no new env
vars; existing CERTCTL_SCEP_ allowlist prefix covers everything)
* M-009 hard-zero useMutation guard reproduced locally: clean
(the existing reload mutation already used useTrackedMutation
from the Phase 9 follow-up commit
|
||
|
|
77e0281a0e |
feat(scep-intune): GUI monitoring tab + admin endpoints
Phase 9 of the SCEP RFC 8894 + Intune master bundle. Lands the operator-
facing Intune Monitoring tab plus the two admin-gated endpoints it reads
from. Per the constitutional 'complete path' rule: counters tick on
every typed dispatcher branch, the GUI poll is live (30s for stats,
60s for the audit log filter), and the SIGHUP-equivalent reload action
is one click + a confirmation modal — no follow-up plumbing required.
Backend (Phase 9.1 + 9.2 + 9.3):
* internal/service/scep.go gains:
- intuneCounterTab — atomic per-status counters keyed by the same
labels intuneFailReason() emits (success / signature_invalid /
expired / not_yet_valid / wrong_audience / replay / rate_limited /
claim_mismatch / compliance_failed / malformed / unknown_version).
Lock-free on the dispatcher hot path; snapshot() returns a
zero-allocation map for the admin endpoint.
- dispatchIntuneChallenge wires intuneCounters.inc(...) on every
typed return path INCLUDING the success leg (credited before
processEnrollment so a downstream issuer-connector failure
doesn't double-count).
- SetPathID + PathID accessors (so admin rows surface the SCEP
profile path ID per row).
- IntuneStatsSnapshot + IntuneTrustAnchorInfo public types, plus
IntuneStats(now) accessor that walks the trust holder pool and
packages a per-profile snapshot. ReloadIntuneTrust() is the
typed wrapper around TrustAnchorHolder.Reload that returns
ErrSCEPProfileIntuneDisabled when called on a profile where
Intune isn't enabled (admin endpoint maps that to HTTP 409).
* internal/api/handler/admin_scep_intune.go:
- AdminSCEPIntuneService narrow interface (Stats + ReloadTrust)
so the handler depends on a small surface; AdminSCEPIntuneServiceImpl
is the production walker over the per-profile SCEPService map.
- AdminSCEPIntuneHandler.Stats handles GET /api/v1/admin/scep/intune/stats
with the M-008 admin gate (non-admin → 403 + service never
invoked); returns {profiles, profile_count, generated_at}.
- AdminSCEPIntuneHandler.ReloadTrust handles POST
/api/v1/admin/scep/intune/reload-trust. Body is {path_id: '<id>'};
empty body targets the legacy /scep root profile. Returns 200 on
success / 404 on unknown PathID / 409 when the profile is Intune-
disabled / 500 on a parse error from intune.LoadTrustAnchor (the
holder retains its previous pool — fail-safe). 400 on malformed
JSON.
- ErrAdminSCEPProfileNotFound typed error so the handler can
distinguish 'wrong profile' from 'broken file'.
* internal/api/router/router.go: HandlerRegistry gains
AdminSCEPIntune; both routes registered as bearer-auth-required
(the admin-gate is at the handler layer per the M-008 pattern).
* cmd/server/main.go: declares scepServices map[string]*service.SCEPService
BEFORE HandlerRegistry construction so the same map can be referenced
from both the admin handler (constructed early) and the SCEP startup
loop (which populates it later by reference). The per-profile loop
now calls scepService.SetPathID(profile.PathID) and stores the service
pointer into the shared map. AdminSCEPIntune handler is constructed
at the same time as AdminCRLCache.
* internal/api/handler/m008_admin_gate_test.go: AdminGatedHandlers
map gains 'admin_scep_intune.go' with a one-line justification —
the regression scanner enforces the per-handler test triplet
(TestAdminSCEPIntune_NonAdmin_Returns403 + _AdminExplicitFalse_Returns403
+ _AdminPermitted_ForwardsActor) plus their POST siblings for
ReloadTrust.
* api/openapi.yaml: documents both endpoints with request body /
response shape / error mapping; openapi-parity-test now matches
the registered routes.
Frontend (Phase 9.4):
* web/src/pages/SCEPAdminPage.tsx — single-page Intune Monitoring
surface:
- Per-profile cards (one card per SCEP profile). Enabled profiles
get the full counter grid + trust-anchor-expiry badge tone
(good ≥30d / warn 7-30d / bad <7d / EXPIRED). Disabled profiles
get an off-state pill with the env-var hint to opt in.
- Counters polled every 30s via TanStack Query against
GET /admin/scep/intune/stats.
- Recent failures table (last 50) populated from the audit log
filtered to action=scep_pkcsreq_intune AND scep_renewalreq_intune;
merged + sorted by timestamp descending. Polled every 60s.
- Reload trust anchor button per profile + confirmation modal that
explains the SIGHUP equivalence and the fail-safe behavior.
onConfirm runs a TanStack mutation, refetches the stats query
on success, surfaces the underlying error (eg 'trust anchor
cert expired') in the modal on failure (modal stays open so
operator can retry).
- Admin gate: when authRequired && !admin the page renders an
'Admin access required' banner and the underlying admin API
requests are never issued (React Query enabled flag gated on
auth.admin) — server-side enforcement is M-008.
* web/src/api/types.ts: IntuneStatsSnapshot + IntuneTrustAnchorInfo +
IntuneStatsResponse + IntuneReloadTrustResponse.
* web/src/api/client.ts: getAdminSCEPIntuneStats +
reloadAdminSCEPIntuneTrust(pathID).
* web/src/main.tsx: new route /scep/intune. The route is unconditional;
the gating is at the page level so deep-links land cleanly.
* web/src/components/Layout.tsx: 'SCEP Intune' nav link between
Observability and Audit Trail with the appropriate sidebar icon.
Tests (Phase 9.5):
* internal/api/handler/admin_scep_intune_test.go (16 tests):
- M-008 admin-gate triplet for both Stats (GET) and ReloadTrust
(POST): NonAdmin / AdminExplicitFalse / AdminPermitted.
- Method-gate tests (Stats rejects POST, ReloadTrust rejects GET).
- Stats propagates service errors as 500.
- ReloadTrust maps ErrAdminSCEPProfileNotFound→404,
ErrSCEPProfileIntuneDisabled→409, generic err→500.
- Empty body targets legacy root PathID.
- Malformed JSON→400.
- AdminSCEPIntuneServiceImpl handles nil map + unknown PathID.
* web/src/pages/SCEPAdminPage.test.tsx (13 tests):
- Admin gate (non-admin sees gated banner + zero admin API calls;
admin sees the page; no-auth dev mode also passes).
- Profile rendering (counters with correct labels, expiry badge
tone for ≥30d / EXPIRED states, off-state pill for disabled
profiles, empty-state banner when no profiles configured).
- Reload modal (opens on click, calls mutation on Confirm,
keeps modal open + shows error on failure, Cancel skips
mutation).
- Error path renders ErrorState with retry.
- Audit log filter merges PKCSReq + RenewalReq events and sorts
descending.
Verification:
* gofmt clean on touched files
* go vet ./... clean
* staticcheck on intune/service/api/cmd-server clean
* go test -short across api+service+intune+cmd-server: all green
* web tsc --noEmit clean
* Vitest: SCEPAdminPage.test.tsx 13/13 + sibling page suites all
pass
* G-3 docs-drift CI guard: Phase 9 adds no new CERTCTL_* env vars
so the guard does not fire
* openapi-parity-test green (both new admin endpoints documented)
* M-008 regression scanner enforces the per-handler test triplet —
pin updated, all triplets present
Refs: cowork/scep-rfc8894-intune-master-prompt.md::Phase 9
cowork/scep-rfc8894-intune/progress.md
|
||
|
|
a4df1f86ae |
crl/ocsp: admin observability endpoint + Phase 6 e2e scaffold
Phase 5 (admin endpoint slice) + Phase 6 (e2e test stub) of the
CRL/OCSP responder bundle. Closes the deferred items from the
backend-slice merge (
|
||
|
|
dc1e0bfbaa |
crl/ocsp: POST OCSP endpoint (RFC 6960 §A.1.1) + cache integration
Phase 4 (final phase) of the CRL/OCSP responder bundle. Closes the
backend slice; HTTP layer is now production-ready for relying parties.
What landed:
* POST /.well-known/pki/ocsp/{issuer_id} (handler.HandleOCSPPost)
- Accepts binary application/ocsp-request body per RFC 6960 §A.1.1
- Tolerant of missing Content-Type (some clients omit); validates
via ocsp.ParseRequest, returns 400 on malformed
- Returns 415 on explicit wrong Content-Type
- Reuses the existing service path (h.svc.GetOCSPResponse) — the
only new logic is body decoding + serial-from-OCSPRequest extraction
- GET form preserved unchanged for ad-hoc curl + human URL paths
- Auth-exempt under /.well-known/pki/ prefix (already in
AuthExemptDispatchPrefixes — no router changes for that)
- 7 new tests: success, method-not-allowed, wrong content-type,
missing content-type accepted, malformed body, missing issuer,
service error propagation
* router.go: r.Register("POST /.well-known/pki/ocsp/{issuer_id}", ...)
* CertificateService.GenerateDERCRL — cache-aware:
- New SetCRLCacheSvc(svc) setter (matches existing SetCAOperationsSvc
pattern — optional dep)
- When wired, GenerateDERCRL calls crlCacheSvc.Get → cheap DB read
on cache hit, singleflight-coalesced regen on miss
- When unwired, falls back to historical caSvc.GenerateDERCRL path
- GET /.well-known/pki/crl/{issuer_id} handler unchanged — calls
the same service method, gets cache benefit transparently when
the cache service is wired in cmd/server/main.go
Coverage: handler 79.8% (floor 75), service unchanged, scheduler 78%.
What's deferred (intentional scope cut for this session):
* cmd/server/main.go wiring of CRLCacheService + responder service
setters into the local issuer factory + scheduler. The wiring is
mechanical (NewCRLCacheService + scheduler.SetCRLCacheService call
in the existing wiring block); deferring keeps this commit focused
on the responder + cache primitives. Operator can wire when ready.
* Phase 5 (GUI), Phase 6 (e2e test against kind), Phase 7 (release
prep) — separate follow-up sessions.
* OCSP cache integration: today's GET/POST OCSP path goes through
the on-demand SignOCSPResponse (already cheap with the dedicated
responder cert from Phase 2). A cached-OCSP path is V3-Pro polish.
The bundle's V2 backend slice (Phases 0-4) is complete. All 4 phases
shipped 4 commits + 1 amend on this branch. CI will validate the
testcontainers repository tests on push.
|
||
|
|
f0865bb051 |
fix(api,web,mcp): add bulk-renew + bulk-reassign endpoints, drop client-side N×HTTP loops (L-1 master)
Two audit findings, both category cat-l, both rooted in
web/src/pages/CertificatesPage.tsx. Pre-L-1 the GUI looped per-cert
HTTP calls — 100 selected certs = 100 sequential round-trips × ~50–200
ms each = a 5–20-second wedge during which the operator stared at a
progress bar. Post-L-1 each workflow is a single POST.
cat-l-fa0c1ac07ab5 [P1, primary] — bulk renew loop
handleBulkRenewal: for/await triggerRenewal(id)
cat-l-8a1fb258a38a [P2] — bulk reassign loop
handleReassign: for/await updateCertificate(id, {owner_id})
The bulk-revoke endpoint (POST /api/v1/certificates/bulk-revoke +
BulkRevocationCriteria/Result) already existed as the canonical shape
in v2.0.x — L-1 ports that pattern to renew + reassign with per-action
twists.
Backend (Go)
- internal/domain/bulk_renewal.go: BulkRenewalCriteria mirrors
BulkRevocationCriteria (criteria + IDs modes); BulkRenewalResult
envelope adds EnqueuedJobs[] for per-cert {certificate_id, job_id};
shared BulkOperationError type for all bulk paths.
- internal/domain/bulk_reassignment.go: narrower shape — IDs-only,
owner_id required, team_id optional.
- internal/service/bulk_renewal.go::BulkRenewalService.BulkRenew:
resolves criteria → status filter (Archived/Revoked/Expired/
RenewalInProgress all silent-skip) → per-cert status flip + job
create. Keygen-mode-aware so jobs land in the same initial status
as single-cert TriggerRenewal. Single bulk audit event per call,
not N.
- internal/service/bulk_reassignment.go::BulkReassignmentService.
BulkReassign: validates owner_id upfront via the
ErrBulkReassignOwnerNotFound typed sentinel — non-existent owner
returns 400 before any cert is touched. Already-owned-by-target
is silent-skip. Single bulk audit event.
- internal/api/handler/{bulk_renewal,bulk_reassignment}.go: HTTP
shape mirrors bulk_revocation.go. NOT admin-gated (renew is non-
destructive; reassign is a common-case workflow). Sentinel-error
→ 400 mapping for OwnerNotFound.
- internal/api/router/router.go: three bulk-* routes registered as a
block before the {id} routes. HandlerRegistry gains BulkRenewal +
BulkReassignment fields.
- cmd/server/main.go: NewBulkRenewalService threads cfg.Keygen.Mode
so bulk-renew jobs land in same initial state as single-cert path.
Frontend
- web/src/api/client.ts: bulkRenewCertificates(criteria) +
bulkReassignCertificates(request) functions with full TS types.
- web/src/pages/CertificatesPage.tsx: handleBulkRenewal + handleReassign
rewritten from N-call loops to single calls. Result envelope drives
progress UI; first-error message surfaced when total_failed > 0.
Stale triggerRenewal + updateCertificate imports removed.
MCP
- internal/mcp/types.go: BulkRenewCertificatesInput +
BulkReassignCertificatesInput.
- internal/mcp/tools.go: certctl_bulk_renew_certificates +
certctl_bulk_reassign_certificates tools mirroring the existing
certctl_bulk_revoke_certificates pattern.
OpenAPI
- api/openapi.yaml: two new operations (bulkRenewCertificates,
bulkReassignCertificates) under Certificates tag. Four new schemas
(BulkRenewRequest, BulkRenewResult, BulkEnqueuedJob,
BulkReassignRequest, BulkReassignResult).
Tests
- Domain: BulkRenewalCriteria.IsEmpty + BulkReassignmentRequest.IsEmpty
IsEmpty contracts; JSON round-trip shape pinning.
- Service: 7 BulkRenew tests (happy/criteria-mode/skips-RenewalInProgress/
skips-revoked-archived/empty-criteria-error/partial-failure/
audit-event-emitted) + 8 BulkReassign tests (happy/skips-already-
owned/owner-required/empty-IDs/owner-not-found-sentinel/team-id-
optional/team-id-provided/partial-failure/audit-event-emitted).
- Handler: 5 BulkRenew handler tests (happy/empty-body-400/wrong-
method-405/actor-attribution/service-error-500) + 6 BulkReassign
handler tests (happy/empty-IDs-400/missing-owner-400/owner-not-
found-400-via-sentinel/wrong-method-405/generic-error-500).
CI guardrail
- .github/workflows/ci.yml: 'Forbidden client-side bulk-action loop
regression guard (L-1)'. Greps web/src/pages/CertificatesPage.tsx
for 'for(...) await triggerRenewal(...)' and 'for(...) await
updateCertificate(...)' patterns; comment lines exempt; test files
exempt. Verified locally (passes against post-fix tree, fires
against synthetic regression).
Counts (deltas)
- Routes: 119 → 121 (+2)
- OpenAPI operations: 123 → 125 (+2)
- MCP tools: 83 → 85 (+2)
Performance
- 100-cert bulk-renew: ~10s of sequential HTTP → ~100ms (99% latency
reduction on the canonical operator workflow).
- Audit event volume: 1 + N per operation → 1.
Out of scope (deferred follow-ups)
- cat-b-31ceb6aaa9f1: updateOwner/updateTeam/updateAgentGroup orphan
(different shape — wire existing PUT to GUI, not new bulk endpoint).
- cat-k-e85d1099b2d7: CertificatesPage no pagination UI.
- cat-i-b0924b6675f8: MCP missing claim/dismiss/acknowledge (L-1 added
2 new tools but does not close that finding).
Verification
- go build / vet / test -short / test -short -race all clean.
- web tsc --noEmit + vitest run all clean (296 tests passing).
- OpenAPI YAML parses (89 paths, 125 ops).
- L-1 CI guardrail passes against post-fix tree, fires against
synthetic regression.
No push.
|
||
|
|
9dc0742e77 |
fix(web): close StatusBadge enum drift + Certificate TS phantom fields (D-1 master)
Five audit findings, all category cat-d or cat-f, all rooted in two
frontend files. The dashboard silently lied:
cat-d-359e92c20cbf [P1, primary] — Agent: 'Stale' dead key + 'Degraded'
neutral fallthrough
cat-d-9f4c8e4a91f1 [P2] — Notification: 'dead' missing
cat-d-1447e04732e7 [P3] — Cert: 'PendingIssuance' dead key
cat-f-cert_detail_page_key_render_fallback [P2] — render-site reads
cert.key_algorithm directly
cat-f-ae0d06b6588f [P2] — Certificate TS phantom fields (root cause)
Pre-D-1, agents in the only Go AgentStatus that means 'needs operator
attention' (Degraded) rendered as default neutral grey because StatusBadge
mapped 'Stale' (a key Go has never emitted) to yellow. Dead-letter
notifications visually equated with 'read' (operator-acknowledged). The
Certificate badge map carried a 'PendingIssuance' key no Go enum emits.
CertificateDetailPage's Key Algorithm and Key Size rows always rendered
'—' even when the data was a single fetch away — the lookup went through
cert.key_algorithm / cert.key_size directly, both phantom Certificate TS
fields. Trim the TS type so the missing-data case is explicit; fix the
render site to use latestVersion?.field; pin the contract with a 38-case
Vitest property test that walks every Go enum.
StatusBadge (web/src/components/StatusBadge.tsx)
- Drop 'Stale' (Agent dead key) + 'PendingIssuance' (Cert dead key).
- Add 'Degraded' (Agent → badge-warning) + 'dead' (Notification → badge-danger).
- Add leading docblock naming Go-side source-of-truth file for every
status family and pointing at the property test as regression vector.
Property test (web/src/components/StatusBadge.test.tsx — 38 cases)
- Iterates every Go-emitted enum value (AgentStatus, CertificateStatus,
JobStatus, NotificationStatus, DiscoveryStatus, HealthStatus) plus the
two frontend-synthesized Enabled/Disabled labels, asserts every value
gets a non-default class (or an explicit 'badge badge-neutral' for the
five intentionally-neutral terminal values: Archived, Cancelled,
Dismissed, read, unknown).
- Negative assertions: 'Stale' and 'PendingIssuance' must fall through
to the dictionary default — re-adding either key surfaces here.
- Specific UX-correctness assertions: 'dead' → badge-danger,
'Degraded' → badge-warning.
- Unknown-status fallthrough preserves label text.
Certificate TS trim (web/src/api/types.ts)
- Drop serial_number?, fingerprint_sha256?, key_algorithm?, key_size?,
issued_at? from Certificate. Go's ManagedCertificate has never carried
these — they live on CertificateVersion. Post-trim a cert.X access for
any of the five fields is a TS compile error.
- Leading docblock cross-references the closure rationale and the
latestVersion fallback pattern.
Render-site fix (web/src/pages/CertificateDetailPage.tsx)
- Key Algorithm / Key Size rows now read latestVersion?.key_algorithm /
latestVersion?.key_size, mirroring the existing latestVersion fallback
used a few lines above for serial_number / fingerprint_sha256.
- The same edit also tightened the serial / fingerprint / issued_at
derivations to drop the now-impossible 'cert.X || latestVersion?.X'
cert-side leg (cert.serial_number is a TS error post-trim).
Type-test regression (web/src/api/types.test.ts)
- Certificate literal construction pinned post-trim — adding any of the
five fields back makes the literal an excess-property TS error.
- Sibling CertificateVersion literal pinning the trimmed fields still
live on the version envelope (so the CertificateDetailPage fallback
path can't break).
OpenAPI (api/openapi.yaml)
- ManagedCertificate schema unchanged — was already correct (no phantom
fields). Added a leading comment cross-referencing the D-5 closure for
future readers.
CI guardrail (.github/workflows/ci.yml)
- 'Forbidden StatusBadge dead-key + Certificate phantom-field regression
guard (D-1)'. Two grep blocks: catches Stale/PendingIssuance map
literals in StatusBadge.tsx; uses an awk-scoped window over the
'export interface Certificate {' block in types.ts to catch the five
phantom fields reappearing while explicitly excluding CertificateVersion
(which legitimately carries them). Comments + test files exempt.
Verification
- Backend build/vet/test -short -race all clean across handler/router/
middleware packages.
- Frontend tsc --noEmit clean.
- Vitest 256 → 296 tests (+40: 38 from new StatusBadge test, 2 from D-5
Certificate trim regression in types.test.ts).
- OpenAPI YAML parses (87 paths).
- Both CI guardrail patterns clear on the post-fix tree; both fire
against synthetic regression patterns (re-add Stale → fires; re-add
serial_number? to Certificate → fires).
Out of scope (deferred)
- diff-05x06-* type drifts for Agent/DeploymentTarget/Notification/
DiscoveredCertificate/Issuer TS interfaces. Per-type field-by-field
Go ↔ TS diff is codegen-shaped, not edit-shaped — warrants its own
D-2 master prompt. Noted in CHANGELOG follow-ups section.
|
||
|
|
a3d8b9c607 |
fix(deploy,db,handler): close fresh-clone postgres init failure + 4 ride-along audit findings (U-3 master)
GitHub #10 reopened: operator mikeakasully cloned v2.0.50 fresh and ran the canonical quickstart (docker compose -f deploy/docker-compose.yml up -d --build); postgres reported unhealthy indefinitely, dependent containers never started. Root cause: deploy/docker-compose.yml mounted a hand-curated subset of migrations/*.up.sql + seed.sql into postgres /docker-entrypoint-initdb.d/. Postgres applied them at initdb time. Once seed.sql referenced columns added by migrations *after* the mounted cutoff (e.g., policy_rules.severity from migration 000013), initdb crashed mid-seed and the container loop wedged. Two sources of truth (compose mount list vs in-tree migration ladder) diverged the moment a seed-touching migration shipped, and the only thing that fixed it was hand-editing the compose file every release. Fix: remove the dual source. Postgres boots empty; the server applies migrations + seed at startup via RunMigrations + RunSeed. Helm has used this pattern since day one (postgres-init emptyDir); compose now matches. Bundled with four ride-along audit findings whose fixes share the same schema/db code surface, so operators take the schema-change pain only once: cat-u-seed_initdb_schema_drift [P1, primary] — initdb-mount fix cat-o-retry_interval_unit_mismatch [P1] — column rename minutes→seconds cat-o-notification_created_at_dead_field [P2] — add column + populate cat-o-health_check_column_orphans [P1] — drop unwired columns cat-u-no_version_endpoint [P2] — add /api/v1/version Single migration (000017_db_coupling_cleanup) bundles the three schema changes under a DO \$\$ guard so re-application is safe; reduces operator-visible 'schema-change releases' from four to one. Backend - internal/repository/postgres/db.go: add RunSeed (baseline) + RunDemoSeed (gated by CERTCTL_DEMO_SEED). Both idempotent (ON CONFLICT DO NOTHING in every shipped INSERT) so repeated boots are safe; missing-file is no-op so custom packaging that strips seeds still boots cleanly. - cmd/server/main.go: invoke RunSeed (always) + RunDemoSeed (when flag set) immediately after RunMigrations. - internal/repository/postgres/notification.go: NotificationRepository.Create now sets created_at (with time.Now() fallback when caller leaves it zero); scanNotification reads it back; List + ListRetryEligible SELECT extended. - internal/repository/postgres/renewal_policy.go: column references updated to retry_interval_seconds across SELECT/INSERT/UPDATE sites. - internal/api/handler/version.go: new VersionHandler exposes {version, commit, modified, build_time, go_version} from runtime/debug.ReadBuildInfo() with ldflags-supplied Version override. - internal/api/router/router.go: register GET /api/v1/version through the no-auth chain (CORS + ContentType) alongside /health, /ready, /api/v1/auth/info. - cmd/server/main.go: add /api/v1/version to no-auth dispatch + audit ExcludePaths so rollout polling doesn't dominate the audit trail. - internal/config/config.go: add DatabaseConfig.DemoSeed + CERTCTL_DEMO_SEED env var. Migration - migrations/000017_db_coupling_cleanup.up.sql + .down.sql: (1) renewal_policies.retry_interval_minutes → retry_interval_seconds (DO \$\$ guard, idempotent re-application) (2) notification_events ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() (3) network_scan_targets DROP orphan health_check_enabled + health_check_interval_seconds - migrations/seed.sql: column reference updated to retry_interval_seconds. - migrations/seed_demo.sql: same column rename + applied at runtime now via RunDemoSeed (no longer initdb-mounted). Compose - deploy/docker-compose.yml: drop ALL initdb mounts (10 migration files + seed.sql); add start_period: 30s to postgres + certctl-server healthchecks to absorb the runtime migration + seed application window on first boot. - deploy/docker-compose.test.yml: same drop (+ ghost seed_test.sql mount removed; that file never existed); same healthcheck start_period. - deploy/docker-compose.demo.yml: replace seed_demo.sql initdb mount with CERTCTL_DEMO_SEED=true env var on certctl-server. Tests - internal/api/handler/version_handler_test.go: TestVersion_ReturnsBuildInfo, TestVersion_RejectsNonGet, TestVersion_LdflagsOverride. - internal/repository/postgres/seed_test.go: TestRunSeed_AppliesIdempotently, TestRunSeed_MissingFileIsNoOp, TestRunDemoSeed_AppliesIdempotently, TestMigration000017_RetryIntervalRename, TestMigration000017_NotificationCreatedAt, TestMigration000017_HealthCheckOrphansDropped (testcontainers, -short skips). - internal/repository/postgres/notification_test.go: TestNotificationRepository_CreatedAt_IsPersisted + TestNotificationRepository_CreatedAt_DefaultsToNow. CI guardrail - .github/workflows/ci.yml: new 'Forbidden migration mount in compose initdb (U-3)' step grep-fails the build if any migrations/*.sql or seed*.sql re-appears in /docker-entrypoint-initdb.d in any compose file. Catches future drift before a fresh-clone operator hits it. Spec / Docs - api/openapi.yaml: add /api/v1/version operation under Health tag. - docs/architecture.md: replace the 'initdb may run the same SQL' paragraph with a post-U-3 single-source-of-truth explanation. - CHANGELOG.md: full unreleased-section entry covering all 5 closures, breaking changes, and the new env var. Audit doc - coverage-gap-audit-2026-04-24-v5/unified-audit.md: add new P1 #14 cat-u-seed_initdb_schema_drift; flip the 4 ride-along findings to ✅ RESOLVED with closure prose pointing at this commit. Verification: build/vet/test -short -race all clean across all touched packages locally; govulncheck reports 0 vulnerabilities affecting our code; OpenAPI YAML parses; CI U-3 grep guardrail clears against the post-fix tree. |
||
|
|
87213128cc |
fix(security,domain): redact Agent.APIKeyHash from JSON wire shape (G-2)
Pre-G-2 internal/domain/connector.go::Agent::APIKeyHash was tagged
`json:"api_key_hash"` and shipped on every wire surface that returned
domain.Agent — GET /api/v1/agents (PagedResponse{Data: agents}),
GET /api/v1/agents/{id}, GET /api/v1/agents/retired, and the
POST /api/v1/agents registration response. Every authenticated client
(browser, CLI --json, MCP tool calls) received the SHA-256-of-the-API-key
string. The browser silently dropped it because web/src/api/types.ts
omits the field, but CLI and MCP consumers print full JSON so the hash
was visible there. Even though the value is a hash and not the plaintext
key, shipping it gives an attacker an offline brute-force target if the
API-key entropy is low (certctl doesn't enforce a minimum on operator-
supplied keys), and there's no business reason for any client to ever
receive it — the value is server-internal, used only for the lookup at
internal/repository/postgres/agent.go::GetByAPIKey. (Audit:
cat-s5-apikey_leak in coverage-gap-audit-2026-04-24-v5/unified-audit.md.)
We chose the audit's recommended fix (json:"-") plus a defense-in-depth
MarshalJSON plus a CI guardrail. Three layers because struct-tag
redaction alone is one rebase away from being silently reverted, the
custom MarshalJSON catches the case where a parent struct embeds Agent
under a different tag, and the CI grep blocks reintroduction at the spec
or frontend boundary even without a code review catching it.
Files changed:
Phase 1 — Domain redaction:
- internal/domain/connector.go: APIKeyHash tag flipped from
`json:"api_key_hash"` to `json:"-"`. New Agent.MarshalJSON
with value receiver + type-alias-recursion-break that explicitly
zeroes APIKeyHash on the marshal-time copy. Long-form docblock
explaining the G-2 closure rationale + cross-references to
service.RegisterAgent (populator), repository.AgentRepository::
GetByAPIKey (consumer), docs/architecture.md (DB-shape vs
API-shape distinction), and the audit finding.
Phase 2 — Domain tests (5 test functions):
- internal/domain/connector_test.go: TestAgent_MarshalJSON_RedactsAPIKeyHash
pins the marshal-boundary contract on a value receiver. ...RedactsViaPointer
pins the *Agent path. ...RedactsInSlice pins the []Agent path that the
ListAgents handler actually emits via PagedResponse. ...DoesNotMutateReceiver
pins the by-value-receiver contract so a future refactor that switches
to pointer-receiver gets caught. ...RoundTrip pins the wire-shape
guarantee that APIKeyHash is dropped on encode and cannot reappear on
decode. Single sentinel value ("sha256:LEAKED-CREDENTIAL-DERIVATIVE-
SENTINEL") flows through every fixture for grep-ability on regression.
Phase 3 — Handler tests (4 test functions):
- internal/api/handler/agent_handler_test.go: TestListAgents_DoesNotLeakAPIKeyHash,
TestGetAgent_DoesNotLeakAPIKeyHash, TestRegisterAgent_DoesNotLeakAPIKeyHash,
TestListRetiredAgents_DoesNotLeakAPIKeyHash. Each asserts (a) the
literal substring "api_key_hash" is absent from the httptest-captured
body, (b) the leak sentinel value is absent, (c) the non-leaked fields
ARE present (sanity that the handler is serving real data, not just
empty payloads). Shared sentinel "sha256:LEAKED-CREDENTIAL-DERIVATIVE-
HANDLER-SENTINEL" so a single grep over a failing test's output
identifies the leak surface immediately.
Phase 4 — Spec / docs:
- api/openapi.yaml: api_key_hash property REMOVED from Agent schema
(was at line 3690). Inline G-2 comment naming the closure + the
database-vs-API-shape distinction so a future spec edit doesn't
silently re-introduce the field.
- docs/architecture.md: ER-diagram block already documents the agents
table including api_key_hash (DB shape — correct). Added a sibling
note paragraph immediately below the diagram explaining that several
columns are intentionally server-internal (api_key_hash redaction
+ issuers.config / deployment_targets.config encrypted shadow), with
cross-references to the redaction enforcement site, the OpenAPI
schema, the frontend interface, and the CI guardrail.
- web/src/api/types.ts: Agent interface unchanged in shape (already
omitted the field) but added a leading comment block explaining
WHY the omission is intentional — stops a future frontend dev from
"completing" the interface from the OpenAPI spec or the Go struct.
Phase 5 — CI guardrail:
- .github/workflows/ci.yml: new "Forbidden api_key_hash JSON-shape
regression guard (G-2)" step. Scoped patterns catch the actual
regression shapes — Go struct tag (json:"api_key_hash"), frontend
interface declaration, OpenAPI schema property, YAML enum/array
membership. Repository / migration / seed / service / integration /
unit-test / comment lines exempt. Verified locally on the real tree
(passes) and against 4 synthetic regression patterns (each fires
the guardrail). Mirrors the G-1 pattern from .github/workflows/
ci.yml lines 47-108.
Phase 5b — Sweep verification (no changes, results documented for the
next reader):
- internal/api/middleware/audit.go: doesn't serialize Agent struct;
records request body only. No leak.
- service.RegisterAgent audit-event payload: `map[string]interface{}{
"name": name, "hostname": hostname}` — name + hostname only,
no APIKeyHash. No leak.
- All 9 slog sites that mention agent: scalar attrs only ("agent_id",
"error", "agent_hostname"), never the full struct. No leak.
- internal/mcp, internal/cli, cmd/cli, cmd/mcp-server: zero matches
for APIKeyHash / api_key_hash. Both pass server JSON verbatim, so
the wire-side fix transitively closes them.
Verification (all gates pass):
- go build ./...
- go vet ./...
- go test -short ./... — every package green
- go test -short -race ./internal/domain/... ./internal/api/handler/... — clean
- govulncheck ./... — no vulnerabilities in our code
- helm lint deploy/helm/certctl/ — clean
- helm template smoke render — succeeds
- python3 yaml.safe_load on api/openapi.yaml — parses
- OpenAPI Agent schema scan: no api_key_hash property
- CI guardrail mirror: clean on real tree, fires on all 4 synthetic
regression patterns
- Domain pkg coverage: Agent.MarshalJSON 100%, connector.go total 87.5%
- Handler pkg coverage: 79.2%
Sample response body (httptest captured during verification, GET
/api/v1/agents/{id} via the new handler test):
{"id":"agent-demo","name":"demo-agent","hostname":"demo.host",
"status":"Online","last_heartbeat_at":"2026-04-24T11:59:30Z",
"registered_at":"2026-04-24T12:00:00Z","os":"linux",
"architecture":"amd64","ip_address":"10.0.0.42",
"version":"v2.0.49"}
Note the absence of any api_key_hash key, even though the in-memory
struct passed to the handler had APIKeyHash set to a sentinel.
Out of scope (intentionally untouched):
- internal/repository/postgres/agent.go SELECT/INSERT/UPDATE/scan
paths and GetByAPIKey lookup — DB column stays, repo still
populates the struct, auth lookup still works. The redaction is a
marshal-boundary concern.
- migrations/000001_initial_schema.up.sql + migrations/seed_*.sql —
DB schema and seed data unchanged.
- internal/service/agent.go::RegisterAgent — service-side hashing
and persistence unchanged.
- Other domain types with potential credential-derivative fields
(Issuer.Config, DeploymentTarget.Config, notifier configs). Not
flagged by the audit; some are already protected (e.g.,
DeploymentTarget.EncryptedConfig []byte `json:"-"`). File a
separate audit pass if recon surfaces additional leaks.
- Per-resource DTO layer across every handler. Single audit
finding, single domain type.
- A separate possible follow-up: the v2 RegisterAgent endpoint
doesn't return the plaintext API key to the agent, which may
mean self-bootstrap via POST /api/v1/agents is broken. Verified
during recon; out of scope for G-2; should be its own ticket.
Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
§2 P1 cluster, cat-s5-apikey_leak
Audit recommendation: 'json:"-" or API-response DTO
excluding APIKeyHash' — went with the json:"-" + MarshalJSON
defense-in-depth pair plus CI guardrail and structural docs.
|
||
|
|
9c1d446e40 |
fix(security,config): remove unimplemented JWT auth-type, close silent downgrade (G-1)
The pre-G-1 config validator accepted CERTCTL_AUTH_TYPE=jwt and the
startup log faithfully echoed 'authentication enabled type=jwt'.
Reasonable people read that and concluded JWT auth was on. It wasn't.
The auth-middleware wiring at cmd/server/main.go unconditionally routed
every request through the api-key bearer middleware regardless of
cfg.Auth.Type. So CERTCTL_AUTH_TYPE=jwt quietly compared the incoming
'Authorization: Bearer <token>' against whatever string the operator put
in CERTCTL_AUTH_SECRET — real JWT clients got 401, and operators who
treated CERTCTL_AUTH_SECRET as a *signing* secret (because they thought
they were configuring JWT) had effectively handed an attacker an api-key.
A security finding masquerading as a config option.
We chose the audit-recommended structural fix: remove the option, fail
fast at startup, and add the gateway-fronting pattern as the documented
forward path. Implementing JWT middleware would have meant jwks vs
static-secret rotation, claim mapping, expiry enforcement, audience and
issuer validation, key rollover semantics, and regression coverage at the
same depth as the existing api-key path — a feature, not a fix. Operators
who genuinely need JWT/OIDC front certctl with an authenticating gateway
(oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium /
Authelia) and run the upstream certctl with CERTCTL_AUTH_TYPE=none. Same
shape works on docker-compose and Helm.
The change is comprehensive across 7 phases — every surface that
mentioned 'jwt' as a certctl-auth-type is updated, plus structural
backstops (typed enum, runtime guard, helm template validation, CI grep
guard) so the lie can't reappear.
Files changed:
Phase 1 — production code (typed enum + jwt removal):
- internal/config/config.go: AuthType typed alias + AuthTypeAPIKey /
AuthTypeNone constants + ValidAuthTypes() helper. Validate() routes
literal 'jwt' through a dedicated multi-line diagnostic naming the
authenticating-gateway pattern, then cross-checks against
ValidAuthTypes(). Secret-required branch simplified to api-key-only.
Field comment on AuthConfig.Type rewritten to drop jwt and point at
the gateway pattern.
- internal/api/middleware/middleware.go: AuthConfig.Type field comment
references the typed config.AuthType constants.
- internal/api/handler/health.go: same treatment for HealthHandler.AuthType.
- cmd/server/main.go: defense-in-depth runtime switch immediately after
config.Load() — exits 1 on any unsupported auth-type that bypassed the
validator. Auth-disabled startup log explicitly names the
authenticating-gateway pattern.
Phase 2 — tests (Red→Green, contract pinning):
- internal/config/config_test.go: TestValidate_JWTAuth_RejectedDedicated
(two table rows pinning the dedicated G-1 error fires regardless of
whether Secret is set), TestValidAuthTypesDoesNotContainJWT (property
guard against future re-introduction),
TestValidAuthTypesIsExactly_APIKey_None (allowed-set contract),
TestValidate_GenericInvalidAuthType (pins non-jwt invalid values still
hit the generic invalid-auth-type error). Removed the prior
TestValidate_JWTAuth_MissingSecret happy-path since its premise is
inverted post-G-1.
- internal/api/handler/health_test.go: removed
TestAuthInfo_ReturnsAuthType_JWT (which baked the silent-downgrade lie
into the regression suite). Pre-existing _APIKey test continues to
cover the api-key happy path.
Phase 3 — spec, docs, env templates:
- api/openapi.yaml: auth_type enum dropped to [api-key, none] with
inline comment naming the G-1 closure.
- .env.example (root): CERTCTL_AUTH_TYPE comment block rewritten to drop
jwt and point at the gateway pattern; secret-required conditional
simplified to api-key-only.
- docs/architecture.md: middleware-stack bullet rewritten to drop the
JWT mention; new H3 'Authenticating-gateway pattern (JWT, OIDC, mTLS)'
section explaining the design rationale and listing oauth2-proxy /
Envoy ext_authz / Traefik ForwardAuth / Pomerium / Authelia / Caddy
forward_auth / Apache mod_auth_openidc / nginx auth_request as the
standard fronting options.
- docs/upgrade-to-v2-jwt-removal.md (new ~125 lines): migration guide
with preconditions, what-changes, both recovery paths, complete
docker-compose oauth2-proxy walkthrough, Traefik ForwardAuth and Envoy
ext_authz patterns, rollback posture.
Phase 4 — Helm chart (template validation + docs):
- deploy/helm/certctl/templates/_helpers.tpl: new certctl.validateAuthType
helper mirroring the existing certctl.tls.required pattern. Fails
template render on any server.auth.type outside {api-key, none} with
a multi-line diagnostic.
- deploy/helm/certctl/templates/server-deployment.yaml,
server-configmap.yaml, server-secret.yaml: invoke the helper at the
top of each template that depends on .Values.server.auth.type.
- deploy/helm/certctl/values.yaml: auth: block comment expanded with the
G-1 rationale and gateway-pattern cross-reference.
- deploy/helm/CHART_SUMMARY.md: server.auth.type table row now surfaces
the allowed set and points at the upgrade doc.
- deploy/helm/certctl/README.md: new 'JWT / OIDC via authenticating
gateway' section with a Kubernetes-flavored oauth2-proxy + certctl
walkthrough.
Phase 5 — release surface:
- CHANGELOG.md: new [unreleased] top entry with Breaking / Removed /
Added / Changed sections; explicit pointer at
docs/upgrade-to-v2-jwt-removal.md from the Breaking subsection.
Phase 6 — CI guardrail:
- .github/workflows/ci.yml: new 'Forbidden auth-type literal regression
guard (G-1)' step. Scoped patterns catch the actual regression shapes
(map literal, slice literal, switch case, OpenAPI enum, env-file
default, AuthType('jwt') cast). Comments and the dedicated rejection
branch are intentionally exempt; connector-package JWT references
(Google OAuth2 / step-ca) are exempt as out-of-scope external
protocols. Verified locally: the guard passes on the actual tree and
fires on all 4 synthetic regression patterns.
Out of scope (explicitly untouched):
- internal/connector/discovery/gcpsm/gcpsm.go — Google OAuth2 service-
account JWT (external protocol).
- internal/connector/issuer/googlecas/googlecas.go — same.
- internal/connector/issuer/stepca/stepca.go — step-ca's provisioner
one-time-token JWT for /sign API.
- docs/test-env.md, docs/connectors.md, docs/features.md — describe
external CAs' use of JWT, not certctl's auth shape.
- Implementing actual JWT middleware. Feature, not a fix.
Verification (all gates pass):
- go build ./... — clean
- go vet ./... — clean
- go test -short ./... — every package green
- go test -short -race ./internal/config/... ./internal/api/... — clean
- govulncheck ./... — no vulnerabilities in our code
- helm lint deploy/helm/certctl/ — clean
- helm template with auth.type=api-key — renders OK
- helm template with auth.type=none — renders OK
- helm template with auth.type=jwt — fails with validateAuthType
diagnostic (exit 1)
- python3 yaml.safe_load on api/openapi.yaml — parses
- CI guardrail mirror — clean on real tree, fires on all 4 synthetic
regression patterns
- Smoke test: 'CERTCTL_AUTH_TYPE=jwt ./certctl-server' exits non-zero
with: 'Failed to load configuration: CERTCTL_AUTH_TYPE=jwt is no
longer accepted (G-1 silent auth downgrade): no JWT middleware ships
with certctl. To use JWT/OIDC, run an authenticating gateway
(oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium) in
front of certctl and set CERTCTL_AUTH_TYPE=none on the upstream.
See docs/architecture.md "Authenticating-gateway pattern" and
docs/upgrade-to-v2-jwt-removal.md for the migration walkthrough'
config pkg coverage: ValidAuthTypes 100%, Validate 94.7%, total 75.5%.
Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
§2 P1 cluster, cat-g-jwt_silent_auth_downgrade
Audit recommendation followed verbatim: 'Remove jwt from
validAuthTypes until middleware ships'.
|
||
|
|
9834b4e4a4 |
G-1: renewal-policies API + frontend FK-drift fix
Three frontend call sites (OnboardingWizard.tsx:603, CertificatesPage.tsx:52,
CertificateDetailPage.tsx:169) populated the renewal_policy_id dropdown from
getPolicies() — the compliance-rule endpoint returning pol-* IDs — which
violated the FK managed_certificates.renewal_policy_id REFERENCES
renewal_policies(id) ON DELETE RESTRICT. Create would fail pg 23503 at insert.
Backend (new):
- RenewalPolicyRepository CRUD + ListAll/ExistsByID (pg 23503 → ErrRenewalPolicyInUse
→ HTTP 409; pg 23505 → ErrRenewalPolicyDuplicateName → HTTP 409)
- RenewalPolicyService with repo-only constructor. Service sentinels
var-alias the repo sentinels so errors.Is walks across layers.
- RenewalPolicyHandler with validation bounds: name 1–255;
renewal_window_days [1,365] default 30; max_retries [0,10] not defaulted;
retry_interval_seconds [60,86400] default 3600; alert_thresholds_days
[0,365] default [30,14,7,0]. Auto-generated IDs rp-<slug(name)>.
- Router registers 5 routes under /api/v1/renewal-policies[/{id}].
Frontend:
- CertificatesPage/CertificateDetailPage/OnboardingWizard now call
getRenewalPolicies() and render rp-* IDs.
- client.ts adds getRenewalPolicies/createRenewalPolicy/updateRenewalPolicy/
deleteRenewalPolicy. types.ts adds the RenewalPolicy shape.
OpenAPI: RenewalPolicies tag + 5 operations + 3 schemas (RenewalPolicy,
RenewalPolicyCreateRequest, RenewalPolicyUpdateRequest). 409 responses
on create/update duplicate-name and delete FK-in-use.
No migration — renewal_policies table already exists from the initial
schema (000001).
Tests:
- internal/service/renewal_policy_test.go: CRUD + validation + sentinel
error wrapping.
- internal/api/handler/renewal_policy_handler_test.go: handler endpoint
contracts including 400/404/409.
- web/src/api/client.test.ts: 4 subtests covering the 4 new API functions.
Phase 3 gates all green: go vet, build, short tests, race tests (service/
handler/router/scheduler), staticcheck (G-1 packages), govulncheck (0
reachable), coverage (service 69.7%, handler 79.0%, domain 86.9%,
middleware 80.6% — all above thresholds), tsc, vitest (256 passed),
vite build, OpenAPI structural validation.
|
||
|
|
52248be717 |
v2.0.47: HTTPS Everywhere — TLS-only control plane, agents/CLI/MCP
Breaking change release. Plaintext HTTP listener removed. The certctl control plane now terminates TLS 1.3 on :8443 via http.Server.ListenAndServeTLS. No CERTCTL_TLS_ENABLED=false escape hatch. No dual-listener mode. One-step cutover per docs/upgrade-to-tls.md. Server - cmd/server/tls.go: certHolder with SIGHUP hot-reload + atomic cert swap, buildServerTLSConfig (TLS 1.3 min, GetCertificate callback), preflightServerTLS validation - cmd/server/main.go: ListenAndServeTLS in place of ListenAndServe, watchSIGHUP wiring, cert/key path config threading - tls_test.go: 418-line regression coverage of reload, preflight, callback behavior, SAN validation Config - CERTCTL_TLS_CERT_PATH / CERTCTL_TLS_KEY_PATH (required) - Plaintext rejection: agents/CLI/MCP pre-flight-fail on http:// URLs with a pointer to docs/upgrade-to-tls.md Agents, CLI, MCP - All three pre-flight-reject http:// URLs with fail-loud diagnostic - CERTCTL_SERVER_CA_BUNDLE_PATH for private-CA trust - CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY for dev-only bypass (loud warning on startup) - install-agent.sh emits both vars as commented template lines docker-compose - certctl-tls-init sidecar generates SAN-valid self-signed cert into deploy/test/certs/ on first boot - All demo-stack curls pin against ca.crt with --cacert Helm chart - Three TLS provisioning modes, exactly one required: - server.tls.existingSecret (operator-supplied) - server.tls.certManager.enabled (cert-manager integration) - server.tls.selfSigned.enabled (eval only — not for production) - server-certificate.yaml template for cert-manager mode - helm install without a TLS source fails at template render with a pointer to docs/tls.md CI - .github/workflows/ci.yml Helm Chart Validation step renders the chart in both existingSecret and cert-manager modes, plus an inverse guard-regression test that asserts helm template MUST refuse to render when no TLS source is configured. Previously the single `helm template` invocation hit the certctl.tls.required fail-loud guard and exit-1'd CI. Four invocations now: lint (existingSecret), template (existingSecret), template (cert-manager), template (no args — must fail). Integration tests - deploy/test/integration_test.go stands up the Compose stack over HTTPS, extracts the CA bundle, and exercises every certctl API over https://localhost:8443 - All 34 integration subtests green (per Phase 8 local CI-parity) Documentation - New: docs/tls.md (provisioning patterns, rotation, SIGHUP reload) - New: docs/upgrade-to-tls.md (one-step cutover, no-downgrade warnings, fleet-roll sequencing) - CHANGELOG.md: v2.2.0 "HTTPS Everywhere — The Irony" entry (file heading unchanged; release tag is v2.0.47) - All curls in docs/, examples/, deploy/helm/ guides use https://localhost:8443 --cacert Verification - grep -rn "ListenAndServe[^T]" cmd/ internal/ → 0 hits - grep -rn "\"http://" cmd/ internal/ → 2 benign hits (Caddy admin API default, SSRF doc comment) — zero certctl endpoints - Tasks #197–#206 (Phases 0–8) all closed in the tracker Files: 65 changed, 3489 insertions, 372 deletions (pre-CI-fix). |
||
|
|
675b87ba63 |
I-005: notification retry loop + dead-letter queue
Critical alerts can no longer be silently dropped by a transient
notifier failure. Failed notification attempts now ride an exponential
backoff retry loop, with a 5-attempt budget before promotion to the
dead-letter queue for operator intervention.
Schema (migration 000016, idempotent):
- retry_count INTEGER NOT NULL DEFAULT 0
- next_retry_at TIMESTAMPTZ
- last_error TEXT
- idx_notification_events_retry_sweep partial index
(next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL
Dead rows clear next_retry_at so the index stops matching them.
Service contract:
- NotificationService.RetryFailedNotifications drives 2^n-minute
exponential backoff capped at 1h (notifRetryBackoffCap) with
5-attempt budget (notifRetryMaxAttempts).
- Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to
status='dead' via MarkAsDead.
- Non-terminal failures record via RecordFailedAttempt.
- Success path promotes to 'sent' without touching retry_count
(audit preserves "delivered on attempt N").
- Missing-notifier branch defensively promotes to 'sent' to avoid
wedging a row on a deleted channel.
- RequeueNotification operator escape hatch atomically resets
retry_count -> 0, next_retry_at -> NULL, last_error -> NULL,
status -> pending via notifRepo.Requeue.
Scheduler:
- New always-on notificationRetryLoop wired into the base loop set at
CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m).
- sync/atomic.Bool idempotency guard.
- sync.WaitGroup shutdown drain via WaitForCompletion.
StatsService:
- SetNotifRepo setter pattern preserves 9 pre-existing
NewStatsService call sites (main.go + stats_test.go + 8 digest
tests) without touching the constructor signature.
- DashboardSummary.NotificationsDead populated via
notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired
(reports zero on systems without a notification repository).
- CountByStatus error is non-fatal (dashboard summary is
best-effort for this field).
- Prometheus certctl_notification_dead_total counter emitted from
the same snapshot.
Handler:
- New POST /api/v1/notifications/{id}/requeue endpoint.
- dead status surfaces to MCP + CLI.
Frontend:
- NotificationsPage gains two-tab toolbar ("All" / "Dead letter")
with queryKey: ['notifications', activeTab] so switching tabs
doesn't serve stale data until the 30s refetch.
- Dead rows surface "Retry {n}/5" + truncated last_error with
full-text title tooltip.
- Requeue mutation wrapped as
mutationFn: (id: string) => requeueNotification(id)
to prevent react-query v5's positional context argument from
leaking into the API client — pinned against future refactors
by strict-match toHaveBeenCalledWith('notif-dead-001') in
NotificationsPage.test.tsx:181.
Closes I-005.
|
||
|
|
0725713e19 |
Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.
Schema — migration 000015:
migrations/000015_agent_retire.up.sql flips
deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
boundary instead of quietly destroying targets. Both `agents` and
`deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
TEXT pair (TEXT not VARCHAR so operator comments are never
truncated), indexed via partial indexes WHERE retired_at IS NOT
NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
EXISTS) so repeated runs against partially-migrated databases
converge. migrations/000015_agent_retire.down.sql restores CASCADE
and drops the new columns for clean rollback. A dedicated
repository-layer testcontainers test
(internal/repository/postgres/migration_000015_test.go) asserts the
before/after FK action, column presence, index presence, and
round-trip idempotency under up→down→up.
Domain — sentinel guard + dependency counts:
internal/domain/connector.go gains IsRetired() on Agent, the
exported SentinelAgentIDs slice listing server-scanner,
cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
four reserved IDs documented in CLAUDE.md and created at startup in
cmd/server/main.go), IsSentinelAgent(id string) predicate,
AgentDependencyCounts{ActiveTargets, ActiveCertificates,
PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
ActorTypeSystem enum values used by audit emission downstream.
Coverage locked down by internal/domain/connector_test.go.
Service — 8-step ordered contract:
internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
opts{Force, Reason}) enforces a fixed execution order:
(1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
unconditionally; force=true does NOT bypass it.
(2) fetch — ErrAgentNotFound on miss.
(3) idempotency — if IsRetired() already, return
AgentRetirementResult{AlreadyRetired: true} with no new audit
event and no state change (safe to replay from flaky clients).
(4) preflight counts — collectAgentDependencyCounts runs
ActiveTargets, ActiveCertificates, PendingJobs sequentially
(not in parallel; keeps the per-query timeout predictable and
matches the repo's existing call-chain shape).
(5) force-reason guard — opts.Force=true with empty Reason returns
ErrForceReasonRequired (wired into the 400 status surface).
(6) dependency guard — HasDependencies() with opts.Force=false
returns BlockedByDependenciesError{Counts} (wired into the 409
body with per-bucket counts).
(7) mutation — single pinned retiredAt := time.Now(); agent
retirement first, then cascade target retirement if opts.Force,
all under the repo's single transaction so the two retired_at
stamps match to the second.
(8) best-effort audit — agent_retired always; agent_retirement_
cascaded additionally on the force path. Actor is whatever the
handler resolves from the request; actor type is mapped by
resolveActorType (system/agent-prefix→Agent/else→User). Audit
emission failures are logged via slog.Error but do not abort
the retirement (matches the house convention used by every
other scheduler-emitted event).
BlockedByDependenciesError implements Error() as
"active_targets=%d, active_certificates=%d, pending_jobs=%d" and
Unwrap() → ErrBlockedByDependencies. The single struct satisfies
errors.Is via Unwrap (used by scheduler-level tests) and errors.As
via the concrete type (used by the handler to fish out Counts for
the 409 body). ListRetiredAgents(page, perPage) adds a separate
paginated accessor with page<1→1 and perPage<1→50 normalization so
retired rows are queryable without polluting the default agent
listing.
Sentinel guard coverage is asymmetric by design: all four reserved
IDs are protected, and force=true cannot override. Regression tests
in internal/service/agent_retire_test.go assert each of the eight
steps in order, plus sentinel bypass attempts and idempotency
replay.
Handler + router — status-code surface:
internal/api/handler/agents.go:RetireAgent exposes seven status
codes on DELETE /agents/{id}:
200 on a fresh retirement (body echoes AgentRetirementResult).
204 on idempotent replay (AlreadyRetired=true; no new audit).
400 on ErrForceReasonRequired.
403 on ErrAgentIsSentinel.
404 on ErrAgentNotFound.
409 on BlockedByDependenciesError, with a custom body shape
{error, counts{active_targets, active_certificates,
pending_jobs}} that bypasses the default ErrorWithRequestID
envelope so callers get the per-bucket numbers directly.
500 on any other error.
Heartbeat HandleHeartbeat returns 410 Gone when the agent is
retired (ErrAgentRetired), signalling the agent to shut down.
Query params `force=true` and `reason=<text>` drive the cascade
path; both are forwarded as url.Values through the new MCP
transport.
internal/api/router/router.go registers GET /api/v1/agents/retired
literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
literal-beats-pattern-var precedence routes "retired" to the
paginated retired-agents listing instead of fetching a hypothetical
agent named "retired".
Agent binary — clean shutdown on 410:
cmd/agent/main.go gains the ErrAgentRetired sentinel, a
retiredOnce sync.Once, and a retiredSignal chan struct{}. A
markRetired(source, statusCode, body) helper closes the channel
exactly once; the Run() select loop observes the close and returns
ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
and exits cleanly instead of spinning in the heartbeat retry loop.
The 410 Gone surface is therefore terminal for the agent process.
MCP transport:
internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
a new additive transport method. Client.Delete is path-only; without
this method the retire tool would silently drop `force` and `reason`,
turning every cascade retire into a default soft-retire. The new
method shares do()'s 204 normalization and 4xx/5xx error
propagation so tool authors get one contract.
internal/mcp/tools.go + internal/mcp/types.go expose the
retire_agent tool with Force+Reason inputs wired through
DeleteWithQuery.
CLI:
cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
`agents list --retired` (client-side strip of --retired then
delegation to ListRetiredAgents, sharing --page/--per-page parsing
with the default listing) and `agents retire <id> [--force --reason
"…"]` (mirrors ErrForceReasonRequired — force without reason is
rejected client-side before the request is sent). JSON + table
output modes both honor the new columns.
Frontend:
web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
web/src/api/client.ts + web/src/api/types.ts expose the retire
endpoint and the retired-listing. 4 new Vitest regression cases.
OpenAPI:
api/openapi.yaml documents DELETE /agents/{id} with all seven
status codes, 410 on heartbeat, and the 409 per-bucket body shape.
Regression coverage (six new test files, all green):
internal/service/agent_retire_test.go — 8-step contract + sentinel guards
internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through
internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing
internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies
Files:
api/openapi.yaml — DELETE + 410 + 409 body shape
cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal
cmd/cli/main.go — handleAgents list/get/retire dispatch
docs/architecture.md, docs/concepts.md,
docs/testing-guide.md — retirement contract narrative
internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat
internal/api/handler/agent_handler_test.go — extended coverage
internal/api/handler/agent_retire_handler_test.go — new
internal/api/router/router.go — /agents/retired before /agents/{id}
internal/cli/agent_retire_test.go — new
internal/cli/client.go — ListRetiredAgents + RetireAgent
internal/domain/connector.go — IsRetired, SentinelAgentIDs,
IsSentinelAgent, AgentDependencyCounts,
ActorTypeAgent/System
internal/domain/connector_test.go — new
internal/integration/lifecycle_test.go — retirement fixture
internal/mcp/client.go — DeleteWithQuery additive transport
internal/mcp/retire_agent_test.go — new
internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs
internal/repository/interfaces.go — AgentRepository retirement methods
internal/repository/postgres/agent.go — retire + cascade target retire + counts
internal/repository/postgres/migration_000015_test.go — new
internal/service/agent.go — wire into AgentService surface
internal/service/agent_retire.go — new 8-step contract
internal/service/agent_retire_test.go — new
internal/service/deployment.go — skip retired agents
internal/service/target.go — skip retired agents
internal/service/testutil_test.go — shared mocks extended
migrations/000015_agent_retire.up.sql — new
migrations/000015_agent_retire.down.sql — new
web/src/api/client.ts, types.ts + tests — retire endpoint wiring
web/src/pages/AgentsPage.tsx — retire UI
|
||
|
|
3287e174dc |
Unify API auth + RFC-compliant CRL/OCSP (M-002 + M-003 + M-006, auto-closes M-001)
Closes the remaining P1 gaps from coverage-gap-audit.md (M-001/M-002/M-003/M-006)
on top of the C-001/C-002 ownership + agent-FK contract fixes landed in
|
||
|
|
a53a4b845b |
fix(gui,api): close C-001 + C-002 — ownership + agent FK contract
C-001 — CreateCertificate was server-accepted with null owner_id,
team_id, renewal_policy_id because the GUI neither collected the fields
nor enforced them, even though the backend's ManagedCertificate schema
and handler contract treat them as required. Fix the contract at all
four layers:
- web/src/pages/CertificatesPage.tsx: replace owner_id/team_id free-
text inputs with <select> elements fed by getOwners/getTeams/
getPolicies queries; mark all three required; gate the Create
button on owner_id + team_id + renewal_policy_id being set.
- internal/api/handler/certificates.go: ValidateRequired for
owner_id, team_id, renewal_policy_id on CreateCertificate so the
handler returns HTTP 400 with the offending field name before the
service layer is reached.
- internal/mcp/types.go: drop ',omitempty' from
CreateCertificateInput.RenewalPolicyID so the MCP schema reflects
the required contract; Update inputs keep partial-update semantics.
- api/openapi.yaml: 'required: [name, common_name, renewal_policy_id,
issuer_id, owner_id, team_id]' was already present on the Create
schema; clarified DeploymentTarget.agent_id description to note the
FK contract.
C-002 — CreateTargetWizard accepted an empty or bogus agent_id and the
service inserted directly, producing a Postgres 23503 FK-violation that
bubbled out as a generic HTTP 500. The FK itself (migration 000001 line
104: agent_id TEXT NOT NULL REFERENCES agents(id)) is correct; we keep
the schema strict and add validation at three layers:
- internal/service/target.go: introduce
ErrAgentNotFound sentinel and pre-validate agent_id in
TargetService.CreateTarget — empty string returns
'agent_id is required'; a nonexistent id returns the full
'referenced agent does not exist: <id>' error. Both wrap
ErrAgentNotFound via fmt.Errorf %w so callers can use errors.Is.
- internal/api/handler/targets.go: ValidateRequired on agent_id; map
errors.Is(err, service.ErrAgentNotFound) to HTTP 400 instead of
letting it fall through to the generic 500 branch.
- internal/mcp/types.go: drop ',omitempty' from
CreateTargetInput.AgentID to match the required contract.
- web/src/pages/TargetsPage.tsx: replace the free-text Agent ID input
with a <select> populated from getAgents(); include agent in the
canProceedToReview gate so Next is disabled until an agent is
chosen.
Regression coverage (21 new subtests total):
- TestCreateCertificate_MissingRequiredField_Returns400 — 6 subtests,
one per required field, each proves the handler guard fires before
the mock service is called.
- TestCreateTarget_MissingAgentID_Returns400 — handler guard.
- TestCreateTarget_NonexistentAgent_Returns400 — pins the
ErrAgentNotFound -> 400 translation.
- TestTargetService_CreateTarget_MissingAgentID — errors.Is sentinel.
- TestTargetService_CreateTarget_NonexistentAgentID — errors.Is.
- The existing TestTargetService_CreateTarget_Success, along with
TestCreateTarget_{MissingName,MissingType,NameTooLong}_* handler
tests, were updated to seed a real agent or include agent_id in
the request body so the happy paths still run cleanly.
Gates (Phase 4):
- go build/vet/test/race: green
- go test -cover: internal/service 68.7% (gate 55%),
internal/api/handler 78.9% (gate 60%)
- golangci-lint on service+handler+mcp: 0 issues
- govulncheck: no reachable vulns
- tsc --noEmit: clean
- vitest: 223/223 passing
See cowork/certctl-coverage-gap-audit.md entries C-001 and C-002.
|
||
|
|
b3cc7cbdb2 |
fix(policies): close the D-006 loop — TitleCase seed canonicals + severity-aware, config-consuming rule engine (D-008)
D-008 was a three-part drift in the policy engine that made the
D-005/D-006 remediation cosmetic below the DB layer:
(a) migrations/seed.sql INSERTed rules with pre-D-005 lowercase
types ('ownership', 'environment', 'lifetime', 'renewal_window')
that the handler validator rejects on Create/Update but that
raw SQL INSERTs bypassed entirely. At runtime evaluateRule's
switch fell through to the default "unknown policy rule type"
error branch on every demo rule × every cert × every cycle,
flooding logs while emitting zero violations.
(b) migrations/seed_demo.sql persisted lowercase severity values
('critical', 'error', 'warning') on policy_violations rows.
INSERT succeeded because that column had no CHECK, but any
frontend comparing against the canonical PolicySeverity enum
mis-categorized every seeded violation.
(c) evaluateRule hardcoded Severity: PolicySeverityWarning on
every emitted violation and ignored rule.Config entirely —
so the D-006 per-rule severity column (000013) and every
per-arm Config JSON ({allowed_issuer_ids, allowed_domains,
required_keys, allowed, lead_time_days, max_days}) was dead
data below the evaluation layer.
This commit lands (a)+(b)+(c) atomically. Shipping any subset
leaves the feature half-working.
## Changes
Domain (internal/domain/policy.go):
* Add PolicyTypeCertificateLifetime as the 6th TitleCase canonical.
Pre-D-008 the seeded "max-certificate-lifetime" rule had no engine
arm — routing it through RenewalLeadTime would conflate "how
close to expiry before we renew" with "how long can the cert
possibly be", two distinct semantics. The new type accepts
config {"max_days": int} and flags certs whose
NotAfter - NotBefore exceeds the cap.
Handler validator (internal/api/handler/validation.go):
* ValidatePolicyType allowlist grown to 6 canonicals
(AllowedIssuers, AllowedDomains, RequiredMetadata,
AllowedEnvironments, RenewalLeadTime, CertificateLifetime).
OpenAPI (api/openapi.yaml):
* PolicyType enum grown to match domain.
Frontend (web/src/api/types.ts, types.test.ts):
* POLICY_TYPES tuple gains CertificateLifetime; pin test asserts
all 6 canonicals and rejects casing drift.
Migration 000014 (policy_violations severity CHECK):
* Named CHECK constraint (policy_violations_severity_check)
mirroring 000013's allowlist, defense-in-depth at the DB layer
against future drift from bypassed writes (migrations, psql
sessions, future callers). Symmetric down migration drops by
name.
Seed data:
* migrations/seed.sql rewritten to emit TitleCase canonicals with
per-arm config JSON that actually exercises the config-consuming
paths (not the missing-field backstops):
- pr-require-owner → RequiredMetadata {"required_keys":["owner"]} Warning
- pr-allowed-environments → AllowedEnvironments {"allowed":["production","staging","development"]} Error
- pr-max-certificate-lifetime → CertificateLifetime {"max_days":90} Critical
- pr-min-renewal-window → RenewalLeadTime {"lead_time_days":14} Warning
Severities are now differentiated per rule (D-006 intent).
* migrations/seed_demo.sql violation rows flipped to TitleCase
severity ('Critical', 'Error', 'Warning') so migration 000014
applies cleanly on upgrade paths.
Engine rewrite (internal/service/policy.go):
* evaluateRule rewritten. All six arms now:
1. Parse rule.Config into the per-arm typed struct.
2. Bad JSON → log at ValidateCertificate boundary and skip
this rule (no co-located poisoning of other rules in the
same batch).
3. Empty/null Config → emit the pre-D-008 missing-field
violation (backwards compat invariant — operators who
haven't reconfigured still see the same output).
4. Violations emitted carry rule.Severity (no more hardcoded
Warning); D-006 column is now load-bearing.
* CertificateLifetime arm reads NotBefore/NotAfter from the
certificate's latest version via CertRepo. Injected via
PolicyService.SetCertRepo() setter — avoids churning ~36
NewPolicyService call sites while keeping the lifetime arm
optional (degrades to a log+skip if the setter is not wired).
Server wiring (cmd/server/main.go):
* policyService.SetCertRepo(certRepo) wired after construction.
Tests (internal/service/policy_test.go):
* 25 new subtests across 5 groups:
- TestEvaluateRule_SeverityPassThrough (6): every rule type
emits violations carrying rule.Severity, not hardcoded.
- TestEvaluateRule_ConfigConsumed (12): every per-arm Config
path exercised positive + negative.
- TestEvaluateRule_EmptyConfig_BackCompat (3): empty/null
Config still emits pre-D-008 missing-field violations.
- TestEvaluateRule_BadConfig_SkipsRule: malformed JSON logs
and skips cleanly without poisoning neighbors.
- TestEvaluateRule_CertificateLifetime_RepoScenarios (3):
ok when repo wired, log+skip when not, handles missing
NotBefore/NotAfter edges.
Provenance: D-008 surfaced during D-005/D-006 remediation review
in
|
||
|
|
eef1db0f0a |
fix(policies): stop 400ing the "+ New Policy" button + add per-rule severity (D-005, D-006)
Coverage Gap Audit findings D-005 (P0) + D-006 (P1) fixed together in a
single commit because they share the same root cause — policy CRUD sending
values the backend silently rejects — and splitting them would leave a
half-working UI between commits.
## D-005 (P0): PoliciesPage dropdown 400s every Create Policy
Root cause
----------
`web/src/pages/PoliciesPage.tsx` populated the Type `<select>` from a
hardcoded `['key_algorithm', 'ownership', 'allowed_issuers', ...]` array.
The backend's `internal/api/handler/validators.go::ValidatePolicyType`
enforces the TitleCase allowlist `AllowedIssuers`, `AllowedDomains`,
`RequiredMetadata`, `AllowedEnvironments`, `RenewalLeadTime` — defined in
`internal/domain/policy.go`. Every Create Policy request was rejected with
`400 invalid policy type`. The error surfaced only as a transient toast;
the modal closed anyway. Silent user-visible failure.
Fix
---
- `web/src/api/types.ts`: added `POLICY_TYPES` and `POLICY_SEVERITIES`
tuples with `as const` and narrowed `PolicyRule.type`, `.severity`, and
`PolicyViolation.severity` to the literal-union types. Dropdown is now
sourced from the tuple; casing drift becomes a compile error.
- `web/src/pages/PoliciesPage.tsx`: rekeyed `severityStyles` /
`severityDots` to the TitleCase values, added `humanize()` for display
(AllowedIssuers → "Allowed Issuers"), removed the `badge-neutral`
fallback that was papering over the mismatch.
- `web/src/api/types.test.ts` (new): pins both tuples exactly. If anyone
edits one side of the frontend/backend contract without the other, CI
fails with a clear assertion. Pure-TS vitest, no RTL dependency.
## D-006 (P1): `severity` field silently dropped on create/update
Root cause
----------
`PolicyRule` had no `Severity` field in `internal/domain/policy.go`. The
frontend has always sent `severity` on create/update, but Go's
`json.Decoder` (default settings, no `DisallowUnknownFields`) silently
dropped it. The value never reached PostgreSQL. Every rule rendered with
the same severity because there was no severity — just a display
computation downstream.
Fix: option (b), full-stack schema add (not delete-the-field)
-------------------------------------------------------------
- Migration `000013_policy_rule_severity` (up + down): adds
`severity VARCHAR(50) NOT NULL DEFAULT 'Warning'` to `policy_rules` with
CHECK constraint `severity IN ('Warning', 'Error', 'Critical')`. No
index — three-value column on a low-thousands-rows table, planner will
seq-scan regardless. PG 11+ metadata-only ADD COLUMN, safe on live data.
- `internal/domain/policy.go`: added `Severity PolicySeverity` field.
- `internal/repository/postgres/policy.go`: plumbed `severity` through
ListRules SELECT + Scan, GetRule SELECT + Scan, CreateRule INSERT,
UpdateRule UPDATE (4 queries).
- `internal/service/policy.go::UpdatePolicy`: if the client omits
severity on a PUT (zero-value empty string), fetch the existing rule
and preserve its severity. Without this, partial updates would trip the
NOT NULL CHECK and 500. Preserves pre-existing behavior for Name/Type
(out of scope).
- `internal/api/handler/policies.go::CreatePolicy`: default empty severity
to `'Warning'`, then validate via `ValidatePolicySeverity`. 400 with
clear message instead of 500 on CHECK violation. `UpdatePolicy`:
validates severity only when provided.
- `internal/mcp/types.go` + `internal/mcp/tools.go`: added optional
`severity` on the MCP `create_policy` / `update_policy` tool inputs so
LLM callers stay in sync with the wire contract.
- `api/openapi.yaml`: added `severity` to the `PolicyRule` schema with
the enum and default.
Acceptance criterion (user-defined)
-----------------------------------
"Create a rule with severity=Critical, reload the page, and still see
Critical — no silent drops." Verified end-to-end: frontend sends
`severity: "Critical"`, handler validates, service persists, DB stores,
GET returns, React renders the correct badge.
Seed data
---------
`migrations/seed.sql`: four demo rules now have differentiated severities
— `pr-require-owner` → Warning, `pr-allowed-environments` → Error,
`pr-max-certificate-lifetime` → Critical, `pr-min-renewal-window` →
Warning. The user called out that seeding all four at the same severity
makes the feature look decorative; differentiation demonstrates the
column carries real signal.
## Integration test fix (side effect of D-006)
`internal/integration/e2e_test.go::TestCrossResourceWorkflow/CreatePolicy`
was sending `"severity": "High"` — a value from the pre-audit severity
vocabulary that the new `ValidatePolicySeverity` correctly rejects with
400. Changed to `"Error"` (closest semantic match in the new TitleCase
allowlist). Only severity reference in the integration/ directory;
verified via grep.
## Out of scope, logged for follow-up (d/D-008)
Three policy-engine drift issues orthogonal to D-005 + D-006, explicitly
deferred per direction:
1. `migrations/seed.sql` policy_rules INSERTs use lowercase TYPE values
(`'ownership'`, `'environment'`, `'lifetime'`, `'renewal_window'`).
These are load-bearing on `internal/service/policy.go::evaluateRule`'s
`switch rule.Type` (which also uses the lowercase strings). Migrating
requires coordinated changes across seed + evaluation engine.
2. `migrations/seed_demo.sql:482-483` contains lowercase `'critical'`
severity — will now fail the new CHECK constraint. Separate fix.
3. `evaluateRule` hardcodes `Severity: domain.PolicySeverityWarning` on
emitted violations and ignores the configured `rule.Config`. The new
severity column is read correctly on the CRUD path but not yet
consulted during evaluation.
## Verification
Backend:
- `go build ./...` — clean
- `go vet ./...` — clean
- `go test -short ./...` — all packages green, including
`internal/service` (policy service), `internal/api/handler` (policy +
MCP handler tests), `internal/integration` (e2e_test.go after fix),
`internal/domain`, `internal/repository/postgres`.
Frontend:
- `tsc --noEmit` — clean
- `vitest run` — 223/223 passing (4 new assertions in types.test.ts)
- `vite build` — clean (only the pre-existing chunk-size warning)
|
||
|
|
9d0c3dfa15 |
docs(openapi): reconcile api/openapi.yaml with router routes (M-10)
Add 9 missing operations to api/openapi.yaml that exist in router.go but
were absent from the spec. Spec-only change with no runtime Go code
changes; all 106 pre-existing operationIds preserved byte-identical.
New operationIds:
- testTargetConnection (POST /api/v1/targets/{id}/test)
- verifyDeployment (POST /api/v1/jobs/{id}/verify)
- getJobVerification (GET /api/v1/jobs/{id}/verification)
- estCACerts (GET /.well-known/est/cacerts)
- estSimpleEnroll (POST /.well-known/est/simpleenroll)
- estSimpleReEnroll (POST /.well-known/est/simplereenroll)
- estCSRAttrs (GET /.well-known/est/csrattrs)
- scepGet (GET /scep)
- scepPost (POST /scep)
Spec operations: 106 → 115 (matches 115 router routes exactly).
Verification:
- openapi-spec-validator: OK
- go build ./...: clean
- go vet ./...: clean
- go test -race -count=1 -short ./...: 54 packages ok, 0 FAIL
- golangci-lint run ./...: 0 issues
- govulncheck ./...: 0 vulnerabilities in our code
- tsc --noEmit: 0 errors
- vitest run: 3 files, 218 tests passed
sha256 before: 7c14f77107a86f8de82fe91b7f5e16cca11206d1e1fab7b7bd77ff396620fdf3
sha256 after: 87bd92d0407d63643bec612d27261bf489563beb90d0791ea71cde26346f83d3
|
||
|
|
13cd4d98ba |
feat(V2.2): bulk revocation — filter-based fleet-wide certificate revocation
Add POST /api/v1/certificates/bulk-revoke with filter criteria (profile_id, owner_id, agent_id, issuer_id, team_id, certificate_ids), partial-failure tolerance, and audit trail. Includes MCP tool, CLI command (certs bulk-revoke), server-side bulk modal in GUI replacing client-side sequential loop, OpenAPI spec, compliance mapping updates, and 21 new tests (12 service, 7 handler, 1 CLI, 1 frontend). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
3f619bcaac |
feat(M49): Entrust, GlobalSign & EJBCA issuer connectors
Add three new issuer connectors completing commercial and open-source CA coverage. Entrust uses mTLS client certificate auth with sync/async issuance. GlobalSign Atlas uses mTLS + API key/secret dual auth with serial-based tracking. EJBCA supports dual auth (mTLS or OAuth2) for self-hosted Keyfactor CAs. Each connector implements the full issuer.Connector interface (9 methods), includes httptest-based unit tests (~14 each), and follows established patterns (injectable HTTP clients, RFC 5280 revocation reason mapping, CRL/OCSP delegated to CA). Also includes: issuer factory cases, env var seeding, config structs, domain types, seed data (3 rows, all disabled), OpenAPI enum updates, frontend issuer catalog entries with config fields, and full docs (connectors.md, architecture.md, features.md, README). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
596d86a206 |
feat(M48): continuous TLS health monitoring — endpoint state machine, shared tlsprobe, 8 API endpoints, GUI
Adds continuous TLS endpoint health monitoring that closes the deploy→verify→monitor loop. After M25 verifies a deployment succeeded once, M48 continuously confirms it stays healthy. Key components: - Shared `internal/tlsprobe/` package extracted from network scanner for reuse - Health status state machine: healthy → degraded (2 failures) → down (5 failures), plus cert_mismatch when served fingerprint differs from expected - 8th scheduler loop (60s tick, per-endpoint configurable intervals) - PostgreSQL migration 000011: endpoint_health_checks + endpoint_health_history tables - 8 REST API endpoints (CRUD, history, acknowledge, summary) - Health Monitor GUI page with summary bar, status table, create modal, auto-refresh - 38 new tests (5 tlsprobe + 11 domain + 10 service + 8 handler + 4 frontend) - All coverage thresholds maintained (service 68%, handler 83%, domain 87%, middleware 63%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
5567d4b411 |
feat(M47): add Kubernetes Secrets target + AWS ACM PCA issuer connectors
Implement both M47 connectors with full cross-layer wiring: Kubernetes Secrets target: DNS-1123 validation, kubernetes.io/tls Secret create-or-update, chain concatenation, serial number validation, Helm RBAC gating. 18 tests. AWS ACM Private CA issuer: synchronous issuance (like Vault), ARN regex validation, RFC 5280 revocation reason mapping, CA cert retrieval, factory + env var seeding. 23 tests. Cross-cutting: domain types, service validation, config, factory, agent dispatch, frontend (TargetsPage, issuerTypes), OpenAPI, seed data, Helm chart, connectors docs, README. Testing docs (testing-guide, qa-test-guide, qa_test.go) with Parts thematically integrated near related connectors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
7d6ef44e21 |
feat(M46): Windows Certificate Store + Java Keystore target connectors, shared certutil package
Extract shared certutil helpers (CreatePFX, ParsePrivateKey, ComputeThumbprint, GenerateRandomPassword, ParseCertificatePEM) from IIS connector for reuse. Add WinCertStore connector (PowerShell Import-PfxCertificate, dual local/WinRM mode, configurable store/location, expired cert cleanup) and JavaKeystore connector (PEM→PKCS#12→keytool pipeline, JKS/PKCS12 support, shell injection prevention, path traversal protection). 53 new tests, all passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
697c0be9f3 |
feat(M38): SSH target connector for agentless deployment via SSH/SFTP
Adds a new target connector enabling certificate deployment to any Linux/Unix server without installing the certctl agent binary. Uses the proxy agent pattern — a single agent in the same network zone deploys certs to remote servers over SSH/SFTP. Key additions: - SSH/SFTP connector with key auth (file/inline) + password auth - Injectable SSHClient interface for cross-platform testing (25 tests) - Shell injection prevention via validation.ValidateShellCommand() - Configurable cert/key/chain paths with octal permissions - GUI: 11 SSH config fields in target create wizard Also fixes pre-existing frontend bug where all target type strings (nginx, apache, etc.) were sent as lowercase but the backend expects proper-case (NGINX, Apache, etc.), breaking GUI-created targets. Adds missing TargetTypeSSH to validTargetTypes service map. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
5a53b648b1 |
feat(M44): Google CAS issuer connector
Google Cloud Certificate Authority Service integration via REST API with OAuth2 service account auth (JWT→access token). Synchronous issuance model, CA pool selection, mutex-guarded token caching, revocation with RFC 5280 reason mapping. No Google SDK dependency — all stdlib. 19 tests with httptest mock OAuth2 + CAS API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
3a11e447cf |
feat(M43): Sectigo SCM issuer connector
Implement Sectigo Certificate Manager REST API connector with async order model (enroll → poll → collect PEM), 3-header auth, DV/OV/EV support, collect-not-ready (400/-183) graceful handling, and RFC 5280 revocation reason mapping. 20 tests with httptest mock API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
9feb6c796d |
feat(M42): Postfix/Dovecot mail server target connector
Dual-mode TLS connector for mail servers — single package with mode field selecting Postfix or Dovecot defaults. File-based cert/key deployment with correct permissions (cert 0644, key 0600), optional chain append, shell injection prevention, and configurable reload/validate commands. 18 tests covering config validation, deployment, and security. GUI wizard fields and OpenAPI enum updated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
fd05bacb76 |
feat(M41): Envoy target connector with SDS support
File-based deployment for Envoy service mesh — writes cert/key/chain to watched directory with optional SDS JSON config for xDS bootstrap. Path traversal prevention, configurable filenames, 15 tests passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
9a41d0ca39 |
feat(M39): IIS WinRM proxy agent mode + front-to-back wiring
Complete the IIS target connector with dual-mode deployment: - WinRM proxy agent mode via masterzen/winrm for remote Windows servers - Base64 PFX transfer with try/finally cleanup on remote host - GUI wizard updated with 13 IIS config fields including WinRM settings - TargetDetailPage sensitive field redaction (password/secret/token/key) - OpenAPI TargetType enum updated (added Traefik, Caddy) - connectors.md fully documented with WinRM proxy config example - 38 total IIS tests (10 new WinRM tests), all passing with race detection Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
6375909591 |
feat: add Vault PKI and DigiCert CertCentral issuer connectors (M32 + M37)
Vault PKI: synchronous issuance via /v1/{mount}/sign/{role}, token auth,
revocation, CA cert retrieval, 14 tests. DigiCert CertCentral: async order
model (submit → poll → download), X-DC-DEVKEY auth, OV/EV support, PEM
bundle parsing, 16 tests. Both conditionally registered based on env vars.
Includes OpenAPI enum updates, seed data, connector docs, architecture docs,
README badges, and testing guide sign-off (Parts 38 + 39, 12 automated
smoke test assertions all passing).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
a8fc177118 |
fix: resolve NULL csr_pem scan errors and QA smoke test failures
Root cause: certificate_versions.csr_pem is nullable in the schema but Go code scanned it into a plain string. Used sql.NullString in ListVersions and GetLatestVersion to handle NULL values correctly. Also includes: partial update fetch-merge-update pattern to prevent FK violations, nil directory guard in discovery service, diagnostic slog logging in handlers, export handler 422 for unparseable PEM, OpenAPI spec corrections, MCP tool description improvements, and test fixes. Rewrites the Release Sign-Off section in testing-guide.md to individual test-level granularity (320 rows) with smoke test results audited and checked off (121 pass, 5 skip, 194 manual remaining). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
ec21c9bb29 |
feat(m28+m29+m30): ACME ARI, email digest, and Helm chart
M28: ACME Renewal Information (RFC 9702) — CA-directed renewal timing with cert ID computation, directory endpoint discovery, graceful degradation for non-ARI CAs. 19 tests. M29: Email notifier wiring + scheduled certificate digest — SMTP connector bridged to service layer via NotifierAdapter, DigestService with HTML email template, 7th scheduler loop (24h), digest preview/send API endpoints and GUI card. 21 tests. M30: Production-ready Helm chart — server Deployment, PostgreSQL StatefulSet, agent DaemonSet, ConfigMaps, Secrets, Ingress, security contexts, health probes, example values for dev/prod/ACME scenarios. Also: OpenAPI spec updates, MCP tool additions, CI helm-lint job, documentation updates across 5 doc files and README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
a00bb349c4 |
feat(m27): certificate export (PEM/PKCS#12) and S/MIME EKU support
Add certificate export in PEM (JSON or file download) and PKCS#12 formats. Private keys are never included — they stay on agents. Add EKU-aware issuance threading profile EKUs (serverAuth, clientAuth, codeSigning, emailProtection, timeStamping) through the full issuance pipeline. Fix agent CSR SAN splitting for email addresses, adaptive KeyUsage flags for S/MIME vs TLS, and a pre-existing generateID collision bug in deployment job creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
6c10c33572 |
docs: complete V2 audit remediation — OpenAPI spec, demos, and features
- Add 15 missing operations to openapi.yaml (M18b discovery, M20 deployments, M21 network scan, M22 Prometheus) — spec now has 93 operations matching all 93 router routes - Add 6 new component schemas (DiscoveredCertificate, DiscoveryScan, DiscoveryReport, NetworkScanTarget, NetworkScanTargetCreate, StatusMessageResponse) - Add Discovery and Network Scan tags to OpenAPI spec - Fix stale "Prometheus format deferred to V3" claim in metrics description - Add Part 4.5 (Target CRUD) to demo-advanced.md with create/update/delete curl examples - Expand Certificate Profiles section in features.md with list/get/update curl examples - Add Deployment Trigger section to features.md with curl examples - Add discovery-summary and discovery-scans curl examples to features.md - Remove 3 empty directories (internal/agent/, internal/audit/, internal/policy/) - Update features.md OpenAPI scope from "78 documented" to "93 operations" Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |