mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 17:31:30 +00:00
12705efe36781efd95450bf8a450ec1220544d7e
69 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
def4be9b38 |
fix(migrations): two cold-DB regressions surfaced by Phase-9 docker compose smoke
The v2.1.0 release-gate Phase-9 docker compose smoke run against a
fresh Postgres surfaced two real defects in the migration files that
testcontainers schema-per-test never exercised. Both reproduce by
running 'docker compose down -v && docker compose up --build'
against the current master tree.
Bug A — migration 000045_users_deactivated_at.up.sql is malformed.
The 000029 schema defines:
permissions (id TEXT PRIMARY KEY, name TEXT NOT NULL UNIQUE,
namespace TEXT NOT NULL)
role_permissions (..., permission_id TEXT NOT NULL REFERENCES ..., ...)
But 000045 was written as:
INSERT INTO permissions (name) VALUES ... -- missing id + namespace
INSERT INTO role_permissions (role_id, permission, ...) VALUES ...
^^ wrong column name
On a cold-DB run this fails immediately with:
pq: null value in column "id" of relation "permissions"
violates not-null constraint
Fix: provide id + namespace columns, use permission_id (the actual
column name), ON CONFLICT (id) DO NOTHING. The new permission ids
follow the existing 'p-auth-*' prefix convention (p-auth-user-read +
p-auth-user-deactivate) used by 000029.
Bug B — migration 000029_rbac.up.sql is not idempotent post-000043.
000029 originally created actor_roles with:
UNIQUE (actor_id, actor_type, role_id, tenant_id)
Audit 2026-05-10 HIGH-10 closure / migration 000043 drops that
constraint and re-creates it WITH scope columns:
UNIQUE (actor_id, actor_type, role_id, scope_type, scope_id, tenant_id)
The migration runner (internal/repository/postgres/db.go::RunMigrations)
is naive — no tracker table — and re-runs every *.up.sql file on
every server boot. On the second-and-later boots, 000029's seed
INSERT for actor-demo-anon-admin still references the
pre-000043 constraint name in its ON CONFLICT clause:
ON CONFLICT (actor_id, actor_type, role_id, tenant_id) DO NOTHING
Postgres errors out with:
pq: there is no unique or exclusion constraint matching the
ON CONFLICT specification
Fix: pin the conflict target to the row's primary key 'id' column
(always present, never altered). The seed row's deterministic id
'ar-demo-anon-admin' makes ON CONFLICT (id) work under both pre-
and post-000043 schemas.
Why testcontainers schema-per-test missed these:
Each test in internal/repository/postgres/*_test.go spins up a
fresh schema and applies every .up.sql in order ONCE. The full
'000029 -> 000043 -> retry 000029' cascade never happens because
migrations don't re-run within a test. Phase-9 docker compose
smoke is the only test path that exercises the server-restart-
on-error retry, which is exactly the missing coverage.
Verify (sandbox): go test ./internal/repository/postgres/ PASS.
Workstation re-runs 'docker compose down -v && docker compose up'
to confirm both bugs are closed.
|
||
|
|
e1e43c8924 |
feat(auth): foundation for MED-11 — users.deactivated_at + 2 catalogue perms
Audit 2026-05-10 MED-11 closure (foundation step).
WHAT.
Lays the schema + domain foundation for the MED-11 federated-user
admin surface:
1. Migration 000045 adds users.deactivated_at TIMESTAMPTZ (nullable;
non-NULL = deactivated). Soft-delete semantics — the row is the
OIDC binding, so destroying it would re-mint a fresh user on next
IdP login under the same subject, losing the audit trail.
2. Seeds 2 new catalogue permissions:
- auth.user.read (admin / operator / auditor)
- auth.user.deactivate (admin ONLY)
3. Extends User domain struct with DeactivatedAt *time.Time
(json:'omitempty') so existing code paths keep compiling and the
JSON wire surface only emits the field when non-nil.
WHY.
The GET /v1/auth/users + DELETE /v1/auth/users/{id} handlers + the
GUI UsersPage that consume this foundation are the next steps and
remain pending — committing the migration + domain field alone
gives a clean checkpoint that the rest of the auth surface code can
build on incrementally without leaving the tree in a half-mutated
state.
HOW.
migrations/000045_users_deactivated_at.up.sql:
- ALTER TABLE users ADD COLUMN IF NOT EXISTS deactivated_at TIMESTAMPTZ
- INSERT 2 permissions into permissions
- INSERT role_permissions rows (read in r-admin/operator/auditor;
deactivate in r-admin)
- Single BEGIN/COMMIT, idempotent (ON CONFLICT DO NOTHING)
migrations/000045_users_deactivated_at.down.sql:
- reverse-order DELETE + DROP COLUMN
internal/auth/user/domain/types.go:
- User.DeactivatedAt *time.Time, JSON tag omitempty.
VERIFY.
- go vet ./internal/auth/user/... ./internal/auth/oidc/...
./internal/repository/... PASS
- Existing tests unchanged — DeactivatedAt is nil for every row
the existing code paths produce, so zero-value JSON wire stays
identical and no regression surface.
Refs: cowork/auth-bundles-audit-2026-05-10.md MED-11
cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 14
|
||
|
|
2a1a0b347c |
harden(oidc): pre-login UA/IP binding (MED-16) — RFC 9700 §4.7.1
Audit 2026-05-10 MED-16 closure.
WHAT.
Binds the OIDC pre-login row to the (clientIP, userAgent) tuple of
the /auth/oidc/login request, and enforces a constant-time compare
against the /auth/oidc/callback request at consume time. Defeats
replay of a stolen pre-login cookie by a different browser /
source — the secondary defense layer recommended by RFC 9700 §4.7.1
when the primary layer (HMAC integrity + Path=/ + SameSite=Lax on
the cookie) is bypassed via CSRF / XSS / TLS-termination leak.
WHY.
Pre-fix, the pre-login cookie's HMAC verified only that 'some'
caller of /auth/oidc/login was talking to /auth/oidc/callback; it
did not verify that the SAME browser / source was on both sides.
An attacker who exfiltrated the cookie value via any vector could
replay the bytes through their own user-agent and ride the victim's
authorization. RFC 9700 §4.7.1 calls out the gap explicitly and
recommends binding state to a user-agent fingerprint + source IP.
HOW.
Migration:
migrations/000044_prelogin_uaip.up.sql
ALTER TABLE oidc_pre_login_sessions
ADD COLUMN IF NOT EXISTS client_ip TEXT,
ADD COLUMN IF NOT EXISTS user_agent TEXT;
Both nullable for in-flight rolling-deploy compat — the consume-
side check only enforces when both row AND request carry non-empty
values for the leg in question.
Domain:
internal/repository/oidc.go (PreLoginSession) — adds ClientIP +
UserAgent fields.
Repository:
internal/repository/postgres/oidc_prelogin.go — Create persists
via sql.NullString (empty → NULL); LookupAndConsume reads back.
Re-uses package-local nullableString from discovery.go.
Service:
internal/auth/oidc/service.go
- PreLoginStore.CreatePreLogin signature takes (clientIP,
userAgent) as positions 5–6.
- PreLoginStore.LookupAndConsume returns (clientIP, userAgent)
as positions 5–6.
- HandleAuthRequest signature gains (clientIP, userAgent),
threaded to the store.
- HandleCallback adds Step 1.5 — UA / IP constant-time compare
between stored row and incoming request. Per-leg toggles via
preLoginRequireUA / preLoginRequireIP service fields. Empty
values on either side pass through (rolling-deploy + headless-
proxy compat).
- New sentinels ErrPreLoginUAMismatch, ErrPreLoginIPMismatch.
- SetPreLoginBindingRequirements(requireUA, requireIP) helper
for main.go config wiring.
Adapter:
internal/auth/oidc/prelogin.go — PreLoginAdapter passes the new
fields through to the repo row.
Handler:
internal/api/handler/auth_session_oidc.go
- OIDCAuthHandshaker.HandleAuthRequest signature updated.
- LoginInitiate captures clientIPFromRequest + r.UserAgent()
and passes to the service.
- classifyOIDCFailure adds errors.Is dispatch for the two new
sentinels → prelogin_ua_mismatch / prelogin_ip_mismatch
audit categories.
Config:
internal/config/config.go
+ AuthConfig.OIDCPreLoginRequireUA (default true)
env CERTCTL_OIDC_PRELOGIN_REQUIRE_UA
+ AuthConfig.OIDCPreLoginRequireIP (default true)
env CERTCTL_OIDC_PRELOGIN_REQUIRE_IP
cmd/server/main.go calls oidcService.SetPreLoginBindingRequirements
from cfg.Auth.OIDCPreLoginRequire{UA,IP}.
Tests (internal/auth/oidc/service_test.go):
- TestService_HandleCallback_MED16_UAMismatchRejected
- TestService_HandleCallback_MED16_IPMismatchRejected
- TestService_HandleCallback_MED16_BothMatch_Succeeds
- TestService_HandleCallback_MED16_LegacyRowEmptyValues (rolling-
deploy compat — empty stored values pass through)
- TestService_HandleCallback_MED16_RequireUAFalse_AllowsMismatch
(operator escape-hatch — UA mismatch silently allowed)
Mechanical fan-out:
- stubPreLogin / stubPreLoginRepo signatures updated.
- All existing call sites in service_test.go (~40), prelogin_test.go,
bench_test.go, logging_test.go, provider_enabled_test.go,
integration_keycloak_test.go, integration_okta_smoke_test.go,
auth_session_oidc_test.go updated to pass empty strings for the
new params — pre-existing tests do not exercise UA/IP binding
semantics.
VERIFY.
- go vet ./internal/auth/oidc/... ./internal/api/handler/...
./internal/config/... PASS
- go test -short -count=1 -run MED16 ./internal/auth/oidc/... PASS (5/5)
- go test -short -count=1 ./internal/auth/oidc/... PASS (4.6s)
- go test -short -count=1 ./internal/api/handler/... PASS (4.3s)
- go test -short -count=1 ./internal/config/... PASS
Refs: cowork/auth-bundles-audit-2026-05-10.md MED-16
cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 6
RFC 9700 §4.7.1 — OAuth 2.0 Security Best Current Practice
|
||
|
|
72b54ce850 |
feat(auth/rbac): scope_type+scope_id+expires_at on role grants (HIGH-10)
Audit 2026-05-10 — close HIGH-10 from the HANDOFF.md backend batch
(item 1). Per-actor scoped + time-bound role grants are now
expressible via the API.
Migration 000043: adds scope_type TEXT NOT NULL DEFAULT 'global' +
scope_id TEXT to actor_roles. Constraints:
- actor_roles_scope_type_enum: scope_type ∈ {global, profile, issuer}
- actor_roles_scope_id_required_when_not_global: scope_id is NULL
iff scope_type='global'
- Uniqueness extended: (actor_id, actor_type, role_id, scope_type,
scope_id, tenant_id) — so an operator can grant the same role to
the same actor scoped to multiple profiles/issuers (e.g.
r-operator on p-finance AND on p-engineering).
Index idx_actor_roles_scope for non-global lookup hot paths.
Domain: ActorRole.ScopeType (ScopeType enum) + ScopeID (*string).
Authorizer.CheckPermission already understands the tuple via the
parallel role_permissions columns; this addition gives operators a
per-actor knob without forking roles.
Postgres repo: Grant writes scope_type+scope_id with ON CONFLICT keyed
on the new uniqueness tuple. Defaults to (global, NULL) when caller
omits.
Handler: assignRoleRequest extended with scope_type / scope_id /
expires_at. Validation:
- role_id required (unchanged)
- scope_type defaults to 'global'; allowed values global/profile/
issuer; anything else → 400
- scope_id required when scope_type ∈ {profile, issuer}; rejected
(must be empty) when scope_type='global'
- expires_at must be in the future when present; nil = standing
Regression matrix in internal/api/handler/auth_test.go (6 cases):
- TestAssignRoleToKey_HIGH10_ProfileScopeBoundGrantPersists
- TestAssignRoleToKey_HIGH10_TimeBoundGrantPersists
- TestAssignRoleToKey_HIGH10_RejectsScopeIDWithGlobalScope
- TestAssignRoleToKey_HIGH10_RejectsMissingScopeIDOnProfile
- TestAssignRoleToKey_HIGH10_RejectsPastExpiry
- TestAssignRoleToKey_HIGH10_RejectsInvalidScopeType
HIGH-10 marked CLOSED in audit-doc — the v3 deferral from the prior
session is reversed; everything lands in v2.
Refs: cowork/auth-bundles-fixes-2026-05-10/HANDOFF.md item 1
cowork/auth-bundles-audit-2026-05-10.md HIGH-10
|
||
|
|
925523e06e |
feat(oidc): Enabled toggle on OIDCProvider (MED-9)
Audit 2026-05-10 Fix 13 Phase B — close MED-9. MED-4/5/6/7 deferred to v3.
MED-9: ship the OIDCProvider.Enabled boolean. Pre-fix, the only way
to take a provider offline during an incident was DELETE, which
breaks active user_oidc_provider FK references and orphans any
session that minted under the provider. Post-fix:
- Migration 000042 adds enabled BOOLEAN NOT NULL DEFAULT TRUE.
Default-true means existing pre-migration rows are all enabled
post-deploy; no breaking-change window.
- internal/auth/oidc/domain/types.go::OIDCProvider.Enabled ships
the domain field with JSON tag 'enabled'.
- Repository read/write paths (List, Get, GetByName, Create, Update)
all carry the column.
- internal/auth/oidc/service.go::HandleAuthRequest rejects with
the new ErrProviderDisabled sentinel when cfgRow.Enabled=false.
- cmd/server/main.go::oidcProvidersListAdapter.List filters
disabled providers before constructing OIDCProviderInfo so the
LoginPage's 'Sign in with X' buttons never render for offline
IdPs.
- Defense-in-depth: the ErrProviderDisabled service-layer check
is the guard for direct API / MCP / CLI callers that bypass the
GUI.
Regression test: internal/auth/oidc/provider_enabled_test.go warms
the entry cache via a successful HandleAuthRequest, flips
cfgRow.Enabled=false on the cached entry, then asserts the next call
returns ErrProviderDisabled (errors.Is). Test fixtures (newValidProvider,
makeProvider) updated to set Enabled: true so existing tests stay
green.
Operators can toggle Enabled today via the existing PUT
/api/v1/auth/oidc/providers/{id} body field. A dedicated GUI
toggle on OIDCProviderDetailPage and a single-purpose PUT-just-enabled
endpoint are deferred to the v3 GUI-polish bundle — the load-bearing
wire is in place now.
MED-4 (GUI advanced fields on edit), MED-5 (POST .../test endpoint
+ button), MED-6 (JWKS auto-refresh on cache-miss), MED-7 (JWKS
health endpoint + GUI panel): DEFERRED to v3 with explicit
annotations in the audit doc. Workarounds: MED-4 fields are
PUT-editable via curl/MCP; MED-5 → call refresh post-create;
MED-6 → call refresh manually on key rotation.
Refs: cowork/auth-bundles-audit-2026-05-10.md MED-4, MED-5, MED-6,
MED-7, MED-9
Spec: cowork/auth-bundles-fixes-2026-05-10/13-med-bundle.md Phase B
|
||
|
|
90210c9334 |
fix(oidc/prelogin): encrypt state/nonce/PKCE-verifier at rest (HIGH-5)
Pre-login rows previously persisted the OIDC state, nonce, and PKCE
verifier as plaintext columns; an operator restoring an unredacted
backup of oidc_pre_login_sessions to a debug environment leaked every
in-flight handshake. If the IdP also leaked the auth code in the same
window (logged at a misconfigured TLS terminator, etc.), the attacker
could exchange code + verifier directly. RFC 7636 §7 requires verifier
confidentiality.
This commit:
- Migration 000041 adds {state,nonce,pkce_verifier}_enc BYTEA columns
and makes the legacy plaintext columns nullable. A follow-up
migration drops the plaintext columns once the rolling deploy
completes.
- internal/repository/postgres/oidc_prelogin.go::Create encrypts the
three secrets via crypto.EncryptIfKeySet (v3 magic 0x03 + per-row
salt + nonce + AES-256-GCM tag) and writes only the encrypted
columns; legacy plaintext stays NULL on the write path.
- LookupAndConsume prefers encrypted columns via materialize(),
falling back to the legacy plaintext only when _enc is NULL — the
rolling-deploy compat layer that 000042 will retire.
- NewPreLoginRepository takes encryptionKey; cmd/server/main.go threads
cfg.Encryption.ConfigEncryptionKey in.
- Encryption key reuses CERTCTL_CONFIG_ENCRYPTION_KEY (same passphrase
already protecting OIDC client secrets and SessionSigningKey material).
No new env var.
Why encryption-at-rest, not HMAC: the spec's HMAC approach required
moving plaintext into the cookie (the cookie currently carries only
row ID + HMAC). Re-shaping the cookie wire format would be a larger
refactor; the audit explicitly admits encryption-at-rest is an
acceptable closure (weaker because backups still contain decryptable
ciphertext, but the encryption key is held separately from the DB
backup, and the 10-minute TTL further bounds usable secret window).
Three new regression tests in oidc_prelogin_encryption_test.go pin:
(a) _enc columns contain v3-format ciphertext, NOT plaintext
substrings, post-Create
(b) legacy plaintext columns are NULL post-Create (defends against
future patches that re-introduce plaintext writes)
(c) LookupAndConsume round-trips state/nonce/verifier byte-for-byte
A fourth test pins the legacy-row fallback for rolling-deploy compat.
Refs: cowork/auth-bundles-audit-2026-05-10.md HIGH-5
Spec: cowork/auth-bundles-fixes-2026-05-10/09-high-5-prelogin-secret-protection.md
|
||
|
|
15435ca02b |
fix(oidc/bcl): jti replay-cache + iat freshness check (HIGH-3 closure)
Closes HIGH-3 of the 2026-05-10 audit. Pre-fix the BCL handler
accepted any logout_token whose iat + jti were syntactically present
but never checked (a) that iat fell within a skew window or (b) that
jti hadn't been seen before. A captured logout_token was replayable
indefinitely; once CRIT-2 was fixed, every replay would revoke the
user's current sessions — persistent DoS. RFC 9700 §2.7 + OIDC BCL
1.0 §2.5 require jti replay defense.
- Migration 000040_bcl_replay_cache: oidc_bcl_consumed_jtis table with
composite PK on (jti, issuer_url) — RFC 7519 §4.1.7 per-issuer
uniqueness — and an expires_at index for the GC sweep.
- repository.BCLReplayRepository interface + ErrBCLJTIAlreadyConsumed
sentinel. Postgres impl uses INSERT...ON CONFLICT DO NOTHING
RETURNING true for atomic single-use semantics in one round-trip.
- handler.DefaultBCLVerifier gains WithMaxAge + nowFn clock seam. iat
freshness check rejects tokens whose iat is in the future beyond
max-age OR stale beyond it. Verifier signature extended:
Verify(ctx, jwt) (iss, sub, sid, jti string, iat int64, err error).
- handler.AuthSessionOIDCHandler gains BCLReplayConsumer (interface)
+ WithBCLReplayConsumer(consumer, maxAge) setter. BackChannelLogout
consumes the jti post-verify with TTL = max(24h, 2*maxAge):
- first-receive → 200, sessions revoked, audit outcome=revoked
- replay (ErrBCLJTIAlreadyConsumed) → 200 + Cache-Control: no-store,
audit outcome=jti_replayed, sessions NOT re-revoked
- transient (non-AlreadyConsumed error) → 503 so the IdP retries
- internal/scheduler/scheduler.go: SetBCLReplayGarbageCollector wires
SweepExpired into the existing session-GC tick (no separate ticker
for short-lived replay rows).
- cmd/server/main.go: bclMaxAge from cfg.Auth.OIDCBCLMaxAgeSeconds
(default 60s, env CERTCTL_OIDC_BCL_MAX_AGE_SECONDS); bclReplayRepo
wired into the verifier + handler + scheduler.
- Three regression tests in internal/api/handler/bcl_replay_test.go:
TestBackChannelLogout_FirstReceiveConsumesJTI,
TestBackChannelLogout_ReplayedJTIReturns200WithAudit,
TestBackChannelLogout_TransientConsumeFailureReturns503.
- internal/api/handler/auth_session_oidc_test.go: stubBCLVerifier
gains jti + iat fields; existing TestBackChannelLogout_* tests
rewritten for the new Verify return.
Verification gate green: gofmt clean, go vet clean, go test -short
-count=1 on internal/api/handler / internal/api/router /
internal/scheduler / cmd/server / internal/auth/oidc /
internal/auth/breakglass — all pass.
CRIT-1..CRIT-5 + HIGH-1 + HIGH-2 + HIGH-3 of the 2026-05-10 audit
now closed on this branch. Spec at
cowork/auth-bundles-fixes-2026-05-10/07-high-3-bcl-replay-defense.md.
Refs: cowork/auth-bundles-audit-2026-05-10.md HIGH-3
|
||
|
|
68ca42fef1 |
fix(auth): apply rbacGate to every state-changing + read handler (CRIT-1 closure)
Closes the wire-layer authorization gap surfaced by the 2026-05-10 audit
(CRIT-1). Before this commit only ~24 of ~140 routes carried rbacGate
enforcement — all of them admin-only fine-grained perms (auth.session.*,
auth.oidc.*, auth.breakglass.admin, cert.bulk_revoke, crl.admin, scep.admin,
est.admin, ca.hierarchy.manage). Every catalogued legacy-CRUD perm
(cert.read/issue/revoke/delete, profile.edit/delete, issuer.edit/delete,
target.*, agent.*, plus role-mgmt verbs) was declared in
internal/domain/auth/validate.go but never wired at the router. A r-viewer
Bearer was essentially r-admin minus five verbs at the wire layer (CWE-862).
This commit:
- Adds rbacGateScoped(checker, perm, scopeType, scopeFn, h) helper to
internal/api/router/router.go for path-bound scope resolution. Per-profile
and per-issuer grants (Decision 2) now reach the wire layer.
- Wraps every state-changing route AND every read endpoint in router.go
with rbacGate (global) or rbacGateScoped (path-bound). The auth-management
routes (POST /api/v1/auth/roles, etc.) gain router-level enforcement
in addition to the existing service-layer Authorizer check — defense in
depth (HIGH-9 of the same audit collapses into this closure).
- Auth-exempt surfaces stay un-gated by design: login, callback, BCL,
logout, breakglass-login, bootstrap, health, auth-info, version. Allowlist
is documented in TestRouterRBACGateCoverage.
- Extends internal/domain/auth/validate.go CanonicalPermissions with 30 new
perms across 12 namespaces: cert.edit; job.read, job.cancel; approval.read,
approval.approve, approval.reject; policy.read/edit/delete;
team.read/edit/delete; owner.read/edit/delete; notification.read/edit;
discovery.read/run/claim; network_scan.read/edit/run;
healthcheck.read/edit/delete/acknowledge; digest.read, digest.send;
verification.read, verification.run; stats.read; metrics.read.
- Updates DefaultRoles for r-admin / r-operator / r-viewer / r-mcp / r-cli /
r-agent. r-auditor gets NOTHING new — the auditor pin
(TestAuditorRoleHoldsExactlyAuditReadAndExport) stays invariant.
- Migration 000039_audit_crit1_perms seeds the new perm rows + role grants
per the updated DefaultRoles map. Idempotent ON CONFLICT DO NOTHING.
Reverse migration removes role_permissions before permissions
(ON DELETE RESTRICT on the FK).
- AST-level CI guard TestRouterRBACGateCoverage in
internal/api/router/router_rbac_coverage_test.go walks router.go and
asserts every state-changing + read route is wrapped (or in the
documented allowlist). Adding a new ungated route fails CI.
- Updates docs/operator/rbac.md permission-catalogue table with the new
namespaces + footer link to the AST CI guard.
- Updates certctl/CHANGELOG.md v2.1.0 section with the closure narrative.
Audit doc cowork/auth-bundles-audit-2026-05-10.md CRIT-1 row annotated
CLOSED 2026-05-10. Bundle's exit-gate spec lives at
cowork/auth-bundles-fixes-2026-05-10/01-crit-1-rbac-gates.md.
CRIT-2 / CRIT-3 / CRIT-4 / CRIT-5 of the same audit remain open and
continue to block the v2.1.0 tag.
Verification gate green:
- gofmt -d (no diff after gofmt -w on the touched files)
- go vet ./...
- go test -short -count=1 ./... (all packages pass including auditor pin)
- go build ./...
HIGH-9 of the audit closes via this commit's router-layer rbacGate on
POST /api/v1/auth/keys/{id}/roles + DELETE /api/v1/auth/keys/{id}/roles/{role_id}
(defense-in-depth on top of the existing service-layer privilege check).
Refs: cowork/auth-bundles-audit-2026-05-10.md CRIT-1 HIGH-9
|
||
|
|
1d01c87663 |
auth-bundle-2 Phase 7 + Phase 7.5: OIDC first-admin bootstrap +
break-glass admin (Argon2id, lockout, default-OFF, surface-invisibility)
Phase 7 — OIDC first-admin bootstrap (Decision 3):
- Optional AdminBootstrapHook closure on *oidc.Service. When wired,
HandleCallback consults the hook AFTER group resolution + user
upsert and BEFORE the empty-mapping fail-closed check. Hook
receives (providerID, groups, userID); returns grantAdmin=true
when the user matches CERTCTL_BOOTSTRAP_ADMIN_GROUPS AND no
admin exists yet in the tenant.
- cmd/server/main.go wires the hook as a closure that:
* Filters by CERTCTL_BOOTSTRAP_OIDC_PROVIDER_ID (if configured).
* Probes AdminExists via authActorRoleRepo (admin-already-exists
silently returns false; bootstrap mode is one-shot per tenant).
* Walks group intersection.
* On match: grants r-admin via authActorRoleRepo.Grant + emits
the bootstrap.oidc_first_admin audit row with
event_category=auth + INFO log.
- Coexists with the Bundle 1 env-var-token bootstrap. Both paths
can be configured; first match wins (admin-existence probe
short-circuits the second).
- HandleCallback's empty-mapping fail-closed check moved AFTER the
hook so a fresh deployment with zero group_role_mappings can
still mint the first admin.
- 5 tests in service_test.go: hook grants admin on match, hook
returns false preserves empty-mapping fail-closed, admin-already-
exists silently falls through to normal mapping, hook-error wraps
+ bubbles, idempotent when admin is already in the mapped role set.
Phase 7.5 — Break-glass admin (Decision 4, default-OFF):
Migration 000038 ships:
- breakglass_credentials table — at-most-one-credential-per-actor
(UNIQUE(actor_id)), Argon2id PHC-format password_hash, lockout
state machine (failure_count, locked_until, last_failure_at).
FK CASCADE on users(id) so deleting a user atomically removes
their credential.
- Two new permissions seeded into r-admin only:
auth.breakglass.admin — set/rotate/unlock/remove credentials.
auth.breakglass.login — actor uses break-glass to log in.
CanonicalPermissions extended in lockstep.
internal/auth/breakglass/service.go (~580 LOC):
- Service.Enabled() reflects CERTCTL_BREAKGLASS_ENABLED.
- SetPassword: Argon2id with OWASP 2024 params (m=64MiB, t=3, p=4,
salt=16 random bytes, output=32 bytes); per-password random salt;
PHC-format hash output. Min 12 / max 256 byte input.
- Authenticate: constant-time-compare via subtle.ConstantTimeCompare
on every code path. Identical 401 + identical timing across the
wrong-password / locked-account / non-existent-actor paths so an
attacker cannot probe whether a given actor has break-glass
configured. Non-existent-actor + locked-account paths run a
verifyDummy() Argon2id pass for timing parity. Lockout state
machine: failure_count++ on every wrong attempt; threshold (default
5) trips locked_until = NOW() + duration (default 15m). Successful
Authenticate resets the counter. Reset-window: failures aged out
after CERTCTL_BREAKGLASS_LOCKOUT_RESET_INTERVAL (default 1h)
auto-reset on next attempt.
- Unlock + RemoveCredential: admin-only (auth.breakglass.admin
gated at the router via rbacGate). Audit rows on every operation.
- All public methods refuse to act when Enabled()==false (returns
ErrDisabled; the handler maps to HTTP 404 — surface invisibility).
internal/repository/postgres/breakglass.go ships the 5-method
postgres impl with atomic single-statement IncrementFailure (so
concurrent racing wrong-password attempts can't observe an
intermediate state and slip past the threshold) and idempotent
ResetFailureCount.
internal/api/handler/auth_breakglass.go ships the 4-endpoint HTTP
surface:
- POST /auth/breakglass/login (auth-exempt; 5/min rate-limited per
source IP via the existing rate limiter; returns 404 when
disabled). On success sets the post-login session cookie + CSRF
cookie via SessionService.Create + 204. On any failure:
uniform 401 + identical timing (the service has already audited
the specific failure category).
- POST /api/v1/auth/breakglass/credentials (auth.breakglass.admin)
- POST /api/v1/auth/breakglass/credentials/{actor_id}/unlock
(auth.breakglass.admin)
- DELETE /api/v1/auth/breakglass/credentials/{actor_id}
(auth.breakglass.admin)
Admin endpoints share the surface-invisibility property: when
CERTCTL_BREAKGLASS_ENABLED=false, every admin endpoint also returns
404 (not 403) so probing via the admin surface gets the same signal
as probing the login endpoint.
Tests (internal/auth/breakglass/service_test.go):
All 8 Phase 7.5 spec-mandated negative cases:
1. Service.Enabled()==false → all ops return ErrDisabled.
2. Wrong password → ErrInvalidCredentials, failure_count++,
audit row with event_category=auth.
3. Failure_count exceeds threshold → locked, subsequent attempts
(including with the CORRECT password) return identical-shape
401 while the lockout window holds.
4. Lockout window expires → next attempt with correct password
succeeds + resets the counter.
5. Password < 12 bytes (or > 256 bytes) → ErrWeakPassword.
6. Password leak hygiene — the service has zero slog calls; the
audit-row map literal never includes the password plaintext.
7. Argon2id hash never appears in logs OR API responses — pinned
by `json:"-"` tag on BreakglassCredential.PasswordHash + a
belt-and-braces json.Marshal probe asserting the hash bytes
never appear in the marshaled output.
8. Constant-time-compare verified via timing-statistical test —
wrong-password vs no-credential paths take statistically
indistinguishable time (within 5x ratio). The verifyDummy()
hash compute on the no-credential + locked paths is what
keeps timing parity; absent that, an attacker could side-
channel "actor doesn't have a credential" via timing.
Plus coverage-lift batch covering: SetPassword first-time vs rotate,
no-caller-id rejection, no-target-id rejection, RNG failure surface,
Authenticate happy-path mints session, no-credential audit row,
session-mint-failure surface, FailureResetInterval recycle, Unlock
+ RemoveCredential happy paths, hash-format unit tests (round-trip,
mismatch, malformed/wrong-version/bad-base64 formats), nil-audit +
nil-session pass-through.
Coverage on internal/auth/breakglass/ at 91.5% per-statement (above
the Phase 7.5 spec ≥ 90% floor).
cmd/server/main.go wiring:
- Constructs breakglassRepo + breakglassService + breakglassHandler
after the OIDC service block.
- breakglassSessionMinterAdapter shim bridges *session.Service.Create
to the breakglass.SessionMinter port.
- Logs WARN at boot when CERTCTL_BREAKGLASS_ENABLED=true (operator
visibility for the deliberate SSO-bypass).
internal/config/config.go gains:
- AuthConfig.BootstrapAdminGroups + BootstrapOIDCProviderID for
Phase 7 (CERTCTL_BOOTSTRAP_ADMIN_GROUPS comma-list +
CERTCTL_BOOTSTRAP_OIDC_PROVIDER_ID).
- AuthConfig.Breakglass nested struct with 4 env vars
(CERTCTL_BREAKGLASS_ENABLED + LOCKOUT_THRESHOLD + LOCKOUT_DURATION
+ LOCKOUT_RESET_INTERVAL).
Router wiring:
- 4 new breakglass routes registered when reg.AuthBreakglass != nil;
public login route via direct r.mux.Handle (auth-exempt), 3 admin
routes via r.Register + rbacGate(auth.breakglass.admin).
- POST /auth/breakglass/login pinned in AuthExemptRouterRoutes
allowlist with Phase 7.5 justification.
- SpecParityExceptions extended with 4 new entries documenting
the Phase 7.5 deferral of full per-endpoint OpenAPI rows
(handler doc-block at the top of auth_breakglass.go is the
operator-facing reference).
Threat model (encoded in service.go + auth_breakglass.go doc-blocks
+ migration 000038 docstrings, to be promoted to docs/operator/auth-
threat-model.md in Phase 12):
- Break-glass is a deliberate bypass of the SSO security boundary.
An attacker who phishes the password OR finds it in a compromised
password manager bypasses MFA, OIDC, and every group-claim gate.
- Recommendation: keep CERTCTL_BREAKGLASS_ENABLED=false in steady-
state. Enable only during SSO-broken incidents. Disable after
recovery.
- WebAuthn pairing (v3 per Decision 12) is the load-bearing second
factor. Without it, break-glass is best treated as an emergency-
only path.
- Audit trail surfaces every break-glass action under
event_category=auth; the auditor role can monitor for unexpected
break-glass logins.
Verifications: gofmt clean, go vet clean across all touched packages,
go test -short -count=1 green across internal/auth/oidc (3.0s; new
Phase 7 hook tests integrated alongside the 21+ Phase 3 negatives),
internal/auth/breakglass (3.6s; 8 spec-mandated negatives + coverage
batch passing), internal/config + internal/domain/auth + internal/api/
router + internal/api/handler all green, no regressions in Bundle 1
packages.
|
||
|
|
9c679a5960 |
auth-bundle-2 Phase 5: OIDC + session HTTP surface (13 endpoints),
pre-login store, OpenID Connect Back-Channel Logout 1.0, cookieAuth
scheme, 7 new auth permissions, CI guard, handler tests
Phase 5 of the bundle puts the Phase 3 OIDC service + Phase 4 session
service on the wire. 13 HTTP endpoints split into three logical groups:
Public OIDC handshake (auth-exempt; protocol-mediated):
GET /auth/oidc/login?provider=<id> -> 302 to IdP authorization URL
+ sets certctl_oidc_pending cookie
(10-min TTL, Path=/auth/oidc/,
SameSite=Lax)
GET /auth/oidc/callback?code=...&state=... -> consume pre-login row,
run Phase 3's 11-step token
validation, mint post-login
session, 302 to dashboard
POST /auth/oidc/back-channel-logout -> OpenID Connect BCL 1.0 — IdP
POSTs logout_token JWT; certctl
validates signature against IdP
JWKS via Phase 3 alg allow-list,
required claims (iss/aud/iat/jti/
events; exactly one of sub/sid;
nonce ABSENT per spec §2.4),
revokes matching sessions,
returns 200 with
Cache-Control: no-store
POST /auth/logout -> revoke caller's session
Session management (RBAC-gated auth.session.*):
GET /api/v1/auth/sessions -> auth.session.list (own / all)
DELETE /api/v1/auth/sessions/{id} -> auth.session.revoke (own bypass)
OIDC provider + group-mapping CRUD (RBAC-gated auth.oidc.*):
GET /api/v1/auth/oidc/providers -> auth.oidc.list
POST /api/v1/auth/oidc/providers -> auth.oidc.create
(client_secret encrypted
at rest via
internal/crypto.EncryptIfKeySet)
PUT /api/v1/auth/oidc/providers/{id} -> auth.oidc.edit
DELETE /api/v1/auth/oidc/providers/{id} -> auth.oidc.delete
(refused via
ErrOIDCProviderInUse → 409
when users authenticated
via this provider)
POST /api/v1/auth/oidc/providers/{id}/refresh -> auth.oidc.edit
(re-runs IdP downgrade
defense via
OIDCService.RefreshKeys)
GET /api/v1/auth/oidc/group-mappings -> auth.oidc.list
POST /api/v1/auth/oidc/group-mappings -> auth.oidc.edit
DELETE /api/v1/auth/oidc/group-mappings/{id} -> auth.oidc.edit
Migration 000037 ships:
- oidc_pre_login_sessions table (10-min absolute TTL, FK CASCADE on
oidc_provider_id, FK RESTRICT on signing_key_id; index on
absolute_expires_at for the GC sweep);
- 7 new permissions seeded into r-admin only:
auth.session.list, auth.session.list.all, auth.session.revoke,
auth.oidc.list, auth.oidc.create, auth.oidc.edit, auth.oidc.delete
CanonicalPermissions extended in lockstep at internal/domain/auth/
validate.go.
Pre-login machinery:
- internal/repository/oidc.go gains PreLoginRepository interface +
PreLoginSession struct + ErrPreLoginNotFound / ErrPreLoginExpired
sentinels.
- internal/repository/postgres/oidc_prelogin.go ships the impl;
LookupAndConsume uses DELETE ... RETURNING for atomic single-use.
- internal/auth/oidc/prelogin.go is the PreLoginAdapter that bridges
the OIDC service's Phase 3 PreLoginStore interface to the new
repository, signing the cookie value under the active
SessionSigningKey via the same v1.<id>.<key>.<HMAC> wire format
Phase 4 uses for post-login cookies. Defense-in-depth: the
pre-login `pl-` prefix is enforced by ParseCookieValue(prefix);
a stolen pre-login cookie cannot be replayed against the
post-login Validate path (pinned by
TestService_Validate_RejectsPreLoginCookieAtPostLoginGate).
Session package extension:
- internal/auth/session/service.go gains exported SignCookieValue,
ParseCookieValue (with caller-supplied id-1 prefix), ComputeCookieHMAC,
DecryptKeyMaterial wrappers so the OIDC pre-login adapter shares
the same length-prefixed HMAC math without code duplication.
- parseCookie no longer hardcodes the `ses-` prefix check (moved to
Validate as defense-in-depth; pre-login cookie verification uses
the `pl-` prefix via ParseCookieValue).
Cookie attributes (all Phase 5 endpoints honor CERTCTL_SESSION_SAMESITE
+ Secure=true via SessionCookieAttrs from Phase 4 config):
- certctl_oidc_pending: Path=/auth/oidc/, MaxAge=600s, SameSite=Lax
(cannot be Strict because the IdP-initiated callback is a top-level
navigation from a different origin).
- certctl_session: Path=/, Expires=8h, SameSite=Lax|Strict, HttpOnly.
- certctl_csrf: Path=/, Expires=8h, HttpOnly=false (intentional —
GUI must read it to echo into X-CSRF-Token header).
Audit logging on every mutating operation (event_category="auth"):
auth.oidc_login_succeeded / failed / unmapped_groups
auth.oidc_back_channel_logout / failed
auth.session_revoked
auth.oidc_provider_{created,updated,deleted,refreshed}
auth.group_mapping_{added,removed}
OpenAPI updates:
- cookieAuth security scheme added to api/openapi.yaml under
components.securitySchemes (apiKey / cookie / certctl_session).
- The 13 Phase 5 routes are added to SpecParityExceptions with a
deferral note: full per-endpoint OpenAPI rows land in a follow-on
commit alongside the GUI work (Phase 8) so the ergonomic shape can
be validated against the live GUI client.
CI guard: scripts/ci-guards/N-bundle-2-security-empty-preserved.sh
asserts api/openapi.yaml has ≥ 14 'security: []' occurrences (the
pre-Bundle-2 baseline). Reducing the count below 14 would silently
force a Bearer-or-cookie requirement onto an endpoint that legitimately
runs without certctl-issued credentials; the guard fires before that
regression lands.
Handler tests (internal/api/handler/auth_session_oidc_test.go):
- All 6 prompt-mandated negative cases:
BCL with missing events claim -> 400
BCL with nonce present -> 400 (per spec §2.4)
BCL with sig signed by an unknown key -> 400
Callback with replayed state -> 400
Callback with PKCE verifier mismatch -> 400
Callback with expired pre-login row -> 400
- Plus happy paths for every endpoint, edge cases (missing-cookie,
duplicate-name, in-use-409, wrong-tenant), and the Helper-function
coverage (peekIssuer, classifyOIDCFailure, defaultIfBlank,
defaultIntIfZero, clientIPFromRequest, encryptClientSecret).
Coverage on internal/api/handler/auth_session_oidc.go: 80.9% per-function
(above the Phase 5 spec's ≥ 80% floor).
Server wiring (cmd/server/main.go):
Wired AFTER sessionService (Phase 4) so the OIDC PreLoginAdapter can
sign pre-login cookies under the active SessionSigningKey:
oidcProviderRepo + oidcMappingRepo + oidcUserRepo + oidcPreLoginRepo
-> preLoginAdapter -> oidcService -> authSessionOIDCHandler.
sessionMinterAdapter shim bridges *session.Service.Create to the
oidcsvc.SessionMinter port the OIDC service consumes.
Router wiring (internal/api/router/router.go):
4 public OIDC routes via direct r.mux.Handle (auth-exempt; pinned in
AuthExemptRouterRoutes); 9 RBAC-gated routes via r.Register +
rbacGate(checker, perm, h). Routes only register when
reg.AuthSessionOIDC != nil so pre-Phase-5 builds skip the block
entirely.
Verifications: gofmt clean, go vet clean across all touched packages,
go test -short -count=1 green across internal/api/handler (74 tests +
new Phase 5 batch), internal/api/router (parity + auth-exempt
allowlist), internal/auth/oidc + session (no regressions), full domain
+ scheduler + config sweeps green, ci-guard
N-bundle-2-security-empty-preserved.sh green (17 ≥ 14 baseline).
|
||
|
|
315e132981 |
auth-bundle-2 Phase 2a: SQL migrations (oidc_providers, sessions, users)
Three new idempotent transactional migrations that materialize the
Phase 1 domain types into Postgres tables. Repository implementations
+ integration tests land as Phase 2b in the next commit.
migrations/000034_oidc_providers.up.sql:
oidc_providers table with the full OIDCProvider field set
(issuer_url + client_id + client_secret_encrypted v2 blob +
redirect_uri + groups_claim_path + groups_claim_format +
fetch_userinfo + scopes[] + allowed_email_domains[] +
iat_window_seconds + jwks_cache_ttl_seconds + tenant_id).
group_role_mappings table linking provider+group_name to role_id.
Closed-enum CHECK on groups_claim_format ('string-array' or
'json-path').
Defense-in-depth bounds CHECKs on iat_window_seconds (1..600) and
jwks_cache_ttl_seconds (>= 60); app-layer Validate() also
enforces these.
ON DELETE CASCADE on group_role_mappings.provider_id so deleting a
provider cleans up its mappings.
ON DELETE RESTRICT on group_role_mappings.role_id so an in-use role
can't be silently dropped.
migrations/000035_sessions.up.sql:
session_signing_keys table with key_material_encrypted v2 blob +
retired_at nullable + the retired-after-created CHECK.
Partial index on (tenant_id, created_at DESC) WHERE retired_at IS
NULL backs the GetActive hot path.
sessions table covers BOTH the post-login row (1h-idle/8h-absolute
cookie lifecycle) AND the Phase 5 pre-login row (10-minute TTL,
is_pre_login=true). csrf_token_hash holds the SHA-256 of the
CSRF token plaintext (the plaintext lives in a separate
JS-readable cookie, hashed here so a DB-read leak can't replay).
Two CHECK constraints pin the expiry order (absolute > idle, idle >
created); these match the Phase 1 domain Validate() pre-write
invariants but enforce them at the DB layer too so direct SQL
inserts can't silently land malformed rows.
Partial indexes on actor_id (active sessions only), the active
session lookup, the pre-login GC sweep (created_at), and the
absolute-expired GC sweep (absolute_expires_at) cover the four
hot paths Phase 4's service consumes.
ON DELETE RESTRICT on sessions.signing_key_id so a signing key
referenced by an active session can't be dropped (the retention
window keeps retired keys valid; full purge waits until every
session signed under that key has expired).
migrations/000036_users.up.sql:
users table for federated-human identity (per-(provider, subject)
tuple via UNIQUE constraint, not global - identity is per-IdP by
design).
webauthn_credentials JSONB DEFAULT '[]' reserved for v3 (Decision
12); Bundle 2 always stores [].
Email index for the GUI's "find user by email" surface (not unique
because the same email can appear in multiple providers per the
per-IdP identity model).
ON DELETE RESTRICT on users.oidc_provider_id keeps Phase 3's "delete
provider only when no users authenticated via it" rule enforced
at the DB layer; the OIDCProviderRepository.Delete impl will
translate SQLSTATE 23503 into a 409 sentinel.
All three migrations:
Wrapped in BEGIN/COMMIT so partial-fail leaves no half-state.
IF NOT EXISTS / IF EXISTS / ON CONFLICT DO NOTHING for idempotency
(the certctl-server boot path applies every migration on every
start per CLAUDE.md "Idempotent migrations" architecture rule).
TIMESTAMPTZ for time columns (no TIMESTAMP WITHOUT TIME ZONE).
TEXT primary keys with prefixes per CLAUDE.md "Architecture
Decisions" (op- / grm- / sk- / ses- / u-).
Multi-tenant ready: tenant_id column with DEFAULT 't-default' on
every row, FK to tenants(id) ON DELETE CASCADE. Bundle 2 ships
single-tenant; managed-service activation adds tenants without a
schema migration.
Down migrations exist in lockstep, drop tables in FK-safe order
(group_role_mappings -> oidc_providers; sessions ->
session_signing_keys; users alone). Down-migrations are destructive;
docstrings call this out.
Verifications:
Migration count: ls migrations/*.up.sql | wc -l = 36 (33 from
Bundle 1 + 3 new).
BEGIN/COMMIT pair counts: each new migration is 1:1.
No Docker in this sandbox, so the migrations are not applied
end-to-end here; CI's testcontainers harness runs them via
postgres.RunMigrations on every push. Phase 2b's repository
integration tests will exercise the schema against Postgres 16
Alpine.
|
||
|
|
efea4d0e03 |
auth-bundle-1 fix: bundled certctl-agent restart loop (latent since 2026-03-14)
The bundled `docker-compose.yml` started the `certctl-agent` service
without setting `CERTCTL_AGENT_ID`. `cmd/agent/main.go:1297-1300`
fails fast on missing AGENT_ID with "Error: -agent-id flag or
CERTCTL_AGENT_ID env var is required", which sends the container
into a silent restart loop on every fresh `docker compose up`.
Latent since commit
|
||
|
|
45122d7edb |
auth-bundle-1 fix: migration 000029 role_permissions NULL scope_id
Real bug an external tester (operator) hit on first docker compose up:
failed to execute migration 000029_rbac.up.sql: pq: null value in
column "scope_id" of relation "role_permissions" violates
not-null constraint
# Root cause
The role_permissions table declared scope_id TEXT (nullable) but
also declared
PRIMARY KEY (role_id, permission_id, scope_type, scope_id)
In Postgres, PRIMARY KEY columns are implicitly NOT NULL — the
PK constraint silently overrode the column-level nullability. So
every global-scope INSERT (which legitimately has scope_id=NULL
per the CHECK constraint that requires it) tripped the NOT NULL.
The schema was never reachable in the unit-test suite because
the in-memory fakes don't enforce Postgres semantics, and the
postgres integration tests skip on -short. First contact with a
real postgres:16-alpine boot caught it.
# Fix
Switch to a synthetic BIGSERIAL primary key + a UNIQUE NULLS NOT
DISTINCT constraint on the natural key
(role_id, permission_id, scope_type, scope_id):
- BIGSERIAL primary key satisfies Postgres's PK-implies-NOT-NULL.
- UNIQUE NULLS NOT DISTINCT (Postgres 15+; the project targets
postgres:16-alpine) treats two NULL scope_ids as colliding,
which is what the seed's ON CONFLICT (...) DO NOTHING relies
on to make re-running the migration idempotent.
- The CHECK (scope_type='global' AND scope_id IS NULL OR
scope_type IN ('profile','issuer') AND scope_id IS NOT NULL)
still enforces the per-row invariant.
The ON CONFLICT (col1, col2, ...) clauses in the seed and in
RoleRepository.AddPermission infer the unique index from the
column list and still resolve correctly against the renamed
constraint — no other changes needed.
# Verification
After this commit, docker compose up -d --build should boot
clean: postgres becomes healthy, certctl-tls-init exits 0,
certctl-server applies all 33 migrations including 000029,
backfills the 7 default roles + 33-permission catalogue + the
synthetic actor-demo-anon admin grant, and starts serving on
:8443.
docker compose -f deploy/docker-compose.yml \
-f deploy/docker-compose.demo.yml down -v
docker compose -f deploy/docker-compose.yml \
-f deploy/docker-compose.demo.yml up -d --build
sleep 15
curl -sk https://localhost:8443/api/v1/auth/me | jq
# Expect: actor_id=actor-demo-anon, admin=true, roles=[r-admin]
|
||
|
|
69a508dfcf |
auth-bundle-1 Phase 9 + 10: approval-bypass closure + RBAC GUI
# Phase 9 — approval-bypass closure (Decision 9, option a)
* Migration 000033_approval_kinds.up.sql: ALTER TABLE
issuance_approval_requests ADD COLUMN approval_kind +
payload JSONB; relax certificate_id + job_id to nullable;
CHECK (approval_kind IN ('cert_issuance','profile_edit'))
+ CHECK (per-kind nullability invariant) + index on
approval_kind. Idempotent throughout via DO blocks.
* domain.ApprovalKind enum (cert_issuance / profile_edit) +
IsValidApprovalKind. ApprovalRequest gains Kind +
Payload []byte for the pending profile diff.
* postgres.ApprovalRepository.Create + scanApprovalRow extended
to round-trip the new columns; certificate_id + job_id
switched to sql.NullString so profile_edit rows persist
cleanly. Default Kind=cert_issuance preserves back-compat
for every Phase-7-2026-05-03 caller.
* ApprovalService.RequestProfileEditApproval: new entry point
that creates a pending profile-edit row carrying the
serialized profile diff. Bypass mode (CERTCTL_APPROVAL_BYPASS)
short-circuits the same way it does for cert_issuance.
* ApprovalService.SetProfileEditApply hook: cmd/server/main.go
registers a closure that deserializes req.Payload + persists
via profileRepo.Update + emits a profile.edit_applied audit
row with category=auth. The hook avoids the Approval ↔
Profile import cycle.
* ProfileService.UpdateProfile: gates when (a) the live
profile carries RequiresApproval=true, OR (b) the proposed
edit would set it true. Returns ErrProfileEditPendingApproval
with the new approval ID; ProfileHandler maps to HTTP 202
Accepted + {pending_approval_id}. Both arms close the
flip-flop loophole because every transition through an
approval-tier profile fires the gate.
* TestProfileEdit_RequiresApprovalLoopholeClosed pins all 3
bypass attempts (flip-off / kept-on / flip-on) gated; nil-
approval-service preserves pre-Phase-9 direct-apply for
test fixtures.
* Approval service tests gain 4 profile_edit rows: pending row
shape; same-actor self-approve rejected with
ErrApproveBySameActor (load-bearing two-person integrity);
approve fails-closed when apply callback unwired;
apply callback invoked on approve.
* docs/reference/profiles.md (new) explains the gate +
edit response shape (202) + same-actor invariant + bypass
+ audit hooks.
# Phase 10 — RBAC management GUI
* useAuthMe hook (web/src/hooks/useAuthMe.ts): TanStack Query
fetches /api/v1/auth/me on app boot, caches for 60s, exposes
hasPerm(p) + hasAnyPerm + isAdmin predicates. Every Phase-10
page consumes this on mount + gates affordances against the
cached effective_permissions slice. Server-side enforcement
is the load-bearing gate; client-side hide/disable is UX.
* New routes:
- /auth/roles — list (auth.role.list); create-role modal
(auth.role.create) hidden when missing.
- /auth/roles/:id — detail + permissions; edit
(auth.role.edit), delete (auth.role.delete), add/remove
permission affordances each gated.
- /auth/keys — list of every actor with role grants; assign
+ revoke modals (auth.role.assign). actor-demo-anon
flagged system-managed; mutation buttons hidden for it.
- /auth/settings — stub showing /v1/auth/me identity +
bootstrap-endpoint availability via /v1/auth/bootstrap.
* AuditPage extended with category filter ('All categories'
+ the 3 enum values from migration 000032). Selection flows
to the API call params + the URL-driven query state.
* Layout: 3 new nav entries (Roles / API Keys / Auth Settings).
* api/client.ts: 12 new exported functions for the RBAC
surface (authMe, list/get/create/update/delete role,
list/add/remove role permissions, list keys, assign/revoke
key role, bootstrap-availability probe).
* data-testid attributes on every interactive element so a
future Playwright suite can assert behavior without brittle
CSS selectors.
* Empty state, error state, and unsaved-changes warnings on
every form per the prompt's implementation rules.
# Frontend tests
* RolesPage.test.tsx (6 tests): list render, empty state,
error state, hide-create-button-without-perm,
show-create-button-with-perm, submit-create-modal.
* KeysPage.test.tsx (3 tests): demo-anon flagged
system-managed (no buttons), permission-gated affordance
hide for auditor caller, assign-modal-POST contract.
* AuthSettingsPage.test.tsx (2 tests): identity surface,
bootstrap-OPEN-status surface.
* AuditPage.test.tsx (+1): category-filter select renders
with the 4 documented options.
15 frontend tests total in src/pages/auth/ + the audit
category-filter test; all pass via npx vitest run.
# Verifications
* go vet ./... clean.
* staticcheck across internal/auth + handler + router + cli +
service + repository + cmd + domain: clean.
* gofmt -l clean repo-wide.
* go test -short -count=1 green across internal/service,
internal/api/handler, internal/api/router, internal/auth,
internal/auth/bootstrap, internal/service/auth,
internal/domain/auth, cmd/server, cmd/cli, internal/cli.
* npx tsc --noEmit clean.
* npm run build green (vite build produces dist/index.html
+ 946KB JS bundle; chunk-size warning is pre-existing).
* npx vitest run src/pages/auth/ src/pages/AuditPage.test.tsx
green (15 tests, 4 files).
|
||
|
|
3ef45e2ad4 |
auth-bundle-1 Phase 6-7-8: bootstrap path + scope-down CLI + auditor-role split
# Phase 6 — day-0 admin bootstrap * internal/auth/bootstrap/ (new package): Strategy interface + EnvTokenStrategy with constant-time compare, one-shot consumption via sync.Mutex, optional admin-existence probe. Bundle 2's OIDC- first-admin will plug in alongside as an alternate Strategy. * BootstrapService.ValidateAndMint: validates the operator's CERTCTL_BOOTSTRAP_TOKEN, mints a 32-byte (64-hex-char) random API key value, persists the SHA-256 hash to api_keys, grants r-admin via actor_roles, AddHashed's the runtime keystore so the just- minted key authenticates the next request without restart, and records bootstrap.consume to the audit trail with category=auth. * internal/auth/keystore.go (new): KeyStore interface + StaticKeyStore (immutable env-var-only path) + MutableKeyStore (env-var keys + DB-loaded api_keys + runtime AddHashed). The auth middleware now consumes a KeyStore so the bootstrap path can extend the lookup table at runtime. * migrations/000031_api_keys.up/down.sql: api_keys table with (id, name UNIQUE, key_hash UNIQUE, tenant_id, admin, created_by, created_at, expires_at, last_used_at). Idempotent. * /v1/auth/bootstrap GET (probe) + POST (mint) — auth-exempt. Both routes documented in api/openapi.yaml + AuthExemptRouterRoutes allowlist updated. The token never leaves internal/auth/bootstrap; the minted plaintext key flows only into the HTTP response body. * Startup warning emitted when CERTCTL_BOOTSTRAP_TOKEN is set AND admin actors already exist (config drift signal). * Tests: 4 strategy invariants (empty token born disabled, wrong token=ErrInvalidToken without consumption, one-shot consumption, admin-exists closes path), 5 service tests (happy path + actor- name validation + propagation of strategy errors + nil-deps guard + 32-byte entropy budget), 8 HTTP-handler tests (status 201/410/401/400 mapping + token-leak hygiene scan of slog + audit details + Location header). Token-leak test redirects slog.Default to a buffer for the test scope. # Phase 7 — API-key migration + scope-down CLI * GET /v1/auth/keys handler + service method ListKeys backed by ActorRoleRepository.ListDistinctActors. Returns one row per (actor_id, actor_type) pair with the slice of role IDs they hold. Permission: auth.role.list. * internal/cli/auth_scope_down.go: AuthListKeys, AuthScopeDown (interactive), AuthScopeDownNonInteractive (JSON config), AuthScopeDownSuggest (--suggest with optional --apply). The synthetic actor-demo-anon is filtered out of every interactive / bulk path; non-interactive flow logs and skips it explicitly. * SuggestRoleFromAuditEvents (pure function): walks 30 days of audit events per actor and returns the narrowest matching role (admin / mcp / viewer / agent / operator) plus a one-line reason. Classification: any admin-shaped action wins; otherwise all-MCP → mcp; all-read-only → viewer; all-agent-shaped → agent; otherwise operator. Test table pins all six classifications. * CLI subcommand tree extended: 'auth keys list' + 'auth keys scope-down [--non-interactive <cfg>] [--suggest [--apply]]'. * CHANGELOG.md leads v2.1.0 with the SECURITY: AUDIT YOUR API KEYS call-out + four flow examples. # Phase 8 — auditor role + event_category column * migrations/000032_audit_category.up/down.sql: ALTER TABLE audit_events ADD COLUMN event_category TEXT NOT NULL DEFAULT 'cert_lifecycle' + CHECK constraint (cert_lifecycle/auth/config) + (event_category) and (event_category, timestamp DESC) indexes for the auditor-filter query path. WORM trigger from migration 000018 continues to enforce append-only at the DB layer (DDL is not blocked). * domain.AuditEvent gains EventCategory string (omitempty); domain.EventCategoryCertLifecycle / Auth / Config constants. * AuditService.RecordEventWithCategory sibling of RecordEvent; legacy callers stay on RecordEvent (defaults to cert_lifecycle). Auth callers (RoleService, ActorRoleService, BootstrapService) switched to RecordEventWithCategory(..., 'auth', ...). * GET /v1/audit?category=<cat>: handler accepts the optional query param, validates against the enum (400 on invalid value), dispatches through ListAuditEventsByCategory. OpenAPI updated with the new query param + AuditEvent.event_category schema. * Postgres AuditRepository.Create now writes event_category; AuditRepository.List filters on it; AuditFilter.EventCategory gates the WHERE clause. * Tests: 5 audit-category-filter HTTP tests (dispatch routing, back-compat fallback, 400 for invalid values, all 3 enum values accepted, page+category combine, JSON output surfaces the field). 3 auditor-role invariants (auditor holds exactly audit.read+audit.export, no mutating perms, disjoint from viewer except audit.read). # Cross-phase wiring * HandlerRegistry.Bootstrap field added; cmd/server/main.go wires the bootstrap service ahead of RegisterHandlers (extracted assembleNamedAPIKeys helper into auth_backfill.go, moved the keystore + bootstrap construction up alongside the auth repos). * AuthCheckResolver / AuthActorRoleService extended with ListKeys to satisfy the Phase 7 surface; existing fakes updated. * fakeAudit + mockAuditService stubs in tests gain RecordEventWithCategory + ListAuditEventsByCategory; existing tests untouched. # Verifications * gofmt -l: clean across every modified file. * go vet ./...: clean. * staticcheck across internal/auth + handler + router + cli + service + repository + cmd + domain: clean. * go test -short -count=1: green across every Bundle-1-touched package — internal/auth (incl. bootstrap), internal/api/handler, internal/api/router, internal/cli, internal/service/auth, internal/service, internal/domain/auth, internal/repository/postgres, cmd/server, cmd/cli, plus internal/scheduler, internal/api/middleware, cmd/agent, internal/mcp. |
||
|
|
7ff2e2de08 |
auth-bundle-1 Phase 3.5: handler IsAdmin -> router-wrapped RequirePermission
Phase 3.5 atomic conversion. The five legacy admin-gated handlers (bulk_revocation, admin_crl_cache, admin_scep_intune, admin_est, intermediate_ca) had their in-body auth.IsAdmin checks removed; the gate moved to router.go via auth.RequirePermission middleware wrapping each route. Non-admin operators with the right scoped permission can now reach these endpoints; legacy in-body admin checks no longer block them.
Migration 000030_rbac_admin_perms.up.sql ships five admin-only fine-grained permissions: cert.bulk_revoke, crl.admin, scep.admin, est.admin, ca.hierarchy.manage. All five are seeded into r-admin only; operator/viewer/agent/mcp/cli/auditor do not receive them by default. Operators can grant any of these to a custom role via the Phase 4 RBAC API. Idempotent + transaction-wrapped.
internal/domain/auth/validate.go::CanonicalPermissions extended with the five new entries so RoleService.AddPermission accepts them.
internal/api/router/router.go: HandlerRegistry gains a Checker field (auth.PermissionChecker). New rbacGate(checker, perm, handler) helper wraps a handler with auth.RequirePermission middleware; nil-checker fall-through preserves test/demo deployments without the RBAC stack. 12 admin routes wrapped: cert.bulk_revoke (POST /api/v1/certificates/bulk-revoke + POST /api/v1/est/certificates/bulk-revoke), crl.admin (GET /api/v1/admin/crl/cache), scep.admin (GET /api/v1/admin/scep/profiles + GET /api/v1/admin/scep/intune/stats + POST /api/v1/admin/scep/intune/reload-trust), est.admin (GET /api/v1/admin/est/profiles + POST /api/v1/admin/est/reload-trust), ca.hierarchy.manage (POST /api/v1/issuers/{id}/intermediates + GET /api/v1/issuers/{id}/intermediates + POST /api/v1/intermediates/{id}/retire + GET /api/v1/intermediates/{id}).
cmd/server/main.go: HandlerRegistry.Checker wired with the same authPermissionCheckerAdapter shim Phase 4 introduced for AuthHandler. Same adapter; one source of truth.
Handler bodies: removed eight in-body auth.IsAdmin checks across the 5 files. bulk_revocation.go's BulkRevoke + BulkRevokeEST, admin_crl_cache.go::ListCache, admin_scep_intune.go's three methods, admin_est.go's two methods, intermediate_ca.go's four methods. Replaced each with a comment naming the new gate location. Unused 'github.com/certctl-io/certctl/internal/auth' imports removed.
Test triplet rewrite: deleted obsolete _NonAdmin_Returns403 and _AdminExplicitFalse_Returns403 tests across 6 test files (5 handler tests + bulk_revocation_est_test.go) — they tested the now-removed in-body gate. _AdminPermitted_ForwardsActor tests stay intact: they pin the actor-passthrough invariant which is still relevant. Added internal/api/router/rbac_gate_integration_test.go with four router-level integration tests pinning the new gate: deny → 403 + handler not reached, permit → 200 + handler reached, nil-checker → fall-through, no-actor → 401.
M-008 admin-gate registry: AdminGatedHandlers map now empty (Phase 3.5 invariant: zero in-handler auth.IsAdmin call sites; only health.go's informational caller remains). m008_admin_gate_test.go retains the scan to enforce the invariant going forward; new admin-gated routes must wrap at router.go::rbacGate, not gate in-handler. Updated error message to direct future contributors to the new pattern.
Verifications: gofmt clean across all touched files; go vet ./... clean; go test -short across internal/auth, internal/service/auth, internal/api/handler, internal/api/router, cmd/server all green.
Branch: dev/auth-bundle-1. Commit chain:
|
||
|
|
19497eef87 |
auth-bundle-1 Phase 1: RBAC schema + domain types + repository layer
Bundle 1 / Phase 1: ships the RBAC primitive as schema + domain types + repo layer. Service-layer wiring lands in Phase 2; middleware integration in Phase 3.
Schema (migrations/000029_rbac.up.sql, 272 lines, idempotent, transaction-wrapped):
tenants, roles, permissions, role_permissions, actor_roles. TEXT primary keys with prefixes (t-, r-, p-, ar-) per CLAUDE.md Architecture Decisions. TIMESTAMPTZ time columns. FK cascade explicit (tenant CASCADE, role RESTRICT, actor CASCADE). Three-value scope_type CHECK ('global', 'profile', 'issuer') matched 1:1 with internal/domain/auth.ScopeType. UNIQUE(tenant_id, name) on roles, UNIQUE(name) on permissions, UNIQUE(actor_id, actor_type, role_id, tenant_id) on actor_roles.
Seeds: t-default tenant, 7 default roles (admin, operator, viewer, agent, mcp, cli, auditor), 33-permission canonical catalogue (cert.* / profile.* / issuer.* / target.* / agent.* / audit.* / auth.role.* / auth.key.* / auth.bootstrap.use), full role->permission grant matrix at global scope. Demo-mode preservation: actor-demo-anon seeded with admin role unconditionally; Phase 3 wires the auth middleware to inject this actor into the context when CERTCTL_AUTH_TYPE=none. Reserved system actor; Phase 4 API rejects mutations / deletions targeting it with 409 Conflict.
Domain types (internal/domain/auth/{types,validate,validate_test}.go):
Tenant, Role, Permission, RolePermission, ActorRole structs with JSON tags. ScopeType enum (global/profile/issuer). ActorTypeValue mirrors internal/domain.ActorType to avoid an import cycle. CanonicalPermissions slice + DefaultRoles map are the single source of truth referenced by the migration; validate_test.go pins (a) no duplicate permissions, (b) every default-role permission is canonical, (c) admin holds the full catalogue, (d) seeded IDs carry the prefix convention, (e) ScopeType enum has exactly 3 values matching the CHECK constraint.
Extended internal/domain/audit.go: added ActorTypeAPIKey + ActorTypeAnonymous to the existing User/System/Agent enum so the audit trail can distinguish API-key requests from federated humans (Bundle 2 OIDC) and demo-mode (CERTCTL_AUTH_TYPE=none). Existing code that records actor_type=User keeps working; new APIKey value used by Bundle 1 Phase 3 middleware.
Repository layer (internal/repository/auth.go + internal/repository/postgres/auth.go):
TenantRepository (Get, List, EnsureDefault). RoleRepository (Get, GetByName, List, Create, Update, Delete with ErrAuthRoleInUse on FK RESTRICT, ListPermissions, AddPermission idempotent, RemovePermission). PermissionRepository (List, GetByName, IsCanonical for fail-fast catalog check). ActorRoleRepository (ListByActor, ListByRole, Grant idempotent, Revoke, EffectivePermissions which is the JOIN that auth.RequirePermission will use in Phase 3 — returns deduplicated (permission, scope) triples honouring the not-yet-expired predicate so future time-bound grants work without code change). Sentinel errors ErrAuthNotFound, ErrAuthDuplicateName, ErrAuthRoleInUse, ErrAuthReservedActor, ErrAuthUnknownPermission for handler-layer 404/409/400 mapping.
Verification: gofmt clean, go vet ./... clean, go test -short ./internal/domain/auth ./internal/repository/postgres pass. Integration tests against a live Postgres are gated by testing.Short() per repo convention; Phase 12 wires the testcontainers harness for full e2e coverage.
Branch: dev/auth-bundle-1. Phase 0 was
|
||
|
|
7a9ae3157f |
fix(seed): repair deployment_targets FK violation crashing fresh demo boot
The Rank 5 cloud-target seed rows in `seed_demo.sql` referenced a non-existent `ag-server` agent_id. On every fresh-clone `docker compose -f deploy/docker-compose.yml -f deploy/docker-compose.demo.yml up` the server crash-looped at the demo-seed step: pq: insert or update on table "deployment_targets" violates foreign key constraint "deployment_targets_agent_id_fkey" Origin: commit |
||
|
|
75097909e9 | |||
|
|
66d2af36a7 |
domain, migrations: IntermediateCA type + intermediate_cas + Issuer.HierarchyMode
Rank 8 of the 2026-05-03 deep-research deliverable, commit 1 of 5
(cowork/rank-8-intermediate-ca-hierarchy-prompt.md). Closes the multi-
level CA hierarchy gap for FedRAMP boundary-CA, financial-services
policy-CA, and OT network-CA deployments where regulator-mandated
certificate-policy separation requires multiple layers (root → policy
→ issuing).
This commit lands ONLY the foundation — schema, types, repository
interface, postgres implementation. No service / connector / handler
wiring yet. The 5-commit chain is bisectable: this commit can ship
with no operator-visible behavior change until commits 2-5 wire the
service layer + the local-connector tree-mode + admin API + GUI tree
view + operator runbook. The default value for issuers.hierarchy_mode
is 'single' so every existing operator's behavior is byte-identical
post-migration.
Existing scaffolding REUSED (not redefined):
- internal/crypto/signer.Driver seam — every IntermediateCA carries
a key_driver_id pointing at the signer.Driver instance that owns
its private key. Defense in depth: NEVER persist key bytes in a
row. FileDriver is the production default; future PKCS11Driver /
CloudKMSDriver close the disk-exposure leg via the same seam.
- issuers.id row — the new intermediate_cas FK references it.
Files added:
internal/domain/intermediate_ca.go — IntermediateCA type,
IntermediateCAState
closed enum (active /
retiring / retired),
IsValidIntermediateCAState
+ IsTerminal helpers,
NameConstraint struct
(RFC 5280 §4.2.1.10
permitted+excluded
subtree subset
semantics for service-
layer enforcement),
HierarchyModeSingle /
HierarchyModeTree
constants.
internal/repository/postgres/intermediate_ca.go — IntermediateCARepository
impl: Create (ica-<slug>
ID gen, JSONB +
nullable-column round-
trip, lib/pq 23505 →
ErrAlreadyExists),
Get, ListByIssuer,
ListChildren,
UpdateState,
GetActiveRoot,
WalkAncestry (recursive
CTE — single SQL
round-trip, O(depth)
rows, leaf-first
ordering).
migrations/000028_intermediate_ca_hierarchy.{up,down}.sql
— idempotent schema.
issuers.hierarchy_mode
VARCHAR(20) DEFAULT
'single'. New
intermediate_cas table
with FKs to
issuers / self
(parent_ca_id) +
CHECK constraints
(closed-enum state,
not_after >
not_before, no self-
parent) + 6 indexes
(partial-unique
active root per
issuer, partial-
unique name per
issuer, owning
issuer, parent,
state, expiring).
Files modified:
internal/domain/connector.go — adds Issuer.HierarchyMode field
with full doc comment + JSON tag.
Empty string ≡ single mode for
back-compat.
internal/repository/interfaces.go — adds IntermediateCARepository
interface (7 methods).
Verified locally:
gofmt: clean.
go vet ./internal/domain/... ./internal/repository/...: exit 0.
go build ./internal/domain/... ./internal/repository/...: exit 0.
Out of scope for this commit (lands in commits 2-5):
- service/intermediate_ca.go (CreateRoot / CreateChild / Retire /
LoadHierarchy / AssembleChain + RFC 5280 §4.2.1.9 path-len +
§4.2.1.10 NameConstraints subset enforcement + 9 service tests).
- local connector rewrite + byte-equivalence pin
(TestLocal_HierarchyMode_SingleVsTree_ByteIdentical — the load-
bearing backwards-compat refusal-to-ship test).
- 4 admin-gated handler endpoints + OpenAPI extension + handler tests.
- web/src/pages/IssuerHierarchyPage.tsx.
- docs/intermediate-ca-hierarchy.md sysadmin runbook + connectors.md
row + WORKSPACE-ROADMAP follow-ons.
Reference: cowork/rank-8-intermediate-ca-hierarchy-prompt.md.
|
||
|
|
aebfd8bd7c |
Revert "chore: drop 'Infisical' label from internal references"
This reverts commit
|
||
|
|
19706e56b3 |
chore: drop 'Infisical' label from internal references
Strategic naming cleanup. Earlier doc-comments + commit messages framed Rank 4 / Rank 5 / Rank 7 work as 'Rank N of the 2026-05-03 Infisical deep-research deliverable' — the 'Infisical' qualifier was a holdover from the original deep-research framing where Infisical (a competing secrets-management platform) was the comparator. Keeping the comparator's name in our source adds noise without value; an external reader sees 'Infisical' and assumes a dependency or shared lineage rather than reading it as the competitive context it was. Mechanical sed across 34 files (32 source / docs + 2 follow-up Python passes to collapse 'deep-research deep-research' duplicates that emerged where the original phrase wrapped across lines): s|Infisical deep-research|deep-research|g s|infisical-deep-research-results|deep-research-results-2026-05-03|g s|infisical-deep-research-prompt|deep-research-prompt-2026-05-03|g s|infisical-deep-research|deep-research|g s|Infisical|deep-research|g s|deep-research deep-research|deep-research|g # collapse-pass Net diff: 63 insertions / 64 deletions across cmd/, docs/, internal/, migrations/. Pure text substitution; zero behavior change. Code path unchanged — go vet clean, tests for TestApproval pass on both internal/service and internal/api/handler packages. Workspace docs (cowork/) carry the same references and will be swept separately — they're not under certctl/ git control. The two filename references (cowork/infisical-deep-research-results.md + cowork/infisical-deep-research-prompt.md) get renamed alongside that sweep to deep-research-results-2026-05-03.md / deep-research-prompt-2026-05-03.md so cross-references in the certctl repo doc-comments resolve cleanly. |
||
|
|
2025275b43 |
domain, migrations: ApprovalRequest type + issuance_approval_requests + RequiresApproval
Rank 7 of the 2026-05-03 Infisical deep-research deliverable, commit 1 of 4
(cowork/rank-7-approval-workflow-primitive-prompt.md). The four-commit
chain ships the issuance approval-workflow primitive (request → human review
→ CA call) closing the two-person integrity / four-eyes principle
procurement gap for PCI-DSS Level 1, FedRAMP Moderate / High, SOC 2
Type II, and HIPAA-regulated PHI deployments.
This commit lands ONLY the foundation — schema, types, repository
interface, postgres implementation. No service / handler wiring yet.
The four-commit shape is bisectable: the schema can land in production
behind a flag (via the default RequiresApproval=false on every existing
profile) without any operator-visible behavior change until commits 2-4
wire the surrounding workflow.
Existing scaffolding REUSED (not redefined here):
- JobStatusAwaitingApproval enum value (internal/domain/job.go).
- JobRepository.ListTimedOutAwaitingJobs (postgres reaper query).
- Config.Scheduler.AwaitingApprovalTimeout (env-mapped via
CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT, default 168h = 7 days).
- Scheduler.SetAwaitingApprovalTimeout wiring.
Files added:
internal/domain/approval.go - ApprovalRequest type,
ApprovalState closed enum
(pending/approved/rejected/
expired), IsValidApprovalState +
IsTerminal helpers, outcome
const block + bypass-actor
sentinel.
internal/repository/postgres/approval.go - ApprovalRepository
implementation: Create
(ar-<slug> ID gen + JSONB
metadata round-trip + lib/pq
23505 → ErrAlreadyExists
translation), Get, GetByJobID,
List (paginated with state /
cert / requester filters),
UpdateState (pending→terminal
transitions only, with
already-terminal disambiguation),
ExpireStale (bulk reaper,
decided_by='system-reaper').
migrations/000027_approval_workflow.{up,down}.sql
- Idempotent IF NOT EXISTS /
IF EXISTS. Adds
certificate_profiles.requires_approval
BOOLEAN NOT NULL DEFAULT false,
issuance_approval_requests
table with FK to
managed_certificates / jobs /
certificate_profiles, four
indexes (state, certificate,
pending-age, partial-unique
pending-per-job), and the
approval_decision_consistency
CHECK constraint enforcing
decided_by/decided_at must be
non-null for terminal states.
Files modified:
internal/domain/profile.go - Adds CertificateProfile.RequiresApproval
bool field with full doc
comment + JSON tag. Defaults
to false (back-compat — every
existing profile keeps the
unattended renewal path).
internal/repository/interfaces.go - Adds ApprovalRepository
interface (6 methods) +
ApprovalFilter struct.
internal/repository/errors.go - Adds ErrAlreadyExists sentinel
for postgres SQLSTATE 23505
(unique-constraint violations
from the partial-unique
pending-per-job index, plus
the "already terminal" state-
transition signal). Mirrors
the existing ErrNotFound +
ErrForeignKeyConstraint shape.
Verified:
gofmt: clean.
go vet ./internal/domain/... ./internal/repository/...: exit 0.
go build ./internal/domain/... ./internal/repository/...: exit 0.
Out of scope for this commit (lands in commits 2-4):
- service/approval.go (RequestApproval / Approve / Reject / ListPending
/ ExpireStale + same-actor RBAC + bypass mode + audit + metrics).
- service/approval_metrics.go (decisions counter + pending-age histogram).
- 8 service-level table-driven tests including the load-bearing
TestApproval_Approve_RejectsSameActor two-person integrity pin.
- api/handler/approval.go (5 endpoints + RBAC integration).
- api/openapi.yaml (5 new operationIds).
- Integration into CertificateService.TriggerRenewal +
RenewalService.CheckExpiringCertificates + Scheduler.ReapTimedOutJobs.
- cmd/server/main.go wiring.
- Config.Approval.BypassEnabled + CERTCTL_APPROVAL_BYPASS env var.
- docs/connectors.md CertificateProfile config-table row.
- docs/approval-workflow.md operator playbook + compliance control mapping.
Reference: cowork/infisical-deep-research-results.md Part 5 Rank 7.
Acquisition prompt: cowork/rank-7-approval-workflow-primitive-prompt.md.
|
||
|
|
9a7e818f3e |
docs, seed: cloud-target operator runbook + AWS ACM / Azure KV demo seed rows
Wraps up Rank 5 of the 2026-05-03 Infisical deep-research deliverable (commits |
||
|
|
109f32ff41 |
notifications: per-policy multi-channel expiry-alert routing
Closes Rank 4 of the 2026-05-03 Infisical deep-research deliverable
(see cowork/infisical-deep-research-results.md Part 5). Pre-fix,
RenewalService.CheckExpiringCertificates already ran daily,
RenewalPolicy.AlertThresholdsDays drove per-cert thresholds, and
NotificationService.SendThresholdAlert deduped per (cert, threshold)
— but the channel was hardcoded to Email
(internal/service/notification.go:118 pre-fix). Operators who
configured PagerDuty / Slack / Teams / OpsGenie via
CERTCTL_PAGERDUTY_ROUTING_KEY etc. got nothing at any threshold
unless SMTP was also wired. Their first signal of an expired cert
was a 3 AM outage.
This commit lands the routing matrix on top of the existing
infrastructure:
1. RenewalPolicy gains AlertChannels (per-tier channel list) +
AlertSeverityMap (per-threshold tier assignment) +
EffectiveAlertChannels / EffectiveAlertSeverity accessors.
Default*() helpers preserve the back-compat Email-only
behaviour for operators who haven't touched their policies
post-upgrade. Migration 000026 adds the JSONB columns
idempotently.
2. NotificationService.SendThresholdAlertOnChannel — the new
per-channel dispatch helper. Old SendThresholdAlert stays as
an Email-only alias so non-policy callers (admin "send test
alert" surfaces) keep working byte-for-byte.
3. NotificationService.HasThresholdNotificationOnChannel — per-
(cert, threshold, channel) deduplication so a transient
PagerDuty 5xx today does NOT suppress today's Slack alert and
tomorrow's PagerDuty retry will still fire.
4. RenewalService.sendThresholdAlerts walks the resolved channel
set per threshold tier, fans out to every configured channel,
handles per-channel failures independently, defensively drops
off-enum channels with an audit row trail, and records a per-
channel audit event with metadata.channel + metadata.severity_tier.
5. service.ExpiryAlertMetrics — atomic counter table mirrored on
the VaultRenewalMetrics shape from the 2026-05-03 audit fix #5
(commit
|
||
|
|
ec88a61274 |
acme-server: foundation — directory + new-nonce + per-profile routing (Phase 1a/7)
First slice of the RFC 8555 ACME server endpoint (master plan at cowork/acme-server-endpoint-prompt.md, per-phase prompts at cowork/acme-server-prompts/). This commit lands the smallest viable end-to-end deployable slice: an ACME client running curl -sk https://certctl/acme/profile/<id>/directory curl -sk -I https://certctl/acme/profile/<id>/new-nonce successfully fetches the directory document and a Replay-Nonce. Account creation, JWS verification, orders, challenges, and revocation are all out of scope for this phase and arrive in Phases 1b–4. Closes the Rank 1 LHF from the 2026-05-03 Infisical deep-research (cowork/infisical-deep-research-results.md). Pre-fix, certctl was an ACME consumer only — no /acme/directory endpoint, no JWS verifier, no challenge validators. K8s customers running cert-manager could not point at certctl as an ACME issuer; they had to deploy a certctl agent on every node. What ships: - internal/api/acme/{directory,nonce,errors}.go (+ tests). - internal/api/handler/acme.go + acme_handler_test.go. - internal/repository/postgres/acme.go (nonce ops only — Phase 1b extends with account CRUD; Phases 2-4 extend with order / authz / challenge CRUD). - internal/service/acme.go (BuildDirectory + IssueNonce stubs; Phase 1b adds VerifyJWS / NewAccount / etc.). - migrations/000025_acme_server.{up,down}.sql ships the full 5-table ACME schema (acme_accounts / acme_orders / acme_authorizations / acme_challenges / acme_nonces) PLUS the per-profile certificate_profiles.acme_auth_mode column. Phase 1a actively uses only acme_nonces; remaining tables are empty until Phases 1b-4 plug in. - internal/config/config.go: ACMEServerConfig struct + ACMEServer field on Config. Env vars use CERTCTL_ACME_SERVER_* prefix to avoid colliding with the existing consumer-side ACMEConfig at config.go:1746 (CERTCTL_ACME_DIRECTORY_URL / PROFILE / CHALLENGE_TYPE etc.). Phase 1a wires Enabled + DefaultAuthMode + DefaultProfileID + NonceTTL + DirectoryMeta; Order/Authz TTLs + per-challenge-type concurrency caps + DNS01 resolver are reserved fields parsed in 1a so operators can set them ahead of Phases 2/3. - cmd/server/main.go: wire ACMEHandler into the HandlerRegistry literal alongside the existing certificate / EST / SCEP / etc. handlers. - internal/api/router/router.go: HandlerRegistry.ACME field + 6 Register calls (3 per-profile + 3 shorthand). - internal/api/router/openapi_parity_test.go: 6 new entries in SpecParityExceptions. ACME is a wire-protocol surface (JWS-signed JSON over HTTPS per RFC 7515) whose semantics are dictated by RFC 8555 + RFC 9773 rather than by an OpenAPI document, same precedent as SCEP/EST. The canonical reference is docs/acme-server.md. - docs/acme-server.md: Phase-1a-shaped reference. Configuration table for every CERTCTL_ACME_SERVER_* env var. Per-profile auth-mode decision tree skeleton. TLS trust bootstrap section flagging cert-manager's ClusterIssuer.spec.acme.caBundle requirement (the single biggest first-time-deploy footgun; the full cert-manager walkthrough lands in Phase 6 but the requirement is documented up front). Architecture decisions baked in: - URL family is /acme/profile/<id>/* (per-profile, canonical) with /acme/* shorthand active when CERTCTL_ACME_SERVER_DEFAULT_PROFILE_ID is set. Path matches existing per-profile precedent in EST + SCEP. - Auth mode is per-profile (acme_auth_mode column on certificate_profiles), NOT server-wide. One certctl-server can serve trust_authenticated for an internal-PKI profile and challenge for a public-trust-style profile simultaneously. The column is read at request time, not cached at server start — operators flipping a profile's mode via SQL take effect on the next order without restart. - Nonces are DB-backed (acme_nonces table). Survive server restart. The RFC 8555 §6.5 replay defense requires the store to outlast the client's nonce caching window; an in-memory-only nonce store would lose every in-flight order on restart. - Per-op atomic counters on service.ACMEService.Metrics() — certctl_acme_directory_total, certctl_acme_directory_failures_total, certctl_acme_new_nonce_total, certctl_acme_new_nonce_failures_total. Naming follows certctl frozen decision 0.10 cardinality discipline. Phase 1b will extend with new_account counters; Phase 2 with order / finalize / cert; Phase 3 with per-challenge-type counters. Audit fixes #11 + #12 (cowork/acme-server-prompts/audit-additions.md) applied: - #11: CERTCTL_ACME_SERVER_* prefix avoids the consumer-side CERTCTL_ACME_* namespace collision. - #12: prior-attempt WIP from two failed Phase-1 dispatches was discarded at phase start; this commit starts from a clean tree. Tests: - 14 unit tests in internal/api/acme/ (directory, nonce, errors). - 7 handler-level tests via httptest.NewServer + mockACMEService (mirrors the mockSCEPService pattern at scep_handler_test.go). - 7 service-layer tests with mocked repo + injected profileLookup. - All pass under -race -count=1 -short. Deferred to Phase 1b: - JWS verification (go-jose v4 — see master-prompt §8a for the API surface and audit doc for the speculation pitfalls). - new-account / account/<id> endpoints + AccountService. - Nonce *consumption* path (issue path is in this commit; consume is only invoked by JWS-verified POSTs which Phase 1b adds). Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-1a". Per-phase implementation plan: cowork/acme-server-prompts/. Master plan + audit fixes: cowork/acme-server-endpoint-prompt.md + cowork/acme-server-prompt-audit.md + cowork/acme-server-prompts/audit-additions.md. |
||
|
|
40fd96a416 |
feat(ocsp): pre-signed response cache + invalidate-on-revoke (Phase 2)
Production hardening II Phase 2 — closes the per-request live-signing
bottleneck for OCSP. Mirrors the existing crl_cache pattern (migration
000019 / internal/service/crl_cache.go) but per (issuer_id, serial_hex)
instead of per-issuer.
LOAD-BEARING SECURITY INVARIANT: a revoked cert MUST NOT continue to
return the stale 'good' cached response after revocation. The
RevocationSvc.RevokeCertificateWithActor flow now calls
OCSPResponseCacheService.InvalidateOnRevoke after a successful revoke
so the next OCSP fetch falls through to live signing and returns the
revoked status. Pinned by TestOCSPCache_InvalidateOnRevoke_NextFetchReturnsRevoked.
NEW migrations/000024_ocsp_response_cache.{up,down}.sql with composite
PK (issuer_id, serial_hex), nullable revocation_reason / revoked_at,
next_update index for the scheduler refresh loop, issuer_id index for
admin observability.
NEW internal/domain/ocsp_response_cache.go::OCSPResponseCacheEntry +
IsStale helper.
NEW internal/repository/postgres/ocsp_response_cache.go implementing
repository.OCSPResponseCacheRepository (Get / Put / Delete /
CountByIssuer). Interface defined in internal/repository/interfaces.go.
NEW internal/service/ocsp_response_cache.go::OCSPResponseCacheService
with read-through facade + sync.Map singleflight + InvalidateOnRevoke.
On cache miss, calls caOperationsSvc.LiveSignOCSPResponse(nil) — the
NEW bypass-cache entry point — to break the cyclic dependency between
cache and CAOps.
REFACTORED internal/service/ca_operations.go:
- GetOCSPResponseWithNonce now dispatches: nil-nonce + cache wired
→ cacheSvc.Get (cache); nonce != nil OR cache nil → live-sign.
- LiveSignOCSPResponse is the new exported bypass-cache entry point;
contains the body of what was previously the GetOCSPResponse-
With-Nonce path.
- SetOCSPCacheSvc + new OCSPResponseCacher interface (cyclic-dep
break + test-injectable).
The cache stores nil-nonce blobs by design. Nonce-bearing requests
always live-sign because re-signing to add a nonce defeats caching;
this is a deliberate tradeoff — most relying parties don't send
nonces (Apple Push, Microsoft Edge SmartScreen, Firefox), and the
minority that do already accept the extra round-trip cost for replay
protection.
WIRED in cmd/server/main.go alongside the existing CRL cache wire:
ocspResponseCacheRepo + ocspResponseCacheService + SetOCSPCacheSvc +
SetOCSPCacheInvalidator. Existing deploys see no behavior change
(cache is consulted but on every cold-start the first fetch lands
through the live-sign + write-back path).
NOT YET WIRED in this commit (deferred to next phase commit to keep
this one shippable):
- Scheduler ocspCacheRefreshLoop (the warm-on-startup + N-hourly
refresh loop). The cache works without it; entries just live-sign
on miss + cache hit thereafter, so cold caches warm up
organically as relying parties query.
- Admin observability endpoint /api/v1/admin/ocsp/cache.
- CERTCTL_OCSP_CACHE_REFRESH_INTERVAL env var.
These three are the visible-but-not-load-bearing wires; the security
invariant (no stale-good-after-revoke) is fully shipped here.
7 new tests in internal/service/ocsp_response_cache_test.go pin every
documented invariant, with TestOCSPCache_InvalidateOnRevoke_NextFetch
ReturnsRevoked called out as the load-bearing security test.
Pre-commit verification: go build ./... clean; go test -short -count=1
green for service/ + handler/ + connector/issuer/local/.
|
||
|
|
5a682db8e2 |
EST RFC 7030 hardening master bundle Phases 10-11: libest sidecar e2e
+ Cisco IOS quirk fixtures + ManagedCertificate.Source provenance + EST bulk-revoke endpoint + 13 typed audit action codes. Phase 10.1 — libest reference-client sidecar: - deploy/test/libest/Dockerfile: multi-stage Debian-bookworm-slim build of Cisco's libest v3.2.0-2 from source (autoconf/automake/ libtool + libcurl4-openssl-dev + libssl-dev). Runtime stage carries only estclient + bash + openssl + ca-certificates so the exec surface stays small + predictable. - docker-compose.test.yml libest-client entry (profiles: [est-e2e]) with bind mounts for /config/est (test workspace) + /config/certs (certctl CA bundle for TLS pinning); IP 10.30.50.9 (10.30.50.8 was already taken by certctl-agent). - deploy/test/est/.gitkeep keeps the bind-mount target tracked. Phase 10.2 — 5 integration tests (//go:build integration) in deploy/test/est_e2e_test.go: - TestEST_LibESTClient_Enrollment_Integration (cacerts → simpleenroll → cert-shape assertion) - TestEST_LibESTClient_MTLSEnrollment_Integration (mTLS sibling-route cert auth; skip when bootstrap cert absent) - TestEST_LibESTClient_ServerKeygen_Integration (RFC 7030 §4.4 multipart; skip when profile gate disabled) - TestEST_LibESTClient_RateLimited_Integration (4th enroll trips per-principal cap, asserts 429-shaped error) - TestEST_LibESTClient_ChannelBinding_Integration (libest --tls-exporter; skip when libest build lacks the flag). - requireESTSidecar guard skips the suite when the operator forgot --profile est-e2e; helpful error message includes the exact command to bring the sidecar up. Phase 10.3 — Cisco IOS quirk fixtures + 3 unit tests in internal/api/handler/cisco_ios_quirks_test.go: - testdata/cisco_ios_15x_pem_csr.txt: PEM body sent with Content-Type application/x-pem-file. Handler dispatches on body-prefix not Content-Type — accepts cleanly. - testdata/cisco_ios_16x_trailing_newline_csr.txt: extra trailing newlines after base64 body. strings.TrimSpace tolerates. - testdata/cisco_ios_crlf_b64_csr.txt: CRLF-wrapped base64. base64.StdEncoding handles CRLF + LF identically. Phase 11.1 — ManagedCertificate.Source provenance: - New domain.CertificateSource enum (Unspecified/EST/SCEP/API/Agent). - Migration 000023_managed_certificates_source.up.sql adds source TEXT NOT NULL DEFAULT '' so existing rows scan as CertificateSourceUnspecified — back-compat: bulk-revoke filter treats empty as "any source". - Postgres repo Insert/Update/scan paths all wire the new column. Phase 11.2 — EST bulk-revoke endpoint: - BulkRevocationCriteria.Source field (Source-only requests rejected as too broad — must accompany at least one narrower criterion). - service.bulk_revocation.resolveCertificates post-filter by Source (empty=any, no SQL change so existing CertificateFilter callers unaffected). - New BulkRevocationHandler.BulkRevokeEST method pins Source=EST + dispatches; new route POST /api/v1/est/certificates/bulk-revoke (M-008 admin-gated). openapi.yaml documented + parity-guard green. Phase 11.3 — 13 typed audit action codes in internal/service/est_audit_actions.go: - est_simple_enroll_success / _failed - est_simple_reenroll_success / _failed - est_server_keygen_success / _failed - est_auth_failed_basic / _mtls / _channel_binding - est_rate_limited - est_csr_policy_violation - est_bulk_revoke - est_trust_anchor_reloaded - ESTService.processEnrollment + SimpleServerKeygen + ReloadTrust split-emit BOTH the legacy bare action codes (back-compat for the GUI activity-tab chip filters that match by exact string + existing audit-log analysers) AND the new typed _success / _failed variants (operator grep target + per-failure-mode counter). Tests: - internal/api/handler/bulk_revocation_est_test.go — 5 cases (admin-true happy path pins Source=EST + non-admin 403 + empty-criteria 400 + invalid-reason 400 + method-not-allowed). - internal/service/est_audit_actions_test.go — 5 cases (SimpleEnroll legacy+typed emission / SimpleReEnroll typed / IssuerError typed-failed / PolicyViolation triple-emit / unique-string invariant). Pre-commit verification (sandbox): gofmt clean, go vet clean (excluding repository/postgres testcontainers limit), staticcheck clean across api/handler/api/router/domain/service/deploy/test, go test -short -count=1 green for every non-postgres Go package + integration build (`go build -tags integration ./deploy/test/...`) clean. G-3 docs-drift guard reproduced locally clean (Phases 10-11 added zero new env vars). Spec preserved at cowork/est-rfc7030-hardening-prompt.md. Phases 12-13 (docs/est.md + WiFi/802.1X / IoT bootstrap / FreeRADIUS recipes; release prep + tag) remain — post-2.1.0 work. |
||
|
|
43075a1b5c |
EST RFC 7030 hardening master bundle Phases 5-7: end-to-end serverkeygen
+ profile-driven csrattrs + admin observability with per-status counters + reload-trust endpoint. Phase 5 — RFC 7030 §4.4 server-driven key generation: - internal/pkcs7/envelopeddata_builder.go is the inverse of the existing parser/decryptor: AES-256-CBC content cipher + RSA PKCS#1 v1.5 keyTrans + per-call random IV. Round-trip pinned in test (BuildEnvelopedData → ParseEnvelopedData → Decrypt returns the original plaintext byte-for-byte). - ESTService.SimpleServerKeygen runs the full §4.4 flow: parse client CSR → require RSA pubkey for keyTrans → resolve per-profile algorithm (RSA-2048 default; honors AllowedKeyAlgorithms) → in- memory keygen → re-build CSR with server pubkey → run existing issuer pipeline → marshal PKCS#8 → CMS-EnvelopedData wrap to a synthetic recipient cert wrapping the device's CSR-supplied pubkey → zeroize plaintext + PKCS#8 bytes → return CertPEM + ChainPEM + EncryptedKey. Typed sentinels ErrServerKeygenRequiresKey- Encipherment / ErrServerKeygenUnsupportedAlgorithm / ErrServerKeygenDisabled. - ESTHandler.ServerKeygen + ServerKeygenMTLS emit RFC 7030 §4.4.2 multipart/mixed with random per-response boundary; per-profile SetServerKeygenEnabled gate returns 404 when off (defense in depth even if the route was registered). - New routes POST /.well-known/est/[<PathID>/]serverkeygen + /.well-known/est-mtls/<PathID>/serverkeygen; openapi.yaml + openapi-parity guard updated. Phase 6 — Real csrattrs implementation: - New CertificateProfile.RequiredCSRAttributes []string + migration 000022_certificate_profiles_csrattrs.up.sql. The migration also lands the previously-unwired must_staple column (closes the 5.6 follow-up loop where the field shipped at the domain + service layer but the postgres scan/insert/update never persisted it). - domain.EKUStringToOID + AttributeStringToOID lookup tables: id-kp-* EKUs (RFC 5280 §4.2.1.12) + RFC 5280 DN attributes + RFC 2985 PKCS#10 attributes + Microsoft Intune device-serial OID. - ESTService.GetCSRAttrs replaces the v2.0.x nil/204 stub with a profile-derived SEQUENCE OF OID ASN.1 marshal. Unknown EKU / attribute strings dropped + warning-logged so a typo doesn't take down the entire endpoint. Phase 7 — Admin observability + counters + reload-trust: - internal/service/est_counters.go: estCounterTab (sync/atomic; 12 named labels) + ESTStatsSnapshot per-profile shape + ESTService.Stats(now) zero-allocation accessor + ReloadTrust() SIGHUP-equivalent + SetESTAdminMetadata setter. - Counter ticks wired into processEnrollment + SimpleServerKeygen at every success/failure leg. - internal/api/handler/admin_est.go mirrors AdminSCEPIntune verbatim: Profiles + ReloadTrust handlers + AdminESTServiceImpl. Both endpoints admin-gated (M-008 triplet pinned + admin_est.go added to AdminGatedHandlers). - New routes GET /api/v1/admin/est/profiles + POST /api/v1/admin/ est/reload-trust; openapi.yaml documented; openapi-parity guard reproduced clean. - cmd/server/main.go grows estServices map populated by the per- profile EST loop + handed to AdminEST. New MTLSTrust() + HasMTLSTrust() accessors on ESTHandler so main.go can pull the trust holder for the admin-metadata wire-up. - Per-profile counter isolation regression test (internal/service/est_profile_counter_isolation_test.go) proves a future shared-counter refactor would fail at compile-time pointer-identity check. Pre-commit verification (sandbox): gofmt clean, go vet clean (excluding repository/postgres which the sandbox can't build — disk-space testcontainers download), staticcheck clean across cms/trustanchor/api/handler/api/router/scep/intune/ratelimit/ service/pkcs7/domain/cmd/server, go test -short -count=1 green for every non-postgres package. G-3 docs-drift guard reproduced locally clean (Phases 5-7 added zero new env vars; Phase 1 already documented per-profile SERVER_KEYGEN_ENABLED). Spec preserved at cowork/est-rfc7030-hardening-prompt.md. Phases 8-13 (GUI ESTAdminPage / CLI+MCP / libest e2e / bulk revocation / docs/est.md / release prep) remain — post-2.1.0 work. |
||
|
|
506cff137d |
feat(scep): SCEP probe in network scanner for fleet-readiness assessment
Phase 11.5 of the SCEP RFC 8894 + Intune master bundle. Adds an
operator-facing SCEP probe that issues GetCACaps + GetCACert against
an arbitrary SCEP server URL and returns a structured posture snapshot
(reachable + advertised caps + RFC 8894 / AES / POST / Renewal /
SHA-256 / SHA-512 support flags + CA cert subject + issuer + NotBefore
+ NotAfter + days-to-expiry + algorithm + chain length).
Two operator use cases per the master prompt:
1. Pre-migration assessment — probe an existing EJBCA / NDES SCEP
server before switching to certctl to see what capabilities it
advertises and what the CA cert looks like.
2. Compliance posture audits — periodic ad-hoc probes against the
operator's own SCEP servers to flag drift.
Capability-only — does NOT POST a CSR per the spec (would consume slot
allocations on the target server + create audit noise). Standalone CLI
binary explicitly out of scope (per the master prompt §11.5.6 and the
operator's confirmation): the probe code lands inside certctl; a
future thin Cobra wrapper is a separate decision.
Backend (six new + one extended file):
* internal/domain/network_scan.go — new SCEPProbeResult struct with
every probe field documented for the GUI's display layer.
* migrations/000021_scep_probe_results.up.sql + .down.sql — new
scep_probe_results table with TEXT id, target_url, all probe
flags, CA cert metadata, probed_at, probe_duration_ms, error.
Two indexes: idx_scep_probe_results_probed_at (DESC) for the
'recent probes' GUI query, idx_scep_probe_results_target_url
(target_url, probed_at DESC) for the future per-URL history view.
* internal/repository/interfaces.go — new SCEPProbeResultRepository
interface (Insert + ListRecent).
* internal/repository/postgres/scep_probe_results.go — Postgres
implementation. ListRecent clamps limit to [1, 200]; on read
re-derives ca_cert_days_to_expiry against the query-time wall
clock so 'X days remaining' stays fresh.
* internal/service/scep_probe.go — ProbeSCEP(ctx, url) on
NetworkScanService. Validation order:
1. Up-front URL validation via validation.ValidateSafeURL
(defaults to validation.ValidateSafeURL but injectable for
tests via the new scepValidateURL field on the service).
2. Dial-time SSRF re-check via SafeHTTPDialContext on the
http.Transport (defends against DNS rebinding).
3. GET ?operation=GetCACaps + GET ?operation=GetCACert.
GetCACert handles three response shapes: PKCS#7 SignedData
certs-only envelope (multi-cert), raw DER (single-cert),
and PEM-wrapped DER (non-conforming servers).
Times out at 30s; uses a 1MB body cap for DoS defense; wraps
the result + persists via the repo (nil-safe) before returning.
describeCertAlgorithm helper returns 'RSA-N' / 'ECDSA-curve' /
'Ed25519' / 'DSA' for the GUI's algorithm column.
* internal/service/network_scan.go — added scepProbeRepo +
scepHTTPClient + scepValidateURL + scepIDFn + nowFn fields;
SetSCEPProbeRepo wires the repo at startup.
* internal/api/handler/network_scan.go — extended NetworkScanService
interface with ProbeSCEP + ListRecentSCEPProbes; added two new
HTTP handlers:
POST /api/v1/network-scan/scep-probe (body {url})
GET /api/v1/network-scan/scep-probes (recent history)
Synchronous probe; HTTP 200 with the result body for both success
and reachable-but-failed cases (so the GUI can render the failure
tone with the operator-actionable error message).
* internal/api/router/router.go — registered the two routes inline
after the existing network-scan target endpoints.
* api/openapi.yaml — documented both endpoints (operationId
probeSCEP + listSCEPProbes) with full schema + response codes.
* cmd/server/main.go — wires the new SCEPProbeResultRepository
onto the network scan service via SetSCEPProbeRepo right after
the existing NewNetworkScanService construction.
Backend tests (6 new — exit-criteria-named per the master prompt):
* TestProbeSCEP_AdvertisesAllCaps — happy path, full RFC 8894
capability set, ECDSA P-256 CA cert, 365-day expiry.
* TestProbeSCEP_MissingSCEPStandard — pre-RFC-8894 server (only
POSTPKIOperation + SHA-1 + DES3); SupportsRFC8894 = false.
* TestProbeSCEP_GetCACertExpired — CA cert NotAfter 30d in the
past; CACertExpired = true.
* TestProbeSCEP_Unreachable — connect to TCP port 1; probe
returns Reachable=false + non-empty Error.
* TestProbeSCEP_RejectsReservedIP — http://169.254.169.254/scep
(EC2 metadata literal) rejected by the up-front
validation.ValidateSafeURL gate; result captures the error
without ever issuing the HTTP call.
* TestProbeSCEP_PEMWrappedCert — server returns PEM instead of
raw DER for GetCACert; the fallback parse path handles it.
Frontend (one extended file + types/client):
* web/src/api/types.ts — SCEPProbeResult + SCEPProbesResponse.
* web/src/api/client.ts — probeSCEPServer + listSCEPProbes
helpers.
* web/src/pages/NetworkScanPage.tsx — new SCEPProbeSection
component + ProbeResultPanel (with capability badges + CA cert
details panel + raw caps line) + SCEPProbeHistoryTable. Form
rejects empty URL with inline error before calling the API.
Reload mutation goes through useTrackedMutation with explicit
invalidates: [['scep-probes']] (M-009 contract).
Frontend tests (5 new + 0 regressions):
* Scep probe section header + form renders.
* Empty URL is rejected with inline error and never calls the
probe endpoint.
* Successful probe renders capability badges + CA cert subject
+ days-remaining inline panel.
* Probe-level errors are surfaced in the inline panel (no result
panel rendered).
* Recent-probes history table renders one row per probe.
* (Existing 2 NetworkScanPage XSS-hardening tests stub the new
listSCEPProbes endpoint to an empty list so they still pass.)
Verification:
* gofmt clean on touched files
* go vet ./... clean
* staticcheck on service+handler+router+repository+cmd-server clean
* go test -short across service+handler+router+repository+cmd-server
+ integration: all green (existing + 6 new probe tests pass)
* Frontend tsc --noEmit clean
* Vitest: 7/7 NetworkScanPage tests pass (2 existing XSS + 5 new
probe section)
* G-3 docs-drift CI guard reproduced locally clean (no new env vars)
* M-009 hard-zero useMutation guard clean (probe mutation goes
through useTrackedMutation)
* openapi-parity guard satisfied (both new routes documented)
* The mockNetworkScanService in handler + integration packages
extended with stub Probe methods; targeted coverage stays in
scep_probe_test.go.
Out of scope (per master prompt §11.5.6 + operator confirmation):
* Standalone certctl-scan CLI binary — separate decision, ~1d of
follow-up work when/if shipped.
Refs: cowork/scep-rfc8894-intune-master-prompt.md::Phase 11.5
cowork/scep-rfc8894-intune/progress.md
|
||
|
|
a0b7f7da9d |
ocsp/responder: dedicated OCSP responder cert per issuer (RFC 6960 §2.6)
Phase 2 of the CRL/OCSP responder bundle. Stops signing OCSP responses
with the CA private key directly; the local issuer now bootstraps a
dedicated responder cert + key per issuer, persists them, and rotates
within a grace window before expiry.
Why this matters:
- Every relying-party OCSP poll today triggers a CA-key signing op.
With this change those polls hit a cheap responder key; the CA key
only signs at responder bootstrap / rotation (rare).
- When the CA key lives on an HSM (PKCS#11 driver, V3-Pro item 3),
the dedicated responder removes the per-poll-HSM-op pressure.
- Carries id-pkix-ocsp-nocheck (RFC 6960 §4.2.2.2.1) so OCSP clients
do NOT recursively check the responder cert's revocation status.
What landed:
* migration 000020_ocsp_responder.up.sql (+down) — ocsp_responders table
keyed by issuer_id; rotated_from records the prior cert serial for
audit; not_after index drives the rotation scheduler query
* internal/domain/ocsp_responder.go — OCSPResponder type + NeedsRotation
helper (configurable grace window; default 7 days before expiry)
* internal/repository/postgres/ocsp_responder.go — Postgres impl with
upsert-on-Put + ListExpiring for the future rotation scheduler
* internal/repository/interfaces.go — OCSPResponderRepository interface
* internal/connector/issuer/local/ocsp_responder.go — bootstrap +
rotation logic; under c.mu so concurrent first-call OCSP requests
don't double-bootstrap; recovers gracefully from corrupt key ref
or corrupt cert PEM rather than failing the OCSP request
* internal/connector/issuer/local/local.go:
- Connector struct gains optional dependencies (ocspResponderRepo,
signerDriver, issuerID, rotation grace, validity, key dir)
- Set*() helpers for each dep matching the existing SCEPService
pattern (SetProfileRepo / SetProfileID)
- SignOCSPResponse refactored: ensureOCSPResponder dispatches on
whether deps are wired; fallback path (deps unset) preserves
pre-Phase-2 behavior of signing with CA key directly
* internal/connector/issuer/local/ocsp_responder_test.go — bootstrap
happy path; reuse-across-calls; fallback (no deps wired); rotation
on grace window; corrupt-key-ref recovery; corrupt-cert-PEM recovery;
SetOCSPResponderKeyDir setter
Coverage: local issuer 86.3% (above CI floor of 86; was 86.5% before
Phase 2 added ~140 LoC of new code). The recovered-from-drop tests are
real behavior tests of the new error paths I introduced, not
coverage-game artifacts.
Backward compat: unchanged for any caller that doesn't wire the
responder deps. The factory at internal/connector/issuerfactory/factory.go
still calls local.New(&cfg, logger) with no responder wiring; OCSP
responses continue to be signed by the CA key directly until the
operator wires the deps. cmd/server/main.go wiring lands in Phase 3
alongside the CRL cache service.
|
||
|
|
30765ba1ed |
crl/cache: schema + repository for crl_cache + crl_generation_events
Phase 1 of the CRL/OCSP responder bundle. Adds:
* migration 000019 — crl_cache (one row per issuer; pre-generated CRL DER,
monotonic crl_number per RFC 5280 §5.2.3, this_update/next_update,
generation duration metric, revoked_count) + crl_generation_events
(append-only audit log of every regeneration attempt, succeeded
+ error fields for ops grep)
* internal/domain/crl_cache.go — CRLCacheEntry + IsStale helper +
CRLGenerationEvent (raw DER omitted from JSON to avoid bloating
admin responses; CRLDERBase64 field for explicit transit shaping)
* internal/repository/interfaces.go — CRLCacheRepository interface
(Get / Put / NextCRLNumber / RecordGenerationEvent /
ListGenerationEvents)
* internal/repository/postgres/crl_cache.go — Postgres impl with
SERIALIZABLE-isolated NextCRLNumber to defeat the monotonicity
race between concurrent generations of the same issuer
* internal/repository/postgres/crl_cache_test.go — testcontainers
suite (round-trip, overwrite, monotonicity, event recording,
failure-event-with-error)
No behavior change at the HTTP layer yet — Phase 3 wires the cache into
GetDERCRL via a new CRLCacheService + crlGenerationLoop.
|
||
|
|
62a412c488 |
Bundle C: Renewal/reliability cluster — 7 findings closed
Closes M-006 + M-007 + M-008 + M-015 + M-016 + M-019 + M-020 from
comprehensive-audit-2026-04-25. M-028 was already closed by the
Bundle B CI follow-up.
M-006 (CWE-913) — Idempotent migration 000014
migrations/000014_policy_violation_severity_check.up.sql:
Prepended ALTER TABLE ... DROP CONSTRAINT IF EXISTS before the
ADD. Mirrors the down migration's existing IF EXISTS shape and
the M-7 idempotent-index idiom. Re-runs against partially-applied
DBs now succeed.
M-007 — Bulk-op partial-failure tests (3 new)
internal/api/handler/bulk_partial_failure_test.go:
TestBulkRevoke_PartialFailure_ReportsBoth
TestBulkRenew_PartialFailure_ReportsBoth
TestBulkReassign_PartialFailure_ReportsBoth
Each asserts HTTP 200 + both success/failure counters round-trip
+ per-cert errors[] preserved with non-empty messages so operators
can correlate each failure to its certificate ID.
M-008 — Admin-gated handler enumeration pin (verified-already-clean)
Recon: only one admin-gated handler — bulk_revocation.go — with
full 3-branch test triplet already in place. health.go calls
IsAdmin informationally to surface the flag to the GUI without
gating.
internal/api/handler/m008_admin_gate_test.go:
Walks every handler .go file, asserts every middleware.IsAdmin
call site is in AdminGatedHandlers (with required test triplet)
or InformationalIsAdminCallers (justified). Adding a new admin
gate without updating both the constant AND adding the test
triplet fails CI.
M-015 — Single-profile cardinality pin (verified-already-clean)
Audit claim 'no cardinality validation' was wrong — enforced at
struct level. domain.ManagedCertificate.{CertificateProfileID,
RenewalPolicyID,IssuerID,OwnerID} and RenewalPolicy.
CertificateProfileID are bare strings, not slices.
internal/domain/m015_cardinality_test.go:
reflect-based pin on kind=String. Schema change to N:N would
have to update renewal.go's lookup loop in the same commit.
M-016 (CWE-754) — Reap stale-agent jobs
internal/repository/postgres/job.go::ListJobsWithOfflineAgents:
JOIN jobs to agents on agent_id, filter (status=Running AND
a.last_heartbeat_at < cutoff), exclude server-keygen jobs.
internal/service/job.go::ReapJobsWithOfflineAgents:
Flips matched jobs to Failed reason agent_offline so I-001
retry loop re-queues them on a healthy agent. Records audit
event per reap.
internal/scheduler/scheduler.go:
Scheduler.runJobTimeout cycle now calls both reaper arms.
agentOfflineJobTTL default 5min (5x agent-health-check default);
SetAgentOfflineJobTTL knob for operator override.
internal/service/job_offline_agent_reaper_test.go: 6 unit tests
cover happy path, server-keygen-skip, non-Running-skip, non-
positive-TTL fail-loud, repo-error propagation, audit-event
recording.
M-019 — Configurable ARI HTTP timeout
Audit claim 'no fallback timeout' was wrong — ari.go:52 already
had a 15s timeout. Bundle C makes it configurable.
internal/connector/issuer/acme/acme.go:
Config.ARIHTTPTimeoutSeconds field with env path
CERTCTL_ACME_ARI_HTTP_TIMEOUT_SECONDS.
internal/connector/issuer/acme/ari.go:
Both HTTP clients (GetRenewalInfo + getARIEndpoint) now use the
new ariHTTPTimeout() helper. Zero / negative / nil-config all
fall back to the historic 15s default.
ari_timeout_test.go: 4 dispatch arm tests.
M-020 (CWE-770) — OCSP DoS hardening
Pre-bundle the noAuthHandler chain had no rate limit. An attacker
could DoS the OCSP responder, which for fail-open relying parties
is a revocation bypass.
cmd/server/main.go:
noAuthHandler refactored from fixed middleware.Chain(...) to a
conditional slice that appends middleware.NewRateLimiter when
cfg.RateLimit.Enabled. Per-IP keying applies; OCSP/CRL/EST/SCEP
are unauth.
docs/security.md (NEW):
Operator runbook documenting Must-Staple TLS Feature extension
RFC 7633 as the architectural fix for fail-open relying parties.
Profile-flip guidance + nginx/Apache/HAProxy/Envoy stapling
snippets + explicit scope statement on what the rate limiter
alone does NOT solve.
Audit deliverables:
cowork/comprehensive-audit-2026-04-25/audit-report.md: score
31/55 -> 38/55 closed (Medium 13/27 -> 20/27).
cowork/comprehensive-audit-2026-04-25/findings.yaml: 7 status
flips open -> closed with closure notes citing the Bundle C
mechanism.
certctl/CHANGELOG.md: Bundle C section under [unreleased].
Verification:
go vet ./internal/service ./internal/scheduler ./internal/connector/issuer/acme
./internal/api/handler ./internal/domain ./cmd/server clean
go test -count=1 -short on the same packages all green
helm template + helm lint clean
internal/repository/postgres setup-fail sandbox disk
pressure (same on master HEAD before this branch)
|
||
|
|
1d6c7a0552 |
fix(bundle-6): Audit Integrity + Privacy — 3 audit findings closed
Closes Audit-2026-04-25 H-008 (High), M-017 (Medium), M-022 (Medium).
Hardens audit-trail tamper-resistance + minimizes PII leakage in one
cohesive change, with both controls applying automatically and no
operator action required at install time.
What changed
- internal/service/audit_redact.go (NEW) — RedactDetailsForAudit:
* credentialKeys deny-list (api_key, password, *_pem, eab_secret, ...)
* piiKeys deny-list (email, phone, ssn, name, address, ip_address, ...)
* case-insensitive key match; recurses into nested maps + arrays
* mutation-free; surfaces redacted_keys array for operator visibility
* nil/empty input → nil out (preserves pre-Bundle-6 behaviour)
- internal/service/audit.go — RecordEvent now routes details through
RedactDetailsForAudit BEFORE marshaling. No call-site changes required.
- internal/service/audit_redact_test.go (NEW) — full coverage:
* credential keys (~30 entries)
* PII keys (~20 entries)
* nested maps + arrays
* case-insensitivity
* mutation-free invariant
* JSON round-trip (catches type-assertion regressions)
* scalar pass-through (no panic on int/bool/nil)
- migrations/000018_audit_events_worm.up.sql (NEW) — DB-level WORM:
* BEFORE UPDATE OR DELETE trigger raises check_violation with
diagnostic citing the rationale + compliance-superuser hint
* REVOKE UPDATE,DELETE ON audit_events FROM certctl (defence-in-depth)
* REVOKE wrapped in pg_roles existence check so test fixtures
without the certctl role stay idempotent
- migrations/000018_audit_events_worm.down.sql (NEW) — clean teardown
for dev resets; not for production use.
- internal/repository/postgres/audit_worm_test.go (NEW, testcontainers,
-short gated) — INSERT succeeds; UPDATE + DELETE fail with
check_violation; second INSERT after blocked modification still
succeeds (no trigger-state corruption).
- docs/compliance.md — new section "Audit-Trail Integrity & Privacy
(Bundle 6)" with verification psql snippet, compliance-superuser
pattern (NOT auto-created), redactor before/after example, and a
maintenance note for adding new credential keys.
Compliance mapping
- H-008 (CWE-532 Insertion of Sensitive Information into Log File)
- M-017 (HIPAA Technical Safeguards §164.312(b) — audit controls)
- M-022 (GDPR Art. 32 — data minimization)
Threat model: TB-3 (audit log tampering), TB-1 (operator/orchestrator).
Verification
- go vet ./... → clean
- go build ./... → clean
- go test -short -count=1 ./... → all packages pass
- go test -count=1 -run TestRedactDetailsForAudit ./internal/service/...
→ all pass
- (testcontainers, gated by -short) audit_worm_test.go pins WORM contract
- npx tsc --noEmit (web) → clean (no frontend changes)
- python3 yaml.safe_load(api/openapi.yaml) → 89 paths
Backward compatibility
- Trigger applies forward only — existing rows unchanged.
- nil/empty details from RecordEvent callers → nil out (preserves prior
behaviour for the many existing call sites that pass nil).
- Compliance superusers (provisioned out-of-band) bypass the trigger.
Bundle 6 of the 2026-04-25 comprehensive audit.
|
||
|
|
a3d8b9c607 |
fix(deploy,db,handler): close fresh-clone postgres init failure + 4 ride-along audit findings (U-3 master)
GitHub #10 reopened: operator mikeakasully cloned v2.0.50 fresh and ran the canonical quickstart (docker compose -f deploy/docker-compose.yml up -d --build); postgres reported unhealthy indefinitely, dependent containers never started. Root cause: deploy/docker-compose.yml mounted a hand-curated subset of migrations/*.up.sql + seed.sql into postgres /docker-entrypoint-initdb.d/. Postgres applied them at initdb time. Once seed.sql referenced columns added by migrations *after* the mounted cutoff (e.g., policy_rules.severity from migration 000013), initdb crashed mid-seed and the container loop wedged. Two sources of truth (compose mount list vs in-tree migration ladder) diverged the moment a seed-touching migration shipped, and the only thing that fixed it was hand-editing the compose file every release. Fix: remove the dual source. Postgres boots empty; the server applies migrations + seed at startup via RunMigrations + RunSeed. Helm has used this pattern since day one (postgres-init emptyDir); compose now matches. Bundled with four ride-along audit findings whose fixes share the same schema/db code surface, so operators take the schema-change pain only once: cat-u-seed_initdb_schema_drift [P1, primary] — initdb-mount fix cat-o-retry_interval_unit_mismatch [P1] — column rename minutes→seconds cat-o-notification_created_at_dead_field [P2] — add column + populate cat-o-health_check_column_orphans [P1] — drop unwired columns cat-u-no_version_endpoint [P2] — add /api/v1/version Single migration (000017_db_coupling_cleanup) bundles the three schema changes under a DO \$\$ guard so re-application is safe; reduces operator-visible 'schema-change releases' from four to one. Backend - internal/repository/postgres/db.go: add RunSeed (baseline) + RunDemoSeed (gated by CERTCTL_DEMO_SEED). Both idempotent (ON CONFLICT DO NOTHING in every shipped INSERT) so repeated boots are safe; missing-file is no-op so custom packaging that strips seeds still boots cleanly. - cmd/server/main.go: invoke RunSeed (always) + RunDemoSeed (when flag set) immediately after RunMigrations. - internal/repository/postgres/notification.go: NotificationRepository.Create now sets created_at (with time.Now() fallback when caller leaves it zero); scanNotification reads it back; List + ListRetryEligible SELECT extended. - internal/repository/postgres/renewal_policy.go: column references updated to retry_interval_seconds across SELECT/INSERT/UPDATE sites. - internal/api/handler/version.go: new VersionHandler exposes {version, commit, modified, build_time, go_version} from runtime/debug.ReadBuildInfo() with ldflags-supplied Version override. - internal/api/router/router.go: register GET /api/v1/version through the no-auth chain (CORS + ContentType) alongside /health, /ready, /api/v1/auth/info. - cmd/server/main.go: add /api/v1/version to no-auth dispatch + audit ExcludePaths so rollout polling doesn't dominate the audit trail. - internal/config/config.go: add DatabaseConfig.DemoSeed + CERTCTL_DEMO_SEED env var. Migration - migrations/000017_db_coupling_cleanup.up.sql + .down.sql: (1) renewal_policies.retry_interval_minutes → retry_interval_seconds (DO \$\$ guard, idempotent re-application) (2) notification_events ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() (3) network_scan_targets DROP orphan health_check_enabled + health_check_interval_seconds - migrations/seed.sql: column reference updated to retry_interval_seconds. - migrations/seed_demo.sql: same column rename + applied at runtime now via RunDemoSeed (no longer initdb-mounted). Compose - deploy/docker-compose.yml: drop ALL initdb mounts (10 migration files + seed.sql); add start_period: 30s to postgres + certctl-server healthchecks to absorb the runtime migration + seed application window on first boot. - deploy/docker-compose.test.yml: same drop (+ ghost seed_test.sql mount removed; that file never existed); same healthcheck start_period. - deploy/docker-compose.demo.yml: replace seed_demo.sql initdb mount with CERTCTL_DEMO_SEED=true env var on certctl-server. Tests - internal/api/handler/version_handler_test.go: TestVersion_ReturnsBuildInfo, TestVersion_RejectsNonGet, TestVersion_LdflagsOverride. - internal/repository/postgres/seed_test.go: TestRunSeed_AppliesIdempotently, TestRunSeed_MissingFileIsNoOp, TestRunDemoSeed_AppliesIdempotently, TestMigration000017_RetryIntervalRename, TestMigration000017_NotificationCreatedAt, TestMigration000017_HealthCheckOrphansDropped (testcontainers, -short skips). - internal/repository/postgres/notification_test.go: TestNotificationRepository_CreatedAt_IsPersisted + TestNotificationRepository_CreatedAt_DefaultsToNow. CI guardrail - .github/workflows/ci.yml: new 'Forbidden migration mount in compose initdb (U-3)' step grep-fails the build if any migrations/*.sql or seed*.sql re-appears in /docker-entrypoint-initdb.d in any compose file. Catches future drift before a fresh-clone operator hits it. Spec / Docs - api/openapi.yaml: add /api/v1/version operation under Health tag. - docs/architecture.md: replace the 'initdb may run the same SQL' paragraph with a post-U-3 single-source-of-truth explanation. - CHANGELOG.md: full unreleased-section entry covering all 5 closures, breaking changes, and the new env var. Audit doc - coverage-gap-audit-2026-04-24-v5/unified-audit.md: add new P1 #14 cat-u-seed_initdb_schema_drift; flip the 4 ride-along findings to ✅ RESOLVED with closure prose pointing at this commit. Verification: build/vet/test -short -race all clean across all touched packages locally; govulncheck reports 0 vulnerabilities affecting our code; OpenAPI YAML parses; CI U-3 grep guardrail clears against the post-fix tree. |
||
|
|
675b87ba63 |
I-005: notification retry loop + dead-letter queue
Critical alerts can no longer be silently dropped by a transient
notifier failure. Failed notification attempts now ride an exponential
backoff retry loop, with a 5-attempt budget before promotion to the
dead-letter queue for operator intervention.
Schema (migration 000016, idempotent):
- retry_count INTEGER NOT NULL DEFAULT 0
- next_retry_at TIMESTAMPTZ
- last_error TEXT
- idx_notification_events_retry_sweep partial index
(next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL
Dead rows clear next_retry_at so the index stops matching them.
Service contract:
- NotificationService.RetryFailedNotifications drives 2^n-minute
exponential backoff capped at 1h (notifRetryBackoffCap) with
5-attempt budget (notifRetryMaxAttempts).
- Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to
status='dead' via MarkAsDead.
- Non-terminal failures record via RecordFailedAttempt.
- Success path promotes to 'sent' without touching retry_count
(audit preserves "delivered on attempt N").
- Missing-notifier branch defensively promotes to 'sent' to avoid
wedging a row on a deleted channel.
- RequeueNotification operator escape hatch atomically resets
retry_count -> 0, next_retry_at -> NULL, last_error -> NULL,
status -> pending via notifRepo.Requeue.
Scheduler:
- New always-on notificationRetryLoop wired into the base loop set at
CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m).
- sync/atomic.Bool idempotency guard.
- sync.WaitGroup shutdown drain via WaitForCompletion.
StatsService:
- SetNotifRepo setter pattern preserves 9 pre-existing
NewStatsService call sites (main.go + stats_test.go + 8 digest
tests) without touching the constructor signature.
- DashboardSummary.NotificationsDead populated via
notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired
(reports zero on systems without a notification repository).
- CountByStatus error is non-fatal (dashboard summary is
best-effort for this field).
- Prometheus certctl_notification_dead_total counter emitted from
the same snapshot.
Handler:
- New POST /api/v1/notifications/{id}/requeue endpoint.
- dead status surfaces to MCP + CLI.
Frontend:
- NotificationsPage gains two-tab toolbar ("All" / "Dead letter")
with queryKey: ['notifications', activeTab] so switching tabs
doesn't serve stale data until the 30s refetch.
- Dead rows surface "Retry {n}/5" + truncated last_error with
full-text title tooltip.
- Requeue mutation wrapped as
mutationFn: (id: string) => requeueNotification(id)
to prevent react-query v5's positional context argument from
leaking into the API client — pinned against future refactors
by strict-match toHaveBeenCalledWith('notif-dead-001') in
NotificationsPage.test.tsx:181.
Closes I-005.
|
||
|
|
0725713e19 |
Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.
Schema — migration 000015:
migrations/000015_agent_retire.up.sql flips
deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
boundary instead of quietly destroying targets. Both `agents` and
`deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
TEXT pair (TEXT not VARCHAR so operator comments are never
truncated), indexed via partial indexes WHERE retired_at IS NOT
NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
EXISTS) so repeated runs against partially-migrated databases
converge. migrations/000015_agent_retire.down.sql restores CASCADE
and drops the new columns for clean rollback. A dedicated
repository-layer testcontainers test
(internal/repository/postgres/migration_000015_test.go) asserts the
before/after FK action, column presence, index presence, and
round-trip idempotency under up→down→up.
Domain — sentinel guard + dependency counts:
internal/domain/connector.go gains IsRetired() on Agent, the
exported SentinelAgentIDs slice listing server-scanner,
cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
four reserved IDs documented in CLAUDE.md and created at startup in
cmd/server/main.go), IsSentinelAgent(id string) predicate,
AgentDependencyCounts{ActiveTargets, ActiveCertificates,
PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
ActorTypeSystem enum values used by audit emission downstream.
Coverage locked down by internal/domain/connector_test.go.
Service — 8-step ordered contract:
internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
opts{Force, Reason}) enforces a fixed execution order:
(1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
unconditionally; force=true does NOT bypass it.
(2) fetch — ErrAgentNotFound on miss.
(3) idempotency — if IsRetired() already, return
AgentRetirementResult{AlreadyRetired: true} with no new audit
event and no state change (safe to replay from flaky clients).
(4) preflight counts — collectAgentDependencyCounts runs
ActiveTargets, ActiveCertificates, PendingJobs sequentially
(not in parallel; keeps the per-query timeout predictable and
matches the repo's existing call-chain shape).
(5) force-reason guard — opts.Force=true with empty Reason returns
ErrForceReasonRequired (wired into the 400 status surface).
(6) dependency guard — HasDependencies() with opts.Force=false
returns BlockedByDependenciesError{Counts} (wired into the 409
body with per-bucket counts).
(7) mutation — single pinned retiredAt := time.Now(); agent
retirement first, then cascade target retirement if opts.Force,
all under the repo's single transaction so the two retired_at
stamps match to the second.
(8) best-effort audit — agent_retired always; agent_retirement_
cascaded additionally on the force path. Actor is whatever the
handler resolves from the request; actor type is mapped by
resolveActorType (system/agent-prefix→Agent/else→User). Audit
emission failures are logged via slog.Error but do not abort
the retirement (matches the house convention used by every
other scheduler-emitted event).
BlockedByDependenciesError implements Error() as
"active_targets=%d, active_certificates=%d, pending_jobs=%d" and
Unwrap() → ErrBlockedByDependencies. The single struct satisfies
errors.Is via Unwrap (used by scheduler-level tests) and errors.As
via the concrete type (used by the handler to fish out Counts for
the 409 body). ListRetiredAgents(page, perPage) adds a separate
paginated accessor with page<1→1 and perPage<1→50 normalization so
retired rows are queryable without polluting the default agent
listing.
Sentinel guard coverage is asymmetric by design: all four reserved
IDs are protected, and force=true cannot override. Regression tests
in internal/service/agent_retire_test.go assert each of the eight
steps in order, plus sentinel bypass attempts and idempotency
replay.
Handler + router — status-code surface:
internal/api/handler/agents.go:RetireAgent exposes seven status
codes on DELETE /agents/{id}:
200 on a fresh retirement (body echoes AgentRetirementResult).
204 on idempotent replay (AlreadyRetired=true; no new audit).
400 on ErrForceReasonRequired.
403 on ErrAgentIsSentinel.
404 on ErrAgentNotFound.
409 on BlockedByDependenciesError, with a custom body shape
{error, counts{active_targets, active_certificates,
pending_jobs}} that bypasses the default ErrorWithRequestID
envelope so callers get the per-bucket numbers directly.
500 on any other error.
Heartbeat HandleHeartbeat returns 410 Gone when the agent is
retired (ErrAgentRetired), signalling the agent to shut down.
Query params `force=true` and `reason=<text>` drive the cascade
path; both are forwarded as url.Values through the new MCP
transport.
internal/api/router/router.go registers GET /api/v1/agents/retired
literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
literal-beats-pattern-var precedence routes "retired" to the
paginated retired-agents listing instead of fetching a hypothetical
agent named "retired".
Agent binary — clean shutdown on 410:
cmd/agent/main.go gains the ErrAgentRetired sentinel, a
retiredOnce sync.Once, and a retiredSignal chan struct{}. A
markRetired(source, statusCode, body) helper closes the channel
exactly once; the Run() select loop observes the close and returns
ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
and exits cleanly instead of spinning in the heartbeat retry loop.
The 410 Gone surface is therefore terminal for the agent process.
MCP transport:
internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
a new additive transport method. Client.Delete is path-only; without
this method the retire tool would silently drop `force` and `reason`,
turning every cascade retire into a default soft-retire. The new
method shares do()'s 204 normalization and 4xx/5xx error
propagation so tool authors get one contract.
internal/mcp/tools.go + internal/mcp/types.go expose the
retire_agent tool with Force+Reason inputs wired through
DeleteWithQuery.
CLI:
cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
`agents list --retired` (client-side strip of --retired then
delegation to ListRetiredAgents, sharing --page/--per-page parsing
with the default listing) and `agents retire <id> [--force --reason
"…"]` (mirrors ErrForceReasonRequired — force without reason is
rejected client-side before the request is sent). JSON + table
output modes both honor the new columns.
Frontend:
web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
web/src/api/client.ts + web/src/api/types.ts expose the retire
endpoint and the retired-listing. 4 new Vitest regression cases.
OpenAPI:
api/openapi.yaml documents DELETE /agents/{id} with all seven
status codes, 410 on heartbeat, and the 409 per-bucket body shape.
Regression coverage (six new test files, all green):
internal/service/agent_retire_test.go — 8-step contract + sentinel guards
internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through
internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing
internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies
Files:
api/openapi.yaml — DELETE + 410 + 409 body shape
cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal
cmd/cli/main.go — handleAgents list/get/retire dispatch
docs/architecture.md, docs/concepts.md,
docs/testing-guide.md — retirement contract narrative
internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat
internal/api/handler/agent_handler_test.go — extended coverage
internal/api/handler/agent_retire_handler_test.go — new
internal/api/router/router.go — /agents/retired before /agents/{id}
internal/cli/agent_retire_test.go — new
internal/cli/client.go — ListRetiredAgents + RetireAgent
internal/domain/connector.go — IsRetired, SentinelAgentIDs,
IsSentinelAgent, AgentDependencyCounts,
ActorTypeAgent/System
internal/domain/connector_test.go — new
internal/integration/lifecycle_test.go — retirement fixture
internal/mcp/client.go — DeleteWithQuery additive transport
internal/mcp/retire_agent_test.go — new
internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs
internal/repository/interfaces.go — AgentRepository retirement methods
internal/repository/postgres/agent.go — retire + cascade target retire + counts
internal/repository/postgres/migration_000015_test.go — new
internal/service/agent.go — wire into AgentService surface
internal/service/agent_retire.go — new 8-step contract
internal/service/agent_retire_test.go — new
internal/service/deployment.go — skip retired agents
internal/service/target.go — skip retired agents
internal/service/testutil_test.go — shared mocks extended
migrations/000015_agent_retire.up.sql — new
migrations/000015_agent_retire.down.sql — new
web/src/api/client.ts, types.ts + tests — retire endpoint wiring
web/src/pages/AgentsPage.tsx — retire UI
|
||
|
|
b3cc7cbdb2 |
fix(policies): close the D-006 loop — TitleCase seed canonicals + severity-aware, config-consuming rule engine (D-008)
D-008 was a three-part drift in the policy engine that made the
D-005/D-006 remediation cosmetic below the DB layer:
(a) migrations/seed.sql INSERTed rules with pre-D-005 lowercase
types ('ownership', 'environment', 'lifetime', 'renewal_window')
that the handler validator rejects on Create/Update but that
raw SQL INSERTs bypassed entirely. At runtime evaluateRule's
switch fell through to the default "unknown policy rule type"
error branch on every demo rule × every cert × every cycle,
flooding logs while emitting zero violations.
(b) migrations/seed_demo.sql persisted lowercase severity values
('critical', 'error', 'warning') on policy_violations rows.
INSERT succeeded because that column had no CHECK, but any
frontend comparing against the canonical PolicySeverity enum
mis-categorized every seeded violation.
(c) evaluateRule hardcoded Severity: PolicySeverityWarning on
every emitted violation and ignored rule.Config entirely —
so the D-006 per-rule severity column (000013) and every
per-arm Config JSON ({allowed_issuer_ids, allowed_domains,
required_keys, allowed, lead_time_days, max_days}) was dead
data below the evaluation layer.
This commit lands (a)+(b)+(c) atomically. Shipping any subset
leaves the feature half-working.
## Changes
Domain (internal/domain/policy.go):
* Add PolicyTypeCertificateLifetime as the 6th TitleCase canonical.
Pre-D-008 the seeded "max-certificate-lifetime" rule had no engine
arm — routing it through RenewalLeadTime would conflate "how
close to expiry before we renew" with "how long can the cert
possibly be", two distinct semantics. The new type accepts
config {"max_days": int} and flags certs whose
NotAfter - NotBefore exceeds the cap.
Handler validator (internal/api/handler/validation.go):
* ValidatePolicyType allowlist grown to 6 canonicals
(AllowedIssuers, AllowedDomains, RequiredMetadata,
AllowedEnvironments, RenewalLeadTime, CertificateLifetime).
OpenAPI (api/openapi.yaml):
* PolicyType enum grown to match domain.
Frontend (web/src/api/types.ts, types.test.ts):
* POLICY_TYPES tuple gains CertificateLifetime; pin test asserts
all 6 canonicals and rejects casing drift.
Migration 000014 (policy_violations severity CHECK):
* Named CHECK constraint (policy_violations_severity_check)
mirroring 000013's allowlist, defense-in-depth at the DB layer
against future drift from bypassed writes (migrations, psql
sessions, future callers). Symmetric down migration drops by
name.
Seed data:
* migrations/seed.sql rewritten to emit TitleCase canonicals with
per-arm config JSON that actually exercises the config-consuming
paths (not the missing-field backstops):
- pr-require-owner → RequiredMetadata {"required_keys":["owner"]} Warning
- pr-allowed-environments → AllowedEnvironments {"allowed":["production","staging","development"]} Error
- pr-max-certificate-lifetime → CertificateLifetime {"max_days":90} Critical
- pr-min-renewal-window → RenewalLeadTime {"lead_time_days":14} Warning
Severities are now differentiated per rule (D-006 intent).
* migrations/seed_demo.sql violation rows flipped to TitleCase
severity ('Critical', 'Error', 'Warning') so migration 000014
applies cleanly on upgrade paths.
Engine rewrite (internal/service/policy.go):
* evaluateRule rewritten. All six arms now:
1. Parse rule.Config into the per-arm typed struct.
2. Bad JSON → log at ValidateCertificate boundary and skip
this rule (no co-located poisoning of other rules in the
same batch).
3. Empty/null Config → emit the pre-D-008 missing-field
violation (backwards compat invariant — operators who
haven't reconfigured still see the same output).
4. Violations emitted carry rule.Severity (no more hardcoded
Warning); D-006 column is now load-bearing.
* CertificateLifetime arm reads NotBefore/NotAfter from the
certificate's latest version via CertRepo. Injected via
PolicyService.SetCertRepo() setter — avoids churning ~36
NewPolicyService call sites while keeping the lifetime arm
optional (degrades to a log+skip if the setter is not wired).
Server wiring (cmd/server/main.go):
* policyService.SetCertRepo(certRepo) wired after construction.
Tests (internal/service/policy_test.go):
* 25 new subtests across 5 groups:
- TestEvaluateRule_SeverityPassThrough (6): every rule type
emits violations carrying rule.Severity, not hardcoded.
- TestEvaluateRule_ConfigConsumed (12): every per-arm Config
path exercised positive + negative.
- TestEvaluateRule_EmptyConfig_BackCompat (3): empty/null
Config still emits pre-D-008 missing-field violations.
- TestEvaluateRule_BadConfig_SkipsRule: malformed JSON logs
and skips cleanly without poisoning neighbors.
- TestEvaluateRule_CertificateLifetime_RepoScenarios (3):
ok when repo wired, log+skip when not, handles missing
NotBefore/NotAfter edges.
Provenance: D-008 surfaced during D-005/D-006 remediation review
in
|
||
|
|
eef1db0f0a |
fix(policies): stop 400ing the "+ New Policy" button + add per-rule severity (D-005, D-006)
Coverage Gap Audit findings D-005 (P0) + D-006 (P1) fixed together in a
single commit because they share the same root cause — policy CRUD sending
values the backend silently rejects — and splitting them would leave a
half-working UI between commits.
## D-005 (P0): PoliciesPage dropdown 400s every Create Policy
Root cause
----------
`web/src/pages/PoliciesPage.tsx` populated the Type `<select>` from a
hardcoded `['key_algorithm', 'ownership', 'allowed_issuers', ...]` array.
The backend's `internal/api/handler/validators.go::ValidatePolicyType`
enforces the TitleCase allowlist `AllowedIssuers`, `AllowedDomains`,
`RequiredMetadata`, `AllowedEnvironments`, `RenewalLeadTime` — defined in
`internal/domain/policy.go`. Every Create Policy request was rejected with
`400 invalid policy type`. The error surfaced only as a transient toast;
the modal closed anyway. Silent user-visible failure.
Fix
---
- `web/src/api/types.ts`: added `POLICY_TYPES` and `POLICY_SEVERITIES`
tuples with `as const` and narrowed `PolicyRule.type`, `.severity`, and
`PolicyViolation.severity` to the literal-union types. Dropdown is now
sourced from the tuple; casing drift becomes a compile error.
- `web/src/pages/PoliciesPage.tsx`: rekeyed `severityStyles` /
`severityDots` to the TitleCase values, added `humanize()` for display
(AllowedIssuers → "Allowed Issuers"), removed the `badge-neutral`
fallback that was papering over the mismatch.
- `web/src/api/types.test.ts` (new): pins both tuples exactly. If anyone
edits one side of the frontend/backend contract without the other, CI
fails with a clear assertion. Pure-TS vitest, no RTL dependency.
## D-006 (P1): `severity` field silently dropped on create/update
Root cause
----------
`PolicyRule` had no `Severity` field in `internal/domain/policy.go`. The
frontend has always sent `severity` on create/update, but Go's
`json.Decoder` (default settings, no `DisallowUnknownFields`) silently
dropped it. The value never reached PostgreSQL. Every rule rendered with
the same severity because there was no severity — just a display
computation downstream.
Fix: option (b), full-stack schema add (not delete-the-field)
-------------------------------------------------------------
- Migration `000013_policy_rule_severity` (up + down): adds
`severity VARCHAR(50) NOT NULL DEFAULT 'Warning'` to `policy_rules` with
CHECK constraint `severity IN ('Warning', 'Error', 'Critical')`. No
index — three-value column on a low-thousands-rows table, planner will
seq-scan regardless. PG 11+ metadata-only ADD COLUMN, safe on live data.
- `internal/domain/policy.go`: added `Severity PolicySeverity` field.
- `internal/repository/postgres/policy.go`: plumbed `severity` through
ListRules SELECT + Scan, GetRule SELECT + Scan, CreateRule INSERT,
UpdateRule UPDATE (4 queries).
- `internal/service/policy.go::UpdatePolicy`: if the client omits
severity on a PUT (zero-value empty string), fetch the existing rule
and preserve its severity. Without this, partial updates would trip the
NOT NULL CHECK and 500. Preserves pre-existing behavior for Name/Type
(out of scope).
- `internal/api/handler/policies.go::CreatePolicy`: default empty severity
to `'Warning'`, then validate via `ValidatePolicySeverity`. 400 with
clear message instead of 500 on CHECK violation. `UpdatePolicy`:
validates severity only when provided.
- `internal/mcp/types.go` + `internal/mcp/tools.go`: added optional
`severity` on the MCP `create_policy` / `update_policy` tool inputs so
LLM callers stay in sync with the wire contract.
- `api/openapi.yaml`: added `severity` to the `PolicyRule` schema with
the enum and default.
Acceptance criterion (user-defined)
-----------------------------------
"Create a rule with severity=Critical, reload the page, and still see
Critical — no silent drops." Verified end-to-end: frontend sends
`severity: "Critical"`, handler validates, service persists, DB stores,
GET returns, React renders the correct badge.
Seed data
---------
`migrations/seed.sql`: four demo rules now have differentiated severities
— `pr-require-owner` → Warning, `pr-allowed-environments` → Error,
`pr-max-certificate-lifetime` → Critical, `pr-min-renewal-window` →
Warning. The user called out that seeding all four at the same severity
makes the feature look decorative; differentiation demonstrates the
column carries real signal.
## Integration test fix (side effect of D-006)
`internal/integration/e2e_test.go::TestCrossResourceWorkflow/CreatePolicy`
was sending `"severity": "High"` — a value from the pre-audit severity
vocabulary that the new `ValidatePolicySeverity` correctly rejects with
400. Changed to `"Error"` (closest semantic match in the new TitleCase
allowlist). Only severity reference in the integration/ directory;
verified via grep.
## Out of scope, logged for follow-up (d/D-008)
Three policy-engine drift issues orthogonal to D-005 + D-006, explicitly
deferred per direction:
1. `migrations/seed.sql` policy_rules INSERTs use lowercase TYPE values
(`'ownership'`, `'environment'`, `'lifetime'`, `'renewal_window'`).
These are load-bearing on `internal/service/policy.go::evaluateRule`'s
`switch rule.Type` (which also uses the lowercase strings). Migrating
requires coordinated changes across seed + evaluation engine.
2. `migrations/seed_demo.sql:482-483` contains lowercase `'critical'`
severity — will now fail the new CHECK constraint. Separate fix.
3. `evaluateRule` hardcodes `Severity: domain.PolicySeverityWarning` on
emitted violations and ignores the configured `rule.Config`. The new
severity column is read correctly on the CRUD path but not yet
consulted during evaluation.
## Verification
Backend:
- `go build ./...` — clean
- `go vet ./...` — clean
- `go test -short ./...` — all packages green, including
`internal/service` (policy service), `internal/api/handler` (policy +
MCP handler tests), `internal/integration` (e2e_test.go after fix),
`internal/domain`, `internal/repository/postgres`.
Frontend:
- `tsc --noEmit` — clean
- `vitest run` — 223/223 passing (4 new assertions in types.test.ts)
- `vite build` — clean (only the pre-existing chunk-size warning)
|
||
|
|
387fb555ac |
security: scope revocation unique index to (issuer_id, serial_number) (fixes H-1)
RFC 5280 §5.2.3 defines certificate serial number uniqueness per issuing CA,
not globally. The prior unique index on `certificate_revocations.serial_number`
enforced a stricter invariant than the spec: with 12 issuer connectors (Local
CA, ACME, Vault, step-ca, OpenSSL, DigiCert, Sectigo, Google CAS, AWS ACM PCA,
Entrust, GlobalSign, EJBCA), two distinct certificates legitimately issued by
different CAs can share a serial number. Recording a revocation for the second
collision silently dropped via `ON CONFLICT DO NOTHING`, leaving the second
cert persistently absent from OCSP/CRL responses.
Changes:
- Migration 000012 drops `idx_certificate_revocations_serial` and creates
`idx_certificate_revocations_issuer_serial` UNIQUE ON (issuer_id,
serial_number). Adds a non-unique `idx_certificate_revocations_serial_lookup`
to preserve the serial-only fast path for OCSP/CRL probes that already know
the issuer scope.
- `CertificateRevocationRepository.Create` targets the new composite key in
`ON CONFLICT` — same-issuer idempotency preserved, cross-issuer collisions
now recorded as distinct rows.
- `GetBySerial(serial)` renamed `GetByIssuerAndSerial(issuerID, serial)` on
the interface and Postgres impl. All callers (OCSP responder, CRL
generator, short-lived-cert exemption check) already have `issuerID` in
scope because the protocol paths carry it (`/api/v1/ocsp/{issuer_id}/{serial}`,
`/api/v1/crl/{issuer_id}`).
- Repository integration test added: `TestRevocationRepository_CrossIssuerSerialCollision`
asserts that serial `CAFEBABE01` can be stored under two issuers
simultaneously, that lookups return the correct row per (issuer, serial),
and that same-issuer idempotency still works (re-inserting (issuer, serial)
does not error and does not duplicate).
- Existing tests and service/integration mocks updated for the rename.
Wire-format invariants preserved: CRL DER bytes, OCSP response bytes, and
AES-256-GCM config encryption are unaffected — this change touches only
revocation-record uniqueness scope.
CWE-664.
|
||
|
|
e1bcde4cf1 |
feat(M50): cloud secret manager discovery — AWS SM, Azure KV, GCP SM
Extend certificate discovery from filesystem + network to cloud secret managers. Three pluggable DiscoverySource connectors feed into the existing discovery pipeline via sentinel agent pattern, with a 9th scheduler loop for periodic cloud scanning. - AWS Secrets Manager: aws-sdk-go-v2, tag/prefix filtering, 10 tests - Azure Key Vault: stdlib HTTP + OAuth2, base64 DER/PEM, 16 tests - GCP Secret Manager: stdlib HTTP + JWT OAuth2, label filter, 14 tests - CloudDiscoveryService orchestrator with 9 tests - 9th scheduler loop (6h default, atomic.Bool idempotency) - Discovery page: color-coded source type badges - 14 new env vars across CloudDiscoveryConfig structs - Docs: connectors.md, architecture.md, features.md, README updated 49 new tests. All CI checks pass (go vet, race, lint, coverage). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
3f619bcaac |
feat(M49): Entrust, GlobalSign & EJBCA issuer connectors
Add three new issuer connectors completing commercial and open-source CA coverage. Entrust uses mTLS client certificate auth with sync/async issuance. GlobalSign Atlas uses mTLS + API key/secret dual auth with serial-based tracking. EJBCA supports dual auth (mTLS or OAuth2) for self-hosted Keyfactor CAs. Each connector implements the full issuer.Connector interface (9 methods), includes httptest-based unit tests (~14 each), and follows established patterns (injectable HTTP clients, RFC 5280 revocation reason mapping, CRL/OCSP delegated to CA). Also includes: issuer factory cases, env var seeding, config structs, domain types, seed data (3 rows, all disabled), OpenAPI enum updates, frontend issuer catalog entries with config fields, and full docs (connectors.md, architecture.md, features.md, README). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
596d86a206 |
feat(M48): continuous TLS health monitoring — endpoint state machine, shared tlsprobe, 8 API endpoints, GUI
Adds continuous TLS endpoint health monitoring that closes the deploy→verify→monitor loop. After M25 verifies a deployment succeeded once, M48 continuously confirms it stays healthy. Key components: - Shared `internal/tlsprobe/` package extracted from network scanner for reuse - Health status state machine: healthy → degraded (2 failures) → down (5 failures), plus cert_mismatch when served fingerprint differs from expected - 8th scheduler loop (60s tick, per-endpoint configurable intervals) - PostgreSQL migration 000011: endpoint_health_checks + endpoint_health_history tables - 8 REST API endpoints (CRUD, history, acknowledge, summary) - Health Monitor GUI page with summary bar, status table, create modal, auto-refresh - 38 new tests (5 tlsprobe + 11 domain + 10 service + 8 handler + 4 frontend) - All coverage thresholds maintained (service 68%, handler 83%, domain 87%, middleware 63%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
c015cab2f4 |
docs: rewrite features.md, audit README + architecture against repo
Rewrote docs/features.md from scratch as authoritative feature inventory (1255 lines, every claim verified against source files). Audited README.md and architecture.md against repo — fixed 19 stale references: K8s Secrets status, issuer counts, dashboard page counts, CI thresholds, missing connectors in Mermaid diagrams, OpenAPI operation count, GetCACertPEM behavior, and V2/V4 roadmap accuracy. Also includes related fixes discovered during audit: - Scheduler skips expired/failed/revoked certs from auto-renewal - Seed demo expiry dates moved outside 31-day scheduler query window - Agent pages use correct last_heartbeat_at field name Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
5567d4b411 |
feat(M47): add Kubernetes Secrets target + AWS ACM PCA issuer connectors
Implement both M47 connectors with full cross-layer wiring: Kubernetes Secrets target: DNS-1123 validation, kubernetes.io/tls Secret create-or-update, chain concatenation, serial number validation, Helm RBAC gating. 18 tests. AWS ACM Private CA issuer: synchronous issuance (like Vault), ARN regex validation, RFC 5280 revocation reason mapping, CA cert retrieval, factory + env var seeding. 23 tests. Cross-cutting: domain types, service validation, config, factory, agent dispatch, frontend (TargetsPage, issuerTypes), OpenAPI, seed data, Helm chart, connectors docs, README. Testing docs (testing-guide, qa-test-guide, qa_test.go) with Parts thematically integrated near related connectors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
e6088c79a3 |
feat(M35): dynamic target configuration with encrypted config, test connection, and GUI updates
Mirror M34's dynamic issuer config pattern for deployment targets: AES-256-GCM encrypted config storage, sensitive field redaction in API responses, agent heartbeat-based test connection endpoint, and full frontend updates including test status indicators, source badges, and removal of stale hostname/status fields from the Target interface. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
995b72df05 |
feat(M34): dynamic issuer configuration with encrypted config storage
Replace static env-var-based issuer wiring with GUI-driven dynamic configuration stored encrypted in PostgreSQL. Operators can now configure, test, enable/disable, and manage issuers from the dashboard without restarting the server. Key changes: - AES-256-GCM encryption for sensitive issuer config at rest (PBKDF2 key derivation with 100k iterations) - Dynamic IssuerRegistry with sync.RWMutex replacing static map - Connector factory pattern (issuerfactory.NewFromConfig) replacing 140 lines of static wiring in main.go - Migration 000009: encrypted_config, last_tested_at, test_status, source columns on issuers table - Env var seeding on first boot with ON CONFLICT DO NOTHING - Registry Rebuild() for atomic map swap after CRUD operations - Issuer type validation against domain constants on Create - Audit trail for test connection results - Conditional seeding for step-ca/OpenSSL (only when env vars set) - GUI: source badge, connection test status on issuer detail page Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
5a53b648b1 |
feat(M44): Google CAS issuer connector
Google Cloud Certificate Authority Service integration via REST API with OAuth2 service account auth (JWT→access token). Synchronous issuance model, CA pool selection, mutex-guarded token caching, revocation with RFC 5280 reason mapping. No Google SDK dependency — all stdlib. 19 tests with httptest mock OAuth2 + CAS API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
3a11e447cf |
feat(M43): Sectigo SCM issuer connector
Implement Sectigo Certificate Manager REST API connector with async order model (enroll → poll → collect PEM), 3-header auth, DV/OV/EV support, collect-not-ready (400/-183) graceful handling, and RFC 5280 revocation reason mapping. 20 tests with httptest mock API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
4c3b7cbb16 |
docs: fix stale references, seed data case bugs, and convert ASCII diagrams to Mermaid
Audit all docs and examples against current codebase state. Fix seed_demo.sql domain constant casing (IssuerType, TargetType, AgentStatus) that would cause agent dispatch failures. Fix example docker-compose health endpoints (/health not /api/v1/health) and env var names (CERTCTL_DATABASE_URL). Update connector counts, test numbers, and planned→implemented status across docs. Convert 3 ASCII flow diagrams to Mermaid. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |