mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 21:41:39 +00:00
12003f5ca521dddf00488edab6536c57541cd6fd
46 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
46800f3365 |
Bundle C tail: integration mock stub for ListJobsWithOfflineAgents
CI on the bundle-C merge (run #24970879984) failed go vet because internal/integration/lifecycle_test.go::mockJobRepository didn't implement the new JobRepository.ListJobsWithOfflineAgents method that Bundle C added. The lifecycle integration test does not exercise the offline-agent reaper path (the unit-level test in internal/service covers that), so the integration-mock stub is a no-op returning (nil, nil) — same shape as the existing M-7 / I-003 stubs in this file. Verification: go vet ./internal/integration clean go test -count=1 -short ./internal/integration green |
||
|
|
85e60b24ec |
fix(bundle-5): Operational Liveness + Bootstrap — 4 audit findings closed
Closes Audit-2026-04-25 H-006 (High), H-007 (High), M-011 (Medium),
L-006 (Low — verified-already-closed via C-1 master closure in v2.0.54).
Hardens the orchestrator-facing surface — k8s probes, agent enrollment,
shutdown audit drain, scheduler config plumbing.
What changed
- internal/api/handler/health.go — split contract:
* /health stays shallow 200 (k8s liveness — process alive)
* /ready accepts *sql.DB; runs db.PingContext(2s); 503 on failure
* Nil DB path returns 200 + db=not_configured (test fixtures)
- internal/api/handler/agent_bootstrap.go (NEW) — verifyBootstrapToken:
* empty expected = warn-mode pass-through
* non-empty = `Authorization: Bearer <token>` required
* crypto/subtle.ConstantTimeCompare; length-mismatch path runs dummy
compare to keep timing uniform
* ErrBootstrapTokenInvalid sentinel
- internal/api/handler/agents.go — RegisterAgent calls verifyBootstrapToken
BEFORE body parse so unauth probes don't even allocate a JSON decoder
- internal/config/config.go — two new env vars:
* CERTCTL_AGENT_BOOTSTRAP_TOKEN (Auth.AgentBootstrapToken)
* CERTCTL_AUDIT_FLUSH_TIMEOUT_SECONDS (Server.AuditFlushTimeoutSeconds)
- cmd/server/main.go — 3 changes:
* pass *sql.DB into NewHealthHandler (H-006)
* pass cfg.Auth.AgentBootstrapToken into NewAgentHandler (H-007)
* configurable shutdown audit-flush timeout (M-011)
* one-shot startup WARN when bootstrap token unset (deprecation)
- new tests: agent_bootstrap_test.go (full deny/accept/warn-mode coverage,
constant-time compare path, length-mismatch); health_test.go extended
with /ready DB-probe failure (503), nil-DB pass-through, /health-shallow
L-006 verified
- cmd/server/main.go:557 already calls
sched.SetShortLivedExpiryCheckInterval(cfg.Scheduler.ShortLivedExpiryCheckInterval)
per the C-1 master closure in v2.0.54. Bundle 5 confirms; no code change.
Threat model: TB-1 (operator/orchestrator), TB-2 (Agent↔Server).
- CWE-754 (Improper Check for Unusual or Exceptional Conditions) for H-006
- CWE-306 + CWE-288 (Missing Authentication for Critical Function) for H-007
Verification
- go vet ./... → clean
- go build ./... → clean
- go test -short -count=1 ./... → all packages pass
- targeted Bundle-5 regressions → all pass
- npx tsc --noEmit (web) → clean
- npx vitest run (web) → in-flight (sandbox 45s
ceiling exceeded; no failure markers in dot stream; no frontend
changes in this bundle so no regression risk)
- python3 yaml.safe_load(api/openapi.yaml) → 89 paths
Backward compatibility
- Bootstrap token defaults to empty (warn-mode) — existing demo
deployments unaffected. Server logs deprecation WARN; v2.2.0 will
require it.
- Audit flush timeout default 30s preserves prior behaviour.
- Helm chart already routes readiness probe to /ready (no chart change
needed); now /ready actually probes the DB.
Bundle 5 of the 2026-04-25 comprehensive audit.
|
||
|
|
1c099071d1 |
fix(bundle-4): EST/SCEP Attack Surface Hardening — 3 audit findings closed
Closes 3 findings (1 High + 1 Medium + 1 Low) from
/Users/shankar/Desktop/cowork/comprehensive-audit-2026-04-25/.
Bundle 4 hardens the only attack surface reachable by an anonymous network
attacker in certctl: the unauthenticated EST + SCEP enrollment endpoints.
Findings closed:
- H-004 (High): Hand-rolled ASN.1 parser had no fuzz target.
The audit's original framing pointed at internal/pkcs7/, but recon
confirmed that package is an ASN.1 ENCODER (BuildCertsOnlyPKCS7,
ASN1Wrap*, ASN1EncodeLength) — not a parser. The actual hand-rolled
PKCS#7 PARSING reachable via anonymous network is in
internal/api/handler/scep.go::extractCSRFromPKCS7 +
parseSignedDataForCSR. Added native go fuzz targets:
* internal/api/handler/scep_fuzz_test.go::FuzzExtractCSRFromPKCS7
* internal/api/handler/scep_fuzz_test.go::FuzzParseSignedDataForCSR
* internal/pkcs7/pkcs7_fuzz_test.go::FuzzPEMToDERChain (defense-in-depth)
* internal/pkcs7/pkcs7_fuzz_test.go::FuzzASN1EncodeLength (defense-in-depth)
Local 15s fuzz session: 150k execs on FuzzExtractCSRFromPKCS7,
937k on FuzzPEMToDERChain, 925k on FuzzASN1EncodeLength — zero panics.
- M-021 (Medium): EST TLS-Unique channel binding (RFC 7030 §3.2.3).
Added internal/api/handler/est.go::verifyESTTransport — defense-in-depth
TLS pre-conditions (r.TLS != nil; HandshakeComplete; TLS ≥ 1.2).
The full §3.2.3 channel binding only applies when EST mTLS is in use;
certctl does not currently support EST mTLS, so the §3.2.3 requirement
is moot today. RFC 9266 (TLS 1.3 tls-exporter) and EST mTLS are
documented as deferred follow-ups in the verifyESTTransport doc comment.
- L-005 (Low): EST/SCEP issuer-binding fail-loud at startup.
Pre-Bundle-4 cmd/server/main.go validated that CERTCTL_EST_ISSUER_ID and
CERTCTL_SCEP_ISSUER_ID existed in the registry but did NOT validate the
issuer TYPE could emit a CA cert. An operator binding EST to an ACME
issuer (whose GetCACertPEM returns explicit error) booted successfully
and only failed at first /est/cacerts request. Post-Bundle-4: new
preflightEnrollmentIssuer helper calls GetCACertPEM(ctx) at startup
with a 10s timeout. Failure logs the connector error + the candidate
issuer types and os.Exit(1).
Tests added/modified:
- internal/api/handler/est_transport_test.go (new) — 5 verifyESTTransport
table cases covering plaintext-rejected, incomplete-handshake-rejected,
TLS 1.0 rejected, TLS 1.2/1.3 accepted
- cmd/server/preflight_test.go (new) — TestPreflightEnrollmentIssuer
covering nil-connector, error-from-issuer, empty-PEM, valid cases
- internal/api/handler/est_handler_test.go (modified) — 7 POST sites
now stamp r.TLS to satisfy the new transport pre-condition
- internal/integration/negative_test.go (modified) — setupTestServer
wraps the test handler with a fake-TLS-state injector so the EST
handler receives r.TLS != nil; production paths still rely on the
real TLS listener
Threat model reference: TB-11 (EST/SCEP client ↔ Server) per
cowork/comprehensive-audit-2026-04-25/threat-model.md.
Standards: RFC 7030 §3.2.3, RFC 8894 §3, RFC 5652, RFC 9266 (deferred).
|
||
|
|
0e29c416b1 |
refactor(handler,repo): replace strings.Contains error dispatch with typed sentinels (S-2)
Closes one 2026-04-24 audit finding (P2):
- cat-s6-efc7f6f6bd50: 30 strings.Contains(err.Error(), ...) sites
in internal/api/handler/ — brittle to repository-layer message
changes, untyped against the actual failure mode.
Approach (Option B from prompt design notes):
- New typed sentinels in internal/repository/errors.go:
ErrNotFound, ErrForeignKeyConstraint
IsForeignKeyError(err) helper (the only place substring
matching at the lib/pq boundary is allowed; isolates the
DB-driver string knowledge to one function).
- New typed sentinel in internal/domain/errors.go:
ErrValidation (reserved for future per-entity validation
wrappers; not yet used by all handlers).
- 49 sites in internal/repository/postgres/*.go updated to wrap
sql.ErrNoRows-derived errors via fmt.Errorf("...: %w",
repository.ErrNotFound).
- 18 not-found handler sites + 2 FK-constraint handler sites
refactored to errors.Is(err, repository.ErrNotFound) /
repository.IsForeignKeyError(err).
- 23 inline `fmt.Errorf("X not found")` test fixtures across
handler tests rewrapped to wrap repository.ErrNotFound.
- test_utils.go::ErrMockNotFound rewrapped to wrap
repository.ErrNotFound; renewal_policy.go closure docblock
updated to reflect the new convention.
- integration test mockJobRepository.Get wraps repository.ErrNotFound.
CI regression guardrail:
- .github/workflows/ci.yml::"Forbidden strings.Contains(err.Error())
regression guard (S-2)" greps for the three patterns ("not found",
"violates foreign key", "RESTRICT") under internal/api/handler/
and fails the build on regression.
Verification:
- go build ./... — clean
- go vet ./... — clean
- go test ./... -short -count=1 — all packages pass (handler +
repository + service + integration)
- golangci-lint v2.11.4 run ./... — 0 issues
- S-2 guardrail dry-run on post-fix tree → empty (good)
- All sibling guardrails (S-1, G-3, D-1+D-2, B-1, L-1, H-1, C-1, F-1, P-1) pass
Audit findings closed:
- cat-s6-efc7f6f6bd50 (P2)
Deferred follow-ups:
- 6 domain-specific substring patterns still inline in handlers
("cannot approve", "cannot reject", "cannot be parsed",
"no certificates found", "challenge password", "invalid"/
"required" validation chains in profiles + agent_groups). Each
needs its own typed sentinel, scoped per service. Documented
by the S-2 CI guardrail's allowlist for closure-comments only.
- Per-entity not-found sentinels (Option A — ErrCertificateNotFound,
ErrAgentNotFound, etc.) deferred. Generic ErrNotFound covers the
current dispatch needs; per-entity precision would let handlers
return entity-aware error bodies without a domain.Type field,
but not blocking.
|
||
|
|
9834b4e4a4 |
G-1: renewal-policies API + frontend FK-drift fix
Three frontend call sites (OnboardingWizard.tsx:603, CertificatesPage.tsx:52,
CertificateDetailPage.tsx:169) populated the renewal_policy_id dropdown from
getPolicies() — the compliance-rule endpoint returning pol-* IDs — which
violated the FK managed_certificates.renewal_policy_id REFERENCES
renewal_policies(id) ON DELETE RESTRICT. Create would fail pg 23503 at insert.
Backend (new):
- RenewalPolicyRepository CRUD + ListAll/ExistsByID (pg 23503 → ErrRenewalPolicyInUse
→ HTTP 409; pg 23505 → ErrRenewalPolicyDuplicateName → HTTP 409)
- RenewalPolicyService with repo-only constructor. Service sentinels
var-alias the repo sentinels so errors.Is walks across layers.
- RenewalPolicyHandler with validation bounds: name 1–255;
renewal_window_days [1,365] default 30; max_retries [0,10] not defaulted;
retry_interval_seconds [60,86400] default 3600; alert_thresholds_days
[0,365] default [30,14,7,0]. Auto-generated IDs rp-<slug(name)>.
- Router registers 5 routes under /api/v1/renewal-policies[/{id}].
Frontend:
- CertificatesPage/CertificateDetailPage/OnboardingWizard now call
getRenewalPolicies() and render rp-* IDs.
- client.ts adds getRenewalPolicies/createRenewalPolicy/updateRenewalPolicy/
deleteRenewalPolicy. types.ts adds the RenewalPolicy shape.
OpenAPI: RenewalPolicies tag + 5 operations + 3 schemas (RenewalPolicy,
RenewalPolicyCreateRequest, RenewalPolicyUpdateRequest). 409 responses
on create/update duplicate-name and delete FK-in-use.
No migration — renewal_policies table already exists from the initial
schema (000001).
Tests:
- internal/service/renewal_policy_test.go: CRUD + validation + sentinel
error wrapping.
- internal/api/handler/renewal_policy_handler_test.go: handler endpoint
contracts including 400/404/409.
- web/src/api/client.test.ts: 4 subtests covering the 4 new API functions.
Phase 3 gates all green: go vet, build, short tests, race tests (service/
handler/router/scheduler), staticcheck (G-1 packages), govulncheck (0
reachable), coverage (service 69.7%, handler 79.0%, domain 86.9%,
middleware 80.6% — all above thresholds), tsc, vitest (256 passed),
vite build, OpenAPI structural validation.
|
||
|
|
4e5522a999 |
F-001/F-002/F-003: CRL prefix-scan, digest error sanitization, ctx-aware sleeps
F-001 (P3): GenerateDERCRL scoped to issuer via composite index
- Add RevocationRepository.ListByIssuer leveraging migration 000012's
idx_certificate_revocations_issuer_serial composite index as a
prefix-scan target. Previously CAOperationsSvc.GenerateDERCRL called
ListAll() and filtered by IssuerID in Go — O(total revocations)
regardless of how many revocations belonged to the target issuer.
- Rewrite GenerateDERCRL to call ListByIssuer(ctx, issuerID) so PostgreSQL
drives a prefix scan of the composite index. Drops the in-memory filter.
- New regression test in ca_operations_test.go asserts the CRL hot path
invokes ListByIssuer exactly once and never ListAll, and that the
issuerID is threaded through correctly.
F-002 (P3): digest.go admin-auth endpoints no longer leak internal errors
- PreviewDigest (GET /api/v1/digest/preview) and SendDigest
(POST /api/v1/digest/send) previously wrote err.Error() into the HTTP
response body on 500s. Replace with slog.Error server-side logging plus
a generic "internal error" response body, matching the house pattern
in certificates.go and export.go.
F-003 (P4): three blocking time.Sleep sites now honor ctx cancellation
- internal/connector/issuer/acme/acme.go:672 (DNS-01 propagation wait)
now runs under a select{case <-ctx.Done(): CleanUp + return ctx.Err();
case <-time.After(d):} so graceful shutdown doesn't get stuck behind
the propagation delay.
- internal/connector/issuer/acme/acme.go:786 (dns-persist-01 propagation
wait) same pattern, returns ctx.Err() on cancel.
- cmd/agent/main.go:272 (polling backoff inside the heartbeat loop) now
wraps the sleep in select{case <-ctx.Done(): continue; case <-time.After(backoff):}
so the outer <-ctx.Done() case on the parent loop fires cleanly.
Verification: build, vet, and race-enabled short tests green across all
55+ packages. govulncheck reports zero vulnerabilities in the code path.
No migration needed — F-001 reuses the existing 000012 composite index.
No frontend changes.
|
||
|
|
675b87ba63 |
I-005: notification retry loop + dead-letter queue
Critical alerts can no longer be silently dropped by a transient
notifier failure. Failed notification attempts now ride an exponential
backoff retry loop, with a 5-attempt budget before promotion to the
dead-letter queue for operator intervention.
Schema (migration 000016, idempotent):
- retry_count INTEGER NOT NULL DEFAULT 0
- next_retry_at TIMESTAMPTZ
- last_error TEXT
- idx_notification_events_retry_sweep partial index
(next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL
Dead rows clear next_retry_at so the index stops matching them.
Service contract:
- NotificationService.RetryFailedNotifications drives 2^n-minute
exponential backoff capped at 1h (notifRetryBackoffCap) with
5-attempt budget (notifRetryMaxAttempts).
- Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to
status='dead' via MarkAsDead.
- Non-terminal failures record via RecordFailedAttempt.
- Success path promotes to 'sent' without touching retry_count
(audit preserves "delivered on attempt N").
- Missing-notifier branch defensively promotes to 'sent' to avoid
wedging a row on a deleted channel.
- RequeueNotification operator escape hatch atomically resets
retry_count -> 0, next_retry_at -> NULL, last_error -> NULL,
status -> pending via notifRepo.Requeue.
Scheduler:
- New always-on notificationRetryLoop wired into the base loop set at
CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m).
- sync/atomic.Bool idempotency guard.
- sync.WaitGroup shutdown drain via WaitForCompletion.
StatsService:
- SetNotifRepo setter pattern preserves 9 pre-existing
NewStatsService call sites (main.go + stats_test.go + 8 digest
tests) without touching the constructor signature.
- DashboardSummary.NotificationsDead populated via
notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired
(reports zero on systems without a notification repository).
- CountByStatus error is non-fatal (dashboard summary is
best-effort for this field).
- Prometheus certctl_notification_dead_total counter emitted from
the same snapshot.
Handler:
- New POST /api/v1/notifications/{id}/requeue endpoint.
- dead status surfaces to MCP + CLI.
Frontend:
- NotificationsPage gains two-tab toolbar ("All" / "Dead letter")
with queryKey: ['notifications', activeTab] so switching tabs
doesn't serve stale data until the 30s refetch.
- Dead rows surface "Retry {n}/5" + truncated last_error with
full-text title tooltip.
- Requeue mutation wrapped as
mutationFn: (id: string) => requeueNotification(id)
to prevent react-query v5's positional context argument from
leaking into the API client — pinned against future refactors
by strict-match toHaveBeenCalledWith('notif-dead-001') in
NotificationsPage.test.tsx:181.
Closes I-005.
|
||
|
|
0725713e19 |
Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.
Schema — migration 000015:
migrations/000015_agent_retire.up.sql flips
deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
boundary instead of quietly destroying targets. Both `agents` and
`deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
TEXT pair (TEXT not VARCHAR so operator comments are never
truncated), indexed via partial indexes WHERE retired_at IS NOT
NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
EXISTS) so repeated runs against partially-migrated databases
converge. migrations/000015_agent_retire.down.sql restores CASCADE
and drops the new columns for clean rollback. A dedicated
repository-layer testcontainers test
(internal/repository/postgres/migration_000015_test.go) asserts the
before/after FK action, column presence, index presence, and
round-trip idempotency under up→down→up.
Domain — sentinel guard + dependency counts:
internal/domain/connector.go gains IsRetired() on Agent, the
exported SentinelAgentIDs slice listing server-scanner,
cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
four reserved IDs documented in CLAUDE.md and created at startup in
cmd/server/main.go), IsSentinelAgent(id string) predicate,
AgentDependencyCounts{ActiveTargets, ActiveCertificates,
PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
ActorTypeSystem enum values used by audit emission downstream.
Coverage locked down by internal/domain/connector_test.go.
Service — 8-step ordered contract:
internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
opts{Force, Reason}) enforces a fixed execution order:
(1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
unconditionally; force=true does NOT bypass it.
(2) fetch — ErrAgentNotFound on miss.
(3) idempotency — if IsRetired() already, return
AgentRetirementResult{AlreadyRetired: true} with no new audit
event and no state change (safe to replay from flaky clients).
(4) preflight counts — collectAgentDependencyCounts runs
ActiveTargets, ActiveCertificates, PendingJobs sequentially
(not in parallel; keeps the per-query timeout predictable and
matches the repo's existing call-chain shape).
(5) force-reason guard — opts.Force=true with empty Reason returns
ErrForceReasonRequired (wired into the 400 status surface).
(6) dependency guard — HasDependencies() with opts.Force=false
returns BlockedByDependenciesError{Counts} (wired into the 409
body with per-bucket counts).
(7) mutation — single pinned retiredAt := time.Now(); agent
retirement first, then cascade target retirement if opts.Force,
all under the repo's single transaction so the two retired_at
stamps match to the second.
(8) best-effort audit — agent_retired always; agent_retirement_
cascaded additionally on the force path. Actor is whatever the
handler resolves from the request; actor type is mapped by
resolveActorType (system/agent-prefix→Agent/else→User). Audit
emission failures are logged via slog.Error but do not abort
the retirement (matches the house convention used by every
other scheduler-emitted event).
BlockedByDependenciesError implements Error() as
"active_targets=%d, active_certificates=%d, pending_jobs=%d" and
Unwrap() → ErrBlockedByDependencies. The single struct satisfies
errors.Is via Unwrap (used by scheduler-level tests) and errors.As
via the concrete type (used by the handler to fish out Counts for
the 409 body). ListRetiredAgents(page, perPage) adds a separate
paginated accessor with page<1→1 and perPage<1→50 normalization so
retired rows are queryable without polluting the default agent
listing.
Sentinel guard coverage is asymmetric by design: all four reserved
IDs are protected, and force=true cannot override. Regression tests
in internal/service/agent_retire_test.go assert each of the eight
steps in order, plus sentinel bypass attempts and idempotency
replay.
Handler + router — status-code surface:
internal/api/handler/agents.go:RetireAgent exposes seven status
codes on DELETE /agents/{id}:
200 on a fresh retirement (body echoes AgentRetirementResult).
204 on idempotent replay (AlreadyRetired=true; no new audit).
400 on ErrForceReasonRequired.
403 on ErrAgentIsSentinel.
404 on ErrAgentNotFound.
409 on BlockedByDependenciesError, with a custom body shape
{error, counts{active_targets, active_certificates,
pending_jobs}} that bypasses the default ErrorWithRequestID
envelope so callers get the per-bucket numbers directly.
500 on any other error.
Heartbeat HandleHeartbeat returns 410 Gone when the agent is
retired (ErrAgentRetired), signalling the agent to shut down.
Query params `force=true` and `reason=<text>` drive the cascade
path; both are forwarded as url.Values through the new MCP
transport.
internal/api/router/router.go registers GET /api/v1/agents/retired
literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
literal-beats-pattern-var precedence routes "retired" to the
paginated retired-agents listing instead of fetching a hypothetical
agent named "retired".
Agent binary — clean shutdown on 410:
cmd/agent/main.go gains the ErrAgentRetired sentinel, a
retiredOnce sync.Once, and a retiredSignal chan struct{}. A
markRetired(source, statusCode, body) helper closes the channel
exactly once; the Run() select loop observes the close and returns
ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
and exits cleanly instead of spinning in the heartbeat retry loop.
The 410 Gone surface is therefore terminal for the agent process.
MCP transport:
internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
a new additive transport method. Client.Delete is path-only; without
this method the retire tool would silently drop `force` and `reason`,
turning every cascade retire into a default soft-retire. The new
method shares do()'s 204 normalization and 4xx/5xx error
propagation so tool authors get one contract.
internal/mcp/tools.go + internal/mcp/types.go expose the
retire_agent tool with Force+Reason inputs wired through
DeleteWithQuery.
CLI:
cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
`agents list --retired` (client-side strip of --retired then
delegation to ListRetiredAgents, sharing --page/--per-page parsing
with the default listing) and `agents retire <id> [--force --reason
"…"]` (mirrors ErrForceReasonRequired — force without reason is
rejected client-side before the request is sent). JSON + table
output modes both honor the new columns.
Frontend:
web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
web/src/api/client.ts + web/src/api/types.ts expose the retire
endpoint and the retired-listing. 4 new Vitest regression cases.
OpenAPI:
api/openapi.yaml documents DELETE /agents/{id} with all seven
status codes, 410 on heartbeat, and the 409 per-bucket body shape.
Regression coverage (six new test files, all green):
internal/service/agent_retire_test.go — 8-step contract + sentinel guards
internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through
internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing
internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies
Files:
api/openapi.yaml — DELETE + 410 + 409 body shape
cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal
cmd/cli/main.go — handleAgents list/get/retire dispatch
docs/architecture.md, docs/concepts.md,
docs/testing-guide.md — retirement contract narrative
internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat
internal/api/handler/agent_handler_test.go — extended coverage
internal/api/handler/agent_retire_handler_test.go — new
internal/api/router/router.go — /agents/retired before /agents/{id}
internal/cli/agent_retire_test.go — new
internal/cli/client.go — ListRetiredAgents + RetireAgent
internal/domain/connector.go — IsRetired, SentinelAgentIDs,
IsSentinelAgent, AgentDependencyCounts,
ActorTypeAgent/System
internal/domain/connector_test.go — new
internal/integration/lifecycle_test.go — retirement fixture
internal/mcp/client.go — DeleteWithQuery additive transport
internal/mcp/retire_agent_test.go — new
internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs
internal/repository/interfaces.go — AgentRepository retirement methods
internal/repository/postgres/agent.go — retire + cascade target retire + counts
internal/repository/postgres/migration_000015_test.go — new
internal/service/agent.go — wire into AgentService surface
internal/service/agent_retire.go — new 8-step contract
internal/service/agent_retire_test.go — new
internal/service/deployment.go — skip retired agents
internal/service/target.go — skip retired agents
internal/service/testutil_test.go — shared mocks extended
migrations/000015_agent_retire.up.sql — new
migrations/000015_agent_retire.down.sql — new
web/src/api/client.ts, types.ts + tests — retire endpoint wiring
web/src/pages/AgentsPage.tsx — retire UI
|
||
|
|
1ee77c89f8 |
I-003: job timeout reaper closes AwaitingCSR/AwaitingApproval gap
Add 11th always-on scheduler loop that transitions jobs stuck in
AwaitingCSR (default 24h TTL) or AwaitingApproval (default 168h TTL)
to Failed. I-001's retry loop then auto-promotes eligible Failed jobs
back to Pending. No new status enum, no schema migration.
- JobRepository.ListTimedOutAwaitingJobs with per-status cutoff WHERE
- JobService.ReapTimedOutJobs mirrors RetryFailedJobs structure
- Scheduler jobTimeoutLoop with atomic.Bool idempotency guard, 2m
per-tick context, WaitGroup shutdown drain
- Config: CERTCTL_JOB_TIMEOUT_INTERVAL (10m), CERTCTL_JOB_AWAITING_CSR_TIMEOUT
(24h), CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT (168h)
- Audit event per transition: actor=system, actorType=System,
action=job_timeout, details={old_status, new_status, timeout_reason,
age_hours}
- 14 new tests: 3 config, 7 service, 4 scheduler
|
||
|
|
fe7e766510 |
Close M-004 (OCSP issuer binding) and M-005 (discovery actor propagation) coverage-gap findings
M-004 — OCSP issuer binding (composite key):
The OCSP lookup path now binds (issuer_id, serial) as a composite key
rather than resolving by serial alone. CertificateRepository and
RevocationRepository gain GetByIssuerAndSerial methods; ca_operations.go
scopes both lookups by the issuer_id path param. When no managed cert
binds to that (issuer, serial) tuple, GetOCSPResponse constructs an
RFC 6960 §2.2 'unknown' response (CertStatus=2) instead of the prior
default 'good'. Short-lived cert exemption (profile TTL < 1h) is
preserved. Real repo errors (non-sql.ErrNoRows) fail closed with a log.
Regression coverage: internal/service/ca_operations_test.go
- TestCAOperationsSvc_GetOCSPResponse_Unknown_CrossIssuer
- TestCAOperationsSvc_GetOCSPResponse_Unknown_UnknownSerial
M-005 — Discovery Claim/Dismiss actor propagation:
DiscoveryService.ClaimDiscovered and DismissDiscovered now accept an
explicit 'actor string' parameter (propagation pattern mirrors
bulk_revocation.go / revocation_svc.go). The handler layer passes
resolveActor(r.Context()) — the named-key identity established by the
M-002 auth unification — and the service falls back to 'api' (the same
safe sentinel resolveActor uses when no auth context is present) only
when the caller passes an empty string. Never falls back to 'operator'.
Regression coverage: internal/service/discovery_test.go
- TestDiscoveryService_ClaimDiscovered_AuditActor
- TestDiscoveryService_DismissDiscovered_AuditActor
- TestDiscoveryService_ClaimDiscovered_EmptyActorFallsBackToAPI
- TestDiscoveryService_DismissDiscovered_EmptyActorFallsBackToAPI
Each new test asserts event.Actor matches the caller-supplied string (or
'api' on empty input) and explicitly asserts event.Actor != 'operator'
to lock in the historical fix intent.
Files:
internal/api/handler/discovery.go — pass resolveActor(ctx)
internal/api/handler/discovery_handler_test.go — updated call sites
internal/integration/lifecycle_test.go — updated mock wiring
internal/repository/interfaces.go — GetByIssuerAndSerial on
CertificateRepository +
RevocationRepository
internal/repository/postgres/certificate.go — composite key lookup
internal/service/ca_operations.go — (issuer_id, serial) scoping
internal/service/ca_operations_test.go — 2 new M-004 tests
internal/service/discovery.go — actor parameter + 'api' fallback
internal/service/discovery_test.go — 4 new M-005 tests
internal/service/shortlived_test.go — mock signature update
internal/service/testutil_test.go — mock GetByIssuerAndSerial
|
||
|
|
3287e174dc |
Unify API auth + RFC-compliant CRL/OCSP (M-002 + M-003 + M-006, auto-closes M-001)
Closes the remaining P1 gaps from coverage-gap-audit.md (M-001/M-002/M-003/M-006)
on top of the C-001/C-002 ownership + agent-FK contract fixes landed in
|
||
|
|
eef1db0f0a |
fix(policies): stop 400ing the "+ New Policy" button + add per-rule severity (D-005, D-006)
Coverage Gap Audit findings D-005 (P0) + D-006 (P1) fixed together in a
single commit because they share the same root cause — policy CRUD sending
values the backend silently rejects — and splitting them would leave a
half-working UI between commits.
## D-005 (P0): PoliciesPage dropdown 400s every Create Policy
Root cause
----------
`web/src/pages/PoliciesPage.tsx` populated the Type `<select>` from a
hardcoded `['key_algorithm', 'ownership', 'allowed_issuers', ...]` array.
The backend's `internal/api/handler/validators.go::ValidatePolicyType`
enforces the TitleCase allowlist `AllowedIssuers`, `AllowedDomains`,
`RequiredMetadata`, `AllowedEnvironments`, `RenewalLeadTime` — defined in
`internal/domain/policy.go`. Every Create Policy request was rejected with
`400 invalid policy type`. The error surfaced only as a transient toast;
the modal closed anyway. Silent user-visible failure.
Fix
---
- `web/src/api/types.ts`: added `POLICY_TYPES` and `POLICY_SEVERITIES`
tuples with `as const` and narrowed `PolicyRule.type`, `.severity`, and
`PolicyViolation.severity` to the literal-union types. Dropdown is now
sourced from the tuple; casing drift becomes a compile error.
- `web/src/pages/PoliciesPage.tsx`: rekeyed `severityStyles` /
`severityDots` to the TitleCase values, added `humanize()` for display
(AllowedIssuers → "Allowed Issuers"), removed the `badge-neutral`
fallback that was papering over the mismatch.
- `web/src/api/types.test.ts` (new): pins both tuples exactly. If anyone
edits one side of the frontend/backend contract without the other, CI
fails with a clear assertion. Pure-TS vitest, no RTL dependency.
## D-006 (P1): `severity` field silently dropped on create/update
Root cause
----------
`PolicyRule` had no `Severity` field in `internal/domain/policy.go`. The
frontend has always sent `severity` on create/update, but Go's
`json.Decoder` (default settings, no `DisallowUnknownFields`) silently
dropped it. The value never reached PostgreSQL. Every rule rendered with
the same severity because there was no severity — just a display
computation downstream.
Fix: option (b), full-stack schema add (not delete-the-field)
-------------------------------------------------------------
- Migration `000013_policy_rule_severity` (up + down): adds
`severity VARCHAR(50) NOT NULL DEFAULT 'Warning'` to `policy_rules` with
CHECK constraint `severity IN ('Warning', 'Error', 'Critical')`. No
index — three-value column on a low-thousands-rows table, planner will
seq-scan regardless. PG 11+ metadata-only ADD COLUMN, safe on live data.
- `internal/domain/policy.go`: added `Severity PolicySeverity` field.
- `internal/repository/postgres/policy.go`: plumbed `severity` through
ListRules SELECT + Scan, GetRule SELECT + Scan, CreateRule INSERT,
UpdateRule UPDATE (4 queries).
- `internal/service/policy.go::UpdatePolicy`: if the client omits
severity on a PUT (zero-value empty string), fetch the existing rule
and preserve its severity. Without this, partial updates would trip the
NOT NULL CHECK and 500. Preserves pre-existing behavior for Name/Type
(out of scope).
- `internal/api/handler/policies.go::CreatePolicy`: default empty severity
to `'Warning'`, then validate via `ValidatePolicySeverity`. 400 with
clear message instead of 500 on CHECK violation. `UpdatePolicy`:
validates severity only when provided.
- `internal/mcp/types.go` + `internal/mcp/tools.go`: added optional
`severity` on the MCP `create_policy` / `update_policy` tool inputs so
LLM callers stay in sync with the wire contract.
- `api/openapi.yaml`: added `severity` to the `PolicyRule` schema with
the enum and default.
Acceptance criterion (user-defined)
-----------------------------------
"Create a rule with severity=Critical, reload the page, and still see
Critical — no silent drops." Verified end-to-end: frontend sends
`severity: "Critical"`, handler validates, service persists, DB stores,
GET returns, React renders the correct badge.
Seed data
---------
`migrations/seed.sql`: four demo rules now have differentiated severities
— `pr-require-owner` → Warning, `pr-allowed-environments` → Error,
`pr-max-certificate-lifetime` → Critical, `pr-min-renewal-window` →
Warning. The user called out that seeding all four at the same severity
makes the feature look decorative; differentiation demonstrates the
column carries real signal.
## Integration test fix (side effect of D-006)
`internal/integration/e2e_test.go::TestCrossResourceWorkflow/CreatePolicy`
was sending `"severity": "High"` — a value from the pre-audit severity
vocabulary that the new `ValidatePolicySeverity` correctly rejects with
400. Changed to `"Error"` (closest semantic match in the new TitleCase
allowlist). Only severity reference in the integration/ directory;
verified via grep.
## Out of scope, logged for follow-up (d/D-008)
Three policy-engine drift issues orthogonal to D-005 + D-006, explicitly
deferred per direction:
1. `migrations/seed.sql` policy_rules INSERTs use lowercase TYPE values
(`'ownership'`, `'environment'`, `'lifetime'`, `'renewal_window'`).
These are load-bearing on `internal/service/policy.go::evaluateRule`'s
`switch rule.Type` (which also uses the lowercase strings). Migrating
requires coordinated changes across seed + evaluation engine.
2. `migrations/seed_demo.sql:482-483` contains lowercase `'critical'`
severity — will now fail the new CHECK constraint. Separate fix.
3. `evaluateRule` hardcodes `Severity: domain.PolicySeverityWarning` on
emitted violations and ignores the configured `rule.Config`. The new
severity column is read correctly on the CRUD path but not yet
consulted during evaluation.
## Verification
Backend:
- `go build ./...` — clean
- `go vet ./...` — clean
- `go test -short ./...` — all packages green, including
`internal/service` (policy service), `internal/api/handler` (policy +
MCP handler tests), `internal/integration` (e2e_test.go after fix),
`internal/domain`, `internal/repository/postgres`.
Frontend:
- `tsc --noEmit` — clean
- `vitest run` — 223/223 passing (4 new assertions in types.test.ts)
- `vite build` — clean (only the pre-existing chunk-size warning)
|
||
|
|
2497be496d |
M-2 PR-C: Collapse Policy/Profile/Owner/Team services to ctx-first signatures
- Add ctx first param to 21 service-layer handler-interface methods
across policy.go (6), profile.go (5), owner.go (5), team.go (5)
- Replace 24 context.Background() call sites with received ctx; use
context.WithoutCancel(ctx) for subsidiary audit-recording ops to
preserve fire-and-forget audit semantics without inheriting caller
cancellation
- Add ctx first param to 21 handler-interface method signatures across
policies.go (6), profiles.go (5), owners.go (5), teams.go (5)
- Thread r.Context() through 21 HTTP handler sites (ListPolicies,
GetPolicy, CreatePolicy, UpdatePolicy, DeletePolicy, ListViolations,
ListProfiles, GetProfile, CreateProfile, UpdateProfile, DeleteProfile,
ListOwners, GetOwner, CreateOwner, UpdateOwner, DeleteOwner,
ListTeams, GetTeam, CreateTeam, UpdateTeam, DeleteTeam)
- Update MockPolicyService/MockProfileService/MockOwnerService/
MockTeamService mock method impls with _ context.Context first param
(Fn fields unchanged — closures do not need ctx); update mock impls
in integration/lifecycle_test.go for all four services
- Update 12 service-layer test callsites (policy_test.go ×2,
owner_test.go ×5, team_test.go ×5, profile_test.go ×13) to pass
context.Background() at the call site
Audit complete. Commit:
|
||
|
|
eb14236166 |
M-2 PR-B: Collapse IssuerService + TargetService to ctx-first signatures
- Delete bare TestConnection wrapper in IssuerService; rename
TestConnectionWithContext → TestConnection
- Delete TestTargetConnection delegate shim in TargetService (canonical
TestConnection already ctx-first)
- Add ctx first param to 10 handler-interface methods
(ListIssuers/GetIssuer/CreateIssuer/UpdateIssuer/DeleteIssuer and
ListTargets/GetTarget/CreateTarget/UpdateTarget/DeleteTarget)
- Replace 16 context.Background() call sites with received ctx
- Thread r.Context() through 12 HTTP handler sites in issuers.go and
targets.go (outer TargetHandler.TestTargetConnection HTTP method name
preserved for router compatibility)
- Update MockIssuerService, MockTargetService, and mockTargetService
(integration) for ctx-first forwarding; update test callsite literals
Audit complete. Commit:
|
||
|
|
27afa4463d |
fix(repository): idempotent sentinel agent creation via ON CONFLICT (M-6)
Sentinel agents (server-scanner, cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm) were created on startup with a plain INSERT whose duplicate-key error was swallowed unconditionally. That silenced every other DB failure too (connectivity drop, permissions change, unrelated constraint violation) — a restart after the first boot quietly de-fanged cloud discovery and the network scanner (CWE-662, CWE-209- adjacent). Shape A: add AgentRepository.CreateIfNotExists using ON CONFLICT (id) DO NOTHING RETURNING id + sql.ErrNoRows discrimination. This keeps the strict Create semantics (duplicate-key is an error) intact for real agent registration and gives sentinels their own idempotent path. - repo: CreateIfNotExists returns (created bool, err error); false,nil on pre-existing row; false,wrapped err on anything else. - interface: CreateIfNotExists added to AgentRepository. - main.go: 4 sentinel sites log Error/Info/Debug distinctly. - mocks: service + integration mocks implement the new method. - tests: 4 new testcontainers integration tests cover first-insert, idempotent second-call, concurrent 16-goroutine race (exactly one creator, no duplicate-key panic), and pre-cancelled context surfacing. Coverage gates (go test -cover): service 67.6%/55, handler 78.6%/60, domain 92.7%/40, middleware 80.0%/30, crypto 86.7%/85. Race/vet/ golangci-lint v2.11.4 (0 issues)/govulncheck v1.2.0 clean across all touched packages. |
||
|
|
5abeeb882b | fix(crypto): per-ciphertext PBKDF2 salt + v2 versioned format with v1 fallback (M-8) | ||
|
|
89b910a8f1 |
security: atomic pending-job claim with FOR UPDATE SKIP LOCKED (H-6)
Fixes H-6 (CWE-362) — GetPendingJobs returned pending rows without row
locks, so two scheduler replicas in an HA deployment could both read the
same row, both decide it was theirs, and race on UpdateStatus, producing
duplicate Running jobs and duplicate certificate issuances.
Remediation: a claim-style repository API that selects + transitions
Pending -> Running in one transaction with SELECT ... FOR UPDATE SKIP
LOCKED. Concurrent claimants observe disjoint row sets; no worker ever
sees another worker's claimed row.
Repository changes (internal/repository/postgres/job.go):
- New ClaimPendingJobs(ctx, jobType, limit): BEGIN; SELECT id,...
FROM jobs WHERE status='Pending' (optional type filter, optional
LIMIT) FOR UPDATE SKIP LOCKED; UPDATE jobs SET status='Running',
updated_at=NOW() WHERE id = ANY($ids); COMMIT. Returns the claimed
rows with status already flipped.
- New ClaimPendingByAgentID(ctx, agentID): mirrors M31 UNION ALL
semantics (direct agent_id match, target->agent JOIN fallback,
certificate->target->agent chain for AwaitingCSR) but wraps each
branch in FOR UPDATE SKIP LOCKED and flips Deployment/Renewal rows
to Running. AwaitingCSR rows are returned in place (state
transition deferred until SubmitCSR, consistent with M8 semantics).
- Existing GetPendingJobs / ListPendingByAgentID retained for legacy
compatibility; their godoc now directs production callers to the
Claim* variants.
Production caller switches:
- internal/service/job.go ProcessPendingJobs: ListByStatus(Pending)
-> ClaimPendingJobs(ctx, "", 0). Eliminates the real scheduler
race between two replicas tick-firing simultaneously.
- internal/service/agent.go GetPendingWork: ListPendingByAgentID ->
ClaimPendingByAgentID. Eliminates the race between two pollers
for the same agent (e.g. brief network blip causing duplicate
poll) and between a scheduler tick and an agent poll.
Safety argument for pre-flipping Pending -> Running inside the claim
transaction: ProcessRenewalJob and ProcessDeploymentJob both call
UpdateStatus(Running) unconditionally on entry, so an early flip is
idempotent. On panic, the scheduler's panic recovery leaves the job
in Running which the existing stale-running reaper handles.
Tests (internal/repository/postgres/repo_test.go, skipped in -short):
- TestJobRepository_ClaimPendingJobs_FlipsToRunning: seed 5 Pending,
claim once, assert all 5 returned + DB rows Running, residual
claim returns 0.
- TestJobRepository_ClaimPendingJobs_ConcurrentDisjoint: seed M=40
Pending Renewals, spawn N=8 goroutines each calling
ClaimPendingJobs(_, JobTypeRenewal, 1) in a loop. Invariants:
(a) no job ID claimed by more than one worker, (b) sum of claims
== 40, (c) all 40 rows in Running state in the DB. Bounded
empty-streak guard (20 iterations) covers SKIP LOCKED transient
zeros under contention.
- TestJobRepository_ClaimPendingByAgentID_TransitionsDeployments:
seeds 2 Pending Deployment + 1 AwaitingCSR for agent A plus 1
Pending Renewal for agent B (scope check). Asserts deployments
flip to Running, AwaitingCSR is returned but preserved, agent B's
renewal never appears.
Mock updates: testutil_test.go, lifecycle_test.go, verification_test.go
gained ClaimPendingJobs/ClaimPendingByAgentID on their mock job repos
mirroring the real Pending -> Running semantics. Mocks intentionally
do NOT write to StatusUpdates (that map tracks UpdateStatus() call
history specifically; the real claim path uses a bulk UPDATE, not
UpdateStatus).
Verification (CI-scope):
- go build ./cmd/...: ok
- go vet ./...: ok
- go test -race -short on service, api/handler, api/middleware,
scheduler, connector/..., domain, validation, tlsprobe: ok
- Coverage gates: service 67.6% (>=55), handler 78.6% (>=60),
middleware 80.0% (>=30), domain 92.7% (>=40). All hold.
- golangci-lint 2.11.4: 0 issues
- govulncheck: no vulnerabilities in call graph
- Frontend: tsc clean, 218 vitest tests pass, vite build ok
- helm lint + helm template: ok
- Invariant sweeps: FOR UPDATE SKIP LOCKED present in job.go;
H-1 through H-5 fixtures unchanged.
Refs: H-6 in certctl-audit-report.md
|
||
|
|
387fb555ac |
security: scope revocation unique index to (issuer_id, serial_number) (fixes H-1)
RFC 5280 §5.2.3 defines certificate serial number uniqueness per issuing CA,
not globally. The prior unique index on `certificate_revocations.serial_number`
enforced a stricter invariant than the spec: with 12 issuer connectors (Local
CA, ACME, Vault, step-ca, OpenSSL, DigiCert, Sectigo, Google CAS, AWS ACM PCA,
Entrust, GlobalSign, EJBCA), two distinct certificates legitimately issued by
different CAs can share a serial number. Recording a revocation for the second
collision silently dropped via `ON CONFLICT DO NOTHING`, leaving the second
cert persistently absent from OCSP/CRL responses.
Changes:
- Migration 000012 drops `idx_certificate_revocations_serial` and creates
`idx_certificate_revocations_issuer_serial` UNIQUE ON (issuer_id,
serial_number). Adds a non-unique `idx_certificate_revocations_serial_lookup`
to preserve the serial-only fast path for OCSP/CRL probes that already know
the issuer scope.
- `CertificateRevocationRepository.Create` targets the new composite key in
`ON CONFLICT` — same-issuer idempotency preserved, cross-issuer collisions
now recorded as distinct rows.
- `GetBySerial(serial)` renamed `GetByIssuerAndSerial(issuerID, serial)` on
the interface and Postgres impl. All callers (OCSP responder, CRL
generator, short-lived-cert exemption check) already have `issuerID` in
scope because the protocol paths carry it (`/api/v1/ocsp/{issuer_id}/{serial}`,
`/api/v1/crl/{issuer_id}`).
- Repository integration test added: `TestRevocationRepository_CrossIssuerSerialCollision`
asserts that serial `CAFEBABE01` can be stored under two issuers
simultaneously, that lookups return the correct row per (issuer, serial),
and that same-issuer idempotency still works (re-inserting (issuer, serial)
does not error and does not duplicate).
- Existing tests and service/integration mocks updated for the rename.
Wire-format invariants preserved: CRL DER bytes, OCSP response bytes, and
AES-256-GCM config encryption are unaffected — this change touches only
revocation-record uniqueness scope.
CWE-664.
|
||
|
|
f549a7aa79 |
security: fail closed when CERTCTL_CONFIG_ENCRYPTION_KEY is unset (fixes C-2)
EncryptIfKeySet/DecryptIfKeySet in internal/crypto/encryption.go previously
returned plaintext + wasEncrypted=false when the operator had not configured
CERTCTL_CONFIG_ENCRYPTION_KEY. That produced a data-at-rest confidentiality
bypass (CWE-311): sensitive fields on dynamically-configured issuer and
target rows (source='database') were persisted to PostgreSQL without any
encryption, and no caller could distinguish the encrypted from the plaintext
branch at runtime. The only visible signal was a single warning log line
emitted once at startup.
Fail closed instead:
- EncryptIfKeySet / DecryptIfKeySet now return crypto.ErrEncryptionKeyRequired
(a new exported sentinel, errors.Is-unwrappable) when the key is empty or
nil, rather than silently emitting plaintext. The (result, wasEncrypted,
err) tuple signature is preserved for source compatibility; only the
semantics of the no-key branch changed.
- cmd/server/main.go grows a startup pre-flight check: if no encryption key
is configured the server lists issuers and targets, counts rows with
source='database', and refuses to start (os.Exit(1)) if any exist. Operators
must either configure CERTCTL_CONFIG_ENCRYPTION_KEY or remove the exposed
rows before the control plane can boot. The warning-only path is retained
for the clean-slate case (no database rows).
- internal/service/issuer.go's SeedFromEnvVars now guards the encryption call
with len(s.encryptionKey) > 0 so env-seeded rows (source='env', which are
reconstructable on every boot from process env) continue to persist as
plaintext in the 'config' column when no key is configured. Registry load
already falls through to cfg.Config when EncryptedConfig is nil. GUI/API
write paths (source='database') remain fail-closed via propagation of
ErrEncryptionKeyRequired.
- Integration tests that exercise CreateIssuer via the handler layer now
supply a real 32-byte AES-256 test key so the encrypt path runs instead of
returning ErrEncryptionKeyRequired. Same pattern in internal/service/
testutil_test.go for consolidated service-layer tests.
- internal/crypto/encryption_test.go grows regression guards:
TestEncryptIfKeySet_EmptyKeyFailsClosed (nil_key + empty_key subtests),
TestDecryptIfKeySet_EmptyKeyFailsClosed (nil_key + empty_key subtests),
TestEncryptDecryptIfKeySet_RoundTripProducesDifferentCiphertext,
TestDecryptIfKeySet_RejectsTamperedCiphertext, and
TestEncryptIfKeySet_PreservesErrEncryptionKeyRequiredSentinel (verifies
the sentinel unwraps through fmt.Errorf(%w)-style wrapping).
Wire format is unchanged: AES-256-GCM Encrypt/Decrypt/DeriveKey, the
12-byte nonce prefix, the GCM auth tag, the PBKDF2 salt
('certctl-config-encryption-v1'), and the 100,000 iteration count are all
byte-identical. Ciphertexts produced before this change remain decryptable.
Verified:
- go build ./... : clean
- go vet ./... : clean
- go test -race ./internal/crypto/... ./internal/service/... \
./internal/integration/... ./cmd/server/... : pass
- golangci-lint run ./... : 0 issues
- govulncheck ./... : 0 reachable vulnerabilities
- rg 'return plaintext, false, nil' internal/ : no matches
- Coverage: crypto 85.0% (unchanged), service 67.8% (was 67.9%, noise),
cmd/server 0.0% (unchanged baseline). All above CI thresholds.
See certctl-audit-report.md for the full finding record and resolution log.
|
||
|
|
13cd4d98ba |
feat(V2.2): bulk revocation — filter-based fleet-wide certificate revocation
Add POST /api/v1/certificates/bulk-revoke with filter criteria (profile_id, owner_id, agent_id, issuer_id, team_id, certificate_ids), partial-failure tolerance, and audit trail. Includes MCP tool, CLI command (certs bulk-revoke), server-side bulk modal in GUI replacing client-side sequential loop, OpenAPI spec, compliance mapping updates, and 21 new tests (12 service, 7 handler, 1 CLI, 1 frontend). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
e6088c79a3 |
feat(M35): dynamic target configuration with encrypted config, test connection, and GUI updates
Mirror M34's dynamic issuer config pattern for deployment targets: AES-256-GCM encrypted config storage, sensitive field redaction in API responses, agent heartbeat-based test connection endpoint, and full frontend updates including test status indicators, source badges, and removal of stale hostname/status fields from the Target interface. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
995b72df05 |
feat(M34): dynamic issuer configuration with encrypted config storage
Replace static env-var-based issuer wiring with GUI-driven dynamic configuration stored encrypted in PostgreSQL. Operators can now configure, test, enable/disable, and manage issuers from the dashboard without restarting the server. Key changes: - AES-256-GCM encryption for sensitive issuer config at rest (PBKDF2 key derivation with 100k iterations) - Dynamic IssuerRegistry with sync.RWMutex replacing static map - Connector factory pattern (issuerfactory.NewFromConfig) replacing 140 lines of static wiring in main.go - Migration 000009: encrypted_config, last_tested_at, test_status, source columns on issuers table - Env var seeding on first boot with ON CONFLICT DO NOTHING - Registry Rebuild() for atomic map swap after CRUD operations - Issuer type validation against domain constants on Create - Audit trail for test connection results - Conditional seeding for step-ca/OpenSSL (only when env vars set) - GUI: source badge, connection test status on issuer detail page Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
11173a74c6 |
feat(M31): agent work routing — scope jobs to assigned agents
Deployment jobs now set agent_id from target→agent relationship at creation time. GetPendingWork() uses ListPendingByAgentID() with a 3-way UNION query (direct match, legacy NULL fallback via target JOIN, AwaitingCSR via cert→target→agent chain) so each agent only receives its own jobs. - Added AgentID *string to Job domain struct - Added agent_id to all job SQL queries (5 SELECTs, INSERT, UPDATE, scanJob) - New ListPendingByAgentID() repository method - Rewrote GetPendingWork() from ~25 lines to single scoped query - 4 new Go tests (3 agent routing + 1 deployment agent_id) - Frontend: agent_id/target_id on Job type Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
fde5b39d53 |
fix: resolve test compilation and runtime failures across codebase
- Add context.Context to handler test mocks (agent, agent_group) - Refactor scheduler to use local interfaces instead of concrete service types - Wire RevocationSvc/CAOperationsSvc sub-services in integration tests - Add context.Background() to service test calls (agent, agent_group) - Fix repo integration tests: add FK prerequisite records (team, owner, issuer, renewal_policy) before creating certificates - Set MaxOpenConns(1) on test DB to preserve SET search_path across queries - Fix Apache/HAProxy tests: replace "echo ok"/"echo reload" with "true" binary to avoid macOS exec.Command PATH resolution failure - Fix validation tests: correct error expectations for regex-first checks, replace null byte strings with strings.Repeat for length tests - Fix scheduler timeout test flakiness with t.Skip fallback - Remove unused imports (context in ca_operations_test, service in scheduler) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
677c28aeca |
refactor(api): TICKET-006 replace 18-param RegisterHandlers with HandlerRegistry struct
Replace the 18-parameter RegisterHandlers function signature with a cleaner HandlerRegistry struct that groups all API handler dependencies. This eliminates the signature explosion that made the function difficult to read and maintain. Changes: - Added HandlerRegistry struct with 18 fields grouping all handler types - Updated RegisterHandlers to accept a single HandlerRegistry parameter - Updated all internal handler references to use reg.FieldName syntax - Updated call sites in cmd/server/main.go and integration tests - No functional changes, purely structural refactoring Resolves TICKET-006: RegisterHandlers Signature Explosion |
||
|
|
b4ac0cda43 |
fix: use context.Context instead of interface{} in VerificationService interface
The handler's VerificationService interface used interface{} for the ctx
parameter, but the service implementation uses context.Context. This caused
a compile error: *service.VerificationService does not implement
handler.VerificationService.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
be72627aeb |
feat: M25 post-deployment TLS verification + M26 Traefik/Caddy targets
M25: After deploying a certificate, the agent probes the live TLS
endpoint and compares SHA-256 fingerprints to verify the correct cert
is being served. Best-effort — failures don't block deployments.
New endpoints: POST /jobs/{id}/verify, GET /jobs/{id}/verification.
Migration 000008 adds verification columns to jobs table.
M26: Traefik target connector (file provider, auto-reload) and Caddy
target connector (dual-mode: admin API hot-reload or file-based).
Both wired into agent dispatch.
Also: restructured README to highlight supported integrations (issuers,
targets, notifiers) earlier, moved API/CLI/MCP sections lower. Updated
all docs (features, connectors, architecture, testing guide, why-certctl)
and fixed integration tests for 18-param RegisterHandlers signature.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
50c520e1ff |
feat: dashboard theme overhaul — light content area with branded teal sidebar
Complete frontend visual redesign using certctl logo color palette: - Deep teal sidebar (#0c2e25) with prominent centered logo (64px in white pill) - Light content area (#f0f4f8) with white cards and visible borders - Brand colors from logo: teal (#2ea88f), blue (#3b7dd8), orange (#e8873a), green (#4ebe6e) - Inter + JetBrains Mono typography, colored stat card top borders - All 17 pages + 7 components updated (25 files, ~700 lines changed) - 15 new dashboard screenshots replacing old dark theme screenshots - Prometheus metrics e2e test added, integration test mock fixes - Docs updated: architecture.md theme description, testing-guide.md DNS-PERSIST-01 coverage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
7d14635a72 |
feat: add EST server (RFC 7030) for device certificate enrollment (M23)
Implement Enrollment over Secure Transport protocol with 4 endpoints under /.well-known/est/ — cacerts (CA chain distribution), simpleenroll (initial enrollment), simplereenroll (certificate renewal), and csrattrs (CSR attributes). PKCS#7 certs-only wire format with hand-rolled ASN.1, accepts both PEM and base64-encoded DER CSRs, configurable issuer and profile binding, full audit trail. 28 new tests (18 handler + 10 service). Also includes: - GetCACertPEM added to issuer connector interface (all 4 issuers updated) - EST integration tests wired into e2e test suite (13 test cases) - QA testing guide Part 26 (15 manual EST test cases) - All docs updated: README, features, architecture, concepts, connectors, quickstart, demo-advanced (endpoint counts, MCP wording, agent IDs, issuer interface, resource lists, OpenSSL status) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
4f90be9311 |
feat: add network certificate discovery (M21) and Prometheus metrics (M22)
M21 adds server-side active TLS scanning of CIDR ranges with concurrent probing, sentinel agent pattern for pipeline reuse, and full CRUD API for scan targets. M22 adds Prometheus exposition format endpoint alongside existing JSON metrics. Comprehensive documentation audit updates all docs to reflect 91 endpoints, 19 tables, 6 scheduler loops, and 900+ tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
667a30870d |
feat: M18b Filesystem Certificate Discovery — agent scanning, server dedup, triage API
Agent-side:
- Filesystem scanner walks configured directories (CERTCTL_DISCOVERY_DIRS)
- Parses PEM (.pem, .crt, .cer, .cert) and DER (.der) certificate files
- Extracts CN, SANs, serial, issuer/subject DN, validity, key info, SHA-256 fingerprint
- Reports discoveries to control plane on startup + every 6 hours
- Skips files >1MB and private key files
Server-side:
- Migration 000006: discovered_certificates + discovery_scans tables
- Domain model: DiscoveredCertificate, DiscoveryScan, DiscoveryReport
- Three triage states: Unmanaged, Managed (claimed), Dismissed
- Repository with upsert dedup (fingerprint + agent + path)
- Service layer: process reports, claim, dismiss, list, summary
- 7 new API endpoints (84 total):
POST /agents/{id}/discoveries, GET /discovered-certificates,
GET /discovered-certificates/{id}, POST .../claim, POST .../dismiss,
GET /discovery-scans, GET /discovery-summary
- Audit trail: scan_completed, cert_claimed, cert_dismissed events
Tests: 28 new test functions (domain, handler, service layers)
Docs: README, quickstart, demo-guide, demo-advanced, architecture,
concepts, connectors, features.md all updated
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
e078a686bf |
feat: M20 Enhanced Query API — sort, time-range filters, cursor pagination, sparse fields, deployments endpoint
V2 (free) query enhancements for certificates:
- `sort` param with direction (`?sort=-notAfter` for descending)
- Time-range filters: `expires_before`, `expires_after`, `created_after`, `updated_after`
- Cursor-based pagination (`?cursor=token&page_size=100`) alongside page-based
- Sparse field selection (`?fields=id,commonName,status`)
- Additional filters: `agent_id`, `profile_id`
- New endpoint: `GET /api/v1/certificates/{id}/deployments`
25 new tests (12 handler + 13 e2e) covering all M20 features.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
373346a0ba |
feat: comprehensive e2e test suite for cross-milestone integration testing
7 test functions with 35+ subtests covering stats/metrics endpoints, cross-resource workflows (policy→cert→agent→renew→revoke→audit), job approval workflows, notification endpoints, CRL endpoint, pagination across 6 endpoints, and issuer/target CRUD lifecycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
ee75f149ae |
feat: M14 — Observability (dashboard charts, agent fleet, stats API, metrics, structured logging, rollback)
Backend: StatsService with 5 aggregation methods, JSON metrics endpoint, slog-based structured logging middleware. Stats API: dashboard summary, certificates-by-status, expiration timeline, job trends, issuance rate. 23 new backend tests. Frontend: Recharts-powered dashboard with 4 charts (status pie, expiration heatmap, job trends line, issuance bar), agent fleet overview page with OS/arch grouping and version breakdown, deployment rollback buttons on version history. 7 new frontend tests. 78 API endpoints, 744+ total tests (658 Go + 86 Vitest). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
d4fd46155e |
fix: unused certRepo variable and missing revocation wiring
- revocation_test.go: certRepo unused in TestGenerateDERCRL_Success, replaced with blank identifier - lifecycle_test.go: missing revocationRepo init and setter calls (SetRevocationRepo, SetNotificationService, SetIssuerRegistry) that negative_test.go already had Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
5d98e373e3 |
feat: M15a — certificate revocation API, CRL endpoint, and revocation notifications
Implements core revocation infrastructure: POST /api/v1/certificates/{id}/revoke
with all 8 RFC 5280 reason codes, JSON-formatted CRL at GET /api/v1/crl, webhook
and email revocation notifications, best-effort issuer notification, and immutable
revocation audit trail. Includes 48 new tests across service, handler, integration,
and domain layers (600+ total). Fixes 3 pre-existing test bugs (team_test error
matching, agent_group delete status code, team handler per_page validation).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
||
|
|
690765b53e |
test: comprehensive test expansion — 330+ to 525+ tests, close M11b coverage gaps
Add 195+ new tests across service, handler, connector, and integration layers: - Service tests: team (23), owner (21), agent_group (25), issuer (18), issuer_adapter (6) - Handler tests: teams (26), owners (21) - NGINX target connector tests (13): config validation, deployment, reload - Integration tests: 19 M11b endpoint subtests (teams, owners, agent groups CRUD) - CI pipeline: add ./internal/connector/target/... to test coverage path - Docs: update test counts to 525+ across README, architecture, CLAUDE.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
b0549e6f05 |
feat: M11b — ownership tracking, agent groups, interactive renewal approval
Ownership: owners/teams GUI pages, notification email resolution via resolveRecipient (owner_id → owner.email lookup). Agent groups: dynamic device grouping by OS/arch/IP CIDR/version with manual include/exclude membership, migration 000004, full CRUD stack (domain → repo → service → handler → frontend). Interactive approval: AwaitingApproval job state, approve/reject API endpoints with reason tracking. Tests: 12 agent group handler tests, 8 approve/reject job handler tests, integration tests updated for 13-param RegisterHandlers. Docs updated across architecture, concepts, and seed data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
a579a84c7f |
feat: M11a — certificate profiles, crypto policy enforcement, short-lived cert expiry
Add certificate profiles as named enrollment templates that control allowed key algorithms, max TTL, permitted EKUs, required SAN patterns, and optional SPIFFE URI SANs. CSR submissions are validated against profile rules at signing time (key type + minimum size). Short-lived certs (TTL < 1 hour) auto-expire via a new scheduler loop — expiry acts as revocation, no CRL/OCSP needed. New files: - Migration 000003: certificate_profiles table, FK columns on managed_certificates/renewal_policies, key metadata on certificate_versions - domain/profile.go: CertificateProfile + KeyAlgorithmRule structs - repository/postgres/profile.go: full CRUD with JSONB marshaling - service/profile.go: ProfileService with validation + audit logging - service/crypto_validation.go: CSR-against-profile validation (RSA/ECDSA/Ed25519) - handler/profiles.go: 5 HTTP endpoints under /api/v1/profiles - web/src/pages/ProfilesPage.tsx: profiles management page Modified: - renewal.go: CSR validation in CompleteAgentCSRRenewal, ExpireShortLivedCertificates - scheduler.go: 30s short-lived expiry check loop - certificate.go (repo): nullable profile FK, key metadata on versions - main.go: profile repo/service/handler wiring, 8-param NewRenewalService - router.go: 12-param RegisterHandlers with profile routes - seed_demo.sql: 4 demo profiles (standard, mtls, short-lived, high-security) - Frontend: types, API client, routing, sidebar nav Tests: 40 new tests across handler (15), service (13), crypto validation (12) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
07275bf92f |
feat: M10 — agent metadata collection, Apache httpd + HAProxy target connectors
Agents now report OS, architecture, IP address, hostname, and version via heartbeat using runtime.GOOS, runtime.GOARCH, and net.Dial. New migration adds columns to agents table. Heartbeat handler, service, and repository updated to accept and persist metadata. GUI shows OS/Arch in agent list and full system info in agent detail page. Apache httpd connector: separate cert/chain/key files, apachectl configtest validation, graceful reload. HAProxy connector: combined PEM file (cert+chain+key), optional config validation, reload. Both wired into agent binary's target connector switch. 14 tests for new connectors. All existing tests updated for new Heartbeat/UpdateHeartbeat signatures. Docs updated across README, architecture, concepts, and connectors guides. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
66f04f7afe |
style: run gofmt -s across all Go files
Fixes Go Report Card gofmt score from 52% to 100%. Pure formatting changes — no logic modifications. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
ac594d88c0 |
Fix agent registration URL in negative test and connector test path in CI
negative_test.go used /api/v1/agents/register but the route is POST /api/v1/agents. ci.yml used ./internal/connector/... which includes packages with no test files causing "no such tool covdata" errors; narrowed to ./internal/connector/issuer/local/... which has tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
8526314a44 |
Implement M9: test hardening with handler tests, negative paths, CI coverage gates
All 7 handler files now have test coverage: jobs (14 tests), notifications (11), policies (15), issuers (15), targets (14). Negative-path integration tests cover nonexistent resources, invalid payloads, malformed CSR, expired cert lifecycle, and method-not-allowed errors. CI now enforces coverage thresholds (service 60%+, handler 50%+) and includes connector tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
e2821c448a |
Implement M8: agent-side key generation with ECDSA P-256
Private keys never leave agent infrastructure. Agents generate ECDSA P-256 key pairs locally, store them with 0600 permissions, and submit only the CSR (public key) to the control plane. New AwaitingCSR job state pauses renewal/issuance jobs until the agent submits its CSR. Server-side keygen retained behind CERTCTL_KEYGEN_MODE=server for demo/development. Key changes: - Dual keygen mode via CERTCTL_KEYGEN_MODE (agent default, server for demo) - AwaitingCSR job state with CommonName/SANs in work response - Agent ECDSA P-256 keygen, local key storage, CSR-only submission - CompleteAgentCSRRenewal server-side flow for agent-submitted CSRs - DeploymentRequest.KeyPEM for agent-provided keys during deployment - Dockerfile.agent creates /var/lib/certctl/keys with correct ownership Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
11a5ca6135 |
Fix NewHealthHandler call in integration test for M7 signature change
Pass "none" auth type to match updated NewHealthHandler(authType string). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
5553568495 |
Implement M4: comprehensive test coverage with 120 tests
Service layer (63 tests): certificate, agent, audit, job, notification, policy, and renewal services with mock repositories covering threshold alerting, deduplication, status transitions, and job processing. Handler layer (46 tests): certificate and agent HTTP handlers using httptest with mock service interfaces, covering success/error paths, pagination, JSON marshaling, and path parameter extraction. Integration (11 subtests): end-to-end certificate lifecycle test exercising real services and Local CA issuer through HTTP API — create cert, trigger renewal, process jobs, register agent, heartbeat, verify audit trail. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |