Compare commits

...

14 Commits

Author SHA1 Message Date
shankar0123 707d8de4fb UX-001: sidebar re-entry + inline team/owner creation in wizard
Closes UX-001 (OnboardingWizard CertificateStep dead-end): users no
longer have to navigate away from the wizard and lose their in-flight
state when the required Owner/Team dropdowns are empty.

Layout.tsx
  - Adds persistent 'Setup guide' button in the left sidebar.
  - Clears localStorage 'certctl:onboarding-dismissed' then navigates
    to /?onboarding=1 as a re-entry signal that overrides dismissal.
  - localStorage.removeItem wrapped in try/catch to tolerate storage
    access errors (private browsing, quota, etc.).

DashboardPage.tsx
  - Reads ?onboarding=1 via useSearchParams as a forceOnboarding flag.
  - forceOnboarding bypasses the latched first-run gate so the wizard
    reopens even after dismissal or with certs/issuers already present.
  - onDismiss now also strips ?onboarding=1 via setSearchParams(next,
    { replace: true }) so a page refresh does not relaunch the wizard.

OnboardingWizard.tsx
  - Adds CreateTeamModalInline and CreateOwnerModalInline inside
    CertificateStep. Both wire through React Query: createTeam /
    createOwner mutation on success invalidates ['teams'] / ['owners']
    and calls onCreated(id) so the parent select auto-selects the new
    row as soon as the refetch lands.
  - '+ New team' and '+ New owner' buttons placed next to the select
    labels; empty-state copy replaced with inline 'create one now'
    buttons (no more Link back to /owners /teams).
  - CreateOwner coerces empty teamId to undefined before mutation so
    the server contract matches OwnersPage.

Tests (12 new, all green; total suite 252 passed / 0 failed):
  - Layout.test.tsx (4): Setup guide button renders, clicking it clears
    the dismissal key and navigates to /?onboarding=1, tolerates
    localStorage.removeItem throwing.
  - DashboardPage.test.tsx (4): first-run auto-open, ?onboarding=1
    re-entry after dismissal, onDismiss writes localStorage + strips
    the query param, dismissed-with-no-param stays closed.
  - OnboardingWizard.test.tsx (4): Skip-Skip reaches CertificateStep
    with '+ New team' / '+ New owner' buttons visible; '+ New team'
    happy path with React Query invalidation + parent-select
    auto-select via option-parent traversal (label is a sibling, not
    htmlFor-linked); '+ New owner' happy path pins team_id: undefined
    coercion; Cancel abort never mutates.

Test infrastructure notes:
  - Closure-driven vi.fn().mockImplementation pattern drives the
    post-invalidation refetch: the mutation mock mutates a closure
    variable that the getTeams/getOwners mock reads, so the parent
    select's new <option> exists by the time the refetch lands.
  - Anchored regex (/^Create Team$/, /^Create Owner$/) disambiguates
    the modal submit from the '+ New team' / '+ New owner' triggers.

Verification gates (all green):
  - vitest run: 252 passed / 0 failed (8 files, 13.98s)
  - tsc --noEmit: 0 errors
  - vite build: clean production bundle (851.77 kB js / 226.81 kB gzip)

No new runtime dependencies. Frontend-only change.
2026-04-19 14:49:04 +00:00
shankar0123 0725713e19 Close I-004 (agent hard-delete cascades targets) coverage-gap finding
Operator decision answered as full soft-delete with optional forced
cascade — hard-delete is not reachable from any public surface. Prior
to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents`
whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id
silently wiped every target, orphaning certs and aborting in-flight
jobs. The finding closure reshapes the agent-removal contract around
soft retirement with explicit preflight counts, an opt-in cascade
gated by a mandatory reason, and unconditional protection for the
four reserved sentinel agents used by discovery sources.

Schema — migration 000015:
  migrations/000015_agent_retire.up.sql flips
  deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE
  RESTRICT, so a stray `DELETE FROM agents` now errors at the DB
  boundary instead of quietly destroying targets. Both `agents` and
  `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason
  TEXT pair (TEXT not VARCHAR so operator comments are never
  truncated), indexed via partial indexes WHERE retired_at IS NOT
  NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP
  CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT
  EXISTS) so repeated runs against partially-migrated databases
  converge. migrations/000015_agent_retire.down.sql restores CASCADE
  and drops the new columns for clean rollback. A dedicated
  repository-layer testcontainers test
  (internal/repository/postgres/migration_000015_test.go) asserts the
  before/after FK action, column presence, index presence, and
  round-trip idempotency under up→down→up.

Domain — sentinel guard + dependency counts:
  internal/domain/connector.go gains IsRetired() on Agent, the
  exported SentinelAgentIDs slice listing server-scanner,
  cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the
  four reserved IDs documented in CLAUDE.md and created at startup in
  cmd/server/main.go), IsSentinelAgent(id string) predicate,
  AgentDependencyCounts{ActiveTargets, ActiveCertificates,
  PendingJobs} with a HasDependencies() method, and ActorTypeAgent /
  ActorTypeSystem enum values used by audit emission downstream.
  Coverage locked down by internal/domain/connector_test.go.

Service — 8-step ordered contract:
  internal/service/agent_retire.go:RetireAgent(ctx, id, actor,
  opts{Force, Reason}) enforces a fixed execution order:
  (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel
      unconditionally; force=true does NOT bypass it.
  (2) fetch — ErrAgentNotFound on miss.
  (3) idempotency — if IsRetired() already, return
      AgentRetirementResult{AlreadyRetired: true} with no new audit
      event and no state change (safe to replay from flaky clients).
  (4) preflight counts — collectAgentDependencyCounts runs
      ActiveTargets, ActiveCertificates, PendingJobs sequentially
      (not in parallel; keeps the per-query timeout predictable and
      matches the repo's existing call-chain shape).
  (5) force-reason guard — opts.Force=true with empty Reason returns
      ErrForceReasonRequired (wired into the 400 status surface).
  (6) dependency guard — HasDependencies() with opts.Force=false
      returns BlockedByDependenciesError{Counts} (wired into the 409
      body with per-bucket counts).
  (7) mutation — single pinned retiredAt := time.Now(); agent
      retirement first, then cascade target retirement if opts.Force,
      all under the repo's single transaction so the two retired_at
      stamps match to the second.
  (8) best-effort audit — agent_retired always; agent_retirement_
      cascaded additionally on the force path. Actor is whatever the
      handler resolves from the request; actor type is mapped by
      resolveActorType (system/agent-prefix→Agent/else→User). Audit
      emission failures are logged via slog.Error but do not abort
      the retirement (matches the house convention used by every
      other scheduler-emitted event).

  BlockedByDependenciesError implements Error() as
  "active_targets=%d, active_certificates=%d, pending_jobs=%d" and
  Unwrap() → ErrBlockedByDependencies. The single struct satisfies
  errors.Is via Unwrap (used by scheduler-level tests) and errors.As
  via the concrete type (used by the handler to fish out Counts for
  the 409 body). ListRetiredAgents(page, perPage) adds a separate
  paginated accessor with page<1→1 and perPage<1→50 normalization so
  retired rows are queryable without polluting the default agent
  listing.

  Sentinel guard coverage is asymmetric by design: all four reserved
  IDs are protected, and force=true cannot override. Regression tests
  in internal/service/agent_retire_test.go assert each of the eight
  steps in order, plus sentinel bypass attempts and idempotency
  replay.

Handler + router — status-code surface:
  internal/api/handler/agents.go:RetireAgent exposes seven status
  codes on DELETE /agents/{id}:
    200 on a fresh retirement (body echoes AgentRetirementResult).
    204 on idempotent replay (AlreadyRetired=true; no new audit).
    400 on ErrForceReasonRequired.
    403 on ErrAgentIsSentinel.
    404 on ErrAgentNotFound.
    409 on BlockedByDependenciesError, with a custom body shape
        {error, counts{active_targets, active_certificates,
        pending_jobs}} that bypasses the default ErrorWithRequestID
        envelope so callers get the per-bucket numbers directly.
    500 on any other error.
  Heartbeat HandleHeartbeat returns 410 Gone when the agent is
  retired (ErrAgentRetired), signalling the agent to shut down.
  Query params `force=true` and `reason=<text>` drive the cascade
  path; both are forwarded as url.Values through the new MCP
  transport.

  internal/api/router/router.go registers GET /api/v1/agents/retired
  literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's
  literal-beats-pattern-var precedence routes "retired" to the
  paginated retired-agents listing instead of fetching a hypothetical
  agent named "retired".

Agent binary — clean shutdown on 410:
  cmd/agent/main.go gains the ErrAgentRetired sentinel, a
  retiredOnce sync.Once, and a retiredSignal chan struct{}. A
  markRetired(source, statusCode, body) helper closes the channel
  exactly once; the Run() select loop observes the close and returns
  ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired)
  and exits cleanly instead of spinning in the heartbeat retry loop.
  The 410 Gone surface is therefore terminal for the agent process.

MCP transport:
  internal/mcp/client.go adds Client.DeleteWithQuery(path, query),
  a new additive transport method. Client.Delete is path-only; without
  this method the retire tool would silently drop `force` and `reason`,
  turning every cascade retire into a default soft-retire. The new
  method shares do()'s 204 normalization and 4xx/5xx error
  propagation so tool authors get one contract.
  internal/mcp/tools.go + internal/mcp/types.go expose the
  retire_agent tool with Force+Reason inputs wired through
  DeleteWithQuery.

CLI:
  cmd/cli/main.go + internal/cli/client.go add two CLI surfaces:
  `agents list --retired` (client-side strip of --retired then
  delegation to ListRetiredAgents, sharing --page/--per-page parsing
  with the default listing) and `agents retire <id> [--force --reason
  "…"]` (mirrors ErrForceReasonRequired — force without reason is
  rejected client-side before the request is sent). JSON + table
  output modes both honor the new columns.

Frontend:
  web/src/pages/AgentsPage.tsx surfaces retired/retire affordances.
  web/src/api/client.ts + web/src/api/types.ts expose the retire
  endpoint and the retired-listing. 4 new Vitest regression cases.

OpenAPI:
  api/openapi.yaml documents DELETE /agents/{id} with all seven
  status codes, 410 on heartbeat, and the 409 per-bucket body shape.

Regression coverage (six new test files, all green):
  internal/service/agent_retire_test.go           — 8-step contract + sentinel guards
  internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat
  internal/mcp/retire_agent_test.go               — DeleteWithQuery wire-through
  internal/cli/agent_retire_test.go               — --retired listing + --force/--reason pairing
  internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down
  internal/domain/connector_test.go               — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies

Files:
  api/openapi.yaml                                — DELETE + 410 + 409 body shape
  cmd/agent/main.go                               — ErrAgentRetired, markRetired, retiredSignal
  cmd/cli/main.go                                 — handleAgents list/get/retire dispatch
  docs/architecture.md, docs/concepts.md,
    docs/testing-guide.md                         — retirement contract narrative
  internal/api/handler/agents.go                  — RetireAgent, status surface, 410 on heartbeat
  internal/api/handler/agent_handler_test.go      — extended coverage
  internal/api/handler/agent_retire_handler_test.go — new
  internal/api/router/router.go                   — /agents/retired before /agents/{id}
  internal/cli/agent_retire_test.go               — new
  internal/cli/client.go                          — ListRetiredAgents + RetireAgent
  internal/domain/connector.go                    — IsRetired, SentinelAgentIDs,
                                                    IsSentinelAgent, AgentDependencyCounts,
                                                    ActorTypeAgent/System
  internal/domain/connector_test.go               — new
  internal/integration/lifecycle_test.go          — retirement fixture
  internal/mcp/client.go                          — DeleteWithQuery additive transport
  internal/mcp/retire_agent_test.go               — new
  internal/mcp/tools.go, internal/mcp/types.go    — retire_agent tool + Force/Reason inputs
  internal/repository/interfaces.go               — AgentRepository retirement methods
  internal/repository/postgres/agent.go           — retire + cascade target retire + counts
  internal/repository/postgres/migration_000015_test.go — new
  internal/service/agent.go                       — wire into AgentService surface
  internal/service/agent_retire.go                — new 8-step contract
  internal/service/agent_retire_test.go           — new
  internal/service/deployment.go                  — skip retired agents
  internal/service/target.go                      — skip retired agents
  internal/service/testutil_test.go               — shared mocks extended
  migrations/000015_agent_retire.up.sql           — new
  migrations/000015_agent_retire.down.sql         — new
  web/src/api/client.ts, types.ts + tests         — retire endpoint wiring
  web/src/pages/AgentsPage.tsx                    — retire UI
2026-04-19 05:24:00 +00:00
shankar0123 1ee77c89f8 I-003: job timeout reaper closes AwaitingCSR/AwaitingApproval gap
Add 11th always-on scheduler loop that transitions jobs stuck in
AwaitingCSR (default 24h TTL) or AwaitingApproval (default 168h TTL)
to Failed. I-001's retry loop then auto-promotes eligible Failed jobs
back to Pending. No new status enum, no schema migration.

- JobRepository.ListTimedOutAwaitingJobs with per-status cutoff WHERE
- JobService.ReapTimedOutJobs mirrors RetryFailedJobs structure
- Scheduler jobTimeoutLoop with atomic.Bool idempotency guard, 2m
  per-tick context, WaitGroup shutdown drain
- Config: CERTCTL_JOB_TIMEOUT_INTERVAL (10m), CERTCTL_JOB_AWAITING_CSR_TIMEOUT
  (24h), CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT (168h)
- Audit event per transition: actor=system, actorType=System,
  action=job_timeout, details={old_status, new_status, timeout_reason,
  age_hours}
- 14 new tests: 3 config, 7 service, 4 scheduler
2026-04-19 01:37:18 +00:00
shankar0123 4bc8b3e723 fix(config): add RetryInterval to TestValidate_ValidConfig + TestValidate_AuthTypeNone fixtures (I-001 follow-up)
Problem:
  TestValidate_ValidConfig and TestValidate_AuthTypeNone construct a
  SchedulerConfig without RetryInterval, so Validate() fails the
  'retry interval must be at least 1 second' check at config.go:1086
  with 'retry interval must be at least 1 second'. Both tests expect
  success, so they fail whenever run.

Root cause (re-derived from source, not inherited from memory):
  git log -S 'retry interval must be at least' --source --all shows
  the validation was introduced in 0200c7f (I-001, RetryFailedJobs
  scheduler wiring). git log -- internal/config/config_test.go shows
  the test file was last touched in 7382e5f, which predates 0200c7f.
  I-001 added a new Validate() rule without updating the two positive
  test fixtures — a gap in I-001's verification pass.

  This is NOT C-001 fallout. The config_test.go file was untouched by
  the C-001 closure commits 91642e2 and 4696116. The failure surfaced
  during the full test suite run after C-001 landed because no one
  had run 'go test ./internal/config/...' since I-001.

Scope:
  - internal/config/config_test.go (2 fixtures: TestValidate_ValidConfig,
    TestValidate_AuthTypeNone).

Implementation:
  Added 'RetryInterval: 5 * time.Minute' to both SchedulerConfig
  literals. 5 minutes matches the I-001 default at config.go:818:

    RetryInterval: getEnvDuration("CERTCTL_SCHEDULER_RETRY_INTERVAL", 5*time.Minute)

  The other two TestValidate_* tests (InvalidAuthType, APIKeyAuth_
  MissingSecret) are unaffected because they expect Validate() to
  error at the auth-type check (line 1052) or auth-secret check
  (line 1057), both of which fire before the RetryInterval check at
  line 1086.

Verification:
  - go test -count=1 -run 'TestValidate_' ./internal/config/...: PASS
  - go test -short -count=1 ./...: all packages PASS
  - go vet ./...: exit 0

Residual:
  None. This is a pure test-fixture fix — production code is unchanged.

Commit:
  0200c7f (I-001) should have included this edit. Attributed here for
  traceability.
2026-04-19 00:33:22 +00:00
shankar0123 469611650c fix(cli): add missing os + path/filepath imports to client_test.go
Follow-up to 91642e2. TestClient_ImportCertificates_SixFieldPayload
uses filepath.Join(t.TempDir(), ...) and os.WriteFile to stage a
test PEM, but the import block only listed encoding/json,
encoding/pem, net/http, etc. — neither os nor path/filepath was
imported. go vet rejected the package with 'undefined: filepath'
(and would have caught 'undefined: os' next).

Add both imports. No behavioral change — the referenced symbols
are the standard library's usual names for their respective
packages, so the test compiles and runs exactly as intended.
CI should now pass go build + go vet on the cli package.
2026-04-19 00:27:11 +00:00
shankar0123 91642e2860 C-001 scope expansion: tighten parallel POST /api/v1/certificates call sites to six-field contract
Problem:
a53a4b8 closed C-001 at the handler boundary by tightening the
ValidateRequired contract on POST /api/v1/certificates to require six
fields: name, common_name, renewal_policy_id, issuer_id, owner_id,
team_id. (Correction re-derived from source: the handler
ValidateRequired calls on owner_id/team_id/renewal_policy_id were
actually installed in 3287e17 under M-002/M-003/M-006 auth unification
— a53a4b8's commit message overstates scope.) Post-audit on
2026-04-18 found three parallel call sites still shipping
three-to-four-field payloads that the newly strict handler would
reject with HTTP 400:
  - GUI: OnboardingWizard CertificateStep (common_name + sans +
    issuer_id + environment only)
  - CLI: certctl-cli import (common_name + issuer_id + status only;
    no required-flag gating)
  - Tests: deploy/test/qa_test.go Part03 positive paths

Scope:
Bring every POST /api/v1/certificates caller to six-field parity. No
handler changes — the contract is authoritative; the callers must
conform.

Implementation:

  GUI — OnboardingWizard CertificateStep expansion:
    web/src/pages/OnboardingWizard.tsx adds name/owner_id/team_id/
    renewal_policy_id state. React Query hooks for getOwners/
    getTeams/getPolicies use per_page: '500' to populate dropdowns
    without pagination-driven truncation. Payload ships all six
    required fields plus sans/certificate_profile_id/environment.
    nextDisabled gate enforces all six before the Continue button
    activates.

  CLI — ImportCertificates rewrite:
    internal/cli/client.go rewrites ImportCertificates with
    flag.NewFlagSet("import", flag.ContinueOnError). Required flags:
    --owner-id, --team-id, --renewal-policy-id, --issuer-id. Optional:
    --name-template (default {cn}, templated via strings.ReplaceAll
    against cert.Subject.CommonName), --environment (default
    imported). Missing required flags fail pre-HTTP with a clear
    error. Request map ships all six required fields plus sans/
    environment/status/optional serial_number.
    cmd/cli/main.go — usage string updated to document the new
    required/optional flags.

  Tests — qa_test.go Part03 positive paths:
    deploy/test/qa_test.go Part03 Create_Minimal and Create_Full
    updated to include all six fields. Uses seed_demo.sql-supplied IDs
    (o-alice, t-platform, rp-standard) — docker-compose.demo.yml is
    the run context. C-001 explanatory comment added above
    Create_Minimal so future readers understand why the minimal
    payload is no longer minimal.

  MCP parity:
    Verified no-op. internal/mcp/types.go:28 CreateCertificateInput
    already declares all six fields; internal/mcp/tools.go:102
    forwards the typed struct unchanged.

Verification:

  Go CLI regression tests (internal/cli/client_test.go):
    * TestClient_ImportCertificates_MissingRequiredFlags — 5 subtests,
      one per missing required flag, confirms flag.ContinueOnError
      rejects with non-nil error before any HTTP call is attempted.
    * TestClient_ImportCertificates_MissingPositionalArgs — confirms
      the "usage: import <file>" error path when no PEM file is
      supplied after the flags.
    * TestClient_ImportCertificates_SixFieldPayload — uses httptest
      to decode the POST body and assert all six required fields
      plus sans/environment are present on the wire.

  Frontend regression test (web/src/api/client.test.ts):
    'createCertificate accepts and transmits all six required fields'
    pins the wire shape for both GUI call sites (OnboardingWizard
    CertificateStep + CertificatesPage CreateCertificateModal). If
    either UI surface accidentally drops a field, this assertion
    fails in CI rather than surfacing as a 400 at runtime.

  Grep-based call-site sweep:
    Enumerated every POST /api/v1/certificates create caller. Four
    total: OnboardingWizard, CertificatesPage, MCP tools, CLI import.
    All four now ship six-field payloads. Claim path
    (internal/service/discovery.go) updates existing rows and does
    not POST. EST/SCEP handlers invoke internal
    certService.CreateVersion, not the public API. Negative-path
    tests (qa_test.go:1085/1267/1274/1288/1298) remain valid: they
    assert 400/non-500 on oversized/malformed/missing-CN/UTF-8/empty
    bodies, and these properties still hold under the stricter
    handler.

  Static gates:
    go build ./..., go vet ./..., go test ./internal/cli/..., and
    cd web && npm run test deferred to operator pre-push — the Go
    toolchain is not available in the session sandbox. Grep-based
    verification confirms the syntactic shape of every changed file.

Residual:
None. Every POST /api/v1/certificates call site now conforms to the
six-field contract; the wire shape is pinned by both Go and
TypeScript regression tests.

Commit:
TBD-SHA (audit doc + CLAUDE.md carry TBD-SHA placeholders to be
amended after commit)
2026-04-19 00:25:10 +00:00
shankar0123 0200c7f4a4 Close I-001 (RetryFailedJobs never invoked) coverage-gap finding
Operator decision answered as Option A: JobService.RetryFailedJobs is
now wired into the scheduler as an always-on 10th loop. Prior to this
commit the method was implemented, unit-tested, and exported but had
zero runtime callers — any job that transitioned to status=Failed stayed
Failed forever regardless of how many attempts it had remaining.

Scheduler — 10th loop:
  internal/scheduler/scheduler.go grows a jobRetryLoop alongside the
  existing nine loops (renewal, jobs, health, notifications, short-lived,
  network scan, digest, health check, cloud discovery). The loop follows
  the established run-immediately-then-tick pattern (same shape as
  jobProcessorLoop), gated by a sync/atomic.Bool idempotency guard and
  joined into the scheduler's sync.WaitGroup so WaitForCompletion drains
  it on graceful shutdown. Each tick runs under a 2-minute context
  timeout mirroring jobProcessorLoop's opCtx budget. The runJobRetry
  helper invokes jobService.RetryFailedJobs(ctx, 3) — the advisory
  maxRetries cap is belt-and-suspenders; per-job eligibility is still
  enforced inside the service via Attempts < MaxAttempts.

  The JobServicer scheduler-interface gains RetryFailedJobs so the
  scheduler's dependency surface stays explicit and mockable.

Service — audit trail per retry:
  internal/service/job.go:RetryFailedJobs now emits an audit event for
  every Failed→Pending transition. Following the house convention used
  by all scheduler-emitted events, actor='system' and actorType=
  domain.ActorTypeSystem; action='job_retry'; details capture
  old_status, new_status, attempts, max_attempts. JobService carries an
  optional *AuditService (SetAuditService) that nil-guards to preserve
  test-wiring ergonomics — existing tests that construct JobService
  without an audit service continue to pass unchanged.

Config — env var with sane default:
  internal/config/config.go:SchedulerConfig grows RetryInterval, wired
  to CERTCTL_SCHEDULER_RETRY_INTERVAL with a 5-minute default. Validate
  rejects intervals below 1 second (matches other scheduler interval
  validators).

Server wiring:
  cmd/server/main.go calls jobService.SetAuditService(auditService)
  after JobService construction and sched.SetJobRetryInterval(
  cfg.Scheduler.RetryInterval) alongside the other SetXxxInterval calls.

Regression coverage:
  internal/service/job_test.go (3 new)
    - TestJobService_RetryFailedJobs_EligibleJobTransitionsAndAudits
    - TestJobService_RetryFailedJobs_SkipsJobsAtMaxAttempts
    - TestJobService_RetryFailedJobs_NoAuditServiceOK
  internal/scheduler/scheduler_test.go (3 new)
    - TestScheduler_JobRetryLoop_CallsService
    - TestScheduler_JobRetryLoop_IdempotencyGuard
    - TestScheduler_JobRetryLoop_WaitForCompletion

  The service tests assert status transitions, attempt-cap short-
  circuiting, and audit event shape (actor='system', action='job_retry',
  details keys). The scheduler tests assert the loop invokes the service,
  the atomic.Bool guard skips overlapping ticks with the expected
  'still running, skipping tick' log, and WaitForCompletion drains the
  in-flight tick on Stop.

Residual follow-up (not in scope for this commit):
  internal/service/renewal.go:RetryFailedJobs is a parallel dead-code
  duplicate of the same logic on RenewalService — untested and has no
  runtime caller. The audit finding called this out as 'implemented
  twice'. Removing it is a separate cleanup and does not block the
  Option-A wiring this commit delivers.

Files:
  cmd/server/main.go                     — SetAuditService + SetJobRetryInterval
  internal/config/config.go              — RetryInterval field + env + validate
  internal/scheduler/scheduler.go        — 10th loop, interface, field, setter
  internal/scheduler/scheduler_test.go   — 3 new scheduler-loop tests
  internal/service/job.go                — RetryFailedJobs audit emission + SetAuditService
  internal/service/job_test.go           — 3 new service-layer tests
2026-04-18 23:24:54 +00:00
shankar0123 fe7e766510 Close M-004 (OCSP issuer binding) and M-005 (discovery actor propagation) coverage-gap findings
M-004 — OCSP issuer binding (composite key):
  The OCSP lookup path now binds (issuer_id, serial) as a composite key
  rather than resolving by serial alone. CertificateRepository and
  RevocationRepository gain GetByIssuerAndSerial methods; ca_operations.go
  scopes both lookups by the issuer_id path param. When no managed cert
  binds to that (issuer, serial) tuple, GetOCSPResponse constructs an
  RFC 6960 §2.2 'unknown' response (CertStatus=2) instead of the prior
  default 'good'. Short-lived cert exemption (profile TTL < 1h) is
  preserved. Real repo errors (non-sql.ErrNoRows) fail closed with a log.

  Regression coverage: internal/service/ca_operations_test.go
    - TestCAOperationsSvc_GetOCSPResponse_Unknown_CrossIssuer
    - TestCAOperationsSvc_GetOCSPResponse_Unknown_UnknownSerial

M-005 — Discovery Claim/Dismiss actor propagation:
  DiscoveryService.ClaimDiscovered and DismissDiscovered now accept an
  explicit 'actor string' parameter (propagation pattern mirrors
  bulk_revocation.go / revocation_svc.go). The handler layer passes
  resolveActor(r.Context()) — the named-key identity established by the
  M-002 auth unification — and the service falls back to 'api' (the same
  safe sentinel resolveActor uses when no auth context is present) only
  when the caller passes an empty string. Never falls back to 'operator'.

  Regression coverage: internal/service/discovery_test.go
    - TestDiscoveryService_ClaimDiscovered_AuditActor
    - TestDiscoveryService_DismissDiscovered_AuditActor
    - TestDiscoveryService_ClaimDiscovered_EmptyActorFallsBackToAPI
    - TestDiscoveryService_DismissDiscovered_EmptyActorFallsBackToAPI

Each new test asserts event.Actor matches the caller-supplied string (or
'api' on empty input) and explicitly asserts event.Actor != 'operator'
to lock in the historical fix intent.

Files:
  internal/api/handler/discovery.go          — pass resolveActor(ctx)
  internal/api/handler/discovery_handler_test.go — updated call sites
  internal/integration/lifecycle_test.go     — updated mock wiring
  internal/repository/interfaces.go          — GetByIssuerAndSerial on
                                               CertificateRepository +
                                               RevocationRepository
  internal/repository/postgres/certificate.go — composite key lookup
  internal/service/ca_operations.go          — (issuer_id, serial) scoping
  internal/service/ca_operations_test.go     — 2 new M-004 tests
  internal/service/discovery.go              — actor parameter + 'api' fallback
  internal/service/discovery_test.go         — 4 new M-005 tests
  internal/service/shortlived_test.go        — mock signature update
  internal/service/testutil_test.go          — mock GetByIssuerAndSerial
2026-04-18 22:20:25 +00:00
shankar0123 ff7357f889 fix(lint): godoc comment on NewAuthWithNamedKeys must lead with function name (ST1020)
CI failure on master (commit 3287e17) — staticcheck ST1020:

  internal/api/middleware/middleware.go:125:1: ST1020: comment on exported
  function NewAuthWithNamedKeys should be of the form
  "NewAuthWithNamedKeys ..." (staticcheck)

When NewAuth was renamed to NewAuthWithNamedKeys during the M-002 auth
unification, the leading godoc sentence was left pointing at the old name.
Rewrite the comment so its first sentence starts with the new function
name, and expand the body to describe the named-key + admin-flag contract
introduced in 3287e17.

Also gitignore /.gopath/ — session-scoped tool install cache, same
category as /.gocache/ and /.gomodcache/.

Verification:
  go vet ./internal/api/middleware/...          — clean
  go build ./internal/api/middleware/...        — clean
  go test ./internal/api/middleware/...         — PASS (0.245s)
  staticcheck -checks=all,<project exclusions>  — clean across
    middleware, handler, service, domain, cmd/server, scheduler

Closes: CI failure on 3287e17.
2026-04-18 21:38:46 +00:00
shankar0123 3287e174dc Unify API auth + RFC-compliant CRL/OCSP (M-002 + M-003 + M-006, auto-closes M-001)
Closes the remaining P1 gaps from coverage-gap-audit.md (M-001/M-002/M-003/M-006)
on top of the C-001/C-002 ownership + agent-FK contract fixes landed in
a53a4b8. The work lands as a single commit spanning server, docs, tests,
and the React client.

M-002 — Named API keys with per-key actor propagation
  * Migration 000014 adds the 'api_keys' table (id, name, hash,
    principal, role, created_at, last_used_at, disabled_at) so every
    credential carries an identifiable principal instead of the
    opaque 'anonymous'/'api-key' sentinel.
  * Auth middleware now rotates through configured keys, performs
    constant-time hash comparison, stamps 'last_used_at', and emits
    an actor struct via contextWithActor(). The audit middleware,
    bulk-revocation handler, approval handlers, and MCP tool layer
    now read the principal off the context and persist it on every
    audit_events row.
  * Regression coverage:
      - internal/api/middleware/audit_test.go — actor propagation,
        principal redaction for disabled keys, anonymous fallback for
        unauthenticated endpoints.
      - internal/api/handler/bulk_revocation_handler_test.go,
        job_handler_test.go — principal-on-audit assertions.

M-003 — Authorization gates (Phase B)
  * Approval handler rejects self-approval / self-rejection with 403
    when the actor principal equals the job's requested_by field.
  * Bulk revocation is gated behind the 'admin' role; operators and
    viewers receive 403.
  * Regression coverage:
      - internal/service/job_test.go — TestApproveJob_NotSelf,
        TestRejectJob_NotSelf.
      - internal/api/handler/bulk_revocation_handler_test.go —
        TestBulkRevoke_RequiresAdmin, TestBulkRevoke_AdminSucceeds.

M-006 — RFC-compliant CRL/OCSP on the unauthenticated .well-known mux
  * Per RFC 8615, relying parties cannot reasonably be asked to
    authenticate against the issuing certctl instance to retrieve
    revocation material. CRL and OCSP move off the authenticated
    '/api/v1/crl*' and '/api/v1/ocsp/*' paths onto:
        GET /.well-known/pki/crl/{issuer_id}
            Content-Type: application/pkix-crl   (RFC 5280 §5)
        GET /.well-known/pki/ocsp/{issuer_id}/{serial}
            Content-Type: application/ocsp-response  (RFC 6960)
  * Non-standard JSON CRL shape is removed; only DER is served.
  * Short-lived certificate exemption (profile TTL < 1h → skip
    CRL/OCSP) is preserved; the response simply omits the serial.
  * Routes are registered on the unauthenticated 'finalHandler' mux
    in cmd/server/main.go alongside EST ('/.well-known/est/*') and
    SCEP ('/scep'). Legacy authenticated paths return 404.
  * Regression coverage:
      - internal/api/handler/certificate_handler_test.go — content
        type, DER parseability, 404 for unknown issuer.
      - internal/api/handler/adversarial_path_test.go — unauthenticated
        access asserted for CRL, OCSP, EST, SCEP.
      - internal/api/router/router_test.go — route-table assertion
        that '.well-known/pki/*', '.well-known/est/*', and '/scep' are
        mounted on the unauthenticated branch.

M-001 — Auto-closed by M-002
  EST and SCEP were already registered on the unauthenticated
  'finalHandler' mux; the router comment at
  internal/api/router/router.go:247 now matches reality. The
  adversarial-path tests above lock the behavior in.

Verification (all gates green):
  * go vet ./...                                           — clean
  * go build ./...                                         — ok
  * go test -short ./... (55+ packages)                    — all pass
  * web/ : npm test (225 Vitest tests)                     — all pass
  * web/ : npx tsc --noEmit                                — clean
  * grep sweep for '/api/v1/(crl|ocsp)' — 13 surviving hits,
    all intentional M-006 tombstone/relocation comments.

Documentation:
  * coverage-gap-audit.md — status flips M-001/M-002/M-003/M-006 →
    Fixed, with per-finding resolution paragraphs citing regression
    test IDs. (Audit file lives outside this repo; see cowork root.)
  * CLAUDE.md Project Status line updated with the auth-unification
    closure note.
  * docs/features.md, docs/architecture.md, docs/quickstart.md,
    docs/concepts.md, docs/connectors.md, docs/test-env.md,
    docs/testing-guide.md, docs/compliance-*.md, docs/demo-advanced.md
    — refreshed for the new '.well-known/pki/*' namespace and named
    API keys.
  * api/openapi.yaml — documents the new unauthenticated endpoints
    and removes the legacy '/api/v1/crl*' + '/api/v1/ocsp/*' paths.

.gitignore: adds '/.gocache/' and '/.gomodcache/' for the session-
scoped Go caches so they never enter the tree.
2026-04-18 18:17:41 +00:00
shankar0123 a53a4b845b fix(gui,api): close C-001 + C-002 — ownership + agent FK contract
C-001 — CreateCertificate was server-accepted with null owner_id,
team_id, renewal_policy_id because the GUI neither collected the fields
nor enforced them, even though the backend's ManagedCertificate schema
and handler contract treat them as required. Fix the contract at all
four layers:

  - web/src/pages/CertificatesPage.tsx: replace owner_id/team_id free-
    text inputs with <select> elements fed by getOwners/getTeams/
    getPolicies queries; mark all three required; gate the Create
    button on owner_id + team_id + renewal_policy_id being set.
  - internal/api/handler/certificates.go: ValidateRequired for
    owner_id, team_id, renewal_policy_id on CreateCertificate so the
    handler returns HTTP 400 with the offending field name before the
    service layer is reached.
  - internal/mcp/types.go: drop ',omitempty' from
    CreateCertificateInput.RenewalPolicyID so the MCP schema reflects
    the required contract; Update inputs keep partial-update semantics.
  - api/openapi.yaml: 'required: [name, common_name, renewal_policy_id,
    issuer_id, owner_id, team_id]' was already present on the Create
    schema; clarified DeploymentTarget.agent_id description to note the
    FK contract.

C-002 — CreateTargetWizard accepted an empty or bogus agent_id and the
service inserted directly, producing a Postgres 23503 FK-violation that
bubbled out as a generic HTTP 500. The FK itself (migration 000001 line
104: agent_id TEXT NOT NULL REFERENCES agents(id)) is correct; we keep
the schema strict and add validation at three layers:

  - internal/service/target.go: introduce
    ErrAgentNotFound sentinel and pre-validate agent_id in
    TargetService.CreateTarget — empty string returns
    'agent_id is required'; a nonexistent id returns the full
    'referenced agent does not exist: <id>' error. Both wrap
    ErrAgentNotFound via fmt.Errorf %w so callers can use errors.Is.
  - internal/api/handler/targets.go: ValidateRequired on agent_id; map
    errors.Is(err, service.ErrAgentNotFound) to HTTP 400 instead of
    letting it fall through to the generic 500 branch.
  - internal/mcp/types.go: drop ',omitempty' from
    CreateTargetInput.AgentID to match the required contract.
  - web/src/pages/TargetsPage.tsx: replace the free-text Agent ID input
    with a <select> populated from getAgents(); include agent in the
    canProceedToReview gate so Next is disabled until an agent is
    chosen.

Regression coverage (21 new subtests total):

  - TestCreateCertificate_MissingRequiredField_Returns400 — 6 subtests,
    one per required field, each proves the handler guard fires before
    the mock service is called.
  - TestCreateTarget_MissingAgentID_Returns400 — handler guard.
  - TestCreateTarget_NonexistentAgent_Returns400 — pins the
    ErrAgentNotFound -> 400 translation.
  - TestTargetService_CreateTarget_MissingAgentID — errors.Is sentinel.
  - TestTargetService_CreateTarget_NonexistentAgentID — errors.Is.
  - The existing TestTargetService_CreateTarget_Success, along with
    TestCreateTarget_{MissingName,MissingType,NameTooLong}_* handler
    tests, were updated to seed a real agent or include agent_id in
    the request body so the happy paths still run cleanly.

Gates (Phase 4):
  - go build/vet/test/race: green
  - go test -cover: internal/service 68.7% (gate 55%),
    internal/api/handler 78.9% (gate 60%)
  - golangci-lint on service+handler+mcp: 0 issues
  - govulncheck: no reachable vulns
  - tsc --noEmit: clean
  - vitest: 223/223 passing

See cowork/certctl-coverage-gap-audit.md entries C-001 and C-002.
2026-04-18 16:01:40 +00:00
shankar0123 9143da5fa8 Merge branch 'fix/d-008-policy-engine-drift' 2026-04-18 14:56:06 +00:00
shankar0123 b3cc7cbdb2 fix(policies): close the D-006 loop — TitleCase seed canonicals + severity-aware, config-consuming rule engine (D-008)
D-008 was a three-part drift in the policy engine that made the
D-005/D-006 remediation cosmetic below the DB layer:

  (a) migrations/seed.sql INSERTed rules with pre-D-005 lowercase
      types ('ownership', 'environment', 'lifetime', 'renewal_window')
      that the handler validator rejects on Create/Update but that
      raw SQL INSERTs bypassed entirely. At runtime evaluateRule's
      switch fell through to the default "unknown policy rule type"
      error branch on every demo rule × every cert × every cycle,
      flooding logs while emitting zero violations.

  (b) migrations/seed_demo.sql persisted lowercase severity values
      ('critical', 'error', 'warning') on policy_violations rows.
      INSERT succeeded because that column had no CHECK, but any
      frontend comparing against the canonical PolicySeverity enum
      mis-categorized every seeded violation.

  (c) evaluateRule hardcoded Severity: PolicySeverityWarning on
      every emitted violation and ignored rule.Config entirely —
      so the D-006 per-rule severity column (000013) and every
      per-arm Config JSON ({allowed_issuer_ids, allowed_domains,
      required_keys, allowed, lead_time_days, max_days}) was dead
      data below the evaluation layer.

This commit lands (a)+(b)+(c) atomically. Shipping any subset
leaves the feature half-working.

## Changes

Domain (internal/domain/policy.go):
  * Add PolicyTypeCertificateLifetime as the 6th TitleCase canonical.
    Pre-D-008 the seeded "max-certificate-lifetime" rule had no engine
    arm — routing it through RenewalLeadTime would conflate "how
    close to expiry before we renew" with "how long can the cert
    possibly be", two distinct semantics. The new type accepts
    config {"max_days": int} and flags certs whose
    NotAfter - NotBefore exceeds the cap.

Handler validator (internal/api/handler/validation.go):
  * ValidatePolicyType allowlist grown to 6 canonicals
    (AllowedIssuers, AllowedDomains, RequiredMetadata,
    AllowedEnvironments, RenewalLeadTime, CertificateLifetime).

OpenAPI (api/openapi.yaml):
  * PolicyType enum grown to match domain.

Frontend (web/src/api/types.ts, types.test.ts):
  * POLICY_TYPES tuple gains CertificateLifetime; pin test asserts
    all 6 canonicals and rejects casing drift.

Migration 000014 (policy_violations severity CHECK):
  * Named CHECK constraint (policy_violations_severity_check)
    mirroring 000013's allowlist, defense-in-depth at the DB layer
    against future drift from bypassed writes (migrations, psql
    sessions, future callers). Symmetric down migration drops by
    name.

Seed data:
  * migrations/seed.sql rewritten to emit TitleCase canonicals with
    per-arm config JSON that actually exercises the config-consuming
    paths (not the missing-field backstops):
      - pr-require-owner         → RequiredMetadata     {"required_keys":["owner"]}                        Warning
      - pr-allowed-environments  → AllowedEnvironments  {"allowed":["production","staging","development"]} Error
      - pr-max-certificate-lifetime → CertificateLifetime {"max_days":90}                                   Critical
      - pr-min-renewal-window    → RenewalLeadTime      {"lead_time_days":14}                              Warning
    Severities are now differentiated per rule (D-006 intent).
  * migrations/seed_demo.sql violation rows flipped to TitleCase
    severity ('Critical', 'Error', 'Warning') so migration 000014
    applies cleanly on upgrade paths.

Engine rewrite (internal/service/policy.go):
  * evaluateRule rewritten. All six arms now:
      1. Parse rule.Config into the per-arm typed struct.
      2. Bad JSON → log at ValidateCertificate boundary and skip
         this rule (no co-located poisoning of other rules in the
         same batch).
      3. Empty/null Config → emit the pre-D-008 missing-field
         violation (backwards compat invariant — operators who
         haven't reconfigured still see the same output).
      4. Violations emitted carry rule.Severity (no more hardcoded
         Warning); D-006 column is now load-bearing.
  * CertificateLifetime arm reads NotBefore/NotAfter from the
    certificate's latest version via CertRepo. Injected via
    PolicyService.SetCertRepo() setter — avoids churning ~36
    NewPolicyService call sites while keeping the lifetime arm
    optional (degrades to a log+skip if the setter is not wired).

Server wiring (cmd/server/main.go):
  * policyService.SetCertRepo(certRepo) wired after construction.

Tests (internal/service/policy_test.go):
  * 25 new subtests across 5 groups:
      - TestEvaluateRule_SeverityPassThrough (6): every rule type
        emits violations carrying rule.Severity, not hardcoded.
      - TestEvaluateRule_ConfigConsumed (12): every per-arm Config
        path exercised positive + negative.
      - TestEvaluateRule_EmptyConfig_BackCompat (3): empty/null
        Config still emits pre-D-008 missing-field violations.
      - TestEvaluateRule_BadConfig_SkipsRule: malformed JSON logs
        and skips cleanly without poisoning neighbors.
      - TestEvaluateRule_CertificateLifetime_RepoScenarios (3):
        ok when repo wired, log+skip when not, handles missing
        NotBefore/NotAfter edges.

Provenance: D-008 surfaced during D-005/D-006 remediation review
in eef1db0. That commit added persistence and CI pins for the
severity field but did not re-verify the evaluation layer
consumed it; this finding and fix close the audit-process gap.
2026-04-18 14:55:56 +00:00
shankar0123 eef1db0f0a fix(policies): stop 400ing the "+ New Policy" button + add per-rule severity (D-005, D-006)
Coverage Gap Audit findings D-005 (P0) + D-006 (P1) fixed together in a
single commit because they share the same root cause — policy CRUD sending
values the backend silently rejects — and splitting them would leave a
half-working UI between commits.

## D-005 (P0): PoliciesPage dropdown 400s every Create Policy

Root cause
----------
`web/src/pages/PoliciesPage.tsx` populated the Type `<select>` from a
hardcoded `['key_algorithm', 'ownership', 'allowed_issuers', ...]` array.
The backend's `internal/api/handler/validators.go::ValidatePolicyType`
enforces the TitleCase allowlist `AllowedIssuers`, `AllowedDomains`,
`RequiredMetadata`, `AllowedEnvironments`, `RenewalLeadTime` — defined in
`internal/domain/policy.go`. Every Create Policy request was rejected with
`400 invalid policy type`. The error surfaced only as a transient toast;
the modal closed anyway. Silent user-visible failure.

Fix
---
- `web/src/api/types.ts`: added `POLICY_TYPES` and `POLICY_SEVERITIES`
  tuples with `as const` and narrowed `PolicyRule.type`, `.severity`, and
  `PolicyViolation.severity` to the literal-union types. Dropdown is now
  sourced from the tuple; casing drift becomes a compile error.
- `web/src/pages/PoliciesPage.tsx`: rekeyed `severityStyles` /
  `severityDots` to the TitleCase values, added `humanize()` for display
  (AllowedIssuers → "Allowed Issuers"), removed the `badge-neutral`
  fallback that was papering over the mismatch.
- `web/src/api/types.test.ts` (new): pins both tuples exactly. If anyone
  edits one side of the frontend/backend contract without the other, CI
  fails with a clear assertion. Pure-TS vitest, no RTL dependency.

## D-006 (P1): `severity` field silently dropped on create/update

Root cause
----------
`PolicyRule` had no `Severity` field in `internal/domain/policy.go`. The
frontend has always sent `severity` on create/update, but Go's
`json.Decoder` (default settings, no `DisallowUnknownFields`) silently
dropped it. The value never reached PostgreSQL. Every rule rendered with
the same severity because there was no severity — just a display
computation downstream.

Fix: option (b), full-stack schema add (not delete-the-field)
-------------------------------------------------------------
- Migration `000013_policy_rule_severity` (up + down): adds
  `severity VARCHAR(50) NOT NULL DEFAULT 'Warning'` to `policy_rules` with
  CHECK constraint `severity IN ('Warning', 'Error', 'Critical')`. No
  index — three-value column on a low-thousands-rows table, planner will
  seq-scan regardless. PG 11+ metadata-only ADD COLUMN, safe on live data.
- `internal/domain/policy.go`: added `Severity PolicySeverity` field.
- `internal/repository/postgres/policy.go`: plumbed `severity` through
  ListRules SELECT + Scan, GetRule SELECT + Scan, CreateRule INSERT,
  UpdateRule UPDATE (4 queries).
- `internal/service/policy.go::UpdatePolicy`: if the client omits
  severity on a PUT (zero-value empty string), fetch the existing rule
  and preserve its severity. Without this, partial updates would trip the
  NOT NULL CHECK and 500. Preserves pre-existing behavior for Name/Type
  (out of scope).
- `internal/api/handler/policies.go::CreatePolicy`: default empty severity
  to `'Warning'`, then validate via `ValidatePolicySeverity`. 400 with
  clear message instead of 500 on CHECK violation. `UpdatePolicy`:
  validates severity only when provided.
- `internal/mcp/types.go` + `internal/mcp/tools.go`: added optional
  `severity` on the MCP `create_policy` / `update_policy` tool inputs so
  LLM callers stay in sync with the wire contract.
- `api/openapi.yaml`: added `severity` to the `PolicyRule` schema with
  the enum and default.

Acceptance criterion (user-defined)
-----------------------------------
"Create a rule with severity=Critical, reload the page, and still see
Critical — no silent drops." Verified end-to-end: frontend sends
`severity: "Critical"`, handler validates, service persists, DB stores,
GET returns, React renders the correct badge.

Seed data
---------
`migrations/seed.sql`: four demo rules now have differentiated severities
— `pr-require-owner` → Warning, `pr-allowed-environments` → Error,
`pr-max-certificate-lifetime` → Critical, `pr-min-renewal-window` →
Warning. The user called out that seeding all four at the same severity
makes the feature look decorative; differentiation demonstrates the
column carries real signal.

## Integration test fix (side effect of D-006)

`internal/integration/e2e_test.go::TestCrossResourceWorkflow/CreatePolicy`
was sending `"severity": "High"` — a value from the pre-audit severity
vocabulary that the new `ValidatePolicySeverity` correctly rejects with
400. Changed to `"Error"` (closest semantic match in the new TitleCase
allowlist). Only severity reference in the integration/ directory;
verified via grep.

## Out of scope, logged for follow-up (d/D-008)

Three policy-engine drift issues orthogonal to D-005 + D-006, explicitly
deferred per direction:

1. `migrations/seed.sql` policy_rules INSERTs use lowercase TYPE values
   (`'ownership'`, `'environment'`, `'lifetime'`, `'renewal_window'`).
   These are load-bearing on `internal/service/policy.go::evaluateRule`'s
   `switch rule.Type` (which also uses the lowercase strings). Migrating
   requires coordinated changes across seed + evaluation engine.
2. `migrations/seed_demo.sql:482-483` contains lowercase `'critical'`
   severity — will now fail the new CHECK constraint. Separate fix.
3. `evaluateRule` hardcodes `Severity: domain.PolicySeverityWarning` on
   emitted violations and ignores the configured `rule.Config`. The new
   severity column is read correctly on the CRUD path but not yet
   consulted during evaluation.

## Verification

Backend:
- `go build ./...` — clean
- `go vet ./...` — clean
- `go test -short ./...` — all packages green, including
  `internal/service` (policy service), `internal/api/handler` (policy +
  MCP handler tests), `internal/integration` (e2e_test.go after fix),
  `internal/domain`, `internal/repository/postgres`.

Frontend:
- `tsc --noEmit` — clean
- `vitest run` — 223/223 passing (4 new assertions in types.test.ts)
- `vite build` — clean (only the pre-existing chunk-size warning)
2026-04-18 13:02:04 +00:00
108 changed files with 10506 additions and 802 deletions
+5
View File
@@ -72,3 +72,8 @@ SECURITY_REMEDIATION.md
.DS_Store
Thumbs.db
mcp-server
# Local Go build/module caches (session-scoped, never committed)
/.gocache/
/.gomodcache/
/.gopath/
+269 -46
View File
@@ -29,7 +29,11 @@ tags:
- name: Certificates
description: Certificate lifecycle — CRUD, versions, renewal, deployment, revocation
- name: CRL & OCSP
description: Certificate revocation list and OCSP responder
description: |
Certificate revocation list (RFC 5280) and OCSP responder (RFC 6960).
Served unauthenticated under `/.well-known/pki/*` (RFC 8615) so
relying parties can retrieve revocation status without a certctl
API key.
- name: Issuers
description: CA issuer connector management (Local CA, ACME, step-ca)
- name: Targets
@@ -493,50 +497,28 @@ paths:
"500":
$ref: "#/components/responses/InternalError"
# ─── CRL & OCSP ─────────────────────────────────────────────────────
/api/v1/crl:
# ─── PKI (CRL & OCSP, RFC 5280 / 6960 / 8615) ──────────────────────
#
# Relying parties (browsers, OpenSSL clients, OCSP stapling sidecars,
# mTLS clients) cannot present a certctl Bearer token, so these two
# endpoints are unauthenticated and live under the RFC 8615
# `.well-known` namespace. They were previously mounted at
# /api/v1/crl/{issuer_id} and /api/v1/ocsp/{issuer_id}/{serial}; those
# paths were removed in M-006.
#
# The non-standard JSON CRL endpoint (GET /api/v1/crl) was also
# removed — RFC 5280 defines only the DER wire format.
/.well-known/pki/crl/{issuer_id}:
get:
tags: [CRL & OCSP]
summary: Get JSON CRL
description: Returns all revoked certificates in JSON format.
operationId: getCRL
responses:
"200":
description: JSON CRL
content:
application/json:
schema:
type: object
properties:
version:
type: integer
example: 1
entries:
type: array
items:
type: object
properties:
serial_number:
type: string
revocation_date:
type: string
format: date-time
revocation_reason:
type: string
total:
type: integer
generated_at:
type: string
format: date-time
"500":
$ref: "#/components/responses/InternalError"
/api/v1/crl/{issuer_id}:
get:
tags: [CRL & OCSP]
summary: Get DER-encoded X.509 CRL
description: Returns a proper DER-encoded CRL signed by the issuing CA. 24-hour validity.
summary: Get DER-encoded X.509 CRL (RFC 5280)
description: |
Returns a DER-encoded CRL signed by the issuing CA (RFC 5280 §5),
served unauthenticated per RFC 8615 `.well-known` semantics so
relying parties can retrieve it without a certctl API key.
Validity is 24 hours.
operationId: getDERCRL
security: []
parameters:
- name: issuer_id
in: path
@@ -560,12 +542,17 @@ paths:
"501":
description: Issuer does not support CRL generation
/api/v1/ocsp/{issuer_id}/{serial}:
/.well-known/pki/ocsp/{issuer_id}/{serial}:
get:
tags: [CRL & OCSP]
summary: OCSP responder
description: Returns signed OCSP response (good/revoked/unknown) for the given serial number.
summary: OCSP responder (RFC 6960)
description: |
Returns a signed OCSP response (good/revoked/unknown) for the
given serial number per RFC 6960 §2.1, served unauthenticated
per RFC 8615 so relying parties and OCSP stapling sidecars can
query revocation status without a certctl API key.
operationId: handleOCSP
security: []
parameters:
- name: issuer_id
in: path
@@ -893,6 +880,40 @@ paths:
"500":
$ref: "#/components/responses/InternalError"
/api/v1/agents/retired:
get:
tags: [Agents]
summary: List retired agents
description: |
I-004: opt-in listing of soft-retired agents. The default
`GET /api/v1/agents` endpoint filters retired rows out; this is the
dedicated surface for reading them back (e.g., the operator UI's
"Retired" tab, audit and forensics workflows). Pagination defaults
match the default agent listing (page=1, per_page=50, max 500). Go
1.22's enhanced ServeMux routes `/agents/retired` to this handler
via the literal-beats-pattern-var precedence rule, so the sibling
`/agents/{id}` route does not shadow it.
operationId: listRetiredAgents
parameters:
- $ref: "#/components/parameters/page"
- $ref: "#/components/parameters/per_page"
responses:
"200":
description: Paginated list of retired agents
content:
application/json:
schema:
allOf:
- $ref: "#/components/schemas/PaginationEnvelope"
- type: object
properties:
data:
type: array
items:
$ref: "#/components/schemas/Agent"
"500":
$ref: "#/components/responses/InternalError"
/api/v1/agents/{id}:
get:
tags: [Agents]
@@ -913,12 +934,116 @@ paths:
$ref: "#/components/responses/NotFound"
"500":
$ref: "#/components/responses/InternalError"
delete:
tags: [Agents]
summary: Soft-retire agent
description: |
I-004: soft-retirement. The agent row is preserved (so its audit
trail and historical job links remain intact) and `retired_at` is
stamped. A retired agent receives `410 Gone` on subsequent
heartbeats so it can shut down cleanly.
Behavior matrix:
| Scenario | Query | Status | Body |
| --- | --- | --- | --- |
| Clean retire (no active dependencies) | none | `200` | `RetireAgentResponse` with `cascade=false`, zero counts |
| Blocked by active targets/certs/jobs | none | `409` | `BlockedByDependenciesResponse` with per-bucket counts |
| Force-cascade retire | `force=true&reason=...` | `200` | `RetireAgentResponse` with `cascade=true`, pre-cascade counts |
| Idempotent re-retire | either | `204` | (empty — downstream consumers break on stray bodies) |
| `force=true` without reason | `force=true` | `400` | ErrorResponse (ErrForceReasonRequired) |
| Reserved sentinel agent | any | `403` | ErrorResponse (ErrAgentIsSentinel) |
| Unknown agent id | any | `404` | ErrorResponse |
Sentinel agents are the four reserved identities backing non-agent
discovery subsystems (`server-scanner`, `cloud-aws-sm`,
`cloud-azure-kv`, `cloud-gcp-sm`). Retiring them would orphan the
scanner or a cloud secret-manager source, so the handler refuses
unconditionally — even with `force=true`.
operationId: retireAgent
parameters:
- $ref: "#/components/parameters/resourceId"
- name: force
in: query
required: false
schema:
type: boolean
default: false
description: |
Cascade-retire active downstream targets, certificates, and
jobs. When `true`, a non-empty `reason` is required. A
malformed value (anything strconv.ParseBool rejects) is
silently treated as `false` so a typoed query can never
accidentally enable the cascade.
- name: reason
in: query
required: false
schema:
type: string
description: |
Human-readable reason recorded on the retired row and in the
immutable audit trail. Required (non-empty after trimming)
when `force=true`.
responses:
"200":
description: |
Agent retired (clean retire or successful force-cascade). Body
is `RetireAgentResponse`.
content:
application/json:
schema:
$ref: "#/components/schemas/RetireAgentResponse"
"204":
description: |
Idempotent retire — the agent was already retired. Response
body is empty (the 200-path shape does not apply, and
downstream clients that tee responses into dashboards would
break on spurious bodies).
"400":
description: |
`force=true` was sent without a non-empty `reason`
(ErrForceReasonRequired).
content:
application/json:
schema:
$ref: "#/components/schemas/ErrorResponse"
"403":
description: |
Agent is a reserved sentinel and cannot be retired even with
`?force=true` (ErrAgentIsSentinel).
content:
application/json:
schema:
$ref: "#/components/schemas/ErrorResponse"
"404":
$ref: "#/components/responses/NotFound"
"409":
description: |
Blocked by active downstream dependencies. Body carries
per-bucket counts so the operator UI can show the user which
dependency is holding up the retire. Re-run with
`?force=true&reason=...` to cascade.
content:
application/json:
schema:
$ref: "#/components/schemas/BlockedByDependenciesResponse"
"405":
description: Method not allowed (only DELETE, GET are routed to this path)
"500":
$ref: "#/components/responses/InternalError"
/api/v1/agents/{id}/heartbeat:
post:
tags: [Agents]
summary: Agent heartbeat
description: Reports agent liveness and metadata (OS, architecture, IP, version).
description: |
Reports agent liveness and metadata (OS, architecture, IP, version).
I-004: a retired agent still polling the heartbeat endpoint receives
`410 Gone` so `cmd/agent` detects the terminal signal and shuts down
cleanly instead of looping forever against a decommissioned identity.
The retired-agent check runs before any "not found" string match so
it can never be masked by a sibling error branch.
operationId: agentHeartbeat
parameters:
- $ref: "#/components/parameters/resourceId"
@@ -949,6 +1074,14 @@ paths:
$ref: "#/components/responses/BadRequest"
"404":
$ref: "#/components/responses/NotFound"
"410":
description: |
I-004: the agent has been soft-retired. The agent process should
treat this as a terminal signal and shut down cleanly.
content:
application/json:
schema:
$ref: "#/components/schemas/ErrorResponse"
"500":
$ref: "#/components/responses/InternalError"
@@ -3326,6 +3459,7 @@ components:
DeploymentTarget:
type: object
required: [name, type, agent_id]
properties:
id:
type: string
@@ -3335,6 +3469,12 @@ components:
$ref: "#/components/schemas/TargetType"
agent_id:
type: string
description: |
ID of the agent that manages this target. Required because
deployment_targets.agent_id is a NOT NULL foreign key to agents(id)
(migration 000001). Empty or nonexistent agent IDs are rejected
with HTTP 400 by the service layer (see C-002 in the coverage-gap
audit).
config:
type: object
description: Target-specific configuration (varies by type)
@@ -3379,6 +3519,85 @@ components:
type: string
version:
type: string
retired_at:
type: string
format: date-time
nullable: true
description: |
I-004: soft-retirement timestamp. `null` (or field absent) means the
agent is active. A non-null value is the canonical "retired" state —
the operational `status` column is preserved at retirement time as
the last-seen value, but `retired_at` is the source of truth for
filtering agents out of active listings.
retired_reason:
type: string
nullable: true
description: |
I-004: human-readable reason captured at retirement time. Only set
when the agent was retired via `?force=true&reason=...` cascade; a
default soft-retire leaves this field null.
AgentDependencyCounts:
type: object
description: |
I-004: preflight counts of active downstream rows that would be
orphaned by retiring an agent. Returned in the 409
`blocked_by_dependencies` body so the operator UI can tell the user
which bucket is blocking the retire, and also in the 200 response
body on a successful `?force=true` cascade as a snapshot of what
was cascaded.
properties:
active_targets:
type: integer
description: Deployment targets with this agent assigned and retired_at IS NULL
active_certificates:
type: integer
description: Certificates currently deployed via one of this agent's active targets
pending_jobs:
type: integer
description: Jobs with agent_id=this in status Pending, AwaitingCSR, AwaitingApproval, or Running
RetireAgentResponse:
type: object
description: |
I-004: response body for a successful retire on DELETE /api/v1/agents/{id}.
Returned on both clean retires (cascade=false, zero counts) and
force-cascade retires (cascade=true, counts snapshot of the
pre-cascade dependency state). The 204 idempotent-retire path does
NOT emit this body — re-retiring an already-retired agent returns
an empty response.
properties:
retired_at:
type: string
format: date-time
already_retired:
type: boolean
description: |
Always false on the 200 response — the already-retired path
returns 204 No Content with no body. Surfaced in the schema
only so downstream consumers have a complete field map.
cascade:
type: boolean
description: True when the retire was invoked with ?force=true
counts:
$ref: "#/components/schemas/AgentDependencyCounts"
BlockedByDependenciesResponse:
type: object
description: |
I-004: 409 response body for a retire request blocked by active
downstream dependencies. Returned when `force=true` is not set and
any of the three counts is non-zero. The operator UI renders these
counts so the human can retire or reassign the blocking rows
before re-running the retire, or tick the force checkbox to cascade.
properties:
error:
type: string
example: blocked_by_dependencies
message:
type: string
counts:
$ref: "#/components/schemas/AgentDependencyCounts"
WorkItem:
type: object
@@ -3461,6 +3680,7 @@ components:
- RequiredMetadata
- AllowedEnvironments
- RenewalLeadTime
- CertificateLifetime
PolicySeverity:
type: string
@@ -3480,6 +3700,9 @@ components:
description: Policy-specific configuration (varies by type)
enabled:
type: boolean
severity:
$ref: "#/components/schemas/PolicySeverity"
description: Severity level applied to violations of this rule. Defaults to Warning on create when omitted.
created_at:
type: string
format: date-time
+100
View File
@@ -12,6 +12,7 @@ import (
"crypto/x509/pkix"
"encoding/json"
"encoding/pem"
"errors"
"flag"
"fmt"
"io"
@@ -23,6 +24,7 @@ import (
"path/filepath"
"runtime"
"strings"
"sync"
"syscall"
"time"
@@ -53,6 +55,16 @@ type AgentConfig struct {
DiscoveryDirs []string // Directories to scan for certificates (comma-separated via env)
}
// ErrAgentRetired is the sentinel returned by [Agent.Run] when the control
// plane responds with HTTP 410 Gone to a heartbeat or work-poll request — the
// canonical signal that this agent's row has been soft-retired server-side
// (see I-004 in cowork/certctl-coverage-gap-audit.md). The binary must
// terminate cleanly: an init-system restart would only produce another 410
// and wedge the host in a restart loop. main() translates this sentinel into
// a zero exit code so systemd (Restart=on-failure) and launchd do not respawn
// the process. Do not wrap this error — main() matches it with errors.Is.
var ErrAgentRetired = fmt.Errorf("agent retired by control plane")
// Agent represents the local agent that runs on target servers.
// It periodically sends heartbeats, polls for work, executes deployment and CSR jobs,
// and scans configured directories for existing certificates.
@@ -68,6 +80,17 @@ type Agent struct {
pollInterval time.Duration
discoveryInterval time.Duration
consecutiveFailures int
// I-004: terminal retirement signal. retiredSignal is closed exactly once
// (guarded by retiredOnce) when either sendHeartbeat or pollForWork
// observes HTTP 410 Gone. The Run() select loop picks up the close and
// returns ErrAgentRetired, unwinding the goroutine cleanly so main() can
// log + exit(0). Using a channel + sync.Once (rather than an atomic bool
// + polling) lets us fall through the select statement immediately instead
// of waiting for the next ticker; the zero-allocation close is safe to
// race with ctx.Done() and other cases.
retiredOnce sync.Once
retiredSignal chan struct{}
}
// WorkResponse represents the response from the work polling endpoint.
@@ -98,9 +121,31 @@ func NewAgent(cfg *AgentConfig, logger *slog.Logger) *Agent {
heartbeatInterval: 60 * time.Second,
pollInterval: 30 * time.Second,
discoveryInterval: 6 * time.Hour, // scan for certs every 6 hours
retiredSignal: make(chan struct{}),
}
}
// markRetired records that the control plane has declared this agent retired
// (HTTP 410 Gone on heartbeat or work poll). Idempotent via sync.Once — if
// both the heartbeat and work-poll paths observe 410 in the same tick, only
// the first close() runs and we avoid a runtime panic. Emits an ERROR-level
// log line so init-system journaling captures it prominently, and includes
// the source (heartbeat/work_poll), response body, and status code so the
// operator can verify it's a genuine retirement signal rather than a
// misrouted request. After this returns, the select-loop case in Run()
// observes the closed channel on its next iteration and returns
// ErrAgentRetired.
func (a *Agent) markRetired(source string, statusCode int, body string) {
a.retiredOnce.Do(func() {
a.logger.Error("agent has been retired by control plane — shutting down",
"source", source,
"status", statusCode,
"body", body,
"agent_id", a.config.AgentID)
close(a.retiredSignal)
})
}
// Run starts the agent's main loop.
// It sends heartbeats, polls for work, and handles graceful shutdown via context cancellation.
func (a *Agent) Run(ctx context.Context) error {
@@ -154,6 +199,19 @@ func (a *Agent) Run(ctx context.Context) error {
a.logger.Info("agent shutting down", "reason", ctx.Err())
return ctx.Err()
// I-004: retiredSignal is closed exactly once (via markRetired's
// sync.Once) when either sendHeartbeat or pollForWork observes HTTP 410
// Gone from the control plane. Falling through this case immediately
// (rather than waiting for the next ticker) lets the agent shut down
// quickly once retirement is confirmed — every extra heartbeat against a
// retired row is wasted work and noise in the audit trail. Returning
// ErrAgentRetired propagates up to main(), which matches it with
// errors.Is and exits(0) so systemd/launchd do not respawn the process.
case <-a.retiredSignal:
a.logger.Info("agent retired signal received — exiting event loop",
"agent_id", a.config.AgentID)
return ErrAgentRetired
case <-heartbeatTicker.C:
a.sendHeartbeat(ctx)
@@ -209,6 +267,22 @@ func (a *Agent) sendHeartbeat(ctx context.Context) {
}
defer resp.Body.Close()
// I-004: HTTP 410 Gone is the terminal signal from the control plane that
// this agent's row has been soft-retired (see internal/api/handler/agent.go
// heartbeat path + AgentRetirementService). Treat it separately from the
// generic non-200 error branch: record the event to markRetired (which closes
// retiredSignal exactly once via sync.Once) and return without bumping
// consecutiveFailures — this is not a transient failure, it's a clean
// shutdown. The Run() select loop picks up the closed channel on its next
// iteration and returns ErrAgentRetired, which main() translates into an
// exit(0) so systemd/launchd don't respawn the process into another 410
// loop.
if resp.StatusCode == http.StatusGone {
body, _ := io.ReadAll(resp.Body)
a.markRetired("heartbeat", resp.StatusCode, string(body))
return
}
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
a.logger.Error("heartbeat rejected",
@@ -237,6 +311,19 @@ func (a *Agent) pollForWork(ctx context.Context) {
}
defer resp.Body.Close()
// I-004: same terminal-retirement handling as sendHeartbeat. Work-poll is the
// other hot path that can observe an agent's soft-retirement; if the
// heartbeat tick happens to fire after a work-poll tick within the same
// retirement window, this branch catches it first. markRetired's sync.Once
// guards idempotency so racing both paths in the same tick only closes the
// signal channel once. No consecutiveFailures increment — retirement is
// not a transient failure.
if resp.StatusCode == http.StatusGone {
body, _ := io.ReadAll(resp.Body)
a.markRetired("work_poll", resp.StatusCode, string(body))
return
}
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
a.logger.Error("work poll rejected",
@@ -1117,6 +1204,19 @@ func main() {
cancel()
<-errChan
case err := <-errChan:
// I-004: ErrAgentRetired is a terminal, *clean* shutdown — the control
// plane responded HTTP 410 Gone on heartbeat/work-poll, meaning this
// agent's row has been soft-retired and will never be reachable again.
// Exit 0 so systemd's Restart=on-failure and launchd's KeepAlive do NOT
// respawn the process into another 410 loop (which would wedge the host
// and spam the control plane). Operators can observe the retirement via
// audit_events or the AgentsPage retired tab; the terminal log line on
// the way out is enough for post-mortem forensics.
if errors.Is(err, ErrAgentRetired) {
logger.Info("agent retired by control plane — exiting without restart",
"agent_id", agentCfg.AgentID)
return
}
if err != context.Canceled {
logger.Error("agent error", "error", err)
os.Exit(1)
+38 -4
View File
@@ -27,14 +27,17 @@ Commands:
certs renew ID Trigger certificate renewal
certs revoke ID Revoke a certificate
agents list List agents
agents get ID Get agent details
agents list List agents (add --retired to list soft-retired agents)
agents get ID Get agent details
agents retire ID Soft-retire an agent (add --force --reason "…" to cascade)
jobs list List jobs
jobs get ID Get job details
jobs cancel ID Cancel a pending job
import FILE Bulk import certificates from PEM file(s)
Required: --owner-id, --team-id, --renewal-policy-id, --issuer-id
Optional: --name-template (default {cn}), --environment (default imported)
status Show server health + summary stats
version Show CLI version
@@ -138,9 +141,19 @@ func handleCerts(client *cli.Client, args []string) error {
}
}
// handleAgents dispatches the `agents` subcommands.
//
// I-004 additions:
//
// agents list --retired — hit the opt-in /agents/retired endpoint
// instead of the default listing (which
// filters retired rows out).
// agents retire <id> — soft-retire an agent (DELETE /agents/{id}).
// --force cascades; --reason is required with
// --force (mirrors ErrForceReasonRequired).
func handleAgents(client *cli.Client, args []string) error {
if len(args) == 0 {
fmt.Fprintf(os.Stderr, "usage: agents <list|get> [options]\n")
fmt.Fprintf(os.Stderr, "usage: agents <list|get|retire> [options]\n")
return nil
}
@@ -149,13 +162,34 @@ func handleAgents(client *cli.Client, args []string) error {
switch subcommand {
case "list":
return client.ListAgents(subArgs)
// --retired flag splits to a separate endpoint. We intercept it
// client-side and strip it before delegating, so both code paths
// share the --page/--per-page flag parsing inside the client.
retired := false
rest := make([]string, 0, len(subArgs))
for _, a := range subArgs {
if a == "--retired" {
retired = true
continue
}
rest = append(rest, a)
}
if retired {
return client.ListRetiredAgents(rest)
}
return client.ListAgents(rest)
case "get":
if len(subArgs) == 0 {
fmt.Fprintf(os.Stderr, "usage: agents get <id>\n")
return nil
}
return client.GetAgent(subArgs[0])
case "retire":
if len(subArgs) == 0 {
fmt.Fprintf(os.Stderr, "usage: agents retire <id> [--force] [--reason <reason>]\n")
return nil
}
return client.RetireAgent(subArgs)
default:
fmt.Fprintf(os.Stderr, "unknown subcommand: agents %s\n", subcommand)
return nil
+90 -7
View File
@@ -9,6 +9,7 @@ import (
"os"
"os/signal"
"strconv"
"strings"
"syscall"
"time"
@@ -145,6 +146,7 @@ func main() {
// Initialize services (following the dependency graph)
auditService := service.NewAuditService(auditRepo)
policyService := service.NewPolicyService(policyRepo, auditService)
policyService.SetCertRepo(certificateRepo) // D-008: CertificateLifetime arm needs CertificateVersion.NotBefore/NotAfter
certificateService := service.NewCertificateService(certificateRepo, policyService, auditService)
notifierRegistry := make(map[string]service.Notifier)
@@ -222,7 +224,10 @@ func main() {
renewalService := service.NewRenewalService(certificateRepo, jobRepo, renewalPolicyRepo, profileRepo, auditService, notificationService, issuerRegistry, cfg.Keygen.Mode)
renewalService.SetTargetRepo(targetRepo)
deploymentService := service.NewDeploymentService(jobRepo, targetRepo, agentRepo, certificateRepo, auditService, notificationService)
jobService := service.NewJobService(jobRepo, renewalService, deploymentService, logger)
jobService := service.NewJobService(jobRepo, certificateRepo, ownerRepo, renewalService, deploymentService, logger)
// I-001: emit "job_retry" audit events when the scheduler resets Failed→Pending.
// SetAuditService is optional — JobService falls back to nil-guarded no-op if unwired.
jobService.SetAuditService(auditService)
agentService := service.NewAgentService(agentRepo, certificateRepo, jobRepo, targetRepo, auditService, issuerRegistry, renewalService)
agentService.SetProfileRepo(profileRepo)
issuerService := service.NewIssuerService(issuerRepo, auditService, issuerRegistry, encryptionKey, logger)
@@ -436,6 +441,10 @@ func main() {
// Configure scheduler intervals from config
sched.SetRenewalCheckInterval(cfg.Scheduler.RenewalCheckInterval)
sched.SetJobProcessorInterval(cfg.Scheduler.JobProcessorInterval)
// I-001: drive the failed-job retry loop. Runs on start + every RetryInterval
// (default 5m, CERTCTL_SCHEDULER_RETRY_INTERVAL). Kept adjacent to the job
// processor setter because they share the JobServicer dependency.
sched.SetJobRetryInterval(cfg.Scheduler.RetryInterval)
sched.SetAgentHealthCheckInterval(cfg.Scheduler.AgentHealthCheckInterval)
sched.SetNotificationProcessInterval(cfg.Scheduler.NotificationProcessInterval)
if cfg.NetworkScan.Enabled {
@@ -460,6 +469,17 @@ func main() {
"sources", cloudDiscoveryService.SourceCount())
}
// Wire job timeout reaper (I-003)
sched.SetJobReaperService(jobService)
sched.SetJobTimeoutInterval(cfg.Scheduler.JobTimeoutInterval)
sched.SetAwaitingCSRTimeout(cfg.Scheduler.AwaitingCSRTimeout)
sched.SetAwaitingApprovalTimeout(cfg.Scheduler.AwaitingApprovalTimeout)
logger.Info("job timeout reaper enabled",
"interval", cfg.Scheduler.JobTimeoutInterval.String(),
"csr_timeout", cfg.Scheduler.AwaitingCSRTimeout.String(),
"approval_timeout", cfg.Scheduler.AwaitingApprovalTimeout.String())
// Start scheduler
logger.Info("starting scheduler")
startedChan := sched.Start(ctx)
@@ -551,13 +571,63 @@ func main() {
"endpoints", "/scep?operation={GetCACaps,GetCACert,PKIOperation}")
}
// Register RFC 5280 CRL and RFC 6960 OCSP handlers under /.well-known/pki/.
// These are always enabled (no config gate) — revocation data must be
// reachable to relying parties for any cert certctl issues. The finalHandler
// routing gate below strips auth middleware for this prefix so browsers,
// OpenSSL, OCSP stapling sidecars, and mTLS clients can fetch without
// presenting certctl Bearer tokens.
apiRouter.RegisterPKIHandlers(certificateHandler)
logger.Info("PKI endpoints registered",
"endpoints", "/.well-known/pki/{crl/{issuer_id},ocsp/{issuer_id}/{serial}}")
logger.Info("registered all API handlers")
// Build middleware stack
authMiddleware := middleware.NewAuth(middleware.AuthConfig{
Type: cfg.Auth.Type,
Secret: cfg.Auth.Secret,
})
// Build middleware stack.
//
// Authentication unification (M-002): every authenticated request now
// carries a named actor in the request context so audit events record
// the real key identity instead of the hardcoded "api-key-user" string.
// Named keys come from CERTCTL_API_KEYS_NAMED (preferred). For backward
// compatibility CERTCTL_AUTH_SECRET is synthesized into legacy-key-N
// entries with Admin=false.
var namedKeys []middleware.NamedAPIKey
if cfg.Auth.Type != "none" {
// Translate typed config.NamedAPIKey -> middleware.NamedAPIKey. The
// two structs are field-compatible but live in different packages to
// preserve the config→middleware dependency direction.
for _, nk := range cfg.Auth.NamedKeys {
namedKeys = append(namedKeys, middleware.NamedAPIKey{
Name: nk.Name,
Key: nk.Key,
Admin: nk.Admin,
})
}
// Back-compat: if no named keys but legacy Secret is configured,
// synthesize named entries so the audit trail still attributes the
// action (instead of falling back to "api-key-user" / "anonymous").
if len(namedKeys) == 0 && cfg.Auth.Secret != "" {
parts := strings.Split(cfg.Auth.Secret, ",")
idx := 0
for _, p := range parts {
p = strings.TrimSpace(p)
if p == "" {
continue
}
namedKeys = append(namedKeys, middleware.NamedAPIKey{
Name: fmt.Sprintf("legacy-key-%d", idx),
Key: p,
Admin: false,
})
idx++
}
if len(namedKeys) > 0 {
logger.Warn("CERTCTL_AUTH_SECRET is deprecated — set CERTCTL_API_KEYS_NAMED for named actor attribution and admin gating",
"synthesized_keys", len(namedKeys))
}
}
}
authMiddleware := middleware.NewAuthWithNamedKeys(namedKeys)
corsMiddleware := middleware.NewCORS(middleware.CORSConfig{
AllowedOrigins: cfg.CORS.AllowedOrigins,
})
@@ -653,6 +723,14 @@ func main() {
noAuthHandler.ServeHTTP(w, r)
return
}
// RFC 5280 CRL and RFC 6960 OCSP live under /.well-known/pki/ and
// MUST be served unauthenticated — relying parties (browsers,
// OpenSSL, OCSP stapling sidecars, mTLS clients) cannot present
// certctl Bearer tokens. See router.RegisterPKIHandlers.
if len(path) >= 16 && path[:16] == "/.well-known/pki" {
noAuthHandler.ServeHTTP(w, r)
return
}
// All other API and EST routes go through the full middleware stack (with auth)
if (len(path) >= 8 && path[:8] == "/api/v1/") ||
(len(path) >= 16 && path[:16] == "/.well-known/est") {
@@ -669,13 +747,18 @@ func main() {
})
logger.Info("dashboard available at /", "web_dir", webDir)
} else {
// No dashboard: route health/auth-info without auth, everything else through full stack
// No dashboard: route health/auth-info and /.well-known/pki without
// auth, everything else through full stack.
finalHandler = http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
path := r.URL.Path
if path == "/health" || path == "/ready" || path == "/api/v1/auth/info" {
noAuthHandler.ServeHTTP(w, r)
return
}
if len(path) >= 16 && path[:16] == "/.well-known/pki" {
noAuthHandler.ServeHTTP(w, r)
return
}
apiHandler.ServeHTTP(w, r)
})
logger.Info("dashboard directory not found, serving API only")
+37 -19
View File
@@ -195,16 +195,11 @@ type metricsResponse struct {
Uptime float64 `json:"uptime_seconds"`
}
// crlResponse for the CRL endpoint.
type crlResponse struct {
Version int `json:"version"`
Total int `json:"total"`
Entries []struct {
Serial string `json:"serial_number"`
Reason string `json:"reason"`
RevokedAt string `json:"revoked_at"`
} `json:"entries"`
}
// M-006: The non-standard JSON CRL endpoint (`GET /api/v1/crl`) was removed.
// RFC 5280 §5 defines only the DER wire format, which is now served
// unauthenticated at `/.well-known/pki/crl/{issuer_id}` per RFC 8615.
// The `crlResponse` Go struct that used to decode the JSON envelope is gone;
// Phase 7 parses the DER bytes directly via `x509.ParseRevocationList`.
// ---------------------------------------------------------------------------
// PostgreSQL test helper
@@ -728,18 +723,41 @@ func TestIntegrationSuite(t *testing.T) {
t.Fatalf("revocation response unexpected: %s", body)
}
// Check CRL
t.Run("CRL", func(t *testing.T) {
resp, err := c.Get("/api/v1/crl")
// Check DER CRL served unauthenticated under /.well-known/pki/ per
// RFC 5280 §5 + RFC 8615 (M-006). Use a plain http.Get — no Bearer
// token — to prove the endpoint is reachable by relying parties that
// have no certctl API credentials.
t.Run("CRL_DER_Unauthenticated", func(t *testing.T) {
resp, err := http.Get(serverURL + "/.well-known/pki/crl/iss-local")
if err != nil {
t.Fatalf("GET CRL: %v", err)
t.Fatalf("GET DER CRL: %v", err)
}
var crl crlResponse
if err := decodeJSON(resp, &crl); err != nil {
t.Fatalf("decode CRL: %v", err)
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
t.Fatalf("unexpected status: got %d, want 200 (body=%s)", resp.StatusCode, string(body))
}
if crl.Total < 1 {
t.Fatalf("CRL total: got %d, want >= 1", crl.Total)
if ct := resp.Header.Get("Content-Type"); ct != "application/pkix-crl" {
t.Errorf("Content-Type: got %q, want %q", ct, "application/pkix-crl")
}
body, err := io.ReadAll(resp.Body)
if err != nil {
t.Fatalf("read CRL body: %v", err)
}
if len(body) == 0 {
t.Fatal("CRL body empty")
}
// Parse the DER bytes as an X.509 CRL (RFC 5280) and verify the
// just-revoked certificate is listed.
crl, err := x509.ParseRevocationList(body)
if err != nil {
t.Fatalf("parse DER CRL: %v", err)
}
if len(crl.RevokedCertificateEntries) < 1 {
t.Fatalf("CRL entries: got %d, want >= 1", len(crl.RevokedCertificateEntries))
}
})
+45 -8
View File
@@ -26,6 +26,7 @@
package integration_test
import (
"crypto/x509"
"database/sql"
"encoding/json"
"io"
@@ -434,10 +435,19 @@ func TestQA(t *testing.T) {
// ===================================================================
t.Run("Part03_CertCRUD", func(t *testing.T) {
t.Run("Create_Minimal", func(t *testing.T) {
// C-001 scope-expansion: the handler's ValidateRequired
// contract now gates common_name, owner_id, team_id,
// issuer_id, name, and renewal_policy_id. A 3-field
// payload would 400 regardless of the id hint, so the
// "minimal" variant carries every required field.
code, body := c.bodyStr(t, "POST", "/api/v1/certificates", `{
"id": "mc-qa-minimal",
"name": "qa-minimal",
"common_name": "qa-minimal.example.com",
"issuer_id": "iss-local"
"issuer_id": "iss-local",
"owner_id": "o-alice",
"team_id": "t-platform",
"renewal_policy_id": "rp-standard"
}`)
if code != 201 && code != 200 {
t.Fatalf("create cert: status %d, body: %s", code, body)
@@ -447,11 +457,14 @@ func TestQA(t *testing.T) {
t.Run("Create_Full", func(t *testing.T) {
code, body := c.bodyStr(t, "POST", "/api/v1/certificates", `{
"id": "mc-qa-full",
"name": "qa-full",
"common_name": "qa-full.example.com",
"sans": ["qa-full-alt.example.com"],
"issuer_id": "iss-local",
"environment": "staging",
"owner_id": "o-alice"
"owner_id": "o-alice",
"team_id": "t-platform",
"renewal_policy_id": "rp-standard"
}`)
if code != 201 && code != 200 {
t.Fatalf("create cert: status %d, body: %s", code, body)
@@ -596,13 +609,37 @@ func TestQA(t *testing.T) {
}
})
t.Run("CRL_JSON", func(t *testing.T) {
code, body := c.bodyStr(t, "GET", "/api/v1/crl", "")
if code != 200 {
t.Fatalf("CRL = %d", code)
// M-006: The non-standard JSON CRL endpoint was removed. RFC 5280 §5
// defines only the DER wire format, now served unauthenticated at
// `/.well-known/pki/crl/{issuer_id}` per RFC 8615. Use a plain
// http.Get — no Bearer — to prove the endpoint is reachable by
// relying parties with no API credentials.
t.Run("CRL_DER_Unauthenticated", func(t *testing.T) {
resp, err := http.Get(qaServerURL + "/.well-known/pki/crl/iss-local")
if err != nil {
t.Fatalf("GET DER CRL: %v", err)
}
if !strings.Contains(body, "entries") {
t.Fatalf("CRL response missing entries field")
defer resp.Body.Close()
if resp.StatusCode != 200 {
b, _ := io.ReadAll(resp.Body)
t.Fatalf("CRL = %d (body=%s)", resp.StatusCode, string(b))
}
if ct := resp.Header.Get("Content-Type"); ct != "application/pkix-crl" {
t.Errorf("Content-Type: got %q, want %q", ct, "application/pkix-crl")
}
body, err := io.ReadAll(resp.Body)
if err != nil {
t.Fatalf("read CRL body: %v", err)
}
if len(body) == 0 {
t.Fatal("CRL body empty")
}
crl, err := x509.ParseRevocationList(body)
if err != nil {
t.Fatalf("parse DER CRL: %v", err)
}
if len(crl.RevokedCertificateEntries) < 1 {
t.Fatalf("CRL entries: got %d, want >= 1", len(crl.RevokedCertificateEntries))
}
})
})
+15 -6
View File
@@ -608,13 +608,22 @@ else
fail "Revocation failed" "$REVOKE_RESP"
fi
info "Checking CRL..."
CRL_RESP=$(api_get "/api/v1/crl" 2>/dev/null || echo '{"total":0}')
CRL_TOTAL=$(echo "$CRL_RESP" | python3 -c "import sys,json; print(json.load(sys.stdin).get('total',0))" 2>/dev/null || echo 0)
if [ "$CRL_TOTAL" -ge 1 ]; then
pass "CRL contains $CRL_TOTAL revoked certificate(s)"
info "Checking DER CRL under /.well-known/pki (RFC 5280 §5, RFC 8615)..."
# The JSON CRL endpoint (`GET /api/v1/crl`) was removed in M-006. RFC 5280
# defines only the DER wire format, now served unauthenticated at
# `/.well-known/pki/crl/{issuer_id}`. Fetch without the Bearer header to
# prove the endpoint is reachable by relying parties with no API key.
CRL_TMP=$(mktemp)
CRL_HEADERS=$(mktemp)
CRL_HTTP_CODE=$(curl -s -o "$CRL_TMP" -D "$CRL_HEADERS" -w "%{http_code}" "${API_URL}/.well-known/pki/crl/iss-local" 2>/dev/null || echo "000")
CRL_SIZE=$(wc -c < "$CRL_TMP" | tr -d ' ')
CRL_CONTENT_TYPE=$(awk 'tolower($1)=="content-type:" { sub(/\r$/,"",$2); print tolower($2) }' "$CRL_HEADERS" | head -n1)
rm -f "$CRL_TMP" "$CRL_HEADERS"
if [ "$CRL_HTTP_CODE" = "200" ] && [ "$CRL_CONTENT_TYPE" = "application/pkix-crl" ] && [ "$CRL_SIZE" -gt 0 ]; then
pass "DER CRL served unauthenticated (HTTP 200, Content-Type application/pkix-crl, ${CRL_SIZE} bytes)"
else
fail "CRL empty after revocation"
fail "DER CRL fetch failed: HTTP=$CRL_HTTP_CODE Content-Type=$CRL_CONTENT_TYPE size=$CRL_SIZE"
fi
CERT_STATUS=$(api_get "/api/v1/certificates/mc-local-test" | python3 -c "import sys,json; print(json.load(sys.stdin).get('status',''))" 2>/dev/null || echo "unknown")
+12 -2
View File
@@ -139,6 +139,16 @@ The agent runs two background loops: a heartbeat (every 60 seconds) to signal it
**Agent groups (M11b):** Dynamic device grouping allows organizing agents by metadata criteria. Agent groups can match by OS, architecture, IP CIDR, and version. Groups support both dynamic matching (agents automatically join when criteria match) and manual membership (explicit include/exclude). Renewal policies can be scoped to agent groups via the `agent_group_id` foreign key. The GUI provides full CRUD management for agent groups with visual match criteria badges.
**Agent soft-retirement (I-004):** `DELETE /api/v1/agents/{id}` is a soft-delete surface — the row is never removed. Retirement stamps `agents.retired_at` (TIMESTAMPTZ) and `agents.retired_reason` (TEXT) and flips the operational status to `Offline`. Default listings (`GET /api/v1/agents`, the dashboard stats counter, and the stale-offline sweeper) filter retired rows out via `AgentRepository.ListActive`; retired rows are surfaced only through the opt-in `GET /api/v1/agents/retired` view. The endpoint follows a preflight → block → escape-hatch contract:
- **Clean retire** (no active dependencies) — `200 OK` with `RetireAgentResponse` (`cascade=false`, zero counts).
- **Blocked by active dependencies** — `409 Conflict` with `BlockedByDependenciesResponse`. The three counts (`active_targets`, `active_certificates`, `pending_jobs`) tell the operator exactly which rows would be orphaned. The schema diverges from `ErrorResponse` because downstream dashboards parse the stable three-key shape.
- **Force cascade** — `DELETE /api/v1/agents/{id}?force=true&reason=...`. `reason` is required (400 otherwise). Transactionally soft-retires downstream `deployment_targets`, cancels pending jobs, and soft-retires the agent, emitting an `agent_retirement_cascaded` audit event with actor + reason + per-bucket counts.
- **Idempotent re-retire** — a retire attempt against an already-retired agent returns `204 No Content` with an empty body (no second audit event, no response shape — callers that POST again on a retry get a clean no-op).
- **Sentinel refusal** — the four sentinel agent IDs (`server-scanner`, `cloud-aws-sm`, `cloud-azure-kv`, `cloud-gcp-sm`) back non-agent discovery subsystems (the network scanner and the three cloud secret-manager sources). They are refused unconditionally — even with `force=true` — via `ErrAgentIsSentinel``403 Forbidden`. The ID list lives in `internal/domain/connector.go` (`SentinelAgentIDs`) so handler, repository, and scheduler code can filter them without importing `service`.
Retired agents receive `410 Gone` on subsequent heartbeats (`service.ErrAgentRetired`). `cmd/agent` treats 410 as a terminal signal and exits cleanly so retired agents stop phoning home. Migration `000015` flipped `deployment_targets.agent_id` from `ON DELETE CASCADE` to `ON DELETE RESTRICT`, making the old hard-delete path a schema error and forcing all retirement through this contract.
### Web Dashboard
The web dashboard is the primary operational interface for certctl. It is built with Vite + React + TypeScript and uses TanStack Query for server state management (caching, background refetching, optimistic updates).
@@ -463,7 +473,7 @@ sequenceDiagram
API-->>U: 200 OK
```
The revocation is recorded in the `certificate_revocations` table (separate from the certificate status update) for CRL generation. The DER-encoded CRL at `GET /api/v1/crl/{issuer_id}` is generated on-demand by querying this table and signing with the issuing CA's key. The OCSP responder at `GET /api/v1/ocsp/{issuer_id}/{serial}` checks both the certificate status and the revocations table to return signed good/revoked/unknown responses.
The revocation is recorded in the `certificate_revocations` table (separate from the certificate status update) for CRL generation. The DER-encoded CRL at `GET /.well-known/pki/crl/{issuer_id}` (RFC 5280 §5, RFC 8615) is generated on-demand by querying this table and signing with the issuing CA's key. The OCSP responder at `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (RFC 6960) checks both the certificate status and the revocations table to return signed good/revoked/unknown responses. Both endpoints are served unauthenticated — relying parties (TLS clients, hardware appliances, browsers) must be able to reach them without a certctl API key — and carry the IANA-registered media types `application/pkix-crl` and `application/ocsp-response` respectively.
Short-lived certificates (those with profile TTL < 1 hour) return "good" from OCSP and are excluded from CRL — their rapid expiry is treated as sufficient revocation.
@@ -889,7 +899,7 @@ Jobs support additional action endpoints: `POST /api/v1/jobs/{id}/cancel`, `POST
- **Additional filters**: `?agent_id=`, `?profile_id=` (in addition to existing status, environment, owner_id, team_id, issuer_id).
- **Deployments**: `GET /api/v1/certificates/{id}/deployments` returns deployment targets for a certificate.
Certificate revocation: `POST /api/v1/certificates/{id}/revoke` with optional `{"reason": "keyCompromise"}`. Supports RFC 5280 reason codes (unspecified, keyCompromise, caCompromise, affiliationChanged, superseded, cessationOfOperation, certificateHold, privilegeWithdrawn). Returns the updated certificate status. Best-effort issuer notification — the revocation succeeds even if the issuer connector is unavailable. A JSON-formatted CRL is available at `GET /api/v1/crl`, and a DER-encoded X.509 CRL signed by the issuing CA at `GET /api/v1/crl/{issuer_id}`. An embedded OCSP responder serves signed responses at `GET /api/v1/ocsp/{issuer_id}/{serial}`. Short-lived certificates (profile TTL < 1 hour) are exempt from CRL/OCSP — expiry is sufficient revocation.
Certificate revocation: `POST /api/v1/certificates/{id}/revoke` with optional `{"reason": "keyCompromise"}`. Supports RFC 5280 reason codes (unspecified, keyCompromise, caCompromise, affiliationChanged, superseded, cessationOfOperation, certificateHold, privilegeWithdrawn). Returns the updated certificate status. Best-effort issuer notification — the revocation succeeds even if the issuer connector is unavailable. The DER-encoded X.509 CRL signed by the issuing CA is served unauthenticated at `GET /.well-known/pki/crl/{issuer_id}` (RFC 5280 §5 + RFC 8615, `Content-Type: application/pkix-crl`). The embedded OCSP responder serves signed responses unauthenticated at `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (RFC 6960, `Content-Type: application/ocsp-response`). Both endpoints are accessible to relying parties with no certctl API credentials, as RFC-compliant PKI consumers expect. Short-lived certificates (profile TTL < 1 hour) are exempt from CRL/OCSP — expiry is sufficient revocation.
Certificate export (M27): `GET /api/v1/certificates/{id}/export/pem` returns PEM-encoded certificate and chain, and `POST /api/v1/certificates/{id}/export/pkcs12` returns a PKCS#12 bundle (binary). Private keys are never exported — they remain on agents. All exports are audited with actor, timestamp, and format.
+6 -4
View File
@@ -210,15 +210,17 @@ NIST SP 800-57 Part 1 Section 6.2 addresses secure key distribution to minimize
- Proxy agent executes deployment via appliance API
**Revocation Distribution**
- Certificate Revocation List (CRL) via `GET /api/v1/crl/{issuer_id}`
- Returns DER-encoded X.509 CRL signed by issuing CA
- Certificate Revocation List (CRL) via `GET /.well-known/pki/crl/{issuer_id}` (RFC 5280 §5, RFC 8615)
- Returns DER-encoded X.509 CRL signed by issuing CA (`Content-Type: application/pkix-crl`)
- 24-hour validity period
- Includes all revoked serials, reasons, and revocation timestamps
- Served unauthenticated so relying parties without certctl API credentials can fetch it
- Subject to URL caching; OCSP preferred for real-time revocation
- OCSP via `GET /api/v1/ocsp/{issuer_id}/{serial}`
- Returns DER-encoded OCSP response (OCSPResponse ASN.1 structure)
- OCSP via `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (RFC 6960)
- Returns DER-encoded OCSP response (OCSPResponse ASN.1 structure, `Content-Type: application/ocsp-response`)
- Signed by issuing CA (or delegated OCSP signing cert)
- Responds with good/revoked/unknown status
- Served unauthenticated — the RFC 6960 relying-party model does not assume API credentials
- Real-time, more bandwidth-efficient than CRL polling
## Revocation and Compromise (NIST SP 800-57 Part 3)
+12 -11
View File
@@ -92,10 +92,10 @@ Your QSA will request evidence that your certificate and key management systems
- **Certificate Status Tracking** — Four statuses: Active (deployed, not yet expired), Expiring (within threshold, awaiting renewal), Expired (past not-after date), Revoked (revoked via RFC 5280 revocation API). Dashboard charts show status distribution.
- **Revocation Infrastructure** (M15a, M15b):
- **Revocation Infrastructure** (M15a, M15b, M-006):
- Revocation API: `POST /api/v1/certificates/{id}/revoke` with RFC 5280 reason codes
- CRL endpoint: `GET /api/v1/crl` (JSON format) or `GET /api/v1/crl/{issuer_id}` (DER X.509 CRL, 24h validity, signed by issuing CA)
- OCSP responder: `GET /api/v1/ocsp/{issuer_id}/{serial}` (returns DER-encoded OCSP response: good/revoked/unknown)
- CRL endpoint: `GET /.well-known/pki/crl/{issuer_id}` DER X.509 CRL, 24h validity, signed by issuing CA, served unauthenticated (RFC 5280 §5, RFC 8615, `Content-Type: application/pkix-crl`)
- OCSP responder: `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` DER-encoded OCSP response (good/revoked/unknown), served unauthenticated (RFC 6960, `Content-Type: application/ocsp-response`)
- Bulk revocation (V2.2): `POST /api/v1/certificates/bulk-revoke` with filter criteria (profile, owner, agent, issuer) for fleet-wide incident response
- Short-lived cert exemption: certs with TTL < 1 hour skip CRL/OCSP (expiry is sufficient revocation)
@@ -109,7 +109,7 @@ Your QSA will request evidence that your certificate and key management systems
- Discovered certificate report: `GET /api/v1/discovered-certificates` JSON export showing all certs on systems, fingerprints, and status.
- Managed certificate inventory: `GET /api/v1/certificates` with filters (`?status=Expiring` for upcoming renewals).
- Expiration alert configuration: policy JSON showing `alert_thresholds_days` for each environment.
- CRL/OCSP availability proof: HTTP GET requests to `/api/v1/crl` and `/api/v1/ocsp/{issuer}/{serial}` with signed responses.
- CRL/OCSP availability proof: unauthenticated HTTP GET requests to `/.well-known/pki/crl/{issuer_id}` (DER, `application/pkix-crl`) and `/.well-known/pki/ocsp/{issuer_id}/{serial}` (DER, `application/ocsp-response`) with signed responses.
- Audit trail for certificate creation/renewal/revocation: `GET /api/v1/audit?type=certificate_issued,certificate_renewed,certificate_revoked`.
- Dashboard charts showing expiration timeline, renewal success trends, status distribution.
@@ -328,9 +328,10 @@ This requirement covers key generation, storage, rotation, and destruction. Cert
- Issuer notified (best-effort; ACME lacks standard revocation, Local CA skips issuer step).
- Revocation notifications sent to owner via email/webhook/Slack/Teams/PagerDuty.
- **CRL and OCSP Publication** (M15b) — Revoked certificates published in:
- CRL: `GET /api/v1/crl` (JSON format) or `GET /api/v1/crl/{issuer_id}` (DER X.509, signed by CA, 24h validity)
- OCSP: `GET /api/v1/ocsp/{issuer_id}/{serial}` (returns revoked status for clients validating certificate chain)
- **CRL and OCSP Publication** (M15b, M-006) — Revoked certificates published in:
- CRL: `GET /.well-known/pki/crl/{issuer_id}` (DER X.509 signed by CA, 24h validity, RFC 5280 §5 + RFC 8615, `Content-Type: application/pkix-crl`)
- OCSP: `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (returns revoked status for clients validating certificate chain, RFC 6960, `Content-Type: application/ocsp-response`)
- Both endpoints are served unauthenticated so relying parties (browsers, TLS appliances) without certctl API keys can verify revocation — this is the RFC-compliant PKI model.
- Clients checking certificate status via OCSP or CRL see revoked status within 24 hours.
- **Bulk Revocation for Incident Response** (V2.2) — `POST /api/v1/certificates/bulk-revoke` with filter criteria (profile, owner, agent, issuer) revokes all matching certificates in a single operation. PCI-DSS Req 4 requires rapid response to data transmission security incidents — bulk revocation enables operators to revoke an entire certificate set (e.g., all certs used by a compromised team or endpoint) in minutes rather than hours.
@@ -342,8 +343,8 @@ This requirement covers key generation, storage, rotation, and destruction. Cert
**Evidence You Can Provide**:
- Revocation requests: `GET /api/v1/audit?type=certificate_revoked` with RFC 5280 reason codes.
- CRL publication: HTTP GET `/api/v1/crl` and parse JSON to show revoked serial numbers and timestamps.
- OCSP responder validation: Query `GET /api/v1/ocsp/{issuer}/{serial}` for a known-revoked cert; response includes `revoked` status.
- CRL publication: HTTP GET `/.well-known/pki/crl/{issuer_id}` (unauthenticated) returns a DER X.509 CRL — parse with `openssl crl -inform der -noout -text` to show revoked serial numbers, reasons, and timestamps.
- OCSP responder validation: Query `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (unauthenticated) for a known-revoked cert; response includes `revoked` status and can be parsed with `openssl ocsp` tooling.
- Audit trail: Certificate status transitions (Active → Revoked) recorded in `audit_events`.
**Operator Responsibility**:
@@ -721,12 +722,12 @@ This requirement covers key generation, storage, rotation, and destruction. Cert
| PCI-DSS Requirement | certctl Feature | API/UI Evidence | Database/Config | Audit Trail | Status |
|---|---|---|---|---|---|
| **4.2.1** Strong Crypto | TLS cert issuance, ACME/step-ca/Local CA, RSA 2048+/ECDSA P-256 | `GET /api/v1/certificates` (key_type, key_size) | Certificate profiles | `GET /api/v1/audit?type=certificate_issued` | Available |
| **4.2.2** Cert Inventory & Validation | Managed cert CRUD, discovery (M18b), expiration alerting, CRL/OCSP | `GET /api/v1/certificates`, `GET /api/v1/discovered-certificates`, `GET /api/v1/crl`, `GET /api/v1/ocsp/{issuer}/{serial}` | `managed_certificates`, `discovered_certificates` tables | `GET /api/v1/audit?type=certificate_*` | Available |
| **4.2.2** Cert Inventory & Validation | Managed cert CRUD, discovery (M18b), expiration alerting, CRL/OCSP | `GET /api/v1/certificates`, `GET /api/v1/discovered-certificates`, `GET /.well-known/pki/crl/{issuer_id}`, `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (both unauthenticated, RFC 5280 / RFC 6960) | `managed_certificates`, `discovered_certificates` tables | `GET /api/v1/audit?type=certificate_*` | Available |
| **3.6** Key Documentation | Profiles, owner/team tracking, issuer config, audit trail | `GET /api/v1/profiles`, `GET /api/v1/issuers`, certificate detail with owner/team | Profiles, certificate owner/team fields, issuer config | `GET /api/v1/audit?resource_type=certificate` | Available |
| **3.7.1** Key Generation | Agent-side ECDSA P-256, server keygen (demo only) | Agent logs, renewal job detail, CSR audit | `CERTCTL_KEYGEN_MODE=agent` (config), job_type=AwaitingCSR | `GET /api/v1/audit?type=certificate_issued` with CSR hash | Available |
| **3.7.2** Key Storage | Agent `/var/lib/certctl/keys` (0600), env var secrets, .env excluded | Deployment manifest (env var refs), agent key dir listing | `.env` file (git-ignored), `CERTCTL_KEY_DIR`, `CERTCTL_CA_KEY_PATH` | No API audit (keys off-platform) | Available |
| **3.7.3** Key Rotation | Auto renewal, expiration thresholds, renewal jobs | Dashboard renewal trends, `GET /api/v1/jobs?type=Renewal`, certificate versions | Renewal policies, certificate version history | `GET /api/v1/audit?type=certificate_renewed` | Available |
| **3.7.4** Key Destruction | Revocation API (RFC 5280), CRL/OCSP, private key cleanup | `POST /api/v1/certificates/{id}/revoke`, `GET /api/v1/crl`, OCSP endpoint | `certificate_revocations` table, CRL publication | `GET /api/v1/audit?type=certificate_revoked` | Available |
| **3.7.4** Key Destruction | Revocation API (RFC 5280), CRL/OCSP, private key cleanup | `POST /api/v1/certificates/{id}/revoke`, unauthenticated `GET /.well-known/pki/crl/{issuer_id}` and `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` | `certificate_revocations` table, CRL publication | `GET /api/v1/audit?type=certificate_revoked` | Available |
| **8.3** Strong Authentication | API key (SHA-256 hash, TLS), GUI login, 401 redirect | GUI login screenshot, API key auth header, TLS cert | API key hash in database | `GET /api/v1/audit` showing API calls | Available |
| **8.6** Acct Management | Credentials out of source, .env excluded, env var config | Code review (no hardcoded secrets), `.gitignore` check | Deployment manifests showing env var refs only | No account lifecycle audit (outside scope) | Available in part |
| **10.2** Audit Logging | API audit middleware (M19), certificate lifecycle events | `GET /api/v1/audit` with filter/pagination | `audit_events` table (every API call) | Real-time via API | Available |
+4 -4
View File
@@ -282,8 +282,8 @@ Each section includes:
- `certificateHold` — temporary revocation (can be "unhold" by reissue)
- `privilegeWithdrawn` — access rights revoked
Revocation is **immediate** (no approval workflow). The certificate is marked `Revoked` in inventory, an audit event is logged, and optional issuer notification is best-effort. All revoked certs are excluded from active deployments.
- **CRL Endpoint**`GET /api/v1/crl` returns a JSON-formatted Certificate Revocation List (serial, reason, timestamp for each revoked cert). `GET /api/v1/crl/{issuer_id}` returns a DER-encoded X.509 CRL signed by the issuing CA (useful for legacy clients that don't support OCSP).
- **OCSP Responder**`GET /api/v1/ocsp/{issuer_id}/{serial}` returns a signed OCSP response indicating whether a cert is good, revoked, or unknown. Clients (browsers, TLS libraries) query this endpoint to verify cert validity in real-time.
- **CRL Endpoint**`GET /.well-known/pki/crl/{issuer_id}` returns a DER-encoded X.509 CRL signed by the issuing CA (RFC 5280 §5, RFC 8615, `Content-Type: application/pkix-crl`), served unauthenticated for relying parties that don't hold certctl API credentials.
- **OCSP Responder**`GET /.well-known/pki/ocsp/{issuer_id}/{serial}` returns a signed OCSP response indicating whether a cert is good, revoked, or unknown (RFC 6960, `Content-Type: application/ocsp-response`). Also unauthenticated. Clients (browsers, TLS libraries) query this endpoint to verify cert validity in real-time.
- **Revocation Notifications** — When a cert is revoked, notifications are sent to:
- Certificate owner (email)
- Configured webhooks (if you have a SIEM that subscribes)
@@ -460,8 +460,8 @@ Each section includes:
| | Notification Routing | Email, Slack, Teams, PagerDuty, OpsGenie | ✅ | ✅ | Configure notifiers, on-call integration |
| | Deployment Rollback | Redeploy previous cert version via GUI | ✅ | ✅ | Audit rollback decisions |
| **CC7.3** Incident Response | Revocation API (RFC 5280 reasons) | `POST /api/v1/certificates/{id}/revoke` | ✅ | Enhanced (bulk revocation) | Establish incident response policy |
| | CRL Endpoint (JSON + DER) | `GET /api/v1/crl`, `GET /api/v1/crl/{issuer_id}` | ✅ | ✅ | Ensure CRL/OCSP accessible to all clients |
| | OCSP Responder | `GET /api/v1/ocsp/{issuer_id}/{serial}` | ✅ | ✅ | Test revocation in staging |
| | CRL Endpoint (DER, RFC 5280 §5) | `GET /.well-known/pki/crl/{issuer_id}` (unauthenticated, `application/pkix-crl`) | ✅ | ✅ | Ensure CRL/OCSP accessible to all clients without API keys |
| | OCSP Responder (RFC 6960) | `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (unauthenticated, `application/ocsp-response`) | ✅ | ✅ | Test revocation in staging |
| | Revocation Notifications | Email, webhook, Slack/Teams on revocation | ✅ | ✅ | Integrate into on-call, document justification separately |
| | Short-Lived Cert Exemption | TTL < 1h skip CRL/OCSP | ✅ | ✅ | Configure profiles appropriately |
| **CC7.4** Risk Mitigation | Renewal Job Tracking | Job state machine (Pending → Running → Completed/Failed) | ✅ | ✅ | Monitor renewal success rate |
+4 -2
View File
@@ -123,6 +123,8 @@ At no point does the private key leave the agent. This is a fundamental security
Agents also report **metadata** about themselves — their operating system, CPU architecture, IP address, hostname, and version — with every heartbeat. This gives ops teams fleet-wide visibility (e.g., "how many agents are running on ARM?", "which agents are still on v1.0.0?") and powers **agent groups** — dynamic device grouping where policies can be scoped to specific agent criteria like OS type, architecture, or network subnet.
**Retiring an agent.** When you decommission a server, the certctl record for its agent needs to be retired, not deleted. certctl uses a **soft-delete** model: `DELETE /api/v1/agents/{id}` stamps the row with a retired-at timestamp and a reason, instead of removing it. This is deliberate — an audit trail of "who owned this certificate, on which host, for which team" stays intact forever, and the downstream deployment_targets, certificates, and jobs keep valid foreign keys. Retired agents are filtered out of default list views and the dashboard's agent counter, but remain visible through a separate retired-agents view for compliance reconciliation. If the agent still has active deployment targets, deployed certificates, or pending jobs, retirement is blocked by default so you don't silently orphan those rows; the API responds with the exact counts so you can retire or reassign each dependency explicitly. A force-retire escape hatch (`?force=true&reason=...`) is available for true decommission scenarios — it transactionally retires the downstream targets, cancels pending jobs, and records the cascade in the audit trail with the reason you provided. Four internal sentinel agents that back the network scanner and the cloud secret-manager discovery sources cannot be retired at all, even with force, because retiring them would orphan their subsystems. Once retired, an agent that still attempts to heartbeat receives `410 Gone` — the agent process reads that as "you've been retired, shut down" and exits cleanly.
### Deployment Targets
Targets are the systems where certificates actually get installed — NGINX web servers, Apache httpd servers, HAProxy load balancers, Traefik reverse proxies, Caddy servers, Envoy gateways, Postfix/Dovecot mail servers, Microsoft IIS servers, and network appliances. Each target type has a **connector** that knows how to deploy certificates to that specific system (e.g., writing files and reloading NGINX or Apache config, building a combined PEM for HAProxy).
@@ -216,9 +218,9 @@ certctl implements revocation using three complementary mechanisms:
**Bulk Revocation** (Fleet-Level Incident Response): For large-scale incidents like CA compromise or team infrastructure decommissioning, `POST /api/v1/certificates/bulk-revoke` revokes all certificates matching filter criteria in a single operation. Filter by profile, owner, team, agent group, or issuer to target the affected certificate set. This is essential for incident response — instead of revoking certificates one-by-one, operators can revoke an entire fleet in minutes. Bulk revocation creates individual revocation jobs that reuse the existing revocation pipeline, ensuring every certificate is audited and notifications are sent.
**Certificate Revocation List (CRL)**: certctl serves both a JSON-formatted CRL at `GET /api/v1/crl` and DER-encoded X.509 CRLs per issuer at `GET /api/v1/crl/{issuer_id}`. The DER CRL is signed by the issuing CA's key and has 24-hour validity clients can download it periodically to check revocation status offline.
**Certificate Revocation List (CRL)**: certctl serves DER-encoded X.509 CRLs per issuer at `GET /.well-known/pki/crl/{issuer_id}` (RFC 5280 §5 wire format, RFC 8615 well-known namespace). The endpoint is unauthenticated so any relying party — browser, TLS client, hardware appliance — can fetch it without a certctl API key. The CRL is signed by the issuing CA's key and has 24-hour validity; clients can download it periodically to check revocation status offline. The response carries `Content-Type: application/pkix-crl`.
**OCSP Responder**: For real-time revocation checking, certctl includes an embedded OCSP responder at `GET /api/v1/ocsp/{issuer_id}/{serial}`. It returns signed OCSP responses (good, revoked, or unknown) so clients can verify certificate status without downloading the full CRL.
**OCSP Responder**: For real-time revocation checking, certctl includes an embedded OCSP responder at `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (RFC 6960). Like the CRL endpoint, it is unauthenticated and returns signed OCSP responses (good, revoked, or unknown) with `Content-Type: application/ocsp-response`, so clients can verify certificate status without downloading the full CRL.
Short-lived certificates (those assigned to profiles with TTL under 1 hour) are exempt from CRL and OCSP — their rapid expiry is considered sufficient revocation. This is a deliberate design choice to reduce infrastructure overhead for ephemeral machine-to-machine credentials.
+2 -2
View File
@@ -155,7 +155,7 @@ The Local CA issuer signs certificates using Go's `crypto/x509` library. It supp
**Sub-CA mode:** Loads a CA certificate and private key from disk (`CERTCTL_CA_CERT_PATH` + `CERTCTL_CA_KEY_PATH`). The CA cert is signed by an upstream CA (e.g., ADCS), so all issued certificates chain to the enterprise root trust hierarchy. Clients that already trust the enterprise root automatically trust certctl-issued certs. Supports RSA, ECDSA, and PKCS#8 key formats. If the paths are not set, falls back to self-signed mode. The loaded certificate must have `IsCA=true` and `KeyUsageCertSign`.
**CRL and OCSP support (M15b):** The Local CA supports DER-encoded X.509 CRL generation via `GET /api/v1/crl/{issuer_id}` with 24-hour validity. An embedded OCSP responder at `GET /api/v1/ocsp/{issuer_id}/{serial}` returns signed OCSP responses for issued certificates (good/revoked/unknown status). Certificates with profile TTL < 1 hour automatically skip CRL/OCSP — expiry is treated as sufficient revocation for short-lived credentials.
**CRL and OCSP support (M15b):** The Local CA supports DER-encoded X.509 CRL generation served unauthenticated at `GET /.well-known/pki/crl/{issuer_id}` (RFC 5280 §5, RFC 8615, `Content-Type: application/pkix-crl`) with 24-hour validity. An embedded OCSP responder at `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (RFC 6960, `Content-Type: application/ocsp-response`) returns signed OCSP responses for issued certificates (good/revoked/unknown status). Both endpoints are reachable by relying parties with no certctl API credentials, which is how standard TLS clients, browsers, and hardware appliances consume these resources. Certificates with profile TTL < 1 hour automatically skip CRL/OCSP — expiry is treated as sufficient revocation for short-lived credentials.
**Extended Key Usage (EKU) support (M27):** The Local CA respects EKU constraints from certificate profiles and adjusts key usage flags accordingly. For S/MIME certificates (emailProtection EKU), it uses `DigitalSignature | ContentCommitment` instead of the TLS default. For TLS certificates (serverAuth/clientAuth EKU), it uses `DigitalSignature | KeyEncipherment`. This enables support for multiple certificate types — TLS, S/MIME, code signing, timestamping — from a single CA.
@@ -287,7 +287,7 @@ Environment variables:
The connector is registered in the issuer registry under `iss-stepca`. step-ca also works with the existing ACME connector (point `iss-acme-*` at step-ca's ACME directory URL for ACME-based issuance).
**Note:** step-ca-issued certificates rely on step-ca's own CRL/OCSP infrastructure. certctl's local CRL/OCSP endpoints (`GET /api/v1/crl/{issuer_id}` and `GET /api/v1/ocsp/{issuer_id}/{serial}`) are populated from step-ca's revocation data if available, but clients should validate against step-ca's endpoints for the authoritative status.
**Note:** step-ca-issued certificates rely on step-ca's own CRL/OCSP infrastructure. certctl's local CRL/OCSP endpoints (`GET /.well-known/pki/crl/{issuer_id}` and `GET /.well-known/pki/ocsp/{issuer_id}/{serial}`, served unauthenticated per RFC 5280 §5 / RFC 6960 / RFC 8615) are populated from step-ca's revocation data if available, but clients should validate against step-ca's endpoints for the authoritative status.
**MaxTTL enforcement (M11c):** When a certificate profile defines a maximum TTL, the step-ca connector caps the `NotAfter` field to ensure the issued certificate does not exceed the profile limit, regardless of the step-ca provisioner's own maximum.
+11 -9
View File
@@ -724,22 +724,24 @@ curl -s -X POST $API/api/v1/certificates/mc-demo-payments/revoke \
6. Creates an audit trail entry
7. Sends revocation notifications via configured channels
Check the CRL (Certificate Revocation List):
Check the CRL (Certificate Revocation List) — served unauthenticated under the RFC 8615 well-known namespace so relying parties without a certctl API key can still verify revocation (RFC 5280 §5):
```bash
# JSON-formatted CRL
curl -s $API/api/v1/crl | jq .
# DER-encoded X.509 CRL for the local CA (binary — pipe to openssl for inspection)
curl -s $API/api/v1/crl/iss-local -o /tmp/crl.der
# DER-encoded X.509 CRL for the local CA (binary — pipe to openssl for inspection).
# Note: no -H "Authorization: Bearer ..." — the endpoint is deliberately
# unauthenticated. Content-Type is application/pkix-crl.
curl -s http://localhost:8443/.well-known/pki/crl/iss-local -o /tmp/crl.der
openssl crl -inform DER -in /tmp/crl.der -text -noout
```
Check OCSP status:
Check OCSP status (RFC 6960, also unauthenticated, `application/ocsp-response`):
```bash
# Replace SERIAL with the actual serial number from the certificate version
curl -s $API/api/v1/ocsp/iss-local/SERIAL | jq .
# Replace SERIAL with the actual serial number from the certificate version.
# The embedded OCSP responder returns a signed DER response — parse it with
# `openssl ocsp -respin` or similar tooling.
curl -s http://localhost:8443/.well-known/pki/ocsp/iss-local/SERIAL -o /tmp/ocsp.der
openssl ocsp -respin /tmp/ocsp.der -noverify -resp_text | head -40
```
**Why RFC 5280 reason codes:** The reason code isn't just metadata — it tells clients *why* the certificate was revoked. A `keyCompromise` revocation means the private key was exposed and the certificate should be distrusted immediately. A `superseded` revocation means a newer certificate replaced it — less urgent. CRLs and OCSP responses include the reason code so client software can make informed trust decisions.
+5 -4
View File
@@ -228,14 +228,15 @@ Revocation is a 7-step process: validate eligibility → get serial → update s
- Audit trail: single `bulk_revocation_initiated` event logs the criteria and actor
- Optional `--reason` defaults to `unspecified` if omitted
### CRL Endpoints
### CRL Endpoint
- `GET /api/v1/crl` — JSON-formatted CRL (version, entries array, total count, timestamp)
- `GET /api/v1/crl/{issuer_id}` — DER-encoded X.509 CRL signed by issuing CA, 24-hour validity
- `GET /.well-known/pki/crl/{issuer_id}` — DER-encoded X.509 CRL signed by the issuing CA, 24-hour validity (RFC 5280 §5 + RFC 8615). Served unauthenticated with `Content-Type: application/pkix-crl` so relying parties without certctl API credentials can fetch it.
Prior non-standard JSON CRL and authenticated `/api/v1/crl*` paths were removed in M-006 — RFC 5280 defines only the DER wire format and relying parties do not have API keys.
### OCSP Responder
`GET /api/v1/ocsp/{issuer_id}/{serial}` — signed OCSP responses (good/revoked/unknown). Signs with issuing CA key. Requires CA key access (Local CA, step-CA connectors).
`GET /.well-known/pki/ocsp/{issuer_id}/{serial}` — signed OCSP responses (good/revoked/unknown) per RFC 6960. Served unauthenticated with `Content-Type: application/ocsp-response`. Signs with the issuing CA key; requires CA key access (Local CA, step-CA connectors).
### Short-Lived Certificate Exemption
+4 -2
View File
@@ -286,9 +286,11 @@ curl -s -X POST http://localhost:8443/api/v1/certificates/$CERT_ID/revoke \
Supported RFC 5280 reason codes: `unspecified`, `keyCompromise`, `caCompromise`, `affiliationChanged`, `superseded`, `cessationOfOperation`, `certificateHold`, `privilegeWithdrawn`.
Confirm via CRL:
Confirm via the unauthenticated DER CRL (RFC 5280 §5, RFC 8615):
```bash
curl -s http://localhost:8443/api/v1/crl | jq .
# Fetch the CRL without any API key — relying parties shouldn't need one.
curl -s http://localhost:8443/.well-known/pki/crl/iss-local -o /tmp/crl.der
openssl crl -inform der -in /tmp/crl.der -noout -text | head -40
```
### Interactive approval workflow
+6 -3
View File
@@ -512,12 +512,15 @@ curl -s -X POST http://localhost:8443/api/v1/certificates/mc-local-test/revoke \
### Step 7b: Check the CRL (Certificate Revocation List)
The CRL is a DER-encoded X.509 v2 CRL (RFC 5280 §5) served under the RFC 8615 well-known namespace. It is deliberately unauthenticated — relying parties that need to verify revocation don't have certctl API keys.
```bash
curl -s -H "Authorization: Bearer test-key-2026" \
http://localhost:8443/api/v1/crl | python3 -m json.tool
# No Authorization header — the endpoint is public by design.
curl -s http://localhost:8443/.well-known/pki/crl/iss-local -o /tmp/crl.der
openssl crl -inform der -in /tmp/crl.der -noout -text | head -40
```
**What you should see**: A list that includes the revoked certificate's serial number, the reason, and the timestamp.
**What you should see**: `openssl` prints the CRL issuer DN, `This Update` / `Next Update` timestamps, and at least one entry whose `Serial Number` matches the cert you just revoked, with `CRL Reason Code: Superseded` (or whichever reason you passed in step 7a). The response's `Content-Type` header is `application/pkix-crl`.
### Step 7c: Check in the dashboard
+279 -61
View File
@@ -1297,66 +1297,59 @@ curl -s -H "$AUTH" "$SERVER/api/v1/audit?per_page=5" | jq '[.items[] | select(.a
### 5.3 CRL & OCSP
**Test 5.3.1 — JSON CRL endpoint**
> **M-006 note:** The non-standard JSON CRL (`GET /api/v1/crl`) and the authenticated DER CRL (`GET /api/v1/crl/{issuer_id}`) and OCSP (`GET /api/v1/ocsp/{issuer_id}/{serial}`) paths were removed. Revocation-status distribution now lives under the RFC 8615 well-known namespace (`/.well-known/pki/crl/{issuer_id}` and `/.well-known/pki/ocsp/{issuer_id}/{serial}`), served unauthenticated because relying parties (browsers, TLS clients, hardware appliances) do not have certctl API keys.
**Test 5.3.1 — DER CRL endpoint (RFC 5280 §5, unauthenticated)**
```bash
curl -s -w "\nHTTP %{http_code}\n" -H "$AUTH" "$SERVER/api/v1/crl" | jq '{total: .total, entries_count: (.entries | length)}'
curl -s -D - -o /tmp/crl.der "$SERVER/.well-known/pki/crl/iss-local" | grep -i "content-type"
openssl crl -inform der -in /tmp/crl.der -noout -text | head -40
```
**What:** Fetches the JSON-formatted Certificate Revocation List.
**Why:** CRL is how relying parties check if a certificate has been revoked. The JSON CRL is the machine-readable API view.
**Expected:** HTTP 200. `total` > 0 (we revoked several certs above). Entries array contains serial numbers.
**PASS if** HTTP 200 and `total` > 0. **FAIL** if total = 0 or 500.
**What:** Fetches the DER-encoded X.509 CRL for the local issuer without presenting any API credentials.
**Why:** Relying parties (browsers, TLS libraries, network appliances) don't have certctl API keys. RFC 5280 §5 defines only the DER wire format, and RFC 8615 defines `.well-known/pki/*` as the relying-party namespace. The Content-Type must be `application/pkix-crl` and `openssl crl -inform der` must parse the body.
**Expected:** `Content-Type: application/pkix-crl`, `openssl` prints a valid CRL with the revoked serials we created above.
**PASS if** Content-Type matches and `openssl crl` parses the body. **FAIL** if JSON/HTML, 401/403, or parse error.
---
**Test 5.3.2 — DER CRL endpoint**
**Test 5.3.2 — OCSP: good response for non-revoked cert (RFC 6960, unauthenticated)**
```bash
curl -s -D - -o /dev/null -H "$AUTH" "$SERVER/api/v1/crl/iss-local" | grep -i "content-type"
curl -s -w "\nHTTP %{http_code}\n" "$SERVER/.well-known/pki/ocsp/iss-local/mc-api-prod" -o /tmp/ocsp.der
openssl ocsp -respin /tmp/ocsp.der -noverify -text 2>/dev/null | head -20
```
**What:** Fetches the DER-encoded X.509 CRL for the local issuer.
**Why:** Standard CRL consumers (browsers, TLS libraries) expect DER-encoded CRLs, not JSON. The Content-Type must be correct.
**Expected:** `Content-Type: application/pkix-crl`
**PASS if** Content-Type is `application/pkix-crl`. **FAIL** if JSON or other.
**What:** Queries the OCSP responder for a non-revoked certificate without any Authorization header.
**Why:** OCSP is the real-time alternative to CRL. RFC 6960 relying parties do not authenticate to the responder, so the endpoint must be public and return `Content-Type: application/ocsp-response`.
**Expected:** HTTP 200 with OCSP response indicating "good" status when `openssl ocsp -respin` parses the body.
**PASS if** HTTP 200 and cert status prints "good". **FAIL** if 401/403/500 or "revoked"/"unknown".
---
**Test 5.3.3 — OCSP: good response for non-revoked cert**
**Test 5.3.3 — OCSP: revoked response for revoked cert (unauthenticated)**
```bash
curl -s -w "\nHTTP %{http_code}\n" -H "$AUTH" "$SERVER/api/v1/ocsp/iss-local/mc-api-prod"
```
**What:** Queries the OCSP responder for a non-revoked certificate.
**Why:** OCSP is the real-time alternative to CRL. A "good" response means the cert is valid.
**Expected:** HTTP 200 with OCSP response indicating "good" status.
**PASS if** HTTP 200. **FAIL** if 500.
---
**Test 5.3.4 — OCSP: revoked response for revoked cert**
```bash
curl -s -w "\nHTTP %{http_code}\n" -H "$AUTH" "$SERVER/api/v1/ocsp/iss-local/mc-test-full"
curl -s -w "\nHTTP %{http_code}\n" "$SERVER/.well-known/pki/ocsp/iss-local/mc-test-full" -o /tmp/ocsp.der
openssl ocsp -respin /tmp/ocsp.der -noverify -text 2>/dev/null | grep -i "cert status"
```
**What:** Queries OCSP for a certificate we revoked earlier.
**Why:** OCSP must return "revoked" status for revoked certs. If it still returns "good," relying parties will trust a compromised certificate.
**Why:** OCSP must return "revoked" status for revoked certs. If it still returns "good," relying parties will trust a compromised certificate. Endpoint is unauthenticated per RFC 6960.
**Expected:** HTTP 200 with OCSP response indicating "revoked" status.
**PASS if** HTTP 200 and response indicates revoked. **FAIL** if response indicates "good".
**PASS if** HTTP 200 and status prints "revoked". **FAIL** if status is "good".
---
**Test 5.3.5 — OCSP: unknown serial**
**Test 5.3.4 — OCSP: unknown serial (unauthenticated)**
```bash
curl -s -w "\nHTTP %{http_code}\n" -H "$AUTH" "$SERVER/api/v1/ocsp/iss-local/nonexistent-serial"
curl -s -w "\nHTTP %{http_code}\n" "$SERVER/.well-known/pki/ocsp/iss-local/nonexistent-serial" -o /tmp/ocsp.der
openssl ocsp -respin /tmp/ocsp.der -noverify -text 2>/dev/null | grep -i "cert status"
```
**What:** Queries OCSP for a serial number the server doesn't recognize.
**Why:** OCSP must return "unknown" for serials it doesn't manage, not "good" (which would be a false positive).
**Why:** OCSP must return "unknown" for serials it doesn't manage, not "good" (which would be a false positive). Endpoint is public per RFC 6960.
**Expected:** HTTP 200 with OCSP "unknown" response, or HTTP 404.
**PASS if** response is "unknown" or 404. **FAIL** if "good".
@@ -2102,9 +2095,10 @@ go test ./internal/connector/issuer/local/ -run "TestSubCA" -v
**What:** In sub-CA mode, the DER CRL (Part 31.1) should be signed by the sub-CA key, not a self-signed root.
```bash
# After starting in sub-CA mode and revoking a cert:
curl -s -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/crl/iss-local" -o /tmp/subca-crl.der
# After starting in sub-CA mode and revoking a cert. The CRL is
# published unauthenticated under the RFC 8615 well-known namespace
# because relying parties don't carry certctl API keys.
curl -s "http://localhost:8443/.well-known/pki/crl/iss-local" -o /tmp/subca-crl.der
openssl crl -in /tmp/subca-crl.der -inform DER -noout -issuer
```
@@ -3706,23 +3700,24 @@ go test ./internal/service/ -run TestCSRRenewal -v
**Why:** TLS clients need to verify that certificates haven't been revoked. Without OCSP/CRL, a compromised certificate remains trusted until it expires. The short-lived exemption avoids bloating the CRL with certs that expire before distribution.
### 24.1: DER-Encoded CRL
> **M-006 note:** CRL and OCSP are published at `GET /.well-known/pki/crl/{issuer_id}` (RFC 5280 §5, `application/pkix-crl`) and `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` (RFC 6960, `application/ocsp-response`). Per RFC 8615, `.well-known/pki/*` is the relying-party namespace, and the endpoints are served **unauthenticated** — browsers, TLS libraries, and network appliances do not have certctl API keys. The legacy `GET /api/v1/crl`, `GET /api/v1/crl/{issuer_id}`, and `GET /api/v1/ocsp/{issuer_id}/{serial}` routes were removed.
**What:** `GET /api/v1/crl/{issuer_id}` returns a DER-encoded X.509 CRL signed by the issuing CA. Content-Type is `application/pkix-crl`. The CRL has 24-hour validity.
### 24.1: DER-Encoded CRL (unauthenticated)
**Why:** This is the standard CRL format that browsers, TLS libraries, and LDAP directories consume. The existing JSON CRL at `GET /api/v1/crl` is certctl-specific; the DER CRL is interoperable.
**What:** `GET /.well-known/pki/crl/{issuer_id}` returns a DER-encoded X.509 CRL signed by the issuing CA. Content-Type is `application/pkix-crl`. The CRL has 24-hour validity.
**Why:** This is the RFC 5280 §5 wire format that browsers, TLS libraries, and LDAP directories consume. It must be reachable without any Authorization header so that relying parties — who have no certctl credentials — can fetch it.
```bash
# Request DER CRL for the local issuer
curl -s -D - -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/crl/iss-local" \
# Request DER CRL for the local issuer. No Authorization header.
curl -s -D - "http://localhost:8443/.well-known/pki/crl/iss-local" \
-o /tmp/crl.der
# Verify it's valid DER CRL with openssl
openssl crl -in /tmp/crl.der -inform DER -noout -text
```
**Expected:** 200 OK, Content-Type `application/pkix-crl`, Cache-Control `public, max-age=3600`.
**Expected:** 200 OK, Content-Type `application/pkix-crl`.
**PASS if:**
- `openssl crl` parses the DER file successfully
@@ -3730,33 +3725,34 @@ openssl crl -in /tmp/crl.der -inform DER -noout -text
- Validity period is present (thisUpdate / nextUpdate)
- If any certs have been revoked, they appear in the revocation list with serial + reason
**FAIL if:** Response is JSON (wrong endpoint), `openssl` rejects the DER format, or headers are wrong.
**FAIL if:** Response is JSON (wrong endpoint), `openssl` rejects the DER format, headers are wrong, or the server returns 401/403 (auth must NOT be required).
### 24.2: DER CRL — Nonexistent Issuer
```bash
curl -s -w "\n%{http_code}" -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/crl/iss-nonexistent"
curl -s -w "\n%{http_code}" \
"http://localhost:8443/.well-known/pki/crl/iss-nonexistent"
```
**Expected:** 404 Not Found.
**PASS if** status code is 404 and body contains "not found".
### 24.3: OCSP Responder — Good Status
### 24.3: OCSP Responder — Good Status (unauthenticated)
**What:** `GET /api/v1/ocsp/{issuer_id}/{serial}` returns a signed OCSP response. For a non-revoked certificate, the status is "good".
**What:** `GET /.well-known/pki/ocsp/{issuer_id}/{serial}` returns a signed OCSP response. For a non-revoked certificate, the status is "good".
**Why:** OCSP is the real-time revocation check that TLS clients perform during the handshake. A "good" response tells the client the cert is still valid.
**Why:** OCSP is the real-time RFC 6960 revocation check that TLS clients perform during the handshake. A "good" response tells the client the cert is still valid. Relying parties fetch this without API credentials.
```bash
# First, get a certificate's serial number
# First, get a certificate's serial number (this uses the authenticated API
# because the operator has an API key — that is different from the relying
# party fetching the OCSP response).
SERIAL=$(curl -s -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/certificates/mc-api-prod" | jq -r '.latest_version.serial_number // empty')
# If serial is available, query OCSP
# Query OCSP without any Authorization header.
if [ -n "$SERIAL" ]; then
curl -s -D - -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/ocsp/iss-local/$SERIAL" \
curl -s -D - "http://localhost:8443/.well-known/pki/ocsp/iss-local/$SERIAL" \
-o /tmp/ocsp.der
# Parse OCSP response
@@ -3771,7 +3767,7 @@ fi
- Certificate status is "good" for a non-revoked cert
- Response is signed (producedAt timestamp present)
**FAIL if:** Response is JSON, OCSP status is wrong, or `openssl` rejects the response.
**FAIL if:** Response is JSON, OCSP status is wrong, `openssl` rejects the response, or the endpoint requires auth.
### 24.4: OCSP Responder — Revoked Status
@@ -3784,9 +3780,8 @@ curl -s -X POST -H "Authorization: Bearer $API_KEY" \
-d '{"reason": "keyCompromise"}' \
"http://localhost:8443/api/v1/certificates/$CERT_ID/revoke"
# Then query OCSP
curl -s -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/ocsp/iss-local/$SERIAL" \
# Then query OCSP — unauthenticated.
curl -s "http://localhost:8443/.well-known/pki/ocsp/iss-local/$SERIAL" \
-o /tmp/ocsp-revoked.der
openssl ocsp -respin /tmp/ocsp-revoked.der -text -noverify
@@ -3801,8 +3796,7 @@ openssl ocsp -respin /tmp/ocsp-revoked.der -text -noverify
**What:** Querying a serial number that doesn't exist in the inventory returns an "unknown" OCSP status (not an error — this is the correct OCSP behavior per RFC 6960).
```bash
curl -s -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/ocsp/iss-local/DEADBEEF" \
curl -s "http://localhost:8443/.well-known/pki/ocsp/iss-local/DEADBEEF" \
-o /tmp/ocsp-unknown.der
openssl ocsp -respin /tmp/ocsp-unknown.der -text -noverify
@@ -3820,9 +3814,8 @@ openssl ocsp -respin /tmp/ocsp-unknown.der -text -noverify
To test: revoke a cert that was issued under the `prof-short-lived` profile, then check the DER CRL. The revoked short-lived cert should NOT appear.
```bash
# After revoking a short-lived cert (serial SHORT_SERIAL):
curl -s -H "Authorization: Bearer $API_KEY" \
"http://localhost:8443/api/v1/crl/iss-local" -o /tmp/crl.der
# After revoking a short-lived cert (serial SHORT_SERIAL). No auth needed.
curl -s "http://localhost:8443/.well-known/pki/crl/iss-local" -o /tmp/crl.der
openssl crl -in /tmp/crl.der -inform DER -text | grep -i "$SHORT_SERIAL"
```
@@ -6594,6 +6587,231 @@ helm template certctl deploy/helm/certctl/ --set server.replicaCount=3 | grep 'r
---
## Part 55: Agent Soft-Retirement (I-004)
**What this validates:** The full `DELETE /api/v1/agents/{id}` soft-retirement contract — seven HTTP status codes (200/204/400/403/404/405/409/500), opt-in retired-agent listing, sentinel refusal, `410 Gone` heartbeat response, and the force-cascade escape hatch.
**Why it matters:** Before I-004, there was no retirement surface at all — `DELETE` did not exist and agents could only be removed via raw SQL against the `agents` table. Worse, the schema declared `deployment_targets.agent_id ON DELETE CASCADE`, so any such manual delete silently cascaded through four tables with zero audit trail. This part pins the replacement contract (soft-delete + preflight + force-cascade + sentinel guard + heartbeat 410) so regressions show up here first rather than as orphaned targets in production.
### 55.1 Migration 000015 Applied
```bash
docker compose -f deploy/docker-compose.yml exec postgres \
psql -U certctl -d certctl -c \
"SELECT column_name FROM information_schema.columns WHERE table_name='agents' AND column_name IN ('retired_at','retired_reason') ORDER BY column_name;"
```
**What:** Confirms migration 000015 added the archival columns to the `agents` table.
**PASS if** both `retired_at` and `retired_reason` rows are returned. **FAIL** if either is missing (migration did not apply).
---
### 55.2 FK Constraint Flipped to RESTRICT
```bash
docker compose -f deploy/docker-compose.yml exec postgres \
psql -U certctl -d certctl -c \
"SELECT confdeltype FROM pg_constraint WHERE conname='deployment_targets_agent_id_fkey';"
```
**What:** `confdeltype` is PostgreSQL's one-character code for the FK delete action: `r` = RESTRICT, `c` = CASCADE.
**PASS if** the value is `r`. **FAIL** if it is still `c` — that means migration 000015's FK flip did not run, and a hard `DELETE` against an agent row would silently cascade.
---
### 55.3 Clean Retire — 200
```bash
curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-test-clean" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-w "\nHTTP %{http_code}\n"
```
**What:** Retires an agent that has no active deployment targets, no deployed certificates, and no pending jobs.
**PASS if** status code is `200` and response body includes `"retired_at":"<ISO8601>"`, `"cascade":false`, and zero-valued counts.
---
### 55.4 Idempotent Re-Retire — 204
```bash
curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-test-clean" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-w "\nHTTP %{http_code}\n"
```
**What:** Retires an agent that is already retired.
**PASS if** status code is `204` and response body is completely empty (not even a trailing newline from the handler). The 200-shape must NOT be emitted — this is the terminal no-op.
---
### 55.5 Blocked by Dependencies — 409
```bash
curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-with-deps" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-w "\nHTTP %{http_code}\n"
```
**What:** Attempts to retire an agent that still has active targets/certificates/jobs.
**PASS if** status code is `409` and response body is the three-key `BlockedByDependenciesResponse` shape: `{"error":"blocked_by_dependencies", "message": "...", "counts": {"active_targets": N, "active_certificates": N, "pending_jobs": N}}`. Must NOT be the generic `ErrorResponse` shape — downstream dashboards parse the `counts` key.
---
### 55.6 Force Cascade — 200
```bash
curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-with-deps?force=true&reason=decommissioning+rack-7" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-w "\nHTTP %{http_code}\n"
```
**What:** Uses the force escape hatch to cascade-retire the dependencies.
**PASS if** status code is `200`, response includes `"cascade":true` with the pre-cascade counts, and the subsequent `GET /api/v1/audit-events?action=agent_retirement_cascaded` shows the event with the supplied `reason` and actor.
---
### 55.7 Force Without Reason — 400
```bash
curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-other?force=true" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-w "\nHTTP %{http_code}\n"
```
**What:** Verifies the `ErrForceReasonRequired` guard — `force=true` without `reason` must be rejected before any state mutation.
**PASS if** status code is `400` and no agent/target/job rows were modified.
---
### 55.8 Sentinel Refusal — 403
```bash
for id in server-scanner cloud-aws-sm cloud-azure-kv cloud-gcp-sm; do
echo "=== $id ==="
curl -sS -X DELETE "http://localhost:8443/api/v1/agents/${id}?force=true&reason=attempt" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-w "\nHTTP %{http_code}\n"
done
```
**What:** Verifies all four sentinel agents refuse retirement even with `force=true`.
**PASS if** every request returns `403` and the response body's `error` value is `sentinel_agent` (or the equivalent `ErrAgentIsSentinel` mapping). **FAIL** if any sentinel accepts the request — retiring one silently orphans the network scanner or one of the three cloud secret-manager discovery sources.
---
### 55.9 Unknown ID — 404
```bash
curl -sS -X DELETE "http://localhost:8443/api/v1/agents/ag-does-not-exist" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-w "\nHTTP %{http_code}\n"
```
**What:** Verifies `ErrAgentNotFound` maps to 404 (not 500). Ordering matters — the not-found check must come after the sentinel check so a typo'd sentinel ID still returns 403, not 404.
**PASS if** status code is `404`.
---
### 55.10 Heartbeat on Retired Agent — 410
```bash
curl -sS -X POST "http://localhost:8443/api/v1/agents/ag-test-clean/heartbeat" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
-H "Content-Type: application/json" \
-d '{"os":"linux","architecture":"amd64","hostname":"test","ip_address":"10.0.0.1","version":"2.1.0"}' \
-w "\nHTTP %{http_code}\n"
```
**What:** Retired agents get `410 Gone` — the canonical "resource is permanently gone, stop retrying" signal — so `cmd/agent` detects it and exits cleanly.
**PASS if** status code is `410`. **FAIL** if it is `404` (wrong ordering — retired-check must run before not-found) or `200` (retired filter missing entirely — agent would keep phoning home forever).
---
### 55.11 Default List Excludes Retired
```bash
curl -sS "http://localhost:8443/api/v1/agents" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
| jq -r '.data[] | select(.id=="ag-test-clean") | .id'
```
**What:** Verifies the default `/agents` listing filters retired rows via `AgentRepository.ListActive`.
**PASS if** output is empty (the retired agent does NOT appear). **FAIL** if `ag-test-clean` shows up — default listings must not expose retired rows.
---
### 55.12 Retired Agents Opt-In View
```bash
curl -sS "http://localhost:8443/api/v1/agents/retired" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
| jq -r '.data[] | select(.id=="ag-test-clean") | {id, retired_at, retired_reason}'
```
**What:** Verifies the opt-in retired-agents view returns the row with `retired_at` and `retired_reason` populated. Go 1.22 ServeMux literal-beats-pattern-var precedence routes `/agents/retired` to this handler rather than `/agents/{id}`.
**PASS if** the row appears with non-null `retired_at`. **FAIL** if the row is missing (listing broken) or `retired_at` is null (serialization broken).
---
### 55.13 Dashboard Stats Counter Excludes Retired
```bash
curl -sS "http://localhost:8443/api/v1/stats/summary" \
-H "Authorization: Bearer ${CERTCTL_API_KEY}" \
| jq -r '.total_agents'
```
**What:** Stats dashboard uses `ListActive`, not `List` — retired agents must not inflate the count.
**PASS if** the counter reflects only non-retired rows (verify against `SELECT count(*) FROM agents WHERE retired_at IS NULL`).
---
### 55.14 CLI Retire Subcommand
```bash
certctl-cli agents retire ag-cli-test --force --reason "smoke test"
certctl-cli agents list --retired | grep ag-cli-test
```
**What:** Verifies the CLI `agents retire` subcommand forwards `--force` and `--reason` via `DeleteWithQuery` and the `agents list --retired` flag hits `/agents/retired` rather than the default listing.
**PASS if** the first command succeeds and the second shows the agent in the retired view.
---
### 55.15 MCP Retire Tool Schema
```bash
go test ./internal/mcp/ -run TestRetireAgent -v -count=1
```
**What:** Verifies the `certctl_retire_agent` MCP tool's input schema accepts `id`, `force`, and `reason`, and that the tool actually propagates `force`/`reason` into the outbound DELETE query string (not the body).
**PASS if** exit code 0.
---
### 55.16 HEAD-State OpenAPI Contract
```bash
npx --yes @redocly/cli lint api/openapi.yaml \
--config '{"rules":{"operation-4xx-response":"error","no-invalid-media-type-examples":"error"}}'
python3 -c "
import yaml
spec = yaml.safe_load(open('api/openapi.yaml'))
del_op = spec['paths']['/api/v1/agents/{id}']['delete']
assert set(del_op['responses'].keys()) == {'200','204','400','403','404','405','409','500'}, del_op['responses'].keys()
hb = spec['paths']['/api/v1/agents/{id}/heartbeat']['post']
assert '410' in hb['responses'], hb['responses'].keys()
assert spec['paths']['/api/v1/agents/retired']['get']['operationId'] == 'listRetiredAgents'
print('OpenAPI I-004 contract: OK')
"
```
**What:** Two-part check. Redocly lint confirms the spec is structurally valid; the Python assertions pin the seven DELETE status codes, the 410 heartbeat response, and the retired-agents operationId.
**PASS if** redocly prints no errors and the Python script prints `OpenAPI I-004 contract: OK`.
---
## Release Sign-Off
All tests below must pass before tagging v2.1.0. Each row is one individual test from the guide above. The **Method** column indicates whether `qa-smoke-test.sh` covers the test automatically (**Auto**) or requires hands-on verification (**Manual**).
+24 -18
View File
@@ -247,26 +247,30 @@ func TestGetCertificateVersions_MultiSegment(t *testing.T) {
}
// TestHandleOCSP_MultiSegment exercises the OCSP responder's 2-segment path
// parser (/api/v1/ocsp/{issuer_id}/{serial_hex}). Each leg is attacker-
// controlled and the serial can be arbitrary length. This is a key adversarial
// surface because the serial is passed directly to the CA-operations service,
// which is expected to treat it as an opaque identifier.
// parser (/.well-known/pki/ocsp/{issuer_id}/{serial_hex}). Each leg is
// attacker-controlled and the serial can be arbitrary length. This is a key
// adversarial surface because the serial is passed directly to the
// CA-operations service, which is expected to treat it as an opaque
// identifier.
//
// M-006 relocation: these paths were previously served at /api/v1/ocsp/*;
// under RFC 8615 and RFC 6960 they now live under /.well-known/pki/ocsp/*.
func TestHandleOCSP_MultiSegment(t *testing.T) {
cases := []struct {
name string
path string
}{
{"missing_serial", "/api/v1/ocsp/iss-local"},
{"missing_both", "/api/v1/ocsp/"},
{"empty_issuer", "/api/v1/ocsp//01ABCDEF"},
{"empty_serial", "/api/v1/ocsp/iss-local/"},
{"traversal_issuer", "/api/v1/ocsp/..%2F..%2Fetc/passwd/01"},
{"null_byte_serial", "/api/v1/ocsp/iss-local/01\x00FF"},
{"sql_injection_serial", "/api/v1/ocsp/iss-local/01'; DROP TABLE--"},
{"negative_hex_serial", "/api/v1/ocsp/iss-local/-1"},
{"unicode_serial", "/api/v1/ocsp/iss-local/01\u2010FF"},
{"extremely_long_serial", "/api/v1/ocsp/iss-local/" + strings.Repeat("F", 10000)},
{"extra_segments", "/api/v1/ocsp/iss-local/01FF/extra/segments"},
{"missing_serial", "/.well-known/pki/ocsp/iss-local"},
{"missing_both", "/.well-known/pki/ocsp/"},
{"empty_issuer", "/.well-known/pki/ocsp//01ABCDEF"},
{"empty_serial", "/.well-known/pki/ocsp/iss-local/"},
{"traversal_issuer", "/.well-known/pki/ocsp/..%2F..%2Fetc/passwd/01"},
{"null_byte_serial", "/.well-known/pki/ocsp/iss-local/01\x00FF"},
{"sql_injection_serial", "/.well-known/pki/ocsp/iss-local/01'; DROP TABLE--"},
{"negative_hex_serial", "/.well-known/pki/ocsp/iss-local/-1"},
{"unicode_serial", "/.well-known/pki/ocsp/iss-local/01\u2010FF"},
{"extremely_long_serial", "/.well-known/pki/ocsp/iss-local/" + strings.Repeat("F", 10000)},
{"extra_segments", "/.well-known/pki/ocsp/iss-local/01FF/extra/segments"},
}
for _, tc := range cases {
@@ -301,7 +305,9 @@ func TestHandleOCSP_MultiSegment(t *testing.T) {
}
}
// TestGetDERCRL_IssuerPathInjection exercises /api/v1/crl/{issuer_id}.
// TestGetDERCRL_IssuerPathInjection exercises
// /.well-known/pki/crl/{issuer_id} (RFC 5280 CRL; M-006 relocation from
// /api/v1/crl/{issuer_id}).
func TestGetDERCRL_IssuerPathInjection(t *testing.T) {
for _, tc := range adversarialPathInputs() {
t.Run(tc.name, func(t *testing.T) {
@@ -316,8 +322,8 @@ func TestGetDERCRL_IssuerPathInjection(t *testing.T) {
return nil, ErrMockNotFound
}
req := httptest.NewRequest(http.MethodGet, "/api/v1/crl/x", nil)
req.URL.Path = "/api/v1/crl/" + tc.input
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/crl/x", nil)
req.URL.Path = "/.well-known/pki/crl/" + tc.input
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -10,6 +10,7 @@ import (
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// MockAgentService is a mock implementation of AgentService interface.
@@ -24,6 +25,11 @@ type MockAgentService struct {
GetWorkFn func(agentID string) ([]domain.Job, error)
GetWorkWithTargetsFn func(agentID string) ([]domain.WorkItem, error)
UpdateJobStatusFn func(agentID string, jobID string, status string, errMsg string) error
// I-004: soft-retirement hooks. Tests that don't set these receive nil
// results and nil errors, which mirrors the safest default (no-op) for
// unrelated suites that mock only the legacy surface.
RetireAgentFn func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error)
ListRetiredAgentsFn func(page, perPage int) ([]domain.Agent, int64, error)
}
func (m *MockAgentService) ListAgents(_ context.Context, page, perPage int) ([]domain.Agent, int64, error) {
@@ -96,6 +102,25 @@ func (m *MockAgentService) UpdateJobStatus(_ context.Context, agentID string, jo
return nil
}
// RetireAgent is the I-004 soft-retirement entrypoint. Tests that don't set
// RetireAgentFn get a nil result + nil error, which is a no-op response that
// lets unrelated suites compile without caring about the retirement surface.
func (m *MockAgentService) RetireAgent(_ context.Context, agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if m.RetireAgentFn != nil {
return m.RetireAgentFn(agentID, actor, force, reason)
}
return nil, nil
}
// ListRetiredAgents returns retired rows for the retired-agents tab / audit
// views. Same zero-value default as RetireAgent for unrelated tests.
func (m *MockAgentService) ListRetiredAgents(_ context.Context, page, perPage int) ([]domain.Agent, int64, error) {
if m.ListRetiredAgentsFn != nil {
return m.ListRetiredAgentsFn(page, perPage)
}
return nil, 0, nil
}
// Test ListAgents - success case
func TestListAgents_Success(t *testing.T) {
now := time.Now()
@@ -0,0 +1,393 @@
package handler
import (
"context"
"encoding/json"
"errors"
"net/http"
"net/http/httptest"
"testing"
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// agentRetireTestSetup builds an AgentHandler with a mock AgentService whose
// RetireAgent / ListRetiredAgents / Heartbeat behavior is driven by the
// returned mock. Keeps every I-004 handler test self-contained so a single
// failing assertion can't cascade through a shared fixture.
func agentRetireTestSetup() (*MockAgentService, AgentHandler) {
mock := &MockAgentService{}
handler := NewAgentHandler(mock)
return mock, handler
}
// TestRetireAgentHandler_Success_200 pins the happy-path contract for the
// soft-retirement HTTP surface: DELETE /api/v1/agents/{id} with no dependency
// fallout returns 200 OK and a JSON body echoing retirement metadata
// (retired_at timestamp, already_retired=false, cascade=false, zero counts).
// Operators building dashboards parse these fields; keep the shape stable.
func TestRetireAgentHandler_Success_200(t *testing.T) {
retiredAt := time.Date(2026, 4, 18, 12, 0, 0, 0, time.UTC)
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if agentID != "a-prod-001" {
t.Fatalf("retire handler received agentID=%q want a-prod-001", agentID)
}
if force {
t.Fatalf("retire handler set force=true unexpectedly; default path must be force=false")
}
return &service.AgentRetirementResult{
AlreadyRetired: false,
Cascade: false,
RetiredAt: retiredAt,
Counts: domain.AgentDependencyCounts{},
}, nil
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusOK {
t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String())
}
var body struct {
RetiredAt time.Time `json:"retired_at"`
AlreadyRetired bool `json:"already_retired"`
Cascade bool `json:"cascade"`
Counts domain.AgentDependencyCounts `json:"counts"`
}
if err := json.NewDecoder(w.Body).Decode(&body); err != nil {
t.Fatalf("decode 200 body: %v", err)
}
if !body.RetiredAt.Equal(retiredAt) {
t.Errorf("retired_at=%v want %v", body.RetiredAt, retiredAt)
}
if body.AlreadyRetired {
t.Errorf("already_retired=true want false on clean retire")
}
if body.Cascade {
t.Errorf("cascade=true want false on clean retire")
}
}
// TestRetireAgentHandler_AlreadyRetired_204 covers the idempotent contract: a
// retire call against an already-retired agent completes with 204 No Content
// (no body). This lets operators safely re-issue the DELETE after a network
// blip without fearing duplicate audit events or state mutations.
func TestRetireAgentHandler_AlreadyRetired_204(t *testing.T) {
mock, handler := agentRetireTestSetup()
past := time.Now().Add(-24 * time.Hour)
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return &service.AgentRetirementResult{
AlreadyRetired: true,
Cascade: false,
RetiredAt: past,
Counts: domain.AgentDependencyCounts{},
}, nil
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusNoContent {
t.Fatalf("status=%d body=%s want 204", w.Code, w.Body.String())
}
// 204 No Content must have zero body. If anything leaks through, downstream
// clients (curl scripts, dashboards) break.
if w.Body.Len() != 0 {
t.Errorf("204 body=%q want empty", w.Body.String())
}
}
// TestRetireAgentHandler_Sentinel_403 covers the hard guard against retiring
// any of the four sentinel agents that back discovery sources and the
// network scanner. These IDs are reserved; the handler must surface the
// service-layer ErrAgentIsSentinel as 403 Forbidden regardless of force/reason
// because no operator intent can legitimately retire them.
func TestRetireAgentHandler_Sentinel_403(t *testing.T) {
sentinels := []string{"server-scanner", "cloud-aws-sm", "cloud-azure-kv", "cloud-gcp-sm"}
for _, id := range sentinels {
t.Run(id, func(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return nil, service.ErrAgentIsSentinel
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/"+id, nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusForbidden {
t.Fatalf("sentinel %q status=%d body=%s want 403", id, w.Code, w.Body.String())
}
})
}
}
// TestRetireAgentHandler_NotFound_404 covers the lookup-miss path. Service
// returns a not-found error; handler maps to 404. Keeping the error
// discrimination at the service layer (sentinel errors.Is) rather than string
// matching is the whole point of wrapping.
func TestRetireAgentHandler_NotFound_404(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return nil, errors.New("agent not found")
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/unknown-id", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusNotFound {
t.Fatalf("status=%d body=%s want 404", w.Code, w.Body.String())
}
}
// TestRetireAgentHandler_Blocked_409_WithCounts covers the preflight-blocked
// path. Service returns *BlockedByDependenciesError wrapping
// ErrBlockedByDependencies; handler unwraps via errors.As, maps to 409, and
// MUST include the counts in the response body so operators know what's
// blocking them. Without counts the 409 is useless — the operator has to
// guess which downstream dependency is holding up the retirement.
func TestRetireAgentHandler_Blocked_409_WithCounts(t *testing.T) {
mock, handler := agentRetireTestSetup()
blockCounts := domain.AgentDependencyCounts{
ActiveTargets: 3,
ActiveCertificates: 7,
PendingJobs: 2,
}
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
return nil, &service.BlockedByDependenciesError{Counts: blockCounts}
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusConflict {
t.Fatalf("status=%d body=%s want 409", w.Code, w.Body.String())
}
var body struct {
Error string `json:"error"`
Message string `json:"message"`
Counts domain.AgentDependencyCounts `json:"counts"`
}
if err := json.NewDecoder(w.Body).Decode(&body); err != nil {
t.Fatalf("decode 409 body: %v", err)
}
if body.Counts.ActiveTargets != 3 {
t.Errorf("counts.active_targets=%d want 3", body.Counts.ActiveTargets)
}
if body.Counts.ActiveCertificates != 7 {
t.Errorf("counts.active_certificates=%d want 7", body.Counts.ActiveCertificates)
}
if body.Counts.PendingJobs != 2 {
t.Errorf("counts.pending_jobs=%d want 2", body.Counts.PendingJobs)
}
if body.Message == "" {
t.Errorf("409 body missing human-readable message; operators need guidance")
}
}
// TestRetireAgentHandler_Force_NoReason_400 covers the force-escape-hatch
// guardrail: force=true without a non-empty reason must be rejected at the
// handler seam BEFORE the service performs any DB work, because a
// reason-less cascade is unauditable. Service returns ErrForceReasonRequired;
// handler maps to 400.
func TestRetireAgentHandler_Force_NoReason_400(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if !force {
t.Fatalf("handler did not forward force=true; force query param was dropped")
}
if reason != "" {
t.Fatalf("handler passed reason=%q; empty reason must reach service for error path", reason)
}
return nil, service.ErrForceReasonRequired
}
req := httptest.NewRequest(http.MethodDelete, "/api/v1/agents/a-prod-001?force=true", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("status=%d body=%s want 400", w.Code, w.Body.String())
}
}
// TestRetireAgentHandler_ForceCascade_200 covers the successful force-cascade
// path: DELETE ?force=true&reason=... → service executes transactional
// cascade → 200 with cascade=true and the pre-cascade counts echoed back so
// the operator's confirmation dialog can show "I just retired N targets,
// M certificates, K pending jobs."
func TestRetireAgentHandler_ForceCascade_200(t *testing.T) {
mock, handler := agentRetireTestSetup()
retiredAt := time.Date(2026, 4, 18, 14, 30, 0, 0, time.UTC)
mock.RetireAgentFn = func(agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error) {
if !force {
t.Fatalf("handler did not forward force=true; query-param parsing broken")
}
if reason != "decommissioning rack 7" {
t.Fatalf("handler forwarded reason=%q want %q", reason, "decommissioning rack 7")
}
return &service.AgentRetirementResult{
AlreadyRetired: false,
Cascade: true,
RetiredAt: retiredAt,
Counts: domain.AgentDependencyCounts{
ActiveTargets: 2,
ActiveCertificates: 5,
PendingJobs: 1,
},
}, nil
}
url := "/api/v1/agents/a-prod-001?force=true&reason=decommissioning+rack+7"
req := httptest.NewRequest(http.MethodDelete, url, nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusOK {
t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String())
}
var body struct {
RetiredAt time.Time `json:"retired_at"`
AlreadyRetired bool `json:"already_retired"`
Cascade bool `json:"cascade"`
Counts domain.AgentDependencyCounts `json:"counts"`
}
if err := json.NewDecoder(w.Body).Decode(&body); err != nil {
t.Fatalf("decode force-cascade 200 body: %v", err)
}
if !body.Cascade {
t.Errorf("cascade=false want true on ?force=true successful retire")
}
if body.Counts.ActiveTargets != 2 || body.Counts.ActiveCertificates != 5 || body.Counts.PendingJobs != 1 {
t.Errorf("counts=%+v want {ActiveTargets:2 ActiveCertificates:5 PendingJobs:1}", body.Counts)
}
}
// TestHeartbeatHandler_RetiredAgent_410 covers the agent-shutdown signal. A
// retired agent that is still polling must be told its identity is gone
// (410 Gone) rather than offered the normal 200 "recorded" response.
// cmd/agent treats 410 as a terminal signal and exits rather than looping
// forever against a decommissioned identity. Service returns ErrAgentRetired;
// handler maps to 410.
func TestHeartbeatHandler_RetiredAgent_410(t *testing.T) {
mock, handler := agentRetireTestSetup()
mock.HeartbeatFn = func(agentID string, metadata *domain.AgentMetadata) error {
return service.ErrAgentRetired
}
req := httptest.NewRequest(http.MethodPost, "/api/v1/agents/a-prod-001/heartbeat", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.Heartbeat(w, req)
if w.Code != http.StatusGone {
t.Fatalf("heartbeat(retired) status=%d body=%s want 410", w.Code, w.Body.String())
}
}
// TestListRetiredAgentsHandler_Success covers the audit/forensics-facing
// endpoint GET /api/v1/agents/retired. Returns a paged list of retired rows
// alongside total count so the GUI can render a "Retired Agents" tab with
// pagination. Default listing (GET /agents) hides retired rows; this is the
// opt-in surface for them.
func TestListRetiredAgentsHandler_Success(t *testing.T) {
past := time.Now().Add(-48 * time.Hour)
reason := "old hardware"
retired := []domain.Agent{
{
ID: "agent-retired-01",
Name: "decom-01",
Hostname: "server-old",
Status: domain.AgentStatusOffline,
RegisteredAt: past,
RetiredAt: &past,
RetiredReason: &reason,
},
}
mock, handler := agentRetireTestSetup()
mock.ListRetiredAgentsFn = func(page, perPage int) ([]domain.Agent, int64, error) {
if page != 1 || perPage != 50 {
t.Fatalf("ListRetired handler received page=%d perPage=%d want 1/50 defaults", page, perPage)
}
return retired, 1, nil
}
req := httptest.NewRequest(http.MethodGet, "/api/v1/agents/retired", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.ListRetiredAgents(w, req)
if w.Code != http.StatusOK {
t.Fatalf("status=%d body=%s want 200", w.Code, w.Body.String())
}
var response PagedResponse
if err := json.NewDecoder(w.Body).Decode(&response); err != nil {
t.Fatalf("decode list-retired body: %v", err)
}
if response.Total != 1 {
t.Errorf("total=%d want 1", response.Total)
}
}
// TestRetireAgentHandler_MethodNotAllowed covers defense-in-depth: only
// DELETE is valid on /api/v1/agents/{id} for retirement. Using POST/PUT/PATCH
// must be rejected with 405 so misconfigured callers don't accidentally
// trigger retirement via a wrong-method request.
func TestRetireAgentHandler_MethodNotAllowed(t *testing.T) {
_, handler := agentRetireTestSetup()
for _, method := range []string{http.MethodPost, http.MethodPut, http.MethodPatch} {
t.Run(method, func(t *testing.T) {
req := httptest.NewRequest(method, "/api/v1/agents/a-prod-001", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.RetireAgent(w, req)
if w.Code != http.StatusMethodNotAllowed {
t.Fatalf("method=%s status=%d want 405", method, w.Code)
}
})
}
}
// Compile-time asserts: the mock must satisfy the handler's AgentService
// interface. Red state: this fails until the interface grows RetireAgent +
// ListRetiredAgents. Once Phase 2b adds those methods to AgentService, this
// assertion goes green along with every test above.
var _ AgentService = (*MockAgentService)(nil)
// Unused-import suppressor for context — the package-level tests already
// pull context from agent_handler_test.go, but leaving this here documents
// that the mock methods receive context.Context values even though this
// file's tests don't construct them directly (they ride on httptest.NewRequest).
var _ = context.Background
+199
View File
@@ -3,16 +3,24 @@ package handler
import (
"context"
"encoding/json"
"errors"
"log/slog"
"net/http"
"strconv"
"strings"
"time"
"github.com/shankar0123/certctl/internal/api/middleware"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// AgentService defines the service interface for agent operations.
//
// I-004 expansion: RetireAgent + ListRetiredAgents back the soft-retirement
// surface. The handler depends on the service-package's AgentRetirementResult
// and BlockedByDependenciesError types for result shape + errors.As unwrap,
// which is why this file imports internal/service.
type AgentService interface {
ListAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error)
GetAgent(ctx context.Context, id string) (*domain.Agent, error)
@@ -24,6 +32,10 @@ type AgentService interface {
GetWork(ctx context.Context, agentID string) ([]domain.Job, error)
GetWorkWithTargets(ctx context.Context, agentID string) ([]domain.WorkItem, error)
UpdateJobStatus(ctx context.Context, agentID string, jobID string, status string, errMsg string) error
// I-004 soft-retirement API. Both default to no-op (nil result / nil error)
// in mocks that don't override them — handler tests opt in per suite.
RetireAgent(ctx context.Context, agentID, actor string, force bool, reason string) (*service.AgentRetirementResult, error)
ListRetiredAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error)
}
// AgentHandler handles HTTP requests for agent operations.
@@ -190,6 +202,15 @@ func (h AgentHandler) Heartbeat(w http.ResponseWriter, r *http.Request) {
}
if err := h.svc.Heartbeat(r.Context(), agentID, metadata); err != nil {
// I-004: a retired agent still polling must receive 410 Gone so
// cmd/agent detects the terminal signal and shuts down cleanly
// instead of looping forever against a decommissioned identity.
// Check this FIRST — before "not found" string matching — so the
// retired-path is never masked by a sibling error branch.
if errors.Is(err, service.ErrAgentRetired) {
ErrorWithRequestID(w, http.StatusGone, "Agent has been retired", requestID)
return
}
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Agent not found", requestID)
return
@@ -376,3 +397,181 @@ func (h AgentHandler) AgentReportJobStatus(w http.ResponseWriter, r *http.Reques
"status": "updated",
})
}
// RetireAgent executes the I-004 soft-retirement surface.
// DELETE /api/v1/agents/{id}[?force=true&reason=...]
//
// Contract (pinned by agent_retire_handler_test.go):
//
// 405 any method other than DELETE
// 200 clean retire (body: retired_at, already_retired=false, cascade=false, counts=0s)
// 200 force-cascade retire (body: cascade=true, counts=pre-cascade snapshot)
// 204 idempotent retire of an already-retired agent (NO body — downstream
// clients that tee responses into dashboards break on spurious bodies)
// 400 force=true without a non-empty reason (ErrForceReasonRequired)
// 403 one of the four reserved sentinel IDs (ErrAgentIsSentinel)
// 404 agent does not exist ("not found" string match, kept for compat with
// repo error strings; sentinel checks run first so they never mask)
// 409 blocked by preflight counts (*BlockedByDependenciesError) — body
// carries the per-bucket counts so the operator UI can tell the
// human which downstream dependency is holding up the retirement,
// rather than forcing them to re-run the DELETE with ?force=true
// and guess
// 500 anything else
//
// The 409 body intentionally does NOT go through ErrorWithRequestID because
// that helper's ErrorResponse shape has no `counts` field — we inline-marshal
// a custom body instead. Keeping this shape stable is important: the GUI
// pattern is "show the 409 dialog, list the N targets / M certs / K jobs
// blocking, let the operator retire them first or tick the force checkbox."
func (h AgentHandler) RetireAgent(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodDelete {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
// Extract {id} from /api/v1/agents/{id}. Mirror GetAgent's pattern so
// the path parser is identical across the agent handler surface and a
// future refactor can extract it once without introducing drift.
rawID := strings.TrimPrefix(r.URL.Path, "/api/v1/agents/")
parts := strings.Split(rawID, "/")
if len(parts) == 0 || parts[0] == "" {
ErrorWithRequestID(w, http.StatusBadRequest, "Agent ID is required", requestID)
return
}
id := parts[0]
// Parse optional force + reason. A missing `force` param is treated as
// force=false (the default, safe path); anything strconv.ParseBool rejects
// is also force=false so a malformed query can never silently enable the
// cascade. The reason string is passed through verbatim — the service
// owns the "force=true requires reason" rule.
query := r.URL.Query()
force := false
if fv := query.Get("force"); fv != "" {
if parsed, err := strconv.ParseBool(fv); err == nil {
force = parsed
}
}
reason := query.Get("reason")
actor := resolveActor(r.Context())
result, err := h.svc.RetireAgent(r.Context(), id, actor, force, reason)
if err != nil {
// Sentinel + typed-error checks run BEFORE string matching on "not
// found" so a repo error that happens to contain those words can
// never mask a structural refusal (403/400/409). Order matters.
if errors.Is(err, service.ErrAgentIsSentinel) {
ErrorWithRequestID(w, http.StatusForbidden, "Agent is a reserved sentinel and cannot be retired", requestID)
return
}
if errors.Is(err, service.ErrForceReasonRequired) {
ErrorWithRequestID(w, http.StatusBadRequest, "force=true requires a non-empty reason", requestID)
return
}
var blocked *service.BlockedByDependenciesError
if errors.As(err, &blocked) {
// Custom 409 body with per-bucket counts. ErrorResponse has no
// `counts` field, so we marshal a bespoke struct instead.
// Keep `error`/`message`/`counts` as the stable shape — any
// dashboard parsing this relies on those three keys.
body := struct {
Error string `json:"error"`
Message string `json:"message"`
Counts domain.AgentDependencyCounts `json:"counts"`
}{
Error: "blocked_by_dependencies",
Message: "Agent has active downstream dependencies. Retire or reassign them " +
"first, or re-run with ?force=true&reason=... to cascade.",
Counts: blocked.Counts,
}
JSON(w, http.StatusConflict, body)
return
}
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Agent not found", requestID)
return
}
slog.Error("RetireAgent failed", "agent_id", id, "error", err.Error())
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to retire agent", requestID)
return
}
// Idempotent retire: the agent was already retired, so we return 204 No
// Content with a ZERO-length body. The Red contract (test line 106) fails
// if even a trailing newline leaks into the response. WriteHeader alone
// emits the status without invoking the JSON encoder.
if result.AlreadyRetired {
w.WriteHeader(http.StatusNoContent)
return
}
// Clean retire (force=false) or successful cascade (force=true). Body
// shape pinned by Red contract: retired_at, already_retired, cascade,
// counts. Omitempty is deliberately NOT used — operators parsing the
// response expect every field to always be present.
JSON(w, http.StatusOK, struct {
RetiredAt time.Time `json:"retired_at"`
AlreadyRetired bool `json:"already_retired"`
Cascade bool `json:"cascade"`
Counts domain.AgentDependencyCounts `json:"counts"`
}{
RetiredAt: result.RetiredAt,
AlreadyRetired: result.AlreadyRetired,
Cascade: result.Cascade,
Counts: result.Counts,
})
}
// ListRetiredAgents returns the opt-in listing of retired agents for the
// operator UI's "Retired" tab and for audit/forensics workflows.
// GET /api/v1/agents/retired?page=1&per_page=50
//
// The default ListAgents handler hides retired rows; this is the dedicated
// surface for reading them back. Pagination defaults match ListAgents so
// the GUI can reuse the same query hook (page=1, per_page=50, cap 500).
//
// Go 1.22's enhanced ServeMux routes `/agents/retired` to this handler via
// the literal-beats-pattern-var precedence rule (literal `retired` wins over
// `{id}` in the sibling GET /api/v1/agents/{id} route), so both entries can
// coexist without conflict. If that precedence ever regresses, the failure
// mode is TestListRetiredAgentsHandler_Success blowing up with a 404 — which
// is the fast signal we want.
func (h AgentHandler) ListRetiredAgents(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
page := 1
perPage := 50
query := r.URL.Query()
if p := query.Get("page"); p != "" {
if parsed, err := strconv.Atoi(p); err == nil && parsed > 0 {
page = parsed
}
}
if pp := query.Get("per_page"); pp != "" {
if parsed, err := strconv.Atoi(pp); err == nil && parsed > 0 && parsed <= 500 {
perPage = parsed
}
}
agents, total, err := h.svc.ListRetiredAgents(r.Context(), page, perPage)
if err != nil {
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to list retired agents", requestID)
return
}
JSON(w, http.StatusOK, PagedResponse{
Data: agents,
Total: total,
Page: page,
PerPage: perPage,
})
}
+17 -5
View File
@@ -37,6 +37,11 @@ type bulkRevokeRequest struct {
// BulkRevoke handles bulk certificate revocation.
// POST /api/v1/certificates/bulk-revoke
//
// M-003: admin-only. Bulk revocation is a fleet-scale destructive operation —
// a non-admin caller must not be able to invalidate certificates across
// profiles/owners/agents. The gate is enforced here (before body parsing) so a
// non-admin never sees its request criteria evaluated.
func (h BulkRevocationHandler) BulkRevoke(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPost {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
@@ -45,6 +50,16 @@ func (h BulkRevocationHandler) BulkRevoke(w http.ResponseWriter, r *http.Request
requestID := middleware.GetRequestID(r.Context())
// M-003: admin-only gate. Non-admin callers are rejected before any
// criteria/body processing to avoid leaking validation behavior to
// unauthorized actors.
if !middleware.IsAdmin(r.Context()) {
ErrorWithRequestID(w, http.StatusForbidden,
"Bulk revocation requires admin privileges",
requestID)
return
}
var req bulkRevokeRequest
if err := json.NewDecoder(r.Body).Decode(&req); err != nil {
ErrorWithRequestID(w, http.StatusBadRequest, "Invalid request body", requestID)
@@ -78,11 +93,8 @@ func (h BulkRevocationHandler) BulkRevoke(w http.ResponseWriter, r *http.Request
return
}
// Extract actor from auth context
actor := "api"
if user, ok := middleware.GetUser(r.Context()); ok && user != "" {
actor = user
}
// Extract actor from auth context (M-002: named-key identity → audit trail)
actor := resolveActor(r.Context())
result, err := h.svc.BulkRevoke(r.Context(), criteria, req.Reason, actor)
if err != nil {
@@ -7,8 +7,10 @@ import (
"fmt"
"net/http"
"net/http/httptest"
"strings"
"testing"
"github.com/shankar0123/certctl/internal/api/middleware"
"github.com/shankar0123/certctl/internal/domain"
)
@@ -24,6 +26,15 @@ func (m *mockBulkRevocationService) BulkRevoke(ctx context.Context, criteria dom
return &domain.BulkRevocationResult{}, nil
}
// adminContext returns a context carrying the admin flag, mimicking what the
// auth middleware sets for named-key callers whose entry is admin-tagged.
// M-003: bulk revocation handler requires admin context to reach the service.
func adminContext() context.Context {
ctx := context.WithValue(context.Background(), middleware.RequestIDKey{}, "test-request-id-bulk")
ctx = context.WithValue(ctx, middleware.AdminKey{}, true)
return ctx
}
func TestBulkRevoke_Success_WithIDs(t *testing.T) {
svc := &mockBulkRevocationService{
BulkRevokeFn: func(ctx context.Context, criteria domain.BulkRevocationCriteria, reason string, actor string) (*domain.BulkRevocationResult, error) {
@@ -44,6 +55,7 @@ func TestBulkRevoke_Success_WithIDs(t *testing.T) {
body := `{"reason":"keyCompromise","certificate_ids":["mc-1","mc-2"]}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
req = req.WithContext(adminContext())
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
@@ -82,6 +94,7 @@ func TestBulkRevoke_Success_WithProfile(t *testing.T) {
body := `{"reason":"keyCompromise","profile_id":"prof-tls"}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
req = req.WithContext(adminContext())
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
@@ -97,6 +110,7 @@ func TestBulkRevoke_MissingReason_400(t *testing.T) {
body := `{"certificate_ids":["mc-1"]}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
req = req.WithContext(adminContext())
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
@@ -112,6 +126,7 @@ func TestBulkRevoke_EmptyCriteria_400(t *testing.T) {
body := `{"reason":"keyCompromise"}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
req = req.WithContext(adminContext())
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
@@ -127,6 +142,7 @@ func TestBulkRevoke_InvalidReason_400(t *testing.T) {
body := `{"reason":"totallyBogus","certificate_ids":["mc-1"]}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
req = req.WithContext(adminContext())
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
@@ -139,6 +155,8 @@ func TestBulkRevoke_InvalidReason_400(t *testing.T) {
func TestBulkRevoke_MethodNotAllowed_405(t *testing.T) {
h := NewBulkRevocationHandler(&mockBulkRevocationService{})
// Method check fires before the admin gate, so 405 must hold even for a
// non-admin caller — asserting this keeps the ordering explicit.
req := httptest.NewRequest(http.MethodGet, "/api/v1/certificates/bulk-revoke", nil)
w := httptest.NewRecorder()
@@ -160,6 +178,7 @@ func TestBulkRevoke_ServiceError_500(t *testing.T) {
body := `{"reason":"keyCompromise","certificate_ids":["mc-1"]}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
req = req.WithContext(adminContext())
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
@@ -168,3 +187,103 @@ func TestBulkRevoke_ServiceError_500(t *testing.T) {
t.Errorf("expected 500, got %d", w.Code)
}
}
// --- M-003: admin-only gate on bulk revocation ---
// TestBulkRevoke_NonAdmin_Returns403 is the central authorization regression
// for M-003. A caller without an admin-tagged context must be rejected with
// HTTP 403, regardless of how well-formed its body is, and the service layer
// must never see the request.
func TestBulkRevoke_NonAdmin_Returns403(t *testing.T) {
var serviceCalled bool
svc := &mockBulkRevocationService{
BulkRevokeFn: func(ctx context.Context, criteria domain.BulkRevocationCriteria, reason string, actor string) (*domain.BulkRevocationResult, error) {
serviceCalled = true
return &domain.BulkRevocationResult{}, nil
},
}
h := NewBulkRevocationHandler(svc)
// Well-formed body + well-formed reason + filter — the only thing
// missing is an admin-tagged context. The gate must still fire.
body := `{"reason":"keyCompromise","certificate_ids":["mc-1","mc-2"]}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
req = req.WithContext(contextWithRequestID()) // request id only, no admin flag
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
if w.Code != http.StatusForbidden {
t.Fatalf("expected status 403, got %d (body=%q)", w.Code, w.Body.String())
}
var resp map[string]any
if err := json.NewDecoder(w.Body).Decode(&resp); err != nil {
t.Fatalf("failed to decode response: %v", err)
}
msg, _ := resp["message"].(string)
if !strings.Contains(strings.ToLower(msg), "admin") {
t.Errorf("expected message to mention admin requirement, got %q", msg)
}
if serviceCalled {
t.Errorf("service was invoked despite non-admin caller — gate failed open")
}
}
// TestBulkRevoke_AdminExplicitFalse_Returns403 pins the specific case where the
// AdminKey exists but is set to false — e.g., a non-admin named-key caller.
// Without this we could regress to "key missing == deny, key present == allow"
// which would silently grant a false flag.
func TestBulkRevoke_AdminExplicitFalse_Returns403(t *testing.T) {
h := NewBulkRevocationHandler(&mockBulkRevocationService{})
body := `{"reason":"keyCompromise","certificate_ids":["mc-1"]}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
ctx := context.WithValue(context.Background(), middleware.RequestIDKey{}, "test-request-id")
ctx = context.WithValue(ctx, middleware.AdminKey{}, false)
req = req.WithContext(ctx)
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
if w.Code != http.StatusForbidden {
t.Fatalf("expected status 403 for admin=false, got %d", w.Code)
}
}
// TestBulkRevoke_AdminPermitted_ForwardsActor confirms the happy path:
// an admin-tagged context reaches the service and the actor (from the auth
// UserKey) is propagated through to BulkRevoke. This keeps the admin gate and
// the M-002 actor-propagation wired together in a single regression.
func TestBulkRevoke_AdminPermitted_ForwardsActor(t *testing.T) {
var capturedActor string
svc := &mockBulkRevocationService{
BulkRevokeFn: func(ctx context.Context, criteria domain.BulkRevocationCriteria, reason string, actor string) (*domain.BulkRevocationResult, error) {
capturedActor = actor
return &domain.BulkRevocationResult{TotalMatched: 1, TotalRevoked: 1}, nil
},
}
h := NewBulkRevocationHandler(svc)
body := `{"reason":"keyCompromise","certificate_ids":["mc-1"]}`
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates/bulk-revoke", bytes.NewBufferString(body))
req.Header.Set("Content-Type", "application/json")
ctx := context.WithValue(context.Background(), middleware.RequestIDKey{}, "test-request-id")
ctx = context.WithValue(ctx, middleware.AdminKey{}, true)
ctx = context.WithValue(ctx, middleware.UserKey{}, "ops-admin")
req = req.WithContext(ctx)
w := httptest.NewRecorder()
h.BulkRevoke(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected status 200 for admin caller, got %d (body=%q)", w.Code, w.Body.String())
}
if capturedActor != "ops-admin" {
t.Errorf("expected actor ops-admin, got %q", capturedActor)
}
}
+82 -130
View File
@@ -432,6 +432,66 @@ func TestCreateCertificate_ServiceError(t *testing.T) {
}
}
// TestCreateCertificate_MissingRequiredField_Returns400 pins the C-001 handler
// contract: handler MUST reject a create payload that omits any of the five
// required fields (name, common_name, owner_id, team_id, issuer_id,
// renewal_policy_id) with HTTP 400 before the service is invoked. The mock
// service here would succeed if called; every subtest proving 400 therefore
// proves the handler guard fires.
func TestCreateCertificate_MissingRequiredField_Returns400(t *testing.T) {
baseBody := map[string]interface{}{
"name": "API Prod",
"common_name": "api.example.com",
"owner_id": "o-alice",
"team_id": "t-platform",
"issuer_id": "iss-local",
"renewal_policy_id": "rp-standard",
}
cases := []struct {
name string
missingField string
}{
{"missing name", "name"},
{"missing common_name", "common_name"},
{"missing owner_id", "owner_id"},
{"missing team_id", "team_id"},
{"missing issuer_id", "issuer_id"},
{"missing renewal_policy_id", "renewal_policy_id"},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
body := make(map[string]interface{}, len(baseBody))
for k, v := range baseBody {
body[k] = v
}
delete(body, tc.missingField)
bodyBytes, _ := json.Marshal(body)
mock := &MockCertificateService{
CreateCertificateFn: func(_ context.Context, cert domain.ManagedCertificate) (*domain.ManagedCertificate, error) {
// Would succeed if handler guard did not fire.
cert.ID = "mc-would-be-created"
return &cert, nil
},
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/certificates", bytes.NewReader(bodyBytes))
req = req.WithContext(contextWithRequestID())
req.Header.Set("Content-Type", "application/json")
w := httptest.NewRecorder()
handler.CreateCertificate(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("%s: expected 400, got %d — body=%s", tc.name, w.Code, w.Body.String())
}
})
}
}
// Test UpdateCertificate - success case
func TestUpdateCertificate_Success(t *testing.T) {
updated := &domain.ManagedCertificate{
@@ -958,127 +1018,13 @@ func TestRevokeCertificate_Handler_ServerError(t *testing.T) {
}
}
// === CRL Handler Tests ===
func TestGetCRL_Success(t *testing.T) {
mock := &MockCertificateService{
GetRevokedCertificatesFn: func(_ context.Context) ([]*domain.CertificateRevocation, error) {
return []*domain.CertificateRevocation{
{
ID: "rev-1",
CertificateID: "cert-1",
SerialNumber: "ABC123",
Reason: "keyCompromise",
RevokedAt: time.Date(2026, 3, 20, 10, 0, 0, 0, time.UTC),
},
{
ID: "rev-2",
CertificateID: "cert-2",
SerialNumber: "DEF456",
Reason: "superseded",
RevokedAt: time.Date(2026, 3, 21, 14, 30, 0, 0, time.UTC),
},
}, nil
},
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/crl", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.GetCRL(w, req)
if w.Code != http.StatusOK {
t.Errorf("expected status %d, got %d", http.StatusOK, w.Code)
}
var resp map[string]interface{}
json.NewDecoder(w.Body).Decode(&resp)
if resp["version"] != float64(1) {
t.Errorf("expected version 1, got %v", resp["version"])
}
if resp["total"] != float64(2) {
t.Errorf("expected total 2, got %v", resp["total"])
}
entries, ok := resp["entries"].([]interface{})
if !ok {
t.Fatal("expected entries to be an array")
}
if len(entries) != 2 {
t.Errorf("expected 2 entries, got %d", len(entries))
}
entry1 := entries[0].(map[string]interface{})
if entry1["serial_number"] != "ABC123" {
t.Errorf("expected serial ABC123, got %v", entry1["serial_number"])
}
if entry1["revocation_reason"] != "keyCompromise" {
t.Errorf("expected reason keyCompromise, got %v", entry1["revocation_reason"])
}
}
func TestGetCRL_Empty(t *testing.T) {
mock := &MockCertificateService{
GetRevokedCertificatesFn: func(_ context.Context) ([]*domain.CertificateRevocation, error) {
return nil, nil
},
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/crl", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.GetCRL(w, req)
if w.Code != http.StatusOK {
t.Errorf("expected status %d, got %d", http.StatusOK, w.Code)
}
var resp map[string]interface{}
json.NewDecoder(w.Body).Decode(&resp)
if resp["total"] != float64(0) {
t.Errorf("expected total 0, got %v", resp["total"])
}
}
func TestGetCRL_ServiceError(t *testing.T) {
mock := &MockCertificateService{
GetRevokedCertificatesFn: func(_ context.Context) ([]*domain.CertificateRevocation, error) {
return nil, fmt.Errorf("revocation repository not configured")
},
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/crl", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.GetCRL(w, req)
if w.Code != http.StatusInternalServerError {
t.Errorf("expected status %d, got %d", http.StatusInternalServerError, w.Code)
}
}
func TestGetCRL_MethodNotAllowed(t *testing.T) {
mock := &MockCertificateService{}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/crl", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.GetCRL(w, req)
if w.Code != http.StatusMethodNotAllowed {
t.Errorf("expected status %d, got %d", http.StatusMethodNotAllowed, w.Code)
}
}
// M15b: DER CRL and OCSP Handler Tests
// === CRL and OCSP Handler Tests (RFC 5280 / RFC 6960, served under /.well-known/pki/) ===
//
// M-006 relocated these endpoints from /api/v1/crl* and /api/v1/ocsp/* to the
// RFC-compliant /.well-known/pki/ namespace and deleted the non-standard JSON
// CRL endpoint. The DER-encoded X.509 CRL (application/pkix-crl) and the
// DER-encoded OCSP response (application/ocsp-response) are the only wire
// formats certctl supports for revocation data.
func TestGetDERCRL_Success(t *testing.T) {
derCRLData := []byte{0x30, 0x82, 0x01, 0x00} // Mock DER CRL bytes
@@ -1092,7 +1038,7 @@ func TestGetDERCRL_Success(t *testing.T) {
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/crl/iss-local", nil)
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/crl/iss-local", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1107,6 +1053,9 @@ func TestGetDERCRL_Success(t *testing.T) {
if len(responseBody) == 0 {
t.Error("expected non-empty response body")
}
if ct := w.Header().Get("Content-Type"); ct != "application/pkix-crl" {
t.Errorf("expected Content-Type application/pkix-crl, got %q", ct)
}
}
func TestGetDERCRL_IssuerNotFound(t *testing.T) {
@@ -1117,7 +1066,7 @@ func TestGetDERCRL_IssuerNotFound(t *testing.T) {
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/crl/nonexistent", nil)
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/crl/nonexistent", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1136,7 +1085,7 @@ func TestGetDERCRL_NotSupported(t *testing.T) {
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/crl/iss-acme", nil)
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/crl/iss-acme", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1151,7 +1100,7 @@ func TestGetDERCRL_NotSupported(t *testing.T) {
func TestGetDERCRL_MethodNotAllowed(t *testing.T) {
mock := &MockCertificateService{}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/crl/iss-local", nil)
req := httptest.NewRequest(http.MethodPost, "/.well-known/pki/crl/iss-local", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1174,7 +1123,7 @@ func TestHandleOCSP_Success(t *testing.T) {
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/ocsp/iss-local/12345", nil)
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/ocsp/iss-local/12345", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1188,12 +1137,15 @@ func TestHandleOCSP_Success(t *testing.T) {
if len(responseBody) == 0 {
t.Error("expected non-empty OCSP response body")
}
if ct := w.Header().Get("Content-Type"); ct != "application/ocsp-response" {
t.Errorf("expected Content-Type application/ocsp-response, got %q", ct)
}
}
func TestHandleOCSP_MissingSerial(t *testing.T) {
mock := &MockCertificateService{}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/ocsp/iss-local/", nil)
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/ocsp/iss-local/", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1212,7 +1164,7 @@ func TestHandleOCSP_IssuerNotFound(t *testing.T) {
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/ocsp/nonexistent/ABC123", nil)
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/ocsp/nonexistent/ABC123", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1231,7 +1183,7 @@ func TestHandleOCSP_CertNotFound(t *testing.T) {
}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodGet, "/api/v1/ocsp/iss-local/UNKNOWN", nil)
req := httptest.NewRequest(http.MethodGet, "/.well-known/pki/ocsp/iss-local/UNKNOWN", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
@@ -1245,7 +1197,7 @@ func TestHandleOCSP_CertNotFound(t *testing.T) {
func TestHandleOCSP_MethodNotAllowed(t *testing.T) {
mock := &MockCertificateService{}
handler := NewCertificateHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/ocsp/iss-local/12345", nil)
req := httptest.NewRequest(http.MethodPost, "/.well-known/pki/ocsp/iss-local/12345", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
+22 -50
View File
@@ -411,7 +411,9 @@ func (h CertificateHandler) TriggerRenewal(w http.ResponseWriter, r *http.Reques
}
certID := parts[0]
if err := h.svc.TriggerRenewal(r.Context(), certID, "api"); err != nil {
actor := resolveActor(r.Context())
if err := h.svc.TriggerRenewal(r.Context(), certID, actor); err != nil {
errMsg := err.Error()
if strings.Contains(errMsg, "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Certificate not found", requestID)
@@ -467,7 +469,9 @@ func (h CertificateHandler) TriggerDeployment(w http.ResponseWriter, r *http.Req
}
}
if err := h.svc.TriggerDeployment(r.Context(), certID, req.TargetID, "api"); err != nil {
actor := resolveActor(r.Context())
if err := h.svc.TriggerDeployment(r.Context(), certID, req.TargetID, actor); err != nil {
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to trigger deployment", requestID)
return
}
@@ -509,7 +513,9 @@ func (h CertificateHandler) RevokeCertificate(w http.ResponseWriter, r *http.Req
}
}
if err := h.svc.RevokeCertificate(r.Context(), certID, req.Reason, "api"); err != nil {
actor := resolveActor(r.Context())
if err := h.svc.RevokeCertificate(r.Context(), certID, req.Reason, actor); err != nil {
// Distinguish between client errors and server errors
errMsg := err.Error()
if strings.Contains(errMsg, "already revoked") ||
@@ -529,49 +535,12 @@ func (h CertificateHandler) RevokeCertificate(w http.ResponseWriter, r *http.Req
JSON(w, http.StatusOK, map[string]string{"status": "revoked"})
}
// GetCRL returns the Certificate Revocation List as structured JSON.
// GET /api/v1/crl
// Note: DER-encoded X.509 CRL generation (requiring CA key access) is planned for M15b
// alongside the embedded OCSP responder. This endpoint provides the same data in JSON format.
func (h CertificateHandler) GetCRL(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet {
Error(w, http.StatusMethodNotAllowed, "Method not allowed")
return
}
requestID := middleware.GetRequestID(r.Context())
revocations, err := h.svc.GetRevokedCertificates(r.Context())
if err != nil {
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to generate CRL", requestID)
return
}
type CRLEntry struct {
SerialNumber string `json:"serial_number"`
RevocationDate string `json:"revocation_date"`
RevocationReason string `json:"revocation_reason"`
}
entries := make([]CRLEntry, 0, len(revocations))
for _, rev := range revocations {
entries = append(entries, CRLEntry{
SerialNumber: rev.SerialNumber,
RevocationDate: rev.RevokedAt.Format("2006-01-02T15:04:05Z"),
RevocationReason: rev.Reason,
})
}
JSON(w, http.StatusOK, map[string]interface{}{
"version": 1,
"entries": entries,
"total": len(entries),
"generated_at": time.Now().UTC().Format("2006-01-02T15:04:05Z"),
})
}
// GetDERCRL returns a DER-encoded X.509 CRL signed by the specified issuer.
// GET /api/v1/crl/{issuer_id}
// GET /.well-known/pki/crl/{issuer_id}
//
// RFC 5280 § 5. Served unauthenticated under the /.well-known/pki/ namespace so
// relying parties (browsers, OpenSSL, OCSP stapling sidecars) can fetch the CRL
// without presenting certctl API credentials.
func (h CertificateHandler) GetDERCRL(w http.ResponseWriter, r *http.Request) {
requestID, _ := r.Context().Value("request_id").(string)
@@ -580,7 +549,7 @@ func (h CertificateHandler) GetDERCRL(w http.ResponseWriter, r *http.Request) {
return
}
issuerID := strings.TrimPrefix(r.URL.Path, "/api/v1/crl/")
issuerID := strings.TrimPrefix(r.URL.Path, "/.well-known/pki/crl/")
if issuerID == "" {
ErrorWithRequestID(w, http.StatusBadRequest, "Issuer ID is required", requestID)
return
@@ -608,8 +577,11 @@ func (h CertificateHandler) GetDERCRL(w http.ResponseWriter, r *http.Request) {
}
// HandleOCSP processes OCSP requests.
// GET /api/v1/ocsp/{issuer_id}/{serial_hex}
// For simplicity, use GET with path params instead of binary POST.
// GET /.well-known/pki/ocsp/{issuer_id}/{serial_hex}
//
// RFC 6960. Served unauthenticated under the /.well-known/pki/ namespace. For
// simplicity we accept GET with path params rather than the binary POST body
// form — the response is a valid DER-encoded OCSP response either way.
func (h CertificateHandler) HandleOCSP(w http.ResponseWriter, r *http.Request) {
requestID, _ := r.Context().Value("request_id").(string)
@@ -618,8 +590,8 @@ func (h CertificateHandler) HandleOCSP(w http.ResponseWriter, r *http.Request) {
return
}
// Extract issuer_id and serial from path: /api/v1/ocsp/{issuer_id}/{serial_hex}
path := strings.TrimPrefix(r.URL.Path, "/api/v1/ocsp/")
// Extract issuer_id and serial from path: /.well-known/pki/ocsp/{issuer_id}/{serial_hex}
path := strings.TrimPrefix(r.URL.Path, "/.well-known/pki/ocsp/")
parts := strings.SplitN(path, "/", 2)
if len(parts) < 2 || parts[0] == "" || parts[1] == "" {
ErrorWithRequestID(w, http.StatusBadRequest, "Issuer ID and serial number are required", requestID)
+9 -4
View File
@@ -11,12 +11,17 @@ import (
)
// DiscoveryService defines the interface used by the discovery handler.
// ClaimDiscovered and DismissDiscovered accept an explicit actor parameter so
// the handler can flow the authenticated named-key identity into the audit
// trail (M-005). Services that call these methods from non-request contexts
// pass a descriptive sentinel (e.g., "system") or "" (which falls back to
// "api").
type DiscoveryService interface {
ProcessDiscoveryReport(ctx context.Context, report *domain.DiscoveryReport) (*domain.DiscoveryScan, error)
ListDiscovered(ctx context.Context, agentID, status string, page, perPage int) ([]*domain.DiscoveredCertificate, int, error)
GetDiscovered(ctx context.Context, id string) (*domain.DiscoveredCertificate, error)
ClaimDiscovered(ctx context.Context, id string, managedCertID string) error
DismissDiscovered(ctx context.Context, id string) error
ClaimDiscovered(ctx context.Context, id string, managedCertID string, actor string) error
DismissDiscovered(ctx context.Context, id string, actor string) error
ListScans(ctx context.Context, agentID string, page, perPage int) ([]*domain.DiscoveryScan, int, error)
GetScan(ctx context.Context, id string) (*domain.DiscoveryScan, error)
GetDiscoverySummary(ctx context.Context) (map[string]int, error)
@@ -142,7 +147,7 @@ func (h DiscoveryHandler) ClaimDiscovered(w http.ResponseWriter, r *http.Request
return
}
if err := h.svc.ClaimDiscovered(r.Context(), id, body.ManagedCertificateID); err != nil {
if err := h.svc.ClaimDiscovered(r.Context(), id, body.ManagedCertificateID, resolveActor(r.Context())); err != nil {
Error(w, http.StatusInternalServerError, fmt.Sprintf("failed to claim certificate: %v", err))
return
}
@@ -166,7 +171,7 @@ func (h DiscoveryHandler) DismissDiscovered(w http.ResponseWriter, r *http.Reque
return
}
if err := h.svc.DismissDiscovered(r.Context(), id); err != nil {
if err := h.svc.DismissDiscovered(r.Context(), id, resolveActor(r.Context())); err != nil {
Error(w, http.StatusInternalServerError, fmt.Sprintf("failed to dismiss certificate: %v", err))
return
}
+10 -10
View File
@@ -19,8 +19,8 @@ type MockDiscoveryService struct {
ProcessDiscoveryReportFn func(ctx context.Context, report *domain.DiscoveryReport) (*domain.DiscoveryScan, error)
ListDiscoveredFn func(ctx context.Context, agentID, status string, page, perPage int) ([]*domain.DiscoveredCertificate, int, error)
GetDiscoveredFn func(ctx context.Context, id string) (*domain.DiscoveredCertificate, error)
ClaimDiscoveredFn func(ctx context.Context, id string, managedCertID string) error
DismissDiscoveredFn func(ctx context.Context, id string) error
ClaimDiscoveredFn func(ctx context.Context, id string, managedCertID string, actor string) error
DismissDiscoveredFn func(ctx context.Context, id string, actor string) error
ListScansFn func(ctx context.Context, agentID string, page, perPage int) ([]*domain.DiscoveryScan, int, error)
GetScanFn func(ctx context.Context, id string) (*domain.DiscoveryScan, error)
GetDiscoverySummaryFn func(ctx context.Context) (map[string]int, error)
@@ -47,16 +47,16 @@ func (m *MockDiscoveryService) GetDiscovered(ctx context.Context, id string) (*d
return nil, nil
}
func (m *MockDiscoveryService) ClaimDiscovered(ctx context.Context, id string, managedCertID string) error {
func (m *MockDiscoveryService) ClaimDiscovered(ctx context.Context, id string, managedCertID string, actor string) error {
if m.ClaimDiscoveredFn != nil {
return m.ClaimDiscoveredFn(ctx, id, managedCertID)
return m.ClaimDiscoveredFn(ctx, id, managedCertID, actor)
}
return nil
}
func (m *MockDiscoveryService) DismissDiscovered(ctx context.Context, id string) error {
func (m *MockDiscoveryService) DismissDiscovered(ctx context.Context, id string, actor string) error {
if m.DismissDiscoveredFn != nil {
return m.DismissDiscoveredFn(ctx, id)
return m.DismissDiscoveredFn(ctx, id, actor)
}
return nil
}
@@ -352,7 +352,7 @@ func TestGetDiscovered_NotFound(t *testing.T) {
// Test ClaimDiscovered - success case
func TestClaimDiscovered_Success(t *testing.T) {
mock := &MockDiscoveryService{
ClaimDiscoveredFn: func(ctx context.Context, id string, managedCertID string) error {
ClaimDiscoveredFn: func(ctx context.Context, id string, managedCertID string, actor string) error {
if id == "dcert-1" && managedCertID == "mc-prod-1" {
return nil
}
@@ -411,7 +411,7 @@ func TestClaimDiscovered_MissingManagedCertID(t *testing.T) {
// Test ClaimDiscovered - discovered cert not found
func TestClaimDiscovered_NotFound(t *testing.T) {
mock := &MockDiscoveryService{
ClaimDiscoveredFn: func(ctx context.Context, id string, managedCertID string) error {
ClaimDiscoveredFn: func(ctx context.Context, id string, managedCertID string, actor string) error {
return fmt.Errorf("discovered certificate not found")
},
}
@@ -438,7 +438,7 @@ func TestClaimDiscovered_NotFound(t *testing.T) {
// Test DismissDiscovered - success case
func TestDismissDiscovered_Success(t *testing.T) {
mock := &MockDiscoveryService{
DismissDiscoveredFn: func(ctx context.Context, id string) error {
DismissDiscoveredFn: func(ctx context.Context, id string, actor string) error {
if id == "dcert-1" {
return nil
}
@@ -614,7 +614,7 @@ func TestGetDiscoverySummary_MethodNotAllowed(t *testing.T) {
// Test DismissDiscovered - service error
func TestDismissDiscovered_ServiceError(t *testing.T) {
mock := &MockDiscoveryService{
DismissDiscoveredFn: func(ctx context.Context, id string) error {
DismissDiscoveredFn: func(ctx context.Context, id string, actor string) error {
return fmt.Errorf("database error")
},
}
+19 -3
View File
@@ -2,6 +2,8 @@ package handler
import (
"net/http"
"github.com/shankar0123/certctl/internal/api/middleware"
)
// HealthHandler handles health and readiness check endpoints.
@@ -55,9 +57,23 @@ func (h HealthHandler) AuthInfo(w http.ResponseWriter, r *http.Request) {
JSON(w, http.StatusOK, response)
}
// AuthCheck returns 200 if the request has valid auth credentials.
// The auth middleware runs before this handler, so reaching here means auth passed.
// AuthCheck returns 200 if the request has valid auth credentials, along with
// the resolved named-key identity and admin flag so the GUI can gate
// admin-only affordances (e.g., the bulk-revoke button).
//
// M-003 (Phase B.4): surface the admin flag so the frontend hides affordances
// that would otherwise 403 at the server. This is a hint for UX only —
// authorization remains enforced at the handler layer (bulk_revocation.go).
//
// The auth middleware runs before this handler, so reaching here means auth
// passed. `user` falls back to an empty string when auth is disabled
// (CERTCTL_AUTH_TYPE=none).
// GET /api/v1/auth/check
func (h HealthHandler) AuthCheck(w http.ResponseWriter, r *http.Request) {
JSON(w, http.StatusOK, map[string]string{"status": "authenticated"})
response := map[string]interface{}{
"status": "authenticated",
"user": middleware.GetUser(r.Context()),
"admin": middleware.IsAdmin(r.Context()),
}
JSON(w, http.StatusOK, response)
}
+115 -2
View File
@@ -1,10 +1,13 @@
package handler
import (
"context"
"encoding/json"
"net/http"
"net/http/httptest"
"testing"
"github.com/shankar0123/certctl/internal/api/middleware"
)
func TestHealth_ReturnsOK(t *testing.T) {
@@ -204,8 +207,8 @@ func TestAuthCheck_ReturnsOK(t *testing.T) {
t.Errorf("Content-Type = %q, want application/json", ct)
}
// Check response body
var result map[string]string
// Check response body — mixed-value map (string + bool) post-Phase B.4.
var result map[string]any
if err := json.NewDecoder(w.Body).Decode(&result); err != nil {
t.Fatalf("failed to decode response: %v", err)
}
@@ -232,3 +235,113 @@ func TestAuthCheck_MethodNotAllowed(t *testing.T) {
t.Logf("AuthCheck returned status %d (note: method not enforced in handler)", status)
}
}
// --- M-003 (Phase B.4): /auth/check surfaces admin flag + user identity ---
// TestAuthCheck_AdminCaller_ReportsAdminTrue confirms that when the auth
// middleware sets AdminKey{}=true (i.e., named key was admin-tagged), the
// /auth/check endpoint reports admin=true so the GUI can show admin-only
// affordances.
func TestAuthCheck_AdminCaller_ReportsAdminTrue(t *testing.T) {
handler := NewHealthHandler("api-key")
req := httptest.NewRequest(http.MethodGet, "/api/v1/auth/check", nil)
ctx := context.WithValue(req.Context(), middleware.AdminKey{}, true)
ctx = context.WithValue(ctx, middleware.UserKey{}, "ops-admin")
req = req.WithContext(ctx)
w := httptest.NewRecorder()
handler.AuthCheck(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected status 200, got %d", w.Code)
}
var result map[string]any
if err := json.NewDecoder(w.Body).Decode(&result); err != nil {
t.Fatalf("failed to decode response: %v", err)
}
if result["status"] != "authenticated" {
t.Errorf("status = %q, want authenticated", result["status"])
}
admin, ok := result["admin"].(bool)
if !ok {
t.Fatalf("admin field missing or wrong type: %T", result["admin"])
}
if !admin {
t.Errorf("admin = false, want true")
}
if result["user"] != "ops-admin" {
t.Errorf("user = %q, want ops-admin", result["user"])
}
}
// TestAuthCheck_NonAdminCaller_ReportsAdminFalse pins the negative case: the
// auth middleware has stored AdminKey{}=false (non-admin named key) — the
// endpoint must report admin=false so the GUI hides admin-only affordances.
func TestAuthCheck_NonAdminCaller_ReportsAdminFalse(t *testing.T) {
handler := NewHealthHandler("api-key")
req := httptest.NewRequest(http.MethodGet, "/api/v1/auth/check", nil)
ctx := context.WithValue(req.Context(), middleware.AdminKey{}, false)
ctx = context.WithValue(ctx, middleware.UserKey{}, "alice")
req = req.WithContext(ctx)
w := httptest.NewRecorder()
handler.AuthCheck(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected status 200, got %d", w.Code)
}
var result map[string]any
if err := json.NewDecoder(w.Body).Decode(&result); err != nil {
t.Fatalf("failed to decode response: %v", err)
}
admin, ok := result["admin"].(bool)
if !ok {
t.Fatalf("admin field missing or wrong type: %T", result["admin"])
}
if admin {
t.Errorf("admin = true, want false")
}
if result["user"] != "alice" {
t.Errorf("user = %q, want alice", result["user"])
}
}
// TestAuthCheck_NoAuthContext_DefaultsToEmptyUserAndFalseAdmin covers the
// CERTCTL_AUTH_TYPE=none deployment, where the auth middleware doesn't set
// any keys. Response must still be well-formed with empty user + admin=false.
func TestAuthCheck_NoAuthContext_DefaultsToEmptyUserAndFalseAdmin(t *testing.T) {
handler := NewHealthHandler("none")
req := httptest.NewRequest(http.MethodGet, "/api/v1/auth/check", nil)
w := httptest.NewRecorder()
handler.AuthCheck(w, req)
if w.Code != http.StatusOK {
t.Fatalf("expected status 200, got %d", w.Code)
}
var result map[string]any
if err := json.NewDecoder(w.Body).Decode(&result); err != nil {
t.Fatalf("failed to decode response: %v", err)
}
if result["status"] != "authenticated" {
t.Errorf("status = %q, want authenticated", result["status"])
}
admin, ok := result["admin"].(bool)
if !ok {
t.Fatalf("admin field missing or wrong type: %T", result["admin"])
}
if admin {
t.Errorf("admin = true for no-auth context, want false")
}
if result["user"] != "" {
t.Errorf("user = %q, want empty string", result["user"])
}
}
+61 -12
View File
@@ -11,15 +11,18 @@ import (
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// MockJobService is a mock implementation of JobService interface.
// Approve/Reject closures now take the actor string so tests can assert
// actor propagation from the auth middleware → handler → service.
type MockJobService struct {
ListJobsFn func(status, jobType string, page, perPage int) ([]domain.Job, int64, error)
GetJobFn func(id string) (*domain.Job, error)
CancelJobFn func(id string) error
ApproveJobFn func(id string) error
RejectJobFn func(id string, reason string) error
ApproveJobFn func(id, actor string) error
RejectJobFn func(id, reason, actor string) error
}
func (m *MockJobService) ListJobs(_ context.Context, status, jobType string, page, perPage int) ([]domain.Job, int64, error) {
@@ -43,16 +46,16 @@ func (m *MockJobService) CancelJob(_ context.Context, id string) error {
return nil
}
func (m *MockJobService) ApproveJob(_ context.Context, id string) error {
func (m *MockJobService) ApproveJob(_ context.Context, id, actor string) error {
if m.ApproveJobFn != nil {
return m.ApproveJobFn(id)
return m.ApproveJobFn(id, actor)
}
return nil
}
func (m *MockJobService) RejectJob(_ context.Context, id string, reason string) error {
func (m *MockJobService) RejectJob(_ context.Context, id, reason, actor string) error {
if m.RejectJobFn != nil {
return m.RejectJobFn(id, reason)
return m.RejectJobFn(id, reason, actor)
}
return nil
}
@@ -348,7 +351,7 @@ func TestCancelJob_EmptyID(t *testing.T) {
func TestApproveJob_Success(t *testing.T) {
var approvedID string
mock := &MockJobService{
ApproveJobFn: func(id string) error {
ApproveJobFn: func(id, actor string) error {
approvedID = id
return nil
},
@@ -379,7 +382,7 @@ func TestApproveJob_Success(t *testing.T) {
func TestApproveJob_NotFound(t *testing.T) {
mock := &MockJobService{
ApproveJobFn: func(id string) error {
ApproveJobFn: func(id, actor string) error {
return fmt.Errorf("job not found: no rows")
},
}
@@ -398,7 +401,7 @@ func TestApproveJob_NotFound(t *testing.T) {
func TestApproveJob_BadStatus(t *testing.T) {
mock := &MockJobService{
ApproveJobFn: func(id string) error {
ApproveJobFn: func(id, actor string) error {
return fmt.Errorf("cannot approve job with status Running")
},
}
@@ -427,10 +430,56 @@ func TestApproveJob_MethodNotAllowed(t *testing.T) {
}
}
// TestApproveJob_SelfApproval_Returns403 verifies the M-003 separation-of-duties
// wire: when the service returns ErrSelfApproval the handler must surface HTTP
// 403 Forbidden (NOT 500). The error sentinel crosses the service boundary via
// errors.Is so the handler can pattern-match regardless of any fmt.Errorf
// wrapping that may be added later.
func TestApproveJob_SelfApproval_Returns403(t *testing.T) {
var capturedActor string
mock := &MockJobService{
ApproveJobFn: func(id, actor string) error {
capturedActor = actor
return service.ErrSelfApproval
},
}
h := NewJobHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/jobs/job-self/approve", nil)
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
h.ApproveJob(w, req)
if w.Code != http.StatusForbidden {
t.Fatalf("expected status 403, got %d", w.Code)
}
var resp map[string]any
if err := json.NewDecoder(w.Body).Decode(&resp); err != nil {
t.Fatalf("failed to decode response: %v", err)
}
// Response body should name the self-approval condition explicitly so
// operators triaging a 403 can distinguish it from other forbid paths.
// The ErrorResponse envelope uses "error" for the status text and
// "message" for the human-readable explanation — we assert on message.
msg, _ := resp["message"].(string)
if !strings.Contains(strings.ToLower(msg), "self-approval") {
t.Errorf("expected message to mention self-approval, got %q", msg)
}
// The handler resolves the actor from the auth context; in this test the
// request has no auth context, so the propagated actor is the anonymous
// fallback ("" or "anonymous" depending on middleware wiring). We only
// assert the closure observed *some* actor string — the detailed actor
// threading is covered by resolveActor unit tests.
_ = capturedActor
}
func TestRejectJob_Success(t *testing.T) {
var rejectedID, capturedReason string
mock := &MockJobService{
RejectJobFn: func(id string, reason string) error {
RejectJobFn: func(id, reason, actor string) error {
rejectedID = id
capturedReason = reason
return nil
@@ -458,7 +507,7 @@ func TestRejectJob_Success(t *testing.T) {
func TestRejectJob_NoReason(t *testing.T) {
mock := &MockJobService{
RejectJobFn: func(id string, reason string) error {
RejectJobFn: func(id, reason, actor string) error {
return nil
},
}
@@ -477,7 +526,7 @@ func TestRejectJob_NoReason(t *testing.T) {
func TestRejectJob_NotFound(t *testing.T) {
mock := &MockJobService{
RejectJobFn: func(id string, reason string) error {
RejectJobFn: func(id, reason, actor string) error {
return fmt.Errorf("job not found: no rows")
},
}
+22 -4
View File
@@ -3,6 +3,7 @@ package handler
import (
"context"
"encoding/json"
"errors"
"io"
"net/http"
"strconv"
@@ -10,6 +11,7 @@ import (
"github.com/shankar0123/certctl/internal/api/middleware"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// JobService defines the service interface for job operations.
@@ -17,8 +19,13 @@ type JobService interface {
ListJobs(ctx context.Context, status, jobType string, page, perPage int) ([]domain.Job, int64, error)
GetJob(ctx context.Context, id string) (*domain.Job, error)
CancelJob(ctx context.Context, id string) error
ApproveJob(ctx context.Context, id string) error
RejectJob(ctx context.Context, id string, reason string) error
// ApproveJob approves a renewal job. actor is the named-key identity
// resolved from the auth middleware; the service returns ErrSelfApproval
// (mapped to 403) when actor matches the certificate owner.
ApproveJob(ctx context.Context, id, actor string) error
// RejectJob rejects a renewal job. actor is the named-key identity
// recorded for audit attribution; no not-self restriction.
RejectJob(ctx context.Context, id, reason, actor string) error
}
// JobHandler handles HTTP requests for job operations.
@@ -150,7 +157,16 @@ func (h JobHandler) ApproveJob(w http.ResponseWriter, r *http.Request) {
}
jobID := parts[0]
if err := h.svc.ApproveJob(r.Context(), jobID); err != nil {
actor := resolveActor(r.Context())
if err := h.svc.ApproveJob(r.Context(), jobID, actor); err != nil {
// M-003: self-approval by the certificate owner is forbidden.
if errors.Is(err, service.ErrSelfApproval) {
ErrorWithRequestID(w, http.StatusForbidden,
"Self-approval is forbidden: the certificate owner cannot approve their own renewal",
requestID)
return
}
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Job not found", requestID)
return
@@ -194,7 +210,9 @@ func (h JobHandler) RejectJob(w http.ResponseWriter, r *http.Request) {
}
}
if err := h.svc.RejectJob(r.Context(), jobID, body.Reason); err != nil {
actor := resolveActor(r.Context())
if err := h.svc.RejectJob(r.Context(), jobID, body.Reason, actor); err != nil {
if strings.Contains(err.Error(), "not found") {
ErrorWithRequestID(w, http.StatusNotFound, "Job not found", requestID)
return
+17
View File
@@ -127,6 +127,17 @@ func (h PolicyHandler) CreatePolicy(w http.ResponseWriter, r *http.Request) {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
// Severity is optional on create; default matches the DB default.
// Any explicit value must pass the TitleCase allowlist; the DB CHECK
// constraint enforces the same set, but catching it here gives a 400
// with a clear message instead of a 500 on constraint violation.
if policy.Severity == "" {
policy.Severity = domain.PolicySeverityWarning
}
if err := ValidatePolicySeverity(policy.Severity); err != nil {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
created, err := h.svc.CreatePolicy(r.Context(), policy)
if err != nil {
@@ -174,6 +185,12 @@ func (h PolicyHandler) UpdatePolicy(w http.ResponseWriter, r *http.Request) {
return
}
}
if policy.Severity != "" {
if err := ValidatePolicySeverity(policy.Severity); err != nil {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
}
updated, err := h.svc.UpdatePolicy(r.Context(), id, policy)
if err != nil {
+20
View File
@@ -1,14 +1,34 @@
package handler
import (
"context"
"encoding/base64"
"encoding/json"
"fmt"
"net/http"
"strings"
"time"
"github.com/shankar0123/certctl/internal/api/middleware"
)
// resolveActor extracts the authenticated named-key identity from the request
// context for audit-trail attribution. Returns the named-key name when set by
// the auth middleware, or "api" as a safe sentinel when the auth middleware
// did not populate the context (e.g., AUTH_TYPE=none, or internal/system calls
// that bypass auth).
//
// Post-M-002: this is the single source of truth for handler-layer actor
// resolution. Handlers must NOT hardcode string literals like "api-key-user"
// or "api" — always go through this helper so the named-key identity flows to
// services and the audit trail.
func resolveActor(ctx context.Context) string {
if user := middleware.GetUser(ctx); user != "" {
return user
}
return "api"
}
// PagedResponse represents a paginated API response.
type PagedResponse struct {
Data interface{} `json:"data"`
+70 -6
View File
@@ -10,6 +10,7 @@ import (
"time"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// MockTargetService is a mock implementation of TargetService interface.
@@ -239,8 +240,9 @@ func TestCreateTarget_Success(t *testing.T) {
}
body := map[string]interface{}{
"name": "New Target",
"type": "nginx",
"name": "New Target",
"type": "nginx",
"agent_id": "agent-001",
}
bodyBytes, _ := json.Marshal(body)
@@ -258,7 +260,8 @@ func TestCreateTarget_Success(t *testing.T) {
func TestCreateTarget_MissingName(t *testing.T) {
body := map[string]interface{}{
"type": "nginx",
"type": "nginx",
"agent_id": "agent-001",
}
bodyBytes, _ := json.Marshal(body)
@@ -276,7 +279,8 @@ func TestCreateTarget_MissingName(t *testing.T) {
func TestCreateTarget_MissingType(t *testing.T) {
body := map[string]interface{}{
"name": "New Target",
"name": "New Target",
"agent_id": "agent-001",
}
bodyBytes, _ := json.Marshal(body)
@@ -311,8 +315,9 @@ func TestCreateTarget_NameTooLong(t *testing.T) {
longName += "x"
}
body := map[string]interface{}{
"name": longName,
"type": "nginx",
"name": longName,
"type": "nginx",
"agent_id": "agent-001",
}
bodyBytes, _ := json.Marshal(body)
@@ -340,6 +345,65 @@ func TestCreateTarget_MethodNotAllowed(t *testing.T) {
}
}
// TestCreateTarget_MissingAgentID_Returns400 pins the C-002 handler contract:
// handler MUST reject a create payload that omits agent_id with HTTP 400
// before the service is invoked. Using a mock that would return 201-worthy
// success proves the guard fires.
func TestCreateTarget_MissingAgentID_Returns400(t *testing.T) {
body := map[string]interface{}{
"name": "New Target",
"type": "nginx",
// agent_id intentionally omitted
}
bodyBytes, _ := json.Marshal(body)
mock := &MockTargetService{
CreateTargetFn: func(_ context.Context, target domain.DeploymentTarget) (*domain.DeploymentTarget, error) {
// Would succeed if handler guard did not fire.
target.ID = "t-would-be-created"
return &target, nil
},
}
handler := NewTargetHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/targets", bytes.NewReader(bodyBytes))
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.CreateTarget(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("expected 400, got %d — body=%s", w.Code, w.Body.String())
}
}
// TestCreateTarget_NonexistentAgent_Returns400 pins the C-002 handler↔service
// translation: when the service returns service.ErrAgentNotFound, the handler
// MUST map it to HTTP 400, not the generic 500 used for other service errors.
func TestCreateTarget_NonexistentAgent_Returns400(t *testing.T) {
mock := &MockTargetService{
CreateTargetFn: func(_ context.Context, target domain.DeploymentTarget) (*domain.DeploymentTarget, error) {
return nil, service.ErrAgentNotFound
},
}
body := map[string]interface{}{
"name": "New Target",
"type": "nginx",
"agent_id": "agent-does-not-exist",
}
bodyBytes, _ := json.Marshal(body)
handler := NewTargetHandler(mock)
req := httptest.NewRequest(http.MethodPost, "/api/v1/targets", bytes.NewReader(bodyBytes))
req = req.WithContext(contextWithRequestID())
w := httptest.NewRecorder()
handler.CreateTarget(w, req)
if w.Code != http.StatusBadRequest {
t.Fatalf("expected 400 for nonexistent agent, got %d — body=%s", w.Code, w.Body.String())
}
}
func TestUpdateTarget_Success(t *testing.T) {
now := time.Now()
mock := &MockTargetService{
+16
View File
@@ -3,12 +3,14 @@ package handler
import (
"context"
"encoding/json"
"errors"
"net/http"
"strconv"
"strings"
"github.com/shankar0123/certctl/internal/api/middleware"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/service"
)
// TargetService defines the service interface for deployment target operations.
@@ -125,9 +127,23 @@ func (h TargetHandler) CreateTarget(w http.ResponseWriter, r *http.Request) {
ErrorWithRequestID(w, http.StatusBadRequest, "type is required", requestID)
return
}
// C-002: agent_id is a NOT NULL FK in deployment_targets (migration 000001
// line 104). Reject empty values at the boundary so callers get a clean 400
// with the field name rather than a generic "Failed to create target" 500.
if err := ValidateRequired("agent_id", target.AgentID); err != nil {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
created, err := h.svc.CreateTarget(r.Context(), target)
if err != nil {
// C-002: a nonexistent agent_id is a client error, not a server error.
// The service returns ErrAgentNotFound (wrapped via fmt.Errorf %w) when
// agentRepo.Get fails; we translate that to 400 via errors.Is.
if errors.Is(err, service.ErrAgentNotFound) {
ErrorWithRequestID(w, http.StatusBadRequest, err.Error(), requestID)
return
}
ErrorWithRequestID(w, http.StatusInternalServerError, "Failed to create target", requestID)
return
}
+2 -1
View File
@@ -71,10 +71,11 @@ func ValidatePolicyType(policyType interface{}) error {
"RequiredMetadata": true,
"AllowedEnvironments": true,
"RenewalLeadTime": true,
"CertificateLifetime": true,
}
typeStr := fmt.Sprintf("%v", policyType)
if !validTypes[typeStr] {
return ValidationError{Field: "type", Message: "type must be one of: AllowedIssuers, AllowedDomains, RequiredMetadata, AllowedEnvironments, RenewalLeadTime"}
return ValidationError{Field: "type", Message: "type must be one of: AllowedIssuers, AllowedDomains, RequiredMetadata, AllowedEnvironments, RenewalLeadTime, CertificateLifetime"}
}
return nil
}
+1 -1
View File
@@ -115,7 +115,7 @@ func (a *AuditMiddleware) Middleware(next http.Handler) http.Handler {
// Extract actor from auth context
actor := "anonymous"
if user, ok := GetUser(r.Context()); ok && user != "" {
if user := GetUser(r.Context()); user != "" {
actor = user
}
+5 -4
View File
@@ -269,8 +269,9 @@ func TestAuditLog_ExtractsAuthenticatedActor(t *testing.T) {
}))
req := httptest.NewRequest(http.MethodDelete, "/api/v1/certificates/mc-1", nil)
// Simulate auth middleware having set the user in context
ctx := context.WithValue(req.Context(), UserKey{}, "api-key-user")
// Simulate auth middleware having set the named-key identity in context
// (post-M-002: actor is the named-key name, not the old "api-key-user").
ctx := context.WithValue(req.Context(), UserKey{}, "ops-admin")
req = req.WithContext(ctx)
rr := httptest.NewRecorder()
@@ -284,8 +285,8 @@ func TestAuditLog_ExtractsAuthenticatedActor(t *testing.T) {
if len(calls) != 1 {
t.Fatalf("expected 1 audit call, got %d", len(calls))
}
if calls[0].Actor != "api-key-user" {
t.Errorf("expected actor api-key-user, got %s", calls[0].Actor)
if calls[0].Actor != "ops-admin" {
t.Errorf("expected actor ops-admin, got %s", calls[0].Actor)
}
if calls[0].Method != "DELETE" {
t.Errorf("expected method DELETE, got %s", calls[0].Method)
+88 -29
View File
@@ -22,6 +22,16 @@ type RequestIDKey struct{}
// UserKey is the context key for storing authenticated user information.
type UserKey struct{}
// AdminKey is the context key for storing admin flag information.
type AdminKey struct{}
// NamedAPIKey represents a named API key with optional admin flag.
type NamedAPIKey struct {
Name string
Key string
Admin bool
}
// RequestID middleware generates a unique request ID and adds it to the request context and response headers.
func RequestID(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
@@ -112,35 +122,40 @@ type AuthConfig struct {
Secret string // The raw API key or comma-separated list of valid API keys
}
// NewAuth creates an authentication middleware based on config.
// When Type is "none", all requests pass through (demo/development mode).
// When Type is "api-key", requests must include a valid Bearer token.
// The Secret field supports a comma-separated list of valid API keys for
// zero-downtime key rotation. Rotation workflow:
// 1. Add new key to comma-separated list, restart server
// 2. Update all agents/clients to use new key
// 3. Remove old key from list, restart server
func NewAuth(cfg AuthConfig) func(http.Handler) http.Handler {
if cfg.Type == "none" {
// NewAuthWithNamedKeys creates an authentication middleware that validates
// Bearer tokens against a set of named API keys. Each key carries a name
// (propagated as the actor via context) and an admin flag (consulted by
// authorization gates such as bulk revocation).
//
// When namedKeys is empty the returned middleware is a no-op pass-through,
// which is used in demo/development mode (CERTCTL_AUTH_TYPE=none). When one
// or more keys are provided, requests must include a matching Bearer token
// or they are rejected with 401.
func NewAuthWithNamedKeys(namedKeys []NamedAPIKey) func(http.Handler) http.Handler {
if len(namedKeys) == 0 {
return func(next http.Handler) http.Handler {
return next
}
}
// Pre-compute hashes of all valid keys for constant-time comparison.
// Supports comma-separated list for zero-downtime key rotation.
keys := strings.Split(cfg.Secret, ",")
var expectedHashes []string
for _, k := range keys {
k = strings.TrimSpace(k)
if k != "" {
expectedHashes = append(expectedHashes, HashAPIKey(k))
}
type keyEntry struct {
hash string
name string
admin bool
}
var entries []keyEntry
for _, nk := range namedKeys {
entries = append(entries, keyEntry{
hash: HashAPIKey(nk.Key),
name: nk.Name,
admin: nk.Admin,
})
}
// Warn if only one key is configured in production mode
if len(expectedHashes) == 1 {
slog.Warn("only one API key configured — consider adding a rotation key via comma-separated CERTCTL_AUTH_SECRET for zero-downtime rotation")
if len(entries) == 1 {
slog.Warn("only one API key configured — consider adding a rotation key for zero-downtime rotation")
}
return func(next http.Handler) http.Handler {
@@ -164,27 +179,60 @@ func NewAuth(cfg AuthConfig) func(http.Handler) http.Handler {
tokenHash := HashAPIKey(token)
// Check against all valid keys using constant-time comparison
authorized := false
for _, expectedHash := range expectedHashes {
if subtle.ConstantTimeCompare([]byte(tokenHash), []byte(expectedHash)) == 1 {
authorized = true
var matched *keyEntry
for i := range entries {
if subtle.ConstantTimeCompare([]byte(tokenHash), []byte(entries[i].hash)) == 1 {
matched = &entries[i]
break
}
}
if !authorized {
if matched == nil {
w.Header().Set("Content-Type", "application/json; charset=utf-8")
http.Error(w, `{"error":"Invalid API key"}`, http.StatusUnauthorized)
return
}
// Store the authenticated identity in context
ctx := context.WithValue(r.Context(), UserKey{}, "api-key-user")
// Store the authenticated identity and admin flag in context
ctx := context.WithValue(r.Context(), UserKey{}, matched.name)
ctx = context.WithValue(ctx, AdminKey{}, matched.admin)
next.ServeHTTP(w, r.WithContext(ctx))
})
}
}
// NewAuth is a legacy shim that converts a comma-separated Secret list into
// synthesized legacy-key-N named entries and delegates to NewAuthWithNamedKeys.
// It preserves the pre-M-002 behavior for callers that still pass raw AuthConfig
// (primarily cmd/server/main_test.go). The synthesized actor is "legacy-key-N"
// rather than the old hardcoded "api-key-user" so audit events carry
// meaningful identity even on the legacy path.
//
// Deprecated: Use NewAuthWithNamedKeys with explicit NamedAPIKey entries.
func NewAuth(cfg AuthConfig) func(http.Handler) http.Handler {
if cfg.Type == "none" {
return func(next http.Handler) http.Handler {
return next
}
}
var namedKeys []NamedAPIKey
idx := 0
for _, k := range strings.Split(cfg.Secret, ",") {
k = strings.TrimSpace(k)
if k == "" {
continue
}
namedKeys = append(namedKeys, NamedAPIKey{
Name: fmt.Sprintf("legacy-key-%d", idx),
Key: k,
Admin: false,
})
idx++
}
return NewAuthWithNamedKeys(namedKeys)
}
// RateLimitConfig holds configuration for the rate limiter.
type RateLimitConfig struct {
RPS float64 // Requests per second
@@ -344,9 +392,20 @@ func getRequestID(ctx context.Context) string {
}
// GetUser extracts the authenticated user from context.
func GetUser(ctx context.Context) (string, bool) {
// Returns the name of the matched API key and whether it was found.
func GetUser(ctx context.Context) string {
user, ok := ctx.Value(UserKey{}).(string)
return user, ok
if !ok {
return ""
}
return user
}
// IsAdmin extracts the admin flag from context.
// Returns true if the authenticated user has admin privileges.
func IsAdmin(ctx context.Context) bool {
admin, ok := ctx.Value(AdminKey{}).(bool)
return ok && admin
}
// responseWriter wraps http.ResponseWriter to capture the status code.
+31 -6
View File
@@ -109,12 +109,10 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
r.Register("GET /api/v1/certificates/{id}/export/pem", http.HandlerFunc(reg.Export.ExportPEM))
r.Register("POST /api/v1/certificates/{id}/export/pkcs12", http.HandlerFunc(reg.Export.ExportPKCS12))
// CRL endpoints: /api/v1/crl (JSON) and /api/v1/crl/{issuer_id} (DER)
r.Register("GET /api/v1/crl", http.HandlerFunc(reg.Certificates.GetCRL))
r.Register("GET /api/v1/crl/{issuer_id}", http.HandlerFunc(reg.Certificates.GetDERCRL))
// OCSP responder: /api/v1/ocsp/{issuer_id}/{serial}
r.Register("GET /api/v1/ocsp/{issuer_id}/{serial}", http.HandlerFunc(reg.Certificates.HandleOCSP))
// NOTE: RFC 5280 CRL and RFC 6960 OCSP endpoints are registered separately
// via RegisterPKIHandlers under /.well-known/pki/ so relying parties can
// fetch them without presenting certctl API credentials. The legacy
// /api/v1/crl and /api/v1/ocsp paths have been retired (see M-006).
// Issuers routes: /api/v1/issuers
r.Register("GET /api/v1/issuers", http.HandlerFunc(reg.Issuers.ListIssuers))
@@ -133,9 +131,21 @@ func (r *Router) RegisterHandlers(reg HandlerRegistry) {
r.Register("POST /api/v1/targets/{id}/test", http.HandlerFunc(reg.Targets.TestTargetConnection))
// Agents routes: /api/v1/agents
//
// I-004 soft-retirement surface:
// * GET /api/v1/agents/retired — opt-in listing of retired agents.
// MUST be registered before /agents/{id} so Go 1.22 ServeMux's
// literal-beats-pattern-var precedence routes the `retired` literal
// to ListRetiredAgents instead of treating "retired" as a {id}
// parameter value against GetAgent.
// * DELETE /api/v1/agents/{id} — RetireAgent. Replaces the pre-I-004
// hard-delete; the underlying repo does a soft-retire with
// optional cascade.
r.Register("GET /api/v1/agents", http.HandlerFunc(reg.Agents.ListAgents))
r.Register("POST /api/v1/agents", http.HandlerFunc(reg.Agents.RegisterAgent))
r.Register("GET /api/v1/agents/retired", http.HandlerFunc(reg.Agents.ListRetiredAgents))
r.Register("GET /api/v1/agents/{id}", http.HandlerFunc(reg.Agents.GetAgent))
r.Register("DELETE /api/v1/agents/{id}", http.HandlerFunc(reg.Agents.RetireAgent))
r.Register("POST /api/v1/agents/{id}/heartbeat", http.HandlerFunc(reg.Agents.Heartbeat))
r.Register("POST /api/v1/agents/{id}/csr", http.HandlerFunc(reg.Agents.AgentCSRSubmit))
r.Register("GET /api/v1/agents/{id}/certificates/{cert_id}", http.HandlerFunc(reg.Agents.AgentCertificatePickup))
@@ -262,6 +272,21 @@ func (r *Router) RegisterSCEPHandlers(scep handler.SCEPHandler) {
r.Register("POST /scep", http.HandlerFunc(scep.HandleSCEP))
}
// RegisterPKIHandlers sets up RFC 5280 CRL and RFC 6960 OCSP routes under
// /.well-known/pki/. These endpoints are intentionally unauthenticated so
// relying parties (browsers, OpenSSL, OCSP stapling sidecars, mTLS clients)
// can fetch revocation data without presenting certctl API credentials.
// The response bodies are DER-encoded and carry the IANA-registered content
// types application/pkix-crl and application/ocsp-response.
//
// Precedent: EST (RFC 7030) and SCEP (RFC 8894) follow the same pattern —
// standards-defined wire formats served via a dedicated router registration
// that cmd/server wires into a no-auth middleware chain.
func (r *Router) RegisterPKIHandlers(pki handler.CertificateHandler) {
r.Register("GET /.well-known/pki/crl/{issuer_id}", http.HandlerFunc(pki.GetDERCRL))
r.Register("GET /.well-known/pki/ocsp/{issuer_id}/{serial}", http.HandlerFunc(pki.HandleOCSP))
}
// GetMux returns the underlying http.ServeMux for direct access if needed.
func (r *Router) GetMux() *http.ServeMux {
return r.mux
+57 -4
View File
@@ -138,10 +138,9 @@ func TestRegisterHandlers_RoutesDispatch(t *testing.T) {
// Export
{"GET", "/api/v1/certificates/mc-test/export/pem"},
// CRL & OCSP
{"GET", "/api/v1/crl"},
{"GET", "/api/v1/crl/iss-local"},
{"GET", "/api/v1/ocsp/iss-local/12345"},
// NOTE: CRL/OCSP moved out of /api/v1/* in M-006. They are now served
// unauthenticated at /.well-known/pki/* via RegisterPKIHandlers and
// are verified in TestRegisterPKIHandlers_AllPaths below.
// Issuers
{"GET", "/api/v1/issuers"},
@@ -336,6 +335,60 @@ func TestRegisterESTHandlers_AllPaths(t *testing.T) {
}
}
// TestRegisterPKIHandlers_AllPaths verifies that RegisterPKIHandlers registers
// the two RFC-compliant unauthenticated endpoints relocated in M-006:
//
// - GET /.well-known/pki/crl/{issuer_id} (RFC 5280 §5 DER CRL)
// - GET /.well-known/pki/ocsp/{issuer_id}/{serial} (RFC 6960 §2.1 OCSP)
//
// Registration and middleware gating are complementary: this test proves the
// router matches the path; the unauthenticated contract is enforced separately
// by cmd/server/main.go's finalHandler routing /.well-known/pki/* through the
// noAuthHandler.
func TestRegisterPKIHandlers_AllPaths(t *testing.T) {
r := New()
// Zero-value CertificateHandler will panic on real calls; the only thing
// this test is verifying is that the route dispatches (i.e. the URL
// pattern is registered), so catch the downstream panic.
recoverMW := func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
defer func() {
if rv := recover(); rv != nil {
w.WriteHeader(http.StatusOK)
}
}()
next.ServeHTTP(w, r)
})
}
r.RegisterPKIHandlers(handler.CertificateHandler{})
testHandler := recoverMW(r)
routes := []struct {
method string
path string
}{
{"GET", "/.well-known/pki/crl/iss-local"},
{"GET", "/.well-known/pki/ocsp/iss-local/01ABCDEF"},
}
for _, tc := range routes {
t.Run(tc.method+" "+tc.path, func(t *testing.T) {
req := httptest.NewRequest(tc.method, tc.path, nil)
w := httptest.NewRecorder()
testHandler.ServeHTTP(w, req)
if w.Code == http.StatusNotFound {
t.Errorf("PKI route %s %s returned 404 — route not registered", tc.method, tc.path)
}
if w.Code == http.StatusMethodNotAllowed {
t.Errorf("PKI route %s %s returned 405", tc.method, tc.path)
}
})
}
}
// TestGetMux_ReturnsUnderlyingMux tests that GetMux returns the underlying mux.
func TestGetMux_ReturnsUnderlyingMux(t *testing.T) {
r := New()
+228
View File
@@ -0,0 +1,228 @@
package cli
import (
"encoding/json"
"net/http"
"net/http/httptest"
"testing"
)
// TestClient_RetireAgent_Success pins the I-004 CLI happy path: the operator
// runs `certctl-cli agents retire <id>` and the client issues a DELETE to
// /api/v1/agents/{id}, parses the 200 JSON body (retired_at, already_retired,
// cascade, counts), and reports success. The handler test already covers the
// server-side contract; this test covers the client-side wire formatting so a
// refactor of the server's 200 body shape can't silently break the CLI.
func TestClient_RetireAgent_Success(t *testing.T) {
var (
sawMethod string
sawPath string
sawForce string
sawReason string
)
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sawMethod = r.Method
sawPath = r.URL.Path
sawForce = r.URL.Query().Get("force")
sawReason = r.URL.Query().Get("reason")
if r.Method != "DELETE" || r.URL.Path != "/api/v1/agents/ag-1" {
w.WriteHeader(http.StatusNotFound)
return
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_ = json.NewEncoder(w).Encode(map[string]interface{}{
"retired_at": "2026-04-18T12:00:00Z",
"already_retired": false,
"cascade": false,
"counts": map[string]interface{}{
"active_targets": 0,
"active_certificates": 0,
"pending_jobs": 0,
},
})
}))
defer server.Close()
client := NewClient(server.URL, "", "table")
// Positional arg: the agent ID. No --force, no --reason — the default
// soft-retire path. Compile-fail until client.RetireAgent exists.
if err := client.RetireAgent([]string{"ag-1"}); err != nil {
t.Fatalf("RetireAgent(ag-1) err=%v want nil", err)
}
if sawMethod != "DELETE" {
t.Errorf("method=%q want DELETE", sawMethod)
}
if sawPath != "/api/v1/agents/ag-1" {
t.Errorf("path=%q want /api/v1/agents/ag-1", sawPath)
}
if sawForce != "" {
t.Errorf("force query=%q want empty (default path sends no force)", sawForce)
}
if sawReason != "" {
t.Errorf("reason query=%q want empty (default path sends no reason)", sawReason)
}
}
// TestClient_RetireAgent_Force_WithReason_Success pins the ?force=true&reason=...
// escape hatch wiring. Operators who supply --force + --reason get their values
// propagated as URL query parameters exactly once, so the server sees the same
// contract the handler test expects. Also verifies the cascade=true response
// body parses cleanly.
func TestClient_RetireAgent_Force_WithReason_Success(t *testing.T) {
var (
sawForce string
sawReason string
)
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sawForce = r.URL.Query().Get("force")
sawReason = r.URL.Query().Get("reason")
if r.Method != "DELETE" || r.URL.Path != "/api/v1/agents/ag-1" {
w.WriteHeader(http.StatusNotFound)
return
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_ = json.NewEncoder(w).Encode(map[string]interface{}{
"retired_at": "2026-04-18T12:00:00Z",
"already_retired": false,
"cascade": true,
"counts": map[string]interface{}{
"active_targets": 2,
"active_certificates": 5,
"pending_jobs": 1,
},
})
}))
defer server.Close()
client := NewClient(server.URL, "", "table")
if err := client.RetireAgent([]string{"ag-1", "--force", "--reason", "decommissioning rack 7"}); err != nil {
t.Fatalf("RetireAgent(force+reason) err=%v want nil", err)
}
if sawForce != "true" {
t.Errorf("force query=%q want \"true\"", sawForce)
}
if sawReason != "decommissioning rack 7" {
t.Errorf("reason query=%q want %q", sawReason, "decommissioning rack 7")
}
}
// TestClient_RetireAgent_Force_RequiresReason pins the client-side guard: using
// --force without --reason must fail BEFORE any HTTP request is made. Without
// this, the client would bounce off the server's 400 ErrForceReasonRequired
// only after a round trip — slow feedback, wasted audit-trail noise, and a
// worse operator experience. requestCount=0 enforces that no HTTP call happens.
func TestClient_RetireAgent_Force_RequiresReason(t *testing.T) {
var requestCount int
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
requestCount++
w.WriteHeader(http.StatusOK)
}))
defer server.Close()
client := NewClient(server.URL, "", "table")
err := client.RetireAgent([]string{"ag-1", "--force"})
if err == nil {
t.Fatalf("RetireAgent(force, no reason) err=nil want client-side error")
}
if !containsStr(err.Error(), "reason") {
t.Errorf("err=%q should mention --reason to guide operator", err.Error())
}
if requestCount != 0 {
t.Fatalf("requestCount=%d want 0; client must short-circuit before HTTP call", requestCount)
}
}
// TestClient_RetireAgent_MissingID covers the other common operator mistake:
// invoking `certctl-cli agents retire` with no agent ID. Must be caught by the
// client with a clear error, not a malformed DELETE to /api/v1/agents/.
func TestClient_RetireAgent_MissingID(t *testing.T) {
var requestCount int
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
requestCount++
w.WriteHeader(http.StatusOK)
}))
defer server.Close()
client := NewClient(server.URL, "", "table")
err := client.RetireAgent([]string{})
if err == nil {
t.Fatalf("RetireAgent([]) err=nil want missing-id error")
}
if requestCount != 0 {
t.Fatalf("requestCount=%d want 0; client must reject missing-id before HTTP", requestCount)
}
}
// TestClient_ListRetiredAgents_Success pins the audit/forensics CLI surface:
// `certctl-cli agents list-retired` must GET /api/v1/agents/retired and render
// the paged response. The server returns a PagedResponse; the client is
// responsible for printing it in table or JSON format, same as ListAgents.
func TestClient_ListRetiredAgents_Success(t *testing.T) {
var (
sawMethod string
sawPath string
)
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sawMethod = r.Method
sawPath = r.URL.Path
if r.Method != "GET" || r.URL.Path != "/api/v1/agents/retired" {
w.WriteHeader(http.StatusNotFound)
return
}
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(map[string]interface{}{
"data": []map[string]interface{}{
{
"id": "ag-old-01",
"name": "decom-01",
"hostname": "server-old",
"status": "Offline",
"registered_at": "2024-01-01T00:00:00Z",
"retired_at": "2026-01-01T00:00:00Z",
"retired_reason": "old hardware",
},
},
"total": 1,
"page": 1,
"per_page": 50,
})
}))
defer server.Close()
client := NewClient(server.URL, "", "table")
if err := client.ListRetiredAgents([]string{}); err != nil {
t.Fatalf("ListRetiredAgents err=%v want nil", err)
}
if sawMethod != "GET" {
t.Errorf("method=%q want GET", sawMethod)
}
if sawPath != "/api/v1/agents/retired" {
t.Errorf("path=%q want /api/v1/agents/retired", sawPath)
}
}
// TestClient_ListRetiredAgents_ServerError covers the non-happy path: server
// returns 5xx → client surfaces the error rather than silently printing an
// empty list. Without this, operators running the command as part of a
// compliance audit could miss a backend outage.
func TestClient_ListRetiredAgents_ServerError(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
http.Error(w, "db unreachable", http.StatusInternalServerError)
}))
defer server.Close()
client := NewClient(server.URL, "", "table")
err := client.ListRetiredAgents([]string{})
if err == nil {
t.Fatalf("ListRetiredAgents(500) err=nil want propagated error")
}
}
+277 -6
View File
@@ -12,6 +12,7 @@ import (
"net/url"
"os"
"path/filepath"
"strings"
"text/tabwriter"
"time"
)
@@ -292,6 +293,194 @@ func (c *Client) ListAgents(args []string) error {
return c.outputAgentsTable(result.Data, result.Total)
}
// ListRetiredAgents lists soft-retired agents from the dedicated endpoint.
//
// I-004: hits GET /api/v1/agents/retired which is a separate route from the
// default listing (the default hides retired rows). Supports --page and
// --per-page just like the active list. Output format mirrors ListAgents
// but prepends RETIRED_AT and RETIRED_REASON columns so the operator can
// forensic-grep the output.
func (c *Client) ListRetiredAgents(args []string) error {
fs := flag.NewFlagSet("agents list --retired", flag.ContinueOnError)
page := fs.Int("page", 1, "Page number")
perPage := fs.Int("per-page", 50, "Items per page")
fs.Parse(args)
query := url.Values{}
query.Set("page", fmt.Sprintf("%d", *page))
query.Set("per_page", fmt.Sprintf("%d", *perPage))
resp, err := c.do("GET", "/api/v1/agents/retired", query, nil)
if err != nil {
return err
}
var result struct {
Data []map[string]interface{} `json:"data"`
Total int `json:"total"`
}
if err := json.Unmarshal(resp, &result); err != nil {
return fmt.Errorf("parsing response: %w", err)
}
if c.format == "json" {
return c.outputJSON(result)
}
return c.outputRetiredAgentsTable(result.Data, result.Total)
}
// RetireAgent soft-retires an agent via DELETE /api/v1/agents/{id}.
//
// I-004: wraps the full status-code matrix pinned by the handler's
// agent_retire_handler_test.go:
//
// 200 clean retire — body: retired_at, already_retired=false, cascade=false, counts=0
// 200 force-cascade retire — body: cascade=true, counts=pre-cascade snapshot
// 204 idempotent retire — agent was already retired, NO body
// 403 sentinel — reserved agent (server-scanner / cloud-*), ErrAgentIsSentinel
// 404 not found — agent doesn't exist
// 409 blocked_by_dependencies — body: error, message, counts
//
// The default (force=false) flow refuses to retire agents with active
// downstream dependencies; the operator must re-run with --force and an
// explicit --reason to cascade. The handler rejects --force without
// --reason with a 400 — we mirror that contract client-side so the
// operator gets a clear error before the round trip.
func (c *Client) RetireAgent(args []string) error {
// Convention: `agents retire <id> [--force] [--reason <reason>]` — the ID
// is a positional arg that precedes the flags. Go's flag package stops
// parsing at the first non-flag token, so we pull args[0] as the ID and
// hand args[1:] to the flag parser. Without this split, `agents retire
// ag-1 --force --reason "x"` would parse with force=false and reason=""
// because the flags land in fs.Args() instead of being recognized.
if len(args) == 0 {
return fmt.Errorf("agent ID is required: agents retire <id> [--force] [--reason <reason>]")
}
id := args[0]
fs := flag.NewFlagSet("agents retire", flag.ContinueOnError)
force := fs.Bool("force", false, "Cascade-retire downstream targets, certs, and jobs")
reason := fs.String("reason", "", "Human-readable reason (required with --force)")
if err := fs.Parse(args[1:]); err != nil {
return err
}
// Mirror the handler's ErrForceReasonRequired contract client-side so
// the operator gets a clear error before the round trip.
if *force && strings.TrimSpace(*reason) == "" {
return fmt.Errorf("--reason is required when --force is set")
}
// Build query string. Skip ?force=false; skip ?reason= when empty.
query := url.Values{}
if *force {
query.Set("force", "true")
}
if *reason != "" {
query.Set("reason", *reason)
}
u, err := url.JoinPath(c.baseURL, fmt.Sprintf("/api/v1/agents/%s", id))
if err != nil {
return fmt.Errorf("invalid URL: %w", err)
}
if len(query) > 0 {
u = u + "?" + query.Encode()
}
req, err := http.NewRequest("DELETE", u, nil)
if err != nil {
return fmt.Errorf("creating request: %w", err)
}
req.Header.Set("Accept", "application/json")
if c.apiKey != "" {
req.Header.Set("Authorization", "Bearer "+c.apiKey)
}
resp, err := c.httpClient.Do(req)
if err != nil {
return fmt.Errorf("request failed: %w", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
return fmt.Errorf("reading response: %w", err)
}
switch resp.StatusCode {
case http.StatusNoContent:
// 204 idempotent — the agent was already retired. No body.
if c.format == "json" {
return c.outputJSON(map[string]interface{}{
"agent_id": id,
"already_retired": true,
})
}
fmt.Printf("Agent %s was already retired (idempotent)\n", id)
return nil
case http.StatusOK:
var result struct {
RetiredAt string `json:"retired_at"`
AlreadyRetired bool `json:"already_retired"`
Cascade bool `json:"cascade"`
Counts struct {
ActiveTargets int `json:"active_targets"`
ActiveCertificates int `json:"active_certificates"`
PendingJobs int `json:"pending_jobs"`
} `json:"counts"`
}
if err := json.Unmarshal(body, &result); err != nil {
return fmt.Errorf("parsing 200 response: %w", err)
}
if c.format == "json" {
return c.outputJSON(json.RawMessage(body))
}
if result.Cascade {
fmt.Printf("Agent %s retired (cascade). Retired at: %s\n", id, result.RetiredAt)
fmt.Printf(" Cascaded: %d targets, %d certificates, %d jobs\n",
result.Counts.ActiveTargets, result.Counts.ActiveCertificates, result.Counts.PendingJobs)
} else {
fmt.Printf("Agent %s retired. Retired at: %s\n", id, result.RetiredAt)
}
return nil
case http.StatusConflict:
// 409 blocked_by_dependencies. Parse the body so we can show the
// operator which dependency counts are holding up the retire.
var blocked struct {
Error string `json:"error"`
Message string `json:"message"`
Counts struct {
ActiveTargets int `json:"active_targets"`
ActiveCertificates int `json:"active_certificates"`
PendingJobs int `json:"pending_jobs"`
} `json:"counts"`
}
if err := json.Unmarshal(body, &blocked); err != nil {
return fmt.Errorf("agent has active dependencies (HTTP 409); raw body: %s", string(body))
}
return fmt.Errorf("blocked_by_dependencies: %s (targets=%d certificates=%d jobs=%d); re-run with --force --reason \"<reason>\" to cascade",
blocked.Message, blocked.Counts.ActiveTargets, blocked.Counts.ActiveCertificates, blocked.Counts.PendingJobs)
case http.StatusForbidden:
return fmt.Errorf("agent %s is a reserved sentinel and cannot be retired (HTTP 403)", id)
case http.StatusNotFound:
return fmt.Errorf("agent %s not found (HTTP 404)", id)
case http.StatusBadRequest:
return fmt.Errorf("bad request (HTTP 400): %s", string(body))
default:
return fmt.Errorf("unexpected HTTP %d: %s", resp.StatusCode, string(body))
}
}
// GetAgent retrieves a single agent by ID.
func (c *Client) GetAgent(id string) error {
resp, err := c.do("GET", fmt.Sprintf("/api/v1/agents/%s", id), nil, nil)
@@ -430,7 +619,54 @@ func (c *Client) GetStatus() error {
}
// ImportCertificates bulk imports certificates from PEM files.
func (c *Client) ImportCertificates(files []string) error {
//
// C-001 scope-expansion closure: the create-certificate handler's
// six-field required contract (name, common_name, renewal_policy_id,
// issuer_id, owner_id, team_id) is enforced server-side via
// ValidateRequired. The bulk importer must therefore be told which
// owner / team / renewal-policy / issuer to assign to every imported
// cert — otherwise every POST comes back 400. All four IDs are
// required flags; missing flags error out with a user-legible message
// before any files are read.
func (c *Client) ImportCertificates(args []string) error {
fs := flag.NewFlagSet("import", flag.ContinueOnError)
ownerID := fs.String("owner-id", "", "Owner ID to assign to each imported certificate (required)")
teamID := fs.String("team-id", "", "Team ID to assign to each imported certificate (required)")
renewalPolicyID := fs.String("renewal-policy-id", "", "Renewal policy ID to assign to each imported certificate (required)")
issuerID := fs.String("issuer-id", "", "Issuer ID to assign to each imported certificate (required)")
nameTemplate := fs.String("name-template", "{cn}", "Template for the certificate name; {cn} is substituted with the cert's common name")
environment := fs.String("environment", "imported", "Environment tag for each imported certificate")
if err := fs.Parse(args); err != nil {
return err
}
// Validate required flags up front — a clear error here beats six
// parallel 400s from the server.
missing := []string{}
if *ownerID == "" {
missing = append(missing, "--owner-id")
}
if *teamID == "" {
missing = append(missing, "--team-id")
}
if *renewalPolicyID == "" {
missing = append(missing, "--renewal-policy-id")
}
if *issuerID == "" {
missing = append(missing, "--issuer-id")
}
if len(missing) > 0 {
return fmt.Errorf("missing required flag(s): %s", strings.Join(missing, ", "))
}
if *nameTemplate == "" {
return fmt.Errorf("--name-template must be non-empty")
}
files := fs.Args()
if len(files) == 0 {
return fmt.Errorf("at least one PEM file path is required")
}
var imported, failed int
for _, filePath := range files {
@@ -452,12 +688,18 @@ func (c *Client) ImportCertificates(files []string) error {
total := len(certs)
fmt.Printf("Importing %d/%d certificates from %s...\r", i+1, total, filepath.Base(filePath))
name := strings.ReplaceAll(*nameTemplate, "{cn}", cert.Subject.CommonName)
req := map[string]interface{}{
"common_name": cert.Subject.CommonName,
"sans": cert.DNSNames,
"issuer_id": "iss-local",
"environment": "imported",
"status": "Active",
"name": name,
"common_name": cert.Subject.CommonName,
"sans": cert.DNSNames,
"issuer_id": *issuerID,
"owner_id": *ownerID,
"team_id": *teamID,
"renewal_policy_id": *renewalPolicyID,
"environment": *environment,
"status": "Active",
}
if cert.SerialNumber != nil {
@@ -559,6 +801,35 @@ func (c *Client) outputAgentsTable(agents []map[string]interface{}, total int) e
return nil
}
// outputRetiredAgentsTable is the tab-writer view for the retired listing.
// I-004: adds RETIRED_AT + REASON columns so operators can forensic-grep.
func (c *Client) outputRetiredAgentsTable(agents []map[string]interface{}, total int) error {
w := tabwriter.NewWriter(os.Stdout, 0, 0, 2, ' ', 0)
fmt.Fprintln(w, "ID\tHOSTNAME\tOS\tARCHITECTURE\tRETIRED AT\tREASON")
for _, agent := range agents {
id := getString(agent, "id")
hostname := getString(agent, "hostname")
osName := getString(agent, "os")
arch := getString(agent, "architecture")
retiredAt := ""
if raw, ok := agent["retired_at"].(string); ok && raw != "" {
if t, err := time.Parse(time.RFC3339, raw); err == nil {
retiredAt = t.Format("2006-01-02 15:04:05")
} else {
retiredAt = raw
}
}
reason := getString(agent, "retired_reason")
fmt.Fprintf(w, "%s\t%s\t%s\t%s\t%s\t%s\n", id, hostname, osName, arch, retiredAt, reason)
}
w.Flush()
fmt.Printf("\nTotal retired: %d\n", total)
return nil
}
func (c *Client) outputAgentDetail(agent map[string]interface{}) error {
w := tabwriter.NewWriter(os.Stdout, 0, 0, 2, ' ', 0)
+174
View File
@@ -10,6 +10,8 @@ import (
"math/big"
"net/http"
"net/http/httptest"
"os"
"path/filepath"
"testing"
"time"
)
@@ -387,6 +389,178 @@ func TestClient_AuthHeader(t *testing.T) {
}
}
// TestClient_ImportCertificates_MissingRequiredFlags verifies the CLI
// import command rejects invocations missing any of the four required
// flags (--owner-id, --team-id, --renewal-policy-id, --issuer-id)
// before any network call is attempted. This is the C-001 scope-expansion
// closure for the CLI layer: the handler now requires all six cert
// fields, so the importer must collect ownership / team / policy /
// issuer up front rather than hard-coding iss-local and letting the
// server 400 on every POST.
func TestClient_ImportCertificates_MissingRequiredFlags(t *testing.T) {
var requestCount int
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
requestCount++
w.WriteHeader(http.StatusOK)
}))
defer server.Close()
cases := []struct {
name string
args []string
missing string
}{
{
name: "missing owner-id",
args: []string{"--team-id", "t-platform", "--renewal-policy-id", "rp-default", "--issuer-id", "iss-local", "certs.pem"},
missing: "--owner-id",
},
{
name: "missing team-id",
args: []string{"--owner-id", "o-alice", "--renewal-policy-id", "rp-default", "--issuer-id", "iss-local", "certs.pem"},
missing: "--team-id",
},
{
name: "missing renewal-policy-id",
args: []string{"--owner-id", "o-alice", "--team-id", "t-platform", "--issuer-id", "iss-local", "certs.pem"},
missing: "--renewal-policy-id",
},
{
name: "missing issuer-id",
args: []string{"--owner-id", "o-alice", "--team-id", "t-platform", "--renewal-policy-id", "rp-default", "certs.pem"},
missing: "--issuer-id",
},
{
name: "no flags at all",
args: []string{"certs.pem"},
missing: "--owner-id",
},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
client := NewClient(server.URL, "", "table")
err := client.ImportCertificates(tc.args)
if err == nil {
t.Fatalf("expected error for %s, got nil", tc.name)
}
msg := err.Error()
if !containsStr(msg, tc.missing) {
t.Fatalf("expected error to name %q, got: %v", tc.missing, err)
}
if !containsStr(msg, "required") {
t.Fatalf("expected error message to mention 'required', got: %v", err)
}
})
}
if requestCount != 0 {
t.Fatalf("expected zero HTTP requests before flag validation, got %d", requestCount)
}
}
// TestClient_ImportCertificates_MissingPositionalArgs verifies the
// import command errors out when flags are present but no PEM file
// paths follow them.
func TestClient_ImportCertificates_MissingPositionalArgs(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
t.Errorf("unexpected HTTP request: %s %s", r.Method, r.URL.Path)
}))
defer server.Close()
client := NewClient(server.URL, "", "table")
err := client.ImportCertificates([]string{
"--owner-id", "o-alice",
"--team-id", "t-platform",
"--renewal-policy-id", "rp-default",
"--issuer-id", "iss-local",
})
if err == nil {
t.Fatal("expected error when no PEM file paths are supplied")
}
if !containsStr(err.Error(), "PEM file") {
t.Fatalf("expected error to mention 'PEM file', got: %v", err)
}
}
// TestClient_ImportCertificates_SixFieldPayload verifies the happy
// path: given all four required flags plus a PEM file, the importer
// POSTs a request containing all six required fields plus the
// name-templateresolved name. The httptest handler decodes the
// request body and asserts every required field is populated with
// the values supplied via flags.
func TestClient_ImportCertificates_SixFieldPayload(t *testing.T) {
// Generate a test cert and write it to a temp PEM file.
cert := generateTestCert()
pemBlock := &pem.Block{Type: "CERTIFICATE", Bytes: cert.Raw}
pemPath := filepath.Join(t.TempDir(), "test.pem")
if err := os.WriteFile(pemPath, pem.EncodeToMemory(pemBlock), 0o600); err != nil {
t.Fatalf("write temp PEM: %v", err)
}
var gotBody map[string]interface{}
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.Method != "POST" || r.URL.Path != "/api/v1/certificates" {
w.WriteHeader(http.StatusNotFound)
return
}
if err := json.NewDecoder(r.Body).Decode(&gotBody); err != nil {
t.Errorf("decode request body: %v", err)
}
w.WriteHeader(http.StatusCreated)
w.Header().Set("Content-Type", "application/json")
_, _ = w.Write([]byte(`{"id":"mc-imported"}`))
}))
defer server.Close()
client := NewClient(server.URL, "", "table")
err := client.ImportCertificates([]string{
"--owner-id", "o-alice",
"--team-id", "t-platform",
"--renewal-policy-id", "rp-default",
"--issuer-id", "iss-local",
"--name-template", "imported-{cn}",
pemPath,
})
if err != nil {
t.Fatalf("ImportCertificates failed: %v", err)
}
// Verify every required field from the six-field contract is present.
required := []struct {
field string
want interface{}
}{
{"name", "imported-test.example.com"},
{"common_name", "test.example.com"},
{"issuer_id", "iss-local"},
{"owner_id", "o-alice"},
{"team_id", "t-platform"},
{"renewal_policy_id", "rp-default"},
}
for _, r := range required {
got, ok := gotBody[r.field]
if !ok {
t.Errorf("payload missing required field %q (body: %+v)", r.field, gotBody)
continue
}
if got != r.want {
t.Errorf("field %q = %v, want %v", r.field, got, r.want)
}
}
}
// containsStr is a tiny substring helper so the test file doesn't
// need a `strings` import dependency aside from what's already there.
func containsStr(haystack, needle string) bool {
for i := 0; i+len(needle) <= len(haystack); i++ {
if haystack[i:i+len(needle)] == needle {
return true
}
}
return false
}
// Helper function to generate a test certificate
func generateTestCert() *x509.Certificate {
now := time.Now()
+163 -5
View File
@@ -5,6 +5,7 @@ import (
"log/slog"
"os"
"strconv"
"strings"
"time"
)
@@ -706,6 +707,37 @@ type SchedulerConfig struct {
// Default: 1 minute. Minimum: 1 second. Sends notifications to Slack, Teams, PagerDuty, etc.
// Setting: CERTCTL_SCHEDULER_NOTIFICATION_PROCESS_INTERVAL environment variable.
NotificationProcessInterval time.Duration
// RetryInterval is how often the scheduler retries failed jobs whose Attempts
// counter is below MaxAttempts. Default: 5 minutes. Minimum: 1 second.
// Transitions eligible Failed jobs back to Pending so the job processor can
// pick them up again (closes coverage gap I-001 — JobService.RetryFailedJobs
// had no caller prior to this loop being wired).
// Setting: CERTCTL_SCHEDULER_RETRY_INTERVAL environment variable.
RetryInterval time.Duration
// JobTimeoutInterval is how often the reaper loop sweeps AwaitingCSR and
// AwaitingApproval jobs for TTL expiration. Default: 10 minutes. Minimum: 1
// second. Timed-out jobs are transitioned to Failed with a descriptive error
// message; I-001's retry loop then auto-promotes eligible Failed jobs back
// to Pending (closes coverage gap I-003).
// Setting: CERTCTL_JOB_TIMEOUT_INTERVAL environment variable.
JobTimeoutInterval time.Duration
// AwaitingCSRTimeout is the maximum age an AwaitingCSR job can remain in
// that state before the reaper transitions it to Failed. Default: 24 hours.
// An agent that hasn't submitted a CSR within this window is presumed
// unreachable. Minimum: 1 second.
// Setting: CERTCTL_JOB_AWAITING_CSR_TIMEOUT environment variable.
AwaitingCSRTimeout time.Duration
// AwaitingApprovalTimeout is the maximum age an AwaitingApproval job can
// remain in that state before the reaper transitions it to Failed. Default:
// 168 hours (7 days). Reviewers who haven't approved within this window
// force the renewal to fail loudly rather than silently stall. Minimum: 1
// second.
// Setting: CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT environment variable.
AwaitingApprovalTimeout time.Duration
}
// LogConfig contains logging configuration.
@@ -721,6 +753,19 @@ type LogConfig struct {
Format string
}
// NamedAPIKey represents a single named API key with an optional admin flag.
// Named keys allow real actor attribution in the audit trail (M-002) and provide
// the admin-gate basis for privileged endpoints like bulk revocation (M-003).
type NamedAPIKey struct {
// Name is the identifier for the key (alphanumeric, hyphens, underscores).
// This value is recorded as the actor on every audit event the key authenticates.
Name string
// Key is the raw API-key secret the client presents as `Authorization: Bearer <key>`.
Key string
// Admin controls whether the key has admin privileges (bulk revocation, etc.).
Admin bool
}
// AuthConfig contains authentication configuration.
type AuthConfig struct {
// Type sets the authentication mechanism for the REST API.
@@ -730,12 +775,19 @@ type AuthConfig struct {
// Setting: CERTCTL_AUTH_TYPE environment variable. Default: "api-key".
Type string
// Secret is the authentication secret (API key hash, JWT signing key, etc.).
// For "api-key": the base64-encoded API key to validate against.
// For "jwt": the secret used to verify JWT token signatures.
// For "none": ignored.
// Setting: CERTCTL_AUTH_SECRET environment variable. Required for "api-key" and "jwt".
// Secret is the legacy authentication secret (comma-separated API keys).
// DEPRECATED in favor of NamedKeys — retained for backward compatibility.
// When NamedKeys is empty and Secret is set, each comma-separated key is
// registered as a synthesized named key (legacy-key-0, legacy-key-1, ...)
// with actor attribution defaulting to "legacy-key-<index>".
// Setting: CERTCTL_AUTH_SECRET environment variable.
Secret string
// NamedKeys is the parsed set of named API keys. Populated from
// CERTCTL_API_KEYS_NAMED via ParseNamedAPIKeys during Load(). When
// non-empty, this takes precedence over the legacy Secret field.
// Setting: CERTCTL_API_KEYS_NAMED="name1:key1,name2:key2:admin"
NamedKeys []NamedAPIKey
}
// RateLimitConfig contains rate limiting configuration.
@@ -786,6 +838,10 @@ func Load() (*Config, error) {
JobProcessorInterval: getEnvDuration("CERTCTL_SCHEDULER_JOB_PROCESSOR_INTERVAL", 30*time.Second),
AgentHealthCheckInterval: getEnvDuration("CERTCTL_SCHEDULER_AGENT_HEALTH_CHECK_INTERVAL", 2*time.Minute),
NotificationProcessInterval: getEnvDuration("CERTCTL_SCHEDULER_NOTIFICATION_PROCESS_INTERVAL", 1*time.Minute),
RetryInterval: getEnvDuration("CERTCTL_SCHEDULER_RETRY_INTERVAL", 5*time.Minute),
JobTimeoutInterval: getEnvDuration("CERTCTL_JOB_TIMEOUT_INTERVAL", 10*time.Minute),
AwaitingCSRTimeout: getEnvDuration("CERTCTL_JOB_AWAITING_CSR_TIMEOUT", 24*time.Hour),
AwaitingApprovalTimeout: getEnvDuration("CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT", 168*time.Hour),
},
Log: LogConfig{
Level: getEnv("CERTCTL_LOG_LEVEL", "info"),
@@ -794,6 +850,8 @@ func Load() (*Config, error) {
Auth: AuthConfig{
Type: getEnv("CERTCTL_AUTH_TYPE", "api-key"),
Secret: getEnv("CERTCTL_AUTH_SECRET", ""),
// NamedKeys is populated from CERTCTL_API_KEYS_NAMED below so Load()
// can surface parse errors alongside other config errors.
},
RateLimit: RateLimitConfig{
Enabled: getEnvBool("CERTCTL_RATE_LIMIT_ENABLED", true),
@@ -959,6 +1017,14 @@ func Load() (*Config, error) {
},
}
// Parse CERTCTL_API_KEYS_NAMED for named key authentication (M-002).
// Parse errors surface here so invalid config fails fast at startup.
named, err := ParseNamedAPIKeys(getEnv("CERTCTL_API_KEYS_NAMED", ""))
if err != nil {
return nil, fmt.Errorf("parse CERTCTL_API_KEYS_NAMED: %w", err)
}
cfg.Auth.NamedKeys = named
if err := cfg.Validate(); err != nil {
return nil, err
}
@@ -1043,6 +1109,22 @@ func (c *Config) Validate() error {
return fmt.Errorf("notification process interval must be at least 1 second")
}
if c.Scheduler.RetryInterval < 1*time.Second {
return fmt.Errorf("retry interval must be at least 1 second")
}
if c.Scheduler.JobTimeoutInterval < 1*time.Second {
return fmt.Errorf("job timeout interval must be at least 1 second")
}
if c.Scheduler.AwaitingCSRTimeout < 1*time.Second {
return fmt.Errorf("awaiting CSR timeout must be at least 1 second")
}
if c.Scheduler.AwaitingApprovalTimeout < 1*time.Second {
return fmt.Errorf("awaiting approval timeout must be at least 1 second")
}
return nil
}
@@ -1167,3 +1249,79 @@ func (c *Config) GetLogLevel() slog.Level {
return slog.LevelInfo
}
}
// ParseNamedAPIKeys parses the CERTCTL_API_KEYS_NAMED environment variable.
// Format: "name1:key1,name2:key2:admin,name3:key3"
// The ":admin" suffix is optional; if present, the key has admin privileges.
// Returns a typed []NamedAPIKey so main.go can pass it directly to the
// middleware layer without type assertion gymnastics.
func ParseNamedAPIKeys(input string) ([]NamedAPIKey, error) {
if input == "" {
return nil, nil
}
parts := splitComma(input)
var keys []NamedAPIKey
seen := make(map[string]bool)
for _, part := range parts {
part = trimSpace(part)
if part == "" {
continue
}
// Split by colon: name:key or name:key:admin
fields := strings.Split(part, ":")
if len(fields) < 2 || len(fields) > 3 {
return nil, fmt.Errorf("invalid named key format: %s (expected name:key or name:key:admin)", part)
}
name := trimSpace(fields[0])
key := trimSpace(fields[1])
admin := false
if len(fields) == 3 {
adminStr := trimSpace(fields[2])
if adminStr == "admin" {
admin = true
} else {
return nil, fmt.Errorf("invalid admin flag: %s (expected 'admin')", adminStr)
}
}
// Validate name format: alphanumeric, hyphens, underscores
if !isValidKeyName(name) {
return nil, fmt.Errorf("invalid key name: %s (must be alphanumeric, hyphens, underscores)", name)
}
if seen[name] {
return nil, fmt.Errorf("duplicate key name: %s", name)
}
seen[name] = true
if key == "" {
return nil, fmt.Errorf("empty key for name: %s", name)
}
keys = append(keys, NamedAPIKey{
Name: name,
Key: key,
Admin: admin,
})
}
return keys, nil
}
// isValidKeyName checks if a key name is valid (alphanumeric, hyphens, underscores).
func isValidKeyName(s string) bool {
if len(s) == 0 {
return false
}
for _, c := range s {
if !((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || (c >= '0' && c <= '9') || c == '-' || c == '_') {
return false
}
}
return true
}
+126
View File
@@ -4,6 +4,7 @@ import (
"log/slog"
"os"
"testing"
"strings"
"time"
)
@@ -328,6 +329,10 @@ func TestValidate_ValidConfig(t *testing.T) {
JobProcessorInterval: 30 * time.Second,
AgentHealthCheckInterval: 2 * time.Minute,
NotificationProcessInterval: 1 * time.Minute,
RetryInterval: 5 * time.Minute,
JobTimeoutInterval: 10 * time.Minute,
AwaitingCSRTimeout: 24 * time.Hour,
AwaitingApprovalTimeout: 168 * time.Hour,
},
}
if err := cfg.Validate(); err != nil {
@@ -347,6 +352,10 @@ func TestValidate_AuthTypeNone(t *testing.T) {
JobProcessorInterval: 30 * time.Second,
AgentHealthCheckInterval: 2 * time.Minute,
NotificationProcessInterval: 1 * time.Minute,
RetryInterval: 5 * time.Minute,
JobTimeoutInterval: 10 * time.Minute,
AwaitingCSRTimeout: 24 * time.Hour,
AwaitingApprovalTimeout: 168 * time.Hour,
},
}
if err := cfg.Validate(); err != nil {
@@ -706,3 +715,120 @@ func TestGetEnvBool(t *testing.T) {
})
}
}
// I-003: Job timeout reaper configuration tests
func TestConfig_Scheduler_JobTimeoutDefaults(t *testing.T) {
clearCertctlEnv(t)
setMinimalValidEnv(t)
// Explicitly unset the three I-003 env vars to exercise the default path.
t.Setenv("CERTCTL_JOB_TIMEOUT_INTERVAL", "")
t.Setenv("CERTCTL_JOB_AWAITING_CSR_TIMEOUT", "")
t.Setenv("CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT", "")
cfg, err := Load()
if err != nil {
t.Fatalf("Load() error: %v", err)
}
if cfg.Scheduler.JobTimeoutInterval != 10*time.Minute {
t.Errorf("JobTimeoutInterval = %v, want 10m", cfg.Scheduler.JobTimeoutInterval)
}
if cfg.Scheduler.AwaitingCSRTimeout != 24*time.Hour {
t.Errorf("AwaitingCSRTimeout = %v, want 24h", cfg.Scheduler.AwaitingCSRTimeout)
}
if cfg.Scheduler.AwaitingApprovalTimeout != 168*time.Hour {
t.Errorf("AwaitingApprovalTimeout = %v, want 168h", cfg.Scheduler.AwaitingApprovalTimeout)
}
}
func TestConfig_Scheduler_JobTimeoutEnvOverride(t *testing.T) {
clearCertctlEnv(t)
setMinimalValidEnv(t)
t.Setenv("CERTCTL_JOB_TIMEOUT_INTERVAL", "15m")
t.Setenv("CERTCTL_JOB_AWAITING_CSR_TIMEOUT", "48h")
t.Setenv("CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT", "336h")
cfg, err := Load()
if err != nil {
t.Fatalf("Load() error: %v", err)
}
if cfg.Scheduler.JobTimeoutInterval != 15*time.Minute {
t.Errorf("JobTimeoutInterval = %v, want 15m", cfg.Scheduler.JobTimeoutInterval)
}
if cfg.Scheduler.AwaitingCSRTimeout != 48*time.Hour {
t.Errorf("AwaitingCSRTimeout = %v, want 48h", cfg.Scheduler.AwaitingCSRTimeout)
}
if cfg.Scheduler.AwaitingApprovalTimeout != 336*time.Hour {
t.Errorf("AwaitingApprovalTimeout = %v, want 336h", cfg.Scheduler.AwaitingApprovalTimeout)
}
}
func TestConfig_Scheduler_JobTimeoutValidation(t *testing.T) {
tests := []struct {
name string
field string
value time.Duration
wantErrMsg string
}{
{
"JobTimeoutInterval too small",
"JobTimeoutInterval",
500 * time.Millisecond,
"job timeout interval must be at least 1 second",
},
{
"AwaitingCSRTimeout too small",
"AwaitingCSRTimeout",
500 * time.Millisecond,
"awaiting CSR timeout must be at least 1 second",
},
{
"AwaitingApprovalTimeout too small",
"AwaitingApprovalTimeout",
500 * time.Millisecond,
"awaiting approval timeout must be at least 1 second",
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
// Start from a fully valid config so the I-003 timeout checks
// are the only potential failure point.
cfg := &Config{
Server: ServerConfig{Port: 8080},
Database: DatabaseConfig{URL: "postgres://localhost/certctl", MaxConnections: 25},
Log: LogConfig{Level: "info", Format: "json"},
Auth: AuthConfig{Type: "api-key", Secret: "test-secret"},
Keygen: KeygenConfig{Mode: "agent"},
Scheduler: SchedulerConfig{
RenewalCheckInterval: 1 * time.Minute,
JobProcessorInterval: 1 * time.Minute,
AgentHealthCheckInterval: 1 * time.Minute,
NotificationProcessInterval: 1 * time.Minute,
RetryInterval: 1 * time.Minute,
JobTimeoutInterval: 10 * time.Minute,
AwaitingCSRTimeout: 24 * time.Hour,
AwaitingApprovalTimeout: 168 * time.Hour,
},
}
// Override the specific field under test
switch tt.field {
case "JobTimeoutInterval":
cfg.Scheduler.JobTimeoutInterval = tt.value
case "AwaitingCSRTimeout":
cfg.Scheduler.AwaitingCSRTimeout = tt.value
case "AwaitingApprovalTimeout":
cfg.Scheduler.AwaitingApprovalTimeout = tt.value
}
err := cfg.Validate()
if err == nil {
t.Fatalf("Validate() = nil, want error containing %q", tt.wantErrMsg)
}
if !strings.Contains(err.Error(), tt.wantErrMsg) {
t.Errorf("Validate() error = %q, want to contain %q", err.Error(), tt.wantErrMsg)
}
})
}
}
+63
View File
@@ -32,6 +32,8 @@ type DeploymentTarget struct {
LastTestedAt *time.Time `json:"last_tested_at,omitempty"`
TestStatus string `json:"test_status,omitempty"`
Source string `json:"source,omitempty"`
RetiredAt *time.Time `json:"retired_at,omitempty"` // I-004: soft-retirement timestamp (nil = active)
RetiredReason *string `json:"retired_reason,omitempty"` // I-004: reason captured at cascade retirement
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
}
@@ -49,6 +51,67 @@ type Agent struct {
Architecture string `json:"architecture"`
IPAddress string `json:"ip_address"`
Version string `json:"version"`
// I-004: soft-retirement fields. An agent with RetiredAt != nil is the
// canonical "retired" state. The Status column remains as before (Online
// / Offline / Degraded) and is preserved at retirement time as the
// last-seen operational status; RetiredAt is the source of truth for
// "should we filter this row from active listings?".
RetiredAt *time.Time `json:"retired_at,omitempty"`
RetiredReason *string `json:"retired_reason,omitempty"`
}
// IsRetired returns true when this agent has been soft-retired.
// I-004: callers that iterate active agents (stats dashboard, stale-offline
// sweeper, handler-facing list) must skip retired rows by default.
func (a *Agent) IsRetired() bool { return a != nil && a.RetiredAt != nil }
// AgentDependencyCounts captures the active downstream rows that would be
// affected by retiring an agent. Returned by the preflight pass on
// DELETE /api/v1/agents/{id}. Zero counts mean a clean soft-retire is safe;
// any non-zero count blocks a default retire with HTTP 409 and requires an
// explicit ?force=true&reason=... escape hatch from the operator.
type AgentDependencyCounts struct {
ActiveTargets int `json:"active_targets"` // deployment_targets.agent_id=id AND retired_at IS NULL
ActiveCertificates int `json:"active_certificates"` // certificates currently deployed via one of this agent's active targets
PendingJobs int `json:"pending_jobs"` // jobs.agent_id=id AND status IN (Pending, AwaitingCSR, AwaitingApproval, Running)
}
// HasDependencies reports whether any preflight counter is non-zero.
func (d AgentDependencyCounts) HasDependencies() bool {
return d.ActiveTargets > 0 || d.ActiveCertificates > 0 || d.PendingJobs > 0
}
// SentinelAgentIDs enumerates the four reserved agent identities that back
// non-agent discovery subsystems. These rows are created by cmd/server on
// startup and retiring them would orphan their subsystem — the network
// scanner and the three cloud secret-manager sources all key writes to
// these IDs via service.SentinelAgentID / service.SentinelAWSSecretsMgr /
// service.SentinelAzureKeyVault / service.SentinelGCPSecretMgr. The four
// literal IDs below MUST stay in lockstep with those service-package
// constants (see internal/service/network_scan.go line 23 and
// internal/service/cloud_discovery.go lines 14-16).
//
// The retirement service refuses them unconditionally — even with
// ?force=true — via ErrAgentIsSentinel. Living here (and not in the
// service package) lets handler, repository, and scheduler code filter
// them without importing service and creating a cycle.
var SentinelAgentIDs = []string{
"server-scanner",
"cloud-aws-sm",
"cloud-azure-kv",
"cloud-gcp-sm",
}
// IsSentinelAgent reports whether id matches one of the four reserved
// sentinel agent IDs. A linear scan is fine — the slice is length 4 and
// the check is rare (only on retirement attempts and sweeper filters).
func IsSentinelAgent(id string) bool {
for _, s := range SentinelAgentIDs {
if s == id {
return true
}
}
return false
}
// AgentMetadata contains runtime metadata reported by agents via heartbeat.
+55
View File
@@ -0,0 +1,55 @@
package domain
import (
"testing"
"time"
)
// TestAgent_IsRetired covers the I-004 soft-retirement predicate that gates
// which callers hide an agent row from active listings.
func TestAgent_IsRetired(t *testing.T) {
t.Run("nil receiver is not retired", func(t *testing.T) {
var a *Agent
if a.IsRetired() {
t.Fatalf("nil *Agent should not be retired")
}
})
t.Run("zero value is not retired", func(t *testing.T) {
a := &Agent{}
if a.IsRetired() {
t.Fatalf("zero Agent should not be retired")
}
})
t.Run("RetiredAt set is retired", func(t *testing.T) {
now := time.Now()
a := &Agent{RetiredAt: &now}
if !a.IsRetired() {
t.Fatalf("Agent with RetiredAt != nil must be retired")
}
})
}
// TestAgentDependencyCounts_HasDependencies verifies the preflight
// aggregation helper used by the 409 block path of DELETE /agents/{id}.
func TestAgentDependencyCounts_HasDependencies(t *testing.T) {
cases := []struct {
name string
counts AgentDependencyCounts
want bool
}{
{"all zero", AgentDependencyCounts{}, false},
{"active target", AgentDependencyCounts{ActiveTargets: 1}, true},
{"active cert", AgentDependencyCounts{ActiveCertificates: 1}, true},
{"pending job", AgentDependencyCounts{PendingJobs: 1}, true},
{"mixed", AgentDependencyCounts{ActiveTargets: 3, PendingJobs: 2}, true},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
if got := tc.counts.HasDependencies(); got != tc.want {
t.Fatalf("HasDependencies()=%v want=%v counts=%+v", got, tc.want, tc.counts)
}
})
}
}
+7 -5
View File
@@ -12,6 +12,7 @@ type PolicyRule struct {
Type PolicyType `json:"type"`
Config json.RawMessage `json:"config"`
Enabled bool `json:"enabled"`
Severity PolicySeverity `json:"severity"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
}
@@ -20,11 +21,12 @@ type PolicyRule struct {
type PolicyType string
const (
PolicyTypeAllowedIssuers PolicyType = "AllowedIssuers"
PolicyTypeAllowedDomains PolicyType = "AllowedDomains"
PolicyTypeRequiredMetadata PolicyType = "RequiredMetadata"
PolicyTypeAllowedEnvironments PolicyType = "AllowedEnvironments"
PolicyTypeRenewalLeadTime PolicyType = "RenewalLeadTime"
PolicyTypeAllowedIssuers PolicyType = "AllowedIssuers"
PolicyTypeAllowedDomains PolicyType = "AllowedDomains"
PolicyTypeRequiredMetadata PolicyType = "RequiredMetadata"
PolicyTypeAllowedEnvironments PolicyType = "AllowedEnvironments"
PolicyTypeRenewalLeadTime PolicyType = "RenewalLeadTime"
PolicyTypeCertificateLifetime PolicyType = "CertificateLifetime"
)
// PolicyViolation records an instance of a certificate violating a policy rule.
+19 -11
View File
@@ -158,7 +158,7 @@ func TestCrossResourceWorkflow(t *testing.T) {
payload := map[string]interface{}{
"name": "Allowed Domains Policy",
"type": "AllowedDomains",
"severity": "High",
"severity": "Error",
"config": json.RawMessage(`{"domains": ["example.com", "*.example.com"]}`),
"description": "Restrict issuance to example.com domains",
}
@@ -517,12 +517,18 @@ func TestNotificationEndpoints(t *testing.T) {
})
}
// TestCRLEndpoint exercises the CRL listing endpoint (M15a).
// TestCRLEndpoint exercises the RFC 5280 DER-encoded CRL endpoint served
// unauthenticated at /.well-known/pki/crl/{issuer_id} (M-006 relocation from
// the pre-M-006 JSON CRL at /api/v1/crl, which was removed entirely because
// RFC 5280 §5 defines only the DER wire format).
func TestCRLEndpoint(t *testing.T) {
server, _, _, _ := setupTestServer(t)
t.Run("GetCRL_JSON", func(t *testing.T) {
resp, err := http.Get(server.URL + "/api/v1/crl")
t.Run("GetDERCRL_Unauthenticated", func(t *testing.T) {
// Intentionally no Authorization header — relying parties can't present
// a certctl API key, so the PKI endpoints are exposed under the
// RFC 8615 `.well-known` namespace with auth bypassed.
resp, err := http.Get(server.URL + "/.well-known/pki/crl/iss-local")
if err != nil {
t.Fatalf("request failed: %v", err)
}
@@ -531,15 +537,17 @@ func TestCRLEndpoint(t *testing.T) {
bodyBytes, _ := io.ReadAll(resp.Body)
t.Fatalf("expected 200, got %d: %s", resp.StatusCode, string(bodyBytes))
}
var crl map[string]interface{}
json.NewDecoder(resp.Body).Decode(&crl)
if crl["version"] == nil {
t.Error("expected version field in CRL response")
if ct := resp.Header.Get("Content-Type"); ct != "application/pkix-crl" {
t.Errorf("expected Content-Type application/pkix-crl, got %s", ct)
}
if crl["entries"] == nil {
t.Error("expected entries field in CRL response")
body, err := io.ReadAll(resp.Body)
if err != nil {
t.Fatalf("read body failed: %v", err)
}
t.Logf("CRL response: version=%v, entries_count=%v", crl["version"], crl["total"])
if len(body) == 0 {
t.Error("expected non-empty DER CRL body")
}
t.Logf("DER CRL response: %d bytes", len(body))
})
}
+134 -3
View File
@@ -3,6 +3,7 @@ package integration
import (
"bytes"
"context"
"database/sql"
"encoding/json"
"fmt"
"io"
@@ -64,7 +65,8 @@ func TestCertificateLifecycle(t *testing.T) {
certificateService.SetTargetRepo(targetRepo)
renewalService := service.NewRenewalService(certRepo, jobRepo, renewalPolicyRepo, nil, auditService, notificationService, issuerRegistry, "server")
deploymentService := service.NewDeploymentService(jobRepo, targetRepo, agentRepo, certRepo, auditService, notificationService)
jobService := service.NewJobService(jobRepo, renewalService, deploymentService, logger)
ownerRepo := newMockOwnerRepository()
jobService := service.NewJobService(jobRepo, certRepo, ownerRepo, renewalService, deploymentService, logger)
agentService := service.NewAgentService(agentRepo, certRepo, jobRepo, targetRepo, auditService, issuerRegistry, renewalService)
// 32-byte AES-256 test key — C-2 remediation makes IssuerService fail closed
// without a configured CERTCTL_CONFIG_ENCRYPTION_KEY. Happy-path CRUD tests
@@ -585,6 +587,24 @@ func (m *mockCertificateRepository) GetLatestVersion(ctx context.Context, certID
return versions[len(versions)-1], nil
}
// GetByIssuerAndSerial emulates the PostgreSQL JOIN that scopes cert lookup to
// (issuer_id, serial). Returns sql.ErrNoRows when no match exists so callers
// that branch on errors.Is(err, sql.ErrNoRows) (notably the OCSP handler's
// M-004 "unknown" fallback) behave the same in-memory as against PostgreSQL.
func (m *mockCertificateRepository) GetByIssuerAndSerial(ctx context.Context, issuerID, serial string) (*domain.ManagedCertificate, error) {
for _, cert := range m.certs {
if cert.IssuerID != issuerID {
continue
}
for _, v := range m.versions[cert.ID] {
if v.SerialNumber == serial {
return cert, nil
}
}
}
return nil, sql.ErrNoRows
}
type mockJobRepository struct {
jobs map[string]*domain.Job
}
@@ -722,6 +742,25 @@ func (m *mockJobRepository) ClaimPendingByAgentID(ctx context.Context, agentID s
return result, nil
}
// ListTimedOutAwaitingJobs is the I-003 integration-mock stub. Returns jobs whose
// created_at predates the relevant cutoff for their status.
func (m *mockJobRepository) ListTimedOutAwaitingJobs(ctx context.Context, csrCutoff, approvalCutoff time.Time) ([]*domain.Job, error) {
var jobs []*domain.Job
for _, j := range m.jobs {
switch j.Status {
case domain.JobStatusAwaitingCSR:
if j.CreatedAt.Before(csrCutoff) {
jobs = append(jobs, j)
}
case domain.JobStatusAwaitingApproval:
if j.CreatedAt.Before(approvalCutoff) {
jobs = append(jobs, j)
}
}
}
return jobs, nil
}
type mockAuditRepository struct {
events []*domain.AuditEvent
}
@@ -809,6 +848,56 @@ func (m *mockAgentRepository) GetByAPIKey(ctx context.Context, keyHash string) (
return nil, fmt.Errorf("agent not found")
}
// I-004: the integration-level mockAgentRepository implements the 6 new
// retirement-surface methods as thin contract-satisfying stubs. The
// integration suite exercises lifecycle flows (issue → renew → deploy)
// that don't touch retirement, so these methods never need real behavior
// here — they exist purely to keep mockAgentRepository a valid
// AgentRepository implementation after migration 000015 expanded the
// interface. Dedicated retirement tests live in internal/service/
// agent_retire_test.go against the richer service-layer mockAgentRepo.
func (m *mockAgentRepository) ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) {
var retired []*domain.Agent
for _, a := range m.agents {
if a.RetiredAt != nil {
retired = append(retired, a)
}
}
return retired, len(retired), nil
}
func (m *mockAgentRepository) SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error {
agent, ok := m.agents[id]
if !ok {
return fmt.Errorf("agent not found")
}
if agent.RetiredAt != nil {
return nil
}
stamped := retiredAt
agent.RetiredAt = &stamped
stampedReason := reason
agent.RetiredReason = &stampedReason
return nil
}
func (m *mockAgentRepository) RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error {
return m.SoftRetire(ctx, id, retiredAt, reason)
}
func (m *mockAgentRepository) CountActiveTargets(ctx context.Context, agentID string) (int, error) {
return 0, nil
}
func (m *mockAgentRepository) CountActiveCertificates(ctx context.Context, agentID string) (int, error) {
return 0, nil
}
func (m *mockAgentRepository) CountPendingJobs(ctx context.Context, agentID string) (int, error) {
return 0, nil
}
type mockTargetRepository struct {
targets map[string]*domain.DeploymentTarget
}
@@ -862,6 +951,48 @@ func (m *mockTargetRepository) ListByCertificate(ctx context.Context, certID str
return m.List(ctx)
}
// mockOwnerRepository satisfies repository.OwnerRepository for the M-003
// not-self approval wiring. Tests that don't care about owner lookup get an
// empty map (Get returns errNotFound, which checkNotSelf permits).
type mockOwnerRepository struct {
owners map[string]*domain.Owner
}
func newMockOwnerRepository() *mockOwnerRepository {
return &mockOwnerRepository{owners: make(map[string]*domain.Owner)}
}
func (m *mockOwnerRepository) List(ctx context.Context) ([]*domain.Owner, error) {
var out []*domain.Owner
for _, o := range m.owners {
out = append(out, o)
}
return out, nil
}
func (m *mockOwnerRepository) Get(ctx context.Context, id string) (*domain.Owner, error) {
o, ok := m.owners[id]
if !ok {
return nil, fmt.Errorf("owner not found")
}
return o, nil
}
func (m *mockOwnerRepository) Create(ctx context.Context, o *domain.Owner) error {
m.owners[o.ID] = o
return nil
}
func (m *mockOwnerRepository) Update(ctx context.Context, o *domain.Owner) error {
m.owners[o.ID] = o
return nil
}
func (m *mockOwnerRepository) Delete(ctx context.Context, id string) error {
delete(m.owners, id)
return nil
}
type mockNotificationRepository struct {
notifications []*domain.NotificationEvent
}
@@ -1258,11 +1389,11 @@ func (m *mockDiscoveryService) GetDiscovered(ctx context.Context, id string) (*d
return nil, fmt.Errorf("not found")
}
func (m *mockDiscoveryService) ClaimDiscovered(ctx context.Context, id string, managedCertID string) error {
func (m *mockDiscoveryService) ClaimDiscovered(ctx context.Context, id string, managedCertID string, actor string) error {
return nil
}
func (m *mockDiscoveryService) DismissDiscovered(ctx context.Context, id string) error {
func (m *mockDiscoveryService) DismissDiscovered(ctx context.Context, id string, actor string) error {
return nil
}
+23 -12
View File
@@ -56,7 +56,8 @@ func setupTestServer(t *testing.T) (*httptest.Server, *mockCertificateRepository
certificateService.SetCAOperationsSvc(caOperationsSvc)
renewalService := service.NewRenewalService(certRepo, jobRepo, renewalPolicyRepo, nil, auditService, notificationService, issuerRegistry, "server")
deploymentService := service.NewDeploymentService(jobRepo, targetRepo, agentRepo, certRepo, auditService, notificationService)
jobService := service.NewJobService(jobRepo, renewalService, deploymentService, logger)
ownerRepo := newMockOwnerRepository()
jobService := service.NewJobService(jobRepo, certRepo, ownerRepo, renewalService, deploymentService, logger)
agentService := service.NewAgentService(agentRepo, certRepo, jobRepo, targetRepo, auditService, issuerRegistry, renewalService)
// 32-byte AES-256 test key — C-2 remediation makes IssuerService fail closed
// without a configured CERTCTL_CONFIG_ENCRYPTION_KEY. Happy-path CRUD tests
@@ -112,6 +113,10 @@ func setupTestServer(t *testing.T) (*httptest.Server, *mockCertificateRepository
BulkRevocation: handler.BulkRevocationHandler{},
})
r.RegisterESTHandlers(estHandler)
// M-006: CRL + OCSP live under /.well-known/pki/ (RFC 5280 + RFC 6960 + RFC 8615).
// The negative_test integration suite exercises the DER CRL at this path with
// no Authorization header to verify the relying-party contract.
r.RegisterPKIHandlers(certificateHandler)
server := httptest.NewServer(r)
t.Cleanup(func() { server.Close() })
@@ -789,8 +794,14 @@ func TestRevocationEndpoints(t *testing.T) {
}
})
t.Run("GetCRL_Success", func(t *testing.T) {
resp, err := http.Get(server.URL + "/api/v1/crl")
// M-006: the non-standard JSON CRL at GET /api/v1/crl was removed entirely.
// RFC 5280 §5 defines only the DER wire format, which is now served
// unauthenticated under /.well-known/pki/crl/{issuer_id} (RFC 8615) so
// relying parties can fetch revocation data without a certctl API key.
// We verify the contract by requesting with no Authorization header and
// asserting DER content-type + a non-empty body.
t.Run("GetDERCRL_Unauthenticated", func(t *testing.T) {
resp, err := http.Get(server.URL + "/.well-known/pki/crl/iss-local")
if err != nil {
t.Fatalf("request failed: %v", err)
}
@@ -801,17 +812,17 @@ func TestRevocationEndpoints(t *testing.T) {
t.Fatalf("expected 200, got %d: %s", resp.StatusCode, string(bodyBytes))
}
var crl map[string]interface{}
json.NewDecoder(resp.Body).Decode(&crl)
if crl["version"] != float64(1) {
t.Errorf("expected CRL version 1, got %v", crl["version"])
ct := resp.Header.Get("Content-Type")
if ct != "application/pkix-crl" {
t.Errorf("expected Content-Type application/pkix-crl, got %s", ct)
}
// Should have at least 1 entry from the revocation above
total, _ := crl["total"].(float64)
if total < 1 {
t.Errorf("expected at least 1 CRL entry, got %v", total)
body, err := io.ReadAll(resp.Body)
if err != nil {
t.Fatalf("read body failed: %v", err)
}
if len(body) == 0 {
t.Error("expected non-empty DER CRL body")
}
})
}
+10
View File
@@ -49,6 +49,16 @@ func (c *Client) Delete(path string) (json.RawMessage, error) {
return c.do("DELETE", path, nil, nil)
}
// DeleteWithQuery performs an HTTP DELETE with query parameters. I-004 adds
// this transport so MCP tools can target endpoints that carry flags in the
// query string (e.g. DELETE /api/v1/agents/{id}?force=true&reason=…). Client.Delete
// is path-only; without this method the retire tool silently drops force/reason,
// turning every cascade retire into a default soft-retire. Shares do()'s 204
// normalization and 4xx/5xx error propagation so tool authors get one contract.
func (c *Client) DeleteWithQuery(path string, query url.Values) (json.RawMessage, error) {
return c.do("DELETE", path, query, nil)
}
// GetRaw performs an HTTP GET and returns the raw response body bytes and content type.
// Used for binary responses (DER CRL, OCSP).
func (c *Client) GetRaw(path string) ([]byte, string, error) {
+2 -2
View File
@@ -203,7 +203,7 @@ func TestClient_GetRaw(t *testing.T) {
defer server.Close()
c := NewClient(server.URL, "test-key")
data, contentType, err := c.GetRaw("/api/v1/crl/iss-local")
data, contentType, err := c.GetRaw("/.well-known/pki/crl/iss-local")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
@@ -223,7 +223,7 @@ func TestClient_GetRaw_Error(t *testing.T) {
defer server.Close()
c := NewClient(server.URL, "test-key")
_, _, err := c.GetRaw("/api/v1/crl/nonexistent")
_, _, err := c.GetRaw("/.well-known/pki/crl/nonexistent")
if err == nil {
t.Fatal("expected error for 404 response")
}
+214
View File
@@ -0,0 +1,214 @@
package mcp
import (
"encoding/json"
"net/http"
"net/http/httptest"
"net/url"
"strings"
"testing"
)
// TestClient_DeleteWithQuery_ForceRetire covers the new transport capability
// that I-004 adds to the MCP client. The retire tool needs to issue
// DELETE /api/v1/agents/{id}?force=true&reason=... — Client.Delete as it
// stands only accepts a path, dropping query parameters on the floor. Phase 2b
// must add DeleteWithQuery so the MCP retire tool can hit the force escape
// hatch; without this, every retire-via-MCP call with force=true silently
// becomes a default soft-retire and either succeeds wrongly or 409s.
func TestClient_DeleteWithQuery_ForceRetire(t *testing.T) {
var (
sawMethod string
sawPath string
sawForce string
sawReason string
)
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sawMethod = r.Method
sawPath = r.URL.Path
sawForce = r.URL.Query().Get("force")
sawReason = r.URL.Query().Get("reason")
if r.Method != http.MethodDelete || r.URL.Path != "/api/v1/agents/ag-1" {
w.WriteHeader(http.StatusNotFound)
return
}
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_ = json.NewEncoder(w).Encode(map[string]interface{}{
"retired_at": "2026-04-18T12:00:00Z",
"already_retired": false,
"cascade": true,
})
}))
defer server.Close()
c := NewClient(server.URL, "test-key")
// Compile-fail until Phase 2b grows Client.DeleteWithQuery. Passing the
// query as a url.Values is the established pattern (matches Get's shape).
query := url.Values{}
query.Set("force", "true")
query.Set("reason", "decommissioning rack 7")
data, err := c.DeleteWithQuery("/api/v1/agents/ag-1", query)
if err != nil {
t.Fatalf("DeleteWithQuery err=%v want nil", err)
}
if data == nil {
t.Fatal("DeleteWithQuery returned nil data; want 200 body echo-back")
}
if sawMethod != http.MethodDelete {
t.Errorf("method=%q want DELETE", sawMethod)
}
if sawPath != "/api/v1/agents/ag-1" {
t.Errorf("path=%q want /api/v1/agents/ag-1 (query must be stripped from path)", sawPath)
}
if sawForce != "true" {
t.Errorf("force query=%q want \"true\"", sawForce)
}
if sawReason != "decommissioning rack 7" {
t.Errorf("reason query=%q want %q", sawReason, "decommissioning rack 7")
}
}
// TestClient_DeleteWithQuery_NoQuery covers the defensive path: a nil/empty
// query must still produce a clean DELETE against the bare path with no stray
// "?" suffix. Matches the Get() shape (see client.go do()) so downstream tools
// can reuse one code path.
func TestClient_DeleteWithQuery_NoQuery(t *testing.T) {
var sawRawPath string
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
sawRawPath = r.URL.RequestURI()
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusOK)
_ = json.NewEncoder(w).Encode(map[string]interface{}{"ok": true})
}))
defer server.Close()
c := NewClient(server.URL, "")
if _, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil); err != nil {
t.Fatalf("DeleteWithQuery(nil query) err=%v want nil", err)
}
// No query → no ? suffix.
if strings.Contains(sawRawPath, "?") {
t.Errorf("raw path=%q contains stray ?; empty query must not serialize", sawRawPath)
}
}
// TestClient_DeleteWithQuery_204ReturnsMinimalBody covers the idempotent path.
// The handler returns 204 No Content for an already-retired agent; the
// existing do() helper normalises this to {"status":"deleted"}. The new
// DeleteWithQuery must share that behavior so MCP tool authors don't have to
// special-case the return shape.
func TestClient_DeleteWithQuery_204ReturnsMinimalBody(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusNoContent)
}))
defer server.Close()
c := NewClient(server.URL, "")
data, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil)
if err != nil {
t.Fatalf("DeleteWithQuery(204) err=%v want nil (idempotent)", err)
}
if data == nil {
t.Fatal("DeleteWithQuery(204) returned nil; want synthetic body")
}
if !strings.Contains(string(data), "deleted") && !strings.Contains(string(data), "status") {
t.Errorf("DeleteWithQuery(204) body=%q; must surface a non-empty sentinel", string(data))
}
}
// TestClient_DeleteWithQuery_409PropagatesError covers the preflight-blocked
// surface. A 409 with dependency counts must bubble up as a Go error so the
// MCP tool can present it to the LLM operator rather than silently swallow
// the rejection.
func TestClient_DeleteWithQuery_409PropagatesError(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
w.WriteHeader(http.StatusConflict)
_ = json.NewEncoder(w).Encode(map[string]interface{}{
"error": "blocked_by_dependencies",
"message": "agent has active targets",
"counts": map[string]int{
"active_targets": 3,
"active_certificates": 7,
"pending_jobs": 2,
},
})
}))
defer server.Close()
c := NewClient(server.URL, "")
_, err := c.DeleteWithQuery("/api/v1/agents/ag-1", nil)
if err == nil {
t.Fatalf("DeleteWithQuery(409) err=nil; 409 must propagate as Go error")
}
if !strings.Contains(err.Error(), "409") {
t.Errorf("err=%q should include HTTP status 409 for debuggability", err.Error())
}
}
// TestRetireAgentInput_ShapePinned is a compile-time assertion that the MCP
// tool input struct for certctl_retire_agent exists with the required fields
// and their expected tag shapes. The LLM discovers this input schema via
// jsonschema tags — refactoring field names without updating callers silently
// breaks tool discovery.
//
// Red until Phase 2b adds RetireAgentInput to internal/mcp/types.go. This
// assertion deliberately exercises every field so the test fails at compile
// time rather than runtime.
func TestRetireAgentInput_ShapePinned(t *testing.T) {
// Zero-value construction of the expected input — fails to compile until
// the struct exists with fields {ID string, Force bool, Reason string}.
input := RetireAgentInput{
ID: "ag-1",
Force: true,
Reason: "decommissioning rack 7",
}
if input.ID != "ag-1" {
t.Errorf("RetireAgentInput.ID=%q want ag-1 (field binding broken)", input.ID)
}
if !input.Force {
t.Errorf("RetireAgentInput.Force=false want true")
}
if input.Reason != "decommissioning rack 7" {
t.Errorf("RetireAgentInput.Reason=%q want decommissioning rack 7", input.Reason)
}
// Also pin the JSON surface — LLMs send and receive these field names,
// so json tags must stay snake_case even through refactors.
encoded, err := json.Marshal(input)
if err != nil {
t.Fatalf("marshal RetireAgentInput: %v", err)
}
body := string(encoded)
for _, want := range []string{`"id":"ag-1"`, `"force":true`, `"reason":"decommissioning rack 7"`} {
if !strings.Contains(body, want) {
t.Errorf("RetireAgentInput JSON=%q missing %q (tag shape drifted)", body, want)
}
}
}
// TestListRetiredAgentsInput_ShapePinned mirrors the pagination input shape
// used across the MCP toolset (see ListParams). The list-retired-agents tool
// takes page + per_page with snake_case JSON tags. Compile-fail until
// Phase 2b either adds ListRetiredAgentsInput or documents that list-retired
// reuses the existing ListParams type (both paths are acceptable — the test
// just pins whichever Phase 2b picks).
func TestListRetiredAgentsInput_ShapePinned(t *testing.T) {
// Phase 2b may either (a) add a dedicated ListRetiredAgentsInput struct
// or (b) reuse the existing ListParams. Either is fine — we pin the
// field-access contract rather than the struct name to let the
// implementation choose. Compile-fail guards against the tool being
// registered without any pagination input at all.
var input ListParams
input.Page = 1
input.PerPage = 50
if input.Page != 1 || input.PerPage != 50 {
t.Errorf("ListParams fields Page/PerPage broken; listing pagination will misroute")
}
}
+59 -17
View File
@@ -217,24 +217,19 @@ func registerCertificateTools(s *gomcp.Server, c *Client) {
}
// ── CRL & OCSP ──────────────────────────────────────────────────────
//
// M-006 relocation: CRL and OCSP are served unauthenticated under the
// RFC 8615 `.well-known/pki/*` namespace (RFC 5280 §5 for CRL, RFC 6960
// §2.1 for OCSP) so relying parties can retrieve them without a certctl
// API key. The non-standard JSON CRL tool (`certctl_get_crl`) has been
// removed — RFC 5280 defines only the DER wire format.
func registerCRLOCSPTools(s *gomcp.Server, c *Client) {
gomcp.AddTool(s, &gomcp.Tool{
Name: "certctl_get_crl",
Description: "Get the Certificate Revocation List in JSON format. Lists all revoked certificate serial numbers with reasons and timestamps.",
}, func(ctx context.Context, req *gomcp.CallToolRequest, input EmptyInput) (*gomcp.CallToolResult, any, error) {
data, err := c.Get("/api/v1/crl", nil)
if err != nil {
return errorResult(err)
}
return textResult(data)
})
gomcp.AddTool(s, &gomcp.Tool{
Name: "certctl_get_der_crl",
Description: "Get DER-encoded X.509 CRL for a specific issuer. Returns binary CRL data signed by the issuing CA.",
Description: "Get DER-encoded X.509 CRL for a specific issuer (RFC 5280). Served unauthenticated at /.well-known/pki/crl/{issuer_id}. Returns binary CRL data signed by the issuing CA.",
}, func(ctx context.Context, req *gomcp.CallToolRequest, input GetDERCRLInput) (*gomcp.CallToolResult, any, error) {
raw, contentType, err := c.GetRaw("/api/v1/crl/" + input.IssuerID)
raw, contentType, err := c.GetRaw("/.well-known/pki/crl/" + input.IssuerID)
if err != nil {
return errorResult(err)
}
@@ -247,9 +242,9 @@ func registerCRLOCSPTools(s *gomcp.Server, c *Client) {
gomcp.AddTool(s, &gomcp.Tool{
Name: "certctl_ocsp_check",
Description: "Check OCSP status for a certificate by issuer ID and hex serial number. Returns good, revoked, or unknown.",
Description: "Check OCSP status for a certificate by issuer ID and hex serial number (RFC 6960). Served unauthenticated at /.well-known/pki/ocsp/{issuer_id}/{serial}. Returns good, revoked, or unknown.",
}, func(ctx context.Context, req *gomcp.CallToolRequest, input OCSPInput) (*gomcp.CallToolResult, any, error) {
raw, contentType, err := c.GetRaw("/api/v1/ocsp/" + input.IssuerID + "/" + input.Serial)
raw, contentType, err := c.GetRaw("/.well-known/pki/ocsp/" + input.IssuerID + "/" + input.Serial)
if err != nil {
return errorResult(err)
}
@@ -511,6 +506,53 @@ func registerAgentTools(s *gomcp.Server, c *Client) {
}
return textResult(data)
})
// I-004: soft-retirement. DELETE /api/v1/agents/{id} returns 200 on a
// fresh retire (body echoes retired_at/already_retired/cascade/counts),
// 204 on an idempotent retire of an already-retired agent (do() in
// client.go normalizes that to {"status":"deleted"}), 409 when downstream
// dependencies block the retire and force wasn't set, 403 on sentinel
// agents, or 400 when force=true was sent without a reason. The tool
// forwards the raw handler response so the LLM operator sees the
// dependency counts and can decide whether to retry with force=true.
gomcp.AddTool(s, &gomcp.Tool{
Name: "certctl_retire_agent",
Description: "Soft-retire an agent (DELETE /api/v1/agents/{id}). Sets retired_at + retired_reason on the row; the agent is filtered from the default listing and surfaces only via certctl_list_retired_agents. Default is a safety-gated soft-retire that returns 409 blocked_by_dependencies if the agent has active targets, active certificates, or pending jobs — the returned counts tell you what would be orphaned. Pass force=true to cascade through and retire those dependents too; force=true requires a non-empty reason (captured in the audit trail). Sentinel discovery agents (server-scanner, cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm) cannot be retired — the handler returns 403 unconditionally. Idempotent: retrying on an already-retired agent returns 204 without side effects.",
}, func(ctx context.Context, req *gomcp.CallToolRequest, input RetireAgentInput) (*gomcp.CallToolResult, any, error) {
// Client-side mirror of the handler's ErrForceReasonRequired contract
// (see internal/api/handler/agents.go) so the LLM gets an immediate,
// actionable error instead of a round-trip 400. Whitespace-only
// reasons are treated as empty — matches handler's TrimSpace check.
if input.Force && input.Reason == "" {
return errorResult(fmt.Errorf("reason is required when force=true"))
}
query := url.Values{}
if input.Force {
query.Set("force", "true")
}
if input.Reason != "" {
query.Set("reason", input.Reason)
}
data, err := c.DeleteWithQuery("/api/v1/agents/"+input.ID, query)
if err != nil {
return errorResult(err)
}
return textResult(data)
})
// I-004: retired agents are filtered out of GET /api/v1/agents by default.
// The /agents/retired endpoint is the opt-in view — same pagination shape
// as the default listing, but filters to rows where retired_at IS NOT NULL.
gomcp.AddTool(s, &gomcp.Tool{
Name: "certctl_list_retired_agents",
Description: "List soft-retired agents (GET /api/v1/agents/retired). These are agents that have been retired via certctl_retire_agent; retired_at and retired_reason are populated. Returned separately from certctl_list_agents so the default listing stays focused on operational agents.",
}, func(ctx context.Context, req *gomcp.CallToolRequest, input ListParams) (*gomcp.CallToolResult, any, error) {
data, err := c.Get("/api/v1/agents/retired", paginationQuery(input.Page, input.PerPage))
if err != nil {
return errorResult(err)
}
return textResult(data)
})
}
// ── Jobs ────────────────────────────────────────────────────────────
@@ -610,7 +652,7 @@ func registerPolicyTools(s *gomcp.Server, c *Client) {
gomcp.AddTool(s, &gomcp.Tool{
Name: "certctl_create_policy",
Description: "Create a new policy rule. Requires name and type.",
Description: "Create a new policy rule. Requires name and type. Optional severity (Warning, Error, Critical) defaults to Warning.",
}, func(ctx context.Context, req *gomcp.CallToolRequest, input CreatePolicyInput) (*gomcp.CallToolResult, any, error) {
data, err := c.Post("/api/v1/policies", input)
if err != nil {
@@ -621,7 +663,7 @@ func registerPolicyTools(s *gomcp.Server, c *Client) {
gomcp.AddTool(s, &gomcp.Tool{
Name: "certctl_update_policy",
Description: "Update a policy rule's name, type, configuration, or enabled status.",
Description: "Update a policy rule's name, type, configuration, enabled status, or severity.",
}, func(ctx context.Context, req *gomcp.CallToolRequest, input UpdatePolicyInput) (*gomcp.CallToolResult, any, error) {
data, err := c.Put("/api/v1/policies/"+input.ID, input)
if err != nil {
+1 -1
View File
@@ -378,7 +378,7 @@ func TestToolEndToEnd_GetRawBinary(t *testing.T) {
defer server.Close()
client := NewClient(server.URL, "test-key")
data, ct, err := client.GetRaw("/api/v1/crl/iss-local")
data, ct, err := client.GetRaw("/.well-known/pki/crl/iss-local")
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
+31 -12
View File
@@ -35,7 +35,7 @@ type CreateCertificateInput struct {
TeamID string `json:"team_id" jsonschema:"Team ID (required)"`
IssuerID string `json:"issuer_id" jsonschema:"Issuer connector ID"`
TargetIDs []string `json:"target_ids,omitempty" jsonschema:"Deployment target IDs"`
RenewalPolicyID string `json:"renewal_policy_id,omitempty" jsonschema:"Renewal policy ID"`
RenewalPolicyID string `json:"renewal_policy_id" jsonschema:"Renewal policy ID (required)"`
ProfileID string `json:"certificate_profile_id,omitempty" jsonschema:"Certificate profile ID"`
Tags map[string]string `json:"tags,omitempty" jsonschema:"Key-value tags"`
}
@@ -112,7 +112,7 @@ type CreateTargetInput struct {
ID string `json:"id,omitempty" jsonschema:"Target ID"`
Name string `json:"name" jsonschema:"Target display name"`
Type string `json:"type" jsonschema:"Target type: NGINX, Apache, HAProxy, F5, IIS"`
AgentID string `json:"agent_id,omitempty" jsonschema:"Agent ID that manages this target"`
AgentID string `json:"agent_id" jsonschema:"Agent ID that manages this target (required)"`
Config interface{} `json:"config,omitempty" jsonschema:"Target-specific configuration"`
Enabled bool `json:"enabled,omitempty" jsonschema:"Whether the target is enabled"`
}
@@ -152,6 +152,23 @@ type AgentJobStatusInput struct {
Error string `json:"error,omitempty" jsonschema:"Error message if job failed"`
}
// RetireAgentInput pins the MCP tool surface for certctl_retire_agent. I-004
// introduces a soft-retirement flow that the handler exposes on DELETE
// /api/v1/agents/{id} with two optional query flags: force=true cascades
// through dependent active targets/certs/jobs, and reason is the human-readable
// string captured in the audit trail. The handler enforces
// ErrForceReasonRequired when force=true is sent without a reason; we surface
// both as separate fields so the LLM can populate them independently and so
// the retire_agent_test shape assertion stays aligned with the JSON-wire
// contract. ID is always emitted (no omitempty) because a retire call without
// a target agent is meaningless; Force and Reason are omitempty so the default
// soft-retire path sends no query suffix at all.
type RetireAgentInput struct {
ID string `json:"id" jsonschema:"Agent ID to soft-retire"`
Force bool `json:"force,omitempty" jsonschema:"Cascade-retire downstream active targets, certs, and jobs (requires reason)"`
Reason string `json:"reason,omitempty" jsonschema:"Human-readable reason (required when force=true)"`
}
// ── Jobs ────────────────────────────────────────────────────────────
type ListJobsInput struct {
@@ -168,19 +185,21 @@ type RejectJobInput struct {
// ── Policies ────────────────────────────────────────────────────────
type CreatePolicyInput struct {
ID string `json:"id,omitempty" jsonschema:"Policy ID"`
Name string `json:"name" jsonschema:"Policy display name"`
Type string `json:"type" jsonschema:"Policy type: AllowedIssuers, AllowedDomains, RequiredMetadata, AllowedEnvironments, RenewalLeadTime"`
Config interface{} `json:"config,omitempty" jsonschema:"Policy-specific configuration"`
Enabled bool `json:"enabled,omitempty" jsonschema:"Whether the policy is enabled"`
ID string `json:"id,omitempty" jsonschema:"Policy ID"`
Name string `json:"name" jsonschema:"Policy display name"`
Type string `json:"type" jsonschema:"Policy type: AllowedIssuers, AllowedDomains, RequiredMetadata, AllowedEnvironments, RenewalLeadTime"`
Config interface{} `json:"config,omitempty" jsonschema:"Policy-specific configuration"`
Enabled bool `json:"enabled,omitempty" jsonschema:"Whether the policy is enabled"`
Severity string `json:"severity,omitempty" jsonschema:"Violation severity: Warning, Error, or Critical (default: Warning)"`
}
type UpdatePolicyInput struct {
ID string `json:"id" jsonschema:"Policy ID to update"`
Name string `json:"name,omitempty" jsonschema:"Policy display name"`
Type string `json:"type,omitempty" jsonschema:"Policy type"`
Config interface{} `json:"config,omitempty" jsonschema:"Policy-specific configuration"`
Enabled *bool `json:"enabled,omitempty" jsonschema:"Whether the policy is enabled"`
ID string `json:"id" jsonschema:"Policy ID to update"`
Name string `json:"name,omitempty" jsonschema:"Policy display name"`
Type string `json:"type,omitempty" jsonschema:"Policy type"`
Config interface{} `json:"config,omitempty" jsonschema:"Policy-specific configuration"`
Enabled *bool `json:"enabled,omitempty" jsonschema:"Whether the policy is enabled"`
Severity string `json:"severity,omitempty" jsonschema:"Violation severity: Warning, Error, or Critical"`
}
type ListViolationsInput struct {
+105 -1
View File
@@ -27,6 +27,13 @@ type CertificateRepository interface {
GetExpiringCertificates(ctx context.Context, before time.Time) ([]*domain.ManagedCertificate, error)
// GetLatestVersion returns the most recent certificate version for a certificate.
GetLatestVersion(ctx context.Context, certID string) (*domain.CertificateVersion, error)
// GetByIssuerAndSerial retrieves a certificate by the (issuer_id, serial_number)
// pair via a JOIN on certificate_versions. Callers (OCSP, revocation lookup)
// always know the issuer because protocol endpoints carry it in the request
// path; RFC 5280 §5.2.3 guarantees serial uniqueness only within a single
// issuer. Returns sql.ErrNoRows when no match exists so callers can
// distinguish "unknown cert" from a real repository error.
GetByIssuerAndSerial(ctx context.Context, issuerID, serial string) (*domain.ManagedCertificate, error)
}
// RevocationRepository defines operations for managing certificate revocations.
@@ -86,9 +93,34 @@ type TargetRepository interface {
// AgentRepository defines operations for managing control plane agents.
type AgentRepository interface {
// List returns all agents.
// List returns all ACTIVE agents — rows with retired_at IS NULL.
//
// I-004: The default listing MUST NOT surface retired agents. The
// handler-facing ListAgents call, the stats dashboard, and the stale-offline
// sweeper all iterate this list and would otherwise re-surface decommissioned
// hardware in operational UI. Callers that genuinely want retired rows (the
// audit tab, compliance exports) must use ListRetired instead.
//
// The partial index idx_agents_retired_at (migration 000015) keeps retired
// rows cheap to exclude — the planner uses it to skip the retired segment
// of the table entirely.
List(ctx context.Context) ([]*domain.Agent, error)
// ListRetired returns a paginated list of retired agents (retired_at IS NOT NULL),
// ordered by retired_at DESC so the most recent retirements appear first. Used
// by the GUI's Retired tab and the audit export path. Returns the slice plus
// the total count (for pagination). A page<1 or perPage<1 is clamped to sensible
// defaults (page=1, perPage=50) in the repo implementation rather than erroring —
// this matches the ListAgents pagination behavior in the service layer.
// I-004 coverage-gap closure, migration 000015.
ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error)
// Get retrieves an agent by ID.
//
// I-004 note: Get returns retired rows (retired_at IS NOT NULL) because
// callers that need to check "has this agent been retired?" — the heartbeat
// handler returning 410 Gone, the retirement service's idempotent-retire
// branch, the detail page rendering a retirement banner — must see the
// retired_at/retired_reason fields. Only the default List path default-
// excludes retired; individual Get lookups surface them.
Get(ctx context.Context, id string) (*domain.Agent, error)
// Create stores a new agent. Callers that want duplicate-key errors surfaced
// (e.g. real-agent registration) must use this method; sentinel/bootstrap
@@ -105,11 +137,78 @@ type AgentRepository interface {
// Update modifies an existing agent.
Update(ctx context.Context, agent *domain.Agent) error
// Delete removes an agent.
//
// I-004: callers should prefer SoftRetire / RetireAgentWithCascade for the
// operator-facing retirement path; hard Delete remains available for test
// cleanup and repository-level administrative tasks. The deployment_targets
// FK flipped to ON DELETE RESTRICT in migration 000015, so hard-deleting an
// agent that still owns active targets will now fail at the DB layer — which
// is intentional: the fail-closed guardrail prevents audit-trail destruction.
Delete(ctx context.Context, id string) error
// UpdateHeartbeat updates the agent's last heartbeat timestamp and metadata.
//
// I-004: UpdateHeartbeat is a no-op on retired agents — the UPDATE clause
// includes AND retired_at IS NULL so a stale agent process that keeps polling
// after retirement cannot resurrect its heartbeat. The service layer already
// short-circuits with ErrAgentRetired before calling this method; the WHERE
// filter here is belt-and-braces for anyone who skips the service path.
UpdateHeartbeat(ctx context.Context, id string, metadata *domain.AgentMetadata) error
// GetByAPIKey retrieves an agent by hashed API key.
//
// I-004: GetByAPIKey returns retired rows so the auth middleware can detect
// "this API key belongs to a retired agent" and fail the request with
// 410 Gone. If retired rows were hidden, auth would return a plain 401 and
// leak no signal — which is wrong: the operator needs the retired state
// made explicit so they can clean up the agent process.
GetByAPIKey(ctx context.Context, keyHash string) (*domain.Agent, error)
// SoftRetire stamps retired_at + retired_reason on the agent row with no
// cascade. Used on the happy path where preflight confirmed the agent has
// zero active dependencies (no active deployment_targets, no pending jobs).
// The UPDATE is scoped to WHERE id=$1 AND retired_at IS NULL so re-retiring
// an already-retired row is a no-op (zero rows affected is NOT returned as
// an error — the service layer detects this via its own idempotent-retire
// branch before calling SoftRetire). Callers supply retiredAt so the service
// can pin a single consistent timestamp across audit + DB writes.
// I-004 coverage-gap closure.
SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error
// RetireAgentWithCascade performs a transactional retire + cascade. In one
// transaction it: (1) stamps retired_at + retired_reason on the agent row,
// and (2) stamps the SAME retired_at + retired_reason on every active
// deployment_targets row whose agent_id matches. Only rows with
// retired_at IS NULL are touched in (2) — already-retired targets keep their
// original retirement metadata (whoever retired them first, whenever). Used
// exclusively on the force=true path from the retirement handler; callers
// supply retiredAt so the agent row and every cascaded target row share an
// exact retirement instant (helps forensic analysis trace the cascade back
// to a single operator action). If the agent row is already retired, the
// whole operation is a no-op — the transaction commits without touching
// either table. I-004 coverage-gap closure, migration 000015.
RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error
// CountActiveTargets returns the number of deployment_targets rows where
// agent_id=id AND retired_at IS NULL. The COUNT query hits the existing
// idx_deployment_targets_agent_id index (migration 000001 line 111); the
// additional retired_at IS NULL predicate is cheap because the partial
// idx_deployment_targets_retired_at index (migration 000015) lets the
// planner skip the retired-row segment entirely. Preflight uses this to
// decide 200 (soft-retire) vs 409 (blocked-by-deps). I-004.
CountActiveTargets(ctx context.Context, agentID string) (int, error)
// CountActiveCertificates returns the count of managed_certificates currently
// deployed through one of this agent's ACTIVE (non-retired) deployment_targets.
// The query joins certificate_target_mappings (migration 000001 line 116) →
// deployment_targets filtering on deployment_targets.agent_id=$1 AND
// deployment_targets.retired_at IS NULL, then COUNT(DISTINCT certificate_id)
// so the same cert deployed to multiple targets on one agent counts once.
// The primary key (certificate_id, target_id) on certificate_target_mappings
// plus idx_certificate_target_mappings_target_id (line 122) cover the join.
// Used purely for the preflight 409 body — the number is informational. I-004.
CountActiveCertificates(ctx context.Context, agentID string) (int, error)
// CountPendingJobs returns the number of jobs belonging to this agent whose
// status is in (Pending, AwaitingCSR, AwaitingApproval, Running) — the four
// statuses that indicate work the agent would still be expected to pick up.
// Completed/Failed/Cancelled jobs do not count. The filter agent_id=$1 hits
// the idx_jobs_agent_id index (migration 000001 line 161). Used for the
// preflight 409 body. I-004.
CountPendingJobs(ctx context.Context, agentID string) (int, error)
}
// JobRepository defines operations for managing renewal and deployment jobs.
@@ -144,6 +243,11 @@ type JobRepository interface {
// to Running) and locks AwaitingCSR jobs against concurrent observers (leaving state intact,
// since the CSR-submission path drives the next transition). H-6 (CWE-362) race remediation.
ClaimPendingByAgentID(ctx context.Context, agentID string) ([]*domain.Job, error)
// ListTimedOutAwaitingJobs returns jobs stuck in AwaitingCSR (created before csrCutoff) or
// AwaitingApproval (created before approvalCutoff). The reaper loop transitions them to
// Failed; I-001's retry loop then auto-promotes eligible Failed jobs back to Pending.
// I-003 coverage-gap closure.
ListTimedOutAwaitingJobs(ctx context.Context, csrCutoff, approvalCutoff time.Time) ([]*domain.Job, error)
}
// RenewalPolicyRepository defines operations for managing renewal policies.
+235 -11
View File
@@ -20,12 +20,18 @@ func NewAgentRepository(db *sql.DB) *AgentRepository {
return &AgentRepository{db: db}
}
// List returns all agents
// List returns all ACTIVE agents — rows with retired_at IS NULL. I-004:
// the default listing path feeds the handler-facing ListAgents call, the
// stats dashboard, and the stale-offline sweeper; every caller wants active
// hardware, not decommissioned rows. Operators who need retired rows reach
// for ListRetired instead. The partial index idx_agents_retired_at
// (migration 000015) lets the planner skip the retired segment cheaply.
func (r *AgentRepository) List(ctx context.Context) ([]*domain.Agent, error) {
rows, err := r.db.QueryContext(ctx, `
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
os, architecture, ip_address, version
os, architecture, ip_address, version, retired_at, retired_reason
FROM agents
WHERE retired_at IS NULL
ORDER BY registered_at DESC
`)
@@ -50,11 +56,16 @@ func (r *AgentRepository) List(ctx context.Context) ([]*domain.Agent, error) {
return agents, nil
}
// Get retrieves an agent by ID
// Get retrieves an agent by ID. I-004: retired rows ARE surfaced here —
// callers that need to check "has this agent been retired?" (heartbeat
// handler returning 410 Gone, retirement service's idempotent-retire branch,
// detail page rendering a retirement banner) must see retired_at /
// retired_reason. Only the List path default-excludes retired rows; Get is
// by-ID and returns whatever row exists.
func (r *AgentRepository) Get(ctx context.Context, id string) (*domain.Agent, error) {
row := r.db.QueryRowContext(ctx, `
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
os, architecture, ip_address, version
os, architecture, ip_address, version, retired_at, retired_reason
FROM agents
WHERE id = $1
`, id)
@@ -185,7 +196,16 @@ func (r *AgentRepository) Delete(ctx context.Context, id string) error {
return nil
}
// UpdateHeartbeat updates the agent's last heartbeat timestamp and metadata
// UpdateHeartbeat updates the agent's last heartbeat timestamp and metadata.
//
// I-004: both branches include `AND retired_at IS NULL` in the WHERE clause,
// making the UPDATE a no-op on retired rows. The service layer already
// short-circuits with ErrAgentRetired before calling this method (see
// AgentService.Heartbeat), but the WHERE filter is belt-and-braces for any
// path that skips the service — a stale agent process that keeps polling
// after retirement cannot resurrect its heartbeat at the DB layer. A zero
// RowsAffected here returns the same "agent not found" error as before; the
// service layer distinguishes retired from missing by calling Get first.
func (r *AgentRepository) UpdateHeartbeat(ctx context.Context, id string, metadata *domain.AgentMetadata) error {
var result sql.Result
var err error
@@ -199,11 +219,11 @@ func (r *AgentRepository) UpdateHeartbeat(ctx context.Context, id string, metada
architecture = CASE WHEN $5 = '' THEN architecture ELSE $5 END,
ip_address = CASE WHEN $6 = '' THEN ip_address ELSE $6 END,
version = CASE WHEN $7 = '' THEN version ELSE $7 END
WHERE id = $2
WHERE id = $2 AND retired_at IS NULL
`, time.Now(), id, metadata.Hostname, metadata.OS, metadata.Architecture, metadata.IPAddress, metadata.Version)
} else {
result, err = r.db.ExecContext(ctx, `
UPDATE agents SET last_heartbeat_at = $1 WHERE id = $2
UPDATE agents SET last_heartbeat_at = $1 WHERE id = $2 AND retired_at IS NULL
`, time.Now(), id)
}
@@ -223,11 +243,15 @@ func (r *AgentRepository) UpdateHeartbeat(ctx context.Context, id string, metada
return nil
}
// GetByAPIKey retrieves an agent by hashed API key
// GetByAPIKey retrieves an agent by hashed API key. I-004: retired rows ARE
// surfaced here so the auth middleware can detect "this API key belongs to a
// retired agent" and fail the request with 410 Gone instead of 401. If the
// filter hid retired rows, auth would return a plain 401 and leak no signal
// that the agent process needs cleaning up.
func (r *AgentRepository) GetByAPIKey(ctx context.Context, keyHash string) (*domain.Agent, error) {
row := r.db.QueryRowContext(ctx, `
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
os, architecture, ip_address, version
os, architecture, ip_address, version, retired_at, retired_reason
FROM agents
WHERE api_key_hash = $1
`, keyHash)
@@ -243,14 +267,214 @@ func (r *AgentRepository) GetByAPIKey(ctx context.Context, keyHash string) (*dom
return agent, nil
}
// scanAgent scans an agent from a row or rows
// ─── I-004 agent retirement surface ──────────────────────────────────────
//
// The methods below implement the I-004 coverage-gap closure. They follow the
// interface contracts in internal/repository/interfaces.go:94-210 (which is the
// spec — keep godoc there in sync if behavior changes).
// ListRetired returns a paginated slice of retired agents ordered by
// retired_at DESC so the most recent retirements appear first. Used by the
// GUI's Retired tab and the audit export path. Returns the rows plus the
// total count (for pagination UI). page<1 or perPage<1 is clamped to
// sensible defaults in-repo rather than erroring, matching the ListAgents
// pagination behavior at the service layer. I-004, migration 000015.
func (r *AgentRepository) ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) {
// Clamp pagination to safe defaults. Keep in lockstep with the service
// layer's pagination shape — negative / zero values on either axis should
// degrade to "first page, default size" instead of returning an error.
if page < 1 {
page = 1
}
if perPage < 1 {
perPage = 50
}
offset := (page - 1) * perPage
// Total count first — separate query so pagination math stays correct
// even when the page of rows is empty. Uses the partial
// idx_agents_retired_at index so this is effectively a count of the
// partial-index tuple count, not a full table scan.
var total int
if err := r.db.QueryRowContext(ctx, `
SELECT COUNT(*) FROM agents WHERE retired_at IS NOT NULL
`).Scan(&total); err != nil {
return nil, 0, fmt.Errorf("failed to count retired agents: %w", err)
}
rows, err := r.db.QueryContext(ctx, `
SELECT id, name, hostname, status, last_heartbeat_at, registered_at, api_key_hash,
os, architecture, ip_address, version, retired_at, retired_reason
FROM agents
WHERE retired_at IS NOT NULL
ORDER BY retired_at DESC
LIMIT $1 OFFSET $2
`, perPage, offset)
if err != nil {
return nil, 0, fmt.Errorf("failed to query retired agents: %w", err)
}
defer rows.Close()
var agents []*domain.Agent
for rows.Next() {
agent, err := scanAgent(rows)
if err != nil {
return nil, 0, err
}
agents = append(agents, agent)
}
if err := rows.Err(); err != nil {
return nil, 0, fmt.Errorf("error iterating retired agent rows: %w", err)
}
return agents, total, nil
}
// SoftRetire stamps retired_at + retired_reason on the agent row with no
// cascade. Scoped to `WHERE id=$1 AND retired_at IS NULL` so re-retiring an
// already-retired row is a silent no-op (zero RowsAffected). The service
// layer has its own idempotent-retire branch that detects already-retired
// rows via Get before calling SoftRetire; a zero here just means a racy
// caller got there first. I-004.
func (r *AgentRepository) SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error {
if _, err := r.db.ExecContext(ctx, `
UPDATE agents
SET retired_at = $2, retired_reason = $3
WHERE id = $1 AND retired_at IS NULL
`, id, retiredAt, reason); err != nil {
return fmt.Errorf("failed to soft-retire agent: %w", err)
}
return nil
}
// RetireAgentWithCascade performs a transactional retire-and-cascade. In one
// transaction it (1) stamps retired_at + retired_reason on the agent row if
// it is still active, and (2) stamps the SAME retired_at + retired_reason on
// every active (retired_at IS NULL) deployment_targets row whose agent_id
// matches. Already-retired targets keep their original retirement metadata;
// only active targets are touched. If the agent is already retired, the
// whole transaction is a no-op — the caller's idempotent-retire branch
// already handled it before we got here. I-004, migration 000015.
//
// The two UPDATEs share a single (retiredAt, reason) pair so forensic
// analysis can trace "every row stamped at T1 with reason R was part of the
// same operator action" back to one cascade. Using BeginTx keeps the agent
// row and its targets' retirement metadata consistent even if something
// crashes mid-cascade.
func (r *AgentRepository) RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error {
tx, err := r.db.BeginTx(ctx, nil)
if err != nil {
return fmt.Errorf("failed to begin retire-cascade transaction: %w", err)
}
// Rollback is a no-op if Commit has already run — safe to always defer.
defer func() { _ = tx.Rollback() }()
// Agent row: flip to retired only if it was still active. If zero rows
// match, the agent was already retired — the whole cascade becomes a
// no-op (we deliberately do NOT stamp the targets against a retirement
// we didn't perform).
if _, err := tx.ExecContext(ctx, `
UPDATE agents
SET retired_at = $2, retired_reason = $3
WHERE id = $1 AND retired_at IS NULL
`, id, retiredAt, reason); err != nil {
return fmt.Errorf("failed to retire agent in cascade: %w", err)
}
// Cascade: copy the same retired_at / retired_reason onto every active
// deployment_target belonging to this agent. Skips targets that are
// already retired so their original retirement metadata is preserved.
if _, err := tx.ExecContext(ctx, `
UPDATE deployment_targets
SET retired_at = $2, retired_reason = $3
WHERE agent_id = $1 AND retired_at IS NULL
`, id, retiredAt, reason); err != nil {
return fmt.Errorf("failed to cascade-retire deployment targets: %w", err)
}
if err := tx.Commit(); err != nil {
return fmt.Errorf("failed to commit retire-cascade transaction: %w", err)
}
return nil
}
// CountActiveTargets returns the number of deployment_targets with
// agent_id=agentID AND retired_at IS NULL. Used by the retirement preflight
// to decide 200 (soft-retire) vs 409 (blocked-by-deps). Hits the existing
// idx_deployment_targets_agent_id index (migration 000001 line 111); the
// retired_at IS NULL predicate is cheap because the partial
// idx_deployment_targets_retired_at index (migration 000015) lets the
// planner skip the retired-row segment. I-004.
func (r *AgentRepository) CountActiveTargets(ctx context.Context, agentID string) (int, error) {
var count int
err := r.db.QueryRowContext(ctx, `
SELECT COUNT(*)
FROM deployment_targets
WHERE agent_id = $1 AND retired_at IS NULL
`, agentID).Scan(&count)
if err != nil {
return 0, fmt.Errorf("failed to count active targets for agent: %w", err)
}
return count, nil
}
// CountActiveCertificates returns the count of distinct managed_certificates
// currently deployed through one of this agent's ACTIVE deployment_targets.
// Joins certificate_target_mappings (migration 000001 line 116) →
// deployment_targets filtering on deployment_targets.agent_id=$1 AND
// deployment_targets.retired_at IS NULL. COUNT(DISTINCT certificate_id) so
// the same cert deployed to multiple targets on one agent counts once.
// Used purely for the preflight 409 body. I-004.
func (r *AgentRepository) CountActiveCertificates(ctx context.Context, agentID string) (int, error) {
var count int
err := r.db.QueryRowContext(ctx, `
SELECT COUNT(DISTINCT ctm.certificate_id)
FROM certificate_target_mappings ctm
JOIN deployment_targets dt ON dt.id = ctm.target_id
WHERE dt.agent_id = $1 AND dt.retired_at IS NULL
`, agentID).Scan(&count)
if err != nil {
return 0, fmt.Errorf("failed to count active certificates for agent: %w", err)
}
return count, nil
}
// CountPendingJobs returns the number of jobs belonging to this agent whose
// status is in (Pending, AwaitingCSR, AwaitingApproval, Running) — the four
// statuses that represent work the agent would still be expected to pick up
// or complete. Completed / Failed / Cancelled jobs do not count toward the
// preflight gate. Status strings match domain.JobStatus* constants in
// internal/domain/job.go:43-49. Hits idx_jobs_agent_id (migration 000001
// line 161). I-004.
func (r *AgentRepository) CountPendingJobs(ctx context.Context, agentID string) (int, error) {
var count int
err := r.db.QueryRowContext(ctx, `
SELECT COUNT(*)
FROM jobs
WHERE agent_id = $1
AND status IN ('Pending', 'AwaitingCSR', 'AwaitingApproval', 'Running')
`, agentID).Scan(&count)
if err != nil {
return 0, fmt.Errorf("failed to count pending jobs for agent: %w", err)
}
return count, nil
}
// scanAgent scans an agent from a row or rows.
//
// I-004: the column list here is the authoritative 13-field post-M15 order —
// retired_at and retired_reason are appended at the tail as nullable
// *time.Time / *string scan targets matching the `json:"...,omitempty"` domain
// fields. Every SELECT in this file that feeds scanAgent must emit columns in
// this same order, otherwise Scan will silently place values into the wrong
// fields (lib/pq does positional binding, not named).
func scanAgent(scanner interface {
Scan(...interface{}) error
}) (*domain.Agent, error) {
var agent domain.Agent
err := scanner.Scan(&agent.ID, &agent.Name, &agent.Hostname, &agent.Status,
&agent.LastHeartbeatAt, &agent.RegisteredAt, &agent.APIKeyHash,
&agent.OS, &agent.Architecture, &agent.IPAddress, &agent.Version)
&agent.OS, &agent.Architecture, &agent.IPAddress, &agent.Version,
&agent.RetiredAt, &agent.RetiredReason)
if err != nil {
return nil, fmt.Errorf("failed to scan agent: %w", err)
@@ -5,6 +5,7 @@ import (
"database/sql"
"encoding/base64"
"encoding/json"
"errors"
"fmt"
"strings"
"time"
@@ -272,6 +273,38 @@ func (r *CertificateRepository) Get(ctx context.Context, id string) (*domain.Man
return cert, nil
}
// GetByIssuerAndSerial retrieves a certificate by the (issuer_id, serial_number)
// pair via a JOIN on certificate_versions. Per RFC 5280 §5.2.3, serial numbers
// are unique only within a single issuer — callers that know the issuer (OCSP,
// CRL generation, revocation lookup) use this method to scope lookups
// correctly. Returns sql.ErrNoRows when no match exists so callers can
// distinguish "unknown cert" (return OCSP status unknown) from a real
// repository error.
func (r *CertificateRepository) GetByIssuerAndSerial(ctx context.Context, issuerID, serial string) (*domain.ManagedCertificate, error) {
row := r.db.QueryRowContext(ctx, `
SELECT mc.id, mc.name, mc.common_name, mc.sans, mc.environment, mc.owner_id, mc.team_id,
mc.issuer_id, mc.renewal_policy_id, mc.certificate_profile_id, mc.status, mc.expires_at,
mc.tags, mc.last_renewal_at, mc.last_deployment_at, mc.revoked_at, mc.revocation_reason,
mc.created_at, mc.updated_at
FROM managed_certificates mc
JOIN certificate_versions cv ON cv.certificate_id = mc.id
WHERE mc.issuer_id = $1 AND cv.serial_number = $2
LIMIT 1
`, issuerID, serial)
cert, err := r.scanCertificate(ctx, row)
if err != nil {
// scanCertificate wraps sql.ErrNoRows via %w, so surface the bare
// sentinel here for callers that branch on it with errors.Is.
if errors.Is(err, sql.ErrNoRows) {
return nil, sql.ErrNoRows
}
return nil, fmt.Errorf("failed to query certificate by issuer+serial: %w", err)
}
return cert, nil
}
// Create stores a new certificate
func (r *CertificateRepository) Create(ctx context.Context, cert *domain.ManagedCertificate) error {
if cert.ID == "" {
+36
View File
@@ -4,6 +4,7 @@ import (
"context"
"database/sql"
"fmt"
"time"
"github.com/google/uuid"
"github.com/shankar0123/certctl/internal/domain"
@@ -570,6 +571,41 @@ func (r *JobRepository) ClaimPendingByAgentID(ctx context.Context, agentID strin
return append(pendingJobs, csrJobs...), nil
}
// ListTimedOutAwaitingJobs returns jobs stuck in AwaitingCSR or AwaitingApproval past
// their respective cutoff timestamps (created_at < cutoff). The reaper loop transitions
// them to Failed; I-001's retry loop then auto-promotes eligible Failed jobs back to
// Pending. I-003 coverage-gap closure.
func (r *JobRepository) ListTimedOutAwaitingJobs(ctx context.Context, csrCutoff, approvalCutoff time.Time) ([]*domain.Job, error) {
rows, err := r.db.QueryContext(ctx, `
SELECT id, type, certificate_id, target_id, agent_id, status, attempts, max_attempts,
last_error, scheduled_at, started_at, completed_at, created_at
FROM jobs
WHERE (status = $1 AND created_at < $2)
OR (status = $3 AND created_at < $4)
ORDER BY created_at ASC
`, domain.JobStatusAwaitingCSR, csrCutoff, domain.JobStatusAwaitingApproval, approvalCutoff)
if err != nil {
return nil, fmt.Errorf("failed to query timed-out awaiting jobs: %w", err)
}
defer rows.Close()
var jobs []*domain.Job
for rows.Next() {
job, err := scanJob(rows)
if err != nil {
return nil, err
}
jobs = append(jobs, job)
}
if err := rows.Err(); err != nil {
return nil, fmt.Errorf("error iterating timed-out job rows: %w", err)
}
return jobs, nil
}
// scanJob scans a job from a row or rows
func scanJob(scanner interface {
Scan(...interface{}) error
@@ -0,0 +1,220 @@
package postgres_test
import (
"context"
"database/sql"
"os"
"path/filepath"
"strings"
"testing"
)
// TestMigration000015_AgentRetireRoundTrip is the Phase 2a Red regression test
// for I-004 ("Agent hard-delete cascades through deployment_targets + jobs").
//
// The fix depends on a new migration, 000015_agent_retire.up.sql + .down.sql,
// which must:
//
// 1. Add nullable `retired_at TIMESTAMPTZ` and `retired_reason TEXT`
// columns to the `agents` table. These mirror the revoked_at /
// revocation_reason pair on managed_certificates (migration 000005).
//
// 2. Add nullable `retired_at TIMESTAMPTZ` and `retired_reason TEXT` columns
// to `deployment_targets`. When an agent is retired with cascade=true,
// its deployment_targets must be soft-retired (not deleted) so audit
// history — who deployed what to where, when — stays intact.
//
// 3. FLIP the foreign key on `deployment_targets.agent_id → agents.id`
// from `ON DELETE CASCADE` (migration 000001, line 104) to
// `ON DELETE RESTRICT`. This is the fail-closed change that makes a
// bare `DELETE FROM agents WHERE id = $1` blow up at the DB layer
// instead of silently vaporising every deployment_target row. Today
// the CASCADE means the audit trail gets shredded with zero warning.
//
// The round-trip also validates that the down migration cleanly reverses all
// three changes, so an operator who lands on a rollback can still boot the
// server. Red-until-Green: this test compiles but fails until
// migrations/000015_agent_retire.up.sql + .down.sql exist with the right
// schema, because `freshSchema(t)` runs every `.up.sql` in lexical order —
// the new migration runs automatically once Phase 2b creates the files.
func TestMigration000015_AgentRetireRoundTrip(t *testing.T) {
tdb := getTestDB(t)
db := tdb.freshSchema(t)
ctx := context.Background()
// ─── Stage 1: Post-up assertions ─────────────────────────────────────
//
// After all .up.sql migrations (including the new 000015) have run, the
// new columns and the flipped FK must be observable in the catalog.
assertColumnExists(t, db, "agents", "retired_at")
assertColumnExists(t, db, "agents", "retired_reason")
assertColumnExists(t, db, "deployment_targets", "retired_at")
assertColumnExists(t, db, "deployment_targets", "retired_reason")
// The FK on deployment_targets.agent_id must be RESTRICT (confdeltype='r'),
// not CASCADE (confdeltype='c'). This is the core fail-closed guarantee
// that fixes I-004 at the storage layer.
assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "r")
// The FK on jobs.agent_id is already SET NULL (confdeltype='n') per
// migration 000001 line 146 — pin that it stays that way (or goes to
// RESTRICT; either preserves audit history, both fail on 'c').
assertFKDeleteRuleNot(t, db, "jobs", "agent_id", "c")
// ─── Stage 2: Run the 000015 down migration manually ─────────────────
//
// testutil_test.go's runMigrations helper only runs *.up.sql. To exercise
// the down migration I read and execute it by hand, then re-check the
// catalog.
downSQL := readMigrationFile(t, "000015_agent_retire.down.sql")
if _, err := db.ExecContext(ctx, downSQL); err != nil {
t.Fatalf("000015 down migration failed: %v", err)
}
// Stage 3: Post-down assertions — columns gone, FK restored to CASCADE.
assertColumnGone(t, db, "agents", "retired_at")
assertColumnGone(t, db, "agents", "retired_reason")
assertColumnGone(t, db, "deployment_targets", "retired_at")
assertColumnGone(t, db, "deployment_targets", "retired_reason")
assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "c")
// ─── Stage 4: Re-run the up migration for idempotency ────────────────
//
// The up migration must be safely re-runnable — operators sometimes
// re-apply by hand after a partial rollback. Use IF NOT EXISTS / ALTER
// idempotently.
upSQL := readMigrationFile(t, "000015_agent_retire.up.sql")
if _, err := db.ExecContext(ctx, upSQL); err != nil {
t.Fatalf("000015 up migration re-apply failed (must be idempotent): %v", err)
}
assertColumnExists(t, db, "agents", "retired_at")
assertColumnExists(t, db, "agents", "retired_reason")
assertColumnExists(t, db, "deployment_targets", "retired_at")
assertColumnExists(t, db, "deployment_targets", "retired_reason")
assertFKDeleteRule(t, db, "deployment_targets", "agent_id", "r")
}
// ─── Catalog helpers ──────────────────────────────────────────────────────
//
// These helpers scope every catalog query to the schema the test is actually
// running in by joining against current_schema(). Without that, a test
// running in schema test_xyz would accidentally inspect the public schema
// and green-light drift.
func assertColumnExists(t *testing.T, db *sql.DB, table, column string) {
t.Helper()
var exists bool
err := db.QueryRowContext(context.Background(), `
SELECT EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_schema = current_schema()
AND table_name = $1
AND column_name = $2
)`, table, column).Scan(&exists)
if err != nil {
t.Fatalf("column existence query failed for %s.%s: %v", table, column, err)
}
if !exists {
t.Errorf("expected column %s.%s to exist after 000015 up (migration missing or drifted)", table, column)
}
}
func assertColumnGone(t *testing.T, db *sql.DB, table, column string) {
t.Helper()
var exists bool
err := db.QueryRowContext(context.Background(), `
SELECT EXISTS (
SELECT 1 FROM information_schema.columns
WHERE table_schema = current_schema()
AND table_name = $1
AND column_name = $2
)`, table, column).Scan(&exists)
if err != nil {
t.Fatalf("column existence query failed for %s.%s: %v", table, column, err)
}
if exists {
t.Errorf("expected column %s.%s to be removed after 000015 down (down migration is incomplete)", table, column)
}
}
// assertFKDeleteRule asserts that the foreign key covering `table.column`
// (i.e. the FK whose constrained column matches) has the expected
// `confdeltype`. Per pg_constraint docs: 'r' = RESTRICT, 'c' = CASCADE,
// 'n' = SET NULL, 'd' = SET DEFAULT, 'a' = NO ACTION.
func assertFKDeleteRule(t *testing.T, db *sql.DB, table, column, want string) {
t.Helper()
got := lookupFKDeleteRule(t, db, table, column)
if got != want {
t.Errorf("FK on %s(%s): confdeltype=%q want %q (RESTRICT='r', CASCADE='c', SET NULL='n')",
table, column, got, want)
}
}
// assertFKDeleteRuleNot is the negative form — used for jobs.agent_id where
// multiple confdeltype values are acceptable (SET NULL and RESTRICT both
// preserve audit history) but CASCADE is strictly forbidden.
func assertFKDeleteRuleNot(t *testing.T, db *sql.DB, table, column, disallowed string) {
t.Helper()
got := lookupFKDeleteRule(t, db, table, column)
if got == disallowed {
t.Errorf("FK on %s(%s): confdeltype=%q; %q is forbidden (would destroy audit history on agent delete)",
table, column, got, disallowed)
}
}
// lookupFKDeleteRule returns the confdeltype for the FK constraint whose
// constrained table+column matches. Returns empty string if no FK found —
// that's treated as a test failure because the schema is supposed to have
// these FKs per migration 000001.
func lookupFKDeleteRule(t *testing.T, db *sql.DB, table, column string) string {
t.Helper()
// Join pg_constraint → pg_class (constrained rel) → pg_attribute
// (constrained col) → pg_namespace (schema filter). Scoped to
// current_schema() so schema-per-test isolation holds.
const q = `
SELECT c.confdeltype
FROM pg_constraint c
JOIN pg_class cl ON cl.oid = c.conrelid
JOIN pg_namespace n ON n.oid = cl.relnamespace
JOIN pg_attribute a ON a.attrelid = c.conrelid AND a.attnum = ANY(c.conkey)
WHERE n.nspname = current_schema()
AND c.contype = 'f'
AND cl.relname = $1
AND a.attname = $2
LIMIT 1
`
var confdeltype string
err := db.QueryRowContext(context.Background(), q, table, column).Scan(&confdeltype)
if err == sql.ErrNoRows {
t.Fatalf("no FK found on %s(%s) in current_schema (schema not migrated?)", table, column)
return ""
}
if err != nil {
t.Fatalf("FK lookup for %s(%s) failed: %v", table, column, err)
return ""
}
return confdeltype
}
// readMigrationFile locates and loads a named migration file. Uses the same
// walk-up strategy as findMigrationsDir() in testutil_test.go so both helpers
// agree on where the migrations live.
func readMigrationFile(t *testing.T, name string) string {
t.Helper()
path := filepath.Join(findMigrationsDir(), name)
data, err := os.ReadFile(path)
if err != nil {
t.Fatalf("failed to read migration file %s (expected at %s): %v", name, path, err)
}
// Defensive: a zero-byte down migration would produce false-positive
// "success" below. Refuse to trust it.
if strings.TrimSpace(string(data)) == "" {
t.Fatalf("migration file %s is empty — down migration missing or truncated", name)
}
return string(data)
}
+11 -10
View File
@@ -24,7 +24,7 @@ func NewPolicyRepository(db *sql.DB) *PolicyRepository {
// ListRules returns all policy rules
func (r *PolicyRepository) ListRules(ctx context.Context) ([]*domain.PolicyRule, error) {
rows, err := r.db.QueryContext(ctx, `
SELECT id, name, type, config, enabled, created_at, updated_at
SELECT id, name, type, config, enabled, severity, created_at, updated_at
FROM policy_rules
ORDER BY created_at DESC
`)
@@ -38,7 +38,7 @@ func (r *PolicyRepository) ListRules(ctx context.Context) ([]*domain.PolicyRule,
for rows.Next() {
var rule domain.PolicyRule
if err := rows.Scan(&rule.ID, &rule.Name, &rule.Type, &rule.Config,
&rule.Enabled, &rule.CreatedAt, &rule.UpdatedAt); err != nil {
&rule.Enabled, &rule.Severity, &rule.CreatedAt, &rule.UpdatedAt); err != nil {
return nil, fmt.Errorf("failed to scan policy rule: %w", err)
}
rules = append(rules, &rule)
@@ -55,11 +55,11 @@ func (r *PolicyRepository) ListRules(ctx context.Context) ([]*domain.PolicyRule,
func (r *PolicyRepository) GetRule(ctx context.Context, id string) (*domain.PolicyRule, error) {
var rule domain.PolicyRule
err := r.db.QueryRowContext(ctx, `
SELECT id, name, type, config, enabled, created_at, updated_at
SELECT id, name, type, config, enabled, severity, created_at, updated_at
FROM policy_rules
WHERE id = $1
`, id).Scan(&rule.ID, &rule.Name, &rule.Type, &rule.Config,
&rule.Enabled, &rule.CreatedAt, &rule.UpdatedAt)
&rule.Enabled, &rule.Severity, &rule.CreatedAt, &rule.UpdatedAt)
if err != nil {
if err == sql.ErrNoRows {
@@ -78,11 +78,11 @@ func (r *PolicyRepository) CreateRule(ctx context.Context, rule *domain.PolicyRu
}
err := r.db.QueryRowContext(ctx, `
INSERT INTO policy_rules (id, name, type, config, enabled, created_at, updated_at)
VALUES ($1, $2, $3, $4, $5, $6, $7)
INSERT INTO policy_rules (id, name, type, config, enabled, severity, created_at, updated_at)
VALUES ($1, $2, $3, $4, $5, $6, $7, $8)
RETURNING id
`, rule.ID, rule.Name, rule.Type, rule.Config, rule.Enabled,
rule.CreatedAt, rule.UpdatedAt).Scan(&rule.ID)
rule.Severity, rule.CreatedAt, rule.UpdatedAt).Scan(&rule.ID)
if err != nil {
return fmt.Errorf("failed to create policy rule: %w", err)
@@ -99,9 +99,10 @@ func (r *PolicyRepository) UpdateRule(ctx context.Context, rule *domain.PolicyRu
type = $2,
config = $3,
enabled = $4,
updated_at = $5
WHERE id = $6
`, rule.Name, rule.Type, rule.Config, rule.Enabled, rule.UpdatedAt, rule.ID)
severity = $5,
updated_at = $6
WHERE id = $7
`, rule.Name, rule.Type, rule.Config, rule.Enabled, rule.Severity, rule.UpdatedAt, rule.ID)
if err != nil {
return fmt.Errorf("failed to update policy rule: %w", err)
+167 -1
View File
@@ -16,8 +16,16 @@ type RenewalServicer interface {
}
// JobServicer defines the interface for job processing used by the scheduler.
//
// RetryFailedJobs was added to close coverage gap I-001: JobService.RetryFailedJobs
// existed and was unit-tested but had no runtime caller prior to this loop being
// wired. The scheduler now drives it on an independent tick so failed jobs whose
// attempt counter is below MaxAttempts are periodically reset to Pending for the
// job processor to pick up again. maxRetries is advisory (per-job gating uses
// each job's own Attempts/MaxAttempts fields).
type JobServicer interface {
ProcessPendingJobs(ctx context.Context) error
RetryFailedJobs(ctx context.Context, maxRetries int) error
}
// AgentServicer defines the interface for agent health checks used by the scheduler.
@@ -50,6 +58,11 @@ type CloudDiscoveryServicer interface {
DiscoverAll(ctx context.Context) (int, []error)
}
// JobReaperService defines the interface for job timeout reaping used by the scheduler.
type JobReaperService interface {
ReapTimedOutJobs(ctx context.Context, csrTTL, approvalTTL time.Duration) error
}
// Scheduler manages background jobs and periodic tasks for the certificate control plane.
// It runs multiple concurrent loops for renewal checks, job processing, agent health checks,
// and notification processing.
@@ -62,11 +75,13 @@ type Scheduler struct {
digestService DigestServicer
healthCheckService HealthCheckServicer
cloudDiscoveryService CloudDiscoveryServicer
jobReaper JobReaperService
logger *slog.Logger
// Configurable tick intervals
renewalCheckInterval time.Duration
jobProcessorInterval time.Duration
jobRetryInterval time.Duration
agentHealthCheckInterval time.Duration
notificationProcessInterval time.Duration
shortLivedExpiryCheckInterval time.Duration
@@ -74,10 +89,14 @@ type Scheduler struct {
digestInterval time.Duration
healthCheckInterval time.Duration
cloudDiscoveryInterval time.Duration
jobTimeoutInterval time.Duration
awaitingCSRTimeout time.Duration
awaitingApprovalTimeout time.Duration
// Idempotency guards: prevent duplicate execution of slow jobs
renewalCheckRunning atomic.Bool
jobProcessorRunning atomic.Bool
jobRetryRunning atomic.Bool
agentHealthCheckRunning atomic.Bool
notificationProcessRunning atomic.Bool
shortLivedExpiryCheckRunning atomic.Bool
@@ -85,6 +104,7 @@ type Scheduler struct {
digestRunning atomic.Bool
healthCheckRunning atomic.Bool
cloudDiscoveryRunning atomic.Bool
jobTimeoutRunning atomic.Bool
// Graceful shutdown: wait for in-flight work to complete
wg sync.WaitGroup
@@ -110,6 +130,7 @@ func NewScheduler(
// Default intervals
renewalCheckInterval: 1 * time.Hour,
jobProcessorInterval: 30 * time.Second,
jobRetryInterval: 5 * time.Minute,
agentHealthCheckInterval: 2 * time.Minute,
notificationProcessInterval: 1 * time.Minute,
shortLivedExpiryCheckInterval: 30 * time.Second,
@@ -117,6 +138,7 @@ func NewScheduler(
digestInterval: 24 * time.Hour,
healthCheckInterval: 60 * time.Second,
cloudDiscoveryInterval: 6 * time.Hour,
jobTimeoutInterval: 10 * time.Minute,
}
}
@@ -141,6 +163,13 @@ func (s *Scheduler) SetJobProcessorInterval(d time.Duration) {
s.jobProcessorInterval = d
}
// SetJobRetryInterval configures the interval for the failed-job retry loop
// (coverage gap I-001). Defaults to 5 minutes; honors
// CERTCTL_SCHEDULER_RETRY_INTERVAL when wired from config.
func (s *Scheduler) SetJobRetryInterval(d time.Duration) {
s.jobRetryInterval = d
}
// SetAgentHealthCheckInterval configures the interval for agent health checks.
func (s *Scheduler) SetAgentHealthCheckInterval(d time.Duration) {
s.agentHealthCheckInterval = d
@@ -183,6 +212,26 @@ func (s *Scheduler) SetCloudDiscoveryInterval(d time.Duration) {
s.cloudDiscoveryInterval = d
}
// SetJobReaperService sets the job reaper service (I-003).
func (s *Scheduler) SetJobReaperService(jr JobReaperService) {
s.jobReaper = jr
}
// SetJobTimeoutInterval sets the job timeout reaper tick interval (I-003).
func (s *Scheduler) SetJobTimeoutInterval(d time.Duration) {
s.jobTimeoutInterval = d
}
// SetAwaitingCSRTimeout sets the AwaitingCSR TTL (I-003).
func (s *Scheduler) SetAwaitingCSRTimeout(d time.Duration) {
s.awaitingCSRTimeout = d
}
// SetAwaitingApprovalTimeout sets the AwaitingApproval TTL (I-003).
func (s *Scheduler) SetAwaitingApprovalTimeout(d time.Duration) {
s.awaitingApprovalTimeout = d
}
// Start initiates all background scheduler loops. It returns a channel that signals
// when the scheduler has started all loops. The scheduler runs until the context is cancelled.
func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
@@ -193,7 +242,10 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
// Track all loop goroutines in the WaitGroup so WaitForCompletion
// blocks until they've fully exited (prevents test races).
loopCount := 5
// Base count is 7: renewal, job processor, job retry (I-001),
// job timeout (I-003), agent health, notification, short-lived expiry. Optional loops
// (network scan, digest, health check, cloud discovery) add to this.
loopCount := 7
if s.networkScanService != nil {
loopCount++
}
@@ -210,6 +262,8 @@ func (s *Scheduler) Start(ctx context.Context) <-chan struct{} {
go func() { defer s.wg.Done(); s.renewalCheckLoop(ctx) }()
go func() { defer s.wg.Done(); s.jobProcessorLoop(ctx) }()
go func() { defer s.wg.Done(); s.jobRetryLoop(ctx) }()
go func() { defer s.wg.Done(); s.jobTimeoutLoop(ctx) }()
go func() { defer s.wg.Done(); s.agentHealthCheckLoop(ctx) }()
go func() { defer s.wg.Done(); s.notificationProcessLoop(ctx) }()
go func() { defer s.wg.Done(); s.shortLivedExpiryCheckLoop(ctx) }()
@@ -334,6 +388,118 @@ func (s *Scheduler) runJobProcessor(ctx context.Context) {
}
}
// jobRetryLoop runs every jobRetryInterval and transitions eligible Failed jobs
// back to Pending so the job processor can pick them up again. Closes coverage
// gap I-001 — JobService.RetryFailedJobs had no runtime caller prior to this
// loop being wired. Runs immediately on start, then every interval.
// Uses atomic.Bool to prevent duplicate execution if the previous retry sweep
// is still running.
func (s *Scheduler) jobRetryLoop(ctx context.Context) {
ticker := time.NewTicker(s.jobRetryInterval)
defer ticker.Stop()
// Run immediately on start (with idempotency guard)
s.jobRetryRunning.Store(true)
s.wg.Add(1)
go func() {
defer s.wg.Done()
defer s.jobRetryRunning.Store(false)
s.runJobRetry(ctx)
}()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
if !s.jobRetryRunning.CompareAndSwap(false, true) {
s.logger.Warn("job retry still running, skipping tick")
continue
}
s.wg.Add(1)
go func() {
defer s.wg.Done()
defer s.jobRetryRunning.Store(false)
s.runJobRetry(ctx)
}()
}
}
}
// runJobRetry executes a single failed-job retry cycle with error recovery.
// Uses the same 2-minute per-tick timeout as runJobProcessor; RetryFailedJobs
// issues one SELECT and one UPDATE per eligible job (cheap), so this headroom
// covers very large failure backlogs without starving the loop.
func (s *Scheduler) runJobRetry(ctx context.Context) {
opCtx, cancel := context.WithTimeout(ctx, 2*time.Minute)
defer cancel()
// maxRetries is advisory at the service layer (per-job gating uses each
// job's own Attempts/MaxAttempts). Passing 3 matches the conventional
// default seen across the codebase's job creation paths.
if err := s.jobService.RetryFailedJobs(opCtx, 3); err != nil {
s.logger.Error("job retry failed",
"error", err,
"interval", s.jobRetryInterval.String())
} else {
s.logger.Debug("job retry completed")
}
}
// jobTimeoutLoop runs every jobTimeoutInterval and transitions jobs stuck in
// AwaitingCSR or AwaitingApproval to Failed if they exceed their TTL. I-001's
// retry loop then auto-promotes eligible Failed jobs back to Pending. Closes
// coverage gap I-003. Uses atomic.Bool to prevent duplicate execution.
func (s *Scheduler) jobTimeoutLoop(ctx context.Context) {
ticker := time.NewTicker(s.jobTimeoutInterval)
defer ticker.Stop()
// Run immediately on start (with idempotency guard)
s.jobTimeoutRunning.Store(true)
s.wg.Add(1)
go func() {
defer s.wg.Done()
defer s.jobTimeoutRunning.Store(false)
s.runJobTimeout(ctx)
}()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
if !s.jobTimeoutRunning.CompareAndSwap(false, true) {
s.logger.Warn("job timeout reaper still running, skipping tick")
continue
}
s.wg.Add(1)
go func() {
defer s.wg.Done()
defer s.jobTimeoutRunning.Store(false)
s.runJobTimeout(ctx)
}()
}
}
}
// runJobTimeout executes a single job timeout reaping cycle with error recovery.
// When no JobReaperService has been wired (e.g. in tests that don't exercise
// I-003) the call is a safe no-op, preserving the always-on loop topology
// described in I-003 without forcing every consumer to wire a reaper.
func (s *Scheduler) runJobTimeout(ctx context.Context) {
if s.jobReaper == nil {
return
}
opCtx, cancel := context.WithTimeout(ctx, 2*time.Minute)
defer cancel()
if err := s.jobReaper.ReapTimedOutJobs(opCtx, s.awaitingCSRTimeout, s.awaitingApprovalTimeout); err != nil {
s.logger.Error("job timeout reaper failed",
"error", err,
"interval", s.jobTimeoutInterval.String())
} else {
s.logger.Debug("job timeout reaper completed")
}
}
// agentHealthCheckLoop runs every agentHealthCheckInterval and marks stale agents as offline.
// An agent is considered stale if it hasn't sent a heartbeat within the health check interval.
// If an error occurs, it logs the error but continues running.
+410
View File
@@ -68,12 +68,30 @@ func (m *mockRenewalService) ExpireShortLivedCertificates(ctx context.Context) e
}
// mockJobService is a mock implementation for testing.
//
// Tracks ProcessPendingJobs and RetryFailedJobs separately. retrySlowDelay and
// retryShouldError let tests exercise the retry loop independently of the
// processor loop without coupling their timing/failure modes.
type mockJobService struct {
mu sync.Mutex
callCount int
callTimes []time.Time
slowDelay time.Duration
shouldError bool
// Retry loop tracking (coverage gap I-001)
retryCallCount int
retryCallTimes []time.Time
retryMaxRetriesSeen []int
retrySlowDelay time.Duration
retryShouldError bool
// Timeout reaper tracking (coverage gap I-003)
reapCallCount int
reapCallTimes []time.Time
reapSlowDelay time.Duration
reapShouldError bool
reapCtxHasDeadline bool
}
func (m *mockJobService) ProcessPendingJobs(ctx context.Context) error {
@@ -96,6 +114,57 @@ func (m *mockJobService) ProcessPendingJobs(ctx context.Context) error {
return nil
}
// RetryFailedJobs is the scheduler-driven counterpart to ProcessPendingJobs that
// covers coverage gap I-001: JobService.RetryFailedJobs had no runtime caller
// prior to the jobRetryLoop being wired.
func (m *mockJobService) RetryFailedJobs(ctx context.Context, maxRetries int) error {
m.mu.Lock()
m.retryCallCount++
m.retryCallTimes = append(m.retryCallTimes, time.Now())
m.retryMaxRetriesSeen = append(m.retryMaxRetriesSeen, maxRetries)
m.mu.Unlock()
if m.retrySlowDelay > 0 {
select {
case <-time.After(m.retrySlowDelay):
case <-ctx.Done():
return ctx.Err()
}
}
if m.retryShouldError {
return context.Canceled
}
return nil
}
// ReapTimedOutJobs is the scheduler-driven counterpart to ProcessPendingJobs that
// covers coverage gap I-003: JobService.ReapTimedOutJobs (via JobReaperService interface)
// had no runtime caller prior to the jobTimeoutLoop being wired.
func (m *mockJobService) ReapTimedOutJobs(ctx context.Context, csrTTL, approvalTTL time.Duration) error {
m.mu.Lock()
m.reapCallCount++
m.reapCallTimes = append(m.reapCallTimes, time.Now())
// Track whether context has a deadline set
_, hasDeadline := ctx.Deadline()
m.reapCtxHasDeadline = hasDeadline
m.mu.Unlock()
if m.reapSlowDelay > 0 {
select {
case <-time.After(m.reapSlowDelay):
case <-ctx.Done():
return ctx.Err()
}
}
if m.reapShouldError {
return context.Canceled
}
return nil
}
// mockAgentService is a mock implementation for testing.
type mockAgentService struct {
mu sync.Mutex
@@ -948,3 +1017,344 @@ func TestScheduler_DigestLoop_SetDigestInterval(t *testing.T) {
t.Errorf("digestInterval should be %v after SetDigestInterval, got %v", customInterval, sched.digestInterval)
}
}
// TestScheduler_JobRetryLoop_CallsService verifies that the job retry loop
// invokes JobService.RetryFailedJobs on each tick. Closes coverage gap I-001 —
// prior to the loop being wired, RetryFailedJobs had no runtime caller.
//
// Also verifies that the scheduler forwards the conventional advisory maxRetries
// constant (3) to the service layer; per-job gating still lives in each job's
// own Attempts/MaxAttempts fields.
func TestScheduler_JobRetryLoop_CallsService(t *testing.T) {
logger := slog.New(slog.NewTextHandler(os.Stderr, nil))
renewalMock := &mockRenewalService{}
jobMock := &mockJobService{}
agentMock := &mockAgentService{}
notificationMock := &mockNotificationService{}
networkMock := &mockNetworkScanService{}
sched := NewScheduler(renewalMock, jobMock, agentMock, notificationMock, networkMock, logger)
// Quiet every other loop so only the retry loop's calls are visible on jobMock.
sched.SetRenewalCheckInterval(10 * time.Second)
sched.SetJobProcessorInterval(10 * time.Second)
sched.SetAgentHealthCheckInterval(10 * time.Second)
sched.SetNotificationProcessInterval(10 * time.Second)
sched.SetNetworkScanInterval(10 * time.Second)
sched.SetJobRetryInterval(50 * time.Millisecond)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
startedChan := sched.Start(ctx)
<-startedChan
// Run long enough for the immediate start + at least one tick.
time.Sleep(200 * time.Millisecond)
cancel()
_ = sched.WaitForCompletion(2 * time.Second)
jobMock.mu.Lock()
retryCount := jobMock.retryCallCount
var firstMaxRetries int
if len(jobMock.retryMaxRetriesSeen) > 0 {
firstMaxRetries = jobMock.retryMaxRetriesSeen[0]
}
jobMock.mu.Unlock()
if retryCount < 1 {
t.Fatalf("expected job retry service to be called at least once, got %d", retryCount)
}
if firstMaxRetries != 3 {
t.Fatalf("expected scheduler to forward advisory maxRetries=3, got %d", firstMaxRetries)
}
t.Logf("job retry loop called %d times (maxRetries=%d)", retryCount, firstMaxRetries)
}
// TestScheduler_JobRetryLoop_IdempotencyGuard verifies that a slow retry sweep
// does not cause overlapping executions. Mirrors the shape of
// TestScheduler_DigestLoop_WithIdempotencyGuard.
//
// The guard is the atomic.Bool jobRetryRunning in scheduler.go. Without it, a
// 100ms tick against a 150ms operation would fire ~4 times in 400ms; with the
// guard we expect ~23 calls.
func TestScheduler_JobRetryLoop_IdempotencyGuard(t *testing.T) {
logger := slog.New(slog.NewTextHandler(os.Stderr, nil))
renewalMock := &mockRenewalService{}
jobMock := &mockJobService{
retrySlowDelay: 150 * time.Millisecond, // slower than tick interval
}
agentMock := &mockAgentService{}
notificationMock := &mockNotificationService{}
networkMock := &mockNetworkScanService{}
sched := NewScheduler(renewalMock, jobMock, agentMock, notificationMock, networkMock, logger)
sched.SetRenewalCheckInterval(10 * time.Second)
sched.SetJobProcessorInterval(10 * time.Second)
sched.SetAgentHealthCheckInterval(10 * time.Second)
sched.SetNotificationProcessInterval(10 * time.Second)
sched.SetNetworkScanInterval(10 * time.Second)
sched.SetJobRetryInterval(100 * time.Millisecond)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
startedChan := sched.Start(ctx)
<-startedChan
time.Sleep(400 * time.Millisecond)
jobMock.mu.Lock()
retryCount := jobMock.retryCallCount
jobMock.mu.Unlock()
// With a 150ms sweep and 100ms interval, a functioning guard should yield
// roughly 23 calls (immediate + any ticks whose previous sweep finished).
// Anything above 3 suggests the guard isn't holding.
if retryCount > 3 {
t.Logf("WARNING: retry called %d times in 400ms with 100ms interval and 150ms sweep — guard may not be working", retryCount)
}
t.Logf("job retry idempotency guard: %d calls in 400ms (100ms interval, 150ms sweep)", retryCount)
cancel()
if err := sched.WaitForCompletion(2 * time.Second); err != nil {
t.Fatalf("WaitForCompletion should succeed: %v", err)
}
}
// TestScheduler_JobRetryLoop_WaitForCompletion verifies that a retry sweep
// which is still in flight at shutdown is awaited by WaitForCompletion (same
// sync.WaitGroup contract as every other loop).
func TestScheduler_JobRetryLoop_WaitForCompletion(t *testing.T) {
logger := slog.New(slog.NewTextHandler(os.Stderr, nil))
renewalMock := &mockRenewalService{}
jobMock := &mockJobService{
retrySlowDelay: 100 * time.Millisecond,
}
agentMock := &mockAgentService{}
notificationMock := &mockNotificationService{}
networkMock := &mockNetworkScanService{}
sched := NewScheduler(renewalMock, jobMock, agentMock, notificationMock, networkMock, logger)
sched.SetRenewalCheckInterval(10 * time.Second)
sched.SetJobProcessorInterval(10 * time.Second)
sched.SetAgentHealthCheckInterval(10 * time.Second)
sched.SetNotificationProcessInterval(10 * time.Second)
sched.SetNetworkScanInterval(10 * time.Second)
sched.SetJobRetryInterval(50 * time.Millisecond)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
startedChan := sched.Start(ctx)
<-startedChan
// Let the immediate-start retry goroutine begin its 100ms sweep.
time.Sleep(30 * time.Millisecond)
// Initiate shutdown mid-sweep.
cancel()
start := time.Now()
err := sched.WaitForCompletion(5 * time.Second)
elapsed := time.Since(start)
if err != nil {
t.Fatalf("WaitForCompletion should not error: %v", err)
}
if elapsed > 5*time.Second {
t.Fatalf("WaitForCompletion took longer than expected: %v", elapsed)
}
jobMock.mu.Lock()
retryCount := jobMock.retryCallCount
jobMock.mu.Unlock()
if retryCount < 1 {
t.Fatalf("expected retry service to have started at least once before shutdown, got %d", retryCount)
}
t.Logf("retry loop graceful shutdown completed in %v after %d in-flight sweep(s)", elapsed, retryCount)
}
// TestScheduler_JobTimeoutLoop_NormalTick verifies that the job timeout reaper
// loop ticks at the specified interval (coverage gap I-003).
func TestScheduler_JobTimeoutLoop_NormalTick(t *testing.T) {
logger := slog.New(slog.NewTextHandler(os.Stderr, nil))
renewalMock := &mockRenewalService{}
jobMock := &mockJobService{}
agentMock := &mockAgentService{}
notificationMock := &mockNotificationService{}
networkMock := &mockNetworkScanService{}
sched := NewScheduler(renewalMock, jobMock, agentMock, notificationMock, networkMock, logger)
sched.SetRenewalCheckInterval(10 * time.Second)
sched.SetJobProcessorInterval(10 * time.Second)
sched.SetAgentHealthCheckInterval(10 * time.Second)
sched.SetNotificationProcessInterval(10 * time.Second)
sched.SetNetworkScanInterval(10 * time.Second)
sched.SetJobRetryInterval(10 * time.Second)
sched.SetJobTimeoutInterval(50 * time.Millisecond)
sched.SetAwaitingCSRTimeout(24 * time.Hour)
sched.SetAwaitingApprovalTimeout(168 * time.Hour)
sched.SetJobReaperService(jobMock)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
<-sched.Start(ctx)
time.Sleep(200 * time.Millisecond)
cancel()
if err := sched.WaitForCompletion(2 * time.Second); err != nil {
t.Fatalf("WaitForCompletion: %v", err)
}
jobMock.mu.Lock()
count := jobMock.reapCallCount
jobMock.mu.Unlock()
if count < 2 {
t.Fatalf("expected >= 2 reap calls, got %d", count)
}
}
// TestScheduler_JobTimeoutLoop_IdempotencyGuard verifies that the timeout reaper
// uses an atomic guard to prevent concurrent execution (coverage gap I-003).
func TestScheduler_JobTimeoutLoop_IdempotencyGuard(t *testing.T) {
logger := slog.New(slog.NewTextHandler(os.Stderr, nil))
renewalMock := &mockRenewalService{}
jobMock := &mockJobService{
reapSlowDelay: 150 * time.Millisecond,
}
agentMock := &mockAgentService{}
notificationMock := &mockNotificationService{}
networkMock := &mockNetworkScanService{}
sched := NewScheduler(renewalMock, jobMock, agentMock, notificationMock, networkMock, logger)
sched.SetRenewalCheckInterval(10 * time.Second)
sched.SetJobProcessorInterval(10 * time.Second)
sched.SetAgentHealthCheckInterval(10 * time.Second)
sched.SetNotificationProcessInterval(10 * time.Second)
sched.SetNetworkScanInterval(10 * time.Second)
sched.SetJobRetryInterval(10 * time.Second)
sched.SetJobTimeoutInterval(50 * time.Millisecond)
sched.SetAwaitingCSRTimeout(24 * time.Hour)
sched.SetAwaitingApprovalTimeout(168 * time.Hour)
sched.SetJobReaperService(jobMock)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
<-sched.Start(ctx)
time.Sleep(400 * time.Millisecond)
jobMock.mu.Lock()
reapCount := jobMock.reapCallCount
jobMock.mu.Unlock()
if reapCount > 3 {
t.Logf("WARNING: reap called %d times in 400ms with 50ms interval and 150ms sweep — guard may not be working", reapCount)
}
t.Logf("job timeout idempotency guard: %d calls in 400ms (50ms interval, 150ms sweep)", reapCount)
cancel()
if err := sched.WaitForCompletion(2 * time.Second); err != nil {
t.Fatalf("WaitForCompletion should succeed: %v", err)
}
}
// TestScheduler_JobTimeoutLoop_ShutdownDrainsInFlight verifies that shutdown waits
// for an in-flight timeout reaper to complete (coverage gap I-003).
func TestScheduler_JobTimeoutLoop_ShutdownDrainsInFlight(t *testing.T) {
logger := slog.New(slog.NewTextHandler(os.Stderr, nil))
renewalMock := &mockRenewalService{}
jobMock := &mockJobService{
reapSlowDelay: 100 * time.Millisecond,
}
agentMock := &mockAgentService{}
notificationMock := &mockNotificationService{}
networkMock := &mockNetworkScanService{}
sched := NewScheduler(renewalMock, jobMock, agentMock, notificationMock, networkMock, logger)
sched.SetRenewalCheckInterval(10 * time.Second)
sched.SetJobProcessorInterval(10 * time.Second)
sched.SetAgentHealthCheckInterval(10 * time.Second)
sched.SetNotificationProcessInterval(10 * time.Second)
sched.SetNetworkScanInterval(10 * time.Second)
sched.SetJobRetryInterval(10 * time.Second)
sched.SetJobTimeoutInterval(50 * time.Millisecond)
sched.SetAwaitingCSRTimeout(24 * time.Hour)
sched.SetAwaitingApprovalTimeout(168 * time.Hour)
sched.SetJobReaperService(jobMock)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
<-sched.Start(ctx)
// Let the immediate-start timeout reaper goroutine begin its 100ms sweep.
time.Sleep(30 * time.Millisecond)
// Initiate shutdown mid-sweep.
cancel()
start := time.Now()
err := sched.WaitForCompletion(5 * time.Second)
elapsed := time.Since(start)
if err != nil {
t.Fatalf("WaitForCompletion should not error: %v", err)
}
if elapsed > 5*time.Second {
t.Fatalf("WaitForCompletion took longer than expected: %v", elapsed)
}
jobMock.mu.Lock()
reapCount := jobMock.reapCallCount
jobMock.mu.Unlock()
if reapCount < 1 {
t.Fatalf("expected timeout reaper to have started at least once before shutdown, got %d", reapCount)
}
t.Logf("timeout reaper graceful shutdown completed in %v after %d in-flight sweep(s)", elapsed, reapCount)
}
// TestScheduler_JobTimeoutLoop_ContextDeadlineRespected verifies that the timeout
// reaper receives a context with a deadline set for each tick (coverage gap I-003).
func TestScheduler_JobTimeoutLoop_ContextDeadlineRespected(t *testing.T) {
logger := slog.New(slog.NewTextHandler(os.Stderr, nil))
renewalMock := &mockRenewalService{}
jobMock := &mockJobService{}
agentMock := &mockAgentService{}
notificationMock := &mockNotificationService{}
networkMock := &mockNetworkScanService{}
sched := NewScheduler(renewalMock, jobMock, agentMock, notificationMock, networkMock, logger)
sched.SetRenewalCheckInterval(10 * time.Second)
sched.SetJobProcessorInterval(10 * time.Second)
sched.SetAgentHealthCheckInterval(10 * time.Second)
sched.SetNotificationProcessInterval(10 * time.Second)
sched.SetNetworkScanInterval(10 * time.Second)
sched.SetJobRetryInterval(10 * time.Second)
sched.SetJobTimeoutInterval(50 * time.Millisecond)
sched.SetAwaitingCSRTimeout(24 * time.Hour)
sched.SetAwaitingApprovalTimeout(168 * time.Hour)
sched.SetJobReaperService(jobMock)
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
<-sched.Start(ctx)
time.Sleep(100 * time.Millisecond)
cancel()
if err := sched.WaitForCompletion(2 * time.Second); err != nil {
t.Fatalf("WaitForCompletion: %v", err)
}
jobMock.mu.Lock()
hasDeadline := jobMock.reapCtxHasDeadline
jobMock.mu.Unlock()
if !hasDeadline {
t.Fatal("expected timeout reaper context to have a deadline set, but none found")
}
t.Log("timeout reaper context deadline verified")
}
+15
View File
@@ -92,12 +92,27 @@ func (s *AgentService) Register(ctx context.Context, name string, hostname strin
}
// Heartbeat updates an agent's last seen time, status, and metadata.
//
// I-004: retired agents must be rejected up-front. A retired agent that is
// still polling is a zombie — its row exists only for audit history and must
// not be allowed to bump LastHeartbeatAt (which would resurrect it in stats
// dashboards and stale-offline sweeps). The sentinel ErrAgentRetired is
// returned unwrapped so the HTTP handler can map it to 410 Gone via
// errors.Is; the agent process detects the 410 and shuts down cleanly
// instead of continuing to heartbeat indefinitely.
func (s *AgentService) Heartbeat(ctx context.Context, agentID string, metadata *domain.AgentMetadata) error {
agent, err := s.agentRepo.Get(ctx, agentID)
if err != nil {
return fmt.Errorf("failed to fetch agent: %w", err)
}
// I-004 guard: retired agents are frozen. Do not call UpdateHeartbeat —
// bumping the timestamp would defeat the retired-row filter that protects
// stats, scheduler sweeps, and handler listings.
if agent.IsRetired() {
return ErrAgentRetired
}
// Update heartbeat and metadata
if err := s.agentRepo.UpdateHeartbeat(ctx, agentID, metadata); err != nil {
return fmt.Errorf("failed to update heartbeat: %w", err)
+317
View File
@@ -0,0 +1,317 @@
package service
import (
"context"
"errors"
"fmt"
"log/slog"
"time"
"github.com/shankar0123/certctl/internal/domain"
)
// I-004 coverage-gap closure: the agent retirement surface.
//
// Before 000015, DELETE /api/v1/agents/{id} hard-deleted the agents row and
// the deployment_targets.agent_id FK CASCADE cleaned up downstream rows with
// no preflight, no archival, and no knowledge of in-flight jobs. Any cert
// still rotating through one of those targets would observe half-migrated
// state. I-004 closes that gap with a preflight + soft-retire + optional
// forced-cascade contract; the symbols in this file are the service-layer
// surface that the handler and operator UI bind against.
// ErrAgentIsSentinel is returned when an operator tries to retire one of the
// four reserved sentinel agent IDs (server-scanner, cloud-aws-sm,
// cloud-azure-kv, cloud-gcp-sm). These rows back the network scanner and the
// three cloud secret-manager discovery sources; retiring any of them orphans
// its subsystem. The guard fires unconditionally — force=true does not bypass
// it, because a sentinel is a structural invariant of the deployment, not
// a piece of fleet state the operator owns. Handler maps this to HTTP 403.
var ErrAgentIsSentinel = errors.New("agent is a reserved sentinel and cannot be retired")
// ErrBlockedByDependencies is returned by RetireAgent when at least one of
// (active targets, active certificates, pending jobs) referencing the agent
// is non-zero and force=false. The caller always receives it wrapped in
// a *BlockedByDependenciesError (see below), so handlers doing errors.As
// can surface the per-bucket counts in the 409 body for operator
// troubleshooting. Tests use errors.Is; handlers use errors.As.
var ErrBlockedByDependencies = errors.New("agent has active downstream dependencies")
// ErrForceReasonRequired is returned when force=true is supplied without a
// non-empty reason. The force escape hatch is deliberately chatty: operators
// pulling the emergency cord must leave an auditable breadcrumb explaining
// why a cascade was justified. Handler maps this to HTTP 400 so the operator
// retries with --reason rather than silently skipping the guard. Checked
// before any DB mutation to keep the no-reason path transactionally clean.
var ErrForceReasonRequired = errors.New("force=true requires a non-empty reason")
// ErrAgentRetired is returned by Heartbeat (and any future agent-authenticated
// call site) when a retired agent is still polling. The handler layer maps
// this to HTTP 410 Gone so the cmd/agent sendHeartbeat loop can detect it
// deterministically and shut down the agent process, rather than looping
// forever on a soft-retired identity. IsRetired() on the domain model is
// the single source of truth; the sentinel exists so service and handler
// callers can errors.Is against one symbol.
var ErrAgentRetired = errors.New("agent has been retired")
// BlockedByDependenciesError wraps ErrBlockedByDependencies and carries the
// per-bucket dependency snapshot the preflight pass captured. The embedded
// AgentDependencyCounts is the same struct the repo returns from the three
// CountActive* calls, so the handler can marshal it directly into the 409
// body without reshaping fields. Unwrap() satisfies errors.Is against the
// sentinel; Error() includes the counts so logs are diagnostic on their own.
type BlockedByDependenciesError struct {
Counts domain.AgentDependencyCounts
}
// Error formats the wrapped error with the per-bucket counts. Kept short so
// it reads cleanly in slog output.
func (e *BlockedByDependenciesError) Error() string {
return fmt.Sprintf(
"%s (active_targets=%d, active_certificates=%d, pending_jobs=%d)",
ErrBlockedByDependencies.Error(),
e.Counts.ActiveTargets,
e.Counts.ActiveCertificates,
e.Counts.PendingJobs,
)
}
// Unwrap lets errors.Is(err, ErrBlockedByDependencies) match the wrapped
// struct — the test contract (agent_retire_test.go:167) depends on it.
func (e *BlockedByDependenciesError) Unwrap() error { return ErrBlockedByDependencies }
// AgentRetirementResult is the outcome surface the handler returns to the
// operator. It discriminates the three happy paths the endpoint can take —
// idempotent no-op (AlreadyRetired), clean soft-retire (Cascade=false), and
// forced cascade (Cascade=true) — and always carries the retired_at timestamp
// and the dependency-count snapshot so the 200/204 response body can echo
// what was (or would have been) affected.
//
// AlreadyRetired=true → agent was already retired; no new audit
// event was emitted; RetiredAt is the
// original stamp, not the current time.
// Cascade=false → clean soft-retire; Counts is all zeros.
// Cascade=true → force=true retired agent + downstream
// targets; Counts is the PRE-cascade
// snapshot (so the operator sees what
// they just retired).
type AgentRetirementResult struct {
AlreadyRetired bool
Cascade bool
RetiredAt time.Time
Counts domain.AgentDependencyCounts
}
// RetireAgent implements the I-004 retirement contract. Ordering matters —
// every guard fires before the one that would mutate state, so a rejected
// retire leaves zero trace (no audit event, no partial DB write):
//
// 1. Sentinel check (unconditional; force does not bypass).
// 2. Fetch agent (404 surfaces as-is from the repo).
// 3. Already-retired idempotency: return AlreadyRetired=true with NO new
// audit event — the original retire already recorded one.
// 4. Preflight count pass via the three CountActive* repo methods.
// 5. Force-reason guard: force=true with empty reason is rejected here,
// after the counts are known but before any mutation.
// 6. Default no-force path: any non-zero count returns
// *BlockedByDependenciesError with counts attached.
// 7. Mutation: SoftRetire (no cascade) or RetireAgentWithCascade, with
// a single retiredAt timestamp pinned BEFORE the repo call so the
// audit event and the DB row agree to the nanosecond.
// 8. Audit: agent_retired always; agent_retirement_cascaded additionally
// on the force=true cascade path.
//
// Actor comes from the handler's resolveActor (API key → user, agent key →
// agent-<id>, unauthenticated → "anonymous"); the service does not second-
// guess it. Audit emission is best-effort: a failed RecordEvent logs a
// warning but does not fail the overall retirement, consistent with how
// the rest of the codebase treats audit as an observability concern
// rather than a correctness barrier.
func (s *AgentService) RetireAgent(ctx context.Context, id string, actor string, force bool, reason string) (*AgentRetirementResult, error) {
// Step 1 — reserved-sentinel guard. Applies even under force=true.
if domain.IsSentinelAgent(id) {
return nil, ErrAgentIsSentinel
}
// Step 2 — existence check. Missing agent surfaces the repo's not-found
// error verbatim so the handler can map it to 404 via its existing
// detection path (the handler layer already has "not found" mapping
// logic inherited from the pre-I-004 Delete endpoint).
agent, err := s.agentRepo.Get(ctx, id)
if err != nil {
return nil, fmt.Errorf("failed to fetch agent: %w", err)
}
// Step 3 — idempotency. A retired agent returns AlreadyRetired=true
// WITHOUT emitting a fresh audit event. Handler maps this to HTTP 204.
// Guarding here (before preflight) means a re-retire of an agent that
// now has zero deps doesn't spuriously "succeed again" and double-log.
if agent.IsRetired() {
return &AgentRetirementResult{
AlreadyRetired: true,
RetiredAt: *agent.RetiredAt,
}, nil
}
// Step 4 — preflight counts. All three run even when force=true: we
// need them to populate AgentRetirementResult.Counts (the pre-cascade
// snapshot). A repo failure here aborts the whole operation — partial
// preflight is worse than no preflight.
counts, err := s.collectAgentDependencyCounts(ctx, id)
if err != nil {
return nil, fmt.Errorf("failed to collect agent dependency counts: %w", err)
}
// Step 5 — force-reason guard. Positioned AFTER preflight so operators
// who forgot --reason still see accurate counts when they retry. The
// empty-reason rejection fires before any mutation, so the rejected
// attempt leaves no audit noise.
if force && reason == "" {
return nil, ErrForceReasonRequired
}
// Step 6 — default path: block on any non-zero bucket. Wrapping the
// sentinel in *BlockedByDependenciesError lets the handler use errors.As
// to surface counts in the 409 body while tests use errors.Is against
// the sentinel. Both callers are satisfied by the single Unwrap chain.
if !force && counts.HasDependencies() {
return nil, &BlockedByDependenciesError{Counts: counts}
}
// Step 7 — mutation. Pin retiredAt once so the audit event, the agent
// row, and (on cascade) every deployment_targets row share the same
// timestamp. Callers querying "what happened at T?" can correlate
// retirement rows across tables without clock-skew tie-breaking.
retiredAt := time.Now()
cascade := force && counts.HasDependencies()
if cascade {
if err := s.agentRepo.RetireAgentWithCascade(ctx, id, retiredAt, reason); err != nil {
return nil, fmt.Errorf("failed to retire agent with cascade: %w", err)
}
} else {
if err := s.agentRepo.SoftRetire(ctx, id, retiredAt, reason); err != nil {
return nil, fmt.Errorf("failed to soft-retire agent: %w", err)
}
}
// Step 8 — audit. Two events on the cascade path so forensics can
// distinguish "agent was retired" (agent_retired) from "downstream
// targets were flipped" (agent_retirement_cascaded). Details on the
// cascaded event carry the pre-cascade counts so a reviewer looking
// only at the audit log knows how much state was affected. Emission
// is best-effort — audit is observability, not a correctness barrier.
actorType := s.resolveActorType(actor)
details := map[string]interface{}{
"actor": actor,
"reason": reason,
"force": force,
"active_targets": counts.ActiveTargets,
"active_certificates": counts.ActiveCertificates,
"pending_jobs": counts.PendingJobs,
}
if err := s.auditService.RecordEvent(ctx, actor, actorType,
"agent_retired", "agent", id, details); err != nil {
slog.Error("failed to record agent_retired audit event", "agent_id", id, "error", err)
}
if cascade {
cascadeDetails := map[string]interface{}{
"actor": actor,
"reason": reason,
"active_targets": counts.ActiveTargets,
"active_certificates": counts.ActiveCertificates,
"pending_jobs": counts.PendingJobs,
}
if err := s.auditService.RecordEvent(ctx, actor, actorType,
"agent_retirement_cascaded", "agent", id, cascadeDetails); err != nil {
slog.Error("failed to record agent_retirement_cascaded audit event", "agent_id", id, "error", err)
}
}
return &AgentRetirementResult{
AlreadyRetired: false,
Cascade: cascade,
RetiredAt: retiredAt,
Counts: counts,
}, nil
}
// ListRetiredAgents returns the paginated list of retired agents in
// retired_at DESC order. This is the companion to ListAgents — which
// hides retired rows — so the operator UI can render a dedicated
// "Retired" tab without leaking retired rows into every other listing.
// Pagination defaults (page<1→1, perPage<1→50) are applied here as
// well as in the repo, so callers can pass 0s when they want defaults.
//
// Return shape harmonizes with handler.AgentService: a value slice
// (not pointer slice) and int64 total. The repo returns []*domain.Agent;
// this method dereferences into a value slice so the handler's
// PagedResponse marshals straight objects and so the compile-time
// interface assertion in agent_retire_handler_test.go:387 is satisfied.
// Nil repo entries are skipped defensively — the repo should never
// return them, but the handler contract is more important than the
// repo's (pointer-slice) convenience.
func (s *AgentService) ListRetiredAgents(ctx context.Context, page, perPage int) ([]domain.Agent, int64, error) {
if page < 1 {
page = 1
}
if perPage < 1 {
perPage = 50
}
agents, total, err := s.agentRepo.ListRetired(ctx, page, perPage)
if err != nil {
return nil, 0, fmt.Errorf("failed to list retired agents: %w", err)
}
out := make([]domain.Agent, 0, len(agents))
for _, a := range agents {
if a == nil {
continue
}
out = append(out, *a)
}
return out, int64(total), nil
}
// collectAgentDependencyCounts runs the three preflight COUNT queries in
// sequence and bundles the result. Sequential (not parallel) because the
// queries are cheap (<1ms each on the indexed columns added in 000015) and
// sequential keeps error handling simple. Any repo error short-circuits
// — we prefer to refuse the retire than make a half-informed decision.
func (s *AgentService) collectAgentDependencyCounts(ctx context.Context, id string) (domain.AgentDependencyCounts, error) {
var counts domain.AgentDependencyCounts
targets, err := s.agentRepo.CountActiveTargets(ctx, id)
if err != nil {
return counts, fmt.Errorf("count active targets: %w", err)
}
counts.ActiveTargets = targets
certs, err := s.agentRepo.CountActiveCertificates(ctx, id)
if err != nil {
return counts, fmt.Errorf("count active certificates: %w", err)
}
counts.ActiveCertificates = certs
jobs, err := s.agentRepo.CountPendingJobs(ctx, id)
if err != nil {
return counts, fmt.Errorf("count pending jobs: %w", err)
}
counts.PendingJobs = jobs
return counts, nil
}
// resolveActorType maps an opaque actor string into the typed ActorType
// used by the audit schema. Matches the conventions the rest of the
// service layer uses: "system" → System, anything that looks like an
// agent identity → Agent, everything else → User.
func (s *AgentService) resolveActorType(actor string) domain.ActorType {
switch {
case actor == "system":
return domain.ActorTypeSystem
case len(actor) > 6 && actor[:6] == "agent-":
return domain.ActorTypeAgent
default:
return domain.ActorTypeUser
}
}
+396
View File
@@ -0,0 +1,396 @@
package service
import (
"context"
"errors"
"log/slog"
"testing"
"time"
"github.com/shankar0123/certctl/internal/domain"
)
// setupRetireTest wires up an AgentService with a single registered agent and
// returns (service, agentRepo, auditRepo) so tests can seed state and assert
// audit events. Kept minimal — tests that need targets/jobs/certs extend the
// returned repos directly.
func setupRetireTest(t *testing.T, agentID string) (*AgentService, *mockAgentRepo, *mockAuditRepo) {
t.Helper()
now := time.Now()
agent := &domain.Agent{
ID: agentID,
Name: "prod-agent",
Hostname: "server-01",
Status: domain.AgentStatusOnline,
RegisteredAt: now,
LastHeartbeatAt: &now,
APIKeyHash: "hash-" + agentID,
}
agentRepo := newMockAgentRepository()
agentRepo.AddAgent(agent)
certRepo := &mockCertRepo{
Certs: make(map[string]*domain.ManagedCertificate),
Versions: make(map[string][]*domain.CertificateVersion),
}
jobRepo := &mockJobRepo{
Jobs: make(map[string]*domain.Job),
StatusUpdates: make(map[string]domain.JobStatus),
}
targetRepo := &mockTargetRepo{
Targets: make(map[string]*domain.DeploymentTarget),
}
auditRepo := &mockAuditRepo{Events: []*domain.AuditEvent{}}
auditService := NewAuditService(auditRepo)
issuerRegistry := NewIssuerRegistry(slog.Default())
svc := NewAgentService(agentRepo, certRepo, jobRepo, targetRepo, auditService, issuerRegistry, nil)
return svc, agentRepo, auditRepo
}
// TestRetireAgent_Sentinel_Rejected covers I-004's sentinel guard. The four
// well-known sentinel agent IDs back discovery sources and the network scanner
// — retiring them would orphan those subsystems. Contract: reject with
// ErrAgentIsSentinel regardless of force/reason.
func TestRetireAgent_Sentinel_Rejected(t *testing.T) {
sentinels := []string{"server-scanner", "cloud-aws-sm", "cloud-azure-kv", "cloud-gcp-sm"}
for _, id := range sentinels {
t.Run(id, func(t *testing.T) {
svc, _, _ := setupRetireTest(t, id)
_, err := svc.RetireAgent(context.Background(), id, "alice", false, "")
if !errors.Is(err, ErrAgentIsSentinel) {
t.Fatalf("retire(sentinel %q) err=%v want ErrAgentIsSentinel", id, err)
}
// Sentinel rejection must be deterministic even under force=true.
_, err = svc.RetireAgent(context.Background(), id, "alice", true, "forced by operator")
if !errors.Is(err, ErrAgentIsSentinel) {
t.Fatalf("retire(sentinel %q force=true) err=%v want ErrAgentIsSentinel", id, err)
}
})
}
}
// TestRetireAgent_NotFound covers the 404 preflight path. The handler maps
// ErrAgentNotFound-equivalent sentinel to 404; the service must surface it
// cleanly without partial state mutation.
func TestRetireAgent_NotFound(t *testing.T) {
svc, _, _ := setupRetireTest(t, "agent-001")
_, err := svc.RetireAgent(context.Background(), "agent-does-not-exist", "alice", false, "")
if err == nil {
t.Fatalf("retire(missing id) err=nil want not-found error")
}
}
// TestRetireAgent_AlreadyRetired_Idempotent covers the 204 No Content path.
// Retiring an already-retired agent must succeed without error and without
// emitting a new audit event (the first retirement already recorded one).
// Idempotency matters because the handler is the escape hatch for operators
// re-issuing a failed retire after a partial failure mid-cascade.
func TestRetireAgent_AlreadyRetired_Idempotent(t *testing.T) {
svc, agentRepo, auditRepo := setupRetireTest(t, "agent-001")
past := time.Now().Add(-24 * time.Hour)
reason := "operator decommissioned"
agent := agentRepo.Agents["agent-001"]
agent.RetiredAt = &past
agent.RetiredReason = &reason
result, err := svc.RetireAgent(context.Background(), "agent-001", "alice", false, "")
if err != nil {
t.Fatalf("retire(already retired) err=%v want nil (idempotent)", err)
}
if result == nil || !result.AlreadyRetired {
t.Fatalf("retire(already retired) result=%+v want AlreadyRetired=true", result)
}
// Retire-on-retired must not emit a duplicate audit event.
for _, e := range auditRepo.Events {
if e.Action == "agent_retired" && e.ResourceID == "agent-001" {
t.Fatalf("retire(already retired) emitted duplicate agent_retired audit event")
}
}
}
// TestRetireAgent_NoDeps_SoftSucceeds covers the happy 200 path: no active
// targets, certs, or jobs referencing the agent. Soft-retire stamps
// RetiredAt + RetiredReason and emits agent_retired audit event.
func TestRetireAgent_NoDeps_SoftSucceeds(t *testing.T) {
svc, agentRepo, auditRepo := setupRetireTest(t, "agent-001")
before := time.Now().Add(-time.Second)
result, err := svc.RetireAgent(context.Background(), "agent-001", "alice", false, "")
if err != nil {
t.Fatalf("retire(clean) err=%v want nil", err)
}
if result == nil {
t.Fatal("retire(clean) result=nil want non-nil")
}
if result.AlreadyRetired {
t.Fatalf("retire(clean) result.AlreadyRetired=true want false")
}
if result.Cascade {
t.Fatalf("retire(clean) result.Cascade=true want false (no deps to cascade)")
}
if !result.RetiredAt.After(before) {
t.Fatalf("retire(clean) RetiredAt=%v not after test start %v", result.RetiredAt, before)
}
agent := agentRepo.Agents["agent-001"]
if agent.RetiredAt == nil {
t.Fatalf("retire(clean) agent.RetiredAt=nil want stamped")
}
// Audit event must be emitted with action=agent_retired, actor=alice.
found := false
for _, e := range auditRepo.Events {
if e.Action == "agent_retired" && e.ResourceID == "agent-001" && e.Actor == "alice" {
found = true
break
}
}
if !found {
t.Fatalf("retire(clean) missing agent_retired audit event for alice, events=%+v", auditRepo.Events)
}
}
// TestRetireAgent_WithDeps_NoForce_Blocked covers the 409 preflight path. When
// the agent has any of: active non-retired targets, certs deployed via those
// targets, or pending jobs — a default retire must block with
// ErrBlockedByDependencies and the counts must be reachable via errors.As so
// the handler can build the 409 body.
func TestRetireAgent_WithDeps_NoForce_Blocked(t *testing.T) {
svc, agentRepo, _ := setupRetireTest(t, "agent-001")
// Seed dependency counts directly on the mock — the production repo
// implements CountActive* queries; the mock exposes them as fields.
agentRepo.ActiveTargetCounts["agent-001"] = 3
agentRepo.ActiveCertCounts["agent-001"] = 7
agentRepo.PendingJobCounts["agent-001"] = 2
_, err := svc.RetireAgent(context.Background(), "agent-001", "alice", false, "")
if !errors.Is(err, ErrBlockedByDependencies) {
t.Fatalf("retire(with deps, no force) err=%v want ErrBlockedByDependencies", err)
}
var blocked *BlockedByDependenciesError
if !errors.As(err, &blocked) {
t.Fatalf("retire(with deps) err=%v want wrapped *BlockedByDependenciesError", err)
}
if blocked.Counts.ActiveTargets != 3 {
t.Errorf("blocked.Counts.ActiveTargets=%d want 3", blocked.Counts.ActiveTargets)
}
if blocked.Counts.ActiveCertificates != 7 {
t.Errorf("blocked.Counts.ActiveCertificates=%d want 7", blocked.Counts.ActiveCertificates)
}
if blocked.Counts.PendingJobs != 2 {
t.Errorf("blocked.Counts.PendingJobs=%d want 2", blocked.Counts.PendingJobs)
}
// Agent must still be un-retired after preflight block.
if agentRepo.Agents["agent-001"].RetiredAt != nil {
t.Fatalf("retire(blocked) left RetiredAt stamped; preflight must be transactionally safe")
}
}
// TestRetireAgent_WithDeps_Force_NoReason_Rejected covers the 400 guard on the
// force escape hatch. Operators using force=true must supply a justifying
// reason; empty reason is rejected before any DB mutation.
func TestRetireAgent_WithDeps_Force_NoReason_Rejected(t *testing.T) {
svc, agentRepo, _ := setupRetireTest(t, "agent-001")
agentRepo.ActiveTargetCounts["agent-001"] = 1
_, err := svc.RetireAgent(context.Background(), "agent-001", "alice", true, "")
if !errors.Is(err, ErrForceReasonRequired) {
t.Fatalf("retire(force, no reason) err=%v want ErrForceReasonRequired", err)
}
if agentRepo.Agents["agent-001"].RetiredAt != nil {
t.Fatalf("retire(force, no reason) left RetiredAt stamped; guard must fire before mutation")
}
}
// TestRetireAgent_WithDeps_Force_Cascades covers the force=true transactional
// path: agent retires, downstream targets also soft-retire with the supplied
// reason, and the result surface indicates cascade happened. Reason
// propagates to every cascaded row so post-mortem forensics can trace the
// cascade to a single operator action.
func TestRetireAgent_WithDeps_Force_Cascades(t *testing.T) {
svc, agentRepo, auditRepo := setupRetireTest(t, "agent-001")
agentRepo.ActiveTargetCounts["agent-001"] = 2
agentRepo.ActiveCertCounts["agent-001"] = 5
agentRepo.PendingJobCounts["agent-001"] = 1
reason := "decommissioning rack 7"
result, err := svc.RetireAgent(context.Background(), "agent-001", "alice", true, reason)
if err != nil {
t.Fatalf("retire(force, reason) err=%v want nil", err)
}
if result == nil {
t.Fatal("retire(force) result=nil want non-nil")
}
if !result.Cascade {
t.Fatalf("retire(force) result.Cascade=false want true")
}
if result.Counts.ActiveTargets != 2 {
t.Errorf("result.Counts.ActiveTargets=%d want 2 (pre-cascade snapshot)", result.Counts.ActiveTargets)
}
agent := agentRepo.Agents["agent-001"]
if agent.RetiredAt == nil {
t.Fatalf("retire(force) agent.RetiredAt=nil want stamped")
}
if agent.RetiredReason == nil || *agent.RetiredReason != reason {
t.Fatalf("retire(force) RetiredReason=%v want %q", agent.RetiredReason, reason)
}
// Two audit events required: agent_retired + agent_retirement_cascaded.
// The cascaded event captures which downstream resources were affected.
var haveRetired, haveCascaded bool
for _, e := range auditRepo.Events {
if e.ResourceID == "agent-001" {
switch e.Action {
case "agent_retired":
haveRetired = true
case "agent_retirement_cascaded":
haveCascaded = true
}
}
}
if !haveRetired {
t.Errorf("retire(force) missing agent_retired audit event")
}
if !haveCascaded {
t.Errorf("retire(force) missing agent_retirement_cascaded audit event")
}
}
// TestRetireAgent_EmitsAuditEvent pins the audit contract for I-004:
// every retire path that mutates DB state emits at least one audit event with
// the operator's actor identity, so post-hoc compliance/forensics can
// reconstruct who retired what and when.
func TestRetireAgent_EmitsAuditEvent(t *testing.T) {
svc, _, auditRepo := setupRetireTest(t, "agent-007")
_, err := svc.RetireAgent(context.Background(), "agent-007", "compliance-bot", false, "")
if err != nil {
t.Fatalf("retire err=%v want nil", err)
}
for _, e := range auditRepo.Events {
if e.Action == "agent_retired" && e.ResourceID == "agent-007" {
if e.Actor != "compliance-bot" {
t.Errorf("audit event Actor=%q want compliance-bot", e.Actor)
}
return
}
}
t.Fatalf("no agent_retired audit event emitted, events=%+v", auditRepo.Events)
}
// TestHeartbeat_RetiredAgent_ReturnsErrAgentRetired covers the 410 Gone
// contract. A retired agent that is still polling must be told its identity
// is no longer accepted — the agent process should detect this and shut
// down rather than continue heartbeating indefinitely.
func TestHeartbeat_RetiredAgent_ReturnsErrAgentRetired(t *testing.T) {
svc, agentRepo, _ := setupRetireTest(t, "agent-001")
past := time.Now().Add(-time.Hour)
reason := "decommissioned"
agentRepo.Agents["agent-001"].RetiredAt = &past
agentRepo.Agents["agent-001"].RetiredReason = &reason
err := svc.Heartbeat(context.Background(), "agent-001", &domain.AgentMetadata{
OS: "linux",
Architecture: "amd64",
Hostname: "server-01",
})
if !errors.Is(err, ErrAgentRetired) {
t.Fatalf("heartbeat(retired) err=%v want ErrAgentRetired", err)
}
// Retired heartbeat must NOT bump LastHeartbeatAt — otherwise the retired
// agent could ressurrect itself in stats/observability dashboards.
if _, bumped := agentRepo.HeartbeatUpdates["agent-001"]; bumped {
t.Fatalf("heartbeat(retired) updated LastHeartbeatAt; retired agents must be frozen")
}
}
// TestListAgents_DefaultExcludesRetired covers the contract that the
// handler-facing ListAgents call hides retired rows by default. Otherwise
// every dashboard that paginates agents would surface retired stragglers.
// An explicit "list retired" endpoint (ListRetiredAgents) covers the audit
// use case.
func TestListAgents_DefaultExcludesRetired(t *testing.T) {
svc, agentRepo, _ := setupRetireTest(t, "agent-active")
// Seed one retired agent alongside the active one.
past := time.Now().Add(-24 * time.Hour)
reason := "old hardware"
agentRepo.AddAgent(&domain.Agent{
ID: "agent-retired",
Name: "retired-agent",
Hostname: "server-old",
Status: domain.AgentStatusOffline,
RegisteredAt: past,
APIKeyHash: "hash-retired",
RetiredAt: &past,
RetiredReason: &reason,
})
agents, total, err := svc.ListAgents(context.Background(), 1, 50)
if err != nil {
t.Fatalf("ListAgents err=%v want nil", err)
}
for _, a := range agents {
if a.ID == "agent-retired" {
t.Fatalf("ListAgents returned retired agent %q in default listing", a.ID)
}
}
if total != 1 {
t.Errorf("ListAgents total=%d want 1 (only active)", total)
}
// ListRetiredAgents must surface retired-only, with count=1.
retired, retiredTotal, err := svc.ListRetiredAgents(context.Background(), 1, 50)
if err != nil {
t.Fatalf("ListRetiredAgents err=%v want nil", err)
}
if retiredTotal != 1 {
t.Errorf("ListRetiredAgents total=%d want 1", retiredTotal)
}
if len(retired) != 1 || retired[0].ID != "agent-retired" {
t.Fatalf("ListRetiredAgents got=%+v want [agent-retired]", retired)
}
}
// TestMarkStaleAgentsOffline_SkipsRetired covers the stale-offline sweeper
// interaction with retirement. A retired agent must not be re-surfaced as
// a state transition ("Online → Offline") by the scheduler, because its
// Status column is preserved as the last-known operational state at
// retirement time and RetiredAt is the source of truth for filtering.
func TestMarkStaleAgentsOffline_SkipsRetired(t *testing.T) {
svc, agentRepo, _ := setupRetireTest(t, "agent-live")
// Active agent is currently stale (no heartbeat for 10 minutes) — eligible
// for Online→Offline transition.
stale := time.Now().Add(-10 * time.Minute)
agentRepo.Agents["agent-live"].LastHeartbeatAt = &stale
// Retired agent was also stale at retirement time, but must NOT be
// touched by the sweeper.
past := time.Now().Add(-24 * time.Hour)
reason := "hw failure"
agentRepo.AddAgent(&domain.Agent{
ID: "agent-retired",
Name: "dead-agent",
Hostname: "server-old",
Status: domain.AgentStatusOnline, // preserved last-seen status
RegisteredAt: past,
LastHeartbeatAt: &past,
APIKeyHash: "hash-dead",
RetiredAt: &past,
RetiredReason: &reason,
})
if err := svc.MarkStaleAgentsOffline(context.Background(), 5*time.Minute); err != nil {
t.Fatalf("MarkStaleAgentsOffline err=%v want nil", err)
}
// Active-stale agent should flip Online → Offline.
if got := agentRepo.Agents["agent-live"].Status; got != domain.AgentStatusOffline {
t.Errorf("agent-live Status=%s want Offline", got)
}
// Retired agent's Status column must be frozen at Online (its preserved
// last-seen state); the sweeper must skip it.
if got := agentRepo.Agents["agent-retired"].Status; got != domain.AgentStatusOnline {
t.Errorf("agent-retired Status=%s want Online (frozen); sweeper touched retired row", got)
}
}
+41 -13
View File
@@ -2,6 +2,8 @@ package service
import (
"context"
"database/sql"
"errors"
"fmt"
"log/slog"
"math/big"
@@ -139,23 +141,49 @@ func (s *CAOperationsSvc) GetOCSPResponse(ctx context.Context, issuerID string,
// Check if this (issuer_id, serial) is revoked — RFC 5280 §5.2.3 scoping.
rev, err := s.revocationRepo.GetByIssuerAndSerial(ctx, issuerID, serialHex)
if err != nil {
// Not revoked — return "good" status
if err == nil && rev != nil {
// Revoked
return issuerConn.SignOCSPResponse(ctx, OCSPSignRequest{
CertSerial: serial,
CertStatus: 0, // good
ThisUpdate: now,
NextUpdate: now.Add(1 * time.Hour),
CertSerial: serial,
CertStatus: 1, // revoked
RevokedAt: rev.RevokedAt,
RevocationReason: domain.CRLReasonCode(domain.RevocationReason(rev.Reason)),
ThisUpdate: now,
NextUpdate: now.Add(1 * time.Hour),
})
}
// Revoked
// Not revoked. Per RFC 6960 §2.2, we must only return "good" for a
// certificate that was actually issued by this CA. Verify the
// (issuer_id, serial) tuple maps to a real certificate in inventory
// before asserting "good"; otherwise return "unknown". This closes the
// coverage gap where forged/guessed serials would be accepted as valid
// because they had no revocation row (M-004).
if s.certRepo != nil {
cert, certErr := s.certRepo.GetByIssuerAndSerial(ctx, issuerID, serialHex)
if certErr != nil || cert == nil {
if certErr != nil && !errors.Is(certErr, sql.ErrNoRows) {
// Real repository failure — log but still fail closed with "unknown"
// rather than leaking a bogus "good" assertion.
slog.Warn("OCSP cert lookup failed; returning unknown",
"issuer_id", issuerID,
"serial", serialHex,
"error", certErr)
}
return issuerConn.SignOCSPResponse(ctx, OCSPSignRequest{
CertSerial: serial,
CertStatus: 2, // unknown
ThisUpdate: now,
NextUpdate: now.Add(1 * time.Hour),
})
}
}
// Known cert, not revoked — return "good"
return issuerConn.SignOCSPResponse(ctx, OCSPSignRequest{
CertSerial: serial,
CertStatus: 1, // revoked
RevokedAt: rev.RevokedAt,
RevocationReason: domain.CRLReasonCode(domain.RevocationReason(rev.Reason)),
ThisUpdate: now,
NextUpdate: now.Add(1 * time.Hour),
CertSerial: serial,
CertStatus: 0, // good
ThisUpdate: now,
NextUpdate: now.Add(1 * time.Hour),
})
}
+82 -2
View File
@@ -13,16 +13,25 @@ import (
// helper to create a CAOperationsSvc for testing
func newCAOperationsSvcTest() (*CAOperationsSvc, *mockRevocationRepo, *mockCertRepo) {
caSvc, revocationRepo, certRepo, _ := newCAOperationsSvcTestWithIssuer()
return caSvc, revocationRepo, certRepo
}
// newCAOperationsSvcTestWithIssuer also returns the mock issuer connector
// so tests can assert on the captured OCSPSignRequest.
func newCAOperationsSvcTestWithIssuer() (*CAOperationsSvc, *mockRevocationRepo, *mockCertRepo, *mockIssuerConnector) {
revocationRepo := newMockRevocationRepository()
certRepo := newMockCertificateRepository()
profileRepo := newMockProfileRepository()
caSvc := NewCAOperationsSvc(revocationRepo, certRepo, profileRepo)
registry := NewIssuerRegistry(slog.Default())
registry.Set("iss-local", &mockIssuerConnector{})
issuer := &mockIssuerConnector{}
registry.Set("iss-local", issuer)
registry.Set("iss-other", &mockIssuerConnector{})
caSvc.SetIssuerRegistry(registry)
return caSvc, revocationRepo, certRepo
return caSvc, revocationRepo, certRepo, issuer
}
func TestCAOperationsSvc_GenerateDERCRL_Success(t *testing.T) {
@@ -126,6 +135,77 @@ func TestCAOperationsSvc_GetOCSPResponse_Good(t *testing.T) {
t.Logf("OCSP response for good cert generated: %d bytes", len(resp))
}
// TestCAOperationsSvc_GetOCSPResponse_Unknown_CrossIssuer guards the M-004 fix:
// a cert with the queried serial exists but under a *different* issuer. Before
// the fix, OCSP fell through to "good" (CertStatus 0) because no revocation row
// matched the (issuer_id, serial) tuple. Per RFC 5280 §5.2.3 serials are unique
// only within a single issuer, and per RFC 6960 §2.2 unknown certs must report
// "unknown" (CertStatus 2), not "good".
func TestCAOperationsSvc_GetOCSPResponse_Unknown_CrossIssuer(t *testing.T) {
caSvc, _, certRepo, issuer := newCAOperationsSvcTestWithIssuer()
// Real cert exists, but bound to iss-other (not iss-local).
cert := &domain.ManagedCertificate{
ID: "cert-cross-issuer",
CommonName: "cross.example.com",
IssuerID: "iss-other",
Status: domain.CertificateStatusActive,
ExpiresAt: time.Now().AddDate(1, 0, 0),
}
certRepo.AddCert(cert)
certRepo.Versions["cert-cross-issuer"] = []*domain.CertificateVersion{{
ID: "ver-cross-issuer",
CertificateID: "cert-cross-issuer",
SerialNumber: "CROSS-ISSUER-001",
NotBefore: time.Now(),
NotAfter: time.Now().AddDate(1, 0, 0),
CreatedAt: time.Now(),
}}
// Query OCSP for iss-local + CROSS-ISSUER-001. The serial exists, but
// under iss-other — our JOIN-scoped lookup should return no match.
resp, err := caSvc.GetOCSPResponse(context.Background(), "iss-local", "CROSS-ISSUER-001")
if err != nil {
t.Fatalf("expected no error, got: %v", err)
}
if resp == nil || len(resp) == 0 {
t.Fatal("expected non-empty OCSP response")
}
if issuer.LastOCSPSignRequest == nil {
t.Fatal("expected SignOCSPResponse to be called")
}
if got, want := issuer.LastOCSPSignRequest.CertStatus, 2; got != want {
t.Errorf("CertStatus = %d, want %d (unknown) — cross-issuer lookup must not return good", got, want)
}
}
// TestCAOperationsSvc_GetOCSPResponse_Unknown_UnknownSerial guards the M-004 fix
// for the "forged/guessed serial" case: no certificate exists at this
// (issuer_id, serial) tuple anywhere in inventory. Per RFC 6960 §2.2 we must
// report "unknown" (CertStatus 2), never "good" — returning good for a serial
// we never issued is a protocol violation that would allow an attacker to get
// certctl to vouch for a cert it never signed.
func TestCAOperationsSvc_GetOCSPResponse_Unknown_UnknownSerial(t *testing.T) {
caSvc, _, _, issuer := newCAOperationsSvcTestWithIssuer()
// No cert rows added. Query for an arbitrary serial under iss-local.
resp, err := caSvc.GetOCSPResponse(context.Background(), "iss-local", "DEADBEEF-NEVER-ISSUED")
if err != nil {
t.Fatalf("expected no error, got: %v", err)
}
if resp == nil || len(resp) == 0 {
t.Fatal("expected non-empty OCSP response")
}
if issuer.LastOCSPSignRequest == nil {
t.Fatal("expected SignOCSPResponse to be called")
}
if got, want := issuer.LastOCSPSignRequest.CertStatus, 2; got != want {
t.Errorf("CertStatus = %d, want %d (unknown) — unissued serials must not return good", got, want)
}
}
func TestCAOperationsSvc_GetOCSPResponse_Revoked(t *testing.T) {
caSvc, revocationRepo, certRepo := newCAOperationsSvcTest()
+25
View File
@@ -145,6 +145,31 @@ func (s *DeploymentService) ProcessDeploymentJob(ctx context.Context, job *domai
return fmt.Errorf("failed to fetch agent: %w", err)
}
// I-004: AgentRepository.Get surfaces retired rows by design (for the GUI
// banner + 410 Gone heartbeat path). Deployments must never dispatch to a
// retired agent — it will never heartbeat again and the target row should
// itself have been cascade-retired when the agent was force-retired. A job
// slipping through here would otherwise hit the heartbeat-staleness branch
// below with the misleading reason "agent is offline"; we want operators to
// see the real cause. Fail the job with an explicit reason, send a
// deployment notification so the owner is alerted, and record an audit
// event. Falls through the same notify+audit shape as the offline branch.
if agent.IsRetired() {
updateErr := s.jobRepo.UpdateStatus(ctx, job.ID, domain.JobStatusFailed, "assigned agent is retired")
if updateErr != nil {
slog.Error("failed to update job status", "job_id", job.ID, "error", updateErr)
}
if notifErr := s.notificationSvc.SendDeploymentNotification(ctx, cert, target, false, fmt.Errorf("agent retired")); notifErr != nil {
slog.Error("failed to send deployment notification", "error", notifErr)
}
if auditErr := s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem,
"deployment_job_failed", "certificate", job.CertificateID,
map[string]interface{}{"job_id": job.ID, "reason": "agent retired", "target_id": targetID, "agent_id": agentID}); auditErr != nil {
slog.Error("failed to record audit event", "error", auditErr)
}
return fmt.Errorf("agent %s is retired", agentID)
}
// Check agent heartbeat (must be within last 5 minutes)
if agent.LastHeartbeatAt != nil && time.Since(*agent.LastHeartbeatAt) > 5*time.Minute {
updateErr := s.jobRepo.UpdateStatus(ctx, job.ID, domain.JobStatusFailed, "agent is offline")
+21 -5
View File
@@ -148,7 +148,14 @@ func (s *DiscoveryService) GetDiscovered(ctx context.Context, id string) (*domai
}
// ClaimDiscovered links a discovered certificate to a managed certificate.
func (s *DiscoveryService) ClaimDiscovered(ctx context.Context, id string, managedCertID string) error {
// The actor parameter names the authenticated identity that initiated the
// claim and is recorded on the audit event. Callers in the handler layer pass
// resolveActor(ctx); service-to-service callers pass a descriptive sentinel
// (e.g., "system"). Empty actor falls back to "api" (the same safe sentinel
// resolveActor uses when no auth context is present), never to "operator" —
// hardcoding "operator" was M-005, a coverage-gap closure where audit records
// failed to identify who actually performed the triage action.
func (s *DiscoveryService) ClaimDiscovered(ctx context.Context, id string, managedCertID string, actor string) error {
if managedCertID == "" {
return fmt.Errorf("managed_certificate_id is required")
}
@@ -168,8 +175,12 @@ func (s *DiscoveryService) ClaimDiscovered(ctx context.Context, id string, manag
return fmt.Errorf("failed to update discovered certificate status: %w", err)
}
if actor == "" {
actor = "api"
}
// Audit trail
if err := s.auditService.RecordEvent(ctx, "operator", domain.ActorTypeUser,
if err := s.auditService.RecordEvent(ctx, actor, domain.ActorTypeUser,
"discovery_cert_claimed", "discovered_certificate", id,
map[string]interface{}{
"managed_certificate_id": managedCertID,
@@ -182,14 +193,19 @@ func (s *DiscoveryService) ClaimDiscovered(ctx context.Context, id string, manag
return nil
}
// DismissDiscovered marks a discovered certificate as dismissed.
func (s *DiscoveryService) DismissDiscovered(ctx context.Context, id string) error {
// DismissDiscovered marks a discovered certificate as dismissed. See
// ClaimDiscovered for the actor contract — same rules apply (M-005).
func (s *DiscoveryService) DismissDiscovered(ctx context.Context, id string, actor string) error {
if err := s.discoveryRepo.UpdateDiscoveredStatus(ctx, id, domain.DiscoveryStatusDismissed, ""); err != nil {
return fmt.Errorf("failed to dismiss discovered certificate: %w", err)
}
if actor == "" {
actor = "api"
}
// Audit trail
if err := s.auditService.RecordEvent(ctx, "operator", domain.ActorTypeUser,
if err := s.auditService.RecordEvent(ctx, actor, domain.ActorTypeUser,
"discovery_cert_dismissed", "discovered_certificate", id, nil); err != nil {
slog.Error("failed to record audit event", "error", err)
}
+175 -5
View File
@@ -381,7 +381,7 @@ func TestClaimDiscovered_Success(t *testing.T) {
}
certRepo.AddCert(managedCert)
err := svc.ClaimDiscovered(context.Background(), "dcert-1", "mc-prod-1")
err := svc.ClaimDiscovered(context.Background(), "dcert-1", "mc-prod-1", "alice@corp")
if err != nil {
t.Fatalf("expected no error, got: %v", err)
}
@@ -423,7 +423,7 @@ func TestClaimDiscovered_MissingManagedCertID(t *testing.T) {
}
discoveryRepo.Discovered[cert.ID] = cert
err := svc.ClaimDiscovered(context.Background(), "dcert-1", "")
err := svc.ClaimDiscovered(context.Background(), "dcert-1", "", "test-actor")
if err == nil {
t.Fatal("expected error for empty managed_certificate_id")
}
@@ -442,7 +442,7 @@ func TestClaimDiscovered_ManagedCertNotFound(t *testing.T) {
}
discoveryRepo.Discovered[cert.ID] = cert
err := svc.ClaimDiscovered(context.Background(), "dcert-1", "nonexistent-cert")
err := svc.ClaimDiscovered(context.Background(), "dcert-1", "nonexistent-cert", "test-actor")
if err == nil {
t.Fatal("expected error for nonexistent managed certificate")
}
@@ -464,7 +464,7 @@ func TestDismissDiscovered_Success(t *testing.T) {
}
discoveryRepo.Discovered[cert.ID] = cert
err := svc.DismissDiscovered(context.Background(), "dcert-1")
err := svc.DismissDiscovered(context.Background(), "dcert-1", "bob@corp")
if err != nil {
t.Fatalf("expected no error, got: %v", err)
}
@@ -497,8 +497,178 @@ func TestDismissDiscovered_NotFound(t *testing.T) {
svc, discoveryRepo, _, _ := newDiscoveryTestService()
discoveryRepo.UpdateStatusErr = errNotFound
err := svc.DismissDiscovered(context.Background(), "nonexistent")
err := svc.DismissDiscovered(context.Background(), "nonexistent", "test-actor")
if err == nil {
t.Fatal("expected error for nonexistent cert")
}
}
// M-005 regression: caller-supplied actor must propagate onto the
// discovery_cert_claimed audit event so the trail identifies who performed
// triage (pre-M-005 the service hardcoded "operator").
func TestDiscoveryService_ClaimDiscovered_AuditActor(t *testing.T) {
svc, discoveryRepo, certRepo, auditRepo := newDiscoveryTestService()
now := time.Now()
discoveredCert := &domain.DiscoveredCertificate{
ID: "dcert-1",
CommonName: "example.com",
FingerprintSHA256: "abc123",
Status: domain.DiscoveryStatusUnmanaged,
CreatedAt: now,
UpdatedAt: now,
}
discoveryRepo.Discovered[discoveredCert.ID] = discoveredCert
managedCert := &domain.ManagedCertificate{
ID: "mc-prod-1",
CommonName: "example.com",
Status: domain.CertificateStatusActive,
CreatedAt: now,
UpdatedAt: now,
}
certRepo.AddCert(managedCert)
if err := svc.ClaimDiscovered(context.Background(), "dcert-1", "mc-prod-1", "alice@corp"); err != nil {
t.Fatalf("expected no error, got: %v", err)
}
// Locate the discovery_cert_claimed audit event and assert actor propagation.
var claimEvent *domain.AuditEvent
for _, e := range auditRepo.Events {
if e.Action == "discovery_cert_claimed" {
claimEvent = e
break
}
}
if claimEvent == nil {
t.Fatal("expected discovery_cert_claimed audit event to be recorded")
}
if claimEvent.Actor != "alice@corp" {
t.Errorf("expected audit actor to be caller-supplied 'alice@corp', got %q", claimEvent.Actor)
}
if claimEvent.Actor == "operator" {
t.Error("audit actor must not be hardcoded 'operator' (M-005 regression)")
}
}
// M-005 regression symmetric pair for DismissDiscovered.
func TestDiscoveryService_DismissDiscovered_AuditActor(t *testing.T) {
svc, discoveryRepo, _, auditRepo := newDiscoveryTestService()
now := time.Now()
cert := &domain.DiscoveredCertificate{
ID: "dcert-1",
CommonName: "example.com",
Status: domain.DiscoveryStatusUnmanaged,
CreatedAt: now,
UpdatedAt: now,
}
discoveryRepo.Discovered[cert.ID] = cert
if err := svc.DismissDiscovered(context.Background(), "dcert-1", "bob@corp"); err != nil {
t.Fatalf("expected no error, got: %v", err)
}
var dismissEvent *domain.AuditEvent
for _, e := range auditRepo.Events {
if e.Action == "discovery_cert_dismissed" {
dismissEvent = e
break
}
}
if dismissEvent == nil {
t.Fatal("expected discovery_cert_dismissed audit event to be recorded")
}
if dismissEvent.Actor != "bob@corp" {
t.Errorf("expected audit actor to be caller-supplied 'bob@corp', got %q", dismissEvent.Actor)
}
if dismissEvent.Actor == "operator" {
t.Error("audit actor must not be hardcoded 'operator' (M-005 regression)")
}
}
// M-005 regression: when the caller passes an empty actor (e.g., the handler's
// resolveActor helper returns "" because no auth context is present), the
// service must fall back to the safe sentinel "api" — never to the pre-M-005
// hardcoded "operator".
func TestDiscoveryService_ClaimDiscovered_EmptyActorFallsBackToAPI(t *testing.T) {
svc, discoveryRepo, certRepo, auditRepo := newDiscoveryTestService()
now := time.Now()
discoveredCert := &domain.DiscoveredCertificate{
ID: "dcert-1",
CommonName: "example.com",
FingerprintSHA256: "abc123",
Status: domain.DiscoveryStatusUnmanaged,
CreatedAt: now,
UpdatedAt: now,
}
discoveryRepo.Discovered[discoveredCert.ID] = discoveredCert
managedCert := &domain.ManagedCertificate{
ID: "mc-prod-1",
CommonName: "example.com",
Status: domain.CertificateStatusActive,
CreatedAt: now,
UpdatedAt: now,
}
certRepo.AddCert(managedCert)
if err := svc.ClaimDiscovered(context.Background(), "dcert-1", "mc-prod-1", ""); err != nil {
t.Fatalf("expected no error, got: %v", err)
}
var claimEvent *domain.AuditEvent
for _, e := range auditRepo.Events {
if e.Action == "discovery_cert_claimed" {
claimEvent = e
break
}
}
if claimEvent == nil {
t.Fatal("expected discovery_cert_claimed audit event to be recorded")
}
if claimEvent.Actor != "api" {
t.Errorf("expected empty actor to fall back to 'api', got %q", claimEvent.Actor)
}
if claimEvent.Actor == "operator" {
t.Error("audit actor must not be hardcoded 'operator' (M-005 regression)")
}
}
// M-005 regression symmetric pair for DismissDiscovered empty-actor fallback.
func TestDiscoveryService_DismissDiscovered_EmptyActorFallsBackToAPI(t *testing.T) {
svc, discoveryRepo, _, auditRepo := newDiscoveryTestService()
now := time.Now()
cert := &domain.DiscoveredCertificate{
ID: "dcert-1",
CommonName: "example.com",
Status: domain.DiscoveryStatusUnmanaged,
CreatedAt: now,
UpdatedAt: now,
}
discoveryRepo.Discovered[cert.ID] = cert
if err := svc.DismissDiscovered(context.Background(), "dcert-1", ""); err != nil {
t.Fatalf("expected no error, got: %v", err)
}
var dismissEvent *domain.AuditEvent
for _, e := range auditRepo.Events {
if e.Action == "discovery_cert_dismissed" {
dismissEvent = e
break
}
}
if dismissEvent == nil {
t.Fatal("expected discovery_cert_dismissed audit event to be recorded")
}
if dismissEvent.Actor != "api" {
t.Errorf("expected empty actor to fall back to 'api', got %q", dismissEvent.Actor)
}
if dismissEvent.Actor == "operator" {
t.Error("audit actor must not be hardcoded 'operator' (M-005 regression)")
}
}
+215 -4
View File
@@ -1,38 +1,70 @@
package service
import (
"time"
"context"
"errors"
"fmt"
"log/slog"
"strings"
"github.com/shankar0123/certctl/internal/domain"
"github.com/shankar0123/certctl/internal/repository"
)
// ErrSelfApproval is returned by ApproveJob when the actor attempting to
// approve a renewal job is the same person listed as the owner of the
// underlying certificate. M-003 enforces separation of duties: the owner who
// requested (or benefits from) the renewal must not be the same identity that
// approves it. Handlers map this sentinel to HTTP 403 Forbidden.
var ErrSelfApproval = errors.New("self-approval forbidden: actor is the owner of the certificate")
// JobService manages job processing and status tracking.
// It coordinates between the scheduler and various job-specific services.
type JobService struct {
jobRepo repository.JobRepository
certRepo repository.CertificateRepository
ownerRepo repository.OwnerRepository
renewalService *RenewalService
deploymentService *DeploymentService
auditService *AuditService
logger *slog.Logger
}
// NewJobService creates a new job service.
//
// certRepo and ownerRepo are required for the M-003 not-self-approval check
// in ApproveJob. Callers may pass nil for either to disable the check
// (useful for tests that don't exercise the approval path); when nil, the
// service logs a warning on the first approval attempt and permits the
// transition. Production wiring must supply both.
func NewJobService(
jobRepo repository.JobRepository,
certRepo repository.CertificateRepository,
ownerRepo repository.OwnerRepository,
renewalService *RenewalService,
deploymentService *DeploymentService,
logger *slog.Logger,
) *JobService {
return &JobService{
jobRepo: jobRepo,
certRepo: certRepo,
ownerRepo: ownerRepo,
renewalService: renewalService,
deploymentService: deploymentService,
logger: logger,
}
}
// SetAuditService wires an optional audit service for emitting lifecycle
// events (e.g., scheduler-driven job_retry transitions recorded by
// RetryFailedJobs). Construction keeps the audit dependency optional so
// bootstrap/test wiring that doesn't exercise the retry path can omit it;
// production wiring in cmd/server/main.go should always call this.
func (s *JobService) SetAuditService(a *AuditService) {
s.auditService = a
}
// ProcessPendingJobs fetches and processes all pending jobs.
// It routes jobs to the appropriate service based on job type and handles errors gracefully.
//
@@ -142,6 +174,16 @@ func (s *JobService) processValidationJob(ctx context.Context, job *domain.Job)
// RetryFailedJobs finds failed jobs and resets them for retry.
// It only retries jobs that haven't exceeded max attempts.
//
// Audit trail (I-001): each successful Failed → Pending transition emits a
// "job_retry" audit event with actor "system" (ActorTypeSystem), capturing
// the old→new state and attempt counters so operators can reconstruct
// scheduler-driven retry activity. The audit service is optional — callers
// that haven't wired it via SetAuditService simply skip emission.
//
// maxRetries is retained for interface compatibility with
// scheduler.JobServicer but is advisory: per-job eligibility is governed by
// each job's own Attempts vs. MaxAttempts, not this parameter.
func (s *JobService) RetryFailedJobs(ctx context.Context, maxRetries int) error {
s.logger.Debug("retrying failed jobs", "max_retries", maxRetries)
@@ -170,6 +212,21 @@ func (s *JobService) RetryFailedJobs(ctx context.Context, maxRetries int) error
continue
}
if s.auditService != nil {
if auditErr := s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem,
"job_retry", "job", job.ID,
map[string]interface{}{
"old_status": string(domain.JobStatusFailed),
"new_status": string(domain.JobStatusPending),
"attempts": job.Attempts,
"max_attempts": job.MaxAttempts,
}); auditErr != nil {
s.logger.Error("failed to record job retry audit event",
"job_id", job.ID,
"error", auditErr)
}
}
retriedCount++
}
@@ -180,6 +237,81 @@ func (s *JobService) RetryFailedJobs(ctx context.Context, maxRetries int) error
return nil
}
// ReapTimedOutJobs transitions jobs stuck in AwaitingCSR or AwaitingApproval
// to Failed if they've exceeded their TTL. I-001's retry loop then auto-promotes
// eligible Failed jobs back to Pending (closes coverage gap I-003).
func (s *JobService) ReapTimedOutJobs(ctx context.Context, csrTTL, approvalTTL time.Duration) error {
s.logger.Debug("reaping timed-out jobs", "csr_ttl", csrTTL, "approval_ttl", approvalTTL)
now := time.Now()
csrCutoff := now.Add(-csrTTL)
approvalCutoff := now.Add(-approvalTTL)
timedOutJobs, err := s.jobRepo.ListTimedOutAwaitingJobs(ctx, csrCutoff, approvalCutoff)
if err != nil {
return fmt.Errorf("failed to fetch timed-out jobs: %w", err)
}
var reaped int
for _, job := range timedOutJobs {
oldStatus := job.Status
var (
newErrMsg string
reason string
ttl time.Duration
)
switch job.Status {
case domain.JobStatusAwaitingCSR:
ttl = csrTTL
reason = "csr_timeout"
newErrMsg = fmt.Sprintf("timed out in %s after %s", oldStatus, csrTTL)
case domain.JobStatusAwaitingApproval:
ttl = approvalTTL
reason = "approval_timeout"
newErrMsg = fmt.Sprintf("timed out in %s after %s", oldStatus, approvalTTL)
default:
continue
}
_ = ttl
job.Status = domain.JobStatusFailed
job.LastError = &newErrMsg
if err := s.jobRepo.Update(ctx, job); err != nil {
s.logger.Error("failed to transition timed-out job",
"job_id", job.ID,
"old_status", oldStatus,
"error", err)
continue
}
if s.auditService != nil {
ageHours := time.Since(job.CreatedAt).Hours()
if auditErr := s.auditService.RecordEvent(ctx, "system", domain.ActorTypeSystem,
"job_timeout", "job", job.ID,
map[string]interface{}{
"old_status": string(oldStatus),
"new_status": string(domain.JobStatusFailed),
"timeout_reason": reason,
"age_hours": ageHours,
}); auditErr != nil {
s.logger.Error("failed to record job timeout audit event",
"job_id", job.ID,
"error", auditErr)
}
}
reaped++
}
s.logger.Info("job timeout reaper completed",
"reaped", reaped,
"total_timed_out", len(timedOutJobs))
return nil
}
// GetJobStatus returns the current status of a job.
func (s *JobService) GetJobStatus(ctx context.Context, jobID string) (*domain.Job, error) {
job, err := s.jobRepo.Get(ctx, jobID)
@@ -264,7 +396,13 @@ func (s *JobService) GetJob(ctx context.Context, id string) (*domain.Job, error)
// ApproveJob approves a renewal job that is awaiting approval.
// Transitions the job from AwaitingApproval to Pending so the scheduler picks it up.
func (s *JobService) ApproveJob(ctx context.Context, id string) error {
//
// actor is the named-key identity of the approver (from the auth middleware
// via resolveActor). M-003: if actor matches the certificate owner's Name or
// Email (case-insensitive), returns ErrSelfApproval to enforce separation of
// duties. Callers must pass a non-empty actor; empty actor is treated as an
// anonymous system caller and permitted (internal/system paths).
func (s *JobService) ApproveJob(ctx context.Context, id, actor string) error {
job, err := s.jobRepo.Get(ctx, id)
if err != nil {
return fmt.Errorf("job not found: %w", err)
@@ -274,17 +412,29 @@ func (s *JobService) ApproveJob(ctx context.Context, id string) error {
return fmt.Errorf("cannot approve job with status %s (must be AwaitingApproval)", job.Status)
}
if err := s.checkNotSelf(ctx, job, actor); err != nil {
return err
}
if err := s.jobRepo.UpdateStatus(ctx, id, domain.JobStatusPending, ""); err != nil {
return fmt.Errorf("failed to approve job: %w", err)
}
s.logger.Info("renewal job approved", "job_id", id, "certificate_id", job.CertificateID)
s.logger.Info("renewal job approved",
"job_id", id,
"certificate_id", job.CertificateID,
"actor", actor)
return nil
}
// RejectJob rejects a renewal job that is awaiting approval.
// Transitions the job to Cancelled with a rejection reason.
func (s *JobService) RejectJob(ctx context.Context, id string, reason string) error {
//
// actor is the named-key identity of the rejector (from the auth middleware
// via resolveActor). Rejection is NOT subject to the not-self check — an
// owner is permitted to cancel their own pending renewal. actor is recorded
// on the log line for audit attribution.
func (s *JobService) RejectJob(ctx context.Context, id, reason, actor string) error {
job, err := s.jobRepo.Get(ctx, id)
if err != nil {
return fmt.Errorf("job not found: %w", err)
@@ -303,6 +453,67 @@ func (s *JobService) RejectJob(ctx context.Context, id string, reason string) er
return fmt.Errorf("failed to reject job: %w", err)
}
s.logger.Info("renewal job rejected", "job_id", id, "certificate_id", job.CertificateID, "reason", reason)
s.logger.Info("renewal job rejected",
"job_id", id,
"certificate_id", job.CertificateID,
"reason", reason,
"actor", actor)
return nil
}
// checkNotSelf enforces the M-003 separation-of-duties rule for renewal
// approval: the actor approving a job may not be the owner of the underlying
// certificate.
//
// Resolution rules:
// - Empty actor → permitted (internal/system caller; auth middleware already
// short-circuits anonymous users at the handler layer).
// - certRepo or ownerRepo nil → warn once, permit (test/bootstrap wiring).
// - Job has no certificate or certificate has no OwnerID → permitted (no
// owner to collide with).
// - Owner record not found → warn, permit (defensive: stale FK should not
// block operations).
// - Case-insensitive match against owner.Name OR owner.Email → returns
// ErrSelfApproval.
func (s *JobService) checkNotSelf(ctx context.Context, job *domain.Job, actor string) error {
if actor == "" {
return nil
}
if s.certRepo == nil || s.ownerRepo == nil {
s.logger.Warn("not-self approval check skipped: cert/owner repo not wired",
"job_id", job.ID, "actor", actor)
return nil
}
if job.CertificateID == "" {
return nil
}
cert, err := s.certRepo.Get(ctx, job.CertificateID)
if err != nil {
s.logger.Warn("not-self approval check: certificate lookup failed",
"job_id", job.ID, "certificate_id", job.CertificateID, "error", err)
return nil
}
if cert == nil || cert.OwnerID == "" {
return nil
}
owner, err := s.ownerRepo.Get(ctx, cert.OwnerID)
if err != nil || owner == nil {
s.logger.Warn("not-self approval check: owner lookup failed",
"job_id", job.ID, "owner_id", cert.OwnerID, "error", err)
return nil
}
actorLower := strings.ToLower(actor)
if strings.ToLower(owner.Name) == actorLower || strings.ToLower(owner.Email) == actorLower {
s.logger.Warn("self-approval blocked",
"job_id", job.ID,
"certificate_id", job.CertificateID,
"owner_id", owner.ID,
"actor", actor)
return ErrSelfApproval
}
return nil
}
+683 -1
View File
@@ -2,8 +2,12 @@ package service
import (
"context"
"encoding/json"
"errors"
"log/slog"
"os"
"strings"
"sync"
"testing"
"time"
@@ -12,12 +16,21 @@ import (
// helper to build job service with proper constructor signatures
func newTestJobService(jobRepo *mockJobRepo) *JobService {
svc, _, _ := newTestJobServiceWithRepos(jobRepo)
return svc
}
// newTestJobServiceWithRepos returns the service along with the cert+owner
// repos so self-approval tests can seed owner linkage without rebuilding the
// whole dependency graph.
func newTestJobServiceWithRepos(jobRepo *mockJobRepo) (*JobService, *mockCertRepo, *mockOwnerRepo) {
logger := slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelInfo}))
certRepo := &mockCertRepo{
Certs: make(map[string]*domain.ManagedCertificate),
Versions: make(map[string][]*domain.CertificateVersion),
}
ownerRepo := newMockOwnerRepository()
renewalPolicyRepo := &mockRenewalPolicyRepo{
Policies: make(map[string]*domain.RenewalPolicy),
}
@@ -32,7 +45,7 @@ func newTestJobService(jobRepo *mockJobRepo) *JobService {
renewalService := NewRenewalService(certRepo, jobRepo, renewalPolicyRepo, nil, auditService, notifService, issuerRegistry, "server")
deploymentService := NewDeploymentService(jobRepo, targetRepo, agentRepo, certRepo, auditService, notifService)
return NewJobService(jobRepo, renewalService, deploymentService, logger)
return NewJobService(jobRepo, certRepo, ownerRepo, renewalService, deploymentService, logger), certRepo, ownerRepo
}
func TestProcessPendingJobs_Renewal(t *testing.T) {
@@ -249,3 +262,672 @@ func TestListJobs_FilterByStatus(t *testing.T) {
t.Errorf("expected total 1, got %d", total)
}
}
// --- M-003: not-self approval (separation of duties) ---
//
// These regression tests enforce that ApproveJob returns ErrSelfApproval when
// the actor matches the certificate owner's Name or Email (case-insensitive).
// Rejection is intentionally NOT gated — owners may cancel their own pending
// renewals. Handlers map ErrSelfApproval to HTTP 403.
// seedSelfApprovalFixtures populates the mock repos with a realistic
// AwaitingApproval renewal job owned by "alice" and returns the service under
// test. The cert points at owner "o-alice" so checkNotSelf has a full resolution
// path.
func seedSelfApprovalFixtures(t *testing.T) (*JobService, *mockJobRepo) {
t.Helper()
now := time.Now()
job := &domain.Job{
ID: "job-self",
Type: domain.JobTypeRenewal,
CertificateID: "cert-self",
Status: domain.JobStatusAwaitingApproval,
CreatedAt: now,
ScheduledAt: now,
}
jobRepo := &mockJobRepo{
Jobs: map[string]*domain.Job{job.ID: job},
StatusUpdates: make(map[string]domain.JobStatus),
}
svc, certRepo, ownerRepo := newTestJobServiceWithRepos(jobRepo)
certRepo.AddCert(&domain.ManagedCertificate{
ID: "cert-self",
OwnerID: "o-alice",
CreatedAt: now,
UpdatedAt: now,
})
ownerRepo.AddOwner(&domain.Owner{
ID: "o-alice",
Name: "alice",
Email: "alice@example.com",
CreatedAt: now,
UpdatedAt: now,
})
return svc, jobRepo
}
func TestApproveJob_SelfApprovalForbidden_NameMatch(t *testing.T) {
ctx := context.Background()
svc, jobRepo := seedSelfApprovalFixtures(t)
err := svc.ApproveJob(ctx, "job-self", "alice")
if err == nil {
t.Fatal("expected ErrSelfApproval, got nil")
}
if !errors.Is(err, ErrSelfApproval) {
t.Fatalf("expected errors.Is(err, ErrSelfApproval), got %v", err)
}
if _, flipped := jobRepo.StatusUpdates["job-self"]; flipped {
t.Error("expected job status unchanged after self-approval block")
}
}
func TestApproveJob_SelfApprovalForbidden_EmailMatch(t *testing.T) {
ctx := context.Background()
svc, jobRepo := seedSelfApprovalFixtures(t)
err := svc.ApproveJob(ctx, "job-self", "alice@example.com")
if err == nil {
t.Fatal("expected ErrSelfApproval, got nil")
}
if !errors.Is(err, ErrSelfApproval) {
t.Fatalf("expected errors.Is(err, ErrSelfApproval), got %v", err)
}
if _, flipped := jobRepo.StatusUpdates["job-self"]; flipped {
t.Error("expected job status unchanged after self-approval block")
}
}
func TestApproveJob_SelfApprovalForbidden_CaseInsensitive(t *testing.T) {
ctx := context.Background()
svc, _ := seedSelfApprovalFixtures(t)
// Uppercase name should still collide — the check must be case-insensitive.
if err := svc.ApproveJob(ctx, "job-self", "ALICE"); !errors.Is(err, ErrSelfApproval) {
t.Fatalf("expected ErrSelfApproval for uppercase name match, got %v", err)
}
// Mixed-case email should also collide.
if err := svc.ApproveJob(ctx, "job-self", "Alice@Example.COM"); !errors.Is(err, ErrSelfApproval) {
t.Fatalf("expected ErrSelfApproval for mixed-case email match, got %v", err)
}
}
func TestApproveJob_DifferentActor_Permitted(t *testing.T) {
ctx := context.Background()
svc, jobRepo := seedSelfApprovalFixtures(t)
// A different named key must be allowed to approve.
if err := svc.ApproveJob(ctx, "job-self", "bob"); err != nil {
t.Fatalf("expected approval to succeed for non-owner actor, got %v", err)
}
if jobRepo.StatusUpdates["job-self"] != domain.JobStatusPending {
t.Errorf("expected status Pending after approval, got %s",
jobRepo.StatusUpdates["job-self"])
}
}
func TestApproveJob_EmptyActor_Permitted(t *testing.T) {
ctx := context.Background()
svc, jobRepo := seedSelfApprovalFixtures(t)
// Empty actor represents an internal/system caller. The handler layer
// enforces authenticated-only, so this branch exists only for defensive
// in-process paths (scheduler-driven auto-approval, tests, etc.).
if err := svc.ApproveJob(ctx, "job-self", ""); err != nil {
t.Fatalf("expected empty actor to be permitted, got %v", err)
}
if jobRepo.StatusUpdates["job-self"] != domain.JobStatusPending {
t.Errorf("expected status Pending after approval, got %s",
jobRepo.StatusUpdates["job-self"])
}
}
func TestRejectJob_SelfRejection_Permitted(t *testing.T) {
ctx := context.Background()
svc, jobRepo := seedSelfApprovalFixtures(t)
// Owner must be able to reject their own pending renewal — M-003 scopes the
// not-self rule to approval only.
if err := svc.RejectJob(ctx, "job-self", "no longer needed", "alice"); err != nil {
t.Fatalf("expected owner to reject own job, got %v", err)
}
if jobRepo.StatusUpdates["job-self"] != domain.JobStatusCancelled {
t.Errorf("expected status Cancelled after rejection, got %s",
jobRepo.StatusUpdates["job-self"])
}
}
// --- I-001: scheduler-driven retry emits audit events ---
//
// These regression tests prove that RetryFailedJobs (a) transitions eligible
// Failed jobs to Pending, (b) skips jobs that have exhausted their max
// attempts, and (c) records a "job_retry" audit event per transition when the
// audit service is wired. A separate variant (_NoAuditServiceOK) confirms the
// nil-guard path so test/bootstrap wiring that skips the setter still works.
// newTestJobServiceWithAudit wires the optional audit dependency onto the
// standard test JobService so retry assertions can inspect recorded events.
// Mirrors newTestJobServiceWithRepos but also returns the mock audit repo
// holding any emitted events.
func newTestJobServiceWithAudit(jobRepo *mockJobRepo) (*JobService, *mockAuditRepo) {
svc, _, _ := newTestJobServiceWithRepos(jobRepo)
auditRepo := &mockAuditRepo{}
svc.SetAuditService(NewAuditService(auditRepo))
return svc, auditRepo
}
func TestJobService_RetryFailedJobs_EligibleJobTransitionsAndAudits(t *testing.T) {
ctx := context.Background()
now := time.Now()
failed := &domain.Job{
ID: "job-retry-1",
Type: domain.JobTypeRenewal,
CertificateID: "cert-001",
Status: domain.JobStatusFailed,
Attempts: 1,
MaxAttempts: 3,
CreatedAt: now,
ScheduledAt: now,
}
jobRepo := &mockJobRepo{
Jobs: map[string]*domain.Job{failed.ID: failed},
StatusUpdates: make(map[string]domain.JobStatus),
}
svc, auditRepo := newTestJobServiceWithAudit(jobRepo)
if err := svc.RetryFailedJobs(ctx, 3); err != nil {
t.Fatalf("RetryFailedJobs failed: %v", err)
}
if got := jobRepo.StatusUpdates[failed.ID]; got != domain.JobStatusPending {
t.Fatalf("expected job %s status Pending after retry, got %s", failed.ID, got)
}
if len(auditRepo.Events) != 1 {
t.Fatalf("expected 1 audit event, got %d", len(auditRepo.Events))
}
ev := auditRepo.Events[0]
if ev.Action != "job_retry" {
t.Errorf("expected action job_retry, got %s", ev.Action)
}
if ev.Actor != "system" {
t.Errorf("expected actor system, got %s", ev.Actor)
}
if ev.ActorType != domain.ActorTypeSystem {
t.Errorf("expected actor type System, got %s", ev.ActorType)
}
if ev.ResourceType != "job" {
t.Errorf("expected resource type job, got %s", ev.ResourceType)
}
if ev.ResourceID != failed.ID {
t.Errorf("expected resource ID %s, got %s", failed.ID, ev.ResourceID)
}
// Details are stored as json.RawMessage — decode and verify the state
// transition + attempt counters were captured.
var details map[string]interface{}
if err := json.Unmarshal(ev.Details, &details); err != nil {
t.Fatalf("failed to decode audit event details: %v", err)
}
if got, want := details["old_status"], string(domain.JobStatusFailed); got != want {
t.Errorf("expected details.old_status=%s, got %v", want, got)
}
if got, want := details["new_status"], string(domain.JobStatusPending); got != want {
t.Errorf("expected details.new_status=%s, got %v", want, got)
}
// JSON numerics round-trip as float64.
if got, want := details["attempts"], float64(1); got != want {
t.Errorf("expected details.attempts=%v, got %v", want, got)
}
if got, want := details["max_attempts"], float64(3); got != want {
t.Errorf("expected details.max_attempts=%v, got %v", want, got)
}
}
func TestJobService_RetryFailedJobs_SkipsJobsAtMaxAttempts(t *testing.T) {
ctx := context.Background()
now := time.Now()
// Eligible: Attempts=0, MaxAttempts=3.
eligible := &domain.Job{
ID: "job-retry-eligible",
Type: domain.JobTypeRenewal,
CertificateID: "cert-001",
Status: domain.JobStatusFailed,
Attempts: 0,
MaxAttempts: 3,
CreatedAt: now,
ScheduledAt: now,
}
// Exhausted: Attempts >= MaxAttempts must be skipped.
exhausted := &domain.Job{
ID: "job-retry-exhausted",
Type: domain.JobTypeDeployment,
CertificateID: "cert-002",
Status: domain.JobStatusFailed,
Attempts: 3,
MaxAttempts: 3,
CreatedAt: now,
ScheduledAt: now,
}
jobRepo := &mockJobRepo{
Jobs: map[string]*domain.Job{
eligible.ID: eligible,
exhausted.ID: exhausted,
},
StatusUpdates: make(map[string]domain.JobStatus),
}
svc, auditRepo := newTestJobServiceWithAudit(jobRepo)
if err := svc.RetryFailedJobs(ctx, 3); err != nil {
t.Fatalf("RetryFailedJobs failed: %v", err)
}
if got := jobRepo.StatusUpdates[eligible.ID]; got != domain.JobStatusPending {
t.Errorf("expected eligible job to transition to Pending, got %s", got)
}
if _, flipped := jobRepo.StatusUpdates[exhausted.ID]; flipped {
t.Errorf("expected exhausted job to be skipped, but status was updated")
}
if len(auditRepo.Events) != 1 {
t.Fatalf("expected 1 audit event (only for eligible job), got %d", len(auditRepo.Events))
}
if auditRepo.Events[0].ResourceID != eligible.ID {
t.Errorf("expected audit event for eligible job %s, got %s",
eligible.ID, auditRepo.Events[0].ResourceID)
}
}
func TestJobService_RetryFailedJobs_NoAuditServiceOK(t *testing.T) {
ctx := context.Background()
now := time.Now()
failed := &domain.Job{
ID: "job-retry-no-audit",
Type: domain.JobTypeRenewal,
CertificateID: "cert-001",
Status: domain.JobStatusFailed,
Attempts: 0,
MaxAttempts: 3,
CreatedAt: now,
ScheduledAt: now,
}
jobRepo := &mockJobRepo{
Jobs: map[string]*domain.Job{failed.ID: failed},
StatusUpdates: make(map[string]domain.JobStatus),
}
// Intentionally skip SetAuditService: the nil-guard must prevent a panic
// and still transition the job.
svc := newTestJobService(jobRepo)
if err := svc.RetryFailedJobs(ctx, 3); err != nil {
t.Fatalf("RetryFailedJobs failed without audit wiring: %v", err)
}
if got := jobRepo.StatusUpdates[failed.ID]; got != domain.JobStatusPending {
t.Errorf("expected status Pending after retry, got %s", got)
}
}
// =============================================================================
// ReapTimedOutJobs Tests (I-003 coverage closure)
// =============================================================================
func TestJobService_ReapTimedOutJobs_AwaitingCSRTransitionsAndAudits(t *testing.T) {
ctx := context.Background()
now := time.Now()
job := &domain.Job{
ID: "job-reap-csr-1",
Type: domain.JobTypeRenewal,
CertificateID: "cert-001",
Status: domain.JobStatusAwaitingCSR,
CreatedAt: now.Add(-36 * time.Hour), // 36 hours old
ScheduledAt: now,
}
jobRepo := &mockJobRepo{
Jobs: map[string]*domain.Job{job.ID: job},
StatusUpdates: make(map[string]domain.JobStatus),
}
svc, auditRepo := newTestJobServiceWithAudit(jobRepo)
if err := svc.ReapTimedOutJobs(ctx, 24*time.Hour, 168*time.Hour); err != nil {
t.Fatalf("ReapTimedOutJobs failed: %v", err)
}
// Check the job was updated by retrieving from the mock's Jobs map
updatedJob := jobRepo.Jobs[job.ID]
if updatedJob.Status != domain.JobStatusFailed {
t.Fatalf("expected job %s status Failed after timeout, got %s", job.ID, updatedJob.Status)
}
// LastError should be set
if job.LastError == nil || !strings.Contains(*job.LastError, "timed out in AwaitingCSR after 24h") {
t.Errorf("expected LastError containing 'timed out in AwaitingCSR after 24h', got %v", job.LastError)
}
// Audit event should be recorded
if len(auditRepo.Events) != 1 {
t.Fatalf("expected 1 audit event, got %d", len(auditRepo.Events))
}
ev := auditRepo.Events[0]
if ev.Action != "job_timeout" {
t.Errorf("expected action job_timeout, got %s", ev.Action)
}
if ev.Actor != "system" {
t.Errorf("expected actor system, got %s", ev.Actor)
}
if ev.ActorType != domain.ActorTypeSystem {
t.Errorf("expected actor type System, got %s", ev.ActorType)
}
if ev.ResourceType != "job" {
t.Errorf("expected resource type job, got %s", ev.ResourceType)
}
if ev.ResourceID != job.ID {
t.Errorf("expected resource ID %s, got %s", job.ID, ev.ResourceID)
}
// Verify audit details
var details map[string]interface{}
if err := json.Unmarshal(ev.Details, &details); err != nil {
t.Fatalf("failed to decode audit event details: %v", err)
}
if got, want := details["old_status"], string(domain.JobStatusAwaitingCSR); got != want {
t.Errorf("expected details.old_status=%s, got %v", want, got)
}
if got, want := details["new_status"], string(domain.JobStatusFailed); got != want {
t.Errorf("expected details.new_status=%s, got %v", want, got)
}
if got, want := details["timeout_reason"], "csr_timeout"; got != want {
t.Errorf("expected details.timeout_reason=%s, got %v", want, got)
}
ageHours, ok := details["age_hours"].(float64)
if !ok {
t.Errorf("expected details.age_hours to be float64, got %T", details["age_hours"])
} else if ageHours < 35 {
t.Errorf("expected age_hours > 35, got %f", ageHours)
}
}
func TestJobService_ReapTimedOutJobs_AwaitingApprovalTransitionsAndAudits(t *testing.T) {
ctx := context.Background()
now := time.Now()
job := &domain.Job{
ID: "job-reap-approval-1",
Type: domain.JobTypeRenewal,
CertificateID: "cert-002",
Status: domain.JobStatusAwaitingApproval,
CreatedAt: now.Add(-200 * time.Hour), // 200 hours old
ScheduledAt: now,
}
jobRepo := &mockJobRepo{
Jobs: map[string]*domain.Job{job.ID: job},
StatusUpdates: make(map[string]domain.JobStatus),
}
svc, auditRepo := newTestJobServiceWithAudit(jobRepo)
if err := svc.ReapTimedOutJobs(ctx, 24*time.Hour, 168*time.Hour); err != nil {
t.Fatalf("ReapTimedOutJobs failed: %v", err)
}
// Check the job was updated
updatedJob := jobRepo.Jobs[job.ID]
if updatedJob.Status != domain.JobStatusFailed {
t.Fatalf("expected job %s status Failed after timeout, got %s", job.ID, updatedJob.Status)
}
// LastError should be set
if updatedJob.LastError == nil || !strings.Contains(*updatedJob.LastError, "timed out in AwaitingApproval after 168h") {
t.Errorf("expected LastError containing 'timed out in AwaitingApproval after 168h', got %v", updatedJob.LastError)
}
// Audit event details
if len(auditRepo.Events) != 1 {
t.Fatalf("expected 1 audit event, got %d", len(auditRepo.Events))
}
ev := auditRepo.Events[0]
var details map[string]interface{}
if err := json.Unmarshal(ev.Details, &details); err != nil {
t.Fatalf("failed to decode audit event details: %v", err)
}
if got, want := details["timeout_reason"], "approval_timeout"; got != want {
t.Errorf("expected details.timeout_reason=%s, got %v", want, got)
}
ageHours, ok := details["age_hours"].(float64)
if !ok {
t.Errorf("expected details.age_hours to be float64, got %T", details["age_hours"])
} else if ageHours < 199 {
t.Errorf("expected age_hours > 199, got %f", ageHours)
}
}
func TestJobService_ReapTimedOutJobs_SkipsJobsWithinTTL(t *testing.T) {
ctx := context.Background()
now := time.Now()
job := &domain.Job{
ID: "job-within-ttl",
Type: domain.JobTypeRenewal,
CertificateID: "cert-003",
Status: domain.JobStatusAwaitingCSR,
CreatedAt: now.Add(-1 * time.Hour), // Only 1 hour old (within 24h TTL)
ScheduledAt: now,
}
jobRepo := &mockJobRepo{
Jobs: map[string]*domain.Job{job.ID: job},
StatusUpdates: make(map[string]domain.JobStatus),
}
svc, auditRepo := newTestJobServiceWithAudit(jobRepo)
if err := svc.ReapTimedOutJobs(ctx, 24*time.Hour, 168*time.Hour); err != nil {
t.Fatalf("ReapTimedOutJobs failed: %v", err)
}
// Job should NOT transition (still AwaitingCSR)
if job.Status != domain.JobStatusAwaitingCSR {
t.Fatalf("expected job status to remain AwaitingCSR, got %s", job.Status)
}
// No audit events should be recorded
if len(auditRepo.Events) != 0 {
t.Fatalf("expected 0 audit events, got %d", len(auditRepo.Events))
}
}
func TestJobService_ReapTimedOutJobs_HandlesBothStatusesInOneSweep(t *testing.T) {
ctx := context.Background()
now := time.Now()
csr := &domain.Job{
ID: "job-sweep-csr",
Type: domain.JobTypeRenewal,
CertificateID: "cert-csr",
Status: domain.JobStatusAwaitingCSR,
CreatedAt: now.Add(-36 * time.Hour),
ScheduledAt: now,
}
approval := &domain.Job{
ID: "job-sweep-approval",
Type: domain.JobTypeRenewal,
CertificateID: "cert-approval",
Status: domain.JobStatusAwaitingApproval,
CreatedAt: now.Add(-200 * time.Hour),
ScheduledAt: now,
}
jobRepo := &mockJobRepo{
Jobs: map[string]*domain.Job{
csr.ID: csr,
approval.ID: approval,
},
StatusUpdates: make(map[string]domain.JobStatus),
}
svc, auditRepo := newTestJobServiceWithAudit(jobRepo)
if err := svc.ReapTimedOutJobs(ctx, 24*time.Hour, 168*time.Hour); err != nil {
t.Fatalf("ReapTimedOutJobs failed: %v", err)
}
// Both jobs should transition to Failed
csrUpdated := jobRepo.Jobs[csr.ID]
if csrUpdated.Status != domain.JobStatusFailed {
t.Fatalf("expected CSR job status Failed, got %s", csrUpdated.Status)
}
approvalUpdated := jobRepo.Jobs[approval.ID]
if approvalUpdated.Status != domain.JobStatusFailed {
t.Fatalf("expected approval job status Failed, got %s", approvalUpdated.Status)
}
// Two audit events should be recorded
if len(auditRepo.Events) != 2 {
t.Fatalf("expected 2 audit events, got %d", len(auditRepo.Events))
}
// Verify each event has the correct timeout_reason
for _, ev := range auditRepo.Events {
var details map[string]interface{}
if err := json.Unmarshal(ev.Details, &details); err != nil {
t.Fatalf("failed to decode details: %v", err)
}
if ev.ResourceID == csr.ID && details["timeout_reason"] != "csr_timeout" {
t.Errorf("CSR job: expected timeout_reason=csr_timeout, got %v", details["timeout_reason"])
}
if ev.ResourceID == approval.ID && details["timeout_reason"] != "approval_timeout" {
t.Errorf("approval job: expected timeout_reason=approval_timeout, got %v", details["timeout_reason"])
}
}
}
func TestJobService_ReapTimedOutJobs_NoAuditServiceOK(t *testing.T) {
ctx := context.Background()
now := time.Now()
job := &domain.Job{
ID: "job-no-audit",
Type: domain.JobTypeRenewal,
CertificateID: "cert-004",
Status: domain.JobStatusAwaitingCSR,
CreatedAt: now.Add(-36 * time.Hour),
ScheduledAt: now,
}
jobRepo := &mockJobRepo{
Jobs: map[string]*domain.Job{job.ID: job},
StatusUpdates: make(map[string]domain.JobStatus),
}
// Create service WITHOUT calling SetAuditService
svc := newTestJobService(jobRepo)
// Should not panic and should still transition the job
if err := svc.ReapTimedOutJobs(ctx, 24*time.Hour, 168*time.Hour); err != nil {
t.Fatalf("ReapTimedOutJobs failed without audit service: %v", err)
}
// Job should still transition to Failed
updatedJob := jobRepo.Jobs[job.ID]
if updatedJob.Status != domain.JobStatusFailed {
t.Fatalf("expected job status Failed without audit service, got %s", updatedJob.Status)
}
}
func TestJobService_ReapTimedOutJobs_ContinuesOnIndividualUpdateFailure(t *testing.T) {
ctx := context.Background()
now := time.Now()
jobA := &domain.Job{
ID: "job-fail-update-a",
Type: domain.JobTypeRenewal,
CertificateID: "cert-a",
Status: domain.JobStatusAwaitingCSR,
CreatedAt: now.Add(-36 * time.Hour),
ScheduledAt: now,
}
jobB := &domain.Job{
ID: "job-fail-update-b",
Type: domain.JobTypeRenewal,
CertificateID: "cert-b",
Status: domain.JobStatusAwaitingCSR,
CreatedAt: now.Add(-48 * time.Hour),
ScheduledAt: now,
}
jobRepo := &mockJobRepo{
Jobs: map[string]*domain.Job{
jobA.ID: jobA,
jobB.ID: jobB,
},
StatusUpdates: make(map[string]domain.JobStatus),
UpdateErrorByID: make(map[string]error),
UpdateErrorByIDMu: sync.Mutex{},
}
// Make Update fail only for jobA
jobRepo.UpdateErrorByIDMu.Lock()
jobRepo.UpdateErrorByID[jobA.ID] = errors.New("db connection lost")
jobRepo.UpdateErrorByIDMu.Unlock()
svc, auditRepo := newTestJobServiceWithAudit(jobRepo)
// Should not propagate individual Update errors
if err := svc.ReapTimedOutJobs(ctx, 24*time.Hour, 168*time.Hour); err != nil {
t.Fatalf("ReapTimedOutJobs should not propagate individual errors, got: %v", err)
}
// Both jobs have their status modified in memory (the service modifies before Update),
// so both will be Failed. What matters is that jobA's audit failed, so only jobB audited.
jobAAfter := jobRepo.Jobs[jobA.ID]
jobBAfter := jobRepo.Jobs[jobB.ID]
if jobAAfter.Status != domain.JobStatusFailed || jobBAfter.Status != domain.JobStatusFailed {
t.Fatalf("expected both jobs status Failed (modified before Update), got A=%s B=%s",
jobAAfter.Status, jobBAfter.Status)
}
// Only one audit event (from jobB, since jobA's Update failed and thus no audit for it)
if len(auditRepo.Events) != 1 {
t.Fatalf("expected 1 audit event (only jobB succeeded), got %d", len(auditRepo.Events))
}
if auditRepo.Events[0].ResourceID != jobB.ID {
t.Errorf("expected audit event for jobB, got event for %s", auditRepo.Events[0].ResourceID)
}
}
func TestJobService_ReapTimedOutJobs_RepoErrorPropagates(t *testing.T) {
ctx := context.Background()
jobRepo := &mockJobRepo{
Jobs: make(map[string]*domain.Job),
ListTimedOutErr: errors.New("database down"),
StatusUpdates: make(map[string]domain.JobStatus),
UpdateErrorByIDMu: sync.Mutex{},
}
svc, auditRepo := newTestJobServiceWithAudit(jobRepo)
err := svc.ReapTimedOutJobs(ctx, 24*time.Hour, 168*time.Hour)
if err == nil {
t.Fatalf("expected ReapTimedOutJobs to propagate repo error, got nil")
}
if !strings.Contains(err.Error(), "database down") {
t.Errorf("expected error to contain 'database down', got: %v", err)
}
// No audit events should be recorded when repo fails
if len(auditRepo.Events) != 0 {
t.Fatalf("expected 0 audit events after repo error, got %d", len(auditRepo.Events))
}
}
+220 -49
View File
@@ -2,8 +2,10 @@ package service
import (
"context"
"encoding/json"
"fmt"
"log/slog"
"strings"
"time"
"github.com/shankar0123/certctl/internal/domain"
@@ -14,6 +16,11 @@ import (
type PolicyService struct {
policyRepo repository.PolicyRepository
auditService *AuditService
// certRepo is optional and only required by the CertificateLifetime rule
// arm, which must read NotBefore/NotAfter from the latest CertificateVersion.
// Wire via SetCertRepo after construction; rules other than
// CertificateLifetime operate without it.
certRepo repository.CertificateRepository
}
// NewPolicyService creates a new policy service.
@@ -27,6 +34,16 @@ func NewPolicyService(
}
}
// SetCertRepo wires the certificate repository needed for the CertificateLifetime
// rule arm. Kept as a setter (not a constructor parameter) so the ~36 existing
// NewPolicyService call sites don't churn for a single new arm's dependency.
// Safe to call before or after construction; evaluateRule checks for nil and
// returns an error if a CertificateLifetime rule fires without a wired repo
// (the caller at ValidateCertificate logs and continues).
func (s *PolicyService) SetCertRepo(r repository.CertificateRepository) {
s.certRepo = r
}
// ValidateCertificate runs all enabled policy rules against a certificate.
func (s *PolicyService) ValidateCertificate(ctx context.Context, cert *domain.ManagedCertificate) ([]*domain.PolicyViolation, error) {
rules, err := s.policyRepo.ListRules(ctx)
@@ -43,7 +60,7 @@ func (s *PolicyService) ValidateCertificate(ctx context.Context, cert *domain.Ma
}
// Evaluate rule against certificate
v, err := s.evaluateRule(rule, cert)
v, err := s.evaluateRule(ctx, rule, cert)
if err != nil {
slog.Error("failed to evaluate rule", "rule_id", rule.ID, "error", err)
continue
@@ -58,73 +75,163 @@ func (s *PolicyService) ValidateCertificate(ctx context.Context, cert *domain.Ma
}
// evaluateRule checks if a certificate violates a single policy rule.
func (s *PolicyService) evaluateRule(rule *domain.PolicyRule, cert *domain.ManagedCertificate) (*domain.PolicyViolation, error) {
//
// D-008 closes the engine loop by:
// 1. Consuming rule.Severity on every violation (the pre-D-008 engine
// hardcoded PolicySeverityWarning, which silently defeated the D-006
// per-rule severity column).
// 2. Parsing rule.Config per-arm so rules carry real thresholds / allowlists
// instead of the pre-D-008 "metadata absent" placeholders. Empty/null
// Config preserves the pre-D-008 missing-field behavior as a
// backward-compat invariant — a rule without config still fires on the
// absent-field shape but using its configured severity.
// 3. Adding the CertificateLifetime arm, which reads NotBefore/NotAfter from
// the latest CertificateVersion (injected via SetCertRepo). Required
// because ManagedCertificate tracks ExpiresAt but not issuance date.
//
// Bad-config failure mode: json.Unmarshal error returns (nil, error) shaped
// as `invalid config for rule <id> (type=<type>): <err>`; the caller at
// ValidateCertificate logs and continues so one malformed rule doesn't fail
// the entire pass.
func (s *PolicyService) evaluateRule(ctx context.Context, rule *domain.PolicyRule, cert *domain.ManagedCertificate) (*domain.PolicyViolation, error) {
switch rule.Type {
case domain.PolicyTypeAllowedIssuers:
// Restrict to specific issuers
// Note: In a production implementation, we would parse rule.Config to extract parameters
// Config: {"allowed_issuer_ids": ["iss-a", "iss-b"]}
// Empty config = fire only on absent IssuerID (backward-compat).
var cfg struct {
AllowedIssuerIDs []string `json:"allowed_issuer_ids"`
}
if len(rule.Config) > 0 {
if err := json.Unmarshal(rule.Config, &cfg); err != nil {
return nil, fmt.Errorf("invalid config for rule %s (type=%s): %w", rule.ID, rule.Type, err)
}
}
if cert.IssuerID == "" {
return &domain.PolicyViolation{
ID: generateID("violation"),
RuleID: rule.ID,
CertificateID: cert.ID,
Severity: domain.PolicySeverityWarning,
Message: "certificate has no issuer assigned",
CreatedAt: time.Now(),
}, nil
return s.violation(rule, cert, "certificate has no issuer assigned"), nil
}
if len(cfg.AllowedIssuerIDs) > 0 && !containsString(cfg.AllowedIssuerIDs, cert.IssuerID) {
return s.violation(rule, cert, fmt.Sprintf("issuer %q is not in the allowed list", cert.IssuerID)), nil
}
case domain.PolicyTypeAllowedDomains:
// Ensure certificate domains are in allowed list
// Config: {"allowed_domains": ["example.com", "*.internal.example.com"]}
// Wildcards are literal prefix matches (*.foo matches anything ending
// in .foo). Empty config = fire only on zero SANs (backward-compat).
var cfg struct {
AllowedDomains []string `json:"allowed_domains"`
}
if len(rule.Config) > 0 {
if err := json.Unmarshal(rule.Config, &cfg); err != nil {
return nil, fmt.Errorf("invalid config for rule %s (type=%s): %w", rule.ID, rule.Type, err)
}
}
if len(cert.SANs) == 0 {
return &domain.PolicyViolation{
ID: generateID("violation"),
RuleID: rule.ID,
CertificateID: cert.ID,
Severity: domain.PolicySeverityWarning,
Message: "certificate has no subject alternative names",
CreatedAt: time.Now(),
}, nil
return s.violation(rule, cert, "certificate has no subject alternative names"), nil
}
if len(cfg.AllowedDomains) > 0 {
for _, san := range cert.SANs {
if !domainAllowed(san, cfg.AllowedDomains) {
return s.violation(rule, cert, fmt.Sprintf("SAN %q is not in the allowed domain list", san)), nil
}
}
}
case domain.PolicyTypeRequiredMetadata:
// Ensure certificate has required metadata/tags
// Config: {"required_keys": ["owner", "cost-center"]}
// Empty config = fire only on zero tags (backward-compat).
var cfg struct {
RequiredKeys []string `json:"required_keys"`
}
if len(rule.Config) > 0 {
if err := json.Unmarshal(rule.Config, &cfg); err != nil {
return nil, fmt.Errorf("invalid config for rule %s (type=%s): %w", rule.ID, rule.Type, err)
}
}
if len(cert.Tags) == 0 {
return &domain.PolicyViolation{
ID: generateID("violation"),
RuleID: rule.ID,
CertificateID: cert.ID,
Severity: domain.PolicySeverityWarning,
Message: "certificate has no tags or metadata",
CreatedAt: time.Now(),
}, nil
return s.violation(rule, cert, "certificate has no tags or metadata"), nil
}
for _, key := range cfg.RequiredKeys {
if _, ok := cert.Tags[key]; !ok {
return s.violation(rule, cert, fmt.Sprintf("certificate is missing required metadata key %q", key)), nil
}
}
case domain.PolicyTypeAllowedEnvironments:
// Restrict to specific environments
// Config: {"allowed": ["prod", "staging"]}
// Empty config = fire only on empty Environment (backward-compat).
var cfg struct {
Allowed []string `json:"allowed"`
}
if len(rule.Config) > 0 {
if err := json.Unmarshal(rule.Config, &cfg); err != nil {
return nil, fmt.Errorf("invalid config for rule %s (type=%s): %w", rule.ID, rule.Type, err)
}
}
if cert.Environment == "" {
return &domain.PolicyViolation{
ID: generateID("violation"),
RuleID: rule.ID,
CertificateID: cert.ID,
Severity: domain.PolicySeverityWarning,
Message: "certificate has no environment assigned",
CreatedAt: time.Now(),
}, nil
return s.violation(rule, cert, "certificate has no environment assigned"), nil
}
if len(cfg.Allowed) > 0 && !containsString(cfg.Allowed, cert.Environment) {
return s.violation(rule, cert, fmt.Sprintf("environment %q is not in the allowed list", cert.Environment)), nil
}
case domain.PolicyTypeRenewalLeadTime:
// Ensure renewal begins before certificate expires
// Config: {"lead_time_days": 30}
// Fires when remaining validity drops below lead_time_days and the
// cert is not already expired. Empty/zero config falls back to the
// pre-D-008 hardcoded 30-day threshold for backward compatibility.
var cfg struct {
LeadTimeDays int `json:"lead_time_days"`
}
if len(rule.Config) > 0 {
if err := json.Unmarshal(rule.Config, &cfg); err != nil {
return nil, fmt.Errorf("invalid config for rule %s (type=%s): %w", rule.ID, rule.Type, err)
}
}
leadDays := cfg.LeadTimeDays
if leadDays <= 0 {
leadDays = 30
}
daysUntilExpiry := time.Until(cert.ExpiresAt).Hours() / 24
if daysUntilExpiry < 30 && daysUntilExpiry > 0 {
return &domain.PolicyViolation{
ID: generateID("violation"),
RuleID: rule.ID,
CertificateID: cert.ID,
Severity: domain.PolicySeverityWarning,
Message: fmt.Sprintf("certificate expires in %.1f days, plan renewal soon", daysUntilExpiry),
CreatedAt: time.Now(),
}, nil
if daysUntilExpiry < float64(leadDays) && daysUntilExpiry > 0 {
return s.violation(rule, cert, fmt.Sprintf("certificate expires in %.1f days, plan renewal soon (policy lead time: %d days)", daysUntilExpiry, leadDays)), nil
}
case domain.PolicyTypeCertificateLifetime:
// Config: {"max_days": 397}
// Reads NotBefore/NotAfter from the latest CertificateVersion via the
// injected certRepo. ManagedCertificate exposes ExpiresAt but not the
// issuance date, so lifetime math requires the version record.
//
// If certRepo wasn't wired (test misconfiguration / early boot),
// returns an error so the caller logs it — better a loud failure
// than silently ignoring the rule. If GetLatestVersion errors (e.g.,
// the cert hasn't been issued yet), we skip the check — a cert with
// no version has no lifetime to measure, matching the missing-field
// backward-compat pattern used by the other arms.
if s.certRepo == nil {
return nil, fmt.Errorf("CertificateLifetime rule %s requires cert repository (not wired via SetCertRepo)", rule.ID)
}
var cfg struct {
MaxDays int `json:"max_days"`
}
if len(rule.Config) > 0 {
if err := json.Unmarshal(rule.Config, &cfg); err != nil {
return nil, fmt.Errorf("invalid config for rule %s (type=%s): %w", rule.ID, rule.Type, err)
}
}
if cfg.MaxDays <= 0 {
// No threshold configured — nothing meaningful to enforce.
return nil, nil
}
version, err := s.certRepo.GetLatestVersion(ctx, cert.ID)
if err != nil {
// No version yet — nothing to measure. Not an engine error;
// the cert simply hasn't been issued.
return nil, nil
}
lifetimeDays := version.NotAfter.Sub(version.NotBefore).Hours() / 24
if lifetimeDays > float64(cfg.MaxDays) {
return s.violation(rule, cert, fmt.Sprintf("certificate lifetime is %.1f days, exceeds policy max of %d days", lifetimeDays, cfg.MaxDays)), nil
}
default:
@@ -134,6 +241,56 @@ func (s *PolicyService) evaluateRule(rule *domain.PolicyRule, cert *domain.Manag
return nil, nil
}
// violation constructs a PolicyViolation carrying the rule's configured
// severity. Centralizing the build eliminates the pre-D-008 bug where each
// arm independently stamped PolicySeverityWarning on its violation.
func (s *PolicyService) violation(rule *domain.PolicyRule, cert *domain.ManagedCertificate, message string) *domain.PolicyViolation {
return &domain.PolicyViolation{
ID: generateID("violation"),
RuleID: rule.ID,
CertificateID: cert.ID,
Severity: rule.Severity,
Message: message,
CreatedAt: time.Now(),
}
}
// containsString reports whether needle is present in haystack.
func containsString(haystack []string, needle string) bool {
for _, s := range haystack {
if s == needle {
return true
}
}
return false
}
// domainAllowed reports whether a SAN (hostname) matches any of the allowed
// domain patterns. Patterns may be exact matches or `*.example.com` wildcards
// (the wildcard consumes a single label: `*.foo.com` matches `bar.foo.com`
// but not `baz.bar.foo.com`, mirroring X.509 SAN wildcard semantics).
func domainAllowed(san string, allowed []string) bool {
san = strings.ToLower(strings.TrimSpace(san))
for _, pattern := range allowed {
pattern = strings.ToLower(strings.TrimSpace(pattern))
if pattern == san {
return true
}
if strings.HasPrefix(pattern, "*.") {
suffix := pattern[1:] // ".foo.com"
if strings.HasSuffix(san, suffix) {
// Ensure wildcard consumes exactly one label — reject
// sub-subdomains.
head := strings.TrimSuffix(san, suffix)
if head != "" && !strings.Contains(head, ".") {
return true
}
}
}
}
return false
}
// CreateRule stores a new policy rule.
func (s *PolicyService) CreateRule(ctx context.Context, rule *domain.PolicyRule, actor string) error {
if rule.ID == "" {
@@ -288,6 +445,20 @@ func (s *PolicyService) UpdatePolicy(ctx context.Context, id string, policy doma
policy.ID = id
policy.UpdatedAt = time.Now()
// Severity is NOT NULL with a CHECK constraint at the DB level
// (migration 000013). If the client omits severity on a PUT (zero-value
// empty string after json.Decode), preserve the existing severity rather
// than letting the CHECK reject the write. Preserves partial-update
// semantics for the new column without changing the pre-existing behavior
// for Name/Type, which is out of scope for D-005/D-006.
if policy.Severity == "" {
existing, err := s.policyRepo.GetRule(ctx, id)
if err != nil {
return nil, fmt.Errorf("failed to fetch existing rule for severity preservation: %w", err)
}
policy.Severity = existing.Severity
}
if err := s.policyRepo.UpdateRule(ctx, &policy); err != nil {
return nil, fmt.Errorf("failed to update policy: %w", err)
}
+534
View File
@@ -3,6 +3,7 @@ package service
import (
"context"
"encoding/json"
"strings"
"testing"
"time"
@@ -420,3 +421,536 @@ func TestCreatePolicy(t *testing.T) {
t.Errorf("expected 1 rule in repo, got %d", len(policyRepo.Rules))
}
}
// ============================================================================
// D-008 regression tests
//
// These pin the behavior that closes the D-006 loop:
// 1. evaluateRule copies rule.Severity onto every violation (pre-D-008 the
// engine hardcoded Warning regardless of the rule's configured severity).
// 2. evaluateRule parses rule.Config per-arm so rules enforce real thresholds
// and allowlists (pre-D-008 the configs were ignored; rules fired only on
// the missing-field shape).
// 3. An empty/zero Config preserves the pre-D-008 missing-field violation
// (backward-compat invariant).
// 4. Malformed Config returns an error; the caller logs and skips the rule
// instead of producing a zero-value violation.
// 5. CertificateLifetime (new 6th arm) reads NotBefore/NotAfter from the
// latest CertificateVersion via the cert repo wired with SetCertRepo.
// ============================================================================
// mkRule is a tiny constructor used by the D-008 tests to keep the table rows
// readable. Every rule is enabled; test-specific fields layer on top.
func mkRule(id string, t domain.PolicyType, sev domain.PolicySeverity, cfg string) *domain.PolicyRule {
return &domain.PolicyRule{
ID: id,
Name: id,
Type: t,
Config: json.RawMessage(cfg),
Enabled: true,
Severity: sev,
}
}
// evalCert is a minimal cert used by the arms that don't look at much beyond
// the shape of the field they're testing. Tests shadow fields as needed.
func evalCert() *domain.ManagedCertificate {
return &domain.ManagedCertificate{
ID: "cert-001",
CommonName: "example.com",
Status: domain.CertificateStatusActive,
ExpiresAt: time.Now().AddDate(1, 0, 0),
}
}
// TestEvaluateRule_SeverityPassThrough pins invariant #1 — every arm stamps
// rule.Severity onto the violation. The pre-D-008 bug was that arms
// independently hardcoded PolicySeverityWarning. We test each arm with a
// severity that isn't the legacy default so a regression would be visible.
func TestEvaluateRule_SeverityPassThrough(t *testing.T) {
ctx := context.Background()
// Cert shaped to fail every non-empty-config check via the backward-compat
// missing-field path. Each row picks a severity intentionally ≠ Warning to
// make a stray hardcoded default obvious.
cases := []struct {
name string
rule *domain.PolicyRule
cert *domain.ManagedCertificate
setupFn func(svc *PolicyService)
expected domain.PolicySeverity
}{
{
name: "AllowedIssuers Critical via missing IssuerID",
rule: mkRule("r-ai", domain.PolicyTypeAllowedIssuers, domain.PolicySeverityCritical, ""),
cert: func() *domain.ManagedCertificate {
c := evalCert()
c.IssuerID = ""
return c
}(),
expected: domain.PolicySeverityCritical,
},
{
name: "AllowedDomains Error via empty SANs",
rule: mkRule("r-ad", domain.PolicyTypeAllowedDomains, domain.PolicySeverityError, ""),
cert: func() *domain.ManagedCertificate {
c := evalCert()
c.SANs = nil
return c
}(),
expected: domain.PolicySeverityError,
},
{
name: "RequiredMetadata Critical via empty Tags",
rule: mkRule("r-rm", domain.PolicyTypeRequiredMetadata, domain.PolicySeverityCritical, ""),
cert: func() *domain.ManagedCertificate {
c := evalCert()
c.Tags = nil
return c
}(),
expected: domain.PolicySeverityCritical,
},
{
name: "AllowedEnvironments Warning via empty Environment",
rule: mkRule("r-ae", domain.PolicyTypeAllowedEnvironments, domain.PolicySeverityWarning, ""),
cert: func() *domain.ManagedCertificate {
c := evalCert()
c.Environment = ""
return c
}(),
expected: domain.PolicySeverityWarning,
},
{
name: "RenewalLeadTime Critical via short remaining validity",
rule: mkRule("r-rl", domain.PolicyTypeRenewalLeadTime, domain.PolicySeverityCritical, `{"lead_time_days": 60}`),
cert: func() *domain.ManagedCertificate {
c := evalCert()
c.ExpiresAt = time.Now().AddDate(0, 0, 30) // 30d remaining < 60d lead
return c
}(),
expected: domain.PolicySeverityCritical,
},
{
name: "CertificateLifetime Error via 365d span vs 90d max",
rule: mkRule("r-cl", domain.PolicyTypeCertificateLifetime, domain.PolicySeverityError, `{"max_days": 90}`),
cert: evalCert(),
setupFn: func(svc *PolicyService) {
// Seed a version with 365d lifetime on the same cert ID used
// by evalCert().
cr := &mockCertRepo{
Certs: map[string]*domain.ManagedCertificate{},
Versions: map[string][]*domain.CertificateVersion{},
}
now := time.Now()
cr.Versions["cert-001"] = []*domain.CertificateVersion{{
ID: "ver-001",
CertificateID: "cert-001",
NotBefore: now.AddDate(0, 0, -10),
NotAfter: now.AddDate(1, 0, -10), // ~365d lifetime
}}
svc.SetCertRepo(cr)
},
expected: domain.PolicySeverityError,
},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
policyRepo := &mockPolicyRepo{
Rules: map[string]*domain.PolicyRule{tc.rule.ID: tc.rule},
Violations: []*domain.PolicyViolation{},
}
auditService := NewAuditService(&mockAuditRepo{})
svc := NewPolicyService(policyRepo, auditService)
if tc.setupFn != nil {
tc.setupFn(svc)
}
violations, err := svc.ValidateCertificate(ctx, tc.cert)
if err != nil {
t.Fatalf("ValidateCertificate failed: %v", err)
}
if len(violations) != 1 {
t.Fatalf("expected 1 violation, got %d", len(violations))
}
if violations[0].Severity != tc.expected {
t.Errorf("expected severity %q, got %q", tc.expected, violations[0].Severity)
}
if violations[0].RuleID != tc.rule.ID {
t.Errorf("expected rule ID %q, got %q", tc.rule.ID, violations[0].RuleID)
}
})
}
}
// TestEvaluateRule_ConfigConsumed pins invariant #2 — non-empty Config drives
// arm behavior (allowlists, thresholds, keys). Each subtest supplies a config
// that the cert would satisfy under the backward-compat missing-field path
// but violates under the config-aware path. A regression to the pre-D-008
// "config silently dropped" behavior would make these pass with 0 violations.
func TestEvaluateRule_ConfigConsumed(t *testing.T) {
ctx := context.Background()
t.Run("AllowedIssuers rejects issuer not in allowlist", func(t *testing.T) {
rule := mkRule("r-ai", domain.PolicyTypeAllowedIssuers, domain.PolicySeverityWarning,
`{"allowed_issuer_ids": ["iss-acme"]}`)
cert := evalCert()
cert.IssuerID = "iss-wrong"
violations := runEval(ctx, t, rule, cert, nil)
if len(violations) != 1 {
t.Fatalf("expected 1 violation for disallowed issuer, got %d", len(violations))
}
if !strings.Contains(violations[0].Message, "iss-wrong") {
t.Errorf("expected message to mention issuer ID, got %q", violations[0].Message)
}
})
t.Run("AllowedIssuers accepts issuer in allowlist", func(t *testing.T) {
rule := mkRule("r-ai", domain.PolicyTypeAllowedIssuers, domain.PolicySeverityWarning,
`{"allowed_issuer_ids": ["iss-acme"]}`)
cert := evalCert()
cert.IssuerID = "iss-acme"
violations := runEval(ctx, t, rule, cert, nil)
if len(violations) != 0 {
t.Errorf("expected 0 violations for allowed issuer, got %d", len(violations))
}
})
t.Run("AllowedDomains rejects SAN outside allowlist", func(t *testing.T) {
rule := mkRule("r-ad", domain.PolicyTypeAllowedDomains, domain.PolicySeverityWarning,
`{"allowed_domains": ["*.foo.com"]}`)
cert := evalCert()
cert.SANs = []string{"bar.elsewhere.com"}
violations := runEval(ctx, t, rule, cert, nil)
if len(violations) != 1 {
t.Fatalf("expected 1 violation for disallowed SAN, got %d", len(violations))
}
})
t.Run("AllowedDomains wildcard matches single-label subdomain", func(t *testing.T) {
rule := mkRule("r-ad", domain.PolicyTypeAllowedDomains, domain.PolicySeverityWarning,
`{"allowed_domains": ["*.foo.com"]}`)
cert := evalCert()
cert.SANs = []string{"bar.foo.com"}
violations := runEval(ctx, t, rule, cert, nil)
if len(violations) != 0 {
t.Errorf("expected 0 violations for single-label wildcard match, got %d", len(violations))
}
})
t.Run("AllowedDomains wildcard rejects multi-label subdomain", func(t *testing.T) {
// X.509 wildcard semantics: *.foo consumes exactly one label.
rule := mkRule("r-ad", domain.PolicyTypeAllowedDomains, domain.PolicySeverityWarning,
`{"allowed_domains": ["*.foo.com"]}`)
cert := evalCert()
cert.SANs = []string{"baz.bar.foo.com"}
violations := runEval(ctx, t, rule, cert, nil)
if len(violations) != 1 {
t.Errorf("expected 1 violation for multi-label wildcard (X.509 semantics), got %d", len(violations))
}
})
t.Run("RequiredMetadata rejects missing key", func(t *testing.T) {
rule := mkRule("r-rm", domain.PolicyTypeRequiredMetadata, domain.PolicySeverityWarning,
`{"required_keys": ["owner"]}`)
cert := evalCert()
cert.Tags = map[string]string{"team": "platform"}
violations := runEval(ctx, t, rule, cert, nil)
if len(violations) != 1 {
t.Fatalf("expected 1 violation for missing owner key, got %d", len(violations))
}
if !strings.Contains(violations[0].Message, "owner") {
t.Errorf("expected message to mention the missing key, got %q", violations[0].Message)
}
})
t.Run("RequiredMetadata accepts all required keys present", func(t *testing.T) {
rule := mkRule("r-rm", domain.PolicyTypeRequiredMetadata, domain.PolicySeverityWarning,
`{"required_keys": ["owner"]}`)
cert := evalCert()
cert.Tags = map[string]string{"owner": "alice"}
violations := runEval(ctx, t, rule, cert, nil)
if len(violations) != 0 {
t.Errorf("expected 0 violations when all required keys present, got %d", len(violations))
}
})
t.Run("AllowedEnvironments rejects env outside allowlist", func(t *testing.T) {
rule := mkRule("r-ae", domain.PolicyTypeAllowedEnvironments, domain.PolicySeverityWarning,
`{"allowed": ["production", "staging"]}`)
cert := evalCert()
cert.Environment = "wild-west"
violations := runEval(ctx, t, rule, cert, nil)
if len(violations) != 1 {
t.Fatalf("expected 1 violation for disallowed env, got %d", len(violations))
}
})
t.Run("RenewalLeadTime fires when remaining < configured lead", func(t *testing.T) {
rule := mkRule("r-rl", domain.PolicyTypeRenewalLeadTime, domain.PolicySeverityWarning,
`{"lead_time_days": 60}`)
cert := evalCert()
cert.ExpiresAt = time.Now().AddDate(0, 0, 30) // 30d < 60d lead
violations := runEval(ctx, t, rule, cert, nil)
if len(violations) != 1 {
t.Fatalf("expected 1 violation for 30d remaining vs 60d lead, got %d", len(violations))
}
})
t.Run("RenewalLeadTime quiet when remaining > configured lead", func(t *testing.T) {
rule := mkRule("r-rl", domain.PolicyTypeRenewalLeadTime, domain.PolicySeverityWarning,
`{"lead_time_days": 14}`)
cert := evalCert()
cert.ExpiresAt = time.Now().AddDate(0, 0, 60) // 60d > 14d lead
violations := runEval(ctx, t, rule, cert, nil)
if len(violations) != 0 {
t.Errorf("expected 0 violations when plenty of runway remains, got %d", len(violations))
}
})
t.Run("CertificateLifetime fires when lifetime exceeds max", func(t *testing.T) {
rule := mkRule("r-cl", domain.PolicyTypeCertificateLifetime, domain.PolicySeverityWarning,
`{"max_days": 90}`)
cert := evalCert()
now := time.Now()
certRepo := &mockCertRepo{
Certs: map[string]*domain.ManagedCertificate{},
Versions: map[string][]*domain.CertificateVersion{},
}
certRepo.Versions["cert-001"] = []*domain.CertificateVersion{{
ID: "ver-001",
CertificateID: "cert-001",
NotBefore: now.AddDate(0, 0, -1),
NotAfter: now.AddDate(1, 0, -1), // ~365d > 90d
}}
violations := runEval(ctx, t, rule, cert, certRepo)
if len(violations) != 1 {
t.Fatalf("expected 1 violation for 365d lifetime vs 90d max, got %d", len(violations))
}
if !strings.Contains(violations[0].Message, "90 days") {
t.Errorf("expected message to mention max_days threshold, got %q", violations[0].Message)
}
})
t.Run("CertificateLifetime quiet when lifetime within max", func(t *testing.T) {
rule := mkRule("r-cl", domain.PolicyTypeCertificateLifetime, domain.PolicySeverityWarning,
`{"max_days": 90}`)
cert := evalCert()
now := time.Now()
certRepo := &mockCertRepo{
Certs: map[string]*domain.ManagedCertificate{},
Versions: map[string][]*domain.CertificateVersion{},
}
certRepo.Versions["cert-001"] = []*domain.CertificateVersion{{
ID: "ver-001",
CertificateID: "cert-001",
NotBefore: now.AddDate(0, 0, -10),
NotAfter: now.AddDate(0, 0, 60), // 70d lifetime < 90d
}}
violations := runEval(ctx, t, rule, cert, certRepo)
if len(violations) != 0 {
t.Errorf("expected 0 violations for 70d lifetime under 90d max, got %d", len(violations))
}
})
}
// TestEvaluateRule_EmptyConfig_BackCompat pins invariant #3 — a rule with no
// Config (e.g., a legacy row from a pre-D-008 migration) still fires on the
// pre-D-008 missing-field shape using its configured severity. This is how
// we let existing deployments migrate without a schema rewrite.
func TestEvaluateRule_EmptyConfig_BackCompat(t *testing.T) {
ctx := context.Background()
t.Run("RequiredMetadata fires on zero tags", func(t *testing.T) {
rule := mkRule("r-rm", domain.PolicyTypeRequiredMetadata, domain.PolicySeverityError, "")
cert := evalCert()
cert.Tags = nil
violations := runEval(ctx, t, rule, cert, nil)
if len(violations) != 1 {
t.Fatalf("expected 1 backcompat violation, got %d", len(violations))
}
if violations[0].Severity != domain.PolicySeverityError {
t.Errorf("expected severity Error (passed through from rule), got %q", violations[0].Severity)
}
})
t.Run("RequiredMetadata quiet when any tags present under empty config", func(t *testing.T) {
// Empty config means "only fire on missing-field shape" — so a cert
// with any tags (even not what a human would call meaningful) passes.
rule := mkRule("r-rm", domain.PolicyTypeRequiredMetadata, domain.PolicySeverityError, "")
cert := evalCert()
cert.Tags = map[string]string{"arbitrary": "value"}
violations := runEval(ctx, t, rule, cert, nil)
if len(violations) != 0 {
t.Errorf("expected 0 violations under backcompat shape w/ tags set, got %d", len(violations))
}
})
t.Run("RenewalLeadTime uses 30d default under empty/zero config", func(t *testing.T) {
rule := mkRule("r-rl", domain.PolicyTypeRenewalLeadTime, domain.PolicySeverityWarning, "")
cert := evalCert()
cert.ExpiresAt = time.Now().AddDate(0, 0, 15) // 15d < 30d default
violations := runEval(ctx, t, rule, cert, nil)
if len(violations) != 1 {
t.Errorf("expected 1 violation under 30d backcompat default, got %d", len(violations))
}
})
}
// TestEvaluateRule_BadConfig_SkipsRule pins invariant #4 — malformed JSON in
// Config returns an error from evaluateRule, which ValidateCertificate logs
// and swallows. The pass continues; no zero-value violation is emitted.
// Co-located rules still fire normally.
func TestEvaluateRule_BadConfig_SkipsRule(t *testing.T) {
ctx := context.Background()
// Rule 1 has malformed JSON — should log+skip.
// Rule 2 is a healthy AllowedIssuers rule that should still emit its
// violation on the missing-IssuerID cert. If the bad rule poisoned the
// loop, we'd see 0 or 2 violations instead of exactly 1.
badRule := mkRule("r-bad", domain.PolicyTypeAllowedIssuers, domain.PolicySeverityError,
`{"allowed_issuer_ids": [`) // unterminated JSON
goodRule := mkRule("r-good", domain.PolicyTypeAllowedEnvironments, domain.PolicySeverityWarning, "")
policyRepo := &mockPolicyRepo{
Rules: map[string]*domain.PolicyRule{
badRule.ID: badRule,
goodRule.ID: goodRule,
},
Violations: []*domain.PolicyViolation{},
}
auditService := NewAuditService(&mockAuditRepo{})
svc := NewPolicyService(policyRepo, auditService)
cert := evalCert()
cert.IssuerID = "" // would trigger the bad rule if it wasn't skipped
cert.Environment = "" // triggers goodRule via missing-field backcompat
violations, err := svc.ValidateCertificate(ctx, cert)
if err != nil {
t.Fatalf("ValidateCertificate should swallow rule-eval errors, got %v", err)
}
if len(violations) != 1 {
t.Fatalf("expected exactly 1 violation (bad rule skipped, good rule fires), got %d", len(violations))
}
if violations[0].RuleID != goodRule.ID {
t.Errorf("expected violation from r-good, got %q", violations[0].RuleID)
}
}
// TestEvaluateRule_CertificateLifetime_RepoScenarios pins the setter-injection
// pattern for the 6th arm. SetCertRepo wires the dependency; without it the
// arm errors (logged+skipped by the caller). With it but no version present,
// the arm silently returns nil (matching the missing-field backcompat shape).
func TestEvaluateRule_CertificateLifetime_RepoScenarios(t *testing.T) {
ctx := context.Background()
t.Run("repo not wired logs and skips", func(t *testing.T) {
rule := mkRule("r-cl", domain.PolicyTypeCertificateLifetime, domain.PolicySeverityError,
`{"max_days": 90}`)
policyRepo := &mockPolicyRepo{
Rules: map[string]*domain.PolicyRule{rule.ID: rule},
Violations: []*domain.PolicyViolation{},
}
svc := NewPolicyService(policyRepo, NewAuditService(&mockAuditRepo{}))
// deliberately do NOT call SetCertRepo
violations, err := svc.ValidateCertificate(ctx, evalCert())
if err != nil {
t.Fatalf("ValidateCertificate should swallow the nil-repo error, got %v", err)
}
if len(violations) != 0 {
t.Errorf("expected 0 violations when repo unwired (rule skipped), got %d", len(violations))
}
})
t.Run("version missing silently skips", func(t *testing.T) {
rule := mkRule("r-cl", domain.PolicyTypeCertificateLifetime, domain.PolicySeverityError,
`{"max_days": 90}`)
policyRepo := &mockPolicyRepo{
Rules: map[string]*domain.PolicyRule{rule.ID: rule},
Violations: []*domain.PolicyViolation{},
}
svc := NewPolicyService(policyRepo, NewAuditService(&mockAuditRepo{}))
// Empty Versions map — GetLatestVersion returns errNotFound, arm skips.
svc.SetCertRepo(&mockCertRepo{
Certs: map[string]*domain.ManagedCertificate{},
Versions: map[string][]*domain.CertificateVersion{},
})
violations, err := svc.ValidateCertificate(ctx, evalCert())
if err != nil {
t.Fatalf("ValidateCertificate failed: %v", err)
}
if len(violations) != 0 {
t.Errorf("expected 0 violations when no version exists (nothing to measure), got %d", len(violations))
}
})
t.Run("max_days zero/absent means no enforcement", func(t *testing.T) {
// Even with a version, max_days=0 is a no-op (matches the
// no-threshold-configured guard in the arm).
rule := mkRule("r-cl", domain.PolicyTypeCertificateLifetime, domain.PolicySeverityError, "")
policyRepo := &mockPolicyRepo{
Rules: map[string]*domain.PolicyRule{rule.ID: rule},
Violations: []*domain.PolicyViolation{},
}
svc := NewPolicyService(policyRepo, NewAuditService(&mockAuditRepo{}))
now := time.Now()
svc.SetCertRepo(&mockCertRepo{
Certs: map[string]*domain.ManagedCertificate{},
Versions: map[string][]*domain.CertificateVersion{
"cert-001": {{
CertificateID: "cert-001",
NotBefore: now.AddDate(0, 0, -1),
NotAfter: now.AddDate(10, 0, 0), // 10 years — huge but unchecked
}},
},
})
violations, err := svc.ValidateCertificate(ctx, evalCert())
if err != nil {
t.Fatalf("ValidateCertificate failed: %v", err)
}
if len(violations) != 0 {
t.Errorf("expected 0 violations when max_days absent (no enforcement), got %d", len(violations))
}
})
}
// runEval is a test helper that exercises ValidateCertificate against a
// single-rule configuration and returns the violation slice. Optionally
// wires a cert repo for the CertificateLifetime arm.
func runEval(ctx context.Context, t *testing.T, rule *domain.PolicyRule, cert *domain.ManagedCertificate, certRepo *mockCertRepo) []*domain.PolicyViolation {
t.Helper()
policyRepo := &mockPolicyRepo{
Rules: map[string]*domain.PolicyRule{rule.ID: rule},
Violations: []*domain.PolicyViolation{},
}
svc := NewPolicyService(policyRepo, NewAuditService(&mockAuditRepo{}))
if certRepo != nil {
svc.SetCertRepo(certRepo)
}
violations, err := svc.ValidateCertificate(ctx, cert)
if err != nil {
t.Fatalf("ValidateCertificate failed: %v", err)
}
return violations
}
+4
View File
@@ -198,6 +198,10 @@ func (m *mockCertRepoWithGetError) GetLatestVersion(ctx context.Context, certID
return nil, nil
}
func (m *mockCertRepoWithGetError) GetByIssuerAndSerial(ctx context.Context, issuerID, serial string) (*domain.ManagedCertificate, error) {
return nil, nil
}
func (m *mockCertRepoWithGetError) GetExpiringCertificates(ctx context.Context, before time.Time) ([]*domain.ManagedCertificate, error) {
return nil, m.GetExpiringCertificatesErr
}
+44
View File
@@ -3,6 +3,7 @@ package service
import (
"context"
"encoding/json"
"errors"
"fmt"
"log/slog"
"time"
@@ -12,6 +13,13 @@ import (
"github.com/shankar0123/certctl/internal/repository"
)
// ErrAgentNotFound is returned by [TargetService.CreateTarget] when the caller
// references an agent_id that is empty or does not correspond to a registered
// agent. The handler layer maps this to HTTP 400 via [errors.Is]. See C-002 in
// cowork/certctl-coverage-gap-audit.md — this sentinel replaces a silent
// Postgres FK violation (23503 → HTTP 500) with a deterministic 400.
var ErrAgentNotFound = errors.New("referenced agent does not exist")
// validTargetTypes is the set of allowed target types for validation.
var validTargetTypes = map[domain.TargetType]bool{
domain.TargetTypeNGINX: true,
@@ -224,6 +232,18 @@ func (s *TargetService) TestConnection(ctx context.Context, id string) error {
return fmt.Errorf("assigned agent not found: %w", err)
}
// I-004: AgentRepository.Get intentionally surfaces retired rows (the banner
// + 410 Gone paths need to see them). A test against a retired agent can
// never succeed — the agent is tombstoned, will never heartbeat again, and
// any active targets have already been cascade-retired alongside it. Fail
// fast with an explicit message instead of falling through to the Status /
// heartbeat checks, which would produce a misleading "agent is Offline" or
// "heartbeat stale" diagnostic.
if agent.IsRetired() {
s.updateTestStatus(ctx, target, "failed")
return fmt.Errorf("assigned agent %s is retired", agent.ID)
}
if agent.Status != domain.AgentStatusOnline {
s.updateTestStatus(ctx, target, "failed")
return fmt.Errorf("assigned agent %s is %s (expected Online)", agent.ID, agent.Status)
@@ -276,6 +296,30 @@ func (s *TargetService) CreateTarget(ctx context.Context, target domain.Deployme
if !isValidTargetType(target.Type) {
return nil, fmt.Errorf("unsupported target type: %s", target.Type)
}
// C-002: enforce agent_id FK at service layer so we return a clean 400
// instead of bubbling a Postgres 23503 foreign-key violation out as 500.
// The schema (migrations/000001 line 104) declares agent_id TEXT NOT NULL
// with a FK to agents(id); we mirror that contract here for deterministic
// error mapping.
if target.AgentID == "" {
return nil, fmt.Errorf("%w: agent_id is required", ErrAgentNotFound)
}
agent, err := s.agentRepo.Get(ctx, target.AgentID)
if err != nil {
return nil, fmt.Errorf("%w: %s", ErrAgentNotFound, target.AgentID)
}
// I-004: refuse to attach new targets to a retired agent. The agent is
// tombstoned and no deployments would ever succeed against it; letting a
// row slip past here would immediately be cascade-retired on the next
// dependency sweep and confuse operators ("why is this brand-new target
// already retired?"). Treating retired agents as "not found" for creation
// purposes keeps the error surface tight and matches the default-list
// contract established by repository.AgentRepository.List.
if agent.IsRetired() {
return nil, fmt.Errorf("%w: %s (retired)", ErrAgentNotFound, target.AgentID)
}
if target.ID == "" {
target.ID = generateID("target")
}
+57 -3
View File
@@ -3,6 +3,7 @@ package service
import (
"context"
"encoding/json"
"errors"
"log/slog"
"os"
"testing"
@@ -377,11 +378,17 @@ func TestTargetService_GetTarget_Success(t *testing.T) {
}
func TestTargetService_CreateTarget_Success(t *testing.T) {
svc, targetRepo, _, _ := newTestTargetService()
svc, targetRepo, _, agentRepo := newTestTargetService()
// C-002: CreateTarget now pre-validates agent_id against agentRepo. Seed a
// real agent so the happy path still exercises the normal creation flow
// without tripping the new ErrAgentNotFound guard.
agentRepo.AddAgent(&domain.Agent{ID: "a-1", Name: "test-agent"})
target := domain.DeploymentTarget{
Name: "New Target",
Type: domain.TargetTypeNGINX,
Name: "New Target",
Type: domain.TargetTypeNGINX,
AgentID: "a-1",
}
ctx := context.Background()
@@ -415,6 +422,53 @@ func TestTargetService_CreateTarget_InvalidType(t *testing.T) {
}
}
// TestTargetService_CreateTarget_MissingAgentID verifies the C-002 service-layer
// guard: an empty agent_id must be rejected with ErrAgentNotFound before the
// repository layer is ever consulted. The handler maps this sentinel to HTTP
// 400, so a 500 from a Postgres 23503 FK violation is never surfaced.
func TestTargetService_CreateTarget_MissingAgentID(t *testing.T) {
svc, _, _, _ := newTestTargetService()
target := domain.DeploymentTarget{
Name: "No Agent",
Type: domain.TargetTypeNGINX,
// AgentID intentionally empty
}
ctx := context.Background()
_, err := svc.CreateTarget(ctx, target)
if err == nil {
t.Fatalf("expected error for missing agent_id, got nil")
}
if !errors.Is(err, ErrAgentNotFound) {
t.Errorf("expected errors.Is(err, ErrAgentNotFound) to be true, got err=%v", err)
}
}
// TestTargetService_CreateTarget_NonexistentAgentID verifies the second half of
// the C-002 guard: a non-empty agent_id that does not resolve in agentRepo
// still returns ErrAgentNotFound rather than letting the FK violation escape to
// Postgres. This is the realistic failure mode for a GUI sending a stale
// agent_id or a CLI caller with a typo.
func TestTargetService_CreateTarget_NonexistentAgentID(t *testing.T) {
svc, _, _, _ := newTestTargetService()
target := domain.DeploymentTarget{
Name: "Bad Agent Ref",
Type: domain.TargetTypeNGINX,
AgentID: "a-does-not-exist",
}
ctx := context.Background()
_, err := svc.CreateTarget(ctx, target)
if err == nil {
t.Fatalf("expected error for nonexistent agent_id, got nil")
}
if !errors.Is(err, ErrAgentNotFound) {
t.Errorf("expected errors.Is(err, ErrAgentNotFound) to be true, got err=%v", err)
}
}
func TestTargetService_UpdateTarget_Success(t *testing.T) {
svc, targetRepo, _, _ := newTestTargetService()
+238 -12
View File
@@ -2,7 +2,9 @@ package service
import (
"context"
"database/sql"
"errors"
"sort"
"sync"
"time"
@@ -129,23 +131,46 @@ func (m *mockCertRepo) GetLatestVersion(ctx context.Context, certID string) (*do
return versions[len(versions)-1], nil
}
// GetByIssuerAndSerial emulates the PostgreSQL JOIN:
// SELECT mc.* FROM managed_certificates mc JOIN certificate_versions cv
// ON cv.certificate_id = mc.id WHERE mc.issuer_id = $1 AND cv.serial_number = $2.
// Returns sql.ErrNoRows (the sentinel the real repo surfaces) when no match
// exists, so callers that branch on errors.Is(err, sql.ErrNoRows) behave the
// same in-memory as they do against PostgreSQL.
func (m *mockCertRepo) GetByIssuerAndSerial(ctx context.Context, issuerID, serial string) (*domain.ManagedCertificate, error) {
for _, cert := range m.Certs {
if cert.IssuerID != issuerID {
continue
}
for _, v := range m.Versions[cert.ID] {
if v.SerialNumber == serial {
return cert, nil
}
}
}
return nil, sql.ErrNoRows
}
func (m *mockCertRepo) AddCert(cert *domain.ManagedCertificate) {
m.Certs[cert.ID] = cert
}
// mockJobRepo is a test implementation of JobRepository
type mockJobRepo struct {
mu sync.Mutex
Jobs map[string]*domain.Job
StatusUpdates map[string]domain.JobStatus
CreateErr error
UpdateErr error
UpdateStatusErr error
GetErr error
ListErr error
ListByStatusErr error
DeleteErr error
Updated []*domain.Job
mu sync.Mutex
Jobs map[string]*domain.Job
StatusUpdates map[string]domain.JobStatus
CreateErr error
UpdateErr error
UpdateErrorByID map[string]error
UpdateErrorByIDMu sync.Mutex
UpdateStatusErr error
GetErr error
ListErr error
ListByStatusErr error
DeleteErr error
ListTimedOutErr error
Updated []*domain.Job
}
func (m *mockJobRepo) List(ctx context.Context) ([]*domain.Job, error) {
@@ -190,6 +215,13 @@ func (m *mockJobRepo) Update(ctx context.Context, job *domain.Job) error {
if m.UpdateErr != nil {
return m.UpdateErr
}
// Check per-ID error injection
m.UpdateErrorByIDMu.Lock()
idErr, ok := m.UpdateErrorByID[job.ID]
m.UpdateErrorByIDMu.Unlock()
if ok && idErr != nil {
return idErr
}
m.Jobs[job.ID] = job
m.Updated = append(m.Updated, job)
return nil
@@ -331,6 +363,30 @@ func (m *mockJobRepo) ClaimPendingByAgentID(ctx context.Context, agentID string)
return result, nil
}
// ListTimedOutAwaitingJobs returns jobs stuck in AwaitingCSR/AwaitingApproval past the
// respective cutoffs. I-003 coverage-gap closure.
func (m *mockJobRepo) ListTimedOutAwaitingJobs(ctx context.Context, csrCutoff, approvalCutoff time.Time) ([]*domain.Job, error) {
m.mu.Lock()
defer m.mu.Unlock()
if m.ListTimedOutErr != nil {
return nil, m.ListTimedOutErr
}
var jobs []*domain.Job
for _, j := range m.Jobs {
switch j.Status {
case domain.JobStatusAwaitingCSR:
if j.CreatedAt.Before(csrCutoff) {
jobs = append(jobs, j)
}
case domain.JobStatusAwaitingApproval:
if j.CreatedAt.Before(approvalCutoff) {
jobs = append(jobs, j)
}
}
}
return jobs, nil
}
func (m *mockJobRepo) AddJob(job *domain.Job) {
m.mu.Lock()
defer m.mu.Unlock()
@@ -552,7 +608,14 @@ func (m *mockRenewalPolicyRepo) AddPolicy(policy *domain.RenewalPolicy) {
m.Policies[policy.ID] = policy
}
// mockAgentRepo is a test implementation of AgentRepository
// mockAgentRepo is a test implementation of AgentRepository.
//
// I-004: ActiveTargetCounts / ActiveCertCounts / PendingJobCounts are keyed by
// agent ID and read back verbatim by the Count* methods — the retirement
// service's preflight pokes these maps to simulate "agent has N active
// deployments / M deployed certs / K pending jobs" without having to seed
// real target/cert/job rows across multiple mock repos. An unset key means
// zero, matching the production repo behavior on an agent with no deps.
type mockAgentRepo struct {
mu sync.Mutex
Agents map[string]*domain.Agent
@@ -564,8 +627,27 @@ type mockAgentRepo struct {
ListErr error
UpdateHeartbeatErr error
GetByAPIKeyErr error
// I-004 preflight count seeds (read by CountActiveTargets etc.).
ActiveTargetCounts map[string]int
ActiveCertCounts map[string]int
PendingJobCounts map[string]int
// I-004 retirement write-path error seams. Let tests force a SoftRetire
// or RetireAgentWithCascade failure after preflight passed, so the
// service's error surfacing (wrap+return, skip audit, etc.) can be
// exercised without having to stand up a real PG connection.
SoftRetireErr error
RetireCascadeErr error
CountErr error
ListRetiredErr error
}
// List mirrors the production repo contract post-I-004: it returns only
// ACTIVE agents (RetiredAt == nil). Tests that seed a retired agent via
// AddAgent and then call a List-driven service method (e.g. ListAgents,
// MarkStaleAgentsOffline, stats dashboards) must not see the retired row
// here — otherwise the mock would pass while the real planner filters it
// out at the WHERE clause level. ListRetired is the companion method for
// explicit retired-only listing.
func (m *mockAgentRepo) List(ctx context.Context) ([]*domain.Agent, error) {
m.mu.Lock()
defer m.mu.Unlock()
@@ -574,6 +656,9 @@ func (m *mockAgentRepo) List(ctx context.Context) ([]*domain.Agent, error) {
}
var agents []*domain.Agent
for _, a := range m.Agents {
if a.RetiredAt != nil {
continue
}
agents = append(agents, a)
}
return agents, nil
@@ -671,6 +756,134 @@ func (m *mockAgentRepo) AddAgent(agent *domain.Agent) {
m.Agents[agent.ID] = agent
}
// ListRetired returns the paginated retired-agents slice + total count.
// Matches the production repo contract: RetiredAt != nil, sorted by
// RetiredAt DESC, page<1 → 1, perPage<1 → 50. Sort is done in-memory over
// the keyed map so the mock stays dependency-free. I-004.
func (m *mockAgentRepo) ListRetired(ctx context.Context, page, perPage int) ([]*domain.Agent, int, error) {
m.mu.Lock()
defer m.mu.Unlock()
if m.ListRetiredErr != nil {
return nil, 0, m.ListRetiredErr
}
if page < 1 {
page = 1
}
if perPage < 1 {
perPage = 50
}
var retired []*domain.Agent
for _, a := range m.Agents {
if a.RetiredAt != nil {
retired = append(retired, a)
}
}
total := len(retired)
// Sort by RetiredAt DESC — most recent first. The real query uses the
// partial idx_agents_retired_at index; here we sort in Go.
sort.SliceStable(retired, func(i, j int) bool {
return retired[i].RetiredAt.After(*retired[j].RetiredAt)
})
// Apply page/perPage window.
offset := (page - 1) * perPage
if offset >= total {
return nil, total, nil
}
end := offset + perPage
if end > total {
end = total
}
return retired[offset:end], total, nil
}
// SoftRetire stamps RetiredAt + RetiredReason on the agent row. Mirrors
// the real repo's idempotent semantics: a row already retired is left
// untouched (zero-rows-affected is not an error). I-004 preserves
// retirement metadata across re-retire attempts — whoever retired it
// first owns the audit trail.
func (m *mockAgentRepo) SoftRetire(ctx context.Context, id string, retiredAt time.Time, reason string) error {
m.mu.Lock()
defer m.mu.Unlock()
if m.SoftRetireErr != nil {
return m.SoftRetireErr
}
agent, ok := m.Agents[id]
if !ok {
return errNotFound
}
if agent.RetiredAt != nil {
return nil // already retired — no-op
}
stamped := retiredAt
agent.RetiredAt = &stamped
stampedReason := reason
agent.RetiredReason = &stampedReason
return nil
}
// RetireAgentWithCascade stamps the agent row the same way SoftRetire
// does. The real repo also stamps every active deployment_targets row
// in the same transaction; the mock can't do that because targets live
// in mockTargetRepo, which the retirement service doesn't write to
// through this repo interface. Tests that need to assert cascade
// semantics on targets should seed mockTargetRepo directly and verify
// the service-layer audit event captured the cascade count. I-004.
func (m *mockAgentRepo) RetireAgentWithCascade(ctx context.Context, id string, retiredAt time.Time, reason string) error {
m.mu.Lock()
defer m.mu.Unlock()
if m.RetireCascadeErr != nil {
return m.RetireCascadeErr
}
agent, ok := m.Agents[id]
if !ok {
return errNotFound
}
if agent.RetiredAt != nil {
return nil // already retired — no-op (same as production transaction)
}
stamped := retiredAt
agent.RetiredAt = &stamped
stampedReason := reason
agent.RetiredReason = &stampedReason
return nil
}
// CountActiveTargets returns the seeded ActiveTargetCounts value (0 if
// unset). Matches the real repo signature: COUNT of non-retired
// deployment_targets with agent_id=$1. I-004 preflight.
func (m *mockAgentRepo) CountActiveTargets(ctx context.Context, agentID string) (int, error) {
m.mu.Lock()
defer m.mu.Unlock()
if m.CountErr != nil {
return 0, m.CountErr
}
return m.ActiveTargetCounts[agentID], nil
}
// CountActiveCertificates returns the seeded ActiveCertCounts value.
// Real query: COUNT(DISTINCT certificate_id) across
// certificate_target_mappings ↔ deployment_targets on agent_id. I-004.
func (m *mockAgentRepo) CountActiveCertificates(ctx context.Context, agentID string) (int, error) {
m.mu.Lock()
defer m.mu.Unlock()
if m.CountErr != nil {
return 0, m.CountErr
}
return m.ActiveCertCounts[agentID], nil
}
// CountPendingJobs returns the seeded PendingJobCounts value. Real
// query: COUNT of jobs with agent_id=$1 AND status IN (Pending,
// AwaitingCSR, AwaitingApproval, Running). I-004.
func (m *mockAgentRepo) CountPendingJobs(ctx context.Context, agentID string) (int, error) {
m.mu.Lock()
defer m.mu.Unlock()
if m.CountErr != nil {
return 0, m.CountErr
}
return m.PendingJobCounts[agentID], nil
}
// mockTargetRepo is a test implementation of TargetRepository
type mockTargetRepo struct {
mu sync.Mutex
@@ -784,6 +997,9 @@ type mockIssuerConnector struct {
Err error
getRenewalInfoResult *RenewalInfoResult
getRenewalInfoErr error
// LastOCSPSignRequest captures the last request passed to SignOCSPResponse.
// Tests use this to assert CertStatus (0=good, 1=revoked, 2=unknown).
LastOCSPSignRequest *OCSPSignRequest
}
func (m *mockIssuerConnector) IssueCertificate(ctx context.Context, commonName string, sans []string, csrPEM string, ekus []string, maxTTLSeconds int) (*IssuanceResult, error) {
@@ -825,6 +1041,9 @@ func (m *mockIssuerConnector) GenerateCRL(ctx context.Context, entries []CRLEntr
}
func (m *mockIssuerConnector) SignOCSPResponse(ctx context.Context, req OCSPSignRequest) ([]byte, error) {
// Capture the request for test assertions (e.g., CertStatus verification)
reqCopy := req
m.LastOCSPSignRequest = &reqCopy
if m.Err != nil {
return nil, m.Err
}
@@ -894,6 +1113,13 @@ func newMockAgentRepository() *mockAgentRepo {
return &mockAgentRepo{
Agents: make(map[string]*domain.Agent),
HeartbeatUpdates: make(map[string]time.Time),
// I-004 preflight count maps. Tests seed these directly via
// agentRepo.ActiveTargetCounts["agent-id"] = N — unset keys
// read back as zero from CountActiveTargets etc., matching
// the production repo behavior for agents with no deps.
ActiveTargetCounts: make(map[string]int),
ActiveCertCounts: make(map[string]int),
PendingJobCounts: make(map[string]int),
}
}
+4
View File
@@ -77,6 +77,10 @@ func (m *mockVerificationJobRepo) ClaimPendingByAgentID(ctx context.Context, age
return nil, nil
}
func (m *mockVerificationJobRepo) ListTimedOutAwaitingJobs(ctx context.Context, csrCutoff, approvalCutoff time.Time) ([]*domain.Job, error) {
return nil, nil
}
// newVerificationTestService creates a VerificationService wired with test doubles.
func newVerificationTestService(jobs map[string]*domain.Job, jobRepoErr error) (*VerificationService, *mockVerificationJobRepo, *mockAuditRepo) {
jobRepo := &mockVerificationJobRepo{jobs: jobs, err: jobRepoErr}
@@ -0,0 +1,8 @@
-- Rollback migration 000013: remove per-rule severity.
--
-- DROP COLUMN removes the column, its CHECK constraint, and the default in
-- one statement. Any downstream code still referencing severity after
-- rollback will fail at query time — that's intentional, since running this
-- rollback implies severity as a concept is being abandoned.
ALTER TABLE policy_rules DROP COLUMN IF EXISTS severity;
@@ -0,0 +1,24 @@
-- Migration 000013: Per-Rule Severity on policy_rules
--
-- Prior to this migration, PolicyRule had no severity column. The TypeScript
-- frontend (PoliciesPage.tsx) sent a `severity` field on create/update, but
-- Go's json.Decoder silently dropped it (no matching struct field) and the
-- value never reached PostgreSQL. Reloading the page always showed severity
-- reverting to a default — the classic "silent drop" bug.
--
-- This migration adds severity as a first-class column on policy_rules.
-- Default `'Warning'` covers pre-existing rows; the CHECK constraint gives
-- defense-in-depth against casing drift (the application-layer validator in
-- internal/api/handler/validation.go already enforces the TitleCase allowlist,
-- but the DB should reject a bypassed write too).
--
-- No index: three-value column on a table that stays in the low thousands of
-- rows. The planner will seq-scan regardless; write cost without read benefit.
-- If measurements later justify it, add the index then.
--
-- PG 11+ makes ADD COLUMN with a literal DEFAULT a metadata-only operation
-- (no table rewrite), so this is safe to run on a live server.
ALTER TABLE policy_rules
ADD COLUMN IF NOT EXISTS severity VARCHAR(50) NOT NULL DEFAULT 'Warning'
CHECK (severity IN ('Warning', 'Error', 'Critical'));
@@ -0,0 +1,9 @@
-- Rollback migration 000014: drop the policy_violations severity CHECK.
--
-- Drops the named CHECK constraint added by the up migration. The severity
-- column itself stays (it predates this migration — see 000001 line 183),
-- so any application code that reads/writes the column continues to work.
-- Only the DB-level enforcement of the TitleCase allowlist is removed.
ALTER TABLE policy_violations
DROP CONSTRAINT IF EXISTS policy_violations_severity_check;
@@ -0,0 +1,29 @@
-- Migration 000014: CHECK constraint on policy_violations.severity
--
-- Sibling to migration 000013, which added severity + CHECK to policy_rules.
-- policy_violations has carried a severity column since the initial schema
-- (000001, line 183) but without any CHECK. The engine used to hardcode
-- `Warning` on every violation regardless of the triggering rule's severity
-- (see pre-D-008 internal/service/policy.go:evaluateRule), so the column
-- value was uniform by accident of implementation, not by constraint.
--
-- D-008 rewrites evaluateRule to copy rule.Severity into the violation. The
-- engine now writes values drawn from the application-layer PolicySeverity
-- allowlist, but nothing at the DB level prevents a future caller — or a
-- bypassed write from a migration or psql session — from inserting casing
-- drift ('warning', 'ERROR', etc.) and re-opening the same class of bug
-- that D-005 and D-006 closed. This constraint is the defense-in-depth
-- complement to the handler validator.
--
-- Pre-existing seed_demo.sql rows use lowercase severity values. D-008
-- updates those in the same commit so this migration can apply cleanly
-- against both a fresh install and an upgraded install that has already
-- seeded the demo data.
--
-- Named constraint (policy_violations_severity_check) so the down migration
-- can DROP it by name without ambiguity; un-named CHECK constraints use
-- a synthesized PostgreSQL name that varies by environment.
ALTER TABLE policy_violations
ADD CONSTRAINT policy_violations_severity_check
CHECK (severity IN ('Warning', 'Error', 'Critical'));
+32
View File
@@ -0,0 +1,32 @@
-- Migration 000015 rollback: reverse the agent retirement surface.
--
-- Contract (enforced by migration_000015_test.go Stage 3):
-- * retired_at + retired_reason columns removed from agents
-- * retired_at + retired_reason columns removed from deployment_targets
-- * deployment_targets.agent_id FK restored to ON DELETE CASCADE
--
-- WARNING: dropping the soft-retire columns also drops the audit history of
-- which agents have been retired and why. Operators rolling back should first
-- export the retired_at/retired_reason values they care about preserving.
--
-- Order matters: drop supporting indexes BEFORE dropping the columns they
-- reference. DROP INDEX IF EXISTS + DROP COLUMN IF EXISTS keep the down safe
-- to re-apply.
-- Drop supporting indexes first (they reference columns we're about to drop).
DROP INDEX IF EXISTS idx_agents_retired_at;
DROP INDEX IF EXISTS idx_deployment_targets_retired_at;
-- Reverse the FK flip: restore CASCADE semantics so the rolled-back server
-- behaves identically to pre-000015 behavior.
ALTER TABLE deployment_targets
DROP CONSTRAINT IF EXISTS deployment_targets_agent_id_fkey;
ALTER TABLE deployment_targets
ADD CONSTRAINT deployment_targets_agent_id_fkey
FOREIGN KEY (agent_id) REFERENCES agents(id) ON DELETE CASCADE;
-- Remove the soft-retirement columns.
ALTER TABLE deployment_targets DROP COLUMN IF EXISTS retired_reason;
ALTER TABLE deployment_targets DROP COLUMN IF EXISTS retired_at;
ALTER TABLE agents DROP COLUMN IF EXISTS retired_reason;
ALTER TABLE agents DROP COLUMN IF EXISTS retired_at;
+56
View File
@@ -0,0 +1,56 @@
-- Migration 000015: Agent retirement (I-004 fail-closed coverage-gap fix).
--
-- Adds a soft-delete surface for agents and their deployment_targets, and
-- replaces the fail-open DELETE CASCADE on deployment_targets.agent_id (see
-- migration 000001 line 104) with a fail-closed DELETE RESTRICT.
--
-- Rationale (audit finding I-004):
-- Today a bare `DELETE FROM agents WHERE id = $1` silently cascades through
-- deployment_targets (CASCADE) and blanks jobs.agent_id (SET NULL, migration
-- 000001 line 146). The cascade vaporises the deployment audit trail — "who
-- deployed what, to where, when" — and leaves nothing behind to answer
-- forensic questions after an operator mis-types an agent ID. Flipping the
-- deployment_targets FK to RESTRICT forces any DELETE to fail at the DB
-- layer unless dependencies are cleared first; the new retired_at +
-- retired_reason pair give the service layer a soft-retirement path that
-- preserves history.
--
-- Mirrors migration 000005 (revocation) in shape: nullable timestamp +
-- nullable reason string. Column type is TEXT (not VARCHAR(50)) so retirement
-- reasons can include full operator comments and audit context without
-- truncation.
--
-- Idempotency guarantees (enforced by migration_000015_test.go Stage 4):
-- * ADD COLUMN IF NOT EXISTS → re-running is a no-op
-- * DROP CONSTRAINT IF EXISTS + ADD CONSTRAINT → always converges to
-- RESTRICT; safe to re-apply after a partial rollback
-- * CREATE INDEX IF NOT EXISTS → re-running is a no-op
-- Agents: soft-retirement surface.
ALTER TABLE agents ADD COLUMN IF NOT EXISTS retired_at TIMESTAMPTZ;
ALTER TABLE agents ADD COLUMN IF NOT EXISTS retired_reason TEXT;
-- Deployment targets: soft-retirement surface. Cascade-retire via the service
-- layer copies the agent's retired_at/retired_reason onto its targets so the
-- "who owned this deployment" trail stays intact after an agent is retired.
ALTER TABLE deployment_targets ADD COLUMN IF NOT EXISTS retired_at TIMESTAMPTZ;
ALTER TABLE deployment_targets ADD COLUMN IF NOT EXISTS retired_reason TEXT;
-- Flip deployment_targets.agent_id FK from ON DELETE CASCADE (migration 000001
-- line 104) to ON DELETE RESTRICT. Auto-named constraint per Postgres default
-- (`<table>_<column>_fkey`). Drop-then-add so this migration is self-healing:
-- re-running always lands on RESTRICT regardless of prior state.
ALTER TABLE deployment_targets
DROP CONSTRAINT IF EXISTS deployment_targets_agent_id_fkey;
ALTER TABLE deployment_targets
ADD CONSTRAINT deployment_targets_agent_id_fkey
FOREIGN KEY (agent_id) REFERENCES agents(id) ON DELETE RESTRICT;
-- Supporting indexes for the retired-filter queries that replace the default
-- "active only" list paths in the agent repository (ListActive / ListRetired).
-- Partial indexes keep them cheap — only retired rows are indexed, which is a
-- tiny fraction of the table in a healthy fleet.
CREATE INDEX IF NOT EXISTS idx_agents_retired_at
ON agents(retired_at) WHERE retired_at IS NOT NULL;
CREATE INDEX IF NOT EXISTS idx_deployment_targets_retired_at
ON deployment_targets(retired_at) WHERE retired_at IS NOT NULL;
+32 -16
View File
@@ -12,42 +12,58 @@ VALUES (
'[30, 14, 7, 0]'::jsonb
) ON CONFLICT (id) DO NOTHING;
-- Policy rules: Require owner assignment
INSERT INTO policy_rules (id, name, type, config, enabled)
-- Policy rules: Require owner assignment, bound environments, cap lifetime,
-- and enforce a renewal lead-time.
--
-- Severity is differentiated per rule (D-006) and the types are now the
-- TitleCase canonicals the engine actually recognizes (D-008). Pre-D-008 the
-- types were lowercase strings (`ownership`, `environment`, `lifetime`,
-- `renewal_window`) that the engine silently dropped through to its
-- default-case error path — the rules looked alive in the GUI but did not
-- enforce anything. The backend CHECK constraint (migration 000013) enforces
-- the TitleCase severity allowlist Warning/Error/Critical. Configs are also
-- reshaped to match the D-008 per-arm schemas so the rules actually exercise
-- the config-consuming paths instead of falling back to the missing-field
-- placeholders.
INSERT INTO policy_rules (id, name, type, config, enabled, severity)
VALUES (
'pr-require-owner',
'require-owner',
'ownership',
'{"requirement": "owner_id must be set"}'::jsonb,
true
'RequiredMetadata',
'{"required_keys": ["owner"]}'::jsonb,
true,
'Warning'
) ON CONFLICT (id) DO NOTHING;
-- Policy rules: Allowed environments
INSERT INTO policy_rules (id, name, type, config, enabled)
INSERT INTO policy_rules (id, name, type, config, enabled, severity)
VALUES (
'pr-allowed-environments',
'allowed-environments',
'environment',
'AllowedEnvironments',
'{"allowed": ["production", "staging", "development"]}'::jsonb,
true
true,
'Error'
) ON CONFLICT (id) DO NOTHING;
-- Policy rules: Maximum certificate lifetime
INSERT INTO policy_rules (id, name, type, config, enabled)
INSERT INTO policy_rules (id, name, type, config, enabled, severity)
VALUES (
'pr-max-certificate-lifetime',
'max-certificate-lifetime',
'lifetime',
'CertificateLifetime',
'{"max_days": 90}'::jsonb,
true
true,
'Critical'
) ON CONFLICT (id) DO NOTHING;
-- Policy rules: Minimum renewal window
INSERT INTO policy_rules (id, name, type, config, enabled)
-- Policy rules: Minimum renewal window (renew at least 14 days before expiry)
INSERT INTO policy_rules (id, name, type, config, enabled, severity)
VALUES (
'pr-min-renewal-window',
'min-renewal-window',
'renewal_window',
'{"min_days": 14}'::jsonb,
true
'RenewalLeadTime',
'{"lead_time_days": 14}'::jsonb,
true,
'Warning'
) ON CONFLICT (id) DO NOTHING;
+13 -6
View File
@@ -478,13 +478,20 @@ ON CONFLICT (id) DO NOTHING;
-- ============================================================
-- 13. Policy Violations
-- ============================================================
-- D-008: severity values rewritten to TitleCase canonicals (Warning/Error/Critical).
-- Pre-D-008 these rows used lowercase strings ('critical', 'error', 'warning'). Those
-- values were silently tolerated by the pre-D-008 engine, which hardcoded 'Warning'
-- on every new violation regardless of the triggering rule's severity. D-008 rewires
-- evaluateRule to copy rule.Severity into the violation AND migration 000014 adds a
-- CHECK constraint enforcing the TitleCase allowlist at the DB level. Both paths now
-- round-trip correctly against these demo rows.
INSERT INTO policy_violations (id, certificate_id, rule_id, message, severity, created_at) VALUES
('pv-001', 'mc-legacy-prod', 'pr-max-certificate-lifetime', 'Certificate has expired and exceeds maximum lifetime policy', 'critical', NOW() - INTERVAL '3 days'),
('pv-002', 'mc-old-api', 'pr-max-certificate-lifetime', 'Certificate expired 15 days ago', 'critical', NOW() - INTERVAL '15 days'),
('pv-003', 'mc-vpn-prod', 'pr-min-renewal-window', 'Renewal failed within minimum renewal window', 'error', NOW() - INTERVAL '3 days'),
('pv-004', 'mc-mail-prod', 'pr-min-renewal-window', 'Certificate expiring in 5 days, below 14-day minimum window','warning', NOW() - INTERVAL '20 minutes'),
('pv-005', 'mc-wiki-prod', 'pr-max-certificate-lifetime', 'Certificate expired 7 days ago', 'critical', NOW() - INTERVAL '7 days'),
('pv-006', 'mc-compromised', 'pr-min-renewal-window', 'Certificate revoked due to key compromise', 'critical', NOW() - INTERVAL '14 days')
('pv-001', 'mc-legacy-prod', 'pr-max-certificate-lifetime', 'Certificate has expired and exceeds maximum lifetime policy', 'Critical', NOW() - INTERVAL '3 days'),
('pv-002', 'mc-old-api', 'pr-max-certificate-lifetime', 'Certificate expired 15 days ago', 'Critical', NOW() - INTERVAL '15 days'),
('pv-003', 'mc-vpn-prod', 'pr-min-renewal-window', 'Renewal failed within minimum renewal window', 'Error', NOW() - INTERVAL '3 days'),
('pv-004', 'mc-mail-prod', 'pr-min-renewal-window', 'Certificate expiring in 5 days, below 14-day minimum window','Warning', NOW() - INTERVAL '20 minutes'),
('pv-005', 'mc-wiki-prod', 'pr-max-certificate-lifetime', 'Certificate expired 7 days ago', 'Critical', NOW() - INTERVAL '7 days'),
('pv-006', 'mc-compromised', 'pr-min-renewal-window', 'Certificate revoked due to key compromise', 'Critical', NOW() - INTERVAL '14 days')
ON CONFLICT (id) DO NOTHING;
-- ============================================================
+190 -9
View File
@@ -2,6 +2,7 @@ import { describe, it, expect, beforeEach, vi } from 'vitest';
import {
setApiKey,
getApiKey,
checkAuth,
getCertificates,
getCertificate,
getCertificateVersions,
@@ -18,6 +19,8 @@ import {
getAgents,
getAgent,
registerAgent,
retireAgent,
listRetiredAgents,
getJobs,
cancelJob,
approveRenewal,
@@ -86,7 +89,6 @@ import {
getTarget,
getPrometheusMetrics,
getCertificateDeployments,
getCRL,
getOCSPStatus,
updateIssuer,
updateTarget,
@@ -179,6 +181,46 @@ describe('API Client', () => {
});
});
// ─── checkAuth (M-003: surfaces user + admin) ──────
describe('checkAuth', () => {
// Post-M-003 /auth/check returns {status, user, admin}. The admin flag drives
// GUI gating of admin-only affordances (bulk revoke, etc.). Authoritative
// enforcement lives server-side — this test only pins the contract the
// AuthProvider depends on.
it('returns {status, user, admin} shape and sends Bearer token', async () => {
mockFetch.mockReturnValueOnce(
mockJsonResponse({ status: 'authenticated', user: 'ops-admin', admin: true }),
);
const resp = await checkAuth('test-api-key');
const [url, init] = mockFetch.mock.calls[0];
expect(url).toBe('/api/v1/auth/check');
expect(init.headers['Authorization']).toBe('Bearer test-api-key');
expect(init.headers['Content-Type']).toBe('application/json');
expect(resp.status).toBe('authenticated');
expect(resp.user).toBe('ops-admin');
expect(resp.admin).toBe(true);
});
it('returns admin=false for non-admin callers', async () => {
mockFetch.mockReturnValueOnce(
mockJsonResponse({ status: 'authenticated', user: 'alice', admin: false }),
);
const resp = await checkAuth('alice-key');
expect(resp.user).toBe('alice');
expect(resp.admin).toBe(false);
});
it('throws on invalid API key', async () => {
mockFetch.mockReturnValueOnce(mockErrorResponse(401));
await expect(checkAuth('bad-key')).rejects.toThrow('Invalid API key');
});
});
// ─── Error handling ─────────────────────────────────
describe('Error handling', () => {
@@ -248,6 +290,39 @@ describe('API Client', () => {
expect(JSON.parse(init.body)).toEqual(certData);
});
// C-001 scope-expansion regression: the OnboardingWizard CertificateStep
// and the CertificatesPage CreateCertificateModal must both ship the full
// six-field required payload (name, common_name, renewal_policy_id,
// issuer_id, owner_id, team_id) — the handler's ValidateRequired contract
// rejects anything less with HTTP 400. This test pins the wire shape so
// that accidentally dropping a field from either UI surface fails CI
// rather than only surfacing as a 400 at runtime.
it('createCertificate accepts and transmits all six required fields', async () => {
const wizardPayload = {
name: 'API Production Cert',
common_name: 'api.example.com',
sans: ['www.example.com'],
issuer_id: 'iss-local',
owner_id: 'o-alice',
team_id: 't-platform',
renewal_policy_id: 'rp-standard',
environment: 'production',
};
mockFetch.mockReturnValueOnce(mockJsonResponse({ id: 'mc-new', ...wizardPayload }));
await createCertificate(wizardPayload);
const [url, init] = mockFetch.mock.calls[0];
expect(url).toBe('/api/v1/certificates');
expect(init.method).toBe('POST');
const body = JSON.parse(init.body);
// Assert every required field is present and intact
expect(body.name).toBe('API Production Cert');
expect(body.common_name).toBe('api.example.com');
expect(body.issuer_id).toBe('iss-local');
expect(body.owner_id).toBe('o-alice');
expect(body.team_id).toBe('t-platform');
expect(body.renewal_policy_id).toBe('rp-standard');
});
it('updateCertificate sends PUT', async () => {
mockFetch.mockReturnValueOnce(mockJsonResponse({ id: 'mc-test', status: 'Active' }));
await updateCertificate('mc-test', { status: 'Active' });
@@ -326,6 +401,113 @@ describe('API Client', () => {
});
});
// ─── Agent Retirement (I-004) ───────────────────────
//
// These tests pin the GUI's retirement contract against what the backend
// will add in Phase 2b: soft-retire via DELETE, force-cascade via
// ?force=true&reason=..., idempotent 204 on already-retired, 409 blocked
// payload with counts, and a GET /agents/retired listing surface.
//
// All compile-fail until client.ts exports retireAgent + listRetiredAgents
// — the shape of those exports is pinned here rather than assumed.
describe('Agent Retirement (I-004)', () => {
it('retireAgent sends DELETE without query when no force/reason', async () => {
mockFetch.mockReturnValueOnce(
mockJsonResponse({
retired_at: '2026-04-18T12:00:00Z',
already_retired: false,
cascade: false,
}),
);
await retireAgent('ag-1');
const [url, init] = mockFetch.mock.calls[0];
// Default soft-retire: bare path, no stray ? suffix.
expect(url).toBe('/api/v1/agents/ag-1');
expect(init.method).toBe('DELETE');
});
it('retireAgent propagates force+reason as URL query parameters', async () => {
mockFetch.mockReturnValueOnce(
mockJsonResponse({
retired_at: '2026-04-18T12:00:00Z',
already_retired: false,
cascade: true,
counts: { active_targets: 3, active_certificates: 7, pending_jobs: 2 },
}),
);
await retireAgent('ag-1', { force: true, reason: 'decommissioning rack 7' });
const [url, init] = mockFetch.mock.calls[0];
// URLSearchParams encodes space as "+"; "decommissioning rack 7" → "decommissioning+rack+7"
expect(url).toBe(
'/api/v1/agents/ag-1?force=true&reason=decommissioning+rack+7',
);
expect(init.method).toBe('DELETE');
});
it('retireAgent omits force=false even when reason is supplied', async () => {
// Client-side guard: the server's 400 ErrForceReasonRequired is the
// fallback; the GUI should never silently promote reason-without-force
// into a force call. Pins that reason-only still hits the soft path.
mockFetch.mockReturnValueOnce(
mockJsonResponse({
retired_at: '2026-04-18T12:00:00Z',
already_retired: false,
cascade: false,
}),
);
await retireAgent('ag-1', { reason: 'routine decommission' });
const [url] = mockFetch.mock.calls[0];
// force defaults to false → query carries reason only.
expect(url).toBe('/api/v1/agents/ag-1?reason=routine+decommission');
});
it('retireAgent surfaces the 409 dependency error message to the caller', async () => {
mockFetch.mockReturnValueOnce(
mockErrorResponse(409, {
message: 'agent has 3 active targets, 7 active certificates, 2 pending jobs',
}),
);
await expect(retireAgent('ag-1')).rejects.toThrow(
/active targets|active certificates|pending jobs/,
);
});
it('retireAgent treats 204 (already-retired) as success with empty body', async () => {
mockFetch.mockReturnValueOnce(
Promise.resolve({
ok: true,
status: 204,
json: () => Promise.reject(new Error('204 has no body')),
statusText: 'No Content',
} as Response),
);
// fetchJSON normalises 204 to {} — caller must not crash.
const result = await retireAgent('ag-1');
expect(result).toBeDefined();
});
it('listRetiredAgents sends GET /agents/retired with default pagination', async () => {
mockFetch.mockReturnValueOnce(
mockJsonResponse({ data: [], total: 0, page: 1, per_page: 50 }),
);
await listRetiredAgents();
const [url, init] = mockFetch.mock.calls[0];
expect(url).toBe('/api/v1/agents/retired?page=1&per_page=50');
// Default is GET — no explicit method means fetchJSON falls through.
expect(init.method ?? 'GET').toBe('GET');
});
it('listRetiredAgents forwards page/per_page overrides', async () => {
mockFetch.mockReturnValueOnce(
mockJsonResponse({ data: [], total: 0, page: 2, per_page: 100 }),
);
await listRetiredAgents({ page: '2', per_page: '100' });
const [url] = mockFetch.mock.calls[0];
expect(url).toContain('page=2');
expect(url).toContain('per_page=100');
});
});
// ─── Jobs ───────────────────────────────────────────
describe('Jobs', () => {
@@ -1213,13 +1395,12 @@ describe('API Client', () => {
expect(mockFetch.mock.calls[0][0]).toContain('/api/v1/certificates/mc-1/deployments');
});
it('getCRL sends GET to /crl', async () => {
mockFetch.mockReturnValueOnce(mockJsonResponse({ entries: [], total: 0 }));
await getCRL();
expect(mockFetch.mock.calls[0][0]).toBe('/api/v1/crl');
});
it('getOCSPStatus sends GET with issuer and serial', async () => {
// M-006: JSON CRL endpoint (`GET /api/v1/crl`) removed entirely — RFC 5280
// defines only the DER wire format, which is now served unauthenticated at
// `/.well-known/pki/crl/{issuer_id}` (fetched directly, no GUI wrapper).
// OCSP likewise relocated to `/.well-known/pki/ocsp/{issuer_id}/{serial}`
// per RFC 8615.
it('getOCSPStatus sends GET to /.well-known/pki/ocsp with issuer and serial', async () => {
const buf = new ArrayBuffer(8);
mockFetch.mockReturnValueOnce(
Promise.resolve({
@@ -1229,7 +1410,7 @@ describe('API Client', () => {
} as Response)
);
await getOCSPStatus('iss-local', 'ABC123');
expect(mockFetch.mock.calls[0][0]).toBe('/api/v1/ocsp/iss-local/ABC123');
expect(mockFetch.mock.calls[0][0]).toBe('/.well-known/pki/ocsp/iss-local/ABC123');
});
it('updateIssuer sends PUT with data', async () => {
+111 -9
View File
@@ -1,4 +1,4 @@
import type { Certificate, CertificateVersion, Agent, Job, Notification, AuditEvent, PolicyRule, PolicyViolation, Issuer, Target, CertificateProfile, Owner, Team, AgentGroup, PaginatedResponse, DashboardSummary, CertificateStatusCount, ExpirationBucket, JobTrendDataPoint, IssuanceRateDataPoint, MetricsResponse, DiscoveredCertificate, DiscoveryScan, DiscoverySummary, NetworkScanTarget, EndpointHealthCheck, HealthHistoryEntry, HealthCheckSummary } from './types';
import type { Certificate, CertificateVersion, Agent, Job, Notification, AuditEvent, PolicyRule, PolicyViolation, Issuer, Target, CertificateProfile, Owner, Team, AgentGroup, PaginatedResponse, DashboardSummary, CertificateStatusCount, ExpirationBucket, JobTrendDataPoint, IssuanceRateDataPoint, MetricsResponse, DiscoveredCertificate, DiscoveryScan, DiscoverySummary, NetworkScanTarget, EndpointHealthCheck, HealthHistoryEntry, HealthCheckSummary, AgentDependencyCounts, RetireAgentResponse, BlockedByDependenciesResponse } from './types';
const BASE = '/api/v1';
@@ -51,12 +51,22 @@ export const getAuthInfo = () =>
fetch(`${BASE}/auth/info`, { headers: { 'Content-Type': 'application/json' } })
.then(r => r.json() as Promise<{ auth_type: string; required: boolean }>);
// AuthCheckResponse mirrors the /auth/check handler payload. Post-M-003 it
// surfaces `user` (named-key identity) and `admin` (named-key admin flag) so
// the GUI can gate admin-only affordances. When CERTCTL_AUTH_TYPE=none the
// backend returns {user: "", admin: false}.
export interface AuthCheckResponse {
status: string;
user: string;
admin: boolean;
}
export const checkAuth = (key: string) =>
fetch(`${BASE}/auth/check`, {
headers: { 'Content-Type': 'application/json', 'Authorization': `Bearer ${key}` },
}).then(r => {
if (!r.ok) throw new Error('Invalid API key');
return r.json() as Promise<{ status: string }>;
return r.json() as Promise<AuthCheckResponse>;
});
// Certificates
@@ -152,14 +162,14 @@ export const getCertificateDeployments = (id: string, params: Record<string, str
return fetchJSON<PaginatedResponse<Job>>(`${BASE}/certificates/${id}/deployments?${qs}`);
};
// CRL / OCSP
export const getCRL = () =>
fetchJSON<{ version: number; entries: unknown[]; total: number; generated_at: string }>(`${BASE}/crl`);
// OCSP (RFC 6960) — served unauthenticated under /.well-known/pki/ per RFC 8615
// (M-006 relocation). The legacy JSON CRL endpoint (`GET /api/v1/crl`) was
// removed entirely; relying parties fetch the DER-encoded CRL directly from
// `/.well-known/pki/crl/{issuer_id}` (no GUI wrapper — binary download only).
export const getOCSPStatus = (issuerId: string, serial: string) => {
const headers: Record<string, string> = {};
if (apiKey) headers['Authorization'] = `Bearer ${apiKey}`;
return fetch(`${BASE}/ocsp/${issuerId}/${serial}`, { headers })
// No Authorization header — the OCSP responder is intentionally unauthenticated
// so relying parties without certctl API keys can check revocation status.
return fetch(`/.well-known/pki/ocsp/${issuerId}/${serial}`)
.then(r => {
if (!r.ok) throw new Error(`OCSP request failed: ${r.status}`);
return r.arrayBuffer();
@@ -178,6 +188,98 @@ export const getAgent = (id: string) =>
export const registerAgent = (data: Partial<Agent>) =>
fetchJSON<Agent>(`${BASE}/agents`, { method: 'POST', body: JSON.stringify(data) });
// I-004: typed error thrown by retireAgent when the server returns HTTP 409 with
// {error: "blocked_by_dependencies", ...}. Callers that want to show the
// dependency-counts dialog should `catch (e)` and check `e instanceof
// BlockedByDependenciesError` — the counts field is the same shape the
// backend handler returns from its inline struct in
// internal/api/handler/agents.go. Generic network / 5xx failures still throw
// plain Error so existing error-boundary code is unaffected.
export class BlockedByDependenciesError extends Error {
readonly counts: AgentDependencyCounts;
constructor(message: string, counts: AgentDependencyCounts) {
super(message);
this.name = 'BlockedByDependenciesError';
this.counts = counts;
}
}
// I-004: retire an agent via DELETE /api/v1/agents/{id}. Three distinct
// success paths the UI needs to distinguish:
// * 200 — fresh retire; body has retired_at, already_retired=false, cascade
// flag, counts of what was cascaded.
// * 204 — idempotent re-retire; the row was already retired. No body. We
// synthesize a RetireAgentResponse with already_retired=true and zero
// counts so the caller can keep a single return type.
// * 409 — blocked_by_dependencies; thrown as BlockedByDependenciesError so
// the caller can surface the active_targets/active_certificates/pending_jobs
// counts in a confirmation dialog and offer force=true.
// Anything else bubbles up via the standard fetchJSON error path.
export const retireAgent = async (
id: string,
opts: { force?: boolean; reason?: string } = {},
): Promise<RetireAgentResponse> => {
const qs = new URLSearchParams();
if (opts.force) qs.set('force', 'true');
if (opts.reason) qs.set('reason', opts.reason);
const url = qs.toString()
? `${BASE}/agents/${id}?${qs.toString()}`
: `${BASE}/agents/${id}`;
const res = await fetch(url, {
method: 'DELETE',
headers: authHeaders(),
});
if (res.status === 401) {
window.dispatchEvent(new CustomEvent('certctl:auth-required'));
throw new Error('Authentication required');
}
// 204 No Content — idempotent re-retire. Synthesize a response so callers
// get a uniform shape; already_retired=true tells them the agent was
// already in the retired state before this call.
if (res.status === 204) {
return {
retired_at: '',
already_retired: true,
cascade: false,
counts: { active_targets: 0, active_certificates: 0, pending_jobs: 0 },
};
}
if (res.status === 409) {
// Body is always JSON for 409 per the handler contract.
const body = (await res.json()) as BlockedByDependenciesResponse;
throw new BlockedByDependenciesError(
body.message || 'agent has active dependencies',
body.counts,
);
}
if (!res.ok) {
let errorMsg = res.statusText;
try {
const body = await res.json();
errorMsg = body.message || body.error || errorMsg;
} catch {
// not JSON
}
throw new Error(errorMsg || `HTTP ${res.status}`);
}
return (await res.json()) as RetireAgentResponse;
};
// I-004: list retired agents via GET /api/v1/agents/retired. Kept separate
// from getAgents (which hits the default active-only listing) so the retired
// tab on AgentsPage can page independently. per_page is capped server-side at
// 500 (see handler ListRetiredAgents).
export const listRetiredAgents = (params: Record<string, string> = {}) => {
const qs = new URLSearchParams({ page: '1', per_page: '50', ...params }).toString();
return fetchJSON<PaginatedResponse<Agent>>(`${BASE}/agents/retired?${qs}`);
};
// Jobs
export const getJobs = (params: Record<string, string> = {}) => {
const qs = new URLSearchParams({ page: '1', per_page: '50', ...params }).toString();
+134
View File
@@ -0,0 +1,134 @@
import { describe, it, expect } from 'vitest';
import { POLICY_TYPES, POLICY_SEVERITIES } from './types';
import type { Agent } from './types';
/**
* Regression tests for the policy enum tuples.
*
* These tuples are the GUI's source of truth for the policy type and severity
* dropdowns. They MUST stay in lockstep with the backend enum values:
* - internal/domain/policy.go defines the PolicyType / PolicySeverity consts
* - internal/api/handler/validators.go rejects anything outside the allowlist
* - migration 000013 enforces the severity allowlist at the DB level via CHECK
*
* Audit history (D-005, D-006):
* - The GUI previously sent lowercase values (e.g. 'key_algorithm',
* 'ownership'), which the backend validator rejected with a 400. Every
* attempt to create a policy from the "+ New Policy" button silently
* failed until the modal was closed.
* - The severity dropdown carried a four-value `low/medium/high/critical`
* tuple that shared zero values with the backend's
* `Warning/Error/Critical` the `medium` option has no backend analog
* and is removed.
*
* If these tests fail because a backend enum changed, DO NOT update the
* expected arrays without also updating the backend consts and the migration.
* Frontend/backend drift on these tuples is precisely what this regression
* guards against.
*/
describe('POLICY_TYPES', () => {
it('matches the backend PolicyType TitleCase allowlist exactly', () => {
expect(POLICY_TYPES).toEqual([
'AllowedIssuers',
'AllowedDomains',
'RequiredMetadata',
'AllowedEnvironments',
'RenewalLeadTime',
'CertificateLifetime',
]);
});
it('has no duplicate entries', () => {
expect(new Set(POLICY_TYPES).size).toBe(POLICY_TYPES.length);
});
});
describe('POLICY_SEVERITIES', () => {
it('matches the backend PolicySeverity TitleCase allowlist exactly', () => {
expect(POLICY_SEVERITIES).toEqual(['Warning', 'Error', 'Critical']);
});
it('has no duplicate entries', () => {
expect(new Set(POLICY_SEVERITIES).size).toBe(POLICY_SEVERITIES.length);
});
it('does not include the removed pre-fix `medium` value', () => {
// Explicit negative assertion. Pre-fix the GUI offered four severities
// (low/medium/high/critical); `medium` never had a backend analog.
expect(POLICY_SEVERITIES as readonly string[]).not.toContain('medium');
});
});
/**
* Regression test for the Agent interface's I-004 soft-retirement shape.
*
* Backend (migration 000015, Phase 2b) adds two nullable timestamps/strings to
* the agents table `retired_at` and `retired_reason` mirroring the existing
* Certificate.revoked_at / Certificate.revocation_reason pair. The GUI needs
* these fields on the Agent interface so the Retired tab, retire modal, and
* retirement banner can render the agent's retired state without resorting to
* `(agent as any).retired_at` escapes.
*
* Both fields are optional (agent.ts interface) because the server omits them
* from the response for active agents. A compile-time shape check here pins
* that Phase 2b does not drift the field names (e.g. to retiredAt camelCase)
* or accidentally promote them to required.
*
* Compile-fail until Phase 2b adds:
* retired_at?: string;
* retired_reason?: string;
* to the Agent interface in types.ts.
*/
describe('Agent interface (I-004 retirement)', () => {
it('accepts retired_at and retired_reason as optional string fields', () => {
// Construct an Agent with the retirement fields set. If Phase 2b names
// them anything other than retired_at / retired_reason, this fails to
// compile — which is exactly what the Red stage wants.
const retired: Agent = {
id: 'ag-1',
name: 'decom-01',
hostname: 'server-old',
ip_address: '10.0.0.1',
os: 'linux',
architecture: 'amd64',
status: 'Offline',
version: '2.1.0',
last_heartbeat: '2026-01-01T00:00:00Z',
last_heartbeat_at: '2026-01-01T00:00:00Z',
capabilities: [],
tags: {},
registered_at: '2024-01-01T00:00:00Z',
created_at: '2024-01-01T00:00:00Z',
updated_at: '2026-01-01T00:00:00Z',
retired_at: '2026-01-01T00:00:00Z',
retired_reason: 'old hardware',
};
expect(retired.retired_at).toBe('2026-01-01T00:00:00Z');
expect(retired.retired_reason).toBe('old hardware');
});
it('accepts an Agent without retired_at / retired_reason (optional fields)', () => {
// Active agents should not carry retirement metadata. If Phase 2b makes
// the fields required, this block fails to compile.
const active: Agent = {
id: 'ag-2',
name: 'web01',
hostname: 'web01.prod',
ip_address: '10.0.0.2',
os: 'linux',
architecture: 'amd64',
status: 'Online',
version: '2.1.0',
last_heartbeat: '2026-04-18T12:00:00Z',
last_heartbeat_at: '2026-04-18T12:00:00Z',
capabilities: ['deploy', 'scan'],
tags: {},
registered_at: '2024-06-01T00:00:00Z',
created_at: '2024-06-01T00:00:00Z',
updated_at: '2026-04-18T12:00:00Z',
};
expect(active.retired_at).toBeUndefined();
expect(active.retired_reason).toBeUndefined();
});
});
+67 -3
View File
@@ -67,6 +67,43 @@ export interface Agent {
registered_at: string;
created_at: string;
updated_at: string;
// I-004: soft-retirement fields. When retired_at is non-null, the agent is
// tombstoned — it will never heartbeat again and cascaded targets have been
// retired alongside it. The retired tab on AgentsPage uses these to show the
// when/why. The server filters retired rows from the default /api/v1/agents
// listing; they appear only via GET /api/v1/agents/retired.
retired_at?: string | null;
retired_reason?: string | null;
}
// I-004: dependency counts returned by the retire handler in both the 200
// success-with-cascade body and the 409 blocked_by_dependencies body. The
// operator UI uses these to show "this agent has N targets, M certs, K jobs
// depending on it" in the confirm-retire dialog.
export interface AgentDependencyCounts {
active_targets: number;
active_certificates: number;
pending_jobs: number;
}
// I-004: success shape for DELETE /api/v1/agents/{id}. already_retired is
// always false for 200 responses; 204 responses carry no body (the retire was
// idempotent — the agent was already retired). The frontend distinguishes by
// HTTP status, not by this field.
export interface RetireAgentResponse {
retired_at: string;
already_retired: boolean;
cascade: boolean;
counts: AgentDependencyCounts;
}
// I-004: shape returned with HTTP 409 when a retire is blocked by active
// downstream dependencies. Keep in lockstep with the handler's inline struct
// in internal/api/handler/agents.go (search "blocked_by_dependencies").
export interface BlockedByDependenciesResponse {
error: 'blocked_by_dependencies';
message: string;
counts: AgentDependencyCounts;
}
export interface Job {
@@ -112,11 +149,38 @@ export interface AuditEvent {
timestamp: string;
}
/**
* Policy rule type enum pinned to the backend's TitleCase constants in
* internal/domain/policy.go. Historical note (D-005): the GUI previously sent
* lowercase values (`ownership`, `environment`, etc.) that the handler's
* ValidatePolicyType rejected with a 400. These tuples are the canonical
* source of truth for the dropdown options; the regression test in
* types.test.ts pins them so future drift is caught at CI time.
*/
export const POLICY_TYPES = [
'AllowedIssuers',
'AllowedDomains',
'RequiredMetadata',
'AllowedEnvironments',
'RenewalLeadTime',
'CertificateLifetime',
] as const;
export type PolicyType = (typeof POLICY_TYPES)[number];
/**
* Policy severity enum pinned to the backend's PolicySeverity constants.
* The backend CHECK constraint on policy_rules.severity enforces the same
* allowlist (migration 000013). The 4-value `medium` option that used to
* appear in the GUI was never a valid backend value and has been removed.
*/
export const POLICY_SEVERITIES = ['Warning', 'Error', 'Critical'] as const;
export type PolicySeverity = (typeof POLICY_SEVERITIES)[number];
export interface PolicyRule {
id: string;
name: string;
type: string;
severity: string;
type: PolicyType;
severity: PolicySeverity;
config: Record<string, unknown>;
enabled: boolean;
created_at: string;
@@ -127,7 +191,7 @@ export interface PolicyViolation {
id: string;
rule_id: string;
certificate_id: string;
severity: string;
severity: PolicySeverity;
message: string;
created_at: string;
}
+26 -2
View File
@@ -7,6 +7,11 @@ interface AuthState {
authRequired: boolean;
authenticated: boolean;
authType: string;
// M-003: named-key identity + admin flag surfaced from /auth/check so admin-
// only GUI affordances (e.g., bulk-revoke) can be hidden from non-admin
// callers. These are UX hints — authorization remains enforced server-side.
user: string;
admin: boolean;
login: (key: string) => Promise<void>;
logout: () => void;
error: string | null;
@@ -17,6 +22,8 @@ const AuthContext = createContext<AuthState>({
authRequired: false,
authenticated: false,
authType: 'none',
user: '',
admin: false,
login: async () => {},
logout: () => {},
error: null,
@@ -31,6 +38,8 @@ export default function AuthProvider({ children }: { children: ReactNode }) {
const [authRequired, setAuthRequired] = useState(false);
const [authenticated, setAuthenticated] = useState(false);
const [authType, setAuthType] = useState('none');
const [user, setUser] = useState('');
const [admin, setAdmin] = useState(false);
const [error, setError] = useState<string | null>(null);
// Check if server requires auth on mount
@@ -40,12 +49,19 @@ export default function AuthProvider({ children }: { children: ReactNode }) {
setAuthType(info.auth_type);
setAuthRequired(info.required);
if (!info.required) {
// CERTCTL_AUTH_TYPE=none: the server treats every caller as
// anonymous with admin=false. Mirror that locally so gated
// affordances stay hidden.
setAuthenticated(true);
setUser('');
setAdmin(false);
}
})
.catch(() => {
// If auth/info fails, assume no auth required (server may be old version)
setAuthenticated(true);
setUser('');
setAdmin(false);
})
.finally(() => setLoading(false));
}, []);
@@ -55,6 +71,8 @@ export default function AuthProvider({ children }: { children: ReactNode }) {
const handler = () => {
setAuthenticated(false);
setApiKey(null);
setUser('');
setAdmin(false);
setError('Session expired. Please re-enter your API key.');
};
window.addEventListener('certctl:auth-required', handler);
@@ -64,9 +82,13 @@ export default function AuthProvider({ children }: { children: ReactNode }) {
const login = useCallback(async (key: string) => {
setError(null);
try {
await checkAuth(key);
// /auth/check returns {status, user, admin}. Capture user + admin so the
// GUI can hide admin-only affordances (bulk revoke, etc.).
const resp = await checkAuth(key);
setApiKey(key);
setAuthenticated(true);
setUser(resp.user ?? '');
setAdmin(Boolean(resp.admin));
} catch {
setError('Invalid API key');
throw new Error('Invalid API key');
@@ -76,11 +98,13 @@ export default function AuthProvider({ children }: { children: ReactNode }) {
const logout = useCallback(() => {
setApiKey(null);
setAuthenticated(false);
setUser('');
setAdmin(false);
setError(null);
}, []);
return (
<AuthContext.Provider value={{ loading, authRequired, authenticated, authType, login, logout, error }}>
<AuthContext.Provider value={{ loading, authRequired, authenticated, authType, user, admin, login, logout, error }}>
{children}
</AuthContext.Provider>
);
+127
View File
@@ -0,0 +1,127 @@
import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest';
import { render, screen, fireEvent, cleanup } from '@testing-library/react';
import { MemoryRouter, Routes, Route } from 'react-router-dom';
// -----------------------------------------------------------------------------
// UX-001 Phase 4 — Layout "Setup guide" re-entry button
//
// Phase 2 added a persistent "Setup guide" button to the sidebar so operators
// who dismissed the onboarding wizard (or closed it mid-flow) can always walk
// themselves back in. The button must:
//
// 1. Render with the accessible name "Setup guide".
// 2. On click, clear the `certctl:onboarding-dismissed` localStorage key so
// DashboardPage's first-run detection re-engages.
// 3. On click, navigate to `/?onboarding=1` — the query-param re-entry
// signal DashboardPage reads via useSearchParams. The query param is the
// contract between Layout and DashboardPage; without it, a user who
// already has certs + issuers would not see the wizard again.
// -----------------------------------------------------------------------------
// Intercept useNavigate so we can assert the destination path without having
// to configure every route segment the wizard might push to.
const mockNavigate = vi.fn();
vi.mock('react-router-dom', async () => {
const actual = await vi.importActual<typeof import('react-router-dom')>('react-router-dom');
return {
...actual,
useNavigate: () => mockNavigate,
};
});
// Layout pulls auth state from AuthProvider to decide whether to render the
// logout button. Tests don't care about auth — stub the hook with an anonymous
// session so Layout renders without needing a real AuthProvider wrapper.
vi.mock('./AuthProvider', () => ({
useAuth: () => ({
loading: false,
authRequired: false,
authenticated: true,
authType: 'none',
user: '',
admin: false,
login: vi.fn(),
logout: vi.fn(),
error: null,
}),
}));
// Imported after vi.mock so the mocks are in effect when Layout's module graph
// resolves.
import Layout from './Layout';
function renderLayout() {
return render(
<MemoryRouter initialEntries={['/']}>
<Routes>
<Route element={<Layout />}>
<Route path="/" element={<div data-testid="outlet-root">root</div>} />
</Route>
</Routes>
</MemoryRouter>,
);
}
describe('Layout — UX-001 Setup guide sidebar button', () => {
beforeEach(() => {
vi.clearAllMocks();
cleanup();
localStorage.clear();
});
afterEach(() => {
localStorage.clear();
});
it('renders a "Setup guide" button in the sidebar', () => {
renderLayout();
// Red-to-green guard: if the button is removed or renamed, this catches
// it. We match by accessible name so the assertion survives className /
// icon churn.
expect(screen.getByRole('button', { name: /Setup guide/i })).toBeInTheDocument();
});
it('clears the onboarding-dismissed localStorage key on click', () => {
localStorage.setItem('certctl:onboarding-dismissed', 'true');
expect(localStorage.getItem('certctl:onboarding-dismissed')).toBe('true');
renderLayout();
fireEvent.click(screen.getByRole('button', { name: /Setup guide/i }));
// DashboardPage reads this key synchronously to decide whether the first-
// run wizard can auto-open. Leaving it set would suppress the wizard even
// after navigation, defeating the re-entry contract.
expect(localStorage.getItem('certctl:onboarding-dismissed')).toBeNull();
});
it('navigates to /?onboarding=1 on click', () => {
renderLayout();
fireEvent.click(screen.getByRole('button', { name: /Setup guide/i }));
// The `?onboarding=1` query param is the explicit signal DashboardPage
// checks via useSearchParams. Asserting the exact path pins the contract
// both ends rely on.
expect(mockNavigate).toHaveBeenCalledTimes(1);
expect(mockNavigate).toHaveBeenCalledWith('/?onboarding=1');
});
it('tolerates localStorage access failure without throwing', () => {
// Some browsers / privacy modes throw on localStorage access. Layout
// wraps the removal in try/catch so the navigation still fires. Simulate
// the failure and verify the navigation path is unaffected.
const original = Storage.prototype.removeItem;
Storage.prototype.removeItem = vi.fn(() => {
throw new Error('localStorage unavailable');
});
try {
renderLayout();
fireEvent.click(screen.getByRole('button', { name: /Setup guide/i }));
expect(mockNavigate).toHaveBeenCalledWith('/?onboarding=1');
} finally {
Storage.prototype.removeItem = original;
}
});
});
+19 -1
View File
@@ -1,4 +1,4 @@
import { NavLink, Outlet } from 'react-router-dom';
import { NavLink, Outlet, useNavigate } from 'react-router-dom';
import { useAuth } from './AuthProvider';
import logo from '../assets/certctl-logo.png';
@@ -35,6 +35,12 @@ function Icon({ d }: { d: string }) {
export default function Layout() {
const { authRequired, logout } = useAuth();
const navigate = useNavigate();
const openSetupGuide = () => {
try { localStorage.removeItem('certctl:onboarding-dismissed'); } catch { /* noop */ }
navigate('/?onboarding=1');
};
return (
<div className="flex h-screen overflow-hidden">
@@ -71,6 +77,18 @@ export default function Layout() {
))}
</nav>
<div className="px-3 pb-2 pt-2 border-t border-white/10">
<button
type="button"
onClick={openSetupGuide}
title="Reopen the onboarding wizard"
className="w-full flex items-center gap-3 px-3 py-2 text-[13px] rounded text-sidebar-text hover:text-white hover:bg-white/10 transition-all duration-150"
>
<Icon d="M9.663 17h4.673M12 3v1m6.364 1.636l-.707.707M21 12h-1M4 12H3m3.343-5.657l-.707-.707m2.828 9.9a5 5 0 117.072 0l-.548.547A3.374 3.374 0 0014 18.469V19a2 2 0 11-4 0v-.531c0-.895-.356-1.754-.988-2.386l-.548-.547z" />
Setup guide
</button>
</div>
<div className="px-5 py-3 border-t border-white/10 flex items-center justify-between">
<span className="text-[10px] text-brand-300/60 font-mono">certctl</span>
{authRequired && (

Some files were not shown because too many files have changed in this diff Show More