certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 21:01:31 +00:00

Author	SHA1	Message	Date
shankar0123	3b92048242	metrics: add per-issuer-type issuance counters, histogram, and failure classifier Closes the #4 acquisition-readiness blocker from the 2026-05-01 issuer coverage audit. Before this commit, certctl's Prometheus exposition had zero per-issuer-type signal — operators answering "is DigiCert slow?" or "is Sectigo failing more than ACME?" had to grep logs by issuer name. This commit adds three series labelled by issuer type: certctl_issuance_total{issuer_type, outcome} certctl_issuance_duration_seconds{issuer_type} (histogram) certctl_issuance_failures_total{issuer_type, error_class} The histogram covers 0.05–120 second buckets to span the local-issuer fast path and async-CA slow path (DigiCert/Sectigo/Entrust polling can take minutes). error_class is a closed enum of eight values (timeout, auth, rate_limited, validation, upstream_5xx, upstream_4xx, network, other) classified once in service.ClassifyError. Cardinality budget is ~276 new series, well within Prometheus's comfortable range. Implementation: - service.IssuanceMetrics is the thread-safe counter + histogram table. Three independent views (counters / failures / durations) exposed via SnapshotCounters / SnapshotFailures / SnapshotDurations. sync.RWMutex protects the map shape; per-key sync/atomic.Uint64 primitives keep the recording hot path lock-free under concurrent service-layer goroutines. - service.IssuanceCounterEntry / IssuanceFailureEntry / IssuanceDurationEntry / IssuanceMetricsSnapshotter live in service (not handler) to avoid an import cycle: handler already imports service for admin_est.go etc., so service can't import handler back. Handler's exposer takes the snapshotter via the service-defined interface. - service.ClassifyError pure function maps error → error_class. context.DeadlineExceeded / context.Canceled → timeout; net.OpError → network; substring matches against canonical AWS / DigiCert / Sectigo error shapes for auth / rate_limited / validation / upstream_5xx / upstream_4xx / network; unknown → other. Each branch has at least one representative test case in TestClassifyError. - IssuerConnectorAdapter.SetMetrics wires per-adapter recording (issuerType + metrics). Existing 28+ test call sites of NewIssuerConnectorAdapter keep their one-arg signature; production wiring goes through SetMetrics post-construction. - IssuerRegistry.SetIssuanceMetrics + Rebuild type-asserts to IssuerConnectorAdapter and calls SetMetrics with the issuer type string. nil-guarded — tests that hand-build adapters without metrics get no-op recording. - IssuerConnectorAdapter.IssueCertificate / RenewCertificate wrap the underlying connector call with start := time.Now() and recordIssuance(start, err). Renewal is recorded into the same certctl_issuance_* series as initial issuance — operationally, renewal IS issuance from the connector's perspective (matches the audit prompt's guidance on series naming). - handler/metrics.go GetPrometheusMetrics gains a new exposer block emitting all three series in stable label order with correct Prometheus format (_bucket / _sum / _count for the histogram, +Inf bucket appended). Sorted via sort.Slice for stable output. nil- guarded so deploys without the wire produce clean exposition. - formatLE helper trims trailing zeros from histogram bucket labels via strconv.FormatFloat(le, 'f', -1, 64) so the `le` labels match Prometheus client conventions ("0.05", "30", "120", not "0.0500" etc.). - cmd/server/main.go wires a single IssuanceMetrics instance into both the IssuerRegistry (recording) and the MetricsHandler (exposer) using DefaultIssuanceBucketBoundaries. Tests: - TestIssuanceMetrics_RecordAndSnapshot — happy-path counter + histogram + failure recording, BucketBoundaries returns a copy (not shared storage). - TestIssuanceMetrics_HistogramCumulative — pins the cumulative-buckets contract. 100ms observation lands in 0.1 bucket and every larger bucket; 750ms only in the 1.0 bucket. Off-by-one here would corrupt every quantile query downstream. - TestIssuanceMetrics_Concurrency — 100 goroutines × 1000 ops under the race detector. Asserts atomic counter integrity across contended writes. - TestClassifyError — 17 cases covering every branch of the closed enum plus the nil-error special case. Implementation chooses the existing hand-rolled fmt.Fprintf exposition pattern (no prometheus/client_golang dependency added) to stay consistent with the OCSP / deploy counter blocks already in the file. Out of scope (separate follow-ups): - Revocation metrics (certctl_revocation_*) — symmetric to issuance but the audit didn't ask; explicit follow-up commit. - Discovery / health-check duration histograms. - prometheus/client_golang migration. Verified locally: - gofmt clean - go vet ./... clean - staticcheck ./... clean - golangci-lint run --timeout 5m ./... → 0 issues - go test -short -count=1 ./internal/service/ green - go test -short -count=1 -race -run TestIssuanceMetrics ./internal/service/ green - go test -short -count=1 ./internal/api/handler/ green - go build ./... success Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md Top-10 fix #4 (Part 3, narrative section).	2026-05-02 00:39:25 +00:00
shankar0123	b0efdbe2f8	repo,service: introduce WithinTx and atomic audit rows for issue/renew/revoke Closes the #3 acquisition-readiness blocker from the 2026-05-01 issuer coverage audit (Part 1.5 finding #1: audit row not transactional with issuance). AuditRepository.Create previously ran on the package-level sql.DB while the certificate insert / version insert / revocation insert ran on independent connections — a failed audit INSERT after a successful operation INSERT was silently lost. SOX §404 over IT general controls, PCI-DSS §10 audit logging, HIPAA §164.312(b) audit controls, and CA/B Forum Baseline Requirements §5.4.1 audit log records all presume audit-with-operation atomicity. Design — Option A (Querier abstraction). The chosen pattern: a shared repository.Querier interface (subset of sql.DB and sql.Tx) plus a postgres.WithinTx helper that begins a tx, runs fn, commits on nil error, rolls back on error or panic, and returns the wrapped result. Repository methods that participate in a service-layer transaction expose a WithTx variant taking repository.Querier; the bare methods remain for stand-alone use. A repository.Transactor abstracts the "begin tx, run fn, commit/rollback" lifecycle so service-layer code runs multi-write operations atomically without holding sql.DB directly. Option B (UnitOfWork) was considered but adds boilerplate without behavioral benefit for the current scope. Option C (context-carried tx) was explicitly rejected — it hides the transactional boundary from the type system, reproducing the class of bug we're fixing. This commit: - Adds internal/repository/querier.go with the Querier interface (compile-time guards that sql.DB and sql.Tx satisfy it) and the Transactor interface for service-layer use. - Adds internal/repository/postgres/tx.go with the WithinTx helper (begin/fn/commit/rollback with panic recovery) and a transactor type that satisfies repository.Transactor. - Adds CreateWithTx variants on AuditRepository, CertificateRepository (Create + Update + CreateVersion), and RevocationRepository. Existing bare methods now delegate to the WithTx variant using the package-level sql.DB so existing call sites are behavior-preserving. - Updates repository/interfaces.go: AuditRepository, CertificateRepository, and RevocationRepository declare the new WithTx methods. Adds an atomicity contract doc-comment on AuditRepository pointing at WithinTx + the audit blocker. - Adds AuditService.RecordEventWithTx, mirroring RecordEvent but routing through CreateWithTx so the audit row is part of the caller's transaction. Same redaction + marshalling contract. - Refactors three audit-emitting service paths to use Transactor.WithinTx when SetTransactor was wired, with a legacy fallback for backward compat: * CertificateService.Create — cert insert + audit row in one tx. * RevocationSvc.RevokeCertificateWithActor — cert status update + revocation row + audit row in one tx. The OCSP cache invalidate remains best-effort (out of scope per the prompt). * RenewalService CompleteServerRenewal — cert version insert + cert update + audit row in one tx. Job status update stays outside the audit-atomicity scope (job state lives outside the operator-facing audit trail). - Adds SetTransactor on CertificateService, RevocationSvc, and RenewalService. cmd/server/main.go wires a single Transactor instance shared across all three so all audit-emitting paths run their writes in transactions backed by the same sql.DB handle. - Updates 5 mock implementations to satisfy the new interface methods: mockCertRepo (testutil_test.go), mockCertRepoWithGetError (shortlived_test.go), fakeRevocationRepo (crl_cache_test.go), intuneE2EAuditRepo (scep_intune_e2e_test.go), and the integration- test mocks (lifecycle_test.go: mockCertificateRepository, mockAuditRepository, mockRevocationRepository). All WithTx mocks ignore the Querier and delegate to the bare method (mocks have no DB; in-memory state is shared regardless of "tx"). - Adds a service-layer test mockTransactor with BeginTxErr and CommitErr knobs so the atomic-audit tests can assert error propagation through the transactional boundary. - Adds internal/repository/postgres/tx_test.go: unit-level test that WithinTx surfaces "begin tx" wrap when BeginTx fails, and that Transactor.WithinTx delegates correctly. Real-Postgres rollback semantics are covered by the testcontainers tests in the postgres package — sandbox disk pressure prevented adding a sqlmock dep for the in-fn / commit-failure unit test, so those scenarios are exercised through atomic_audit_test.go using the mockTransactor's CommitErr / BeginTxErr fields. - Adds internal/service/atomic_audit_test.go: * TestCertificateService_Create_AtomicWithTx — asserts audit insert failure inside the tx surfaces as the operation's error (closes the blocker contract). * TestCertificateService_Create_LegacyPathLogs — pins the backward-compat behavior when SetTransactor isn't wired: audit failure is logged-not-failed, matching pre-fix. * TestCertificateService_Create_TransactorBeginFailure — BeginTx error path: operation fails, no cert insert, no audit insert. * TestCertificateService_Create_TransactorCommitFailure — Commit error after successful in-fn writes surfaces as the operation's error. Real Postgres can fail Commit on serialization conflicts; the service must report this. Out of scope (separate follow-up commits, same shape): - Issuer CRUD audit atomicity. - Target CRUD audit atomicity. - Agent retire (already transactional via RetireAgentWithCascade; verified, not changed). - Renewal-policy CRUD audit atomicity. - Owner/team/agent-group CRUD audit atomicity. - Discovery / health-check audit atomicity. Verified locally: - gofmt -l . clean - go vet ./... clean - staticcheck ./... clean - golangci-lint run --timeout 5m ./... → 0 issues - go test -short -count=1 ./internal/service/ green - go test -short -count=1 ./internal/api/handler/ green - go test -short -count=1 ./internal/integration/ green - go test -short -count=1 ./internal/repository/postgres/ green - go build ./... success Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md Top-10 fix #3 (Part 3, narrative section).	2026-05-02 00:29:09 +00:00
shankar0123	7cb453a336	chore(fmt): repo-wide gofmt -w sweep — close drift surfaced by ci-pipeline-cleanup Phase 4 Mechanical reformat. The new 'gofmt drift' CI step (added in ci-pipeline-cleanup Phase 4, commit `0f205a8`) surfaced 111 files with accumulated gofmt drift across cmd/, internal/, and deploy/test/. Each file's diff is gofmt-standard: whitespace adjustments, intra- group import sorting (alphabetical by import path within blank-line- separated groups), and struct-tag column alignment. No semantic changes — verified via 'git diff --ignore-all-space' which shows only the line-position deltas from import reordering. The gate stays in place after this commit. Going forward it catches gofmt drift at PR time.	2026-04-30 22:33:57 +00:00
shankar0123	8637131f80	chore: gofmt fixes across deploy-hardening I new files Phase 13 verification surfaced gofmt-formatting drift in 6 files across the bundle's new code: - internal/api/handler/metrics.go (struct field alignment) - internal/connector/target/k8ssecret/validate_only_test.go (alignment) - internal/connector/target/nginx/nginx.go (alignment) - internal/connector/target/postfix/postfix.go (alignment) - internal/connector/target/ssh/validate_only_test.go (alignment) - internal/service/deploy_counters.go (alignment) Pure mechanical gofmt -w fixes; no behavior changes. CI's make verify gate (which runs `go fmt ./...`) didn't catch these because go fmt is more lenient than gofmt -l, but golangci-lint v2.11.4 + the explicit gofmt step in Phase 13 verification did. Phase 13 full-matrix verification all green: - gofmt -l: empty across all bundle-touched files - go vet ./internal/deploy/... ./internal/connector/target/... ./internal/service/ ./internal/api/handler/ ./cmd/agent/: clean - golangci-lint v2.11.4 (the version CI runs): 0 issues - go test -race -count=1 across deploy + nginx + apache + haproxy + agent + service: all green - INTEGRATION=1 go test -tags integration -run Deploy ./deploy/test/...: 4/4 e2e tests green Phase 14 next: release prep — Active Focus update, release notes, Reddit-beat draft, final tag handoff to operator.	2026-04-30 15:33:33 +00:00
shankar0123	135b271197	feat(metrics): per-target-type deploy counters wired into /metrics/prometheus Phase 10 of the deploy-hardening I master bundle. Mirrors the production-hardening-II Phase 8 OCSP-counter pattern. Per frozen decision 0.9, the metric naming convention is `certctl_deploy_<area>_total` with target_type + sub-label. internal/service/deploy_counters.go: - DeployCounters struct with sync.Map of per-target-type buckets (apache, nginx, etc.). Lock-free fast path via sync/atomic Uint64 counters; LoadOrStore on first tick. - 8 sub-counters per target-type bucket: - attemptsSuccess / attemptsFailure - validateFailures (PreCommit returned error) - reloadFailures (PostCommit returned error → rollback ran) - postVerifyFails (post-deploy TLS handshake failed) - rollbackRestored (rollback succeeded) - rollbackAlsoFail (operator-actionable escalation) - idempotentSkips (SHA-256 match → no-op deploy) - Snapshot returns []DeploySnapshot for the Prometheus exposer. internal/service/deploy_counters_test.go: - 5 tests: zero-state, per-target-type tick isolation, race-detector smoke under concurrent ticks, cross-target bucket isolation, snapshot-mutation-doesn't-affect-counter. internal/api/handler/metrics.go: - New DeployCounterSnapshotter interface (mirrors CounterSnapshotter for the OCSP counters but uses the per-target-type tuple shape). - New DeploySnapshotEntry struct copying the service-layer shape; avoids importing the service package directly so the handler stays dependency-light. - New SetDeployCounters setter on MetricsHandler (mirrors SetOCSPCounters wiring). - Prometheus exposer extended with 6 new metric blocks per frozen decision 0.9: - certctl_deploy_attempts_total{target_type, result} - certctl_deploy_validate_failures_total{target_type} - certctl_deploy_reload_failures_total{target_type} - certctl_deploy_post_verify_failures_total{target_type} - certctl_deploy_rollback_total{target_type, outcome} - certctl_deploy_idempotent_skip_total{target_type} - Output sorted by target_type for stable diffs across requests. The agent-side wire-up (cmd/agent/main.go ticking counters in the DeployCertificate dispatch site) is intentionally deferred to a follow-up commit — Phase 10's load-bearing change is the infrastructure; per-connector tick wiring is a mechanical follow-on. Build + go vet clean. go test -count=1 green for service + handler packages. Phase 11 next: cross-cutting integration tests at deploy/test/.	2026-04-30 15:25:38 +00:00
shankar0123	2d83342bbe	feat(metrics): extend /metrics/prometheus with per-area OCSP counters (Phase 8) Production hardening II Phase 8 — surface the OCSP per-event counters shipped in Phase 1+2 through the existing /api/v1/metrics/prometheus endpoint. Operators now alert on certctl_ocsp_counter_total {label="rate_limited"} (Phase 3 trip), {label="nonce_malformed"} (Phase 1 reject), {label="signing_failed"} (issuer connector fails), etc. NEW interface CounterSnapshotter (handler/metrics.go) — minimum surface the Prometheus exposer needs from any per-area counter table: just Snapshot() map[string]uint64. service.OCSPCounters.Snapshot (Phase 1) satisfies it; future per-area counters (CRL, cert-export, EST per-profile, SCEP per-profile, Intune per-profile) plug in the same way as separate SetXxxCounters setters. Naming convention per frozen decision 0.10: certctl_<area>_counter_total{label="<event>"} <value> This commit ships only the OCSP block. The remaining areas (CRL, cert-export, EST, SCEP, Intune) plug in via the same SetXxxCounters pattern in follow-up commits — the wire-up cost per area is one new field + one setter + one block of fmt.Fprintf lines. The bundle's S-1 docs-count guard means we don't claim a specific total in prose; operators run `curl /api/v1/metrics/prometheus \| grep certctl_` to enumerate. Wired in cmd/server/main.go: a single shared *service.OCSPCounters instance is created once and passed to BOTH the ocspResponseCacheService (so the cache hot path ticks counters) AND metricsHandler.SetOCSPCounters (so the Prometheus exposer reads them). Existing dashboard metrics (certctl_certificate_total, certctl_agent_total, etc.) remain unchanged at the same line offsets — back-compat preserved. Pre-commit verification: go build ./... clean; go test -short -count=1 green for handler/ + service/. The existing TestGetPrometheusMetrics_Success tests still pass (the new counter block is additive at the END of the response body, after the existing dashboard metrics + uptime line).	2026-04-30 05:15:05 +00:00
shankar0123	db854ecc6f	feat(crl): HTTP caching headers (ETag + If-None-Match 304) per RFC 7232 (Phase 4) Production hardening II Phase 4 — wire RFC 7232 conditional-request support into GetDERCRL so CDNs and reverse proxies in front of certctl can serve repeated CRL fetches from edge caches. Saves bandwidth + removes the per-request DB read on the certctl side when a relying party honors max-age. ETag: weak form (W/) per RFC 7232 §2.3 wrapping the first 16 bytes of SHA-256(DER) — sufficient ID space for the cache layer + leaves headroom for a future builder that might emit signature randomness that doesn't change the CRL semantics. If-None-Match: when the inbound header matches the computed ETag, short-circuit to 304 Not Modified with no body. Identical inbound ETag → identical CRL → no need to retransmit the bytes. Cache-Control: public, max-age=3600, must-revalidate. The 1h max-age matches the default CRL regen cadence; relying parties that cache won't re-fetch within the window. must-revalidate forces revalidation once the window expires (so a stale relying party doesn't keep returning expired-cache CRLs after the regen tick). The pre-existing Cache-Control: max-age=3600 is preserved syntactically (the new line replaces it with the more complete form); existing relying parties see the same ceiling, just with the addition of public + must-revalidate hints for downstream caches. Pre-commit verification: go build ./... clean; go test -short -count=1 green for handler/. The existing TestGetDERCRL_* tests still pass — the new headers are additive, the response body is unchanged.	2026-04-30 05:09:28 +00:00
shankar0123	ed19312df6	feat(ratelimit): per-endpoint rate limit on OCSP + cert-export (Phase 3) Production hardening II Phase 3 — wire the existing internal/ratelimit/SlidingWindowLimiter into the OCSP and cert-export handlers. Removes the DoS vector where an unauthenticated relying party (or compromised admin token) can hammer the responder / key-export endpoint at unbounded rates. OCSP: per-source-IP cap. Default 1000 req/min/IP, 50k tracked IPs (matches the SCEP/Intune replay cache cap). Configurable via CERTCTL_OCSP_RATE_LIMIT_PER_IP_MIN; zero disables. Source IP comes from net.SplitHostPort(r.RemoteAddr) — we deliberately do NOT honor X-Forwarded-For because OCSP is publicly reachable and untrusted intermediaries could spoof the header to bypass the limit. On rate-limit trip: respond with the canonical ocsp.UnauthorizedErrorResponse pre-built blob from x/crypto/ocsp (status 6 per RFC 6960 §2.3) plus Retry-After: 60. Using the unauthorized status (instead of TryLater) avoids hand-rolling DER for a single rejection path; relying parties retry on any non-good status anyway. Cert-export: per-actor cap. Default 50 exports/hr/operator. Configurable via CERTCTL_CERT_EXPORT_RATE_LIMIT_PER_ACTOR_HR; zero disables. Actor extracted from the X-Actor request header (set by the auth middleware); falls back to RemoteAddr if empty (defensive). On rate-limit trip: HTTP 429 + JSON body {"error":"rate_limit_exceeded","retry_after_seconds":3600} + Retry-After: 3600. NEW config fields in internal/config/config.go::SchedulerConfig: OCSPRateLimitPerIPMin (default 1000) CertExportRateLimitPerActorHr (default 50) WIRED in cmd/server/main.go: ocspLimiter constructed with the configured cap, 1m window, 50k map cap; exportLimiter same shape with 1h window. Both wired via SetOCSPRateLimiter / SetExportRateLimiter on their respective handlers. Existing deploys see no behavior change unless the env vars are set to non-default values + traffic exceeds the cap. Pre-commit verification: go build ./... clean; go test -short -count=1 green for handler + service + config.	2026-04-30 05:08:04 +00:00
shankar0123	3d15a3e5af	feat(ocsp): RFC 6960 §4.4.1 nonce extension support — echo client nonce in response, reject malformed Production hardening II Phase 1. The OCSP responder previously ignored the request's nonce extension entirely, leaving relying parties vulnerable to replay attacks. RFC 6960 §4.4.1 defines the OPTIONAL id-pkix-ocsp-nonce extension (OID 1.3.6.1.5.5.7.48.1.2): when present in the request, the responder MUST echo the same value in the response; when absent, no nonce in the response (back-compat with relying parties that don't send one). NEW internal/service/ocsp_nonce.go: ParseOCSPRequestNonce walks raw DER (golang.org/x/crypto/ocsp.Request doesn't expose the request's extensions field — the library only exposes IssuerNameHash + IssuerKeyHash + SerialNumber). Returns one of three states: - (nil, false, nil) — no nonce extension in request - (nonce, true, nil) — well-formed nonce, ≤ MaxOCSPNonceLength (32) - (nil, false, ErrOCSPNonceMalformed) — empty or oversized NEW internal/service/ocsp_counters.go: sync/atomic counter table for OCSP request lifecycle (request_get/post, request_success/invalid, nonce_echoed, nonce_malformed, rate_limited, ...). Mirrors the EST/ SCEP counter pattern; Phase 8 wires these into /metrics/prometheus. CertSrv types extended: - internal/connector/issuer/interface.go::OCSPSignRequest gains Nonce []byte field. - internal/service/renewal.go::OCSPSignRequest (the service-layer duplicate used by ca_operations.go) gains the same field. - internal/service/issuer_adapter.go bridges the two. Service path: CAOperationsSvc.GetOCSPResponseWithNonce(ctx, issuerID, serialHex, nonce) is the new entry point that plumbs the nonce through every signing site (good / revoked / unknown / short-lived). The legacy GetOCSPResponse becomes a nil-nonce wrapper for back- compat — every existing caller (tests, the GET handler) sees no behavior change. CertificateService gains the same WithNonce variant; the handler interface adds it to the contract. MockCertificateService in tests extended with the new method (delegates to the legacy fn when no override is set, so existing tests that don't care about the nonce keep working). Local issuer's SignOCSPResponse appends the id-pkix-ocsp-nonce extension (non-Critical per RFC 6960 §4.4) to the response template's ExtraExtensions when req.Nonce != nil. The extnValue is the nonce bytes wrapped in an OCTET STRING per RFC 6960 §4.4.1. POST OCSP handler (HandleOCSPPost): - After ocsp.ParseRequest succeeds, calls ParseOCSPRequestNonce on the raw body to extract the optional nonce. - On ErrOCSPNonceMalformed (empty or > 32 bytes): writes an 'unauthorized' OCSP response (status 6 per RFC 6960 §2.3) using the canonical ocsp.UnauthorizedErrorResponse from x/crypto/ocsp. Does NOT echo malicious bytes back. - On well-formed nonce: passes it through GetOCSPResponseWithNonce. - On no nonce: nil passed through; back-compat preserved. GET OCSP handler unchanged — the GET form has no body to carry a nonce extension. 6 new tests in internal/service/ocsp_nonce_test.go pin every documented failure mode + the 32-byte boundary. The test fixture builds an OCSPRequest via golang.org/x/crypto/ocsp.CreateRequest then splices in a [2] EXPLICIT Extensions element by hand (the library doesn't expose extension construction either). Pre-commit verification: gofmt clean, go vet clean across affected packages, go test -short -count=1 green for service/ + handler/ + connector/issuer/local/. No new env vars introduced (Phase 1 is always-on per RFC; no operator opt-out).	2026-04-30 04:55:06 +00:00
shankar0123	5834e5b866	fix(est): plumb context through ESTService.ReloadTrust to satisfy contextcheck CI golangci-lint v2.11.4 flagged internal/api/handler/admin_est.go:178: the AdminESTServiceImpl.ReloadTrust method took ctx context.Context but called svc.ReloadTrust() with no context, then the underlying ESTService.ReloadTrust used context.Background() internally for the audit RecordEvent call. That's the contextcheck linter's textbook 'context discarded at boundary' violation. Fix: change ESTService.ReloadTrust signature to ReloadTrust(ctx context.Context) and forward the caller-supplied ctx into auditService.RecordEvent. AdminESTServiceImpl.ReloadTrust now passes its received ctx through. The HTTP handler already forwards r.Context() one layer up, so the request-scoped trace identifiers now flow end-to-end into the audit row instead of being severed at the service boundary. Verified locally with golangci-lint v2.11.4 (the same version CI runs) against ./internal/api/handler/... ./internal/service/... — '0 issues.' All cmd/* binaries build clean, go test -short -count=1 green for both packages.	2026-04-30 01:59:04 +00:00
shankar0123	5a682db8e2	EST RFC 7030 hardening master bundle Phases 10-11: libest sidecar e2e + Cisco IOS quirk fixtures + ManagedCertificate.Source provenance + EST bulk-revoke endpoint + 13 typed audit action codes. Phase 10.1 — libest reference-client sidecar: - deploy/test/libest/Dockerfile: multi-stage Debian-bookworm-slim build of Cisco's libest v3.2.0-2 from source (autoconf/automake/ libtool + libcurl4-openssl-dev + libssl-dev). Runtime stage carries only estclient + bash + openssl + ca-certificates so the exec surface stays small + predictable. - docker-compose.test.yml libest-client entry (profiles: [est-e2e]) with bind mounts for /config/est (test workspace) + /config/certs (certctl CA bundle for TLS pinning); IP 10.30.50.9 (10.30.50.8 was already taken by certctl-agent). - deploy/test/est/.gitkeep keeps the bind-mount target tracked. Phase 10.2 — 5 integration tests (//go:build integration) in deploy/test/est_e2e_test.go: - TestEST_LibESTClient_Enrollment_Integration (cacerts → simpleenroll → cert-shape assertion) - TestEST_LibESTClient_MTLSEnrollment_Integration (mTLS sibling-route cert auth; skip when bootstrap cert absent) - TestEST_LibESTClient_ServerKeygen_Integration (RFC 7030 §4.4 multipart; skip when profile gate disabled) - TestEST_LibESTClient_RateLimited_Integration (4th enroll trips per-principal cap, asserts 429-shaped error) - TestEST_LibESTClient_ChannelBinding_Integration (libest --tls-exporter; skip when libest build lacks the flag). - requireESTSidecar guard skips the suite when the operator forgot --profile est-e2e; helpful error message includes the exact command to bring the sidecar up. Phase 10.3 — Cisco IOS quirk fixtures + 3 unit tests in internal/api/handler/cisco_ios_quirks_test.go: - testdata/cisco_ios_15x_pem_csr.txt: PEM body sent with Content-Type application/x-pem-file. Handler dispatches on body-prefix not Content-Type — accepts cleanly. - testdata/cisco_ios_16x_trailing_newline_csr.txt: extra trailing newlines after base64 body. strings.TrimSpace tolerates. - testdata/cisco_ios_crlf_b64_csr.txt: CRLF-wrapped base64. base64.StdEncoding handles CRLF + LF identically. Phase 11.1 — ManagedCertificate.Source provenance: - New domain.CertificateSource enum (Unspecified/EST/SCEP/API/Agent). - Migration 000023_managed_certificates_source.up.sql adds source TEXT NOT NULL DEFAULT '' so existing rows scan as CertificateSourceUnspecified — back-compat: bulk-revoke filter treats empty as "any source". - Postgres repo Insert/Update/scan paths all wire the new column. Phase 11.2 — EST bulk-revoke endpoint: - BulkRevocationCriteria.Source field (Source-only requests rejected as too broad — must accompany at least one narrower criterion). - service.bulk_revocation.resolveCertificates post-filter by Source (empty=any, no SQL change so existing CertificateFilter callers unaffected). - New BulkRevocationHandler.BulkRevokeEST method pins Source=EST + dispatches; new route POST /api/v1/est/certificates/bulk-revoke (M-008 admin-gated). openapi.yaml documented + parity-guard green. Phase 11.3 — 13 typed audit action codes in internal/service/est_audit_actions.go: - est_simple_enroll_success / _failed - est_simple_reenroll_success / _failed - est_server_keygen_success / _failed - est_auth_failed_basic / _mtls / _channel_binding - est_rate_limited - est_csr_policy_violation - est_bulk_revoke - est_trust_anchor_reloaded - ESTService.processEnrollment + SimpleServerKeygen + ReloadTrust split-emit BOTH the legacy bare action codes (back-compat for the GUI activity-tab chip filters that match by exact string + existing audit-log analysers) AND the new typed _success / _failed variants (operator grep target + per-failure-mode counter). Tests: - internal/api/handler/bulk_revocation_est_test.go — 5 cases (admin-true happy path pins Source=EST + non-admin 403 + empty-criteria 400 + invalid-reason 400 + method-not-allowed). - internal/service/est_audit_actions_test.go — 5 cases (SimpleEnroll legacy+typed emission / SimpleReEnroll typed / IssuerError typed-failed / PolicyViolation triple-emit / unique-string invariant). Pre-commit verification (sandbox): gofmt clean, go vet clean (excluding repository/postgres testcontainers limit), staticcheck clean across api/handler/api/router/domain/service/deploy/test, go test -short -count=1 green for every non-postgres Go package + integration build (`go build -tags integration ./deploy/test/...`) clean. G-3 docs-drift guard reproduced locally clean (Phases 10-11 added zero new env vars). Spec preserved at cowork/est-rfc7030-hardening-prompt.md. Phases 12-13 (docs/est.md + WiFi/802.1X / IoT bootstrap / FreeRADIUS recipes; release prep + tag) remain — post-2.1.0 work.	2026-04-30 00:52:43 +00:00
shankar0123	43075a1b5c	EST RFC 7030 hardening master bundle Phases 5-7: end-to-end serverkeygen + profile-driven csrattrs + admin observability with per-status counters + reload-trust endpoint. Phase 5 — RFC 7030 §4.4 server-driven key generation: - internal/pkcs7/envelopeddata_builder.go is the inverse of the existing parser/decryptor: AES-256-CBC content cipher + RSA PKCS#1 v1.5 keyTrans + per-call random IV. Round-trip pinned in test (BuildEnvelopedData → ParseEnvelopedData → Decrypt returns the original plaintext byte-for-byte). - ESTService.SimpleServerKeygen runs the full §4.4 flow: parse client CSR → require RSA pubkey for keyTrans → resolve per-profile algorithm (RSA-2048 default; honors AllowedKeyAlgorithms) → in- memory keygen → re-build CSR with server pubkey → run existing issuer pipeline → marshal PKCS#8 → CMS-EnvelopedData wrap to a synthetic recipient cert wrapping the device's CSR-supplied pubkey → zeroize plaintext + PKCS#8 bytes → return CertPEM + ChainPEM + EncryptedKey. Typed sentinels ErrServerKeygenRequiresKey- Encipherment / ErrServerKeygenUnsupportedAlgorithm / ErrServerKeygenDisabled. - ESTHandler.ServerKeygen + ServerKeygenMTLS emit RFC 7030 §4.4.2 multipart/mixed with random per-response boundary; per-profile SetServerKeygenEnabled gate returns 404 when off (defense in depth even if the route was registered). - New routes POST /.well-known/est/[<PathID>/]serverkeygen + /.well-known/est-mtls/<PathID>/serverkeygen; openapi.yaml + openapi-parity guard updated. Phase 6 — Real csrattrs implementation: - New CertificateProfile.RequiredCSRAttributes []string + migration 000022_certificate_profiles_csrattrs.up.sql. The migration also lands the previously-unwired must_staple column (closes the 5.6 follow-up loop where the field shipped at the domain + service layer but the postgres scan/insert/update never persisted it). - domain.EKUStringToOID + AttributeStringToOID lookup tables: id-kp-* EKUs (RFC 5280 §4.2.1.12) + RFC 5280 DN attributes + RFC 2985 PKCS#10 attributes + Microsoft Intune device-serial OID. - ESTService.GetCSRAttrs replaces the v2.0.x nil/204 stub with a profile-derived SEQUENCE OF OID ASN.1 marshal. Unknown EKU / attribute strings dropped + warning-logged so a typo doesn't take down the entire endpoint. Phase 7 — Admin observability + counters + reload-trust: - internal/service/est_counters.go: estCounterTab (sync/atomic; 12 named labels) + ESTStatsSnapshot per-profile shape + ESTService.Stats(now) zero-allocation accessor + ReloadTrust() SIGHUP-equivalent + SetESTAdminMetadata setter. - Counter ticks wired into processEnrollment + SimpleServerKeygen at every success/failure leg. - internal/api/handler/admin_est.go mirrors AdminSCEPIntune verbatim: Profiles + ReloadTrust handlers + AdminESTServiceImpl. Both endpoints admin-gated (M-008 triplet pinned + admin_est.go added to AdminGatedHandlers). - New routes GET /api/v1/admin/est/profiles + POST /api/v1/admin/ est/reload-trust; openapi.yaml documented; openapi-parity guard reproduced clean. - cmd/server/main.go grows estServices map populated by the per- profile EST loop + handed to AdminEST. New MTLSTrust() + HasMTLSTrust() accessors on ESTHandler so main.go can pull the trust holder for the admin-metadata wire-up. - Per-profile counter isolation regression test (internal/service/est_profile_counter_isolation_test.go) proves a future shared-counter refactor would fail at compile-time pointer-identity check. Pre-commit verification (sandbox): gofmt clean, go vet clean (excluding repository/postgres which the sandbox can't build — disk-space testcontainers download), staticcheck clean across cms/trustanchor/api/handler/api/router/scep/intune/ratelimit/ service/pkcs7/domain/cmd/server, go test -short -count=1 green for every non-postgres package. G-3 docs-drift guard reproduced locally clean (Phases 5-7 added zero new env vars; Phase 1 already documented per-profile SERVER_KEYGEN_ENABLED). Spec preserved at cowork/est-rfc7030-hardening-prompt.md. Phases 8-13 (GUI ESTAdminPage / CLI+MCP / libest e2e / bulk revocation / docs/est.md / release prep) remain — post-2.1.0 work.	2026-04-29 23:57:45 +00:00
shankar0123	aa139ee0d9	EST RFC 7030 hardening master bundle Phases 2-4: end-to-end mTLS sibling route + RFC 9266 channel binding + HTTP Basic enrollment-password + per-source-IP failed-auth limit + per-(CN, sourceIP) sliding-window cap. Two new shared packages so EST + Intune share infrastructure: - internal/cms/ — RFC 9266 tls-exporter extractor (ExtractTLSExporter with stdlib-panic recovery for synthetic ConnectionStates) + CSR-side channel-binding parser via raw TBSCertificationRequestInfo walk (the stdlib's csr.Attributes can't represent the OCTET STRING binding value), VerifyChannelBinding composite, EmbedChannel- BindingAttribute fixture helper, typed sentinel errors for missing / mismatch / not-TLS-1.3 mapped to HTTP 400 / 409 / 426 in handler. - internal/trustanchor/ — extracted from scep/intune/trust_anchor.go so the EST mTLS sibling route + Intune dispatcher share the same SIGHUP-reloadable PEM bundle primitive. intune.TrustAnchorHolder is now `= trustanchor.Holder` (type alias) + NewTrustAnchorHolder = trustanchor.New (function alias) — every existing call site compiles unchanged. Intune's LoadTrustAnchor is a thin wrapper over trustanchor.LoadBundle. White-box tests moved to the new package. - internal/ratelimit/ — extracted from scep/intune/rate_limit.go (this was Phase 4.1, in the same bundle). intune.PerDeviceRateLimiter is now a thin wrapper preserving the (subject, issuer)→key composition; EST handler reaches for SlidingWindowLimiter directly. ESTHandler grew six optional fields wired by per-profile setters (SetMTLSTrust / SetChannelBindingRequired / SetEnrollmentPassword / SetSourceIPRateLimiter / SetPerPrincipalRateLimiter / SetLabelForLog) plus four new mTLS-route methods (CACertsMTLS / SimpleEnrollMTLS / SimpleReEnrollMTLS / CSRAttrsMTLS); shared internal pipeline handleEnrollOrReEnroll(reEnroll, viaMTLS) keeps the auth/binding/ rate-limit gates DRY. New router method RegisterESTMTLSHandlers registers /.well-known/est-mtls/<PathID>/{cacerts,simpleenroll, simplereenroll,csrattrs}; AuthExemptDispatchPrefixes extends the no-auth chain to /.well-known/est-mtls. cmd/server/main.go's EST loop wires per-profile mTLS holder + channel-binding policy + per-principal limiter + (when EnrollmentPassword non-empty) Basic + source-IP limiter; new preflightESTMTLSClientCATrust- Bundle returns trustanchor.Holder so SIGHUP rotates the EST mTLS bundle live without restart. SCEP + EST mTLS profiles now share a single union mtlsUnionPoolForTLS passed to buildServerTLSConfigWithMTLS (replaces the protocol-specific scepMTLSUnionPoolForTLS); per-handler re-verify enforces "cert must chain to THIS profile's bundle" so cross-protocol bleed is blocked at the application layer even though the TLS layer trusts certs from either pool's union. Phase 3.3 source-IP failed-Basic limiter defaults: 10 attempts / 1h / 50k tracked IPs (no env var; tunable in a follow-up). Phase 4.2 per-principal limiter cap from CERTCTL_EST_PROFILE_<NAME>_RATE_ LIMIT_PER_PRINCIPAL_24H (existing field, Phase 1 shipped). New tests: - internal/cms/channelbinding_test.go: extractor + CSR-side parser + composite + TLS-1.3 round-trip end-to-end + EmbedChannelBinding- Attribute round-trip - internal/trustanchor/holder_test.go: parseBundlePEM white-box + LoadBundle + Holder Get/Pool/SetLabelForLog/Reload-happy/ Reload-keeps-old-on-failure/Reload-keeps-old-on-expired/ WatchSIGHUP-reloads-pool/WatchSIGHUP-stop-clean - internal/api/handler/est_hardening_test.go: 16 named cases covering mTLS no-trust-pool 500 + no-cert 401 + cross-profile cert 401 + happy-path 200 + CACertsMTLS auth gate + CSRAttrsMTLS auth gate + channel-binding required-absent-rejected + not-required-absent- allowed + writeChannelBindingError mapping + Basic no-header 401 + Basic wrong-password 401 + Basic correct-200 + Basic-no-password no-gate + per-IP failed-attempt lockout 429 + per-principal blocks-after-cap + different-principals-independent + no-limiter- unbounded. Pre-commit verification (sandbox): gofmt clean, go vet clean (excluding repository/postgres which the sandbox can't build — disk-space testcontainers download), staticcheck clean for cms/trustanchor/api/handler/api/router/scep/intune/ratelimit/ cmd/server, go test -short -count=1 green for cms/trustanchor/ api/handler/api/router/scep/intune/ratelimit/service. G-3 docs-drift guard reproduced locally clean (Phase 1 already documented every new env var; Phases 2-4 added zero new env vars).	2026-04-29 23:15:35 +00:00
shankar0123	530593507b	fix(scep-intune): close 11 audit gaps from 2026-04-29 pre-tag review Closes the eleven gaps identified in the pre-v2.1.0 audit of the SCEP RFC 8894 + Intune master bundle (cowork/scep-bundle-gap-closure-prompt.md). Constitutional rule from cowork/CLAUDE.md::Operating Rules — 'Always take the complete path, not the easy path' — drove this closure: each gap was a load-bearing wire that crossed multiple layers (config → validator → service wire-up → tests → docs) and shipping the bundle without them would have produced lying-field footguns where operator- visible config options stored values without affecting behavior. WHAT LANDS: Phase A — Clock-skew tolerance (master prompt §15 hazard closure) internal/scep/intune/challenge.go: ValidateChallenge migrated from positional args to ValidateOptions{} struct; new ClockSkewTolerance field with default 0 (strict). 24 call sites updated mechanically. Asymmetric application: now+tolerance >= iat AND now-tolerance < exp. internal/config/config.go: SCEPIntuneProfileConfig.ClockSkewTolerance default 60s + Validate() refusal when >= ChallengeValidity. cmd/server/main.go: SetIntuneIntegration signature extended; per-profile env-var loader honors CERTCTL_SCEP_PROFILE_<NAME>_INTUNE_CLOCK_SKEW_TOLERANCE. internal/service/scep.go: intuneClockSkew field + IntuneStatsSnapshot surfaces clock_skew_tolerance_ns. web/src/api/types.ts mirrors. 4 new tests in challenge_test.go covering accept-within-tolerance, reject-beyond-tolerance, accept-expired-within-tolerance, negative-treated-as-zero defensive normalization. docs/scep-intune.md updated with the new env var + time-bounds rule. Phase B — unknown-version-rejected golden test internal/scep/intune/golden_helper_test.go: goldenUnknownVersionPayload helper + signGoldenChallengeAny generic signer. challenge_golden_test.go: TestGoldenChallenge_UnknownVersionRejected uses an in-process ECDSA fixture (the on-disk PEM was generated with a Go-stdlib version that produces different ecdsa.GenerateKey bytes from the current call). TestRegenerateGoldenFixtures emits the new unknown_version fixture file too. Phase C — Two named Intune e2e tests internal/api/handler/scep_intune_e2e_test.go: TestSCEPIntuneEnrollment_RateLimited_E2E (cap=2 + 3 attempts; 3rd returns FAILURE+badRequest with rate_limited counter ticked) TestSCEPIntuneEnrollment_TrustAnchorSIGHUPReload_E2E (rotate on-disk PEM + holder.Reload(); old-key challenge fails with badMessageCheck; signature_invalid counter ticked) intuneE2EFixture struct extended with trustHolder + trustPath fields so tests can rotate. Phase D — Four new ChromeOS hermetic tests (10 total now) internal/api/handler/scep_chromeos_test.go: _RAKeyMismatch — PKIMessage encrypted to wrong RA cert; handler rejects without reaching service. _3DESBackwardCompat — RFC 8894 §3.5.2 legacy fallback verified. _RSACSR + _ECDSACSR — explicit matrix-pair pinning. buildTestECDSACSR helper for ECDSA P-256 CSR construction; tripleDESCBCEncrypt mirrors aesCBCEncrypt for 3DES-CBC; assertChromeOSPositiveCertRep shared assertion. Phase E — Per-profile counter isolation test internal/api/handler/scep_profile_counter_isolation_test.go: TestSCEPHandler_PerProfileIntuneCountersIsolated wires two SCEPService instances + drives distinct PKIMessages + asserts counter isolation. Guards against a future cmd/server/main.go refactor that shares a *intuneCounterTab across profiles. buildPerProfileIntuneFixture parameterized helper. Phase F — Server-boot regression tests cmd/server/preflight_scep_intune_test.go: 3 named tests covering disabled-backward-compat, broken-config-with-PathID, expired-cert refusal. preflightSCEPIntuneTrustAnchor signature extended with pathID arg so error messages carry PathID= for operator log-grep. Phase G — docs/connectors.md Four new subsections under §EST/SCEP Integration: multi-profile dispatch + mTLS sibling route + Intune Connector dispatcher + SCEP probe in network scanner. Each has a one-paragraph operator explanation + an env-var or endpoint table. Phase H — Coverage uplift internal/service/scep_probe_persist_test.go: 5 unit tests on persistProbeResult (nil-safe + nil-repo-safe + repo-error swallow + nil-logger guard) + ListRecentSCEPProbes (empty-slice-not-nil + repo pass-through) + describeCertAlgorithm (RSA/ECDSA/QF1008-nil-curve defensive branch/Ed25519/DSA/empty). CI gates (service ≥70, handler ≥75) PASS at 70.9% / 79.3%. Phase I — deploy/test integration variant deploy/test/scep_intune_e2e_test.go (//go:build integration): TestSCEPIntuneEnrollment_Integration + _RateLimited_Integration against the live docker-compose certctl container. Skip-when- stack-missing semantics so sandbox + CI both work. deploy/docker-compose.test.yml: new e2eintune SCEP profile env vars + bind-mount of deploy/test/fixtures/. deploy/test/fixtures/README.md: documents the deterministic trust anchor regeneration recipe. VERIFICATION (sandbox): gofmt -d — clean for all changed files staticcheck — clean for intune + handler + config + service + cmd/server packages go vet — clean for the same packages go test -short — green for intune (95.3% cov), service (70.9%), handler (79.3%), config (94.0%), cmd/server (boot path; my preflight tests cover the directly- testable function), pkcs7 (80.5% informational) DEFERRED (per closure prompt §7 out-of-scope): - V3-Pro Conditional Access gating + Microsoft Graph integration - Standalone certctl-scan CLI binary - OCSP rate-limiting, OCSP stapling, delta CRLs Spec preserved at cowork/scep-bundle-gap-closure-prompt.md; journal at cowork/scep-rfc8894-intune/progress.md (audit-closure section appended).	2026-04-29 20:28:53 +00:00
shankar0123	506cff137d	feat(scep): SCEP probe in network scanner for fleet-readiness assessment Phase 11.5 of the SCEP RFC 8894 + Intune master bundle. Adds an operator-facing SCEP probe that issues GetCACaps + GetCACert against an arbitrary SCEP server URL and returns a structured posture snapshot (reachable + advertised caps + RFC 8894 / AES / POST / Renewal / SHA-256 / SHA-512 support flags + CA cert subject + issuer + NotBefore + NotAfter + days-to-expiry + algorithm + chain length). Two operator use cases per the master prompt: 1. Pre-migration assessment — probe an existing EJBCA / NDES SCEP server before switching to certctl to see what capabilities it advertises and what the CA cert looks like. 2. Compliance posture audits — periodic ad-hoc probes against the operator's own SCEP servers to flag drift. Capability-only — does NOT POST a CSR per the spec (would consume slot allocations on the target server + create audit noise). Standalone CLI binary explicitly out of scope (per the master prompt §11.5.6 and the operator's confirmation): the probe code lands inside certctl; a future thin Cobra wrapper is a separate decision. Backend (six new + one extended file): * internal/domain/network_scan.go — new SCEPProbeResult struct with every probe field documented for the GUI's display layer. * migrations/000021_scep_probe_results.up.sql + .down.sql — new scep_probe_results table with TEXT id, target_url, all probe flags, CA cert metadata, probed_at, probe_duration_ms, error. Two indexes: idx_scep_probe_results_probed_at (DESC) for the 'recent probes' GUI query, idx_scep_probe_results_target_url (target_url, probed_at DESC) for the future per-URL history view. * internal/repository/interfaces.go — new SCEPProbeResultRepository interface (Insert + ListRecent). * internal/repository/postgres/scep_probe_results.go — Postgres implementation. ListRecent clamps limit to [1, 200]; on read re-derives ca_cert_days_to_expiry against the query-time wall clock so 'X days remaining' stays fresh. * internal/service/scep_probe.go — ProbeSCEP(ctx, url) on NetworkScanService. Validation order: 1. Up-front URL validation via validation.ValidateSafeURL (defaults to validation.ValidateSafeURL but injectable for tests via the new scepValidateURL field on the service). 2. Dial-time SSRF re-check via SafeHTTPDialContext on the http.Transport (defends against DNS rebinding). 3. GET ?operation=GetCACaps + GET ?operation=GetCACert. GetCACert handles three response shapes: PKCS#7 SignedData certs-only envelope (multi-cert), raw DER (single-cert), and PEM-wrapped DER (non-conforming servers). Times out at 30s; uses a 1MB body cap for DoS defense; wraps the result + persists via the repo (nil-safe) before returning. describeCertAlgorithm helper returns 'RSA-N' / 'ECDSA-curve' / 'Ed25519' / 'DSA' for the GUI's algorithm column. * internal/service/network_scan.go — added scepProbeRepo + scepHTTPClient + scepValidateURL + scepIDFn + nowFn fields; SetSCEPProbeRepo wires the repo at startup. * internal/api/handler/network_scan.go — extended NetworkScanService interface with ProbeSCEP + ListRecentSCEPProbes; added two new HTTP handlers: POST /api/v1/network-scan/scep-probe (body {url}) GET /api/v1/network-scan/scep-probes (recent history) Synchronous probe; HTTP 200 with the result body for both success and reachable-but-failed cases (so the GUI can render the failure tone with the operator-actionable error message). * internal/api/router/router.go — registered the two routes inline after the existing network-scan target endpoints. * api/openapi.yaml — documented both endpoints (operationId probeSCEP + listSCEPProbes) with full schema + response codes. * cmd/server/main.go — wires the new SCEPProbeResultRepository onto the network scan service via SetSCEPProbeRepo right after the existing NewNetworkScanService construction. Backend tests (6 new — exit-criteria-named per the master prompt): * TestProbeSCEP_AdvertisesAllCaps — happy path, full RFC 8894 capability set, ECDSA P-256 CA cert, 365-day expiry. * TestProbeSCEP_MissingSCEPStandard — pre-RFC-8894 server (only POSTPKIOperation + SHA-1 + DES3); SupportsRFC8894 = false. * TestProbeSCEP_GetCACertExpired — CA cert NotAfter 30d in the past; CACertExpired = true. * TestProbeSCEP_Unreachable — connect to TCP port 1; probe returns Reachable=false + non-empty Error. * TestProbeSCEP_RejectsReservedIP — http://169.254.169.254/scep (EC2 metadata literal) rejected by the up-front validation.ValidateSafeURL gate; result captures the error without ever issuing the HTTP call. * TestProbeSCEP_PEMWrappedCert — server returns PEM instead of raw DER for GetCACert; the fallback parse path handles it. Frontend (one extended file + types/client): * web/src/api/types.ts — SCEPProbeResult + SCEPProbesResponse. * web/src/api/client.ts — probeSCEPServer + listSCEPProbes helpers. * web/src/pages/NetworkScanPage.tsx — new SCEPProbeSection component + ProbeResultPanel (with capability badges + CA cert details panel + raw caps line) + SCEPProbeHistoryTable. Form rejects empty URL with inline error before calling the API. Reload mutation goes through useTrackedMutation with explicit invalidates: [['scep-probes']] (M-009 contract). Frontend tests (5 new + 0 regressions): * Scep probe section header + form renders. * Empty URL is rejected with inline error and never calls the probe endpoint. * Successful probe renders capability badges + CA cert subject + days-remaining inline panel. * Probe-level errors are surfaced in the inline panel (no result panel rendered). * Recent-probes history table renders one row per probe. * (Existing 2 NetworkScanPage XSS-hardening tests stub the new listSCEPProbes endpoint to an empty list so they still pass.) Verification: * gofmt clean on touched files * go vet ./... clean * staticcheck on service+handler+router+repository+cmd-server clean * go test -short across service+handler+router+repository+cmd-server + integration: all green (existing + 6 new probe tests pass) * Frontend tsc --noEmit clean * Vitest: 7/7 NetworkScanPage tests pass (2 existing XSS + 5 new probe section) * G-3 docs-drift CI guard reproduced locally clean (no new env vars) * M-009 hard-zero useMutation guard clean (probe mutation goes through useTrackedMutation) * openapi-parity guard satisfied (both new routes documented) * The mockNetworkScanService in handler + integration packages extended with stub Probe methods; targeted coverage stays in scep_probe_test.go. Out of scope (per master prompt §11.5.6 + operator confirmation): * Standalone certctl-scan CLI binary — separate decision, ~1d of follow-up work when/if shipped. Refs: cowork/scep-rfc8894-intune-master-prompt.md::Phase 11.5 cowork/scep-rfc8894-intune/progress.md	2026-04-29 18:51:57 +00:00
shankar0123	0be889ff1d	refactor(scep-gui): rebrand SCEP admin surface to per-profile tabbed interface (Profiles + Intune + Recent Activity) Phase 9 follow-up to the SCEP RFC 8894 + Intune master bundle. The Phase 9.4 GUI shipped 'SCEP Intune Monitoring' at /scep/intune, which made the per-profile observability surface look Intune-only — operators running EJBCA + Jamf would never click that nav link expecting per- profile RA cert + mTLS observability. The page is per-profile keyed under the hood; this commit rebrands + restructures so the surface matches what operators actually need. Spec: cowork/scep-gui-restructure-prompt.md. User-visible change: - Nav link renamed: 'SCEP Intune' → 'SCEP Admin'. - Route: /scep is the new canonical path; /scep/intune kept as a backward-compat alias that lands directly on the Intune tab. - Page header: 'SCEP Administration'. - Three tabs: * Profiles (default) — per-profile lean cards with RA cert expiry countdown, mTLS sibling-route status badge, Intune enabled/disabled badge, challenge-password-set indicator. 'View Intune details →' link on Intune-enabled cards deep-links into the Intune tab. * Intune Monitoring — the existing Phase 9.4 deep-dive (per-status counters, trust anchor expiry, recent failures table, reload-trust button + confirmation modal). * Recent Activity — full SCEP audit log filter merging all four action codes (scep_pkcsreq + scep_renewalreq + scep_pkcsreq_intune + scep_renewalreq_intune); chip filters for All / Initial / Renewal / Intune / Static. Backend: * internal/service/scep.go — new SCEPProfileStatsSnapshot type + IntuneSection sub-block + ProfileStats(now) accessor. Adds raCertSubject/raCertNotBefore/raCertNotAfter + mtlsEnabled + mtlsTrustBundlePath fields with SetRACert + SetMTLSConfig setters. Existing IntuneStatsSnapshot + IntuneStats(now) preserved UNCHANGED for /admin/scep/intune/stats backward compat (the JSON shape stays byte-stable for external consumers — the aliasing approach the prompt initially suggested doesn't work because the new shape nests Intune while the old one is flat). ChallengePasswordSet is derived from challengePassword != '' (the secret value itself is never surfaced). * internal/api/handler/admin_scep_intune.go — new Profiles handler method on AdminSCEPIntuneHandler with the same M-008 admin gate. AdminSCEPIntuneServiceImpl extended (in place; same map[string]service.SCEPService) to satisfy the new AdminSCEPProfileService interface. Single handler file gets the third method so the M-008 pin entry count stays steady (no new file, no new triplet of admin-gate test files — just three new Profiles tests inside the existing test file). internal/api/router/router.go — one new route 'GET /api/v1/admin/scep/profiles' registered to reg.AdminSCEPIntune.Profiles. HandlerRegistry unchanged. * api/openapi.yaml — new operation 'listSCEPProfiles' documenting the request body / response shape / error mapping. Existing Intune entries unchanged. * cmd/server/main.go — per-profile loop now calls scepService.SetMTLSConfig(profile.MTLSEnabled, profile.MTLSClientCATrustBundlePath) right after SetPathID, and scepService.SetRACert(raCert) right after loadSCEPRAPair returns the leaf cert. Both setters are nil-safe. * internal/api/handler/m008_admin_gate_test.go — extended the existing admin_scep_intune.go entry's justification to mention the third endpoint. No new map entry needed (file already listed). Backend tests (8 new): * TestAdminSCEPProfiles_NonAdmin_Returns403 * TestAdminSCEPProfiles_AdminExplicitFalse_Returns403 * TestAdminSCEPProfiles_AdminPermitted_ForwardsActor — also pins that Intune-enabled profiles emit an 'intune' sub-block while Intune-disabled profiles OMIT it. * TestAdminSCEPProfiles_RejectsNonGetMethod * TestAdminSCEPProfiles_PropagatesServiceError * TestAdminSCEPProfilesServiceImpl_NilMapReturnsEmpty * (existing 16 Phase 9 admin tests still pass — backward-compat preserved) Frontend: * web/src/api/types.ts — new SCEPProfileStatsSnapshot + IntuneSection + SCEPProfilesResponse types. Existing IntuneStatsSnapshot et al unchanged. * web/src/api/client.ts — new getAdminSCEPProfiles helper. * web/src/pages/SCEPAdminPage.tsx — full rewrite as the tabbed surface. Reuses the existing ConfirmReloadModal and Intune deep-dive card components verbatim; adds ProfileSummaryCard (lean card for the Profiles tab) and ActivityTab. URL state sync via useSearchParams so deep links survive reloads + browser back/forward. The legacy /scep/intune route alias defaults the activeTab to 'intune' on mount. * web/src/main.tsx — new <Route path='scep' /> + preserved <Route path='scep/intune' /> alias. Both render SCEPAdminPage. * web/src/components/Layout.tsx — nav link rebranded: label 'SCEP Intune' → 'SCEP Admin', to '/scep/intune' → '/scep'. Frontend tests (20 — full rebuild): * Admin gate (non-admin sees gated banner + zero admin API calls) * Profiles tab default + Intune tab tabswitch + ?tab=intune deep link + legacy /scep/intune alias all land on Intune * Profiles tab status badges (Intune + mTLS + challenge-set) reflect each profile's flags * RA cert expiry tone bands (good ≥30d / warn 7-30d / bad <7d / EXPIRED) verified across three fixture profiles * 'View Intune details →' only renders for Intune-enabled profiles AND switches tabs on click * Empty-state banner when no profiles configured * Intune tab counters render with the existing Phase 9 deep-dive shape; reload modal Open/Confirm/Cancel/Error paths all pinned * Recent Activity tab merges all four SCEP audit actions across four parallel useQuery calls; filter chips (all/initial/renewal/intune/static) narrow correctly * Error path surfaces ErrorState on the active tab Docs: * docs/scep-intune.md — Operational monitoring section heading expanded to '(SCEP Administration → Intune Monitoring tab)'. Page-surface description rewritten for the tabbed shape; admin-endpoints list extended with the new /admin/scep/profiles entry. * docs/architecture.md — Microsoft Intune Connector trust anchor subsection updated to reference the Intune Monitoring tab inside the SCEP Administration page + lists all three admin endpoints. * docs/legacy-est-scep.md — forward-ref expanded with a parallel sentence for the per-profile observability surface (independent of Intune). * README.md — Enrollment Protocols bullet for Intune updated to 'admin GUI SCEP Administration page at /scep' with the three tabs called out. Verification: * gofmt clean on touched files * go vet ./... clean * staticcheck on intune+service+handler+router+cmd-server clean * go test -short across intune+service+handler+router+cmd-server: all green (existing Phase 9 tests + new Profiles tests) * Frontend tsc --noEmit clean * Vitest: 20/20 SCEPAdminPage tests + 3/3 sibling AuditPage tests pass * G-3 docs-drift CI guard reproduced locally: clean (no new env vars; existing CERTCTL_SCEP_ allowlist prefix covers everything) * M-009 hard-zero useMutation guard reproduced locally: clean (the existing reload mutation already used useTrackedMutation from the Phase 9 follow-up commit `28e277a`) * openapi-parity test green (new GET /api/v1/admin/scep/profiles operation documented) * M-008 admin-gate scanner green (existing admin_scep_intune.go entry covers all three handler methods; the test scanner enforces the triplet by file, not by endpoint, and the new Profiles triplet was added to the existing test file) Backward compat preserved: * /api/v1/admin/scep/intune/stats unchanged — same JSON shape, same error codes, same M-008 gate * /api/v1/admin/scep/intune/reload-trust unchanged * /scep/intune route still works (alias to /scep with activeTab=intune) * IntuneStatsSnapshot Go type unchanged * IntuneStats(now) accessor unchanged Refs: cowork/scep-gui-restructure-prompt.md cowork/scep-rfc8894-intune-master-prompt.md::Phase 9 Phase 11.5 (SCEP probe in scanner — opt-in) and Phase 12 (release prep + tag) of the master bundle resume after this.	2026-04-29 17:46:42 +00:00
shankar0123	e0d00717c7	feat(scep-intune): golden-file tests + e2e harness against fixture trust anchor Phase 10 of the SCEP RFC 8894 + Intune master bundle. Adds reproducible testdata fixtures + a hermetic end-to-end test that exercises the full handler → service → dispatcher → CertRep wire path. Phase 10.1 — Golden-file tests (internal/scep/intune/): * testdata/intune_trust_anchor.pem — deterministic ECDSA P-256 cert seeded from a constant byte string (sha256-derived PRNG); regenerates byte-identical PEM bytes across runs. * testdata/intune_challenge_golden_success.txt — valid challenge, iat/exp window covers goldenChallengeNow. * testdata/intune_challenge_golden_expired.txt — same trust anchor + payload shape but iat/exp shifted into the past. * testdata/intune_challenge_golden_tampered_sig.txt — payload bytes intact, last sig byte flipped. challenge_golden_test.go reads each fixture and asserts: - Success → ValidateChallenge returns a populated claim (DeviceName / Subject / SANDNS pinned to the documented values). - Expired → errors.Is(err, ErrChallengeExpired). - Tampered → errors.Is(err, ErrChallengeSignature). - Plus two defensive permutations: WrongAudienceReuse pins the audience-check ordering after a successful sig verify; RotatedTrustAnchorRejects pins the holder-rotation failure mode using a freshly-generated unrelated trust cert. golden_helper_test.go contains the deterministic-PRNG, ES256 signer, fixture-load helpers, and the regeneration target. Operators flip fixtures via: go test -run='^TestRegenerateGoldenFixtures$' ./internal/scep/intune/... -args -update-golden Why ECDSA + a deterministic seed: a hand-pasted base64 blob would break on every Go stdlib bump (json.Marshal field ordering, ASN.1 encoding edge cases). Generating from a pinned seed gives reproducible PEM bytes; only the ECDSA signature suffix varies across regenerations (Go's stdlib doesn't expose RFC 6979 deterministic-k cleanly), and ValidateChallenge re-verifies the signature on every read so it doesn't matter. intune package coverage: 95.2% (was 94.8%). Phase 10.2 — Hermetic end-to-end test (internal/api/handler/scep_intune_e2e_test.go): Departs from the spec's deploy/test/ location because the handler package already has the chromeOS-shape PKIMessage builders (buildTestCSR / buildEnvelopedDataForTest / buildSignedDataForTest / aesCBCEncrypt / postPKIOperation). Putting the e2e test in the handler package lets it reuse those helpers AND run in the default 'go test ./...' sweep — every CI run exercises the full Intune dispatcher chain. The deploy/test/ location is reserved for a future docker-compose-driven variant that would mount a fixture trust anchor into the running container; this hermetic version proves the wire works without that dependency. intuneE2EFixture stands up: - A real Intune Connector signing keypair (ECDSA P-256) + cert written to a temp PEM file the TrustAnchorHolder loads at startup. - A real RA pair the SCEPHandler decrypts EnvelopedData with. - A fixture issuer connector (intuneE2EIssuerConnector) that records every IssueCertificate call + returns a deterministic child cert chained to a fixture CA. Implements the full IssuerConnector interface (IssueCertificate / RenewCertificate / RevokeCertificate / GenerateCRL / SignOCSPResponse / GetRenewalInfo) with the non-issuance methods stubbed. - A capturing AuditRepository that records every Create call so the test can assert action='scep_pkcsreq_intune' was emitted. - A real SCEPService with SetIntuneIntegration wired to a real ReplayCache + PerDeviceRateLimiter. Three test scenarios: 1. TestSCEPIntuneEnrollment_E2E — the documented happy path. Forge a valid Intune-shaped challenge (ES256 signed, length > 200, two dots — satisfies looksIntuneShaped), build a CSR with CN matching the claim's device_name, POST through HandleSCEP, decode the CertRep, assert pkiStatus=SUCCESS + issuer.issued has one entry + audit log carries 'scep_pkcsreq_intune' + IntuneStats.counters[ 'success']==1. 2. TestSCEPIntuneEnrollment_ClaimMismatchRejected_E2E — same setup but CSR CN is 'attacker-host.example.com'. Dispatcher must reject with CertRep FAILURE+BadRequest (mapIntuneErrorToFailInfo: ErrClaimCNMismatch → BadRequest), no issuance, IntuneStats counters['claim_mismatch']==1. 3. TestSCEPIntuneEnrollment_TamperedSignature_E2E — flip a byte in the JWT signature segment of the Intune challenge before wrapping it in the PKIMessage. Dispatcher rejects with FAILURE+BadMessageCheck (signature errors → BadMessageCheck per the same mapping table). Important sanity learning during construction: the buildTestCSR helper from scep_chromeos_test.go does NOT populate DNSNames on the CSR. The success claim therefore omits san_dns to avoid tripping ErrClaimSANDNSMismatch (claim says ['x'], CSR has nothing). The claim_mismatch sibling test exercises the SAN-dimension via the CN mismatch path; coverage of explicit SANDNS mismatches stays in the unit tests in claim_test.go where the helper builds CSRs with full SANs. Verification: * gofmt clean on touched files * go vet ./internal/scep/intune/... ./internal/api/handler/...: clean * staticcheck: clean * go test -count=1 -cover ./internal/scep/intune/...: 95.2% * 5 golden tests + 3 e2e tests all pass * No new env vars (G-3 docs guard not triggered) * No new HTTP routes (openapi-parity guard not triggered) * Sibling test packages (service + router) still green Refs: cowork/scep-rfc8894-intune-master-prompt.md::Phase 10 cowork/scep-rfc8894-intune/progress.md	2026-04-29 16:55:52 +00:00
shankar0123	77e0281a0e	feat(scep-intune): GUI monitoring tab + admin endpoints Phase 9 of the SCEP RFC 8894 + Intune master bundle. Lands the operator- facing Intune Monitoring tab plus the two admin-gated endpoints it reads from. Per the constitutional 'complete path' rule: counters tick on every typed dispatcher branch, the GUI poll is live (30s for stats, 60s for the audit log filter), and the SIGHUP-equivalent reload action is one click + a confirmation modal — no follow-up plumbing required. Backend (Phase 9.1 + 9.2 + 9.3): * internal/service/scep.go gains: - intuneCounterTab — atomic per-status counters keyed by the same labels intuneFailReason() emits (success / signature_invalid / expired / not_yet_valid / wrong_audience / replay / rate_limited / claim_mismatch / compliance_failed / malformed / unknown_version). Lock-free on the dispatcher hot path; snapshot() returns a zero-allocation map for the admin endpoint. - dispatchIntuneChallenge wires intuneCounters.inc(...) on every typed return path INCLUDING the success leg (credited before processEnrollment so a downstream issuer-connector failure doesn't double-count). - SetPathID + PathID accessors (so admin rows surface the SCEP profile path ID per row). - IntuneStatsSnapshot + IntuneTrustAnchorInfo public types, plus IntuneStats(now) accessor that walks the trust holder pool and packages a per-profile snapshot. ReloadIntuneTrust() is the typed wrapper around TrustAnchorHolder.Reload that returns ErrSCEPProfileIntuneDisabled when called on a profile where Intune isn't enabled (admin endpoint maps that to HTTP 409). * internal/api/handler/admin_scep_intune.go: - AdminSCEPIntuneService narrow interface (Stats + ReloadTrust) so the handler depends on a small surface; AdminSCEPIntuneServiceImpl is the production walker over the per-profile SCEPService map. - AdminSCEPIntuneHandler.Stats handles GET /api/v1/admin/scep/intune/stats with the M-008 admin gate (non-admin → 403 + service never invoked); returns {profiles, profile_count, generated_at}. - AdminSCEPIntuneHandler.ReloadTrust handles POST /api/v1/admin/scep/intune/reload-trust. Body is {path_id: '<id>'}; empty body targets the legacy /scep root profile. Returns 200 on success / 404 on unknown PathID / 409 when the profile is Intune- disabled / 500 on a parse error from intune.LoadTrustAnchor (the holder retains its previous pool — fail-safe). 400 on malformed JSON. - ErrAdminSCEPProfileNotFound typed error so the handler can distinguish 'wrong profile' from 'broken file'. * internal/api/router/router.go: HandlerRegistry gains AdminSCEPIntune; both routes registered as bearer-auth-required (the admin-gate is at the handler layer per the M-008 pattern). * cmd/server/main.go: declares scepServices map[string]service.SCEPService BEFORE HandlerRegistry construction so the same map can be referenced from both the admin handler (constructed early) and the SCEP startup loop (which populates it later by reference). The per-profile loop now calls scepService.SetPathID(profile.PathID) and stores the service pointer into the shared map. AdminSCEPIntune handler is constructed at the same time as AdminCRLCache. internal/api/handler/m008_admin_gate_test.go: AdminGatedHandlers map gains 'admin_scep_intune.go' with a one-line justification — the regression scanner enforces the per-handler test triplet (TestAdminSCEPIntune_NonAdmin_Returns403 + _AdminExplicitFalse_Returns403 + _AdminPermitted_ForwardsActor) plus their POST siblings for ReloadTrust. * api/openapi.yaml: documents both endpoints with request body / response shape / error mapping; openapi-parity-test now matches the registered routes. Frontend (Phase 9.4): * web/src/pages/SCEPAdminPage.tsx — single-page Intune Monitoring surface: - Per-profile cards (one card per SCEP profile). Enabled profiles get the full counter grid + trust-anchor-expiry badge tone (good ≥30d / warn 7-30d / bad <7d / EXPIRED). Disabled profiles get an off-state pill with the env-var hint to opt in. - Counters polled every 30s via TanStack Query against GET /admin/scep/intune/stats. - Recent failures table (last 50) populated from the audit log filtered to action=scep_pkcsreq_intune AND scep_renewalreq_intune; merged + sorted by timestamp descending. Polled every 60s. - Reload trust anchor button per profile + confirmation modal that explains the SIGHUP equivalence and the fail-safe behavior. onConfirm runs a TanStack mutation, refetches the stats query on success, surfaces the underlying error (eg 'trust anchor cert expired') in the modal on failure (modal stays open so operator can retry). - Admin gate: when authRequired && !admin the page renders an 'Admin access required' banner and the underlying admin API requests are never issued (React Query enabled flag gated on auth.admin) — server-side enforcement is M-008. * web/src/api/types.ts: IntuneStatsSnapshot + IntuneTrustAnchorInfo + IntuneStatsResponse + IntuneReloadTrustResponse. * web/src/api/client.ts: getAdminSCEPIntuneStats + reloadAdminSCEPIntuneTrust(pathID). * web/src/main.tsx: new route /scep/intune. The route is unconditional; the gating is at the page level so deep-links land cleanly. * web/src/components/Layout.tsx: 'SCEP Intune' nav link between Observability and Audit Trail with the appropriate sidebar icon. Tests (Phase 9.5): * internal/api/handler/admin_scep_intune_test.go (16 tests): - M-008 admin-gate triplet for both Stats (GET) and ReloadTrust (POST): NonAdmin / AdminExplicitFalse / AdminPermitted. - Method-gate tests (Stats rejects POST, ReloadTrust rejects GET). - Stats propagates service errors as 500. - ReloadTrust maps ErrAdminSCEPProfileNotFound→404, ErrSCEPProfileIntuneDisabled→409, generic err→500. - Empty body targets legacy root PathID. - Malformed JSON→400. - AdminSCEPIntuneServiceImpl handles nil map + unknown PathID. * web/src/pages/SCEPAdminPage.test.tsx (13 tests): - Admin gate (non-admin sees gated banner + zero admin API calls; admin sees the page; no-auth dev mode also passes). - Profile rendering (counters with correct labels, expiry badge tone for ≥30d / EXPIRED states, off-state pill for disabled profiles, empty-state banner when no profiles configured). - Reload modal (opens on click, calls mutation on Confirm, keeps modal open + shows error on failure, Cancel skips mutation). - Error path renders ErrorState with retry. - Audit log filter merges PKCSReq + RenewalReq events and sorts descending. Verification: * gofmt clean on touched files * go vet ./... clean * staticcheck on intune/service/api/cmd-server clean * go test -short across api+service+intune+cmd-server: all green * web tsc --noEmit clean * Vitest: SCEPAdminPage.test.tsx 13/13 + sibling page suites all pass * G-3 docs-drift CI guard: Phase 9 adds no new CERTCTL_* env vars so the guard does not fire * openapi-parity-test green (both new admin endpoints documented) * M-008 regression scanner enforces the per-handler test triplet — pin updated, all triplets present Refs: cowork/scep-rfc8894-intune-master-prompt.md::Phase 9 cowork/scep-rfc8894-intune/progress.md	2026-04-29 16:14:07 +00:00
shankar0123	a12a437664	feat(scep): mTLS sibling route /scep-mtls/<pathID> (opt-in) SCEP RFC 8894 + Intune master bundle — Phase 6.5 of 14 (opt-in, enterprise-procurement-checkbox). Closes the procurement-team objection that 'shared password authentication' is a checkbox-fail regardless of how strong the password is. The clean answer: a sibling route that adds client-cert auth at the handler layer AND keeps the challenge password (defense in depth, not replacement). Devices present a bootstrap cert from a trusted CA (e.g. a manufacturing-time cert), then SCEP-enroll for their long-lived cert. Same model Apple's MDM and Cisco's BRSKI use. internal/config/config.go * SCEPProfileConfig gains MTLSEnabled bool + MTLSClientCATrustBundlePath string. Indexed env-var loader reads CERTCTL_SCEP_PROFILE_<NAME>_MTLS_ENABLED + CERTCTL_SCEP_PROFILE_<NAME>_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH. * Validate() refuses MTLSEnabled=true with empty bundle path — structural defense in depth ahead of the file-content preflight. cmd/server/main.go * preflightSCEPMTLSTrustBundle: file existence + PEM parse + ≥1 CERTIFICATE block + non-expired check. Returns the parsed x509.CertPool ready to inject into the per-profile SCEPHandler. Failures os.Exit(1) with the offending PathID in the structured log. SCEP startup loop walks each profile; when MTLSEnabled, runs preflight, builds the per-profile pool, contributes the bundle's certs to the union pool that backs the TLS-layer VerifyClientCertIfGiven, clones the SCEPHandler with SetMTLSTrustPool, and registers the parallel sibling route via apiRouter.RegisterSCEPMTLSHandlers. * Union pool published to outer scope as scepMTLSUnionPoolForTLS; passed to buildServerTLSConfigWithMTLS so the listener serves both /scep[/<pathID>] (no client cert) and /scep-mtls/<pathID> (cert required at handler layer) on the same socket. * Final-handler dispatch gains /scep-mtls + /scep-mtls/* prefix routing through the no-auth chain (auth boundary is the client cert + challenge password, NOT a Bearer token). cmd/server/tls.go * New buildServerTLSConfigWithMTLS that wraps buildServerTLSConfig + sets ClientCAs + ClientAuth=VerifyClientCertIfGiven when a non-nil pool is passed. nil pool = identical TLS shape to the pre-Phase-6.5 builder (no behavior change for deploys without mTLS profiles). * Critical: VerifyClientCertIfGiven (NOT RequireAndVerifyClientCert) so a client that doesn't present a cert can still hit the standard /scep route. The per-profile gate at the handler layer enforces 'cert required' on /scep-mtls/<pathID>. internal/api/handler/scep.go * SCEPHandler gains mtlsTrustPool x509.CertPool field + SetMTLSTrustPool method. Per-profile pool injected by cmd/server/main.go after preflight. HandleSCEPMTLS wrapper: gates on r.TLS.PeerCertificates non-empty + per-profile cert.Verify against THIS profile's pool. Returns HTTP 401 for missing/untrusted cert (mTLS failure is auth, not authorization). Returns HTTP 500 if mtlsTrustPool is nil (deploy bug — the route shouldn't have been registered). On success delegates to HandleSCEP — defense in depth: mTLS is additive, NOT replacement; the standard SCEP code path including the challenge-password gate still executes. * Per-profile re-verification via cert.Verify(...) is critical: the TLS layer verified against the UNION pool, so a cert that chains to profile A's bundle would pass TLS even when targeting profile B. The handler-layer gate prevents cross-profile bleed-through. internal/api/router/router.go * AuthExemptDispatchPrefixes gains '/scep-mtls' (auth boundary is client cert + challenge password, NOT Bearer token). * RegisterSCEPMTLSHandlers parallel to RegisterSCEPHandlers: empty PathID maps to /scep-mtls root; non-empty maps to /scep-mtls/<pathID>. Each handler in the map MUST have had SetMTLSTrustPool called. internal/api/router/openapi_parity_test.go * SpecParityExceptions allowlists 'GET /scep-mtls' + 'POST /scep-mtls' since the wire format is identical to /scep — documenting both routes separately would duplicate every operation row with no information gain. Documented alternative in docs/legacy-est-scep.md. internal/api/handler/scep_mtls_test.go (new, ~210 LoC) * 6 tests + 2 helpers covering the auth contract: 1. RejectsMissingClientCert — request with r.TLS=nil → 401 2. RejectsUntrustedClientCert — cert chains to a different CA → 401 (per-profile re-verification works) 3. AcceptsTrustedClientCert — cert chains to THIS profile's pool → 200 (delegates to HandleSCEP) 4. StillRoutesThroughHandleSCEP — pin Content-Type + body come from HandleSCEP delegate (defense in depth pin) 5. NoTrustPool_Returns500 — handler with SetMTLSTrustPool never called → 500 (deploy-bug surface) 6. StandardRoute_StillNoMTLS — pin /scep keeps working without a client cert even when mTLS pool is set * genSelfSignedECDSACA + signECDSAClientCert helpers materialise real cert chains (trusted-bootstrap-ca + trusted-device, untrusted-attacker-ca + untrusted-device) so the Verify path exercises real x509 chain validation, not mocks. docs/features.md * SCEP env-vars table extended with the two new MTLS env vars (CERTCTL_SCEP_PROFILE_<NAME>_MTLS_ENABLED, CERTCTL_SCEP_PROFILE_<NAME>_MTLS_CLIENT_CA_TRUST_BUNDLE_PATH). Closes the G-3 'env var defined in Go but never documented' gate. docs/legacy-est-scep.md * New 'mTLS sibling route (Phase 6.5, opt-in)' section covering opt-in env vars, TLS server config (union pool + VerifyClientCertIfGiven), handler-layer per-profile gate, full auth chain on /scep-mtls/<pathID>, operator migration workflow from challenge-password-only to challenge+mTLS. cowork/CLAUDE.md::Active Focus * 'HALF 1 COMPLETE' updated from '(Phases 0-5 of 14 SHIPPED)' to '(Phases 0-6 + Phase 6.5 of 14 SHIPPED)'. Verification: * gofmt + go vet + staticcheck clean across api/handler / api/router / config / cmd/server. * go test -short -count=1 green across api/handler (with the new scep_mtls_test.go) / api/router / service / config / pkcs7 / cmd/server / connector/issuer/local. * G-3 docs-drift CI guard local check: empty in both directions after the new MTLS env vars landed in features.md. * The constitutional test ('can an operator flip the bit and observe the behavior change end-to-end?') is YES: setting CERTCTL_SCEP_PROFILE_<NAME>_MTLS_ENABLED=true plus the trust bundle path produces a working /scep-mtls/<pathID> endpoint that accepts trusted client certs + rejects untrusted ones, with no further code changes required. Phase 6.5 of 14 in SCEP RFC 8894 + Intune master bundle. Half 1 (Phases 0-6 + 6.5) is now FEATURE-COMPLETE for the ChromeOS / general-MDM use case. Half 2 (Phases 7-12) adds the Microsoft Intune dynamic-challenge layer.	2026-04-29 13:58:18 +00:00
shankar0123	b33b843908	feat(scep): RenewalReq + GetCertInitial + ChromeOS E2E + caps + must-staple SCEP RFC 8894 + Intune master bundle — Phase 4 + Phase 5 of 14. Half 1 of the bundle's two halves is now COMPLETE through Phase 5: the certctl SCEP server passes ChromeOS-shape hermetic E2E tests, advertises the right capabilities, dispatches PKCSReq / RenewalReq / GetCertInitial, and supports must-staple per-profile. == Phase 4: RenewalReq + GetCertInitial wiring ============================ internal/service/scep.go * RenewalReqWithEnvelope (RFC 8894 §3.3.1.2) — re-enrollment with an existing valid cert. Same contract as PKCSReqWithEnvelope but the service additionally verifies that envelope.SignerCert chains to the issuer's CA (verifyRenewalSignerCertChain). A self-signed throwaway cert (initial-enrollment shape) fails this check — that's an indicator the client meant PKCSReq, not RenewalReq. * GetCertInitialWithEnvelope (RFC 8894 §3.3.3) — polling stub. Returns FAILURE+badCertID for all polls because deferred-issuance isn't supported in v1 (every PKCSReq either succeeds or fails synchronously). Wiring stays in place for a future enhancement. * Audit actions: scep_pkcsreq vs scep_renewalreq — operators can grep the audit log to distinguish initial enrollments from renewals. internal/api/handler/scep.go * SCEPService interface gains RenewalReqWithEnvelope + GetCertInitialWithEnvelope. * pkiOperation RFC 8894 path now switches on envelope.MessageType: PKCSReq → PKCSReqWithEnvelope; RenewalReq → RenewalReqWithEnvelope; GetCertInitial → GetCertInitialWithEnvelope; unknown → CertRep+FAILURE+ badRequest per RFC 8894 §3.3.2.2. == Phase 5.1: GetCACaps capability advertisement ========================= internal/service/scep.go * Caps string extended from 'POSTPKIOperation+SHA-256+AES+SCEPStandard' to add 'SHA-512' (modern digest alternative now implemented in the Phase 2 verifier) and 'Renewal' (the messageType-17 dispatch from Phase 4). ChromeOS specifically looks for these capabilities to negotiate the strongest available cipher + digest combo. * scep_test.go pins the new caps so a future 'simplify caps' refactor doesn't quietly remove ChromeOS-required negotiation flags. == Phase 5.2: ChromeOS-shape integration tests =========================== internal/api/handler/scep_chromeos_test.go (new, ~570 LoC) * 6 hermetic E2E tests + ~12 helpers. Builds a real PKIMessage in-test (acting as the ChromeOS client), POSTs through the handler, parses the CertRep response back via the same internal/pkcs7/ builders the handler uses. * TestSCEPHandler_ChromeOSPKIMessage_E2E — full RFC 8894 happy path: SignedData(SignerInfo(deviceCert, sig over auth-attrs)) wrapping EnvelopedData(KTRI(raCert), AES-CBC(CSR + challengePassword)) — POSTed; verifies CertRep parses + RA signature verifies. * TestSCEPHandler_ChromeOSPKIMessage_RenewalReq — pins messageType=17 routes to RenewalReqWithEnvelope, NOT PKCSReqWithEnvelope. * TestSCEPHandler_ChromeOSPKIMessage_GetCertInitial — pins polling returns CertRep with pkiStatus=FAILURE + failInfo=badCertID. * TestSCEPHandler_ChromeOSPKIMessage_BadPOPO — corrupted signerInfo signature falls through to MVP path (which also rejects since the encrypted EnvelopedData isn't a raw CSR). No silent acceptance. * TestSCEPHandler_ChromeOSPKIMessage_AESVariants — table-driven AES-128/192/256-CBC; ChromeOS picks based on GetCACaps response. * TestSCEPHandler_MVPCompat_StillWorks — pins the legacy MVP raw-CSR path keeps working when no RA pair is configured. Backward compat is non-negotiable. == Phase 5.6: must-staple per-profile policy field (RFC 7633) ============ internal/domain/profile.go * Added MustStaple bool to CertificateProfile. Default false; operators opt in once they've confirmed the TLS reverse proxy / load balancer staples OCSP responses (NGINX, HAProxy, Envoy support stapling but require explicit config). internal/connector/issuer/interface.go * IssuanceRequest + RenewalRequest gained MustStaple bool (additive field). Connectors that don't support extension injection (Vault, EJBCA, ACME, etc.) silently ignore it — must-staple is a local- issuer-only feature in V2 since upstream connectors enforce their own extension policy. internal/connector/issuer/local/local.go * Added oidMustStaple (1.3.6.1.5.5.7.1.24, id-pe-tlsfeature) + pre-encoded mustStapleExtensionValue (0x30 0x03 0x02 0x01 0x05 — SEQUENCE OF INTEGER {5}, the TLS Feature for status_request per RFC 7633 §6). * generateCertificate signature gained mustStaple bool; when true, appends pkix.Extension{Id: oidMustStaple, Critical: false, Value: mustStapleExtensionValue} to template.ExtraExtensions before x509.CreateCertificate. internal/connector/issuer/local/must_staple_test.go (new) * TestGenerateCertificate_MustStapleProfile_AddsExtension — end-to-end: IssueCertificate with MustStaple=true → walks issued cert's Extensions for the OID, verifies non-critical + DER bytes match the constant. * TestGenerateCertificate_NoMustStaple_OmitsExtension — pins the 'omit by default' contract (adding it by default would break customer deployments where the TLS path doesn't staple). * TestMustStapleConstants_PinExactRFC7633Bytes — locks the OID + DER bytes against RFC 7633 §6 verbatim; round-trips through asn1.Unmarshal as []int{5}. Note: full service-layer plumbing (CertificateProfile.MustStaple → IssuanceRequest.MustStaple → connector) flows through the issuer-side field already; the per-call profile.MustStaple read at the service layer (currently a no-op until SCEP/EST/CertificateService each plumb through their respective IssueCertificate adapters) lands as a follow-up. The load-bearing code path (the cert template) is correct TODAY; flipping the service-layer flag is the missing wire. == Phase 5.4: docs/legacy-est-scep.md ==================================== Added a new ~180-line section covering the SCEP RFC 8894 native implementation: required env vars (CERTCTL_SCEP_RA_CERT_PATH + _KEY_PATH), the openssl recipe for generating an RA pair, the GetCACaps capability list, supported messageTypes, the MVP backward- compat path, multi-profile dispatch (CERTCTL_SCEP_PROFILES + indexed per-profile envs), ChromeOS Admin Console integration pointer, RA cert rotation procedure, must-staple per-profile policy with the 'opt-in once your TLS path staples' caveat, operational notes (audit actions, body-size cap, HTTPS-only), and a forward reference to scep-intune.md (Phase 11). == Verification ========================================================== * gofmt + go vet clean for the files I touched. * staticcheck ./internal/api/handler/... clean (the SA1019 lint on extractChallengePasswordFromCSR uses the line-level //lint:ignore directive matching the M-028 audit closure precedent). * go test -short -count=1 green across api/handler / api/router / service / pkcs7 / connector/issuer/local / domain / cmd/server. * G-3 docs-drift CI guard local check: empty diff in both directions. Phase 4 + Phase 5 of 14 in SCEP RFC 8894 + Intune master bundle. Half 1 (Phases 0-5) is now feature-complete; Phase 6 (docs + smoke + audit deliverables) lands next; then Phase 6.5 (mTLS sibling route, opt-in) is independently shippable; then Half 2 (Phases 7-12) adds the Microsoft Intune dynamic-challenge layer. Living progress at cowork/scep-rfc8894-intune/progress.md.	2026-04-29 13:16:09 +00:00
shankar0123	b540d4421e	feat(scep): CertRep PKIMessage response builder (RFC 8894 §3.3.2) SCEP RFC 8894 + Intune master bundle — Phase 3 of 14. Implements the SCEP CertRep response builder + wires it into the handler's RFC 8894 path. After this commit, certctl emits proper CertRep PKIMessage responses (signed by the RA key, with EnvelopedData encrypting the issued cert chain to the device's transient signing cert) for both success and failure outcomes — RFC 8894 §3.3 mandates a PKIMessage response on every PKIOperation request, including failure cases that carry pkiStatus=2 + failInfo. internal/pkcs7/certrep.go (new, ~370 LoC) * BuildCertRepPKIMessage: assembles the full ContentInfo → SignedData → {certs, signerInfo, encapContent} structure per RFC 8894 §3.3.2 + RFC 5652 §5+§6. * Success path: encrypts the issued cert chain (PKCS#7 certs-only) INSIDE an EnvelopedData targeting req.SignerCert (the device's transient cert, NOT the RA cert — response goes back to the device encrypted with its public key). AES-256-CBC + random 16-byte IV + PKCS#7 padding + RSA PKCS#1v1.5 keyTrans. * Failure path: encapContent is empty (no EnvelopedData); the failInfo auth-attr is populated. * Pending path: encapContent is empty; client polls via GetCertInitial. * Auth-attr ordering matches micromdm/scep for byte-level wire-format diffing (DER SET-OF normalises order anyway, but matching the reference implementation makes audit + manual inspection easier). * senderNonce is freshly generated from crypto/rand on every call. * RA key signs the canonical SET OF Attribute re-serialisation (RFC 5652 §5.4 quirk every CMS implementation hits — wire form is [0] IMPLICIT but the signature is computed over EXPLICIT SET OF). * Helper functions: buildCertRepAuthAttrs, buildSignerInfoCertRep, signCertRep, buildEncapContentInfo, buildEnvelopedDataAES256, all constructed via this package's existing ASN1Wrap primitives (avoids asn1.Marshal nuances with nested RawValues — same pattern Phase 2 settled on). internal/pkcs7/signedinfo.go (1-line tweak) * ParseSignedData no longer refuses when SignerInfos is empty. The degenerate certs-only SignedData form (RFC 8894 §3.5.1 GetCACert response, RFC 7030 EST cacerts, AND now the encrypted certs-only inner content of the CertRep EnvelopedData) is structurally valid with zero signers. Caller decides whether the lack of signers is an error in their context. internal/pkcs7/certrep_test.go (new, ~230 LoC) * TestBuildCertRepPKIMessage_Success_RoundTrip — full pipeline round-trip: build → ParseSignedData → VerifySignature → auth-attr extractors → ParseEnvelopedData(encapContent) → Decrypt with device key → ParseSignedData(innerCertsOnly) → assert issued cert CN. Catches drift between the build-side encoding and the parse-side decoding. * TestBuildCertRepPKIMessage_Failure_NoEncapContent — pkiStatus=2 + failInfo populated; encapContent empty. * TestBuildCertRepPKIMessage_FreshSenderNonceEachCall — pins the 'never reuse senderNonce' invariant from RFC 8894 §3.2.1.4.5 (replay defense). * TestBuildCertRepPKIMessage_RejectsNonRSADeviceCert — pins the RSA-only requirement on the device's transient cert (KTRI requires RSA pubkey for keyTrans encryption). * TestBuildCertRepPKIMessage_NilArgs_Refuses. internal/pkcs7/certrep_fuzz_test.go (new, ~150 LoC) * FuzzBuildCertRepPKIMessage — varies transactionID + senderNonce + signerCert; asserts no panic. When build succeeds for the success path, asserts round-trip soundness (output parses back via ParseSignedData). 6s seed-corpus run hit no panics. internal/api/handler/scep.go * pkiOperation now emits writeCertRepPKIMessage for the RFC 8894 path (both success AND failure). MVP path keeps writeSCEPResponse for backward compat with lightweight clients. * tryParseRFC8894 extended to extract the RFC 2985 §5.4.1 challengePassword attribute from the recovered CSR, so the service-layer's challenge-password gate can run on the RFC 8894 path the same way it does on the MVP path. Returns (envelope, csrPEM, challengePassword, ok) — was 3-tuple before. * extractChallengePasswordFromCSR helper mirrors the MVP path's extractCSRFields logic; same staticcheck SA1019 carve-out for the deprecated csr.Attributes API (RFC 2985 challengePassword has no non-deprecated stdlib API per the M-028 audit closure). * writeCertRepPKIMessage helper wraps pkcs7.BuildCertRepPKIMessage; on build failure (programmer/config bug) returns HTTP 500 rather than try a fallback PKIMessage that might re-trigger the same bug. Verification: * gofmt + go vet clean across pkcs7 / api/handler. * go test -short -count=1 green across pkcs7 / api/handler / api/router / service / cmd/server. * Coverage: pkcs7 80.5% (was 78.4% before Phase 3). Handler/service held steady. * Fuzz seed-corpus (6s): FuzzBuildCertRepPKIMessage — no panic; round-trip soundness invariant held for every successful build. Phase 3 of 14 in SCEP RFC 8894 + Intune master bundle. Living progress at cowork/scep-rfc8894-intune/progress.md.	2026-04-29 12:46:30 +00:00
shankar0123	a546a1bbef	feat(scep): EnvelopedData decrypt + signerInfo POPO verify (RFC 8894 §3.2) SCEP RFC 8894 + Intune master bundle — Phase 2 of 14. Implements the new RFC 8894 PKIMessage parse path: EnvelopedData parser + decryptor, signerInfo parser + signature verifier, handler dispatch that tries the RFC 8894 path FIRST and falls through to the legacy MVP raw-CSR path on any parse failure. Backward compat with lightweight SCEP clients is preserved by design — no behavior change for any existing deploy that doesn't set CERTCTL_SCEP_RA_. internal/pkcs7/envelopeddata.go (new, ~330 LoC) ParseEnvelopedData: parses CMS EnvelopedData per RFC 5652 §6.1, with optional outer ContentInfo unwrapping. Handles SET OF RecipientInfo + IssuerAndSerial form rid (RFC 8894 §3.2.2). * EnvelopedData.Decrypt: RSA PKCS#1 v1.5 key-trans + AES-CBC (128/192/ 256) or DES-EDE3-CBC content decryption with constant-time PKCS#7 padding strip (no branch on padding-byte values; closes the padding-oracle leak surface). Recipient mismatch is BadMessageCheck per RFC 8894 §3.3.2.2 (NOT BadCertID); every failure mode returns the same ErrEnvelopedDataDecrypt sentinel to close timing-leak legs of Bleichenbacher attacks. * Equivalent to micromdm/scep's cryptoutil/cryptoutil.go::DecryptPKCS- Envelope (cited in code comments; not vendored — fuzz-target ownership stays in this sub-package per the operating rule). internal/pkcs7/signedinfo.go (new, ~370 LoC) * ParseSignedData / ParseSignerInfos: parses CMS SignedData per RFC 5652 §5.3. Resolves each SignerInfo's SID (IssuerAndSerial v1 OR [0] SubjectKeyId v3) against the SignedData certificates SET to pluck the device's transient signing cert. * SignerInfo.VerifySignature: re-serialises signedAttrs as the canonical SET OF Attribute (the RFC 5652 §5.4 quirk every CMS implementation hits — wire form is [0] IMPLICIT but the signature is over EXPLICIT SET OF). Hashes with SHA-1/SHA-256/SHA-512 + verifies via RSA PKCS1v15 or ECDSA per the cert's pubkey type. * Auth-attr extractors: GetMessageType (PrintableString-decimal), GetTransactionID, GetSenderNonce, GetMessageDigest. SCEP attr OIDs pinned (RFC 8894 §3.2.1.4). internal/pkcs7/{envelopeddata,signedinfo}_fuzz_test.go (new) * FuzzParseEnvelopedData / FuzzParseSignedData / FuzzParseSignerInfos / FuzzVerifySignerInfoSignature — every parser certctl adds gets a panic-safety fuzzer (the fuzz-target-ownership rule from cowork/CLAUDE.md::Operating Rules). Local 5s runs hit ~270k executions per parser without panic. Errors are expected for arbitrary inputs; only panics are bugs. internal/pkcs7/{envelopeddata,signedinfo}_test.go (new) * Round-trip tests that materialise real RSA/ECDSA pairs, hand-build the wire bytes, parse + decrypt + verify, and assert plaintext / auth-attr equality. The build helpers use this package's ASN1Wrap primitives directly (asn1.Marshal of structs containing nested asn1.RawValue is finicky for mixed Class/Tag); gives byte-level control matching what real SCEP clients emit. * Negative tests: tampered ciphertext / tampered auth-attrs / wrong RA / wrong key / mismatched recipients / random garbage all return the appropriate sentinel error without panic. internal/service/scep.go * PKCSReqWithEnvelope: RFC 8894 envelope-aware variant. Returns SCEPResponseEnvelope (not error + SCEPEnrollResult) because RFC 8894 §3.3 mandates a CertRep PKIMessage on every response, even failures — the handler shouldn't translate Go errors into SCEP failInfo codes. Returns nil to signal 'invalid challenge password' so the caller can translate to HTTP 403 (matches MVP path's wire shape; RFC 8894 §3.3.1 is silent on this case). * mapServiceErrorToFailInfo: exact mapping table from the prompt (CSR parse → BadRequest, CSR sig → BadMessageCheck, crypto policy → BadAlg, default → BadRequest). internal/api/handler/scep.go * SCEPService interface gains PKCSReqWithEnvelope. * SCEPHandler now optionally carries an RA cert + key pair. SetRAPair upgrades the handler to the RFC 8894 path; without that call the handler stays MVP-only (the v2.0.x behavior). * pkiOperation: tries the RFC 8894 path FIRST when the RA pair is set. tryParseRFC8894 helper does the full pipeline (ParseSignedData → VerifySignature → extract auth-attrs → ParseEnvelopedData → Decrypt → x509.ParseCertificateRequest the recovered bytes). On any failure it falls through to the legacy extractCSRFromPKCS7 MVP path — backward compat is non-negotiable. * Phase 2 emits the legacy certs-only response on RFC 8894 success; Phase 3 (next commit) swaps in writeCertRepPKIMessage with the proper status / failInfo / nonce-echo wire shape. cmd/server/main.go * Per-profile loop now calls loadSCEPRAPair after preflight to load the cert + key + inject via SetRAPair. crypto + crypto/tls imports added. * loadSCEPRAPair helper: tls.X509KeyPair-based parse + leaf cert extraction. Failures here indicate TOCTOU between preflight + load. internal/api/handler/scep_handler_test.go + internal/api/router/router_scep_profiles_test.go * mockSCEPService / scepProfileMockService gain PKCSReqWithEnvelope stubs to satisfy the extended interface. Existing test cases unchanged (they exercise the MVP path; RA pair is unset). Verification: * gofmt + go vet clean for the files I touched. * go test -short -count=1 green across pkcs7 / api/handler / api/router / service / cmd/server. * Coverage: pkcs7 78.4% (was 100% — drops because new code includes paths the round-trip tests don't yet hit, like decryption alg fall-through and v3 SubjectKeyId SID matching). * Fuzz-target seed-corpus runs (5s each, ~270k execs/parser): no panic. Pre-merge fuzz-time bumps to 30s per the prompt's verification gate. Phase 2 of 14 in SCEP RFC 8894 + Intune master bundle. Living progress at cowork/scep-rfc8894-intune/progress.md.	2026-04-29 12:36:27 +00:00
shankar0123	a4df1f86ae	crl/ocsp: admin observability endpoint + Phase 6 e2e scaffold Phase 5 (admin endpoint slice) + Phase 6 (e2e test stub) of the CRL/OCSP responder bundle. Closes the deferred items from the backend-slice merge (`77d6326`). What landed: Phase 5 — admin observability: * GET /api/v1/admin/crl/cache (handler.AdminCRLCacheHandler): - Per-issuer cache state + most recent N generation events - Admin-gated via middleware.IsAdmin (M-003 pattern); non-admin callers get 403 + the service is never invoked - Reveals issuer set + CRL cadence, hence the gate - Returns CachePresent=false rows for never-generated issuers so the GUI can show 'not yet generated' instead of 404 - Per-issuer Get failures decorate the row's RecentEvents rather than failing the whole response * AdminCRLCacheServiceImpl: thin handler-side composition over repository.CRLCacheRepository + an issuer-IDs callback (avoids importing internal/service from internal/api/handler) * M-008 admin-gate pin updated: admin_crl_cache.go added to AdminGatedHandlers; full triplet of tests (NonAdmin_Returns403, AdminExplicitFalse_Returns403, AdminPermitted_ForwardsActor) + RejectsNonGetMethod + PropagatesServiceError * Router registration + HandlerRegistry field + main.go wiring (callback closure over issuerRegistry.List) * OpenAPI entry under CRL & OCSP tag Phase 6 — e2e scaffold: * deploy/test/crl_ocsp_e2e_test.go with TestCRLOCSPLifecycle + TestCRLOCSPPostEndpoint * Lifecycle test exercises issue → fetch OCSP (Good) → revoke → wait → fetch CRL (entry present) → fetch OCSP (Revoked) → verify dedicated responder cert + id-pkix-ocsp-nocheck * Helpers (issueLocalCert, revokeCertViaAPI, fetchCRL, fetchOCSP, fetchCACert) currently call t.Skip with TODO markers — sandbox has no Docker so the harness can't be wired end-to-end here; when CI / a fresh dev workstation runs, the implementer wires each helper to the existing integration_test.go primitives * Build-tagged //go:build integration so the standard go test sweep skips it; runs via the deploy/test integration workflow Coverage: handler 80.6% (above 75 floor; was 79.8% pre-Phase-5). All other packages unchanged. Backward compat: admin endpoint inert until an admin Bearer key is configured. The e2e test stub is no-op (skips) until wired. Deferred: * GUI cert-detail-page revocation panel — pure frontend work, no backend impact, separate session * E2E test helper wiring — depends on extracting the existing integration-test harness primitives into shared helpers; doable in a follow-up that has Docker available * V3-Pro polish (delta CRLs, OCSP rate-limiting, OCSP stapling)	2026-04-29 01:55:39 +00:00
shankar0123	dc1e0bfbaa	crl/ocsp: POST OCSP endpoint (RFC 6960 §A.1.1) + cache integration Phase 4 (final phase) of the CRL/OCSP responder bundle. Closes the backend slice; HTTP layer is now production-ready for relying parties. What landed: * POST /.well-known/pki/ocsp/{issuer_id} (handler.HandleOCSPPost) - Accepts binary application/ocsp-request body per RFC 6960 §A.1.1 - Tolerant of missing Content-Type (some clients omit); validates via ocsp.ParseRequest, returns 400 on malformed - Returns 415 on explicit wrong Content-Type - Reuses the existing service path (h.svc.GetOCSPResponse) — the only new logic is body decoding + serial-from-OCSPRequest extraction - GET form preserved unchanged for ad-hoc curl + human URL paths - Auth-exempt under /.well-known/pki/ prefix (already in AuthExemptDispatchPrefixes — no router changes for that) - 7 new tests: success, method-not-allowed, wrong content-type, missing content-type accepted, malformed body, missing issuer, service error propagation * router.go: r.Register("POST /.well-known/pki/ocsp/{issuer_id}", ...) * CertificateService.GenerateDERCRL — cache-aware: - New SetCRLCacheSvc(svc) setter (matches existing SetCAOperationsSvc pattern — optional dep) - When wired, GenerateDERCRL calls crlCacheSvc.Get → cheap DB read on cache hit, singleflight-coalesced regen on miss - When unwired, falls back to historical caSvc.GenerateDERCRL path - GET /.well-known/pki/crl/{issuer_id} handler unchanged — calls the same service method, gets cache benefit transparently when the cache service is wired in cmd/server/main.go Coverage: handler 79.8% (floor 75), service unchanged, scheduler 78%. What's deferred (intentional scope cut for this session): * cmd/server/main.go wiring of CRLCacheService + responder service setters into the local issuer factory + scheduler. The wiring is mechanical (NewCRLCacheService + scheduler.SetCRLCacheService call in the existing wiring block); deferring keeps this commit focused on the responder + cache primitives. Operator can wire when ready. * Phase 5 (GUI), Phase 6 (e2e test against kind), Phase 7 (release prep) — separate follow-up sessions. * OCSP cache integration: today's GET/POST OCSP path goes through the on-demand SignOCSPResponse (already cheap with the dedicated responder cert from Phase 2). A cached-OCSP path is V3-Pro polish. The bundle's V2 backend slice (Phases 0-4) is complete. All 4 phases shipped 4 commits + 1 amend on this branch. CI will validate the testcontainers repository tests on push.	2026-04-29 00:07:27 +00:00
shankar0123	8326d95210	Bundle N.C-extended (Coverage Audit Extension): service + handler round-out — M-002 + M-003 partial-closed Three new round-out test files targeting handler-interface delegators on CertificateService + AgentService + IssuerHandler/HealthCheckHandler. Coverage deltas ================= internal/service: 70.5% -> 73.4% (+2.9pp; 17 new tests) internal/api/handler: 79.4% -> 79.8% (+0.4pp; 4 new tests) Service round-out tests (certificate_round_out_test.go, ~165 LoC) ================= - GetCertificate (delegate-to-repo + NotFound) - CreateCertificate (defaults populated + repo error) - UpdateCertificate (patch merge + NotFound + repo error) - ArchiveCertificate (delegate + repo error) - GetCertificateVersions (pagination defaults + page-out-of-range + repo error) - SetJobRepo / SetKeygenMode (no-crash setters) Service round-out tests (agent_round_out_test.go, ~140 LoC) ================= - GetAgent (delegate) - RegisterAgent (defaults populated + repo error) - GetWork / GetWorkWithTargets (no-jobs path) - UpdateJobStatus (delegate to ReportJobStatus) - CSRSubmit / CSRSubmitForCert (invalid-CSR error) - CertificatePickup (agent-not-found) - GetAgentByAPIKey (unknown key) - GetCertificateForAgent (missing agent) - SetProfileRepo (no-crash) Handler round-out tests (round_out_test.go, ~40 LoC) ================= - NewIssuerHandlerWithLogger (logger wired through) - UpdateHealthCheck dispatch arm with bad ID - GetHealthCheckHistory dispatch arm with bad ID Why partial ================= M-002 / M-003 prescribed >=80%. Service at 73.4% and handler at 79.8% miss the gate by 6.6pp / 0.2pp respectively. The remaining service gap is in CSR-submit happy-path and large-population list-filter flows that need deeper repo plumbing (3-4 hr more focused work). The handler 0.2pp is in parseSignedDataForCSR (SCEP), DeleteHealthCheck, AcknowledgeHealthCheck — needs repo fixtures. These extensions are a meaningful step but don't fully close M-002 and M-003. Tracked as N.C-final follow-on; not blocking on a CI floor at 73 / 79. Audit deliverables ================= - gap-backlog.md M-002, M-003: partial-strikethrough with progress note + remaining-gap analysis - extension-progress.md: N.C-extended marked PARTIAL Closes (partial): M-002, M-003 Bundle: N.C-extended (Coverage Audit Extension)	2026-04-27 21:40:09 +00:00
shankar0123	62a412c488	Bundle C: Renewal/reliability cluster — 7 findings closed Closes M-006 + M-007 + M-008 + M-015 + M-016 + M-019 + M-020 from comprehensive-audit-2026-04-25. M-028 was already closed by the Bundle B CI follow-up. M-006 (CWE-913) — Idempotent migration 000014 migrations/000014_policy_violation_severity_check.up.sql: Prepended ALTER TABLE ... DROP CONSTRAINT IF EXISTS before the ADD. Mirrors the down migration's existing IF EXISTS shape and the M-7 idempotent-index idiom. Re-runs against partially-applied DBs now succeed. M-007 — Bulk-op partial-failure tests (3 new) internal/api/handler/bulk_partial_failure_test.go: TestBulkRevoke_PartialFailure_ReportsBoth TestBulkRenew_PartialFailure_ReportsBoth TestBulkReassign_PartialFailure_ReportsBoth Each asserts HTTP 200 + both success/failure counters round-trip + per-cert errors[] preserved with non-empty messages so operators can correlate each failure to its certificate ID. M-008 — Admin-gated handler enumeration pin (verified-already-clean) Recon: only one admin-gated handler — bulk_revocation.go — with full 3-branch test triplet already in place. health.go calls IsAdmin informationally to surface the flag to the GUI without gating. internal/api/handler/m008_admin_gate_test.go: Walks every handler .go file, asserts every middleware.IsAdmin call site is in AdminGatedHandlers (with required test triplet) or InformationalIsAdminCallers (justified). Adding a new admin gate without updating both the constant AND adding the test triplet fails CI. M-015 — Single-profile cardinality pin (verified-already-clean) Audit claim 'no cardinality validation' was wrong — enforced at struct level. domain.ManagedCertificate.{CertificateProfileID, RenewalPolicyID,IssuerID,OwnerID} and RenewalPolicy. CertificateProfileID are bare strings, not slices. internal/domain/m015_cardinality_test.go: reflect-based pin on kind=String. Schema change to N:N would have to update renewal.go's lookup loop in the same commit. M-016 (CWE-754) — Reap stale-agent jobs internal/repository/postgres/job.go::ListJobsWithOfflineAgents: JOIN jobs to agents on agent_id, filter (status=Running AND a.last_heartbeat_at < cutoff), exclude server-keygen jobs. internal/service/job.go::ReapJobsWithOfflineAgents: Flips matched jobs to Failed reason agent_offline so I-001 retry loop re-queues them on a healthy agent. Records audit event per reap. internal/scheduler/scheduler.go: Scheduler.runJobTimeout cycle now calls both reaper arms. agentOfflineJobTTL default 5min (5x agent-health-check default); SetAgentOfflineJobTTL knob for operator override. internal/service/job_offline_agent_reaper_test.go: 6 unit tests cover happy path, server-keygen-skip, non-Running-skip, non- positive-TTL fail-loud, repo-error propagation, audit-event recording. M-019 — Configurable ARI HTTP timeout Audit claim 'no fallback timeout' was wrong — ari.go:52 already had a 15s timeout. Bundle C makes it configurable. internal/connector/issuer/acme/acme.go: Config.ARIHTTPTimeoutSeconds field with env path CERTCTL_ACME_ARI_HTTP_TIMEOUT_SECONDS. internal/connector/issuer/acme/ari.go: Both HTTP clients (GetRenewalInfo + getARIEndpoint) now use the new ariHTTPTimeout() helper. Zero / negative / nil-config all fall back to the historic 15s default. ari_timeout_test.go: 4 dispatch arm tests. M-020 (CWE-770) — OCSP DoS hardening Pre-bundle the noAuthHandler chain had no rate limit. An attacker could DoS the OCSP responder, which for fail-open relying parties is a revocation bypass. cmd/server/main.go: noAuthHandler refactored from fixed middleware.Chain(...) to a conditional slice that appends middleware.NewRateLimiter when cfg.RateLimit.Enabled. Per-IP keying applies; OCSP/CRL/EST/SCEP are unauth. docs/security.md (NEW): Operator runbook documenting Must-Staple TLS Feature extension RFC 7633 as the architectural fix for fail-open relying parties. Profile-flip guidance + nginx/Apache/HAProxy/Envoy stapling snippets + explicit scope statement on what the rate limiter alone does NOT solve. Audit deliverables: cowork/comprehensive-audit-2026-04-25/audit-report.md: score 31/55 -> 38/55 closed (Medium 13/27 -> 20/27). cowork/comprehensive-audit-2026-04-25/findings.yaml: 7 status flips open -> closed with closure notes citing the Bundle C mechanism. certctl/CHANGELOG.md: Bundle C section under [unreleased]. Verification: go vet ./internal/service ./internal/scheduler ./internal/connector/issuer/acme ./internal/api/handler ./internal/domain ./cmd/server clean go test -count=1 -short on the same packages all green helm template + helm lint clean internal/repository/postgres setup-fail sandbox disk pressure (same on master HEAD before this branch)	2026-04-27 00:08:25 +00:00
shankar0123	a172b6ed3b	Bundle B CI follow-up: G-3 env-var docs + M-028 closure (final 5 SA1019 sites) Two CI failures on master after Bundle B merge: 1. Frontend Build / G-3 env-var docs guardrail Bundle B introduced CERTCTL_RATE_LIMIT_PER_USER_RPS and CERTCTL_RATE_LIMIT_PER_USER_BURST without adding them to docs/features.md. The guardrail step that scans Go source for getEnv* calls and asserts each appears in a doc page failed. Fix: docs/features.md rate-limit section extended with both new env vars + a paragraph explaining the per-key keying contract from M-025. 2. Go Build & Test / staticcheck SA1019 hits (6 errors) The CI workflow runs staticcheck without continue-on-error. Bundle 7 opened M-028 to track 6 deprecated-API sites; Bundle 9 closed 1 of them (the elliptic.Marshal in local.go) but kept a deliberate regression-oracle reference in bundle9_coverage_test.go protected only by golangci-lint's //nolint comment — staticcheck-as-CLI does not honor that, only its native //lint:ignore directive. Closure of remaining 5 sites: cmd/server/main_test.go:47, 163, 192, 465 — 4 × middleware.NewAuth migrated to middleware.NewAuthWithNamedKeys with explicit NamedAPIKey entries. The auth=none case at line 465 maps to a nil NamedAPIKey slice (no-op pass-through, matches the NewAuthWithNamedKeys contract for empty input). Audit count was 3; recon found a 4th at line 465 that was missed. internal/api/handler/scep.go:266 — csr.Attributes is a real RFC 2985 §5.4.1 challengePassword carve-out. Go's stdlib deprecation note explicitly applies only to OID 1.2.840.113549.1.9.14 (requestedExtensions), NOT to OID 1.2.840.113549.1.9.7 (challengePassword), for which there is no non-deprecated stdlib API. Suppressed with native //lint:ignore SA1019 + comment block citing the RFC. internal/connector/issuer/local/bundle9_coverage_test.go:342 — deliberate regression-oracle that calls elliptic.Marshal to prove the new crypto/ecdh path is byte-identical. Comment converted from //nolint:staticcheck to native //lint:ignore SA1019 so staticcheck-as-CLI honors the suppression. Audit deliverables: cowork/comprehensive-audit-2026-04-25/audit-report.md: M-028 box flipped [x]; score 30/55 -> 31/55 (Medium 12/27 -> 13/27). cowork/comprehensive-audit-2026-04-25/findings.yaml: M-028 status partial_closed -> closed with closure note. Verification: go test -count=1 -short ./cmd/server ./internal/api/handler ./internal/connector/issuer/local ./internal/api/middleware ./internal/config — all green. staticcheck on each changed package — 0 SA1019 hits. Bundle C had M-028 in scope; this CI-fix lift moves it forward so master CI goes green immediately. Bundle C scope adjusts to remove M-028 and focuses on M-006 / M-015 / M-016 / M-019 / M-020 plus the M-007 / M-008 coverage gaps.	2026-04-26 23:35:13 +00:00
shankar0123	a2a82a6cf8	fix(bundle-5): CI green-up — drop unused sync.Once + document new env vars Two CI gate failures from the Bundle 5 push: 1. golangci-lint (unused) — agent_bootstrap.go declared `var bootstrapWarnOnce sync.Once` but never called .Do(). The one-shot WARN actually lives in cmd/server/main.go (per-process at startup, not per-request) so the handler-side variable was dead code. Dropped the var + sync import; left a comment explaining where the WARN lives. 2. G-3 env-var docs guardrail — Bundle 5 added two new env vars (CERTCTL_AGENT_BOOTSTRAP_TOKEN, CERTCTL_AUDIT_FLUSH_TIMEOUT_SECONDS) but the G-3 closure CI step asserts every CERTCTL_* env defined in internal/config/config.go is mentioned in docs/features.md. Added three new sub-sections to docs/features.md after the Body Size Limits block: * Agent Bootstrap Token (H-007 contract + generation guidance) * Graceful Shutdown Audit Flush (M-011 timeout knob) * Liveness vs Readiness Probes (H-006 /health vs /ready table) No production behaviour change; pure CI-gate fix. Verification - go vet ./internal/api/handler/... → clean - go test -count=1 -run 'TestVerifyBootstrapToken\|TestRegisterAgent_BootstrapToken' ./internal/api/handler/... → all pass - grep CERTCTL_AGENT_BOOTSTRAP_TOKEN docs/features.md → present - grep CERTCTL_AUDIT_FLUSH_TIMEOUT_SECONDS docs/features.md → present	2026-04-26 00:03:03 +00:00
shankar0123	85e60b24ec	fix(bundle-5): Operational Liveness + Bootstrap — 4 audit findings closed Closes Audit-2026-04-25 H-006 (High), H-007 (High), M-011 (Medium), L-006 (Low — verified-already-closed via C-1 master closure in v2.0.54). Hardens the orchestrator-facing surface — k8s probes, agent enrollment, shutdown audit drain, scheduler config plumbing. What changed - internal/api/handler/health.go — split contract: * /health stays shallow 200 (k8s liveness — process alive) * /ready accepts sql.DB; runs db.PingContext(2s); 503 on failure Nil DB path returns 200 + db=not_configured (test fixtures) - internal/api/handler/agent_bootstrap.go (NEW) — verifyBootstrapToken: * empty expected = warn-mode pass-through * non-empty = `Authorization: Bearer <token>` required * crypto/subtle.ConstantTimeCompare; length-mismatch path runs dummy compare to keep timing uniform * ErrBootstrapTokenInvalid sentinel - internal/api/handler/agents.go — RegisterAgent calls verifyBootstrapToken BEFORE body parse so unauth probes don't even allocate a JSON decoder - internal/config/config.go — two new env vars: * CERTCTL_AGENT_BOOTSTRAP_TOKEN (Auth.AgentBootstrapToken) * CERTCTL_AUDIT_FLUSH_TIMEOUT_SECONDS (Server.AuditFlushTimeoutSeconds) - cmd/server/main.go — 3 changes: * pass sql.DB into NewHealthHandler (H-006) pass cfg.Auth.AgentBootstrapToken into NewAgentHandler (H-007) * configurable shutdown audit-flush timeout (M-011) * one-shot startup WARN when bootstrap token unset (deprecation) - new tests: agent_bootstrap_test.go (full deny/accept/warn-mode coverage, constant-time compare path, length-mismatch); health_test.go extended with /ready DB-probe failure (503), nil-DB pass-through, /health-shallow L-006 verified - cmd/server/main.go:557 already calls sched.SetShortLivedExpiryCheckInterval(cfg.Scheduler.ShortLivedExpiryCheckInterval) per the C-1 master closure in v2.0.54. Bundle 5 confirms; no code change. Threat model: TB-1 (operator/orchestrator), TB-2 (Agent↔Server). - CWE-754 (Improper Check for Unusual or Exceptional Conditions) for H-006 - CWE-306 + CWE-288 (Missing Authentication for Critical Function) for H-007 Verification - go vet ./... → clean - go build ./... → clean - go test -short -count=1 ./... → all packages pass - targeted Bundle-5 regressions → all pass - npx tsc --noEmit (web) → clean - npx vitest run (web) → in-flight (sandbox 45s ceiling exceeded; no failure markers in dot stream; no frontend changes in this bundle so no regression risk) - python3 yaml.safe_load(api/openapi.yaml) → 89 paths Backward compatibility - Bootstrap token defaults to empty (warn-mode) — existing demo deployments unaffected. Server logs deprecation WARN; v2.2.0 will require it. - Audit flush timeout default 30s preserves prior behaviour. - Helm chart already routes readiness probe to /ready (no chart change needed); now /ready actually probes the DB. Bundle 5 of the 2026-04-25 comprehensive audit.	2026-04-25 23:54:18 +00:00
shankar0123	1c099071d1	fix(bundle-4): EST/SCEP Attack Surface Hardening — 3 audit findings closed Closes 3 findings (1 High + 1 Medium + 1 Low) from /Users/shankar/Desktop/cowork/comprehensive-audit-2026-04-25/. Bundle 4 hardens the only attack surface reachable by an anonymous network attacker in certctl: the unauthenticated EST + SCEP enrollment endpoints. Findings closed: - H-004 (High): Hand-rolled ASN.1 parser had no fuzz target. The audit's original framing pointed at internal/pkcs7/, but recon confirmed that package is an ASN.1 ENCODER (BuildCertsOnlyPKCS7, ASN1Wrap, ASN1EncodeLength) — not a parser. The actual hand-rolled PKCS#7 PARSING reachable via anonymous network is in internal/api/handler/scep.go::extractCSRFromPKCS7 + parseSignedDataForCSR. Added native go fuzz targets: internal/api/handler/scep_fuzz_test.go::FuzzExtractCSRFromPKCS7 * internal/api/handler/scep_fuzz_test.go::FuzzParseSignedDataForCSR * internal/pkcs7/pkcs7_fuzz_test.go::FuzzPEMToDERChain (defense-in-depth) * internal/pkcs7/pkcs7_fuzz_test.go::FuzzASN1EncodeLength (defense-in-depth) Local 15s fuzz session: 150k execs on FuzzExtractCSRFromPKCS7, 937k on FuzzPEMToDERChain, 925k on FuzzASN1EncodeLength — zero panics. - M-021 (Medium): EST TLS-Unique channel binding (RFC 7030 §3.2.3). Added internal/api/handler/est.go::verifyESTTransport — defense-in-depth TLS pre-conditions (r.TLS != nil; HandshakeComplete; TLS ≥ 1.2). The full §3.2.3 channel binding only applies when EST mTLS is in use; certctl does not currently support EST mTLS, so the §3.2.3 requirement is moot today. RFC 9266 (TLS 1.3 tls-exporter) and EST mTLS are documented as deferred follow-ups in the verifyESTTransport doc comment. - L-005 (Low): EST/SCEP issuer-binding fail-loud at startup. Pre-Bundle-4 cmd/server/main.go validated that CERTCTL_EST_ISSUER_ID and CERTCTL_SCEP_ISSUER_ID existed in the registry but did NOT validate the issuer TYPE could emit a CA cert. An operator binding EST to an ACME issuer (whose GetCACertPEM returns explicit error) booted successfully and only failed at first /est/cacerts request. Post-Bundle-4: new preflightEnrollmentIssuer helper calls GetCACertPEM(ctx) at startup with a 10s timeout. Failure logs the connector error + the candidate issuer types and os.Exit(1). Tests added/modified: - internal/api/handler/est_transport_test.go (new) — 5 verifyESTTransport table cases covering plaintext-rejected, incomplete-handshake-rejected, TLS 1.0 rejected, TLS 1.2/1.3 accepted - cmd/server/preflight_test.go (new) — TestPreflightEnrollmentIssuer covering nil-connector, error-from-issuer, empty-PEM, valid cases - internal/api/handler/est_handler_test.go (modified) — 7 POST sites now stamp r.TLS to satisfy the new transport pre-condition - internal/integration/negative_test.go (modified) — setupTestServer wraps the test handler with a fake-TLS-state injector so the EST handler receives r.TLS != nil; production paths still rely on the real TLS listener Threat model reference: TB-11 (EST/SCEP client ↔ Server) per cowork/comprehensive-audit-2026-04-25/threat-model.md. Standards: RFC 7030 §3.2.3, RFC 8894 §3, RFC 5652, RFC 9266 (deferred).	2026-04-25 21:14:41 +00:00
shankar0123	0e29c416b1	refactor(handler,repo): replace strings.Contains error dispatch with typed sentinels (S-2) Closes one 2026-04-24 audit finding (P2): - cat-s6-efc7f6f6bd50: 30 strings.Contains(err.Error(), ...) sites in internal/api/handler/ — brittle to repository-layer message changes, untyped against the actual failure mode. Approach (Option B from prompt design notes): - New typed sentinels in internal/repository/errors.go: ErrNotFound, ErrForeignKeyConstraint IsForeignKeyError(err) helper (the only place substring matching at the lib/pq boundary is allowed; isolates the DB-driver string knowledge to one function). - New typed sentinel in internal/domain/errors.go: ErrValidation (reserved for future per-entity validation wrappers; not yet used by all handlers). - 49 sites in internal/repository/postgres/*.go updated to wrap sql.ErrNoRows-derived errors via fmt.Errorf("...: %w", repository.ErrNotFound). - 18 not-found handler sites + 2 FK-constraint handler sites refactored to errors.Is(err, repository.ErrNotFound) / repository.IsForeignKeyError(err). - 23 inline `fmt.Errorf("X not found")` test fixtures across handler tests rewrapped to wrap repository.ErrNotFound. - test_utils.go::ErrMockNotFound rewrapped to wrap repository.ErrNotFound; renewal_policy.go closure docblock updated to reflect the new convention. - integration test mockJobRepository.Get wraps repository.ErrNotFound. CI regression guardrail: - .github/workflows/ci.yml::"Forbidden strings.Contains(err.Error()) regression guard (S-2)" greps for the three patterns ("not found", "violates foreign key", "RESTRICT") under internal/api/handler/ and fails the build on regression. Verification: - go build ./... — clean - go vet ./... — clean - go test ./... -short -count=1 — all packages pass (handler + repository + service + integration) - golangci-lint v2.11.4 run ./... — 0 issues - S-2 guardrail dry-run on post-fix tree → empty (good) - All sibling guardrails (S-1, G-3, D-1+D-2, B-1, L-1, H-1, C-1, F-1, P-1) pass Audit findings closed: - cat-s6-efc7f6f6bd50 (P2) Deferred follow-ups: - 6 domain-specific substring patterns still inline in handlers ("cannot approve", "cannot reject", "cannot be parsed", "no certificates found", "challenge password", "invalid"/ "required" validation chains in profiles + agent_groups). Each needs its own typed sentinel, scoped per service. Documented by the S-2 CI guardrail's allowlist for closure-comments only. - Per-entity not-found sentinels (Option A — ErrCertificateNotFound, ErrAgentNotFound, etc.) deferred. Generic ErrNotFound covers the current dispatch needs; per-entity precision would let handlers return entity-aware error bodies without a domain.Type field, but not blocking.	2026-04-25 17:54:14 +00:00
shankar0123	f0865bb051	fix(api,web,mcp): add bulk-renew + bulk-reassign endpoints, drop client-side N×HTTP loops (L-1 master) Two audit findings, both category cat-l, both rooted in web/src/pages/CertificatesPage.tsx. Pre-L-1 the GUI looped per-cert HTTP calls — 100 selected certs = 100 sequential round-trips × ~50–200 ms each = a 5–20-second wedge during which the operator stared at a progress bar. Post-L-1 each workflow is a single POST. cat-l-fa0c1ac07ab5 [P1, primary] — bulk renew loop handleBulkRenewal: for/await triggerRenewal(id) cat-l-8a1fb258a38a [P2] — bulk reassign loop handleReassign: for/await updateCertificate(id, {owner_id}) The bulk-revoke endpoint (POST /api/v1/certificates/bulk-revoke + BulkRevocationCriteria/Result) already existed as the canonical shape in v2.0.x — L-1 ports that pattern to renew + reassign with per-action twists. Backend (Go) - internal/domain/bulk_renewal.go: BulkRenewalCriteria mirrors BulkRevocationCriteria (criteria + IDs modes); BulkRenewalResult envelope adds EnqueuedJobs[] for per-cert {certificate_id, job_id}; shared BulkOperationError type for all bulk paths. - internal/domain/bulk_reassignment.go: narrower shape — IDs-only, owner_id required, team_id optional. - internal/service/bulk_renewal.go::BulkRenewalService.BulkRenew: resolves criteria → status filter (Archived/Revoked/Expired/ RenewalInProgress all silent-skip) → per-cert status flip + job create. Keygen-mode-aware so jobs land in the same initial status as single-cert TriggerRenewal. Single bulk audit event per call, not N. - internal/service/bulk_reassignment.go::BulkReassignmentService. BulkReassign: validates owner_id upfront via the ErrBulkReassignOwnerNotFound typed sentinel — non-existent owner returns 400 before any cert is touched. Already-owned-by-target is silent-skip. Single bulk audit event. - internal/api/handler/{bulk_renewal,bulk_reassignment}.go: HTTP shape mirrors bulk_revocation.go. NOT admin-gated (renew is non- destructive; reassign is a common-case workflow). Sentinel-error → 400 mapping for OwnerNotFound. - internal/api/router/router.go: three bulk-* routes registered as a block before the {id} routes. HandlerRegistry gains BulkRenewal + BulkReassignment fields. - cmd/server/main.go: NewBulkRenewalService threads cfg.Keygen.Mode so bulk-renew jobs land in same initial state as single-cert path. Frontend - web/src/api/client.ts: bulkRenewCertificates(criteria) + bulkReassignCertificates(request) functions with full TS types. - web/src/pages/CertificatesPage.tsx: handleBulkRenewal + handleReassign rewritten from N-call loops to single calls. Result envelope drives progress UI; first-error message surfaced when total_failed > 0. Stale triggerRenewal + updateCertificate imports removed. MCP - internal/mcp/types.go: BulkRenewCertificatesInput + BulkReassignCertificatesInput. - internal/mcp/tools.go: certctl_bulk_renew_certificates + certctl_bulk_reassign_certificates tools mirroring the existing certctl_bulk_revoke_certificates pattern. OpenAPI - api/openapi.yaml: two new operations (bulkRenewCertificates, bulkReassignCertificates) under Certificates tag. Four new schemas (BulkRenewRequest, BulkRenewResult, BulkEnqueuedJob, BulkReassignRequest, BulkReassignResult). Tests - Domain: BulkRenewalCriteria.IsEmpty + BulkReassignmentRequest.IsEmpty IsEmpty contracts; JSON round-trip shape pinning. - Service: 7 BulkRenew tests (happy/criteria-mode/skips-RenewalInProgress/ skips-revoked-archived/empty-criteria-error/partial-failure/ audit-event-emitted) + 8 BulkReassign tests (happy/skips-already- owned/owner-required/empty-IDs/owner-not-found-sentinel/team-id- optional/team-id-provided/partial-failure/audit-event-emitted). - Handler: 5 BulkRenew handler tests (happy/empty-body-400/wrong- method-405/actor-attribution/service-error-500) + 6 BulkReassign handler tests (happy/empty-IDs-400/missing-owner-400/owner-not- found-400-via-sentinel/wrong-method-405/generic-error-500). CI guardrail - .github/workflows/ci.yml: 'Forbidden client-side bulk-action loop regression guard (L-1)'. Greps web/src/pages/CertificatesPage.tsx for 'for(...) await triggerRenewal(...)' and 'for(...) await updateCertificate(...)' patterns; comment lines exempt; test files exempt. Verified locally (passes against post-fix tree, fires against synthetic regression). Counts (deltas) - Routes: 119 → 121 (+2) - OpenAPI operations: 123 → 125 (+2) - MCP tools: 83 → 85 (+2) Performance - 100-cert bulk-renew: ~10s of sequential HTTP → ~100ms (99% latency reduction on the canonical operator workflow). - Audit event volume: 1 + N per operation → 1. Out of scope (deferred follow-ups) - cat-b-31ceb6aaa9f1: updateOwner/updateTeam/updateAgentGroup orphan (different shape — wire existing PUT to GUI, not new bulk endpoint). - cat-k-e85d1099b2d7: CertificatesPage no pagination UI. - cat-i-b0924b6675f8: MCP missing claim/dismiss/acknowledge (L-1 added 2 new tools but does not close that finding). Verification - go build / vet / test -short / test -short -race all clean. - web tsc --noEmit + vitest run all clean (296 tests passing). - OpenAPI YAML parses (89 paths, 125 ops). - L-1 CI guardrail passes against post-fix tree, fires against synthetic regression. No push.	2026-04-25 14:33:02 +00:00
shankar0123	a3d8b9c607	fix(deploy,db,handler): close fresh-clone postgres init failure + 4 ride-along audit findings (U-3 master) GitHub #10 reopened: operator mikeakasully cloned v2.0.50 fresh and ran the canonical quickstart (docker compose -f deploy/docker-compose.yml up -d --build); postgres reported unhealthy indefinitely, dependent containers never started. Root cause: deploy/docker-compose.yml mounted a hand-curated subset of migrations/.up.sql + seed.sql into postgres /docker-entrypoint-initdb.d/. Postgres applied them at initdb time. Once seed.sql referenced columns added by migrations after* the mounted cutoff (e.g., policy_rules.severity from migration 000013), initdb crashed mid-seed and the container loop wedged. Two sources of truth (compose mount list vs in-tree migration ladder) diverged the moment a seed-touching migration shipped, and the only thing that fixed it was hand-editing the compose file every release. Fix: remove the dual source. Postgres boots empty; the server applies migrations + seed at startup via RunMigrations + RunSeed. Helm has used this pattern since day one (postgres-init emptyDir); compose now matches. Bundled with four ride-along audit findings whose fixes share the same schema/db code surface, so operators take the schema-change pain only once: cat-u-seed_initdb_schema_drift [P1, primary] — initdb-mount fix cat-o-retry_interval_unit_mismatch [P1] — column rename minutes→seconds cat-o-notification_created_at_dead_field [P2] — add column + populate cat-o-health_check_column_orphans [P1] — drop unwired columns cat-u-no_version_endpoint [P2] — add /api/v1/version Single migration (000017_db_coupling_cleanup) bundles the three schema changes under a DO \$\$ guard so re-application is safe; reduces operator-visible 'schema-change releases' from four to one. Backend - internal/repository/postgres/db.go: add RunSeed (baseline) + RunDemoSeed (gated by CERTCTL_DEMO_SEED). Both idempotent (ON CONFLICT DO NOTHING in every shipped INSERT) so repeated boots are safe; missing-file is no-op so custom packaging that strips seeds still boots cleanly. - cmd/server/main.go: invoke RunSeed (always) + RunDemoSeed (when flag set) immediately after RunMigrations. - internal/repository/postgres/notification.go: NotificationRepository.Create now sets created_at (with time.Now() fallback when caller leaves it zero); scanNotification reads it back; List + ListRetryEligible SELECT extended. - internal/repository/postgres/renewal_policy.go: column references updated to retry_interval_seconds across SELECT/INSERT/UPDATE sites. - internal/api/handler/version.go: new VersionHandler exposes {version, commit, modified, build_time, go_version} from runtime/debug.ReadBuildInfo() with ldflags-supplied Version override. - internal/api/router/router.go: register GET /api/v1/version through the no-auth chain (CORS + ContentType) alongside /health, /ready, /api/v1/auth/info. - cmd/server/main.go: add /api/v1/version to no-auth dispatch + audit ExcludePaths so rollout polling doesn't dominate the audit trail. - internal/config/config.go: add DatabaseConfig.DemoSeed + CERTCTL_DEMO_SEED env var. Migration - migrations/000017_db_coupling_cleanup.up.sql + .down.sql: (1) renewal_policies.retry_interval_minutes → retry_interval_seconds (DO \$\$ guard, idempotent re-application) (2) notification_events ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() (3) network_scan_targets DROP orphan health_check_enabled + health_check_interval_seconds - migrations/seed.sql: column reference updated to retry_interval_seconds. - migrations/seed_demo.sql: same column rename + applied at runtime now via RunDemoSeed (no longer initdb-mounted). Compose - deploy/docker-compose.yml: drop ALL initdb mounts (10 migration files + seed.sql); add start_period: 30s to postgres + certctl-server healthchecks to absorb the runtime migration + seed application window on first boot. - deploy/docker-compose.test.yml: same drop (+ ghost seed_test.sql mount removed; that file never existed); same healthcheck start_period. - deploy/docker-compose.demo.yml: replace seed_demo.sql initdb mount with CERTCTL_DEMO_SEED=true env var on certctl-server. Tests - internal/api/handler/version_handler_test.go: TestVersion_ReturnsBuildInfo, TestVersion_RejectsNonGet, TestVersion_LdflagsOverride. - internal/repository/postgres/seed_test.go: TestRunSeed_AppliesIdempotently, TestRunSeed_MissingFileIsNoOp, TestRunDemoSeed_AppliesIdempotently, TestMigration000017_RetryIntervalRename, TestMigration000017_NotificationCreatedAt, TestMigration000017_HealthCheckOrphansDropped (testcontainers, -short skips). - internal/repository/postgres/notification_test.go: TestNotificationRepository_CreatedAt_IsPersisted + TestNotificationRepository_CreatedAt_DefaultsToNow. CI guardrail - .github/workflows/ci.yml: new 'Forbidden migration mount in compose initdb (U-3)' step grep-fails the build if any migrations/.sql or seed.sql re-appears in /docker-entrypoint-initdb.d in any compose file. Catches future drift before a fresh-clone operator hits it. Spec / Docs - api/openapi.yaml: add /api/v1/version operation under Health tag. - docs/architecture.md: replace the 'initdb may run the same SQL' paragraph with a post-U-3 single-source-of-truth explanation. - CHANGELOG.md: full unreleased-section entry covering all 5 closures, breaking changes, and the new env var. Audit doc - coverage-gap-audit-2026-04-24-v5/unified-audit.md: add new P1 #14 cat-u-seed_initdb_schema_drift; flip the 4 ride-along findings to ✅ RESOLVED with closure prose pointing at this commit. Verification: build/vet/test -short -race all clean across all touched packages locally; govulncheck reports 0 vulnerabilities affecting our code; OpenAPI YAML parses; CI U-3 grep guardrail clears against the post-fix tree.	2026-04-25 13:29:23 +00:00
shankar0123	87213128cc	fix(security,domain): redact Agent.APIKeyHash from JSON wire shape (G-2) Pre-G-2 internal/domain/connector.go::Agent::APIKeyHash was tagged `json:"api_key_hash"` and shipped on every wire surface that returned domain.Agent — GET /api/v1/agents (PagedResponse{Data: agents}), GET /api/v1/agents/{id}, GET /api/v1/agents/retired, and the POST /api/v1/agents registration response. Every authenticated client (browser, CLI --json, MCP tool calls) received the SHA-256-of-the-API-key string. The browser silently dropped it because web/src/api/types.ts omits the field, but CLI and MCP consumers print full JSON so the hash was visible there. Even though the value is a hash and not the plaintext key, shipping it gives an attacker an offline brute-force target if the API-key entropy is low (certctl doesn't enforce a minimum on operator- supplied keys), and there's no business reason for any client to ever receive it — the value is server-internal, used only for the lookup at internal/repository/postgres/agent.go::GetByAPIKey. (Audit: cat-s5-apikey_leak in coverage-gap-audit-2026-04-24-v5/unified-audit.md.) We chose the audit's recommended fix (json:"-") plus a defense-in-depth MarshalJSON plus a CI guardrail. Three layers because struct-tag redaction alone is one rebase away from being silently reverted, the custom MarshalJSON catches the case where a parent struct embeds Agent under a different tag, and the CI grep blocks reintroduction at the spec or frontend boundary even without a code review catching it. Files changed: Phase 1 — Domain redaction: - internal/domain/connector.go: APIKeyHash tag flipped from `json:"api_key_hash"` to `json:"-"`. New Agent.MarshalJSON with value receiver + type-alias-recursion-break that explicitly zeroes APIKeyHash on the marshal-time copy. Long-form docblock explaining the G-2 closure rationale + cross-references to service.RegisterAgent (populator), repository.AgentRepository:: GetByAPIKey (consumer), docs/architecture.md (DB-shape vs API-shape distinction), and the audit finding. Phase 2 — Domain tests (5 test functions): - internal/domain/connector_test.go: TestAgent_MarshalJSON_RedactsAPIKeyHash pins the marshal-boundary contract on a value receiver. ...RedactsViaPointer pins the Agent path. ...RedactsInSlice pins the []Agent path that the ListAgents handler actually emits via PagedResponse. ...DoesNotMutateReceiver pins the by-value-receiver contract so a future refactor that switches to pointer-receiver gets caught. ...RoundTrip pins the wire-shape guarantee that APIKeyHash is dropped on encode and cannot reappear on decode. Single sentinel value ("sha256:LEAKED-CREDENTIAL-DERIVATIVE- SENTINEL") flows through every fixture for grep-ability on regression. Phase 3 — Handler tests (4 test functions): - internal/api/handler/agent_handler_test.go: TestListAgents_DoesNotLeakAPIKeyHash, TestGetAgent_DoesNotLeakAPIKeyHash, TestRegisterAgent_DoesNotLeakAPIKeyHash, TestListRetiredAgents_DoesNotLeakAPIKeyHash. Each asserts (a) the literal substring "api_key_hash" is absent from the httptest-captured body, (b) the leak sentinel value is absent, (c) the non-leaked fields ARE present (sanity that the handler is serving real data, not just empty payloads). Shared sentinel "sha256:LEAKED-CREDENTIAL-DERIVATIVE- HANDLER-SENTINEL" so a single grep over a failing test's output identifies the leak surface immediately. Phase 4 — Spec / docs: - api/openapi.yaml: api_key_hash property REMOVED from Agent schema (was at line 3690). Inline G-2 comment naming the closure + the database-vs-API-shape distinction so a future spec edit doesn't silently re-introduce the field. - docs/architecture.md: ER-diagram block already documents the agents table including api_key_hash (DB shape — correct). Added a sibling note paragraph immediately below the diagram explaining that several columns are intentionally server-internal (api_key_hash redaction + issuers.config / deployment_targets.config encrypted shadow), with cross-references to the redaction enforcement site, the OpenAPI schema, the frontend interface, and the CI guardrail. - web/src/api/types.ts: Agent interface unchanged in shape (already omitted the field) but added a leading comment block explaining WHY the omission is intentional — stops a future frontend dev from "completing" the interface from the OpenAPI spec or the Go struct. Phase 5 — CI guardrail: - .github/workflows/ci.yml: new "Forbidden api_key_hash JSON-shape regression guard (G-2)" step. Scoped patterns catch the actual regression shapes — Go struct tag (json:"api_key_hash"), frontend interface declaration, OpenAPI schema property, YAML enum/array membership. Repository / migration / seed / service / integration / unit-test / comment lines exempt. Verified locally on the real tree (passes) and against 4 synthetic regression patterns (each fires the guardrail). Mirrors the G-1 pattern from .github/workflows/ ci.yml lines 47-108. Phase 5b — Sweep verification (no changes, results documented for the next reader): - internal/api/middleware/audit.go: doesn't serialize Agent struct; records request body only. No leak. - service.RegisterAgent audit-event payload: `map[string]interface{}{ "name": name, "hostname": hostname}` — name + hostname only, no APIKeyHash. No leak. - All 9 slog sites that mention agent: scalar attrs only ("agent_id", "error", "agent_hostname"), never the full struct. No leak. - internal/mcp, internal/cli, cmd/cli, cmd/mcp-server: zero matches for APIKeyHash / api_key_hash. Both pass server JSON verbatim, so the wire-side fix transitively closes them. Verification (all gates pass): - go build ./... - go vet ./... - go test -short ./... — every package green - go test -short -race ./internal/domain/... ./internal/api/handler/... — clean - govulncheck ./... — no vulnerabilities in our code - helm lint deploy/helm/certctl/ — clean - helm template smoke render — succeeds - python3 yaml.safe_load on api/openapi.yaml — parses - OpenAPI Agent schema scan: no api_key_hash property - CI guardrail mirror: clean on real tree, fires on all 4 synthetic regression patterns - Domain pkg coverage: Agent.MarshalJSON 100%, connector.go total 87.5% - Handler pkg coverage: 79.2% Sample response body (httptest captured during verification, GET /api/v1/agents/{id} via the new handler test): {"id":"agent-demo","name":"demo-agent","hostname":"demo.host", "status":"Online","last_heartbeat_at":"2026-04-24T11:59:30Z", "registered_at":"2026-04-24T12:00:00Z","os":"linux", "architecture":"amd64","ip_address":"10.0.0.42", "version":"v2.0.49"} Note the absence of any api_key_hash key, even though the in-memory struct passed to the handler had APIKeyHash set to a sentinel. Out of scope (intentionally untouched): - internal/repository/postgres/agent.go SELECT/INSERT/UPDATE/scan paths and GetByAPIKey lookup — DB column stays, repo still populates the struct, auth lookup still works. The redaction is a marshal-boundary concern. - migrations/000001_initial_schema.up.sql + migrations/seed_.sql — DB schema and seed data unchanged. - internal/service/agent.go::RegisterAgent — service-side hashing and persistence unchanged. - Other domain types with potential credential-derivative fields (Issuer.Config, DeploymentTarget.Config, notifier configs). Not flagged by the audit; some are already protected (e.g., DeploymentTarget.EncryptedConfig []byte `json:"-"`). File a separate audit pass if recon surfaces additional leaks. - Per-resource DTO layer across every handler. Single audit finding, single domain type. - A separate possible follow-up: the v2 RegisterAgent endpoint doesn't return the plaintext API key to the agent, which may mean self-bootstrap via POST /api/v1/agents is broken. Verified during recon; out of scope for G-2; should be its own ticket. Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md §2 P1 cluster, cat-s5-apikey_leak Audit recommendation: 'json:"-" or API-response DTO excluding APIKeyHash' — went with the json:"-" + MarshalJSON defense-in-depth pair plus CI guardrail and structural docs.	2026-04-25 01:56:26 +00:00
shankar0123	9c1d446e40	fix(security,config): remove unimplemented JWT auth-type, close silent downgrade (G-1) The pre-G-1 config validator accepted CERTCTL_AUTH_TYPE=jwt and the startup log faithfully echoed 'authentication enabled type=jwt'. Reasonable people read that and concluded JWT auth was on. It wasn't. The auth-middleware wiring at cmd/server/main.go unconditionally routed every request through the api-key bearer middleware regardless of cfg.Auth.Type. So CERTCTL_AUTH_TYPE=jwt quietly compared the incoming 'Authorization: Bearer <token>' against whatever string the operator put in CERTCTL_AUTH_SECRET — real JWT clients got 401, and operators who treated CERTCTL_AUTH_SECRET as a signing secret (because they thought they were configuring JWT) had effectively handed an attacker an api-key. A security finding masquerading as a config option. We chose the audit-recommended structural fix: remove the option, fail fast at startup, and add the gateway-fronting pattern as the documented forward path. Implementing JWT middleware would have meant jwks vs static-secret rotation, claim mapping, expiry enforcement, audience and issuer validation, key rollover semantics, and regression coverage at the same depth as the existing api-key path — a feature, not a fix. Operators who genuinely need JWT/OIDC front certctl with an authenticating gateway (oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium / Authelia) and run the upstream certctl with CERTCTL_AUTH_TYPE=none. Same shape works on docker-compose and Helm. The change is comprehensive across 7 phases — every surface that mentioned 'jwt' as a certctl-auth-type is updated, plus structural backstops (typed enum, runtime guard, helm template validation, CI grep guard) so the lie can't reappear. Files changed: Phase 1 — production code (typed enum + jwt removal): - internal/config/config.go: AuthType typed alias + AuthTypeAPIKey / AuthTypeNone constants + ValidAuthTypes() helper. Validate() routes literal 'jwt' through a dedicated multi-line diagnostic naming the authenticating-gateway pattern, then cross-checks against ValidAuthTypes(). Secret-required branch simplified to api-key-only. Field comment on AuthConfig.Type rewritten to drop jwt and point at the gateway pattern. - internal/api/middleware/middleware.go: AuthConfig.Type field comment references the typed config.AuthType constants. - internal/api/handler/health.go: same treatment for HealthHandler.AuthType. - cmd/server/main.go: defense-in-depth runtime switch immediately after config.Load() — exits 1 on any unsupported auth-type that bypassed the validator. Auth-disabled startup log explicitly names the authenticating-gateway pattern. Phase 2 — tests (Red→Green, contract pinning): - internal/config/config_test.go: TestValidate_JWTAuth_RejectedDedicated (two table rows pinning the dedicated G-1 error fires regardless of whether Secret is set), TestValidAuthTypesDoesNotContainJWT (property guard against future re-introduction), TestValidAuthTypesIsExactly_APIKey_None (allowed-set contract), TestValidate_GenericInvalidAuthType (pins non-jwt invalid values still hit the generic invalid-auth-type error). Removed the prior TestValidate_JWTAuth_MissingSecret happy-path since its premise is inverted post-G-1. - internal/api/handler/health_test.go: removed TestAuthInfo_ReturnsAuthType_JWT (which baked the silent-downgrade lie into the regression suite). Pre-existing _APIKey test continues to cover the api-key happy path. Phase 3 — spec, docs, env templates: - api/openapi.yaml: auth_type enum dropped to [api-key, none] with inline comment naming the G-1 closure. - .env.example (root): CERTCTL_AUTH_TYPE comment block rewritten to drop jwt and point at the gateway pattern; secret-required conditional simplified to api-key-only. - docs/architecture.md: middleware-stack bullet rewritten to drop the JWT mention; new H3 'Authenticating-gateway pattern (JWT, OIDC, mTLS)' section explaining the design rationale and listing oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium / Authelia / Caddy forward_auth / Apache mod_auth_openidc / nginx auth_request as the standard fronting options. - docs/upgrade-to-v2-jwt-removal.md (new ~125 lines): migration guide with preconditions, what-changes, both recovery paths, complete docker-compose oauth2-proxy walkthrough, Traefik ForwardAuth and Envoy ext_authz patterns, rollback posture. Phase 4 — Helm chart (template validation + docs): - deploy/helm/certctl/templates/_helpers.tpl: new certctl.validateAuthType helper mirroring the existing certctl.tls.required pattern. Fails template render on any server.auth.type outside {api-key, none} with a multi-line diagnostic. - deploy/helm/certctl/templates/server-deployment.yaml, server-configmap.yaml, server-secret.yaml: invoke the helper at the top of each template that depends on .Values.server.auth.type. - deploy/helm/certctl/values.yaml: auth: block comment expanded with the G-1 rationale and gateway-pattern cross-reference. - deploy/helm/CHART_SUMMARY.md: server.auth.type table row now surfaces the allowed set and points at the upgrade doc. - deploy/helm/certctl/README.md: new 'JWT / OIDC via authenticating gateway' section with a Kubernetes-flavored oauth2-proxy + certctl walkthrough. Phase 5 — release surface: - CHANGELOG.md: new [unreleased] top entry with Breaking / Removed / Added / Changed sections; explicit pointer at docs/upgrade-to-v2-jwt-removal.md from the Breaking subsection. Phase 6 — CI guardrail: - .github/workflows/ci.yml: new 'Forbidden auth-type literal regression guard (G-1)' step. Scoped patterns catch the actual regression shapes (map literal, slice literal, switch case, OpenAPI enum, env-file default, AuthType('jwt') cast). Comments and the dedicated rejection branch are intentionally exempt; connector-package JWT references (Google OAuth2 / step-ca) are exempt as out-of-scope external protocols. Verified locally: the guard passes on the actual tree and fires on all 4 synthetic regression patterns. Out of scope (explicitly untouched): - internal/connector/discovery/gcpsm/gcpsm.go — Google OAuth2 service- account JWT (external protocol). - internal/connector/issuer/googlecas/googlecas.go — same. - internal/connector/issuer/stepca/stepca.go — step-ca's provisioner one-time-token JWT for /sign API. - docs/test-env.md, docs/connectors.md, docs/features.md — describe external CAs' use of JWT, not certctl's auth shape. - Implementing actual JWT middleware. Feature, not a fix. Verification (all gates pass): - go build ./... — clean - go vet ./... — clean - go test -short ./... — every package green - go test -short -race ./internal/config/... ./internal/api/... — clean - govulncheck ./... — no vulnerabilities in our code - helm lint deploy/helm/certctl/ — clean - helm template with auth.type=api-key — renders OK - helm template with auth.type=none — renders OK - helm template with auth.type=jwt — fails with validateAuthType diagnostic (exit 1) - python3 yaml.safe_load on api/openapi.yaml — parses - CI guardrail mirror — clean on real tree, fires on all 4 synthetic regression patterns - Smoke test: 'CERTCTL_AUTH_TYPE=jwt ./certctl-server' exits non-zero with: 'Failed to load configuration: CERTCTL_AUTH_TYPE=jwt is no longer accepted (G-1 silent auth downgrade): no JWT middleware ships with certctl. To use JWT/OIDC, run an authenticating gateway (oauth2-proxy / Envoy ext_authz / Traefik ForwardAuth / Pomerium) in front of certctl and set CERTCTL_AUTH_TYPE=none on the upstream. See docs/architecture.md "Authenticating-gateway pattern" and docs/upgrade-to-v2-jwt-removal.md for the migration walkthrough' config pkg coverage: ValidAuthTypes 100%, Validate 94.7%, total 75.5%. Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md §2 P1 cluster, cat-g-jwt_silent_auth_downgrade Audit recommendation followed verbatim: 'Remove jwt from validAuthTypes until middleware ships'.	2026-04-25 00:22:23 +00:00
shankar0123	9834b4e4a4	G-1: renewal-policies API + frontend FK-drift fix Three frontend call sites (OnboardingWizard.tsx:603, CertificatesPage.tsx:52, CertificateDetailPage.tsx:169) populated the renewal_policy_id dropdown from getPolicies() — the compliance-rule endpoint returning pol-* IDs — which violated the FK managed_certificates.renewal_policy_id REFERENCES renewal_policies(id) ON DELETE RESTRICT. Create would fail pg 23503 at insert. Backend (new): - RenewalPolicyRepository CRUD + ListAll/ExistsByID (pg 23503 → ErrRenewalPolicyInUse → HTTP 409; pg 23505 → ErrRenewalPolicyDuplicateName → HTTP 409) - RenewalPolicyService with repo-only constructor. Service sentinels var-alias the repo sentinels so errors.Is walks across layers. - RenewalPolicyHandler with validation bounds: name 1–255; renewal_window_days [1,365] default 30; max_retries [0,10] not defaulted; retry_interval_seconds [60,86400] default 3600; alert_thresholds_days [0,365] default [30,14,7,0]. Auto-generated IDs rp-<slug(name)>. - Router registers 5 routes under /api/v1/renewal-policies[/{id}]. Frontend: - CertificatesPage/CertificateDetailPage/OnboardingWizard now call getRenewalPolicies() and render rp-* IDs. - client.ts adds getRenewalPolicies/createRenewalPolicy/updateRenewalPolicy/ deleteRenewalPolicy. types.ts adds the RenewalPolicy shape. OpenAPI: RenewalPolicies tag + 5 operations + 3 schemas (RenewalPolicy, RenewalPolicyCreateRequest, RenewalPolicyUpdateRequest). 409 responses on create/update duplicate-name and delete FK-in-use. No migration — renewal_policies table already exists from the initial schema (000001). Tests: - internal/service/renewal_policy_test.go: CRUD + validation + sentinel error wrapping. - internal/api/handler/renewal_policy_handler_test.go: handler endpoint contracts including 400/404/409. - web/src/api/client.test.ts: 4 subtests covering the 4 new API functions. Phase 3 gates all green: go vet, build, short tests, race tests (service/ handler/router/scheduler), staticcheck (G-1 packages), govulncheck (0 reachable), coverage (service 69.7%, handler 79.0%, domain 86.9%, middleware 80.6% — all above thresholds), tsc, vitest (256 passed), vite build, OpenAPI structural validation.	2026-04-20 18:53:01 +00:00
shankar0123	4e5522a999	F-001/F-002/F-003: CRL prefix-scan, digest error sanitization, ctx-aware sleeps F-001 (P3): GenerateDERCRL scoped to issuer via composite index - Add RevocationRepository.ListByIssuer leveraging migration 000012's idx_certificate_revocations_issuer_serial composite index as a prefix-scan target. Previously CAOperationsSvc.GenerateDERCRL called ListAll() and filtered by IssuerID in Go — O(total revocations) regardless of how many revocations belonged to the target issuer. - Rewrite GenerateDERCRL to call ListByIssuer(ctx, issuerID) so PostgreSQL drives a prefix scan of the composite index. Drops the in-memory filter. - New regression test in ca_operations_test.go asserts the CRL hot path invokes ListByIssuer exactly once and never ListAll, and that the issuerID is threaded through correctly. F-002 (P3): digest.go admin-auth endpoints no longer leak internal errors - PreviewDigest (GET /api/v1/digest/preview) and SendDigest (POST /api/v1/digest/send) previously wrote err.Error() into the HTTP response body on 500s. Replace with slog.Error server-side logging plus a generic "internal error" response body, matching the house pattern in certificates.go and export.go. F-003 (P4): three blocking time.Sleep sites now honor ctx cancellation - internal/connector/issuer/acme/acme.go:672 (DNS-01 propagation wait) now runs under a select{case <-ctx.Done(): CleanUp + return ctx.Err(); case <-time.After(d):} so graceful shutdown doesn't get stuck behind the propagation delay. - internal/connector/issuer/acme/acme.go:786 (dns-persist-01 propagation wait) same pattern, returns ctx.Err() on cancel. - cmd/agent/main.go:272 (polling backoff inside the heartbeat loop) now wraps the sleep in select{case <-ctx.Done(): continue; case <-time.After(backoff):} so the outer <-ctx.Done() case on the parent loop fires cleanly. Verification: build, vet, and race-enabled short tests green across all 55+ packages. govulncheck reports zero vulnerabilities in the code path. No migration needed — F-001 reuses the existing 000012 composite index. No frontend changes.	2026-04-20 16:51:52 +00:00
shankar0123	675b87ba63	I-005: notification retry loop + dead-letter queue Critical alerts can no longer be silently dropped by a transient notifier failure. Failed notification attempts now ride an exponential backoff retry loop, with a 5-attempt budget before promotion to the dead-letter queue for operator intervention. Schema (migration 000016, idempotent): - retry_count INTEGER NOT NULL DEFAULT 0 - next_retry_at TIMESTAMPTZ - last_error TEXT - idx_notification_events_retry_sweep partial index (next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL Dead rows clear next_retry_at so the index stops matching them. Service contract: - NotificationService.RetryFailedNotifications drives 2^n-minute exponential backoff capped at 1h (notifRetryBackoffCap) with 5-attempt budget (notifRetryMaxAttempts). - Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to status='dead' via MarkAsDead. - Non-terminal failures record via RecordFailedAttempt. - Success path promotes to 'sent' without touching retry_count (audit preserves "delivered on attempt N"). - Missing-notifier branch defensively promotes to 'sent' to avoid wedging a row on a deleted channel. - RequeueNotification operator escape hatch atomically resets retry_count -> 0, next_retry_at -> NULL, last_error -> NULL, status -> pending via notifRepo.Requeue. Scheduler: - New always-on notificationRetryLoop wired into the base loop set at CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m). - sync/atomic.Bool idempotency guard. - sync.WaitGroup shutdown drain via WaitForCompletion. StatsService: - SetNotifRepo setter pattern preserves 9 pre-existing NewStatsService call sites (main.go + stats_test.go + 8 digest tests) without touching the constructor signature. - DashboardSummary.NotificationsDead populated via notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired (reports zero on systems without a notification repository). - CountByStatus error is non-fatal (dashboard summary is best-effort for this field). - Prometheus certctl_notification_dead_total counter emitted from the same snapshot. Handler: - New POST /api/v1/notifications/{id}/requeue endpoint. - dead status surfaces to MCP + CLI. Frontend: - NotificationsPage gains two-tab toolbar ("All" / "Dead letter") with queryKey: ['notifications', activeTab] so switching tabs doesn't serve stale data until the 30s refetch. - Dead rows surface "Retry {n}/5" + truncated last_error with full-text title tooltip. - Requeue mutation wrapped as mutationFn: (id: string) => requeueNotification(id) to prevent react-query v5's positional context argument from leaking into the API client — pinned against future refactors by strict-match toHaveBeenCalledWith('notif-dead-001') in NotificationsPage.test.tsx:181. Closes I-005.	2026-04-19 15:17:27 +00:00
shankar0123	0725713e19	Close I-004 (agent hard-delete cascades targets) coverage-gap finding Operator decision answered as full soft-delete with optional forced cascade — hard-delete is not reachable from any public surface. Prior to this commit, DELETE /agents/{id} ran a plain `DELETE FROM agents` whose schema-level `ON DELETE CASCADE` on deployment_targets.agent_id silently wiped every target, orphaning certs and aborting in-flight jobs. The finding closure reshapes the agent-removal contract around soft retirement with explicit preflight counts, an opt-in cascade gated by a mandatory reason, and unconditional protection for the four reserved sentinel agents used by discovery sources. Schema — migration 000015: migrations/000015_agent_retire.up.sql flips deployment_targets_agent_id_fkey from ON DELETE CASCADE to ON DELETE RESTRICT, so a stray `DELETE FROM agents` now errors at the DB boundary instead of quietly destroying targets. Both `agents` and `deployment_targets` grow a retired_at TIMESTAMPTZ + retired_reason TEXT pair (TEXT not VARCHAR so operator comments are never truncated), indexed via partial indexes WHERE retired_at IS NOT NULL. The migration is self-healing (ADD COLUMN IF NOT EXISTS, DROP CONSTRAINT IF EXISTS then ADD CONSTRAINT, CREATE INDEX IF NOT EXISTS) so repeated runs against partially-migrated databases converge. migrations/000015_agent_retire.down.sql restores CASCADE and drops the new columns for clean rollback. A dedicated repository-layer testcontainers test (internal/repository/postgres/migration_000015_test.go) asserts the before/after FK action, column presence, index presence, and round-trip idempotency under up→down→up. Domain — sentinel guard + dependency counts: internal/domain/connector.go gains IsRetired() on Agent, the exported SentinelAgentIDs slice listing server-scanner, cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm verbatim (matching the four reserved IDs documented in CLAUDE.md and created at startup in cmd/server/main.go), IsSentinelAgent(id string) predicate, AgentDependencyCounts{ActiveTargets, ActiveCertificates, PendingJobs} with a HasDependencies() method, and ActorTypeAgent / ActorTypeSystem enum values used by audit emission downstream. Coverage locked down by internal/domain/connector_test.go. Service — 8-step ordered contract: internal/service/agent_retire.go:RetireAgent(ctx, id, actor, opts{Force, Reason}) enforces a fixed execution order: (1) sentinel guard — IsSentinelAgent(id) returns ErrAgentIsSentinel unconditionally; force=true does NOT bypass it. (2) fetch — ErrAgentNotFound on miss. (3) idempotency — if IsRetired() already, return AgentRetirementResult{AlreadyRetired: true} with no new audit event and no state change (safe to replay from flaky clients). (4) preflight counts — collectAgentDependencyCounts runs ActiveTargets, ActiveCertificates, PendingJobs sequentially (not in parallel; keeps the per-query timeout predictable and matches the repo's existing call-chain shape). (5) force-reason guard — opts.Force=true with empty Reason returns ErrForceReasonRequired (wired into the 400 status surface). (6) dependency guard — HasDependencies() with opts.Force=false returns BlockedByDependenciesError{Counts} (wired into the 409 body with per-bucket counts). (7) mutation — single pinned retiredAt := time.Now(); agent retirement first, then cascade target retirement if opts.Force, all under the repo's single transaction so the two retired_at stamps match to the second. (8) best-effort audit — agent_retired always; agent_retirement_ cascaded additionally on the force path. Actor is whatever the handler resolves from the request; actor type is mapped by resolveActorType (system/agent-prefix→Agent/else→User). Audit emission failures are logged via slog.Error but do not abort the retirement (matches the house convention used by every other scheduler-emitted event). BlockedByDependenciesError implements Error() as "active_targets=%d, active_certificates=%d, pending_jobs=%d" and Unwrap() → ErrBlockedByDependencies. The single struct satisfies errors.Is via Unwrap (used by scheduler-level tests) and errors.As via the concrete type (used by the handler to fish out Counts for the 409 body). ListRetiredAgents(page, perPage) adds a separate paginated accessor with page<1→1 and perPage<1→50 normalization so retired rows are queryable without polluting the default agent listing. Sentinel guard coverage is asymmetric by design: all four reserved IDs are protected, and force=true cannot override. Regression tests in internal/service/agent_retire_test.go assert each of the eight steps in order, plus sentinel bypass attempts and idempotency replay. Handler + router — status-code surface: internal/api/handler/agents.go:RetireAgent exposes seven status codes on DELETE /agents/{id}: 200 on a fresh retirement (body echoes AgentRetirementResult). 204 on idempotent replay (AlreadyRetired=true; no new audit). 400 on ErrForceReasonRequired. 403 on ErrAgentIsSentinel. 404 on ErrAgentNotFound. 409 on BlockedByDependenciesError, with a custom body shape {error, counts{active_targets, active_certificates, pending_jobs}} that bypasses the default ErrorWithRequestID envelope so callers get the per-bucket numbers directly. 500 on any other error. Heartbeat HandleHeartbeat returns 410 Gone when the agent is retired (ErrAgentRetired), signalling the agent to shut down. Query params `force=true` and `reason=<text>` drive the cascade path; both are forwarded as url.Values through the new MCP transport. internal/api/router/router.go registers GET /api/v1/agents/retired literal-path BEFORE /api/v1/agents/{id} — Go 1.22 ServeMux's literal-beats-pattern-var precedence routes "retired" to the paginated retired-agents listing instead of fetching a hypothetical agent named "retired". Agent binary — clean shutdown on 410: cmd/agent/main.go gains the ErrAgentRetired sentinel, a retiredOnce sync.Once, and a retiredSignal chan struct{}. A markRetired(source, statusCode, body) helper closes the channel exactly once; the Run() select loop observes the close and returns ErrAgentRetired; main() matches via errors.Is(err, ErrAgentRetired) and exits cleanly instead of spinning in the heartbeat retry loop. The 410 Gone surface is therefore terminal for the agent process. MCP transport: internal/mcp/client.go adds Client.DeleteWithQuery(path, query), a new additive transport method. Client.Delete is path-only; without this method the retire tool would silently drop `force` and `reason`, turning every cascade retire into a default soft-retire. The new method shares do()'s 204 normalization and 4xx/5xx error propagation so tool authors get one contract. internal/mcp/tools.go + internal/mcp/types.go expose the retire_agent tool with Force+Reason inputs wired through DeleteWithQuery. CLI: cmd/cli/main.go + internal/cli/client.go add two CLI surfaces: `agents list --retired` (client-side strip of --retired then delegation to ListRetiredAgents, sharing --page/--per-page parsing with the default listing) and `agents retire <id> [--force --reason "…"]` (mirrors ErrForceReasonRequired — force without reason is rejected client-side before the request is sent). JSON + table output modes both honor the new columns. Frontend: web/src/pages/AgentsPage.tsx surfaces retired/retire affordances. web/src/api/client.ts + web/src/api/types.ts expose the retire endpoint and the retired-listing. 4 new Vitest regression cases. OpenAPI: api/openapi.yaml documents DELETE /agents/{id} with all seven status codes, 410 on heartbeat, and the 409 per-bucket body shape. Regression coverage (six new test files, all green): internal/service/agent_retire_test.go — 8-step contract + sentinel guards internal/api/handler/agent_retire_handler_test.go — 7-status-code surface + 410 heartbeat internal/mcp/retire_agent_test.go — DeleteWithQuery wire-through internal/cli/agent_retire_test.go — --retired listing + --force/--reason pairing internal/repository/postgres/migration_000015_test.go — FK flip + columns + indexes + up↔down internal/domain/connector_test.go — IsRetired, IsSentinelAgent, SentinelAgentIDs, HasDependencies Files: api/openapi.yaml — DELETE + 410 + 409 body shape cmd/agent/main.go — ErrAgentRetired, markRetired, retiredSignal cmd/cli/main.go — handleAgents list/get/retire dispatch docs/architecture.md, docs/concepts.md, docs/testing-guide.md — retirement contract narrative internal/api/handler/agents.go — RetireAgent, status surface, 410 on heartbeat internal/api/handler/agent_handler_test.go — extended coverage internal/api/handler/agent_retire_handler_test.go — new internal/api/router/router.go — /agents/retired before /agents/{id} internal/cli/agent_retire_test.go — new internal/cli/client.go — ListRetiredAgents + RetireAgent internal/domain/connector.go — IsRetired, SentinelAgentIDs, IsSentinelAgent, AgentDependencyCounts, ActorTypeAgent/System internal/domain/connector_test.go — new internal/integration/lifecycle_test.go — retirement fixture internal/mcp/client.go — DeleteWithQuery additive transport internal/mcp/retire_agent_test.go — new internal/mcp/tools.go, internal/mcp/types.go — retire_agent tool + Force/Reason inputs internal/repository/interfaces.go — AgentRepository retirement methods internal/repository/postgres/agent.go — retire + cascade target retire + counts internal/repository/postgres/migration_000015_test.go — new internal/service/agent.go — wire into AgentService surface internal/service/agent_retire.go — new 8-step contract internal/service/agent_retire_test.go — new internal/service/deployment.go — skip retired agents internal/service/target.go — skip retired agents internal/service/testutil_test.go — shared mocks extended migrations/000015_agent_retire.up.sql — new migrations/000015_agent_retire.down.sql — new web/src/api/client.ts, types.ts + tests — retire endpoint wiring web/src/pages/AgentsPage.tsx — retire UI	2026-04-19 05:24:00 +00:00
shankar0123	fe7e766510	Close M-004 (OCSP issuer binding) and M-005 (discovery actor propagation) coverage-gap findings M-004 — OCSP issuer binding (composite key): The OCSP lookup path now binds (issuer_id, serial) as a composite key rather than resolving by serial alone. CertificateRepository and RevocationRepository gain GetByIssuerAndSerial methods; ca_operations.go scopes both lookups by the issuer_id path param. When no managed cert binds to that (issuer, serial) tuple, GetOCSPResponse constructs an RFC 6960 §2.2 'unknown' response (CertStatus=2) instead of the prior default 'good'. Short-lived cert exemption (profile TTL < 1h) is preserved. Real repo errors (non-sql.ErrNoRows) fail closed with a log. Regression coverage: internal/service/ca_operations_test.go - TestCAOperationsSvc_GetOCSPResponse_Unknown_CrossIssuer - TestCAOperationsSvc_GetOCSPResponse_Unknown_UnknownSerial M-005 — Discovery Claim/Dismiss actor propagation: DiscoveryService.ClaimDiscovered and DismissDiscovered now accept an explicit 'actor string' parameter (propagation pattern mirrors bulk_revocation.go / revocation_svc.go). The handler layer passes resolveActor(r.Context()) — the named-key identity established by the M-002 auth unification — and the service falls back to 'api' (the same safe sentinel resolveActor uses when no auth context is present) only when the caller passes an empty string. Never falls back to 'operator'. Regression coverage: internal/service/discovery_test.go - TestDiscoveryService_ClaimDiscovered_AuditActor - TestDiscoveryService_DismissDiscovered_AuditActor - TestDiscoveryService_ClaimDiscovered_EmptyActorFallsBackToAPI - TestDiscoveryService_DismissDiscovered_EmptyActorFallsBackToAPI Each new test asserts event.Actor matches the caller-supplied string (or 'api' on empty input) and explicitly asserts event.Actor != 'operator' to lock in the historical fix intent. Files: internal/api/handler/discovery.go — pass resolveActor(ctx) internal/api/handler/discovery_handler_test.go — updated call sites internal/integration/lifecycle_test.go — updated mock wiring internal/repository/interfaces.go — GetByIssuerAndSerial on CertificateRepository + RevocationRepository internal/repository/postgres/certificate.go — composite key lookup internal/service/ca_operations.go — (issuer_id, serial) scoping internal/service/ca_operations_test.go — 2 new M-004 tests internal/service/discovery.go — actor parameter + 'api' fallback internal/service/discovery_test.go — 4 new M-005 tests internal/service/shortlived_test.go — mock signature update internal/service/testutil_test.go — mock GetByIssuerAndSerial	2026-04-18 22:20:25 +00:00
shankar0123	3287e174dc	Unify API auth + RFC-compliant CRL/OCSP (M-002 + M-003 + M-006, auto-closes M-001) Closes the remaining P1 gaps from coverage-gap-audit.md (M-001/M-002/M-003/M-006) on top of the C-001/C-002 ownership + agent-FK contract fixes landed in `a53a4b8`. The work lands as a single commit spanning server, docs, tests, and the React client. M-002 — Named API keys with per-key actor propagation * Migration 000014 adds the 'api_keys' table (id, name, hash, principal, role, created_at, last_used_at, disabled_at) so every credential carries an identifiable principal instead of the opaque 'anonymous'/'api-key' sentinel. * Auth middleware now rotates through configured keys, performs constant-time hash comparison, stamps 'last_used_at', and emits an actor struct via contextWithActor(). The audit middleware, bulk-revocation handler, approval handlers, and MCP tool layer now read the principal off the context and persist it on every audit_events row. * Regression coverage: - internal/api/middleware/audit_test.go — actor propagation, principal redaction for disabled keys, anonymous fallback for unauthenticated endpoints. - internal/api/handler/bulk_revocation_handler_test.go, job_handler_test.go — principal-on-audit assertions. M-003 — Authorization gates (Phase B) * Approval handler rejects self-approval / self-rejection with 403 when the actor principal equals the job's requested_by field. * Bulk revocation is gated behind the 'admin' role; operators and viewers receive 403. * Regression coverage: - internal/service/job_test.go — TestApproveJob_NotSelf, TestRejectJob_NotSelf. - internal/api/handler/bulk_revocation_handler_test.go — TestBulkRevoke_RequiresAdmin, TestBulkRevoke_AdminSucceeds. M-006 — RFC-compliant CRL/OCSP on the unauthenticated .well-known mux * Per RFC 8615, relying parties cannot reasonably be asked to authenticate against the issuing certctl instance to retrieve revocation material. CRL and OCSP move off the authenticated '/api/v1/crl' and '/api/v1/ocsp/' paths onto: GET /.well-known/pki/crl/{issuer_id} Content-Type: application/pkix-crl (RFC 5280 §5) GET /.well-known/pki/ocsp/{issuer_id}/{serial} Content-Type: application/ocsp-response (RFC 6960) * Non-standard JSON CRL shape is removed; only DER is served. * Short-lived certificate exemption (profile TTL < 1h → skip CRL/OCSP) is preserved; the response simply omits the serial. * Routes are registered on the unauthenticated 'finalHandler' mux in cmd/server/main.go alongside EST ('/.well-known/est/') and SCEP ('/scep'). Legacy authenticated paths return 404. Regression coverage: - internal/api/handler/certificate_handler_test.go — content type, DER parseability, 404 for unknown issuer. - internal/api/handler/adversarial_path_test.go — unauthenticated access asserted for CRL, OCSP, EST, SCEP. - internal/api/router/router_test.go — route-table assertion that '.well-known/pki/', '.well-known/est/', and '/scep' are mounted on the unauthenticated branch. M-001 — Auto-closed by M-002 EST and SCEP were already registered on the unauthenticated 'finalHandler' mux; the router comment at internal/api/router/router.go:247 now matches reality. The adversarial-path tests above lock the behavior in. Verification (all gates green): * go vet ./... — clean * go build ./... — ok * go test -short ./... (55+ packages) — all pass * web/ : npm test (225 Vitest tests) — all pass * web/ : npx tsc --noEmit — clean * grep sweep for '/api/v1/(crl\|ocsp)' — 13 surviving hits, all intentional M-006 tombstone/relocation comments. Documentation: * coverage-gap-audit.md — status flips M-001/M-002/M-003/M-006 → Fixed, with per-finding resolution paragraphs citing regression test IDs. (Audit file lives outside this repo; see cowork root.) * CLAUDE.md Project Status line updated with the auth-unification closure note. * docs/features.md, docs/architecture.md, docs/quickstart.md, docs/concepts.md, docs/connectors.md, docs/test-env.md, docs/testing-guide.md, docs/compliance-.md, docs/demo-advanced.md — refreshed for the new '.well-known/pki/' namespace and named API keys. * api/openapi.yaml — documents the new unauthenticated endpoints and removes the legacy '/api/v1/crl' + '/api/v1/ocsp/' paths. .gitignore: adds '/.gocache/' and '/.gomodcache/' for the session- scoped Go caches so they never enter the tree.	2026-04-18 18:17:41 +00:00
shankar0123	a53a4b845b	fix(gui,api): close C-001 + C-002 — ownership + agent FK contract C-001 — CreateCertificate was server-accepted with null owner_id, team_id, renewal_policy_id because the GUI neither collected the fields nor enforced them, even though the backend's ManagedCertificate schema and handler contract treat them as required. Fix the contract at all four layers: - web/src/pages/CertificatesPage.tsx: replace owner_id/team_id free- text inputs with <select> elements fed by getOwners/getTeams/ getPolicies queries; mark all three required; gate the Create button on owner_id + team_id + renewal_policy_id being set. - internal/api/handler/certificates.go: ValidateRequired for owner_id, team_id, renewal_policy_id on CreateCertificate so the handler returns HTTP 400 with the offending field name before the service layer is reached. - internal/mcp/types.go: drop ',omitempty' from CreateCertificateInput.RenewalPolicyID so the MCP schema reflects the required contract; Update inputs keep partial-update semantics. - api/openapi.yaml: 'required: [name, common_name, renewal_policy_id, issuer_id, owner_id, team_id]' was already present on the Create schema; clarified DeploymentTarget.agent_id description to note the FK contract. C-002 — CreateTargetWizard accepted an empty or bogus agent_id and the service inserted directly, producing a Postgres 23503 FK-violation that bubbled out as a generic HTTP 500. The FK itself (migration 000001 line 104: agent_id TEXT NOT NULL REFERENCES agents(id)) is correct; we keep the schema strict and add validation at three layers: - internal/service/target.go: introduce ErrAgentNotFound sentinel and pre-validate agent_id in TargetService.CreateTarget — empty string returns 'agent_id is required'; a nonexistent id returns the full 'referenced agent does not exist: <id>' error. Both wrap ErrAgentNotFound via fmt.Errorf %w so callers can use errors.Is. - internal/api/handler/targets.go: ValidateRequired on agent_id; map errors.Is(err, service.ErrAgentNotFound) to HTTP 400 instead of letting it fall through to the generic 500 branch. - internal/mcp/types.go: drop ',omitempty' from CreateTargetInput.AgentID to match the required contract. - web/src/pages/TargetsPage.tsx: replace the free-text Agent ID input with a <select> populated from getAgents(); include agent in the canProceedToReview gate so Next is disabled until an agent is chosen. Regression coverage (21 new subtests total): - TestCreateCertificate_MissingRequiredField_Returns400 — 6 subtests, one per required field, each proves the handler guard fires before the mock service is called. - TestCreateTarget_MissingAgentID_Returns400 — handler guard. - TestCreateTarget_NonexistentAgent_Returns400 — pins the ErrAgentNotFound -> 400 translation. - TestTargetService_CreateTarget_MissingAgentID — errors.Is sentinel. - TestTargetService_CreateTarget_NonexistentAgentID — errors.Is. - The existing TestTargetService_CreateTarget_Success, along with TestCreateTarget_{MissingName,MissingType,NameTooLong}_* handler tests, were updated to seed a real agent or include agent_id in the request body so the happy paths still run cleanly. Gates (Phase 4): - go build/vet/test/race: green - go test -cover: internal/service 68.7% (gate 55%), internal/api/handler 78.9% (gate 60%) - golangci-lint on service+handler+mcp: 0 issues - govulncheck: no reachable vulns - tsc --noEmit: clean - vitest: 223/223 passing See cowork/certctl-coverage-gap-audit.md entries C-001 and C-002.	2026-04-18 16:01:40 +00:00
shankar0123	b3cc7cbdb2	fix(policies): close the D-006 loop — TitleCase seed canonicals + severity-aware, config-consuming rule engine (D-008) D-008 was a three-part drift in the policy engine that made the D-005/D-006 remediation cosmetic below the DB layer: (a) migrations/seed.sql INSERTed rules with pre-D-005 lowercase types ('ownership', 'environment', 'lifetime', 'renewal_window') that the handler validator rejects on Create/Update but that raw SQL INSERTs bypassed entirely. At runtime evaluateRule's switch fell through to the default "unknown policy rule type" error branch on every demo rule × every cert × every cycle, flooding logs while emitting zero violations. (b) migrations/seed_demo.sql persisted lowercase severity values ('critical', 'error', 'warning') on policy_violations rows. INSERT succeeded because that column had no CHECK, but any frontend comparing against the canonical PolicySeverity enum mis-categorized every seeded violation. (c) evaluateRule hardcoded Severity: PolicySeverityWarning on every emitted violation and ignored rule.Config entirely — so the D-006 per-rule severity column (000013) and every per-arm Config JSON ({allowed_issuer_ids, allowed_domains, required_keys, allowed, lead_time_days, max_days}) was dead data below the evaluation layer. This commit lands (a)+(b)+(c) atomically. Shipping any subset leaves the feature half-working. ## Changes Domain (internal/domain/policy.go): * Add PolicyTypeCertificateLifetime as the 6th TitleCase canonical. Pre-D-008 the seeded "max-certificate-lifetime" rule had no engine arm — routing it through RenewalLeadTime would conflate "how close to expiry before we renew" with "how long can the cert possibly be", two distinct semantics. The new type accepts config {"max_days": int} and flags certs whose NotAfter - NotBefore exceeds the cap. Handler validator (internal/api/handler/validation.go): * ValidatePolicyType allowlist grown to 6 canonicals (AllowedIssuers, AllowedDomains, RequiredMetadata, AllowedEnvironments, RenewalLeadTime, CertificateLifetime). OpenAPI (api/openapi.yaml): * PolicyType enum grown to match domain. Frontend (web/src/api/types.ts, types.test.ts): * POLICY_TYPES tuple gains CertificateLifetime; pin test asserts all 6 canonicals and rejects casing drift. Migration 000014 (policy_violations severity CHECK): * Named CHECK constraint (policy_violations_severity_check) mirroring 000013's allowlist, defense-in-depth at the DB layer against future drift from bypassed writes (migrations, psql sessions, future callers). Symmetric down migration drops by name. Seed data: * migrations/seed.sql rewritten to emit TitleCase canonicals with per-arm config JSON that actually exercises the config-consuming paths (not the missing-field backstops): - pr-require-owner → RequiredMetadata {"required_keys":["owner"]} Warning - pr-allowed-environments → AllowedEnvironments {"allowed":["production","staging","development"]} Error - pr-max-certificate-lifetime → CertificateLifetime {"max_days":90} Critical - pr-min-renewal-window → RenewalLeadTime {"lead_time_days":14} Warning Severities are now differentiated per rule (D-006 intent). * migrations/seed_demo.sql violation rows flipped to TitleCase severity ('Critical', 'Error', 'Warning') so migration 000014 applies cleanly on upgrade paths. Engine rewrite (internal/service/policy.go): * evaluateRule rewritten. All six arms now: 1. Parse rule.Config into the per-arm typed struct. 2. Bad JSON → log at ValidateCertificate boundary and skip this rule (no co-located poisoning of other rules in the same batch). 3. Empty/null Config → emit the pre-D-008 missing-field violation (backwards compat invariant — operators who haven't reconfigured still see the same output). 4. Violations emitted carry rule.Severity (no more hardcoded Warning); D-006 column is now load-bearing. * CertificateLifetime arm reads NotBefore/NotAfter from the certificate's latest version via CertRepo. Injected via PolicyService.SetCertRepo() setter — avoids churning ~36 NewPolicyService call sites while keeping the lifetime arm optional (degrades to a log+skip if the setter is not wired). Server wiring (cmd/server/main.go): * policyService.SetCertRepo(certRepo) wired after construction. Tests (internal/service/policy_test.go): * 25 new subtests across 5 groups: - TestEvaluateRule_SeverityPassThrough (6): every rule type emits violations carrying rule.Severity, not hardcoded. - TestEvaluateRule_ConfigConsumed (12): every per-arm Config path exercised positive + negative. - TestEvaluateRule_EmptyConfig_BackCompat (3): empty/null Config still emits pre-D-008 missing-field violations. - TestEvaluateRule_BadConfig_SkipsRule: malformed JSON logs and skips cleanly without poisoning neighbors. - TestEvaluateRule_CertificateLifetime_RepoScenarios (3): ok when repo wired, log+skip when not, handles missing NotBefore/NotAfter edges. Provenance: D-008 surfaced during D-005/D-006 remediation review in `eef1db0`. That commit added persistence and CI pins for the severity field but did not re-verify the evaluation layer consumed it; this finding and fix close the audit-process gap.	2026-04-18 14:55:56 +00:00
shankar0123	eef1db0f0a	fix(policies): stop 400ing the "+ New Policy" button + add per-rule severity (D-005, D-006) Coverage Gap Audit findings D-005 (P0) + D-006 (P1) fixed together in a single commit because they share the same root cause — policy CRUD sending values the backend silently rejects — and splitting them would leave a half-working UI between commits. ## D-005 (P0): PoliciesPage dropdown 400s every Create Policy Root cause ---------- `web/src/pages/PoliciesPage.tsx` populated the Type `<select>` from a hardcoded `['key_algorithm', 'ownership', 'allowed_issuers', ...]` array. The backend's `internal/api/handler/validators.go::ValidatePolicyType` enforces the TitleCase allowlist `AllowedIssuers`, `AllowedDomains`, `RequiredMetadata`, `AllowedEnvironments`, `RenewalLeadTime` — defined in `internal/domain/policy.go`. Every Create Policy request was rejected with `400 invalid policy type`. The error surfaced only as a transient toast; the modal closed anyway. Silent user-visible failure. Fix --- - `web/src/api/types.ts`: added `POLICY_TYPES` and `POLICY_SEVERITIES` tuples with `as const` and narrowed `PolicyRule.type`, `.severity`, and `PolicyViolation.severity` to the literal-union types. Dropdown is now sourced from the tuple; casing drift becomes a compile error. - `web/src/pages/PoliciesPage.tsx`: rekeyed `severityStyles` / `severityDots` to the TitleCase values, added `humanize()` for display (AllowedIssuers → "Allowed Issuers"), removed the `badge-neutral` fallback that was papering over the mismatch. - `web/src/api/types.test.ts` (new): pins both tuples exactly. If anyone edits one side of the frontend/backend contract without the other, CI fails with a clear assertion. Pure-TS vitest, no RTL dependency. ## D-006 (P1): `severity` field silently dropped on create/update Root cause ---------- `PolicyRule` had no `Severity` field in `internal/domain/policy.go`. The frontend has always sent `severity` on create/update, but Go's `json.Decoder` (default settings, no `DisallowUnknownFields`) silently dropped it. The value never reached PostgreSQL. Every rule rendered with the same severity because there was no severity — just a display computation downstream. Fix: option (b), full-stack schema add (not delete-the-field) ------------------------------------------------------------- - Migration `000013_policy_rule_severity` (up + down): adds `severity VARCHAR(50) NOT NULL DEFAULT 'Warning'` to `policy_rules` with CHECK constraint `severity IN ('Warning', 'Error', 'Critical')`. No index — three-value column on a low-thousands-rows table, planner will seq-scan regardless. PG 11+ metadata-only ADD COLUMN, safe on live data. - `internal/domain/policy.go`: added `Severity PolicySeverity` field. - `internal/repository/postgres/policy.go`: plumbed `severity` through ListRules SELECT + Scan, GetRule SELECT + Scan, CreateRule INSERT, UpdateRule UPDATE (4 queries). - `internal/service/policy.go::UpdatePolicy`: if the client omits severity on a PUT (zero-value empty string), fetch the existing rule and preserve its severity. Without this, partial updates would trip the NOT NULL CHECK and 500. Preserves pre-existing behavior for Name/Type (out of scope). - `internal/api/handler/policies.go::CreatePolicy`: default empty severity to `'Warning'`, then validate via `ValidatePolicySeverity`. 400 with clear message instead of 500 on CHECK violation. `UpdatePolicy`: validates severity only when provided. - `internal/mcp/types.go` + `internal/mcp/tools.go`: added optional `severity` on the MCP `create_policy` / `update_policy` tool inputs so LLM callers stay in sync with the wire contract. - `api/openapi.yaml`: added `severity` to the `PolicyRule` schema with the enum and default. Acceptance criterion (user-defined) ----------------------------------- "Create a rule with severity=Critical, reload the page, and still see Critical — no silent drops." Verified end-to-end: frontend sends `severity: "Critical"`, handler validates, service persists, DB stores, GET returns, React renders the correct badge. Seed data --------- `migrations/seed.sql`: four demo rules now have differentiated severities — `pr-require-owner` → Warning, `pr-allowed-environments` → Error, `pr-max-certificate-lifetime` → Critical, `pr-min-renewal-window` → Warning. The user called out that seeding all four at the same severity makes the feature look decorative; differentiation demonstrates the column carries real signal. ## Integration test fix (side effect of D-006) `internal/integration/e2e_test.go::TestCrossResourceWorkflow/CreatePolicy` was sending `"severity": "High"` — a value from the pre-audit severity vocabulary that the new `ValidatePolicySeverity` correctly rejects with 400. Changed to `"Error"` (closest semantic match in the new TitleCase allowlist). Only severity reference in the integration/ directory; verified via grep. ## Out of scope, logged for follow-up (d/D-008) Three policy-engine drift issues orthogonal to D-005 + D-006, explicitly deferred per direction: 1. `migrations/seed.sql` policy_rules INSERTs use lowercase TYPE values (`'ownership'`, `'environment'`, `'lifetime'`, `'renewal_window'`). These are load-bearing on `internal/service/policy.go::evaluateRule`'s `switch rule.Type` (which also uses the lowercase strings). Migrating requires coordinated changes across seed + evaluation engine. 2. `migrations/seed_demo.sql:482-483` contains lowercase `'critical'` severity — will now fail the new CHECK constraint. Separate fix. 3. `evaluateRule` hardcodes `Severity: domain.PolicySeverityWarning` on emitted violations and ignores the configured `rule.Config`. The new severity column is read correctly on the CRUD path but not yet consulted during evaluation. ## Verification Backend: - `go build ./...` — clean - `go vet ./...` — clean - `go test -short ./...` — all packages green, including `internal/service` (policy service), `internal/api/handler` (policy + MCP handler tests), `internal/integration` (e2e_test.go after fix), `internal/domain`, `internal/repository/postgres`. Frontend: - `tsc --noEmit` — clean - `vitest run` — 223/223 passing (4 new assertions in types.test.ts) - `vite build` — clean (only the pre-existing chunk-size warning)	2026-04-18 13:02:04 +00:00
shankar0123	ccd89c348f	fix(m2-pr-d): thread ctx through Job/Notification/Audit services Collapse CancelJobWithContext into CancelJob; eliminate 10 context.Background() hits across the Job+Notification+Audit service cluster by threading ctx through their handler-facing service interfaces. Services (ctx-first): - service/job.go: ListJobs, GetJob, CancelJob, ApproveJob, RejectJob now accept ctx; the CancelJobWithContext wrapper is removed (handler callers continue to invoke CancelJob, now ctx-aware). - service/notification.go: ListNotifications, GetNotification, MarkAsRead accept ctx. - service/audit.go: ListAuditEvents, GetAuditEvent accept ctx. Handlers (interface + callsites): - handler/jobs.go, handler/notifications.go, handler/audit.go: local service interfaces updated, r.Context() threaded at every callsite. Tests: - Mock services updated to match the new interfaces (ctx accepted and ignored via '_ context.Context' first parameter; Fn closure fields unchanged). - job_test.go / notification_test.go callsites thread context.Background() to match production shape. Verification: go build ./... ok go vet ./... ok go test -short ./... ok go test -race -short ./... ok golangci-lint run ./... 0 issues Locked decisions from the M-2 plan: D-1 ctx-only signatures (no dual forms) D-4 preserve handler method names facing the router D-5 domain types stay ctx-free Audit complete. Commit: `1f6cf0eafa`. Sections: 12. Findings: 2/7/10/4/6.	2026-04-18 01:20:46 +00:00
shankar0123	2497be496d	M-2 PR-C: Collapse Policy/Profile/Owner/Team services to ctx-first signatures - Add ctx first param to 21 service-layer handler-interface methods across policy.go (6), profile.go (5), owner.go (5), team.go (5) - Replace 24 context.Background() call sites with received ctx; use context.WithoutCancel(ctx) for subsidiary audit-recording ops to preserve fire-and-forget audit semantics without inheriting caller cancellation - Add ctx first param to 21 handler-interface method signatures across policies.go (6), profiles.go (5), owners.go (5), teams.go (5) - Thread r.Context() through 21 HTTP handler sites (ListPolicies, GetPolicy, CreatePolicy, UpdatePolicy, DeletePolicy, ListViolations, ListProfiles, GetProfile, CreateProfile, UpdateProfile, DeleteProfile, ListOwners, GetOwner, CreateOwner, UpdateOwner, DeleteOwner, ListTeams, GetTeam, CreateTeam, UpdateTeam, DeleteTeam) - Update MockPolicyService/MockProfileService/MockOwnerService/ MockTeamService mock method impls with _ context.Context first param (Fn fields unchanged — closures do not need ctx); update mock impls in integration/lifecycle_test.go for all four services - Update 12 service-layer test callsites (policy_test.go ×2, owner_test.go ×5, team_test.go ×5, profile_test.go ×13) to pass context.Background() at the call site Audit complete. Commit: `1f6cf0eafa`. Sections: 12. Findings: 2/7/10/4/6.	2026-04-18 01:10:06 +00:00
shankar0123	eb14236166	M-2 PR-B: Collapse IssuerService + TargetService to ctx-first signatures - Delete bare TestConnection wrapper in IssuerService; rename TestConnectionWithContext → TestConnection - Delete TestTargetConnection delegate shim in TargetService (canonical TestConnection already ctx-first) - Add ctx first param to 10 handler-interface methods (ListIssuers/GetIssuer/CreateIssuer/UpdateIssuer/DeleteIssuer and ListTargets/GetTarget/CreateTarget/UpdateTarget/DeleteTarget) - Replace 16 context.Background() call sites with received ctx - Thread r.Context() through 12 HTTP handler sites in issuers.go and targets.go (outer TargetHandler.TestTargetConnection HTTP method name preserved for router compatibility) - Update MockIssuerService, MockTargetService, and mockTargetService (integration) for ctx-first forwarding; update test callsite literals Audit complete. Commit: `1f6cf0eafa`. Sections: 12. Findings: 2/7/10/4/6.	2026-04-18 00:46:58 +00:00
shankar0123	cdc9d03d5b	fix(m-2): thread context through CertificateService cluster Collapses CertificateService, RevocationSvc, and CAOperationsSvc to ctx-accepting method signatures. Removes context.Background() synthesis at 24 internal call sites across certificate.go, revocation_svc.go, and ca_operations.go. - Primary repo calls inherit request cancellation via the passed ctx. - Audit and notification dispatches use context.WithoutCancel(ctx) so they survive client disconnect. - Collapses TriggerRenewal/TriggerRenewalWithActor, TriggerDeployment/TriggerDeploymentWithActor, and RevokeCertificate/RevokeCertificateWithActor sibling pairs into single canonical ctx-accepting methods (decisions D-1, D-2). Handlers pass r.Context(). Mocks and tests updated to match new signatures. No HTTP surface change, no OpenAPI change. PR 1 of 6 in the M-2 remediation chain. Master green at this commit. Refs: certctl-audit-report.md M-2 (L143, L224)	2026-04-18 00:29:37 +00:00
shankar0123	13cd4d98ba	feat(V2.2): bulk revocation — filter-based fleet-wide certificate revocation Add POST /api/v1/certificates/bulk-revoke with filter criteria (profile_id, owner_id, agent_id, issuer_id, team_id, certificate_ids), partial-failure tolerance, and audit trail. Includes MCP tool, CLI command (certs bulk-revoke), server-side bulk modal in GUI replacing client-side sequential loop, OpenAPI spec, compliance mapping updates, and 21 new tests (12 service, 7 handler, 1 CLI, 1 frontend). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-16 00:06:34 -04:00
shankar0123	596d86a206	feat(M48): continuous TLS health monitoring — endpoint state machine, shared tlsprobe, 8 API endpoints, GUI Adds continuous TLS endpoint health monitoring that closes the deploy→verify→monitor loop. After M25 verifies a deployment succeeded once, M48 continuously confirms it stays healthy. Key components: - Shared `internal/tlsprobe/` package extracted from network scanner for reuse - Health status state machine: healthy → degraded (2 failures) → down (5 failures), plus cert_mismatch when served fingerprint differs from expected - 8th scheduler loop (60s tick, per-endpoint configurable intervals) - PostgreSQL migration 000011: endpoint_health_checks + endpoint_health_history tables - 8 REST API endpoints (CRUD, history, acknowledge, summary) - Health Monitor GUI page with summary bar, status table, create modal, auto-refresh - 38 new tests (5 tlsprobe + 11 domain + 10 service + 8 handler + 4 frontend) - All coverage thresholds maintained (service 68%, handler 83%, domain 87%, middleware 63%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-15 21:45:45 -04:00

1 2

94 Commits