certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 15:41:41 +00:00

Author	SHA1	Message	Date
shankar0123	037876fa0f	fix(scheduler): SCALE-001 — cap ClaimPendingJobs per-tick (default 1000) Sprint 2 unified-master-audit closure. Pre-fix the scheduler invoked ClaimPendingJobs(ctx, "", 0). limit:0 loads every Pending row in a single transaction — a 100K-job burst (cert-fleet sweep, post-outage recovery, large agent-fleet first boot) marshalled the full queue into process memory before boundedFanOut's semaphore could back- pressure the upstream CAs. Fix: - SchedulerConfig.JobClaimLimit (env CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT, default 1000). ≤0 normalised to 1000 in SetClaimLimit — fail-safe vs. legacy unlimited semantics. - JobService.claimLimit threaded into the existing ProcessPendingJobs flow; ClaimPendingJobs(ctx, "", s.claimLimit). - cmd/server/main.go wires jobService.SetClaimLimit(cfg.Scheduler.JobClaimLimit). - 'processing pending jobs' log line now includes claim_limit so operators can spot the cap engaging (count == claim_limit ⇒ queue is running ahead of fan-out; bump CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT or CERTCTL_RENEWAL_CONCURRENCY). - Test wiring keeps the legacy zero-value (unlimited) for byte- for-byte compatibility with the existing 600+ JobService unit tests — only production code goes through SetClaimLimit. Regression coverage: - mockJobRepo.LastClaimLimit records the limit passed through ClaimPendingJobs so tests can pin the propagation. - TestProcessPendingJobs_RespectsClaimLimit: 10 Pending rows, SetClaimLimit(3), expect exactly 3 transition to Running plus LastClaimLimit=3 on the mock. - TestSetClaimLimit_NormalisesNonPositive: 0/-1/-1000 all normalise to 1000. Closes SCALE-001.	2026-05-16 04:00:49 +00:00
shankar0123	21aeed4f4e	legal: addlicense headers + normalize legacy variants (Phase 0 RED-4) Phase 0 closure (Path B2, post-rewrite): addlicense sweep — adds the canonical certctl LLC copyright + BUSL-1.1 SPDX header to every production Go file. Template: // Copyright 2026 certctl LLC. All rights reserved. // SPDX-License-Identifier: BUSL-1.1 Coverage: 338 / 338 production Go files (cmd/ + internal/, excluding _test.go and /testdata/). Pre-sweep coverage was 22 / 338 (6.5%); post-sweep is 338 / 338 (100%). Normalized 22 pre-existing legacy headers (`// Copyright (c) certctl` + `// SPDX-License-Identifier: BSL-1.1`) and 1 file using a `Certctl Contributors` attribution. The legacy SPDX ID `BSL-1.1` is non-standard; the official SPDX identifier for Business Source License 1.1 is `BUSL-1.1` (capital U). All 338 files now share the canonical form. Generated via: addlicense -c "certctl LLC" -y 2026 \ -f cowork/legal/copyright-header.tpl \ -ignore '/testdata/' -ignore '/_test.go' \ cmd/ internal/ Verification: find cmd internal -name '.go' -not -name '_test.go' \ -not -path '/testdata/' \ -exec grep -L '^// Copyright 2026 certctl LLC' {} \; \| wc -l Returns: 0 gofmt clean. Header additions are comments only, no compile impact. Closes: cowork/certctl-architecture-diligence-audit.html#fix-RED-4	2026-05-13 21:23:35 +00:00
shankar0123	02438ad9e1	ci: floor raise + doc drift (Phase 3 closure — TEST-H1/H2/M1/M2/M3/M4/L1, ARCH-H3/L1/L2/L3/L4) Twelve findings from the architecture diligence audit's Phase 3 bundle closed in one PR. All touch the CI workflows + small doc-drift fixes across the production Go tree + migration headers. CI workflow changes ==================== TEST-H1 — Race detection on ./... -short .github/workflows/ci.yml:106 was a 9-package explicit list. Audit finding TEST-H1 flagged that 25+ packages (internal/auth/, internal/repository/, internal/mcp, internal/scep, internal/pkcs7, internal/api/router, internal/api/acme, internal/cli, internal/cms, internal/config, internal/deploy, internal/integration, internal/ratelimit, internal/secret, internal/trustanchor, all of cmd/) silently dropped off race coverage. Post-fix: 'go test -race -short ./... -count=1 -timeout 600s'. 76 testing.Short() guards already cover testcontainers + live-DB integration suites, so -short keeps the long-running tests out. TEST-H2 — Cross-platform build matrix New 'cross-platform-build' job in ci.yml. Matrix: ubuntu-latest + windows-latest + macos-latest, fail-fast: false. Builds cmd/server + cmd/agent + cmd/cli + cmd/mcp-server on each. Catches Windows-specific regressions (path separators, file permissions, exec.Command semantics) the pre-Phase-3 Ubuntu-only CI missed. TEST-L1 — actions/setup-go cache: true (explicit) setup-go v5 defaults cache: true; making it explicit so a future setup-go upgrade can't silently flip it. Re-runs hit the Go module + build cache instead of recompiling cold. TEST-M1 — Mutation-testing floor at 55% security-deep-scan.yml::go-mutesting step rewritten. Removed continue-on-error + per-package '\|\| true'. New post-loop check extracts every 'The mutation score is X.YZ' line and fails the step if any package drops below 0.55. Floor rationale: starter ratio catches major regressions without rejecting the audit's 'this is OK' steady state; raise quarterly. TEST-M2 — 3 advisory deep-scan gates promoted to blocking Removed continue-on-error: true from: - gosec (filtered to G201/G202/G304/G108 high-signal rules: SQL-injection + path-traversal + pprof-exposed) - osv-scanner (multi-ecosystem CVE; complements govulncheck which is already blocking in ci.yml) - trivy image scan (--severity HIGH,CRITICAL --exit-code 1) continue-on-error count: 15 → 11. ZAP / schemathesis / nuclei / testssl stay advisory because their false-positive rates on https://localhost:8443-targeted DAST runs are high. TEST-M3 — Playwright harness stub web/package.json adds '@playwright/test' devDep + 'e2e' / 'e2e:install' npm scripts. web/playwright.config.ts ships single chromium project with webServer block pointing at 'npm run dev'. web/src/__tests__/ e2e/smoke.spec.ts proves the harness wires through. The full 15-flow suite ships in frontend-design-audit Phase 8 (TEST-H1 in THAT audit); this is the wiring + a single smoke test as the regression floor. New Makefile target: 'make e2e-test'. Doc/code drift fixes ==================== TEST-M4 + ARCH-L2 — Skip inventory artifact + CI guard scripts/skip-inventory.sh walks every t.Skip site under cmd/ + internal/ + deploy/test/ and emits docs/testing/skip-inventory.md grouped by package with file:line:expression triples. Current inventory: 142 t.Skip sites, 76 testing.Short() guards. scripts/ci-guards/skip-inventory-drift.sh regenerates and fails on diff (excluding the 'Last reviewed' timestamp line which drifts daily). The Markdown is the canonical acquisition-diligence artifact for 'what tests are being skipped and why.' ARCH-H3 — MCP catalogue floor reconciliation Audit framing was '121 vs floor 150 — doc/code drift.' Live count via the test's actual regex over all 5 tool files (tools.go + tools_audit_fix.go + tools_auth.go + tools_auth_bundle2.go + tools_est.go): 155 unique 'Name: "certctl_*"' declarations. Pre-Phase-3 audit measured tools.go in isolation (121) and missed the other 4 files (+34 unique names). The test at internal/ciparity/surface_parity_test.go::TestSurfaceParity_MCP passes today (155 ≥ 150). Added a clarifying comment near mcpBaselineFloor explaining the measurement scope so future reviewers don't repeat the audit's framing error. STATUS: stale — no code drift, just a measurement scoping error in the audit. ARCH-L1 — panic() rationale comments 5 panic sites in production Go (excluding _test.go): - internal/repository/postgres/tx.go:84 - internal/service/issuer.go:861 (mustJSON) - internal/service/est.go:728 (mustParseTime) - internal/service/acme.go:1288 (rand source failure — already documented) - internal/pkcs7/certrep.go:270 (OID marshal — already documented) Added ARCH-L1 rationale comments to the 3 sites that didn't have them. All 5 are defensible impossible-path / rethrow / hardcoded- constant guards. ARCH-L3 — Migration IF-NOT-EXISTS carve-outs 4 migrations skip the literal 'IF NOT EXISTS' token but ARE idempotent via different Postgres patterns: - 000014_policy_violation_severity_check.up.sql: ALTER TABLE ADD CONSTRAINT CHECK doesn't accept IF NOT EXISTS; idempotency via DROP CONSTRAINT IF EXISTS preamble. - 000018_audit_events_worm.up.sql: CREATE OR REPLACE FUNCTION + DROP TRIGGER IF EXISTS + CREATE TRIGGER + DO $$ pg_roles existence check. CREATE TRIGGER doesn't take IF NOT EXISTS. - 000030_rbac_admin_perms.up.sql: INSERT ... ON CONFLICT DO NOTHING. - 000039_audit_crit1_perms.up.sql: same INSERT + ON CONFLICT pattern. Added ARCH-L3 header comments to each explaining the carve-out so reviewers don't flag the missing literal token. STATUS: largely stale — migrations are already idempotent. ARCH-L4 — TODO/FIXME → see #<descriptor> 5 TODOs rewritten to the allowed 'see #<descriptor>' pattern: - internal/repository/postgres/auth.go:220 → see #bundle-2-scope-fk - internal/connector/discovery/gcpsm/gcpsm.go:547 → see #gcpsm-pagination - internal/service/audit.go:244 → see #audit-pagination-count - internal/service/job.go:295, 299 → see #validation-job-impl New CI guard scripts/ci-guards/no-todo-in-prod.sh grep-fails any new TODO/FIXME in cmd/ + internal/ (excluding _test.go); allows 'see #N' / 'see #<descriptor>' patterns. Sandbox limitation ================== The 6.1 GB certctl working tree fills the sandbox volume; go1.25.10 toolchain download fails with 'no space left on device' (sandbox has 1.25.9; go.mod requires 1.25.10). Local 'go test' / 'go build' NOT run in this commit. Operator must run 'make verify' on their workstation before push per CLAUDE.md operating rules. The smoke.spec.ts NOT executed in the sandbox (no chromium installed). Operator runs 'cd web && npm install && npx playwright install --with-deps chromium && npm run e2e' on first wire-up. All CI guards (no-todo-in-prod, skip-inventory-drift, G-3 env-docs-drift, doc-rot-detector, and every existing guard) verified clean by running each individually. Closes: cowork/certctl-architecture-diligence-audit.html#fix-TEST-H1, cowork/certctl-architecture-diligence-audit.html#fix-TEST-H2, cowork/certctl-architecture-diligence-audit.html#fix-TEST-M1, cowork/certctl-architecture-diligence-audit.html#fix-TEST-M2, cowork/certctl-architecture-diligence-audit.html#fix-TEST-M3, cowork/certctl-architecture-diligence-audit.html#fix-TEST-M4, cowork/certctl-architecture-diligence-audit.html#fix-TEST-L1, cowork/certctl-architecture-diligence-audit.html#fix-ARCH-H3, cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L1, cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L2, cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L3, cowork/certctl-architecture-diligence-audit.html#fix-ARCH-L4	2026-05-13 20:10:08 +00:00
shankar0123	8b75e0311b	chore: rename Go module path to github.com/certctl-io/certctl Mechanical sed across the main go.mod's module declaration, the f5-mock-icontrol sub-module's go.mod, every Go file's import path (361 files), and a rebuild of the checked-in f5-mock-icontrol binary so its embedded build-info reflects the new module path. No behavior change. Choice B from cowork/transfer-certctl-to-org.md, executed 2026-05-04. Choice A (keep module path declared as github.com/shankar0123/certctl regardless of repo URL) shipped on the day of the org transfer (2026-05-03) since we had no external Go consumers; this commit closes that deferral. Backward-compat: GitHub HTTP redirects continue to forward github.com/shankar0123/certctl → github.com/certctl-io/certctl at the URL level, but Go's module proxy uses the path declared in go.mod as the canonical name. Pre-fix, anyone trying `go get github.com/certctl-io/certctl/...` hit a "module path mismatch" error because go.mod said github.com/shankar0123/certctl and the URL they fetched it from said certctl-io/certctl. Post-fix, the canonical name and the URL agree, so go get / go install / external Go consumers / Go-tooling integrations work cleanly via either the new path (preferred) or the old path (which redirects and Go follows the redirect for source fetch). Anyone still importing the old path inside their own code keeps working provided they update their go.mod's `require` line to match — the module path declared in their consumer's go.sum / go.mod is the authoritative import name, so a mass sed across their import statements is the migration on the consumer side. No external consumers exist today. Diff shape: 361 *.go files — import path replacement only 2 go.mod — module declaration replacement only 1 binary — deploy/test/f5-mock-icontrol/f5-mock-icontrol rebuilt so embedded build-info reflects the new path (8618965 vs 8618933 bytes; 32-byte diff is the build-info change) Total: 364 files, 730 insertions / 730 deletions, net-zero size, pure mechanical substitution. Verification: gofmt: 17 files needed re-alignment after sed (the new path is one char shorter than the old, so column-aligned import groups drifted). Applied `gofmt -w` to fix. go mod tidy: clean exit on both modules. go vet ./...: clean exit. go build ./...: clean exit. go test -short -count=1 on representative packages: all green (internal/domain, internal/validation, internal/crypto, internal/crypto/signer, cmd/agent). Test output now reads `ok github.com/certctl-io/certctl/...` confirming the module path resolves correctly. binary: f5-mock-icontrol rebuilt; `strings \| grep shankar0123` returns nothing; `strings \| grep certctl-io/certctl` shows the new module path embedded in build-info. Files intentionally NOT touched in this commit: README.md / CHANGELOG.md / docs/ / etc. — already swept to certctl-io URLs in commit `0729ee4` (the post-transfer URL refresh). This commit is purely the Go-tooling layer. Scarf pixels (`shankar0123.docker.scarf.sh/...`) — Scarf-account namespace, not a Go import or GitHub repo URL. Stays. This is a non-blocking, non-customer-impacting change. Operators pulling container images, running `make verify`, hitting the API, or installing the agent see no functional difference. Only Go-tooling consumers (none today) are affected, and they're enabled — not broken — by this commit.	2026-05-04 00:30:29 +00:00
shankar0123	35e18bfc56	scheduler: bound renewal concurrency via CERTCTL_RENEWAL_CONCURRENCY Closes the #9 acquisition-readiness blocker from the 2026-05-01 issuer coverage audit. Pre-fix, JobService.ProcessPendingJobs ran every claimed job sequentially in a single goroutine: safe but slow, and operators with large fleets had no lever to dial throughput up. Switching to fire-and-forget per-job goroutines would have unbounded the upstream-CA call rate and tripped DigiCert / Entrust / Sectigo rate limits — certctl's response to 429 was to retry on the next tick, re-fanning out the same calls and digging deeper into the limit. Operators need a knob. This commit: - Adds CERTCTL_RENEWAL_CONCURRENCY env var (default 25) loaded via the existing getEnvInt pattern in internal/config/config.go. Documented inline as the cap for the per-tick renewal/issuance/ deployment goroutine fan-out, with operator-tuning guidance: permissive upstream limits + large fleets (>10k certs) → 100; strict limits or async-CA-heavy fleets → 25 or lower. - Wires golang.org/x/sync/semaphore.Weighted around the per-job goroutine launch in JobService.ProcessPendingJobs. Acquire(ctx, 1) is the load-bearing piece — it BLOCKS the loop when at the cap, providing real backpressure rather than fire-and-forget. The fan-out is split into processPendingJobsSequential (legacy, preserved for unit-test wiring that doesn't call SetRenewalConcurrency) and processPendingJobsConcurrent (production, delegates to a generic boundedFanOut helper). - boundedFanOut takes the per-job work as a closure so the cap can be tested directly without standing up the renewal/deployment service graph. processed/failed counters use atomic.Int64 to avoid mutex overhead on every job completion; final log line reads both AFTER wg.Wait so the counts reflect every dispatched job. ctx-aware Acquire ensures a shutdown ctx cancel interrupts the dispatch loop promptly; in-flight goroutines drain via Wait before the function returns so no goroutine outlives the scheduler tick. - shouldSkipJob extracted as a package-private helper so the agent-routed-deployment skip logic is shared between the sequential and concurrent paths byte-for-byte (the audit prompt's "channel-based semaphore without ctx-aware acquire" anti-pattern is explicitly avoided — semaphore.Weighted.Acquire returns on ctx done; channel <- struct{}{} would block forever). - SetRenewalConcurrency setter on JobService normalises ≤0 to 1. semaphore.NewWeighted(0) constructs a semaphore that blocks every Acquire forever; the normalisation prevents a misconfigured env var from wedging the scheduler. - cmd/server/main.go wires SetRenewalConcurrency(cfg.Scheduler. RenewalConcurrency) on the freshly-built jobService, immediately after SetAuditService. Production deployments always take the bounded path; tests that build JobService directly via NewJobService keep their strict-sequential behaviour because renewalConcurrency is the zero value. - Tests in internal/service/job_concurrency_test.go: * TestBoundedFanOut_CapHolds — primary regression guard. 50 jobs × 50ms work × cap=5 → asserts peak in-flight never exceeds 5 AND reaches 5 at least once (catches both upper-bound regressions and gates that incorrectly cap below the configured value). Lock-free max via CompareAndSwap so the measurement instrument doesn't itself constrain concurrency. * TestBoundedFanOut_AllJobsRun — lower-bound: every non-skipped job is dispatched. * TestBoundedFanOut_SkipsAgentRoutedDeployments — pins the shouldSkipJob contract. * TestBoundedFanOut_CtxCancelInterrupts — ctx cancellation interrupts a stuck fan-out within the timeout budget. * TestBoundedFanOut_FailedJobsCounted — per-job errors don't abort the fan-out. * TestSetRenewalConcurrency_NormalizesNonPositive — ≤0 → 1 fail-safe pinned across negative/zero/positive inputs. - docs/features.md: scheduler-loop table augmented with the concurrency-cap env-var pointer alongside the job-processor row. - docs/architecture.md: Concurrency Safety section gains a paragraph explaining the cap, the operator-tuning guidance, the ctx-aware Acquire semantics, and the audit reference. Operator-facing impact: the first big renewal sweep no longer takes down the upstream CA's rate-limit budget. Existing deployments get the bounded path automatically (default 25); operators can override via env var without code changes. Verified locally: - gofmt -l . clean - go vet ./... clean - staticcheck ./... clean - go test -short -count=1 across service / scheduler / config / integration: green - Six new tests under TestBoundedFanOut* + TestSetRenewalConcurrency*: green Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md Top-10 fix #9.	2026-05-02 14:12:30 +00:00
shankar0123	7cb453a336	chore(fmt): repo-wide gofmt -w sweep — close drift surfaced by ci-pipeline-cleanup Phase 4 Mechanical reformat. The new 'gofmt drift' CI step (added in ci-pipeline-cleanup Phase 4, commit `0f205a8`) surfaced 111 files with accumulated gofmt drift across cmd/, internal/, and deploy/test/. Each file's diff is gofmt-standard: whitespace adjustments, intra- group import sorting (alphabetical by import path within blank-line- separated groups), and struct-tag column alignment. No semantic changes — verified via 'git diff --ignore-all-space' which shows only the line-position deltas from import reordering. The gate stays in place after this commit. Going forward it catches gofmt drift at PR time.	2026-04-30 22:33:57 +00:00
shankar0123	62a412c488	Bundle C: Renewal/reliability cluster — 7 findings closed Closes M-006 + M-007 + M-008 + M-015 + M-016 + M-019 + M-020 from comprehensive-audit-2026-04-25. M-028 was already closed by the Bundle B CI follow-up. M-006 (CWE-913) — Idempotent migration 000014 migrations/000014_policy_violation_severity_check.up.sql: Prepended ALTER TABLE ... DROP CONSTRAINT IF EXISTS before the ADD. Mirrors the down migration's existing IF EXISTS shape and the M-7 idempotent-index idiom. Re-runs against partially-applied DBs now succeed. M-007 — Bulk-op partial-failure tests (3 new) internal/api/handler/bulk_partial_failure_test.go: TestBulkRevoke_PartialFailure_ReportsBoth TestBulkRenew_PartialFailure_ReportsBoth TestBulkReassign_PartialFailure_ReportsBoth Each asserts HTTP 200 + both success/failure counters round-trip + per-cert errors[] preserved with non-empty messages so operators can correlate each failure to its certificate ID. M-008 — Admin-gated handler enumeration pin (verified-already-clean) Recon: only one admin-gated handler — bulk_revocation.go — with full 3-branch test triplet already in place. health.go calls IsAdmin informationally to surface the flag to the GUI without gating. internal/api/handler/m008_admin_gate_test.go: Walks every handler .go file, asserts every middleware.IsAdmin call site is in AdminGatedHandlers (with required test triplet) or InformationalIsAdminCallers (justified). Adding a new admin gate without updating both the constant AND adding the test triplet fails CI. M-015 — Single-profile cardinality pin (verified-already-clean) Audit claim 'no cardinality validation' was wrong — enforced at struct level. domain.ManagedCertificate.{CertificateProfileID, RenewalPolicyID,IssuerID,OwnerID} and RenewalPolicy. CertificateProfileID are bare strings, not slices. internal/domain/m015_cardinality_test.go: reflect-based pin on kind=String. Schema change to N:N would have to update renewal.go's lookup loop in the same commit. M-016 (CWE-754) — Reap stale-agent jobs internal/repository/postgres/job.go::ListJobsWithOfflineAgents: JOIN jobs to agents on agent_id, filter (status=Running AND a.last_heartbeat_at < cutoff), exclude server-keygen jobs. internal/service/job.go::ReapJobsWithOfflineAgents: Flips matched jobs to Failed reason agent_offline so I-001 retry loop re-queues them on a healthy agent. Records audit event per reap. internal/scheduler/scheduler.go: Scheduler.runJobTimeout cycle now calls both reaper arms. agentOfflineJobTTL default 5min (5x agent-health-check default); SetAgentOfflineJobTTL knob for operator override. internal/service/job_offline_agent_reaper_test.go: 6 unit tests cover happy path, server-keygen-skip, non-Running-skip, non- positive-TTL fail-loud, repo-error propagation, audit-event recording. M-019 — Configurable ARI HTTP timeout Audit claim 'no fallback timeout' was wrong — ari.go:52 already had a 15s timeout. Bundle C makes it configurable. internal/connector/issuer/acme/acme.go: Config.ARIHTTPTimeoutSeconds field with env path CERTCTL_ACME_ARI_HTTP_TIMEOUT_SECONDS. internal/connector/issuer/acme/ari.go: Both HTTP clients (GetRenewalInfo + getARIEndpoint) now use the new ariHTTPTimeout() helper. Zero / negative / nil-config all fall back to the historic 15s default. ari_timeout_test.go: 4 dispatch arm tests. M-020 (CWE-770) — OCSP DoS hardening Pre-bundle the noAuthHandler chain had no rate limit. An attacker could DoS the OCSP responder, which for fail-open relying parties is a revocation bypass. cmd/server/main.go: noAuthHandler refactored from fixed middleware.Chain(...) to a conditional slice that appends middleware.NewRateLimiter when cfg.RateLimit.Enabled. Per-IP keying applies; OCSP/CRL/EST/SCEP are unauth. docs/security.md (NEW): Operator runbook documenting Must-Staple TLS Feature extension RFC 7633 as the architectural fix for fail-open relying parties. Profile-flip guidance + nginx/Apache/HAProxy/Envoy stapling snippets + explicit scope statement on what the rate limiter alone does NOT solve. Audit deliverables: cowork/comprehensive-audit-2026-04-25/audit-report.md: score 31/55 -> 38/55 closed (Medium 13/27 -> 20/27). cowork/comprehensive-audit-2026-04-25/findings.yaml: 7 status flips open -> closed with closure notes citing the Bundle C mechanism. certctl/CHANGELOG.md: Bundle C section under [unreleased]. Verification: go vet ./internal/service ./internal/scheduler ./internal/connector/issuer/acme ./internal/api/handler ./internal/domain ./cmd/server clean go test -count=1 -short on the same packages all green helm template + helm lint clean internal/repository/postgres setup-fail sandbox disk pressure (same on master HEAD before this branch)	2026-04-27 00:08:25 +00:00
shankar0123	1ee77c89f8	I-003: job timeout reaper closes AwaitingCSR/AwaitingApproval gap Add 11th always-on scheduler loop that transitions jobs stuck in AwaitingCSR (default 24h TTL) or AwaitingApproval (default 168h TTL) to Failed. I-001's retry loop then auto-promotes eligible Failed jobs back to Pending. No new status enum, no schema migration. - JobRepository.ListTimedOutAwaitingJobs with per-status cutoff WHERE - JobService.ReapTimedOutJobs mirrors RetryFailedJobs structure - Scheduler jobTimeoutLoop with atomic.Bool idempotency guard, 2m per-tick context, WaitGroup shutdown drain - Config: CERTCTL_JOB_TIMEOUT_INTERVAL (10m), CERTCTL_JOB_AWAITING_CSR_TIMEOUT (24h), CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT (168h) - Audit event per transition: actor=system, actorType=System, action=job_timeout, details={old_status, new_status, timeout_reason, age_hours} - 14 new tests: 3 config, 7 service, 4 scheduler	2026-04-19 01:37:18 +00:00
shankar0123	0200c7f4a4	Close I-001 (RetryFailedJobs never invoked) coverage-gap finding Operator decision answered as Option A: JobService.RetryFailedJobs is now wired into the scheduler as an always-on 10th loop. Prior to this commit the method was implemented, unit-tested, and exported but had zero runtime callers — any job that transitioned to status=Failed stayed Failed forever regardless of how many attempts it had remaining. Scheduler — 10th loop: internal/scheduler/scheduler.go grows a jobRetryLoop alongside the existing nine loops (renewal, jobs, health, notifications, short-lived, network scan, digest, health check, cloud discovery). The loop follows the established run-immediately-then-tick pattern (same shape as jobProcessorLoop), gated by a sync/atomic.Bool idempotency guard and joined into the scheduler's sync.WaitGroup so WaitForCompletion drains it on graceful shutdown. Each tick runs under a 2-minute context timeout mirroring jobProcessorLoop's opCtx budget. The runJobRetry helper invokes jobService.RetryFailedJobs(ctx, 3) — the advisory maxRetries cap is belt-and-suspenders; per-job eligibility is still enforced inside the service via Attempts < MaxAttempts. The JobServicer scheduler-interface gains RetryFailedJobs so the scheduler's dependency surface stays explicit and mockable. Service — audit trail per retry: internal/service/job.go:RetryFailedJobs now emits an audit event for every Failed→Pending transition. Following the house convention used by all scheduler-emitted events, actor='system' and actorType= domain.ActorTypeSystem; action='job_retry'; details capture old_status, new_status, attempts, max_attempts. JobService carries an optional *AuditService (SetAuditService) that nil-guards to preserve test-wiring ergonomics — existing tests that construct JobService without an audit service continue to pass unchanged. Config — env var with sane default: internal/config/config.go:SchedulerConfig grows RetryInterval, wired to CERTCTL_SCHEDULER_RETRY_INTERVAL with a 5-minute default. Validate rejects intervals below 1 second (matches other scheduler interval validators). Server wiring: cmd/server/main.go calls jobService.SetAuditService(auditService) after JobService construction and sched.SetJobRetryInterval( cfg.Scheduler.RetryInterval) alongside the other SetXxxInterval calls. Regression coverage: internal/service/job_test.go (3 new) - TestJobService_RetryFailedJobs_EligibleJobTransitionsAndAudits - TestJobService_RetryFailedJobs_SkipsJobsAtMaxAttempts - TestJobService_RetryFailedJobs_NoAuditServiceOK internal/scheduler/scheduler_test.go (3 new) - TestScheduler_JobRetryLoop_CallsService - TestScheduler_JobRetryLoop_IdempotencyGuard - TestScheduler_JobRetryLoop_WaitForCompletion The service tests assert status transitions, attempt-cap short- circuiting, and audit event shape (actor='system', action='job_retry', details keys). The scheduler tests assert the loop invokes the service, the atomic.Bool guard skips overlapping ticks with the expected 'still running, skipping tick' log, and WaitForCompletion drains the in-flight tick on Stop. Residual follow-up (not in scope for this commit): internal/service/renewal.go:RetryFailedJobs is a parallel dead-code duplicate of the same logic on RenewalService — untested and has no runtime caller. The audit finding called this out as 'implemented twice'. Removing it is a separate cleanup and does not block the Option-A wiring this commit delivers. Files: cmd/server/main.go — SetAuditService + SetJobRetryInterval internal/config/config.go — RetryInterval field + env + validate internal/scheduler/scheduler.go — 10th loop, interface, field, setter internal/scheduler/scheduler_test.go — 3 new scheduler-loop tests internal/service/job.go — RetryFailedJobs audit emission + SetAuditService internal/service/job_test.go — 3 new service-layer tests	2026-04-18 23:24:54 +00:00
shankar0123	3287e174dc	Unify API auth + RFC-compliant CRL/OCSP (M-002 + M-003 + M-006, auto-closes M-001) Closes the remaining P1 gaps from coverage-gap-audit.md (M-001/M-002/M-003/M-006) on top of the C-001/C-002 ownership + agent-FK contract fixes landed in `a53a4b8`. The work lands as a single commit spanning server, docs, tests, and the React client. M-002 — Named API keys with per-key actor propagation * Migration 000014 adds the 'api_keys' table (id, name, hash, principal, role, created_at, last_used_at, disabled_at) so every credential carries an identifiable principal instead of the opaque 'anonymous'/'api-key' sentinel. * Auth middleware now rotates through configured keys, performs constant-time hash comparison, stamps 'last_used_at', and emits an actor struct via contextWithActor(). The audit middleware, bulk-revocation handler, approval handlers, and MCP tool layer now read the principal off the context and persist it on every audit_events row. * Regression coverage: - internal/api/middleware/audit_test.go — actor propagation, principal redaction for disabled keys, anonymous fallback for unauthenticated endpoints. - internal/api/handler/bulk_revocation_handler_test.go, job_handler_test.go — principal-on-audit assertions. M-003 — Authorization gates (Phase B) * Approval handler rejects self-approval / self-rejection with 403 when the actor principal equals the job's requested_by field. * Bulk revocation is gated behind the 'admin' role; operators and viewers receive 403. * Regression coverage: - internal/service/job_test.go — TestApproveJob_NotSelf, TestRejectJob_NotSelf. - internal/api/handler/bulk_revocation_handler_test.go — TestBulkRevoke_RequiresAdmin, TestBulkRevoke_AdminSucceeds. M-006 — RFC-compliant CRL/OCSP on the unauthenticated .well-known mux * Per RFC 8615, relying parties cannot reasonably be asked to authenticate against the issuing certctl instance to retrieve revocation material. CRL and OCSP move off the authenticated '/api/v1/crl' and '/api/v1/ocsp/' paths onto: GET /.well-known/pki/crl/{issuer_id} Content-Type: application/pkix-crl (RFC 5280 §5) GET /.well-known/pki/ocsp/{issuer_id}/{serial} Content-Type: application/ocsp-response (RFC 6960) * Non-standard JSON CRL shape is removed; only DER is served. * Short-lived certificate exemption (profile TTL < 1h → skip CRL/OCSP) is preserved; the response simply omits the serial. * Routes are registered on the unauthenticated 'finalHandler' mux in cmd/server/main.go alongside EST ('/.well-known/est/') and SCEP ('/scep'). Legacy authenticated paths return 404. Regression coverage: - internal/api/handler/certificate_handler_test.go — content type, DER parseability, 404 for unknown issuer. - internal/api/handler/adversarial_path_test.go — unauthenticated access asserted for CRL, OCSP, EST, SCEP. - internal/api/router/router_test.go — route-table assertion that '.well-known/pki/', '.well-known/est/', and '/scep' are mounted on the unauthenticated branch. M-001 — Auto-closed by M-002 EST and SCEP were already registered on the unauthenticated 'finalHandler' mux; the router comment at internal/api/router/router.go:247 now matches reality. The adversarial-path tests above lock the behavior in. Verification (all gates green): * go vet ./... — clean * go build ./... — ok * go test -short ./... (55+ packages) — all pass * web/ : npm test (225 Vitest tests) — all pass * web/ : npx tsc --noEmit — clean * grep sweep for '/api/v1/(crl\|ocsp)' — 13 surviving hits, all intentional M-006 tombstone/relocation comments. Documentation: * coverage-gap-audit.md — status flips M-001/M-002/M-003/M-006 → Fixed, with per-finding resolution paragraphs citing regression test IDs. (Audit file lives outside this repo; see cowork root.) * CLAUDE.md Project Status line updated with the auth-unification closure note. * docs/features.md, docs/architecture.md, docs/quickstart.md, docs/concepts.md, docs/connectors.md, docs/test-env.md, docs/testing-guide.md, docs/compliance-.md, docs/demo-advanced.md — refreshed for the new '.well-known/pki/' namespace and named API keys. * api/openapi.yaml — documents the new unauthenticated endpoints and removes the legacy '/api/v1/crl' + '/api/v1/ocsp/' paths. .gitignore: adds '/.gocache/' and '/.gomodcache/' for the session- scoped Go caches so they never enter the tree.	2026-04-18 18:17:41 +00:00
shankar0123	ccd89c348f	fix(m2-pr-d): thread ctx through Job/Notification/Audit services Collapse CancelJobWithContext into CancelJob; eliminate 10 context.Background() hits across the Job+Notification+Audit service cluster by threading ctx through their handler-facing service interfaces. Services (ctx-first): - service/job.go: ListJobs, GetJob, CancelJob, ApproveJob, RejectJob now accept ctx; the CancelJobWithContext wrapper is removed (handler callers continue to invoke CancelJob, now ctx-aware). - service/notification.go: ListNotifications, GetNotification, MarkAsRead accept ctx. - service/audit.go: ListAuditEvents, GetAuditEvent accept ctx. Handlers (interface + callsites): - handler/jobs.go, handler/notifications.go, handler/audit.go: local service interfaces updated, r.Context() threaded at every callsite. Tests: - Mock services updated to match the new interfaces (ctx accepted and ignored via '_ context.Context' first parameter; Fn closure fields unchanged). - job_test.go / notification_test.go callsites thread context.Background() to match production shape. Verification: go build ./... ok go vet ./... ok go test -short ./... ok go test -race -short ./... ok golangci-lint run ./... 0 issues Locked decisions from the M-2 plan: D-1 ctx-only signatures (no dual forms) D-4 preserve handler method names facing the router D-5 domain types stay ctx-free Audit complete. Commit: `1f6cf0eafa`. Sections: 12. Findings: 2/7/10/4/6.	2026-04-18 01:20:46 +00:00
shankar0123	89b910a8f1	security: atomic pending-job claim with FOR UPDATE SKIP LOCKED (H-6) Fixes H-6 (CWE-362) — GetPendingJobs returned pending rows without row locks, so two scheduler replicas in an HA deployment could both read the same row, both decide it was theirs, and race on UpdateStatus, producing duplicate Running jobs and duplicate certificate issuances. Remediation: a claim-style repository API that selects + transitions Pending -> Running in one transaction with SELECT ... FOR UPDATE SKIP LOCKED. Concurrent claimants observe disjoint row sets; no worker ever sees another worker's claimed row. Repository changes (internal/repository/postgres/job.go): - New ClaimPendingJobs(ctx, jobType, limit): BEGIN; SELECT id,... FROM jobs WHERE status='Pending' (optional type filter, optional LIMIT) FOR UPDATE SKIP LOCKED; UPDATE jobs SET status='Running', updated_at=NOW() WHERE id = ANY($ids); COMMIT. Returns the claimed rows with status already flipped. - New ClaimPendingByAgentID(ctx, agentID): mirrors M31 UNION ALL semantics (direct agent_id match, target->agent JOIN fallback, certificate->target->agent chain for AwaitingCSR) but wraps each branch in FOR UPDATE SKIP LOCKED and flips Deployment/Renewal rows to Running. AwaitingCSR rows are returned in place (state transition deferred until SubmitCSR, consistent with M8 semantics). - Existing GetPendingJobs / ListPendingByAgentID retained for legacy compatibility; their godoc now directs production callers to the Claim* variants. Production caller switches: - internal/service/job.go ProcessPendingJobs: ListByStatus(Pending) -> ClaimPendingJobs(ctx, "", 0). Eliminates the real scheduler race between two replicas tick-firing simultaneously. - internal/service/agent.go GetPendingWork: ListPendingByAgentID -> ClaimPendingByAgentID. Eliminates the race between two pollers for the same agent (e.g. brief network blip causing duplicate poll) and between a scheduler tick and an agent poll. Safety argument for pre-flipping Pending -> Running inside the claim transaction: ProcessRenewalJob and ProcessDeploymentJob both call UpdateStatus(Running) unconditionally on entry, so an early flip is idempotent. On panic, the scheduler's panic recovery leaves the job in Running which the existing stale-running reaper handles. Tests (internal/repository/postgres/repo_test.go, skipped in -short): - TestJobRepository_ClaimPendingJobs_FlipsToRunning: seed 5 Pending, claim once, assert all 5 returned + DB rows Running, residual claim returns 0. - TestJobRepository_ClaimPendingJobs_ConcurrentDisjoint: seed M=40 Pending Renewals, spawn N=8 goroutines each calling ClaimPendingJobs(_, JobTypeRenewal, 1) in a loop. Invariants: (a) no job ID claimed by more than one worker, (b) sum of claims == 40, (c) all 40 rows in Running state in the DB. Bounded empty-streak guard (20 iterations) covers SKIP LOCKED transient zeros under contention. - TestJobRepository_ClaimPendingByAgentID_TransitionsDeployments: seeds 2 Pending Deployment + 1 AwaitingCSR for agent A plus 1 Pending Renewal for agent B (scope check). Asserts deployments flip to Running, AwaitingCSR is returned but preserved, agent B's renewal never appears. Mock updates: testutil_test.go, lifecycle_test.go, verification_test.go gained ClaimPendingJobs/ClaimPendingByAgentID on their mock job repos mirroring the real Pending -> Running semantics. Mocks intentionally do NOT write to StatusUpdates (that map tracks UpdateStatus() call history specifically; the real claim path uses a bulk UPDATE, not UpdateStatus). Verification (CI-scope): - go build ./cmd/...: ok - go vet ./...: ok - go test -race -short on service, api/handler, api/middleware, scheduler, connector/..., domain, validation, tlsprobe: ok - Coverage gates: service 67.6% (>=55), handler 78.6% (>=60), middleware 80.0% (>=30), domain 92.7% (>=40). All hold. - golangci-lint 2.11.4: 0 issues - govulncheck: no vulnerabilities in call graph - Frontend: tsc clean, 218 vitest tests pass, vite build ok - helm lint + helm template: ok - Invariant sweeps: FOR UPDATE SKIP LOCKED present in job.go; H-1 through H-5 fixtures unchanged. Refs: H-6 in certctl-audit-report.md	2026-04-17 02:34:56 +00:00
shankar0123	b059ec930f	fix: end-to-end certificate lifecycle bugs + integration test environment Fixes 12 production bugs preventing the full issuance→deployment flow from working with ACME (Pebble/Let's Encrypt) and step-ca issuers: ACME connector (acme.go): - Save orderURI before WaitOrder overwrites it (Go crypto/acme bug) - Add CreateOrderCert fallback via WaitOrder+FetchCert - Remove defer-reset in ValidateConfig that caused nil pointer panic - Add Insecure TLS option for self-signed ACME servers (Pebble) step-ca connector (stepca.go, jwe.go): - Real JWE provisioner key loading + decryption (was using ephemeral keys) - Fix JWT audience (/1.0/sign), sha claim (key fingerprint), kid header - Custom root CA trust via RootCertPath config - Remove hardcoded 90-day validity default (let step-ca decide) NGINX target connector (nginx.go): - Use sh -c for validate/reload commands (shell interpretation) - Use filepath.Dir instead of fragile string slicing - Add private key file writing (agent-mode keys were never deployed) - Make chain_path write conditional Server/service layer: - TriggerRenewalWithActor now creates actual Job records (was no-op) - createDeploymentJobs falls back to DB query when cert.TargetIDs empty - ProcessPendingJobs skips agent-routed deployment jobs - Agent cert pickup path parsing: len(parts)<4 → len(parts)<3 - Health/ready/auth-info endpoints bypass auth middleware - Write timeout 15s→120s for ACME issuance - Cert fingerprint computed on CSR submission Integration test environment (deploy/test/): - 10-phase test script covering Local CA, ACME, step-ca, revocation, discovery, renewal, and API spot checks - Docker Compose with 7 containers (server, agent, postgres, nginx, pebble, challtestsrv, step-ca) on isolated network - TLS verification checks SAN (not just Subject CN) for modern CA compat Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-02 17:02:20 -04:00
shankar0123	b0549e6f05	feat: M11b — ownership tracking, agent groups, interactive renewal approval Ownership: owners/teams GUI pages, notification email resolution via resolveRecipient (owner_id → owner.email lookup). Agent groups: dynamic device grouping by OS/arch/IP CIDR/version with manual include/exclude membership, migration 000004, full CRUD stack (domain → repo → service → handler → frontend). Interactive approval: AwaitingApproval job state, approve/reject API endpoints with reason tracking. Tests: 12 agent group handler tests, 8 approve/reject job handler tests, integration tests updated for 13-param RegisterHandlers. Docs updated across architecture, concepts, and seed data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 21:02:35 -04:00
shankar0123	66f04f7afe	style: run gofmt -s across all Go files Fixes Go Report Card gofmt score from 52% to 100%. Pure formatting changes — no logic modifications. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 19:32:29 -04:00
shankar0123	ae67b10708	Complete M1, M1.1, M2: end-to-end lifecycle, agent deployment, ACME v2 - Wire issuer connector end-to-end with IssuerConnectorAdapter (dependency inversion) - Renewal/issuance job processor: RSA key + CSR generation, Local CA signing, cert version storage - Agent work API (GET /agents/{id}/work) and job status API (POST /agents/{id}/jobs/{job_id}/status) - Agent-side deployment: WorkItem enrichment with target type/config, NGINX/F5/IIS connector invocation - Full ACME v2 implementation: HTTP-01 challenge solving, account registration, order lifecycle - Update all docs (README, architecture, connectors, demo-advanced, quickstart) for M1-M2 - Fix go vet warning in deployment.go (non-constant format string) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 23:49:45 -04:00
shankar0123	9b4122b159	Fix runtime bugs, implement service layer, and overhaul documentation Runtime fixes: - Fix env var mismatch (CERTCTL_DB_URL → CERTCTL_DATABASE_URL) - Fix table name mismatches (certificates → managed_certificates, notifications → notification_events) - Add renewal_policy_id to certificate queries - Remove non-existent created_at from notification queries - Add env var fallback for agent CLI flags - Graceful degradation for missing notifiers/issuers in demo mode - Copy web/ directory in Dockerfile for dashboard serving Service layer: - Implement handler-service interface pattern across all services - Wire up certificate, agent, job, policy, team, owner, audit, notification services Documentation: - Add concepts.md: beginner-friendly guide to TLS, CAs, private keys - Rewrite quickstart.md with accurate API examples matching actual handlers - Add demo-advanced.md: interactive demo with cert issuance and automated script - Update architecture.md with correct table names and connector interfaces - Update connectors.md to match actual Go interface signatures - Update demo-guide.md with cross-references to new docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 21:38:11 -04:00
shankar0123	d395776a95	Initial scaffold: certificate control plane v0.1.0	2026-03-14 08:22:17 -04:00

18 Commits