certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-07 22:01:36 +00:00

Author	SHA1	Message	Date
shankar0123	21aeed4f4e	legal: addlicense headers + normalize legacy variants (Phase 0 RED-4) Phase 0 closure (Path B2, post-rewrite): addlicense sweep — adds the canonical certctl LLC copyright + BUSL-1.1 SPDX header to every production Go file. Template: // Copyright 2026 certctl LLC. All rights reserved. // SPDX-License-Identifier: BUSL-1.1 Coverage: 338 / 338 production Go files (cmd/ + internal/, excluding _test.go and /testdata/). Pre-sweep coverage was 22 / 338 (6.5%); post-sweep is 338 / 338 (100%). Normalized 22 pre-existing legacy headers (`// Copyright (c) certctl` + `// SPDX-License-Identifier: BSL-1.1`) and 1 file using a `Certctl Contributors` attribution. The legacy SPDX ID `BSL-1.1` is non-standard; the official SPDX identifier for Business Source License 1.1 is `BUSL-1.1` (capital U). All 338 files now share the canonical form. Generated via: addlicense -c "certctl LLC" -y 2026 \ -f cowork/legal/copyright-header.tpl \ -ignore '/testdata/' -ignore '/_test.go' \ cmd/ internal/ Verification: find cmd internal -name '.go' -not -name '_test.go' \ -not -path '/testdata/' \ -exec grep -L '^// Copyright 2026 certctl LLC' {} \; \| wc -l Returns: 0 gofmt clean. Header additions are comments only, no compile impact. Closes: cowork/certctl-architecture-diligence-audit.html#fix-RED-4	2026-05-13 21:23:35 +00:00
shankar0123	1b03d0c594	fix(repo/job): split UNION ALL + FOR UPDATE into two queries (Postgres-correctness) Phase-9 docker compose smoke surfaced a latent production-breaking bug introduced by commit `89b910a` (H-6 atomic pending-job claim). The ClaimPendingByAgentID query in internal/repository/postgres/job.go combined UNION ALL with FOR UPDATE SKIP LOCKED in a single statement. Postgres rejects this with: ERROR: FOR UPDATE is not allowed with UNION/INTERSECT/EXCEPT Every agent work-poll returns HTTP 500 in any real deployment where an agent is actually polling. From the compose log: request_id=6da47015-... GET /api/v1/agents/agent-demo-1/work status=500 duration_ms=2 The schema-per-test unit harness in internal/repository/postgres/ *_test.go never inserted jobs and polled, so the SQL execution path was never exercised. The bug has been latent in master since `89b910a` landed. Fix: split the UNION ALL into two separate FOR UPDATE SKIP LOCKED queries within the existing transaction. The H-6 atomicity invariant (concurrent pollers never see the same Pending row) is preserved because: 1. The two queries run inside the same transaction (tx). 2. Each query independently locks its result rows with FOR UPDATE SKIP LOCKED. 3. The subsequent UPDATE that flips Pending -> Running runs in the same transaction, so the rows stay invisible to concurrent callers from initial SELECT through final COMMIT. 4. The transaction is the unit of consistency, not the single SQL statement. Two queries: - Branch 1 (direct): jobs.agent_id = + status='Pending' + type='Deployment'. ORDER BY created_at ASC, FOR UPDATE SKIP LOCKED. - Branch 2 (fallback): jobs.agent_id IS NULL + INNER JOIN deployment_targets dt ON jobs.target_id = dt.id WHERE dt.agent_id = . ORDER BY j.created_at ASC, FOR UPDATE OF j SKIP LOCKED (FOR UPDATE OF needed because the join brings in dt). Branch 3 (AwaitingCSR) is unchanged — already a single SELECT, not affected by the UNION restriction. Inline comment explains the fix's load-bearing-ness so a future refactor doesn't merge them back into one UNION query. Verify (sandbox): go vet clean; go test -short -count=1 PASS on internal/repository/postgres/. Workstation re-runs 'docker compose up' to confirm the agent's GET /work returns 200 with the next pending-deployment claim. Note: this is NOT a regression introduced by Auth Bundle 2 or the 2026-05-11 audit fixes; it's a pre-existing latent defect from H-6. Including in v2.1.0 because shipping with a broken agent work-poll would block the demo path on day one of release.	2026-05-11 16:11:33 +00:00
shankar0123	8b75e0311b	chore: rename Go module path to github.com/certctl-io/certctl Mechanical sed across the main go.mod's module declaration, the f5-mock-icontrol sub-module's go.mod, every Go file's import path (361 files), and a rebuild of the checked-in f5-mock-icontrol binary so its embedded build-info reflects the new module path. No behavior change. Choice B from cowork/transfer-certctl-to-org.md, executed 2026-05-04. Choice A (keep module path declared as github.com/shankar0123/certctl regardless of repo URL) shipped on the day of the org transfer (2026-05-03) since we had no external Go consumers; this commit closes that deferral. Backward-compat: GitHub HTTP redirects continue to forward github.com/shankar0123/certctl → github.com/certctl-io/certctl at the URL level, but Go's module proxy uses the path declared in go.mod as the canonical name. Pre-fix, anyone trying `go get github.com/certctl-io/certctl/...` hit a "module path mismatch" error because go.mod said github.com/shankar0123/certctl and the URL they fetched it from said certctl-io/certctl. Post-fix, the canonical name and the URL agree, so go get / go install / external Go consumers / Go-tooling integrations work cleanly via either the new path (preferred) or the old path (which redirects and Go follows the redirect for source fetch). Anyone still importing the old path inside their own code keeps working provided they update their go.mod's `require` line to match — the module path declared in their consumer's go.sum / go.mod is the authoritative import name, so a mass sed across their import statements is the migration on the consumer side. No external consumers exist today. Diff shape: 361 *.go files — import path replacement only 2 go.mod — module declaration replacement only 1 binary — deploy/test/f5-mock-icontrol/f5-mock-icontrol rebuilt so embedded build-info reflects the new path (8618965 vs 8618933 bytes; 32-byte diff is the build-info change) Total: 364 files, 730 insertions / 730 deletions, net-zero size, pure mechanical substitution. Verification: gofmt: 17 files needed re-alignment after sed (the new path is one char shorter than the old, so column-aligned import groups drifted). Applied `gofmt -w` to fix. go mod tidy: clean exit on both modules. go vet ./...: clean exit. go build ./...: clean exit. go test -short -count=1 on representative packages: all green (internal/domain, internal/validation, internal/crypto, internal/crypto/signer, cmd/agent). Test output now reads `ok github.com/certctl-io/certctl/...` confirming the module path resolves correctly. binary: f5-mock-icontrol rebuilt; `strings \| grep shankar0123` returns nothing; `strings \| grep certctl-io/certctl` shows the new module path embedded in build-info. Files intentionally NOT touched in this commit: README.md / CHANGELOG.md / docs/ / etc. — already swept to certctl-io URLs in commit `0729ee4` (the post-transfer URL refresh). This commit is purely the Go-tooling layer. Scarf pixels (`shankar0123.docker.scarf.sh/...`) — Scarf-account namespace, not a Go import or GitHub repo URL. Stays. This is a non-blocking, non-customer-impacting change. Operators pulling container images, running `make verify`, hitting the API, or installing the agent see no functional difference. Only Go-tooling consumers (none today) are affected, and they're enabled — not broken — by this commit.	2026-05-04 00:30:29 +00:00
shankar0123	7cb453a336	chore(fmt): repo-wide gofmt -w sweep — close drift surfaced by ci-pipeline-cleanup Phase 4 Mechanical reformat. The new 'gofmt drift' CI step (added in ci-pipeline-cleanup Phase 4, commit `0f205a8`) surfaced 111 files with accumulated gofmt drift across cmd/, internal/, and deploy/test/. Each file's diff is gofmt-standard: whitespace adjustments, intra- group import sorting (alphabetical by import path within blank-line- separated groups), and struct-tag column alignment. No semantic changes — verified via 'git diff --ignore-all-space' which shows only the line-position deltas from import reordering. The gate stays in place after this commit. Going forward it catches gofmt drift at PR time.	2026-04-30 22:33:57 +00:00
shankar0123	62a412c488	Bundle C: Renewal/reliability cluster — 7 findings closed Closes M-006 + M-007 + M-008 + M-015 + M-016 + M-019 + M-020 from comprehensive-audit-2026-04-25. M-028 was already closed by the Bundle B CI follow-up. M-006 (CWE-913) — Idempotent migration 000014 migrations/000014_policy_violation_severity_check.up.sql: Prepended ALTER TABLE ... DROP CONSTRAINT IF EXISTS before the ADD. Mirrors the down migration's existing IF EXISTS shape and the M-7 idempotent-index idiom. Re-runs against partially-applied DBs now succeed. M-007 — Bulk-op partial-failure tests (3 new) internal/api/handler/bulk_partial_failure_test.go: TestBulkRevoke_PartialFailure_ReportsBoth TestBulkRenew_PartialFailure_ReportsBoth TestBulkReassign_PartialFailure_ReportsBoth Each asserts HTTP 200 + both success/failure counters round-trip + per-cert errors[] preserved with non-empty messages so operators can correlate each failure to its certificate ID. M-008 — Admin-gated handler enumeration pin (verified-already-clean) Recon: only one admin-gated handler — bulk_revocation.go — with full 3-branch test triplet already in place. health.go calls IsAdmin informationally to surface the flag to the GUI without gating. internal/api/handler/m008_admin_gate_test.go: Walks every handler .go file, asserts every middleware.IsAdmin call site is in AdminGatedHandlers (with required test triplet) or InformationalIsAdminCallers (justified). Adding a new admin gate without updating both the constant AND adding the test triplet fails CI. M-015 — Single-profile cardinality pin (verified-already-clean) Audit claim 'no cardinality validation' was wrong — enforced at struct level. domain.ManagedCertificate.{CertificateProfileID, RenewalPolicyID,IssuerID,OwnerID} and RenewalPolicy. CertificateProfileID are bare strings, not slices. internal/domain/m015_cardinality_test.go: reflect-based pin on kind=String. Schema change to N:N would have to update renewal.go's lookup loop in the same commit. M-016 (CWE-754) — Reap stale-agent jobs internal/repository/postgres/job.go::ListJobsWithOfflineAgents: JOIN jobs to agents on agent_id, filter (status=Running AND a.last_heartbeat_at < cutoff), exclude server-keygen jobs. internal/service/job.go::ReapJobsWithOfflineAgents: Flips matched jobs to Failed reason agent_offline so I-001 retry loop re-queues them on a healthy agent. Records audit event per reap. internal/scheduler/scheduler.go: Scheduler.runJobTimeout cycle now calls both reaper arms. agentOfflineJobTTL default 5min (5x agent-health-check default); SetAgentOfflineJobTTL knob for operator override. internal/service/job_offline_agent_reaper_test.go: 6 unit tests cover happy path, server-keygen-skip, non-Running-skip, non- positive-TTL fail-loud, repo-error propagation, audit-event recording. M-019 — Configurable ARI HTTP timeout Audit claim 'no fallback timeout' was wrong — ari.go:52 already had a 15s timeout. Bundle C makes it configurable. internal/connector/issuer/acme/acme.go: Config.ARIHTTPTimeoutSeconds field with env path CERTCTL_ACME_ARI_HTTP_TIMEOUT_SECONDS. internal/connector/issuer/acme/ari.go: Both HTTP clients (GetRenewalInfo + getARIEndpoint) now use the new ariHTTPTimeout() helper. Zero / negative / nil-config all fall back to the historic 15s default. ari_timeout_test.go: 4 dispatch arm tests. M-020 (CWE-770) — OCSP DoS hardening Pre-bundle the noAuthHandler chain had no rate limit. An attacker could DoS the OCSP responder, which for fail-open relying parties is a revocation bypass. cmd/server/main.go: noAuthHandler refactored from fixed middleware.Chain(...) to a conditional slice that appends middleware.NewRateLimiter when cfg.RateLimit.Enabled. Per-IP keying applies; OCSP/CRL/EST/SCEP are unauth. docs/security.md (NEW): Operator runbook documenting Must-Staple TLS Feature extension RFC 7633 as the architectural fix for fail-open relying parties. Profile-flip guidance + nginx/Apache/HAProxy/Envoy stapling snippets + explicit scope statement on what the rate limiter alone does NOT solve. Audit deliverables: cowork/comprehensive-audit-2026-04-25/audit-report.md: score 31/55 -> 38/55 closed (Medium 13/27 -> 20/27). cowork/comprehensive-audit-2026-04-25/findings.yaml: 7 status flips open -> closed with closure notes citing the Bundle C mechanism. certctl/CHANGELOG.md: Bundle C section under [unreleased]. Verification: go vet ./internal/service ./internal/scheduler ./internal/connector/issuer/acme ./internal/api/handler ./internal/domain ./cmd/server clean go test -count=1 -short on the same packages all green helm template + helm lint clean internal/repository/postgres setup-fail sandbox disk pressure (same on master HEAD before this branch)	2026-04-27 00:08:25 +00:00
shankar0123	0e29c416b1	refactor(handler,repo): replace strings.Contains error dispatch with typed sentinels (S-2) Closes one 2026-04-24 audit finding (P2): - cat-s6-efc7f6f6bd50: 30 strings.Contains(err.Error(), ...) sites in internal/api/handler/ — brittle to repository-layer message changes, untyped against the actual failure mode. Approach (Option B from prompt design notes): - New typed sentinels in internal/repository/errors.go: ErrNotFound, ErrForeignKeyConstraint IsForeignKeyError(err) helper (the only place substring matching at the lib/pq boundary is allowed; isolates the DB-driver string knowledge to one function). - New typed sentinel in internal/domain/errors.go: ErrValidation (reserved for future per-entity validation wrappers; not yet used by all handlers). - 49 sites in internal/repository/postgres/*.go updated to wrap sql.ErrNoRows-derived errors via fmt.Errorf("...: %w", repository.ErrNotFound). - 18 not-found handler sites + 2 FK-constraint handler sites refactored to errors.Is(err, repository.ErrNotFound) / repository.IsForeignKeyError(err). - 23 inline `fmt.Errorf("X not found")` test fixtures across handler tests rewrapped to wrap repository.ErrNotFound. - test_utils.go::ErrMockNotFound rewrapped to wrap repository.ErrNotFound; renewal_policy.go closure docblock updated to reflect the new convention. - integration test mockJobRepository.Get wraps repository.ErrNotFound. CI regression guardrail: - .github/workflows/ci.yml::"Forbidden strings.Contains(err.Error()) regression guard (S-2)" greps for the three patterns ("not found", "violates foreign key", "RESTRICT") under internal/api/handler/ and fails the build on regression. Verification: - go build ./... — clean - go vet ./... — clean - go test ./... -short -count=1 — all packages pass (handler + repository + service + integration) - golangci-lint v2.11.4 run ./... — 0 issues - S-2 guardrail dry-run on post-fix tree → empty (good) - All sibling guardrails (S-1, G-3, D-1+D-2, B-1, L-1, H-1, C-1, F-1, P-1) pass Audit findings closed: - cat-s6-efc7f6f6bd50 (P2) Deferred follow-ups: - 6 domain-specific substring patterns still inline in handlers ("cannot approve", "cannot reject", "cannot be parsed", "no certificates found", "challenge password", "invalid"/ "required" validation chains in profiles + agent_groups). Each needs its own typed sentinel, scoped per service. Documented by the S-2 CI guardrail's allowlist for closure-comments only. - Per-entity not-found sentinels (Option A — ErrCertificateNotFound, ErrAgentNotFound, etc.) deferred. Generic ErrNotFound covers the current dispatch needs; per-entity precision would let handlers return entity-aware error bodies without a domain.Type field, but not blocking.	2026-04-25 17:54:14 +00:00
shankar0123	1ee77c89f8	I-003: job timeout reaper closes AwaitingCSR/AwaitingApproval gap Add 11th always-on scheduler loop that transitions jobs stuck in AwaitingCSR (default 24h TTL) or AwaitingApproval (default 168h TTL) to Failed. I-001's retry loop then auto-promotes eligible Failed jobs back to Pending. No new status enum, no schema migration. - JobRepository.ListTimedOutAwaitingJobs with per-status cutoff WHERE - JobService.ReapTimedOutJobs mirrors RetryFailedJobs structure - Scheduler jobTimeoutLoop with atomic.Bool idempotency guard, 2m per-tick context, WaitGroup shutdown drain - Config: CERTCTL_JOB_TIMEOUT_INTERVAL (10m), CERTCTL_JOB_AWAITING_CSR_TIMEOUT (24h), CERTCTL_JOB_AWAITING_APPROVAL_TIMEOUT (168h) - Audit event per transition: actor=system, actorType=System, action=job_timeout, details={old_status, new_status, timeout_reason, age_hours} - 14 new tests: 3 config, 7 service, 4 scheduler	2026-04-19 01:37:18 +00:00
shankar0123	89b910a8f1	security: atomic pending-job claim with FOR UPDATE SKIP LOCKED (H-6) Fixes H-6 (CWE-362) — GetPendingJobs returned pending rows without row locks, so two scheduler replicas in an HA deployment could both read the same row, both decide it was theirs, and race on UpdateStatus, producing duplicate Running jobs and duplicate certificate issuances. Remediation: a claim-style repository API that selects + transitions Pending -> Running in one transaction with SELECT ... FOR UPDATE SKIP LOCKED. Concurrent claimants observe disjoint row sets; no worker ever sees another worker's claimed row. Repository changes (internal/repository/postgres/job.go): - New ClaimPendingJobs(ctx, jobType, limit): BEGIN; SELECT id,... FROM jobs WHERE status='Pending' (optional type filter, optional LIMIT) FOR UPDATE SKIP LOCKED; UPDATE jobs SET status='Running', updated_at=NOW() WHERE id = ANY($ids); COMMIT. Returns the claimed rows with status already flipped. - New ClaimPendingByAgentID(ctx, agentID): mirrors M31 UNION ALL semantics (direct agent_id match, target->agent JOIN fallback, certificate->target->agent chain for AwaitingCSR) but wraps each branch in FOR UPDATE SKIP LOCKED and flips Deployment/Renewal rows to Running. AwaitingCSR rows are returned in place (state transition deferred until SubmitCSR, consistent with M8 semantics). - Existing GetPendingJobs / ListPendingByAgentID retained for legacy compatibility; their godoc now directs production callers to the Claim* variants. Production caller switches: - internal/service/job.go ProcessPendingJobs: ListByStatus(Pending) -> ClaimPendingJobs(ctx, "", 0). Eliminates the real scheduler race between two replicas tick-firing simultaneously. - internal/service/agent.go GetPendingWork: ListPendingByAgentID -> ClaimPendingByAgentID. Eliminates the race between two pollers for the same agent (e.g. brief network blip causing duplicate poll) and between a scheduler tick and an agent poll. Safety argument for pre-flipping Pending -> Running inside the claim transaction: ProcessRenewalJob and ProcessDeploymentJob both call UpdateStatus(Running) unconditionally on entry, so an early flip is idempotent. On panic, the scheduler's panic recovery leaves the job in Running which the existing stale-running reaper handles. Tests (internal/repository/postgres/repo_test.go, skipped in -short): - TestJobRepository_ClaimPendingJobs_FlipsToRunning: seed 5 Pending, claim once, assert all 5 returned + DB rows Running, residual claim returns 0. - TestJobRepository_ClaimPendingJobs_ConcurrentDisjoint: seed M=40 Pending Renewals, spawn N=8 goroutines each calling ClaimPendingJobs(_, JobTypeRenewal, 1) in a loop. Invariants: (a) no job ID claimed by more than one worker, (b) sum of claims == 40, (c) all 40 rows in Running state in the DB. Bounded empty-streak guard (20 iterations) covers SKIP LOCKED transient zeros under contention. - TestJobRepository_ClaimPendingByAgentID_TransitionsDeployments: seeds 2 Pending Deployment + 1 AwaitingCSR for agent A plus 1 Pending Renewal for agent B (scope check). Asserts deployments flip to Running, AwaitingCSR is returned but preserved, agent B's renewal never appears. Mock updates: testutil_test.go, lifecycle_test.go, verification_test.go gained ClaimPendingJobs/ClaimPendingByAgentID on their mock job repos mirroring the real Pending -> Running semantics. Mocks intentionally do NOT write to StatusUpdates (that map tracks UpdateStatus() call history specifically; the real claim path uses a bulk UPDATE, not UpdateStatus). Verification (CI-scope): - go build ./cmd/...: ok - go vet ./...: ok - go test -race -short on service, api/handler, api/middleware, scheduler, connector/..., domain, validation, tlsprobe: ok - Coverage gates: service 67.6% (>=55), handler 78.6% (>=60), middleware 80.0% (>=30), domain 92.7% (>=40). All hold. - golangci-lint 2.11.4: 0 issues - govulncheck: no vulnerabilities in call graph - Frontend: tsc clean, 218 vitest tests pass, vite build ok - helm lint + helm template: ok - Invariant sweeps: FOR UPDATE SKIP LOCKED present in job.go; H-1 through H-5 fixtures unchanged. Refs: H-6 in certctl-audit-report.md	2026-04-17 02:34:56 +00:00
shankar0123	11173a74c6	feat(M31): agent work routing — scope jobs to assigned agents Deployment jobs now set agent_id from target→agent relationship at creation time. GetPendingWork() uses ListPendingByAgentID() with a 3-way UNION query (direct match, legacy NULL fallback via target JOIN, AwaitingCSR via cert→target→agent chain) so each agent only receives its own jobs. - Added AgentID *string to Job domain struct - Added agent_id to all job SQL queries (5 SELECTs, INSERT, UPDATE, scanJob) - New ListPendingByAgentID() repository method - Rewrote GetPendingWork() from ~25 lines to single scoped query - 4 new Go tests (3 agent routing + 1 deployment agent_id) - Frontend: agent_id/target_id on Job type Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-30 14:10:42 -04:00
shankar0123	3a9fe8ba37	Complete V1 scaffold	2026-03-14 20:01:53 -04:00

10 Commits