certctl

mirror of https://github.com/shankar0123/certctl.git synced 2026-06-10 18:18:52 +00:00

Author	SHA1	Message	Date
Shankar	0a75a3065f	security: atomic pending-job claim with FOR UPDATE SKIP LOCKED (H-6) Fixes H-6 (CWE-362) — GetPendingJobs returned pending rows without row locks, so two scheduler replicas in an HA deployment could both read the same row, both decide it was theirs, and race on UpdateStatus, producing duplicate Running jobs and duplicate certificate issuances. Remediation: a claim-style repository API that selects + transitions Pending -> Running in one transaction with SELECT ... FOR UPDATE SKIP LOCKED. Concurrent claimants observe disjoint row sets; no worker ever sees another worker's claimed row. Repository changes (internal/repository/postgres/job.go): - New ClaimPendingJobs(ctx, jobType, limit): BEGIN; SELECT id,... FROM jobs WHERE status='Pending' (optional type filter, optional LIMIT) FOR UPDATE SKIP LOCKED; UPDATE jobs SET status='Running', updated_at=NOW() WHERE id = ANY($ids); COMMIT. Returns the claimed rows with status already flipped. - New ClaimPendingByAgentID(ctx, agentID): mirrors M31 UNION ALL semantics (direct agent_id match, target->agent JOIN fallback, certificate->target->agent chain for AwaitingCSR) but wraps each branch in FOR UPDATE SKIP LOCKED and flips Deployment/Renewal rows to Running. AwaitingCSR rows are returned in place (state transition deferred until SubmitCSR, consistent with M8 semantics). - Existing GetPendingJobs / ListPendingByAgentID retained for legacy compatibility; their godoc now directs production callers to the Claim* variants. Production caller switches: - internal/service/job.go ProcessPendingJobs: ListByStatus(Pending) -> ClaimPendingJobs(ctx, "", 0). Eliminates the real scheduler race between two replicas tick-firing simultaneously. - internal/service/agent.go GetPendingWork: ListPendingByAgentID -> ClaimPendingByAgentID. Eliminates the race between two pollers for the same agent (e.g. brief network blip causing duplicate poll) and between a scheduler tick and an agent poll. Safety argument for pre-flipping Pending -> Running inside the claim transaction: ProcessRenewalJob and ProcessDeploymentJob both call UpdateStatus(Running) unconditionally on entry, so an early flip is idempotent. On panic, the scheduler's panic recovery leaves the job in Running which the existing stale-running reaper handles. Tests (internal/repository/postgres/repo_test.go, skipped in -short): - TestJobRepository_ClaimPendingJobs_FlipsToRunning: seed 5 Pending, claim once, assert all 5 returned + DB rows Running, residual claim returns 0. - TestJobRepository_ClaimPendingJobs_ConcurrentDisjoint: seed M=40 Pending Renewals, spawn N=8 goroutines each calling ClaimPendingJobs(_, JobTypeRenewal, 1) in a loop. Invariants: (a) no job ID claimed by more than one worker, (b) sum of claims == 40, (c) all 40 rows in Running state in the DB. Bounded empty-streak guard (20 iterations) covers SKIP LOCKED transient zeros under contention. - TestJobRepository_ClaimPendingByAgentID_TransitionsDeployments: seeds 2 Pending Deployment + 1 AwaitingCSR for agent A plus 1 Pending Renewal for agent B (scope check). Asserts deployments flip to Running, AwaitingCSR is returned but preserved, agent B's renewal never appears. Mock updates: testutil_test.go, lifecycle_test.go, verification_test.go gained ClaimPendingJobs/ClaimPendingByAgentID on their mock job repos mirroring the real Pending -> Running semantics. Mocks intentionally do NOT write to StatusUpdates (that map tracks UpdateStatus() call history specifically; the real claim path uses a bulk UPDATE, not UpdateStatus). Verification (CI-scope): - go build ./cmd/...: ok - go vet ./...: ok - go test -race -short on service, api/handler, api/middleware, scheduler, connector/..., domain, validation, tlsprobe: ok - Coverage gates: service 67.6% (>=55), handler 78.6% (>=60), middleware 80.0% (>=30), domain 92.7% (>=40). All hold. - golangci-lint 2.11.4: 0 issues - govulncheck: no vulnerabilities in call graph - Frontend: tsc clean, 218 vitest tests pass, vite build ok - helm lint + helm template: ok - Invariant sweeps: FOR UPDATE SKIP LOCKED present in job.go; H-1 through H-5 fixtures unchanged. Refs: H-6 in certctl-audit-report.md	2026-04-17 02:34:56 +00:00
Shankar	cf632c0af4	fix: end-to-end certificate lifecycle bugs + integration test environment Fixes 12 production bugs preventing the full issuance→deployment flow from working with ACME (Pebble/Let's Encrypt) and step-ca issuers: ACME connector (acme.go): - Save orderURI before WaitOrder overwrites it (Go crypto/acme bug) - Add CreateOrderCert fallback via WaitOrder+FetchCert - Remove defer-reset in ValidateConfig that caused nil pointer panic - Add Insecure TLS option for self-signed ACME servers (Pebble) step-ca connector (stepca.go, jwe.go): - Real JWE provisioner key loading + decryption (was using ephemeral keys) - Fix JWT audience (/1.0/sign), sha claim (key fingerprint), kid header - Custom root CA trust via RootCertPath config - Remove hardcoded 90-day validity default (let step-ca decide) NGINX target connector (nginx.go): - Use sh -c for validate/reload commands (shell interpretation) - Use filepath.Dir instead of fragile string slicing - Add private key file writing (agent-mode keys were never deployed) - Make chain_path write conditional Server/service layer: - TriggerRenewalWithActor now creates actual Job records (was no-op) - createDeploymentJobs falls back to DB query when cert.TargetIDs empty - ProcessPendingJobs skips agent-routed deployment jobs - Agent cert pickup path parsing: len(parts)<4 → len(parts)<3 - Health/ready/auth-info endpoints bypass auth middleware - Write timeout 15s→120s for ACME issuance - Cert fingerprint computed on CSR submission Integration test environment (deploy/test/): - 10-phase test script covering Local CA, ACME, step-ca, revocation, discovery, renewal, and API spot checks - Docker Compose with 7 containers (server, agent, postgres, nginx, pebble, challtestsrv, step-ca) on isolated network - TLS verification checks SAN (not just Subject CN) for modern CA compat Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-02 17:02:20 -04:00
Shankar	e445cbef22	feat: M11b — ownership tracking, agent groups, interactive renewal approval Ownership: owners/teams GUI pages, notification email resolution via resolveRecipient (owner_id → owner.email lookup). Agent groups: dynamic device grouping by OS/arch/IP CIDR/version with manual include/exclude membership, migration 000004, full CRUD stack (domain → repo → service → handler → frontend). Interactive approval: AwaitingApproval job state, approve/reject API endpoints with reason tracking. Tests: 12 agent group handler tests, 8 approve/reject job handler tests, integration tests updated for 13-param RegisterHandlers. Docs updated across architecture, concepts, and seed data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-20 21:02:35 -04:00
Shankar	f1eff55894	style: run gofmt -s across all Go files Fixes Go Report Card gofmt score from 52% to 100%. Pure formatting changes — no logic modifications. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-17 19:32:29 -04:00
Shankar	ab79dead13	Complete M1, M1.1, M2: end-to-end lifecycle, agent deployment, ACME v2 - Wire issuer connector end-to-end with IssuerConnectorAdapter (dependency inversion) - Renewal/issuance job processor: RSA key + CSR generation, Local CA signing, cert version storage - Agent work API (GET /agents/{id}/work) and job status API (POST /agents/{id}/jobs/{job_id}/status) - Agent-side deployment: WorkItem enrichment with target type/config, NGINX/F5/IIS connector invocation - Full ACME v2 implementation: HTTP-01 challenge solving, account registration, order lifecycle - Update all docs (README, architecture, connectors, demo-advanced, quickstart) for M1-M2 - Fix go vet warning in deployment.go (non-constant format string) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 23:49:45 -04:00
Shankar	9918f2f5cb	Fix runtime bugs, implement service layer, and overhaul documentation Runtime fixes: - Fix env var mismatch (CERTCTL_DB_URL → CERTCTL_DATABASE_URL) - Fix table name mismatches (certificates → managed_certificates, notifications → notification_events) - Add renewal_policy_id to certificate queries - Remove non-existent created_at from notification queries - Add env var fallback for agent CLI flags - Graceful degradation for missing notifiers/issuers in demo mode - Copy web/ directory in Dockerfile for dashboard serving Service layer: - Implement handler-service interface pattern across all services - Wire up certificate, agent, job, policy, team, owner, audit, notification services Documentation: - Add concepts.md: beginner-friendly guide to TLS, CAs, private keys - Rewrite quickstart.md with accurate API examples matching actual handlers - Add demo-advanced.md: interactive demo with cert issuance and automated script - Update architecture.md with correct table names and connector interfaces - Update connectors.md to match actual Go interface signatures - Update demo-guide.md with cross-references to new docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-14 21:38:11 -04:00
shankar0123	d395776a95	Initial scaffold: certificate control plane v0.1.0	2026-03-14 08:22:17 -04:00

7 Commits