mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 17:12:04 +00:00
fe7e766510f7707b1ff5c437dc0f129d15bbb388
326 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
fe7e766510 |
Close M-004 (OCSP issuer binding) and M-005 (discovery actor propagation) coverage-gap findings
M-004 — OCSP issuer binding (composite key):
The OCSP lookup path now binds (issuer_id, serial) as a composite key
rather than resolving by serial alone. CertificateRepository and
RevocationRepository gain GetByIssuerAndSerial methods; ca_operations.go
scopes both lookups by the issuer_id path param. When no managed cert
binds to that (issuer, serial) tuple, GetOCSPResponse constructs an
RFC 6960 §2.2 'unknown' response (CertStatus=2) instead of the prior
default 'good'. Short-lived cert exemption (profile TTL < 1h) is
preserved. Real repo errors (non-sql.ErrNoRows) fail closed with a log.
Regression coverage: internal/service/ca_operations_test.go
- TestCAOperationsSvc_GetOCSPResponse_Unknown_CrossIssuer
- TestCAOperationsSvc_GetOCSPResponse_Unknown_UnknownSerial
M-005 — Discovery Claim/Dismiss actor propagation:
DiscoveryService.ClaimDiscovered and DismissDiscovered now accept an
explicit 'actor string' parameter (propagation pattern mirrors
bulk_revocation.go / revocation_svc.go). The handler layer passes
resolveActor(r.Context()) — the named-key identity established by the
M-002 auth unification — and the service falls back to 'api' (the same
safe sentinel resolveActor uses when no auth context is present) only
when the caller passes an empty string. Never falls back to 'operator'.
Regression coverage: internal/service/discovery_test.go
- TestDiscoveryService_ClaimDiscovered_AuditActor
- TestDiscoveryService_DismissDiscovered_AuditActor
- TestDiscoveryService_ClaimDiscovered_EmptyActorFallsBackToAPI
- TestDiscoveryService_DismissDiscovered_EmptyActorFallsBackToAPI
Each new test asserts event.Actor matches the caller-supplied string (or
'api' on empty input) and explicitly asserts event.Actor != 'operator'
to lock in the historical fix intent.
Files:
internal/api/handler/discovery.go — pass resolveActor(ctx)
internal/api/handler/discovery_handler_test.go — updated call sites
internal/integration/lifecycle_test.go — updated mock wiring
internal/repository/interfaces.go — GetByIssuerAndSerial on
CertificateRepository +
RevocationRepository
internal/repository/postgres/certificate.go — composite key lookup
internal/service/ca_operations.go — (issuer_id, serial) scoping
internal/service/ca_operations_test.go — 2 new M-004 tests
internal/service/discovery.go — actor parameter + 'api' fallback
internal/service/discovery_test.go — 4 new M-005 tests
internal/service/shortlived_test.go — mock signature update
internal/service/testutil_test.go — mock GetByIssuerAndSerial
|
||
|
|
ff7357f889 |
fix(lint): godoc comment on NewAuthWithNamedKeys must lead with function name (ST1020)
CI failure on master (commit |
||
|
|
3287e174dc |
Unify API auth + RFC-compliant CRL/OCSP (M-002 + M-003 + M-006, auto-closes M-001)
Closes the remaining P1 gaps from coverage-gap-audit.md (M-001/M-002/M-003/M-006)
on top of the C-001/C-002 ownership + agent-FK contract fixes landed in
|
||
|
|
a53a4b845b |
fix(gui,api): close C-001 + C-002 — ownership + agent FK contract
C-001 — CreateCertificate was server-accepted with null owner_id,
team_id, renewal_policy_id because the GUI neither collected the fields
nor enforced them, even though the backend's ManagedCertificate schema
and handler contract treat them as required. Fix the contract at all
four layers:
- web/src/pages/CertificatesPage.tsx: replace owner_id/team_id free-
text inputs with <select> elements fed by getOwners/getTeams/
getPolicies queries; mark all three required; gate the Create
button on owner_id + team_id + renewal_policy_id being set.
- internal/api/handler/certificates.go: ValidateRequired for
owner_id, team_id, renewal_policy_id on CreateCertificate so the
handler returns HTTP 400 with the offending field name before the
service layer is reached.
- internal/mcp/types.go: drop ',omitempty' from
CreateCertificateInput.RenewalPolicyID so the MCP schema reflects
the required contract; Update inputs keep partial-update semantics.
- api/openapi.yaml: 'required: [name, common_name, renewal_policy_id,
issuer_id, owner_id, team_id]' was already present on the Create
schema; clarified DeploymentTarget.agent_id description to note the
FK contract.
C-002 — CreateTargetWizard accepted an empty or bogus agent_id and the
service inserted directly, producing a Postgres 23503 FK-violation that
bubbled out as a generic HTTP 500. The FK itself (migration 000001 line
104: agent_id TEXT NOT NULL REFERENCES agents(id)) is correct; we keep
the schema strict and add validation at three layers:
- internal/service/target.go: introduce
ErrAgentNotFound sentinel and pre-validate agent_id in
TargetService.CreateTarget — empty string returns
'agent_id is required'; a nonexistent id returns the full
'referenced agent does not exist: <id>' error. Both wrap
ErrAgentNotFound via fmt.Errorf %w so callers can use errors.Is.
- internal/api/handler/targets.go: ValidateRequired on agent_id; map
errors.Is(err, service.ErrAgentNotFound) to HTTP 400 instead of
letting it fall through to the generic 500 branch.
- internal/mcp/types.go: drop ',omitempty' from
CreateTargetInput.AgentID to match the required contract.
- web/src/pages/TargetsPage.tsx: replace the free-text Agent ID input
with a <select> populated from getAgents(); include agent in the
canProceedToReview gate so Next is disabled until an agent is
chosen.
Regression coverage (21 new subtests total):
- TestCreateCertificate_MissingRequiredField_Returns400 — 6 subtests,
one per required field, each proves the handler guard fires before
the mock service is called.
- TestCreateTarget_MissingAgentID_Returns400 — handler guard.
- TestCreateTarget_NonexistentAgent_Returns400 — pins the
ErrAgentNotFound -> 400 translation.
- TestTargetService_CreateTarget_MissingAgentID — errors.Is sentinel.
- TestTargetService_CreateTarget_NonexistentAgentID — errors.Is.
- The existing TestTargetService_CreateTarget_Success, along with
TestCreateTarget_{MissingName,MissingType,NameTooLong}_* handler
tests, were updated to seed a real agent or include agent_id in
the request body so the happy paths still run cleanly.
Gates (Phase 4):
- go build/vet/test/race: green
- go test -cover: internal/service 68.7% (gate 55%),
internal/api/handler 78.9% (gate 60%)
- golangci-lint on service+handler+mcp: 0 issues
- govulncheck: no reachable vulns
- tsc --noEmit: clean
- vitest: 223/223 passing
See cowork/certctl-coverage-gap-audit.md entries C-001 and C-002.
v2.0.44
|
||
|
|
9143da5fa8 | Merge branch 'fix/d-008-policy-engine-drift' | ||
|
|
b3cc7cbdb2 |
fix(policies): close the D-006 loop — TitleCase seed canonicals + severity-aware, config-consuming rule engine (D-008)
D-008 was a three-part drift in the policy engine that made the
D-005/D-006 remediation cosmetic below the DB layer:
(a) migrations/seed.sql INSERTed rules with pre-D-005 lowercase
types ('ownership', 'environment', 'lifetime', 'renewal_window')
that the handler validator rejects on Create/Update but that
raw SQL INSERTs bypassed entirely. At runtime evaluateRule's
switch fell through to the default "unknown policy rule type"
error branch on every demo rule × every cert × every cycle,
flooding logs while emitting zero violations.
(b) migrations/seed_demo.sql persisted lowercase severity values
('critical', 'error', 'warning') on policy_violations rows.
INSERT succeeded because that column had no CHECK, but any
frontend comparing against the canonical PolicySeverity enum
mis-categorized every seeded violation.
(c) evaluateRule hardcoded Severity: PolicySeverityWarning on
every emitted violation and ignored rule.Config entirely —
so the D-006 per-rule severity column (000013) and every
per-arm Config JSON ({allowed_issuer_ids, allowed_domains,
required_keys, allowed, lead_time_days, max_days}) was dead
data below the evaluation layer.
This commit lands (a)+(b)+(c) atomically. Shipping any subset
leaves the feature half-working.
## Changes
Domain (internal/domain/policy.go):
* Add PolicyTypeCertificateLifetime as the 6th TitleCase canonical.
Pre-D-008 the seeded "max-certificate-lifetime" rule had no engine
arm — routing it through RenewalLeadTime would conflate "how
close to expiry before we renew" with "how long can the cert
possibly be", two distinct semantics. The new type accepts
config {"max_days": int} and flags certs whose
NotAfter - NotBefore exceeds the cap.
Handler validator (internal/api/handler/validation.go):
* ValidatePolicyType allowlist grown to 6 canonicals
(AllowedIssuers, AllowedDomains, RequiredMetadata,
AllowedEnvironments, RenewalLeadTime, CertificateLifetime).
OpenAPI (api/openapi.yaml):
* PolicyType enum grown to match domain.
Frontend (web/src/api/types.ts, types.test.ts):
* POLICY_TYPES tuple gains CertificateLifetime; pin test asserts
all 6 canonicals and rejects casing drift.
Migration 000014 (policy_violations severity CHECK):
* Named CHECK constraint (policy_violations_severity_check)
mirroring 000013's allowlist, defense-in-depth at the DB layer
against future drift from bypassed writes (migrations, psql
sessions, future callers). Symmetric down migration drops by
name.
Seed data:
* migrations/seed.sql rewritten to emit TitleCase canonicals with
per-arm config JSON that actually exercises the config-consuming
paths (not the missing-field backstops):
- pr-require-owner → RequiredMetadata {"required_keys":["owner"]} Warning
- pr-allowed-environments → AllowedEnvironments {"allowed":["production","staging","development"]} Error
- pr-max-certificate-lifetime → CertificateLifetime {"max_days":90} Critical
- pr-min-renewal-window → RenewalLeadTime {"lead_time_days":14} Warning
Severities are now differentiated per rule (D-006 intent).
* migrations/seed_demo.sql violation rows flipped to TitleCase
severity ('Critical', 'Error', 'Warning') so migration 000014
applies cleanly on upgrade paths.
Engine rewrite (internal/service/policy.go):
* evaluateRule rewritten. All six arms now:
1. Parse rule.Config into the per-arm typed struct.
2. Bad JSON → log at ValidateCertificate boundary and skip
this rule (no co-located poisoning of other rules in the
same batch).
3. Empty/null Config → emit the pre-D-008 missing-field
violation (backwards compat invariant — operators who
haven't reconfigured still see the same output).
4. Violations emitted carry rule.Severity (no more hardcoded
Warning); D-006 column is now load-bearing.
* CertificateLifetime arm reads NotBefore/NotAfter from the
certificate's latest version via CertRepo. Injected via
PolicyService.SetCertRepo() setter — avoids churning ~36
NewPolicyService call sites while keeping the lifetime arm
optional (degrades to a log+skip if the setter is not wired).
Server wiring (cmd/server/main.go):
* policyService.SetCertRepo(certRepo) wired after construction.
Tests (internal/service/policy_test.go):
* 25 new subtests across 5 groups:
- TestEvaluateRule_SeverityPassThrough (6): every rule type
emits violations carrying rule.Severity, not hardcoded.
- TestEvaluateRule_ConfigConsumed (12): every per-arm Config
path exercised positive + negative.
- TestEvaluateRule_EmptyConfig_BackCompat (3): empty/null
Config still emits pre-D-008 missing-field violations.
- TestEvaluateRule_BadConfig_SkipsRule: malformed JSON logs
and skips cleanly without poisoning neighbors.
- TestEvaluateRule_CertificateLifetime_RepoScenarios (3):
ok when repo wired, log+skip when not, handles missing
NotBefore/NotAfter edges.
Provenance: D-008 surfaced during D-005/D-006 remediation review
in
|
||
|
|
eef1db0f0a |
fix(policies): stop 400ing the "+ New Policy" button + add per-rule severity (D-005, D-006)
Coverage Gap Audit findings D-005 (P0) + D-006 (P1) fixed together in a
single commit because they share the same root cause — policy CRUD sending
values the backend silently rejects — and splitting them would leave a
half-working UI between commits.
## D-005 (P0): PoliciesPage dropdown 400s every Create Policy
Root cause
----------
`web/src/pages/PoliciesPage.tsx` populated the Type `<select>` from a
hardcoded `['key_algorithm', 'ownership', 'allowed_issuers', ...]` array.
The backend's `internal/api/handler/validators.go::ValidatePolicyType`
enforces the TitleCase allowlist `AllowedIssuers`, `AllowedDomains`,
`RequiredMetadata`, `AllowedEnvironments`, `RenewalLeadTime` — defined in
`internal/domain/policy.go`. Every Create Policy request was rejected with
`400 invalid policy type`. The error surfaced only as a transient toast;
the modal closed anyway. Silent user-visible failure.
Fix
---
- `web/src/api/types.ts`: added `POLICY_TYPES` and `POLICY_SEVERITIES`
tuples with `as const` and narrowed `PolicyRule.type`, `.severity`, and
`PolicyViolation.severity` to the literal-union types. Dropdown is now
sourced from the tuple; casing drift becomes a compile error.
- `web/src/pages/PoliciesPage.tsx`: rekeyed `severityStyles` /
`severityDots` to the TitleCase values, added `humanize()` for display
(AllowedIssuers → "Allowed Issuers"), removed the `badge-neutral`
fallback that was papering over the mismatch.
- `web/src/api/types.test.ts` (new): pins both tuples exactly. If anyone
edits one side of the frontend/backend contract without the other, CI
fails with a clear assertion. Pure-TS vitest, no RTL dependency.
## D-006 (P1): `severity` field silently dropped on create/update
Root cause
----------
`PolicyRule` had no `Severity` field in `internal/domain/policy.go`. The
frontend has always sent `severity` on create/update, but Go's
`json.Decoder` (default settings, no `DisallowUnknownFields`) silently
dropped it. The value never reached PostgreSQL. Every rule rendered with
the same severity because there was no severity — just a display
computation downstream.
Fix: option (b), full-stack schema add (not delete-the-field)
-------------------------------------------------------------
- Migration `000013_policy_rule_severity` (up + down): adds
`severity VARCHAR(50) NOT NULL DEFAULT 'Warning'` to `policy_rules` with
CHECK constraint `severity IN ('Warning', 'Error', 'Critical')`. No
index — three-value column on a low-thousands-rows table, planner will
seq-scan regardless. PG 11+ metadata-only ADD COLUMN, safe on live data.
- `internal/domain/policy.go`: added `Severity PolicySeverity` field.
- `internal/repository/postgres/policy.go`: plumbed `severity` through
ListRules SELECT + Scan, GetRule SELECT + Scan, CreateRule INSERT,
UpdateRule UPDATE (4 queries).
- `internal/service/policy.go::UpdatePolicy`: if the client omits
severity on a PUT (zero-value empty string), fetch the existing rule
and preserve its severity. Without this, partial updates would trip the
NOT NULL CHECK and 500. Preserves pre-existing behavior for Name/Type
(out of scope).
- `internal/api/handler/policies.go::CreatePolicy`: default empty severity
to `'Warning'`, then validate via `ValidatePolicySeverity`. 400 with
clear message instead of 500 on CHECK violation. `UpdatePolicy`:
validates severity only when provided.
- `internal/mcp/types.go` + `internal/mcp/tools.go`: added optional
`severity` on the MCP `create_policy` / `update_policy` tool inputs so
LLM callers stay in sync with the wire contract.
- `api/openapi.yaml`: added `severity` to the `PolicyRule` schema with
the enum and default.
Acceptance criterion (user-defined)
-----------------------------------
"Create a rule with severity=Critical, reload the page, and still see
Critical — no silent drops." Verified end-to-end: frontend sends
`severity: "Critical"`, handler validates, service persists, DB stores,
GET returns, React renders the correct badge.
Seed data
---------
`migrations/seed.sql`: four demo rules now have differentiated severities
— `pr-require-owner` → Warning, `pr-allowed-environments` → Error,
`pr-max-certificate-lifetime` → Critical, `pr-min-renewal-window` →
Warning. The user called out that seeding all four at the same severity
makes the feature look decorative; differentiation demonstrates the
column carries real signal.
## Integration test fix (side effect of D-006)
`internal/integration/e2e_test.go::TestCrossResourceWorkflow/CreatePolicy`
was sending `"severity": "High"` — a value from the pre-audit severity
vocabulary that the new `ValidatePolicySeverity` correctly rejects with
400. Changed to `"Error"` (closest semantic match in the new TitleCase
allowlist). Only severity reference in the integration/ directory;
verified via grep.
## Out of scope, logged for follow-up (d/D-008)
Three policy-engine drift issues orthogonal to D-005 + D-006, explicitly
deferred per direction:
1. `migrations/seed.sql` policy_rules INSERTs use lowercase TYPE values
(`'ownership'`, `'environment'`, `'lifetime'`, `'renewal_window'`).
These are load-bearing on `internal/service/policy.go::evaluateRule`'s
`switch rule.Type` (which also uses the lowercase strings). Migrating
requires coordinated changes across seed + evaluation engine.
2. `migrations/seed_demo.sql:482-483` contains lowercase `'critical'`
severity — will now fail the new CHECK constraint. Separate fix.
3. `evaluateRule` hardcodes `Severity: domain.PolicySeverityWarning` on
emitted violations and ignores the configured `rule.Config`. The new
severity column is read correctly on the CRUD path but not yet
consulted during evaluation.
## Verification
Backend:
- `go build ./...` — clean
- `go vet ./...` — clean
- `go test -short ./...` — all packages green, including
`internal/service` (policy service), `internal/api/handler` (policy +
MCP handler tests), `internal/integration` (e2e_test.go after fix),
`internal/domain`, `internal/repository/postgres`.
Frontend:
- `tsc --noEmit` — clean
- `vitest run` — 223/223 passing (4 new assertions in types.test.ts)
- `vite build` — clean (only the pre-existing chunk-size warning)
|
||
|
|
72f5246ce3 | Merge branch 'fix/m11-cosign-v3-sign-blob-bundle': M-11 cosign v3 sign-blob migration v2.0.43 | ||
|
|
cb308bb4c7 |
ci(release): migrate cosign sign-blob to --bundle (cosign v3.0)
Cosign v3.0 (shipped by default with sigstore/cosign-installer@cad07c2e, release v3.0.5) removed --output-signature and --output-certificate from the sign-blob subcommand. The replacement is a single --bundle flag that emits a unified Sigstore bundle (.sigstore.json) containing the signature, certificate chain, and Rekor inclusion proof in one file. This change migrates both sign-blob invocations in .github/workflows/ release.yml (per-binary matrix signing and aggregate checksums.txt signing), updates the artefact upload paths, the artefact aggregation case filter, the GitHub Release asset list, and the release-notes body verify-blob example. The README cosign verification snippet and sidecar description are also updated to the --bundle / .sigstore.json shape. No cosign version pinning. No legacy fallback. OCI image signing (cosign sign on image digest) is unchanged — only sign-blob flags changed in v3.0. See M-11 in certctl-audit-report.md. Verification gates: - YAML parse: OK - go vet ./...: exit 0 - go build ./...: exit 0 - grep 'cosign sign-blob' release.yml: 2 (expected: 2) - grep '.sigstore.json' release.yml: 9 (expected: >=5) - grep '.sig/.pem' release.yml non-comment: 0 (expected: 0) - README legacy cosign refs: 0 (expected: 0) - docs/ legacy cosign refs: 0 (expected: 0) Coverage: unchanged (CI workflow edit + README — zero Go code touched). |
||
|
|
ad93e99158 | Merge branch 'fix/m10-openapi-spec-drift': M-10 OpenAPI spec drift reconciliation v2.0.42 | ||
|
|
9d0c3dfa15 |
docs(openapi): reconcile api/openapi.yaml with router routes (M-10)
Add 9 missing operations to api/openapi.yaml that exist in router.go but
were absent from the spec. Spec-only change with no runtime Go code
changes; all 106 pre-existing operationIds preserved byte-identical.
New operationIds:
- testTargetConnection (POST /api/v1/targets/{id}/test)
- verifyDeployment (POST /api/v1/jobs/{id}/verify)
- getJobVerification (GET /api/v1/jobs/{id}/verification)
- estCACerts (GET /.well-known/est/cacerts)
- estSimpleEnroll (POST /.well-known/est/simpleenroll)
- estSimpleReEnroll (POST /.well-known/est/simplereenroll)
- estCSRAttrs (GET /.well-known/est/csrattrs)
- scepGet (GET /scep)
- scepPost (POST /scep)
Spec operations: 106 → 115 (matches 115 router routes exactly).
Verification:
- openapi-spec-validator: OK
- go build ./...: clean
- go vet ./...: clean
- go test -race -count=1 -short ./...: 54 packages ok, 0 FAIL
- golangci-lint run ./...: 0 issues
- govulncheck ./...: 0 vulnerabilities in our code
- tsc --noEmit: 0 errors
- vitest run: 3 files, 218 tests passed
sha256 before: 7c14f77107a86f8de82fe91b7f5e16cca11206d1e1fab7b7bd77ff396620fdf3
sha256 after: 87bd92d0407d63643bec612d27261bf489563beb90d0791ea71cde26346f83d3
|
||
|
|
2c9602db71 | Merge branch 'fix/m9-sentinel-discovery-log-levels': M-9 sentinel discovery log-level fix | ||
|
|
ef670fa6da |
fix(m-9): aggregate per-endpoint scan errors in NetworkScanService
Before this fix, RunScan declared `scanErrors []string` but never
appended to it. As a result:
- the summary Info log ("network target scan completed") always
reported `"errors": 0`, regardless of how many endpoints failed
- the DiscoveryReport's `Errors` field — stored on the scan record
and surfaced in the GUI scan history — was always nil
Operators who needed to understand scan failures had to enable Debug
logging and grep through the noise of expected sweep-scan connection
refusals. The per-endpoint log level (Debug) is deliberate and correct
— scanning a /24 typically produces 200+ connection-refused results,
and logging each at Warn would create massive log spam at default
verbosity. The bug was the silent loss of the aggregate count.
This commit:
- extracts the partitioning logic into `collectScanResults`, a pure
method that splits per-endpoint results into discovered certificate
entries and a list of endpoint error strings
- populates the errors list with "<address>: <error>" so the scan
record correlates failures back to specific endpoints
- preserves the existing Debug-level per-endpoint log (sweep noise
discipline) — no change to default-verbosity log output
The summary Info log's "errors" field and the DiscoveryReport's Errors
field now reflect the true failure count. Debug detail remains
available for operators diagnosing specific endpoints.
Audit scope note: the M-9 finding narrative implied broad Debug-level
hiding of real errors across AWS SM, Azure KV, GCP SM, and network
scan sentinel agents. On investigation, the three cloud-discovery
connectors (awssm, azurekv, gcpsm) already use appropriate Warn/Error
discipline for per-item and root-level failures. Only the network
scanner had a silent observability gap, and it was a missed append
rather than a misapplied log level. See audit resolution log for
full details.
CWE: CWE-778 (Insufficient Logging) — aggregate failure count lost.
Tests: 4 new unit tests on collectScanResults covering the
aggregation path (success + failure mix), all-success, all-failed,
and empty-input degenerate cases. All tests pass with -race.
Verification:
- go build ./cmd/server/... ./cmd/agent/... ./cmd/mcp-server/... ./cmd/cli/... exit 0
- go vet ./... exit 0
- go test -race -count=1 -timeout 300s [full CI race path] exit 0
- golangci-lint run ./... --timeout 5m (v2.11.4) 0 issues
- govulncheck ./... (@latest) 0 in-code vulnerabilities
- go test -count=1 -cover ./internal/service/... 68.0% (> 55% threshold)
Invariants preserved:
- collectScanResults signature: method on *NetworkScanService,
input []domain.NetworkScanResult, return ([]DiscoveredCertEntry, []string)
- Debug log key names unchanged ("address", "error")
- DiscoveryReport schema unchanged (Errors field already existed)
- Sentinel agent ID "server-scanner" unchanged
- No migration, no API, no wire-format change
Refs: M-9 Medium finding; audit resolution log appended in follow-up
commit on workspace-level audit report.
|
||
|
|
5a6ec39cfd | Merge branch 'fix/m2-pr-f-scheduler-contextcheck-audit-closeout' | ||
|
|
e3196e7b50 |
M-2 PR-F: Middleware/ACME ctx-propagation + contextcheck linter + audit closeout
Final PR in the six-commit M-2 sequence (PR-A: CertificateService cluster |
||
|
|
bea69efd12 |
Merge branch 'fix/m2-pr-e-agent-service'
PR-E of 6: AgentService ctx-first collapse.
Collapses the HeartbeatWithContext wrapper into a single Heartbeat
method. Handler-facing method name is preserved (D-4); the handler
service interface and mock already expected ctx-first, so this PR
touches only the service layer and its tests (4 files, 9+/15-).
Verification on the feature branch: build, vet, test (-short),
test -race, full-module test -short, and golangci-lint all clean.
Audit complete. Commit:
|
||
|
|
283ec27ca4 |
fix(m2-pr-e): collapse AgentService.HeartbeatWithContext into Heartbeat
PR-E of 6 in the M-2 end-to-end remediation sequence. Collapses the
HeartbeatWithContext wrapper into a single ctx-first Heartbeat method,
matching D-1 (ctx-only signatures, no dual forms). The handler-facing
method name is preserved (D-4) — internal/api/handler/agents.go already
declares `Heartbeat(ctx, ...)` on its local service interface, and the
handler mock at internal/api/handler/agent_handler_test.go already
takes `_ context.Context` as its first param, so no handler churn.
Changes
-------
internal/service/agent.go
- Delete the zero-body Heartbeat wrapper that forwarded to
HeartbeatWithContext with context.Background().
- Rename HeartbeatWithContext → Heartbeat (ctx-bearing body
folded directly into the canonical method).
internal/service/agent_test.go
- TestHeartbeat (L95) and TestHeartbeat_NotFound (L128):
agentService.HeartbeatWithContext(ctx, ...) → .Heartbeat(ctx, ...).
internal/service/concurrent_test.go
- L162: agentSvc.HeartbeatWithContext(ctx, agentID, metadata)
→ .Heartbeat(ctx, agentID, metadata).
internal/service/context_test.go
- L179 + L232: agentSvc.HeartbeatWithContext(ctx, ...) → .Heartbeat(...)
- L185 + L238 t.Logf strings: "HeartbeatWithContext with ..." →
"Heartbeat with ..." to match the collapsed method name.
Verification (Go 1.25.9 linux/arm64, CI-parity caches)
------------------------------------------------------
go build ./... clean
go vet ./... clean
go test -short ./internal/service/... ./internal/api/handler/... \
./internal/integration/... all ok
go test -race -short same set all ok
go test -short ./... all packages ok
golangci-lint run ./... 0 issues
Locked decisions from the M-2 plan:
D-1 ctx-only signatures (no dual forms)
D-4 preserve handler method names facing the router
D-5 domain types stay ctx-free
Audit complete. Commit:
|
||
|
|
a67a6b6c30 |
Merge branch 'fix/m2-pr-d-job-notification-audit'
PR-D: Thread ctx through Job + Notification + Audit service cluster.
Collapse CancelJobWithContext into CancelJob; eliminate 10
context.Background() hits.
Audit complete. Commit:
|
||
|
|
ccd89c348f |
fix(m2-pr-d): thread ctx through Job/Notification/Audit services
Collapse CancelJobWithContext into CancelJob; eliminate 10 context.Background()
hits across the Job+Notification+Audit service cluster by threading ctx
through their handler-facing service interfaces.
Services (ctx-first):
- service/job.go: ListJobs, GetJob, CancelJob, ApproveJob, RejectJob now
accept ctx; the CancelJobWithContext wrapper is removed (handler callers
continue to invoke CancelJob, now ctx-aware).
- service/notification.go: ListNotifications, GetNotification, MarkAsRead
accept ctx.
- service/audit.go: ListAuditEvents, GetAuditEvent accept ctx.
Handlers (interface + callsites):
- handler/jobs.go, handler/notifications.go, handler/audit.go: local
service interfaces updated, r.Context() threaded at every callsite.
Tests:
- Mock services updated to match the new interfaces (ctx accepted and
ignored via '_ context.Context' first parameter; Fn closure fields
unchanged).
- job_test.go / notification_test.go callsites thread context.Background()
to match production shape.
Verification:
go build ./... ok
go vet ./... ok
go test -short ./... ok
go test -race -short ./... ok
golangci-lint run ./... 0 issues
Locked decisions from the M-2 plan:
D-1 ctx-only signatures (no dual forms)
D-4 preserve handler method names facing the router
D-5 domain types stay ctx-free
Audit complete. Commit:
|
||
|
|
478a141498 | Merge branch 'fix/m2-pr-c-crud-cluster' | ||
|
|
2497be496d |
M-2 PR-C: Collapse Policy/Profile/Owner/Team services to ctx-first signatures
- Add ctx first param to 21 service-layer handler-interface methods
across policy.go (6), profile.go (5), owner.go (5), team.go (5)
- Replace 24 context.Background() call sites with received ctx; use
context.WithoutCancel(ctx) for subsidiary audit-recording ops to
preserve fire-and-forget audit semantics without inheriting caller
cancellation
- Add ctx first param to 21 handler-interface method signatures across
policies.go (6), profiles.go (5), owners.go (5), teams.go (5)
- Thread r.Context() through 21 HTTP handler sites (ListPolicies,
GetPolicy, CreatePolicy, UpdatePolicy, DeletePolicy, ListViolations,
ListProfiles, GetProfile, CreateProfile, UpdateProfile, DeleteProfile,
ListOwners, GetOwner, CreateOwner, UpdateOwner, DeleteOwner,
ListTeams, GetTeam, CreateTeam, UpdateTeam, DeleteTeam)
- Update MockPolicyService/MockProfileService/MockOwnerService/
MockTeamService mock method impls with _ context.Context first param
(Fn fields unchanged — closures do not need ctx); update mock impls
in integration/lifecycle_test.go for all four services
- Update 12 service-layer test callsites (policy_test.go ×2,
owner_test.go ×5, team_test.go ×5, profile_test.go ×13) to pass
context.Background() at the call site
Audit complete. Commit:
|
||
|
|
25dd6c07f3 | Merge branch 'fix/m2-pr-b-issuer-target' | ||
|
|
eb14236166 |
M-2 PR-B: Collapse IssuerService + TargetService to ctx-first signatures
- Delete bare TestConnection wrapper in IssuerService; rename
TestConnectionWithContext → TestConnection
- Delete TestTargetConnection delegate shim in TargetService (canonical
TestConnection already ctx-first)
- Add ctx first param to 10 handler-interface methods
(ListIssuers/GetIssuer/CreateIssuer/UpdateIssuer/DeleteIssuer and
ListTargets/GetTarget/CreateTarget/UpdateTarget/DeleteTarget)
- Replace 16 context.Background() call sites with received ctx
- Thread r.Context() through 12 HTTP handler sites in issuers.go and
targets.go (outer TargetHandler.TestTargetConnection HTTP method name
preserved for router compatibility)
- Update MockIssuerService, MockTargetService, and mockTargetService
(integration) for ctx-first forwarding; update test callsite literals
Audit complete. Commit:
|
||
|
|
bbb628243f | Merge branch 'fix/m2-pr-a-certificate-cluster' | ||
|
|
cdc9d03d5b |
fix(m-2): thread context through CertificateService cluster
Collapses CertificateService, RevocationSvc, and CAOperationsSvc to ctx-accepting method signatures. Removes context.Background() synthesis at 24 internal call sites across certificate.go, revocation_svc.go, and ca_operations.go. - Primary repo calls inherit request cancellation via the passed ctx. - Audit and notification dispatches use context.WithoutCancel(ctx) so they survive client disconnect. - Collapses TriggerRenewal/TriggerRenewalWithActor, TriggerDeployment/TriggerDeploymentWithActor, and RevokeCertificate/RevokeCertificateWithActor sibling pairs into single canonical ctx-accepting methods (decisions D-1, D-2). Handlers pass r.Context(). Mocks and tests updated to match new signatures. No HTTP surface change, no OpenAPI change. PR 1 of 6 in the M-2 remediation chain. Master green at this commit. Refs: certctl-audit-report.md M-2 (L143, L224) |
||
|
|
e951d319d0 |
Merge branch 'fix/m1-audit-shutdown-drain'
Resolves M-1 (Medium): Audit recorder shutdown drain. The API audit middleware's detached recording goroutines now drain during graceful shutdown via AuditMiddleware.Flush (sync.WaitGroup + timeout-aware select), called between http.Server.Shutdown and db.Close. Prevents silent audit-event loss on SIGTERM (CWE-662 / CWE-400). |
||
|
|
d14a45401b |
fix(audit): drain in-flight recording goroutines on shutdown (M-1)
Audit events spawned from the HTTP middleware ran in detached goroutines using context.Background(). On SIGTERM the DB pool was closed before those goroutines finished writing, silently dropping audit events (CWE-662 Improper Synchronization / CWE-400 Uncontrolled Resource Consumption). NewAuditLog now returns an *AuditMiddleware struct that tracks every spawned goroutine with sync.WaitGroup. Callers wire the middleware via its Middleware method value (preserves the existing func(http.Handler) http.Handler shape) and drain the WaitGroup with Flush(ctx), which blocks until in-flight recordings complete or the provided context is cancelled — mirroring scheduler.WaitForCompletion. Flush is invoked in cmd/server/main.go between http.Server.Shutdown (no new requests accepted) and db.Close (pool torn down), with a timeout returning ErrAuditFlushTimeout wrapping ctx.Err(). Request-derived inputs (method, path, status) are snapshotted before the goroutine spawn so the worker does not race with http.Server reusing r after the handler returns. Tests: TestAuditLog_FlushDrainsInFlightGoroutines TestAuditLog_FlushTimeoutReturnsErrAuditFlushTimeout Verification: go build ./... : 0 go vet ./... : 0 go test -race -short ./... : 0 (all packages) go test -cover ./internal/api/middleware : 81.4% golangci-lint run : 0 issues govulncheck ./... : 0 vulns in called code |
||
|
|
655e2879e6 |
feat(frontend): add Owner field to OnboardingWizard Certificate step
The first-run onboarding wizard's Certificate step now surfaces an Owner dropdown (required) alongside Issuer and Profile, matching the ownership model introduced in M11b. Prevents newly-created certs from being unowned and bypassing notification routing. - web/src/pages/OnboardingWizard.tsx: getOwners query, ownerId state, Owner <select>, required-field guard (nextDisabled), empty-state link to /owners page when no owners exist yet. Frontend-only change; no backend wiring or schema impact. Separated from the M-6 sentinel-agent idempotency commit per scope-guard. |
||
|
|
e757ef1471 |
Merge branch 'fix/m6-sentinel-idempotent-create'
Resolves M-6 (Medium): swallowed sentinel agent INSERT errors. CWE-662 / CWE-209-adjacent. Shape A: CreateIfNotExists helper + 4 sentinel call sites. |
||
|
|
27afa4463d |
fix(repository): idempotent sentinel agent creation via ON CONFLICT (M-6)
Sentinel agents (server-scanner, cloud-aws-sm, cloud-azure-kv, cloud-gcp-sm) were created on startup with a plain INSERT whose duplicate-key error was swallowed unconditionally. That silenced every other DB failure too (connectivity drop, permissions change, unrelated constraint violation) — a restart after the first boot quietly de-fanged cloud discovery and the network scanner (CWE-662, CWE-209- adjacent). Shape A: add AgentRepository.CreateIfNotExists using ON CONFLICT (id) DO NOTHING RETURNING id + sql.ErrNoRows discrimination. This keeps the strict Create semantics (duplicate-key is an error) intact for real agent registration and gives sentinels their own idempotent path. - repo: CreateIfNotExists returns (created bool, err error); false,nil on pre-existing row; false,wrapped err on anything else. - interface: CreateIfNotExists added to AgentRepository. - main.go: 4 sentinel sites log Error/Info/Debug distinctly. - mocks: service + integration mocks implement the new method. - tests: 4 new testcontainers integration tests cover first-insert, idempotent second-call, concurrent 16-goroutine race (exactly one creator, no duplicate-key panic), and pre-cancelled context surfacing. Coverage gates (go test -cover): service 67.6%/55, handler 78.6%/60, domain 92.7%/40, middleware 80.0%/30, crypto 86.7%/85. Race/vet/ golangci-lint v2.11.4 (0 issues)/govulncheck v1.2.0 clean across all touched packages. |
||
|
|
80450c7180 |
fix(repository): populate TargetIDs in certificate scan helper (M-7)
scanCertificate never queried the certificate_target_mappings junction
table, so Certificate.TargetIDs was always nil on reads. This silently
broke deployment lookups, bulk revocation filters, cert detail pages,
and any code path that iterated TargetIDs to dispatch target work.
Fix:
- Convert scanCertificate to a receiver method (r *CertificateRepository)
so it has access to the DB for the secondary junction query.
- Get(): scan the row, then call r.getTargetIDs(ctx, certID) to populate
TargetIDs with a single targeted query.
- List() and GetExpiringCertificates(): inline the scan loop so we can
collect all certIDs first, then call getTargetIDsForCertificates once
with pq.Array(certIDs) to avoid N+1 round-trips. Build a map and
attach TargetIDs to each certificate in the result set.
- Default TargetIDs to []string{} (not nil) when a cert has no mappings
so JSON marshals as [] rather than null.
Tests:
- New integration test file certificate_targetids_test.go with 5
subtests exercising Get / List / GetExpiringCertificates single
and multi-target cases plus the empty-slice vs nil contract.
- Uses the shared testcontainers-go setupTestDB infrastructure and
skips under 'go test -short' so CI (which excludes ./internal/repository/...
from coverage paths anyway) stays green.
Addresses M-7 from certctl-audit-report.md.
|
||
|
|
c655e0f8c5 |
fix(crypto/local-ca): reject expired or not-yet-valid sub-CA certificates on disk load (M-5)
loadCAFromDisk now validates the upstream sub-CA certificate's NotBefore
and NotAfter fields before accepting it, returning a fail-closed error
at server startup instead of silently loading an out-of-window CA.
Before this fix, loadCAFromDisk checked BasicConstraints.IsCA and
KeyUsage=CertSign but not the validity window. An expired enterprise
sub-CA (e.g. an ADCS subordinate whose rollover slipped) would load
without warning and the scheduler would mint child certs that every
RFC 5280 path validator rejects — outages show up at relying parties,
not at certctl, and only after thresholds trip.
CWE-672 (Operation on a Resource after Expiration or Release); secondary
CWE-295 (Improper Certificate Validation). Error strings include the CA
subject CommonName and both RFC3339 timestamps so the log line is
actionable in a 3am incident.
Tests: TestSubCAMode gains three subtests exercising the new gate —
SubCA_ExpiredCert_IsRejected (CA expired 1h ago → error mentions
'expired' and the CN), SubCA_NotYetValid_IsRejected (CA valid +1h →
error mentions 'not yet valid' and the CN), and SubCA_BarelyValid_IsAccepted
(CA valid [now-1m, now+1h] → issuance succeeds, proving no
over-rejection). Adds generateTestSubCAWithValidity helper; the
original generateTestSubCA wrapper preserves the [now, now+5y] default
for existing tests.
Package coverage: 67.7% -> 68.3%.
Verification: go build, go vet, go test -race, go test -cover all
green locally; golangci-lint v2.11.4 clean; govulncheck clean. All CI
coverage floors met with margin (service 67.6/55, handler 78.6/60,
domain 92.7/40, middleware 80.0/30, crypto 86.7/85).
Parent:
|
||
|
|
5abeeb882b | fix(crypto): per-ciphertext PBKDF2 salt + v2 versioned format with v1 fallback (M-8) | ||
|
|
b1df6dab27 | ci(release): add CLI/MCP binaries, checksums, SBOM, Cosign, SLSA provenance (M-3) | ||
|
|
672e1d991d |
build: propagate HTTP_PROXY/HTTPS_PROXY/NO_PROXY through Docker build (M-4, Issue #9)
Addresses Medium finding M-4 in the audit report. The multi-stage
Dockerfiles previously had no ARG declarations for HTTP_PROXY,
HTTPS_PROXY, or NO_PROXY, so corporate-proxy environments silently
failed at 'npm ci' (frontend stage) and 'go mod download' (Go builder).
The npm retry idiom (`npm ci --include=dev || npm ci --include=dev`)
masked the failure because the upstream 'Exit handler never called!'
bug exits 0 despite the install crash.
Fix: thread HTTP_PROXY / HTTPS_PROXY / NO_PROXY ARGs through every
Docker build stage that performs network I/O, re-export them as ENV
with both upper- and lower-case aliases (apk/curl/npm read lowercase;
Go/Node read uppercase), and forward the host shell's environment via
`build.args:` in every compose file and `build-args:` in the release
workflow's docker/build-push-action steps. Defaults are empty strings
so un-proxied builds remain byte-identical to the pre-fix tree.
Scope: Dockerfile (frontend + Go builder stages), Dockerfile.agent
(Go builder stage), deploy/docker-compose.yml (server + agent),
deploy/docker-compose.dev.yml (server + agent), deploy/docker-compose.test.yml
(server + agent), .github/workflows/release.yml (both docker/build-push-action
v6 invocations). Zero Go, web, test, or runtime code changes. Zero
base-image changes. Existing npm `||` retry idiom and `ARG TARGETARCH`
preserved verbatim.
CWE-1173 (Improper Use of Validated Input) / CWE-16 (Configuration).
Verification:
- YAML parses clean across all four compose files and release.yml.
- yamllint -d relaxed: clean exit across all five YAML files.
- All six `build.args:` blocks expose HTTP_PROXY, HTTPS_PROXY, NO_PROXY
with default-empty ${VAR:-} substitution.
- Both release.yml docker/build-push-action steps expose the same
three keys sourced from ${{ secrets.HTTP_PROXY }}, etc.
- Dockerfiles contain 5 proxy ARG declarations total (Dockerfile has 2
stages × 3 ARGs = 6 lines, Dockerfile.agent has 1 stage × 3 ARGs = 3
lines); lowercase ENV aliases verified present in every stage.
- git diff --shortstat: 6 files changed, 117 insertions(+), 0 deletions.
Pure additive.
Docker-live verification (`docker build`, `docker compose config`)
deferred to CI / post-commit smoke because the sandbox has no Docker
runtime. hadolint, go, golangci-lint, govulncheck likewise unavailable
in the sandbox; per-layer CI coverage gates (service 55%, handler 60%,
domain 40%, middleware 30%) are trivially unaffected as M-4 touches
zero Go source files.
v2.0.41
|
||
|
|
89b910a8f1 |
security: atomic pending-job claim with FOR UPDATE SKIP LOCKED (H-6)
Fixes H-6 (CWE-362) — GetPendingJobs returned pending rows without row
locks, so two scheduler replicas in an HA deployment could both read the
same row, both decide it was theirs, and race on UpdateStatus, producing
duplicate Running jobs and duplicate certificate issuances.
Remediation: a claim-style repository API that selects + transitions
Pending -> Running in one transaction with SELECT ... FOR UPDATE SKIP
LOCKED. Concurrent claimants observe disjoint row sets; no worker ever
sees another worker's claimed row.
Repository changes (internal/repository/postgres/job.go):
- New ClaimPendingJobs(ctx, jobType, limit): BEGIN; SELECT id,...
FROM jobs WHERE status='Pending' (optional type filter, optional
LIMIT) FOR UPDATE SKIP LOCKED; UPDATE jobs SET status='Running',
updated_at=NOW() WHERE id = ANY($ids); COMMIT. Returns the claimed
rows with status already flipped.
- New ClaimPendingByAgentID(ctx, agentID): mirrors M31 UNION ALL
semantics (direct agent_id match, target->agent JOIN fallback,
certificate->target->agent chain for AwaitingCSR) but wraps each
branch in FOR UPDATE SKIP LOCKED and flips Deployment/Renewal rows
to Running. AwaitingCSR rows are returned in place (state
transition deferred until SubmitCSR, consistent with M8 semantics).
- Existing GetPendingJobs / ListPendingByAgentID retained for legacy
compatibility; their godoc now directs production callers to the
Claim* variants.
Production caller switches:
- internal/service/job.go ProcessPendingJobs: ListByStatus(Pending)
-> ClaimPendingJobs(ctx, "", 0). Eliminates the real scheduler
race between two replicas tick-firing simultaneously.
- internal/service/agent.go GetPendingWork: ListPendingByAgentID ->
ClaimPendingByAgentID. Eliminates the race between two pollers
for the same agent (e.g. brief network blip causing duplicate
poll) and between a scheduler tick and an agent poll.
Safety argument for pre-flipping Pending -> Running inside the claim
transaction: ProcessRenewalJob and ProcessDeploymentJob both call
UpdateStatus(Running) unconditionally on entry, so an early flip is
idempotent. On panic, the scheduler's panic recovery leaves the job
in Running which the existing stale-running reaper handles.
Tests (internal/repository/postgres/repo_test.go, skipped in -short):
- TestJobRepository_ClaimPendingJobs_FlipsToRunning: seed 5 Pending,
claim once, assert all 5 returned + DB rows Running, residual
claim returns 0.
- TestJobRepository_ClaimPendingJobs_ConcurrentDisjoint: seed M=40
Pending Renewals, spawn N=8 goroutines each calling
ClaimPendingJobs(_, JobTypeRenewal, 1) in a loop. Invariants:
(a) no job ID claimed by more than one worker, (b) sum of claims
== 40, (c) all 40 rows in Running state in the DB. Bounded
empty-streak guard (20 iterations) covers SKIP LOCKED transient
zeros under contention.
- TestJobRepository_ClaimPendingByAgentID_TransitionsDeployments:
seeds 2 Pending Deployment + 1 AwaitingCSR for agent A plus 1
Pending Renewal for agent B (scope check). Asserts deployments
flip to Running, AwaitingCSR is returned but preserved, agent B's
renewal never appears.
Mock updates: testutil_test.go, lifecycle_test.go, verification_test.go
gained ClaimPendingJobs/ClaimPendingByAgentID on their mock job repos
mirroring the real Pending -> Running semantics. Mocks intentionally
do NOT write to StatusUpdates (that map tracks UpdateStatus() call
history specifically; the real claim path uses a bulk UPDATE, not
UpdateStatus).
Verification (CI-scope):
- go build ./cmd/...: ok
- go vet ./...: ok
- go test -race -short on service, api/handler, api/middleware,
scheduler, connector/..., domain, validation, tlsprobe: ok
- Coverage gates: service 67.6% (>=55), handler 78.6% (>=60),
middleware 80.0% (>=30), domain 92.7% (>=40). All hold.
- golangci-lint 2.11.4: 0 issues
- govulncheck: no vulnerabilities in call graph
- Frontend: tsc clean, 218 vitest tests pass, vite build ok
- helm lint + helm template: ok
- Invariant sweeps: FOR UPDATE SKIP LOCKED present in job.go;
H-1 through H-5 fixtures unchanged.
Refs: H-6 in certctl-audit-report.md
|
||
|
|
6315ef102a |
security(globalsign): remove InsecureSkipVerify and pin CA pool (H-5)
The GlobalSign Atlas HVCA connector previously used InsecureSkipVerify:true on its mTLS TLS config, disabling server certificate validation and defeating the purpose of the client-side mTLS handshake. This was a CWE-295 Improper Certificate Validation vulnerability silently degrading trust on every production call to GlobalSign's signing API. Remediation (per H-5 audit finding, Lens 4.4): - Remove InsecureSkipVerify from all three http.Client construction sites (ValidateConfig, getHTTPClient, and legacy initialisation path). - Introduce buildServerTLSConfig() helper that constructs tls.Config with MinVersion: tls.VersionTLS12 (addresses adjacent L-1 recommendation). - New optional config field `server_ca_path` (env: CERTCTL_GLOBALSIGN_SERVER_CA_PATH). When unset the connector trusts the system root CA bundle (correct default for GlobalSign's publicly-trusted HVCA endpoints). When set the bundle is loaded via x509.NewCertPool() + AppendCertsFromPEM, and only those roots are trusted (supports private HVCA deployments and defence-in-depth root pinning). - Error wrapping chain: "failed to read server CA bundle at %s" and "no valid PEM certificates found in server CA bundle at %s" surface config problems at ValidateConfig time instead of silently failing at request time. Docs, config, service env-seed, and GUI issuer type definition updated to expose the new field. Tests: 9 dead `InsecureSkipVerify: true` client TLSClientConfig blocks (no-ops against httptest.NewServer plain-HTTP) replaced with bare http.Client; new TestGlobalSign_ServerTLSConfig covers pinned-CA trust, untrusted-server rejection, missing-file and invalid-PEM error paths. Verification: - go build ./... clean - go vet ./... clean - go test -race ./internal/connector/issuer/globalsign/... ./internal/config/... ./internal/service/... ok - go test ./... (excluding testcontainers-gated repo layer) ok - golangci-lint run ./... 0 issues - govulncheck ./... 0 reachable vulns - Per-layer coverage: service 68.7% (≥55), handler 83.6% (≥60), domain 82.0% (≥40), middleware 63.8% (≥30) - globalsign package coverage: 75.9% - Invariant sweep: 0 InsecureSkipVerify references remain in globalsign package (only a test-file comment documenting the removal). |
||
|
|
119986fa7e |
security: add SSRF defence-in-depth for webhook notifier (fixes H-4)
The webhook notifier would previously accept any operator-configured URL
and hand it to http.Client without validation. That exposed two
SSRF classes (CWE-918):
* Reserved-address reachability — a misconfigured or adversarial
webhook URL pointing at 127.0.0.1, ::1, 169.254.169.254 (cloud
metadata), or 0.0.0.0 would succeed, exfiltrating request bodies
to local services or leaking short-lived cloud credentials.
* DNS rebinding — a hostname resolving to a public IP at validation
time and to a reserved IP at dial time would bypass any
URL-string-only check.
Fix installs two independent layers:
* validation.ValidateSafeURL runs at config-ingest time and before
every outbound POST. It rejects non-HTTP(S) schemes, empty hosts,
and literal reserved-IP hosts with a clear operator-facing error.
This is a fast early diagnostic.
* validation.SafeHTTPDialContext is installed on the webhook
http.Transport. It re-resolves the host at dial time, rejects any
resolved address whose address lies in a reserved range (loopback,
link-local, multicast, broadcast, unspecified, IPv6
link-local/multicast), and pins the resolved IP into the final
dial address so the TLS handshake targets the exact IP the guard
approved. This is the authoritative, TOCTOU-safe defence against
DNS rebinding.
The two layers are complementary — validateURL fails fast on obvious
misconfiguration; SafeHTTPDialContext fails closed when DNS changes
between validation and dial.
The existing unexported isReservedIP helper in
internal/service/network_scan.go is extracted into
internal/validation.IsReservedIP with byte-identical behaviour so the
webhook notifier and the network scanner share a single authoritative
reserved-address list. RFC 1918 ranges remain intentionally allowed
(certctl's self-hosted design). Broader unspecified / IPv6 link-local
coverage lives only in the stricter dial-time policy, where it belongs
for outbound HTTP egress.
Test seam: Connector gains an unexported validateURL func field and a
same-package newForTest constructor that installs a permissive
validator and the stdlib default transport. Production callers cannot
reach this constructor because it is unexported; only same-package
tests (package webhook) can use it. Same-package happy-path tests call
newForTest so they can point at httptest loopback servers without
being blocked by the production guard. The four SSRF-rejection tests
that verify the guard itself still call New so they exercise the real,
strict validator. This keeps the production SSRF defence
unconditionally on in real code while preserving legitimate unit-test
coverage.
Tests
-----
* internal/validation/ssrf_test.go (new) — 16-subtest pin on
IsReservedIP that is byte-identical with the original network-
scanner behaviour; ValidateSafeURL accept/reject matrix covering
HTTPS/HTTP, reserved-literal IPv4/IPv6, dangerous schemes
(file/gopher/ftp/javascript/data/ldap/dict/jar), missing hosts,
and malformed inputs; SafeHTTPDialContext rejects literal reserved
addresses and hosts resolving to reserved addresses (DNS-rebinding
coverage via localhost).
* internal/connector/notifier/webhook/webhook_test.go — happy-path
tests switched to newForTest; production-guard SSRF-rejection
tests (TestValidateConfig_RejectsReservedURLs,
TestValidateConfig_RejectsDangerousScheme,
TestPostWebhook_RejectsReservedURL,
TestPostWebhook_RejectsDangerousScheme) continue to call New so
they exercise the unconditionally-installed production validator.
Wire-format invariants preserved
--------------------------------
* Outbound HTTP request shape (method, headers, body, HMAC
signature) unchanged.
* network_scan.go behaviour unchanged — validation.IsReservedIP is
byte-identical with the deleted helper.
* RFC 1918 (10/8, 172.16/12, 192.168/16) remain allowed for both
outbound webhook and CIDR expansion, matching the self-hosted
design.
Verification
------------
* go test -race ./internal/validation/... ./internal/connector/
notifier/webhook/... ./internal/service/... — green.
* Full-suite go test -race ./... — green (GOTMPDIR=/dev/shm to
sidestep full /tmp on the sandbox host).
* Coverage gates pass: service 68.8% >= 55%, handler 83.6% >= 60%,
domain 82.0% >= 40%, middleware 63.8% >= 30%. Overall 67.8%.
Webhook package 91.5% line coverage; validation package
ValidateSafeURL/SafeHTTPDialContext 78-100% per function.
* govulncheck ./... — no vulnerabilities found.
* golangci-lint run on touched H-4 production code — clean. Pre-
existing errcheck/gosimple warnings in scope-adjacent files
(webhook_test.go:270 w.Write, network_scan.go:120/173/265/305)
verified against
|
||
|
|
3853b7460c |
security: reject CRLF/NUL in email headers to prevent SMTP injection (fixes H-3)
H-3 in certctl-audit-report.md: caller-supplied From/To/Subject were
interpolated directly into the SMTP DATA payload and handed to
client.Mail / client.Rcpt with no sanitization, allowing an attacker
who controls any of those values to inject extra headers (Bcc:,
Reply-To:), split the message body (CRLFCRLF), or tamper with the
SMTP envelope. CWE-113.
Fix:
- New package helper internal/validation.ValidateHeaderValue(field,
value). Rejects CR ("\r"), LF ("\n"), and NUL ("\x00") with an error
that names the offending field but does NOT echo the raw value,
so log readers cannot be attacked with injected content. Silent
stripping was considered and rejected: authentication-relevant
headers must fail visibly.
- Two-layer defense in internal/connector/notifier/email/email.go:
(1) primary guard at the top of sendEmail / sendHTMLEmail, which
blocks tampering of the SMTP envelope (client.Mail, client.Rcpt)
since net/smtp does not sanitize those arguments; and
(2) defense-in-depth guard inside formatEmailMessage /
formatHTMLEmailMessage, catching any future caller that
bypasses sendEmail. Both format functions now return an error.
- Body content is intentionally NOT validated — CR/LF in body is legal
RFC 5322 content and net/smtp handles dot-stuffing.
Tests:
- internal/validation/headers_test.go: 3 functions (AcceptsSafeInput,
RejectsControlCharacters, DefaultFieldName) covering plain ASCII,
UTF-8 multibyte, tabs, typical email addresses, CRLF injection,
lone CR, lone LF, NUL, CRLFCRLF body split, trailing CR, leading LF.
Each reject case asserts the field name IS in the error and the
raw offending value IS NOT (anti-log-injection).
- internal/connector/notifier/email/email_test.go: added
TestEmail_FormatEmailMessage_RejectsCRLFInjection and
TestEmail_FormatHTMLEmailMessage_RejectsCRLFInjection. Existing
format tests updated for the new (bytes, error) signature.
Wire-format invariants preserved:
- SMTP DATA headers still use CRLF separators and RFC 1123Z Date
(unchanged).
- Content-Type headers unchanged (text/plain for plain, text/html +
MIME-Version: 1.0 for HTML).
- No change to message encoding or transport.
Verification (Go 1.25.9 linux-arm64, parent
|
||
|
|
e9947dc0fe |
docs: redact V3 feature specifics from README (fixes H-7)
Problem
-------
H-7 (CWE-200 / information disclosure, strategic-policy class): the
public README's V3 section enumerated the paid-tier feature set --
"Role-based access control with profile-gating", "Event-driven
architecture with real-time operational views", "Advanced search",
"compliance scoring", "HSM/TPM integration" -- violating the
CLAUDE.md directive "Keep V3+ deliberately vague -- one-liner
descriptions only. Don't telegraph the paid feature set." The prior
wording also carried factual drift: `compliance scoring` was pulled
forward to V2.2 per the V2.2 Roadmap, so pairing it with V3 in the
README misrepresented the open-core line.
Fix
---
Replace the two-sentence enumeration at README.md:322-323 with a
single deliberately-vague sentence:
Enterprise capabilities for larger deployments are available in
the commercial tier.
No named features. No SKU enumeration. Matches the policy one-liner
shape used in neighboring V1 / V2 / V4+ sections. Net -1 line of
prose.
Files
-----
README.md 1 -, 1 +
Wire-format invariants preserved
--------------------------------
This is a docs-only change. All protocol surfaces are byte-identical:
- RFC 7030 EST handler (internal/api/handler/est.go) -- untouched
- RFC 8894 SCEP handler (internal/api/handler/scep.go) -- untouched
- Shared internal/pkcs7/ package -- untouched
- H-1 revocation composite key (migration 000012) -- untouched
- H-2 SCEP challenge-password preflight + PKCSReq guard -- untouched
- C-2 AES-256-GCM config encryption contract -- untouched
- CRL DER bytes, OCSP response bytes -- untouched
Verification
------------
git diff
|
||
|
|
b813660c74 |
security: require SCEP challenge password when SCEP enabled (fixes H-2)
Problem (CWE-306 Missing Authentication for Critical Function):
internal/service/scep.go PKCSReq skipped the shared-secret check when
s.challengePassword was empty. An unconfigured-but-enabled SCEP server
accepted any unauthenticated client reaching /scep and issued a
certificate against the configured issuer for any CSR with a valid
signature. No audit trail distinguished authenticated from
unauthenticated enrollments. This matches the two-layer fail-closed
pattern already used for C-2 (
|
||
|
|
387fb555ac |
security: scope revocation unique index to (issuer_id, serial_number) (fixes H-1)
RFC 5280 §5.2.3 defines certificate serial number uniqueness per issuing CA,
not globally. The prior unique index on `certificate_revocations.serial_number`
enforced a stricter invariant than the spec: with 12 issuer connectors (Local
CA, ACME, Vault, step-ca, OpenSSL, DigiCert, Sectigo, Google CAS, AWS ACM PCA,
Entrust, GlobalSign, EJBCA), two distinct certificates legitimately issued by
different CAs can share a serial number. Recording a revocation for the second
collision silently dropped via `ON CONFLICT DO NOTHING`, leaving the second
cert persistently absent from OCSP/CRL responses.
Changes:
- Migration 000012 drops `idx_certificate_revocations_serial` and creates
`idx_certificate_revocations_issuer_serial` UNIQUE ON (issuer_id,
serial_number). Adds a non-unique `idx_certificate_revocations_serial_lookup`
to preserve the serial-only fast path for OCSP/CRL probes that already know
the issuer scope.
- `CertificateRevocationRepository.Create` targets the new composite key in
`ON CONFLICT` — same-issuer idempotency preserved, cross-issuer collisions
now recorded as distinct rows.
- `GetBySerial(serial)` renamed `GetByIssuerAndSerial(issuerID, serial)` on
the interface and Postgres impl. All callers (OCSP responder, CRL
generator, short-lived-cert exemption check) already have `issuerID` in
scope because the protocol paths carry it (`/api/v1/ocsp/{issuer_id}/{serial}`,
`/api/v1/crl/{issuer_id}`).
- Repository integration test added: `TestRevocationRepository_CrossIssuerSerialCollision`
asserts that serial `CAFEBABE01` can be stored under two issuers
simultaneously, that lookups return the correct row per (issuer, serial),
and that same-issuer idempotency still works (re-inserting (issuer, serial)
does not error and does not duplicate).
- Existing tests and service/integration mocks updated for the rename.
Wire-format invariants preserved: CRL DER bytes, OCSP response bytes, and
AES-256-GCM config encryption are unaffected — this change touches only
revocation-record uniqueness scope.
CWE-664.
|
||
|
|
f549a7aa79 |
security: fail closed when CERTCTL_CONFIG_ENCRYPTION_KEY is unset (fixes C-2)
EncryptIfKeySet/DecryptIfKeySet in internal/crypto/encryption.go previously
returned plaintext + wasEncrypted=false when the operator had not configured
CERTCTL_CONFIG_ENCRYPTION_KEY. That produced a data-at-rest confidentiality
bypass (CWE-311): sensitive fields on dynamically-configured issuer and
target rows (source='database') were persisted to PostgreSQL without any
encryption, and no caller could distinguish the encrypted from the plaintext
branch at runtime. The only visible signal was a single warning log line
emitted once at startup.
Fail closed instead:
- EncryptIfKeySet / DecryptIfKeySet now return crypto.ErrEncryptionKeyRequired
(a new exported sentinel, errors.Is-unwrappable) when the key is empty or
nil, rather than silently emitting plaintext. The (result, wasEncrypted,
err) tuple signature is preserved for source compatibility; only the
semantics of the no-key branch changed.
- cmd/server/main.go grows a startup pre-flight check: if no encryption key
is configured the server lists issuers and targets, counts rows with
source='database', and refuses to start (os.Exit(1)) if any exist. Operators
must either configure CERTCTL_CONFIG_ENCRYPTION_KEY or remove the exposed
rows before the control plane can boot. The warning-only path is retained
for the clean-slate case (no database rows).
- internal/service/issuer.go's SeedFromEnvVars now guards the encryption call
with len(s.encryptionKey) > 0 so env-seeded rows (source='env', which are
reconstructable on every boot from process env) continue to persist as
plaintext in the 'config' column when no key is configured. Registry load
already falls through to cfg.Config when EncryptedConfig is nil. GUI/API
write paths (source='database') remain fail-closed via propagation of
ErrEncryptionKeyRequired.
- Integration tests that exercise CreateIssuer via the handler layer now
supply a real 32-byte AES-256 test key so the encrypt path runs instead of
returning ErrEncryptionKeyRequired. Same pattern in internal/service/
testutil_test.go for consolidated service-layer tests.
- internal/crypto/encryption_test.go grows regression guards:
TestEncryptIfKeySet_EmptyKeyFailsClosed (nil_key + empty_key subtests),
TestDecryptIfKeySet_EmptyKeyFailsClosed (nil_key + empty_key subtests),
TestEncryptDecryptIfKeySet_RoundTripProducesDifferentCiphertext,
TestDecryptIfKeySet_RejectsTamperedCiphertext, and
TestEncryptIfKeySet_PreservesErrEncryptionKeyRequiredSentinel (verifies
the sentinel unwraps through fmt.Errorf(%w)-style wrapping).
Wire format is unchanged: AES-256-GCM Encrypt/Decrypt/DeriveKey, the
12-byte nonce prefix, the GCM auth tag, the PBKDF2 salt
('certctl-config-encryption-v1'), and the 100,000 iteration count are all
byte-identical. Ciphertexts produced before this change remain decryptable.
Verified:
- go build ./... : clean
- go vet ./... : clean
- go test -race ./internal/crypto/... ./internal/service/... \
./internal/integration/... ./cmd/server/... : pass
- golangci-lint run ./... : 0 issues
- govulncheck ./... : 0 reachable vulnerabilities
- rg 'return plaintext, false, nil' internal/ : no matches
- Coverage: crypto 85.0% (unchanged), service 67.8% (was 67.9%, noise),
cmd/server 0.0% (unchanged baseline). All above CI thresholds.
See certctl-audit-report.md for the full finding record and resolution log.
|
||
|
|
b219e5d68a |
security: use crypto/rand for agent API keys (fixes C-1)
Replaces math/rand-based agent API key generation in internal/service/agent.go with crypto/rand.Read over a 32-byte buffer encoded with base64.RawURLEncoding, yielding a 43-character URL-safe unpadded ASCII string (256 bits of entropy). generateAPIKey now returns (string, error); Register and RegisterAgent propagate entropy-source failures. hashAPIKey is unchanged — the SHA-256 hashed-at-rest invariant is preserved. Fixes C-1 (CWE-338: Use of Cryptographically Weak Pseudo-Random Number Generator) from certctl-audit-report.md. Changes: - internal/service/agent.go: new imports (crypto/rand, encoding/base64); generateAPIKey rewritten to return (string, error); Register and RegisterAgent updated to propagate the error. - internal/service/agent_test.go: TestGenerateAPIKey_Properties regression test (non-empty, length 43, valid base64url, 32 decoded bytes, no collisions over 64 calls). No entropy-failure test — Go 1.24+ (issue #66821) makes crypto/rand errors fatal, so that branch is defensively unreachable. Verification: - go build ./cmd/server/... ./cmd/agent/... ./cmd/mcp-server/... ./cmd/cli/... → pass - go vet ./... → pass - go test -race (CI scope, 43 packages) → pass - golangci-lint v2.11.4 run ./... → 0 issues - govulncheck ./... → 0 vulnerabilities in certctl code - Coverage: service 68.9% / handler 83.6% / domain 82.0% / middleware 63.8% (all above CI gates 55/60/40/30) - grep math/rand in internal/ and cmd/ → zero production hits - No caller assumes the old 32-char length or legacy charset |
||
|
|
1f6cf0eafa |
fix: add npm ci retry and install verification for proxy environments (#9)
npm has a known bug where `npm ci` can crash with "Exit handler never called!" behind corporate proxies yet exit with code 0. This adds a single retry on failure and verifies tsc is actually installed before proceeding to build. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>v2.0.40 |
||
|
|
a49eae8155 |
fix: correct BSL 1.1 change date to March 14, 2033
why-certctl.md said March 1, CHART_SUMMARY.md said March 28. The LICENSE file is authoritative: Change Date is March 14, 2033. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
1c7d085f16 |
docs: move maintenance notice and quick start link above Documentation section
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
cc6eec3608 |
fix: merge npm install + build into single Docker layer (#9)
The previous fix (--include=dev) was necessary but insufficient. The real issue is that node_modules created by npm ci in one layer can be lost when COPY web/ . creates the next layer — depending on the Docker storage driver (fuse-overlayfs, vfs). Merging install and build into a single RUN eliminates the layer boundary entirely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>v2.0.39 |
||
|
|
86fb140414 |
fix: ensure devDependencies install in Docker build (#9)
npm ci skips devDependencies when NODE_ENV=production leaks from the host environment into the Docker build. This breaks the frontend stage because typescript and vite are devDependencies. Adding --include=dev makes the install hermetic regardless of host environment. Closes #9 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
13cd4d98ba |
feat(V2.2): bulk revocation — filter-based fleet-wide certificate revocation
Add POST /api/v1/certificates/bulk-revoke with filter criteria (profile_id, owner_id, agent_id, issuer_id, team_id, certificate_ids), partial-failure tolerance, and audit trail. Includes MCP tool, CLI command (certs bulk-revoke), server-side bulk modal in GUI replacing client-side sequential loop, OpenAPI spec, compliance mapping updates, and 21 new tests (12 service, 7 handler, 1 CLI, 1 frontend). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>v2.0.38 |