Commit Graph

4 Commits

Author SHA1 Message Date
shankar0123 5ea45a19b9 feat(security): Sprint 5 ACQ — RED-003 deny-empty flip + SEC-009/RED-005 RFC1918 opt-in
Acquisition-audit Sprint 5 ACQ closure (2026-05-16). Two
independent findings ship together because they share Load() /
main.go wiring; the closure comments tie each line to its finding.

PART A — RED-003 (agent-bootstrap deny-empty cutover)
=====================================================

Phase 2 SEC-H1 closure (2026-05-13) introduced the
CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY staged feature flag with
default `false` so v2.1.x operators wouldn't get a surprise
fail-closed on upgrade. This commit flips the default to `true`
(per the staged plan in the existing CHANGELOG "Breaking changes
(scheduled for v2.2.0)" block). Operators who haven't generated a
real bootstrap token yet keep the v2.1.x warn-mode pass-through
for one upgrade window by setting
CERTCTL_AGENT_BOOTSTRAP_TOKEN_DENY_EMPTY=false explicitly.

Demo-mode escape hatch: CERTCTL_DEMO_MODE_ACK=true skips the
fail-closed gate so the screenshot/demo path stays one-command-up.
The accompanying boot-banner WARN at cmd/server/main.go:124-126
keeps demo mode visible in every log scraper, so this override
cannot silently re-enable warn-mode in production.

internal/config/config.go
  - Load() default for AgentBootstrapTokenDenyEmpty flipped to true
  - Validate() gate now also checks !c.Auth.DemoModeAck so the demo
    override line up with the boot-banner WARN
  - Closure comment block updated to cross-reference Sprint 5 ACQ
    and the CHANGELOG v2.2.0 entry

cmd/server/main.go
  - Updated boot-time WARN message to reflect the new default
    (deny-empty=true) — the warn now fires only in the two
    explicit override scenarios (warn-mode opt-back or demo mode),
    and explains the operator action either way
  - Info-line on configured-token path unchanged

PART B — SEC-009 + RED-005 (opt-in RFC1918 outbound block)
==========================================================

internal/validation/ssrf.go::IsReservedIP has always intentionally
left RFC 1918 ranges (10/8, 172.16/12, 192.168/16) NOT-reserved
because certctl is designed to manage certificates inside private
networks. For operators on hosted IaaS where RFC1918 IS internal
trust (kubeadm-default 10.96.0.0/12 service CIDR exposes the
Kubernetes API on 10.96.0.1; cloud-provider internal monitoring;
hosted-bastion subnets), this default is a real exposure path.

Add a package-level atomic.Bool toggle in internal/validation/ssrf.go
that, when on, extends IsReservedIP to ALSO return true for the
three RFC1918 ranges. Every IsReservedIP-derived path
(SafeHTTPDialContext, ValidateSafeURL, the network scanner, the
webhook + OIDC + ACME callers) picks up the new policy
transitively without per-call-site changes.

internal/validation/ssrf.go
  - blockRFC1918Outbound atomic.Bool + SetBlockRFC1918Outbound /
    BlockRFC1918OutboundEnabled accessor pair
  - rfc1918Nets pre-parsed at package init (panic on parse failure
    surfaces a misconfigured ssrf package immediately, not via a
    silently disabled toggle)
  - IsReservedIP checks the toggle after the existing reserved-IP
    checks
  - Header comment rewritten to document the toggle + the
    transitive coverage

internal/config/config.go
  - New NetworkConfig sub-config; Config gains a Network field
  - Load() reads CERTCTL_BLOCK_RFC1918_OUTBOUND env var (default
    false; preserves the existing self-hosted threat model)
  - NetworkConfig docstring lists the operator-trap (enabling this
    also blocks RFC1918 from the network scanner) so an operator
    cert-discovering their own RFC1918 space doesn't get a
    silently-empty scan result

cmd/server/main.go
  - Wires validation.SetBlockRFC1918Outbound after config.Load and
    near the demo-mode banner / agent-bootstrap-token block; emits
    a one-shot INFO line when the toggle is enabled so the policy
    is visible in journals

Tests
=====

internal/config/config_test.go
  - TestLoad_AgentBootstrapTokenDenyEmpty_DefaultIsTrue — pins the
    default flip at the boot path (Load returns the flipped value)
  - TestValidate_DenyEmptyDefault_RefusesWithoutToken — pins the
    fail-closed behavior under the new default
  - TestValidate_DenyEmptyExplicitFalse_AllowsEmpty — pins the
    v2.1.x back-compat escape hatch
  - TestValidate_DenyEmpty_DemoModeAckOverride_AllowsEmpty — pins
    the demo-mode override

internal/validation/ssrf_test.go
  - TestIsReservedIP_RFC1918_OptIn — pins toggle-off / toggle-on
    behavior across all three RFC1918 ranges, edge cases
    immediately outside the ranges, and the toggle-back-off path
  - TestSafeHTTPDialContext_RFC1918_OptIn — pins that the toggle
    reaches the dial-time SSRF check transitively (not just
    IsReservedIP in isolation)

Test-helper updates (Sprint-5-induced churn):
  - internal/config/config_test.go::setMinimalValidEnv now sets
    CERTCTL_AGENT_BOOTSTRAP_TOKEN to a placeholder so Load()-based
    tests that don't specifically exercise the empty-token gate
    keep passing under the new fail-closed default. Tests that DO
    exercise the empty-token path explicitly override back to "".
  - internal/config/config_est_profiles_test.go +
    internal/config/config_scep_profiles_test.go: same placeholder
    fix for the four Load()-based EST/SCEP profile tests.
  - cmd/server/main_test.go::TestMain_ServerConfigFromEnvironment +
    TestMain_AuthTypeConfiguration: same fix at the main.go test
    layer with prior-value restore.

Verified locally: gofmt -l clean; go vet clean; staticcheck clean
across internal/config, internal/validation, cmd/server; short
tests green on all three packages; targeted -v run of all six new
test names confirms PASS.
2026-05-16 19:13:52 +00:00
shankar0123 21aeed4f4e legal: addlicense headers + normalize legacy variants (Phase 0 RED-4)
Phase 0 closure (Path B2, post-rewrite):

addlicense sweep — adds the canonical certctl LLC copyright + BUSL-1.1
SPDX header to every production Go file. Template:

  // Copyright 2026 certctl LLC. All rights reserved.
  // SPDX-License-Identifier: BUSL-1.1

Coverage: 338 / 338 production Go files (cmd/ + internal/, excluding
*_test.go and **/testdata/**). Pre-sweep coverage was 22 / 338 (6.5%);
post-sweep is 338 / 338 (100%).

Normalized 22 pre-existing legacy headers (`// Copyright (c) certctl`
+ `// SPDX-License-Identifier: BSL-1.1`) and 1 file using a
`Certctl Contributors` attribution. The legacy SPDX ID `BSL-1.1`
is non-standard; the official SPDX identifier for Business Source
License 1.1 is `BUSL-1.1` (capital U). All 338 files now share the
canonical form.

Generated via:
  addlicense -c "certctl LLC" -y 2026 \
    -f cowork/legal/copyright-header.tpl \
    -ignore '**/testdata/**' -ignore '**/*_test.go' \
    cmd/ internal/

Verification:
  find cmd internal -name '*.go' -not -name '*_test.go' \
    -not -path '*/testdata/*' \
    -exec grep -L '^// Copyright 2026 certctl LLC' {} \; | wc -l

  Returns: 0

gofmt clean. Header additions are comments only, no compile impact.

Closes: cowork/certctl-architecture-diligence-audit.html#fix-RED-4
2026-05-13 21:23:35 +00:00
shankar0123 9bc845304e acme-server: HTTP-01 + DNS-01 + TLS-ALPN-01 challenge validation (Phase 3/7)
Wires up the actual challenge-validation machinery so profiles in
acme_auth_mode='challenge' resolve end-to-end. After this commit,
cert-manager 1.15+ with `solver: http01: ingress` against a
challenge-mode profile completes a real HTTP-01 flow and gets a cert.
DNS-01 + TLS-ALPN-01 share the same code path with the appropriate
validator selection.

Architecture (the load-bearing parts):
  - 3 separate semaphore-bounded worker pools (one per challenge type),
    so HTTP-01 and DNS-01 can't starve each other under load. Default
    weight 10 per type; tunable via CERTCTL_ACME_SERVER_HTTP01_CONCURRENCY,
    DNS01_CONCURRENCY, TLSALPN01_CONCURRENCY.
  - 30s per-challenge timeout (configurable via PoolConfig.PerChallengeTimeout).
  - HTTP-01 validator runs validation.IsReservedIPForDial (newly
    exported wrapper preserving the existing private impl byte-for-byte
    for the network scanner + ValidateSafeURL paths) on the resolved
    IP — both at the initial dial and every redirect hop. SSRF probes
    into private IP space are refused before the connect.
  - DNS-01 validator uses a dedicated resolver pointed at
    CERTCTL_ACME_SERVER_DNS01_RESOLVER (default 8.8.8.8:53) — does
    NOT use the system resolver to keep behavior deterministic across
    deployments. Wildcard handling: `*.example.com` queries
    _acme-challenge.example.com.
  - TLS-ALPN-01 validator (RFC 8737) connects with ALPN `acme-tls/1`,
    inspects the id-pe-acmeIdentifier extension (OID 1.3.6.1.5.5.7.1.31),
    asserts the ASN.1 OCTET STRING value equals SHA-256 of the key
    authorization. Cert chain is intentionally NOT validated
    (InsecureSkipVerify=true is correct per RFC 8737 — the proof is
    in the extension, not the chain). Documented in docs/tls.md L-001
    table + the //nolint:gosec comment carries the justification.
    SSRF guard: same posture as HTTP-01.
  - Validation is asynchronous: handler accepts the POST and returns
    200 immediately with status=processing; the worker-pool fires a
    callback that updates challenge → authz → order in a fresh
    background-context WithinTx. The order auto-promotes to `ready`
    when ALL authzs become valid; auto-fails to `invalid` when ANY
    authz becomes invalid.

What ships:
  - internal/api/acme/challenge.go: KeyAuthorization (RFC 8555 §8.1) +
    DNS01TXTRecordValue (§8.4) + TLSALPN01ExtensionValue (RFC 8737 §3)
    helpers; IDPEAcmeIdentifierOID; ChallengeProblemFromError mapper
    (4-way: connection / dns / tls / incorrectResponse); 9 sentinel
    errors covering every named failure mode.
  - internal/api/acme/validators.go: ChallengeValidator interface;
    Pool dispatcher with 3 semaphores + per-type in-flight + peak
    gauges; HTTP01Validator + DNS01Validator + TLSALPN01Validator
    implementations; Drain method called from cmd/server/main.go's
    shutdown sequence.
  - internal/api/acme/validators_test.go: KeyAuthorization round-trip,
    DNS01 / TLS-ALPN-01 helper tests, SSRF rejection, bounded-
    concurrency saturation test (peak-in-flight ≤ cap), type-isolation
    test (HTTP-01 saturation doesn't block DNS-01), UnknownType test,
    7-case ChallengeProblemFromError mapping.
  - internal/repository/postgres/acme.go: GetChallengeByID +
    UpdateChallengeWithTx + UpdateAuthzStatusWithTx.
  - internal/service/acme.go: SetValidatorPool wires the *acme.Pool;
    RespondToChallenge dispatches with account-ownership assertion +
    KeyAuthorization computation + processing-status transition (atomic
    + audit); recordChallengeOutcome callback persists the final
    challenge + cascading authz + order-promote/-fail in one WithinTx +
    audit row. 4 new metrics.
  - internal/api/handler/acme.go: Challenge handler; round-trips
    account.JWKPEM through ParseJWKFromPEM to recover the *jose.JSONWebKey
    the validator pool needs.
  - internal/api/router/router.go + openapi_parity_test.go +
    api/openapi-handler-exceptions.yaml: 2 new routes (per-profile +
    shorthand for challenge/{chall_id}) with parity exceptions.
  - cmd/server/main.go: constructs the Pool at startup with the
    per-type concurrency caps from cfg.ACMEServer; ACMEService.ValidatorPool()
    accessor exposed for the shutdown drain sequence.
  - internal/validation/ssrf.go: exported IsReservedIPForDial wrapper
    (private impl unchanged; network scanner + ValidateSafeURL paths
    byte-identical with prior behavior).
  - docs/tls.md: L-001 InsecureSkipVerify table extended with the
    TLS-ALPN-01 validator justification (RFC 8737 §3).
  - docs/acme-server.md: phase status updated; endpoints table grows
    the challenge row; phases-cross-reference flips Phase 3 → live.

Tests:
  - 80%+ coverage on the new files.
  - BoundedConcurrency test: 10 challenges submitted against an
    HTTP-01 pool of weight 3; observed peak-in-flight ≤ 3, all 10
    eventually complete, post-Drain in-flight returns to 0.
  - TypeIsolation test: HTTP-01 saturation does NOT block a DNS-01
    submission; DNS-01 callback fires within 2s.
  - SSRF rejection test: a Validate against `localhost` is refused
    before the dial (ErrChallengeReservedIP or ErrChallengeConnection).

Engineering history: cowork/WORKSPACE-CHANGELOG.md "ACME-Server-3".
2026-05-03 14:09:00 +00:00
shankar0123 119986fa7e security: add SSRF defence-in-depth for webhook notifier (fixes H-4)
The webhook notifier would previously accept any operator-configured URL
and hand it to http.Client without validation. That exposed two
SSRF classes (CWE-918):

  * Reserved-address reachability — a misconfigured or adversarial
    webhook URL pointing at 127.0.0.1, ::1, 169.254.169.254 (cloud
    metadata), or 0.0.0.0 would succeed, exfiltrating request bodies
    to local services or leaking short-lived cloud credentials.
  * DNS rebinding — a hostname resolving to a public IP at validation
    time and to a reserved IP at dial time would bypass any
    URL-string-only check.

Fix installs two independent layers:

  * validation.ValidateSafeURL runs at config-ingest time and before
    every outbound POST. It rejects non-HTTP(S) schemes, empty hosts,
    and literal reserved-IP hosts with a clear operator-facing error.
    This is a fast early diagnostic.
  * validation.SafeHTTPDialContext is installed on the webhook
    http.Transport. It re-resolves the host at dial time, rejects any
    resolved address whose address lies in a reserved range (loopback,
    link-local, multicast, broadcast, unspecified, IPv6
    link-local/multicast), and pins the resolved IP into the final
    dial address so the TLS handshake targets the exact IP the guard
    approved. This is the authoritative, TOCTOU-safe defence against
    DNS rebinding.

The two layers are complementary — validateURL fails fast on obvious
misconfiguration; SafeHTTPDialContext fails closed when DNS changes
between validation and dial.

The existing unexported isReservedIP helper in
internal/service/network_scan.go is extracted into
internal/validation.IsReservedIP with byte-identical behaviour so the
webhook notifier and the network scanner share a single authoritative
reserved-address list. RFC 1918 ranges remain intentionally allowed
(certctl's self-hosted design). Broader unspecified / IPv6 link-local
coverage lives only in the stricter dial-time policy, where it belongs
for outbound HTTP egress.

Test seam: Connector gains an unexported validateURL func field and a
same-package newForTest constructor that installs a permissive
validator and the stdlib default transport. Production callers cannot
reach this constructor because it is unexported; only same-package
tests (package webhook) can use it. Same-package happy-path tests call
newForTest so they can point at httptest loopback servers without
being blocked by the production guard. The four SSRF-rejection tests
that verify the guard itself still call New so they exercise the real,
strict validator. This keeps the production SSRF defence
unconditionally on in real code while preserving legitimate unit-test
coverage.

Tests
-----
  * internal/validation/ssrf_test.go (new) — 16-subtest pin on
    IsReservedIP that is byte-identical with the original network-
    scanner behaviour; ValidateSafeURL accept/reject matrix covering
    HTTPS/HTTP, reserved-literal IPv4/IPv6, dangerous schemes
    (file/gopher/ftp/javascript/data/ldap/dict/jar), missing hosts,
    and malformed inputs; SafeHTTPDialContext rejects literal reserved
    addresses and hosts resolving to reserved addresses (DNS-rebinding
    coverage via localhost).
  * internal/connector/notifier/webhook/webhook_test.go — happy-path
    tests switched to newForTest; production-guard SSRF-rejection
    tests (TestValidateConfig_RejectsReservedURLs,
    TestValidateConfig_RejectsDangerousScheme,
    TestPostWebhook_RejectsReservedURL,
    TestPostWebhook_RejectsDangerousScheme) continue to call New so
    they exercise the unconditionally-installed production validator.

Wire-format invariants preserved
--------------------------------
  * Outbound HTTP request shape (method, headers, body, HMAC
    signature) unchanged.
  * network_scan.go behaviour unchanged — validation.IsReservedIP is
    byte-identical with the deleted helper.
  * RFC 1918 (10/8, 172.16/12, 192.168/16) remain allowed for both
    outbound webhook and CIDR expansion, matching the self-hosted
    design.

Verification
------------
  * go test -race ./internal/validation/... ./internal/connector/
    notifier/webhook/... ./internal/service/... — green.
  * Full-suite go test -race ./... — green (GOTMPDIR=/dev/shm to
    sidestep full /tmp on the sandbox host).
  * Coverage gates pass: service 68.8% >= 55%, handler 83.6% >= 60%,
    domain 82.0% >= 40%, middleware 63.8% >= 30%. Overall 67.8%.
    Webhook package 91.5% line coverage; validation package
    ValidateSafeURL/SafeHTTPDialContext 78-100% per function.
  * govulncheck ./... — no vulnerabilities found.
  * golangci-lint run on touched H-4 production code — clean. Pre-
    existing errcheck/gosimple warnings in scope-adjacent files
    (webhook_test.go:270 w.Write, network_scan.go:120/173/265/305)
    verified against 3853b74 to predate this commit; left alone per
    scope guard.

Operational notes
-----------------
  * No migration needed. The guard is pure Go code; existing webhook
    configs continue to work unless they point at reserved addresses,
    in which case they now fail closed with a clear error.
  * Existing operators who rely on webhook POST to 127.0.0.1 or
    ::1 (e.g., local receivers on the same host as certctl-server)
    must expose their receiver on an RFC 1918 address or public IP.
    This is deliberate — the threat model for webhook notifiers
    includes untrusted operator-supplied URLs.

Scope guard: H-4 only. H-5, H-6, M-*, L-*, and I-* findings remain
open and are tracked separately. No drive-by refactors.
2026-04-17 00:34:47 +00:00