Files
certctl/docs/tls.md
T
shankar0123 e720474fb7 Bundle D: Documentation & transparency sweep — 8 findings closed
Closes H-009 + L-001 + L-007 + L-008 + L-016 + L-017 + L-018 + M-027
from comprehensive-audit-2026-04-25.

H-009 — README JWT verified-already-clean
  README has zero JWT mentions at audit time. docs/architecture.md
  correctly documents JWT/OIDC integration via authenticating-gateway
  pattern (line 905-912).
  .github/workflows/ci.yml: new step
    'Forbidden README JWT advertising regression guard (H-009)'
    greps README for JWT-as-supported phrasing; passes verbatim
    (gateway / pre-G-1) but fails build on net-new advertising.

L-001 (CWE-295) — InsecureSkipVerify per-site justification
  Audit count was 8; recon found 13 production sites.
  docs/tls.md: new 'InsecureSkipVerify justifications' table
    enumerates each site by file:line with per-site rationale.
  cmd/agent/verify.go:78, internal/tlsprobe/probe.go:54,
  internal/service/network_scan.go:460: each previously-bare
    InsecureSkipVerify: true now carries //nolint:gosec.
  .github/workflows/ci.yml: new step
    'Forbidden bare InsecureSkipVerify regression guard (L-001)'
    fails build if any net-new ISV lands in non-test .go without
    nolint:gosec on the same or preceding line.

L-007 — README dependency-audit commands
  README.md: new Dependencies section with go list -m all | wc -l,
    go mod why, govulncheck ./.... Honors operating-rules invariant.

L-008 — Release-time govulncheck gate
  .github/workflows/release.yml: new 'Install govulncheck' +
    'Run govulncheck (release gate)' steps in the matrix job.
    Pinned to same install path as ci.yml. Default exit code
    semantics (fail on called-vuln only, deferred-call advisories
    tracked on master via L-021) keeps the gate appropriate.

L-016 — architecture.md drift fixes
  docs/architecture.md: system-components diagram's '21 tables'
    annotation removed (current 23; replaced with TEXT-keys
    descriptor); connector-architecture '9 connectors' prose
    replaced with grep ref + current 12-issuer list (added
    Entrust/GlobalSign/EJBCA which were missing); API-design
    '97 operations / 107 total' replaced with grep commands.
  Connector subgraphs verified-current at 12/13/6.

L-017 — workspace CLAUDE.md verified-already-clean
  Bundle B's pre-commit-gate refactor already converted current-
  state numeric claims to grep commands. Phase 0 recon confirmed
  zero remaining hardcoded counts.

L-018 — Defect age table
  cowork/comprehensive-audit-2026-04-25/defect-age.md (NEW):
    Tabulates all 9 High findings with first-mentioned commit,
    closing bundle, days-open. Methodology snippet for re-running.
    Key finding: 8 of 9 closed within 24h of audit publication.

M-027 — OpenAPI parity verified-already-clean
  Audit's 'router 121 vs OpenAPI 125 — 4-op gap' was wrong
  methodology. The 4-op 'gap' was exactly the 4 routes registered
  via r.mux.Handle (auth-exempt allowlist) instead of r.Register.
  When you count both dispatch shapes the totals match exactly.
  internal/api/router/openapi_parity_test.go (NEW):
    TestRouter_OpenAPIParity AST-walks router.go for both
    Register and mux.Handle calls + walks api/openapi.yaml's
    path/method nesting + asserts the sets match. Adding a route
    without updating the spec fails CI permanently.

Audit deliverables:
  audit-report.md: score 38/55 -> 46/55 closed
    (High 7/9 -> 8/9; Medium 20/27 -> 21/27; Low 8/19 -> 14/19)
  findings.yaml: 8 status flips open -> closed
  defect-age.md: new file
  certctl/CHANGELOG.md: Bundle D section

Verification:
  TestRouter_OpenAPIParity                                   PASS
  L-001 grep guard self-test (after //nolint:gosec adds)     PASS
  H-009 grep guard self-test                                 PASS
  go test -count=1 -short on changed packages                green
2026-04-27 00:47:15 +00:00

18 KiB

TLS on the Control Plane

certctl's control plane is HTTPS-only as of v2.2. There is no plaintext http:// listener, no auto mode, no dual-listener bridge, no TLS 1.2 escape hatch. The server refuses to start without a cert+key pair, the agent/CLI/MCP clients reject http:// URLs at startup, and the Helm chart refuses to render without either an operator-supplied Secret or a cert-manager Certificate CR.

This doc covers four cert provisioning patterns, SIGHUP-based cert rotation, and the client-side CA-trust configuration agents and the CLI need to talk to the server. If you are upgrading from a pre-HTTPS release and want the step-by-step cutover procedure, read upgrade-to-tls.md first and come back here for reference.

What you get

The server binds TLS 1.3 only with an explicit curve preference of [X25519, P-256]. TLS 1.3 cipher suites are non-negotiable (all three mandatory suites — AES-128-GCM-SHA256, AES-256-GCM-SHA384, CHACHA20-POLY1305-SHA256 — are always offered), so there is no CipherSuites knob to misconfigure. No TLS 1.2 fallback is available.

Two env vars are required on the server:

  • CERTCTL_SERVER_TLS_CERT_PATH — filesystem path to the PEM-encoded server certificate
  • CERTCTL_SERVER_TLS_KEY_PATH — filesystem path to the PEM-encoded private key that signs the cert

Both paths are read during a fail-loud preflight in cmd/server/main.go (see preflightServerTLS in cmd/server/tls.go). If either is unset, unreadable, or the cert+key pair does not round-trip through tls.LoadX509KeyPair, the process refuses to start and emits a diagnostic pointing back at this doc. The rationale lives in §3 of the HTTPS-Everywhere milestone: a cert-lifecycle product should not silently bind plaintext.

Pattern 1 — Self-signed bootstrap for docker-compose demos

This is the default for the deploy/docker-compose.yml stack. It exists so docker compose up -d --build just works on a laptop without the operator standing up a CA first. It is not appropriate for any non-demo environment.

An init container named certctl-tls-init runs once before the server starts. It uses the alpine/openssl image and generates an ECDSA-P256 self-signed cert (SHA-256 signature):

openssl req -x509 -newkey ec \
  -pkeyopt ec_paramgen_curve:P-256 \
  -nodes \
  -keyout /etc/certctl/tls/server.key \
  -out   /etc/certctl/tls/server.crt \
  -days 3650 \
  -subj "/CN=certctl-server" \
  -addext "subjectAltName=DNS:certctl-server,DNS:localhost,IP:127.0.0.1,IP:::1"

Why ECDSA-P256 and not ed25519. The pre-v2.0.48 demo bootstrap used ed25519 (small keys, fast signatures). Apple's TLS stack — Safari Network Framework and the macOS-bundled LibreSSL 3.3.6 /usr/bin/curl — does not advertise ed25519 in the ClientHello signature_algorithms extension for server certs, so an ed25519 server cert was rejected at handshake with tls: peer doesn't support any of the certificate's signature algorithms on the server side (and the generic TLS handshake error on the client side). Homebrew OpenSSL 3.x, Chrome, Firefox, and Linux curl all accepted ed25519 — Apple was the outlier. ECDSA-P256 with SHA-256 is universally supported, so the demo bootstrap uses it by default. To pick up the new algorithm on an existing demo install, tear the volume down and rebuild: docker compose -f deploy/docker-compose.yml down -v && docker compose -f deploy/docker-compose.yml up -d --build. Helm and operator-supplied-Secret users (Patterns 2 and 3) are unaffected — they bring their own cert, and cmd/server/tls.go is algorithm-agnostic (TLS 1.3 with curve preference [X25519, P-256] for key exchange — no constraint on the server cert's signature algorithm).

The cert, its matching key, and a copy of the cert published as ca.crt land in a named volume (certs) mounted at /etc/certctl/tls/ in the server container (read-only) and the agent container (read-only). The bootstrap is idempotent — if server.crt, server.key, and ca.crt are already present on the volume, the init container logs TLS cert already present at … and exits cleanly.

Single-cert design. CN is certctl-server to match the Docker-network hostname. The SAN list is [certctl-server, localhost, 127.0.0.1, ::1], which covers both container-internal agent→server traffic and operator browser/curl access to https://localhost:8443. There is no separate intermediate/root chain — the server cert and the CA bundle are the same PEM. This is the whole point of a demo bootstrap.

To force regeneration (rotate the demo cert), tear the volume down: docker compose down -v. The next up re-runs the init container.

The server's Docker healthcheck and the agent both verify against /etc/certctl/tls/ca.crt; no -k / InsecureSkipVerify anywhere in the default stack.

Pattern 2 — Operator-supplied kubernetes.io/tls Secret (Helm)

This is the default path for Helm installs. The operator provisions a Secret of type kubernetes.io/tls holding tls.crt + tls.key (and optionally ca.crt for mounting a CA bundle to clients in the same cluster) from whatever source they already trust — their internal CA, a manually-issued cert, step-ca, AWS ACM PCA exported to PEM, or the output of the self-signed bootstrap pattern above copied into a cluster Secret.

kubectl create secret tls certctl-server-tls \
  --cert=server.crt \
  --key=server.key \
  --namespace certctl

Then:

helm install certctl deploy/helm/certctl \
  --namespace certctl \
  --set server.tls.existingSecret=certctl-server-tls

The Secret is mounted read-only at /etc/certctl/tls/ in the server pod. The CERTCTL_SERVER_TLS_CERT_PATH and CERTCTL_SERVER_TLS_KEY_PATH env vars are wired to tls.crt and tls.key keys inside that mount. If ca.crt is absent from the Secret, clients that need a CA bundle should use tls.crt as the bundle (self-signed case) or mount a separate ConfigMap with the root chain (operator-CA case).

If the operator sets neither server.tls.existingSecret nor server.tls.certManager.enabled=true, helm template / helm install fails at render-time with a diagnostic pointing at this doc. The guard is implemented in deploy/helm/certctl/templates/_helpers.tpl under the certctl.tls.required helper. This is deliberate: the HTTPS-only server would crash-loop on an empty path, so we fail earlier at Helm-render time.

Pattern 3 — cert-manager Certificate CR (Helm, opt-in)

For clusters that already run cert-manager, the chart can provision a Certificate CR that writes into the Secret the server pod reads from. This is opt-in — the default is server.tls.certManager.enabled: false — because not every cluster has cert-manager installed, and we refuse to ship a chart that silently depends on an external controller.

helm install certctl deploy/helm/certctl \
  --namespace certctl \
  --set server.tls.certManager.enabled=true \
  --set server.tls.certManager.issuerRef.name=my-cluster-issuer \
  --set server.tls.certManager.issuerRef.kind=ClusterIssuer

The rendered Certificate (see deploy/helm/certctl/templates/server-certificate.yaml) writes tls.crt + tls.key + ca.crt into the Secret named by server.tls.certManager.secretName (defaults to <fullname>-tls). The server pod reads from that same Secret; the agent DaemonSet mounts the same Secret as its CA bundle source.

cert-manager handles rotation. certctl-server handles in-place reload — see the SIGHUP section below.

The chart enforces that if server.tls.certManager.enabled=true, server.tls.certManager.issuerRef.name must also be set. An empty issuerRef.name makes helm template fail with a diagnostic naming the missing flag.

Pattern 4 — Manually-issued from an internal CA

For operators running neither Helm nor docker-compose (bare-metal / custom orchestration), the server just needs two files on disk pointed at by CERTCTL_SERVER_TLS_CERT_PATH and CERTCTL_SERVER_TLS_KEY_PATH. Issue the cert from your internal CA with:

  • CN matching the hostname your agents and operators use to dial the server (e.g., certctl.prod.example.com)
  • SAN list covering every hostname and IP that appears in CERTCTL_SERVER_URL values across your agent fleet
  • Key usage: digital signature + key encipherment
  • Extended key usage: server auth

Store the key with mode 0600 and owner matching the UID the server runs as (1000 in our shipped Dockerfile). The server process reads both files during preflightServerTLS at startup and again on every SIGHUP.

The full CA chain that signed the server cert should be distributed to agents, CLI operators, and MCP clients as their CERTCTL_SERVER_CA_BUNDLE_PATH — see the client section below.

SIGHUP cert rotation

The server wraps its cert+key pair in a *certHolder (see cmd/server/tls.go) that guards the loaded *tls.Certificate under a sync.Mutex. The *tls.Config wires GetCertificate to the holder, so every new inbound TLS handshake reads whatever cert the holder currently has.

Send SIGHUP to the server PID and the holder re-reads both files from disk. On success, the next new connection uses the new cert; in-flight requests finish on the previous cert. A log line goes out:

TLS cert reloaded via SIGHUP cert_path=/etc/certctl/tls/server.crt key_path=/etc/certctl/tls/server.key

On failure (missing file, malformed PEM, key does not sign cert), the old cert is retained and an error logs:

TLS cert reload failed; continuing with previous cert cert_path=… key_path=… error=…

This is deliberately fail-safe on reload (as opposed to fail-loud on startup). A cert-manager renewal race, a partially-copied file, a typo in a rotation script — none of those should crash a running server and drop every agent connection. The operator sees the error in logs, fixes the underlying issue, and sends another SIGHUP.

Pair with cert-manager, certbot --post-hook, or any rotation tool that can fire a signal. For docker-compose, docker compose kill -s HUP certctl-server works. For Kubernetes, reload is typically handled by cert-manager updating the Secret and the mounted file changing on the next kubelet sync — no explicit SIGHUP needed if the volume mount is subPath-free.

Startup is a different story. If the cert is missing or malformed at process start, the server exits non-zero rather than binding plaintext or attempting a retry loop. That's the HTTPS-only contract.

Client-side TLS: agents, CLI, MCP

Everything that talks to the server enforces HTTPS on the URL.

Agent

CERTCTL_SERVER_URL must be https://…. http://, bare hostnames, ftp://, ws://, and empty strings are rejected at startup by validateHTTPSScheme in cmd/agent/main.go with a diagnostic pointing at upgrade-to-tls.md. There is no warning-and-proceed path.

Two additional env vars control how the agent verifies the server cert:

  • CERTCTL_SERVER_CA_BUNDLE_PATH — filesystem path to a PEM-encoded CA bundle that signed the server cert. Loaded into *tls.Config.RootCAs on the agent's HTTP client. If unset, the agent falls back to the OS system trust store.
  • CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY — defaults to false. Setting it to true skips verification entirely. Dev-only escape hatch. The agent logs a prominent warning at startup (TLS certificate verification is disabled … never enable this in production). Use this only when dialing a demo server whose cert you haven't bothered to mount into the agent container.

Equivalent CLI flags: --ca-bundle <path> and --insecure-skip-verify.

If both the CA bundle and InsecureSkipVerify=true are set, InsecureSkipVerify wins — it's the whole point of the flag. Don't do this in production.

CLI (certctl-cli)

Same contract as the agent:

  • CERTCTL_SERVER_URL defaults to https:// scheme; http:// rejected at startup
  • --ca-bundle <path> flag or CERTCTL_SERVER_CA_BUNDLE_PATH env var — CA bundle for server cert verification
  • --insecure flag or CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true — skip verification (dev only)
  • Error diagnostic on empty URL explicitly mentions both --server and CERTCTL_SERVER_URL so operators see the right knob to turn

The CLI shares the URL-scheme validation with the agent; the test pins in cmd/cli/main_test.go:TestValidateHTTPSScheme cover the full rejection matrix.

MCP server (certctl-mcp-server)

Same three controls as CLI, env-var-driven only (no flags — MCP runs as a stdio subprocess and inherits env from the launching LLM client):

  • CERTCTL_SERVER_URL must start with https://
  • CERTCTL_SERVER_CA_BUNDLE_PATH optional CA bundle
  • CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY optional skip

Claude Desktop / other MCP client configs should set all three in the tool's env block.

Troubleshooting: fail-loud preflight errors

Every preflight failure message ends with (see docs/tls.md) so this doc is the first hit when an operator searches. Common failures:

CERTCTL_SERVER_TLS_CERT_PATH is empty: HTTPS-only control plane refuses to start Set the env var. For docker-compose this is already set to /etc/certctl/tls/server.crt in the shipped compose file — if you're seeing this, check the certctl-tls-init service logs to see why the init container didn't populate the volume. For Helm, check that server.tls.existingSecret or server.tls.certManager.enabled=true is set.

TLS cert file "…" unreadable: … The cert path is set but os.Stat failed. Check filesystem permissions — the server runs as UID 1000 in our shipped Dockerfile; the cert needs to be readable by that UID. Typos in the path also land here.

TLS cert/key pair invalid (cert="…" key="…"): … Both files exist but tls.LoadX509KeyPair refused them. Typical causes: the private key does not sign the certificate, the key is encrypted with a passphrase (not supported — remove the passphrase with openssl pkey before mounting), or one of the two is DER-encoded instead of PEM. Re-issue the pair from the same CA call and re-mount.

Client side: tls: failed to verify certificate: x509: certificate signed by unknown authority The client did not trust the CA that signed the server cert. Either mount the CA bundle via CERTCTL_SERVER_CA_BUNDLE_PATH, add the CA to the system trust store on the client host, or (dev only) set CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=true.

Client side: tls: first record does not look like a TLS handshake The client is speaking plaintext HTTP to an HTTPS server (or vice-versa). Check that CERTCTL_SERVER_URL starts with https://. If you are upgrading from a pre-v2.2 release and your agents are old, they will surface this error until you roll the DaemonSet — see upgrade-to-tls.md.

InsecureSkipVerify justifications (Audit L-001)

crypto/tls.Config.InsecureSkipVerify short-circuits standard certificate chain validation. Each production use site below has a justification — the shape is "this code path is fundamentally pre-trust or trust-from-context, and chain validation in the stdlib path is not the right tool". Test-only sites are not enumerated here.

The CI grep guard Forbidden bare InsecureSkipVerify regression guard (L-001) in .github/workflows/ci.yml fails the build if any new InsecureSkipVerify: true lands in a non-test file without a //nolint:gosec comment carrying a justification — adding a new entry to this table is the right way to extend the surface.

Site (file:line) Trigger Justification
cmd/agent/main.go:59,125,136,1259,1262 --insecure-skip-verify CLI flag Dev escape hatch; docs/tls.md and the agent install script direct operators to use a real CA bundle in production. The server emits a startup WARN when set.
cmd/agent/verify.go:70,78 TLS deployment verification probe The agent is verifying that its own freshly-deployed cert is being served. The chain may be self-signed or signed by an upstream the agent host doesn't trust; what matters is the leaf-cert match against what the agent just deployed. The verifier compares the served leaf bytes to the expected leaf, not the chain.
internal/tlsprobe/probe.go:33,47,54 Network scanner / discovery probe Discovery's job is to find every cert on the network, including expired, self-signed, and not-yet-deployed certs. Validating the chain would silently skip the broken-cert results that are precisely what operators want to know about.
internal/mcp/client.go:35 MCP CLI --insecure flag Dev escape hatch for local-only MCP testing against a self-signed control plane.
internal/cli/client.go:39 certctl --insecure flag Same shape as the agent flag — local dev only.
internal/connector/target/f5/f5.go:128 F5 BIG-IP iControl REST F5 default install ships with a self-signed cert; operators who haven't replaced it use config.Insecure. The connector logs this on every dial and the operator-facing config docs this.
internal/connector/issuer/acme/acme.go:146 Pebble (ACME test server) Hard-coded for tests that drive against Pebble locally. Pebble issues self-signed; verifying the chain would defeat the purpose.
internal/service/network_scan.go:460 Network scanner probe Same rationale as tlsprobe/probe.go above — discovery surfaces broken certs by design.

What is NOT covered by this list: *_test.go files use InsecureSkipVerify freely against httptest.Server instances; that's a test-fixture pattern, not a production trust decision. The grep guard ignores _test.go.

  • upgrade-to-tls.md — one-step cutover from pre-HTTPS releases
  • quickstart.md — docker-compose walkthrough with HTTPS examples
  • test-env.md — integration test environment (also HTTPS-only)
  • security.md — overall security posture, OCSP Must-Staple guidance, encryption-at-rest spec
  • Milestone spec: prompts/https-everywhere-milestone.md (authoritative source for locked decisions)