Commit Graph

12 Commits

Author SHA1 Message Date
shankar0123 3e91c7a1f0 chore(security): bump Go toolchain 1.25.9 -> 1.25.10 + golang.org/x/net 0.49 -> 0.53
CI run #484's Go Build & Test job failed govulncheck (M-024 hard
gate). Six standard-library CVEs land in go1.25.9 + one
golang.org/x/net CVE in v0.49.0; all are fixed in go1.25.10 + x/net
v0.53.0 respectively. The advisories that fired were:

  GO-2026-4986  Quadratic string concat in net/mail.consumeComment
                — called via internal/api/handler/validation.go's
                ValidateCommonName -> mail.ParseAddress
  GO-2026-4977  Quadratic string concat in net/mail.consumePhrase
                — same call site
  GO-2026-4982  Bypass of meta-content URL escaping in html/template
                — called via internal/service/digest.go's
                RenderDigestHTML -> Template.Execute
  GO-2026-4980  Escaper bypass in html/template
                — same call site
  GO-2026-4971  Panic in net.Dial / LookupPort on Windows NUL bytes
                — many call sites (email notifier, SSH connector,
                ACME validators, validation.ValidateSafeURL, ...)
  GO-2026-4918  Infinite loop in net/http2 transport on bad
                SETTINGS_MAX_FRAME_SIZE
                — called via internal/connector/target/f5.go's
                F5Client.Authenticate -> http.Client.Do

Bumps applied:

* `go.mod`: `go 1.25.9` -> `go 1.25.10`; `golang.org/x/net v0.49.0`
  -> `v0.53.0` (kept indirect — the upgrade is force-pulled by the
  module-version directive; transitive deps will pick the higher).
* `.github/workflows/{ci,codeql,release}.yml`: setup-go pin and the
  release.yml `GO_VERSION` env var bumped to 1.25.10. The
  security-deep-scan.yml workflow uses the major-minor `1.25` pin
  which auto-resolves to the latest 1.25.x and is unaffected.
* `Dockerfile` + `Dockerfile.agent`: `golang:1.25-alpine@sha256:5caa...`
  re-pinned to `golang:1.25.10-alpine@sha256:8d22e29d960bc50cd0...`
  (digest looked up against `registry-1.docker.io/v2/library/golang/
  manifests/1.25.10-alpine`; verified by the digest-validity ci-guard).
  The explicit `1.25.10-alpine` tag form replaces the moving
  `1.25-alpine` pin so the image-spec is reproducible end-to-end
  even without the digest reference.
* `deploy/test/f5-mock-icontrol/Dockerfile`: `golang:1.25.9-bookworm
  @sha256:1a14...` re-pinned to `golang:1.25.10-bookworm@sha256:
  e3a54b77385b4f8a31c1...` (looked up the same way).
* `deploy/test/f5-mock-icontrol/go.mod`: `go 1.25.9` -> `go 1.25.10`.
* `internal/api/handler/version.go` + `api/openapi.yaml`: the
  `runtime.Version()`-shape comment + OpenAPI `example: go1.25.9`
  bumped to keep doc/example freshness.
* `docs/contributor/ci-pipeline.md` + `docs/reference/connectors/
  iis.md`: doc-only `Go 1.25.9` -> `Go 1.25.10` references.

Verification done in-tree:

* All `scripts/ci-guards/*.sh` pass locally including
  `digest-validity.sh` (the new digests resolve cleanly against
  Docker Hub).
* `S-1-hardcoded-source-counts.sh` clean (the false-positive on
  "Bundle 1 migrations" was fixed in the prior commit).

Operator step required post-push (sandbox has no Go toolchain):

  cd certctl && go mod tidy

This regenerates go.sum's `golang.org/x/net v0.49.0` h1: lines into
v0.53.0 ones. CI's `go mod tidy && git diff --exit-code go.mod
go.sum` step will catch the drift if missed; in that case run the
command, commit, and push the go.sum-only delta.
2026-05-09 21:35:46 -04:00
shankar0123 12003f5ca5 Bundle A: Container & supply-chain hardening — 3 findings closed; All High closed
Closes H-001 + M-012 + M-014 from comprehensive-audit-2026-04-25.

H-001 (CWE-829) — Container base images SHA-pinned
  Pre-bundle: 5 FROM lines pulled by tag only — registry-side tag
  swap could silently change the build.
  Post-bundle: every FROM pinned to immutable digest fetched live
  from Docker Hub at audit time:
    node:20-alpine@sha256:fb4cd12c85ee03686f6af5362a0b0d56d50c58a04632e6c0fb8363f609372293
    golang:1.25-alpine@sha256:5caaf1cca9dc351e13deafbc3879fd4754801acba8653fa9540cea125d01a71f (x2)
    alpine:3.19@sha256:6baf43584bcb78f2e5847d1de515f23499913ac9f12bdf834811a3145eb11ca1 (x2)
  Dockerfile header comment documents the operator bump procedure
  (quarterly cadence; docker manifest inspect or Hub Registry API).
  CI step Forbidden bare FROM regression guard (H-001) fails build
  if any new FROM lacks @sha256.

M-012 (CWE-250) — Verified-already-clean + USER guard
  Recon found both Dockerfile:75 and Dockerfile.agent:59 already
  carry USER certctl directives; pre-USER RUN calls are build-setup
  steps that legitimately need root, each happening before the
  USER drop.
  CI step Forbidden missing USER regression guard (M-012) greps
  every Dockerfile* for the LAST USER directive; fails build if
  missing OR equals root/0. Future Dockerfile additions must
  preserve the privilege drop.

M-014 — npm ci explicit retry helper
  Pre-bundle Dockerfile:25:
    RUN npm ci --include=dev || npm ci --include=dev && \
        tsc --version && npm run build
  Broken bash precedence: A || (B && C && D) means tsc+build only
  ran on success path of the second npm ci. A transient registry
  blip silently skipped the production step — build would succeed
  with no node_modules + no tsc verification.
  Post-bundle: deterministic 3-attempt retry loop with 5s backoff
  plus explicit [ -d node_modules ] post-check that fails loudly
  if directory wasn't created. Silent failure is now impossible.

Audit deliverables:
  audit-report.md: H-001/M-012/M-014 flipped [x] with closure
    notes; score 49/55 closed (High 9/9 = 100%; Medium 24/27;
    Low 19/19 with L-004 deferred). All High audit findings now
    closed for the first time.
  findings.yaml: 3 status flips
  CHANGELOG.md: Bundle A section

Verification:
  Self-test of both new CI guards locally — PASS for current state
  (every FROM has @sha256; every Dockerfile drops to non-root).
2026-04-27 01:28:38 +00:00
shankar0123 86fffa305a fix(deploy,helm,docs): published-image HEALTHCHECK speaks HTTPS + Helm /ready path + docs HTTPS sweep (U-2)
Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image
shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`.
The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone
(`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS
1.3 pinned), so the probe failed on every interval and Docker marked
the container `unhealthy` indefinitely. Operators inside docker-
compose / Helm / the example stacks were unaffected — compose overrides
the HEALTHCHECK with `--cacert + https://`, Helm uses explicit
`httpGet` probes that ignore Docker's HEALTHCHECK, and every example
compose file overrides with `curl -sfk https://localhost:8443/health`.
But anyone running bare `docker run` / Docker Swarm / Nomad / ECS —
exactly the "I just pulled the published image" path — saw permanent
`unhealthy` status and (depending on orchestrator policy) a restart-
loop. (Audit: cat-u-healthcheck_protocol_mismatch in
coverage-gap-audit-2026-04-24-v5/unified-audit.md.)

Recon for U-2 surfaced two adjacent bugs from the same v2.2 milestone
gap, both bundled into this commit because they share the same root
cause and the same operator surface:

  1. Helm chart `server.readinessProbe.httpGet.path` pointed at
     `/readyz`, the kube-flavored convention. The certctl server
     doesn't register `/readyz` (only `/health` and `/ready` are
     wired and bypass the auth middleware — see
     internal/api/router/router.go:81 and cmd/server/main.go:920).
     K8s readiness probes therefore got 401 (api-key auth rejection)
     or 404 (when auth was disabled), pods stayed `NotReady`
     indefinitely, and Helm rollouts stalled.

  2. The agent image (`Dockerfile.agent`) had no HEALTHCHECK at all,
     so bare-`docker run` agents got zero health signal. The
     compose override at `deploy/docker-compose.yml:173` called
     `pgrep -f certctl-agent` against the agent image, but the
     agent image didn't ship `procps` — pgrep was missing too. The
     compose probe was a latent always-fail.

We fixed all three with the audit-recommended shape (option (a) — `-k`)
plus three structural backstops:

Files changed:

Phase 1 — Dockerfile fix:
- Dockerfile: HEALTHCHECK switched from `curl -f http://localhost:8443/
  health` to `curl -fsk https://localhost:8443/health`. `-k`
  (insecure) is acceptable because the probe is localhost-to-localhost:
  the same process serving the cert is being probed, no network hop.
  Pinning `--cacert` is not viable for the published image because
  the bootstrap cert is per-deploy (generated into the `certs` named
  volume on first up; operator-supplied via Helm's `existingSecret`
  or cert-manager). Long-form docblock cross-references the audit
  closure, the compose vs Helm vs examples coverage matrix, and the
  CI guardrail.
- Dockerfile.agent: added HEALTHCHECK using `pgrep -f certctl-agent`
  matching the compose pattern. Added `procps` to the runtime apk
  install — fixes both the new image-level HEALTHCHECK AND the
  pre-existing compose probe that was silently failing.

Phase 2 — Helm readiness probe path:
- deploy/helm/certctl/values.yaml: server.readinessProbe.httpGet.path
  changed from `/readyz` to `/ready`. Liveness probe path
  (`/health`) was correct and is unchanged. Probes block now carries
  an explanatory comment naming the registered no-auth probe routes
  and the U-2 closure rationale.

Phase 3 — Image-level integration tests:
- deploy/test/healthcheck_test.go (new, //go:build integration):
  TestPublishedServerImage_HealthcheckSpecUsesHTTPS builds the server
  image, inspects `Config.Healthcheck.Test` via `docker inspect`,
  and asserts the array contains `https://localhost:8443/health` and
  `-k`, and does NOT contain `http://localhost:8443/health`
  (positive + negative regression contracts).
  TestPublishedAgentImage_HealthcheckSpecExists builds the agent image
  and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`.
  Both tests `t.Skip` cleanly when docker isn't available (sandbox /
  CI without docker-in-docker) — verified locally: tests skip with the
  diagnostic and the suite returns PASS.
  TestPublishedServerImage_HealthcheckTransitionsToHealthy is a
  documented `t.Skip` placeholder until the harness wires a sidecar
  postgres for image-level smoke; the spec-level tests above cover the
  audit-flagged regression.

Phase 4 — CI guardrail:
- .github/workflows/ci.yml: new "Forbidden plaintext HEALTHCHECK
  regression guard (U-2)" step. Scoped patterns catch
  `HEALTHCHECK.*http://` and `curl -f http://localhost:8443/health`
  in any `Dockerfile*`. Comment lines exempt; docs/upgrade-to-tls.md
  out of scope (the post-cutover invariant string at line 182 is
  intentionally a documented expected-failure assertion). Verified
  locally on the real tree (passes) and against synthetic regressions
  (each fires the guard).

Phase 5 — Docs sweep:
- docs/connectors.md: 15 stale curl examples updated from
  `http://localhost:8443/...` to `https://localhost:8443/...` with
  `--cacert "$CA"` injected on every site. Added a one-time
  introductory note documenting the `$CA` extraction with
  `docker compose ... exec ... cat /etc/certctl/tls/ca.crt`,
  matching the pattern in docs/quickstart.md. Pre-U-2 these examples
  silently failed against the HTTPS listener.

Phase 6 — Release surface:
- CHANGELOG.md: appended U-2 section to the existing [unreleased]
  block (immediately below the G-1 entry). Sections: explanatory
  blockquote covering all three bugs (primary + 2 adjacent), Fixed,
  Added, Changed.

Verification (all gates pass):
- go build ./... — clean
- go vet ./... — clean
- go vet -tags integration ./deploy/test/ — clean
- go test -short ./... — every package green
- go test -tags integration -v -run TestPublishedServerImage|TestPublishedAgentImage ./deploy/test/ —
  three tests SKIP cleanly with "docker not available" diagnostic
- helm lint deploy/helm/certctl/ — clean
- helm template smoke render — succeeds; rendered Deployment carries
  `path: /ready` and zero `/readyz` matches
- python3 yaml.safe_load on api/openapi.yaml — parses
- govulncheck ./... — no vulnerabilities in our code
- CI guardrail mirror: clean on real tree, fires on synthetic
  regression patterns

Out of scope (intentionally untouched):
- cmd/server/main.go::ListenAndServeTLS — HTTPS-only is correct,
  this finding does NOT propose adding back a plaintext listener.
- deploy/docker-compose.yml:126 HEALTHCHECK — already correct.
- deploy/docker-compose.test.yml HEALTHCHECK blocks — already correct.
- All 5 examples/*/docker-compose.yml HEALTHCHECK overrides — already
  correct (they ALSO use `-fsk https://localhost:8443/health`).
- Helm server.livenessProbe.httpGet — already uses `scheme: HTTPS` +
  `path: /health`, correct.
- docs/upgrade-to-tls.md:182 `curl ... http://localhost:8443/health`
  invariant line — that's the expected-failure assertion for the
  post-cutover state ("plaintext is gone, expect Connection refused");
  intentionally left intact.
- Go production code — this is purely a deploy-image / probe / docs /
  Helm-chart fix.

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
      §2 P1 cluster, cat-u-healthcheck_protocol_mismatch
      Audit recommendation followed verbatim: 'change Dockerfile:80
      to CMD curl -kf https://localhost:8443/health'.
2026-04-25 12:02:18 +00:00
shankar0123 672e1d991d build: propagate HTTP_PROXY/HTTPS_PROXY/NO_PROXY through Docker build (M-4, Issue #9)
Addresses Medium finding M-4 in the audit report. The multi-stage
Dockerfiles previously had no ARG declarations for HTTP_PROXY,
HTTPS_PROXY, or NO_PROXY, so corporate-proxy environments silently
failed at 'npm ci' (frontend stage) and 'go mod download' (Go builder).
The npm retry idiom (`npm ci --include=dev || npm ci --include=dev`)
masked the failure because the upstream 'Exit handler never called!'
bug exits 0 despite the install crash.

Fix: thread HTTP_PROXY / HTTPS_PROXY / NO_PROXY ARGs through every
Docker build stage that performs network I/O, re-export them as ENV
with both upper- and lower-case aliases (apk/curl/npm read lowercase;
Go/Node read uppercase), and forward the host shell's environment via
`build.args:` in every compose file and `build-args:` in the release
workflow's docker/build-push-action steps. Defaults are empty strings
so un-proxied builds remain byte-identical to the pre-fix tree.

Scope: Dockerfile (frontend + Go builder stages), Dockerfile.agent
(Go builder stage), deploy/docker-compose.yml (server + agent),
deploy/docker-compose.dev.yml (server + agent), deploy/docker-compose.test.yml
(server + agent), .github/workflows/release.yml (both docker/build-push-action
v6 invocations). Zero Go, web, test, or runtime code changes. Zero
base-image changes. Existing npm `||` retry idiom and `ARG TARGETARCH`
preserved verbatim.

CWE-1173 (Improper Use of Validated Input) / CWE-16 (Configuration).

Verification:
- YAML parses clean across all four compose files and release.yml.
- yamllint -d relaxed: clean exit across all five YAML files.
- All six `build.args:` blocks expose HTTP_PROXY, HTTPS_PROXY, NO_PROXY
  with default-empty ${VAR:-} substitution.
- Both release.yml docker/build-push-action steps expose the same
  three keys sourced from ${{ secrets.HTTP_PROXY }}, etc.
- Dockerfiles contain 5 proxy ARG declarations total (Dockerfile has 2
  stages × 3 ARGs = 6 lines, Dockerfile.agent has 1 stage × 3 ARGs = 3
  lines); lowercase ENV aliases verified present in every stage.
- git diff --shortstat: 6 files changed, 117 insertions(+), 0 deletions.
  Pure additive.

Docker-live verification (`docker build`, `docker compose config`)
deferred to CI / post-commit smoke because the sandbox has no Docker
runtime. hadolint, go, golangci-lint, govulncheck likewise unavailable
in the sandbox; per-layer CI coverage gates (service 55%, handler 60%,
domain 40%, middleware 30%) are trivially unaffected as M-4 touches
zero Go source files.
2026-04-17 03:12:45 +00:00
shankar0123 1f6cf0eafa fix: add npm ci retry and install verification for proxy environments (#9)
npm has a known bug where `npm ci` can crash with "Exit handler never
called!" behind corporate proxies yet exit with code 0. This adds a
single retry on failure and verifies tsc is actually installed before
proceeding to build.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 11:21:47 -04:00
shankar0123 cc6eec3608 fix: merge npm install + build into single Docker layer (#9)
The previous fix (--include=dev) was necessary but insufficient. The
real issue is that node_modules created by npm ci in one layer can be
lost when COPY web/ . creates the next layer — depending on the Docker
storage driver (fuse-overlayfs, vfs). Merging install and build into a
single RUN eliminates the layer boundary entirely.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 10:52:50 -04:00
shankar0123 86fb140414 fix: ensure devDependencies install in Docker build (#9)
npm ci skips devDependencies when NODE_ENV=production leaks from the
host environment into the Docker build. This breaks the frontend stage
because typescript and vite are devDependencies. Adding --include=dev
makes the install hermetic regardless of host environment.

Closes #9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 10:00:06 -04:00
shankar0123 4049dc8c7f fix: bump Docker Go version from 1.22 to 1.25 to match go.mod
go.mod requires go >= 1.25.0 but both Dockerfiles used golang:1.22-alpine,
causing `go mod download` to fail during container build.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-24 12:03:36 -04:00
shankar0123 169516f0fb fix: add frontend build stage to Dockerfile
The Dockerfile was copying raw web/ source files but never building the
frontend. Since .gitignore excludes web/dist/, the Docker image had no
built frontend — only the Vite dev entry point (web/index.html) which
references /src/main.tsx and only works with the Vite dev server. This
caused a blank page when accessing the dashboard.

Fix: Add a Node.js build stage that runs npm ci && npm run build, then
copy only web/dist/ into the final image. Also add web/node_modules and
web/dist to .dockerignore to keep the build context clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-16 14:16:55 -04:00
shankar0123 9b4122b159 Fix runtime bugs, implement service layer, and overhaul documentation
Runtime fixes:
- Fix env var mismatch (CERTCTL_DB_URL → CERTCTL_DATABASE_URL)
- Fix table name mismatches (certificates → managed_certificates, notifications → notification_events)
- Add renewal_policy_id to certificate queries
- Remove non-existent created_at from notification queries
- Add env var fallback for agent CLI flags
- Graceful degradation for missing notifiers/issuers in demo mode
- Copy web/ directory in Dockerfile for dashboard serving

Service layer:
- Implement handler-service interface pattern across all services
- Wire up certificate, agent, job, policy, team, owner, audit, notification services

Documentation:
- Add concepts.md: beginner-friendly guide to TLS, CAs, private keys
- Rewrite quickstart.md with accurate API examples matching actual handlers
- Add demo-advanced.md: interactive demo with cert issuance and automated script
- Update architecture.md with correct table names and connector interfaces
- Update connectors.md to match actual Go interface signatures
- Update demo-guide.md with cross-references to new docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-14 21:38:11 -04:00
shankar0123 3a9fe8ba37 Complete V1 scaffold 2026-03-14 20:01:53 -04:00
shankar0123 d395776a95 Initial scaffold: certificate control plane v0.1.0 2026-03-14 08:22:17 -04:00