docs: archive version-specific upgrade guides

upgrade-to-tls.md and upgrade-to-v2-jwt-removal.md are version-specific runbooks for past releases. Late upgraders still need them; current operators don't. Move both to docs/archive/upgrades/ with one-line archive headers pointing readers at the current canonical docs. Renames: docs/upgrade-to-tls.md → docs/archive/upgrades/to-tls-v2.2.md docs/upgrade-to-v2-jwt-removal.md → docs/archive/upgrades/to-v2-jwt-removal.md Each gets a top-of-doc archive notice with the date and a forward pointer to the relevant steady-state doc: to-tls-v2.2.md → docs/operator/tls.md to-v2-jwt-removal.md → docs/operator/security.md The relative link inside to-v2-jwt-removal.md (was "upgrade-to-tls.md", now "to-tls-v2.2.md") updated to point at its archived sibling. Cross-reference updates from other docs and README still pending in Phase 11.
2026-06-09 13:08:53 +00:00 · 2026-05-05 02:50:14 +00:00
parent b375df767e
commit 32009cf7c8
2 changed files with 15 additions and 2 deletions
@@ -0,0 +1,200 @@
+# Upgrading to HTTPS-Everywhere (v2.2)
+
+> **Archived 2026-05-05.** This upgrade guide applies to certctl < v2.2.
+> Current operators on v2.2+ already have HTTPS-only control planes and
+> don't need this procedure. For the steady-state TLS reference, see
+> [`docs/operator/tls.md`](../../operator/tls.md). Preserved here for
+> late upgraders coming off pre-v2.2 releases.
+
+certctl's control plane is HTTPS-only as of v2.2. There is no `http` mode, no `auto` mode, no dual-listener bind, no N-release migration window. The cutover is a single step. Out-of-date agents that still point at `http://…` fail at the TCP/TLS handshake layer on first connect after the upgrade and stay `Offline` in the dashboard until their env block is updated and the fleet is rolled.
+
+This doc walks operators through the cutover for the two shipped deployment topologies — docker-compose and Helm — and documents the failure modes and rollback posture explicitly.
+
+For the deep-dive on cert provisioning patterns, SIGHUP cert reload, and client-side CA-trust configuration, read [`tls.md`](../../operator/tls.md). This doc is the narrow "how do I upgrade" procedure.
+
+## Preconditions
+
+Before you start, confirm:
+
+- **Shell access** to the server host and every agent host. The cutover requires you to restart the server and update every agent's env block.
+- **A cert+key source** for the server. Pick one:
+  - An internal CA that can issue a server cert (CN + SAN list covering every hostname / IP agents dial).
+  - A `cert-manager` install in the target Kubernetes cluster, plus a `ClusterIssuer` or `Issuer` you're willing to reference.
+  - Willingness to use the self-signed bootstrap that the shipped `deploy/docker-compose.yml` generates automatically. This is the right choice for dev and demo; it is the wrong choice for production.
+- **A maintenance window.** Out-of-date agents break at the TLS handshake and stay offline until rolled. Schedule the upgrade so the agent fleet can be updated in the same window as the server.
+- **Backups.** This is a one-way door (see the Rollback section below). Snapshot your PostgreSQL database before `docker compose down` or `helm upgrade`.
+
+There is no schema migration tied to this release; the only at-rest state that changes is the `certs` named volume (docker-compose) or the `tls.crt`/`tls.key` Secret (Helm).
+
+## Procedure — docker-compose operators
+
+The shipped `deploy/docker-compose.yml` includes a `certctl-tls-init` init container that self-signs an ECDSA-P256 (SHA-256 signature) cert on first boot and drops `server.crt`, `server.key`, and `ca.crt` into a named volume mounted read-only at `/etc/certctl/tls/` on the server and agent containers. No manual cert provisioning is required for the default stack. (Pre-v2.0.48 this was an ed25519 cert; see [`tls.md`](tls.md) Pattern 1 for the rationale and the `down -v && up --build` migration note.)
+
+1. **Pull the HTTPS-everywhere release.** From the repo root:
+
+   ```
+   git pull
+   ```
+
+   Confirm you're on a tag or `master` that contains the `certctl-tls-init` service in `deploy/docker-compose.yml`. Grep for it: `grep certctl-tls-init deploy/docker-compose.yml` should hit.
+
+2. **Stop the old plaintext cluster.**
+
+   ```
+   docker compose -f deploy/docker-compose.yml down
+   ```
+
+   Do not pass `-v`; keeping the PostgreSQL volume preserves your cert inventory, audit trail, and job history across the upgrade.
+
+3. **Bring the cluster back up with the HTTPS build.**
+
+   ```
+   docker compose -f deploy/docker-compose.yml up -d --build
+   ```
+
+   The `certctl-tls-init` service runs once, generates the self-signed cert into the `certs` volume, and exits with code 0. The server container waits for `certctl-tls-init` via `depends_on: { condition: service_completed_successfully }` and only starts once the cert material is on disk. The server's Docker healthcheck now uses `curl --cacert /etc/certctl/tls/ca.crt -f https://localhost:8443/health`, so the container only becomes healthy once the HTTPS listener is up and serving the bundled cert correctly.
+
+4. **Verify the HTTPS endpoint from the host.**
+
+   ```
+   curl --cacert $(docker compose -f deploy/docker-compose.yml exec -T certctl-server cat /etc/certctl/tls/ca.crt) https://localhost:8443/health
+   ```
+
+   Expect `{"status":"ok"}` with HTTP 200. If you get a TLS verification error, the CA bundle wasn't read correctly — re-run the `exec -T` command and pipe the output directly into `--cacert @-` or save it to a local file first. If you get `connection refused`, the server never finished startup — check `docker compose logs certctl-server` for a fail-loud preflight diagnostic pointing at `docs/tls.md`.
+
+5. **Confirm the bundled agent reconnects.** Agents inside the compose stack pick up the new URL (`CERTCTL_SERVER_URL=https://certctl-server:8443`) and the bundled CA (`CERTCTL_SERVER_CA_BUNDLE_PATH=/etc/certctl/tls/ca.crt`) from their env block automatically — no per-agent change needed. Tail the agent log:
+
+   ```
+   docker compose -f deploy/docker-compose.yml logs -f certctl-agent
+   ```
+
+   You should see `heartbeat sent` within 30 seconds. In the dashboard (`https://localhost:8443`), the agent should show as `Online`.
+
+**External agents** running outside the compose network (e.g., the `install-agent.sh`-installed systemd service on a separate host) need their env block updated manually before the cutover — see the Agent env block section below.
+
+## Procedure — Helm operators
+
+The Helm chart does not self-sign. It refuses to render (`helm template` exits non-zero) unless you configure one of two cert sources: an operator-supplied Secret, or a cert-manager `Certificate` CR. See [`tls.md`](tls.md) for the full pattern catalog.
+
+1. **Provision cert material.** Pick one of:
+
+   - **Operator-supplied Secret.** Issue a cert from your internal CA (or any other source) and load it into a `kubernetes.io/tls` Secret in the certctl namespace:
+
+     ```
+     kubectl create secret tls certctl-server-tls \
+       --cert=server.crt --key=server.key \
+       --namespace certctl
+     ```
+
+   - **cert-manager.** Set `server.tls.certManager.enabled=true` on the upgrade and reference an existing `ClusterIssuer` or `Issuer`:
+
+     ```
+     --set server.tls.certManager.enabled=true
+     --set server.tls.certManager.issuerRef.name=my-cluster-issuer
+     --set server.tls.certManager.issuerRef.kind=ClusterIssuer
+     ```
+
+2. **Upgrade the release.**
+
+   ```
+   helm upgrade certctl deploy/helm/certctl \
+     --namespace certctl \
+     --set server.tls.existingSecret=certctl-server-tls
+   ```
+
+   (Or the `certManager` variant.) If you omit both `server.tls.existingSecret` and `server.tls.certManager.enabled`, the chart fails at render time with a diagnostic pointing at `docs/tls.md`. That guard exists precisely so you catch the missing config at `helm upgrade` time, not at pod-crash-loop time.
+
+3. **Verify the HTTPS endpoint from inside the cluster.** Port-forward and curl with the CA bundle:
+
+   ```
+   kubectl port-forward -n certctl svc/certctl-server 8443:8443 &
+   kubectl get secret -n certctl certctl-server-tls -o jsonpath='{.data.ca\.crt}' | base64 -d > /tmp/certctl-ca.crt
+   curl --cacert /tmp/certctl-ca.crt https://localhost:8443/health
+   ```
+
+   Expect `{"status":"ok"}`. If the Secret does not contain a `ca.crt` key (operator-supplied Secrets often don't), use `tls.crt` as the bundle instead — for a self-signed cert the two files are identical, and for a cert chained to an internal CA you should separately distribute the root CA bundle via ConfigMap or mounted file.
+
+4. **Update every agent manifest.** Agents outside this Helm release (or in a separately-managed DaemonSet) need their env block updated:
+
+   ```
+   - name: CERTCTL_SERVER_URL
+     value: "https://certctl-server.certctl.svc.cluster.local:8443"
+   - name: CERTCTL_SERVER_CA_BUNDLE_PATH
+     value: "/etc/certctl/tls/ca.crt"
+   ```
+
+   Mount the server's Secret (or a separate CA-bundle Secret / ConfigMap) at `/etc/certctl/tls/` as a read-only volume. If you bundle the agent via the shipped Helm chart's DaemonSet, the wiring is already done — set `agent.enabled=true` and the chart mounts the same Secret.
+
+5. **Roll the agent DaemonSet.**
+
+   ```
+   kubectl rollout restart ds/certctl-agent -n certctl
+   kubectl rollout status ds/certctl-agent -n certctl
+   ```
+
+   Every agent pod restarts with the new URL + CA bundle and reconnects on HTTPS. The dashboard shows agents flip from `Offline` to `Online` as pods finish rolling.
+
+## Agent env block — external hosts
+
+Agents installed on bare-metal or VM hosts via `install-agent.sh` (systemd on Linux, launchd on macOS) read config from `/etc/certctl/agent.env` (Linux) or `~/Library/Application Support/certctl/agent.env` (macOS). On cutover, append or update:
+
+```
+CERTCTL_SERVER_URL=https://certctl.example.com:8443
+CERTCTL_SERVER_CA_BUNDLE_PATH=/etc/certctl/tls/ca.crt
+# CERTCTL_SERVER_TLS_INSECURE_SKIP_VERIFY=false    # Dev only. Never set to true in production.
+```
+
+Distribute the CA bundle (the same `ca.crt` the server holds, or the root chain if you issued the server cert from an intermediate) to every agent host. The path under `CERTCTL_SERVER_CA_BUNDLE_PATH` must be readable by the UID the agent service runs as.
+
+Restart the service after editing:
+
+- Linux: `systemctl restart certctl-agent`
+- macOS: `launchctl kickstart -k system/com.certctl.agent`
+
+The agent refuses to start on an `http://` URL and exits with a pre-flight diagnostic that names this doc. That rejection happens before any network call — no spurious half-connected state.
+
+## Failure mode
+
+Out-of-date agents still configured with `CERTCTL_SERVER_URL=http://…` fail on first reconnect after the cutover. The failure surfaces as one of:
+
+- `dial tcp …: connect: connection refused` — the server is no longer listening on a plaintext port. The new release binds only a TLS listener; attempting a plaintext `connect()` gets refused at the kernel level because nothing holds the socket.
+- `tls: first record does not look like a TLS handshake` — depending on timing and proxy layers (e.g., a load balancer that accepts the TCP connection before forwarding), the client may negotiate TCP, send an HTTP request line, and have the server's TLS stack reject it.
+
+Agents in this state surface as `Offline` in the dashboard. They stay offline until their env block is updated and the service restarts. There is no graceful 400-with-migration-URL response because there is no HTTP listener to serve one from — the entire plaintext call path is removed by design.
+
+If you see an unexpected agent stay `Offline` past the cutover window, SSH to the host and check the agent log. On a systemd host:
+
+```
+journalctl -u certctl-agent -n 100
+```
+
+Look for `URL scheme "http" is not supported: HTTPS-only control plane refuses to start (see docs/upgrade-to-tls.md)`. That's the pre-flight rejection. Update `CERTCTL_SERVER_URL`, restart the service, and the agent reconnects.
+
+## Rollback
+
+**There is no rollback window.** The upgrade is a one-way door. The rationale lives in §3.7 of `prompts/https-everywhere-milestone.md`: a cert-lifecycle product that bridges back to plaintext after committing to HTTPS is advertising that its own security posture is negotiable.
+
+If you need to revert, you have two options:
+
+1. **Stay on the pre-HTTPS release.** Do not upgrade until you are ready to run HTTPS on the control plane. Pin your `docker-compose.yml` or `helm upgrade` command to the last pre-v2.2 tag.
+2. **Rollback the release.** `helm rollback certctl <previous-revision>` or `git checkout <previous-tag> && docker compose up -d --build`. This rolls back the server, the compose topology, and the Helm chart in lockstep. Your PostgreSQL volume — cert inventory, audit trail, jobs — survives the rollback; nothing in this milestone changes the database schema.
+
+Option 2 drops you back to the plaintext world. It should be treated as an emergency measure, not a supported migration path.
+
+## After the cutover
+
+Once every agent is `Online`, confirm a few invariants:
+
+- `curl -sS -o /dev/null -w "%{http_code}\n" http://localhost:8443/health` returns `000` with `Connection refused` (no HTTP listener). Plaintext is gone.
+- `openssl s_client -connect localhost:8443 -tls1_2 </dev/null` fails the handshake. TLS 1.2 is rejected.
+- `openssl s_client -connect localhost:8443 -tls1_3 </dev/null` succeeds and prints the server's SAN list. TLS 1.3 is live.
+- A cert rotation test: overwrite the server cert on disk, `kill -HUP` the server PID, confirm the new cert serves on the next `openssl s_client -connect … -showcerts` without a process restart. See the SIGHUP section in [`tls.md`](tls.md).
+
+Update your runbooks. Every `http://certctl.example.com` URL in internal documentation, monitoring config, and on-call playbooks should become `https://certctl.example.com` plus a CA-trust note.
+
+## Related docs
+
+- [`tls.md`](tls.md) — cert provisioning patterns, SIGHUP rotation, troubleshooting
+- [`quickstart.md`](quickstart.md) — docker-compose walkthrough (post-HTTPS)
+- [`test-env.md`](test-env.md) — integration test environment (HTTPS-only)
+- Milestone spec: `prompts/https-everywhere-milestone.md`
@@ -0,0 +1,162 @@
+# Upgrading past G-1 — `CERTCTL_AUTH_TYPE=jwt` removal
+
+> **Archived 2026-05-05.** This upgrade guide applies to operators
+> upgrading past the G-1 milestone (the `CERTCTL_AUTH_TYPE=jwt` removal).
+> Current operators on post-G-1 releases don't need this. For the
+> steady-state security posture reference, see
+> [`docs/operator/security.md`](../../operator/security.md). Preserved
+> here for late upgraders.
+
+If your certctl deployment currently sets `CERTCTL_AUTH_TYPE=jwt` (or `server.auth.type=jwt` in Helm), the next certctl upgrade will fail-fast at startup with a dedicated diagnostic. This guide explains why, what to switch to, and how to keep JWT/OIDC at your edge.
+
+For everyone else — operators running `api-key` or `none` — this upgrade is a no-op. Skip to [`to-tls-v2.2.md`](to-tls-v2.2.md) for the v2.2 HTTPS-everywhere migration if you haven't done that one yet.
+
+## Why we removed it
+
+Pre-G-1, the config validator at `internal/config/config.go` accepted three values for `CERTCTL_AUTH_TYPE`: `api-key`, `jwt`, and `none`. The startup log line at `cmd/server/main.go` faithfully echoed `"authentication enabled" "type"="jwt"` when an operator picked `jwt`. Reasonable people read that and concluded JWT auth was on.
+
+It wasn't. Grep `internal/ cmd/` for `NewJWT`, `JWTMiddleware`, or `jwt.Parse` — pre-G-1, there were zero matches in production code. The auth-middleware wiring at `cmd/server/main.go:653` unconditionally called `middleware.NewAuthWithNamedKeys(namedKeys)` regardless of `cfg.Auth.Type`. So `CERTCTL_AUTH_TYPE=jwt` just routed every request through the api-key bearer middleware, comparing the incoming `Authorization: Bearer <something>` against whatever string the operator put in `CERTCTL_AUTH_SECRET`. Real JWT clients got 401 (the api-key middleware saw the JWT string as a literal token and compared bytes). Operators who treated `CERTCTL_AUTH_SECRET` as a JWT signing secret (and therefore handled it less carefully than an api-key) handed an attacker an api-key. Silent auth downgrade — a security finding masquerading as a config option.
+
+We chose to remove the option rather than implement JWT middleware. Implementing real JWT/OIDC requires jwks vs static-secret rotation, claim mapping (which claim is the actor / the admin flag?), expiry enforcement, audience and issuer validation, key rollover semantics, and regression coverage at the same depth as the existing api-key path. That's a feature, not a fix. The audit-recommended structural fix — and the one that actually closes the hazard — is to fail loudly instead of silently downgrading.
+
+## What changes at startup
+
+Post-G-1, a binary started with `CERTCTL_AUTH_TYPE=jwt` exits non-zero before opening the listener:
+
+```
+Failed to load configuration: CERTCTL_AUTH_TYPE=jwt is no longer accepted
+(G-1 silent auth downgrade): no JWT middleware ships with certctl. To use
+JWT/OIDC, run an authenticating gateway (oauth2-proxy / Envoy ext_authz /
+Traefik ForwardAuth / Pomerium) in front of certctl and set
+CERTCTL_AUTH_TYPE=none on the upstream. See docs/architecture.md
+"Authenticating-gateway pattern" and docs/upgrade-to-v2-jwt-removal.md
+for the migration walkthrough
+```
+
+Helm operators get the same shape at `helm install` / `helm upgrade` template time: `server.auth.type=jwt` is rejected by the chart's `certctl.validateAuthType` template helper before any Kubernetes object is rendered.
+
+The CI-side regression guard at `.github/workflows/ci.yml` blocks any future PR that re-introduces `"jwt"` as an auth-type literal in production code or spec.
+
+## Recovery — pick one
+
+### Option A — switch to `api-key` (you weren't actually using JWT)
+
+If your `CERTCTL_AUTH_SECRET` was a single high-entropy token and your clients sent it as `Authorization: Bearer <token>`, you were already using api-key auth — you just had `CERTCTL_AUTH_TYPE` set to the wrong string. Flip it:
+
+```
+# .env (docker-compose)
+CERTCTL_AUTH_TYPE=api-key
+CERTCTL_AUTH_SECRET=<your-existing-token>
+```
+
+```
+# Helm
+helm upgrade <release> deploy/helm/certctl/ \
+  --reuse-values \
+  --set server.auth.type=api-key \
+  --set server.auth.apiKey=<your-existing-token>
+```
+
+No client changes needed — the same Bearer token continues to work. The startup log will now read `"authentication enabled" "type"="api-key"`, which matches what was actually happening pre-G-1.
+
+### Option B — front certctl with an authenticating gateway
+
+If you genuinely need JWT, OIDC, mTLS, or SAML, run an authenticating gateway in front of certctl and let the gateway terminate the federated identity protocol. Configure certctl for `CERTCTL_AUTH_TYPE=none`:
+
+```
+CERTCTL_AUTH_TYPE=none
+```
+
+Then put an oauth2-proxy / Envoy `ext_authz` / Traefik `ForwardAuth` / Pomerium / Authelia (etc.) in the network path between operators and certctl. The gateway validates the identity and proxies the authenticated request to certctl as a same-origin call on a private network.
+
+### Concrete walkthrough — oauth2-proxy + certctl on docker-compose
+
+This is the simplest production-grade JWT/OIDC shape. It assumes you have an OIDC provider (Okta, Auth0, Google Workspace, Keycloak, Dex) and a registered client_id / client_secret.
+
+```yaml
+# deploy/docker-compose.gateway.yml — overlay on the base compose file
+services:
+  oauth2-proxy:
+    image: quay.io/oauth2-proxy/oauth2-proxy:latest
+    command:
+      - --provider=oidc
+      - --oidc-issuer-url=https://<your-issuer>/
+      - --client-id=${OIDC_CLIENT_ID}
+      - --client-secret=${OIDC_CLIENT_SECRET}
+      - --cookie-secret=${OAUTH2_PROXY_COOKIE_SECRET}  # openssl rand -base64 32
+      - --upstream=http://certctl-server:8443  # internal-network only; certctl listens on 8443
+      - --http-address=0.0.0.0:4180
+      - --email-domain=*
+      - --pass-access-token=true
+      - --pass-authorization-header=true
+      - --set-authorization-header=true       # forwards a bearer token upstream
+      - --skip-provider-button=true
+      - --reverse-proxy=true
+    ports:
+      - "443:4180"
+    depends_on:
+      - certctl-server
+    networks:
+      - certctl-network
+
+  certctl-server:
+    environment:
+      CERTCTL_AUTH_TYPE: none   # gateway terminates auth — see docs/upgrade-to-v2-jwt-removal.md
+      # ... rest of the certctl env block unchanged
+```
+
+Operators hit `https://<your-host>/`, get redirected through the OIDC provider, land back at oauth2-proxy with a session cookie, and oauth2-proxy proxies their request to certctl on the internal Docker network. certctl itself is HTTPS-only on `:8443` (TLS 1.3, see [`tls.md`](tls.md)) but operator browsers never see that hop directly. Bind certctl-server's `:8443` to the internal Docker network only — do NOT publish it to the host. The audit trail will record the actor as the gateway-forwarded identity if you also configure a small bearer-token-mapping shim at the gateway (most production deployments do this with a per-user api-key issued by the gateway after OIDC validation).
+
+### Traefik ForwardAuth pattern (Kubernetes)
+
+Same shape, kubernetes-flavored:
+
+```yaml
+apiVersion: traefik.io/v1alpha1
+kind: Middleware
+metadata:
+  name: oidc-forward-auth
+spec:
+  forwardAuth:
+    address: http://oauth2-proxy.auth.svc.cluster.local:4180
+    trustForwardHeader: true
+    authResponseHeaders:
+      - X-Auth-Request-User
+      - X-Auth-Request-Email
+      - Authorization
+---
+apiVersion: traefik.io/v1alpha1
+kind: IngressRoute
+metadata:
+  name: certctl
+spec:
+  routes:
+    - match: Host(`certctl.example.com`)
+      kind: Rule
+      middlewares:
+        - name: oidc-forward-auth
+      services:
+        - name: certctl-server
+          port: 8443
+```
+
+The certctl Helm release runs with `server.auth.type=none`. The Traefik IngressRoute attaches `oidc-forward-auth` as a middleware so every request is OIDC-validated by oauth2-proxy before reaching certctl.
+
+### Envoy `ext_authz` pattern
+
+For service-mesh deployments (Istio, Consul, plain Envoy), the `ext_authz` filter calls out to an external authorization service per-request. Same outcome: certctl runs `CERTCTL_AUTH_TYPE=none` and Envoy + your authz service handle JWT/OIDC/mTLS at the mesh edge. See the [Envoy ext_authz docs](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_authz_filter) for the configuration surface.
+
+## Rollback
+
+Pre-G-1 binaries silently accepted `CERTCTL_AUTH_TYPE=jwt` and routed through the api-key middleware. Downgrading the binary is the only mechanical rollback path, and it puts you back into the silent-downgrade state — which is exactly what the G-1 audit finding is about. We don't recommend it. If something is forcing your hand, capture the operational issue you're hitting and open a GitHub issue against the certctl repo with the SHAs involved; the Authenticating-gateway pattern was specifically designed to cover the use cases that historically led operators to set `CERTCTL_AUTH_TYPE=jwt`.
+
+There is no on-disk state that changes with this upgrade — no migrations to roll back, no encrypted config to re-encode, no certificates to re-issue. The change is entirely in the config-validation surface and the helm-chart template guard.
+
+## Cross-references
+
+- [`architecture.md`](architecture.md) — "Authenticating-gateway pattern (JWT, OIDC, mTLS)" section.
+- [`tls.md`](tls.md) — TLS provisioning patterns. The gateway proxying to certctl-server still needs to trust certctl's TLS cert; same patterns apply.
+- [`../deploy/helm/certctl/README.md`](../deploy/helm/certctl/README.md) — Helm-chart-flavored guidance.
+- `internal/config/config.go::ValidAuthTypes` — the single source of truth for what's accepted post-G-1.
+- `internal/repository/postgres/db.go::wrapPingError` — unrelated; pattern for runtime diagnostic of operator misconfiguration.
+- `coverage-gap-audit-2026-04-24-v5/unified-audit.md` — the audit finding (`cat-g-jwt_silent_auth_downgrade`).