Files
shankar0123 af47d19ae2 fix(deploy,examples,env): close U-1 trap end-to-end across Helm, examples, and root env
Follow-up to cfc234e (U-1 docker-compose fix) — closes the remaining adjacent
code paths that share the postgres-first-boot-password-binding root cause but
were scoped out of the original commit.

The runtime diagnostic in internal/repository/postgres/db.go::wrapPingError
(landed in a911970) already covers every NewDB call site, so Helm operators
and example users hit the SQLSTATE 28P01 guidance for free at startup. What
was missing: deployment-shape-specific remediation guidance (kubectl vs
docker-compose), the hardcoded password in the *root* .env.example, and
shared ops notes for the 5 examples/ compose files. This commit closes all
three.

Files changed:

- .env.example (root) — line 16 had `postgres://certctl:certctl@...` with
  the password hardcoded literally instead of interpolating POSTGRES_PASSWORD.
  Edit if a user copied this file as their .env (binary-direct deployment,
  not docker-compose) and rotated POSTGRES_PASSWORD on line 10, the URL on
  line 16 still carried 'certctl' — silent two-line drift. Replaced 'certctl'
  with the same default that line 10 carries ('change-me-in-production') and
  added an explanatory comment block describing the docker-compose
  override semantics, when this URL matters (binary-direct), and the
  cross-reference to the U-1 wrapPingError diagnostic. Also fixed an
  adjacent bug: line 31 CERTCTL_SERVER_URL was `http://localhost:8443`,
  which agents reject at startup since v2.2 (HTTPS-everywhere milestone made
  the control plane HTTPS-only with TLS 1.3 pinned). Updated to https://
  with a comment pointing operators at the bootstrap CA bundle.

- deploy/helm/certctl/values.yaml — postgresql.auth.password field had a
  one-line 'REQUIRED' comment. Expanded into a full WARNING block (~25
  lines) explaining the PVC retention semantics, the failure symptom,
  and both kubectl-flavored remediation paths: non-destructive
  (`kubectl exec ... ALTER ROLE`) preferred for environments with data,
  and destructive (`helm uninstall + kubectl delete pvc`) for dev/demo.
  Cross-references the wrapPingError runtime diagnostic.

- deploy/helm/certctl/README.md (new, ~115 lines) — chart-level operational
  guide. Covers quick install, both remediation paths with concrete
  kubectl commands, why-we-don't-fix-this-in-the-chart explanation,
  cross-references to the docker-compose docs, server API key rotation
  (the easy case — comma-separated key list), TLS provisioning shapes,
  embedded-vs-external postgres, and uninstall semantics with the PVC
  retention gotcha called out.

- examples/README.md (new, ~55 lines) — shared operational notes for the
  5 example deployments. Covers the postgres password rotation trap with
  example-flavored remediation paths (`docker compose -f examples/<x>/...`),
  the TLS warning, and teardown semantics. Replaces what would otherwise
  be 5x duplication across per-example READMEs.

- examples/{acme-nginx,acme-wildcard-dns01,multi-issuer,private-ca-traefik,
  step-ca-haproxy}/*.md — one-line cross-reference at the top of each
  example's primary doc, pointing at examples/README.md for the shared
  ops notes. Avoids 5x duplication of the same warning text while still
  surfacing the link in every operator's first-touch surface.

Verification:

- go build ./... — clean
- go vet ./... — clean
- go test -short ./internal/repository/postgres/ — 4/4 wrapPingError tests
  still passing (no production-code touch in this commit)
- helm lint deploy/helm/certctl/ — clean (1 INFO about chart icon, pre-existing)
- helm template smoke test — renders without error
- python3 yaml.safe_load on values.yaml — parses

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
      §2 P1 cluster, cat-u-quickstart_postgres_password_volume_trap
      Closes the three deliberate scope-outs from cfc234e (Helm,
      root .env.example, examples/) end-to-end.

      Adjacent bugs caught while in scope:
      - root .env.example:16 hardcoded password not matching line 10
      - root .env.example:31 http:// URL incompatible with HTTPS-only v2.2
2026-04-24 23:51:13 +00:00

382 lines
13 KiB
Markdown

# certctl + NGINX + Let's Encrypt
This example demonstrates certctl's core use case: **automatically manage TLS certificates for NGINX using Let's Encrypt (ACME HTTP-01 challenges).**
> **Operational notes** shared by every example (postgres password rotation trap, TLS provisioning, teardown semantics) live in [`../README.md`](../README.md). Read it first if you plan to change `DB_PASSWORD` after the initial `docker compose up` — the postgres volume binds the password on first boot only.
## What This Does
- Deploys certctl server (control plane) with PostgreSQL
- Deploys certctl agent on the same network (in production: on your NGINX server)
- Configures Let's Encrypt as the certificate issuer via ACME v2
- Demonstrates HTTP-01 challenge solving (requires port 80 open to the internet)
- Shows how to set up 3 example domains for certificate enrollment and renewal
- Automatically renews certificates 30 days before expiration
## Architecture
```mermaid
flowchart TD
A["Your Domain (example.com)"]
B["Let's Encrypt ACME"]
C["certctl Server (control plane)"]
D["certctl Agent (on NGINX server)"]
E["NGINX Reverse Proxy"]
A -->|HTTP-01 validation<br/>port 80| B
B -->|CSR submission| C
C -->|API polling| D
D -->|deploy cert+key| E
```
## Prerequisites
1. **Docker & Docker Compose** (v20.10+)
2. **A domain name** pointing to your server (e.g., `example.com`)
3. **Ports 80 and 443 open** to the internet (ACME HTTP-01 needs port 80)
4. **Valid email address** for Let's Encrypt account (errors and renewal notices)
If you don't have a real domain or can't open port 80, see [Customization Tips](#customization-tips) below.
## TLS Security
certctl is HTTPS-only as of v2.2. The demo compose stack provisions a self-signed certificate. When accessing `https://localhost:8443`, you can either:
- Use `curl --cacert ./deploy/test/certs/ca.crt ...` to pin the CA certificate
- Use `curl -k ...` for quick smoke tests (never in production)
- Import the CA at `./deploy/test/certs/ca.crt` into your OS trust store for browser visits
## Quick Start
### 1. Clone or copy this example
```bash
cd examples/acme-nginx
```
### 2. Create a `.env` file with your settings
```bash
cat > .env <<'EOF'
# Your email for Let's Encrypt account
ACME_EMAIL=admin@example.com
# Database password (change this in production!)
DB_PASSWORD=certctl-demo-password
# Agent API key (generate a real one in production)
AGENT_API_KEY=agent-demo-key
# Server port (certctl listens here internally on 8443; expose as needed)
SERVER_PORT=8443
EOF
```
### 3. (Optional) Create an NGINX config
If you have a real domain and want NGINX to route traffic:
```bash
cat > nginx.conf <<'EOF'
events {
worker_connections 1024;
}
http {
# HTTP block for ACME challenges
server {
listen 80;
server_name example.com www.example.com api.example.com;
# ACME challenge directory (certctl writes validation files here)
location /.well-known/acme-challenge/ {
root /var/www/certbot;
}
# Redirect HTTP to HTTPS
location / {
return 301 https://$server_name$request_uri;
}
}
# HTTPS block (certificates deployed here by certctl agent)
server {
listen 443 ssl http2;
server_name example.com www.example.com api.example.com;
ssl_certificate /etc/nginx/ssl/example.com.crt;
ssl_certificate_key /etc/nginx/ssl/example.com.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
location / {
proxy_pass http://upstream-service;
}
}
}
EOF
```
Or just accept the default empty NGINX config for demonstration.
### 4. Start the stack
```bash
docker compose up -d
```
Monitor logs:
```bash
docker compose logs -f certctl-server certctl-agent
```
### 5. Access the dashboard
Navigate to `https://localhost:8443` (or your `SERVER_PORT`)
You should see:
- An empty certificate inventory (no certs issued yet)
- One ACME issuer ("iss-acme") configured and ready
- One agent ("nginx-agent-01") online and heartbeating
### 6. Create a certificate profile
In the certctl dashboard:
1. Go to **Profiles** (sidebar)
2. Click **New Profile**
3. Set:
- Name: `acme-prod`
- Key Type: `RSA-2048` (or `ECDSA-P256`)
- Max TTL: `90 days`
- Allowed Key Types: `RSA-2048, ECDSA-P256`
4. Save
### 7. Request a certificate
In the certctl dashboard:
1. Go to **Certificates** (sidebar)
2. Click **Request New Certificate**
3. Set:
- Common Name: `example.com`
- SANs: `www.example.com`, `api.example.com` (optional)
- Issuer: `iss-acme` (Let's Encrypt)
- Profile: `acme-prod`
4. Click **Request**
Behind the scenes:
- Server creates an `Issuance` job
- Agent polls for work, fetches the job
- Agent generates a P-256 key (never sent to server)
- Agent submits CSR to server
- Server sends CSR to Let's Encrypt ACME
- Let's Encrypt provides HTTP-01 challenge token
- Server downloads ACME challenge, returns to agent
- Agent deploys challenge file to NGINX `/.well-known/acme-challenge/`
- Let's Encrypt validates (HTTP GET to `http://example.com/.well-known/acme-challenge/...`)
- Let's Encrypt issues certificate
- Server receives certificate, passes to agent
- Agent deploys cert+key to `/etc/nginx/ssl/example.com.crt` + `.key`
- Agent reloads NGINX (`nginx -s reload`)
- Certificate is now active
### 8. View the certificate
In the dashboard:
1. Go to **Certificates**
2. Click the certificate to see:
- Common name, SANs, serial number
- Issuer (Let's Encrypt), not-before/after dates
- Status (Active, Expiring in N days, Expired)
- Deployment history (timestamps, agent name, target)
- Next auto-renewal date (30 days before expiration)
### 9. Set up automatic renewal
The server automatically checks for certificates expiring within 30 days and triggers renewal. You can:
- Adjust the threshold in the certificate's policy
- Manually trigger renewal via dashboard button
- View renewal job status and history
## How It Works
### Certificate Lifecycle
1. **Request** — Operator creates certificate request via dashboard or API
2. **CSR Generation** — Agent generates private key locally, submits CSR to server
3. **ACME Challenge** — Server communicates with Let's Encrypt ACME, obtains challenge
4. **Challenge Proof** — Agent deploys challenge proof to NGINX
5. **Issuance** — Let's Encrypt validates, issues certificate
6. **Deployment** — Agent receives certificate, deploys to NGINX SSL directory
7. **Reload** — Agent signals NGINX to reload (`nginx -s reload`)
8. **Verification** — Agent optionally verifies the live TLS endpoint (handshake fingerprint)
9. **Renewal** — 30 days before expiration, process repeats automatically
### HTTP-01 Challenge
ACME HTTP-01 works like this:
1. Let's Encrypt generates random token (e.g., `abc123def456`)
2. Server returns token to agent
3. Agent writes file: `/.well-known/acme-challenge/abc123def456` with value (random key material)
4. Let's Encrypt performs HTTP GET to `http://example.com/.well-known/acme-challenge/abc123def456`
5. If content matches, domain ownership is proven
6. Certificate is issued
**Requirements:**
- Port 80 must be open to the internet
- DNS must resolve your domain to your server
- NGINX must serve `/.well-known/acme-challenge/` (or certctl mounts a separate directory)
### Agent Key Generation
Keys are generated **on the agent**, never on the server:
1. Agent creates ECDSA P-256 keypair using `crypto/ecdsa`
2. Private key is stored locally on agent at `/var/lib/certctl/keys/` (readable only by certctl process)
3. Agent creates CSR (certificate signing request) with private key
4. Agent submits CSR to server
5. Server never sees the private key
6. Certificate is returned, agent stores it alongside key
7. Both key and cert used for NGINX deployment
This keeps private keys in the infrastructure where they're used, following zero-trust principles.
## Adding More Domains
### Option 1: Additional SANs on Same Certificate
Edit the existing certificate in the dashboard:
1. Click the certificate
2. Edit SANs to add `mail.example.com`, `ftp.example.com`, etc.
3. Trigger renewal
4. Agent generates new CSR with all SANs
5. Let's Encrypt validates each SAN (HTTP-01 for each)
6. Single certificate with multiple SANs is issued
### Option 2: Separate Certificates per Domain
If you want separate certificates (different issuance schedules, different targets):
1. Dashboard → **Certificates****Request New Certificate**
2. Common Name: `subdomain.example.com`
3. Set same issuer and profile
4. Request
Each domain gets its own cert, key, and renewal schedule.
### Wildcard Certificates (Not HTTP-01)
HTTP-01 does **not** support wildcard (`*.example.com`). To issue wildcards, use DNS-01 challenge (see [acme-wildcard-dns01](../acme-wildcard-dns01/) example).
## Customization Tips
### Using Let's Encrypt Staging (for testing)
Staging has higher rate limits and doesn't require real domains:
```bash
# In .env or docker-compose.yml override:
CERTCTL_ACME_DIRECTORY_URL=https://acme-staging-v02.api.letsencrypt.org/directory
```
Staging certificates won't be trusted by browsers (fake CA), but you can test the full flow without hitting production rate limits.
### Disabling Port 80 Requirement (Demo Mode)
If you can't open port 80, use ACME DNS-01 instead (requires DNS provider integration). See [acme-wildcard-dns01](../acme-wildcard-dns01/) example.
Or use Local CA for internal testing:
```bash
# Switch issuer to Local CA (not public-trusted, but no challenge needed)
CERTCTL_ACME_DIRECTORY_URL= # Leave empty to disable ACME
# (then configure Local CA instead)
```
### Custom NGINX Config
Replace `nginx.conf` with your own before `docker compose up`. The agent doesn't manage the NGINX config — it only deploys certificates. You're responsible for:
- Configuring SSL paths (`ssl_certificate`, `ssl_certificate_key`)
- Setting up challenge directory (`/.well-known/acme-challenge/`)
- Pointing NGINX to agent-deployed certificates
### Database Persistence
PostgreSQL data is stored in the `postgres_data` volume. To reset:
```bash
docker compose down -v # Destroy all volumes
```
### Viewing Agent Logs
```bash
docker compose logs -f certctl-agent
```
Look for:
- `Heartbeat successful` — agent is communicating with server
- `CSR submitted` — key generation and CSR submission worked
- `Deployment succeeded` — certificate deployed to NGINX
- `NGINX reload` — signal sent to reload
### Testing ACME Without Real Domain
Use `nip.io` (free DNS service):
1. Deploy to a server with a public IP
2. Use domain: `<your-ip>.nip.io` (e.g., `203.0.113.45.nip.io`)
3. Let's Encrypt will validate to that IP
4. Change ACME_EMAIL to a real email you control
## Production Checklist
Before running in production:
- [ ] Change `DB_PASSWORD` to a strong random password
- [ ] Generate a real API key for the agent (don't use the demo key)
- [ ] Enable `CERTCTL_AUTH_TYPE=api-key` and enforce authentication
- [ ] Use Let's Encrypt production directory (not staging)
- [ ] Configure `CERTCTL_CORS_ORIGINS` to restrict cross-origin access
- [ ] Use `CERTCTL_KEYGEN_MODE=agent` (default, but verify)
- [ ] Set `CERTCTL_LOG_LEVEL=warn` to reduce log noise
- [ ] Configure email notifications for certificate expiration alerts
- [ ] Set up log aggregation (Datadog, ELK, Splunk, etc.)
- [ ] Use docker secrets or external secret manager for credentials (not .env)
- [ ] Run agent on actual NGINX servers (not co-located with server for HA)
- [ ] Set up monitoring and alerting on agent heartbeat and job completion
- [ ] Implement backup/restore for PostgreSQL
- [ ] Use TLS for certctl server (terminate at reverse proxy or load balancer)
## Troubleshooting
### Agent heartbeat failing
```bash
docker compose logs certctl-agent
# Check: CERTCTL_SERVER_URL, CERTCTL_API_KEY, network connectivity
```
### ACME challenge failing
```bash
# Ensure port 80 is open: curl http://example.com/.well-known/acme-challenge/test
# Check NGINX is running and serving /.well-known/acme-challenge/
# Verify DNS resolves domain to your server: dig example.com
```
### NGINX reload failing
Check agent permissions on NGINX socket and that NGINX is reachable from agent container.
### Let's Encrypt rate limited
Let's Encrypt has rate limits (50 certs per domain per week). Use staging to test, or wait a week.
### Certificate not deployed to NGINX
Check agent logs for deployment errors. Verify `/etc/nginx/ssl` volume is writable by agent container.
## Next Steps
- **Wildcard certificates**: See [acme-wildcard-dns01](../acme-wildcard-dns01/) example
- **Multiple issuers**: See [multi-issuer](../multi-issuer/) example
- **Private CA**: See [private-ca-traefik](../private-ca-traefik/) example
- **Dashboard deep dive**: Read [docs/quickstart.md](../../docs/quickstart.md)
- **REST API**: Explore [api/openapi.yaml](../../api/openapi.yaml)
## Support
For issues or questions:
- Check [docs/troubleshooting.md](../../docs/troubleshooting.md)
- Open an issue on GitHub
- Review server and agent logs: `docker compose logs -f`