mirror of https://github.com/shankar0123/certctl.git synced 2026-07-26 16:48:12 +00:00

Files

T

shankar0123 86fffa305a fix(deploy,helm,docs): published-image HEALTHCHECK speaks HTTPS + Helm /ready path + docs HTTPS sweep (U-2)

Pre-U-2 the published `ghcr.io/shankar0123/certctl-server` image
shipped with `HEALTHCHECK CMD curl -f http://localhost:8443/health`.
The server has been HTTPS-only since the v2.2 HTTPS-Everywhere milestone
(`cmd/server/main.go::ListenAndServeTLS`, no plaintext fallback, TLS
1.3 pinned), so the probe failed on every interval and Docker marked
the container `unhealthy` indefinitely. Operators inside docker-
compose / Helm / the example stacks were unaffected — compose overrides
the HEALTHCHECK with `--cacert + https://`, Helm uses explicit
`httpGet` probes that ignore Docker's HEALTHCHECK, and every example
compose file overrides with `curl -sfk https://localhost:8443/health`.
But anyone running bare `docker run` / Docker Swarm / Nomad / ECS —
exactly the "I just pulled the published image" path — saw permanent
`unhealthy` status and (depending on orchestrator policy) a restart-
loop. (Audit: cat-u-healthcheck_protocol_mismatch in
coverage-gap-audit-2026-04-24-v5/unified-audit.md.)

Recon for U-2 surfaced two adjacent bugs from the same v2.2 milestone
gap, both bundled into this commit because they share the same root
cause and the same operator surface:

1. Helm chart `server.readinessProbe.httpGet.path` pointed at
`/readyz`, the kube-flavored convention. The certctl server
doesn't register `/readyz` (only `/health` and `/ready` are
wired and bypass the auth middleware — see
internal/api/router/router.go:81 and cmd/server/main.go:920).
K8s readiness probes therefore got 401 (api-key auth rejection)
or 404 (when auth was disabled), pods stayed `NotReady`
indefinitely, and Helm rollouts stalled.

2. The agent image (`Dockerfile.agent`) had no HEALTHCHECK at all,
so bare-`docker run` agents got zero health signal. The
compose override at `deploy/docker-compose.yml:173` called
`pgrep -f certctl-agent` against the agent image, but the
agent image didn't ship `procps` — pgrep was missing too. The
compose probe was a latent always-fail.

We fixed all three with the audit-recommended shape (option (a) — `-k`)
plus three structural backstops:

Files changed:

Phase 1 — Dockerfile fix:
- Dockerfile: HEALTHCHECK switched from `curl -f http://localhost:8443/
health` to `curl -fsk https://localhost:8443/health`. `-k`
(insecure) is acceptable because the probe is localhost-to-localhost:
the same process serving the cert is being probed, no network hop.
Pinning `--cacert` is not viable for the published image because
the bootstrap cert is per-deploy (generated into the `certs` named
volume on first up; operator-supplied via Helm's `existingSecret`
or cert-manager). Long-form docblock cross-references the audit
closure, the compose vs Helm vs examples coverage matrix, and the
CI guardrail.
- Dockerfile.agent: added HEALTHCHECK using `pgrep -f certctl-agent`
matching the compose pattern. Added `procps` to the runtime apk
install — fixes both the new image-level HEALTHCHECK AND the
pre-existing compose probe that was silently failing.

Phase 2 — Helm readiness probe path:
- deploy/helm/certctl/values.yaml: server.readinessProbe.httpGet.path
changed from `/readyz` to `/ready`. Liveness probe path
(`/health`) was correct and is unchanged. Probes block now carries
an explanatory comment naming the registered no-auth probe routes
and the U-2 closure rationale.

Phase 3 — Image-level integration tests:
- deploy/test/healthcheck_test.go (new, //go:build integration):
TestPublishedServerImage_HealthcheckSpecUsesHTTPS builds the server
image, inspects `Config.Healthcheck.Test` via `docker inspect`,
and asserts the array contains `https://localhost:8443/health` and
`-k`, and does NOT contain `http://localhost:8443/health`
(positive + negative regression contracts).
TestPublishedAgentImage_HealthcheckSpecExists builds the agent image
and asserts the HEALTHCHECK uses `pgrep` against `certctl-agent`.
Both tests `t.Skip` cleanly when docker isn't available (sandbox /
CI without docker-in-docker) — verified locally: tests skip with the
diagnostic and the suite returns PASS.
TestPublishedServerImage_HealthcheckTransitionsToHealthy is a
documented `t.Skip` placeholder until the harness wires a sidecar
postgres for image-level smoke; the spec-level tests above cover the
audit-flagged regression.

Phase 4 — CI guardrail:
- .github/workflows/ci.yml: new "Forbidden plaintext HEALTHCHECK
regression guard (U-2)" step. Scoped patterns catch
`HEALTHCHECK.*http://` and `curl -f http://localhost:8443/health`
in any `Dockerfile*`. Comment lines exempt; docs/upgrade-to-tls.md
out of scope (the post-cutover invariant string at line 182 is
intentionally a documented expected-failure assertion). Verified
locally on the real tree (passes) and against synthetic regressions
(each fires the guard).

Phase 5 — Docs sweep:
- docs/connectors.md: 15 stale curl examples updated from
`http://localhost:8443/...` to `https://localhost:8443/...` with
`--cacert "$CA"` injected on every site. Added a one-time
introductory note documenting the `$CA` extraction with
`docker compose ... exec ... cat /etc/certctl/tls/ca.crt`,
matching the pattern in docs/quickstart.md. Pre-U-2 these examples
silently failed against the HTTPS listener.

Phase 6 — Release surface:
- CHANGELOG.md: appended U-2 section to the existing [unreleased]
block (immediately below the G-1 entry). Sections: explanatory
blockquote covering all three bugs (primary + 2 adjacent), Fixed,
Added, Changed.

Verification (all gates pass):
- go build ./... — clean
- go vet ./... — clean
- go vet -tags integration ./deploy/test/ — clean
- go test -short ./... — every package green
- go test -tags integration -v -run TestPublishedServerImage|TestPublishedAgentImage ./deploy/test/ —
three tests SKIP cleanly with "docker not available" diagnostic
- helm lint deploy/helm/certctl/ — clean
- helm template smoke render — succeeds; rendered Deployment carries
`path: /ready` and zero `/readyz` matches
- python3 yaml.safe_load on api/openapi.yaml — parses
- govulncheck ./... — no vulnerabilities in our code
- CI guardrail mirror: clean on real tree, fires on synthetic
regression patterns

Out of scope (intentionally untouched):
- cmd/server/main.go::ListenAndServeTLS — HTTPS-only is correct,
this finding does NOT propose adding back a plaintext listener.
- deploy/docker-compose.yml:126 HEALTHCHECK — already correct.
- deploy/docker-compose.test.yml HEALTHCHECK blocks — already correct.
- All 5 examples/*/docker-compose.yml HEALTHCHECK overrides — already
correct (they ALSO use `-fsk https://localhost:8443/health`).
- Helm server.livenessProbe.httpGet — already uses `scheme: HTTPS` +
`path: /health`, correct.
- docs/upgrade-to-tls.md:182 `curl ... http://localhost:8443/health`
invariant line — that's the expected-failure assertion for the
post-cutover state ("plaintext is gone, expect Connection refused");
intentionally left intact.
- Go production code — this is purely a deploy-image / probe / docs /
Helm-chart fix.

Refs: coverage-gap-audit-2026-04-24-v5/unified-audit.md
§2 P1 cluster, cat-u-healthcheck_protocol_mismatch
Audit recommendation followed verbatim: 'change Dockerfile:80
to CMD curl -kf https://localhost:8443/health'.

2026-04-25 12:02:18 +00:00

certctl

fix(deploy,helm,docs): published-image HEALTHCHECK speaks HTTPS + Helm /ready path + docs HTTPS sweep (U-2)

2026-04-25 12:02:18 +00:00

examples

feat(m28+m29+m30): ACME ARI, email digest, and Helm chart

2026-03-28 21:18:35 -04:00

CHART_SUMMARY.md

fix(security,config): remove unimplemented JWT auth-type, close silent downgrade (G-1)

2026-04-25 00:22:23 +00:00

DEPLOYMENT_GUIDE.md

v2.0.47: HTTPS Everywhere — TLS-only control plane, agents/CLI/MCP

2026-04-20 03:43:10 +00:00

INDEX.md

feat(m28+m29+m30): ACME ARI, email digest, and Helm chart

2026-03-28 21:18:35 -04:00

INSTALLATION.md

v2.0.47: HTTPS Everywhere — TLS-only control plane, agents/CLI/MCP

2026-04-20 03:43:10 +00:00

README.md

feat(m28+m29+m30): ACME ARI, email digest, and Helm chart

2026-03-28 21:18:35 -04:00

README.md

Certctl Helm Chart

Production-ready Helm chart for deploying certctl (self-hosted certificate lifecycle management platform) on Kubernetes.

Quick Start
Chart Features
Prerequisites
Installation
Configuration
Usage Examples
Upgrading
Uninstalling
Architecture
Security Considerations
Troubleshooting

Quick Start

# Add the chart repository (when available)
helm repo add certctl https://charts.example.com
helm repo update

# Install with default values
helm install certctl certctl/certctl \
  --set server.auth.apiKey="your-secure-api-key" \
  --set postgresql.auth.password="your-secure-password"

# Check installation status
kubectl get pods -l app.kubernetes.io/instance=certctl

Chart Features

Server Deployment — certctl control plane with configurable replicas
PostgreSQL StatefulSet — Persistent database with automatic schema migration
Agent DaemonSet or Deployment — Flexible agent deployment (per-node or custom replicas)
Ingress Support — Optional HTTPS ingress with cert-manager integration
Security Contexts — Non-root containers, read-only filesystems, minimal capabilities
Resource Limits — Configurable CPU and memory requests/limits
Health Checks — Liveness and readiness probes on all containers
ConfigMaps and Secrets — Centralized configuration management
Service Account and RBAC — Optional cluster role bindings
Pod Disruption Budgets — HA-ready with configurable disruption budgets
Monitoring — Optional Prometheus ServiceMonitor support

Prerequisites

Kubernetes 1.19 or later
Helm 3.0 or later
Optional: cert-manager (for automatic TLS certificate provisioning)
Optional: Prometheus (for metrics scraping)

Installation

1. Using Chart from Repository

helm repo add certctl https://charts.example.com
helm repo update
helm install certctl certctl/certctl -f my-values.yaml

2. Using Local Chart

cd deploy/helm
helm install certctl certctl/ \
  --set server.auth.apiKey="$(openssl rand -base64 32)" \
  --set postgresql.auth.password="$(openssl rand -base64 32)"

3. Minimal Production Installation

helm install certctl certctl/certctl \
  --namespace certctl \
  --create-namespace \
  --set server.auth.apiKey="change-me" \
  --set postgresql.auth.password="change-me" \
  --set server.replicas=2 \
  --set server.resources.requests.cpu=200m \
  --set server.resources.requests.memory=256Mi \
  --set ingress.enabled=true \
  --set ingress.className=nginx \
  --set ingress.hosts[0].host=certctl.example.com

Configuration

Server Configuration

server:
  replicas: 1                    # Number of server replicas
  port: 8443                     # Service port
  auth:
    type: api-key               # Authentication type
    apiKey: "your-api-key"      # REQUIRED for production
  logging:
    level: info                 # Log level (debug, info, warn, error)
    format: json                # Output format
  issuer:
    local:
      enabled: true             # Enable local CA issuer
    acme:
      enabled: false            # Enable ACME issuer
      directoryURL: ""          # ACME directory URL
      email: ""                 # ACME registration email
      challengeType: "http-01"  # Challenge type (http-01, dns-01, dns-persist-01)

PostgreSQL Configuration

postgresql:
  enabled: true                 # Use managed PostgreSQL
  auth:
    database: certctl
    username: certctl
    password: "your-password"   # REQUIRED
  storage:
    size: 10Gi                  # PVC size
    storageClass: ""            # Use default StorageClass

Agent Configuration

agent:
  enabled: true                 # Deploy agents
  kind: DaemonSet              # DaemonSet (one per node) or Deployment
  replicas: 1                  # For Deployment kind only
  discoveryDirs: ""            # Comma-separated cert discovery paths
  nodeSelector: {}             # Node affinity for DaemonSet

Ingress Configuration

ingress:
  enabled: false
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: certctl.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: certctl-tls
      hosts:
        - certctl.example.com

See values.yaml for all available configuration options.

Usage Examples

Example 1: High Availability Setup

# ha-values.yaml
server:
  replicas: 3
  resources:
    requests:
      cpu: 250m
      memory: 256Mi
    limits:
      cpu: 1000m
      memory: 512Mi

postgresql:
  storage:
    size: 50Gi

podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
          - key: app.kubernetes.io/component
            operator: In
            values: [server]
      topologyKey: kubernetes.io/hostname

Deploy with:

helm install certctl certctl/certctl -f ha-values.yaml

Example 2: External PostgreSQL Database

# external-db-values.yaml
postgresql:
  enabled: false

server:
  env:
    CERTCTL_DATABASE_URL: "postgres://user:password@rds.example.com:5432/certctl?sslmode=require"

Deploy with:

helm install certctl certctl/certctl -f external-db-values.yaml

Example 3: ACME + Let's Encrypt

# acme-values.yaml
server:
  issuer:
    acme:
      enabled: true
      directoryURL: https://acme-v02.api.letsencrypt.org/directory
      email: admin@example.com
      challengeType: dns-01
      dnsPresentScript: /scripts/dns-present.sh
      dnsCleanupScript: /scripts/dns-cleanup.sh
      dnsPropagationWait: 30s

Example 4: Email Notifications via Slack + SMTP

# notifications-values.yaml
server:
  smtp:
    enabled: true
    host: smtp.example.com
    port: 587
    username: certctl@example.com
    password: "smtp-password"
    fromAddress: certctl@example.com
    useTLS: true

  notifiers:
    slack:
      enabled: true
      webhookUrl: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
      channel: "#certificates"

Upgrading

# Update chart repository
helm repo update

# Upgrade release
helm upgrade certctl certctl/certctl -f values.yaml

# View upgrade history
helm history certctl

# Rollback to previous version
helm rollback certctl 1

Uninstalling

# Delete the release (keeps data by default)
helm uninstall certctl

# Also delete persistent data
kubectl delete pvc --all -l app.kubernetes.io/instance=certctl

# Delete namespace
kubectl delete namespace certctl

Architecture

Components

┌──────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster                                           │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────────┐                 ┌──────────────────┐  │
│  │ Ingress/LB      │                 │  Agent Pod 1     │  │
│  │ (optional)      │                 │  (DaemonSet)     │  │
│  └────────┬────────┘                 └──────────────────┘  │
│           │                                                  │
│           ▼                           ┌──────────────────┐  │
│  ┌─────────────────────────┐          │  Agent Pod 2     │  │
│  │ Server Deployment       │          │  (DaemonSet)     │  │
│  │ (1 to N replicas)       │          └──────────────────┘  │
│  │ - REST API              │                                 │
│  │ - Scheduler             │          ┌──────────────────┐  │
│  │ - UI Dashboard          │          │  Agent Pod N     │  │
│  └────────┬────────────────┘          │  (DaemonSet)     │  │
│           │                           └──────────────────┘  │
│           │                                                  │
│           ▼                                                  │
│  ┌──────────────────────────┐                               │
│  │ PostgreSQL StatefulSet   │                               │
│  │ - Database               │                               │
│  │ - PVC (persistent)       │                               │
│  └──────────────────────────┘                               │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Network Communication

Server → PostgreSQL: Internal cluster DNS (certctl-postgres:5432)
Agent → Server: Internal cluster DNS (certctl-server:8443)
External → Server: Via Ingress or Service (ClusterIP/LoadBalancer/NodePort)

Security Considerations

1. Secrets Management

All sensitive data is stored in Kubernetes Secrets:

PostgreSQL credentials
API keys
SMTP passwords
ACME account secrets

Best Practices:

Use sealed-secrets or external-secrets operator
Enable encryption at rest in etcd
Rotate secrets regularly

# Example: Using sealed-secrets
kubectl create secret generic certctl-api-key --from-literal=api-key="$(openssl rand -base64 32)" --dry-run=client -o yaml | kubeseal -f - | kubectl apply -f -

2. RBAC

The chart creates minimal RBAC by default:

ServiceAccount per release
ClusterRole (empty, extensible)
ClusterRoleBinding

To restrict further:

rbac:
  create: true
  # Add specific rules here

3. Pod Security

All containers run with:

Non-root user (UID 1000)
Read-only root filesystem
No privilege escalation
Dropped capabilities (ALL)

4. Network Policies

Restrict pod-to-pod communication:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: certctl-default-deny
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/instance: certctl
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: certctl
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: certctl
    - to:
        - podSelector: {}
      ports:
        - protocol: TCP
          port: 53  # DNS
        - protocol: UDP
          port: 53

5. TLS/HTTPS

Enable HTTPS with cert-manager:

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set installCRDs=true

Then configure Ingress with TLS.

6. API Key Security

For production:

Generate a strong API key: openssl rand -base64 32
Store securely (Vault, sealed-secrets, etc.)
Never commit to Git
Rotate periodically

# Generate and deploy API key
NEW_KEY=$(openssl rand -base64 32)
kubectl patch secret certctl-server -p "{\"data\":{\"api-key\":\"$(echo -n $NEW_KEY | base64)\"}}"

Troubleshooting

1. Pods Not Starting

# Check pod status
kubectl get pods -l app.kubernetes.io/instance=certctl
kubectl describe pod <pod-name>
kubectl logs <pod-name>

2. Database Connection Issues

# Verify PostgreSQL is running
kubectl get pods -l app.kubernetes.io/component=postgres
kubectl logs -l app.kubernetes.io/component=postgres

# Test connection from server pod
kubectl exec -it <server-pod> -- \
  psql postgres://certctl:password@certctl-postgres:5432/certctl

3. Agent Not Connecting

# Check agent logs
kubectl logs -l app.kubernetes.io/component=agent

# Verify server is reachable
kubectl exec -it <agent-pod> -- \
  wget -q -O - http://certctl-server:8443/health

4. Persistent Data Loss

# Check PVC status
kubectl get pvc

# Verify data is being stored
kubectl exec -it <postgres-pod> -- \
  ls -lah /var/lib/postgresql/data/postgres

5. Permission Denied Errors

The chart runs containers as non-root (UID 1000). If you see permission errors:

# Temporarily allow root for debugging
server:
  securityContext:
    runAsUser: 0  # NOT FOR PRODUCTION

6. Out of Memory

Increase resource limits:

helm upgrade certctl certctl/certctl \
  --set server.resources.limits.memory=1Gi \
  --set postgresql.resources.limits.memory=2Gi

7. Certificate Validation Issues

For self-signed certificates:

kubectl exec -it <pod> -- \
  CERTCTL_TLS_INSECURE_SKIP_VERIFY=true <command>

Common Issues and Solutions

Issue	Solution
`ImagePullBackOff`	Update `server.image.repository` to your registry
`CrashLoopBackOff`	Check logs with `kubectl logs <pod>`
`Pending` PVC	Check storage class availability
Connection timeout	Verify network policies and service DNS
High memory usage	Adjust `postgresql.resources.limits` and `server.resources.limits`

Support and Contributing

For issues, questions, or contributions, visit:

GitHub: https://github.com/shankar0123/certctl
Documentation: https://github.com/shankar0123/certctl/tree/main/docs

License

BSL-1.1 (converts to Apache 2.0 in 2033)

README.md

Certctl Helm Chart

Table of Contents

Quick Start

Chart Features

Prerequisites

Installation

1. Using Chart from Repository

2. Using Local Chart

3. Minimal Production Installation

Configuration

Server Configuration

PostgreSQL Configuration

Agent Configuration

Ingress Configuration

Usage Examples

Example 1: High Availability Setup

Example 2: External PostgreSQL Database

Example 3: ACME + Let's Encrypt

Example 4: Email Notifications via Slack + SMTP

Upgrading

Uninstalling

Architecture

Components

Network Communication

Security Considerations

1. Secrets Management

2. RBAC

3. Pod Security

4. Network Policies

5. TLS/HTTPS

6. API Key Security

Troubleshooting

1. Pods Not Starting

2. Database Connection Issues

3. Agent Not Connecting

4. Persistent Data Loss

5. Permission Denied Errors

6. Out of Memory

7. Certificate Validation Issues

Common Issues and Solutions

Support and Contributing

License