mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 15:01:32 +00:00
d6f4d5c5e8
Phase 4 of the certctl architecture diligence remediation closure.
Seven findings, all in deploy/helm/certctl/.
DEPL-H2 (High) — ship deploy/helm/certctl/templates/backup-cronjob.yaml
Operator opt-in via backup.enabled=true. Default OFF. CronJob runs
pg_dump --format=custom --no-owner --no-acl --dbname=certctl
matching the canonical shape in
docs/operator/runbooks/postgres-backup.md (so manual and
automated dumps are byte-identical). Sink: PVC (default) OR S3
via aws-cli. Documented as in-cluster-Postgres only — managed DB
deployments rely on their provider's PITR.
DEPL-M1 (Med) — Helm pre-install/pre-upgrade migration hook
deploy/helm/certctl/templates/migration-job.yaml — runs
`certctl-server --migrate-only` before the server Deployment
rolls. The --migrate-only flag (new in cmd/server/main.go) is a
hermetic schema-mutation pass: load config, open DB pool, run
RunMigrations + RunSeed, exit 0. No HTTP listener, no scheduler,
no signing setup.
Server's boot-time RunMigrations call is now gated on
CERTCTL_MIGRATIONS_VIA_HOOK — when set true, the server skips
the boot path (the hook owns the work). Default still runs at
boot, so Compose / VM / bare-metal deploys are unchanged.
migrations.viaHook: false in values.yaml (off by default).
DEPL-M4 (Med) — explicit Postgres StatefulSet strategy fields
deploy/helm/certctl/templates/postgres-statefulset.yaml adds:
spec.updateStrategy.type: OnDelete
spec.podManagementPolicy: OrderedReady
Operator-controlled Postgres upgrades (the OnDelete strategy
means a chart template tweak no longer triggers an immediate
Postgres restart). OrderedReady aligns with the standard
Postgres-on-Kubernetes pattern for any future HA work.
DEPL-M5 (Med) — per-fleet-size resource ladder documentation
deploy/helm/certctl/values.yaml — extended comments next to
server.resources + agent.resources documenting:
"≤ 500 certs / 100 agents" → defaults are validated
"5K certs / 1K agents" → starter suggestions, TBD Phase 8
"50K certs / 10K agents" → starter suggestions, TBD Phase 8
Numbers for the small-fleet case derive from the measured
baselines in docs/operator/performance-baselines.md
(50ms p50, < 3s for 1000-cert inventory walk, etc.). Larger
fleet numbers explicitly marked TBD pending Phase 8 load-test
runs — operators tune empirically until then.
DEPL-L1 (Low) — Helm rollback runbook
docs/operator/runbooks/rollback.md — covers helm rollback
mechanics, the schema-migration manual-cleanup path (when
*.down.sql files apply vs. when full restore is the only safe
path), and the per-migration-class safe-to-rollback table.
DEPL-L2 (Low) — Prometheus AlertManager rules
deploy/helm/certctl/templates/prometheusrules.yaml — opt-in via
monitoring.prometheusRules.enabled=true. Default OFF. Four
starter rules using verified metric names from
internal/api/handler/metrics.go:
CertctlCertificateExpiringSoon (certctl_certificate_expiring_soon)
CertctlAgentOffline ((agent_total - agent_online) > 0 for 1h)
CertctlJobFailureRateHigh (failure rate over 5% for 15m)
CertctlIssuanceFailures (any failures over 15m window)
All thresholds operator-tunable via
monitoring.prometheusRules.thresholds.* in values.
DEPL-L3 (Low) — Prometheus bearer-token setup runbook
docs/operator/runbooks/prometheus-bearer-token.md — documents
the API-key + Secret + values wiring for the RBAC-gated
/api/v1/metrics/prometheus scrape endpoint. End-to-end
procedure with troubleshooting steps + rotation guide.
CI guard: scripts/ci-guards/helm-templates-lint.sh
Six-combo matrix: defaults / backup PVC / backup S3 /
prometheusRules / migrations.viaHook / all-on. Each runs helm
template + checks render success. helm lint also gated.
Wired into the auto-pickup loop in .github/workflows/ci.yml;
azure/setup-helm@b9e51907 (v4.3.0, SHA-pinned per Phase 1
RED-2) installs helm v3.16.0 on the runner.
Verification (all pass):
ls deploy/helm/certctl/templates/{backup-cronjob,migration-job,prometheusrules}.yaml
grep -E 'updateStrategy|podManagementPolicy' deploy/helm/certctl/templates/postgres-statefulset.yaml # 2 matches
helm template deploy/helm/certctl/ --set backup.enabled=true \
--set monitoring.prometheusRules.enabled=true --set migrations.viaHook=true \
| grep -E "kind: (CronJob|PrometheusRule|Job)" # 3 matches
helm lint deploy/helm/certctl/ # 0 failed
ls docs/operator/runbooks/{rollback,prometheus-bearer-token}.md
bash scripts/ci-guards/helm-templates-lint.sh # 6/6 matrix combinations pass
Go build clean (cmd/server compiles, migrate-only path verified by
the build target). YAML validated.
Closes: cowork/certctl-architecture-diligence-audit.html#fix-DEPL-H2
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M1
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M4
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-M5
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L1
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L2
cowork/certctl-architecture-diligence-audit.html#fix-DEPL-L3
Certctl Helm Chart
Production-ready Helm chart for deploying certctl (self-hosted certificate lifecycle management platform) on Kubernetes.
Table of Contents
- Quick Start
- Chart Features
- Prerequisites
- Installation
- Configuration
- Usage Examples
- Upgrading
- Uninstalling
- Architecture
- Security Considerations
- Troubleshooting
Quick Start
# Add the chart repository (when available)
helm repo add certctl https://charts.example.com
helm repo update
# Install with default values
helm install certctl certctl/certctl \
--set server.auth.apiKey="your-secure-api-key" \
--set postgresql.auth.password="your-secure-password"
# Check installation status
kubectl get pods -l app.kubernetes.io/instance=certctl
Chart Features
- Server Deployment — certctl control plane with configurable replicas
- PostgreSQL StatefulSet — Persistent database with automatic schema migration
- Agent DaemonSet or Deployment — Flexible agent deployment (per-node or custom replicas)
- Ingress Support — Optional HTTPS ingress with cert-manager integration
- Security Contexts — Non-root containers, read-only filesystems, minimal capabilities
- Resource Limits — Configurable CPU and memory requests/limits
- Health Checks — Liveness and readiness probes on all containers
- ConfigMaps and Secrets — Centralized configuration management
- Service Account and RBAC — Optional cluster role bindings
- Pod Disruption Budgets — HA-ready with configurable disruption budgets
- Monitoring — Optional Prometheus ServiceMonitor support
Prerequisites
- Kubernetes 1.19 or later
- Helm 3.0 or later
- Optional: cert-manager (for automatic TLS certificate provisioning)
- Optional: Prometheus (for metrics scraping)
Installation
1. Using Chart from Repository
helm repo add certctl https://charts.example.com
helm repo update
helm install certctl certctl/certctl -f my-values.yaml
2. Using Local Chart
cd deploy/helm
helm install certctl certctl/ \
--set server.auth.apiKey="$(openssl rand -base64 32)" \
--set postgresql.auth.password="$(openssl rand -base64 32)"
3. Minimal Production Installation
helm install certctl certctl/certctl \
--namespace certctl \
--create-namespace \
--set server.auth.apiKey="change-me" \
--set postgresql.auth.password="change-me" \
--set server.replicas=2 \
--set server.resources.requests.cpu=200m \
--set server.resources.requests.memory=256Mi \
--set ingress.enabled=true \
--set ingress.className=nginx \
--set ingress.hosts[0].host=certctl.example.com
Configuration
Server Configuration
server:
replicas: 1 # Number of server replicas
port: 8443 # Service port
auth:
type: api-key # Authentication type
apiKey: "your-api-key" # REQUIRED for production
logging:
level: info # Log level (debug, info, warn, error)
format: json # Output format
issuer:
local:
enabled: true # Enable local CA issuer
acme:
enabled: false # Enable ACME issuer
directoryURL: "" # ACME directory URL
email: "" # ACME registration email
challengeType: "http-01" # Challenge type (http-01, dns-01, dns-persist-01)
PostgreSQL Configuration
postgresql:
enabled: true # Use managed PostgreSQL
auth:
database: certctl
username: certctl
password: "your-password" # REQUIRED
storage:
size: 10Gi # PVC size
storageClass: "" # Use default StorageClass
Agent Configuration
agent:
enabled: true # Deploy agents
kind: DaemonSet # DaemonSet (one per node) or Deployment
replicas: 1 # For Deployment kind only
discoveryDirs: "" # Comma-separated cert discovery paths
nodeSelector: {} # Node affinity for DaemonSet
Ingress Configuration
ingress:
enabled: false
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: certctl.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: certctl-tls
hosts:
- certctl.example.com
See values.yaml for all available configuration options.
Usage Examples
Example 1: High Availability Setup
# ha-values.yaml
server:
replicas: 3
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 1000m
memory: 512Mi
postgresql:
storage:
size: 50Gi
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values: [server]
topologyKey: kubernetes.io/hostname
Deploy with:
helm install certctl certctl/certctl -f ha-values.yaml
Example 2: External PostgreSQL Database
# external-db-values.yaml
postgresql:
enabled: false
server:
env:
CERTCTL_DATABASE_URL: "postgres://user:password@rds.example.com:5432/certctl?sslmode=require"
Deploy with:
helm install certctl certctl/certctl -f external-db-values.yaml
Example 3: ACME + Let's Encrypt
# acme-values.yaml
server:
issuer:
acme:
enabled: true
directoryURL: https://acme-v02.api.letsencrypt.org/directory
email: admin@example.com
challengeType: dns-01
dnsPresentScript: /scripts/dns-present.sh
dnsCleanupScript: /scripts/dns-cleanup.sh
dnsPropagationWait: 30s
Example 4: Email Notifications via Slack + SMTP
# notifications-values.yaml
server:
smtp:
enabled: true
host: smtp.example.com
port: 587
username: certctl@example.com
password: "smtp-password"
fromAddress: certctl@example.com
useTLS: true
notifiers:
slack:
enabled: true
webhookUrl: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
channel: "#certificates"
Upgrading
# Update chart repository
helm repo update
# Upgrade release
helm upgrade certctl certctl/certctl -f values.yaml
# View upgrade history
helm history certctl
# Rollback to previous version
helm rollback certctl 1
Uninstalling
# Delete the release (keeps data by default)
helm uninstall certctl
# Also delete persistent data
kubectl delete pvc --all -l app.kubernetes.io/instance=certctl
# Delete namespace
kubectl delete namespace certctl
Architecture
Components
┌──────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
├──────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌──────────────────┐ │
│ │ Ingress/LB │ │ Agent Pod 1 │ │
│ │ (optional) │ │ (DaemonSet) │ │
│ └────────┬────────┘ └──────────────────┘ │
│ │ │
│ ▼ ┌──────────────────┐ │
│ ┌─────────────────────────┐ │ Agent Pod 2 │ │
│ │ Server Deployment │ │ (DaemonSet) │ │
│ │ (1 to N replicas) │ └──────────────────┘ │
│ │ - REST API │ │
│ │ - Scheduler │ ┌──────────────────┐ │
│ │ - UI Dashboard │ │ Agent Pod N │ │
│ └────────┬────────────────┘ │ (DaemonSet) │ │
│ │ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ PostgreSQL StatefulSet │ │
│ │ - Database │ │
│ │ - PVC (persistent) │ │
│ └──────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
Network Communication
- Server → PostgreSQL: Internal cluster DNS (
certctl-postgres:5432) - Agent → Server: Internal cluster DNS (
certctl-server:8443) - External → Server: Via Ingress or Service (ClusterIP/LoadBalancer/NodePort)
Security Considerations
1. Secrets Management
All sensitive data is stored in Kubernetes Secrets:
- PostgreSQL credentials
- API keys
- SMTP passwords
- ACME account secrets
Best Practices:
- Use sealed-secrets or external-secrets operator
- Enable encryption at rest in etcd
- Rotate secrets regularly
# Example: Using sealed-secrets
kubectl create secret generic certctl-api-key --from-literal=api-key="$(openssl rand -base64 32)" --dry-run=client -o yaml | kubeseal -f - | kubectl apply -f -
2. RBAC
The chart creates minimal RBAC by default:
- ServiceAccount per release
- ClusterRole (empty, extensible)
- ClusterRoleBinding
To restrict further:
rbac:
create: true
# Add specific rules here
3. Pod Security
All containers run with:
- Non-root user (UID 1000)
- Read-only root filesystem
- No privilege escalation
- Dropped capabilities (ALL)
4. Network Policies
Restrict pod-to-pod communication:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: certctl-default-deny
spec:
podSelector:
matchLabels:
app.kubernetes.io/instance: certctl
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: certctl
egress:
- to:
- namespaceSelector:
matchLabels:
name: certctl
- to:
- podSelector: {}
ports:
- protocol: TCP
port: 53 # DNS
- protocol: UDP
port: 53
5. TLS/HTTPS
Enable HTTPS with cert-manager:
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set installCRDs=true
Then configure Ingress with TLS.
6. API Key Security
For production:
- Generate a strong API key:
openssl rand -base64 32 - Store securely (Vault, sealed-secrets, etc.)
- Never commit to Git
- Rotate periodically
# Generate and deploy API key
NEW_KEY=$(openssl rand -base64 32)
kubectl patch secret certctl-server -p "{\"data\":{\"api-key\":\"$(echo -n $NEW_KEY | base64)\"}}"
Troubleshooting
1. Pods Not Starting
# Check pod status
kubectl get pods -l app.kubernetes.io/instance=certctl
kubectl describe pod <pod-name>
kubectl logs <pod-name>
2. Database Connection Issues
# Verify PostgreSQL is running
kubectl get pods -l app.kubernetes.io/component=postgres
kubectl logs -l app.kubernetes.io/component=postgres
# Test connection from server pod
kubectl exec -it <server-pod> -- \
psql postgres://certctl:password@certctl-postgres:5432/certctl
3. Agent Not Connecting
# Check agent logs
kubectl logs -l app.kubernetes.io/component=agent
# Verify server is reachable
kubectl exec -it <agent-pod> -- \
wget -q -O - http://certctl-server:8443/health
4. Persistent Data Loss
# Check PVC status
kubectl get pvc
# Verify data is being stored
kubectl exec -it <postgres-pod> -- \
ls -lah /var/lib/postgresql/data/postgres
5. Permission Denied Errors
The chart runs containers as non-root (UID 1000). If you see permission errors:
# Temporarily allow root for debugging
server:
securityContext:
runAsUser: 0 # NOT FOR PRODUCTION
6. Out of Memory
Increase resource limits:
helm upgrade certctl certctl/certctl \
--set server.resources.limits.memory=1Gi \
--set postgresql.resources.limits.memory=2Gi
7. Certificate Validation Issues
For self-signed certificates:
kubectl exec -it <pod> -- \
CERTCTL_TLS_INSECURE_SKIP_VERIFY=true <command>
Common Issues and Solutions
| Issue | Solution |
|---|---|
ImagePullBackOff |
Update server.image.repository to your registry |
CrashLoopBackOff |
Check logs with kubectl logs <pod> |
Pending PVC |
Check storage class availability |
| Connection timeout | Verify network policies and service DNS |
| High memory usage | Adjust postgresql.resources.limits and server.resources.limits |
Support and Contributing
For issues, questions, or contributions, visit:
- GitHub: https://github.com/certctl-io/certctl
- Documentation: https://github.com/certctl-io/certctl/tree/main/docs
License
BSL-1.1