mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 16:21:30 +00:00
a41fc2d75c
Phase 13 Sprint 13.3 — the completion half of the ARCH-M1
substantive close. Sprint 13.2 shipped the Postgres-backed
sliding-window limiter + multi-replica integration test; Sprint 13.3
wires the 6 call sites in cmd/server/main.go through the operator-
chosen backend selector, adds the rate_limit_buckets scheduler
janitor sweep, rewrites the observability doc, exposes the env-var
in the helm chart, and promotes the multi-replica integration test
to a required CI status check.
Signature ground-truth (sprint 13.2 + 13.3)
===========================================
Prompt-template signatures: `Allow(key string) error` and "5 call
sites." Actual repo: `Allow(key string, now time.Time) error` and 6
NewSlidingWindowLimiter call sites in cmd/server/main.go (the prompt
miscounted the second EST per-principal arm). Per CLAUDE.md "the repo
is truth," matched the live shape.
What changed
============
internal/config/server.go (+40 LOC):
- Added `SlidingWindowBackend string` + `SlidingWindowJanitorInterval
time.Duration` to RateLimitConfig with full operator-facing
documentation of the two valid values (memory|postgres) +
when-to-use-which decision tree.
internal/config/config.go (+27 LOC):
- Load() reads CERTCTL_RATE_LIMIT_BACKEND (default "memory") +
CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m).
- Validate() rejects anything other than ""/"memory"/"postgres"
(empty = memory equivalence for test-built Configs that bypass
Load()). Janitor interval must be ≥ 1 minute when set.
- Failure modes return clear ::error:: with the env-var name + the
valid values, so an operator typo ("postgress" → memory in a
3-replica cluster) fails fast at startup.
internal/ratelimit/factory.go (NEW, 67 LOC):
- NewLimiter(backend, db, maxN, window, mapCap) Limiter — single
factory the 6 cmd/server/main.go call sites route through.
- Drop-in signature: same maxN/window/mapCap as
NewSlidingWindowLimiter (mapCap accepted + ignored for postgres
— the rate_limit_buckets table grows until the janitor sweeps).
- Defensive panic on unknown backend (config.Validate is SoT;
this is belt-and-suspenders).
internal/ratelimit/postgres_gc.go (NEW, 73 LOC):
- PostgresGC struct + NewPostgresGC + GarbageCollect.
- Single-statement DELETE FROM rate_limit_buckets WHERE
updated_at < NOW() - maxWindow. Idempotent.
- maxWindow <= 0 is a no-op (operator opt-out).
internal/scheduler/scheduler.go (+90 LOC):
- New RateLimitGarbageCollector interface (mirrors the
ACMEGarbageCollector / SessionGarbageCollector contracts).
- rateLimitGC field + rateLimitGCInterval + rateLimitGCRunning
on Scheduler.
- SetRateLimitGarbageCollector(gc) + SetRateLimitGCInterval(d)
Setters following the existing acmeGC/sessionGC pattern.
- rateLimitGCLoop() — JitteredTicker + atomic.Bool guard +
per-tick context.WithTimeout(1m). Logs row count at Debug.
- Loop counted in the Start() WaitGroup only when the GC is
non-nil; cmd/server/main.go skips SetRateLimitGarbageCollector
when backend=memory so the loop never launches for that case.
cmd/server/main.go (35 LOC diff):
- All 6 ratelimit.NewSlidingWindowLimiter call sites now route
through ratelimit.NewLimiter(cfg.RateLimit.SlidingWindowBackend,
db, ...). Grep verification post-fix returns ZERO hits.
- Six sites: breakglass loginLimiter (580), ocspLimiter (1003),
exportLimiter (1068), EST failed-basic (1535), EST per-principal
SCEP-mTLS arm (1591), EST per-principal SCEP arm (1613). The
intune.NewPerDeviceRateLimiter site at line 1823 stays unmoved
— its inner type-alias wrapper is the prompt's
out-of-scope (cmd/server/*.go only).
- Conditionally constructs PostgresGC + wires the scheduler janitor
when backend=postgres; logs the wiring decision either way so
operators see "rate-limit GC sweep enabled (postgres backend)"
or "in-memory backend self-prunes" in the boot log.
internal/api/handler/{est,export,certificates,auth_breakglass}.go:
- Replaced 5 *ratelimit.SlidingWindowLimiter field/Setter types
with ratelimit.Limiter (the interface). Allow() satisfies the
same call shape on both backends; the in-memory tests that
construct *SlidingWindowLimiter still compile because the
concrete type satisfies the interface (compile-time check in
internal/ratelimit/limiter.go pins this).
docs/operator/observability.md (176 LOC diff):
- Replaced the "per-process, in-memory, reset-on-restart, not
shared across replicas" paragraph with the new
configurable-backend section: operator decision tree,
backend internals (memory vs postgres), janitor description,
falsifiable closure proof (the Sprint 13.2 integration test
name + invocation), helm chart wiring example.
- Updated inventory to reflect the actual handler file paths +
actual cap configurations (the prior doc said "60s window" for
several limiters that actually use 60m / 24h windows).
- Doc smoke confirmed: grep -c 'per-process, in-memory,
reset-on-restart' docs/operator/observability.md = 0.
deploy/helm/certctl/values.yaml + templates/server-configmap.yaml +
templates/server-deployment.yaml:
- Exposed server.rateLimiting.backend (default "memory") +
server.rateLimiting.janitorInterval (default "5m") under the
existing rateLimiting block.
- ConfigMap renders both as rate-limit-backend +
rate-limit-janitor-interval keys.
- Deployment wires CERTCTL_RATE_LIMIT_BACKEND +
CERTCTL_RATE_LIMIT_JANITOR_INTERVAL env vars from the configmap.
- Helm render: `helm template deploy/helm/certctl --set
server.rateLimiting.backend=postgres` shows the env-var on the
server-deployment.yaml output.
.github/workflows/ci.yml (+12 LOC):
- Added a new step in the Go Build & Test job that runs the
Sprint 13.2 multi-replica integration test
(TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas) with
-tags=integration -race -timeout=300s. Fails the CI status check
if the cross-replica row lock ever stops arbitrating across
replicas — the ARCH-M1 closure regression gate.
Verification (all green locally; postgres integration via CI)
============================================================
$ grep -nE 'NewSlidingWindowLimiter' cmd/server/*.go
(zero hits — Sprint 13.3 receipt)
$ go test -short -count=1 \
./internal/config/... ./internal/ratelimit/... \
./internal/scheduler/... ./internal/api/handler/... \
./cmd/server/...
ok internal/config 1.177s
ok internal/ratelimit 0.007s
ok internal/scheduler 9.165s
ok internal/api/handler 6.245s
ok cmd/server 0.390s
$ staticcheck ./internal/ratelimit/... ./internal/scheduler/... \
./internal/config/... ./internal/api/handler/... ./cmd/server/...
(clean)
$ gofmt -l internal/ cmd/server/
(clean)
$ grep -c 'per-process, in-memory, reset-on-restart' \
docs/operator/observability.md
0 (doc smoke — the audit's verbatim phrasing is gone)
$ bash scripts/ci-guards/G-3-env-docs-drift.sh
G-3 env-docs-drift: clean.
$ bash scripts/ci-guards/complete-path-config-coverage.sh
OK — every CERTCTL_* env var (197) has at least one non-config-
package consumer.
Selector contract verified — config.Validate() rejects any value
other than ""/memory/postgres at startup with a clear error message.
Sprint 13.4 next (ARCH-H1 OpenAPI authoring batch 1) is on a
different axis; ARCH-M1 closure is complete with this commit
modulo the Sprint 13.7 audit-HTML flip + zero-floor pin.
Closes: ARCH-M1 substantive remediation. The cross-replica rate-
limit-cap-enforcement gap that the audit recommended deferring to
v3 is closed; operators with server.replicas > 1 flip
CERTCTL_RATE_LIMIT_BACKEND=postgres and get exactly-cap enforcement
across the cluster (proved by the multi-replica integration test now
gating CI).
Certctl Helm Chart
Production-ready Helm chart for deploying certctl (self-hosted certificate lifecycle management platform) on Kubernetes.
Table of Contents
- Quick Start
- Chart Features
- Prerequisites
- Installation
- Configuration
- Usage Examples
- Upgrading
- Uninstalling
- Architecture
- Security Considerations
- Troubleshooting
Quick Start
# Add the chart repository (when available)
helm repo add certctl https://charts.example.com
helm repo update
# Install with default values
helm install certctl certctl/certctl \
--set server.auth.apiKey="your-secure-api-key" \
--set postgresql.auth.password="your-secure-password"
# Check installation status
kubectl get pods -l app.kubernetes.io/instance=certctl
Chart Features
- Server Deployment — certctl control plane with configurable replicas
- PostgreSQL StatefulSet — Persistent database with automatic schema migration
- Agent DaemonSet or Deployment — Flexible agent deployment (per-node or custom replicas)
- Ingress Support — Optional HTTPS ingress with cert-manager integration
- Security Contexts — Non-root containers, read-only filesystems, minimal capabilities
- Resource Limits — Configurable CPU and memory requests/limits
- Health Checks — Liveness and readiness probes on all containers
- ConfigMaps and Secrets — Centralized configuration management
- Service Account and RBAC — Optional cluster role bindings
- Pod Disruption Budgets — HA-ready with configurable disruption budgets
- Monitoring — Optional Prometheus ServiceMonitor support
Prerequisites
- Kubernetes 1.19 or later
- Helm 3.0 or later
- Optional: cert-manager (for automatic TLS certificate provisioning)
- Optional: Prometheus (for metrics scraping)
Installation
1. Using Chart from Repository
helm repo add certctl https://charts.example.com
helm repo update
helm install certctl certctl/certctl -f my-values.yaml
2. Using Local Chart
cd deploy/helm
helm install certctl certctl/ \
--set server.auth.apiKey="$(openssl rand -base64 32)" \
--set postgresql.auth.password="$(openssl rand -base64 32)"
3. Minimal Production Installation
helm install certctl certctl/certctl \
--namespace certctl \
--create-namespace \
--set server.auth.apiKey="change-me" \
--set postgresql.auth.password="change-me" \
--set server.replicas=2 \
--set server.resources.requests.cpu=200m \
--set server.resources.requests.memory=256Mi \
--set ingress.enabled=true \
--set ingress.className=nginx \
--set ingress.hosts[0].host=certctl.example.com
Configuration
Server Configuration
server:
replicas: 1 # Number of server replicas
port: 8443 # Service port
auth:
type: api-key # Authentication type
apiKey: "your-api-key" # REQUIRED for production
logging:
level: info # Log level (debug, info, warn, error)
format: json # Output format
issuer:
local:
enabled: true # Enable local CA issuer
acme:
enabled: false # Enable ACME issuer
directoryURL: "" # ACME directory URL
email: "" # ACME registration email
challengeType: "http-01" # Challenge type (http-01, dns-01, dns-persist-01)
PostgreSQL Configuration
postgresql:
enabled: true # Use managed PostgreSQL
auth:
database: certctl
username: certctl
password: "your-password" # REQUIRED
storage:
size: 10Gi # PVC size
storageClass: "" # Use default StorageClass
Agent Configuration
agent:
enabled: true # Deploy agents
kind: DaemonSet # DaemonSet (one per node) or Deployment
replicas: 1 # For Deployment kind only
discoveryDirs: "" # Comma-separated cert discovery paths
nodeSelector: {} # Node affinity for DaemonSet
Ingress Configuration
ingress:
enabled: false
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
hosts:
- host: certctl.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: certctl-tls
hosts:
- certctl.example.com
See values.yaml for all available configuration options.
Usage Examples
Example 1: High Availability Setup
# ha-values.yaml
server:
replicas: 3
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 1000m
memory: 512Mi
postgresql:
storage:
size: 50Gi
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/component
operator: In
values: [server]
topologyKey: kubernetes.io/hostname
Deploy with:
helm install certctl certctl/certctl -f ha-values.yaml
Example 2: External PostgreSQL Database
# external-db-values.yaml
postgresql:
enabled: false
server:
env:
CERTCTL_DATABASE_URL: "postgres://user:password@rds.example.com:5432/certctl?sslmode=require"
Deploy with:
helm install certctl certctl/certctl -f external-db-values.yaml
Example 3: ACME + Let's Encrypt
# acme-values.yaml
server:
issuer:
acme:
enabled: true
directoryURL: https://acme-v02.api.letsencrypt.org/directory
email: admin@example.com
challengeType: dns-01
dnsPresentScript: /scripts/dns-present.sh
dnsCleanupScript: /scripts/dns-cleanup.sh
dnsPropagationWait: 30s
Example 4: Email Notifications via Slack + SMTP
# notifications-values.yaml
server:
smtp:
enabled: true
host: smtp.example.com
port: 587
username: certctl@example.com
password: "smtp-password"
fromAddress: certctl@example.com
useTLS: true
notifiers:
slack:
enabled: true
webhookUrl: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
channel: "#certificates"
Upgrading
# Update chart repository
helm repo update
# Upgrade release
helm upgrade certctl certctl/certctl -f values.yaml
# View upgrade history
helm history certctl
# Rollback to previous version
helm rollback certctl 1
Uninstalling
# Delete the release (keeps data by default)
helm uninstall certctl
# Also delete persistent data
kubectl delete pvc --all -l app.kubernetes.io/instance=certctl
# Delete namespace
kubectl delete namespace certctl
Architecture
Components
┌──────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
├──────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌──────────────────┐ │
│ │ Ingress/LB │ │ Agent Pod 1 │ │
│ │ (optional) │ │ (DaemonSet) │ │
│ └────────┬────────┘ └──────────────────┘ │
│ │ │
│ ▼ ┌──────────────────┐ │
│ ┌─────────────────────────┐ │ Agent Pod 2 │ │
│ │ Server Deployment │ │ (DaemonSet) │ │
│ │ (1 to N replicas) │ └──────────────────┘ │
│ │ - REST API │ │
│ │ - Scheduler │ ┌──────────────────┐ │
│ │ - UI Dashboard │ │ Agent Pod N │ │
│ └────────┬────────────────┘ │ (DaemonSet) │ │
│ │ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ PostgreSQL StatefulSet │ │
│ │ - Database │ │
│ │ - PVC (persistent) │ │
│ └──────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────┘
Network Communication
- Server → PostgreSQL: Internal cluster DNS (
certctl-postgres:5432) - Agent → Server: Internal cluster DNS (
certctl-server:8443) - External → Server: Via Ingress or Service (ClusterIP/LoadBalancer/NodePort)
Security Considerations
1. Secrets Management
All sensitive data is stored in Kubernetes Secrets:
- PostgreSQL credentials
- API keys
- SMTP passwords
- ACME account secrets
Best Practices:
- Use sealed-secrets or external-secrets operator
- Enable encryption at rest in etcd
- Rotate secrets regularly
# Example: Using sealed-secrets
kubectl create secret generic certctl-api-key --from-literal=api-key="$(openssl rand -base64 32)" --dry-run=client -o yaml | kubeseal -f - | kubectl apply -f -
2. RBAC
The chart creates minimal RBAC by default:
- ServiceAccount per release
- ClusterRole (empty, extensible)
- ClusterRoleBinding
To restrict further:
rbac:
create: true
# Add specific rules here
3. Pod Security
All containers run with:
- Non-root user (UID 1000)
- Read-only root filesystem
- No privilege escalation
- Dropped capabilities (ALL)
4. Network Policies
Restrict pod-to-pod communication:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: certctl-default-deny
spec:
podSelector:
matchLabels:
app.kubernetes.io/instance: certctl
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: certctl
egress:
- to:
- namespaceSelector:
matchLabels:
name: certctl
- to:
- podSelector: {}
ports:
- protocol: TCP
port: 53 # DNS
- protocol: UDP
port: 53
5. TLS/HTTPS
Enable HTTPS with cert-manager:
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--set installCRDs=true
Then configure Ingress with TLS.
6. API Key Security
For production:
- Generate a strong API key:
openssl rand -base64 32 - Store securely (Vault, sealed-secrets, etc.)
- Never commit to Git
- Rotate periodically
# Generate and deploy API key
NEW_KEY=$(openssl rand -base64 32)
kubectl patch secret certctl-server -p "{\"data\":{\"api-key\":\"$(echo -n $NEW_KEY | base64)\"}}"
Troubleshooting
1. Pods Not Starting
# Check pod status
kubectl get pods -l app.kubernetes.io/instance=certctl
kubectl describe pod <pod-name>
kubectl logs <pod-name>
2. Database Connection Issues
# Verify PostgreSQL is running
kubectl get pods -l app.kubernetes.io/component=postgres
kubectl logs -l app.kubernetes.io/component=postgres
# Test connection from server pod
kubectl exec -it <server-pod> -- \
psql postgres://certctl:password@certctl-postgres:5432/certctl
3. Agent Not Connecting
# Check agent logs
kubectl logs -l app.kubernetes.io/component=agent
# Verify server is reachable
kubectl exec -it <agent-pod> -- \
wget -q -O - http://certctl-server:8443/health
4. Persistent Data Loss
# Check PVC status
kubectl get pvc
# Verify data is being stored
kubectl exec -it <postgres-pod> -- \
ls -lah /var/lib/postgresql/data/postgres
5. Permission Denied Errors
The chart runs containers as non-root (UID 1000). If you see permission errors:
# Temporarily allow root for debugging
server:
securityContext:
runAsUser: 0 # NOT FOR PRODUCTION
6. Out of Memory
Increase resource limits:
helm upgrade certctl certctl/certctl \
--set server.resources.limits.memory=1Gi \
--set postgresql.resources.limits.memory=2Gi
7. Certificate Validation Issues
For self-signed certificates:
kubectl exec -it <pod> -- \
CERTCTL_TLS_INSECURE_SKIP_VERIFY=true <command>
Common Issues and Solutions
| Issue | Solution |
|---|---|
ImagePullBackOff |
Update server.image.repository to your registry |
CrashLoopBackOff |
Check logs with kubectl logs <pod> |
Pending PVC |
Check storage class availability |
| Connection timeout | Verify network policies and service DNS |
| High memory usage | Adjust postgresql.resources.limits and server.resources.limits |
Support and Contributing
For issues, questions, or contributions, visit:
- GitHub: https://github.com/certctl-io/certctl
- Documentation: https://github.com/certctl-io/certctl/tree/main/docs
License
BSL-1.1