Files
certctl/deploy/helm
shankar0123 a41fc2d75c feat(ratelimit): Phase 13 Sprint 13.3 — wire backend selector + scheduler janitor + docs + helm (ARCH-M1 closure complete)
Phase 13 Sprint 13.3 — the completion half of the ARCH-M1
substantive close. Sprint 13.2 shipped the Postgres-backed
sliding-window limiter + multi-replica integration test; Sprint 13.3
wires the 6 call sites in cmd/server/main.go through the operator-
chosen backend selector, adds the rate_limit_buckets scheduler
janitor sweep, rewrites the observability doc, exposes the env-var
in the helm chart, and promotes the multi-replica integration test
to a required CI status check.

Signature ground-truth (sprint 13.2 + 13.3)
===========================================
Prompt-template signatures: `Allow(key string) error` and "5 call
sites." Actual repo: `Allow(key string, now time.Time) error` and 6
NewSlidingWindowLimiter call sites in cmd/server/main.go (the prompt
miscounted the second EST per-principal arm). Per CLAUDE.md "the repo
is truth," matched the live shape.

What changed
============

internal/config/server.go (+40 LOC):
  - Added `SlidingWindowBackend string` + `SlidingWindowJanitorInterval
    time.Duration` to RateLimitConfig with full operator-facing
    documentation of the two valid values (memory|postgres) +
    when-to-use-which decision tree.

internal/config/config.go (+27 LOC):
  - Load() reads CERTCTL_RATE_LIMIT_BACKEND (default "memory") +
    CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m).
  - Validate() rejects anything other than ""/"memory"/"postgres"
    (empty = memory equivalence for test-built Configs that bypass
    Load()). Janitor interval must be ≥ 1 minute when set.
  - Failure modes return clear ::error:: with the env-var name + the
    valid values, so an operator typo ("postgress" → memory in a
    3-replica cluster) fails fast at startup.

internal/ratelimit/factory.go (NEW, 67 LOC):
  - NewLimiter(backend, db, maxN, window, mapCap) Limiter — single
    factory the 6 cmd/server/main.go call sites route through.
  - Drop-in signature: same maxN/window/mapCap as
    NewSlidingWindowLimiter (mapCap accepted + ignored for postgres
    — the rate_limit_buckets table grows until the janitor sweeps).
  - Defensive panic on unknown backend (config.Validate is SoT;
    this is belt-and-suspenders).

internal/ratelimit/postgres_gc.go (NEW, 73 LOC):
  - PostgresGC struct + NewPostgresGC + GarbageCollect.
  - Single-statement DELETE FROM rate_limit_buckets WHERE
    updated_at < NOW() - maxWindow. Idempotent.
  - maxWindow <= 0 is a no-op (operator opt-out).

internal/scheduler/scheduler.go (+90 LOC):
  - New RateLimitGarbageCollector interface (mirrors the
    ACMEGarbageCollector / SessionGarbageCollector contracts).
  - rateLimitGC field + rateLimitGCInterval + rateLimitGCRunning
    on Scheduler.
  - SetRateLimitGarbageCollector(gc) + SetRateLimitGCInterval(d)
    Setters following the existing acmeGC/sessionGC pattern.
  - rateLimitGCLoop() — JitteredTicker + atomic.Bool guard +
    per-tick context.WithTimeout(1m). Logs row count at Debug.
  - Loop counted in the Start() WaitGroup only when the GC is
    non-nil; cmd/server/main.go skips SetRateLimitGarbageCollector
    when backend=memory so the loop never launches for that case.

cmd/server/main.go (35 LOC diff):
  - All 6 ratelimit.NewSlidingWindowLimiter call sites now route
    through ratelimit.NewLimiter(cfg.RateLimit.SlidingWindowBackend,
    db, ...). Grep verification post-fix returns ZERO hits.
  - Six sites: breakglass loginLimiter (580), ocspLimiter (1003),
    exportLimiter (1068), EST failed-basic (1535), EST per-principal
    SCEP-mTLS arm (1591), EST per-principal SCEP arm (1613). The
    intune.NewPerDeviceRateLimiter site at line 1823 stays unmoved
    — its inner type-alias wrapper is the prompt's
    out-of-scope (cmd/server/*.go only).
  - Conditionally constructs PostgresGC + wires the scheduler janitor
    when backend=postgres; logs the wiring decision either way so
    operators see "rate-limit GC sweep enabled (postgres backend)"
    or "in-memory backend self-prunes" in the boot log.

internal/api/handler/{est,export,certificates,auth_breakglass}.go:
  - Replaced 5 *ratelimit.SlidingWindowLimiter field/Setter types
    with ratelimit.Limiter (the interface). Allow() satisfies the
    same call shape on both backends; the in-memory tests that
    construct *SlidingWindowLimiter still compile because the
    concrete type satisfies the interface (compile-time check in
    internal/ratelimit/limiter.go pins this).

docs/operator/observability.md (176 LOC diff):
  - Replaced the "per-process, in-memory, reset-on-restart, not
    shared across replicas" paragraph with the new
    configurable-backend section: operator decision tree,
    backend internals (memory vs postgres), janitor description,
    falsifiable closure proof (the Sprint 13.2 integration test
    name + invocation), helm chart wiring example.
  - Updated inventory to reflect the actual handler file paths +
    actual cap configurations (the prior doc said "60s window" for
    several limiters that actually use 60m / 24h windows).
  - Doc smoke confirmed: grep -c 'per-process, in-memory,
    reset-on-restart' docs/operator/observability.md = 0.

deploy/helm/certctl/values.yaml + templates/server-configmap.yaml +
templates/server-deployment.yaml:
  - Exposed server.rateLimiting.backend (default "memory") +
    server.rateLimiting.janitorInterval (default "5m") under the
    existing rateLimiting block.
  - ConfigMap renders both as rate-limit-backend +
    rate-limit-janitor-interval keys.
  - Deployment wires CERTCTL_RATE_LIMIT_BACKEND +
    CERTCTL_RATE_LIMIT_JANITOR_INTERVAL env vars from the configmap.
  - Helm render: `helm template deploy/helm/certctl --set
    server.rateLimiting.backend=postgres` shows the env-var on the
    server-deployment.yaml output.

.github/workflows/ci.yml (+12 LOC):
  - Added a new step in the Go Build & Test job that runs the
    Sprint 13.2 multi-replica integration test
    (TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas) with
    -tags=integration -race -timeout=300s. Fails the CI status check
    if the cross-replica row lock ever stops arbitrating across
    replicas — the ARCH-M1 closure regression gate.

Verification (all green locally; postgres integration via CI)
============================================================

  $ grep -nE 'NewSlidingWindowLimiter' cmd/server/*.go
    (zero hits — Sprint 13.3 receipt)

  $ go test -short -count=1 \
      ./internal/config/... ./internal/ratelimit/... \
      ./internal/scheduler/... ./internal/api/handler/... \
      ./cmd/server/...
    ok  internal/config       1.177s
    ok  internal/ratelimit    0.007s
    ok  internal/scheduler    9.165s
    ok  internal/api/handler  6.245s
    ok  cmd/server            0.390s

  $ staticcheck ./internal/ratelimit/... ./internal/scheduler/... \
      ./internal/config/... ./internal/api/handler/... ./cmd/server/...
    (clean)

  $ gofmt -l internal/ cmd/server/
    (clean)

  $ grep -c 'per-process, in-memory, reset-on-restart' \
      docs/operator/observability.md
    0   (doc smoke — the audit's verbatim phrasing is gone)

  $ bash scripts/ci-guards/G-3-env-docs-drift.sh
    G-3 env-docs-drift: clean.

  $ bash scripts/ci-guards/complete-path-config-coverage.sh
    OK — every CERTCTL_* env var (197) has at least one non-config-
    package consumer.

Selector contract verified — config.Validate() rejects any value
other than ""/memory/postgres at startup with a clear error message.

Sprint 13.4 next (ARCH-H1 OpenAPI authoring batch 1) is on a
different axis; ARCH-M1 closure is complete with this commit
modulo the Sprint 13.7 audit-HTML flip + zero-floor pin.

Closes: ARCH-M1 substantive remediation. The cross-replica rate-
limit-cap-enforcement gap that the audit recommended deferring to
v3 is closed; operators with server.replicas > 1 flip
CERTCTL_RATE_LIMIT_BACKEND=postgres and get exactly-cap enforcement
across the cluster (proved by the multi-replica integration test now
gating CI).
2026-05-14 11:52:13 +00:00
..

Certctl Helm Chart

Production-ready Helm chart for deploying certctl (self-hosted certificate lifecycle management platform) on Kubernetes.

Table of Contents

  1. Quick Start
  2. Chart Features
  3. Prerequisites
  4. Installation
  5. Configuration
  6. Usage Examples
  7. Upgrading
  8. Uninstalling
  9. Architecture
  10. Security Considerations
  11. Troubleshooting

Quick Start

# Add the chart repository (when available)
helm repo add certctl https://charts.example.com
helm repo update

# Install with default values
helm install certctl certctl/certctl \
  --set server.auth.apiKey="your-secure-api-key" \
  --set postgresql.auth.password="your-secure-password"

# Check installation status
kubectl get pods -l app.kubernetes.io/instance=certctl

Chart Features

  • Server Deployment — certctl control plane with configurable replicas
  • PostgreSQL StatefulSet — Persistent database with automatic schema migration
  • Agent DaemonSet or Deployment — Flexible agent deployment (per-node or custom replicas)
  • Ingress Support — Optional HTTPS ingress with cert-manager integration
  • Security Contexts — Non-root containers, read-only filesystems, minimal capabilities
  • Resource Limits — Configurable CPU and memory requests/limits
  • Health Checks — Liveness and readiness probes on all containers
  • ConfigMaps and Secrets — Centralized configuration management
  • Service Account and RBAC — Optional cluster role bindings
  • Pod Disruption Budgets — HA-ready with configurable disruption budgets
  • Monitoring — Optional Prometheus ServiceMonitor support

Prerequisites

  • Kubernetes 1.19 or later
  • Helm 3.0 or later
  • Optional: cert-manager (for automatic TLS certificate provisioning)
  • Optional: Prometheus (for metrics scraping)

Installation

1. Using Chart from Repository

helm repo add certctl https://charts.example.com
helm repo update
helm install certctl certctl/certctl -f my-values.yaml

2. Using Local Chart

cd deploy/helm
helm install certctl certctl/ \
  --set server.auth.apiKey="$(openssl rand -base64 32)" \
  --set postgresql.auth.password="$(openssl rand -base64 32)"

3. Minimal Production Installation

helm install certctl certctl/certctl \
  --namespace certctl \
  --create-namespace \
  --set server.auth.apiKey="change-me" \
  --set postgresql.auth.password="change-me" \
  --set server.replicas=2 \
  --set server.resources.requests.cpu=200m \
  --set server.resources.requests.memory=256Mi \
  --set ingress.enabled=true \
  --set ingress.className=nginx \
  --set ingress.hosts[0].host=certctl.example.com

Configuration

Server Configuration

server:
  replicas: 1                    # Number of server replicas
  port: 8443                     # Service port
  auth:
    type: api-key               # Authentication type
    apiKey: "your-api-key"      # REQUIRED for production
  logging:
    level: info                 # Log level (debug, info, warn, error)
    format: json                # Output format
  issuer:
    local:
      enabled: true             # Enable local CA issuer
    acme:
      enabled: false            # Enable ACME issuer
      directoryURL: ""          # ACME directory URL
      email: ""                 # ACME registration email
      challengeType: "http-01"  # Challenge type (http-01, dns-01, dns-persist-01)

PostgreSQL Configuration

postgresql:
  enabled: true                 # Use managed PostgreSQL
  auth:
    database: certctl
    username: certctl
    password: "your-password"   # REQUIRED
  storage:
    size: 10Gi                  # PVC size
    storageClass: ""            # Use default StorageClass

Agent Configuration

agent:
  enabled: true                 # Deploy agents
  kind: DaemonSet              # DaemonSet (one per node) or Deployment
  replicas: 1                  # For Deployment kind only
  discoveryDirs: ""            # Comma-separated cert discovery paths
  nodeSelector: {}             # Node affinity for DaemonSet

Ingress Configuration

ingress:
  enabled: false
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - host: certctl.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: certctl-tls
      hosts:
        - certctl.example.com

See values.yaml for all available configuration options.

Usage Examples

Example 1: High Availability Setup

# ha-values.yaml
server:
  replicas: 3
  resources:
    requests:
      cpu: 250m
      memory: 256Mi
    limits:
      cpu: 1000m
      memory: 512Mi

postgresql:
  storage:
    size: 50Gi

podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
          - key: app.kubernetes.io/component
            operator: In
            values: [server]
      topologyKey: kubernetes.io/hostname

Deploy with:

helm install certctl certctl/certctl -f ha-values.yaml

Example 2: External PostgreSQL Database

# external-db-values.yaml
postgresql:
  enabled: false

server:
  env:
    CERTCTL_DATABASE_URL: "postgres://user:password@rds.example.com:5432/certctl?sslmode=require"

Deploy with:

helm install certctl certctl/certctl -f external-db-values.yaml

Example 3: ACME + Let's Encrypt

# acme-values.yaml
server:
  issuer:
    acme:
      enabled: true
      directoryURL: https://acme-v02.api.letsencrypt.org/directory
      email: admin@example.com
      challengeType: dns-01
      dnsPresentScript: /scripts/dns-present.sh
      dnsCleanupScript: /scripts/dns-cleanup.sh
      dnsPropagationWait: 30s

Example 4: Email Notifications via Slack + SMTP

# notifications-values.yaml
server:
  smtp:
    enabled: true
    host: smtp.example.com
    port: 587
    username: certctl@example.com
    password: "smtp-password"
    fromAddress: certctl@example.com
    useTLS: true

  notifiers:
    slack:
      enabled: true
      webhookUrl: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
      channel: "#certificates"

Upgrading

# Update chart repository
helm repo update

# Upgrade release
helm upgrade certctl certctl/certctl -f values.yaml

# View upgrade history
helm history certctl

# Rollback to previous version
helm rollback certctl 1

Uninstalling

# Delete the release (keeps data by default)
helm uninstall certctl

# Also delete persistent data
kubectl delete pvc --all -l app.kubernetes.io/instance=certctl

# Delete namespace
kubectl delete namespace certctl

Architecture

Components

┌──────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster                                           │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────────┐                 ┌──────────────────┐  │
│  │ Ingress/LB      │                 │  Agent Pod 1     │  │
│  │ (optional)      │                 │  (DaemonSet)     │  │
│  └────────┬────────┘                 └──────────────────┘  │
│           │                                                  │
│           ▼                           ┌──────────────────┐  │
│  ┌─────────────────────────┐          │  Agent Pod 2     │  │
│  │ Server Deployment       │          │  (DaemonSet)     │  │
│  │ (1 to N replicas)       │          └──────────────────┘  │
│  │ - REST API              │                                 │
│  │ - Scheduler             │          ┌──────────────────┐  │
│  │ - UI Dashboard          │          │  Agent Pod N     │  │
│  └────────┬────────────────┘          │  (DaemonSet)     │  │
│           │                           └──────────────────┘  │
│           │                                                  │
│           ▼                                                  │
│  ┌──────────────────────────┐                               │
│  │ PostgreSQL StatefulSet   │                               │
│  │ - Database               │                               │
│  │ - PVC (persistent)       │                               │
│  └──────────────────────────┘                               │
│                                                              │
└──────────────────────────────────────────────────────────────┘

Network Communication

  • Server → PostgreSQL: Internal cluster DNS (certctl-postgres:5432)
  • Agent → Server: Internal cluster DNS (certctl-server:8443)
  • External → Server: Via Ingress or Service (ClusterIP/LoadBalancer/NodePort)

Security Considerations

1. Secrets Management

All sensitive data is stored in Kubernetes Secrets:

  • PostgreSQL credentials
  • API keys
  • SMTP passwords
  • ACME account secrets

Best Practices:

  • Use sealed-secrets or external-secrets operator
  • Enable encryption at rest in etcd
  • Rotate secrets regularly
# Example: Using sealed-secrets
kubectl create secret generic certctl-api-key --from-literal=api-key="$(openssl rand -base64 32)" --dry-run=client -o yaml | kubeseal -f - | kubectl apply -f -

2. RBAC

The chart creates minimal RBAC by default:

  • ServiceAccount per release
  • ClusterRole (empty, extensible)
  • ClusterRoleBinding

To restrict further:

rbac:
  create: true
  # Add specific rules here

3. Pod Security

All containers run with:

  • Non-root user (UID 1000)
  • Read-only root filesystem
  • No privilege escalation
  • Dropped capabilities (ALL)

4. Network Policies

Restrict pod-to-pod communication:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: certctl-default-deny
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/instance: certctl
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: certctl
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: certctl
    - to:
        - podSelector: {}
      ports:
        - protocol: TCP
          port: 53  # DNS
        - protocol: UDP
          port: 53

5. TLS/HTTPS

Enable HTTPS with cert-manager:

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --set installCRDs=true

Then configure Ingress with TLS.

6. API Key Security

For production:

  1. Generate a strong API key: openssl rand -base64 32
  2. Store securely (Vault, sealed-secrets, etc.)
  3. Never commit to Git
  4. Rotate periodically
# Generate and deploy API key
NEW_KEY=$(openssl rand -base64 32)
kubectl patch secret certctl-server -p "{\"data\":{\"api-key\":\"$(echo -n $NEW_KEY | base64)\"}}"

Troubleshooting

1. Pods Not Starting

# Check pod status
kubectl get pods -l app.kubernetes.io/instance=certctl
kubectl describe pod <pod-name>
kubectl logs <pod-name>

2. Database Connection Issues

# Verify PostgreSQL is running
kubectl get pods -l app.kubernetes.io/component=postgres
kubectl logs -l app.kubernetes.io/component=postgres

# Test connection from server pod
kubectl exec -it <server-pod> -- \
  psql postgres://certctl:password@certctl-postgres:5432/certctl

3. Agent Not Connecting

# Check agent logs
kubectl logs -l app.kubernetes.io/component=agent

# Verify server is reachable
kubectl exec -it <agent-pod> -- \
  wget -q -O - http://certctl-server:8443/health

4. Persistent Data Loss

# Check PVC status
kubectl get pvc

# Verify data is being stored
kubectl exec -it <postgres-pod> -- \
  ls -lah /var/lib/postgresql/data/postgres

5. Permission Denied Errors

The chart runs containers as non-root (UID 1000). If you see permission errors:

# Temporarily allow root for debugging
server:
  securityContext:
    runAsUser: 0  # NOT FOR PRODUCTION

6. Out of Memory

Increase resource limits:

helm upgrade certctl certctl/certctl \
  --set server.resources.limits.memory=1Gi \
  --set postgresql.resources.limits.memory=2Gi

7. Certificate Validation Issues

For self-signed certificates:

kubectl exec -it <pod> -- \
  CERTCTL_TLS_INSECURE_SKIP_VERIFY=true <command>

Common Issues and Solutions

Issue Solution
ImagePullBackOff Update server.image.repository to your registry
CrashLoopBackOff Check logs with kubectl logs <pod>
Pending PVC Check storage class availability
Connection timeout Verify network policies and service DNS
High memory usage Adjust postgresql.resources.limits and server.resources.limits

Support and Contributing

For issues, questions, or contributions, visit:

License

BSL-1.1