# Certctl Architecture ## Overview Certctl is a certificate management platform with a **decoupled control-plane and agent architecture**. The control plane orchestrates certificate issuance and renewal, while stateless agents deployed across your infrastructure handle certificate generation, deployment, and renewal without exposing private keys to the control plane. ### Design Principles 1. **Zero Private Key Exposure** — Private keys generated and managed only on agents 2. **Decoupled Operations** — Agents operate autonomously; control plane is optional for agent function 3. **Audit-First** — Complete traceability of all issuance, deployment, and rotation events 4. **Connector Architecture** — Pluggable issuers, targets, and notifiers for extensibility 5. **Self-Hosted** — No cloud lock-in; run on Kubernetes, Docker, or bare metal --- ## System Components ### Control Plane The control plane is a REST API server backed by PostgreSQL. It: - **Manages state**: Certificates, agents, targets, issuers, policies - **Orchestrates issuance**: Coordinates with ACME/PKI issuers - **Tracks jobs**: Certificate issuance, renewal, and deployment workflows - **Audits all actions**: Immutable audit trail for compliance - **Dispatches work**: Schedules renewal checks and deployment jobs **Deployment Options**: Single binary, Docker container, Kubernetes deployment ### Agents Lightweight agents deployed on or near your infrastructure. They: - **Generate certificates**: Create private keys and certificate requests - **Deploy certificates**: Push certs to NGINX, F5, IIS, etc. - **Manage credentials**: Store and rotate API keys with control plane - **Report status**: Health checks and job completion status - **Operate independently**: Continue functioning even if control plane is unreachable **Deployment Options**: Container, systemd service, Kubernetes DaemonSet, Lambda ### PostgreSQL Database Persistent state store: ``` ├── Teams & Ownership │ ├── teams │ └── owners ├── Certificate Management │ ├── certificates │ ├── certificate_versions │ └── renewal_policies ├── Infrastructure │ ├── agents │ ├── targets │ └── target_connections ├── Issuance │ ├── issuers │ ├── jobs │ └── job_steps ├── Monitoring & Audit │ ├── audit_logs │ ├── notifications │ └── deployment_history └── Configuration ├── agent_api_keys └── connector_config ``` --- ## Data Flow: Certificate Lifecycle ### 1. **Create Managed Certificate** ``` User/API │ ├─→ POST /api/v1/certificates │ { │ "domain": "api.example.com", │ "issuer_id": "issuer-001", │ "target_ids": ["nginx-prod-01"], │ "renewal_days_before": 30 │ } │ └─→ Control Plane ├─ Insert certificate record ├─ Create initial job ├─ Log audit event └─ Return cert ID + API response ``` ### 2. **Agent Requests Certificate (CSR → Issuance)** ``` Agent Control Plane ACME Issuer │ │ │ ├─ POST /api/v1/csr │ │ │ { │ │ │ "cert_id": "cert-123", │ │ │ "csr": "-----BEGIN CSR..." │ │ │ } │ │ │ ├─ Validate CSR │ │ │ │ │ ├─ POST /directory/new-order │ │ ├──────────────────────────────→ │ │ │ │ │← Poll challenges │ │ ├──────────────────────────────→ │ │ │ │ ├─ POST /acme/finalize │ │ ├──────────────────────────────→ │ │ │ │← Certificate + chain │← Signed certificate │ ├─────────────────────────────────│ │ │ │ │ ├─ Store locally: │ │ │ /etc/certctl/api.example.com/ │ │ │ ├─ cert.pem │ │ │ ├─ key.pem (never sent back) │ │ │ └─ chain.pem │ │ │ │ │ └─ POST /api/v1/deployments │ │ { "cert_id", "status": "ok" } │ │ ├─ Update cert record │ ├─ Log "issued" event │ └─ Trigger deployment jobs │ ``` ### 3. **Deploy Certificate to Target** ``` Agent Target System │ ├─ Fetch target credentials from config │ ├─ Load certificate: │ - /etc/certctl/api.example.com/cert.pem │ - /etc/certctl/api.example.com/key.pem │ ├─ NGINX (SSH): │ ├─ scp cert.pem → /etc/nginx/ssl/ │ ├─ scp key.pem → /etc/nginx/ssl/ (restricted perms) │ ├─ ssh nginx -s reload │ └─ Verify: curl https://api.example.com/health │ ├─ F5 (HTTPS API): │ ├─ Authenticate with credentials │ ├─ POST /mgmt/tm/ltm/cert {"name": "api.example.com", "cert": "..."} │ ├─ PUT /mgmt/tm/ltm/virtual (update virtual server) │ └─ Verify: F5 configuration updated │ ├─ IIS (WinRM): │ ├─ Import cert to store: Import-PfxCertificate │ ├─ Bind to site: Set-WebBinding │ └─ Verify: Get-WebBinding │ └─ Report deployment status: POST /api/v1/deployments/{id}/status { "status": "success", "deployed_at": "..." } ``` ### 4. **Renewal Check & Rotation** ``` Scheduler (Control Plane) │ ├─ Every hour: SELECT certificates WHERE expiry_date < NOW() + 30 days │ ├─ For each certificate: │ │ │ ├─ Create renewal job │ ├─ Notify agent(s) │ │ │ └─ Agent flow: │ ├─ Generate new CSR │ ├─ Request new certificate │ ├─ Deploy new cert to targets │ ├─ Verify deployment │ └─ Delete old private key from agent │ ├─ Log completion └─ Notify via email/webhook ``` --- ## Connector Architecture Certctl uses **connector interfaces** for extensibility. Connectors are pluggable implementations of specific capabilities. ### Issuer Connector Handles certificate issuance from external PKI systems. ```go type IssuerConnector interface { // GetDirectory returns the ACME directory GetDirectory(ctx context.Context) (*ACMEDirectory, error) // NewAccount registers a new account NewAccount(ctx context.Context, email string) (*Account, error) // NewOrder creates a new certificate order NewOrder(ctx context.Context, identifiers []Identifier) (*Order, error) // GetAuthorization retrieves challenge info GetAuthorization(ctx context.Context, authURL string) (*Authorization, error) // FinalizeOrder submits CSR and gets certificate FinalizeOrder(ctx context.Context, orderURL, csr string) ([]byte, error) } ``` **Built-in Issuers**: - `acme` — ACME v2 protocol (Let's Encrypt, Sectigo, etc.) **Example Usage**: ```yaml issuer: type: acme config: directory_url: https://acme-v02.api.letsencrypt.org/directory email: admin@example.com ``` ### Target Connector Deploys certificates to infrastructure systems. ```go type TargetConnector interface { // Validate tests connectivity and credentials Validate(ctx context.Context) error // Deploy pushes certificate to target Deploy(ctx context.Context, cert *Certificate) error // Remove removes/revokes certificate from target Remove(ctx context.Context, domain string) error // GetStatus checks deployment status GetStatus(ctx context.Context, domain string) (string, error) } ``` **Built-in Targets**: - `nginx` — NGINX via SSH - `f5` — F5 BIG-IP via REST API - `iis` — Microsoft IIS via WinRM **Example Usage**: ```yaml target: type: nginx config: host: web01.prod.internal ssh_user: deploy ssh_key: /etc/certctl/keys/deploy.pem cert_path: /etc/nginx/ssl ``` ### Notifier Connector Sends notifications about certificate events. ```go type NotifierConnector interface { // Send delivers a notification Send(ctx context.Context, notif *Notification) error // Validate checks configuration Validate(ctx context.Context) error } ``` **Built-in Notifiers**: - `email` — SMTP email - `webhook` — HTTP webhooks **Example Usage**: ```yaml notifier: type: email config: smtp_host: smtp.example.com smtp_port: 587 username: alerts@example.com password: "***" from_address: certctl@example.com recipients: - ops@example.com - security@example.com ``` --- ## Job Lifecycle & States Jobs represent work to be done: certificate issuance, renewal, deployment, etc. ``` ┌──────────┐ │ PENDING │ Job created, waiting to be processed └────┬─────┘ │ ↓ ┌──────────┐ │ RUNNING │ Job in progress (CSR generation, issuance, deployment) └────┬─────┘ │ ├─→ SUCCESS ──→ COMPLETED (job done, no errors) │ ├─→ FAILURE ──→ FAILED (error occurred, may retry) │ └─→ CANCEL ───→ CANCELLED (user or scheduler cancelled) Additional states: • RETRY_WAIT — Backoff before retry • ABANDONED — Max retries exceeded ``` ### Job Steps Complex jobs are broken into steps: ``` Issuance Job │ ├─ Step 1: Notify agent of CSR request │ Status: COMPLETED │ ├─ Step 2: Wait for CSR from agent │ Status: RUNNING (timeout: 5 min) │ ├─ Step 3: Submit to ACME issuer │ Status: PENDING │ ├─ Step 4: Poll for certificate │ Status: PENDING │ └─ Step 5: Trigger deployment jobs Status: PENDING ``` --- ## Security Model ### Private Key Management ``` Private Key Lifecycle │ ├─ GENERATED on Agent (never sent to control plane) │ └─ Location: /etc/certctl/domains/{domain}/key.pem │ ├─ STORED on Agent │ ├─ File permissions: 0600 (agent user only) │ └─ Encrypted at rest (optional, per deployment) │ ├─ USED on Agent for: │ ├─ Deployment to targets │ └─ Certificate renewal │ └─ DELETED on Agent ├─ Old key deleted after successful renewal └─ Manual revocation on agent removal ``` ### Authentication & Authorization **Agent-to-Server**: - API Key (registered at agent creation) - mTLS optional for high-security deployments - All API calls include agent ID + API key **Server-to-External Systems**: - ACME: ACME protocol with account key - NGINX: SSH key authentication - F5: Username/password or token - IIS: WinRM with encrypted credentials ### Audit Logging Every action is logged: ```json { "id": "audit-98765", "timestamp": "2024-03-14T10:30:00Z", "actor": { "type": "agent", "id": "agent-prod-01" }, "action": "certificate_issued", "resource": { "type": "certificate", "id": "cert-api-example-com" }, "status": "success", "details": { "issuer": "acme/letsencrypt", "expiry": "2024-06-12T10:30:00Z", "deployed_to": ["nginx-prod-01"] } } ``` **Query examples**: - All actions by agent: `GET /audit/logs?actor_type=agent&actor_id=agent-001` - All deployments: `GET /audit/logs?action=certificate_deployed` - Last 30 days: `GET /audit/logs?from=2024-02-12` ### Data Encryption at Rest Optional encryption for sensitive fields: - Passwords in connector configs - API keys - ACME account keys Uses AES-256-GCM with per-row nonce. --- ## Scaling Considerations ### Control Plane Scaling **Single Server Limits**: - ~1000 agents (verified in testing) - ~10,000 managed certificates - ~100,000 audit log entries per day **Horizontal Scaling** (future): - Multiple server instances behind load balancer - Shared PostgreSQL backend - Distributed job queue (Redis/RabbitMQ) ### Agent Scaling Agents are stateless and scale horizontally: - Each agent processes certificates independently - Scheduler distributes renewal checks across agents - No inter-agent communication required ### Database Scaling For large deployments: - Vertical scaling: More CPU/RAM for PostgreSQL - Read replicas: For audit log queries - Partitioning: Audit logs by date - Connection pooling: PgBouncer --- ## Integration Points ### External Integrations ``` Certctl │ ├─→ ACME Servers │ ├─ Let's Encrypt │ ├─ Sectigo │ └─ Internal ACME (optional) │ ├─→ Infrastructure Targets │ ├─ NGINX (SSH) │ ├─ F5 (REST API) │ ├─ IIS (WinRM) │ └─ Kubernetes (future) │ ├─→ Notification Systems │ ├─ SMTP (email) │ ├─ HTTP webhooks │ └─ Slack (future) │ └─→ External Systems ├─ Vault (credential storage) ├─ HashiCorp Consul (service discovery) └─ Prometheus (metrics) ``` ### Internal Component Communication ``` Agent ← → Control Plane ├─ Agent registration ├─ CSR submission ├─ Certificate retrieval ├─ Deployment status └─ Health checks (bidirectional) Scheduler → Services ├─ Certificate renewal ├─ Job processing ├─ Notifications └─ Cleanup tasks ``` --- ## Deployment Topologies ### Single-Node (Development) ``` ┌────────────────────────────┐ │ Server + Agent │ │ ├─ HTTP API (8443) │ │ ├─ PostgreSQL │ │ └─ Agent (test mode) │ └────────────────────────────┘ ``` ### Docker Compose (Local Dev) ``` ┌─────────────────────────────────────┐ │ Docker Network │ │ ├─ certctl-server (8443) │ │ ├─ postgres (5432) │ │ ├─ certctl-agent (managed) │ │ └─ pgadmin (5050, optional) │ └─────────────────────────────────────┘ ``` ### Kubernetes (Production) ``` ┌──────────────────────────────────────────────────┐ │ Kubernetes Cluster │ │ ├─ Deployment: certctl-server (replicas=3) │ │ ├─ DaemonSet: certctl-agent (all nodes) │ │ ├─ StatefulSet: postgres (primary + replica) │ │ ├─ ConfigMap: connector configurations │ │ └─ Secret: API keys, credentials │ └──────────────────────────────────────────────────┘ ``` --- ## Performance Characteristics | Operation | Typical Duration | Bottleneck | |-----------|------------------|-----------| | Certificate request (CSR) | 100-500ms | Agent network latency | | ACME challenge (DNS) | 30-60s | DNS propagation | | ACME finalize | 1-5s | ACME server | | NGINX deployment | 500ms-2s | SSH latency + nginx reload | | F5 deployment | 2-10s | F5 API response | | IIS deployment | 3-15s | WinRM latency | --- ## Future Enhancements - **HSM Support**: Hardware security module integration for ACME account keys - **Multi-Region**: Control plane federation with local agents - **HA Control Plane**: Active-active with etcd-backed state - **Policy Engine**: Advanced renewal and deployment policies - **Certificate Pinning**: HPKP and pin validation - **Metrics**: Prometheus integration for observability --- See [README.md](../README.md) for quick start and [docs/](../) for additional guides.