Files
certctl/docs/architecture.md
T
2026-03-14 08:22:17 -04:00

17 KiB

Certctl Architecture

Overview

Certctl is a certificate management platform with a decoupled control-plane and agent architecture. The control plane orchestrates certificate issuance and renewal, while stateless agents deployed across your infrastructure handle certificate generation, deployment, and renewal without exposing private keys to the control plane.

Design Principles

  1. Zero Private Key Exposure — Private keys generated and managed only on agents
  2. Decoupled Operations — Agents operate autonomously; control plane is optional for agent function
  3. Audit-First — Complete traceability of all issuance, deployment, and rotation events
  4. Connector Architecture — Pluggable issuers, targets, and notifiers for extensibility
  5. Self-Hosted — No cloud lock-in; run on Kubernetes, Docker, or bare metal

System Components

Control Plane

The control plane is a REST API server backed by PostgreSQL. It:

  • Manages state: Certificates, agents, targets, issuers, policies
  • Orchestrates issuance: Coordinates with ACME/PKI issuers
  • Tracks jobs: Certificate issuance, renewal, and deployment workflows
  • Audits all actions: Immutable audit trail for compliance
  • Dispatches work: Schedules renewal checks and deployment jobs

Deployment Options: Single binary, Docker container, Kubernetes deployment

Agents

Lightweight agents deployed on or near your infrastructure. They:

  • Generate certificates: Create private keys and certificate requests
  • Deploy certificates: Push certs to NGINX, F5, IIS, etc.
  • Manage credentials: Store and rotate API keys with control plane
  • Report status: Health checks and job completion status
  • Operate independently: Continue functioning even if control plane is unreachable

Deployment Options: Container, systemd service, Kubernetes DaemonSet, Lambda

PostgreSQL Database

Persistent state store:

├── Teams & Ownership
│  ├── teams
│  └── owners
├── Certificate Management
│  ├── certificates
│  ├── certificate_versions
│  └── renewal_policies
├── Infrastructure
│  ├── agents
│  ├── targets
│  └── target_connections
├── Issuance
│  ├── issuers
│  ├── jobs
│  └── job_steps
├── Monitoring & Audit
│  ├── audit_logs
│  ├── notifications
│  └── deployment_history
└── Configuration
   ├── agent_api_keys
   └── connector_config

Data Flow: Certificate Lifecycle

1. Create Managed Certificate

User/API
   │
   ├─→ POST /api/v1/certificates
   │    {
   │      "domain": "api.example.com",
   │      "issuer_id": "issuer-001",
   │      "target_ids": ["nginx-prod-01"],
   │      "renewal_days_before": 30
   │    }
   │
   └─→ Control Plane
        ├─ Insert certificate record
        ├─ Create initial job
        ├─ Log audit event
        └─ Return cert ID + API response

2. Agent Requests Certificate (CSR → Issuance)

Agent                          Control Plane                    ACME Issuer
  │                                  │                               │
  ├─ POST /api/v1/csr               │                               │
  │  {                              │                               │
  │    "cert_id": "cert-123",       │                               │
  │    "csr": "-----BEGIN CSR..."   │                               │
  │  }                              │                               │
  │                                 ├─ Validate CSR                 │
  │                                 │                               │
  │                                 ├─ POST /directory/new-order    │
  │                                 ├──────────────────────────────→
  │                                 │                               │
  │                                 │← Poll challenges              │
  │                                 ├──────────────────────────────→
  │                                 │                               │
  │                                 ├─ POST /acme/finalize         │
  │                                 ├──────────────────────────────→
  │                                 │                               │
  │← Certificate + chain           │← Signed certificate           │
  ├─────────────────────────────────│                               │
  │                                 │                               │
  ├─ Store locally:                │                               │
  │  /etc/certctl/api.example.com/  │                               │
  │   ├─ cert.pem                   │                               │
  │   ├─ key.pem (never sent back)  │                               │
  │   └─ chain.pem                  │                               │
  │                                 │                               │
  └─ POST /api/v1/deployments      │                               │
     { "cert_id", "status": "ok" }  │                               │
                                    ├─ Update cert record           │
                                    ├─ Log "issued" event           │
                                    └─ Trigger deployment jobs      │

3. Deploy Certificate to Target

Agent                          Target System
  │
  ├─ Fetch target credentials from config
  │
  ├─ Load certificate:
  │  - /etc/certctl/api.example.com/cert.pem
  │  - /etc/certctl/api.example.com/key.pem
  │
  ├─ NGINX (SSH):
  │  ├─ scp cert.pem → /etc/nginx/ssl/
  │  ├─ scp key.pem → /etc/nginx/ssl/ (restricted perms)
  │  ├─ ssh nginx -s reload
  │  └─ Verify: curl https://api.example.com/health
  │
  ├─ F5 (HTTPS API):
  │  ├─ Authenticate with credentials
  │  ├─ POST /mgmt/tm/ltm/cert {"name": "api.example.com", "cert": "..."}
  │  ├─ PUT /mgmt/tm/ltm/virtual (update virtual server)
  │  └─ Verify: F5 configuration updated
  │
  ├─ IIS (WinRM):
  │  ├─ Import cert to store: Import-PfxCertificate
  │  ├─ Bind to site: Set-WebBinding
  │  └─ Verify: Get-WebBinding
  │
  └─ Report deployment status:
     POST /api/v1/deployments/{id}/status
     { "status": "success", "deployed_at": "..." }

4. Renewal Check & Rotation

Scheduler (Control Plane)
  │
  ├─ Every hour: SELECT certificates WHERE expiry_date < NOW() + 30 days
  │
  ├─ For each certificate:
  │  │
  │  ├─ Create renewal job
  │  ├─ Notify agent(s)
  │  │
  │  └─ Agent flow:
  │     ├─ Generate new CSR
  │     ├─ Request new certificate
  │     ├─ Deploy new cert to targets
  │     ├─ Verify deployment
  │     └─ Delete old private key from agent
  │
  ├─ Log completion
  └─ Notify via email/webhook

Connector Architecture

Certctl uses connector interfaces for extensibility. Connectors are pluggable implementations of specific capabilities.

Issuer Connector

Handles certificate issuance from external PKI systems.

type IssuerConnector interface {
    // GetDirectory returns the ACME directory
    GetDirectory(ctx context.Context) (*ACMEDirectory, error)

    // NewAccount registers a new account
    NewAccount(ctx context.Context, email string) (*Account, error)

    // NewOrder creates a new certificate order
    NewOrder(ctx context.Context, identifiers []Identifier) (*Order, error)

    // GetAuthorization retrieves challenge info
    GetAuthorization(ctx context.Context, authURL string) (*Authorization, error)

    // FinalizeOrder submits CSR and gets certificate
    FinalizeOrder(ctx context.Context, orderURL, csr string) ([]byte, error)
}

Built-in Issuers:

  • acme — ACME v2 protocol (Let's Encrypt, Sectigo, etc.)

Example Usage:

issuer:
  type: acme
  config:
    directory_url: https://acme-v02.api.letsencrypt.org/directory
    email: admin@example.com

Target Connector

Deploys certificates to infrastructure systems.

type TargetConnector interface {
    // Validate tests connectivity and credentials
    Validate(ctx context.Context) error

    // Deploy pushes certificate to target
    Deploy(ctx context.Context, cert *Certificate) error

    // Remove removes/revokes certificate from target
    Remove(ctx context.Context, domain string) error

    // GetStatus checks deployment status
    GetStatus(ctx context.Context, domain string) (string, error)
}

Built-in Targets:

  • nginx — NGINX via SSH
  • f5 — F5 BIG-IP via REST API
  • iis — Microsoft IIS via WinRM

Example Usage:

target:
  type: nginx
  config:
    host: web01.prod.internal
    ssh_user: deploy
    ssh_key: /etc/certctl/keys/deploy.pem
    cert_path: /etc/nginx/ssl

Notifier Connector

Sends notifications about certificate events.

type NotifierConnector interface {
    // Send delivers a notification
    Send(ctx context.Context, notif *Notification) error

    // Validate checks configuration
    Validate(ctx context.Context) error
}

Built-in Notifiers:

  • email — SMTP email
  • webhook — HTTP webhooks

Example Usage:

notifier:
  type: email
  config:
    smtp_host: smtp.example.com
    smtp_port: 587
    username: alerts@example.com
    password: "***"
    from_address: certctl@example.com
    recipients:
      - ops@example.com
      - security@example.com

Job Lifecycle & States

Jobs represent work to be done: certificate issuance, renewal, deployment, etc.

┌──────────┐
│ PENDING  │  Job created, waiting to be processed
└────┬─────┘
     │
     ↓
┌──────────┐
│ RUNNING  │  Job in progress (CSR generation, issuance, deployment)
└────┬─────┘
     │
     ├─→ SUCCESS ──→ COMPLETED (job done, no errors)
     │
     ├─→ FAILURE ──→ FAILED (error occurred, may retry)
     │
     └─→ CANCEL ───→ CANCELLED (user or scheduler cancelled)

Additional states:
  • RETRY_WAIT  — Backoff before retry
  • ABANDONED   — Max retries exceeded

Job Steps

Complex jobs are broken into steps:

Issuance Job
  │
  ├─ Step 1: Notify agent of CSR request
  │   Status: COMPLETED
  │
  ├─ Step 2: Wait for CSR from agent
  │   Status: RUNNING (timeout: 5 min)
  │
  ├─ Step 3: Submit to ACME issuer
  │   Status: PENDING
  │
  ├─ Step 4: Poll for certificate
  │   Status: PENDING
  │
  └─ Step 5: Trigger deployment jobs
      Status: PENDING

Security Model

Private Key Management

Private Key Lifecycle
  │
  ├─ GENERATED on Agent (never sent to control plane)
  │  └─ Location: /etc/certctl/domains/{domain}/key.pem
  │
  ├─ STORED on Agent
  │  ├─ File permissions: 0600 (agent user only)
  │  └─ Encrypted at rest (optional, per deployment)
  │
  ├─ USED on Agent for:
  │  ├─ Deployment to targets
  │  └─ Certificate renewal
  │
  └─ DELETED on Agent
     ├─ Old key deleted after successful renewal
     └─ Manual revocation on agent removal

Authentication & Authorization

Agent-to-Server:

  • API Key (registered at agent creation)
  • mTLS optional for high-security deployments
  • All API calls include agent ID + API key

Server-to-External Systems:

  • ACME: ACME protocol with account key
  • NGINX: SSH key authentication
  • F5: Username/password or token
  • IIS: WinRM with encrypted credentials

Audit Logging

Every action is logged:

{
  "id": "audit-98765",
  "timestamp": "2024-03-14T10:30:00Z",
  "actor": {
    "type": "agent",
    "id": "agent-prod-01"
  },
  "action": "certificate_issued",
  "resource": {
    "type": "certificate",
    "id": "cert-api-example-com"
  },
  "status": "success",
  "details": {
    "issuer": "acme/letsencrypt",
    "expiry": "2024-06-12T10:30:00Z",
    "deployed_to": ["nginx-prod-01"]
  }
}

Query examples:

  • All actions by agent: GET /audit/logs?actor_type=agent&actor_id=agent-001
  • All deployments: GET /audit/logs?action=certificate_deployed
  • Last 30 days: GET /audit/logs?from=2024-02-12

Data Encryption at Rest

Optional encryption for sensitive fields:

  • Passwords in connector configs
  • API keys
  • ACME account keys

Uses AES-256-GCM with per-row nonce.


Scaling Considerations

Control Plane Scaling

Single Server Limits:

  • ~1000 agents (verified in testing)
  • ~10,000 managed certificates
  • ~100,000 audit log entries per day

Horizontal Scaling (future):

  • Multiple server instances behind load balancer
  • Shared PostgreSQL backend
  • Distributed job queue (Redis/RabbitMQ)

Agent Scaling

Agents are stateless and scale horizontally:

  • Each agent processes certificates independently
  • Scheduler distributes renewal checks across agents
  • No inter-agent communication required

Database Scaling

For large deployments:

  • Vertical scaling: More CPU/RAM for PostgreSQL
  • Read replicas: For audit log queries
  • Partitioning: Audit logs by date
  • Connection pooling: PgBouncer

Integration Points

External Integrations

Certctl
  │
  ├─→ ACME Servers
  │   ├─ Let's Encrypt
  │   ├─ Sectigo
  │   └─ Internal ACME (optional)
  │
  ├─→ Infrastructure Targets
  │   ├─ NGINX (SSH)
  │   ├─ F5 (REST API)
  │   ├─ IIS (WinRM)
  │   └─ Kubernetes (future)
  │
  ├─→ Notification Systems
  │   ├─ SMTP (email)
  │   ├─ HTTP webhooks
  │   └─ Slack (future)
  │
  └─→ External Systems
      ├─ Vault (credential storage)
      ├─ HashiCorp Consul (service discovery)
      └─ Prometheus (metrics)

Internal Component Communication

Agent ← → Control Plane
  ├─ Agent registration
  ├─ CSR submission
  ├─ Certificate retrieval
  ├─ Deployment status
  └─ Health checks (bidirectional)

Scheduler → Services
  ├─ Certificate renewal
  ├─ Job processing
  ├─ Notifications
  └─ Cleanup tasks

Deployment Topologies

Single-Node (Development)

┌────────────────────────────┐
│ Server + Agent             │
│ ├─ HTTP API (8443)         │
│ ├─ PostgreSQL              │
│ └─ Agent (test mode)       │
└────────────────────────────┘

Docker Compose (Local Dev)

┌─────────────────────────────────────┐
│ Docker Network                      │
│ ├─ certctl-server (8443)            │
│ ├─ postgres (5432)                  │
│ ├─ certctl-agent (managed)          │
│ └─ pgadmin (5050, optional)         │
└─────────────────────────────────────┘

Kubernetes (Production)

┌──────────────────────────────────────────────────┐
│ Kubernetes Cluster                               │
│ ├─ Deployment: certctl-server (replicas=3)       │
│ ├─ DaemonSet: certctl-agent (all nodes)          │
│ ├─ StatefulSet: postgres (primary + replica)     │
│ ├─ ConfigMap: connector configurations           │
│ └─ Secret: API keys, credentials                 │
└──────────────────────────────────────────────────┘

Performance Characteristics

Operation Typical Duration Bottleneck
Certificate request (CSR) 100-500ms Agent network latency
ACME challenge (DNS) 30-60s DNS propagation
ACME finalize 1-5s ACME server
NGINX deployment 500ms-2s SSH latency + nginx reload
F5 deployment 2-10s F5 API response
IIS deployment 3-15s WinRM latency

Future Enhancements

  • HSM Support: Hardware security module integration for ACME account keys
  • Multi-Region: Control plane federation with local agents
  • HA Control Plane: Active-active with etcd-backed state
  • Policy Engine: Advanced renewal and deployment policies
  • Certificate Pinning: HPKP and pin validation
  • Metrics: Prometheus integration for observability

See README.md for quick start and docs/ for additional guides.