Files
certctl/cmd/agent/deploy.go
T
shankar0123 3094010880 refactor(cmd/agent): split main.go into poll + deploy + discovery sibling files (Phase 9, 12 of N — LAST hotspot)
Phase 9 ARCH-M2 closure Sprint 12 — the LAST of the audit's named
hotspot sub-splits. Splits cmd/agent/main.go (1489 LOC, the
sixth-largest backend hotspot at audit time) via the Option B
sibling-file pattern (mirrors the Sprint 8 cmd/server cut). Package
stays `main`; every method is still defined on *Agent so each call
site continues to resolve through Go's same-package method-set —
no import-path or signature change.

Audit prescription vs reality
=============================
The audit's Tasks-Deferred row prescribed
"main + poll + deploy + register sibling files." The actual
cmd/agent/main.go has no `register` function — agent registration
happens via the control-plane REST API (POST /api/v1/agents)
before the agent process starts. The closest analogue in the agent
binary is the filesystem-discovery scan (runDiscoveryScan + the
parsePEMFile / parseDERFile / certToEntry / sha256Sum / certKeyInfo
helpers), which is the agent's other "outbound report-to-server"
surface alongside the inbound work-poll path.

Sprint 12 substitutes `discovery` for `register` in the prescription
and keeps the other three buckets as named: `main` (lifecycle + HTTP
infrastructure + entrypoint), `poll` (work-poll + CSR-job execution),
`deploy` (deployment-job execution + target connector factory).

What moved
==========

New `cmd/agent/poll.go` (279 LOC) — work-poll + CSR-job execution:
  - pollForWork: GET /api/v1/agents/{id}/work each tick; dispatches
    each returned JobItem to the right executor.
  - executeCSRJob: handles AwaitingCSR jobs by generating an ECDSA
    P-256 key locally, persisting it with 0600 permissions (key
    NEVER leaves the agent — CLAUDE.md "Agent-based key
    management"), creating + submitting the CSR.

New `cmd/agent/deploy.go` (443 LOC) — deployment + target factory:
  - executeDeploymentJob: handles Pending deployment jobs by
    fetching the cert PEM, loading the locally-held private key
    (agent keygen mode), instantiating the appropriate target
    connector, calling DeployCertificate, and reporting status.
  - createTargetConnector: the 170-LOC switch over target_type
    that instantiates 14 different target connectors (apache /
    awsacm / azurekv / caddy / envoy / f5 / haproxy / iis /
    javakeystore / k8ssecret / nginx / postfix / ssh / traefik /
    wincertstore). Context is threaded through to SDK-driven
    connectors (AWSACM, AzureKeyVault) per the contextcheck linter
    fix in CI commit 502823d.
  - splitPEMChain + fetchCertificate (deploy-only helpers).

New `cmd/agent/discovery.go` (275 LOC) — filesystem cert discovery:
  - runDiscoveryScan: walks each configured discovery directory,
    dispatches each candidate file to parsePEMFile / parseDERFile,
    batches the parsed entries, and POSTs them to
    /api/v1/agents/{id}/discoveries (the machine-to-machine surface
    that is intentionally NOT exposed via MCP).
  - parsePEMFile + parseDERFile + certToEntry + sha256Sum +
    certKeyInfo + the discoveredCertEntry struct that ties them
    together.

What stays in main.go (644 LOC, down from 1489)
================================================
  - Types: AgentConfig, Agent struct, ErrAgentRetired var,
    WorkResponse, JobItem.
  - Lifecycle: NewAgent constructor, Run, markRetired,
    sendHeartbeat, getOutboundIP, targetDeployMutex method.
  - Shared HTTP infrastructure: makeRequest (consumed by poll +
    deploy + discovery + lifecycle), reportJobStatus (consumed by
    poll + deploy).
  - Entrypoint: main(), getEnvDefault, getEnvBoolDefault,
    validateHTTPSScheme.

Side-effect import cleanup
==========================
21 imports drop from cmd/agent/main.go as a clean side effect:

Standard library (7):
  - crypto/ecdsa, crypto/elliptic (poll only)
  - crypto/rand (poll only)
  - crypto/rsa (discovery only)
  - crypto/sha256 (discovery only)
  - crypto/x509/pkix (poll only)
  - encoding/pem (poll + deploy + discovery)
  - path/filepath (poll + deploy + discovery)

Target connectors (14):
  - internal/connector/target + apache + awsacm + azurekv + caddy +
    envoy + f5 + haproxy + iis + javakeystore + k8ssecret + nginx +
    postfix + ssh + traefik + wincertstore — all 14 were used ONLY
    by createTargetConnector and moved with the factory to deploy.go.

The surviving main.go now imports 20 stdlib packages + zero
internal packages — the leanest the agent binary's entrypoint has
been since the agent first shipped target-connector orchestration.

Per-import audit on every new sibling file is in the diff:
  - poll.go: context, crypto/ecdsa, crypto/elliptic, crypto/rand,
    crypto/x509, crypto/x509/pkix, encoding/json, encoding/pem,
    fmt, io, net/http, os, path/filepath, strings (no sync — the
    sync.Once / sync.Mutex / sync.Map usages all live in the
    surviving main.go's lifecycle code).
  - deploy.go: context, encoding/json, encoding/pem, fmt, io,
    net/http, os, path/filepath, strings + target + 14 connector
    packages.
  - discovery.go: context, crypto/ecdsa, crypto/rsa, crypto/sha256,
    crypto/x509, encoding/pem, fmt, io, net/http, os,
    path/filepath, strings, time.

Net effect
==========
main.go: 1489 → 644 LOC (-845 = -56.7%). Three new sibling files at
997 LOC total (845 moved + ~152 LOC of header + Phase 9 doc-comment
overhead). Matches the Sprint 8 cmd/server pattern in shape (main +
wire + migrations) and size reduction (-23.8% there vs -56.7% here —
the agent had more concentrated single-purpose functions than the
server's wiring-heavy main).

Cumulative Phase 9 progress (all 6 named hotspots)
==================================================
  config.go          3403 → 1342 (-60.6%, Sprints 1-7)
  cmd/server/main.go 2966 → 2260 (-23.8%, Sprints 8 + 8b)
  service/acme.go    1965 → 1162 (-40.9%, Sprints 9 + 9b)
  mcp/tools.go       1867 →  109 (-94.2%, Sprint 10)
  auth_session_oidc  1577 →  452 (-71.3%, Sprint 11)
  cmd/agent/main.go  1489 →  644 (-56.7%, Sprint 12)
  TOTAL across 6 files: 13,267 → 5,969 LOC = -7,298 (-55.0%)

All 6 named hotspots from the audit's top-6 list are now below
1,500 LOC. The largest remaining hotspot from the top-6 is
cmd/server/main.go at 2,260 LOC (intentional — every backend
service the server wires is one line in main(), so the size is
roughly proportional to surface area, not concern-tangling).

Behavior preservation contract
==============================
1. gofmt -l clean across all 4 affected files.
2. go vet ./cmd/agent/... — no findings.
3. staticcheck ./cmd/agent/... — no findings.
4. go test -short -count=1 ./cmd/agent/... — green (includes
   agent_test.go 1716-LOC suite that pins every moved function:
   pollForWork / executeCSRJob / executeDeploymentJob /
   createTargetConnector / runDiscoveryScan plus dispatch_test.go,
   deploy_mutex_test.go, keymem_test.go).
5. Broader-importer build green: go build ./... .

Same-package resolution means every cross-file call (poll →
makeRequest, deploy → makeRequest + reportJobStatus + verifyAnd-
ReportDeployment in verify.go, discovery → makeRequest) resolves
through Go's package-level method-set with zero compile-time cost
+ zero runtime overhead. The public surface of the cmd/agent
binary is unchanged.

What this commit closes
=======================
Sprint 12 is the LAST of the audit's named top-6 hotspot sub-splits.
The ARCH-M2 finding now reflects:
  - 6 of 6 named backend hotspots below 1,500 LOC.
  - 24 of 24 named sub-splits shipped across Sprints 1-12 (config
    family ×7 + cmd/server ×2 + service/acme ×2 + mcp/tools ×6 +
    auth_session_oidc ×4 + cmd/agent ×3).
  - 7,298 LOC of code-locality concentration removed across the
    top 6 files.

Whether to flip ARCH-M2 from 🛠 Scaffolded to ✓ Shipped is now an
operator-discretion call — every named target landed, but the
finding's spirit ("split god-files by responsibility") is a
continuous discipline rather than a binary done/not-done.

Refs: ARCH-M2 (god-files), Phase 9 audit. Sprint 12 is the named-
hotspot conclusion of Phase 9.
2026-05-14 10:36:08 +00:00

444 lines
16 KiB
Go

// Copyright 2026 certctl LLC. All rights reserved.
// SPDX-License-Identifier: BUSL-1.1
package main
import (
"context"
"encoding/json"
"encoding/pem"
"fmt"
"io"
"net/http"
"os"
"path/filepath"
"strings"
"github.com/certctl-io/certctl/internal/connector/target"
"github.com/certctl-io/certctl/internal/connector/target/apache"
"github.com/certctl-io/certctl/internal/connector/target/awsacm"
"github.com/certctl-io/certctl/internal/connector/target/azurekv"
"github.com/certctl-io/certctl/internal/connector/target/caddy"
"github.com/certctl-io/certctl/internal/connector/target/envoy"
"github.com/certctl-io/certctl/internal/connector/target/f5"
"github.com/certctl-io/certctl/internal/connector/target/haproxy"
"github.com/certctl-io/certctl/internal/connector/target/iis"
jks "github.com/certctl-io/certctl/internal/connector/target/javakeystore"
k8s "github.com/certctl-io/certctl/internal/connector/target/k8ssecret"
"github.com/certctl-io/certctl/internal/connector/target/nginx"
pf "github.com/certctl-io/certctl/internal/connector/target/postfix"
sshconn "github.com/certctl-io/certctl/internal/connector/target/ssh"
"github.com/certctl-io/certctl/internal/connector/target/traefik"
wcs "github.com/certctl-io/certctl/internal/connector/target/wincertstore"
)
// Phase 9 ARCH-M2 closure Sprint 12 (2026-05-14): extracted from
// cmd/agent/main.go via the Option B sibling-file pattern.
//
// This file holds the DEPLOYMENT executor + the target connector
// factory + the deploy-only helpers:
//
// - executeDeploymentJob: handles Pending deployment jobs by
// fetching the cert PEM from the control plane, loading the
// locally-held private key (in agent keygen mode), instantiating
// the appropriate target connector via createTargetConnector,
// calling DeployCertificate on it, and reporting Completed or
// Failed back to the control plane.
// - createTargetConnector: the big switch over target_type that
// instantiates one of 14 target connectors (apache / awsacm /
// azurekv / caddy / envoy / f5 / haproxy / iis / javakeystore /
// k8ssecret / nginx / postfix / ssh / traefik / wincertstore).
// Context is threaded into SDK-driven connectors (AWSACM,
// AzureKeyVault) so credential resolution honors caller
// cancellation per the contextcheck linter — see CI commit
// 502823d.
// - splitPEMChain: split a PEM chain into (first cert, rest).
// - fetchCertificate: pull the PEM chain from
// GET /api/v1/certificates/{certID}/version.
//
// All 14 target-connector imports were used ONLY by
// createTargetConnector; moving the factory here also moved the
// 14 connector imports out of main.go, leaving the surviving
// cmd/agent/main.go with the minimal stdlib surface its lifecycle
// + HTTP infrastructure needs.
// executeDeploymentJob executes a deployment job by fetching the certificate and deploying it
// to the target system using the appropriate connector (NGINX, F5 BIG-IP, or IIS).
//
// For agent keygen mode, the private key is read from the local key store (keyDir/certID.key)
// rather than fetched from the server. The deployment includes the locally-held key.
//
// Flow:
// 1. Report job as Running
// 2. Fetch the certificate PEM from the control plane
// 3. Load local private key if it exists (agent keygen mode)
// 4. Instantiate the target connector based on target_type from the work response
// 5. Call DeployCertificate on the connector
// 6. Report job as Completed (or Failed)
func (a *Agent) executeDeploymentJob(ctx context.Context, job JobItem) {
a.logger.Info("executing deployment job",
"job_id", job.ID,
"certificate_id", job.CertificateID,
"target_type", job.TargetType)
// Report job as running
if err := a.reportJobStatus(ctx, job.ID, "Running", ""); err != nil {
a.logger.Error("failed to report job running", "error", err)
}
// Fetch the certificate from the control plane
certPEM, err := a.fetchCertificate(ctx, job.CertificateID)
if err != nil {
a.logger.Error("failed to fetch certificate",
"job_id", job.ID,
"error", err)
if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("cert fetch failed: %v", err)); reportErr != nil {
a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
}
return
}
a.logger.Info("certificate fetched for deployment",
"job_id", job.ID,
"cert_length", len(certPEM))
// Split PEM into cert and chain (separated by double newline between PEM blocks)
certOnly, chainPEM := splitPEMChain(certPEM)
// Check for locally-stored private key (agent keygen mode)
keyPath := filepath.Join(a.config.KeyDir, job.CertificateID+".key")
var keyPEM string
keyData, err := os.ReadFile(keyPath)
if err != nil {
a.logger.Error("failed to read local private key for deployment",
"job_id", job.ID,
"key_path", keyPath,
"error", err)
if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("key read failed: %v", err)); reportErr != nil {
a.logger.Error("failed to report job status to server", "job_id", job.ID, "error", reportErr)
}
return
}
keyPEM = string(keyData)
a.logger.Info("loaded local private key for deployment",
"job_id", job.ID,
"key_path", keyPath)
// Deploy to the target using the appropriate connector
if job.TargetType != "" {
connector, err := a.createTargetConnector(ctx, job.TargetType, job.TargetConfig)
if err != nil {
a.logger.Error("failed to create target connector",
"job_id", job.ID,
"target_type", job.TargetType,
"error", err)
if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("connector init failed: %v", err)); reportErr != nil {
a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
}
return
}
// Bundle 1 / RT-C1 closure (2026-05-12): defense in depth. The server
// runs internal/connector/target/configcheck.Validate on the way IN
// (Create/Update), and rejects shell metacharacters in command-bearing
// fields. Re-run the connector's full ValidateConfig here on the way
// OUT, before any DeployCertificate call. This catches (a) configs
// that pre-date the server-side guard, (b) corruption/tampering of
// the encrypted config blob, and (c) per-connector filesystem
// invariants (cert dir exists, paths writable) that the server can't
// check because the filesystem is on the agent host.
if err := connector.ValidateConfig(ctx, job.TargetConfig); err != nil {
a.logger.Error("connector config validation failed",
"job_id", job.ID,
"target_type", job.TargetType,
"error", err)
if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("%s config validation failed: %v", job.TargetType, err)); reportErr != nil {
a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
}
return
}
deployReq := target.DeploymentRequest{
CertPEM: certOnly,
KeyPEM: keyPEM,
ChainPEM: chainPEM,
TargetConfig: job.TargetConfig,
Metadata: map[string]string{
"certificate_id": job.CertificateID,
"job_id": job.ID,
},
}
// Phase 2 of the deploy-hardening I master bundle:
// per-target deploy mutex. Acquire BEFORE
// DeployCertificate so two concurrent renewals against
// the same target ID serialize. The lock is held for the
// full Deploy duration including PreCommit (validate),
// PostCommit (reload), and post-deploy verify (Phases
// 4-9). Released on every return path via defer.
var targetID string
if job.TargetID != nil {
targetID = *job.TargetID
}
if mu := a.targetDeployMutex(targetID); mu != nil {
mu.Lock()
defer mu.Unlock()
}
result, err := connector.DeployCertificate(ctx, deployReq)
if err != nil {
a.logger.Error("deployment failed",
"job_id", job.ID,
"target_type", job.TargetType,
"error", err)
if reportErr := a.reportJobStatus(ctx, job.ID, "Failed", fmt.Sprintf("deployment failed: %v", err)); reportErr != nil {
a.logger.Error("failed to report job status to server", "job_id", job.ID, "status", "Failed", "error", reportErr)
}
return
}
a.logger.Info("target connector deployment completed",
"job_id", job.ID,
"target_type", job.TargetType,
"success", result.Success,
"message", result.Message)
// If verification is enabled, verify the deployment by probing the live TLS endpoint
targetHost, targetPort, err := extractTargetHostAndPort(job.TargetConfig)
if err != nil {
a.logger.Warn("could not extract target host/port for verification",
"job_id", job.ID,
"error", err)
} else {
a.verifyAndReportDeployment(ctx, job, targetHost, targetPort, certOnly)
}
} else {
a.logger.Info("no target type specified, skipping connector invocation",
"job_id", job.ID)
}
// Report job as completed
if err := a.reportJobStatus(ctx, job.ID, "Completed", ""); err != nil {
a.logger.Error("failed to report job completed", "error", err)
return
}
a.logger.Info("deployment job completed", "job_id", job.ID)
}
// createTargetConnector instantiates the appropriate target connector based on type.
// ctx is threaded into SDK-driven connectors (AWSACM, AzureKeyVault) so credential
// resolution honors caller cancellation / deadlines instead of using a fresh
// context.Background() (the contextcheck linter enforces this — the original Rank 5
// implementation used Background() and tripped CI on commit 502823d).
func (a *Agent) createTargetConnector(ctx context.Context, targetType string, configJSON json.RawMessage) (target.Connector, error) {
switch targetType {
case "NGINX":
var cfg nginx.Config
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid NGINX config: %w", err)
}
}
return nginx.New(&cfg, a.logger), nil
case "Apache":
var cfg apache.Config
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid Apache config: %w", err)
}
}
return apache.New(&cfg, a.logger), nil
case "HAProxy":
var cfg haproxy.Config
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid HAProxy config: %w", err)
}
}
return haproxy.New(&cfg, a.logger), nil
case "F5":
var cfg f5.Config
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid F5 config: %w", err)
}
}
conn, err := f5.New(&cfg, a.logger)
if err != nil {
return nil, fmt.Errorf("failed to create F5 connector: %w", err)
}
return conn, nil
case "IIS":
var cfg iis.Config
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid IIS config: %w", err)
}
}
return iis.New(&cfg, a.logger)
case "Traefik":
var cfg traefik.Config
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid Traefik config: %w", err)
}
}
return traefik.New(&cfg, a.logger), nil
case "Caddy":
var cfg caddy.Config
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid Caddy config: %w", err)
}
}
return caddy.New(&cfg, a.logger), nil
case "Envoy":
var cfg envoy.Config
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid Envoy config: %w", err)
}
}
return envoy.New(&cfg, a.logger), nil
case "Postfix":
var cfg pf.Config
cfg.Mode = "postfix"
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid Postfix config: %w", err)
}
}
return pf.New(&cfg, a.logger), nil
case "Dovecot":
var cfg pf.Config
cfg.Mode = "dovecot"
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid Dovecot config: %w", err)
}
}
return pf.New(&cfg, a.logger), nil
case "SSH":
var cfg sshconn.Config
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid SSH config: %w", err)
}
}
return sshconn.New(&cfg, a.logger)
case "WinCertStore":
var cfg wcs.Config
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid WinCertStore config: %w", err)
}
}
return wcs.New(&cfg, a.logger)
case "JavaKeystore":
var cfg jks.Config
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid JavaKeystore config: %w", err)
}
}
return jks.New(&cfg, a.logger), nil
case "KubernetesSecrets":
var cfg k8s.Config
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid KubernetesSecrets config: %w", err)
}
}
return k8s.New(&cfg, a.logger)
case "AWSACM":
// Rank 5 of the 2026-05-03 Infisical deep-research deliverable.
// AWS Certificate Manager target — SDK-driven (no file I/O).
// LoadDefaultConfig handles the standard AWS credential chain
// (IRSA / EC2 instance profile / SSO / env vars) without any
// long-lived creds in connector Config.
var cfg awsacm.Config
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid AWSACM config: %w", err)
}
}
return awsacm.New(ctx, &cfg, a.logger)
case "AzureKeyVault":
// Rank 5 of the 2026-05-03 Infisical deep-research deliverable.
// Azure Key Vault target — SDK-driven (no file I/O).
// DefaultAzureCredential handles the standard Azure credential
// chain (managed identity / workload identity / env vars / az
// CLI fallback). Long-lived service-principal secrets are
// supported but discouraged via the credential_mode config.
var cfg azurekv.Config
if len(configJSON) > 0 {
if err := json.Unmarshal(configJSON, &cfg); err != nil {
return nil, fmt.Errorf("invalid AzureKeyVault config: %w", err)
}
}
return azurekv.New(ctx, &cfg, a.logger)
default:
return nil, fmt.Errorf("unsupported target type: %s", targetType)
}
}
// splitPEMChain splits a PEM chain into the first certificate (cert) and the rest (chain).
// The control plane returns the full chain as a single string with PEM blocks concatenated.
func splitPEMChain(pemChain string) (string, string) {
data := []byte(pemChain)
block, rest := pem.Decode(data)
if block == nil {
return pemChain, ""
}
cert := string(pem.EncodeToMemory(block))
// Skip whitespace between cert and chain
chain := strings.TrimSpace(string(rest))
if chain == "" {
return cert, ""
}
return cert, chain
}
// fetchCertificate retrieves the certificate PEM chain from the control plane.
// GET /api/v1/agents/{agentID}/certificates/{certID}
func (a *Agent) fetchCertificate(ctx context.Context, certID string) (string, error) {
path := fmt.Sprintf("/api/v1/agents/%s/certificates/%s", a.config.AgentID, certID)
resp, err := a.makeRequest(ctx, http.MethodGet, path, nil)
if err != nil {
return "", fmt.Errorf("request failed: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
body, _ := io.ReadAll(resp.Body)
return "", fmt.Errorf("server returned %d: %s", resp.StatusCode, string(body))
}
var certResp struct {
CertificatePEM string `json:"certificate_pem"`
}
if err := json.NewDecoder(resp.Body).Decode(&certResp); err != nil {
return "", fmt.Errorf("failed to decode response: %w", err)
}
return certResp.CertificatePEM, nil
}