mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 22:21:30 +00:00
f5c67a51b2
Phase 1 of the deploy-hardening I master bundle. Closes the load-bearing prerequisite for the seven Bundle I items by extracting one canonical atomic-deploy primitive at internal/deploy/ that all 13 target connectors will consume in Phases 4-9. The package ships: - Plan + Apply API: write all File entries to sibling .certctl-tmp.<nanos> in the destination directory (same-filesystem guarantees os.Rename atomicity), call PreCommit (validate-with-the-target), atomic-rename all temps to final, call PostCommit (reload). On PostCommit failure, restore from pre-deploy backups + re-call PostCommit. If second PostCommit also fails, return ErrRollbackFailed (operator-actionable; documented loud). - AtomicWriteFile lower-level entry for connectors that don't fit the Plan model (F5, K8s — they ship bytes through APIs, not local files). - SHA-256 idempotency: every Apply short-circuits when all File destinations already match SHA-256 of new bytes. Defends against agent-restart retry storms hammering targets with no-op reloads. - Ownership + mode preservation: existing nginx:nginx 0640 stays nginx:nginx 0640 across renewals. Per-target FileDefaults applies for first-deploy. Per-File explicit Mode/Owner/Group overrides win over both. Closes the silent-failure mode where os.WriteFile(path, bytes, 0600) at apache.go:119 (et al.) clobbered worker access. - Backup retention janitor: pre-deploy backup at <path>.certctl-bak.<nanos>; default keeps last 3 (DefaultBackupRetention); BackupRetention=-1 disables backups (rollback impossible — documented foot-gun). - File-level mutex via sync.Map: two concurrent Apply calls touching the same destination serialize. Per-target serialization (Phase 2) is finer- grained at the agent dispatch layer; this is the file-level guard. - Sentinel errors for connector errors.Is checks: ErrPlanInvalid, ErrValidateFailed, ErrReloadFailed, ErrRollbackFailed. Tests (37 named cases across deploy_test.go + coverage_test.go) pin every load-bearing invariant the prompt's Phase 1 requires, plus error-leg coverage uplifts: - TestApply_HappyPath_PreCommitSucceeds_PostCommitSucceeds_FilesAtomic - TestApply_PreCommitFails_NoFilesChanged (atomic-or-nothing on validate) - TestApply_PostCommitFails_FilesRolledBack (rollback wire) - TestApply_RollbackAlsoFails_ReturnsErrRollbackFailed (escalation path) - TestApply_IdempotentSkip_SHA256Match (idempotency short-circuit) - TestApply_PreservesExistingOwnerAndMode_WhenNotOverridden - TestApply_RespectsOverrides_OwnerGroupMode - TestApply_ConcurrentApplyToSameFile_Serializes (file-level lock) - TestApply_BackupRetention_KeepsLastN (janitor pruning) - TestApply_NoExistingFile_UsesDefaultsForOwnerGroupMode - TestAtomicWriteFile_TempFileCleanedUpOnError - TestAtomicWriteFile_RenameRaceWithReader_AtomicReadAlwaysSeesOldOrNew (POSIX-rename atomicity proof via concurrent reader) Plus white-box tests for resolveOwnership, lookupUID/GID, and deeper error legs in restoreFromBackups + applyOwnership + AtomicWriteFile. Coverage 87.3% — practical ceiling without injecting a fault-aware FS abstraction (Write/Sync/Close OS errors are unreachable from go test without sudo'd disk-fill or a custom interface seam). Above the existing service-layer 70% floor; Phases 4-9 will lift this further as they exercise the package through real-connector use. Race detector clean; gofmt + go vet + golangci-lint v2.11.4 all 0 issues. The package is the load-bearing prerequisite for Phases 4-9. Phase 2 next: per-target deploy mutex in cmd/agent/main.go. Spec: cowork/deploy-hardening-i-prompt.md Baseline + recon: cowork/deploy-hardening-i/baseline.md
328 lines
9.5 KiB
Go
328 lines
9.5 KiB
Go
package deploy
|
|
|
|
import (
|
|
"context"
|
|
"errors"
|
|
"fmt"
|
|
"os"
|
|
"path/filepath"
|
|
"sort"
|
|
"time"
|
|
)
|
|
|
|
// Apply executes plan as one atomic deployment. See package doc and
|
|
// the Plan-type comments for the full algorithm contract; the
|
|
// summary:
|
|
//
|
|
// 1. Validate the plan shape (no empty paths, no dupes).
|
|
// 2. Per-file SHA-256 check; if every file already has identical
|
|
// bytes and !plan.SkipIdempotent, return early with
|
|
// SkippedAsIdempotent=true.
|
|
// 3. Lock every file path in the plan (sorted to avoid deadlocks
|
|
// when two concurrent Applies share some paths).
|
|
// 4. Backup every existing destination.
|
|
// 5. Write every file to its sibling .certctl-tmp.<unix-nanos>;
|
|
// apply ownership (chmod + chown) to each temp.
|
|
// 6. Call PreCommit(ctx, tempPaths). On error: clean up all temp
|
|
// files; backups stay (operator may want to restore manually).
|
|
// Return ErrValidateFailed.
|
|
// 7. os.Rename every temp → final, in plan-order. We don't try to
|
|
// "rollback" a partial rename mid-loop — we trust os.Rename to
|
|
// either succeed or fail-fast within the same filesystem; if a
|
|
// mid-loop rename fails, we attempt rollback of the renames
|
|
// that already succeeded.
|
|
// 8. Call PostCommit(ctx). On success: prune old backups; return.
|
|
// 9. On PostCommit error: restore each File from its backup;
|
|
// re-call PostCommit. If second PostCommit also fails, return
|
|
// ErrRollbackFailed (operator-actionable; deploy is in known-
|
|
// bad state).
|
|
//
|
|
// The PreCommit/PostCommit hooks may be nil; nil = "no-op step".
|
|
func Apply(ctx context.Context, plan Plan) (*Result, error) {
|
|
start := time.Now()
|
|
|
|
if err := validatePlan(plan); err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
// Lock every path in sorted order to defend against the
|
|
// classic AB/BA deadlock when two concurrent Applies overlap
|
|
// in their file sets.
|
|
absPaths := make([]string, len(plan.Files))
|
|
for i, f := range plan.Files {
|
|
abs, err := filepath.Abs(f.Path)
|
|
if err != nil {
|
|
return nil, fmt.Errorf("resolve path %s: %w", f.Path, err)
|
|
}
|
|
absPaths[i] = abs
|
|
}
|
|
sortedPaths := append([]string(nil), absPaths...)
|
|
sort.Strings(sortedPaths)
|
|
unlocks := make([]func(), 0, len(sortedPaths))
|
|
defer func() {
|
|
// Release in reverse order. Standard mutex hygiene.
|
|
for i := len(unlocks) - 1; i >= 0; i-- {
|
|
unlocks[i]()
|
|
}
|
|
}()
|
|
for _, p := range sortedPaths {
|
|
unlocks = append(unlocks, lockFile(p))
|
|
}
|
|
|
|
if err := ctx.Err(); err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
res := &Result{
|
|
BackupPaths: make(map[string]string, len(plan.Files)),
|
|
}
|
|
|
|
// 2. Idempotency short-circuit.
|
|
if !plan.SkipIdempotent {
|
|
allMatch := true
|
|
for i, f := range plan.Files {
|
|
abs := absPaths[i]
|
|
existing, err := os.ReadFile(abs)
|
|
if err != nil {
|
|
allMatch = false
|
|
break
|
|
}
|
|
if !sha256Eq(existing, f.Bytes) {
|
|
allMatch = false
|
|
break
|
|
}
|
|
}
|
|
if allMatch {
|
|
res.SkippedAsIdempotent = true
|
|
res.Duration = time.Since(start)
|
|
return res, nil
|
|
}
|
|
}
|
|
|
|
// 3. For each file: stat existing, resolve ownership, prep
|
|
// the per-file work plan.
|
|
preps := make([]*filePrep, len(plan.Files))
|
|
for i, f := range plan.Files {
|
|
abs := absPaths[i]
|
|
stat, statErr := os.Stat(abs)
|
|
existed := statErr == nil
|
|
owner, err := resolveOwnership(f, plan.Defaults, ownershipStat(stat, statErr))
|
|
if err != nil {
|
|
return nil, fmt.Errorf("file %d (%s): resolve ownership: %w", i, abs, err)
|
|
}
|
|
preps[i] = &filePrep{
|
|
abs: abs,
|
|
file: f,
|
|
owner: owner,
|
|
hadOrig: existed,
|
|
}
|
|
}
|
|
|
|
// 4. Backup every existing destination BEFORE writing any
|
|
// temp file. If any backup fails, abort with no on-disk
|
|
// changes to live files.
|
|
if plan.BackupRetention != -1 {
|
|
for _, p := range preps {
|
|
if !p.hadOrig {
|
|
res.BackupPaths[p.abs] = ""
|
|
continue
|
|
}
|
|
backupPath, err := backupFile(p.abs)
|
|
if err != nil {
|
|
// Clean up any backups already taken.
|
|
cleanupBackups(res.BackupPaths)
|
|
return nil, fmt.Errorf("backup %s: %w", p.abs, err)
|
|
}
|
|
p.backupTo = backupPath
|
|
res.BackupPaths[p.abs] = backupPath
|
|
}
|
|
}
|
|
|
|
// 5. Write every file to a sibling temp + apply ownership.
|
|
tempPaths := make(map[string]string, len(preps))
|
|
cleanupTemps := func() {
|
|
for _, p := range preps {
|
|
if p.tempPath != "" {
|
|
_ = os.Remove(p.tempPath)
|
|
}
|
|
}
|
|
}
|
|
for _, p := range preps {
|
|
tempPath, err := writeTempFile(p.abs, p.file.Bytes)
|
|
if err != nil {
|
|
cleanupTemps()
|
|
return nil, fmt.Errorf("write temp for %s: %w", p.abs, err)
|
|
}
|
|
p.tempPath = tempPath
|
|
tempPaths[p.abs] = tempPath
|
|
if err := applyOwnership(tempPath, p.owner); err != nil {
|
|
cleanupTemps()
|
|
return nil, fmt.Errorf("apply ownership to temp for %s: %w", p.abs, err)
|
|
}
|
|
}
|
|
|
|
// 6. PreCommit (validate-with-the-target).
|
|
if plan.PreCommit != nil {
|
|
if err := plan.PreCommit(ctx, tempPaths); err != nil {
|
|
cleanupTemps()
|
|
return nil, fmt.Errorf("%w: %v", ErrValidateFailed, err)
|
|
}
|
|
}
|
|
res.ValidateOK = true
|
|
|
|
// 7. Atomic rename each temp → final. If a mid-loop rename
|
|
// fails, attempt to restore the renames that already
|
|
// succeeded (a degraded form of rollback — better than
|
|
// leaving a half-deployed state).
|
|
doneRenames := make([]*filePrep, 0, len(preps))
|
|
for _, p := range preps {
|
|
if err := os.Rename(p.tempPath, p.abs); err != nil {
|
|
// Mid-loop rename failure. Roll back what we did.
|
|
rollbackErr := restoreFromBackups(doneRenames)
|
|
cleanupTemps()
|
|
if rollbackErr != nil {
|
|
return res, fmt.Errorf("%w: rename %s mid-loop, rollback also failed: %v (rename: %v)", ErrRollbackFailed, p.abs, rollbackErr, err)
|
|
}
|
|
return res, fmt.Errorf("rename %s: %w", p.abs, err)
|
|
}
|
|
doneRenames = append(doneRenames, p)
|
|
}
|
|
|
|
// 8. PostCommit (reload).
|
|
if plan.PostCommit != nil {
|
|
if err := plan.PostCommit(ctx); err != nil {
|
|
// Rollback: restore + re-PostCommit.
|
|
rollbackErr := restoreFromBackups(preps)
|
|
if rollbackErr != nil {
|
|
res.Duration = time.Since(start)
|
|
return res, fmt.Errorf("%w: PostCommit failed (%v) AND rollback restore failed (%v)", ErrRollbackFailed, err, rollbackErr)
|
|
}
|
|
// Restore succeeded; re-call PostCommit against the
|
|
// previous bytes. This is the second PostCommit; if
|
|
// IT also fails, we're in operator-actionable state.
|
|
if err2 := plan.PostCommit(ctx); err2 != nil {
|
|
res.Duration = time.Since(start)
|
|
return res, fmt.Errorf("%w: PostCommit failed (%v) AND second PostCommit after restore also failed (%v)", ErrRollbackFailed, err, err2)
|
|
}
|
|
res.RolledBack = true
|
|
res.Duration = time.Since(start)
|
|
return res, fmt.Errorf("%w: %v", ErrReloadFailed, err)
|
|
}
|
|
}
|
|
res.Reloaded = true
|
|
|
|
// 9. Janitor: prune backups beyond retention.
|
|
retention := plan.BackupRetention
|
|
if retention == 0 {
|
|
retention = DefaultBackupRetention
|
|
}
|
|
if retention > 0 {
|
|
for _, p := range preps {
|
|
_ = pruneBackups(p.abs, retention)
|
|
}
|
|
}
|
|
|
|
res.Duration = time.Since(start)
|
|
return res, nil
|
|
}
|
|
|
|
// validatePlan rejects malformed plans before any I/O.
|
|
func validatePlan(plan Plan) error {
|
|
if len(plan.Files) == 0 {
|
|
return fmt.Errorf("%w: no files", ErrPlanInvalid)
|
|
}
|
|
seen := make(map[string]struct{}, len(plan.Files))
|
|
for i, f := range plan.Files {
|
|
if f.Path == "" {
|
|
return fmt.Errorf("%w: file %d has empty path", ErrPlanInvalid, i)
|
|
}
|
|
abs, err := filepath.Abs(f.Path)
|
|
if err != nil {
|
|
return fmt.Errorf("%w: file %d (%s): %v", ErrPlanInvalid, i, f.Path, err)
|
|
}
|
|
if _, dup := seen[abs]; dup {
|
|
return fmt.Errorf("%w: duplicate destination %s", ErrPlanInvalid, abs)
|
|
}
|
|
seen[abs] = struct{}{}
|
|
}
|
|
return nil
|
|
}
|
|
|
|
// filePrep is the per-file working state for one Apply call.
|
|
// Held by Apply's slice; passed to restoreFromBackups during
|
|
// rollback.
|
|
type filePrep struct {
|
|
abs string
|
|
file File
|
|
tempPath string
|
|
owner resolvedOwnership
|
|
hadOrig bool
|
|
backupTo string
|
|
}
|
|
|
|
// restoreFromBackups copies each prep's backup back into place.
|
|
// Used during rollback (PostCommit failure or mid-loop rename
|
|
// failure).
|
|
func restoreFromBackups(preps []*filePrep) error {
|
|
var firstErr error
|
|
for _, p := range preps {
|
|
if p.backupTo == "" {
|
|
// File didn't exist before deploy — restore = remove.
|
|
if err := os.Remove(p.abs); err != nil && !errors.Is(err, os.ErrNotExist) {
|
|
if firstErr == nil {
|
|
firstErr = err
|
|
}
|
|
}
|
|
continue
|
|
}
|
|
// Read backup; atomically rewrite destination via the
|
|
// same temp + rename dance so this restore is itself
|
|
// atomic. We DON'T call AtomicWriteFile because we want
|
|
// to skip the per-file mutex (we already hold it from
|
|
// the outer Apply) and skip the backup-of-the-restore
|
|
// (we don't want a backup chain explosion).
|
|
bytes, err := os.ReadFile(p.backupTo)
|
|
if err != nil {
|
|
if firstErr == nil {
|
|
firstErr = fmt.Errorf("read backup %s: %w", p.backupTo, err)
|
|
}
|
|
continue
|
|
}
|
|
tempPath, err := writeTempFile(p.abs, bytes)
|
|
if err != nil {
|
|
if firstErr == nil {
|
|
firstErr = fmt.Errorf("write restore temp for %s: %w", p.abs, err)
|
|
}
|
|
continue
|
|
}
|
|
// Reapply original ownership (preserved from existing
|
|
// stat at prep time).
|
|
if err := applyOwnership(tempPath, p.owner); err != nil {
|
|
_ = os.Remove(tempPath)
|
|
if firstErr == nil {
|
|
firstErr = fmt.Errorf("apply ownership during restore for %s: %w", p.abs, err)
|
|
}
|
|
continue
|
|
}
|
|
if err := os.Rename(tempPath, p.abs); err != nil {
|
|
_ = os.Remove(tempPath)
|
|
if firstErr == nil {
|
|
firstErr = fmt.Errorf("rename during restore for %s: %w", p.abs, err)
|
|
}
|
|
continue
|
|
}
|
|
}
|
|
return firstErr
|
|
}
|
|
|
|
// cleanupBackups removes a partial set of backups. Used when an
|
|
// early backup step fails — we want to leave the destination
|
|
// directory clean.
|
|
func cleanupBackups(backupPaths map[string]string) {
|
|
for _, bp := range backupPaths {
|
|
if bp != "" {
|
|
_ = os.Remove(bp)
|
|
}
|
|
}
|
|
}
|