javakeystore: pre-deploy export snapshot + on-import-failure rollback + argv-password operator note

Closes Bundle 8 of the 2026-05-02 deployment-target coverage audit
(see cowork/deployment-target-audit-2026-05-02/RESULTS.md). Pre-fix,
DeployCertificate at javakeystore.go:172-272 ran an irreversible
keytool -delete against the existing alias, then keytool
-importkeystore. If the import failed after the delete succeeded,
the keystore was missing the alias entirely — previous cert gone,
new cert never landed. docs/deployment-atomicity.md L94 promised
"keytool snapshot; rollback via keytool -delete + re-import"; the
code didn't deliver. Separately, the operator-facing keystore
password is passed via -storepass argv (a standard keytool
limitation) which is visible to ps(1) for the duration of each
subprocess; this was undocumented as an operator-playbook caveat.

This commit:

1. Pre-delete snapshot. When os.Stat(KeystorePath) succeeds,
   snapshotKeystore runs keytool -exportkeystore to
   <BackupDir>/.certctl-bak.<unix-nanos>.p12 BEFORE the existing
   -delete step. Backup path persisted in a local variable for
   the rollback path; export-step failure aborts the deploy
   entirely (no mutation has happened yet — the keystore is
   untouched). Snapshot skipped on first-time deploys (no
   keystore file = nothing to roll back to). The "alias not
   present in pre-existing keystore" case is recognised via the
   well-known keytool error string and treated as a clean
   first-time-on-existing-keystore signal — the deploy proceeds
   without a backup, and rollback (if needed) becomes the
   no-backup branch.

2. On-import-failure rollback. When keytool -importkeystore
   returns error, rollbackImport(ctx, backupPath) runs:
   - keytool -delete -alias <Alias> ... (best-effort; the failed
     import may have created a partial alias entry).
   - keytool -importkeystore from the backup PKCS#12 to restore
     the previous state.
   On rollback success, the deploy returns wrapped error noting
   "rolled back from <backup_path>". On rollback failure,
   returns operator-actionable wrapped error containing both the
   import error AND the rollback error AND the backup path so
   the operator can manually keytool -importkeystore from the
   .p12 file to recover.

3. Backup retention. Successful deploys prune older
   .certctl-bak.*.p12 files beyond Config.BackupRetention.
   Sort by ModTime newest-first; keep most recent N. Defaults:
   BackupRetention=0  → keep most recent 3 (the default).
   BackupRetention=N  → keep most recent N.
   BackupRetention=-1 → opt out of pruning entirely (operators
                        that wire their own archival/rotation).
   Pruning runs in the success path AFTER the optional reload
   command so it doesn't interfere with deploy-time signals.
   ReadDir / Remove failures are non-fatal (debug log only) —
   the deploy already succeeded.

4. Config gains BackupRetention int and BackupDir string fields.
   BackupDir defaults to filepath.Dir(KeystorePath) so backups
   land on the same filesystem as the keystore (atomic-ish
   writes, disk-full failures fail fast at snapshot time).

5. Helper extraction. snapshotKeystore + rollbackImport +
   pruneBackups + backupDir are private methods on Connector.
   Constants backupFilePrefix=".certctl-bak." and
   backupFileSuffix=".p12" centralise the naming convention so
   the snapshot writer, the rollback reader, and the retention
   pruner all agree.

6. Operator-playbook section added to docs/connectors.md
   JavaKeystore section. Documents the standard keytool
   -storepass argv exposure: ps(1)-visible for the duration
   of each subprocess. Lists mitigations:
   - Restrict shell access to the agent host.
   - Linux user namespaces / AppArmor / SystemD ProtectProc=
     invisible to deny ps-visibility.
   - Single-purpose container for proper PID-namespace
     isolation.
   - Post-deploy keystore password rotation via reload_command
     for high-security environments.
   - BCFKS keystore type for FIPS environments (same argv
     caveat applies).
   Also documents an "Atomic rollback" subsection covering the
   snapshot/rollback flow, the new backup_retention /
   backup_dir Config fields, and the design choice to reuse
   the keystore password for the snapshot (rather than
   generating a separate transient password) — operator
   already trusts the connector with this secret, surface area
   doesn't grow, rollback's matching -srcstorepass stays
   simple.

Tests added to javakeystore_test.go (7 new tests, ~430 LOC):

- TestJKS_Snapshot_RunsBefore_Delete: mock executor records call
  order; asserts -exportkeystore is call[0], -delete is call[1],
  -importkeystore is call[2]. The snapshot MUST run before the
  delete — otherwise the delete destroys the very state the
  snapshot is meant to capture.
- TestJKS_Snapshot_FirstTimeDeploy_NoExport: no keystore file
  pre-created; asserts exactly 1 keytool call (-importkeystore
  only), no -exportkeystore.
- TestJKS_ImportFails_RollsBack: happy rollback path with one
  same-Subject backup. Asserts rollback re-import references the
  same backup path the snapshot wrote (verified via arg
  comparison between call[0] and call[4]).
- TestJKS_ImportFails_RollbackAlsoFails_OperatorActionable:
  wrapped-error escalation with backup path in the error
  message.
- TestJKS_BackupRetention_PrunesOldBackups: 5 pre-existing
  staggered-ModTime backups + 1 deploy-created → retention=3 →
  exactly 3 newest survive (deploy-created + 2 newest
  pre-existing); 3 oldest pre-existing pruned.
- TestJKS_BackupRetention_Zero_DefaultsTo3: BackupRetention=0
  must default to 3 (not "keep none").
- TestJKS_BackupRetention_Negative_OptsOut: BackupRetention=-1
  pre-existing 5 + deploy 1 = 6 total, all 6 remain.
- TestJKS_Snapshot_AliasNotInKeystore_ProceedsCleanly: keystore
  exists but alias missing; -exportkeystore returns "alias does
  not exist" → snapshot helper recognises this signal and
  returns ("", nil) so the deploy proceeds cleanly.

mockExecutor extended with optional `onCall` hook so the
retention-pruning tests can simulate keytool -exportkeystore's
file-write side effect (via the simulateExportSideEffect helper
that parses -destkeystore from args and writes a placeholder
.p12 file). Existing tests that don't set onCall behave
identically to before — backward compatible.

docs/deployment-atomicity.md L94 unchanged from today's text —
Bundle 1 doc-realignment hasn't shipped, so the "keytool snapshot;
rollback via keytool -delete + re-import" line was never softened.
Post-Bundle-8 the claim is honest (was aspirational pre-fix).

Verified locally (sandbox lacks staticcheck install due to disk
pressure; CI runs the full lint gate):
- gofmt -l ./internal/connector/target/javakeystore/ clean
- go vet ./internal/connector/target/javakeystore/ clean
- go build ./cmd/agent/... clean
- go test -race -count=1 ./internal/connector/target/javakeystore/
  green (16 tests total: 9 pre-existing + 7 new)

Audit reference: cowork/deployment-target-audit-2026-05-02/RESULTS.md
Bundle 8.
This commit is contained in:
shankar0123
2026-05-02 19:01:06 +00:00
parent 1dd1dd4e0a
commit 87e0009d97
3 changed files with 887 additions and 2 deletions
@@ -17,6 +17,7 @@ import (
"os/exec"
"path/filepath"
"regexp"
"sort"
"strings"
"time"
@@ -49,6 +50,23 @@ type Config struct {
// KeytoolPath overrides the default keytool binary path.
// Default: "keytool" (found via PATH).
KeytoolPath string `json:"keytool_path,omitempty"`
// BackupRetention controls how many .certctl-bak.<unix-nanos>.p12 backup
// files to keep after a successful deploy. Bundle 8 (2026-05-02
// deployment-target audit) introduced these backups for on-import-failure
// rollback; without retention, every deploy adds another file and disks
// fill up over time. Values:
// 0 → use default of 3 (keep most recent 3 backups).
// N → keep most recent N backups.
// -1 → opt out of pruning entirely (operators that wire their own
// archival/rotation logic).
BackupRetention int `json:"backup_retention,omitempty"`
// BackupDir overrides the directory where .certctl-bak.* files are
// written and pruned from. Default: filepath.Dir(KeystorePath) — same
// filesystem as the keystore itself, so backup writes are atomic-ish
// and a full disk fails fast at snapshot time rather than mid-deploy.
BackupDir string `json:"backup_dir,omitempty"`
}
// CommandExecutor abstracts command execution for testability.
@@ -168,7 +186,21 @@ func (c *Connector) ValidateConfig(ctx context.Context, config json.RawMessage)
}
// DeployCertificate imports a certificate and key into the Java Keystore.
// Flow: PEM → PKCS#12 temp file → keytool -importkeystore → cleanup temp → optional reload
//
// Bundle 8 of the 2026-05-02 deployment-target audit added a pre-delete
// snapshot + on-import-failure rollback wrapper around the original
// keytool flow:
// 1. Convert PEM to PKCS#12 temp file (transient password, never logged).
// 2. If the keystore exists, run `keytool -exportkeystore` to a sibling
// `.certctl-bak.<unix-nanos>.p12` BEFORE the irreversible -delete.
// Backup path persisted in a local variable for the rollback path.
// 3. Run the existing -delete (best-effort; alias may not exist).
// 4. Run keytool -importkeystore.
// 5. On import failure with a backup in hand, rollbackImport runs
// keytool -delete (clean up the alias the failed import may have
// created) + keytool -importkeystore from the backup PFX.
// 6. On success: compute thumbprint, run optional reload command,
// prune old backup files per Config.BackupRetention.
func (c *Connector) DeployCertificate(ctx context.Context, request target.DeploymentRequest) (*target.DeploymentResult, error) {
if request.KeyPEM == "" {
return nil, fmt.Errorf("private key is required for Java Keystore import")
@@ -204,8 +236,27 @@ func (c *Connector) DeployCertificate(ctx context.Context, request target.Deploy
}
tmpFile.Close()
// Step 2: Delete existing alias if keystore exists (keytool -delete)
// Bundle 8: pre-delete snapshot. When the keystore exists, run
// keytool -exportkeystore to capture the prior alias state into a
// sibling PKCS#12 backup file BEFORE the irreversible -delete step.
// Backup path is held in a local variable for the rollback path;
// snapshot failure aborts the deploy entirely (no mutation has
// happened yet, so the keystore is untouched).
//
// Empty backupPath = first-time deploy (keystore file doesn't exist
// yet) — rollback in that case has nothing to restore from; the
// failure path returns the import error verbatim.
var backupPath string
if _, err := os.Stat(c.config.KeystorePath); err == nil {
var snapErr error
backupPath, snapErr = c.snapshotKeystore(ctx)
if snapErr != nil {
return nil, fmt.Errorf("pre-deploy snapshot failed: %w", snapErr)
}
c.logger.Debug("pre-deploy snapshot captured", "backup_path", backupPath)
// Step 2: Delete existing alias (keytool -delete). Best-effort —
// the alias may not exist in this keystore.
deleteArgs := []string{
"-delete",
"-alias", c.config.Alias,
@@ -234,6 +285,30 @@ func (c *Connector) DeployCertificate(ctx context.Context, request target.Deploy
output, err := c.executor.Execute(ctx, c.config.KeytoolPath, importArgs...)
if err != nil {
// Bundle 8: import failed. Roll back if we have a backup; otherwise
// surface the import error verbatim (first-time deploy — nothing
// to restore from, the failed import didn't write anything we can
// undo at the alias level).
if backupPath != "" {
c.logger.Error("keytool import failed; attempting rollback",
"error", err,
"output", output,
"backup_path", backupPath)
rbErr := c.rollbackImport(ctx, backupPath)
if rbErr != nil {
// Operator-actionable: import AND rollback both failed.
// Surface BOTH errors AND the backup path so the operator
// can manually keytool -importkeystore from the .p12 file
// to recover.
combined := fmt.Errorf("keytool import failed (%w) AND rollback also failed (%v); manual operator inspection required (backup at %s)", err, rbErr, backupPath)
c.logger.Error("JavaKeystore rollback also failed",
"import_error", err,
"rollback_error", rbErr,
"backup_path", backupPath)
return nil, combined
}
return nil, fmt.Errorf("keytool import failed; rolled back from %s: %s: %w", backupPath, output, err)
}
return nil, fmt.Errorf("keytool import failed: %s: %w", output, err)
}
@@ -251,6 +326,12 @@ func (c *Connector) DeployCertificate(ctx context.Context, request target.Deploy
}
}
// Bundle 8: prune old backups on the success path so operator filesystems
// don't accumulate .certctl-bak.* files indefinitely. Failure here is
// non-fatal (debug log only) — the deploy succeeded, retention cleanup
// is housekeeping.
c.pruneBackups()
c.logger.Info("certificate imported to Java Keystore",
"keystore", c.config.KeystorePath,
"alias", c.config.Alias,
@@ -325,3 +406,186 @@ func (c *Connector) ValidateDeployment(ctx context.Context, request target.Valid
// Ensure Connector implements target.Connector.
var _ target.Connector = (*Connector)(nil)
// --- Bundle 8: pre-delete snapshot + on-import-failure rollback ---
// backupFilePrefix is the literal prefix on rollback-snapshot files.
// Centralised here so the snapshot writer, the rollback reader, and the
// retention pruner all agree on the naming convention.
//
// Bundle 8 of the 2026-05-02 deployment-target audit.
const backupFilePrefix = ".certctl-bak."
// backupFileSuffix is the literal suffix on rollback-snapshot files. Always
// PKCS#12 regardless of the source keystore type — `keytool -exportkeystore`
// destinations are PKCS#12 by convention because every JVM can read PKCS#12,
// while JKS is OpenJDK-specific.
const backupFileSuffix = ".p12"
// backupDir returns the directory rollback snapshots are written to.
// Operators can override via Config.BackupDir; default = same dir as the
// keystore so snapshots land on the same filesystem (atomic-ish writes,
// disk-full failures surface at snapshot time rather than mid-deploy).
func (c *Connector) backupDir() string {
if c.config.BackupDir != "" {
return c.config.BackupDir
}
return filepath.Dir(c.config.KeystorePath)
}
// snapshotKeystore runs `keytool -exportkeystore` to copy the existing alias
// into a new PKCS#12 file at <backupDir>/.certctl-bak.<unix-nanos>.p12.
// Returns the backup path on success; the caller persists it for the
// rollback path.
//
// The export password mirrors the keystore password — it's the same secret
// the operator already trusts the connector with, and avoiding a second
// transient password keeps the rollback's matching `-srcstorepass` simple.
//
// Bundle 8 of the 2026-05-02 deployment-target audit.
func (c *Connector) snapshotKeystore(ctx context.Context) (string, error) {
backupPath := filepath.Join(
c.backupDir(),
fmt.Sprintf("%s%d%s", backupFilePrefix, time.Now().UnixNano(), backupFileSuffix),
)
exportArgs := []string{
"-exportkeystore",
"-srckeystore", c.config.KeystorePath,
"-srcstoretype", c.config.KeystoreType,
"-srcstorepass", c.config.KeystorePassword,
"-srcalias", c.config.Alias,
"-destkeystore", backupPath,
"-deststoretype", "PKCS12",
"-deststorepass", c.config.KeystorePassword,
"-noprompt",
}
output, err := c.executor.Execute(ctx, c.config.KeytoolPath, exportArgs...)
if err != nil {
// keytool -exportkeystore returns non-zero when the alias isn't
// present in the source keystore. That's a normal first-time-on-
// existing-keystore signal, NOT an outage. Treat it as "no
// snapshot to roll back to" and proceed cleanly — the import
// will create the alias from scratch, and rollback (if the
// import then fails) will be the no-backup path.
lowerOut := strings.ToLower(output)
if strings.Contains(lowerOut, "does not exist") || strings.Contains(lowerOut, "alias <") {
c.logger.Debug("snapshot found no existing alias to export — first-time-on-keystore deploy",
"alias", c.config.Alias,
"output", output)
return "", nil
}
return "", fmt.Errorf("keytool -exportkeystore: %s: %w", output, err)
}
return backupPath, nil
}
// rollbackImport restores the previous alias state from a snapshot PFX. Two
// keytool calls in order:
// 1. -delete the alias (best-effort — the failed import may or may not have
// created an alias entry; we don't know which, so we always try).
// 2. -importkeystore from the backup PFX, restoring the original cert + key
// under the original alias.
//
// Returns nil on success; wrapped error on rollback-script failure. The
// caller surfaces the wrapped error to the operator alongside the import
// error and the backup path so manual recovery is possible.
//
// Bundle 8 of the 2026-05-02 deployment-target audit.
func (c *Connector) rollbackImport(ctx context.Context, backupPath string) error {
// Step 1: best-effort delete (alias may not exist after a failed import).
deleteArgs := []string{
"-delete",
"-alias", c.config.Alias,
"-keystore", c.config.KeystorePath,
"-storepass", c.config.KeystorePassword,
"-storetype", c.config.KeystoreType,
"-noprompt",
}
c.executor.Execute(ctx, c.config.KeytoolPath, deleteArgs...)
// Step 2: re-import from the backup PKCS#12 to restore the previous state.
importArgs := []string{
"-importkeystore",
"-srckeystore", backupPath,
"-srcstoretype", "PKCS12",
"-srcstorepass", c.config.KeystorePassword,
"-destkeystore", c.config.KeystorePath,
"-deststoretype", c.config.KeystoreType,
"-deststorepass", c.config.KeystorePassword,
"-srcalias", c.config.Alias,
"-destalias", c.config.Alias,
"-noprompt",
}
output, err := c.executor.Execute(ctx, c.config.KeytoolPath, importArgs...)
if err != nil {
return fmt.Errorf("rollback re-import: %s: %w", output, err)
}
c.logger.Info("JavaKeystore rollback completed", "backup_path", backupPath)
return nil
}
// pruneBackups removes older `.certctl-bak.*.p12` files beyond the configured
// retention count so operator filesystems don't accumulate snapshots
// indefinitely. Best-effort: any error during the readdir / remove cycle
// is swallowed at debug level — the deploy already succeeded, retention
// cleanup is housekeeping.
//
// Retention semantics (per Config.BackupRetention):
// - 0 → default of 3 (keep most recent 3 backups).
// - N → keep most recent N backups.
// - -1 → opt out entirely (no pruning).
//
// "Most recent" is determined by file ModTime, not by the unix-nanos in the
// filename — ModTime is robust against system-clock changes between deploys
// and aligns with the actual filesystem ordering operators see in `ls -lt`.
//
// Bundle 8 of the 2026-05-02 deployment-target audit.
func (c *Connector) pruneBackups() {
keep := c.config.BackupRetention
if keep == 0 {
keep = 3
}
if keep < 0 {
return // operator opted out
}
dir := c.backupDir()
entries, err := os.ReadDir(dir)
if err != nil {
c.logger.Debug("backup retention prune skipped: ReadDir failed",
"dir", dir, "error", err)
return
}
type backupFile struct {
name string
modTime time.Time
}
var backups []backupFile
for _, e := range entries {
if e.IsDir() {
continue
}
name := e.Name()
if !strings.HasPrefix(name, backupFilePrefix) || !strings.HasSuffix(name, backupFileSuffix) {
continue
}
info, err := e.Info()
if err != nil {
continue
}
backups = append(backups, backupFile{name: name, modTime: info.ModTime()})
}
if len(backups) <= keep {
return
}
// Sort newest-first by ModTime; older entries (the tail) get pruned.
sort.Slice(backups, func(i, j int) bool {
return backups[i].modTime.After(backups[j].modTime)
})
for _, b := range backups[keep:] {
path := filepath.Join(dir, b.name)
if err := os.Remove(path); err != nil {
c.logger.Debug("backup retention prune: Remove failed",
"path", path, "error", err)
}
}
}