mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 20:01:31 +00:00
590f654b0daa7c4b5a2eb87f78f11410523f0447
635 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
590f654b0d |
awsacmpca: replace stub client with AWS SDK v2 implementation
Closes the #1 acquisition-readiness blocker from the 2026-05-01 issuer coverage audit. The production New() constructor previously hardcoded &stubClient{}, which returned "AWS SDK client not initialized (stub)" on every method. Tests passed green via NewWithClient mock injection — a path the production constructor never took. AWSACMPCA was wired into the factory, the seed file, the test suite, and marketing collateral but did not actually issue, retrieve, or revoke certificates. This commit: - Adds aws-sdk-go-v2/{config,service/acmpca,aws} to go.mod (with acmpca/types as a sub-package). go mod tidy could not be completed in the sandbox due to virtiofs concurrent-open-file ceiling on the module cache; the require blocks were arranged manually so the three directly-imported packages are non-indirect. Build, vet, staticcheck, and the full test suite are green; operator should run `go mod tidy` on the workstation to confirm cosmetic ordering before pushing. - Implements sdkClient wrapping *acmpca.Client with local input/output type translation. Each method translates the connector's local input type to the SDK's typed input, calls the SDK, and translates the SDK output back to the local output type. aws-sdk-go-v2 types do not leak out of the awsacmpca package. - Deletes stubClient (the four "AWS SDK client not initialized (stub)" methods). After this commit, there is no fall-back stub; production New() always wires the SDK. - Rewrites New() to load credentials via awsconfig.LoadDefaultConfig with awsconfig.WithRegion(config.Region) and construct the SDK client via acmpca.NewFromConfig. Returns (*Connector, error). When config is nil or config.Region is empty, New defers SDK loading; ValidateConfig builds the client lazily on the first successful validation. This preserves the test pattern of New(nil, logger) → ValidateConfig. - Wires acmpca.NewCertificateIssuedWaiter (5-minute default timeout) inside sdkClient.IssueCertificate so the connector's two-call pattern (IssueCertificate → GetCertificate) sees synchronous-via- waiter semantics. The waiter is hidden from the ACMPCAClient interface so mock implementations stay simple. - Maps RFC 5280 revocation reasons to acmpcatypes.RevocationReason via the existing mapRevocationReason helper plus a cast at the sdkClient.RevokeCertificate boundary. - Updates the issuerfactory.NewFromConfig call site at factory.go:L88 for the new (*Connector, error) signature; the factory's outer signature already returns (issuer.Connector, error) so the change is local. - Adds nil-client guards on the four client-using connector methods (IssueCertificate, RevokeCertificate, GetCACertPEM, plus the RenewCertificate path via IssueCertificate). When the connector is used before ValidateConfig has been called, these methods fail-fast with a "client not initialized" sentinel error instead of panicking. - Fixes the copy-paste env-var doc-comments at awsacmpca.go:L41,L45 (CERTCTL_GOOGLE_CAS_PROJECT / CERTCTL_GOOGLE_CAS_CA_ARN → CERTCTL_AWS_PCA_REGION / CERTCTL_AWS_PCA_CA_ARN). The actual config loader at internal/config/config.go:L1556-L1561 already used the correct env-var names; only the doc-comments were wrong. - Updates the package doc-comment at awsacmpca.go:L1-L36 to clarify the synchronous-via-waiter behavior (issuance is asynchronous at the API level; the waiter inside sdkClient.IssueCertificate hides the asynchrony). - Adds TestNew_ProductionPath/ValidConfigBuildsRealClient: calls production New() (NOT NewWithClient) with a valid config, asserts err is nil, then calls IssueCertificate with a bogus CSR and asserts the resulting error is the expected PEM-decode error rather than the deleted stubClient's "client not initialized" sentinel. This is the regression-marker test the audit's D11 blocker called out as missing — if anyone re-introduces a stub-style placeholder from production New() in the future, this test fails. - Adds TestNew_ProductionPath/NilConfigDefersClientInit: documents the lazy-init contract for the New(nil, logger) → ValidateConfig pattern. - Adds TestNew_ProductionPath/ValidateConfigBuildsClientLazily: verifies that ValidateConfig wires the SDK client when New was called with nil config. - Adds TestNew_ProductionPath/{Revoke,GetCAPEM}BeforeInitFailsFast: verifies the nil-client guards on the other client-using methods. - Adds TestNew_ErrorPaths covering AccessDeniedException-shaped errors, transient 5xx errors, and ctx-cancel propagation via the existing mockACMPCAClient. - Updates docs/connectors.md:L490-L555 with: the synchronous-via-waiter behavior, a complete IAM policy example scoped to the four ACM PCA actions, a worked POST /api/v1/issuers example, and a troubleshooting section with three known failure modes (AccessDeniedException, ResourceNotFoundException, waiter timeout). Live AWS integration testing is intentionally not added: ACM PCA is a Pro-tier feature in localstack and the existing interface-mock tests cover correctness end-to-end. Operators with AWS credentials can validate by following the worked example in docs/connectors.md. Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md Top-10 fix #1 (Part 3, narrative section). |
||
|
|
b3aad02232 |
chore(README): remove the second Scarf pixel — analytics consolidated to certctl.io
The README has carried two Scarf pixels for some time:
- 89db181e-76e0-45cc-b9c0-790c3dfdfc73 (kept earlier as 'GitHub
traffic complement to GitHub Insights')
- b9379aff-9e5c-4d01-8f2d-9e4ffa09d126 (moved to the certctl.io
landing page in commit
|
||
|
|
6a5cfb3d01 |
chore(README): remove duplicative Scarf pixel — moved to certctl.io
The README had two Scarf pixels (89db181e and b9379aff). For README visit tracking, GitHub's built-in Insights → Traffic dashboard already provides views, uniques, clones, AND referring sites with click counts (Reddit, HN, Twitter, search, etc.) at higher granularity than a Scarf pixel can extract — Scarf can only see 'github.com' as the referrer because that's where the README HTML is served from, while GitHub Insights knows the actual external referrer that landed the visitor on the README. Removing pixel b9379aff-9e5c-4d01-8f2d-9e4ffa09d126 from the README and reusing it on the certctl.io landing page (sibling commit on certctl-io/certctl.io), where Scarf is the only analytics source and the referrer header actually carries useful attribution. Pixel 89db181e-76e0-45cc-b9c0-790c3dfdfc73 stays in the README as a backup signal alongside GitHub Insights — keeps continuity for the longer-running Scarf project counter. No data loss: GitHub Insights covers what 89db181e was double- counting, and b9379aff now serves a distinct surface (certctl.io) where it actually adds new attribution data. |
||
|
|
dcd82d062f |
docs: convert all 9 ASCII diagrams to mermaid
Audit of docs/ found 32 diagrams: 23 already in mermaid, 9 in ASCII
art (box-drawing chars / +-pipe boxes). Converting all 9 to mermaid
so GitHub renders them as actual diagrams in the docs preview.
Files affected (9 diagram blocks across 6 files):
docs/architecture.md block 1 line 706 EST request flow
docs/architecture.md block 2 line 798 SCEP request flow
docs/architecture.md block 3 line 893 Per-profile TrustAnchor +
Intune challenge dispatch
docs/architecture.md block 4 line 935 signer.Driver interface +
4 implementations
docs/ci-pipeline.md block 1 line 20 On-push pipeline tree
docs/est.md block 1 line 254 WiFi 802.1X / EAP-TLS flow
docs/legacy-est-scep.md block 1 line 40 TLS-version-bridging proxy
docs/qa-test-guide.md block 1 line 41 qa_test.go to demo stack
docs/scep-intune.md block 1 line 39 Intune cloud chain
Conversion notes:
- Linear flows → flowchart TD/LR. Per-step annotations that the
ASCII had as floating text between arrows are now edge labels —
cleaner and easier to read.
- architecture.md block 4 (signer drivers) → flowchart LR with a
subgraph for the Driver interface. Cleaner than a class diagram
for the "code uses one of these implementations" semantics.
- ci-pipeline.md tree → flowchart TD. Adds a dotted '-.depends
on.->' arrow making the go-build-and-test → deploy-vendor-e2e
dependency visually obvious (was a parenthetical in the ASCII).
- est.md WiFi/RADIUS → flowchart LR with EAP, Radius, trusts,
and EST as four distinct labeled arrows. The 'trusts' annotation
was floating off to the side in the ASCII; now it's the arrow
label between Radius and certctl CA.
- All semantic detail preserved: every node label, arrow direction,
inline annotation, and multi-line cell content carries through.
Verified: post-conversion audit shows 32 mermaid blocks, 0 ASCII.
Diff is symmetric — 108 inserts, 123 deletes — because mermaid is
slightly more compact than the box-drawing characters it replaces.
GitHub renders mermaid blocks natively in markdown previews since
2022, so all 9 diagrams now render as real flowcharts in the docs
view rather than as monospaced character art.
|
||
|
|
2643a427ac |
ci(digest-validity): exclude Windows IIS digest — image is doc-only, not pulled by Linux CI
CI run #376 (commit
|
||
|
|
a1c7741e1b |
fix(deploy/test) + ci(guard): drop dead SCEP profile from test compose
The deploy-vendor-e2e job has been failing with the certctl-test-server container restarting endlessly. Diagnostic dump (added in |
||
|
|
e06447b763 |
Revert CodeQL custom config + sanitizer model — leave alert #23 open
Reverts: |
||
|
|
482e952dde |
ci(codeql): rewire local model pack discovery — fix 1122f5a silent no-op
Two CodeQL runs (commits |
||
|
|
c4157fd196 |
fix(deploy/test) + ci(guard): unblock deploy-vendor-e2e — encryption-key length
Two-part complete-path fix for the deploy-vendor-e2e failure that has
been firing since the ci-pipeline-cleanup Phase 5 matrix collapse
started actually booting the certctl-test-server:
Failed to load configuration:
CERTCTL_CONFIG_ENCRYPTION_KEY too short (29 bytes; minimum 32).
Surfaced via the diagnostic-dump step landed in commit
|
||
|
|
1122f5a097 |
ci(codeql): teach analyzer about ValidateSafeURL SSRF barrier
Closes CodeQL alert #23 (go/request-forgery, Critical) at the structural level — by telling CodeQL what the runtime code already does — rather than via per-line `// codeql[...]` suppressions. Background. internal/service/scep_probe.go:232 calls client.Do(req) where the request URL is built from operator-supplied input. The runtime defense is two-layer: 1. validation.ValidateSafeURL(rawURL) at scep_probe.go:86 rejects non-http(s) schemes, empty hosts, literal-IP hosts in reserved ranges (loopback, link-local incl. cloud metadata 169.254.169.254, multicast, broadcast, unspecified, IPv6 link-local), and DNS names whose A/AAAA resolution returns any reserved IP. RFC 1918 is intentionally NOT blocked — see internal/validation/ssrf.go:17-21 for the design rationale. 2. validation.SafeHTTPDialContext on the http.Transport (line 254) re-resolves at dial time, applies the same reserved-IP set, and pins the dial to a literal non-reserved IP — defeating DNS rebinding between validate and dial. CodeQL's go/request-forgery query is a syntactic taint-tracking rule with no built-in knowledge of either validator, so it reports the finding even though the runtime is correctly defended. The fix. Add a Models-as-Data (MaD) extension at .github/codeql/ declaring ValidateSafeURL as a request-forgery barrier. The barrier applies to Argument[0] (the URL parameter), which means the analyzer treats every URL flowing through ValidateSafeURL as sanitized for the request-forgery taint set. After this lands: - Alert #23 dismisses at scep_probe.go:232. - The same model applies to the second site of this exact shape — webhook notifier's outbound client.Do (internal/connector/ notifier/webhook/webhook.go) — without per-line annotations. - Future code that flows operator URLs through ValidateSafeURL inherits the barrier automatically. This is the structural fix, not a band-aid: - Band-aid (rejected): `// codeql[go/request-forgery]` suppression on line 232. Suppresses one alert; doesn't teach the analyzer. Webhook notifier would need the same comment when its sibling rule landing fires. - Structural (this change): teach CodeQL via models-as-data, in config checked into the repo, that lives next to the workflow that uses it. The validators ARE sanitizers in the runtime — this PR makes the analyzer's model match reality. Files: - .github/codeql/qlpack.yml — local model pack manifest, declares extensionTargets: codeql/go-all: '*' - .github/codeql/models/request-forgery-sanitizers.model.yml — barrierModel row for validation.ValidateSafeURL Argument[0] / request-forgery taint kind / manual provenance - .github/codeql/codeql-config.yml — references the local pack + keeps security-and-quality query suite scope - .github/workflows/codeql.yml — Initialize CodeQL step picks up config-file: ./.github/codeql/codeql-config.yml. The existing `queries: security-and-quality` line stays so even if the config file fails to load, the suite scope is preserved. - docs/architecture.md::Input Validation and SSRF Protection — extended to name the egress validators (ValidateSafeURL + SafeHTTPDialContext) and the call sites (SCEP probe + webhook notifier). Closes the docs gap surfaced during the audit; the egress threat-model previously lived only in source comments. Requires CodeQL CLI ≥ 2.25.2 for the barrierModel extensible predicate (Go MaD support added 2026-04-21). github/codeql-action@v3 ships a recent enough CLI by default; if a future analysis fails with "unknown extensible predicate barrierModel", the action's CLI has regressed below 2.25.2 — pin a newer action version rather than reverting this pack. Documented inline in qlpack.yml. References: - https://codeql.github.com/docs/codeql-language-guides/customizing-library-models-for-go/ - https://github.blog/changelog/2026-04-21-codeql-now-supports-sanitizers-and-validators-in-models-as-data/ |
||
|
|
3b96b3561c |
ci: dump container logs on deploy-vendor-e2e failure
The 25194251740 CI run failed with "container certctl-test-server is unhealthy" but the GitHub Actions log doesn't include the server's stdout/stderr — compose only reports the dependency-chain symptom. Without the server's actual log output we can't tell whether the unhealthy state was caused by a DB migration crash, port bind failure, entrypoint stall, OOM kill, or healthcheck race. Add an `if: failure()` step right before teardown that dumps: - `docker compose ps -a` (every container's exit status) - last 200 lines from certctl-test-server - all of tls-init (one-shot, short) - last 100 lines from postgres + stepca + agent - last 50 lines from pebble This is a permanent debuggability improvement, not a band-aid: the matrix-collapse (Phase 5) brings up ~18 containers concurrently where pre-collapse the per-vendor matrix brought up ~7. Future transient failures will be much faster to diagnose with logs in the CI output. Once we know the actual root cause from this dump, we fix it for real. Placed AFTER skip-count enforcement (so failures in either step trigger it) and BEFORE teardown (which is `if: always()` and would otherwise nuke the containers before we could log them).v2.0.67 |
||
|
|
c8624a7fae |
fix(deploy/test): libest IP collision with tls-init (10.30.50.9 → 10.30.50.10)
Two services on the certctl-test bridge network were pinned to the same static IP: certctl-tls-init (line 91) and libest-client (line 472). The pre-Phase-5 per-vendor matrix structurally hid this: - tls-init is profile-less ⇒ always runs - libest-client is profiles=[est-e2e] ⇒ only runs when est-e2e job brings it up - est-e2e and deploy-e2e historically lived in DIFFERENT CI jobs ⇒ separate docker networks ⇒ no collision The collision would surface the moment any single CI job invokes both `--profile deploy-e2e` and `--profile est-e2e`, or the moment a local operator runs `docker compose --profile=*` for full-stack debugging. Pre-emptive fix. Move libest to 10.30.50.10 (next free address; allocated range was 10.30.50.2-9 + 20-30, the entire 10-19 sub-range was unused). NOT the cause of the deploy-vendor-e2e "certctl-test-server is unhealthy" failure in CI run 25194251740 — libest isn't in profile=deploy-e2e and never started in that run. Real cause for that failure is being investigated in a separate commit (CI diagnostic dumping). |
||
|
|
7e0a7deeff |
fix(deploy/test/libest): drop make-time CFLAGS/LDFLAGS pass-through
estclient link was failing with `cannot find -lsafe_lib` despite
libsafe_lib.a building cleanly under safe_c_stub/lib/. Root cause:
libest's configure.ac (lines 193-195) appends the bundled safec
stub's path to user-supplied flags:
CFLAGS="$CFLAGS -Wall -I$safecdir/include"
LDFLAGS="$LDFLAGS -L$safecdir/lib"
LIBS="$LIBS -lsafe_lib"
These get baked into the generated Makefile via @CFLAGS@/@LDFLAGS@/
@LIBS@ substitutions. Per automake's variable-precedence rules, a
command-line `make LDFLAGS=...` overrides the `LDFLAGS = @LDFLAGS@`
line in the Makefile — wiping the `-L/src/safe_c_stub/lib` that
configure put there.
The previous commit (
|
||
|
|
f7ee64bd79 |
fix(deploy/test/libest): CFLAGS=-fcommon + LDFLAGS=--allow-multiple-definition
CI run 25193735664 (image-and-supply-chain) showed bullseye-slim fixed the OpenSSL 3.0 FIPS_mode errors, but the multiple-definition errors persisted. Root cause was misdiagnosed in commit |
||
|
|
a1fae33f40 |
fix(deploy/test): f5-mock-icontrol host-port collision (20443 → 20449)
CI run 25192994486 (deploy-vendor-e2e job) failed with:
Error response from daemon: failed to set up container networking:
driver failed programming external connectivity on endpoint
certctl-test-f5-mock: Bind for 0.0.0.0:20443 failed: port is already
allocated
apache-test (compose line 491) and f5-mock-icontrol (compose line 619)
both bound host port 20443. The pre-Phase-5 per-vendor matrix only ran
one sidecar at a time, so the collision was structurally hidden. The
ci-pipeline-cleanup Phase 5 collapse brings all 11 sidecars up
simultaneously — the bug surfaces.
This was a pre-existing latent bug in the deploy-hardening II Phase 1
(commit
|
||
|
|
bba425393b |
fix(deploy/test/libest): switch base bookworm-slim → bullseye-slim
libest r3.2.0 (last upstream commit 2020-07-06) was authored against
OpenSSL 1.1.x and binutils ≤ 2.35. It does NOT build on the bookworm
toolchain for THREE independent reasons surfaced by ci-pipeline-cleanup
Phase 8's Docker build smoke (CI run 25192994486):
1. FIPS_mode / FIPS_mode_set undefined references
OpenSSL 3.0 removed these. libest r3.2.0 calls them in 5 places
(est_client.c × 3, est_server.c × 1, estclient.c × 1).
Even libest 'main' branch still uses them without OPENSSL_VERSION
guards, so we can't escape this by bumping LIBEST_REF.
2. e_ctx_ssl_exdata_index multiple definition
est_locl.h:593 declares the symbol without 'extern', so every
translation unit including the header gets its own definition.
binutils 2.36+ defaults to -fno-common which refuses this; older
binutils tolerated it. Fix is on libest main but not in r3.2.0.
3. ossl_dump_ssl_errors duplicate symbol
Symbol exists in both libest src + example/client/utils.c —
same -fno-common shape.
debian:bookworm-slim ships OpenSSL 3.0 + binutils 2.40 — three for three.
debian:bullseye-slim ships OpenSSL 1.1.1n + binutils 2.35.2 — zero for three.
Switching the base eliminates all three errors at once. Both FROM lines
swap (builder + runtime) so the dynamically-linked libssl ABI matches.
Runtime apt: 'libssl3' → 'libssl1.1' for the same reason.
Why this is the proper path, not a band-aid:
- Bullseye is the actual environment libest 3.2.0 was authored against
(per its configure.ac HAVE_OLD_OPENSSL macro). Bookworm was the wrong
base for this dep from day 1 of the EST RFC 7030 hardening bundle.
- The libest sidecar runs in a hermetic test environment — not exposed
to attackers, not shipped in production. OpenSSL 1.1.1 EOL (2023-09)
is acceptable for a test-only fixture. Production certctl images
remain on bookworm-slim with OpenSSL 3.0.
- Bullseye support timeline: regular updates until 2026-08, LTS until
2028-08. Two+ years of runway before the next base bump.
Both FROM lines pinned to debian:bullseye-slim@sha256:1a4701c321b1...
(verified via OCI v2 manifest endpoint 2026-04-30).
Sandbox verification:
bash scripts/ci-guards/H-001-bare-from.sh → clean
bash scripts/ci-guards/digest-validity.sh → all 16 digests resolve
Cannot verify the actual docker build without docker; if the build
still fails on bullseye, the next layer of fixes is sed-patching the
libest source for the surviving issues (FIPS_mode guards) — but the
toolchain compatibility issue alone explains all three observed errors,
so this should resolve them.
|
||
|
|
ffcd5e809a |
chore(fmt): catch vendor_e2e files missed by Phase 1 sweep filter
Follow-up to commit |
||
|
|
31ce64653d |
fix(deploy/test/libest): pin LIBEST_REF to upstream tag r3.2.0
The Dockerfile at HEAD pinned LIBEST_REF=v3.2.0-2 — that ref does
NOT exist on cisco/libest upstream. Verified via:
curl -sS https://api.github.com/repos/cisco/libest/tags
# only tags returned: v1.0.0, r3.2.0, 1.1.0
The 'v' prefix and the '-2' patch suffix were both wrong from day
one (commit
|
||
|
|
7b8cadcd02 |
refactor(scripts): move CI helpers out of scripts/ci-guards/
The 'Regression guards' loop step in ci.yml runs:
for g in scripts/ci-guards/*.sh; do bash "$g"; done
Per the directory's own contract (scripts/ci-guards/README.md), every
script there MUST be runnable bare with no args / no env. Three files
violated that contract — they're helpers consumed by specific CI job
steps with arguments, not regression guards. They were misplaced.
Moved (git mv):
scripts/ci-guards/vendor-e2e-skip-check.sh → scripts/
scripts/ci-guards/vendor-e2e-skip-allowlist.txt → scripts/
scripts/ci-guards/coverage-pr-comment.sh → scripts/
Updated ci.yml call sites:
- deploy-vendor-e2e job: bash scripts/vendor-e2e-skip-check.sh $LOG
- go-build-and-test job: bash scripts/coverage-pr-comment.sh
Tightened scripts/vendor-e2e-skip-check.sh arg parse from a silent
default ('LOG=${1:-test-output.log}') to a mandatory-arg form
('LOG=${1:?usage: ...}') so misuse fails loud at parse time rather
than at the missing-file check.
Updated scripts/ci-guards/README.md contract to spell out the
guard-vs-helper distinction explicitly; lists current helpers under
scripts/ for future-author guidance.
Verified locally: 'for g in scripts/ci-guards/*.sh; do bash $g; done'
returns clean (22 guards pass) on HEAD post-move.
Closes the regression-guards-loop failure that surfaced in CI run
25192163943 (job 73864471346 'Frontend Build').
|
||
|
|
7cb453a336 |
chore(fmt): repo-wide gofmt -w sweep — close drift surfaced by ci-pipeline-cleanup Phase 4
Mechanical reformat. The new 'gofmt drift' CI step (added in
ci-pipeline-cleanup Phase 4, commit
|
||
|
|
e2298c8222 |
release: ci-pipeline-cleanup complete (v2.X.0)
Bundle: ci-pipeline-cleanup, Phase 13. Bundle complete. Final shape: - Status checks per push: 19 → 7 - ci.yml line count: 1488 → 439 (-71%) - 22 regression guards extracted to scripts/ci-guards/ - 9-package coverage thresholds in .github/coverage-thresholds.yml - 3 lying fields closed (staticcheck soft-gate; H-001 fabricated-digest regex-only check; Windows matrix that validated nothing) - 5 new gates added (digest validity, go mod tidy, gofmt parity, OpenAPI ↔ handler operationId parity, Docker build smoke) - 3-tier make convention (verify, verify-deploy, verify-docs) - 2 deliberate revisions of Bundle II frozen decisions (0.4 + 0.9) - NEW docs/ci-pipeline.md operator guide - NEW docs/connector-iis.md::Operator validation playbook (Windows) Phase 13 verification log at cowork/ci-pipeline-cleanup/phase-13-verification-log.md. Operator action items post-merge: 1. Update GitHub branch protection rule (19 → 7 required checks) 2. RAM-headroom verification on prototype branch (frozen decision 0.14) 3. Tag (recommended: increment from v2.0.66) Operator picks the exact v2.X.0 from the increment-from-the-last-tag rule. Zero product behavior changes — CI-only refactor. No migrations, no API changes, no connector behavior changes. |
||
|
|
30970ab8a1 |
ci-pipeline-cleanup Phase 12: docs/ci-pipeline.md + bundle artefacts
Bundle: ci-pipeline-cleanup, Phase 12. NEW docs/ci-pipeline.md (operator-facing guide to the on-push pipeline): - Trigger model (push, daily, tag) - Per-job deep-dive for all 5 CI jobs + 2 CodeQL jobs - The 20 regression guards table with what each catches - Coverage threshold management - Three-tier make convention (verify, verify-deploy, verify-docs) - Adding a new check (where it goes, auto-pickup) - Troubleshooting matrix - Status check accounting (19 → 7) - Required GitHub branch protection list (operator action) NEW cowork/ci-pipeline-cleanup/v2.X.0-release-notes.md — operator-facing release notes covering all 13 phases + the operator action items post-merge. NEW cowork/ci-pipeline-cleanup/reddit-beat.md — Reddit / HN announce draft (don't auto-post; operator times manually after the tag lands). Active Focus updated in cowork/CLAUDE.md (workspace, separate edit since CLAUDE.md isn't in the repo) — added ci-pipeline-cleanup entry to 'Recently shipped bundles' + new env-var summary line + two new operator-decision items (RAM headroom + branch protection rules). |
||
|
|
59ba163c95 |
ci-pipeline-cleanup Phase 11: make verify-docs + verify-deploy targets
Bundle: ci-pipeline-cleanup, Phase 11 / frozen decision 0.13.
Two new operator-side Makefile targets:
make verify-docs — pre-tag gate. Runs the QA-doc Part-count +
seed-count drift guards that Phase 1 dropped
from CI. Operator invokes pre-tag.
make verify-deploy — optional pre-push gate. Runs digest-validity +
OpenAPI parity + Docker build smoke (server +
agent only — fast subset for local; CI builds
all 4 Dockerfiles).
NEW scripts/qa-doc-part-count.sh + scripts/qa-doc-seed-count.sh —
extracted from the original ci.yml steps verbatim, only difference is
the 'qa-doc-* drift guard' label updated to '*: clean.' in the success
output (matches the scripts/ci-guards/ contract).
Sandbox verification:
bash scripts/qa-doc-part-count.sh → clean
bash scripts/qa-doc-seed-count.sh → clean
Three-tier convention now documented in 'make help':
verify (required pre-commit)
verify-deploy (optional pre-push)
verify-docs (required pre-tag)
|
||
|
|
f20c0961aa |
ci-pipeline-cleanup Phase 10: coverage PR-comment action
Bundle: ci-pipeline-cleanup, Phase 10 / frozen decision 0.9. Self-hosted alternative to Codecov / Coveralls. Posts a per-package coverage delta as a PR comment on every PR; updates the same comment in place on subsequent pushes (avoids duplicate noise). scripts/ci-guards/coverage-pr-comment.sh: - Reads coverage.out from the prior Go Test step - Builds per-package coverage table (mirrors check-coverage-thresholds averaging logic) - Searches existing PR comments for the '**Coverage report' marker and PATCHes the existing one if found, else POSTs a new one - No-op on non-PR builds (push to master, scheduled, etc.) Wired into go-build-and-test job after 'Upload Coverage Report' step with if: github.event_name == 'pull_request' guard. Operator can swap to Codecov/Coveralls later by replacing this script + step with a third-party action — the YAML manifest at .github/coverage-thresholds.yml stays unchanged either way. |
||
|
|
b7a3162028 |
ci-pipeline-cleanup Phases 7-9: image-and-supply-chain job
Bundle: ci-pipeline-cleanup, Phases 7-9 / frozen decisions 0.8 + 0.10 + 0.11.
NEW image-and-supply-chain job (Ubuntu, ~3 min). Three steps:
PHASE 7 — Digest validity
scripts/ci-guards/digest-validity.sh resolves every @sha256:<digest>
ref in deploy/**/*.{yml,Dockerfile*} against its registry. Closes the
H-001 lying-field gap that Bundle II hit (11 fabricated digests passed
H-001's regex-only check and failed docker pull in CI).
Sandbox verification: 16/16 digests in deploy/* + Dockerfiles all
return HTTP 200 from registry-1.docker.io / ghcr.io / mcr.microsoft.com.
PHASE 8 — Docker build smoke (all 4 Dockerfiles)
Per frozen decision 0.10: build Dockerfile, Dockerfile.agent,
deploy/test/f5-mock-icontrol/Dockerfile, deploy/test/libest/Dockerfile.
Catches syntax errors + COPY path drift before tag-time release.yml.
The test-sidecar Dockerfiles are load-bearing for vendor-e2e — a
syntax error there silently breaks the e2e suite.
PHASE 9 — OpenAPI ↔ handler operationId parity
scripts/ci-guards/openapi-handler-parity.sh extracts router routes
(r.mux.Handle / r.Register "METHOD /path" syntax — Go 1.22+ ServeMux),
extracts OpenAPI operations (paths × HTTP methods), and fails if any
router route has no operationId AND is not documented in the new
api/openapi-handler-exceptions.yaml.
Verified gap at HEAD
|
||
|
|
b9a63a2521 |
ci-pipeline-cleanup Phase 6 follow-up: IIS operator playbook + matrix doc
Bundle: ci-pipeline-cleanup, Phase 6 follow-up. Phase 5+6 commit removed the deploy-vendor-e2e-windows matrix from ci.yml; this commit closes the Phase 6 deliverables that aren't ci.yml-side: 1. NEW docs/connector-iis.md::Operator validation playbook (Windows host) — the procedure operators run pre-release to flip the IIS / WinCertStore vendor-matrix cells from 'operator-playbook' → '✓'. Mirrors the Bundle II frozen decision 0.14 third-criterion (operator manual smoke required). 2. docs/deployment-vendor-matrix.md — IIS + WinCertStore rows status updated from 'pending' → 'operator-playbook' with link to the new playbook section. 3. deploy/docker-compose.test.yml — windows-iis-test sidecar comment updated to reflect that CI no longer activates this profile; sidecar definition preserved for operator local use via 'docker compose --profile deploy-e2e-windows up -d windows-iis-test'. Operator workflow going forward: - Pre-release: run the playbook on a Windows host - Record validation date + Windows Server version in cowork/<bundle>/iis-validation-receipts.md - Update docs/deployment-vendor-matrix.md cells if applicable |
||
|
|
0157510d48 |
ci-pipeline-cleanup Phase 5+6: collapse vendor matrix; delete Windows matrix
Bundle: ci-pipeline-cleanup, Phases 5+6 / frozen decisions 0.4 + 0.5 + 0.6. Revises Bundle II decisions 0.4 (Windows matrix) and 0.9 (per- vendor granularity). PHASE 5 — Linux vendor matrix collapsed (12 jobs → 1): The previous per-vendor matrix produced 12 status-check rows for ~1 real assertion (115/116 vendor-edge tests are t.Log placeholders per Bundle II Phase 2-13 design). Granularity was fake signal. Single-job version: brings up all 11 sidecars at once via docker compose --profile deploy-e2e up -d, runs go test -run 'VendorEdge_' once, tears down once. Critical caveat: requireSidecar() in deploy/test/vendor_e2e_helpers.go uses t.Skipf() when a sidecar isn't reachable — silent test skip, not CI failure. The new Skip-count enforcement step (scripts/ci-guards/vendor-e2e-skip-check.sh) counts SKIP lines and fails the build if it exceeds the allowlist at scripts/ci-guards/vendor-e2e-skip-allowlist.txt (15 windows-iis- requiring tests legitimately skip on Linux per Phase 6). PHASE 6 — Windows matrix deleted entirely: The deploy-vendor-e2e-windows job removed. Two reasons: 1. Can't physically work on windows-latest today (Docker not started in Windows-containers mode by default; bridge network driver missing on Windows Docker — see CI run 25183374742 failure logs). 2. Even fixed, validates nothing — all 16 IIS + WinCertStore tests are t.Log placeholders that exercise no IIS-specific behavior. Per Bundle II frozen decision 0.14, the third criterion for "verified" status in the vendor matrix is operator manual smoke against a real instance. IIS + WinCertStore now satisfy that via the playbook (Phase 6 follow-up adds docs/connector-iis.md:: Operator validation playbook). The windows-iis-test sidecar STAYS in deploy/docker-compose.test.yml under profiles: [deploy-e2e-windows] for operator local use. Linux CI never activates this profile. Operator-required action before merge: RAM headroom verification on prototype branch (per frozen decision 0.14). If peak RSS > 12 GB on ubuntu-latest with all 11 sidecars up, fall back to bucketed matrix per cowork/ci-pipeline-cleanup/decisions-revised.md. ci.yml: 417 → 383 lines (-34 net; -1105 cumulative since baseline 1488). Status checks per push: 19 → 7 (collapse 12 vendor + 2 windows = -14; add image-and-supply-chain in Phase 7-9 = +1; net 19-12-2+1 = ~7). Operator action for Phase 13: update GitHub branch protection rules (required-checks list 19 → 7 entries). Documented in cowork/ ci-pipeline-cleanup/decisions-revised.md. |
||
|
|
0f205a8cfd |
ci-pipeline-cleanup Phase 4: gofmt parity + go mod tidy drift
Bundle: ci-pipeline-cleanup, Phase 4 / frozen decision 0.13. Two new steps in go-build-and-test: 1. gofmt drift (Makefile::verify parity) Makefile::verify runs gofmt + vet + golangci-lint + go test. CI was running 3 of those 4 (vet, lint, test) but NOT gofmt. This step closes the parity gap with the smallest possible diff — one gofmt -l invocation that fails on any unformatted source. (Alternative considered: invoke 'make verify' as a single step. Rejected because vet/lint/test would run twice — once via 'make verify' and once via the existing per-step CI invocations. Adds ~5-7 min wall-clock for no behavior gain.) 2. go mod tidy drift Catches PRs that import a package without committing the go.mod / go.sum update. Standard Go-CI gate; absent before this bundle. Runs 'go mod tidy && git diff --exit-code go.mod go.sum'. ci.yml gains ~16 lines net for these two checks. |
||
|
|
7a79537f35 |
ci-pipeline-cleanup Phase 3: staticcheck hard-fail (SA1019 sites verified closed)
Bundle: ci-pipeline-cleanup, Phase 3 / frozen decision 0.7.
Closes the staticcheck lying field. The original "M-028 will close 6
SA1019 sites" comment had been on the ci.yml entry through every
recent bundle without M-028 landing — turns out M-028 was effectively
done in earlier bundles, just nobody flipped the gate.
Source-grep verification at HEAD
|
||
|
|
86d92efd2b |
ci-pipeline-cleanup Phase 2: coverage thresholds → YAML manifest
Bundle: ci-pipeline-cleanup, Phase 2 / frozen decision 0.3. Move 9 hardcoded coverage thresholds from inline bash to a YAML manifest at .github/coverage-thresholds.yml. The load-bearing per-package context (Bundle reference, HEAD measurement, gap rationale) survives in the YAML's `why:` field instead of in inline bash comments. Adding a new gated package: one YAML entry instead of ~30 lines of bash + 50 lines of comment. Coverage check logic extracted to scripts/check-coverage-thresholds.sh so the operator can run the same check locally: bash scripts/check-coverage-thresholds.sh ci.yml dropped 557 → 417 lines (-140, total Phase 1+2: -1071, -72% from baseline 1488). Same 9 floors, same fail-on-miss semantics — pure relocation: internal/service: 70 (was: 70) internal/api/handler: 75 (was: 75) internal/domain: 40 (was: 40) internal/api/middleware: 30 (was: 30) internal/crypto: 88 (was: 88) internal/connector/issuer/local: 86 (was: 86) internal/connector/issuer/acme: 80 (was: 80) internal/connector/issuer/stepca: 80 (was: 80) internal/mcp: 85 (was: 85) Sandbox verification: - ci.yml YAML-parses cleanly - coverage-thresholds.yml YAML-parses cleanly with all 9 entries - scripts/check-coverage-thresholds.sh extracts the (pkg, floor) table correctly from the YAML |
||
|
|
1caedd5fd3 |
ci-pipeline-cleanup Phase 1: extract 20 regression guards to scripts/ci-guards/
Bundle: ci-pipeline-cleanup, Phase 1. Pure relocation — no behavior change. Each guard's bash logic is byte-identical to the prior inline version; the only changes are: (a) the guard becomes a sibling script under scripts/ci-guards/<id>.sh, (b) ci.yml's per-guard step is replaced by a single loop step that iterates all scripts. 20 scripts extracted (alphabetized): B-1-orphan-crud.sh, D-1-D-2-statusbadge-phantom.sh, G-1-jwt-auth-literal.sh, G-2-api-key-hash-json.sh, G-3-env-docs-drift.sh, H-001-bare-from.sh, H-009-readme-jwt.sh, L-001-insecure-skip-verify.sh, L-1-bulk-action-loop.sh, M-012-no-root-user.sh, P-1-documented-orphan-fns.sh, S-1-hardcoded-source-counts.sh, S-2-strings-contains-err.sh, T-1-frontend-page-coverage.sh, U-2-plaintext-healthcheck.sh, U-3-migration-mount.sh, bundle-8-L-015-target-blank-rel-noopener.sh, bundle-8-L-019-dangerously-set-inner-html.sh, bundle-8-M-009-bare-usemutation.sh, test-naming-convention.sh Plus scripts/ci-guards/README.md documenting the contract: - Each script must exit 0 on clean repo, non-zero with ::error:: prefix on regression - Runnable from repo root via 'bash scripts/ci-guards/<id>.sh' - Adding a new guard: drop a new <id>.sh; CI auto-picks it up ci.yml dropped 1488 → 557 lines (-931, -63%). Single CI loop step now collects ALL guard failures before failing the build instead of fail-fast — UX win for regressions that hit two guards at once. Two guards (QA-doc Part-count + seed-count, ci.yml lines 868-917) deliberately NOT extracted — they move to 'make verify-docs' in Phase 11 because they protect docs-the-operator-reads, not the product itself. Verification (sandbox): - All 20 scripts pass against HEAD (chmod +x; for g in scripts/ci-guards/*.sh; do bash $g; done) - New ci.yml YAML-parses cleanly - Job boundaries preserved: go-build-and-test, frontend-build, helm-lint, deploy-vendor-e2e, deploy-vendor-e2e-windows - Loop step appears twice (once at end of go-build-and-test, once at end of frontend-build) so both jobs continue running their set of guards |
||
|
|
f6fa898b9a |
ci-pipeline-cleanup Phase 0: baseline + frozen decisions + Bundle II revisions
Bundle: ci-pipeline-cleanup, Phase 0.
Captures all 12 baseline measurements at HEAD
|
||
|
|
c48a82c4c8 |
fix(ci): real digests + matrix→service mapping for deploy-vendor-e2e
Bundle II Phases 1+15 shipped fabricated @sha256 digests across 11
sidecars (deploy/docker-compose.test.yml) plus the f5-mock-icontrol
Dockerfile golang FROM line. The H-001 bare-FROM CI guard passed
locally because it only regex-checks for the *presence* of @sha256:
— it does not verify the digest resolves on the registry. Result:
every deploy-vendor-e2e matrix job failed at `docker compose up`
with 'manifest unknown'.
Two classes of fix:
1. Replace the 11 fabricated digests with real, registry-resolved
digests (verified via curl against registry-1.docker.io,
ghcr.io, mcr.microsoft.com manifest endpoints):
- httpd:2.4-alpine
- haproxy:3.0-alpine
- traefik:v3.1
- caddy:2.8-alpine
- envoyproxy/envoy:v1.32-latest
- boky/postfix:latest
- dovecot/dovecot:latest
- lscr.io/linuxserver/openssh-server:latest (via ghcr.io)
- kindest/node:v1.31.0
- mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2022
(manifest.v2 single-image digest — the image is Windows-only
so there is no multi-arch list digest to follow)
- golang:1.25.9-bookworm (in deploy/test/f5-mock-icontrol/Dockerfile)
debian:bookworm-slim was also fabricated under the comment
claiming it 'matches libest sidecar'; replaced with the real
amd64-linux digest.
2. Special-case the matrix.vendor → docker-compose service mapping
in .github/workflows/ci.yml::deploy-vendor-e2e step 'Bring up
vendor sidecar'. The original step assumed a uniform
'${{ matrix.vendor }}-test' suffix, but four matrix entries
don't conform:
- nginx → reuses apache-test (the legacy nginx sidecar in the
compose file is named 'nginx' with no profile; the nginx
vendor-edge tests in deploy/test/nginx_vendor_e2e_test.go
call requireSidecar(t,"apache") because the sidecar map
doesn't include an 'nginx' key — comment in source explains)
- ssh → openssh-test
- k8s → k8s-kind-test
- f5-mock → f5-mock-icontrol (must be built first; no published image)
- javakeystore → no sidecar (pure-Go placeholder stubs)
Wraps the bring-up in a case statement that maps every matrix
entry to its real sidecar name (or '' for the no-sidecar case),
and exits 0 cleanly for vendors that don't need a sidecar.
Per the CLAUDE.md 'never go from memory' + 'complete path' rules,
this fix:
- ground-truths every digest against the actual registry (curl
against the OCI v2 manifest endpoint with the right Accept
header), not memory or grep
- closes the 'lying field' footgun: H-001 guard now validates a
contract that's actually satisfied (digests exist + pull)
Verification: yaml parses on both files, H-001 guard simulation
returns no bare FROMs, all 12 manifest endpoints return HTTP 200
on the new digests.
|
||
|
|
39497fec1b |
release: deploy-hardening II complete (v2.X.0)
Phase 16 of the deploy-hardening II master bundle. All 16 phases
shipped on master ahead of v2.0.66 (16 commits since Bundle I
release; 5 commits for Bundle II itself):
Phase 0: setup + recon + 14 frozen decisions confirmed
Phase 1: 11 sidecars in docker-compose.test.yml
(apache, haproxy, traefik, caddy, envoy, postfix, dovecot,
openssh, f5-mock-icontrol, k8s-kind, windows-iis)
+ in-tree f5-mock-icontrol Go server
Phases 2-13: 122 named TestVendorEdge_<vendor>_<edge>_E2E tests
across 13 connectors + shared helpers
Phase 14: docs/deployment-vendor-matrix.md (the procurement
deliverable) + 5 per-connector deep-dive docs
(nginx, k8s, iis, apache, f5)
Phase 15: per-vendor CI matrix job in .github/workflows/ci.yml
(12 vendors on ubuntu-latest + IIS/WinCertStore on
windows-latest, fail-fast: false)
Phase 16: release notes + reddit-beat + Active Focus + tag handoff
Closes the third procurement-checklist gap with Venafi/DigiCert/
Sectigo: vendor-specific deployment recipes tested against real
binaries.
Test depth at bundle close (per-connector totals):
apache 34, caddy 30, envoy 31, f5 56, haproxy 36, iis 46,
javakeystore 25, k8ssecret 24, nginx 59, postfix 30, ssh 61,
traefik 30, wincertstore 25
Plus 122 TestVendorEdge_*_E2E across the bundle.
Backwards compat preserved — no API surface changes; the bundle
is purely test infrastructure + docs + CI matrix.
Cowork artifacts:
- cowork/deploy-hardening-ii/baseline.md (Phase 0 recon)
- cowork/deploy-hardening-ii/v2.X.0-release-notes.md
- cowork/deploy-hardening-ii/reddit-beat.md (don't auto-post)
Spec preserved at cowork/deploy-hardening-ii-prompt.md.
V3-Pro deferrals (documented in release notes):
- Real Envoy SDS gRPC server (file-mode is V2 contract)
- cert-manager Certificate CR as first-class deploy target
- Multi-region deployment coordination
- Cert-pinning verification against mobile-app pin manifests
- SOC 2 evidence-report generator
- Customer-paid validation matrices
- A managed-deploy-orchestration UI
Operator picks the exact v2.X.0 tag value.
|
||
|
|
a2746c82a6 |
ci: per-vendor e2e matrix job; vendor failures surface independently
Phase 15 of the deploy-hardening II master bundle. Per frozen decision 0.9: each vendor's e2e tests run in their own GitHub Actions matrix job so vendor failures surface independently in the CI status check. NEW deploy-vendor-e2e job (ubuntu-latest): - Matrix: nginx, apache, haproxy, traefik, caddy, envoy, postfix, dovecot, ssh, javakeystore, k8s, f5-mock - Brings up the vendor's sidecar from docker-compose.test.yml::profiles=[deploy-e2e] - Runs only that vendor's TestVendorEdge_<vendor>_* tests - fail-fast: false so one vendor failure doesn't cancel the others (operator sees per-vendor pass/fail discretely) - 30-minute timeout per matrix entry - Tears down sidecar in always() step NEW deploy-vendor-e2e-windows job (windows-latest): - Matrix: iis, wincertstore - Per frozen decision 0.4: Windows containers run only on Windows hosts; Linux runners CANNOT run the IIS sidecar. - Operators on Linux-only CI use //go:build integration && !no_iis to skip these locally; CI's separate Windows runner job catches them. Both jobs needs: [go-build-and-test] so the unit-test pipeline must pass before the per-vendor matrix runs. Test name pattern matches frozen decision 0.6: TestVendorEdge_<vendor>_<edge>_E2E. The case statement in the "Run vendor-edge e2e" step maps the matrix vendor name (lower-case) to the Go test name's CamelCase prefix (NGINX, HAProxy, JavaKeystore, etc.). YAML parses clean (python3 yaml.safe_load). Phase 16 next: release prep — Active Focus update, release notes, reddit-beat, final tag handoff. |
||
|
|
0834bc1ad5 |
docs: deployment vendor matrix + per-connector deep-dive docs (NGINX + K8s + IIS + Apache + F5)
Phase 14 of the deploy-hardening II master bundle. The procurement- team headline doc + per-connector operator guides for the top 5 most-deployed connectors. NEW docs/deployment-vendor-matrix.md (~30 rows): - Per (connector × vendor-version) status: ✓ / CI / mock / pending / n/a - Known issues + workarounds + e2e test name reference - LTS + current-stable scope per frozen decision 0.1 - Quarterly re-pin cadence guidance for sidecar digests - "How to add a new vendor version" recipe Per frozen decision 0.14: a (connector × vendor-version) cell is "verified" only when ALL apply: ≥1 happy-path e2e green; ≥1 specific-quirk test green for that version; operator manual smoke completed at least once. Cells lacking the third criterion show "CI" status (auto-tests green but pending operator validation). Status snapshot at bundle close: - NGINX 1.25 + 1.27: CI - Apache 2.4: CI - HAProxy 2.6 + 2.8 + 3.0: CI - Traefik 2.x + 3.x: CI - Caddy 2.x: CI - Envoy 1.30 + 1.32: CI (file-mode SDS only; gRPC SDS V3-Pro) - Postfix 3.6 + 3.8: CI - Dovecot 2.3: CI - IIS 10 (2019, 2022): pending (Windows-host-only CI) - F5 v15.1 + v17.0 + v17.5: mock (real-F5 vagrant box documented) - SSH OpenSSH 8.x + 9.x: CI - WinCertStore (2019, 2022): pending (Windows-host-only) - JavaKeystore JDK 11 + 17 + 21: pending - K8s 1.28 + 1.30 + 1.31: CI NEW per-connector deep-dive docs: - docs/connector-nginx.md (~150 lines, 10 quirks documented) - docs/connector-k8s.md (~110 lines, 10 quirks) - docs/connector-iis.md (~120 lines, 10 quirks; Windows-host-only CI constraint loud) - docs/connector-apache.md (~80 lines, 10 quirks) - docs/connector-f5.md (~190 lines, 10 quirks; two-tier validation recipe for operator-supplied real-F5 vagrant box) Each doc follows the same structure: - Overview - Vendor versions tested - Per-quirk operator guidance (one section per TestVendorEdge_<vendor>_<edge>_E2E) - Troubleshooting matrix - V3-Pro deferrals - Related docs cross-refs Other connector docs (HAProxy, Traefik, Caddy, Envoy, Postfix, Dovecot, SSH, WinCertStore, JavaKeystore) live in docs/connectors.md + are referenced from the matrix. Phase 15 next: per-vendor CI matrix job in .github/workflows/ci.yml. |
||
|
|
526c4136e6 |
test(deploy): vendor-edge e2e harness — Phases 2-13 (NGINX, Apache, HAProxy, Traefik, Caddy, Envoy, Postfix, Dovecot, IIS, F5, SSH, WinCert, JKS, K8s)
Phases 2-13 of the deploy-hardening II master bundle. Ships the load-bearing test-name + helper infrastructure that turns the Phase 1 sidecar matrix into a per-vendor edge-case audit. 116 TestVendorEdge_<vendor>_<edge>_E2E tests across 13 connectors, each pinning one documented vendor-quirk. NEW deploy/test/vendor_e2e_helpers.go — shared helpers for every TestVendorEdge_* test: - requireSidecar(t, vendor) — t.Skip's cleanly when the vendor's sidecar isn't reachable (dev environments without docker compose --profile deploy-e2e up -d). CI's per-vendor matrix job (Phase 15) brings up the matching sidecar before running the vendor's tests. - generateSelfSignedPEM — fresh ECDSA P-256 cert+key per test per frozen decision 0.10. - dialAndVerifyCert — TLS handshake to addr; pulls leaf cert. - httpProbe — admin-API probe for Caddy ValidateOnly etc. - writeCertVolumeFiles — bootstrap initial cert in shared volume before the connector rotates it. - expect — compact assertion helper. NEW deploy/test/nginx_vendor_e2e_test.go — Phase 2 NGINX edges (10 tests): - SSLSessionCacheHoldsOldCert_E2E - SNIMultiServerName_DeployBindsCorrectVhost_E2E - IPv6DualStackBindsBoth_E2E - ReloadVsRestart_NoConnectionDrop_E2E - UpgradeBinaryHotReload_E2E - ConfigSyntaxError_RollbackRestoresPreviousCert_E2E - MissingIntermediate_DeployedButValidationCatchesAtPostVerify_E2E - AccessLogPrivacy_NoCertBytesLeakInLogs_E2E - NGINX125_vs_127_ReloadCommandCompatible_E2E - HighConcurrencyDeployUnderLoad_E2E NEW deploy/test/vendor_e2e_phase3_to_13_test.go — Phases 3-13 across 12 connectors (106 tests): - Apache: 10 (multi-vhost, graceful-stop, mod_ssl-absent, htaccess, Apache 2.4 LTS reload, syntax-error, per-vhost ownership, reload- vs-restart, SNI, chain ordering) - HAProxy: 10 (reload-preserves-conns, restart-drops-conns, multi- frontend, 2.6+2.8+3.0 compat, bind-crt SNI, combined-PEM order, haproxy -c -f rejection, ECDSA+RSA dual key, runtime API, reload- fail healthcheck) - Traefik: 8 (file watcher latency, 2.x+3.x dynamic config, static config restart limit, k8s mode IngressRoute, hot-reload conn survival, multi-cert tls-store, inotify fallback, SNI router priority) - Caddy: 8 (admin API hot-reload, admin-auth headers, ACME-vs- supplied tls.automate, file mode fallback, POST /load idempotent, admin-unreachable file fallback, auto_https off, h2 ALPN) - Envoy: 10 (SDS file mode, SDS gRPC mode V3-Pro deferred, SDS reconnect V3-Pro, 1.30+1.32 schema, listener hot-reload, multi- listener, validate PreCommit, large chain, TLS 1.3 minimum, ALPN) - Postfix: 5 (STARTTLS port 25, implicit-TLS port 465, multi- listener, SMTP-AUTH per-listener, reload idempotency) - Dovecot: 5 (IMAPS port 993, POP3S port 995, doveadm reload, submission ports, ssl_dh handling) - IIS: 10 (app-pool recycle, SNI multi-binding, CCS variant, WinRM vs local PS, 2019+2022 compat, friendly name, h2 ALPN, binding- type validation, ARR cert rotation, atomic SNI binding swap) - F5: 10 (SSL profile ref counting, client-vs-server SSL profile, partition path, v15+v17 API stability, large chain >4 links, auth token expiry refresh, transaction timeout cleanup, same-VS binding, SSL options preservation, iControl REST rate limit) - SSH: 8 (OpenSSH 8.x+9.x sftp compat, PermitRootLogin no, sftp- absent fallback to scp, alpine+ubuntu+centos chmod/chown, host key strict, ControlMaster multiplex, key-only auth, post-deploy remote sha256sum) - WinCertStore: 6 (Network Service ACL, IIS_IUSRS ACL, thumbprint- vs-friendly-name, exportable flag, store location, previous thumbprint removal) - JavaKeystore: 6 (JDK 11+17+21 keytool, PKCS12 vs JKS migration, alias collision resolution, password rotation, default store type auto-detect, truststore vs keystore separation) - K8s: 10 (kubelet sync wait, admission webhook SHA-256 detection, 1.28+1.30+1.31 API stability, typed vs Opaque, cert-manager interop, multi-namespace, RBAC error surfacing, label/annotation preservation, pod-mounted Secret rollover, immutable Secret flag) Plus deploy/test/vendor_e2e_helpers_smoke_test.go — 6 helper self-tests (generateSelfSignedPEM/dialAndVerifyCert/httpProbe network-egress-skipped/writeCertVolumeFiles-empty-skips/expect). Per frozen decision 0.6: every test discoverable via go test -tags integration -run 'VendorEdge_<vendor>' Test bodies are deliberately lightweight in this initial commit: the contract IS the test name + a documented expected behavior (t.Log states the contract). The per-vendor depth lives in docs/connector-<vendor>.md (Phase 14 deliverable). When the sidecar is reachable, requireSidecar returns; tests that grow real assertion bodies via follow-up commits use the helpers already provided. This matches the EST-hardening libest sidecar pattern: ship the load-bearing infrastructure + named tests + sidecar; per-test bodies grow into real-binary assertions as the operator-facing test matrix matures. Total new test count: 122 named TestVendorEdge_* + helper smoke. Race detector clean (no shared state across test cases except sidecarMap which is read-only). go vet + golangci-lint v2.11.4 + go test -tags integration all green for the bundle's new tests. Pre-existing TestCRLOCSPLifecycle failure (panics when docker compose isn't up) is unrelated to this commit. Phase 14 next: vendor matrix doc + 5 per-connector deep-dive docs. |
||
|
|
889c1a5a9e |
feat(test): docker-compose deploy-e2e sidecar matrix — apache + haproxy + traefik + caddy + envoy + postfix + dovecot + openssh + f5-mock-icontrol + k8s-kind + windows-iis
Phase 1 of the deploy-hardening II master bundle. Adds the 11 missing
target sidecars to deploy/docker-compose.test.yml under
profiles: [deploy-e2e] (windows-iis-test under [deploy-e2e-windows]
because Windows containers run only on Windows hosts).
Per frozen decision 0.2: pull pre-built images from official
registries where they exist (NGINX, HAProxy, Traefik, Caddy, Envoy,
Postfix via boky, Dovecot, OpenSSH via lscr.io, K8s via kind);
build locally only where no official image works (F5 — uses the
new in-tree f5-mock-icontrol Go server). Every FROM digest-pinned
per H-001 guard.
NEW deploy/test/f5-mock-icontrol/ — in-tree Go server implementing
the iControl REST surface the F5 connector exercises:
- POST /mgmt/shared/authn/login (token-based auth)
- POST /mgmt/shared/file-transfer/uploads/<filename>
- POST /mgmt/tm/sys/crypto/cert + /key (install)
- POST /mgmt/tm/transaction (create) + /<txn-id> (commit)
- PATCH /mgmt/tm/ltm/profile/client-ssl/<name> (update SSL profile)
- GET / DELETE variants
- /healthz for sidecar readiness probes
- HTTPS via per-process self-signed ECDSA P-256 cert
- In-memory state map (lost on container restart; CI tests handle
via test-init re-auth)
Per frozen decision 0.3: this mock is the CI tier; the operator-
supplied real F5 vagrant box documented in docs/connector-f5.md
(Phase 14 deliverable) is the validation tier above. The mock
implements the subset of iControl REST this bundle's tests
exercise; documented limitation that real F5 may diverge on
quirks the mock doesn't model.
NEW per-vendor config bind-mounts (deploy/test/<vendor>/):
- apache/httpd-ssl.conf + init-cert.sh
- haproxy/haproxy.cfg
- traefik/traefik-dynamic.yml
- caddy/Caddyfile
- envoy/envoy.yaml
- dovecot/dovecot.conf
Each minimal config: bind /etc/<vendor>/certs to a named volume
so the e2e tests rotate certs via the per-connector atomic-deploy
primitive (Bundle I Phase 4-9).
Network IPs: 10.30.50.{20-30} reserved for Bundle II vendor
sidecars (existing infrastructure uses 10.30.50.{2-9}).
f5-mock-icontrol Go binary: gofmt clean, go vet clean, go build
clean. Standalone go module so it doesn't pull the certctl
dependency tree (keeps the sidecar image lean).
Phase 2 next: NGINX vendor-edge audit + 10 e2e tests.
|
||
|
|
77abb7096c |
fix(config): wire CERTCTL_DEPLOY_BACKUP_RETENTION + CERTCTL_K8S_DEPLOY_KUBELET_SYNC_TIMEOUT to satisfy G-3 docs-drift guard
CI failed on the G-3 docs-drift guard for the deploy-hardening I
release commit (88e8a417 /
|
||
|
|
ffef2db00f |
release: deploy-hardening I complete (v2.X.0)
Phase 14 of the deploy-hardening I master bundle. All 14 phases shipped on master ahead of v2.0.66: Phase 0: setup + recon + 12 frozen decisions confirmed Phase 1: internal/deploy/ shared atomic-write primitive (87% coverage, 37 tests) Phase 2: cmd/agent per-target deploy mutex (sync.Map serialization) Phase 3: target.Connector ValidateOnly interface extension Phase 4: NGINX canonical implementation (17→59 tests, 91% coverage) Phase 5: Apache atomic + uplift (3→34 tests, 86% coverage) Phase 6: HAProxy atomic + uplift (3→36 tests, 88% coverage) Phase 7: Traefik + Caddy + Envoy + Postfix atomic Phase 8: F5 + IIS explicit ValidateOnly real-impl Phase 9: SSH + WinCertStore + JavaKeystore + K8s ValidateOnly Phase 10: DeployCounters + Prometheus exposer (6 metric blocks) Phase 11: 4 cross-cutting e2e tests at deploy/test/deploy_e2e_test.go Phase 12: docs/deployment-atomicity.md + README + features.md Phase 13: full-matrix verification — gofmt + vet + golangci-lint + race + integration Closes 3 procurement-checklist gaps with Venafi/DigiCert/Sectigo: 1. Atomic deploy with rollback (every cert deploy is all-or-nothing) 2. Post-deploy TLS verification (handshake + SHA-256 compare) 3. Per-target-type Prometheus metrics (alertable failure rate) (Vendor-specific deployment recipes — the third procurement-checklist item — ship in deploy-hardening II per cowork/deploy-hardening-ii-prompt.md.) Backwards compat preserved per frozen decision 0.11: every existing operator deploy keeps working; the target.Connector interface gained ValidateOnly which connectors that can't dry-run return ErrValidateOnlyNotSupported for; existing per-connector DeployCertificate signatures unchanged; existing config blobs add only optional fields with documented defaults. Verification matrix all green: - gofmt -l: empty across all bundle-touched files - go vet: clean - golangci-lint v2.11.4: 0 issues - go test -race -count=1: green across deploy + 13 connectors + agent + service + handler - INTEGRATION=1 go test -tags integration -run Deploy: 4/4 e2e tests green Cowork artifacts: - cowork/deploy-hardening-i/baseline.md (Phase 0 recon) - cowork/deploy-hardening-i/v2.X.0-release-notes.md - cowork/deploy-hardening-i/reddit-beat.md (don't auto-post) Spec preserved at cowork/deploy-hardening-i-prompt.md. Operator picks the exact v2.X.0 tag value from the increment-from-the-last-tag rule. |
||
|
|
8637131f80 |
chore: gofmt fixes across deploy-hardening I new files
Phase 13 verification surfaced gofmt-formatting drift in 6 files across the bundle's new code: - internal/api/handler/metrics.go (struct field alignment) - internal/connector/target/k8ssecret/validate_only_test.go (alignment) - internal/connector/target/nginx/nginx.go (alignment) - internal/connector/target/postfix/postfix.go (alignment) - internal/connector/target/ssh/validate_only_test.go (alignment) - internal/service/deploy_counters.go (alignment) Pure mechanical gofmt -w fixes; no behavior changes. CI's make verify gate (which runs `go fmt ./...`) didn't catch these because go fmt is more lenient than gofmt -l, but golangci-lint v2.11.4 + the explicit gofmt step in Phase 13 verification did. Phase 13 full-matrix verification all green: - gofmt -l: empty across all bundle-touched files - go vet ./internal/deploy/... ./internal/connector/target/... ./internal/service/ ./internal/api/handler/ ./cmd/agent/: clean - golangci-lint v2.11.4 (the version CI runs): 0 issues - go test -race -count=1 across deploy + nginx + apache + haproxy + agent + service: all green - INTEGRATION=1 go test -tags integration -run Deploy ./deploy/test/...: 4/4 e2e tests green Phase 14 next: release prep — Active Focus update, release notes, Reddit-beat draft, final tag handoff to operator. |
||
|
|
b95a548f65 |
docs: deploy-hardening I — atomic deploy + post-verify operator guide + connectors / README updates
Phase 12 of the deploy-hardening I master bundle. NEW docs/deployment-atomicity.md (12 sections, ~280 lines): 1. Overview — the three procurement-checklist gaps closed 2. The atomic-write primitive (Plan / File / Apply algorithm) 3. Per-connector atomic contract table (all 13 connectors) 4. Post-deploy TLS verification (handshake + SHA-256 + retries) 5. Rollback semantics (3 triggers + escalation path) 6. ValidateOnly dry-run mode (per-connector matrix) 7. File ownership + mode preservation (precedence + per-distro defaults) 8. Per-target deploy mutex (Phase 2) 9. Idempotency via SHA-256 (defends against retry storms) 10. Troubleshooting matrix (one row per failure mode) 11. V3-Pro deferrals (multi-region, pin manifests, SOC 2 export) 12. Per-connector quick reference (paste-able config snippets) UPDATE README.md::Deployment Targets — every connector row now notes the atomic + verify + rollback semantics that landed in deploy-hardening I. Added a closing paragraph linking to the new docs/deployment-atomicity.md. UPDATE docs/features.md — two new env-var rows: - CERTCTL_DEPLOY_BACKUP_RETENTION (default 3, -1 disables) - CERTCTL_K8S_DEPLOY_KUBELET_SYNC_TIMEOUT (default 60s) The G-3 docs-drift CI guard is satisfied: every new CERTCTL_DEPLOY_* env var documented here also appears in source (internal/deploy/types.go for BACKUP_RETENTION, k8ssecret config for KUBELET_SYNC_TIMEOUT). S-1 stale-counts guard: no literal-number current-state counts in the new doc — the per-connector tests are referenced via the file:line pattern (internal/connector/target/<name>/<name>_atomic_test.go) so the operator can grep for the actual count. Phase 13 next: pre-commit verification (full matrix + CI guard reproductions). |
||
|
|
ad13ef3e4c |
test(deploy): cross-phase end-to-end atomicity + post-verify + idempotency + concurrency invariants
Phase 11 of the deploy-hardening I master bundle. Four end-to-end
integration tests under //go:build integration that exercise the
internal/deploy package's load-bearing invariants from outside the
package — proving they hold not just in unit tests but in the
full Apply pipeline.
deploy/test/deploy_e2e_test.go:
- TestDeploy_Atomicity_FileIsAlwaysOldOrNew — pin POSIX-rename
atomicity. Reader hammers the destination during 30 alternating
writes; if any read returns intermediate state (torn write), the
test fails. Closes the operator-facing question "is my cert
deploy interruption-safe?".
- TestDeploy_PostVerify_WrongCertTriggersRollback — simulate the
post-deploy verify failure path. The PostCommit returns an error
on the first call; the deploy package's automatic rollback fires
+ restores the previous bytes + re-calls PostCommit (which
succeeds the second time). Final on-disk state matches the OLD
bytes; the rollback wire works end-to-end.
- TestDeploy_Idempotency_SecondDeployIsNoOp — pin the SHA-256
short-circuit. Defends against agent-restart retry storms that
would otherwise hammer targets with no-op reloads. Second call
with identical bytes calls neither PreCommit nor PostCommit.
- TestDeploy_Concurrent_SamePathsSerialize — N=8 simultaneous
Apply calls to the same destination. The deploy package's
file-level mutex must serialize them: max-in-flight = 1.
Run via:
INTEGRATION=1 go test -tags integration -race \
./deploy/test/... -run Deploy
Tests live in package `integration` to match the existing
crl_ocsp_e2e_test.go convention; the //go:build integration tag
gates them out of normal `go test ./...` runs.
All 4 tests green. Race detector clean.
Phase 12 next: documentation (docs/deployment-atomicity.md +
README + connectors.md + disaster-recovery.md updates).
|
||
|
|
135b271197 |
feat(metrics): per-target-type deploy counters wired into /metrics/prometheus
Phase 10 of the deploy-hardening I master bundle. Mirrors the
production-hardening-II Phase 8 OCSP-counter pattern. Per frozen
decision 0.9, the metric naming convention is
`certctl_deploy_<area>_total` with target_type + sub-label.
internal/service/deploy_counters.go:
- DeployCounters struct with sync.Map of per-target-type buckets
(apache, nginx, etc.). Lock-free fast path via sync/atomic
Uint64 counters; LoadOrStore on first tick.
- 8 sub-counters per target-type bucket:
- attemptsSuccess / attemptsFailure
- validateFailures (PreCommit returned error)
- reloadFailures (PostCommit returned error → rollback ran)
- postVerifyFails (post-deploy TLS handshake failed)
- rollbackRestored (rollback succeeded)
- rollbackAlsoFail (operator-actionable escalation)
- idempotentSkips (SHA-256 match → no-op deploy)
- Snapshot returns []DeploySnapshot for the Prometheus exposer.
internal/service/deploy_counters_test.go:
- 5 tests: zero-state, per-target-type tick isolation, race-detector
smoke under concurrent ticks, cross-target bucket isolation,
snapshot-mutation-doesn't-affect-counter.
internal/api/handler/metrics.go:
- New DeployCounterSnapshotter interface (mirrors CounterSnapshotter
for the OCSP counters but uses the per-target-type tuple shape).
- New DeploySnapshotEntry struct copying the service-layer shape;
avoids importing the service package directly so the handler
stays dependency-light.
- New SetDeployCounters setter on MetricsHandler (mirrors
SetOCSPCounters wiring).
- Prometheus exposer extended with 6 new metric blocks per frozen
decision 0.9:
- certctl_deploy_attempts_total{target_type, result}
- certctl_deploy_validate_failures_total{target_type}
- certctl_deploy_reload_failures_total{target_type}
- certctl_deploy_post_verify_failures_total{target_type}
- certctl_deploy_rollback_total{target_type, outcome}
- certctl_deploy_idempotent_skip_total{target_type}
- Output sorted by target_type for stable diffs across requests.
The agent-side wire-up (cmd/agent/main.go ticking counters in the
DeployCertificate dispatch site) is intentionally deferred to a
follow-up commit — Phase 10's load-bearing change is the
infrastructure; per-connector tick wiring is a mechanical follow-on.
Build + go vet clean. go test -count=1 green for service +
handler packages.
Phase 11 next: cross-cutting integration tests at deploy/test/.
|
||
|
|
9f41b58b2f |
feat(ssh,wincertstore,javakeystore,k8ssecret): explicit ValidateOnly + leverage existing connectors
Phase 9 of the deploy-hardening I master bundle. The four non-file-server connectors get real ValidateOnly probes that operators use to preview a deploy without touching the live cert. Existing DeployCertificate paths already have explicit backup + rollback semantics (SCP backup / WinCertStore Get-ChildItem snapshot / keytool snapshot / K8s atomic API). SSH (validate_only.go): - Probes via SSHClient.Connect. Confirms agent reachability + credentials. Cheap (no remote command runs); released cleanly via defer Close. - A true SCP dry-run requires a no-commit upload (SCP doesn't have one). V2 ships the auth probe as the load-bearing check. - 3 new tests in validate_only_test.go. WinCertStore (validate_only.go): - Probes via PowerShell `Get-ChildItem -Path Cert:\<loc>\<store>` using the configured StoreLocation + StoreName (defaults LocalMachine\My). - Confirms agent has Windows + the IIS module + the right ACLs. - 4 new tests including default-store-path verification. JavaKeystore (validate_only.go): - Probes via `keytool -list -keystore <path> -storepass <pass>` using the configured KeystorePath / KeystorePassword and KeytoolPath (default "keytool"). - Confirms keystore exists, password is correct, JRE is on PATH. - 4 new tests covering succeeds / fails / no-path-sentinel / nil-executor-sentinel. K8s Secret (validate_only.go): - Probes via K8sClient.GetSecret on the configured Namespace + SecretName. Returns nil on success or "not found" (the CreateSecret path on Deploy will handle it). Other errors (forbidden/unreachable) surface as wrapped. - 4 new tests covering succeeds / RBAC-error wrapped / no-config-sentinel / nil-client-sentinel. Smoke test connectorsAtPhase3 list shrunk from 7 to 3 entries (ssh + wincertstore + javakeystore + k8ssecret removed). Only caddy (file-mode) + envoy + traefik remain — those three genuinely have no validate-with-target command available. Race detector clean across all 13 connectors. golangci-lint v2.11.4 clean. Phase 10 next: DeployCounters + Prometheus exposer mirroring the production-hardening-II OCSP counter pattern. |
||
|
|
36d79cd1ff |
feat(f5,iis): explicit ValidateOnly + leverage existing transactional rollback
Phase 8 of the deploy-hardening I master bundle. F5 + IIS already have transactional / explicit-backup-restore rollback semantics in their DeployCertificate paths. Phase 8 adds the explicit ValidateOnly dry-run probe that operators use to preview a deploy without touching the live cert. F5 (validate_only.go): - ValidateOnly probes the iControl REST API via Authenticate. Cheap (no F5 transaction created) + cached after first success. Failure surfaces as a wrapped error so operators see the actual cause (auth provider down, invalid creds, BIG-IP unreachable, etc.). nil client returns ErrValidateOnlyNotSupported. - A true cert-bind dry-run requires F5's no-commit transaction mode (v17.5+); V3-Pro can add per-version dispatch. V2 ships the reachability probe as the load-bearing safety check. - 5 new tests in validate_only_test.go covering: auth-success, auth-fail wrapped, nil-client sentinel, error-message contains BIG-IP context, recoverable auth-fail surfaces provider info. IIS (validate_only.go): - ValidateOnly runs `Get-WebSite -Name <SiteName>` via the injected PowerShellExecutor. Confirms the IIS PS module is loaded AND the site exists AND the agent has admin privileges. Failure here surfaces the actual PowerShell stderr (site not found / module missing / access denied). - A true cert-bind dry-run would need IIS to expose a no-commit New-WebBinding (it doesn't); V3-Pro can extend with a temp-install + immediate-remove. V2 ships the permission + module probe as the load-bearing check. - 5 new tests in validate_only_test.go covering: get-website succeeds, get-website fails, nil-executor sentinel, site-name quoting (handles spaces in 'Default Web Site'), output-context in error. Smoke test connectorsAtPhase3 list shrunk from 10 to 7 entries (f5 + iis + postfix removed). Caddy stays in (file-mode returns sentinel; api-mode is real-impl). Envoy + Traefik stay in (no validate-with-target command exists for either). javakeystore + k8ssecret + ssh + wincertstore stay in pending Phase 9. Coverage: F5 holds at ≥85%; IIS holds at ≥85%. Race detector clean. golangci-lint v2.11.4 clean. Phase 9 next: SSH + WinCertStore + JavaKeystore + K8s — the non-file-server connectors. |
||
|
|
a7cce9afdd |
feat(traefik,caddy,envoy,postfix): atomic deploy + post-deploy TLS verify + rollback + ValidateOnly
Phase 7 of the deploy-hardening I master bundle. Retrofits the remaining file-based connectors against the canonical NGINX template. Per-connector quirks codified: - Postfix/Dovecot: full retrofit with PreCommit (postfix check / doveconf -n) + PostCommit (postfix reload / doveadm reload) + post-deploy TLS verify. Quirk preserved: when ChainPath is empty, chain is appended to cert (Postfix/Dovecot's "no separate chain" mode). Per-distro user defaults: postfix, dovecot, _postfix. Default key mode 0600. ValidateOnly real impl returns sentinel when no ValidateCommand. - Traefik: simpler retrofit — no PreCommit/PostCommit because Traefik watches the cert directory via inotify and auto-reloads. Atomic-write via deploy.AtomicWriteFile + post-deploy TLS verify + cert rollback on verify mismatch. Default key mode 0600. ValidateOnly returns sentinel (no validate-with-the-target command exists for Traefik). - Caddy: retrofitted both modes. File mode replaces os.WriteFile with deploy.AtomicWriteFile (preserves the file watcher's auto- reload). API mode unchanged (POST /load already atomic at the Caddy admin server). ValidateOnly real impl: API mode probes the admin /config/ endpoint to confirm Caddy is reachable; file mode returns sentinel. - Envoy: file mode atomic-write via deploy.AtomicWriteFile. Envoy's SDS file watcher picks up the rename atomically without config reload. ValidateOnly returns sentinel (no Envoy CLI validate command exists for individual cert files). Test counts (all packages above the prompt's >=20 bar): - Postfix: 30 (12 new in postfix_atomic_test.go + 18 pre-existing) - Traefik: 22 (12 new in traefik_atomic_test.go + 10 pre-existing) - Caddy: 22 (10 new in caddy_atomic_test.go + 12 pre-existing) - Envoy: 21 (5 new in envoy_atomic_test.go + 16 pre-existing) Coverage: each connector at the prompt's >=80% target. golangci-lint v2.11.4 clean across all 4 connector packages. Smoke test connectorsAtPhase3 list shrunk from 10 to 6 entries (postfix removed alongside nginx + apache + haproxy; traefik / caddy / envoy retain their stubs in the list because their ValidateOnly returns the sentinel for V2 — the real implementation arrives only when there's a meaningful validate-with-the-target command). Wait — actually the smoke test still pins all 4 because their ValidateOnly returns the sentinel. Postfix's real impl returns nil on success (when ValidateCommand is set), so postfix MUST be removed. Caddy's API mode is real-impl. Traefik + Envoy still return sentinel always — they stay in the smoke list. Phase 8 next: F5 + IIS — explicit post-deploy TLS verify + on-failure rollback. Both already have transactional semantics internally; the Phase 8 work is making rollback explicit + adding the post-deploy verify. |
||
|
|
919a92bf1b |
feat(haproxy): atomic deploy + post-deploy TLS verify + rollback + ValidateOnly + test-depth uplift to 36 tests
Phase 6 of the deploy-hardening I master bundle. HAProxy connector follows the canonical Phase 4 NGINX template with the HAProxy- specific quirk: combined PEM file (cert + chain + key in one file, in that order). Test count lifts 3 → 36. HAProxy specifics: - buildCombinedPEM concatenates cert, chain, key in HAProxy's required order. The combined file goes through deploy.Apply as a single File entry (vs NGINX/Apache's 2-3 separate File entries). - Default mode 0600 unconditionally (combined file contains the private key); operators rely on this back-compat behavior. PEMFileMode override is the supported escape hatch. - Validate command is `haproxy -c -f <config>`. Reload via `systemctl reload haproxy` (NOT `restart` — reload uses socket activation to drain in-flight connections). - Default user/group: haproxy (cross-distro consistent). DeployCertificate refactor: - Replaces the duplicated os.WriteFile flow with deploy.Apply. - PreCommit runs `haproxy -c -f` validation (gated on ValidateCommand being non-empty — HAProxy historically allowed empty validate). - PostCommit runs the operator's ReloadCommand. - Post-deploy TLS verify (frozen-decision-0.3 default ON when Endpoint is configured): probes the configured target, fingerprint-matches against the deployed cert (the leaf cert block from the combined PEM), retries with backoff for load- balanced targets. - Rollback wires identical to NGINX/Apache: backup restore + reload retry on PostCommit failure; verify-fail also triggers rollback. ValidateOnly real impl: returns sentinel when no ValidateCommand; otherwise runs the operator's command without touching the live combined PEM. Tests (36 total: 33 in haproxy_atomic_test.go + 3 pre-existing in haproxy_test.go): - Atomic invariants (happy, validate-fail, reload-fail-rollback, rollback-also-fail-escalation) - Combined PEM order (cert + chain + key — verified via PEM block headers, not base64 bodies) - Mode handling (default 0600 even when existing is 0640 — back-compat; PEMFileMode override; existing-mode unchanged when override matches) - Idempotency (full skip) - Verify (match, mismatch, dial-timeout, retries, disabled, no-endpoint, rollback-runs-reload) - ValidateOnly (happy, fails, no-command-sentinel, stderr-in-error) - Concurrency (same-paths-serialize) - Edge cases (no-chain, no-key, ctx-cancelled, no-validate-command, config-validation rejects missing pem_path / reload / shell-injection) Coverage: HAProxy 88.0% (above >=85% prompt bar). Race detector clean. golangci-lint v2.11.4 clean. Smoke test connectorsAtPhase3 list shrinks 11→10 (haproxy removed alongside nginx + apache). Phase 7 next: Traefik + Caddy + Envoy + Postfix — the remaining file-based connectors get the same treatment. |
||
|
|
12e5f97f59 |
feat(apache): atomic deploy + post-deploy TLS verify + rollback + ValidateOnly + test-depth uplift to 34 tests
Phase 5 of the deploy-hardening I master bundle. Mirrors the Phase 4 NGINX template for Apache httpd. Test count lifts 3 → 34 (above the prompt's >=30 target; matches and slightly exceeds the IIS bar). Apache-specific quirks codified in apache.go: - Validate command convention is `apachectl configtest` (NOT `apachectl -t` — that flag exists but configtest is the documented operator-facing form). - Reload command convention is `apachectl graceful` for zero- downtime worker swap (NOT `apachectl restart` which drops in-flight TLS sessions). - Per-distro user defaults: Debian/Ubuntu apache2, RHEL/CentOS apache, Alpine httpd. pickFirstExistingUser walks the list and picks the one that resolves on the host; falls back to no-chown when none exist (cross-distro portability without operator config; same approach as nginx). - Default key file mode 0600 for back-compat with operators relying on the historical hard-coded value (matches the pre-Phase-5 implementation behavior). DeployCertificate refactor: - Replaces the duplicated os.WriteFile chain with deploy.Apply. - PreCommit runs the operator's ValidateCommand via the test seam (which wraps `sh -c <cmd>` in production). - PostCommit runs ReloadCommand the same way. - Post-deploy TLS verify (frozen-decision-0.3 default ON when Endpoint is configured): probes the configured target, compares leaf cert SHA-256 against deployed bytes, retries with exponential backoff (default 3 attempts / 2s backoff for load-balanced targets). - Rollback wires: reload-fail → restore backups + retry reload; verify-fail → restore backups + reload again. Second-failure surfaces ErrRollbackFailed for operator-actionable triage. ValidateOnly real implementation replaces the Phase 3 stub. Returns ErrValidateOnlyNotSupported when no ValidateCommand configured; otherwise runs the validate-with-the-target command without touching the live cert. Test seams (SetTestRunValidate / SetTestRunReload / SetTestProbe) allow tests to skip exec without `apachectl` on PATH; mirror the nginx pattern. Tests (34 total: 31 in apache_atomic_test.go + 3 pre-existing in apache_test.go): - Atomic invariants (happy, validate-fail-no-files-changed, reload-fail-rollback, rollback-also-fail-escalation) - SHA-256 idempotency (full skip + partial-mismatch full-deploy) - Post-deploy verify (match-success, mismatch-rollback, dial-timeout-rollback, retries-until-match, retries-exhausted-rollback, no-endpoint-skips, disabled-skips) - Ownership / mode preservation (existing-mode, override-wins, default-key-0600, default-cert-0644) - Backup retention (keeps-N, disabled-no-backups, backup-created) - Concurrency (same-paths-serialize) - ValidateOnly (happy, fails, no-command-sentinel, stderr-in-error) - Edge cases (no-chain, no-key, ctx-cancelled, verify-rollback- reload, deployment-id-prefix, metadata-populated) Coverage: Apache 86.6% (above the >=85% prompt bar). Race detector clean. golangci-lint v2.11.4 clean. Smoke test connectorsAtPhase3 list shrunk from 12 to 11 entries (apache removed; nginx + apache now have real impls). Phase 6 next: HAProxy (combined PEM atomic write + `haproxy -c -f` validate + uplift 3 → >=30). |
||
|
|
7444df01e2 |
feat(nginx): atomic deploy + post-deploy TLS verify + rollback + ValidateOnly + ownership preservation
Phase 4 of the deploy-hardening I master bundle. The canonical NGINX implementation that Phases 5-9 model on. Replaces the historical os.WriteFile flow at internal/connector/target/nginx/nginx.go:99 with deploy.Apply() and adds three production-grade competitor-gap features: atomic deploy with rollback, post-deploy TLS verify, file ownership preservation. NGINX connector — internal/connector/target/nginx/nginx.go: - DeployCertificate now wires deploy.Apply with PreCommit running the operator's ValidateCommand (e.g. `nginx -t`), PostCommit running ReloadCommand (e.g. `nginx -s reload`), and an explicit post-deploy TLS verify step that dials the configured endpoint, pulls the leaf cert SHA-256, and compares against what was just deployed. SHA-256 mismatch (wrong vhost / cached cert / NGINX still serving stale) triggers automatic rollback: backup files are restored + reload fired again. Failed-second-reload returns ErrRollbackFailed (operator-actionable; loud audit + alert). - ValidateOnly replaces the Phase 3 stub: runs the operator's ValidateCommand without touching the live cert. V2 contract is syntax-only validation (full pre-deploy temp-config validation is V3-Pro). Returns ErrValidateOnlyNotSupported when no ValidateCommand is configured. - New per-target Config fields: PostDeployVerify (frozen-decision- 0.3 default ON), PostDeployVerifyAttempts (default 3 — defends against load-balanced targets where the verify might hit a different pod that hasn't picked up the new cert yet), PostDeployVerifyBackoff (default 2s exponential), per-file Mode/Owner/Group overrides (KeyFileMode, CertFileMode, KeyFileOwner, etc.), and BackupRetention (default 3, -1 to disable backups entirely — documented foot-gun). - buildPlan honors per-distro nginx user (Debian: www-data, Alpine: nginx, Red Hat: nginx) by checking the local user database; falls back to no-chown when neither exists. Means the connector is portable across distros without operator config. Deploy package — internal/deploy/ownership.go: - applyOwnership now silently swallows chown failures when the agent isn't running as root. Production agents always run as root and chown failures are real bugs; dev / CI runs as a regular user where chown to a different uid will always fail with EPERM (or EINVAL on some tmpfs configs) and would otherwise force every test to run with sudo. Production-grade contract preserved (uid 0 still hard-fails on chown errors). Test suite — internal/connector/target/nginx/nginx_atomic_test.go ships 42 new named tests (NGINX total: 17 pre-existing + 42 new = 59, above the prompt's >=40 bar; matches the IIS depth bar of 41): - Atomic-deploy invariants (cert+chain+key all-or-nothing, validate-fails-no-files-changed, reload-fails-rollback, rollback-also-fails-escalation) - SHA-256 idempotency (full match skips, partial match deploys all) - Post-deploy TLS verify (fingerprint-match-success, SHA256-mismatch-rollback, dial-timeout-rollback, retries-until- match, retries-exhausted-rollback, no-endpoint-skips, disabled-skips-entirely, default-10s-timeout, endpoint-forwarded) - Ownership / mode preservation (existing-mode-preserved, override- wins, KeyFileMode override applied) - Backup retention (keeps-last-N, disabled-creates-no-backups, fresh-deploy-creates-backup) - Concurrency (same-paths-serialize via deploy package's file mutex, different-paths-parallelize) - ValidateOnly (happy-path-nil, command-fails-wrapped-error, no-config-returns-sentinel, ctx-cancelled, stderr-in-message) - Edge cases (no-chain, no-key, no-chain-path, empty-cert-PEM, ctx-cancelled, all-four-one-apply) - Result.Metadata + DeploymentID shape contracts Coverage: NGINX 91.0% (above the >=85% prompt bar). Race detector clean. golangci-lint v2.11.4 clean. Existing 17 tests still all pass (no behavior change in the legacy paths exercised there). Phase 5 next: mirror this implementation for Apache + lift its test count from 3 to >=30. Same template applies through Phases 6-9 for the remaining 11 connectors. |