From 2643a427ace8af42e9bd5308a0b72b1c1c1a8018 Mon Sep 17 00:00:00 2001 From: shankar0123 Date: Fri, 1 May 2026 03:06:49 +0000 Subject: [PATCH] =?UTF-8?q?ci(digest-validity):=20exclude=20Windows=20IIS?= =?UTF-8?q?=20digest=20=E2=80=94=20image=20is=20doc-only,=20not=20pulled?= =?UTF-8?q?=20by=20Linux=20CI?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CI run #376 (commit a1c7741, Frontend Build job) failed with: digest does not resolve: mcr.microsoft.com/windows/servercore/iis: windowsservercore-ltsc2022@sha256:8d0b0e651ad514e3fb05978db66f38036 118812e1b9314a48f10419cad8a3462 A re-run with no code changes went green. The digest itself is fine — verified against MCR directly (HTTP 200 from mcr.microsoft.com/v2/windows/servercore/iis/manifests/sha256:8d0b...), and the tag `:windowsservercore-ltsc2022` currently resolves to that exact digest. Microsoft hasn't rotated. Root cause is registry-side rate-limiting. MCR throttles unauthenticated GET-by-digest requests by source IP. GitHub-hosted runners share a small pool of egress IPs across many users; bursts trip the throttle and return non-200. Re-run = different runner = different IP = throttle window has reset = pass. This will recur on roughly N% of pushes indefinitely, until either (a) Microsoft loosens MCR rate limits, (b) GitHub buys more runner IPs, or (c) we stop verifying digests CI doesn't actually use. The deeper issue is structural, not transient. The Windows IIS image is gated behind compose `profiles: [deploy-e2e-windows]` (deploy/docker-compose.test.yml:700). The comment block above the service definition (lines 675-691) explicitly says "Linux CI never activates this profile." All 10 TestVendorEdge_IIS_*_E2E tests are on scripts/vendor-e2e-skip-allowlist.txt because the sidecar is never started. The whole Windows matrix was DELETED in ci-pipeline-cleanup Phase 6 / frozen decision 0.5 (revising Bundle II decision 0.4); IIS validation moved to docs/connector-iis.md::Operator validation playbook. So `digest-validity.sh` is verifying a digest that no CI job ever pulls — paying CI brittleness against MCR rate-limiting we can't control, for an image whose only purpose in compose is documentation for an operator's manual workflow on a real Windows host. The fix matches the guard's stated purpose ("every digest CI actually depends on is valid"): exclude images CI never pulls. Implementation. Add an EXCLUDED_PATTERNS array near the top of the script with one entry — the IIS image path `mcr.microsoft.com/windows/servercore/iis` — and a comment block above it documenting: - WHY it's excluded (gated profile, never started, all tests on skip-allowlist) - WHEN it would need re-inclusion (if a Windows CI runner is added that actually starts the sidecar) - WHAT this list is NOT for (transient flake silencing — that gets fixed via retry logic in the script, not via exclusion) The match is by image-path substring, not by digest, so future tag/ digest updates of the same image still hit the exclusion without needing this list to be re-edited. Loop logic gains a 6-line check that runs the exclusion match before any registry work. Excluded refs log as "SKIP (excluded) " so operator-facing CI logs stay informative — at a glance you can see which digests were verified vs which were intentionally not. The success message updates to differentiate verified vs excluded counts: "digest-validity: clean — N verified, M excluded (CI never pulls)" when M > 0; original message preserved when M == 0. Verified manually: - Clean repo: 15 verified, 1 excluded, exit 0. - Fabricated bogus httpd digest: ::error:: emitted for the bad digest, IIS still SKIP-excluded, exit 1. (Real regressions still caught.) - Restore: 15 verified, 1 excluded, exit 0 again. Other recurring MCR-hosted images would warrant the same treatment if they get added later. The exclusion list pattern scales: each new entry needs its own "WHY this is doc-only" justification block. What this is NOT: - Not a generic flake-silencer. The exclusion is justified by the image being doc-only, not by the test being noisy. - Not a global retry/resilience layer. If MCR rate-limits an image CI DOES pull, that's a real CI dependency on an unreliable external service — fix by retry-with-backoff, not by excluding. --- scripts/ci-guards/digest-validity.sh | 60 +++++++++++++++++++++++++++- 1 file changed, 59 insertions(+), 1 deletion(-) diff --git a/scripts/ci-guards/digest-validity.sh b/scripts/ci-guards/digest-validity.sh index abdea2c..b83fab2 100755 --- a/scripts/ci-guards/digest-validity.sh +++ b/scripts/ci-guards/digest-validity.sh @@ -30,8 +30,61 @@ if [ ${#REFS[@]} -eq 0 ]; then exit 0 fi +# --------------------------------------------------------------------------- +# Excluded refs — digests for images CI never pulls. +# --------------------------------------------------------------------------- +# The guard's purpose is "every digest CI actually depends on is valid." +# Images that exist in compose only as documentation for an operator's +# manual workflow (e.g., Windows containers we cannot start on Linux +# runners) shouldn't add CI brittleness against external-registry +# rate-limiting we don't control. +# +# Each entry below is a substring matched against the full ref line +# (`:@sha256:`). When a ref matches, it is logged as +# `SKIP (excluded)` and the loop continues. The match is by image-path +# substring, not by digest, so a future tag/digest update still excludes +# the right image without needing this list to be re-edited. +# +# Add an entry only with a documented reason in the comment block above +# the entry. This list is NOT a place to silence transient flakes — those +# get fixed by retries in the script itself, not by exclusion. +EXCLUDED_PATTERNS=( + # mcr.microsoft.com/windows/servercore/iis + # Windows-only image gated behind compose profiles=[deploy-e2e-windows] + # (deploy/docker-compose.test.yml:700). Linux CI runners cannot start + # the windows-iis-test sidecar — the entire Windows matrix was deleted + # per ci-pipeline-cleanup Phase 6 / frozen decision 0.5, and IIS + # validation moved to docs/connector-iis.md::Operator validation + # playbook. All 10 TestVendorEdge_IIS_*_E2E tests are on + # scripts/vendor-e2e-skip-allowlist.txt for the same reason. + # + # Without this exclusion, Linux CI runners HEAD this digest from MCR + # on every push. MCR rate-limits unauthenticated requests by source IP; + # GitHub-hosted runner IPs are heavily reused across users; the result + # is ~one transient 4xx/5xx every N runs (CI run #376 hit it). Re-runs + # pass because runner IPs rotate. The image itself is fine — we just + # don't need Linux CI to verify it. + "mcr.microsoft.com/windows/servercore/iis" +) + fail=0 +verified=0 +skipped=0 for ref in "${REFS[@]}"; do + # Apply exclusion list before any work on the ref. + excluded=0 + for pat in "${EXCLUDED_PATTERNS[@]}"; do + if [[ "$ref" == *"$pat"* ]]; then + echo "SKIP (excluded) $ref" + excluded=1 + skipped=$((skipped + 1)) + break + fi + done + if [ "$excluded" -eq 1 ]; then + continue + fi + digest="${ref##*@}" imgtag="${ref%@*}" tag="${imgtag##*:}" @@ -96,9 +149,14 @@ for ref in "${REFS[@]}"; do fail=1 else echo "OK $ref" + verified=$((verified + 1)) fi done [ $fail -eq 0 ] || exit 1 echo "" -echo "digest-validity: clean — all ${#REFS[@]} digest references resolve." +if [ "$skipped" -gt 0 ]; then + echo "digest-validity: clean — ${verified} verified, ${skipped} excluded (CI never pulls)." +else + echo "digest-validity: clean — all ${verified} digest references resolve." +fi