mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 13:41:30 +00:00
ci: dump container logs on deploy-vendor-e2e failure
The 25194251740 CI run failed with "container certctl-test-server is unhealthy" but the GitHub Actions log doesn't include the server's stdout/stderr — compose only reports the dependency-chain symptom. Without the server's actual log output we can't tell whether the unhealthy state was caused by a DB migration crash, port bind failure, entrypoint stall, OOM kill, or healthcheck race. Add an `if: failure()` step right before teardown that dumps: - `docker compose ps -a` (every container's exit status) - last 200 lines from certctl-test-server - all of tls-init (one-shot, short) - last 100 lines from postgres + stepca + agent - last 50 lines from pebble This is a permanent debuggability improvement, not a band-aid: the matrix-collapse (Phase 5) brings up ~18 containers concurrently where pre-collapse the per-vendor matrix brought up ~7. Future transient failures will be much faster to diagnose with logs in the CI output. Once we know the actual root cause from this dump, we fix it for real. Placed AFTER skip-count enforcement (so failures in either step trigger it) and BEFORE teardown (which is `if: always()` and would otherwise nuke the containers before we could log them).
This commit is contained in:
@@ -389,6 +389,40 @@ jobs:
|
||||
# on Linux per Phase 6 / frozen decision 0.5).
|
||||
run: bash scripts/vendor-e2e-skip-check.sh test-output.log
|
||||
|
||||
- name: Diagnostic dump on failure
|
||||
# Prints container status + last 200 log lines from the certctl-server
|
||||
# and base-stack containers when ANY previous step in this job fails.
|
||||
# The matrix-collapse (Phase 5) brings up ~18 containers concurrently
|
||||
# (vs 1 vendor sidecar at a time pre-collapse); transient failures
|
||||
# surface most often as "container certctl-test-server is unhealthy"
|
||||
# without any visible reason because compose only reports the
|
||||
# dependency-chain symptom, not the root cause. Dumping logs here
|
||||
# makes the underlying error (DB migration crash, port bind failure,
|
||||
# entrypoint stall, OOM kill) visible in the GitHub Actions log
|
||||
# without requiring a workstation reproduction.
|
||||
if: failure()
|
||||
run: |
|
||||
echo "=== docker compose ps -a ==="
|
||||
docker compose --profile deploy-e2e -f deploy/docker-compose.test.yml ps -a || true
|
||||
echo ""
|
||||
echo "=== certctl-test-server logs (last 200 lines) ==="
|
||||
docker logs --tail 200 certctl-test-server 2>&1 || true
|
||||
echo ""
|
||||
echo "=== certctl-test-tls-init logs ==="
|
||||
docker logs certctl-test-tls-init 2>&1 || true
|
||||
echo ""
|
||||
echo "=== certctl-test-postgres logs (last 100 lines) ==="
|
||||
docker logs --tail 100 certctl-test-postgres 2>&1 || true
|
||||
echo ""
|
||||
echo "=== certctl-test-stepca logs (last 100 lines) ==="
|
||||
docker logs --tail 100 certctl-test-stepca 2>&1 || true
|
||||
echo ""
|
||||
echo "=== certctl-test-pebble logs (last 50 lines) ==="
|
||||
docker logs --tail 50 certctl-test-pebble 2>&1 || true
|
||||
echo ""
|
||||
echo "=== certctl-test-agent logs (last 100 lines) ==="
|
||||
docker logs --tail 100 certctl-test-agent 2>&1 || true
|
||||
|
||||
- name: Tear down sidecars
|
||||
if: always()
|
||||
run: docker compose --profile deploy-e2e -f deploy/docker-compose.test.yml down -v
|
||||
|
||||
Reference in New Issue
Block a user