mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 18:01:37 +00:00
40fd96a416bc484b4d1feeedfa880986bb60eb75
6 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
0e29c416b1 |
refactor(handler,repo): replace strings.Contains error dispatch with typed sentinels (S-2)
Closes one 2026-04-24 audit finding (P2):
- cat-s6-efc7f6f6bd50: 30 strings.Contains(err.Error(), ...) sites
in internal/api/handler/ — brittle to repository-layer message
changes, untyped against the actual failure mode.
Approach (Option B from prompt design notes):
- New typed sentinels in internal/repository/errors.go:
ErrNotFound, ErrForeignKeyConstraint
IsForeignKeyError(err) helper (the only place substring
matching at the lib/pq boundary is allowed; isolates the
DB-driver string knowledge to one function).
- New typed sentinel in internal/domain/errors.go:
ErrValidation (reserved for future per-entity validation
wrappers; not yet used by all handlers).
- 49 sites in internal/repository/postgres/*.go updated to wrap
sql.ErrNoRows-derived errors via fmt.Errorf("...: %w",
repository.ErrNotFound).
- 18 not-found handler sites + 2 FK-constraint handler sites
refactored to errors.Is(err, repository.ErrNotFound) /
repository.IsForeignKeyError(err).
- 23 inline `fmt.Errorf("X not found")` test fixtures across
handler tests rewrapped to wrap repository.ErrNotFound.
- test_utils.go::ErrMockNotFound rewrapped to wrap
repository.ErrNotFound; renewal_policy.go closure docblock
updated to reflect the new convention.
- integration test mockJobRepository.Get wraps repository.ErrNotFound.
CI regression guardrail:
- .github/workflows/ci.yml::"Forbidden strings.Contains(err.Error())
regression guard (S-2)" greps for the three patterns ("not found",
"violates foreign key", "RESTRICT") under internal/api/handler/
and fails the build on regression.
Verification:
- go build ./... — clean
- go vet ./... — clean
- go test ./... -short -count=1 — all packages pass (handler +
repository + service + integration)
- golangci-lint v2.11.4 run ./... — 0 issues
- S-2 guardrail dry-run on post-fix tree → empty (good)
- All sibling guardrails (S-1, G-3, D-1+D-2, B-1, L-1, H-1, C-1, F-1, P-1) pass
Audit findings closed:
- cat-s6-efc7f6f6bd50 (P2)
Deferred follow-ups:
- 6 domain-specific substring patterns still inline in handlers
("cannot approve", "cannot reject", "cannot be parsed",
"no certificates found", "challenge password", "invalid"/
"required" validation chains in profiles + agent_groups). Each
needs its own typed sentinel, scoped per service. Documented
by the S-2 CI guardrail's allowlist for closure-comments only.
- Per-entity not-found sentinels (Option A — ErrCertificateNotFound,
ErrAgentNotFound, etc.) deferred. Generic ErrNotFound covers the
current dispatch needs; per-entity precision would let handlers
return entity-aware error bodies without a domain.Type field,
but not blocking.
|
||
|
|
a3d8b9c607 |
fix(deploy,db,handler): close fresh-clone postgres init failure + 4 ride-along audit findings (U-3 master)
GitHub #10 reopened: operator mikeakasully cloned v2.0.50 fresh and ran the canonical quickstart (docker compose -f deploy/docker-compose.yml up -d --build); postgres reported unhealthy indefinitely, dependent containers never started. Root cause: deploy/docker-compose.yml mounted a hand-curated subset of migrations/*.up.sql + seed.sql into postgres /docker-entrypoint-initdb.d/. Postgres applied them at initdb time. Once seed.sql referenced columns added by migrations *after* the mounted cutoff (e.g., policy_rules.severity from migration 000013), initdb crashed mid-seed and the container loop wedged. Two sources of truth (compose mount list vs in-tree migration ladder) diverged the moment a seed-touching migration shipped, and the only thing that fixed it was hand-editing the compose file every release. Fix: remove the dual source. Postgres boots empty; the server applies migrations + seed at startup via RunMigrations + RunSeed. Helm has used this pattern since day one (postgres-init emptyDir); compose now matches. Bundled with four ride-along audit findings whose fixes share the same schema/db code surface, so operators take the schema-change pain only once: cat-u-seed_initdb_schema_drift [P1, primary] — initdb-mount fix cat-o-retry_interval_unit_mismatch [P1] — column rename minutes→seconds cat-o-notification_created_at_dead_field [P2] — add column + populate cat-o-health_check_column_orphans [P1] — drop unwired columns cat-u-no_version_endpoint [P2] — add /api/v1/version Single migration (000017_db_coupling_cleanup) bundles the three schema changes under a DO \$\$ guard so re-application is safe; reduces operator-visible 'schema-change releases' from four to one. Backend - internal/repository/postgres/db.go: add RunSeed (baseline) + RunDemoSeed (gated by CERTCTL_DEMO_SEED). Both idempotent (ON CONFLICT DO NOTHING in every shipped INSERT) so repeated boots are safe; missing-file is no-op so custom packaging that strips seeds still boots cleanly. - cmd/server/main.go: invoke RunSeed (always) + RunDemoSeed (when flag set) immediately after RunMigrations. - internal/repository/postgres/notification.go: NotificationRepository.Create now sets created_at (with time.Now() fallback when caller leaves it zero); scanNotification reads it back; List + ListRetryEligible SELECT extended. - internal/repository/postgres/renewal_policy.go: column references updated to retry_interval_seconds across SELECT/INSERT/UPDATE sites. - internal/api/handler/version.go: new VersionHandler exposes {version, commit, modified, build_time, go_version} from runtime/debug.ReadBuildInfo() with ldflags-supplied Version override. - internal/api/router/router.go: register GET /api/v1/version through the no-auth chain (CORS + ContentType) alongside /health, /ready, /api/v1/auth/info. - cmd/server/main.go: add /api/v1/version to no-auth dispatch + audit ExcludePaths so rollout polling doesn't dominate the audit trail. - internal/config/config.go: add DatabaseConfig.DemoSeed + CERTCTL_DEMO_SEED env var. Migration - migrations/000017_db_coupling_cleanup.up.sql + .down.sql: (1) renewal_policies.retry_interval_minutes → retry_interval_seconds (DO \$\$ guard, idempotent re-application) (2) notification_events ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() (3) network_scan_targets DROP orphan health_check_enabled + health_check_interval_seconds - migrations/seed.sql: column reference updated to retry_interval_seconds. - migrations/seed_demo.sql: same column rename + applied at runtime now via RunDemoSeed (no longer initdb-mounted). Compose - deploy/docker-compose.yml: drop ALL initdb mounts (10 migration files + seed.sql); add start_period: 30s to postgres + certctl-server healthchecks to absorb the runtime migration + seed application window on first boot. - deploy/docker-compose.test.yml: same drop (+ ghost seed_test.sql mount removed; that file never existed); same healthcheck start_period. - deploy/docker-compose.demo.yml: replace seed_demo.sql initdb mount with CERTCTL_DEMO_SEED=true env var on certctl-server. Tests - internal/api/handler/version_handler_test.go: TestVersion_ReturnsBuildInfo, TestVersion_RejectsNonGet, TestVersion_LdflagsOverride. - internal/repository/postgres/seed_test.go: TestRunSeed_AppliesIdempotently, TestRunSeed_MissingFileIsNoOp, TestRunDemoSeed_AppliesIdempotently, TestMigration000017_RetryIntervalRename, TestMigration000017_NotificationCreatedAt, TestMigration000017_HealthCheckOrphansDropped (testcontainers, -short skips). - internal/repository/postgres/notification_test.go: TestNotificationRepository_CreatedAt_IsPersisted + TestNotificationRepository_CreatedAt_DefaultsToNow. CI guardrail - .github/workflows/ci.yml: new 'Forbidden migration mount in compose initdb (U-3)' step grep-fails the build if any migrations/*.sql or seed*.sql re-appears in /docker-entrypoint-initdb.d in any compose file. Catches future drift before a fresh-clone operator hits it. Spec / Docs - api/openapi.yaml: add /api/v1/version operation under Health tag. - docs/architecture.md: replace the 'initdb may run the same SQL' paragraph with a post-U-3 single-source-of-truth explanation. - CHANGELOG.md: full unreleased-section entry covering all 5 closures, breaking changes, and the new env var. Audit doc - coverage-gap-audit-2026-04-24-v5/unified-audit.md: add new P1 #14 cat-u-seed_initdb_schema_drift; flip the 4 ride-along findings to ✅ RESOLVED with closure prose pointing at this commit. Verification: build/vet/test -short -race all clean across all touched packages locally; govulncheck reports 0 vulnerabilities affecting our code; OpenAPI YAML parses; CI U-3 grep guardrail clears against the post-fix tree. |
||
|
|
675b87ba63 |
I-005: notification retry loop + dead-letter queue
Critical alerts can no longer be silently dropped by a transient
notifier failure. Failed notification attempts now ride an exponential
backoff retry loop, with a 5-attempt budget before promotion to the
dead-letter queue for operator intervention.
Schema (migration 000016, idempotent):
- retry_count INTEGER NOT NULL DEFAULT 0
- next_retry_at TIMESTAMPTZ
- last_error TEXT
- idx_notification_events_retry_sweep partial index
(next_retry_at) WHERE status='failed' AND next_retry_at IS NOT NULL
Dead rows clear next_retry_at so the index stops matching them.
Service contract:
- NotificationService.RetryFailedNotifications drives 2^n-minute
exponential backoff capped at 1h (notifRetryBackoffCap) with
5-attempt budget (notifRetryMaxAttempts).
- Exhaustion (RetryCount >= notifRetryMaxAttempts-1) promotes to
status='dead' via MarkAsDead.
- Non-terminal failures record via RecordFailedAttempt.
- Success path promotes to 'sent' without touching retry_count
(audit preserves "delivered on attempt N").
- Missing-notifier branch defensively promotes to 'sent' to avoid
wedging a row on a deleted channel.
- RequeueNotification operator escape hatch atomically resets
retry_count -> 0, next_retry_at -> NULL, last_error -> NULL,
status -> pending via notifRepo.Requeue.
Scheduler:
- New always-on notificationRetryLoop wired into the base loop set at
CERTCTL_NOTIFICATION_RETRY_INTERVAL (default 2m).
- sync/atomic.Bool idempotency guard.
- sync.WaitGroup shutdown drain via WaitForCompletion.
StatsService:
- SetNotifRepo setter pattern preserves 9 pre-existing
NewStatsService call sites (main.go + stats_test.go + 8 digest
tests) without touching the constructor signature.
- DashboardSummary.NotificationsDead populated via
notifRepo.CountByStatus(ctx, "dead") — nil-safe when unwired
(reports zero on systems without a notification repository).
- CountByStatus error is non-fatal (dashboard summary is
best-effort for this field).
- Prometheus certctl_notification_dead_total counter emitted from
the same snapshot.
Handler:
- New POST /api/v1/notifications/{id}/requeue endpoint.
- dead status surfaces to MCP + CLI.
Frontend:
- NotificationsPage gains two-tab toolbar ("All" / "Dead letter")
with queryKey: ['notifications', activeTab] so switching tabs
doesn't serve stale data until the 30s refetch.
- Dead rows surface "Retry {n}/5" + truncated last_error with
full-text title tooltip.
- Requeue mutation wrapped as
mutationFn: (id: string) => requeueNotification(id)
to prevent react-query v5's positional context argument from
leaking into the API client — pinned against future refactors
by strict-match toHaveBeenCalledWith('notif-dead-001') in
NotificationsPage.test.tsx:181.
Closes I-005.
|
||
|
|
1d1b89c9b5 |
Implement M3: expiration threshold alerting with dedup and status transitions
- Add alert_thresholds_days JSONB column to renewal_policies (default [30,14,7,0]) - Add RenewalPolicy.AlertThresholdsDays field + EffectiveAlertThresholds() helper - Add RenewalPolicyRepository interface + postgres implementation - Rewrite CheckExpiringCertificates with per-policy threshold alerting - Add SendThresholdAlert + HasThresholdNotification for deduplication via [threshold:N] tags - Add Type and MessageLike filters to NotificationFilter + postgres query support - Auto-transition certs to Expiring (>0 days) or Expired (<=0 days) status - Record expiration_alert_sent audit events per threshold crossing - Fix .gitignore: allow SQL migration files, scope server/agent build artifact rules - Track previously untracked cmd/ and migrations/ directories - Update docs (README, architecture, demo-advanced) for threshold alerting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
9b4122b159 |
Fix runtime bugs, implement service layer, and overhaul documentation
Runtime fixes: - Fix env var mismatch (CERTCTL_DB_URL → CERTCTL_DATABASE_URL) - Fix table name mismatches (certificates → managed_certificates, notifications → notification_events) - Add renewal_policy_id to certificate queries - Remove non-existent created_at from notification queries - Add env var fallback for agent CLI flags - Graceful degradation for missing notifiers/issuers in demo mode - Copy web/ directory in Dockerfile for dashboard serving Service layer: - Implement handler-service interface pattern across all services - Wire up certificate, agent, job, policy, team, owner, audit, notification services Documentation: - Add concepts.md: beginner-friendly guide to TLS, CAs, private keys - Rewrite quickstart.md with accurate API examples matching actual handlers - Add demo-advanced.md: interactive demo with cert issuance and automated script - Update architecture.md with correct table names and connector interfaces - Update connectors.md to match actual Go interface signatures - Update demo-guide.md with cross-references to new docs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
3a9fe8ba37 | Complete V1 scaffold |