Commit Graph

2 Commits

Author SHA1 Message Date
shankar0123 62a412c488 Bundle C: Renewal/reliability cluster — 7 findings closed
Closes M-006 + M-007 + M-008 + M-015 + M-016 + M-019 + M-020 from
comprehensive-audit-2026-04-25. M-028 was already closed by the
Bundle B CI follow-up.

M-006 (CWE-913) — Idempotent migration 000014
  migrations/000014_policy_violation_severity_check.up.sql:
    Prepended ALTER TABLE ... DROP CONSTRAINT IF EXISTS before the
    ADD. Mirrors the down migration's existing IF EXISTS shape and
    the M-7 idempotent-index idiom. Re-runs against partially-applied
    DBs now succeed.

M-007 — Bulk-op partial-failure tests (3 new)
  internal/api/handler/bulk_partial_failure_test.go:
    TestBulkRevoke_PartialFailure_ReportsBoth
    TestBulkRenew_PartialFailure_ReportsBoth
    TestBulkReassign_PartialFailure_ReportsBoth
  Each asserts HTTP 200 + both success/failure counters round-trip
  + per-cert errors[] preserved with non-empty messages so operators
  can correlate each failure to its certificate ID.

M-008 — Admin-gated handler enumeration pin (verified-already-clean)
  Recon: only one admin-gated handler — bulk_revocation.go — with
  full 3-branch test triplet already in place. health.go calls
  IsAdmin informationally to surface the flag to the GUI without
  gating.
  internal/api/handler/m008_admin_gate_test.go:
    Walks every handler .go file, asserts every middleware.IsAdmin
    call site is in AdminGatedHandlers (with required test triplet)
    or InformationalIsAdminCallers (justified). Adding a new admin
    gate without updating both the constant AND adding the test
    triplet fails CI.

M-015 — Single-profile cardinality pin (verified-already-clean)
  Audit claim 'no cardinality validation' was wrong — enforced at
  struct level. domain.ManagedCertificate.{CertificateProfileID,
  RenewalPolicyID,IssuerID,OwnerID} and RenewalPolicy.
  CertificateProfileID are bare strings, not slices.
  internal/domain/m015_cardinality_test.go:
    reflect-based pin on kind=String. Schema change to N:N would
    have to update renewal.go's lookup loop in the same commit.

M-016 (CWE-754) — Reap stale-agent jobs
  internal/repository/postgres/job.go::ListJobsWithOfflineAgents:
    JOIN jobs to agents on agent_id, filter (status=Running AND
    a.last_heartbeat_at < cutoff), exclude server-keygen jobs.
  internal/service/job.go::ReapJobsWithOfflineAgents:
    Flips matched jobs to Failed reason agent_offline so I-001
    retry loop re-queues them on a healthy agent. Records audit
    event per reap.
  internal/scheduler/scheduler.go:
    Scheduler.runJobTimeout cycle now calls both reaper arms.
    agentOfflineJobTTL default 5min (5x agent-health-check default);
    SetAgentOfflineJobTTL knob for operator override.
  internal/service/job_offline_agent_reaper_test.go: 6 unit tests
  cover happy path, server-keygen-skip, non-Running-skip, non-
  positive-TTL fail-loud, repo-error propagation, audit-event
  recording.

M-019 — Configurable ARI HTTP timeout
  Audit claim 'no fallback timeout' was wrong — ari.go:52 already
  had a 15s timeout. Bundle C makes it configurable.
  internal/connector/issuer/acme/acme.go:
    Config.ARIHTTPTimeoutSeconds field with env path
    CERTCTL_ACME_ARI_HTTP_TIMEOUT_SECONDS.
  internal/connector/issuer/acme/ari.go:
    Both HTTP clients (GetRenewalInfo + getARIEndpoint) now use the
    new ariHTTPTimeout() helper. Zero / negative / nil-config all
    fall back to the historic 15s default.
  ari_timeout_test.go: 4 dispatch arm tests.

M-020 (CWE-770) — OCSP DoS hardening
  Pre-bundle the noAuthHandler chain had no rate limit. An attacker
  could DoS the OCSP responder, which for fail-open relying parties
  is a revocation bypass.
  cmd/server/main.go:
    noAuthHandler refactored from fixed middleware.Chain(...) to a
    conditional slice that appends middleware.NewRateLimiter when
    cfg.RateLimit.Enabled. Per-IP keying applies; OCSP/CRL/EST/SCEP
    are unauth.
  docs/security.md (NEW):
    Operator runbook documenting Must-Staple TLS Feature extension
    RFC 7633 as the architectural fix for fail-open relying parties.
    Profile-flip guidance + nginx/Apache/HAProxy/Envoy stapling
    snippets + explicit scope statement on what the rate limiter
    alone does NOT solve.

Audit deliverables:
  cowork/comprehensive-audit-2026-04-25/audit-report.md: score
    31/55 -> 38/55 closed (Medium 13/27 -> 20/27).
  cowork/comprehensive-audit-2026-04-25/findings.yaml: 7 status
    flips open -> closed with closure notes citing the Bundle C
    mechanism.
  certctl/CHANGELOG.md: Bundle C section under [unreleased].

Verification:
  go vet ./internal/service ./internal/scheduler ./internal/connector/issuer/acme
    ./internal/api/handler ./internal/domain ./cmd/server     clean
  go test -count=1 -short on the same packages              all green
  helm template + helm lint                                 clean
  internal/repository/postgres setup-fail                   sandbox disk
    pressure (same on master HEAD before this branch)
2026-04-27 00:08:25 +00:00
shankar0123 b3cc7cbdb2 fix(policies): close the D-006 loop — TitleCase seed canonicals + severity-aware, config-consuming rule engine (D-008)
D-008 was a three-part drift in the policy engine that made the
D-005/D-006 remediation cosmetic below the DB layer:

  (a) migrations/seed.sql INSERTed rules with pre-D-005 lowercase
      types ('ownership', 'environment', 'lifetime', 'renewal_window')
      that the handler validator rejects on Create/Update but that
      raw SQL INSERTs bypassed entirely. At runtime evaluateRule's
      switch fell through to the default "unknown policy rule type"
      error branch on every demo rule × every cert × every cycle,
      flooding logs while emitting zero violations.

  (b) migrations/seed_demo.sql persisted lowercase severity values
      ('critical', 'error', 'warning') on policy_violations rows.
      INSERT succeeded because that column had no CHECK, but any
      frontend comparing against the canonical PolicySeverity enum
      mis-categorized every seeded violation.

  (c) evaluateRule hardcoded Severity: PolicySeverityWarning on
      every emitted violation and ignored rule.Config entirely —
      so the D-006 per-rule severity column (000013) and every
      per-arm Config JSON ({allowed_issuer_ids, allowed_domains,
      required_keys, allowed, lead_time_days, max_days}) was dead
      data below the evaluation layer.

This commit lands (a)+(b)+(c) atomically. Shipping any subset
leaves the feature half-working.

## Changes

Domain (internal/domain/policy.go):
  * Add PolicyTypeCertificateLifetime as the 6th TitleCase canonical.
    Pre-D-008 the seeded "max-certificate-lifetime" rule had no engine
    arm — routing it through RenewalLeadTime would conflate "how
    close to expiry before we renew" with "how long can the cert
    possibly be", two distinct semantics. The new type accepts
    config {"max_days": int} and flags certs whose
    NotAfter - NotBefore exceeds the cap.

Handler validator (internal/api/handler/validation.go):
  * ValidatePolicyType allowlist grown to 6 canonicals
    (AllowedIssuers, AllowedDomains, RequiredMetadata,
    AllowedEnvironments, RenewalLeadTime, CertificateLifetime).

OpenAPI (api/openapi.yaml):
  * PolicyType enum grown to match domain.

Frontend (web/src/api/types.ts, types.test.ts):
  * POLICY_TYPES tuple gains CertificateLifetime; pin test asserts
    all 6 canonicals and rejects casing drift.

Migration 000014 (policy_violations severity CHECK):
  * Named CHECK constraint (policy_violations_severity_check)
    mirroring 000013's allowlist, defense-in-depth at the DB layer
    against future drift from bypassed writes (migrations, psql
    sessions, future callers). Symmetric down migration drops by
    name.

Seed data:
  * migrations/seed.sql rewritten to emit TitleCase canonicals with
    per-arm config JSON that actually exercises the config-consuming
    paths (not the missing-field backstops):
      - pr-require-owner         → RequiredMetadata     {"required_keys":["owner"]}                        Warning
      - pr-allowed-environments  → AllowedEnvironments  {"allowed":["production","staging","development"]} Error
      - pr-max-certificate-lifetime → CertificateLifetime {"max_days":90}                                   Critical
      - pr-min-renewal-window    → RenewalLeadTime      {"lead_time_days":14}                              Warning
    Severities are now differentiated per rule (D-006 intent).
  * migrations/seed_demo.sql violation rows flipped to TitleCase
    severity ('Critical', 'Error', 'Warning') so migration 000014
    applies cleanly on upgrade paths.

Engine rewrite (internal/service/policy.go):
  * evaluateRule rewritten. All six arms now:
      1. Parse rule.Config into the per-arm typed struct.
      2. Bad JSON → log at ValidateCertificate boundary and skip
         this rule (no co-located poisoning of other rules in the
         same batch).
      3. Empty/null Config → emit the pre-D-008 missing-field
         violation (backwards compat invariant — operators who
         haven't reconfigured still see the same output).
      4. Violations emitted carry rule.Severity (no more hardcoded
         Warning); D-006 column is now load-bearing.
  * CertificateLifetime arm reads NotBefore/NotAfter from the
    certificate's latest version via CertRepo. Injected via
    PolicyService.SetCertRepo() setter — avoids churning ~36
    NewPolicyService call sites while keeping the lifetime arm
    optional (degrades to a log+skip if the setter is not wired).

Server wiring (cmd/server/main.go):
  * policyService.SetCertRepo(certRepo) wired after construction.

Tests (internal/service/policy_test.go):
  * 25 new subtests across 5 groups:
      - TestEvaluateRule_SeverityPassThrough (6): every rule type
        emits violations carrying rule.Severity, not hardcoded.
      - TestEvaluateRule_ConfigConsumed (12): every per-arm Config
        path exercised positive + negative.
      - TestEvaluateRule_EmptyConfig_BackCompat (3): empty/null
        Config still emits pre-D-008 missing-field violations.
      - TestEvaluateRule_BadConfig_SkipsRule: malformed JSON logs
        and skips cleanly without poisoning neighbors.
      - TestEvaluateRule_CertificateLifetime_RepoScenarios (3):
        ok when repo wired, log+skip when not, handles missing
        NotBefore/NotAfter edges.

Provenance: D-008 surfaced during D-005/D-006 remediation review
in eef1db0. That commit added persistence and CI pins for the
severity field but did not re-verify the evaluation layer
consumed it; this finding and fix close the audit-process gap.
2026-04-18 14:55:56 +00:00