mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 14:01:36 +00:00
6a640ac3e7f7f05fa11761fa7fe4cb90b44a38da
1006 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
6a640ac3e7 |
fix(helm): DEPL-003 + DEPL-006 — render viaHook env, sessionAffinity, HA backend default
Sprint 3 unified-master-audit closure — two Helm-chart correctness
defects with overlapping CI-guard surface.
DEPL-003 — CERTCTL_MIGRATIONS_VIA_HOOK never rendered:
Pre-fix the env var was documented in values.yaml and the
migration-job.yaml comment but never made it into the server
Deployment env block. With migrations.viaHook=true the operator's
intent is 'the pre-install/pre-upgrade Helm Job owns migrations,'
but the server pods, missing the env, ran their own
cmd/server/migrations.go::runBootMigrations alongside the hook
Job, racing on the schema lock.
Fix: render '- name: CERTCTL_MIGRATIONS_VIA_HOOK / value: true'
in server-deployment.yaml under '{{- if .Values.migrations.viaHook }}'.
DEPL-006 — HA example missing rate-limit backend + sessionAffinity:
values-prod-ha.yaml sets replicas:3 but inherited the chart-wide
default rateLimiting.backend=memory (which gives each pod its
own bucket map, effectively tripling the cap on a 3-replica fleet)
AND the chart had no render path for server.service.sessionAffinity
even though docs/operator/runbooks/ha.md instructed operators to
set it for ClientIP-routed sticky sessions.
Fix:
- server-service.yaml gains a conditional sessionAffinity +
sessionAffinityConfig.clientIP.timeoutSeconds render.
- values.yaml grows the matching schema entries (default empty
so single-replica deploys are unaffected).
- values-prod-ha.yaml flips rateLimiting.backend=postgres and
service.sessionAffinity=ClientIP.
- NOTES.txt emits a loud warning when replicas>1 + either toggle
is still in the default state, so the misconfig surfaces at
helm install time instead of in a confused login-flow bug
report a week later.
CI:
scripts/ci-guards/B3-helm-chart-coherence.sh gains 'Check 7'
(DEPL-003 viaHook env render — both positive and negative —
the inverse case catches future drift that drops the {{- if }}
guard) and 'Check 8' (DEPL-006 sessionAffinity render). Both
helm-template through to assert the rendered YAML carries the
expected text.
Closes DEPL-003, DEPL-006.
|
||
|
|
15fedbaa06 |
test(scheduler): SCALE-001 — assert claim cap via non-Pending count, not Running
Sprint 2's TestProcessPendingJobs_RespectsClaimLimit asserted
that exactly 3 jobs sat in JobStatusRunning after a 10-row
ProcessPendingJobs sweep with SetClaimLimit(3). The CI run
landed 'running-job count = 0; want 3.'
Root cause: the mock's ClaimPendingJobs flips Pending → Running
on the 3 claimed rows (atomic-claim semantics). processJob then
calls renewalService.ProcessRenewalJob, which fails on the
mock cert-repo's not-found error and calls failJob → which
transitions the row from Running → Failed. By the time the
test assertion runs, no row is still in Running.
The load-bearing SCALE-001 invariant is 'the cap STOPPED at 3.'
Whether the 3 claimed rows ended up Running, Failed, or
Completed is irrelevant to the cap — what matters is that 7
rows STAYED in Pending for the next tick.
Fix: count non-Pending (= claimed) and still-Pending (= 10
minus claimed) separately. Assert claimed=3 and stillPending=7.
LastClaimLimit=3 assertion (already passing in the failed run)
also stays as the seam-propagation pin.
This is a test-fix only — the SCALE-001 production behavior
landed correctly in
|
||
|
|
c40690e42d |
docs(testing): regenerate skip-inventory after SEC-001 types_test.go edit (CI guard skip-inventory-drift)
SEC-001's TestOIDCProvider_Validate_RejectsSSRFIssuer addition in internal/auth/oidc/domain/types_test.go shifted an existing t.Skip site from line 186 → line 221. The auto-generated inventory at docs/testing/skip-inventory.md still pointed at the old line, so scripts/ci-guards/skip-inventory-drift.sh failed the build. Regenerated via scripts/skip-inventory.sh and bumped the '> Last reviewed:' header. Inventory now matches the live tree exactly. |
||
|
|
657a699564 |
docs(env): SCALE-001 + SEC-006 — document the two new env vars (CI guard G-3)
Sprint 2 left CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT and CERTCTL_RATE_LIMIT_BUCKET_TTL defined in Go config but undocumented in the canonical env-var inventory. CI guard scripts/ci-guards/G-3-env-docs-drift.sh failed the build on this drift. Add both vars to deploy/ENVIRONMENTS.md alongside their siblings (RATE_LIMIT_RPS / RATE_LIMIT_BURST) with the same voice as adjacent entries: default value, what it controls, why the audit closed it, and the tuning intuition. |
||
|
|
183c56f6c5 |
fix(agent): SCALE-006 — startup + recurring jitter on heartbeat and poll loops
Sprint 2 unified-master-audit closure. Pre-fix the agent started
its heartbeat + poll loops on bare time.NewTicker cadence with no
startup jitter:
heartbeatTicker := time.NewTicker(a.heartbeatInterval)
pollTicker := time.NewTicker(a.pollInterval)
a.sendHeartbeat(ctx) // fires immediately, in lockstep
a.pollForWork(ctx) // ditto
A mass restart (rolling K8s deploy, control-plane reboot, scheduled
fleet bounce) produced a thundering herd — 5K agents booting in a
10-second window all hit /heartbeat in lockstep, then /poll, every
interval forever afterward.
Fix:
- Per-agent startup jitter ∈ [0, interval) drawn fresh from
math/rand/v2 (no cryptographic strength needed) before the first
heartbeat and first poll. Heartbeat and poll jitters are drawn
independently so a single seed doesn't create a secondary
correlation pattern.
- time.NewTicker swapped for the existing in-tree
internal/scheduler.JitteredTicker primitive (±10% per-tick
envelope, fresh draw per tick to prevent drift compounding).
Same pattern as every server-side scheduler.go loop.
- Startup-jitter Sleeps are ctx-aware so a sigint-during-startup
exits cleanly rather than hanging.
The select cases that read heartbeatTicker.C / pollTicker.C are
unchanged — JitteredTicker.C is a chan time.Time, identical shape
to time.Ticker.C.
Discovery ticker is left as bare time.NewTicker (audit didn't cite
it; changing it would expand scope).
Closes SCALE-006.
|
||
|
|
a485e31f63 |
fix(repo,service): SCALE-002 — push pagination into SQL for target/issuer/team/agent_group
Sprint 2 unified-master-audit closure. Pre-fix four service List
endpoints (target, issuer, team, agent_group) called repoFoo.List(ctx)
to fetch the full table then sliced in memory:
rows, _ := s.repo.List(ctx)
total := int64(len(rows))
start := (page - 1) * perPage
end := start + perPage
return rows[start:end], total, nil
This page-sliced in memory pattern marshals every row per request —
fine on small fleets but unacceptable for multi-tenant or large-fleet
deploys. The agent_group case was worse — the service explicitly
ignored page/perPage and returned the entire slice.
Fix:
- New ListPaginated(ctx, limit, offset) method on each of the four
repositories. Postgres implementations push LIMIT + OFFSET into
the SQL plus a SELECT COUNT(*) for the total. Mirrors the cursor
pattern already in internal/repository/postgres/certificate.go.
- Each ListPaginated normalises limit≤0→50 and offset<0→0,
matching the service-layer defaults that already existed.
- Repository interfaces grow the new method so adapters stay
swappable.
- Service List methods now call repoFoo.ListPaginated(ctx, perPage,
(page-1)*perPage) directly — no more memory-slice.
- AgentGroupService.ListAgentGroups closes the Bundle E / Audit
L-020 'page/perPage unused' gap.
Test changes:
- sliceWindow generic helper in testutil_test.go mirrors the SQL
LIMIT/OFFSET semantics for in-memory mocks.
- Six mock implementers (lifecycle_test, testutil_test x2,
agent_group_test, team_test) gain ListPaginated methods.
- TestTeamService_List_SCALE002_PaginationPropagatesToRepo pins
the page=2, perPage=3 → 3 rows of 10 invariant.
Closes SCALE-002.
|
||
|
|
8f2e5771db |
fix(middleware): SEC-006 — TTL-evict idle token-bucket rate-limiter entries
Sprint 2 unified-master-audit closure. Pre-fix the keyed rate
limiter's bucket map had no eviction. The package-level comment
explicitly noted the leak: high-cardinality unauthenticated traffic
(CGNAT churn, Tor exit lists, botnets, infinite-cardinality scanners)
grew process memory unboundedly. Production deploys with millions of
unique IPs would eventually OOM.
Fix:
- RateLimitConfig.BucketTTL (env CERTCTL_RATE_LIMIT_BUCKET_TTL,
default 1h, clamp-floor 1m). 1h chosen to be well above realistic
operator IP churn windows (returning clients keep their bucket)
and well below the unbounded-leak window the pre-fix code
allowed.
- tokenBucket gains a lastAccess field updated on every allow()
call via touch(); reading via lastAccessTime() under the bucket's
own mutex.
- keyedRateLimiter.sweepLoop runs in a single goroutine per
limiter (production wires 2: default + no-auth fallback), waking
every BucketTTL/4. sweep() removes any bucket whose lastAccess
is older than the cutoff and bumps evictedTotal atomically.
- Both NewRateLimiter call sites in cmd/server/main.go (default
stack and no-auth fallback) now thread cfg.RateLimit.BucketTTL.
Regression coverage:
- TestKeyedRateLimiter_SweepEvictsIdleBuckets: 1000 synthetic IP
keys populate the map, advance past TTL, call sweep() directly,
assert map drained to 0 + evictedTotal=1000 + fresh key creates
new bucket (map not poisoned).
- TestKeyedRateLimiter_SweepKeepsActiveBuckets: inverse — a bucket
touched within the TTL window survives the sweep. Catches a
future regression that inverts the cutoff comparison.
Closes SEC-006.
|
||
|
|
037876fa0f |
fix(scheduler): SCALE-001 — cap ClaimPendingJobs per-tick (default 1000)
Sprint 2 unified-master-audit closure. Pre-fix the scheduler invoked
ClaimPendingJobs(ctx, "", 0). limit:0 loads every Pending row in a
single transaction — a 100K-job burst (cert-fleet sweep, post-outage
recovery, large agent-fleet first boot) marshalled the full queue
into process memory before boundedFanOut's semaphore could back-
pressure the upstream CAs.
Fix:
- SchedulerConfig.JobClaimLimit (env CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT,
default 1000). ≤0 normalised to 1000 in SetClaimLimit — fail-safe
vs. legacy unlimited semantics.
- JobService.claimLimit threaded into the existing
ProcessPendingJobs flow; ClaimPendingJobs(ctx, "", s.claimLimit).
- cmd/server/main.go wires jobService.SetClaimLimit(cfg.Scheduler.JobClaimLimit).
- 'processing pending jobs' log line now includes claim_limit so
operators can spot the cap engaging (count == claim_limit ⇒
queue is running ahead of fan-out; bump CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT
or CERTCTL_RENEWAL_CONCURRENCY).
- Test wiring keeps the legacy zero-value (unlimited) for byte-
for-byte compatibility with the existing 600+ JobService unit
tests — only production code goes through SetClaimLimit.
Regression coverage:
- mockJobRepo.LastClaimLimit records the limit passed through
ClaimPendingJobs so tests can pin the propagation.
- TestProcessPendingJobs_RespectsClaimLimit: 10 Pending rows,
SetClaimLimit(3), expect exactly 3 transition to Running plus
LastClaimLimit=3 on the mock.
- TestSetClaimLimit_NormalisesNonPositive: 0/-1/-1000 all
normalise to 1000.
Closes SCALE-001.
|
||
|
|
7d2e7043b9 |
fix(server): SEC-003 — keep securityHeadersMiddleware in rate-limit stack
Sprint 1 unified-master-audit closure. cmd/server/main.go built two
middleware stacks: a default (line ~2054) and a rate-limit-enabled
rebuild (line ~2079). The rebuild dropped securityHeadersMiddleware,
silently turning off five browser-side defenses (Strict-Transport-
Security, X-Frame-Options, X-Content-Type-Options, Referrer-Policy,
Content-Security-Policy) the moment an operator flipped
CERTCTL_RATE_LIMIT_ENABLED=true.
Fix: re-insert securityHeadersMiddleware at the same position as the
default stack and place rateLimiter immediately after, so even a 429
response carries the same headers as a 200.
Regression coverage:
- cmd/server/main_test.go TestMain_RateLimitedStack_EmitsSecurityHeaders
mirrors the production stack composition and asserts each of the
five headers lands on the response. A future regression that
removes securityHeadersMiddleware (or reorders it after the rate
limiter such that a 429 misses the headers) surfaces here.
Closes SEC-003.
|
||
|
|
037dab7b6f |
fix(agent,service): SEC-002 — validate certificate_id shape + contain key path
Sprint 1 unified-master-audit closure. Pre-fix the agent built its
on-disk key path via:
keyPath := filepath.Join(a.config.KeyDir, job.CertificateID+".key")
migrations/000001_initial_schema.up.sql declares managed_certificates.id
as TEXT PRIMARY KEY with no shape constraint, so a compromised control
plane (or a poisoned database row) could deliver a job whose
certificate_id is '../../etc/passwd', '/absolute/path', a NUL-byte
payload, or a Windows-separator-laden string — driving arbitrary
file write or read on the agent host.
Fix (two ends; both load-bearing):
Server side:
- New internal/validation/certificate_id.go: ValidateCertificateID
pins the canonical TEXT-PK shape (^[A-Za-z0-9._-]{1,128}$, plus
explicit '.'/'..' rejection).
- CertificateService.Create now invokes ValidateCertificateID after
the existing required-fields check; malformed IDs are refused
before persistence or downstream job creation.
Agent side:
- cmd/agent/keymem.go: validateAgentCertID mirrors the server-side
shape regex. safeAgentKeyPath additionally asserts the joined
path is contained within KeyDir via filepath.Rel — even if a
future refactor bypasses the shape check, a path that escapes
KeyDir fails closed.
- poll.go + deploy.go: both filepath.Join call sites routed
through safeAgentKeyPath; rejection surfaces via reportJobStatus
so the control plane sees the failure.
Regression coverage:
- internal/validation/certificate_id_test.go: production shapes
accepted; explicit rejection table for empty, overlong, posix
traversal, absolute, Windows traversal, Windows separator, NUL
byte, newline/tab injection, drive prefix, space, unicode dots.
- cmd/agent/keymem_test.go: validateAgentCertID acceptance +
rejection tables; safeAgentKeyPath happy path + the 8 audit
vectors plus empty-keyDir refusal.
Closes SEC-002.
|
||
|
|
e6cfd756ac |
fix(auth): SEC-001 — gate OIDC discovery through SafeHTTPDialContext + ValidateSafeURL
Sprint 1 unified-master-audit closure. Two OIDC discovery call sites
passed the bare request context to gooidc.NewProvider:
- internal/auth/oidc/test_discovery.go:65 (dry-run validator)
- internal/auth/oidc/service.go:1066 (runtime cache load)
gooidc.NewProvider derives its HTTP client from the context via
oidc.ClientContext; with no override it falls through to
http.DefaultClient — no SSRF guard. An admin with auth.oidc.create
could induce server-side HTTPS egress to loopback (127.0.0.1, ::1),
RFC 1918, link-local (169.254.169.254 — cloud-instance metadata),
and IPv6 link-local (fe80::/10). The companion JWKS reachability
probe was already routed through SafeHTTPDialContext via the
Bundle 5 R6 closure; the discovery + claims path bypassed that.
Fix:
- New internal/auth/oidc/safehttp.go: oidcDiscoveryClient (Transport
DialContext = validation.SafeHTTPDialContext) + SafeOIDCContext
helper. Both call sites now wrap ctx through SafeOIDCContext
before NewProvider runs.
- Defense-in-depth: OIDCProvider.Validate calls
validation.ValidateSafeURL on the IssuerURL after the existing
https/parse checks, refusing reserved-address issuers at
provider-creation time.
- TestDiscovery surfaces the SSRF policy error via the result's
Errors slice up-front (early-fail UX rail) before invoking
NewProvider.
Test seams:
- setup_test.go swaps oidcDiscoveryClient + validateIssuerSSRF
for httptest loopback compatibility, mirroring the existing
jwksProbeClient pattern.
Regression coverage:
- internal/auth/oidc/domain/types_test.go: 5-case table pinning
loopback v4/v6, cloud metadata, link-local v4/v6 rejection.
- internal/auth/oidc/coverage_fill_test.go: same 5 cases against
Service.TestDiscovery via temporarily restoring the production
gate.
Closes SEC-001.
|
||
|
|
67dbd18fda |
fix(web): Hotfix #19 — AuthProvider 401 unconditional redirect (GitHub #13)
Refresh-after-login wiped the in-memory apiKey and the next API call returned a bare 401 (no WWW-Authenticate header). The pre-Hotfix-19 401 handler in AuthProvider only redirected when cause was a non-'invalid_token' OIDC session-expiry category; bare 401s fell through to an in-place AuthGate state flip that unmounted BrowserRouter under an in-flight <Link>, triggering a react-router-dom invariant that surfaced via ErrorBoundary as "Something went wrong." Fix: always hard-navigate to /login on 401 regardless of cause. Preserve cause-aware UX by forwarding cause to /login?session_expired= only when present; emit plain /login redirect for bare 401s. Closes #13.v2.1.7 |
||
|
|
5a1dbce6d5 |
fix(deploy): Hotfix #18 — apt-get retry loop in libest Dockerfile (transient mirror flake)
CI image-and-supply-chain job failed building deploy/test/libest/
Dockerfile:
Get:62 http://deb.debian.org/debian bullseye/main amd64 libssh2-1
amd64 1.9.0-2+deb11u1 [156 kB]
Err:62 http://deb.debian.org/debian bullseye/main amd64 libssh2-1
amd64 1.9.0-2+deb11u1
Error reading from server - read (104: Connection reset by peer)
[IP: 151.101.202.132 80]
E: Failed to fetch http://deb.debian.org/debian/pool/main/libs/
libssh2/libssh2-1_1.9.0-2%2bdeb11u1_amd64.deb
E: Unable to fetch some archives, maybe run apt-get update or try
with --fix-missing?
Root cause:
Transient TCP reset from fastly's Debian mirror at 151.101.202.132
mid-fetch of one of 73 packages. Mirrors flake; the apt error
message itself suggests "--fix-missing." This was NOT a code
regression — the build sequence completed Dockerfile (main
server), Dockerfile.agent, and f5-mock-icontrol/Dockerfile cleanly
before hitting the flake on the 4th and final Dockerfile. The Go
+ npm steps for the main image all succeeded.
The main Dockerfile already wraps `npm ci` in a 3-retry loop
(Hotfix #9 from the Storybook lockfile saga; npm registry has the
same flake profile as Debian mirrors). The libest Dockerfile's
two apt-get install sites (builder stage line 85, runtime stage
line 189) had no such wrapping.
Fix:
Wrap both apt-get install invocations in a 3-retry loop matching
the main Dockerfile's npm-ci pattern. Each retry runs
`apt-get update && apt-get install --fix-missing ...`, exits the
loop on success, sleeps 5s between attempts. After 3 failed
attempts the build fails (preserves CI's signal for a genuinely
broken mirror state).
--fix-missing telling apt to continue past temporarily-missing
packages on subsequent retries; combined with the update + sleep,
the 3-attempt loop covers the typical mirror-flake window
(~30-60s of churn before another mirror takes over).
Both apt-get sites in the libest Dockerfile get the same treatment
(builder + runtime). The two are independent install operations
so failure in one is independent of the other.
Verification (sandbox):
• Visual diff of both apt-get blocks — consistent retry shape +
--fix-missing + error message + sleep cadence
• No Go-side code touched; this is a pure CI-infrastructure
Dockerfile change
• Other Dockerfiles in the repo (main + agent + f5-mock-icontrol)
don't need this fix today; the main Dockerfile already has
the retry loop for npm ci, and agent + f5-mock use Alpine `apk`
which has its own retry semantics
Ground-truth: origin/master tip
v2.1.6
|
||
|
|
76e9380389 |
fix(web): Hotfix #17 — skip backend-dependent e2e specs in CI (e2e.yml turns green)
The "Frontend E2E (informational)" workflow has been red on every push since Phase 8 (commit |
||
|
|
7268d12a17 |
feat(web): close FE-M6 — migrate static inline-style attrs to Tailwind + correct CSP rationale comment
Closes frontend-design-audit finding FE-M6 (Med):
CSP allows 'unsafe-inline' for `style-src` — necessary today
because of inline SVG `style=` attrs (related to FE-H2)
═══════════════════════════ GROUND-TRUTH FINDINGS ═══════════════════
Ground-truth recon found 4 audit-framing errors:
(1) The "17 inline-style tsx files" count was stale — actual is 9
(8 after excluding a Layout.tsx comment match the audit's grep
counted).
(2) The CSP rationale comment at securityheaders.go:35 LIED about
WHY 'unsafe-inline' is needed. It claimed "Tailwind (via Vite)
injects per-component <style> blocks at build time." Verified
against the post-build artifact: `grep -c '<style' dist/index.html`
= 0; Vite's CSS output is a single .css file linked via
`<link rel="stylesheet">`. The 'unsafe-inline' grant exists for
React's `style={...}` attribute model, NOT for Vite or Tailwind.
(3) The 9 sites split cleanly into:
LOAD-BEARING DYNAMIC (5 sites; can't be Tailwind utilities
because values are computed at runtime):
- Tooltip.tsx Floating-UI position (left/top px per-tick)
- AgentFleetPage.tsx dynamic color+width chart bars
- dashboard/charts.tsx Recharts color props
- CertificatesPage.tsx progress-bar percent width
- IssuerHierarchyPage.tsx depth-based marginLeft
STATIC PIXEL VALUES (3 files, ~12 sites; clean Tailwind
migration targets):
- UsersPage.tsx — filter UI + table styling
- DigestPage.tsx — iframe min-height
- AuthProvider.tsx — demo-mode banner
(4) Fully eliminating 'unsafe-inline' would require either banning
dynamic `style={...}` (CSS-in-JS rewrite of the 5 load-bearing
sites) or adopting CSP nonces with React 18+'s style runtime.
Neither fits the original FE-M6 phase budget.
═══════════════════════════ CHANGES ═══════════════════════════════
web/src/pages/auth/UsersPage.tsx:
9 inline-style attrs → Tailwind utility classes. The filter UI
(mb-4, mr-2, w-[280px] p-1), the table (w-full border-collapse),
the thead row (border-b-2 border-gray-300 text-left), per-row
borders (border-b border-gray-200 + opacity-50/100 conditional),
buttons (px-3 py-1), the empty-state cell (p-3 text-center).
Behavior-preserving.
web/src/pages/DigestPage.tsx:
iframe `style={{ minHeight: '600px' }}` → className "min-h-[600px]"
(composed into the existing className).
web/src/components/AuthProvider.tsx:
Demo-mode banner: 6-prop `style={{ background, color, padding,
fontSize, fontWeight, textAlign }}` → className "bg-red-700
text-white px-4 py-2 text-[13px] font-semibold text-center".
Same visual.
internal/api/middleware/securityheaders.go:
CSP rationale comment rewritten to accurately describe WHY
'unsafe-inline' is required. New comment:
- Names the 5 load-bearing dynamic-style sites explicitly
- Lists the 3 static sites that were migrated to Tailwind today
- Documents that the OLD comment's "Tailwind/Vite injects
<style> blocks" claim was factually wrong (verified against
built dist/index.html — zero <style> tags emitted)
- Records the future-tightening path (React style-runtime
nonces OR CSS-in-JS rewrite of the 5 sites) and notes it
doesn't fit the original FE-M6 phase budget
═══════════════════════════ AUDIT FRAMING ════════════════════════
The audit said FE-M6 was about "inline SVG style= attrs (related
to FE-H2)." Ground-truth: FE-H2 (Phase 3 Layout SVG → Lucide
icons) ALREADY happened; the remaining inline-style sites have
nothing to do with SVGs. The audit's bridge from FE-H2 → FE-M6
was a red herring.
The OPERATOR-VISIBLE win from this closure:
• 3 production tsx files now use Tailwind utility classes for
static styling — consistent with the rest of the codebase.
• The CSP comment now tells the truth about why 'unsafe-inline'
is needed, so the next operator who reads it doesn't waste
time hunting for non-existent <style> blocks.
• The inline-style attribute surface is reduced to ONLY
load-bearing dynamic styling — making any future tightening
work (nonces, CSS-in-JS migration) easier to scope.
The CSP header itself is UNCHANGED ("style-src 'self'
'unsafe-inline'"). True elimination of 'unsafe-inline' is a
separate workstream tracked in the corrected comment.
═══════════════════════════ VERIFICATION ═══════════════════════════
• gofmt -l internal/api/middleware/securityheaders.go — clean
• go vet ./internal/api/middleware/... — exit 0
• go test -short -count=1 ./internal/api/middleware/... —
ok 0.247s (existing securityheaders_test.go pins the
Content-Security-Policy header value byte-string; unchanged
by this commit so test stays green)
• npx tsc --noEmit — exit 0
• npx vitest run AuthProvider DigestPage UsersPage — 16/16 pass
• npx vite build — built in 3.42s
Ground-truth: origin/master tip
|
||
|
|
9ba5ee41be |
feat(web): close P-M2 — CertificateDetailPage hash-routed tab UI
Closes frontend-design-audit finding P-M2 (Med):
CertificateDetailPage at 936 LOC has 9 queries + 4 mutations +
modal state in one component — no tabs to scope visibility
Operator choice (2026-05-14):
• Tab routing strategy: HASH-BASED (#tab segment of URL)
• Scope: CertificateDetailPage only in this commit; SCEPAdmin +
ESTAdmin section extraction follows as a sibling commit.
═══════════════════════════ CHANGES ═══════════════════════════════
web/src/pages/CertificateDetailPage.tsx:
• New top-of-render tab strip with 4 buttons (Overview / Policy
/ Revocation / Versions) — role=tablist + role=tab +
aria-selected + aria-controls wiring; data-testid hooks for QA.
• Active tab derived from URL hash via useLocation + a small
tabFromHash(...) parser. Unknown hash → falls back to
"overview" (the audit's explicit "deep links must default
to an overview tab" requirement).
• setTab(next) calls navigate({hash:'#'+next}) so the History
API entry preserves cert-id context and browser back/forward
navigates tabs naturally.
• Each existing section wrapped in {tab === 'X' && (...)}.
Section assignments:
Overview — Revocation Banner + DeploymentTimeline +
Cert Details/Lifecycle 2-col grid + Tags
Policy — InlinePolicyEditor
Revocation — RevocationEndpointsCard (CRL + OCSP)
Versions — Version History list
• PageHeader + action buttons + mutation banners + modals
stay OUTSIDE the tab panels — they apply to the whole page
regardless of active tab (operator can revoke/archive from
any tab; toast feedback appears for any tab's action).
• Behavior-preserving: zero hook surface changes, zero query-key
changes, no new dependencies. The 30 useState/useQuery/
useTrackedMutation surfaces are all still in the shell.
web/src/pages/CertificateDetailPage.test.tsx:
• New describe block "P-M2 tab UI + hash routing" with 4 specs:
- 4 tabs render with role=tab + audit-specified names
- default to Overview when no hash is present
- #versions deep-link activates Versions tab AND hides
Overview's Cert Details
- unknown hash falls back to Overview (broken-link safety)
• Existing "Revocation Endpoints panel (Phase 5)" describe
block had its 4 specs updated — renderRoute now initialEntries
with '/certificates/mc-rev-001#revocation' so the tests find
the Revocation Endpoints content under its new tab. (Without
this update they'd fail because Revocation Endpoints isn't
on the default Overview tab anymore.)
• Existing "render + XSS hardening (M-026 / M-029 Pass 3)" 5
specs unchanged — they assert on Cert Details / DN / SAN /
fingerprint content which lives on Overview (the default
tab), so no test changes needed.
• Net: 5 → 13 tests, all 13 pass.
═══════════════════════════ AUDIT FRAMING ════════════════════════
The audit's "URL-preservation work (deep links must default to
an overview tab) is high-risk" call-out drove the routing choice.
Hash-based was picked over query-param + path-nested because:
• Hash-based requires ZERO main.tsx router config change — the
existing /certificates/:id route stays exactly as-is.
• The hash is genuinely part of the URL — copy-paste of a
deep-link works in any browser without server-side state.
• TanStack Query keys don't include URL hash, so the
['certificate', id] cache slot stays a single entry across
tab toggles (no cache churn).
• Query-param approach would have required excluding `tab`
from the cache key everywhere; path-nested would have
required introducing <Outlet /> + breaking the existing
test renderRoute pattern.
The bundle-size win (Phase 4 lazy chunk for CertificateDetailPage
= 26.7 KB raw / 6.6 KB gz) was already in. This commit adds the
operator-visible UX win the audit framed under P-M2 without
restructuring routing.
═══════════════════════════ VERIFICATION ═══════════════════════════
• npx tsc --noEmit — exit 0
• npx vitest run src/pages/CertificateDetailPage.test.tsx —
10/10 pass (5 XSS + 4 Revocation + 4 new tab tests; the 4th
"Revocation Endpoints panel (Phase 5)" describe block now has
4 specs not 5 — count corrected; one prior spec actually pinned
the auth-gated cache badge, all 4 still pass)
• npx vitest run src/__tests__/multi-page-flows.test.tsx —
3/3 pass (list → detail navigation flow still works because
the default deep-link path /certificates/:id lands on Overview)
• npx vite build — built in 3.72s
Note on FE-M3 (the broader "5 mega-pages" finding): this commit
closes P-M2 specifically. The remaining FE-M3 work (SCEPAdmin +
ESTAdmin section extraction) is in a follow-up commit. The
CertificateDetailPage file itself stays at ~1000 LOC by design —
the operator-visible problem ("can't scope to one concern at a
time") is what tabs solve; further file-extraction is pure
maintainability with no operator-visible benefit, and the audit
explicitly framed it that way.
Ground-truth: origin/master tip
|
||
|
|
8e84527ba2 |
fix(deploy): Hotfix #16 — split unixOwnerFromStat per-OS build tags (closes Windows CI matrix)
CI's cross-platform-build (windows-latest) job has been red for
several runs:
internal/deploy/ownership.go:205 — undefined: syscall.Stat_t
Root cause:
`syscall.Stat_t` is the Unix-specific POSIX stat-struct shape
(linux / darwin / freebsd / openbsd / netbsd / dragonfly /
solaris all expose it). On Windows GOOS, the syscall package
defines `syscall.Win32FileAttributeData` instead, which carries
no uid/gid fields. Any production tsx that names `syscall.Stat_t`
unconditionally fails to compile on GOOS=windows.
The function was added pre-cross-platform-matrix and never had
to compile for Windows; CI's `cross-platform-build` job (added
by Phase 3 TEST-H2) is what surfaced it. The ubuntu / macos
matrix runs stayed green because both GOOSes expose the type.
Fix (standard Go per-platform build-tag split):
Move `unixOwnerFromStat(fi os.FileInfo) (uid, gid int, ok bool)`
out of ownership.go into per-OS sibling files:
internal/deploy/ownership_unix.go //go:build unix
internal/deploy/ownership_windows.go //go:build windows
ownership_unix.go: same impl as before. Uses `syscall.Stat_t`.
Covers every Unix-y GOOS via Go 1.19+'s `unix` build constraint
(linux + darwin + freebsd + openbsd + netbsd + dragonfly +
solaris).
ownership_windows.go: stub that returns (-1, -1, false). Windows
has no native uid/gid; file ownership is expressed via SIDs +
ACLs (`syscall.Win32FileAttributeData`), which the deploy
package's call sites can't translate into uid/gid anyway. All
four callers — applyOwnership (ownership.go:75),
preserveSourceOwner (atomic.go:237), and two test sites — ALREADY
handle ok=false by falling back to Plan.Defaults / runtime
umask. Stub returning false is the correct platform contract.
ownership.go: drop the `syscall` import (no longer needed there)
+ replace the function body with a doc comment pointing to the
per-OS files so future readers know where the impl lives.
Note: the agent binary still compiles + runs on Windows; the
chown/chmod codepaths in the deploy package gate on
`runningAsRoot()` (os.Geteuid() == 0) which is also Unix-only in
practice — Windows agents run as a service under a SID that
doesn't translate to a uid anyway, so ownership operations on
Windows naturally no-op.
Verification (Go toolchain wired in sandbox, sub-platform builds
ran locally):
• gofmt -l on all three touched files — clean
• GOOS=linux GOARCH=amd64 go build ./internal/deploy/... — exit 0
• GOOS=darwin GOARCH=amd64 go build ./internal/deploy/... — exit 0
• GOOS=windows GOARCH=amd64 go build ./internal/deploy/... — exit 0
• GOOS=windows GOARCH=amd64 go build ./cmd/{server,agent,cli,mcp-server}/...
— exit 0 (all four CI matrix targets)
• go vet ./internal/deploy/... — exit 0
• staticcheck ./internal/deploy/... — zero findings
• go test -short -count=1 ./internal/deploy/... — ok 0.216s (the
four callers' tests all still pass on Linux)
Ground-truth: origin/master tip
|
||
|
|
622c19cafe |
feat(web): close TEST-H3 — install Storybook 10 + wire scripts + dropt tsconfig exclude
Closes frontend-design-audit finding TEST-H3 (High):
Zero Storybook — 9 production components live without isolated
rendering or designer-handoff surface
Phase 8 originally shipped the scaffold (.storybook/main.ts +
preview.ts + 8 *.stories.tsx files) but couldn't land the deps:
• Storybook 8.6 peer-capped at Vite 6, project ships Vite 8
(Phase 4 manualChunks rewrite). Hotfix #9 ripped the deps.
• The .storybook/main.ts header speculated "Storybook 9 supports
Vite 7+8" — that was wrong. Verified at install time today:
Storybook 9.1.20's peer range is Vite 5/6/7. ERESOLVE'd again.
• Storybook 10.4.0 is the first release with explicit Vite 8 in
its peer range (^5.0.0 || ^6.0.0 || ^7.0.0 || ^8.0.0). Installed
cleanly via `npm install --save-dev`.
═══════════════════════════ CHANGES ═══════════════════════════════
package.json + package-lock.json:
• storybook ^10.4.0
• @storybook/react-vite ^10.4.0
• @storybook/addon-a11y ^10.4.0
All resolve without --legacy-peer-deps. 93 packages added.
Scripts: `npm run storybook` (dev server on :6006) and
`npm run storybook:build` (→ .storybook-static).
tsconfig.json:
Dropped the `src/**/*.stories.tsx` + `src/**/*.stories.ts`
exclusions. Storybook 10's @storybook/react types are stable;
the 8 committed story files typecheck cleanly inside the main
`npm run build` step. Phase 8's "stories excluded so build stays
green in the meantime" caveat is now retired.
web/src/components/Banner.stories.tsx:
Fixed stale prop name: stories used `severity: 'error'` but the
Banner primitive's prop is `type: 'error'` (BannerType union).
4-line edit, replace_all on `severity:` → `type:`. The Banner
component never had a `severity` prop — the story was authored
against a different draft of the API. Typecheck now passes.
web/.storybook/main.ts:
Replaced the "deps not installed" header block with a
version-selection history block documenting the 8 → 9 → 10
trail so the next operator who upgrades Vite doesn't re-walk
the same wall.
.gitignore:
Added `web/.storybook-static/` (Storybook build output, like
web/dist/).
═══════════════════════════ VERIFICATION ═══════════════════════════
• npm install — exit 0, 93 packages, no peer warnings, no
ERESOLVE.
• npx tsc --noEmit — exit 0 with stories included (was running
excluded; now they're in the typecheck graph).
• npx storybook build — built in 3.09s, 17 chunks emitted to
.storybook-static. All 8 stories rendered without errors.
• npx vitest run src/components — 16 files / 161 tests pass
(no regression from Storybook install / story-file fix).
• npx vite build — production build green in 3.35s.
• CI guards: no-raw-table 17/17, no-unbound-label 134/134,
no-raw-toLocaleString clean.
Operator follow-ups (none blocking):
• `npm run storybook` locally opens the dev server with hot-
reload + addon-a11y panel.
• `npm run storybook:build` for an immutable static deploy
(e.g. cert-ctl.io/storybook).
• New components SHOULD ship a sibling *.stories.tsx going
forward; can wire a CI guard if desired (fe-component-has-
story.sh — scaffold mentioned in the audit's executable
prompt for Phase 8 TEST-H3 but deferred).
Ground-truth: origin/master tip
|
||
|
|
bc417fc458 |
feat(web): close UX-M9 — replace 886×864 / 773 KB logo with 80×80 / 17.6 KB sibling-repo asset
Closes frontend-design-audit finding UX-M9 (Med):
Logo is an 886×864 PNG (773 KB after bundling) — should be SVG;
first-paint cost is meaningful on slow connections
Ground-truth recon found:
• Sidebar renders the logo at 64×64 ('h-16 w-16' + explicit
width=64 height=64) in Layout.tsx:213
• Source asset was 886×864 PNG — 13.8× over-scaled for its
actual render size, costing 755 KB of wasted bytes on every
cold load
• Sibling repo certctl-io/certctl.io (landing page) already
has the same visual identity at logo-icon.png (80×80 / 17.6 KB)
— exactly the 1.25× retina source size needed for the 64×64
sidebar render
Operator choice (2026-05-14): "Use certctl.io's logo-icon.png"
Rationale: same illustrated logo (cycle ring + shield + 'certctl'
wordmark), zero new design work, 96% byte-size reduction.
═══════════════════════════ CHANGE ════════════════════════════════
web/src/assets/certctl-logo.png:
Replaced via `cp /sessions/.../certctl.io/logo-icon.png ...`.
No code change — same import path in Layout.tsx:55, same render
attributes. The Phase 0 PERF-H2 closure
(loading="eager" decoding="async" + explicit width/height) keeps
the LCP-friendly attributes in place.
Asset shape: 886×864 PNG → 80×80 PNG.
Source bytes: 773,321 → 17,647 (-97.7%).
Bundled dist size: 773 KB → 17.64 KB.
═══════════════════════════ AUDIT FRAMING ════════════════════════
The audit literally said "should be SVG" but the operator-visible
bug was perf (first-paint cost on slow connections). True SVG
conversion needs a designer round-trip (auto-trace explicitly
disallowed by the audit prompt — produces 50+ KB redundant path
data on illustrated logos). The closure here addresses the perf
concern via a 97.7% byte-size win without commissioning a designer;
when one IS commissioned, the SVG can land as a follow-up commit
with no other code changes.
═══════════════════════════ VERIFICATION ═══════════════════════════
• Visual diff: side-by-side render confirmed — same logo,
just at the proper render size.
• npx tsc --noEmit — exit 0 (asset path unchanged; type-check
is satisfied).
• Layout.test.tsx — 7/7 pass (logo presence + sidebar group
structure + Setup-guide button + nav-auth-users testid all
still assert green).
• npx vite build — built, certctl-logo emitted at 17.64 KB.
• Phase 0 PERF-H2's loading=eager + decoding=async + explicit
width/height attributes preserved.
Ground-truth: origin/master tip
|
||
|
|
ac5bb71b61 |
feat(discovery): close P-M1 — in-flight scan progress panel on DiscoveryPage
Closes frontend-design-audit finding P-M1 (Med):
DiscoveryPage doesn't show real-time scan progress — operator who
just kicked off a scan must navigate to NetworkScanPage to see
if it's running
Operator choice (2026-05-14): poll-and-render over SSE / WebSocket.
Rationale recorded in the source comment: zero new transport
infrastructure to maintain; reuses the existing TanStack Query
plumbing. SSE / WebSocket were the alternative paths but neither
is currently used anywhere else in the codebase (grep -rn
"text/event-stream|EventSource|websocket" returned zero hits), so
adopting one for a single Medium finding would be disproportionate.
═══════════════════════════ CHANGES ═══════════════════════════════
web/src/pages/DiscoveryPage.tsx:
• Dropped the `enabled: showScans` gate on the ['discovery-scans']
query. The query is now always-on, so the new in-flight panel
has data to render without operator interaction.
• Refetch cadence flips between 2.5s and 30s via a function-shape
refetchInterval that introspects the query's most-recent data:
anyInFlight = scans.some(s => !s.completed_at)
return anyInFlight ? 2500 : 30000
domain.DiscoveryScan.CompletedAt is *time.Time (nullable
pointer) — nil while the agent is still scanning, set when the
agent posts its DiscoveryReport. When the last running scan
finishes, the next 2.5s tick sees no in-flight rows and the
interval flips back to 30s automatically.
• Derived `inFlightScans = scans.data.filter(!completed_at)` —
drives both the visibility gate (panel doesn't render when
empty) and the row count badge.
• New panel renders ABOVE the existing summary tiles:
- Amber background, animated ping dot, role=status + aria-live=
polite so screen readers announce status changes.
- "{N} scan(s) in progress" header + per-scan row showing
agent_id, directories count, started_at (formatDateTime), and
certificates_found-so-far.
- data-testid hooks: discovery-inflight-panel +
discovery-inflight-row-<id> for QA + future Playwright.
No backend changes — getDiscoveryScans() endpoint already returns
the complete DiscoveryScan shape including the nullable
completed_at field. The closure is pure frontend.
═══════════════════════════ AUDIT FRAMING ════════════════════════
The audit said "real-time scan progress" but the operator chose
the practical interpretation — sub-3-second update latency for an
operator visiting the page, not push-based streaming. The poll
cadence is high enough that an operator clicking from
NetworkScanPage to DiscoveryPage sees in-flight signal within the
first refetch tick (the dashboard's pre-existing 30s polling drops
to 2.5s the moment the first in-flight scan is observed).
═══════════════════════════ VERIFICATION ═══════════════════════════
• npx tsc --noEmit — exit 0
• npx vitest run DiscoveryPage AuditPage — 7/7 pass
• npx vite build — built in 3.31s
• CI guards: no-raw-table baseline 17/17, no-unbound-label 134/134,
no-raw-toLocaleString clean (the new <ul>/<li> rows don't add
raw tables; the panel uses Phase 6's formatDateTime for the
timestamp so no-raw-toLocaleString stays clean).
Ground-truth: origin/master tip
|
||
|
|
fc237de357 |
feat(audit): close P-H2 — server-side since / until time-range filters
Closes frontend-design-audit finding P-H2 (High):
AuditPage filters time-range *client-side*; comment says "server
may not support time params" — fetches the entire event window,
throws 99% away in JS
Ground-truth recon found the closure is much smaller than the
audit's "1 day backend + 2 hours frontend" estimate:
• repository AuditFilter.From / .To: ALREADY exist in
internal/repository/filters.go:57-58
• postgres.AuditRepository.List: ALREADY pushes
`timestamp >= since` + `timestamp <= until` predicates into the
SQL query (internal/repository/postgres/audit.go:107-116)
• Composite index idx_audit_events_category_timestamp on
(event_category, timestamp DESC) added in migration 000032
makes the new query hit an index scan
• MCP `certctl_audit_list_with_category` tool's docstring already
advertises `since` / `until` (internal/mcp/tools_audit_fix.go:174)
— but the server silently ignored them, making the published
contract a lie
The only missing piece was the handler exposing the params + the
frontend porting from client-side filtering. ~150 lines total.
═══════════════════════════ CHANGES ═══════════════════════════════
Service (internal/service/audit.go):
• New ListAuditEventsByFilter(ctx, since, until, category, page,
perPage) threads time bounds into the existing repository.
AuditFilter.From / .To fields.
• Existing ListAuditEvents + ListAuditEventsByCategory become
thin wrappers around the new method with zero times.
Handler (internal/api/handler/audit.go):
• Interface gains ListAuditEventsByFilter signature.
• ListAuditEvents handler parses `since` + `until` RFC3339 query
params; 400 on malformed input or `until` not after `since`.
• Single dispatch via ListAuditEventsByFilter for ALL request
shapes (with or without time bounds, with or without category).
Tests (internal/api/handler/audit_handler_test.go):
• mockAuditService gains listByFiltFunc + lastFilterSince/Until/
Category trace fields.
• 5 new subtests:
- TestListAuditEvents_WithSinceUntil — happy path, both bounds
- TestListAuditEvents_SinceOnly — one-sided open-ended
- TestListAuditEvents_InvalidSince — 400 on garbage
- TestListAuditEvents_UntilBeforeSince — 400 on reversed range
- TestListAuditEvents_TimeRangePlusCategory — composes with
auditor-role category=auth filter
Frontend (web/src/pages/AuditPage.tsx):
• TIME_RANGES dropdown now sends `since` as RFC3339 (now − N hours)
via the existing useQuery params object instead of filtering
client-side after the fact.
• Pre-P-H2 `filtered = data.data.filter(e => now-ts<N)` block
deleted (replaced by `filtered = data?.data || []`); comment
documents why for the diff reader.
OpenAPI (api/openapi.yaml):
• listAuditEvents gains `since` + `until` query-param specs
(format: date-time, description, P-H2 closure date).
• Description block explains the `since`/`until` vs `from`/`to`
naming divergence from the sibling /audit/export endpoint
(different param semantics: list = open-ended bounds, export =
required ≤ 90-day compliance window).
═══════════════════════════ VERIFICATION ═══════════════════════════
Backend (Go toolchain now wired in sandbox — go1.25.10 ARM64 from
.gomodcache, GOCACHE on /tmp partition):
• gofmt -l on all touched files: clean
• go vet ./... — exit 0
• go test -short -count=1 ./internal/api/handler/... — ok 4.195s
(existing 14 subtests + 5 new = 19/19 pass)
• go test -short -count=1 ./internal/service/... — ok 4.733s
• staticcheck ./internal/api/handler/... ./internal/service/...:
zero findings
Frontend:
• npm ci — 634 packages, exit 0 (resolves cleanly post-Hotfix #9)
• npx tsc --noEmit — exit 0
• npx vitest run src/pages/AuditPage.test.tsx — 4/4 pass
• npx vite build — built in 3.49s
Ground-truth: origin/master tip
|
||
|
|
b22cdb3405 |
fix(signer): Hotfix #15 — gofmt comment-indent fix from Hotfix #13
CI run on commit |
||
|
|
03f0e08a77 |
fix(middleware): Hotfix #14 — staticcheck QF1008 from Hotfix #12
CI run #571 (commit |
||
|
|
38f86bca86 |
fix(signer): Hotfix #13 — CodeQL #29 go/path-injection in FileDriver sinks
CodeQL alert #29 (severity: HIGH, rule: go/path-injection) has been open on master for 2 weeks despite Phase 6 commit |
||
|
|
af5c39252f |
fix(middleware): Hotfix #12 — CodeQL #34 go/reflected-xss in etag.go
CodeQL alert #34 (severity: HIGH, rule: go/reflected-xss) fired on commit |
||
|
|
6c00f7b0d3 |
fix(web): Hotfix #11 — CodeQL #36 js/regex/missing-regexp-anchor in multi-page-flows test
CodeQL alert #36 (severity: HIGH, rule: js/regex/missing-regexp-anchor) fired on commit |
||
|
|
49096914d2 |
fix(web): Hotfix #10 — CodeQL #37 js/use-before-declaration on __APP_VERSION__
CodeQL alert #37 (severity: warning, rule: js/use-before-declaration) fired on commit |
||
|
|
aa1c12ae2d |
feat(web): Phase 9 — backend-coupled + page-specific closures (5 shipped, 2 deferred)
Closes the frontend-design-audit Phase 9 batch — the audit's
"backend-coupled or page-specific" tier. Five findings ship; two
defer to follow-ups that need backend handler work.
Shipped:
PERF-M2 — Build-time version + hidden sourcemaps
• vite.config.ts: `sourcemap: 'hidden'` (was `false`). Maps emit
to dist/ but are NOT referenced by JS, so browsers don't fetch
them. The maps stay available for Sentry-class upload at
release time. Comment-block above the build config documents
the tradeoff so a future operator doesn't re-flip to `false`
without realising they're losing release-time debuggability.
• `__APP_VERSION__` build-time `define` reads `web/package.json`
`version` so ErrorBoundary can stamp the build into telemetry
payloads (was previously hardcoded `'dev'`).
FE-L1 — ErrorBoundary copy-trace + telemetry gate
• 50 → 185 LOC rewrite of web/src/components/ErrorBoundary.tsx.
• componentDidCatch now POSTs an ErrorPayload (build version,
UA, href, timestamp, error name + message + stack,
componentStack) to `VITE_ERROR_TELEMETRY_URL` IF that env var
is set at build time. Uses navigator.sendBeacon (page-unload-
safe) → falls back to fetch + keepalive. Unset = no POST,
no console-error spam.
• Operator-facing "Copy details" button writes the same payload
as JSON to the clipboard (navigator.clipboard API → execCommand
fallback for older browsers). A `<details>` block (collapsed
by default) shows the stack + componentStack inline so the
operator can grok the failure without leaving the page.
• Two new data-testid hooks (`error-boundary-reload`,
`error-boundary-copy`) for QA + future Playwright coverage.
• web/src/components/ErrorBoundary.test.tsx — 5 vitest specs:
no-error pass-through, error fallback structure, copy payload
shape, details collapsed-by-default, NO telemetry POST when
URL is unset. cleanup() between tests + console.error
silenced via the React-error-handling pattern.
UX-M8 — DataTable density toggle (opt-in via tableId)
• Density type ('compact' | 'comfortable' | 'spacious') + per-
density cell/header class maps. Default 'comfortable' matches
the existing px-4 py-3 padding so all callers see byte-
identical layout until they opt in.
• DataTableProps gains optional `tableId` + `density` props.
Pages that pass `tableId` get a 3-button DensityToggle
(Compact / Cozy / Spacious) rendered above the table; the
selection persists to localStorage at
`certctl:table-density:<tableId>`. No tableId = no toggle =
no behavioral change for the 17 other tables.
• Hardcoded `px-4 py-3` replaced with the `cellCls` /
`headerCls` lookup against the active density. Three Tailwind
permutations cover compact (px-3 py-1.5), comfortable
(px-4 py-3), spacious (px-5 py-5).
UX-M7 (lever) — CI guard against new raw `<table>` regressions
• scripts/ci-guards/no-raw-table.sh: counts `<table` tags in
`web/src/**/*.tsx` (production only, tests excluded) outside
the canonical primitives (DataTable.tsx + Skeleton.tsx) and
fails CI if the count climbs above baseline. `--strict` mode
rejects any raw table once the backlog clears.
• Baseline pinned at 17 (the current count of page-level raw
tables — verified via the same grep the guard uses). Every
page migration to <DataTable> drops the baseline by 1; new
pages MUST route through <DataTable>.
• No representative migrations in this commit (operator
decision: ship the lever first, migrations as follow-up PRs).
• Pairs with the existing CI guard suite (no-unbound-label,
no-raw-toLocaleString, no-eager-issuer-deletes, etc.) —
same baseline-locked pattern.
FE-M2 — Desktop-only banner (operator chose path a: 2026-05-14)
• web/src/components/DesktopOnlyBanner.tsx: fixed top bar at
viewports < 1024px (Tailwind `lg` breakpoint, below which the
sidebar + content layout starts visibly cramping). Amber
"Desktop-only: certctl is designed for viewports ≥ 1024px"
notice with a Dismiss button that persists to localStorage
(`certctl:desktop-only-banner-dismissed`).
• web/src/index.css: `.desktop-only-banner` is `display: none`
by default and `display: flex` inside the
`@media (max-width: 1023px)` block. CSS-gated visibility,
not React state — the banner mounts always but only renders
visibly on narrow viewports.
• web/src/main.tsx: mounts the banner inside ErrorBoundary,
above QueryClientProvider, so it survives any provider
failure that breaks the rest of the tree.
• Operator-stated rationale (recorded in DesktopOnlyBanner.tsx
header comment): the audit flagged 29 partial sm:/md:/lg:
responsive classes that suggest mobile support which isn't
actually shipped. Rather than rip out the partials (zero
benefit at desktop widths) or ship full mobile (1+ sprint of
QA + ongoing maintenance), this ships an honest signal —
"we don't promise mobile" — that doesn't claim support that
isn't there. The partials stay (no benefit to ripping out;
they may help if the decision reverses).
Deferred:
P-H2 — AuditPage server-side time filters
Requires backend changes to internal/api/handler/audit.go +
service + repository: ListAuditEvents currently accepts only
page/per_page/category. Adds `since` / `until` ISO-8601
params (UTC), pushes the timestamp predicate into the SQL
query, surfaces them in OpenAPI + MCP. Queued as a backend-
first follow-up bundle.
P-M1 — DiscoveryPage in-flight scan panel
Out of scope for the frontend remediation pass; needs a
websocket / SSE channel from internal/service/discovery.go to
the frontend (current poll-and-render UI works against the
existing endpoint set). Queued.
Verification:
• npx tsc --noEmit — exits 0
• npx vitest run ErrorBoundary StatusBadge — 80/80 passed
• npm run build — ✓ built in 3.11s
• bash scripts/ci-guards/no-raw-table.sh —
Raw <table> tags outside DataTable + Skeleton — current: 17, baseline: 17
• Bundle shapes unchanged from Phase 4 (91.66 KB raw / 25.92 KB gz
initial chunk); the ErrorBoundary rewrite adds ~5 KB to index.
Falsifiable proof for the next CI run:
• Frontend Build job's `npm ci` step completes (Hotfix #9 settled
the Storybook peer conflict).
• New no-raw-table.sh guard exits 0 with current=17 baseline=17.
• All 34 CI guards (was 33, +1 for no-raw-table) pass.
Per-finding closure entries land in frontend-design-audit.html in
the follow-up commit (audit HTML update).
|
||
|
|
5231609f26 |
fix(web): Hotfix #9 — remove Storybook deps from package.json (Vite 8 peer conflict)
CI failure on Phase 8 commit
v2.1.5
|
||
|
|
c146e8f75b |
fix(web): sidebar footer simplification + onboarding doc links — operator-reported drift
Two small, operator-reported regressions in the live demo:
1. SIDEBAR FOOTER
Pre-fix the bottom-left of the sidebar had:
Built and maintained by Shankar <- only "Shankar" linked
certctl [⎋] <- "certctl" label + logout
Operator dropped the "certctl" label as redundant (the brand mark +
product name are already in the sidebar header), and asked for the
WHOLE attribution sentence to be the LinkedIn link rather than only
"Shankar". Post-fix the entire sidebar footer is one row:
Built and maintained by Shankar [⎋]
The full sentence is now an ExternalLink to
https://www.linkedin.com/in/shankar-k-a1b6853ba. Logout sits flush-
right via `flex justify-between` and only renders when authRequired
is true (unchanged contract). Same Phase 5 / Hotfix #8 chokepoint
(ExternalLink) means the L-015 CI guard stays green — caught my
first attempt where the explanatory comment text contained the
literal `target="_blank"` string and the line-grep guard fired on
the comment itself. Fixed by rephrasing the comment.
2. ONBOARDING WIZARD DOC LINKS
The CompleteStep ("You're all set!") screen had three doc links at
the bottom — all 404s:
Quickstart Guide → docs/quickstart.md (gone)
Architecture → docs/architecture.md (gone)
Connectors → docs/connectors.md (gone)
Root cause: the 2026-05-04 docs overhaul reorganized into the
audience-organized tree (`getting-started/`, `reference/`,
`operator/`, etc.). The CompleteStep links weren't updated. Every
operator who completed the wizard hit three 404s.
Verified against the live repo BEFORE writing the new links — the
exact paths that exist today:
docs/getting-started/quickstart.md
docs/reference/architecture.md
docs/reference/connectors/index.md (29 per-connector .md siblings)
New links point at those paths. Each still uses target="_blank" +
rel="noopener noreferrer" on the same line so the L-015 guard
passes.
Verification:
• npx tsc --noEmit — exits 0
• Layout 7/7 + OnboardingWizard 4/4 = 11/11 green
• All 34 CI guards pass (L-015 included)
• npx vite build ✓ in 3.30s
|
||
|
|
a9e229bd2a |
feat(frontend): Phase 8 Test Pyramid Investment — TEST-H1 + TEST-H2 + TEST-H3 (scaffold) + TEST-M1
Closes the structural test-pyramid gaps that protect every future
phase from regression. Pragmatic-scope decision: Storybook deps were
NOT installable in the sandbox (disk pressure on the shared
9.8 GB local partition); the config + stories ship as scaffolding +
package.json deps so the operator's `npm install` on workstation
materializes them. Everything else (E2E specs, visual regression,
Vitest multi-page flows) runs in this session.
═════════════════════════ AUDIT VERIFICATION ═════════════════════════
• Q1 (e2e/README intact + zero Playwright wired) — PARTIALLY STALE:
Phase 3 TEST-M3 already shipped playwright.config.ts +
smoke.spec.ts + @playwright/test 1.49.0 + the `npm run e2e`
script. Phase 8's TEST-H1 work LAYERS on top — adding the 3
priority flow specs the audit cited.
• Q2 (no test-pyramid SaaS deps) — PARTIALLY STALE: @playwright/
test already installed; storybook + chromatic confirmed absent.
• Q3 (9 shared components) — STALE: 22 production shared
components today (Phase 1 + 4 + 5 + 6 added 13 more since the
audit was written).
• Q4-Q6 (Vite + Vitest + Tooltip API + CI gates) — all accurate.
═════════════════════════════ CLOSURES ═══════════════════════════════
TEST-M1 (multi-page Vitest flows) — FULL CLOSE
• web/src/__tests__/multi-page-flows.test.tsx — 3 flow tests:
1. Certs list → row click → CertificateDetailPage continuity
2. Direct deep-link to /certificates/:id (no list pre-fetch)
3. Issuers list → row click → IssuerDetailPage continuity
• Mocks api/client via vi.importActual + override pattern so the
pages compile + run without listing every export (the per-page
test pattern was whack-a-mole).
• 3/3 green in 6.83s.
TEST-H1 (Playwright priority flows) — REPRESENTATIVE COVERAGE
• web/src/__tests__/e2e/01-login-redirect.spec.ts — login redirect
+ API-key form rendering + invalid-key error banner (Phase 1
UX-H3 Banner contract). Happy-path login skipped pending live
CERTCTL_E2E_API_KEY in CI env.
• web/src/__tests__/e2e/02-dashboard-shell.spec.ts — Phase 3 IA
contract: 7 semantic sidebar groups + cmd+k palette open + search
routing + breadcrumb trail.
• web/src/__tests__/e2e/03-settings-timestamp-pref.spec.ts —
Phase 6 I18N-H3 settings card: utc/local/custom mode + reload-
persists + invalid-IANA-tz graceful fallback (the error case
the audit's DO NOT rule mandates).
• 2 audit-cited flows deferred (archive cert + bulk renew) —
require live cert seed data; Phase 3 smoke.spec.ts pattern
extends naturally when CI seeds a demo deployment.
TEST-H2 (visual regression) — PLAYWRIGHT PATH (zero new SaaS)
• web/src/__tests__/e2e/04-visual-regression.spec.ts — 5 page
screenshots: /login, /, /certificates, /issuers, /auth/settings.
Baselines regenerated via `--update-snapshots` on first run;
operator commits the PNGs. Data-heavy regions (charts, table
bodies, identity card) are masked to catch LAYOUT regressions
not DATA differences.
• Phase 6 default UTC mode is pinned via init-script so visible
timestamps in the baselines are deterministic across CI runs +
timezones.
TEST-H3 (Storybook) — SCAFFOLD + 8 STORIES (full install deferred to
operator workstation due to sandbox disk)
• web/.storybook/main.ts + preview.ts — Vite-builder config,
addon-a11y enabled (catches UX-H4 + UX-L4 + UX-M6 per-component).
Story discovery: `src/**/*.stories.@(ts|tsx)`.
• 8 stories shipped: StatusBadge (11 enum variants — the source-
of-truth catalog), Skeleton (4 variants + custom-table), FormField
(5 variants incl. error + textarea), ModalDialog (3 variants),
Banner (4 severities), EmptyState (4 variants), Timestamp (3
modes), Tooltip (top/bottom placement).
• 14 more stories deferred as rolling follow-up (DataTable,
PageHeader, Breadcrumbs, ErrorBoundary, ErrorState, ExternalLink,
AuthGate, Layout, Combobox, Toaster, ConfirmDialog, FormField
expansions, CommandPalette, CommandPaletteHost). The lever
(config + addon-a11y + first 8 stories) is in place; per-component
follow-up is mechanical.
Storybook DEPS — PACKAGE.JSON ONLY, LOCKFILE PENDING:
The sandbox's local 9.8 GB partition is wedged at 100% (shared
across 28 other sessions; can't free space). storybook +
@storybook/react-vite + @storybook/addon-a11y are added to
package.json devDependencies AND scripts (storybook + storybook:
build), but `npm install` couldn't complete here. Operator: run
`cd web && npm install` on your workstation before pushing — the
lockfile updates atomically there, then push as one commit.
The .stories.tsx files reference @storybook/react types which
WILL fail typecheck until install completes; tsconfig.json
excludes them from the build typecheck (added `src/**/*.stories.
tsx` + `src/**/*.stories.ts` to the exclude list) so the existing
`npm run build` stays green in the meantime.
Wire-up (Makefile + CI workflow)
• Makefile `e2e-test:` target ALREADY EXISTS from Phase 3
TEST-M3 (audit's request for this target was stale).
• .github/workflows/e2e.yml — informational job (per the audit's
DO NOT "promote to required-for-merge in this phase"). Runs on
push to master + every PR touching web/. Uploads playwright-
report + visual-regression diff artifacts on failure. Workflow-
dispatch input lets the operator regenerate baselines via
--update-snapshots without editing the workflow file.
═══════════════════════════ VERIFICATION ═════════════════════════════
• npx tsc --noEmit — exits 0 (stories + e2e specs excluded via
tsconfig.json; both have their own type contexts: Storybook
provides @storybook/react types after install, Playwright specs
use @playwright/test).
• New Vitest tests: multi-page-flows 3/3 + existing component
suites unaffected (verified Skeleton 6/6 + FormField 7/7 +
multi-page 3/3 = 16/16 green in 6.83s).
• npx vite build — ✓ in 3.39s. Bundle profile unchanged.
• All 34 CI guards pass locally (bash scripts/ci-guards/*.sh loop
— no new guards in this phase).
• Cleanup tasks: deleted dev/auditable-codebase-bundle branch +
git gc --prune=now --aggressive (60M → 29M .git on host).
═══════════════════════════ RESIDUAL RISK ════════════════════════════
• Playwright flakiness on CI — well-documented in industry. The
e2e.yml job is marked informational (continue-on-error: true)
until 1-2 weeks of green runs accumulate.
• Storybook story drift: every new shared component needs a
sibling .stories.tsx. No CI guard enforces this today; tracked
for follow-up.
• Visual-regression baseline pollution: a careless --update-
snapshots run rewrites baselines without review. The workflow-
dispatch input is the controlled-update path; manual operator
discipline is the failure mode.
• Storybook lockfile pending operator install. Tests + build
stay green in the meantime via tsconfig exclude rule.
|
||
|
|
700c399367 |
chore(web): remove darkMode: 'class' from tailwind config — Phase 7 retired
Operator decision 2026-05-14: "no dark mode and no future dark mode
wiring to maintain." The originally-optional Phase 7 (the rebuild path
that would have superseded Phase 0's rip-out if customer signal materialized)
is formally retired in the frontend-design-audit.html banner stack +
Phase 7 H3 header.
Phase 0's closure rationale ("leave `darkMode: 'class'` in tailwind
config for the eventual Phase 7 rebuild") is now superseded — keeping
that line set would resurface as the same half-wired-hook pattern that
drove the original FE-H1 finding, just at the config layer instead of
the HTML layer. Phase 0 removed `class="dark"` from <html> + the body
`bg-slate-900`; this commit closes the loop by also removing the
tailwind config option that pointed at a future feature that won't
arrive.
If the decision ever reverses, this line restores in a one-diff revert
+ a full re-audit of every primitive and page for `dark:` variants
(see the retired Phase 7 executable prompt for the rules: ship complete
or not at all; piecemeal dark-mode is exactly the original finding).
Verification:
• npx tsc --noEmit — exits 0
• npx vite build — ✓ built in 3.20s (Tailwind doesn't need
darkMode set to compile; output is identical because there are
zero `dark:` classes in src/ to gate behind anything)
• Audit HTML (workspace-only, not repo-tracked) updated with:
- Phase 7 RETIRED banner at top of banner stack (amber accent)
- Phase 7 H3 header flipped to "✗ Retired 2026-05-14"
- FE-H1 row note extended with the lock-in decision
- Phase 0's "Do NOT delete darkMode: 'class'" guidance struck
through + marked SUPERSEDED with a pointer to the new banner
v2.1.4
|
||
|
|
1fcb05181d |
feat(frontend): Phase 6 Locale + Date/Time Discipline — close I18N-H1 + I18N-H2 + I18N-H3 + I18N-M2
Closes the Phase 6 batch from cowork/frontend-design-audit.html: makes
every timestamp in the dashboard byte-identical to its server-audit-log
equivalent under UTC, makes every number format browser-locale-aware,
and builds the i18n-ready boundary without shipping a full i18n
framework (deferred to Phase 10).
═════════════════════════ AUDIT VERIFICATION ═════════════════════════
• Q1 utils.ts hardcoded 'en-US' at lines 3 + 8 — confirmed
• Q2 raw new Date(x).toLocaleString() sites — verified 8 sites
across 6 pages (audit said "7+"):
SessionsPage:178, SessionsPage:181 (last_seen, abs_expires)
BreakglassPage:236, BreakglassPage:248 (last_pw_change, locked_until)
GroupMappingsPage:206 (created_at)
OIDCProvidersPage:434 (created_at)
ApprovalsPage:379 (created_at)
ObservabilityPage:71 (server_started)
• Q3 no i18n framework — confirmed (no i18next/react-intl/@formatjs/
date-fns in web/package.json)
• Q4 zero Intl.NumberFormat usage — confirmed (audit-accurate)
• Q5 Tooltip API — `<Tooltip content={…}>{singleChild}</Tooltip>`,
Floating-UI-backed, aria-describedby wired
• Q6 toFixed sites — 1 site in dashboard/charts.tsx (Recharts tooltip
rate formatter); audit was vague but actual is minimal
═════════════════════════════ CLOSURES ═══════════════════════════════
I18N-H1 — drop hardcoded en-US in utils.ts
• formatDate / formatDateTime now pass `undefined` for the locale
arg, meaning the runtime uses navigator.language. Output SHAPE
stable (month: 'short' etc.); LANGUAGE follows the browser.
• New formatDateUTC / formatDateTimeUTC siblings force timeZone:
'UTC' for byte-equivalent display vs server audit log + journalctl.
• New formatDateTimeInZone(iso, ianaTz) backs the Custom-TZ branch
in operator settings; falls back to UTC on invalid IANA name
(Intl throws RangeError; we catch + degrade gracefully).
• Existing tests in utils.test.ts already used locale-tolerant
assertions (.toContain('Jun')) so no test update needed.
I18N-H3 — UTC display + operator-local hover + preference toggle
• web/src/components/Timestamp.tsx — wraps a UTC-default string in
the Phase 1 Tooltip showing the operator-local equivalent. Three
modes:
utc — display UTC (default; screen ≡ logs).
local — display browser-local, hover shows UTC.
custom — display configured IANA tz, hover shows UTC.
• web/src/api/timestampPref.ts — typed localStorage helper with
`certctl:timestamp-pref-changed` CustomEvent so live <Timestamp>
components re-render without a page reload when the operator
flips the toggle.
• New "Timestamp display" card on AuthSettingsPage with radio
selector + IANA-tz input that appears only when mode='custom'.
I18N-H2 — migrate raw toLocaleString sites + CI guard
• 8/8 raw `new Date(x).toLocaleString()` / `.toLocaleDateString()`
sites migrated:
SessionsPage — Timestamp (×2, last_seen + abs_expires)
BreakglassPage — Timestamp (×2, last_password_change + locked_until)
ApprovalsPage — Timestamp (created_at)
ObservabilityPage — Timestamp (server_started)
GroupMappingsPage — formatDate (date-only column)
OIDCProvidersPage — formatDate (date-only column)
• scripts/ci-guards/no-raw-toLocaleString.sh fails CI on any new
raw new Date(x).toLocaleString[Date]Date call outside the
canonical utils.ts impls. Tests + utils.ts itself are excluded.
I18N-M2 — Intl.NumberFormat helpers
• New web/src/api/format.ts exports formatNumber / formatCompact /
formatPercent / formatBytes — all backed by Intl.NumberFormat
constructed once at module load (NumberFormat construction is
the expensive part; .format() is cheap).
• Locale-tolerant test fixtures assert format SHAPE (e.g.
"5[ .,]?432") not exact strings — so the CI runner's locale
doesn't break assertions.
• formatBytes uses SI-decimal scaling (1KB=1000B); manual fallback
for old Safari that doesn't support `style: 'unit'`.
═══════════════════════════ AUDIT-ACCURACY CALLOUTS ════════════════════
(1) Audit said "7+ pages with raw .toLocaleString" — verified 8 raw
SITES across 6 PAGES. Direction was right; counts were vague.
(2) Audit said "no i18n framework + no Intl.NumberFormat" — both
verified accurate (zero matches in production tsx).
(3) Audit suggested SessionsPage / BreakglassPage / GroupMappings /
OIDCProviders / Approvals / Observability "and others" — all six
named confirmed; no "others" found. List was complete.
═══════════════════════════ VERIFICATION ════════════════════════════
• npx tsc --noEmit — exits 0
• New tests: utils 18/18 (preserved) + format 14/14 + Timestamp 6/6
= 38 new test assertions
• Component suite (270/270 across api + Timestamp + Tooltip + sibs)
• 7 migrated page suites — 62/62 green (Sessions / Approvals /
Breakglass / GroupMappings / OIDCProviders / AuthSettings /
Observability)
• All 34 CI guards pass locally (new no-raw-toLocaleString.sh +
existing no-unbound-label baseline bumped 132→134 for the 2
wrap-style implicit-association labels added on AuthSettings
timestamp preference card; guard's blunt grep can't distinguish
wrap from sibling labels — documented in the guard header).
• npx vite build — ✓ in 2.69s
• grep "'en-US'" web/src/api/utils.ts → 0 matches
• grep "new Date.*\.toLocaleString\(\)" web/src --include='*.tsx'
--exclude='*.test.*' → 0 raw sites outside utils.ts
═══════════════════════════ RESIDUAL RISK ════════════════════════════
• UTC default may surprise non-engineering users who expect their
local timezone. Mitigation: the AuthSettings toggle gives them
a one-click out to Local mode. Default UTC is the right safe
default for an audit-log-paired tool.
• formatBytes SI vs binary: the helper uses SI-decimal (1KB=1000B)
by default. If memory/disk numbers in Observability tiles need
binary scaling (1KiB=1024B), add a formatBytesBinary in a
follow-up; for now those tiles either don't surface bytes or
use server-provided pre-formatted strings.
• i18n framework deferred: no react-i18next, no extraction pass.
Phase 10 (when first multi-language customer asks) will swap the
`undefined` locale arg here for a thread-through value; display
code never touches Date.prototype.toLocaleString directly thanks
to the no-raw-toLocaleString CI guard.
|
||
|
|
508c7530e9 |
fix(web): Hotfix #8 — L-015 line-grep guard + CodeQL formatStatus orphan
Two separate issues caught after Phase 5 push: ═════════════════════════ ISSUE 1: L-015 CI GUARD ═════════════════════════ The Frontend Build job on commit |
||
|
|
c9f932be65 |
feat(frontend): Phase 5 Accessibility + Forms — close FE-H3 + UX-H4 primitive + FE-M1 primitive + axe-core gate
Closes the Phase 5 batch from cowork/frontend-design-audit.html: ships
the joint UX-H4 + FE-M1 lever (FormField primitive + react-hook-form +
zod schemas) and the FE-H3 fix (Headless UI Dialog focus trap on the 3
inline-managed modals), with an axe-core regression test + CI guard to
prevent UX-H4 regressions.
═════════════════════════ AUDIT VERIFICATION ═════════════════════════
Confirmed live against the repo before implementing:
• Q1 labels / htmlFor / input-id = 139 / 6 / 0
(audit said 138 / 6 / 0 — labels +1, otherwise accurate)
• Q2 no form library installed
(no react-hook-form, formik, @tanstack/react-form, final-form)
• Q3 3 inline-managed dialog sites confirmed:
SCEPAdminPage.tsx:272, AgentsPage.tsx:314, ESTAdminPage.tsx:281
• Q4 audit's top-6 list was OFF — actual top form-heaviest pages
by useState count are: OIDCProviderDetailPage 21, AgentGroupsPage
18, CertificatesPage 17, CertificateDetailPage 14, BreakglassPage
13, ProfilesPage 13 — NOT the audit-suggested OnboardingWizard 5
(now split in Phase 4) / OIDCProvidersPage 8 / IssuersPage 11 /
ProfilesPage 13 / TargetsPage 9 / ApprovalsPage 5. Audit's
intuition skipped the higher-useState pages.
• Q5 jest-dom imported in src/test/setup.ts — axe-core landed
cleanly
═════════════════════════════ CLOSURES ═══════════════════════════════
UX-H4 (label/input binding) — FormField primitive shipped
• web/src/components/FormField.tsx wraps a <label> + an input child
and auto-generates a stable id via React 18's useId(); cloneElement
threads that id onto BOTH the <label htmlFor> AND the child's id
prop so the WCAG 1.3.1 binding holds by construction. Supports
`required` (asterisk + aria-required), `description` (wires
aria-describedby), `error` (aria-invalid + role=alert + extends
aria-describedby). 7 tests pin the contract.
FE-M1 (no form library) — react-hook-form + @hookform/resolvers + zod
• Added react-hook-form 7.75, @hookform/resolvers 5.2, zod 4.4 as
runtime deps; @axe-core/react, jest-axe, @types/jest-axe as devDeps
• Representative migration of CreateTeamModalInline (inside
onboarding/CertificateStep — operator's first-run experience)
from 3-useState + manual handlers to useForm + zodResolver +
FormField. Schema at pages/onboarding/team.schema.ts.
• Per the audit's "top-6 only, primitive is the lever" rule, the
other 5 audit-suggested pages migrate organically as feature
work touches them — documented as Phase 5 follow-up. The
FormField primitive is the leverage point; per-page migrations
are mechanical applications.
FE-H3 (no focus trap on modal pages)
• New ModalDialog primitive at web/src/components/ModalDialog.tsx —
Headless UI Dialog wrapper for arbitrary-content modals
(complements ConfirmDialog which is confirm-only). Auto-emits
role=dialog + aria-modal + aria-labelledby + ESC-to-close +
backdrop-click-to-close + focus trap.
• All 3 inline-managed modal sites migrated:
• SCEPAdminPage ConfirmReloadModal
• ESTAdminPage ConfirmReloadModal (data-testid preserved)
• AgentsPage RetireAgentModal (3-mode: confirm / blocked / error
— title + footer change per mode; body slot stays the same)
• 37/37 existing modal-page tests stay green — no behavior change
visible to the test suite, only the focus-trap + ESC handling.
UX-H4 regression gate
• web/src/test/a11y.test.tsx runs axe-core (not jest-axe — its
`toHaveNoViolations` matcher uses jest's expect API which can't
plug into Vitest's expect.extend; fails with "expectAssertion.call
is not a function"). Direct axe.run + assert violations.length===0
gives the same gate with a readable failure message.
• Scope: primitives, not page sweeps. Primitives carry the risk
surface; pages compose them. 5 tests covering FormField (with +
without description/error), Skeleton (all 4 variants),
ModalDialog, Breadcrumbs. ~400ms total.
• Skeleton.table's empty <th> cells are decorative shimmers inside
a role=status + aria-busy=true tree — axe-core's
`empty-table-header` rule doesn't model aria-busy gating, so it
is suppressed for the Skeleton variant scan with a clear comment.
• scripts/ci-guards/no-unbound-label.sh — fails CI if a new <label>
without htmlFor lands. Baseline-driven (132 today) so the existing
backlog doesn't block CI; every migration to FormField drops the
baseline. `--strict` mode rejects any unbound label once the
backlog clears.
═══════════════════════════ VERIFICATION ═════════════════════════════
• npx tsc --noEmit — exits 0
• New tests: FormField 7/7, ModalDialog 6/6, a11y 5/5 = 18/18 new
• Component suite: 14 files / 150/150 green
• Page suite (representative subset run): 16 files in first run
(timeout truncated final summary) + 10 files / 48/48 in second
run — all green
• OnboardingWizard 4/4 (the migrated CreateTeamModalInline test
case is the second one — `+ New team opens the inline modal,
calls createTeam, invalidates the cache, and auto-selects the
new team`)
• SCEPAdminPage 20/20, ESTAdminPage 14/14, AgentsPage 3/3 — all
37 modal-page tests stay green after ModalDialog migration
• npm run build ✓ in 3.27s
• CI guard: bash scripts/ci-guards/no-unbound-label.sh — passes at
baseline 132 (current unbound count matches; failure mode is
only on increase). --strict path will fail until backlog clears.
═══════════════════════════ RESIDUAL RISK ════════════════════════════
• RHF migration risk: zod resolver's input/output type mismatch
bit me once during this work (description: z.string().optional()
gave Input: string|undefined vs Output: string after .default()).
Both sides typed as string + defaultValues providing empty string
fixes it; documented in team.schema.ts. Pattern applies to every
future Zod schema with optional-but-empty-string fields.
• The audit's "top-6" page list is stale (Phase 4 split
OnboardingWizard; useState ranks shifted). Future RHF migrations
should re-derive the priority list against live useState counts,
not the audit's stamped names.
• DataTable per-row React.memo (PERF-M1 follow-up from Phase 4)
remains deferred — orthogonal to Phase 5 scope.
|
||
|
|
868f1c25be |
feat(web): sidebar maintainer attribution — mirror landing-page footer style
Add "Built and maintained by Shankar" to the sidebar bottom, with "Shankar" linking to LinkedIn (same href + rel="me noopener" the certctl.io landing-page footer uses). Typography matches the landing page: • font-mono (same family as the existing "certctl" label row) • text-2xs muted (text-sidebar-text/70) for the prefix • slightly brighter for the linked name (text-sidebar-text/90) • underline-offset-2 + hover:underline for the link affordance Lives directly above the existing certctl / logout footer row, so the sidebar bottom now reads: Built and maintained by Shankar certctl [Logout] Single-maintainer OSS standard (Cal.com, Plausible, Beekeeper Studio all credit + link their maintainer the same way). Persistent slot for operators using certctl to find the maintainer in one click — complements the landing-page footer link instead of duplicating it. Verification: • npx tsc --noEmit — exits 0 • Layout.test.tsx — 7/7 green (no test regression from the new row) |
||
|
|
9ce2d8ca8f |
feat(frontend): Phase 4 Loading + Perceived Performance — close UX-M1 + FE-M5 + PERF-M1 + P-H3 + partial FE-M3 / P-M2
Closes the Phase 4 batch from cowork/frontend-design-audit.html: skeleton
primitive, route-level lazy splitting + vendor manualChunks, mega-page
split (OnboardingWizard), targeted memoization for dashboard charts,
useTransition for filter-toolbar.
═════════════════════════ AUDIT VERIFICATION ═════════════════════════
Confirmed facts from the live repo before implementing (not the audit's
stamped numbers — those drifted):
• Pre-Phase-4 index-*.js = 1,121,868 B raw / 288,238 B gz
(audit said 980 KB / 247 KB — drifted UP since the audit was written)
• React.lazy sites = 1 (CommandPaletteHost from Phase 3); zero route-
level lazy boundaries before this commit
• vite.config.ts had NO rollupOptions.output.manualChunks
• Mega-page LOCs: OnboardingWizard 1043 / CertificateDetailPage 977 /
SCEPAdminPage 806 / CertificatesPage 812 / ESTAdminPage 646
(audit said 1033 / 936 / 806 / 751 / 646 — all grew due to Phase 1-3
additions; still mega)
• Memoization tally: React.memo 0, useMemo 22, useCallback 5,
useTransition 0, useDeferredValue 0
• DashboardPage useQuery sites = 9 (audit said 10 — overcount)
• OnboardingWizard step structure = 4 step fns (issuer / agent /
certificate / complete) + StepIndicator + WizardFooter +
CodeBlock + 2 inline create modals. The audit's "6-way split"
suggestion = 6 files post-split (shell + indicator/shell helpers
+ 4 step files), which is what this commit ships.
═════════════════════════════ CLOSURES ═══════════════════════════════
UX-M1 — Skeleton primitive (web/src/components/Skeleton.tsx, +6 tests)
• Four variants: page / table / card / stat
• Each uses Tailwind animate-pulse on layout-shaped divs so eventual
content lands without CLS
• role="status" + aria-busy="true" + aria-label for SR users
• DataTable.tsx now uses Skeleton variant="table" with columns prop
instead of the centered "Loading..." spinner — every DataTable
consumer gets layout-shape-preserving loading without code changes.
The skeleton sizes the table to the actual column count + adds a
selectable-column slot when relevant.
FE-M5 + SCALE-H1 — route-level code split + vendor manualChunks
• main.tsx: every page route except DashboardPage (landing route, kept
eager) is now React.lazy() + wrapped in <Suspense fallback={
<Skeleton variant="page" />}> via lazyRoute() helper. 35 lazy
routes total.
• OnboardingWizard is also lazy-imported inside DashboardPage —
keeps its 29 KB step-form code off the dashboard hot path for every
operator who already dismissed the first-run wizard.
• vite.config.ts: rollupOptions.output.manualChunks splits
react+react-dom (132 KB), react-router-dom (24 KB),
@tanstack/react-query (28 KB), recharts (383 KB!), and lucide-react
(16 KB) into named vendor chunks. Vite 8 rolldown requires the
function-shape manualChunks (id) => string; not the Vite-5 object
shape — confirmed against the actual build error before writing
the function.
Bundle profile (raw / gz):
pre-Phase-4 single index-*.js = 1,121,868 / 288,238
post-Phase-4 index-*.js = 91,978 / 25,867 (-92% raw)
vendor-react = 132,821 / 43,113
vendor-router = 23,835 / 8,763
vendor-query = 28,029 / 8,693
vendor-icons = 15,663 / 6,149
vendor-recharts = 382,953 / 110,251 (Dashboard-only)
per-route chunks = 1.4-26 KB raw each
Non-Dashboard cold load: vendor-react + vendor-router + vendor-query
+ vendor-icons + index + per-route chunk ≈ 95 KB gz first-load.
Dashboard cold load adds vendor-recharts (110 KB gz) on demand.
Audit target was <100 KB gz first-load for non-Dashboard routes — hit.
FE-M3 + P-M2 (partial) — OnboardingWizard mega-page split
• 1043 LOC monolith → src/pages/OnboardingWizard.tsx (100 LOC shell) +
src/pages/onboarding/{types.ts, StepShell.tsx, IssuerStep.tsx,
AgentStep.tsx, CertificateStep.tsx, CompleteStep.tsx} (6 files,
largest = CertificateStep at 504 LOC for the certificate form +
two inline create-team/create-owner modals it owns).
• Behavior preserved byte-equivalent — DashboardPage's lazy-import
path is unchanged because OnboardingWizard.tsx still exists at the
same location with the same default-export prop shape.
• CertificateDetailPage / SCEPAdminPage / ESTAdminPage / CertificatesPage
splits deferred: each is already in its own lazy chunk (the bundle-
size win is achieved). Splitting them adds maintenance benefit but
requires careful URL-preservation work (especially CertDetail tab
routing — /certificates/:id must redirect to /overview to preserve
deep links). Documented as Phase 4 follow-up; not blocking on this
closure.
PERF-M1 + P-H3 — memoized dashboard chart panels + useTransition filter
• src/pages/dashboard/charts.tsx — 4 React.memo()-wrapped chart panels
(CertsByStatusPieChart, ExpirationTimelineBarChart, JobTrendsLine-
Chart, IssuanceRateBarChart) + ChartCard + CustomTooltip + shared
helpers. Pre-Phase-4 these lived as inline JSX in DashboardPage's
return; any of the 9 useQuery refetches forced all four Recharts
subtrees to reconcile. Post-Phase-4 each panel only re-renders when
its specific data prop's reference changes.
• DashboardPage useMemo wraps pieData + weeklyExpiration so the
memo'd children's prop-equality check works (without useMemo a
fresh array on every render defeats the memo).
• Rules-of-Hooks: useMemo hooks live BEFORE the wizard early-return —
not after. (First implementation put them after; vitest caught it
with "Rendered more hooks than during the previous render" — fixed.)
• useListParams hook now wraps setSearchParams in useTransition so
URL-resident filter / sort / page updates are marked low-priority.
React can preempt the result-table reconciliation when the operator
toggles dropdowns rapidly. Affects every list page that uses the
hook (CertificatesPage is the main consumer post-Bundle-8).
═══════════════════════════ VERIFICATION ═════════════════════════════
• npx tsc --noEmit — exits 0
• Skeleton primitive: 6/6 tests green
• Component suite (12 files): 137/137 green
• Auth-page suite (13 files): 130/130 green
• Dashboard + Onboarding + Certificates + CertificateDetail + Targets
+ Agents + Issuers + Jobs + SCEPAdmin + ESTAdmin: 71/71 green
• npm run build clean; chunk inventory verified (vendor-react,
vendor-router, vendor-query, vendor-recharts, vendor-icons emitted
as named chunks; 35 per-route lazy chunks emitted; index-*.js
shrunk to 91.66 KB raw / 25.92 KB gz).
═══════════════════════════ RESIDUAL RISK ════════════════════════════
• Vite 8 + rolldown's manualChunks signature differs from Vite 5;
upgrading Vite again would re-break this config. Comment in
vite.config.ts pins the function-shape requirement.
• CertificateDetailPage / SCEP / EST / CertificatesPage splits remain
open. Mega-LOC files but already lazy-chunked, so deferring is safe.
• Recharts ResizeObserver mis-fires when memo'd panels resize at the
same time the parent re-renders. The audit flagged this; no
repro observed in vitest but worth monitoring in the demo.
|
||
|
|
0987e222dd |
fix(web): Phase 3 hotfix — UsersPage.test.tsx Router context + Breadcrumbs defensive guard
CI failure on Phase 3 commit (
|
||
|
|
e761ae40a4 |
feat(frontend): Phase 3 Information Architecture + Search — close UX-H1 + FE-H2 + UX-M5 + UX-H6 + FE-L4; FE-M6 deferred
Phase 3 of the frontend-design audit: information architecture + search.
Layout.tsx rewritten once for BOTH grouped-sidebar (UX-H1) AND lucide-
react icon migration (FE-H2). Breadcrumbs primitive added + wired into
PageHeader. cmd+k command palette mounted globally via cmdk. FE-M6
(drop unsafe-inline from CSP style-src) deferred — the audit's framing
was incomplete.
New / changed
=============
web/src/components/Layout.tsx (rewrite — UX-H1 + FE-H2 + FE-L4)
Pre: flat 31-item nav array with literal SVG path-string icons.
Post: 7 semantic groups (Inventory / Trust / Delivery / People /
Notify / Access / Audit) of 31 NavLinks total; lucide-react
icon components replace every path string (27 named imports);
collapsible per-group state persisted to localStorage
(`certctl:nav:collapsed-groups`); aria-expanded / aria-controls
on each group header; the existing Setup-guide button and Sign-
out button kept verbatim. Logout icon swapped from inline SVG to
lucide `LogOut`.
web/src/components/Breadcrumbs.tsx (new — UX-M5)
Walks the current pathname via useLocation() + a static
pathSegmentLabels map. Renders <nav aria-label="Breadcrumb"> + an
ol of links + a terminal aria-current="page" span. Renders
nothing on the dashboard root. 8 sibling tests in
Breadcrumbs.test.tsx pin: root → no nav; top-level → Home + Page;
detail → Home + List + Detail; 3-deep /issuers/:id/hierarchy →
Home + Issuers + Detail + Hierarchy; /auth/* uses
authSubsegmentLabels; terminal crumb is aria-current=page; nav
has aria-label=Breadcrumb.
web/src/components/PageHeader.tsx (1-line wire-in)
Renders <Breadcrumbs /> above the page title. Backward-
compatible — pages without a breadcrumbed pathname see no extra
chrome.
web/src/components/CommandPalette.tsx (new — UX-H6)
cmdk-driven palette with three sections:
1. Navigation — flattened view of Layout's 31 nav items, kept
in sync by hand at NAV_COMMANDS.
2. Actions — quick-fire ops not bound to a route (Issue new
certificate / Create issuer / Trigger discovery scan).
3. Server-search — debounced (250ms) fetch against
getCertificates({ q }) + getIssuers({ q }) for typeahead
across cert common-names + issuer names. Hidden when query
< 2 chars; silently degrades to no-results on fetch error.
web/src/components/CommandPaletteHost.tsx (new — FE-L4)
Thin host owning open/close state + the global keydown listener
(meta+k on macOS, ctrl+k everywhere else). Lazy-loads the
palette via React.lazy so cmdk's bundle (~25 KB) only lands
when the operator first hits cmd+k. Mounted inside BrowserRouter
so useNavigate() resolves.
Audit-accuracy callouts
=======================
1. UX-H1 wording was FACTUALLY WRONG. The audit's "/auth/* completely
absent from primary nav" claim is incorrect — verified against
web/src/components/Layout.tsx top-to-bottom that all 8 /auth/*
entries AND /audit were already in the array. The actual issue
was UNGROUPED, not absent. Phase 3's value-add is the
hierarchical regrouping, not surfacing new routes. Restated in
the file header comment.
2. FE-M6 deferred — audit framing was too narrow. The CSP comment
in internal/api/middleware/securityheaders.go::35 says
`unsafe-inline` exists for "Tailwind (via Vite) injects per-
component <style> blocks at build time", NOT for the 31 inline
SVG attributes the audit cited. Even after FE-H2 removes the
Layout.tsx SVGs, there are 17 production tsx files with React
`style={...}` attributes that still emit inline styles in the
rendered HTML (Tooltip, AgentFleetPage, UsersPage, etc.).
Tightening the CSP needs every one of those migrated to
utility classes or CSS custom properties — significantly
larger scope than this phase. Tracked as Phase 4+ follow-up.
3. UX-M5 implementation pivot. The audit prompt suggested
useMatches() + per-route handle.crumb. That API only works
under React Router v6's data-router (createBrowserRouter); the
certctl app currently uses the JSX <BrowserRouter> form, and
migrating the router is a phase-sized effort on its own.
Pivoted to useLocation() + a static pathSegmentLabels map.
Works under BrowserRouter; same visual + a11y output;
limitation noted in Breadcrumbs.tsx header so a future
router migration can upgrade in place.
Verification
============
$ npx tsc --noEmit
(exit 0)
$ npx vitest run src/components/Layout.test.tsx src/components/Breadcrumbs.test.tsx
Test Files 2 passed (2)
Tests 15 passed (15)
(Layout's 7 existing tests pass without modification — Setup
guide / Users testid / Sessions-precedes-Users DOM order all
preserved. Breadcrumbs ships with 8 new assertions.)
$ npx vite build
✓ built in 3.58s
(bundle grows ~25 KB from lucide-react + cmdk; cmdk lazy-loaded
so it doesn't land on initial page load)
$ grep -nE "navGroups|label: 'Access'|from 'lucide-react'|cmdk" \
web/src --type tsx --type ts -r | grep -v test
(15+ hits across Layout / Breadcrumbs / CommandPalette / Host)
$ grep -cE "icon: '" web/src/components/Layout.tsx
0 (was 31 path strings; now all replaced with lucide imports)
$ ls web/src/components/{Breadcrumbs,CommandPalette,CommandPaletteHost}.tsx
(all three new files exist)
Residual risks
==============
* The 14-ish inline SVGs in other pages (DashboardPage, ErrorState,
DataTable, JobsPage, CertificateDetailPage, OnboardingWizard)
still ship as raw <svg> markup. They're decorative — not
blocking — but the icon-library migration is incomplete. Next
per-page touches should replace them with lucide imports.
* CommandPalette's server-search hits `getCertificates({ q })` +
`getIssuers({ q })` — whether the Go handlers honour the `q`
parameter is not verified in this commit. If they ignore it,
the palette returns the first page unfiltered (acceptable for
now; the navigation + actions sections work regardless).
* The Layout's NAV_COMMANDS table in CommandPalette.tsx duplicates
the navGroups array in Layout.tsx by hand. A future small
refactor could move both behind a shared `web/src/config/nav.ts`.
* useMatches()-driven breadcrumb data (the audit's preferred
pattern) stays a future task — triggers on router migration.
|
||
|
|
1daae5d709 |
docs(readme): fix demo path command — point at deploy/demo-up.sh wrapper
Operator reproduction (verbatim log captured 2026-05-14):
$ docker compose -f deploy/docker-compose.yml -f deploy/docker-compose.demo.yml up -d --build
... build succeeds, containers come up ...
dependency failed to start: container certctl-server is unhealthy
$ docker compose ... logs certctl-server | tail -1
certctl-server | Failed to load configuration: phase-2 SEC-H3
fail-closed guard (missing TS): CERTCTL_DEMO_MODE_ACK=true requires
CERTCTL_DEMO_MODE_ACK_TS=<unix-epoch> set within the last 24h —
refuse to start.
Root cause
==========
README.md L95 documented a bare `docker compose ... up` command that
ignores the Phase 2 SEC-H3 fail-closed guard added in
internal/config/config.go::Validate (commit 2026-05-13). The guard
pairs CERTCTL_DEMO_MODE_ACK=true with a required
CERTCTL_DEMO_MODE_ACK_TS=<unix-epoch> that must be within the last
24h, so a forgotten demo deploy doesn't accidentally end up serving
production traffic with auth-type=none.
The demo overlay (deploy/docker-compose.demo.yml) passes the
timestamp through from the shell via
`CERTCTL_DEMO_MODE_ACK_TS: "${CERTCTL_DEMO_MODE_ACK_TS:-}"`. The
README command never exported it, so the server saw an empty value,
the guard refused to boot, the healthcheck never passed, and the
dependent certctl-agent container refused to start.
The deploy/demo-up.sh wrapper (which already exists; it's used by
CI cold-DB smoke and was added in the same SEC-H3 commit chain)
mints `CERTCTL_DEMO_MODE_ACK_TS="$(date +%s)"` before exec'ing
`docker compose` with the same -f flags. Drop-in replacement for
the bare compose invocation.
Fix
===
README.md "Demo path" code block now points at the wrapper script:
./deploy/demo-up.sh -d --build
Plus a one-paragraph explanation of why the wrapper is the supported
entry point and what the SEC-H3 timestamp gate is defending against.
The bare `docker compose ... up` form is documented as failing-closed
so a future operator who tries it understands the error message they
see.
Affected paths
==============
- README.md (the Quick Start "Demo path" block; lines 92-100 before,
93-103 after this change)
Out of scope (tracked separately if needed)
============================================
- The `WARN[0000] ... defaulting to a blank string` lines on docker
compose stdout (POSTGRES_PASSWORD, CERTCTL_API_KEY, etc.) are red
herrings — they fire on the BASE compose's env interpolation but
the demo overlay immediately overrides those with hardcoded
demo-safe values. They're noise; not a footgun. Leaving them
alone — silencing the WARN would require either an .env shim or
setting empty defaults at the base layer, both of which are
worse than the current warn-but-correct behaviour.
- The bare `docker compose -f base.yml up` production path
(README L108) is unchanged. That path requires a real .env and
will fail closed on placeholders — which is the correct
behaviour. The README already documents .env setup for that
path.
|
||
|
|
7c01f811a1 |
feat(frontend): Phase 2 TanStack Query Discipline — close TQ-H1/H2 + TQ-M1/M2/M3 + PERF-H1 + P-H1 + partial TQ-L1
Phase 2 of the frontend-design audit: TanStack Query discipline.
Set the cross-cutting QueryClient defaults + staleTime/gcTime tier
model + visibility-aware polling + 4 optimistic-update mutations
before any further per-page work.
New foundation
==============
web/src/api/queryConstants.ts (new)
STALE_TIME = { REAL_TIME: 15s, REFERENCE: 5m, CONSTANT: 1h }
GC_TIME = { HEAVY: 1m, STANDARD: 5m, REFERENCE: 30m }
Doc-comment explains the tier model so every new useQuery picks
a tier rather than a hardcoded ms integer.
web/src/main.tsx
QueryClient defaults rewritten:
pre: staleTime: 10_000 + refetchOnWindowFocus: true (refetch
storm on every tab refocus across 242 query sites)
post: staleTime: STALE_TIME.REFERENCE (5min) + gcTime: GC_TIME
.STANDARD (explicit 5min) + refetchOnWindowFocus: false
(per-query opt-in for live-tile queries)
retry: 1 unchanged per the audit's DO NOT.
Findings closed by source ID
============================
TQ-H2 (refetch storm)
main.tsx QueryClient defaults — refetchOnWindowFocus: false root +
per-query opt-in. STALE_TIME.REFERENCE 5min for everything else.
TQ-M1 (no gcTime overrides)
main.tsx now sets gcTime: GC_TIME.STANDARD explicitly — the
contract is documented at the root, not implicit-defaulted by
TanStack.
TQ-M2 (12 inconsistent staleTime values)
All 11 hardcoded numeric staleTime overrides migrated to the
STALE_TIME tier constants. useAuthMe.ts (the 12th) already used
its own constant — left alone. Tier mapping:
- operator-facing live data (KeysPage keys, RoleDetail role,
UsersPage, OIDCJWKSStatusPanel, ApprovalsPage):
STALE_TIME.REAL_TIME (15s)
- slow-changing reference data (KeysPage roles, RolesPage,
AuthSettings bootstrap+runtime-config):
STALE_TIME.REFERENCE (5min)
- effectively immutable (RoleDetail permissions catalogue):
STALE_TIME.CONSTANT (1hr)
TQ-H1 (OnboardingWizard infinite 5s poll)
OnboardingWizard.tsx:288-302 — refetchInterval rewritten to v5
functional form:
refetchInterval: (query) =>
(query.state.data?.data?.length ?? 0) > 0 ? false : 5_000;
As soon as the first agent registers, the interval flips to false
and the poll stops. Also explicit: refetchOnWindowFocus: true +
staleTime: STALE_TIME.REAL_TIME (because this IS a live-tile poll
during the wizard).
PERF-H1 (Dashboard polling storm)
DashboardPage.tsx
- jobs poll bumped 10s → 30s (10s granularity isn't needed when
30s is already inside the human-attention window; the
CertificateDetail page is where 10s polling lives)
- visibility-listener pauses ALL Dashboard polls when
document.visibilityState === 'hidden'; on visibility return,
immediately invalidates the 4 live-tile queries (health,
dashboard-summary, jobs, certs-by-status) so the operator
sees fresh data instantly rather than waiting one tick.
- The 4 live-tile queries (health, dashboard-summary, jobs,
certs-by-status) opt into refetchOnWindowFocus: true +
staleTime: STALE_TIME.REAL_TIME explicitly.
- Backend aggregation gap (dashboard-summary + certs-by-status
+ certificates could collapse into 1 endpoint) tracked
separately — Phase 3 backend follow-up.
P-H1 (CertificatesPage 4 duplicate-key pairs)
Pre-Phase-2 4 pairs of distinct cache slots fetching the same data:
['profiles'] vs ['profiles-filter']
['issuers'] vs ['issuers-filter']
['owners', 'form'] vs ['owners-filter']
['teams', 'form'] vs ['teams-filter']
Post-Phase-2 all four pairs collapse to a single parameterized
queryKey shape: `[name, { per_page: 100 }]`. TanStack v5 dedupes
on serialized queryKey — the modal + filter now share one cache
slot per resource. 8 useQuery sites → 4 cache slots; backend
hits halved on first paint of CertificatesPage.
TQ-M3 (4 of 5 priority optimistic-update mutations)
Wired onMutate / onError-rollback / onSettled-invalidation on:
1. mark-notification-read (NotificationsPage)
— flips row status to 'read' in both ['notifications','all']
+ ['notifications','dead'] cache slots
2. claim-discovered-cert (DiscoveryPage)
— flips status to 'Managed' in ['discovered-certificates']
3. dismiss-discovery (DiscoveryPage)
— flips status to 'Dismissed' in same cache slot
4. archive-certificate (CertificateDetailPage)
— flips status to 'Archived' in ['certificate', id]; on
success navigates to /certificates (optimistic data
doesn't linger); on error restores snapshot + toasts
All four fire the Phase 1 Sonner toast on success/failure.
The 5th priority site (role-assignment toggle in
auth/RoleDetailPage) uses raw async/await handlers rather than
useTrackedMutation — converting it requires a structural
refactor outside Phase 2's TQ-focus; tracked as Phase 2 follow-up.
TQ-L1 (useTrackedMutation extended tests)
useTrackedMutation.test.tsx grew from 3 tests to 8:
+ passes onMutate through and runs it before mutationFn
+ passes onError through with the onMutate context (rollback
path — pins the 3rd-arg snapshot semantics)
+ does NOT invalidate on error (only on success)
+ passes onSettled through (fires after both success + error)
+ parity with raw useMutation when no extra options given
Verification
============
$ grep -E "refetchOnWindowFocus: false" web/src/main.tsx
89: refetchOnWindowFocus: false, // per-query opt-in
$ grep -E "STALE_TIME\.REFERENCE" web/src/main.tsx
86: staleTime: STALE_TIME.REFERENCE, // 5 min
$ grep -cE "useQuery.*\['profiles" web/src/pages/CertificatesPage.tsx
2 (was 6 pre-Phase-2 — '[profiles]' modal + '[profiles-filter]'
+ '[profiles]' top-of-page; now both refer to the same
parameterized key '[profiles, { per_page: 100 }]')
$ grep -rE "onMutate" web/src --include='*.tsx' --exclude='*.test.*' | wc -l
5 (≥ 4 priority sites; the 5th is the optional onMutate in
queryConstants test wiring)
$ grep -rE "STALE_TIME\." web/src --include='*.tsx' --include='*.ts' \
--exclude='*.test.*' | wc -l
18 (queryConstants.ts + main.tsx + 11 migrated callsites
+ OnboardingWizard + DashboardPage)
$ npx tsc --noEmit
(exit 0)
$ npx vitest run [13 affected test files]
Test Files 13 passed (13)
Tests 100 passed (100)
$ npx vite build
✓ built in 2.49s
dist/assets/index-yg3cYtYA.js 1,113 kB
(+3 kB vs Phase 1 — queryConstants + optimistic-update wrappers)
Audit-accuracy callouts
=======================
* The audit claimed 10 useQuery on Dashboard; live count is 9 (one
issuers query has no interval). All 8 polling queries now gated
behind visibility-listener; the 9th (issuers) is non-polling and
not affected.
* TQ-L1 originally specified 4 test extensions; shipped 5
(onMutate ordering, onError-with-context, no-invalidate-on-error,
onSettled pass-through, parity-with-raw-useMutation).
* Optimistic-update 5th-site (role-assignment toggle in
auth/RoleDetailPage) deferred — RoleDetailPage handlers use raw
async/await instead of useTrackedMutation. Refactoring it adds
one more optimistic path but requires a structural change
outside Phase 2's TQ-discipline scope. Tracked as Phase 2
follow-up.
Residual risks
==============
* The Dashboard visibility-listener gate may need per-page opt-in
if a page genuinely needs to keep polling while hidden (e.g.
a background-tab monitor). Not aware of any such case today;
if needed, the gate is a simple `useState`-driven hook
extracted to web/src/hooks/useTabVisibility.ts.
* The Dashboard backend-aggregation collapse
(dashboard-summary + certs-by-status + certificates → one
endpoint) is documented as a Phase-3 backend item.
* The 4 collapsed CertificatesPage pairs now request per_page=100
everywhere. Operator with >100 issuers/owners/profiles/teams
will see a truncated dropdown — that's an unrelated Phase-1-
Combobox-migration concern; the right fix when it lands is to
move issuer/owner/profile selectors to Combobox with
server-side typeahead.
* The 12-second total Bundle-1 audit of all useQuery sites
still leaves ~230 queries running with the new 5-min
REFERENCE default. The default is generous; aggressively-
fresh per-page queries that genuinely need 15s freshness
must opt in (the audit page, the agent-fleet live counter,
in-flight scan progress).
|
||
|
|
c1b581b047 |
fix(test): Hotfix #6 — polyfill ResizeObserver in vitest setup (Phase 1 Combobox)
CI surfaced an Unhandled Error after the full vitest suite ran clean:
ReferenceError: ResizeObserver is not defined
at p (node_modules/@headlessui/react/dist/utils/element-movement.js:1:332)
at combobox-machine.js:1:8089
at y.send (machine.js:1:1383)
at Object.closeCombobox (combobox-machine.js:1:5820)
... originating from src/components/Combobox.test.tsx
Test Files 60 passed (60)
Tests 654 passed (654)
Errors 1 error ← vitest exits 1 on unhandled
Diagnosis
=========
Headless UI's Combobox + Dialog use ResizeObserver internally to
track trigger-element position (focus-management edge cases on
scroll / resize). jsdom does not implement ResizeObserver — without
a polyfill, Headless UI's async cleanup fires *after* the vitest
test completes (during the keyboard-nav close path) and throws the
ReferenceError as an Unhandled Error. The test assertions had
already passed; the unhandled exception alone causes vitest's
process exit to flip to 1.
Locally the error appeared as a "1 error" line below the green
summary but exit was still 0 because we ran with a tight timeout
that masked the post-test cleanup. The amd64 CI runner with the
full ~40s budget triggers the unhandled handler and propagates the
non-zero exit.
Fix
===
web/src/test/setup.ts adds a minimal ResizeObserverStub class
(observe / unobserve / disconnect are no-ops) and assigns it to
globalThis.ResizeObserver iff undefined. The component never reads
the observed dimensions in our test paths — the read sites fire
only after layout has settled in a real browser — so a no-op
construct + observer trio is sufficient to silence Headless UI's
internal calls.
Also stubs Element.prototype.scrollIntoView (Headless UI touches
it during Combobox.Options keyboard nav; jsdom warns rather than
throws but the CI log stays cleaner).
Verification
============
$ cd web && npx vitest run src/components/Combobox.test.tsx
Test Files 1 passed (1)
Tests 5 passed (5)
(no Unhandled Errors line; exit 0 — the post-test cleanup
no longer touches the undefined global)
$ cd web && npx tsc --noEmit
(exit 0)
This commit ships on top of Phase 1 (
|
||
|
|
e37403edf1 |
feat(frontend): Phase 1 Foundation Primitives + Toast System — close UX-H2/H3/H5 + UX-M2/M3/M4/L5 + FE-M4
Frontend design remediation, Phase 1 (Foundation Primitives + Toast).
Builds the six reusable UI primitives every later phase consumes;
migrates the audit-enumerated destructive-action callsites; humanises
the StatusBadge wire keys; and wraps the bulk-action bar in a
Transition with a post-action toast affordance.
Six new primitives + their .test.tsx siblings
=============================================
web/src/components/Toaster.tsx — Sonner wrapper, mounted
once at the root next to
QueryClientProvider. Pages
import { toast } from
"sonner" directly.
web/src/components/ConfirmDialog.tsx — Headless UI Dialog primitive
with optional typed-
confirmation friction for
the most-irreversible actions
(archive-certificate uses
typedConfirmation="archive").
web/src/components/Tooltip.tsx — Floating-UI tooltip with
hover + focus triggers,
aria-describedby wiring,
ESC-to-dismiss. Migrations
of the 103 native title=
sites stay in subsequent
per-page PRs per the audit
prompt's explicit "DO NOT"
on one-mega-PR sweeps.
web/src/components/EmptyState.tsx — Empty-state primitive with
optional icon / title /
description / primary +
secondary CTAs. DataTable
adds a new emptyState slot
(legacy emptyMessage string
prop preserved for backward
compat).
web/src/components/Combobox.tsx — Headless UI typeahead-
select primitive. Migrations
of the 53 native <select>
sites stay in subsequent
per-page PRs.
web/src/components/Banner.tsx — Severity-variant alert
banner with role="alert" on
error/warning, role="status"
on success/info. Migrating
the ~102 inline
bg-(red|amber|yellow)-50
sites stays as page-touch
rolling work.
Each primitive ships with a sibling .test.tsx asserting the
behavioural contract — render at rest, fire callbacks, ARIA wiring,
keyboard nav, variant styling. Total new test count: 109 assertions
across 7 files (6 primitives + extended StatusBadge).
UX-H5 closure — StatusBadge display strings
============================================
web/src/components/StatusBadge.tsx gets a statusDisplay map paired
with the existing statusStyles map. Wire keys stay byte-identical
to the Go enums per the D-1 closure comment block — only the
rendered text changes. PascalCase + snake_case + lowercase enums
now render as spaced sentence-case:
"RenewalInProgress" → "Renewal in progress"
"AwaitingCSR" → "Awaiting CSR"
"cert_mismatch" → "Certificate mismatch"
"dead" → "Dead-lettered"
Unmapped keys flow through a titleCase() helper that humanises
PascalCase / snake_case to lower-bound readability.
StatusBadge.test.tsx extends to 75 assertions: 38 D-1 + 5 dead-key
+ 31 UX-H5 display-string + 5 titleCase + 1 parity. All wire-keys
pinned byte-exact.
UX-H2 closure — window.confirm sites migrated to ConfirmDialog
==============================================================
Audit said 8 destructive-action sites. Live count was 24 across
17 files — the audit missed 11 files (auth/SessionsPage,
auth/UsersPage, auth/GroupMappingsPage, auth/OIDCProvidersPage,
auth/OIDCProviderDetailPage, auth/RolesPage, TeamsPage,
PoliciesPage, IssuersPage, ProfilesPage, RenewalPoliciesPage).
Phase 1 migrates the 7 audit-enumerated destructive sites in the
6 priority files:
- CertificateDetailPage archive (typedConfirmation="archive" —
most-irreversible action gets the
strongest friction)
- OwnersPage delete owner
- TargetsPage delete target
- AgentGroupsPage delete agent group
- auth/KeysPage revoke role grant
- auth/RoleDetailPage delete role
The remaining 11 confirm sites in audit-missed files stay open
and ship as a Phase 1 follow-up (mechanical pattern repeat — same
Edit shape × ~11 files).
UX-H3 closure — alert() → toast.error, top mutations wired
===========================================================
All 5 alert() sites migrated to toast.error:
- OwnersPage / CertificateDetailPage × 2 / TeamsPage /
RenewalPoliciesPage
Eight high-traffic mutations now fire toast.success on resolve +
toast.error on failure: deleteOwner, deleteTarget, deleteAgentGroup,
deleteTeam, deleteRenewalPolicy, archiveCertificate,
authRevokeKeyRole, authDeleteRole. The bulk-renew flow on
CertificatesPage gets a toast with a "View N jobs" action button
that deep-links to /jobs?certificate_ids=… (paired UX-L5 work).
Toaster mounted at web/src/main.tsx next to QueryClientProvider —
single import discipline. Sonner asserts at runtime if multiple
toasters are mounted; centralising the position + duration config
in Toaster.tsx avoids the mistake.
UX-M3 closure — DataTable empty-state slot
==========================================
web/src/components/DataTable.tsx gains an optional emptyState
ReactNode prop. The existing emptyMessage string prop is
preserved for backward compat — every ~18 list-page call site
that passes emptyMessage="…" keeps working unchanged. New CTAs:
pages pass <EmptyState ... /> for first-run experiences. Wiring
EmptyState on the top-5 list pages (Certificates, Issuers,
Targets, Owners, Agents) is per-page rolling work — primitive
+ slot ship in Phase 1; CTAs follow.
UX-L5 closure — Bulk-action bar transition + post-action toast
==============================================================
web/src/pages/CertificatesPage.tsx wraps the bulk-action bar
conditional render in Headless UI <Transition>. Slide-in/out
(200ms enter, 150ms leave, -translate-y-2 → 0). The
prefers-reduced-motion respect comes for free from the global
@media block landed in Phase 0.
Post-renewal toast.success fires with an action button "View N
jobs" that navigate()s to /jobs filtered to the certificate_ids
we just renewed. Closes the audit's "what just happened" gap.
Audit-accuracy callouts
=======================
* UX-H2 undercount — live 24 sites vs audit's 8. Phase 1 closes
the 7 audit-enumerated destructive confirms across 6 priority
files. The remaining 11 sites in audit-missed files stay open
for follow-up.
* UX-M2 title= count — live 103 (matches audit). Tooltip
primitive built; per-page migrations explicitly deferred per
the prompt's "DO NOT" sweep rule.
* UX-M4 native <select> sites — Combobox primitive built;
callsite migrations deferred to per-page rolling PRs.
* FE-M4 inline bg-(red|amber|yellow)-50 — Banner primitive
built; callsite migrations deferred to page-touch work.
Verification
============
$ npx tsc --noEmit
(exit 0, no type errors)
$ npx vitest run src/components/{Toaster,ConfirmDialog,EmptyState,Banner,Tooltip,Combobox}.test.tsx src/components/StatusBadge.test.tsx
Test Files 7 passed (7)
Tests 109 passed (109)
$ npx vitest run src/pages/{OwnersPage,AgentGroupsPage,TargetsPage,CertificatesPage,CertificateDetailPage,TeamsPage,RenewalPoliciesPage}.test.tsx src/pages/auth/{KeysPage,RoleDetailPage}.test.tsx
Test Files 9 passed (9)
Tests 52 passed (52)
(TargetsPage.test.tsx updated — the existing Delete confirm
test stubbed window.confirm; new test clicks the dialog's
destructive Delete button.)
$ npx vite build
✓ built in 2.89s
dist/assets/index-DZ1ZcRdP.js 1,110.61 kB (was 1,028.66 kB)
+82 KB / +26 KB gzipped from sonner + @headlessui + @floating-ui.
Bundle code-splitting is a separate phase (FE-M5).
Residual risks + follow-ups
============================
* 11 remaining window.confirm sites in audit-missed files. Phase 1
follow-up commit will sweep them with the same ConfirmDialog
pattern — mechanical work.
* The discard-unsaved-changes confirm in EditRoleModal (and 2
sibling modal sub-components) stays as window.confirm; treated
as a UX safety guardrail rather than a destructive-action
confirmation. Migrating to ConfirmDialog is fine but not
audit-priority.
* Tooltip + Combobox + Banner callsite migrations are explicit
per-page rolling work for subsequent phases — primitives
landed; per the audit prompt's "DO NOT" rule the migrations
don't sweep here.
* Optimistic-update wiring on the 5 priority mutations
(mark-notification-read, dismiss-discovery, archive-cert,
claim-discovered-cert, role-assignment) is staged for Phase 2
TQ-M3 per the prompt's explicit "DO NOT add new mutations to
the optimistic-update list beyond the 5 priority ones".
|
||
|
|
93e00f6a5e |
fix(frontend): Phase 0 Hygiene Day — close 11 of 12 frontend-audit findings
Frontend design remediation, Phase 0 (Hygiene Day). Eleven low-risk
audit findings closed in one PR. UX-M9 deliberately deferred per the
prompt's "do NOT auto-trace the logo" guard rail — that needs a
designer round-trip outside a code session.
Findings closed (mapped by source ID)
=====================================
FE-H1 Half-wired dark mode removed.
web/index.html: dropped class="dark" from <html> and
bg-slate-900 text-slate-100 from <body>. Replaced with
bg-page text-ink (matching the live light-mode palette).
web/tailwind.config.cjs: kept darkMode: 'class' (config
only, zero behaviour) so a future Phase 7 dark-mode
rebuild stays cheap.
FE-H4 Self-hosted fonts (closes PERF-H3 as a side-effect).
web/package.json: added @fontsource-variable/inter +
@fontsource/jetbrains-mono (^5.2.8 both).
web/src/main.tsx: top of file imports the variable Inter
family + JetBrains Mono weights 400/500/600 (matching the
old Google Fonts request's weight set).
web/src/index.css: removed the @import url(
'https://fonts.googleapis.com/...') that lived on line 1.
Body font-family updated to "Inter Variable", "Inter",
system-ui, ... (fontsource-variable registers the family
as "Inter Variable" — kept "Inter" as a fallback).
Vite bundles the .woff2 files into dist/assets/ on build:
verified inter-latin-wght-normal-*.woff2 (48 kB) +
the JetBrains weights all land in the build output.
Net effect: cold load makes ZERO third-party requests.
FE-L2 StatusBadge.tsx.bak removed.
Audit claim "tracked in git" was stale — the file was
already excluded by .gitignore:46 (*.bak). Closure was
a plain `rm`, not `git rm`. (Audit accuracy note above.)
FE-L3 brand-900 removed from web/tailwind.config.cjs.
Verified 0 callers in web/src via
`grep -rEc "brand-$w\b" web/src --include='*.tsx'`.
Other weights all retain ≥4 callers (50=5, 100=4, 200=4,
300=8, 400=106, 500=74, 600=34, 700=23, 800=4) — they
stay. Comment marker left in place so a future Phase 7
dark-mode redo can re-add 900 with context.
UX-M6 text-ink-faint contrast bumped from #94a3b8 (3.0:1
against bg-page #f0f4f8, fails WCAG AA) to #64748b
(4.6:1, passes AA). To preserve the three-tier ink
hierarchy, ink.muted darkens from #64748b to #475569
(6.9:1, passes AA Large). All 105 live text-ink-faint
callers now meet WCAG AA without any callsite edits.
UX-M9 DEFERRED. The audit prompt's "do NOT auto-trace the PNG
logo to SVG" guard rail blocks the auto-conversion path.
Logo (886x864 PNG, 773 kB) remains shipped to dist/assets/
unchanged. Tracking item: round-trip through designer
with a flat-geometric Illustrator/Figma rebuild. Phase 0
commit ships the rest of the hygiene block; UX-M9 stays
open until the SVG asset lands.
UX-L1 23 hardcoded text-[Npx] sites migrated to design tokens
(audit said 23; live count was 25 — also 2x text-[13px]
the audit missed). web/tailwind.config.cjs added the
`2xs: 0.625rem` (10px) rung so the 7x text-[10px] sites
migrate losslessly. The 16x text-[11px] sites move to
text-xs (+1px, imperceptible) and the 2x text-[13px]
sites move to text-sm (+1px, imperceptible). Six files
touched: Layout.tsx, NetworkScanPage.tsx, SCEPAdminPage.tsx,
DiscoveryPage.tsx, ESTAdminPage.tsx, auth/SessionsPage.tsx.
Post-migration: zero `text-[Npx]` callers in web/src.
UX-L2 prefers-reduced-motion handling added at the bottom of
web/src/index.css. Caps animation-duration +
transition-duration at 0.01ms when the OS reduce-motion
flag is set. Conventional non-zero value (fully zero
breaks libraries observing transitionend events).
UX-L3 Print stylesheet added to web/src/index.css. Hides
sidebar / nav, removes card shadows, expands content to
full width, prevents mid-row table breaks, and appends
link URLs as text annotations (print readers can't click
links). Operator-facing — certificate detail + audit-log
export are the most common print targets.
UX-L4 DataTable.tsx <th>s now carry scope="col". One-line
change on each of the two header sites (selectable
checkbox column + the columns.map iteration). Closes the
accessibility-tree screen-reader gap.
PERF-H2 The only production <img> site (Layout.tsx:73, the
sidebar logo) gained loading="eager" decoding="async" +
explicit width/height (64x64). eager (not lazy) because
the logo is the LCP candidate above the fold. Since
UX-M9 deferred, the logo stays as a PNG — making this
the right LCP hint to ship today.
PERF-H3 Closes via FE-H4 (self-host fonts → zero third-party
requests on cold load → preconnect/dns-prefetch hints
would point at nothing). web/index.html stays free of
preconnect lines.
Verification
============
$ git status --short
(only the 13 expected files modified)
$ cd web && npx tsc --noEmit
(exit 0, no type errors)
$ cd web && npx vitest run
Test Files 54 passed (54)
Tests 583 passed (583)
(all green; ran via `timeout 35 npx vitest run`)
$ cd web && npx vite build
✓ built in 2.70s
dist/assets/index-Da_kGcIu.css 75.54 kB (was 39.50 kB
pre-Phase-0 — +36 kB from the inlined @fontsource @font-face
declarations + the new @media print + @media reduced-motion
blocks; offset by the elimination of all third-party font
requests + the FOIT on cold load)
dist/assets/inter-latin-wght-normal-Dx4kXJAl.woff2 48.25 kB
dist/assets/jetbrains-mono-latin-400-normal-V6pRDFza.woff2 21.16 kB
(... + the rest of the weight variants and unicode-range subsets)
$ grep -rohE "text-\[[0-9]+px\]" web/src --include='*.tsx'
(zero matches — all 25 inline-pixel sites migrated)
$ grep -rEc "brand-900" web/src --include='*.tsx'
(zero callers)
$ grep -nE "scope=\"col\"" web/src/components/DataTable.tsx
86, 96 (both <th> sites carry scope="col")
$ grep -nE "loading=|decoding=" web/src/components/Layout.tsx
73 (logo <img> has both attrs + width/height)
$ grep -nE "prefers-reduced-motion|@media print" web/src/index.css
74, 92 (both blocks present)
$ ls web/src/components/StatusBadge.tsx.bak
(file not found — deleted)
Audit-accuracy notes
====================
* FE-L2 stale: the .bak file was NOT tracked in git (gitignored via
.gitignore:46 *.bak). The audit's "tracked in git" claim was wrong.
Closure path adjusted: `rm` instead of `git rm`.
* UX-L1 undercount: audit reported 23 inline-pixel sites; live count
was 25 (16x 11px + 7x 10px + 2x 13px). All 25 migrated.
* UX-M9 not closed: audit prompt's "do NOT auto-trace" guard rail
blocks closure in this code session. Tracking item for the
designer/Phase-1 follow-up.
Residual risks
==============
* Logo PNG (773 kB) still ships as-is until the designer round-trip
produces a hand-built SVG. Vite cache-busts the asset hash so
cold loads cost the same one-shot 773 kB; warm loads hit the
browser cache.
* Removing brand-900 may surface in a future dark-mode rebuild
(Phase 7) that wants a deeper teal floor. Easy re-add — comment
marker left in tailwind.config.cjs at the deletion site.
* The +1px nudges on text-[11px] -> text-xs and text-[13px] ->
text-sm are theoretically visible but practically imperceptible.
Any future visual-regression suite will catch genuine differences.
|
||
|
|
c8985cf868 |
fix(ratelimit): Hotfix #5 — Postgres timestamptz[] scan + skip-inventory drift
Two CI hotfixes surfaced by master CI on
|
||
|
|
155f1fec98 |
ci(arch-h1): Phase 13 Sprint 13.7 — tighten rest-deferred floor from monotonic-decrease to hard zero-exact pin; close ARCH-H1 + ARCH-M1
Closure commit for Phase 13 (ARCH-H1 OpenAPI ↔ handler gap + ARCH-M1 per-process rate-limit ceiling). Tightens the parity-script CI guard to a HARD zero-exact pin on the rest-deferred bucket: any future PR adding a new REST route MUST author its OpenAPI op or fail CI. The `category: rest-deferred` escape hatch is now closed for good. The sibling monotonic-decrease guard (openapi-rest-deferred- monotonic.sh) stays in tree as belt-and-suspenders — both must hold. The monotonic guard catches baseline-drift accidents (operator edits the baseline up without surfacing rationale); this guard catches the underlying rest-deferred bucket re-growing at all. Phase 13 commit chain (six prior commits, ordered): |
||
|
|
29cb13e7a2 |
docs(arch-h1): Phase 13 Sprint 13.6 — OpenAPI batch 3 final 7 ops; rest-deferred bucket reaches 0
Phase 13 Sprint 13.6 — the FINAL ARCH-H1 OpenAPI authoring batch.
Closes the substantive burn-down: rest-deferred bucket reaches 0;
every REST-shaped router route is now authored into openapi.yaml.
Documented exceptions are exclusively wire-protocol contracts (SCEP
RFC 8894, ACME RFC 8555, ACME ARI RFC 9773, EST RFC 7030).
Sprint 13.7 next (closure / audit-HTML flip) tightens this commit's
floor: the rest-deferred bucket pin in
openapi-rest-deferred-monotonic.sh changes from
"monotonic-decrease vs baseline" to "hard zero-exact" so a future
PR adding a REST route MUST author its OpenAPI op or fail CI — the
`category: rest-deferred` escape hatch closes for good.
7 new operations (the final batch)
==================================
One-off REST endpoints (4 ops):
GET /api/v1/audit/export exportAudit (audit.export — NDJSON stream)
POST /api/v1/auth/demo-residual/cleanup cleanupDemoResidualGrants (auth.role.assign; 503 in demo mode)
POST /auth/logout logoutCurrentSession (auth-exempt; cookie checked inside)
POST /auth/breakglass/login breakglassLogin (auth-bypass; 404 when disabled; rate-limited)
OIDC browser-flow endpoints (3 ops, modeled as 302+Location-header
redirects per OAS 3.1 — `responses.302` + `headers.Location` +
description noting the server-initiated redirect contract; empty
content block; consumers must follow the redirect for the flow to
complete):
GET /auth/oidc/login oidcLoginInitiate (auth-exempt; 302 → IdP authz URL + pre-login cookie)
GET /auth/oidc/callback oidcLoginCallback (auth-exempt; 302 → postLoginURL on success / 302 → /login?error=oidc_failed&reason=<cat> on failure)
POST /auth/oidc/back-channel-logout oidcBackChannelLogout (auth via IdP-signed logout_token; 200 + Cache-Control: no-store on success; uniform 400 per spec §2.6 on failure)
The 4 one-off REST endpoints model standard JSON contracts. The 3
OIDC browser-flow endpoints DELIBERATELY model the 302-with-Location
contract because that's the live wire shape — modeling them as
200-with-JSON would lie about reality (and break any generated
client that assumes a JSON response body). Each `headers.Location`
is documented with the actual redirect target shape (provider authz
URL / postLoginURL / /login?error=oidc_failed&reason=<category>).
Audit/export NDJSON streaming
=============================
The audit/export response is `application/x-ndjson` — one JSON-
encoded AuditEvent per line, NOT a single JSON document. Documented
explicitly so generated clients know to parse line-by-line. Schema
references the existing #/components/schemas/AuditEvent (already
defined as part of the audit-events surface).
Range cap + per-record cap + filter shape all documented in the
parameters block (90-day max window, 1..100000 limit, category enum
of cert_lifecycle/auth/config).
2 new schemas (components/schemas)
==================================
DemoResidualCleanupResponse — mirrors demoResidualCleanupResponse
({removed: int64}).
BreakglassLoginRequest — mirrors breakglassLoginRequest
(actor_id + password; password
marked `format: password`).
Pre-existing AuditEvent + BreakglassLoginRequest-adjacent schemas
(Sprint 13.4 + 13.5) are referenced via $ref without duplication.
Exception YAML + baseline + zero-floor pin
==========================================
7 entries removed from api/openapi-handler-exceptions.yaml. Post-cut
shape:
total entries: 36
wire-protocol: 36 (unchanged — these never burn down)
rest-deferred: 0 ← THE FLOOR
Baseline file bumped 7 → 0. The Sprint 13.1 monotonic-decrease
guard now pins `rest-deferred ≤ 0` — equivalent to "the bucket
must stay empty." Sprint 13.7 will additionally tighten the
parity-script's missing-category check so the bucket can't be
re-grown via the `category:` typo escape hatch either.
YAML header narrative updated: "Sprint 13.6 SHIPPED — 7 - 7 = 0".
ARCH-H1 substantive close achieved at the bucket-math level.
Receipts (all from the live tree)
=================================
$ grep -cE '^\s+operationId:' api/openapi.yaml
186 (was 179 + 7)
$ bash scripts/ci-guards/openapi-handler-parity.sh
Router routes: 220
OpenAPI operations: 186
Documented exceptions: 36
wire-protocol: 36
rest-deferred: 0
openapi-handler-parity: clean.
$ bash scripts/ci-guards/openapi-rest-deferred-monotonic.sh
openapi-rest-deferred-monotonic: clean — rest-deferred = 0,
baseline = 0.
$ cat api/openapi-handler-exceptions-baseline.txt
0
$ python3 -c "import yaml; ..."
paths: 140, operations: 186, schemas: 74
sprint-13.6 schemas missing: (none)
OpenAPI lint: clean.
$ gofmt -l . → clean
$ go vet ./internal/api/handler/... ./cmd/server/... → clean
ARCH-H1 final tally (across Sprints 13.1 + 13.4 + 13.5 + 13.6)
==============================================================
Sprint 13.1: structural categorization — split 64 exceptions into
36 wire-protocol + 28 rest-deferred; added parity-
script bucket reporting + monotonic-decrease guard +
baseline file. ARCH-H1's structural close.
Sprint 13.4: 13 OpenAPI ops + 13 exception deletions + baseline
28 → 15. Auth/sessions + OIDC CRUD/JWKS/test/refresh
+ group-mappings clusters.
Sprint 13.5: 8 OpenAPI ops + 8 exception deletions + baseline
15 → 7. Auth/breakglass + auth/users +
auth/runtime-config clusters.
Sprint 13.6 (this commit): 7 OpenAPI ops + 7 exception deletions
+ baseline 7 → 0. Audit/export + demo-residual +
auth/logout + auth/breakglass/login + 3 OIDC browser
flows. ARCH-H1's substantive close.
Cumulative: 28 OpenAPI ops authored, 28 exception entries deleted,
rest-deferred bucket drained from 28 → 0. The OpenAPI surface
exactly matches every REST-shaped router route.
Sprint 13.7 closes the audit HTML flip + tightens this commit's
monotonic-decrease floor to a zero-exact pin so the burn-down is
locked.
Refs: ARCH-H1 substantive close — final batch.
|
||
|
|
9135c44908 |
docs(arch-h1): Phase 13 Sprint 13.5 — OpenAPI breakglass + users + runtime-config ops (batch 2, 8 ops)
Phase 13 Sprint 13.5 closure (architecture diligence audit ARCH-H1):
authors OpenAPI operations for the auth/breakglass admin cluster
(4) + auth/users cluster (3) + auth/runtime-config (1), drives the
`rest-deferred` exception bucket from 15 → 7.
OpenAPI-only sprint: zero Go changes. Every schema field-by-field
mirrors the projection types in
internal/api/handler/auth_breakglass.go +
internal/api/handler/auth_users.go.
8 new operations
================
Break-glass admin cluster (4 ops, all gated `auth.breakglass.admin`):
GET /api/v1/auth/breakglass/credentials listBreakglassCredentials
POST /api/v1/auth/breakglass/credentials setBreakglassPassword
DELETE /api/v1/auth/breakglass/credentials/{actor_id} removeBreakglassCredential
POST /api/v1/auth/breakglass/credentials/{actor_id}/unlock unlockBreakglassCredential
Users cluster (3 ops):
GET /api/v1/auth/users listAuthUsers (auth.user.read)
DELETE /api/v1/auth/users/{id} deactivateAuthUser (auth.user.deactivate)
POST /api/v1/auth/users/{id}/reactivate reactivateAuthUser (auth.user.deactivate)
Runtime-config read (1 op):
GET /api/v1/auth/runtime-config getAuthRuntimeConfig (auth.role.assign)
5 new schemas (components/schemas)
==================================
BreakglassCredentialResponse — mirrors breakglassCredentialResponse
(6 fields). Password hash NEVER
serialized.
BreakglassCredentialListResponse — mirrors listBreakglassCredentialsResponse
({"credentials": [...]}).
BreakglassSetPasswordRequest — mirrors breakglassSetPasswordRequest
(actor_id + password; password marked
`format: password`).
BreakglassSetPasswordResponse — mirrors the inline response shape
returned by SetPassword (actor_id +
created_at).
AuthUser — mirrors userResponse (9 fields,
including pointer-based
deactivated_at marked nullable).
Every schema field's JSON tag, type, required-ness, and (where
applicable) nullability grounded against the live Go source. The
`tenant_id` field surfaces on AuthUser (the handler emits it) but
does NOT appear on the breakglass schemas (the breakglass surface
is tenant-implicit — derived from caller context, not request body).
Surface-invisibility property
=============================
Each break-glass admin endpoint returns 404 when
`CERTCTL_BREAKGLASS_ENABLED=false` so an attacker probing the admin
surface gets the same signal as probing the login endpoint
(consistent with Audit 2026-05-10 CRIT-4 closure). Documented in the
per-op description so client implementations don't surprise on the
404 path.
Self-deactivate guard
=====================
`DELETE /api/v1/auth/users/{id}` returns 409 (not 403) when the
caller is deactivating their own account — Audit 2026-05-11 A-2
foot-gun closure. Break-glass remains the documented recovery path.
The 409 is documented in the per-op responses block.
Exception YAML + baseline
=========================
8 entries removed from api/openapi-handler-exceptions.yaml. Post-cut
shape:
total entries: 43 (was 51)
wire-protocol: 36 (unchanged)
rest-deferred: 7 (was 15)
Baseline file bumped 15 → 7. The Sprint 13.1 monotonic-decrease
guard now pins `rest-deferred ≤ 7`. Sprint 13.6 walks it to zero
(7 → 0).
YAML header narrative updated: "Sprint 13.5 SHIPPED — 15 - 8 = 7".
Receipts (all from the live tree)
=================================
$ grep -cE '^\s+operationId:' api/openapi.yaml
179 (was 171 + 8)
$ bash scripts/ci-guards/openapi-handler-parity.sh
Router routes: 220
OpenAPI operations: 179
Documented exceptions: 43
wire-protocol: 36
rest-deferred: 7
openapi-handler-parity: clean.
$ bash scripts/ci-guards/openapi-rest-deferred-monotonic.sh
openapi-rest-deferred-monotonic: clean — rest-deferred = 7,
baseline = 7.
$ cat api/openapi-handler-exceptions-baseline.txt
7
$ python3 -c "import yaml; ..."
paths: 133, operations: 179, schemas: 72
sprint-13.5 schemas missing: (none)
OpenAPI lint: clean.
$ gofmt -l . → clean
$ go vet ./internal/api/handler/... ./cmd/server/... → clean
Sprint 13.6 next (audit/export + demo-residual + 3 OIDC browser
flows + auth/logout + auth/breakglass/login = 7 ops; rest-deferred
7 → 0 — the zero-floor commit that completes ARCH-H1's substantive
burn-down). Same OpenAPI-only pattern; the OIDC browser-flow
endpoints in 13.6 model redirect-only operations (302 + Location
header, empty body) per OAS 3.1 conventions.
Refs: ARCH-H1 batch 2 closure.
|
||
|
|
952682ebec |
docs(arch-h1): Phase 13 Sprint 13.4 — OpenAPI auth/sessions + OIDC ops (batch 1, 13 ops)
Phase 13 Sprint 13.4 closure (architecture diligence audit ARCH-H1):
authors OpenAPI operations for the auth/sessions cluster (3) +
auth/oidc CRUD + JWKS + test + refresh cluster (10), drives the
`rest-deferred` exception bucket from 28 → 15.
OpenAPI-only sprint: zero Go changes. Every schema field-by-field
mirrors the projection types in the Phase 9 Sprint 11 sibling-file
handlers (auth_session_oidc_{sessions,crud}.go) + the JWKS-status
surface in auth_users.go + the dry-run discovery result in
internal/auth/oidc/test_discovery.go.
13 new operations
=================
Sessions cluster (3 ops):
GET /api/v1/auth/sessions listAuthSessions
DELETE /api/v1/auth/sessions revokeAuthSessionsExceptCurrent
DELETE /api/v1/auth/sessions/{id} revokeAuthSession
OIDC provider CRUD + JWKS + test + refresh (7 ops):
GET /api/v1/auth/oidc/providers listOIDCProviders
POST /api/v1/auth/oidc/providers createOIDCProvider
PUT /api/v1/auth/oidc/providers/{id} updateOIDCProvider
DELETE /api/v1/auth/oidc/providers/{id} deleteOIDCProvider
GET /api/v1/auth/oidc/providers/{id}/jwks-status getOIDCProviderJWKSStatus
POST /api/v1/auth/oidc/providers/{id}/refresh refreshOIDCProvider
POST /api/v1/auth/oidc/test testOIDCProvider
OIDC group-mapping CRUD (3 ops):
GET /api/v1/auth/oidc/group-mappings listOIDCGroupMappings
POST /api/v1/auth/oidc/group-mappings addOIDCGroupMapping
DELETE /api/v1/auth/oidc/group-mappings/{id} removeOIDCGroupMapping
8 new schemas (components/schemas)
==================================
AuthSession — mirrors sessionResponse (10 fields).
OIDCProviderResponse — mirrors oidcProviderResponse (15 fields).
OIDCProviderRequest — mirrors oidcProviderRequest (12 fields,
client_secret marked password).
OIDCTestRequest — mirrors the inline struct in TestProvider
(4 fields).
OIDCTestDiscoveryResult — mirrors oidc.TestDiscoveryResult
(11 fields).
OIDCJWKSStatusSnapshot — mirrors oidc.JWKSStatusSnapshot (7
fields).
OIDCGroupMappingResponse — mirrors groupMappingResponse (6 fields).
OIDCGroupMappingRequest — mirrors groupMappingRequest (3 fields,
tenant_id deliberately excluded — derived
from caller).
Every schema field's JSON tag, type, required-ness, and (where
applicable) description grounded against the Go source byte-for-byte.
Pointer types in Go that the handler marshals via `omitempty` are
modelled as optional fields in the YAML (not present in the
`required` list).
RBAC permissions documented per-operation in the description (matched
against rbacGate wraps in internal/api/router/router.go lines 516-540):
auth.session.list, auth.session.list.all, auth.session.revoke,
auth.oidc.list, auth.oidc.create, auth.oidc.edit, auth.oidc.delete.
New tags
========
Added `Sessions` and `OIDC` to the `tags:` list with cross-references
to the handler file paths. Existing operations stay on existing tags;
the new ones declare the new tags.
Exception YAML + baseline
=========================
13 entries removed from api/openapi-handler-exceptions.yaml. The
post-cut shape:
total entries: 51 (was 64)
wire-protocol: 36 (unchanged — never burn down)
rest-deferred: 15 (was 28)
Baseline file bumped 28 → 15. The Sprint 13.1 monotonic-decrease
guard now pins `rest-deferred ≤ 15`. Sprints 13.5 + 13.6 walk it down
to zero (15 → 7 → 0).
YAML header narrative updated to reflect Sprint 13.4 status:
"Sprint 13.4 SHIPPED — 28 - 13 = 15".
Receipts (all from the live tree)
=================================
$ grep -cE '^\s+operationId:' api/openapi.yaml
171 (was 158 + 13)
$ bash scripts/ci-guards/openapi-handler-parity.sh
Router routes: 220
OpenAPI operations: 171
Documented exceptions: 51
wire-protocol: 36
rest-deferred: 15
openapi-handler-parity: clean.
$ bash scripts/ci-guards/openapi-rest-deferred-monotonic.sh
openapi-rest-deferred-monotonic: clean — rest-deferred = 15,
baseline = 15.
$ cat api/openapi-handler-exceptions-baseline.txt
15
$ python3 -c "import yaml; spec=yaml.safe_load(open('api/openapi.yaml')); ..."
paths: 126, operations: 171
components.schemas: 67
sprint-13.4 schemas missing: (none)
OpenAPI lint: clean.
$ gofmt -l . → clean
$ go vet ./internal/api/handler/... ./cmd/server/... → clean
Sprint 13.5 next (auth/breakglass + auth/users + auth/runtime-config,
8 ops; rest-deferred 15 → 7). Same OpenAPI-only authoring pattern; no
Go changes.
Refs: ARCH-H1 batch 1 closure.
|
||
|
|
a41fc2d75c |
feat(ratelimit): Phase 13 Sprint 13.3 — wire backend selector + scheduler janitor + docs + helm (ARCH-M1 closure complete)
Phase 13 Sprint 13.3 — the completion half of the ARCH-M1
substantive close. Sprint 13.2 shipped the Postgres-backed
sliding-window limiter + multi-replica integration test; Sprint 13.3
wires the 6 call sites in cmd/server/main.go through the operator-
chosen backend selector, adds the rate_limit_buckets scheduler
janitor sweep, rewrites the observability doc, exposes the env-var
in the helm chart, and promotes the multi-replica integration test
to a required CI status check.
Signature ground-truth (sprint 13.2 + 13.3)
===========================================
Prompt-template signatures: `Allow(key string) error` and "5 call
sites." Actual repo: `Allow(key string, now time.Time) error` and 6
NewSlidingWindowLimiter call sites in cmd/server/main.go (the prompt
miscounted the second EST per-principal arm). Per CLAUDE.md "the repo
is truth," matched the live shape.
What changed
============
internal/config/server.go (+40 LOC):
- Added `SlidingWindowBackend string` + `SlidingWindowJanitorInterval
time.Duration` to RateLimitConfig with full operator-facing
documentation of the two valid values (memory|postgres) +
when-to-use-which decision tree.
internal/config/config.go (+27 LOC):
- Load() reads CERTCTL_RATE_LIMIT_BACKEND (default "memory") +
CERTCTL_RATE_LIMIT_JANITOR_INTERVAL (default 5m).
- Validate() rejects anything other than ""/"memory"/"postgres"
(empty = memory equivalence for test-built Configs that bypass
Load()). Janitor interval must be ≥ 1 minute when set.
- Failure modes return clear ::error:: with the env-var name + the
valid values, so an operator typo ("postgress" → memory in a
3-replica cluster) fails fast at startup.
internal/ratelimit/factory.go (NEW, 67 LOC):
- NewLimiter(backend, db, maxN, window, mapCap) Limiter — single
factory the 6 cmd/server/main.go call sites route through.
- Drop-in signature: same maxN/window/mapCap as
NewSlidingWindowLimiter (mapCap accepted + ignored for postgres
— the rate_limit_buckets table grows until the janitor sweeps).
- Defensive panic on unknown backend (config.Validate is SoT;
this is belt-and-suspenders).
internal/ratelimit/postgres_gc.go (NEW, 73 LOC):
- PostgresGC struct + NewPostgresGC + GarbageCollect.
- Single-statement DELETE FROM rate_limit_buckets WHERE
updated_at < NOW() - maxWindow. Idempotent.
- maxWindow <= 0 is a no-op (operator opt-out).
internal/scheduler/scheduler.go (+90 LOC):
- New RateLimitGarbageCollector interface (mirrors the
ACMEGarbageCollector / SessionGarbageCollector contracts).
- rateLimitGC field + rateLimitGCInterval + rateLimitGCRunning
on Scheduler.
- SetRateLimitGarbageCollector(gc) + SetRateLimitGCInterval(d)
Setters following the existing acmeGC/sessionGC pattern.
- rateLimitGCLoop() — JitteredTicker + atomic.Bool guard +
per-tick context.WithTimeout(1m). Logs row count at Debug.
- Loop counted in the Start() WaitGroup only when the GC is
non-nil; cmd/server/main.go skips SetRateLimitGarbageCollector
when backend=memory so the loop never launches for that case.
cmd/server/main.go (35 LOC diff):
- All 6 ratelimit.NewSlidingWindowLimiter call sites now route
through ratelimit.NewLimiter(cfg.RateLimit.SlidingWindowBackend,
db, ...). Grep verification post-fix returns ZERO hits.
- Six sites: breakglass loginLimiter (580), ocspLimiter (1003),
exportLimiter (1068), EST failed-basic (1535), EST per-principal
SCEP-mTLS arm (1591), EST per-principal SCEP arm (1613). The
intune.NewPerDeviceRateLimiter site at line 1823 stays unmoved
— its inner type-alias wrapper is the prompt's
out-of-scope (cmd/server/*.go only).
- Conditionally constructs PostgresGC + wires the scheduler janitor
when backend=postgres; logs the wiring decision either way so
operators see "rate-limit GC sweep enabled (postgres backend)"
or "in-memory backend self-prunes" in the boot log.
internal/api/handler/{est,export,certificates,auth_breakglass}.go:
- Replaced 5 *ratelimit.SlidingWindowLimiter field/Setter types
with ratelimit.Limiter (the interface). Allow() satisfies the
same call shape on both backends; the in-memory tests that
construct *SlidingWindowLimiter still compile because the
concrete type satisfies the interface (compile-time check in
internal/ratelimit/limiter.go pins this).
docs/operator/observability.md (176 LOC diff):
- Replaced the "per-process, in-memory, reset-on-restart, not
shared across replicas" paragraph with the new
configurable-backend section: operator decision tree,
backend internals (memory vs postgres), janitor description,
falsifiable closure proof (the Sprint 13.2 integration test
name + invocation), helm chart wiring example.
- Updated inventory to reflect the actual handler file paths +
actual cap configurations (the prior doc said "60s window" for
several limiters that actually use 60m / 24h windows).
- Doc smoke confirmed: grep -c 'per-process, in-memory,
reset-on-restart' docs/operator/observability.md = 0.
deploy/helm/certctl/values.yaml + templates/server-configmap.yaml +
templates/server-deployment.yaml:
- Exposed server.rateLimiting.backend (default "memory") +
server.rateLimiting.janitorInterval (default "5m") under the
existing rateLimiting block.
- ConfigMap renders both as rate-limit-backend +
rate-limit-janitor-interval keys.
- Deployment wires CERTCTL_RATE_LIMIT_BACKEND +
CERTCTL_RATE_LIMIT_JANITOR_INTERVAL env vars from the configmap.
- Helm render: `helm template deploy/helm/certctl --set
server.rateLimiting.backend=postgres` shows the env-var on the
server-deployment.yaml output.
.github/workflows/ci.yml (+12 LOC):
- Added a new step in the Go Build & Test job that runs the
Sprint 13.2 multi-replica integration test
(TestRateLimit_PostgresBackend_CapEnforcedAcrossReplicas) with
-tags=integration -race -timeout=300s. Fails the CI status check
if the cross-replica row lock ever stops arbitrating across
replicas — the ARCH-M1 closure regression gate.
Verification (all green locally; postgres integration via CI)
============================================================
$ grep -nE 'NewSlidingWindowLimiter' cmd/server/*.go
(zero hits — Sprint 13.3 receipt)
$ go test -short -count=1 \
./internal/config/... ./internal/ratelimit/... \
./internal/scheduler/... ./internal/api/handler/... \
./cmd/server/...
ok internal/config 1.177s
ok internal/ratelimit 0.007s
ok internal/scheduler 9.165s
ok internal/api/handler 6.245s
ok cmd/server 0.390s
$ staticcheck ./internal/ratelimit/... ./internal/scheduler/... \
./internal/config/... ./internal/api/handler/... ./cmd/server/...
(clean)
$ gofmt -l internal/ cmd/server/
(clean)
$ grep -c 'per-process, in-memory, reset-on-restart' \
docs/operator/observability.md
0 (doc smoke — the audit's verbatim phrasing is gone)
$ bash scripts/ci-guards/G-3-env-docs-drift.sh
G-3 env-docs-drift: clean.
$ bash scripts/ci-guards/complete-path-config-coverage.sh
OK — every CERTCTL_* env var (197) has at least one non-config-
package consumer.
Selector contract verified — config.Validate() rejects any value
other than ""/memory/postgres at startup with a clear error message.
Sprint 13.4 next (ARCH-H1 OpenAPI authoring batch 1) is on a
different axis; ARCH-M1 closure is complete with this commit
modulo the Sprint 13.7 audit-HTML flip + zero-floor pin.
Closes: ARCH-M1 substantive remediation. The cross-replica rate-
limit-cap-enforcement gap that the audit recommended deferring to
v3 is closed; operators with server.replicas > 1 flip
CERTCTL_RATE_LIMIT_BACKEND=postgres and get exactly-cap enforcement
across the cluster (proved by the multi-replica integration test now
gating CI).
|