mirror of
https://github.com/shankar0123/certctl.git
synced 2026-06-07 12:31:29 +00:00
38f1200f261f19a8a7d489cb06cba8267f94c5e9
1014 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
38f1200f26 |
fix(api,codegen): ARCH-001-A — Phase 1 Orval codegen + 2 new CI guards (large diff)
Sprint 5 unified-master-audit closure. Pre-fix:
- api/openapi.yaml: 7,788 LOC of hand-authored spec.
- web/src/api/generated/: directory did NOT exist (the Phase-5
scaffolding never had its first generation run).
- scripts/ci-guards/openapi-codegen-drift.sh: skip-when-absent
(line 33-39 — informational scaffold).
- api/openapi.yaml info.version: '2.0.0', latest tag: v2.1.7
(a 7-version drift between spec and ship).
Net effect: every new route required three coordinated edits (Go
handler, openapi.yaml, frontend client.ts), payload-level breaking
changes shipped unnoticed, and downstream API client integration
cost was permanent.
Phase 1 fix (the audit's literal scope):
1. **Run Orval**, commit the generated tree. 316 files / ~1.8 MB
under web/src/api/generated/, tags-split layout (one directory
per OpenAPI tag), TanStack Query client mode. All output routes
through web/src/api/mutator.ts which delegates to the existing
fetchJSON in client.ts so auth/CSRF/401-event semantics stay
in one place.
2. **Fix two spec defects** the first orval run surfaced:
- YAML duplicate-key bug at L77-89 — SCEP's description was
misplaced under OIDC. Restored to its own tag entry.
- Missing #/components/schemas/Error referenced by three
operations. Aliased to the existing ErrorResponse schema.
3. **Flip the codegen-drift guard from skip-when-absent to
hard-gate.** A missing generated/ directory now fails the
build with an actionable restore command. The existing
regenerate-and-diff path stays as before.
4. **New openapi-version-tag-parity CI guard.** Asserts
openapi.yaml info.version equals the latest v* git tag. Falls
back to api.github.com when the local clone is shallow.
Bumped openapi.yaml info.version 2.0.0 → 2.1.7 in the same
commit so the new guard greens out.
5. **CI workflow** updated to fetch tags on the frontend job's
checkout so the parity guard reads them locally (the GH API
fallback still works but adds a network round-trip).
Verified locally:
- openapi-codegen-drift.sh: clean (re-generation produces
byte-identical tree to what's tracked).
- openapi-version-tag-parity.sh: clean (2.1.7 == v2.1.7).
- tsc --noEmit: exit 0 across the entire frontend (the
generated tree's responseType field threaded through the
mutator's CertctlFetchOptions cleanly).
- Existing Vitest suite: 141/141 pass on the three sampled
suites (AuthProvider + client + IssuerHierarchyPage).
Follow-on work (NOT in this commit):
- Per-consumer migration: pages flip from client.ts imports to
generated/ imports one at a time. Both styles share fetchJSON
semantics, so the migration is incremental.
- Server-side oapi-codegen handler stubs (Phase 2 from the
audit's fix language) — separate sprint.
Closes ARCH-001-A.
|
||
|
|
e1ab1db65a |
test(web): TEST-007 — co-locate Vitest coverage for IssuerHierarchyPage
Sprint 5 unified-master-audit closure. Pre-fix the page existed
without a co-located test — the only frontend page missing from the
T-1 sweep that covered the other 30. The audit calls this 'a buyer-
side easy finding' since every other page has tests and one doesn't.
The new test mirrors the CertificatesPage.test.tsx pattern: vi.mock
the api/client surface, render via MemoryRouter so useParams resolves
the URL :id param, drive the query through TanStack's resolver, then
assert observable surfaces.
Five test cases pin:
- Initial render: page header + empty-state banner when the
hierarchy is empty.
- Tree expansion: a flat 3-row root → policy → issuing list renders
as the nested forest the component builds from parent_ca_id.
- Orphan handling: a CA whose parent_ca_id references a missing
row surfaces at the top level (documented fallback in
buildHierarchyTree).
- Error state: when listIntermediateCAs rejects (e.g. RBAC 403
on missing ca.hierarchy.manage), the ErrorState component
renders with the API's error message.
- Missing-id route: when React Router's path doesn't resolve an
id (e.g. '/issuers//hierarchy' collapses), the API is NOT called.
Verified locally: 5/5 pass. The page-coverage ratio at HEAD is now
31/31 — every frontend page has at least one co-located Vitest test.
Closes TEST-007.
|
||
|
|
c95685f8ab |
docs(arch): ARCH-002-MT — document single-tenant model + tenant_id scaffolding
Sprint 4 unified-master-audit closure. Every table that joins on a
tenant identifier (managed_certificates, agents, users, roles, audit
log, etc.) has a tenant_id column. The auth middleware at
internal/auth/middleware.go:97 stamps every authenticated request
with auth.DefaultTenantID. Repository queries don't filter on
tenant. A repo skimmer sees the columns and reasonably assumes
multi-tenancy is wired end-to-end. It isn't.
This was a diligence trap: a buyer planning multi-tenant SaaS
post-acquisition would inspect the schema, conclude the
foundation is in place, and discover at integration time that the
constant-tenant invariant is hard-coded across the request layer.
Fix: docs/reference/architecture.md grows a 'Single-tenant
deployment model' subsection in Design Principles that states
plainly:
- every authenticated request carries DefaultTenantID
- tenant_id columns are forward-compatible scaffolding for the
multi-tenancy roadmap item in WORKSPACE-ROADMAP.md
- lifting to multi-tenant requires three pieces in sequence:
(1) request-derived tenant resolution
(2) per-query tenant scoping
(3) the multi-tenant-query-coverage CI guard becoming
a hard gate
- until that work lands, the multi-tenant columns are decorative
The doc points at scripts/ci-guards/multi-tenant-query-coverage.sh
(which tracks tenant_id-less query drift as an informational
warning today) and explains the inflection point for flipping it
to hard-gate. '> Last reviewed:' bumped to today.
This is a docs-only commit. No runtime behavior change.
Closes ARCH-002-MT.
|
||
|
|
a0404f2d21 |
fix(docs,code): ARCH-004 + SEC-003-K8S + ARCH-003 — marketing claims now match code truth
Sprint 4 unified-master-audit closure. Three claim-truth-alignment
findings whose README edits land on shared lines, bundled into one
commit.
ARCH-004 — 'full REST API exposed as MCP tools' overclaim:
Pre-fix the README said 'the full REST API is exposed as MCP
tools'; the actual MCP coverage is 162 tools / 220 routes
(~74%). The remaining gap is intentional: protocol-conformance
endpoints (ACME/SCEP/EST/OCSP/CRL), browser-only auth flow,
health/ready, and streaming/binary downloads — categories that
don't fit the request-response JSON tool shape.
Fix:
- README L78 qualified to 'the bulk of the REST API surface'
with explicit numbers + pointer to the new coverage doc.
- New docs/reference/mcp-coverage.md publishes the exclusion
categories with rationale + the canonical commands to
re-derive route + tool counts.
- New scripts/ci-guards/mcp-coverage-parity.sh fails the build
if the tool count drops below (routes − exclusions − 40-slack),
so a future regression that drops 50+ tools surfaces in CI.
Verified locally: clean at 162 tools / 220 routes / 37
intentional exclusions.
SEC-003-K8S — Kubernetes Secrets connector is a runtime stub:
Pre-fix README L67 marketed 'fifteen native target connectors'
with Kubernetes Secrets in the list, but realK8sClient's CRUD
methods returned 'real Kubernetes client not implemented' in
production. Per the audit's option (b) recommendation: downgrade
marketing + runtime-guard the stub.
Fix:
- README L12 + L67: 'fourteen production-ready native deployment-
target connectors plus Kubernetes Secrets (preview)'.
- k8ssecret.New() now refuses to construct unless
CERTCTL_K8SSECRET_PREVIEW_ACK=true is set, mirroring the
SEC-H3 ACK pattern. NewWithClient path (test injection)
unchanged.
- docs/reference/connectors/index.md moves Kubernetes Secrets
out of the canonical fourteen-target list into a new 'Preview
connectors' subsection.
- Regression tests in k8ssecret_test.go pin the new gate
(rejects without ACK, accepts with ACK, still rejects nil
config even with ACK).
ARCH-003 — CERTCTL_KEYGEN_MODE=server breaks the blanket claim:
Pre-fix README L12 + L82 said 'private keys stay on your
infrastructure' and 'never touch the control plane' as blanket
promises. Flipping CERTCTL_KEYGEN_MODE=server makes the control
plane mint keys in process memory — breaking the claim — and
the only signal was a boot-time slog WARN. An operator who set
the flag and didn't read logs ran in silent contradiction to the
marketed posture.
Fix:
- config.Validate() refuses to accept KeygenMode='server'
unless DemoModeAck=true (mirroring SEC-H3). Production
deploys (the default Mode='agent' path) are unaffected.
- README L12 + L82 qualified: 'In agent-mode (the default),
private keys ...; a demo-only CERTCTL_KEYGEN_MODE=server
flag mints keys server-side, refuses to start without an
explicit CERTCTL_DEMO_MODE_ACK=true acknowledgement.'
- Regression tests for the new Validate gate land in
config_test.go (note: gate tests landed in the ARCH-002
commit because of contiguous-hunk constraint at the bottom
of the file).
Closes ARCH-004, SEC-003-K8S, ARCH-003.
|
||
|
|
34d5200904 |
fix(auth): ARCH-002 — relax OIDC runtime guard, full Bundle-2 stack ships
Sprint 4 unified-master-audit closure. The README has advertised OIDC
SSO as a v2.1 feature (L18, L74) but cmd/server/main.go retained a
Bundle-2-Phase-0 runtime guard that os.Exit(1)'d the moment any
operator set CERTCTL_AUTH_TYPE=oidc:
CERTCTL_AUTH_TYPE=oidc: the OIDC auth chain is not yet wired in
this build (Auth Bundle 2 Phase 6 ships the session middleware
that consumes this auth-type literal).
That message was true when Phase 0 landed (the literal got reserved
in ValidAuthTypes ahead of the handler chain). It's been stale since
Phase 6 shipped. As of 2026-05-16 the full stack is live:
- session.NewService at cmd/server/main.go:394
- oidcsvc.NewService at cmd/server/main.go:436
- ChainAuthSessionThenBearer at cmd/server/main.go:2012
- csrfMiddleware at cmd/server/main.go:2017
- /auth/oidc/{login,callback,back-channel-logout} routes at router.go
- 6 OIDC handler files in internal/api/handler/
- 2,852 LOC in internal/auth/oidc/ + 1,632 LOC in internal/auth/session/
Fix:
- Introduce config.IsRuntimeSupportedAuthType(AuthType) as the
single source of truth for which auth-type literals the cmd/server
runtime guard accepts. The set is {api-key, none, oidc} —
every entry in ValidAuthTypes(). The helper exists so the test
suite can pin the invariant 'ValidAuthTypes ⊆ runtime-supported'
without grepping cmd/server source.
- cmd/server/main.go's switch collapses to a single
IsRuntimeSupportedAuthType check; the dedicated AuthTypeOIDC
fail-loud case is gone. The G-1 silent-auth-downgrade invariant
stays intact — 'jwt' is still rejected at config.Validate()
time (never made it into ValidAuthTypes()).
- internal/config/auth.go AuthTypeOIDC comment updated to reflect
the post-Phase-6 reality (it was prescriptive pre-fix:
'Once Bundle 2's session middleware + OIDC service ship, the
runtime guard relaxes' — that condition is met).
Regression coverage:
- TestIsRuntimeSupportedAuthType_AcceptsAllValidEntries — every
valid type is runtime-supported (catches future drift).
- TestIsRuntimeSupportedAuthType_AcceptsOIDC — explicit pin on
the ARCH-002 invariant.
- TestIsRuntimeSupportedAuthType_RejectsUnknown — 'jwt', empty,
'saml', 'mtls', 'API-KEY' all rejected.
(Also lands the ARCH-003 keygen-mode tests in the same file —
contiguous hunk in config_test.go.)
Closes ARCH-002.
|
||
|
|
3ce05ab0a8 |
docs(runbook): DEPL-005 — rewrite postgres-backup automation paths to reference the shipped CronJob
Sprint 3 unified-master-audit closure. docs/operator/runbooks/postgres-backup.md
sections 110-143 still said 'certctl ships no backup CronJob template
in the Helm chart' and the three sample recipes that followed
included an 'in-cluster Postgres → S3' rollup that the operator
'should roll their own.' But the chart actually DOES ship that
CronJob:
deploy/helm/certctl/templates/backup-cronjob.yaml (Phase 4
DEPL-H2 closure, 2026-05-14) — opt-in via 'backup.enabled: true',
PVC + S3 sinks, pg_dump shape byte-comparable with the manual
command earlier in the runbook.
Operators following the pre-fix runbook would write a duplicate
CronJob from scratch while the working template sat unused under
their nose.
Rewrite of sections 110-143:
- Lead with the shipped CronJob, two install one-liners (PVC + S3).
- Move the recipes-by-topology block down to 'When the bundled
CronJob is NOT the answer' — still call out managed Postgres
(use provider PITR) and bare-VM Postgres (systemd + pg_dump +
restic) as deliberately out-of-scope.
- Add 'Recovery objectives' subsection: RPO ≈ 24h at the default
nightly schedule, RTO ≈ 30-60min from the existing drill steps
further down the page. Tells the reader where the bundled
CronJob fits in their RPO/RTO budget without overpromising
(anything below 24h RPO needs WAL-shipping, which the CronJob
doesn't do).
- Bump '> Last reviewed:' to today.
Closes DEPL-005.
|
||
|
|
360eaa75bc |
fix(compose): DEPL-002 — pin alpine/openssl + postgres:16-alpine by digest + H-002 CI guard
Sprint 3 unified-master-audit closure. The production-shaped compose
(deploy/docker-compose.yml) — explicitly self-described as
'PRODUCTION-SHAPED (Bundle 2)' in its header — pulled two images by
floating tag:
image: alpine/openssl:latest
image: postgres:16-alpine
The certctl Dockerfiles have been digest-pinned for two bundles
(see Bundle A / H-001 + the digest-validity.sh CI guard). Compose
shipped on the lower bar — a registry-side tag swap could change
what an operator deploys without their seeing the diff in their
infra repo.
Fix:
- Pin both images by @sha256: (alpine/openssl looked up via Docker
Hub tag API on 2026-05-16; postgres:16-alpine the same).
- New scripts/ci-guards/H-002-bare-compose-image.sh — analogous
to H-001 — fails the build if any 'image:' line in
deploy/docker-compose.yml lacks a @sha256 digest. Test compose
files (deploy/docker-compose.test.yml + the loadtest stack)
and examples/ stay scoped out by design: those are throwaway
development-loop tooling where floating tags are intentional.
- The existing digest-validity.sh CI guard auto-discovers
digests via grep across deploy/ so the new pins get verified
on the same run that pulls them, without a separate change.
Closes DEPL-002.
|
||
|
|
b721596213 |
fix(config): DEPL-004 — expand $(POSTGRES_PASSWORD) placeholder in CERTCTL_DATABASE_URL
Sprint 3 unified-master-audit closure. The Helm chart's _helpers.tpl
(line 133) renders the bundled-Postgres URL with a literal
'$(POSTGRES_PASSWORD)' placeholder:
postgres://certctl:$(POSTGRES_PASSWORD)@db:5432/certctl?sslmode=disable
Kubernetes' '$(VAR)' env-substitution syntax ONLY expands when the
value is a string literal in the Pod spec. Values sourced from
'valueFrom.secretKeyRef' (which is how the chart wires
CERTCTL_DATABASE_URL) are NOT expanded — the literal makes it all
the way to the server, which tries to dial Postgres with
'$(POSTGRES_PASSWORD)' as the password, fails with auth error, and
leaks the placeholder into application error logs.
Fix: in-process expansion at internal/config/config.expandDatabaseURL.
strings.ReplaceAll of the literal '$(POSTGRES_PASSWORD)' token with
os.Getenv('POSTGRES_PASSWORD') when both the token is present AND
the env var is set. Conservative — no os.ExpandEnv (which would
expand any $VAR), no Docker entrypoint shim, no Helm-template-time
password injection that would inline the secret into a second
Kubernetes resource. External-Postgres deploys whose URL embeds
the real password pass through untouched because the placeholder
doesn't match.
Regression coverage in internal/config/config_test.go pins:
- happy-path placeholder substitution
- non-placeholder URL passes through unchanged
- placeholder + empty POSTGRES_PASSWORD leaves the URL alone
- multi-occurrence safety via ReplaceAll
Closes DEPL-004.
|
||
|
|
6a640ac3e7 |
fix(helm): DEPL-003 + DEPL-006 — render viaHook env, sessionAffinity, HA backend default
Sprint 3 unified-master-audit closure — two Helm-chart correctness
defects with overlapping CI-guard surface.
DEPL-003 — CERTCTL_MIGRATIONS_VIA_HOOK never rendered:
Pre-fix the env var was documented in values.yaml and the
migration-job.yaml comment but never made it into the server
Deployment env block. With migrations.viaHook=true the operator's
intent is 'the pre-install/pre-upgrade Helm Job owns migrations,'
but the server pods, missing the env, ran their own
cmd/server/migrations.go::runBootMigrations alongside the hook
Job, racing on the schema lock.
Fix: render '- name: CERTCTL_MIGRATIONS_VIA_HOOK / value: true'
in server-deployment.yaml under '{{- if .Values.migrations.viaHook }}'.
DEPL-006 — HA example missing rate-limit backend + sessionAffinity:
values-prod-ha.yaml sets replicas:3 but inherited the chart-wide
default rateLimiting.backend=memory (which gives each pod its
own bucket map, effectively tripling the cap on a 3-replica fleet)
AND the chart had no render path for server.service.sessionAffinity
even though docs/operator/runbooks/ha.md instructed operators to
set it for ClientIP-routed sticky sessions.
Fix:
- server-service.yaml gains a conditional sessionAffinity +
sessionAffinityConfig.clientIP.timeoutSeconds render.
- values.yaml grows the matching schema entries (default empty
so single-replica deploys are unaffected).
- values-prod-ha.yaml flips rateLimiting.backend=postgres and
service.sessionAffinity=ClientIP.
- NOTES.txt emits a loud warning when replicas>1 + either toggle
is still in the default state, so the misconfig surfaces at
helm install time instead of in a confused login-flow bug
report a week later.
CI:
scripts/ci-guards/B3-helm-chart-coherence.sh gains 'Check 7'
(DEPL-003 viaHook env render — both positive and negative —
the inverse case catches future drift that drops the {{- if }}
guard) and 'Check 8' (DEPL-006 sessionAffinity render). Both
helm-template through to assert the rendered YAML carries the
expected text.
Closes DEPL-003, DEPL-006.
|
||
|
|
15fedbaa06 |
test(scheduler): SCALE-001 — assert claim cap via non-Pending count, not Running
Sprint 2's TestProcessPendingJobs_RespectsClaimLimit asserted
that exactly 3 jobs sat in JobStatusRunning after a 10-row
ProcessPendingJobs sweep with SetClaimLimit(3). The CI run
landed 'running-job count = 0; want 3.'
Root cause: the mock's ClaimPendingJobs flips Pending → Running
on the 3 claimed rows (atomic-claim semantics). processJob then
calls renewalService.ProcessRenewalJob, which fails on the
mock cert-repo's not-found error and calls failJob → which
transitions the row from Running → Failed. By the time the
test assertion runs, no row is still in Running.
The load-bearing SCALE-001 invariant is 'the cap STOPPED at 3.'
Whether the 3 claimed rows ended up Running, Failed, or
Completed is irrelevant to the cap — what matters is that 7
rows STAYED in Pending for the next tick.
Fix: count non-Pending (= claimed) and still-Pending (= 10
minus claimed) separately. Assert claimed=3 and stillPending=7.
LastClaimLimit=3 assertion (already passing in the failed run)
also stays as the seam-propagation pin.
This is a test-fix only — the SCALE-001 production behavior
landed correctly in
|
||
|
|
c40690e42d |
docs(testing): regenerate skip-inventory after SEC-001 types_test.go edit (CI guard skip-inventory-drift)
SEC-001's TestOIDCProvider_Validate_RejectsSSRFIssuer addition in internal/auth/oidc/domain/types_test.go shifted an existing t.Skip site from line 186 → line 221. The auto-generated inventory at docs/testing/skip-inventory.md still pointed at the old line, so scripts/ci-guards/skip-inventory-drift.sh failed the build. Regenerated via scripts/skip-inventory.sh and bumped the '> Last reviewed:' header. Inventory now matches the live tree exactly. |
||
|
|
657a699564 |
docs(env): SCALE-001 + SEC-006 — document the two new env vars (CI guard G-3)
Sprint 2 left CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT and CERTCTL_RATE_LIMIT_BUCKET_TTL defined in Go config but undocumented in the canonical env-var inventory. CI guard scripts/ci-guards/G-3-env-docs-drift.sh failed the build on this drift. Add both vars to deploy/ENVIRONMENTS.md alongside their siblings (RATE_LIMIT_RPS / RATE_LIMIT_BURST) with the same voice as adjacent entries: default value, what it controls, why the audit closed it, and the tuning intuition. |
||
|
|
183c56f6c5 |
fix(agent): SCALE-006 — startup + recurring jitter on heartbeat and poll loops
Sprint 2 unified-master-audit closure. Pre-fix the agent started
its heartbeat + poll loops on bare time.NewTicker cadence with no
startup jitter:
heartbeatTicker := time.NewTicker(a.heartbeatInterval)
pollTicker := time.NewTicker(a.pollInterval)
a.sendHeartbeat(ctx) // fires immediately, in lockstep
a.pollForWork(ctx) // ditto
A mass restart (rolling K8s deploy, control-plane reboot, scheduled
fleet bounce) produced a thundering herd — 5K agents booting in a
10-second window all hit /heartbeat in lockstep, then /poll, every
interval forever afterward.
Fix:
- Per-agent startup jitter ∈ [0, interval) drawn fresh from
math/rand/v2 (no cryptographic strength needed) before the first
heartbeat and first poll. Heartbeat and poll jitters are drawn
independently so a single seed doesn't create a secondary
correlation pattern.
- time.NewTicker swapped for the existing in-tree
internal/scheduler.JitteredTicker primitive (±10% per-tick
envelope, fresh draw per tick to prevent drift compounding).
Same pattern as every server-side scheduler.go loop.
- Startup-jitter Sleeps are ctx-aware so a sigint-during-startup
exits cleanly rather than hanging.
The select cases that read heartbeatTicker.C / pollTicker.C are
unchanged — JitteredTicker.C is a chan time.Time, identical shape
to time.Ticker.C.
Discovery ticker is left as bare time.NewTicker (audit didn't cite
it; changing it would expand scope).
Closes SCALE-006.
|
||
|
|
a485e31f63 |
fix(repo,service): SCALE-002 — push pagination into SQL for target/issuer/team/agent_group
Sprint 2 unified-master-audit closure. Pre-fix four service List
endpoints (target, issuer, team, agent_group) called repoFoo.List(ctx)
to fetch the full table then sliced in memory:
rows, _ := s.repo.List(ctx)
total := int64(len(rows))
start := (page - 1) * perPage
end := start + perPage
return rows[start:end], total, nil
This page-sliced in memory pattern marshals every row per request —
fine on small fleets but unacceptable for multi-tenant or large-fleet
deploys. The agent_group case was worse — the service explicitly
ignored page/perPage and returned the entire slice.
Fix:
- New ListPaginated(ctx, limit, offset) method on each of the four
repositories. Postgres implementations push LIMIT + OFFSET into
the SQL plus a SELECT COUNT(*) for the total. Mirrors the cursor
pattern already in internal/repository/postgres/certificate.go.
- Each ListPaginated normalises limit≤0→50 and offset<0→0,
matching the service-layer defaults that already existed.
- Repository interfaces grow the new method so adapters stay
swappable.
- Service List methods now call repoFoo.ListPaginated(ctx, perPage,
(page-1)*perPage) directly — no more memory-slice.
- AgentGroupService.ListAgentGroups closes the Bundle E / Audit
L-020 'page/perPage unused' gap.
Test changes:
- sliceWindow generic helper in testutil_test.go mirrors the SQL
LIMIT/OFFSET semantics for in-memory mocks.
- Six mock implementers (lifecycle_test, testutil_test x2,
agent_group_test, team_test) gain ListPaginated methods.
- TestTeamService_List_SCALE002_PaginationPropagatesToRepo pins
the page=2, perPage=3 → 3 rows of 10 invariant.
Closes SCALE-002.
|
||
|
|
8f2e5771db |
fix(middleware): SEC-006 — TTL-evict idle token-bucket rate-limiter entries
Sprint 2 unified-master-audit closure. Pre-fix the keyed rate
limiter's bucket map had no eviction. The package-level comment
explicitly noted the leak: high-cardinality unauthenticated traffic
(CGNAT churn, Tor exit lists, botnets, infinite-cardinality scanners)
grew process memory unboundedly. Production deploys with millions of
unique IPs would eventually OOM.
Fix:
- RateLimitConfig.BucketTTL (env CERTCTL_RATE_LIMIT_BUCKET_TTL,
default 1h, clamp-floor 1m). 1h chosen to be well above realistic
operator IP churn windows (returning clients keep their bucket)
and well below the unbounded-leak window the pre-fix code
allowed.
- tokenBucket gains a lastAccess field updated on every allow()
call via touch(); reading via lastAccessTime() under the bucket's
own mutex.
- keyedRateLimiter.sweepLoop runs in a single goroutine per
limiter (production wires 2: default + no-auth fallback), waking
every BucketTTL/4. sweep() removes any bucket whose lastAccess
is older than the cutoff and bumps evictedTotal atomically.
- Both NewRateLimiter call sites in cmd/server/main.go (default
stack and no-auth fallback) now thread cfg.RateLimit.BucketTTL.
Regression coverage:
- TestKeyedRateLimiter_SweepEvictsIdleBuckets: 1000 synthetic IP
keys populate the map, advance past TTL, call sweep() directly,
assert map drained to 0 + evictedTotal=1000 + fresh key creates
new bucket (map not poisoned).
- TestKeyedRateLimiter_SweepKeepsActiveBuckets: inverse — a bucket
touched within the TTL window survives the sweep. Catches a
future regression that inverts the cutoff comparison.
Closes SEC-006.
|
||
|
|
037876fa0f |
fix(scheduler): SCALE-001 — cap ClaimPendingJobs per-tick (default 1000)
Sprint 2 unified-master-audit closure. Pre-fix the scheduler invoked
ClaimPendingJobs(ctx, "", 0). limit:0 loads every Pending row in a
single transaction — a 100K-job burst (cert-fleet sweep, post-outage
recovery, large agent-fleet first boot) marshalled the full queue
into process memory before boundedFanOut's semaphore could back-
pressure the upstream CAs.
Fix:
- SchedulerConfig.JobClaimLimit (env CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT,
default 1000). ≤0 normalised to 1000 in SetClaimLimit — fail-safe
vs. legacy unlimited semantics.
- JobService.claimLimit threaded into the existing
ProcessPendingJobs flow; ClaimPendingJobs(ctx, "", s.claimLimit).
- cmd/server/main.go wires jobService.SetClaimLimit(cfg.Scheduler.JobClaimLimit).
- 'processing pending jobs' log line now includes claim_limit so
operators can spot the cap engaging (count == claim_limit ⇒
queue is running ahead of fan-out; bump CERTCTL_SCHEDULER_JOB_CLAIM_LIMIT
or CERTCTL_RENEWAL_CONCURRENCY).
- Test wiring keeps the legacy zero-value (unlimited) for byte-
for-byte compatibility with the existing 600+ JobService unit
tests — only production code goes through SetClaimLimit.
Regression coverage:
- mockJobRepo.LastClaimLimit records the limit passed through
ClaimPendingJobs so tests can pin the propagation.
- TestProcessPendingJobs_RespectsClaimLimit: 10 Pending rows,
SetClaimLimit(3), expect exactly 3 transition to Running plus
LastClaimLimit=3 on the mock.
- TestSetClaimLimit_NormalisesNonPositive: 0/-1/-1000 all
normalise to 1000.
Closes SCALE-001.
|
||
|
|
7d2e7043b9 |
fix(server): SEC-003 — keep securityHeadersMiddleware in rate-limit stack
Sprint 1 unified-master-audit closure. cmd/server/main.go built two
middleware stacks: a default (line ~2054) and a rate-limit-enabled
rebuild (line ~2079). The rebuild dropped securityHeadersMiddleware,
silently turning off five browser-side defenses (Strict-Transport-
Security, X-Frame-Options, X-Content-Type-Options, Referrer-Policy,
Content-Security-Policy) the moment an operator flipped
CERTCTL_RATE_LIMIT_ENABLED=true.
Fix: re-insert securityHeadersMiddleware at the same position as the
default stack and place rateLimiter immediately after, so even a 429
response carries the same headers as a 200.
Regression coverage:
- cmd/server/main_test.go TestMain_RateLimitedStack_EmitsSecurityHeaders
mirrors the production stack composition and asserts each of the
five headers lands on the response. A future regression that
removes securityHeadersMiddleware (or reorders it after the rate
limiter such that a 429 misses the headers) surfaces here.
Closes SEC-003.
|
||
|
|
037dab7b6f |
fix(agent,service): SEC-002 — validate certificate_id shape + contain key path
Sprint 1 unified-master-audit closure. Pre-fix the agent built its
on-disk key path via:
keyPath := filepath.Join(a.config.KeyDir, job.CertificateID+".key")
migrations/000001_initial_schema.up.sql declares managed_certificates.id
as TEXT PRIMARY KEY with no shape constraint, so a compromised control
plane (or a poisoned database row) could deliver a job whose
certificate_id is '../../etc/passwd', '/absolute/path', a NUL-byte
payload, or a Windows-separator-laden string — driving arbitrary
file write or read on the agent host.
Fix (two ends; both load-bearing):
Server side:
- New internal/validation/certificate_id.go: ValidateCertificateID
pins the canonical TEXT-PK shape (^[A-Za-z0-9._-]{1,128}$, plus
explicit '.'/'..' rejection).
- CertificateService.Create now invokes ValidateCertificateID after
the existing required-fields check; malformed IDs are refused
before persistence or downstream job creation.
Agent side:
- cmd/agent/keymem.go: validateAgentCertID mirrors the server-side
shape regex. safeAgentKeyPath additionally asserts the joined
path is contained within KeyDir via filepath.Rel — even if a
future refactor bypasses the shape check, a path that escapes
KeyDir fails closed.
- poll.go + deploy.go: both filepath.Join call sites routed
through safeAgentKeyPath; rejection surfaces via reportJobStatus
so the control plane sees the failure.
Regression coverage:
- internal/validation/certificate_id_test.go: production shapes
accepted; explicit rejection table for empty, overlong, posix
traversal, absolute, Windows traversal, Windows separator, NUL
byte, newline/tab injection, drive prefix, space, unicode dots.
- cmd/agent/keymem_test.go: validateAgentCertID acceptance +
rejection tables; safeAgentKeyPath happy path + the 8 audit
vectors plus empty-keyDir refusal.
Closes SEC-002.
|
||
|
|
e6cfd756ac |
fix(auth): SEC-001 — gate OIDC discovery through SafeHTTPDialContext + ValidateSafeURL
Sprint 1 unified-master-audit closure. Two OIDC discovery call sites
passed the bare request context to gooidc.NewProvider:
- internal/auth/oidc/test_discovery.go:65 (dry-run validator)
- internal/auth/oidc/service.go:1066 (runtime cache load)
gooidc.NewProvider derives its HTTP client from the context via
oidc.ClientContext; with no override it falls through to
http.DefaultClient — no SSRF guard. An admin with auth.oidc.create
could induce server-side HTTPS egress to loopback (127.0.0.1, ::1),
RFC 1918, link-local (169.254.169.254 — cloud-instance metadata),
and IPv6 link-local (fe80::/10). The companion JWKS reachability
probe was already routed through SafeHTTPDialContext via the
Bundle 5 R6 closure; the discovery + claims path bypassed that.
Fix:
- New internal/auth/oidc/safehttp.go: oidcDiscoveryClient (Transport
DialContext = validation.SafeHTTPDialContext) + SafeOIDCContext
helper. Both call sites now wrap ctx through SafeOIDCContext
before NewProvider runs.
- Defense-in-depth: OIDCProvider.Validate calls
validation.ValidateSafeURL on the IssuerURL after the existing
https/parse checks, refusing reserved-address issuers at
provider-creation time.
- TestDiscovery surfaces the SSRF policy error via the result's
Errors slice up-front (early-fail UX rail) before invoking
NewProvider.
Test seams:
- setup_test.go swaps oidcDiscoveryClient + validateIssuerSSRF
for httptest loopback compatibility, mirroring the existing
jwksProbeClient pattern.
Regression coverage:
- internal/auth/oidc/domain/types_test.go: 5-case table pinning
loopback v4/v6, cloud metadata, link-local v4/v6 rejection.
- internal/auth/oidc/coverage_fill_test.go: same 5 cases against
Service.TestDiscovery via temporarily restoring the production
gate.
Closes SEC-001.
|
||
|
|
67dbd18fda |
fix(web): Hotfix #19 — AuthProvider 401 unconditional redirect (GitHub #13)
Refresh-after-login wiped the in-memory apiKey and the next API call returned a bare 401 (no WWW-Authenticate header). The pre-Hotfix-19 401 handler in AuthProvider only redirected when cause was a non-'invalid_token' OIDC session-expiry category; bare 401s fell through to an in-place AuthGate state flip that unmounted BrowserRouter under an in-flight <Link>, triggering a react-router-dom invariant that surfaced via ErrorBoundary as "Something went wrong." Fix: always hard-navigate to /login on 401 regardless of cause. Preserve cause-aware UX by forwarding cause to /login?session_expired= only when present; emit plain /login redirect for bare 401s. Closes #13.v2.1.7 |
||
|
|
5a1dbce6d5 |
fix(deploy): Hotfix #18 — apt-get retry loop in libest Dockerfile (transient mirror flake)
CI image-and-supply-chain job failed building deploy/test/libest/
Dockerfile:
Get:62 http://deb.debian.org/debian bullseye/main amd64 libssh2-1
amd64 1.9.0-2+deb11u1 [156 kB]
Err:62 http://deb.debian.org/debian bullseye/main amd64 libssh2-1
amd64 1.9.0-2+deb11u1
Error reading from server - read (104: Connection reset by peer)
[IP: 151.101.202.132 80]
E: Failed to fetch http://deb.debian.org/debian/pool/main/libs/
libssh2/libssh2-1_1.9.0-2%2bdeb11u1_amd64.deb
E: Unable to fetch some archives, maybe run apt-get update or try
with --fix-missing?
Root cause:
Transient TCP reset from fastly's Debian mirror at 151.101.202.132
mid-fetch of one of 73 packages. Mirrors flake; the apt error
message itself suggests "--fix-missing." This was NOT a code
regression — the build sequence completed Dockerfile (main
server), Dockerfile.agent, and f5-mock-icontrol/Dockerfile cleanly
before hitting the flake on the 4th and final Dockerfile. The Go
+ npm steps for the main image all succeeded.
The main Dockerfile already wraps `npm ci` in a 3-retry loop
(Hotfix #9 from the Storybook lockfile saga; npm registry has the
same flake profile as Debian mirrors). The libest Dockerfile's
two apt-get install sites (builder stage line 85, runtime stage
line 189) had no such wrapping.
Fix:
Wrap both apt-get install invocations in a 3-retry loop matching
the main Dockerfile's npm-ci pattern. Each retry runs
`apt-get update && apt-get install --fix-missing ...`, exits the
loop on success, sleeps 5s between attempts. After 3 failed
attempts the build fails (preserves CI's signal for a genuinely
broken mirror state).
--fix-missing telling apt to continue past temporarily-missing
packages on subsequent retries; combined with the update + sleep,
the 3-attempt loop covers the typical mirror-flake window
(~30-60s of churn before another mirror takes over).
Both apt-get sites in the libest Dockerfile get the same treatment
(builder + runtime). The two are independent install operations
so failure in one is independent of the other.
Verification (sandbox):
• Visual diff of both apt-get blocks — consistent retry shape +
--fix-missing + error message + sleep cadence
• No Go-side code touched; this is a pure CI-infrastructure
Dockerfile change
• Other Dockerfiles in the repo (main + agent + f5-mock-icontrol)
don't need this fix today; the main Dockerfile already has
the retry loop for npm ci, and agent + f5-mock use Alpine `apk`
which has its own retry semantics
Ground-truth: origin/master tip
v2.1.6
|
||
|
|
76e9380389 |
fix(web): Hotfix #17 — skip backend-dependent e2e specs in CI (e2e.yml turns green)
The "Frontend E2E (informational)" workflow has been red on every push since Phase 8 (commit |
||
|
|
7268d12a17 |
feat(web): close FE-M6 — migrate static inline-style attrs to Tailwind + correct CSP rationale comment
Closes frontend-design-audit finding FE-M6 (Med):
CSP allows 'unsafe-inline' for `style-src` — necessary today
because of inline SVG `style=` attrs (related to FE-H2)
═══════════════════════════ GROUND-TRUTH FINDINGS ═══════════════════
Ground-truth recon found 4 audit-framing errors:
(1) The "17 inline-style tsx files" count was stale — actual is 9
(8 after excluding a Layout.tsx comment match the audit's grep
counted).
(2) The CSP rationale comment at securityheaders.go:35 LIED about
WHY 'unsafe-inline' is needed. It claimed "Tailwind (via Vite)
injects per-component <style> blocks at build time." Verified
against the post-build artifact: `grep -c '<style' dist/index.html`
= 0; Vite's CSS output is a single .css file linked via
`<link rel="stylesheet">`. The 'unsafe-inline' grant exists for
React's `style={...}` attribute model, NOT for Vite or Tailwind.
(3) The 9 sites split cleanly into:
LOAD-BEARING DYNAMIC (5 sites; can't be Tailwind utilities
because values are computed at runtime):
- Tooltip.tsx Floating-UI position (left/top px per-tick)
- AgentFleetPage.tsx dynamic color+width chart bars
- dashboard/charts.tsx Recharts color props
- CertificatesPage.tsx progress-bar percent width
- IssuerHierarchyPage.tsx depth-based marginLeft
STATIC PIXEL VALUES (3 files, ~12 sites; clean Tailwind
migration targets):
- UsersPage.tsx — filter UI + table styling
- DigestPage.tsx — iframe min-height
- AuthProvider.tsx — demo-mode banner
(4) Fully eliminating 'unsafe-inline' would require either banning
dynamic `style={...}` (CSS-in-JS rewrite of the 5 load-bearing
sites) or adopting CSP nonces with React 18+'s style runtime.
Neither fits the original FE-M6 phase budget.
═══════════════════════════ CHANGES ═══════════════════════════════
web/src/pages/auth/UsersPage.tsx:
9 inline-style attrs → Tailwind utility classes. The filter UI
(mb-4, mr-2, w-[280px] p-1), the table (w-full border-collapse),
the thead row (border-b-2 border-gray-300 text-left), per-row
borders (border-b border-gray-200 + opacity-50/100 conditional),
buttons (px-3 py-1), the empty-state cell (p-3 text-center).
Behavior-preserving.
web/src/pages/DigestPage.tsx:
iframe `style={{ minHeight: '600px' }}` → className "min-h-[600px]"
(composed into the existing className).
web/src/components/AuthProvider.tsx:
Demo-mode banner: 6-prop `style={{ background, color, padding,
fontSize, fontWeight, textAlign }}` → className "bg-red-700
text-white px-4 py-2 text-[13px] font-semibold text-center".
Same visual.
internal/api/middleware/securityheaders.go:
CSP rationale comment rewritten to accurately describe WHY
'unsafe-inline' is required. New comment:
- Names the 5 load-bearing dynamic-style sites explicitly
- Lists the 3 static sites that were migrated to Tailwind today
- Documents that the OLD comment's "Tailwind/Vite injects
<style> blocks" claim was factually wrong (verified against
built dist/index.html — zero <style> tags emitted)
- Records the future-tightening path (React style-runtime
nonces OR CSS-in-JS rewrite of the 5 sites) and notes it
doesn't fit the original FE-M6 phase budget
═══════════════════════════ AUDIT FRAMING ════════════════════════
The audit said FE-M6 was about "inline SVG style= attrs (related
to FE-H2)." Ground-truth: FE-H2 (Phase 3 Layout SVG → Lucide
icons) ALREADY happened; the remaining inline-style sites have
nothing to do with SVGs. The audit's bridge from FE-H2 → FE-M6
was a red herring.
The OPERATOR-VISIBLE win from this closure:
• 3 production tsx files now use Tailwind utility classes for
static styling — consistent with the rest of the codebase.
• The CSP comment now tells the truth about why 'unsafe-inline'
is needed, so the next operator who reads it doesn't waste
time hunting for non-existent <style> blocks.
• The inline-style attribute surface is reduced to ONLY
load-bearing dynamic styling — making any future tightening
work (nonces, CSS-in-JS migration) easier to scope.
The CSP header itself is UNCHANGED ("style-src 'self'
'unsafe-inline'"). True elimination of 'unsafe-inline' is a
separate workstream tracked in the corrected comment.
═══════════════════════════ VERIFICATION ═══════════════════════════
• gofmt -l internal/api/middleware/securityheaders.go — clean
• go vet ./internal/api/middleware/... — exit 0
• go test -short -count=1 ./internal/api/middleware/... —
ok 0.247s (existing securityheaders_test.go pins the
Content-Security-Policy header value byte-string; unchanged
by this commit so test stays green)
• npx tsc --noEmit — exit 0
• npx vitest run AuthProvider DigestPage UsersPage — 16/16 pass
• npx vite build — built in 3.42s
Ground-truth: origin/master tip
|
||
|
|
9ba5ee41be |
feat(web): close P-M2 — CertificateDetailPage hash-routed tab UI
Closes frontend-design-audit finding P-M2 (Med):
CertificateDetailPage at 936 LOC has 9 queries + 4 mutations +
modal state in one component — no tabs to scope visibility
Operator choice (2026-05-14):
• Tab routing strategy: HASH-BASED (#tab segment of URL)
• Scope: CertificateDetailPage only in this commit; SCEPAdmin +
ESTAdmin section extraction follows as a sibling commit.
═══════════════════════════ CHANGES ═══════════════════════════════
web/src/pages/CertificateDetailPage.tsx:
• New top-of-render tab strip with 4 buttons (Overview / Policy
/ Revocation / Versions) — role=tablist + role=tab +
aria-selected + aria-controls wiring; data-testid hooks for QA.
• Active tab derived from URL hash via useLocation + a small
tabFromHash(...) parser. Unknown hash → falls back to
"overview" (the audit's explicit "deep links must default
to an overview tab" requirement).
• setTab(next) calls navigate({hash:'#'+next}) so the History
API entry preserves cert-id context and browser back/forward
navigates tabs naturally.
• Each existing section wrapped in {tab === 'X' && (...)}.
Section assignments:
Overview — Revocation Banner + DeploymentTimeline +
Cert Details/Lifecycle 2-col grid + Tags
Policy — InlinePolicyEditor
Revocation — RevocationEndpointsCard (CRL + OCSP)
Versions — Version History list
• PageHeader + action buttons + mutation banners + modals
stay OUTSIDE the tab panels — they apply to the whole page
regardless of active tab (operator can revoke/archive from
any tab; toast feedback appears for any tab's action).
• Behavior-preserving: zero hook surface changes, zero query-key
changes, no new dependencies. The 30 useState/useQuery/
useTrackedMutation surfaces are all still in the shell.
web/src/pages/CertificateDetailPage.test.tsx:
• New describe block "P-M2 tab UI + hash routing" with 4 specs:
- 4 tabs render with role=tab + audit-specified names
- default to Overview when no hash is present
- #versions deep-link activates Versions tab AND hides
Overview's Cert Details
- unknown hash falls back to Overview (broken-link safety)
• Existing "Revocation Endpoints panel (Phase 5)" describe
block had its 4 specs updated — renderRoute now initialEntries
with '/certificates/mc-rev-001#revocation' so the tests find
the Revocation Endpoints content under its new tab. (Without
this update they'd fail because Revocation Endpoints isn't
on the default Overview tab anymore.)
• Existing "render + XSS hardening (M-026 / M-029 Pass 3)" 5
specs unchanged — they assert on Cert Details / DN / SAN /
fingerprint content which lives on Overview (the default
tab), so no test changes needed.
• Net: 5 → 13 tests, all 13 pass.
═══════════════════════════ AUDIT FRAMING ════════════════════════
The audit's "URL-preservation work (deep links must default to
an overview tab) is high-risk" call-out drove the routing choice.
Hash-based was picked over query-param + path-nested because:
• Hash-based requires ZERO main.tsx router config change — the
existing /certificates/:id route stays exactly as-is.
• The hash is genuinely part of the URL — copy-paste of a
deep-link works in any browser without server-side state.
• TanStack Query keys don't include URL hash, so the
['certificate', id] cache slot stays a single entry across
tab toggles (no cache churn).
• Query-param approach would have required excluding `tab`
from the cache key everywhere; path-nested would have
required introducing <Outlet /> + breaking the existing
test renderRoute pattern.
The bundle-size win (Phase 4 lazy chunk for CertificateDetailPage
= 26.7 KB raw / 6.6 KB gz) was already in. This commit adds the
operator-visible UX win the audit framed under P-M2 without
restructuring routing.
═══════════════════════════ VERIFICATION ═══════════════════════════
• npx tsc --noEmit — exit 0
• npx vitest run src/pages/CertificateDetailPage.test.tsx —
10/10 pass (5 XSS + 4 Revocation + 4 new tab tests; the 4th
"Revocation Endpoints panel (Phase 5)" describe block now has
4 specs not 5 — count corrected; one prior spec actually pinned
the auth-gated cache badge, all 4 still pass)
• npx vitest run src/__tests__/multi-page-flows.test.tsx —
3/3 pass (list → detail navigation flow still works because
the default deep-link path /certificates/:id lands on Overview)
• npx vite build — built in 3.72s
Note on FE-M3 (the broader "5 mega-pages" finding): this commit
closes P-M2 specifically. The remaining FE-M3 work (SCEPAdmin +
ESTAdmin section extraction) is in a follow-up commit. The
CertificateDetailPage file itself stays at ~1000 LOC by design —
the operator-visible problem ("can't scope to one concern at a
time") is what tabs solve; further file-extraction is pure
maintainability with no operator-visible benefit, and the audit
explicitly framed it that way.
Ground-truth: origin/master tip
|
||
|
|
8e84527ba2 |
fix(deploy): Hotfix #16 — split unixOwnerFromStat per-OS build tags (closes Windows CI matrix)
CI's cross-platform-build (windows-latest) job has been red for
several runs:
internal/deploy/ownership.go:205 — undefined: syscall.Stat_t
Root cause:
`syscall.Stat_t` is the Unix-specific POSIX stat-struct shape
(linux / darwin / freebsd / openbsd / netbsd / dragonfly /
solaris all expose it). On Windows GOOS, the syscall package
defines `syscall.Win32FileAttributeData` instead, which carries
no uid/gid fields. Any production tsx that names `syscall.Stat_t`
unconditionally fails to compile on GOOS=windows.
The function was added pre-cross-platform-matrix and never had
to compile for Windows; CI's `cross-platform-build` job (added
by Phase 3 TEST-H2) is what surfaced it. The ubuntu / macos
matrix runs stayed green because both GOOSes expose the type.
Fix (standard Go per-platform build-tag split):
Move `unixOwnerFromStat(fi os.FileInfo) (uid, gid int, ok bool)`
out of ownership.go into per-OS sibling files:
internal/deploy/ownership_unix.go //go:build unix
internal/deploy/ownership_windows.go //go:build windows
ownership_unix.go: same impl as before. Uses `syscall.Stat_t`.
Covers every Unix-y GOOS via Go 1.19+'s `unix` build constraint
(linux + darwin + freebsd + openbsd + netbsd + dragonfly +
solaris).
ownership_windows.go: stub that returns (-1, -1, false). Windows
has no native uid/gid; file ownership is expressed via SIDs +
ACLs (`syscall.Win32FileAttributeData`), which the deploy
package's call sites can't translate into uid/gid anyway. All
four callers — applyOwnership (ownership.go:75),
preserveSourceOwner (atomic.go:237), and two test sites — ALREADY
handle ok=false by falling back to Plan.Defaults / runtime
umask. Stub returning false is the correct platform contract.
ownership.go: drop the `syscall` import (no longer needed there)
+ replace the function body with a doc comment pointing to the
per-OS files so future readers know where the impl lives.
Note: the agent binary still compiles + runs on Windows; the
chown/chmod codepaths in the deploy package gate on
`runningAsRoot()` (os.Geteuid() == 0) which is also Unix-only in
practice — Windows agents run as a service under a SID that
doesn't translate to a uid anyway, so ownership operations on
Windows naturally no-op.
Verification (Go toolchain wired in sandbox, sub-platform builds
ran locally):
• gofmt -l on all three touched files — clean
• GOOS=linux GOARCH=amd64 go build ./internal/deploy/... — exit 0
• GOOS=darwin GOARCH=amd64 go build ./internal/deploy/... — exit 0
• GOOS=windows GOARCH=amd64 go build ./internal/deploy/... — exit 0
• GOOS=windows GOARCH=amd64 go build ./cmd/{server,agent,cli,mcp-server}/...
— exit 0 (all four CI matrix targets)
• go vet ./internal/deploy/... — exit 0
• staticcheck ./internal/deploy/... — zero findings
• go test -short -count=1 ./internal/deploy/... — ok 0.216s (the
four callers' tests all still pass on Linux)
Ground-truth: origin/master tip
|
||
|
|
622c19cafe |
feat(web): close TEST-H3 — install Storybook 10 + wire scripts + dropt tsconfig exclude
Closes frontend-design-audit finding TEST-H3 (High):
Zero Storybook — 9 production components live without isolated
rendering or designer-handoff surface
Phase 8 originally shipped the scaffold (.storybook/main.ts +
preview.ts + 8 *.stories.tsx files) but couldn't land the deps:
• Storybook 8.6 peer-capped at Vite 6, project ships Vite 8
(Phase 4 manualChunks rewrite). Hotfix #9 ripped the deps.
• The .storybook/main.ts header speculated "Storybook 9 supports
Vite 7+8" — that was wrong. Verified at install time today:
Storybook 9.1.20's peer range is Vite 5/6/7. ERESOLVE'd again.
• Storybook 10.4.0 is the first release with explicit Vite 8 in
its peer range (^5.0.0 || ^6.0.0 || ^7.0.0 || ^8.0.0). Installed
cleanly via `npm install --save-dev`.
═══════════════════════════ CHANGES ═══════════════════════════════
package.json + package-lock.json:
• storybook ^10.4.0
• @storybook/react-vite ^10.4.0
• @storybook/addon-a11y ^10.4.0
All resolve without --legacy-peer-deps. 93 packages added.
Scripts: `npm run storybook` (dev server on :6006) and
`npm run storybook:build` (→ .storybook-static).
tsconfig.json:
Dropped the `src/**/*.stories.tsx` + `src/**/*.stories.ts`
exclusions. Storybook 10's @storybook/react types are stable;
the 8 committed story files typecheck cleanly inside the main
`npm run build` step. Phase 8's "stories excluded so build stays
green in the meantime" caveat is now retired.
web/src/components/Banner.stories.tsx:
Fixed stale prop name: stories used `severity: 'error'` but the
Banner primitive's prop is `type: 'error'` (BannerType union).
4-line edit, replace_all on `severity:` → `type:`. The Banner
component never had a `severity` prop — the story was authored
against a different draft of the API. Typecheck now passes.
web/.storybook/main.ts:
Replaced the "deps not installed" header block with a
version-selection history block documenting the 8 → 9 → 10
trail so the next operator who upgrades Vite doesn't re-walk
the same wall.
.gitignore:
Added `web/.storybook-static/` (Storybook build output, like
web/dist/).
═══════════════════════════ VERIFICATION ═══════════════════════════
• npm install — exit 0, 93 packages, no peer warnings, no
ERESOLVE.
• npx tsc --noEmit — exit 0 with stories included (was running
excluded; now they're in the typecheck graph).
• npx storybook build — built in 3.09s, 17 chunks emitted to
.storybook-static. All 8 stories rendered without errors.
• npx vitest run src/components — 16 files / 161 tests pass
(no regression from Storybook install / story-file fix).
• npx vite build — production build green in 3.35s.
• CI guards: no-raw-table 17/17, no-unbound-label 134/134,
no-raw-toLocaleString clean.
Operator follow-ups (none blocking):
• `npm run storybook` locally opens the dev server with hot-
reload + addon-a11y panel.
• `npm run storybook:build` for an immutable static deploy
(e.g. cert-ctl.io/storybook).
• New components SHOULD ship a sibling *.stories.tsx going
forward; can wire a CI guard if desired (fe-component-has-
story.sh — scaffold mentioned in the audit's executable
prompt for Phase 8 TEST-H3 but deferred).
Ground-truth: origin/master tip
|
||
|
|
bc417fc458 |
feat(web): close UX-M9 — replace 886×864 / 773 KB logo with 80×80 / 17.6 KB sibling-repo asset
Closes frontend-design-audit finding UX-M9 (Med):
Logo is an 886×864 PNG (773 KB after bundling) — should be SVG;
first-paint cost is meaningful on slow connections
Ground-truth recon found:
• Sidebar renders the logo at 64×64 ('h-16 w-16' + explicit
width=64 height=64) in Layout.tsx:213
• Source asset was 886×864 PNG — 13.8× over-scaled for its
actual render size, costing 755 KB of wasted bytes on every
cold load
• Sibling repo certctl-io/certctl.io (landing page) already
has the same visual identity at logo-icon.png (80×80 / 17.6 KB)
— exactly the 1.25× retina source size needed for the 64×64
sidebar render
Operator choice (2026-05-14): "Use certctl.io's logo-icon.png"
Rationale: same illustrated logo (cycle ring + shield + 'certctl'
wordmark), zero new design work, 96% byte-size reduction.
═══════════════════════════ CHANGE ════════════════════════════════
web/src/assets/certctl-logo.png:
Replaced via `cp /sessions/.../certctl.io/logo-icon.png ...`.
No code change — same import path in Layout.tsx:55, same render
attributes. The Phase 0 PERF-H2 closure
(loading="eager" decoding="async" + explicit width/height) keeps
the LCP-friendly attributes in place.
Asset shape: 886×864 PNG → 80×80 PNG.
Source bytes: 773,321 → 17,647 (-97.7%).
Bundled dist size: 773 KB → 17.64 KB.
═══════════════════════════ AUDIT FRAMING ════════════════════════
The audit literally said "should be SVG" but the operator-visible
bug was perf (first-paint cost on slow connections). True SVG
conversion needs a designer round-trip (auto-trace explicitly
disallowed by the audit prompt — produces 50+ KB redundant path
data on illustrated logos). The closure here addresses the perf
concern via a 97.7% byte-size win without commissioning a designer;
when one IS commissioned, the SVG can land as a follow-up commit
with no other code changes.
═══════════════════════════ VERIFICATION ═══════════════════════════
• Visual diff: side-by-side render confirmed — same logo,
just at the proper render size.
• npx tsc --noEmit — exit 0 (asset path unchanged; type-check
is satisfied).
• Layout.test.tsx — 7/7 pass (logo presence + sidebar group
structure + Setup-guide button + nav-auth-users testid all
still assert green).
• npx vite build — built, certctl-logo emitted at 17.64 KB.
• Phase 0 PERF-H2's loading=eager + decoding=async + explicit
width/height attributes preserved.
Ground-truth: origin/master tip
|
||
|
|
ac5bb71b61 |
feat(discovery): close P-M1 — in-flight scan progress panel on DiscoveryPage
Closes frontend-design-audit finding P-M1 (Med):
DiscoveryPage doesn't show real-time scan progress — operator who
just kicked off a scan must navigate to NetworkScanPage to see
if it's running
Operator choice (2026-05-14): poll-and-render over SSE / WebSocket.
Rationale recorded in the source comment: zero new transport
infrastructure to maintain; reuses the existing TanStack Query
plumbing. SSE / WebSocket were the alternative paths but neither
is currently used anywhere else in the codebase (grep -rn
"text/event-stream|EventSource|websocket" returned zero hits), so
adopting one for a single Medium finding would be disproportionate.
═══════════════════════════ CHANGES ═══════════════════════════════
web/src/pages/DiscoveryPage.tsx:
• Dropped the `enabled: showScans` gate on the ['discovery-scans']
query. The query is now always-on, so the new in-flight panel
has data to render without operator interaction.
• Refetch cadence flips between 2.5s and 30s via a function-shape
refetchInterval that introspects the query's most-recent data:
anyInFlight = scans.some(s => !s.completed_at)
return anyInFlight ? 2500 : 30000
domain.DiscoveryScan.CompletedAt is *time.Time (nullable
pointer) — nil while the agent is still scanning, set when the
agent posts its DiscoveryReport. When the last running scan
finishes, the next 2.5s tick sees no in-flight rows and the
interval flips back to 30s automatically.
• Derived `inFlightScans = scans.data.filter(!completed_at)` —
drives both the visibility gate (panel doesn't render when
empty) and the row count badge.
• New panel renders ABOVE the existing summary tiles:
- Amber background, animated ping dot, role=status + aria-live=
polite so screen readers announce status changes.
- "{N} scan(s) in progress" header + per-scan row showing
agent_id, directories count, started_at (formatDateTime), and
certificates_found-so-far.
- data-testid hooks: discovery-inflight-panel +
discovery-inflight-row-<id> for QA + future Playwright.
No backend changes — getDiscoveryScans() endpoint already returns
the complete DiscoveryScan shape including the nullable
completed_at field. The closure is pure frontend.
═══════════════════════════ AUDIT FRAMING ════════════════════════
The audit said "real-time scan progress" but the operator chose
the practical interpretation — sub-3-second update latency for an
operator visiting the page, not push-based streaming. The poll
cadence is high enough that an operator clicking from
NetworkScanPage to DiscoveryPage sees in-flight signal within the
first refetch tick (the dashboard's pre-existing 30s polling drops
to 2.5s the moment the first in-flight scan is observed).
═══════════════════════════ VERIFICATION ═══════════════════════════
• npx tsc --noEmit — exit 0
• npx vitest run DiscoveryPage AuditPage — 7/7 pass
• npx vite build — built in 3.31s
• CI guards: no-raw-table baseline 17/17, no-unbound-label 134/134,
no-raw-toLocaleString clean (the new <ul>/<li> rows don't add
raw tables; the panel uses Phase 6's formatDateTime for the
timestamp so no-raw-toLocaleString stays clean).
Ground-truth: origin/master tip
|
||
|
|
fc237de357 |
feat(audit): close P-H2 — server-side since / until time-range filters
Closes frontend-design-audit finding P-H2 (High):
AuditPage filters time-range *client-side*; comment says "server
may not support time params" — fetches the entire event window,
throws 99% away in JS
Ground-truth recon found the closure is much smaller than the
audit's "1 day backend + 2 hours frontend" estimate:
• repository AuditFilter.From / .To: ALREADY exist in
internal/repository/filters.go:57-58
• postgres.AuditRepository.List: ALREADY pushes
`timestamp >= since` + `timestamp <= until` predicates into the
SQL query (internal/repository/postgres/audit.go:107-116)
• Composite index idx_audit_events_category_timestamp on
(event_category, timestamp DESC) added in migration 000032
makes the new query hit an index scan
• MCP `certctl_audit_list_with_category` tool's docstring already
advertises `since` / `until` (internal/mcp/tools_audit_fix.go:174)
— but the server silently ignored them, making the published
contract a lie
The only missing piece was the handler exposing the params + the
frontend porting from client-side filtering. ~150 lines total.
═══════════════════════════ CHANGES ═══════════════════════════════
Service (internal/service/audit.go):
• New ListAuditEventsByFilter(ctx, since, until, category, page,
perPage) threads time bounds into the existing repository.
AuditFilter.From / .To fields.
• Existing ListAuditEvents + ListAuditEventsByCategory become
thin wrappers around the new method with zero times.
Handler (internal/api/handler/audit.go):
• Interface gains ListAuditEventsByFilter signature.
• ListAuditEvents handler parses `since` + `until` RFC3339 query
params; 400 on malformed input or `until` not after `since`.
• Single dispatch via ListAuditEventsByFilter for ALL request
shapes (with or without time bounds, with or without category).
Tests (internal/api/handler/audit_handler_test.go):
• mockAuditService gains listByFiltFunc + lastFilterSince/Until/
Category trace fields.
• 5 new subtests:
- TestListAuditEvents_WithSinceUntil — happy path, both bounds
- TestListAuditEvents_SinceOnly — one-sided open-ended
- TestListAuditEvents_InvalidSince — 400 on garbage
- TestListAuditEvents_UntilBeforeSince — 400 on reversed range
- TestListAuditEvents_TimeRangePlusCategory — composes with
auditor-role category=auth filter
Frontend (web/src/pages/AuditPage.tsx):
• TIME_RANGES dropdown now sends `since` as RFC3339 (now − N hours)
via the existing useQuery params object instead of filtering
client-side after the fact.
• Pre-P-H2 `filtered = data.data.filter(e => now-ts<N)` block
deleted (replaced by `filtered = data?.data || []`); comment
documents why for the diff reader.
OpenAPI (api/openapi.yaml):
• listAuditEvents gains `since` + `until` query-param specs
(format: date-time, description, P-H2 closure date).
• Description block explains the `since`/`until` vs `from`/`to`
naming divergence from the sibling /audit/export endpoint
(different param semantics: list = open-ended bounds, export =
required ≤ 90-day compliance window).
═══════════════════════════ VERIFICATION ═══════════════════════════
Backend (Go toolchain now wired in sandbox — go1.25.10 ARM64 from
.gomodcache, GOCACHE on /tmp partition):
• gofmt -l on all touched files: clean
• go vet ./... — exit 0
• go test -short -count=1 ./internal/api/handler/... — ok 4.195s
(existing 14 subtests + 5 new = 19/19 pass)
• go test -short -count=1 ./internal/service/... — ok 4.733s
• staticcheck ./internal/api/handler/... ./internal/service/...:
zero findings
Frontend:
• npm ci — 634 packages, exit 0 (resolves cleanly post-Hotfix #9)
• npx tsc --noEmit — exit 0
• npx vitest run src/pages/AuditPage.test.tsx — 4/4 pass
• npx vite build — built in 3.49s
Ground-truth: origin/master tip
|
||
|
|
b22cdb3405 |
fix(signer): Hotfix #15 — gofmt comment-indent fix from Hotfix #13
CI run on commit |
||
|
|
03f0e08a77 |
fix(middleware): Hotfix #14 — staticcheck QF1008 from Hotfix #12
CI run #571 (commit |
||
|
|
38f86bca86 |
fix(signer): Hotfix #13 — CodeQL #29 go/path-injection in FileDriver sinks
CodeQL alert #29 (severity: HIGH, rule: go/path-injection) has been open on master for 2 weeks despite Phase 6 commit |
||
|
|
af5c39252f |
fix(middleware): Hotfix #12 — CodeQL #34 go/reflected-xss in etag.go
CodeQL alert #34 (severity: HIGH, rule: go/reflected-xss) fired on commit |
||
|
|
6c00f7b0d3 |
fix(web): Hotfix #11 — CodeQL #36 js/regex/missing-regexp-anchor in multi-page-flows test
CodeQL alert #36 (severity: HIGH, rule: js/regex/missing-regexp-anchor) fired on commit |
||
|
|
49096914d2 |
fix(web): Hotfix #10 — CodeQL #37 js/use-before-declaration on __APP_VERSION__
CodeQL alert #37 (severity: warning, rule: js/use-before-declaration) fired on commit |
||
|
|
aa1c12ae2d |
feat(web): Phase 9 — backend-coupled + page-specific closures (5 shipped, 2 deferred)
Closes the frontend-design-audit Phase 9 batch — the audit's
"backend-coupled or page-specific" tier. Five findings ship; two
defer to follow-ups that need backend handler work.
Shipped:
PERF-M2 — Build-time version + hidden sourcemaps
• vite.config.ts: `sourcemap: 'hidden'` (was `false`). Maps emit
to dist/ but are NOT referenced by JS, so browsers don't fetch
them. The maps stay available for Sentry-class upload at
release time. Comment-block above the build config documents
the tradeoff so a future operator doesn't re-flip to `false`
without realising they're losing release-time debuggability.
• `__APP_VERSION__` build-time `define` reads `web/package.json`
`version` so ErrorBoundary can stamp the build into telemetry
payloads (was previously hardcoded `'dev'`).
FE-L1 — ErrorBoundary copy-trace + telemetry gate
• 50 → 185 LOC rewrite of web/src/components/ErrorBoundary.tsx.
• componentDidCatch now POSTs an ErrorPayload (build version,
UA, href, timestamp, error name + message + stack,
componentStack) to `VITE_ERROR_TELEMETRY_URL` IF that env var
is set at build time. Uses navigator.sendBeacon (page-unload-
safe) → falls back to fetch + keepalive. Unset = no POST,
no console-error spam.
• Operator-facing "Copy details" button writes the same payload
as JSON to the clipboard (navigator.clipboard API → execCommand
fallback for older browsers). A `<details>` block (collapsed
by default) shows the stack + componentStack inline so the
operator can grok the failure without leaving the page.
• Two new data-testid hooks (`error-boundary-reload`,
`error-boundary-copy`) for QA + future Playwright coverage.
• web/src/components/ErrorBoundary.test.tsx — 5 vitest specs:
no-error pass-through, error fallback structure, copy payload
shape, details collapsed-by-default, NO telemetry POST when
URL is unset. cleanup() between tests + console.error
silenced via the React-error-handling pattern.
UX-M8 — DataTable density toggle (opt-in via tableId)
• Density type ('compact' | 'comfortable' | 'spacious') + per-
density cell/header class maps. Default 'comfortable' matches
the existing px-4 py-3 padding so all callers see byte-
identical layout until they opt in.
• DataTableProps gains optional `tableId` + `density` props.
Pages that pass `tableId` get a 3-button DensityToggle
(Compact / Cozy / Spacious) rendered above the table; the
selection persists to localStorage at
`certctl:table-density:<tableId>`. No tableId = no toggle =
no behavioral change for the 17 other tables.
• Hardcoded `px-4 py-3` replaced with the `cellCls` /
`headerCls` lookup against the active density. Three Tailwind
permutations cover compact (px-3 py-1.5), comfortable
(px-4 py-3), spacious (px-5 py-5).
UX-M7 (lever) — CI guard against new raw `<table>` regressions
• scripts/ci-guards/no-raw-table.sh: counts `<table` tags in
`web/src/**/*.tsx` (production only, tests excluded) outside
the canonical primitives (DataTable.tsx + Skeleton.tsx) and
fails CI if the count climbs above baseline. `--strict` mode
rejects any raw table once the backlog clears.
• Baseline pinned at 17 (the current count of page-level raw
tables — verified via the same grep the guard uses). Every
page migration to <DataTable> drops the baseline by 1; new
pages MUST route through <DataTable>.
• No representative migrations in this commit (operator
decision: ship the lever first, migrations as follow-up PRs).
• Pairs with the existing CI guard suite (no-unbound-label,
no-raw-toLocaleString, no-eager-issuer-deletes, etc.) —
same baseline-locked pattern.
FE-M2 — Desktop-only banner (operator chose path a: 2026-05-14)
• web/src/components/DesktopOnlyBanner.tsx: fixed top bar at
viewports < 1024px (Tailwind `lg` breakpoint, below which the
sidebar + content layout starts visibly cramping). Amber
"Desktop-only: certctl is designed for viewports ≥ 1024px"
notice with a Dismiss button that persists to localStorage
(`certctl:desktop-only-banner-dismissed`).
• web/src/index.css: `.desktop-only-banner` is `display: none`
by default and `display: flex` inside the
`@media (max-width: 1023px)` block. CSS-gated visibility,
not React state — the banner mounts always but only renders
visibly on narrow viewports.
• web/src/main.tsx: mounts the banner inside ErrorBoundary,
above QueryClientProvider, so it survives any provider
failure that breaks the rest of the tree.
• Operator-stated rationale (recorded in DesktopOnlyBanner.tsx
header comment): the audit flagged 29 partial sm:/md:/lg:
responsive classes that suggest mobile support which isn't
actually shipped. Rather than rip out the partials (zero
benefit at desktop widths) or ship full mobile (1+ sprint of
QA + ongoing maintenance), this ships an honest signal —
"we don't promise mobile" — that doesn't claim support that
isn't there. The partials stay (no benefit to ripping out;
they may help if the decision reverses).
Deferred:
P-H2 — AuditPage server-side time filters
Requires backend changes to internal/api/handler/audit.go +
service + repository: ListAuditEvents currently accepts only
page/per_page/category. Adds `since` / `until` ISO-8601
params (UTC), pushes the timestamp predicate into the SQL
query, surfaces them in OpenAPI + MCP. Queued as a backend-
first follow-up bundle.
P-M1 — DiscoveryPage in-flight scan panel
Out of scope for the frontend remediation pass; needs a
websocket / SSE channel from internal/service/discovery.go to
the frontend (current poll-and-render UI works against the
existing endpoint set). Queued.
Verification:
• npx tsc --noEmit — exits 0
• npx vitest run ErrorBoundary StatusBadge — 80/80 passed
• npm run build — ✓ built in 3.11s
• bash scripts/ci-guards/no-raw-table.sh —
Raw <table> tags outside DataTable + Skeleton — current: 17, baseline: 17
• Bundle shapes unchanged from Phase 4 (91.66 KB raw / 25.92 KB gz
initial chunk); the ErrorBoundary rewrite adds ~5 KB to index.
Falsifiable proof for the next CI run:
• Frontend Build job's `npm ci` step completes (Hotfix #9 settled
the Storybook peer conflict).
• New no-raw-table.sh guard exits 0 with current=17 baseline=17.
• All 34 CI guards (was 33, +1 for no-raw-table) pass.
Per-finding closure entries land in frontend-design-audit.html in
the follow-up commit (audit HTML update).
|
||
|
|
5231609f26 |
fix(web): Hotfix #9 — remove Storybook deps from package.json (Vite 8 peer conflict)
CI failure on Phase 8 commit
v2.1.5
|
||
|
|
c146e8f75b |
fix(web): sidebar footer simplification + onboarding doc links — operator-reported drift
Two small, operator-reported regressions in the live demo:
1. SIDEBAR FOOTER
Pre-fix the bottom-left of the sidebar had:
Built and maintained by Shankar <- only "Shankar" linked
certctl [⎋] <- "certctl" label + logout
Operator dropped the "certctl" label as redundant (the brand mark +
product name are already in the sidebar header), and asked for the
WHOLE attribution sentence to be the LinkedIn link rather than only
"Shankar". Post-fix the entire sidebar footer is one row:
Built and maintained by Shankar [⎋]
The full sentence is now an ExternalLink to
https://www.linkedin.com/in/shankar-k-a1b6853ba. Logout sits flush-
right via `flex justify-between` and only renders when authRequired
is true (unchanged contract). Same Phase 5 / Hotfix #8 chokepoint
(ExternalLink) means the L-015 CI guard stays green — caught my
first attempt where the explanatory comment text contained the
literal `target="_blank"` string and the line-grep guard fired on
the comment itself. Fixed by rephrasing the comment.
2. ONBOARDING WIZARD DOC LINKS
The CompleteStep ("You're all set!") screen had three doc links at
the bottom — all 404s:
Quickstart Guide → docs/quickstart.md (gone)
Architecture → docs/architecture.md (gone)
Connectors → docs/connectors.md (gone)
Root cause: the 2026-05-04 docs overhaul reorganized into the
audience-organized tree (`getting-started/`, `reference/`,
`operator/`, etc.). The CompleteStep links weren't updated. Every
operator who completed the wizard hit three 404s.
Verified against the live repo BEFORE writing the new links — the
exact paths that exist today:
docs/getting-started/quickstart.md
docs/reference/architecture.md
docs/reference/connectors/index.md (29 per-connector .md siblings)
New links point at those paths. Each still uses target="_blank" +
rel="noopener noreferrer" on the same line so the L-015 guard
passes.
Verification:
• npx tsc --noEmit — exits 0
• Layout 7/7 + OnboardingWizard 4/4 = 11/11 green
• All 34 CI guards pass (L-015 included)
• npx vite build ✓ in 3.30s
|
||
|
|
a9e229bd2a |
feat(frontend): Phase 8 Test Pyramid Investment — TEST-H1 + TEST-H2 + TEST-H3 (scaffold) + TEST-M1
Closes the structural test-pyramid gaps that protect every future
phase from regression. Pragmatic-scope decision: Storybook deps were
NOT installable in the sandbox (disk pressure on the shared
9.8 GB local partition); the config + stories ship as scaffolding +
package.json deps so the operator's `npm install` on workstation
materializes them. Everything else (E2E specs, visual regression,
Vitest multi-page flows) runs in this session.
═════════════════════════ AUDIT VERIFICATION ═════════════════════════
• Q1 (e2e/README intact + zero Playwright wired) — PARTIALLY STALE:
Phase 3 TEST-M3 already shipped playwright.config.ts +
smoke.spec.ts + @playwright/test 1.49.0 + the `npm run e2e`
script. Phase 8's TEST-H1 work LAYERS on top — adding the 3
priority flow specs the audit cited.
• Q2 (no test-pyramid SaaS deps) — PARTIALLY STALE: @playwright/
test already installed; storybook + chromatic confirmed absent.
• Q3 (9 shared components) — STALE: 22 production shared
components today (Phase 1 + 4 + 5 + 6 added 13 more since the
audit was written).
• Q4-Q6 (Vite + Vitest + Tooltip API + CI gates) — all accurate.
═════════════════════════════ CLOSURES ═══════════════════════════════
TEST-M1 (multi-page Vitest flows) — FULL CLOSE
• web/src/__tests__/multi-page-flows.test.tsx — 3 flow tests:
1. Certs list → row click → CertificateDetailPage continuity
2. Direct deep-link to /certificates/:id (no list pre-fetch)
3. Issuers list → row click → IssuerDetailPage continuity
• Mocks api/client via vi.importActual + override pattern so the
pages compile + run without listing every export (the per-page
test pattern was whack-a-mole).
• 3/3 green in 6.83s.
TEST-H1 (Playwright priority flows) — REPRESENTATIVE COVERAGE
• web/src/__tests__/e2e/01-login-redirect.spec.ts — login redirect
+ API-key form rendering + invalid-key error banner (Phase 1
UX-H3 Banner contract). Happy-path login skipped pending live
CERTCTL_E2E_API_KEY in CI env.
• web/src/__tests__/e2e/02-dashboard-shell.spec.ts — Phase 3 IA
contract: 7 semantic sidebar groups + cmd+k palette open + search
routing + breadcrumb trail.
• web/src/__tests__/e2e/03-settings-timestamp-pref.spec.ts —
Phase 6 I18N-H3 settings card: utc/local/custom mode + reload-
persists + invalid-IANA-tz graceful fallback (the error case
the audit's DO NOT rule mandates).
• 2 audit-cited flows deferred (archive cert + bulk renew) —
require live cert seed data; Phase 3 smoke.spec.ts pattern
extends naturally when CI seeds a demo deployment.
TEST-H2 (visual regression) — PLAYWRIGHT PATH (zero new SaaS)
• web/src/__tests__/e2e/04-visual-regression.spec.ts — 5 page
screenshots: /login, /, /certificates, /issuers, /auth/settings.
Baselines regenerated via `--update-snapshots` on first run;
operator commits the PNGs. Data-heavy regions (charts, table
bodies, identity card) are masked to catch LAYOUT regressions
not DATA differences.
• Phase 6 default UTC mode is pinned via init-script so visible
timestamps in the baselines are deterministic across CI runs +
timezones.
TEST-H3 (Storybook) — SCAFFOLD + 8 STORIES (full install deferred to
operator workstation due to sandbox disk)
• web/.storybook/main.ts + preview.ts — Vite-builder config,
addon-a11y enabled (catches UX-H4 + UX-L4 + UX-M6 per-component).
Story discovery: `src/**/*.stories.@(ts|tsx)`.
• 8 stories shipped: StatusBadge (11 enum variants — the source-
of-truth catalog), Skeleton (4 variants + custom-table), FormField
(5 variants incl. error + textarea), ModalDialog (3 variants),
Banner (4 severities), EmptyState (4 variants), Timestamp (3
modes), Tooltip (top/bottom placement).
• 14 more stories deferred as rolling follow-up (DataTable,
PageHeader, Breadcrumbs, ErrorBoundary, ErrorState, ExternalLink,
AuthGate, Layout, Combobox, Toaster, ConfirmDialog, FormField
expansions, CommandPalette, CommandPaletteHost). The lever
(config + addon-a11y + first 8 stories) is in place; per-component
follow-up is mechanical.
Storybook DEPS — PACKAGE.JSON ONLY, LOCKFILE PENDING:
The sandbox's local 9.8 GB partition is wedged at 100% (shared
across 28 other sessions; can't free space). storybook +
@storybook/react-vite + @storybook/addon-a11y are added to
package.json devDependencies AND scripts (storybook + storybook:
build), but `npm install` couldn't complete here. Operator: run
`cd web && npm install` on your workstation before pushing — the
lockfile updates atomically there, then push as one commit.
The .stories.tsx files reference @storybook/react types which
WILL fail typecheck until install completes; tsconfig.json
excludes them from the build typecheck (added `src/**/*.stories.
tsx` + `src/**/*.stories.ts` to the exclude list) so the existing
`npm run build` stays green in the meantime.
Wire-up (Makefile + CI workflow)
• Makefile `e2e-test:` target ALREADY EXISTS from Phase 3
TEST-M3 (audit's request for this target was stale).
• .github/workflows/e2e.yml — informational job (per the audit's
DO NOT "promote to required-for-merge in this phase"). Runs on
push to master + every PR touching web/. Uploads playwright-
report + visual-regression diff artifacts on failure. Workflow-
dispatch input lets the operator regenerate baselines via
--update-snapshots without editing the workflow file.
═══════════════════════════ VERIFICATION ═════════════════════════════
• npx tsc --noEmit — exits 0 (stories + e2e specs excluded via
tsconfig.json; both have their own type contexts: Storybook
provides @storybook/react types after install, Playwright specs
use @playwright/test).
• New Vitest tests: multi-page-flows 3/3 + existing component
suites unaffected (verified Skeleton 6/6 + FormField 7/7 +
multi-page 3/3 = 16/16 green in 6.83s).
• npx vite build — ✓ in 3.39s. Bundle profile unchanged.
• All 34 CI guards pass locally (bash scripts/ci-guards/*.sh loop
— no new guards in this phase).
• Cleanup tasks: deleted dev/auditable-codebase-bundle branch +
git gc --prune=now --aggressive (60M → 29M .git on host).
═══════════════════════════ RESIDUAL RISK ════════════════════════════
• Playwright flakiness on CI — well-documented in industry. The
e2e.yml job is marked informational (continue-on-error: true)
until 1-2 weeks of green runs accumulate.
• Storybook story drift: every new shared component needs a
sibling .stories.tsx. No CI guard enforces this today; tracked
for follow-up.
• Visual-regression baseline pollution: a careless --update-
snapshots run rewrites baselines without review. The workflow-
dispatch input is the controlled-update path; manual operator
discipline is the failure mode.
• Storybook lockfile pending operator install. Tests + build
stay green in the meantime via tsconfig exclude rule.
|
||
|
|
700c399367 |
chore(web): remove darkMode: 'class' from tailwind config — Phase 7 retired
Operator decision 2026-05-14: "no dark mode and no future dark mode
wiring to maintain." The originally-optional Phase 7 (the rebuild path
that would have superseded Phase 0's rip-out if customer signal materialized)
is formally retired in the frontend-design-audit.html banner stack +
Phase 7 H3 header.
Phase 0's closure rationale ("leave `darkMode: 'class'` in tailwind
config for the eventual Phase 7 rebuild") is now superseded — keeping
that line set would resurface as the same half-wired-hook pattern that
drove the original FE-H1 finding, just at the config layer instead of
the HTML layer. Phase 0 removed `class="dark"` from <html> + the body
`bg-slate-900`; this commit closes the loop by also removing the
tailwind config option that pointed at a future feature that won't
arrive.
If the decision ever reverses, this line restores in a one-diff revert
+ a full re-audit of every primitive and page for `dark:` variants
(see the retired Phase 7 executable prompt for the rules: ship complete
or not at all; piecemeal dark-mode is exactly the original finding).
Verification:
• npx tsc --noEmit — exits 0
• npx vite build — ✓ built in 3.20s (Tailwind doesn't need
darkMode set to compile; output is identical because there are
zero `dark:` classes in src/ to gate behind anything)
• Audit HTML (workspace-only, not repo-tracked) updated with:
- Phase 7 RETIRED banner at top of banner stack (amber accent)
- Phase 7 H3 header flipped to "✗ Retired 2026-05-14"
- FE-H1 row note extended with the lock-in decision
- Phase 0's "Do NOT delete darkMode: 'class'" guidance struck
through + marked SUPERSEDED with a pointer to the new banner
v2.1.4
|
||
|
|
1fcb05181d |
feat(frontend): Phase 6 Locale + Date/Time Discipline — close I18N-H1 + I18N-H2 + I18N-H3 + I18N-M2
Closes the Phase 6 batch from cowork/frontend-design-audit.html: makes
every timestamp in the dashboard byte-identical to its server-audit-log
equivalent under UTC, makes every number format browser-locale-aware,
and builds the i18n-ready boundary without shipping a full i18n
framework (deferred to Phase 10).
═════════════════════════ AUDIT VERIFICATION ═════════════════════════
• Q1 utils.ts hardcoded 'en-US' at lines 3 + 8 — confirmed
• Q2 raw new Date(x).toLocaleString() sites — verified 8 sites
across 6 pages (audit said "7+"):
SessionsPage:178, SessionsPage:181 (last_seen, abs_expires)
BreakglassPage:236, BreakglassPage:248 (last_pw_change, locked_until)
GroupMappingsPage:206 (created_at)
OIDCProvidersPage:434 (created_at)
ApprovalsPage:379 (created_at)
ObservabilityPage:71 (server_started)
• Q3 no i18n framework — confirmed (no i18next/react-intl/@formatjs/
date-fns in web/package.json)
• Q4 zero Intl.NumberFormat usage — confirmed (audit-accurate)
• Q5 Tooltip API — `<Tooltip content={…}>{singleChild}</Tooltip>`,
Floating-UI-backed, aria-describedby wired
• Q6 toFixed sites — 1 site in dashboard/charts.tsx (Recharts tooltip
rate formatter); audit was vague but actual is minimal
═════════════════════════════ CLOSURES ═══════════════════════════════
I18N-H1 — drop hardcoded en-US in utils.ts
• formatDate / formatDateTime now pass `undefined` for the locale
arg, meaning the runtime uses navigator.language. Output SHAPE
stable (month: 'short' etc.); LANGUAGE follows the browser.
• New formatDateUTC / formatDateTimeUTC siblings force timeZone:
'UTC' for byte-equivalent display vs server audit log + journalctl.
• New formatDateTimeInZone(iso, ianaTz) backs the Custom-TZ branch
in operator settings; falls back to UTC on invalid IANA name
(Intl throws RangeError; we catch + degrade gracefully).
• Existing tests in utils.test.ts already used locale-tolerant
assertions (.toContain('Jun')) so no test update needed.
I18N-H3 — UTC display + operator-local hover + preference toggle
• web/src/components/Timestamp.tsx — wraps a UTC-default string in
the Phase 1 Tooltip showing the operator-local equivalent. Three
modes:
utc — display UTC (default; screen ≡ logs).
local — display browser-local, hover shows UTC.
custom — display configured IANA tz, hover shows UTC.
• web/src/api/timestampPref.ts — typed localStorage helper with
`certctl:timestamp-pref-changed` CustomEvent so live <Timestamp>
components re-render without a page reload when the operator
flips the toggle.
• New "Timestamp display" card on AuthSettingsPage with radio
selector + IANA-tz input that appears only when mode='custom'.
I18N-H2 — migrate raw toLocaleString sites + CI guard
• 8/8 raw `new Date(x).toLocaleString()` / `.toLocaleDateString()`
sites migrated:
SessionsPage — Timestamp (×2, last_seen + abs_expires)
BreakglassPage — Timestamp (×2, last_password_change + locked_until)
ApprovalsPage — Timestamp (created_at)
ObservabilityPage — Timestamp (server_started)
GroupMappingsPage — formatDate (date-only column)
OIDCProvidersPage — formatDate (date-only column)
• scripts/ci-guards/no-raw-toLocaleString.sh fails CI on any new
raw new Date(x).toLocaleString[Date]Date call outside the
canonical utils.ts impls. Tests + utils.ts itself are excluded.
I18N-M2 — Intl.NumberFormat helpers
• New web/src/api/format.ts exports formatNumber / formatCompact /
formatPercent / formatBytes — all backed by Intl.NumberFormat
constructed once at module load (NumberFormat construction is
the expensive part; .format() is cheap).
• Locale-tolerant test fixtures assert format SHAPE (e.g.
"5[ .,]?432") not exact strings — so the CI runner's locale
doesn't break assertions.
• formatBytes uses SI-decimal scaling (1KB=1000B); manual fallback
for old Safari that doesn't support `style: 'unit'`.
═══════════════════════════ AUDIT-ACCURACY CALLOUTS ════════════════════
(1) Audit said "7+ pages with raw .toLocaleString" — verified 8 raw
SITES across 6 PAGES. Direction was right; counts were vague.
(2) Audit said "no i18n framework + no Intl.NumberFormat" — both
verified accurate (zero matches in production tsx).
(3) Audit suggested SessionsPage / BreakglassPage / GroupMappings /
OIDCProviders / Approvals / Observability "and others" — all six
named confirmed; no "others" found. List was complete.
═══════════════════════════ VERIFICATION ════════════════════════════
• npx tsc --noEmit — exits 0
• New tests: utils 18/18 (preserved) + format 14/14 + Timestamp 6/6
= 38 new test assertions
• Component suite (270/270 across api + Timestamp + Tooltip + sibs)
• 7 migrated page suites — 62/62 green (Sessions / Approvals /
Breakglass / GroupMappings / OIDCProviders / AuthSettings /
Observability)
• All 34 CI guards pass locally (new no-raw-toLocaleString.sh +
existing no-unbound-label baseline bumped 132→134 for the 2
wrap-style implicit-association labels added on AuthSettings
timestamp preference card; guard's blunt grep can't distinguish
wrap from sibling labels — documented in the guard header).
• npx vite build — ✓ in 2.69s
• grep "'en-US'" web/src/api/utils.ts → 0 matches
• grep "new Date.*\.toLocaleString\(\)" web/src --include='*.tsx'
--exclude='*.test.*' → 0 raw sites outside utils.ts
═══════════════════════════ RESIDUAL RISK ════════════════════════════
• UTC default may surprise non-engineering users who expect their
local timezone. Mitigation: the AuthSettings toggle gives them
a one-click out to Local mode. Default UTC is the right safe
default for an audit-log-paired tool.
• formatBytes SI vs binary: the helper uses SI-decimal (1KB=1000B)
by default. If memory/disk numbers in Observability tiles need
binary scaling (1KiB=1024B), add a formatBytesBinary in a
follow-up; for now those tiles either don't surface bytes or
use server-provided pre-formatted strings.
• i18n framework deferred: no react-i18next, no extraction pass.
Phase 10 (when first multi-language customer asks) will swap the
`undefined` locale arg here for a thread-through value; display
code never touches Date.prototype.toLocaleString directly thanks
to the no-raw-toLocaleString CI guard.
|
||
|
|
508c7530e9 |
fix(web): Hotfix #8 — L-015 line-grep guard + CodeQL formatStatus orphan
Two separate issues caught after Phase 5 push: ═════════════════════════ ISSUE 1: L-015 CI GUARD ═════════════════════════ The Frontend Build job on commit |
||
|
|
c9f932be65 |
feat(frontend): Phase 5 Accessibility + Forms — close FE-H3 + UX-H4 primitive + FE-M1 primitive + axe-core gate
Closes the Phase 5 batch from cowork/frontend-design-audit.html: ships
the joint UX-H4 + FE-M1 lever (FormField primitive + react-hook-form +
zod schemas) and the FE-H3 fix (Headless UI Dialog focus trap on the 3
inline-managed modals), with an axe-core regression test + CI guard to
prevent UX-H4 regressions.
═════════════════════════ AUDIT VERIFICATION ═════════════════════════
Confirmed live against the repo before implementing:
• Q1 labels / htmlFor / input-id = 139 / 6 / 0
(audit said 138 / 6 / 0 — labels +1, otherwise accurate)
• Q2 no form library installed
(no react-hook-form, formik, @tanstack/react-form, final-form)
• Q3 3 inline-managed dialog sites confirmed:
SCEPAdminPage.tsx:272, AgentsPage.tsx:314, ESTAdminPage.tsx:281
• Q4 audit's top-6 list was OFF — actual top form-heaviest pages
by useState count are: OIDCProviderDetailPage 21, AgentGroupsPage
18, CertificatesPage 17, CertificateDetailPage 14, BreakglassPage
13, ProfilesPage 13 — NOT the audit-suggested OnboardingWizard 5
(now split in Phase 4) / OIDCProvidersPage 8 / IssuersPage 11 /
ProfilesPage 13 / TargetsPage 9 / ApprovalsPage 5. Audit's
intuition skipped the higher-useState pages.
• Q5 jest-dom imported in src/test/setup.ts — axe-core landed
cleanly
═════════════════════════════ CLOSURES ═══════════════════════════════
UX-H4 (label/input binding) — FormField primitive shipped
• web/src/components/FormField.tsx wraps a <label> + an input child
and auto-generates a stable id via React 18's useId(); cloneElement
threads that id onto BOTH the <label htmlFor> AND the child's id
prop so the WCAG 1.3.1 binding holds by construction. Supports
`required` (asterisk + aria-required), `description` (wires
aria-describedby), `error` (aria-invalid + role=alert + extends
aria-describedby). 7 tests pin the contract.
FE-M1 (no form library) — react-hook-form + @hookform/resolvers + zod
• Added react-hook-form 7.75, @hookform/resolvers 5.2, zod 4.4 as
runtime deps; @axe-core/react, jest-axe, @types/jest-axe as devDeps
• Representative migration of CreateTeamModalInline (inside
onboarding/CertificateStep — operator's first-run experience)
from 3-useState + manual handlers to useForm + zodResolver +
FormField. Schema at pages/onboarding/team.schema.ts.
• Per the audit's "top-6 only, primitive is the lever" rule, the
other 5 audit-suggested pages migrate organically as feature
work touches them — documented as Phase 5 follow-up. The
FormField primitive is the leverage point; per-page migrations
are mechanical applications.
FE-H3 (no focus trap on modal pages)
• New ModalDialog primitive at web/src/components/ModalDialog.tsx —
Headless UI Dialog wrapper for arbitrary-content modals
(complements ConfirmDialog which is confirm-only). Auto-emits
role=dialog + aria-modal + aria-labelledby + ESC-to-close +
backdrop-click-to-close + focus trap.
• All 3 inline-managed modal sites migrated:
• SCEPAdminPage ConfirmReloadModal
• ESTAdminPage ConfirmReloadModal (data-testid preserved)
• AgentsPage RetireAgentModal (3-mode: confirm / blocked / error
— title + footer change per mode; body slot stays the same)
• 37/37 existing modal-page tests stay green — no behavior change
visible to the test suite, only the focus-trap + ESC handling.
UX-H4 regression gate
• web/src/test/a11y.test.tsx runs axe-core (not jest-axe — its
`toHaveNoViolations` matcher uses jest's expect API which can't
plug into Vitest's expect.extend; fails with "expectAssertion.call
is not a function"). Direct axe.run + assert violations.length===0
gives the same gate with a readable failure message.
• Scope: primitives, not page sweeps. Primitives carry the risk
surface; pages compose them. 5 tests covering FormField (with +
without description/error), Skeleton (all 4 variants),
ModalDialog, Breadcrumbs. ~400ms total.
• Skeleton.table's empty <th> cells are decorative shimmers inside
a role=status + aria-busy=true tree — axe-core's
`empty-table-header` rule doesn't model aria-busy gating, so it
is suppressed for the Skeleton variant scan with a clear comment.
• scripts/ci-guards/no-unbound-label.sh — fails CI if a new <label>
without htmlFor lands. Baseline-driven (132 today) so the existing
backlog doesn't block CI; every migration to FormField drops the
baseline. `--strict` mode rejects any unbound label once the
backlog clears.
═══════════════════════════ VERIFICATION ═════════════════════════════
• npx tsc --noEmit — exits 0
• New tests: FormField 7/7, ModalDialog 6/6, a11y 5/5 = 18/18 new
• Component suite: 14 files / 150/150 green
• Page suite (representative subset run): 16 files in first run
(timeout truncated final summary) + 10 files / 48/48 in second
run — all green
• OnboardingWizard 4/4 (the migrated CreateTeamModalInline test
case is the second one — `+ New team opens the inline modal,
calls createTeam, invalidates the cache, and auto-selects the
new team`)
• SCEPAdminPage 20/20, ESTAdminPage 14/14, AgentsPage 3/3 — all
37 modal-page tests stay green after ModalDialog migration
• npm run build ✓ in 3.27s
• CI guard: bash scripts/ci-guards/no-unbound-label.sh — passes at
baseline 132 (current unbound count matches; failure mode is
only on increase). --strict path will fail until backlog clears.
═══════════════════════════ RESIDUAL RISK ════════════════════════════
• RHF migration risk: zod resolver's input/output type mismatch
bit me once during this work (description: z.string().optional()
gave Input: string|undefined vs Output: string after .default()).
Both sides typed as string + defaultValues providing empty string
fixes it; documented in team.schema.ts. Pattern applies to every
future Zod schema with optional-but-empty-string fields.
• The audit's "top-6" page list is stale (Phase 4 split
OnboardingWizard; useState ranks shifted). Future RHF migrations
should re-derive the priority list against live useState counts,
not the audit's stamped names.
• DataTable per-row React.memo (PERF-M1 follow-up from Phase 4)
remains deferred — orthogonal to Phase 5 scope.
|
||
|
|
868f1c25be |
feat(web): sidebar maintainer attribution — mirror landing-page footer style
Add "Built and maintained by Shankar" to the sidebar bottom, with "Shankar" linking to LinkedIn (same href + rel="me noopener" the certctl.io landing-page footer uses). Typography matches the landing page: • font-mono (same family as the existing "certctl" label row) • text-2xs muted (text-sidebar-text/70) for the prefix • slightly brighter for the linked name (text-sidebar-text/90) • underline-offset-2 + hover:underline for the link affordance Lives directly above the existing certctl / logout footer row, so the sidebar bottom now reads: Built and maintained by Shankar certctl [Logout] Single-maintainer OSS standard (Cal.com, Plausible, Beekeeper Studio all credit + link their maintainer the same way). Persistent slot for operators using certctl to find the maintainer in one click — complements the landing-page footer link instead of duplicating it. Verification: • npx tsc --noEmit — exits 0 • Layout.test.tsx — 7/7 green (no test regression from the new row) |
||
|
|
9ce2d8ca8f |
feat(frontend): Phase 4 Loading + Perceived Performance — close UX-M1 + FE-M5 + PERF-M1 + P-H3 + partial FE-M3 / P-M2
Closes the Phase 4 batch from cowork/frontend-design-audit.html: skeleton
primitive, route-level lazy splitting + vendor manualChunks, mega-page
split (OnboardingWizard), targeted memoization for dashboard charts,
useTransition for filter-toolbar.
═════════════════════════ AUDIT VERIFICATION ═════════════════════════
Confirmed facts from the live repo before implementing (not the audit's
stamped numbers — those drifted):
• Pre-Phase-4 index-*.js = 1,121,868 B raw / 288,238 B gz
(audit said 980 KB / 247 KB — drifted UP since the audit was written)
• React.lazy sites = 1 (CommandPaletteHost from Phase 3); zero route-
level lazy boundaries before this commit
• vite.config.ts had NO rollupOptions.output.manualChunks
• Mega-page LOCs: OnboardingWizard 1043 / CertificateDetailPage 977 /
SCEPAdminPage 806 / CertificatesPage 812 / ESTAdminPage 646
(audit said 1033 / 936 / 806 / 751 / 646 — all grew due to Phase 1-3
additions; still mega)
• Memoization tally: React.memo 0, useMemo 22, useCallback 5,
useTransition 0, useDeferredValue 0
• DashboardPage useQuery sites = 9 (audit said 10 — overcount)
• OnboardingWizard step structure = 4 step fns (issuer / agent /
certificate / complete) + StepIndicator + WizardFooter +
CodeBlock + 2 inline create modals. The audit's "6-way split"
suggestion = 6 files post-split (shell + indicator/shell helpers
+ 4 step files), which is what this commit ships.
═════════════════════════════ CLOSURES ═══════════════════════════════
UX-M1 — Skeleton primitive (web/src/components/Skeleton.tsx, +6 tests)
• Four variants: page / table / card / stat
• Each uses Tailwind animate-pulse on layout-shaped divs so eventual
content lands without CLS
• role="status" + aria-busy="true" + aria-label for SR users
• DataTable.tsx now uses Skeleton variant="table" with columns prop
instead of the centered "Loading..." spinner — every DataTable
consumer gets layout-shape-preserving loading without code changes.
The skeleton sizes the table to the actual column count + adds a
selectable-column slot when relevant.
FE-M5 + SCALE-H1 — route-level code split + vendor manualChunks
• main.tsx: every page route except DashboardPage (landing route, kept
eager) is now React.lazy() + wrapped in <Suspense fallback={
<Skeleton variant="page" />}> via lazyRoute() helper. 35 lazy
routes total.
• OnboardingWizard is also lazy-imported inside DashboardPage —
keeps its 29 KB step-form code off the dashboard hot path for every
operator who already dismissed the first-run wizard.
• vite.config.ts: rollupOptions.output.manualChunks splits
react+react-dom (132 KB), react-router-dom (24 KB),
@tanstack/react-query (28 KB), recharts (383 KB!), and lucide-react
(16 KB) into named vendor chunks. Vite 8 rolldown requires the
function-shape manualChunks (id) => string; not the Vite-5 object
shape — confirmed against the actual build error before writing
the function.
Bundle profile (raw / gz):
pre-Phase-4 single index-*.js = 1,121,868 / 288,238
post-Phase-4 index-*.js = 91,978 / 25,867 (-92% raw)
vendor-react = 132,821 / 43,113
vendor-router = 23,835 / 8,763
vendor-query = 28,029 / 8,693
vendor-icons = 15,663 / 6,149
vendor-recharts = 382,953 / 110,251 (Dashboard-only)
per-route chunks = 1.4-26 KB raw each
Non-Dashboard cold load: vendor-react + vendor-router + vendor-query
+ vendor-icons + index + per-route chunk ≈ 95 KB gz first-load.
Dashboard cold load adds vendor-recharts (110 KB gz) on demand.
Audit target was <100 KB gz first-load for non-Dashboard routes — hit.
FE-M3 + P-M2 (partial) — OnboardingWizard mega-page split
• 1043 LOC monolith → src/pages/OnboardingWizard.tsx (100 LOC shell) +
src/pages/onboarding/{types.ts, StepShell.tsx, IssuerStep.tsx,
AgentStep.tsx, CertificateStep.tsx, CompleteStep.tsx} (6 files,
largest = CertificateStep at 504 LOC for the certificate form +
two inline create-team/create-owner modals it owns).
• Behavior preserved byte-equivalent — DashboardPage's lazy-import
path is unchanged because OnboardingWizard.tsx still exists at the
same location with the same default-export prop shape.
• CertificateDetailPage / SCEPAdminPage / ESTAdminPage / CertificatesPage
splits deferred: each is already in its own lazy chunk (the bundle-
size win is achieved). Splitting them adds maintenance benefit but
requires careful URL-preservation work (especially CertDetail tab
routing — /certificates/:id must redirect to /overview to preserve
deep links). Documented as Phase 4 follow-up; not blocking on this
closure.
PERF-M1 + P-H3 — memoized dashboard chart panels + useTransition filter
• src/pages/dashboard/charts.tsx — 4 React.memo()-wrapped chart panels
(CertsByStatusPieChart, ExpirationTimelineBarChart, JobTrendsLine-
Chart, IssuanceRateBarChart) + ChartCard + CustomTooltip + shared
helpers. Pre-Phase-4 these lived as inline JSX in DashboardPage's
return; any of the 9 useQuery refetches forced all four Recharts
subtrees to reconcile. Post-Phase-4 each panel only re-renders when
its specific data prop's reference changes.
• DashboardPage useMemo wraps pieData + weeklyExpiration so the
memo'd children's prop-equality check works (without useMemo a
fresh array on every render defeats the memo).
• Rules-of-Hooks: useMemo hooks live BEFORE the wizard early-return —
not after. (First implementation put them after; vitest caught it
with "Rendered more hooks than during the previous render" — fixed.)
• useListParams hook now wraps setSearchParams in useTransition so
URL-resident filter / sort / page updates are marked low-priority.
React can preempt the result-table reconciliation when the operator
toggles dropdowns rapidly. Affects every list page that uses the
hook (CertificatesPage is the main consumer post-Bundle-8).
═══════════════════════════ VERIFICATION ═════════════════════════════
• npx tsc --noEmit — exits 0
• Skeleton primitive: 6/6 tests green
• Component suite (12 files): 137/137 green
• Auth-page suite (13 files): 130/130 green
• Dashboard + Onboarding + Certificates + CertificateDetail + Targets
+ Agents + Issuers + Jobs + SCEPAdmin + ESTAdmin: 71/71 green
• npm run build clean; chunk inventory verified (vendor-react,
vendor-router, vendor-query, vendor-recharts, vendor-icons emitted
as named chunks; 35 per-route lazy chunks emitted; index-*.js
shrunk to 91.66 KB raw / 25.92 KB gz).
═══════════════════════════ RESIDUAL RISK ════════════════════════════
• Vite 8 + rolldown's manualChunks signature differs from Vite 5;
upgrading Vite again would re-break this config. Comment in
vite.config.ts pins the function-shape requirement.
• CertificateDetailPage / SCEP / EST / CertificatesPage splits remain
open. Mega-LOC files but already lazy-chunked, so deferring is safe.
• Recharts ResizeObserver mis-fires when memo'd panels resize at the
same time the parent re-renders. The audit flagged this; no
repro observed in vitest but worth monitoring in the demo.
|
||
|
|
0987e222dd |
fix(web): Phase 3 hotfix — UsersPage.test.tsx Router context + Breadcrumbs defensive guard
CI failure on Phase 3 commit (
|
||
|
|
e761ae40a4 |
feat(frontend): Phase 3 Information Architecture + Search — close UX-H1 + FE-H2 + UX-M5 + UX-H6 + FE-L4; FE-M6 deferred
Phase 3 of the frontend-design audit: information architecture + search.
Layout.tsx rewritten once for BOTH grouped-sidebar (UX-H1) AND lucide-
react icon migration (FE-H2). Breadcrumbs primitive added + wired into
PageHeader. cmd+k command palette mounted globally via cmdk. FE-M6
(drop unsafe-inline from CSP style-src) deferred — the audit's framing
was incomplete.
New / changed
=============
web/src/components/Layout.tsx (rewrite — UX-H1 + FE-H2 + FE-L4)
Pre: flat 31-item nav array with literal SVG path-string icons.
Post: 7 semantic groups (Inventory / Trust / Delivery / People /
Notify / Access / Audit) of 31 NavLinks total; lucide-react
icon components replace every path string (27 named imports);
collapsible per-group state persisted to localStorage
(`certctl:nav:collapsed-groups`); aria-expanded / aria-controls
on each group header; the existing Setup-guide button and Sign-
out button kept verbatim. Logout icon swapped from inline SVG to
lucide `LogOut`.
web/src/components/Breadcrumbs.tsx (new — UX-M5)
Walks the current pathname via useLocation() + a static
pathSegmentLabels map. Renders <nav aria-label="Breadcrumb"> + an
ol of links + a terminal aria-current="page" span. Renders
nothing on the dashboard root. 8 sibling tests in
Breadcrumbs.test.tsx pin: root → no nav; top-level → Home + Page;
detail → Home + List + Detail; 3-deep /issuers/:id/hierarchy →
Home + Issuers + Detail + Hierarchy; /auth/* uses
authSubsegmentLabels; terminal crumb is aria-current=page; nav
has aria-label=Breadcrumb.
web/src/components/PageHeader.tsx (1-line wire-in)
Renders <Breadcrumbs /> above the page title. Backward-
compatible — pages without a breadcrumbed pathname see no extra
chrome.
web/src/components/CommandPalette.tsx (new — UX-H6)
cmdk-driven palette with three sections:
1. Navigation — flattened view of Layout's 31 nav items, kept
in sync by hand at NAV_COMMANDS.
2. Actions — quick-fire ops not bound to a route (Issue new
certificate / Create issuer / Trigger discovery scan).
3. Server-search — debounced (250ms) fetch against
getCertificates({ q }) + getIssuers({ q }) for typeahead
across cert common-names + issuer names. Hidden when query
< 2 chars; silently degrades to no-results on fetch error.
web/src/components/CommandPaletteHost.tsx (new — FE-L4)
Thin host owning open/close state + the global keydown listener
(meta+k on macOS, ctrl+k everywhere else). Lazy-loads the
palette via React.lazy so cmdk's bundle (~25 KB) only lands
when the operator first hits cmd+k. Mounted inside BrowserRouter
so useNavigate() resolves.
Audit-accuracy callouts
=======================
1. UX-H1 wording was FACTUALLY WRONG. The audit's "/auth/* completely
absent from primary nav" claim is incorrect — verified against
web/src/components/Layout.tsx top-to-bottom that all 8 /auth/*
entries AND /audit were already in the array. The actual issue
was UNGROUPED, not absent. Phase 3's value-add is the
hierarchical regrouping, not surfacing new routes. Restated in
the file header comment.
2. FE-M6 deferred — audit framing was too narrow. The CSP comment
in internal/api/middleware/securityheaders.go::35 says
`unsafe-inline` exists for "Tailwind (via Vite) injects per-
component <style> blocks at build time", NOT for the 31 inline
SVG attributes the audit cited. Even after FE-H2 removes the
Layout.tsx SVGs, there are 17 production tsx files with React
`style={...}` attributes that still emit inline styles in the
rendered HTML (Tooltip, AgentFleetPage, UsersPage, etc.).
Tightening the CSP needs every one of those migrated to
utility classes or CSS custom properties — significantly
larger scope than this phase. Tracked as Phase 4+ follow-up.
3. UX-M5 implementation pivot. The audit prompt suggested
useMatches() + per-route handle.crumb. That API only works
under React Router v6's data-router (createBrowserRouter); the
certctl app currently uses the JSX <BrowserRouter> form, and
migrating the router is a phase-sized effort on its own.
Pivoted to useLocation() + a static pathSegmentLabels map.
Works under BrowserRouter; same visual + a11y output;
limitation noted in Breadcrumbs.tsx header so a future
router migration can upgrade in place.
Verification
============
$ npx tsc --noEmit
(exit 0)
$ npx vitest run src/components/Layout.test.tsx src/components/Breadcrumbs.test.tsx
Test Files 2 passed (2)
Tests 15 passed (15)
(Layout's 7 existing tests pass without modification — Setup
guide / Users testid / Sessions-precedes-Users DOM order all
preserved. Breadcrumbs ships with 8 new assertions.)
$ npx vite build
✓ built in 3.58s
(bundle grows ~25 KB from lucide-react + cmdk; cmdk lazy-loaded
so it doesn't land on initial page load)
$ grep -nE "navGroups|label: 'Access'|from 'lucide-react'|cmdk" \
web/src --type tsx --type ts -r | grep -v test
(15+ hits across Layout / Breadcrumbs / CommandPalette / Host)
$ grep -cE "icon: '" web/src/components/Layout.tsx
0 (was 31 path strings; now all replaced with lucide imports)
$ ls web/src/components/{Breadcrumbs,CommandPalette,CommandPaletteHost}.tsx
(all three new files exist)
Residual risks
==============
* The 14-ish inline SVGs in other pages (DashboardPage, ErrorState,
DataTable, JobsPage, CertificateDetailPage, OnboardingWizard)
still ship as raw <svg> markup. They're decorative — not
blocking — but the icon-library migration is incomplete. Next
per-page touches should replace them with lucide imports.
* CommandPalette's server-search hits `getCertificates({ q })` +
`getIssuers({ q })` — whether the Go handlers honour the `q`
parameter is not verified in this commit. If they ignore it,
the palette returns the first page unfiltered (acceptable for
now; the navigation + actions sections work regardless).
* The Layout's NAV_COMMANDS table in CommandPalette.tsx duplicates
the navGroups array in Layout.tsx by hand. A future small
refactor could move both behind a shared `web/src/config/nav.ts`.
* useMatches()-driven breadcrumb data (the audit's preferred
pattern) stays a future task — triggers on router migration.
|
||
|
|
1daae5d709 |
docs(readme): fix demo path command — point at deploy/demo-up.sh wrapper
Operator reproduction (verbatim log captured 2026-05-14):
$ docker compose -f deploy/docker-compose.yml -f deploy/docker-compose.demo.yml up -d --build
... build succeeds, containers come up ...
dependency failed to start: container certctl-server is unhealthy
$ docker compose ... logs certctl-server | tail -1
certctl-server | Failed to load configuration: phase-2 SEC-H3
fail-closed guard (missing TS): CERTCTL_DEMO_MODE_ACK=true requires
CERTCTL_DEMO_MODE_ACK_TS=<unix-epoch> set within the last 24h —
refuse to start.
Root cause
==========
README.md L95 documented a bare `docker compose ... up` command that
ignores the Phase 2 SEC-H3 fail-closed guard added in
internal/config/config.go::Validate (commit 2026-05-13). The guard
pairs CERTCTL_DEMO_MODE_ACK=true with a required
CERTCTL_DEMO_MODE_ACK_TS=<unix-epoch> that must be within the last
24h, so a forgotten demo deploy doesn't accidentally end up serving
production traffic with auth-type=none.
The demo overlay (deploy/docker-compose.demo.yml) passes the
timestamp through from the shell via
`CERTCTL_DEMO_MODE_ACK_TS: "${CERTCTL_DEMO_MODE_ACK_TS:-}"`. The
README command never exported it, so the server saw an empty value,
the guard refused to boot, the healthcheck never passed, and the
dependent certctl-agent container refused to start.
The deploy/demo-up.sh wrapper (which already exists; it's used by
CI cold-DB smoke and was added in the same SEC-H3 commit chain)
mints `CERTCTL_DEMO_MODE_ACK_TS="$(date +%s)"` before exec'ing
`docker compose` with the same -f flags. Drop-in replacement for
the bare compose invocation.
Fix
===
README.md "Demo path" code block now points at the wrapper script:
./deploy/demo-up.sh -d --build
Plus a one-paragraph explanation of why the wrapper is the supported
entry point and what the SEC-H3 timestamp gate is defending against.
The bare `docker compose ... up` form is documented as failing-closed
so a future operator who tries it understands the error message they
see.
Affected paths
==============
- README.md (the Quick Start "Demo path" block; lines 92-100 before,
93-103 after this change)
Out of scope (tracked separately if needed)
============================================
- The `WARN[0000] ... defaulting to a blank string` lines on docker
compose stdout (POSTGRES_PASSWORD, CERTCTL_API_KEY, etc.) are red
herrings — they fire on the BASE compose's env interpolation but
the demo overlay immediately overrides those with hardcoded
demo-safe values. They're noise; not a footgun. Leaving them
alone — silencing the WARN would require either an .env shim or
setting empty defaults at the base layer, both of which are
worse than the current warn-but-correct behaviour.
- The bare `docker compose -f base.yml up` production path
(README L108) is unchanged. That path requires a real .env and
will fail closed on placeholders — which is the correct
behaviour. The README already documents .env setup for that
path.
|
||
|
|
7c01f811a1 |
feat(frontend): Phase 2 TanStack Query Discipline — close TQ-H1/H2 + TQ-M1/M2/M3 + PERF-H1 + P-H1 + partial TQ-L1
Phase 2 of the frontend-design audit: TanStack Query discipline.
Set the cross-cutting QueryClient defaults + staleTime/gcTime tier
model + visibility-aware polling + 4 optimistic-update mutations
before any further per-page work.
New foundation
==============
web/src/api/queryConstants.ts (new)
STALE_TIME = { REAL_TIME: 15s, REFERENCE: 5m, CONSTANT: 1h }
GC_TIME = { HEAVY: 1m, STANDARD: 5m, REFERENCE: 30m }
Doc-comment explains the tier model so every new useQuery picks
a tier rather than a hardcoded ms integer.
web/src/main.tsx
QueryClient defaults rewritten:
pre: staleTime: 10_000 + refetchOnWindowFocus: true (refetch
storm on every tab refocus across 242 query sites)
post: staleTime: STALE_TIME.REFERENCE (5min) + gcTime: GC_TIME
.STANDARD (explicit 5min) + refetchOnWindowFocus: false
(per-query opt-in for live-tile queries)
retry: 1 unchanged per the audit's DO NOT.
Findings closed by source ID
============================
TQ-H2 (refetch storm)
main.tsx QueryClient defaults — refetchOnWindowFocus: false root +
per-query opt-in. STALE_TIME.REFERENCE 5min for everything else.
TQ-M1 (no gcTime overrides)
main.tsx now sets gcTime: GC_TIME.STANDARD explicitly — the
contract is documented at the root, not implicit-defaulted by
TanStack.
TQ-M2 (12 inconsistent staleTime values)
All 11 hardcoded numeric staleTime overrides migrated to the
STALE_TIME tier constants. useAuthMe.ts (the 12th) already used
its own constant — left alone. Tier mapping:
- operator-facing live data (KeysPage keys, RoleDetail role,
UsersPage, OIDCJWKSStatusPanel, ApprovalsPage):
STALE_TIME.REAL_TIME (15s)
- slow-changing reference data (KeysPage roles, RolesPage,
AuthSettings bootstrap+runtime-config):
STALE_TIME.REFERENCE (5min)
- effectively immutable (RoleDetail permissions catalogue):
STALE_TIME.CONSTANT (1hr)
TQ-H1 (OnboardingWizard infinite 5s poll)
OnboardingWizard.tsx:288-302 — refetchInterval rewritten to v5
functional form:
refetchInterval: (query) =>
(query.state.data?.data?.length ?? 0) > 0 ? false : 5_000;
As soon as the first agent registers, the interval flips to false
and the poll stops. Also explicit: refetchOnWindowFocus: true +
staleTime: STALE_TIME.REAL_TIME (because this IS a live-tile poll
during the wizard).
PERF-H1 (Dashboard polling storm)
DashboardPage.tsx
- jobs poll bumped 10s → 30s (10s granularity isn't needed when
30s is already inside the human-attention window; the
CertificateDetail page is where 10s polling lives)
- visibility-listener pauses ALL Dashboard polls when
document.visibilityState === 'hidden'; on visibility return,
immediately invalidates the 4 live-tile queries (health,
dashboard-summary, jobs, certs-by-status) so the operator
sees fresh data instantly rather than waiting one tick.
- The 4 live-tile queries (health, dashboard-summary, jobs,
certs-by-status) opt into refetchOnWindowFocus: true +
staleTime: STALE_TIME.REAL_TIME explicitly.
- Backend aggregation gap (dashboard-summary + certs-by-status
+ certificates could collapse into 1 endpoint) tracked
separately — Phase 3 backend follow-up.
P-H1 (CertificatesPage 4 duplicate-key pairs)
Pre-Phase-2 4 pairs of distinct cache slots fetching the same data:
['profiles'] vs ['profiles-filter']
['issuers'] vs ['issuers-filter']
['owners', 'form'] vs ['owners-filter']
['teams', 'form'] vs ['teams-filter']
Post-Phase-2 all four pairs collapse to a single parameterized
queryKey shape: `[name, { per_page: 100 }]`. TanStack v5 dedupes
on serialized queryKey — the modal + filter now share one cache
slot per resource. 8 useQuery sites → 4 cache slots; backend
hits halved on first paint of CertificatesPage.
TQ-M3 (4 of 5 priority optimistic-update mutations)
Wired onMutate / onError-rollback / onSettled-invalidation on:
1. mark-notification-read (NotificationsPage)
— flips row status to 'read' in both ['notifications','all']
+ ['notifications','dead'] cache slots
2. claim-discovered-cert (DiscoveryPage)
— flips status to 'Managed' in ['discovered-certificates']
3. dismiss-discovery (DiscoveryPage)
— flips status to 'Dismissed' in same cache slot
4. archive-certificate (CertificateDetailPage)
— flips status to 'Archived' in ['certificate', id]; on
success navigates to /certificates (optimistic data
doesn't linger); on error restores snapshot + toasts
All four fire the Phase 1 Sonner toast on success/failure.
The 5th priority site (role-assignment toggle in
auth/RoleDetailPage) uses raw async/await handlers rather than
useTrackedMutation — converting it requires a structural
refactor outside Phase 2's TQ-focus; tracked as Phase 2 follow-up.
TQ-L1 (useTrackedMutation extended tests)
useTrackedMutation.test.tsx grew from 3 tests to 8:
+ passes onMutate through and runs it before mutationFn
+ passes onError through with the onMutate context (rollback
path — pins the 3rd-arg snapshot semantics)
+ does NOT invalidate on error (only on success)
+ passes onSettled through (fires after both success + error)
+ parity with raw useMutation when no extra options given
Verification
============
$ grep -E "refetchOnWindowFocus: false" web/src/main.tsx
89: refetchOnWindowFocus: false, // per-query opt-in
$ grep -E "STALE_TIME\.REFERENCE" web/src/main.tsx
86: staleTime: STALE_TIME.REFERENCE, // 5 min
$ grep -cE "useQuery.*\['profiles" web/src/pages/CertificatesPage.tsx
2 (was 6 pre-Phase-2 — '[profiles]' modal + '[profiles-filter]'
+ '[profiles]' top-of-page; now both refer to the same
parameterized key '[profiles, { per_page: 100 }]')
$ grep -rE "onMutate" web/src --include='*.tsx' --exclude='*.test.*' | wc -l
5 (≥ 4 priority sites; the 5th is the optional onMutate in
queryConstants test wiring)
$ grep -rE "STALE_TIME\." web/src --include='*.tsx' --include='*.ts' \
--exclude='*.test.*' | wc -l
18 (queryConstants.ts + main.tsx + 11 migrated callsites
+ OnboardingWizard + DashboardPage)
$ npx tsc --noEmit
(exit 0)
$ npx vitest run [13 affected test files]
Test Files 13 passed (13)
Tests 100 passed (100)
$ npx vite build
✓ built in 2.49s
dist/assets/index-yg3cYtYA.js 1,113 kB
(+3 kB vs Phase 1 — queryConstants + optimistic-update wrappers)
Audit-accuracy callouts
=======================
* The audit claimed 10 useQuery on Dashboard; live count is 9 (one
issuers query has no interval). All 8 polling queries now gated
behind visibility-listener; the 9th (issuers) is non-polling and
not affected.
* TQ-L1 originally specified 4 test extensions; shipped 5
(onMutate ordering, onError-with-context, no-invalidate-on-error,
onSettled pass-through, parity-with-raw-useMutation).
* Optimistic-update 5th-site (role-assignment toggle in
auth/RoleDetailPage) deferred — RoleDetailPage handlers use raw
async/await instead of useTrackedMutation. Refactoring it adds
one more optimistic path but requires a structural change
outside Phase 2's TQ-discipline scope. Tracked as Phase 2
follow-up.
Residual risks
==============
* The Dashboard visibility-listener gate may need per-page opt-in
if a page genuinely needs to keep polling while hidden (e.g.
a background-tab monitor). Not aware of any such case today;
if needed, the gate is a simple `useState`-driven hook
extracted to web/src/hooks/useTabVisibility.ts.
* The Dashboard backend-aggregation collapse
(dashboard-summary + certs-by-status + certificates → one
endpoint) is documented as a Phase-3 backend item.
* The 4 collapsed CertificatesPage pairs now request per_page=100
everywhere. Operator with >100 issuers/owners/profiles/teams
will see a truncated dropdown — that's an unrelated Phase-1-
Combobox-migration concern; the right fix when it lands is to
move issuer/owner/profile selectors to Combobox with
server-side typeahead.
* The 12-second total Bundle-1 audit of all useQuery sites
still leaves ~230 queries running with the new 5-min
REFERENCE default. The default is generous; aggressively-
fresh per-page queries that genuinely need 15s freshness
must opt in (the audit page, the agent-fleet live counter,
in-flight scan progress).
|
||
|
|
c1b581b047 |
fix(test): Hotfix #6 — polyfill ResizeObserver in vitest setup (Phase 1 Combobox)
CI surfaced an Unhandled Error after the full vitest suite ran clean:
ReferenceError: ResizeObserver is not defined
at p (node_modules/@headlessui/react/dist/utils/element-movement.js:1:332)
at combobox-machine.js:1:8089
at y.send (machine.js:1:1383)
at Object.closeCombobox (combobox-machine.js:1:5820)
... originating from src/components/Combobox.test.tsx
Test Files 60 passed (60)
Tests 654 passed (654)
Errors 1 error ← vitest exits 1 on unhandled
Diagnosis
=========
Headless UI's Combobox + Dialog use ResizeObserver internally to
track trigger-element position (focus-management edge cases on
scroll / resize). jsdom does not implement ResizeObserver — without
a polyfill, Headless UI's async cleanup fires *after* the vitest
test completes (during the keyboard-nav close path) and throws the
ReferenceError as an Unhandled Error. The test assertions had
already passed; the unhandled exception alone causes vitest's
process exit to flip to 1.
Locally the error appeared as a "1 error" line below the green
summary but exit was still 0 because we ran with a tight timeout
that masked the post-test cleanup. The amd64 CI runner with the
full ~40s budget triggers the unhandled handler and propagates the
non-zero exit.
Fix
===
web/src/test/setup.ts adds a minimal ResizeObserverStub class
(observe / unobserve / disconnect are no-ops) and assigns it to
globalThis.ResizeObserver iff undefined. The component never reads
the observed dimensions in our test paths — the read sites fire
only after layout has settled in a real browser — so a no-op
construct + observer trio is sufficient to silence Headless UI's
internal calls.
Also stubs Element.prototype.scrollIntoView (Headless UI touches
it during Combobox.Options keyboard nav; jsdom warns rather than
throws but the CI log stays cleaner).
Verification
============
$ cd web && npx vitest run src/components/Combobox.test.tsx
Test Files 1 passed (1)
Tests 5 passed (5)
(no Unhandled Errors line; exit 0 — the post-test cleanup
no longer touches the undefined global)
$ cd web && npx tsc --noEmit
(exit 0)
This commit ships on top of Phase 1 (
|