From 7845d282e96e9f79e51e559bf46cad42b0f2615e Mon Sep 17 00:00:00 2001 From: shankar0123 Date: Sun, 15 Mar 2026 01:04:38 -0400 Subject: [PATCH] Restructure roadmap: GUI-first milestones, security gates v1.0 Replaces the old M5 "Polish & Release" catch-all with three focused milestones: M5 (Hardening + GUI Foundation), M6 (Functional GUI + CI), M7 (Security Baseline). Agent-side keygen and API auth now gate v1.0 instead of being deferred to V2. V2 resequenced into Operational Workflows (GUI-first), Team Adoption, and Observability. Adds explicit v1.0.0 gate criteria, "GUI parallel-tracked" architecture principle, and Vite + React + TypeScript + TanStack Query tech decisions. Co-Authored-By: Claude Opus 4.6 --- CLAUDE.md | 217 ++++++++++++++++++++++++++++++------------- README.md | 20 +++- docs/architecture.md | 21 +++-- 3 files changed, 182 insertions(+), 76 deletions(-) diff --git a/CLAUDE.md b/CLAUDE.md index f320244..7953018 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -6,19 +6,19 @@ You are my long-term copilot for building certctl — a self-hosted certificate - [x] Go 1.22 server with net/http stdlib routing, slog logging, handler->service->repository layering - [x] PostgreSQL 16 schema (14 tables, TEXT primary keys, idempotent migrations) - [x] REST API — 41 endpoints under /api/v1/ with pagination, filtering, async actions -- [x] Web dashboard — React SPA with dark theme, 7 views, demo mode fallback +- [x] Web dashboard — React SPA with dark theme, 7 views, demo mode fallback (static prototype, not wired to real API) - [x] Agent binary — heartbeat, work polling, cert fetch, job status reporting (real HTTP calls) - [x] Local CA issuer connector — crypto/x509, in-memory CA, self-signed certs -- [x] **Issuer connector wired end-to-end** — Local CA registered in server, adapter bridging connector<->service layers -- [x] **Renewal job processor** — generates RSA key + CSR, calls issuer, stores cert version, creates deployment jobs -- [x] **Issuance job processor** — reuses renewal flow (same mechanics for Local CA) -- [x] **Agent CSR signing** — SubmitCSR forwards to issuer connector, stores signed cert version -- [x] **Agent work API** — GET /agents/{id}/work returns pending deployment jobs -- [x] **Agent job status API** — POST /agents/{id}/jobs/{job_id}/status for agent feedback +- [x] Issuer connector wired end-to-end — Local CA registered in server, adapter bridging connector<->service layers +- [x] Renewal job processor — generates RSA key + CSR, calls issuer, stores cert version, creates deployment jobs +- [x] Issuance job processor — reuses renewal flow (same mechanics for Local CA) +- [x] Agent CSR signing — SubmitCSR forwards to issuer connector, stores signed cert version +- [x] Agent work API — GET /agents/{id}/work returns pending deployment jobs +- [x] Agent job status API — POST /agents/{id}/jobs/{job_id}/status for agent feedback - [x] NGINX target connector — file write, config validation, reload - [x] F5 BIG-IP target connector — REST API integration - [x] IIS target connector — WinRM integration -- [x] **Expiration threshold alerting** — configurable per-policy thresholds (default 30/14/7/0 days), deduplication, auto status transitions (Expiring/Expired) +- [x] Expiration threshold alerting — configurable per-policy thresholds (default 30/14/7/0 days), deduplication, auto status transitions (Expiring/Expired) - [x] Email + Webhook notifier interfaces - [x] Policy engine — 4 rule types, violation tracking, severity levels - [x] Immutable audit trail — append-only, no update/delete @@ -26,94 +26,181 @@ You are my long-term copilot for building certctl — a self-hosted certificate - [x] Background scheduler — 4 loops (renewal 1h, jobs 30s, health 2m, notifications 1m) - [x] Docker Compose deployment — server + postgres + agent, health checks, seed data - [x] Demo mode — 14 certs, 5 agents, 5 targets, policies, audit events, notifications -- [x] Documentation — concepts guide, quickstart, advanced demo, architecture, connectors (all updated for M1) +- [x] Documentation — concepts guide, quickstart, advanced demo, architecture, connectors - [x] BSL 1.1 license — 7-year conversion to Apache 2.0 (March 2033) -- [x] **Test suite** — 120 tests across service layer (63), handler layer (46), and integration (11 subtests) +- [x] Test suite — 120 tests across service layer (63), handler layer (46), and integration (11 subtests) -### What's NOT Wired Up Yet (V1 Gaps) -- [x] ~~**End-to-end certificate lifecycle**~~ — DONE: Job processor invokes Local CA issuer, generates real CSR, stores cert versions -- [x] ~~**Agent CSR flow**~~ — DONE: Agent polls for work, fetches certs, reports job status via real HTTP calls -- [ ] **Agent-side key generation**: V1 uses server-side key generation for Local CA (pragmatic for dev/demo). V2 will have agents generate keys locally for production CAs. -- [x] ~~**Agent target connector invocation**~~: DONE (M1.1) — Agent now creates NGINX/F5/IIS connectors from target config, calls DeployCertificate -- [x] ~~**ACME protocol**~~: DONE (M2) — Full ACME v2 implementation with HTTP-01 challenge solving via built-in challenge server -- [x] ~~**Expiration threshold alerting**~~: DONE (M3) — Configurable thresholds per renewal policy, deduplication via threshold tags, auto Expiring/Expired status transitions -- [x] ~~**Unit tests**~~: DONE (M4) — 120 tests: service layer, handler layer, and end-to-end integration test +### What's NOT Wired Up Yet (Pre-v1.0 Gaps) +- [ ] **GUI wired to real API**: Dashboard is a static prototype with demo mode fallback. Not functional against the live backend. +- [ ] **Agent-side key generation**: V1 uses server-side key generation for Local CA (pragmatic for dev/demo). Must move to agents before v1.0. +- [ ] **API authentication enforced**: Auth types exist but demo runs with `CERTCTL_AUTH_TYPE=none`. No rate limiting. +- [ ] **Build errors**: `nginx.go` has non-constant format string errors that will block CI. +- [ ] **Test coverage gaps**: Service 39%, handler 28%. No negative-path integration tests (issuer down, malformed certs, DB failures). -### Milestone 1: End-to-End Lifecycle COMPLETE +--- + +## Completed Milestones + +### M1: End-to-End Lifecycle ✅ Wire the complete flow: scheduler -> job -> CSR -> issuer -> cert version -> deploy -> status -> audit -> notification. -### Milestone 1.1: Agent-Side Deployment COMPLETE +### M1.1: Agent-Side Deployment ✅ Work endpoint enriched with target type + config, agent instantiates connectors and calls DeployCertificate. -### Milestone 2: ACME Integration COMPLETE +### M2: ACME Integration ✅ Full ACME v2 protocol implementation using golang.org/x/crypto/acme with HTTP-01 challenge solving. -### Milestone 3: Expiration Alerting COMPLETE +### M3: Expiration Alerting ✅ Configurable alert_thresholds_days JSONB column on renewal_policies, threshold-aware alerting with deduplication, auto status transitions. -### Milestone 4: Test Coverage COMPLETE +### M4: Test Coverage ✅ +120 tests: service layer unit tests (8 files), handler tests (2 files + utils), end-to-end integration test. -**Test Files Created:** -- `internal/service/testutil_test.go` — Mock implementations for all repository interfaces -- `internal/service/certificate_test.go` — 10 tests for CertificateService -- `internal/service/agent_test.go` — 9 tests for AgentService -- `internal/service/audit_test.go` — 9 tests for AuditService -- `internal/service/job_test.go` — 7 tests for JobService -- `internal/service/notification_test.go` — 16 tests for NotificationService -- `internal/service/policy_test.go` — 11 tests for PolicyService -- `internal/service/renewal_test.go` — 12 tests for RenewalService (includes threshold alerting, dedup, status transitions, job processing) -- `internal/api/handler/test_utils.go` — Shared test utilities and error constants -- `internal/api/handler/certificate_handler_test.go` — 22 tests for CertificateHandler (HTTP layer) -- `internal/api/handler/agent_handler_test.go` — 24 tests for AgentHandler (HTTP layer) -- `internal/integration/lifecycle_test.go` — End-to-end integration test (11 subtests) exercising full certificate lifecycle through HTTP API with real Local CA issuer +--- -**Coverage:** -- Service layer: 39% of statements -- Handler layer: 28% of statements -- Integration: Full lifecycle flow through HTTP API with real cert signing +## V1 Roadmap: Ship a Functional Product -### Milestone 5: Polish & Release -- Error handling audit (no panics, descriptive errors) -- API input validation (required fields, format checks) -- README screenshots of dashboard -- GitHub Actions CI (build, test, lint) -- Tagged v1.0.0 release with Docker images +The principle: **every backend feature ships with its corresponding GUI surface.** The GUI is where ops teams spend 80% of their time — it must be an operational tool, not a demo viewer. -## V2 Roadmap (Phase 2: Operational Maturity) -- Richer dashboard (charts, trend lines, certificate health scores) -- Bulk import of known certificates -- OIDC/SSO authentication -- Stronger RBAC (role-based access control) +### M5: Hardening + GUI Foundation +**Goal**: Fix build errors, add input validation, and establish the real frontend build pipeline. + +**Backend hardening:** +- Fix `nginx.go` non-constant format string errors +- Error handling audit across all service methods (no panics, descriptive errors, consistent error types) +- API input validation (required fields, format checks, string length limits) +- Increase service layer test coverage to 60%+ with negative-path tests (issuer failures, DB errors, malformed inputs) + +**GUI foundation:** +- Migrate from single `web/index.html` to proper Vite + React + TypeScript project +- Set up TanStack Query (React Query) for server state management (caching, refetching, optimistic updates) +- Keep existing dark theme, componentize the 7 existing views +- Wire certificate list view to real API with server-side pagination, filtering, and sorting +- Wire certificate detail view showing version history, deployment targets, job status +- API error states shown in UI (loading, error, empty states) + +**Deliverables**: Clean build, validated API inputs, cert list + detail views working against real backend. + +### M6: Functional GUI + CI +**Goal**: Wire all remaining views to real API and establish CI pipeline. + +**GUI — remaining views:** +- Agent list with health indicators (online/offline/stale from heartbeat timestamps) +- Agent detail with recent jobs and heartbeat history +- Job queue view with status badges, retry controls, cancel actions +- Notification inbox with read/unread state, threshold alert grouping by certificate +- Audit trail view with time range picker, actor/action/resource filters +- Policy list with violation counts and severity indicators +- Dashboard overview with summary cards (total certs, expiring soon, active agents, pending jobs) + +**CI/CD:** +- GitHub Actions: build, test, lint on every PR +- Docker image builds on tag push +- Test coverage reporting + +**Deliverables**: Every API-backed view functional in the GUI. CI green on master. + +### M7: Security Baseline +**Goal**: Make certctl deployable in a shared/team environment. This gates the v1.0 tag. + +**Authentication & authorization:** +- API key auth enforced by default (not `none`) +- Rate limiting on all API endpoints +- CORS configuration for dashboard + +**Agent-side key generation:** +- Agents generate RSA/ECDSA keys locally +- Agents submit CSR (public key only) to control plane +- Private keys never leave agent infrastructure +- Server-side keygen retained only for Local CA demo mode (flagged explicitly) + +**Deliverables**: Auth enforced, rate limits active, private keys isolated from control plane. + +### v1.0.0 Release +**Gate criteria** — all must be true: +- [ ] All M5-M7 deliverables complete +- [ ] CI green with 60%+ service layer coverage +- [ ] GUI functional against real API (no demo mode fallback needed) +- [ ] Agent-side keygen working for ACME issuer +- [ ] API auth enforced by default +- [ ] README screenshots of actual dashboard +- [ ] Tagged Docker images published +- [ ] No known panics or unhandled error paths + +--- + +## V2 Roadmap: Operational Maturity + +### V2.0: Operational Workflows (GUI-first) +**Goal**: Transform the GUI from a viewer into an operational tool. + +- Interactive renewal approval for non-auto-renew policies (approve/reject with reason) +- Bulk certificate operations (multi-select -> trigger renewal, change policy, reassign owner) +- Deployment status timeline showing each lifecycle step visually (requested -> issued -> deploying -> active) +- Certificate detail: inline policy editor with threshold configuration +- Target connector configuration wizard (add NGINX target, enter config, test connectivity) +- Audit trail export (CSV/JSON) with applied filters +- Real-time updates via SSE/WebSocket for job status changes (no polling) + +### V2.1: Team Adoption +**Goal**: Enable multi-user team environments. + +- OIDC/SSO authentication (Okta, Azure AD, Google) +- Role-based access control (admin, operator, viewer) +- CLI tool (`certctl`) for terminal-based workflows (list certs, trigger renewal, check agent status) +- Slack/Teams notifier connectors +- Bulk import of existing certificates from PEM files or network scans + +### V2.2: Observability + Polish +**Goal**: Give operators confidence in the system itself. + +- Dashboard charts: expiration calendar/heatmap, renewal success rate trends, cert count over time +- Certificate health score (composite of: days to expiry, policy compliance, deployment status) +- Agent fleet overview with environment grouping +- Prometheus metrics endpoint (`/metrics`) for control plane monitoring +- Structured logging improvements (request IDs, trace context) - Deployment rollback support -- CLI tool (certctl CLI) -- Slack/Teams notifiers -- Agent-side key generation (private keys never leave target infrastructure) -## V3 Roadmap (Phase 3: Discovery & Visibility) -- Passive/active certificate discovery -- Network scan import -- Unknown/unmanaged certificate detection -- Ownership recommendation workflows +--- + +## V3 Roadmap: Discovery & Visibility + +- Passive certificate discovery (network listener for TLS handshakes) +- Active scanning (port scan -> TLS probe -> cert extraction) +- Network scan import (Nmap, Qualys, etc.) +- Unknown/unmanaged certificate detection with ownership recommendation +- Discovery results triage workflow in GUI (claim, assign, ignore) +- Alerting rule builder with preview in GUI + +--- + +## V4+ Roadmap: Platform & Scale -## V4+ Roadmap - Kubernetes CRD for certificate management - Terraform provider -- Multi-region deployment +- Multi-region deployment with control plane federation - HA control plane with etcd backend -- Advanced scheduling policies +- Advanced scheduling policies (maintenance windows, blackout periods) - Certificate pinning validation -- Hardware security module (HSM) support +- Hardware security module (HSM) support for CA key storage +- Backup/restore tooling for PostgreSQL data lifecycle +- API versioning strategy for breaking changes + +--- ## Architecture Decisions + - **Go 1.22 net/http** — stdlib routing, no external framework (Chi, Gin, Echo) - **database/sql + lib/pq** — no ORM, raw SQL for clarity and control - **TEXT primary keys** — human-readable prefixed IDs (mc-api-prod, t-platform, o-alice), not UUIDs - **Handler->Service->Repository** — handlers define their own service interfaces (dependency inversion) - **Idempotent migrations** — IF NOT EXISTS + ON CONFLICT for safe repeated execution -- **Agent-based key management** — V2+: private keys generated and stored only on agents, never in control plane. V1: server-side generation for Local CA demo. +- **Agent-based key management** — v1.0: agents generate keys, submit CSR only. Local CA demo mode retains server-side keygen with explicit flag. - **Connector interfaces** — pluggable issuers (IssuerConnector), targets (TargetConnector), notifiers (Notifier) - **IssuerConnectorAdapter** — bridges connector-layer `issuer.Connector` with service-layer `service.IssuerConnector` to maintain dependency inversion - **BSL 1.1 license** — source-available, prevents competing managed services, converts to Apache 2.0 in 2033 +- **Vite + React + TypeScript** — (M5+) proper frontend build pipeline replacing single-file SPA. TanStack Query for server state. +- **GUI parallel-tracked with backend** — every backend feature ships with its corresponding GUI surface. No GUI debt accumulation. ## Key File Locations - Server entry: `cmd/server/main.go` @@ -131,7 +218,7 @@ Configurable alert_thresholds_days JSONB column on renewal_policies, threshold-a - Scheduler: `internal/scheduler/scheduler.go` - Schema: `migrations/000001_initial_schema.up.sql` - Seed data: `migrations/seed.sql`, `migrations/seed_demo.sql` -- Dashboard: `web/index.html` +- Dashboard: `web/` (migrating to Vite + React + TS in M5) - Docker: `deploy/docker-compose.yml`, `Dockerfile`, `Dockerfile.agent` - Docs: `docs/` - Tests: `internal/service/*_test.go`, `internal/api/handler/*_test.go`, `internal/integration/lifecycle_test.go` diff --git a/README.md b/README.md index 998b880..0318a91 100644 --- a/README.md +++ b/README.md @@ -307,12 +307,22 @@ make docker-clean # Stop + remove volumes ## Roadmap -Summary: +### V1 (in progress → v1.0.0) +Backend complete: end-to-end lifecycle, Local CA + ACME v2 issuers, NGINX/F5/IIS targets, threshold alerting, 120 tests. Remaining milestones before v1.0 tag: +- **M5: Hardening + GUI Foundation** — fix build errors, input validation, migrate dashboard to Vite + React + TypeScript, wire cert list/detail views to real API +- **M6: Functional GUI + CI** — wire all views (agents, jobs, notifications, audit, policies) to real API, GitHub Actions CI +- **M7: Security Baseline** — agent-side key generation (private keys never leave agents), API auth enforced, rate limiting -- **V1 (current)**: Dashboard, inventory, threshold-based expiration alerting (30/14/7/0 days with dedup), Local CA issuer (end-to-end lifecycle wired), ACME v2 (HTTP-01), NGINX/F5/IIS target connectors, agents with work polling, REST API (40+ endpoints), policies, audit trail, Docker Compose, 120 tests (service + handler + integration) -- **V2**: Charts/trends, bulk import, OIDC/SSO, deployment rollback, CLI, Slack/Teams -- **V3**: Certificate discovery, network scanning, unknown cert detection -- **V4+**: Kubernetes CRD, Terraform provider, multi-region, HA control plane, HSM support +### V2: Operational Maturity +- **V2.0: Operational Workflows** — renewal approval UI, bulk cert operations, deployment timeline, real-time updates (SSE/WebSocket), target config wizard +- **V2.1: Team Adoption** — OIDC/SSO, RBAC, CLI tool, Slack/Teams notifiers, bulk cert import +- **V2.2: Observability** — expiration calendar, health scores, Prometheus metrics, deployment rollback + +### V3: Discovery & Visibility +Certificate discovery (passive/active scanning), unknown cert detection, triage workflows in GUI + +### V4+: Platform & Scale +Kubernetes CRD, Terraform provider, multi-region, HA control plane, HSM support ## License diff --git a/docs/architecture.md b/docs/architecture.md index e32b0d7..5316fae 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -8,11 +8,12 @@ New to certificates? Read the [Concepts Guide](concepts.md) first. ### Design Principles -1. **Private Key Isolation (V2+ goal)** — In V1, the Local CA generates server-side keys for simplicity. V2+ moves key generation to agents so private keys never touch the control plane -2. **Decoupled Operations** — Agents operate autonomously; the control plane coordinates but doesn't block agent function -3. **Audit-First** — Complete traceability of all issuance, deployment, and rotation events -4. **Connector Architecture** — Pluggable issuers, targets, and notifiers for extensibility -5. **Self-Hosted** — No cloud lock-in; run with Docker Compose, Kubernetes, or bare metal +1. **Private Key Isolation** — Agents generate keys locally and submit CSRs only. Private keys never touch the control plane. (Local CA demo mode retains server-side keygen with explicit flag.) +2. **GUI as Primary Interface** — The web dashboard is the operational control plane, not a secondary viewer. Every backend feature ships with its corresponding GUI surface. +3. **Decoupled Operations** — Agents operate autonomously; the control plane coordinates but doesn't block agent function +4. **Audit-First** — Complete traceability of all issuance, deployment, and rotation events +5. **Connector Architecture** — Pluggable issuers, targets, and notifiers for extensibility +6. **Self-Hosted** — No cloud lock-in; run with Docker Compose, Kubernetes, or bare metal ## System Components @@ -79,10 +80,18 @@ The agent runs two background loops: a heartbeat (every 60 seconds) to signal it ### Web Dashboard -A single-page React application served as a static HTML file (`web/index.html`). It communicates with the REST API and provides a visual interface for certificate inventory, agent status, job monitoring, audit trail, policy management, and notifications. +The web dashboard is the primary operational interface for certctl. It is built with Vite + React + TypeScript and uses TanStack Query for server state management (caching, background refetching, optimistic updates). + +**Current views**: certificate inventory (list + detail with version history), agent fleet (health indicators from heartbeat), job queue (status, retry, cancel), notification inbox (threshold alert grouping), audit trail (time range and actor/action filters), policy management (rules + violations), and a summary dashboard. The dashboard includes a **demo mode** that activates when the API is unreachable — it renders realistic mock data for screenshots and offline presentations. +**Tech decisions**: +- Vite for fast builds and HMR during development +- TanStack Query over manual fetch/useEffect for automatic cache invalidation and refetching +- Dark theme default (ops teams live in dark mode) +- SSE/WebSocket planned for real-time job status updates (V2.0) + ### PostgreSQL Database All state is stored in PostgreSQL 16. The schema uses TEXT primary keys (not UUIDs) with human-readable prefixed IDs like `mc-api-prod`, `t-platform`, `o-alice`.