Restructure roadmap: GUI-first milestones, security gates v1.0

Replaces the old M5 "Polish & Release" catch-all with three focused
milestones: M5 (Hardening + GUI Foundation), M6 (Functional GUI + CI),
M7 (Security Baseline). Agent-side keygen and API auth now gate v1.0
instead of being deferred to V2. V2 resequenced into Operational
Workflows (GUI-first), Team Adoption, and Observability.

Adds explicit v1.0.0 gate criteria, "GUI parallel-tracked" architecture
principle, and Vite + React + TypeScript + TanStack Query tech decisions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
shankar0123
2026-03-15 01:04:38 -04:00
parent 5553568495
commit 7845d282e9
3 changed files with 182 additions and 76 deletions
+152 -65
View File
@@ -6,19 +6,19 @@ You are my long-term copilot for building certctl — a self-hosted certificate
- [x] Go 1.22 server with net/http stdlib routing, slog logging, handler->service->repository layering
- [x] PostgreSQL 16 schema (14 tables, TEXT primary keys, idempotent migrations)
- [x] REST API — 41 endpoints under /api/v1/ with pagination, filtering, async actions
- [x] Web dashboard — React SPA with dark theme, 7 views, demo mode fallback
- [x] Web dashboard — React SPA with dark theme, 7 views, demo mode fallback (static prototype, not wired to real API)
- [x] Agent binary — heartbeat, work polling, cert fetch, job status reporting (real HTTP calls)
- [x] Local CA issuer connector — crypto/x509, in-memory CA, self-signed certs
- [x] **Issuer connector wired end-to-end** — Local CA registered in server, adapter bridging connector<->service layers
- [x] **Renewal job processor** — generates RSA key + CSR, calls issuer, stores cert version, creates deployment jobs
- [x] **Issuance job processor** — reuses renewal flow (same mechanics for Local CA)
- [x] **Agent CSR signing** — SubmitCSR forwards to issuer connector, stores signed cert version
- [x] **Agent work API** — GET /agents/{id}/work returns pending deployment jobs
- [x] **Agent job status API** — POST /agents/{id}/jobs/{job_id}/status for agent feedback
- [x] Issuer connector wired end-to-end — Local CA registered in server, adapter bridging connector<->service layers
- [x] Renewal job processor — generates RSA key + CSR, calls issuer, stores cert version, creates deployment jobs
- [x] Issuance job processor — reuses renewal flow (same mechanics for Local CA)
- [x] Agent CSR signing — SubmitCSR forwards to issuer connector, stores signed cert version
- [x] Agent work API — GET /agents/{id}/work returns pending deployment jobs
- [x] Agent job status API — POST /agents/{id}/jobs/{job_id}/status for agent feedback
- [x] NGINX target connector — file write, config validation, reload
- [x] F5 BIG-IP target connector — REST API integration
- [x] IIS target connector — WinRM integration
- [x] **Expiration threshold alerting** — configurable per-policy thresholds (default 30/14/7/0 days), deduplication, auto status transitions (Expiring/Expired)
- [x] Expiration threshold alerting — configurable per-policy thresholds (default 30/14/7/0 days), deduplication, auto status transitions (Expiring/Expired)
- [x] Email + Webhook notifier interfaces
- [x] Policy engine — 4 rule types, violation tracking, severity levels
- [x] Immutable audit trail — append-only, no update/delete
@@ -26,94 +26,181 @@ You are my long-term copilot for building certctl — a self-hosted certificate
- [x] Background scheduler — 4 loops (renewal 1h, jobs 30s, health 2m, notifications 1m)
- [x] Docker Compose deployment — server + postgres + agent, health checks, seed data
- [x] Demo mode — 14 certs, 5 agents, 5 targets, policies, audit events, notifications
- [x] Documentation — concepts guide, quickstart, advanced demo, architecture, connectors (all updated for M1)
- [x] Documentation — concepts guide, quickstart, advanced demo, architecture, connectors
- [x] BSL 1.1 license — 7-year conversion to Apache 2.0 (March 2033)
- [x] **Test suite** — 120 tests across service layer (63), handler layer (46), and integration (11 subtests)
- [x] Test suite — 120 tests across service layer (63), handler layer (46), and integration (11 subtests)
### What's NOT Wired Up Yet (V1 Gaps)
- [x] ~~**End-to-end certificate lifecycle**~~ — DONE: Job processor invokes Local CA issuer, generates real CSR, stores cert versions
- [x] ~~**Agent CSR flow**~~ — DONE: Agent polls for work, fetches certs, reports job status via real HTTP calls
- [ ] **Agent-side key generation**: V1 uses server-side key generation for Local CA (pragmatic for dev/demo). V2 will have agents generate keys locally for production CAs.
- [x] ~~**Agent target connector invocation**~~: DONE (M1.1) — Agent now creates NGINX/F5/IIS connectors from target config, calls DeployCertificate
- [x] ~~**ACME protocol**~~: DONE (M2) — Full ACME v2 implementation with HTTP-01 challenge solving via built-in challenge server
- [x] ~~**Expiration threshold alerting**~~: DONE (M3) — Configurable thresholds per renewal policy, deduplication via threshold tags, auto Expiring/Expired status transitions
- [x] ~~**Unit tests**~~: DONE (M4) — 120 tests: service layer, handler layer, and end-to-end integration test
### What's NOT Wired Up Yet (Pre-v1.0 Gaps)
- [ ] **GUI wired to real API**: Dashboard is a static prototype with demo mode fallback. Not functional against the live backend.
- [ ] **Agent-side key generation**: V1 uses server-side key generation for Local CA (pragmatic for dev/demo). Must move to agents before v1.0.
- [ ] **API authentication enforced**: Auth types exist but demo runs with `CERTCTL_AUTH_TYPE=none`. No rate limiting.
- [ ] **Build errors**: `nginx.go` has non-constant format string errors that will block CI.
- [ ] **Test coverage gaps**: Service 39%, handler 28%. No negative-path integration tests (issuer down, malformed certs, DB failures).
### Milestone 1: End-to-End Lifecycle COMPLETE
---
## Completed Milestones
### M1: End-to-End Lifecycle ✅
Wire the complete flow: scheduler -> job -> CSR -> issuer -> cert version -> deploy -> status -> audit -> notification.
### Milestone 1.1: Agent-Side Deployment COMPLETE
### M1.1: Agent-Side Deployment
Work endpoint enriched with target type + config, agent instantiates connectors and calls DeployCertificate.
### Milestone 2: ACME Integration COMPLETE
### M2: ACME Integration
Full ACME v2 protocol implementation using golang.org/x/crypto/acme with HTTP-01 challenge solving.
### Milestone 3: Expiration Alerting COMPLETE
### M3: Expiration Alerting
Configurable alert_thresholds_days JSONB column on renewal_policies, threshold-aware alerting with deduplication, auto status transitions.
### Milestone 4: Test Coverage COMPLETE
### M4: Test Coverage
120 tests: service layer unit tests (8 files), handler tests (2 files + utils), end-to-end integration test.
**Test Files Created:**
- `internal/service/testutil_test.go` — Mock implementations for all repository interfaces
- `internal/service/certificate_test.go` — 10 tests for CertificateService
- `internal/service/agent_test.go` — 9 tests for AgentService
- `internal/service/audit_test.go` — 9 tests for AuditService
- `internal/service/job_test.go` — 7 tests for JobService
- `internal/service/notification_test.go` — 16 tests for NotificationService
- `internal/service/policy_test.go` — 11 tests for PolicyService
- `internal/service/renewal_test.go` — 12 tests for RenewalService (includes threshold alerting, dedup, status transitions, job processing)
- `internal/api/handler/test_utils.go` — Shared test utilities and error constants
- `internal/api/handler/certificate_handler_test.go` — 22 tests for CertificateHandler (HTTP layer)
- `internal/api/handler/agent_handler_test.go` — 24 tests for AgentHandler (HTTP layer)
- `internal/integration/lifecycle_test.go` — End-to-end integration test (11 subtests) exercising full certificate lifecycle through HTTP API with real Local CA issuer
---
**Coverage:**
- Service layer: 39% of statements
- Handler layer: 28% of statements
- Integration: Full lifecycle flow through HTTP API with real cert signing
## V1 Roadmap: Ship a Functional Product
### Milestone 5: Polish & Release
- Error handling audit (no panics, descriptive errors)
- API input validation (required fields, format checks)
- README screenshots of dashboard
- GitHub Actions CI (build, test, lint)
- Tagged v1.0.0 release with Docker images
The principle: **every backend feature ships with its corresponding GUI surface.** The GUI is where ops teams spend 80% of their time — it must be an operational tool, not a demo viewer.
## V2 Roadmap (Phase 2: Operational Maturity)
- Richer dashboard (charts, trend lines, certificate health scores)
- Bulk import of known certificates
- OIDC/SSO authentication
- Stronger RBAC (role-based access control)
### M5: Hardening + GUI Foundation
**Goal**: Fix build errors, add input validation, and establish the real frontend build pipeline.
**Backend hardening:**
- Fix `nginx.go` non-constant format string errors
- Error handling audit across all service methods (no panics, descriptive errors, consistent error types)
- API input validation (required fields, format checks, string length limits)
- Increase service layer test coverage to 60%+ with negative-path tests (issuer failures, DB errors, malformed inputs)
**GUI foundation:**
- Migrate from single `web/index.html` to proper Vite + React + TypeScript project
- Set up TanStack Query (React Query) for server state management (caching, refetching, optimistic updates)
- Keep existing dark theme, componentize the 7 existing views
- Wire certificate list view to real API with server-side pagination, filtering, and sorting
- Wire certificate detail view showing version history, deployment targets, job status
- API error states shown in UI (loading, error, empty states)
**Deliverables**: Clean build, validated API inputs, cert list + detail views working against real backend.
### M6: Functional GUI + CI
**Goal**: Wire all remaining views to real API and establish CI pipeline.
**GUI — remaining views:**
- Agent list with health indicators (online/offline/stale from heartbeat timestamps)
- Agent detail with recent jobs and heartbeat history
- Job queue view with status badges, retry controls, cancel actions
- Notification inbox with read/unread state, threshold alert grouping by certificate
- Audit trail view with time range picker, actor/action/resource filters
- Policy list with violation counts and severity indicators
- Dashboard overview with summary cards (total certs, expiring soon, active agents, pending jobs)
**CI/CD:**
- GitHub Actions: build, test, lint on every PR
- Docker image builds on tag push
- Test coverage reporting
**Deliverables**: Every API-backed view functional in the GUI. CI green on master.
### M7: Security Baseline
**Goal**: Make certctl deployable in a shared/team environment. This gates the v1.0 tag.
**Authentication & authorization:**
- API key auth enforced by default (not `none`)
- Rate limiting on all API endpoints
- CORS configuration for dashboard
**Agent-side key generation:**
- Agents generate RSA/ECDSA keys locally
- Agents submit CSR (public key only) to control plane
- Private keys never leave agent infrastructure
- Server-side keygen retained only for Local CA demo mode (flagged explicitly)
**Deliverables**: Auth enforced, rate limits active, private keys isolated from control plane.
### v1.0.0 Release
**Gate criteria** — all must be true:
- [ ] All M5-M7 deliverables complete
- [ ] CI green with 60%+ service layer coverage
- [ ] GUI functional against real API (no demo mode fallback needed)
- [ ] Agent-side keygen working for ACME issuer
- [ ] API auth enforced by default
- [ ] README screenshots of actual dashboard
- [ ] Tagged Docker images published
- [ ] No known panics or unhandled error paths
---
## V2 Roadmap: Operational Maturity
### V2.0: Operational Workflows (GUI-first)
**Goal**: Transform the GUI from a viewer into an operational tool.
- Interactive renewal approval for non-auto-renew policies (approve/reject with reason)
- Bulk certificate operations (multi-select -> trigger renewal, change policy, reassign owner)
- Deployment status timeline showing each lifecycle step visually (requested -> issued -> deploying -> active)
- Certificate detail: inline policy editor with threshold configuration
- Target connector configuration wizard (add NGINX target, enter config, test connectivity)
- Audit trail export (CSV/JSON) with applied filters
- Real-time updates via SSE/WebSocket for job status changes (no polling)
### V2.1: Team Adoption
**Goal**: Enable multi-user team environments.
- OIDC/SSO authentication (Okta, Azure AD, Google)
- Role-based access control (admin, operator, viewer)
- CLI tool (`certctl`) for terminal-based workflows (list certs, trigger renewal, check agent status)
- Slack/Teams notifier connectors
- Bulk import of existing certificates from PEM files or network scans
### V2.2: Observability + Polish
**Goal**: Give operators confidence in the system itself.
- Dashboard charts: expiration calendar/heatmap, renewal success rate trends, cert count over time
- Certificate health score (composite of: days to expiry, policy compliance, deployment status)
- Agent fleet overview with environment grouping
- Prometheus metrics endpoint (`/metrics`) for control plane monitoring
- Structured logging improvements (request IDs, trace context)
- Deployment rollback support
- CLI tool (certctl CLI)
- Slack/Teams notifiers
- Agent-side key generation (private keys never leave target infrastructure)
## V3 Roadmap (Phase 3: Discovery & Visibility)
- Passive/active certificate discovery
- Network scan import
- Unknown/unmanaged certificate detection
- Ownership recommendation workflows
---
## V3 Roadmap: Discovery & Visibility
- Passive certificate discovery (network listener for TLS handshakes)
- Active scanning (port scan -> TLS probe -> cert extraction)
- Network scan import (Nmap, Qualys, etc.)
- Unknown/unmanaged certificate detection with ownership recommendation
- Discovery results triage workflow in GUI (claim, assign, ignore)
- Alerting rule builder with preview in GUI
---
## V4+ Roadmap: Platform & Scale
## V4+ Roadmap
- Kubernetes CRD for certificate management
- Terraform provider
- Multi-region deployment
- Multi-region deployment with control plane federation
- HA control plane with etcd backend
- Advanced scheduling policies
- Advanced scheduling policies (maintenance windows, blackout periods)
- Certificate pinning validation
- Hardware security module (HSM) support
- Hardware security module (HSM) support for CA key storage
- Backup/restore tooling for PostgreSQL data lifecycle
- API versioning strategy for breaking changes
---
## Architecture Decisions
- **Go 1.22 net/http** — stdlib routing, no external framework (Chi, Gin, Echo)
- **database/sql + lib/pq** — no ORM, raw SQL for clarity and control
- **TEXT primary keys** — human-readable prefixed IDs (mc-api-prod, t-platform, o-alice), not UUIDs
- **Handler->Service->Repository** — handlers define their own service interfaces (dependency inversion)
- **Idempotent migrations** — IF NOT EXISTS + ON CONFLICT for safe repeated execution
- **Agent-based key management** — V2+: private keys generated and stored only on agents, never in control plane. V1: server-side generation for Local CA demo.
- **Agent-based key management** — v1.0: agents generate keys, submit CSR only. Local CA demo mode retains server-side keygen with explicit flag.
- **Connector interfaces** — pluggable issuers (IssuerConnector), targets (TargetConnector), notifiers (Notifier)
- **IssuerConnectorAdapter** — bridges connector-layer `issuer.Connector` with service-layer `service.IssuerConnector` to maintain dependency inversion
- **BSL 1.1 license** — source-available, prevents competing managed services, converts to Apache 2.0 in 2033
- **Vite + React + TypeScript** — (M5+) proper frontend build pipeline replacing single-file SPA. TanStack Query for server state.
- **GUI parallel-tracked with backend** — every backend feature ships with its corresponding GUI surface. No GUI debt accumulation.
## Key File Locations
- Server entry: `cmd/server/main.go`
@@ -131,7 +218,7 @@ Configurable alert_thresholds_days JSONB column on renewal_policies, threshold-a
- Scheduler: `internal/scheduler/scheduler.go`
- Schema: `migrations/000001_initial_schema.up.sql`
- Seed data: `migrations/seed.sql`, `migrations/seed_demo.sql`
- Dashboard: `web/index.html`
- Dashboard: `web/` (migrating to Vite + React + TS in M5)
- Docker: `deploy/docker-compose.yml`, `Dockerfile`, `Dockerfile.agent`
- Docs: `docs/`
- Tests: `internal/service/*_test.go`, `internal/api/handler/*_test.go`, `internal/integration/lifecycle_test.go`
+15 -5
View File
@@ -307,12 +307,22 @@ make docker-clean # Stop + remove volumes
## Roadmap
Summary:
### V1 (in progress → v1.0.0)
Backend complete: end-to-end lifecycle, Local CA + ACME v2 issuers, NGINX/F5/IIS targets, threshold alerting, 120 tests. Remaining milestones before v1.0 tag:
- **M5: Hardening + GUI Foundation** — fix build errors, input validation, migrate dashboard to Vite + React + TypeScript, wire cert list/detail views to real API
- **M6: Functional GUI + CI** — wire all views (agents, jobs, notifications, audit, policies) to real API, GitHub Actions CI
- **M7: Security Baseline** — agent-side key generation (private keys never leave agents), API auth enforced, rate limiting
- **V1 (current)**: Dashboard, inventory, threshold-based expiration alerting (30/14/7/0 days with dedup), Local CA issuer (end-to-end lifecycle wired), ACME v2 (HTTP-01), NGINX/F5/IIS target connectors, agents with work polling, REST API (40+ endpoints), policies, audit trail, Docker Compose, 120 tests (service + handler + integration)
- **V2**: Charts/trends, bulk import, OIDC/SSO, deployment rollback, CLI, Slack/Teams
- **V3**: Certificate discovery, network scanning, unknown cert detection
- **V4+**: Kubernetes CRD, Terraform provider, multi-region, HA control plane, HSM support
### V2: Operational Maturity
- **V2.0: Operational Workflows** — renewal approval UI, bulk cert operations, deployment timeline, real-time updates (SSE/WebSocket), target config wizard
- **V2.1: Team Adoption** — OIDC/SSO, RBAC, CLI tool, Slack/Teams notifiers, bulk cert import
- **V2.2: Observability** — expiration calendar, health scores, Prometheus metrics, deployment rollback
### V3: Discovery & Visibility
Certificate discovery (passive/active scanning), unknown cert detection, triage workflows in GUI
### V4+: Platform & Scale
Kubernetes CRD, Terraform provider, multi-region, HA control plane, HSM support
## License
+15 -6
View File
@@ -8,11 +8,12 @@ New to certificates? Read the [Concepts Guide](concepts.md) first.
### Design Principles
1. **Private Key Isolation (V2+ goal)** — In V1, the Local CA generates server-side keys for simplicity. V2+ moves key generation to agents so private keys never touch the control plane
2. **Decoupled Operations** — Agents operate autonomously; the control plane coordinates but doesn't block agent function
3. **Audit-First** — Complete traceability of all issuance, deployment, and rotation events
4. **Connector Architecture** — Pluggable issuers, targets, and notifiers for extensibility
5. **Self-Hosted** — No cloud lock-in; run with Docker Compose, Kubernetes, or bare metal
1. **Private Key Isolation** — Agents generate keys locally and submit CSRs only. Private keys never touch the control plane. (Local CA demo mode retains server-side keygen with explicit flag.)
2. **GUI as Primary Interface** — The web dashboard is the operational control plane, not a secondary viewer. Every backend feature ships with its corresponding GUI surface.
3. **Decoupled Operations** — Agents operate autonomously; the control plane coordinates but doesn't block agent function
4. **Audit-First** — Complete traceability of all issuance, deployment, and rotation events
5. **Connector Architecture** — Pluggable issuers, targets, and notifiers for extensibility
6. **Self-Hosted** — No cloud lock-in; run with Docker Compose, Kubernetes, or bare metal
## System Components
@@ -79,10 +80,18 @@ The agent runs two background loops: a heartbeat (every 60 seconds) to signal it
### Web Dashboard
A single-page React application served as a static HTML file (`web/index.html`). It communicates with the REST API and provides a visual interface for certificate inventory, agent status, job monitoring, audit trail, policy management, and notifications.
The web dashboard is the primary operational interface for certctl. It is built with Vite + React + TypeScript and uses TanStack Query for server state management (caching, background refetching, optimistic updates).
**Current views**: certificate inventory (list + detail with version history), agent fleet (health indicators from heartbeat), job queue (status, retry, cancel), notification inbox (threshold alert grouping), audit trail (time range and actor/action filters), policy management (rules + violations), and a summary dashboard.
The dashboard includes a **demo mode** that activates when the API is unreachable — it renders realistic mock data for screenshots and offline presentations.
**Tech decisions**:
- Vite for fast builds and HMR during development
- TanStack Query over manual fetch/useEffect for automatic cache invalidation and refetching
- Dark theme default (ops teams live in dark mode)
- SSE/WebSocket planned for real-time job status updates (V2.0)
### PostgreSQL Database
All state is stored in PostgreSQL 16. The schema uses TEXT primary keys (not UUIDs) with human-readable prefixed IDs like `mc-api-prod`, `t-platform`, `o-alice`.