Restructure v1 roadmap: split M7, add M9 test hardening milestone

Split the monolithic M7 (Security Baseline) into three focused milestones: M7 (Auth + Rate Limiting), M8 (Agent-Side Key Generation), and M9 (End-to-End Test Hardening). M9 adds handler tests for all 7 files, negative-path integration tests, scheduler/connector tests, and CI coverage gates (service 70%+, handler 60%+). Updated v1.0 gate criteria, replaced all stale V2+ references with M8, and added Testing Strategy section to architecture docs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-07-26 13:58:13 +00:00 · 2026-03-15 11:47:27 -04:00
parent f6139252e1
commit 2ba8245159
3 changed files with 99 additions and 29 deletions
@@ -33,9 +33,9 @@ You are my long-term copilot for building certctl — a self-hosted certificate
 - [x] GitHub Actions CI — parallel Go (build, vet, test+coverage) and Frontend (tsc, vite build) jobs

 ### What's NOT Wired Up Yet (Pre-v1.0 Gaps)
- [ ] **Agent-side key generation**: V1 uses server-side key generation for Local CA (pragmatic for dev/demo). Must move to agents before v1.0.
 - [ ] **API authentication enforced**: Auth types exist but demo runs with `CERTCTL_AUTH_TYPE=none`. No rate limiting.
- [ ] **Test coverage gaps**: Negative-path integration tests (issuer down, malformed certs, DB failures) still needed.
+- [ ] **Agent-side key generation**: V1 uses server-side key generation for Local CA (pragmatic for dev/demo). Must move to agents before v1.0.
+- [ ] **End-to-end test hardening**: Handler tests only cover 2 of 7 files. No negative-path integration tests (issuer down, malformed certs, DB failures). No scheduler or connector tests. No frontend tests.

 ---

@@ -68,29 +68,88 @@ All views wired to real API: agent detail page with heartbeat status + capabilit

 The principle: **every backend feature ships with its corresponding GUI surface.** The GUI is where ops teams spend 80% of their time — it must be an operational tool, not a demo viewer.

-### M7: Security Baseline
-**Goal**: Make certctl deployable in a shared/team environment. This gates the v1.0 tag.
+### M7: Auth + Rate Limiting
+**Goal**: Make the API production-safe for shared/team environments.

-**Authentication & authorization:**
- API key auth enforced by default (not `none`)
- Rate limiting on all API endpoints
- CORS configuration for dashboard
+**Authentication:**
+- API key auth middleware enforced by default (`CERTCTL_AUTH_TYPE=api-key`)
+- Key generation and hashing (bcrypt/argon2) for stored keys
+- Auth bypass only with explicit `CERTCTL_AUTH_TYPE=none` flag
+- GUI: API key entry/login screen, key passed via `Authorization: Bearer` header

-**Agent-side key generation:**
- Agents generate RSA/ECDSA keys locally
- Agents submit CSR (public key only) to control plane
- Private keys never leave agent infrastructure
- Server-side keygen retained only for Local CA demo mode (flagged explicitly)
+**Rate limiting:**
+- Token bucket rate limiter on all API endpoints (`golang.org/x/time/rate`)
+- Configurable per-endpoint or global limits via `CERTCTL_RATE_LIMIT_RPS`
+- 429 Too Many Requests response with `Retry-After` header

-**Deliverables**: Auth enforced, rate limits active, private keys isolated from control plane.
+**CORS:**
+- Configurable allowed origins for dashboard (`CERTCTL_CORS_ORIGINS`)
+- Sensible defaults for same-origin deployment
+
+**Deliverables**: Auth enforced by default, rate limits active, CORS configured. certctl deployable in shared environments.
+
+### M8: Agent-Side Key Generation
+**Goal**: Private keys never leave agent infrastructure. This is the crypto architecture gate for v1.0.
+
+**Agent key generation:**
+- Agent generates RSA-2048 or ECDSA P-256 key pair locally
+- Agent creates CSR (public key only) and submits via `POST /agents/{id}/csr`
+- Control plane signs CSR via issuer connector, returns cert + chain (no private key)
+- Agent stores key locally with file permissions 0600
+
+**Server-side keygen flagging:**
+- Server-side keygen retained only for Local CA with explicit `--server-side-keygen` flag
+- Default behavior: reject issuance requests without agent-submitted CSR
+- Clear log warnings when server-side keygen is active
+
+**ACME integration:**
+- Agent handles ACME HTTP-01 challenge locally (challenge server on agent)
+- Or: agent submits CSR, server handles ACME flow, returns signed cert
+
+**Deliverables**: Private keys isolated from control plane for all production issuers. Server-side keygen flagged as demo-only.
+
+### M9: End-to-End Test Hardening
+**Goal**: Comprehensive test coverage across all layers as the final quality gate before v1.0.
+
+**Handler test expansion (target: all 7 handler files covered):**
+- Jobs handler tests — status transitions, cancel, filter by type/status
+- Notifications handler tests — list, mark-read, filter by type/channel
+- Policies handler tests — CRUD, violations endpoint
+- Issuers handler tests — list, create, test connectivity
+- Targets handler tests — list, create, config validation
+
+**Negative-path integration tests:**
+- Issuer unavailable / returns error mid-issuance
+- Malformed CSR submission (invalid PEM, wrong key type, missing fields)
+- Database connection failure / timeout during job processing
+- Agent heartbeat with invalid/expired API key
+- Rate limiter rejection under load
+- Deployment job with unreachable target
+
+**Scheduler tests:**
+- Renewal checker creates jobs for expiring certs only
+- Job processor respects max_attempts and backoff
+- Health checker marks stale agents offline
+- Notification processor sends pending, skips already-sent
+
+**Connector tests:**
+- IssuerConnectorAdapter bridges correctly for both Local CA and ACME
+- Target connector error handling (NGINX config validation failure, F5 API timeout, WinRM auth failure)
+
+**CI coverage enforcement:**
+- Coverage threshold check in CI (fail if service layer <60%, handler layer <50%)
+- Coverage trend reporting via artifact comparison
+
+**Deliverables**: All handler files tested, negative-path integration suite, scheduler and connector tests, CI coverage gates. Target: 70%+ service layer, 60%+ handler layer coverage.

 ### v1.0.0 Release
 **Gate criteria** — all must be true:
- [ ] All M5-M7 deliverables complete
- [ ] CI green with 60%+ service layer coverage
+- [ ] All M5–M9 deliverables complete
+- [ ] CI green with coverage gates passing (service 70%+, handler 60%+)
 - [ ] GUI functional against real API (no demo mode fallback needed)
 - [ ] Agent-side keygen working for ACME issuer
 - [ ] API auth enforced by default
+- [ ] Negative-path integration tests passing
 - [ ] README screenshots of actual dashboard
 - [ ] Tagged Docker images published
 - [ ] No known panics or unhandled error paths
@@ -8,7 +8,7 @@ A self-hosted certificate lifecycle platform. Track, renew, and deploy TLS certi

 ## What It Does

-certctl gives you a single pane of glass for every TLS certificate in your organization. The **web dashboard** shows your full certificate inventory — what's healthy, what's expiring, what's already expired, and who owns each one. The **REST API** (40+ endpoints) lets you automate everything. **Agents** deployed on your infrastructure handle certificate deployment, and in V2+ will handle key generation locally so private keys never leave your servers.
+certctl gives you a single pane of glass for every TLS certificate in your organization. The **web dashboard** shows your full certificate inventory — what's healthy, what's expiring, what's already expired, and who owns each one. The **REST API** (40+ endpoints) lets you automate everything. **Agents** deployed on your infrastructure handle certificate deployment, and key generation moves to agents in M8 so private keys never leave your servers.

 ```mermaid
 flowchart LR
@@ -115,7 +115,7 @@ flowchart TB

 ### Key Design Decisions

- **Private keys isolated from the control plane (V2+ goal).** In V1, the Local CA issuer generates server-side keys for simplicity. V2+ moves key generation to agents — agents generate keys locally and submit CSRs (public key only). The architecture is designed for this separation; V1 takes a pragmatic shortcut for the built-in CA.
+- **Private keys isolated from the control plane (M8 goal).** Currently, the Local CA issuer generates server-side keys for simplicity. M8 moves key generation to agents — agents generate keys locally and submit CSRs (public key only). The architecture is designed for this separation; server-side keygen will be flagged as demo-only.
 - **TEXT primary keys, not UUIDs.** IDs are human-readable prefixed strings (`mc-api-prod`, `t-platform`, `o-alice`) so you can identify resource types at a glance in logs and queries.
 - **Handler → Service → Repository layering.** Handlers define their own service interfaces for clean dependency inversion. No global service singletons.
 - **Idempotent migrations.** All schema uses `IF NOT EXISTS` and seed data uses `ON CONFLICT (id) DO NOTHING`, safe for repeated execution.
@@ -293,7 +293,7 @@ make docker-clean       # Stop + remove volumes

 ### Private Key Management
 - **V1 (Local CA)**: The control plane generates ephemeral RSA-2048 keys server-side for certificate issuance. This simplifies the initial implementation but means private keys exist on the control plane temporarily. Keys are stored in certificate version records.
- **V2+**: Private keys will be generated exclusively on agents, never sent to the control plane. Keys stored with file permissions 0600 and rotated after successful renewal.
+- **M8+**: Private keys will be generated exclusively on agents, never sent to the control plane. Keys stored with file permissions 0600 and rotated after successful renewal.

 ### Authentication
 - Agent-to-server: API key (registered at agent creation)
@@ -308,10 +308,10 @@ make docker-clean       # Stop + remove volumes
 ## Roadmap

 ### V1 (in progress → v1.0.0)
-Backend complete: end-to-end lifecycle, Local CA + ACME v2 issuers, NGINX/F5/IIS targets, threshold alerting, 120 tests. Remaining milestones before v1.0 tag:
- **M5: Hardening + GUI Foundation** — fix build errors, input validation, migrate dashboard to Vite + React + TypeScript, wire cert list/detail views to real API
- **M6: Functional GUI + CI** — wire all views (agents, jobs, notifications, audit, policies) to real API, GitHub Actions CI
- **M7: Security Baseline** — agent-side key generation (private keys never leave agents), API auth enforced, rate limiting
+Backend complete: end-to-end lifecycle, Local CA + ACME v2 issuers, NGINX/F5/IIS targets, threshold alerting. GUI fully wired to real API with 11 views. CI pipeline running. Remaining milestones before v1.0 tag:
+- **M7: Auth + Rate Limiting** — API key auth enforced by default, token bucket rate limiting, CORS configuration, GUI login flow
+- **M8: Agent-Side Key Generation** — agents generate keys locally, submit CSR only, private keys never leave infrastructure, server-side keygen flagged as demo-only
+- **M9: End-to-End Test Hardening** — handler tests for all 7 files, negative-path integration tests (issuer down, malformed CSR, DB failure), scheduler and connector tests, CI coverage gates (service 70%+, handler 60%+)

 ### V2: Operational Maturity
 - **V2.0: Operational Workflows** — renewal approval UI, bulk cert operations, deployment timeline, real-time updates (SSE/WebSocket), target config wizard
@@ -8,7 +8,7 @@ New to certificates? Read the [Concepts Guide](concepts.md) first.

 ### Design Principles

-1. **Private Key Isolation** — Agents generate keys locally and submit CSRs only. Private keys never touch the control plane. (Local CA demo mode retains server-side keygen with explicit flag.)
+1. **Private Key Isolation** — Agents generate keys locally and submit CSRs only. Private keys never touch the control plane. (M8 milestone; currently server-side keygen for Local CA demo, flagged explicitly.)
 2. **GUI as Primary Interface** — The web dashboard is the operational control plane, not a secondary viewer. Every backend feature ships with its corresponding GUI surface.
 3. **Decoupled Operations** — Agents operate autonomously; the control plane coordinates but doesn't block agent function
 4. **Audit-First** — Complete traceability of all issuance, deployment, and rotation events
@@ -74,7 +74,7 @@ The server exposes a REST API under `/api/v1/` and optionally serves the web das

 ### Agents

-Lightweight Go processes that run on or near your infrastructure. Agents poll the control plane for pending deployment jobs, fetch signed certificates, deploy them to target systems, and report job status back. In V2+, agents will also generate private keys locally and create CSRs. Agents communicate with the control plane via HTTP and authenticate with API keys.
+Lightweight Go processes that run on or near your infrastructure. Agents poll the control plane for pending deployment jobs, fetch signed certificates, deploy them to target systems, and report job status back. In M8, agents will also generate private keys locally and create CSRs — private keys will never leave agent infrastructure. Agents communicate with the control plane via HTTP and authenticate with API keys.

 The agent runs two background loops: a heartbeat (every 60 seconds) to signal it's alive, and a work poll (every 30 seconds) to check for pending deployment jobs via `GET /api/v1/agents/{id}/work`. When a job is found, the agent fetches the certificate, executes the deployment, and reports status via `POST /api/v1/agents/{id}/jobs/{job_id}/status`.

@@ -236,7 +236,7 @@ sequenceDiagram

 #### V1: Server-Side Key Generation (Local CA)

-In V1, the control plane generates keys and CSRs server-side for the Local CA. This simplifies the initial implementation — the full agent-side key generation flow is planned for V2+.
+In V1, the control plane generates keys and CSRs server-side for the Local CA. This simplifies the initial implementation — the full agent-side key generation flow is planned for M8.

 ```mermaid
 sequenceDiagram
@@ -263,7 +263,7 @@ sequenceDiagram
    Note over SVC: Deployment jobs picked up by agents<br/>via GET /api/v1/agents/{id}/work
 ```

-#### V2+ (Planned): Agent-Side Key Generation
+#### M8 (Planned): Agent-Side Key Generation

 ```mermaid
 sequenceDiagram
@@ -457,14 +457,14 @@ flowchart LR

 **V1 (Current):** The Local CA issuer generates RSA-2048 keys and CSRs server-side within `RenewalService.ProcessRenewalJob`. Private key material is stored alongside the CSR in the `certificate_versions` table. This is a pragmatic V1 trade-off to get the end-to-end lifecycle working.

-**V2+ (Target Architecture):** Private keys follow a strict lifecycle on agents:
+**M8 (Target Architecture):** Private keys follow a strict lifecycle on agents:

 1. **Generated on the agent** — never sent to the control plane
 2. **Stored on the agent** — file permissions 0600, owned by the agent process user
 3. **Used by the agent** — for deployment to targets and CSR generation
 4. **Rotated by the agent** — old keys deleted after successful renewal

-The V2+ architecture ensures the control plane only handles public material: certificates, chains, and CSRs.
+The M8+ architecture ensures the control plane only handles public material: certificates, chains, and CSRs.

 ### Authentication

@@ -552,6 +552,17 @@ flowchart TB

 For production, you would also add an ingress controller, TLS termination for the certctl API itself, and external PostgreSQL (RDS, Cloud SQL, etc.).

+## Testing Strategy
+
+certctl uses a layered testing approach aligned with the handler → service → repository architecture:
+
+- **Service layer unit tests** (`internal/service/*_test.go`) — 74 test functions across 7 files with mock repositories. Tests all business logic: certificate CRUD, agent lifecycle, job state machine, policy evaluation, renewal/issuance flow, notification deduplication.
+- **Handler layer tests** (`internal/api/handler/*_test.go`) — 50 test functions using `httptest`. Currently covers certificates and agents; M9 expands to all 7 handler files.
+- **Integration tests** (`internal/integration/lifecycle_test.go`) — 11 subtests covering the full lifecycle from certificate creation through issuance, deployment, and status reporting. M9 adds negative-path scenarios (issuer failure, malformed CSR, DB timeout).
+- **CI pipeline** (`.github/workflows/ci.yml`) — Parallel Go (build, vet, test with coverage) and Frontend (TypeScript check, Vite build) jobs. M9 adds coverage threshold enforcement.
+
+Remaining gaps before v1.0 (M9): handler tests for jobs/notifications/policies/issuers/targets, negative-path integration tests, scheduler loop tests, connector error handling tests, and CI coverage gates.
+
 ## What's Next

 - [Quick Start](quickstart.md) — Get certctl running locally