feat: add network certificate discovery (M21) and Prometheus metrics (M22)

M21 adds server-side active TLS scanning of CIDR ranges with concurrent probing, sentinel agent pattern for pipeline reuse, and full CRUD API for scan targets. M22 adds Prometheus exposition format endpoint alongside existing JSON metrics. Comprehensive documentation audit updates all docs to reflect 91 endpoints, 19 tables, 6 scheduler loops, and 900+ tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-07 15:01:32 +00:00 · 2026-03-24 23:37:47 -04:00
parent d613d98c72
commit 4f90be9311
26 changed files with 2022 additions and 71 deletions
@@ -25,12 +25,12 @@ flowchart TB
        API["REST API\n(Go net/http, :8443)"]
        SVC["Service Layer"]
        REPO["Repository Layer\n(database/sql + lib/pq)"]
-        SCHED["Background Scheduler\n5 loops"]
+        SCHED["Background Scheduler\n6 loops"]
        DASH["Web Dashboard\n(React SPA)"]
    end

    subgraph "Data Store"
-        PG[("PostgreSQL 16\n18 tables\nTEXT primary keys")]
+        PG[("PostgreSQL 16\n19 tables\nTEXT primary keys")]
    end

    subgraph "Agent Fleet"
@@ -374,7 +374,7 @@ Short-lived certificates (those with profile TTL < 1 hour) return "good" from OC

 ### 4. Automatic Renewal

-The control plane runs a scheduler with five background loops:
+The control plane runs a scheduler with six background loops:

 ```mermaid
 flowchart LR
@@ -384,6 +384,7 @@ flowchart LR
        H["Agent Health\n⏱ every 2m"]
        N["Notification Processor\n⏱ every 1m"]
        SL["Short-Lived Expiry\n⏱ every 30s"]
+        NS["Network Scanner\n⏱ every 6h"]
    end

    R -->|"Find expiring certs\nCreate renewal jobs"| DB[("PostgreSQL")]
@@ -391,6 +392,7 @@ flowchart LR
    H -->|"Check heartbeat staleness\nMark agents offline"| DB
    N -->|"Send pending notifications\nEmail / Webhook / Slack"| DB
    SL -->|"Expire short-lived certs\nMark as Expired"| DB
+    NS -->|"Probe TLS endpoints\nStore discovered certs"| DB
 ```

 | Loop | Interval | Timeout | Purpose |
@@ -400,6 +402,7 @@ flowchart LR
 | Agent health check | 2 minutes | 1 minute | Marks agents as offline if heartbeat is stale |
 | Notification processor | 1 minute | 1 minute | Sends pending notifications via configured channels |
 | Short-lived expiry | 30 seconds | 30 seconds | Marks expired short-lived certificates (profile TTL < 1 hour) |
+| Network scanner | 6 hours | 30 minutes | Probes TLS endpoints on configured CIDR ranges, stores discovered certs (M21, opt-in via `CERTCTL_NETWORK_SCAN_ENABLED`) |

 Each operation has a context timeout to prevent indefinite hangs if external services become unresponsive.

@@ -605,7 +608,7 @@ All endpoints are under `/api/v1/` and follow consistent patterns:

 Resources: certificates, issuers, targets, agents, jobs, policies, profiles, teams, owners, agent-groups, audit, notifications.

-The full API is documented in an OpenAPI 3.1 specification at `api/openapi.yaml` with 78 documented operations (including health, readiness, and auth endpoints; 7 discovery endpoints from M18b pending spec update), all request/response schemas, and pagination conventions. See the [OpenAPI Guide](openapi.md) for usage with Swagger UI and SDK generation.
+The full API is documented in an OpenAPI 3.1 specification at `api/openapi.yaml` with 91 endpoints across 19 resource domains (including health, readiness, auth, 7 discovery endpoints from M18b, 6 network scan endpoints from M21, and Prometheus metrics from M22), all request/response schemas, and pagination conventions. See the [OpenAPI Guide](openapi.md) for usage with Swagger UI and SDK generation.

 Jobs support additional action endpoints: `POST /api/v1/jobs/{id}/cancel`, `POST /api/v1/jobs/{id}/approve`, `POST /api/v1/jobs/{id}/reject`.

@@ -703,54 +706,64 @@ flowchart TB

 For production, you would also add an ingress controller, TLS termination for the certctl API itself, and external PostgreSQL (RDS, Cloud SQL, etc.).

-## Discovery Data Flow (M18b)
+## Discovery Data Flow (M18b + M21)

-Certificate discovery enables operators to build a complete inventory of existing certificates before managing them with certctl. Here's how data flows through the system:
+Certificate discovery enables operators to build a complete inventory of existing certificates before managing them with certctl. There are two discovery modes that feed into the same pipeline:

 ```mermaid
 flowchart TB
-    AGENT["certctl-agent\n(on infrastructure)"]
-    SCAN["Filesystem Scanner\n(CERTCTL_DISCOVERY_DIRS)"]
+    subgraph "Discovery Sources"
+        AGENT["certctl-agent\n(filesystem discovery)"]
+        SCAN["Filesystem Scanner\n(CERTCTL_DISCOVERY_DIRS)"]
+        SERVER["certctl-server\n(network discovery)"]
+        NETSCAN["TLS Scanner\n(CIDR ranges + ports)"]
+    end
+
    EXTRACT["Extract Metadata\n(CN, SANs, serial, issuer, expiry, fingerprint)"]
-    REPORT["POST /api/v1/agents/{id}/discoveries\n(submit scan results)"]
-    HANDLER["Discovery Handler\n(parse request)"]
    SERVICE["Discovery Service\n(ProcessDiscoveryReport)"]
    REPO["Discovery Repository\n(upsert with fingerprint dedup)"]
    DB["PostgreSQL\ndiscovered_certificates\ndiscovery_scans tables"]
    AUDIT["Audit Service\n(RecordDiscoveryScanCompleted)"]
    API_LIST["GET /api/v1/discovered-certificates\n(list for triage)"]
-    API_CLAIM["POST /discovered-certificates/{id}/claim\n(operator claims cert)"]
-    API_DISMISS["POST /discovered-certificates/{id}/dismiss\n(operator dismisses)"]
-    UPDATE_STATUS["Update Status\n(Unmanaged → Managed/Dismissed)"]
+    API_CLAIM["POST /discovered-certificates/{id}/claim"]
+    API_DISMISS["POST /discovered-certificates/{id}/dismiss"]

    AGENT -->|"Scan loop\n(startup + 6h)"| SCAN
    SCAN --> EXTRACT
-    EXTRACT --> REPORT
-    REPORT --> HANDLER
-    HANDLER --> SERVICE
+    SERVER -->|"Scheduler loop\n(every 6h)"| NETSCAN
+    NETSCAN -->|"crypto/tls.Dial\n50 goroutines"| EXTRACT
+    EXTRACT --> SERVICE
    SERVICE --> REPO
-    REPO -->|"Dedup by fingerprint\n+ agent + path"| DB
+    REPO -->|"Dedup by fingerprint\n+ agent_id + source_path"| DB
    SERVICE --> AUDIT
-    AUDIT -->|"discovery_scan_completed"| DB
-    DB -->|"query unmanaged"| API_LIST
-    API_LIST -->|"operator reviews"| API_CLAIM
-    API_LIST -->|"operator reviews"| API_DISMISS
-    API_CLAIM --> UPDATE_STATUS
-    API_DISMISS --> UPDATE_STATUS
-    UPDATE_STATUS -->|"RecordDiscoveryCertClaimed\nRecordDiscoveryCertDismissed"| AUDIT
    AUDIT --> DB
+    DB --> API_LIST
+    API_LIST --> API_CLAIM
+    API_LIST --> API_DISMISS
 ```

-**Key steps:**
+**Filesystem Discovery (M18b):**

 1. **Agent-side discovery** — Agent scans `CERTCTL_DISCOVERY_DIRS` on startup and every 6 hours, walking directories recursively and parsing PEM/DER files
 2. **Metadata extraction** — For each certificate found, extract: common name, SANs, serial number, issuer DN, subject DN, expiration date, key algorithm, key size, is_ca flag, SHA-256 fingerprint (used as dedup key)
 3. **Server submission** — Agent POSTs scan results as `DiscoveryReport` to `POST /api/v1/agents/{id}/discoveries`
 4. **Deduplication** — Server uses fingerprint + agent ID + filesystem path as unique key; prevents duplicate records of the same cert on the same agent
-5. **Storage** — Records stored in `discovered_certificates` table with status = "Unmanaged"
-6. **Audit** — `discovery_scan_completed` event logged with agent ID, cert count, scan timestamp
-7. **Operator triage** — Operator queries `GET /api/v1/discovered-certificates?status=Unmanaged` to see new findings
-8. **Claim or dismiss** — For each unmanaged cert, operator either:
+
+**Network Discovery (M21):**
+
+1. **Target configuration** — Operator creates network scan targets via `POST /api/v1/network-scan-targets` with CIDR ranges, ports, and scan interval
+2. **CIDR expansion** — Ranges expanded to individual IPs with /20 safety cap (4096 IPs max)
+3. **TLS probing** — Server uses `crypto/tls.DialWithDialer` with `InsecureSkipVerify=true` to connect to each endpoint; 50 concurrent goroutines with configurable timeout
+4. **Certificate extraction** — Full X.509 metadata extracted from TLS handshake peer certificates
+5. **Sentinel agent** — Results submitted using `server-scanner` as virtual agent ID, with `source_path` set to `ip:port` and `source_format` set to `network`
+6. **Same pipeline** — Feeds into the same `DiscoveryService.ProcessDiscoveryReport()` as filesystem discovery — same dedup, same audit trail, same triage workflow
+
+**Common triage workflow (both sources):**
+
+1. **Storage** — Records stored in `discovered_certificates` table with status = "Unmanaged"
+2. **Audit** — `discovery_scan_completed` event logged with agent ID, cert count, scan timestamp
+3. **Operator triage** — Operator queries `GET /api/v1/discovered-certificates?status=Unmanaged` to see new findings
+4. **Claim or dismiss** — For each unmanaged cert, operator either:
   - **Claims it** via `POST /discovered-certificates/{id}/claim` — links to existing managed cert or creates new enrollment
   - **Dismisses it** via `POST /discovered-certificates/{id}/dismiss` — removes from triage, marked as "Dismissed"
 9. **Status tracking** — `discovery_cert_claimed` and `discovery_cert_dismissed` events audit the operator's decision
@@ -160,17 +160,20 @@ Each section includes:

 - **Health Endpoint** — `GET /health` returns 200 OK with service status. Consumed by Docker health checks and Kubernetes probes.
 - **Readiness Endpoint** — `GET /ready` returns 200 OK when the database is connected and migrations are applied.
- **Background Scheduler Monitoring** — 5 background loops run on a fixed schedule:
+- **Background Scheduler Monitoring** — 6 background loops run on a fixed schedule:
  - Renewal loop: every 1 hour, scans for certificates approaching renewal threshold
  - Job processor loop: every 30 seconds, picks up pending/waiting jobs and advances their state
  - Health check loop: every 2 minutes, pings agents to detect downtime
  - Notification dispatcher loop: every 1 minute, sends queued alerts
  - Short-lived cert expiry loop: every 30 seconds, marks expired short-lived credentials
+  - Network scanner loop: every 6 hours, scans enabled TLS endpoints for certificate discovery
  Each loop includes error handling and logs failures via structured slog.
- **JSON Metrics Endpoint** — `GET /api/v1/metrics` returns JSON object with:
-  - **Gauges** — `certificates_total`, `certificates_active`, `certificates_expiring_soon`, `agents_total`, `agents_healthy`, `pending_jobs`, `failed_jobs`
-  - **Counters** — `certs_issued_total`, `certs_renewed_total`, `certs_revoked_total`, `deployments_completed_total`, `deployments_failed_total`
-  - **Uptime** — `uptime_seconds` (seconds since server start)
+- **Metrics Endpoints** — Two formats for monitoring integration:
+  - `GET /api/v1/metrics` — JSON object with gauges, counters, and uptime for custom dashboards
+  - `GET /api/v1/metrics/prometheus` — Prometheus exposition format (`text/plain; version=0.0.4`) for native scraping by Prometheus, Grafana Agent, Datadog, and other OpenMetrics-compatible collectors
+  - **Gauges** — `certctl_certificate_total`, `certctl_certificate_active`, `certctl_certificate_expiring`, `certctl_certificate_expired`, `certctl_certificate_revoked`, `certctl_agent_total`, `certctl_agent_active`, `certctl_job_pending`
+  - **Counters** — `certctl_job_completed_total`, `certctl_job_failed_total`
+  - **Uptime** — `certctl_uptime_seconds` (seconds since server start)
  All values are point-in-time snapshots computed from database tables.
 - **Structured Logging** — All scheduler operations, API calls, and connector actions log via `slog` (Go's structured logger). Logs include timestamp, level (DEBUG/INFO/WARN/ERROR), structured fields (e.g., `actor`, `resource_id`, `latency_ms`), and request IDs for tracing.
 - **Request ID Propagation** — Each HTTP request gets a unique ID (`X-Request-ID` header). The ID is included in all correlated logs, making it easy to trace a single request through multiple service layers.
@@ -426,7 +429,7 @@ Each section includes:
 | | Metrics JSON Endpoint | `GET /api/v1/metrics` (gauges, counters, uptime) | ✅ | ✅ | Set thresholds, configure alerting |
 | | Stats API (time-series) | `GET /api/v1/stats/*` (summary, status, expiration, jobs, issuance) | ✅ | ✅ | Integrate into dashboards, SLO tracking |
 | | Structured Logging | `slog` middleware with request IDs | ✅ | ✅ | Aggregate logs to SIEM, define retention policy |
-| | Background Scheduler | 5 loops (renewal 1h, jobs 30s, health 2m, notifications 1m, short-lived 30s) | ✅ | ✅ | Alert on scheduler loop failures |
+| | Background Scheduler | 6 loops (renewal 1h, jobs 30s, health 2m, notifications 1m, short-lived 30s, network scan 6h) | ✅ | ✅ | Alert on scheduler loop failures |
 | **CC7.2** Anomaly Detection | Immutable API Audit Trail | `internal/api/middleware/audit.go`, `GET /api/v1/audit` | ✅ | Enhanced (SIEM export) | Integrate into SIEM, search for anomalies, archive long-term |
 | | Expiration Threshold Alerting | Configurable per-policy (default 30/14/7/0 days) | ✅ | ✅ | Configure thresholds, integrate notifications |
 | | Status Auto-Transitions | Active → Expiring (30d) → Expired (0d) | ✅ | ✅ | Monitor status changes in audit trail |
@@ -194,7 +194,7 @@ The MCP server is a separate binary (`cmd/mcp-server/`) that communicates via st

 Certificate discovery is the process of automatically finding existing certificates in your infrastructure — certificates you didn't issue through certctl, possibly issued by other CAs or tools. This is essential for building a complete inventory before you can manage everything.

-**How it works:** Agents can scan configured directories (configured via `CERTCTL_DISCOVERY_DIRS`) for certificate files. On startup and every 6 hours, the agent walks these directories recursively, parses PEM and DER files, extracts metadata (common name, SANs, expiration, issuer, key algorithm), and reports all findings to the control plane. The server deduplicates by fingerprint (prevents duplicate reports of the same cert) and stores them with a status: **Unmanaged** (discovered but not yet managed), **Managed** (linked to a control plane cert), or **Dismissed** (operator decided not to manage it).
+**How it works:** There are two discovery modes. *Filesystem discovery* — agents scan configured directories (configured via `CERTCTL_DISCOVERY_DIRS`) for certificate files. On startup and every 6 hours, the agent walks directories recursively, parses PEM and DER files, extracts metadata, and reports findings to the control plane. *Network discovery* — the control plane itself probes TLS endpoints across configured CIDR ranges and ports (enabled via `CERTCTL_NETWORK_SCAN_ENABLED=true`). It connects to each endpoint, extracts certificates from the TLS handshake, and feeds results into the same discovery pipeline. This finds certificates on services you may not have agents on. In both cases, the server deduplicates by fingerprint and stores discovered certs with a status: **Unmanaged** (discovered but not yet managed), **Managed** (linked to a control plane cert), or **Dismissed** (operator decided not to manage it).

 This gives you a three-step triage workflow:
 1. **Discover** — Agents find all existing certs on your infrastructure
@@ -205,7 +205,7 @@ This is a prerequisite for multi-CA migration, compliance audits, and building c

 ### Observability

-certctl exposes a JSON metrics endpoint at `GET /api/v1/metrics` with gauges (certificate totals by status, agent counts, pending jobs), counters (completed/failed jobs), and uptime. Five stats endpoints power the dashboard charts: summary statistics, certificates by status, expiration timeline, job trends, and issuance rate.
+certctl exposes metrics in two formats: a JSON endpoint at `GET /api/v1/metrics` and a Prometheus exposition format at `GET /api/v1/metrics/prometheus` (compatible with Prometheus, Grafana Agent, Datadog Agent, and Victoria Metrics). Both provide gauges (certificate totals by status, agent counts, pending jobs), counters (completed/failed jobs), and uptime. Five stats endpoints power the dashboard charts: summary statistics, certificates by status, expiration timeline, job trends, and issuance rate.

 The agent fleet overview page groups agents by OS, architecture, and version, showing distribution charts that help ops teams track fleet health and identify outdated agents. All API requests are logged via structured `slog` middleware with request IDs for correlation.

@@ -639,6 +639,84 @@ curl -s http://localhost:8443/api/v1/discovery-summary | jq .
 - **Compliance** — Detect rogue/unauthorized certificates in monitored directories
 - **Integration** — Pull certificate data from systems that pre-generate certs (e.g., Kubernetes CertManager)

+## Network Certificate Scanner (M21)
+
+The control plane includes a built-in active TLS scanner that probes network endpoints and discovers certificates without requiring agent deployment. This complements the agent-based filesystem discovery with network-level visibility.
+
+### Configuration
+
+Enable network scanning on the server:
+
+```bash
+export CERTCTL_NETWORK_SCAN_ENABLED=true
+export CERTCTL_NETWORK_SCAN_INTERVAL=6h  # default
+```
+
+### Creating Scan Targets
+
+Network scan targets define which CIDR ranges and ports to probe:
+
+```bash
+# Create a scan target for your internal network
+curl -s -X POST http://localhost:8443/api/v1/network-scan-targets \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Production Web Servers",
+    "cidrs": ["10.0.1.0/24", "10.0.2.0/24"],
+    "ports": [443, 8443, 6443],
+    "enabled": true,
+    "scan_interval_hours": 6,
+    "timeout_ms": 5000
+  }' | jq .
+```
+
+### How It Works
+
+1. **Expand**: CIDR ranges are expanded to individual IPs (safety cap at /20 = 4096 IPs)
+2. **Probe**: Concurrent TLS connections (50 goroutines) with configurable timeout per endpoint
+3. **Extract**: Certificate metadata extracted from TLS handshake (CN, SANs, serial, issuer, key info, fingerprint)
+4. **Pipeline**: Results fed into the same `DiscoveryService.ProcessDiscoveryReport()` as filesystem discovery
+5. **Deduplicate**: Sentinel agent ID (`server-scanner`) with source_path as `ip:port` ensures proper dedup
+6. **Triage**: Discovered certs appear in `GET /api/v1/discovered-certificates` with `agent_id=server-scanner`
+
+### API Endpoints
+
+```bash
+# List all scan targets
+curl -s http://localhost:8443/api/v1/network-scan-targets | jq .
+
+# Create a scan target
+curl -s -X POST http://localhost:8443/api/v1/network-scan-targets \
+  -H "Content-Type: application/json" \
+  -d '{"name": "DMZ", "cidrs": ["172.16.0.0/24"], "ports": [443]}' | jq .
+
+# Get a specific target (includes last_scan_at, last_scan_certs_found)
+curl -s http://localhost:8443/api/v1/network-scan-targets/nst-dmz | jq .
+
+# Trigger an immediate scan (doesn't wait for scheduler)
+curl -s -X POST http://localhost:8443/api/v1/network-scan-targets/nst-dmz/scan | jq .
+
+# Update scan configuration
+curl -s -X PUT http://localhost:8443/api/v1/network-scan-targets/nst-dmz \
+  -H "Content-Type: application/json" \
+  -d '{"ports": [443, 8443, 9443], "timeout_ms": 3000}' | jq .
+
+# Delete a scan target
+curl -s -X DELETE http://localhost:8443/api/v1/network-scan-targets/nst-dmz
+```
+
+### Scheduler Integration
+
+When `CERTCTL_NETWORK_SCAN_ENABLED=true`, the server runs a 6th scheduler loop (alongside renewal, jobs, health, notifications, and short-lived expiry). It scans all enabled targets at the configured interval (default 6h). Each target tracks `last_scan_at`, `last_scan_duration_ms`, and `last_scan_certs_found` for monitoring scan health.
+
+### Use Cases
+
+- **Network inventory** — "What TLS certs are deployed across my network?" without deploying agents
+- **Shadow certificate detection** — Find certificates on services you didn't know were running TLS
+- **Compliance scanning** — Prove to auditors that all TLS endpoints are inventoried
+- **Migration assessment** — Scan a network range before onboarding to certctl management
+- **Expiration monitoring** — Discover soon-to-expire certs on network endpoints before they cause outages
+
 ## What's Next

 - [Architecture Guide](architecture.md) — Understanding the full system design
@@ -695,11 +695,14 @@ curl -s "$API/api/v1/stats/job-trends?days=30" | jq .
 # Issuance rate — new certificates per day over 30 days
 curl -s "$API/api/v1/stats/issuance-rate?days=30" | jq .

-# System metrics — gauges, counters, uptime
+# System metrics — gauges, counters, uptime (JSON)
 curl -s $API/api/v1/metrics | jq .
+
+# System metrics — Prometheus exposition format (for Prometheus/Grafana/Datadog scraping)
+curl -s $API/api/v1/metrics/prometheus
 ```

-**How it works:** The `StatsService` computes aggregations in Go from existing repository List methods — no additional SQL queries or materialized views. This keeps the database schema simple while providing real-time dashboard data. The metrics endpoint returns gauges (cert totals by status, agent counts, pending jobs), counters (completed/failed jobs), and server uptime.
+**How it works:** The `StatsService` computes aggregations in Go from existing repository List methods — no additional SQL queries or materialized views. This keeps the database schema simple while providing real-time dashboard data. The JSON metrics endpoint returns gauges (cert totals by status, agent counts, pending jobs), counters (completed/failed jobs), and server uptime. The Prometheus endpoint (`/api/v1/metrics/prometheus`) exposes the same data in Prometheus exposition format (`text/plain; version=0.0.4`) with `certctl_` prefixed metric names — ready for scraping by Prometheus, Grafana Agent, Datadog Agent, or Victoria Metrics.

 **In the dashboard**, these stats power four interactive charts: an expiration heatmap, renewal success rate trends, certificate status distribution, and issuance rate. The agent fleet overview page uses agent metadata to group by OS, architecture, and version.

@@ -916,11 +919,13 @@ The MCP server is perfect for:

 ---

-## Part 16: Certificate Discovery (M18b)
+## Part 16: Certificate Discovery (M18b + M21)

-Agents can automatically discover existing certificates already deployed in your infrastructure. This is useful for building a baseline inventory before you start managing everything with certctl.
+certctl discovers existing certificates two ways: **filesystem scanning** (agents scan local directories) and **network scanning** (the server probes TLS endpoints). Both feed into the same triage pipeline.

-First, configure the demo agent to scan for certificates. In the Docker Compose setup, agents have a `/tmp/certs` directory (created by the seed script). Restart the agent with discovery enabled:
+### Filesystem Discovery (Agent-Side)
+
+Configure the demo agent to scan for certificates. In the Docker Compose setup, agents have a `/tmp/certs` directory (created by the seed script). Restart the agent with discovery enabled:

 ```bash
 # Stop the existing agent
@@ -936,17 +941,46 @@ Or with the CLI flag:
 certctl-agent --agent-id a-demo-1 --key-dir /tmp/keys --discovery-dirs /tmp/certs --server http://localhost:8443 --api-key test-key-123
 ```

-Now check what the agent discovered:
+### Network Discovery (Server-Side)
+
+The server can also discover certificates by actively probing TLS endpoints — no agent required. Create a scan target and trigger a scan:

 ```bash
-# List discovered certificates (should show unmanaged certs found on the agent)
+# Create a network scan target
+curl -s -X POST $API/api/v1/network-scan-targets \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Demo Local Scan",
+    "cidrs": ["127.0.0.1/32"],
+    "ports": [8443],
+    "enabled": true,
+    "scan_interval_hours": 6,
+    "timeout_ms": 5000
+  }' | jq .
+
+# Trigger an immediate scan (otherwise runs every 6 hours)
+NST_ID=$(curl -s $API/api/v1/network-scan-targets | jq -r '.data[0].id')
+curl -s -X POST "$API/api/v1/network-scan-targets/$NST_ID/scan" | jq .
+
+# List scan targets and their results
+curl -s $API/api/v1/network-scan-targets | jq .
+```
+
+Network-discovered certificates appear in the same discovery pipeline as filesystem-discovered ones, with `agent_id=server-scanner` and `source_format=network`.
+
+### Triage Discovered Certificates
+
+Both discovery sources feed into the same triage workflow. Check what was found:
+
+```bash
+# List discovered certificates (should show unmanaged certs found by agents and network scans)
 curl -s "$API/api/v1/discovered-certificates?status=Unmanaged" | jq '.data[] | {id, common_name, expires_at, issuer_dn, status}'

 # Get a summary of all discoveries
 curl -s $API/api/v1/discovery-summary | jq .
 ```

-If the agent found certificates, you'll see entries with `status: "Unmanaged"`. Now triage them — claim the ones you want to manage or dismiss the ones you don't:
+If certificates were found, you'll see entries with `status: "Unmanaged"`. Triage them — claim the ones you want to manage or dismiss the ones you don't:

 ```bash
 # Claim a certificate (link it to a managed cert, or create new enrollment)
@@ -961,9 +995,9 @@ curl -s -X POST "$API/api/v1/discovered-certificates/$DISCOVERED_ID/dismiss" \
  -d '{"reason": "Self-signed test cert, not production"}' | jq .
 ```

-**How it works:** The agent scans `CERTCTL_DISCOVERY_DIRS` on startup and every 6 hours, extracts metadata (common name, SANs, issuer, expiration, key type, fingerprint) from all PEM and DER files, and POSTs the findings to `POST /api/v1/agents/{id}/discoveries`. The server deduplicates by fingerprint (prevents duplicate records) and stores results with a status: **Unmanaged** (discovered, not yet managed), **Managed** (linked to a control plane cert), or **Dismissed** (operator decided not to manage). This gives you a triage workflow: discover → review → claim or dismiss.
+**How it works:** Filesystem discovery: the agent scans `CERTCTL_DISCOVERY_DIRS` on startup and every 6 hours, extracts metadata (common name, SANs, issuer, expiration, key type, fingerprint) from all PEM and DER files, and POSTs findings to `POST /api/v1/agents/{id}/discoveries`. Network discovery: the server expands CIDR ranges (capped at /20 = 4096 IPs), connects to each IP:port via TLS, extracts the peer certificate chain, and stores results using `server-scanner` as a sentinel agent ID. Both sources deduplicate by fingerprint and store results with a status: **Unmanaged** (discovered, not yet managed), **Managed** (linked to a control plane cert), or **Dismissed** (operator decided not to manage). This gives you a triage workflow: discover → review → claim or dismiss.

-**In the dashboard**, the Discovery page (coming in future V2.x) will provide a visual triage interface for claiming and dismissing discovered certificates.
+**In the dashboard**, click "Discovered Certificates" in the sidebar to see what agents and network scans found — claim unmanaged certs to bring them under certctl's management, or dismiss them.

 ---

@@ -989,12 +1023,12 @@ flowchart TB
        API["REST API\nGo net/http"]
        SVC["Service Layer\nBusiness Logic"]
        REPO["Repository Layer\ndatabase/sql + lib/pq"]
-        SCHED["Scheduler\n5 background loops"]
+        SCHED["Scheduler\n6 background loops"]
        CONN["Connector Registry\nIssuer + Target + Notifier"]
    end

    subgraph "Data Store"
-        PG["PostgreSQL 16\n18 tables, TEXT PKs"]
+        PG["PostgreSQL 16\n19 tables, TEXT PKs"]
    end

    subgraph "Agent (certctl-agent)"
@@ -70,11 +70,11 @@ On the Certificates page, select multiple certificates using the checkboxes. A b
 Click any certificate, then scroll to the deployment timeline. A visual 4-step timeline shows the lifecycle: Requested → Issued → Deploying → Active. Previous versions show a rollback button.

 **11. "What about certificates already running in production?"**
-Enable discovery on agents by setting `CERTCTL_DISCOVERY_DIRS` to directories containing certificates (e.g., `/etc/nginx/certs`). Agents scan on startup and every 6 hours, report findings to the control plane. Click "Discovered Certificates" to see what agents found — claim unmanaged certs to bring them under certctl's management, or dismiss them.
+Enable discovery on agents by setting `CERTCTL_DISCOVERY_DIRS` to directories containing certificates (e.g., `/etc/nginx/certs`). Agents scan on startup and every 6 hours, report findings to the control plane. For network-based discovery without agents, enable `CERTCTL_NETWORK_SCAN_ENABLED=true` and configure scan targets via the API — the server probes TLS endpoints on configured CIDR ranges and ports. Click "Discovered Certificates" to see what agents and network scans found — claim unmanaged certs to bring them under certctl's management, or dismiss them.

 ## REST API Walkthrough

-The dashboard is backed by a real REST API (84 endpoints). Try these while the demo is running:
+The dashboard is backed by a real REST API (91 endpoints). Try these while the demo is running:

 ```bash
 # List all certificates
@@ -114,6 +114,7 @@ curl -s http://localhost:8443/api/v1/stats/expiration-timeline | jq .
 curl -s http://localhost:8443/api/v1/stats/job-trends | jq .
 curl -s http://localhost:8443/api/v1/stats/issuance-rate | jq .
 curl -s http://localhost:8443/api/v1/metrics | jq .
+curl -s http://localhost:8443/api/v1/metrics/prometheus  # Prometheus format

 # Certificate profiles
 curl -s http://localhost:8443/api/v1/profiles | jq .
@@ -135,6 +136,9 @@ curl -s http://localhost:8443/api/v1/discovered-certificates | jq .

 # Discovery summary (counts by status)
 curl -s http://localhost:8443/api/v1/discovery-summary | jq .
+
+# Network scan targets (active TLS scanning)
+curl -s http://localhost:8443/api/v1/network-scan-targets | jq .
 ```

 ## CLI Tool
@@ -236,7 +240,7 @@ If you're demoing to a team or customer, here's a suggested flow:
 7. **Show profiles** — "Certificate profiles enforce crypto constraints — key types, max TTL, compliance requirements"
 8. **Show policies** — "Guardrails prevent teams from going outside approved scope"
 9. **Show bulk operations** — "Select multiple certs, trigger renewal or revoke in bulk with progress tracking"
-10. **Show certificate discovery** — "Agents scan your infrastructure for existing certificates you're not managing yet. We automatically deduplicate by fingerprint, show you what we found, and let you claim them or dismiss them"
+10. **Show certificate discovery** — "We discover certificates two ways: agents scan local filesystems, and the server actively probes TLS endpoints on your network. We deduplicate by fingerprint, show you what we found, and let you claim them or dismiss them"
 11. **Show the immutable audit trail** — "Every action in the system is recorded: who did it, what they did, when, what changed. Export to CSV/JSON for compliance"
 12. **Show advanced query features** — "Sort by any field, filter by date range, paginate efficiently with cursor-based pagination, select just the fields you need"
 13. **Show the CLI and MCP server** — "Terminal users get `certctl-cli` with 10 subcommands. AI assistants get MCP integration with 76 tools. Everything is API-first"
@@ -7,7 +7,7 @@ Complete reference of all features shipped in the V2 release (as of March 2026).
 ## API Surface

 ### Overview
- **84 endpoints** across 17 resource domains under `/api/v1/`
+- **91 endpoints** across 19 resource domains under `/api/v1/`
 - REST API with HTTP semantics (GET, POST, PUT, DELETE)
 - All endpoints require authentication by default (configurable)
 - OpenAPI 3.1 spec with full schema documentation
@@ -55,10 +55,11 @@ Complete reference of all features shipped in the V2 release (as of March 2026).
 | **Owners** | 5 | List, create, get, update, delete |
 | **Agent Groups** | 6 | List, create, get, update, delete, list agents in group |
 | **Discovery** | 7 | Submit scan results, list discovered certs, get detail, claim, dismiss, list scans, summary stats |
+| **Network Scan** | 6 | List targets, create, get, update, delete, trigger scan |
 | **Audit** | 3 | List events, list by resource, export (CSV/JSON) |
 | **Notifications** | 3 | List, get, mark as read |
 | **Stats** | 5 | Dashboard summary, certificates by status, expiration timeline, job trends, issuance rate |
-| **Metrics** | 1 | JSON metrics (gauges, counters, uptime) |
+| **Metrics** | 2 | JSON metrics (gauges, counters, uptime), Prometheus exposition format |
 | **Health** | 4 | Health check, readiness check, auth info, auth check |

 ---
@@ -411,6 +412,60 @@ Each discovered certificate is parsed and its metadata extracted:

 ---

+## Network Certificate Discovery (M21)
+
+### Overview
+Server-side active TLS scanning probes network endpoints across CIDR ranges, extracts certificate metadata from TLS handshakes, and feeds results into the existing filesystem discovery pipeline. No agent deployment required — the control plane scans directly.
+
+### Configuration
+- **Enable** — `CERTCTL_NETWORK_SCAN_ENABLED=true` (disabled by default)
+- **Scan Interval** — `CERTCTL_NETWORK_SCAN_INTERVAL=6h` (default 6 hours, configurable)
+
+### Network Scan Targets
+Scan targets define what CIDR ranges and ports to probe.
+
+| Field | Details | Example |
+|-------|---------|---------|
+| **ID** | Prefixed text PK (nst-xxx) | nst-datacenter-east |
+| **Name** | Human-readable target name | Datacenter East Production |
+| **CIDRs** | Array of CIDR ranges | ["10.0.1.0/24", "10.0.2.0/24"] |
+| **Ports** | Array of TCP ports | [443, 8443, 6443] |
+| **Enabled** | Toggle scanning on/off | true |
+| **Scan Interval Hours** | Per-target scan frequency | 6 |
+| **Timeout Ms** | Per-connection timeout | 5000 |
+
+### Scanning Behavior
+- **CIDR Expansion** — Ranges expanded to individual IPs; safety cap at /20 (4096 IPs) prevents accidental large scans
+- **Concurrent Probing** — 50 goroutines (semaphore-based), configurable timeout per TLS connection
+- **TLS Extraction** — `crypto/tls.DialWithDialer` with `InsecureSkipVerify=true` discovers all certs including self-signed, expired, and internal CA certs
+- **Sentinel Agent Pattern** — Uses `server-scanner` as virtual agent ID, reusing the existing `discovered_certificates` dedup constraint without schema changes
+- **Discovery Pipeline** — Scan results feed into `DiscoveryService.ProcessDiscoveryReport()` for fingerprint dedup, audit trail, and triage workflow
+
+### Network Scan API Endpoints (M21)
+
+| Endpoint | Method | Purpose |
+|----------|--------|---------|
+| `/api/v1/network-scan-targets` | GET | List all scan targets with metrics |
+| `/api/v1/network-scan-targets` | POST | Create a new scan target |
+| `/api/v1/network-scan-targets/{id}` | GET | Get scan target details |
+| `/api/v1/network-scan-targets/{id}` | PUT | Update scan target configuration |
+| `/api/v1/network-scan-targets/{id}` | DELETE | Delete a scan target |
+| `/api/v1/network-scan-targets/{id}/scan` | POST | Trigger an immediate scan |
+
+### Scheduler Integration
+- **6th scheduler loop** — runs at configured interval (default 6h) alongside renewal (1h), jobs (30s), health (2m), notifications (1m), short-lived expiry (30s)
+- **Conditional** — only starts if `CERTCTL_NETWORK_SCAN_ENABLED=true` and network scan service is initialized
+- **Scan Metrics** — each target tracks `last_scan_at`, `last_scan_duration_ms`, `last_scan_certs_found`
+
+### Use Cases
+- **Network Inventory** — "What TLS certs are deployed across my network?" without deploying agents
+- **Shadow Certificate Detection** — Find certificates on services you didn't know were running TLS
+- **Compliance Scanning** — Prove to auditors that all TLS endpoints are inventoried
+- **Migration Assessment** — Scan a network range before onboarding to certctl management
+- **Expiration Monitoring** — Discover soon-to-expire certs on network endpoints before they cause outages
+
+---
+
 ## Ownership & Accountability

 ### Teams
@@ -451,13 +506,23 @@ Live aggregated views of certificate and job metrics.
 | **Certificate Status Distribution** | Donut | Pie breakdown: Active, Expiring, Expired, Failed, Revoked, etc. |
 | **Issuance Rate** | Bar (30-day) | Certs issued per day; trend line |

-#### Metrics Endpoint
+#### Metrics Endpoints
+
+**JSON Format**
 - **URL** — `GET /api/v1/metrics`
 - **Format** — JSON with timestamp
 - **Gauges** — Certificate counts by status, agent count (online/offline), pending job count
 - **Counters** — Total jobs completed, total jobs failed, total renewals, total issuances
 - **Uptime** — Server uptime in seconds

+**Prometheus Exposition Format (M22)**
+- **URL** — `GET /api/v1/metrics/prometheus`
+- **Content-Type** — `text/plain; version=0.0.4; charset=utf-8`
+- **Compatible with** — Prometheus, Grafana Agent, Datadog Agent, Victoria Metrics, OpenMetrics scrapers
+- **Naming** — `certctl_` prefix, snake_case (e.g., `certctl_certificate_total`, `certctl_agent_online`)
+- **11 Metrics** — 8 gauges (cert total/active/expiring/expired/revoked, agent total/online, job pending), 2 counters (job completed/failed totals), 1 gauge (uptime seconds)
+- **Scrape Config** — Add to `prometheus.yml`: `scrape_configs: [{job_name: certctl, static_configs: [{targets: ['localhost:8443']}], metrics_path: /api/v1/metrics/prometheus}]`
+
 #### Stats API (M14)
 Five parameterized endpoints for dashboard data.

@@ -541,7 +606,7 @@ Every API call recorded to immutable `audit_events` table.
 3. **Approve** → `POST /api/v1/jobs/{id}/approve` → Job → `Running`
 4. **Reject** → `POST /api/v1/jobs/{id}/reject` + reason → Job → `Cancelled`

-### Background Scheduler (5 loops)
+### Background Scheduler (6 loops)
 | Loop | Interval | Task |
 |------|----------|------|
 | **Renewal Checker** | 1 hour | Scan policies; trigger renewals if cert expires soon |
@@ -549,6 +614,7 @@ Every API call recorded to immutable `audit_events` table.
 | **Health Checker** | 2 minutes | Check agent heartbeat; mark offline if >3 missed |
 | **Notification Processor** | 1 minute | Send queued notifications (email, Slack, webhook, etc.) |
 | **Short-Lived Cleanup** | 30 seconds | Audit short-lived credential expirations |
+| **Network Scanner** | 6 hours | Scan enabled network targets; discover TLS certificates |

 All loops have configurable intervals via environment variables (`CERTCTL_SCHEDULER_*_INTERVAL`).

@@ -898,7 +964,7 @@ Each guide includes an evidence summary table mapping specific criteria to certc
 | Revocation (RFC 5280, CRL, OCSP) | ✓ | ✓ | Shipped |
 | Dashboard + 19 pages | ✓ | ✓ | Shipped |
 | Observability (charts, metrics, stats) | ✓ | ✓ | Shipped |
-| REST API (84 endpoints) | ✓ | ✓ | Shipped |
+| REST API (91 endpoints) | ✓ | ✓ | Shipped |
 | MCP server (76 tools) | ✓ | ✓ | Shipped v2.1 |
 | CLI tool (10 subcommands) | ✓ | ✓ | Shipped |
 | Compliance mapping docs (SOC 2, PCI-DSS, NIST) | ✓ | ✓ | Shipped |
@@ -295,8 +295,11 @@ curl -s "http://localhost:8443/api/v1/stats/expiration-timeline?days=90" | jq .
 # Job trends (last 30 days)
 curl -s "http://localhost:8443/api/v1/stats/job-trends?days=30" | jq .

-# System metrics
+# System metrics (JSON)
 curl -s http://localhost:8443/api/v1/metrics | jq .
+
+# System metrics (Prometheus format — for scraping by Prometheus, Grafana Agent, Datadog)
+curl -s http://localhost:8443/api/v1/metrics/prometheus
 ```

 ### Certificate profiles
@@ -364,6 +367,35 @@ curl -s -X POST "http://localhost:8443/api/v1/discovered-certificates/DISCOVERY_
  -d '{"managed_certificate_id": "mc-api-prod"}' | jq .
 ```

+### Network Certificate Discovery
+
+The server can also discover certificates by scanning TLS endpoints directly — no agent required:
+
+```bash
+# Enable network scanning (set in environment or docker-compose)
+export CERTCTL_NETWORK_SCAN_ENABLED=true
+
+# Create a scan target (e.g., scan your internal network on port 443)
+curl -s -X POST http://localhost:8443/api/v1/network-scan-targets \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Internal Network",
+    "cidrs": ["10.0.1.0/24"],
+    "ports": [443, 8443],
+    "enabled": true,
+    "scan_interval_hours": 6,
+    "timeout_ms": 5000
+  }' | jq .
+
+# Trigger an immediate scan
+curl -s -X POST http://localhost:8443/api/v1/network-scan-targets/nst-internal-network/scan | jq .
+
+# List scan targets with results
+curl -s http://localhost:8443/api/v1/network-scan-targets | jq .
+```
+
+Discovered network certificates appear in the same `GET /api/v1/discovered-certificates` list as filesystem-discovered certs, with `agent_id=server-scanner` and `source_format=network`.
+
 ## What's Next

 - **[Advanced Demo](demo-advanced.md)** — Issue a real certificate via the Local CA and watch it appear in the dashboard