metrics: add per-issuer-type issuance counters, histogram, and failure classifier

Closes the #4 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. Before this commit, certctl's Prometheus exposition had
zero per-issuer-type signal — operators answering "is DigiCert slow?"
or "is Sectigo failing more than ACME?" had to grep logs by issuer
name. This commit adds three series labelled by issuer type:

  certctl_issuance_total{issuer_type, outcome}
  certctl_issuance_duration_seconds{issuer_type}            (histogram)
  certctl_issuance_failures_total{issuer_type, error_class}

The histogram covers 0.05–120 second buckets to span the local-issuer
fast path and async-CA slow path (DigiCert/Sectigo/Entrust polling
can take minutes). error_class is a closed enum of eight values
(timeout, auth, rate_limited, validation, upstream_5xx, upstream_4xx,
network, other) classified once in service.ClassifyError. Cardinality
budget is ~276 new series, well within Prometheus's comfortable range.

Implementation:
- service.IssuanceMetrics is the thread-safe counter + histogram
  table. Three independent views (counters / failures / durations)
  exposed via SnapshotCounters / SnapshotFailures / SnapshotDurations.
  sync.RWMutex protects the map shape; per-key sync/atomic.Uint64
  primitives keep the recording hot path lock-free under concurrent
  service-layer goroutines.
- service.IssuanceCounterEntry / IssuanceFailureEntry /
  IssuanceDurationEntry / IssuanceMetricsSnapshotter live in service
  (not handler) to avoid an import cycle: handler already imports
  service for admin_est.go etc., so service can't import handler back.
  Handler's exposer takes the snapshotter via the service-defined
  interface.
- service.ClassifyError pure function maps error → error_class.
  context.DeadlineExceeded / context.Canceled → timeout; *net.OpError
  → network; substring matches against canonical AWS / DigiCert /
  Sectigo error shapes for auth / rate_limited / validation /
  upstream_5xx / upstream_4xx / network; unknown → other. Each branch
  has at least one representative test case in
  TestClassifyError.
- IssuerConnectorAdapter.SetMetrics wires per-adapter recording
  (issuerType + metrics). Existing 28+ test call sites of
  NewIssuerConnectorAdapter keep their one-arg signature; production
  wiring goes through SetMetrics post-construction.
- IssuerRegistry.SetIssuanceMetrics + Rebuild type-asserts to
  *IssuerConnectorAdapter and calls SetMetrics with the issuer type
  string. nil-guarded — tests that hand-build adapters without
  metrics get no-op recording.
- IssuerConnectorAdapter.IssueCertificate / RenewCertificate wrap the
  underlying connector call with start := time.Now() and
  recordIssuance(start, err). Renewal is recorded into the same
  certctl_issuance_* series as initial issuance — operationally,
  renewal IS issuance from the connector's perspective (matches the
  audit prompt's guidance on series naming).
- handler/metrics.go GetPrometheusMetrics gains a new exposer block
  emitting all three series in stable label order with correct
  Prometheus format (_bucket / _sum / _count for the histogram, +Inf
  bucket appended). Sorted via sort.Slice for stable output. nil-
  guarded so deploys without the wire produce clean exposition.
- formatLE helper trims trailing zeros from histogram bucket labels
  via strconv.FormatFloat(le, 'f', -1, 64) so the `le` labels match
  Prometheus client conventions ("0.05", "30", "120", not "0.0500"
  etc.).
- cmd/server/main.go wires a single IssuanceMetrics instance into
  both the IssuerRegistry (recording) and the MetricsHandler (exposer)
  using DefaultIssuanceBucketBoundaries.

Tests:
- TestIssuanceMetrics_RecordAndSnapshot — happy-path counter +
  histogram + failure recording, BucketBoundaries returns a copy
  (not shared storage).
- TestIssuanceMetrics_HistogramCumulative — pins the cumulative-buckets
  contract. 100ms observation lands in 0.1 bucket and every larger
  bucket; 750ms only in the 1.0 bucket. Off-by-one here would
  corrupt every quantile query downstream.
- TestIssuanceMetrics_Concurrency — 100 goroutines × 1000 ops under
  the race detector. Asserts atomic counter integrity across
  contended writes.
- TestClassifyError — 17 cases covering every branch of the closed
  enum plus the nil-error special case.

Implementation chooses the existing hand-rolled fmt.Fprintf
exposition pattern (no prometheus/client_golang dependency added)
to stay consistent with the OCSP / deploy counter blocks already in
the file.

Out of scope (separate follow-ups):
- Revocation metrics (certctl_revocation_*) — symmetric to issuance
  but the audit didn't ask; explicit follow-up commit.
- Discovery / health-check duration histograms.
- prometheus/client_golang migration.

Verified locally:
- gofmt clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... → 0 issues
- go test -short -count=1 ./internal/service/ green
- go test -short -count=1 -race -run TestIssuanceMetrics ./internal/service/ green
- go test -short -count=1 ./internal/api/handler/ green
- go build ./... success

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #4 (Part 3, narrative section).
This commit is contained in:
shankar0123
2026-05-02 00:39:25 +00:00
parent b9d15c5dbf
commit 95c2bf9818
6 changed files with 738 additions and 4 deletions
+194
View File
@@ -0,0 +1,194 @@
// Copyright (c) certctl
// SPDX-License-Identifier: BSL-1.1
package service
import (
"context"
"errors"
"net"
"sync"
"testing"
"time"
)
// TestIssuanceMetrics_RecordAndSnapshot exercises the happy-path
// counter + histogram + failure recording. Asserts:
// - SnapshotCounters returns the expected (issuer_type, outcome, count) tuples
// - SnapshotDurations returns cumulative bucket counts
// - SnapshotFailures returns the expected (issuer_type, error_class, count) tuples
// - BucketBoundaries returns a copy that doesn't share backing storage
func TestIssuanceMetrics_RecordAndSnapshot(t *testing.T) {
m := NewIssuanceMetrics(DefaultIssuanceBucketBoundaries)
// Record three issuances: two success (one fast, one slow), one failure.
m.RecordIssuance("local", "success", 50*time.Millisecond) // 0.05 bucket
m.RecordIssuance("local", "success", 2*time.Second) // 2.5 bucket
m.RecordIssuance("digicert", "failure", 90*time.Second) // 120 bucket
m.RecordFailure("digicert", "rate_limited")
counters := m.SnapshotCounters()
if len(counters) != 2 {
t.Fatalf("expected 2 counter entries, got %d", len(counters))
}
for _, c := range counters {
switch {
case c.IssuerType == "local" && c.Outcome == "success":
if c.Count != 2 {
t.Errorf("local/success: want 2, got %d", c.Count)
}
case c.IssuerType == "digicert" && c.Outcome == "failure":
if c.Count != 1 {
t.Errorf("digicert/failure: want 1, got %d", c.Count)
}
default:
t.Errorf("unexpected counter entry: %+v", c)
}
}
failures := m.SnapshotFailures()
if len(failures) != 1 {
t.Fatalf("expected 1 failure entry, got %d", len(failures))
}
if failures[0].IssuerType != "digicert" || failures[0].ErrorClass != "rate_limited" || failures[0].Count != 1 {
t.Errorf("unexpected failure entry: %+v", failures[0])
}
durations := m.SnapshotDurations()
if len(durations) != 2 {
t.Fatalf("expected 2 duration entries, got %d", len(durations))
}
// BucketBoundaries: returned slice must be a copy.
b1 := m.BucketBoundaries()
b2 := m.BucketBoundaries()
if &b1[0] == &b2[0] {
t.Error("BucketBoundaries should return a copy, not shared storage")
}
}
// TestIssuanceMetrics_HistogramCumulative pins the cumulative-buckets
// contract. Prometheus histograms require buckets to be cumulative —
// `le=0.5` includes everything <= 0.5, including <= 0.05 and <= 0.1.
// Off-by-one here corrupts every quantile query downstream.
func TestIssuanceMetrics_HistogramCumulative(t *testing.T) {
m := NewIssuanceMetrics([]float64{0.1, 0.5, 1.0})
// Observe 100ms (= 0.1s exactly).
m.RecordIssuance("local", "success", 100*time.Millisecond)
durs := m.SnapshotDurations()
if len(durs) != 1 {
t.Fatalf("expected 1 duration entry, got %d", len(durs))
}
// Boundaries: [0.1, 0.5, 1.0]. 100ms falls into 0.1 bucket and
// every larger bucket (cumulative). Sum = 0.1, count = 1.
want := []uint64{1, 1, 1}
for i, w := range want {
if durs[0].Buckets[i] != w {
t.Errorf("bucket[%d]: want %d, got %d", i, w, durs[0].Buckets[i])
}
}
if durs[0].Sum < 0.099 || durs[0].Sum > 0.101 {
t.Errorf("sum: want ~0.1, got %v", durs[0].Sum)
}
if durs[0].Count != 1 {
t.Errorf("count: want 1, got %d", durs[0].Count)
}
// Observe 750ms — falls into 1.0 bucket only (>0.1, >0.5).
m.RecordIssuance("local", "success", 750*time.Millisecond)
durs = m.SnapshotDurations()
want = []uint64{1, 1, 2} // 100ms in all 3, 750ms in only the 1.0 bucket
for i, w := range want {
if durs[0].Buckets[i] != w {
t.Errorf("after 750ms — bucket[%d]: want %d, got %d", i, w, durs[0].Buckets[i])
}
}
}
// TestIssuanceMetrics_Concurrency stresses RecordIssuance under 100
// goroutines × 1000 ops to assert atomic counter integrity. Race-
// detector clean is non-optional for this test (the whole point of
// IssuanceMetrics is concurrent recording from many service
// goroutines).
func TestIssuanceMetrics_Concurrency(t *testing.T) {
m := NewIssuanceMetrics(DefaultIssuanceBucketBoundaries)
const goroutines = 100
const opsPerGoroutine = 1000
var wg sync.WaitGroup
wg.Add(goroutines)
for i := 0; i < goroutines; i++ {
go func() {
defer wg.Done()
for j := 0; j < opsPerGoroutine; j++ {
m.RecordIssuance("local", "success", 50*time.Millisecond)
}
}()
}
wg.Wait()
counters := m.SnapshotCounters()
if len(counters) != 1 {
t.Fatalf("expected 1 counter entry, got %d", len(counters))
}
wantTotal := uint64(goroutines * opsPerGoroutine)
if counters[0].Count != wantTotal {
t.Errorf("counter under contention: want %d, got %d", wantTotal, counters[0].Count)
}
durs := m.SnapshotDurations()
if durs[0].Count != wantTotal {
t.Errorf("histogram count under contention: want %d, got %d", wantTotal, durs[0].Count)
}
}
// TestClassifyError exercises every branch of the closed-enum
// classifier. The classification logic is the load-bearing piece of
// the failure metric — misclassification doesn't break operators, but
// it makes their alerts noisier. Each enum value has at least one
// representative input.
func TestClassifyError(t *testing.T) {
cases := []struct {
name string
err error
want string
}{
{"context_canceled", context.Canceled, "timeout"},
{"context_deadline", context.DeadlineExceeded, "timeout"},
{"timeout_substring", errors.New("operation deadline exceeded"), "timeout"},
{"i_o_timeout", errors.New("read tcp: i/o timeout"), "timeout"},
{"net_op_error", &net.OpError{Op: "dial", Net: "tcp", Err: errors.New("connection refused")}, "network"},
{"unauthorized_4xx", errors.New("DigiCert: 401 Unauthorized"), "auth"},
{"access_denied_aws", errors.New("AccessDeniedException: not authorized"), "auth"},
{"forbidden_403", errors.New("forbidden: insufficient permissions"), "auth"},
{"rate_limited_429", errors.New("Sectigo: 429 too many requests"), "rate_limited"},
{"throttled", errors.New("ThrottlingException: rate exceeded"), "rate_limited"},
{"validation_csr", errors.New("malformed CSR: invalid PEM block"), "validation"},
{"validation_invalid", errors.New("invalid signing algorithm"), "validation"},
{"upstream_503", errors.New("ServiceUnavailable: 503"), "upstream_5xx"},
{"upstream_500_internal", errors.New("Internal Server Error: 500"), "upstream_5xx"},
{"upstream_404", errors.New("NotFound: 404 cert not found"), "upstream_4xx"},
{"network_no_host", errors.New("dial tcp: no such host"), "network"},
{"other_unmatched", errors.New("something completely unexpected happened"), "other"},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
got := ClassifyError(tc.err)
if got != tc.want {
t.Errorf("ClassifyError(%q): want %q, got %q", tc.err.Error(), tc.want, got)
}
})
}
// Special case: nil → "" so callers that accidentally call us
// with a nil err don't bump the failure counter.
if got := ClassifyError(nil); got != "" {
t.Errorf("ClassifyError(nil): want \"\", got %q", got)
}
}