Files
certctl/internal/service/issuance_metrics_test.go
T
shankar0123 95c2bf9818 metrics: add per-issuer-type issuance counters, histogram, and failure classifier
Closes the #4 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. Before this commit, certctl's Prometheus exposition had
zero per-issuer-type signal — operators answering "is DigiCert slow?"
or "is Sectigo failing more than ACME?" had to grep logs by issuer
name. This commit adds three series labelled by issuer type:

  certctl_issuance_total{issuer_type, outcome}
  certctl_issuance_duration_seconds{issuer_type}            (histogram)
  certctl_issuance_failures_total{issuer_type, error_class}

The histogram covers 0.05–120 second buckets to span the local-issuer
fast path and async-CA slow path (DigiCert/Sectigo/Entrust polling
can take minutes). error_class is a closed enum of eight values
(timeout, auth, rate_limited, validation, upstream_5xx, upstream_4xx,
network, other) classified once in service.ClassifyError. Cardinality
budget is ~276 new series, well within Prometheus's comfortable range.

Implementation:
- service.IssuanceMetrics is the thread-safe counter + histogram
  table. Three independent views (counters / failures / durations)
  exposed via SnapshotCounters / SnapshotFailures / SnapshotDurations.
  sync.RWMutex protects the map shape; per-key sync/atomic.Uint64
  primitives keep the recording hot path lock-free under concurrent
  service-layer goroutines.
- service.IssuanceCounterEntry / IssuanceFailureEntry /
  IssuanceDurationEntry / IssuanceMetricsSnapshotter live in service
  (not handler) to avoid an import cycle: handler already imports
  service for admin_est.go etc., so service can't import handler back.
  Handler's exposer takes the snapshotter via the service-defined
  interface.
- service.ClassifyError pure function maps error → error_class.
  context.DeadlineExceeded / context.Canceled → timeout; *net.OpError
  → network; substring matches against canonical AWS / DigiCert /
  Sectigo error shapes for auth / rate_limited / validation /
  upstream_5xx / upstream_4xx / network; unknown → other. Each branch
  has at least one representative test case in
  TestClassifyError.
- IssuerConnectorAdapter.SetMetrics wires per-adapter recording
  (issuerType + metrics). Existing 28+ test call sites of
  NewIssuerConnectorAdapter keep their one-arg signature; production
  wiring goes through SetMetrics post-construction.
- IssuerRegistry.SetIssuanceMetrics + Rebuild type-asserts to
  *IssuerConnectorAdapter and calls SetMetrics with the issuer type
  string. nil-guarded — tests that hand-build adapters without
  metrics get no-op recording.
- IssuerConnectorAdapter.IssueCertificate / RenewCertificate wrap the
  underlying connector call with start := time.Now() and
  recordIssuance(start, err). Renewal is recorded into the same
  certctl_issuance_* series as initial issuance — operationally,
  renewal IS issuance from the connector's perspective (matches the
  audit prompt's guidance on series naming).
- handler/metrics.go GetPrometheusMetrics gains a new exposer block
  emitting all three series in stable label order with correct
  Prometheus format (_bucket / _sum / _count for the histogram, +Inf
  bucket appended). Sorted via sort.Slice for stable output. nil-
  guarded so deploys without the wire produce clean exposition.
- formatLE helper trims trailing zeros from histogram bucket labels
  via strconv.FormatFloat(le, 'f', -1, 64) so the `le` labels match
  Prometheus client conventions ("0.05", "30", "120", not "0.0500"
  etc.).
- cmd/server/main.go wires a single IssuanceMetrics instance into
  both the IssuerRegistry (recording) and the MetricsHandler (exposer)
  using DefaultIssuanceBucketBoundaries.

Tests:
- TestIssuanceMetrics_RecordAndSnapshot — happy-path counter +
  histogram + failure recording, BucketBoundaries returns a copy
  (not shared storage).
- TestIssuanceMetrics_HistogramCumulative — pins the cumulative-buckets
  contract. 100ms observation lands in 0.1 bucket and every larger
  bucket; 750ms only in the 1.0 bucket. Off-by-one here would
  corrupt every quantile query downstream.
- TestIssuanceMetrics_Concurrency — 100 goroutines × 1000 ops under
  the race detector. Asserts atomic counter integrity across
  contended writes.
- TestClassifyError — 17 cases covering every branch of the closed
  enum plus the nil-error special case.

Implementation chooses the existing hand-rolled fmt.Fprintf
exposition pattern (no prometheus/client_golang dependency added)
to stay consistent with the OCSP / deploy counter blocks already in
the file.

Out of scope (separate follow-ups):
- Revocation metrics (certctl_revocation_*) — symmetric to issuance
  but the audit didn't ask; explicit follow-up commit.
- Discovery / health-check duration histograms.
- prometheus/client_golang migration.

Verified locally:
- gofmt clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... → 0 issues
- go test -short -count=1 ./internal/service/ green
- go test -short -count=1 -race -run TestIssuanceMetrics ./internal/service/ green
- go test -short -count=1 ./internal/api/handler/ green
- go build ./... success

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #4 (Part 3, narrative section).
2026-05-02 00:39:25 +00:00

195 lines
6.8 KiB
Go
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
// Copyright (c) certctl
// SPDX-License-Identifier: BSL-1.1
package service
import (
"context"
"errors"
"net"
"sync"
"testing"
"time"
)
// TestIssuanceMetrics_RecordAndSnapshot exercises the happy-path
// counter + histogram + failure recording. Asserts:
// - SnapshotCounters returns the expected (issuer_type, outcome, count) tuples
// - SnapshotDurations returns cumulative bucket counts
// - SnapshotFailures returns the expected (issuer_type, error_class, count) tuples
// - BucketBoundaries returns a copy that doesn't share backing storage
func TestIssuanceMetrics_RecordAndSnapshot(t *testing.T) {
m := NewIssuanceMetrics(DefaultIssuanceBucketBoundaries)
// Record three issuances: two success (one fast, one slow), one failure.
m.RecordIssuance("local", "success", 50*time.Millisecond) // 0.05 bucket
m.RecordIssuance("local", "success", 2*time.Second) // 2.5 bucket
m.RecordIssuance("digicert", "failure", 90*time.Second) // 120 bucket
m.RecordFailure("digicert", "rate_limited")
counters := m.SnapshotCounters()
if len(counters) != 2 {
t.Fatalf("expected 2 counter entries, got %d", len(counters))
}
for _, c := range counters {
switch {
case c.IssuerType == "local" && c.Outcome == "success":
if c.Count != 2 {
t.Errorf("local/success: want 2, got %d", c.Count)
}
case c.IssuerType == "digicert" && c.Outcome == "failure":
if c.Count != 1 {
t.Errorf("digicert/failure: want 1, got %d", c.Count)
}
default:
t.Errorf("unexpected counter entry: %+v", c)
}
}
failures := m.SnapshotFailures()
if len(failures) != 1 {
t.Fatalf("expected 1 failure entry, got %d", len(failures))
}
if failures[0].IssuerType != "digicert" || failures[0].ErrorClass != "rate_limited" || failures[0].Count != 1 {
t.Errorf("unexpected failure entry: %+v", failures[0])
}
durations := m.SnapshotDurations()
if len(durations) != 2 {
t.Fatalf("expected 2 duration entries, got %d", len(durations))
}
// BucketBoundaries: returned slice must be a copy.
b1 := m.BucketBoundaries()
b2 := m.BucketBoundaries()
if &b1[0] == &b2[0] {
t.Error("BucketBoundaries should return a copy, not shared storage")
}
}
// TestIssuanceMetrics_HistogramCumulative pins the cumulative-buckets
// contract. Prometheus histograms require buckets to be cumulative —
// `le=0.5` includes everything <= 0.5, including <= 0.05 and <= 0.1.
// Off-by-one here corrupts every quantile query downstream.
func TestIssuanceMetrics_HistogramCumulative(t *testing.T) {
m := NewIssuanceMetrics([]float64{0.1, 0.5, 1.0})
// Observe 100ms (= 0.1s exactly).
m.RecordIssuance("local", "success", 100*time.Millisecond)
durs := m.SnapshotDurations()
if len(durs) != 1 {
t.Fatalf("expected 1 duration entry, got %d", len(durs))
}
// Boundaries: [0.1, 0.5, 1.0]. 100ms falls into 0.1 bucket and
// every larger bucket (cumulative). Sum = 0.1, count = 1.
want := []uint64{1, 1, 1}
for i, w := range want {
if durs[0].Buckets[i] != w {
t.Errorf("bucket[%d]: want %d, got %d", i, w, durs[0].Buckets[i])
}
}
if durs[0].Sum < 0.099 || durs[0].Sum > 0.101 {
t.Errorf("sum: want ~0.1, got %v", durs[0].Sum)
}
if durs[0].Count != 1 {
t.Errorf("count: want 1, got %d", durs[0].Count)
}
// Observe 750ms — falls into 1.0 bucket only (>0.1, >0.5).
m.RecordIssuance("local", "success", 750*time.Millisecond)
durs = m.SnapshotDurations()
want = []uint64{1, 1, 2} // 100ms in all 3, 750ms in only the 1.0 bucket
for i, w := range want {
if durs[0].Buckets[i] != w {
t.Errorf("after 750ms — bucket[%d]: want %d, got %d", i, w, durs[0].Buckets[i])
}
}
}
// TestIssuanceMetrics_Concurrency stresses RecordIssuance under 100
// goroutines × 1000 ops to assert atomic counter integrity. Race-
// detector clean is non-optional for this test (the whole point of
// IssuanceMetrics is concurrent recording from many service
// goroutines).
func TestIssuanceMetrics_Concurrency(t *testing.T) {
m := NewIssuanceMetrics(DefaultIssuanceBucketBoundaries)
const goroutines = 100
const opsPerGoroutine = 1000
var wg sync.WaitGroup
wg.Add(goroutines)
for i := 0; i < goroutines; i++ {
go func() {
defer wg.Done()
for j := 0; j < opsPerGoroutine; j++ {
m.RecordIssuance("local", "success", 50*time.Millisecond)
}
}()
}
wg.Wait()
counters := m.SnapshotCounters()
if len(counters) != 1 {
t.Fatalf("expected 1 counter entry, got %d", len(counters))
}
wantTotal := uint64(goroutines * opsPerGoroutine)
if counters[0].Count != wantTotal {
t.Errorf("counter under contention: want %d, got %d", wantTotal, counters[0].Count)
}
durs := m.SnapshotDurations()
if durs[0].Count != wantTotal {
t.Errorf("histogram count under contention: want %d, got %d", wantTotal, durs[0].Count)
}
}
// TestClassifyError exercises every branch of the closed-enum
// classifier. The classification logic is the load-bearing piece of
// the failure metric — misclassification doesn't break operators, but
// it makes their alerts noisier. Each enum value has at least one
// representative input.
func TestClassifyError(t *testing.T) {
cases := []struct {
name string
err error
want string
}{
{"context_canceled", context.Canceled, "timeout"},
{"context_deadline", context.DeadlineExceeded, "timeout"},
{"timeout_substring", errors.New("operation deadline exceeded"), "timeout"},
{"i_o_timeout", errors.New("read tcp: i/o timeout"), "timeout"},
{"net_op_error", &net.OpError{Op: "dial", Net: "tcp", Err: errors.New("connection refused")}, "network"},
{"unauthorized_4xx", errors.New("DigiCert: 401 Unauthorized"), "auth"},
{"access_denied_aws", errors.New("AccessDeniedException: not authorized"), "auth"},
{"forbidden_403", errors.New("forbidden: insufficient permissions"), "auth"},
{"rate_limited_429", errors.New("Sectigo: 429 too many requests"), "rate_limited"},
{"throttled", errors.New("ThrottlingException: rate exceeded"), "rate_limited"},
{"validation_csr", errors.New("malformed CSR: invalid PEM block"), "validation"},
{"validation_invalid", errors.New("invalid signing algorithm"), "validation"},
{"upstream_503", errors.New("ServiceUnavailable: 503"), "upstream_5xx"},
{"upstream_500_internal", errors.New("Internal Server Error: 500"), "upstream_5xx"},
{"upstream_404", errors.New("NotFound: 404 cert not found"), "upstream_4xx"},
{"network_no_host", errors.New("dial tcp: no such host"), "network"},
{"other_unmatched", errors.New("something completely unexpected happened"), "other"},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
got := ClassifyError(tc.err)
if got != tc.want {
t.Errorf("ClassifyError(%q): want %q, got %q", tc.err.Error(), tc.want, got)
}
})
}
// Special case: nil → "" so callers that accidentally call us
// with a nil err don't bump the failure counter.
if got := ClassifyError(nil); got != "" {
t.Errorf("ClassifyError(nil): want \"\", got %q", got)
}
}