metrics: add per-issuer-type issuance counters, histogram, and failure classifier

Closes the #4 acquisition-readiness blocker from the 2026-05-01 issuer
coverage audit. Before this commit, certctl's Prometheus exposition had
zero per-issuer-type signal — operators answering "is DigiCert slow?"
or "is Sectigo failing more than ACME?" had to grep logs by issuer
name. This commit adds three series labelled by issuer type:

  certctl_issuance_total{issuer_type, outcome}
  certctl_issuance_duration_seconds{issuer_type}            (histogram)
  certctl_issuance_failures_total{issuer_type, error_class}

The histogram covers 0.05–120 second buckets to span the local-issuer
fast path and async-CA slow path (DigiCert/Sectigo/Entrust polling
can take minutes). error_class is a closed enum of eight values
(timeout, auth, rate_limited, validation, upstream_5xx, upstream_4xx,
network, other) classified once in service.ClassifyError. Cardinality
budget is ~276 new series, well within Prometheus's comfortable range.

Implementation:
- service.IssuanceMetrics is the thread-safe counter + histogram
  table. Three independent views (counters / failures / durations)
  exposed via SnapshotCounters / SnapshotFailures / SnapshotDurations.
  sync.RWMutex protects the map shape; per-key sync/atomic.Uint64
  primitives keep the recording hot path lock-free under concurrent
  service-layer goroutines.
- service.IssuanceCounterEntry / IssuanceFailureEntry /
  IssuanceDurationEntry / IssuanceMetricsSnapshotter live in service
  (not handler) to avoid an import cycle: handler already imports
  service for admin_est.go etc., so service can't import handler back.
  Handler's exposer takes the snapshotter via the service-defined
  interface.
- service.ClassifyError pure function maps error → error_class.
  context.DeadlineExceeded / context.Canceled → timeout; *net.OpError
  → network; substring matches against canonical AWS / DigiCert /
  Sectigo error shapes for auth / rate_limited / validation /
  upstream_5xx / upstream_4xx / network; unknown → other. Each branch
  has at least one representative test case in
  TestClassifyError.
- IssuerConnectorAdapter.SetMetrics wires per-adapter recording
  (issuerType + metrics). Existing 28+ test call sites of
  NewIssuerConnectorAdapter keep their one-arg signature; production
  wiring goes through SetMetrics post-construction.
- IssuerRegistry.SetIssuanceMetrics + Rebuild type-asserts to
  *IssuerConnectorAdapter and calls SetMetrics with the issuer type
  string. nil-guarded — tests that hand-build adapters without
  metrics get no-op recording.
- IssuerConnectorAdapter.IssueCertificate / RenewCertificate wrap the
  underlying connector call with start := time.Now() and
  recordIssuance(start, err). Renewal is recorded into the same
  certctl_issuance_* series as initial issuance — operationally,
  renewal IS issuance from the connector's perspective (matches the
  audit prompt's guidance on series naming).
- handler/metrics.go GetPrometheusMetrics gains a new exposer block
  emitting all three series in stable label order with correct
  Prometheus format (_bucket / _sum / _count for the histogram, +Inf
  bucket appended). Sorted via sort.Slice for stable output. nil-
  guarded so deploys without the wire produce clean exposition.
- formatLE helper trims trailing zeros from histogram bucket labels
  via strconv.FormatFloat(le, 'f', -1, 64) so the `le` labels match
  Prometheus client conventions ("0.05", "30", "120", not "0.0500"
  etc.).
- cmd/server/main.go wires a single IssuanceMetrics instance into
  both the IssuerRegistry (recording) and the MetricsHandler (exposer)
  using DefaultIssuanceBucketBoundaries.

Tests:
- TestIssuanceMetrics_RecordAndSnapshot — happy-path counter +
  histogram + failure recording, BucketBoundaries returns a copy
  (not shared storage).
- TestIssuanceMetrics_HistogramCumulative — pins the cumulative-buckets
  contract. 100ms observation lands in 0.1 bucket and every larger
  bucket; 750ms only in the 1.0 bucket. Off-by-one here would
  corrupt every quantile query downstream.
- TestIssuanceMetrics_Concurrency — 100 goroutines × 1000 ops under
  the race detector. Asserts atomic counter integrity across
  contended writes.
- TestClassifyError — 17 cases covering every branch of the closed
  enum plus the nil-error special case.

Implementation chooses the existing hand-rolled fmt.Fprintf
exposition pattern (no prometheus/client_golang dependency added)
to stay consistent with the OCSP / deploy counter blocks already in
the file.

Out of scope (separate follow-ups):
- Revocation metrics (certctl_revocation_*) — symmetric to issuance
  but the audit didn't ask; explicit follow-up commit.
- Discovery / health-check duration histograms.
- prometheus/client_golang migration.

Verified locally:
- gofmt clean
- go vet ./... clean
- staticcheck ./... clean
- golangci-lint run --timeout 5m ./... → 0 issues
- go test -short -count=1 ./internal/service/ green
- go test -short -count=1 -race -run TestIssuanceMetrics ./internal/service/ green
- go test -short -count=1 ./internal/api/handler/ green
- go build ./... success

Audit reference: cowork/issuer-coverage-audit-2026-05-01/RESULTS.md
Top-10 fix #4 (Part 3, narrative section).
This commit is contained in:
shankar0123
2026-05-02 00:39:25 +00:00
parent b0efdbe2f8
commit 3b92048242
6 changed files with 738 additions and 4 deletions
+51 -3
View File
@@ -2,6 +2,7 @@ package service
import (
"context"
"time"
"github.com/shankar0123/certctl/internal/connector/issuer"
)
@@ -9,15 +10,55 @@ import (
// IssuerConnectorAdapter bridges the connector-layer issuer.Connector interface with the
// service-layer IssuerConnector interface. This maintains dependency inversion: the service
// layer defines the interface it needs, and this adapter wraps the concrete connector.
//
// Metrics: when issuerType + metrics are set via SetMetrics, the adapter
// records every IssueCertificate / RenewCertificate call into the
// IssuanceMetrics tables (audit fix #4). Untyped or unmetricked
// adapters (test path) skip the recording — nil-guard everywhere.
type IssuerConnectorAdapter struct {
connector issuer.Connector
connector issuer.Connector
issuerType string
metrics *IssuanceMetrics
}
// NewIssuerConnectorAdapter wraps an issuer.Connector to implement service.IssuerConnector.
// NewIssuerConnectorAdapter wraps an issuer.Connector to implement
// service.IssuerConnector. Existing call sites (28+) keep this
// signature; metrics are wired via SetMetrics post-construction by
// the production code path (issuer_registry.go) so test sites that
// don't care about metrics stay one-arg.
func NewIssuerConnectorAdapter(c issuer.Connector) IssuerConnector {
return &IssuerConnectorAdapter{connector: c}
}
// SetMetrics wires per-issuer-type issuance metrics. issuerType is the
// factory key (e.g. "local", "acme", "digicert") — must match one of
// the closed-enum values the metrics doc references. metrics may be
// nil to disable recording. Closes the #4 audit-readiness blocker
// (per-issuer-type metrics).
func (a *IssuerConnectorAdapter) SetMetrics(issuerType string, metrics *IssuanceMetrics) {
a.issuerType = issuerType
a.metrics = metrics
}
// recordIssuance is the metrics-recording side effect at the adapter
// boundary. Bumps the issuance counter (success/failure) and the
// duration histogram; on failure also bumps the failure-by-error-class
// counter via ClassifyError.
//
// nil-guarded: when metrics or issuerType are unset, it's a no-op.
func (a *IssuerConnectorAdapter) recordIssuance(start time.Time, err error) {
if a.metrics == nil || a.issuerType == "" {
return
}
duration := time.Since(start)
if err != nil {
a.metrics.RecordIssuance(a.issuerType, "failure", duration)
a.metrics.RecordFailure(a.issuerType, ClassifyError(err))
} else {
a.metrics.RecordIssuance(a.issuerType, "success", duration)
}
}
// IssueCertificate delegates to the underlying connector's IssueCertificate method,
// translating between service-layer and connector-layer types.
//
@@ -26,6 +67,7 @@ func NewIssuerConnectorAdapter(c issuer.Connector) IssuerConnector {
// honors it (RFC 7633 id-pe-tlsfeature extension); upstream connectors
// silently ignore the field.
func (a *IssuerConnectorAdapter) IssueCertificate(ctx context.Context, commonName string, sans []string, csrPEM string, ekus []string, maxTTLSeconds int, mustStaple bool) (*IssuanceResult, error) {
start := time.Now()
result, err := a.connector.IssueCertificate(ctx, issuer.IssuanceRequest{
CommonName: commonName,
SANs: sans,
@@ -34,6 +76,7 @@ func (a *IssuerConnectorAdapter) IssueCertificate(ctx context.Context, commonNam
MaxTTLSeconds: maxTTLSeconds,
MustStaple: mustStaple,
})
a.recordIssuance(start, err)
if err != nil {
return nil, err
}
@@ -47,8 +90,12 @@ func (a *IssuerConnectorAdapter) IssueCertificate(ctx context.Context, commonNam
}
// RenewCertificate delegates to the underlying connector's RenewCertificate method,
// translating between service-layer and connector-layer types.
// translating between service-layer and connector-layer types. Metrics:
// renewal is recorded into the same certctl_issuance_* series as
// initial issuance — operationally, renewal IS issuance from the
// connector's perspective.
func (a *IssuerConnectorAdapter) RenewCertificate(ctx context.Context, commonName string, sans []string, csrPEM string, ekus []string, maxTTLSeconds int, mustStaple bool) (*IssuanceResult, error) {
start := time.Now()
result, err := a.connector.RenewCertificate(ctx, issuer.RenewalRequest{
CommonName: commonName,
SANs: sans,
@@ -57,6 +104,7 @@ func (a *IssuerConnectorAdapter) RenewCertificate(ctx context.Context, commonNam
MaxTTLSeconds: maxTTLSeconds,
MustStaple: mustStaple,
})
a.recordIssuance(start, err)
if err != nil {
return nil, err
}