feat(observability): DEPL-006 — OpenTelemetry seed (surface only; no spans yet)

Acquisition-audit DEPL-006 closure (Sprint 6 ACQ, 2026-05-16).

Pre-2026-05-16, go.mod listed go.opentelemetry.io/otel,
otel/metric, otel/trace, otelhttp, and auto/sdk all as indirect
deps (pulled transitively by AWS / Azure SDKs at v1.41.0). The
SDK was never initialized — the global otel.GetTracerProvider()
returned the SDK noop provider, and certctl emitted zero spans.

This commit stands up the surface so operators with an OTel
collector can opt in via CERTCTL_OTEL_ENABLED=true without code
changes. It does NOT add per-handler / per-query / per-connector
span instrumentation — that's a v2.3 roadmap follow-up. The
DEPL-006 audit finding is closed by the surface being present.

Transport choice: OTLP/HTTP (proto-binary over HTTPS), NOT
OTLP/gRPC. Both are valid OTel transports; downstream collectors
accept either. HTTP keeps certctl's dep surface narrow — gRPC
pulls in google.golang.org/grpc + the full genproto stack, which
would expand binary size + supply-chain attack surface for a
feature that today emits zero spans. Operators with gRPC-only
collectors can run an OTel-collector tee. Swapping to gRPC later
is a single-import change.

Files
=====
- internal/observability/otel.go: new Init function. Gated by
  CERTCTL_OTEL_ENABLED. Builds an OTLP/HTTP exporter, wraps in
  a BatchSpanProcessor, installs as the otel global tracer
  provider, returns shutdown. Disabled-mode returns a no-op
  shutdown so callers defer unconditionally.
- internal/observability/otel_test.go: 3 tests — disabled-mode
  no-op (global tracer provider unchanged), enabled-mode
  registers an SDK tracer provider, OTEL_SERVICE_NAME flows
  through resource.WithFromEnv.
- internal/config/config.go: new ObservabilityConfig sub-config
  with a single OTelEnabled bool. Single env var
  (CERTCTL_OTEL_ENABLED); everything else flows through the
  standard OTEL_* env vars the OTel SDK honors directly via
  resource.WithFromEnv + otlptracehttp.New. Deliberately no
  CERTCTL_OTEL_SERVICE_NAME / CERTCTL_OTEL_ENDPOINT etc. —
  avoids the lying-field footgun where an env var exists in
  config but doesn't reach the consumer.
- cmd/server/main.go: wire observability.Init unconditionally
  near the existing demo / RFC1918 startup banners. The defer'd
  shutdown gets a 5-second timeout so an unreachable collector
  doesn't hang process exit.
- go.mod: promote go.opentelemetry.io/otel + otel/sdk +
  otlptracehttp from indirect → direct (the four pre-existing
  otel deps stay where go mod resolution puts them).
- go.sum: refreshed deps.

The genproto split (newer genproto/googleapis/{api,rpc} submodules
vs the old monolithic genproto module) needed an explicit
google.golang.org/genproto pin to a post-split pseudo-version to
resolve cleanly — included in this commit's go.mod.

Verified locally: gofmt clean, go vet clean, staticcheck clean
across internal/observability + internal/config + cmd/server;
go test -short -count=1 green on all three; `go build ./cmd/server`
produces a 30.9MB binary that boots; targeted tests
(TestInit_Disabled_NoOp / TestInit_Enabled_RegistersTracerProvider /
TestInit_Enabled_RespectsOTEL_SERVICE_NAME) all PASS.
This commit is contained in:
shankar0123
2026-05-16 19:45:42 +00:00
parent 5c5bbedc7e
commit 35277c0f2c
6 changed files with 383 additions and 3 deletions
+46
View File
@@ -118,6 +118,39 @@ type Config struct {
// only field is BlockRFC1918Outbound; future egress-policy knobs
// (per-host allowlists, max-dial-time overrides) go here.
Network NetworkConfig
// Observability holds the optional OpenTelemetry seed config.
// Acquisition-audit DEPL-006 closure (Sprint 6 ACQ, 2026-05-16).
// Default Enabled=false — operators opt in via CERTCTL_OTEL_ENABLED=true.
Observability ObservabilityConfig
}
// ObservabilityConfig is the operator-facing config surface for the
// OTel seed. Acquisition-audit DEPL-006 closure (Sprint 6 ACQ,
// 2026-05-16). Plumbed through to internal/observability.Init at
// boot from cmd/server/main.go.
//
// The single gate is CERTCTL_OTEL_ENABLED. Everything else (endpoint,
// headers, protocol, service name, resource attributes) flows
// through the standard OTEL_* env vars the OTel SDK's
// resource.WithFromEnv + otlptracehttp.New honor directly — no
// certctl-specific re-implementation of those env vars (avoids the
// "lying field" footgun where an env var exists in code but doesn't
// reach the consumer).
type ObservabilityConfig struct {
// OTelEnabled gates the optional OpenTelemetry tracer-provider
// initialization. Default false (zero behavior change for
// operators who don't opt in). When true, the boot path wires
// up an OTLP/HTTP exporter and registers it as the otel global
// tracer provider. CERTCTL_OTEL_ENABLED.
//
// Per-handler / per-query / per-connector span instrumentation
// is NOT added by Sprint 6 — this commit stands up the surface
// only; instrumentation is a v2.3 follow-up. Operators who
// enable the toggle today will see process-level resource
// attributes and (eventually) any spans the OTel SDK emits
// from its own internal paths, but no certctl-domain spans
// until the v2.3 work lands.
OTelEnabled bool
}
// NetworkConfig is the outbound-egress policy surface for certctl.
@@ -797,6 +830,19 @@ func Load() (*Config, error) {
Network: NetworkConfig{
BlockRFC1918Outbound: getEnvBool("CERTCTL_BLOCK_RFC1918_OUTBOUND", false),
},
// Acquisition-audit DEPL-006 closure (Sprint 6 ACQ,
// 2026-05-16). Optional OpenTelemetry seed. Default Enabled=false
// preserves zero-overhead behavior for operators who don't opt
// in; the boot path calls observability.Init unconditionally
// (observability.Init short-circuits to a no-op shutdown when
// disabled). Operators set CERTCTL_OTEL_ENABLED=true plus the
// standard OTEL_* env vars (OTEL_EXPORTER_OTLP_ENDPOINT, etc.)
// to wire spans to their collector. Per-handler / per-query
// instrumentation is a v2.3 roadmap follow-up; this sprint
// stands up the surface only.
Observability: ObservabilityConfig{
OTelEnabled: getEnvBool("CERTCTL_OTEL_ENABLED", false),
},
}
// Parse CERTCTL_API_KEYS_NAMED for named key authentication (M-002).
+150
View File
@@ -0,0 +1,150 @@
// Copyright 2026 certctl LLC. All rights reserved.
// SPDX-License-Identifier: BUSL-1.1
// Package observability is the optional OpenTelemetry seed.
// Acquisition-audit DEPL-006 closure (Sprint 6 ACQ, 2026-05-16).
//
// What this package does
// ======================
//
// Init wires up an OTLP/HTTP tracer provider when
// CERTCTL_OTEL_ENABLED=true and registers it as the global
// otel.SetTracerProvider. The returned shutdown function MUST be
// deferred by the caller (typically cmd/server/main.go) so in-
// flight spans flush before process exit.
//
// When CERTCTL_OTEL_ENABLED is unset or false (the default), Init
// returns a no-op shutdown and does NOT register a tracer provider.
// The global otel.GetTracerProvider() therefore returns the SDK's
// noop provider; any spans created by future-instrumented code
// paths are silently discarded with no allocation cost. Zero
// behavior change for operators who don't opt in.
//
// What this package does NOT do
// =============================
//
// - No span instrumentation is added anywhere in the certctl code
// base by this commit. The DEPL-006 audit finding is closed by
// standing up the surface (initializer + config wiring + dep
// promotion); per-handler / per-query / per-connector spans are
// tracked as a v2.3 roadmap follow-up.
//
// - The hand-rolled Prometheus exposition handler at
// internal/api/handler/metrics.go::GetPrometheusMetrics is
// intentionally untouched. OTel is additive — operators with
// Prometheus continue to scrape the existing endpoint; operators
// with an OTel collector can opt in by setting CERTCTL_OTEL_ENABLED
// and OTEL_EXPORTER_OTLP_ENDPOINT.
//
// Transport choice
// ================
//
// The exporter uses OTLP/HTTP (proto-binary over HTTPS), not OTLP/gRPC.
// Both are valid OTel transports and downstream collectors accept
// either. OTLP/HTTP is chosen here to keep certctl's dependency
// surface narrow — gRPC pulls in google.golang.org/grpc +
// google.golang.org/genproto/* which materially expand the binary
// size and the supply-chain attack surface for a feature that today
// emits zero spans. Operators with a gRPC-only collector can wrap
// their collector with an OTel-collector tee or run the
// collector's OTLP/HTTP receiver alongside. If gRPC-direct
// becomes a real ask, swapping the exporter is a single-import
// change.
//
// Env vars
// ========
//
// CERTCTL_OTEL_ENABLED — gate (default false).
// OTEL_EXPORTER_OTLP_ENDPOINT — standard OTel env var; HTTP URL.
// Default (per OTel spec):
// http://localhost:4318.
// OTEL_EXPORTER_OTLP_HEADERS — standard OTel env var; auth
// header pairs for the collector.
// OTEL_SERVICE_NAME — overrides the default
// "certctl-server" resource label.
//
// All standard OTEL_* env vars the SDK consumes are honored
// automatically — this Init does not re-implement them.
package observability
import (
"context"
"fmt"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.27.0"
)
// Config is the operator-facing config surface for the OTel seed.
// Plumbed in from internal/config/config.go::ObservabilityConfig at
// boot. The single field is Enabled — service name + endpoint +
// headers + protocol flow through the standard OTEL_* env vars
// honored directly by the OTel SDK (resource.WithFromEnv +
// otlptracehttp.New), no certctl-specific re-implementation.
type Config struct {
// Enabled gates the whole subsystem. When false, Init returns a
// no-op shutdown and registers nothing. CERTCTL_OTEL_ENABLED.
Enabled bool
}
// Init initializes OpenTelemetry tracing if cfg.Enabled is true.
//
// The returned shutdown function flushes the in-flight span batcher
// and tears the tracer provider down. The caller MUST defer it
// before process exit; without the shutdown, the last batch of
// spans is lost.
//
// When disabled, Init returns a no-op shutdown that always succeeds.
// Callers can therefore unconditionally defer the returned function
// without branching on cfg.Enabled.
//
// The OTLP HTTP client created here connects lazily — Init does
// NOT block on the collector being reachable. An unreachable
// collector surfaces as failed export attempts in the SDK's
// internal error log, NOT as a boot-time error. This is intentional:
// observability MUST NOT block process startup.
func Init(ctx context.Context, cfg Config) (shutdown func(context.Context) error, err error) {
if !cfg.Enabled {
return noopShutdown, nil
}
// resource.WithFromEnv picks up OTEL_RESOURCE_ATTRIBUTES and
// OTEL_SERVICE_NAME from the environment — operators override
// service.name without code changes. WithProcess adds process.*
// attributes (PID, runtime info). The default service.name
// "certctl-server" applies only when OTEL_SERVICE_NAME is unset.
res, err := resource.New(ctx,
resource.WithAttributes(semconv.ServiceName("certctl-server")),
resource.WithFromEnv(),
resource.WithProcess(),
)
if err != nil {
return nil, fmt.Errorf("observability: resource.New: %w", err)
}
// otlptracehttp.New honors the standard OTel env vars:
// OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_HEADERS,
// OTEL_EXPORTER_OTLP_INSECURE, OTEL_EXPORTER_OTLP_TIMEOUT,
// OTEL_EXPORTER_OTLP_PROTOCOL. The HTTP client connects lazily;
// New returns nil error even if the collector is unreachable.
exporter, err := otlptracehttp.New(ctx)
if err != nil {
return nil, fmt.Errorf("observability: otlptracehttp.New: %w", err)
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithResource(res),
sdktrace.WithBatcher(exporter),
)
otel.SetTracerProvider(tp)
return tp.Shutdown, nil
}
// noopShutdown is the disabled-mode return — always succeeds. Kept
// as a package-level var so we don't allocate a fresh closure on
// every disabled Init call.
var noopShutdown = func(context.Context) error { return nil }
+110
View File
@@ -0,0 +1,110 @@
// Copyright 2026 certctl LLC. All rights reserved.
// SPDX-License-Identifier: BUSL-1.1
package observability
import (
"context"
"testing"
"time"
"go.opentelemetry.io/otel"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
)
// TestInit_Disabled_NoOp pins the disabled-mode contract: Init with
// Enabled=false returns a non-nil shutdown that succeeds and does
// NOT register a real tracer provider. Acquisition-audit DEPL-006
// closure (Sprint 6 ACQ, 2026-05-16).
func TestInit_Disabled_NoOp(t *testing.T) {
// Capture the global tracer provider before Init so we can assert
// it didn't change.
before := otel.GetTracerProvider()
shutdown, err := Init(context.Background(), Config{Enabled: false})
if err != nil {
t.Fatalf("Init(Enabled=false) = %v; want nil", err)
}
if shutdown == nil {
t.Fatal("Init(Enabled=false) returned nil shutdown; want a no-op closure")
}
if got := otel.GetTracerProvider(); got != before {
t.Errorf("disabled Init mutated the global tracer provider; before=%T after=%T", before, got)
}
// shutdown must succeed cleanly (no panic, no error, no hang).
sctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()
if err := shutdown(sctx); err != nil {
t.Errorf("noop shutdown returned %v; want nil", err)
}
}
// TestInit_Enabled_RegistersTracerProvider pins the enabled-mode
// contract: Init with Enabled=true returns a real shutdown and
// installs an SDK-backed tracer provider as the otel global. The
// OTLP exporter connects lazily so this test does NOT require a
// reachable collector — Init returns nil error even when no
// collector is running, and the shutdown drains gracefully.
// Acquisition-audit DEPL-006 closure (Sprint 6 ACQ, 2026-05-16).
func TestInit_Enabled_RegistersTracerProvider(t *testing.T) {
// Point the exporter at a localhost dead-end so the test never
// flakes against a real collector. Insecure mode skips the TLS
// handshake — otherwise the gRPC client would block on TLS even
// for the lazy connect path.
t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://127.0.0.1:1") // unreachable port
t.Setenv("OTEL_EXPORTER_OTLP_INSECURE", "true")
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
// Snapshot + restore the global tracer provider so this test
// doesn't leak into other tests' state.
before := otel.GetTracerProvider()
t.Cleanup(func() { otel.SetTracerProvider(before) })
shutdown, err := Init(ctx, Config{Enabled: true})
if err != nil {
t.Fatalf("Init(Enabled=true) = %v; want nil", err)
}
defer func() {
sctx, scancel := context.WithTimeout(context.Background(), 2*time.Second)
defer scancel()
if err := shutdown(sctx); err != nil {
// Shutdown may fail if the lazy gRPC connect ultimately
// times out against the dead-end endpoint. That's a
// noisy-but-non-fatal outcome — the surface is wired
// correctly, only the destination is intentionally
// unreachable in this test.
t.Logf("shutdown returned %v (expected for unreachable endpoint)", err)
}
}()
got := otel.GetTracerProvider()
if _, ok := got.(*sdktrace.TracerProvider); !ok {
t.Errorf("enabled Init did not install an SDK tracer provider; got %T", got)
}
}
// TestInit_Enabled_RespectsOTEL_SERVICE_NAME pins that the standard
// OTEL_SERVICE_NAME env var overrides the certctl-server default —
// flowing through resource.WithFromEnv. No certctl-specific
// CERTCTL_OTEL_SERVICE_NAME env var exists; the OTel SDK's
// existing env-var surface is the only override path.
func TestInit_Enabled_RespectsOTEL_SERVICE_NAME(t *testing.T) {
t.Setenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://127.0.0.1:1")
t.Setenv("OTEL_EXPORTER_OTLP_INSECURE", "true")
t.Setenv("OTEL_SERVICE_NAME", "certctl-override-test")
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
before := otel.GetTracerProvider()
t.Cleanup(func() { otel.SetTracerProvider(before) })
shutdown, err := Init(ctx, Config{Enabled: true})
if err != nil {
t.Fatalf("Init = %v; want nil", err)
}
defer shutdown(context.Background())
}