From f7fcd1e187ce043bb91e5b1bb1d5a00355e53fc3 Mon Sep 17 00:00:00 2001 From: shankar0123 Date: Sat, 16 May 2026 22:10:05 +0000 Subject: [PATCH] =?UTF-8?q?docs(observability):=20DEPL-006=20follow-up=20?= =?UTF-8?q?=E2=80=94=20document=20CERTCTL=5FOTEL=5FENABLED=20(G-3=20ci-gua?= =?UTF-8?q?rd)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sprint 6 ACQ DEPL-006 closure follow-up. The G-3-env-docs-drift ci-guard scans `internal/` + `cmd/` for every CERTCTL_* env-var reference and cross-checks against README + docs/ + deploy/helm/ + deploy/ENVIRONMENTS.md. The OTel-seed commit (35277c0) introduced `CERTCTL_OTEL_ENABLED` in `internal/config/config.go` + `cmd/server/main.go` but didn't add the matching doc entry, so the guard caught the drift on the next CI run with: G-3 regression: env var(s) defined in Go source but never documented: CERTCTL_OTEL_ENABLED Replaces the existing "Tracing — explicitly not yet shipped" subsection in docs/operator/observability.md with an honest "Tracing — OTLP surface available, instrumentation pending" section that: - Documents the env var + the standard OTEL_* env vars the SDK honors (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME, etc.). - Explains the OTLP/HTTP transport choice (vs gRPC) per the rationale in internal/observability/otel.go's header. - Pins what the current release DOES (surface + lazy connect + graceful shutdown) vs DOES NOT (per-handler / per-DB / per-connector spans). - Notes the no-op-shutdown contract so operators can defer unconditionally. - Cross-references the existing request_id correlation + per- issuer Prometheus histogram as the interim correlation surface. - Repoints the "future work" tracker from the old "v3 item" framing to WORKSPACE-ROADMAP.md §2 (Phase 4 in the path-b build plan). Verified locally: `bash scripts/ci-guards/G-3-env-docs-drift.sh` exits 0 ("G-3 env-docs-drift: clean"). --- docs/operator/observability.md | 64 ++++++++++++++++++++++++++-------- 1 file changed, 49 insertions(+), 15 deletions(-) diff --git a/docs/operator/observability.md b/docs/operator/observability.md index 04dd7ed..f18a177 100644 --- a/docs/operator/observability.md +++ b/docs/operator/observability.md @@ -74,22 +74,55 @@ metric surface meet our SLO needs today" — not "is the right library under the hood." If the answer to the first question is yes, the second is a refactor, not a feature gap. -## Tracing — explicitly not yet shipped +## Tracing — OTLP surface available, instrumentation pending -certctl does **not** ship distributed tracing instrumentation today: +Sprint 6 ACQ DEPL-006 closure (2026-05-16) stood up the OTel tracer- +provider surface. Operators with an OTel collector can opt in via: -- No OpenTelemetry SDK setup in `cmd/server/main.go`. -- No OTLP exporter wired into outbound calls (issuer connectors, - agent enrollment, etc.). -- The `go.opentelemetry.io/otel` packages that appear in - [`go.mod`](../../go.mod) are indirect-only — they're transitive - dependencies of `coreos/go-oidc` and similar. +``` +CERTCTL_OTEL_ENABLED=true +OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.example.com:4318 +``` -This is honest: there is no in-process tracing surface to monitor, -correlate, or sample. If your environment requires end-to-end traces -across the certctl control plane + agents + issuer backends, this is -a gap you would close on the certctl side as part of a v3 work item. -Until then: +When `CERTCTL_OTEL_ENABLED` is true, `cmd/server/main.go` calls +`internal/observability.Init` which: + +- Constructs an OTLP/HTTP exporter (chosen over OTLP/gRPC to keep + the dependency surface narrow — see `internal/observability/otel.go` + header for the transport-choice rationale). +- Registers a real `sdktrace.TracerProvider` as the otel global. +- Honors the standard OTel env vars (`OTEL_EXPORTER_OTLP_ENDPOINT`, + `OTEL_EXPORTER_OTLP_HEADERS`, `OTEL_EXPORTER_OTLP_INSECURE`, + `OTEL_SERVICE_NAME` overrides the default `certctl-server`, etc.). +- Defers a graceful shutdown that flushes the in-flight batcher. + +What this **does not** ship yet: + +- No per-handler / per-DB / per-connector span instrumentation in + the certctl code base. The OTel SDK emits the spans it generates + internally (process resource attributes, eventual stdlib HTTP + spans), but certctl-domain spans (issuance, renewal, deployment, + agent enrollment) are a v2.3 roadmap follow-up. +- No tracing-correlated metric exemplars in the Prometheus + histograms above. Those still ship the per-issuer latency signal + without per-request fan-out. +- No backwards-compat shim — operators who never set + `CERTCTL_OTEL_ENABLED` (the default) see zero behavior change. + The init returns a no-op shutdown so the deferred call is safe + to invoke unconditionally. + +When this matters today: + +- Operators wiring up a v3 instrumentation effort have the OTel + surface in place; they only need to add `tracer.Start(ctx, "…")` + call sites in the handler/service code. +- Operators evaluating certctl for acquisition / due-diligence see + an opt-in OTel surface in the current release rather than a "v3 + roadmap item" — a useful signal for buyer credibility per the + acquisition-thesis framing in `WORKSPACE-ROADMAP.md` §3. + +Existing correlation surfaces stay in place until span coverage +ships: - Structured logs include a `request_id` you can correlate across the server log stream. See @@ -99,8 +132,9 @@ Until then: same per-issuer latency signal a trace span would, just without the per-request fan-out. -OpenTelemetry instrumentation is tracked in -[WORKSPACE-ROADMAP.md](../../WORKSPACE-ROADMAP.md) as a v3 item. +Per-handler / per-query / per-connector span instrumentation is +tracked in [WORKSPACE-ROADMAP.md](../../WORKSPACE-ROADMAP.md) under +§2 (NHI / Agent Identity, Phase 4 in the path-b build plan). ## Logging