docs(observability): DEPL-006 follow-up — document CERTCTL_OTEL_ENABLED (G-3 ci-guard)

Sprint 6 ACQ DEPL-006 closure follow-up. The G-3-env-docs-drift
ci-guard scans `internal/` + `cmd/` for every CERTCTL_*
env-var reference and cross-checks against README + docs/ +
deploy/helm/ + deploy/ENVIRONMENTS.md. The OTel-seed commit
(35277c0) introduced `CERTCTL_OTEL_ENABLED` in
`internal/config/config.go` + `cmd/server/main.go` but didn't
add the matching doc entry, so the guard caught the drift on
the next CI run with:

  G-3 regression: env var(s) defined in Go source but never documented:
  CERTCTL_OTEL_ENABLED

Replaces the existing "Tracing — explicitly not yet shipped"
subsection in docs/operator/observability.md with an honest
"Tracing — OTLP surface available, instrumentation pending"
section that:

- Documents the env var + the standard OTEL_* env vars the SDK
  honors (OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_SERVICE_NAME, etc.).
- Explains the OTLP/HTTP transport choice (vs gRPC) per the
  rationale in internal/observability/otel.go's header.
- Pins what the current release DOES (surface + lazy connect +
  graceful shutdown) vs DOES NOT (per-handler / per-DB /
  per-connector spans).
- Notes the no-op-shutdown contract so operators can defer
  unconditionally.
- Cross-references the existing request_id correlation + per-
  issuer Prometheus histogram as the interim correlation surface.
- Repoints the "future work" tracker from the old "v3 item"
  framing to WORKSPACE-ROADMAP.md §2 (Phase 4 in the path-b
  build plan).

Verified locally: `bash scripts/ci-guards/G-3-env-docs-drift.sh`
exits 0 ("G-3 env-docs-drift: clean").
This commit is contained in:
shankar0123
2026-05-16 22:10:05 +00:00
parent 9155ec9174
commit f7fcd1e187
+49 -15
View File
@@ -74,22 +74,55 @@ metric surface meet our SLO needs today" — not "is the right library
under the hood." If the answer to the first question is yes, the under the hood." If the answer to the first question is yes, the
second is a refactor, not a feature gap. second is a refactor, not a feature gap.
## Tracing — explicitly not yet shipped ## Tracing — OTLP surface available, instrumentation pending
certctl does **not** ship distributed tracing instrumentation today: Sprint 6 ACQ DEPL-006 closure (2026-05-16) stood up the OTel tracer-
provider surface. Operators with an OTel collector can opt in via:
- No OpenTelemetry SDK setup in `cmd/server/main.go`. ```
- No OTLP exporter wired into outbound calls (issuer connectors, CERTCTL_OTEL_ENABLED=true
agent enrollment, etc.). OTEL_EXPORTER_OTLP_ENDPOINT=https://otel-collector.example.com:4318
- The `go.opentelemetry.io/otel` packages that appear in ```
[`go.mod`](../../go.mod) are indirect-only — they're transitive
dependencies of `coreos/go-oidc` and similar.
This is honest: there is no in-process tracing surface to monitor, When `CERTCTL_OTEL_ENABLED` is true, `cmd/server/main.go` calls
correlate, or sample. If your environment requires end-to-end traces `internal/observability.Init` which:
across the certctl control plane + agents + issuer backends, this is
a gap you would close on the certctl side as part of a v3 work item. - Constructs an OTLP/HTTP exporter (chosen over OTLP/gRPC to keep
Until then: the dependency surface narrow — see `internal/observability/otel.go`
header for the transport-choice rationale).
- Registers a real `sdktrace.TracerProvider` as the otel global.
- Honors the standard OTel env vars (`OTEL_EXPORTER_OTLP_ENDPOINT`,
`OTEL_EXPORTER_OTLP_HEADERS`, `OTEL_EXPORTER_OTLP_INSECURE`,
`OTEL_SERVICE_NAME` overrides the default `certctl-server`, etc.).
- Defers a graceful shutdown that flushes the in-flight batcher.
What this **does not** ship yet:
- No per-handler / per-DB / per-connector span instrumentation in
the certctl code base. The OTel SDK emits the spans it generates
internally (process resource attributes, eventual stdlib HTTP
spans), but certctl-domain spans (issuance, renewal, deployment,
agent enrollment) are a v2.3 roadmap follow-up.
- No tracing-correlated metric exemplars in the Prometheus
histograms above. Those still ship the per-issuer latency signal
without per-request fan-out.
- No backwards-compat shim — operators who never set
`CERTCTL_OTEL_ENABLED` (the default) see zero behavior change.
The init returns a no-op shutdown so the deferred call is safe
to invoke unconditionally.
When this matters today:
- Operators wiring up a v3 instrumentation effort have the OTel
surface in place; they only need to add `tracer.Start(ctx, "…")`
call sites in the handler/service code.
- Operators evaluating certctl for acquisition / due-diligence see
an opt-in OTel surface in the current release rather than a "v3
roadmap item" — a useful signal for buyer credibility per the
acquisition-thesis framing in `WORKSPACE-ROADMAP.md` §3.
Existing correlation surfaces stay in place until span coverage
ships:
- Structured logs include a `request_id` you can correlate across - Structured logs include a `request_id` you can correlate across
the server log stream. See the server log stream. See
@@ -99,8 +132,9 @@ Until then:
same per-issuer latency signal a trace span would, just without same per-issuer latency signal a trace span would, just without
the per-request fan-out. the per-request fan-out.
OpenTelemetry instrumentation is tracked in Per-handler / per-query / per-connector span instrumentation is
[WORKSPACE-ROADMAP.md](../../WORKSPACE-ROADMAP.md) as a v3 item. tracked in [WORKSPACE-ROADMAP.md](../../WORKSPACE-ROADMAP.md) under
§2 (NHI / Agent Identity, Phase 4 in the path-b build plan).
## Logging ## Logging