OpenTelemetry Graduated. The Architecture Shift Is Still Ahead.
OTel's CNCF graduation settles the standard question. It does not settle the ownership question: who manages the Collector pipeline, who enforces semantic conventions, and how the backend stays decoupled from every service that emits telemetry. Most teams have the SDK. Fewer have the architecture.
Antonio J. del Águila
Knaisoma
On May 21, the Cloud Native Computing Foundation announced the graduation of OpenTelemetry. The numbers behind the milestone are not modest: over 12,000 contributors from more than 2,800 companies, more than 1.36 billion downloads of the JavaScript API package in the past twelve months, and the second-highest project velocity in the CNCF ecosystem after Kubernetes. The announcement was made at the Observability Summit in Minneapolis, and the vendor ecosystem has been aligned on OTel as the instrumentation standard for long enough that the graduation reads more as formalization than revelation.
That is not a criticism. Graduation matters: it means the project completed CNCF’s governance and independent security review, and organizations waiting for that signal before committing now have it. But what graduation resolves is the standard question. The harder question, which graduation does not touch, is this: once every team has an OTel SDK installed, who owns the pipeline?
How organic adoption creates telemetry debt
Most organizations adopted OpenTelemetry the natural way. A service team learned about OTel, added an SDK, configured an OTLP exporter, and pointed it at their observability backend. Then a second team did the same, to a slightly different backend, with different resource attributes and a Collector sidecar deployed in their own Kubernetes namespace. A third team followed. Within a year, the platform team inherits an estate with seven Collector configurations, three versions of semantic conventions, two backends, and no shared place to apply a processing rule or change an export destination.
This is not a failure of OpenTelemetry. It is the predictable outcome of adopting a distributed standard without a centralized collection architecture. The OTel project’s own Blueprints initiative, launched earlier this month, names the failure mode explicitly: organizations that adopt OTel organically without centralized standards produce fragmented telemetry pipelines, inconsistent semantic conventions, and broken context propagation. The initiative offers prescriptive architectural guidance specifically because the pattern is widespread enough to warrant a project-level response.
The root cause is a misclassification. The Collector is not a deployment detail bundled with each service. It is the architectural boundary between what services know about themselves and what the observability platform knows about them. When each team manages their own Collector, there is no boundary, and the benefits of a shared standard dissolve into the same per-team configuration sprawl that the standard was supposed to prevent.
Three tiers, three owners
The pattern that holds up at scale separates telemetry responsibility into three tiers.
A sequence diagram with four participants: Application/SDK, Platform Collector, Processor Pipeline, and Observability Backend. The Application sends OTLP spans, metrics, and logs to the Platform Collector with a note indicating the application emits only business-level span attributes while the SDK handles auto-instrumentation. The Collector feeds the Processor Pipeline, which adds cloud metadata, cluster and namespace, and migrates semantic conventions. The Processor Pipeline sends normalized, sampled, enriched telemetry to the Observability Backend. A final note indicates that the backend is a configuration detail and applications do not know its address.
sequenceDiagram
participant App as Application / SDK
participant Col as Platform Collector
participant Pro as Processor Pipeline
participant Bak as Observability Backend
App->>Col: OTLP (spans, metrics, logs)
Note over App: Emits business-level span attributes.<br/>SDK handles auto-instrumentation.
Col->>Pro: Raw telemetry stream
Note over Col,Pro: Platform adds cloud metadata, cluster,<br/>namespace. Migrates semantic conventions.<br/>Applies sampling policy.
Pro->>Bak: Normalized, sampled, enriched telemetry
Note over Bak: Backend is a config detail in one file.<br/>Applications do not know its address. The application tier is where most teams over-invest. The instinct is to instrument every internal method, add rich attributes to every span, and configure sampling at the SDK level. The result is that sampling policy, convention choices, and export configuration scatter across every service. When OTel releases a new semantic convention version, teams face coordinated SDK rollouts instead of a one-line Collector change. The application tier should own exactly what it uniquely knows: business context that the infrastructure layer cannot infer. Which customer placed the order, which feature variant was served, which batch job the task belongs to. HTTP spans, database queries, and framework lifecycle events belong to auto-instrumentation, not application code.
The platform tier is the Collector pipeline: receiving OTLP from all services, running resource detection to add cloud provider, region, cluster, and namespace attributes once across the estate, applying semantic convention normalization through a transform processor, making the sampling decision centrally, and routing to backends through an exporter that no application ever configured. The platform team owns this deployment, versions it through GitOps, and rolls it forward independently of any application release.
The backend tier consumes normalized, enriched telemetry and has no structural dependency on the services that produced it. The claim that OTel lets you change backends without rewriting instrumentation code is only true if the export address is configured in one place and applications do not know it. When services configure their own exporters, they are directly coupled to the backend, not decoupled from it.
What the platform tier looks like in practice
A minimal platform Collector configuration makes the ownership split concrete.
# otel-platform-collector.yaml
# Deployed and versioned by the platform team via GitOps.
# Applications set one env var: OTEL_EXPORTER_OTLP_ENDPOINT=<this address>
# Nothing else about the collection pipeline is their concern.
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
memory_limiter:
check_interval: 5s
limit_mib: 512
# Adds cloud provider, region, cluster, and namespace from the runtime env.
# Applications cannot know these reliably; the platform can.
resourcedetection/cloud:
detectors: [env, gcp, eks, azure, k8snode]
timeout: 2s
# Promotes deprecated attribute names to current OTel semantic conventions.
# One config change here replaces a coordinated SDK update across fifty services.
# Example: net.peer.name -> server.address (HTTP attributes migration, OTel 1.21+)
transform/semconv:
trace_statements:
- context: span
statements:
- set(attributes["server.address"], attributes["net.peer.name"])
where attributes["net.peer.name"] != nil
- set(attributes["network.peer.address"], attributes["net.peer.ip"])
where attributes["net.peer.ip"] != nil
# Central probabilistic sampling: one policy, one place, no per-service overrides.
probabilistic_sampler:
sampling_percentage: 2
batch:
timeout: 10s
send_batch_size: 1024
exporters:
otlphttp:
endpoint: "${env:OTEL_BACKEND_ENDPOINT}"
headers:
Authorization: "Bearer ${env:OTEL_BACKEND_TOKEN}"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, resourcedetection/cloud, transform/semconv,
probabilistic_sampler, batch]
exporters: [otlphttp]
metrics:
receivers: [otlp]
processors: [memory_limiter, resourcedetection/cloud, batch]
exporters: [otlphttp]
logs:
receivers: [otlp]
processors: [memory_limiter, resourcedetection/cloud, batch]
exporters: [otlphttp]
The transform processor statements reflect real semantic convention changes: net.peer.name
was deprecated in favor of server.address during the HTTP and RPC attribute migration
introduced in OTel 1.21. A team whose services still ship an older SDK can run unchanged
behind this Collector; the platform normalizes on their behalf. When the backend changes,
one environment variable changes. Applications never know.
This configuration is intentionally minimal. Production deployments at scale need separate pipelines per tenant environment, more nuanced sampling policies, horizontal Collector scaling, and gateway-mode deployments in front of agent-mode sidecars for services that cannot accept a change to their export endpoint. The structural principle is what matters: the platform owns everything downstream of the SDK’s OTLP emit.
Anti-patterns that sink platform observability
Four patterns appear reliably when OTel adoption reaches scale without this separation.
Collector sprawl is the most visible. Each team deploys and owns a Collector for their services. Sampling policies differ. Semantic conventions diverge. Export credentials expire at different times. When the backend migrates, the migration requires changes in as many places as there are Collector deployments. The symptom: nobody on the platform team can state the current sampling rate without asking three service teams.
Vendor migration theater is what OTel adoption produces when teams add the SDK explicitly to achieve backend independence but leave export configuration in application code or per-team Collector files. The services are technically OTel-instrumented. They are practically locked to the backend, because changing the export endpoint requires touching every service configuration. Backend independence is not a property of the SDK. It is a property of where the export configuration lives.
Metrics-without-traces is a subtler failure. An observability backend can answer “how often is this happening” in metrics even without correlated traces from the same pipeline. Teams frequently instrument metrics and logs through a central pipeline but leave traces in a vendor-specific agent, because that is how observability worked before OTel. The consequence: a latency spike appears in the metrics and the slow traces that would explain it are in a different backend or absent. A single platform pipeline handling all three signals is what makes correlation queries reliable.
Context propagation leakage is the failure mode that surfaces after initial adoption. W3C Trace Context headers must work across every service hop. When one service in a distributed call is uninstrumented, or is instrumented without propagation configured, the trace breaks at that boundary. The broken span appears in the backend as a disconnected root, or does not appear at all, leaving engineers to reconstruct a timeline from logs and metrics instead. A trace completeness metric at the Collector (spans with a valid parent trace ID, divided by total inbound spans) surfaces propagation gaps across the estate without requiring per-trace inspection.
A migration path
The Blueprints initiative reference implementations are a useful check for teams moving from fragmented organic adoption to a platform-owned pipeline. The sequence that works in practice has four stages.
First, audit the existing Collector topology. Count how many Collector deployments exist, who owns each configuration, and where they have diverged from each other. Most teams find this number higher than expected and the divergence deeper. This audit also produces the input for the trace completeness metric: a view of how many services are actually propagating context correctly versus emitting disconnected spans.
Second, deploy a centralized Collector as a new deployment without decommissioning anything existing. Migrate one service team at a time by changing the SDK export endpoint, one environment variable, while validating that their telemetry arrives at the new Collector correctly. This is a soft migration with no application code changes required.
Third, consolidate resource detection and semantic convention normalization into the central Collector’s processor chain. As services migrate, the platform takes over attribute enrichment that was previously done inconsistently or not at all. Semantic convention normalization in particular has strong leverage here: one transform processor update covers every service that has migrated to the central Collector.
Fourth, as services migrate, remove the per-team Collectors. The end state is a single managed deployment with a GitOps pipeline, a defined upgrade cadence, and one address that every service in the estate points to.
Across the engineering teams we work with at various stages of this transition, the constraint that surprises most is not technical. It is prioritization. Platform observability is infrastructure investment with no immediately visible feature behind it. The case for it sits in two lagging indicators: the operational cost of debugging production incidents with fragmented, inconsistent telemetry, and the migration cost when a backend changes and every service has to be touched. Both costs are real. Both are invisible until they become urgent.
OTel graduation marks the moment a tool becomes infrastructure. Most engineering organizations are still treating it as a library. The gap between those two positions is the Collector pipeline, and closing it is a platform engineering decision, not an SDK upgrade.
If you are working through the transition from service-by-service OTel adoption to a platform-owned pipeline, and you are weighing where the platform boundary should sit and how to manage the migration without disrupting service teams, we have helped engineering teams navigate that decision and are glad to think through it with you.
Stay updated
Get insights on engineering transformation delivered to your inbox.
Newsletter coming soon.