Introduction to OpenTelemetry

OpenTelemetry is a CNCF (Cloud Native Computing Foundation) project that provides a single, vendor-neutral standard for collecting observability data from applications and infrastructure. Instead of integrating a different agent for each monitoring backend, you instrument your code once with OpenTelemetry and send the data wherever you need.

OpenTelemetry defines the APIs, SDKs, and wire protocols. It does not store or visualise data that is the job of your chosen backend (Jaeger, Prometheus, Grafana, Datadog, and others).

The three signals

OpenTelemetry organises observability data into three complementary signal types.

Traces

A trace records the end-to-end journey of a single request as it travels through a distributed system. It is made up of spans individual units of work with a start time, duration, and attributes.

TraceID: abc123
  └── Span: http.server (frontend)         0ms → 120ms
        └── Span: grpc.client (API gateway) 5ms → 110ms
              └── Span: db.query (database) 20ms → 95ms

Traces answer: Where did this request spend its time, and where did it fail?

Metrics

Metrics are numeric measurements collected at regular intervals counters, gauges, and histograms. Examples: request rate, error rate, CPU usage, queue depth.

Metrics answer: Is the system healthy right now, and is it trending in the right direction?

Logs

Logs are timestamped, structured records of discrete events. OpenTelemetry connects logs to the trace they belong to via TraceID and SpanID fields, making it possible to jump from a metric alert to the relevant trace and its logs in one step.

Logs answer: What exactly happened at this moment?

Architecture

graph LR
  App["Application\n(instrumented with OTel SDK)"]
  Collector["OTel Collector\n(agent or gateway)"]
  Jaeger["Jaeger\n(traces)"]
  Prometheus["Prometheus\n(metrics)"]
  Loki["Loki\n(logs)"]

  App -->|"OTLP (gRPC/HTTP)"| Collector
  Collector --> Jaeger
  Collector --> Prometheus
  Collector --> Loki

The OTLP (OpenTelemetry Protocol) is the standard wire format used between all components. Any backend that speaks OTLP can receive data from any OTel-instrumented application.

Core components

Component Role
API Language-specific interfaces what your application code calls. Stable and minimal. Importing only the API adds no overhead if no SDK is loaded.
SDK The implementation of the API. Handles sampling, batching, and exporting. Configured at process startup, not in library code.
Collector A standalone process that receives, processes, and exports telemetry. Acts as a buffer, fan-out point, and place to apply sampling or filtering without touching application code.
Exporter A plugin inside the SDK or Collector that translates OTLP data to a backend-specific format (Jaeger Thrift, Prometheus text, etc.).
Propagator Injects and extracts trace context (W3C traceparent header) across process boundaries so spans from different services are linked into one trace.

A minimal instrumentation example

The following pseudocode illustrates the three things every instrumented application does: get a tracer, start a span, set attributes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# 1 Configure the SDK once at startup
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)

# 2 Get a tracer scoped to this component
tracer = trace.get_tracer("order-service")

# 3 Create spans around units of work
def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("order.source", "web")
        # business logic here

The SDK configuration (exporter endpoint, sampling rate) lives in environment variables or a config file not in this code. Switching from a local Jaeger instance to a cloud backend requires no code change.

Semantic conventions

OpenTelemetry publishes Semantic Conventions standardised attribute names for common operations. Using them ensures your data is compatible with community dashboards, alerts, and auto-instrumentation libraries across HTTP, database, messaging, RPC, and resource signals.

For a complete attribute reference grouped by domain, see the Semantic Conventions guide.

Why OpenTelemetry instead of vendor agents

Vendor agent approach OpenTelemetry approach
One agent per backend One SDK, many exporters
Vendor-specific APIs in your code Standard OTel API in your code
Switching backends requires code changes Switching backends changes only the exporter config
Inconsistent attribute names across vendors Semantic conventions standardise attribute names
Separate agents compete for resources Single SDK, single Collector process

Auto-instrumentation libraries exist for most frameworks (Django, Spring Boot, Express, gRPC, and others) and require zero code changes they patch the framework at load time.

What to avoid

Do not import both the OTel SDK and a vendor-specific agent for the same signal. For example, using the Datadog APM library alongside the OTel SDK for traces will produce duplicate spans and conflicting context propagation. Choose one and configure it via the Collector or exporter.

Do not skip context propagation for cross-service calls. If the HTTP client or message producer does not inject the W3C traceparent header, the downstream service starts a new trace and the two spans are forever disconnected. Always enable propagation in the SDK configuration.

Do not configure exporters inside library code. Libraries should call only the OTel API and let the application configure the SDK. If a library sets up its own TracerProvider, it overrides the application’s configuration.

Do not sample at the SDK level without a strategy. Dropping 99% of traces with head-based sampling discards most failures by chance. Use tail-based sampling in the Collector to keep all traces that contain errors or slow spans.

Do not rely on the Collector being available at startup. Configure the SDK exporter with a retry policy and a fallback queue so that a temporary Collector outage does not lose data or crash the application.

Next steps