Observability Explained: Logs, Metrics, Traces, and Alerts

Observability is the capability to infer the internal state of a complex system from external outputs. In modern software systems—distributed, dynamic, and often ephemeral—observability is essential for understanding behavior, diagnosing issues, ensuring reliability, and improving performance. This article provides a deep, practical, and theoretically grounded exploration of observability with a focus on the four primary telemetry signals: logs, metrics, traces, and alerts. You’ll find history, core concepts, instrumentation patterns, storage and processing concerns, practical examples, best practices, key tools, and forward-looking trends.

Table of contents

  • Origins and theory of observability
  • Observability vs monitoring: differences and interplay
  • The four telemetry signals
    • Metrics
    • Logs
    • Traces (distributed tracing)
    • Alerts (and SLO-driven alerting)
  • How the signals relate and should be used together
  • Instrumentation and tooling (OpenTelemetry, Prometheus, ELK/Loki, Jaeger/Tempo)
  • Storage, scaling, and cost considerations (cardinality, sampling, retention)
  • Alerting lifecycle and incident response
  • Practical examples and code snippets
    • Instrumenting a microservice (Python examples)
    • Prometheus rules and PromQL examples
    • Sample structured log and Loki query
    • Tracing span example and Jaeger/Tempo concepts
  • Best practices and anti-patterns
  • Current state of the ecosystem
  • Future directions: AI, predictive observability, standards
  • Summary checklist

Origins and theory of observability

Observability is a concept borrowed from control theory. In control systems, a system is observable if one can deduce its internal state from outputs (measurements) over time. For software systems, the “internal state” might include application variables, queue lengths, dependency health, transaction contexts, and resource usage. Telemetry provides measurable outputs; observability is the property that these outputs enable accurate reasoning about system behavior.

Key theoretical points:

  • Observability is not only about collecting data; it’s about collecting the right data (high signal-to-noise ratio).
  • Observability enables state reconstruction and causal inference: you want to know why something happened, not just that it did.
  • Good observability supports hypothesis-driven debugging: form a hypothesis, gather data, test and refine.

Historically, monitoring focused on black-box symptoms—up or down, basic metrics—whereas observability emphasizes instrumentation and context to answer arbitrary questions, even ones not anticipated when the system was built.


Observability vs monitoring: differences and interplay

  • Monitoring: The practice of collecting predefined signals and raising alerts based on known failure modes. Typically goal-oriented (e.g., alert if CPU > 80% for 5m). Designed for detection.
  • Observability: The ability to explore and understand unknown unknowns. It emphasizes high-cardinality, high-dimensional telemetry, context propagation, and ad-hoc investigation. Designed for diagnosis and root cause analysis.

They complement each other: monitoring detects problems quickly; observability gives the ability to drill down, diagnose, and fix the problems.


The four telemetry signals

The telemetry signals are often described as “pillars” but are more accurately complementary signal types. Here we break them down and describe what unique value each brings.

Metrics

Metrics are numeric measurements sampled over time.

  • Common types:

    • Counter: monotonically increasing (e.g., total requests)
    • Gauge: snapshot value (e.g., memory usage, number of items in queue)
    • Histogram: distribution of values across buckets (useful for latency percentiles)
    • Summary: quantile estimation per client (less favored at scale due to aggregation complexity)
  • Properties:

    • Low cardinality is desirable for storage and query performance.
    • Great for dashboards, alerting, long-term trends, SLOs.
    • Compact, efficient to store and query.
    • Time-series DBs: Prometheus, InfluxDB, VictoriaMetrics, Cortex, Thanos.
  • Use cases:

    • SLIs (e.g., request success rate, p99 latency)
    • Capacity planning
    • Anomaly detection on aggregates

Logs

Logs are timestamped, semi-structured (json or text) event records, often associated with a specific component or transaction.

  • Properties:

    • High cardinality, rich text context.
    • Best for detailed forensic analysis and investigating discrete events.
    • Can contain stack traces, user IDs, request payloads (watch for PII).
    • Storage: centralized systems like ELK (Elasticsearch), Loki, Splunk, or cloud offerings.
    • Query languages vary (Lucene, LogQL).
  • Use cases:

    • Error investigations, root cause analysis
    • Audit trails
    • Security incident forensics

Traces (Distributed Tracing)

Traces represent the lifecycle of a request (or transaction) as it flows through a distributed system, composed of spans (operations).

  • Span: an operation with start time, duration, name, attributes, and parent-child relationship.
  • Trace: a tree/DAG of spans representing an end-to-end request.
  • Context propagation: correlation IDs passed over network calls (W3C Trace Context).
  • Tools: Jaeger, Zipkin, Tempo, Lightstep, Datadog APM.
  • Sampling is often necessary due to high volume; adaptive sampling strategies exist.
  • Use cases:
    • Latency hotspots and bottleneck identification
    • Causal analysis across services
    • Visualizing end-to-end request flow

Alerts

Alerts are actionable notifications raised when telemetry data violates a defined condition.

  • Two main types:

    • Symptoms-based alerts: metric thresholds or errors (e.g., high CPU).
    • SLO-driven alerts: alerting based on SLO burn rates (more resilient to noise).
  • Alert lifecycle: detection -> deduplication -> routing -> escalation -> on-call -> runbook -> resolution -> postmortem.

  • Platforms: Prometheus Alertmanager, Grafana Alerting, PagerDuty, OpsGenie.

  • Best practice: treat alerts as an interrupt to humans; tune alert fatigue by focusing on high-signal alerts (SLO-based).


How the signals relate and should be used together

  • Metrics detect trends and trigger alerts; they are your first notification system.
  • Traces show the end-to-end path for problematic requests flagged by metrics.
  • Logs provide the fine-grained context and payload for a particular trace/span or metric anomaly.
  • Alerts coordinate human action and link to dashboards, runbooks, traces, and logs.

Example workflow:

  1. Metric alert: p95 latency for service X > 1s.
  2. Look at trace samples for recent slow requests—identify a dependency or database query with high tail latency.
  3. Inspect logs for the trace IDs to find errors, stack traces, or explanatory messages.
  4. Fix, validate via metrics, and update runbook/alert conditions.

Correlation is vital: include trace IDs and request IDs in logs and metrics as labels/fields (with cardinality caution) so you can pivot between signals.


Instrumentation and tooling

Open standards and vendor-neutral tooling have matured—OpenTelemetry is the major standard for collecting traces, metrics, and logs.

OpenTelemetry (OTel)

  • Provides SDKs for automatic and manual instrumentation.
  • Components: API (instrumentation), SDK (exporters), Collector (pipeline).
  • Exports to backends: Prometheus, Jaeger, Tempo, Loki, vendor APMs.
  • Implements W3C Trace Context for interoperability.

Example OpenTelemetry architecture:

  • Application code uses OTel SDK to create spans, metrics, and logs.
  • SDK exports to an OTel Collector (agent or gateway).
  • Collector handles batching, sampling, enrichment, export to backend(s).

Metrics stack: Prometheus ecosystem

  • Prometheus scrapes metrics from instrumented endpoints (HTTP /metrics using Prometheus format).
  • PromQL allows flexible querying and alerting rules.
  • Long-term storage: Cortex, Thanos, Mimir for scale and durability.
  • Grafana: visualization and dashboards, integrates with Prometheus and other backends.

Logging stack

  • ELK (Elasticsearch, Logstash, Kibana) historically common.
  • Loki (Grafana Loki) offers log aggregation optimized for labels (lower cost), integrates with Prometheus labels.
  • Fluentd / Fluent Bit / Vector: ingestion and processing agents.
  • Structured logs (JSON) recommended for parsing and indexing.

Tracing stack

  • Instrumentation via OpenTelemetry or client libs (Jaeger, Zipkin).
  • Backends: Jaeger, Tempo, Lightstep, Honeycomb.
  • UI for trace visualization and latency waterfall charts.

Alerting & incident tools

  • Prometheus Alertmanager, Grafana Alerting, PagerDuty, Opsgenie, VictorOps.
  • Integrations for chatops (Slack, MS Teams), ticketing, phone/SMS.

Storage, scaling, and cost considerations

Observability data can be expensive. Understand where cost arises and strategies to manage it.

Key concerns:

  • Cardinality: number of unique label combinations for metrics or keys in logs/traces. High cardinality multiplies storage and query cost, often leading to performance problems.
  • Sampling: needed for traces; downsample logs/metrics or roll up metrics by reducing labels.
  • Retention: balance regulatory needs vs cost. Keep high-resolution recent data, aggregated older data (rollups).
  • Compression: use efficient storage formats (Parquet, columnar stores, compressed TSDB).
  • Rate limiting: protect collector/backends from burst floods.

Common patterns:

  • Use low-cardinality labels for metrics; avoid user_id, session_id labels.
  • Include identifiers in logs and traces instead of metric labels; link them via request_id.
  • Sample traces adaptively (retain more when errors or high latency).
  • Aggregate histograms on client or collector to reduce cardinality.

Alerting lifecycle and incident response

A robust alerting approach treats alerts as workflows, not mere messages.

Best practices:

  • Reduce noise: prefer SLO-based alerts for user impact rather than infrastructure-level churn.
  • Categorize alerts: P0/P1/P2 with defined response times and escalation.
  • Provide runbooks: each alert should include a runbook URL and diagnostic queries.
  • Route intelligently: route to the team owning the service or the owner on call.
  • Post-incident analysis: capture RCA, contributing factors, and remediation; track SLO implications and follow-up action items.
  • Automation: use auto-remediation for known recurrent issues (but with care: avoid masking root causes).

SLOs and error budgets:

  • Define SLI: metric representing service health (e.g., request success rate).
  • Define SLO: target for SLI over a window (e.g., 99.9% over 30 days).
  • Error budget: 1 - SLO; when exhausted, trigger mitigation like feature freezes or capacity changes.
  • Use error budget burn alerts instead of raw error counts to align engineering priorities.

Practical examples and code snippets

Below are concrete examples to illustrate how to instrument and use the signals.

Example 1: Structured logging (Python)

Write structured JSON logs with context (request_id). Use Python logging and structlog.

Python
1import logging 2import structlog 3import uuid 4import time 5import json 6 7logging.basicConfig(level=logging.INFO, format="%(message)s") 8structlog.configure(wrapper_class=structlog.make_filtering_bound_logger(logging.INFO)) 9 10log = structlog.get_logger() 11 12def handle_request(user_id): 13 request_id = str(uuid.uuid4()) 14 start = time.time() 15 try: 16 log = structlog.get_logger().bind(request_id=request_id, user_id=user_id) 17 log.info("request.start", path="/checkout", method="POST") 18 # do work... 19 log.info("db.query", query="SELECT ...", duration_ms=12) 20 # success 21 log.info("request.end", status=200, duration_ms=int((time.time() - start) * 1000)) 22 except Exception as e: 23 log.error("request.error", error=str(e)) 24 raise 25 26if __name__ == "__main__": 27 handle_request(user_id="user-42")

Example log line (JSON):

JSON
1{ 2 "event": "request.start", 3 "request_id": "4a5f0c2e-...", 4 "user_id": "user-42", 5 "path": "/checkout", 6 "method": "POST", 7 "timestamp": "2026-05-13T12:00:00Z" 8}

Add request_id to traces and metrics too to correlate.

Example 2: Prometheus metrics (Python client)

Expose metrics for Prometheus scraping.

Python
1from prometheus_client import start_http_server, Counter, Histogram 2import time, random 3 4REQUESTS = Counter('myapp_requests_total', 'Total HTTP requests', ['path', 'status']) 5REQUEST_LATENCY = Histogram('myapp_request_latency_seconds', 'Request latency', ['path']) 6 7def handle_request(path): 8 start = time.time() 9 try: 10 # simulate processing 11 time.sleep(random.random() * 0.2) 12 status = '200' 13 except Exception: 14 status = '500' 15 finally: 16 duration = time.time() - start 17 REQUESTS.labels(path=path, status=status).inc() 18 REQUEST_LATENCY.labels(path=path).observe(duration) 19 20if __name__ == "__main__": 21 start_http_server(8000) 22 while True: 23 handle_request("/checkout")

Alert example (Prometheus alert rule):

YAML
1groups: 2- name: service.rules 3 rules: 4 - alert: HighRequestLatency 5 expr: histogram_quantile(0.95, sum(rate(myapp_request_latency_seconds_bucket[5m])) by (le)) > 1 6 for: 2m 7 labels: 8 severity: page 9 annotations: 10 summary: "p95 latency > 1s for myapp" 11 description: "Investigate slow requests"

Example 3: Tracing (OpenTelemetry, Python)

Create spans and propagate context.

Python
1from opentelemetry import trace 2from opentelemetry.sdk.trace import TracerProvider 3from opentelemetry.exporter.jaeger.thrift import JaegerExporter 4from opentelemetry.sdk.trace.export import BatchSpanProcessor 5 6trace.set_tracer_provider(TracerProvider()) 7jaeger_exporter = JaegerExporter( 8 agent_host_name='localhost', 9 agent_port=6831, 10) 11span_processor = BatchSpanProcessor(jaeger_exporter) 12trace.get_tracer_provider().add_span_processor(span_processor) 13 14tracer = trace.get_tracer(__name__) 15 16def process_payment(amount): 17 with tracer.start_as_current_span("process_payment") as span: 18 span.set_attribute("payment.amount", amount) 19 # call downstream service; propagate trace context over HTTP headers 20 # ...

Span example (concept):

  • Trace ID: 4bf92f3577b34da6a3ce929d0e0e4736
  • Span:
    • span_id: 00f067aa0ba902b7
    • name: "http.server"
    • start: timestamp
    • duration: 150ms
    • attributes: {http.method: GET, http.status_code: 200}
    • parent: root span or span id

Example 4: Loki query

Assume logs contain request_id label. Query recent errors for a request:

Log line:

JSON
{"level":"error","msg":"db timeout","request_id":"4a5f0c2e", "service":"payments"}

LogQL:

{service="payments"} |= "db timeout" | json | request_id="4a5f0c2e"

This can be linked from a trace UI: click request_id to open logs in Loki.


Best practices and anti-patterns

Best practices:

  • Instrument early and consistently with OpenTelemetry compatible libs.
  • Use structured logs (JSON + fields) and include request_id and trace_id.
  • Keep metrics low-cardinality; prefer labels like endpoint, status_code, instance_role.
  • Use histograms for latency distributions and avoid client-side quantile summaries at scale.
  • Implement correlation: include trace_id in logs and propogate request_id to metrics when needed.
  • Implement SLOs and SLI measurement; align alerts to SLOs.
  • Enforce sampling strategies for traces: preserve all error traces, sample successful traces.
  • Centralize pipelines with a collector: apply enrichment, filtering, redaction, and sampling at the collector.
  • Redact secrets and PII before exporting telemetry; use operators and privacy-aware transformations.

Anti-patterns:

  • Logging everything verbatim without structure—hard to query.
  • Adding high-cardinality labels to metrics (e.g., user_id) causing explosion.
  • Alerting on noisy infrastructure metrics (e.g., CPU spikes that do not affect users).
  • Sending raw logs/traces with sensitive data to untrusted third-party backends.
  • Not having runbooks or playbooks for alerts.

Current state of the ecosystem

  • OpenTelemetry is the dominant open standard for in-app instrumentation across traces, metrics, and logs; supported by major vendors.
  • Prometheus is the de facto standard for scraping metrics at scale; Grafana for visualization.
  • Jaeger and Tempo are popular open-source tracing backends; commercial APMs offer additional ML-driven root cause analysis.
  • Loki offers cost-effective logs for teams already using Grafana/Prometheus label paradigms.
  • Cloud providers provide integrated observability platforms (AWS CloudWatch/Observability, GCP Cloud Operations, Azure Monitor).
  • The industry is moving toward more integrated tooling and vendor-neutral ingestion with the OTel Collector.

Future directions: AI, predictive observability, standards

  • AI/ML in observability:
    • Anomaly detection: model-based detection for unknown failure modes.
    • Automated root cause analysis: correlate signals and propose probable causes.
    • Auto-generated runbooks and remediation suggestions.
  • Standardization and vendor neutrality:
    • Wider adoption of OpenTelemetry and W3C Trace Context.
    • Interchange formats and long-term storage formats (OTLP, Parquet-based archives).
  • Observability for new architectures:
    • Edge, IoT, serverless: require lightweight collectors, sampling, and offline buffering.
    • Ephemeral environments like containers and ephemeral compute require context propagation and short retention on raw data with enriched rollups.
  • Privacy-respecting telemetry:
    • Built-in scrubbing, transformation policies, and consent-aware telemetry collection.

Example incident walkthrough (practical narrative)

Scenario: Customers report slow checkout.

  1. Monitoring: Alert triggers: SLO burn rate alert—checkout p99 latency exceeds SLO.
  2. Dashboard: Metrics show increased p99 latency; error rate slightly up.
  3. Traces: Look at traces for slow requests; identify that external payment gateway calls show high latency in a particular region.
  4. Logs: Using trace_ids, retrieve logs to see timeouts and a pattern of retries hitting a cached token expiration.
  5. Root cause: Token refresh logic triggers on expired tokens, resulting in synchronous external calls causing tail latency.
  6. Mitigation: Apply temporary rate-limiting and circuit breaker to the payment gateway calls; implement asynchronous token refresh.
  7. Postmortem: Document findings, change instrumentation to capture token expiry time as metric, set new alerting threshold for token refresh failures, and add unit/integration tests.

This example demonstrates correlation of metrics -> trace -> logs -> alerting + remediation.


Checklist: Implementing robust observability

  • Instrument metrics for SLIs (success rate, latency, availability).
  • Add structured logging and include request_id and trace_id.
  • Use OpenTelemetry for traces, and ensure context propagation across services.
  • Implement Prometheus metrics with low-cardinality labels.
  • Add histograms for latency and measures for error rates.
  • Configure SLOs and use SLO-driven alerting for paging.
  • Create runbooks for critical alerts and integrate with on-call systems.
  • Implement adaptive trace sampling (capture all failures).
  • Centralize collectors to enforce redaction and sampling policies.
  • Monitor observability pipeline health (collector status, ingestion rates).
  • Regularly review alerts and reduce noise (alert fatigue).

Closing summary

Observability is essential for operating modern distributed systems. Metrics tell you something is wrong, traces show where to look, logs explain why, and alerts coordinate response. Effective observability requires thoughtful instrumentation, sensible data hygiene (cardinality, sampling, retention), and a culture of SLO-driven reliability. With standards like OpenTelemetry and mature tooling across metrics, logs, and tracing, organizations have the building blocks to turn telemetry into reliable, actionable insight. The future will bring AI assistance, predictive observability, and more standardized, privacy-aware telemetry pipelines—shifting the emphasis from collection to intelligent interpretation.

If you want, I can:

  • Provide a step-by-step tutorial to instrument a sample microservice with OpenTelemetry (traces, metrics, logs) and configure Prometheus, Jaeger, and Loki.
  • Generate sample alerting rules and runbooks tailored to your stack.
  • Review your current observability setup and give recommendations.