Observability Explained: Logs, Metrics, Traces, and Alerts
Observability is the capability to infer the internal state of a complex system from external outputs. In modern software systems—distributed, dynamic, and often ephemeral—observability is essential for understanding behavior, diagnosing issues, ensuring reliability, and improving performance. This article provides a deep, practical, and theoretically grounded exploration of observability with a focus on the four primary telemetry signals: logs, metrics, traces, and alerts. You’ll find history, core concepts, instrumentation patterns, storage and processing concerns, practical examples, best practices, key tools, and forward-looking trends.
Table of contents
- Origins and theory of observability
- Observability vs monitoring: differences and interplay
- The four telemetry signals
- Metrics
- Logs
- Traces (distributed tracing)
- Alerts (and SLO-driven alerting)
- How the signals relate and should be used together
- Instrumentation and tooling (OpenTelemetry, Prometheus, ELK/Loki, Jaeger/Tempo)
- Storage, scaling, and cost considerations (cardinality, sampling, retention)
- Alerting lifecycle and incident response
- Practical examples and code snippets
- Instrumenting a microservice (Python examples)
- Prometheus rules and PromQL examples
- Sample structured log and Loki query
- Tracing span example and Jaeger/Tempo concepts
- Best practices and anti-patterns
- Current state of the ecosystem
- Future directions: AI, predictive observability, standards
- Summary checklist
Origins and theory of observability
Observability is a concept borrowed from control theory. In control systems, a system is observable if one can deduce its internal state from outputs (measurements) over time. For software systems, the “internal state” might include application variables, queue lengths, dependency health, transaction contexts, and resource usage. Telemetry provides measurable outputs; observability is the property that these outputs enable accurate reasoning about system behavior.
Key theoretical points:
- Observability is not only about collecting data; it’s about collecting the right data (high signal-to-noise ratio).
- Observability enables state reconstruction and causal inference: you want to know why something happened, not just that it did.
- Good observability supports hypothesis-driven debugging: form a hypothesis, gather data, test and refine.
Historically, monitoring focused on black-box symptoms—up or down, basic metrics—whereas observability emphasizes instrumentation and context to answer arbitrary questions, even ones not anticipated when the system was built.
Observability vs monitoring: differences and interplay
- Monitoring: The practice of collecting predefined signals and raising alerts based on known failure modes. Typically goal-oriented (e.g., alert if CPU > 80% for 5m). Designed for detection.
- Observability: The ability to explore and understand unknown unknowns. It emphasizes high-cardinality, high-dimensional telemetry, context propagation, and ad-hoc investigation. Designed for diagnosis and root cause analysis.
They complement each other: monitoring detects problems quickly; observability gives the ability to drill down, diagnose, and fix the problems.
The four telemetry signals
The telemetry signals are often described as “pillars” but are more accurately complementary signal types. Here we break them down and describe what unique value each brings.
Metrics
Metrics are numeric measurements sampled over time.
- Common types:
- Counter: monotonically increasing (e.g., total requests)
- Gauge: snapshot value (e.g., memory usage, number of items in queue)
- Histogram: distribution of values across buckets (useful for latency percentiles)
- Summary: quantile estimation per client (less favored at scale due to aggregation complexity)
- Properties:
- Low cardinality is desirable for storage and query performance.
- Great for dashboards, alerting, long-term trends, SLOs.
- Compact, efficient to store and query.
- Time-series DBs: Prometheus, InfluxDB, VictoriaMetrics, Cortex, Thanos.
- Use cases:
- SLIs (e.g., request success rate, p99 latency)
- Capacity planning
- Anomaly detection on aggregates
Logs
Logs are timestamped, semi-structured (json or text) event records, often associated with a specific component or transaction.
- Properties:
- High cardinality, rich text context.
- Best for detailed forensic analysis and investigating discrete events.
- Can contain stack traces, user IDs, request payloads (watch for PII).
- Storage: centralized systems like ELK (Elasticsearch), Loki, Splunk, or cloud offerings.
- Query languages vary (Lucene, LogQL).
- Use cases:
- Error investigations, root cause analysis
- Audit trails
- Security incident forensics
Traces (Distributed Tracing)
Traces represent the lifecycle of a request (or transaction) as it flows through a distributed system, composed of spans (operations).
- Span: an operation with start time, duration, name, attributes, and parent-child relationship.
- Trace: a tree/DAG of spans representing an end-to-end request.
- Context propagation: correlation IDs passed over network calls (W3C Trace Context).
- Tools: Jaeger, Zipkin, Tempo, Lightstep, Datadog APM.
- Sampling is often necessary due to high volume; adaptive sampling strategies exist.
- Use cases:
- Latency hotspots and bottleneck identification
- Causal analysis across services
- Visualizing end-to-end request flow
Alerts
Alerts are actionable notifications raised when telemetry data violates a defined condition.
- Two main types:
- Symptoms-based alerts: metric thresholds or errors (e.g., high CPU).
- SLO-driven alerts: alerting based on SLO burn rates (more resilient to noise).
- Alert lifecycle: detection -> deduplication -> routing -> escalation -> on-call -> runbook -> resolution -> postmortem.
- Platforms: Prometheus Alertmanager, Grafana Alerting, PagerDuty, OpsGenie.
- Best practice: treat alerts as an interrupt to humans; tune alert fatigue by focusing on high-signal alerts (SLO-based).
How the signals relate and should be used together
- Metrics detect trends and trigger alerts; they are your first notification system.
- Traces show the end-to-end path for problematic requests flagged by metrics.
- Logs provide the fine-grained context and payload for a particular trace/span or metric anomaly.
- Alerts coordinate human action and link to dashboards, runbooks, traces, and logs.
Example workflow:
- Metric alert: p95 latency for service X > 1s.
- Look at trace samples for recent slow requests—identify a dependency or database query with high tail latency.
- Inspect logs for the trace IDs to find errors, stack traces, or explanatory messages.
- Fix, validate via metrics, and update runbook/alert conditions.
Correlation is vital: include trace IDs and request IDs in logs and metrics as labels/fields (with cardinality caution) so you can pivot between signals.
Instrumentation and tooling
Open standards and vendor-neutral tooling have matured—OpenTelemetry is the major standard for collecting traces, metrics, and logs.
OpenTelemetry (OTel)
- Provides SDKs for automatic and manual instrumentation.
- Components: API (instrumentation), SDK (exporters), Collector (pipeline).
- Exports to backends: Prometheus, Jaeger, Tempo, Loki, vendor APMs.
- Implements W3C Trace Context for interoperability.
Example OpenTelemetry architecture:
- Application code uses OTel SDK to create spans, metrics, and logs.
- SDK exports to an OTel Collector (agent or gateway).
- Collector handles batching, sampling, enrichment, export to backend(s).
Metrics stack: Prometheus ecosystem
- Prometheus scrapes metrics from instrumented endpoints (HTTP /metrics using Prometheus format).
- PromQL allows flexible querying and alerting rules.
- Long-term storage: Cortex, Thanos, Mimir for scale and durability.
- Grafana: visualization and dashboards, integrates with Prometheus and other backends.
Logging stack
- ELK (Elasticsearch, Logstash, Kibana) historically common.
- Loki (Grafana Loki) offers log aggregation optimized for labels (lower cost), integrates with Prometheus labels.
- Fluentd / Fluent Bit / Vector: ingestion and processing agents.
- Structured logs (JSON) recommended for parsing and indexing.
Tracing stack
- Instrumentation via OpenTelemetry or client libs (Jaeger, Zipkin).
- Backends: Jaeger, Tempo, Lightstep, Honeycomb.
- UI for trace visualization and latency waterfall charts.
Alerting & incident tools
- Prometheus Alertmanager, Grafana Alerting, PagerDuty, Opsgenie, VictorOps.
- Integrations for chatops (Slack, MS Teams), ticketing, phone/SMS.
Storage, scaling, and cost considerations
Observability data can be expensive. Understand where cost arises and strategies to manage it.
Key concerns:
- Cardinality: number of unique label combinations for metrics or keys in logs/traces. High cardinality multiplies storage and query cost, often leading to performance problems.
- Sampling: needed for traces; downsample logs/metrics or roll up metrics by reducing labels.
- Retention: balance regulatory needs vs cost. Keep high-resolution recent data, aggregated older data (rollups).
- Compression: use efficient storage formats (Parquet, columnar stores, compressed TSDB).
- Rate limiting: protect collector/backends from burst floods.
Common patterns:
- Use low-cardinality labels for metrics; avoid userid, sessionid labels.
- Include identifiers in logs and traces instead of metric labels; link them via request_id.
- Sample traces adaptively (retain more when errors or high latency).
- Aggregate histograms on client or collector to reduce cardinality.
Alerting lifecycle and incident response
A robust alerting approach treats alerts as workflows, not mere messages.
Best ...