What Is Observability? The Three Pillars Explained
When your system goes down, you ask "what happened?" Monitoring tells you "something broke." Observability answers "why it broke, where it broke, and how to fix it."
Monitoring vs Observability
| Feature | Monitoring | Observability | |---------|-----------|--------------| | Focus | Track known issues | Discover unknown issues | | Approach | Dashboards + alerts | Querying + exploration | | Question | "Is the system running?" | "Why isn't it running?" |
Three Pillars
1. Metrics
Numerical measurements stored as time-series data:
http_requests_total{method="GET", status="200"} 15234
http_request_duration_seconds{quantile="0.99"} 0.250
cpu_usage_percent 45.2
RED Method (Request-oriented): Rate, Errors, Duration USE Method (Resource-oriented): Utilization, Saturation, Errors
Tools: Prometheus, Grafana, Datadog, CloudWatch
2. Logs
Text-based records of events. Structured (JSON) logs are preferred:
{
"timestamp": "2026-02-14T21:30:00Z",
"level": "error",
"service": "payment-service",
"traceId": "abc-123",
"message": "Payment failed",
"userId": 42,
"error": "Insufficient funds"
}
Tools: ELK Stack, Loki, Fluentd
3. Traces
Track a request's journey through the system end-to-end. Critical in distributed systems:
[Client] → [API Gateway] → [Order Service] → [Payment Service]
0ms 5ms 15ms 45ms
Tools: Jaeger, Zipkin, OpenTelemetry, Datadog APM
OpenTelemetry
The vendor-agnostic open standard for collecting observability data (metrics, logs, traces). Avoids vendor lock-in by providing a single API for all observability signals.
Observability Tools
| Tool | Type | Strength | |------|------|----------| | Prometheus | Metrics | Open source, powerful queries | | Grafana | Visualization | Multi-source dashboards | | Jaeger | Tracing | Distributed tracing | | ELK Stack | Logging | Full-text search | | Datadog | All-in-one | Integrated solution |
Alerting Best Practices
- Only create actionable alerts
- Avoid alert fatigue
- Define severity levels (P1-P4)
- Prepare runbooks
- Implement on-call rotation
Conclusion
Observability is the key to debugging and performance optimization in modern distributed systems. Monitor with Metrics, understand with Logs, find bottlenecks with Traces.
Learn Observability and DevOps practices on the DevOps career path at LabLudus.