What Is Monitoring & Alerting? Observability Guide
When something goes wrong, do you find out first or do your customers? Monitoring watches your system 24/7, and alerting catches issues before users notice.
The Three Pillars
1. Metrics
Numerical values stored as time series:
http_requests_total{status="200"} 15234
http_request_duration_seconds{p99} 0.245
2. Logs
Detailed text records of events.
3. Traces
A request's journey across services — find the slow link.
Golden Signals
Google SRE's four critical metrics:
| Signal | Measures | Alert Threshold | |--------|----------|----------------| | Latency | Request duration | p99 > 500ms | | Traffic | Request volume | Sudden drop/spike | | Errors | Error rate | > 1% | | Saturation | Resource usage | CPU > 85% |
Prometheus + Grafana
Application Metrics (Node.js)
import { Counter, Histogram } from 'prom-client';
const httpRequests = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'status']
});
const httpDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Request duration',
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});
PromQL Examples
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# p99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
SLA / SLO / SLI
| Concept | Definition | Example | |---------|-----------|---------| | SLI | Measured metric | p99 latency: 245ms | | SLO | Internal target | p99 < 500ms | | SLA | Customer commitment | 99.9% uptime |
Alerting Best Practices
- Actionable — The person who sees it knows what to do
- Severity levels — Critical, Warning, Info
- Group similar alerts — Reduce noise
- Runbook links — Every alert has a resolution doc
- Avoid alert fatigue — Too many false positives erode trust
Tools
| Tool | Area | Highlights | |------|------|-----------| | Prometheus | Metrics | Open source, PromQL | | Grafana | Dashboards | Visualization | | Loki | Logs | Grafana integration | | Jaeger | Tracing | Distributed tracing | | Datadog | Full-stack | SaaS, all-in-one |
Conclusion
Without monitoring and alerting, operations are flying blind. Watch the golden signals, set meaningful alerts, and avoid alert fatigue. The goal: detect issues before your customers do.
Learn monitoring and observability on LabLudus.