What Is Monitoring & Alerting? Observability Guide

When something goes wrong, do you find out first or do your customers? Monitoring watches your system 24/7, and alerting catches issues before users notice.

The Three Pillars

1. Metrics

Numerical values stored as time series:

http_requests_total{status="200"} 15234
http_request_duration_seconds{p99} 0.245

2. Logs

Detailed text records of events.

3. Traces

A request's journey across services — find the slow link.

Golden Signals

Google SRE's four critical metrics:

| Signal | Measures | Alert Threshold | |--------|----------|----------------| | Latency | Request duration | p99 > 500ms | | Traffic | Request volume | Sudden drop/spike | | Errors | Error rate | > 1% | | Saturation | Resource usage | CPU > 85% |

Prometheus + Grafana

Application Metrics (Node.js)

import { Counter, Histogram } from 'prom-client';

const httpRequests = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'status']
});

const httpDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Request duration',
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});

PromQL Examples

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# p99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

SLA / SLO / SLI

| Concept | Definition | Example | |---------|-----------|---------| | SLI | Measured metric | p99 latency: 245ms | | SLO | Internal target | p99 < 500ms | | SLA | Customer commitment | 99.9% uptime |

Alerting Best Practices

Actionable — The person who sees it knows what to do
Severity levels — Critical, Warning, Info
Group similar alerts — Reduce noise
Runbook links — Every alert has a resolution doc
Avoid alert fatigue — Too many false positives erode trust

Tools

| Tool | Area | Highlights | |------|------|-----------| | Prometheus | Metrics | Open source, PromQL | | Grafana | Dashboards | Visualization | | Loki | Logs | Grafana integration | | Jaeger | Tracing | Distributed tracing | | Datadog | Full-stack | SaaS, all-in-one |

Conclusion

Without monitoring and alerting, operations are flying blind. Watch the golden signals, set meaningful alerts, and avoid alert fatigue. The goal: detect issues before your customers do.

Learn monitoring and observability on LabLudus.

What Is Monitoring & Alerting? Prometheus, Grafana & Observability