← Blog'a Dön
DEVOPS

What Is Monitoring & Alerting? Prometheus, Grafana & Observability

F. Çağrı Bilgehan29 Ocak 202611 dk okuma
monitoringprometheusgrafanaobservability

What Is Monitoring & Alerting? Observability Guide

When something goes wrong, do you find out first or do your customers? Monitoring watches your system 24/7, and alerting catches issues before users notice.

The Three Pillars

1. Metrics

Numerical values stored as time series:

http_requests_total{status="200"} 15234
http_request_duration_seconds{p99} 0.245

2. Logs

Detailed text records of events.

3. Traces

A request's journey across services — find the slow link.

Golden Signals

Google SRE's four critical metrics:

| Signal | Measures | Alert Threshold | |--------|----------|----------------| | Latency | Request duration | p99 > 500ms | | Traffic | Request volume | Sudden drop/spike | | Errors | Error rate | > 1% | | Saturation | Resource usage | CPU > 85% |

Prometheus + Grafana

Application Metrics (Node.js)

import { Counter, Histogram } from 'prom-client';

const httpRequests = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'status']
});

const httpDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Request duration',
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5]
});

PromQL Examples

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

# p99 latency
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

SLA / SLO / SLI

| Concept | Definition | Example | |---------|-----------|---------| | SLI | Measured metric | p99 latency: 245ms | | SLO | Internal target | p99 < 500ms | | SLA | Customer commitment | 99.9% uptime |

Alerting Best Practices

  1. Actionable — The person who sees it knows what to do
  2. Severity levels — Critical, Warning, Info
  3. Group similar alerts — Reduce noise
  4. Runbook links — Every alert has a resolution doc
  5. Avoid alert fatigue — Too many false positives erode trust

Tools

| Tool | Area | Highlights | |------|------|-----------| | Prometheus | Metrics | Open source, PromQL | | Grafana | Dashboards | Visualization | | Loki | Logs | Grafana integration | | Jaeger | Tracing | Distributed tracing | | Datadog | Full-stack | SaaS, all-in-one |

Conclusion

Without monitoring and alerting, operations are flying blind. Watch the golden signals, set meaningful alerts, and avoid alert fatigue. The goal: detect issues before your customers do.

Learn monitoring and observability on LabLudus.

İlgili Yazılar

Infrastructure as Code (IaC) Nedir? Terraform ve Altyapı Otomasyonu

Infrastructure as Code nedir? Terraform, Pulumi, CloudFormation ile altyapı otomasyonu, versiyon kontrolü ve tekrarlanabilir deployment rehberi.

What Is Infrastructure as Code? Terraform & Automation Guide

IaC explained: Terraform, Pulumi, CloudFormation for infrastructure automation, version control, and repeatable deployments.