Skip to main content

SRE & Monitoring

Site reliability engineering, monitoring, alerting, and incident management

Overview

Site Reliability Engineering for GGNomad.

Repository

Monitoring Stack

ComponentTechnology
MetricsPrometheus
VisualizationGrafana
LoggingAzure Log Analytics
TracingOpenTelemetry
AlertingPrometheus Alertmanager

SLIs & SLOs

  • Availability targets
  • Latency targets
  • Error rate thresholds

On-Call

  • Incident response procedures
  • Runbooks
  • Escalation paths