Monitoring Guide
Comprehensive monitoring setup and practices for Burdenoff products.
Monitoring Architecture
Applications → Prometheus → Grafana
→ OpenTelemetry → Jaeger
→ Azure Monitor → Log Analytics
Prometheus Setup
Installation
# helm/prometheus-values.yaml
prometheus:
retention: 15d
scrapeInterval: 30s
evaluationInterval: 30s
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi
Scrape Configuration
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Application Instrumentation
Python (FastAPI)
from prometheus_client import Counter, Histogram, Gauge
# Metrics
request_count = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)
active_users = Gauge(
'active_users',
'Number of active users',
['tenant_id']
)
# Middleware
@app.middleware("http")
async def metrics_middleware(request, call_next):
start_time = time.time()
response = await call_next(request)
duration = time.time() - start_time
request_duration.labels(
method=request.method,
endpoint=request.url.path
).observe(duration)
request_count.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()
return response
TypeScript (Node.js)
import prometheus from 'prom-client';
// Metrics
const httpRequestsTotal = new prometheus.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'endpoint', 'status'],
});
const httpRequestDuration = new prometheus.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'endpoint'],
});
// Middleware
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestDuration
.labels(req.method, req.path)
.observe(duration);
httpRequestsTotal
.labels(req.method, req.path, res.statusCode)
.inc();
});
next();
});
Grafana Dashboards
Application Dashboard
{
"dashboard": {
"title": "Application Overview",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
}
]
},
{
"title": "Request Duration (p95)",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
}
]
}
]
}
}
Database Dashboard
- Connection pool usage
- Query duration
- Active connections
- Slow queries
- Lock waits
- Transaction rate
System Dashboard
- CPU utilization
- Memory usage
- Disk I/O
- Network throughput
- Pod status
- Node health
Alert Rules
High Error Rate
groups:
- name: application_alerts
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} requests/second"
Slow Response Time
- alert: SlowResponseTime
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "Slow response time"
description: "P95 latency is {{ $value }}s"
Low Availability
- alert: ServiceDown
expr: up{job="kubernetes-pods"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.pod }} is not responding"
Distributed Tracing
OpenTelemetry Setup
from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Initialize tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Export to collector
otlp_exporter = OTLPSpanExporter(
endpoint="http://otel-collector:4317"
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(otlp_exporter)
)
# Instrument FastAPI
FastAPIInstrumentor.instrument_app(app)
Custom Spans
@app.get("/users/{user_id}")
async def get_user(user_id: str):
with tracer.start_as_current_span("fetch_user") as span:
span.set_attribute("user.id", user_id)
user = await db.users.get(user_id)
span.set_attribute("user.found", user is not None)
return user
Log Aggregation
Structured Logging
import structlog
logger = structlog.get_logger()
logger.info(
"user_login",
user_id=user.id,
tenant_id=tenant.id,
ip_address=request.client.host
)
Log Levels
- DEBUG: Detailed diagnostic information
- INFO: General informational messages
- WARNING: Warning messages
- ERROR: Error messages
- CRITICAL: Critical issues
Log Retention
- DEBUG/INFO: 7 days
- WARNING: 30 days
- ERROR/CRITICAL: 90 days
Health Checks
Kubernetes Probes
livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
Health Endpoint
@app.get("/health/live")
async def liveness():
return {"status": "alive"}
@app.get("/health/ready")
async def readiness():
checks = {
"database": await check_database(),
"redis": await check_redis(),
"storage": await check_storage()
}
if all(checks.values()):
return {"status": "ready", "checks": checks}
else:
raise HTTPException(
status_code=503,
detail={"status": "not_ready", "checks": checks}
)
Performance Monitoring
Key Metrics
- Apdex score
- User-centric metrics (FCP, LCP, CLS)
- Backend response times
- Database query times
- Cache hit rates
Profiling
- CPU profiling
- Memory profiling
- Query profiling
- Trace profiling
Cost Monitoring
Azure Cost Management
- Daily cost tracking
- Budget alerts
- Resource tagging
- Cost allocation
Optimization
- Right-sizing resources
- Unused resource cleanup
- Reserved instances
- Spot instances