Skip to main content

Monitoring Guide

Comprehensive monitoring setup and practices for Burdenoff products.

Monitoring Architecture

Applications → Prometheus → Grafana
→ OpenTelemetry → Jaeger
→ Azure Monitor → Log Analytics

Prometheus Setup

Installation

# helm/prometheus-values.yaml
prometheus:
retention: 15d
scrapeInterval: 30s
evaluationInterval: 30s
resources:
requests:
cpu: 500m
memory: 2Gi
limits:
cpu: 2000m
memory: 4Gi

Scrape Configuration

scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true

Application Instrumentation

Python (FastAPI)

from prometheus_client import Counter, Histogram, Gauge

# Metrics
request_count = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)

request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)

active_users = Gauge(
'active_users',
'Number of active users',
['tenant_id']
)

# Middleware
@app.middleware("http")
async def metrics_middleware(request, call_next):
start_time = time.time()

response = await call_next(request)

duration = time.time() - start_time
request_duration.labels(
method=request.method,
endpoint=request.url.path
).observe(duration)

request_count.labels(
method=request.method,
endpoint=request.url.path,
status=response.status_code
).inc()

return response

TypeScript (Node.js)

import prometheus from 'prom-client';

// Metrics
const httpRequestsTotal = new prometheus.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'endpoint', 'status'],
});

const httpRequestDuration = new prometheus.Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'endpoint'],
});

// Middleware
app.use((req, res, next) => {
const start = Date.now();

res.on('finish', () => {
const duration = (Date.now() - start) / 1000;

httpRequestDuration
.labels(req.method, req.path)
.observe(duration);

httpRequestsTotal
.labels(req.method, req.path, res.statusCode)
.inc();
});

next();
});

Grafana Dashboards

Application Dashboard

{
"dashboard": {
"title": "Application Overview",
"panels": [
{
"title": "Request Rate",
"targets": [
{
"expr": "rate(http_requests_total[5m])"
}
]
},
{
"title": "Error Rate",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
}
]
},
{
"title": "Request Duration (p95)",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
}
]
}
]
}
}

Database Dashboard

- Connection pool usage
- Query duration
- Active connections
- Slow queries
- Lock waits
- Transaction rate

System Dashboard

- CPU utilization
- Memory usage
- Disk I/O
- Network throughput
- Pod status
- Node health

Alert Rules

High Error Rate

groups:
- name: application_alerts
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} requests/second"

Slow Response Time

- alert: SlowResponseTime
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "Slow response time"
description: "P95 latency is {{ $value }}s"

Low Availability

- alert: ServiceDown
expr: up{job="kubernetes-pods"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.pod }} is not responding"

Distributed Tracing

OpenTelemetry Setup

from opentelemetry import trace
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Initialize tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Export to collector
otlp_exporter = OTLPSpanExporter(
endpoint="http://otel-collector:4317"
)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(otlp_exporter)
)

# Instrument FastAPI
FastAPIInstrumentor.instrument_app(app)

Custom Spans

@app.get("/users/{user_id}")
async def get_user(user_id: str):
with tracer.start_as_current_span("fetch_user") as span:
span.set_attribute("user.id", user_id)

user = await db.users.get(user_id)

span.set_attribute("user.found", user is not None)

return user

Log Aggregation

Structured Logging

import structlog

logger = structlog.get_logger()

logger.info(
"user_login",
user_id=user.id,
tenant_id=tenant.id,
ip_address=request.client.host
)

Log Levels

  • DEBUG: Detailed diagnostic information
  • INFO: General informational messages
  • WARNING: Warning messages
  • ERROR: Error messages
  • CRITICAL: Critical issues

Log Retention

  • DEBUG/INFO: 7 days
  • WARNING: 30 days
  • ERROR/CRITICAL: 90 days

Health Checks

Kubernetes Probes

livenessProbe:
httpGet:
path: /health/live
port: 8000
initialDelaySeconds: 30
periodSeconds: 10

readinessProbe:
httpGet:
path: /health/ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 5

Health Endpoint

@app.get("/health/live")
async def liveness():
return {"status": "alive"}

@app.get("/health/ready")
async def readiness():
checks = {
"database": await check_database(),
"redis": await check_redis(),
"storage": await check_storage()
}

if all(checks.values()):
return {"status": "ready", "checks": checks}
else:
raise HTTPException(
status_code=503,
detail={"status": "not_ready", "checks": checks}
)

Performance Monitoring

Key Metrics

  • Apdex score
  • User-centric metrics (FCP, LCP, CLS)
  • Backend response times
  • Database query times
  • Cache hit rates

Profiling

  • CPU profiling
  • Memory profiling
  • Query profiling
  • Trace profiling

Cost Monitoring

Azure Cost Management

  • Daily cost tracking
  • Budget alerts
  • Resource tagging
  • Cost allocation

Optimization

  • Right-sizing resources
  • Unused resource cleanup
  • Reserved instances
  • Spot instances

Next Steps