Observability
Operating software on cloud platforms requires metrics, logs, and traces. Observability supports deployment automation rollbacks and AI agent quality monitoring.
Error rate example (PromQL)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
What to instrument
| Signal | Examples |
|---|---|
| Metrics | Latency, traffic, errors, saturation |
| Logs | Structured JSON, correlation IDs |
| Traces | OpenTelemetry across API and agent tools |
SLO mindset
- Define SLOs per user journey (docs read, chat response, API call)
- Alert on error budget burn, not every blip
- Dashboards for Kubernetes and serverless alike