- Handbook
- DevOps
- DevOps practices (blueprint)
- SRE and observability (blueprint)
SRE and observability (blueprint)
Site Reliability Engineering (SRE) and observability are complementary practices that ensure production systems are reliable, understandable, and continuously improving. SRE provides the operational framework (SLOs…
Guide · Updated · Source
This guide covers SLOs/SLIs/error budgets, the three pillars of observability, alerting philosophy, chaos engineering, on-call practices, and incident learning.
1. SRE foundations
SLOs, SLIs, and error budgets
| Concept | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | A quantitative measure of service behavior | Request latency P99, error rate, availability |
| SLO (Service Level Objective) | Target value for an SLI over a time window | P99 latency < 200 ms over 30 days; availability >= 99.9% per quarter |
| SLA (Service Level Agreement) | Contractual commitment (SLO + consequences) | 99.9% uptime; credits issued below threshold |
| Error budget | Allowed unreliability = 1 - SLO | 99.9% SLO → 0.1% error budget → ~43 min downtime/month |
Error budget policy
| Budget status | Action |
|---|---|
| Healthy (> 50% remaining) | Normal development velocity; feature work prioritized |
| Caution (25–50% remaining) | Increase review rigor; prioritize reliability improvements alongside features |
| Critical (< 25% remaining) | Freeze non-critical changes; dedicate engineering to reliability work |
| Exhausted (0%) | Stop all feature work until budget recovers; post-incident analysis for each new incident |
Toil
| Characteristic | Description |
|---|---|
| Manual | Requires human intervention, not automated |
| Repetitive | Happens over and over; not a one-time task |
| Automatable | Could be handled by software |
| Reactive | Triggered by alerts or requests, not proactive |
| Without enduring value | Does not improve the system permanently; must be done again |
Target: Keep toil below 50% of SRE team capacity; invest the remainder in automation and engineering.
2. Observability — the three pillars
Logs
| Concern | Guidance |
|---|---|
| Structured logging | JSON or key-value format; include trace ID, span ID, request ID, user ID (anonymized) |
| Log levels | ERROR (action required), WARN (attention needed), INFO (operational events), DEBUG (development) |
| Centralized aggregation | Ship to centralized platform (ELK, Loki, CloudWatch Logs); searchable and retainable |
| Retention policy | Hot (7–30 days searchable), warm (30–90 days), cold (archive for compliance) |
| Avoid | Logging PII, secrets, or high-cardinality unbounded fields |
Metrics
| Concern | Guidance |
|---|---|
| RED method (request-scoped) | Rate (requests/second), Errors (error rate), Duration (latency distribution) |
| USE method (resource-scoped) | Utilization, Saturation, Errors — for CPU, memory, disk, network |
| Cardinality | Control label cardinality; high-cardinality labels (user ID, request ID) belong in traces, not metrics |
| Instrumentation | Use client libraries (Prometheus, OpenTelemetry); instrument at service boundaries |
| Dashboards | SLO-based dashboards first; drill-down to RED/USE; avoid vanity dashboards |
Traces
| Concern | Guidance |
|---|---|
| Distributed tracing | Propagate trace context (W3C Trace Context, B3) across all service boundaries |
| Sampling | Head-based (decide at entry) or tail-based (decide after completion based on attributes); 100% for errors |
| Span attributes | HTTP method, status, route, database operation, queue name, error details |
| Trace-to-log correlation | Include trace ID in all log entries; link from trace to relevant logs |
| Tooling | OpenTelemetry (vendor-neutral), Jaeger, Tempo, Honeycomb, Datadog |
OpenTelemetry
| Component | Role |
|---|---|
| SDK | Instrument application code; auto-instrumentation for frameworks and libraries |
| Collector | Receive, process, and export telemetry (logs, metrics, traces) |
| Exporters | Send to backends — Prometheus, Jaeger, OTLP, vendor-specific |
| Semantic conventions | Standardized attribute names for consistent telemetry across services |
3. Alerting philosophy
SLO-based alerting
Alert on SLO burn rate rather than individual metric thresholds:
| Alert type | Window | Use case |
|---|---|---|
| Fast burn | 5 min rate over 1 hr budget | Severe incident — rapid budget consumption; page on-call |
| Slow burn | 30 min rate over 6 hr budget | Gradual degradation — create a ticket; investigate during business hours |
Alert design principles
| Principle | Description |
|---|---|
| Actionable | Every alert should have a clear response action; if no action, it should not page |
| Relevant | Alert the team that can fix it; avoid broadcasting to uninvolved teams |
| Proportional | Severity matches impact; critical = customer-facing SLO violation; warning = potential issue |
| Deduplicated | Group related alerts; avoid alert storms during cascading failures |
| Documented | Each alert links to a runbook with diagnosis steps and remediation |
4. Chaos engineering
Process
| Step | Activity | Output |
|---|---|---|
| 1. Hypothesis | Define steady state and expected behavior under fault | Written hypothesis |
| 2. Scope | Choose blast radius — start small (single pod/instance) | Experiment scope document |
| 3. Execute | Inject fault (network delay, pod kill, disk fill, DNS failure) | Experiment execution |
| 4. Observe | Monitor SLOs, alerts, dashboards during experiment | Observation log |
| 5. Learn | Compare actual vs expected behavior; document surprises | Findings and action items |
Common fault injections
| Fault | What it tests |
|---|---|
| Pod/process kill | Auto-recovery, load balancing, health checks |
| Network latency injection | Timeout configuration, circuit breakers, retry behavior |
| Network partition | Split-brain handling, data consistency, failover |
| Disk full | Graceful degradation, log rotation, database behavior |
| DNS failure | Service discovery resilience, caching behavior |
| Clock skew | Certificate validation, token expiry, time-dependent logic |
| Dependency failure | Circuit breaker activation, fallback behavior, error messaging |
5. On-call practices
| Practice | Description |
|---|---|
| Rotation schedule | Weekly rotation; minimum team size for sustainable coverage (5+ engineers) |
| Escalation policy | Primary → secondary → team lead → engineering manager; auto-escalate after timeout |
| Runbooks | Step-by-step diagnosis and remediation for each alert; maintain alongside alert definition |
| Handoff | End-of-rotation summary — open issues, recent changes, upcoming risks |
| Compensation | On-call compensation policy; acknowledge the burden; avoid burnout |
| Shadowing | New on-call engineers shadow experienced engineers before solo rotation |
Incident severity levels
| Level | Impact | Response | Example |
|---|---|---|---|
| P1 / Critical | Service down or data loss for many users | Immediate page; war room; status page update | Complete outage, data breach |
| P2 / High | Major feature degraded for significant users | Page; begin investigation within 15 min | Payment processing slow, search broken |
| P3 / Medium | Minor feature degraded; workaround available | Business hours; investigate same day | Export feature timeout, non-critical API errors |
| P4 / Low | Cosmetic or minor issue; no user impact | Queue for next sprint | Dashboard rendering glitch, non-customer log error |
6. Post-incident learning
Blameless postmortem structure
| Section | Content |
|---|---|
| Summary | What happened, impact, duration, severity |
| Timeline | Chronological events from detection to resolution (with timestamps) |
| Root cause | Contributing factors (not a single root cause — look for systemic issues) |
| What went well | Detection, response, communication that worked |
| What could be improved | Gaps in monitoring, response, communication, tooling |
| Action items | Specific, assigned, time-bound improvements (not "be more careful") |
| Lessons learned | Broader insights for the team and organization |
Learning culture
| Practice | Description |
|---|---|
| Blameless | Focus on systemic factors, not individual mistakes |
| Share widely | Publish postmortems to the organization; learning benefits everyone |
| Track action items | Follow through on postmortem action items; review completion rate |
| Recurring themes | Identify patterns across postmortems; address systemic issues |
| Game days | Periodic simulated incidents to practice response and validate runbooks |
External references
| Topic | URL | Why it is linked |
|---|---|---|
| Google SRE Books | https://sre.google/books/ | Foundational SRE practices — free online |
| OpenTelemetry | https://opentelemetry.io/ | Vendor-neutral observability framework |
| SLO Alerting (Google) | https://sre.google/workbook/alerting-on-slos/ | Multi-window burn rate alerting |
| Principles of Chaos Engineering | https://principlesofchaos.org/ | Chaos engineering methodology |
| Chaos Monkey (Netflix) | https://netflix.github.io/chaosmonkey/ | Pioneering chaos engineering tool |
| Incident.io Handbook | https://incident.io/guide/ | Modern incident management practices |
| Honeycomb Observability Guide | https://www.honeycomb.io/observability/ | Observability culture and practices |