Observability Beyond Logging: What I Wish I Knew Earlier
TL;DR
Observability is not about collecting more data — it's about being able to answer questions you didn't know you'd ask when everything was fine. Structure your logs as JSON from day one (I'm begging you), instrument with OpenTelemetry for distributed tracing, track RED metrics for services and USE metrics for infrastructure, and set SLOs before you set alerts. The three pillars — logs, metrics, traces — are useless unless they're correlated. And for the love of all that is holy, every alert needs a runbook.
For the first two years of my career, my entire observability strategy was console.log. I am not exaggerating. When something broke in production, I'd SSH into the server, tail the logs, and grep for the word "error." Sometimes I'd get fancy and grep for "ERROR" in all caps, because surely the important ones would be capitalized. (They were not.)
It worked! Kind of. In the way that duct tape on a leaking pipe "works" — right up until the water pressure changes and you're standing in a puddle wondering where it all went wrong.
The system that finally broke me was an AI inference pipeline serving real-time predictions. Latency would spike randomly. Users reported wrong results intermittently. The logs said everything was fine. I had monitoring that checked "is the server up?" and it always was. The server was up! It was just... wrong. Sometimes. For some users. In ways I couldn't reproduce. The problem was somewhere in the interaction between six services, two ML models, and a feature store, and I had absolutely no way to trace a single request through that chain. I was essentially doing archaeology — sifting through timestamps across six different log files, trying to reconstruct what happened to Request #Whatever at 14:32:07. It took me four days to find a bug that, with proper tracing, would have taken four minutes.
That's when I learned the difference between logging and observability. And that difference, let me tell you, is the difference between "I can see what happened" and "I can understand why it happened."
The Three Pillars (And Why They Must Be Connected)
Everyone talks about the three pillars of observability: logs, metrics, and traces. You've seen the blog posts. You've seen the Venn diagram. You might even have all three set up in your system right now. But here's what most guides completely miss — and this is the thing that would have saved me months of frustration: these pillars are almost useless in isolation. Their power comes from correlation.
Having logs without being able to connect them to traces is like having a detective novel where all the chapters are shuffled. You have all the information, technically, but good luck figuring out who did it.
┌─────────────────────────────────────────────────────────────────┐
│ The Three Pillars — Connected │
├─────────────────────────────────────────────────────────────────┤
│ │
│ METRICS TRACES LOGS │
│ ──────── ────── ──── │
│ "What is "Where did "What │
│ happening?" time go?" happened?" │
│ │
│ Error rate Request flow Detailed │
│ Latency p99 across services event record │
│ Throughput Bottleneck Stack traces │
│ identification Business events │
│ │
│ ┌──────────────────────────────┐ │
│ │ CORRELATION │ │
│ │ │ │
│ │ trace_id ←→ log entry │ │
│ │ trace_id ←→ metric tag │ │
│ │ request_id ←→ all three │ │
│ │ │ │
│ │ "Show me the logs and │ │
│ │ metrics for THIS specific │ │
│ │ slow request" │ │
│ └──────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
The key is a shared correlation ID — typically the trace ID from OpenTelemetry — that appears in every log entry, every metric tag, and every trace span. When a metric shows a latency spike, you should be able to click through to the specific traces that were slow, and from those traces to the log lines for each service involved. It should feel like zooming in on a map: start at the city level (metrics), zoom into the neighborhood (traces), then read the street signs (logs).
Without this correlation? You're just staring at three separate dashboards and trying to line up timestamps by eye. I've done this. For hours. While an incident was ongoing. It's like trying to solve a puzzle where the pieces are in three different rooms and you're not allowed to carry them.
Start With Correlation
Before you invest in fancy dashboards, make sure every log line includes trace_id and span_id. This single change transforms your debugging experience more than any tool purchase.
Structured Logging: Stop Using Printf
The first pillar to get right is logging, because it's the one you're already doing. Probably badly. (Don't feel attacked — I did it badly for years, and I have a PhD. Education does not immunize you against console.log("HERE 2 !!!!") at 11 PM.)
Here's the thing about unstructured logs: they are write-only data. You write them for comfort — that warm fuzzy feeling of "I'll be able to see what happened." And then you try to query them at scale and discover that parsing User usr_123 created order ord_456 for $99.99 across 500 million log lines with a regex is roughly as fun as doing your taxes in Roman numerals.
// Unstructured — useless at scale
console.log(`User ${userId} created order ${orderId} for $${total}`);
// Output: "User usr_123 created order ord_456 for $99.99"
// Good luck parsing that with a regex across 500 million log lines
// Structured — queryable, filterable, aggregatable
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level(label) { return { level: label }; },
},
timestamp: pino.stdTimeFunctions.isoTime,
});
// Every log entry is a JSON object
logger.info({
event: 'order_created',
user_id: 'usr_123',
order_id: 'ord_456',
total: 99.99,
currency: 'USD',
items_count: 3,
trace_id: context.traceId,
duration_ms: 145,
}, 'Order created successfully');
// Output:
// {
// "level": "info",
// "time": "2026-02-22T10:30:00.000Z",
// "event": "order_created",
// "user_id": "usr_123",
// "order_id": "ord_456",
// "total": 99.99,
// "currency": "USD",
// "items_count": 3,
// "trace_id": "abc123def456",
// "duration_ms": 145,
// "msg": "Order created successfully"
// }Now I can query: "Show me all orders over $500 that took more than 200ms in the last hour." Try doing that with printf-style logs. Go ahead. I'll wait. (I'll be waiting a long time, because it's basically impossible without wanting to throw your laptop into the ocean.)
The switch to structured logging is one of those changes that feels like overhead when you're writing the code and feels like a superpower when you're debugging at 2 AM. The first time I ran a Kibana query that filtered by event: "order_created" AND total > 500 AND duration_ms > 200 and got results in milliseconds, I felt like I'd been living in a cave and someone just showed me electricity.
Log Level Discipline
Only log at ERROR level if it's something that needs human attention. I've seen systems where 40% of log volume was ERROR-level messages for expected conditions like "user not found" on a search endpoint. That noise makes real errors invisible.
True story: I once inherited a service where every 404 response was logged at ERROR level. The search endpoint returned 404 when no results matched, which was... most searches. The error rate dashboard was a solid wall of red. Actual errors — database connection failures, OOM kills, the stuff that matters — were completely invisible in the noise. The team had learned to ignore the error dashboard entirely. When a real database outage happened, nobody noticed the alerts for 45 minutes because they'd been conditioned to treat the error dashboard like a lava lamp: always moving, never meaningful.
What to Log (And What Not To)
┌─────────────────────────────────────────────────────────────────┐
│ Logging Decision Guide │
├───────────────────────────┬─────────────────────────────────────┤
│ DO Log: │ DON'T Log: │
├───────────────────────────┼─────────────────────────────────────┤
│ • Business events │ • PII (names, emails, SSNs) │
│ (order created, paid) │ • Authentication tokens/secrets │
│ • State transitions │ • Full request/response bodies │
│ • Integration calls │ (log summaries instead) │
│ (duration, status) │ • Expected conditions at ERROR │
│ • Error context (what │ level (404s, validation fails) │
│ was the request?) │ • High-frequency health checks │
│ • Performance data │ • Duplicate info already in traces │
│ (latency, queue depth) │ • Temporary debug logs (remove!) │
└───────────────────────────┴─────────────────────────────────────┘
That "temporary debug logs (remove!)" entry is a personal attack against my past self. I once left a logger.debug("checking inventory", { items }) in production code that logged the entire inventory payload — including product descriptions — for every single order. Our log storage bill tripled. In one month. Our infra team sent me a Slack message that was just a screenshot of the billing dashboard with a single question mark. I deserved that question mark.
Distributed Tracing with OpenTelemetry
Distributed tracing changed how I debug production systems. I'm not being dramatic. It's the single biggest upgrade to my incident response capability in the last decade. Instead of correlating timestamps across six different log streams while squinting at my monitor and muttering "this one looks like it happened around the same time as that one," I see the entire request journey in one view. Start to finish. Every service hop. Every database call. Every cache lookup. With timing.
OpenTelemetry (OTel) is the standard now, and I say "now" because if you started this journey a few years ago, you might have scars from the OpenTracing/OpenCensus split. Good news: they merged. Bad news: if you have old Jaeger instrumentation lying around, it's time to migrate. (I had to do this migration. It was not fun. But it was worth it.)
OTel is vendor-neutral, well-supported, and — this is the part I really care about — the instrumentation you write today will work with Jaeger, Datadog, Honeycomb, Grafana Tempo, or whatever you're using next year when your company inevitably switches observability vendors. And they will switch. They always switch.
// Basic OpenTelemetry setup for Node.js
// tracing.ts — load this before anything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION }
from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: 'order-service',
[ATTR_SERVICE_VERSION]: '1.4.2',
environment: process.env.NODE_ENV,
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
instrumentations: [
getNodeAutoInstrumentations({
// Auto-instrument HTTP, Express, PostgreSQL, Redis, etc.
'@opentelemetry/instrumentation-fs': { enabled: false },
}),
],
});
sdk.start();Auto-instrumentation gives you traces across HTTP calls, database queries, and cache lookups with zero code changes. That alone is worth the setup. But the real value — the stuff that makes you feel like you have x-ray vision into your production system — comes from custom spans around your business logic:
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service');
async function processOrder(order: Order) {
return tracer.startActiveSpan('process_order', async (span) => {
span.setAttribute('order.id', order.id);
span.setAttribute('order.total', order.total);
span.setAttribute('order.items_count', order.items.length);
try {
// Each step gets its own child span automatically
const inventory = await checkInventory(order.items);
const payment = await chargePayment(order);
// Custom span for ML-based fraud check
await tracer.startActiveSpan('fraud_check', async (fraudSpan) => {
fraudSpan.setAttribute('model.version', 'fraud-v3.2');
const score = await fraudModel.predict(order);
fraudSpan.setAttribute('fraud.score', score);
fraudSpan.setAttribute('fraud.flagged', score > 0.8);
fraudSpan.end();
});
await fulfillOrder(order, inventory);
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}That fraud_check span is a perfect example of why custom instrumentation matters. Auto-instrumentation would have shown me "an HTTP call happened" or "a model was loaded." But it wouldn't tell me the fraud score, the model version, or whether the order was flagged. When our fraud detection started returning weird scores after a model update, I could search for fraud.score > 0.9 AND model.version = 'fraud-v3.2' and immediately see which orders were affected. Without that custom span? I'd be back to grepping log files and reconstructing timelines by hand. Never again.
Trace Sampling Strategy
Don't trace 100% of requests in production — it's expensive and unnecessary. Sample 100% of errors and slow requests, 10-20% of normal traffic, and 100% of specific operations you're investigating. OpenTelemetry's tail-based sampling lets you decide after the request completes.
A word on sampling, because I learned this one the expensive way too: our first OTel deployment traced 100% of requests. In staging. Worked great. Then we turned it on in production and our tracing backend fell over within an hour. Turns out, 100% sampling on 50,000 requests per minute generates a LOT of data. Who knew? (Everyone knew. I should have known. The documentation literally says "don't do this." I did it anyway. Classic.)
Metrics That Matter: RED and USE
Here's a pattern I've seen at every company I've worked at: someone sets up Prometheus and Grafana, gets excited, and instruments EVERYTHING. CPU usage. Memory usage. Garbage collection pauses. Thread counts. File descriptor counts. Connection pool sizes. Cache hit ratios. JVM heap generations. The temperature of the server room (okay, maybe not that one, but I wouldn't be surprised).
Then an incident happens, and everyone stares at 47 dashboards and nobody can figure out which metric actually matters.
The RED and USE methods solve this. They give you a framework for what to actually measure, so you can stop playing "which of these 200 graphs is relevant right now" during an outage.
┌─────────────────────────────────────────────────────────────────┐
│ Metrics Frameworks │
├─────────────────────────────────────────────────────────────────┤
│ │
│ RED Method (for request-driven services): │
│ ────────────────────────────────────────── │
│ R — Rate: Requests per second │
│ E — Errors: Failed requests per second │
│ D — Duration: Latency distribution (p50, p95, p99) │
│ │
│ USE Method (for infrastructure resources): │
│ ────────────────────────────────────────── │
│ U — Utilization: % of resource capacity in use │
│ S — Saturation: Queue depth / work waiting │
│ E — Errors: Error count for the resource │
│ │
│ Apply RED to: API endpoints, background jobs, ML inference │
│ Apply USE to: CPU, memory, disk, network, DB connections │
│ │
└─────────────────────────────────────────────────────────────────┘
RED for your services, USE for your infrastructure. That's it. That's the tweet. (I mean, there's more nuance, but that framing alone will get you 80% of the way there.)
In practice, here's what I instrument for every service on day one — not "eventually" or "when we get to it," but before the service goes to production:
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('order-service');
// RED metrics for the order endpoint
const requestCounter = meter.createCounter('http.requests.total', {
description: 'Total HTTP requests',
});
const requestDuration = meter.createHistogram('http.request.duration_ms', {
description: 'HTTP request duration in milliseconds',
unit: 'ms',
});
const errorCounter = meter.createCounter('http.errors.total', {
description: 'Total HTTP errors',
});
// Middleware that captures RED metrics automatically
function metricsMiddleware(req: Request, res: Response, next: NextFunction) {
const start = performance.now();
res.on('finish', () => {
const duration = performance.now() - start;
const labels = {
method: req.method,
path: req.route?.path || 'unknown',
status_code: res.statusCode.toString(),
};
requestCounter.add(1, labels);
requestDuration.record(duration, labels);
if (res.statusCode >= 500) {
errorCounter.add(1, labels);
}
});
next();
}Histogram Bucket Trap
Default histogram buckets are rarely right for your service. If your API typically responds in 5-50ms, the default buckets (up to 10s) will lump all your normal traffic into one bucket. Configure buckets based on your actual latency distribution: [5, 10, 25, 50, 100, 250, 500, 1000, 2500].
That histogram bucket thing? Let me tell you about the time I spent TWO HOURS during an incident staring at a latency histogram that showed "everything is in the 0-1s bucket" and thinking "great, latency is fine!" It was not fine. Our p99 had spiked to 800ms. But because the default bucket boundaries were [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10], everything from 0ms to 1000ms was in the same bucket. The histogram was technically correct — the requests were under 1 second — but it was hiding a 16x increase in tail latency. Configure your buckets, people. Configure. Your. Buckets.
Alerting Without Fatigue
Alert fatigue is real, it's dangerous, and it will quietly destroy your on-call rotation from the inside. I've watched it happen. When your on-call engineer ignores the 50th alert this week — because the last 49 were false positives or non-actionable noise — the 51st might be an actual outage that costs real money and affects real users.
The rule I follow now (after learning the hard way, obviously — there's a theme here): every alert must be actionable, and every page must require a human decision within 15 minutes. If it can wait until morning, it's not a page. If there's nothing a human can do about it, it's not a page. If the system should auto-recover, it's not a page — it's a monitoring item. Simple, right? You'd be amazed how many teams page their on-call for "disk usage is at 82%." What is the on-call engineer supposed to do at 3 AM? Go buy a bigger hard drive?
┌─────────────────────────────────────────────────────────────────┐
│ Alerting Hierarchy │
├─────────────────────────────────────────────────────────────────┤
│ │
│ PAGE (wake someone up): │
│ • Error rate > 5% for 5 minutes │
│ • p99 latency > 2s for 10 minutes │
│ • SLO burn rate consuming daily budget in 1 hour │
│ • Zero successful requests for 2 minutes │
│ │
│ TICKET (fix during business hours): │
│ • Error rate > 1% for 30 minutes │
│ • Disk usage > 80% │
│ • Certificate expiring in < 14 days │
│ • Dependency deprecation warnings │
│ │
│ DASHBOARD ONLY (informational): │
│ • CPU utilization trends │
│ • Request rate changes │
│ • Cache hit ratios │
│ • Individual 500 errors (already counted in error rate) │
│ │
└─────────────────────────────────────────────────────────────────┘
The biggest lesson — and I cannot stress this enough — alert on symptoms, not causes. "Error rate is above 5%" is actionable. "CPU is at 80%" usually isn't, because maybe that's just what your service looks like under normal load during business hours. If high CPU is actually causing problems, the error rate alert or the latency alert will fire anyway. You don't need a CPU alert as a middleman; you need alerts that tell you users are affected.
I once worked on a team that had 147 alert rules. One hundred and forty-seven. Most of them were cause-based: "disk approaching 70%," "memory usage above 60%," "thread count increasing." The on-call person got paged about 8 times per day on a quiet day. They had developed a Pavlovian response to their phone buzzing — not urgency, but resignation. When we rewrote the alerting to be symptom-based and SLO-driven, we went down to about 3 pages per week. The same system. The same failure modes. Just better signal-to-noise ratio. The on-call engineers started sleeping again. Morale improved measurably.
# Prometheus alerting rules example
groups:
- name: order-service-slos
rules:
# Alert on high error rate (symptom-based)
- alert: HighErrorRate
expr: |
sum(rate(http_errors_total{service="order-service"}[5m]))
/
sum(rate(http_requests_total{service="order-service"}[5m]))
> 0.05
for: 5m
labels:
severity: page
annotations:
summary: "Order service error rate above 5%"
runbook: "https://wiki.internal/runbooks/order-service-errors"
# Alert on SLO burn rate (proactive)
- alert: SLOBurnRateHigh
expr: |
slo:burn_rate:5m{service="order-service"} > 14.4
and
slo:burn_rate:1h{service="order-service"} > 14.4
for: 2m
labels:
severity: page
annotations:
summary: "Order service burning through error budget too fast"Runbooks Are Non-Negotiable
Every alert must link to a runbook. When you're paged at 3 AM, you should not have to think about what to check first. The runbook should list: what the alert means, what to check, common causes, and how to mitigate.
I once got paged for an alert that said "SLO burn rate high." That's it. No runbook link. No context. No suggested first steps. I spent the first 20 minutes of the incident just figuring out what the alert meant. Twenty minutes! During an active incident! The runbook thing isn't optional — it's the difference between a 20-minute resolution and a 60-minute "what am I even looking at" marathon. Write the runbook when you create the alert. Not "later." Not "when we have time." Now. Future-you-at-3-AM will be grateful.
SLOs, SLIs, and Error Budgets
SLOs (Service Level Objectives) transformed how my team thinks about reliability. Before SLOs, our reliability goal was the vague, impossible "make everything as reliable as possible." Which sounds noble but is operationally useless. How reliable is "as possible"? 100%? (That's not possible.) 99.99%? (That costs a fortune.) 99.9%? (That might be fine.) Without a number, "reliable" is just a feeling, and feelings don't help you make engineering trade-offs.
With SLOs, you have a concrete target and — this is the magic part — a budget for failure.
┌─────────────────────────────────────────────────────────────────┐
│ SLO Framework │
├─────────────────────────────────────────────────────────────────┤
│ │
│ SLI (Indicator): What you measure │
│ "Proportion of requests that return successfully │
│ within 500ms" │
│ │
│ SLO (Objective): Your target for the SLI │
│ "99.9% of requests succeed within 500ms over │
│ a 30-day rolling window" │
│ │
│ Error Budget: How much failure you can afford │
│ 0.1% = ~43 minutes of downtime per month │
│ or ~4,320 failed requests per 4.32M total │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Error Budget Remaining: 67% ████████████░░░░░░ │ │
│ │ Days into window: 15/30 │ │
│ │ Status: HEALTHY — safe to deploy │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ When budget is exhausted: │
│ → Freeze non-critical deployments │
│ → Focus engineering on reliability improvements │
│ → Conduct incident reviews for budget-burning events │
│ │
└─────────────────────────────────────────────────────────────────┘
The error budget concept is what makes SLOs practical instead of aspirational. Without it, reliability is a religion — "we must have zero errors, zero downtime, zero problems." That's not engineering; that's wishful thinking. With an error budget, reliability becomes an engineering trade-off — "we have room for X failures this month, so we can ship this feature and accept a little risk, OR we've burned through our budget and it's time to pause feature work and focus on stability."
This is genuinely the most powerful framing shift I've experienced in my career. It turns the "move fast vs. be reliable" argument from a philosophical debate into a conversation backed by data. Product wants to ship a risky feature? Check the error budget. Plenty of budget left? Ship it. Budget's thin? Maybe we harden the deployment or wait until next month. No feelings hurt, no turf wars — just math.
For my AI inference services, I define two SLOs:
// SLO definitions for an AI inference service
const slos = {
availability: {
sli: 'Proportion of inference requests that return a valid prediction',
target: 0.999, // 99.9%
window: '30d',
// Excludes: client errors (4xx), scheduled maintenance
},
latency: {
sli: 'Proportion of inference requests completing within threshold',
target: 0.99, // 99%
threshold_ms: 500,
window: '30d',
// p99 latency must stay under 500ms
},
};Why two SLOs? Because a service can be "available" (returning responses) while being unacceptably slow, and it can be "fast" while returning garbage. I learned this when our inference service was technically responding to 100% of requests — it's just that 5% of those responses were the model's default fallback prediction because the feature store was timing out. Availability: perfect. Usefulness: questionable. You need both dimensions.
Dashboard Design That Works
A wall of dashboards is not observability. It's interior decoration. I've seen NOCs (Network Operations Centers) with 30 screens showing beautiful graphs in glorious high-definition color that nobody — NOBODY — looks at during an actual incident. They look at Slack. They look at the alert. They look at the one grafana panel they have bookmarked. The other 29 screens are expensive screen savers.
Every dashboard should answer a specific question. If you can't articulate what question a dashboard answers, delete it. (Yes, really.) I keep three levels, and I ruthlessly prune anything that doesn't fit:
┌─────────────────────────────────────────────────────────────────┐
│ Dashboard Hierarchy │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Level 1: Service Health (the "glance" dashboard) │
│ ───────────────────────────────────────────── │
│ • One row per service │
│ • SLO status (green/yellow/red) │
│ • Error budget remaining │
│ • Current error rate and p99 latency │
│ Purpose: "Is anything on fire right now?" │
│ │
│ Level 2: Service Deep Dive │
│ ───────────────────────── │
│ • RED metrics over time │
│ • Latency heatmap │
│ • Error breakdown by type │
│ • Dependency health │
│ • Recent deployments overlay │
│ Purpose: "This service has a problem, what kind?" │
│ │
│ Level 3: Investigation │
│ ──────────────────── │
│ • Trace search and exploration │
│ • Log search with filters │
│ • Database query performance │
│ • Infrastructure metrics (USE) │
│ Purpose: "I know the problem area, show me details" │
│ │
└─────────────────────────────────────────────────────────────────┘
Level 1 is the one I look at first during any incident. If all the rows are green, the problem probably isn't in our services (check the CDN, the DNS, the things outside your blast radius). If a row is red, I click into Level 2 for that service. Level 2 tells me what kind of problem — is it errors? latency? a dependency? Then Level 3 is where the actual detective work happens.
This hierarchy maps to the natural flow of incident response: "Is something wrong?" -> "What service is it?" -> "What specifically is broken?" If your dashboards don't follow this flow, people will skip them entirely and go straight to grepping logs. I've seen it happen a hundred times.
Deployment Markers
Always overlay deployment timestamps on your metrics dashboards. The most common cause of production issues is "we deployed something." Being able to visually correlate a latency spike with a deployment saves precious minutes during incidents.
The deployment markers thing is not optional. I'd estimate that 70% of the production incidents I've investigated were caused by, correlated with, or worsened by a recent deployment. Being able to look at a latency graph and immediately see "oh, there was a deploy 12 minutes before this spike" is worth its weight in gold. Without that marker, you'd spend 20 minutes running git log and Slack-searching "did anyone deploy recently?" while the incident clock is ticking.
On-Call That Doesn't Burn People Out
Observability is ultimately about people. I know that sounds like something you'd read on a motivational poster in a WeWork, but I mean it earnestly. The best instrumentation in the world doesn't help if your on-call engineer is burnt out from false alarms, sleep-deprived from unnecessary pages, and missing the context they need to actually fix problems when they're real.
I've been that burnt-out engineer. I've had on-call rotations where I dreaded the week so much that it affected my work the week before. Nobody does good engineering when they're anxious about their phone buzzing. Fixing this isn't just "nice to have" — it's a retention issue, a quality issue, and honestly, a human decency issue.
Practices that materially improved on-call quality for my teams:
-
Every page gets a blameless postmortem if it lasts more than 15 minutes. Not to punish anyone — the word "blameless" is doing critical work in that sentence — but to improve the system. If the same alert fires three times in a month, the system has a bug, not the engineer.
-
On-call handoff includes context: what alerts fired this week, what's currently degraded, what deployments are in flight. I've seen handoffs that were literally "good luck." That's not a handoff, that's abandonment. Write a paragraph. Share the open tickets. Mention the thing that's been flaky. It takes 10 minutes and saves hours.
-
Shadow on-call for new team members: pair them with an experienced engineer for their first rotation. They observe, ask questions, and build confidence before flying solo. Throwing a junior engineer into on-call with no shadowing is how you get both bad incident response AND an updated resume on LinkedIn.
-
Toil budget: if more than 30% of on-call time is spent on repetitive manual tasks, that's a signal to automate. Track it. "I restarted the cache server 4 times this week" is not an on-call experience — it's a cron job that hasn't been written yet.
-
Compensatory time off: if someone gets paged at 3 AM and spends two hours fixing an issue, they should take time off the next day. On-call is not free labor. I know this sounds obvious, but I've worked at companies where it wasn't the norm, and the difference in morale between "take tomorrow morning off" and "see you at standup at 9" is enormous.
The Observability Maturity Journey
You don't need to implement everything in this article at once. Please don't try. I've seen teams attempt the "big bang observability overhaul" and end up with a half-configured OpenTelemetry setup, three different logging libraries, and a Grafana instance that nobody knows the password to. (I wish I was kidding.)
Here's the progression I recommend, based on watching multiple teams go through this:
Phase 1: Foundations (week 1-2)
- Structured JSON logging with correlation IDs
- Basic health check endpoints
- Error rate and latency dashboards
This alone will transform your debugging. Seriously. If you do nothing else, do this. The jump from console.log to structured logging with trace IDs is the single biggest bang-for-buck improvement in this entire article.
Phase 2: Tracing (week 3-4)
- OpenTelemetry auto-instrumentation
- Custom spans for critical business logic
- Trace-to-log correlation
This is where things start feeling like magic. The first time you click a trace and see the entire request flow across five services with timing breakdowns... you'll wonder how you ever debugged without it.
Phase 3: SLOs (month 2)
- Define SLIs and SLOs for each service
- Error budget tracking
- Symptom-based alerting tied to SLOs
This is where the cultural shift happens. You go from "vibes-based reliability" to "data-driven reliability." Product and engineering stop arguing about whether to ship or stabilize, because the error budget answers the question for them.
Phase 4: Culture (ongoing)
- Runbooks for every alert
- Blameless postmortems
- On-call quality improvements
- Regular SLO review meetings
This phase never ends, and that's fine. It's maintenance, not a project. Review your SLOs quarterly. Update your runbooks when things change. Keep making on-call suck less.
The tools matter less than the practices. I'll say that again because it's the most counterintuitive thing in this article: the tools matter less than the practices. I've seen teams with six-figure Datadog contracts who still debug by SSHing into servers and tailing logs, and teams running Prometheus + Grafana + Jaeger on a shoestring budget who resolve incidents in minutes. The difference is always culture and discipline, not technology. The expensive platform doesn't help if nobody writes structured logs. The open-source stack works great if people actually instrument their code and write runbooks.
Start with structured logs and correlation IDs. Everything else builds on that foundation. And the next time something breaks at 3 AM — because it will, that's the nature of distributed systems — you'll thank yourself for investing in observability before you needed it. Or more accurately, you'll think "at least I can actually figure out what's happening" instead of "I have no idea what's happening and I want to cry." Both are valid 3 AM emotions. But only one of them leads to a resolution.
Frequently Asked Questions
Don't miss a post
Articles on AI, engineering, and lessons I learn building things. No spam, I promise.
Osvaldo Restrepo
Senior Full Stack AI & Software Engineer. Building production AI systems that solve real problems.