Observability in Production: Logs, Metrics, and Traces Done Right

Monitoring Is Not Observability

Most engineering teams have monitoring. They have dashboards that show CPU usage, memory consumption, and request counts. They have alerts that fire when disk space runs low or a health check fails. And yet, when something goes wrong in production, they still spend hours guessing what happened.

The difference between monitoring and observability is the difference between knowing that something is wrong and understanding why it is wrong. Monitoring tells you that your API response time spiked to 3 seconds. Observability tells you that the spike was caused by a specific database query, triggered by a new feature flag that was enabled 20 minutes ago, affecting users in the EU region because that query joins a table that was recently migrated to a different partition scheme.

Observability is the ability to understand your system's internal state by examining its outputs. It rests on three pillars - logs, metrics, and traces - and the real power comes from correlating all three.

Pillar 1: Structured Logging

Logs are the most familiar observability signal, but most teams use them poorly. Unstructured log lines like Error processing request for user 12345 are almost useless at scale. You cannot filter them, aggregate them, or correlate them with other signals without complex regex parsing.

How to Log Effectively

Use structured JSON format. Every log entry should be a JSON object with consistent fields:

{
  "timestamp": "2024-07-05T14:23:01.234Z",
  "level": "error",
  "message": "Payment processing failed",
  "service": "payment-api",
  "request_id": "req_abc123",
  "user_id": "usr_789",
  "payment_id": "pay_456",
  "error_code": "STRIPE_CARD_DECLINED",
  "duration_ms": 1250
}

This format lets you search, filter, and aggregate logs programmatically. Find all errors for a specific user. Count payment failures by error code. Calculate average processing duration.

Establish mandatory fields. Every log entry across every service should include: timestamp, log level, service name, request ID, and message. Additional context fields vary by domain.

Propagate request IDs. Generate a unique request ID at your API gateway or load balancer and pass it through every service in the chain. When a user reports an issue, you can trace their entire request journey by filtering on a single ID.

Log at the right level.

ERROR: Something failed that should not have. Requires investigation. Examples: unhandled exceptions, failed external API calls that affect users, data integrity violations.
WARN: Something unexpected happened but was handled. Worth monitoring for trends. Examples: retry succeeded after first attempt, deprecated API endpoint called, rate limit approaching threshold.
INFO: Significant business events. Examples: user registered, payment processed, deployment completed, feature flag changed.
DEBUG: Detailed diagnostic information. Disabled in production by default, but enable-able per service or per request for troubleshooting.

Control log volume. Logging everything at DEBUG level in production will bury useful information and inflate your observability costs. Use sampling for high-volume paths and ensure your logging framework supports dynamic log level changes so you can increase verbosity for specific services without redeploying.

Log Aggregation and Search

Centralize your logs in a searchable system. The major options:

Datadog Logs - excellent search, correlation with metrics and traces, but can be expensive at high volume
Grafana Loki - cost-effective log aggregation designed to work with Grafana dashboards, uses label-based indexing rather than full-text indexing
Elasticsearch/OpenSearch - powerful full-text search, but requires significant operational expertise to run at scale
AWS CloudWatch Logs - sufficient for smaller applications already on AWS, but limited query capabilities compared to dedicated tools

Pillar 2: Metrics That Drive Action

Metrics are numerical measurements collected at regular intervals. They tell you how your system is performing over time and are the foundation of alerting and capacity planning.

The Four Golden Signals

Google's Site Reliability Engineering book defines four golden signals that every service should track:

Latency: The time it takes to serve a request. Track both successful and failed requests separately - a fast error is very different from a slow success. Measure P50, P95, and P99 percentiles rather than averages. An average response time of 200ms can hide a P99 of 5 seconds that affects thousands of users daily.

Traffic: The demand on your system. For a web application, this is usually requests per second. For a message queue, messages per second. For a database, queries per second. Traffic metrics help you understand capacity needs and detect anomalies.

Errors: The rate of failed requests. Track both explicit errors (HTTP 5xx responses) and implicit errors (HTTP 200 responses with incorrect content, or requests that succeed but exceed an SLA threshold). Express error rate as a percentage of total traffic for meaningful comparison over time.

Saturation: How full your system is. CPU utilization, memory usage, disk I/O, database connection pool utilization, and queue depth are all saturation signals. Saturation metrics are your early warning system - they tell you about problems before they affect users.

Application-Specific Metrics

Beyond the golden signals, instrument metrics specific to your business logic:

Business metrics: signups per minute, orders processed per hour, payment success rate
Feature metrics: usage by feature flag, adoption of new features, A/B test cohort sizes
Dependency metrics: external API response times, cache hit rates, database query duration by query type
Queue metrics: queue depth, consumer lag, processing time per message

Metrics Infrastructure

Use a time-series database. Prometheus is the industry standard for metrics collection, often paired with Grafana for visualization. For managed solutions, Datadog, Grafana Cloud, or AWS CloudWatch Metrics eliminate operational overhead.

Adopt consistent naming conventions. Use a standard format like service_subsystem_metric_unit. For example: payment_api_stripe_request_duration_seconds, auth_service_login_attempts_total, order_api_database_query_duration_milliseconds.

Set meaningful retention policies. Keep high-resolution data (every 10-15 seconds) for 30 days, downsampled data for 6-12 months, and aggregated data indefinitely. This balances storage cost with the ability to investigate recent incidents and analyze long-term trends.

Pillar 3: Distributed Tracing

In a distributed system - whether you have microservices, serverless functions, or even a monolith that calls external APIs - understanding the full journey of a request is critical. Distributed tracing provides this visibility.

How Tracing Works

A trace represents the complete journey of a request through your system. It consists of spans, each representing a unit of work: an API call, a database query, a cache lookup, or a message publish. Spans are nested to show parent-child relationships.

When a request enters your system, a unique trace ID is generated and propagated through every service call via HTTP headers (typically using the W3C Trace Context standard). Each service creates spans for its work and reports them to a trace collector that assembles the full picture.

What to Trace

Instrument at service boundaries. Every HTTP request, gRPC call, and message queue interaction should automatically generate spans. Most tracing libraries and frameworks handle this automatically.

Add custom spans for critical operations. Database queries, cache operations, external API calls, and significant business logic steps should have their own spans. This is where the debugging value comes from - when a request is slow, you can see exactly which operation took the time.

Attach contextual attributes. Add metadata to spans: user ID, tenant ID, feature flags, request parameters (excluding sensitive data), and any other context that helps with debugging.

Tracing Infrastructure

OpenTelemetry has become the standard for instrumentation. It provides vendor-neutral SDKs for all major languages that can export traces (and metrics and logs) to any compatible backend. Invest in OpenTelemetry and you avoid vendor lock-in.

Backend options:

Jaeger - open-source, strong for debugging, limited analytics
Grafana Tempo - cost-effective trace storage designed for the Grafana ecosystem
Datadog APM - excellent UX, good correlation with logs and metrics, premium pricing
Honeycomb - purpose-built for observability with powerful query capabilities, particularly good for high-cardinality exploration

Connecting the Three Pillars

The real power of observability comes from correlating logs, metrics, and traces. When an alert fires on a latency metric:

Start with the metric to understand the scope: which endpoints are affected, when did it start, how severe is it?
Pivot to traces to find example slow requests and identify which span (operation) is responsible for the latency increase.
Drill into logs for the specific trace ID to understand the detailed error messages, input parameters, and context that explain why that operation was slow.

This correlation requires shared context - specifically, the trace ID and request ID should appear in all three signals. When you configure your logging library to include the current trace ID in every log entry, and your tracing library to link to related log queries, you enable seamless investigation across all three pillars.

Alerting That Does Not Burn Out Your Team

Good observability data is wasted if your alerting strategy is wrong. Alert fatigue - too many noisy, non-actionable alerts - is one of the most common reasons teams disable alerts and miss real incidents.

Alerting Principles

Alert on symptoms, not causes. Alert when users are affected (high error rate, elevated latency) rather than on internal signals (high CPU usage) that may or may not affect users. CPU at 90% is not a problem if response times are normal.

Use severity levels deliberately. Define what each level means:

Critical (pages on-call): User-facing impact right now. Service is down, error rate exceeds SLA, data is at risk.
Warning (Slack notification): Something is trending in the wrong direction. Approaching capacity limits, elevated error rate below SLA threshold, degraded but functional.
Informational (dashboard only): Noteworthy events with no action required. Deployment completed, autoscaling triggered, backup succeeded.

Require runbooks for every paging alert. When an engineer is woken at 3 AM, the alert should link to a runbook that explains: what the alert means, how to assess severity, what immediate actions to take, and when to escalate. No alert should require the on-call engineer to decode what is happening from scratch.

Review and prune alerts quarterly. If an alert has not fired in six months, consider removing it. If an alert fires frequently but never requires action, fix the underlying issue or remove the alert. Every alert should be actionable.

Building Your Observability Stack

For teams starting from zero, here is a practical implementation order:

Week 1: Add structured logging with request ID propagation. Deploy a log aggregation tool.
Week 2: Instrument the four golden signals for your primary services. Set up dashboards in Grafana or your metrics platform.
Week 3: Add OpenTelemetry instrumentation for distributed tracing. Connect traces to logs via trace ID.
Week 4: Configure alerting for critical user-facing symptoms. Write runbooks for each alert.
Ongoing: Add application-specific metrics, expand trace coverage, and refine alerts based on operational experience.

This phased approach gives you meaningful observability quickly without overwhelming the team.

At InfoDive Labs, our cloud architecture and DevOps teams specialize in building production-grade observability stacks that give engineering teams the visibility they need to ship confidently and resolve incidents quickly. From OpenTelemetry instrumentation to Grafana dashboard design to alerting strategy reviews, we help companies move from reactive firefighting to proactive operational excellence. If your team is struggling with production visibility, we would be glad to help you build an observability practice that scales.