05 Jan 2026 8 min read observability

Observability in Software: Logs, Metrics & Traces

Learn what observability is, why it matters, and how logs, metrics, traces help you understand system behavior and performance.

Photo by Dustin Scarpitti / Unsplash

Observability is a must-have part of software development, and it shouldn't be ignored on any stage of the lifecycle. It helps you understand the system and how it behaves. More importantly, it provides answers to questions related to the current state of the system.

Observability should be considered during system design. You might not include it in the MVP, but you should plan how to add it later, otherwise, introducing observability at later stages can become complex, costly, and error-prone.

📌

Business can't grow without customers and an understanding of their needs through feedback. Likewise, software can't grow without understanding how it works internally.

Questions

When developing software, you should be able to answer questions such as:

Are the services available?
How many RPS (requests per second) can the system withstand?
What is RT (response time) of different parts of the system? How long does user need to wait before a request is processed?
Are there any traffic peaks when system is under heavy load?
When does system performance degrade?
Which part of system is the bottleneck?
Why some requests fail?
Which HTTP status codes does system return? What is the ratio of 4xx/5xx responses compared to 2xx?
Which part of the system should be optimized first?

As you can see, there are many questions you need to ask when developing software. Nothing provides better insights than the system itself.

📓

Large companies collect huge amounts of metrics and other observability data, often requiring petabytes (PiB) of storage or more.

Example

For example, imagine you're developing an online store with multi-step order flow that includes stock reservation, order creation and communication with a bank. Suddenly, there're no orders for quite long period of time and you have also received some negative emails from customers. You check website and try to place an order, but it fails with an unknown error.

As you remember, the order flow is quite complex and without any observability it would be tough task to find what's broken and what needs to be fixed.

Possible reasons for the failure could include:

Network issues preventing services from communication with each other
Database being down or running out of space
The bank blocking your account
Requests timing out
The application being unavailable

Observability is a powerful tool that helps you understand how an application works and makes its behavior more predictable. There're three main pillars (logging, tracing, and metrics) which together provide most of what is needed to effectively observe and operate modern applications.

💡

Below is a short description of each observability pillar, but more detailed explanations will be covered in future articles.

Logging

Logging is the most well-known part of observability, and it's widely used across various applications. It provides a detailed description of events in the system. Usually, a log contains log event ID, timestamp, log event level, message and custom details.

Log types

Logs are divided into two types:

Unstructured (plain text) - stored as raw text
Structured - data object with defined serialization format (for example, JSON). Structured logs allow faster querying and often reduce storage costs

Log levels

Log levels describe the severity of a log entry. They're usually used for filtering and can be enabled or disabled at runtime. For example, the DEBUG level can be enabled for a short period of time to observe specific request pipeline.
Log levels, from lowest to highest severity (more verbose → fewer events):

TRACE - every step of a process (method invocation, method returned)
DEBUG - diagnostic information (method arguments, intermediate values)
INFO - application lifecycle and business events (order created, user logged in)
WARN - unexpected situations that do not break the application (order not found, retries)
ERROR - failed operations (failed to process order, failed to accept payment)
FATAL - application is unusable or near failure (database unavailable, thread exhaustion, high memory usage)

Best practices

Best practices when to log:

Log every exception (ERROR level)
Log unexpected situations (WARN level)
Log events that may help with incident investigation (INFO level)
Use DEBUG logs when an issue can't be reproduced locally or in staging
Avoid logging high-volume events, as they consume significant storage
Never log sensitive data, such as customer names, addresses, identification numbers, or phone numbers

💡

Logs are not guaranteed to be stored. Applications usually write logs to local memory and then send them in batches to a logging backend, or logs are collected periodically by an agent. If an application crashes, unsent logs may be lost.

Log creation and collection must never block or slow down business-critical application logic.

Example

For example, customer can't place an order and receives a 5xx error code, it usually means that application threw an exception. This exception should be logged at the ERROR level. We can then analyze the log details to diagnose the issue and decide how to fix it.

Logging is especially useful when you need to deep dive into the details of each event in the system. This observability pillar is the most verbose among the others.

Tracing

Tracing is another pillar of observability and the most complex one as all services must support tracing in request pipeline. It is used primarily in complex distributed systems, where you need to understand the exact data flow of a request across multiple services and their dependencies, as well as response time of each step.

Each trace contains unique ID, which correlates multiple spans into one a single request lifecycle.

Use cases

Tracing can be used to analyze:

Which requests exceedes SLO (95th percentile latency < 50ms)
Which requests fail, along with correlated logs
Which part of request consumes the most response time
Dependencies between services involved in a request

Trace structure

A single incoming request creates unique trace ID and root span.

When the request calls another service, the trace ID is propagated (via headers) and a child span is created.

Trace ID: 885bef8bbc18649c
 ├─ Root span: HTTP POST /orders
 │    ├─ Span: StockService.Reserve
 │    ├─ Span: MoneyService.MakePayment
 │    └─ Span: OrdersService.CreateOrder
 │         └─ Span: DB query

💡

A single request or transaction always corresponds to one trace. A trace can contain multiple spans, and spans can be nested to represent call hierarchy.

Multiple traces can be linked, but they remain independent.

How traces are collected

Applications create spans and traces locally, then export them to a tracing backend directly or via an intermediate collector.

💡

Traces are not guaranteed to be collected. In high-traffic systems, generating traces for every request is too expensive, so sampling is used. For example, you can configure sampling so that only 5% of traces are collected, while the remaining 95% are discarded.

Only completed spans are exported. Incomplete or in-flight requests are not sent

Tracing is especially useful when you need to understand end-to-end request latency, cross-service dependencies, and performance bottlenecks in distributed systems - something that logs or metrics alone cannot fully explain.

Metrics

Metrics is the most useful pillar of observability for understanding how an application operates over time. They provide aggregated and cumulative information about system behavior, but do not contain details about individual events.

Metric types

There are different types of metrics that are not interchangeable and must be used depending on the use case:

Counter - a monotonically increasing value (number of requests, number of created orders)
Gauge - a value at a specific point of time that can increase and decrease (memory usage, number of active users)
Histogram - a distribution of values across predefined buckets (request latency, order cost)
Summary - similar to a histogram, but quantiles are calculated on the client side

How metrics are collected

There're two main strategies:

Pull - preferred approach
Exporters periodically scrape metrics from a specific endpoint (usually metrics) and upload them to metrics backend. The pull model provides a single point of synchronization, which helps avoid race conditions.
Push - should be used only when Pull strategy is not suitable
The application pushes metrics directly to metrics backend. This approach is typically used for short-lived jobs, which push metrics at the end of their processing pipeline.

Visualization

Once metrics are collected, they can be visualized in a UI (most commonly Grafana) using predefined dashboard.

Alerts

Metrics allow you to detect abnormal situations and notify developers automatically through various channels (phone calls, messengers, application notifications or email), enabling faster incident response.

An alert can be configured with conditions such as:

Memory usage exceeds 95%
The condition persists for 5 minutes

If all conditions are met, the alert is triggered, and a notification is sent.

Example

Suppose we add a metric that increments a counter every time an order is created. This metric allows us to see how many orders are created over time.

Using this data, we can determine how many orders were created in the last 10 minutes (or any other time window).

If the application is expected to run 24/7 and orders should be created continuously, we can configure an alert to detect when no orders have been created for the past 5 minutes (or when the rate drops below the expected threshold).

This alert is high-level and does not explain why the problem occurred. Possible reasons could include:

The website is unavailable
Network issues between services
External dependencies are down

To improve observability, you can also configure more specific alerts. For example, if an external dependency becomes unavailable, you can increment a dedicated metric and trigger a targeted alert.

OpenTelemetry

OpenTelemetry is the de facto observability standard in modern software development. It is a collection of APIs, SDKs, and tools used to generate, collect, and export telemetry data - metrics, logs, and traces.

The problem

The observability landscape is highly fragmented. There are many monitoring systems for logs, metrics, and traces, each with its own data formats, APIs, agents, and visualization tools.
In large technology companies, this quickly becomes a serious issue: different teams, infrastructures, and open-source applications rely on different vendors, each requiring its own tooling and operational expertise.

As a result, organizations end up with a technology zoo that is costly to maintain and difficult to evolve. Despite these differences, all these systems essentially solve the same problem - collecting and visualizing telemetry data to understand system behavior.

The standard

OpenTelemetry addresses this fragmentation by standardizing how applications generate, collect, and export telemetry data, as well as defining common semantic conventions and data formats.

By adopting OpenTelemetry, observability becomes vendor-neutral. Applications are instrumented once and can export telemetry to any compatible backend. This eliminates vendor lock-in and allows teams to switch or add monitoring backends without modifying application code.

Summary

Observability is a critical capability in modern software development that cannot be ignored at any stage of the system lifecycle. It enables teams to understand system behavior, assess its current state, and answer essential questions about availability, performance, and failures.

Modern observability relies on three complementary pillars: logging, tracing, and metrics. Together, they provide deep visibility:

Logs capture detailed events
Traces explain end-to-end request flows across services
Metrics reveal aggregated trends and enable alerting

As systems grow, the observability ecosystem becomes fragmented, creating a costly technology zoo of incompatible tools, formats, and vendors. OpenTelemetry solves this fragmentation by defining a vendor-neutral standard for generating, collecting, and exporting telemetry data. With OpenTelemetry, applications are instrumented once and can send data to any compatible backend, eliminating vendor lock-in and simplifying observability at scale.