Traces / Spans / Events / Logs / Metrics
These terms are recurring when working on observability and monitoring, which are two related but distinct concepts, in the context of managing and maintaining software systems.
They both aim to ensure that systems are running smoothly and to detect and diagnose issues when they arise, but they approach this goal from different perspectives.
Monitoring is the process of collecting, analyzing, and using data to track the performance, availability, and health of a system. It involves setting up predefined metrics, logs, and alerts to keep an eye on the system’s operation.
Observability is the capability of a system to allow for easy understanding and insight of its state based on the data it produces, such as logs, metrics, and traces. It is more about understanding the “why” behind system behavior. Its scope is broad as it also aims at offering a view of the system interactions.
That said, the underlying terms are the followings:
- Traces: It represents the journey of a request as it travels through different components of a distributed system. They are a high level overview of a transaction, composed of spans, showing how different services interact with each others. An example of a trace could be a user request (issued from a frontend) going through multiple backend microservices and data storage.
- Spans: They are the building blocks of the traces, as they each represent a single unit of work or a single operation. They might offer detailed information about individual operation, including start and end time, duration, metadata, etc… One example of a span can be a database query.
- Events: Time-stamped records which happens within a system. They help track a specific action or change in state on a more granular level. An event can be a simple click of a button, a database connection creation, an error occurring, etc… Whatever the developer might want to track precisely.
- Logs: Detailed, time-stamped records of any event and state, typically in the form of text messages. They provide historical record of what happened in a system. An error is typically pretty precise, for instance, it might contain an error message, alongside with the user ID, and the stack trace, when an error occurs.
- Metrics: Numerical data points measuring various aspects of a system’s performances and health over time. Providing quantitative data for monitoring, alerting and performance analysis. Metrics can include CPU usage, memory usage, request latency, error rates, thoughput, etc…