Logs Matter More Than Metrics
data:image/s3,"s3://crabby-images/f9a09/f9a09737cf65b60f60cd1fccf6c5e750d7be04ea" alt="Michael Shi"
data:image/s3,"s3://crabby-images/6c65a/6c65ab3c2e99765a305e7b00a4d7e68160eb5861" alt="Logs Matter More Than Metrics"
Disclosure: I run an observability company, so this post is subject to some (heavy) bias. However, it also underscores why I wanted to work on HyperDX.
Metrics matter. Logs matter more.
But that’s not how most developers see it. Developers love metrics. It’s something that they put care and attention into. Developers call meetings to figure out how to implement and interpret metrics. They are readily shown to new hires—colorful dashboards with sprawling statistics measuring CPU, memory, and network health. Once, when demoing my product, I was told by a engineering director, “This is cool, but where are the fancy charts?”
Logs get none of that hype. They are the ugly stepchild of observability. They get implemented, but with the attitude that you’d treat a necessary evil. They don’t get meetings dedicated to them. They’re never flaunted to new hires. They just exist, quietly recording events in the background.
Here’s the irony: while metrics might have the aesthetic of a complex system, logs are more useful 80% of the time. When an incorrect behavior emerges, logs are more likely to explain what happened than any metrics. Logs—particularly logs with high cardinality—provide a detailed recollection. They feature no dimension reduction. And metrics, by definition, do. They are just a crude read of a bug’s effect on an application.
Not All Logs Are Created Equal
The importance of logs is partially diminished because they are poorly implemented in many organizations. The difference between a good log and a great log is striking.
Great logs are those with attributes that can tie an event to the source of the issue (e.g. a user_id, payment, host, etc.). This is often framed as logs with high cardinality. High cardinality means that the log includes multiple fields containing unique values. For example, a front-end logged event might include a session ID, a request ID, a user ID, an organization ID, a payment ID, a timestamp, and a network trace. High cardinality like this is a heuristic for a log actually being useful in the case of an error.
Tricky Bugs Where Logs Are the Saving Grace
I have two contrasting examples that illustrate the value of logs.
The Socket Timeout
A while ago, we had a weird issue with sockets—customers reported certain queries would unpredictably time-out. On our dashboard, there were no reports of failed ClickHouse queries—however, customers failed to get data that originated in ClickHouse. Looking through our traces associated with those specific customers and timestamps, we discovered the error: The ClickHouse query succeeded, but the load balancer’s socket timed out before ClickHouse could reply. This was obvious by comparing the timestamps of the socket and the ClickHouse response, as well as observing the corresponding error returned within our API.
Using the logs, we were able to correlate the types of requests that would lead to the same behavior. Additionally, on the ClickHouse side, we could determine what query properties caused sluggish performance. These details are all things untraceable to a spurious failure metric.
Glofox Fake “DDoS”
Pierre Vincent has a fantastic developer talk (opens in a new tab) at InfoQ’s Dev Summit (opens in a new tab) where he discusses logs versus metrics. Pierre works at Glofox (opens in a new tab), a gym management software company. A few years ago, they experienced an incident that highlighted how metrics could be misleading in the absence of great logs.
Because Glofox creates gym software, the pandemic significantly impacted their product’s usage. Gyms suddenly closed (and subsequently opened) on government orders. On one of these reopening dates, Glofox experienced a massive surge in requests, which lit up metrics.
Through metrics, Glofox appeared to be suffering from a DDoS attack originating in Singapore. The easy remedy would be blocking all the IPs dispatching thousands of requests. Singapore was also reopening gyms that day, and Pierre suspected the incident wasn’t actually an attack. But it also wasn’t just returning users; the requests were overwhelming.
By diving through logs, Glofox’s engineering team nailed the culprit: Glofox’s front-end had a bug where lengthy sessions would dispatch more and more requests due to an unintentional JS loop. Many of Glofox’s Singaporean customers had been shut down for months but had minimized tabs. By reopening these tabs, Glofox’s back end was inundated by months of quarantined requests, which imitated a DDoS attack.
Only because of logs was Glofox able to diagnose the problem and devise a quick remedy that enabled their application to persist on one of the most important days of the year.
Developer Religions
I’ll admit this debate hinges on some concept of developer religions—the idea that developers, myself included, have strong beliefs because of some hypothetical ideal. Some developers swear by the importance of metrics; I care more about capturing high cardinality data through logs.
But to be clear, it is ridiculous to believe one should exist at the demise of the other. It’s more a matter of foundations. In my worldview, high cardinality should be the north star for building a good observability stack; metrics should follow.
Funnily enough, I hold the opposite belief regarding our marketing strategy. For marketing, I care more about metrics than individual stories. That’s because marketing is an optimizing outcomes problem—strategies succeed or fail on the basis on an aggregate. That mindset doesn’t hold when it comes to development, where the goal is to eliminate issues that any user is facing.
A Closing Thought
Logs matter. They matter in the same vein that testing matters, CI/CD matters, security matters. Without good logs, errors turn from nuisances to headaches. So next time that your team brings up the importance of metrics, push aside the hype of fancy charts to spend time improving your logs. Of course, you can take my opinion with a grain of salt—I run a observability company that’s built on good logs—but there’s a reason that I ended up in this space.