What's the Problem with OpenTelemetry?

What's the Problem with OpenTelemetry?

Michael Shi Michael Shi • Jun 25, 2024
What's the Problem with OpenTelemetry?

A week ago, David from Sentry (zeeg) wrote a popular post (opens in a new tab) on his views on the problem of OpenTelemetry. It definitely highlight some fair points (I’ve agreed a lot with him in our IRL chat about Otel’s issues in the past!) - but at the same time it felt that the post had a narrow perspective and the conclusion and I disagreed that the problem is with OpenTelemetry having too broad of a goal. Coming back fresh from Otel community day, I wanted to share my own perspective on Otel and where the true problems are for the wider community.

Disclaimer: I have my own bias in building HyperDX, but hopefully I can lend a different perspective to anyone thinking about adopting OpenTelemetry as either an end-user, tool builder, or observability backend.

What is OpenTelemetry?

First of all, what is OpenTelemetry actually? Candidly, I think the site itself can do a better job at explaining - so here’s my take on what it tangibly delivers to end-users:

  1. A set of SDKs that allow you to collect logs, metrics and traces emitted from your application. The SDKs also support “auto-instrumentation” - which means it’ll hook into existing libraries your application uses and creates helpful metrics/traces (ex. HTTP response sizes) without extra work from the user.
  2. The OpenTelemetry collector, a binary that acts as a traditional “agent”. It serves as both a way to collect data (ex. tail log files, scrape metrics, or get data from other otel collectors/SDKs), do some transformations on it (ex. sanitize PII, batch), and then send it to a destination (another Otel collector, your vendor, S3, etc.)

However, those two pieces on their own aren’t really that interesting - every major observability vendor has implemented those two pieces of software for their users already. Instead, where OpenTelemetry really makes an impact is introducing a set of standards that allows everyone from your infrastructure provider (ex. AWS), your libraries (ex. Prisma) and your own code to emit telemetry that’s correlated and cohesive. It does this by broadly sets two standards:

  1. A standard schema for emitting logs, metrics, and traces. The standard covers both the transport-side of things (ex. how the log message is transmitted) as well as the “semantic attributes” (ex. http.response.status_code (opens in a new tab) property in an event will store the HTTP status code).
  2. A standard for what SDKs should support to actually collect and correlate those logs, metrics and traces from your application and libraries.

Together it makes so that when you’re up at 2am figuring out why your servers are spiking up in p99 response latency, you can have all the logs/traces/metrics from your application, libraries and infrastructure that are related to the incident correlated together - which is a huge win for an on call engineer!

What’s the End Goal of OpenTelemetry?

Now that we know what OpenTelemetry broadly is, what’s the end game for it? I honestly think it’s pretty straightforward: the project is building a standard way of collecting and transmitting application telemetry (surprise!) - that includes the many shapes like logs, metrics, traces, exceptions, session replays, profiling and even more.

This means that there’s one “common language” we can all speak in the ecosystem, so that if I have a troubled k8s.pod.name (opens in a new tab) on hand, I can go find the logs, metrics, traces (and more) using that same exact property. No limitations on vendors, whether I wrote the telemetry myself or if it’s emitted by a library author, what SDK was used, or anything else.

It also means that as a builder of HyperDX we can build experiences where your session replays, frontend logs/traces/exceptions, and backend logs/metrics/traces/exceptions are all automatically correlated, regardless of your choice of OpenTelemetry-compatible SDKs. We built this because our customers have a very real problem stitching the bits and pieces of the traditionally fragmented observability landscape together.

This is a pretty big evolution: observability doesn’t mean having logs, metrics, traces stored separately and copying + pasting timestamps and bespoke correlation ids between everything (or having thin layers of duct tape between said products). It's evolved into having different signals all be correlated together by default, and using them together to understand your system.

But Do We Need Another Standard?

There’s an argument along the lines of XKCD 927 (opens in a new tab):

competing standards xkcd

Yes logging has a variety of long-lived formats from syslog (which is a troubled format on its own already) to things like CLF (which is a format… but doesn’t describe anything beyond that). Metrics has a cleaner story around Prometheus being one of the dominant formats after formats such as statsd.

However the prior art in logging and metrics largely stop at describing the data format, but what the various fields and metrics represent (a semantic convention!) are sorely lacking.

For example if you were to adopt a stack of Syslog, Prometheus metrics, and Zipkin traces, it’d be impossible to globally define a way of joining the logs, metrics and traces emitted from a single host. Or even the logs and traces emitted from a single transaction - as neither Prometheus nor Syslog set any standards to what the content of the telemetry is, much less how it’ll ever interoperate with the rest of the observability standards.

You’d have to do it yourself - and every vendor and ecosystem will come up with some bespoke way of doing it that’s incompatible with the next (ex. see AWS X-Ray)

Much like how Kubernetes has allowed for an entire ecosystem to be built on top because it created a standard for deploying containers (rather than being locked into a cloud-specific solution), OpenTelemetry is providing the same benefit to all of observability (and it’s only the 2nd most active CNCF project behind Kubernetes). At the end of the day, that’s a win for customers.

Okay, but does it actually work?

Zeeg made a point that OpenTelemetry isn’t actually working for people - yet we’re seeing both new and existing projects (opens in a new tab) leverage OpenTelemetry. Not to say that there aren’t adoption issues to sort out (and I’ll get to the issues in a bit). From libraries like Prisma or Temporal and across all cloud vendors such as AWS, GCP, Azure, they’re all baking in OpenTelemetry instrumentation directly into their libraries so that if you’re using one of their libraries, as soon as you adopt OpenTelemetry, you’ll get a bunch of instrumentation for free as part of it.

otel npm trends

While OpenTelemetry is certainly younger than a number of other projects such as Zipkin, Prometheus, or Fluentd - it’s rapidly growing to become the standard for end-user and library builders to instrument their code via their many SDKs, for telemetry hubs (ala Cloudwatch) to transmit telemetry, and for backends like us to receive that telemetry. It’s defined a standard for the industry - and everyone is rallying around it.

Yes, there are problems

However, to be clear, OpenTelemetry definitely has areas to improve, in fact they’ve conducted a survey themselves and the results are incredibly clear.

The Docs are Bad

Objectively, the number one barrier to adoption is good documentation and examples of implementing OpenTelemetry.

otel survey results

From Otel Getting Started Survey (opens in a new tab)

It’s a complex problem that involves anywhere from the SDKs themselves requiring too much boilerplate, edge cases, and version compatibility issues that users can run into. There’s also the problem of educating about observability itself - things like when to choose a log versus a span or metric, and how to implement the best practices version of each (ex. span events vs logs). All of this gets bundled into “OpenTelemetry is hard to use” - which is totally fair.

Heck, even as a vendor in the space that live and breath OpenTelemetry I frequently curse at how fragmented the documentation is between SDKs guides, references, actual implementation and the backing standards (plus the various deprecation/promotion of standards that are happening).

Committee Driven Development

Beyond documentation, Otel is entering a point where a lot of the work to introducing new changes is slowing down and getting driven by committee. Often decisions for a single SDK can cross cut a handful of different Otel stakeholder groups which means buy in (and ultimately implementation) takes a lot of time due to all the traditional issues of doing committee-driven-development.

Bring Your Own Batteries

I would describe a lot of the upstream OpenTelemetry offerings as solidly in the batteries-not-included camp due to poor documentation and large configuration surface needed to just get started. The OpenTelemetry ecosystem has created a very extensible and flexible system, so anyone can take the parts they like and build exactly what they want. However, it also means that end users have had to piece together too much to just get the ball rolling. (I have to caveat that this is getting improved very quickly across the ecosystem - and vendor distributions tend to do a good job bridging this for end users, something we've needed to do at HyperDX as well).

But are those really fatal issues?

Overall - almost any large project tends to suffer the problems above, and it’s not like OpenTelemetry isn’t listening or improving on this. In fact from Otel community day, a lot of the talk was explicitly around fixing these issues. From the end-user perspective, there’s been a huge amount of improvements being made every month to the problems of documentation and onboarding. Even with issues of committee driven development, the whole ecosystem has been set up so that good ideas can be proven outside of the direct committee process, and get upstreamed back into OpenTelemetry if it’s warranted.

Where does that leave us?

OpenTelemetry is an absolutely ambitious project, and I was personally dismissive of what it could be when the initial merger of the two naescent projects of OpenCensus and OpenTracing were announced in 2019. However, over the last ~4 years since, it’s been fulfilling its promise of unifying observability across different signals, languages, and platforms.

The project isn’t without its flaws, as with literally any other alternative you can think of, but it’s usable today. It’s fast growing in adoption, and the project continues to rapidly improve. In my view, it's established itself as the new standard for how most telemetry will be generated and transmitted in the future, regardless of the few vendors that fail to believe in the same.

Note: I’ve described “what is OpenTelemetry” in an incredibly lossy manner - there’s both a bit more detail in the different parts of the SDK, the semantic conventions, the SDK standard, the signal specs themselves, the Otel transport protocol standards (defined in both Protobuf and JSON), and a few more bits and pieces. These largely get largely abstracted behind the SDKs + semantic convention for the majority of end users.