Why is Sampling for Observability So Hard?


If you are an engineer that's building a popular application, you might be familiar with sampling in the context of observability. As applications scale, so do the number of events that are dispatched on a weekly, daily, hourly, and even per-minute basis. At some point, storing every single event becomes too costly, especially if the events have high cardinality data that bloats the package size. However, storing events to diagnose issues is extremely important.
The compromise is sampling. Sampling involves only saving a (relatively small) subset of events and discarding the rest. Many applications will sample as little as 1% of events while maintaining statistical significance. But achieving this isn't a matter of naively rolling a 100-face die—sampling requires an actual thorough strategy that maximizing cost-savings without depleting accuracy.
The general problem of sampling is how do you ensure that the events that matter—such as errors, slow sessions, or unexpected responses—are accounted for, especially when they might only account for a sheer minority of events. Put simply, it's a question of accounting for the .1% of events that actually matter amongst the 1% of events totally saved.
Let's discuss how this all works.
Who actually needs to sample?
Generally speaking, if you're generating a lot of data, you'll need to sample. This typically translates to hitting up against one of two constraints. Either, your observability tooling struggles to handle that much data; alternatively, the cost of your observability infrastructure is too high for your budget.
This is a problem faced by a lot of companies. As production system scale, they can generate hundreds of gigabytes—or even a few terabytes—of data per day. This is conditional on a myriad of factors—customer count, average actions taken per session, services used per action, and data saved per service. Point being, there a plenty of paths towards excessive data creation.
What determines a good sampling?
A common misconception is that sampled data should proportionally represent the total net traces created. However, traces aren't equally interesting to engineers—anything that requires a fix or change is more relevant. Accordingly, your goal is typically to optimize your sample data for traces that are interesting, while still preserving some otherwise useless traces for a baseline.
Simply put, for any important trace, there should be a high probability that it's included in your sample set. Meanwhile, how many traces are sampled is a function of volume. The higher the volume, the lower a sampling rate needed. You can have different sampling rates for various systems in your distributed system based on the unique traffic volume of each system.
Head-sampling vs tail-sampling
Head Sampling
In head sampling, you're not using any compute to look into the individual trace. Your sampling decision isn't about the particulars of the data in the trace. You're affectively rolling a dice and determining the trace's fate.
Pros
- Very simple strategy to implement. doesn't require much ongoing maintenance
- Very low compute cost.
Cons
- This results in your sampled data having the same distribution as your production traffic. If you have rare events that happen in production, it'll be unlikely that they're captured in your sampled set.
Tail Sampling
In tail sampling, you're inspecting the trace and making the decision whether or not to include this trace in your sampled data based on the particulars of the trace. In order to use this strategy, it's best to have some heuristics around which traces you care about and which you don't based on the property of the trace: latency, error type, originating service, etc. For a given trace, if it falls within the set of traces that you care about, you save it. If not, you discard it.
Within each given case, you can also have a sampling probability. If this trace is Case1, there's a 25% chance it'll be saved. If Case2, 60% chance it'll be saved etc. Allows you to prioritize different types of data in a very detailed manner.
Pros
- Your sampled data won't have the same distribution as production traffic. It'll be biased more towards the traces that you care bout.
Cons
- This is a much more computationally expensive approach.
- This sampling strategy here can get quite complicated and require ongoing tuning as your distributed system changes.
TSA Analogy
- Head sampling: You've been chosen for random selection on the scanner. The selection has nothing to do about you. It's just random.
- Tail sampling: We put your bag in the scanner and route the traffic for deeper search if there's something of interest.
Modern Implementations
Head Sampling Then Tail Sampling
Sometimes your production traffic is too high for a tail sampling technique to work. There are several ways to solve this, but one of them is to first use a head sampling with a relatively high sample rate followed by a tail sampling. Let's say your head sampling rate is 50%, this means your tail sampling step is only going to be looking at 50% of the production traffic.
You can create a system that can dynamically change the head sampling rate based on the tail sampling system's input queue length. This allows you to insulate the tail sampling system from spikes in production traffic.
Sending All Traces to Cold Storage
There might be cases where you're interested in data that wasn't sampled by your sampling strategy. If that data could possibly still be important in the future, you could send all trace data to cold storage. If ever needed, you could retrieve it and find the necessary information. This could be especially useful if you are required to save all log data for contractual or regulatory reasons.
Dynamic and Adaptive Sampling Techniques
Dynamic and adaptive sampling uses AI to automatically adjust sampling rates and rules based on current system conditions like error spikes, load increases, or unusual user behavior.
For example, if you experience a sudden increase in errors, adaptive sampling can automatically ramp up data collection to ensure you capture critical details during the incident.
Adaptive sampling can be configured using thresholds and triggers based on metrics such as latency, throughput, error rates, or specific event occurrences.
Example: If your system typically samples at 1%, but error rates spike above a certain threshold, adaptive sampling could automatically increase the sampling rate.
Downsides of Adaptive Sampling:
- Higher computational overhead due to real-time processing of metrics and dynamic sampling decisions.
- Ongoing maintenance and tuning as the system evolves, requiring dedicated attention to maintain optimal performance.