Okay, so we started getting swamped with tracing data. Like, seriously buried. Our observability bill was going through the roof, and finding the important traces felt like searching for a needle in a haystack, especially during incidents. Standard sampling helped a bit, but we were still losing potentially critical traces sometimes, while keeping loads of routine stuff. We needed a smarter way to decide what traces to keep, giving preference to the important ones. We internally called this little project “firefly trace priority”.

Getting Started: The Problem
First thing, we sat down and really defined the problem. It wasn’t just about less data, it was about the right data. We had terabytes of trace info, but when the payment gateway hiccuped, digging through logs and traces was slow because 99% of the traces were just health checks or low-value requests. We needed to ensure the traces for critical paths, errors, or specific important transactions always made it through, even if we sampled heavily elsewhere.
Brainstorming and Planning
We looked at the tools we were using, mostly OpenTelemetry stuff. Head-based sampling (deciding at the start) was simple but dumb – it didn’t know if a trace would turn out to be important later (like hitting an error). Tail-based sampling (deciding at the end) was smarter but needed more resources on the collector side to hold onto traces before deciding.
We figured a mix was needed. We wanted to influence the sampling decision based on context we had early in the request. What makes a trace important for us?
- Any trace containing an error.
- Traces related to payment processing.
- Requests for certain high-value API endpoints.
- Maybe traces for specific beta testers or internal users for debugging.
So the plan was: tag traces with a priority level right at the source, in our application code, and then configure our tracing backend (the collector) to use this tag for sampling decisions.
Doing the Work: Implementation Steps
Step 1: Tagging Traces
This was the trickiest part, involving code changes. In our main services (mostly Go and some Java), we dug into the middleware or interceptors where requests first come in. We added logic there:
- Check the request path: If it matches `/api/payments/.` or `/api/admin/.`, add a tag like `* = high`.
- Check user info (if available early): If user ID is in a specific debug list, tag `* = debug`.
- Later in the request lifecycle, if an error occurred, our error handling logic would add or update the tag: `* = error`.
We used the standard OpenTelemetry baggage or span attributes to carry this priority tag along with the trace context.
Step 2: Configuring the Collector

Next, we tackled the OpenTelemetry Collector configuration. We were already using the `tailsamplingprocessor`. This thing is pretty powerful. We configured its policies:
- Define multiple policies with different sampling rates.
- A policy for errors: Match spans with `* = ERROR` or our custom `* = error` tag. Set this to 100% sampling (keep all error traces).
- A policy for high priority: Match spans with `* = high`. We set this to maybe 50% or even 100% initially, depending on volume.
- A policy for debug: Match `* = debug`. Keep 100% of these.
- A fallback policy: For everything else, use a very low probability, like 1% or 5%, just to keep a general sense of traffic patterns.
Getting the policy definitions right took some trial and error. We had to carefully define the matching rules based on the attributes we were setting in our code.
Testing and Rollout
We didn’t just flip the switch. We started testing in our staging environment. We generated specific traffic – simulated errors, calls to payment endpoints, normal background noise. Then we checked the backend storage: Were error traces present? Were payment traces being kept at the higher rate? Were the low-priority ones mostly gone? We looked at the trace counts and costs in staging.
It looked promising. Error traces were reliably captured. High-priority stuff showed up way more often. We tweaked the sampling percentages a bit based on the volume we saw.
Then we rolled it out slowly in production. Started with just one or two services, watched the monitoring dashboards like hawks. Checked system load on the collectors (tail sampling can add overhead). Checked trace counts and storage costs. Slowly expanded it to more services over a couple of weeks.
Challenges Faced
It wasn’t all smooth sailing.
- Consistency: Getting all the different teams (Go, Java) to implement the tagging consistently was a bit of a coordination challenge. We had to provide clear guidelines and code examples.
- Performance: Adding logic in the request path always carries a small risk. We benchmarked carefully to ensure we didn’t add noticeable latency. The collector also needed adequate memory/CPU for tail sampling.
- Defining “Priority”: Agreeing on what was truly “high priority” involved some debates between product owners, SREs, and developers. We had to keep the criteria simple initially.
The Result: What We Got
In the end, it worked out pretty well.
Trace volume dropped significantly. Our storage costs went down noticeably. But the key thing was that we felt much more confident that we were keeping the important traces. When an issue happened with payments, the relevant traces were almost always there. Debugging critical failures became faster because we weren’t wading through as much noise.

It wasn’t a magic bullet, and we still tweak the sampling rates and priority rules now and then as things change. But implementing this “firefly trace priority” system definitely made our tracing setup way more useful and cost-effective.