implementation for span derived primary tags by Sam-Maya · Pull Request #11358 · DataDog/dd-trace-java

Sam-Maya · 2026-05-12T17:38:59Z

What Does This Do

Adds an implementation of span-derived primary tags for client-side stats in the Java tracer (spec). Users can configure a list of span tag keys via DD_TRACE_STATS_ADDITIONAL_TAGS; the tracer extracts the matching values from each eligible span, adds them as an additional aggregation dimension on MetricKey, and emits them in the v0.6/stats payload under the new AdditionalMetricTags field (msgpack array of "key:value" strings, mirroring PeerTags).

Motivation

Customers want APM stats broken down by custom dimensions beyond the standard env primary tag (e.g. region, tenant_id) for more granular per-segment visibility. Previously this was only available via the deprecated agent-side config; this PR moves it tracer-side per the v1.3.0 spec.

Behavior summary

Config: DD_TRACE_STATS_ADDITIONAL_TAGS=region,tenant_id (env), dd.trace.stats.additional.tags=region,tenant_id (sysprop), or trace.stats.additional.tags=region,tenant_id in dd-java-tracer.properties. Comma-separated list of tag keys.
Normalization: configured keys are deduplicated and sorted alphabetically once at startup so config order/duplicates don't affect aggregation keys.
Per-span extraction: for each eligible span (measured / top-level / configured span-kind), the configured tag keys are looked up via span.unsafeGetTag(...); only present values are included.
MetricKey participation: extracted tags become part of the aggregation key — spans with region:us-east-1 aggregate separately from region:eu-west-1; spans without the configured tag aggregate together.
Wire format: emitted as AdditionalMetricTags (repeated string, each entry "key:value") in the per-metric msgpack map. Field omitted when no tags were extracted, so the 99% of customers who don't configure the feature pay zero payload overhead.
Backwards compatible: older agents that don't recognize the field ignore it.

Hardening

Configured-key cap: MAX_ADDITIONAL_TAG_KEYS = 10. Anything beyond the cap is dropped at startup (after sort) with a log.warn listing the dropped keys. Reasoning: the backend stats pipeline only supports a small number of primary tag dimensions (4 by default, up to ~10 for elevated quotas), so configuring more than 10 is misuse.
Per-value length cap: MAX_ADDITIONAL_TAG_VALUE_LENGTH = 250. Tag values longer than 250 chars are replaced with blocked_by_tracer. Protects against accidental misconfigurations like stuffing stack traces, SQL statements, or JSON blobs into a configured tag key (a precedent in this codebase, per Doug's review).
Per-bucket stat-entry cap: DD_TRACE_STATS_ADDITIONAL_TAGS_CARDINALITY_LIMIT = 100 (configurable; ≤ 0 falls back to the default with a warn). One global counter per flush bucket, bumped each time a new MetricKey with additional tags is admitted. When the counter is at the cap, any new MetricKey has its additional tag values replaced with blocked_by_tracer (the dimension keys are preserved on the wire — blocked spans collapse into a small number of "shape" MetricKeys, not into the no-additional-tags base bucket). Spans whose full MetricKey already exists in the bucket continue to merge into it regardless of the cap. Counter resets on every flush (~10s).
Health metric: onAdditionalTagValueCardinalityBlocked(String tagKey) callback on HealthMetrics, surfaced through TracerHealthMetrics as a counter named stats.additional_tag.cardinality_blocked. Fires on both length-blocks and cardinality-blocks so operators can see when the feature is being suppressed.

Spec deviation: per-tag isolation

The CSS v1.3.0 spec requires per-tag isolation:

Limits are evaluated independently per configured tag key (so a runaway customer_id doesn't poison region).

This PR uses a single global counter instead. Trade-off explicitly accepted for simplicity: no per-tag bookkeeping, no Map of HashSets, just one AtomicInteger. A misconfigured customer_id can starve region's admission of new MetricKeys within a bucket, but every span still gets emitted with its dimension keys preserved (values masked) and its base stats unchanged. Worth noting if reviewers want stricter spec adherence.

Allocation profile

Took two passes addressing review feedback on per-span allocation:

MetricKey.additionalTags is a String[] parallel to the configured-keys list (null in slot i = the span didn't set tag i), rather than a List<UTF8BytesString> of pre-formatted "key:value" entries. No intermediate ArrayList per span; no per-(key,value) UTF8BytesString cache.
The serializer composes "key:value" directly into the msgpack buffer using a new MsgPackWriter.writeUTF8(byte[] prefix, byte separator, byte[] suffix) helper that writes the string header for the combined length and puts the three byte ranges back-to-back. Tag-key bytes are pre-encoded once at construction. The only per-write allocation is value.getBytes(UTF_8) per non-null slot per MetricKey per flush — bounded by maxAggregates × configured_tags, dominated by per-span volume which is now leaner.
The peerTags field still uses the older List<UTF8BytesString> shape; same optimization is applicable but out of scope here to keep this diff focused. Filed for a follow-up.

Additional Notes

Out of scope for this PR:

setMeasuredTag(key, value) programmatic API — spec marks this as future work.
Per-tag cardinality isolation — see spec-deviation note above; happy to revisit if reviewers want strict spec adherence.
peerTags allocation refactor — same String[] + direct-write pattern is applicable, deferred follow-up.

Contributor Checklist

Format the title according to the contribution guidelines
Assign the type: and (comp: or inst:) labels in addition to any other useful labels
Avoid using close, fix, or any linking keywords when referencing an issue
Update the CODEOWNERS file on source file addition, migration, or deletion
Update public documentation with the new DD_TRACE_STATS_ADDITIONAL_TAGS and DD_TRACE_STATS_ADDITIONAL_TAGS_CARDINALITY_LIMIT configs

Jira ticket: [PROJ-IDENT]

sarahchen6 · 2026-05-12T20:37:01Z

  }

+  public List<String> getTraceStatsAdditionalTags() {
+    return tryMakeImmutableList(configProvider.getList(TRACE_STATS_ADDITIONAL_TAGS));


Instead of a List where something like region,tenant_id and tenant_id,region would be seen as different values and duplicates would be preserved, instead we can define this as a Set? assuming order and duplicates do not matter. For example,

public Set<String> getTraceStatsAdditionalTags() { return tryMakeImmutableSet(configProvider.getList(TRACE_STATS_ADDITIONAL_TAGS)); }

similar to https://github.com/DataDog/dd-trace-java/blob/master/internal-api/src/main/java/datadog/trace/api/Config.java#L5107. Claude recommended the following change as well to dd-trace-core/src/main/java/datadog/trace/common/metrics/ConflatingMetricsAggregator.java so that additionalTagKeys is still iterable while maintaining no duplicates, no config-order sensitivity, and stable payload/key ordering:

List<String> orderedAdditionalTagKeys = new ArrayList<>(additionalTagKeys); Collections.sort(orderedAdditionalTagKeys); this.additionalTagKeys = Collections.unmodifiableList(orderedAdditionalTagKeys);

and corresponding tests that check duplicate keys / same keys but re-ordered return the expected result 🤔

sarahchen6 · 2026-05-12T20:38:37Z

    final boolean hasGrpcStatusCode = key.getGrpcStatusCode() != null;
    final int mapSize =
-        15
+        16


A future improvement can be to make this conditional such as hasServiceSource and hasHttpMethod below. Otherwise, most customers will always get an empty additional tag and thus an unnecessarily large payload

sarahchen6 · 2026-05-12T20:44:42Z

      boolean includeEndpointInMetrics) {
    this.ignoredResources = ignoredResources;
+    this.additionalTagKeys =
+        additionalTagKeys == null ? Collections.emptyList() : additionalTagKeys;


We could also add a MAX_ADDITIONAL_TAG_KEYS value limiting the number of tag keys that customers can set to prevent "infinite" tag keys and possible exploitation(? - not sure if exploitation is possible here but a limit and warning once you hit the limit seems safer)

We only allow 4 unique primary tag keys in the UI so if a user is setting more than that they are aggregating their stats on an additional dimension only for that value to get discarded in our stats pipeline.
IIRC we do increase the amount of keys a customer can set but its rare, a max value of 10 I think makes sense here. It provides protection against the extreme while leaving some room for customers with an increased key count.

dougqh

Given that we've already had issues with cardinality in this code, for me, a minimal viable solution needs to include a more robust solution to tag cardinality.

While right now there is a size cap. When we hit the size cap, we silently drop stats which leads to a bad user experience.

I think we need to sort those issues out; otherwise, adding more dimensions will just exacerbate the problem.

dougqh · 2026-05-13T15:04:00Z

+      ADDITIONAL_TAG_VALUES_CACHE = DDCaches.newFixedSizeCache(64);
+  private static final Function<
+          String, Pair<DDCache<String, UTF8BytesString>, Function<String, UTF8BytesString>>>
+      ADDITIONAL_TAG_VALUES_CACHE_ADDER =


In hindsight, I wish we hadn't chosen to concatenate key + ":" + value in the payload. For keeping memory down, separate fields would probably have been better.

dougqh · 2026-05-13T15:53:16Z

    return Collections.emptyList();
  }

+  private List<UTF8BytesString> getAdditionalTags(CoreSpan<?> span) {


We already have high allocation from creating MetricKey-s just to perform Map look-ups.
Producing an intermediate list with UTF8BytesStrings is going to compound that problem significantly.

I copied how we interact with peer tags, to keep behavior consistent. What do you propose I do differently?

dougqh · 2026-05-13T15:56:22Z

+          key ->
+              Pair.of(
+                  DDCaches.newFixedSizeCache(512),
+                  value -> UTF8BytesString.create(key + ":" + value));


I think as is this degenerates into hot allocation when a tag has high cardinality. In high cardinality situation, we need to avoid the concatenation before it happens.

Right now, we're still paying the allocation cost and undoing the benefit of the cache.

Just to verify, I ran this through Claude as well...

Vulnerabilities / Misconfiguration risk

No per-value cardinality cap (acknowledged in PR description)
The MAX_ADDITIONAL_TAG_KEYS = 10 cap protects against the number of configured keys. It does nothing about the number of unique values per key. A single misconfiguration —
DD_TRACE_STATS_ADDITIONAL_TAGS=user_id or request_id or trace_id — gives the customer an unbounded MetricKey set, unbounded keys map growth, and an oversized payload to the
agent. The PR explicitly defers cardinality protection ("requires an HLL/bounded-set estimator"), which is reasonable for MVP — but in its current form this config is a foot-gun
and should not be enabled by default, which it isn't. Two suggestions:

Add a log.warn (one-shot, at startup) when additionalTagKeys is non-empty, calling out the cardinality risk and pointing at the future
DD_TRACE_STATS_ADDITIONAL_TAGS_CARDINALITY_LIMIT config.

Consider a stop-gap "max unique MetricKeys before we stop adding" guard rather than waiting for the full HLL implementation.

Tag values are not length-bounded before being interned and serialized
UTF8BytesString.create(key + ":" + value) will faithfully serialize whatever the customer set on the span tag. If application code stuffs a JSON body or a stack trace into a tag
named in DD_TRACE_STATS_ADDITIONAL_TAGS, the per-value cache entry, the MetricKey, and the msgpack payload all carry it. This is pre-existing for peer tags too, but again is
amplified here because the user controls which keys feed this path. A value.length() < N guard with a one-shot warn would be cheap insurance.

Overhead
5. Cache footprint scales with MAX_ADDITIONAL_TAG_KEYS × per-key-cache-size
ADDITIONAL_TAG_VALUES_CACHE is newFixedSizeCache(64) (plenty) but each entry holds an inner newFixedSizeCache(512). With 10 configured keys that's up to 10×512 = 5,120 cached
UTF8BytesString entries plus the lambda closures, vs. the existing peer-tags cache of similar size. Acceptable, but worth a comment near ADDITIONAL_TAG_VALUES_CACHE explaining
the bound (mirrors the existing comment on PEER_TAGS_CACHE).

Also worth noting, that the cache is still unbounded in terms of bytes. That has bit us with SQL statements previously.

And I guess part of what Claude is pointing out is that tags may also contain complicated objects, we might need to filter those out otherwise allocation could explode if misconfigured. Although, that was definitely a pre-existing issue.

added basic cardinality control and will add a char limit to the string values, would that address all your concerns here?

I'm experimenting with a revised approach.

I think the simplest solution might be to combine the cardinality tracking and the caching into a single class.

In hindsight, I'd argue that DDCache isn't a great fit for metrics. DDCache works well with low cardinality, but doesn't work well in situations where cardinality can be high.

In a strict sense, that's okay because the GC will be able to reclaim the memory, so memory isn't unbounded. However, a solution that deals with high cardinality more gracefully would be preferable.

dougqh · 2026-05-13T16:29:58Z

+
+  private final int limitPerTag;
+  private final HealthMetrics healthMetrics;
+  private final ConcurrentHashMap<String, Set<String>> seenValuesPerTag = new ConcurrentHashMap<>();


I think we're going to want a lighter weight collection for this purpose.
ConcurrentHashMap is very allocation heavy.

As mentioned elsewhere, I think a better would be to have an AdditionalTagCardinalityLimiter per tag. Then we don't need the extra allocation heft of one of the ConcurrentHashMap.

I'm also debating whether we can do something cheaper for the Set, but that I think we could leave for later.

I'm also hoping to get away from the extra coordination overhead needed for ConcurrentHashMap by moving more of the metrics work into a background thread.

What do you propose we use instead? An array of length 100 would work but then our lookup time is less efficient.

Not necessarily.

Pointer chasing actually isn't great on modern hardware.
And arrays benefit from prefetch into L3 cache, so often a small array beats a linked list.

Admittedly, I need to benchmark a bit to determine what's best here.

dougqh · 2026-05-13T16:34:13Z

+
  private final Set<String> ignoredResources;
+  private final List<String> additionalTagKeys;
+  private final AdditionalTagsCardinalityLimiter cardinalityLimiter;


I think rather than a single AdditionalTagsCardinalityLimiter that handles all the extra tags. A better design would be a handler per configured tag.

That would avoid the Map of Set design which is adding undue allocation.

Although for the most part, I think we just need to take a step back and rework this code before we add more to it. Rather than continue to review this PR, I'd like to work on solutions to these issues more directly.

- TracerHealthMetrics.previousCounts: 51 -> 52 so the new stats.additional_tag.cardinality_blocked counter has a slot. Otherwise the flush task throws AIOOBE which the catch logs as "previousCounts array needs resizing..." -- that WARN lands in the middle of an otherwise-allowed openliberty exception block, breaking the system test's allowlist regex (test_java_logs). - metadata/supported-configurations.json: DD_TRACE_STATS_ADDITIONAL_TAGS and DD_TRACE_STATS_ADDITIONAL_TAGS_CARDINALITY_LIMIT had invalid "type":"number" and version "B"; switched to "int" and "A" to match the format the validator expects.

implement mvp for span derived primary tags

07df666

sarahchen6 reviewed May 12, 2026

View reviewed changes

dougqh reviewed May 13, 2026

View reviewed changes

add max keys and switch from list to set

4597764

dougqh reviewed May 13, 2026

View reviewed changes

add cardinality control

983ae32

dougqh reviewed May 13, 2026

View reviewed changes

Sam-Maya added 3 commits May 13, 2026 15:45

add protection against long chars

4b981d7

simple cardinality control

68fea6a

optimize tag handling

b6a04dc

Sam-Maya changed the title ~~implement mvp for span derived primary tags~~ implemen for span derived primary tags May 15, 2026

Sam-Maya changed the title ~~implemen for span derived primary tags~~ implementation for span derived primary tags May 15, 2026

Conversation

Sam-Maya commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What Does This Do

Motivation

Behavior summary

Hardening

Spec deviation: per-tag isolation

Allocation profile

Additional Notes

Contributor Checklist

Uh oh!

sarahchen6 May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarahchen6 May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dougqh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dougqh May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dougqh May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dougqh May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dougqh May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dougqh May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dougqh May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sam-Maya commented May 12, 2026 •

edited

Loading

sarahchen6 May 12, 2026 •

edited

Loading

sarahchen6 May 12, 2026 •

edited

Loading

dougqh May 13, 2026 •

edited

Loading

dougqh May 13, 2026 •

edited

Loading

dougqh May 14, 2026 •

edited

Loading

dougqh May 13, 2026 •

edited

Loading

dougqh May 14, 2026 •

edited

Loading

dougqh May 13, 2026 •

edited

Loading