Support OTLP runtime metrics with OTel-native naming#11318
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 953c8710a6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
mcculls
left a comment
There was a problem hiding this comment.
The JvmOtlpRuntimeMetrics needs to be moved to under the agent-jmxfetch module. You can then start it from the JMXFetch class, which means you won't need to change anything in the Agent class.
One thought I did have is that we'd still be sending JMXFetch runtime metrics over DogstatsD, and that this is somewhat duplicating the existing metrics managed there. Is there a plan to address this?
BTW another approach would be to write an implementation of AgentStatsdReporter (say AgentOtlpReporter) which sends metrics data from JMXFetch to the OTel API. We could pass into the embedded JMXFetch service when OTLP is enabled for metrics. This would let you send the existing JMXFetch metrics over OTLP and avoid having 2 different pieces of runtime metrics code to maintain.
(Downside is that a couple of the JMXFetch matrics are different to the OTel metrics, but we might need to converge those anyway.)
Agreed that this is suboptimal. I'll pass in a new .yaml file w/
@mcculls The problem w/ using JMXFetch is that it is structured to handle DogstatsD metric formats, and to emit the OTel runtime metrics over OTLP, we would still need to convert the results JMXFetch returns to what OTLP expects. This means that we would effectively duplicate the OTel runtime metrics information in a .yaml file and the tracer where we convert the JMXFetch results. IMO this is less optimal than just invoking the OTel Metrics API directly. |
…ate from depending on otel-shim to otel-bootstrap
What Does This Do
Adds an OTLP runtime-metrics path that emits JVM runtime metrics with OTel semantic-convention names (
jvm.*) through the agent'sMeterProvider, instead of the proprietary DogStatsD names (jvm.heap_memory,jvm.thread_count, …).When the three flags below are set together,
JvmOtlpRuntimeMetrics.start()is invoked fromJMXFetch.run()and registers 15 instruments backed byjava.lang.managementMXBean callbacks. They flow through the existing OTLP exporter — no new transport. To avoid double-reporting, JMXFetch switches tojmxfetch-config-no-jvm-defaults.yaml(which setscollect_default_jvm_metrics: false) instead of the default config when OTLP runtime metrics are enabled.JvmOtlpRuntimeMetricslives in theagent-jmxfetchmodule. Starting it fromJMXFetch.run()lets it ride the same delayed-start path as the rest of JMXFetch, avoiding the JMX side-effects that would occur if it were started fromAgent.installDatadogTracer().DD_RUNTIME_METRICS_ENABLEDtruetrueDD_METRICS_OTEL_ENABLEDtruefalseDD_METRICS_OTEL_EXPORTERotlpInstruments registered (15 total —
Recommended+Developmentper the OTel JVM semconv):jvm.memory.used,jvm.memory.committed,jvm.memory.limit,jvm.memory.init,jvm.memory.used_after_last_gcjvm.buffer.memory.used,jvm.buffer.memory.limit,jvm.buffer.countjvm.thread.countjvm.class.loaded,jvm.class.count,jvm.class.unloadedjvm.cpu.time,jvm.cpu.count,jvm.cpu.recent_utilizationjvm.gc.durationis intentionally deferred. The spec requires a Histogram of per-collection pause durations, butGarbageCollectorMXBeanonly exposes cumulative collection time. Populating the histogram requires either subscribing toGarbageCollectionNotificationInfovia JMX (blocked by the bootstrap-class-loading constraints indocs/bootstrap_design_guidelines.md) or consuming JFRGarbageCollectionevents. Tracked as a follow-up.Related system tests PR enabling tests: DataDog/system-tests#6800
Motivation
Customers running with
DD_METRICS_OTEL_EXPORTER=otlproute their telemetry to an OTel collector — there may not be a Datadog Agent on the path, and therefore nothing listening on the DogStatsD socket. Today the tracer's runtime metrics still emit through DogStatsD with proprietary names (jvm.heap_memory, …), so in those deployments runtime metrics silently go nowhere.This change emits the same runtime metric data as OTLP instruments with OTel semantic-convention names through the OTel
MeterProvider, so it travels the same OTLP pipeline the customer already configured. Customers who haven't opted into OTLP metrics see no change — the existing DogStatsD path is untouched.Additional Notes
start()is single-shot: anAtomicBooleanCAS guards against re-entry from re-init, and on failure we log and stop (partial registration is worse than a silent retry).java.lang.management.*pluscom.sun.management.OperatingSystemMXBeanfor CPU. CPU instruments are skipped at registration time on JVMs where thecom.sunbean isn't present. Nojavax.management.*is touched, keeping the constraints indocs/bootstrap_design_guidelines.mdintact.agent-jmxfetchdepends onotel-bootstrapat build time (compile-only). The OTel API is vendor-repackaged intootel-bootstrapat build time, so it won't conflict with anything in the customer app.OtelRunnableObservable(new, inotel-bootstrap) provides aRunnable-backedOtelObservablefor lambda-style registration; it rate-limits exception logging from the callback.jmxfetch-config-no-jvm-defaults.yamlis registered as a GraalVM native-image resource inResourcesFeatureInstrumentationso AOT/native-image builds can load it.JvmOtlpRuntimeMetricsTest(JUnit 5,opentelemetry-1.47module) covers instrument surface, attribute keys (jvm.memory.type=heap|non_heap), positive values for live metrics (jvm.memory.used,jvm.thread.count), and idempotency of repeatedstart()calls.