Telemetry: stop retrying into 429s, honour Retry-After, fix userAgent#354
Open
samikshya-db wants to merge 5 commits into
Open
Telemetry: stop retrying into 429s, honour Retry-After, fix userAgent#354samikshya-db wants to merge 5 commits into
samikshya-db wants to merge 5 commits into
Conversation
1300f1a to
fbb0263
Compare
After v1.11.0 enabled telemetry by default via the server feature flag, high-QPS workloads produced excessive 429s on /telemetry-ext. Three issues compounded: 1. Double-retry. The exporter ran its own retry loop on top of the retryablehttp-wrapped HTTP client (internal/client.RetryableClient), which already retries 429/5xx with Retry-After. Result: up to RetryMax * (MaxRetries+1) HTTP attempts per export, all collapsed into one circuit-breaker outcome — so the breaker barely opened. 2. Untraceable in access logs. Telemetry POSTs and feature-flag GETs sent no User-Agent, so 429s were tagged Go-http-client/1.1 and could not be attributed to godatabrickssqlconnector by version. 3. High request volume. FlushInterval=5s, BatchSize=100. Changes: - telemetry/exporter.go: drop the retry loop entirely. doExport now makes a single HTTP request; transient retries (429/5xx, Retry-After) are owned by the underlying retryablehttp client. Each export call → exactly one breaker outcome. - telemetry/exporter.go, telemetry/featureflag.go: set User-Agent header on telemetry POST and feature-flag GET. Built once at the connector site (buildUserAgent in connector.go), mirroring internal/client/client.go format (DriverName/DriverVersion + UserAgentEntry + agent product), and plumbed via TelemetryInitOptions.UserAgent. - telemetry/config.go: FlushInterval 5s → 30s, BatchSize 100 → 200. Remove MaxRetries/RetryDelay from telemetry.Config and TelemetryInitOptions; telemetry_retry_count/_delay DSN params still parse for backwards compat but are no-ops. - telemetry/circuitbreaker.go: lower minimumNumberOfCalls 20 → 10 (so low-traffic clients can still trip the breaker on a sustained outage now that each export is one signal), and raise waitDurationInOpenState 30s → 60s (respect typical Retry-After). - Tests: removed obsolete retry/backoff tests; added single-attempt assertion across 4xx/429/5xx; added User-Agent assertions on both endpoints. Co-authored-by: Isaac
fbb0263 to
d47bf99
Compare
… first flush The previous commit removed the exporter's inner retry loop, but the retryablehttp layer (RetryMax=4) was still amplifying each telemetry export into up to 5 attempts on a 429 — defeating the breaker because only the final outcome was observed. - internal/client: add WithSkipRateLimitRetry context helper. RetryPolicy treats 429 as non-retryable when the flag is set; 5xx and transport errors are unaffected. Set on telemetry POSTs and feature-flag GETs so each rate-limited request is exactly one HTTP transaction = one breaker signal. - telemetry/circuitbreaker: record server-provided Retry-After deltas from 429 responses; the open-state wait becomes max(waitDurationInOpenState, hint). Hint cleared on the next half-open -> closed transition. - telemetry/aggregator: offset the first flush by a random [0, FlushInterval) so a fleet of clients booted together does not phase-align bursts. Co-authored-by: Isaac
- Doc accuracy on retry semantics: WithSkipRateLimitRetry only suppresses 429s. Generic 5xx and transport errors are governed by ClientMethod, so telemetry contexts (which set no ClientMethod) get one-shot semantics on those failure modes; 503 still retries via the method-agnostic path. Updated SkipRateLimitRetry godoc and doExport comment to match. - Drop dead TelemetryRetryCount / TelemetryRetryDelay fields from UserConfig. The DSN keys are still consumed (so they do not leak into session params) but no longer stored. Cleaned up assertions in config_test.go and connector_test.go. - Clear retryAfterHint when the breaker transitions Open -> Half-Open, so a stale hint cannot extend a later reopen that did not carry its own Retry-After. Collapsed the two-step lock dance in execute() into one critical section. Updated godoc. - Guard rand.Int63n against a zero flushInterval in flushLoop. - Document the first-caller-wins User-Agent behaviour on the per-host telemetry client singleton, since later connections on the same host reuse the cached client. Co-authored-by: Isaac
- gofmt -s on connector_test.go, internal/config/config_test.go, telemetry/exporter.go (struct alignment + missing blank line). - internal/config/config.go: telemetry_retry_count is parsed for backwards compat then discarded, so call extract (2 returns) rather than extractAsInt (3 returns) to satisfy dogsled. - telemetry/circuitbreaker.go: drop the now-unused setState helper. execute() now transitions to half-open via setStateUnlocked under an already-held lock. Co-authored-by: Isaac
Collaborator
|
Must-fix
Should-fix
Test coverage
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
After
v1.11.0enabled telemetry by default via the server feature flag, high-QPS workloads produced excessive 429s on/telemetry-ext. Three issues compounded:internal/client.RetryableClient), which already retries 429/5xx withRetry-After. Result: up toRetryMax × (MaxRetries+1)HTTP attempts per export, all collapsed into a single circuit-breaker outcome — so the breaker barely opened against persistent throttling.User-Agent, so 429s landed in access logs tagged asGo-http-client/1.1and could not be attributed togodatabrickssqlconnectorby driver version.FlushInterval=5s/BatchSize=100.Changes
Retry behavior
telemetry/exporter.go— RemoveddoExport's retry loop entirely. doExport now makes a single HTTP request; transient retries (429/5xx,Retry-After) are owned by the underlying retryablehttp client. Eachexport()call now corresponds to exactly one HTTP transaction = one breaker outcome.telemetry/config.go,telemetry/driver_integration.go— RemovedMaxRetries/RetryDelayfromtelemetry.ConfigandTelemetryInitOptions.telemetry_retry_count/telemetry_retry_delayDSN params still parse without error for backwards compatibility but are no-ops.Identifiability
connector.go— NewbuildUserAgenthelper mirroringinternal/client/client.go:295-302exactly:DriverName/DriverVersion+ optionalUserAgentEntry+ agent product.telemetry/exporter.go,telemetry/featureflag.go— SetUser-Agenton telemetry POST and feature-flag GET. Plumbed viaTelemetryInitOptions.UserAgent.Cadence and breaker tuning
telemetry/config.go—FlushInterval5s → 30s,BatchSize100 → 200.telemetry/circuitbreaker.go—minimumNumberOfCalls20 → 10 (so low-traffic clients can trip the breaker now that each export is one signal),waitDurationInOpenState30s → 60s (respect typicalRetry-After).Tests
TestExport_RetryOn5xx,TestExport_ExponentialBackoff,TestIsRetryableStatus, retry-config parsing tests).TestExport_SingleAttemptPerExportcovering 4xx/429/5xx, asserting the exporter never retries.TestExport_SetsUserAgentandTestFetchFeatureFlag_SetsUserAgent.Mitigation
While this rolls out, users can opt out via DSN:
enableTelemetry=false. Server-side: disableenableTelemetryForGoDriverfor affected workspaces.Test plan
go test ./...— all green locally./telemetry-extand/api/2.0/connector-service/feature-flags/GOLANG/...requests carrygodatabrickssqlconnector/<version>inhttp_user_agent./telemetry-extdrops after rollout.This pull request and its description were written by Isaac.