Scheduler client reuse by SubhamSinghal · Pull Request #1796 · apache/datafusion-ballista

SubhamSinghal · 2026-05-31T05:23:41Z

Which issue does this PR close?

Rationale for this change

Today every query against a Ballista cluster opens a brand-new TCP+TLS+HTTP/2 connection to the scheduler, runs
execute_query plus a polling loop of get_job_status, and discards the connection. Even when the same
SessionContext runs many queries against one scheduler URL, the handshake is paid on every query.

tonic::transport::Channel is Clone and HTTP/2 multiplexes concurrent RPCs over one connection by design — so
the per-query reconnect is throwing away exactly the thing meant to be shared. For interactive REPLs and JDBC
sessions, this is a constant per-query latency floor (and worse on TLS-enabled deployments).

Two TODO reuse the scheduler to avoid connecting to the Ballista scheduler again and again markers flag this in
the source today:

ballista/core/src/execution_plans/distributed_query.rs:354 (execute_query_pull)
ballista/core/src/execution_plans/distributed_query.rs:527 (execute_query_push)

What changes are included in this PR?

Adds a per-planner, lazily-initialized, idle-evicting, single-flight SchedulerChannelCache that holds one
tonic::transport::Channel per BallistaQueryPlanner. The cache is constructed once per planner (≈ once per
SessionContext) and cloned into every DistributedQueryExec it builds, so all queries on a session share one
HTTP/2 connection.

Are these changes tested?

Yes:

9 unit tests on SchedulerChannelCache: lazy init, single-flight under 16 concurrent callers,
build-failure-doesn't-poison, invalidate() swap semantics, idle eviction with and without an in-flight build,
drop-of-cache exits the eviction task (proving the eviction task holds only a Weak), and the
is_transport_error predicate.
1 unit test in BallistaQueryPlanner asserting that two create_physical_plan calls produce exec nodes whose
caches share the same Arc<Inner> — the invariant that makes the feature actually work end-to-end.
All pre-existing ballista-core tests pass unchanged (108 / 108 lib tests green).

Are there any user-facing changes?

One new config option:

ballista.client.scheduler_channel_idle_timeout_seconds — idle timeout in seconds for the client-side cached
scheduler gRPC channel. After this many seconds with no query activity, the underlying TCP connection is closed
and the next query reconnects. Default 300.

milenkovicm · 2026-05-31T07:46:16Z

Thanks @SubhamSinghal for your contribution.

There is similar functionality used for executor client caching, I wonder if provided infrastructure could be reused for scheduler client caching?

Also, do you really see this as a performance bottleneck?

SubhamSinghal · 2026-05-31T14:50:40Z

@milenkovicm

Quick clarification on what you have in mind by "reuse"

Same pattern: a DashMap<scheduler_id, ...> field on the planner that mirrors SchedulerClients in executor_server.rs.
Lift to shared primitive: extract the existing executor-side caching into a reusable type in ballista-core and have both ExecutorServer and BallistaQueryPlanner use it.
Something else I'm missing.

Also, do you really see this as a performance bottleneck?

I haven't measured it as such. I noticed the per-query reconnect(explicit mentioned TODO) while reading the client path and the fix seemed self-contained.

Scheduler client reuse

b2dfac3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler client reuse#1796

Scheduler client reuse#1796
SubhamSinghal wants to merge 1 commit into
apache:mainfrom
SubhamSinghal:scheduler-client-reuse

SubhamSinghal commented May 31, 2026

Uh oh!

milenkovicm commented May 31, 2026

Uh oh!

SubhamSinghal commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SubhamSinghal commented May 31, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

milenkovicm commented May 31, 2026

Uh oh!

SubhamSinghal commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants