[fix](join): avoid join amplification in scalar subquery window rewrite#63763
[fix](join): avoid join amplification in scalar subquery window rewrite#63763starocean999 wants to merge 2 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
There was a problem hiding this comment.
I found a remaining correctness hole in the scalar-subquery window rewrite. The new partitioning fixes the case where an outer-only relation has a visible distinguishing column, but the rule can still fire when the distinguishing columns have been pruned from the Apply left output, leaving only the correlated key and still allowing join amplification.
Critical checkpoint conclusions:
- Goal/test proof: The goal is to avoid join amplification in the scalar-subquery-to-window rewrite. The added test proves one case with a visible
dim.did, but does not prove cases where duplicate outer-only rows are not distinguished by the visible output. - Scope/focus: The change is focused, but the implementation is incomplete for pruned duplicate-distinguishing columns.
- Concurrency/lifecycle: No new concurrency, locking, or lifecycle concerns identified.
- Configuration/compatibility: No new configs or incompatible serialization/storage changes identified in the actual PR diff.
- Parallel paths: The relevant parallel path is the same rewrite when column pruning leaves only correlated slots from the outer-only relation; it still has the original amplification problem.
- Tests/results: Existing new regression output is deterministic, but coverage should add the non-visible duplicate-column case or the rule should bail out when it cannot partition by the full outer-only relation identity.
- Observability/transactions/persistence: Not applicable.
- Performance: Partitioning by extra slots can cost more, but no blocking performance issue beyond the correctness concern.
User focus points: No additional user-provided review focus was present.
|
/review |
There was a problem hiding this comment.
Review summary: no additional blocking issues found in the current PR diff beyond the existing inline thread already recorded for the scalar-subquery window rewrite.
Critical checkpoint conclusions:
- Goal and proof: The PR narrows the scalar-subquery-to-window rewrite by partitioning on the full visible outer-only relation row image and adds FE/regression coverage for the safe and pruned-output cases. The change appears to accomplish that goal in the reviewed diff.
- Scope: The current diff is small and focused on the rewrite and its tests.
- Concurrency: No new concurrency or shared mutable runtime state is introduced beyond rule-instance analysis state that is reset per check.
- Lifecycle/static initialization: No new static initialization or special lifecycle concern found.
- Configuration/compatibility/protocol: No new configs, storage format, function symbol, or FE-BE protocol compatibility changes in the current PR diff.
- Parallel paths: The relevant scalar-subquery rewrite path and regression coverage were reviewed; no additional parallel rule path requiring the same change was identified.
- Conditional checks: The new guard that disables the rewrite when the outer-only relation output is pruned is consistent with preserving row identity.
- Tests: Added unit and regression tests cover the intended safe rewrite and no-rewrite fallback; no additional user-provided review focus was specified.
- Observability/transactions/data writes/memory/performance: Not materially applicable to this optimizer rewrite; no new observability, transaction, write-path, memory-accounting, or obvious performance issue found.
Focus response: review_focus.txt contained no additional focus points.
|
run buildall |
TPC-H: Total hot run time: 31585 ms |
TPC-DS: Total hot run time: 171454 ms |
FE Regression Coverage ReportIncrement line coverage |
Problem Summary:
The rewrite in AggScalarSubQueryToWindowFunction.java partitioned the generated window function only by the correlated slots. That assumption is not sufficient when the outer query contains an additional relation whose rows are duplicated under the same correlated key. In that situation, multiple distinct outer rows are merged into the same window partition, so the rewritten window aggregate is evaluated over join-amplified rows rather than over the logical row set seen by the original correlated scalar subquery. The rule also kept mutable matching state on the rewriter instance, which needed to be reset per rewrite attempt.
Fix:
The fix changes the rewrite to partition the generated window function by all visible slots from the outer-only relation, so different outer rows that share the same correlated key remain isolated in separate window partitions. This preserves the original scalar-subquery semantics even when the outer plan has join-expanded rows.
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)