Materialize Iceberg row lineage virtual columns #63787
Merged
Merged
Conversation
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: IcebergTableReader v2 did not materialize ROW_LINEAGE_ROW_ID or ROW_LINEAGE_LAST_UPDATED_SEQUENCE_NUMBER. This adds file-local row position exposure from the file reader layer, records Parquet batch row positions after filtering, fills Iceberg row lineage columns from split metadata, and makes each ParquetReader read only row groups owned by its file range.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Added TableReaderTest.IcebergVirtualColumnsUseRowLineageMetadata
- Added TableReaderTest.ParquetReaderReadsOnlyRowGroupsInFileRange
- Ran git diff --check -- be/src/io/file_factory.h be/src/format/reader/table_reader.cpp be/src/format/reader/file_reader.h be/src/format/new_parquet/parquet_reader.h be/src/format/new_parquet/parquet_reader.cpp be/src/format/table/iceberg_reader_v2.h be/test/format/reader/table_reader_test.cpp
- Could not run ./run-be-ut.sh --run --filter=TableReaderTest.IcebergVirtualColumnsUseRowLineageMetadata -j 8 because the local environment has JDK 11 while the script requires JDK 17
- Could not run ./run-be-ut.sh --run --filter=TableReaderTest.ParquetReaderReadsOnlyRowGroupsInFileRange -j 8 because the local environment has JDK 11 while the script requires JDK 17
- Could not run build-support/clang-format.sh because llvm@16 is not installed locally
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: IcebergTableReader v2 did not materialize ROW_LINEAGE_ROW_ID or ROW_LINEAGE_LAST_UPDATED_SEQUENCE_NUMBER. This adds a ParquetColumnReader-level virtual file row-position reader, lets IcebergTableReader inject that hidden file-local column only when row lineage needs it, fills Iceberg row lineage columns from split metadata, and makes each ParquetReader read only row groups owned by its file range.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Added TableReaderTest.IcebergVirtualColumnsUseRowLineageMetadata
- Added TableReaderTest.ParquetReaderReadsOnlyRowGroupsInFileRange
- Ran git diff --check -- be/src/format/new_parquet/column_reader.cpp be/src/format/new_parquet/column_reader.h be/src/format/new_parquet/parquet_reader.cpp be/src/format/new_parquet/parquet_reader.h be/src/format/reader/file_reader.h be/src/format/reader/table_reader.h be/src/format/table/iceberg_reader_v2.h
- Could not run ./run-be-ut.sh --run --filter=TableReaderTest.IcebergVirtualColumnsUseRowLineageMetadata -j 8 because the local environment has JDK 11 while the script requires JDK 17
- Could not run ./run-be-ut.sh --run --filter=TableReaderTest.ParquetReaderReadsOnlyRowGroupsInFileRange -j 8 because the local environment has JDK 11 while the script requires JDK 17
- Could not run build-support/clang-format.sh because llvm@16 is not installed locally
- Behavior changed: No
- Does this need documentation: No
Contributor
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add BE unit tests covering Iceberg row lineage virtual columns when scan results are filtered by conjuncts and when Parquet row groups are pruned by column predicates.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Added TableReaderTest cases for Iceberg virtual columns with conjunct filtering and row-group predicate pruning. Targeted run was attempted but blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Add BE unit tests for the Parquet row position virtual column reader, including file-local row positions after expression selection and after selecting different scan ranges.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Added NewParquetReaderTest coverage for RowPositionReader and scan range row-group selection. Targeted run was attempted but blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Fix Iceberg virtual column unit tests to validate last_updated_sequence_number after expanding const columns instead of casting the block column directly to ColumnNullable.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Targeted TableReaderTest run was attempted but blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Update Iceberg virtual column tests to validate row id values after expanding possible top-level const columns, matching how virtual columns can be materialized in table blocks.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Source assertions were updated for the reported failure. Local targeted run remains blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Fix Iceberg virtual column tests to set table_format_params through the TFileRangeDesc setter so IcebergTableReader can read row lineage metadata during prepare_split.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Updated test setup for the reported Iceberg virtual column failures. Local targeted run remains blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Iceberg row lineage row id materialization may receive the table block's default virtual column as a top-level ColumnConst. Expand const columns before mutating the nullable int64 payload.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Fixes the reported TableReaderTest Iceberg virtual row id bad cast. Local targeted run remains blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Materialize Iceberg last_updated_sequence_number with an explicitly populated nullable data column before wrapping it in ColumnConst, so the virtual column is non-null for all returned rows.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Fixes the reported TableReaderTest Iceberg virtual sequence null-map failure. Local targeted run remains blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary: Row id materialization expands a default nullable const column whose null map is already sized and filled with NULL. Resize does not overwrite existing values, so explicitly clear the full null map before writing row id values.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- Fixes the reported TableReaderTest Iceberg row id null-map failure. Local targeted run remains blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.
- Behavior changed: No
- Does this need documentation: No
321134d
into
apache:refact_reader_branch
16 of 18 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)