Skip to content

Materialize Iceberg row lineage virtual columns #63787

Merged
Gabriel39 merged 12 commits into
apache:refact_reader_branchfrom
Gabriel39:refactor_0528
May 28, 2026
Merged

Materialize Iceberg row lineage virtual columns #63787
Gabriel39 merged 12 commits into
apache:refact_reader_branchfrom
Gabriel39:refactor_0528

Conversation

@Gabriel39
Copy link
Copy Markdown
Contributor

@Gabriel39 Gabriel39 commented May 28, 2026

What problem does this PR solve?

  1. ParquetReader reads a range of a parquet file
  2. ParquetReader supports virtual column reader (RowPosition)
  3. IcebergReader supports virtual columns

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Gabriel39 added 2 commits May 28, 2026 12:04
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: IcebergTableReader v2 did not materialize ROW_LINEAGE_ROW_ID or ROW_LINEAGE_LAST_UPDATED_SEQUENCE_NUMBER. This adds file-local row position exposure from the file reader layer, records Parquet batch row positions after filtering, fills Iceberg row lineage columns from split metadata, and makes each ParquetReader read only row groups owned by its file range.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Added TableReaderTest.IcebergVirtualColumnsUseRowLineageMetadata
    - Added TableReaderTest.ParquetReaderReadsOnlyRowGroupsInFileRange
    - Ran git diff --check -- be/src/io/file_factory.h be/src/format/reader/table_reader.cpp be/src/format/reader/file_reader.h be/src/format/new_parquet/parquet_reader.h be/src/format/new_parquet/parquet_reader.cpp be/src/format/table/iceberg_reader_v2.h be/test/format/reader/table_reader_test.cpp
    - Could not run ./run-be-ut.sh --run --filter=TableReaderTest.IcebergVirtualColumnsUseRowLineageMetadata -j 8 because the local environment has JDK 11 while the script requires JDK 17
    - Could not run ./run-be-ut.sh --run --filter=TableReaderTest.ParquetReaderReadsOnlyRowGroupsInFileRange -j 8 because the local environment has JDK 11 while the script requires JDK 17
    - Could not run build-support/clang-format.sh because llvm@16 is not installed locally
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: IcebergTableReader v2 did not materialize ROW_LINEAGE_ROW_ID or ROW_LINEAGE_LAST_UPDATED_SEQUENCE_NUMBER. This adds a ParquetColumnReader-level virtual file row-position reader, lets IcebergTableReader inject that hidden file-local column only when row lineage needs it, fills Iceberg row lineage columns from split metadata, and makes each ParquetReader read only row groups owned by its file range.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - Added TableReaderTest.IcebergVirtualColumnsUseRowLineageMetadata
    - Added TableReaderTest.ParquetReaderReadsOnlyRowGroupsInFileRange
    - Ran git diff --check -- be/src/format/new_parquet/column_reader.cpp be/src/format/new_parquet/column_reader.h be/src/format/new_parquet/parquet_reader.cpp be/src/format/new_parquet/parquet_reader.h be/src/format/reader/file_reader.h be/src/format/reader/table_reader.h be/src/format/table/iceberg_reader_v2.h
    - Could not run ./run-be-ut.sh --run --filter=TableReaderTest.IcebergVirtualColumnsUseRowLineageMetadata -j 8 because the local environment has JDK 11 while the script requires JDK 17
    - Could not run ./run-be-ut.sh --run --filter=TableReaderTest.ParquetReaderReadsOnlyRowGroupsInFileRange -j 8 because the local environment has JDK 11 while the script requires JDK 17
    - Could not run build-support/clang-format.sh because llvm@16 is not installed locally
- Behavior changed: No
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Gabriel39 added 10 commits May 28, 2026 13:04
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add BE unit tests covering Iceberg row lineage virtual columns when scan results are filtered by conjuncts and when Parquet row groups are pruned by column predicates.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - Added TableReaderTest cases for Iceberg virtual columns with conjunct filtering and row-group predicate pruning. Targeted run was attempted but blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.

- Behavior changed: No

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Add BE unit tests for the Parquet row position virtual column reader, including file-local row positions after expression selection and after selecting different scan ranges.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - Added NewParquetReaderTest coverage for RowPositionReader and scan range row-group selection. Targeted run was attempted but blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.

- Behavior changed: No

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Fix Iceberg virtual column unit tests to validate last_updated_sequence_number after expanding const columns instead of casting the block column directly to ColumnNullable.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - Targeted TableReaderTest run was attempted but blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.

- Behavior changed: No

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Update Iceberg virtual column tests to validate row id values after expanding possible top-level const columns, matching how virtual columns can be materialized in table blocks.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - Source assertions were updated for the reported failure. Local targeted run remains blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.

- Behavior changed: No

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Fix Iceberg virtual column tests to set table_format_params through the TFileRangeDesc setter so IcebergTableReader can read row lineage metadata during prepare_split.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - Updated test setup for the reported Iceberg virtual column failures. Local targeted run remains blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.

- Behavior changed: No

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Iceberg row lineage row id materialization may receive the table block's default virtual column as a top-level ColumnConst. Expand const columns before mutating the nullable int64 payload.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - Fixes the reported TableReaderTest Iceberg virtual row id bad cast. Local targeted run remains blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.

- Behavior changed: No

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Materialize Iceberg last_updated_sequence_number with an explicitly populated nullable data column before wrapping it in ColumnConst, so the virtual column is non-null for all returned rows.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - Fixes the reported TableReaderTest Iceberg virtual sequence null-map failure. Local targeted run remains blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.

- Behavior changed: No

- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: Row id materialization expands a default nullable const column whose null map is already sized and filled with NULL. Resize does not overwrite existing values, so explicitly clear the full null map before writing row id values.

### Release note

None

### Check List (For Author)

- Test: Unit Test

    - Fixes the reported TableReaderTest Iceberg row id null-map failure. Local targeted run remains blocked because JAVA_HOME points to JDK 11 and JDK_17 is not set.

- Behavior changed: No

- Does this need documentation: No
@Gabriel39 Gabriel39 merged commit 321134d into apache:refact_reader_branch May 28, 2026
16 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants