Skip to content

[C++] How to concatenate multiple tables in one parquet? #41858

@zliucd

Description

@zliucd

Hi,

It's possible to write multiple tables in a single parquet by appending each rows from individual parquet? All tables read from parquets have same columns. This functionality is similar to Python dataframe.concat([df1, df2]).

For example:

table1
Name   Age
Jim     36
Bill      30

table2
Name   Age
Sam    28
Joe     30

The concatenated table and parquet file should be:

Name   Age
Jim       36
Bill        30
Sam      28
Joe       30

We can concatenate tables using auto con_tables = arrow::ConcatenateTables, but it's not possible to write con_tables using parquet::arrow::WriteTable(). The first param of WriteTable() is a single arrow::Table.

This post shows how to merge tables by appending columns, but my context is appending rows.
https://stackoverflow.com/questions/71183352/merging-tables-in-apache-arrow

Following code from Apache-arrow's arrow_reader_writer_test.cc. It's likely we can manually read rows and columns from parquets(tables), but is there an easier way to concatenate?

::arrow::Result<std::shared_ptr<Table>> ReadTableManually(FileReader* reader) {
  std::vector<std::shared_ptr<Table>> tables;

  std::shared_ptr<::arrow::Schema> schema;
  RETURN_NOT_OK(reader->GetSchema(&schema));

  int n_row_groups = reader->num_row_groups();
  int n_columns = schema->num_fields();
  for (int i = 0; i < n_row_groups; i++) {
    std::vector<std::shared_ptr<ChunkedArray>> columns{static_cast<size_t>(n_columns)};

    for (int j = 0; j < n_columns; j++) {
      RETURN_NOT_OK(reader->RowGroup(i)->Column(j)->Read(&columns[j]));
    }

    tables.push_back(Table::Make(schema, columns));
  }

  return ConcatenateTables(tables);
}

Thanks.


Notes: I have implemented a simple version that read each column from every table, and 'concatenate'. The performance is not sound.

Component(s)

C++

Metadata

Metadata

Assignees

No one assigned

    Labels

    Component: C++Status: stale-warningIssues and PRs flagged as stale which are due to be closed if no indication otherwiseType: usageIssue is a user question

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions