Hi,
It's possible to write multiple tables in a single parquet by appending each rows from individual parquet? All tables read from parquets have same columns. This functionality is similar to Python dataframe.concat([df1, df2]).
For example:
table1
Name Age
Jim 36
Bill 30
table2
Name Age
Sam 28
Joe 30
The concatenated table and parquet file should be:
Name Age
Jim 36
Bill 30
Sam 28
Joe 30
We can concatenate tables using auto con_tables = arrow::ConcatenateTables, but it's not possible to write con_tables using parquet::arrow::WriteTable(). The first param of WriteTable() is a single arrow::Table.
This post shows how to merge tables by appending columns, but my context is appending rows.
https://stackoverflow.com/questions/71183352/merging-tables-in-apache-arrow
Following code from Apache-arrow's arrow_reader_writer_test.cc. It's likely we can manually read rows and columns from parquets(tables), but is there an easier way to concatenate?
::arrow::Result<std::shared_ptr<Table>> ReadTableManually(FileReader* reader) {
std::vector<std::shared_ptr<Table>> tables;
std::shared_ptr<::arrow::Schema> schema;
RETURN_NOT_OK(reader->GetSchema(&schema));
int n_row_groups = reader->num_row_groups();
int n_columns = schema->num_fields();
for (int i = 0; i < n_row_groups; i++) {
std::vector<std::shared_ptr<ChunkedArray>> columns{static_cast<size_t>(n_columns)};
for (int j = 0; j < n_columns; j++) {
RETURN_NOT_OK(reader->RowGroup(i)->Column(j)->Read(&columns[j]));
}
tables.push_back(Table::Make(schema, columns));
}
return ConcatenateTables(tables);
}
Thanks.
Notes: I have implemented a simple version that read each column from every table, and 'concatenate'. The performance is not sound.
Component(s)
C++
Hi,
It's possible to write multiple tables in a single parquet by appending each rows from individual parquet? All tables read from parquets have same columns. This functionality is similar to Python
dataframe.concat([df1, df2]).For example:
The concatenated table and parquet file should be:
We can concatenate tables using
auto con_tables = arrow::ConcatenateTables, but it's not possible to writecon_tablesusingparquet::arrow::WriteTable(). The first param of WriteTable() is a singlearrow::Table.This post shows how to merge tables by appending columns, but my context is appending rows.
https://stackoverflow.com/questions/71183352/merging-tables-in-apache-arrow
Following code from Apache-arrow's
arrow_reader_writer_test.cc. It's likely we can manually read rows and columns from parquets(tables), but is there an easier way to concatenate?Thanks.
Notes: I have implemented a simple version that read each column from every table, and 'concatenate'. The performance is not sound.
Component(s)
C++