Skip to content

ref(vortex-io): unify cloud object storage api#8259

Draft
m7kss1 wants to merge 10 commits into
vortex-data:developfrom
m7kss1:feat-cloud-object-store
Draft

ref(vortex-io): unify cloud object storage api#8259
m7kss1 wants to merge 10 commits into
vortex-data:developfrom
m7kss1:feat-cloud-object-store

Conversation

@m7kss1

@m7kss1 m7kss1 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Cloud object store construction was duplicated across vortex-jni and vortex-duckdb,
each with hand-rolled per-scheme builders (AmazonS3Builder, MicrosoftAzureBuilder,
GoogleCloudStorageBuilder), hardcoded scheme lists, and inconsistent credential handling

Introduce FileLocation in vortex-io as the single resolution point for any
URL or path string - no manual builder code required

// Handles local paths, file://, s3://, gs://, az://, https:// - anything object_store supports
// Use credentials from env (AWS_ACCESS_KEY_ID etc.)
match FileLocation::resolve(url)? {
    FileLocation::Local(path) => { /* open with async_fs */ }
    FileLocation::Remote { store, path } => { /* open with ObjectStore */ }
}

// Explicit credentials override env (e.g. per-connection props passed)
// props example: {"access_key_id": "XXX", "secret_access_key": XXX", "region": "eu-west-X"}
FileLocation::resolve_with_props(url, java_properties.iter()
    .map(|(k, v)| (k.as_str(), v.as_str())))?

By the way, now object store works out of the box in CLI :

$ AWS_REGION=us-east-1 vx query s3://bucket/hits.vortex --sql "select count(*) from hits"
$ vx tree gs://bucket/hits.vortex

Closes: #000

Testing

AI disclosure: XXX

@m7kss1 m7kss1 force-pushed the feat-cloud-object-store branch 2 times, most recently from bd6efc0 to 70f771d Compare June 5, 2026 06:59
@m7kss1 m7kss1 changed the title [WIP] ref(vortex-io): unify cloud object storage api ref(vortex-io): unify cloud object storage api Jun 5, 2026
@m7kss1

m7kss1 commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

note: verified only on s3-compatible storages, azure and gcs remain untested for now

@AdamGS AdamGS self-assigned this Jun 5, 2026
@AdamGS

AdamGS commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Hi @m7kss1 thanks for your PR! I'll review it next week.

@m7kss1 m7kss1 requested a review from a team June 5, 2026 18:15
m7kss1 added 8 commits June 7, 2026 19:04
Consolidate cloud storage builders and resolvers from multiple integrations
(JNI, DuckDB, datafusion-bench) into a single canonical core module.

New:
- vortex-io/src/object_store/{cloud,registry}.rs with Registry and resolve_url
- VortexOpenOptions::open_url dispatcher for CLI and library users
- 'cloud' umbrella feature gating S3, GCS, Azure, HTTP support

Migrate:
- JNI, DuckDB, datafusion-bench to use core make_object_store
- CLI commands to accept remote URLs (s3://, gs://, az://)

Improvements:
- Consistent credential/endpoint handling across integrations
- First-class remote file support in vortex query/tree/browse/segments

Signed-off-by: Maxim Dergousov <dergousovmaxim99@gmail.com>
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
@m7kss1 m7kss1 force-pushed the feat-cloud-object-store branch from 097a411 to c952c6d Compare June 7, 2026 19:08
@codspeed-hq

codspeed-hq Bot commented Jun 8, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 4 improved benchmarks
❌ 4 regressed benchmarks
✅ 1505 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation baseline_eq[4, 65536] 186.5 µs 239.2 µs -22.02%
Simulation baseline_lt[16, 65536] 219 µs 276.8 µs -20.87%
Simulation baseline_lt[4, 65536] 202.3 µs 254.4 µs -20.47%
Simulation baseline_eq[16, 65536] 231.4 µs 261.7 µs -11.57%
Simulation bitwise_not_vortex_buffer_mut[128] 275.3 ns 216.9 ns +26.89%
Simulation bitwise_not_vortex_buffer_mut[1024] 336.9 ns 278.6 ns +20.94%
Simulation bitwise_not_vortex_buffer_mut[2048] 400.6 ns 342.2 ns +17.05%
Simulation encode_varbin[(1000, 2)] 164.1 µs 143.1 µs +14.65%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing m7kss1:feat-cloud-object-store (c952c6d) with develop (e06d80b)

Open in CodSpeed

@robert3005 robert3005 added action/benchmark-sql Trigger SQL benchmarks to run on this PR changelog/chore A trivial change labels Jun 9, 2026
@robert3005

Copy link
Copy Markdown
Contributor

You have one failing test case on windows object_store::cloud::tests::test_resolve. Otherwise I think this looks good.

Comment thread vortex-io/src/object_store/cloud.rs Outdated
Comment thread vortex-io/src/object_store/cloud.rs Outdated
.with_url(url.to_string())
// Use a generic S3 endpoint to avoid DNS resolution issues with
// region-specific endpoints.
.with_endpoint("https://s3.amazonaws.com")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be configurable

@m7kss1 m7kss1 Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AdamGS indeed, but if we make them configurable, should we keep the current ones as defaults + change api to allow use custom properties? in trunk, these settings are hard-coded, so this should be handled extra carefully (ref: https://github.com/vortex-data/vortex/blob/develop/vortex-jni/src/object_store.rs#L71-L78)

Comment thread vortex-io/src/object_store/registry.rs Outdated
Comment thread vortex-io/src/object_store/cloud.rs Outdated

@AdamGS AdamGS left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a few more focused comments, but I think there are two bigger issues here:

  1. I don't think this quite solves the issue you describe in the PR.
    While it does centralize the implementations (which I agree is desirable), it keeps some of inconsistencies between them. The registry is a lot more configurable than make_object_store, which hard-codes a few assumptions.
  2. It introduces a bunch of new public APIs like the Registry but more importantly ObjectStore, which is currently a crate we use in non-rust integrations and CAN work with for rust codebases when it makes sense for the user. I think our current IO traits will likely change, but still we should probably use them externally.

@m7kss1

m7kss1 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@AdamGS thanks, that's a good point. the idea behind this approach was to avoid making any major changes to the current workflow and reuse as much as possible from the original implementation, to unlock the possibility for future refactoring in other crates. though, i think we can make them directly inside this MR, so I've temporarily marked MR as a draft and will ping you once i've finished the changes

@m7kss1 m7kss1 marked this pull request as draft June 9, 2026 12:18
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
@m7kss1 m7kss1 force-pushed the feat-cloud-object-store branch from 48c5ed0 to b831585 Compare June 9, 2026 15:00
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
@m7kss1 m7kss1 force-pushed the feat-cloud-object-store branch from b831585 to 016c2c1 Compare June 9, 2026 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action/benchmark-sql Trigger SQL benchmarks to run on this PR changelog/chore A trivial change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants