ref(vortex-io): unify cloud object storage api#8259
Conversation
bd6efc0 to
70f771d
Compare
|
note: verified only on |
|
Hi @m7kss1 thanks for your PR! I'll review it next week. |
Consolidate cloud storage builders and resolvers from multiple integrations
(JNI, DuckDB, datafusion-bench) into a single canonical core module.
New:
- vortex-io/src/object_store/{cloud,registry}.rs with Registry and resolve_url
- VortexOpenOptions::open_url dispatcher for CLI and library users
- 'cloud' umbrella feature gating S3, GCS, Azure, HTTP support
Migrate:
- JNI, DuckDB, datafusion-bench to use core make_object_store
- CLI commands to accept remote URLs (s3://, gs://, az://)
Improvements:
- Consistent credential/endpoint handling across integrations
- First-class remote file support in vortex query/tree/browse/segments
Signed-off-by: Maxim Dergousov <dergousovmaxim99@gmail.com>
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
097a411 to
c952c6d
Compare
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | baseline_eq[4, 65536] |
186.5 µs | 239.2 µs | -22.02% |
| ❌ | Simulation | baseline_lt[16, 65536] |
219 µs | 276.8 µs | -20.87% |
| ❌ | Simulation | baseline_lt[4, 65536] |
202.3 µs | 254.4 µs | -20.47% |
| ❌ | Simulation | baseline_eq[16, 65536] |
231.4 µs | 261.7 µs | -11.57% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[128] |
275.3 ns | 216.9 ns | +26.89% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[1024] |
336.9 ns | 278.6 ns | +20.94% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[2048] |
400.6 ns | 342.2 ns | +17.05% |
| ⚡ | Simulation | encode_varbin[(1000, 2)] |
164.1 µs | 143.1 µs | +14.65% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing m7kss1:feat-cloud-object-store (c952c6d) with develop (e06d80b)
|
You have one failing test case on windows |
| .with_url(url.to_string()) | ||
| // Use a generic S3 endpoint to avoid DNS resolution issues with | ||
| // region-specific endpoints. | ||
| .with_endpoint("https://s3.amazonaws.com") |
There was a problem hiding this comment.
@AdamGS indeed, but if we make them configurable, should we keep the current ones as defaults + change api to allow use custom properties? in trunk, these settings are hard-coded, so this should be handled extra carefully (ref: https://github.com/vortex-data/vortex/blob/develop/vortex-jni/src/object_store.rs#L71-L78)
AdamGS
left a comment
There was a problem hiding this comment.
I've added a few more focused comments, but I think there are two bigger issues here:
- I don't think this quite solves the issue you describe in the PR.
While it does centralize the implementations (which I agree is desirable), it keeps some of inconsistencies between them. The registry is a lot more configurable thanmake_object_store, which hard-codes a few assumptions. - It introduces a bunch of new public APIs like the
Registrybut more importantlyObjectStore, which is currently a crate we use in non-rust integrations and CAN work with for rust codebases when it makes sense for the user. I think our current IO traits will likely change, but still we should probably use them externally.
|
@AdamGS thanks, that's a good point. the idea behind this approach was to avoid making any major changes to the current workflow and reuse as much as possible from the original implementation, to unlock the possibility for future refactoring in other crates. though, i think we can make them directly inside this MR, so I've temporarily marked MR as a draft and will ping you once i've finished the changes |
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
48c5ed0 to
b831585
Compare
Signed-off-by: Dergousov Maksim <dergousovmaxim99@gmail.com>
b831585 to
016c2c1
Compare
Summary
Cloud object store construction was duplicated across
vortex-jniandvortex-duckdb,each with hand-rolled per-scheme builders (
AmazonS3Builder,MicrosoftAzureBuilder,GoogleCloudStorageBuilder), hardcoded scheme lists, and inconsistent credential handlingIntroduce
FileLocationinvortex-ioas the single resolution point for anyURL or path string - no manual builder code required
By the way, now object store works out of the box in CLI :
$ AWS_REGION=us-east-1 vx query s3://bucket/hits.vortex --sql "select count(*) from hits" $ vx tree gs://bucket/hits.vortexCloses: #000
Testing
AI disclosure: XXX