Read remote SQLite databases through a DuckDB FileSystem VFS#154
Conversation
Maxxen
left a comment
There was a problem hiding this comment.
Hello!
Looks cool, but I wonder why/if this vfs is limited to http, why not just wrap DuckDBs regular filesystem? You can then let DuckDB handle the underlying abstraction and ensure sqlite can access files through all duckdb supported filesystems (e.g. compressed files, or regular files on WASM)
|
[Updated] @Maxxen Thanks for the excellent suggestions! I've re-implemented based on your feedback. Integrated with DuckDB's filesystem and caching infrastructure
Simplified concurrency model
Adaptive read-ahead optimization
The implementation is much cleaner now - it properly delegates to DuckDB's existing infrastructure rather than reimplementing caching logic. Thanks again for the great foundation and suggestions! |
a5fdad2 to
da96c0b
Compare
|
This is very cool! Does it work with hosted SQLite solutions like Turso? |
277539a to
49f7aba
Compare
Thank you @Alex-Monahan! I wasn't familiar with Turso, but it seems to implement a custom wire format rather than providing HTTP-like access to the stored file format, so regrettably, it likely wouldn't. |
6d1c0cf to
9cc202e
Compare
3ce1800 to
063a5bd
Compare
063a5bd to
cfd246b
Compare
cfd246b to
5d4e241
Compare
|
I've reorganized this PR into 5 logical commits to attempt to make the review process easier. Each commit is self-contained, compiles, and passes all tests: Commit 1: Add ClientContext parameter to SQLiteDB::Open methods
Commit 2: Lazy initialization when opening remote SQLite databases
Commit 3: Add SQLite VFS implementation for remote file support
Commit 4: Enable HTTP/HTTPS SQLite database access via sqlite_scan
Commit 5: Add HTTP ATTACH support and improve error handling
Let me know if you'd like me to adjust the commit structure or if you have any questions about specific changes or feedback. |
|
Curious if this is blocked by something? Would love to help get this merged if any additional contributions are necessary |
|
Trying to see if there are updates on this matter since it would be very useful for me |
A custom SQLite VFS that delegates file I/O to DuckDB's CachingFileSystem, so any filesystem DuckDB exposes (HTTP, S3, GCS, Azure, etc.) is reachable as a read-only SQLite database. Registered per ClientContext and unregistered on connection close.
SQLiteDB::Open now takes a ClientContext and routes paths DuckDB's FileSystem owns (remote, WASM) through the caching VFS, opened read-only; local files keep the native SQLite path unchanged. Open errors from the VFS path are enriched with the underlying filesystem error (HTTP status/URL).
Defer the connection open and BEGIN for remote (HTTP/S3) databases to first use (GetDB()), so attaching a remote database does not open it eagerly. Destroying a SQLiteTransaction runs sqlite3_close_v2, which can re-enter paths that take transaction_lock; extract the transaction under the lock and let it destruct after the lock is released to avoid a lock-order inversion.
A stdlib HTTP range server (and optional rclone http/s3 backends) serves committed SQLite fixtures over localhost, so the remote tests run hermetically. Adds the http_sqlite_* suite and a CI workflow that runs it under sanitizers.
5d4e241 to
648ee1a
Compare
|
Keeps the existing approach — remote databases open through a SQLite VFS over DuckDB's
Restructured into four focused commits and rebased onto current |
The concurrent-scans test first-accesses a remote file from several connections at once, tripping a data race in DuckDB's external file cache. Upstream duckdb#22979 fixes it by requesting parallel access on the shared file handle. Advance the submodule to the merge commit that carries the fix.
|
CI surfaced a race in That race is fixed upstream in duckdb/duckdb#22979 (merged 2026-06-01). The pinned engine here, I can split the engine bump into its own PR if you'd rather land it separately. |
|
Hi, thanks for the update! It is fine to have the |
http_sqlite_02 opens several remote connections at once. CPython's stdlib HTTP server does not serve simultaneous connections reliably, so the test intermittently fails (~1% disk I/O errors) on the default server while passing cleanly under rclone serve http. Gate it on SQLITE_HTTP_ROBUST like the concurrent-lifecycle test, and widen the rclone CI steps to cover both.
Reading a SQLite database that lives in object storage or at a URL currently means downloading it first. This lets
sqlite_scanandATTACHopen it in place:A remote path is opened through a SQLite VFS that forwards reads to DuckDB's
CachingFileSystem; local paths keep the native SQLite code path. Because it sits at the FileSystem layer, it works for any filesystem DuckDB exposes (HTTP, S3, GCS, Azure, …) and reuses DuckDB's existing file cache. Queries fetch byte ranges on demand rather than downloading the whole file.Limitations
PRAGMA wal_checkpoint(TRUNCATE)orVACUUM INTOproduces a servable copy.SQLITE_IOCAP_IMMUTABLE). httpfs raises an error if the object's ETag changes mid-read; on a server that returns no stable ETag, a concurrent change can be read inconsistently.Tests are hermetic (after setup): a stdlib Python HTTP server under
scripts/serves committed fixtures over localhost, so core CI needs no network. Optional S3 and high-concurrency paths are gated on environment flags and use a pinnedrclonebinary the CI fetches. The remote tests pullhttpfs(pinned to the commit the bundled DuckDB engine coordinates, with its patches) as a test-only dependency; the WASM build skips it — those tests don't run there, and httpfs's OpenSSL dependency didn't build for emscripten in our build. The extension compiles for the WASM targets, where all paths route through the FileSystem; that path is not exercised at runtime.Notes
-wal/-journalsidecar checks are not — each open re-probes them with a networkHEAD, so repeated and parallel queries re-pay those round-trips.Related Issues: #39, #141