Skip to content

Read remote SQLite databases through a DuckDB FileSystem VFS#154

Open
ak2k wants to merge 6 commits into
duckdb:mainfrom
ak2k:feat/http-sqlite-support
Open

Read remote SQLite databases through a DuckDB FileSystem VFS#154
ak2k wants to merge 6 commits into
duckdb:mainfrom
ak2k:feat/http-sqlite-support

Conversation

@ak2k

@ak2k ak2k commented Jun 5, 2025

Copy link
Copy Markdown
Contributor

Reading a SQLite database that lives in object storage or at a URL currently means downloading it first. This lets sqlite_scan and ATTACH open it in place:

SELECT * FROM sqlite_scan('https://example.com/chinook.sqlite', 'Artist');

ATTACH 'https://example.com/chinook.sqlite' AS db (TYPE sqlite);
SELECT count(*) FROM db.Album;

A remote path is opened through a SQLite VFS that forwards reads to DuckDB's CachingFileSystem; local paths keep the native SQLite code path. Because it sits at the FileSystem layer, it works for any filesystem DuckDB exposes (HTTP, S3, GCS, Azure, …) and reuses DuckDB's existing file cache. Queries fetch byte ranges on demand rather than downloading the whole file.

Limitations

  • Remote databases are read-only. As a consequence, SQLite's temp store stays in memory, so a large sort or hash that would normally spill to disk is kept in RAM.
  • The database must be checkpointed and self-contained; reading a remote WAL is not supported. PRAGMA wal_checkpoint(TRUNCATE) or VACUUM INTO produces a servable copy.
  • A query treats the remote object as immutable (the VFS reports SQLITE_IOCAP_IMMUTABLE). httpfs raises an error if the object's ETag changes mid-read; on a server that returns no stable ETag, a concurrent change can be read inconsistently.
  • Some network-error classification matches on httpfs message text, so it can be imprecise if those messages change.

Tests are hermetic (after setup): a stdlib Python HTTP server under scripts/ serves committed fixtures over localhost, so core CI needs no network. Optional S3 and high-concurrency paths are gated on environment flags and use a pinned rclone binary the CI fetches. The remote tests pull httpfs (pinned to the commit the bundled DuckDB engine coordinates, with its patches) as a test-only dependency; the WASM build skips it — those tests don't run there, and httpfs's OpenSSL dependency didn't build for emscripten in our build. The extension compiles for the WASM targets, where all paths route through the FileSystem; that path is not exercised at runtime.

Notes

Related Issues: #39, #141

@Maxxen Maxxen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello!

Looks cool, but I wonder why/if this vfs is limited to http, why not just wrap DuckDBs regular filesystem? You can then let DuckDB handle the underlying abstraction and ensure sqlite can access files through all duckdb supported filesystems (e.g. compressed files, or regular files on WASM)

Comment thread src/sqlite_http_vfs.cpp Outdated
Comment thread src/include/sqlite_http_vfs.hpp Outdated
@ak2k

ak2k commented Jun 5, 2025

Copy link
Copy Markdown
Contributor Author

[Updated]

@Maxxen Thanks for the excellent suggestions! I've re-implemented based on your feedback.

Integrated with DuckDB's filesystem and caching infrastructure

  • Now wraps DuckDB's FileSystem API for all remote file access (HTTP/HTTPS via httpfs)
  • Uses DuckDB's CachingFileSystem which provides intelligent block caching
  • Cache is managed by DuckDB's CachingFileSystem infrastructure

Simplified concurrency model

  • Removed per-file mutexes - file operations are lock-free
  • Only uses mutex for VFS registry operations (registration/unregistration)
  • Relies on CachingFileSystem's internal synchronization for thread safety

Adaptive read-ahead optimization

  • Implements dynamic read-ahead (1MB-128MB) to reduce network round trips
  • Adjusts read size based on sequential vs random access patterns
  • Works on top of CachingFileSystem's block caching

The implementation is much cleaner now - it properly delegates to DuckDB's existing infrastructure rather than reimplementing caching logic. Thanks again for the great foundation and suggestions!

@ak2k ak2k force-pushed the feat/http-sqlite-support branch from a5fdad2 to da96c0b Compare June 5, 2025 14:54
@Alex-Monahan

Copy link
Copy Markdown

This is very cool! Does it work with hosted SQLite solutions like Turso?

@ak2k ak2k force-pushed the feat/http-sqlite-support branch 3 times, most recently from 277539a to 49f7aba Compare June 5, 2025 15:42
@ak2k

ak2k commented Jun 6, 2025

Copy link
Copy Markdown
Contributor Author

This is very cool! Does it work with hosted SQLite solutions like Turso?

Thank you @Alex-Monahan! I wasn't familiar with Turso, but it seems to implement a custom wire format rather than providing HTTP-like access to the stored file format, so regrettably, it likely wouldn't.

@ak2k ak2k requested a review from Maxxen June 6, 2025 12:19
@ak2k ak2k marked this pull request as draft June 8, 2025 19:35
@ak2k ak2k force-pushed the feat/http-sqlite-support branch 2 times, most recently from 6d1c0cf to 9cc202e Compare June 14, 2025 04:07
@ak2k ak2k marked this pull request as ready for review June 14, 2025 04:08
@ak2k ak2k force-pushed the feat/http-sqlite-support branch 5 times, most recently from 3ce1800 to 063a5bd Compare June 14, 2025 04:56
@ak2k ak2k changed the title feature: Add HTTP/HTTPS support for remote SQLite databases feature: Add Remote (HTTP(S)) Support for SQLite Databases Jun 14, 2025
@ak2k ak2k force-pushed the feat/http-sqlite-support branch from 063a5bd to cfd246b Compare June 14, 2025 05:10
@ak2k ak2k force-pushed the feat/http-sqlite-support branch from cfd246b to 5d4e241 Compare June 27, 2025 22:48
@ak2k

ak2k commented Jun 27, 2025

Copy link
Copy Markdown
Contributor Author

I've reorganized this PR into 5 logical commits to attempt to make the review process easier. Each commit is self-contained, compiles, and passes all tests:

Commit 1: Add ClientContext parameter to SQLiteDB::Open methods

  • Minimal API change to support future extensibility
  • Required for subsequent commits that need access to DuckDB's context

Commit 2: Lazy initialization when opening remote SQLite databases

  • Implements lazy initialization to prevent deadlocks during remote file access
  • Includes necessary validation logic for busy_timeout

Commit 3: Add SQLite VFS implementation for remote file support

  • Core VFS implementation that integrates with DuckDB's CachingFileSystem
  • Handles HTTP error mapping and adaptive read-ahead optimization
  • No user-facing changes yet

Commit 4: Enable HTTP/HTTPS SQLite database access via sqlite_scan

  • Activates HTTP/HTTPS support for sqlite_scan() function
  • Includes 7 working tests demonstrating basic functionality
  • ATTACH support intentionally deferred to next commit

Commit 5: Add HTTP ATTACH support and improve error handling

  • Completes the implementation with ATTACH functionality
  • Improves error handling throughout the PR
  • Adds remaining tests for complex queries

Let me know if you'd like me to adjust the commit structure or if you have any questions about specific changes or feedback.

@ngalluzzo

Copy link
Copy Markdown

Curious if this is blocked by something? Would love to help get this merged if any additional contributions are necessary

@FIGIO55

FIGIO55 commented Apr 7, 2026

Copy link
Copy Markdown

Trying to see if there are updates on this matter since it would be very useful for me

ak2k added 4 commits June 10, 2026 13:21
A custom SQLite VFS that delegates file I/O to DuckDB's CachingFileSystem, so any
filesystem DuckDB exposes (HTTP, S3, GCS, Azure, etc.) is reachable as a read-only
SQLite database. Registered per ClientContext and unregistered on connection close.
SQLiteDB::Open now takes a ClientContext and routes paths DuckDB's FileSystem owns
(remote, WASM) through the caching VFS, opened read-only; local files keep the
native SQLite path unchanged. Open errors from the VFS path are enriched with the
underlying filesystem error (HTTP status/URL).
Defer the connection open and BEGIN for remote (HTTP/S3) databases to first use
(GetDB()), so attaching a remote database does not open it eagerly. Destroying a
SQLiteTransaction runs sqlite3_close_v2, which can re-enter paths that take
transaction_lock; extract the transaction under the lock and let it destruct
after the lock is released to avoid a lock-order inversion.
A stdlib HTTP range server (and optional rclone http/s3 backends) serves committed
SQLite fixtures over localhost, so the remote tests run hermetically. Adds the
http_sqlite_* suite and a CI workflow that runs it under sanitizers.
@ak2k ak2k force-pushed the feat/http-sqlite-support branch from 5d4e241 to 648ee1a Compare June 10, 2026 17:51
@ak2k

ak2k commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Keeps the existing approach — remote databases open through a SQLite VFS over DuckDB's CachingFileSystem — and revises the implementation:

  • drops the adaptive read-ahead layer (CachingFileSystem already caches);
  • expands the -wal/-journal fail-closed handling so an un-checkpointed remote database is refused rather than read inconsistently;
  • adds a hermetic stdlib HTTP test server in place of network-dependent tests;
  • narrows the diff to the feature (the earlier version also modified many existing storage tests).

Restructured into four focused commits and rebased onto current main (the prior base was ~a year old).

@ak2k ak2k changed the title feature: Add Remote (HTTP(S)) Support for SQLite Databases Read remote SQLite databases through a DuckDB FileSystem VFS Jun 10, 2026
The concurrent-scans test first-accesses a remote file from several
connections at once, tripping a data race in DuckDB's external file cache.
Upstream duckdb#22979 fixes it by requesting parallel access on the shared
file handle. Advance the submodule to the merge commit that carries the fix.
@ak2k

ak2k commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

CI surfaced a race in http_sqlite_02_concurrent_scans: under concurrent first-access, CachingFileHandle::GetFileHandle() opened the shared underlying handle without FILE_FLAGS_PARALLEL_ACCESS, so parallel block fetches raced on the non-thread-safe HTTPFS handle and a cache block could be read with zero bytes — surfacing in FileBufferHandleGroup::CopyTo.

That race is fixed upstream in duckdb/duckdb#22979 (merged 2026-06-01). The pinned engine here, a966898d (2026-05-28, the same commit main pins), predates it. This bumps the duckdb submodule to the #22979 merge commit to pick up the fix.

I can split the engine bump into its own PR if you'd rather land it separately.

@staticlibs

Copy link
Copy Markdown
Member

Hi, thanks for the update! It is fine to have the duckdb submodule updated as part of this PR.

http_sqlite_02 opens several remote connections at once. CPython's stdlib
HTTP server does not serve simultaneous connections reliably, so the test
intermittently fails (~1% disk I/O errors) on the default server while
passing cleanly under rclone serve http. Gate it on SQLITE_HTTP_ROBUST like
the concurrent-lifecycle test, and widen the rclone CI steps to cover both.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants