-
Notifications
You must be signed in to change notification settings - Fork 126
feat: implement DuckDB filesystem integration for Vortex file handling #6198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
| "http" | "https" | "s3" => { | ||
| let reader = open_duckdb_reader(client_ctx, &url); | ||
|
|
||
| // Fallback to the legacy object_store path for s3 if DuckDB fs isn't configured. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when will that happen? I would really love to get rid of object_store here, it has some underlying dependencies on tokio that can cause some weird issues
CodSpeed Performance ReportMerging this PR will degrade performance by 45.74%Comparing
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| 🆕 | WallTime | 10M_50pct[5000000] |
N/A | 278.7 µs | N/A |
| 🆕 | WallTime | 1M_90pct[1000000] |
N/A | 35.2 µs | N/A |
| 🆕 | WallTime | 1M_50pct[500000] |
N/A | 50.9 µs | N/A |
| 🆕 | WallTime | 10M_90pct[10000000] |
N/A | 363 µs | N/A |
| 🆕 | WallTime | 10M_10pct[1000000] |
N/A | 132.4 µs | N/A |
| ❌ | WallTime | u16_FoR[10M] |
8.3 µs | 10.2 µs | -18.59% |
| 🆕 | WallTime | 10M_10pct[1000000] |
N/A | 219.9 µs | N/A |
| ❌ | WallTime | u8_FoR[10M] |
6.9 µs | 12.8 µs | -45.74% |
| 🆕 | WallTime | 1M_50pct[500000] |
N/A | 24.5 µs | N/A |
| 🆕 | WallTime | 1M_90pct[1000000] |
N/A | 57 µs | N/A |
| 🆕 | WallTime | 10M_50pct[5000000] |
N/A | 278.7 µs | N/A |
| 🆕 | WallTime | 1M_90pct[1000000] |
N/A | 57.6 µs | N/A |
| 🆕 | WallTime | 1M_50pct[500000] |
N/A | 50.6 µs | N/A |
| 🆕 | WallTime | 1M_10pct[100000] |
N/A | 23.9 µs | N/A |
| 🆕 | WallTime | 10M_90pct[10000000] |
N/A | 363.4 µs | N/A |
| 🆕 | WallTime | 10M_10pct[1000000] |
N/A | 219.8 µs | N/A |
| 🆕 | WallTime | 10M_50pct[5000000] |
N/A | 158.2 µs | N/A |
| 🆕 | WallTime | 1M_10pct[100000] |
N/A | 76.7 µs | N/A |
| 🆕 | WallTime | 10M_90pct[10000000] |
N/A | 195.8 µs | N/A |
| 🆕 | WallTime | 1M_10pct[100000] |
N/A | 45.6 µs | N/A |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Footnotes
-
No successful run was found on
develop(db2e2af) during the generation of this report, so 68130ce was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩ -
1323 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
0ax1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to look into this. There's a couple of issues around thread safety / locking, error handling and memory leaks.
If you used AI to assist in writing the code for your PR please mention this in your PR description (see: https://github.com/vortex-data/vortex/blob/develop/CONTRIBUTING.md).
| SESSION.write_options().write(&mut file, array_stream).await | ||
| // Prefer DuckDB FS (httpfs/s3/etc.), fallback to local async fs if unavailable. | ||
| let ctx_raw = ctx_ptr.0 as cpp::duckdb_vx_client_context; | ||
| if let Ok(writer) = unsafe { duckdb_fs_create_writer(ctx_raw, &file_path) } { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably only use the DuckDB FS and not fallback to our prev logic implicitly but return an error if the new solution fails to init. We can handle this though in a follow up on our end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your comments, I will address them ASAP.
| *error_out = duckdb_vx_error_create(message.data(), message.size()); | ||
| } | ||
|
|
||
| duckdb_state HandleException(duckdb_vx_error *error_out) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This snippet does not handle an exception really, right? We take the error, call throw unconditionally ourselves and then return a DuckDBError.
| if (!error_out) { | ||
| return; | ||
| } | ||
| *error_out = duckdb_vx_error_create(message.data(), message.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this error get freed?
|
|
||
| using namespace duckdb; | ||
|
|
||
| namespace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the anonymous namespace?
| handle->Truncate(0); | ||
| return reinterpret_cast<duckdb_vx_file_handle>(new FileHandleWrapper(std::move(handle))); | ||
| } catch (...) { | ||
| HandleException(error_out); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to forward the exception being passed to catch to HandleException?
|
|
||
| /// A VortexReadAt implementation backed by DuckDB's filesystem (e.g., httpfs/s3). | ||
| pub struct DuckDbFsReadAt { | ||
| handle: Arc<Mutex<FsFileHandle>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need a Mutex given that FsFileHandle is Send and Sync? If it is send and sync Arc<FsFileHandle> should be sufficient.
| max_size: 8 * 1024 * 1024, // 8 MB | ||
| }; | ||
|
|
||
| const DEFAULT_CONCURRENCY: usize = 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give more context, add a comment how you came up with this number?
| let mut size_out: cpp::idx_t = 0; | ||
| let status = unsafe { | ||
| cpp::duckdb_vx_fs_get_size( | ||
| handle.lock().as_ptr(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only lock the handle for the duration of retrieving the pointer but not for the duration of duckdb_vx_fs_get_size. But as questioned before, isn't the file handle thread safe (send + sync) anyway?
| let mut out_len: cpp::idx_t = 0; | ||
| let status = unsafe { | ||
| cpp::duckdb_vx_fs_read( | ||
| handle.lock().as_ptr(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, if we really need a lock, it would need to held for duration of duckdb_vx_fs_read.
| pub struct VortexCopyFunction; | ||
|
|
||
| #[derive(Clone, Copy)] | ||
| struct SendableClientCtx(usize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we encode the pointer as a usize?
I tested it locally