Column Statistics #4540
Replies: 4 comments 16 replies
-
Yep, the stats will be there once you build the zonemap index. The data is written into an auxiliary index file at table layer like all other scalar indices.
The stats used to exist at lance file level, but it was removed recently due to requirement changes. Could you share more details on your usecases so folks can chime in? |
Beta Was this translation helpful? Give feedback.
-
Yes. We can use zone map indexes as an initial source of column statistics. Partly I think we might want something else someday that is even smaller and lighter than zone map (e.g. just one set of min/max/some unique values/etc. for the entire column) which we can compute for every column. Or maybe we just compute zone maps for every column or maybe we don't worry about statistics on every column since they're only really needed on columns we are filtering with and we probably want indexes on those filter columns anyways 🤷 The important part will probably just be getting an API up for column statistics and we can use zone maps to develop that API and refactor later as we learn.
Sure. You can put pretty much whatever arbitrary buffers you want into a lance file (see https://github.com/lancedb/lance/blob/60711f360b7f8692df44a0e84c98c8fdff2897a3/rust/lance-file/src/v2/writer.rs#L491). However, if you want it to be visible by the lance table format, we would probably need to do some more work. The table format does have a pretty generic "index store" concept which might be able to map to "the indexes are in file footers" or maybe we need some changes. Also, the zone map index currently expects the entire index to be in a single file. There are other indexes which support "parts" like the btree index or maybe a "partitioned zone map" is just another index type. Just to explain some of the lance motivation here, we are trying to avoid having a column index spread into many files because then it becomes a cold start problem when you need to open potentially thousands of files (if you have billions of rows) to read what is tens of MB of data at most.
Could this be done in the iceberg writer? You can look at each batch before you send it to lance and update statistics and then, before calling the finish method, you can add them as a global buffer. Or do you think there is some change you would need in the lance file writer? If you're interested in reusing the Rust code to collect statistics then I think we can do that too. We can make some of the utilities in the zone map index more general purpose "zone map calculating tools". IIRC we are using some datafusion internals to do the actual comparisons and aggregation. Slightly related / tangential: There's a discussion here around how we plan to handle statistics on write in Lance (though we are using separate index file). |
Beta Was this translation helpful? Give feedback.
-
|
I've been thinking about column statistics recently and wanted to sketch up my current thinking around a possible design. It doesn't replace Motivation (why indexes aren't enough)
ImplementationI think from above my key goal would be that column statistics are on-by-default, coarse grained, and always present once compaction has run. Given this a design could be...
|
Beta Was this translation helpful? Give feedback.
-
I'm not sure that we can have a single dataset statistics file How do you propose that |
Beta Was this translation helpful? Give feedback.



Uh oh!
There was an error while loading. Please reload this page.
-
I've read the ISSUE about Column Statistics: #4163 and the recent merged PR about zonemap: #4244
It feels like the zonemap codes could also solve fragment level column statistics generating. So I wonder if there's a plan on doing that? And if there is column statistics, where would it be written in?
I'm currently working on using lance file reader and writer to add a new format for iceberg. I'm not sure if it's acceptable to add the column statistics to the footer of a lance file just like parquet file?
Or maybe could i generate the column statistics during the process of lance file writing. And expose a method to get these statistics before the writer is closed?
Beta Was this translation helpful? Give feedback.
All reactions