Fix: Can't parse stac-geoparquet struct columns #147
Closed
+9
−2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
When reading Parquet files with struct columns where some child columns have lots of unique values and others don't, hyparquet throws a
parquet struct parsing error. This happens because columns with many unique values end up with multiple data pages, while columns with few values fit in a single page.This breaks reading STAC-geoparquet files and other parquet files with nested struct columns like
assetsorbboxwhere child columns have different cardinality.In
src/assemble.jstheinvertStructfunction assumes all struct children will have the same number of values. But when you read a row range from a file where struct children have different page structures, they end up with different array lengths. The current code just throws an error instead of handling it.Solution
Instead of throwing when arrays don't match we can just use the minimum length across all the struct children.
Reproducing
This came up when trying to parse the
aef_index_stac_geoparquet.parquetfile on Source Coop herecc @jedsundwall