Skip to content

Conversation

@isaaccorley
Copy link

@isaaccorley isaaccorley commented Dec 18, 2025

Problem

When reading Parquet files with struct columns where some child columns have lots of unique values and others don't, hyparquet throws a parquet struct parsing error. This happens because columns with many unique values end up with multiple data pages, while columns with few values fit in a single page.

This breaks reading STAC-geoparquet files and other parquet files with nested struct columns like assets or bbox where child columns have different cardinality.

In src/assemble.js the invertStruct function assumes all struct children will have the same number of values. But when you read a row range from a file where struct children have different page structures, they end up with different array lengths. The current code just throws an error instead of handling it.

Solution

Instead of throwing when arrays don't match we can just use the minimum length across all the struct children.

Reproducing

This came up when trying to parse the aef_index_stac_geoparquet.parquet file on Source Coop here

cc @jedsundwall

When struct child columns have significantly different cardinality,
they may have different numbers of data pages. The previous implementation
would throw a 'parquet struct parsing error' when the child arrays had
different lengths.

This fix uses the minimum length across all struct children, allowing
proper handling of page alignment issues. This enables reading of
STAC-geoparquet files and other complex Parquet structures with structs
that have child columns with varying page counts.
@platypii
Copy link
Collaborator

Thanks @isaaccorley I'll take a look at this! The fix seems reasonable, but my concern is... why are the arrays different lengths? Pages should get concatenated. Does this mean we lost some data?

@isaaccorley
Copy link
Author

Thanks @isaaccorley I'll take a look at this! The fix seems reasonable, but my concern is... why are the arrays different lengths? Pages should get concatenated. Does this mean we lost some data?

In the case of stac-geoparquet this is because we are forcing JSON data into tabular format where the "assets" column can be nested with different lengths values

E.g. {href: [href1, href2, ..., hrefN], roles: [role1, role2, role3]}

I'm not sure if this has negative implications for all other parquet dataset though, but this was a fix that worked for my case.

@platypii
Copy link
Collaborator

Hey @isaaccorley I fixed it a different way in #148 but I think it should fix your issue. Please try out hyparquet v1.23.3 on npm when you get a chance. I verified your test file on the demo page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants