Skip to content

[C++][Python] Incorrect results from hash_pivot_wider #48679

@dangotbanned

Description

@dangotbanned

Describe the bug, including details regarding any error messages, version, and platform.

I'm not sure exactly what's happening here.

Discovered while porting this polars.DataFrame.pivot test in (narwhals-dev/narwhals#3373)

Repro

from __future__ import annotations


def pivot_wider(on: str, index: str, values: str) -> pa.Table:
    import pyarrow as pa
    import pyarrow.compute as pc  # noqa: F811
    from pyarrow.acero import AggregateNodeOptions, Declaration, TableSourceNodeOptions

    data = {
        "foo": ["A", "A", "B", "B", "C"],
        "bar": ["k", "l", "m", "n", "o"],
        "N1": [1, 2, 2, 4, 2],
        "N2": [1, 2, 2, 4, 2],
    }

    table = pa.Table.from_pydict(data)
    source = Declaration("table_source", options=TableSourceNodeOptions(table))
    key_names = table.column(on).unique().to_pylist()
    options = pc.PivotWiderOptions(key_names, unexpected_key_behavior="raise")
    aggregates = [([on, values], "hash_pivot_wider", options, values)]
    pivot = Declaration("aggregate", AggregateNodeOptions(aggregates, [index]))
    result = Declaration.from_sequence([source, pivot]).to_table()
    return result.flatten()

The original test was for roughly this:

print(pivot_wider(on="bar", index="foo", values="N1").to_pydict())

And by swapping on and index, it looks like there's some sorting happening behind the scenes:

print(pivot_wider(on="foo", index="bar", values="N1").to_pydict())

Output

"n" and "o" appear in the wrong field:

{'foo': ['A', 'B', 'C'],
 'N1.k': [1, None, None],
 'N1.l': [2, None, None],
 'N1.m': [None, 2, None],
 'N1.n': [None, None, 2],
 'N1.o': [None, 4, None]}

Here, they're correct (refer to the right column) - but you can see the order has changed in the same way:

{'bar': ['k', 'l', 'm', 'o', 'n'],
 'N1.A': [1, 2, None, None, None],
 'N1.B': [None, None, 2, None, 4],
 'N1.C': [None, None, None, 2, None]}

Expected results

Ignore the difference in column names, but this:

import polars as pl

data = {
        "foo": ["A", "A", "B", "B", "C"],
        "bar": ["k", "l", "m", "n", "o"],
        "N1": [1, 2, 2, 4, 2],
        "N2": [1, 2, 2, 4, 2],
    }
df = pl.DataFrame(data)

print(df.pivot(on="bar", index="foo", values="N1").to_dict(as_series=False))
{'foo': ['A', 'B', 'C'],
 'k': [1, None, None],
 'l': [2, None, None],
 'm': [None, 2, None],
 'n': [None, 4, None],
 'o': [None, None, 2]}
print(df.pivot(on="foo", index="bar", values="N1").to_dict(as_series=False))
{'bar': ['k', 'l', 'm', 'n', 'o'],
 'A': [1, 2, None, None, None],
 'B': [None, None, 2, 4, None],
 'C': [None, None, None, None, 2]}

Component(s)

Python, C++

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions