-
Notifications
You must be signed in to change notification settings - Fork 4k
[python-package] add PyArrow Table support to get_data and add_features_from methods #6892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 4 commits
cf13aad
99ee66f
a3256a3
9de2650
01bc668
27e2545
219e61c
d4de309
6686a60
22ec314
c7594c8
e0fce82
1af15c6
b5395a0
5f067ec
eaf9510
04db4aa
e569ac5
5b4b197
e826a4f
17dbc43
8135e92
71a65e5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -46,6 +46,9 @@ | |||||
| pd_Series, | ||||||
| ) | ||||||
|
|
||||||
| if PYARROW_INSTALLED: | ||||||
| import pyarrow as pa | ||||||
|
|
||||||
| if TYPE_CHECKING: | ||||||
| from typing import Literal | ||||||
|
|
||||||
|
|
@@ -3293,6 +3296,8 @@ def get_data(self) -> Optional[_LGBM_TrainDataType]: | |||||
| self.data = self.data[self.used_indices, :] | ||||||
| elif isinstance(self.data, Sequence): | ||||||
| self.data = self.data[self.used_indices] | ||||||
| elif isinstance(self.data, pa_Table): | ||||||
| self.data = self.data.take(self.used_indices) | ||||||
| elif _is_list_of_sequences(self.data) and len(self.data) > 0: | ||||||
| self.data = np.array(list(self._yield_row_from_seqlist(self.data, self.used_indices))) | ||||||
| else: | ||||||
|
|
@@ -3480,6 +3485,22 @@ def add_features_from(self, other: "Dataset") -> "Dataset": | |||||
| elif isinstance(other.data, dt_DataTable): | ||||||
| _emit_datatable_deprecation_warning() | ||||||
| self.data = np.hstack((self.data, other.data.to_numpy())) | ||||||
| elif isinstance(other.data, pa_Table): | ||||||
| if not PYARROW_INSTALLED: | ||||||
| raise LightGBMError( | ||||||
| "Cannot add features to pyarrow.Table type of raw data " | ||||||
| "without pyarrow installed. " | ||||||
| "Install pyarrow and restart your session." | ||||||
| ) | ||||||
| else: | ||||||
|
||||||
| if not (PYARROW_INSTALLED and CFFI_INSTALLED): | |
| raise LightGBMError("Cannot init Dataset from Arrow without 'pyarrow' and 'cffi' installed.") |
That's enforced by convention today... if there's some ruff rule that could enforce that (like no-unnecessary-else or something), I'd support adding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this commit, I removed the unnecessary else block as suggested.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing more datatable code getting added here is making me think we should just do what xgboost did (dmlc/xgboost#11070) and just fully drop support for it now.
In #6662, I'd proposed having deprecation warnings for "2-3 releases", but I'm going to put up a PR just proposing dropping this in the next release. We got a deprecation warning into 4.6.0 (#6670), which was released about 2 months ago, and it'll probably be at least another 2 months until the next LightGBM release... I think that's enough time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #6894
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for thinking through this and for moving things forward!
I appreciate you taking the initiative to propose a clear path for dropping datatable support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This was definitely just something we'd missed.
Can you please add a test in https://github.com/microsoft/LightGBM/blob/master/tests/python_package_test/test_arrow.py just for this change to
get_data()? The other changes you made intest_basic.pydo not cover these changes.When you do that, please check that the content of
self.dataAND the returned value are correct (e.g., contain exactly the expected values and data types).If you'd like, I'd even support opening a new pull request that only has the
get_data()changes + test (and then making this PR only aboutadd_features_from()). Totally your choice, I want to be respectful of your time.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion!
I added tests for both get_data() and add_features_from() directly in
test_arrow.pyas part of this PR.Please let me know if there’s anything else you’d like me to adjust!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jameslamb
I’ve opened a new pull request(#6911) that includes only the changes to get_data() along with the corresponding test. This should help keep things focused. I’d appreciate it if you could take a look when you have time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll focus there.