Inconsistent performance using ordered categorical variables vs numerical equivalents with LightGBM #971
Unanswered
harshvardhaniimi
asked this question in
Q&A
Replies: 1 comment 1 reply
-
|
FLAML doesn't vary the categorical features. They are passed as-is to LightGBM. Do you observe the same issue when using LightGBM without FLAML? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have found that using the numerical equivalent of these ordered categorical variables consistently results in better performance compared to using the categorical variables directly. Here's a brief explanation of the issue:
When using pandas, I can set an order for categorical variables using
pd.set_order(). This allows me to define how these categories should be ordered when usingcat.codesto obtain numerical equivalents. However, when using these categorical variables directly in LightGBM models, I have noticed that the performance is consistently worse than when using the numerical equivalent of these ordered categorical variables.As far as I understand,
pd.Categoricalvariables should be used as numeric variables under the hood (please correct me if I'm wrong). Therefore, I am curious as to why the models using the numerical equivalent of categorical variables consistently perform better.Is there an issue with how FLAML handles ordered categorical variables (for LightGBM)? Or is there a specific reason why the performance differs between these two approaches? Any help or explanation would be greatly appreciated.
Beta Was this translation helpful? Give feedback.
All reactions