-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[trainer] feat: allow override for reward_manager_worker in agent loop #4423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request allows subclasses of AgentLoopWorkerBase to provide their own reward_manager_worker instance. The change wraps the default initialization of self.reward_manager_worker in a hasattr check, which is a clean and effective way to allow for this customization. This pattern is already used for server_manager in the same class, making the change consistent with the existing codebase. The modification is straightforward and improves the flexibility of the agent loop. I find no issues with this implementation.
|
Could you share a bit more about the scenarios where you feel it might be necessary to override the reward manager worker, as these workers only maintains the initialization and computation logic for the reward components. |
|
This is mainly for flexibility.
The alternative is to allow us to define our own. Majority of "computation logic for the reward components" we find that it doesn't need to be changed at all, but we did find the need to want to change the initialization logic. For example the default one: Two things:
While these could be added to verl, allowing to override the class of the |
|
Understood. Users indeed need a high degree of flexibility for customizing the reward worker. |
|
But I also wonder whether there might be a unified approach that allows the existing reward-loop worker to cover most (all) use cases, while leaving the flexibility primarily to the user-customized reward manager rather than the worker. I consider the concrete scenarios you mention:
Future PRs may start addressing these aspects. Thanks and welcome contribution. |
What does this PR do?
Allow subclass to set reward_manager_worker
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)