-
Notifications
You must be signed in to change notification settings - Fork 7k
[Core] Introduce local port service discovery #59065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Core] Introduce local port service discovery #59065
Conversation
Signed-off-by: yicheng <[email protected]>
78680b8 to
222b544
Compare
Signed-off-by: yicheng <[email protected]>
697bd9c to
1cfdeb3
Compare
Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
…_client_proxy.py Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
…r Windows handle inheritance Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
|
Awesome @Yicheng-Lu-llll ❤️ |
Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
7338b27 to
e8ac442
Compare
|
Hey @edoakes, whenever you have a moment, I'd love to get your input on this. Currently, GDB debugging in Ray requires running the process inside tmux, as enforced here: ray/python/ray/_private/services.py Line 914 in d37dff6
One side effect of this approach is that tmux breaks the parent–child relationship of the spawned process, which prevents pipe-based interactions from working as expected: ray/python/ray/_private/services.py Line 952 in d37dff6
For reference, the related documentation is here: Do you happen to have any background on why this tmux requirement was originally introduced? The code and docs seem to date back about five years, so I’m wondering whether this is still something we want to keep as-is, or if it might make sense to revisit it so that pipe-based interactions continue to work. |
Signed-off-by: yicheng <[email protected]>
I don't know, or have since forgotten. Revisiting this to ensure a parent-child relationship sounds like the right thing to do. |
Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
1e8c3f7 to
92e99ab
Compare
Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
Description
Previously, if the user did not specify them, Ray preassigned the GCS port, dashboard agent port, runtime environment port, etc., and passed them to each component at startup. This created a race condition: Ray might believe a port is free, but by the time the port information is propagated to each component, another process may have already bound to that port.
This can cause user-facing issues, for example when Raylet heartbeat messages are missed frequently enough that the GCS considers the node unhealthy and removes it.
We originally did this because there was no standard local service discovery, so components had no way to know each other’s serving ports unless they were preassigned.
The final port discovery design is here:

This PR addresses port discovery for:
ray.init()) and a startup driver still receive all port information from the GCS✅Related issues
Closes #54321
Test
For GCS-related work, here is a detailed test I wrote that covers seven starting/connecting cases:
For runtime env agent:
Test that ray_client_server works correctly with dynamic runtime env agent port:
Follow up
performance
After This PR:
We can see that the dominant time is actually at the start of GCS. We wait for GCS to be ready and write the cluster info.
The port reporting speed is quite fast (file appearance time − raylet start time).
https://github.com/ray-project/ray/blob/863ae9fd573b13a05dcae63b483e9b1eb0175571/python/ray/_private/node.py#L1365-L367