Skip to content

Conversation

@Yicheng-Lu-llll
Copy link
Member

@Yicheng-Lu-llll Yicheng-Lu-llll commented Nov 28, 2025

Description

Previously, if the user did not specify them, Ray preassigned the GCS port, dashboard agent port, runtime environment port, etc., and passed them to each component at startup. This created a race condition: Ray might believe a port is free, but by the time the port information is propagated to each component, another process may have already bound to that port.

This can cause user-facing issues, for example when Raylet heartbeat messages are missed frequently enough that the GCS considers the node unhealthy and removes it.

We originally did this because there was no standard local service discovery, so components had no way to know each other’s serving ports unless they were preassigned.

The final port discovery design is here:
image

This PR addresses port discovery for:

  • GCS reporting back to the startup script (driver)✅
  • The runtime env agent reporting back to the raylet✅
  • The dashboard agent reporting back to the raylet ✅
  • The Ray client server obtaining the runtime env agent port from the GCS✅
  • Ensuring that both a connected-only driver (e.g., ray.init()) and a startup driver still receive all port information from the GCS✅
  • Ensure GCS FT Works:Using the same GCS port as before✅
  • Clean up the old cache port code✅

Related issues

Closes #54321

Test

For GCS-related work, here is a detailed test I wrote that covers seven starting/connecting cases:

For runtime env agent:

Test that ray_client_server works correctly with dynamic runtime env agent port:

Follow up

  • The dashboard agent reporting back to the raylet
  • The dashboard agent now also writes to GCS, but we should allow only the raylet to write to GCS

performance

After This PR:

[0.000s] Starting ray.init()...
[0.075s] Process: gcs_server
[0.075s] Session dir created
[0.075s] File: gcs_server_port.json = 39451
[6.976s] Process: raylet
[6.976s] Process: dashboard_agent
[6.976s] Process: runtime_env_agent
[7.576s] File: runtime_env_agent_port.json = 38747
[7.640s] File: metrics_agent_port.json = 40005
[8.083s] File: metrics_export_port.json = 44515
[8.083s] File: dashboard_agent_listen_port.json = 52365
2025-12-12 02:02:54,925 INFO worker.py:1998 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
/home/ubuntu/ray/python/ray/_private/worker.py:2046: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
[10.035s] ray.init() completed

We can see that the dominant time is actually at the start of GCS. We wait for GCS to be ready and write the cluster info.
The port reporting speed is quite fast (file appearance time − raylet start time).
https://github.com/ray-project/ray/blob/863ae9fd573b13a05dcae63b483e9b1eb0175571/python/ray/_private/node.py#L1365-L367

@Yicheng-Lu-llll Yicheng-Lu-llll changed the title [Core] Allow agents to self-assign ports and report via pipe [Core] Allow GCS && runtime env agent to self-assign ports and report via pipe Dec 2, 2025
@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the agent-port-self-discovery branch from 78680b8 to 222b544 Compare December 2, 2025 18:52
@Yicheng-Lu-llll Yicheng-Lu-llll marked this pull request as ready for review December 2, 2025 19:10
@Yicheng-Lu-llll Yicheng-Lu-llll requested a review from a team as a code owner December 2, 2025 19:10
@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the agent-port-self-discovery branch from 697bd9c to 1cfdeb3 Compare December 2, 2025 20:47
yicheng added 2 commits December 2, 2025 22:21
@ray-gardener ray-gardener bot added the core Issues that should be addressed in Ray Core label Dec 3, 2025
yicheng added 2 commits December 3, 2025 03:30
yicheng added 2 commits December 3, 2025 05:16
Signed-off-by: yicheng <[email protected]>
Signed-off-by: yicheng <[email protected]>
@testinfected
Copy link

Awesome @Yicheng-Lu-llll ❤️

Signed-off-by: yicheng <[email protected]>
@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the agent-port-self-discovery branch from 7338b27 to e8ac442 Compare December 6, 2025 04:55
@Yicheng-Lu-llll
Copy link
Member Author

Yicheng-Lu-llll commented Dec 6, 2025

Hey @edoakes, whenever you have a moment, I'd love to get your input on this.

Currently, GDB debugging in Ray requires running the process inside tmux, as enforced here:

"If 'use_gdb' is true, then 'use_tmux' must be true as well."

One side effect of this approach is that tmux breaks the parent–child relationship of the spawned process, which prevents pipe-based interactions from working as expected:

command = ["tmux", "new-session", "-d", f"{' '.join(command)}"]

For reference, the related documentation is here:
https://github.com/ray-project/ray/blob/master/doc/source/ray-contribute/debugging.rst

Do you happen to have any background on why this tmux requirement was originally introduced? The code and docs seem to date back about five years, so I’m wondering whether this is still something we want to keep as-is, or if it might make sense to revisit it so that pipe-based interactions continue to work.

Signed-off-by: yicheng <[email protected]>
@edoakes
Copy link
Collaborator

edoakes commented Dec 9, 2025

Do you happen to have any background on why this tmux requirement was originally introduced? The code and docs seem to date back about five years, so I’m wondering whether this is still something we want to keep as-is, or if it might make sense to revisit it so that pipe-based interactions continue to work.

I don't know, or have since forgotten. Revisiting this to ensure a parent-child relationship sounds like the right thing to do.

@Yicheng-Lu-llll Yicheng-Lu-llll marked this pull request as draft December 10, 2025 03:41
@Yicheng-Lu-llll Yicheng-Lu-llll changed the title [Core] Allow GCS && runtime env agent to self-assign ports and report via pipe [Core] Introduce local port service discovery Dec 11, 2025
@Yicheng-Lu-llll Yicheng-Lu-llll changed the title [Core] Introduce local port service discovery [Core][1/n] Introduce local port service discovery Dec 11, 2025
@Yicheng-Lu-llll Yicheng-Lu-llll force-pushed the agent-port-self-discovery branch from 1e8c3f7 to 92e99ab Compare December 11, 2025 23:10
@Yicheng-Lu-llll Yicheng-Lu-llll changed the title [Core][1/n] Introduce local port service discovery [Core] Introduce local port service discovery Dec 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core] Raylet heartbeat misses

5 participants