Commit 0c5ebd2
Manav Kumar
[yugabyte#27894] YSQL: Don't release the lock on route until server is closed in multi route pooling
Summary:
**Issue Summary**
A core dump was triggered during a ConnectionBurst stress test, with the crash occurring in the od_backend_close_connection function with multi route pooling. The stack trace is as follows:
frame #0: 0x00005601a62712bc odyssey`od_backend_close_connection [inlined] mm_tls_free(io=0x0000000000000000) at tls.c:91:10
frame #1: 0x00005601a62712bc odyssey`od_backend_close_connection [inlined] machine_io_free(obj=0x0000000000000000) at io.c:201:2
frame #2: 0x00005601a627129e odyssey`od_backend_close_connection [inlined] od_io_close(io=0x000031f53e72b8b8) at io.h:77:2
frame #3: 0x00005601a627128c odyssey`od_backend_close_connection(server=0x000031f53e72b880) at backend.c:56:2
frame #4: 0x00005601a6250de5 odyssey`od_router_attach(router=0x00007fff00dbeb30, client_for_router=0x000031f53e5df180, wait_for_idle=<unavailable>, external_client=0x000031f53ee30680) at router.c:1010:6
frame #5: 0x00005601a6258b1b odyssey`od_auth_frontend [inlined] yb_execute_on_control_connection(client=0x000031f53ee30680, function=<unavailable>) at frontend.c:2842:11
frame #6: 0x00005601a6258b0b odyssey`od_auth_frontend(client=0x000031f53ee30680) at auth.c:677:8
frame #7: 0x00005601a626782e odyssey`od_frontend(arg=0x000031f53ee30680) at frontend.c:2539:8
frame yugabyte#8: 0x00005601a6290912 odyssey`mm_scheduler_main(arg=0x000031f53e390000) at scheduler.c:17:2
frame yugabyte#9: 0x00005601a6290b77 odyssey`mm_context_runner at context.c:28:2
**Root Cause**
The crash originated from an improper lock release in the yb_get_idle_server_to_close function, introduced in commit 55beeb0 during multi-route pooling implementation. The function released the lock on the route object, despite a comment explicitly warning against it. After returning to its caller, no lock was held on the route or idle_route. This allowed other coroutines to access and use the same route and its idle server, which the original coroutine intended to close. This race condition led to a crash due to an assertion failure during connection closure.
**Note**
If the order of acquiring locks is the same across all threads or processes differences in the release order alone cannot cause a deadlock. Deadlocks arise from circular dependencies during acquisition, not release.
In the connection manager code base:
Locks are acquired in the order: router → route. This order must be strictly enforced everywhere to prevent deadlocks.
Lock release order varies (e.g., router then route in od_router_route and yb_get_idle_server_to_close, versus the reverse elsewhere). This variation does not cause deadlocks, as release order is irrelevant to deadlock prevention.
Jira: DB-17501
Test Plan: Jenkins: all tests
Reviewers: skumar, vikram.damle, asrinivasan, arpit.saxena
Reviewed By: skumar
Subscribers: svc_phabricator, yql
Differential Revision: https://phorge.dev.yugabyte.com/D456411 parent 914fc29 commit 0c5ebd2
1 file changed
+0
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
824 | 824 | | |
825 | 825 | | |
826 | 826 | | |
827 | | - | |
828 | 827 | | |
829 | 828 | | |
830 | 829 | | |
| |||
0 commit comments