-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Did a couple of full runs on the maxi yesterday, after some generalizations in the network stack. No changes expected, none observed. 58 mins to 900k. But I also managed to get a good baseline for my new test WAN, which is now dual-homed with two 2.5Gbps NICs, each on their own dedicated 2.5 Gbps router and modem, one to a fiber line and the other to cable. Each measures 2.3 Gbps independently on speedtest.net, though upload on the fiber is 10x that of the cable (400 Mbps vs. 40). The upload speed doesn't have have a material impact on perf at either level.
The noticeable difference now that each is independently the same, is that the load is well balanced. However the performance is not cumulative at all. It was almost identical to a single NIC, with bandwidth utilization at about 1.2 on each NIC. Hardware utilization is otherwise the same as on a single, plenty of usable RAM (which it will use if it needs it), barely touching the SSD (periodic flush to keep RAM low), 30-40% CPU, plenty of threads available and allocated. Nothing interesting in the heatmap. This indicates that it's still network limited, which is consistent with the current multi-homed setup.
Current:
┌────────────────────────────────────┐
│ Single io_context (all channels) │ ← One slow peer blocks *all*
└────────────────────────────────────┘
│
├──> NIC A: 2.3 Gbps → drops to 1.8 Gbps
└──> NIC B: 0.1 Gbps
Total: ~2.9 Gbps
Fixed:
┌─────────────────┐ ┌─────────────────┐
│ nic_a.ioc │ │ nic_b.ioc │ ← Fully independent
│ (channels on A) │ │ (channels on B) │
└─────────────────┘ └─────────────────┘
│ │
└────> 2.3 Gbps └────> 2.3 Gbps
Total: 4.6 Gbps
A couple of images that Grok drew me that explain the issue. Each channel operates on its own strand. Inbound channels (not material here) are bound to a NIC. Outbound channels use boost::asio's default NIC selection. Each channel is on it's own proactor strand, but all share the same network thread pool. So where the software can otherwise easily concurrency across the two independent channels, it's effectively serialized. I also verified that the hardware I/O bus can handle it. Not a problem at all, can handle about 20x more than it's getting.
The solution is to configure outbound NICs for explicit binding (and/or default the explicit binding), and put each NIC on its own thread pool. This can be generalized to inbound and outbound. Each inbound address binding has an independent acceptor object that iterates over individual new connection attempts (basically the same for manual outbound connections - using the connector). Inbound connections are tied to an address, which means a NIC, so the number of acceptors is limited. Outbound connections also use a connector to establish the connection, and then the connector goes out of scope. The connector and acceptor objects control socket construction and thread pool ( asio iocontext) association, with each socket establishing and owning the asio strand instance for all operations on the connection.
So there is a master network strand, which drives the connection process, and protects shared resources, like the address pool and certain config, and there is a strand for each channel. These all operate on the single network thread pool. There is only one other thread pool, which is to isolate CPU intensive prevout query and script validation. The database is synchronous and lock-free, so all database calls are just simple function calls from strands on either of these pools. Channels directly read/write the store, with the major block commitments coming concurrently from the channels that downloads the blocks.
The important observation is that despite having each channel on the same thread pool, logically they operate entirely independently, with the strand ensuring thread safety only internally to the channel. There are also chaser objects that manage cross-channel state computations, such as performance deviations and validation. These also operate on their own strands and on the network thread pool, but are generally very low cost (apart from the aforementioned script validation, which uses an independent pool). So given that these are already logically independent (concurrent) logical strands of execution within a single thread pool, it is trivial to factor them into independent thread pools - based on NIC selection.
The slightly more challenging issue is just to determine which NIC is more heavily loaded. So it requires cross-NIC monitoring, similarly to what we presently do for cross-channel download performance monitoring. Given a balanced distribution of work across thread pool-associated NICs, we should be able to run them in parallel and get most of the expected cumulative benefit. This is basically running a load balanced internally. This can be applied externally as well, but the OS doesn't do this.