-
-
Notifications
You must be signed in to change notification settings - Fork 33.6k
GH-135379: Top of stack caching for the JIT. #135465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fidget-Spinner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really cool. I'll do a full review soon enough.
78489ea to
2850d72
Compare
|
Performance is in the noise, but we would need a really big speed up of jitted code for it to be more than noise overall. The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup. |
Nice. We use Apple's Compiler for the interpreter, though the JIT uses stock LLVm. Thomas previously showed that the version of the Apple compiler we use is subject to huge fluctuations in performance due to a PGO bug. |
Misc/NEWS.d/next/Core_and_Builtins/2025-06-20-16-03-59.gh-issue-135379.eDg89T.rst
Outdated
Show resolved
Hide resolved
Fidget-Spinner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to review the cases generator later.
Misc/NEWS.d/next/Core_and_Builtins/2025-06-13-13-32-16.gh-issue-135379.pAxZgy.rst
Outdated
Show resolved
Hide resolved
Fidget-Spinner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stylistic nits
|
Stats show that the spill/reload uops are about 13% of the total, and that we aren't spilling and reloading much more than the minimum. |
Fidget-Spinner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was discussing with Brandt the other day, and I concluded we need this to implement decref elimination with lower maintenance burden.
For decref elimination, there are two ways:
- Manually handwrite/cases generate a version of the instruction that has no PyStackref_CLOSE/DECREF_INPUTS.
- Specialize for POP_TOP.
Option 2 is a lot more maintainable in the long run and doesn't involve any more cases generator magic like in 1. It's also less likely to introduce a bug, as again less cases generator magic. So I'm more inclined to 2.
Option 2. requires TOS caching, otherwise it won't pay off. So this PR is needed otherwise we're blocking other optimizations. Unless of course, you folks don't mind me exercising some cases generator magic again 😉 .
|
I've mentioned this before, but I'm uncomfortable adding all of this additional code/complexity without more certain payoff. Personally, "13%-18% faster I understand that this should be faster. But the numbers don't show that it is, generally. At least not enough to justify the additional complexity.
I'm not sure that's the case. Yesterday, we benchmarked this approach together:
So the results seem to be more subtle and interconnected than "they both make each other faster". If anything (based just on the numbers I've seen), decref elimination makes TOS caching slower, and we shouldn't use it to justify merging this PR. |
|
@brandtbucher that branch uses decref elimination via Option 1, ie its not scalable at all for the whole interpreter unless you let me go ham with the DSL |
|
But the effect is the same, right? Decref elimination seems to interact poorly with this branch for some reason (it's not quite clear why yet). |
|
I can't say for sure whether the effect is the same. |
|
One suspicion I have from examining the nbody traces is that the decref elimination is not actually improving the spilling. The problem is that we have too few TOS registers, so it spills regardless of whether we decref eliminate or not (right before the _BINARY_OP_SUBSCR_LIST_INT instruction, for example). So the decref elimination is not doing anything to the TOS caching at the moment. The problem however, is that we are not actually increasing any of the spills. The maximum number of spills should stay the same regardless of decref elimination or not. So the benchmark results are a little suspect to me. |
|
One more reason why I'm a little suspicious of our benchmarks: the JIT performance fluctuates quite a bit. On the Meta runners, the JIT fluctuates 2% weekly https://github.com/facebookexperimental/free-threading-benchmarking On the MS runner, it's slightly better at 1% weekly https://github.com/faster-cpython/benchmarking-public . However, we're basing off our decision on this which I can't say I trust fully. |
|
@brandtbucher so I decided to build 4 versions of CPython on my system, with the following configs:
All of them are standard benchmarking builds, Ie PGO, LTO, JIT. These are the results for nbody: So we do indeed see TOS caching and decref elimination helping each other out and compounding on my system. |
|
The Meta benchmarking results concur with my own results above:
Most importantly: |
|
The tail calling CI seems to be failing because homebrew changed where they install clang (yet again). Will put up a separate PR to fix that. |
|
Ok, I fixed the macOS CI on main. Please pull the changes in. |
|
I thought that caching through side exits would speed things up, but it looks like it slows things down a bit if anything. So, I've reverted that change. Will rerun the benchmarks, to confirm... |
|
I was hoping to get a clear cross-the-board speedup before merging this. |
savannahostrowski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick skim of this - very neat! Thanks for also including all the perf numbers in the discussion, which helps counteract some of the initial trepidation I had around the amount of new generated code. This lays pretty solid groundwork for future optimizations as well.
Just one comment about the type change in pycore_optimizer.h
|
When you're done making the requested changes, leave the comment: |
|
Once we get Savannah's LLVM 21 PR in, we should experiment with setting the TOS cache size to 4. I observe a lot of spilling due to the loop iterator taking up some space. |
|
I think we will want to vary the cache depending on both hardware and operating system. All that for later PR(s) though. |
Uses three registers to cache values at the top of the evaluation stack This significantly reduces memory traffic for smaller, more common uops.
7d47f13 to
4fec38d
Compare
|
Apologies for the force push. This needed manually rebasing after the changes to the JIT frontend. |
Fidget-Spinner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read the code and understood it mostly as I had to rebase it a few times on my own branch.
These are the latest benchmarking results for this branch https://github.com/facebookexperimental/free-threading-benchmarking/blob/main/results/bm-20251210-3.15.0a2%2B-4b24c15-JIT/bm-20251210-vultr-x86_64-faster%252dcpython-tier_2_tos_caching-3.15.0a2%2B-4b24c15-vs-base.md
The 19% nbody speedup is still there, along with a modest ~7% speedup for richards as well.
So this bodes very well for the JIT. Considering it will unlock future optimizations in decref elimination, which would be very hard without this, we should just go ahead and merge it.
The stats need fixing and the generated tables could be more compact, but it works.