Skip to content

Conversation

@markshannon
Copy link
Member

@markshannon markshannon commented Jun 13, 2025

The stats need fixing and the generated tables could be more compact, but it works.

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really cool. I'll do a full review soon enough.

@markshannon
Copy link
Member Author

Performance is in the noise, but we would need a really big speed up of jitted code for it to be more than noise overall.

The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup.
I don't know why that would be as I think we are using stock LLVM for Mac, not the Apple compiler.

@Fidget-Spinner
Copy link
Member

The nbody benchmark, which spends a lot of time in the JIT shows a 13-18% speedup, except on Mac where it shows no speedup. I don't know why that would be as I think we are using stock LLVM for Mac, not the Apple compiler.

Nice. We use Apple's Compiler for the interpreter, though the JIT uses stock LLVm. Thomas previously showed that the version of the Apple compiler we use is subject to huge fluctuations in performance due to a PGO bug.

@markshannon markshannon marked this pull request as ready for review June 20, 2025 15:04
Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to review the cases generator later.

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stylistic nits

@markshannon
Copy link
Member Author

Stats show that the spill/reload uops are about 13% of the total, and that we aren't spilling and reloading much more than the minimum.

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was discussing with Brandt the other day, and I concluded we need this to implement decref elimination with lower maintenance burden.

For decref elimination, there are two ways:

  1. Manually handwrite/cases generate a version of the instruction that has no PyStackref_CLOSE/DECREF_INPUTS.
  2. Specialize for POP_TOP.

Option 2 is a lot more maintainable in the long run and doesn't involve any more cases generator magic like in 1. It's also less likely to introduce a bug, as again less cases generator magic. So I'm more inclined to 2.

Option 2. requires TOS caching, otherwise it won't pay off. So this PR is needed otherwise we're blocking other optimizations. Unless of course, you folks don't mind me exercising some cases generator magic again 😉 .

@brandtbucher
Copy link
Member

I've mentioned this before, but I'm uncomfortable adding all of this additional code/complexity without more certain payoff. Personally, "13%-18% faster nbody on most platforms, and everything else in the noise" really just doesn't feel like enough. If this is really a good approach to register allocation, we should be able to see a general pattern of improvement for multiple JIT-heavy benchmarks. That could mean more tweaking of magic numbers, or a change in approach.

I understand that this should be faster. But the numbers don't show that it is, generally. At least not enough to justify the additional complexity.

Option 2. requires TOS caching, otherwise it won't pay off. So this PR is needed otherwise we're blocking other optimizations.

I'm not sure that's the case. Yesterday, we benchmarked this approach together:

  • Just decref elimination: 7% faster nbody
  • This PR: 13% faster nbody
  • Both PRs together: 9% faster nbody

So the results seem to be more subtle and interconnected than "they both make each other faster". If anything (based just on the numbers I've seen), decref elimination makes TOS caching slower, and we shouldn't use it to justify merging this PR.

@Fidget-Spinner
Copy link
Member

@brandtbucher that branch uses decref elimination via Option 1, ie its not scalable at all for the whole interpreter unless you let me go ham with the DSL

@brandtbucher
Copy link
Member

But the effect is the same, right? Decref elimination seems to interact poorly with this branch for some reason (it's not quite clear why yet).

@Fidget-Spinner
Copy link
Member

I can't say for sure whether the effect is the same.

@Fidget-Spinner
Copy link
Member

Fidget-Spinner commented Jun 26, 2025

One suspicion I have from examining the nbody traces is that the decref elimination is not actually improving the spilling. The problem is that we have too few TOS registers, so it spills regardless of whether we decref eliminate or not (right before the _BINARY_OP_SUBSCR_LIST_INT instruction, for example). So the decref elimination is not doing anything to the TOS caching at the moment.

The problem however, is that we are not actually increasing any of the spills. The maximum number of spills should stay the same regardless of decref elimination or not. So the benchmark results are a little suspect to me.

@Fidget-Spinner
Copy link
Member

One more reason why I'm a little suspicious of our benchmarks: the JIT performance fluctuates quite a bit. On the Meta runners, the JIT fluctuates 2% weekly https://github.com/facebookexperimental/free-threading-benchmarking

On the MS runner, it's slightly better at 1% weekly https://github.com/faster-cpython/benchmarking-public . However, we're basing off our decision on this which I can't say I trust fully.

@Fidget-Spinner
Copy link
Member

@brandtbucher so I decided to build 4 versions of CPython on my system, with the following configs:

  1. Base bda1218 (the reference)
  2. Decref
  3. tos caching
  4. tos caching + decref

All of them are standard benchmarking builds, Ie PGO, LTO, JIT.

These are the results for nbody:

Mean +- std dev: [bench-base-bda121862] 105 ms +- 1 ms -> [bench-decref-bda121862] 102 ms +- 1 ms: 1.03x faster
Mean +- std dev: [bench-base-bda121862] 105 ms +- 1 ms -> [bench-tos-caching-bda121862] 91.1 ms +- 0.5 ms: 1.16x faster
Mean +- std dev: [bench-base-bda121862] 105 ms +- 1 ms -> [bench-tos-caching-decref-bda121862] 85.8 ms +- 0.4 ms: 1.23x faster

So we do indeed see TOS caching and decref elimination helping each other out and compounding on my system.

@Fidget-Spinner
Copy link
Member

Fidget-Spinner commented Jul 3, 2025

The Meta benchmarking results concur with my own results above:

Most importantly:
nbody on TOS caching only: 1.18x faster
nbody on TOS caching+refcount: 1.21x faster
So refcount indeed improves on top of TOS caching on the meta runners

@markshannon markshannon requested a review from diegorusso as a code owner July 21, 2025 09:46
@Fidget-Spinner
Copy link
Member

The tail calling CI seems to be failing because homebrew changed where they install clang (yet again). Will put up a separate PR to fix that.

@Fidget-Spinner
Copy link
Member

Ok, I fixed the macOS CI on main. Please pull the changes in.

@markshannon
Copy link
Member Author

I thought that caching through side exits would speed things up, but it looks like it slows things down a bit if anything.
https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20250822-3.15.0a0-efe4628-JIT

So, I've reverted that change.

Will rerun the benchmarks, to confirm...

@markshannon
Copy link
Member Author

I was hoping to get a clear cross-the-board speedup before merging this.
But as it also enables decref removal and fixes the problem of intermediate values on the stack during tracing (#140277) I think we should merge this soon and tweak it later.

Copy link
Member

@savannahostrowski savannahostrowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a quick skim of this - very neat! Thanks for also including all the perf numbers in the discussion, which helps counteract some of the initial trepidation I had around the amount of new generated code. This lays pretty solid groundwork for future optimizations as well.

Just one comment about the type change in pycore_optimizer.h

@bedevere-app
Copy link

bedevere-app bot commented Nov 9, 2025

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@Fidget-Spinner
Copy link
Member

Once we get Savannah's LLVM 21 PR in, we should experiment with setting the TOS cache size to 4. I observe a lot of spilling due to the loop iterator taking up some space.

@markshannon
Copy link
Member Author

markshannon commented Nov 12, 2025

I think we will want to vary the cache depending on both hardware and operating system.
We will want to vary it for different hardware, due to differing numbers of available registers.
We will want to vary for operating system, as different OSes have different calling conventions, changing the number of available registers.

All that for later PR(s) though.

Uses three registers to cache values at the top of the evaluation stack
This significantly reduces memory traffic for smaller, more common uops.
@markshannon
Copy link
Member Author

Apologies for the force push. This needed manually rebasing after the changes to the JIT frontend.

Copy link
Member

@Fidget-Spinner Fidget-Spinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read the code and understood it mostly as I had to rebase it a few times on my own branch.

These are the latest benchmarking results for this branch https://github.com/facebookexperimental/free-threading-benchmarking/blob/main/results/bm-20251210-3.15.0a2%2B-4b24c15-JIT/bm-20251210-vultr-x86_64-faster%252dcpython-tier_2_tos_caching-3.15.0a2%2B-4b24c15-vs-base.md

The 19% nbody speedup is still there, along with a modest ~7% speedup for richards as well.

So this bodes very well for the JIT. Considering it will unlock future optimizations in decref elimination, which would be very hard without this, we should just go ahead and merge it.

@markshannon markshannon merged commit 469f191 into python:main Dec 11, 2025
78 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in MEDX-arabic-programme Dec 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants