Skip to content

Conversation

@almayne
Copy link
Contributor

@almayne almayne commented Dec 17, 2025

This change adds interleaving to sgemm and dgemm copies and kernels for ARMV8SVE.
This required a degree of disentangling symm and trmm kernels from gemm. It should now be much easier to apply further optimisations to gemm.

The addition of interleaving provides a ~1.4% speedup on c7g (V1), with negligible changes on c8g (V2).

Taken over square matrix operations with size 2->2014, stepsize = 1:
Geometric mean for interleave/c7g_dgemm.txt: 0.9859023206257058
Geometric mean for interleave/c7g_sgemm.txt: 0.9887890902680289
Geometric mean for interleave/c8g_dgemm.txt: 0.9970050554316875
Geometric mean for interleave/c8g_sgemm.txt: 0.9948135816755502

We see an increase in the sgemm speedup (~2.4%) on c7g for larger matrix sizes.

Taken over square matrix operations with size 2,000->10,000, stepsize = 1,000:
Geometric mean for 64thread_interleave/c7g_dgemm.txt: 0.9865252964543917
Geometric mean for 64thread_interleave/c7g_sgemm.txt: 0.9762227312411808
Geometric mean for 64thread_interleave/c8g_dgemm.txt: 0.9997186302044462
Geometric mean for 64thread_interleave/c8g_sgemm.txt: 0.9996022927667269

@aditew01
Copy link
Contributor

aditew01 commented Dec 17, 2025

@martin-frbg @Mousius can you please have a look?

@martin-frbg
Copy link
Collaborator

Thanks for looking into this - though I'm not immediately convinced that the speedup warrants the added complexity @Mousius ? At least I'd like to put this off until after the 0.3.31 release.

@Mousius
Copy link
Contributor

Mousius commented Jan 6, 2026

Thanks for looking into this - though I'm not immediately convinced that the speedup warrants the added complexity @Mousius ? At least I'd like to put this off until after the 0.3.31 release.

The speedup is one thing, the other is enabling GEMM kernels without the need to also implement the other kernels. This would enable us to land existing SME kernels, as previously proposed here: #5011 (comment)

Putting off until after 0.3.31 makes perfect sense to me as it's a relatively high risk change.

@almayne
Copy link
Contributor Author

almayne commented Jan 6, 2026

Thanks for looking into this - though I'm not immediately convinced that the speedup warrants the added complexity @Mousius ? At least I'd like to put this off until after the 0.3.31 release.

Hi Martin. Thanks for taking a look. I'm happy for this to go in after the release. Do you have a rough estimate for when that might be, so I can share internally?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants