OpenBLAS bottlenecks multithreading benefits at 8 working threads despite setting BLAS threads to 1

I have code that heavily makes use of the LAPACK `syevr` functions via Julia. They're relatively small matrices (at most 15x15). I'm processing chunks of video frames across multiple threads, and each thread will perform millions of these operations. I've set BLAS threads to 1, which I understand to mean that OpenBLAS just uses the parent thread calling it. (Setting it to anything more than 1 tanks performance generally.)

However, what I've found is that no matter what size computer I run on, performance gains stop once I reach 8 working threads; even worsening with many more. Somehow it seems that OpenBLAS, without itself doing multithreaded computation, is interfering with higher-level multithreading?

If I switch to MKL with 1 thread, I see continued performance improvements through 48 CPUs.

I'm willing to poke around at this as much as I can myself, I'm just not sure where to begin. Where might the bottleneck be?

<img width="811" height="798" alt="Image" src="https://github.com/user-attachments/assets/db33c9c6-13ae-4405-9001-4c1facbb77be" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OpenBLAS bottlenecks multithreading benefits at 8 working threads despite setting BLAS threads to 1 #5589

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OpenBLAS bottlenecks multithreading benefits at 8 working threads despite setting BLAS threads to 1 #5589

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions