I have code that heavily makes use of the LAPACK syevr functions via Julia. They're relatively small matrices (at most 15x15). I'm processing chunks of video frames across multiple threads, and each thread will perform millions of these operations. I've set BLAS threads to 1, which I understand to mean that OpenBLAS just uses the parent thread calling it. (Setting it to anything more than 1 tanks performance generally.)
However, what I've found is that no matter what size computer I run on, performance gains stop once I reach 8 working threads; even worsening with many more. Somehow it seems that OpenBLAS, without itself doing multithreaded computation, is interfering with higher-level multithreading?
If I switch to MKL with 1 thread, I see continued performance improvements through 48 CPUs.
I'm willing to poke around at this as much as I can myself, I'm just not sure where to begin. Where might the bottleneck be?
