Can you explain what you mean by this? Are you saying there's a correctness issue here? I only recall running into issues with MPI, where you (typically) run one MPI rank (process) per CPU core. Then if you combine that with a multi-threaded BLAS library you'll suddenly have N^2 BLAS threads fighting over the CPU's and performance goes down the drain. The solution to this is, like you say, to use a single-threaded OpenBLAS, or then the OpenMP OpenBLAS and set OMP_NUM_THREADS=1
I guess with threads you'll have the same issue if you launch N cpu-bound threads and all those call BLAS, resulting in the same N^2 issue as you see with MPI.