> This is your original comment, which as stated is simply incorrect
I apologize if I can't make you understand what I said. But I still believe I said it clearly and it is simply correct.
Anyway, let me try to mansplain it again, "numpy isn't written in Python" - and numpy releases GIL. So as long as a Python code calls into numpy, the thread on which the numpy call runs can go without GIL. The GIL could be taken by another thread to run Python code. I have easily have 100 threads running in numpy without any GIL issue. BUT, they eventually needs to return to Python and retake GIL. Say, for each such thread and for every 1 second there is 0.1 seconds they need to run pure Python code (and must hold GIL). Please tell me how to scale this to >10 threads.
> Now you have concocted this arbitrary example of why you can't use multithreading that has nothing to do with your original comment or my response.
The example is not arbitrary at all. This is exactly the problem people are facing TODAY in ANY DL training written in PyTorch.
I have a Python thread, driving GPUs in an async way, it barely runs any Python code, all good. No problem.
Then, I need to load and preprocess data [1] for the GPUs to consume. I need very high velocity changing this code, so it looks like a stupid script, reads data from storage, and then does some transformation using numpy / again whatever shit I decided to call. Unfortunately, as dealing with the data is largely where magic happens in today's DL, the code spend non-trivial time (say, 10%) in pure Python code manipulating bullshit dicts in between all numpy calls.
Compared to what happens on GPUs this is pretty lightweight, this is not latency sensitive and I just need good enough throughput, there's always enough CPU cores alongside GPUs, so, ideally, I just dial up the concurrency. And then I hit the GIL wall.
> I don't think you understood my comment - or maybe you don't understand Python multithreading. If a C extension is single threaded but releases the GIL, you can use multithreading to parallelize it in Python. e.g. `ThreadPool(processes=100)` will create 100 threads within the current Python process and it will soak all the CPUs you have -- without additional Python processes. I have done this many times with numpy, numba, vector indexes, etc.
I don't think you understood what you did before and you already wasted a lot of CPUs.
I never talked about doing any SINGLE computation. Maybe you are one of those HPC gurus who care and only care about solving single very big problem instances? Otherwise I have no idea why you are even talking about hierarchical parallelization after I already said that a lot of problem is embarrassingly parallel and they are important.
[1] Why don't I simply do it once and store the result? Because that's where actual research is happening and is what a lot of experiments are about. Yeah, not model architectures, not hyper-parameters. Just how you massage your data.